How to do SLA’s right

The way most people think about maintenance and support is wrong. Wrong in the same way that the healthcare system is broken. Here’s a brilliant way to fix it, so brilliant that I wish I’d thought it up myself.

Someone told me that in ancient china, the village doctor would get paid a wage. Except when someone got sick. When someone gets sick, it is the doctor’s fault, and we should not pay them… right? It may sound strange but it is actually quite correct.

In the modern western (capitalist) medical world, doctors are paid more if they prescribe more drugs. This is a royally bad idea that benefits only drug company investors. Then we pay insurance companies to cover our medical costs. The insurance companies cover more and more drugs, raising the fees, while the doctors prescribe more and more drugs, raising the costs. It’s a perfect vicious cycle.

In software, we build a product and then it goes live. Outages cost money, so our clients seek to mitigate the risk. Enter the modern maintenance and on-call support contract. A typical on-call support deal consists of a fixed fee for hours spent on-call, where the fee goes up if the response time must be shorter, or the hours are worse. On top of that the client is charged per incident, an hourly fee that also depends on the time and the severity of the incident.

The incentives here are all screwed up. A software company that is tempted, may put subtle bugs in the software, or just avoid fixing them, just so they make more money from the incidents. A timely and adequate response to an incident builds trust with the patient –sorry client– and they’ll just keep on coming back for more.

What if we did things differently and fixed the incentives. Actually they tried in China. And the results were very positive. I’d like to simply do the same thing for software.

The base cost of the contract is set up-front. Every time there is an incident, the client gets a reduction in the monthly fee. The longer the incident is open, the lower the monthly fee. The base price should be high enough to incentivise the doctors –sorry engineers– to reduce the risk of issues. How tolerant would such a system get you think? The higher the monthly fee, the higher the incentive to keep the money flowing. The client can play with the base fee and the penalty construction to ensure there is a good balance between sticks and carrots.

If there are no incidents, the client could lower the base fee you might think. But that’s the wrong kind of incentive again. The only reason to lower the fee should be that the cost of the risk goes down (meaning the company is winding down its business). In a growing business the cost of the risk is going up, so the budget for mitigation should go up as well. So instead of lowering the base price, a client could hire hackers to try and cause an incident. The hackers could be incentivised to cause an outage at a low load window, making extra fun for the engineers. Nothing like a Saturday night outage to motivate the engineers to do better next time.

So the rules of the game should be:

  • the client+team set the base fee,
  • the client+team set the penalty structure for incidents,
  • the team is free to do as little, or as much as they want in terms of prevention,
  • the client is free to intentionally cause incidents to reduce the fee via penalties.

It might be prudent to negotiate rules for changing the base fee and penalty structure over time as well. The more income security the doctors have if things go well, the more motivated they will be to do long term prevention.

This way maintenance and on-call support is an interesting game where the incentives are all pointing in the right direction: a well hardened system.

I’d love to hear your thoughts on this!

Do you like the post? Share it with your friends!


Looking for a Lean Startup technical team to build your dream startup? Sign up here. Pay as you go.