You’ve been checking your messages, emails and notifications for around an hour and everything has been eerily quiet.
The team are exhausted.
Their brains are mush with the battle that has been going on since lunch. It’s dark outside and folks want to go home to spend some time with their loved ones.
The team go over the wild scribbles on the whiteboard one last time, just to see if anything has been missed. Everything looks fine. The powers that be call it and end the P1 incident. Following a notification to the wider business this battle is over but the war rages on.
It’s time to go home.
‘See you at the postmortem…’
A postmortem (or post-mortem) is a ceremony that is intended to allow you and the team a chance to go through a past incident, while still fresh in their mind. The end result of a postmortem is learning.
The word ‘postmortem’ does indeed sound a bit scary as it is usually the result of something dying. It really does depend on the workplace culture on how these ceremonies are respected and carried out.
https://blog.octanner.com/leadership/how-leaders-are-killing-innovation If a postmortem includes some kind of death march from middle management intended to throw someone under the bus; then your current culture has far bigger problems than a SQL server outage.
Ultimately as an Engineering Manager, Development Manager, Ops Manager, VP of Engineering or CTO; you must take accountability for things that go wrong inside your department.
Your ego cannot come before the humans in the team.
If you are the one accountable for the incident or you need to run a postmortem to initiate some learning, here are a few things you should be thinking about to get the most out of a usually tricky situation.
Who to Invite
A practical postmortem should be open to anyone in the company to attend.
If the thought of managers from other teams/departments coming gives you a panic attack then good! If they are coming it’s likely they want to know what happened and also how it was fixed. They care about the outage as it has effected them in some way.
It’s probably not a good idea to have the postmortem without the members of the team who actually found, dealt with or fixed the problem. You need them to help drive the conversation about what really happened as well as help set the scene for everyone else.
Get as big a space as possible (with preferably a whiteboard) with some sticky notes as you are going to want to write stuff down. You should want to show a visual picture of what happened.
You should also clearly state somewhere that the meeting, discussion and actions will be minuted for anyone who can’t attend.
Set the scene.
As the presenter of the postmortem you should have a clear enough account of what happened to at least open the scene.
Opening with an account of how the incident was flagged to the team is a really good starting point, it also gives you a chance to tell the audience exactly how the incident was discovered; by your team or your customers.
Present the problems that were first brought to the teams attention so that it gives everyone in the audience a context for what is about to be discussed.
Guiding The Conversation
Your job in this postmortem is to try and guide the team into learning from failure so that they can reduce the risk of it happening again.
There is no doubt that you and the rest of the team have probably been thinking about solutions since it happened, but it’s good to get this out from the conversation. You want to eek it out of the team to see if they are thinking the same way as you… maybe your way isn’t the best way. If you are leading the discussion you are likely in a more ‘powerful’ position; your ideas may stop others from putting forward their own.
If you can manage to talk openly about what happened and discuss the issue from soup to nuts, there will be loads of chances for you to poke the team into thinking about what could have been done;
Customer called about the problem. ‘Where we can improve monitoring?’ Ticket logged in the system. ‘Who did it go to first and why?’ We’ve had this problem before. ‘Was it documented and easy to find?’ First attempt didn’t work. ‘Was it the right thing to try first?’ Dave knows how to fix it. ‘What if Dave wasn’t here?’
Beware of hindsight and confirmation bias. These are opinions which are heavily swayed by the information of what actually happened or by a tendency to look for information which favours an individuals own opinion on what happened.
Action It Out
So the meeting is complete, you’ve gone through the gory details of what everyone thinks happened and everyone in the room has learned a ton but the postmortem isn’t over.
You have two things you need to do;
Summarize the incident to make sure you have the correct details for the meeting minutes and that everyone understood Dish out the any remedial actions If you’ve managed to get to the end of a postmortem and not found anything that could have been done to prevent it then it’s likely that the meeting hasn’t gone as well as it could have.
The Database logs filled up? Get the DBAs to check their routines? Load Balancer dropped from the DNS? Can Ops create another fail over? Feature Toggle not behaving? Find out how we missed it in test?
Depending on the severity on the incident, these actions may take a week to complete or a year. It may very well be the case that the amount of work needed to handle the incident is far higher than the actual issue it’s causing in your organisation.
It’s completely up to you and your team how these are filtered into your regular working practices.
The sky falls every day in most engineering organisations and the sign of a good team is that key processes and procedures are exercised to make sure the mean time to resolution (MTTR) is as small as possible.
Shit happens. Learn from it.