So you have to go on-call… what now?

Now that the panic is over, let’s have a look at how you can prepare for the inevitable

Before I begin, I’d like to point out that I am aware that each team/company on-call rotations differ and it’s hard to give advice on how to tackle every possible scenario. So I’m going to start out by explaining what my typical on-call rotation looks like..

On my team, we do weekly on-call rotations whereby you are on-call 24/7 😑 for the given week. This is due to the fact we don’t have a representative team in the US or other time zones that may be able to cover some unfortunate hours. As it stands, each person is typically on call once per month (based on our team size).

So what are our duties? Well, the obvious one, if there is a production issue with our given service (that we are on call for) we’ll get paged 📟 (morning/day/night/weekend) in the hopes that we can solve the problem or provide some leads onto what may be going wrong. As well as that, we assist with day to day triages that come in which may not have been critical enough for a page but suspicious enough to warrant some investigation. We also do what we call “service owner” duties whereby we check for any ERROR/WARN logs from our service for the week, any abnormalities in any metrics etc. As I’ve said, each team/service/company would deal with it differently.

Before we get started, I’d like to point out that there are many advantages to actually going on-call — it’s not all doom and gloom 🚒

  • Extra 💰 💰 💰 (this is especially sweet when you’ve done a super quiet week of on-call)
  • A sense of responsibility and ownership for your team/service. There is an undeniable sense of pride in being responsible for your team.
  • Potential for learning. Great opportunity to expand your service knowledge, triage capabilities, knowledge of other services and/or your company and customers.
  • Potential for improvement. Issues/bugs may come up during investigation that may have otherwise gone unnoticed (and you’ve gotten the credit for finding them).

Of course there are downsides to being on-call too. I’ve gotten woken up in the middle of the night, there can be some pressure associated with certain issues/customers and nobody wants to give up their weekend (no matter the money most of the time). But brutal honestly, you’ve probably signed the contract and there is no escaping it — so let’s embrace it instead.

So without further mumble, here’s what I did to prepare myself for my first on-call rotation…. 🥁

First and foremost, if you are preparing to go on-call, there is an important conversation to have with your people leader/manager. You must be open and honest about your fears, your knowledge gaps and expectations before setting off to go on-call. Your manager should already have a good background of what it takes for a person to go on-call for his given team. He/She has watched you grow and learn and is aware of your capabilities so their feedback is important.

You should take the time to ask the difficult questions, does he/she think you are ready? What are you lacking? What areas do you need to improve before taking the role on? Is there a deadline?

You manager should be your number one supporter in this case (as it was in mine) and you should come away from your conversations having a clear action plan on how to get you there.

The best way to learn to be productive in an actual production issue is (in my humble opinion) practice. This is where triages come in. Volunteer to take weekly (less critical perhaps) triages during the week. Digging through those logs at your own pace, without the pressure of a production issue is greatly valuable experience. This is an opportunity for you to dig deep, take notes, notice patterns and ask all the questions from your teammates that you otherwise may not get to ask during an actual production issue.

I kept a notebook of triages I solved along the way while prepping. I took notes, made shortcuts, played with logs and asked a million questions. It not only enhanced my knowledge but also increased my confidence.

If you’ve already spoken to you manager, they should already be aware to point these type of tasks your way.

You’re not always going to be assigned every triage so I’d recommend looking at other peoples work. Usually, the results of a triage are formally written or presented in some form. Keep yourself in the loop. Catchup on the conclusion, ask questions if you’re not clear, study the pattern, write it down!

During my prep, I would make sure to catch up on every single issue, no matter how big or small. I would read through the slack chats (if available), I would comb through the Jiras, I would dig through the logs. What would I do in this situation? How would I approach it? I treated all these incidents as “sample exam questions”. If time allowed me, I’d try to triage or solve the problem myself and see if I would have come to the same conclusions. If I saw some information I wasn’t able to gather, I would ask the person who was involved, how they came up with that information. Was there a log search I didn’t know about?

Practice, practice, practice.

As to shadow a team-mate while they are on-call. Ask them to involve you in everything that comes up. This process really makes everything a lot more real. It gives you a sense of urgency and responsibility to learn. It makes your team aware of your intentions. If you shadow were to get paged, it’s a great opportunity for you to get involved in a real life scenario, with the added protection of somebody else with you.

Take the time to set up your toolbox for on-call. There are numerous bookmarks you can set up, notes you can take, log shortcuts you can prepare. If you found a way to spot a certain pattern, make it easily accessible. Remember, in an actual production issue, things move fast. You don’t want to be slowed down because you failed to find the right link or the right log search. Be prepared.

Lastly, as Nike would say “Just Do It” .

Yes it’s scary, yes it’s overwhelming but it gets easier and you get better. No amount of preparation will truly prepare you like just throwing yourself in the deep end. If you are surrounded by great teammates around you, you won’t be alone and you should be able to reach out to anybody in case of true crisis which you cannot solve. It all gets easier with time, it gets so much less nerve wracking and frightening and suddenly it becomes second nature.

Last piece of advice, just because you’re on the rota now doesn’t mean your learning time is over. Don’t stop doing any of the above. What got you on-call in the first place is what’s going to allow to become reliable in future on-call situations.

Happy on-call duty 👩‍🚒 👨‍🚒

Distributed Systems Engineer @ Workday