2.1.On-call engineers are the first line of defense against unplanned work, whether it's a production environment issue or an ad-hoc support request.
2.2.Separating in-depth work from operations allows most of the team to focus on development tasks.
2.3.On-call engineers only need to focus on unpredictable operational challenges and support tasks.
3.1.The developers of On-Call rotate according to the schedule.
3.1.1.Each qualified developer participates in the rotation.
3.2.The majority of on-call personnel spend their time working on ad-hoc support requests.
3.2.1.Bug reports, questions about how their team's software works and how it's used.
3.3.Probably every on-call person will eventually encounter an O&M incident (a critical issue with production software).
3.3.1.An incident is an alarm issued by an automated monitoring system or a problem observed by a support engineer and reported to the on-duty personnel.
3.3.2.The developers of the on-call must triage, alleviate the symptoms, and eventually resolve the incident.
3.4.All on-call rotation work should start and end with a handover.
4.1.Respond at any time.
4.1.1.Larger companies have a "follow the sun" on-call rotation mechanism, where developers rotate to different time zones over time.
4.1.2."Respond at any time" doesn't mean immediately letting go of what you're doing to fix the latest issues.
4.1.3.For many requests, it's perfectly fine to start by acknowledging that you've received the inquiry and answering when you should be able to take a look at the question.
4.2.Stay focused.
4.3.Prioritize work.
4.3.1.Work on the highest priority tasks first.
4.3.2.As tasks are completed or blocked, you can work from the highest priority to the lowest priority.
4.3.3.If you can't tell how urgent a request is, ask what the impact of the request is.
4.3.4.P1: Critical Impact - The service is unavailable in production.
4.3.5.P2: High Impact – The use of the service has been severely impaired.
4.3.6.P3: Medium Impact - Partial impairment of the use of the service.
4.3.7.P4: Low Impact - The service is fully available.
4.3.8.Service level indicators (SLIs) such as error rate, request latency, and requests per second are one of the easiest ways to know if an application is healthy.
4.3.9.Service Level Objectives (SLOs) define the objectives of SLIs for healthy application behavior.
4.3.10.If the error rate is the SLI of an application, the SLO may be a request error rate of less than 0001%.
4.3.11.A Service Level Agreement (SLA) is an agreement on what happens when the SLO is crossed.
4.3.12.Know your application's SLIs, SLOs, and SLAs, which will point you to the most important metrics and SLOs and SLAs that will help you prioritize incidents.
4.4.Clear communication.
4.4.1.Communicate in concise sentences.
4.4.2.Respond quickly to requests.
4.4.2.1.A response doesn't necessarily represent a solution.
4.4.3.Status updates are released on a regular basis.
4.4.3.1.Each time it is updated, a new time estimate is provided.
4.5.Keep track of your work.
4.5.1.Keep a record of what you do at work.
4.5.2.Chat is a great way to communicate, but chat logs can be difficult to read later, so make sure to summarize everything in a task ticket or document.
4.5.3.Close resolved issues so that pending task tickets don't leave a trail on the on-call's Kanban board and don't skew the system metrics supported by the on-call.
4.5.3.1.If the requester doesn't respond, say you're going to close the task ticket for lack of response within 24 hours, and then really do so.
4.5.4.Always include timestamps in your notes.
5.1.Accident handling is the most important responsibility of on-call personnel.
5.1.1.The first goal is to mitigate the impact of the problem and restore service.
5.1.2.The second goal is to capture information so that you can later analyze how and why the problem occurred.
5.1.3.The third goal is to determine the cause of the accident, prove that it is the culprit, and solve the underlying problem.
5.2.Provide support.
5.2.1.Most requests are bug reports, questions about business logic, or technical questions about how to use your software.
5.2.2.Support requests follow a fairly standard process.
5.3.5 stages.
5.3.1.Triage
5.3.1.1.Engineers must find the problem, determine its severity, and determine who can fix it.
5.3.1.2.Identify issues and understand their impact so they can be prioritized appropriately.
5.3.1.3.Triage is not the time to prove that you can solve the problem on your own, the most valuable thing is to buy time.
5.3.1.4.Shunting is also not a time for troubleshooting.
5.3.1.4.1.Leave troubleshooting to the stage of coming up with contingencies and solutions.5.3.2.Coordination
5.3.2.1.Teams (and potential users) must be notified of this issue.
5.3.2.2.Large incidents have dedicated "war rooms" to help with communication, which are virtual or physical spaces used to coordinate incident response.
5.3.2.3.All parties involved join the war room to respond in a coordinated manner.
5.3.2.4.Even if you're working alone, communicate about your work.
5.3.2.4.1.Someone might jump in later and find that your log is beneficial, and a detailed record will help rebuild the timeline afterwards.5.3.3.Mitigation
5.3.3.1.Engineers have to settle things down as quickly as possible.
5.3.3.2.Remission is not a long-term fix, you are just trying to "stop the bleeding".
5.3.3.3.The phased objective of the contingency plan is to reduce the impact of the problem.
5.3.3.3.1.The contingency plan is not to address the problem completely, but to reduce its severity.5.3.3.4.Fixing a problem can take a lot of time, and a workaround plan can usually be completed quickly.
5.3.3.5.The workaround for an incident is usually to roll back the software version to the "last known good" version, or to divert traffic away from the problem.
5.3.3.6.Quickly writing down any deficiencies you find can make you more comfortable troubleshooting, and create new task tickets to address them during the follow-up phase.
5.3.4.Resolution
5.3.4.1.After the problem is mitigated, the engineer has some time to catch his breath, think deeply, and work on the problem.
5.3.4.2.Once the contingency plan is in place, the accident is no longer an emergency.
5.3.4.3.Use the scientific method to troubleshoot technical issues.
5.3.4.4.The test is not**.
5.3.5.Follow-up
5.3.5.1.Investigate the root cause of the accident: why it happened.
5.3.5.2.The aim is to learn Xi from the accident and prevent it from happening again.
5.3.5.3.Write a post-mortem document and review it, and start a new task to prevent it from happening again.
5.3.5.3.1.Describes the cause and effect of the incident, failures, impacts, detection, response, recovery, timelines, root causes, lessons learned, and corrective actions required.
5.3.5.3.2.A key part of any retrospective summary document is root-cause analysis (RCA).5.3.5.4.Root cause analysis is conducted using 5 "whys".
5.3.5.4.1."5W" is just a word of mouth experience, and most problems need to be repeated 5 times before the root cause can be found.
5.3.5.4.2.Root cause analysis is a popular but misleading term.
5.3.5.4.2.1.Accidents are rarely caused by a single problem.
5.3.5.4.2.2.In practice, these 5 "whys" can lead to many different reasons.
5.3.5.4.2.3.Just record everything.5.3.5.5.A good post-event wrap-up will also separate "problem solving" from the review meeting.
5.3.5.6.At the end of the post-event summary meeting, follow-up tasks must be completed.
5.3.5.7.The old retrospective summary document is a great Xi material for learning.
6.1.Jumping into "firefighting" mode becomes a conditioned reflex.
6.2.Relying on "firefighters" is unhealthy.
6.3.Prolonged and high risk will lead to burnout. "Firefighting" engineers can also "falter" in programming or design work because they are constantly interrupted.
6.4.The heroism of the "firefighters" can also lead to the relegation of the work of fixing serious underlying problems, because the "firefighters" are always on the sidelines to tinker.