Share your ideas

Improve Ops Responsiveness

From time to time our WCH Production environments will have alerts from our New Relic monitoring that indicate failures in functionality. When these outages occur the ADRT team is setup to be a first responder team to perform the following for WCH as well as all other SaaS WCE offerings :

- Assess the customer impact

- Alert L2, and update SaaS Availability dashboard

- Narrow down root cause, and research runbooks for service impacted

- Perform any resolution steps, or alert Development

Often the ADRT team has little time, and limited knowledge to make it through these steps. Development ends up responding to the incidents and resolving the problems almost exclusively. The resolution of problems in Production takes on average around 3-4 hours even with expert developers involved.

Need to improve this time to resolution by tackling multiple areas of need :

- Improving monitoring for squads to identify warning signs before they become Incidents

- Improving documentation, and training for ADRT, regular interlocks and training

- Train our Development squads on best practices, and procedures during an incident, and tips from services that have proven more resilient.

- Improve runbooks/scripts so that other online squads or ADRT can resolve incidents without callout.

- Add Ops dashboarding to quickly pinpoint failure areas for ADRT, L2, and development

- Improve notification to all customers (trial and paying), possibly within shell UI and directing to SaaS dashboard for information.

  • Guest
  • Feb 25 2020
  • Shipped
What is your industry? Non-Industry Specific
What is the idea priority? High