We've shifted our oncall incident response over to mostly AI at this point. And ...

nevon · on Aug 23, 2024

I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.

On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.

twunde · on Aug 23, 2024

Playbooks that I've found value in: - Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own) - Database performance debugging - Disaster recovery (bad db delete/update, hardware failure, region failure)

In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)

jononor · on Aug 23, 2024

Expert systems redux? Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.

If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.

nyrikki · on Aug 23, 2024

Why won't it hit the same limits of frame problem or qualification problem?

Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.

It will be a useful tool, but expert systems had a very restricted solution space.

SoftTalker · on Aug 23, 2024

I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"

If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.

wredue · on Aug 23, 2024

Meanwhile, we’ve tried AI products just for assigning incidents and are forced to turn them off because of how shitty of a job they do.

vvram · on Aug 23, 2024

That's great to hear. What is your current tool chain in the effort? Do you have a structure for Playbooks and KBs you would recommend

stenlix · on Aug 23, 2024

Curious if you explored any external tools before building in-house? Looking to do something similar at my company

bamboozled · on Aug 23, 2024

What does AI add to your playooks ?

snovv_crash · on Aug 23, 2024

I'm guessing the being awake and fresh at 3am within a few seconds of the incident occuring part.

bamboozled · on Aug 23, 2024

I can execute a playbook without AI at 3am in a few seconds using some orchestration tools. Without any AI.

bobbiechen · on Aug 23, 2024

Are you happy about waking up to do so?

a012 · on Aug 24, 2024

If you get compensation for being on-call then why not? Unless it’s on Holiday eve

bamboozled · on Aug 25, 2024

Been automatically executing playbooks (Ansible) since before you were born. I sleep fine.

This is standard SRe/ Ops practice. Monitoring system detects failures and automatically runs remediation.

You didn’t read the part where I said “using orchestration to tools”.

thebruce87m · on Aug 25, 2024

> Been automatically executing playbooks (Ansible) since before you were born.

This made me look up how old Ansible was.

> Initial release: February 20, 2012; 12 years ago

https://en.m.wikipedia.org/wiki/Ansible_(software)