We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.
One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.
I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.
On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.
Playbooks that I've found value in:
- Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own)
- Database performance debugging
- Disaster recovery (bad db delete/update, hardware failure, region failure)
In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)
Expert systems redux?
Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.
If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.
Why won't it hit the same limits of frame problem or qualification problem?
Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.
It will be a useful tool, but expert systems had a very restricted solution space.
I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"
If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.
One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.