Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.

One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.

These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.

We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.

Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.



I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.

On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.


Playbooks that I've found value in: - Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own) - Database performance debugging - Disaster recovery (bad db delete/update, hardware failure, region failure)

In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)


Expert systems redux? Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.

If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.


Why won't it hit the same limits of frame problem or qualification problem?

Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.

It will be a useful tool, but expert systems had a very restricted solution space.


I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"

If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.


Meanwhile, we’ve tried AI products just for assigning incidents and are forced to turn them off because of how shitty of a job they do.


That's great to hear. What is your current tool chain in the effort? Do you have a structure for Playbooks and KBs you would recommend


Curious if you explored any external tools before building in-house? Looking to do something similar at my company


What does AI add to your playooks ?


I'm guessing the being awake and fresh at 3am within a few seconds of the incident occuring part.


I can execute a playbook without AI at 3am in a few seconds using some orchestration tools. Without any AI.


Are you happy about waking up to do so?


If you get compensation for being on-call then why not? Unless it’s on Holiday eve


Been automatically executing playbooks (Ansible) since before you were born. I sleep fine.

This is standard SRe/ Ops practice. Monitoring system detects failures and automatically runs remediation.

You didn’t read the part where I said “using orchestration to tools”.


> Been automatically executing playbooks (Ansible) since before you were born.

This made me look up how old Ansible was.

> Initial release: February 20, 2012; 12 years ago

https://en.m.wikipedia.org/wiki/Ansible_(software)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: