Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.

In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.

But copying a strong reference policy ... is still learning a policy. Whether by SFT or not



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: