In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.
But copying a strong reference policy ... is still learning a policy. Whether by SFT or not
In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.
But copying a strong reference policy ... is still learning a policy. Whether by SFT or not