Well if you look at the plot in the "Learning progress" section, you'll see that they did require almost two orders of magnitude more training time due to the randomizations they were adding to the simulation. Without these the policy isn't very robust but also takes a lot less time to train.