Baseline Fails: The Pathetic State of RL Reproducibility

State of RL benchmarks is pathetic. Minimal thing you had expect is definitive way to reproduce results. OpenAI baselines, ShangtongZhang/DeepRL, rlpyt all are either incomplete or have wrong hyperparams to reproduce the results. OpenAI DQN baselines is in fact broken for months.

RLLib and stable-baselines are the better of the bunch - very nicely documented and with explicit supply of hyperparam values to reproduce the results.

Discussion