☁️ How We Conduct Large-scale DRL Experiments
Running thousands of experiments locally is not quite viable. In this blog post, we show a workflow that allows us to do experiments at scale. We use AWS Batch to launch experiments and manage them via Weights and Biases.
🔮 A Closer Look at Invalid Action Masking in Policy Gradient Algorithms
invalid action masking is a technique employed most prominently in AlphaStar and OpenAI Five to avoid executing invalid actions. In our paper, we find standard working mechanism of invalid action masking corresponds to valid policy gradient updates and, more interestingly, it works by applying a state-dependent differentiable function during the calculation of action probability distribution. Furthermore our investigation find invalid action masking to be empirically significant to the performance of policy gradient algorithms.
🎯 The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm
Implementation of the Proximal Policy Optimization matters. In this post, I compile a list of 26 implementation details that help to reproduce the reported results on Atari and Mujoco.
👾 Understanding why there isn't a log probability in TRPO and PPO's objective
Recently I have been reading quite a lot on off-policy policy gradient, importance sampling, etc. When I was reading about Trust Region Policy Optimization (TRPO), I couldn't help but notice that the TPRO's objective doesn't have the log probability normally present in policy gradient methods such as A2C. In this post, we explore the connection between them.