Costa Huang's Website

☁️ How We Conduct Large-scale DRL Experiments

Running thousands of experiments locally is not quite viable. In this blog post, we show a workflow that allows us to do experiments at scale. We use AWS Batch to launch experiments and manage them via Weights and Biases.

Posted on Tue, Aug 25, 2020

🔮 A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

invalid action masking is a technique employed most prominently in AlphaStar and OpenAI Five to avoid executing invalid actions. In our paper, we find standard working mechanism of invalid action masking corresponds to valid policy gradient updates and, more interestingly, it works by applying a state-dependent differentiable function during the calculation of action probability distribution. Furthermore our investigation find invalid action masking to be empirically significant to the performance of policy gradient algorithms.

Posted on Wed, Jul 1, 2020

🎯 The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm

Implementation of the Proximal Policy Optimization matters. In this post, I compile a list of 26 implementation details that help to reproduce the reported results on Atari and Mujoco.

Posted on Wed, Jun 10, 2020

👾 Understanding why there isn't a log probability in TRPO and PPO's objective

Recently I have been reading quite a lot on off-policy policy gradient, importance sampling, etc. When I was reading about Trust Region Policy Optimization (TRPO), I couldn't help but notice that the TPRO's objective doesn't have the log probability normally present in policy gradient methods such as A2C. In this post, we explore the connection between them.

Posted on Sat, Aug 17, 2019