NTU Weekly Progress Report 20200217
Byron, 17 February 2020
Completion (2020/02/17 - 2020/03/08)
- Courses Assigment
- Deep Learning with NLP final project Proposal and presentation(The voice is generated by the Google Text-to-Speech API)
- Literature Review of Multi-agent system
- Read through the book Deep Reinforcement Learning Hands-On - Second Edition
- Open AI Gym
- The Cross-Entropy Method
- Tabular Learning and the Bellman Equation
- Deep Q-Networks
- Higher-Level RL Libraries
- DQN Extensions
- Ways to Speed up RL
- Policy Gradients – an Alternative
- The Actor-Critic Method
- Asynchronous Advantage Actor-Critic
- implement the basic DQN on the Atari Game
- Other Material read:
- A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agen
- Variational Autoencoders for Opponent Modeling in Multi-Agent Systems
- REINFORCEMENT LEARNING (DQN) TUTORIAL
- Deep Reinforcement Learning Hands-On
- CoQA: A Conversational Question Answering Challenge
- Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents
- VAE Pytorch implementation
- GAN学习指南:从原理入门到制作生成Demo
- 变分自编码器VAE:原来是这么一回事
- 花式解释AutoEncoder与VAE
Ideas
- Nonstationarity means that there is some hidden factor that influences our system dynamics, and this factor is not included in observations.
- Concrete episodes that we observe are randomly sampled from the distribution of the model, so they can differ from episode to episode. However, the probability of concrete transition to be sampled remains the same.
- When training the DQN on Atari, reduce the uncessary action the agent coud use? manuuly punish the reward for each action take?
Questions
Basic
- Q: What the size for the batch data in Pytorch?
- A: For example, an input batch size is (3,10), the batch size is 3 and the input dim is 10
- Q: In the DQN model(acutally all RL model with randomness), it will use the random sample from the replay buff and the epsilon greedy stredgy, is that possible to completly reimplement one model later?(Throught Seed?)
- Q: For the DQN in Atari Pong, the agent tend to win the opponent using the same exactly the same way.(local optimal?)
- Q: Why it says that the max operation in the bellman equation would lead to the suboptimal policies
- Q: How to understand policy gradient
- Q: The difference between on-policy(A2C) and off-policy(DQN)
- A: on-policy using the chosen action to update, while off-policy not neccesarily use the chosen action to update
- Q: what is the typicall RL higher level library?
- A: For the research, most of time need to build from scratch because we are facing the condition others never met before
- Q: Any more GPU i can use?
- A: Pending
Project related
- If one feasible method is to map the sequential decision of a opponent policy to low dimentional representation, besides the VAE, could we try other dimension reduction method? unsupervised learning?
- A: VAE itself belong to the unsupervised learning
- Q: If one paper is an imporvement based on another paper, how to make sure that the replication is accurate, specially when need to compare the performance?
- A: Some have open source code, and some is easy to reimplement
Next Step
- Finish the course project for the AI Introduction (RL relatied)
- Finish the course project for multi-agent project
- Research about the Gym-soccer environment