NTU Weekly Progress Report 20200217

Byron, 17 February 2020

Nonstationarity means that there is some hidden factor that influences our system dynamics, and this factor is not included in observations.
Concrete episodes that we observe are randomly sampled from the distribution of the model, so they can differ from episode to episode. However, the probability of concrete transition to be sampled remains the same.
When training the DQN on Atari, reduce the uncessary action the agent coud use? manuuly punish the reward for each action take?

Basic

Q: What the size for the batch data in Pytorch?
A: For example, an input batch size is (3,10), the batch size is 3 and the input dim is 10
Q: In the DQN model(acutally all RL model with randomness), it will use the random sample from the replay buff and the epsilon greedy stredgy, is that possible to completly reimplement one model later?(Throught Seed?)
Q: For the DQN in Atari Pong, the agent tend to win the opponent using the same exactly the same way.(local optimal?)
Q: Why it says that the max operation in the bellman equation would lead to the suboptimal policies
Q: How to understand policy gradient
Q: The difference between on-policy(A2C) and off-policy(DQN)
A: on-policy using the chosen action to update, while off-policy not neccesarily use the chosen action to update
Q: what is the typicall RL higher level library?
A: For the research, most of time need to build from scratch because we are facing the condition others never met before
Q: Any more GPU i can use?
A: Pending

Project related

If one feasible method is to map the sequential decision of a opponent policy to low dimentional representation, besides the VAE, could we try other dimension reduction method? unsupervised learning?
A: VAE itself belong to the unsupervised learning
Q: If one paper is an imporvement based on another paper, how to make sure that the replication is accurate, specially when need to compare the performance?
A: Some have open source code, and some is easy to reimplement