Learn Reinforcement Learning (3)  DQN improvement and Deep SARSA
16 Apr 2019 â€˘ 0 CommentsIn the previous article, we learned the DQN in Grid World that was developed by Deepmind and used to play Atari games. The average reward of the random action agent was obtained from 14
to 17
, whereas the average reward of the DQN agent was obtained from 11
to 12
.
However, if you move the learned agent by pressing the Run button, you can not show satisfactory movements such as wandering in front of the goal. Today, after DQN was first announced in 2013, we will try to increase the average reward of the ballfind3
problem in Grid World by applying various methods to improve performance. Next, we will look at a more efficient algorithm for solving this problem.
Â
DQN improvement
Hyperparameter adjustment
The easiest way to think about is to adjust the hyper parameters. Hyper parameters include the learning rate, batch size, etc. that affect the learning of the deep learning network. Also specific to DQN is the total size of the memory to store the experience, the memory size at which to start learning, and the discount rate, \(\gamma\).
Figure 1. Hyperparameter adjustments can be likened to adjusting the number of switches to achieve the desired results. Source Link
We can adjust the learn_step
. The previous articleâ€™s DQN agent was set to learn_step=10
, so agent learned every 10 steps. How about reducing this step? Of course the agent will learn more often. Does learning average increase the average reward? If learn_step=1
, the agent will learn 10 times faster by learning every step.
weight initialization
Deep learning networks consist of a number of float value weights and bias. These values determine the output of the deep learning, but before the learning begins, these values are undefined and need to be initialized. The usual way to do this is to initialize to a random number, but we can use Glorot Initialization to create an initial value for the weight that yields a more stable result. This method is also called Xavier Initialization because it was proposed by a person named Xavier Glorot.^{1}
soft update (target network)
As I mentioned in the last article, DQN has an idea of target network. QLearning updates the Qvalue as follows.
\[Q(s, a) = Q(s, a) + \alpha (R + \gamma max Q(s', a')  Q(s, a))\]Where \(\alpha\) is the learning rate. In the update equation, \(Q(s, a)\) is updated by multiplying \(R + \gamma max Q(s', a')  Q(s, a)\) by \(\alpha\). If \(\alpha=0\), \(Q(s, a)\) will not change, and if \(\alpha=1\), \(Q(s, a)\) will be\(R + \gamma max Q(s',a')\). Since \(\alpha\) is usually between 0.0
and 1.0
, you can think of this update expression as:
That is, when learning continues, \(Q(s, a)\) approaches \(R + \gamma max Q(s', a')\). If we denote the weight and bias of Qnetwork as \(\theta\), we can use the above equation as follows.
\[Q(s, a;\theta) \rightarrow R + \gamma max Q(s', a';\theta)\]The problem with this equation is that the current value and the goal use the same network weight. The weight of the Qnetwork changes with each learning. However, since the goal also changes, it may be difficult to converge to a certain value. Therefore, a separate network with the same structure as the Qnetwork named target network is created and marked with the symbol \(\theta^{}\).
\[Q(s, a;\theta) \rightarrow R + \gamma max Q(s', a';\theta^{})\]The target network maintains a fixed value during the learning process of the original Qnetwork^{2}, and then periodically resets it to the original Qnetwork value. This can be effective learning because the Qnetwork can be approached with a fixed target network.
Figure 2. Structure of learning using target network in DQN
A soft update means that we do not update this target network at once, but frequently and very little. The value of \(\tau\) is used. In Deepmindâ€™s paper^{3}, which proposed an algorithm called DPG, they used \(\tau=0.001\). The target network is updated as follows.
\[\theta^{} = \theta \times \tau + \theta^{} \times (1  \tau)\]The target network will move slightly to the value of Qnetwork. Since the value of \(\tau\) is small, the update should be frequent so that the effect will be noticeable. I modified the code to soft update every time agent learned.
Figure 3. The structure of learning to soft update target network in DQN
Letâ€™s try the above three enhancements, the agent that applied learn_step adjustment + soft update (target network) + weight initialization (glorot initialization).
By clicking on the Learn (DQN) button, you can get an average reward of 9
to 10
by running the DQN algorithm with several changes to the 2000 episode. This is slightly better than 11
to 12
of pure DQN.
Â
Â
Double DQN
David Silver of Deepmind cited three major improvements since Nature DQN in his lecture entitled â€śDeep Reinforcement Learningâ€ť.^{4} Three things are Double DQN, Prioritized replay, and Dueling DQN. Letâ€™s look at the double DQN and the Dueling DQN that have changes in the direct calculation.
Double DQN was an algorithm that used two Qnetworks to improve the QLearning problem in 2010, before the original DQN. In the â€śDouble QLearningâ€ť paper by Hado van Hasselt, he pointed out that Qlearning tends to be difficult because of the tendency of overestimating Qvalues.
\[Q(s, a) = Q(s, a) + \alpha (R + \gamma max Q(s', a')  Q(s, a))\]In the original Qlearning equation, when \(max Q(s', a')\) is extracted, this value is obtained by selecting \(a'\) having the highest Q value in the Qnetwork when the state \(s'\) is given, by multiplying the Q value by \(\gamma\), so that will create a target value that \(Q(s, a)\) should be close. However, the problem is that you use the same Qnetwork in the 1. Select, 2. Import Q values. Because the max
operation is used, the selected Q value is usually a large value, and the Q value will go in the increasing direction because it gets its value. However, if the Q value becomes larger than necessary, the performance of the Qnetwork will be greatly degraded.
To prevent this, Hado van Hasseltâ€™s first approach is to use two Qnetworks.
\[Q_{A}(s, a) = Q_{A}(s, a) + \alpha (R + \gamma Q_{B}(s', argmaxQ_{A}(s',a'))  Q_{A}(s, a))\] \[Q_{B}(s, a) = Q_{B}(s, a) + \alpha (R + \gamma Q_{A}(s', argmaxQ_{B}(s',a'))  Q_{B}(s, a))\]The two expressions are the same if A and B are changed. Based on the above equation, the \(argmaxQ_{A}(s',a')\) part is the part of the \(Q_{A}\) network that selects the behavior. And the outer \(Q_{B}(s', argmaxQ_{A}(s',a'))\) gets the Q value from the \(Q_{B}\) network. The use of two Qnetworks together is key to Double DQNâ€™s ability to prevent unnecessarily large Q values.
Figure 4. In the â€śDouble QLearningâ€ť example, the Grid world was a small 3x3. In the lower right, S is the starting point and G is the target point.
However, as we have seen above, DQN already had the idea of â€‹â€‹a target network. Hado van Hasselt, who wrote this paper, went to Deepmind and published â€śDeep Reinforcement Learning with Double Qlearningâ€ť with David Silver, a paper that applied Double QLearning to DQN in 2016.
\[Q(s, a;\theta) \rightarrow R + \gamma max Q(s', a';\theta^{})\]The above equation changes as follows.
\[Q(s, a;\theta) \rightarrow R + \gamma max Q(s', argmaxQ(s',a';\theta);\theta^{})\]Use the existing Qnetwork to select the action, and get the Q value from the target network with the selected behavior. There are so many formulas that I think it might be hard to see. As shown above, the learning structure diagram is as follows.
Figure 5. Learning structure of Double DQN
Letâ€™s look at the results of applying the Double DQN with the interactive example.
Double DQN and the Dueling DQN described later also have soft update, learn step 10>1, and glorot initialization. Since reinforcement learning can vary depending on execution, I have shown graphs of performance of Random, DQN, DQNsoftupdate and Double DQN.
Figure 6. The performance graph of each algorithm in ballfind3
. The X axis is the episode, and the Y axis is the average of the rewards from the last 100 trials.
Each algorithm has five runs, the average value is the line, and the maximum and minimum values are in the area. You can see that the version with Double DQN has better performance than DQNsoftupdate.
Dueling DQN
Dueling DQN is an idea of â€‹â€‹dividing the network result by V and A, then reassembling it and then getting the Q value. V is equivalent to the value function we saw in the first article of this series, and A is the value Advantage. Since A is calculated value for each action, it is easy to think that it is similar to Q value. The important point here is that V can be calculated. Calculate V for each state and add it to Q as shown below.
\[Q(s, a) = V(s) + A(s, a)  \frac{1}{A}\sum_{a} A(s, a)\]Since the V values are added together when the Q value is obtained, the Q value of the state having a high V value will increase as a result, and in the opposite case, the Q value will be lowered. Therefore, the agent moves to a state with a high Q value (high V), avoiding the action of giving a low Q value (low V). This idea of separating information about values and behaviors is similarly applied to the ActorCritic algorithm, which we will introduce next time.
Unlike the above Double DQN, Dueling DQN changes network structure. At the end, the output layer that selects the action is the same, but in the middle, V and A are calculated separately, and finally reassembled and the Q value is calculated to select the action.
Figure 7. The above is the DQN network structure in Grid World, and below is Dueling DQN.
The performance of Dueling DQN is also better than the existing soft update version. Double DQN and Dueling DQN are also called DDDQN(Dueling Double Deep QNetwork).
Figure 8. The performance graph of each algorithm in ballfind3
. The X axis is the episode, and the Y axis is the average of the rewards from the last 100 trials.
Dueling DQN shows better performance than DQN and has similar performance to soft update version. DDDQN, which combines Dueling and Double, did not record very good performance in this problem, but it seems to be able to improve the results through parameter tuning and so on.
Â
Deep SARSA
Some problems may be more effective with simple algorithms than with complex algorithms. In this Grid World, for the ballfind3
problem, the Deep SARSA algorithm performed better than the DQN. SARSA is a combination of state(s), action(a), reward(r), next state(sâ€™), and next action(aâ€™) that we have seen above.
Comparing Qlearning with SARSA, it can be said that SARS uses only state(s), action(a), reward(r), and next state(sâ€™). So the difference between Qlearning and SARSA is whether the next action(aâ€™) is necessary for learning.
\[Qlearning : Q(s, a) = Q(s, a) + \alpha (R + \gamma max Q(s', a')  Q(s, a))\] \[SARSA : Q(s, a) = Q(s, a) + \alpha (R + \gamma Q(s', a')  Q(s, a))\]SARSA expressions include \(Q(s', a')\) instead of missing the max
operator. If Qlearning is a greedy way of finding the maxQ value of the next step, then SARSA is a stochastic way of using the value of the action the agent took in the next step directly. Qlearning is an offpolicy algorithm that must be performed after episode execution because max value must be obtained, but SARSA also has an attribute called onpolicy algorithm that can be learned at any time during episode execution.
Deep SARSA is a deep learning neural network that uses Qnetwork to obtain Q value like DQN. Letâ€™s look at an interactive example of how well Deep SARSA performs in this problem.
The DeepSARSA agent gets an average reward of about 8
to 10
. If you press the RUN (DeepSARSA) button after the learning ends, you can see that agent movement is more efficient than DQN.
So far, we have looked at several ways to improve DQN and DeepSARSA, which is similar to DQN but onpolicy. Next time, we will look at ActorCritic, which has a concept similar to Dueling DQN. In addition, I attached the motion of the agent who trained the ballfind3
problem with the ActorCritic algorithm for 2,000 episodes. Letâ€™s analyze this algorithm next time.
Figure 9. ActorCritic agent. In ballfind3
, the average reward is 2.5~1.5
.
And I have summarized the performance of the algorithms we have seen so far. Because it is an interactive graph, you can turn on / off the top item and compare only the items you want to see.
It seems that you could be burdened by the fact that the formula was much larger than the previous article. It is not easy, but I will try to solve it as easily as possible next time. Thank you for reading the long article.

Paper Link (The coauthor of this paper is professor Yoshua Bengio, a worldrenowned researcher who recently won the 2018 Turing Award.)Â ↩

In Deepmindâ€™s DQN Nature paper, they set the target network update cycle to 10,000. This means that every time Qnetwork learns 10,000 times, it updates target network the same as Qnetwork. If you have a simpler problem than Atari games, you can lower this number. In the last article, we updated the target network once every episode.Â ↩

Lecture Link and Commentary link on lecture. It seems that the lecture is not currently being played.Â ↩