Weg's Tutorials

Double Deep Q Learning 2

The True Origin Story

Prerequisites

A Different Motivation

The Double-DQN algorithm was not invented to delay policy updates. Even though it has the effect of delaying the future evaluation updates, the actual purpose of double dqn is way more straightforward. This is a rare opportunity where you already have a small amount of context and there is a paper that is just at the horizon of your current level of ignorance. Lucky you.
I know this is sort of demanding, but read this paper before you go on. Don't worry if it doesn't make complete sense, because we are about to dive into our own investigation. Really though, if you make it through that paper you're officially in big boy land.

Assigned Reading

Okay you read it right?...
What do you mean you didnt read it. Go back and read it or your mother will die in her sleep tonight.
Don't worry I'm just kidding. The consequences will be much more dire than that. You'll kill thousands more little ai animals than you have to, and maybe never birth one that really survives. (You wish I was joking) Less importantly you'll also be stifling your ability to move beyond tutorials. That's right. Mama Weg isn't going to live forever. At some point these mind breasts are going to run dry and you'll have to go forage for brain milk elsewhere. I'm preparing you for that now.

Doctor It Hurts

The patient came in this morning. His name is DQN, age 8 (but born in 1992 somehow). It seems he keeps forgetting how to tie his shoes. His mother explains to you that she must have practiced with him 10000 times on that new educational VR app for teaching children stuff, but nothing works. "The kids love it, you know. computer games, technology. I thought for sure gamifying every aspect of his life was the best way to ensure he has a good stimulated mind."

"Go ahead Johnny, show him how good you can tie your shoe."

He looks down at his shoes for a moment of contemplation, then begins fumbling with his laces. His mother aims the ipad at his feet. The app begins to score his progress as she coaches him on.

ep   57: high-score      150.000, score      158.000, epsilon 0.001
ep   58: high-score      158.000, score      167.000, epsilon 0.001
ep   59: high-score      167.000, score      188.000, epsilon 0.001
ep   60: high-score      188.000, score      228.000, epsilon 0.001
ep   61: high-score      228.000, score      222.000, epsilon 0.001
ep   62: high-score      228.000, score       12.000, epsilon 0.001
ep   63: high-score      228.000, score       12.000, epsilon 0.001
ep   64: high-score      228.000, score       12.000, epsilon 0.001
ep   65: high-score      228.000, score       12.000, epsilon 0.001
ep   66: high-score      228.000, score       11.000, epsilon 0.001
ep   67: high-score      228.000, score       12.000, epsilon 0.001

He's doing so well, getting close to completion, but suddenly he begins laughing uncontrollably, throwing his laces around like he's making stir fry noodles. After a moment of this he stops and regains his bearings, but he can't even begin to tie his shoes now. Its as if the laces slip right through his fingers. The poor boy keeps trudging onward in frustration, trying over and over and over again, his knuckles smashing into the floor with each attempt.

"Johnny! Oh Johnny that's enough. Stop it now, you've done all you can.", his mother cries.

This is more serious than you thought.

"This is more serious than I thought", you say.
"We're going to need a more scientific investigation. These scores alone just aren't enough to see what's going wrong."

Diagnostics

You grab the power drill and assure the mother it won't hurt (It will.), before drilling a small millimeter sized hole in his skull.
"Do you have insurance, ma'am?", you ask, as you wipe a disposable pleasure sensing nodule with an alcohol swab before plunging it deep into his think gelatin. It smells disgusting. It usually does.

"Okay Johnny boy. Please tie your shoes for me, again. From the top."
Johnny isn't quite sure what top you're referring to, but after a questioning glance at his mother he begins tying again. The glass telovis lights up and charts out his expected returns as he works.

claims the intention was to squash q values. To thoroughly investigate this claim would require a special environment designed to explicitly demonstrate the unnecessarily high q values. However, I think the original paper and the more recent paper above did pretty good jobs of this, and there is a whole world of more powerful, and more interesting addons ahead that probably demand your time more. It would be a good idea to give the papers a skim atleast and go look at the graphs of q values to see what they were talking about. Even if you don't understand most of the math, you can get the general idea. It is mostly just modifications to the bellman equation and temporal difference function. Both of which you've probably already spent a bit of time looking at by now. You can get an intuitive understanding from thinking about the vanilla version of the td function, though. The present q values are pushed to meet the future q values from the subtraction.

q_target = rewards + self.gamma * action_qs_
td = q_target - q_values

More specifically, the difference between the future q values and now q values is minimized. The reason q values normally go sky high (or ocean floor low) is hidden in this equation. When the present qvalues moves towards the future qvalues, the future qvalues also immediatly move. Thats because, in normal dql, both predictions are done by the same network. Its like chasing your own shadow. Every step you take towards it, it moves forward too. Not only does this destabilize the q values by making them self exaggerating, but it it makes them exponentially larger than they otherwise have to be. By freezing the future q value predictions for 10 steps we freeze the shadow. That way you can step on your shadow (creepy), before it moves ahead a little bit. Double DQLearning provides a freeze in the feedback loop that we hadn't even intended.

Downsides

There are two downsides to this. Firstly, we have to deal with this extra network. That means both storing an extra network, doing a weight copy, and an additional pass of states_ through it. So before we only needed to pass states and states_ through 1 network. Now we have to do states once, and then states_ twice, once through each network. So this first cost is primarily a time cost, a compute cost. It takes a bit longer to run. It should be about 30% longer for the forward pass... Which may not be a big deal now but once the network has convolutional layers on the front that 30% will be pretty big.

Secondly, the future q values really are on a delay. Almost assuredly there will be environments where a new circumstances occurs, and then immediatly again the next frame. The agent could respond to this with its action selection, but wouln't adequately value it for 10 frames. That's not to say it can't learn from circumstances like that. It would. It is more that the first time it sees a circumstance that ends up being a frequent one, the future value will be wrong. In an environment where that sort of thing happens all the time, I suspect double dqn will exhibit some behavioural artifact of this. I'm not sure what that environment would be. But surely it exists. You might not think this is a big downside, but I would be much happier to know the algorithms flaws than not know them. At the moment I don't know if we know the specific nature of its flaws. That scares me. Im afraid of things I don't understand. Like snakes. How do they walk without legs. How does their stomach not wear down. Anyways a few people I've discussed double dqn with have anecdotally noted that it gets lower top scores in cartpole. The paper is more conclusive on the atari environments, and studies after the original paper have pretty unanimously been positive about its prospects.

Full Code

Here's the full code.

import gym       
import math
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class ReplayBuffer:
    def __init__(self, mem_size, state_shape):
        self.mem_size = mem_size
        self.mem_count = 0

        self.states     = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
        self.actions    = np.zeros( self.mem_size,               dtype=np.int64  )
        self.rewards    = np.zeros( self.mem_size,               dtype=np.float32)
        self.states_    = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
        self.dones      = np.zeros( self.mem_size,               dtype=np.bool   )

    def add(self, state, action, reward, state_, done):
        mem_index = self.mem_count % self.mem_size 
        
        self.states[mem_index]  = state
        self.actions[mem_index] = action
        self.rewards[mem_index] = reward
        self.states_[mem_index] = state_
        self.dones[mem_index]   = done

        self.mem_count += 1

    def sample(self, sample_size):
        mem_max = min(self.mem_count, self.mem_size)
        batch_indices = np.random.choice(mem_max, sample_size, replace=True)

        states  = self.states[batch_indices]
        actions = self.actions[batch_indices]
        rewards = self.rewards[batch_indices]
        states_ = self.states_[batch_indices]
        dones   = self.dones[batch_indices]

        return states, actions, rewards, states_, dones

class Network(torch.nn.Module):
    def __init__(self, alpha, input_shape, num_actions):
        super().__init__()
        self.input_shape = input_shape
        self.num_actions = num_actions
        self.fc1_dims = 1024
        self.fc2_dims = 512

        self.fc1 = nn.Linear(*self.input_shape, self.fc1_dims)
        self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims, num_actions)

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

class Agent():
    def __init__(self, lr, state_shape, num_actions):
        self.net = Network(lr, state_shape, num_actions)
        self.future_net = Network(lr, state_shape, num_actions)
        self.memory = ReplayBuffer(mem_size=100000, state_shape=state_shape)
        self.batch_size = 64
        self.gamma = 0.99

        self.epsilon = 0.1
        self.epsilon_decay = 0.00005
        self.epsilon_min = 0.001

        self.learn_step_counter = 0
        self.net_copy_interval = 10

    def choose_action(self, observation):
        if np.random.random() < self.epsilon:
            action = random.randint(0, 1)
        else:
            state = torch.tensor(observation).float().detach()
            state = state.to(self.net.device)
            state = state.unsqueeze(0)

            q_values = self.net(state)
            action = torch.argmax(q_values).item()
        return action

    def store_memory(self, state, action, reward, state_, done):
        self.memory.add(state, action, reward, state_, done)

    def learn(self):
        if self.memory.mem_count < self.batch_size:
            return

        states, actions, rewards, states_, dones = self.memory.sample(self.batch_size)
        states  = torch.tensor(states , dtype=torch.float32).to(self.net.device)
        actions = torch.tensor(actions, dtype=torch.long   ).to(self.net.device)
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.net.device)
        states_ = torch.tensor(states_, dtype=torch.float32).to(self.net.device)
        dones   = torch.tensor(dones  , dtype=torch.bool   ).to(self.net.device)

        batch_indices = np.arange(self.batch_size, dtype=np.int64)
        q_values  =   self.net(states)[batch_indices, actions]

        q_values_ =   self.future_net(states_)
        actions_ =   self.net(states_).max(dim=1)[1]
        action_qs_ = q_values_[batch_indices, actions_]

        action_qs_[dones] = 0.0
        q_target = rewards + self.gamma * action_qs_

        td = q_target - q_values

        self.net.optimizer.zero_grad()
        loss = ((td ** 2.0)).mean()
        loss.backward()
        self.net.optimizer.step()

        self.epsilon -= self.epsilon_decay
        if self.epsilon < self.epsilon_min:
            self.epsilon = self.epsilon_min

        if self.learn_step_counter % self.net_copy_interval == 0:
            self.future_net.load_state_dict(self.net.state_dict())

        self.learn_step_counter += 1

if __name__ == '__main__':
    env = gym.make('CartPole-v1').unwrapped
    agent = Agent(lr=0.001, state_shape=(4,), num_actions=2)

    high_score = -math.inf
    episode = 0

    num_samples = 0         
    samples_processed = 0  
    while True:
        done = False
        state = env.reset()

        score, frame = 0, 1
        while not done:
            # env.render()

            action = agent.choose_action(state)
            state_, reward, done, info = env.step(action)
            agent.store_memory(state, action, reward, state_, done)
            
            agent.learn()
            samples_processed += agent.batch_size
            
            state = state_

            score += reward
            frame += 1
            num_samples += 1 
            
        high_score = max(high_score, score)

        print(( "samples: {}, samps_procd: {}, ep {:4d}: high-score {:12.3f}, "
                "score {:12.3f}, epsilon {:5.3f}").format(
            num_samples, samples_processed, episode, 
            high_score, score, agent.epsilon))

        episode += 1

In Conclusion

This algo is pretty common. And though you took the scenic route, it was only about 7 or so lines of code different from vanilla dqn. Thats pretty nice considering the scores. Good job. Now go check out Dueling Deep Q Learning. It's just a little bit harder than this addon. And a bit more interesting. :^)

Tutorial Index