Weg's Tutorials

Double Deep Q Learning

Barely Even An Addon

Prerequisites

This tutorial assumes you've completed the "Foundations" tutorials. It also uses code from those previous tutorials. If you aren't new to reinforcement learning then i'm sure you can follow along. Less will be explained within the "Upgrades" tutorials than prior tutorials.

Getting Started

DQN works okay as is, but I don't know if you have been looking at the performance graphs... These things are unstable. The rewards go all over the place. Numbers are going up but only generally, and chaotically. Some of this chaos comes from the stochastic nature of the environment. Some of it comes from the randomly initialized network weights finding their happy place. However, what likely is the root cause of chaos in deep reinforcement learning is the fact that the agent is a big feedback loop. Action changes the environment changes the action changes the environment changes the weights changes the actions... etc. In a recursive system like this, everything is an exponential whirlpool. So, what if you froze one or more components of the feedback loop temporarily? Would it reduce the insanity?

Code Review

Let's begin by reviewing the code starting point. It should look familiar but also slightly unfamiliar. Recently I started coding with underscores to match the common python formatting. Consider it an exercise in reading. Also consider it a hand exercise in painful finger movements used to type underscores.

import gym       
import math
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class ReplayBuffer:
    def __init__(self, mem_size, state_shape):
        self.mem_size = mem_size
        self.mem_count = 0

        self.states     = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
        self.actions    = np.zeros( self.mem_size,               dtype=np.int64  )
        self.rewards    = np.zeros( self.mem_size,               dtype=np.float32)
        self.states_    = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
        self.dones      = np.zeros( self.mem_size,               dtype=np.bool   )

    def add(self, state, action, reward, state_, done):
        mem_index = self.mem_count % self.mem_size 
        
        self.states[mem_index]  = state
        self.actions[mem_index] = action
        self.rewards[mem_index] = reward
        self.states_[mem_index] = state_
        self.dones[mem_index]   = done

        self.mem_count += 1

    def sample(self, sample_size):
        mem_max = min(self.mem_count, self.mem_size)
        batch_indices = np.random.choice(mem_max, sample_size, replace=True)

        states  = self.states[batch_indices]
        actions = self.actions[batch_indices]
        rewards = self.rewards[batch_indices]
        states_ = self.states_[batch_indices]
        dones   = self.dones[batch_indices]

        return states, actions, rewards, states_, dones

class Network(torch.nn.Module):
    def __init__(self, alpha, input_shape, num_actions):
        super().__init__()
        self.input_shape = input_shape
        self.num_actions = num_actions
        self.fc1_dims = 1024
        self.fc2_dims = 512

        self.fc1 = nn.Linear(*self.input_shape, self.fc1_dims)
        self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims, num_actions)

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

class Agent():
    def __init__(self, lr, state_shape, num_actions):
        self.net = Network(lr, state_shape, num_actions)
        self.memory = ReplayBuffer(mem_size=10000, state_shape=state_shape)
        self.batch_size = 64
        self.gamma = 0.99

        self.epsilon = 0.1
        self.epsilon_decay = 0.00005
        self.epsilon_min = 0.001

    def choose_action(self, observation):
        if np.random.random() < self.epsilon:
            action = random.randint(0, 1)
        else:
            state = torch.tensor(observation).float().detach()
            state = state.to(self.net.device)
            state = state.unsqueeze(0)

            q_values = self.net(state)
            action = torch.argmax(q_values).item()
        return action

    def store_memory(self, state, action, reward, state_, done):
        self.memory.add(state, action, reward, state_, done)

    def learn(self):
        if self.memory.mem_count < self.batch_size:
            return

        states, actions, rewards, states_, dones = self.memory.sample(self.batch_size)
        states  = torch.tensor(states , dtype=torch.float32).to(self.net.device)
        actions = torch.tensor(actions, dtype=torch.long   ).to(self.net.device)
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.net.device)
        states_ = torch.tensor(states_, dtype=torch.float32).to(self.net.device)
        dones   = torch.tensor(dones  , dtype=torch.bool   ).to(self.net.device)

        batch_indices = np.arange(self.batch_size, dtype=np.int64)
        q_values  =   self.net(states)[batch_indices, actions]

        q_values_ =   self.net(states_)
        action_qs_ = torch.max(q_values_, dim=1)[0]
        action_qs_[dones] = 0.0
        q_target = rewards + self.gamma * action_qs_

        td = q_target - q_values

        self.net.optimizer.zero_grad()
        loss = ((td ** 2.0)).mean()
        loss.backward()
        self.net.optimizer.step()

        self.epsilon -= self.epsilon_decay
        if self.epsilon < self.epsilon_min:
            self.epsilon = self.epsilon_min

if __name__ == '__main__':
    env = gym.make('CartPole-v1').unwrapped
    agent = Agent(lr=0.001, state_shape=(4,), num_actions=2)

    high_score = -math.inf
    episode = 0
    while True:
        done = False
        state = env.reset()

        score, frame = 0, 1
        while not done:
            # env.render()

            action = agent.choose_action(state)
            state_, reward, done, info = env.step(action)
            agent.store_memory(state, action, reward, state_, done)
            agent.learn()
            
            state = state_

            score += reward
            frame += 1

        high_score = max(high_score, score)

        print(( "ep {:4d}: high-score {:12.3f}, "
                "score {:12.3f}, epsilon {:5.3f}").format(
            episode, high_score, score, agent.epsilon))

        episode += 1

At this point this code should look really familiar to you. The only thing I can think of that might be confusing you is state vs state_. I'm not sure why but state_ is often used as short for next_state. Anyways now you'll know what that is when you see it. But basically anytime a variable ends in _ it is assumed to be "for the next time step".

You also might have noticed MSELoss() is gone now from the network class. Instead of using the built in pytorch loss function I reimplemented it in learn(). You can see the td error is squared, and then that tensor is meaned.

self.net.optimizer.zero_grad()
loss = ((td ** 2.0)).mean()   # this is what MSELoss did.
loss.backward()
self.net.optimizer.step()

Now there is no more magic hidden beneath. It's good to know how things work.

By the way when you run the code as is you get typical DQN performance like this.

ep   57: high-score      273.000, score      158.000, epsilon 0.001
ep   58: high-score      273.000, score      167.000, epsilon 0.001
ep   59: high-score      273.000, score      188.000, epsilon 0.001
ep   60: high-score      273.000, score      228.000, epsilon 0.001
ep   61: high-score      273.000, score      222.000, epsilon 0.001
ep   62: high-score      273.000, score      144.000, epsilon 0.001
ep   63: high-score      273.000, score      199.000, epsilon 0.001
ep   64: high-score      300.000, score      300.000, epsilon 0.001
ep   65: high-score      300.000, score      162.000, epsilon 0.001
ep   66: high-score      300.000, score      170.000, epsilon 0.001
ep   67: high-score      312.000, score      312.000, epsilon 0.001

Chef Man

You are a chef. You cook tacos. You are on a quest to cook the most perfect taco that ever existed. You get a bags of supplies and you go to the kitchen. Also this is the first taco you've ever made and the first time you have ever cooked. For some reason this doesn't seem like a problem to you and so you turn on the "Taco Food Channel" and start cooking random ingredients while you watch the TV. 12 hours later you are tired, sweaty, and defeated and also your tacos are terrible. You tried to eat one and threw up. You know your friend is a world renouned taco chef so you call him, and invite him over for an educational cooking party. The next day he teaches you how to make great tacos. First you watch him make one. Then you try to make one. Then he slaps you, and he makes you watch him make one. Then you try. Then he makes one... This goes on for 36 hours and you emerge from taco bootcamp a taco making god.

At the moment, the agent is not like taco chef man at all. He never takes time to observe. He only tries to learn while doing. This is easily fixed.

...
counter = 0     #   count the lessons
while True:
    done = False
    state = env.reset()

    score, frame = 0, 1
    while not done:
        # env.render()

        action = agent.choose_action(state)
        state_, reward, done, info = env.step(action)
        agent.store_memory(state, action, reward, state_, done)
        
        if counter % 10 == 0:   #   just watch sometimes
            agent.learn()
        
        state = state_

        score += reward
        frame += 1

        counter += 1    #   count count count

This way the agent will take turns with the environment. Sometimes it will sit and learn passively from it's past self. And other times it will actively learn by doing. Let's see how the results turn out.

ep  392: high-score      490.000, score      163.000, epsilon 0.001
ep  393: high-score      490.000, score      185.000, epsilon 0.001
ep  394: high-score      490.000, score      187.000, epsilon 0.001
ep  395: high-score      490.000, score      232.000, epsilon 0.001
ep  396: high-score      490.000, score      339.000, epsilon 0.001
Sample Efficiency

It took about 400 episodes for the results to be on par with before. You might think thats pretty good because the agent should only have actually trained 1/10th as often. Meaning, it should get the same scores but after 10 times as many episodes. However, episodes are different lengths, so they aren't that useful as a metric on their own. What works much better is performance per sample. Let's keep track of how many samples are collected, and how many transitions have been processed by the network.

...
num_samples = 0         #   TRANSITIONS COLLECTED
samples_processed = 0   #   TRANSITIONS PROCESSED
counter = 0
while True:
    done = False
    state = env.reset()

    score, frame = 0, 1
    while not done:
        # env.render()

        action = agent.choose_action(state)
        state_, reward, done, info = env.step(action)
        agent.store_memory(state, action, reward, state_, done)
        
        if counter % 1 == 0:    #   BACK TO 1
            agent.learn()
            samples_processed += agent.batch_size   #   NEW
        
        state = state_

        score += reward
        frame += 1
        num_samples += 1    #   NEW

        counter += 1 
        
    high_score = max(high_score, score)

    ''' SOME NEW PRINTING STUFF '''
    print(( "samples: {}, samps_procd: {}, ep {:4d}: high-score {:12.3f}, "
            "score {:12.3f}, epsilon {:5.3f}").format(
        num_samples, samples_processed, episode, 
        high_score, score, agent.epsilon))

    episode += 1

Okay let's get a new baseline. Notice the counter is set to 1 so this is normal DQN without the turn taking.

samples: 4017, samps_procd: 257088, ep   50: high-score      214.000, score      180.000, epsilon 0.001
samples: 4164, samps_procd: 266496, ep   51: high-score      214.000, score      147.000, epsilon 0.001
samples: 4332, samps_procd: 277248, ep   52: high-score      214.000, score      168.000, epsilon 0.001
samples: 4681, samps_procd: 299584, ep   53: high-score      349.000, score      349.000, epsilon 0.001
samples: 4975, samps_procd: 318400, ep   54: high-score      349.000, score      294.000, epsilon 0.001
samples: 5281, samps_procd: 337984, ep   55: high-score      349.000, score      306.000, epsilon 0.001

It should look about the same. And it does. But something new is standing out now as worrisome.
To get to this point the agent processed 300,000 frames and only had 5000 transitions collected to work with. That means the agent reused the same data about 30,000 times on average. In normal machine learning, 30,000 epochs would make someone slap you and yell "overfitting". But we are ignorant DRL programmers so we can get away with it. Part of this has to do with old transitions being framed within a new "reward prediction state of mind". Later on in your life you can think back to your childhood and learn new things from analysing it. This is because your interpretation is different, and your goals have changed. I am unsure as to whether normal machine learning benefits from this to quite the same degree. Generally the classification, or ground truth, of a sample in a regular ML dataset is fixed. A picture of a dog is still 100% dog 1 year later. So reviewing the same data over and over isn't quite so useful. But in DRL the q values might have changed since last exposure to a sample. What was once seen as kind of bad might now be seen as REALLY BAD. Or not as bad. So overfitting in DRL is less of a problem, and more of a feature. Although it can definitely still happen. You will learn more about overfitting in DRL in the future. :)

Anyways, back to the watch-learn-watch-learn idea. Set the agent to learn every 10 steps instead to see what changes.

if counter % 10 == 0:    #   BACK TO 10
      agent.learn()
      samples_processed += agent.batch_size   #   NOT NEW ANYMORE

And results...

samples: 38808, samps_procd: 248384, ep  570: high-score      508.000, score      508.000, epsilon 0.001
samples: 39052, samps_procd: 249984, ep  571: high-score      508.000, score      244.000, epsilon 0.001
samples: 39396, samps_procd: 252160, ep  572: high-score      508.000, score      344.000, epsilon 0.001
samples: 39504, samps_procd: 252864, ep  573: high-score      508.000, score      108.000, epsilon 0.001
samples: 39679, samps_procd: 253952, ep  574: high-score      508.000, score      175.000, epsilon 0.001
samples: 40009, samps_procd: 256064, ep  575: high-score      508.000, score      330.000, epsilon 0.001
samples: 40300, samps_procd: 257920, ep  576: high-score      508.000, score      291.000, epsilon 0.001
samples: 40561, samps_procd: 259648, ep  577: high-score      508.000, score      261.000, epsilon 0.001

The agent jumped from getting 200 scores to 500 scores right away. I ran it a few times and the result is like this each time. Even though learn() ran only once every ten steps, it didnt take 10 times as many episodes, or samples. It required a similar number of samples-processed to gain the same performance. Interestingly the scores are slightly higher and the number of processed samples is slightly lower. We could attribute this to the "turn-taking" learning strategy that worked for Chef Man, but there is an uncontrolled variable here. Specifically, the number of repeated sample exposures is much less if you collect more transitions and run learn() much less frequently. The ingredients are "more fresh", if you will. To reduce that effect some, let's make the agent collect 50,000 frames before it even starts learning. That way both the normal DQN agent and the version with the periodic learning have similar data freshness.

if counter % 1 == 0:    #   back to 1 again
    if num_samples > 50000: # lots of falling poles...
        agent.learn()
        samples_processed += agent.batch_size

Dont forget to go make the replay buffer larger to hold the new samples. I set mine to 100,000.
aaaand run.

samples: 55196, samps_procd: 332480, ep 5112: high-score      210.000, score      159.000, epsilon 0.001
samples: 55410, samps_procd: 346176, ep 5113: high-score      214.000, score      214.000, epsilon 0.001
samples: 55535, samps_procd: 354176, ep 5114: high-score      214.000, score      125.000, epsilon 0.001
samples: 55678, samps_procd: 363328, ep 5115: high-score      214.000, score      143.000, epsilon 0.001
samples: 55905, samps_procd: 377856, ep 5116: high-score      227.000, score      227.000, epsilon 0.001

This is about 80 episodes after it started running learn(). Again, you can see its a similar number of samples processed necessary to get to this performance. Interestingly, there were a bunch of episodes were it seemed almost no performance was gained from learn(). Those big stretches of episodes where scores don't really change could be due to the periodic learning just added. The memory will be full of rather simplistic transitions (cartpole falling in the center of the play area) before the agent really begins to learn. Those transitions are not representative of the new circumstances the more trained agent finds itself in (cartpole falling near the edges of the play area). It takes time for those more relevant samples to become less minority in the replay buffer before they can be chosen for the learn batch. That's good evidence that setting the "minimum sample's collected" requirement too high could actually make learning much slower if paired with a giant replay buffer that takes forever to be "cleansed", or become "relevant".

On a side note, due to the learning rate being adjusted within learn() this experiment has kept that as a dependant variable. the learning rate is only tweaked on learn steps, not on collection steps.

Alright. Now back to if counter % 10 == 0:

samples: 93908, samps_procd: 280960, ep 5736: high-score      420.000, score      140.000, epsilon 0.001
samples: 94122, samps_procd: 282368, ep 5737: high-score      420.000, score      214.000, epsilon 0.001
samples: 94363, samps_procd: 283904, ep 5738: high-score      420.000, score      241.000, epsilon 0.001
samples: 94625, samps_procd: 285568, ep 5739: high-score      420.000, score      262.000, epsilon 0.001
                                                            # 420 haha

I hope by now you can see the pattern. If you have enough data, and fresh data to work with, performance in DRL largely depends on the number of samples processed. The performance gained per sample processed is called the sample efficiency of an agent. This is generally dependant upon the network architecture, the addons, and the loss function. Wheras learning stability depends more on the policy not changing too much or too frequently, and the samples being consistently relevant to the agent's current circumstance. Too much instability can lower the sample-efficiency of the agent, so they aren't exactly entirely seperate aspects of learning, but it can help to look independantly for sources of either.

Underwhelming

Let's consider sample "relevancy" in the context of our watch-learn-watch change. Obviously, to learn something you want to study things relevant to your current understanding, and part of that means adapting what you are studying to the moving horizon of your ignorance. In school, when you get better at multiplication, you have to move on to harder multiplication problems, or different problems, or you won't get any better. One of the results of running learn() only periodically is that the practice material isn't pushed forward as often. The "flash cards" are much less diverse, which was intended, but this slows down the agent "moving on" to the next concept. Every time the policy changed, the agent has to wait for the memory to be filled with more current samples. And with a bigger memory it takes longer for that to happen. We made the flashcards less hard, but the flashcards also stay old longer. The end result isn't fantastic. Their are a few one off higher scores that show up earlier, but the average scores are similar per sample-processed.

Almost Double Q Learning

Collecting a bunch of samples before letting the agent learn is a reasonable way to prevent early overfitting, but periodic learning isn't as reasonable. If the environment is really dynamic it could make sense, but generally it's not a technique I often see used. That's usually because the environment is expensive to run. As you can imagine, if you are trying to make the agent learn things in the real world you cant run the environment at 1000 times speed. This downside is just too big. Besides, who likes having to wait 10 times as long to get improved performance, if the performance is only a tiny bit better? I would rather save that tool for a case where the instability prevents learning entirely. What if there was a way to ensure the environment has time to collect fresh data before running learn(), still push its next lesson forward, and not waste all that time just spinning on the environment?

Luckily for you there is a way. Actually, there are an infinite number of ways. This is one of them.
What if instead of only learning every ten steps, we add another neural network and train it in the "off time", then swap it in after 10 steps. This could be the best of both worlds.

First, add another network, and a step counter and update interval.

class Agent():
    def __init__(self, lr, state_shape, num_actions):
        self.net = Network(lr, state_shape, num_actions)
        self.future_net = Network(lr, state_shape, num_actions)   # another network
        self.memory = ReplayBuffer(mem_size=100000, state_shape=state_shape)
        self.batch_size = 64
        self.gamma = 0.99

        self.epsilon = 0.1
        self.epsilon_decay = 0.00005
        self.epsilon_min = 0.001

        self.learn_step_counter = 0     # increments each learn()
        self.net_copy_interval = 10     # NEW

Then make the future_net swap out its weights every n training steps.

def learn(self):
...
    if self.learn_step_counter % self.net_copy_interval == 0: 
        self.net.load_state_dict(self.future_net.state_dict())  # state_dict() packs the 
                                                                # net params into a dict
    self.learn_step_counter += 1

Okay, the final step is to make the actions chosen by the main network, but the learning happen to the future_net. Luckily for us the actions are already chosen by the main network.

def learn(self):
...
    batch_indices = np.arange(self.batch_size, dtype=np.int64)
    q_values  =   self.future_net(states)[batch_indices, actions]

    q_values_ =   self.future_net(states_)
    action_qs_ = torch.max(q_values_, dim=1)[0]
    action_qs_[dones] = 0.0
    q_target = rewards + self.gamma * action_qs_

    td = q_target - q_values

    self.future_net.optimizer.zero_grad()
    loss = ((td ** 2.0)).mean()
    loss.backward()
    self.future_net.optimizer.step()
...

Really you just swap out net for future_net every time it shows up in learn(). I kind of hope you knew that already. :^) Make sure to get all of them. Also go remove that minimum sample collection stuff in the main loop for now.
Time to test it out.

#  run 1
samples: 1647, samps_procd: 105408, ep   28: high-score      157.000, score      157.000, epsilon 0.021
samples: 1769, samps_procd: 113216, ep   29: high-score      157.000, score      122.000, epsilon 0.015
samples: 1970, samps_procd: 126080, ep   30: high-score      201.000, score      201.000, epsilon 0.005
samples: 2134, samps_procd: 136576, ep   31: high-score      201.000, score      164.000, epsilon 0.001
samples: 2252, samps_procd: 144128, ep   32: high-score      201.000, score      118.000, epsilon 0.001
samples: 2455, samps_procd: 157120, ep   33: high-score      203.000, score      203.000, epsilon 0.001

#  run 2
samples: 2066, samps_procd: 132224, ep   50: high-score      148.000, score      110.000, epsilon 0.001
samples: 2167, samps_procd: 138688, ep   51: high-score      148.000, score      101.000, epsilon 0.001
samples: 2423, samps_procd: 155072, ep   52: high-score      256.000, score      256.000, epsilon 0.001
samples: 2716, samps_procd: 173824, ep   53: high-score      293.000, score      293.000, epsilon 0.001
samples: 2837, samps_procd: 181568, ep   54: high-score      293.000, score      121.000, epsilon 0.001

#  run 3
samples: 2910, samps_procd: 186240, ep   51: high-score      170.000, score      170.000, epsilon 0.001
samples: 3018, samps_procd: 193152, ep   52: high-score      170.000, score      108.000, epsilon 0.001
samples: 3225, samps_procd: 206400, ep   53: high-score      207.000, score      207.000, epsilon 0.001
samples: 3405, samps_procd: 217920, ep   54: high-score      207.000, score      180.000, epsilon 0.001
samples: 3586, samps_procd: 229504, ep   55: high-score      207.000, score      181.000, epsilon 0.001
samples: 3789, samps_procd: 242496, ep   56: high-score      207.000, score      203.000, epsilon 0.001

# run 4 
samples: 2910, samps_procd: 186240, ep   51: high-score      170.000, score      170.000, epsilon 0.001
samples: 3018, samps_procd: 193152, ep   52: high-score      170.000, score      108.000, epsilon 0.001
samples: 3225, samps_procd: 206400, ep   53: high-score      207.000, score      207.000, epsilon 0.001
samples: 3405, samps_procd: 217920, ep   54: high-score      207.000, score      180.000, epsilon 0.001
samples: 3586, samps_procd: 229504, ep   55: high-score      207.000, score      181.000, epsilon 0.001
samples: 3789, samps_procd: 242496, ep   56: high-score      207.000, score      203.000, epsilon 0.001

This looks pretty good. On average its an obviously better sample efficiency, and you don't have to waste time letting the environment spin. The only thing really changed here is how often the policy changes. There are downsides, though. For a more complicated environment that requires frequent policy changes, it will take a lot more tries to push the policy into the right place. The advantage here is the path to the current policy should be refined more easily. That is, it won't be interrupted by so many irrelevant decision changes along the way. It's like making sure to use the same brand of tortilla for a while to get a good feel of how to cook with it. As opposed to buying a new brand of tortilla every time you make a single taco.

Realish Double Deep Q Learning

There's a bit of a problem with the td function like this though. In the td function, the present q values and future q values are expected to be as close as possible to the ground truth. (If everything is working ideally.) In our current implementation, that's impossible. While actions in choose_action() are being chosen according to the old network, in learn() actions are being chosen by the future_network. That doesn't really make sense. In fact, it's only correct to do this 1/10th of the time. That would be right after the parameters copy. The future actions should be chosen by the old network too. The future_net is now responsible for the value of the future. Not for the current policy. Our original intention was to delay policy changes anyways. Let's try fixing this by making the old net pick the actions.

def learn(self):
...
    batch_indices = np.arange(self.batch_size, dtype=np.int64)
    q_values  =   self.future_net(states)[batch_indices, actions]  # future net

    ''' let the old net pick the future actions,
        but let the future_net choose their value'''
    q_values_ =   self.future_net(states_)          #   future net picks the values
    actions_ =   self.net(states_).max(dim=1)[1]    #   old net picks the actions
    action_qs_ = q_values_[batch_indices, actions_] #   extract q values according to old nets actions
                                                    #     from future nets values
    action_qs_[dones] = 0.0
    q_target = rewards + self.gamma * action_qs_

    td = q_target - q_values

    self.future_net.optimizer.zero_grad()
    loss = ((td ** 2.0)).mean()
    loss.backward()
    self.future_net.optimizer.step()

Run it a few times.

samples: 5909, samps_procd: 378176, ep  158: high-score      276.000, score      145.000, epsilon 0.001
samples: 6160, samps_procd: 394240, ep  159: high-score      276.000, score      251.000, epsilon 0.001
samples: 6290, samps_procd: 402560, ep  160: high-score      276.000, score      130.000, epsilon 0.001
samples: 6557, samps_procd: 419648, ep  161: high-score      276.000, score      267.000, epsilon 0.001
samples: 6714, samps_procd: 429696, ep  162: high-score      276.000, score      157.000, epsilon 0.001

Those of you familiar with the math might find it amazing that this works at all. Actually you can see that the sample efficiency went down a bit. I cherry picked one of the better runs. If you run it a few times you will find it learns atleast twice as slow as normal dqn. However eventually the scores go up into the thousands just like usual. The q values are learned just the same. As long as the predictions are on both sides of the td function, future prediction, and present prediction, the base reward signal will slowly pull them togethor. Have proof.

samples: 41042, samps_procd: 2626688, ep  523: high-score     1451.000, score     1187.000, epsilon 0.001
samples: 41330, samps_procd: 2645120, ep  524: high-score     1451.000, score      288.000, epsilon 0.001
samples: 42242, samps_procd: 2703488, ep  525: high-score     1451.000, score      912.000, epsilon 0.001
samples: 43444, samps_procd: 2780416, ep  526: high-score     1451.000, score     1202.000, epsilon 0.001
samples: 52037, samps_procd: 3330368, ep  527: high-score     8593.000, score     8593.000, epsilon 0.001
samples: 61511, samps_procd: 3936704, ep  528: high-score     9474.000, score     9474.000, epsilon 0.001
# there was no next episode. It was still balancing the cartpole for another 20 minutes or so.
# legend has it that somewhere in cyber space the little agent that could is still balancing the pole.
# that legend is full of shit because i hit ctrl-c and murdered him.

So scores are fine, but look at that sample efficiency. It's terrible. Why does this agent learn so slowly? The issue is probably that the policy changes on a massive delay. Action selection is not adapting to their new interpretations until quite a bit later than before. Actually... It wouldn't be unreasonable to guess this would be slightly more stable due to data freshness being more likely, but about 10 times slower to learn per sample. What we need is for the agent to adapt to the present quickly, but to be slow and calculated in its predictions of the future. Maybe future_net really needs to be responsible for the value of the future alone. Let's try that.

def learn(self):
    if self.memory.mem_count < self.batch_size:
        return

    states, actions, rewards, states_, dones = self.memory.sample(self.batch_size)
    states  = torch.tensor(states , dtype=torch.float32).to(self.net.device)
    actions = torch.tensor(actions, dtype=torch.long   ).to(self.net.device)
    rewards = torch.tensor(rewards, dtype=torch.float32).to(self.net.device)
    states_ = torch.tensor(states_, dtype=torch.float32).to(self.net.device)
    dones   = torch.tensor(dones  , dtype=torch.bool   ).to(self.net.device)

    batch_indices = np.arange(self.batch_size, dtype=np.int64)
    q_values  =   self.net(states)[batch_indices, actions]  # "now" net evaluates the present

    ''' let the "now" net pick the future actions,
        but let the future_net choose their value'''
    q_values_ =   self.future_net(states_)          #   future_net
    actions_ =   self.net(states_).max(dim=1)[1]    #   now net
    action_qs_ = q_values_[batch_indices, actions_]

    action_qs_[dones] = 0.0
    q_target = rewards + self.gamma * action_qs_

    td = q_target - q_values

    self.net.optimizer.zero_grad()  # no more future_net learning
    loss = ((td ** 2.0)).mean()
    loss.backward()
    self.net.optimizer.step()       # no more future_net learning

    self.epsilon -= self.epsilon_decay
    if self.epsilon < self.epsilon_min:
        self.epsilon = self.epsilon_min

    ''' notice this copy goes the other direction now'''
    if self.learn_step_counter % self.net_copy_interval == 0:
        self.future_net.load_state_dict(self.net.state_dict())  # LINE CHANGED

    self.learn_step_counter += 1

Results:

samples: 552, samps_procd: 35328, ep   22: high-score      105.000, score      105.000, epsilon 0.076
samples: 865, samps_procd: 55360, ep   23: high-score      313.000, score      313.000, epsilon 0.060
samples: 1105, samps_procd: 70720, ep   24: high-score      313.000, score      240.000, epsilon 0.048
samples: 1346, samps_procd: 86144, ep   25: high-score      313.000, score      241.000, epsilon 0.036
samples: 1585, samps_procd: 101440, ep   26: high-score      313.000, score      239.000, epsilon 0.024

The sample efficiency has improved substantially. I ran it lots of times and its usually gets to these score levels around 100-150k samples. Sometimes before 100k even. So this is a success. The agent learns and responds to changes in the short term, doesn't spin the environment, and freezes its values of the future to interrupt the feedback loop. Though, it does not have as much as freezing as we originally thought might be helpful. What you have now is the Double Deep Q Learning Algorthm.

Gotcha

You thought you were done here and that you had an algo ready for primetime, didn't you? Turns out there's some more to read. What I explained above about stopping feedback loops is the right concept, and applies to every aspect of DRL. But that roundabout conception just isn't how DQN came to be...