This tutorial assumes you've completed the "Foundations" tutorials. It also uses code from those previous tutorials. If you aren't new to reinforcement learning then i'm sure you can follow along. Less will be explained within the "Upgrades" tutorials than prior tutorials.
DQN works okay as is, but I don't know if you have been looking at the performance graphs... These things are unstable. The rewards go all over the place. Numbers are going up but only generally, and chaotically. Some of this chaos comes from the stochastic nature of the environment. Some of it comes from the randomly initialized network weights finding their happy place. However, what likely is the root cause of chaos in deep reinforcement learning is the fact that the agent is a big feedback loop. Action changes the environment changes the action changes the environment changes the weights changes the actions... etc. In a recursive system like this, everything is an exponential whirlpool. So, what if you froze one or more components of the feedback loop temporarily? Would it reduce the insanity?
Let's begin by reviewing the code starting point. It should look familiar but also slightly unfamiliar. Recently I started coding with underscores to match the common python formatting. Consider it an exercise in reading. Also consider it a hand exercise in painful finger movements used to type underscores.
import gym
import math
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class ReplayBuffer:
def __init__(self, mem_size, state_shape):
self.mem_size = mem_size
self.mem_count = 0
self.states = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
self.actions = np.zeros( self.mem_size, dtype=np.int64 )
self.rewards = np.zeros( self.mem_size, dtype=np.float32)
self.states_ = np.zeros((self.mem_size, *state_shape),dtype=np.float32)
self.dones = np.zeros( self.mem_size, dtype=np.bool )
def add(self, state, action, reward, state_, done):
mem_index = self.mem_count % self.mem_size
self.states[mem_index] = state
self.actions[mem_index] = action
self.rewards[mem_index] = reward
self.states_[mem_index] = state_
self.dones[mem_index] = done
self.mem_count += 1
def sample(self, sample_size):
mem_max = min(self.mem_count, self.mem_size)
batch_indices = np.random.choice(mem_max, sample_size, replace=True)
states = self.states[batch_indices]
actions = self.actions[batch_indices]
rewards = self.rewards[batch_indices]
states_ = self.states_[batch_indices]
dones = self.dones[batch_indices]
return states, actions, rewards, states_, dones
class Network(torch.nn.Module):
def __init__(self, alpha, input_shape, num_actions):
super().__init__()
self.input_shape = input_shape
self.num_actions = num_actions
self.fc1_dims = 1024
self.fc2_dims = 512
self.fc1 = nn.Linear(*self.input_shape, self.fc1_dims)
self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
self.fc3 = nn.Linear(self.fc2_dims, num_actions)
self.optimizer = optim.Adam(self.parameters(), lr=alpha)
# self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.device = torch.device("cpu")
self.to(self.device)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
class Agent():
def __init__(self, lr, state_shape, num_actions):
self.net = Network(lr, state_shape, num_actions)
self.memory = ReplayBuffer(mem_size=10000, state_shape=state_shape)
self.batch_size = 64
self.gamma = 0.99
self.epsilon = 0.1
self.epsilon_decay = 0.00005
self.epsilon_min = 0.001
def choose_action(self, observation):
if np.random.random() < self.epsilon:
action = random.randint(0, 1)
else:
state = torch.tensor(observation).float().detach()
state = state.to(self.net.device)
state = state.unsqueeze(0)
q_values = self.net(state)
action = torch.argmax(q_values).item()
return action
def store_memory(self, state, action, reward, state_, done):
self.memory.add(state, action, reward, state_, done)
def learn(self):
if self.memory.mem_count < self.batch_size:
return
states, actions, rewards, states_, dones = self.memory.sample(self.batch_size)
states = torch.tensor(states , dtype=torch.float32).to(self.net.device)
actions = torch.tensor(actions, dtype=torch.long ).to(self.net.device)
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.net.device)
states_ = torch.tensor(states_, dtype=torch.float32).to(self.net.device)
dones = torch.tensor(dones , dtype=torch.bool ).to(self.net.device)
batch_indices = np.arange(self.batch_size, dtype=np.int64)
q_values = self.net(states)[batch_indices, actions]
q_values_ = self.net(states_)
action_qs_ = torch.max(q_values_, dim=1)[0]
action_qs_[dones] = 0.0
q_target = rewards + self.gamma * action_qs_
td = q_target - q_values
self.net.optimizer.zero_grad()
loss = ((td ** 2.0)).mean()
loss.backward()
self.net.optimizer.step()
self.epsilon -= self.epsilon_decay
if self.epsilon < self.epsilon_min:
self.epsilon = self.epsilon_min
if __name__ == '__main__':
env = gym.make('CartPole-v1').unwrapped
agent = Agent(lr=0.001, state_shape=(4,), num_actions=2)
high_score = -math.inf
episode = 0
while True:
done = False
state = env.reset()
score, frame = 0, 1
while not done:
# env.render()
action = agent.choose_action(state)
state_, reward, done, info = env.step(action)
agent.store_memory(state, action, reward, state_, done)
agent.learn()
state = state_
score += reward
frame += 1
high_score = max(high_score, score)
print(( "ep {:4d}: high-score {:12.3f}, "
"score {:12.3f}, epsilon {:5.3f}").format(
episode, high_score, score, agent.epsilon))
episode += 1
At this point this code should look really familiar to you. The only
thing I can think of that might be confusing you is state
vs state_
.
I'm not sure why but state_
is often used as short for next_state
.
Anyways now you'll know what that is when you see it. But basically anytime a variable ends in _
it is assumed to be "for the next time step".
You also might have noticed MSELoss() is gone now from the network class. Instead of using the built in pytorch loss function I reimplemented it in learn(). You can see the td error is squared, and then that tensor is meaned.
self.net.optimizer.zero_grad()
loss = ((td ** 2.0)).mean() # this is what MSELoss did.
loss.backward()
self.net.optimizer.step()
Now there is no more magic hidden beneath. It's good to know how things work.
By the way when you run the code as is you get typical DQN performance like this.
ep 57: high-score 273.000, score 158.000, epsilon 0.001
ep 58: high-score 273.000, score 167.000, epsilon 0.001
ep 59: high-score 273.000, score 188.000, epsilon 0.001
ep 60: high-score 273.000, score 228.000, epsilon 0.001
ep 61: high-score 273.000, score 222.000, epsilon 0.001
ep 62: high-score 273.000, score 144.000, epsilon 0.001
ep 63: high-score 273.000, score 199.000, epsilon 0.001
ep 64: high-score 300.000, score 300.000, epsilon 0.001
ep 65: high-score 300.000, score 162.000, epsilon 0.001
ep 66: high-score 300.000, score 170.000, epsilon 0.001
ep 67: high-score 312.000, score 312.000, epsilon 0.001
You are a chef. You cook tacos. You are on a quest to cook the most perfect taco that ever existed. You get a bags of supplies and you go to the kitchen. Also this is the first taco you've ever made and the first time you have ever cooked. For some reason this doesn't seem like a problem to you and so you turn on the "Taco Food Channel" and start cooking random ingredients while you watch the TV. 12 hours later you are tired, sweaty, and defeated and also your tacos are terrible. You tried to eat one and threw up. You know your friend is a world renouned taco chef so you call him, and invite him over for an educational cooking party. The next day he teaches you how to make great tacos. First you watch him make one. Then you try to make one. Then he slaps you, and he makes you watch him make one. Then you try. Then he makes one... This goes on for 36 hours and you emerge from taco bootcamp a taco making god.
At the moment, the agent is not like taco chef man at all. He never takes time to observe. He only tries to learn while doing. This is easily fixed.
...
counter = 0 # count the lessons
while True:
done = False
state = env.reset()
score, frame = 0, 1
while not done:
# env.render()
action = agent.choose_action(state)
state_, reward, done, info = env.step(action)
agent.store_memory(state, action, reward, state_, done)
if counter % 10 == 0: # just watch sometimes
agent.learn()
state = state_
score += reward
frame += 1
counter += 1 # count count count
This way the agent will take turns with the environment. Sometimes it will sit and learn passively from it's past self. And other times it will actively learn by doing. Let's see how the results turn out.
ep 392: high-score 490.000, score 163.000, epsilon 0.001
ep 393: high-score 490.000, score 185.000, epsilon 0.001
ep 394: high-score 490.000, score 187.000, epsilon 0.001
ep 395: high-score 490.000, score 232.000, epsilon 0.001
ep 396: high-score 490.000, score 339.000, epsilon 0.001
It took about 400 episodes for the results to be on par with before. You might think thats pretty good because the agent should only have actually trained 1/10th as often. Meaning, it should get the same scores but after 10 times as many episodes. However, episodes are different lengths, so they aren't that useful as a metric on their own. What works much better is performance per sample. Let's keep track of how many samples are collected, and how many transitions have been processed by the network.
...
num_samples = 0 # TRANSITIONS COLLECTED
samples_processed = 0 # TRANSITIONS PROCESSED
counter = 0
while True:
done = False
state = env.reset()
score, frame = 0, 1
while not done:
# env.render()
action = agent.choose_action(state)
state_, reward, done, info = env.step(action)
agent.store_memory(state, action, reward, state_, done)
if counter % 1 == 0: # BACK TO 1
agent.learn()
samples_processed += agent.batch_size # NEW
state = state_
score += reward
frame += 1
num_samples += 1 # NEW
counter += 1
high_score = max(high_score, score)
''' SOME NEW PRINTING STUFF '''
print(( "samples: {}, samps_procd: {}, ep {:4d}: high-score {:12.3f}, "
"score {:12.3f}, epsilon {:5.3f}").format(
num_samples, samples_processed, episode,
high_score, score, agent.epsilon))
episode += 1
Okay let's get a new baseline. Notice the counter is set to 1 so this is normal DQN without the turn taking.
samples: 4017, samps_procd: 257088, ep 50: high-score 214.000, score 180.000, epsilon 0.001
samples: 4164, samps_procd: 266496, ep 51: high-score 214.000, score 147.000, epsilon 0.001
samples: 4332, samps_procd: 277248, ep 52: high-score 214.000, score 168.000, epsilon 0.001
samples: 4681, samps_procd: 299584, ep 53: high-score 349.000, score 349.000, epsilon 0.001
samples: 4975, samps_procd: 318400, ep 54: high-score 349.000, score 294.000, epsilon 0.001
samples: 5281, samps_procd: 337984, ep 55: high-score 349.000, score 306.000, epsilon 0.001
It should look about the same. And it does. But something new is standing out now as worrisome.
To get to this point the agent processed 300,000 frames and only had 5000 transitions
collected to work with. That means the agent reused the same data about 30,000 times on average. In normal
machine learning, 30,000 epochs would make someone slap you and yell "overfitting". But we are ignorant DRL
programmers so we can get away with it. Part of this has to do with old transitions being
framed within a new "reward prediction state of mind". Later on in your life you can think back to your childhood
and learn new things from analysing it. This is because your interpretation is different, and your goals have
changed. I am unsure as to whether normal machine learning benefits from this to quite the same degree.
Generally the classification, or ground truth, of a sample in a
regular ML dataset is fixed. A picture of a dog is still 100% dog 1 year later.
So reviewing the same data over and over isn't quite so useful. But in DRL the q values might have changed since
last
exposure to a sample. What was once seen as kind of bad might now be seen as REALLY BAD. Or not as bad.
So overfitting in DRL is less of a problem, and more of a feature. Although it can definitely still happen. You will learn more about
overfitting in DRL in the future. :)
Anyways, back to the watch-learn-watch-learn idea. Set the agent to learn every 10 steps instead to see what changes.
if counter % 10 == 0: # BACK TO 10
agent.learn()
samples_processed += agent.batch_size # NOT NEW ANYMORE
And results...
samples: 38808, samps_procd: 248384, ep 570: high-score 508.000, score 508.000, epsilon 0.001
samples: 39052, samps_procd: 249984, ep 571: high-score 508.000, score 244.000, epsilon 0.001
samples: 39396, samps_procd: 252160, ep 572: high-score 508.000, score 344.000, epsilon 0.001
samples: 39504, samps_procd: 252864, ep 573: high-score 508.000, score 108.000, epsilon 0.001
samples: 39679, samps_procd: 253952, ep 574: high-score 508.000, score 175.000, epsilon 0.001
samples: 40009, samps_procd: 256064, ep 575: high-score 508.000, score 330.000, epsilon 0.001
samples: 40300, samps_procd: 257920, ep 576: high-score 508.000, score 291.000, epsilon 0.001
samples: 40561, samps_procd: 259648, ep 577: high-score 508.000, score 261.000, epsilon 0.001
The agent jumped from getting 200 scores to 500 scores right away. I ran it a few times and the result is
like this each time. Even though learn()
ran only once every ten steps, it didnt take 10 times
as many episodes, or samples. It required a similar number of samples-processed to gain the same
performance.
Interestingly the scores are slightly higher and the number of processed samples is slightly lower.
We could attribute this to the "turn-taking" learning strategy that worked for Chef Man, but there is
an uncontrolled variable here. Specifically, the number of repeated sample exposures is much less if you
collect more transitions and run learn()
much less frequently. The ingredients are "more fresh", if
you will.
To reduce that effect some, let's make the agent collect 50,000 frames before it even starts learning.
That way both the normal DQN agent and the version with the periodic learning have similar data freshness.
if counter % 1 == 0: # back to 1 again
if num_samples > 50000: # lots of falling poles...
agent.learn()
samples_processed += agent.batch_size
Dont forget to go make the replay buffer larger to hold the new samples. I set mine to 100,000.
aaaand run.
samples: 55196, samps_procd: 332480, ep 5112: high-score 210.000, score 159.000, epsilon 0.001
samples: 55410, samps_procd: 346176, ep 5113: high-score 214.000, score 214.000, epsilon 0.001
samples: 55535, samps_procd: 354176, ep 5114: high-score 214.000, score 125.000, epsilon 0.001
samples: 55678, samps_procd: 363328, ep 5115: high-score 214.000, score 143.000, epsilon 0.001
samples: 55905, samps_procd: 377856, ep 5116: high-score 227.000, score 227.000, epsilon 0.001
This is about 80 episodes after it started running learn()
. Again, you can see its a similar number of
samples processed necessary to get to this performance. Interestingly, there were a bunch of episodes were it
seemed almost no performance was gained from learn()
. Those big stretches of episodes where
scores don't really change could be due to the periodic
learning just added. The memory will be full of
rather simplistic transitions (cartpole falling in the center of the play area) before the agent really begins
to learn. Those transitions are not representative of the new circumstances the more trained agent finds
itself in (cartpole falling near the edges of the play area). It takes time for those more relevant
samples to become less minority in the replay buffer
before they can be chosen for the learn batch. That's good evidence that setting the "minimum sample's
collected" requirement too high could actually make learning much slower if paired with a giant replay buffer
that takes forever to be "cleansed", or become "relevant".
On a side note, due to the learning rate being adjusted within learn()
this experiment
has kept that as a dependant variable. the learning rate is only tweaked on learn steps, not on collection steps.
Alright. Now back to if counter % 10 == 0:
samples: 93908, samps_procd: 280960, ep 5736: high-score 420.000, score 140.000, epsilon 0.001
samples: 94122, samps_procd: 282368, ep 5737: high-score 420.000, score 214.000, epsilon 0.001
samples: 94363, samps_procd: 283904, ep 5738: high-score 420.000, score 241.000, epsilon 0.001
samples: 94625, samps_procd: 285568, ep 5739: high-score 420.000, score 262.000, epsilon 0.001
# 420 haha
I hope by now you can see the pattern. If you have enough data, and fresh data to work with, performance in DRL largely depends on the number of samples processed. The performance gained per sample processed is called the sample efficiency of an agent. This is generally dependant upon the network architecture, the addons, and the loss function. Wheras learning stability depends more on the policy not changing too much or too frequently, and the samples being consistently relevant to the agent's current circumstance. Too much instability can lower the sample-efficiency of the agent, so they aren't exactly entirely seperate aspects of learning, but it can help to look independantly for sources of either.
Let's consider sample "relevancy" in the context of our watch-learn-watch change.
Obviously, to learn something you want to study things relevant to your current understanding, and part of that
means adapting what you are studying to the moving horizon of your ignorance. In school, when you get better at
multiplication, you have to move on to harder multiplication problems, or different problems, or you won't get any better.
One of the results of running learn()
only periodically is that the practice material isn't pushed forward as often. The "flash cards" are much less diverse,
which was intended, but this slows down the agent "moving on" to the next concept. Every time the policy changed,
the agent has to wait for the memory to be filled with more current samples. And with a bigger memory it
takes longer for that to happen. We made the flashcards less hard, but the flashcards also stay old longer.
The end result isn't fantastic. Their are a few one off higher scores that show up earlier,
but the average scores are similar per sample-processed.
Collecting a bunch of samples before letting the agent learn is a reasonable way to prevent early overfitting,
but periodic learning isn't as reasonable. If the environment is really dynamic it could make sense,
but generally it's not a technique I often see used. That's usually because the environment is expensive to run.
As you can imagine, if you are trying to make the agent learn things in the real world you cant
run the environment at 1000 times speed. This downside is just too big.
Besides, who likes having to wait 10 times as long to get improved performance,
if the performance is only a tiny bit better? I would rather save that tool for a case where the instability
prevents learning entirely. What if there was a way to ensure the environment
has time to collect fresh data before running learn()
, still push its next lesson forward, and
not waste all that time just spinning on the environment?
Luckily for you there is a way. Actually, there are an infinite number of ways. This is one of them.
What if instead of only learning every ten steps, we add another neural network and train it in the "off time",
then swap it in after 10 steps. This could be the best of both worlds.
First, add another network, and a step counter and update interval.
class Agent():
def __init__(self, lr, state_shape, num_actions):
self.net = Network(lr, state_shape, num_actions)
self.future_net = Network(lr, state_shape, num_actions) # another network
self.memory = ReplayBuffer(mem_size=100000, state_shape=state_shape)
self.batch_size = 64
self.gamma = 0.99
self.epsilon = 0.1
self.epsilon_decay = 0.00005
self.epsilon_min = 0.001
self.learn_step_counter = 0 # increments each learn()
self.net_copy_interval = 10 # NEW
Then make the future_net swap out its weights every n training steps.
def learn(self):
...
if self.learn_step_counter % self.net_copy_interval == 0:
self.net.load_state_dict(self.future_net.state_dict()) # state_dict() packs the
# net params into a dict
self.learn_step_counter += 1
Okay, the final step is to make the actions chosen by the main network, but the learning happen to the future_net. Luckily for us the actions are already chosen by the main network.
def learn(self):
...
batch_indices = np.arange(self.batch_size, dtype=np.int64)
q_values = self.future_net(states)[batch_indices, actions]
q_values_ = self.future_net(states_)
action_qs_ = torch.max(q_values_, dim=1)[0]
action_qs_[dones] = 0.0
q_target = rewards + self.gamma * action_qs_
td = q_target - q_values
self.future_net.optimizer.zero_grad()
loss = ((td ** 2.0)).mean()
loss.backward()
self.future_net.optimizer.step()
...
Really you just swap out net for future_net every time it shows up in learn().
I kind of hope you knew that already. :^) Make sure to get all of
them. Also go remove that minimum sample collection stuff in the main loop for now.
Time to test it out.
# run 1
samples: 1647, samps_procd: 105408, ep 28: high-score 157.000, score 157.000, epsilon 0.021
samples: 1769, samps_procd: 113216, ep 29: high-score 157.000, score 122.000, epsilon 0.015
samples: 1970, samps_procd: 126080, ep 30: high-score 201.000, score 201.000, epsilon 0.005
samples: 2134, samps_procd: 136576, ep 31: high-score 201.000, score 164.000, epsilon 0.001
samples: 2252, samps_procd: 144128, ep 32: high-score 201.000, score 118.000, epsilon 0.001
samples: 2455, samps_procd: 157120, ep 33: high-score 203.000, score 203.000, epsilon 0.001
# run 2
samples: 2066, samps_procd: 132224, ep 50: high-score 148.000, score 110.000, epsilon 0.001
samples: 2167, samps_procd: 138688, ep 51: high-score 148.000, score 101.000, epsilon 0.001
samples: 2423, samps_procd: 155072, ep 52: high-score 256.000, score 256.000, epsilon 0.001
samples: 2716, samps_procd: 173824, ep 53: high-score 293.000, score 293.000, epsilon 0.001
samples: 2837, samps_procd: 181568, ep 54: high-score 293.000, score 121.000, epsilon 0.001
# run 3
samples: 2910, samps_procd: 186240, ep 51: high-score 170.000, score 170.000, epsilon 0.001
samples: 3018, samps_procd: 193152, ep 52: high-score 170.000, score 108.000, epsilon 0.001
samples: 3225, samps_procd: 206400, ep 53: high-score 207.000, score 207.000, epsilon 0.001
samples: 3405, samps_procd: 217920, ep 54: high-score 207.000, score 180.000, epsilon 0.001
samples: 3586, samps_procd: 229504, ep 55: high-score 207.000, score 181.000, epsilon 0.001
samples: 3789, samps_procd: 242496, ep 56: high-score 207.000, score 203.000, epsilon 0.001
# run 4
samples: 2910, samps_procd: 186240, ep 51: high-score 170.000, score 170.000, epsilon 0.001
samples: 3018, samps_procd: 193152, ep 52: high-score 170.000, score 108.000, epsilon 0.001
samples: 3225, samps_procd: 206400, ep 53: high-score 207.000, score 207.000, epsilon 0.001
samples: 3405, samps_procd: 217920, ep 54: high-score 207.000, score 180.000, epsilon 0.001
samples: 3586, samps_procd: 229504, ep 55: high-score 207.000, score 181.000, epsilon 0.001
samples: 3789, samps_procd: 242496, ep 56: high-score 207.000, score 203.000, epsilon 0.001
This looks pretty good. On average its an obviously better sample efficiency, and you don't have to waste time letting the environment spin. The only thing really changed here is how often the policy changes. There are downsides, though. For a more complicated environment that requires frequent policy changes, it will take a lot more tries to push the policy into the right place. The advantage here is the path to the current policy should be refined more easily. That is, it won't be interrupted by so many irrelevant decision changes along the way. It's like making sure to use the same brand of tortilla for a while to get a good feel of how to cook with it. As opposed to buying a new brand of tortilla every time you make a single taco.
There's a bit of a problem with the td function like this though. In the td function, the present q values and future q values are expected to be as close as possible to the ground truth. (If everything is working ideally.) In our current implementation, that's impossible. While actions in choose_action() are being chosen according to the old network, in learn() actions are being chosen by the future_network. That doesn't really make sense. In fact, it's only correct to do this 1/10th of the time. That would be right after the parameters copy. The future actions should be chosen by the old network too. The future_net is now responsible for the value of the future. Not for the current policy. Our original intention was to delay policy changes anyways. Let's try fixing this by making the old net pick the actions.
def learn(self):
...
batch_indices = np.arange(self.batch_size, dtype=np.int64)
q_values = self.future_net(states)[batch_indices, actions] # future net
''' let the old net pick the future actions,
but let the future_net choose their value'''
q_values_ = self.future_net(states_) # future net picks the values
actions_ = self.net(states_).max(dim=1)[1] # old net picks the actions
action_qs_ = q_values_[batch_indices, actions_] # extract q values according to old nets actions
# from future nets values
action_qs_[dones] = 0.0
q_target = rewards + self.gamma * action_qs_
td = q_target - q_values
self.future_net.optimizer.zero_grad()
loss = ((td ** 2.0)).mean()
loss.backward()
self.future_net.optimizer.step()
Run it a few times.
samples: 5909, samps_procd: 378176, ep 158: high-score 276.000, score 145.000, epsilon 0.001
samples: 6160, samps_procd: 394240, ep 159: high-score 276.000, score 251.000, epsilon 0.001
samples: 6290, samps_procd: 402560, ep 160: high-score 276.000, score 130.000, epsilon 0.001
samples: 6557, samps_procd: 419648, ep 161: high-score 276.000, score 267.000, epsilon 0.001
samples: 6714, samps_procd: 429696, ep 162: high-score 276.000, score 157.000, epsilon 0.001
Those of you familiar with the math might find it amazing that this works at all. Actually you can see that the sample efficiency went down a bit. I cherry picked one of the better runs. If you run it a few times you will find it learns atleast twice as slow as normal dqn. However eventually the scores go up into the thousands just like usual. The q values are learned just the same. As long as the predictions are on both sides of the td function, future prediction, and present prediction, the base reward signal will slowly pull them togethor. Have proof.
samples: 41042, samps_procd: 2626688, ep 523: high-score 1451.000, score 1187.000, epsilon 0.001
samples: 41330, samps_procd: 2645120, ep 524: high-score 1451.000, score 288.000, epsilon 0.001
samples: 42242, samps_procd: 2703488, ep 525: high-score 1451.000, score 912.000, epsilon 0.001
samples: 43444, samps_procd: 2780416, ep 526: high-score 1451.000, score 1202.000, epsilon 0.001
samples: 52037, samps_procd: 3330368, ep 527: high-score 8593.000, score 8593.000, epsilon 0.001
samples: 61511, samps_procd: 3936704, ep 528: high-score 9474.000, score 9474.000, epsilon 0.001
# there was no next episode. It was still balancing the cartpole for another 20 minutes or so.
# legend has it that somewhere in cyber space the little agent that could is still balancing the pole.
# that legend is full of shit because i hit ctrl-c and murdered him.
So scores are fine, but look at that sample efficiency. It's terrible. Why does this agent learn so slowly? The issue is probably that the policy changes on a massive delay. Action selection is not adapting to their new interpretations until quite a bit later than before. Actually... It wouldn't be unreasonable to guess this would be slightly more stable due to data freshness being more likely, but about 10 times slower to learn per sample. What we need is for the agent to adapt to the present quickly, but to be slow and calculated in its predictions of the future. Maybe future_net really needs to be responsible for the value of the future alone. Let's try that.
def learn(self):
if self.memory.mem_count < self.batch_size:
return
states, actions, rewards, states_, dones = self.memory.sample(self.batch_size)
states = torch.tensor(states , dtype=torch.float32).to(self.net.device)
actions = torch.tensor(actions, dtype=torch.long ).to(self.net.device)
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.net.device)
states_ = torch.tensor(states_, dtype=torch.float32).to(self.net.device)
dones = torch.tensor(dones , dtype=torch.bool ).to(self.net.device)
batch_indices = np.arange(self.batch_size, dtype=np.int64)
q_values = self.net(states)[batch_indices, actions] # "now" net evaluates the present
''' let the "now" net pick the future actions,
but let the future_net choose their value'''
q_values_ = self.future_net(states_) # future_net
actions_ = self.net(states_).max(dim=1)[1] # now net
action_qs_ = q_values_[batch_indices, actions_]
action_qs_[dones] = 0.0
q_target = rewards + self.gamma * action_qs_
td = q_target - q_values
self.net.optimizer.zero_grad() # no more future_net learning
loss = ((td ** 2.0)).mean()
loss.backward()
self.net.optimizer.step() # no more future_net learning
self.epsilon -= self.epsilon_decay
if self.epsilon < self.epsilon_min:
self.epsilon = self.epsilon_min
''' notice this copy goes the other direction now'''
if self.learn_step_counter % self.net_copy_interval == 0:
self.future_net.load_state_dict(self.net.state_dict()) # LINE CHANGED
self.learn_step_counter += 1
Results:
samples: 552, samps_procd: 35328, ep 22: high-score 105.000, score 105.000, epsilon 0.076
samples: 865, samps_procd: 55360, ep 23: high-score 313.000, score 313.000, epsilon 0.060
samples: 1105, samps_procd: 70720, ep 24: high-score 313.000, score 240.000, epsilon 0.048
samples: 1346, samps_procd: 86144, ep 25: high-score 313.000, score 241.000, epsilon 0.036
samples: 1585, samps_procd: 101440, ep 26: high-score 313.000, score 239.000, epsilon 0.024
The sample efficiency has improved substantially. I ran it lots of times and its usually gets to these score levels around 100-150k samples. Sometimes before 100k even. So this is a success. The agent learns and responds to changes in the short term, doesn't spin the environment, and freezes its values of the future to interrupt the feedback loop. Though, it does not have as much as freezing as we originally thought might be helpful. What you have now is the Double Deep Q Learning Algorthm.
You thought you were done here and that you had an algo ready for primetime, didn't you? Turns out there's some more to read. What I explained above about stopping feedback loops is the right concept, and applies to every aspect of DRL. But that roundabout conception just isn't how DQN came to be...