I'll be honest with you. The Deep Q Learning Agent we made in the last tutorial was never gonna work.
I wish I could have seen your dissapointed face after all that effort. :^)
Fear not. It was not a waste of time.
With just a few changes we can make it great.
Okay, so when you ran it, the terminal looked like this:
...
ep 76: high-score 48.000, score 9.000, last-episode-time 10
ep 77: high-score 48.000, score 10.000, last-episode-time 11
ep 78: high-score 48.000, score 10.000, last-episode-time 11
ep 79: high-score 48.000, score 10.000, last-episode-time 11
ep 80: high-score 48.000, score 9.000, last-episode-time 10
ep 81: high-score 48.000, score 10.000, last-episode-time 11
ep 82: high-score 48.000, score 10.000, last-episode-time 11
ep 83: high-score 48.000, score 10.000, last-episode-time 11
ep 84: high-score 48.000, score 9.000, last-episode-time 10
ep 85: high-score 48.000, score 10.000, last-episode-time 11
ep 86: high-score 48.000, score 10.000, last-episode-time 11
ep 87: high-score 48.000, score 9.000, last-episode-time 10
ep 88: high-score 48.000, score 10.000, last-episode-time 11
ep 89: high-score 48.000, score 9.000, last-episode-time 10
ep 90: high-score 48.000, score 9.000, last-episode-time 10
ep 91: high-score 48.000, score 9.000, last-episode-time 10
ep 92: high-score 48.000, score 8.000, last-episode-time 9
ep 93: high-score 48.000, score 9.000, last-episode-time 10
ep 94: high-score 48.000, score 9.000, last-episode-time 10
ep 95: high-score 48.000, score 9.000, last-episode-time 10
ep 96: high-score 48.000, score 9.000, last-episode-time 10
...
The environment gives 1 point for every frame the pole doesn't drop beyond a critical angle. Those scores are terrible. A good score is around 200 points. 9 might even be the minimum possible score.
def chooseAction(self, observation):
# state = torch.tensor(observation).float().detach()
# state = state.to(self.network.device)
# state = state.unsqueeze(0)
# qValues = self.network(state)
# action = torch.argmax(qValues).item()
action = random.randint(0, 1)
return action
... and run it ...
ep 0: high-score 10.000, score 10.000, last-episode-time 11
ep 1: high-score 12.000, score 12.000, last-episode-time 13
ep 2: high-score 27.000, score 27.000, last-episode-time 28
ep 3: high-score 27.000, score 13.000, last-episode-time 14
ep 4: high-score 27.000, score 18.000, last-episode-time 19
ep 5: high-score 27.000, score 17.000, last-episode-time 18
ep 6: high-score 27.000, score 12.000, last-episode-time 13
ep 7: high-score 27.000, score 15.000, last-episode-time 16
ep 8: high-score 27.000, score 10.000, last-episode-time 11
ep 9: high-score 27.000, score 15.000, last-episode-time 16
ep 10: high-score 27.000, score 19.000, last-episode-time 20
ep 11: high-score 36.000, score 36.000, last-episode-time 37
ep 12: high-score 36.000, score 12.000, last-episode-time 13
ep 13: high-score 57.000, score 57.000, last-episode-time 58
ep 14: high-score 57.000, score 9.000, last-episode-time 10
ep 15: high-score 57.000, score 19.000, last-episode-time 20
ep 16: high-score 57.000, score 15.000, last-episode-time 16
ep 17: high-score 57.000, score 20.000, last-episode-time 21
ep 18: high-score 57.000, score 13.000, last-episode-time 14
ep 19: high-score 57.000, score 13.000, last-episode-time 14
ep 20: high-score 57.000, score 16.000, last-episode-time 17
ep 21: high-score 57.000, score 56.000, last-episode-time 57
Look at those scores. They are substantially higher than our learning agent.
Okay so what gives?
It's time to investigate what is going wrong. Undo the random action selection stuff. Then let's see which actions the agent is taking.
def chooseAction(self, observation):
state = torch.tensor(observation).float().detach()
state = state.to(self.network.device)
state = state.unsqueeze(0)
qValues = self.network(state)
action = torch.argmax(qValues).item()
print(action) # here
return action
... and run ...
...
ep 5: high-score 11.000, score 11.000, last-episode-time 12
0
0
0
0
0
0
0
0
0
ep 6: high-score 11.000, score 9.000, last-episode-time 10
0
0
0
0
0
0
0
0
0
ep 7: high-score 11.000, score 9.000, last-episode-time 10
0
0
0
0
0
0
0
0
0
...
So the agent picks the same action over and over. That sucks. Rerun it a few times to see if the action is different between runs.
Okay I did that. It seems to just pick one action at random early on and mash that one.
Alright, time for more prints. Let's look at the QValues.
Is it picking correctly? Do they look sane?
def chooseAction(self, observation):
state = torch.tensor(observation).float().detach()
state = state.to(self.network.device)
state = state.unsqueeze(0)
qValues = self.network(state)
action = torch.argmax(qValues).item()
print("qValues: {}, action {}".format(qValues, action)) # happy printing
return action
you know the drill
qValues: tensor([[-0.0304, -0.1123]], grad_fn=), action 0
qValues: tensor([[ 0.5467, -0.1014]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1103, -0.1368]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.4888, -0.1696]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.4374, -0.1851]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0218, -0.1893]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.4915, -0.1777]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.6207, -0.1962]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1235, -0.2413]], grad_fn=<\AddmmBackward>), action 0
ep 0: high-score 9.000, score 9.000, last-episode-time 10
qValues: tensor([[ 1.3182, -0.1829]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2014, -0.1902]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1133, -0.2033]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1132, -0.2105]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1815, -0.2261]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2351, -0.2414]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2081, -0.2509]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0592, -0.2488]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.8426, -0.2346]], grad_fn=<\AddmmBackward>), action 0
...
By the way that grad_fn=<\AddmmBackward>
it keeps printing out
is the derivative of the last action performed on the tensor.
The tensor happens to be passing it through the ADAM Optimizer last. Hence the name.
You can hide that weird print by using .detach()
on it when you print it. That is because
detach disables derivative tracking on tensors.
Anyways, the q values start kinda random. They look like random floats. That makes sense, because the weights of the network are random. If we skip ahead to a future episode:
...
ep 29: high-score 11.000, score 9.000, last-episode-time 10
qValues: tensor([[ 0.9909, -0.3292]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9975, -0.3475]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9999, -0.3630]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9931, -0.3728]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9912, -0.3755]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9945, -0.3801]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0003, -0.3881]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9978, -0.3959]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9969, -0.4068]], grad_fn=<\AddmmBackward>), action 0
ep 30: high-score 11.000, score 9.000, last-episode-time 10
qValues: tensor([[ 1.0100, -0.3460]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0139, -0.3623]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0163, -0.3759]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0087, -0.3879]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0066, -0.3916]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0061, -0.3951]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0039, -0.3992]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9966, -0.4057]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9908, -0.4150]], grad_fn=<\AddmmBackward>), action 0
One of the action's QValue seems to be approaching 1.
The other one seems stable, but hasn't deviated too far from where it started.
Let's try printing out the reward now. You're a big boy you can figure out how by yourself this time.
...
ep 31: high-score 11.000, score 10.000, last-episode-time 11
qValues: tensor([[-0.0982, 1.0357]]), action 1
reward 1.0
qValues: tensor([[-0.0637, 1.0319]]), action 1
reward 1.0
qValues: tensor([[-0.0530, 1.0257]]), action 1
reward 1.0
qValues: tensor([[-0.0456, 1.0255]]), action 1
reward 1.0
qValues: tensor([[-0.0410, 1.0269]]), action 1
reward 1.0
qValues: tensor([[-0.0491, 1.0256]]), action 1
reward 1.0
qValues: tensor([[-0.0606, 1.0219]]), action 1
reward 1.0
...
It looks like the reward is 1.0 every frame. And that the action the agent chooses has a QValue near 1.0. This is good. That means it is learning to correctly estimate the value of the action it tries all the time.
And now we know why it doesn't pick the other action. It randomly starts with one QValue as higher than the other. (If you pick 2 random numbers, one has to be higher.) The higher QValue is then raised to approximate the reward, which is 1.0. And since it was raised to 1.0, and the other QValue is probably less than 1.0, then when the next frame comes around the agent will just pick the same action. This never ends.
I want one thing to be clear: Even though the performance isn't improving from the perspective of
balancing the pole, the network is not malfunctioning. It is doing exactly what we want it to.
The network is accurately predicting the reward it will get for the action it has "studied".
The reward prediction for the action it has not been taking is obviously wrong though.
The environment returns a reward of 1.0 every step of an episode, so both actions should have value predictions around 1.0. The agent
just hasn't been providing training data for the action that it never picks. And so even though the
network is functioning effectively, and exactly as intended, it isn't being trained on the right material.
Fundamentally this is an issue of not exploring enough.
If the agent was trying both the actions, then both the QValues would be right
around 1.0. Then the agent would not just pick one action. Its decisions would be better.
You've got lots of options to fix this. But all of them involve making our agent pick more diverse actions.
Here's the general idea. We are going to hack the reward such that the agent does what we want it to. (Or what we think we want it to...) If the action causes the agent to fail, the estimated reward for that action should go down, right? Instead of letting the agent figure that out we can just spank it with our own abusive hands.
...
while not done:
env.render()
action = agent.chooseAction(state)
state_, reward, done, info = env.step(action)
if done: # if you dropped the pole
reward = -50.0 # that action was bad
agent.learn(state, reward) # and you should feel bad
state = state_
...
nice. Try it out.
...
qValues: tensor([[0.0366, 0.5643]]), action 1
qValues: tensor([[0.0642, 0.6074]]), action 1
qValues: tensor([[0.0559, 1.0360]]), action 1
qValues: tensor([[0.0471, 1.4809]]), action 1
ep 0: high-score -41.000, score -41.000, last-episode-time 11
qValues: tensor([[-0.1129, 1.4063]]), action 1
qValues: tensor([[-0.0601, 1.0136]]), action 1
qValues: tensor([[0.0054, 0.5944]]), action 1
qValues: tensor([[0.0532, 0.2115]]), action 1
qValues: tensor([[ 0.1043, -0.1176]]), action 0
qValues: tensor([[ 0.1087, -0.0426]]), action 0
qValues: tensor([[0.1207, 0.1440]]), action 1
qValues: tensor([[ 0.1917, -0.2176]]), action 0
qValues: tensor([[0.2152, 0.0013]]), action 0
qValues: tensor([[0.2579, 0.3355]]), action 1
qValues: tensor([[ 0.3185, -0.1100]]), action 0
...
Hey maybe worked! It looks like it learned that action 1 is bad. The QValue for
action 1 went down, then it picked action 0 instead, and then switched back and forth forever. ))<>((
Let's wait and see if it improves more...
...
ep 143: high-score 26.000, score -41.000, last-episode-time 11
qValues: tensor([[-13.1998, 1.9950]]), action 1
qValues: tensor([[-12.7037, 1.8748]]), action 1
qValues: tensor([[-12.9340, 1.8002]]), action 1
qValues: tensor([[-13.6834, 1.8688]]), action 1
qValues: tensor([[-14.9132, 2.5243]]), action 1
qValues: tensor([[-17.2499, 3.3457]]), action 1
qValues: tensor([[-21.3560, 2.6179]]), action 1
qValues: tensor([[-27.4956, -1.5666]]), action 1
qValues: tensor([[-36.4758, -18.5921]]), action 1
qValues: tensor([[-45.7820, -35.8122]]), action 1
qValues: tensor([[-55.1387, -50.7749]]), action 1
...
aaand it gets stuck again picking one action. Also that score was -41. It got harder to interpret.
Check out the reward predictions as it gets close to the end of the episode.
They become more and more negative. It learns that all actions are gonna be bad.
It's learned the futility of life. (a sign of real intelligence)
Anyways, our strategy didn't work. If you keep running it, you will see it doesnt get better.
Maybe -50.0 is just too big, or too vague. What if we tried to make a more specific reward that really encourages the
network to make the right decisions? We can extract the position and velocity of the cart from the environment state.
And make a really specific reward function.
Here is an example in pseudocode.
# condition if cart is moving left and pole is falling left
if cartVel < 0.0 and poleAngleVel < 0.0:
if action == 1:
reward = 10.0
else: # action = 0
reward = -10.0
# condition if cart is moving left but pole is moving right
if cartVel < 0.0 and poleAngleVel > 0.0:
if action == 1:
reward = 10.0
else: # action = 0
reward = -10.0
...
I think you see where this is going.
This strategy is garanteed to work in cartpole. All you have to do is keep getting more and more refined with your conditions, and the agent will win the game. But before you go spend an hour making a custom reward function, make sure to read the next section...
Don't do this.
This is called "Reward Shaping" and it is really bad.
Now you might be thinking, "Why is it bad? I want my bot to work. I just want it to do
the task, so who cares how i achieve that?"
I hope I can talk you out of it.
Is the network really learning how to play the game? Or are you? The whole point of deep reinforcement learning is that we should have an agent that learns the task. If your reward function does all the work, then you can just remove the neural network entirely. The policy will just be the conditions you make.
A good learning agent doesn't require your human curated rewards. It is extremely unlikely that the conditions you come up with will be better than the policy of an actual learning agent. You want the benefits of AI dont you?
The reward shape you design only works for a specific environment. You would have to do it all over again each time you changed the environment. Again, it just defeats the point of the DRL agent entirely.
You've probably only seen cartpole or lunar lander at this point,
but even amongst the simple environments the reward function necessary to succesfully
win the game will be really complicated.
And I'm not talking about an optimal reward shape. I just said "win the game". That means
minimum win condition.
Imagine trying to do reward shaping to make the agent beat zelda or mario, or navigate
a room in real life. You might be able to do it with a team of people working nonstop for years.
People have tried. Likely even under best circumstances you would still fail.
A neural network is a function approximator. Ideally it resembles a function smoothly.
Your complicated reward function is probably full of local minima and maxima.
The network will find them if they exist. And it will get stuck in them, figuring out ways to
exploit your function for massive rewards that you never intended for, and then never
accomplishing the task you wanted it to do in the first place.
It is actually really difficult to make a
reward function that doesn't have traps in it. Most of the ai-gym environments have a
fairly carefully created reward function. If you go look at the lunar-lander reward code
in the ai-gym repo, you'll see what i mean. It's complicated and deliberate. Each fractional
reward is scaled in such a way to minimize reward gaming.
Imagine that you are an agent.
Consider cartpole. Is 50 good actions worth one deadly action? We kind of made that claim
in our reward function above didn't we? Should it be 200 since 200 is a winning score? How can we know?
When i was a noob i made a snake environment and i did a lot of reward shaping.
The snake agent carefully calculated how many apples was worth a game over, and would meticulously
plan its path such that it would suicide in a specific number of turns such that the apple was worth it.
Even though... I just would prefer it to survive and get the apple instead.
There's an image somewhere of some giant reward shaping equation written for a paper that was just attempting to get one of the ai-gym robots to do a backflip. The equation is big and looks like it is full of calculus. I have no idea how it works, and can only imagine how long it took to make it. (Also it's suboptimal)
When i was new to deep reinforcement learning I spent about a week on an agent
that plays snake. I carefully tweaked the magnitude of each
sub reward in my reward conditions to try and raise the performance. At one point I even considered letting a
neural network learn the optimal reward magnitudes, but then i realized how stupid that
was because a good agent would have done that without my reward shaping anyways.
I spent almost an entire week on reward shaping snake by the way. Just think about that. Just the
reward shaping. That's what i did. For like 6 hours a day. Don't be me.
At the end of that it turned out i just had a bug in my agent code. Once I fixed that bug, it
worked great without the reward shaping.
When other people make agents to solve the same environment as you,
you want to be able to measure the performance of their agents against yours.
If everyone is making custom reward functions, then is their agent
better than yours or not? It becomes unclear. It could be that none of the
parameters or architecture of their agent matters at all, and it's just their reward function
pulling all the weight.
So for the most part everyone agrees to not do reward shaping, so that
the results they share are comparable in the most basic way.
Actually even just the numbers become different when you shape. You'll notice in our
specific case, once we added the done penalty the episode rewards started
totaling up to negative numbers. It wasn't like that before. People looking at our results
would be really confused. Is 200 still a good score in our reward shape? Is 9 a bad score?
Anyways, I hope you don't plan on doing it.
You wanted to make the agent work didn't you?
What were we doing again? Oh yeah the issue was our agent wasn't exploring different actions.
And somehow we have to accomplish that without rigging the reward.
How about we try a real action exploration strategy in the next tutorial?