Weg's Tutorials

Deep Q Learning Tutorial 2:

Electric Boogaloo

How To Make A Pessimistic Bastard

Prerequisites

You Lied To Me

I'll be honest with you. The Deep Q Learning Agent we made in the last tutorial was never gonna work.
I wish I could have seen your dissapointed face after all that effort. :^)

Fear not. It was not a waste of time.
With just a few changes we can make it great.

Okay, so when you ran it, the terminal looked like this:

...
ep 76: high-score       48.000, score        9.000, last-episode-time   10
ep 77: high-score       48.000, score       10.000, last-episode-time   11
ep 78: high-score       48.000, score       10.000, last-episode-time   11
ep 79: high-score       48.000, score       10.000, last-episode-time   11
ep 80: high-score       48.000, score        9.000, last-episode-time   10
ep 81: high-score       48.000, score       10.000, last-episode-time   11
ep 82: high-score       48.000, score       10.000, last-episode-time   11
ep 83: high-score       48.000, score       10.000, last-episode-time   11
ep 84: high-score       48.000, score        9.000, last-episode-time   10
ep 85: high-score       48.000, score       10.000, last-episode-time   11
ep 86: high-score       48.000, score       10.000, last-episode-time   11
ep 87: high-score       48.000, score        9.000, last-episode-time   10
ep 88: high-score       48.000, score       10.000, last-episode-time   11
ep 89: high-score       48.000, score        9.000, last-episode-time   10
ep 90: high-score       48.000, score        9.000, last-episode-time   10
ep 91: high-score       48.000, score        9.000, last-episode-time   10
ep 92: high-score       48.000, score        8.000, last-episode-time    9
ep 93: high-score       48.000, score        9.000, last-episode-time   10
ep 94: high-score       48.000, score        9.000, last-episode-time   10
ep 95: high-score       48.000, score        9.000, last-episode-time   10
ep 96: high-score       48.000, score        9.000, last-episode-time   10
      ...

The environment gives 1 point for every frame the pole doesn't drop beyond a critical angle. Those scores are terrible. A good score is around 200 points. 9 might even be the minimum possible score.

Shameful Performance
Needless to say, the agent isn't improving. And no matter how long you run it, it will not improve.
In fact, if you set the agent to pick random actions, you will get better performance than our agent as is.
Don't believe me? Watch this...

def chooseAction(self, observation):
    # state = torch.tensor(observation).float().detach()
    # state = state.to(self.network.device)
    # state = state.unsqueeze(0)

    # qValues = self.network(state)
    # action = torch.argmax(qValues).item()
    action = random.randint(0, 1)

    return action

... and run it ...

ep 0: high-score       10.000, score       10.000, last-episode-time   11
ep 1: high-score       12.000, score       12.000, last-episode-time   13
ep 2: high-score       27.000, score       27.000, last-episode-time   28
ep 3: high-score       27.000, score       13.000, last-episode-time   14
ep 4: high-score       27.000, score       18.000, last-episode-time   19
ep 5: high-score       27.000, score       17.000, last-episode-time   18
ep 6: high-score       27.000, score       12.000, last-episode-time   13
ep 7: high-score       27.000, score       15.000, last-episode-time   16
ep 8: high-score       27.000, score       10.000, last-episode-time   11
ep 9: high-score       27.000, score       15.000, last-episode-time   16
ep 10: high-score       27.000, score       19.000, last-episode-time   20
ep 11: high-score       36.000, score       36.000, last-episode-time   37
ep 12: high-score       36.000, score       12.000, last-episode-time   13
ep 13: high-score       57.000, score       57.000, last-episode-time   58
ep 14: high-score       57.000, score        9.000, last-episode-time   10
ep 15: high-score       57.000, score       19.000, last-episode-time   20
ep 16: high-score       57.000, score       15.000, last-episode-time   16
ep 17: high-score       57.000, score       20.000, last-episode-time   21
ep 18: high-score       57.000, score       13.000, last-episode-time   14
ep 19: high-score       57.000, score       13.000, last-episode-time   14
ep 20: high-score       57.000, score       16.000, last-episode-time   17
ep 21: high-score       57.000, score       56.000, last-episode-time   57

Look at those scores. They are substantially higher than our learning agent.
Okay so what gives?

Post Mortem

It's time to investigate what is going wrong. Undo the random action selection stuff. Then let's see which actions the agent is taking.

def chooseAction(self, observation):
    state = torch.tensor(observation).float().detach()
    state = state.to(self.network.device)
    state = state.unsqueeze(0)

    qValues = self.network(state)
    action = torch.argmax(qValues).item()
    print(action) # here
    return action

... and run ...

...
ep 5: high-score       11.000, score       11.000, last-episode-time   12
0
0
0
0
0
0
0
0
0
ep 6: high-score       11.000, score        9.000, last-episode-time   10
0
0
0
0
0
0
0
0
0
ep 7: high-score       11.000, score        9.000, last-episode-time   10
0
0
0
0
0
0
0
0
0
...

So the agent picks the same action over and over. That sucks. Rerun it a few times to see if the action is different between runs.

Okay I did that. It seems to just pick one action at random early on and mash that one.
Alright, time for more prints. Let's look at the QValues. Is it picking correctly? Do they look sane?

def chooseAction(self, observation):
    state = torch.tensor(observation).float().detach()
    state = state.to(self.network.device)
    state = state.unsqueeze(0)

    qValues = self.network(state)
    action = torch.argmax(qValues).item()
    print("qValues: {}, action {}".format(qValues, action)) # happy printing
    return action

you know the drill

qValues: tensor([[-0.0304, -0.1123]], grad_fn=), action 0
qValues: tensor([[ 0.5467, -0.1014]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1103, -0.1368]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.4888, -0.1696]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.4374, -0.1851]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0218, -0.1893]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.4915, -0.1777]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.6207, -0.1962]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1235, -0.2413]], grad_fn=<\AddmmBackward>), action 0
ep 0: high-score        9.000, score        9.000, last-episode-time   10
qValues: tensor([[ 1.3182, -0.1829]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2014, -0.1902]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1133, -0.2033]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1132, -0.2105]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.1815, -0.2261]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2351, -0.2414]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.2081, -0.2509]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0592, -0.2488]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.8426, -0.2346]], grad_fn=<\AddmmBackward>), action 0
...

By the way that grad_fn=<\AddmmBackward> it keeps printing out is the derivative of the last action performed on the tensor.
The tensor happens to be passing it through the ADAM Optimizer last. Hence the name.
You can hide that weird print by using .detach() on it when you print it. That is because detach disables derivative tracking on tensors.

Anyways, the q values start kinda random. They look like random floats. That makes sense, because the weights of the network are random. If we skip ahead to a future episode:

...
ep 29: high-score       11.000, score        9.000, last-episode-time   10
qValues: tensor([[ 0.9909, -0.3292]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9975, -0.3475]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9999, -0.3630]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9931, -0.3728]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9912, -0.3755]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9945, -0.3801]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0003, -0.3881]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9978, -0.3959]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9969, -0.4068]], grad_fn=<\AddmmBackward>), action 0
ep 30: high-score       11.000, score        9.000, last-episode-time   10
qValues: tensor([[ 1.0100, -0.3460]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0139, -0.3623]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0163, -0.3759]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0087, -0.3879]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0066, -0.3916]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0061, -0.3951]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 1.0039, -0.3992]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9966, -0.4057]], grad_fn=<\AddmmBackward>), action 0
qValues: tensor([[ 0.9908, -0.4150]], grad_fn=<\AddmmBackward>), action 0

One of the action's QValue seems to be approaching 1.
The other one seems stable, but hasn't deviated too far from where it started.

Let's try printing out the reward now. You're a big boy you can figure out how by yourself this time.

...
ep 31: high-score       11.000, score       10.000, last-episode-time   11
qValues: tensor([[-0.0982,  1.0357]]), action 1
reward 1.0
qValues: tensor([[-0.0637,  1.0319]]), action 1
reward 1.0
qValues: tensor([[-0.0530,  1.0257]]), action 1
reward 1.0
qValues: tensor([[-0.0456,  1.0255]]), action 1
reward 1.0
qValues: tensor([[-0.0410,  1.0269]]), action 1
reward 1.0
qValues: tensor([[-0.0491,  1.0256]]), action 1
reward 1.0
qValues: tensor([[-0.0606,  1.0219]]), action 1
reward 1.0
...

It looks like the reward is 1.0 every frame. And that the action the agent chooses has a QValue near 1.0. This is good. That means it is learning to correctly estimate the value of the action it tries all the time.

And now we know why it doesn't pick the other action. It randomly starts with one QValue as higher than the other. (If you pick 2 random numbers, one has to be higher.) The higher QValue is then raised to approximate the reward, which is 1.0. And since it was raised to 1.0, and the other QValue is probably less than 1.0, then when the next frame comes around the agent will just pick the same action. This never ends.

Exploration

When To Get Help For Addiction

I want one thing to be clear: Even though the performance isn't improving from the perspective of balancing the pole, the network is not malfunctioning. It is doing exactly what we want it to. The network is accurately predicting the reward it will get for the action it has "studied". The reward prediction for the action it has not been taking is obviously wrong though. The environment returns a reward of 1.0 every step of an episode, so both actions should have value predictions around 1.0. The agent just hasn't been providing training data for the action that it never picks. And so even though the network is functioning effectively, and exactly as intended, it isn't being trained on the right material.
Fundamentally this is an issue of not exploring enough.

Is Rehab Right For You?

If the agent was trying both the actions, then both the QValues would be right around 1.0. Then the agent would not just pick one action. Its decisions would be better. You've got lots of options to fix this. But all of them involve making our agent pick more diverse actions.

I'm always responsible for making you fail so that you can learn how to do things the right way later. So we are going to try option one first: punishing the agent's repeated decisions manually.

Reward Shaping

A Firm Hand

Here's the general idea. We are going to hack the reward such that the agent does what we want it to. (Or what we think we want it to...) If the action causes the agent to fail, the estimated reward for that action should go down, right? Instead of letting the agent figure that out we can just spank it with our own abusive hands.

...
while not done:
    env.render()

    action = agent.chooseAction(state)
    state_, reward, done, info = env.step(action)
    if done:  # if you dropped the pole
        reward = -50.0  # that action was bad
    agent.learn(state, reward)  # and you should feel bad

    state = state_
...

nice. Try it out.

...
qValues: tensor([[0.0366, 0.5643]]), action 1
qValues: tensor([[0.0642, 0.6074]]), action 1
qValues: tensor([[0.0559, 1.0360]]), action 1
qValues: tensor([[0.0471, 1.4809]]), action 1
ep 0: high-score      -41.000, score      -41.000, last-episode-time   11
qValues: tensor([[-0.1129,  1.4063]]), action 1
qValues: tensor([[-0.0601,  1.0136]]), action 1
qValues: tensor([[0.0054, 0.5944]]), action 1
qValues: tensor([[0.0532, 0.2115]]), action 1
qValues: tensor([[ 0.1043, -0.1176]]), action 0
qValues: tensor([[ 0.1087, -0.0426]]), action 0
qValues: tensor([[0.1207, 0.1440]]), action 1
qValues: tensor([[ 0.1917, -0.2176]]), action 0
qValues: tensor([[0.2152, 0.0013]]), action 0
qValues: tensor([[0.2579, 0.3355]]), action 1
qValues: tensor([[ 0.3185, -0.1100]]), action 0
...

Hey maybe worked! It looks like it learned that action 1 is bad. The QValue for action 1 went down, then it picked action 0 instead, and then switched back and forth forever. ))<>((

Let's wait and see if it improves more...

...
ep 143: high-score       26.000, score      -41.000, last-episode-time   11
qValues: tensor([[-13.1998,   1.9950]]), action 1
qValues: tensor([[-12.7037,   1.8748]]), action 1
qValues: tensor([[-12.9340,   1.8002]]), action 1
qValues: tensor([[-13.6834,   1.8688]]), action 1
qValues: tensor([[-14.9132,   2.5243]]), action 1
qValues: tensor([[-17.2499,   3.3457]]), action 1
qValues: tensor([[-21.3560,   2.6179]]), action 1
qValues: tensor([[-27.4956,  -1.5666]]), action 1
qValues: tensor([[-36.4758, -18.5921]]), action 1
qValues: tensor([[-45.7820, -35.8122]]), action 1
qValues: tensor([[-55.1387, -50.7749]]), action 1
...

aaand it gets stuck again picking one action. Also that score was -41. It got harder to interpret.

Pessimism

Check out the reward predictions as it gets close to the end of the episode. They become more and more negative. It learns that all actions are gonna be bad. It's learned the futility of life. (a sign of real intelligence)
Anyways, our strategy didn't work. If you keep running it, you will see it doesnt get better.

A More Rigorous Approach

Maybe -50.0 is just too big, or too vague. What if we tried to make a more specific reward that really encourages the network to make the right decisions? We can extract the position and velocity of the cart from the environment state. And make a really specific reward function.
Here is an example in pseudocode.

#  condition if cart is moving left and pole is falling left
if cartVel < 0.0 and poleAngleVel < 0.0:
    if action == 1:
        reward = 10.0
    else: # action = 0
        reward = -10.0
# condition if cart is moving left but pole is moving right
if cartVel < 0.0 and poleAngleVel > 0.0:
    if action == 1:
        reward = 10.0
    else: # action = 0
        reward = -10.0
...

I think you see where this is going.

The Temptations Of Evil

This strategy is garanteed to work in cartpole. All you have to do is keep getting more and more refined with your conditions, and the agent will win the game. But before you go spend an hour making a custom reward function, make sure to read the next section...

DONT DO IT

Don't do this.
This is called "Reward Shaping" and it is really bad.
Now you might be thinking, "Why is it bad? I want my bot to work. I just want it to do the task, so who cares how i achieve that?"
I hope I can talk you out of it.

Who Is Learning

Is the network really learning how to play the game? Or are you? The whole point of deep reinforcement learning is that we should have an agent that learns the task. If your reward function does all the work, then you can just remove the neural network entirely. The policy will just be the conditions you make.

Limiting Performance

A good learning agent doesn't require your human curated rewards. It is extremely unlikely that the conditions you come up with will be better than the policy of an actual learning agent. You want the benefits of AI dont you?

Flexibility

The reward shape you design only works for a specific environment. You would have to do it all over again each time you changed the environment. Again, it just defeats the point of the DRL agent entirely.

Your Brain Isn't Good Enough

You've probably only seen cartpole or lunar lander at this point, but even amongst the simple environments the reward function necessary to succesfully win the game will be really complicated.
And I'm not talking about an optimal reward shape. I just said "win the game". That means minimum win condition.
Imagine trying to do reward shaping to make the agent beat zelda or mario, or navigate a room in real life. You might be able to do it with a team of people working nonstop for years.
People have tried. Likely even under best circumstances you would still fail.

Gaming The Reward

A neural network is a function approximator. Ideally it resembles a function smoothly.

Your complicated reward function is probably full of local minima and maxima. The network will find them if they exist. And it will get stuck in them, figuring out ways to exploit your function for massive rewards that you never intended for, and then never accomplishing the task you wanted it to do in the first place.
It is actually really difficult to make a reward function that doesn't have traps in it. Most of the ai-gym environments have a fairly carefully created reward function. If you go look at the lunar-lander reward code in the ai-gym repo, you'll see what i mean. It's complicated and deliberate. Each fractional reward is scaled in such a way to minimize reward gaming.

Read this for some fun.

Imagine that you are an agent.
Consider cartpole. Is 50 good actions worth one deadly action? We kind of made that claim in our reward function above didn't we? Should it be 200 since 200 is a winning score? How can we know?
When i was a noob i made a snake environment and i did a lot of reward shaping. The snake agent carefully calculated how many apples was worth a game over, and would meticulously plan its path such that it would suicide in a specific number of turns such that the apple was worth it. Even though... I just would prefer it to survive and get the apple instead.

Its Too Much Work

There's an image somewhere of some giant reward shaping equation written for a paper that was just attempting to get one of the ai-gym robots to do a backflip. The equation is big and looks like it is full of calculus. I have no idea how it works, and can only imagine how long it took to make it. (Also it's suboptimal)

When i was new to deep reinforcement learning I spent about a week on an agent that plays snake. I carefully tweaked the magnitude of each sub reward in my reward conditions to try and raise the performance. At one point I even considered letting a neural network learn the optimal reward magnitudes, but then i realized how stupid that was because a good agent would have done that without my reward shaping anyways.
I spent almost an entire week on reward shaping snake by the way. Just think about that. Just the reward shaping. That's what i did. For like 6 hours a day. Don't be me.
At the end of that it turned out i just had a bug in my agent code. Once I fixed that bug, it worked great without the reward shaping.

... And Finally, Consistency

When other people make agents to solve the same environment as you, you want to be able to measure the performance of their agents against yours. If everyone is making custom reward functions, then is their agent better than yours or not? It becomes unclear. It could be that none of the parameters or architecture of their agent matters at all, and it's just their reward function pulling all the weight.
So for the most part everyone agrees to not do reward shaping, so that the results they share are comparable in the most basic way.
Actually even just the numbers become different when you shape. You'll notice in our specific case, once we added the done penalty the episode rewards started totaling up to negative numbers. It wasn't like that before. People looking at our results would be really confused. Is 200 still a good score in our reward shape? Is 9 a bad score?

I Forbid You

Anyways, I hope you don't plan on doing it.

Oh Right I Almost Forgot

You wanted to make the agent work didn't you?
What were we doing again? Oh yeah the issue was our agent wasn't exploring different actions. And somehow we have to accomplish that without rigging the reward.

How about we try a real action exploration strategy in the next tutorial?