Weg's Tutorials

Deep Q Learning Tutorial

How To Make A Stubborn Bastard

❗❗❗Message For Noobs❗❗❗

This tutorial could probably be your very first deep reinforcement learning program.
But, I've got a tutorial for that for impatient people here.
However, that tutorial won't answer all your questions. It might leave you with quite a lot. You don't need to fully understand that tutorial to get this one, but it's recommended to atleast read it, run it, and play with the program.
This one goes a lot more in depth on the reasoning behind the learn function. Which, reasonably, is probably the most complicated and confusing part of the actor-critic tutorial. If you dont want to read that one and if you just know some pytorch you should be fine here. Each tutorial makes different assumptions about your prior knowledge.

❗❗❗Expert Warning❗❗❗

I take an iterative approach to building up knowledge. If the agent defies your expectations about how a DQN should be at first, just bare with me.

Prerequisites

Today we're gonna try a different algorithm for some deep reinforcement learning. It's generally a lot better than the actor-critic one, but it has some real limitations.

pytorch installed
ai gym installed
a computer
hands (for now)

Getting Started

Deep Q Learning is a reinforcement learning algorithm. It's like Actor-Critic in a lot of ways.
Just like in Actor-Critic we're gonna have an agent, a network, and an environment. The network is responsible for correlating environment inputs with the outcome of available actions. In actor-critic, we have probabilities of actions, and a seperate network that will judge and tweak those probabilities based on the feedback from the environment. QLearning does not work like this. In QLearning the outcome prediction and the action choice is sort of one and the same. There is no seperate actor and critic. The actor is the critic. (and it's generally not called an actor, either) There are other ways of doing it, but basics first. Let's start by predicting the value of action...

Waste Not Want Not

Okay so you have some state that comes into your network from the environment, and you have a finite set of actions you can take. Pass in the state, and out comes the action values.

Action Outcome Prediction

Consider the following:

Ex: The agent is given a picture of a burger.

It outputs 3 values. One per available action.

Eat: Predict we will get 100.0 Points
Dont Eat: -10.0 Points
Eat, Throw Up, Then Eat Again: 200.0 Points

Each output is a prediction of how good the next state of the environment will be if it does that action. That prediction is known as a Q Value.
Notice: It does not matter if the input to the Action Outcome Predictor is a state, action, or some combination of the two. (People will tell you it matters, but if you aren't balls deep in the math it doesn't matter, and tutorials you will find online get it wrong half the time anyways.)

Picking An Action

Why are we predicting action outcomes? Well, the agent will pick the action with the highest estimation of how many points it will result in.
A DQN is fairly optimistic about life.

predicted_outcomes = network(now_state)
predicted_outcomes is array([100., -10., 200.])

chosen_action = predicted_outcomes.argmax()
chosen_action is 2

Mmmm. Yummy. :^)

Let's set up our network that will predict action outcomes, Q Values.

class Network(torch.nn.Module):
    def __init__(self, lr, inputShape, numActions):
        super().__init__()
        self.inputShape = inputShape
        self.numActions = numActions
        self.fc1Dims = 1024
        self.fc2Dims = 512

        #   layers
        self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
        self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
        self.fc3 = nn.Linear(self.fc2Dims, numActions)  # output 1 outcome estimate per action

        #   pytorch stuff
        self.optimizer = optim.Adam(self.parameters(), lr=lr)
        self.loss = nn.MSELoss()
        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

Network Inputs

InputShape should be a tuple. In the case of cartpole it's

... inputShape=(4,), numActions=2 ...

Input shape is of size 4, because the cartpole environment state consists of 4 numbers. Those 4 numbers are the angle, and position of the cart and whatnot. Is it weird I don't really care what they are? It works anyways. The agent is supposed to do my work for me.
The numActions is just how many discrete actions you can choose from for the environment. In this case the agent can pick to shift the cart left or right. So two actions, 0 or 1.

Network Outputs

See in the forward function we just pass in the environment state and out comes an array of numbers, one for each action. Those are the Q Values.

#  self.fc3 = nn.Linear(self.fc2Dims, numActions)
def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)

    return x

One outcome value prediction for each action.

#  call forward by using the network name with ()
predicted_outcomes = network(now_state)
predicted_outcomes is array([100., -10., 200.])

That's it for the network code.

Choose Wisely (Greedily)

Let's make the agent now.
As I said before, the agent will pick the action with the highest value estimation. Why? Well, if you wanted to make an agent that got progressively worse at things, you could pick the lowest. :^)

def chooseAction(self, observation):
    # your env state probably comes in as a numpy array of floats. So we have to put it in a tensor
    state = torch.tensor(observation).float()
    state = state.to(self.dqn.device) # and put it on the gpu/cpu
    state = state.unsqueeze(0)  # pytorch likes inputs in a specific shape (1, 8) instead of just (8)

    qValues = self.dqn(state) # pass it through the network to get your estimations
    action = torch.argmax(qValues) # pick the highest
    return action.item()  # return an int instead of a tensor containing the index of the best action

This greedy selection might feel wrong to you. Sometimes we need to pick the action that is worse in the short term but gives us better options in the long term, right? Your intuition is good. However, although the QValues start off as short sighted, if an action gives the agent better options in the long term, and the agent is training well, then the QValue of the long-term action will slowly increase throughout training until it beats out the greedy actions.
An example of this might be in pacman, where the agent has to pick between two directions. One direction might have lots of dots to eat, but lead to a dead end. The other direction might have no dots, but lead to a new section of the maze full of even more dots to eat that the first direction.
At first you can expect your agent to go with the easy reward. In fact... sometimes it might never figure out the long term best action. Anyways the point is the QValues are not straightforward to interpret. It can be really difficult to figure out if the agent is being nearsighted or thinking long term just from the QValues alone. This greedy action selection, ends up being a lot less greedy than you would worry.

Better Forecasting

Unlike in Actor-Critic, this network only has one error to minimize. That one error is the difference between the reward the environment gave us, and what the QValue guessed it would be.

def learn(self, state, reward): 
    self.network.optimizer.zero_grad()  #   pytorch stuff: resets all the tensor derivatives

    # put our states in tensors and such
    state = torch.tensor(state).float().detach()  # detach it so we dont backprop through it
    state = state.to(self.dqn.device) # and put it on the gpu/cpu
    state = state.unsqueeze(0)

    reward = torch.tensor(reward).float().detach()  # have to put the reward in tensor form too
    reward = reward.to(self.network.device) # and on the proper device

    qValues = self.network(state) # predict what reward each action will get
    valueOfBestAction = qValues.max() # assume it took the best action

    # did we accurately predict how much reward the best action would give us?
    loss = self.network.loss(valueOfBestAction, reward) 

    loss.backward() # calculate the influence environment state and action choice had on the reward
    self.network.optimizer.step()   # tweak the weights to reduce the error

Loss

The network takes the difference between the actual reward and its predicted reward as its loss. The agent picks an action, gives that action to the environment. The environment gives a reward. Then you get to see how wrong it was. The action QValue is the reward prediction right? You want it to be correct.

If the difference between reward and prediction is zero, it means the network perfectly predicted the reward it would get. That means it "understands reality".

If the difference is high, it means the network really over/under valued its action.
If the difference is low, it means the network only kinda under/over valued its action.

The error being positive means it was an overvalued action.
The error being negative means it was an undervalued action.
Anyways, both of those are bad things.

.detach()

Remember when we called .detach() on the state? We do that every time before we pass it into the network. Well that's important. While the environment is responsible for what reward the agent predicts it will get, it would be wrong to modify the state according to the loss.
If we don't detach it, when we backprop the error, the state tensor will absorb a portion of the loss.
So why is that bad? Well, the state is just state. It's life.
You can't change the laws of physics, you can only change your interpretation of the state, and your choices.
So the loss should be accounted for in the change to the network weights, not the state.

This Isn't How They Do It In My Textbook

You were expecting a bunch of nonsense about temporal difference,
compounding reward probabilities and whatnot weren't you?
Don't worry. We will get there.

The Whole Agent

The agent class holds the network and functions as a nice place to put utilities needed for decision making. You've already seen all this code. It's just in one place now. I'm not a huge fan of unnecessary classes but this Agent will be convenient later when we add stuff to it. Also it is a pretty standard way of doing it.

class Agent():
    def __init__(self, lr, inputShape, numActions=2):
        self.network = Network(lr, inputShape, numActions)

    def chooseAction(self, observation):  # you've seen this code
        state = torch.tensor(observation).float()
        state = state.to(self.dqn.device) 
        state = state.unsqueeze(0)
    
        qValues = self.dqn(state)
        action = torch.argmax(qValues)  # there are a few ways to get the max value and its index (google)
        return action.item()  # .item() grabs the number out of the tensor (also worth a google)

    def learn(self, state, reward): # youve seen this before too
        self.network.optimizer.zero_grad()  #   pytorch stuff: resets all the tensor derivatives
    
        state = torch.tensor(state).float().detach()  # detach it so we dont backprop through it
        state = state.to(self.dqn.device) # and put it on the gpu/cpu
        state = state.unsqueeze(0)
    
        qValues = self.network(state)
        valueOfBestAction = qValues.max()
        loss = self.network.loss(valueOfBestAction, reward) 
    
        loss.backward()
        self.network.optimizer.step()

The Main Loop

All the hard parts are done now. Put the agent, into the environment, and trap it in an infinite loop so it can learn the futility of life and give up.

agent = Agent(lr=0.001, inputShape=(8,), numActions=2)
env = gym.make('CartPole-v1') # this is how you pick the env from ai gym

highScore = -math.inf
while True:                     #   keep starting new episodes forever
    observation = env.reset()   #   observation is just a commonly used term for the environment state
    score, frame, done = 0, 1, False
    while not done:             #   keep going until the env reports the episode is done
        env.render()            #   draw it on your screen so you can watch
        action = agent.chooseAction(observation)
        nextObservation, reward, done, info = env.step(action)  #   make the environment go one time step
        agent.learn(observation, reward, nextObservation, done)  #   make your network more accuracte
        observation = nextObservation
        score += reward
        frame += 1

    highScore = max(highScore, score)
    print(( "ep {}: high-score {:12.3f}, "
            "score {:12.3f}, last-episode-time {:4d}").format(
        episode, highScore, score, frame))

Full Code

You have the basic DQN framework now. There isnt that much to it.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gym       
import math
import numpy as np

class Network(torch.nn.Module):
    def __init__(self, alpha, inputShape, numActions):
        super().__init__()
        self.inputShape = inputShape
        self.numActions = numActions
        self.fc1Dims = 1024
        self.fc2Dims = 512

        self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
        self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
        self.fc3 = nn.Linear(self.fc2Dims, numActions)

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.loss = nn.MSELoss()
        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

class Agent():
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)

    def chooseAction(self, observation):
        state = torch.tensor(observation).float().detach()
        state = state.to(self.network.device)
        state = state.unsqueeze(0)

        qValues = self.network(state)
        action = torch.argmax(qValues).item()
        return action

    def learn(self, state, reward):
        self.network.optimizer.zero_grad()

        state = torch.tensor(state).float().detach()
        state = state.to(self.network.device)
        state = state.unsqueeze(0)

        reward = torch.tensor(reward).float().detach()
        reward = reward.to(self.network.device)

        qValues = self.network(state)
        valueOfBestAction = qValues.max()

        loss = self.network.loss(valueOfBestAction, reward)

        loss.backward()
        self.network.optimizer.step()

if __name__ == '__main__':
    env = gym.make('CartPole-v1').unwrapped
    agent = Agent(lr=0.001, inputShape=(4,), numActions=2)

    highScore = -math.inf
    episode = 0
    while True:
        done = False
        state = env.reset()

        score, frame = 0, 1
        while not done:
            env.render()

            action = agent.chooseAction(state)
            state_, reward, done, info = env.step(action)
            agent.learn(state, reward)

            state = state_

            score += reward
            frame += 1

        highScore = max(highScore, score)

        print(( "ep {}: high-score {:12.3f}, "
                "score {:12.3f}, last-episode-time {:4d}").format(
            episode, highScore, score,frame))

        episode += 1

Did It Work?

It didn't did it? Do you have any guesses why?
Go scour the code a bit, and verify each line makes sense to you. Maybe do some printing.
Don't come back until you atleast tried.

Hey. I said stop reading this. Go try.

. . .

. . I . . Gi . .

I Give Up

I hope you didn't spend more than 30 minutes trying to figure out why it doesn't work. That would be hilarious.
If you did though, don't worry, stress induced hair-loss is temporary.
I set you up. You were doomed to fail from the begining.

Let's investigate why you are such a failure in the next tutorial.