This tutorial could probably be your very first deep reinforcement learning program.
But, I've got a tutorial for that for impatient people here.
However, that tutorial won't answer all your questions. It might leave you with quite a lot.
You don't need to fully understand that tutorial to get this one, but it's recommended to atleast read it,
run it, and play with the program.
This one goes a lot more in depth on the reasoning behind the learn function.
Which, reasonably, is probably the most complicated and confusing part of the actor-critic tutorial.
If you dont want to read that one and if you just know some pytorch you should be fine here.
Each tutorial makes different assumptions about your prior knowledge.
I take an iterative approach to building up knowledge. If the agent defies your expectations about how a DQN should be at first, just bare with me.
Today we're gonna try a different algorithm for some deep reinforcement learning. It's generally a lot better than the actor-critic one, but it has some real limitations.
Deep Q Learning is a reinforcement learning algorithm. It's like Actor-Critic in a lot of ways.
Just like in Actor-Critic we're gonna have an agent, a network, and an
environment.
The network is responsible for correlating environment inputs with the outcome of available
actions.
In actor-critic, we have probabilities of actions, and a seperate network that will judge and tweak those
probabilities
based on the feedback from the environment. QLearning does not work like this.
In QLearning the outcome prediction and the action choice is sort of one and the same.
There is no seperate actor and critic. The actor is the critic. (and it's generally not called an actor, either)
There are other ways of doing it, but basics first. Let's start by predicting the value of action...
Okay so you have some state that comes into your network from the environment, and you have a finite set of actions you can take. Pass in the state, and out comes the action values.
Consider the following:
Ex: The agent is given a picture of a burger.
It outputs 3 values. One per available action.
Eat: Predict we will get 100.0 Points
Dont Eat: -10.0 Points
Eat, Throw Up, Then Eat Again: 200.0 Points
Each output is a prediction of how good the next state of the environment will be if it does that
action. That prediction is known as a Q Value.
Notice: It does not matter if the input to the Action Outcome Predictor
is a state, action, or some combination of the two.
(People will tell you it matters, but if you aren't balls deep in the math it doesn't matter,
and tutorials you will find online get it wrong half the time anyways.)
Why are we predicting action outcomes? Well, the agent will pick the action with the highest estimation of how
many points it will result in.
A DQN is fairly optimistic about life.
predicted_outcomes = network(now_state)
predicted_outcomes is array([100., -10., 200.])
chosen_action = predicted_outcomes.argmax()
chosen_action is 2
Mmmm. Yummy. :^)
Let's set up our network that will predict action outcomes, Q Values.
class Network(torch.nn.Module):
def __init__(self, lr, inputShape, numActions):
super().__init__()
self.inputShape = inputShape
self.numActions = numActions
self.fc1Dims = 1024
self.fc2Dims = 512
# layers
self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
self.fc3 = nn.Linear(self.fc2Dims, numActions) # output 1 outcome estimate per action
# pytorch stuff
self.optimizer = optim.Adam(self.parameters(), lr=lr)
self.loss = nn.MSELoss()
# self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.device = torch.device("cpu")
self.to(self.device)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
InputShape should be a tuple. In the case of cartpole it's
... inputShape=(4,), numActions=2 ...
Input shape is of size 4, because the cartpole environment state consists of 4 numbers. Those 4 numbers
are the angle, and position of the cart and whatnot. Is it weird I don't really care what they are?
It works anyways. The agent is supposed to do my work for me.See in the forward function we just pass in the environment state and out comes an array of numbers, one for each action. Those are the Q Values.
# self.fc3 = nn.Linear(self.fc2Dims, numActions)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
One outcome value prediction for each action.
# call forward by using the network name with ()
predicted_outcomes = network(now_state)
predicted_outcomes is array([100., -10., 200.])
That's it for the network code.
Let's make the agent now.
As I said before, the agent will pick the action with the highest value estimation.
Why? Well, if you wanted to make an agent that got progressively worse at things, you could pick the lowest. :^)
def chooseAction(self, observation):
# your env state probably comes in as a numpy array of floats. So we have to put it in a tensor
state = torch.tensor(observation).float()
state = state.to(self.dqn.device) # and put it on the gpu/cpu
state = state.unsqueeze(0) # pytorch likes inputs in a specific shape (1, 8) instead of just (8)
qValues = self.dqn(state) # pass it through the network to get your estimations
action = torch.argmax(qValues) # pick the highest
return action.item() # return an int instead of a tensor containing the index of the best action
This greedy selection might feel wrong to you. Sometimes we need to pick the action that
is worse in the short term but gives us better options in the long term, right?
Your intuition is good. However, although the QValues start off as short sighted,
if an action gives the agent better options in the long term, and the agent is
training well, then the QValue of the long-term action will slowly increase
throughout training until it beats out the greedy actions.
An example of this might be in pacman, where the agent has to pick between two directions.
One direction might have lots of dots to eat, but lead to a dead end. The other direction might
have no dots, but lead to a new section of the maze full of even more dots to eat that the first direction.
At first you can expect your agent to go with the easy reward. In fact... sometimes it might never figure out the
long
term best action. Anyways the point is the QValues are not straightforward to interpret.
It can be really difficult to figure out if the agent is being nearsighted or thinking long term just
from the QValues alone. This greedy action selection, ends up being a lot less greedy than you would worry.
Unlike in Actor-Critic, this network only has one error to minimize. That one error is the difference between the reward the environment gave us, and what the QValue guessed it would be.
def learn(self, state, reward):
self.network.optimizer.zero_grad() # pytorch stuff: resets all the tensor derivatives
# put our states in tensors and such
state = torch.tensor(state).float().detach() # detach it so we dont backprop through it
state = state.to(self.dqn.device) # and put it on the gpu/cpu
state = state.unsqueeze(0)
reward = torch.tensor(reward).float().detach() # have to put the reward in tensor form too
reward = reward.to(self.network.device) # and on the proper device
qValues = self.network(state) # predict what reward each action will get
valueOfBestAction = qValues.max() # assume it took the best action
# did we accurately predict how much reward the best action would give us?
loss = self.network.loss(valueOfBestAction, reward)
loss.backward() # calculate the influence environment state and action choice had on the reward
self.network.optimizer.step() # tweak the weights to reduce the error
The network takes the difference between the actual reward and its predicted reward as its
loss.
The agent picks an action, gives that action to the environment. The environment gives a reward.
Then you get to see how wrong it was.
The action QValue is the reward prediction right? You want it to be correct.
If the difference between reward and prediction is zero, it means the network perfectly predicted the
reward it would get. That means it "understands reality".
If the difference is high, it means the network really over/under valued its action.
If the difference is low, it means the network only kinda under/over valued its action.
The error being positive means it was an overvalued action.
The error being negative means it was an undervalued action.
Anyways, both of those are bad things.
Remember when we called .detach()
on the state?
We do that every time before we pass it into the network.
Well that's important. While the environment is responsible for what reward the agent predicts it will get,
it would be wrong to modify the state according to the loss.
If we don't detach it, when we backprop the error, the state tensor will absorb a portion of the loss.
So why is that bad? Well, the state is just state. It's life.
You can't change the laws of physics, you can only change your interpretation of the state, and your
choices.
So the loss should be accounted for in the change to the network weights, not the state.
You were expecting a bunch of nonsense about temporal difference,
compounding reward probabilities and whatnot
weren't you?
Don't worry. We will get there.
The agent class holds the network and functions as a nice place to put utilities needed for decision making. You've already seen all this code. It's just in one place now. I'm not a huge fan of unnecessary classes but this Agent will be convenient later when we add stuff to it. Also it is a pretty standard way of doing it.
class Agent():
def __init__(self, lr, inputShape, numActions=2):
self.network = Network(lr, inputShape, numActions)
def chooseAction(self, observation): # you've seen this code
state = torch.tensor(observation).float()
state = state.to(self.dqn.device)
state = state.unsqueeze(0)
qValues = self.dqn(state)
action = torch.argmax(qValues) # there are a few ways to get the max value and its index (google)
return action.item() # .item() grabs the number out of the tensor (also worth a google)
def learn(self, state, reward): # youve seen this before too
self.network.optimizer.zero_grad() # pytorch stuff: resets all the tensor derivatives
state = torch.tensor(state).float().detach() # detach it so we dont backprop through it
state = state.to(self.dqn.device) # and put it on the gpu/cpu
state = state.unsqueeze(0)
qValues = self.network(state)
valueOfBestAction = qValues.max()
loss = self.network.loss(valueOfBestAction, reward)
loss.backward()
self.network.optimizer.step()
All the hard parts are done now. Put the agent, into the environment, and trap it in an
infinite loop so it can learn the futility of life and give up.
agent = Agent(lr=0.001, inputShape=(8,), numActions=2)
env = gym.make('CartPole-v1') # this is how you pick the env from ai gym
highScore = -math.inf
while True: # keep starting new episodes forever
observation = env.reset() # observation is just a commonly used term for the environment state
score, frame, done = 0, 1, False
while not done: # keep going until the env reports the episode is done
env.render() # draw it on your screen so you can watch
action = agent.chooseAction(observation)
nextObservation, reward, done, info = env.step(action) # make the environment go one time step
agent.learn(observation, reward, nextObservation, done) # make your network more accuracte
observation = nextObservation
score += reward
frame += 1
highScore = max(highScore, score)
print(( "ep {}: high-score {:12.3f}, "
"score {:12.3f}, last-episode-time {:4d}").format(
episode, highScore, score, frame))
You have the basic DQN framework now. There isnt that much to it.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gym
import math
import numpy as np
class Network(torch.nn.Module):
def __init__(self, alpha, inputShape, numActions):
super().__init__()
self.inputShape = inputShape
self.numActions = numActions
self.fc1Dims = 1024
self.fc2Dims = 512
self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
self.fc3 = nn.Linear(self.fc2Dims, numActions)
self.optimizer = optim.Adam(self.parameters(), lr=alpha)
self.loss = nn.MSELoss()
# self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.device = torch.device("cpu")
self.to(self.device)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
class Agent():
def __init__(self, lr, inputShape, numActions):
self.network = Network(lr, inputShape, numActions)
def chooseAction(self, observation):
state = torch.tensor(observation).float().detach()
state = state.to(self.network.device)
state = state.unsqueeze(0)
qValues = self.network(state)
action = torch.argmax(qValues).item()
return action
def learn(self, state, reward):
self.network.optimizer.zero_grad()
state = torch.tensor(state).float().detach()
state = state.to(self.network.device)
state = state.unsqueeze(0)
reward = torch.tensor(reward).float().detach()
reward = reward.to(self.network.device)
qValues = self.network(state)
valueOfBestAction = qValues.max()
loss = self.network.loss(valueOfBestAction, reward)
loss.backward()
self.network.optimizer.step()
if __name__ == '__main__':
env = gym.make('CartPole-v1').unwrapped
agent = Agent(lr=0.001, inputShape=(4,), numActions=2)
highScore = -math.inf
episode = 0
while True:
done = False
state = env.reset()
score, frame = 0, 1
while not done:
env.render()
action = agent.chooseAction(state)
state_, reward, done, info = env.step(action)
agent.learn(state, reward)
state = state_
score += reward
frame += 1
highScore = max(highScore, score)
print(( "ep {}: high-score {:12.3f}, "
"score {:12.3f}, last-episode-time {:4d}").format(
episode, highScore, score,frame))
episode += 1
It didn't did it? Do you have any guesses why?
Go scour the code a bit, and verify each line makes sense to you. Maybe do some printing.
Don't come back until you atleast tried.
Hey. I said stop reading this. Go try.
I hope you didn't spend more than 30 minutes trying to figure out why
it doesn't work. That would be hilarious.
If you did though, don't worry, stress
induced hair-loss is temporary.
I set you up. You were doomed to fail from the begining.
Let's investigate why you are such a failure in the next tutorial.