Weg's Tutorials

Cleanup, Epsilon-Greedy, and Numpy Memory

Bringing The Code Up To Modern Standards

Prerequisites

This tutorial assumes you've completed the "Foundations" tutorials. It also uses code from those previous tutorials. The intention is to make the old code look more like common RL code in preperation for the coming upgrades. If you aren't new to reinforcement learning then i'm sure you can follow along. Less will be explained within the "Upgrades" tutorials than prior tutorials. There might be a link to prior explanation from prior tutorials if I can remember to add them.

Getting Started

The DQN implementations in prior tutorials worked as demonstrations of DRL. The intention was to minimize lines, necessary explanation, and complication. However, the simplified code comes at the cost of both agent performance, and run speed. Also, it looked somewhat different from the common code seen on github, so comparing it to other people's agent code might not be so straightforward. It served it's purpose as a learning tool, but now that you are familiar with the function of each component in a drl agent it's time to modernize the implementation.

Code Review

This is what we are working with. There are some HERE comments marking what needs attention most.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gym       
import math
import numpy as np
import random

'''FINE: this class actually looks mostly fine as is.
Sure, the number of layers and layer dimensions aren't passed in as arguments 
to the constructor, but that's a lot of boilerplate I don't want to inflate the code with.
Yes, that is useful if you want to programmatically try different architectures, but 
it's not necessary generally.'''
class Network(torch.nn.Module):
    def __init__(self, alpha, inputShape, numActions):
        super().__init__()
        self.inputShape = inputShape
        self.numActions = numActions
        self.fc1Dims = 1024
        self.fc2Dims = 512

        self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
        self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
        self.fc3 = nn.Linear(self.fc2Dims, numActions)

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.loss = nn.MSELoss()

        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class Agent():
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)

    def chooseAction(self, observation):
        state = torch.tensor(observation).float().detach()
        state = state.to(self.network.device)
        state = state.unsqueeze(0)

        qValues = self.network(state)
        action = torch.argmax(qValues).item()

        '''HERE: this should probably be changed for a more sane exploration method.'''
        chanceOfAsparagus = random.randint(1, 10)
        if chanceOfAsparagus == 1:  #   10% chance
            action = random.randint(0, 1)

        return action

    def learn(self, memory, batchSize):
        '''HERE: theres no minimum memory size??'''
        if len(memory) < batchSize:
            return 

        self.network.optimizer.zero_grad()

        '''HERE: this entire block runs very slow compared to the common implementation'''
        randomMemories = random.choices(memory, k=batchSize)
        memories = np.stack(randomMemories)
        states, actions, rewards, states_, dones = memories.T
        states, actions, rewards, states_, dones = \
            np.stack(states), np.stack(actions), np.stack(rewards), np.stack(states_), np.stack(dones)
        
        states  =   torch.tensor( states    ).float().to(self.network.device)
        actions =   torch.tensor( actions   ).long().to(self.network.device)
        rewards =   torch.tensor( rewards   ).float().to(self.network.device)
        states_ =   torch.tensor( states_   ).float().to(self.network.device)
        dones   =   torch.tensor( dones     ).to(self.network.device)

        qValues = self.network(states)
        nextQValues = self.network(states_)

        batchIndecies = np.arange(batchSize, dtype=np.int64)

        '''HERE: the names of these variables don't match the common names, but 
        the functionality is all here, and the equation is fine. 
        Except it is missing one part, but that will be explained.'''
        nowValues = qValues[batchIndecies, actions]    #   interpret the past
        futureValues = torch.max(nextQValues, dim=1)[0]    #   interpret the future
        futureValues[dones] = 0.0   #   ignore future actions if there will 
                                    #   be no future actions anyways
        trueValuesOfNow = rewards + futureValues    #   same temporal difference
        loss = self.network.loss(trueValuesOfNow, nowValues)

        loss.backward()
        self.network.optimizer.step()

'''FINE: this "main agent loop" is actually really close to the common drl code.'''
if __name__ == '__main__':
    env = gym.make('CartPole-v1').unwrapped
    agent = Agent(lr=0.001, inputShape=(4,), numActions=2)
    BATCH_SIZE = 64
    '''HERE: the memory deserves an upgrade'''
    memory = []

    highScore = -math.inf
    episode = 0
    numSamples = 0
    while True:
        done = False
        state = env.reset()

        score, frame = 0, 1
        while not done:
            # env.render()

            action = agent.chooseAction(state)
            state_, reward, done, info = env.step(action)

            '''HERE: the memory will change so this will change a bit'''
            transition = [state, action, reward, state_, done]
            memory.append(transition)
            
            agent.learn(memory, BATCH_SIZE)
            state = state_

            numSamples += 1

            score += reward
            frame += 1

        highScore = max(highScore, score)

        print(( "total samples: {}, ep {}: high-score {:12.3f}, "
                "score {:12.3f}").format(
            numSamples, episode, highScore, score,frame))

        episode += 1

Upgrades

The three aspects of this code most needing improvement are the experience replay buffer, the td code in the learn function, and the exploration strategy.

Exploration Strategy

At the moment the agent is picking a random action 10% of the time. This is to force the agent to try different actions. Regular exploration ensures enough data is collected to get the neural network to learn the environments reward function. Mainly, it pushes the agent into circumstances it wouldnt find itself in otherwise. Exploration is covered more deeply in the DQLearning Tutorial and the Actor Critic Tutorial. The primary issue with the code as is, is the exploration rate. It doesn't change. As the qvalues become more refined, the greedy action is often a better choice for the agent. If the exploration rate remains high, then the agent might not get to refine it's strategy. Also the code has the word asparagus in it.

class Agent(): # NEW AND IMPROVED
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)

        # exploration parameters
        self.epsilon = 0.1            # chance of random action
        self.epsilon_decay = 0.00005  # how much the chance shrinks each step
        self.epsilon_min = 0.001      # minimum for the chance, so you never fully stop exploring

    def chooseAction(self, observation):
        if np.random.random() < self.epsilon: # generate a num between 0.0 and 1.0 to "roll"
            action = random.randint(0, 1)
        else: # dont bother doing all that torch stuff if you're just gonna choose a random
            state = torch.tensor(observation).float().detach()
            state = state.to(self.network.device)
            state = state.unsqueeze(0)

            qValues = self.network(state)
            action = torch.argmax(qValues).item()
        return action

The new improved code uses the epsilon-greedy exploration strategy. It works just like the old exploration strategy:

chanceOfAsparagus = random.randint(1, 10)
if chanceOfAsparagus == 1:  #   10% chance
    action = random.randint(0, 1)

Except instead of a fixed 10% chance, the chance starts high and shrinks until it hits a minimum. Why is it called epsilon? More math history or something. It is generally denoted by the ϵ character, epsilon... In the mainstream it appeared around 1998, maybe in the Sutton and Barto book on reinforcement learning. Probably before that. Anyways, epsilon is a number between 0 and 1, that should start high, and slowly become smaller. Generate a random number between zero and one, and then compare it to the exploration threshold, epsilon. If epsilon is 0.9, then 90% of the time the random number will be smaller than epsilon, so 90% of the time the action will be random. Epsilon-greedy is probably the most common exploration strategy. More generally though, theres usually some way actions are randomized. You will see epsilon-greedy all over reinforcement learning.

Common Concerns

Dont forget to actually shrink epsilon, and cap it at the minimum value. You can do it periodically but I most commonly see the decay done once per learn step at the bottom of the learn function.

def learn(self, memory, batchSize):
    ... # bla bla bla the rest of the learn function above
    trueValuesOfNow = rewards + futureValues    #   same temporal difference
    loss = self.network.loss(trueValuesOfNow, nowValues)

    loss.backward()
    self.network.optimizer.step()

    '''SHRINK EPSILON HERE'''
    self.epsilon -= self.epsilon_decay  # shrink
    if self.epsilon < self.epsilon_min: # clamp
        self.epsilon = self.epsilon_min

Greater Than Less Than

Make sure to get the direction of the < vs > correct when comparing to epsilon to pick a random action. If you get it backwards, your exploration chance is inverted.

# epsilon = 0.1

# this is 10% chance
if np.random.random() < self.epsilon:   
    action = random.randint(0, 1)

# this is 90% chance
if np.random.random() > self.epsilon:   
    action = random.randint(0, 1)

And worse, if it is backwards, your exploration chance will actually increase instead of decrease when you shrink epsilon. As you can imagine, this results in terrible scores, as the agent will slowly take more and more random moves.

When In Doubt, Print

If the agent is being dumb, print out epsilon at each step, or episode, to make sure it is doing what you want.

How To Set Settings

That epsilon_decay number I picked seems kinda random doesn't it?

self.epsilon = 0.1    
self.epsilon_decay = 0.00005  
self.epsilon_min = 0.001

If you set these wrong the agent won't learn at all. You might think it's an issue with any other part of your code, which could send you on an hours/days bug hunt. Luckily, these new hyperparameters can be made to be equivalent to the old code to give a starting point that you can be confident will work.

self.epsilon = 0.1    
self.epsilon_decay = 0.0
self.epsilon_min = 0.1

If epsilon is 0.1, and the decay rate is 0.0, then it is a constant flat 10% chance of random action. This should yield exactly the same results as the old code. (It does. I tested it.) So you can start from this, and slowly raise the decay and shrink the min from there. self.epsilon_decay = 0.00000001 That might be a bit small of a decay rate, but maybe you can aim epsilon so that by a certain episode it hits a target minimum. The right epsilon and decay rate heavily depends on the environment, though. One of the benefits of using this standard epsilon-greedy implementation is you can now use other peoples espilon-greedy settings. Yay.

The Future of Exploration
Epsilon-greedy is not perfect. Notice epsilon can only decrease, and never increase. This could be considered a feature, but also a flaw. It makes assumptions about how the best strategy is developed, and about how unchanging the environment is. When a human finds that it is not getting good results, it actively tries new strategies (espilon goes up). If a human finds its current strategy to be paying off, it won't try new things (epsilon goes down). Real animals use adaptive exploration rates. So, maybe the agent's exploration rate should increase if reward is unexpectedly low. Maybe it should change if the reward is staying the same for a while. Perhaps it should decrease based on queues from the environment, and should be a function learned across environments with a neural network. It's a whole other upgrade just waiting to be discovered.
Another issue with this exploration strategy is that it explores by randomizing action, as opposed to randomizing the goal. Humans practice tasks often by learning similar modified versions of the task. Even if somebody doesn't know what optimal chess play looks like, they still know they will get better at chess by playing alternate versions, and chess minigames. Sometimes these minigames have different goals than normal chess, but that doesn't make them misaligned with learning some of the same skills one would use in regular chess. For reinforcement learning this is also the case.
For a mechanical task, I was thinking along the lines of practicing shooting an arrow at a target, by training to miss the target by a certain amount on purpose. A big advantage of this feature is that it allows for training the agent on arbitrary goals, preventing overfitting to the main goal of the environment. Which would you rather have, an agent that can hit any target with an arrow? Or an agent that can only hit one? This goal exploration effect could be achieved by micromanaging which environments the agent plays, but it would be much more valuable to build it into the agent as an exploration strategy. The agent would sometimes try different goals, and ignore the environmental reward. I think I saw a paper along these lines where the goal is augmented to give the agent a better sense of the real primary goal. Another upgrade worth investigating.

Numpy Experience Replay

Maybe you noticed before, but the old code gets slower over time. This is partially due to the agent getting better at the game. As the agent gets better, the episodes require more steps to end. However, you probably noticed if you let the game go for a few thousand episodes, the episodes start to get really slow, even though the scores are roughly the same. (For the record, a few thousand episodes is not that many. DRL papers often consider agents trained on millions of samples.) What is the cause of this slowdown? Nothing in our agent seems to require more processing at a later episode than an early episode. It should be a constant amount of processing per learn step... except the code involving the experience replay buffer. This isn't too much of a problem for "read", but the moment you might try to "write" back to the buffer inside your learn step, runtime performance suffers too much. It starts to get unbearably slow.
Additionally, your options for batch size are limited by the sampling speed of your replay buffer. As the replay buffer gets longer, sampling from it becomes slower. (relative to numpy arrays atleast) We mostly avoided that effect by being careful up to this point. But it's better to have a high power tool to play with. A lot of the agent upgrades involve sampling data from the memory, computing something, and sometimes even storing (regularly processing the memories, and storing information about them). Doing this with python arrays starts to become a performance problem, especially once there are some python for loops in there. ew :^) So think of this as an infrastructural investment. The numpy indexing tricks end up being incredibly convenient, and besides, it's the most common way to do memories anyways.

So, this:

memory = []

becomes:

class ReplayBuffer(): '''NEW CODE'''
    def __init__(self, maxSize, stateShape):
        self.memSize = maxSize
        self.memCount = 0

        self.stateMemory        = np.zeros((self.memSize, *stateShape), dtype=np.float32)
        self.actionMemory       = np.zeros( self.memSize,               dtype=np.int64)
        self.rewardMemory       = np.zeros( self.memSize,               dtype=np.float32)
        self.nextStateMemory    = np.zeros((self.memSize, *stateShape), dtype=np.float32)
        self.doneMemory         = np.zeros( self.memSize,               dtype=np.bool)

    def storeMemory(self, state, action, reward, nextState, done):
        memIndex = self.memCount % self.memSize 
        
        self.stateMemory[memIndex]      = state
        self.actionMemory[memIndex]     = action
        self.rewardMemory[memIndex]     = reward
        self.nextStateMemory[memIndex]  = nextState
        self.doneMemory[memIndex]       = done

        self.memCount += 1

    def sample(self, sampleSize):
        memMax = min(self.memCount, self.memSize)
        batchIndecies = np.random.choice(memMax, sampleSize, replace=False)

        states      = self.stateMemory[batchIndecies]
        actions     = self.actionMemory[batchIndecies]
        rewards     = self.rewardMemory[batchIndecies]
        nextStates  = self.nextStateMemory[batchIndecies]
        dones       = self.doneMemory[batchIndecies]

        return states, actions, rewards, nextStates, dones

Notice how much code this adds. There's a reason i don't encourage people to start with this. I apologize for inflating your code now, but it's time.

Struct Of Arrays

Each piece of the transition is stored in its own array. You premake and the arrays to a certain size when you create the memory.

self.stateMemory        = np.zeros((self.memSize, *stateShape), dtype=np.float32)
self.actionMemory       = np.zeros( self.memSize,               dtype=np.int64)
self.rewardMemory       = np.zeros( self.memSize,               dtype=np.float32)
self.nextStateMemory    = np.zeros((self.memSize, *stateShape), dtype=np.float32)
self.doneMemory         = np.zeros( self.memSize,               dtype=np.bool)

Notice the names of these arrays correspond to the same SARS you are familiar with.

Class Inputs
class ReplayBuffer(): '''NEW CODE'''
    def __init__(self, maxSize, stateShape):
        self.memSize = maxSize
        self.memCount = 0

The class takes in the max size of the buffer, and the shape of the states.
Ex: cartpole has a state shape of 4 numbers, so the stateShape should be (4,). If you just pass in stateShape=4 it wont work. That's because there is some tuple unpacking going on in the arrays. (See the *stateShape in the array allocation.)

In python (4,) and (4) are not the same thing. Sounds crazy right? It's a noob trap. Go ahead and try it in the terminal. print((4)) vs. print((4,))

Sampling

Getting the transitions out isn't so difficult, and most importantly, it takes the same amount of time, regardless of how big the memory is.

batchIndecies = np.random.choice(
    memMax, sampleSize, replace=False)

states      = self.stateMemory[batchIndecies]
actions     = self.actionMemory[batchIndecies]
...

First pick some random indices, and then just use the numpy indexing magic to get all the right elements out at once.

Indexing Complications

You may have noticed a few weird lines.

self.memCount = 0
# and 
memIndex = self.memCount % self.memSize
# and
memMax = min(self.memCount, self.memSize)

The way this replay buffer works is the arrays are given a length before hand. So we have to keep track of how many memories were currently stored to know where to put the new transitions. self.memCount += 1

What happens if you store memories after the buffer is full?

# the index rolls over back to the begining.
memIndex = self.memCount % self.memSize 

#  this overwrites the oldest memory
self.stateMemory[memIndex]      = state
self.actionMemory[memIndex]     = action
self.rewardMemory[memIndex]     = reward
self.nextStateMemory[memIndex]  = nextState
self.doneMemory[memIndex]       = done

Before the memory has been filled all the way, you have to avoid sampling from parts of the arrays that haven't been assigned data to yet. They will just hold garbage.

#  that should explain this line in sample()
memMax = min(self.memCount, self.memSize)

Now To Use It

To use the new replay buffer we have to do some refactoring. First, remove it from the main loop entirely, and put the replay buffer into the agent. This isnt the only way to do it, but I feel like doing it this way. I'm writing the tutorial, so I am your god now. You have to do what I say. Give me your money.

class Agent():
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)
        self.memory = ReplayBuffer(maxSize=100000, stateShape=inputShape)
        self.batchSize = 64

Also we need to wrap the memory store function so you can call it from the agent.

#    this function is in the agent class
def storeMemory(self, state, action, reward, nextState, done):
    self.memory.storeMemory(state, action, reward, nextState, done)

Inside the main function its a little bit cleaner now.

...
score, frame = 0, 1
while not done:
    # env.render()

    action = agent.chooseAction(state)
    state_, reward, done, info = env.step(action)
    agent.storeMemory(state, action, reward, state_, done)  #   use the wrapped memory store function
    agent.learn()   #   no more arguments go into learn()
                    #   the agent has everything it needs
    state = state_
...
Learn Function Adjustments

So that's all setup... now to fix the learn function.

def learn(self, memory, batchSize):
    if self.memory.memCount < self.batchSize:   #   this changed, but does the same thing
        return

    ''' this stuff gets replaced with the ReplayBuffer sample function, 
        which basically does the same thing internally. 
        just without the stacking and python choice'''
    # randomMemories = random.choices(memory, k=batchSize)
    # memories = np.stack(randomMemories)
    # states, actions, rewards, states_, dones = memories.T
    # states, actions, rewards, states_, dones = \
    #     np.stack(states), np.stack(actions), np.stack(rewards), np.stack(states_), np.stack(dones)
    states, actions, rewards, states_, dones = self.memory.sample(self.batchSize)

    #   still need to pass the stuff to the gpu
    states  = torch.tensor(states , dtype=torch.float32).to(self.network.device)
    actions = torch.tensor(actions, dtype=torch.long   ).to(self.network.device)
    rewards = torch.tensor(rewards, dtype=torch.float32).to(self.network.device)
    states_ = torch.tensor(states_, dtype=torch.float32).to(self.network.device)
    dones   = torch.tensor(dones  , dtype=torch.bool   ).to(self.network.device)
    #   pep8 python style guidelines and google don't want you to format things like this.
    #   but i'm god, and this is my tutorial dimension. Mr.Google has no power here
...

The rest of the learn function code is compatable with the new memory. Almost... There are a couple of lines that need changing. And, while we are at it let's try some more standard naming conventions.

'''OLD CODE'''
qValues = self.network(states)
nextQValues = self.network(states_)

batchIndecies = np.arange(self.batchSize, dtype=np.int64)

nowValues = qValues[batchIndecies, actions]    #   interpret the past
futureValues = torch.max(nextQValues, dim=1)[0]    #   interpret the future
futureValues[dones] = 0.0   #   ignore future actions if there will 
                            #   be no future actions anyways
trueValuesOfNow = rewards + futureValues    
loss = self.network.loss(trueValuesOfNow, nowValues)  #   same temporal difference

'''NEW CODE'''
batchIndices = np.arange(self.batchSize, dtype=np.int64)  # i learned how to spell indices
qValue = self.network(states)[batchIndices, actions]

qValues_ = self.network(states_)        #   values of all actions
qValue_ = torch.max(qValues_, dim=1)[0] #   extract greedy action value
qValue_[dones] = 0.0                    #   filter out post-terminal states

qTarget = reward + self.gamma * qValue_      
loss = self.network.loss(qTarget, qValue)    #   temporal difference
...

It's a little more compact, but more importantly other people who look at your code will recognize the idioms.
Also, you may have noticed gamma is back, which you might remember from the actor-critic tutorial. I didn't explain it too much then, but basically its a bias towards the present. Without it you might get circumstances where the agent could wait indefinitely for some anticipated reward. Realistically, that's never going to happen. A more likely scenario is that the agent can't tell which route to the reward is longer, or takes more time. Either way, hey, now you have another hyperparamater to worry about. Lucky you. Picking gamma is a whole seperate topic to get into, but generrally a gamma value of 0.99 is fine. I've never had to change it. In some environments a tuned lower gamma improves performance. By the way, don't forget to put gamma into the agent class definition.

class Agent():
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)
        self.memory = ReplayBuffer(maxSize=100000, stateShape=inputShape)
        self.batchSize = 64
        self.gamma = 0.99   #   did i really need to show you this?
Learned Gamma

This is another case where a hyperparameter imposes assumptions about the nature of the environment. Gamma should probably vary during the agent's life. Maybe learned with a neural network across environments or just tuned based on td error and state familiarity. It is yet another upgrade idea that facilitates a new type of adaptation.

Minimum Memory Fullness

Something that you will see if you peruse the openAI baselines is a minimum replay buffer fullness. They seem to set it to 20,000 samples for some reason. The agent won't learn from any samples, until the memory hits the minimum threshold. It is bad to sample from a barely filled replay buffer. An agent that does that will end up seeing the first few memories tons of times, and so they will be overrepresented in the neural network. Also the buffer isn't very diverse at first, so there is even more risk for overfitting. Setting a minimum memory size reduces both of these issues. Though, it does mean the agent will not be doing any learning for the earliest portion of episodes, so don't expect to see any good performance while the buffer is still filling up. In my personal experience, I have found that adding a minimum buffer fullness really stabilizes agent performance. There are less weird sudden drops in score. My guess is that without a minimum memory the network learns a bunch of noise that it has to unlearn later on. Play with it to see if you can replicate that issue. (Try setting the minimum to a low number like the batch size, then up to a higher number like 2048.)

class Agent():
def __init__(self, lr, inputShape, numActions):
    self.network = Network(lr, inputShape, numActions)
    self.memory = ReplayBuffer(maxSize=10000, stateShape=inputShape)
    self.minMemorySize = 1024
...
    def learn(self):
        if self.memory.memCount < self.minMemorySize:   #   this replaced self.batchSize
            return
...

Full Code

Welcome your code to the 22nd century.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gym       
import math
import numpy as np
import random

class ReplayBuffer():
    def __init__(self, maxSize, stateShape):
        self.memSize = maxSize
        self.memCount = 0

        self.stateMemory        = np.zeros((self.memSize, *stateShape), dtype=np.float32)
        self.actionMemory       = np.zeros( self.memSize,               dtype=np.int64  )
        self.rewardMemory       = np.zeros( self.memSize,               dtype=np.float32)
        self.nextStateMemory    = np.zeros((self.memSize, *stateShape), dtype=np.float32)
        self.doneMemory         = np.zeros( self.memSize,               dtype=np.bool   )

    def storeMemory(self, state, action, reward, nextState, done):
        memIndex = self.memCount % self.memSize 
        
        self.stateMemory[memIndex]      = state
        self.actionMemory[memIndex]     = action
        self.rewardMemory[memIndex]     = reward
        self.nextStateMemory[memIndex]  = nextState
        self.doneMemory[memIndex]       = done

        self.memCount += 1

    def sample(self, sampleSize):
        memMax = min(self.memCount, self.memSize)
        batchIndecies = np.random.choice(memMax, sampleSize, replace=False)

        states      = self.stateMemory[batchIndecies]
        actions     = self.actionMemory[batchIndecies]
        rewards     = self.rewardMemory[batchIndecies]
        nextStates  = self.nextStateMemory[batchIndecies]
        dones       = self.doneMemory[batchIndecies]

        return states, actions, rewards, nextStates, dones

class Network(torch.nn.Module):
    def __init__(self, alpha, inputShape, numActions):
        super().__init__()
        self.inputShape = inputShape
        self.numActions = numActions
        self.fc1Dims = 1024
        self.fc2Dims = 512

        self.fc1 = nn.Linear(*self.inputShape, self.fc1Dims)
        self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
        self.fc3 = nn.Linear(self.fc2Dims, numActions)

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.loss = nn.MSELoss()
        # self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.device = torch.device("cpu")
        self.to(self.device)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

class Agent():
    def __init__(self, lr, inputShape, numActions):
        self.network = Network(lr, inputShape, numActions)
        self.memory = ReplayBuffer(maxSize=10000, stateShape=inputShape)
        self.minMemorySize = 1024
        self.batchSize = 64
        self.gamma = 0.99

        self.epsilon = 0.1
        self.epsilon_decay = 0.00005
        self.epsilon_min = 0.001

    def chooseAction(self, observation):
        if np.random.random() < self.epsilon:
            action = random.randint(0, 1)
        else:
            state = torch.tensor(observation).float().detach()
            state = state.to(self.network.device)
            state = state.unsqueeze(0)

            qValues = self.network(state)
            action = torch.argmax(qValues).item()
        return action

    def storeMemory(self, state, action, reward, nextState, done):
        self.memory.storeMemory(state, action, reward, nextState, done)

    def learn(self):
        if self.memory.memCount < self.minMemorySize:
            return

        states, actions, rewards, states_, dones = self.memory.sample(self.batchSize)
        states  = torch.tensor(states , dtype=torch.float32).to(self.network.device)
        actions = torch.tensor(actions, dtype=torch.long   ).to(self.network.device)
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.network.device)
        states_ = torch.tensor(states_, dtype=torch.float32).to(self.network.device)
        dones   = torch.tensor(dones  , dtype=torch.bool   ).to(self.network.device)

        batchIndices = np.arange(self.batchSize, dtype=np.int64)
        qValue = self.network(states)[batchIndices, actions]

        qValues_ = self.network(states_)
        qValue_ = torch.max(qValues_, dim=1)[0]
        qValue_[dones] = 0.0

        qTarget = reward + self.gamma * qValue_
        loss = self.network.loss(qTarget, qValue)

        self.network.optimizer.zero_grad()
        loss.backward()
        self.network.optimizer.step()

        self.epsilon -= self.epsilon_decay
        if self.epsilon < self.epsilon_min:
            self.epsilon = self.epsilon_min

if __name__ == '__main__':
    env = gym.make('CartPole-v1').unwrapped
    agent = Agent(lr=0.001, inputShape=(4,), numActions=2)

    highScore = -math.inf
    episode = 0
    numSamples = 0
    while True:
        done = False
        state = env.reset()

        score, frame = 0, 1
        while not done:
            # env.render()

            action = agent.chooseAction(state)
            state_, reward, done, info = env.step(action)
            agent.storeMemory(state, action, reward, state_, done)
            agent.learn()
            
            state = state_

            numSamples += 1

            score += reward
            frame += 1

        highScore = max(highScore, score)

        print(( "ep {:4d}: high-score {:12.3f}, "
                "score {:12.3f}, epsilon {:5.3f}").format(
            episode, highScore, score, agent.epsilon))

        episode += 1

It's about 30 lines longer now, and there's a lot more that can go wrong. A few extra lines for epsilon-greedy and the rest for the experience replay. But, this prep should make the rest of the upgrade tutorials substantially easier, for me to write, and for you to read, and ultimately graduate from. Make sure to run the code to ensure it still gets some good scores. Play with the new settings to get a feel for how they affect performance.

Moving Forward

Honestly this just isnt a very philosophical tutorial. It's time to move on to the dumptruck of upgrades awaiting your agent.
Good luck.

Tutorial Index