This tutorial is intended to be your very first tutorial in deep reinforcement learning.
In it you will make a program that learns to play lunar lander from AI Gym.
AI Gym is a game world for ai. If you don't know it I've got a quickie tutorial for you here.
It's easy to install with pip, and the main website has some getting started code examples that are only a few
lines.
Don't feel trapped in AI Gym though. You can use your agent to do other things too, like driving cars or balancing
robots
in the real world.
If you don't know anything about neural networks or python you will find this tutorial fascinating, but probably a
bit abstract.
If you know python but don't know about neural networks, I highly recommend the book Grokking Deep
Learning.
That book will take you from zero to hero with neural networks, and can be easily finished in a month (two max) if
you find it fun. It also
doesn't assume you're a math god beforehand, as opposed to 90% of machine learning resources that will. (I hate
it. It's just
DRM for education.)
Anyways, we will be using pytorch for the neural networks stuff. I have a quick tutorial on pytorch here.
If you already know keras or tensorflow it should be fairly straightforward to follow this tutorial.
For you experts out there, yes, technically, we don't need neural networks to do this.
There are endless options for generative or discriminative clustering algorithms, and many of them work just fine. Maybe
we can try some in different tutorials.
The Actor-Critic method is a reinforcement learning algorithm.
It will power our agent. The agent is what lives in our environment and
makes decisions. It sends it's action to the environment,
and the environment sends back the state of the environment, and a
reward representing the goodness of that state. The reward is an abstract number
that depends on the environment. It could be how many coins the agent got in mario, how far right mario ran,
or how many people it ran over if it's a self driving car. In that case the reward might be negative. :^)
An agent can have any logic inside to make decisions, but hopefully it makes better and better
decisions so that it can maximize the reward it gets.
This time we are making an "Actor-Critic Agent".
It has two primary components, an action picker and a
state value estimator.
The action picker takes in a frame of information, camera input, game state, or even past action, and outputs a number for each action the agent is allowed to take. The numbers it outputs are used to pick the agents actions.
Ex: The actor is given a picture of a poisoned burger.
It outputs 3 values.
Eat: 0.8
Dont Eat: 0.3
Be Suspicious: 0.001
The Agent picks the highest one. :()
The state value estimator is very similar to the
action picker. It takes in the same exact frame of information as the
action picker, but outputs a single number
representing the value of the input state.
Ex: The critic gets a picture of you eating a
poisoned burger.
Value: -10.0 :(
Ex: The critic gets a picture of a man handing you an
antidote.
Value: 10.0 :)
Ex: The critic gets a picture of divorce papers.
Value: :^)
The predictive function behind the action picker or the state value estimator can be whatever you want (ex: linear regression, nearest neighbor, genetic algorithm, random forest). However i will be using a single neural network for both of them.
class ActorCriticNetwork(torch.nn.Module): # ~here is my handle~
def __init__(self, lr, inputDims, numActions):
super().__init__()
# fc means "fully connected layer",
self.fc1Dims = 1024 # size of hidden layer 1
self.fc2Dims = 512 # size of hidden layer 2
# backbone network
self.fc1 = nn.Linear(*inputDims, self.fc1Dims) # hidden layer 1
self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims) # hidden layer 2
# tail networks ~here is my spout~
self.actor = nn.Linear(self.fc2Dims, self.numActions) # here is the actor
self.critic = nn.Linear(self.fc2Dims, 1) # here is the critic
# pytorch stuff
self.optimizer = optim.Adam(self.parameters(), lr=lr)
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
self.to(self.device)
def forward(self, observation): # ~tip me over and pour me out~
state = torch.tensor(observation).to(self.device)
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
policy = self.actor(x) # actor outputs one number for each action, a vector
value = self.critic(x) # critic just puts out the value, one number
return policy, value
The network class takes in the learning rate (often called alpha), game frame shape (the inputs for the actor-critic network), and the number of actions (which happens to be the output of the actor network).
A convenient way to do an actor-critic network is by making them share a backbone network. And it makes sense, the same features that are useful for determining state value are probably also useful for determining which action to pick next.
The values used to pick actions are known as the
policy.
Why that word? Math history or something. Not a terrible word for it
thought.
"It is my policy to do THIS under THESE specific circumstances." The
standard symbol for a policy is 𝜋. (pi, like 3.14pi)
In reinforcement learning papers anytime you see 𝜋 it means
policy.
For vanilla Deep Q learning it's as easy as picking the highest actor action output. :)
def chooseAction(state):
policy, _ = actorCriticNetwork(state)
action = torch.argmax(policy).item() # pick here
return action
But this isnt Vanilla Deep Q Learning. This is Actor-Critic Method.
>:()
The actor outputs probabilities of actions instead of
actual action values.
So the Eat: 0.8 from earlier, yeah thats like an 80%
chance of eating the burger.
If this was DQ Learning a value of 0.8 being the highest would mean 100%
chance of eating the poison burger. :(
Lets choose an action the Actor Critic way.
We start by passing in the state from the environment into the critic.
I don't know why this convention, but people often call the state: "observation".
Maybe because it's what the agent "observes"?
def chooseAction(observation):
policy, _ = actorCritic.forward(observation)
policy = F.softmax(policy, dim=0)
actionProbs = torch.distributions.Categorical(policy)
action = actionProbs.sample()
self.logProbs = actionProbs.log_prob(action) # saving this value for later
return action.item()
We have to softmax the output action probabilities from the networks
output policy.
Why are the outputs not good enough as is? Well, its just because the
policy is supposed to be probabilities of each action being taken.
Probabilities need to add up to 1.
Consider the following policy:
Eat: 0.8
Dont Eat: 0.3
Be Suspicious: 0.001
Thats 80% + 30% + 0.001%.
110.001% with the actions combined.
"110% chance of rain today."
Doesn't make any sense right?
Softmax takes in a list of numbers and makes them add up to 1. Which
is exactly what we want.
# softmax example
a = torch.tensor([0.8, 0.3, 0.001])
b = F.softmax(a, dim=0)
b is now tensor([0.4863, 0.2950, 0.2187])
You may have noticed now Eat has changed size
relative to Dont Eat. 0.8 / 0.3 is not the same as
0.4863 / 0.2950
Softmax mangles the relative probability scales just a little bit.
But, the network figures it out after a while. More importantly,
thanks to softmax no matter what weird shit the actor network outputs
we have sane action probability ranges.
You put 80 red marbles, 30 blue marbles, and a glass shard from a
green marble in a bag. We have 3 distinct
categories of marbles. One might even say its a
categorical distribution of marbles.
You pick one marble from the bag. One might even say you
sample the bag.
In this case sample() returns an index because its from a
categorical distribution. But from other distributions it might
return a floating point value. Another common distribution used for
picking actions in reinforcement learning is a normal distribution.
This is the probability of a specific action, but pushed through the
log function.
Why do this? Well like everything in life there are more than one
explanation, and some of them are more complicated than others.
Basically, you'll understand when you're older.
You'll also understand it right now:
The important thing to take from this though is that you are gonna
need the probability of the chosen action for teaching the network.
And smoothing and shrinking the distribution out gives the network
an easier time.
Specifically, youre gonna multiply the probabilities of chosen
actions by the critic state values. Your natural intuition for why
this works is probably pretty reasonable. We'll get to the
explanation in a sec.
Why is our actor putting out chances of actions? Seems kind of stupid doesn't it?
Imagine that every millisecond you are holding a knife you have a 10% chance of letting go of it.
That's dumb right? Why not just pick the highest action instead of sampling a distribution?
Well most of the time the next millisecond you will decide to hold the knife tightly again.
So statistically it works out. You will never completely drop the knife.
The Actor-Critic algorithm statistically makes good decisions, but it might make a different
decision given the same exact scenario a second time. It is non deterministic.
In the real world with fuzzy inputs like muscle flexing and rotation angles, picking the wrong
action for small fraction of a second is not that big of an issue.
For a game where a single wrong button makes you lose, it kind of sucks.
Some reinforcement learning algorithms ARE deterministic. But that has its own downsides.
Non-deterministic algorithms naturally explore different actions by accident. They try new things sometimes.
Deterministic algorithms can get stuck, repeating the same action every time, until it gets punished into
oblivion.
Sometimes deterministic algorithms are so stubborn they will pick the same action until it has been punished so
much
that it will never choose that action ever again. ever. You will likely even see this happen if
you keep
playing with AI Gym and Deep Q Learning.
Deterministic algorithms often need some noise injected into the decision making process to help them be less
stubborn.
Luckily for us, Actor-Critic doesn't need that because it is non-deterministic.
The actor network and the critic network both have unique errors.
Each network is punished by their own respective error, and their
estimation of their respective responsibility improves. We get better
actions from the actor, and better value estimates from
the critic.
def learn(state, reward, nextState, done): following the meta
actorCritic.optimizer.zero_grad() # pytorch stuff: resets all the tensor derivatives
# fetch values from the critic network
_, criticValue = actorCritic.forward(state) # the value of now
_, nextCriticValue = actorCritic.forward(nextState) # the value of the future
# the temporal difference zone
# # the true value of now = now + future
valueOfNow = reward + gamma * nextCriticValue * (1 - int(done))
temporalDifference = valueOfNow - criticValue oh how wrong we were
# compute the error for our actor and critic networks
actorLoss = -self.logProbs * temporalDifference # probability of chosen action times how wrong we were
criticLoss = temporalDifference**2 # we dont care about which direction (the sign),
# # just wanna minimize how wrong we were in total
(actorLoss + criticLoss).backward()
actorCritic.optimizer.step()
This is really important for reinforcement learning. You will see
variations of it all over the place.
Temporal Difference is a way of valuing the
present.
The true value of now includes the potential value of all future
states.
We can compute that in a literal way. Just add the value of now to
the value of the future.
We discount the future a little bit by multiplying it by
gamma which is normally 0.99 ish.
actualValueOfNow = reward + gamma * valueOfTheFuture
temporalDifference = actualValueOfNow - criticsGuess
A key difference though, notice how in our learn function we dont pass in the next rewards. That's because instead of using the ground truth future value we just let the critic guess.
_, nextCriticValue = actorCritic.forward(nextState)
valueOfNow = reward + gamma * nextCriticValue
Gamma is known as the discount factor. In
reinforcement learning math you will see it as γ.
We have to discount the future a little bit. Future rewards are not
as valuable as rewards now. I will explain more after you have some
more context for these types of algorithms.
(actorLoss + criticLoss).backward()
This might seem weird if you know pytorch. Why not just call backward on actorLoss, then on criticLoss?
Well, because they share a backbone network, you can't do that. Pytorch will complain about "unreachable gradient"
stuff when you call step().
And, actually, I think it computes the wrong gradient. Order will matter for the shared backbone network. Luckily
for us the derivative of addition is 1.
So adding the losses togethor just multiplies the gradient by 1. Multiplying things by 1 doesn't change them.
:^)
If you want to split your network in all sorts of crazy ways this works for adding as many as you want togethor.
If you want to see the way mathemagicians formalize temporal difference, here you go. I urge you to give the equations a look. I bet you can kinda get them. You're gonna need to wet your whistle on these a little bit if you want to go implement algorithms from professional reinforcement learning papers.
So, V𝜋(s) means the value of a state according to our policy.
So, logically st+1 is the state after st.
Little t is used for time.
Σ is just a for loop where you add the result to a total.
So Σ10t=0 t is this:
total = 0
for i in range(0, 10 + 1):
total += i
The sum of the inside stuff where t goes from 0 to 10.
For the most part reinforcement learning math boils down to formalizing
these questions:
What's the value of now? What's the value of the future? What's the
value of an action? What's the total value of all the actions i took so
far? What's the total value of all the best possible actions?
So just like most neural network stuff we want the loss to go to zero.
But, in reinforcement learning its a little more subtle than that. The
actor and critic both have different
losses. And they function fairly differently as well.
This is where we focus on these lines specifically:
actorLoss = -self.logProbs * temporalDifference
criticLoss = temporalDifference**2
For the critic the loss is just the difference between what it thought the reward was now, and what the reward actually was. It doesn't matter what direction the error is, + or -. Just want to minimize it. Hence squaring it to remove the sign.
For the actor punishment, we take the probability
of our chosen action, and multiply it into how wrong our state value
estimation was.
Why? There are two reasons.
Remember our action probabilities are between 0 and
1.
Remember we pass those probabilities into log() when we compute the
log-probabilities?
The graph of log(x) is 0 at x = 1. (go look at the graph now).
If our action choice was perfect, the probability of that action
should be 100% or 1.0 (That way we pick that action every time).
When you put the output from the actor of 1.0 into log(x) you get 0.
So the actor's goal is to pick an action such that
log(actorOutput) == 0, a perfect action.
That means the error should be designed such that if we did a
perfect action, error should be zero.
How do you know if the action was perfect? Well the action was
perfect if the critic output the correct value for
the state. The action is as wrong as the
critic times the confidence of the
actor choice.
This means our actions can only be as good as our value
estimation.
This can't be understated. It has important implications for the
instability of
value based learning. I will discuss it more
another time.
Why the multiplication, and why the negative on the actor loss? Consider the following 4 cases:
(reward + nextReward) < criticValue # so...
smallNegativeValue = (reward + nextReward) - criticValue
If the critic overestimates that means the action wasn't as good
as we thought it was. So we should shrink the probability. We have
a small negative temporal difference value, so to shrink the
probability we make that positive, and then multiply the small
number by our probability.
(reward + nextReward) > criticValue # so...
smallPositiveValue = (reward + nextReward) - criticValue
We need to increase that action's probability. It should move in the opposite
direction of the last example. That's why the sign is reversed. And the
number is small because it should be a small increase.
The remaining two cases are the same, just with different magnitudes of adjustment.
The point is that the action probability should be adjusted
proportionally to how wrong or right the
critic was, and in the correct direction. Negative
for shrinking, and positive for growing.
I want to give more specific examples, but in order to make the example simple enough
to make sense it becomes a pretty impractical. Maybe someone has a good one somewhere.
In this particular case if you flip the sign your lander will get worse at landing over time
instead of better. Also if you multiply in a huge or tiny constant into the TD to change the magnitude, you
can see
the lander overcompensate and undercompensate when it tries to balance. It will eventually figure that out
though. The network
weights just adjust to be smaller or bigger.
A reinforcement learning agent is what learns about the
environment and chooses the actions.
There are many different types of agents, but this one
contains an actor and a critic.
An agent class doesn't need to explicitly exist but
doing it this way is fairly neat, and makes it easy to swap in different
agents in our main loop when we want to try out different ones.
To make the agent we just put the learn and choose action functions
togethor with the actor critic network from earlier.
There really isn't anything new here. Just notice it's where we save our
logProb when we choose an action.
class ActorCriticAgent():
def __init__(self, lr, inputDims, numActions):
self.gamma = 0.99 # a common gamma value
self.actorCritic = ActorCriticNetwork(lr, inputDims, numActions)
self.logProbs = None # log of the probability of the last action the agent chose
def chooseAction(self, observation):
policy, _ = self.actorCritic.forward(observation)
policy = F.softmax(policy, dim=0)
actionProbs = torch.distributions.Categorical(policy)
action = actionProbs.sample()
self.logProbs = actionProbs.log_prob(action) # save it here
return action.item()
def learn(self, state, reward, nextState, done):
self.actorCritic.optimizer.zero_grad()
_, criticValue = self.actorCritic.forward(state)
_, nextCriticValue = self.actorCritic.forward(nextState)
reward = torch.tensor(reward, dtype=torch.float).to(self.actorCritic.device)
td = reward + self.gamma * nextCriticValue * (1 - int(done)) - criticValue
actorLoss = -self.logProbs * td
criticLoss = td**2
(actorLoss + criticLoss).backward()
self.actorCritic.optimizer.step()
All the hard parts are done now. The only thing left is to make our
agent, put it into an environment, and trap it in an
infinite loop so it can self improve until it becomes skynet. You can
create your own environment or use input from the real world, but for
demonstration let's use AI Gym. A place where AI can watch their macros
and get sick gains. (or fail horribly. Some of the provided AI Gym
environments are fairly difficult, and require much more complicated
algorithms than this one to solve.)
AI Gym environments are nice little simulated worlds, that happen to
return rewards and states just like our actor and
critic needs.
What a coencidence. :^) Anyways, put our agent into an
AI Gym environment and let it run for 20 minutes to 10 years.
agent = ActorCriticAgent(lr=0.00001, inputDims=(8,), numActions=4) we wrote this earlier
env = gym.make("LunarLander-v2")
highScore = -math.inf
recordTimeSteps = math.inf
while True: # keep starting new episodes forever
observation = env.reset() # observation is just a commonly used term for the environment state
score, frame, done = 0, 1, False
while not done: # keep going until the episode is done
env.render() # draw it on your screen so you can watch
action = agent.chooseAction(observation) we wrote this too
nextObservation, reward, done, info = env.step(action) # make the environment go one time step
agent.learn(observation, reward, nextObservation, done) and this
observation = nextObservation
score += reward
frame += 1
recordTimeSteps = min(recordTimeSteps, frame)
highScore = max(highScore, score)
print(( "ep {}: high-score {:12.3f}, shortest-time {:d}, "
"score {:12.3f}, last-epidode-time {:4d}").format(
episode, highScore, recordTimeSteps, score, frame))
inputDims=(8,)
This is 8 for the lunar lander environment because the state is 8 floats representing angle, and distance to the target and whatnot. As long as your network is taking them all in it doesn't really matter what they are. Although, normalizing them between -1.0 and 1.0 can help a lot of the time. It's a tuple, not a number, so that if the input was pixels it could be a tuple of width and height of the incoming game pixels.
numActions=4,
This is the number of actions the lunar lander environment accepts. We only need one probability for each action the agent can take. Make sure if you switch to other environments to change your number of network outputs. You can get creative with this though. Wanna try normal distributions instead? You'll need 2 outputs per action.
layer1Size=1024, layer2Size=512
I didn't pass in the network layer sizes for simplicity of example, but you can write your class so that you
can pass them into the constructor for "rapid" experimentation if you want. There wasn't too much
experimentation
on my part to get these particular sizes. I found someone using these layer sizes online somewhere.
I tried smaller ones but the agent never got good performance.
You could write a program to test varying neural network shapes and sizes and graph them out to find optimal
agent settings for this. A lot of machine learning papers do just that. It can be very time consuming though.
For a big agent it can be impractical, as it requires you to train your agent maybe 20 or more times to find
good settings.
Either way I might make a tutorial for it at some point. Just remember, layers too small and it won't learn,
or
wont have a brain big enough to learn complicated behaviour. Layers too big and it runs slow.
One of these is much worse than the other.
lr=0.00001,
You might notice the learning rate is rather small if you are familiar with other machine learning stuff.
Often I don't see reinforcement learning algorithms using the classic ML 10e-3 and 10e-4. Instead I see really
really tiny alphas.
Some algorithms can handle more normal learning rates, but this one can not. It turns out
value based algorithms (that are concerned with assigning value to states),
such as our critic, are rather unstable. You can try making it higher. I tried.
The cool thing is that it learns differently with different learning rates. With a high alpha it will learn
how to balance the lander
almost instantly, because that reward is so prominent and obvious. The uncool part is that it will stop at
that and never learn how to land.
With a tiny learning rate it takes a pathetically long time to correctly balance the lander,
overcompensating and undercompensating the thrusters. When it finally can balance it slowly inches it's hover
closer and closer to the target over tens of episodes.
But it will eventually successfully land, and consistently. It just never gets that with the high learning
rate. Play with it.
You did it, it's done. Heres the full code with all the imports added and sassy comments removed.
Copy it into an editor and print the outputs of functions you dont understand.
If you don't have cuda go uncomment out the torch.device("cpu")
line in the ActorCriticNetwork
class to enable cpu mode.
It will run much more slowly though so you might want to make the layer sizes smaller. Ex: layer1Size 128,
layer2Size 128
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
class ActorCriticNetwork(torch.nn.Module):
def __init__(self, lr, inputDims, numActions, fc1Dims=1024, fc2Dims=512):
super().__init__()
self.inputDims = inputDims
self.numActions = numActions
self.fc1Dims = fc1Dims
self.fc2Dims = fc2Dims
# primary network
self.fc1 = nn.Linear(*inputDims, self.fc1Dims)
self.fc2 = nn.Linear(self.fc1Dims, self.fc2Dims)
# tail networks
self.policy = nn.Linear(self.fc2Dims, self.numActions)
self.critic = nn.Linear(self.fc2Dims, 1)
self.optimizer = optim.Adam(self.parameters(), lr=lr)
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# self.device = torch.device("cpu")
self.to(self.device)
def forward(self, observation):
state = torch.tensor(observation).float().to(self.device)
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
policy = self.policy(x)
value = self.critic(x)
return policy, value
class ActorCriticAgent():
def __init__(self, lr, inputDims, numActions, gamma=0.99, layer1Size=1024, layer2Size=512):
self.gamma = gamma
self.actorCritic = ActorCriticNetwork(lr, inputDims, numActions, layer1Size, layer2Size)
self.logProbs = None
def chooseAction(self, observation):
policy, _ = self.actorCritic.forward(observation)
policy = F.softmax(policy, dim=0)
actionProbs = torch.distributions.Categorical(policy)
action = actionProbs.sample()
self.logProbs = actionProbs.log_prob(action)
return action.item()
def learn(self, state, reward, nextState, done):
self.actorCritic.optimizer.zero_grad()
_, criticValue = self.actorCritic.forward(state)
_, nextCriticValue = self.actorCritic.forward(nextState)
reward = torch.tensor(reward).float().to(self.actorCritic.device)
td = reward + self.gamma * nextCriticValue * (1 - int(done)) - criticValue
actorLoss = -self.logProbs * td
criticLoss = td**2
(actorLoss + criticLoss).backward()
self.actorCritic.optimizer.step()
if __name__ == '__main__':
import gym
import math
from matplotlib import pyplot as plt
agent = ActorCriticAgent(
lr=0.00001, inputDims=(8,), gamma=0.99, numActions=4, layer1Size=1024, layer2Size=512)
env = gym.make("LunarLander-v2")
scoreHistory = []
numEpisodes = 200
numTrainingEpisodes = 50
highScore = -math.inf
recordTimeSteps = math.inf
for episode in range(numEpisodes):
done = False
observation = env.reset()
score, frame = 0, 1
while not done:
if episode > numTrainingEpisodes:
env.render()
action = agent.chooseAction(observation)
nextObservation, reward, done, info = env.step(action)
agent.learn(observation, reward, nextObservation, done)
observation = nextObservation
score += reward
frame += 1
scoreHistory.append(score)
recordTimeSteps = min(recordTimeSteps, frame)
highScore = max(highScore, score)
print(( "ep {}: high-score {:12.3f}, shortest-time {:d}, "
"score {:12.3f}, last-epidode-time {:4d}").format(
episode, highScore, recordTimeSteps, score, frame))
fig = plt.figure()
meanWindow = 10
meanedScoreHistory = np.convolve(scoreHistory, np.ones(meanWindow), 'valid') / meanWindow
plt.plot(np.arange(0, numEpisodes-1, 1.0), meanedScoreHistory)
plt.ylabel("score")
plt.xlabel("episode")
plt.title("Training Scores")
plt.show()
So you are watching it go now. How long is it gonna take to land? Well luckily for you it might never land. :^)
Reinforcement learning is like this. Sometimes you roll bad dice and your child comes out with
three hands and no fingers. The initial neuron weights could be bad, your agent gets stuck in a local minima,
theres no end
to the number of things that can go wrong in these environments.
It can be very difficult to find bugs in RL code for this reason. Even if the code is correct, you still might get
bad results.
The code above usually makes a successful smooth landing somewhere between 75 and 300 episodes of learning.
Probably it will take 30 minutes.
Actor-Critic is simple and not terribly good at this environment. Sorry. It kind of sucks.
Actually you've learned a lot here. You have a lot of intuition for how the thing 'learns' now. With just a few tweaks to your code you can make substantially better agents. As it turns out I'm working on tutorials for those agents here. :^)
An hour or few has passed. Maybe a few days. You've already sent videos to all your family and friends, and your
mom, of your little fake moon landing,
and now the excitement of making your first AI is wearing off. You're probably feeling a bit let down.
You know how this stuff works and the magic has ended. You might even feel a bit betrayed.
"This THING isn't alive. It doesnt LEARN anything. It can never gain consciousness.
It's just probabilities and math.
AI IS A BIG LIE. A SCAM. I WANT A DIVORCE"
You probably have some questions and complaints about taking abstract
life wisdom and implementing it in such a literal way in math. And also
questions about why we do so in this specific way.
Your brain is a computer. It is constantly evaluating the value of its
particular actions, and picking what it thinks to be the best ones.
Maybe it is doing so in a much more complicated way. Maybe there are thousands of "actors" and "critics",
seperate zones of neurons competing over decisions. Maybe your brain has an
internal model of the world for trying out different actions in.
Reinforcement learning is a blatant oversimplification. Inspired by the way you think.
It shrinks down the problem for practicality reasons. It keeps the decision making essence
just enough to be metaphorically sound. Most of reinforcement learning
progress seems to fall into two categories:
The distinction between these two categories is blurred. Often the
inspiration for what seems to be an engineering feature is a metaphor
for some aspect of your cognition, and then you bounce back and forth
between the two.
Ex: "Experience Replay" is like memory. It's a list of
previous states and rewards.
Ex: Some of my memories are more important than others.
If i remembered all my memories as being equally impactful to my life I
would value thinking about yesterdays breakfast as much as the time my
wife left me and took the kids. Most of my memories are just noise, and
can be thrown out. So make a "Prioritized Experience Replay" where you
estimate memory value and discard low value ones.
Ex: My Agent keeps getting stuck in local minima, I
want it to explore. So bias it to value discovering new
states or trying new actions that it hasnt seen before.
It goes back and forth. Cognitive metaphor, engineering, cognitive
metaphor, engineering.
How about creativity? The notion of friend or foe?
Ownership, fairness, or maybe mate value estimation? :^)
There is no end to the possibilities. If you have enough computing
power, you can make your agent that thinks it.
And of course if it's brain gets big enough, and self observant enough,
then it can have consciousness. It may even learn to
infer things about its own internal workings by focusing on generalizing
across its strategies within varying circumstances. Maybe it can even be
given the value of making more and more agents. :^)
Before we make horny von neumman probes there is a lot of work to do and
much to learn. It might surprise you to discover reinforcement learning
has a long history behind it. Though many of the "RL" algorithms are
recently created ( or discovered :^) ), the simplest versions of these
algorithms have been around since the 1950s and maybe even earlier in
other forms. There is a lot of historical baggage and convention. If you
have lots of questions about this you might choose to peruse some of the
other resources, but be warned there are many math symbols out there.
Even if you aren't math fluent (I am also not math fluent :() ), you can
still learn stuff from understanding 5-10% of the symbols. Though it
might take you a month or so off and on.
Some noteable books and resources include: