Reinforcement learning is the branch of machine learning that seeks to train agents to take actions that will maximize their reward, as the term of art is called. It’s one of the core techniques that trained AlphaGo, the artificially intelligent agent that beat the world’s top human Go players.
The core concepts are intuitive, and the model is easily mapped onto human experience.
How are you like an artificial intelligence trained with reinforcement learning?
Reinforcement learning background
At each state of the environment the agent chooses to take an action from the set of all available actions. This leads to receiving a reward and transitioning into a new state. The agent wants to maximize the cumulative reward it receives over time.
Consider the game of checkers. The state of the system is the board state and whose turn it is. The set of all available actions are the moves possible by all unblocked pieces owned by the active player. Winning is the only reward in the game, so winning moves can be considered to have a value of 1, and all other moves a value of 0.
With just that formulation and some clever engineering you can build an AI that will beat you at checkers.
Note that available actions and rewards both depend on the state of the system. I cannot move the pieces on my back row until I move the pieces blocking them, and then I can. The same action is available to choose, or not, depending on the rest of the board. Similarly, jumping my opponent’s piece from a particular action wins the game and provides me a reward of 1 if it’s their last piece, otherwise it does not win and provides me a reward of 0.
Some systems transition between states stochastically. You may choose the Hit me action in blackjack, but you don’t know with certainty which next state you’ll end up in. If you bust your reward is losing your bet; otherwise you keep playing, and may get a positive monetary reward in a future state.
So we have the following:
S: the set of all possible states of the environment
s: a specific state
A: the set of all possible actions
a: a specific action
P(s, a, s’): the probability that you will transition to state s’ by taking action a from state s
R(s, a, s’): the reward you receive when transitioning to state s’ by taking action a from state s
E[R(s, a)]: the expected reward of taking action a from state s, combining the reward for transitioning into each possible s’ with the probability of doing so
The intelligent agent’s goal is to maximize the cumulative reward received through all of the states it sees through actions it takes. To do so it needs to know the above, which it has to learn through exploring the environment, taking various actions from various states and seeing what happens. Once it has learned about the environment, it may exploit its knowledge and maximize reward.
It’s also worth noting that if the game goes on forever, the concept of a cumulative reward is undefined. Reinforcement learning agents in such environments progressively discount the importance of rewards further in the future, eventually to 0.
Why are humans hard?
We don’t live on a chessboard or in a deck of cards, so S is every possible way the universe could ever possibly be. And most people aren’t often physically constrained, so A is similarly limitless. These sets are impossible to fully explore.
Each specific s is a full description of the universe; or at least your universe. It includes not just the furniture around you, but your emotional state, your thoughts, those of everyone around you, and the weather. The state is so information dense it is impossible to define.
If the universe is deterministic it is at least chaotic, so it’s more easily modeled as stochastic. Think rolling a die under classical physics: with enough information and computation we could know which side it will land on, but it’s more practical to think of it as random. In a stochastic formulation, P(s, a, s’) for the universe is rarely 1 and frequently impossible to estimate.
And what would a reasonable reward function R(s, a, s’) even be for a person?
The critical assumption
Let’s assume that we are like reinforcement learning agents. Everything above is a valid model of the world and defines us and how we act. We do seek to maximize our cumulative reward.
Under that assumption, the reward function is still not made explicit and quantified, but we are able to define it relatively. If we choose to eat a banana instead of an apple, we are saying that it had a greater expected cumulative reward for us. E[R(s, eat banana)] > E[R(s, eat apple)].
Well, not the immediate reward, but the cumulative reward starting with eating the banana and continuing on through the furthest time horizon your consciousness can see. This is why you avoid foods that give you diarrhea, even if you love them. You get a short term positive reward for eating them, but an absolutely larger negative reward for it later.
A fun part
We can categorize human behavior according to how it relates to the terms in the formulae.
We explore S and A and P and R directly with our lived experience. But we also explore them vicariously through stories, indirectly through education, and we expand the frontier of what we think they could possibly be through science and philosophy.
We learn more about a specific state s through experimentation and observation and measurement. We learn about the internal components of it through meditation and mindfulness.
We punish misdeeds relative to the perpetrator’s expected knowledge of s and P. We make R higher (less punishing) for a child’s crime than an adult’s. We arrange society to make R lower (more punishing) for premeditated crimes and repeat offenses.
Social mores, rules, conventions, and habits constrain the infinite set of possible actions A to something closer to finite and more manageable. Experience, experimentation, training, imagination, and simply being willing to break the rules in turn increases the size of our effective A, giving us more actions to choose from.
How we each value short term thinking versus long term thinking depends on how far of a time horizon we are able to keep in mind, consciously or not.
We seek to minimize P through acts of attempted control, or at least prayer.
We call behavior rational when it lines up with the socially dominant R, or an easily quantifiable one. We call behavior irrational when we don’t understand someone’s R. That someone is often ourself, so we seek to understand the intricacies of our own personal R through introspection and therapy.
You have only one priority
Under this model, you are always maximizing your expected cumulative reward.
Every choice you make is is an expression of your R. Each action a isn’t in line with or against your values: it is what defines your values. It is what you personally consider most rewarding.
When you do inevitably choose the food you love but gives you diarrhea, you’re saying you value it more. The pleasure now and pain later are worth more than neither one. When you don’t have that hard conversation with your friend, you’re saying you value that more. The pain of the talk and its consequences are worth less than avoiding it now plus the sum of whatever you expect to feel with the issues left unaddressed. As far into the future as you can see, you expect to get more from those choices. Those choices are purely rational according to your R, so don’t beat yourself up about them later when you and your relationship are in the shitter. That sounds sarcastic because that’s the conventional understanding, but the sentence is sincere.
Every choice can be compared with every other one. Every ideal you hold and every domain of your life can be weighed against each other, lined up in order of how important they are to you. This is true by definition when you choose an action a.
If you stay late at work instead of coming home to your family, if you volunteer at the homeless shelter, if you miss a birthday party, if you don’t return an email, if you enlist in the army, if you break into your neighbor’s house at midnight to shit in their living room, if you stay in your job, if you procrastinate on what you call your dream —
Whatever you do, you are defining your reward function. With what you know of s and S and A and P, you are choosing a. You are saying:
Within the bounds of my universe — with everything I know, with all of my powers of imagination, with how I feel — what I am doing right now is what I value most of all.