machine-learning-nanodegree

Class notes for the Machine Learning Nanodegree at Udacity

Go to Index

Markov Decision Process

Comparison of Learning Types

Comparison of Learning Types

In Supervised Learning, we are given y and x and we try to find the best function f that fits f(x) = y.

In Unsupervised Learning, we are given x and we try to find the best function f that produces a compact description of x in the form of f(x) clusters.

In Reinforcement Learning, we are given x and z and we try to find f and y, where f(x) = y. It looks a lot like supervised learning, but the main difference is that we are not given the labels (y), but instead the reinforcers and punishers (z) to adjust the function f.

Markov Decision Process Explained

Explanation of Markov Decision Process

Markov Assumptions

Image Explained

More on rewards

Some examples of Policies affected by different rewards:

Rewards affecting policies

Sequence of Rewards: Assumptions

Infinite Horizons

Infinite Horizons means we can take as many steps as we want through the MDP.

Utility of Sequences

Utility of Sequences measures the total reward value of a given sequence of steps.

Sum of each step reward

Sum of each step reward

The sum of all rewards for each step is not a good measure of utility because, assuming we are working with infinite horizons, we will have infinite and all sequences will always provide infinite utility, as shown below.

Quiz example

Discounted Rewards

Discounted Rewards

Discounted rewards is a geometric sequence that provides a finite number to measure the utility against sequences.

Discounted Rewards Explained

Maximization of discounted rewards:

Maximal Discounted Rewards

which is demonstrated by:

Maximal Discounted Rewards Demonstrated

Bellman Equation

Bellman Equation

Algorithmic Solution

Algorithmic Solution for Bellman Equation

Algorithmic Solution for Bellman Equation Continuation

Example (Quiz)

Quiz Example

Finding the Policies

While the solutions presented above find the true Utility Values for each state, we are not interested in finding these values, but instead we are interested in finding the Optimal Policy.

The image below shows a simplification of the solutions elaborated previously to solve this.

Finding Policies

Summary

Summary