Slot Machine Markov Decision

It’s actually unlikely to find a simple three scroll slot 30 win line machine like the old days anymore. Now online slots offer three scroll games, five scroll games, and variations that take some games up to 100 win lines. One of the most common variations you’ll encounter is free slot machines with Wild symbols.
Markov Decision Process. The picture above represents a slot machine also known as a bandit with two levers. We assume that each lever has a separate distribution.

Slot Machine Markov Decision Definition
Markov Decision Process Pdf
Markov Decision Model
Slot Machine Markov Decision Tutorial

ADVERTISEMENTS:

A row of slot machines in Las Vegas. In probability theory, the multi-armed bandit problem. Later in 'Optimal adaptive policies for Markov decision processes' Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend.

After reading this article you will learn about:- 1. Meaning of Markov Analysis 2. Example on Markov Analysis 3. Applications.

Meaning of Markov Analysis:

Markov analysis is a method of analyzing the current behaviour of some variable in an effort to predict the future behaviour of the same variable. This procedure was developed by the Russian mathematician, Andrei A. Markov early in this century. He first used it to describe and predict the behaviour of particles of gas in a closed container. As a management tool, Markov analysis has been successfully applied to a wide variety of decision situations.

Perhaps its widest use is in examining and predicting the behaviour of customers in terms of their brand loyalty and their switching from one brand to another. Markov processes are a special class of mathematical models which are often applicable to decision problems. In a Markov process, various states are defined. The probability of going to each of the states depends only on the present state and is independent of how we arrived at that state.

Example on Markov Analysis:

ADVERTISEMENTS:

A simple Markov process is illustrated in the following example:

Example 1:

A machine which produces parts may either he in adjustment or out of adjustment. If the machine is in adjustment, the probability that it will be in adjustment a day later is 0.7, and the probability that it will be out of adjustment a day later is 0.3. If the machine is out of adjustment, the probability that it will be in adjustment a day later is 0.6, and the probability that it will be out of adjustment a day later is 0.4. If we let state-1 represent the situation in which the machine is in adjustment and let state-2 represent its being out of adjustment, then the probabilities of change are as given in the table below. Note that the sum of the probabilities in any row is equal to one.

ADVERTISEMENTS:

Solution:

The process is represented in Fig. 18.4 by two probability trees whose upward branches indicate moving to state-1 and whose downward branches indicate moving to state-2.

Suppose the machine starts out in state-1 (in adjustment), Table 18.1 and Fig.18.4 show there is a 0.7 probability that the machine will be in state-1 on the second day. Now, consider the state of machine on the third day. The probability that the machine is in state-1 on the third day is 0.49 plus 0.18 or 0.67 (Fig. 18.4).

The corresponding probability that the machine will be in state-2 on day 3, given that it started in state-1 on day 1, is 0.21 plus 0.12, or 0.33. The probability of being in state-1 plus the probability of being in state-2 add to one (0.67 + 0.33 = 1) since there are only two possible states in this example.

Calculations can similarly be made for next days and are given in Table 18.2 below:

Refer to Fig. 18.4 (start in State-2)

ADVERTISEMENTS:

The probability that the machine will be in state-1 on day 3, given that it started off in state-2 on day 1 is 0.42 plus 0.24 or 0.66. hence the table below:

Table 18.2 and 18.3 above show that the probability of machine being in state 1 on any future day tends towards 2/3, irrespective of the initial state of the machine on day-1. This probability is called the steady-state probability of being in state-1; the corresponding probability of being in state 2 (1 – 2/3 = 1/3) is called the steady-state probability of being in state-2. The steady state probabilities are often significant for decision purposes.

For example, if we were deciding to lease either this machine or some other machine, the steady-state probability of state-2 would indicate the fraction of time the machine would be out of adjustment in the long run, and this fraction (e.g. 1/3) would be of interest to us in making the decision.

Applications of Markov Analysis:

ADVERTISEMENTS:

Markov analysis has come to be used as a marketing research tool for examining and forecasting the frequency with which customers will remain loyal to one brand or switch to others. It is generally assumed that customers do not shift from one brand to another at random, but instead will choose to buy brands in the future that reflect their choices in the past.

Other applications that have been found for Markov Analysis include the following models:

A model for manpower planning,

A model for human needs,

ADVERTISEMENTS:

A model for assessing the behaviour of stock prices,

A model for scheduling hospital admissions,

A model for analyzing internal manpower supply etc.

Behavioural Finance: Meaning and Applications | Financial Management
10 Basic Managerial Applications of Network Analysis, Techniques and Concepts
PERT: Meaning and Steps | Network Analysis | Project Management
Data Mining: Meaning, Scope and Its Applications

Markov Decision Process (MDP) Toolbox for Matlab

Written by Kevin Murphy, 1999
Last updated: 23 October, 2002.

This toolbox supports value and policy iteration for discreteMDPs, and includes some grid-world examples from the textbooks bySutton and Barto, and Russell and Norvig.It does not implement reinforcement learning or POMDPs.For a very similar package,see INRA's matlab MDPtoolbox.

Reinforcement learning is the problem of getting an agent to act inthe world so as to maximize its rewards. For example, considerteaching a dog a new trick: you cannot tell it what to do, but you canreward/punish it if it does the right/wrong thing. It has to figureout what it did that made it get the reward/punishment, which is knownas the credit assignment problem. We can use a similar method to traincomputers to do many tasks, such as playing backgammon orchess, scheduling jobs, and controlling robot limbs.

We can formalise the RL problem as follows.The environment is a modelled as a stochastic finite state machinewith inputs (actions sent from the agent) and outputs (observationsand rewards sent to the agent)

State transition function P(X(t)|X(t-1),A(t))
Observation (output) function P(Y(t) | X(t), A(t))
Reward function E(R(t) | X(t), A(t))

(Notice that what the agent sees depends on what it does, whichreflects the fact that perception is an active process.)The agent is also modelled as stochastic FSMwith inputs (observations/rewards sent from the environment) andoutputs (actions sent to the environment).

State transition function: S(t) = f (S(t-1), Y(t), R(t), A(t))
Policy/output function: A(t) = pi(S(t)))

The agent's goal is to find a policy and state-update function so as to maximize the the expected sum of discounted rewardswhere 0 <= gamma <= 1 is a discount factor which models thefact future reward is worth less than immediate reward (becausetomorrow you might die). (Mathematically, we need gamma < 1 to makethe infinite sum converge, unless the environment hasabsorbing states with zero reward.)

In the special case that Y(t)=X(t), we say the world is fullyobservable, and the model becomes a Markov Decision Process (MDP).In this case, the agent does not need any internal state (memory) toact optimally.In the more realistic case, where the agent only gets to see part ofthe world state, the model is called a Partially Observable MDP(POMDP), pronounced 'pom-dp'.We give a brief introduction to these topics below.

A Markov Decision Process (MDP) is just like a Markov Chain, exceptthe transition matrix depends on the action taken by the decisionmaker (agent) at each time step. The agent receives a reward, whichdepends on the action and the state. The goal is to find a function,called a policy, which specifies which action to take in each state,so as to maximize somefunction (e.g., the mean or expected discounted sum) of the sequence of rewards. One canformalize this in terms of Bellman's equation, which can be solvediteratively using policy iteration. The unique fixed point of thisequation is the optimal policy.

More precisely, let us define the transition matrix and rewardfunctions as follows.(We are assuming states, actions and time arediscrete. Continuous MDPs can also be defined, but are usuallysolved by discretization.)

We define the value of performing action a in state s asfollows:where 0 < g <= 1 is the amount by which we discount future rewards,and V(s) is overall value of state s, given by Bellman's equation:In words, the value of a state is the maximum expected reward we willget in that state, plus the expected discounted value of all possiblesuccessor states, s'.If we definethe above equation simplifies to the more common formwhich, for a fixed policy and a tabular (non-parametric)representation of the V/Q/T/R functions, can be rewritten in matrix-vectorform as V = R + g T VSolving these n simultaneous equations is called valuedetermination (n is the number of states).

If V/Q satisfies the Bellman equation, then the greedy policyis optimal.If not, we can set p(s) to argmax_a Q(s,a) and re-evaluate V (andhence Q) and repeat. This is called policy iteration, and isguaranteed to converge to the unique optimal policy.(Here is some Matlab software for solvingMDPs using policy iteration.)The best theoretical upper bound on the number of iterations needed bypolicy iteration is exponential in n(Mansour and Singh, UAI 99), but in practice, thenumber of steps is O(n). By formulating the problem as a linearprogram, it can be proved that one can find the optimal policy inpolynomial time.

For AI applications, the state is usually defined in terms of statevariables. If there are k binary variables, there are n = 2^kstates. Typically, there are some independencies between thesevariables, so that the T/R functions (and hopefully the V/Q functions,too!) are structured; this can berepresented using a Dynamic Bayesian Network (DBN), which is like a probabilisticversion of a STRIPS rule used inclassical AI planning.For details, see

'Decision Theoretic Planning: Structural Assumptions and ComputationalLeverage'.
Craig Boutilier, Thomas Dean and Steve Hanks
JAIR (Journal of AI Research) 1999.Postscript(87 pages)

If we know the model (i.e., the transition and reward functions), wecan solve for the optimal policy in about n^2 time using policyiteration.Unfortunately, if the state is composed of k binary state variables,then n = 2^k, so this is way too slow.In addition, what do we do if we don't know the model?

Slot Machine Markov Decision Definition

Reinforcement Learning (RL) solves both problems: we can approximatelysolve an MDP by replacing the sum over all states with a Monte Carloapproximation. In other words, we only update the V/Q functions (using temporaldifference (TD) methods) for statesthat are actually visited while acting in the world.If we keep track of the transitions made and the rewards received, wecan also estimate the model as we go, and then 'simulate' the effectsof actions without having to actually perform them.

There are three fundamental problems that RL must tackle:the exploration-exploitation tradeoff,the problem of delayed reward (credit assignment),and the need to generalize.We will discuss each in turn.

We mentioned that in RL, the agent must make trajectories throughthe state space to gather statistics. The exploration-exploitation tradeoff is the following: should weexplore new(and potentially more rewarding) states, or stick with what weknow to be good (exploit existing knowledge)?This problem hasbeen extensively studied in the case of k-armed bandits, which areMDPs with a single state and k actions. The goal is to choose the optimal action toperform in that state, which is analogous to deciding which of the klevers to pull in a k-armed bandit (slot machine).There are some theoretical results (e.g., Gittins' indices),but they do not generalise to the multi-state case.

The problem of delayed reward is well-illustrated by games such aschess or backgammon.The player (agent) makes many moves, and only gets rewarded orpunished at the end of the game. Which move in that long sequencewas responsible for the win or loss? This is called the creditassignment problem.We can solve it by essentially doing stochastic gradient descent onBellman's equation, backpropagating the reward signal through thetrajectory, and averaging over many trials. This is called temporal difference learning.

It is fundamentally impossible to learn the value of a state before areward signal has been received.In large state spaces, random exploration might take a long time toreach a rewarding state.The only solution is to define higher-level actions,which can reach the goal more quickly.A canonical example is travel:to get from Berkeley to San Francisco, I first plan at a high level (Idecide to drive, say), then at a lower level (I walk to my car),then at a still lower level (how to move my feet), etc.Automatically learning action hierarchies (temporal abstraction) iscurrently a very active research area.

The last problem we will discuss is generalization: giventhat we can only visit a subset of the (exponential number) of states,how can know the value of all the states? The most common approach isto approximate the Q/V functions using, say, a neural net.A more promising approach (in my opinion) uses the factored structureof the model to allow safe state abstraction (Dietterich, NIPS'99).

RL is a huge and active subject, and you are recommended to read thereferences below for more information.

'Reinforcement Learning: An Introduction',
Richard Sutton and Andrew Barto, MIT Press, 1998.
'Neuro-dynamic programming'
Dimitri P. Bertsekas and John Tsitsiklis,Athena Scientific, 1996.
'Reinforcement Learning: A Survey'.
Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore
JAIR (Journal of AI Research), Volume 4, 1996.Postscript(40 pages)orHTML version
Harmon's tutorial on RL

Markov Decision Process Pdf

There have been a few successful applications of RL.The most famous is probablyTesauro's TD-gammon, which learned to play backgammon extremely well,using a neural network function approximator and TD(lambda).Other applications have included controlling robot arms and variousscheduling problems. However, these arestill very simple problems by the standards of AI, and required a lotof human engineering; we are a far cry fromthe dream of fully autonomous learning agents.MDPs assume that the complete state of the world is visible to theagent. This is clearly highly unrealistic (think of a robot in aroom with enclosing walls:it cannot see the state of the world outside of the room).POMDPs model the information available to the agent by specifying afunction from the hidden state to the observables, just as in an HMM.The goal now is to find a mapping from observations (not states) toactions. Unfortunately, the observations are not Markov (because twodifferent states might look the same), which invalidates all of theMDP solution techniques. The optimal solution to this problem is toconstruct a belief state MDP, where a belief state is a probabilitydistribution over states. For details on this approach, see

'Planning and Acting in Partially Observable Stochastic Domains'.
Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra
Artificial Intelligence, Vol. 101, 1998.Postscript (45 pages)

Control theory is concerned with solving POMDPs, but in practice,control theorists make strong assumptions about the nature of themodel (typically linear-Gaussian) and reward function (typicallynegative quadratic loss) in order to be able to make theoreticalguarantees of optimality, etc. By contrast, optimally solving ageneric discrete POMDP is wildly intractable.Finding tractable special cases (e.g., structured models) is a hotresearch topic.

For more details on POMDPs, seeTony Cassandra'sPOMDP page.

Books

Markov Decision Model

'Reinforcement Learning: An Introduction',
Richard Sutton and Andrew Barto, MIT Press, 1998.
'Neuro-dynamic programming'
Dimitri P. Bertsekas and John Tsitsiklis,Athena Scientific, 1996.

Papers

'Reinforcement Learning: A Survey'.
Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore
JAIR (Journal of AI Research), Volume 4, 1996.Postscript(40 pages)orHTML version
'Planning and Acting in Partially Observable Stochastic Domains'.
Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra
Artificial Intelligence, Vol. 101, 1998.Postscript (45 pages)
'Decision Theoretic Planning: Structural Assumptions and ComputationalLeverage'.
Craig Boutilier, Thomas Dean and Steve Hanks
JAIR (Journal of AI Research) 1999.Postscript(87 pages)

Online stuff

Slot Machine Markov Decision Tutorial

MichaelKearns' list of recommended reading
Stuart Reyonold'sbibliography, contains pointers to many good online articles