Markov Decision Processes (MDPs) model controlled stochastic systems. Like Markov chains, an MDP consists of states and probabilistic transitions; unlike Markov chains, there is assumed to be an outside controller who chooses an action (with its associated transition matrix) at each step of the process, according to some strategy or policy. In addition, each state and action pair has an associated reward. The goal of the controller is to maximize the expected reward. MDPs are used in applications as diverse as wildlife management and robot navigation control. Optimization and approximation strategies for these models constitute a major body of literature in mathematics, operations research, and engineering. We consider the complexity of the following decision problem: for a given MDP and type of policy, is there such a policy for that MDP with positive expected reward? The complexity of this problem depends on at least half a dozen factors, including the information available to the controller, the feedback mechanism, and the succinctness of the representation of the system relative to the number of states. This paper, together with UK CS TR 268-96 shows variations of the decision problem to be complete for NL, PL, P, NP, PP, NP(PP), PSPACE, EXP, NEXP and EXPSPACE. This paper focuses on the proofs of completeness for PL, PP, and NP(PP). All of the problems considered here are for MDPs that run for a fixed, finite time, equal to the number of states of the system or the size of the representation of the system. Papadimitriou and Tsitsiklis showed that the most straightforward MDP decision problem is P-complete, and other variants are NP- or PSPACE-complete. Several others have shown other MDP problems are complete for P, NP, PSPACE, or EXP. We consider a slightly different decision problem than they do; our decision problem allows self-reductions to find the optimal policy, whereas theirs does not.
Back to my homepage