Skip to document

Optimisation and Control 2015-2016 Course Notes

Academic year: 2015/2016
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Newcastle University

Comments

Please sign in or register to post comments.

Preview text

Optimization and Control

Contents

  - Richard Weber, Lent Term
  • 1 Dynamic Programming Schedules iv
    • 1 Control as optimization over time
    • 1 Example: the shortest path problem
    • 1 The principle of optimality
    • 1 The optimality equation
    • 1 Example: optimization of consumption
  • 2 Markov Decision Problems
    • 2 Markov decision processes
    • 2 Features of the state-structured case
    • 2 Example: exercising a stock option
    • 2 Example: secretary problem
  • 3 Dynamic Programming over the Infinite Horizon
    • 3 Discounted costs
    • 3 Example: job scheduling
    • 3 The infinite-horizon case
    • 3 The optimality equation in the infinite-horizon case
    • 3 Example: selling an asset
  • 4 Positive Programming
    • 4 Example: possible lack of an optimal policy.
    • 4 Characterization of the optimal policy
    • 4 Example: optimal gambling
    • 4 Value iteration
    • 4 D case recast as a N or P case
    • 4 Example: pharmaceutical trials
  • 5 Negative Programming
    • 5 Example: a partially observed MDP
    • 5 Stationary policies
    • 5 Characterization of the optimal policy
    • 5 Optimal stopping over a finite horizon
    • 5 Example: optimal parking
  • 6 Optimal Stopping Problems
    • 6 Bruss’s odds algorithm
    • 6 Example: stopping a random walk
    • 6 Optimal stopping over the infinite horizon
    • 6 Example: sequential probability ratio test
    • 6 Example: prospecting
  • 7 Bandit Processes and the Gittins Index
    • 7 Bandit processes and the multi-armed bandit problem
    • 7 The two-armed bandit
    • 7 Gittins index theorem
    • 7 Example: single machine scheduling
    • 7 Proof of the Gittins index theorem
    • 7 Example: Weitzman’s problem
    • 7 Calculation of the Gittins index
    • 7 Forward induction policies
  • 8 Average-cost Programming
    • 8 Average-cost optimality equation
    • 8 Example: admission control at a queue
    • 8 Value iteration bounds
    • 8 Policy improvement algorithm
  • 9 Continuous-time Markov Decision Processes
    • 9 Stochastic scheduling on parallel machines
    • 9 Controlled Markov jump processes
    • 9 Example: admission control at a queue
  • 10 LQ Regulation
    • 10 The LQ regulation problem
    • 10 The Riccati recursion
    • 10 White noise disturbances
    • 10 Example: control of an inertial system

Schedules

Dynamic programming The principle of optimality. The dynamic programming equation for finite-horizon problems. Interchange arguments. Markov decision processes in discrete time. Innite- horizon problems: positive, negative and discounted cases. Value interation. Policy improvement algorithm. Stopping problems. Average-cost programming. [6]

LQG systems Linear dynamics, quadratic costs, Gaussian noise. The Riccati recursion. Controlla- bility. Stabilizability. Infinite-horizon LQ regulation. Observability. Imperfect state observation and the Kalman filter. Certainty equivalence control. [5]

Continuous-time models The optimality equation in continuous time. Pontryagin’s maximum principle. Heuris- tic proof and connection with Lagrangian methods. Transversality conditions. Opti- mality equations for Markov jump processes and diffusion processes.[5]

Richard Weber, January 2016

iv

1 Dynamic Programming

Dynamic programming and the principle of optimality. Notation for state-structured models. Optimization of consumption with a bang-bang optimal control.

1 Control as optimization over time

Modelling real-life problems is something that humans do all the time. Sometimes an optimal solution to a model can be found. Other times a near-optimal solution is adequate, or there is no single criterion by which a solution can be judged. However, even when an optimal solution is not required it can be useful to followan optimization approach. If the ‘optimal’ solution is ridiculous then that can suggest ways in which the modelling can be refined. Control theory is concerned with dynamical systems and theiroptimization over time. These systems may evolve stochastically and key variables may be unknown or imperfectly observed. The IB Optimization course concerned static problems in which nothing was random or hidden. In this course our problems are dynamic, with stochastic evolution, and even imperfect state observation. These give rise to new types of optimization problem which require new ways of thinking. The origins of ‘control theory’ can be traced to the wind vane used to facea wind- mill’s rotor into the wind, and the centrifugal governor invented byJame Watt. Such ‘classic control theory’ is largely concerned with the question of stability, and much of this is outside this course, e., Nyquist criterion and dynamic lags, control theory is not merely concerned with the control of mechanisms. It is useful in the study of a multitude of dynamical systems, in biology, communications, manufacturing, heath services, finance, and economics.

1 Example: the shortest path problem

Consider the ‘stagecoach problem’ in which a traveller wishes to minimize the length of a journey from town A to town J by first travelling to one of B, C or D andthen onwards to one of E, F or G then onwards to one of H or I and the finally to J. Thus there are 4 ‘stages’. The arcs are marked with distances between towns.

A
B
C
D
E
F
G
H
I
J
1
1
2
2
3
3
3
3 3
3
4
4
4
4
4
4
5
6
6
7

Road system for stagecoach problem

The state structured case. The control variableutis chosen on the basis of knowing Ut− 1 = (u 0 ,... , ut− 1 ), (which determines everything else). But a more economical representation of the past history is often sufficient. For example, wemay not need to know the entire path that has been followed up to timet, but only the place to which it has taken us. The idea of astate variablex∈Rdis that its value att, denotedxt, can be found from known quantities and obeys aplant equation(or law of motion)

xt+1=a(xt, ut, t). Suppose we wish to minimize aseparable cost functionof the form

C=

h∑− 1

t=

c(xt, ut, t) +Ch(xh), (1)

by choice of controls{u 0 ,... , uh− 1 }. Define the cost from timetonwards as,

Ct=

h∑− 1

τ=t

c(xτ, uτ, τ) +Ch(xh), (1)

and the minimal cost from timetonwards as an optimization over{ut,... , uh− 1 } conditional onxt=x, F(x, t) = inf ut,...,uh− 1 Ct.

HereF(x, t) is the minimal future cost from timetonward, given that the state isxat timet. By an inductive proof, one can show as in Theorem 1 that

F(x, t) = inf u [c(x, u, t) +F(a(x, u, t), t+ 1)], t < h, (1)

with terminal conditionF(x, h) =Ch(x). Herexis a generic value ofxt. The mini- mizinguin (1) is the optimal controlu(x, t) and values ofx 0 ,... , xt− 1 are irrelevant. Theoptimality equation(1) is also called thedynamic programming equation (DP) orBellman equation.

1 Example: optimization of consumption

An investor receives annual income ofxtpounds in yeart. He consumesutand adds xt−utto his capital, 0≤ut≤xt. The capital is invested at interest rateθ×100%, and so his income in yeart+ 1 increases to

xt+1=a(xt, ut) =xt+θ(xt−ut). (1)

He desires to maximize total consumption overhyears,

C=

h∑− 1

t=

c(xt, ut, t) +Ch(xh) =

h∑− 1

t=

ut

In the notation we have been using,c(xt, ut, t) =ut,Ch(xh) = 0. This is termed a time-homogeneousmodel because neither costs nor dynamics depend ont.

Solution dynamic programming makes its calculations backwards, from the termination point, it is often advantageous to write things in terms of the ‘time to go’,s=h−t. LetFs(x) denote the maximal reward obtainable, starting in statex when there is timesto go. The dynamic programming equation is

Fs(x) = max 0 ≤u≤x [u+Fs− 1 (x+θ(x−u))],

whereF 0 (x) = 0, (since nothing more can be consumed once timehis reached.) Here, xanduare generic values forxsandus. We can substitute backwards and soon guess the form of the solution. First,

F 1 (x) = max 0 ≤u≤x [u+F 0 (u+θ(x−u))] = max 0 ≤u≤x [u+ 0] =x.

Next, F 2 (x) = max 0 ≤u≤x

[u+F 1 (x+θ(x−u))] = max 0 ≤u≤x

[u+x+θ(x−u)].

Sinceu+x+θ(x−u) linear inu, its maximum occurs atu= 0 oru=x, and so

F 2 (x) = max[(1 +θ)x, 2 x] = max[1 +θ,2]x=ρ 2 x.

This motivates the guessFs− 1 (x) =ρs− 1 x. Trying this, we find

Fs(x) = max 0 ≤u≤x

[u+ρs− 1 (x+θ(x−u))] = max[(1 +θ)ρs− 1 ,1 +ρs− 1 ]x=ρsx.

Thus our guess is verified andFs(x) =ρsx, whereρsobeys the recursion implicit in the above, and i.ρs=ρs− 1 + max[θρs− 1 ,1]. This gives

ρs=

{

s s≤s∗ (1 +θ)s−s

∗ s∗ s≥s∗

,

wheres∗is the least integer such that (1+θ)s∗≥1+s∗ ⇐⇒ s∗≥ 1 /θ, i.e∗=⌈ 1 /θ⌉. The optimal strategy is to invest the whole of the income in years 0,... , h−s∗−1, (to build up capital) and then consume the whole of the income in yearsh−s∗,... , h−1.

There are several things worth learning from this example.

(i) It is often useful to frame things in terms of time to go,s.

(ii) The dynamic programming equation my look messy. But try working backwards fromF 0 (x), which is known. A pattern may emerge from which you can guess the general solution. You can then prove it correct by induction.

(iii) When the dynamics are linear, the optimal control lies at an extreme point of the set of feasible controls. This form of policy, which either consumes nothing or consumes everything, is known asbang-bang control.

Proof value ofF(Wh) isCh(xh), so the asserted reduction ofF is valid at time h. Assume it is valid at timet+ 1. The DP equation is then

F(Wt) = inf ut {c(xt, ut, t) +E[F(xt+1, t+ 1)|Xt, Ut]}. (2)

But, by assumption (a), the right-hand side of (2) reduces to the right-hand member of (2). All the assertions then follow.

2 Features of the state-structured case

In the state-structured case the DP equation, (1) and (2), provides the optimal control in what is calledfeedbackorclosed-loopform, withut=u(xt, t). This contrasts withopen-loopformulation in which{u 0 ,... , uh− 1 }are to be chosen all at once at time 0. To summarise:

(i) The optimalutis a function only ofxtandt, i.e=u(xt, t)., (ii) The DP equation expresses the optimalutin closed-loop form. It is optimal whatever the past control policy may have been., (iii) The DP equation is a backward recursion in time (from which we get the optimum ath−1, thenh−2 and so on.) The later policy is decided first., ‘Life must be lived forward and understood backwards.’ (Kierkegaard)

2 Example: exercising a stock option

The owner of a call option has the option to buy a share at fixed ‘striking price’p. The option must be exercised by dayh. If she exercises the option on dayt, buying forp and then immediately selling at the current pricext, she can make a profit ofxt−p. Suppose the price sequence obeys the equationxt+1=xt+ǫt, where theǫtare i.i. random variables for whichE|ǫ|<∞. The aim is to exercise the option optimally.

LetFs(x) be thevalue function(maximal expected profit) when the share price isxand there aresdays to go. Show that

(i)Fs(x) is non-decreasing ins, (ii)Fs(x)−xis non-increasing inx, and (iii)Fs(x) is continuous inx.

Deduce that the optimal policy can be characterised as follows.

There exists a non-decreasing sequence{as}such that an optimal policy is to exercise the option the first time thatx≥as, wherexis the current price andsis the number of days to go before expiry of the option.

Solution state at timetis, strictly speaking,xtplus a variable to indicate whether the option has been exercised or not. However, it is only the latter casewhich is of

interest, soxis the effective state variable. As previously, we use time to go,s=h−t. So lettingFs(x) be the value function (maximal expected profit) withsdays to go then

F 0 (x) = max{x−p, 0 },

and so the dynamic programming equation is

Fs(x) = max{x−p, E[Fs− 1 (x+ǫ)]}, s= 1, 2 ,...

Note that the expectation operator comesoutside, not inside,Fs− 1 (·). It easy to show (i), (ii), (iii) by induction ons. Of course (i) is obvious, since increasingsmeans more time over which to exercise the option. However, for a formal proof F 1 (x) = max{x−p, E[F 0 (x+ǫ)]}≥max{x−p, 0 }=F 0 (x).

Now suppose, inductively, thatFs− 1 ≥Fs− 2. Then

Fs(x) = max{x−p, E[Fs− 1 (x+ǫ)]}≥max{x−p, E[Fs− 2 (x+ǫ)]}=Fs− 1 (x),

whenceFsis non-decreasing ins. Similarly, an inductive proof of (ii) follows from

Fs(x)−x ︸ ︷︷ ︸

= max{−p, E[Fs− 1 (x+ǫ)−(x+ǫ) ︸ ︷︷ ︸

] +E(ǫ)},

since the left hand underbraced term inherits the non-increasing character of the right hand underbraced term. Since the right underbraced term is non-increasing inx, the optimal policy can be characterized as stated. Eitherasis the leastxsuch thatFs(x) = x−p, or if no suchxexists thenas=∞. From (i) it follows thatasis non-decreasing ins. SinceFs− 1 (x)> x−p=⇒ Fs(x)> x−p.

2 Example: secretary problem

Suppose we are to interviewhcandidates for a secretarial job. After seeing each candidate we must either hire or permanently reject her. Candidates are seen in random order and can be ranked against those seen previously. The aim is to maximize the probability of choosing the best candidate.

Solution the history of observations up to timet, i. after we have in- terviewed thetth candidate. All that matters are the value oftand whether thetth candidate is better than all her predecessors. Letxt= 1 if this is true andxt= 0 if it is not. In the casext= 1, the probability she is the best of allhcandidates is

P(best ofh|best of firstt) =

P(best ofh) P(best of firstt)

=

1 /h 1 /t

=

t h

.

Now the fact that thetth candidate is the best of thetcandidates seen so far places no restriction on the relative ranks of the firstt−1 candidates; thusxt= 1 andWt− 1 are statistically independent and we have

P(xt= 1|Wt− 1 ) =

P(Wt− 1 |xt= 1) P(Wt− 1 )

P(xt= 1) =P(xt= 1) =

1

t

.

3 Dynamic Programming over the Infinite Horizon

Discounting. Interchange arguments. Discounted, negative andpositive cases of dynamic programming. Validity of the optimality equation over the infinite horizon. Selling an asset.

3 Discounted costs

For adiscount factor,β∈(0,1], thediscounted-cost criterionis defined as

C=

h∑− 1

t=

βtc(xt, ut, t) +βhCh(xh). (3)

This simplifies things mathematically, particularly when we want to consider an infinite horizon. If costs are uniformly bounded, say|c(x, u)|< B, and discounting is strict (β <1) then the infinite horizon cost is bounded byB/(1−β). In finance, if there is an interest rate ofr% per unit time, then a unit amount of money at timetis worthρ= 1+r/100 at timet+1. Equivalently, a unit amount at timet+1 has present valueβ= 1/ρ. The function,F(x, t), which expresses the minimal present value at timetof expected-cost from timetup tohis

F(x, t) = inf π Eπ

[h− 1 ∑

τ=t

βτ−tc(xτ, uτ, τ) +βh−tCh(xh)

xt=x

]
. (3)

whereEπdenotes expectation over the future path of the process under policyπ. The DP equation is now

F(x, t) = inf u [c(x, u, t) +βEF(xt+1, t+ 1)|xt=x, ut=u], t < h, (3)

whereF(x, h) =Ch(x).

3 Example: job scheduling

A collection ofnjobs is to be processed in arbitrary order by a single machine. Jobi has processing timepiand when it completes a rewardriis obtained. Find the order of processing that maximizes the sum of the discounted rewards.

Solution we take ‘time-to-gok’ as the point at which then−kth job has just been completed and there remains a set ofkuncompleted jobs, saySk. The dynamic programming equation is

Fk(Sk) = max i∈Sk [riβpi+βpiFk− 1 (Sk−{i})].

ObviouslyF 0 (∅) = 0. Applying the method of dynamic programming we first find F 1 ({i}) =riβpi. Then, working backwards, we find

F 2 ({i, j}) = max[riβpi+βpi+pjrj, rjβpj+βpj+piri].

There will be 2n equations to evaluate, but with perseverance we can determine Fn({ 1 , 2 ,... , n}). However, there is a simpler way.

An interchange argument

Suppose jobs are processed in the orderi 1 ,... , ik, i, j, ik+3,... , in. Compare the reward that is obtained if the order of jobsiandjis reversed:i 1 ,... , ik, j, i, ik+3,... , in. The rewards under the two schedules are respectively

R 1 +βT+piri+βT+pi+pjrj+R 2 and R 1 +βT+pjrj+βT+pj+piri+R 2 ,

whereT=pi 1 +···+pik, andR 1 andR 2 are respectively the sum of the rewards due to the jobs coming before and after jobsi, j; these are the same under both schedules. The reward of the first schedule is greater ifriβpi/(1−βpi)> rjβpj/(1−βpj). Hence a schedule can be optimal only if the jobs are taken in decreasing order ofthe indices riβpi/(1−βpi). This type of reasoning is known as aninterchange argument. The optimal policy we have obtained is an example of anindex policy.

Note these points. (i) An interchange argument can be useful when a system evolves in stages. Although one might use dynamic programming, an interchange argument, — when it works —, is usually easier. (ii) The decision points need not be equally spaced in time. Here they are the times at which jobs complete.

3 The infinite-horizon case

In the finite-horizon case the value function is obtained simply from (3) by the back- ward recursion from the terminal point. However, when the horizon is infinite there is no terminal point and so the validity of the optimality equation is no longer obvious. Consider the time-homogeneous Markov case, in which costs and dynamicsdo not depend ont, i.e(x, u, t) =c(x, u). Suppose also that there is no terminal cost, i. Ch(x) = 0. Define thes-horizon cost under policyπas

Fs(π, x) =Eπ

[s− 1 ∑

t=

βtc(xt, ut)

x 0 =x

]
.

If we take the infimum with respect toπwe have theinfimals-horizon cost

Fs(x) = inf π Fs(π, x).

Clearly, this always exists and satisfies the optimality equation

Fs(x) = inf u {c(x, u) +βE[Fs− 1 (x 1 )|x 0 =x, u 0 =u]}, (3)

with terminal conditionF 0 (x) = 0. Theinfinite-horizon cost under policyπis also quite naturally defined as

F(π, x) = lim s→∞ Fs(π, x). (3)

This limit need not exist (e. ifβ= 1,xt+1=−xtandc(x, u) =x), but it will do so under any of the following three scenarios.

F(x)≤F(π, x) =c(x, u 0 ) +βE[F(π 1 , x 1 )|x 0 =x, u 0 ] ≤c(x, u 0 ) +βE[F(x 1 ) +ǫ|x 0 =x, u 0 ] ≤c(x, u 0 ) +βE[F(x 1 )|x 0 =x, u 0 ] +βǫ.

Minimizing the right hand side overu 0 and recalling thatǫis arbitrary gives ‘≤’.

3 Example: selling an asset

Once a day a speculator has an opportunity to sell her rare collection of tulip bulbs, which she may either accept or reject. The potential sale prices areindependently and identically distributed with probability density functiong(x),x≥0. Each day there is a probability 1−βthat the market for tulip bulbs will collapse, making her bulb collection completely worthless. Find the policy that maximizes her expected return and express it as the unique root of an equation. Show that ifβ > 1 /2,g(x) = 2/x 3 , x≥1, then she should sell the first time the sale price is at least

β/(1−β).

Solution are only two states, depending on whether she has sold the collection or not. Let these be 0 and 1, respectively. The optimality equation is

F(1) =

∫∞

y=

max[y, βF(1)]g(y)dy

=βF(1) +

∫∞

y=

max[y−βF(1),0]g(y)dy

=βF(1) +

∫∞

y=βF(1)

[y−βF(1)]g(y)dy

Hence

(1−β)F(1) =

∫∞

y=βF(1)

[y−βF(1)]g(y)dy. (3)

That this equation has a unique root,F(1) =F∗, follows from the fact that left and right hand sides are increasing and decreasing inF(1), respectively. Thus she should sell when he can get at leastβF∗. Her maximal reward isF∗.

Consider the caseg(y) = 2/y 3 ,y≥1. The left hand side of (3) is less that the right hand side atF(1) = 1 providedβ > 1 /2. In this case the root is greater than 1 and we compute it as

(1−β)F(1) = 2/βF(1)−βF(1)/[βF(1)] 2 ,

and thusF∗= 1/

β(1−β) andβF∗=

β/(1−β). Ifβ≤ 1 /2 she should sell at any price. Notice that discounting arises in this problem because at each stage there is a probability 1−βthat a ‘catastrophe’ will occur that brings things to a sudden end. This characterization of a way in which discounting can arise is often quite useful.

4 Positive Programming

In the P case, there may be no optimal policy. However, if a policy’s value function satisfies the optimality equation then it is optimal. Value iteration algorithm. Clinical trials.

4 Example: possible lack of an optimal policy.

Positive programming is about maximizing non-negative rewards,r(x, u)≥0, or mini- mizing non-positive costs,c(x, u)≤0. There may be no optimal policy.

Example 4.1 states are 0, 1 , 2 ,.. .and in statexwe may either move to state x+ 1 and receive no reward, or move to state 0, obtain reward 1− 1 /x, and remain there ever after, obtaining no further reward. The optimality equation is

F(x) = max

{

1 − 1 /x, F(x+ 1)

}

x > 0. (4)

ClearlyF(x) = 1,x >0. But there is no policy that actually achieves a reward of 1.

4 Characterization of the optimal policy

For cases P and D, there is a sufficient condition for a policy to be optimal.

Theorem 4.2 P or D holds andπis a policy whose value functionF(π, x) satisfies the optimality equation

F(π, x) = sup u

{

r(x, u) +βE[F(π, x 1 )|x 0 =x, u 0 =u]

}
.

Thenπis optimal.

Proofπ′be any policy and suppose that in initial statexit takes actionu. Since F(π, x) satisfies the optimality equation,

F(π, x)≥r(x, u) +βEπ′[F(π, x 1 )|x 0 =x, u 0 =u].

By repeated substitution of this into itselfstimes, we find

F(π, x)≥Eπ′

[s− 1 ∑

t=

βtr(xt, ut)

x 0 =x

]

+βsEπ′[F(π, xs)|x 0 =x] (4)

where u 0 , u 1 ,... , us− 1 are controls determined byπ′ as the state evolves through x 0 , x 1 ,... , xs− 1. In case P we can drop the final term on the right hand side of (4) (because it is non-negative) and then lets→∞; in case D we can lets→∞directly, observing that this term tends to zero. Either way, we haveF(π, x)≥F(π′, x).

Now let F∞(x) = lim s→∞ Fs(x) = lim s→∞ inf π Fs(π, x). (4)

This limit exists (by monotone convergence under N or P, or by the fact that under D the cost incurred after timesis vanishingly small.) Notice that, given any ̄π,

F∞(x) = lim s→∞ inf π Fs(π, x)≤ lim s→∞ Fs( ̄π, x) =F( ̄π, x).

Taking the infimum over ̄πgives

F∞(x)≤F(x). (4)

The following theorem states thatLs(0) =Fs(x)→F(x) ass→∞. For case N we need an additional assumption:

F (finite actions): There are only finitely many possible values ofuin each state.

Theorem 4.3 that D, P, or N and F hold. Thenlims→∞Fs(x) =F(x).

Proof have (4), so must prove ‘≥’.

In case P,c(x, u)≤0, soFs(x)≥F(x). Lettings→∞proves the result.

In case D, the optimal policy is no more costly than a policy that minimizes the expected cost over the firstssteps and then behaves arbitrarily thereafter, incurring an expected cost no more thanβsB/(1−β). So

F(x)≤Fs(x) +βsB/(1−β).

It follows that lims→∞Fs(x)≥F(x).

In case N and F,

F∞(x) = lim s→∞ min u {c(x, u) +E[Fs− 1 (x 1 )|x 0 =x, u 0 =u]}

= min u {c(x, u) + lim s→∞ E[Fs− 1 (x 1 )|x 0 =x, u 0 =u]}

= min u {c(x, u) +E[F∞(x 1 )|x 0 =x, u 0 =u]}, (4)

where the first equality is because the minimum is over a finite number of terms and the second equality is by Lebesgue monotone convergence, noting thatFs(x) increases ins. Letπbe the policy that chooses the minimizing action on the right hand sideof (4). Then by substitution of (4) into itself, and the fact that N impliesF∞≥0,

F∞(x) =Eπ

[s− 1 ∑

t=

c(xt, ut) +F∞(xs)

x 0 =x

]

≥Eπ

[s− 1 ∑

t=

c(xt, ut)

x 0 =x

]
.

Lettings→∞givesF∞(x)≥F(π, x)≥F(x).

4 D case recast as a N or P case

A D case can always be recast as a P or N case. To see this, recall that in theD case, |c(x, u)|< B. Imagine subtractingB >0 from every cost. This reduces the infinite- horizon cost under any policy by exactlyB/(1−β). That is, in a problem with costs, ̃c(x, u) =c(x, u)−B,

F ̃(π, x) =F(π, x)− B 1 −β

.

So any optimal policy is unchanged. All costs are now negative, so we now have a P case. Similarly, addingBto every cost reduces a D case to an N case.

This means that any result we might prove under conditions for the N or Pcase will also hold for the D case.

4 Example: pharmaceutical trials

A doctor has two drugs available to treat a disease. One is well-established drug and is known to work for a given patient with probabilityp, independently of its success for other patients. The new drug is untested and has an unknown probability of successθ, which the doctor believes to be uniformly distributed over [0,1]. He treats one patient per day and must choose which drug to use. Suppose he has observedssuccesses andf failures with the new drug. LetF(s, f) be the maximal expected-discounted number of future patients who are successfully treated if he chooses between the drugs optimally from this point onwards. For example, if he uses only the established drug, the expected- discounted number of patients successfully treated isp+βp+β 2 p+···=p/(1−β). The posterior distribution ofθis

f(θ|s, f) = (s+f+ 1)! s!f!

θs(1−θ)f, 0 ≤θ≤ 1 ,

and the posterior mean isθ ̄(s, f) = (s+ 1)/(s+f+ 2). The optimality equation is

F(s, f) = max

[

p 1 −β

,

s+ 1 s+f+ 2

(1 +βF(s+ 1, f)) +

f+ 1 s+f+ 2

βF(s, f+ 1)

]
.

Notice that after the first time that the doctor decides is not optimal touse the new drug it cannot be optimal for him to return to using it later, since his indformation about that drug cannot have changed while not using it. It is not possible to give a closed-form expression forF, but we can can approximate Fusing value iteration, findingF≈Ln(0) for largen. An alternative, is the following. Ifs+fis very large, say 300, thenθ ̄(s, f) = (s+1)/(s+f+2) is a good approximation toθ. Thus we can takeF(s, f)≈(1−β)− 1 max[p,θ ̄(s, f)],s+f= 300 and then work backwards. Forβ= 0, one obtains the following table.

Was this document helpful?

Optimisation and Control 2015-2016 Course Notes

Was this document helpful?
Optimization and Control
Richard Weber, Lent Term 2016
Contents
Schedules iv
1 Dynamic Programming 1
1.1 Control as optimization over time . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Example: the shortest path problem . . . . . . . . . . . . . . . . . . . . 1
1.3 The principle of optimality . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 The optimality equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Example: optimization of consumption . . . . . . . . . . . . . . . . . . . 3
2 Markov Decision Problems 5
2.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Features of the state-structured case . . . . . . . . . . . . . . . . . . . . 6
2.3 Example: exercising a stock option . . . . . . . . . . . . . . . . . . . . . 6
2.4 Example: secretary problem . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Dynamic Programming over the Infinite Horizon 9
3.1 Discountedcosts ............................... 9
3.2 Example: job scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 The infinite-horizon case . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 The optimality equation in the infinite-horizon case . . . . . . . . . . . . 11
3.5 Example: selling an asset . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Positive Programming 13
4.1 Example: possible lack of an optimal policy. . . . . . . . . . . . . . . . . 13
4.2 Characterization of the optimal policy . . . . . . . . . . . . . . . . . . . 13
4.3 Example: optimal gambling . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Valueiteration ................................ 14
4.5 D case recast as a N or P case . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Example: pharmaceutical trials . . . . . . . . . . . . . . . . . . . . . . . 16
i