Decision theory
Deicision rules. Loss functions and risks
D6.1.1 A statistical decision problem consists of the following elements:
- a ps `(bbbX, ccX, ccP)` for a random observable `X`
- an action space `(bbbA, ccA)`. `bbbA` is the set of allowable actions, and `ccA` is the `sigma`-field on `bbbA`
- a decision rule (dr) `d: bbbX -> bbbA`, measurable
- a loss function `L(P, d(x))` which specifies the loss associated with picking `d(x)` when underlying model is `P in ccP`, `L: ccP xx bbbA -> [0, oo)`, measurable wrt to `ccA` for each fixed `P`
- the risk of a decision rule `R(P, d) = E_P(L(P, d(x)))`
The goal of decision theory is to find the best decision rule given a loss function.
If `ccP` is parametric, then often write `L(theta, d(x))`
D6.1.2 A dr is `d_1`:
- as good as `d_2` if `R(P, d_1) <= R(P, d_2) quad AA P in ccP`
- better than `d_2` if it as good as and `R(P, d_1) < R(P, d_2)` for at least one `P in ccP`
- equivalent to `d_2` if `R(P, d_1) = R(P, d_2) quad AA P in ccP`
Let `ccT` be a class of decision rules. A dr d is `ccT`-optimal if `d` is as good as any other dr in `ccT`. If `ccT` contains all the possible dr's, then `d` is optimal if `d` is `ccT`-optimal
D6.1.2
- A non-random decision rule (nrdr) `d:bbbX -> bbbA` is a measurable function
A behavioural decision rule (bdr) is a function `delta: bbbX xx bbbA -> [0,1]` t:
- `AA X in bbbX` `delta(x , *)` is a pm on `(bbbA, ccA)`
- `AA A in bbbA` `delta(*, x)` is `ccX`-measurable
The loss function of a bdr is `L(P, delta)(x) = int_bbbA L(P, a) delta(x, da)`. The risk is `R(P, delta) = int_bbbX L(P, delta)(x) dP(x)`
D6.1.3 A randomised decision rule (rdr) `bar delta` is a pm on `(bbbD, ccD)`. The risk is `R(P, bar delta) = int_bbbD R(P, g) d bar delta(g)`
Remarks:
- rdr and bdr called random drs
- `R(P, bar delta)` is a further average of `R(P, g)` wrt a pm on all `g in bbbD`. `bar delta` is a model for preference of nrdrs - like a prior on `bbbD`
- A bdr is more natural, but a rdr is easy to analyse.
- It can be shown that if `bbbA` is a complete separable (Polish) metric space and `ccA = B(bbbA)` then `AA delta in bbbD_b EE bar delta in bbbD_r "st" R(P, delta) = R(P, bar delta) quad AA P in ccP` for a given decision problem
Admissiability and geometry of decision rules
D6.2.1 Let `ccT` be a class of drs. A dr `delta in ccT` is `ccT`-admissible if there is not dr `in ccT` that is better than `delta`.
The notion of admissibility is a retreat from `ccT`-optimality as the latter may not exist.
T6.2.1 Suppose `bbbA sub RR^k` is convex aand `delta in bbbD_b` st `int_bbbA ||a|| delta(x, d(a)) < oo quad AA x in ccX`. Let `d(x) = int_bbbA a delta(x, d(x)) quad AA x in ccX` (a nrdr). Then:
- If `L(P, a)` is convex then `R(P, d) <= R(P,delta)`
- If `L(P, a)` is strictly convex in `a`, `P(P, delta) < oo` and `P({x: delta(x) "is not degenerate"}) > 0` then `R(P, d) < R(P,delta)`
Geometry of decision rules
A helpful device to understand basics of decision rules. Assume `ccP = {P_1, ..., P_k}` a finite collectioon of pms. Given a dr `delta` define the k-dim risk profile `y_delta = (R(P_1, delta), ..., (R(P_k, delta)))`. Let `ccR_(k,r) = {y_delta in RR^k, AA delta in bbbD_r}`, `ccR_(k,r) = {y_delta in RR^k, AA delta in bbbD_b}`
T6.2.2 `ccR_(k,r)` and `ccR(b,r)` are convex.
D6.2.2 Let `X = (X_1, ..., X_k )` and `Y = (Y_1, ..., Y_k)`:
- `X <= Y` if `X_i <= Y_i quad AA i = 1, ..., k`
- `X < Y` if `X <= Y` and `EE i_0 "st" X_i < Y_i`
D6.2.3 `X in RR^k`, the lower quadrant of X is the set `Q_x = {Y in RR^k; Y <= X}`
D6.2.4
- `X in RR^k` is a lower boundary point of a set `A sub RR^k` if `Q_x nn A = {x}`
- `lambda(A) = {x ; Q_x nn A = {x}}`
- a set `A sub RR^k` is closed from below if `lambda(A) sub A`
T6.2.3 `y_delta in ccR_(k,b)`. If `y_delta in lambda(ccR_(k,b))` then `delta` is admissible. The converse is true if `ccR_(k,b)` is closed.
Complete classes of decision rules
D6.3.1 Let `ccG sub bbbD_b` be a class of drs.
- `ccC` is a complete class (CC) if `AA delta !in ccC EE delta_1 in ccC "st" delta_1 > delta`
- `ccC` is an essentially complete class (ECC) if `AA delta !in ccC EE delta_1 in ccC "st" delta_1 >= delta`
- `ccC` is a minimal complete class (MCC) if it is complete and is a subset of every other CC.
- a minimal essentially complete class is defined similarly
T6.3.1 Let `A(bbbD_b)` be the set of admissable drs in `bbbD_b`. If a MCC exists then it is `A(bbbD_b)`
T6.3.2 If `A(bbbD_b)` is complete, then it is a MCC.
T6.3.3 Suppose `ccP = {P_1, ..., P_k}` is finite. If `bbbD_b` is continuous from below, then `bbbD_0 {delta in bDDD: Y_delta in lambda(ccR_(k,b))}` is MCC.
T6.3.4 If `T(X)` is sufficient for `ccP` and `delta in bbbD_b` then `delta'(X,A) = E(delta(X,A) | T(X)) in bbbD_b` and `R(P, delta') < R(P, delta) quad AA P in ccP`
L6.3.5 Suppose `bbbA sub RR^k` is convex and `d_1, d_2 in bbbD`. Let `d(x) = (1/2)(d_1 + d_2)` then:
- `d in bbbD`
- If `L(P,a)` is convex in `a quad AA P in ccP` and `R(P, d_1) = R(P, d_2)`, then `R(P, d) <= R(P, d_1)`
- If `L(P,a)` is strictly convex in `a quad AA P in ccP` and `R(P, d_1) = R(P, d_2) < oo` and `P(d_1 != d_2) > 0` then `R(P, d) < R(P, d_1)`
C6.3.6 Suppose `bbbA sub RR^k` is convex and `d_1, d_2 in bbbD` with same rish `AA P in ccP`. If loss is convex in `a quad AA P in ccP` and strictly convex for one `P_0 in ccP` and `R(P_0, d_1) = R(P_0, d_2) < oo` and `P(d_1 != d_2) > 0` then `d_1` and `d_2` are inadmissable.
T6.3.7 (Rao-Blackwell theorem). Let `bbbA` be a convex subset of `bbbR^k` and `T` be a sufficient statistic for `ccP`. Let `d` be an nrdr with `E_p||d(x)|| < oo quad AA P in ccP` and `d_0(x) = E_p(d |T)(x)`. Then
- `d_0` is an nrdr st `E_p(d_0) < oo`
- if `L(P,a)` is convex in `a quad AA P in ccP`, `R(P, d_0) <= R(P, d)`
- if `L(P,a)` is strictly convex for in `a quad AA P in ccP` and `R(P, d) < oo` and `P(d_0 != d) > 0` then `R(P, d_0) < R(P, d)`
Bayes and minimax rules
So far we have compared drs `delta_1` and `delta_2` via their risk vectors/profiles, however, this multivariate comparison may not produce a "better" rule. Bayes and minimax rules used a univariate measure of risk.
D6.4.1 Given a statistical decision problem `(bbbX, ccX, ccP)`, `(bbbA, ccA)`, `L(p,a)` `delta in bbbD` which produce a risk `R(P delta)`. The Bays risk wrt a prior pm `Pi` on `(ccP, ccF_P)` is `R_Pi(delta) = int_ccP R(P, delta) Pi(dP)`.
Remarks:
- if `ccP` is finite, then this is equivalent to a weighted average
- if `ccP = { P_theta, theta in Theta}`, a parametric space, `ccF_theta` as `sigma`-field on `Theta`, the prior `Pi` can be regarded as a pm on `(Theta, ccF_theta)` and `R_Pi(delta) = int_Theta R(P_theta, delta) Pi(d theta)`
- Bayes risk has the same favlour as the risk of an rdr, but averaging over `d in bbbD` rather than `P in ccP`
D6.4.2 Let `ccT` be a set of drs. A dr `delta_0 in ccT` is the `ccT`-Bayes rule if `R_Pi(delta) = int_(delta in ccT) R_Pi(delta)`
T6.4.1 Let `ccP = {P_1, ..., P_k}` and `ccT` a family of drs. If `delta_0 ` is `ccT`-Bayes wrt to a prior `PI = (pi_1, ..., pi_k)`, `pi_i > 0 sum pi_i = 1` then `delta_0` is admissable.
T6.4.2 Let `ccP = {P_theta; theta in Theta sub RR^k}` st tevery open ball in `Theta` has a non-empty intersection with the interior with positive `Pi`-probability and `R(theta, delta) := R(P_theta, delta)` is continuous wrt `theta` on `Theta` for each `delta in ccT`. If:
- `delta_0` is `ccT`-Bayes wrt to a prior `Pi`
- `R_pi(delta_0) < oo`
then `delta_0` is `ccT`-admissable.
T6.4.5 If `ccP` is finite, and `delta` is `ccT`-admissable then there exists a prior `Pi` on `(ccP, ccF_P)` st `delta` is `ccT`-Bayes wrt `Pi`.
P6.4.4 (Lehmann's theorem) Let `T: (Omega, ccF) -> (Lambda, ccG)` measurable, `phi: (Omega, ccF) -> (RR^k, B(RR^k))` measurable. Then `phi` is `(Omega, sigma(T)) -> (RR^k, B(RR^k))` iff there exists a `psi: (Lambda, ccG) -> (RR, B(RR^k))` st `phi = psi * T`
D6.4.3 The conditional expectation of `X | Y=y` for some `y in RR^k` is `E(X|Y=y) = h(y)`
P6.4.5 Let `X` and `Y` be n- and m-dimensiional r. vectors. Suppose `P_((x,y))`, the pm of `(X,Y)` is dominated by `vu xx lambda` with density `f(x,y)` where `nu` and `lambda` are `sigma`-finite measure of `(RR^n, B(RR^n))` and `(RR^m, B(RR^m))`. Let `g(x,y): (RR^(n+m), B(RR^(n+m))) -> (RR, B(RR))` be measurable, st `E|g(x,y)| < oo`. Then `E(g(X,Y)) = (int g(x,Y) f(x,Y) nu(dx))/(int f(x,Y) nu(dx))` and `E(g(X,Y) | Y=y) = (int g(x,y) f(x,y) nu(dx))/(int f(x,y) nu(dx))`.
T6.4.6 (Existence of conditional distribution in a general case). Let `X` be a n-dim r. vec on `(Omega, ccF, ccP)`. `Y: (Omega, ccF) -> (Lambda, ccG)` measurable. Then there exists a regular cond pm `P_(X|Y)( * | y)` called the conditional distribution of `X|Y=y`, st
- `P_(X|Y)(B | y) = P(X in B | Y = y) "wp1" P_Y quad AA B in B(RR^n)`
- `P_(X|Y)(* | y)` is a pm on `(RR^n, B(RR^n)) quad AA y in Lambda`
Furthermore, if `E|g(X,Y)| < oo` for `g: RR^n xx Lambda -> RR` measurable, then `E(g(X,Y) | Y=y) = E(g(X, y) | Y=y) = int_(RR^n) g(x,y) dP_(X|Y)(x|y) "wp1" P_Y`
Remark:
- the theorem assures existence of cond dist for a wide range of cases, and free sus from the task of proving P6.4.5 (but P6.4.5 does give more details of the cond dist)
- cond dist is a regualr cond prob of a rv given another rv `Y` at `y`.
Construction of Bayes Rules
T6.4.7 (Bayes formula). Assume `ccP = {P_theta; theta in Theta}` is dominated by `sigma`-finite `nu` and `f_theta(x) = (dP_(X|theta)(x | theta))/(dnu)` is the cond density which is a function of `(x, theta)` is measurable on `(bbbX xx Theta, sigma(ccX xx ccF_Theta))`. Let `Pi` by a prior pm on `(Theta, ccF_Theta)` Assume `m(x) = int_Theta f_theta(x) dPi > 0 ` then:
- the posterior distribution of `theta | X` is `P_(theta|X) << Pi ` and `(dP_(theta|X))/dPi = (f_theta(x)) / (m(x))`
- if `Pi << lambda` and `dPi/dlambda = pi(theta)` for `sigma`-finite `lambda` then `(dP_(theta|X))/(dlambda) = (f_theta(x) pi(theta))/(m(x))`
Remarks:
- `M(A) = int_bbbA m(x) nu(dx) quad AA A in ccX` is the marginal pdf of `X`
the posterior distribution `P_(theta|X)` plays a pivotal role in Bayesian inference and derives Bayes rule in most situations.
For an nrdr `R_Pi(delta) = int_bbbX E(L(theta, delta(x)) | X=x) dM(x)` where `E(L(theta, delta(x)) | X=x)` is the posterior expected value given `X=x`
T6.4.8 Under the conditions of T6.4.6 and `Theta, bbbA` convex. Then
- with squared error loss, and `PI "st" int theta^2 dPi < oo`, the Bayes rule is `d(x) = E(theta | X = x)`
- with `L_1` loss, and `PI "st" int theta dPi < oo`, the Bayes rule is `d(x) = "median of" P_(theta|X)`
Minimax rules
D6.4.4 A dr `delta_o in ccT` is the minimax rule if `spr_(P in ccP) R(P, delta_0) = inf_(delta in T) spr_(P in ccP) R(P, delta)`. A minimax rule has the smallest worst-case risk.
D6.4.5 If a dr `delta` has constant risk if it is an equiliser rule, ie `R(P, delta) = c quad AA P in ccF`
T6.4.6 If `delta` is an equiliser rule and is admissible then it is minimax.
T6.4.9 Suppose `{delta_i}_{i>=1}` is a sequence of drs, and each `delta_i` is Bayes wrt `Pi_i`. If `R_(Pi_i) -> c < oo` and `delta_0` is a dr with `R(P, delta_0) <= c quad AA P in ccP` then `delta_0` is minimax.
C6.4.10 If `delta_0` is an equiliser rule and is Bayes wrt `PI_0` then it is minimax.
Remarks:
- constant risk functions and either admissibility or Bayes implies minimax, but converse may not be true
- `ul V = sup_Pi inf_(delta in ccT) R_Pi(delta)`, `bar V = inf_ccT sup_ccP R(P,delta)`. Let `R_Pi = int_(delta in ccT) R_Pi(delta)` (the minimum Bayes risk wrt `Pi`). It can be shown that `R_Pi <= bar V`
D6.4.6 A prior `Pi_0` is least favourable if `R_(Pi_0) = sup_Pi R_Pi`.
If a dr `delta` which is least favourable has the best chance of being minimax.
C6.4.11 If `delta` is Bayes wrt `Pi` and `R(p, delta) <= R_Pi = int_(delta in T) R_Pi(delta)` then `delta` is minimax and `Pi` is least favourable.
Unbiased estimators and invariance drs
D6.5.1 `(bbbX, ccX, ccP = {P_theta, theta in Theta})`
- a measurable function `r: Theta -> RR^k` is called a parametric function
- an nrdr `d(x)` is an unbiased estimator (UE) of a parametric fn `gamma(theta)` if `E_p(d(X)) = gamma(theta) quad AA theta in Theta`
- a parametric function `gamma(theta)` is U-estimable if there exists a UE for `gamma(theta)`
T6.5.1 (Lehmann-Scheffe) Suppose that:
- `ccB` is complete and sufficient for `ccP`
- `bbbA sub RR^k` is convex
- `L(theta, a)` is convex in `a AA theta in Theta`
If there exists an UE of a parametric fn `gamma(theta)` then there exists a best UE (BUE) of `gamma(theta)`
Invariance drs
Would be nice to have decision rules that are invariant to transformations of r. obs.
D6.5.2 Let `G != o/`, and `@` be a binary operator on `G`. `(G, @)` is called a groupp if:
- `AA g_1, g_2 in G` `g_1 @ g_2 in G`
- `AA g_1, g_2, g_2 in G` `(g_1 @ g_2) @ g_3 = g_1 @ (g_2 @ g_3) `
- `EE e in G` st `g @ e = e @ g = g quad AA g in G`
- `AA g in G EE h in G "st" g @ h = h @ g = e`
`e` is unqiue and is the identity element of `G`. `h` is the inverse of `g`.
D6.5.3 Let `(bbbX, ccX)` be a measurable space. `G` is a group of measurable transformations (GMT) if :
- `G sub {g | g:bbbX -> bbbX, "measurable and one-to-one"}`
- `(G, @)` is a group, where `@` is the composition operator
D6.5.4 Let `(bbbX, ccX, {P_theta, theta in Theta})` be ps for r. obs `X` and `G` is a GMT on `(bbbX, ccX)`, then `ccP` is invariant under `G` if `AA theta in Theta, g in G` there exists a unique `theta' in Theta` st `P_theta(g^(-1)(A)) = P_(theta')(A) quad AA in bbbX`
- we use `bar g(theta)` to denote the unique `theta'`