Exponential smoothing

Given $Y_1, ..., Y_T$ , want to predict $Y_{T+1}$ , denoted by $\hat{Y}_{T+1}$ , and can measure the quality with the MSE $ = E(Y_{T+1} - \hat{Y}_{T+1})^2$ .

Suppose underlying process has constant mean, $\mu$ , which we don’t know but can estimate from the sample. If instead of being constant the mean is slowly varying function of $t$ , then it makes sense to use a weighted mean with recent values having greater weights.

\[ \hat{Y}_{T+1} = c(Y_t + \omega Y_{T-1} + \omega^2 Y_{T-2} + ...) \]

For some $|\omega| \lt 1$ (known as the discounting coefficient), with weights summing to 1, ie. $\hat{Y}_{T+1} = (1- \omega)(Y_t + \omega Y_{T-1} + \omega^2 Y_{T-2} + ...)$

Updating

Can define exponential smoothing recursively: $\hat{Y}_{T+1} = \hat{Y}_T + \alpha(Y_T - \hat{Y}_T)$ . Often use $\alpha = 1 - \omega$ to make formula even simpler. Since $|\omega| \lt 1$ , values in remote past do not affect predictions to any great extent, and procedure is relatively insenstive to value of $\omega$ .

In order to get started need a value for $Y_1$ . eg. $Y_1$ , $\bar{Y}$ , predict backwards from later values.

Introductory theory

Assume that time series values we observe are realisations of random variables $Y_1, Y_2, ..., Y_T$ , which are in turn part of a larger stochastic process $\{ Y_t: t \in \mathbb{Z}\}$ .

Mean function $= \mu(t) = E[Y_t]$ Autocovariance function $= \gamma(s,t) = cov(Y_s, Y_t)$

These are fundamental parameters and would be useful to obtain estimates, however, in general there are $T + T(T+1) / 2$ parameters with $T$ units and it is not possible to estimate them all. We’ll need to impose some constraints; the most common is stationarity.

Strict stationarity: if for any $k \gt 0$ and any $t_1, \ldots, t_k \in \mathbb{Z}$ , the distribution of $(Y_t_1, \ldots, Y_t_k)$ , is the same as $(Y_{t_1+L}, \ldots, Y_{t_k+L})$ . This implies that $\mu{t} = \mu{0}$ and $\gamma(s,t) = \gamma(s-t, 0)$ . It turns out that these two implications are enough for a reasonable amount of theory.

Weak stationarity: if $E[Y_t] \lt \infinity$ , $\mu(t) = \mu$ , and $\gamma(s,t) = \gamma(s-t, 0)$ , for all $t$ and $u$ .

Equivalent for Gaussian. When TS are stationary possible to simplify: $\mu = E[Y_t]$ , $\gamma(u) = cov(Y_t, Y_{t+u})$ , may also be interested in correlation $\rho(u) = \mu(u) / \mu(0)$ .

Hilbert spaces

(vector space + inner product completed to limits of Cauchy integrals = Hilbert space, $Y_n \to Y$ , $E|Y - Y_n| \to 0$ , mean-squared convergence). Places many of operator manipulations to follow on rigorous basis.

Vector space composed of $\sum_{i=1}{N} c_i Y_t_i$ . Inner product = covariance. Denoted $\mathcal{H}$ .

Consider the lag operator, L, defined by $LY_t = Y_{t-1}$ . (Defined obviously for linear combinations). As well as being linear, the lag operator preserves inner products, thus it is a unitary operator.

Linear processes

The time series $Y_t$ , defined by $Y_t = \sum_{-\infinity}^\infinity \phi_u \epsilon_{t-u}$ , where $\epsilon_t$ is white noise, and $\sum_{-\infinity}^\infinity |\phi_u|^2 \lt \infinity$ , is called a linear process.

Partial autocorrelation function

Given stretch of time series values, partial autocorrelation is the correlation between $Y_t$ and $Y_{t-u}$ not conveyed through intervening values. If the Y-values are normally distributed then $\phi(u)$ can be defined as $\cor(Y_t Y_{t-u} | Y_{t-1},..., Y_{t-u+1})$ . A more general approach is based on regression theory, from which we can find $\phi(u) = cor(Y_t - \hat{Y}_t, Y_{t-u} - \hat{Y}_{t-u})$ .

By convention $\phi(1) = \rho(1)$ . We can find $\rho(2)$ using the theory above and the best predictor of $Y_t$ based on $Y_{t-1}$ , $\rho(1) Y_{t-1}$ . $\phi(2) = \frac{\rho(2) - \rho(1)^2}{1 - \rho(1)^2}$ .

For general AR (p) $\phi(u) = 0$ , $\forall u \gt p$ . For general MA (q) $\phi(u)$ decays exponentially as $u \to \infinity$ .

Computing the PACF

Although the definition ahove is conceptually simple, it is computationally hard. This section presents an alternative form which is simple to compute.

Consider the $k$ th order autoregressive predictor of $Y_{k+1}$ , $Y_{k+1} = \phi_{n1} Y_t + ... + \phi_kk Y_1$ , obtained by minimising $E(Y_{k+1} - \hat{Y}_{k+1})^2$ . We will show that $\phi(k) = \phi_kk$ .

Let $\mathcal{H}_n = span\{Y_1, Y_2, ..., Y_k\}$ , with associated projection $P_\mathcal{H}_n$ , $\mathcal{H}_1 = span\{Y_2, ..., Y_k\}$ , $\mathcal{H}_2 = span\{ Y_1 - P_\mathcal{H}_1 Y_1\}$ .

Thus $\hat{Y}_{k+1} = P_\mathcal{H}_n Y_{k+1} = P_\mathcal{H}_1 Y_{k+1} + P_\mathcal{H}_2 Y_{k+1}$ $ = P_\mathcal{H}_1 Y_{k+1} + a(Y_1 - P_\mathcal{H}_1 Y_1)$ , rearranging and equating to first equation shows $a = \phi_kk$ .

Dubin-Levinson algorithm calculates these coefficients recursively.

ARMA Series

If $Y_t = \phi_1 Y_{t-1} + \mdots + \phi_p Y_{t-p} + \epsilon_t$ where $\epsilon_t$ is white noise, then $Y_t$ is called an autoregressive series of order p, denoted AR(p).

Important because:

have natural intepretation: next value is slight perturbation of most recent observation
easy to estimate their parameters
easy to forecast

AR (1) series

AR (1) series defined by $Y_t = \phi Y_{t-1} + \epsilon_t$ , or $(1- \phi L)Y_t = \epsilon_t$ . $Y_t$ and $\epsilon_t$ are uncorrelated, so $var(Y_t) = \phi^2 var(Y_{t-1}) + \sigma^2_\epsilon$ , and under stationarity $\sigma_Y^2 = \phi^2 \sigma^2_Y + \sigma^2_\epsilon$ which implies that $\sigma_Y^2 \le 1$ if stationary.

Similarly by inverting autoregressive operator and applying to $\{\epsilon_t\}$ we find that $Y_t = \sum_{u=0}^\infinity \phi^u \epsilon_t-u$ which converge in mean square because $\sum |\phi|^2u \lt \infinity$ . An equivalent condition is that the root of $(1- \phi L)$ lies outside the unit circle.

If we multiply both sides of the equation by $Y_{t-u}$ and take expectations: \[E[Y_t Y_{t-u}] = \phi E[Y_{t-u} Y_{t-u}] + E[\epsilon_t Y_{t-u}]$

MA (1) series

\[ \gamma(u) = \left\{ \array{

(Can see the invertibility problem here: if we replace $\theta$ by $1 / \theta$ and $\sigma^2$ by $\theta \sigma^2$ the function is unchanged).

ARMA

Time series is ARMA if can be written $\phi(L) Y_t = \theta(L) \epsilon_t$ . Stationary if roots of $\phi(z)$ lie outside unit circle.

ARMA (1,1) series

Model: $Y_t = \phi Y_{t-1} + \epsilon_t + \theta \epsilon_{t-1}$

To derive ACF note that $E(\epsilon_t \Y_t) = \sigma^2_\epsilon$ and $E(\epsilon_{t-1} \Y_t) = (\phi + \theta) \sigma^2_\epsilon$ . As above multiple model by $Y_{t-1}$ take expectations, solve recurrence and we get:.

\[ \gamma(u) = \left\{ \array{

Similar to AR (1) except for the first term.

ARMA (p,q) series

Model: $\Phi(L)Y_t = \Theta(L)\epsilon_t$ , or $Y_t = \Psi(L) \epsilon_t$ , where $\Psi(L) = \Theta(L) / \Phi(L)$ .

When greater than $q$ time units apart, will behave like AR (p) series, but first $p$ terms will exhibit additional structure.

Generalisations

ARIMA models

ARMA models useful for stationary time series, but many real-life series are not stationary, so we need a way to generalise ARMA models. Difficult to do with complete generality but can do for certain types of stationarity (ie. polynomial trend).

Can define a trend model explicitly, eg. $Y_t = \beta_0 + \beta_1 t + a_t$ , or implicitly by taking differences eg. $Y_t - Y_{t-1} = \beta_1 + a_t$ . Explicit model less variable, but not flexible enough in practice.

Can write $\bigtriangledown Y_t = \beta_1 + a_t$ , where $\bigtriangledown$ is the differencing operator, $1 - L$ . This suggest the generalisation $\bigtriangledown^d Y_t = \beta_1 + a_T$ which will remove polynomial trend of degree $d$ .

Written: ARIMA (p, d, q)

SARIMA models

Many real-life time series show seasonal trends with periodicity $s$ . To keep number of parameters down usually assume seasonal effects are independent, so use multiplicative model

Written: SARIMA (p, d, q) $\times$ (P, D, Q)_s Model: $\bigtriangledown^d \bigtriangledown_s^D \phi(L) \Phi(L) Y_t = \theta(L) \Theta(L) \epsilon_t$

Estimation and identification

ACF

Do not usually know $\gamma(u)$ etc. and must estimate from data, with $\hat{y} (u) = \frac{1}{T} \sum (Y_{t+u} - \bar{Y})(Y_t - \bar{Y})$ . Estimator is biased (using $1 / T$ instead of $1 / T-U$ ) because gives positive definite $\hat{\gamma}(u)$ which ensures that all covariances are positive and has lower MSE for time series where $\gamma(u) \to 0$ as $u \to \infinity$ . Estimated autocorrelation function $r(u) = \hat{\gamma}(u) / \hat{\gamma}(0)$ .

To estimate confidence intervals etc. need some distribution theory (from Andersen, 1971). If $Y_t = \mu + \sum \psi_u \epsilon_{t-u}$ , $\sum |\psi_u| \lt \infinity$ , and $\sum u \psi_u^2 \lt \infinity$ , then $r(u) ~ AN(0, c_uu / T)$ , $cor(r(u), r(v)) = c_uv / \sqrt{c_uu c_vv}$ , where $c_uv$ is given by Bartletts formula, $\sum_{k=0}^\infinity (\rho(k - u) + \rho(k + u) - 2\rho(k)\rho(u))(\rho(k - u) + \rho(k + u) - 2\rho(k)\rho(v))$ .

White noise

If $Y_t$ are iid: $var(r(u)) \approx 1 / T$ , $cor(r(u), r(v)) \approx 0$ .

AR (1) series

In this case $\rho(u) = \phi^2$ for $u \gt 0$ , so $var(r(1)) = \frac{1}{T}(1 - \phi^2) $ , and $var(r(u)) = \frac{1}{T}\frac{1 + \phi^2}{1-\phi^2}$ . When $|\phi|$ near 1, accurate at low $u$ , inaccurate at high.

General MA (q) series

“Easy” to see that: $c_uu = 1 + 2 \sum_{v=1}^q \rho(v)^2$ for $u \gt q$ .

PACF

Use Derbin-Levinson recursion on estimated ACF.

For CI(confidence interval)s use Quenouille (1949) which states that if true model is AR (p) then when $u \gt p$ , $\hat{\phi}(u) ~ AN(0, 1 / T)$ , and so we can use $\pm 2 / \sqrt{T}$ as 95% CIs.

System identification

Using above results:

ACF: decay, PACF: decay = ARMA (p = ?, q = ?)
ACF: decay, PACF: cutoff = AR (p = cutoff)
ACF: cuttoff, PACF: decay = MA (q = cutoff)
ACF: cuffoff, PACF: cutoff = Problem!

Very slow decay in the ACF indicates non-stationarity

Fitting and forecasting

Once we’ve identified a particular model we need to estimate the parameters and assess how well the model fits. Fitting is usually carried out using maximum liklehood ($\min_{\phi, \theta, \sigma} (-log(f Y_1...Y_2 | \phi, \theta, \sigma)$ through a recursive process known as Kalman filtering. Residuals calculated using one step ahead predictions. These are uncorrelated by construction, and will be iid. normal if $\epsilon ~ NID(0, \sigma^2_\epsilon$ ).

Calculated in R, using z <- arima(y, order=c(p,d,q), seasonal=c(P,D,Q)).

Assessing quality of fit

Usual way of assessing fit is to examine residuals. Obvious what to do in AR (p) model, but for general ARMA (p,q) will have to invert MA characteristic polynomial. However, in practice residuals obtained as byproduct of prediction process. If ARMA model is correct they should be $~NID(0, \sigma^2_\epsilon$ .

Need to check:

normality: look at residual plot, qqplot etc.
independence: look at ACF but usual limits overstate variability at small lags (because model-fitting removes as much as possible), use portmanteau test (Ljung-Box-Pierce statistic)
both output from tsdiags(z)

Forecasting

Could use one-step ahead prediction, using previous predictions to extend out further, but in practice again come as byproduct of Kalman filtering.

In R: predict(z, n.ahead).

Practical process

Inspect plot. If mean-variance relationship, use variance stabilising transformation if necessary: square root (for count data, or mild relationship) or log (for strong relationship)
If clearly non-stationary difference, otherwise inspect ACF for very slow decay. Difference until this is gone. Check at seasonal intervals as well
Plot ACF and PACF for differenced series. Identify as AR, MA, or ARMA. If AR or MA get p or q from cutoff. Check seasonal lags as well. In ARMA first try p=1 and q=1.
Estimate parameters.
If any non-significant reduce model and estimate again.
Plot diagnostics (if data is seasonal make sure to extend for multiple seasonal lags). If problems try adding extra terms to model.

Frequency domain analysis

Background

$e^{i\theta} = cos \theta + i sin \theta$ Can use to prove basic trig identities and can invert to get $cos \theta = \frac{e^{i\theta}+e^{-i\theta}}{2}$ and $sin \theta = \frac{e^{i\theta} - e^{-i\theta}}{2}$

Sum of geometric series: $\frac{a(1 - r^n)}{1-r}$ .

General cosinusoid: $f(t) = a cos(\lambda t + \phi)$ , when t takes integer values makes sense to restrict $\lambda$ to $[0, 2\pi]$ , because of the aliasing problem

Frequency = Angular frequency / $2\pi$ , usually use angular for theory (no extra $2\pi$ s) and frequency for practice (more interpretable).

We will be considering siutations that are in some sense time invariant. One definition is $f(t + u) = f(t)$ but this too restrictive and the only solution is a constant. A less restrictive condition is $f(t+u) = C_u f(t)$ , which has general bounded solution of $f(t) = Ae^{i \lambda t}$ . Extends to sums of functions of this form.

Filters and filtering

A filter takes one time series and outputs another: $Y(t) = \mathcal{A}X(t)$ . A filter is linear if $ \mathcal{A} [\alpha X + \beta Y](t) = \alpha \mathcal{A}[X](t) + \beta \mathcal{A}[Y](t)$ , and time invariant if $\mathcal{A}[L_u X](t) = L^u \mathcal{A}[X](t)$ . An important class of linear, time-invariant filters can be written $\mathcal{A}[X](t) = \sum_{-\infinity}^\infinity a(u) X(t-u)$ where the $a(u)$ ’s are called filter coefficients.

Transfer functions

Filter coefficients provide complete description but may not provide much intuition about effect of the filter. Another way of investigating a filter is to look at its effects on complex exponentials. For convenience define $E^\lambda (t) = e^{i \lambda t}$ .

Clearly, $L^u E^\lambda (t) = e^{i \lambda (t+u)} = e^{i \lambda u} E^\lambda (t)$ and by linearity and time-invariance $\mathcal{A}[E^\lambda](t+u) = e^{i \lambda u}\mathcal{A}[E^\lambda](t)$ , setting $t=0$ gives $\mathcal{A}[E^\lambda](u) = e^{i \lambda u}\mathcal{A}[E^\lambda](0)$ . The function $A(\lambda) = \mathcal{A}[E^\lambda](u)$ called transfer function. Linear time-invariant filtering of complex exponential produces constant multiple with same frequency.

Note for any integer $k$ , $E^{\lambda + 2\pi k} = E^\lambda$ . In general transfer functions are complex-valued, and can often be useful to write in their polar form $A(\lambda) = G(\lambda)e^{i \phi(\lambda)}$ , where $G(\lambda)$ is the gain function, and $\phi(\lambda)$ the phase function.

Transfer function performs similarly on sines and cosines, and sums of complex exponential.

Computing transfer functions

If the filter coefficients satisfy $\sum_{-\infinity}^\infinity |a(u)| \lt \infinity$ , then the transfer function is given by $A(\lambda) = \sum_{-\infinity}^\infinity a(u) e^{-i \lambda u}$ . In mathetmatical terms, $A(\lambda)$ is the discrete Fourier transforms of the filter coefficients.

Sequential filtering

If $C[X] = A[B[X]](t)$ , then $C(\lambda) = A(\lambda)B(\lambda)$

Spectral theory

Suppose $X(t)$ is a stationary time-series with autocovariance function $\gamma(u)$ , which satisfies $\sum_{-\infinity}^\infinity |\gamma(u)| \lt \infinity$ , then we define the power spectrum of $X(t)$ to be:

\[f_XX (\lambda) = \frac{1}{2\pi} \sum_{-\infinity}^\infinity \gamma(u) e^{-i \lambda u}\]

Can be inverted to show:

\[\gamma (u) = \int^{2\pi}_0 e^{i \lambda t} f_XX (\lambda) d\lambda\]

This is the spectral decomposition of the autocovariance function. Since $\gamma(0) = var[X(t)]$ , $var[X(t)] = \int^{2\pi}_0 f_XX (\lambda) d\lambda$ , which shows the variability in $X(t)$ being broken down by frequency. To understand this we need to show that time series can be decomposed into independent frequency components.

The Cramér representation

Stieljes integral: $\int_a^b \g(x) dF(x) = lim \sum g(x_i) [F(x_{i+1}) - F(x_i)]$ . When F(x) is differentiable with $F'(x) = f(x)$ , reduces to $\int_a^b \g(x)f(x) dx$ . When F(x) is a step function with step of height $c_i$ at $u_i$ , reduces to $\sum_i c_i g(u_i)$ . Stieljes integral unifies both summation and integration.

To state the Cramér representation we need to extend the Steiljes integral to deal with stochastic processes.

A stochastic process, $Z(\lambda)$ on interval $[0, 2\pi]$ , is indexed set of random variables which assigns a random variable to each $\lambda \in [0, 2\pi]$ . Said to have uncorrelated increments if for $\lambda_1 \lt \lambda_2 \lt \lambda_3 \lt \lambda_4$ , $cor(Z(\lambda_2)-Z(\lambda_1), Z(\lambda_4) - \lambda_3)) = 0$ .

Possible to define an integral against a stochastic process:

\[ \int_0^{2\pi} \phi(\lambda) dZ(\lambda) = lim \sum_{n=0}^{N-1} \phi\left(\frac{2\pi n}{N}\right) \left[ Z\left( \frac{2\pi (n +1)}{N}\right) - Z \left( \frac{2\pi n}{N} \right) \right] \]

Where $Z_x(\lambda)$ is define as follows:

\[d_X^T(\lambda) = \sum_{u=-T}^{T} X(u) e^{-i \lambda u}\]
\[Z_X^T(\lambda) = \int_0^\lambda d_X^T(\alpha) d\alpha = \sum_{u=-T}^{T} X(u) \left( \frac{1 - e^{-i \lambda u}}{-iu} \right)\]
\[Z_X = lim_{T \to \infinity} Z^T_X(\lambda)\]