Poisson regression

Response is a count, with Poisson distribution, mean $\mu$ . $log(\mu) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$ . Parameters estimated by ML(maximum likelihood) with IRLS.

In R: glm(response ~ variables, family=poisson, weight=weight)

Count data often collected as with rates, or number of events occurring for a given population. Could use a binomial model, but more commonly use offsets. $\mu = rate * population size$ , $log(\mu) = log(rate) + log(size)$ . $log(size)$ is just another variable in model, but regression coefficient is fixed at 1. In R: glm(response ~ variables, family=poisson, offset=log(population))

Contingency tables

Arise when when classify individuals into categories using one or more criteria. Two sampling models used for contingency tables: multinomial and poisson.

Multinomial model

Assume: table has $m$ cells, $n$ individuals classified independently with probability $\pi_i$ of being in cell $i$ , $\sum_i \pi_i = 1$ , $Y_i$ number of individuals in cell $i$ .

Then $Y_1, ..., Y_m$ has multinomial distribution: $P(Y_1 = y_1, ..., Y_m = y_m) = \frac{n!}{y_1! y_2! ... y_m !}\pi_1^{y_1}...\pi_m^{y_m}$ .

If there are no restrictions on the $\pi$ ’s, then the likelihood ($\y_i log(\pi_i) + \cdots $ ) will be maximised when $\pi_i = \y_i / n$ . Can test this maximal likelihood against more constrained model using usual $\chi^2$ test.

Poisson model

Assume: each cell is ~Poisson($\mu_i$ ), equivalent to Poisson regression model. Maximal and null likelihoods same as for multinomial – models are equivalent! Can estimate multinomial parameters by fitting a poisson model where $\pi_i = \mu_i / \sum \mu_i$ . Can test hypotheses about the multinomial by testing equivalent Poisson hypotheses.

For example, can test for independence of classifying factors by looking at interaction term in poisson model.

Product multinomial sampling model

So far have assumed sampling at random from single population, all factors were “responses”, and were examining joint distribution. Sometimes we want to think of a factor (A) as representing different populations, and want to compare distribution of another factor (B) between populations. We want to look at the conditional distribution of B given A.

Can compare visually using bar or dot plot.

In general, populations may be defined as level combinations of 2 or more “conditioning factors” (C & D) and want to compare joint distribution of “response factors” (A & B). Want to check if joint distributions of A & B conditional on C & D are the same. Do this by comparing maximal model to model where A & B is independent of C & D, ie. AB.

General solution

Example of a more general type of problem – does our data fit a certain distribution? Basic solution: create contigency table comparing observed and expected under distributional hypotheses, then use LR test.

eg. $2(sum(n_i * log(o_i)) - sum(n_i * log(e_i)))$ .

Related topics

Association reversal

Regression coefficients can change sign when new variables added to model. Similar thing happends in contingency table when collapsing over variables. Known as Simpson’s paradox, or association reversal. Dangerous to collapse over variables strongly related to remaining variables. Better to look at conditional tables rather than marginal (collapsed) tables.