https://www.ucl.ac.uk/lifesciences-faculty-php/courses/viewcourse.php?coursecode=PSYCGR01

This course provides a thorough introduction to the General Linear Model, which incorporates analyses such as multiple regression, ANOVA, ANCOVA, repeated-measures ANOVA. We will also cover extensions to linear mixed-effects models and logistic regression. All techniques will be discussed within a general framework of building and comparing statistical models. Practical experience in applying the methods will be developed through exercises with the statistics package SPSS.

## Lecture 1

Ignore cookbook approach, do model comparison.

General linear model.

Inference as attempted generalization from sample to population (non-Bayesian?).

Want estimators to be:

• Unbiased - expected value is true value
• Consistent - variance decreases as sample size increases
• Efficient - smallest variance out of all unbiased estimators

Efficient estimators:

• Count of errors -> mode
• Sum of absolute errors -> median
• Sum of squared errors -> mean

$MSE = \sum (Y_i - \hat{Y}_i)^2 / n - p$.

TODO Why degrees of freedom?

Review:

• What is inference?
• Three desirable properties of estimators.

Exercises

## Lecture 2

Model is model of population (which implies that we can include sampling method in inference if we think we can accurately model the bias).

Sum of squares reduced $SSR = \operatorname{SSE}(C) - \operatorname{SSE}(A)$

Proportional reduction in error $PRE = \frac{\operatorname{SSE}(C) - \operatorname{SSE}(A)}{\operatorname{SSE}(C)}$. On population is usually denoted $\eta^2$.

F-score for GLM: $F = \frac{\mathrm{PRE} / (\mathrm{PA} - \mathrm{PC})}{(1-\mathrm{PRE})/(n - \mathrm{PA})} \sim F(\mathrm{PA} - \mathrm{PC}, n - \mathrm{PA})$

F-test: reject null if $P_\mathrm{null}(F > F_\mathrm{observed}) < \alpha$. Fixes $P_\mathrm{null}(\mathrm{Type1}) = \alpha$. Produces tradeoff curve between $P_\mathrm{null}(\mathrm{Type2})$ and real effect size.

95% confidence interval of estimate = on 95% of samples, confidence interval falls around true population value = reject null if (1-$\alpha$) confidence interval does not contain null.

Review:

• Define f-score.
• Define f-test.
• Define confidence interval.

Exercises

## Lecture 3

Multiple regression.

Test for unique effect of $X_i$ by comparing with model where $\beta_i=0$.

Omnibus test - testing multiple parameters at once. Prefer tests where $PA - PC = 1$ - easier to interpret success/failure.

$R^2$ - squared multiple correlation coefficient - ‘coefficient of determination’ - ‘proportion of variance explained’ - PRE of model over $Y_i = \beta_0 + \epsilon_i$.

$\eta^2$ - true value of PRE in population. Unbiased estimate $\hat{\eta}^2 = 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}}$.

Conventionally:

• Small effect $\eta^2=.03$
• Medium effect $\eta^2=.13$
• Large effect $\eta^2=.26$

$1-\alpha$ confidence interval for slope $b_j \pm \sqrt{\frac{F_{1,n-p;\alpha}\mathrm{MSE}}{(n-1)S^2_{X_j}(1-R^2_j)}}$ where:

• $\mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}$
• Sample variance $S^2_{X_j} = \frac{\sum_{i=1}^n(X_j,i - \bar{X}_j)^2}{n-1}$
• $R^2_j$ is PRE of model $X_{j,i} = b_0 + \prod_{k \neq j} b_k X_{k,i} + e_i$ vs model $X_{j,i}=b_0 + e_i$ (proportion of variance of $X_j$ that can be explained by other predictors)

$(1 - R^2_j)$ also called tolerance - how uniquely useful is $X_j$

Model search:

• Enter - add variables in blocks
• Backwards - start with all, keep removing worst until PRE becomes significant
• Stepwise - forwards but may also remove parameters that fall beneath some threshold

Better to rely on theory

Note, for null model $Y_i = b_0 + \epsilon$ we get $SSE = (n - 1)\operatorname{Var}(Y_i)$

Exercises

## Lecture 4

GLM assumptions:

1. Normality - $\epsilon_i \sim Normal$
• Biased predictions
2. Unbiasedness - $\epsilon_i$ has mean 0
• Biased test results
3. Homoscedasticity - $\epsilon_i$ has constant variance (per i)
• Unbiased parameter estimates (?)
• Biased test results
4. Independence - $\epsilon_i$ are pairwise independent
• Model mis-specification

Histogram of residuals should be roughly normal (1).

Should be no relationship in residual vs predicted graph (2,3).

Quantile-quantile plot - $Y_i$ vs $Q_i$ where $Q_i$ s.t. $P(Y \leq Q_i) = \hat{p}_i \approx p(Y \leq Y_i)$ ie quantiles vs cdf of normal distribution. If $Y_i$ are normal than should be roughly straight.

Shapiro-Wilk or Kolmogorov-Smirnov tests for normality.

Breush-Pagan or Koenker or Levene test for homoscedasticity.

Randomized control or sequential dependence test for independence.

Transform dependent variables to achieve 1,3. Transform predictor to achieve 2.

Outlier detection:

• Mahalanobis distance - distance of data point from center
• Leverage - weight of data point in parameter estimate
• Studentized deleted residual - ?
• Cook’s distance - does omission of a data point change model predictions

Outlier tests run on all data points, so need multiple comparison correction.

Multicollinearity - as $R^2_j \xrightarrow 1$ the confidence interval $\xrightarrow \infty$. Detection:

• Tolerance or variance inflation factor
• Correlation matrix

Partial correlation between $Y$ and $X_i$ is $\operatorname{sign}(\beta_i) \sqrt{\operatorname{PRE}(M, M-X_i)} = \frac{\operatorname{PRE}(M, NULL) - \operatorname{PRE}(M - X_i, NULL)}{1 - \operatorname{PRE}(M - X_i, NULL)}$

Exercises

## Lecture 5

Moderation

• Effect of $X_1$ varies depending on value of $X_2$
• Fit $Y \sim \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2$
• Formula for confidence interval is same as simple model
• Center predictors for moderation
• Easier to interpret
• Reduces redundancy between $X_1$ and $X_1 X_2$ but does not change confidence interval of $\beta_3$, as long as we have simple parameters ($\beta_1$ and $\beta_2$)
• This is true of any linear change to parameters

Mediation (cf Mediation Analysis):

• Want to separate direct effect of $X_1$ on $Y$ vs indirect effect via effect on $X_2$
• Fit $M = i_1 + aX + e_1\\ Y = i_2 + cX + e_2\\ Y = i_3 + dX + bM + e_3$
• Casual steps procedure
• Test a is significant vs null
• Test c is significant vs null
• Test b is significant vs without b
• Test d is not significant vs without d
• Often low power
• Sobel test:
• Test $Z = ab \sim Normal$
• $Z \sim Normal$ is often a poor approximation - use simulation instead
• Structural Equation Modeling

Caution - Don’t Expect An Easy Answer

Exercises

## Lecture 6

ANOVA - analysis of variance - modeling differences between group means.

Null model = same means.

Contrast codes:

• Want to compare against a null-model where the parameters are restricted to some hyperplane, but analytic solution to GLM can only handle axis-aligned hyperplanes.
• Eg 2x2 control/diet x male/female. ‘Diet effect does not vary between male/female’ is equiv to ‘control/male - diet/male = control/female - diet/female’
• Solution: change to basis - $Y = A + BLX$
• Rows of $L$ should be orthogonal
• Avoids introducing spurious correlations in transformed data, which would create correlations between confidence intervals
• Allows interpreting as difference of means
• Even when cell sizes are unequal!
• Otherwise null hypothesis is same but error is split differently across parameters
• Allows partitioning out $SSR$ due to each parameter (because SSR is linear function of group means)
• As long as cell sizes are equal - otherwise denominator of SSR is not same across rows
• For given row $\lambda$, comparing against model without that parameter reduces to $\mathrm{SSR} = \frac{(\sum_k \lambda_k \bar{Y}_k) ^2}{\sum_k (\lambda_k^2 / n_k)}$
• If a row sums to 0, parameter can be interpreted as difference of means (source).
• Formula for confidence interval is same as simple model
• To test for differences between means of $m$ groups, can use $m-1$ orthogonal rows
• Gives $b = \frac{\sum_k \lambda_k \bar{Y}_k}{\sum_k \lambda_k^2}$
• (Means $L$ does not have rank n - can’t reconstruct original parameters - is this ok?)

With unequal cell sizes, orthogonal rows can still introduce redundancy (in generate case of only one datapoint Y=0, X+Y and X-Y are orthogonal but perfectly anti-correlated).

Helmert codes - $\lambda_{i,i} = m-i$ and $\forall j > i \ldotp \lambda_{i,j} = -1$

Orthogonal polynomial codes:

• $Y$ as polynomial of category
• Each row fits $b_n X^n - \text{previous rows}$
• Differs from simply fitting a polynomial because based on group means rather than individual points - latter weights error towards larger categories

Dummy codes - $\lambda_{i,i+1} = 1$ and $\lambda_{i,j} = 0$ otherwise. Not contrast codes - interpret $\lambda_i$ as comparing case $i$ vs case $0$.

Unequal cell sizes are weird, because mean of group means is not mean of individuals.

Multiple comparisons abound.

• In planned comparisons use $.05/m$
• In post-hoc comparison use Scheffe adjusted critical value
• Fixes type 1 rate at $.05$
• Any contrast exceeds critical value iff omnibus test is significant

For power analysis estimate:

\begin{align} \hat{\eta}^2 &= 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}} \cr &= \left( \frac{ m \sigma^2 (\sum \lambda^2 / n_k) }{ (\sum \lambda_k \mu_k)^2 } + 1 \right) ^{-1} \end{align}

where $\mu_k$ is predicted group mean and $\sigma^2$ is predicted within-cell variance.

• Power for omnibus test is maximized when all cell sizes are equal.
• Power for contrast is maximized when cell sizes proportional to weights.

With multiple categorical variables a good tactic is:

• Do contrast codes for decomposition
• Map m-1 from each subspace into full space to ask basic questions

Including useful variables often increases power for testing original variables, because reduces error which would obscure small effects.

Tukey-Kramer to test all possible pairs of groups.

## Lecture 7

ANCOVA - analysis of covariance - same as ANOVA but with continuous as well as categorical predictors.

Typical use case - control vs treatment whilst controlling for covariate. Similar to before, can increase power by reducing error that is obscuring small effects.

Eg in pre/post test, typically more powerful than just modeling the difference. Latter effectively fixes the pre-test parameter to 1, so is only more powerful if ANCOVA estimate was close to 1.

Homogeneity of regression assumption = no interaction between categorical variable and continuous covariate.

## Lecture 8

What if $e_i$ are not independent? Eg grouped or sequential data.

Repeated measures ANOVA - for grouped data, use weighted mean of group score.

I can’t find a reason to prefer this over a hierarchical model.

## Lecture 9

Losing interest in the course by this point. Statistical Rethinking is much more useful.

Multi-level models.

## Lecture 10

Bayes factors.

Logistic regression.