Notes on 'PSYCGR01: Statistics'

https://www.ucl.ac.uk/lifesciences-faculty-php/courses/viewcourse.php?coursecode=PSYCGR01

This course provides a thorough introduction to the General Linear Model, which incorporates analyses such as multiple regression, ANOVA, ANCOVA, repeated-measures ANOVA. We will also cover extensions to linear mixed-effects models and logistic regression. All techniques will be discussed within a general framework of building and comparing statistical models. Practical experience in applying the methods will be developed through exercises with the statistics package SPSS.

Lecture 1

Ignore cookbook approach, do model comparison.

General linear model.

Inference as attempted generalization from sample to population (non-Bayesian?).

Want estimators to be:

Unbiased - expected value is true value
Consistent - variance decreases as sample size increases
Efficient - smallest variance out of all unbiased estimators

Efficient estimators:

Count of errors -> mode
Sum of absolute errors -> median
Sum of squared errors -> mean

$MSE = \sum (Y_i - \hat{Y}_i)^2 / n - p$.

TODO Why degrees of freedom?

Review:

What is inference?
Three desirable properties of estimators.

Lecture 2

Model is model of population (which implies that we can include sampling method in inference if we think we can accurately model the bias).

Sum of squares reduced $SSR = \operatorname{SSE}(C) - \operatorname{SSE}(A)$

Proportional reduction in error $PRE = \frac{\operatorname{SSE}(C) - \operatorname{SSE}(A)}{\operatorname{SSE}(C)}$. On population is usually denoted $\eta^2$.

F-score for GLM: $F = \frac{\mathrm{PRE} / (\mathrm{PA} - \mathrm{PC})}{(1-\mathrm{PRE})/(n - \mathrm{PA})} \sim F(\mathrm{PA} - \mathrm{PC}, n - \mathrm{PA})$

F-test: reject null if $P_\mathrm{null}(F > F_\mathrm{observed}) < \alpha$. Fixes $P_\mathrm{null}(\mathrm{Type1}) = \alpha$. Produces tradeoff curve between $P_\mathrm{null}(\mathrm{Type2})$ and real effect size.

95% confidence interval of estimate = on 95% of samples, confidence interval falls around true population value = reject null if (1-$\alpha$) confidence interval does not contain null.

Review:

Define f-score.
Define f-test.
Define confidence interval.

Lecture 3

Multiple regression.

Test for unique effect of $X_i$ by comparing with model where $\beta_i=0$.

Omnibus test - testing multiple parameters at once. Prefer tests where $PA - PC = 1$ - easier to interpret success/failure.

$R^2$ - squared multiple correlation coefficient - ‘coefficient of determination’ - ‘proportion of variance explained’ - PRE of model over $Y_i = \beta_0 + \epsilon_i$.

$\eta^2$ - true value of PRE in population. Unbiased estimate $\hat{\eta}^2 = 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}}$.

Conventionally:

Small effect $\eta^2=.03$
Medium effect $\eta^2=.13$
Large effect $\eta^2=.26$

$1-\alpha$ confidence interval for slope $b_j \pm \sqrt{\frac{F_{1,n-p;\alpha}\mathrm{MSE}}{(n-1)S^2_{X_j}(1-R^2_j)}}$ where:

$\mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}$
Sample variance $S^2_{X_j} = \frac{\sum_{i=1}^n(X_j,i - \bar{X}_j)^2}{n-1}$
$R^2_j$ is PRE of model $X_{j,i} = b_0 + \prod_{k \neq j} b_k X_{k,i} + e_i$ vs model $X_{j,i}=b_0 + e_i$ (proportion of variance of $X_j$ that can be explained by other predictors)

$(1 - R^2_j)$ also called tolerance - how uniquely useful is $X_j$

Model search:

Enter - add variables in blocks
Forwards - start with best predictor, keep adding next best until PRE not significant
Backwards - start with all, keep removing worst until PRE becomes significant
Stepwise - forwards but may also remove parameters that fall beneath some threshold

Better to rely on theory

Note, for null model $Y_i = b_0 + \epsilon $ we get $SSE = (n - 1)\operatorname{Var}(Y_i)$

Lecture 4

GLM assumptions:

Normality - $\epsilon_i \sim Normal$
- Biased predictions
Unbiasedness - $\epsilon_i$ has mean 0
- Biased test results
Homoscedasticity - $\epsilon_i$ has constant variance (per i)
- Unbiased parameter estimates (?)
- Biased test results
Independence - $\epsilon_i$ are pairwise independent
- Model mis-specification

Histogram of residuals should be roughly normal (1).

Should be no relationship in residual vs predicted graph (2,3).

Quantile-quantile plot - $Y_i$ vs $Q_i$ where $Q_i$ s.t. $P(Y \leq Q_i) = \hat{p}_i \approx p(Y \leq Y_i)$ ie quantiles vs cdf of normal distribution. If $Y_i$ are normal than should be roughly straight.

Shapiro-Wilk or Kolmogorov-Smirnov tests for normality.

Breush-Pagan or Koenker or Levene test for homoscedasticity.

Randomized control or sequential dependence test for independence.

Transform dependent variables to achieve 1,3. Transform predictor to achieve 2.

Outlier detection:

Mahalanobis distance - distance of data point from center
Leverage - weight of data point in parameter estimate
Studentized deleted residual - ?
Cook’s distance - does omission of a data point change model predictions

Outlier tests run on all data points, so need multiple comparison correction.

Multicollinearity - as $R^2_j \xrightarrow 1$ the confidence interval $\xrightarrow \infty$. Detection:

Tolerance or variance inflation factor
Correlation matrix

Partial correlation between $Y$ and $X_i$ is $\operatorname{sign}(\beta_i) \sqrt{\operatorname{PRE}(M, M-X_i)} = \frac{\operatorname{PRE}(M, NULL) - \operatorname{PRE}(M - X_i, NULL)}{1 - \operatorname{PRE}(M - X_i, NULL)}$

Lecture 5

Moderation

Effect of $X_1$ varies depending on value of $X_2$
Fit $Y \sim \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2$
Formula for confidence interval is same as simple model
Center predictors for moderation
- Easier to interpret
- Reduces redundancy between $X_1$ and $X_1 X_2$ but does not change confidence interval of $\beta_3$, as long as we have simple parameters ($\beta_1$ and $\beta_2$)
  - This is true of any linear change to parameters

Mediation (cf Mediation Analysis):

Want to separate direct effect of $X_1$ on $Y$ vs indirect effect via effect on $X_2$
Fit $M = i_1 + aX + e_1\\ Y = i_2 + cX + e_2\\ Y = i_3 + dX + bM + e_3$
Casual steps procedure
- Test a is significant vs null
- Test c is significant vs null
- Test b is significant vs without b
- Test d is not significant vs without d
- Often low power
Sobel test:
- Test $Z = ab \sim Normal$
- $Z \sim Normal$ is often a poor approximation - use simulation instead
Structural Equation Modeling

Caution - Don’t Expect An Easy Answer

Lecture 6

ANOVA - analysis of variance - modeling differences between group means.

Null model = same means.

Contrast codes:

Want to compare against a null-model where the parameters are restricted to some hyperplane, but analytic solution to GLM can only handle axis-aligned hyperplanes.
- Eg 2x2 control/diet x male/female. ‘Diet effect does not vary between male/female’ is equiv to ‘control/male - diet/male = control/female - diet/female’
Solution: change to basis - $Y = A + BLX$
Rows of $L$ should be orthogonal
- Avoids introducing spurious correlations in transformed data, which would create correlations between confidence intervals
- Allows interpreting as difference of means
  - Even when cell sizes are unequal!
  - Otherwise null hypothesis is same but error is split differently across parameters
- Allows partitioning out $SSR$ due to each parameter (because SSR is linear function of group means)
  - As long as cell sizes are equal - otherwise denominator of SSR is not same across rows
For given row $\lambda$, comparing against model without that parameter reduces to $\mathrm{SSR} = \frac{(\sum_k \lambda_k \bar{Y}_k) ^2}{\sum_k (\lambda_k^2 / n_k)}$
If a row sums to 0, parameter can be interpreted as difference of means (source).
Formula for confidence interval is same as simple model
To test for differences between means of $m$ groups, can use $m-1$ orthogonal rows
- Gives $b = \frac{\sum_k \lambda_k \bar{Y}_k}{\sum_k \lambda_k^2}$
- (Means $L$ does not have rank n - can’t reconstruct original parameters - is this ok?)

With unequal cell sizes, orthogonal rows can still introduce redundancy (in generate case of only one datapoint Y=0, X+Y and X-Y are orthogonal but perfectly anti-correlated).

Helmert codes - $\lambda_{i,i} = m-i$ and $\forall j > i \ldotp \lambda_{i,j} = -1$

Orthogonal polynomial codes:

$Y$ as polynomial of category
Each row fits $b_n X^n - \text{previous rows}$
Differs from simply fitting a polynomial because based on group means rather than individual points - latter weights error towards larger categories

Dummy codes - $\lambda_{i,i+1} = 1$ and $\lambda_{i,j} = 0$ otherwise. Not contrast codes - interpret $\lambda_i$ as comparing case $i$ vs case $0$.

Unequal cell sizes are weird, because mean of group means is not mean of individuals.

Multiple comparisons abound.

In planned comparisons use $.05/m$
In post-hoc comparison use Scheffe adjusted critical value
- Fixes type 1 rate at $.05$
- Any contrast exceeds critical value iff omnibus test is significant

For power analysis estimate:

\begin{align} \hat{\eta}^2 &= 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}} \cr &= \left( \frac{ m \sigma^2 (\sum \lambda^2 / n_k) }{ (\sum \lambda_k \mu_k)^2 } + 1 \right) ^{-1} \end{align}

where $\mu_k$ is predicted group mean and $\sigma^2$ is predicted within-cell variance.

Power for omnibus test is maximized when all cell sizes are equal.
Power for contrast is maximized when cell sizes proportional to weights.

With multiple categorical variables a good tactic is:

Do contrast codes for decomposition
Map m-1 from each subspace into full space to ask basic questions
Take elementwise products of basic questions to ask about interactions

Including useful variables often increases power for testing original variables, because reduces error which would obscure small effects.

Tukey-Kramer to test all possible pairs of groups.

Lecture 7

ANCOVA - analysis of covariance - same as ANOVA but with continuous as well as categorical predictors.

Typical use case - control vs treatment whilst controlling for covariate. Similar to before, can increase power by reducing error that is obscuring small effects.

Eg in pre/post test, typically more powerful than just modeling the difference. Latter effectively fixes the pre-test parameter to 1, so is only more powerful if ANCOVA estimate was close to 1.

Homogeneity of regression assumption = no interaction between categorical variable and continuous covariate.

Lecture 8

What if $e_i$ are not independent? Eg grouped or sequential data.

Repeated measures ANOVA - for grouped data, use weighted mean of group score.

I can’t find a reason to prefer this over a hierarchical model.

Lecture 9

Losing interest in the course by this point. Statistical Rethinking is much more useful.

Multi-level models.

Lecture 10

Bayes factors.

Logistic regression.