https://www.ucl.ac.uk/lifesciences-faculty-php/courses/viewcourse.php?coursecode=PSYCGR01
This course provides a thorough introduction to the General Linear Model, which incorporates analyses such as multiple regression, ANOVA, ANCOVA, repeated-measures ANOVA. We will also cover extensions to linear mixed-effects models and logistic regression. All techniques will be discussed within a general framework of building and comparing statistical models. Practical experience in applying the methods will be developed through exercises with the statistics package SPSS.
Lecture 1
Ignore cookbook approach, do model comparison.
General linear model.
Inference as attempted generalization from sample to population (non-Bayesian?).
Want estimators to be:
- Unbiased - expected value is true value
- Consistent - variance decreases as sample size increases
- Efficient - smallest variance out of all unbiased estimators
Efficient estimators:
- Count of errors -> mode
- Sum of absolute errors -> median
- Sum of squared errors -> mean
$MSE = \sum (Y_i - \hat{Y}_i)^2 / n - p$.
TODO Why degrees of freedom?
Review:
- What is inference?
- Three desirable properties of estimators.
Lecture 2
Model is model of population (which implies that we can include sampling method in inference if we think we can accurately model the bias).
Sum of squares reduced $SSR = \operatorname{SSE}(C) - \operatorname{SSE}(A)$
Proportional reduction in error $PRE = \frac{\operatorname{SSE}(C) - \operatorname{SSE}(A)}{\operatorname{SSE}(C)}$. On population is usually denoted $\eta^2$.
F-score for GLM: $F = \frac{\mathrm{PRE} / (\mathrm{PA} - \mathrm{PC})}{(1-\mathrm{PRE})/(n - \mathrm{PA})} \sim F(\mathrm{PA} - \mathrm{PC}, n - \mathrm{PA})$
F-test: reject null if $P_\mathrm{null}(F > F_\mathrm{observed}) < \alpha$. Fixes $P_\mathrm{null}(\mathrm{Type1}) = \alpha$. Produces tradeoff curve between $P_\mathrm{null}(\mathrm{Type2})$ and real effect size.
95% confidence interval of estimate = on 95% of samples, confidence interval falls around true population value = reject null if (1-$\alpha$) confidence interval does not contain null.
Review:
- Define f-score.
- Define f-test.
- Define confidence interval.
Lecture 3
Multiple regression.
Test for unique effect of $X_i$ by comparing with model where $\beta_i=0$.
Omnibus test - testing multiple parameters at once. Prefer tests where $PA - PC = 1$ - easier to interpret success/failure.
$R^2$ - squared multiple correlation coefficient - ‘coefficient of determination’ - ‘proportion of variance explained’ - PRE of model over $Y_i = \beta_0 + \epsilon_i$.
$\eta^2$ - true value of PRE in population. Unbiased estimate $\hat{\eta}^2 = 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}}$.
Conventionally:
- Small effect $\eta^2=.03$
- Medium effect $\eta^2=.13$
- Large effect $\eta^2=.26$
$1-\alpha$ confidence interval for slope $b_j \pm \sqrt{\frac{F_{1,n-p;\alpha}\mathrm{MSE}}{(n-1)S^2_{X_j}(1-R^2_j)}}$ where:
- $\mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}$
- Sample variance
- is PRE of model vs model (proportion of variance of that can be explained by other predictors)
$(1 - R^2_j)$ also called tolerance - how uniquely useful is $X_j$
Model search:
- Enter - add variables in blocks
- Forwards - start with best predictor, keep adding next best until PRE not significant
- Backwards - start with all, keep removing worst until PRE becomes significant
- Stepwise - forwards but may also remove parameters that fall beneath some threshold
Better to rely on theory
Note, for null model $Y_i = b_0 + \epsilon $ we get $SSE = (n - 1)\operatorname{Var}(Y_i)$
Lecture 4
GLM assumptions:
- Normality - $\epsilon_i \sim Normal$
- Biased predictions
- Unbiasedness - $\epsilon_i$ has mean 0
- Biased test results
- Homoscedasticity - $\epsilon_i$ has constant variance (per i)
- Unbiased parameter estimates (?)
- Biased test results
- Independence - $\epsilon_i$ are pairwise independent
- Model mis-specification
Histogram of residuals should be roughly normal (1).
Should be no relationship in residual vs predicted graph (2,3).
Quantile-quantile plot - $Y_i$ vs $Q_i$ where $Q_i$ s.t. $P(Y \leq Q_i) = \hat{p}_i \approx p(Y \leq Y_i)$ ie quantiles vs cdf of normal distribution. If $Y_i$ are normal than should be roughly straight.
Shapiro-Wilk or Kolmogorov-Smirnov tests for normality.
Breush-Pagan or Koenker or Levene test for homoscedasticity.
Randomized control or sequential dependence test for independence.
Transform dependent variables to achieve 1,3. Transform predictor to achieve 2.
Outlier detection:
- Mahalanobis distance - distance of data point from center
- Leverage - weight of data point in parameter estimate
- Studentized deleted residual - ?
- Cook’s distance - does omission of a data point change model predictions
Outlier tests run on all data points, so need multiple comparison correction.
Multicollinearity - as $R^2_j \xrightarrow 1$ the confidence interval $\xrightarrow \infty$. Detection:
- Tolerance or variance inflation factor
- Correlation matrix
Partial correlation between $Y$ and $X_i$ is $\operatorname{sign}(\beta_i) \sqrt{\operatorname{PRE}(M, M-X_i)} = \frac{\operatorname{PRE}(M, NULL) - \operatorname{PRE}(M - X_i, NULL)}{1 - \operatorname{PRE}(M - X_i, NULL)}$
Lecture 5
Moderation
- Effect of $X_1$ varies depending on value of $X_2$
- Fit $Y \sim \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2$
- Formula for confidence interval is same as simple model
- Center predictors for moderation
- Easier to interpret
- Reduces redundancy between $X_1$ and $X_1 X_2$ but does not change confidence interval of $\beta_3$, as long as we have simple parameters ($\beta_1$ and $\beta_2$)
- This is true of any linear change to parameters
Mediation (cf Mediation Analysis):
- Want to separate direct effect of $X_1$ on $Y$ vs indirect effect via effect on $X_2$
- Fit
- Casual steps procedure
- Test a is significant vs null
- Test c is significant vs null
- Test b is significant vs without b
- Test d is not significant vs without d
- Often low power
- Sobel test:
- Test $Z = ab \sim Normal$
- $Z \sim Normal$ is often a poor approximation - use simulation instead
- Structural Equation Modeling
Caution - Don’t Expect An Easy Answer
Lecture 6
ANOVA - analysis of variance - modeling differences between group means.
Null model = same means.
Contrast codes:
- Want to compare against a null-model where the parameters are restricted to some hyperplane, but analytic solution to GLM can only handle axis-aligned hyperplanes.
- Eg 2x2 control/diet x male/female. ‘Diet effect does not vary between male/female’ is equiv to ‘control/male - diet/male = control/female - diet/female’
- Solution: change to basis - $Y = A + BLX$
- Rows of $L$ should be orthogonal
- Avoids introducing spurious correlations in transformed data, which would create correlations between confidence intervals
- Allows interpreting as difference of means
- Even when cell sizes are unequal!
- Otherwise null hypothesis is same but error is split differently across parameters
- Allows partitioning out $SSR$ due to each parameter (because SSR is linear function of group means)
- As long as cell sizes are equal - otherwise denominator of SSR is not same across rows
- For given row $\lambda$, comparing against model without that parameter reduces to $\mathrm{SSR} = \frac{(\sum_k \lambda_k \bar{Y}_k) ^2}{\sum_k (\lambda_k^2 / n_k)}$
- If a row sums to 0, parameter can be interpreted as difference of means (source).
- Formula for confidence interval is same as simple model
- To test for differences between means of $m$ groups, can use $m-1$ orthogonal rows
- Gives $b = \frac{\sum_k \lambda_k \bar{Y}_k}{\sum_k \lambda_k^2}$
- (Means $L$ does not have rank n - can’t reconstruct original parameters - is this ok?)
With unequal cell sizes, orthogonal rows can still introduce redundancy (in generate case of only one datapoint Y=0, X+Y and X-Y are orthogonal but perfectly anti-correlated).
Helmert codes - $\lambda_{i,i} = m-i$ and $\forall j > i \ldotp \lambda_{i,j} = -1$
Orthogonal polynomial codes:
- $Y$ as polynomial of category
- Each row fits $b_n X^n - \text{previous rows}$
- Differs from simply fitting a polynomial because based on group means rather than individual points - latter weights error towards larger categories
Dummy codes - $\lambda_{i,i+1} = 1$ and $\lambda_{i,j} = 0$ otherwise. Not contrast codes - interpret $\lambda_i$ as comparing case $i$ vs case $0$.
Unequal cell sizes are weird, because mean of group means is not mean of individuals.
Multiple comparisons abound.
- In planned comparisons use $.05/m$
- In post-hoc comparison use Scheffe adjusted critical value
- Fixes type 1 rate at $.05$
- Any contrast exceeds critical value iff omnibus test is significant
For power analysis estimate:
\begin{align} \hat{\eta}^2 &= 1 - \frac{(1 - \mathrm{PRE})(n - \mathrm{PC})}{n - \mathrm{PA}} \cr &= \left( \frac{ m \sigma^2 (\sum \lambda^2 / n_k) }{ (\sum \lambda_k \mu_k)^2 } + 1 \right) ^{-1} \end{align}
where $\mu_k$ is predicted group mean and $\sigma^2$ is predicted within-cell variance.
- Power for omnibus test is maximized when all cell sizes are equal.
- Power for contrast is maximized when cell sizes proportional to weights.
With multiple categorical variables a good tactic is:
- Do contrast codes for decomposition
- Map m-1 from each subspace into full space to ask basic questions
- Take elementwise products of basic questions to ask about interactions
Including useful variables often increases power for testing original variables, because reduces error which would obscure small effects.
Tukey-Kramer to test all possible pairs of groups.
Lecture 7
ANCOVA - analysis of covariance - same as ANOVA but with continuous as well as categorical predictors.
Typical use case - control vs treatment whilst controlling for covariate. Similar to before, can increase power by reducing error that is obscuring small effects.
Eg in pre/post test, typically more powerful than just modeling the difference. Latter effectively fixes the pre-test parameter to 1, so is only more powerful if ANCOVA estimate was close to 1.
Homogeneity of regression assumption = no interaction between categorical variable and continuous covariate.
Lecture 8
What if $e_i$ are not independent? Eg grouped or sequential data.
Repeated measures ANOVA - for grouped data, use weighted mean of group score.
I can’t find a reason to prefer this over a hierarchical model.
Lecture 9
Losing interest in the course by this point. Statistical Rethinking is much more useful.
Multi-level models.
Lecture 10
Bayes factors.
Logistic regression.