Revision Notes/Unit 10 — Multicollinearity, PCA & Factor Analysis/VIF, PCA, EFA/CFA, Scree Plot/Story

VIF, PCA, EFA/CFA, Scree Plot

Unit 10 — Multicollinearity, PCA & Factor Analysis

Maya Untangles Variables That Move Together

Maya's next study is ambitious. She wants to predict student exam performance using a battery of measures: study hours, IQ, hours of sleep, hours of social media, household income, parents' education, attendance percentage, and physical activity. Eight predictors. She fits a regression model and stares at the output.

The signs on some coefficients are baffling — IQ has a negative coefficient. The standard errors are huge. Two predictors that should matter come out non-significant.

What happened? Her predictors are tangled in each other. IQ correlates with parents' education. Parents' education correlates with household income. Household income predicts attendance. The regression is trying to disentangle effects that the data doesn't allow disentangling — a problem called multicollinearity.

This unit covers (1) what multicollinearity is and how to detect it, (2) why you'd want to reduce many variables to fewer, and (3) the two main techniques: Factor Analysis and Principal Component Analysis. Students often confuse them — get the distinction clear and you'll pick up easy marks.

Multicollinearity — the tangle problem

Multicollinearity is a high degree of correlation among your independent variables. It's *not* about correlation between the IV and the DV — that's what you want. It's correlation among the IVs themselves.

Examples:

Height and weight in a study of athletic performance.
Household income and water consumption in a household survey.
Mileage and price of a car.

In each, two predictors carry overlapping information. Including both in a regression doesn't add much beyond including one.

Consequences (exam staple)

Memorise this list:

1. Saps statistical power — you need more data to detect the same effects. 2. Can cause sign flips of regression coefficients — a variable with a positive bivariate relationship to Y can appear with a *negative* coefficient when its collinear partner is in the model. 3. Overestimates standard errors — SEs balloon, CIs widen, tests become non-significant. 4. Reduces precision in estimating each coefficient's unique effect. 5. Increases required sample size. 6. Less reliable inferences overall — the F-test may be highly significant even when no individual coefficient is.

Intuition: when two predictors carry the same information, the model can't tell which is "responsible" for the outcome. Picture trying to weigh two objects on a scale that only shows the total — you can't separate them.

Detecting multicollinearity — VIF

The Variance Inflation Factor: for each predictor j, regress it against all the other predictors and compute $R_{j}^{2}$ . Then:

VIF_{j} = \frac{1}{1 - R _{j}^{2}}

Interpretation:

VIF = 1 → uncorrelated with others (ideal).
1 < VIF < 5 → moderate. Tolerable.
VIF > 5 (some say > 10) → severe. Problematic.

Remedies

1. Drop one of the correlated predictors. 2. Combine variables — replace several correlated variables with a single composite (PCA/FA scores or theoretically motivated sum). 3. Ridge regression — penalises coefficient size. 4. Collect more data — reduces SEs. 5. Centre / standardise variables — helps with interaction-term multicollinearity.

Why reduce dimensions?

Beyond multicollinearity, there are general reasons:

Curse of dimensionality — data needs grow *exponentially* with the number of variables.
Increased complexity — more variables, harder to interpret.
Overfitting risk — too many predictors fit noise.
Computational cost.
Easier visualisation (humans handle 2D/3D).

Even without extreme multicollinearity, you may want to compress data into a smaller set of meaningful summary variables. That's what FA and PCA do.

Factor Analysis — uncovering latent constructs

FA starts from a theoretical premise: the variables you observe are caused by underlying, unobserved variables ("latent factors"), and several of your observed variables may be measures of the same factor.

Examples:

"Intelligence" isn't directly observable. But verbal, mathematical, spatial, and processing-speed abilities are. They share a common underlying construct — general intelligence (often called *g*).
"Customer satisfaction" isn't directly observable. But responses to questions about boarding, in-flight service, baggage, and food are. They might cluster into latent factors like "ground experience," "in-flight experience," and "value perception."

Goal of FA: condense information into a smaller number of factors with minimum information loss.

The factor model

X_{i} = λ_{i 1} F_{1} + λ_{i 2} F_{2} + \dots + λ_{ik} F_{k} + ϵ_{i}

$F_{j}$ are the latent factors.
$λ_{ij}$ are the factor loadings — weight of factor j on variable i.
$ϵ_{i}$ is the error term.

Each variable's variance decomposes into **communality $h^{2}$ (variance explained by common factors) and unique variance** $1 - h^{2}$ .

EFA vs CFA — exam-favourite distinction

EFA (Exploratory Factor Analysis) — data-driven. You don't know structure of latent factors; let the data reveal it.
CFA (Confirmatory Factor Analysis) — theory-driven. You have a pre-established theory; test whether data supports it.

EFA explores; CFA confirms. Best practice: EFA on one sample, CFA on a held-out sample.

R-type vs Q-type FA

R-type (standard): correlations between *variables*. Groups variables that move together.
Q-type: correlations between *people*. Groups participants whose response patterns are similar.

Factor loadings and scores

**Factor loadings ( $λ_{ij}$ ):** correlation between each variable and each factor. High loading (|λ| > 0.4–0.6) means strong belonging. Cross-loadings should be < 0.3 — "simple structure."
Factor scores: each participant's composite score on each latent factor. Usable as inputs to further analyses.

How many factors? — the four criteria

1. A priori — theory-determined (CFA-style). 2. Kaiser rule — keep eigenvalues > 1. Crude; often over-extracts. 3. Scree plot — eigenvalues vs factor #; retain factors above the elbow. 4. Parallel analysis — compare to random-data eigenvalues; most reliable. Retain factors whose observed eigenvalues exceed the random benchmark. 5. % variance — keep enough factors to reach a target cumulative variance (95% natural, > 60% social).

Rotation

After extraction, rotate for interpretability:

Varimax (orthogonal): factors stay uncorrelated.
Oblimin / Promax (oblique): factors can correlate. Appropriate when real-world constructs overlap.

Assumptions and adequacy

KMO > 0.6 (preferably > 0.8) — sampling adequate.
Bartlett's test of sphericity significant — correlation matrix has structure to factor.
Continuous variables; linear relationships; ~5–10 participants per item (preferably 200+).

CFA fit indices

CFI > 0.95
RMSEA < 0.06
SRMR < 0.08
χ²/df < 2 (3 acceptable)

Report multiple — single indices can mislead.

Principal Component Analysis (PCA)

PCA pursues the same general goal — reducing variables while preserving information — but with a different mathematical objective and interpretation.

The core idea

PCA constructs new variables (components) that are linear combinations of originals, ordered such that:

1. PC1 captures the maximum possible variance. 2. PC2 is uncorrelated with PC1 and captures the maximum *remaining* variance. 3. PC3 is uncorrelated with both and captures the next maximum. 4. … and so on.

The first few components usually capture most of the variance. Keep them, discard the rest.

Mathematically: PCs are eigenvectors of the covariance matrix; eigenvalues = variance captured per component.

PCA outputs

Component scores: each participant's projection onto each component axis.
Component loadings: correlation between each original variable and each component. Tells you what each component represents.

PCA assumptions

At least interval-level data.
Linear relationships.
KMO > 0.6; Bartlett's significant.
Approximately Normal (for significance testing).
No major outliers (they dominate variance).

PCA vs FA — the crucial distinction

| | PCA | FA | | --- | --- | --- | | Model | None (mathematical) | Latent-variable (theoretical) | | Variance | All variance | Common (shared) variance only | | Error term? | No | Yes (unique variance) | | Components / factors | Mathematical artefacts | Theoretical constructs | | Use for | Dimensionality reduction, compression, visualisation | Scale validation, psychometric constructs |

PCA reduces dimensions; FA models latent structure.

What you carry into the exam

Multicollinearity — high IV-IV correlation. Inflates SEs, flips signs.
**VIF = 1/(1 − $R_{j}^{2}$ ).** > 5–10 is severe.
Remedies: drop, combine, ridge, more data, centre.
FA factor model: $X_{i} = \sum λ_{ij} F_{j} + ϵ_{i}$ . Communality + unique variance.
EFA → discover; CFA → test.
Factor loadings > 0.4 = strong; cross-loadings < 0.3.
PCA = variance maximisation; components are eigenvectors of $Σ$ .
Choosing # factors: scree plot, Kaiser (eigenvalue > 1), parallel analysis (best).
Varimax orthogonal; oblimin/promax oblique.
KMO > 0.6, Bartlett significant for adequacy.
CFA fit: CFI > 0.95, RMSEA < 0.06, SRMR < 0.08.
PCA vs FA distinction — components are mathematical; factors are theoretical with error terms.

When you're ready, send "next" and we'll move into ANOVA — comparing 3+ group means, the SS partition, the F-test, post-hocs (Tukey HSD), and repeated-measures with sphericity corrections.

Behavioral Research: Statistical Methods