Revision Notes/Unit 10 — Multicollinearity, PCA & Factor Analysis/VIF, PCA, EFA/CFA, Scree Plot

VIF, PCA, EFA/CFA, Scree Plot

Intuition

Multicollinearity — high correlation among *predictors* (not between predictors and outcome!) — causes coefficient SEs to balloon, signs to flip, and individual coefficients to become non-significant even when the model overall fits well. Detect with VIF. Fix by dropping a predictor, combining into composites, regularising (ridge), or collecting more data. PCA reduces dimensions by maximising variance with orthogonal linear combinations of observed variables — no latent model. Factor Analysis models *latent constructs* causing observed indicators + error. *PCA explains variance; FA explains shared variance with a generative model.*

Explanation

The tangle problem. Multicollinearity is high correlation among the *independent* variables (predictors / IVs / features). It's not about correlation between IV and DV — that's what you want. It's correlation among the IVs, which causes trouble.

Examples. Height and weight in athletic-performance studies. Household income and water consumption. Mileage and price of a car. In each, two predictors carry overlapping information. Including both in regression doesn't add much beyond one.

Consequences of multicollinearity — exam staple. (1) Saps statistical power — need more data for the same effects. (2) Coefficient sign flips — a variable with a positive bivariate relationship to Y can appear with a *negative* coefficient when its collinear partner is in the model. Confusing. (3) Standard errors balloon — CIs widen, t-tests become non-significant. (4) Reduced precision in estimating each coefficient's unique effect. (5) Increased required n to obtain reliable estimates. (6) Less reliable inferences overall — the F-test on the model may be highly significant even when no individual coefficient is.

Intuition. When two predictors carry the same information, the model can't tell which is responsible for the outcome. Picture weighing two objects on a scale showing only the total — you can't separate them. Statistical machinery does its best, but the SEs honestly tell you 'we can't pin this down.'

Detecting multicollinearity — VIF. For each predictor j, regress it against all the other predictors; get $R_{j}^{2}$ — how predictable predictor j is from the rest. $VIF_{j} = 1/ (1 - R_{j}^{2})$ . Interpretation: VIF = 1 → uncorrelated with others (ideal). 1 < VIF < 5 → moderate. VIF > 5 (some say 10) → severe.

SMC (Squared Multiple Correlation). The maximal proportion of variance in each predictor explainable by the others. $SMC = 1 - 1/ diag (R^{- 1})$ , where R is the predictor correlation matrix. Equivalent to $R_{j}^{2}$ from the auxiliary regression. **Related to VIF: $VIF_{j} = 1/ (1 - SMC_{j})$ .** SMC near 1 → predictor is redundant; near 0 → unique.

Visual detection. Correlation matrices and heat maps are your first line of defense. |r| > 0.7 between any two predictors → multicollinearity likely.

Remedies for multicollinearity. (1) Drop one of the correlated predictors — keep the more theoretically important or with less missing data. (2) Combine via composite — PCA scores, FA scores, or a theoretically motivated sum/average. (3) Ridge regression — penalises coefficient size; trades bias for variance reduction. (4) Collect more data — reduces SEs even with multicollinearity. (5) Centre/standardise variables — helps with multicollinearity created by interaction terms.

Choosing the remedy. Depends on: (a) research inquiry — do you need to interpret each predictor separately, or is a composite OK? (b) interpretability — combined variables are harder to explain to non-technical audiences. (c) model performance — composites may give better predictions but with less individual meaning.

Curse of dimensionality. Beyond multicollinearity, there are general reasons to reduce variables: (1) data needs grow *exponentially* with dimensions. (2) Models with many predictors are harder to estimate and interpret. (3) Overfitting risk — models fit training data well but generalise poorly. (4) Computational cost. (5) Irrelevant variables add noise. (6) Easier visualisation (humans handle 2-3D).

Factor Analysis (FA) — uncovering latent constructs. Theoretical premise: the observed variables are *caused* by underlying unobserved variables — latent factors — and several observed variables may be measures of the same factor. Examples: intelligence isn't directly observable but verbal / mathematical / spatial / processing-speed abilities are; they share a latent 'g'. Customer satisfaction isn't observable but item responses about boarding / food / value cluster into latent factors.

The factor model. For each observed variable $X_{i}$ : $X_{i} = λ_{i 1} F_{1} + λ_{i 2} F_{2} + \dots + λ_{ik} F_{k} + ϵ_{i}$ where $F_{j}$ are latent factors, $λ_{ij}$ are factor loadings, $ϵ_{i}$ is item-specific error. Each variable's variance decomposes into communality $h^{2}$ (variance explained by common factors) and unique variance $1 - h^{2}$ .

Exploratory FA (EFA) vs Confirmatory FA (CFA) — exam-favourite distinction. *EFA:* data-driven, you don't know structure of factors, let the data reveal it. Use at the early stage. *CFA:* theory-driven, pre-specified structure of which items load on which factors; test fit. EFA explores; CFA confirms. Best practice: EFA on one sample, CFA on a held-out sample.

R-type vs Q-type FA. *R-type* (standard): correlations between *variables*; groups variables that move together. Used in surveys. *Q-type:* covariation between *people*; groups participants with similar response patterns. Closer to clustering. Useful for identifying respondent types.

Factor loadings. $λ_{ij}$ = correlation between variable i and factor j. High loading (> 0.4 by lenient convention, > 0.6 by stricter) = variable strongly belongs to factor. Cross-loadings (< 0.3) = good item discrimination. A 'simple structure' has each item loading on one factor only.

Factor scores. Each participant's composite score on each latent factor. Usable as inputs to further analyses (regression, t-tests) — treat as observed variables.

Sampling adequacy and FA assumptions. *KMO (Kaiser-Meyer-Olkin) measure* > 0.6 (preferably > 0.8) — pattern of correlations is suitable for FA. *Bartlett's test of sphericity* should be significant (i.e., correlation matrix has structure to factor; non-significant = nothing to extract). *Continuous variables*; *linear relationships*; ~ 5–10 participants per item.

How many factors? — the 4 criteria. (1) A priori — theory-determined (CFA-style). Parsimony vs representativeness trade-off. (2) Kaiser rule — keep eigenvalues > 1. Crude; often over-extracts. (3) Scree plot — eigenvalues vs factor #; retain factors *above the elbow*. (4) Parallel analysis — generate random data of same shape, compute random eigenvalues, retain only factors with observed eigenvalue > random benchmark. Most reliable. (5) % variance criterion — retain enough factors for target cumulative variance (~95% natural sciences, > 60% social).

Rotation — for interpretability. After extraction, rotate to make loadings interpretable. *Orthogonal (varimax)*: factors stay uncorrelated; each loading is independent. *Oblique (oblimin / promax)*: factors can correlate; allows real-world constructs that overlap.

PCA — the variance-maximising cousin. Construct new variables (components) that are linear combinations of originals, ordered such that: PC1 captures maximum variance; PC2 is orthogonal to PC1 and captures max remaining variance; etc. First few components capture most variance → keep them, discard the rest. Mathematically: PCs are the eigenvectors of the covariance matrix; eigenvalues = variance per component.

Component scores and loadings. *Component scores:* each participant's projection onto each PC. *Component loadings:* correlation between each original variable and each component. High loadings tell you which variables contribute most.

PCA vs FA — the crucial distinction. *PCA:* no latent model; all variance partitioned into orthogonal components; *components are mathematical artefacts*. *FA:* latent-variable model with common variance and unique variance + error; *factors are theoretical constructs*. PCA reduces dimensions; FA models latent structure. Use PCA for compression / visualisation / preprocessing; use FA for psychometric scale validation.

PCA assumptions. At least interval-level data; linear relationships; KMO > 0.6; Bartlett's test significant; approximately Normal data (for significance testing on components); no major outliers.

Fit indices for CFA. *CFI* > 0.95 (good fit). *RMSEA* < 0.06 (good), < 0.08 acceptable. *SRMR* < 0.08. *χ²/df* < 2 (3 acceptable). Report multiple — single indices can mislead.

Definitions

Multicollinearity — High correlation among *predictors* (not predictor-outcome). Inflates SEs of coefficients; signs can flip.
Variance Inflation Factor (VIF) — $1/ (1 - R_{j}^{2})$ where $R_{j}^{2}$ is the R² regressing predictor j on others. > 5–10 is severe.
SMC (Squared Multiple Correlation) — Maximal proportion of variance in a predictor explained by the others. $VIF = 1/ (1 - SMC)$ .
Curse of dimensionality — Data needs grow exponentially with the number of variables. Motivates dimensionality reduction.
Factor Analysis (FA) — Latent-variable model: observed variables caused by unobserved factors + unique error. Models shared variance only.
Factor loading — $λ_{ij}$ — correlation between variable i and factor j. > 0.4 = strong; cross-loadings < 0.3.
Communality $h^2$ — Sum of squared loadings of an item — proportion of variance explained by common factors.
EFA (Exploratory Factor Analysis) — Data-driven, no prior structure. Discover how many factors fit.
CFA (Confirmatory Factor Analysis) — Theory-driven, pre-specified factor structure. Test fit on independent data.
Principal Component Analysis (PCA) — Orthogonal linear combinations of variables maximising variance. No latent model; data reduction.
Eigenvalue — Variance captured by a component / factor. Sum of eigenvalues = total variance.
Scree plot — Eigenvalues vs factor #. Retain factors *above the elbow*.
Kaiser rule — Retain factors with eigenvalue > 1. Crude; over-extracts in practice.
Parallel analysis — Retain factors whose eigenvalues exceed those of random data of the same shape. Best practice.
KMO (Kaiser-Meyer-Olkin) — Sampling adequacy measure; should be > 0.6 (preferably > 0.8) for FA / PCA.
Bartlett's test of sphericity — Test that the correlation matrix is not an identity — should be significant for FA / PCA to be appropriate.
Varimax rotation — Orthogonal rotation; factors stay uncorrelated; simpler simple structure.
Oblimin / Promax rotation — Oblique rotation; factors can correlate. Appropriate when constructs overlap in reality.
Heywood case — Factor loading ≥ 1 (impossible for correlation). Indicates misspecification or too little data.
CFA fit indices — CFI > 0.95, RMSEA < 0.06, SRMR < 0.08, χ²/df < 2-3 for good fit.

Formulas

$VIF_{j} = \frac{1}{1 - R _{j}^{2}}$
$SMC_{j} = 1 - \frac{1}{diag ( R ^{- 1} ) _{j j}}$
$X_{i} = λ_{i 1} F_{1} + λ_{i 2} F_{2} + \dots + λ_{ik} F_{k} + ϵ_{i} (FA model)$
$h_{i}^{2} = j \sum λ_{ij}^{2} (communality)$
$PC_{1} = ar g ∥ w ∥ = 1 max Var (X w) (first principal component)$
$Eigenvalue_{j} = Var (PC_{j}) (variance captured)$
$Kaiser-Meyer-Olkin (KMO) > 0.6 (sampling adequacy)$
$CFI > 0.95, RMSEA < 0.06, SRMR < 0.08 (CFA fit)$

Derivations

Why VIF measures multicollinearity inflation. In OLS, $Var (\hat{β}_{j}) = σ^{2} / (n \cdot s_{X_{j}}^{2} \cdot (1 - R_{j}^{2}))$ where $R_{j}^{2}$ is the R² from regressing $X_{j}$ on other predictors. The factor $1/ (1 - R_{j}^{2}) = VIF_{j}$ inflates the variance of $\hat{β}_{j}$ . At $R_{j}^{2} = 0$ (no multicollinearity): VIF = 1, no inflation. At $R_{j}^{2} = 0.9$ : VIF = 10, $SE (\hat{β})$ × $10 \approx 3.16$ — 3× less precise estimate. Hence the rule of thumb VIF > 5 or 10 is problematic.

Why PC1 captures maximum variance. Want to find unit vector $w$ such that $Var (X w) = w^{T} Σ w$ is maximised subject to $∥ w ∥ = 1$ . Using Lagrange multipliers: $Σ w = λ w$ — $w$ is an eigenvector of $Σ$ . The maximum variance is achieved at the eigenvector with the largest eigenvalue. PC2 is the next eigenvector (next-largest λ), constrained orthogonal to PC1. Etc.

Communality = sum of squared loadings. In orthogonal FA, $Var (X_{i}) = \sum_{j} λ_{ij}^{2} + Var (ϵ_{i})$ . Hence the proportion explained by common factors: $h_{i}^{2} = \sum_{j} λ_{ij}^{2}$ — the communality.

PCA vs FA mathematical difference. PCA: $X = A T$ where T are orthogonal axes; *all* variance is decomposed; no error term. FA: $X = F Λ + E$ where F is the latent factor matrix, $Λ$ is loadings, E is unique variance + error; only *common* variance is modelled. **PCA inverts the covariance; FA inverts the *reduced* covariance** (with communalities replacing diagonal 1's).

Parallel analysis logic. For uncorrelated random data of size $n \times p$ , the largest eigenvalue is slightly > 1 by chance (eigenvalues spread above 1 just from random fluctuation). Parallel analysis simulates the random-data eigenvalues distribution and uses the 95th percentile as the threshold. This is more conservative than Kaiser's '> 1' rule, which retains noise factors.

Examples

Multicollinearity in income regression. Income = β₀ + β₁·Education + β₂·Experience + β₃·Age + ε. VIFs: 9.8, 11.3, 12.1 — severe. Age and Experience are highly correlated (older people have more experience). Coefficient on Age comes out non-significant even though it theoretically matters — its unique variance contribution is tiny once Experience is in. Remedies: drop Age, or combine into 'career stage' composite.
Airline survey FA. 20 items load on 3 factors after EFA + varimax rotation. F1 (post-boarding experience): seat comfort, in-flight service, food quality. F2 (booking): website usability, loyalty programs, app. F3 (competitive advantage): price, route options, on-time performance. Each item loads > 0.6 on one factor, < 0.3 elsewhere — clean simple structure.
SPSS anxiety scale FA. Items: 'I feel nervous running SPSS', 'I avoid stats homework', 'I freeze at output'. EFA reveals 1 latent factor — 'SPSS anxiety' — with α = 0.92 internal consistency. Factor scores can be used in subsequent analyses.
Parallel analysis worked example. 141 Indians surveyed; FA extraction. Real-data eigenvalues: 4.2, 2.1, 1.6, 1.3, 1.1, 0.9, 0.8. Random-data 95th percentile eigenvalues: 1.5, 1.4, 1.3, 1.2, 1.1, 1.0. Compare element-wise: keep factors where real > random → 3 factors (4.2 > 1.5, 2.1 > 1.4, 1.6 > 1.3; 1.3 < 1.4 stops). Kaiser would have kept 5 — over-extraction.
PCA on food consumption. UK regions × food types matrix. PC1 captures overall North-South diet pattern (fish, fresh fruit, alcohol). PC2 captures Wales-specific cluster. Together explain 80% of variance — reduce 17 food variables to 2 components for visualisation.
Heywood case. EFA produces a factor loading of 1.04 — impossible since loadings = correlations are bounded in [−1, 1]. Indicates model misspecification or insufficient data. Either drop the variable or re-specify model.

Diagrams

Correlation heat map of predictors. Coloured matrix; |r| > 0.7 cells highlighted in red — visual multicollinearity check.
Scree plot. Eigenvalues descending on y-axis vs factor # on x-axis; elbow marked; parallel-analysis horizontal line above noise.
PCA on 2D data. Scatter of two correlated variables; PC1 axis along main correlation direction; PC2 perpendicular. Project points onto each.
Factor model path diagram. Latent factor ovals → observed item rectangles with loadings; each item has an error term (epsilon).
EFA vs CFA visual. EFA: items connected to all factors (all loadings estimated). CFA: items connected only to pre-specified factors (constrained model).
Rotation effect. Same factor solution with un-rotated loadings (hard to interpret), varimax (orthogonal, cleaner), oblimin (oblique, factors allowed to correlate).

Edge cases

Heywood case — factor loading ≥ 1 (impossible since loadings are correlations) — indicates model misspecification or insufficient data.
PCA vs FA confusion — PCA produces components capturing *all* variance; FA factors capture only *shared* variance. Different interpretations.
Small samples (n < 100) are unreliable for factor analysis — need 5–10 participants per item; preferably 200+.
Negative loadings in 'reverse-keyed' items must be flagged before computing composite scores. Reverse-code them.
Orthogonal rotation when factors should be correlated → forced orthogonality misrepresents the structure. Try oblimin.
Kaiser rule over-extracts for surveys with many items — use parallel analysis instead.
Multicollinearity is most severe with three-way or higher (e.g., X, Y, X+Y in the same model). Use condition indices in addition to VIF.

Common mistakes

Reporting individual coefficients while ignoring VIF — multicollinear coefficients are uninterpretable individually.
Confusing PCA components with FA factors — different models, different goals.
Using EFA results as confirmatory — overfits to the sample; need an independent CFA.
Treating Kaiser rule (eigenvalue > 1) as definitive — over-extracts. Use parallel analysis.
Computing FA on Likert data without checking for non-normality — heavy ceiling effects break the model.
Ignoring cross-loadings > 0.3 — indicates items don't clearly belong to one factor.
Computing PCA on un-standardised variables with different units — variables with larger units dominate. Standardise first.
Using single fit index in CFA — report CFI, RMSEA, SRMR together.

Shortcuts

VIF = 1/(1 − R²_j). > 5–10 is severe.
PCA = variance maximisation (no latent variables, all variance).
FA = latent-variable model (common variance only with error).
EFA → discover; CFA → test.
Parallel analysis > Kaiser > scree for choosing # factors.
Varimax (orthogonal) when factors uncorrelated; oblimin / promax (oblique) when factors can correlate.
KMO > 0.6 and Bartlett significant for adequacy.
CFA fit: CFI > 0.95, RMSEA < 0.06, SRMR < 0.08.
Multicollinearity remedies: drop, combine, ridge, more data, centre.

Proofs / Algorithms

** $VIF_{j} = 1/ (1 - R_{j}^{2})$ derivation.** In OLS regression, $Var (\hat{β}_{j}) = σ^{2} [(X^{T} X)^{- 1}]_{j j}$ . The diagonal entry of $(X^{T} X)^{- 1}$ can be written as $1/ [(n - 1) s_{X_{j}}^{2} (1 - R_{j}^{2})]$ , where $R_{j}^{2}$ is from regressing $X_{j}$ on other predictors. Hence $Var (\hat{β}_{j}) = σ^{2} / [(n - 1) s_{X_{j}}^{2} (1 - R_{j}^{2})]$ . With no multicollinearity ( $R_{j}^{2} = 0$ ): variance baseline. Otherwise the factor $1/ (1 - R_{j}^{2}) = VIF_{j}$ multiplicatively inflates it.

PCA via eigendecomposition. Maximise $w^{T} Σ w$ subject to $∥ w ∥ = 1$ . Lagrangian: $L = w^{T} Σ w - λ (w^{T} w - 1)$ . $\partial L / \partial w = 2 (Σ w - λ w) = 0 \Rightarrow Σ w = λ w$ . So $w$ is an eigenvector and $λ$ is the corresponding eigenvalue. The variance captured is $w^{T} Σ w = λ$ . The first PC is the eigenvector with the *largest* eigenvalue; subsequent PCs are next-largest under orthogonality.

Communality formula. Under the orthogonal factor model with factors having unit variance and uncorrelated factors: $Var (X_{i}) = Var (\sum_{j} λ_{ij} F_{j} + ϵ_{i}) = \sum_{j} λ_{ij}^{2} + Var (ϵ_{i})$ (using $Cov (F_{j}, F_{k}) = 0$ and $Cov (F_{j}, ϵ_{i}) = 0$ ). The proportion of $Var (X_{i})$ explained by common factors: $h_{i}^{2} = \sum_{j} λ_{ij}^{2}$ . If we standardise so $Var (X_{i}) = 1$ : $h_{i}^{2} \in [0, 1]$ .

Behavioral Research: Statistical Methods