mtcars — 32 cars, 5 variables: mpg, hp, wt, qsec, dispa) Compute the correlation matrix. Which pair of variables has the strongest correlation?
b) Run PCA with prcomp(scale.=TRUE). What are the eigenvalues? How much variance does each PC explain?
c) Apply the Kaiser criterion. How many components should you keep?
d) Look at the PC1 loadings. What does this component represent?
cars <- mtcars[, c("mpg", "hp", "wt", "qsec", "disp")]
round(cor(cars), 2)
mpg hp wt qsec disp mpg 1.00 -0.78 -0.87 0.42 -0.85 hp -0.78 1.00 0.66 -0.71 0.79 wt -0.87 0.66 1.00 -0.17 0.89 qsec 0.42 -0.71 -0.17 1.00 -0.43 disp -0.85 0.79 0.89 -0.43 1.00
Answer: The strongest correlation is between wt and disp (r = 0.89).
pca1 <- prcomp(cars, scale. = TRUE) eigenvalues <- pca1$sdev^2 variance_pct <- round(eigenvalues / sum(eigenvalues) * 100, 1) data.frame( PC = 1:5, Eigenvalue = round(eigenvalues, 2), Variance_pct = variance_pct, Cumulative = cumsum(variance_pct) )
PC Eigenvalue Variance_pct Cumulative 1 1 3.64 72.8 72.8 2 2 0.86 17.2 90.0 3 3 0.28 5.6 95.6 4 4 0.15 3.0 98.6 5 5 0.07 1.4 100.0
Answer: PC1 explains 72.8% of total variance. The first two components together capture 90.0%.
cat("Components with eigenvalue > 1:", sum(eigenvalues > 1), "\n")
Components with eigenvalue > 1: 1
Answer: Only PC1 has eigenvalue > 1 (3.64). The Kaiser criterion suggests retaining 1 component. The scree plot may suggest 2 (an "elbow" after PC2), so 2 PCs is also defensible given the 90% cumulative variance.
round(pca1$rotation[, 1], 2)
mpg hp wt qsec disp -0.48 0.46 0.49 -0.25 0.50
Answer: PC1 loads positively on hp, wt, disp (power and size) and negatively on mpg (fuel efficiency). It represents a "car size and power" dimension: big, powerful cars score high; small, efficient cars score low.
USArrests — 50 US states, 4 variables: Murder, Assault, UrbanPop, Rapea) Run PCA (scaled). How many PCs are needed to explain at least 80% of variance?
b) Produce a biplot. Which variables cluster together? What is the direction of UrbanPop relative to the crime variables?
c) Which 3 states score highest on PC1? Which 3 score lowest? What do these groups represent?
d) Run PCA again without scaling. How do the results differ and why?
pca2 <- prcomp(USArrests, scale. = TRUE) ev2 <- pca2$sdev^2 cum_var <- cumsum(ev2 / sum(ev2)) * 100 data.frame( PC = 1:4, Eigenvalue = round(ev2, 3), Cumulative = round(cum_var, 1) )
PC Eigenvalue Cumulative 1 1 2.480 62.0 2 2 0.990 86.8 3 3 0.357 95.7 4 4 0.173 100.0
Answer: 2 PCs are needed to exceed 80% (PC1+PC2 ≈ 86.8%).
biplot(pca2, scale = 0, cex = 0.6)
Answer: Murder, Assault, and Rape point in similar directions (positively correlated; they form a "crime" cluster). UrbanPop points in a clearly distinct direction, nearly perpendicular to the crime cluster, indicating it captures a separate dimension (urbanisation) largely independent of violent crime rates.
scores <- pca2$x[, 1] sort(scores, decreasing = TRUE)[1:3] # Highest sort(scores)[1:3] # Lowest
# Highest PC1 (high crime): Florida North Carolina Maryland 2.986245 2.845432 2.398509 # Lowest PC1 (low crime): North Dakota Vermont West Virginia -2.990736 -2.566988 -2.085710
Answer: States with the highest PC1 scores have high violent crime rates. States with the lowest scores are rural, low-crime states. PC1 is effectively a "violent crime intensity" index.
pca2_unscaled <- prcomp(USArrests, scale. = FALSE) ev_unscaled <- pca2_unscaled$sdev^2 round(ev_unscaled / sum(ev_unscaled) * 100, 1)
[1] 96.6 2.7 0.6 0.1
Assault has a far larger numeric range (45–337) than the other variables. Without scaling, PCA is dominated by raw variance, so the first PC mostly captures Assault. With scaling, all variables contribute equally, and PC1 reflects a balanced "crime intensity" factor.First generate the data:
set.seed(42)
n <- 200
F_lat <- matrix(rnorm(n * 3), n, 3)
Lambda <- matrix(0.15, 9, 3)
Lambda[1:3, 1] <- c(0.75, 0.70, 0.65)
Lambda[4:6, 2] <- c(0.72, 0.68, 0.74)
Lambda[7:9, 3] <- c(0.70, 0.73, 0.67)
X <- F_lat %*% t(Lambda) + matrix(rnorm(n * 9, 0, 0.4), n, 9)
colnames(X) <- c("Math","Reading","Science",
"Teamwork","Communication","Leadership",
"Speed","Endurance","Strength")
data3 <- as.data.frame(X)
a) Run PCA and EFA (3 factors, varimax). Compare the PC1 loadings vs. the first factor loadings. Which method gives a cleaner grouping?
b) Check factorability: compute the KMO statistic and run Bartlett’s test of sphericity. Are the conditions met for EFA?
c) Report the communalities from the EFA solution. What do they tell you about each variable?
d) In 2 sentences, describe the key conceptual difference between PCA and EFA.
library(psych) # PCA loadings on PC1 pca3 <- prcomp(data3, scale. = TRUE) round(pca3$rotation[, 1:3], 2) # EFA with 3 factors and varimax rotation efa3 <- fa(data3, nfactors = 3, rotate = "varimax", fm = "ml") print(efa3$loadings, cutoff = 0.3)
# PCA: PC1 is a general factor that loads on every variable
PC1 PC2 PC3
Math -0.33 -0.45 -0.17
Reading -0.30 -0.48 -0.15
Science -0.31 -0.45 -0.14
Teamwork -0.33 0.35 -0.35
Communication -0.33 0.32 -0.34
Leadership -0.34 0.35 -0.29
Speed -0.36 0.08 0.43
Endurance -0.34 0.12 0.46
Strength -0.35 0.07 0.47
# EFA: Clean simple structure (loadings < 0.30 suppressed)
Loadings:
ML1 ML2 ML3
Math 0.879
Reading 0.815
Science 0.799
Teamwork 0.876
Communication 0.829
Leadership 0.855
Speed 0.861
Endurance 0.872
Strength 0.876
ML1 ML2 ML3
SS loadings 2.440 2.353 2.238
Proportion Var 0.271 0.261 0.249
Cumulative Var 0.271 0.533 0.781
Answer: EFA gives a clean "simple structure" — each variable loads strongly (0.80–0.88) on exactly one of the three latent factors, matching the data-generating design. PCA's PC1 instead spreads roughly equal loadings (≈ −0.33) across all nine variables, behaving as a general factor. (PCA sign is arbitrary; flipping the sign does not change the interpretation.) The three EFA factors together explain 78.1 % of the total variance.
KMO(data3) cortest.bartlett(cor(data3), n = nrow(data3))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = data3)
Overall MSA = 0.82
MSA for each item =
Math Reading Science Teamwork Communication
0.79 0.81 0.84 0.80 0.84
Leadership Speed Endurance Strength
0.83 0.84 0.83 0.82
$chisq
[1] 1321.844
$p.value
[1] 2.333859e-254
$df
[1] 36
Answer: KMO = 0.82 (> 0.8, "good"; per-item values all > 0.79) and Bartlett's test is highly significant (chi² = 1321.8, df = 36, p ≈ 10⁻²⁵⁴). The correlation matrix is far from an identity matrix — EFA is appropriate.
round(efa3$communalities, 2)
Math Reading Science Teamwork Communication
0.83 0.70 0.69 0.81 0.75
Leadership Speed Endurance Strength
0.80 0.82 0.81 0.82
Answer: Communalities range from 0.69 (Science) to 0.83 (Math). The 3 factors collectively explain 69–83 % of each variable's variance — the remainder is unique variance plus measurement error. Uniformly high communalities indicate that all nine variables are well-represented by the factor solution.
Answer: PCA is a data-reduction technique that creates new uncorrelated composites ($\text{PCs} = \text{linear combinations of observed variables}$) maximising total explained variance, with no underlying model assumptions. EFA is a latent variable model that assumes observed variables are caused by a smaller set of unobserved factors plus unique error ($X = \Lambda F + \varepsilon$); its goal is to uncover the underlying factor structure, not simply to compress variance.
First generate the data:
set.seed(42)
n <- 200
F_lat <- matrix(rnorm(n * 3), n, 3)
Lambda <- matrix(0.1, 12, 3)
Lambda[1:4, 1] <- c(0.78, 0.72, 0.69, 0.74)
Lambda[5:8, 2] <- c(0.75, 0.71, 0.77, 0.68)
Lambda[9:12, 3] <- c(0.73, 0.76, 0.70, 0.72)
X <- F_lat %*% t(Lambda) + matrix(rnorm(n * 12, 0, 0.35), n, 12)
colnames(X) <- paste0("Q", 1:12)
data4 <- as.data.frame(X)
a) Check whether EFA is appropriate: KMO statistic and Bartlett’s test. Interpret the results.
b) Use parallel analysis to determine the number of factors. How many does it suggest?
c) Run EFA with the suggested number of factors, once with varimax and once with promax rotation. Report the factor correlation matrix for promax. Are the factors correlated?
d) Plot the factor scores of the first two factors. Do the observations show any distinct grouping?
library(psych) KMO(data4) cortest.bartlett(cor(data4), n = nrow(data4))
Kaiser-Meyer-Olkin factor adequacy Call: KMO(r = data4) Overall MSA = 0.88 MSA for each item = Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 0.86 0.89 0.90 0.86 0.87 0.88 0.86 0.89 0.86 0.88 0.87 0.92 $chisq [1] 2335.253 $p.value [1] 0 $df [1] 66
Answer: KMO = 0.88 (> 0.8, "meritorious"; every item also above 0.86) and Bartlett's test is highly significant (chi² = 2335.3, df = 66, p < 10⁻³⁰⁰; reported as 0 by R). The 12 items share enough variance to support a factor model.
fa.parallel(data4, fm = "ml", fa = "fa", n.iter = 100)
Parallel analysis suggests that the number of factors = 3 and the number of components = NA
Answer: Parallel analysis compares the eigenvalues of the actual correlation matrix to those obtained from random data of the same size. It suggests 3 factors, matching the true data-generating structure. (The "components = NA" message appears because we passed fa = "fa" and so did not request a parallel analysis for components.)
# Varimax (orthogonal — assumes uncorrelated factors) efa_var <- fa(data4, nfactors = 3, rotate = "varimax", fm = "ml") print(efa_var$loadings, cutoff = 0.3) # Promax (oblique — allows correlated factors) efa_pro <- fa(data4, nfactors = 3, rotate = "promax", fm = "ml") print(efa_pro$loadings, cutoff = 0.3) # Factor correlations (promax only) round(efa_pro$Phi, 2)
# Varimax loadings (cutoff = 0.30; columns reordered by SS contribution)
Loadings:
ML1 ML3 ML2
Q1 0.911
Q2 0.871
Q3 0.865
Q4 0.901
Q5 0.882
Q6 0.881
Q7 0.897
Q8 0.866
Q9 0.913
Q10 0.900
Q11 0.907
Q12 0.851
ML1 ML3 ML2
SS loadings 3.330 3.256 3.226
Proportion Var 0.277 0.271 0.269
Cumulative Var 0.277 0.549 0.818
# Promax loadings (cutoff = 0.30) — slightly larger because oblique
# rotation lets each factor "claim" its variables more strongly
Loadings:
ML1 ML3 ML2
Q1 0.919
Q2 0.883
Q3 0.872
Q4 0.912
Q5 0.905
Q6 0.893
Q7 0.921
Q8 0.879
Q9 0.942
Q10 0.917
Q11 0.937
Q12 0.851
ML1 ML3 ML2
SS loadings 3.339 3.255 3.219
Proportion Var 0.278 0.271 0.268
Cumulative Var 0.278 0.550 0.818
# Factor correlations (Phi matrix) — promax only
ML1 ML3 ML2
ML1 1.00 0.35 0.24
ML3 0.35 1.00 0.22
ML2 0.24 0.22 1.00
Answer: Both rotations produce a clean simple structure — each item loads strongly (≈ 0.85–0.94) on exactly one factor. However, the promax Phi matrix reveals moderate correlations among the latent factors (Phi = 0.22 to 0.35). The orthogonality assumption that varimax imposes is therefore mildly violated, and the slightly larger promax loadings are the consequence of letting each factor explain its items without sharing variance with the others. Promax is the more honest rotation for these data; varimax remains acceptable when one wants a strictly orthogonal solution for downstream interpretation.
scores4 <- factor.scores(data4, efa_var)$scores
plot(scores4[, "ML1"], scores4[, "ML2"],
xlab = "Factor 1 Score", ylab = "Factor 2 Score",
main = "EFA Factor Scores: F1 vs F2",
col = adjustcolor("steelblue", alpha.f = 0.6), pch = 16)
abline(h = 0, v = 0, lty = 2, col = "gray60")
Answer: Varimax is an orthogonal rotation, so its factor scores are mechanically uncorrelated regardless of the underlying factor correlations. The F1 vs. F2 plot is therefore a diffuse, roughly circular cloud with no distinct clusters — exactly what we expect from independent standard-normal scores in a sample of 200 with no group-level structure. (If the data had come from distinct subpopulations, you would see visible clusters even after orthogonal rotation.)