R Exercises: PCA & EFA

4 hands-on exercises with solutions

Exercise 1: PCA Basics

Dataset: mtcars — 32 cars, 5 variables: mpg, hp, wt, qsec, disp

a) Compute the correlation matrix. Which pair of variables has the strongest correlation?

b) Run PCA with prcomp(scale.=TRUE). What are the eigenvalues? How much variance does each PC explain?

c) Apply the Kaiser criterion. How many components should you keep?

d) Look at the PC1 loadings. What does this component represent?

Show Solution

Task (a): Correlation matrix

cars <- mtcars[, c("mpg", "hp", "wt", "qsec", "disp")]
round(cor(cars), 2)
       mpg    hp    wt  qsec  disp
mpg   1.00 -0.78 -0.87  0.42 -0.85
hp   -0.78  1.00  0.66 -0.71  0.79
wt   -0.87  0.66  1.00 -0.17  0.89
qsec  0.42 -0.71 -0.17  1.00 -0.43
disp -0.85  0.79  0.89 -0.43  1.00

Answer: The strongest correlation is between wt and disp (r = 0.89).

Task (b): Run PCA

pca1 <- prcomp(cars, scale. = TRUE)
eigenvalues <- pca1$sdev^2
variance_pct <- round(eigenvalues / sum(eigenvalues) * 100, 1)
data.frame(
  PC         = 1:5,
  Eigenvalue  = round(eigenvalues, 2),
  Variance_pct = variance_pct,
  Cumulative  = cumsum(variance_pct)
)
  PC Eigenvalue Variance_pct Cumulative
1  1       3.64         72.8       72.8
2  2       0.86         17.2       90.0
3  3       0.28          5.6       95.6
4  4       0.15          3.0       98.6
5  5       0.07          1.4      100.0

Answer: PC1 explains 72.8% of total variance. The first two components together capture 90.0%.

Task (c): Kaiser criterion

cat("Components with eigenvalue > 1:", sum(eigenvalues > 1), "\n")
Components with eigenvalue > 1: 1

Answer: Only PC1 has eigenvalue > 1 (3.64). The Kaiser criterion suggests retaining 1 component. The scree plot may suggest 2 (an "elbow" after PC2), so 2 PCs is also defensible given the 90% cumulative variance.

Task (d): Interpret PC1 loadings

round(pca1$rotation[, 1], 2)
  mpg    hp    wt  qsec  disp
-0.48  0.46  0.49 -0.25  0.50

Answer: PC1 loads positively on hp, wt, disp (power and size) and negatively on mpg (fuel efficiency). It represents a "car size and power" dimension: big, powerful cars score high; small, efficient cars score low.

Exercise 2: PCA for Data Reduction (USArrests)

Dataset: USArrests — 50 US states, 4 variables: Murder, Assault, UrbanPop, Rape

a) Run PCA (scaled). How many PCs are needed to explain at least 80% of variance?

b) Produce a biplot. Which variables cluster together? What is the direction of UrbanPop relative to the crime variables?

c) Which 3 states score highest on PC1? Which 3 score lowest? What do these groups represent?

d) Run PCA again without scaling. How do the results differ and why?

Show Solution

Task (a): Variance explained

pca2 <- prcomp(USArrests, scale. = TRUE)
ev2 <- pca2$sdev^2
cum_var <- cumsum(ev2 / sum(ev2)) * 100
data.frame(
  PC         = 1:4,
  Eigenvalue  = round(ev2, 3),
  Cumulative  = round(cum_var, 1)
)
  PC Eigenvalue Cumulative
1  1      2.480       62.0
2  2      0.990       86.8
3  3      0.357       95.7
4  4      0.173      100.0

Answer: 2 PCs are needed to exceed 80% (PC1+PC2 ≈ 86.8%).

Task (b): Biplot

biplot(pca2, scale = 0, cex = 0.6)

Answer: Murder, Assault, and Rape point in similar directions (positively correlated; they form a "crime" cluster). UrbanPop points in a clearly distinct direction, nearly perpendicular to the crime cluster, indicating it captures a separate dimension (urbanisation) largely independent of violent crime rates.

Task (c): Top and bottom states on PC1

scores <- pca2$x[, 1]
sort(scores, decreasing = TRUE)[1:3]   # Highest
sort(scores)[1:3]                       # Lowest
# Highest PC1 (high crime):
   Florida North Carolina       Maryland
  2.986245       2.845432       2.398509

# Lowest PC1 (low crime):
North Dakota      Vermont West Virginia
  -2.990736    -2.566988    -2.085710

Answer: States with the highest PC1 scores have high violent crime rates. States with the lowest scores are rural, low-crime states. PC1 is effectively a "violent crime intensity" index.

Task (d): Unscaled vs. scaled PCA

pca2_unscaled <- prcomp(USArrests, scale. = FALSE)
ev_unscaled <- pca2_unscaled$sdev^2
round(ev_unscaled / sum(ev_unscaled) * 100, 1)
[1] 96.6  2.7  0.6  0.1
Why the difference? Assault has a far larger numeric range (45–337) than the other variables. Without scaling, PCA is dominated by raw variance, so the first PC mostly captures Assault. With scaling, all variables contribute equally, and PC1 reflects a balanced "crime intensity" factor.

Exercise 3: PCA vs. EFA on Synthetic Multi-Factor Data

Synthetic dataset — 200 observations, 9 variables with 3 latent factors: academic ability (Math, Reading, Science), social skills (Teamwork, Communication, Leadership), physical ability (Speed, Endurance, Strength).

First generate the data:

set.seed(42)
n <- 200
F_lat <- matrix(rnorm(n * 3), n, 3)
Lambda <- matrix(0.15, 9, 3)
Lambda[1:3, 1] <- c(0.75, 0.70, 0.65)
Lambda[4:6, 2] <- c(0.72, 0.68, 0.74)
Lambda[7:9, 3] <- c(0.70, 0.73, 0.67)
X <- F_lat %*% t(Lambda) + matrix(rnorm(n * 9, 0, 0.4), n, 9)
colnames(X) <- c("Math","Reading","Science",
                  "Teamwork","Communication","Leadership",
                  "Speed","Endurance","Strength")
data3 <- as.data.frame(X)

a) Run PCA and EFA (3 factors, varimax). Compare the PC1 loadings vs. the first factor loadings. Which method gives a cleaner grouping?

b) Check factorability: compute the KMO statistic and run Bartlett’s test of sphericity. Are the conditions met for EFA?

c) Report the communalities from the EFA solution. What do they tell you about each variable?

d) In 2 sentences, describe the key conceptual difference between PCA and EFA.

Show Solution

Task (a): Loadings comparison

library(psych)

# PCA loadings on PC1
pca3 <- prcomp(data3, scale. = TRUE)
round(pca3$rotation[, 1:3], 2)

# EFA with 3 factors and varimax rotation
efa3 <- fa(data3, nfactors = 3, rotate = "varimax", fm = "ml")
print(efa3$loadings, cutoff = 0.3)
# PCA: PC1 is a general factor that loads on every variable
                PC1   PC2   PC3
Math          -0.33 -0.45 -0.17
Reading       -0.30 -0.48 -0.15
Science       -0.31 -0.45 -0.14
Teamwork      -0.33  0.35 -0.35
Communication -0.33  0.32 -0.34
Leadership    -0.34  0.35 -0.29
Speed         -0.36  0.08  0.43
Endurance     -0.34  0.12  0.46
Strength      -0.35  0.07  0.47

# EFA: Clean simple structure (loadings < 0.30 suppressed)

Loadings:
              ML1   ML2   ML3
Math                      0.879
Reading                   0.815
Science                   0.799
Teamwork            0.876
Communication       0.829
Leadership          0.855
Speed         0.861
Endurance     0.872
Strength      0.876

                 ML1   ML2   ML3
SS loadings    2.440 2.353 2.238
Proportion Var 0.271 0.261 0.249
Cumulative Var 0.271 0.533 0.781

Answer: EFA gives a clean "simple structure" — each variable loads strongly (0.80–0.88) on exactly one of the three latent factors, matching the data-generating design. PCA's PC1 instead spreads roughly equal loadings (≈ −0.33) across all nine variables, behaving as a general factor. (PCA sign is arbitrary; flipping the sign does not change the interpretation.) The three EFA factors together explain 78.1 % of the total variance.

Task (b): KMO and Bartlett

KMO(data3)
cortest.bartlett(cor(data3), n = nrow(data3))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = data3)
Overall MSA =  0.82
MSA for each item =
         Math       Reading       Science      Teamwork Communication
         0.79          0.81          0.84          0.80          0.84
   Leadership         Speed     Endurance      Strength
         0.83          0.84          0.83          0.82

$chisq
[1] 1321.844

$p.value
[1] 2.333859e-254

$df
[1] 36

Answer: KMO = 0.82 (> 0.8, "good"; per-item values all > 0.79) and Bartlett's test is highly significant (chi² = 1321.8, df = 36, p ≈ 10⁻²⁵⁴). The correlation matrix is far from an identity matrix — EFA is appropriate.

Task (c): Communalities

round(efa3$communalities, 2)
         Math       Reading       Science      Teamwork Communication
         0.83          0.70          0.69          0.81          0.75
   Leadership         Speed     Endurance      Strength
         0.80          0.82          0.81          0.82

Answer: Communalities range from 0.69 (Science) to 0.83 (Math). The 3 factors collectively explain 69–83 % of each variable's variance — the remainder is unique variance plus measurement error. Uniformly high communalities indicate that all nine variables are well-represented by the factor solution.

Task (d): Conceptual difference

Answer: PCA is a data-reduction technique that creates new uncorrelated composites ($\text{PCs} = \text{linear combinations of observed variables}$) maximising total explained variance, with no underlying model assumptions. EFA is a latent variable model that assumes observed variables are caused by a smaller set of unobserved factors plus unique error ($X = \Lambda F + \varepsilon$); its goal is to uncover the underlying factor structure, not simply to compress variance.

Exercise 4: Full EFA Workflow (Synthetic Questionnaire)

Synthetic questionnaire — 200 respondents, 12 items (Q1–Q12) with 3 underlying factors, 4 items each.

First generate the data:

set.seed(42)
n <- 200
F_lat <- matrix(rnorm(n * 3), n, 3)
Lambda <- matrix(0.1, 12, 3)
Lambda[1:4,  1] <- c(0.78, 0.72, 0.69, 0.74)
Lambda[5:8,  2] <- c(0.75, 0.71, 0.77, 0.68)
Lambda[9:12, 3] <- c(0.73, 0.76, 0.70, 0.72)
X <- F_lat %*% t(Lambda) + matrix(rnorm(n * 12, 0, 0.35), n, 12)
colnames(X) <- paste0("Q", 1:12)
data4 <- as.data.frame(X)

a) Check whether EFA is appropriate: KMO statistic and Bartlett’s test. Interpret the results.

b) Use parallel analysis to determine the number of factors. How many does it suggest?

c) Run EFA with the suggested number of factors, once with varimax and once with promax rotation. Report the factor correlation matrix for promax. Are the factors correlated?

d) Plot the factor scores of the first two factors. Do the observations show any distinct grouping?

Show Solution

Task (a): KMO and Bartlett

library(psych)
KMO(data4)
cortest.bartlett(cor(data4), n = nrow(data4))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = data4)
Overall MSA =  0.88
MSA for each item =
  Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12
0.86 0.89 0.90 0.86 0.87 0.88 0.86 0.89 0.86 0.88 0.87 0.92

$chisq
[1] 2335.253

$p.value
[1] 0

$df
[1] 66

Answer: KMO = 0.88 (> 0.8, "meritorious"; every item also above 0.86) and Bartlett's test is highly significant (chi² = 2335.3, df = 66, p < 10⁻³⁰⁰; reported as 0 by R). The 12 items share enough variance to support a factor model.

Task (b): Parallel analysis

fa.parallel(data4, fm = "ml", fa = "fa", n.iter = 100)
Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

Answer: Parallel analysis compares the eigenvalues of the actual correlation matrix to those obtained from random data of the same size. It suggests 3 factors, matching the true data-generating structure. (The "components = NA" message appears because we passed fa = "fa" and so did not request a parallel analysis for components.)

Task (c): Varimax vs. promax rotation

# Varimax (orthogonal — assumes uncorrelated factors)
efa_var <- fa(data4, nfactors = 3, rotate = "varimax", fm = "ml")
print(efa_var$loadings, cutoff = 0.3)

# Promax (oblique — allows correlated factors)
efa_pro <- fa(data4, nfactors = 3, rotate = "promax", fm = "ml")
print(efa_pro$loadings, cutoff = 0.3)

# Factor correlations (promax only)
round(efa_pro$Phi, 2)
# Varimax loadings (cutoff = 0.30; columns reordered by SS contribution)

Loadings:
    ML1   ML3   ML2
Q1              0.911
Q2              0.871
Q3              0.865
Q4              0.901
Q5        0.882
Q6        0.881
Q7        0.897
Q8        0.866
Q9  0.913
Q10 0.900
Q11 0.907
Q12 0.851

                 ML1   ML3   ML2
SS loadings    3.330 3.256 3.226
Proportion Var 0.277 0.271 0.269
Cumulative Var 0.277 0.549 0.818

# Promax loadings (cutoff = 0.30) — slightly larger because oblique
# rotation lets each factor "claim" its variables more strongly

Loadings:
    ML1    ML3    ML2
Q1                 0.919
Q2                 0.883
Q3                 0.872
Q4                 0.912
Q5          0.905
Q6          0.893
Q7          0.921
Q8          0.879
Q9   0.942
Q10  0.917
Q11  0.937
Q12  0.851

                 ML1   ML3   ML2
SS loadings    3.339 3.255 3.219
Proportion Var 0.278 0.271 0.268
Cumulative Var 0.278 0.550 0.818

# Factor correlations (Phi matrix) — promax only
     ML1  ML3  ML2
ML1 1.00 0.35 0.24
ML3 0.35 1.00 0.22
ML2 0.24 0.22 1.00

Answer: Both rotations produce a clean simple structure — each item loads strongly (≈ 0.85–0.94) on exactly one factor. However, the promax Phi matrix reveals moderate correlations among the latent factors (Phi = 0.22 to 0.35). The orthogonality assumption that varimax imposes is therefore mildly violated, and the slightly larger promax loadings are the consequence of letting each factor explain its items without sharing variance with the others. Promax is the more honest rotation for these data; varimax remains acceptable when one wants a strictly orthogonal solution for downstream interpretation.

Task (d): Factor score plot

scores4 <- factor.scores(data4, efa_var)$scores

plot(scores4[, "ML1"], scores4[, "ML2"],
     xlab = "Factor 1 Score", ylab = "Factor 2 Score",
     main = "EFA Factor Scores: F1 vs F2",
     col = adjustcolor("steelblue", alpha.f = 0.6), pch = 16)
abline(h = 0, v = 0, lty = 2, col = "gray60")

Answer: Varimax is an orthogonal rotation, so its factor scores are mechanically uncorrelated regardless of the underlying factor correlations. The F1 vs. F2 plot is therefore a diffuse, roughly circular cloud with no distinct clusters — exactly what we expect from independent standard-normal scores in a sample of 200 with no group-level structure. (If the data had come from distinct subpopulations, you would see visible clusters even after orthogonal rotation.)

Summary of the full EFA workflow: (1) Check factorability (KMO + Bartlett), (2) Determine number of factors (parallel analysis, scree plot), (3) Extract factors (maximum likelihood or principal axis), (4) Rotate for interpretability (varimax if orthogonality is assumed, promax/oblimin otherwise), (5) Interpret factors from pattern of high loadings, (6) Report communalities and fit indices.
‹ Back to Lesson 3
Statistical Data Analysis  |  Digital AI Finance  |  BSc Data Science  |  © Joerg Osterrieder 2025–2026