Exam Practice | Statistical Data Analysis

Questions: 50

Easy: 10

Medium: 20

Hard: 20

Lessons: L01-L05

Jump to Lesson

L01 Regression & Survival L02 Hypothesis Testing L03 PCA & EFA L04 Cluster Analysis L05 Time Series

L01 Regression & Survival Analysis

Q1 MEDIUM OLS Derivation

Derive the OLS estimator for simple linear regression. Show that $\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$.

(a) Write down the sum of squared residuals $SSR = \sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2$.

(b) Take partial derivatives with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$, set them to zero.

Solution

(a) We minimize $SSR = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2$ with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$.

(b) Setting $\frac{\partial SSR}{\partial \hat{\beta}_0} = 0$ gives $\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0$, yielding $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$.

Setting $\frac{\partial SSR}{\partial \hat{\beta}_1} = 0$ gives $\sum x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0$.

(c) Substituting $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$ and simplifying: $\sum x_i(y_i - \bar{y} + \hat{\beta}_1 \bar{x} - \hat{\beta}_1 x_i) = 0$. This gives $\sum x_i(y_i - \bar{y}) = \hat{\beta}_1 \sum x_i(x_i - \bar{x})$. Since $\sum(x_i - \bar{x})(y_i - \bar{y}) = \sum x_i(y_i - \bar{y})$, we get $\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$.

Q2 HARD Model Evaluation

Given a dataset with $R^2 = 0.85$, $n = 50$, $p = 4$ predictors:

(a) Calculate adjusted $R^2$.

(b) Interpret both values.

Solution

(a) $R^2_{adj} = 1 - (1 - R^2)\frac{n-1}{n-p-1} = 1 - (1 - 0.85)\frac{49}{45} = 1 - 0.15 \times 1.089 = 1 - 0.163 = 0.837$.

(b) $R^2 = 0.85$: 85% of variance in $y$ is explained by the 4 predictors. Adjusted $R^2 = 0.837$: after penalizing for 4 predictors, 83.7% of variance is explained. The small difference (0.013) suggests the predictors are genuinely useful.

(c) $F = \frac{R^2 / p}{(1 - R^2)/(n - p - 1)} = \frac{0.85/4}{0.15/45} = \frac{0.2125}{0.00333} = 63.8$. With $df_1 = 4$, $df_2 = 45$, this is highly significant ($p \ll 0.001$). Reject $H_0$: at least one predictor has a non-zero coefficient.

Q3 EASY R-squared Comparison

Explain the difference between $R^2$ and adjusted $R^2$. When would they differ substantially?

Solution

$R^2$ measures the proportion of variance explained by the model. It always increases (or stays the same) when predictors are added, even if they are irrelevant.

Adjusted $R^2$ penalizes for the number of predictors: $R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-p-1}$. It can decrease if a useless predictor is added.

They differ substantially when: (1) many predictors are used relative to sample size, (2) some predictors are irrelevant. For example, with $n=20$ and $p=15$, $R^2$ could be artificially high while adjusted $R^2$ would be much lower.

Q4 MEDIUM Logistic Regression Interpretation

A logistic regression model gives $\hat{\beta}_1 = 0.693$ for a predictor.

(a) Calculate the odds ratio.

(b) Interpret it in context.

Solution

(a) Odds ratio $= e^{\hat{\beta}_1} = e^{0.693} = 2.0$.

(b) For each one-unit increase in the predictor, the odds of the outcome are multiplied by 2 (i.e., the odds double). This means the event becomes twice as likely relative to not occurring.

(c) If $\hat{\beta}_1 < 0$, then $OR = e^{\hat{\beta}_1} < 1$. For example, $\hat{\beta}_1 = -0.693$ gives $OR = 0.5$, meaning the odds are halved for each unit increase. A negative coefficient means the predictor decreases the probability of the outcome.

Q5 HARD Cox Proportional Hazards

For a Cox proportional hazards model:

(a) Write the hazard function.

(b) Explain the proportional hazards assumption.

(d) What is a hazard ratio of 2.5 saying?

Solution

(a) $h(t|\mathbf{x}) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)$, where $h_0(t)$ is the baseline hazard function (unspecified), and $\beta_i$ are regression coefficients.

(b) The proportional hazards assumption states that the ratio of hazards for any two individuals is constant over time: $\frac{h(t|\mathbf{x}_1)}{h(t|\mathbf{x}_2)} = \exp(\boldsymbol{\beta}^T(\mathbf{x}_1 - \mathbf{x}_2))$. The hazard functions are proportional -- they can vary over time but their ratio does not.

(c) Test using: (1) Schoenfeld residuals plotted against time -- should show no trend. (2) Formal test: correlate scaled Schoenfeld residuals with time; significant correlation violates the assumption. (3) Log-log survival plots for categorical covariates -- curves should be parallel.

(d) A hazard ratio of 2.5 means the hazard (instantaneous risk) of the event is 2.5 times higher for a one-unit increase in the covariate, at any point in time. If comparing treated vs. control, the treated group's instantaneous risk of the event is 150% higher.

Q6 HARD OLS Assumptions and Violations

What are the four assumptions of OLS regression? For each, describe (i) the formal condition, (ii) the diagnostic test, (iii) the consequence of violation, and (iv) the remedy.

Solution

1. Linearity: (i) $E[Y|\mathbf{X}] = \mathbf{X}\boldsymbol{\beta}$. (ii) Residuals vs. fitted values plot (look for curvature); RESET test. (iii) Biased and inconsistent coefficient estimates. (iv) Add polynomial terms, use non-linear transformations, or generalized additive models.

2. Independence: (i) $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$. (ii) Durbin-Watson test ($d \approx 2$ means no autocorrelation); plot residuals over time. (iii) OLS estimates remain unbiased but standard errors are incorrect, invalidating inference. (iv) Use GLS, Newey-West HAC standard errors, or model the autocorrelation structure.

3. Homoscedasticity: (i) $\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$. (ii) Residuals vs. fitted plot (fan shape indicates heteroscedasticity); Breusch-Pagan or White test. (iii) OLS is still unbiased but no longer BLUE; standard errors and confidence intervals are wrong. (iv) Use weighted least squares (WLS), robust (Huber-White) standard errors, or variance-stabilizing transformations.

4. Normality: (i) $\varepsilon_i \sim N(0, \sigma^2)$. (ii) Q-Q plot, Shapiro-Wilk test, Jarque-Bera test. (iii) Small-sample inference (t-tests, F-tests) is invalid; large-sample inference is approximately valid via CLT. (iv) Transform the response, use robust regression, or rely on large-sample asymptotics.

Q7 MEDIUM Kaplan-Meier Calculation

In a Kaplan-Meier analysis, calculate $\hat{S}(t)$ at $t = 3$ given: at $t=1$: 2 events, 10 at risk; at $t=2$: 1 event, 8 at risk; at $t=3$: 3 events, 7 at risk.

Solution

The Kaplan-Meier estimator is $\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$, where $d_i$ = events and $n_i$ = at risk.

At $t=1$: $1 - \frac{2}{10} = 0.8$.

At $t=2$: $1 - \frac{1}{8} = 0.875$.

At $t=3$: $1 - \frac{3}{7} = 0.571$.

$\hat{S}(3) = 0.8 \times 0.875 \times 0.571 = 0.400$.

Interpretation: There is an estimated 40% probability of surviving beyond time 3.

Q8 HARD Regression Comparison

Compare and contrast linear and logistic regression:

(a) When should you use each?

(b) Derive the log-likelihood for logistic regression.

Solution

(a) Linear regression: continuous outcome variable (e.g., price, weight). Logistic regression: binary outcome (0/1, yes/no). Logistic is also used for probabilities bounded in [0,1].

(b) For logistic regression with $P(Y_i=1) = p_i = \frac{1}{1+e^{-\mathbf{x}_i^T\boldsymbol{\beta}}}$, the likelihood is $L(\boldsymbol{\beta}) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$. The log-likelihood is $\ell(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \log(p_i) + (1-y_i)\log(1-p_i)]$. This is maximized numerically (Newton-Raphson or IRLS) since no closed-form solution exists.

(c) OLS for binary outcomes fails because: (1) Predicted values are not bounded in [0,1], so predicted 'probabilities' can be negative or exceed 1. (2) Residuals are heteroscedastic (variance depends on $p$). (3) Residuals cannot be normally distributed when the outcome is binary. The linear probability model violates all OLS assumptions.

Q9 EASY Censoring in Survival Analysis

Explain censoring in survival analysis. Give two real-world examples of right-censoring.

Solution

Censoring occurs when the exact event time is unknown for some subjects. Right-censoring means we know the subject survived at least until a certain time, but the actual event time is unknown (it is to the 'right' of the observed time).

Example 1: A clinical trial ends after 5 years. A patient who is still alive at the end of the study is right-censored -- we know they survived at least 5 years, but not when (or if) they will experience the event.

Example 2: A customer churn study tracks users for 12 months. A customer who is still active and then cancels their phone number (lost to follow-up) is right-censored at their last observed active date.

Censoring is a key challenge because simply ignoring censored observations would bias survival estimates downward.

Q10 MEDIUM Influential Points

Given Cook's distance values for 50 observations, where observation 23 has $D_{23} = 1.2$:

(a) What threshold indicates an influential point?

(b) What should you do about it?

Solution

(a) Common thresholds for Cook's distance: $D_i > 1$ is the traditional rule (some use $D_i > 4/n = 4/50 = 0.08$ as a more sensitive cutoff). By either criterion, $D_{23} = 1.2 > 1$, so observation 23 is highly influential.

(b) Steps: (1) Investigate the observation -- is it a data entry error or a genuine outlier? (2) Fit the model with and without observation 23 and compare coefficients. If they change substantially, the point is driving the results. (3) Consider robust regression methods. (4) Report results both with and without the influential point. Never simply delete observations without justification.

L02 Hypothesis Testing

Q11 EASY Steps of Hypothesis Testing

State the steps of hypothesis testing. Define null and alternative hypotheses, test statistic, p-value, and decision rule.

Solution

1. State hypotheses: $H_0$ (null) = status quo or no effect. $H_1$ (alternative) = what we want to show.

2. Choose significance level $\alpha$ (typically 0.05).

3. Select and compute the test statistic: a standardized value (e.g., t, z, F) that measures how far the sample result is from the null hypothesis value.

4. Find the p-value: the probability of observing a test statistic at least as extreme as the one computed, assuming $H_0$ is true.

5. Decision rule: If $p \leq \alpha$, reject $H_0$. If $p > \alpha$, fail to reject $H_0$. Alternatively, compare the test statistic to a critical value.

6. State conclusion in context of the problem.

Q12 MEDIUM One-Sample t-Test

A company claims its widgets weigh 500g. A sample of $n = 25$ gives $\bar{x} = 497$, $s = 8$.

(a) Set up hypotheses.

(b) Calculate the t-statistic.

(d) Conclude at $\alpha = 0.05$.

Solution

(a) $H_0: \mu = 500$ (widgets weigh 500g). $H_1: \mu \neq 500$ (two-sided test).

(b) $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{497 - 500}{8/\sqrt{25}} = \frac{-3}{1.6} = -1.875$.

(c) With $df = 24$, we look up $|t| = 1.875$. The critical value at $\alpha = 0.05$ (two-sided) is $t_{0.025, 24} = 2.064$. Since $1.875 < 2.064$, the p-value is between 0.05 and 0.10 (approximately $p \approx 0.073$).

(d) Since $p \approx 0.073 > 0.05$, we fail to reject $H_0$. There is insufficient evidence at the 5% level to conclude the widgets deviate from 500g.

Q13 HARD p-Value Distribution Under Null

Prove that the p-value follows a Uniform(0,1) distribution under $H_0$.

(a) Define the p-value formally.

(b) Show $P(\text{p-value} \leq \alpha \mid H_0) = \alpha$.

Solution

(a) The p-value is $p = P(T \geq t_{obs} \mid H_0)$ for a one-sided test, where $T$ is the test statistic and $t_{obs}$ is the observed value. Equivalently, $p = 1 - F(t_{obs})$ where $F$ is the CDF of $T$ under $H_0$.

(b) Under $H_0$, let $U = F(T)$ where $F$ is the CDF of $T$. By the probability integral transform, $U \sim \text{Uniform}(0,1)$. The p-value is $p = 1 - U$, which is also Uniform(0,1).

Therefore: $P(p \leq \alpha \mid H_0) = P(1 - F(T) \leq \alpha) = P(F(T) \geq 1-\alpha) = \alpha$.

This proves that under $H_0$, the probability of getting a p-value below $\alpha$ is exactly $\alpha$, confirming the Type I error rate is controlled.

Q14 HARD Type I/II Errors and Power Analysis

A pharmaceutical company tests a new drug at significance level $\alpha = 0.05$.

(a) Define Type I and Type II errors in the context of this drug trial.

(b) The FDA requires power of 0.80 to detect a clinically meaningful effect of \$\delta = 5\$ units with \$\sigma = 12\$. What minimum sample size is needed (per group, two-sample test)?

(c) If the company uses \$\alpha = 0.01\$ instead (to reduce false approvals), how does this affect power and required sample size?

Solution

(a) Type I error: Approving an ineffective drug (rejecting $H_0: \mu_{drug} = \mu_{placebo}$ when there is truly no difference). Cost: patients take a useless drug with potential side effects. Type II error: Failing to approve an effective drug. Cost: patients miss a beneficial treatment.

(b) For a two-sample t-test: $n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2} = \frac{2(1.96 + 0.842)^2 \times 144}{25} = \frac{2 \times 7.85 \times 144}{25} = \frac{2260.8}{25} \approx 91$ per group. Total: 182 subjects.

(c) With $\alpha = 0.01$: $z_{\alpha/2} = 2.576$, so $n = \frac{2(2.576 + 0.842)^2 \times 144}{25} = \frac{2 \times 11.67 \times 144}{25} = \frac{3360.9}{25} \approx 135$ per group. Power decreases (for fixed $n$) or sample size must increase by ~48% to maintain 80% power. Stricter $\alpha$ protects against false approvals but requires more subjects.

Q15 MEDIUM ANOVA Interpretation

Three teaching methods are compared. ANOVA gives $F = 4.52$ with $df_1 = 2$, $df_2 = 27$, $p = 0.020$.

(a) State hypotheses.

(b) Interpret the result.

(d) Why not just do pairwise t-tests?

Solution

(a) $H_0: \mu_1 = \mu_2 = \mu_3$ (all methods produce the same mean outcome). $H_1$: At least one mean differs.

(b) With $p = 0.020 < 0.05$, we reject $H_0$. There is significant evidence that at least one teaching method produces a different mean outcome. The $F = 4.52$ indicates the between-group variance is 4.52 times the within-group variance.

(c) Tukey's HSD (Honestly Significant Difference) test for all pairwise comparisons. It controls the family-wise error rate. Alternatively, Bonferroni correction or Scheffe's method.

(d) With 3 groups, there are $\binom{3}{2} = 3$ pairwise comparisons. Each at $\alpha = 0.05$ gives a family-wise error rate of $1 - (1-0.05)^3 = 0.143$. The inflated Type I error rate makes individual t-tests unreliable without correction.

Q16 HARD CI and Hypothesis Test Duality

Derive the relationship between confidence intervals and hypothesis tests.

(a) Show that rejecting $H_0: \mu = \mu_0$ at level $\alpha$ is equivalent to $\mu_0$ being outside the $(1-\alpha)$ confidence interval.

Solution

(a) The two-sided test rejects $H_0: \mu = \mu_0$ when $|t| = \left|\frac{\bar{x} - \mu_0}{s/\sqrt{n}}\right| > t_{\alpha/2, n-1}$.

This is equivalent to $\mu_0 < \bar{x} - t_{\alpha/2}\frac{s}{\sqrt{n}}$ or $\mu_0 > \bar{x} + t_{\alpha/2}\frac{s}{\sqrt{n}}$.

But $\left[\bar{x} - t_{\alpha/2}\frac{s}{\sqrt{n}},\; \bar{x} + t_{\alpha/2}\frac{s}{\sqrt{n}}\right]$ is exactly the $(1-\alpha)$ confidence interval for $\mu$.

Therefore, rejecting $H_0$ at level $\alpha$ is equivalent to $\mu_0 \notin CI_{1-\alpha}$. This duality means every confidence interval implicitly tests all possible null values: values inside the CI are 'not rejected' and values outside are 'rejected'.

Q17 EASY Power of a Test

What is the power of a test? Name three factors that affect power and explain how.

Solution

Power = $1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ is false})$. It is the probability of correctly detecting a real effect.

Three factors that affect power:

1. Sample size ($n$): Larger $n$ increases power. More data provides more precise estimates, making it easier to detect true differences.

2. Effect size: Larger true effects are easier to detect. A difference of 10 units is easier to find than a difference of 1 unit.

3. Significance level ($\alpha$): Larger $\alpha$ increases power (but also increases Type I error). Using $\alpha = 0.10$ gives more power than $\alpha = 0.01$.

Additional factor: variance ($\sigma^2$). Lower variance increases power because the signal-to-noise ratio improves.

Q18 MEDIUM A/B Test

In an A/B test, version A has conversion rate 12% ($n_A = 500$) and version B has 15% ($n_B = 500$).

(a) Set up hypotheses.

(b) Calculate the test statistic.

Solution

(a) $H_0: p_A = p_B$ (no difference in conversion rates). $H_1: p_A \neq p_B$ (two-sided test).

(b) Pooled proportion: $\hat{p} = \frac{0.12 \times 500 + 0.15 \times 500}{1000} = \frac{60 + 75}{1000} = 0.135$.

Standard error: $SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = \sqrt{0.135 \times 0.865 \times \frac{2}{500}} = \sqrt{0.000467} = 0.0216$.

Test statistic: $z = \frac{0.12 - 0.15}{0.0216} = \frac{-0.03}{0.0216} = -1.389$.

(c) The critical value for $\alpha = 0.05$ two-sided is $z_{0.025} = 1.96$. Since $|z| = 1.389 < 1.96$, we fail to reject $H_0$. The 3 percentage point difference is not statistically significant at the 5% level. The p-value is approximately 0.165.

Q19 HARD Non-Parametric Tests

When should you use a non-parametric test instead of ANOVA?

(a) Name a non-parametric alternative to one-way ANOVA.

(b) What assumptions does it relax?

Solution

(a) The Kruskal-Wallis test is the non-parametric alternative to one-way ANOVA.

(b) It relaxes: (1) Normality -- no assumption about the distribution shape. (2) Homoscedasticity -- less sensitive to unequal variances. It works on ranks rather than raw values, so it is robust to outliers and skewed distributions. It still requires independent observations.

(c) When ANOVA assumptions are met, the Kruskal-Wallis test has lower power (approximately 95.5% asymptotic relative efficiency compared to the F-test for normal data). This means you need about 5% more observations to achieve the same power. The 'cost' is a higher probability of Type II errors -- missing real differences.

Q20 MEDIUM Multiple Comparisons

Explain the multiple comparisons problem.

(a) If you run 20 independent tests at $\alpha = 0.05$, what is the probability of at least one Type I error?

(b) Describe the Bonferroni correction.

Solution

(a) $P(\text{at least one Type I error}) = 1 - P(\text{no Type I errors}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20} = 1 - 0.358 = 0.642$. There is a 64.2% chance of at least one false positive -- far above the nominal 5%.

(b) Bonferroni correction: divide the significance level by the number of tests. Use $\alpha^* = \alpha / m = 0.05 / 20 = 0.0025$ for each individual test. This ensures the family-wise error rate (FWER) is at most $\alpha = 0.05$. The correction is conservative -- it may reduce power, especially with many tests. Alternatives like the Holm-Bonferroni method are less conservative while still controlling FWER.

L03 PCA & Exploratory Factor Analysis

Q21 EASY PCA Fundamentals

Explain PCA in your own words. What is the goal? What are principal components?

Solution

PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated variables called principal components.

Goal: Find new axes (principal components) that capture the maximum variance in the data, allowing us to reduce dimensionality while retaining as much information as possible.

Principal components are linear combinations of the original variables: $PC_k = w_{k1}X_1 + w_{k2}X_2 + \cdots + w_{kp}X_p$. PC1 captures the most variance, PC2 (orthogonal to PC1) captures the next most, and so on.

The weights $w_{ki}$ are the eigenvectors of the covariance (or correlation) matrix, and the variance captured by each PC equals its eigenvalue.

Q22 MEDIUM Eigenvalue Interpretation

Given a correlation matrix for 5 variables with eigenvalues $\lambda = (2.8, 1.3, 0.5, 0.3, 0.1)$:

(a) How much variance does PC1 explain?

(b) How many components would you retain using the Kaiser criterion?

Solution

(a) Total variance = $\sum \lambda_i = 2.8 + 1.3 + 0.5 + 0.3 + 0.1 = 5.0$ (equals the number of variables for a correlation matrix). PC1 explains $2.8/5.0 = 56\%$ of the total variance.

(b) Kaiser criterion: retain components with $\lambda > 1$. Here $\lambda_1 = 2.8 > 1$ and $\lambda_2 = 1.3 > 1$, but $\lambda_3 = 0.5 < 1$. Retain 2 components.

(c) Cumulative variance: PC1 = 56%, PC1+PC2 = 56% + 26% = 82%. Since 82% > 80%, retain 2 components. (With only PC1 at 56%, we would not meet the threshold.)

Q23 HARD PCA Derivation

Derive the first principal component.

(a) State the optimization problem.

(b) Show it reduces to an eigenvalue problem.

Solution

(a) Find weights $\mathbf{w}_1$ to maximize the variance of $PC_1 = \mathbf{w}_1^T \mathbf{X}$: $\max_{\mathbf{w}_1} \text{Var}(\mathbf{w}_1^T \mathbf{X}) = \max_{\mathbf{w}_1} \mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1$ subject to $\mathbf{w}_1^T \mathbf{w}_1 = 1$ (unit length constraint to avoid trivial solution).

(b) Using a Lagrange multiplier: $\mathcal{L} = \mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1 - \lambda(\mathbf{w}_1^T \mathbf{w}_1 - 1)$. Taking the derivative and setting to zero: $\boldsymbol{\Sigma} \mathbf{w}_1 = \lambda \mathbf{w}_1$. This is an eigenvalue equation. The maximum variance $\mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1 = \lambda$, so the first PC uses the eigenvector corresponding to the largest eigenvalue.

(c) The covariance matrix is sensitive to scale: a variable measured in meters will dominate one measured in centimeters simply due to its larger variance. The correlation matrix standardizes all variables to unit variance, ensuring each variable contributes equally regardless of measurement units.

Q24 HARD Component Retention Strategies

A PCA on 8 standardized variables yields eigenvalues $\lambda = (3.2, 1.8, 1.1, 0.9, 0.4, 0.3, 0.2, 0.1)$.

(a) Apply three different retention criteria (Kaiser, scree/elbow, cumulative variance at 70%) and compare their recommendations.

(b) A parallel analysis generates random eigenvalues $(1.45, 1.25, 1.10, 0.98, 0.87, 0.77, 0.68, 0.50)$. How many components does parallel analysis retain?

Solution

(a) Kaiser ($\lambda > 1$): Retain 3 components ($\lambda_1=3.2, \lambda_2=1.8, \lambda_3=1.1$). Scree: The largest drop is between $\lambda_1$ and $\lambda_2$ (1.4), then $\lambda_2$ to $\lambda_3$ (0.7), then $\lambda_3$ to $\lambda_4$ (0.2) -- elbow at 3 or possibly 2. Cumulative variance: 2 components = (3.2+1.8)/8 = 62.5%; 3 components = (3.2+1.8+1.1)/8 = 76.25% > 70%. Retain 3 components.

(b) Parallel analysis: retain components where actual $\lambda$ exceeds random $\lambda$. $\lambda_1=3.2 > 1.45$ (retain), $\lambda_2=1.8 > 1.25$ (retain), $\lambda_3=1.1 = 1.10$ (borderline -- typically not retained since not strictly greater), $\lambda_4=0.9 < 0.98$ (stop). Parallel analysis retains 2 components.

(c) Consider: (1) Interpretability -- can the components be meaningfully named? (2) Theory -- does domain knowledge suggest 2 or 3 constructs? (3) Stability -- do results change across subsamples? Parallel analysis is generally considered the most accurate criterion. Start with 2 components and examine if adding a third improves interpretability.

Q25 MEDIUM Loading Interpretation

After a PCA, the loading matrix shows PC1 loads heavily on variables $X_1, X_2, X_3$ (all financial ratios) and PC2 loads on $X_4, X_5$ (both market variables).

(a) Name PC1 and PC2.

(b) What proportion of variance should these components explain to be useful?

Solution

(a) PC1 could be named 'Financial Health' or 'Profitability Factor' since it captures the common variation in financial ratios. PC2 could be named 'Market Conditions' or 'Market Sentiment Factor' since it captures market-related variation.

(b) As a general guideline, the retained components should explain at least 60-80% of the total variance to be considered useful. In social sciences, 60% may be acceptable; in natural sciences, 80%+ is often expected. The exact threshold depends on the field and application. If these two components explain less than 50%, additional components may be needed, or the data may not have a clear low-dimensional structure.

Q26 HARD PCA vs EFA Comparison

Compare PCA and EFA:

(a) State the mathematical model for each.

(b) When is PCA preferred over EFA?

(d) Explain the difference between a loading in PCA vs EFA.

Solution

(a) PCA: $\mathbf{Z} = \mathbf{X}\mathbf{W}$ -- components are linear combinations of observed variables. No error term. EFA: $\mathbf{X} = \boldsymbol{\Lambda}\mathbf{F} + \boldsymbol{\varepsilon}$ -- observed variables are linear combinations of latent factors plus unique error.

(b) PCA is preferred for: (1) pure dimensionality reduction, (2) creating composite scores, (3) preprocessing for regression (removing multicollinearity). PCA makes no assumptions about underlying structure.

(c) Communality $h_i^2$ is the proportion of variable $i$'s variance explained by the common factors. Uniqueness = $1 - h_i^2$ represents variance specific to that variable plus error. Low communality means the variable is poorly represented by the factor model.

(d) In PCA, loadings are correlations between variables and components (which are exact linear combinations). In EFA, loadings represent the correlation between observed variables and latent (unobserved) factors, accounting for unique variance. EFA loadings are generally smaller than PCA loadings because they exclude unique variance.

Q27 EASY Rotation in Factor Analysis

What is rotation in factor analysis? Why is it needed?

Solution

Rotation is a transformation applied to the initial factor solution to achieve a simpler, more interpretable structure.

Why needed: The initial extraction often produces factors where many variables load moderately on multiple factors, making interpretation difficult. Rotation redistributes variance among factors so that each variable loads strongly on one factor and weakly on others ('simple structure').

The total variance explained does not change after rotation -- only the distribution across factors changes. Rotation does not improve the model fit; it only improves interpretability.

Q28 MEDIUM Varimax vs Promax

Explain the difference between varimax and promax rotation.

(a) Which allows correlated factors?

(b) When would you choose each?

Solution

(a) Varimax is an orthogonal rotation -- factors remain uncorrelated. Promax is an oblique rotation -- factors are allowed to be correlated.

(b) Use varimax when you expect (or want to enforce) uncorrelated factors, or when simplicity is desired. Use promax when you believe the underlying factors are genuinely correlated (common in psychology, e.g., intelligence subfactors). Promax often provides a more realistic solution but is harder to interpret.

(c) Simple structure (Thurstone's criteria): Each variable loads highly on one factor and has near-zero loadings on all other factors. Each factor has a few variables with high loadings and the rest near zero. This makes factors distinctly interpretable.

Q29 HARD Communality Analysis

Given communality values $h^2 = (0.82, 0.65, 0.91, 0.33, 0.78)$ for 5 variables in a 2-factor EFA model:

(a) Which variable is poorly explained?

(b) Calculate its uniqueness.

Solution

(a) Variable 4 with $h^2 = 0.33$ is poorly explained. Only 33% of its variance is accounted for by the two common factors.

(b) Uniqueness = $1 - h^2 = 1 - 0.33 = 0.67$. This means 67% of variable 4's variance is unique (not shared with the common factors).

(c) Options: (1) Consider removing variable 4 from the analysis -- it may not belong to the factor structure being measured. (2) Add a third factor -- perhaps variable 4 represents a separate construct. (3) Examine the variable's content -- it may be poorly measured or conceptually distinct. (4) If theoretically important, keep it but note the low communality as a limitation. Generally, communalities below 0.40 are considered problematic.

Q30 MEDIUM Biplot Interpretation

In a PCA biplot, two arrows point in nearly the same direction.

(a) What does this mean about the variables?

(b) If an observation is far from the origin in the direction of these arrows, what does that indicate?

Solution

(a) Two arrows pointing in the same direction indicate that the corresponding variables are highly positively correlated. They share similar patterns of variation across observations. The angle between arrows approximates the correlation: small angle means high positive correlation, 90 degrees means no correlation, 180 degrees means strong negative correlation.

(b) An observation far from the origin in the direction of these arrows has high values on both variables (relative to other observations). The distance from the origin represents how extreme the observation is in the PC space. This observation scores highly on the principal component defined by those variable loadings.

L04 Cluster Analysis

Q31 EASY Clustering Methods Overview

Name three types of clustering methods. Give one advantage and one disadvantage of each.

Solution

1. k-Means (Partitional): Advantage -- fast, scales well to large datasets, $O(nkt)$ complexity. Disadvantage -- requires pre-specifying $k$, assumes spherical clusters, sensitive to initialization.

2. Hierarchical (Agglomerative): Advantage -- no need to pre-specify $k$, produces a dendrogram showing cluster relationships at all levels. Disadvantage -- $O(n^2 \log n)$ complexity, slow for large datasets, cannot undo merges.

3. Density-Based (e.g., DBSCAN): Advantage -- finds arbitrarily shaped clusters, automatically identifies outliers as noise. Disadvantage -- struggles with clusters of varying density, sensitive to epsilon and minPts parameters.

Q32 MEDIUM Distance Calculations

Calculate the Euclidean and Manhattan distances between $\mathbf{a} = (3, 4, 1)$ and $\mathbf{b} = (1, 2, 5)$.

(a) Show your work.

(b) Which metric is more sensitive to outliers?

Solution

(a) Euclidean: $d_E = \sqrt{(3-1)^2 + (4-2)^2 + (1-5)^2} = \sqrt{4 + 4 + 16} = \sqrt{24} = 4.899$.

Manhattan: $d_M = |3-1| + |4-2| + |1-5| = 2 + 2 + 4 = 8$.

(b) Euclidean is more sensitive to outliers because squaring amplifies large differences. The dimension with difference 4 contributes $4^2 = 16$ to Euclidean (dominant) but only 4 to Manhattan.

(c) Minkowski with $p=3$: $d_3 = \left(|3-1|^3 + |4-2|^3 + |1-5|^3\right)^{1/3} = (8 + 8 + 64)^{1/3} = 80^{1/3} = 4.309$.

Q33 HARD k-Means Algorithm

Describe the k-Means algorithm step by step.

(a) Write the objective function.

(b) Prove that each iteration decreases or maintains WCSS.

(d) How does k-Means++ address the initialization problem?

Solution

(a) Objective: minimize Within-Cluster Sum of Squares: $WCSS = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$, where $C_k$ is cluster $k$ and $\boldsymbol{\mu}_k$ is its centroid.

(b) The algorithm alternates two steps: (1) Assignment: assign each point to the nearest centroid -- this cannot increase WCSS because each point moves to minimize its contribution. (2) Update: recompute centroids as cluster means -- the mean minimizes sum of squared distances within a cluster. Since both steps decrease (or maintain) WCSS and WCSS is bounded below by 0, the algorithm converges.

(c) The objective function is non-convex with multiple local minima. Different initializations lead to different final solutions. The greedy alternating minimization is only guaranteed to find a local minimum, not the global one.

(d) k-Means++ selects initial centroids sequentially: the first is random, each subsequent centroid is chosen with probability proportional to $D(x)^2$ (squared distance to the nearest existing centroid). This spreads out initial centroids and provides an $O(\log k)$-competitive approximation guarantee.

Q34 EASY Dendrograms

What is a dendrogram? How do you determine the number of clusters from it?

Solution

A dendrogram is a tree-like diagram that shows the hierarchical clustering of observations. The y-axis shows the distance (or dissimilarity) at which clusters merge. Observations start as individual leaves at the bottom and are progressively merged into larger clusters.

To determine the number of clusters: look for the largest vertical gap (distance jump) in the dendrogram. Cut the dendrogram horizontally at that gap. The number of vertical lines the horizontal cut crosses equals the number of clusters.

Example: If there is a large gap between merge distances 5 and 12, cutting at height 8 might reveal 3 clusters. The large gap suggests those clusters are well-separated.

Q35 MEDIUM Linkage Methods

Compare single, complete, average, and Ward's linkage methods.

(a) Define each mathematically.

(b) Which tends to produce 'chaining'?

Solution

(a) Single linkage: $d(A,B) = \min_{a \in A, b \in B} d(a,b)$ (minimum distance between any pair).

Complete linkage: $d(A,B) = \max_{a \in A, b \in B} d(a,b)$ (maximum distance between any pair).

Average linkage: $d(A,B) = \frac{1}{|A||B|}\sum_{a \in A}\sum_{b \in B} d(a,b)$ (average of all pairwise distances).

Ward's method: merges the pair of clusters that results in the smallest increase in total within-cluster variance (WCSS).

(b) Single linkage tends to produce 'chaining' -- elongated, string-like clusters where points are connected through a chain of close neighbors, even if the endpoints are far apart.

(c) Complete linkage and Ward's method produce compact, roughly spherical clusters. Ward's is particularly good at finding equal-sized, compact clusters.

Q36 HARD Silhouette Coefficient

Given the silhouette coefficient formula $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$:

(a) Define $a(i)$ and $b(i)$.

(b) What is the range of $s(i)$?

(d) How do you use the average silhouette to select $k$?

Solution

(a) $a(i)$ = average distance from observation $i$ to all other points in the same cluster (cohesion). $b(i)$ = minimum average distance from observation $i$ to all points in any other cluster (separation to nearest neighboring cluster).

(b) The range is $[-1, 1]$. $s(i) \approx 1$: well-clustered (far from neighboring clusters). $s(i) \approx 0$: on the boundary between clusters. $s(i) \approx -1$: likely assigned to the wrong cluster.

(c) $s(i) = -0.3$ means $a(i) > b(i)$: the observation is closer to points in another cluster than to points in its own cluster. It is likely misclassified and would fit better in the neighboring cluster.

(d) Compute the average silhouette width for different values of $k$. Choose the $k$ that maximizes the average silhouette. A higher average indicates better-defined clusters. Values above 0.5 suggest reasonable structure; above 0.7 indicates strong structure.

Q37 HARD Feature Preprocessing for Clustering

You are clustering customers using: income (\$20k-\$200k), age (18-80), satisfaction score (1-5), and a binary gender variable (0/1).

(a) Calculate the Euclidean distance between customer A = (80000, 35, 4, 1) and B = (82000, 55, 2, 0) with and without standardization.

(b) Which variable dominates the unstandardized distance? What percentage of the squared distance does it contribute?

Solution

(a) Unstandardized: $d = \sqrt{(80000-82000)^2 + (35-55)^2 + (4-2)^2 + (1-0)^2} = \sqrt{4000000 + 400 + 4 + 1} = \sqrt{4000405} = 2000.1$. With z-score standardization (assuming $\sigma_{inc}=40000, \sigma_{age}=15, \sigma_{sat}=1.2, \sigma_{gen}=0.5$): differences become $(-0.05, -1.33, 1.67, 2.0)$, giving $d = \sqrt{0.0025 + 1.78 + 2.78 + 4.0} = \sqrt{8.56} = 2.93$.

(b) Income contributes $4000000/4000405 = 99.99\%$ of the squared distance. Age, satisfaction, and gender are effectively ignored. This makes the clustering almost entirely an income-based partition.

(c) Euclidean distance treats the binary variable as continuous, which is problematic. Better approaches: (1) Use Gower's distance, which handles mixed types natively. (2) One-hot encode categorical variables and then standardize. (3) Use k-Prototypes algorithm (k-Means variant for mixed data). For the binary variable specifically, matching/mismatching contributes 0 or 1, but this must be properly scaled relative to continuous variables.

Q38 MEDIUM Elbow Method

The elbow method shows WCSS values: $k=2: 450$, $k=3: 280$, $k=4: 250$, $k=5: 240$, $k=6: 235$.

(a) Where is the 'elbow'?

(b) Why might the elbow be ambiguous?

Solution

(a) The elbow is at $k = 3$. The largest drop is from $k=2$ to $k=3$ (decrease of 170). After $k=3$, improvements are marginal: $k=3$ to $k=4$ drops only 30, and further increases give diminishing returns (10, 5).

(b) The elbow can be ambiguous when: (1) there is no sharp bend (gradual decrease), (2) multiple bends exist, (3) the true number of clusters is not well-defined in the data. In this case, one could also argue for $k=4$.

(c) The silhouette method: compute average silhouette width for each $k$ and choose the maximum. The gap statistic is another option: it compares WCSS to that expected under a null reference distribution. These methods provide complementary evidence to support the elbow method choice.

Q39 HARD k-Means vs k-Medoids

Compare k-Means and k-Medoids (PAM).

(a) How do their objective functions differ?

(b) Which is more robust to outliers and why?

(d) When would you prefer k-Medoids?

Solution

(a) k-Means minimizes $\sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$ where $\boldsymbol{\mu}_k$ is the mean (centroid). k-Medoids minimizes $\sum_{k=1}^K \sum_{i \in C_k} d(\mathbf{x}_i, \mathbf{m}_k)$ where $\mathbf{m}_k$ is an actual data point (the medoid) and $d$ can be any dissimilarity measure.

(b) k-Medoids is more robust because: (1) it uses actual data points as centers, not means that can be pulled by outliers; (2) it minimizes sum of distances (not squared distances), reducing the influence of extreme values; (3) it works with any dissimilarity measure, not just Euclidean.

(c) k-Means: $O(nkt)$ per iteration, where $n$ = points, $k$ = clusters, $t$ = iterations. k-Medoids (PAM): $O(k(n-k)^2)$ per iteration -- significantly slower because it evaluates all possible swaps of medoids with non-medoids.

(d) Prefer k-Medoids when: (1) data contains outliers, (2) you need interpretable centers (actual data points), (3) you want to use non-Euclidean distances (e.g., Manhattan, cosine), (4) the dataset is not too large ($n < 10000$).

Q40 MEDIUM Cluster Validation

You have clustered customers into 4 groups.

(a) How would you validate the clustering?

(b) Name two internal validation metrics.

Solution

(a) Validation approaches: (1) Internal validation -- assess cluster quality using the data itself (compactness, separation). (2) Stability validation -- re-cluster subsamples and check consistency. (3) Visual validation -- examine cluster profiles and check if they make business sense. (4) External validation -- compare to known labels if available.

(b) Internal metrics: (1) Silhouette width: measures how similar points are to their own cluster vs. neighboring clusters (range [-1,1], higher is better). (2) Calinski-Harabasz index (variance ratio criterion): ratio of between-cluster to within-cluster variance (higher is better).

(c) External metric: Adjusted Rand Index (ARI): measures agreement between clustering and true labels, adjusted for chance. Range [-1,1], where 1 = perfect agreement, 0 = random agreement. Other options: Normalized Mutual Information (NMI) or purity.

L05 Time Series Analysis

Q41 EASY Time Series Components

What are the three components of a time series? Give an example of a series that exhibits all three.

Solution

The three components are:

1. Trend: Long-term increase or decrease in the data. Example: GDP growing over decades.

2. Seasonality: Regular, repeating pattern at fixed periods. Example: retail sales peaking in December every year.

3. Noise (Residual/Irregular): Random fluctuations that cannot be attributed to trend or seasonality.

Example exhibiting all three: Monthly airline passenger numbers -- upward trend (growing travel demand), seasonal peaks (summer months), and random month-to-month variation. The additive decomposition is $Y_t = T_t + S_t + \varepsilon_t$; multiplicative is $Y_t = T_t \times S_t \times \varepsilon_t$.

Q42 MEDIUM ACF/PACF Model Identification

Given the ACF of a time series shows: significant at lags 1, 2 (cuts off after 2) and PACF shows: significant at lag 1 only (cuts off after 1).

(a) What model does this suggest?

(b) Write the model equation.

Solution

(a) ACF cutting off after lag 2 suggests an MA(2) component. PACF cutting off after lag 1 suggests an AR(1) component. However, the pattern of ACF cutting off (not tailing off) is the signature of a pure MA model. This suggests an MA(2) model.

(b) MA(2): $Y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2}$, where $\varepsilon_t \sim WN(0, \sigma^2)$.

(c) MA models are always stationary (they are finite sums of white noise). For invertibility (needed for unique parameter estimation): the roots of $1 + \theta_1 z + \theta_2 z^2 = 0$ must lie outside the unit circle. Equivalently: $\theta_2 + \theta_1 < 1$, $\theta_2 - \theta_1 < 1$, and $|\theta_2| < 1$.

Q43 HARD AR(1) Autocorrelation Derivation

Derive the autocorrelation function for an AR(1) process $Y_t = \phi Y_{t-1} + \varepsilon_t$.

(a) Find $\gamma(0) = \text{Var}(Y_t)$.

(b) Find $\gamma(h)$ for general lag $h$.

(d) What condition on $\phi$ ensures stationarity?

Solution

(a) $\text{Var}(Y_t) = \text{Var}(\phi Y_{t-1} + \varepsilon_t) = \phi^2 \text{Var}(Y_{t-1}) + \sigma^2$ (since $\varepsilon_t$ is independent of $Y_{t-1}$). For a stationary process, $\text{Var}(Y_t) = \text{Var}(Y_{t-1}) = \gamma(0)$. So $\gamma(0) = \phi^2 \gamma(0) + \sigma^2$, giving $\gamma(0) = \frac{\sigma^2}{1 - \phi^2}$.

(b) $\gamma(h) = \text{Cov}(Y_t, Y_{t-h}) = \text{Cov}(\phi Y_{t-1} + \varepsilon_t, Y_{t-h}) = \phi \text{Cov}(Y_{t-1}, Y_{t-h}) = \phi \gamma(h-1)$. By induction: $\gamma(h) = \phi^h \gamma(0) = \frac{\phi^h \sigma^2}{1 - \phi^2}$.

(c) $\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\phi^h \gamma(0)}{\gamma(0)} = \phi^h$. The ACF decays exponentially, which is the signature pattern of an AR(1) process.

(d) Stationarity requires $|\phi| < 1$. If $|\phi| \geq 1$, the variance $\gamma(0) = \frac{\sigma^2}{1-\phi^2}$ is undefined (negative or infinite), and the process is non-stationary.

Q44 HARD Stationarity Testing and Differencing

A time series $\{Y_t\}$ shows an upward trend and increasing variance.

(a) Explain the difference between trend-stationarity and difference-stationarity. Why does the distinction matter for modeling?

(b) The ADF test gives $p = 0.42$ and the KPSS test gives $p = 0.01$. Interpret both results together.

(c) After first differencing, ADF gives $p = 0.03$ but variance still appears non-constant. What additional transformation is needed?

(d) How would you determine whether $d = 1$ or $d = 2$ differencing is appropriate?

Solution

(a) Trend-stationary: $Y_t = \alpha + \beta t + \varepsilon_t$ -- deterministic trend that can be removed by detrending (subtracting fitted trend). Difference-stationary: $Y_t = Y_{t-1} + \varepsilon_t$ (unit root) -- stochastic trend removed by differencing. The distinction matters because: detrending a unit root process or differencing a trend-stationary process both lead to incorrect inference and over-differencing.

(b) ADF ($H_0$: unit root): $p = 0.42$, fail to reject -- evidence of a unit root. KPSS ($H_0$: stationary): $p = 0.01$, reject stationarity. Both tests agree: the series is non-stationary. (If they disagreed, more investigation would be needed, such as examining the Phillips-Perron test or structural breaks.)

(c) Non-constant variance after differencing suggests the need for a variance-stabilizing transformation. Apply a log transformation first: $W_t = \log(Y_t)$, then difference: $\Delta W_t = \log(Y_t) - \log(Y_{t-1}) \approx$ percentage change. The log transform stabilizes multiplicative variance, and differencing removes the trend.

(d) Apply ADF to the first-differenced series. If $p < 0.05$, $d=1$ is sufficient. If still non-stationary, try $d=2$. Also check: (1) the ACF of the differenced series -- if it drops quickly, $d$ is adequate; if the first autocorrelation is near $-0.5$, you may have over-differenced. (2) Compare AIC/BIC of ARIMA($p,1,q$) vs ARIMA($p,2,q$). In practice, $d > 2$ is almost never needed.

Q45 MEDIUM Box-Jenkins Methodology

Describe the Box-Jenkins methodology.

(a) What are the three stages?

(b) How do you use ACF/PACF for model identification?

Solution

(a) Three stages: (1) Identification: determine the order $(p, d, q)$ by examining plots, ACF/PACF, and stationarity tests. (2) Estimation: estimate model parameters using maximum likelihood or least squares. (3) Diagnostic checking: verify model adequacy through residual analysis.

(b) ACF/PACF patterns: AR($p$): ACF tails off, PACF cuts off after lag $p$. MA($q$): ACF cuts off after lag $q$, PACF tails off. ARMA($p,q$): both tail off. The differencing order $d$ is determined by how many times differencing is needed to achieve stationarity.

(c) Model adequacy checks: (1) Residuals should be white noise -- check with ACF of residuals (no significant lags). (2) Ljung-Box test on residuals: $H_0$: residuals are white noise. (3) Residuals should be approximately normal (Q-Q plot). (4) Compare AIC/BIC across candidate models -- lower is better.

Q46 HARD ARIMA(1,1,1) Model

For an ARIMA(1,1,1) model:

(a) Write the full model equation.

(b) How many parameters need to be estimated?

(d) After fitting, the residuals show significant autocorrelation at lag 4 -- what should you do?

Solution

(a) Let $W_t = Y_t - Y_{t-1}$ (first difference). Then: $W_t = c + \phi_1 W_{t-1} + \varepsilon_t + \theta_1 \varepsilon_{t-1}$. Expanding: $Y_t - Y_{t-1} = c + \phi_1(Y_{t-1} - Y_{t-2}) + \varepsilon_t + \theta_1 \varepsilon_{t-1}$.

(b) Three parameters: $\phi_1$ (AR coefficient), $\theta_1$ (MA coefficient), and $\sigma^2$ (noise variance). Optionally a constant $c$ (drift term) for a total of 4.

(c) Differencing ($d=1$) removes a unit root (stochastic trend). The original series is non-stationary (e.g., random walk with drift), but the differenced series is stationary. ARMA models require stationarity, so differencing is a preprocessing step.

(d) Significant autocorrelation at lag 4 suggests the model is inadequate. Options: (1) Try ARIMA(1,1,2) or ARIMA(2,1,1) to capture additional structure. (2) If data is quarterly, consider seasonal ARIMA with $s=4$: SARIMA(1,1,1)(1,0,0)[4]. (3) Check for seasonal patterns that the current model misses. Re-fit and re-check residuals iteratively.

Q47 EASY White Noise

What is white noise? State its properties and explain why it is important for model diagnostics.

Solution

White noise is a sequence of uncorrelated random variables with constant mean and variance: $\varepsilon_t \sim WN(0, \sigma^2)$.

Properties: (1) $E[\varepsilon_t] = 0$ for all $t$. (2) $\text{Var}(\varepsilon_t) = \sigma^2$ for all $t$ (constant). (3) $\text{Cov}(\varepsilon_t, \varepsilon_s) = 0$ for $t \neq s$ (no autocorrelation). (4) ACF: $\rho(0) = 1$, $\rho(h) = 0$ for $h \neq 0$.

Importance for diagnostics: A well-fitted time series model should have residuals that resemble white noise. If residuals are white noise, all systematic patterns (trend, seasonality, autocorrelation) have been captured by the model. Remaining residuals that show autocorrelation indicate the model is missing structure and should be improved.

Q48 MEDIUM AIC vs BIC

Compare AIC and BIC for model selection.

(a) Write both formulas.

(b) Which penalizes complexity more?

Solution

(a) $AIC = -2\ln(L) + 2k$, where $L$ is the maximized likelihood and $k$ is the number of parameters. $BIC = -2\ln(L) + k\ln(n)$, where $n$ is the sample size.

(b) BIC penalizes complexity more heavily. For $n \geq 8$, $\ln(n) > 2$, so BIC's penalty per parameter exceeds AIC's. BIC tends to select simpler models, especially for large samples.

(c) This is a common situation. The choice depends on the goal: (1) For forecasting, AIC is often preferred -- it optimizes prediction accuracy and may capture useful patterns that BIC's simpler model misses. (2) For identifying the true model order, BIC is consistent (selects the true model as $n \to \infty$). (3) Pragmatically, ARIMA(1,1,1) is simpler, more interpretable, and less prone to overfitting. Unless the ARIMA(2,1,2) provides substantially better out-of-sample forecasts, the simpler model is usually preferred.

Q49 HARD GARCH(1,1) Model

Explain the GARCH(1,1) model.

(a) Write the conditional variance equation.

(b) What is volatility clustering?

(d) What constraint ensures the variance is positive and finite?

Solution

(a) $\sigma_t^2 = \omega + \alpha_1 \varepsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2$, where $\varepsilon_t = \sigma_t z_t$, $z_t \sim N(0,1)$, $\omega > 0$, $\alpha_1 \geq 0$, $\beta_1 \geq 0$. The conditional variance at time $t$ depends on the previous squared shock and the previous conditional variance.

(b) Volatility clustering: large price changes tend to be followed by large changes (of either sign), and small changes tend to be followed by small changes. GARCH captures this because a large $\varepsilon_{t-1}^2$ increases $\sigma_t^2$, leading to larger expected magnitudes of $\varepsilon_t$.

(c) Taking unconditional expectations: $E[\sigma_t^2] = \omega + \alpha_1 E[\varepsilon_{t-1}^2] + \beta_1 E[\sigma_{t-1}^2]$. Since $E[\varepsilon_t^2] = E[\sigma_t^2]$ and stationarity implies $E[\sigma_t^2] = E[\sigma_{t-1}^2] = \bar{\sigma}^2$: $\bar{\sigma}^2 = \omega + (\alpha_1 + \beta_1)\bar{\sigma}^2$, so $\bar{\sigma}^2 = \frac{\omega}{1 - \alpha_1 - \beta_1}$.

(d) Constraints: $\omega > 0$, $\alpha_1 \geq 0$, $\beta_1 \geq 0$ ensure $\sigma_t^2 > 0$. For finite unconditional variance: $\alpha_1 + \beta_1 < 1$. If $\alpha_1 + \beta_1 = 1$, the process is Integrated GARCH (IGARCH) with infinite unconditional variance -- shocks persist indefinitely.

Q50 MEDIUM MAPE Calculation

Calculate MAPE for the following forecasts: Actual = (100, 120, 90, 110), Forecast = (105, 115, 95, 108).

(a) Show the calculation.

(b) Interpret the result.

Solution

(a) $MAPE = \frac{1}{n}\sum_{i=1}^n \left|\frac{A_i - F_i}{A_i}\right| \times 100\%$.

$\frac{|100-105|}{100} = 0.050$, $\frac{|120-115|}{120} = 0.042$, $\frac{|90-95|}{90} = 0.056$, $\frac{|110-108|}{110} = 0.018$.

$MAPE = \frac{0.050 + 0.042 + 0.056 + 0.018}{4} \times 100\% = \frac{0.166}{4} \times 100\% = 4.15\%$.

(b) On average, forecasts deviate from actuals by 4.15%. This is considered excellent forecasting accuracy (MAPE < 10% is typically very good).

(c) Advantage of MAPE: scale-independent (expressed as percentage), making it easy to compare across different series and communicate to stakeholders. Disadvantage: undefined when actual values are zero, and it penalizes positive errors more than negative errors of the same magnitude (asymmetric). MAE avoids both issues but is scale-dependent.

Statistical Data Analysis -- Exam Practice