Statistical Data Analysis -- Exam Practice
50 Written Questions with Full Solutions -- Designed for Deep Understanding
(a) Write down the sum of squared residuals \(SSR = \sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\).
(b) Take partial derivatives with respect to \(\hat{\beta}_0\) and \(\hat{\beta}_1\), set them to zero.
(c) Solve the resulting normal equations for \(\hat{\beta}_1\).
(a) We minimize \(SSR = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\) with respect to \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
(b) Setting \(\frac{\partial SSR}{\partial \hat{\beta}_0} = 0\) gives \(\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0\), yielding \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\).
Setting \(\frac{\partial SSR}{\partial \hat{\beta}_1} = 0\) gives \(\sum x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0\).
(c) Substituting \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\) and simplifying: \(\sum x_i(y_i - \bar{y} + \hat{\beta}_1 \bar{x} - \hat{\beta}_1 x_i) = 0\). This gives \(\sum x_i(y_i - \bar{y}) = \hat{\beta}_1 \sum x_i(x_i - \bar{x})\). Since \(\sum(x_i - \bar{x})(y_i - \bar{y}) = \sum x_i(y_i - \bar{y})\), we get \(\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\).
(a) Calculate adjusted \(R^2\).
(b) Interpret both values.
(c) Is the model significantly better than the intercept-only model? Set up the F-test.
(a) \(R^2_{adj} = 1 - (1 - R^2)\frac{n-1}{n-p-1} = 1 - (1 - 0.85)\frac{49}{45} = 1 - 0.15 \times 1.089 = 1 - 0.163 = 0.837\).
(b) \(R^2 = 0.85\): 85% of variance in \(y\) is explained by the 4 predictors. Adjusted \(R^2 = 0.837\): after penalizing for 4 predictors, 83.7% of variance is explained. The small difference (0.013) suggests the predictors are genuinely useful.
(c) \(F = \frac{R^2 / p}{(1 - R^2)/(n - p - 1)} = \frac{0.85/4}{0.15/45} = \frac{0.2125}{0.00333} = 63.8\). With \(df_1 = 4\), \(df_2 = 45\), this is highly significant (\(p \ll 0.001\)). Reject \(H_0\): at least one predictor has a non-zero coefficient.
\(R^2\) measures the proportion of variance explained by the model. It always increases (or stays the same) when predictors are added, even if they are irrelevant.
Adjusted \(R^2\) penalizes for the number of predictors: \(R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-p-1}\). It can decrease if a useless predictor is added.
They differ substantially when: (1) many predictors are used relative to sample size, (2) some predictors are irrelevant. For example, with \(n=20\) and \(p=15\), \(R^2\) could be artificially high while adjusted \(R^2\) would be much lower.
(a) Calculate the odds ratio.
(b) Interpret it in context.
(c) What happens to the odds ratio if \(\hat{\beta}_1\) is negative?
(a) Odds ratio \(= e^{\hat{\beta}_1} = e^{0.693} = 2.0\).
(b) For each one-unit increase in the predictor, the odds of the outcome are multiplied by 2 (i.e., the odds double). This means the event becomes twice as likely relative to not occurring.
(c) If \(\hat{\beta}_1 < 0\), then \(OR = e^{\hat{\beta}_1} < 1\). For example, \(\hat{\beta}_1 = -0.693\) gives \(OR = 0.5\), meaning the odds are halved for each unit increase. A negative coefficient means the predictor decreases the probability of the outcome.
(a) Write the hazard function.
(b) Explain the proportional hazards assumption.
(c) How would you test this assumption?
(d) What is a hazard ratio of 2.5 saying?
(a) \(h(t|\mathbf{x}) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)\), where \(h_0(t)\) is the baseline hazard function (unspecified), and \(\beta_i\) are regression coefficients.
(b) The proportional hazards assumption states that the ratio of hazards for any two individuals is constant over time: \(\frac{h(t|\mathbf{x}_1)}{h(t|\mathbf{x}_2)} = \exp(\boldsymbol{\beta}^T(\mathbf{x}_1 - \mathbf{x}_2))\). The hazard functions are proportional -- they can vary over time but their ratio does not.
(c) Test using: (1) Schoenfeld residuals plotted against time -- should show no trend. (2) Formal test: correlate scaled Schoenfeld residuals with time; significant correlation violates the assumption. (3) Log-log survival plots for categorical covariates -- curves should be parallel.
(d) A hazard ratio of 2.5 means the hazard (instantaneous risk) of the event is 2.5 times higher for a one-unit increase in the covariate, at any point in time. If comparing treated vs. control, the treated group's instantaneous risk of the event is 150% higher.
1. Linearity: (i) \(E[Y|\mathbf{X}] = \mathbf{X}\boldsymbol{\beta}\). (ii) Residuals vs. fitted values plot (look for curvature); RESET test. (iii) Biased and inconsistent coefficient estimates. (iv) Add polynomial terms, use non-linear transformations, or generalized additive models.
2. Independence: (i) \(\text{Cov}(\varepsilon_i, \varepsilon_j) = 0\) for \(i \neq j\). (ii) Durbin-Watson test (\(d \approx 2\) means no autocorrelation); plot residuals over time. (iii) OLS estimates remain unbiased but standard errors are incorrect, invalidating inference. (iv) Use GLS, Newey-West HAC standard errors, or model the autocorrelation structure.
3. Homoscedasticity: (i) \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\). (ii) Residuals vs. fitted plot (fan shape indicates heteroscedasticity); Breusch-Pagan or White test. (iii) OLS is still unbiased but no longer BLUE; standard errors and confidence intervals are wrong. (iv) Use weighted least squares (WLS), robust (Huber-White) standard errors, or variance-stabilizing transformations.
4. Normality: (i) \(\varepsilon_i \sim N(0, \sigma^2)\). (ii) Q-Q plot, Shapiro-Wilk test, Jarque-Bera test. (iii) Small-sample inference (t-tests, F-tests) is invalid; large-sample inference is approximately valid via CLT. (iv) Transform the response, use robust regression, or rely on large-sample asymptotics.
The Kaplan-Meier estimator is \(\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\), where \(d_i\) = events and \(n_i\) = at risk.
At \(t=1\): \(1 - \frac{2}{10} = 0.8\).
At \(t=2\): \(1 - \frac{1}{8} = 0.875\).
At \(t=3\): \(1 - \frac{3}{7} = 0.571\).
\(\hat{S}(3) = 0.8 \times 0.875 \times 0.571 = 0.400\).
Interpretation: There is an estimated 40% probability of surviving beyond time 3.
(a) When should you use each?
(b) Derive the log-likelihood for logistic regression.
(c) Why can't we use OLS for binary outcomes?
(a) Linear regression: continuous outcome variable (e.g., price, weight). Logistic regression: binary outcome (0/1, yes/no). Logistic is also used for probabilities bounded in [0,1].
(b) For logistic regression with \(P(Y_i=1) = p_i = \frac{1}{1+e^{-\mathbf{x}_i^T\boldsymbol{\beta}}}\), the likelihood is \(L(\boldsymbol{\beta}) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\). The log-likelihood is \(\ell(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \log(p_i) + (1-y_i)\log(1-p_i)]\). This is maximized numerically (Newton-Raphson or IRLS) since no closed-form solution exists.
(c) OLS for binary outcomes fails because: (1) Predicted values are not bounded in [0,1], so predicted 'probabilities' can be negative or exceed 1. (2) Residuals are heteroscedastic (variance depends on \(p\)). (3) Residuals cannot be normally distributed when the outcome is binary. The linear probability model violates all OLS assumptions.
Censoring occurs when the exact event time is unknown for some subjects. Right-censoring means we know the subject survived at least until a certain time, but the actual event time is unknown (it is to the 'right' of the observed time).
Example 1: A clinical trial ends after 5 years. A patient who is still alive at the end of the study is right-censored -- we know they survived at least 5 years, but not when (or if) they will experience the event.
Example 2: A customer churn study tracks users for 12 months. A customer who is still active and then cancels their phone number (lost to follow-up) is right-censored at their last observed active date.
Censoring is a key challenge because simply ignoring censored observations would bias survival estimates downward.
(a) What threshold indicates an influential point?
(b) What should you do about it?
(a) Common thresholds for Cook's distance: \(D_i > 1\) is the traditional rule (some use \(D_i > 4/n = 4/50 = 0.08\) as a more sensitive cutoff). By either criterion, \(D_{23} = 1.2 > 1\), so observation 23 is highly influential.
(b) Steps: (1) Investigate the observation -- is it a data entry error or a genuine outlier? (2) Fit the model with and without observation 23 and compare coefficients. If they change substantially, the point is driving the results. (3) Consider robust regression methods. (4) Report results both with and without the influential point. Never simply delete observations without justification.
1. State hypotheses: \(H_0\) (null) = status quo or no effect. \(H_1\) (alternative) = what we want to show.
2. Choose significance level \(\alpha\) (typically 0.05).
3. Select and compute the test statistic: a standardized value (e.g., t, z, F) that measures how far the sample result is from the null hypothesis value.
4. Find the p-value: the probability of observing a test statistic at least as extreme as the one computed, assuming \(H_0\) is true.
5. Decision rule: If \(p \leq \alpha\), reject \(H_0\). If \(p > \alpha\), fail to reject \(H_0\). Alternatively, compare the test statistic to a critical value.
6. State conclusion in context of the problem.
(a) Set up hypotheses.
(b) Calculate the t-statistic.
(c) Find the p-value region.
(d) Conclude at \(\alpha = 0.05\).
(a) \(H_0: \mu = 500\) (widgets weigh 500g). \(H_1: \mu \neq 500\) (two-sided test).
(b) \(t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{497 - 500}{8/\sqrt{25}} = \frac{-3}{1.6} = -1.875\).
(c) With \(df = 24\), we look up \(|t| = 1.875\). The critical value at \(\alpha = 0.05\) (two-sided) is \(t_{0.025, 24} = 2.064\). Since \(1.875 < 2.064\), the p-value is between 0.05 and 0.10 (approximately \(p \approx 0.073\)).
(d) Since \(p \approx 0.073 > 0.05\), we fail to reject \(H_0\). There is insufficient evidence at the 5% level to conclude the widgets deviate from 500g.
(a) Define the p-value formally.
(b) Show \(P(\text{p-value} \leq \alpha \mid H_0) = \alpha\).
(a) The p-value is \(p = P(T \geq t_{obs} \mid H_0)\) for a one-sided test, where \(T\) is the test statistic and \(t_{obs}\) is the observed value. Equivalently, \(p = 1 - F(t_{obs})\) where \(F\) is the CDF of \(T\) under \(H_0\).
(b) Under \(H_0\), let \(U = F(T)\) where \(F\) is the CDF of \(T\). By the probability integral transform, \(U \sim \text{Uniform}(0,1)\). The p-value is \(p = 1 - U\), which is also Uniform(0,1).
Therefore: \(P(p \leq \alpha \mid H_0) = P(1 - F(T) \leq \alpha) = P(F(T) \geq 1-\alpha) = \alpha\).
This proves that under \(H_0\), the probability of getting a p-value below \(\alpha\) is exactly \(\alpha\), confirming the Type I error rate is controlled.
(a) Define Type I and Type II errors in the context of this drug trial.
(b) The FDA requires power of 0.80 to detect a clinically meaningful effect of \\(\delta = 5\\) units with \\(\sigma = 12\\). What minimum sample size is needed (per group, two-sample test)?
(c) If the company uses \\(\alpha = 0.01\\) instead (to reduce false approvals), how does this affect power and required sample size?
(a) Type I error: Approving an ineffective drug (rejecting \(H_0: \mu_{drug} = \mu_{placebo}\) when there is truly no difference). Cost: patients take a useless drug with potential side effects. Type II error: Failing to approve an effective drug. Cost: patients miss a beneficial treatment.
(b) For a two-sample t-test: \(n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2} = \frac{2(1.96 + 0.842)^2 \times 144}{25} = \frac{2 \times 7.85 \times 144}{25} = \frac{2260.8}{25} \approx 91\) per group. Total: 182 subjects.
(c) With \(\alpha = 0.01\): \(z_{\alpha/2} = 2.576\), so \(n = \frac{2(2.576 + 0.842)^2 \times 144}{25} = \frac{2 \times 11.67 \times 144}{25} = \frac{3360.9}{25} \approx 135\) per group. Power decreases (for fixed \(n\)) or sample size must increase by ~48% to maintain 80% power. Stricter \(\alpha\) protects against false approvals but requires more subjects.
(a) State hypotheses.
(b) Interpret the result.
(c) What post-hoc test would you use?
(d) Why not just do pairwise t-tests?
(a) \(H_0: \mu_1 = \mu_2 = \mu_3\) (all methods produce the same mean outcome). \(H_1\): At least one mean differs.
(b) With \(p = 0.020 < 0.05\), we reject \(H_0\). There is significant evidence that at least one teaching method produces a different mean outcome. The \(F = 4.52\) indicates the between-group variance is 4.52 times the within-group variance.
(c) Tukey's HSD (Honestly Significant Difference) test for all pairwise comparisons. It controls the family-wise error rate. Alternatively, Bonferroni correction or Scheffe's method.
(d) With 3 groups, there are \(\binom{3}{2} = 3\) pairwise comparisons. Each at \(\alpha = 0.05\) gives a family-wise error rate of \(1 - (1-0.05)^3 = 0.143\). The inflated Type I error rate makes individual t-tests unreliable without correction.
(a) Show that rejecting \(H_0: \mu = \mu_0\) at level \(\alpha\) is equivalent to \(\mu_0\) being outside the \((1-\alpha)\) confidence interval.
(a) The two-sided test rejects \(H_0: \mu = \mu_0\) when \(|t| = \left|\frac{\bar{x} - \mu_0}{s/\sqrt{n}}\right| > t_{\alpha/2, n-1}\).
This is equivalent to \(\mu_0 < \bar{x} - t_{\alpha/2}\frac{s}{\sqrt{n}}\) or \(\mu_0 > \bar{x} + t_{\alpha/2}\frac{s}{\sqrt{n}}\).
But \(\left[\bar{x} - t_{\alpha/2}\frac{s}{\sqrt{n}},\; \bar{x} + t_{\alpha/2}\frac{s}{\sqrt{n}}\right]\) is exactly the \((1-\alpha)\) confidence interval for \(\mu\).
Therefore, rejecting \(H_0\) at level \(\alpha\) is equivalent to \(\mu_0 \notin CI_{1-\alpha}\). This duality means every confidence interval implicitly tests all possible null values: values inside the CI are 'not rejected' and values outside are 'rejected'.
Power = \(1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ is false})\). It is the probability of correctly detecting a real effect.
Three factors that affect power:
1. Sample size (\(n\)): Larger \(n\) increases power. More data provides more precise estimates, making it easier to detect true differences.
2. Effect size: Larger true effects are easier to detect. A difference of 10 units is easier to find than a difference of 1 unit.
3. Significance level (\(\alpha\)): Larger \(\alpha\) increases power (but also increases Type I error). Using \(\alpha = 0.10\) gives more power than \(\alpha = 0.01\).
Additional factor: variance (\(\sigma^2\)). Lower variance increases power because the signal-to-noise ratio improves.
(a) Set up hypotheses.
(b) Calculate the test statistic.
(c) Is the difference significant at \(\alpha = 0.05\)?
(a) \(H_0: p_A = p_B\) (no difference in conversion rates). \(H_1: p_A \neq p_B\) (two-sided test).
(b) Pooled proportion: \(\hat{p} = \frac{0.12 \times 500 + 0.15 \times 500}{1000} = \frac{60 + 75}{1000} = 0.135\).
Standard error: \(SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)} = \sqrt{0.135 \times 0.865 \times \frac{2}{500}} = \sqrt{0.000467} = 0.0216\).
Test statistic: \(z = \frac{0.12 - 0.15}{0.0216} = \frac{-0.03}{0.0216} = -1.389\).
(c) The critical value for \(\alpha = 0.05\) two-sided is \(z_{0.025} = 1.96\). Since \(|z| = 1.389 < 1.96\), we fail to reject \(H_0\). The 3 percentage point difference is not statistically significant at the 5% level. The p-value is approximately 0.165.
(a) Name a non-parametric alternative to one-way ANOVA.
(b) What assumptions does it relax?
(c) What is the cost of using it when ANOVA assumptions hold?
(a) The Kruskal-Wallis test is the non-parametric alternative to one-way ANOVA.
(b) It relaxes: (1) Normality -- no assumption about the distribution shape. (2) Homoscedasticity -- less sensitive to unequal variances. It works on ranks rather than raw values, so it is robust to outliers and skewed distributions. It still requires independent observations.
(c) When ANOVA assumptions are met, the Kruskal-Wallis test has lower power (approximately 95.5% asymptotic relative efficiency compared to the F-test for normal data). This means you need about 5% more observations to achieve the same power. The 'cost' is a higher probability of Type II errors -- missing real differences.
(a) If you run 20 independent tests at \(\alpha = 0.05\), what is the probability of at least one Type I error?
(b) Describe the Bonferroni correction.
(a) \(P(\text{at least one Type I error}) = 1 - P(\text{no Type I errors}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20} = 1 - 0.358 = 0.642\). There is a 64.2% chance of at least one false positive -- far above the nominal 5%.
(b) Bonferroni correction: divide the significance level by the number of tests. Use \(\alpha^* = \alpha / m = 0.05 / 20 = 0.0025\) for each individual test. This ensures the family-wise error rate (FWER) is at most \(\alpha = 0.05\). The correction is conservative -- it may reduce power, especially with many tests. Alternatives like the Holm-Bonferroni method are less conservative while still controlling FWER.
PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated variables called principal components.
Goal: Find new axes (principal components) that capture the maximum variance in the data, allowing us to reduce dimensionality while retaining as much information as possible.
Principal components are linear combinations of the original variables: \(PC_k = w_{k1}X_1 + w_{k2}X_2 + \cdots + w_{kp}X_p\). PC1 captures the most variance, PC2 (orthogonal to PC1) captures the next most, and so on.
The weights \(w_{ki}\) are the eigenvectors of the covariance (or correlation) matrix, and the variance captured by each PC equals its eigenvalue.
(a) How much variance does PC1 explain?
(b) How many components would you retain using the Kaiser criterion?
(c) Using the 80% cumulative variance rule?
(a) Total variance = \(\sum \lambda_i = 2.8 + 1.3 + 0.5 + 0.3 + 0.1 = 5.0\) (equals the number of variables for a correlation matrix). PC1 explains \(2.8/5.0 = 56\%\) of the total variance.
(b) Kaiser criterion: retain components with \(\lambda > 1\). Here \(\lambda_1 = 2.8 > 1\) and \(\lambda_2 = 1.3 > 1\), but \(\lambda_3 = 0.5 < 1\). Retain 2 components.
(c) Cumulative variance: PC1 = 56%, PC1+PC2 = 56% + 26% = 82%. Since 82% > 80%, retain 2 components. (With only PC1 at 56%, we would not meet the threshold.)
(a) State the optimization problem.
(b) Show it reduces to an eigenvalue problem.
(c) Why must we use the correlation matrix (not covariance) when variables have different scales?
(a) Find weights \(\mathbf{w}_1\) to maximize the variance of \(PC_1 = \mathbf{w}_1^T \mathbf{X}\): \(\max_{\mathbf{w}_1} \text{Var}(\mathbf{w}_1^T \mathbf{X}) = \max_{\mathbf{w}_1} \mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1\) subject to \(\mathbf{w}_1^T \mathbf{w}_1 = 1\) (unit length constraint to avoid trivial solution).
(b) Using a Lagrange multiplier: \(\mathcal{L} = \mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1 - \lambda(\mathbf{w}_1^T \mathbf{w}_1 - 1)\). Taking the derivative and setting to zero: \(\boldsymbol{\Sigma} \mathbf{w}_1 = \lambda \mathbf{w}_1\). This is an eigenvalue equation. The maximum variance \(\mathbf{w}_1^T \boldsymbol{\Sigma} \mathbf{w}_1 = \lambda\), so the first PC uses the eigenvector corresponding to the largest eigenvalue.
(c) The covariance matrix is sensitive to scale: a variable measured in meters will dominate one measured in centimeters simply due to its larger variance. The correlation matrix standardizes all variables to unit variance, ensuring each variable contributes equally regardless of measurement units.
(a) Apply three different retention criteria (Kaiser, scree/elbow, cumulative variance at 70%) and compare their recommendations.
(b) A parallel analysis generates random eigenvalues \((1.45, 1.25, 1.10, 0.98, 0.87, 0.77, 0.68, 0.50)\). How many components does parallel analysis retain?
(c) The criteria disagree. How do you make a final decision?
(a) Kaiser (\(\lambda > 1\)): Retain 3 components (\(\lambda_1=3.2, \lambda_2=1.8, \lambda_3=1.1\)). Scree: The largest drop is between \(\lambda_1\) and \(\lambda_2\) (1.4), then \(\lambda_2\) to \(\lambda_3\) (0.7), then \(\lambda_3\) to \(\lambda_4\) (0.2) -- elbow at 3 or possibly 2. Cumulative variance: 2 components = (3.2+1.8)/8 = 62.5%; 3 components = (3.2+1.8+1.1)/8 = 76.25% > 70%. Retain 3 components.
(b) Parallel analysis: retain components where actual \(\lambda\) exceeds random \(\lambda\). \(\lambda_1=3.2 > 1.45\) (retain), \(\lambda_2=1.8 > 1.25\) (retain), \(\lambda_3=1.1 = 1.10\) (borderline -- typically not retained since not strictly greater), \(\lambda_4=0.9 < 0.98\) (stop). Parallel analysis retains 2 components.
(c) Consider: (1) Interpretability -- can the components be meaningfully named? (2) Theory -- does domain knowledge suggest 2 or 3 constructs? (3) Stability -- do results change across subsamples? Parallel analysis is generally considered the most accurate criterion. Start with 2 components and examine if adding a third improves interpretability.
(a) Name PC1 and PC2.
(b) What proportion of variance should these components explain to be useful?
(a) PC1 could be named 'Financial Health' or 'Profitability Factor' since it captures the common variation in financial ratios. PC2 could be named 'Market Conditions' or 'Market Sentiment Factor' since it captures market-related variation.
(b) As a general guideline, the retained components should explain at least 60-80% of the total variance to be considered useful. In social sciences, 60% may be acceptable; in natural sciences, 80%+ is often expected. The exact threshold depends on the field and application. If these two components explain less than 50%, additional components may be needed, or the data may not have a clear low-dimensional structure.
(a) State the mathematical model for each.
(b) When is PCA preferred over EFA?
(c) What is the role of communalities in EFA?
(d) Explain the difference between a loading in PCA vs EFA.
(a) PCA: \(\mathbf{Z} = \mathbf{X}\mathbf{W}\) -- components are linear combinations of observed variables. No error term. EFA: \(\mathbf{X} = \boldsymbol{\Lambda}\mathbf{F} + \boldsymbol{\varepsilon}\) -- observed variables are linear combinations of latent factors plus unique error.
(b) PCA is preferred for: (1) pure dimensionality reduction, (2) creating composite scores, (3) preprocessing for regression (removing multicollinearity). PCA makes no assumptions about underlying structure.
(c) Communality \(h_i^2\) is the proportion of variable \(i\)'s variance explained by the common factors. Uniqueness = \(1 - h_i^2\) represents variance specific to that variable plus error. Low communality means the variable is poorly represented by the factor model.
(d) In PCA, loadings are correlations between variables and components (which are exact linear combinations). In EFA, loadings represent the correlation between observed variables and latent (unobserved) factors, accounting for unique variance. EFA loadings are generally smaller than PCA loadings because they exclude unique variance.
Rotation is a transformation applied to the initial factor solution to achieve a simpler, more interpretable structure.
Why needed: The initial extraction often produces factors where many variables load moderately on multiple factors, making interpretation difficult. Rotation redistributes variance among factors so that each variable loads strongly on one factor and weakly on others ('simple structure').
The total variance explained does not change after rotation -- only the distribution across factors changes. Rotation does not improve the model fit; it only improves interpretability.
(a) Which allows correlated factors?
(b) When would you choose each?
(c) What does 'simple structure' mean?
(a) Varimax is an orthogonal rotation -- factors remain uncorrelated. Promax is an oblique rotation -- factors are allowed to be correlated.
(b) Use varimax when you expect (or want to enforce) uncorrelated factors, or when simplicity is desired. Use promax when you believe the underlying factors are genuinely correlated (common in psychology, e.g., intelligence subfactors). Promax often provides a more realistic solution but is harder to interpret.
(c) Simple structure (Thurstone's criteria): Each variable loads highly on one factor and has near-zero loadings on all other factors. Each factor has a few variables with high loadings and the rest near zero. This makes factors distinctly interpretable.
(a) Which variable is poorly explained?
(b) Calculate its uniqueness.
(c) What would you recommend doing about it?
(a) Variable 4 with \(h^2 = 0.33\) is poorly explained. Only 33% of its variance is accounted for by the two common factors.
(b) Uniqueness = \(1 - h^2 = 1 - 0.33 = 0.67\). This means 67% of variable 4's variance is unique (not shared with the common factors).
(c) Options: (1) Consider removing variable 4 from the analysis -- it may not belong to the factor structure being measured. (2) Add a third factor -- perhaps variable 4 represents a separate construct. (3) Examine the variable's content -- it may be poorly measured or conceptually distinct. (4) If theoretically important, keep it but note the low communality as a limitation. Generally, communalities below 0.40 are considered problematic.
(a) What does this mean about the variables?
(b) If an observation is far from the origin in the direction of these arrows, what does that indicate?
(a) Two arrows pointing in the same direction indicate that the corresponding variables are highly positively correlated. They share similar patterns of variation across observations. The angle between arrows approximates the correlation: small angle means high positive correlation, 90 degrees means no correlation, 180 degrees means strong negative correlation.
(b) An observation far from the origin in the direction of these arrows has high values on both variables (relative to other observations). The distance from the origin represents how extreme the observation is in the PC space. This observation scores highly on the principal component defined by those variable loadings.
1. k-Means (Partitional): Advantage -- fast, scales well to large datasets, \(O(nkt)\) complexity. Disadvantage -- requires pre-specifying \(k\), assumes spherical clusters, sensitive to initialization.
2. Hierarchical (Agglomerative): Advantage -- no need to pre-specify \(k\), produces a dendrogram showing cluster relationships at all levels. Disadvantage -- \(O(n^2 \log n)\) complexity, slow for large datasets, cannot undo merges.
3. Density-Based (e.g., DBSCAN): Advantage -- finds arbitrarily shaped clusters, automatically identifies outliers as noise. Disadvantage -- struggles with clusters of varying density, sensitive to epsilon and minPts parameters.
(a) Show your work.
(b) Which metric is more sensitive to outliers?
(c) What is the Minkowski distance with \(p = 3\)?
(a) Euclidean: \(d_E = \sqrt{(3-1)^2 + (4-2)^2 + (1-5)^2} = \sqrt{4 + 4 + 16} = \sqrt{24} = 4.899\).
Manhattan: \(d_M = |3-1| + |4-2| + |1-5| = 2 + 2 + 4 = 8\).
(b) Euclidean is more sensitive to outliers because squaring amplifies large differences. The dimension with difference 4 contributes \(4^2 = 16\) to Euclidean (dominant) but only 4 to Manhattan.
(c) Minkowski with \(p=3\): \(d_3 = \left(|3-1|^3 + |4-2|^3 + |1-5|^3\right)^{1/3} = (8 + 8 + 64)^{1/3} = 80^{1/3} = 4.309\).
(a) Write the objective function.
(b) Prove that each iteration decreases or maintains WCSS.
(c) Why can k-Means converge to a local minimum?
(d) How does k-Means++ address the initialization problem?
(a) Objective: minimize Within-Cluster Sum of Squares: \(WCSS = \sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2\), where \(C_k\) is cluster \(k\) and \(\boldsymbol{\mu}_k\) is its centroid.
(b) The algorithm alternates two steps: (1) Assignment: assign each point to the nearest centroid -- this cannot increase WCSS because each point moves to minimize its contribution. (2) Update: recompute centroids as cluster means -- the mean minimizes sum of squared distances within a cluster. Since both steps decrease (or maintain) WCSS and WCSS is bounded below by 0, the algorithm converges.
(c) The objective function is non-convex with multiple local minima. Different initializations lead to different final solutions. The greedy alternating minimization is only guaranteed to find a local minimum, not the global one.
(d) k-Means++ selects initial centroids sequentially: the first is random, each subsequent centroid is chosen with probability proportional to \(D(x)^2\) (squared distance to the nearest existing centroid). This spreads out initial centroids and provides an \(O(\log k)\)-competitive approximation guarantee.
A dendrogram is a tree-like diagram that shows the hierarchical clustering of observations. The y-axis shows the distance (or dissimilarity) at which clusters merge. Observations start as individual leaves at the bottom and are progressively merged into larger clusters.
To determine the number of clusters: look for the largest vertical gap (distance jump) in the dendrogram. Cut the dendrogram horizontally at that gap. The number of vertical lines the horizontal cut crosses equals the number of clusters.
Example: If there is a large gap between merge distances 5 and 12, cutting at height 8 might reveal 3 clusters. The large gap suggests those clusters are well-separated.
(a) Define each mathematically.
(b) Which tends to produce 'chaining'?
(c) Which tends to produce compact clusters?
(a) Single linkage: \(d(A,B) = \min_{a \in A, b \in B} d(a,b)\) (minimum distance between any pair).
Complete linkage: \(d(A,B) = \max_{a \in A, b \in B} d(a,b)\) (maximum distance between any pair).
Average linkage: \(d(A,B) = \frac{1}{|A||B|}\sum_{a \in A}\sum_{b \in B} d(a,b)\) (average of all pairwise distances).
Ward's method: merges the pair of clusters that results in the smallest increase in total within-cluster variance (WCSS).
(b) Single linkage tends to produce 'chaining' -- elongated, string-like clusters where points are connected through a chain of close neighbors, even if the endpoints are far apart.
(c) Complete linkage and Ward's method produce compact, roughly spherical clusters. Ward's is particularly good at finding equal-sized, compact clusters.
(a) Define \(a(i)\) and \(b(i)\).
(b) What is the range of \(s(i)\)?
(c) An observation has \(s(i) = -0.3\) -- what does this mean?
(d) How do you use the average silhouette to select \(k\)?
(a) \(a(i)\) = average distance from observation \(i\) to all other points in the same cluster (cohesion). \(b(i)\) = minimum average distance from observation \(i\) to all points in any other cluster (separation to nearest neighboring cluster).
(b) The range is \([-1, 1]\). \(s(i) \approx 1\): well-clustered (far from neighboring clusters). \(s(i) \approx 0\): on the boundary between clusters. \(s(i) \approx -1\): likely assigned to the wrong cluster.
(c) \(s(i) = -0.3\) means \(a(i) > b(i)\): the observation is closer to points in another cluster than to points in its own cluster. It is likely misclassified and would fit better in the neighboring cluster.
(d) Compute the average silhouette width for different values of \(k\). Choose the \(k\) that maximizes the average silhouette. A higher average indicates better-defined clusters. Values above 0.5 suggest reasonable structure; above 0.7 indicates strong structure.
(a) Calculate the Euclidean distance between customer A = (80000, 35, 4, 1) and B = (82000, 55, 2, 0) with and without standardization.
(b) Which variable dominates the unstandardized distance? What percentage of the squared distance does it contribute?
(c) How should you handle the binary variable? Is Euclidean distance appropriate for mixed-type data?
(a) Unstandardized: \(d = \sqrt{(80000-82000)^2 + (35-55)^2 + (4-2)^2 + (1-0)^2} = \sqrt{4000000 + 400 + 4 + 1} = \sqrt{4000405} = 2000.1\). With z-score standardization (assuming \(\sigma_{inc}=40000, \sigma_{age}=15, \sigma_{sat}=1.2, \sigma_{gen}=0.5\)): differences become \((-0.05, -1.33, 1.67, 2.0)\), giving \(d = \sqrt{0.0025 + 1.78 + 2.78 + 4.0} = \sqrt{8.56} = 2.93\).
(b) Income contributes \(4000000/4000405 = 99.99\%\) of the squared distance. Age, satisfaction, and gender are effectively ignored. This makes the clustering almost entirely an income-based partition.
(c) Euclidean distance treats the binary variable as continuous, which is problematic. Better approaches: (1) Use Gower's distance, which handles mixed types natively. (2) One-hot encode categorical variables and then standardize. (3) Use k-Prototypes algorithm (k-Means variant for mixed data). For the binary variable specifically, matching/mismatching contributes 0 or 1, but this must be properly scaled relative to continuous variables.
(a) Where is the 'elbow'?
(b) Why might the elbow be ambiguous?
(c) What other method could confirm your choice?
(a) The elbow is at \(k = 3\). The largest drop is from \(k=2\) to \(k=3\) (decrease of 170). After \(k=3\), improvements are marginal: \(k=3\) to \(k=4\) drops only 30, and further increases give diminishing returns (10, 5).
(b) The elbow can be ambiguous when: (1) there is no sharp bend (gradual decrease), (2) multiple bends exist, (3) the true number of clusters is not well-defined in the data. In this case, one could also argue for \(k=4\).
(c) The silhouette method: compute average silhouette width for each \(k\) and choose the maximum. The gap statistic is another option: it compares WCSS to that expected under a null reference distribution. These methods provide complementary evidence to support the elbow method choice.
(a) How do their objective functions differ?
(b) Which is more robust to outliers and why?
(c) What is the computational complexity of each?
(d) When would you prefer k-Medoids?
(a) k-Means minimizes \(\sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2\) where \(\boldsymbol{\mu}_k\) is the mean (centroid). k-Medoids minimizes \(\sum_{k=1}^K \sum_{i \in C_k} d(\mathbf{x}_i, \mathbf{m}_k)\) where \(\mathbf{m}_k\) is an actual data point (the medoid) and \(d\) can be any dissimilarity measure.
(b) k-Medoids is more robust because: (1) it uses actual data points as centers, not means that can be pulled by outliers; (2) it minimizes sum of distances (not squared distances), reducing the influence of extreme values; (3) it works with any dissimilarity measure, not just Euclidean.
(c) k-Means: \(O(nkt)\) per iteration, where \(n\) = points, \(k\) = clusters, \(t\) = iterations. k-Medoids (PAM): \(O(k(n-k)^2)\) per iteration -- significantly slower because it evaluates all possible swaps of medoids with non-medoids.
(d) Prefer k-Medoids when: (1) data contains outliers, (2) you need interpretable centers (actual data points), (3) you want to use non-Euclidean distances (e.g., Manhattan, cosine), (4) the dataset is not too large (\(n < 10000\)).
(a) How would you validate the clustering?
(b) Name two internal validation metrics.
(c) Name one external validation metric (if labels are available).
(a) Validation approaches: (1) Internal validation -- assess cluster quality using the data itself (compactness, separation). (2) Stability validation -- re-cluster subsamples and check consistency. (3) Visual validation -- examine cluster profiles and check if they make business sense. (4) External validation -- compare to known labels if available.
(b) Internal metrics: (1) Silhouette width: measures how similar points are to their own cluster vs. neighboring clusters (range [-1,1], higher is better). (2) Calinski-Harabasz index (variance ratio criterion): ratio of between-cluster to within-cluster variance (higher is better).
(c) External metric: Adjusted Rand Index (ARI): measures agreement between clustering and true labels, adjusted for chance. Range [-1,1], where 1 = perfect agreement, 0 = random agreement. Other options: Normalized Mutual Information (NMI) or purity.
The three components are:
1. Trend: Long-term increase or decrease in the data. Example: GDP growing over decades.
2. Seasonality: Regular, repeating pattern at fixed periods. Example: retail sales peaking in December every year.
3. Noise (Residual/Irregular): Random fluctuations that cannot be attributed to trend or seasonality.
Example exhibiting all three: Monthly airline passenger numbers -- upward trend (growing travel demand), seasonal peaks (summer months), and random month-to-month variation. The additive decomposition is \(Y_t = T_t + S_t + \varepsilon_t\); multiplicative is \(Y_t = T_t \times S_t \times \varepsilon_t\).
(a) What model does this suggest?
(b) Write the model equation.
(c) What are the parameter constraints for stationarity?
(a) ACF cutting off after lag 2 suggests an MA(2) component. PACF cutting off after lag 1 suggests an AR(1) component. However, the pattern of ACF cutting off (not tailing off) is the signature of a pure MA model. This suggests an MA(2) model.
(b) MA(2): \(Y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2}\), where \(\varepsilon_t \sim WN(0, \sigma^2)\).
(c) MA models are always stationary (they are finite sums of white noise). For invertibility (needed for unique parameter estimation): the roots of \(1 + \theta_1 z + \theta_2 z^2 = 0\) must lie outside the unit circle. Equivalently: \(\theta_2 + \theta_1 < 1\), \(\theta_2 - \theta_1 < 1\), and \(|\theta_2| < 1\).
(a) Find \(\gamma(0) = \text{Var}(Y_t)\).
(b) Find \(\gamma(h)\) for general lag \(h\).
(c) Show that \(\rho(h) = \phi^h\).
(d) What condition on \(\phi\) ensures stationarity?
(a) \(\text{Var}(Y_t) = \text{Var}(\phi Y_{t-1} + \varepsilon_t) = \phi^2 \text{Var}(Y_{t-1}) + \sigma^2\) (since \(\varepsilon_t\) is independent of \(Y_{t-1}\)). For a stationary process, \(\text{Var}(Y_t) = \text{Var}(Y_{t-1}) = \gamma(0)\). So \(\gamma(0) = \phi^2 \gamma(0) + \sigma^2\), giving \(\gamma(0) = \frac{\sigma^2}{1 - \phi^2}\).
(b) \(\gamma(h) = \text{Cov}(Y_t, Y_{t-h}) = \text{Cov}(\phi Y_{t-1} + \varepsilon_t, Y_{t-h}) = \phi \text{Cov}(Y_{t-1}, Y_{t-h}) = \phi \gamma(h-1)\). By induction: \(\gamma(h) = \phi^h \gamma(0) = \frac{\phi^h \sigma^2}{1 - \phi^2}\).
(c) \(\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\phi^h \gamma(0)}{\gamma(0)} = \phi^h\). The ACF decays exponentially, which is the signature pattern of an AR(1) process.
(d) Stationarity requires \(|\phi| < 1\). If \(|\phi| \geq 1\), the variance \(\gamma(0) = \frac{\sigma^2}{1-\phi^2}\) is undefined (negative or infinite), and the process is non-stationary.
(a) Explain the difference between trend-stationarity and difference-stationarity. Why does the distinction matter for modeling?
(b) The ADF test gives \(p = 0.42\) and the KPSS test gives \(p = 0.01\). Interpret both results together.
(c) After first differencing, ADF gives \(p = 0.03\) but variance still appears non-constant. What additional transformation is needed?
(d) How would you determine whether \(d = 1\) or \(d = 2\) differencing is appropriate?
(a) Trend-stationary: \(Y_t = \alpha + \beta t + \varepsilon_t\) -- deterministic trend that can be removed by detrending (subtracting fitted trend). Difference-stationary: \(Y_t = Y_{t-1} + \varepsilon_t\) (unit root) -- stochastic trend removed by differencing. The distinction matters because: detrending a unit root process or differencing a trend-stationary process both lead to incorrect inference and over-differencing.
(b) ADF (\(H_0\): unit root): \(p = 0.42\), fail to reject -- evidence of a unit root. KPSS (\(H_0\): stationary): \(p = 0.01\), reject stationarity. Both tests agree: the series is non-stationary. (If they disagreed, more investigation would be needed, such as examining the Phillips-Perron test or structural breaks.)
(c) Non-constant variance after differencing suggests the need for a variance-stabilizing transformation. Apply a log transformation first: \(W_t = \log(Y_t)\), then difference: \(\Delta W_t = \log(Y_t) - \log(Y_{t-1}) \approx\) percentage change. The log transform stabilizes multiplicative variance, and differencing removes the trend.
(d) Apply ADF to the first-differenced series. If \(p < 0.05\), \(d=1\) is sufficient. If still non-stationary, try \(d=2\). Also check: (1) the ACF of the differenced series -- if it drops quickly, \(d\) is adequate; if the first autocorrelation is near \(-0.5\), you may have over-differenced. (2) Compare AIC/BIC of ARIMA(\(p,1,q\)) vs ARIMA(\(p,2,q\)). In practice, \(d > 2\) is almost never needed.
(a) What are the three stages?
(b) How do you use ACF/PACF for model identification?
(c) How do you check if the model is adequate?
(a) Three stages: (1) Identification: determine the order \((p, d, q)\) by examining plots, ACF/PACF, and stationarity tests. (2) Estimation: estimate model parameters using maximum likelihood or least squares. (3) Diagnostic checking: verify model adequacy through residual analysis.
(b) ACF/PACF patterns: AR(\(p\)): ACF tails off, PACF cuts off after lag \(p\). MA(\(q\)): ACF cuts off after lag \(q\), PACF tails off. ARMA(\(p,q\)): both tail off. The differencing order \(d\) is determined by how many times differencing is needed to achieve stationarity.
(c) Model adequacy checks: (1) Residuals should be white noise -- check with ACF of residuals (no significant lags). (2) Ljung-Box test on residuals: \(H_0\): residuals are white noise. (3) Residuals should be approximately normal (Q-Q plot). (4) Compare AIC/BIC across candidate models -- lower is better.
(a) Write the full model equation.
(b) How many parameters need to be estimated?
(c) Explain why differencing is needed.
(d) After fitting, the residuals show significant autocorrelation at lag 4 -- what should you do?
(a) Let \(W_t = Y_t - Y_{t-1}\) (first difference). Then: \(W_t = c + \phi_1 W_{t-1} + \varepsilon_t + \theta_1 \varepsilon_{t-1}\). Expanding: \(Y_t - Y_{t-1} = c + \phi_1(Y_{t-1} - Y_{t-2}) + \varepsilon_t + \theta_1 \varepsilon_{t-1}\).
(b) Three parameters: \(\phi_1\) (AR coefficient), \(\theta_1\) (MA coefficient), and \(\sigma^2\) (noise variance). Optionally a constant \(c\) (drift term) for a total of 4.
(c) Differencing (\(d=1\)) removes a unit root (stochastic trend). The original series is non-stationary (e.g., random walk with drift), but the differenced series is stationary. ARMA models require stationarity, so differencing is a preprocessing step.
(d) Significant autocorrelation at lag 4 suggests the model is inadequate. Options: (1) Try ARIMA(1,1,2) or ARIMA(2,1,1) to capture additional structure. (2) If data is quarterly, consider seasonal ARIMA with \(s=4\): SARIMA(1,1,1)(1,0,0)[4]. (3) Check for seasonal patterns that the current model misses. Re-fit and re-check residuals iteratively.
White noise is a sequence of uncorrelated random variables with constant mean and variance: \(\varepsilon_t \sim WN(0, \sigma^2)\).
Properties: (1) \(E[\varepsilon_t] = 0\) for all \(t\). (2) \(\text{Var}(\varepsilon_t) = \sigma^2\) for all \(t\) (constant). (3) \(\text{Cov}(\varepsilon_t, \varepsilon_s) = 0\) for \(t \neq s\) (no autocorrelation). (4) ACF: \(\rho(0) = 1\), \(\rho(h) = 0\) for \(h \neq 0\).
Importance for diagnostics: A well-fitted time series model should have residuals that resemble white noise. If residuals are white noise, all systematic patterns (trend, seasonality, autocorrelation) have been captured by the model. Remaining residuals that show autocorrelation indicate the model is missing structure and should be improved.
(a) Write both formulas.
(b) Which penalizes complexity more?
(c) If AIC selects ARIMA(2,1,2) and BIC selects ARIMA(1,1,1), which would you choose and why?
(a) \(AIC = -2\ln(L) + 2k\), where \(L\) is the maximized likelihood and \(k\) is the number of parameters. \(BIC = -2\ln(L) + k\ln(n)\), where \(n\) is the sample size.
(b) BIC penalizes complexity more heavily. For \(n \geq 8\), \(\ln(n) > 2\), so BIC's penalty per parameter exceeds AIC's. BIC tends to select simpler models, especially for large samples.
(c) This is a common situation. The choice depends on the goal: (1) For forecasting, AIC is often preferred -- it optimizes prediction accuracy and may capture useful patterns that BIC's simpler model misses. (2) For identifying the true model order, BIC is consistent (selects the true model as \(n \to \infty\)). (3) Pragmatically, ARIMA(1,1,1) is simpler, more interpretable, and less prone to overfitting. Unless the ARIMA(2,1,2) provides substantially better out-of-sample forecasts, the simpler model is usually preferred.
(a) Write the conditional variance equation.
(b) What is volatility clustering?
(c) Derive the unconditional variance.
(d) What constraint ensures the variance is positive and finite?
(a) \(\sigma_t^2 = \omega + \alpha_1 \varepsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2\), where \(\varepsilon_t = \sigma_t z_t\), \(z_t \sim N(0,1)\), \(\omega > 0\), \(\alpha_1 \geq 0\), \(\beta_1 \geq 0\). The conditional variance at time \(t\) depends on the previous squared shock and the previous conditional variance.
(b) Volatility clustering: large price changes tend to be followed by large changes (of either sign), and small changes tend to be followed by small changes. GARCH captures this because a large \(\varepsilon_{t-1}^2\) increases \(\sigma_t^2\), leading to larger expected magnitudes of \(\varepsilon_t\).
(c) Taking unconditional expectations: \(E[\sigma_t^2] = \omega + \alpha_1 E[\varepsilon_{t-1}^2] + \beta_1 E[\sigma_{t-1}^2]\). Since \(E[\varepsilon_t^2] = E[\sigma_t^2]\) and stationarity implies \(E[\sigma_t^2] = E[\sigma_{t-1}^2] = \bar{\sigma}^2\): \(\bar{\sigma}^2 = \omega + (\alpha_1 + \beta_1)\bar{\sigma}^2\), so \(\bar{\sigma}^2 = \frac{\omega}{1 - \alpha_1 - \beta_1}\).
(d) Constraints: \(\omega > 0\), \(\alpha_1 \geq 0\), \(\beta_1 \geq 0\) ensure \(\sigma_t^2 > 0\). For finite unconditional variance: \(\alpha_1 + \beta_1 < 1\). If \(\alpha_1 + \beta_1 = 1\), the process is Integrated GARCH (IGARCH) with infinite unconditional variance -- shocks persist indefinitely.
(a) Show the calculation.
(b) Interpret the result.
(c) Name one advantage and one disadvantage of MAPE vs MAE.
(a) \(MAPE = \frac{1}{n}\sum_{i=1}^n \left|\frac{A_i - F_i}{A_i}\right| \times 100\%\).
\(\frac{|100-105|}{100} = 0.050\), \(\frac{|120-115|}{120} = 0.042\), \(\frac{|90-95|}{90} = 0.056\), \(\frac{|110-108|}{110} = 0.018\).
\(MAPE = \frac{0.050 + 0.042 + 0.056 + 0.018}{4} \times 100\% = \frac{0.166}{4} \times 100\% = 4.15\%\).
(b) On average, forecasts deviate from actuals by 4.15%. This is considered excellent forecasting accuracy (MAPE < 10% is typically very good).
(c) Advantage of MAPE: scale-independent (expressed as percentage), making it easy to compare across different series and communicate to stakeholders. Disadvantage: undefined when actual values are zero, and it penalizes positive errors more than negative errors of the same magnitude (asymmetric). MAE avoids both issues but is scale-dependent.