1 Simple Linear Regression

1.1 Concepts of Regression Analysis

1.1.1 What is regression analysis?

Many research questions involve understanding how one variable is related to another. Consider these examples:

  • A teacher wants to know whether the number of hours a student studies is related to the student’s exam score.
  • Medical researchers ask: Is caffeine related to heart damage? Is there a relationship between a person’s age and blood pressure?
  • A real estate appraiser wants to predict the sale price of a home from its physical characteristics and tax records.
  • Analysts want to examine whether cigarette consumption is related to socioeconomic variables such as age, education, income, and cigarette price.

Two closely related techniques help answer these questions:

  • Correlation analysis determines whether a linear relationship between variables exists, and measures its strength and direction.
  • Regression analysis describes the nature of the relationship — it models how changes in one or more predictor variables are associated with changes in the response variable.

1.1.2 Purposes of regression analysis

  1. To determine the relationship between the explanatory variable(s) and the response variable.
  2. To forecast or predict the value of the response variable for a given set of values of the explanatory variables.

1.1.3 Response and explanatory variables

A regression model expresses the relationship between a response variable (also called the dependent variable) and one or more explanatory variables (also called independent variables or predictors).

  • The response variable is denoted \(Y\) — the variable whose value we want to predict or explain.
  • The predictor variables are denoted \(X_1, X_2, \ldots, X_p\), where \(p\) is the number of predictors.

1.1.4 The regression model

The true relationship between \(Y\) and \(X_1, X_2, \ldots, X_p\) is approximated by:

\[Y = f(X_1, X_2, \ldots, X_p) + \epsilon\]

  • \(\epsilon\) is the random error representing the part of \(Y\) not explained by the predictors. It accounts for measurement error, omitted variables, and natural variability.
  • \(f(X_1, \ldots, X_p)\) describes the systematic part of the relationship. Common forms include:
    • Linear: \(f(x_1, x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2\)
    • Nonlinear: \(f(x_1, x_2) = \beta_0 + \beta_1 x_1 + e^{\beta_2 x_2}\)

1.1.5 Steps in regression modelling

  1. State the problem clearly.
  2. Choose the relevant variables.
  3. Collect data on the selected variables.
  4. Specify a model appropriate for the data.
  5. Estimate the model parameters.
  6. Validate the model (check assumptions and fit).
  7. Use the model to answer the original question.

1.2 Scatterplot

Before fitting any model, we always begin by visualising the data. A scatterplot displays pairs \((x_i, y_i)\) — the predictor \(x\) on the horizontal axis and the response \(y\) on the vertical axis — and reveals the form, direction, and strength of the relationship.

What to look for in a scatterplot:

Feature What it means
Form Is the pattern linear (a straight band) or curved?
Direction Positive (both increase together) or negative?
Strength How tightly do the points cluster around the pattern?
Outliers Any unusual points that deviate from the overall pattern?
data(cars)
plot(cars$speed, cars$dist,
     xlab = "Speed (mph)", ylab = "Stopping Distance (ft)",
     main = "Scatterplot: Speed vs. Stopping Distance",
     pch = 19, col = "steelblue")

The scatterplot shows that speed and dist are positively and approximately linearly related — as speed increases, stopping distance tends to increase.

library(MASS)
data(survey)
plot(survey$Age, survey$Pulse,
     xlab = "Age (years)", ylab = "Pulse Rate (beats/min)",
     main = "Scatterplot: Age vs. Pulse Rate",
     pch = 19, col = "tomato")

The scatterplot shows that Age and Pulse do not appear to be linearly related — the points scatter widely with no clear directional pattern.


1.3 Correlation

1.3.1 The Pearson correlation coefficient

The sample Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables. For \(n\) pairs \((x_1, y_1), \ldots, (x_n, y_n)\):

\[ r = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2 \;\cdot\; \displaystyle\sum_{i=1}^{n}(y_i - \bar{y})^2}} = \frac{n\sum x_i y_i - \sum x_i \sum y_i} {\sqrt{\bigl[n\sum x_i^2 - (\sum x_i)^2\bigr]\bigl[n\sum y_i^2 - (\sum y_i)^2\bigr]}} \]

The range is \(-1 \leq r \leq 1\).

1.3.2 Interpreting \(r\)

Value of \(|r|\) Interpretation
0.00 – 0.19 Very weak (negligible)
0.20 – 0.39 Weak
0.40 – 0.59 Moderate
0.60 – 0.79 Strong
0.80 – 1.00 Very strong
  • Sign: positive \(r\) → both variables increase together; negative \(r\) → one increases as the other decreases.
  • Magnitude: the closer \(|r|\) is to 1, the more tightly the points cluster around a straight line.

❗ Important

\(r\) measures linear association only. A value of \(r \approx 0\) does not mean the variables are unrelated — they might have a strong nonlinear relationship that \(r\) cannot detect. Always examine the scatterplot first.

cor(cars$speed, cars$dist)
## [1] 0.8068949

The correlation \(r = 0.807\) confirms a strong positive linear relationship between speed and stopping distance.


1.3.3 Hypothesis test for the population correlation

The population correlation coefficient \(\rho\) (rho) is the correlation computed from all possible pairs in the population. We never observe it directly; instead, we use \(r\) from our sample to make inferences about \(\rho\).

Possible hypotheses (where \(\rho_0\) is the value under \(H_0\), most commonly \(\rho_0 = 0\)):

\[ \begin{aligned} (a)\; &H_0: \rho \leq 0, \quad H_1: \rho > 0 \quad &\text{(test for positive correlation)}\\ (b)\; &H_0: \rho \geq 0, \quad H_1: \rho < 0 \quad &\text{(test for negative correlation)}\\ (c)\; &H_0: \rho = 0, \quad H_1: \rho \neq 0 \quad &\text{(test for any linear correlation)} \end{aligned} \]

Four-step procedure:

  1. State \(H_0\) and \(H_1\).
  2. The test statistic is \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\), which follows a \(t\)-distribution with \(\nu = n - 2\) degrees of freedom under \(H_0\).
  3. Compute the test statistic and p-value (using R).
  4. State the conclusion in context: reject \(H_0\) if p-value \(< \alpha\).

1.3.4 Confidence interval for \(\rho\)

A direct CI for \(\rho\) uses Fisher’s \(z\)-transformation. Let \(R = \dfrac{1}{2}\ln\dfrac{1+r}{1-r}\) (the Fisher transform of \(r\)), with approximate standard deviation \(\sigma_R = \dfrac{1}{\sqrt{n-3}}\).

A \((1-\alpha)100\%\) CI for the transformed correlation is:

\[R \pm Z_{1-\alpha/2} \cdot \sigma_R\]

We then back-transform the bounds to the original \(\rho\) scale using \(\rho = \dfrac{e^{2R}-1}{e^{2R}+1}\).

💡 In R

cor.test(x, y, alternative = "two.sided", conf.level = 0.95) performs the hypothesis test and computes the 95% CI for \(\rho\) automatically — no need to apply Fisher’s transformation by hand.


📝 Example 1.1: Delivery Time and Number of Items

A delivery company wants to know whether delivery time is correlated with the number of items delivered. They recorded times and item counts for 25 orders.

Order Time \(y\) (min) Items \(x\) Order Time \(y\) (min) Items \(x\)
1 16.68 7 14 19.75 6
2 11.50 3 15 24.00 9
3 12.03 3 16 29.00 10
4 14.88 4 17 15.35 6
5 13.75 6 18 19.00 7
6 18.11 7 19 9.50 3
7 8.00 2 20 35.10 17
8 17.83 7 21 17.90 10
9 79.24 30 22 52.32 26
10 21.50 5 23 18.75 9
11 40.33 16 24 19.83 8
12 21.00 10 25 10.75 4
13 13.50 4

Assuming that delivery time and number of items follow a bivariate normal distribution, test at \(\alpha = 0.05\) whether delivery time is correlated with the number of items.

# Data
items <- c(7,3,3,4,6,7,2,7,30,5,16,10,4,6,9,10,6,7,3,17,10,26,9,8,4)
time  <- c(16.68,11.5,12.03,14.88,13.75,18.11,8,17.83,79.24,21.5,
           40.33,21,13.5,19.75,24,29,15.35,19,9.5,35.1,17.9,52.32,
           18.75,19.83,10.75)

# Scatterplot
plot(items, time,
     main = "Number of Items vs. Delivery Time",
     xlab = "Number of Items (x)", ylab = "Delivery Time (min) (y)",
     pch = 19, col = "steelblue")

# Pearson correlation test (two-sided, 95% CI)
cor.test(items, time, alternative = "two.sided")
## 
##  Pearson's product-moment correlation
## 
## data:  items and time
## t = 17.546, df = 23, p-value = 8.22e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9202275 0.9845031
## sample estimates:
##       cor 
## 0.9646146

From the data, \(r = 0.9646\).

Hypothesis test:

  1. \(H_0: \rho = 0\) vs. \(H_1: \rho \neq 0\)

  2. Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)

  3. Compute: \[t_\text{cal} = \frac{0.9646\sqrt{25-2}}{\sqrt{1-0.9646^2}} = 17.55, \quad \text{p-value} = 8.22 \times 10^{-15}\]

  4. Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: there is significant evidence that delivery time is linearly correlated with the number of items.

95% confidence interval for \(\rho\):

\[R = \frac{1}{2}\ln\frac{1+0.9646}{1-0.9646} = 2.0082, \qquad \sigma_R = \frac{1}{\sqrt{25-3}} = 0.2132\]

\[2.0082 \pm 1.96\sqrt{0.2132} \;\Rightarrow\; 1.5903 < \mu_R < 2.4261\]

Back-transforming: \(\quad \boxed{0.9202 < \rho < 0.9844}\)


📝 Example 1.2: Math Midterm and Final Scores

Math midterm and final scores for 9 students are shown below. Find the sample correlation \(r\) and test whether midterm and final scores are correlated at \(\alpha = 0.10\).

Student 1 2 3 4 5 6 7 8 9
Midterm 77 50 71 72 80 93 95 98 66
Final 82 66 78 43 56 85 99 98 68
df <- data.frame(
  midterm = c(77, 50, 71, 72, 80, 93, 95, 98, 66),
  final   = c(82, 66, 78, 43, 56, 85, 99, 98, 68)
)

# Scatterplot
plot(df$midterm, df$final,
     main = "Midterm vs. Final Exam Scores",
     xlab = "Midterm Score", ylab = "Final Score",
     pch = 19, col = "darkred")

# Correlation test at alpha = 0.10 (90% CI)
cor.test(df$midterm, df$final, alternative = "two.sided", conf.level = 0.90)
## 
##  Pearson's product-moment correlation
## 
## data:  df$midterm and df$final
## t = 2.197, df = 7, p-value = 0.06402
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.0845006 0.8911984
## sample estimates:
##       cor 
## 0.6388399

The sample correlation is \(r = 0.6388\), indicating a moderately strong positive linear relationship.

Hypothesis test:

  1. \(H_0: \rho = 0\) vs. \(H_1: \rho \neq 0\)

  2. Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)

  3. Compute with \(r = 0.6388\), \(n = 9\): \[t_\text{cal} = \frac{0.6388\sqrt{7}}{\sqrt{1-0.6388^2}} = 2.197, \quad \text{p-value} = 0.0640\]

  4. Since p-value \(= 0.0640 < \alpha = 0.10\), we reject \(H_0\). Conclusion: at the 10% significance level, there is sufficient evidence that midterm and final scores are linearly correlated.


📝 Example 1.3: Rainfall and Umbrella Sales

Peter owns a supermarket and believes that months with higher rainfall will have higher umbrella sales. He recorded monthly rainfall (mm) and umbrellas sold over nine consecutive months.

Month Rainfall (mm) Umbrellas sold
1 82.0 12
2 92.5 25
3 83.2 17
4 97.7 28
5 131.9 41
6 141.3 47
7 165.4 50
8 140.0 46
9 126.7 37

Assuming both variables follow a normal distribution, test at \(\alpha = 0.05\) whether Peter’s assumption is correct.

df2 <- data.frame(
  Rainfall  = c(82.0, 92.5, 83.2, 97.7, 131.9, 141.3, 165.4, 140.0, 126.7),
  Umbrellas = c(12, 25, 17, 28, 41, 47, 50, 46, 37)
)

# Scatterplot
plot(df2$Rainfall, df2$Umbrellas,
     main = "Monthly Rainfall vs. Umbrella Sales",
     xlab = "Rainfall (mm)", ylab = "Umbrellas Sold",
     pch = 19, col = "royalblue")

# One-sided test: H1: rho > 0
cor.test(df2$Rainfall, df2$Umbrellas, alternative = "greater")
## 
##  Pearson's product-moment correlation
## 
## data:  df2$Rainfall and df2$Umbrellas
## t = 10.455, df = 7, p-value = 7.97e-06
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.8877971 1.0000000
## sample estimates:
##       cor 
## 0.9694419

Hypothesis test:

  1. \(H_0: \rho \leq 0\) vs. \(H_1: \rho > 0\) (Peter believes higher rainfall → more sales)

  2. Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)

  3. Compute with \(r = 0.9694\), \(n = 9\): \[t_\text{cal} = \frac{0.9694\sqrt{7}}{\sqrt{1-0.9694^2}} = 10.455, \quad \text{p-value} = 7.97 \times 10^{-6}\]

  4. Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: there is significant evidence that higher rainfall is associated with more umbrella sales. Peter’s assumption is supported by the data.


1.4 Simple Linear Regression Model

1.4.1 Model specification

When both the form of the relationship and the direction of interest (predicting \(Y\) from \(X\)) are clear, we move from correlation to regression.

The simple linear regression model is:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, 2, \ldots, n\]

where:

\[ \begin{aligned} E(y_i) &= \beta_0 + \beta_1 x_i &&\text{(deterministic / systematic part)} \\ \epsilon_i &&&\text{(random error component)} \\ \beta_0 &&&\text{(intercept: value of } E(y) \text{ when } x = 0\text{)} \\ \beta_1 &&&\text{(slope: change in } E(y) \text{ per unit increase in } x\text{)} \end{aligned} \]

1.4.2 Assumptions about the error term \(\epsilon_i\)

For valid inference, the random errors must satisfy:

  1. Normality: \(\epsilon_i \sim N(0, \sigma^2)\) for all \(i\).
  2. Zero mean: \(E(\epsilon_i) = 0\).
  3. Constant variance (homoscedasticity): \(\text{Var}(\epsilon_i) = \sigma^2\) (the same for all \(i\)).
  4. Independence: \(\text{Cov}(\epsilon_i, \epsilon_j) = 0\) for \(i \neq j\).

In short: \(\epsilon_i \overset{iid}{\sim} N(0, \sigma^2)\).

💡 Checking the assumptions

These assumptions cannot be verified from the raw data — they must be checked using residual diagnostics after the model is fitted (see @sec-diagnostics).


1.4.3 Parameter estimation: method of least squares

Since \(\beta_0\) and \(\beta_1\) are unknown population parameters, we estimate them from a sample of size \(n\).

The least-squares estimates (LSE) \(b_0\) and \(b_1\) minimise the sum of squared residuals:

\[Q = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]

Setting the partial derivatives \(\partial Q / \partial \beta_0 = 0\) and \(\partial Q / \partial \beta_1 = 0\) and solving gives the normal equations, whose solutions are:

\[\boxed{b_1 = \frac{S_{xy}}{S_{xx}}, \qquad b_0 = \bar{y} - b_1\bar{x}}\]

where \(S_{xx} = \sum(x_i - \bar{x})^2\) and \(S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})\).

Equivalently:

\[b_1 = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - \left(\sum x_i\right)^2}, \qquad b_0 = \bar{y} - b_1\bar{x}\]

1.4.4 The fitted regression line and residuals

The fitted (estimated) regression line is: \[\hat{y} = b_0 + b_1 x\]

The fitted value at observation \(i\) is \(\hat{y}_i = b_0 + b_1 x_i\), and the residual is: \[e_i = y_i - \hat{y}_i\]

Residuals represent the part of \(y_i\) not explained by the model — they are the observable counterparts of the unobservable errors \(\epsilon_i\).

1.4.5 Estimating \(\sigma^2\)

The error variance \(\sigma^2\) measures the spread of observations around the regression line. It is estimated by the mean squared residual (also called mean squared error, MSE):

\[\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n-2} = \frac{\displaystyle\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2}\]

The denominator \(n-2\) (not \(n\)) gives an unbiased estimate because two parameters (\(\beta_0\) and \(\beta_1\)) are estimated from the data.

The residual sum of squares can also be expressed as: \[SS_\text{res} = S_{yy} - b_1 S_{xy}\]

where \(S_{yy} = \sum(y_i - \bar{y})^2\).


1.4.6 ANOVA table for regression

The total variation in \(Y\) can be decomposed as:

\[SS_T = SS_\text{reg} + SS_\text{res}\]

Source SS df MS \(F\)
Regression \(SS_\text{reg} = b_1 S_{xy}\) \(1\) \(MS_\text{reg} = SS_\text{reg}/1\) \(F = MS_\text{reg}/MS_\text{res}\)
Residual \(SS_\text{res} = S_{yy} - b_1 S_{xy}\) \(n-2\) \(MS_\text{res} = SS_\text{res}/(n-2)\)
Total \(SS_T = S_{yy}\) \(n-1\)

The F-statistic tests \(H_0: \beta_1 = 0\) (the model has no linear fit). For simple regression, this is equivalent to the \(t\)-test on \(\beta_1\).

💡 In R

anova(fit) produces this table. summary(fit) gives the \(F\)-statistic at the bottom of its output.


1.4.7 Inference for \(\beta_0\) and \(\beta_1\)

Because \(b_0\) and \(b_1\) are computed from a random sample, they are random variables with their own sampling distributions.

Distribution of \(b_1\):

\[b_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{S_{xx}}\right)\]

When \(\sigma^2\) is unknown and estimated by \(MS_\text{res}\):

\[t = \frac{b_1 - \beta_1}{se(b_1)} \sim t_{n-2}, \qquad se(b_1) = \sqrt{\frac{MS_\text{res}}{S_{xx}}}\]

Distribution of \(b_0\):

\[t = \frac{b_0 - \beta_0}{se(b_0)} \sim t_{n-2}, \qquad se(b_0) = \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)}\]

Confidence intervals:

\[ \begin{aligned} (1-\alpha)100\% \text{ CI for } \beta_1: &\quad b_1 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_1) \\ (1-\alpha)100\% \text{ CI for } \beta_0: &\quad b_0 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_0) \end{aligned} \]

Confidence interval for \(\sigma^2\):

\[\frac{(n-2)MS_\text{res}}{\chi^2_{1-\alpha/2,\, n-2}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{\alpha/2,\, n-2}}\]


1.4.8 Hypothesis test for \(\beta_1\)

The most important test in simple regression is:

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0\]

If we fail to reject \(H_0\), there is no evidence of a linear relationship between \(X\) and \(Y\).

Four-step procedure:

  1. State \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\) (or one-sided alternative).
  2. Test statistic: \(t_0 = \dfrac{b_1}{se(b_1)} \sim t_{n-2}\) under \(H_0\).
  3. Compute \(t_0\) and the p-value from R output.
  4. Reject \(H_0\) if p-value \(< \alpha\); conclude there is a significant linear relationship.

1.4.9 Coefficient of determination \(r^2\)

The coefficient of determination measures the proportion of the total variation in \(Y\) that is explained by the regression model:

\[r^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq r^2 \leq 1\]

  • \(r^2 = 1\): perfect fit — the model explains all the variation.
  • \(r^2 = 0\): the model explains none of the variation; \(\hat{y}\) is no better than \(\bar{y}\).

Interpretation template: \(r^2 = 0.845\) means that 84.5% of the total variation in \(Y\) is explained by the linear regression on \(X\). The remaining 15.5% is unexplained (attributable to random error or other factors not in the model).”

Note: In simple linear regression, \(r^2 = r^2\) (the square of the Pearson correlation coefficient).

💡 Reading summary(fit) output in R

Output field Meaning
Estimate (Intercept) \(b_0\): estimated \(Y\) when \(X = 0\)
Estimate (\(x\)) \(b_1\): estimated change in \(\hat{Y}\) per 1-unit increase in \(X\)
Std. Error \(se(b_j)\): standard error of the estimate
t value \(t_0 = b_j / se(b_j)\): test statistic for \(H_0: \beta_j = 0\)
Pr(>|t|) p-value; reject \(H_0\) if \(< \alpha\)
Residual standard error \(\hat{\sigma} = \sqrt{MS_\text{res}}\)
Multiple R-squared \(r^2\): proportion of variance explained
F-statistic Overall model F-test (same as \(t^2\) for simple regression)
p-value (F) p-value for the overall F-test

1.4.10 Prediction and confidence intervals

After fitting the model \(\hat{y} = b_0 + b_1 x\), we can form two types of interval estimates for a new value \(x^*\):

1. Confidence interval for the mean response \(E(y \mid x = x^*)\):

This estimates the average \(Y\) for all observations with \(X = x^*\): \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

2. Prediction interval for a new individual observation \(y\) at \(x = x^*\):

This estimates a single future \(Y\) when \(X = x^*\): \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

❗ CI vs. PI — which is wider?

The prediction interval (PI) is always wider than the confidence interval (CI) for the same \(x^*\). The extra “1” inside the square root accounts for the additional uncertainty in predicting a single new observation (which has its own random error \(\epsilon\)), rather than estimating the mean response (which averages out individual variation).

Both intervals are narrowest at \(x^* = \bar{x}\) and widen as \(x^*\) moves away from \(\bar{x}\).


1.4.11 Residual analysis and model diagnostics

After fitting the model, we must verify that the regression assumptions hold. We do this by examining the residuals \(e_i = y_i - \hat{y}_i\).

Key diagnostic plots:

Plot What it checks What “good” looks like
Residuals vs. Fitted Linearity + constant variance Random scatter around \(e = 0\); no pattern
Normal Q-Q plot Normality of residuals Points close to the diagonal reference line
Scale-Location Constant variance (homoscedasticity) Flat red line; even spread
Residuals vs. Leverage Influential observations No points outside Cook’s distance contours
# R's four built-in diagnostic plots
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))

Warning signs:

  • Residuals vs. Fitted: a curved pattern suggests non-linearity; a funnel shape suggests non-constant variance.
  • Q-Q plot: S-shaped deviations suggest heavy tails; systematic curves suggest skewness.
  • Scale-Location: an upward trend suggests variance increases with the fitted values.

💡 Formal tests (used in later labs)

  • Shapiro-Wilk test (shapiro.test(residuals(fit))): tests normality. Satisfied if p-value > 0.05.
  • Breusch-Pagan test (bptest(fit) from lmtest): tests constant variance. Satisfied if p-value > 0.05.
  • Durbin-Watson test (dwtest(fit) from lmtest): tests for autocorrelation. Satisfied if p-value > 0.05.

📝 Example 1.4: Condominium Area and Sale Price

A marketing manager wants to study the relationship between the area of a condominium unit (\(m^2\)) and its sale price (10,000 Baht) in the Mueang district of Chiang Mai. Ten units were randomly selected.

Obs Area \(x\) Price \(y\) \(xy\) \(x^2\) \(y^2\)
1 13 115 1495 169 13225
2 4 45 180 16 2025
3 12 100 1200 144 10000
4 5 50 250 25 2500
5 6 55 330 36 3025
6 8 85 680 64 7225
7 3 40 120 9 1600
8 4 50 200 16 2500
9 5 45 225 25 2025
10 7 70 490 49 4900
Total 67 655 5170 553 49025
df <- data.frame(
  Area = c(13, 4, 12, 5, 6, 8, 3, 4, 5, 7),
  Sale = c(115, 45, 100, 50, 55, 85, 40, 50, 45, 70)
)

# Scatterplot
plot(df$Area, df$Sale,
     xlab = expression("Area (" * m^2 * ")"),
     ylab = "Sale Price (10,000 Baht)",
     main = "Condominium Area vs. Sale Price",
     pch = 19, col = "steelblue")

# Fit simple linear regression
fit <- lm(Sale ~ Area, data = df)
summary(fit)
## 
## Call:
## lm(formula = Sale ~ Area, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.738 -4.618  0.987  2.269  9.741 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.202      4.120    3.69  0.00613 ** 
## Area           7.507      0.554   13.55 8.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.953 
## F-statistic: 183.6 on 1 and 8 DF,  p-value: 8.451e-07
anova(fit)
Df Sum Sq Mean Sq F value Pr(>F)
Area 1 5866.8804 5866.88040 183.6129 8e-07
Residuals 8 255.6196 31.95245 NA NA

Test for significance of regression (\(\beta_1\)):

  1. \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)

  2. Test statistic: \(t_0 = \dfrac{b_1}{se(b_1)}\)

  3. From R: \(t = 13.55\), p-value \(= 8.45 \times 10^{-7}\)

  4. Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: condominium area is a significant linear predictor of sale price.

Fitted equation: \[\hat{y} = 15.202 + 7.507\, x\]

Interpretation: For each additional \(1\, m^2\) of area, the estimated sale price increases by 75,070 Baht (7.507 × 10,000). A unit with 0 \(m^2\) area is estimated to cost 152,020 Baht (intercept, not practically meaningful here).

Coefficient of determination:

\(r^2 = 0.9582\): 95.82% of the total variation in sale price is explained by the linear regression on area. The remaining 4.18% is unexplained.

95% confidence interval for \(E(y \mid x = 13)\):

newx <- data.frame(Area = 13)
predict(fit, newdata = newx, interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 112.7954 103.7525 121.8382

\[\hat{y}_h \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(\frac{1}{10} + \frac{(13 - 6.7)^2}{104.1}\right)} = 112.80 \pm 9.04 \;\Rightarrow\; \boxed{103.76 < E(y \mid x=13) < 121.83}\]

95% prediction interval for a new unit with \(x = 13\):

predict(fit, newdata = newx, interval = "prediction", level = 0.95)
##        fit      lwr    upr
## 1 112.7954 96.93079 128.66

\[\hat{y}_0 \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(1 + \frac{1}{10} + \frac{(13-6.7)^2}{104.1}\right)} = 112.80 \pm 15.86 \;\Rightarrow\; \boxed{96.94 < y \mid x=13 < 128.65}\]

Plot: fitted line with CI and PI bands:

newx_all  <- data.frame(Area = sort(df$Area))
conf_int  <- predict(fit, newdata = newx_all, interval = "confidence")
pred_int  <- predict(fit, newdata = newx_all, interval = "prediction")

plot(df$Area, df$Sale,
     xlab = expression("Area (" * m^2 * ")"),
     ylab = "Sale Price (10,000 Baht)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch = 19, col = "steelblue",
     ylim = c(20, 140))
abline(fit, lwd = 2, col = "black")
lines(sort(df$Area), conf_int[, "lwr"], col = "red",  lty = 2, lwd = 1.5)
lines(sort(df$Area), conf_int[, "upr"], col = "red",  lty = 2, lwd = 1.5)
lines(sort(df$Area), pred_int[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(sort(df$Area), pred_int[, "upr"], col = "blue", lty = 3, lwd = 1.5)
legend("topleft",
       legend = c("Data", "Fitted line", "95% CI", "95% PI"),
       pch    = c(19, NA, NA, NA),
       lty    = c(NA, 1, 2, 3),
       col    = c("steelblue", "black", "red", "blue"),
       lwd    = 2, bty = "n")

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

📝 Example 1.5: Rocket Propellant Shear Strength

A rocket motor is manufactured by bonding two types of propellant inside a metal housing. The shear strength of the bond is an important quality characteristic. It is suspected that shear strength (\(y\), psi) is related to the age of the sustainer propellant batch (\(x\), weeks). Twenty observations are available.

Obs Shear Strength \(y\) Age \(x\) Obs Shear Strength \(y\) Age \(x\)
1 2158.70 15.50 11 2165.20 13.00
2 1678.15 23.75 12 2399.55 3.75
3 2316.00 8.00 13 1779.80 25.00
4 2061.30 17.00 14 2336.75 9.75
5 2207.50 5.50 15 1765.30 22.00
6 1708.30 19.00 16 2053.50 18.00
7 1784.70 24.00 17 2414.40 6.00
8 2575.00 2.50 18 2200.50 12.50
9 2357.90 7.50 19 2654.20 2.00
10 2256.70 11.00 20 1753.70 21.50
age      <- c(15.50, 23.75, 8.00, 17.00, 5.50, 19.00, 24.00, 2.50,
               7.50, 11.00, 13.00, 3.75, 25.00, 9.75, 22.00, 18.00,
               6.00, 12.50, 2.00, 21.50)
strength <- c(2158.70, 1678.15, 2316.00, 2061.30, 2207.50, 1708.30,
              1784.70, 2575.00, 2357.90, 2256.70, 2165.20, 2399.55,
              1779.80, 2336.75, 1765.30, 2053.50, 2414.40, 2200.50,
              2654.20, 1753.70)

dt2 <- data.frame(Age = age, Strength = strength)

# Scatterplot
plot(dt2$Age, dt2$Strength,
     xlab = "Age of Propellant (weeks)",
     ylab = "Shear Strength (psi)",
     main = "Propellant Age vs. Bond Shear Strength",
     pch = 19, col = "darkorange")

The scatterplot shows a strong negative linear relationship — older batches of propellant tend to have lower shear strength. The straight-line model \(y = \beta_0 + \beta_1 x + \epsilon\) appears appropriate.

fit1 <- lm(Strength ~ Age, data = dt2)
summary(fit1)
## 
## Call:
## lm(formula = Strength ~ Age, data = dt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## Age          -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10
anova(fit1)
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 1527482.7 1527482.743 165.3768 0
Residuals 18 166254.9 9236.381 NA NA

Test for significance of regression (\(\beta_1\)):

  1. \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)

  2. Test statistic: \(t_0 = \dfrac{b_1}{se(b_1)}\)

  3. From R: \(t = -12.86\), p-value \(= 1.64 \times 10^{-10}\)

  4. Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: propellant age is a significant linear predictor of shear strength.

Fitted equation: \[\hat{y} = 2627.822 - 37.154\, x\]

Interpretation: For each additional week of propellant age, the estimated shear strength decreases by 37.154 psi on average.

90% and 95% confidence intervals for \(\beta_1\):

confint(fit1, level = 0.90)
##                    5 %      95 %
## (Intercept) 2551.20465 2704.4401
## Age          -42.16349  -32.1437
confint(fit1, level = 0.95)
##                  2.5 %    97.5 %
## (Intercept) 2534.99540 2720.6493
## Age          -43.22338  -31.0838

At 90%: \(-37.15 \pm (1.734)(2.89)\), giving \(-43.22 \leq \beta_1 \leq -31.08\)

95% confidence interval for \(\sigma^2\):

\[\frac{(n-2)MS_\text{res}}{\chi^2_{0.975,\,18}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{0.025,\,18}} = \frac{18 \times 9244.59}{31.526} \leq \sigma^2 \leq \frac{18 \times 9244.59}{8.231}\]

\[5{,}278.27 \leq \sigma^2 \leq 20{,}216.57\]

Plot: fitted line with CI and PI bands:

new5      <- data.frame(Age = sort(dt2$Age))
ci5       <- predict(fit1, newdata = new5, interval = "confidence")
pi5       <- predict(fit1, newdata = new5, interval = "prediction")

plot(dt2$Age, dt2$Strength,
     xlab = "Age of Propellant (weeks)",
     ylab = "Shear Strength (psi)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch = 19, col = "darkorange")
abline(fit1, lwd = 2, col = "black")
lines(sort(dt2$Age), ci5[, "lwr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), ci5[, "upr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "lwr"], col = "purple",    lty = 3, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "upr"], col = "purple",    lty = 3, lwd = 1.5)
legend("topright",
       legend = c("Data", "Fitted line", "95% CI", "95% PI"),
       pch    = c(19, NA, NA, NA),
       lty    = c(NA, 1, 2, 3),
       col    = c("darkorange", "black", "darkgreen", "purple"),
       lwd    = 2, bty = "n")

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit1, pch = 16, col = "darkorange")

par(mfrow = c(1, 1))

2 Multiple Linear Regression

2.1 From Simple to Multiple Regression

In Chapter 1 we modelled a single response variable \(Y\) using one predictor \(X\): \[y = \beta_0 + \beta_1 x + \epsilon\]

In practice, most responses depend on several predictors simultaneously. For example:

  • A researcher wants to predict a student’s nursing board exam score from GPA and age.
  • An engineer wants to model delivery time from number of cases stocked and distance walked.
  • A manufacturer wants to relate product tool life to lathe speed and tool type.

Multiple linear regression extends the simple model to handle \(k \geq 1\) predictors — both quantitative and qualitative — while keeping the linear structure that makes estimation and inference tractable.


2.2 The Multiple Linear Regression Model

2.2.1 Model equation

Given \(n\) observations on a response \(y\) and \(k\) predictors \(x_1, x_2, \ldots, x_k\), the multiple linear regression model is:

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_i, \qquad i = 1, 2, \ldots, n\]

where:

  • \(\beta_0\) is the intercept (value of \(E(y)\) when all \(x_j = 0\)),
  • \(\beta_j\) (\(j = 1, \ldots, k\)) is the partial regression coefficient for \(x_j\) — the expected change in \(y\) for a one-unit increase in \(x_j\), holding all other predictors constant,
  • \(\epsilon_i \overset{iid}{\sim} N(0, \sigma^2)\) are the random errors (same four assumptions as in simple regression).

The systematic (deterministic) part of the model is: \[E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\]

2.2.2 Data structure

Obs \(y\) \(x_1\) \(x_2\) \(\cdots\) \(x_k\)
1 \(y_1\) \(x_{11}\) \(x_{12}\) \(\cdots\) \(x_{1k}\)
2 \(y_2\) \(x_{21}\) \(x_{22}\) \(\cdots\) \(x_{2k}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(n\) \(y_n\) \(x_{n1}\) \(x_{n2}\) \(\cdots\) \(x_{nk}\)

2.2.3 Matrix form

Writing one equation per observation and stacking them gives the compact matrix form:

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}_{n\times 1},\quad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix}_{n\times(k+1)},\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}_{(k+1)\times 1},\quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}_{n\times 1} \]

\(\mathbf{X}\) is called the design matrix. Its first column is all ones, corresponding to the intercept \(\beta_0\).


2.3 Parameter Estimation

2.3.1 Least squares estimator

We estimate \(\boldsymbol{\beta}\) by minimising the residual sum of squares:

\[SS_\text{res} = \boldsymbol{\epsilon}'\boldsymbol{\epsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Setting \(\partial SS_\text{res}/\partial \boldsymbol{\beta} = \mathbf{0}\) gives the normal equations \(\mathbf{X}'\mathbf{X}\boldsymbol{\beta} = \mathbf{X}'\mathbf{y}\), whose solution is the least squares estimator (LSE):

\[\boxed{\mathbf{b} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}}\]

provided \(\mathbf{X}'\mathbf{X}\) is invertible (i.e., no perfect multicollinearity among the predictors).

2.3.2 Fitted values and residuals

Quantity Formula Interpretation
Fitted values \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{b}\) Model’s predicted \(y\) for each observation
Residuals \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) Observed minus predicted

2.3.3 Estimating \(\sigma^2\)

\[SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\]

\[\boxed{\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n - k - 1}}\]

The denominator \(n - k - 1\) accounts for estimating \(k + 1\) parameters (\(\beta_0, \beta_1, \ldots, \beta_k\)), giving an unbiased estimator of \(\sigma^2\).


2.4 Inference for the Regression Coefficients

2.4.1 Distribution of \(b_j\)

Each least squares estimate \(b_j\) follows a \(t\)-distribution:

\[t = \frac{b_j - \beta_j}{se(b_j)} \sim t_{n-k-1}, \qquad j = 0, 1, \ldots, k\]

where the standard error of \(b_j\) is: \[se(b_j) = \sqrt{\hat{\sigma}^2 C_{jj}}\]

and \(C_{jj}\) is the \((j, j)\)-th diagonal element of \((\mathbf{X}'\mathbf{X})^{-1}\).

2.4.2 Confidence intervals for \(\beta_j\)

A \((1-\alpha)100\%\) confidence interval for \(\beta_j\) is:

\[b_j \pm t_{1-\alpha/2,\; n-k-1} \cdot se(b_j)\]


2.5 Testing the Fitted Regression

2.5.1 Overall F-test

The overall F-test asks: does the model explain any significant variation in \(y\)?

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \qquad \text{vs.} \qquad H_1: \text{at least one } \beta_j \neq 0\]

Test statistic:

\[F = \frac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)} = \frac{MS_\text{reg}}{MS_\text{res}} \sim F_{k,\; n-k-1} \quad \text{under } H_0\]

ANOVA table:

Source df SS MS \(F\)
Regression \(k\) \(SS_\text{reg} = \mathbf{b}'\mathbf{X}'\mathbf{y} - n\bar{y}^2\) \(MS_\text{reg} = SS_\text{reg}/k\) \(F = MS_\text{reg}/MS_\text{res}\)
Residual \(n-k-1\) \(SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\) \(MS_\text{res} = SS_\text{res}/(n-k-1)\)
Total \(n-1\) \(SS_T = \mathbf{y}'\mathbf{y} - n\bar{y}^2\)

Reject \(H_0\) if \(F_\text{cal} > F_{1-\alpha,\; k,\; n-k-1}\) or if p-value \(< \alpha\).

💡 In R

anova(fit) produces the ANOVA table. summary(fit) reports the \(F\)-statistic and its p-value at the bottom of the output.

2.5.2 Individual t-test for each \(\beta_j\)

Once the overall F-test is significant, we test each predictor individually:

\[H_0: \beta_j = 0 \qquad \text{vs.} \qquad H_1: \beta_j \neq 0\]

Test statistic: \(t_0 = \dfrac{b_j}{se(b_j)} \sim t_{n-k-1}\) under \(H_0\)

This tests whether \(x_j\) contributes significantly to the model given that all other predictors are already included.


2.6 Coefficient of Determination

2.6.1 \(R^2\) — multiple coefficient of determination

\[R^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq R^2 \leq 1\]

Limitation: \(R^2\) always increases (never decreases) when a predictor is added, even if it is useless. This makes it unreliable for comparing models with different numbers of predictors.

2.6.2 Adjusted \(R^2\)

The adjusted \(R^2\) penalises for the number of predictors and can decrease if an added predictor does not improve fit sufficiently:

\[R^2_\text{adj} = 1 - \frac{SS_\text{res}/(n-k-1)}{SS_T/(n-1)} = 1 - \left(\frac{n-1}{n-k-1}\right)\frac{SS_\text{res}}{SS_T}\]

\(R^2\) vs. Adjusted \(R^2\)

Use \(R^2\) to describe the fit of a given model. Use adjusted \(R^2\) — or AIC/BIC — when comparing models with different numbers of predictors. A model with higher adjusted \(R^2\) is generally preferred.


2.7 Prediction

For a new observation with predictor values \(\mathbf{x}_0 = (1, x_{01}, x_{02}, \ldots, x_{0k})'\):

Point estimate: \(\hat{y}_0 = \mathbf{x}_0'\mathbf{b}\)

\((1-\alpha)100\%\) confidence interval for \(E(y \mid \mathbf{x}_0)\) (mean response): \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \, \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}\]

\((1-\alpha)100\%\) prediction interval for a new individual \(y \mid \mathbf{x}_0\): \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \left(1 + \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0\right)}\]

The prediction interval is always wider than the confidence interval for the same reason as in simple regression — it must account for the random error in a single new observation.

💡 Reading summary(fit) output in R

Output field Meaning
Estimate (Intercept) \(b_0\): estimated \(y\) when all \(x_j = 0\)
Estimate (\(x_j\)) \(b_j\): estimated change in \(\hat{y}\) per 1-unit increase in \(x_j\), holding all others constant
Std. Error \(se(b_j)\): standard error of the estimate
t value \(t_0 = b_j/se(b_j)\): test statistic for \(H_0: \beta_j = 0\)
Pr(>|t|) p-value for individual \(t\)-test; reject \(H_0\) if \(< \alpha\)
Residual standard error \(\hat{\sigma} = \sqrt{MS_\text{res}}\), on \(n-k-1\) df
Multiple R-squared \(R^2\)
Adjusted R-squared \(R^2_\text{adj}\) — use this for model comparison
F-statistic Overall F-test statistic
p-value (F) p-value for the overall F-test

📝 Example 2.1: Delivery Time Data

A soft drink bottler wants to predict the time required by a route driver to service vending machines. The industrial engineer in charge suggests two key predictors: the number of cases stocked (\(x_1\)) and the distance walked by the driver (\(x_2\), in feet). Data were collected from 25 deliveries.

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

  • \(\beta_1\): average increase in delivery time per additional case stocked, holding distance constant.
  • \(\beta_2\): average increase in delivery time per additional foot walked, holding cases constant.
Obs Time \(y\) (min) Cases \(x_1\) Distance \(x_2\) (ft) Obs Time \(y\) Cases \(x_1\) Distance \(x_2\)
1 16.68 7 560 14 19.75 6 462
2 11.50 3 220 15 24.00 9 448
3 12.03 3 340 16 29.00 10 776
4 14.88 4 80 17 15.35 6 200
5 13.75 6 150 18 19.00 7 132
6 18.11 7 330 19 9.50 3 36
7 15.93 5 50 20 35.10 17 770
8 14.38 4 357 21 17.90 10 140
9 25.32 11 449 22 52.32 26 810
10 11.52 3 120 23 18.75 9 450
11 11.93 4 220 24 19.83 8 635
12 27.90 10 588 25 10.75 4 150
13 13.13 4 64
dat <- data.frame(
  y  = c(16.68, 11.50, 12.03, 14.88, 13.75, 18.11, 15.93, 14.38, 25.32,
         11.52, 11.93, 27.90, 13.13, 19.75, 24.00, 29.00, 15.35, 19.00,
          9.50, 35.10, 17.90, 52.32, 18.75, 19.83, 10.75),
  x1 = c(7, 3, 3, 4, 6, 7, 5, 4, 11, 3, 4, 10, 4, 6, 9, 10, 6, 7, 3,
         17, 10, 26, 9, 8, 4),
  x2 = c(560, 220, 340, 80, 150, 330, 50, 357, 449, 120, 220, 588, 64,
         462, 448, 776, 200, 132, 36, 770, 140, 810, 450, 635, 150)
)

Exploratory analysis

# Correlation matrix
round(cor(dat), 3)
##        y    x1    x2
## y  1.000 0.968 0.791
## x1 0.968 1.000 0.729
## x2 0.791 0.729 1.000
# Pairwise scatterplot matrix
pairs(dat,
      main  = "Scatterplot Matrix: Delivery Time Data",
      pch   = 19, col = "steelblue")

From the correlation matrix and scatterplots, both \(x_1\) and \(x_2\) are positively correlated with \(y\). The correlation between \(x_1\) and \(x_2\) is moderate, so multicollinearity is unlikely to be a serious problem.

Fit the multiple regression model

y  <- dat$y  
x1 <- dat$x1  
x2 <- dat$x2

fit.full <- lm(y ~ x1 + x2, data = dat)
summary(fit.full)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7550 -1.7329 -0.0388  1.7226  3.1038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.301098   0.803072   6.601 1.23e-06 ***
## x1          1.537225   0.125982  12.202 2.88e-11 ***
## x2          0.007012   0.002645   2.651   0.0146 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 22 degrees of freedom
## Multiple R-squared:  0.9517, Adjusted R-squared:  0.9474 
## F-statistic: 216.9 on 2 and 22 DF,  p-value: 3.305e-15
anova(fit.full)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 1979.40029 1979.400287 426.860658 0.0000000
x2 1 32.59253 32.592526 7.028627 0.0145887
Residuals 22 102.01644 4.637111 NA NA

(a) Overall F-test at \(\alpha = 0.05\)

  1. \(H_0: \beta_1 = \beta_2 = 0\) vs. \(H_1\): at least one \(\beta_j \neq 0\)

  2. Test statistic: \(F = \dfrac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)}\)

  3. From R: \(F = 261.2\) on \((2, 22)\) df, p-value \(= 4.69 \times 10^{-16}\)

Source df SS MS \(F\)
Regression 2 5550.82 2775.41 261.24
Residual 22 233.73 10.62
Total 24 5784.55
  1. Since p-value \(< 0.05\), we reject \(H_0\). At least one of \(x_1\) or \(x_2\) is a significant predictor of delivery time.

(b) Test whether \(x_1\) should be included when \(x_2\) is already in the model

f1 <- lm(y ~ x2)
f2 <- lm(y ~ x2 + x1)
anova(f1, f2)
Res.Df RSS Df Sum of Sq F Pr(>F)
23 792.4236 NA NA NA NA
22 102.0164 1 690.4071 148.8873 0
  1. \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\) (given \(x_2\) is in the model)
  2. Test statistic: \(t_0 = b_1 / se(b_1)\)
  3. From R: \(b_1 = 1.6159\), \(se(b_1) = 0.1707\), \(t = 9.464\), p-value \(= 3.25 \times 10^{-9}\) or \(F=148.89\) and p-value \(= 2.88 \times 10^{-11}\)
  4. Since p-value \(< 0.05\), reject \(H_0\). \(x_1\) should be included in the model.

(c) Fitted equation

round(coef(fit.full), 4)
## (Intercept)          x1          x2 
##      5.3011      1.5372      0.0070

\[\hat{y} = 5.301 + 1.537\, x_1 + 0.007\, x_2\]

Interpretation:

  • \(b_1 = 1.537\): each additional case stocked is associated with an estimated 1.537-minute increase in delivery time, holding distance constant.
  • \(b_2 = 0.007\): each additional foot walked is associated with an estimated 0.007-minute increase in delivery time, holding cases constant.

(d) 95% confidence intervals for \(\beta_0, \beta_1, \beta_2\)

confint(fit.full, level = 0.95)
##                   2.5 %    97.5 %
## (Intercept) 3.635628107 6.9665670
## x1          1.275953966 1.7984953
## x2          0.001526778 0.0124966

(e) How much variation in \(y\) is explained by \(x_1\) and \(x_2\)?

From summary(), the Adjusted \(R^2 = 0.9487\). That is, the model using number of cases and distance walked explains 94.87% of the total variation in delivery time.

(f) Predict delivery time when \(x_1 = 8\) cases and \(x_2 = 275\) ft

\[\hat{y} = 5.301 + 1.537(8) + 0.007(275) = 19.522 \text{ min}\]

newdata <- data.frame(x1 = 8, x2 = 275)
predict(fit.full, newdata, interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 19.52711 18.51225 20.54197
predict(fit.full, newdata, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 19.52711 14.94738 24.10684

2.8 Dummy Variables for Qualitative Predictors

2.8.1 Why dummy variables?

When a predictor is qualitative (categorical), it cannot be entered directly as a number — the numeric codes 1, 2, 3 would incorrectly imply an ordering or equal spacing between groups.

Instead, we create dummy (indicator) variables: binary (0/1) variables that encode group membership.

Key rule: for a qualitative variable with \(k\) groups, we create \(k - 1\) dummy variables. The omitted group is the reference category (baseline); all dummy coefficients are interpreted as differences relative to that baseline.

Why \(k-1\), not \(k\)? Using \(k\) dummies would make the design matrix \(\mathbf{X}\) singular — the \(k\) dummy columns always sum to 1 (the intercept column), creating perfect multicollinearity. Dropping one group avoids this and lets the intercept represent the reference group’s mean.

2.8.2 General coding scheme

For a variable with \(k = 3\) groups (A = reference, B, C):

Group \(D_B\) \(D_C\) Interpretation
A (reference) 0 0 Intercept alone = mean of A
B 1 0 Intercept \(+ \beta_B\) = mean of B
C 0 1 Intercept \(+ \beta_C\) = mean of C

The model is: \[\hat{y} = b_0 + b_1 x_1 + b_B D_B + b_C D_C\]

This gives parallel regression lines (same slope, different intercepts) for each group:

Group Equation Intercept
A \(\hat{y} = b_0 + b_1 x_1\) \(b_0\)
B \(\hat{y} = (b_0 + b_B) + b_1 x_1\) \(b_0 + b_B\)
C \(\hat{y} = (b_0 + b_C) + b_1 x_1\) \(b_0 + b_C\)

Interpretation of dummy coefficients: \(b_B\) estimates the mean difference in \(y\) between Group B and Group A (the reference), holding \(x_1\) constant. A positive \(b_B\) means Group B has a higher mean response than Group A.

📝 Example 2.2: Tool Life, Lathe Speed, and Tool Type

Twenty observations on tool life (\(y\), hours), lathe speed (\(x_1\), rpm), and tool type (\(x_2\): A or B) are shown below.

obs \(y\) (hrs) \(x_1\) (rpm) Type obs \(y\) (hrs) \(x_1\) (rpm) Type
1 18.73 610 A 11 30.16 670 B
2 14.52 950 A 12 27.09 770 B
3 17.43 720 A 13 25.40 880 B
4 14.54 840 A 14 26.05 1000 B
5 13.44 980 A 15 33.49 760 B
6 24.39 530 A 16 35.62 590 B
7 13.34 680 A 17 26.07 910 B
8 22.71 540 A 18 36.78 650 B
9 12.68 890 A 19 34.95 810 B
10 19.32 730 A 20 43.67 500 B

The model is: \(y = \beta_0 + \beta_1 x_1 + \beta_2 D_B + \epsilon\), where \(D_B = 1\) if Tool Type B, \(0\) if Tool Type A (reference).

dt <- data.frame(
  y        = c(18.73,14.52,17.43,14.54,13.44,24.39,13.34,22.71,12.68,19.32,
               30.16,27.09,25.40,26.05,33.49,35.62,26.07,36.78,34.95,43.67),
  x1       = c(610,950,720,840,980,530,680,540,890,730,
                670,770,880,1000,760,590,910,650,810,500),
  ToolType = c(rep("A",10), rep("B",10))
)

Exploratory analysis

# Scatterplot coloured by Tool Type
par(mfrow = c(1, 2))
plot(dt$x1, dt$y,
     pch = 21,
     bg  = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
     xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
     main = "Tool Life vs. Speed by Type")
legend("topright", legend = c("Type A","Type B"),
       pt.bg = c("steelblue","tomato"), pch = 21, bty = "n")

# Boxplot by Tool Type
boxplot(y ~ ToolType, data = dt,
        xlab = "Tool Type", ylab = "Tool Life (hours)",
        main = "Tool Life by Tool Type",
        col  = c("steelblue","tomato"))

par(mfrow = c(1, 1))

The scatterplot shows two distinct clouds of points — one for each tool type — sitting at different vertical levels but following similar downward slopes. This strongly suggests that tool type shifts the intercept (level) of the relationship, making a dummy variable appropriate.

Create dummy variable (\(D_B = 1\) if Type B, \(0\) if Type A):

dt$D_B <- ifelse(dt$ToolType == "B", 1, 0)

# Verify coding
head(dt[, c("ToolType","D_B")], 5)
ToolType D_B
A 0
A 0
A 0
A 0
A 0
tail(dt[, c("ToolType","D_B")], 5)
ToolType D_B
16 B 1
17 B 1
18 B 1
19 B 1
20 B 1

Fit the regression model

fit2 <- lm(y ~ x1 + D_B, data = dt)
summary(fit2)
## 
## Call:
## lm(formula = y ~ x1 + D_B, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5527 -1.7868 -0.0016  1.8395  4.9838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.98560    3.51038  10.536 7.16e-09 ***
## x1          -0.02661    0.00452  -5.887 1.79e-05 ***
## D_B         15.00425    1.35967  11.035 3.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared:  0.9003, Adjusted R-squared:  0.8886 
## F-statistic: 76.75 on 2 and 17 DF,  p-value: 3.086e-09
anova(fit2)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 293.0054 293.005403 31.71568 2.99e-05
D_B 1 1125.0282 1125.028214 121.77602 0.00e+00
Residuals 17 157.0546 9.238504 NA NA
confint(fit2, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 29.57934160 44.39186079
## x1          -0.03614301 -0.01707145
## D_B         12.13559830 17.87290293

Test whether \(x_1\) (lathe speed) should be included, given \(x_2\) (tool type) is in the model:

  1. \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)
  2. \(t_0 = b_1 / se(b_1)\)
  3. From R: \(t = -5.887\), p-value \(= 1.79 \times 10^{-5}\)
  4. Since p-value \(< 0.05\), reject \(H_0\). \(x_1\) (lathe speed) should be included in the model.

Test whether \(x_2\) (tool type) should be included, given \(x_1\) is in the model:

  1. \(H_0: \beta_2 = 0\) vs. \(H_1: \beta_2 \neq 0\)
  2. \(t_0 = b_2 / se(b_2)\)
  3. From R: \(t = 11.035\), p-value \(= 3.59 \times 10^{-9}\)
  4. Since p-value \(< 0.05\), reject \(H_0\). \(x_2\) (tool type) should be included in the model.

Overall F-test at \(\alpha = 0.05\):

  1. \(H_0: \beta_1 = \beta_2 = 0\) vs. \(H_1\): at least one \(\beta_j \neq 0\)
  2. \(F = SS_\text{reg}(x_1, x_2)/2 \;\div\; SS_\text{res}(x_1,x_2)/(n-3)\)
  3. From R: \(F = 76.75\), p-value \(= 3.09 \times 10^{-9}\)
  4. Reject \(H_0\). The overall model is significant.

Fitted equation (overall):

\[\hat{y} = 36.986 - 0.027\, x_1 + 15.004\, D_B\]

Fitted equations by tool type:

\[ \begin{aligned} \text{Type A } (D_B=0): \quad \hat{y}_A &= 36.986 - 0.027\, x_1 \\ \text{Type B } (D_B=1): \quad \hat{y}_B &= (36.986 + 15.004) - 0.027\, x_1 = 51.990 - 0.027\, x_1 \end{aligned} \]

Both equations share the same slope (\(-0.027\): tool life decreases by 0.027 hours per additional rpm), but Tool Type B has an intercept 15.004 hours higher than Type A. This means, at any given lathe speed, Tool Type B is estimated to last about 15 hours longer than Tool Type A.

Plot the two fitted lines:

b <- coef(fit2)
x_range <- seq(min(dt$x1), max(dt$x1), length.out = 100)

plot(dt$x1, dt$y,
     pch  = 21,
     bg   = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
     xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
     main = "Fitted Regression Lines by Tool Type")
lines(x_range, b[1]            + b[2]*x_range, col = "steelblue", lwd = 2)
lines(x_range, b[1] + b[3]     + b[2]*x_range, col = "tomato",    lwd = 2)
legend("topright",
       legend = c("Type A (observed)","Type B (observed)",
                  "Type A (fitted)",  "Type B (fitted)"),
       pt.bg  = c("steelblue","tomato", NA, NA),
       pch    = c(21, 21, NA, NA),
       lty    = c(NA, NA, 1, 1),
       col    = c("black","black","steelblue","tomato"),
       lwd    = 2, bty = "n")

95% confidence interval for \(\beta_2\) (the tool-type effect):

\[b_2 \pm t_{0.975,\;17} \cdot se(b_2) = 15.004 \pm (2.110)(1.360) \;\Rightarrow\; 12.135 \leq \beta_2 \leq 17.873\]

Interpretation: We are 95% confident that Tool Type B lasts between 12.1 and 17.9 hours longer than Tool Type A at the same lathe speed.

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit2, pch = 16,
     col = c("steelblue","tomato")[as.integer(factor(dt$ToolType))])

par(mfrow = c(1, 1))

2.9 Variable Selection

2.9.1 Why variable selection?

With many candidate predictors:

  • Including irrelevant predictors inflates variance of the estimates, reducing precision.
  • Including too few predictors causes bias (underfitting).
  • Variable selection finds a parsimonious model: small, interpretable, and still fitting well.

All three standard methods use AIC (Akaike Information Criterion) to compare models:

\[\text{AIC} = n\ln(SS_\text{res}/n) + 2(k+1)\]

Lower AIC = better balance between goodness of fit and model complexity.

Method Starting point Strategy Stops when
Forward selection Intercept only Add best predictor at each step No remaining predictor reduces AIC
Backward elimination Full model Remove worst predictor at each step No remaining predictor increases AIC if removed
Stepwise (both) Full model Add or remove at each step No change improves AIC

❗ Which method to prefer?

Stepwise regression is generally preferred because it can reconsider variables added or removed in earlier steps. Forward and backward selections cannot undo earlier decisions. However, none of the three methods guarantees finding the globally optimal model — always check whether the selected model makes practical sense and satisfies the regression assumptions.

2.9.2 Variable selection on the mtcars data

The built-in mtcars dataset records fuel efficiency and characteristics of 32 automobiles. We illustrate variable selection with mpg (fuel efficiency) as the response.

Variable Description Type
mpg Miles per US gallon Response
cyl Number of cylinders Discrete (4, 6, 8)
disp Engine displacement (cu.in.) Numeric
hp Gross horsepower Numeric
drat Rear axle ratio Numeric
wt Weight (1,000 lbs) Numeric
qsec ¼-mile time (seconds) Numeric
vs Engine shape (0 = V, 1 = straight) Binary
am Transmission (0 = auto, 1 = manual) Binary
gear Number of forward gears Discrete
carb Number of carburetors Discrete
data("mtcars")
null_model <- lm(mpg ~ 1, data = mtcars)
full_model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
                 data = mtcars)

2.9.3 Forward selection

Starts with the intercept-only model and adds the predictor that most reduces AIC at each step. Stops when no addition improves AIC.

fw.fit <- step(null_model,
               scope     = list(lower = null_model, upper = full_model),
               direction = "forward",
               trace     = 1)
## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654
summary(fw.fit)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Result: Forward selection chose wt, cyl, and hp.

Fitted equation: \(\hat{\text{mpg}} = 38.751 - 3.167\,\text{wt} - 0.942\,\text{cyl} - 0.018\,\text{hp}\)

2.9.4 Backward elimination

Starts with the full model and removes the predictor whose removal most reduces AIC at each step.

be.fit <- step(full_model,
               direction = "backward",
               trace     = 1)
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(be.fit)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Result: Backward elimination chose wt, qsec, and am.

Fitted equation: \(\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}\)

2.9.5 Stepwise regression

Combines both directions — at each step the algorithm may add or remove a variable. Generally the most reliable of the three.

sw.fit <- step(full_model,
               direction = "both",
               trace     = 1)
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## + cyl   1    0.0799 147.49 70.898
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## + vs    1    0.2685 147.57 68.915
## + cyl   1    0.1883 147.66 68.932
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## + carb  1     0.685 147.84 66.973
## + vs    1     0.434 148.09 67.028
## + cyl   1     0.414 148.11 67.032
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + gear  1     1.565 148.53 65.121
## + cyl   1     1.003 149.09 65.242
## + vs    1     0.645 149.45 65.319
## + carb  1     0.107 149.99 65.434
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + gear  1     2.977 150.46 63.535
## + cyl   1     2.447 150.99 63.648
## + vs    1     1.121 152.32 63.927
## + carb  1     0.011 153.43 64.160
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## + carb  1     3.227 156.84 62.864
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + cyl   1     0.249 159.82 63.465
## + vs    1     0.249 159.82 63.466
## + gear  1     0.171 159.90 63.481
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + carb  1     8.036 161.25 61.751
## + disp  1     3.276 166.01 62.682
## + cyl   1     1.501 167.78 63.022
## + drat  1     1.400 167.89 63.042
## + gear  1     0.123 169.16 63.284
## + vs    1     0.000 169.29 63.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(sw.fit)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Result: Stepwise also chose wt, qsec, and am — same as backward elimination.

Fitted equation: \(\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}\)

💡 Comparing the three results

Method Selected variables Adj. \(R^2\)
Forward wt, cyl, hp ~0.826
Backward wt, qsec, am ~0.834
Stepwise wt, qsec, am ~0.834

Forward selection disagrees here — it chose a different model. When methods disagree, compare adjusted \(R^2\), check diagnostics on each, and prefer the model with better interpretability and better-satisfied assumptions.


2.10 Model Diagnostics

2.10.1 Why check assumptions?

The linear regression model assumes:

\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \epsilon, \qquad \epsilon \sim N(0, \sigma^2)\]

This implies four conditions — LINE:

Condition Meaning
Linearity \(E(y)\) is a linear function of the predictors
Independence Errors \(\epsilon_i\) are independent of each other
Normality Errors follow a normal distribution
Equal variance \(\text{Var}(\epsilon_i) = \sigma^2\) (constant) for all \(i\)

If these are met, our \(t\)-tests, F-tests, and confidence intervals are valid. If they are violated, inference is unreliable — “garbage in, garbage out.”

2.10.2 Graphical diagnostics

R’s plot(fit) produces four standard diagnostic plots. We use the stepwise-selected model (sw.fit) as the working example.

fit <- lm(mpg ~ wt + qsec + am, data = mtcars)
par(mfrow = c(2, 2))
plot(fit, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))
Plot What it checks What “good” looks like Warning signs
Residuals vs. Fitted Linearity + constant variance Random scatter around \(e = 0\); even spread Curve (non-linearity); funnel shape (non-constant variance)
Normal Q-Q Normality of residuals Points close to the diagonal line S-shape (heavy tails); systematic bending
Scale-Location Constant variance (more sensitive) Flat red line; even vertical spread Upward or downward trend in red line
Residuals vs. Leverage Influential observations All points inside Cook’s distance contours Points in upper/lower right corner (high leverage + large residual)

2.10.3 Formal test: constant variance (Breusch-Pagan)

\[H_0: \text{errors have constant variance} \qquad H_1: \text{non-constant variance (heteroscedasticity)}\]

library(lmtest)
bptest(fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 6.1871, df = 3, p-value = 0.1029

Since p-value \(> 0.05\), we fail to reject \(H_0\). The constant variance assumption is satisfied.

2.10.4 Formal test: normality (Shapiro-Wilk)

\[H_0: \text{errors are normally distributed} \qquad H_1: \text{errors are not normal}\]

res <- resid(fit)
par(mfrow = c(1, 2))
hist(res,
     main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white",
     freq = FALSE)
curve(dnorm(x, mean(res), sd(res)), add = TRUE, col = "red", lwd = 2)
qqnorm(res, pch = 16, col = "steelblue", main = "Normal Q-Q Plot")
qqline(res, col = "red", lwd = 2)

par(mfrow = c(1, 1))

shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.9411, p-value = 0.08043

Since p-value \(> 0.05\), we fail to reject \(H_0\). The normality assumption is satisfied.

2.10.5 Formal test: independence (Durbin-Watson)

\[H_0: \text{errors are not autocorrelated (independent)} \qquad H_1: \text{errors are autocorrelated}\]

dwtest(fit)
## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 1.7141, p-value = 0.1263
## alternative hypothesis: true autocorrelation is greater than 0

Since p-value \(> 0.05\), we fail to reject \(H_0\). The independence assumption is satisfied.

2.10.6 Summary of assumption checks

Assumption Graphical check Formal test Satisfied if
Linearity Residuals vs. Fitted No pattern in the plot
Constant variance Scale-Location Breusch-Pagan p-value \(> 0.05\)
Normality Q-Q plot, histogram Shapiro-Wilk p-value \(> 0.05\)
Independence Residuals vs. order Durbin-Watson p-value \(> 0.05\)

2.10.7 Dealing with violated assumptions

If an assumption is violated, consider the remedies in the table below before drawing any conclusions.

Violated assumption Possible remedy
Non-linearity Add polynomial terms (\(x^2\), \(x^3\)); transform \(x\); fit a different model
Non-constant variance Apply a variance-stabilising transformation to \(y\) (log, square root)
Non-normality Transform \(y\); check for and address outliers
Autocorrelation Include a time-trend variable; use time-series regression methods

2.10.8 Box-Cox transformation

When the response \(y\) is strictly positive and the model shows non-normality or non-constant variance, the Box-Cox family of power transformations provides a systematic approach to choosing a variance-stabilising transformation:

\[y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \ln(y) & \lambda = 0 \end{cases}\]

Common values of \(\lambda\) and their corresponding transformations:

\(\lambda\) Transformation Use when
\(-1\) Reciprocal \(1/y\) Variance increases rapidly with mean
\(-0.5\) Reciprocal square root \(1/\sqrt{y}\)
\(0\) Log transformation \(\ln(y)\) Most common remedy for right skew / heteroscedasticity
\(0.5\) Square root \(\sqrt{y}\) Count data; mild variance inflation
\(1\) No transformation Assumptions already satisfied
\(2\) Square \(y^2\) Variance decreases with mean (rare)

Finding the optimal \(\lambda\) in R:

library(MASS)
boxcox(fit, lambda = seq(-3, 3, by = 0.05))

The plot shows the log-likelihood as a function of \(\lambda\). The peak of the curve is the maximum-likelihood estimate of \(\lambda\). The dashed vertical lines mark the 95% confidence interval for \(\lambda\).

  • If the CI contains \(\lambda = 1\): no transformation is needed.
  • If the CI contains \(\lambda = 0\): use a log transformation (\(\ln y\)).
  • If the CI contains \(\lambda = 0.5\): use a square-root transformation (\(\sqrt{y}\)).

For the mtcars model, the peak is near \(\lambda = 1\), suggesting no transformation is necessary — consistent with the diagnostic tests above.

# If lambda = 0 were indicated, apply log transformation:
fit_log <- lm(log(mpg) ~ wt + qsec + am, data = mtcars)
summary(fit_log)
## 
## Call:
## lm(formula = log(mpg) ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13879 -0.08114 -0.03466  0.07030  0.26575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.69410    0.31326   8.600 2.40e-09 ***
## wt          -0.22456    0.03201  -7.015 1.25e-07 ***
## qsec         0.05329    0.01299   4.101  0.00032 ***
## am           0.08558    0.06351   1.347  0.18863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1107 on 28 degrees of freedom
## Multiple R-squared:  0.8752, Adjusted R-squared:  0.8619 
## F-statistic: 65.47 on 3 and 28 DF,  p-value: 9.036e-13
# Then re-check diagnostics on the transformed model
par(mfrow = c(2, 2))
plot(fit_log, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

3 Nonparametric Statistics

Learning Objectives

After completing this chapter, students should be able to:

  • Understand the concept and rationale of nonparametric methods
  • Identify when nonparametric tests are appropriate
  • Apply the sign test, Wilcoxon signed-rank test, Mann–Whitney U test, Kruskal–Wallis test, Friedman test, Spearman rank correlation, and chi-squared test
  • Perform post-hoc comparisons after significant omnibus tests
  • Interpret and report results in practical contexts

3.1 Introduction

3.1.1 What are nonparametric methods?

Nonparametric (or distribution-free) statistical methods do not require the data to follow a specific probability distribution. They are often based on ranks or signs rather than the raw numerical values, making them robust to outliers and skewed data.

Compared to their parametric counterparts, nonparametric methods:

  • Make fewer assumptions about the population distribution
  • Are applicable to ordinal (ranked) data as well as continuous data
  • Are often the only valid choice for small samples where normality cannot be verified
  • Tend to have less statistical power than parametric tests when the parametric assumptions are actually met

3.1.2 When to use nonparametric methods

Situation Why nonparametric?
Small sample size (\(n < 30\)) Cannot verify normality; Central Limit Theorem may not apply
Population distribution unknown or clearly non-normal Parametric inference would be invalid
Presence of extreme outliers Outliers heavily distort means and parametric tests
Ordinal data (rankings, Likert scales) Means are not meaningful for ordinal scales
Data include only directions or signs Magnitude is unknown or unreliable

3.1.3 Which test to use?

Use this table to choose the appropriate test based on your study design:

Design Parametric test Nonparametric equivalent
One sample: test median One-sample \(t\)-test Sign test or Wilcoxon signed-rank test
Two related (paired) samples Paired \(t\)-test Sign test or Wilcoxon signed-rank test
Two independent samples Independent \(t\)-test Mann–Whitney U test
\(k > 2\) related samples (blocks) Repeated-measures ANOVA Friedman test
\(k > 2\) independent samples One-way ANOVA Kruskal–Wallis test
Association between two variables Pearson correlation Spearman rank correlation
Association between two categorical variables Chi-squared test of independence

3.2 Sign Test and Wilcoxon Signed-Rank Test for One Sample

Both tests address the same question — does the population median equal a specified value \(\theta_0\)? — but differ in what information they use:

Test Uses When to prefer
Sign test Only the direction (above/below \(\theta_0\)) Very small samples; data are ordinal; distribution may be asymmetric
Wilcoxon signed-rank Direction and magnitude of differences Symmetric distribution; more powerful than sign test

3.2.1 Sign test for one sample

Hypotheses (choose one):

\[ \begin{aligned} (a)\; &H_0: \theta \leq \theta_0 \quad H_1: \theta > \theta_0 \\ (b)\; &H_0: \theta \geq \theta_0 \quad H_1: \theta < \theta_0 \\ (c)\; &H_0: \theta = \theta_0 \quad H_1: \theta \neq \theta_0 \end{aligned} \]

Procedure:

  1. Compute \(d_i = x_i - \theta_0\) for each observation.
  2. Assign signs: “\(+\)” if \(d_i > 0\), “\(-\)” if \(d_i < 0\); ignore ties (\(d_i = 0\)).
  3. Let \(n\) = number of non-tied observations; let \(T^+\) = number of positive signs.
  4. Under \(H_0\), \(T^+ \sim \text{Binomial}(n,\, 0.5)\).
  5. Compute p-value using binom.test() in R.

📝 Example 3.1: PM2.5 Concentration

Is the median PM2.5 concentration equal to 35? Data from 5 measurements: 30, 40, 35, 50, 45.

x      <- c(30, 40, 35, 50, 45)
theta0 <- 35

par(mfrow = c(1, 2))
boxplot(x, main = "PM2.5 Concentration", ylab = "PM2.5", col = "lightblue")
hist(x, main = "Histogram", xlab = "PM2.5", col = "lightblue", border = "white")

par(mfrow = c(1, 1))

Tpos <- sum(x > theta0)
Tneg <- sum(x < theta0)
cat("T+ =", Tpos, "  T- =", Tneg, "  Ties =", sum(x == theta0), "\n")
## T+ = 3   T- = 1   Ties = 1
binom.test(x = Tpos, n = Tpos + Tneg, alternative = "two.sided")
## 
##  Exact binomial test
## 
## data:  Tpos and Tpos + Tneg
## number of successes = 3, number of trials = 4, p-value = 0.625
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1941204 0.9936905
## sample estimates:
## probability of success 
##                   0.75

Hypothesis test:

  1. \(H_0: \theta = 35\) vs. \(H_1: \theta \neq 35\)
  2. \(T^+ \sim \text{Binomial}(n=4,\; 0.5)\) (one tie excluded)
  3. From R: \(T^+ = 3\), p-value \(= 0.625\)
  4. Since p-value \(> 0.05\), we fail to reject \(H_0\). There is insufficient evidence that the median PM2.5 concentration differs from 35.

📝 Example 3.2: Caffeine in Starbucks Latte

A scientist purchased 16-oz Starbucks Lattes on 10 consecutive days and measured the caffeine content (mg). Does the median caffeine content exceed 40 mg?

Day 1 2 3 4 5 6 7 8 9 10
Caffeine (mg) 41 33 43 52 46 37 44 49 53 30
caffeine <- c(41, 33, 43, 52, 46, 37, 44, 49, 53, 30)
theta0   <- 40

boxplot(caffeine, ylab = "Caffeine (mg)", main = "Caffeine in Starbucks Latte",
        col = "lightyellow")
abline(h = theta0, col = "red", lty = 2)
legend("topright", legend = paste("θ₀ =", theta0), col = "red", lty = 2, bty = "n")

Tpos <- sum(caffeine > theta0)
Tneg <- sum(caffeine < theta0)
cat("T+ =", Tpos, "  T- =", Tneg, "  Ties =", sum(caffeine == theta0), "\n")
## T+ = 7   T- = 3   Ties = 0
binom.test(x = Tpos, n = Tpos + Tneg, alternative = "greater")
## 
##  Exact binomial test
## 
## data:  Tpos and Tpos + Tneg
## number of successes = 7, number of trials = 10, p-value = 0.1719
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.3933758 1.0000000
## sample estimates:
## probability of success 
##                    0.7

Sign table:

Day 1 2 3 4 5 6 7 8 9 10
Caffeine (mg) 41 33 43 52 46 37 44 49 53 30
Sign + + + + + + +

Hypothesis test:

  1. \(H_0: \theta \leq 40\) vs. \(H_1: \theta > 40\)
  2. \(T^+ \sim \text{Binomial}(n = 10,\; 0.5)\)
  3. From R: \(T^+ = 7\), p-value \(= 0.172\)
  4. Since p-value \(> 0.05\), we fail to reject \(H_0\). The data do not support the claim that median caffeine content exceeds 40 mg.

📝 Example 3.3: Snow Cones Sold Per Day

A convenience store owner claims the median number of snow cones sold per day is 40. A random sample of 20 days gives the following counts. Test at \(\alpha = 0.05\).

x <- c(42,43,40,25,22,45,49,32,37,36,39,44,59,40,48,56,50,44,49,52)

boxplot(x, ylab = "Snow Cones Sold", main = "Daily Snow Cone Sales",
        col = "lightcyan")
abline(h = 40, col = "red", lty = 2)

T.pos <- sum(x > 40);  T.neg <- sum(x < 40);  T.ze <- sum(x == 40)
cat("T+ =", T.pos, "  T- =", T.neg, "  Ties =", T.ze, "\n")
## T+ = 12   T- = 6   Ties = 2
binom.test(x = T.pos, n = T.pos + T.neg, alternative = "two.sided")
## 
##  Exact binomial test
## 
## data:  T.pos and T.pos + T.neg
## number of successes = 12, number of trials = 18, p-value = 0.2379
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4099252 0.8665726
## sample estimates:
## probability of success 
##              0.6666667

Sign table:

x 42 43 40 25 22 45 49 32 37 36
Sign + + 0 + +
x 39 44 59 40 48 56 50 44 49 52
Sign + + 0 + + + + + +

Hypothesis test:

  1. \(H_0: \theta = 40\) vs. \(H_1: \theta \neq 40\)
  2. \(T^+ \sim \text{Binomial}(n = 18,\; 0.5)\) (2 ties excluded)
  3. From R: \(T^+ = 12\), p-value \(= 0.238\)
  4. Since p-value \(> 0.05\), we fail to reject \(H_0\). There is no significant evidence that the median number of daily snow cone sales differs from 40.

3.2.2 Wilcoxon signed-rank test for one sample

This test improves on the sign test by incorporating the magnitude of differences, not just their direction. It assumes the distribution of \(X - \theta_0\) is symmetric.

Procedure:

  1. Compute \(d_i = x_i - \theta_0\) for each observation.
  2. Remove zeros; let \(n\) = remaining observations.
  3. Rank the absolute differences \(|d_i|\) from smallest (rank 1) to largest. For ties, assign the average rank.
  4. Attach the original sign of \(d_i\) to each rank.
  5. Test statistic: \(W^+ = \sum_{\{i: d_i > 0\}} R_i\) (sum of positive ranks).
  6. Compute p-value using wilcox.test() in R.

📝 Example 3.4: Employee Weekly Income

A temporary employment agency claims its employees earn a median weekly income greater than 500. A sample of 8 employee records gives: 505, 508, 499, 504, 510, 497, 500, 492. Can the claim be verified at \(\alpha = 0.05\)?

x <- c(505, 508, 499, 504, 510, 497, 500, 492)

boxplot(x, ylab = "Weekly Income", main = "Employee Weekly Income",
        col = "lightgreen")
abline(h = 500, col = "red", lty = 2)

wilcox.test(x = x, mu = 500, alternative = "greater")
## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 23.5, p-value = 0.2266
## alternative hypothesis: true location is greater than 500

Rank table (\(\theta_0 = 500\)):

Obs \(x\) \(d = x - 500\) \(|d|\) Rank of \(|d|\) Signed rank
1 505 +5 5 4 +4
2 508 +8 8 5.5 +5.5
3 499 −1 1 1 −1
4 504 +4 4 3 +3
5 510 +10 10 7 +7
6 497 −3 3 2 −2
7 500 0 (tie, excluded)
8 492 −8 8 5.5 −5.5

\(W^+ = 4 + 5.5 + 3 + 7 = 19.5\)

Hypothesis test:

  1. \(H_0: \theta \leq 500\) vs. \(H_1: \theta > 500\)
  2. Test statistic: \(W^+ = \sum R_i^+\)
  3. From R: \(W^+ = 19.5\) (NOTE that R reports 23.5 as it takes 0 (tie) into ranking) and p-value \(= 0.0469\)
  4. Since p-value \(< 0.05\), we reject \(H_0\). The data support the agency’s claim that the median income exceeds 500.

📝 Example 3.5: Checkout Waiting Times

A supermarket claims checkout waiting times are at most 4 minutes. An angry customer records 6 waiting times (min): 3.8, 5.3, 3.5, 4.5, 7.2, 5.1. Can the customer dispute the claim at \(\alpha = 0.05\)?

x <- c(3.8, 5.3, 3.5, 4.5, 7.2, 5.1)

boxplot(x, ylab = "Waiting Time (min)", main = "Checkout Waiting Times",
        col = "lightyellow")
abline(h = 4, col = "red", lty = 2)

wilcox.test(x = x, mu = 4, alternative = "greater")
## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 17.5, p-value = 0.09375
## alternative hypothesis: true location is greater than 4

Rank table (\(\theta_0 = 4\)):

Obs \(x\) \(d = x-4\) \(|d|\) Rank Signed rank
1 3.8 −0.2 0.2 1 −1
2 5.3 +1.3 1.3 5 +5
3 3.5 −0.5 0.5 2.5 −2.5
4 4.5 +0.5 0.5 2.5 +2.5
5 7.2 +3.2 3.2 6 +6
6 5.1 +1.1 1.1 4 +4

\(W^+ = 5 + 2.5 + 6 + 4 = 17.5\)

Hypothesis test:

  1. \(H_0: \theta \leq 4\) vs. \(H_1: \theta > 4\)
  2. Test statistic: \(W^+ = \sum R_i^+\)
  3. From R: \(W^+ = 17.5\) (same as R reports as it has mo tie), p-value \(= 0.0859\)
  4. Since p-value \(> 0.05\), we fail to reject \(H_0\) at the 5% level. There is insufficient evidence to dispute the store’s claim.

📝 Example 3.6: Law College Exam Scores

A sample of 10 law students’ marks: 71, 79, 40, 70, 82, 72, 60, 76, 69, 75. Test at \(\alpha = 0.05\) whether the median mark is less than or equal to 67.

x <- c(71, 79, 40, 70, 82, 72, 60, 76, 69, 75)

boxplot(x, ylab = "Mark", main = "Law College Exam Scores",
        col = "lightpink")
abline(h = 67, col = "red", lty = 2)

wilcox.test(x = x, mu = 67, alternative = "less")
## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 40, p-value = 0.9033
## alternative hypothesis: true location is less than 67

Hypothesis test:

  1. \(H_0: \theta \geq 67\) vs. \(H_1: \theta < 67\)
  2. Test statistic: \(W^+ = \sum R_i^+\); From R: \(W^+ = 40\), p-value \(= 0.9033\)
  3. Since p-value \(> 0.05\), we fail to reject \(H_0\). The data do not support the claim that the median mark is less than 67.

3.4 Mann–Whitney U Test for Two Independent Samples

The Mann–Whitney U test (also called the Wilcoxon rank-sum test) compares the distributions of two independent groups. It is the nonparametric counterpart of the independent-samples \(t\)-test.

When to use: Two independent groups; response is at least ordinal; normality is not assumed.

3.4.1 Procedure

  1. Pool all \(n + m\) observations from both groups and rank them from smallest (rank 1) to largest. Assign average ranks to ties.
  2. Let \(S = \sum R(X_i)\) = sum of ranks for Group 1 (size \(n\)).
  3. Compute test statistic: \(U = S - \dfrac{n(n+1)}{2}\)
  4. In R: wilcox.test(x, y, paired = FALSE)

Hypotheses:

\[ \begin{aligned} (a)\; &H_0: \theta_X = \theta_Y, \quad H_1: \theta_X \neq \theta_Y \quad \text{(two-sided)} \\ (b)\; &H_0: \theta_X \geq \theta_Y, \quad H_1: \theta_X < \theta_Y \quad \text{(one-sided)} \\ (c)\; &H_0: \theta_X \leq \theta_Y, \quad H_1: \theta_X > \theta_Y \quad \text{(one-sided)} \end{aligned} \]


📝 Example 3.12: Coffee Brand Ratings

Nine participants rated Brand X and nine rated Brand Y on likelihood of purchase. Can we conclude that Brand Y is preferred (rated higher) at \(\alpha = 0.05\)?

Brand X 8.0 8.7 7.7 7.2 8.5 6.0 8.5 6.0 13.5
Brand Y 13.0 10.7 7.7 11.2 13.7 12.0 11.5 14.2 13.7
brand_x <- c(8.0, 8.7, 7.7, 7.2, 8.5, 6.0, 8.5, 6.0, 13.5)
brand_y <- c(13.0, 10.7, 7.7, 11.2, 13.7, 12.0, 11.5, 14.2, 13.7)

boxplot(brand_x, brand_y, names = c("Brand X","Brand Y"),
        col = c("lightblue","lightgreen"),
        main = "Coffee Brand Ratings", ylab = "Rating")

wilcox.test(brand_x, brand_y, paired = FALSE, alternative = "less")
## 
##  Wilcoxon rank sum exact test
## 
## data:  brand_x and brand_y
## W = 10.5, p-value = 0.003085
## alternative hypothesis: true location shift is less than 0

Rank table (combined ranks, \(n=m=9\)):

Brand X 8.0 8.7 7.7 7.2 8.5 6.0 8.5 6.0 13.5
Rank 6 9 4.5 3 7.5 1.5 7.5 1.5 15

\(S = 6+9+4.5+3+7.5+1.5+7.5+1.5+15 = 55.5\), \(\quad U = 55.5 - \dfrac{9 \times 10}{2} = 10.5\)

Hypothesis test:

  1. \(H_0: \theta_X \geq \theta_Y\) vs. \(H_1: \theta_X < \theta_Y\) (Brand Y is preferred)
  2. From R: \(U = 10.5\), p-value \(= 0.0045\)
  3. Since p-value \(< 0.05\), we reject \(H_0\). Customers rate Brand Y significantly higher than Brand X.

📝 Example 3.13: Reaction Times Under Two Drugs

An experimental psychologist compared reaction times (seconds) of adult males under Drug A (\(n=6\)) vs. Drug B (\(n=4\)). Is there a significant difference at \(\alpha = 0.10\)?

Drug A (s) 52 53 73 68 56 74
Drug B (s) 94 64 74 51
drug_a <- c(52,53,73,68,56,74)
drug_b <- c(94,64,74,51)

boxplot(drug_a, drug_b, names = c("Drug A","Drug B"),
        col = c("skyblue","orange"),
        main = "Reaction Time by Drug", ylab = "Reaction Time (s)")

wilcox.test(x = drug_a, y = drug_b, paired = FALSE, alternative = "two.sided")
## 
##  Wilcoxon rank sum exact test
## 
## data:  drug_a and drug_b
## W = 9.5, p-value = 0.6476
## alternative hypothesis: true location shift is not equal to 0

Rank table:

Drug A 52 53 73 68 56 74 Drug B 94 64 74 51
Rank 2 3 7 6 4 8.5 Rank 10 5 8.5 1

\(S = 2+3+7+6+4+8.5 = 30.5\), \(\quad U = 30.5 - \dfrac{6 \times 7}{2} = 9.5\)

Hypothesis test:

  1. \(H_0: \theta_A = \theta_B\) vs. \(H_1: \theta_A \neq \theta_B\)
  2. From R: \(U = 9.5\), p-value \(= 0.6689\)
  3. Since p-value \(> 0.10\), we fail to reject \(H_0\). There is no significant difference in reaction times between the two drugs.

📝 Example 3.14: Prenatal Care Programme and APGAR Scores

A pilot trial assessed whether enhanced prenatal in-home visits improve newborn health (measured by APGAR scores at 5 min). 8 women received usual care; 7 received the new programme.

Usual Care 83 91 94 89 96 91 92 90
New Program 78 82 81 77 81 80 81
usual_care  <- c(83,91,94,89,96,91,92,90)
new_program <- c(78,82,81,77,81,80,81)

boxplot(usual_care, new_program,
        names = c("Usual Care","New Program"),
        col   = c("lightblue","lightgreen"),
        main  = "APGAR Scores by Care Programme",
        ylab  = "APGAR Score")

wilcox.test(usual_care, new_program, paired = FALSE, alternative = "two.sided")
## 
##  Wilcoxon rank sum exact test
## 
## data:  usual_care and new_program
## W = 56, p-value = 0.0003108
## alternative hypothesis: true location shift is not equal to 0

Rank table (\(n=8\), \(m=7\)):

Usual Care 83 91 94 89 96 91 92 90
Rank 8 11.5 14 9 15 11.5 13 10
New Program 78 82 81 77 81 80 81
Rank 2 7 5 1 5 3 5

\(S = 8+11.5+14+9+15+11.5+13+10 = 92\), \(\quad U = 92 - \dfrac{8\times9}{2} = 56\)

Hypothesis test:

  1. \(H_0: \theta_\text{usual} = \theta_\text{new}\) vs. \(H_1: \theta_\text{usual} \neq \theta_\text{new}\)
  2. From R: \(U = 56\), p-value \(= 0.00139\)
  3. Since p-value \(< 0.05\), we reject \(H_0\). There is a significant difference in APGAR scores between the two care programmes. The usual care group had higher scores, suggesting babies in the usual care group were healthier — a surprising result that warrants further investigation.

3.6 Kruskal–Wallis Test for \(k\) Independent Samples (\(k > 2\))

The Kruskal–Wallis test is the nonparametric equivalent of one-way ANOVA. It compares the location (median) of \(k \geq 3\) independent groups.

Test statistic:

\[H = \frac{12}{N(N+1)}\sum_{i=1}^{k} \frac{R_i^2}{n_i} - 3(N+1)\]

where \(n_i\) = sample size of group \(i\), \(R_i\) = rank sum for group \(i\), \(N = \sum n_i\).

Under \(H_0\), \(H \approx \chi^2_{k-1}\) for sufficiently large \(n_i\).

Hypotheses:

  • \(H_0\): The \(k\) population distributions are identical.
  • \(H_1\): At least two population distributions differ in location (median).

📝 Example 3.17: Potassium Content in Breakfast Drinks

Three brands of breakfast drink were tested for potassium content (milli-equivalents per quart). Is there a difference in potassium content at \(\alpha = 0.10\)?

Brand A Brand B Brand C
5.0 3.0 4.0
1.0 11.5 7.5
2.0 5.0 9.5
6.5 7.0 5.0
4.5 10.0 8.2
1.0 9.4 10.25
df17 <- data.frame(
  Potassium = c(5.0,1.0,2.0,6.5,4.5,1.0,
                3.0,11.5,5.0,7.0,10.0,9.4,
                4.0,7.5,9.5,5.0,8.2,10.25),
  Brand     = rep(c("A","B","C"), each = 6)
)
df17$Brand <- as.factor(df17$Brand)

boxplot(Potassium ~ Brand, data = df17,
        col  = c("tomato","skyblue","lightgreen"),
        main = "Potassium by Brand", ylab = "Potassium (mEq/qt)")

kruskal.test(Potassium ~ Brand, data = df17)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Potassium by Brand
## Kruskal-Wallis chi-squared = 6.4308, df = 2, p-value = 0.04014

Rank table (\(N = 18\), \(k = 3\)):

Brand A Rank Brand B Rank Brand C Rank
5.0 8 3.0 4 4.0 5
1.0 1.5 11.5 18 7.5 12
2.0 3 5.0 8 9.5 15
6.5 10 7.0 11 5.0 8
4.5 6 10.0 16 8.2 13
1.0 1.5 9.4 14 10.25 17
\(R_i\) 30 71 70

\[H = \frac{12}{18 \times 19}\left(\frac{30^2}{6}+\frac{71^2}{6}+\frac{70^2}{6}\right) - 3(19) = 6.39\]

Hypothesis test:

  1. \(H_0\): All three brands have equal potassium content.
    \(H_1\): At least two brands differ.
  2. From R: \(H = 6.43\), p-value \(= 0.0401\)
  3. Since p-value \(< 0.10\), we reject \(H_0\). Potassium content differs significantly among the three brands.

Post-hoc pairwise comparisons:

pairwise.wilcox.test(df17$Potassium, df17$Brand,
                     p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df17$Potassium and df17$Brand 
## 
##   A     B    
## B 0.091 -    
## C 0.091 1.000
## 
## P value adjustment method: bonferroni

📝 Example 3.18: Caffeine in Four Types of Beverages

Caffeine content (mg) was measured in regular servings of four beverage types (unequal sample sizes). Is there a significant difference at \(\alpha = 0.05\)?

Type 1 Type 2 Type 3 Type 4
57 43 87 60
81 61 69 70
67 42 58 75
64 45 82 72
96 39 90 79
80 91 76
68
56
df18 <- data.frame(
  Caffeine = c(57,81,67,64,96,80,68,56,
               43,61,42,45,39,
               87,69,58,82,90,91,
               60,70,75,72,79,76),
  Type     = factor(c(rep(1,8), rep(2,5), rep(3,6), rep(4,6)))
)

boxplot(Caffeine ~ Type, data = df18,
        col  = c("lightblue","lightgreen","coral","gold"),
        main = "Caffeine Content by Beverage Type",
        xlab = "Type", ylab = "Caffeine (mg)")

kruskal.test(Caffeine ~ Type, data = df18)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Caffeine by Type
## Kruskal-Wallis chi-squared = 11.308, df = 3, p-value = 0.01017

Rank table (\(N=25\), \(k=4\), rank sums: \(R_1=108\), \(R_2=19\), \(R_3=110\), \(R_4=88\)):

\[H = \frac{12}{25 \times 26}\left(\frac{108^2}{8}+\frac{19^2}{5}+\frac{110^2}{6}+\frac{88^2}{6}\right) - 3(26) = 11.308\]

Hypothesis test:

  1. \(H_0\): Caffeine content is equal across all four beverage types.
    \(H_1\): At least two types differ.
  2. From R: \(H = 11.308\), p-value \(= 0.0102\)
  3. Since p-value \(< 0.05\), we reject \(H_0\). There is a significant difference in caffeine content between the four beverage types.

Post-hoc pairwise comparisons:

pairwise.wilcox.test(df18$Caffeine, df18$Type,
                     p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df18$Caffeine and df18$Type 
## 
##   1     2     3    
## 2 0.037 -     -    
## 3 1.000 0.052 -    
## 4 1.000 0.052 1.000
## 
## P value adjustment method: bonferroni

3.7 Spearman Rank Correlation

Spearman’s rank correlation coefficient \(r_s\) measures the monotone association between two variables. It is the nonparametric counterpart of Pearson’s \(r\), and is appropriate when:

  • Data are ordinal (already ranks), or
  • The relationship may be monotone but not linear, or
  • Outliers or non-normality make Pearson’s \(r\) unreliable.

Formula:

\[r_s = 1 - \frac{6\sum_{i=1}^{n} D_i^2}{n(n^2-1)}\]

where \(D_i = R(X_i) - R(Y_i)\) is the difference in ranks for observation \(i\).

The range is \(-1 \leq r_s \leq 1\), with the same directional interpretation as Pearson’s \(r\).

Hypotheses:

\[ \begin{aligned} (a)\; &H_0: \rho_s = 0, \quad H_1: \rho_s \neq 0 \quad \text{(two-sided)} \\ (b)\; &H_0: \rho_s \leq 0, \quad H_1: \rho_s > 0 \quad \text{(positive association)} \\ (c)\; &H_0: \rho_s \geq 0, \quad H_1: \rho_s < 0 \quad \text{(negative association)} \end{aligned} \]

In R: cor.test(x, y, method = "spearman", alternative = ...)


📝 Example 3.19: Machinery Productivity and Operator Satisfaction

Ten types of machinery are ranked on productivity (1–10) and operator satisfaction (1–10). Test at \(\alpha = 0.05\) whether productivity and satisfaction are positively related.

Machine 1 2 3 4 5 6 7 8 9 10
Productivity rank 8 3 9 2 7 10 4 6 1 5
Satisfaction rank 9 5 10 1 8 7 3 4 2 6
productivity <- c(8,3,9,2,7,10,4,6,1,5)
satisfaction <- c(9,5,10,1,8,7,3,4,2,6)

plot(productivity, satisfaction,
     main = "Productivity Rank vs. Satisfaction Rank",
     xlab = "Productivity Rank", ylab = "Satisfaction Rank",
     pch = 19, col = "steelblue")
abline(lm(satisfaction ~ productivity), col = "red", lty = 2)

cor.test(productivity, satisfaction, method = "spearman", alternative = "greater")
## 
##  Spearman's rank correlation rho
## 
## data:  productivity and satisfaction
## S = 24, p-value = 0.001752
## alternative hypothesis: true rho is greater than 0
## sample estimates:
##       rho 
## 0.8545455

Calculation table:

Machine \(R(X)\) \(R(Y)\) \(D\) \(D^2\)
1 8 9 −1 1
2 3 5 −2 4
3 9 10 −1 1
4 2 1 +1 1
5 7 8 −1 1
6 10 7 +3 9
7 4 3 +1 1
8 6 4 +2 4
9 1 2 −1 1
10 5 6 −1 1
Total \(\sum D^2 = 24\)

\[r_s = 1 - \frac{6 \times 24}{10(100-1)} = 1 - 0.1455 = 0.8545\]

Hypothesis test:

  1. \(H_0: \rho_s \leq 0\) vs. \(H_1: \rho_s > 0\)
  2. From R: \(r_s = 0.8545\), p-value \(= 0.00351\)
  3. Since p-value \(< 0.05\), we reject \(H_0\). The productivity rank and satisfaction rank of machinery are significantly positively related.

📝 Example 3.20: Cigarette Smoking and Birth Weight

A study recorded the number of cigarettes smoked per day during pregnancy and the birth weight (lbs) of the newborn for 12 women. Is there a significant (two-sided) association?

Woman 1 2 3 4 5 6 7 8 9 10 11 12
Cigarettes/day 12 15 35 21 20 17 19 46 20 25 39 25
Birth weight (lbs) 7.7 8.1 6.9 8.2 8.6 8.3 9.4 7.8 8.3 5.2 6.4 7.9
cigarettes <- c(12,15,35,21,20,17,19,46,20,25,39,25)
weight     <- c(7.7,8.1,6.9,8.2,8.6,8.3,9.4,7.8,8.3,5.2,6.4,7.9)

plot(cigarettes, weight,
     main = "Cigarettes per Day vs. Birth Weight",
     xlab = "Cigarettes per Day", ylab = "Birth Weight (lbs)",
     pch = 19, col = "tomato")

cor.test(cigarettes, weight, method = "spearman", alternative = "two.sided")
## 
##  Spearman's rank correlation rho
## 
## data:  cigarettes and weight
## S = 431.26, p-value = 0.09183
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5079094

Hypothesis test:

  1. \(H_0: \rho_s = 0\) vs. \(H_1: \rho_s \neq 0\)
  2. From R: \(r_s = -0.50\), p-value \(= 0.0918\)
  3. Since p-value \(> 0.05\), we fail to reject \(H_0\). There is no significant linear or monotone association between cigarettes smoked per day and birth weight at the 5% level.

📝 Example 3.21: Tree Diameter and Height

A biologist measures the diameter (inches, at 4.5 ft height) and height (feet) of 10 tall trees. Is there a significant association?

Diameter (in.) 1024 950 451 505 761 644 707 586 442 546
Height (ft) 261 321 219 281 159 83 191 141 232 108
diameter <- c(1024,950,451,505,761,644,707,586,442,546)
height   <- c(261,321,219,281,159,83,191,141,232,108)

plot(diameter, height,
     main = "Tree Diameter vs. Height",
     xlab = "Diameter (in.)", ylab = "Height (ft)",
     pch = 19, col = "darkgreen")

cor.test(diameter, height, method = "spearman", alternative = "two.sided")
## 
##  Spearman's rank correlation rho
## 
## data:  diameter and height
## S = 146, p-value = 0.7588
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1151515

Calculation table:

Diameter 1024 950 451 505 761 644 707 586 442 546
\(R(X)\) 10 9 2 3 8 6 7 5 1 4
\(R(Y)\) 8 10 6 9 4 1 5 3 7 2
\(D\) 2 −1 −4 −6 4 5 2 2 −6 2
\(D^2\) 4 1 16 36 16 25 4 4 36 4

\(\sum D^2 = 146\), \(\quad r_s = 1 - \dfrac{6 \times 146}{10(99)} = 1 - 0.885 = 0.115\)

Hypothesis test:

  1. \(H_0: \rho_s = 0\) vs. \(H_1: \rho_s \neq 0\)
  2. From R: \(r_s = 0.115\), p-value \(= 0.759\)
  3. Since p-value \(> 0.05\), we fail to reject \(H_0\). There is no significant association between tree diameter and height in this sample.

3.8 Chi-Squared Test of Independence

The chi-squared test of independence tests whether two categorical variables are associated. It is used when both variables are nominal or ordinal and the data are presented in a contingency table.

Hypotheses:

  • \(H_0\): Variable A and Variable B are independent (not associated).
  • \(H_1\): Variable A and Variable B are not independent (associated).

Procedure:

  1. Construct the observed frequency table (\(r \times c\) contingency table).
  2. Compute expected frequencies: \(E_{ij} = \dfrac{(\text{Row } i \text{ total}) \times (\text{Col } j \text{ total})}{N}\)
  3. Test statistic: \[\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)} \quad \text{under } H_0\]
  4. Reject \(H_0\) if \(\chi^2 > \chi^2_{1-\alpha, (r-1)(c-1)}\) or if p-value \(< \alpha\).

❗ Conditions for validity

All expected cell counts should be \(\geq 5\). If any are \(< 5\):

  • Combine adjacent categories if meaningful, or
  • Use Fisher’s exact test (fisher.test()) instead — valid for any cell counts.

In R: chisq.test(table(var_A, var_B)) or chisq.test(contingency_matrix)


📝 Example 3.22: Smoking Status and Exercise Habit

A researcher surveyed 200 adults on their smoking status (Smoker / Non-smoker) and exercise habit (Regular / Occasional / None). Is smoking status associated with exercise habit at \(\alpha = 0.05\)?

Regular Occasional None Total
Smoker 25 30 45 100
Non-smoker 55 35 10 100
Total 80 65 55 200
obs <- matrix(c(25,30,45, 55,35,10),
              nrow = 2, byrow = TRUE,
              dimnames = list(Status   = c("Smoker","Non-smoker"),
                              Exercise = c("Regular","Occasional","None")))
obs
##             Exercise
## Status       Regular Occasional None
##   Smoker          25         30   45
##   Non-smoker      55         35   10
# Check expected counts before testing
chisq.test(obs)$expected
##             Exercise
## Status       Regular Occasional None
##   Smoker          40       32.5 27.5
##   Non-smoker      40       32.5 27.5
# Chi-squared test
chisq.test(obs)
## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 33.907, df = 2, p-value = 4.336e-08

Hypothesis test:

  1. \(H_0\): Smoking status and exercise habit are independent.
    \(H_1\): They are associated.
  2. Test statistic: \(\chi^2 = \sum \dfrac{(O-E)^2}{E}\) with \((2-1)(3-1) = 2\) df
  3. From R: \(\chi^2\), p-value — see output above.
  4. If p-value \(< 0.05\), reject \(H_0\) and conclude that smoking status and exercise habit are significantly associated.

📝 Example 3.23: Gender and Method of Payment

A retail store recorded the gender and preferred method of payment (Cash, Card, Digital) for 120 customers. Is there a significant association at \(\alpha = 0.05\)?

Cash Card Digital Total
Female 20 35 15 70
Male 25 15 10 50
Total 45 50 25 120
obs2 <- matrix(c(20,35,15, 25,15,10),
               nrow = 2, byrow = TRUE,
               dimnames = list(Gender  = c("Female","Male"),
                               Payment = c("Cash","Card","Digital")))
obs2
##         Payment
## Gender   Cash Card Digital
##   Female   20   35      15
##   Male     25   15      10
# Expected counts
expected2 <- chisq.test(obs2)$expected
expected2
##         Payment
## Gender    Cash     Card  Digital
##   Female 26.25 29.16667 14.58333
##   Male   18.75 20.83333 10.41667
# Chi-squared test
chisq.test(obs2)
## 
##  Pearson's Chi-squared test
## 
## data:  obs2
## X-squared = 6.4, df = 2, p-value = 0.04076
# If any expected count < 5, use Fisher's exact test:
# fisher.test(obs2)

Hypothesis test:

  1. \(H_0\): Gender and method of payment are independent.
    \(H_1\): They are associated.
  2. Compute expected counts (printed above). If all \(\geq 5\), the chi-squared test is valid.
  3. From R: \(\chi^2\), df \(= (2-1)(3-1) = 2\), p-value — see output.
  4. State your conclusion based on whether p-value \(< 0.05\).

4 Mock Midterm Exam

❗Instructions

  • Show your working clearly. State hypotheses, test statistics, p-values, and conclusions in context.
  • Use \(\alpha = 0.05\) unless a question specifies otherwise.
  • R output is provided where indicated. You are expected to interpret and use it correctly.
  • Write all conclusions in the context of the research problem.

4.1 Question 1 — Correlation Analysis (16 marks)

A company analyst wants to investigate the relationship between the number of calls an agent makes in a week and the number of sales made. The analyst reasons that sales should depend to some degree on the number of calls. Data for a random sample of 20 agents are given below.

Agent Calls (\(x\)) Sales (\(y\)) Agent Calls (\(x\)) Sales (\(y\))
1 16 5 11 29 12
2 13 7 12 25 7
3 21 8 13 38 18
4 37 20 14 33 14
5 17 9 15 12 2
6 20 12 16 30 15
7 17 4 17 30 10
8 28 16 18 10 4
9 28 11 19 18 7
10 6 2 20 21 10
calls <- c(16,13,21,37,17,20,17,28,28, 6,
           29,25,38,33,12,30,30,10,18,21)
sales <- c( 5, 7, 8,20, 9,12, 4,16,11, 2,
           12, 7,18,14, 2,15,10, 4, 7,10)
dt1 <- data.frame(calls = calls, sales = sales)

The R output below is used to answer all parts of Question 1.

# Scatterplot
plot(calls, sales,
     xlab = "Number of Calls", ylab = "Number of Sales",
     main = "Calls vs. Sales",
     pch = 19, col = "steelblue")

# Pearson correlation test (95% CI by default)
cor.test(calls, sales, alternative = "two.sided", conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  calls and sales
## t = 8.7336, df = 18, p-value = 6.873e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7591250 0.9599183
## sample estimates:
##       cor 
## 0.8994835

(1.1) From the scatterplot, describe the relationship between the two variables. (3 marks)

Answer: The scatterplot shows a strong, positive, linear relationship between the number of calls and the number of sales. As the number of calls an agent makes increases, the number of sales also tends to increase. The points cluster fairly tightly around an upward-sloping straight line, and there are no obvious outliers or influential observations.


(1.2) What is the sample correlation coefficient between the number of sales and the number of calls? Interpret its value. (3 marks)

Answer: From the R output, the sample correlation coefficient is \(r = 0.899\).

Interpretation: This value indicates a very strong positive linear relationship between the number of calls and the number of sales. Agents who make more calls tend to achieve substantially more sales. The value is close to 1, confirming that the linear association is strong.


(1.3) At significance level 0.05, test whether the number of sales and the number of calls are correlated. (6 marks)

Answer:

  1. Hypotheses:

    • \(H_0: \rho = 0\) (no linear correlation between calls and sales)
    • \(H_1: \rho \neq 0\) (significant linear correlation exists)
  2. Test statistic and distribution:
    \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} = t_{18}\) under \(H_0\)
    From R: \(t_0 = 8.7336\), p-value \(= 6.873 \times 10^{-8}\)

  3. Decision: Since p-value \(= 6.873 \times 10^{-8} < \alpha = 0.05\), we reject \(H_0\).

  4. Conclusion: There is very strong statistical evidence of a significant positive linear correlation between the number of calls made and the number of sales achieved (\(r = 0.899\), p-value \(< 0.001\)).


(1.4) Find the 95% confidence interval for the population correlation \(\rho\) and interpret it. (4 marks)

Answer: From the R output, the 95% confidence interval for \(\rho\) is (0.7591, 0.9599).

Interpretation: We are 95% confident that the true population correlation between calls and sales lies between 0.7591 and 0.9599. Since this interval is entirely above zero and contains only large positive values, it strongly supports the conclusion that calls and sales are positively and substantially correlated in the population.


4.2 Question 2 — Simple Linear Regression (20 marks)

A researcher investigates the relationship between daily cigarette smoking and illness-related work absences. For 15 employees, the number of cigarettes smoked daily (\(x\)) and the number of days absent in the past year (\(y\)) were recorded.

Subject Cigarettes/day (\(x\)) Days absent (\(y\)) Subject Cigarettes/day (\(x\)) Days absent (\(y\))
1 0 1 9 35 8
2 0 3 10 44 9
3 0 8 11 53 5
4 10 10 12 33 12
5 13 14 13 21 15
6 20 14 14 25 13
7 27 11 15 14 9
8 35 6

The simple linear regression model is used: \(y = \beta_0 + \beta_1 x + \epsilon\).

cigs   <- c( 0, 0, 0,10,13,20,27,35,35,44,53,33,21,25,14)
absent <- c( 1, 3, 8,10,14,14,11, 6, 8, 9, 5,12,15,13, 9)
dt2 <- data.frame(cigs = cigs, absent = absent)
# Scatterplot
plot(cigs, absent,
     xlab = "Cigarettes per Day", ylab = "Days Absent",
     main = "Cigarettes Smoked vs. Days Absent",
     pch = 19, col = "tomato")

# Correlation
cor.test(cigs, absent)
## 
##  Pearson's product-moment correlation
## 
## data:  cigs and absent
## t = 0.52436, df = 13, p-value = 0.6089
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3976627  0.6111258
## sample estimates:
##       cor 
## 0.1439172
# Fit regression
fit2 <- lm(absent ~ cigs, data = dt2)
summary(fit2)
## 
## Call:
## lm(formula = absent ~ cigs, data = dt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3878 -2.6799  0.0953  3.0416  5.8369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.38784    1.90166   4.411 0.000704 ***
## cigs         0.03692    0.07040   0.524 0.608854    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.273 on 13 degrees of freedom
## Multiple R-squared:  0.02071,    Adjusted R-squared:  -0.05462 
## F-statistic: 0.275 on 1 and 13 DF,  p-value: 0.6089
anova(fit2)
Df Sum Sq Mean Sq F value Pr(>F)
cigs 1 5.02063 5.02063 0.2749531 0.6088536
Residuals 13 237.37937 18.25995 NA NA
confint(fit2, level = 0.90)
##                     5 %       95 %
## (Intercept)  5.02012966 11.7555489
## cigs        -0.08776229  0.1615951
# Model diagnostics
par(mfrow = c(2, 2))
plot(fit2, pch = 16, col = "tomato")

par(mfrow = c(1, 1))

library(lmtest)
bptest(fit2)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.97208, df = 1, p-value = 0.3242
shapiro.test(resid(fit2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.95729, p-value = 0.6454
dwtest(fit2)
## 
##  Durbin-Watson test
## 
## data:  fit2
## DW = 0.83406, p-value = 0.002079
## alternative hypothesis: true autocorrelation is greater than 0

(2.1) Obtain the fitted least-squares regression line for predicting days absent from number of cigarettes smoked. (3 marks)

Answer: From the summary() output, the estimated coefficients are \(b_0 = 8.3878\) and \(b_1 = 0.0369\). The fitted regression line is:

\[\hat{y} = 8.3878 + 0.0369\, x\]

Interpretation:

  • \(b_0 = 8.388\): A non-smoker (\(x = 0\)) is predicted to be absent approximately 8.39 days per year.
  • \(b_1 = 0.037\): For each additional cigarette smoked per day, the predicted number of absent days increases by 0.037 days on average.

(2.2) At significance level 0.10, test whether the number of cigarettes smoked can be used to describe days absent. (5 marks)

Answer:

  1. Hypotheses:

    • \(H_0: \beta_1 = 0\) (no linear relationship between cigarettes and absences)
    • \(H_1: \beta_1 \neq 0\) (a significant linear relationship exists)
  2. Test statistic:
    \(t_0 = \dfrac{b_1}{se(b_1)} \sim t_{n-2} = t_{13}\) under \(H_0\)
    From R: \(t_0 = 0.5244\), p-value \(= 0.6089\)

  3. Decision: Since p-value \(= 0.609 > \alpha = 0.10\), we fail to reject \(H_0\).

  4. Conclusion: There is insufficient evidence at the 10% significance level to conclude that cigarette smoking is a significant linear predictor of work absences. The number of cigarettes smoked daily does not appear to linearly explain the number of days absent.


(2.3) Find the 90% confidence intervals for the regression parameters \(\beta_0\) and \(\beta_1\). (3 marks)

Answer: From confint(fit, level = 0.90):

Parameter Lower bound Upper bound Interpretation
\(\beta_0\) (intercept) 5.0201 11.7555 90% CI for the predicted days absent for a non-smoker
\(\beta_1\) (slope) −0.0878 0.1616 90% CI for the change in absences per additional cigarette

Note that the 90% CI for \(\beta_1\) contains zero, which is consistent with the conclusion in Q2.2 that \(\beta_1\) is not significantly different from zero.


(2.4) What percentage of the variation in days absent can be accounted for by the fitted regression line? (2 marks)

Answer: From summary(), the Multiple R-squared \(= 0.0207\).

Interpretation: Only 2.07% of the total variation in days absent is explained by the linear regression on number of cigarettes smoked. This is a very low value, indicating that the number of cigarettes smoked is a poor predictor of work absences when used alone.


(2.5) At significance level 0.10, is the constant variance assumption violated? (2 marks)

Answer: Test: Breusch-Pagan test
Hypotheses: \(H_0\): constant variance (homoscedasticity) vs. \(H_1\): non-constant variance (heteroscedasticity)
From R: p-value \(= 0.3242\)
Decision: Since p-value \(= 0.324 > \alpha = 0.10\), we fail to reject \(H_0\).
Conclusion: The constant variance assumption is not violated. The errors appear to have constant variance.


(2.6) At significance level 0.10, is the normality assumption violated? (2 marks)

Answer: Test: Shapiro-Wilk test
Hypotheses: \(H_0\): residuals are normally distributed vs. \(H_1\): residuals are not normal
From R: p-value \(= 0.6454\)
Decision: Since p-value \(= 0.645 > \alpha = 0.10\), we fail to reject \(H_0\).
Conclusion: The normality assumption is not violated. The residuals are consistent with a normal distribution.


(2.7) At significance level 0.10, is the independence assumption violated? (3 marks)

Answer: Test: Durbin-Watson test
Hypotheses: \(H_0\): no autocorrelation in residuals vs. \(H_1\): positive autocorrelation present
From R: DW statistic \(= 0.8341\), p-value \(= 0.0021\)
Decision: Since p-value \(= 0.0021 < \alpha = 0.10\), we reject \(H_0\).
Conclusion: The independence assumption is violated. There is significant positive autocorrelation in the residuals.

⚠️ What to do next

A DW statistic well below 2 indicates that consecutive residuals tend to have the same sign (positive autocorrelation). Possible remedies include:

  • Including a time-trend variable or time-ordered predictor in the model
  • Using a time-series regression approach (e.g., AR(1) error model)
  • Checking whether the data were collected in a natural time order that is driving the pattern

In this case, the overall model fit is also poor (\(R^2 = 2.1\%\)), and the slope is not significant. Other predictors of absenteeism (age, health status, working conditions) may be needed.


4.3 Question 3 — Multiple Regression with Dummy Variable and Stepwise Selection (18 marks)

Researchers study the relationship between Gender, Age, Number of working years, and Income. Let \(y\) = Income, \(x_1\) = Age, \(x_2\) = Number of working years, and \(x_3\) = Gender dummy (\(x_3 = 1\) if Female, \(0\) if Male). The full model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]

dt3 <- data.frame(
  Income  = c(27000,35000,42000,31000,26000,45000,38000,
              29000,50000,33000,41000,28000,36000,52000,30000),
  Age     = c(25,35,42,29,24,48,38,27,52,31,40,26,34,50,28),
  Working = c( 2,10,15, 5, 1,20,12, 3,25, 7,14, 2, 9,22, 4),
  Gender  = c("male","male","female","male","female","female","male",
              "female","female","male","male","female","male","female","male")
)
y  <- dt3$Income
x1 <- dt3$Age
x2 <- dt3$Working
x3 <- ifelse(dt3$Gender == "female", 1, 0)

# After stepwise selection
full <- lm(y ~ x1 + x2+ x3)
f1   <- lm(y ~ x1 + x3)
anova(f1,full)
Res.Df RSS Df Sum of Sq F Pr(>F)
12 17585095 NA NA NA NA
11 17569551 1 15544.03 0.0097319 0.9231908
summary(f1)
## 
## Call:
## lm(formula = y ~ x1 + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2274.96  -495.53    73.46   323.68  2966.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5292.11    1260.21   4.199  0.00123 ** 
## x1            879.47      36.47  24.114 1.55e-11 ***
## x3           -231.88     662.78  -0.350  0.73251    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1211 on 12 degrees of freedom
## Multiple R-squared:  0.9817, Adjusted R-squared:  0.9787 
## F-statistic: 322.4 on 2 and 12 DF,  p-value: 3.722e-11
anova(f1)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 944635535.8 944635535.8 644.615605 0.0000000
x3 1 179369.5 179369.5 0.122401 0.7325084
Residuals 12 17585094.7 1465424.6 NA NA
confint(f1)
##                  2.5 %   97.5 %
## (Intercept)  2546.3481 8037.871
## x1            800.0101  958.937
## x3          -1675.9500 1212.192

(3.1) Conduct the overall fit test of the regression at \(\alpha = 0.05\). (5 marks)

Answer:

  1. Hypotheses:

    • \(H_0: \beta_1 = \beta_3 = 0\) (neither Age nor Gender predicts Income)
    • \(H_1\): at least one of \(\beta_1, \beta_3 \neq 0\)
  2. Test statistic:
    \(F = 322.4\) on \((2, 12)\) df
    p-value \(= 3.722 \times 10^{-11}\)

  3. Decision: Since p-value \(< \alpha = 0.05\), we reject \(H_0\).

  4. Conclusion: The regression model is statistically significant. Age and Gender together are significant predictors of Income at the 5% significance level.


(3.2) Write down the fitted equations for predicting Income of female and male staff separately. (4 marks)

Answer: The stepwise procedure removed \(x_2\) (Number of working years). The final fitted equation is:

\[\hat{y} = 5292.11 + 879.47\, x_1 -231.88\, x_3\]

Substituting the dummy variable values:

Male staff (\(x_3 = 0\)): \[\hat{y}_\text{Male} = 5292.11 + 879.47\, x_1\]

Female staff (\(x_3 = 1\)): \[\hat{y}_\text{Female} = (5292.11-231.88) + 879.47\, x_1 = 5060.23 + 879.47\, x_1\]

Interpretation of \(b_3 = -231.88\): Holding Age constant, female staff earn an estimated 231.88 less than male staff of the same age, on average.

Both lines share the same slope: each additional year of age is associated with an estimated 879.47 increase in income, for both genders.


(3.3) Find the coefficient of determination of the fitted model and interpret its value. (2 marks)

Answer: From summary(), \(R^2 = 0.9817\), i.e., 98.17%.

Interpretation: The model (Age and Gender) explains approximately 98.17% of the total variation in Income. The remaining is attributable to other factors not in the model (such as working years, which were dropped by stepwise selection, or other unmeasured variables).


(3.4) Find the 95% confidence interval for the parameter that represents the income difference between male and female staff. (2 marks)

Answer: The income difference between genders is captured by \(\beta_3\). From confint(fit):

\[(-1675.95,\; 1212.192)\]

Interpretation: We are 95% confident that the true income difference between female and male staff of the same age lies between -1675.95 and 1212.192. Since the interval contains zero, the difference in income between genders is not statistically significant at the 5% level, despite the negative point estimate.


(3.5) Predict the Income of a male staff member aged 28 who has been working for 5 years. (2 marks)

Answer: Since Number of Working Years (\(x_2\)) was removed during stepwise selection, it is not used in the prediction. We use \(x_1 = 28\) (Age) and \(x_3 = 0\) (Male):

\[\hat{y}_\text{Male} = 5292.11 + 879.47(28)=29917.27\]

The predicted monthly income of this male staff member is approximately 29,917 Baht.


4.4 Question 4 — Multiple Regression with Two Dummy Variables (12 marks)

The Faculty of Science, CMU, is piloting a new online English learning programme called SpeedyE. Twenty students were sampled — 10 from Faculty 1 and 10 from Faculty 2, with 5 male and 5 female from each. Students completed a pretest, used the programme for one week, then completed a posttest.

Faculty Gender Pretest Posttest Faculty Gender Pretest Posttest
1 Male 58 71 2 Male 51 69
1 Male 57 69 2 Male 62 69
1 Male 63 71 2 Male 58 66
1 Male 66 70 2 Male 52 61
1 Male 45 65 2 Male 59 63
1 Female 64 71 2 Female 39 65
1 Female 39 71 2 Female 32 66
1 Female 69 71 2 Female 62 70
1 Female 56 76 2 Female 64 68
1 Female 67 71 2 Female 66 68

The model is: \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]

where \(x_1\) = Pretest score, \(x_2\) = Gender dummy (\(1\) = Male, \(0\) = Female), \(x_3\) = Faculty dummy (\(1\) = Faculty 1, \(0\) = Faculty 2), and \(y\) = Posttest score.

dt4 <- data.frame(
  Faculty  = c(1,1,1,1,1, 2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2),
  Gender   = c(rep("Male",10), rep("Female",10)),
  Pretest  = c(58,57,63,66,45, 51,62,58,52,59, 64,39,69,56,67, 39,32,62,64,66),
  Posttest = c(71,69,71,70,65, 69,69,66,61,63, 71,71,71,76,71, 65,66,70,68,68)
)

y  <- dt4$Posttest
x1 <- dt4$Pretest
x2 <- ifelse(dt4$Gender   == "Male", 1, 0)
x3 <- ifelse(dt4$Faculty  == 1,      1, 0)

fit4 <- lm(y ~ x1 + x2 + x3)
summary(fit4)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0690 -1.3876 -0.1686  1.3039  4.4061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.84557    3.11985  20.144 8.57e-13 ***
## x1           0.08922    0.05517   1.617  0.12541    
## x2          -2.41598    1.10151  -2.193  0.04340 *  
## x3           3.75205    1.12003   3.350  0.00407 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.458 on 16 degrees of freedom
## Multiple R-squared:  0.5665, Adjusted R-squared:  0.4852 
## F-statistic: 6.969 on 3 and 16 DF,  p-value: 0.003264
anova(fit4)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 28.31784 28.317844 4.687714 0.0458376
x2 1 30.18700 30.186998 4.997132 0.0399947
x3 1 67.79133 67.791325 11.222123 0.0040678
Residuals 16 96.65383 6.040865 NA NA

(4.1) Write down the fitted linear equation to predict Posttest score. (3 marks)

Answer: From the summary() output, the estimated coefficients are:

Parameter Estimate Variable
\(b_0\) 62.8456 Intercept
\(b_1\) 0.0892 Pretest score
\(b_2\) −2.4160 Gender (Male = 1)
\(b_3\) 3.7520 Faculty (Faculty 1 = 1)

The overall fitted equation is: \[\hat{y} = 62.8456 + 0.0892\, x_1 - 2.4160\, x_2 + 3.7520\, x_3\]

Coefficient interpretations:

  • \(b_1 = 0.089\): For each additional point on the pretest, the predicted posttest score increases by 0.089 points, holding gender and faculty constant.
  • \(b_2 = -2.416\): Male students are predicted to score 2.416 points lower than female students with the same pretest score and faculty.
  • \(b_3 = 3.752\): Faculty 1 students are predicted to score 3.752 points higher than Faculty 2 students with the same pretest score and gender.

(4.2) Write down the fitted equations for each combination of faculty and gender. (6 marks)

Answer: Substituting the dummy variable values gives four parallel equations (all with slope \(b_1 = 0.0892\)):

Group \(x_2\) \(x_3\) Intercept Fitted equation
Faculty 1, Female 0 1 \(62.8456 + 3.7520 = 66.5976\) \(\hat{y} = 66.5976 + 0.0892\, x_1\)
Faculty 1, Male 1 1 \(62.8456 - 2.4160 + 3.7520 = 64.1816\) \(\hat{y} = 64.1816 + 0.0892\, x_1\)
Faculty 2, Female 0 0 \(62.8456\) \(\hat{y} = 62.8456 + 0.0892\, x_1\)
Faculty 2, Male 1 0 \(62.8456 - 2.4160 = 60.4296\) \(\hat{y} = 60.4296 + 0.0892\, x_1\)

All four lines have the same slope (0.0892) — the dummy variables shift the intercept only. Faculty 1 Female students have the highest predicted posttest for any given pretest score; Faculty 2 Male students have the lowest.


(4.3) Compare the predicted posttest scores of male and female students from the same faculty. (3 marks)

Answer: The difference in predicted posttest scores between male and female students from the same faculty with the same pretest score is captured entirely by \(b_2 = -2.416\).

For any fixed pretest score \(x_1\) and faculty: \[\hat{y}_\text{Male} - \hat{y}_\text{Female} = -2.416\]

Conclusion: Male students are predicted to score 2.416 points lower on the posttest than female students from the same faculty with the same pretest score. This gap is the same regardless of faculty or pretest level, because the model assumes no interaction between gender and the other predictors.


4.5 Question 5 — Correlation and Simple Linear Regression (20 marks)

A market research analyst examines whether advertising budget is related to consumer retention of advertisements. For 21 companies, the advertising budget (\(x\), million Baht) and the percentage of customers who remember key details (“retained impressions”, \(y\), %) were recorded.

ad_budget <- c(90.1,85.1,108.6,71.6,92.9,45.1,98.9,36.9,
               90.5,69.8,84.5,42.5,100.4,81.6,57.4,84.6,
               51.5,95.1,53.6,89.8,74.9)
ret_impr  <- c(55.6,42.7,62.2,30.5,45.8,16.4,60.7,10.1,
               45.5,32.9,48.3,15.8,58.2,35.9,26.8,49.4,
               19.3,56.9,21.3,48.2,33.2)
dt5 <- data.frame(x = ad_budget, y = ret_impr)
plot(dt5$x, dt5$y,
     xlab = "Advertising Budget (million Baht)",
     ylab = "Retained Impressions (%)",
     main = "Advertising Budget vs. Retained Impressions",
     pch = 19, col = "darkorchid")

Correlation analysis:

cor.test(dt5$x, dt5$y, alternative = "two.sided", conf.level = 0.90)
## 
##  Pearson's product-moment correlation
## 
## data:  dt5$x and dt5$y
## t = 18.605, df = 19, p-value = 1.18e-13
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.9436239 0.9877721
## sample estimates:
##       cor 
## 0.9736366

(5.1) Find the sample correlation between advertising budget and retained impressions. Interpret the value. (3 marks)

Answer: From R, \(r = 0.9736\).

Interpretation: There is a moderate-to-strong positive linear relationship between advertising budget and retained impressions. Companies with higher advertising budgets tend to have a higher percentage of customers who remember their advertisement. The value is strong ( close to 1), suggesting that other factors besides budget also influence retention.


(5.2) Test whether advertising budget and retained impressions are correlated in the population at \(\alpha = 0.10\). (5 marks)

Answer:

  1. Hypotheses:

    • \(H_0: \rho = 0\) (advertising budget and retained impressions are uncorrelated)
    • \(H_1: \rho \neq 0\) (they are significantly correlated)
  2. Test statistic:
    \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} = 18.605\) on \(n - 2 = 19\) df
    p-value \(= 1.18 \times 10^{-13}\)

  3. Decision: Since p-value $ < = 0.10$, we reject \(H_0\).

  4. Conclusion: There is significant evidence at the 10% level that advertising budget and retained impressions are linearly correlated in the population.


(5.3) Find the 90% confidence interval for the population correlation \(\rho\). (2 marks)

Answer: From R (conf.level = 0.90): \((0.9436,\; 0.9877)\)

Interpretation: We are 90% confident that the true population correlation between advertising budget and retained impressions lies between 0.9436 and 0.9877. The interval is entirely above zero, confirming a positive association.


Simple linear regression:

fit5 <- lm(y ~ x, data = dt5)
summary(fit5)
## 
## Call:
## lm(formula = y ~ x, data = dt5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8143 -2.6441 -0.3756  2.4648  6.4990 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.59850    3.19614  -5.819 1.32e-05 ***
## x             0.75138    0.04039  18.605 1.18e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.789 on 19 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9452 
## F-statistic: 346.2 on 1 and 19 DF,  p-value: 1.18e-13
anova(fit5)
Df Sum Sq Mean Sq F value Pr(>F)
x 1 4969.0893 4969.08931 346.161 0
Residuals 19 272.7421 14.35485 NA NA
confint(fit5, level = 0.90)
##                     5 %        95 %
## (Intercept) -24.1250439 -13.0719581
## x             0.6815506   0.8212132

(5.4) Write the fitted regression equation and explain the coefficient estimates. (4 marks)

Answer: From summary(): \(b_0 = -18.598\), \(b_1 = 0.7514\). The fitted equation is:

\[\hat{y} = -18.598 + 0.7514\, x\]

Coefficient interpretations:

  • \(b_1 = 0.7514\): For every additional 1 million Baht spent on advertising, the retained impressions percentage is predicted to increase by 0.7514 percentage points, on average.

(5.5) At significance level 0.10, test whether advertising budget is a significant predictor of retained impressions. (3 marks)

Answer:

  1. Hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)
  2. From R: \(t_0 = 18.605\), p-value \(= 1.18 \times 10^{-13}\)
  3. Decision: Since p-value \(< 0.10\), we reject \(H_0\).
  4. Conclusion: Advertising budget is a statistically significant linear predictor of retained impressions at the 10% level.

(5.6) Find the coefficient of determination and interpret its value. (2 marks)

Answer: From summary(), \(r^2 = 0.948\), i.e., 94.8%.

Interpretation: The linear regression on advertising budget explains 94.8% of the total variation in retained impressions. The remaining is due to other factors not included in this model (e.g., quality of the advertisement, target audience, timing) or inherent random variability.


(5.7) Find the 90% confidence intervals for \(\beta_0\) and \(\beta_1\). (2 marks)

Answer: From confint(fit, level = 0.90):

Parameter 90% CI Lower 90% CI Upper
\(\beta_0\) (intercept) -24.1250 -13.0719
\(\beta_1\) (slope) 0.6815 0.8212

Interpretation of \(\beta_1\) CI: We are 90% confident that each additional million Baht of advertising budget is associated with a true increase in retained impressions of between 0.195 and 0.531 percentage points.


(5.8) Predict the retained impressions percentage when advertising budget is 100 million Baht. (1 mark)

Answer:

\[\hat{y} = -18.598 + 0.7514(100) = \mathbf{56.542\%}\]

When the advertising budget is 100 million Baht, the model predicts that approximately 56.542% of customers will retain key details from the advertisement.


4.6 Question 6 — Multiple Linear Regression and Model Diagnostics (14 marks)

A data analyst investigates factors related to box-office gross of book-inspired movies. Ten movies were selected. Variables: \(y\) = box-office gross (million USD), \(x_1\) = production budget, \(x_2\) = advertising budget, \(x_3\) = pre-adaptation book sales (all in million USD).

Movie \(y\) \(x_1\) \(x_2\) \(x_3\) Movie \(y\) \(x_1\) \(x_2\) \(x_3\)
1 85.1 8.5 5.1 4.7 6 30.3 3.5 1.2 3.5
2 106.3 12.9 5.8 8.8 7 79.4 9.2 3.7 9.7
3 50.2 5.2 2.1 15.1 8 91.0 9.0 7.6 5.9
4 130.6 10.7 8.4 12.2 9 135.4 15.1 7.7 20.8
5 54.8 3.1 2.9 10.6 10 89.3 10.2 4.5 7.9

The full model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)

dt6 <- data.frame(
  y  = c( 85.1,106.3, 50.2,130.6, 54.8, 30.3, 79.4, 91.0,135.4, 89.3),
  x1 = c(  8.5, 12.9,  5.2, 10.7,  3.1,  3.5,  9.2,  9.0, 15.1, 10.2),
  x2 = c(  5.1,  5.8,  2.1,  8.4,  2.9,  1.2,  3.7,  7.6,  7.7,  4.5),
  x3 = c(  4.7,  8.8, 15.1, 12.2, 10.6,  3.5,  9.7,  5.9, 20.8,  7.9)
)
# Full model with all three predictors
full6 <- lm(y ~ x1 + x2 + x3, data = dt6)
summary(full6)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = dt6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4384  -3.1695   0.8499   3.5134   9.6207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   7.6760     6.7602   1.135   0.2995   
## x1            3.6616     1.1178   3.276   0.0169 * 
## x2            7.6211     1.6573   4.598   0.0037 **
## x3            0.8285     0.5394   1.536   0.1754   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9502 
## F-statistic: 58.22 on 3 and 6 DF,  p-value: 7.913e-05
anova(full6)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 8647.3942 8647.39415 152.064149 0.0000173
x2 1 1150.8996 1150.89959 20.238533 0.0041095
x3 1 134.1697 134.16974 2.359371 0.1754396
Residuals 6 341.2005 56.86675 NA NA

(6.1) At \(\alpha = 0.05\), test whether book sales (\(x_3\)) should be included in the model when production budget (\(x_1\)) and advertising budget (\(x_2\)) are already included. (3 marks)

Answer:

  1. Hypotheses:

    • \(H_0: \beta_3 = 0\) (book sales add no explanatory value given \(x_1\) and \(x_2\))
    • \(H_1: \beta_3 \neq 0\) (book sales should be included)
  2. From R: \(t_0 = 1.536\), p-value \(= 0.1754\)

  3. Decision: Since p-value \(= 0.175 > \alpha = 0.05\), we fail to reject \(H_0\).

  4. Conclusion: Book sales (\(x_3\)) do not significantly improve the model when production and advertising budgets are already present. \(x_3\) should be excluded from the model.


The reduced model using only \(x_1\) and \(x_2\):

fit6 <- lm(y ~ x1 + x2, data = dt6)
summary(fit6)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dt6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4168  -2.5696   0.8051   2.1200  11.0463 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   11.848      6.765   1.751  0.12334   
## x1             4.228      1.153   3.667  0.00800 **
## x2             7.436      1.806   4.117  0.00448 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.241 on 7 degrees of freedom
## Multiple R-squared:  0.9537, Adjusted R-squared:  0.9405 
## F-statistic: 72.14 on 2 and 7 DF,  p-value: 2.131e-05
anova(fit6)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 8647.3942 8647.39415 127.33603 0.0000096
x2 1 1150.8996 1150.89959 16.94742 0.0044779
Residuals 7 475.3703 67.91004 NA NA
confint(fit6)
##                 2.5 %    97.5 %
## (Intercept) -4.148412 27.844835
## x1           1.501766  6.954722
## x2           3.164848 11.707370
# Model diagnostics
par(mfrow = c(2, 2))
plot(fit6, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

library(lmtest)
bptest(fit6)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit6
## BP = 7.6284, df = 2, p-value = 0.02206
shapiro.test(resid(fit6))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit6)
## W = 0.93429, p-value = 0.4914
dwtest(fit6)
## 
##  Durbin-Watson test
## 
## data:  fit6
## DW = 2.0124, p-value = 0.5113
## alternative hypothesis: true autocorrelation is greater than 0

(6.2) Test the overall fit of the reduced model (\(x_1\) and \(x_2\)) at \(\alpha = 0.05\). Provide the ANOVA table. (4 marks)

Answer:

Source df SS MS \(F\) p-value
Regression 2 9,788.7 4,894.35 70.62 \(<0.0001\)
Residual 7 485.0 69.29
Total 9 10,273.7
  1. Hypotheses: \(H_0: \beta_1 = \beta_2 = 0\) vs. \(H_1\): at least one \(\beta_j \neq 0\)
  2. From R: \(F = 72.14\) on \((2, 7)\) df, p-value \(< 0.0001\)
  3. Decision: Since p-value \(< 0.05\), we reject \(H_0\).
  4. Conclusion: The model with production budget and advertising budget is highly significant. Together, \(x_1\) and \(x_2\) explain a substantial proportion of the variation in box-office gross.

(6.3) Test the normality assumption of the errors at \(\alpha = 0.05\). (2 marks)

Answer: Test: Shapiro-Wilk test on residuals
Hypotheses: \(H_0\): residuals are normally distributed vs. \(H_1\): not normal
From R: W statistic, p-value \(= 0.4914\)
Decision: Since p-value \(= 0.491 > 0.05\), we fail to reject \(H_0\).
Conclusion: The normality assumption is satisfied. The residuals are consistent with a normal distribution.


(6.4) Test the independence assumption of the errors at \(\alpha = 0.05\). (2 marks)

Answer: Test: Durbin-Watson test
Hypotheses: \(H_0\): no autocorrelation vs. \(H_1\): autocorrelation present
From R: DW statistic \(= 2.0124\), p-value \(=0.5113 > 0.05\)
Decision: Since p-value \(> 0.05\), we fail to reject \(H_0\).
Conclusion: The independence assumption is satisfied. A DW value close to 2 indicates no significant autocorrelation in the residuals.


(6.5) Test the constant variance assumption of the errors at \(\alpha = 0.05\). (3 marks)

Answer: Test: Breusch-Pagan test
Hypotheses: \(H_0\): constant variance vs. \(H_1\): non-constant variance (heteroscedasticity)
From R: p-value \(= 0.0226\)
Decision: Since p-value \(= 0.022 < \alpha = 0.05\), we reject \(H_0\).
Conclusion: The constant variance assumption is violated. There is evidence of heteroscedasticity — the spread of the residuals changes with the fitted values.

⚠️ What to do next

Since heteroscedasticity is present, the standard errors and therefore the \(t\)-tests and confidence intervals from this model may be unreliable. Recommended next steps:

  1. Apply a variance-stabilising transformation to \(y\) — use the Box-Cox procedure to identify the optimal \(\lambda\):
library(MASS)
boxcox(fit6, lambda = seq(-3, 3, by = 0.05))

  1. If the Box-Cox plot suggests \(\lambda \approx 0\), re-fit with \(\ln(y)\) as the response and re-check diagnostics:
fit6_log <- lm(log(y) ~ x1 + x2, data = dt6)
par(mfrow = c(2, 2)); plot(fit6_log); par(mfrow = c(1, 1))
bptest(fit6_log)
  1. Alternatively, use heteroscedasticity-consistent (HC) standard errors (sandwich package in R).

5 Mock Final Exam

  • Answer all questions. All questions concern nonparametric statistical methods.
  • For every hypothesis test: state hypotheses, identify the appropriate test, show the calculation of the test statistic step by step, give the p-value, and state a conclusion in the context of the problem.
  • Use \(\alpha = 0.05\) unless a question specifies otherwise.
  • R output is provided to verify your calculations.

5.1 Question 7 — Wilcoxon Signed-Rank Test (20 marks)

High blood pressure can lead to serious health problems including heart attack, stroke, and kidney disease. Researchers selected 10 patients and recorded their blood pressure before and after taking medication.

Patient Before After Patient Before After
1 86 80 6 77 82
2 84 80 7 89 88
3 78 92 8 90 89
4 90 79 9 90 92
5 92 92 10 86 83
dt7 <- data.frame(
  Patient = 1:10,
  Before  = c(86, 84, 78, 90, 92, 77, 89, 90, 90, 86),
  After   = c(80, 80, 92, 79, 92, 82, 88, 89, 92, 83)
)
diff_bp <- dt7$Before - dt7$After
boxplot(diff_bp,
        ylab = "Before − After (mmHg)",
        main = "Blood Pressure Change After Medication",
        col  = "lightblue")
abline(h = 0, col = "red", lty = 2)
legend("topright", legend = "H₀: median diff = 0",
       col = "red", lty = 2, bty = "n")

wilcox.test(x = dt7$Before, y = dt7$After,
            paired = TRUE, alternative = "greater")
## 
##  Wilcoxon signed rank exact test
## 
## data:  dt7$Before and dt7$After
## V = 33, p-value = 0.2852
## alternative hypothesis: true location shift is greater than 0

At \(\alpha = 0.05\), test whether medication helps reduce blood pressure.


Answer Q7:

Test choice: The data are 10 matched pairs (same patients, before and after). The sample is small and normality of differences cannot be assumed. The Wilcoxon signed-rank test is appropriate for paired data when the distribution of differences is approximately symmetric.


Step 1 — Compute differences \(d_i = \text{Before}_i - \text{After}_i\)

Patient Before After \(d_i\) Action
1 86 80 +6 keep
2 84 80 +4 keep
3 78 92 −14 keep
4 90 79 +11 keep
5 92 92 0 exclude (tie)
6 77 82 −5 keep
7 89 88 +1 keep
8 90 89 +1 keep
9 90 92 −2 keep
10 86 83 +3 keep

Patient 5 is excluded (\(d = 0\)), leaving \(n = 9\) non-tied pairs.


Step 2 — Rank the absolute differences \(|d_i|\), assign average ranks to ties

Sorting \(|d_i|\) in ascending order:

\(|d|\) value Patients Raw ranks Average rank
1 7, 8 1, 2 1.5
2 9 3 3
3 10 4 4
4 2 5 5
5 6 6 6
6 1 7 7
11 4 8 8
14 3 9 9

Step 3 — Assign signed ranks and compute \(W^+\)

Patient \(d_i\) \(|d_i|\) Rank of \(|d_i|\) Signed rank
7 +1 1 1.5 +1.5
8 +1 1 1.5 +1.5
9 −2 2 3 −3
10 +3 3 4 +4
2 +4 4 5 +5
6 −5 5 6 −6
1 +6 6 7 +7
4 +11 11 8 +8
3 −14 14 9 −9

\[W^+ = \sum_{\{i:\, d_i > 0\}} \text{rank}_i = 1.5 + 1.5 + 4 + 5 + 7 + 8 = \mathbf{27}\]

\[W^- = \sum_{\{i:\, d_i < 0\}} \text{rank}_i = 3 + 6 + 9 = \mathbf{18}\]

Check: \(W^+ + W^- = 27 + 18 = 45 = \dfrac{9 \times 10}{2}\)


Step 4 — Hypothesis test

  1. Hypotheses:

    • \(H_0\): Medication does not reduce blood pressure; median\((d_i) \leq 0\)
    • \(H_1\): Medication reduces blood pressure; median\((d_i) > 0\)
  2. Test statistic: \(W^+ = 27\) (R reports \(V = 33\) as it takes 0 into ranking)

  3. p-value \(= 0.2852\) (from R, one-sided)

  4. Decision: Since p-value \(=0.2852 > \alpha = 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is insufficient evidence at the 5% significance level to conclude that the medication significantly reduces blood pressure. The observed differences between before and after readings are consistent with chance variation in this small sample of 10 patients.


5.2 Question 8 — Reading Habits Study (30 marks)

Education researchers studied reading habits among 15 college students from two faculties, recording year of study, hours reading per week, and reading skill score.

St. Fac. Yr Hours Score St. Fac. Yr Hours Score
1 A 2 7.5 70 9 B 3 6.0 73
2 A 2 5.0 65 10 B 3 6.5 82
3 A 2 2.0 68 11 A 4 5.0 90
4 B 2 3.0 80 12 A 4 4.0 95
5 B 2 4.5 81 13 B 4 3.5 80
6 A 3 4.0 77 14 B 4 3.0 70
7 A 3 4.5 90 15 A 4 8.0 95
8 A 3 7.0 75
dt8 <- data.frame(
  Student      = 1:15,
  Faculty      = factor(c("A","A","A","B","B","A","A","A","B","B",
                          "A","A","B","B","A")),
  YearofStudy  = factor(c(2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)),
  NumberHours  = c(7.5,5,2,3,4.5,4,4.5,7,6,6.5,5,4,3.5,3,8),
  ReadingScore = c(70,65,68,80,81,77,90,75,73,82,90,95,80,70,95)
)
par(mfrow = c(1, 3))
boxplot(ReadingScore ~ YearofStudy, data = dt8,
        main = "Score by Year", xlab = "Year of Study",
        ylab = "Reading Score",
        col  = c("lightblue","lightgreen","lightyellow"))
boxplot(NumberHours ~ Faculty, data = dt8,
        main = "Hours by Faculty", xlab = "Faculty",
        ylab = "Hours/week", col = c("lightblue","lightcoral"))
plot(dt8$NumberHours, dt8$ReadingScore,
     pch = 19, col = as.integer(dt8$Faculty) + 1,
     xlab = "Hours/week", ylab = "Reading Score",
     main = "Hours vs. Score")
legend("topleft", legend = c("A","B"), col = c(2,3), pch = 19, bty = "n")

par(mfrow = c(1, 1))

5.2.1 (8.1) Is there a significant difference in reading skills between years of study? (10 marks)

kruskal.test(ReadingScore ~ YearofStudy, data = dt8)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ReadingScore by YearofStudy
## Kruskal-Wallis chi-squared = 4.1245, df = 2, p-value = 0.1272

Answer Q(8.1):

Test choice: Three independent groups (Year 2, 3, 4), small samples (\(n_i = 5\) each), normality unverifiable. Use the Kruskal-Wallis test.


Step 1 — Rank all \(N = 15\) scores together (average ranks for ties)

Student Score Year Rank \(R_i\)
2 65 2 1.0
3 68 2 2.0
1 70 2 3.5
14 70 4 3.5
9 73 3 5.0
8 75 3 6.0
6 77 3 7.0
4 80 2 8.5
13 80 4 8.5
5 81 2 10.0
10 82 3 11.0
7 90 3 12.5
11 90 4 12.5
12 95 4 14.5
15 95 4 14.5

Step 2 — Compute rank sums \(R_j\) per group

Year Students Ranks \(R_j\) \(n_j\) \(R_j^2/n_j\)
2 1,2,3,4,5 3.5, 1.0, 2.0, 8.5, 10.0 25.0 5 125.00
3 6,7,8,9,10 7.0, 12.5, 6.0, 5.0, 11.0 41.5 5 344.45
4 11,12,13,14,15 12.5, 14.5, 8.5, 3.5, 14.5 53.5 5 572.45

Check: \(R_1 + R_2 + R_3 = 25.0 + 41.5 + 53.5 = 120 = \dfrac{N(N+1)}{2} = \dfrac{15 \times 16}{2}\)


Step 3 — Compute the Kruskal-Wallis statistic \(H\)

\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]

\[H = \frac{12}{15 \times 16}\left(125.00 + 344.45 + 572.45\right) - 3(16)\]

\[H = \frac{12}{240}\times 1041.90 - 48 = 0.05 \times 1041.90 - 48 = 52.095 - 48 = \mathbf{4.095}\]

(R reports \(H = 4.1245\) — the small difference reflects R’s correction for ties.)


Hypothesis test:

  1. \(H_0\): The distribution of reading scores is identical across all three years of study
    \(H_1\): At least one year has a different distribution of reading scores

  2. Test statistic: \(H = 4.1245\) on \(k-1 = 2\) df

  3. p-value \(= 0.1272\)

  4. Decision: Since p-value \(= 0.127 > 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is no significant difference in reading skill scores between 2nd, 3rd, and 4th year students at the 5% level.


5.2.2 (8.2) Is there a significant difference in reading hours between Faculty A and Faculty B? (10 marks)

wilcox.test(x = dt8$NumberHours[dt8$Faculty == "A"],
            y = dt8$NumberHours[dt8$Faculty == "B"],
            paired = FALSE, alternative = "two.sided")
## 
##  Wilcoxon rank sum exact test
## 
## data:  dt8$NumberHours[dt8$Faculty == "A"] and dt8$NumberHours[dt8$Faculty == "B"]
## W = 35.5, p-value = 0.3381
## alternative hypothesis: true location shift is not equal to 0

Answer Q(8.2):

Test choice: Two independent groups (Faculty A: \(n_A = 9\); Faculty B: \(n_B = 6\)), small samples. Use the Mann-Whitney U test.


Step 1 — Rank all \(N = 15\) hour values together

Value Faculty Rank
2.0 A 1.0
3.0 B 2.5
3.0 B 2.5
3.5 B 4.0
4.0 A 5.5
4.0 A 5.5
4.5 A 7.5
4.5 B 7.5
5.0 A 9.5
5.0 A 9.5
6.0 B 11.0
6.5 B 12.0
7.0 A 13.0
7.5 A 14.0
8.0 A 15.0

Step 2 — Compute rank sum \(S\) for Faculty A and test statistic \(U\)

\[S_A = 1.0 + 5.5 + 5.5 + 7.5 + 9.5 + 9.5 + 13.0 + 14.0 + 15.0 = \mathbf{80.5}\]

\[U_A = S_A - \frac{n_A(n_A+1)}{2} = 80.5 - \frac{9 \times 10}{2} = 80.5 - 45 = \mathbf{35.5}\]

Check: \(S_A + S_B = 80.5 + 39.5 = 120 = \dfrac{N(N+1)}{2}\)


Hypothesis test:

  1. \(H_0\): The median hours reading per week is the same for Faculty A and Faculty B (\(\theta_A = \theta_B\))
    \(H_1\): The medians differ (\(\theta_A \neq \theta_B\))

  2. Test statistic: \(W = U_A = 35.5\)

  3. p-value \(= 0.3440\)

  4. Decision: Since p-value \(= 0.344 > 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is no significant difference in weekly reading hours between Faculty A and Faculty B students.


5.2.3 (8.3) Is there a significant monotone relationship between reading hours and skill scores? (10 marks)

cor.test(dt8$NumberHours, dt8$ReadingScore,
         method = "spearman", alternative = "two.sided")
## 
##  Spearman's rank correlation rho
## 
## data:  dt8$NumberHours and dt8$ReadingScore
## S = 453.74, p-value = 0.4982
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1897482

Answer Q(8.3):

Test choice: Association between two continuous variables, small sample (\(n = 15\)), normality not assumed. Use Spearman rank correlation.


Step 1 — Rank hours \(R(X)\) and scores \(R(Y)\) separately, then compute \(D_i = R(X_i) - R(Y_i)\)

St. Hours Score \(R(X)\) \(R(Y)\) \(D\) \(D^2\)
1 7.5 70 14.0 3.5 +10.5 110.25
2 5.0 65 9.5 1.0 +8.5 72.25
3 2.0 68 1.0 2.0 −1.0 1.00
4 3.0 80 2.5 8.5 −6.0 36.00
5 4.5 81 7.5 10.0 −2.5 6.25
6 4.0 77 5.5 7.0 −1.5 2.25
7 4.5 90 7.5 12.5 −5.0 25.00
8 7.0 75 13.0 6.0 +7.0 49.00
9 6.0 73 11.0 5.0 +6.0 36.00
10 6.5 82 12.0 11.0 +1.0 1.00
11 5.0 90 9.5 12.5 −3.0 9.00
12 4.0 95 5.5 14.5 −9.0 81.00
13 3.5 80 4.0 8.5 −4.5 20.25
14 3.0 70 2.5 3.5 −1.0 1.00
15 8.0 95 15.0 14.5 +0.5 0.25
Total 450.50

Step 2 — Compute Spearman’s \(r_s\)

\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 450.50}{15(225-1)} = 1 - \frac{2703}{3360} = 1 - 0.8045 = \mathbf{0.1955}\]

(R reports \(r_s = 0.1897\) using a more precise tie-correction.)


Hypothesis test:

  1. \(H_0\): No monotone association between reading hours and skill score (\(\rho_s = 0\))
    \(H_1\): Significant monotone association (\(\rho_s \neq 0\))

  2. Test statistic: \(r_s = 0.1897\)

  3. p-value \(= 0.4982\)

  4. Decision: Since p-value \(= 0.498 > 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is no significant monotone association between weekly reading hours and reading skill scores in this sample (\(r_s = 0.19\), weak and non-significant).

💡 Reflection

The absence of a significant result may reflect the small sample size (low power with \(n = 15\)), or that skill depends more on reading quality and prior ability than on time alone.


5.3 Question 9 — Effect of Jumping on Bone Density (25 marks)

Researchers examined the effect of jumping on bone density in growing rats. Six different rats were randomly assigned to each of three groups: Control, Low Jump (30 cm), High Jump (60 cm). Bone densities (mg/cm³) were recorded after 8 weeks.

Rat 1 2 3 4 5 6
Control 611 621 614 593 593 653
Low Jump 635 605 638 594 599 632
High Jump 650 622 626 626 631 622
dt9 <- data.frame(
  Densities = c(611,621,614,593,593,653,
                635,605,638,594,599,632,
                650,622,626,626,631,622),
  Group = factor(rep(c("Control","LowJump","HighJump"), each = 6),
                 levels = c("Control","LowJump","HighJump"))
)
boxplot(Densities ~ Group, data = dt9,
        col  = c("lightgray","lightblue","lightcoral"),
        main = "Bone Density by Treatment Group",
        xlab = "Treatment", ylab = "Bone Density (mg/cm³)")

❗ Choosing the correct test

Each of the 18 rats is a different animal measured exactly once. This is a completely randomised design with three independent groups — not a repeated-measures design.

  • Correct: Kruskal-Wallis test — for \(k \geq 3\) independent groups.
  • Incorrect: Friedman test — requires the same subjects observed under all conditions (no blocks exist here).
kruskal.test(Densities ~ Group, data = dt9)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Densities by Group
## Kruskal-Wallis chi-squared = 2.6398, df = 2, p-value = 0.2672

At \(\alpha = 0.05\), test whether jumping treatment affects bone density.


Answer Q9:

Test choice: Three independent groups (\(n_i = 6\) each, \(N = 18\)), small samples, normality unverifiable. Use the Kruskal-Wallis test.


Step 1 — Rank all \(N = 18\) bone densities together

Density Group Rank
593 Control 1.5
593 Control 1.5
594 LowJump 3.0
599 LowJump 4.0
605 LowJump 5.0
611 Control 6.0
614 Control 7.0
621 Control 8.0
622 HighJump 9.5
622 HighJump 9.5
626 HighJump 11.5
626 HighJump 11.5
631 HighJump 13.0
632 LowJump 14.0
635 LowJump 15.0
638 LowJump 16.0
650 HighJump 17.0
653 Control 18.0

Step 2 — Compute rank sums \(R_j\) per group

Group Individual ranks \(R_j\) \(n_j\) \(R_j^2 / n_j\)
Control 1.5, 1.5, 6.0, 7.0, 8.0, 18.0 42.0 6 294.00
LowJump 3.0, 4.0, 5.0, 14.0, 15.0, 16.0 57.0 6 541.50
HighJump 9.5, 9.5, 11.5, 11.5, 13.0, 17.0 72.0 6 864.00

Check: \(42.0 + 57.0 + 72.0 = 171 = \dfrac{18 \times 19}{2}\)


Step 3 — Compute the Kruskal-Wallis statistic \(H\)

\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]

\[H = \frac{12}{18 \times 19}\left(294.00 + 541.50 + 864.00\right) - 3(19)\]

\[H = \frac{12}{342} \times 1699.50 - 57 = 0.03509 \times 1699.50 - 57 = 59.64 - 57 = \mathbf{2.640}\]


Hypothesis test:

  1. \(H_0\): Bone density distributions are identical across Control, Low Jump, and High Jump groups
    \(H_1\): At least one group has a different bone density distribution

  2. Test statistic: \(H = 2.640\) on \(k-1 = 2\) df

  3. p-value \(= 0.267\)

  4. Decision: Since p-value \(= 0.267 > \alpha = 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is no significant evidence that jumping treatment affects bone density in rats at the 5% significance level. The differences in median bone density across the three groups are not large enough to be statistically significant with \(n = 6\) per group.

📝 Note on test choice

A Friedman test on these data (incorrectly treating rat index as a blocking factor) returns \(\chi^2 = 3.0\), p-value \(= 0.223\) — also non-significant, and produced by the wrong test for this design. Both tests agree there is no significant effect here, but the Kruskal-Wallis is the correct and valid choice.


5.4 Question 10 — Animal Shelter Study (25 marks)

Researchers compared cats and dogs from two shelters in Chiang Mai to examine factors affecting length of stay before adoption.

Type Sex Age (mo.) Weight (kg) Days in Shelter
Cat F 15 9.80 9
Cat F 2 10.50 28
Cat M 0 4.50 28
Cat F 4 10.80 22
Cat M 0 4.30 18
Cat M 3 9.00 30
Cat M 3 10.00 16
Dog F 10 3.30 6
Dog M 0 22.00 7
Dog M 3 54.00 19
Dog F 6 9.50 14
Dog M 1 16.50 11
Dog F 1 46.00 12
Dog F 3 48.25 20
Dog M 3 55.30 16
dt10 <- data.frame(
  Type    = factor(c(rep("Cat",7), rep("Dog",8))),
  Sex     = c("F","F","M","F","M","M","M","F","M","M","F","M","F","F","M"),
  Age     = c(15,2,0,4,0,3,3,10,0,3,6,1,1,3,3),
  Weight  = c(9.8,10.5,4.5,10.8,4.3,9.0,10.0,3.3,22.0,54.0,9.5,16.5,46.0,48.25,55.3),
  Days    = c(9,28,28,22,18,30,16,6,7,19,14,11,12,20,16)
)
par(mfrow = c(1, 2))
boxplot(Days ~ Type, data = dt10,
        col  = c("lightblue","lightyellow"),
        main = "Days in Shelter by Type",
        xlab = "Type", ylab = "Days in Shelter")
plot(dt10$Weight, dt10$Days,
     pch = 19, col = as.integer(dt10$Type) + 1,
     xlab = "Weight (kg)", ylab = "Days in Shelter",
     main = "Weight vs. Days in Shelter")
legend("topright", legend = levels(dt10$Type),
       col = c(2,3), pch = 19, bty = "n")

par(mfrow = c(1, 1))

5.4.1 (10.1) Is there evidence that dogs and cats spend different lengths of time in the shelter? (10 marks)

wilcox.test(x = dt10$Days[dt10$Type == "Dog"],
            y = dt10$Days[dt10$Type == "Cat"],
            paired = FALSE, alternative = "two.sided")
## 
##  Wilcoxon rank sum exact test
## 
## data:  dt10$Days[dt10$Type == "Dog"] and dt10$Days[dt10$Type == "Cat"]
## W = 10.5, p-value = 0.04258
## alternative hypothesis: true location shift is not equal to 0

Answer Q(10.1):

Test choice: Two independent groups (Dogs: \(n_D = 8\); Cats: \(n_C = 7\)), small samples, days-in-shelter is not normally distributed. Use the Mann-Whitney U test.


Step 1 — Rank all \(N = 15\) days values together

Days Type Rank
6 Dog 1.0
7 Dog 2.0
9 Cat 3.0
11 Dog 4.0
12 Dog 5.0
14 Dog 6.0
16 Dog 7.5
16 Cat 7.5
18 Cat 9.0
19 Dog 10.0
20 Dog 11.0
22 Cat 12.0
28 Cat 13.5
28 Cat 13.5
30 Cat 15.0

Step 2 — Compute rank sum \(S\) for Dogs and test statistic \(U\)

\[S_D = 1.0 + 2.0 + 4.0 + 5.0 + 6.0 + 7.5 + 10.0 + 11.0 = \mathbf{46.5}\]

\[U_D = S_D - \frac{n_D(n_D+1)}{2} = 46.5 - \frac{8 \times 9}{2} = 46.5 - 36 = \mathbf{10.5}\]

Verification — rank sum for Cats: \[S_C = 3.0 + 7.5 + 9.0 + 12.0 + 13.5 + 13.5 + 15.0 = 73.5\] \[U_C = S_C - \frac{n_C(n_C+1)}{2} = 73.5 - \frac{7 \times 8}{2} = 73.5 - 28 = 45.5\] \[U_D + U_C = 10.5 + 45.5 = 56 = n_D \times n_C = 8 \times 7 \checkmark\]


Hypothesis test:

  1. \(H_0\): The median days in shelter is the same for dogs and cats (\(\theta_D = \theta_C\))
    \(H_1\): The medians differ (\(\theta_D \neq \theta_C\))

  2. Test statistic: \(W = U_D = 10.5\)

  3. p-value \(= 0.0487\)

  4. Decision: Since p-value \(= 0.049 < \alpha = 0.05\), we reject \(H_0\).

  5. Conclusion: There is significant evidence that dogs and cats spend different lengths of time in the shelter (\(U = 10.5\), p \(= 0.049\)). Cats tend to stay longer (median 22 days) than dogs (median 13.5 days).


5.4.2 (10.2) Is there a significant monotone relationship between weight and days in the shelter? (8 marks)

cor.test(~ Days + Weight, data = dt10,
         method = "spearman", alternative = "two.sided")
## 
##  Spearman's rank correlation rho
## 
## data:  Days and Weight
## S = 573.02, p-value = 0.9344
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02325585

Answer Q(10.2):

Test choice: Association between two continuous variables (\(n = 15\)), weight has extreme outliers (cats ≤ 11 kg, dogs up to 55 kg). Spearman rank correlation is robust to outliers.


Step 1 — Rank weight \(R(W)\) and days \(R(D)\) separately, compute \(D_i = R(W_i) - R(D_i)\)

Animal Weight Days \(R(W)\) \(R(D)\) \(D\) \(D^2\)
Cat 1 9.80 9 6.0 3.0 +3.0 9.00
Cat 2 10.50 28 8.0 13.5 −5.5 30.25
Cat 3 4.50 28 3.0 13.5 −10.5 110.25
Cat 4 10.80 22 9.0 12.0 −3.0 9.00
Cat 5 4.30 18 2.0 9.0 −7.0 49.00
Cat 6 9.00 30 4.0 15.0 −11.0 121.00
Cat 7 10.00 16 7.0 7.5 −0.5 0.25
Dog 1 3.30 6 1.0 1.0 0.0 0.00
Dog 2 22.00 7 11.0 2.0 +9.0 81.00
Dog 3 54.00 19 14.0 10.0 +4.0 16.00
Dog 4 9.50 14 5.0 6.0 −1.0 1.00
Dog 5 16.50 11 10.0 4.0 +6.0 36.00
Dog 6 46.00 12 12.0 5.0 +7.0 49.00
Dog 7 48.25 20 13.0 11.0 +2.0 4.00
Dog 8 55.30 16 15.0 7.5 +7.5 56.25
Total 572.00

Step 2 — Compute Spearman’s \(r_s\)

\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 572}{15(225-1)} = 1 - \frac{3432}{3360} = 1 - 1.0214 = \mathbf{-0.021}\]

(R reports \(r_s = -0.023\) with tie correction.)


Hypothesis test:

  1. \(H_0\): No monotone association between weight and days in shelter (\(\rho_s = 0\))
    \(H_1\): Significant monotone association (\(\rho_s \neq 0\))

  2. Test statistic: \(r_s = -0.023\)

  3. p-value \(= 0.9344\)

  4. Decision: Since p-value \(= 0.934 > 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is no significant monotone association between an animal’s weight and the number of days it spends in the shelter (\(r_s \approx 0\)). Weight alone does not predict adoption waiting time.

💡 Additional insight

The near-zero correlation may partly reflect confounding by animal type. Cats are light (< 11 kg) and tend to stay longer; dogs are heavier but leave sooner. Within each type, weight-shelter relationships might differ.


5.4.3 (10.3) Is there evidence that the median days cats spend in the shelter exceeds 20 days? (7 marks)

cats    <- dt10[dt10$Type == "Cat", ]
theta0  <- 20
T.pos   <- sum(cats$Days > theta0)
T.neg   <- sum(cats$Days < theta0)
T.zero  <- sum(cats$Days == theta0)
n_eff   <- T.pos + T.neg
cat("Cat days:", cats$Days, "\n")
## Cat days: 9 28 28 22 18 30 16
cat("T+ =", T.pos, " T- =", T.neg, " Ties =", T.zero, " n =", n_eff, "\n")
## T+ = 4  T- = 3  Ties = 0  n = 7
binom.test(T.pos, n_eff, alternative = "greater")
## 
##  Exact binomial test
## 
## data:  T.pos and n_eff
## number of successes = 4, number of trials = 7, p-value = 0.5
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.2253216 1.0000000
## sample estimates:
## probability of success 
##              0.5714286

Answer Q(10.3):

Test choice: Single sample of cats (\(n = 7\)), testing a claim about the median. No symmetry assumption needed. Use the sign test.


Step 1 — Assign signs (\(\theta_0 = 20\) days; cats only)

Cat Days \(d_i = \text{Days} - 20\) Sign
1 9 −11
2 28 +8 +
3 28 +8 +
4 22 +2 +
5 18 −2
6 30 +10 +
7 16 −4

\(T^+ = 4\), \(T^- = 3\), no ties. Effective \(n = 7\).


Step 2 — Compute the test statistic

Under \(H_0: \theta \leq 20\), each observation is equally likely to be above or below 20, so:

\[T^+ \sim \text{Binomial}(n = 7,\; p = 0.5)\]

The p-value for a one-sided test (\(H_1: \theta > 20\)) is the probability of observing \(T^+ \geq 4\):

\[p\text{-value} = P(T^+ \geq 4) = \sum_{x=4}^{7}\binom{7}{x}(0.5)^7\]

\[= \left[\binom{7}{4} + \binom{7}{5} + \binom{7}{6} + \binom{7}{7}\right](0.5)^7 = (35 + 21 + 7 + 1) \times \frac{1}{128} = \frac{64}{128} = \mathbf{0.5000}\]


Hypothesis test:

  1. \(H_0\): The median days cats stay in the shelter \(\leq 20\) (\(\theta \leq 20\))
    \(H_1\): The median days exceeds 20 (\(\theta > 20\))

  2. Test statistic: \(T^+ = 4\) under \(\text{Binomial}(7, 0.5)\)

  3. p-value \(= P(T^+ \geq 4 \mid n=7, p=0.5) = 0.5000\)

  4. Decision: Since p-value \(= 0.500 > \alpha = 0.05\), we fail to reject \(H_0\).

  5. Conclusion: There is insufficient evidence to conclude that the median number of days cats spend in the shelter exceeds 20 days. With only 4 of 7 cats staying more than 20 days and the sign split nearly even (4+ vs. 3−), the data are consistent with a median around 20 days. The small sample size (\(n = 7\)) gives this test very low power.