1 Simple Linear Regression

1.1 Concepts of Regression Analysis

1.1.1 What is regression analysis?

Many research questions involve understanding how one variable is related to another. Consider these examples:

A teacher wants to know whether the number of hours a student studies is related to the student’s exam score.
Medical researchers ask: Is caffeine related to heart damage? Is there a relationship between a person’s age and blood pressure?
A real estate appraiser wants to predict the sale price of a home from its physical characteristics and tax records.
Analysts want to examine whether cigarette consumption is related to socioeconomic variables such as age, education, income, and cigarette price.

Two closely related techniques help answer these questions:

Correlation analysis determines whether a linear relationship between variables exists, and measures its strength and direction.
Regression analysis describes the nature of the relationship — it models how changes in one or more predictor variables are associated with changes in the response variable.

1.1.2 Purposes of regression analysis

To determine the relationship between the explanatory variable(s) and the response variable.
To forecast or predict the value of the response variable for a given set of values of the explanatory variables.

1.1.3 Response and explanatory variables

A regression model expresses the relationship between a response variable (also called the dependent variable) and one or more explanatory variables (also called independent variables or predictors).

The response variable is denoted $Y$ — the variable whose value we want to predict or explain.
The predictor variables are denoted $X_1, X_2, \ldots, X_p$, where $p$ is the number of predictors.

1.1.4 The regression model

The true relationship between $Y$ and $X_1, X_2, \ldots, X_p$ is approximated by:

\[Y = f(X_1, X_2, \ldots, X_p) + \epsilon\]

$\epsilon$ is the random error representing the part of $Y$ not explained by the predictors. It accounts for measurement error, omitted variables, and natural variability.
$f(X_1, \ldots, X_p)$ describes the systematic part of the relationship. Common forms include:
- Linear: $f(x_1, x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$
- Nonlinear: $f(x_1, x_2) = \beta_0 + \beta_1 x_1 + e^{\beta_2 x_2}$

1.1.5 Steps in regression modelling

State the problem clearly.
Choose the relevant variables.
Collect data on the selected variables.
Specify a model appropriate for the data.
Estimate the model parameters.
Validate the model (check assumptions and fit).
Use the model to answer the original question.

1.2 Scatterplot

Before fitting any model, we always begin by visualising the data. A scatterplot displays pairs $(x_i, y_i)$ — the predictor $x$ on the horizontal axis and the response $y$ on the vertical axis — and reveals the form, direction, and strength of the relationship.

What to look for in a scatterplot:

Feature	What it means
Form	Is the pattern linear (a straight band) or curved?
Direction	Positive (both increase together) or negative?
Strength	How tightly do the points cluster around the pattern?
Outliers	Any unusual points that deviate from the overall pattern?

data(cars)
plot(cars$speed, cars$dist,
     xlab = "Speed (mph)", ylab = "Stopping Distance (ft)",
     main = "Scatterplot: Speed vs. Stopping Distance",
     pch = 19, col = "steelblue")

The scatterplot shows that speed and dist are positively and approximately linearly related — as speed increases, stopping distance tends to increase.

library(MASS)
data(survey)
plot(survey$Age, survey$Pulse,
     xlab = "Age (years)", ylab = "Pulse Rate (beats/min)",
     main = "Scatterplot: Age vs. Pulse Rate",
     pch = 19, col = "tomato")

The scatterplot shows that Age and Pulse do not appear to be linearly related — the points scatter widely with no clear directional pattern.

1.3 Correlation

1.3.1 The Pearson correlation coefficient

The sample Pearson correlation coefficient $r$ measures the strength and direction of the linear relationship between two quantitative variables. For $n$ pairs $(x_1, y_1), \ldots, (x_n, y_n)$:

\[ r = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2 \;\cdot\; \displaystyle\sum_{i=1}^{n}(y_i - \bar{y})^2}} = \frac{n\sum x_i y_i - \sum x_i \sum y_i} {\sqrt{\bigl[n\sum x_i^2 - (\sum x_i)^2\bigr]\bigl[n\sum y_i^2 - (\sum y_i)^2\bigr]}} \]

The range is $-1 \leq r \leq 1$.

1.3.2 Interpreting $r$

Value of $\|r\|$	Interpretation
0.00 – 0.19	Very weak (negligible)
0.20 – 0.39	Weak
0.40 – 0.59	Moderate
0.60 – 0.79	Strong
0.80 – 1.00	Very strong

Sign: positive $r$ → both variables increase together; negative $r$ → one increases as the other decreases.
Magnitude: the closer $|r|$ is to 1, the more tightly the points cluster around a straight line.

❗ Important

$r$ measures linear association only. A value of $r \approx 0$ does not mean the variables are unrelated — they might have a strong nonlinear relationship that $r$ cannot detect. Always examine the scatterplot first.

cor(cars$speed, cars$dist)

## [1] 0.8068949

The correlation $r = 0.807$ confirms a strong positive linear relationship between speed and stopping distance.

1.3.3 Hypothesis test for the population correlation

The population correlation coefficient $\rho$ (rho) is the correlation computed from all possible pairs in the population. We never observe it directly; instead, we use $r$ from our sample to make inferences about $\rho$.

Possible hypotheses (where $\rho_0$ is the value under $H_0$, most commonly $\rho_0 = 0$):

\[ \begin{aligned} (a)\; &H_0: \rho \leq 0, \quad H_1: \rho > 0 \quad &\text{(test for positive correlation)}\\ (b)\; &H_0: \rho \geq 0, \quad H_1: \rho < 0 \quad &\text{(test for negative correlation)}\\ (c)\; &H_0: \rho = 0, \quad H_1: \rho \neq 0 \quad &\text{(test for any linear correlation)} \end{aligned} \]

Four-step procedure:

State $H_0$ and $H_1$.
The test statistic is $t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}$, which follows a $t$-distribution with $\nu = n - 2$ degrees of freedom under $H_0$.
Compute the test statistic and p-value (using R).
State the conclusion in context: reject $H_0$ if p-value $< \alpha$.

1.3.4 Confidence interval for $\rho$

A direct CI for $\rho$ uses Fisher’s $z$-transformation. Let $R = \dfrac{1}{2}\ln\dfrac{1+r}{1-r}$ (the Fisher transform of $r$), with approximate standard deviation $\sigma_R = \dfrac{1}{\sqrt{n-3}}$.

A $(1-\alpha)100\%$ CI for the transformed correlation is:

\[R \pm Z_{1-\alpha/2} \cdot \sigma_R\]

We then back-transform the bounds to the original $\rho$ scale using $\rho = \dfrac{e^{2R}-1}{e^{2R}+1}$.

💡 In R

cor.test(x, y, alternative = "two.sided", conf.level = 0.95) performs the hypothesis test and computes the 95% CI for $\rho$ automatically — no need to apply Fisher’s transformation by hand.

📝 Example 1.1: Delivery Time and Number of Items

A delivery company wants to know whether delivery time is correlated with the number of items delivered. They recorded times and item counts for 25 orders.

Order	Time $y$ (min)	Items $x$	Order	Time $y$ (min)	Items $x$
1	16.68	7	14	19.75	6
2	11.50	3	15	24.00	9
3	12.03	3	16	29.00	10
4	14.88	4	17	15.35	6
5	13.75	6	18	19.00	7
6	18.11	7	19	9.50	3
7	8.00	2	20	35.10	17
8	17.83	7	21	17.90	10
9	79.24	30	22	52.32	26
10	21.50	5	23	18.75	9
11	40.33	16	24	19.83	8
12	21.00	10	25	10.75	4
13	13.50	4

Assuming that delivery time and number of items follow a bivariate normal distribution, test at $\alpha = 0.05$ whether delivery time is correlated with the number of items.

# Data
items <- c(7,3,3,4,6,7,2,7,30,5,16,10,4,6,9,10,6,7,3,17,10,26,9,8,4)
time  <- c(16.68,11.5,12.03,14.88,13.75,18.11,8,17.83,79.24,21.5,
           40.33,21,13.5,19.75,24,29,15.35,19,9.5,35.1,17.9,52.32,
           18.75,19.83,10.75)

# Scatterplot
plot(items, time,
     main = "Number of Items vs. Delivery Time",
     xlab = "Number of Items (x)", ylab = "Delivery Time (min) (y)",
     pch = 19, col = "steelblue")

# Pearson correlation test (two-sided, 95% CI)
cor.test(items, time, alternative = "two.sided")

## 
##  Pearson's product-moment correlation
## 
## data:  items and time
## t = 17.546, df = 23, p-value = 8.22e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9202275 0.9845031
## sample estimates:
##       cor 
## 0.9646146

From the data, $r = 0.9646$.

Hypothesis test:

$H_0: \rho = 0$ vs. $H_1: \rho \neq 0$
Test statistic: $t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}$
Compute: \[t_\text{cal} = \frac{0.9646\sqrt{25-2}}{\sqrt{1-0.9646^2}} = 17.55, \quad \text{p-value} = 8.22 \times 10^{-15}\]
Since p-value $< 0.05$, we reject $H_0$. Conclusion: there is significant evidence that delivery time is linearly correlated with the number of items.

95% confidence interval for $\rho$:

\[R = \frac{1}{2}\ln\frac{1+0.9646}{1-0.9646} = 2.0082, \qquad \sigma_R = \frac{1}{\sqrt{25-3}} = 0.2132\]

\[2.0082 \pm 1.96\sqrt{0.2132} \;\Rightarrow\; 1.5903 < \mu_R < 2.4261\]

Back-transforming: $\quad \boxed{0.9202 < \rho < 0.9844}$

📝 Example 1.2: Math Midterm and Final Scores

Math midterm and final scores for 9 students are shown below. Find the sample correlation $r$ and test whether midterm and final scores are correlated at $\alpha = 0.10$.

Student	1	2	3	4	5	6	7	8	9
Midterm	77	50	71	72	80	93	95	98	66
Final	82	66	78	43	56	85	99	98	68

df <- data.frame(
  midterm = c(77, 50, 71, 72, 80, 93, 95, 98, 66),
  final   = c(82, 66, 78, 43, 56, 85, 99, 98, 68)
)

# Scatterplot
plot(df$midterm, df$final,
     main = "Midterm vs. Final Exam Scores",
     xlab = "Midterm Score", ylab = "Final Score",
     pch = 19, col = "darkred")

# Correlation test at alpha = 0.10 (90% CI)
cor.test(df$midterm, df$final, alternative = "two.sided", conf.level = 0.90)

## 
##  Pearson's product-moment correlation
## 
## data:  df$midterm and df$final
## t = 2.197, df = 7, p-value = 0.06402
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.0845006 0.8911984
## sample estimates:
##       cor 
## 0.6388399

The sample correlation is $r = 0.6388$, indicating a moderately strong positive linear relationship.

Hypothesis test:

$H_0: \rho = 0$ vs. $H_1: \rho \neq 0$
Test statistic: $t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}$
Compute with $r = 0.6388$, $n = 9$: \[t_\text{cal} = \frac{0.6388\sqrt{7}}{\sqrt{1-0.6388^2}} = 2.197, \quad \text{p-value} = 0.0640\]
Since p-value $= 0.0640 < \alpha = 0.10$, we reject $H_0$. Conclusion: at the 10% significance level, there is sufficient evidence that midterm and final scores are linearly correlated.

📝 Example 1.3: Rainfall and Umbrella Sales

Peter owns a supermarket and believes that months with higher rainfall will have higher umbrella sales. He recorded monthly rainfall (mm) and umbrellas sold over nine consecutive months.

Month	Rainfall (mm)	Umbrellas sold
1	82.0	12
2	92.5	25
3	83.2	17
4	97.7	28
5	131.9	41
6	141.3	47
7	165.4	50
8	140.0	46
9	126.7	37

Assuming both variables follow a normal distribution, test at $\alpha = 0.05$ whether Peter’s assumption is correct.

df2 <- data.frame(
  Rainfall  = c(82.0, 92.5, 83.2, 97.7, 131.9, 141.3, 165.4, 140.0, 126.7),
  Umbrellas = c(12, 25, 17, 28, 41, 47, 50, 46, 37)
)

# Scatterplot
plot(df2$Rainfall, df2$Umbrellas,
     main = "Monthly Rainfall vs. Umbrella Sales",
     xlab = "Rainfall (mm)", ylab = "Umbrellas Sold",
     pch = 19, col = "royalblue")

# One-sided test: H1: rho > 0
cor.test(df2$Rainfall, df2$Umbrellas, alternative = "greater")

## 
##  Pearson's product-moment correlation
## 
## data:  df2$Rainfall and df2$Umbrellas
## t = 10.455, df = 7, p-value = 7.97e-06
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.8877971 1.0000000
## sample estimates:
##       cor 
## 0.9694419

Hypothesis test:

$H_0: \rho \leq 0$ vs. $H_1: \rho > 0$ (Peter believes higher rainfall → more sales)
Test statistic: $t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}$
Compute with $r = 0.9694$, $n = 9$: \[t_\text{cal} = \frac{0.9694\sqrt{7}}{\sqrt{1-0.9694^2}} = 10.455, \quad \text{p-value} = 7.97 \times 10^{-6}\]
Since p-value $< 0.05$, we reject $H_0$. Conclusion: there is significant evidence that higher rainfall is associated with more umbrella sales. Peter’s assumption is supported by the data.

1.4 Simple Linear Regression Model

1.4.1 Model specification

When both the form of the relationship and the direction of interest (predicting $Y$ from $X$) are clear, we move from correlation to regression.

The simple linear regression model is:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, 2, \ldots, n\]

where:

\[ \begin{aligned} E(y_i) &= \beta_0 + \beta_1 x_i &&\text{(deterministic / systematic part)} \\ \epsilon_i &&&\text{(random error component)} \\ \beta_0 &&&\text{(intercept: value of } E(y) \text{ when } x = 0\text{)} \\ \beta_1 &&&\text{(slope: change in } E(y) \text{ per unit increase in } x\text{)} \end{aligned} \]

1.4.2 Assumptions about the error term $\epsilon_i$

For valid inference, the random errors must satisfy:

Normality: $\epsilon_i \sim N(0, \sigma^2)$ for all $i$.
Zero mean: $E(\epsilon_i) = 0$.
Constant variance (homoscedasticity): $\text{Var}(\epsilon_i) = \sigma^2$ (the same for all $i$).
Independence: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$.

In short: $\epsilon_i \overset{iid}{\sim} N(0, \sigma^2)$.

💡 Checking the assumptions

These assumptions cannot be verified from the raw data — they must be checked using residual diagnostics after the model is fitted (see @sec-diagnostics).

1.4.3 Parameter estimation: method of least squares

Since $\beta_0$ and $\beta_1$ are unknown population parameters, we estimate them from a sample of size $n$.

The least-squares estimates (LSE) $b_0$ and $b_1$ minimise the sum of squared residuals:

\[Q = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]

Setting the partial derivatives $\partial Q / \partial \beta_0 = 0$ and $\partial Q / \partial \beta_1 = 0$ and solving gives the normal equations, whose solutions are:

\[\boxed{b_1 = \frac{S_{xy}}{S_{xx}}, \qquad b_0 = \bar{y} - b_1\bar{x}}\]

where $S_{xx} = \sum(x_i - \bar{x})^2$ and $S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})$.

Equivalently:

\[b_1 = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - \left(\sum x_i\right)^2}, \qquad b_0 = \bar{y} - b_1\bar{x}\]

1.4.4 The fitted regression line and residuals

The fitted (estimated) regression line is: \[\hat{y} = b_0 + b_1 x\]

The fitted value at observation $i$ is $\hat{y}_i = b_0 + b_1 x_i$, and the residual is: \[e_i = y_i - \hat{y}_i\]

Residuals represent the part of $y_i$ not explained by the model — they are the observable counterparts of the unobservable errors $\epsilon_i$.

1.4.5 Estimating $\sigma^2$

The error variance $\sigma^2$ measures the spread of observations around the regression line. It is estimated by the mean squared residual (also called mean squared error, MSE):

\[\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n-2} = \frac{\displaystyle\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2}\]

The denominator $n-2$ (not $n$) gives an unbiased estimate because two parameters ($\beta_0$ and $\beta_1$) are estimated from the data.

The residual sum of squares can also be expressed as: \[SS_\text{res} = S_{yy} - b_1 S_{xy}\]

where $S_{yy} = \sum(y_i - \bar{y})^2$.

1.4.6 ANOVA table for regression

The total variation in $Y$ can be decomposed as:

\[SS_T = SS_\text{reg} + SS_\text{res}\]

Source	SS	df	MS	$F$
Regression	$SS_\text{reg} = b_1 S_{xy}$	$1$	$MS_\text{reg} = SS_\text{reg}/1$	$F = MS_\text{reg}/MS_\text{res}$
Residual	$SS_\text{res} = S_{yy} - b_1 S_{xy}$	$n-2$	$MS_\text{res} = SS_\text{res}/(n-2)$
Total	$SS_T = S_{yy}$	$n-1$

The F-statistic tests $H_0: \beta_1 = 0$ (the model has no linear fit). For simple regression, this is equivalent to the $t$-test on $\beta_1$.

💡 In R

anova(fit) produces this table. summary(fit) gives the $F$-statistic at the bottom of its output.

1.4.7 Inference for $\beta_0$ and $\beta_1$

Because $b_0$ and $b_1$ are computed from a random sample, they are random variables with their own sampling distributions.

Distribution of $b_1$:

\[b_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{S_{xx}}\right)\]

When $\sigma^2$ is unknown and estimated by $MS_\text{res}$:

\[t = \frac{b_1 - \beta_1}{se(b_1)} \sim t_{n-2}, \qquad se(b_1) = \sqrt{\frac{MS_\text{res}}{S_{xx}}}\]

Distribution of $b_0$:

\[t = \frac{b_0 - \beta_0}{se(b_0)} \sim t_{n-2}, \qquad se(b_0) = \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)}\]

Confidence intervals:

\[ \begin{aligned} (1-\alpha)100\% \text{ CI for } \beta_1: &\quad b_1 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_1) \\ (1-\alpha)100\% \text{ CI for } \beta_0: &\quad b_0 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_0) \end{aligned} \]

Confidence interval for $\sigma^2$:

\[\frac{(n-2)MS_\text{res}}{\chi^2_{1-\alpha/2,\, n-2}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{\alpha/2,\, n-2}}\]

1.4.8 Hypothesis test for $\beta_1$

The most important test in simple regression is:

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0\]

If we fail to reject $H_0$, there is no evidence of a linear relationship between $X$ and $Y$.

Four-step procedure:

State $H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$ (or one-sided alternative).
Test statistic: $t_0 = \dfrac{b_1}{se(b_1)} \sim t_{n-2}$ under $H_0$.
Compute $t_0$ and the p-value from R output.
Reject $H_0$ if p-value $< \alpha$; conclude there is a significant linear relationship.

1.4.9 Coefficient of determination $r^2$

The coefficient of determination measures the proportion of the total variation in $Y$ that is explained by the regression model:

\[r^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq r^2 \leq 1\]

$r^2 = 1$: perfect fit — the model explains all the variation.
$r^2 = 0$: the model explains none of the variation; $\hat{y}$ is no better than $\bar{y}$.

Interpretation template: “$r^2 = 0.845$ means that 84.5% of the total variation in $Y$ is explained by the linear regression on $X$. The remaining 15.5% is unexplained (attributable to random error or other factors not in the model).”

Note: In simple linear regression, $r^2 = r^2$ (the square of the Pearson correlation coefficient).

💡 Reading summary(fit) output in R

Output field	Meaning
`Estimate` (Intercept)	$b_0$: estimated $Y$ when $X = 0$
`Estimate` ($x$)	$b_1$: estimated change in $\hat{Y}$ per 1-unit increase in $X$
`Std. Error`	$se(b_j)$: standard error of the estimate
`t value`	$t_0 = b_j / se(b_j)$: test statistic for $H_0: \beta_j = 0$
`Pr(>\|t\|)`	p-value; reject $H_0$ if $< \alpha$
`Residual standard error`	$\hat{\sigma} = \sqrt{MS_\text{res}}$
`Multiple R-squared`	$r^2$: proportion of variance explained
`F-statistic`	Overall model F-test (same as $t^2$ for simple regression)
`p-value` (F)	p-value for the overall F-test

1.4.10 Prediction and confidence intervals

After fitting the model $\hat{y} = b_0 + b_1 x$, we can form two types of interval estimates for a new value $x^*$:

1. Confidence interval for the mean response $E(y \mid x = x^*)$:

This estimates the average $Y$ for all observations with $X = x^*$: \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

2. Prediction interval for a new individual observation $y$ at $x = x^*$:

This estimates a single future $Y$ when $X = x^*$: \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]

❗ CI vs. PI — which is wider?

The prediction interval (PI) is always wider than the confidence interval (CI) for the same $x^*$. The extra “1” inside the square root accounts for the additional uncertainty in predicting a single new observation (which has its own random error $\epsilon$), rather than estimating the mean response (which averages out individual variation).

Both intervals are narrowest at $x^* = \bar{x}$ and widen as $x^*$ moves away from $\bar{x}$.

1.4.11 Residual analysis and model diagnostics

After fitting the model, we must verify that the regression assumptions hold. We do this by examining the residuals $e_i = y_i - \hat{y}_i$.

Key diagnostic plots:

Plot	What it checks	What “good” looks like
Residuals vs. Fitted	Linearity + constant variance	Random scatter around $e = 0$; no pattern
Normal Q-Q plot	Normality of residuals	Points close to the diagonal reference line
Scale-Location	Constant variance (homoscedasticity)	Flat red line; even spread
Residuals vs. Leverage	Influential observations	No points outside Cook’s distance contours

# R's four built-in diagnostic plots
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))

Warning signs:

Residuals vs. Fitted: a curved pattern suggests non-linearity; a funnel shape suggests non-constant variance.
Q-Q plot: S-shaped deviations suggest heavy tails; systematic curves suggest skewness.
Scale-Location: an upward trend suggests variance increases with the fitted values.

💡 Formal tests (used in later labs)

Shapiro-Wilk test (shapiro.test(residuals(fit))): tests normality. Satisfied if p-value > 0.05.
Breusch-Pagan test (bptest(fit) from lmtest): tests constant variance. Satisfied if p-value > 0.05.
Durbin-Watson test (dwtest(fit) from lmtest): tests for autocorrelation. Satisfied if p-value > 0.05.

📝 Example 1.4: Condominium Area and Sale Price

A marketing manager wants to study the relationship between the area of a condominium unit ($m^2$) and its sale price (10,000 Baht) in the Mueang district of Chiang Mai. Ten units were randomly selected.

Obs	Area $x$	Price $y$	$xy$	$x^2$	$y^2$
1	13	115	1495	169	13225
2	4	45	180	16	2025
3	12	100	1200	144	10000
4	5	50	250	25	2500
5	6	55	330	36	3025
6	8	85	680	64	7225
7	3	40	120	9	1600
8	4	50	200	16	2500
9	5	45	225	25	2025
10	7	70	490	49	4900
Total	67	655	5170	553	49025

df <- data.frame(
  Area = c(13, 4, 12, 5, 6, 8, 3, 4, 5, 7),
  Sale = c(115, 45, 100, 50, 55, 85, 40, 50, 45, 70)
)

# Scatterplot
plot(df$Area, df$Sale,
     xlab = expression("Area (" * m^2 * ")"),
     ylab = "Sale Price (10,000 Baht)",
     main = "Condominium Area vs. Sale Price",
     pch = 19, col = "steelblue")

# Fit simple linear regression
fit <- lm(Sale ~ Area, data = df)
summary(fit)

## 
## Call:
## lm(formula = Sale ~ Area, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.738 -4.618  0.987  2.269  9.741 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.202      4.120    3.69  0.00613 ** 
## Area           7.507      0.554   13.55 8.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.953 
## F-statistic: 183.6 on 1 and 8 DF,  p-value: 8.451e-07

anova(fit)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Area	1	5866.8804	5866.88040	183.6129	8e-07
Residuals	8	255.6196	31.95245	NA	NA

Test for significance of regression ($\beta_1$):

$H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$
Test statistic: $t_0 = \dfrac{b_1}{se(b_1)}$
From R: $t = 13.55$, p-value $= 8.45 \times 10^{-7}$
Since p-value $< 0.05$, we reject $H_0$. Conclusion: condominium area is a significant linear predictor of sale price.

Fitted equation: \[\hat{y} = 15.202 + 7.507\, x\]

Interpretation: For each additional $1\, m^2$ of area, the estimated sale price increases by 75,070 Baht (7.507 × 10,000). A unit with 0 $m^2$ area is estimated to cost 152,020 Baht (intercept, not practically meaningful here).

Coefficient of determination:

$r^2 = 0.9582$: 95.82% of the total variation in sale price is explained by the linear regression on area. The remaining 4.18% is unexplained.

95% confidence interval for $E(y \mid x = 13)$:

newx <- data.frame(Area = 13)
predict(fit, newdata = newx, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 112.7954 103.7525 121.8382

\[\hat{y}_h \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(\frac{1}{10} + \frac{(13 - 6.7)^2}{104.1}\right)} = 112.80 \pm 9.04 \;\Rightarrow\; \boxed{103.76 < E(y \mid x=13) < 121.83}\]

95% prediction interval for a new unit with $x = 13$:

predict(fit, newdata = newx, interval = "prediction", level = 0.95)

##        fit      lwr    upr
## 1 112.7954 96.93079 128.66

\[\hat{y}_0 \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(1 + \frac{1}{10} + \frac{(13-6.7)^2}{104.1}\right)} = 112.80 \pm 15.86 \;\Rightarrow\; \boxed{96.94 < y \mid x=13 < 128.65}\]

Plot: fitted line with CI and PI bands:

newx_all  <- data.frame(Area = sort(df$Area))
conf_int  <- predict(fit, newdata = newx_all, interval = "confidence")
pred_int  <- predict(fit, newdata = newx_all, interval = "prediction")

plot(df$Area, df$Sale,
     xlab = expression("Area (" * m^2 * ")"),
     ylab = "Sale Price (10,000 Baht)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch = 19, col = "steelblue",
     ylim = c(20, 140))
abline(fit, lwd = 2, col = "black")
lines(sort(df$Area), conf_int[, "lwr"], col = "red",  lty = 2, lwd = 1.5)
lines(sort(df$Area), conf_int[, "upr"], col = "red",  lty = 2, lwd = 1.5)
lines(sort(df$Area), pred_int[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(sort(df$Area), pred_int[, "upr"], col = "blue", lty = 3, lwd = 1.5)
legend("topleft",
       legend = c("Data", "Fitted line", "95% CI", "95% PI"),
       pch    = c(19, NA, NA, NA),
       lty    = c(NA, 1, 2, 3),
       col    = c("steelblue", "black", "red", "blue"),
       lwd    = 2, bty = "n")

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

📝 Example 1.5: Rocket Propellant Shear Strength

A rocket motor is manufactured by bonding two types of propellant inside a metal housing. The shear strength of the bond is an important quality characteristic. It is suspected that shear strength ($y$, psi) is related to the age of the sustainer propellant batch ($x$, weeks). Twenty observations are available.

Obs	Shear Strength $y$	Age $x$	Obs	Shear Strength $y$	Age $x$
1	2158.70	15.50	11	2165.20	13.00
2	1678.15	23.75	12	2399.55	3.75
3	2316.00	8.00	13	1779.80	25.00
4	2061.30	17.00	14	2336.75	9.75
5	2207.50	5.50	15	1765.30	22.00
6	1708.30	19.00	16	2053.50	18.00
7	1784.70	24.00	17	2414.40	6.00
8	2575.00	2.50	18	2200.50	12.50
9	2357.90	7.50	19	2654.20	2.00
10	2256.70	11.00	20	1753.70	21.50

age      <- c(15.50, 23.75, 8.00, 17.00, 5.50, 19.00, 24.00, 2.50,
               7.50, 11.00, 13.00, 3.75, 25.00, 9.75, 22.00, 18.00,
               6.00, 12.50, 2.00, 21.50)
strength <- c(2158.70, 1678.15, 2316.00, 2061.30, 2207.50, 1708.30,
              1784.70, 2575.00, 2357.90, 2256.70, 2165.20, 2399.55,
              1779.80, 2336.75, 1765.30, 2053.50, 2414.40, 2200.50,
              2654.20, 1753.70)

dt2 <- data.frame(Age = age, Strength = strength)

# Scatterplot
plot(dt2$Age, dt2$Strength,
     xlab = "Age of Propellant (weeks)",
     ylab = "Shear Strength (psi)",
     main = "Propellant Age vs. Bond Shear Strength",
     pch = 19, col = "darkorange")

The scatterplot shows a strong negative linear relationship — older batches of propellant tend to have lower shear strength. The straight-line model $y = \beta_0 + \beta_1 x + \epsilon$ appears appropriate.

fit1 <- lm(Strength ~ Age, data = dt2)
summary(fit1)

## 
## Call:
## lm(formula = Strength ~ Age, data = dt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## Age          -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

anova(fit1)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Age	1	1527482.7	1527482.743	165.3768	0
Residuals	18	166254.9	9236.381	NA	NA

Test for significance of regression ($\beta_1$):

$H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$
Test statistic: $t_0 = \dfrac{b_1}{se(b_1)}$
From R: $t = -12.86$, p-value $= 1.64 \times 10^{-10}$
Since p-value $< 0.05$, we reject $H_0$. Conclusion: propellant age is a significant linear predictor of shear strength.

Fitted equation: \[\hat{y} = 2627.822 - 37.154\, x\]

Interpretation: For each additional week of propellant age, the estimated shear strength decreases by 37.154 psi on average.

90% and 95% confidence intervals for $\beta_1$:

confint(fit1, level = 0.90)

##                    5 %      95 %
## (Intercept) 2551.20465 2704.4401
## Age          -42.16349  -32.1437

confint(fit1, level = 0.95)

##                  2.5 %    97.5 %
## (Intercept) 2534.99540 2720.6493
## Age          -43.22338  -31.0838

At 90%: $-37.15 \pm (1.734)(2.89)$, giving $-43.22 \leq \beta_1 \leq -31.08$

95% confidence interval for $\sigma^2$:

\[\frac{(n-2)MS_\text{res}}{\chi^2_{0.975,\,18}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{0.025,\,18}} = \frac{18 \times 9244.59}{31.526} \leq \sigma^2 \leq \frac{18 \times 9244.59}{8.231}\]

\[5{,}278.27 \leq \sigma^2 \leq 20{,}216.57\]

Plot: fitted line with CI and PI bands:

new5      <- data.frame(Age = sort(dt2$Age))
ci5       <- predict(fit1, newdata = new5, interval = "confidence")
pi5       <- predict(fit1, newdata = new5, interval = "prediction")

plot(dt2$Age, dt2$Strength,
     xlab = "Age of Propellant (weeks)",
     ylab = "Shear Strength (psi)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch = 19, col = "darkorange")
abline(fit1, lwd = 2, col = "black")
lines(sort(dt2$Age), ci5[, "lwr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), ci5[, "upr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "lwr"], col = "purple",    lty = 3, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "upr"], col = "purple",    lty = 3, lwd = 1.5)
legend("topright",
       legend = c("Data", "Fitted line", "95% CI", "95% PI"),
       pch    = c(19, NA, NA, NA),
       lty    = c(NA, 1, 2, 3),
       col    = c("darkorange", "black", "darkgreen", "purple"),
       lwd    = 2, bty = "n")

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit1, pch = 16, col = "darkorange")

par(mfrow = c(1, 1))

2 Multiple Linear Regression

2.1 From Simple to Multiple Regression

In Chapter 1 we modelled a single response variable $Y$ using one predictor $X$: \[y = \beta_0 + \beta_1 x + \epsilon\]

In practice, most responses depend on several predictors simultaneously. For example:

A researcher wants to predict a student’s nursing board exam score from GPA and age.
An engineer wants to model delivery time from number of cases stocked and distance walked.
A manufacturer wants to relate product tool life to lathe speed and tool type.

Multiple linear regression extends the simple model to handle $k \geq 1$ predictors — both quantitative and qualitative — while keeping the linear structure that makes estimation and inference tractable.

2.2 The Multiple Linear Regression Model

2.2.1 Model equation

Given $n$ observations on a response $y$ and $k$ predictors $x_1, x_2, \ldots, x_k$, the multiple linear regression model is:

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_i, \qquad i = 1, 2, \ldots, n\]

where:

$\beta_0$ is the intercept (value of $E(y)$ when all $x_j = 0$),
$\beta_j$ ($j = 1, \ldots, k$) is the partial regression coefficient for $x_j$ — the expected change in $y$ for a one-unit increase in $x_j$, holding all other predictors constant,
$\epsilon_i \overset{iid}{\sim} N(0, \sigma^2)$ are the random errors (same four assumptions as in simple regression).

The systematic (deterministic) part of the model is: \[E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\]

2.2.2 Data structure

Obs	$y$	$x_1$	$x_2$	$\cdots$	$x_k$
1	$y_1$	$x_{11}$	$x_{12}$	$\cdots$	$x_{1k}$
2	$y_2$	$x_{21}$	$x_{22}$	$\cdots$	$x_{2k}$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\ddots$	$\vdots$
$n$	$y_n$	$x_{n1}$	$x_{n2}$	$\cdots$	$x_{nk}$

2.2.3 Matrix form

Writing one equation per observation and stacking them gives the compact matrix form:

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}_{n\times 1},\quad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix}_{n\times(k+1)},\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}_{(k+1)\times 1},\quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}_{n\times 1} \]

$\mathbf{X}$ is called the design matrix. Its first column is all ones, corresponding to the intercept $\beta_0$.

2.3 Parameter Estimation

2.3.1 Least squares estimator

We estimate $\boldsymbol{\beta}$ by minimising the residual sum of squares:

\[SS_\text{res} = \boldsymbol{\epsilon}'\boldsymbol{\epsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Setting $\partial SS_\text{res}/\partial \boldsymbol{\beta} = \mathbf{0}$ gives the normal equations $\mathbf{X}'\mathbf{X}\boldsymbol{\beta} = \mathbf{X}'\mathbf{y}$, whose solution is the least squares estimator (LSE):

\[\boxed{\mathbf{b} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}}\]

provided $\mathbf{X}'\mathbf{X}$ is invertible (i.e., no perfect multicollinearity among the predictors).

2.3.2 Fitted values and residuals

Quantity	Formula	Interpretation
Fitted values	$\hat{\mathbf{y}} = \mathbf{X}\mathbf{b}$	Model’s predicted $y$ for each observation
Residuals	$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$	Observed minus predicted

2.3.3 Estimating $\sigma^2$

\[SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\]

\[\boxed{\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n - k - 1}}\]

The denominator $n - k - 1$ accounts for estimating $k + 1$ parameters ($\beta_0, \beta_1, \ldots, \beta_k$), giving an unbiased estimator of $\sigma^2$.

2.4 Inference for the Regression Coefficients

2.4.1 Distribution of $b_j$

Each least squares estimate $b_j$ follows a $t$-distribution:

\[t = \frac{b_j - \beta_j}{se(b_j)} \sim t_{n-k-1}, \qquad j = 0, 1, \ldots, k\]

where the standard error of $b_j$ is: \[se(b_j) = \sqrt{\hat{\sigma}^2 C_{jj}}\]

and $C_{jj}$ is the $(j, j)$-th diagonal element of $(\mathbf{X}'\mathbf{X})^{-1}$.

2.4.2 Confidence intervals for $\beta_j$

A $(1-\alpha)100\%$ confidence interval for $\beta_j$ is:

\[b_j \pm t_{1-\alpha/2,\; n-k-1} \cdot se(b_j)\]

2.5 Testing the Fitted Regression

2.5.1 Overall F-test

The overall F-test asks: does the model explain any significant variation in $y$?

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \qquad \text{vs.} \qquad H_1: \text{at least one } \beta_j \neq 0\]

Test statistic:

\[F = \frac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)} = \frac{MS_\text{reg}}{MS_\text{res}} \sim F_{k,\; n-k-1} \quad \text{under } H_0\]

ANOVA table:

Source	df	SS	MS	$F$
Regression	$k$	$SS_\text{reg} = \mathbf{b}'\mathbf{X}'\mathbf{y} - n\bar{y}^2$	$MS_\text{reg} = SS_\text{reg}/k$	$F = MS_\text{reg}/MS_\text{res}$
Residual	$n-k-1$	$SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}$	$MS_\text{res} = SS_\text{res}/(n-k-1)$
Total	$n-1$	$SS_T = \mathbf{y}'\mathbf{y} - n\bar{y}^2$

Reject $H_0$ if $F_\text{cal} > F_{1-\alpha,\; k,\; n-k-1}$ or if p-value $< \alpha$.

💡 In R

anova(fit) produces the ANOVA table. summary(fit) reports the $F$-statistic and its p-value at the bottom of the output.

2.5.2 Individual t-test for each $\beta_j$

Once the overall F-test is significant, we test each predictor individually:

\[H_0: \beta_j = 0 \qquad \text{vs.} \qquad H_1: \beta_j \neq 0\]

Test statistic: $t_0 = \dfrac{b_j}{se(b_j)} \sim t_{n-k-1}$ under $H_0$

This tests whether $x_j$ contributes significantly to the model given that all other predictors are already included.

2.6 Coefficient of Determination

2.6.1 $R^2$ — multiple coefficient of determination

\[R^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq R^2 \leq 1\]

Limitation: $R^2$ always increases (never decreases) when a predictor is added, even if it is useless. This makes it unreliable for comparing models with different numbers of predictors.

2.6.2 Adjusted $R^2$

The adjusted $R^2$ penalises for the number of predictors and can decrease if an added predictor does not improve fit sufficiently:

\[R^2_\text{adj} = 1 - \frac{SS_\text{res}/(n-k-1)}{SS_T/(n-1)} = 1 - \left(\frac{n-1}{n-k-1}\right)\frac{SS_\text{res}}{SS_T}\]

❗ $R^2$ vs. Adjusted $R^2$

Use $R^2$ to describe the fit of a given model. Use adjusted $R^2$ — or AIC/BIC — when comparing models with different numbers of predictors. A model with higher adjusted $R^2$ is generally preferred.

2.7 Prediction

For a new observation with predictor values $\mathbf{x}_0 = (1, x_{01}, x_{02}, \ldots, x_{0k})'$:

Point estimate: $\hat{y}_0 = \mathbf{x}_0'\mathbf{b}$

$(1-\alpha)100\%$ confidence interval for $E(y \mid \mathbf{x}_0)$ (mean response): \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \, \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}\]

$(1-\alpha)100\%$ prediction interval for a new individual $y \mid \mathbf{x}_0$: \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \left(1 + \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0\right)}\]

The prediction interval is always wider than the confidence interval for the same reason as in simple regression — it must account for the random error in a single new observation.

💡 Reading summary(fit) output in R

Output field	Meaning
`Estimate` (Intercept)	$b_0$: estimated $y$ when all $x_j = 0$
`Estimate` ($x_j$)	$b_j$: estimated change in $\hat{y}$ per 1-unit increase in $x_j$, holding all others constant
`Std. Error`	$se(b_j)$: standard error of the estimate
`t value`	$t_0 = b_j/se(b_j)$: test statistic for $H_0: \beta_j = 0$
`Pr(>\|t\|)`	p-value for individual $t$-test; reject $H_0$ if $< \alpha$
`Residual standard error`	$\hat{\sigma} = \sqrt{MS_\text{res}}$, on $n-k-1$ df
`Multiple R-squared`	$R^2$
`Adjusted R-squared`	$R^2_\text{adj}$ — use this for model comparison
`F-statistic`	Overall F-test statistic
`p-value` (F)	p-value for the overall F-test

📝 Example 2.1: Delivery Time Data

A soft drink bottler wants to predict the time required by a route driver to service vending machines. The industrial engineer in charge suggests two key predictors: the number of cases stocked ($x_1$) and the distance walked by the driver ($x_2$, in feet). Data were collected from 25 deliveries.

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

$\beta_1$: average increase in delivery time per additional case stocked, holding distance constant.
$\beta_2$: average increase in delivery time per additional foot walked, holding cases constant.

Obs	Time $y$ (min)	Cases $x_1$	Distance $x_2$ (ft)	Obs	Time $y$	Cases $x_1$	Distance $x_2$
1	16.68	7	560	14	19.75	6	462
2	11.50	3	220	15	24.00	9	448
3	12.03	3	340	16	29.00	10	776
4	14.88	4	80	17	15.35	6	200
5	13.75	6	150	18	19.00	7	132
6	18.11	7	330	19	9.50	3	36
7	15.93	5	50	20	35.10	17	770
8	14.38	4	357	21	17.90	10	140
9	25.32	11	449	22	52.32	26	810
10	11.52	3	120	23	18.75	9	450
11	11.93	4	220	24	19.83	8	635
12	27.90	10	588	25	10.75	4	150
13	13.13	4	64

dat <- data.frame(
  y  = c(16.68, 11.50, 12.03, 14.88, 13.75, 18.11, 15.93, 14.38, 25.32,
         11.52, 11.93, 27.90, 13.13, 19.75, 24.00, 29.00, 15.35, 19.00,
          9.50, 35.10, 17.90, 52.32, 18.75, 19.83, 10.75),
  x1 = c(7, 3, 3, 4, 6, 7, 5, 4, 11, 3, 4, 10, 4, 6, 9, 10, 6, 7, 3,
         17, 10, 26, 9, 8, 4),
  x2 = c(560, 220, 340, 80, 150, 330, 50, 357, 449, 120, 220, 588, 64,
         462, 448, 776, 200, 132, 36, 770, 140, 810, 450, 635, 150)
)

Exploratory analysis

# Correlation matrix
round(cor(dat), 3)

##        y    x1    x2
## y  1.000 0.968 0.791
## x1 0.968 1.000 0.729
## x2 0.791 0.729 1.000

# Pairwise scatterplot matrix
pairs(dat,
      main  = "Scatterplot Matrix: Delivery Time Data",
      pch   = 19, col = "steelblue")

From the correlation matrix and scatterplots, both $x_1$ and $x_2$ are positively correlated with $y$. The correlation between $x_1$ and $x_2$ is moderate, so multicollinearity is unlikely to be a serious problem.

Fit the multiple regression model

y  <- dat$y  
x1 <- dat$x1  
x2 <- dat$x2

fit.full <- lm(y ~ x1 + x2, data = dat)
summary(fit.full)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7550 -1.7329 -0.0388  1.7226  3.1038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.301098   0.803072   6.601 1.23e-06 ***
## x1          1.537225   0.125982  12.202 2.88e-11 ***
## x2          0.007012   0.002645   2.651   0.0146 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 22 degrees of freedom
## Multiple R-squared:  0.9517, Adjusted R-squared:  0.9474 
## F-statistic: 216.9 on 2 and 22 DF,  p-value: 3.305e-15

anova(fit.full)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	1979.40029	1979.400287	426.860658	0.0000000
x2	1	32.59253	32.592526	7.028627	0.0145887
Residuals	22	102.01644	4.637111	NA	NA

(a) Overall F-test at $\alpha = 0.05$

$H_0: \beta_1 = \beta_2 = 0$ vs. $H_1$: at least one $\beta_j \neq 0$
Test statistic: $F = \dfrac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)}$
From R: $F = 261.2$ on $(2, 22)$ df, p-value $= 4.69 \times 10^{-16}$

Source	df	SS	MS	$F$
Regression	2	5550.82	2775.41	261.24
Residual	22	233.73	10.62
Total	24	5784.55

Since p-value $< 0.05$, we reject $H_0$. At least one of $x_1$ or $x_2$ is a significant predictor of delivery time.

(b) Test whether $x_1$ should be included when $x_2$ is already in the model

f1 <- lm(y ~ x2)
f2 <- lm(y ~ x2 + x1)
anova(f1, f2)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
23	792.4236	NA	NA	NA	NA
22	102.0164	1	690.4071	148.8873	0

$H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$ (given $x_2$ is in the model)
Test statistic: $t_0 = b_1 / se(b_1)$
From R: $b_1 = 1.6159$, $se(b_1) = 0.1707$, $t = 9.464$, p-value $= 3.25 \times 10^{-9}$ or $F=148.89$ and p-value $= 2.88 \times 10^{-11}$
Since p-value $< 0.05$, reject $H_0$. $x_1$ should be included in the model.

(c) Fitted equation

round(coef(fit.full), 4)

## (Intercept)          x1          x2 
##      5.3011      1.5372      0.0070

\[\hat{y} = 5.301 + 1.537\, x_1 + 0.007\, x_2\]

Interpretation:

$b_1 = 1.537$: each additional case stocked is associated with an estimated 1.537-minute increase in delivery time, holding distance constant.
$b_2 = 0.007$: each additional foot walked is associated with an estimated 0.007-minute increase in delivery time, holding cases constant.

(d) 95% confidence intervals for $\beta_0, \beta_1, \beta_2$

confint(fit.full, level = 0.95)

##                   2.5 %    97.5 %
## (Intercept) 3.635628107 6.9665670
## x1          1.275953966 1.7984953
## x2          0.001526778 0.0124966

(e) How much variation in $y$ is explained by $x_1$ and $x_2$?

From summary(), the Adjusted $R^2 = 0.9487$. That is, the model using number of cases and distance walked explains 94.87% of the total variation in delivery time.

(f) Predict delivery time when $x_1 = 8$ cases and $x_2 = 275$ ft

\[\hat{y} = 5.301 + 1.537(8) + 0.007(275) = 19.522 \text{ min}\]

newdata <- data.frame(x1 = 8, x2 = 275)
predict(fit.full, newdata, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 19.52711 18.51225 20.54197

predict(fit.full, newdata, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 19.52711 14.94738 24.10684

2.8 Dummy Variables for Qualitative Predictors

2.8.1 Why dummy variables?

When a predictor is qualitative (categorical), it cannot be entered directly as a number — the numeric codes 1, 2, 3 would incorrectly imply an ordering or equal spacing between groups.

Instead, we create dummy (indicator) variables: binary (0/1) variables that encode group membership.

Key rule: for a qualitative variable with $k$ groups, we create $k - 1$ dummy variables. The omitted group is the reference category (baseline); all dummy coefficients are interpreted as differences relative to that baseline.

Why $k-1$, not $k$? Using $k$ dummies would make the design matrix $\mathbf{X}$ singular — the $k$ dummy columns always sum to 1 (the intercept column), creating perfect multicollinearity. Dropping one group avoids this and lets the intercept represent the reference group’s mean.

2.8.2 General coding scheme

For a variable with $k = 3$ groups (A = reference, B, C):

Group	$D_B$	$D_C$	Interpretation
A (reference)	0	0	Intercept alone = mean of A
B	1	0	Intercept $+ \beta_B$ = mean of B
C	0	1	Intercept $+ \beta_C$ = mean of C

The model is: \[\hat{y} = b_0 + b_1 x_1 + b_B D_B + b_C D_C\]

This gives parallel regression lines (same slope, different intercepts) for each group:

Group	Equation	Intercept
A	$\hat{y} = b_0 + b_1 x_1$	$b_0$
B	$\hat{y} = (b_0 + b_B) + b_1 x_1$	$b_0 + b_B$
C	$\hat{y} = (b_0 + b_C) + b_1 x_1$	$b_0 + b_C$

Interpretation of dummy coefficients: $b_B$ estimates the mean difference in $y$ between Group B and Group A (the reference), holding $x_1$ constant. A positive $b_B$ means Group B has a higher mean response than Group A.

📝 Example 2.2: Tool Life, Lathe Speed, and Tool Type

Twenty observations on tool life ($y$, hours), lathe speed ($x_1$, rpm), and tool type ($x_2$: A or B) are shown below.

obs	$y$ (hrs)	$x_1$ (rpm)	Type	obs	$y$ (hrs)	$x_1$ (rpm)	Type
1	18.73	610	A	11	30.16	670	B
2	14.52	950	A	12	27.09	770	B
3	17.43	720	A	13	25.40	880	B
4	14.54	840	A	14	26.05	1000	B
5	13.44	980	A	15	33.49	760	B
6	24.39	530	A	16	35.62	590	B
7	13.34	680	A	17	26.07	910	B
8	22.71	540	A	18	36.78	650	B
9	12.68	890	A	19	34.95	810	B
10	19.32	730	A	20	43.67	500	B

The model is: $y = \beta_0 + \beta_1 x_1 + \beta_2 D_B + \epsilon$, where $D_B = 1$ if Tool Type B, $0$ if Tool Type A (reference).

dt <- data.frame(
  y        = c(18.73,14.52,17.43,14.54,13.44,24.39,13.34,22.71,12.68,19.32,
               30.16,27.09,25.40,26.05,33.49,35.62,26.07,36.78,34.95,43.67),
  x1       = c(610,950,720,840,980,530,680,540,890,730,
                670,770,880,1000,760,590,910,650,810,500),
  ToolType = c(rep("A",10), rep("B",10))
)

Exploratory analysis

# Scatterplot coloured by Tool Type
par(mfrow = c(1, 2))
plot(dt$x1, dt$y,
     pch = 21,
     bg  = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
     xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
     main = "Tool Life vs. Speed by Type")
legend("topright", legend = c("Type A","Type B"),
       pt.bg = c("steelblue","tomato"), pch = 21, bty = "n")

# Boxplot by Tool Type
boxplot(y ~ ToolType, data = dt,
        xlab = "Tool Type", ylab = "Tool Life (hours)",
        main = "Tool Life by Tool Type",
        col  = c("steelblue","tomato"))

par(mfrow = c(1, 1))

The scatterplot shows two distinct clouds of points — one for each tool type — sitting at different vertical levels but following similar downward slopes. This strongly suggests that tool type shifts the intercept (level) of the relationship, making a dummy variable appropriate.

Create dummy variable ($D_B = 1$ if Type B, $0$ if Type A):

dt$D_B <- ifelse(dt$ToolType == "B", 1, 0)

# Verify coding
head(dt[, c("ToolType","D_B")], 5)

ToolType	D_B
A	0
A	0
A	0
A	0
A	0

tail(dt[, c("ToolType","D_B")], 5)

	ToolType	D_B
16	B	1
17	B	1
18	B	1
19	B	1
20	B	1

Fit the regression model

fit2 <- lm(y ~ x1 + D_B, data = dt)
summary(fit2)

## 
## Call:
## lm(formula = y ~ x1 + D_B, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5527 -1.7868 -0.0016  1.8395  4.9838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.98560    3.51038  10.536 7.16e-09 ***
## x1          -0.02661    0.00452  -5.887 1.79e-05 ***
## D_B         15.00425    1.35967  11.035 3.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared:  0.9003, Adjusted R-squared:  0.8886 
## F-statistic: 76.75 on 2 and 17 DF,  p-value: 3.086e-09

anova(fit2)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	293.0054	293.005403	31.71568	2.99e-05
D_B	1	1125.0282	1125.028214	121.77602	0.00e+00
Residuals	17	157.0546	9.238504	NA	NA

confint(fit2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 29.57934160 44.39186079
## x1          -0.03614301 -0.01707145
## D_B         12.13559830 17.87290293

Test whether $x_1$ (lathe speed) should be included, given $x_2$ (tool type) is in the model:

$H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$
$t_0 = b_1 / se(b_1)$
From R: $t = -5.887$, p-value $= 1.79 \times 10^{-5}$
Since p-value $< 0.05$, reject $H_0$. $x_1$ (lathe speed) should be included in the model.

Test whether $x_2$ (tool type) should be included, given $x_1$ is in the model:

$H_0: \beta_2 = 0$ vs. $H_1: \beta_2 \neq 0$
$t_0 = b_2 / se(b_2)$
From R: $t = 11.035$, p-value $= 3.59 \times 10^{-9}$
Since p-value $< 0.05$, reject $H_0$. $x_2$ (tool type) should be included in the model.

Overall F-test at $\alpha = 0.05$:

$H_0: \beta_1 = \beta_2 = 0$ vs. $H_1$: at least one $\beta_j \neq 0$
$F = SS_\text{reg}(x_1, x_2)/2 \;\div\; SS_\text{res}(x_1,x_2)/(n-3)$
From R: $F = 76.75$, p-value $= 3.09 \times 10^{-9}$
Reject $H_0$. The overall model is significant.

Fitted equation (overall):

\[\hat{y} = 36.986 - 0.027\, x_1 + 15.004\, D_B\]

Fitted equations by tool type:

\[ \begin{aligned} \text{Type A } (D_B=0): \quad \hat{y}_A &= 36.986 - 0.027\, x_1 \\ \text{Type B } (D_B=1): \quad \hat{y}_B &= (36.986 + 15.004) - 0.027\, x_1 = 51.990 - 0.027\, x_1 \end{aligned} \]

Both equations share the same slope ($-0.027$: tool life decreases by 0.027 hours per additional rpm), but Tool Type B has an intercept 15.004 hours higher than Type A. This means, at any given lathe speed, Tool Type B is estimated to last about 15 hours longer than Tool Type A.

Plot the two fitted lines:

b <- coef(fit2)
x_range <- seq(min(dt$x1), max(dt$x1), length.out = 100)

plot(dt$x1, dt$y,
     pch  = 21,
     bg   = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
     xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
     main = "Fitted Regression Lines by Tool Type")
lines(x_range, b[1]            + b[2]*x_range, col = "steelblue", lwd = 2)
lines(x_range, b[1] + b[3]     + b[2]*x_range, col = "tomato",    lwd = 2)
legend("topright",
       legend = c("Type A (observed)","Type B (observed)",
                  "Type A (fitted)",  "Type B (fitted)"),
       pt.bg  = c("steelblue","tomato", NA, NA),
       pch    = c(21, 21, NA, NA),
       lty    = c(NA, NA, 1, 1),
       col    = c("black","black","steelblue","tomato"),
       lwd    = 2, bty = "n")

95% confidence interval for $\beta_2$ (the tool-type effect):

\[b_2 \pm t_{0.975,\;17} \cdot se(b_2) = 15.004 \pm (2.110)(1.360) \;\Rightarrow\; 12.135 \leq \beta_2 \leq 17.873\]

Interpretation: We are 95% confident that Tool Type B lasts between 12.1 and 17.9 hours longer than Tool Type A at the same lathe speed.

Residual diagnostics:

par(mfrow = c(2, 2))
plot(fit2, pch = 16,
     col = c("steelblue","tomato")[as.integer(factor(dt$ToolType))])

par(mfrow = c(1, 1))

2.9 Variable Selection

2.9.1 Why variable selection?

With many candidate predictors:

Including irrelevant predictors inflates variance of the estimates, reducing precision.
Including too few predictors causes bias (underfitting).
Variable selection finds a parsimonious model: small, interpretable, and still fitting well.

All three standard methods use AIC (Akaike Information Criterion) to compare models:

\[\text{AIC} = n\ln(SS_\text{res}/n) + 2(k+1)\]

Lower AIC = better balance between goodness of fit and model complexity.

Method	Starting point	Strategy	Stops when
Forward selection	Intercept only	Add best predictor at each step	No remaining predictor reduces AIC
Backward elimination	Full model	Remove worst predictor at each step	No remaining predictor increases AIC if removed
Stepwise (both)	Full model	Add or remove at each step	No change improves AIC

❗ Which method to prefer?

Stepwise regression is generally preferred because it can reconsider variables added or removed in earlier steps. Forward and backward selections cannot undo earlier decisions. However, none of the three methods guarantees finding the globally optimal model — always check whether the selected model makes practical sense and satisfies the regression assumptions.

2.9.2 Variable selection on the `mtcars` data

The built-in mtcars dataset records fuel efficiency and characteristics of 32 automobiles. We illustrate variable selection with mpg (fuel efficiency) as the response.

Variable	Description	Type
`mpg`	Miles per US gallon	Response
`cyl`	Number of cylinders	Discrete (4, 6, 8)
`disp`	Engine displacement (cu.in.)	Numeric
`hp`	Gross horsepower	Numeric
`drat`	Rear axle ratio	Numeric
`wt`	Weight (1,000 lbs)	Numeric
`qsec`	¼-mile time (seconds)	Numeric
`vs`	Engine shape (0 = V, 1 = straight)	Binary
`am`	Transmission (0 = auto, 1 = manual)	Binary
`gear`	Number of forward gears	Discrete
`carb`	Number of carburetors	Discrete

data("mtcars")
null_model <- lm(mpg ~ 1, data = mtcars)
full_model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
                 data = mtcars)

2.9.3 Forward selection

Starts with the intercept-only model and adds the predictor that most reduces AIC at each step. Stops when no addition improves AIC.

fw.fit <- step(null_model,
               scope     = list(lower = null_model, upper = full_model),
               direction = "forward",
               trace     = 1)

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654

summary(fw.fit)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Result: Forward selection chose wt, cyl, and hp.

Fitted equation: $\hat{\text{mpg}} = 38.751 - 3.167\,\text{wt} - 0.942\,\text{cyl} - 0.018\,\text{hp}$

2.9.4 Backward elimination

Starts with the full model and removes the predictor whose removal most reduces AIC at each step.

be.fit <- step(full_model,
               direction = "backward",
               trace     = 1)

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

summary(be.fit)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Result: Backward elimination chose wt, qsec, and am.

Fitted equation: $\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}$

2.9.5 Stepwise regression

Combines both directions — at each step the algorithm may add or remove a variable. Generally the most reliable of the three.

sw.fit <- step(full_model,
               direction = "both",
               trace     = 1)

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## + cyl   1    0.0799 147.49 70.898
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## + vs    1    0.2685 147.57 68.915
## + cyl   1    0.1883 147.66 68.932
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## + carb  1     0.685 147.84 66.973
## + vs    1     0.434 148.09 67.028
## + cyl   1     0.414 148.11 67.032
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + gear  1     1.565 148.53 65.121
## + cyl   1     1.003 149.09 65.242
## + vs    1     0.645 149.45 65.319
## + carb  1     0.107 149.99 65.434
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + gear  1     2.977 150.46 63.535
## + cyl   1     2.447 150.99 63.648
## + vs    1     1.121 152.32 63.927
## + carb  1     0.011 153.43 64.160
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## + carb  1     3.227 156.84 62.864
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + cyl   1     0.249 159.82 63.465
## + vs    1     0.249 159.82 63.466
## + gear  1     0.171 159.90 63.481
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + carb  1     8.036 161.25 61.751
## + disp  1     3.276 166.01 62.682
## + cyl   1     1.501 167.78 63.022
## + drat  1     1.400 167.89 63.042
## + gear  1     0.123 169.16 63.284
## + vs    1     0.000 169.29 63.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

summary(sw.fit)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Result: Stepwise also chose wt, qsec, and am — same as backward elimination.

Fitted equation: $\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}$

💡 Comparing the three results

Method	Selected variables	Adj. $R^2$
Forward	`wt`, `cyl`, `hp`	~0.826
Backward	`wt`, `qsec`, `am`	~0.834
Stepwise	`wt`, `qsec`, `am`	~0.834

Forward selection disagrees here — it chose a different model. When methods disagree, compare adjusted $R^2$, check diagnostics on each, and prefer the model with better interpretability and better-satisfied assumptions.

2.10 Model Diagnostics

2.10.1 Why check assumptions?

The linear regression model assumes:

\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \epsilon, \qquad \epsilon \sim N(0, \sigma^2)\]

This implies four conditions — LINE:

Condition	Meaning
Linearity	$E(y)$ is a linear function of the predictors
Independence	Errors $\epsilon_i$ are independent of each other
Normality	Errors follow a normal distribution
Equal variance	$\text{Var}(\epsilon_i) = \sigma^2$ (constant) for all $i$

If these are met, our $t$-tests, F-tests, and confidence intervals are valid. If they are violated, inference is unreliable — “garbage in, garbage out.”

2.10.2 Graphical diagnostics

R’s plot(fit) produces four standard diagnostic plots. We use the stepwise-selected model (sw.fit) as the working example.

fit <- lm(mpg ~ wt + qsec + am, data = mtcars)
par(mfrow = c(2, 2))
plot(fit, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

Plot	What it checks	What “good” looks like	Warning signs
Residuals vs. Fitted	Linearity + constant variance	Random scatter around $e = 0$; even spread	Curve (non-linearity); funnel shape (non-constant variance)
Normal Q-Q	Normality of residuals	Points close to the diagonal line	S-shape (heavy tails); systematic bending
Scale-Location	Constant variance (more sensitive)	Flat red line; even vertical spread	Upward or downward trend in red line
Residuals vs. Leverage	Influential observations	All points inside Cook’s distance contours	Points in upper/lower right corner (high leverage + large residual)

2.10.3 Formal test: constant variance (Breusch-Pagan)

\[H_0: \text{errors have constant variance} \qquad H_1: \text{non-constant variance (heteroscedasticity)}\]

library(lmtest)
bptest(fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 6.1871, df = 3, p-value = 0.1029

Since p-value $> 0.05$, we fail to reject $H_0$. The constant variance assumption is satisfied.

2.10.4 Formal test: normality (Shapiro-Wilk)

\[H_0: \text{errors are normally distributed} \qquad H_1: \text{errors are not normal}\]

res <- resid(fit)
par(mfrow = c(1, 2))
hist(res,
     main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white",
     freq = FALSE)
curve(dnorm(x, mean(res), sd(res)), add = TRUE, col = "red", lwd = 2)
qqnorm(res, pch = 16, col = "steelblue", main = "Normal Q-Q Plot")
qqline(res, col = "red", lwd = 2)

par(mfrow = c(1, 1))

shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.9411, p-value = 0.08043

Since p-value $> 0.05$, we fail to reject $H_0$. The normality assumption is satisfied.

2.10.5 Formal test: independence (Durbin-Watson)

\[H_0: \text{errors are not autocorrelated (independent)} \qquad H_1: \text{errors are autocorrelated}\]

dwtest(fit)

## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 1.7141, p-value = 0.1263
## alternative hypothesis: true autocorrelation is greater than 0

Since p-value $> 0.05$, we fail to reject $H_0$. The independence assumption is satisfied.

2.10.6 Summary of assumption checks

Assumption	Graphical check	Formal test	Satisfied if
Linearity	Residuals vs. Fitted	—	No pattern in the plot
Constant variance	Scale-Location	Breusch-Pagan	p-value $> 0.05$
Normality	Q-Q plot, histogram	Shapiro-Wilk	p-value $> 0.05$
Independence	Residuals vs. order	Durbin-Watson	p-value $> 0.05$

2.10.7 Dealing with violated assumptions

If an assumption is violated, consider the remedies in the table below before drawing any conclusions.

Violated assumption	Possible remedy
Non-linearity	Add polynomial terms ($x^2$, $x^3$); transform $x$; fit a different model
Non-constant variance	Apply a variance-stabilising transformation to $y$ (log, square root)
Non-normality	Transform $y$; check for and address outliers
Autocorrelation	Include a time-trend variable; use time-series regression methods

2.10.8 Box-Cox transformation

When the response $y$ is strictly positive and the model shows non-normality or non-constant variance, the Box-Cox family of power transformations provides a systematic approach to choosing a variance-stabilising transformation:

\[y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \ln(y) & \lambda = 0 \end{cases}\]

Common values of $\lambda$ and their corresponding transformations:

$\lambda$	Transformation	Use when
$-1$	Reciprocal $1/y$	Variance increases rapidly with mean
$-0.5$	Reciprocal square root $1/\sqrt{y}$
$0$	Log transformation $\ln(y)$	Most common remedy for right skew / heteroscedasticity
$0.5$	Square root $\sqrt{y}$	Count data; mild variance inflation
$1$	No transformation	Assumptions already satisfied
$2$	Square $y^2$	Variance decreases with mean (rare)

Finding the optimal $\lambda$ in R:

library(MASS)
boxcox(fit, lambda = seq(-3, 3, by = 0.05))

The plot shows the log-likelihood as a function of $\lambda$. The peak of the curve is the maximum-likelihood estimate of $\lambda$. The dashed vertical lines mark the 95% confidence interval for $\lambda$.

If the CI contains $\lambda = 1$: no transformation is needed.

If the CI contains $\lambda = 0$: use a log transformation ($\ln y$).

If the CI contains $\lambda = 0.5$: use a square-root transformation ($\sqrt{y}$).

For the mtcars model, the peak is near $\lambda = 1$, suggesting no transformation is necessary — consistent with the diagnostic tests above.

# If lambda = 0 were indicated, apply log transformation:
fit_log <- lm(log(mpg) ~ wt + qsec + am, data = mtcars)
summary(fit_log)

## 
## Call:
## lm(formula = log(mpg) ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13879 -0.08114 -0.03466  0.07030  0.26575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.69410    0.31326   8.600 2.40e-09 ***
## wt          -0.22456    0.03201  -7.015 1.25e-07 ***
## qsec         0.05329    0.01299   4.101  0.00032 ***
## am           0.08558    0.06351   1.347  0.18863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1107 on 28 degrees of freedom
## Multiple R-squared:  0.8752, Adjusted R-squared:  0.8619 
## F-statistic: 65.47 on 3 and 28 DF,  p-value: 9.036e-13

# Then re-check diagnostics on the transformed model
par(mfrow = c(2, 2))
plot(fit_log, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

3 Nonparametric Statistics

Learning Objectives

After completing this chapter, students should be able to:

Understand the concept and rationale of nonparametric methods
Identify when nonparametric tests are appropriate
Apply the sign test, Wilcoxon signed-rank test, Mann–Whitney U test, Kruskal–Wallis test, Friedman test, Spearman rank correlation, and chi-squared test
Perform post-hoc comparisons after significant omnibus tests
Interpret and report results in practical contexts

3.1 Introduction

3.1.1 What are nonparametric methods?

Nonparametric (or distribution-free) statistical methods do not require the data to follow a specific probability distribution. They are often based on ranks or signs rather than the raw numerical values, making them robust to outliers and skewed data.

Compared to their parametric counterparts, nonparametric methods:

Make fewer assumptions about the population distribution
Are applicable to ordinal (ranked) data as well as continuous data
Are often the only valid choice for small samples where normality cannot be verified
Tend to have less statistical power than parametric tests when the parametric assumptions are actually met

3.1.2 When to use nonparametric methods

Situation	Why nonparametric?
Small sample size ($n < 30$)	Cannot verify normality; Central Limit Theorem may not apply
Population distribution unknown or clearly non-normal	Parametric inference would be invalid
Presence of extreme outliers	Outliers heavily distort means and parametric tests
Ordinal data (rankings, Likert scales)	Means are not meaningful for ordinal scales
Data include only directions or signs	Magnitude is unknown or unreliable

3.1.3 Which test to use?

Use this table to choose the appropriate test based on your study design:

Design	Parametric test	Nonparametric equivalent
One sample: test median	One-sample $t$-test	Sign test or Wilcoxon signed-rank test
Two related (paired) samples	Paired $t$-test	Sign test or Wilcoxon signed-rank test
Two independent samples	Independent $t$-test	Mann–Whitney U test
$k > 2$ related samples (blocks)	Repeated-measures ANOVA	Friedman test
$k > 2$ independent samples	One-way ANOVA	Kruskal–Wallis test
Association between two variables	Pearson correlation	Spearman rank correlation
Association between two categorical variables	—	Chi-squared test of independence

3.2 Sign Test and Wilcoxon Signed-Rank Test for One Sample

Both tests address the same question — does the population median equal a specified value $\theta_0$? — but differ in what information they use:

Test	Uses	When to prefer
Sign test	Only the direction (above/below $\theta_0$)	Very small samples; data are ordinal; distribution may be asymmetric
Wilcoxon signed-rank	Direction and magnitude of differences	Symmetric distribution; more powerful than sign test

3.2.1 Sign test for one sample

Hypotheses (choose one):

\[ \begin{aligned} (a)\; &H_0: \theta \leq \theta_0 \quad H_1: \theta > \theta_0 \\ (b)\; &H_0: \theta \geq \theta_0 \quad H_1: \theta < \theta_0 \\ (c)\; &H_0: \theta = \theta_0 \quad H_1: \theta \neq \theta_0 \end{aligned} \]

Procedure:

Compute $d_i = x_i - \theta_0$ for each observation.
Assign signs: “$+$” if $d_i > 0$, “$-$” if $d_i < 0$; ignore ties ($d_i = 0$).
Let $n$ = number of non-tied observations; let $T^+$ = number of positive signs.
Under $H_0$, $T^+ \sim \text{Binomial}(n,\, 0.5)$.
Compute p-value using binom.test() in R.

📝 Example 3.1: PM2.5 Concentration

Is the median PM2.5 concentration equal to 35? Data from 5 measurements: 30, 40, 35, 50, 45.

x      <- c(30, 40, 35, 50, 45)
theta0 <- 35

par(mfrow = c(1, 2))
boxplot(x, main = "PM2.5 Concentration", ylab = "PM2.5", col = "lightblue")
hist(x, main = "Histogram", xlab = "PM2.5", col = "lightblue", border = "white")

par(mfrow = c(1, 1))

Tpos <- sum(x > theta0)
Tneg <- sum(x < theta0)
cat("T+ =", Tpos, "  T- =", Tneg, "  Ties =", sum(x == theta0), "\n")

## T+ = 3   T- = 1   Ties = 1

binom.test(x = Tpos, n = Tpos + Tneg, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  Tpos and Tpos + Tneg
## number of successes = 3, number of trials = 4, p-value = 0.625
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1941204 0.9936905
## sample estimates:
## probability of success 
##                   0.75

Hypothesis test:

$H_0: \theta = 35$ vs. $H_1: \theta \neq 35$
$T^+ \sim \text{Binomial}(n=4,\; 0.5)$ (one tie excluded)
From R: $T^+ = 3$, p-value $= 0.625$
Since p-value $> 0.05$, we fail to reject $H_0$. There is insufficient evidence that the median PM2.5 concentration differs from 35.

📝 Example 3.2: Caffeine in Starbucks Latte

A scientist purchased 16-oz Starbucks Lattes on 10 consecutive days and measured the caffeine content (mg). Does the median caffeine content exceed 40 mg?

Day	1	2	3	4	5	6	7	8	9	10
Caffeine (mg)	41	33	43	52	46	37	44	49	53	30

caffeine <- c(41, 33, 43, 52, 46, 37, 44, 49, 53, 30)
theta0   <- 40

boxplot(caffeine, ylab = "Caffeine (mg)", main = "Caffeine in Starbucks Latte",
        col = "lightyellow")
abline(h = theta0, col = "red", lty = 2)
legend("topright", legend = paste("θ₀ =", theta0), col = "red", lty = 2, bty = "n")

Tpos <- sum(caffeine > theta0)
Tneg <- sum(caffeine < theta0)
cat("T+ =", Tpos, "  T- =", Tneg, "  Ties =", sum(caffeine == theta0), "\n")

## T+ = 7   T- = 3   Ties = 0

binom.test(x = Tpos, n = Tpos + Tneg, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  Tpos and Tpos + Tneg
## number of successes = 7, number of trials = 10, p-value = 0.1719
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.3933758 1.0000000
## sample estimates:
## probability of success 
##                    0.7

Sign table:

Day	1	2	3	4	5	6	7	8	9	10
Caffeine (mg)	41	33	43	52	46	37	44	49	53	30
Sign	+	−	+	+	+	−	+	+	+	−

Hypothesis test:

$H_0: \theta \leq 40$ vs. $H_1: \theta > 40$
$T^+ \sim \text{Binomial}(n = 10,\; 0.5)$
From R: $T^+ = 7$, p-value $= 0.172$
Since p-value $> 0.05$, we fail to reject $H_0$. The data do not support the claim that median caffeine content exceeds 40 mg.

📝 Example 3.3: Snow Cones Sold Per Day

A convenience store owner claims the median number of snow cones sold per day is 40. A random sample of 20 days gives the following counts. Test at $\alpha = 0.05$.

x <- c(42,43,40,25,22,45,49,32,37,36,39,44,59,40,48,56,50,44,49,52)

boxplot(x, ylab = "Snow Cones Sold", main = "Daily Snow Cone Sales",
        col = "lightcyan")
abline(h = 40, col = "red", lty = 2)

T.pos <- sum(x > 40);  T.neg <- sum(x < 40);  T.ze <- sum(x == 40)
cat("T+ =", T.pos, "  T- =", T.neg, "  Ties =", T.ze, "\n")

## T+ = 12   T- = 6   Ties = 2

binom.test(x = T.pos, n = T.pos + T.neg, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  T.pos and T.pos + T.neg
## number of successes = 12, number of trials = 18, p-value = 0.2379
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4099252 0.8665726
## sample estimates:
## probability of success 
##              0.6666667

Sign table:

x	42	43	40	25	22	45	49	32	37	36
Sign	+	+	0	−	−	+	+	−	−	−
x	39	44	59	40	48	56	50	44	49	52
Sign	−	+	+	0	+	+	+	+	+	+

Hypothesis test:

$H_0: \theta = 40$ vs. $H_1: \theta \neq 40$
$T^+ \sim \text{Binomial}(n = 18,\; 0.5)$ (2 ties excluded)
From R: $T^+ = 12$, p-value $= 0.238$
Since p-value $> 0.05$, we fail to reject $H_0$. There is no significant evidence that the median number of daily snow cone sales differs from 40.

3.2.2 Wilcoxon signed-rank test for one sample

This test improves on the sign test by incorporating the magnitude of differences, not just their direction. It assumes the distribution of $X - \theta_0$ is symmetric.

Procedure:

Compute $d_i = x_i - \theta_0$ for each observation.
Remove zeros; let $n$ = remaining observations.
Rank the absolute differences $|d_i|$ from smallest (rank 1) to largest. For ties, assign the average rank.
Attach the original sign of $d_i$ to each rank.
Test statistic: $W^+ = \sum_{\{i: d_i > 0\}} R_i$ (sum of positive ranks).
Compute p-value using wilcox.test() in R.

📝 Example 3.4: Employee Weekly Income

A temporary employment agency claims its employees earn a median weekly income greater than 500. A sample of 8 employee records gives: 505, 508, 499, 504, 510, 497, 500, 492. Can the claim be verified at $\alpha = 0.05$?

x <- c(505, 508, 499, 504, 510, 497, 500, 492)

boxplot(x, ylab = "Weekly Income", main = "Employee Weekly Income",
        col = "lightgreen")
abline(h = 500, col = "red", lty = 2)

wilcox.test(x = x, mu = 500, alternative = "greater")

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 23.5, p-value = 0.2266
## alternative hypothesis: true location is greater than 500

Rank table ($\theta_0 = 500$):

Obs	$x$	$d = x - 500$	$\|d\|$	Rank of $\|d\|$	Signed rank
1	505	+5	5	4	+4
2	508	+8	8	5.5	+5.5
3	499	−1	1	1	−1
4	504	+4	4	3	+3
5	510	+10	10	7	+7
6	497	−3	3	2	−2
7	500	0	—	—	(tie, excluded)
8	492	−8	8	5.5	−5.5

$W^+ = 4 + 5.5 + 3 + 7 = 19.5$

Hypothesis test:

$H_0: \theta \leq 500$ vs. $H_1: \theta > 500$
Test statistic: $W^+ = \sum R_i^+$
From R: $W^+ = 19.5$ (NOTE that R reports 23.5 as it takes 0 (tie) into ranking) and p-value $= 0.0469$
Since p-value $< 0.05$, we reject $H_0$. The data support the agency’s claim that the median income exceeds 500.

📝 Example 3.5: Checkout Waiting Times

A supermarket claims checkout waiting times are at most 4 minutes. An angry customer records 6 waiting times (min): 3.8, 5.3, 3.5, 4.5, 7.2, 5.1. Can the customer dispute the claim at $\alpha = 0.05$?

x <- c(3.8, 5.3, 3.5, 4.5, 7.2, 5.1)

boxplot(x, ylab = "Waiting Time (min)", main = "Checkout Waiting Times",
        col = "lightyellow")
abline(h = 4, col = "red", lty = 2)

wilcox.test(x = x, mu = 4, alternative = "greater")

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 17.5, p-value = 0.09375
## alternative hypothesis: true location is greater than 4

Rank table ($\theta_0 = 4$):

Obs	$x$	$d = x-4$	$\|d\|$	Rank	Signed rank
1	3.8	−0.2	0.2	1	−1
2	5.3	+1.3	1.3	5	+5
3	3.5	−0.5	0.5	2.5	−2.5
4	4.5	+0.5	0.5	2.5	+2.5
5	7.2	+3.2	3.2	6	+6
6	5.1	+1.1	1.1	4	+4

$W^+ = 5 + 2.5 + 6 + 4 = 17.5$

Hypothesis test:

$H_0: \theta \leq 4$ vs. $H_1: \theta > 4$
Test statistic: $W^+ = \sum R_i^+$
From R: $W^+ = 17.5$ (same as R reports as it has mo tie), p-value $= 0.0859$
Since p-value $> 0.05$, we fail to reject $H_0$ at the 5% level. There is insufficient evidence to dispute the store’s claim.

📝 Example 3.6: Law College Exam Scores

A sample of 10 law students’ marks: 71, 79, 40, 70, 82, 72, 60, 76, 69, 75. Test at $\alpha = 0.05$ whether the median mark is less than or equal to 67.

x <- c(71, 79, 40, 70, 82, 72, 60, 76, 69, 75)

boxplot(x, ylab = "Mark", main = "Law College Exam Scores",
        col = "lightpink")
abline(h = 67, col = "red", lty = 2)

wilcox.test(x = x, mu = 67, alternative = "less")

## 
##  Wilcoxon signed rank exact test
## 
## data:  x
## V = 40, p-value = 0.9033
## alternative hypothesis: true location is less than 67

Hypothesis test:

$H_0: \theta \geq 67$ vs. $H_1: \theta < 67$
Test statistic: $W^+ = \sum R_i^+$; From R: $W^+ = 40$, p-value $= 0.9033$
Since p-value $> 0.05$, we fail to reject $H_0$. The data do not support the claim that the median mark is less than 67.

3.3 Sign Test and Wilcoxon Signed-Rank Test for Two Related (Paired) Samples

When the same subjects are observed under two conditions (before/after, treatment/control), the observations form matched pairs. We analyse the differences $d_i = x_i - y_i$ (or $y_i - x_i$) and test whether the median difference is zero.

Sign test: uses only the sign (direction) of each difference.
Wilcoxon signed-rank test: uses the sign and magnitude — more powerful when the differences are approximately symmetric.

3.3.1 Sign test for paired samples

📝 Example 3.7: Effectiveness of a Relaxation Advertisement

Fifteen people rated their relaxation level (1 = very tense, 5 = completely relaxed) before and after viewing an advertisement. Is there evidence at $\alpha = 0.05$ that the ad increased relaxation?

Person	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
Before	3	2	3	2	2	5	3	5	3	1	2	3	3	3	3	3	3
After	3	1	5	5	5	4	5	5	5	5	4	4	5	5	3	2	5

dt <- data.frame(
  before = c(3,2,3,2,2,5,3,5,3,1,2,3,3,3,3,3,3),
  after  = c(3,1,5,5,5,4,5,5,5,5,4,4,5,5,3,2,5)
)

# Visualise differences
diff_val <- dt$after - dt$before
boxplot(diff_val, ylab = "After − Before",
        main = "Change in Relaxation Score", col = "lightblue")
abline(h = 0, col = "red", lty = 2)

T.pos <- sum(dt$before > dt$after)
T.neg <- sum(dt$before < dt$after)
cat("T+ (Before > After) =", T.pos, " T- (Before < After) =", T.neg, "\n")

## T+ (Before > After) = 3  T- (Before < After) = 11

# Test H1: before < after (ad increases relaxation → more + after)
binom.test(x = T.pos, n = T.pos + T.neg, alternative = "less")

## 
##  Exact binomial test
## 
## data:  T.pos and T.pos + T.neg
## number of successes = 3, number of trials = 14, p-value = 0.02869
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.4656566
## sample estimates:
## probability of success 
##              0.2142857

Sign table:

Person	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
Before	3	2	3	2	2	5	3	5	3	1	2	3	3	3	3	3	3
After	3	1	5	5	5	4	5	5	5	5	4	4	5	5	3	2	5
Sign (Before vs After)	0	+	−	−	−	+	−	0	−	−	−	−	−	−	0	+	−

The sign “$+$” here means Before > After (relaxation decreased). We count $T^+ = 3$ (number of cases where Before > After).

Hypothesis test:

$H_0: \theta_D \geq 0$ vs. $H_1: \theta_D < 0$ (ad increases relaxation, so median After > Before)
$T^+ \sim \text{Binomial}(n=14,\; 0.5)$ (3 ties excluded)
From R: $T^+ = 3$, p-value $= 0.0287$
Since p-value $< 0.05$, we reject $H_0$. The ad is effective in increasing relaxation.

📝 Example 3.8: Sales Training Program (Sign Test)

Six sales associates completed a training program. Before and after sales figures are shown below. At $\alpha = 0.10$, test whether the training was effective (After > Before).

Associate	1	2	3	4	5	6
Before	23	35	29	33	43	32
After	28	38	29	37	42	30
Sign	−	−	0	−	+	+

dt2 <- data.frame(before = c(23,35,29,33,43,32),
                  after  = c(28,38,29,37,42,30))

boxplot(dt2$after - dt2$before, ylab = "After − Before",
        main = "Change in Sales After Training", col = "lightgreen")
abline(h = 0, col = "red", lty = 2)

T.pos <- sum(dt2$before > dt2$after)
T.neg <- sum(dt2$before < dt2$after)
binom.test(x = T.pos, n = T.pos + T.neg, alternative = "less")

## 
##  Exact binomial test
## 
## data:  T.pos and T.pos + T.neg
## number of successes = 2, number of trials = 5, p-value = 0.5
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.8107446
## sample estimates:
## probability of success 
##                    0.4

Hypothesis test:

$H_0: \theta_D \geq 0$ vs. $H_1: \theta_D < 0$ (Before < After)
$T^+ \sim \text{Binomial}(n=5,\; 0.5)$ (1 tie excluded)
From R: $T^+ = 2$, p-value $= 0.500$
Since p-value $> 0.10$, we fail to reject $H_0$. The sign test does not provide evidence that the training was effective.

3.3.2 Wilcoxon signed-rank test for paired samples

📝 Example 3.9: Sales Training Program (Wilcoxon)

Using the same sales training data as Example 3.8, apply the more powerful Wilcoxon signed-rank test at $\alpha = 0.10$.

wilcox.test(x = dt2$before, y = dt2$after,
            paired = TRUE, alternative = "less")

## 
##  Wilcoxon signed rank exact test
## 
## data:  dt2$before and dt2$after
## V = 5, p-value = 0.1875
## alternative hypothesis: true location shift is less than 0

Rank table (differences $d_i =$ Before $-$ After):

Obs	Before	After	$d$	$\|d\|$	Rank	Signed rank
1	23	28	−5	5	5	−5
2	35	38	−3	3	3	−3
3	29	29	0	—	—	(excluded)
4	33	37	−4	4	4	−4
5	43	42	+1	1	1	+1
6	32	30	+2	2	2	+2

$W^+ = 1 + 2 = 3$

Hypothesis test:

$H_0: \theta_D \geq 0$ vs. $H_1: \theta_D < 0$
Test statistic: $W^+ = \sum R_i^+ = 3$ (R reports $V=5$ as it takes tie into ranking); p-value $= 0.1563$
Since p-value $> 0.10$, we fail to reject $H_0$. The Wilcoxon test also does not detect a significant training effect at this sample size.

Comparison: Both the sign test and Wilcoxon test fail to reject $H_0$, partly due to the very small sample size ($n = 5$ after removing the tie). A larger study would provide more reliable conclusions.

📝 Example 3.10: Prison Rehabilitation Program

A psychologist claims a rehabilitation course for first-time offenders reduces the number of repeat offenders. Repeat offender counts at 8 prisons were recorded before and after the programme.

Prison	1	2	3	4	5	6	7	8
Before	68	49	53	75	49	41	58	75
After	63	41	54	71	39	44	58	67

dt3 <- data.frame(Prison = 1:8,
                  Before = c(68,49,53,75,49,41,58,75),
                  After  = c(63,41,54,71,39,44,58,67))

boxplot(dt3$Before - dt3$After, ylab = "Before − After",
        main = "Reduction in Repeat Offenders", col = "lightcoral")
abline(h = 0, col = "red", lty = 2)

wilcox.test(x = dt3$Before, y = dt3$After,
            paired = TRUE, alternative = "greater")

## 
##  Wilcoxon signed rank exact test
## 
## data:  dt3$Before and dt3$After
## V = 30, p-value = 0.04688
## alternative hypothesis: true location shift is greater than 0

Rank table (differences $d_i =$ Before $-$ After):

Prison	1	2	3	4	5	6	7	8
$d$	+5	+8	−1	+4	+10	−3	0	+8
$\|d\|$	5	8	1	4	10	3	—	8
Rank	4	5.5	1	3	7	2	—	5.5

$W^+ = 4 + 5.5 + 3 + 7 + 5.5 = 25$

Hypothesis test:

$H_0: \theta_D \leq 0$ vs. $H_1: \theta_D > 0$ (Before > After, i.e., counts decreased)
From R: $W^+ = 25$ (R reports $V=30$ as it takes tie into ranking), p-value $= 0.0377$
Since p-value $< 0.05$, we reject $H_0$. The rehabilitation programme significantly reduced repeat offending.

📝 Example 3.11: Midterm vs. Final Exam Scores

A statistics professor asks whether there is a significant difference between students’ midterm and final exam scores. Eight students’ scores are shown below. Test at $\alpha = 0.10$.

Student	1	2	3	4	5	6	7	8
Midterm	25	30	28	34	23	32	26	25
Final	19	32	21	34	19	30	29	29

dt4 <- data.frame(Midterm = c(25,30,28,34,23,32,26,25),
                  Final   = c(19,32,21,34,19,30,29,29))

boxplot(dt4$Midterm - dt4$Final, ylab = "Midterm − Final",
        main = "Score Difference: Midterm minus Final", col = "lavender")
abline(h = 0, col = "red", lty = 2)

wilcox.test(x = dt4$Midterm, y = dt4$Final,
            paired = TRUE, alternative = "two.sided")

## 
##  Wilcoxon signed rank exact test
## 
## data:  dt4$Midterm and dt4$Final
## V = 23, p-value = 0.5
## alternative hypothesis: true location shift is not equal to 0

Rank table (differences $d_i =$ Midterm $-$ Final):

Student	1	2	3	4	5	6	7	8
$d$	+6	−2	+7	0	+4	+2	−3	−4
$\|d\|$	6	2	7	—	4	2	3	4
Rank	6	1.5	7	—	4.5	1.5	3	4.5

$W^+ = 6 + 7 + 4.5 + 1.5 = 19$

Hypothesis test:

$H_0: \theta_D = 0$ vs. $H_1: \theta_D \neq 0$
From R: $W^+ = 19$ (R reports $V=23$ as it takes 0 in tie into ranking), p-value $= 0.4453$
Since p-value $> 0.10$, we fail to reject $H_0$. There is no significant difference between midterm and final scores.

3.4 Mann–Whitney U Test for Two Independent Samples

The Mann–Whitney U test (also called the Wilcoxon rank-sum test) compares the distributions of two independent groups. It is the nonparametric counterpart of the independent-samples $t$-test.

When to use: Two independent groups; response is at least ordinal; normality is not assumed.

3.4.1 Procedure

Pool all $n + m$ observations from both groups and rank them from smallest (rank 1) to largest. Assign average ranks to ties.
Let $S = \sum R(X_i)$ = sum of ranks for Group 1 (size $n$).
Compute test statistic: $U = S - \dfrac{n(n+1)}{2}$
In R: wilcox.test(x, y, paired = FALSE)

Hypotheses:

\[ \begin{aligned} (a)\; &H_0: \theta_X = \theta_Y, \quad H_1: \theta_X \neq \theta_Y \quad \text{(two-sided)} \\ (b)\; &H_0: \theta_X \geq \theta_Y, \quad H_1: \theta_X < \theta_Y \quad \text{(one-sided)} \\ (c)\; &H_0: \theta_X \leq \theta_Y, \quad H_1: \theta_X > \theta_Y \quad \text{(one-sided)} \end{aligned} \]

📝 Example 3.12: Coffee Brand Ratings

Nine participants rated Brand X and nine rated Brand Y on likelihood of purchase. Can we conclude that Brand Y is preferred (rated higher) at $\alpha = 0.05$?

Brand X	8.0	8.7	7.7	7.2	8.5	6.0	8.5	6.0	13.5
Brand Y	13.0	10.7	7.7	11.2	13.7	12.0	11.5	14.2	13.7

brand_x <- c(8.0, 8.7, 7.7, 7.2, 8.5, 6.0, 8.5, 6.0, 13.5)
brand_y <- c(13.0, 10.7, 7.7, 11.2, 13.7, 12.0, 11.5, 14.2, 13.7)

boxplot(brand_x, brand_y, names = c("Brand X","Brand Y"),
        col = c("lightblue","lightgreen"),
        main = "Coffee Brand Ratings", ylab = "Rating")

wilcox.test(brand_x, brand_y, paired = FALSE, alternative = "less")

## 
##  Wilcoxon rank sum exact test
## 
## data:  brand_x and brand_y
## W = 10.5, p-value = 0.003085
## alternative hypothesis: true location shift is less than 0

Rank table (combined ranks, $n=m=9$):

Brand X	8.0	8.7	7.7	7.2	8.5	6.0	8.5	6.0	13.5
Rank	6	9	4.5	3	7.5	1.5	7.5	1.5	15

$S = 6+9+4.5+3+7.5+1.5+7.5+1.5+15 = 55.5$, $\quad U = 55.5 - \dfrac{9 \times 10}{2} = 10.5$

Hypothesis test:

$H_0: \theta_X \geq \theta_Y$ vs. $H_1: \theta_X < \theta_Y$ (Brand Y is preferred)
From R: $U = 10.5$, p-value $= 0.0045$
Since p-value $< 0.05$, we reject $H_0$. Customers rate Brand Y significantly higher than Brand X.

📝 Example 3.13: Reaction Times Under Two Drugs

An experimental psychologist compared reaction times (seconds) of adult males under Drug A ($n=6$) vs. Drug B ($n=4$). Is there a significant difference at $\alpha = 0.10$?

Drug A (s)	52	53	73	68	56	74
Drug B (s)	94	64	74	51

drug_a <- c(52,53,73,68,56,74)
drug_b <- c(94,64,74,51)

boxplot(drug_a, drug_b, names = c("Drug A","Drug B"),
        col = c("skyblue","orange"),
        main = "Reaction Time by Drug", ylab = "Reaction Time (s)")

wilcox.test(x = drug_a, y = drug_b, paired = FALSE, alternative = "two.sided")

## 
##  Wilcoxon rank sum exact test
## 
## data:  drug_a and drug_b
## W = 9.5, p-value = 0.6476
## alternative hypothesis: true location shift is not equal to 0

Rank table:

Drug A	52	53	73	68	56	74		Drug B	94	64	74	51
Rank	2	3	7	6	4	8.5		Rank	10	5	8.5	1

$S = 2+3+7+6+4+8.5 = 30.5$, $\quad U = 30.5 - \dfrac{6 \times 7}{2} = 9.5$

Hypothesis test:

$H_0: \theta_A = \theta_B$ vs. $H_1: \theta_A \neq \theta_B$
From R: $U = 9.5$, p-value $= 0.6689$
Since p-value $> 0.10$, we fail to reject $H_0$. There is no significant difference in reaction times between the two drugs.

📝 Example 3.14: Prenatal Care Programme and APGAR Scores

A pilot trial assessed whether enhanced prenatal in-home visits improve newborn health (measured by APGAR scores at 5 min). 8 women received usual care; 7 received the new programme.

Usual Care	83	91	94	89	96	91	92	90
New Program	78	82	81	77	81	80	81

usual_care  <- c(83,91,94,89,96,91,92,90)
new_program <- c(78,82,81,77,81,80,81)

boxplot(usual_care, new_program,
        names = c("Usual Care","New Program"),
        col   = c("lightblue","lightgreen"),
        main  = "APGAR Scores by Care Programme",
        ylab  = "APGAR Score")

wilcox.test(usual_care, new_program, paired = FALSE, alternative = "two.sided")

## 
##  Wilcoxon rank sum exact test
## 
## data:  usual_care and new_program
## W = 56, p-value = 0.0003108
## alternative hypothesis: true location shift is not equal to 0

Rank table ($n=8$, $m=7$):

Usual Care	83	91	94	89	96	91	92	90
Rank	8	11.5	14	9	15	11.5	13	10
New Program	78	82	81	77	81	80	81
Rank	2	7	5	1	5	3	5

$S = 8+11.5+14+9+15+11.5+13+10 = 92$, $\quad U = 92 - \dfrac{8\times9}{2} = 56$

Hypothesis test:

$H_0: \theta_\text{usual} = \theta_\text{new}$ vs. $H_1: \theta_\text{usual} \neq \theta_\text{new}$
From R: $U = 56$, p-value $= 0.00139$
Since p-value $< 0.05$, we reject $H_0$. There is a significant difference in APGAR scores between the two care programmes. The usual care group had higher scores, suggesting babies in the usual care group were healthier — a surprising result that warrants further investigation.

3.5 Friedman Test for $k$ Related Samples ($k > 2$)

The Friedman test is the nonparametric equivalent of one-way repeated-measures ANOVA. It is used when the same subjects (or matched blocks) are measured under $k \geq 3$ treatment conditions.

Test statistic:

\[F_r = \frac{12}{bk(k+1)} \sum_{j=1}^{k} R_j^2 - 3b(k+1)\]

where $b$ = number of blocks (subjects), $k$ = number of treatments, $R_j$ = sum of ranks for treatment $j$.

Under $H_0$, $F_r \approx \chi^2_{k-1}$ for large $b$.

Hypotheses:

$H_0$: The $k$ treatment distributions are identical (same location).
$H_1$: At least two treatment distributions differ in location.

📝 Example 3.15: Farm Production Constraints

Twelve farm managers each ranked four conditions (1 = least severe to 4 = most severe) as constraints on farm production. At $\alpha = 0.05$, test whether the conditions differ in severity.

	Cond 1	Cond 2	Cond 3	Cond 4
Farm 1	4	3	2	1
Farm 2	4	2	3	1
Farm 3	3	1	2	4
Farm 4	3	1	2	4
Farm 5	4	2	1	3
Farm 6	3	1	2	4
Farm 7	1	3	2	4
Farm 8	2	4	1	3
Farm 9	3	1	2	4
Farm 10	4	1	3	2
Farm 11	4	2	3	1
Farm 12	3	1	2	4
Sum of Ranks $R_j$	38	22	25	35

dt8 <- data.frame(
  Farm      = rep(1:12, each = 4),
  Condition = rep(c("Cond1","Cond2","Cond3","Cond4"), times = 12),
  Rank      = c(4,3,2,1, 4,2,3,1, 3,1,2,4, 3,1,2,4, 4,2,1,3,
                3,1,2,4, 1,3,2,4, 2,4,1,3, 3,1,2,4, 4,1,3,2,
                4,2,3,1, 3,1,2,4)
)
dt8$Condition <- as.factor(dt8$Condition)

boxplot(Rank ~ Condition, data = dt8,
        col  = c("lightblue","lightgreen","lightyellow","lightcoral"),
        main = "Severity Ranks by Farm Condition",
        xlab = "Condition", ylab = "Rank (1=least, 4=most severe)")

friedman.test(y = dt8$Rank, groups = dt8$Condition, blocks = dt8$Farm)

## 
##  Friedman rank sum test
## 
## data:  dt8$Rank, dt8$Condition and dt8$Farm
## Friedman chi-squared = 8.9, df = 3, p-value = 0.03065

Hypothesis test:

$H_0$: The four conditions are equal in severity.
$H_1$: At least two conditions differ in severity.
Test statistic: $F_r = \dfrac{12}{12 \cdot 4 \cdot 5}\left(38^2+22^2+25^2+35^2\right) - 3(12)(5) = 8.9$
From R: $F_r = 8.9$, p-value $= 0.0307$
Since p-value $< 0.05$, we reject $H_0$. The conditions differ significantly in the constraints they impose on farm production.

Post-hoc pairwise comparisons:

pairwise.wilcox.test(dt8$Rank, dt8$Condition,
                     paired          = TRUE,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon signed rank exact test 
## 
## data:  dt8$Rank and dt8$Condition 
## 
##       Cond1 Cond2 Cond3
## Cond2 0.202 -     -    
## Cond3 0.032 1.000 -    
## Cond4 1.000 0.645 0.656
## 
## P value adjustment method: bonferroni

Inspect the adjusted p-values to identify which specific pairs of conditions differ significantly.

📝 Example 3.16: Wine Type Preference

Eight people tasted red, white, and rosé wine in a randomised order and scored each. Is there a significant difference in scores between wine types at $\alpha = 0.05$?

Person	Red	White	Rosé
1	16	11	14
2	18	14	13
3	15	11	14
4	15	16	9
5	17	12	14
6	14	12	18
7	15	16	12
8	16	14	11
Sum of Ranks	21	14	13

wine <- data.frame(
  Person = rep(1:8, times = 3),
  Type   = rep(c("Red","White","Rose"), each = 8),
  Score  = c(16,18,15,15,17,14,15,16,
             11,14,11,16,12,12,16,14,
             14,13,14, 9,14,18,12,11)
)
wine$Type <- factor(wine$Type, levels = c("Red","White","Rose"))

boxplot(Score ~ Type, data = wine,
        col  = c("firebrick","lightyellow","pink"),
        main = "Wine Scores by Type", ylab = "Score")

friedman.test(y = wine$Score, groups = wine$Type, blocks = wine$Person)

## 
##  Friedman rank sum test
## 
## data:  wine$Score, wine$Type and wine$Person
## Friedman chi-squared = 4.75, df = 2, p-value = 0.09301

Hypothesis test:

$H_0$: Scores are equal for red, white, and rosé wines.
$H_1$: At least two wine types differ in scores.
Test statistic: $F_r = \dfrac{12}{8 \cdot 3 \cdot 4}(21^2+14^2+13^2) - 3(8)(4) = 4.75$
From R: $F_r = 4.75$, p-value $= 0.0930$
Since p-value $> 0.05$, we fail to reject $H_0$. There is no significant difference in preference scores among the three wine types.

3.6 Kruskal–Wallis Test for $k$ Independent Samples ($k > 2$)

The Kruskal–Wallis test is the nonparametric equivalent of one-way ANOVA. It compares the location (median) of $k \geq 3$ independent groups.

Test statistic:

\[H = \frac{12}{N(N+1)}\sum_{i=1}^{k} \frac{R_i^2}{n_i} - 3(N+1)\]

where $n_i$ = sample size of group $i$, $R_i$ = rank sum for group $i$, $N = \sum n_i$.

Under $H_0$, $H \approx \chi^2_{k-1}$ for sufficiently large $n_i$.

Hypotheses:

$H_0$: The $k$ population distributions are identical.
$H_1$: At least two population distributions differ in location (median).

📝 Example 3.17: Potassium Content in Breakfast Drinks

Three brands of breakfast drink were tested for potassium content (milli-equivalents per quart). Is there a difference in potassium content at $\alpha = 0.10$?

Brand A	Brand B	Brand C
5.0	3.0	4.0
1.0	11.5	7.5
2.0	5.0	9.5
6.5	7.0	5.0
4.5	10.0	8.2
1.0	9.4	10.25

df17 <- data.frame(
  Potassium = c(5.0,1.0,2.0,6.5,4.5,1.0,
                3.0,11.5,5.0,7.0,10.0,9.4,
                4.0,7.5,9.5,5.0,8.2,10.25),
  Brand     = rep(c("A","B","C"), each = 6)
)
df17$Brand <- as.factor(df17$Brand)

boxplot(Potassium ~ Brand, data = df17,
        col  = c("tomato","skyblue","lightgreen"),
        main = "Potassium by Brand", ylab = "Potassium (mEq/qt)")

kruskal.test(Potassium ~ Brand, data = df17)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Potassium by Brand
## Kruskal-Wallis chi-squared = 6.4308, df = 2, p-value = 0.04014

Rank table ($N = 18$, $k = 3$):

	Brand A	Rank	Brand B	Rank	Brand C	Rank
	5.0	8	3.0	4	4.0	5
	1.0	1.5	11.5	18	7.5	12
	2.0	3	5.0	8	9.5	15
	6.5	10	7.0	11	5.0	8
	4.5	6	10.0	16	8.2	13
	1.0	1.5	9.4	14	10.25	17
$R_i$		30		71		70

\[H = \frac{12}{18 \times 19}\left(\frac{30^2}{6}+\frac{71^2}{6}+\frac{70^2}{6}\right) - 3(19) = 6.39\]

Hypothesis test:

$H_0$: All three brands have equal potassium content.
$H_1$: At least two brands differ.
From R: $H = 6.43$, p-value $= 0.0401$
Since p-value $< 0.10$, we reject $H_0$. Potassium content differs significantly among the three brands.

Post-hoc pairwise comparisons:

pairwise.wilcox.test(df17$Potassium, df17$Brand,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df17$Potassium and df17$Brand 
## 
##   A     B    
## B 0.091 -    
## C 0.091 1.000
## 
## P value adjustment method: bonferroni

📝 Example 3.18: Caffeine in Four Types of Beverages

Caffeine content (mg) was measured in regular servings of four beverage types (unequal sample sizes). Is there a significant difference at $\alpha = 0.05$?

Type 1	Type 2	Type 3	Type 4
57	43	87	60
81	61	69	70
67	42	58	75
64	45	82	72
96	39	90	79
80		91	76
68
56

df18 <- data.frame(
  Caffeine = c(57,81,67,64,96,80,68,56,
               43,61,42,45,39,
               87,69,58,82,90,91,
               60,70,75,72,79,76),
  Type     = factor(c(rep(1,8), rep(2,5), rep(3,6), rep(4,6)))
)

boxplot(Caffeine ~ Type, data = df18,
        col  = c("lightblue","lightgreen","coral","gold"),
        main = "Caffeine Content by Beverage Type",
        xlab = "Type", ylab = "Caffeine (mg)")

kruskal.test(Caffeine ~ Type, data = df18)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Caffeine by Type
## Kruskal-Wallis chi-squared = 11.308, df = 3, p-value = 0.01017

Rank table ($N=25$, $k=4$, rank sums: $R_1=108$, $R_2=19$, $R_3=110$, $R_4=88$):

\[H = \frac{12}{25 \times 26}\left(\frac{108^2}{8}+\frac{19^2}{5}+\frac{110^2}{6}+\frac{88^2}{6}\right) - 3(26) = 11.308\]

Hypothesis test:

$H_0$: Caffeine content is equal across all four beverage types.
$H_1$: At least two types differ.
From R: $H = 11.308$, p-value $= 0.0102$
Since p-value $< 0.05$, we reject $H_0$. There is a significant difference in caffeine content between the four beverage types.

Post-hoc pairwise comparisons:

pairwise.wilcox.test(df18$Caffeine, df18$Type,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df18$Caffeine and df18$Type 
## 
##   1     2     3    
## 2 0.037 -     -    
## 3 1.000 0.052 -    
## 4 1.000 0.052 1.000
## 
## P value adjustment method: bonferroni

3.7 Spearman Rank Correlation

Spearman’s rank correlation coefficient $r_s$ measures the monotone association between two variables. It is the nonparametric counterpart of Pearson’s $r$, and is appropriate when:

Data are ordinal (already ranks), or
The relationship may be monotone but not linear, or
Outliers or non-normality make Pearson’s $r$ unreliable.

Formula:

\[r_s = 1 - \frac{6\sum_{i=1}^{n} D_i^2}{n(n^2-1)}\]

where $D_i = R(X_i) - R(Y_i)$ is the difference in ranks for observation $i$.

The range is $-1 \leq r_s \leq 1$, with the same directional interpretation as Pearson’s $r$.

Hypotheses:

\[ \begin{aligned} (a)\; &H_0: \rho_s = 0, \quad H_1: \rho_s \neq 0 \quad \text{(two-sided)} \\ (b)\; &H_0: \rho_s \leq 0, \quad H_1: \rho_s > 0 \quad \text{(positive association)} \\ (c)\; &H_0: \rho_s \geq 0, \quad H_1: \rho_s < 0 \quad \text{(negative association)} \end{aligned} \]

In R: cor.test(x, y, method = "spearman", alternative = ...)

📝 Example 3.19: Machinery Productivity and Operator Satisfaction

Ten types of machinery are ranked on productivity (1–10) and operator satisfaction (1–10). Test at $\alpha = 0.05$ whether productivity and satisfaction are positively related.

Machine	1	2	3	4	5	6	7	8	9	10
Productivity rank	8	3	9	2	7	10	4	6	1	5
Satisfaction rank	9	5	10	1	8	7	3	4	2	6

productivity <- c(8,3,9,2,7,10,4,6,1,5)
satisfaction <- c(9,5,10,1,8,7,3,4,2,6)

plot(productivity, satisfaction,
     main = "Productivity Rank vs. Satisfaction Rank",
     xlab = "Productivity Rank", ylab = "Satisfaction Rank",
     pch = 19, col = "steelblue")
abline(lm(satisfaction ~ productivity), col = "red", lty = 2)

cor.test(productivity, satisfaction, method = "spearman", alternative = "greater")

## 
##  Spearman's rank correlation rho
## 
## data:  productivity and satisfaction
## S = 24, p-value = 0.001752
## alternative hypothesis: true rho is greater than 0
## sample estimates:
##       rho 
## 0.8545455

Calculation table:

Machine	$R(X)$	$R(Y)$	$D$	$D^2$
1	8	9	−1	1
2	3	5	−2	4
3	9	10	−1	1
4	2	1	+1	1
5	7	8	−1	1
6	10	7	+3	9
7	4	3	+1	1
8	6	4	+2	4
9	1	2	−1	1
10	5	6	−1	1
Total				$\sum D^2 = 24$

\[r_s = 1 - \frac{6 \times 24}{10(100-1)} = 1 - 0.1455 = 0.8545\]

Hypothesis test:

$H_0: \rho_s \leq 0$ vs. $H_1: \rho_s > 0$
From R: $r_s = 0.8545$, p-value $= 0.00351$
Since p-value $< 0.05$, we reject $H_0$. The productivity rank and satisfaction rank of machinery are significantly positively related.

📝 Example 3.20: Cigarette Smoking and Birth Weight

A study recorded the number of cigarettes smoked per day during pregnancy and the birth weight (lbs) of the newborn for 12 women. Is there a significant (two-sided) association?

Woman	1	2	3	4	5	6	7	8	9	10	11	12
Cigarettes/day	12	15	35	21	20	17	19	46	20	25	39	25
Birth weight (lbs)	7.7	8.1	6.9	8.2	8.6	8.3	9.4	7.8	8.3	5.2	6.4	7.9

cigarettes <- c(12,15,35,21,20,17,19,46,20,25,39,25)
weight     <- c(7.7,8.1,6.9,8.2,8.6,8.3,9.4,7.8,8.3,5.2,6.4,7.9)

plot(cigarettes, weight,
     main = "Cigarettes per Day vs. Birth Weight",
     xlab = "Cigarettes per Day", ylab = "Birth Weight (lbs)",
     pch = 19, col = "tomato")

cor.test(cigarettes, weight, method = "spearman", alternative = "two.sided")

## 
##  Spearman's rank correlation rho
## 
## data:  cigarettes and weight
## S = 431.26, p-value = 0.09183
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5079094

Hypothesis test:

$H_0: \rho_s = 0$ vs. $H_1: \rho_s \neq 0$
From R: $r_s = -0.50$, p-value $= 0.0918$
Since p-value $> 0.05$, we fail to reject $H_0$. There is no significant linear or monotone association between cigarettes smoked per day and birth weight at the 5% level.

📝 Example 3.21: Tree Diameter and Height

A biologist measures the diameter (inches, at 4.5 ft height) and height (feet) of 10 tall trees. Is there a significant association?

Diameter (in.)	1024	950	451	505	761	644	707	586	442	546
Height (ft)	261	321	219	281	159	83	191	141	232	108

diameter <- c(1024,950,451,505,761,644,707,586,442,546)
height   <- c(261,321,219,281,159,83,191,141,232,108)

plot(diameter, height,
     main = "Tree Diameter vs. Height",
     xlab = "Diameter (in.)", ylab = "Height (ft)",
     pch = 19, col = "darkgreen")

cor.test(diameter, height, method = "spearman", alternative = "two.sided")

## 
##  Spearman's rank correlation rho
## 
## data:  diameter and height
## S = 146, p-value = 0.7588
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1151515

Calculation table:

Diameter	1024	950	451	505	761	644	707	586	442	546
$R(X)$	10	9	2	3	8	6	7	5	1	4
$R(Y)$	8	10	6	9	4	1	5	3	7	2
$D$	2	−1	−4	−6	4	5	2	2	−6	2
$D^2$	4	1	16	36	16	25	4	4	36	4

$\sum D^2 = 146$, $\quad r_s = 1 - \dfrac{6 \times 146}{10(99)} = 1 - 0.885 = 0.115$

Hypothesis test:

$H_0: \rho_s = 0$ vs. $H_1: \rho_s \neq 0$
From R: $r_s = 0.115$, p-value $= 0.759$
Since p-value $> 0.05$, we fail to reject $H_0$. There is no significant association between tree diameter and height in this sample.

3.8 Chi-Squared Test of Independence

The chi-squared test of independence tests whether two categorical variables are associated. It is used when both variables are nominal or ordinal and the data are presented in a contingency table.

Hypotheses:

$H_0$: Variable A and Variable B are independent (not associated).
$H_1$: Variable A and Variable B are not independent (associated).

Procedure:

Construct the observed frequency table ($r \times c$ contingency table).
Compute expected frequencies: $E_{ij} = \dfrac{(\text{Row } i \text{ total}) \times (\text{Col } j \text{ total})}{N}$
Test statistic: \[\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \sim \chi^2_{(r-1)(c-1)} \quad \text{under } H_0\]
Reject $H_0$ if $\chi^2 > \chi^2_{1-\alpha, (r-1)(c-1)}$ or if p-value $< \alpha$.

❗ Conditions for validity

All expected cell counts should be $\geq 5$. If any are $< 5$:

Combine adjacent categories if meaningful, or
Use Fisher’s exact test (fisher.test()) instead — valid for any cell counts.

In R: chisq.test(table(var_A, var_B)) or chisq.test(contingency_matrix)

📝 Example 3.22: Smoking Status and Exercise Habit

A researcher surveyed 200 adults on their smoking status (Smoker / Non-smoker) and exercise habit (Regular / Occasional / None). Is smoking status associated with exercise habit at $\alpha = 0.05$?

	Regular	Occasional	None	Total
Smoker	25	30	45	100
Non-smoker	55	35	10	100
Total	80	65	55	200

obs <- matrix(c(25,30,45, 55,35,10),
              nrow = 2, byrow = TRUE,
              dimnames = list(Status   = c("Smoker","Non-smoker"),
                              Exercise = c("Regular","Occasional","None")))
obs

##             Exercise
## Status       Regular Occasional None
##   Smoker          25         30   45
##   Non-smoker      55         35   10

# Check expected counts before testing
chisq.test(obs)$expected

##             Exercise
## Status       Regular Occasional None
##   Smoker          40       32.5 27.5
##   Non-smoker      40       32.5 27.5

# Chi-squared test
chisq.test(obs)

## 
##  Pearson's Chi-squared test
## 
## data:  obs
## X-squared = 33.907, df = 2, p-value = 4.336e-08

Hypothesis test:

$H_0$: Smoking status and exercise habit are independent.
$H_1$: They are associated.
Test statistic: $\chi^2 = \sum \dfrac{(O-E)^2}{E}$ with $(2-1)(3-1) = 2$ df
From R: $\chi^2$, p-value — see output above.
If p-value $< 0.05$, reject $H_0$ and conclude that smoking status and exercise habit are significantly associated.

📝 Example 3.23: Gender and Method of Payment

A retail store recorded the gender and preferred method of payment (Cash, Card, Digital) for 120 customers. Is there a significant association at $\alpha = 0.05$?

	Cash	Card	Digital	Total
Female	20	35	15	70
Male	25	15	10	50
Total	45	50	25	120

obs2 <- matrix(c(20,35,15, 25,15,10),
               nrow = 2, byrow = TRUE,
               dimnames = list(Gender  = c("Female","Male"),
                               Payment = c("Cash","Card","Digital")))
obs2

##         Payment
## Gender   Cash Card Digital
##   Female   20   35      15
##   Male     25   15      10

# Expected counts
expected2 <- chisq.test(obs2)$expected
expected2

##         Payment
## Gender    Cash     Card  Digital
##   Female 26.25 29.16667 14.58333
##   Male   18.75 20.83333 10.41667

# Chi-squared test
chisq.test(obs2)

## 
##  Pearson's Chi-squared test
## 
## data:  obs2
## X-squared = 6.4, df = 2, p-value = 0.04076

# If any expected count < 5, use Fisher's exact test:
# fisher.test(obs2)

Hypothesis test:

$H_0$: Gender and method of payment are independent.
$H_1$: They are associated.
Compute expected counts (printed above). If all $\geq 5$, the chi-squared test is valid.
From R: $\chi^2$, df $= (2-1)(3-1) = 2$, p-value — see output.
State your conclusion based on whether p-value $< 0.05$.

4 Mock Midterm Exam

❗Instructions

Show your working clearly. State hypotheses, test statistics, p-values, and conclusions in context.
Use $\alpha = 0.05$ unless a question specifies otherwise.
R output is provided where indicated. You are expected to interpret and use it correctly.
Write all conclusions in the context of the research problem.

4.1 Question 1 — Correlation Analysis (16 marks)

A company analyst wants to investigate the relationship between the number of calls an agent makes in a week and the number of sales made. The analyst reasons that sales should depend to some degree on the number of calls. Data for a random sample of 20 agents are given below.

Agent	Calls ($x$)	Sales ($y$)	Agent	Calls ($x$)	Sales ($y$)
1	16	5	11	29	12
2	13	7	12	25	7
3	21	8	13	38	18
4	37	20	14	33	14
5	17	9	15	12	2
6	20	12	16	30	15
7	17	4	17	30	10
8	28	16	18	10	4
9	28	11	19	18	7
10	6	2	20	21	10

calls <- c(16,13,21,37,17,20,17,28,28, 6,
           29,25,38,33,12,30,30,10,18,21)
sales <- c( 5, 7, 8,20, 9,12, 4,16,11, 2,
           12, 7,18,14, 2,15,10, 4, 7,10)
dt1 <- data.frame(calls = calls, sales = sales)

The R output below is used to answer all parts of Question 1.

# Scatterplot
plot(calls, sales,
     xlab = "Number of Calls", ylab = "Number of Sales",
     main = "Calls vs. Sales",
     pch = 19, col = "steelblue")

# Pearson correlation test (95% CI by default)
cor.test(calls, sales, alternative = "two.sided", conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  calls and sales
## t = 8.7336, df = 18, p-value = 6.873e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7591250 0.9599183
## sample estimates:
##       cor 
## 0.8994835

(1.1) From the scatterplot, describe the relationship between the two variables. (3 marks)

Answer: The scatterplot shows a strong, positive, linear relationship between the number of calls and the number of sales. As the number of calls an agent makes increases, the number of sales also tends to increase. The points cluster fairly tightly around an upward-sloping straight line, and there are no obvious outliers or influential observations.

(1.2) What is the sample correlation coefficient between the number of sales and the number of calls? Interpret its value. (3 marks)

Answer: From the R output, the sample correlation coefficient is $r = 0.899$.

Interpretation: This value indicates a very strong positive linear relationship between the number of calls and the number of sales. Agents who make more calls tend to achieve substantially more sales. The value is close to 1, confirming that the linear association is strong.

(1.3) At significance level 0.05, test whether the number of sales and the number of calls are correlated. (6 marks)

Answer:

Hypotheses:
- $H_0: \rho = 0$ (no linear correlation between calls and sales)
- $H_1: \rho \neq 0$ (significant linear correlation exists)
Test statistic and distribution:
$t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} = t_{18}$ under $H_0$
From R: $t_0 = 8.7336$, p-value $= 6.873 \times 10^{-8}$
Decision: Since p-value $= 6.873 \times 10^{-8} < \alpha = 0.05$, we reject $H_0$.
Conclusion: There is very strong statistical evidence of a significant positive linear correlation between the number of calls made and the number of sales achieved ($r = 0.899$, p-value $< 0.001$).

(1.4) Find the 95% confidence interval for the population correlation $\rho$ and interpret it. (4 marks)

Answer: From the R output, the 95% confidence interval for $\rho$ is (0.7591, 0.9599).

Interpretation: We are 95% confident that the true population correlation between calls and sales lies between 0.7591 and 0.9599. Since this interval is entirely above zero and contains only large positive values, it strongly supports the conclusion that calls and sales are positively and substantially correlated in the population.

4.2 Question 2 — Simple Linear Regression (20 marks)

A researcher investigates the relationship between daily cigarette smoking and illness-related work absences. For 15 employees, the number of cigarettes smoked daily ($x$) and the number of days absent in the past year ($y$) were recorded.

Subject	Cigarettes/day ($x$)	Days absent ($y$)	Subject	Cigarettes/day ($x$)	Days absent ($y$)
1	0	1	9	35	8
2	0	3	10	44	9
3	0	8	11	53	5
4	10	10	12	33	12
5	13	14	13	21	15
6	20	14	14	25	13
7	27	11	15	14	9
8	35	6

The simple linear regression model is used: $y = \beta_0 + \beta_1 x + \epsilon$.

cigs   <- c( 0, 0, 0,10,13,20,27,35,35,44,53,33,21,25,14)
absent <- c( 1, 3, 8,10,14,14,11, 6, 8, 9, 5,12,15,13, 9)
dt2 <- data.frame(cigs = cigs, absent = absent)

# Scatterplot
plot(cigs, absent,
     xlab = "Cigarettes per Day", ylab = "Days Absent",
     main = "Cigarettes Smoked vs. Days Absent",
     pch = 19, col = "tomato")

# Correlation
cor.test(cigs, absent)

## 
##  Pearson's product-moment correlation
## 
## data:  cigs and absent
## t = 0.52436, df = 13, p-value = 0.6089
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3976627  0.6111258
## sample estimates:
##       cor 
## 0.1439172

# Fit regression
fit2 <- lm(absent ~ cigs, data = dt2)
summary(fit2)

## 
## Call:
## lm(formula = absent ~ cigs, data = dt2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3878 -2.6799  0.0953  3.0416  5.8369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.38784    1.90166   4.411 0.000704 ***
## cigs         0.03692    0.07040   0.524 0.608854    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.273 on 13 degrees of freedom
## Multiple R-squared:  0.02071,    Adjusted R-squared:  -0.05462 
## F-statistic: 0.275 on 1 and 13 DF,  p-value: 0.6089

anova(fit2)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
cigs	1	5.02063	5.02063	0.2749531	0.6088536
Residuals	13	237.37937	18.25995	NA	NA

confint(fit2, level = 0.90)

##                     5 %       95 %
## (Intercept)  5.02012966 11.7555489
## cigs        -0.08776229  0.1615951

# Model diagnostics
par(mfrow = c(2, 2))
plot(fit2, pch = 16, col = "tomato")

par(mfrow = c(1, 1))

library(lmtest)
bptest(fit2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.97208, df = 1, p-value = 0.3242

shapiro.test(resid(fit2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.95729, p-value = 0.6454

dwtest(fit2)

## 
##  Durbin-Watson test
## 
## data:  fit2
## DW = 0.83406, p-value = 0.002079
## alternative hypothesis: true autocorrelation is greater than 0

(2.1) Obtain the fitted least-squares regression line for predicting days absent from number of cigarettes smoked. (3 marks)

Answer: From the summary() output, the estimated coefficients are $b_0 = 8.3878$ and $b_1 = 0.0369$. The fitted regression line is:

\[\hat{y} = 8.3878 + 0.0369\, x\]

Interpretation:

$b_0 = 8.388$: A non-smoker ($x = 0$) is predicted to be absent approximately 8.39 days per year.
$b_1 = 0.037$: For each additional cigarette smoked per day, the predicted number of absent days increases by 0.037 days on average.

(2.2) At significance level 0.10, test whether the number of cigarettes smoked can be used to describe days absent. (5 marks)

Answer:

Hypotheses:
- $H_0: \beta_1 = 0$ (no linear relationship between cigarettes and absences)
- $H_1: \beta_1 \neq 0$ (a significant linear relationship exists)
Test statistic:
$t_0 = \dfrac{b_1}{se(b_1)} \sim t_{n-2} = t_{13}$ under $H_0$
From R: $t_0 = 0.5244$, p-value $= 0.6089$
Decision: Since p-value $= 0.609 > \alpha = 0.10$, we fail to reject $H_0$.
Conclusion: There is insufficient evidence at the 10% significance level to conclude that cigarette smoking is a significant linear predictor of work absences. The number of cigarettes smoked daily does not appear to linearly explain the number of days absent.

(2.3) Find the 90% confidence intervals for the regression parameters $\beta_0$ and $\beta_1$. (3 marks)

Answer: From confint(fit, level = 0.90):

Parameter	Lower bound	Upper bound	Interpretation
$\beta_0$ (intercept)	5.0201	11.7555	90% CI for the predicted days absent for a non-smoker
$\beta_1$ (slope)	−0.0878	0.1616	90% CI for the change in absences per additional cigarette

Note that the 90% CI for $\beta_1$ contains zero, which is consistent with the conclusion in Q2.2 that $\beta_1$ is not significantly different from zero.

(2.4) What percentage of the variation in days absent can be accounted for by the fitted regression line? (2 marks)

Answer: From summary(), the Multiple R-squared $= 0.0207$.

Interpretation: Only 2.07% of the total variation in days absent is explained by the linear regression on number of cigarettes smoked. This is a very low value, indicating that the number of cigarettes smoked is a poor predictor of work absences when used alone.

(2.5) At significance level 0.10, is the constant variance assumption violated? (2 marks)

Answer: Test: Breusch-Pagan test
Hypotheses: $H_0$: constant variance (homoscedasticity) vs. $H_1$: non-constant variance (heteroscedasticity)
From R: p-value $= 0.3242$
Decision: Since p-value $= 0.324 > \alpha = 0.10$, we fail to reject $H_0$.
Conclusion: The constant variance assumption is not violated. The errors appear to have constant variance.

(2.6) At significance level 0.10, is the normality assumption violated? (2 marks)

Answer: Test: Shapiro-Wilk test
Hypotheses: $H_0$: residuals are normally distributed vs. $H_1$: residuals are not normal
From R: p-value $= 0.6454$
Decision: Since p-value $= 0.645 > \alpha = 0.10$, we fail to reject $H_0$.
Conclusion: The normality assumption is not violated. The residuals are consistent with a normal distribution.

(2.7) At significance level 0.10, is the independence assumption violated? (3 marks)

Answer: Test: Durbin-Watson test
Hypotheses: $H_0$: no autocorrelation in residuals vs. $H_1$: positive autocorrelation present
From R: DW statistic $= 0.8341$, p-value $= 0.0021$
Decision: Since p-value $= 0.0021 < \alpha = 0.10$, we reject $H_0$.
Conclusion: The independence assumption is violated. There is significant positive autocorrelation in the residuals.

⚠️ What to do next

A DW statistic well below 2 indicates that consecutive residuals tend to have the same sign (positive autocorrelation). Possible remedies include:

Including a time-trend variable or time-ordered predictor in the model
Using a time-series regression approach (e.g., AR(1) error model)
Checking whether the data were collected in a natural time order that is driving the pattern

In this case, the overall model fit is also poor ($R^2 = 2.1\%$), and the slope is not significant. Other predictors of absenteeism (age, health status, working conditions) may be needed.

4.3 Question 3 — Multiple Regression with Dummy Variable and Stepwise Selection (18 marks)

Researchers study the relationship between Gender, Age, Number of working years, and Income. Let $y$ = Income, $x_1$ = Age, $x_2$ = Number of working years, and $x_3$ = Gender dummy ($x_3 = 1$ if Female, $0$ if Male). The full model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]

dt3 <- data.frame(
  Income  = c(27000,35000,42000,31000,26000,45000,38000,
              29000,50000,33000,41000,28000,36000,52000,30000),
  Age     = c(25,35,42,29,24,48,38,27,52,31,40,26,34,50,28),
  Working = c( 2,10,15, 5, 1,20,12, 3,25, 7,14, 2, 9,22, 4),
  Gender  = c("male","male","female","male","female","female","male",
              "female","female","male","male","female","male","female","male")
)

y  <- dt3$Income
x1 <- dt3$Age
x2 <- dt3$Working
x3 <- ifelse(dt3$Gender == "female", 1, 0)

# After stepwise selection
full <- lm(y ~ x1 + x2+ x3)
f1   <- lm(y ~ x1 + x3)
anova(f1,full)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
12	17585095	NA	NA	NA	NA
11	17569551	1	15544.03	0.0097319	0.9231908

summary(f1)

## 
## Call:
## lm(formula = y ~ x1 + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2274.96  -495.53    73.46   323.68  2966.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5292.11    1260.21   4.199  0.00123 ** 
## x1            879.47      36.47  24.114 1.55e-11 ***
## x3           -231.88     662.78  -0.350  0.73251    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1211 on 12 degrees of freedom
## Multiple R-squared:  0.9817, Adjusted R-squared:  0.9787 
## F-statistic: 322.4 on 2 and 12 DF,  p-value: 3.722e-11

anova(f1)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	944635535.8	944635535.8	644.615605	0.0000000
x3	1	179369.5	179369.5	0.122401	0.7325084
Residuals	12	17585094.7	1465424.6	NA	NA

confint(f1)

##                  2.5 %   97.5 %
## (Intercept)  2546.3481 8037.871
## x1            800.0101  958.937
## x3          -1675.9500 1212.192

(3.1) Conduct the overall fit test of the regression at $\alpha = 0.05$. (5 marks)

Answer:

Hypotheses:
- $H_0: \beta_1 = \beta_3 = 0$ (neither Age nor Gender predicts Income)
- $H_1$: at least one of $\beta_1, \beta_3 \neq 0$
Test statistic:
$F = 322.4$ on $(2, 12)$ df
p-value $= 3.722 \times 10^{-11}$
Decision: Since p-value $< \alpha = 0.05$, we reject $H_0$.
Conclusion: The regression model is statistically significant. Age and Gender together are significant predictors of Income at the 5% significance level.

(3.2) Write down the fitted equations for predicting Income of female and male staff separately. (4 marks)

Answer: The stepwise procedure removed $x_2$ (Number of working years). The final fitted equation is:

\[\hat{y} = 5292.11 + 879.47\, x_1 -231.88\, x_3\]

Substituting the dummy variable values:

Male staff ($x_3 = 0$): \[\hat{y}_\text{Male} = 5292.11 + 879.47\, x_1\]

Female staff ($x_3 = 1$): \[\hat{y}_\text{Female} = (5292.11-231.88) + 879.47\, x_1 = 5060.23 + 879.47\, x_1\]

Interpretation of $b_3 = -231.88$: Holding Age constant, female staff earn an estimated 231.88 less than male staff of the same age, on average.

Both lines share the same slope: each additional year of age is associated with an estimated 879.47 increase in income, for both genders.

(3.3) Find the coefficient of determination of the fitted model and interpret its value. (2 marks)

Answer: From summary(), $R^2 = 0.9817$, i.e., 98.17%.

Interpretation: The model (Age and Gender) explains approximately 98.17% of the total variation in Income. The remaining is attributable to other factors not in the model (such as working years, which were dropped by stepwise selection, or other unmeasured variables).

(3.4) Find the 95% confidence interval for the parameter that represents the income difference between male and female staff. (2 marks)

Answer: The income difference between genders is captured by $\beta_3$. From confint(fit):

\[(-1675.95,\; 1212.192)\]

Interpretation: We are 95% confident that the true income difference between female and male staff of the same age lies between -1675.95 and 1212.192. Since the interval contains zero, the difference in income between genders is not statistically significant at the 5% level, despite the negative point estimate.

(3.5) Predict the Income of a male staff member aged 28 who has been working for 5 years. (2 marks)

Answer: Since Number of Working Years ($x_2$) was removed during stepwise selection, it is not used in the prediction. We use $x_1 = 28$ (Age) and $x_3 = 0$ (Male):

\[\hat{y}_\text{Male} = 5292.11 + 879.47(28)=29917.27\]

The predicted monthly income of this male staff member is approximately 29,917 Baht.

4.4 Question 4 — Multiple Regression with Two Dummy Variables (12 marks)

The Faculty of Science, CMU, is piloting a new online English learning programme called SpeedyE. Twenty students were sampled — 10 from Faculty 1 and 10 from Faculty 2, with 5 male and 5 female from each. Students completed a pretest, used the programme for one week, then completed a posttest.

Faculty	Gender	Pretest	Posttest	Faculty	Gender	Pretest	Posttest
1	Male	58	71	2	Male	51	69
1	Male	57	69	2	Male	62	69
1	Male	63	71	2	Male	58	66
1	Male	66	70	2	Male	52	61
1	Male	45	65	2	Male	59	63
1	Female	64	71	2	Female	39	65
1	Female	39	71	2	Female	32	66
1	Female	69	71	2	Female	62	70
1	Female	56	76	2	Female	64	68
1	Female	67	71	2	Female	66	68

The model is: \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]

where $x_1$ = Pretest score, $x_2$ = Gender dummy ($1$ = Male, $0$ = Female), $x_3$ = Faculty dummy ($1$ = Faculty 1, $0$ = Faculty 2), and $y$ = Posttest score.

dt4 <- data.frame(
  Faculty  = c(1,1,1,1,1, 2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2),
  Gender   = c(rep("Male",10), rep("Female",10)),
  Pretest  = c(58,57,63,66,45, 51,62,58,52,59, 64,39,69,56,67, 39,32,62,64,66),
  Posttest = c(71,69,71,70,65, 69,69,66,61,63, 71,71,71,76,71, 65,66,70,68,68)
)

y  <- dt4$Posttest
x1 <- dt4$Pretest
x2 <- ifelse(dt4$Gender   == "Male", 1, 0)
x3 <- ifelse(dt4$Faculty  == 1,      1, 0)

fit4 <- lm(y ~ x1 + x2 + x3)
summary(fit4)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0690 -1.3876 -0.1686  1.3039  4.4061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.84557    3.11985  20.144 8.57e-13 ***
## x1           0.08922    0.05517   1.617  0.12541    
## x2          -2.41598    1.10151  -2.193  0.04340 *  
## x3           3.75205    1.12003   3.350  0.00407 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.458 on 16 degrees of freedom
## Multiple R-squared:  0.5665, Adjusted R-squared:  0.4852 
## F-statistic: 6.969 on 3 and 16 DF,  p-value: 0.003264

anova(fit4)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	28.31784	28.317844	4.687714	0.0458376
x2	1	30.18700	30.186998	4.997132	0.0399947
x3	1	67.79133	67.791325	11.222123	0.0040678
Residuals	16	96.65383	6.040865	NA	NA

(4.1) Write down the fitted linear equation to predict Posttest score. (3 marks)

Answer: From the summary() output, the estimated coefficients are:

Parameter	Estimate	Variable
$b_0$	62.8456	Intercept
$b_1$	0.0892	Pretest score
$b_2$	−2.4160	Gender (Male = 1)
$b_3$	3.7520	Faculty (Faculty 1 = 1)

The overall fitted equation is: \[\hat{y} = 62.8456 + 0.0892\, x_1 - 2.4160\, x_2 + 3.7520\, x_3\]

Coefficient interpretations:

$b_1 = 0.089$: For each additional point on the pretest, the predicted posttest score increases by 0.089 points, holding gender and faculty constant.
$b_2 = -2.416$: Male students are predicted to score 2.416 points lower than female students with the same pretest score and faculty.
$b_3 = 3.752$: Faculty 1 students are predicted to score 3.752 points higher than Faculty 2 students with the same pretest score and gender.

(4.2) Write down the fitted equations for each combination of faculty and gender. (6 marks)

Answer: Substituting the dummy variable values gives four parallel equations (all with slope $b_1 = 0.0892$):

Group	$x_2$	$x_3$	Intercept	Fitted equation
Faculty 1, Female	0	1	$62.8456 + 3.7520 = 66.5976$	$\hat{y} = 66.5976 + 0.0892\, x_1$
Faculty 1, Male	1	1	$62.8456 - 2.4160 + 3.7520 = 64.1816$	$\hat{y} = 64.1816 + 0.0892\, x_1$
Faculty 2, Female	0	0	$62.8456$	$\hat{y} = 62.8456 + 0.0892\, x_1$
Faculty 2, Male	1	0	$62.8456 - 2.4160 = 60.4296$	$\hat{y} = 60.4296 + 0.0892\, x_1$

All four lines have the same slope (0.0892) — the dummy variables shift the intercept only. Faculty 1 Female students have the highest predicted posttest for any given pretest score; Faculty 2 Male students have the lowest.

(4.3) Compare the predicted posttest scores of male and female students from the same faculty. (3 marks)

Answer: The difference in predicted posttest scores between male and female students from the same faculty with the same pretest score is captured entirely by $b_2 = -2.416$.

For any fixed pretest score $x_1$ and faculty: \[\hat{y}_\text{Male} - \hat{y}_\text{Female} = -2.416\]

Conclusion: Male students are predicted to score 2.416 points lower on the posttest than female students from the same faculty with the same pretest score. This gap is the same regardless of faculty or pretest level, because the model assumes no interaction between gender and the other predictors.

4.5 Question 5 — Correlation and Simple Linear Regression (20 marks)

A market research analyst examines whether advertising budget is related to consumer retention of advertisements. For 21 companies, the advertising budget ($x$, million Baht) and the percentage of customers who remember key details (“retained impressions”, $y$, %) were recorded.

ad_budget <- c(90.1,85.1,108.6,71.6,92.9,45.1,98.9,36.9,
               90.5,69.8,84.5,42.5,100.4,81.6,57.4,84.6,
               51.5,95.1,53.6,89.8,74.9)
ret_impr  <- c(55.6,42.7,62.2,30.5,45.8,16.4,60.7,10.1,
               45.5,32.9,48.3,15.8,58.2,35.9,26.8,49.4,
               19.3,56.9,21.3,48.2,33.2)
dt5 <- data.frame(x = ad_budget, y = ret_impr)

plot(dt5$x, dt5$y,
     xlab = "Advertising Budget (million Baht)",
     ylab = "Retained Impressions (%)",
     main = "Advertising Budget vs. Retained Impressions",
     pch = 19, col = "darkorchid")

Correlation analysis:

cor.test(dt5$x, dt5$y, alternative = "two.sided", conf.level = 0.90)

## 
##  Pearson's product-moment correlation
## 
## data:  dt5$x and dt5$y
## t = 18.605, df = 19, p-value = 1.18e-13
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.9436239 0.9877721
## sample estimates:
##       cor 
## 0.9736366

(5.1) Find the sample correlation between advertising budget and retained impressions. Interpret the value. (3 marks)

Answer: From R, $r = 0.9736$.

Interpretation: There is a moderate-to-strong positive linear relationship between advertising budget and retained impressions. Companies with higher advertising budgets tend to have a higher percentage of customers who remember their advertisement. The value is strong ( close to 1), suggesting that other factors besides budget also influence retention.

(5.2) Test whether advertising budget and retained impressions are correlated in the population at $\alpha = 0.10$. (5 marks)

Answer:

Hypotheses:
- $H_0: \rho = 0$ (advertising budget and retained impressions are uncorrelated)
- $H_1: \rho \neq 0$ (they are significantly correlated)
Test statistic:
$t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} = 18.605$ on $n - 2 = 19$ df
p-value $= 1.18 \times 10^{-13}$
Decision: Since p-value $ < = 0.10$, we reject $H_0$.
Conclusion: There is significant evidence at the 10% level that advertising budget and retained impressions are linearly correlated in the population.

(5.3) Find the 90% confidence interval for the population correlation $\rho$. (2 marks)

Answer: From R (conf.level = 0.90): $(0.9436,\; 0.9877)$

Interpretation: We are 90% confident that the true population correlation between advertising budget and retained impressions lies between 0.9436 and 0.9877. The interval is entirely above zero, confirming a positive association.

Simple linear regression:

fit5 <- lm(y ~ x, data = dt5)
summary(fit5)

## 
## Call:
## lm(formula = y ~ x, data = dt5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8143 -2.6441 -0.3756  2.4648  6.4990 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.59850    3.19614  -5.819 1.32e-05 ***
## x             0.75138    0.04039  18.605 1.18e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.789 on 19 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9452 
## F-statistic: 346.2 on 1 and 19 DF,  p-value: 1.18e-13

anova(fit5)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x	1	4969.0893	4969.08931	346.161	0
Residuals	19	272.7421	14.35485	NA	NA

confint(fit5, level = 0.90)

##                     5 %        95 %
## (Intercept) -24.1250439 -13.0719581
## x             0.6815506   0.8212132

(5.4) Write the fitted regression equation and explain the coefficient estimates. (4 marks)

Answer: From summary(): $b_0 = -18.598$, $b_1 = 0.7514$. The fitted equation is:

\[\hat{y} = -18.598 + 0.7514\, x\]

Coefficient interpretations:

$b_1 = 0.7514$: For every additional 1 million Baht spent on advertising, the retained impressions percentage is predicted to increase by 0.7514 percentage points, on average.

(5.5) At significance level 0.10, test whether advertising budget is a significant predictor of retained impressions. (3 marks)

Answer:

Hypotheses: $H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$
From R: $t_0 = 18.605$, p-value $= 1.18 \times 10^{-13}$
Decision: Since p-value $< 0.10$, we reject $H_0$.
Conclusion: Advertising budget is a statistically significant linear predictor of retained impressions at the 10% level.

(5.6) Find the coefficient of determination and interpret its value. (2 marks)

Answer: From summary(), $r^2 = 0.948$, i.e., 94.8%.

Interpretation: The linear regression on advertising budget explains 94.8% of the total variation in retained impressions. The remaining is due to other factors not included in this model (e.g., quality of the advertisement, target audience, timing) or inherent random variability.

(5.7) Find the 90% confidence intervals for $\beta_0$ and $\beta_1$. (2 marks)

Answer: From confint(fit, level = 0.90):

Parameter	90% CI Lower	90% CI Upper
$\beta_0$ (intercept)	-24.1250	-13.0719
$\beta_1$ (slope)	0.6815	0.8212

Interpretation of $\beta_1$ CI: We are 90% confident that each additional million Baht of advertising budget is associated with a true increase in retained impressions of between 0.195 and 0.531 percentage points.

(5.8) Predict the retained impressions percentage when advertising budget is 100 million Baht. (1 mark)

Answer:

\[\hat{y} = -18.598 + 0.7514(100) = \mathbf{56.542\%}\]

When the advertising budget is 100 million Baht, the model predicts that approximately 56.542% of customers will retain key details from the advertisement.

4.6 Question 6 — Multiple Linear Regression and Model Diagnostics (14 marks)

A data analyst investigates factors related to box-office gross of book-inspired movies. Ten movies were selected. Variables: $y$ = box-office gross (million USD), $x_1$ = production budget, $x_2$ = advertising budget, $x_3$ = pre-adaptation book sales (all in million USD).

Movie	$y$	$x_1$	$x_2$	$x_3$	Movie	$y$	$x_1$	$x_2$	$x_3$
1	85.1	8.5	5.1	4.7	6	30.3	3.5	1.2	3.5
2	106.3	12.9	5.8	8.8	7	79.4	9.2	3.7	9.7
3	50.2	5.2	2.1	15.1	8	91.0	9.0	7.6	5.9
4	130.6	10.7	8.4	12.2	9	135.4	15.1	7.7	20.8
5	54.8	3.1	2.9	10.6	10	89.3	10.2	4.5	7.9

The full model: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$

dt6 <- data.frame(
  y  = c( 85.1,106.3, 50.2,130.6, 54.8, 30.3, 79.4, 91.0,135.4, 89.3),
  x1 = c(  8.5, 12.9,  5.2, 10.7,  3.1,  3.5,  9.2,  9.0, 15.1, 10.2),
  x2 = c(  5.1,  5.8,  2.1,  8.4,  2.9,  1.2,  3.7,  7.6,  7.7,  4.5),
  x3 = c(  4.7,  8.8, 15.1, 12.2, 10.6,  3.5,  9.7,  5.9, 20.8,  7.9)
)

# Full model with all three predictors
full6 <- lm(y ~ x1 + x2 + x3, data = dt6)
summary(full6)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = dt6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4384  -3.1695   0.8499   3.5134   9.6207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   7.6760     6.7602   1.135   0.2995   
## x1            3.6616     1.1178   3.276   0.0169 * 
## x2            7.6211     1.6573   4.598   0.0037 **
## x3            0.8285     0.5394   1.536   0.1754   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9502 
## F-statistic: 58.22 on 3 and 6 DF,  p-value: 7.913e-05

anova(full6)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	8647.3942	8647.39415	152.064149	0.0000173
x2	1	1150.8996	1150.89959	20.238533	0.0041095
x3	1	134.1697	134.16974	2.359371	0.1754396
Residuals	6	341.2005	56.86675	NA	NA

(6.1) At $\alpha = 0.05$, test whether book sales ($x_3$) should be included in the model when production budget ($x_1$) and advertising budget ($x_2$) are already included. (3 marks)

Answer:

Hypotheses:
- $H_0: \beta_3 = 0$ (book sales add no explanatory value given $x_1$ and $x_2$)
- $H_1: \beta_3 \neq 0$ (book sales should be included)
From R: $t_0 = 1.536$, p-value $= 0.1754$
Decision: Since p-value $= 0.175 > \alpha = 0.05$, we fail to reject $H_0$.
Conclusion: Book sales ($x_3$) do not significantly improve the model when production and advertising budgets are already present. $x_3$ should be excluded from the model.

The reduced model using only $x_1$ and $x_2$:

fit6 <- lm(y ~ x1 + x2, data = dt6)
summary(fit6)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = dt6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4168  -2.5696   0.8051   2.1200  11.0463 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   11.848      6.765   1.751  0.12334   
## x1             4.228      1.153   3.667  0.00800 **
## x2             7.436      1.806   4.117  0.00448 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.241 on 7 degrees of freedom
## Multiple R-squared:  0.9537, Adjusted R-squared:  0.9405 
## F-statistic: 72.14 on 2 and 7 DF,  p-value: 2.131e-05

anova(fit6)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
x1	1	8647.3942	8647.39415	127.33603	0.0000096
x2	1	1150.8996	1150.89959	16.94742	0.0044779
Residuals	7	475.3703	67.91004	NA	NA

confint(fit6)

##                 2.5 %    97.5 %
## (Intercept) -4.148412 27.844835
## x1           1.501766  6.954722
## x2           3.164848 11.707370

# Model diagnostics
par(mfrow = c(2, 2))
plot(fit6, pch = 16, col = "steelblue")

par(mfrow = c(1, 1))

library(lmtest)
bptest(fit6)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit6
## BP = 7.6284, df = 2, p-value = 0.02206

shapiro.test(resid(fit6))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit6)
## W = 0.93429, p-value = 0.4914

dwtest(fit6)

## 
##  Durbin-Watson test
## 
## data:  fit6
## DW = 2.0124, p-value = 0.5113
## alternative hypothesis: true autocorrelation is greater than 0

(6.2) Test the overall fit of the reduced model ($x_1$ and $x_2$) at $\alpha = 0.05$. Provide the ANOVA table. (4 marks)

Answer:

Source	df	SS	MS	$F$	p-value
Regression	2	9,788.7	4,894.35	70.62	$<0.0001$
Residual	7	485.0	69.29
Total	9	10,273.7

Hypotheses: $H_0: \beta_1 = \beta_2 = 0$ vs. $H_1$: at least one $\beta_j \neq 0$
From R: $F = 72.14$ on $(2, 7)$ df, p-value $< 0.0001$
Decision: Since p-value $< 0.05$, we reject $H_0$.
Conclusion: The model with production budget and advertising budget is highly significant. Together, $x_1$ and $x_2$ explain a substantial proportion of the variation in box-office gross.

(6.3) Test the normality assumption of the errors at $\alpha = 0.05$. (2 marks)

Answer: Test: Shapiro-Wilk test on residuals
Hypotheses: $H_0$: residuals are normally distributed vs. $H_1$: not normal
From R: W statistic, p-value $= 0.4914$
Decision: Since p-value $= 0.491 > 0.05$, we fail to reject $H_0$.
Conclusion: The normality assumption is satisfied. The residuals are consistent with a normal distribution.

(6.4) Test the independence assumption of the errors at $\alpha = 0.05$. (2 marks)

Answer: Test: Durbin-Watson test
Hypotheses: $H_0$: no autocorrelation vs. $H_1$: autocorrelation present
From R: DW statistic $= 2.0124$, p-value $=0.5113 > 0.05$
Decision: Since p-value $> 0.05$, we fail to reject $H_0$.
Conclusion: The independence assumption is satisfied. A DW value close to 2 indicates no significant autocorrelation in the residuals.

(6.5) Test the constant variance assumption of the errors at $\alpha = 0.05$. (3 marks)

Answer: Test: Breusch-Pagan test
Hypotheses: $H_0$: constant variance vs. $H_1$: non-constant variance (heteroscedasticity)
From R: p-value $= 0.0226$
Decision: Since p-value $= 0.022 < \alpha = 0.05$, we reject $H_0$.
Conclusion: The constant variance assumption is violated. There is evidence of heteroscedasticity — the spread of the residuals changes with the fitted values.

⚠️ What to do next

Since heteroscedasticity is present, the standard errors and therefore the $t$-tests and confidence intervals from this model may be unreliable. Recommended next steps:

Apply a variance-stabilising transformation to $y$ — use the Box-Cox procedure to identify the optimal $\lambda$:

library(MASS)
boxcox(fit6, lambda = seq(-3, 3, by = 0.05))

If the Box-Cox plot suggests $\lambda \approx 0$, re-fit with $\ln(y)$ as the response and re-check diagnostics:

fit6_log <- lm(log(y) ~ x1 + x2, data = dt6)
par(mfrow = c(2, 2)); plot(fit6_log); par(mfrow = c(1, 1))
bptest(fit6_log)

Alternatively, use heteroscedasticity-consistent (HC) standard errors (sandwich package in R).

5 Mock Final Exam

Answer all questions. All questions concern nonparametric statistical methods.
For every hypothesis test: state hypotheses, identify the appropriate test, show the calculation of the test statistic step by step, give the p-value, and state a conclusion in the context of the problem.
Use $\alpha = 0.05$ unless a question specifies otherwise.
R output is provided to verify your calculations.

5.1 Question 7 — Wilcoxon Signed-Rank Test (20 marks)

High blood pressure can lead to serious health problems including heart attack, stroke, and kidney disease. Researchers selected 10 patients and recorded their blood pressure before and after taking medication.

Patient	Before	After	Patient	Before	After
1	86	80	6	77	82
2	84	80	7	89	88
3	78	92	8	90	89
4	90	79	9	90	92
5	92	92	10	86	83

dt7 <- data.frame(
  Patient = 1:10,
  Before  = c(86, 84, 78, 90, 92, 77, 89, 90, 90, 86),
  After   = c(80, 80, 92, 79, 92, 82, 88, 89, 92, 83)
)

diff_bp <- dt7$Before - dt7$After
boxplot(diff_bp,
        ylab = "Before − After (mmHg)",
        main = "Blood Pressure Change After Medication",
        col  = "lightblue")
abline(h = 0, col = "red", lty = 2)
legend("topright", legend = "H₀: median diff = 0",
       col = "red", lty = 2, bty = "n")

wilcox.test(x = dt7$Before, y = dt7$After,
            paired = TRUE, alternative = "greater")

## 
##  Wilcoxon signed rank exact test
## 
## data:  dt7$Before and dt7$After
## V = 33, p-value = 0.2852
## alternative hypothesis: true location shift is greater than 0

At $\alpha = 0.05$, test whether medication helps reduce blood pressure.

Answer Q7:

Test choice: The data are 10 matched pairs (same patients, before and after). The sample is small and normality of differences cannot be assumed. The Wilcoxon signed-rank test is appropriate for paired data when the distribution of differences is approximately symmetric.

Step 1 — Compute differences $d_i = \text{Before}_i - \text{After}_i$

Patient	Before	After	$d_i$	Action
1	86	80	+6	keep
2	84	80	+4	keep
3	78	92	−14	keep
4	90	79	+11	keep
5	92	92	0	exclude (tie)
6	77	82	−5	keep
7	89	88	+1	keep
8	90	89	+1	keep
9	90	92	−2	keep
10	86	83	+3	keep

Patient 5 is excluded ($d = 0$), leaving $n = 9$ non-tied pairs.

Step 2 — Rank the absolute differences $|d_i|$, assign average ranks to ties

Sorting $|d_i|$ in ascending order:

$\|d\|$ value	Patients	Raw ranks	Average rank
1	7, 8	1, 2	1.5
2	9	3	3
3	10	4	4
4	2	5	5
5	6	6	6
6	1	7	7
11	4	8	8
14	3	9	9

Step 3 — Assign signed ranks and compute $W^+$

Patient	$d_i$	$\|d_i\|$	Rank of $\|d_i\|$	Signed rank
7	+1	1	1.5	+1.5
8	+1	1	1.5	+1.5
9	−2	2	3	−3
10	+3	3	4	+4
2	+4	4	5	+5
6	−5	5	6	−6
1	+6	6	7	+7
4	+11	11	8	+8
3	−14	14	9	−9

\[W^+ = \sum_{\{i:\, d_i > 0\}} \text{rank}_i = 1.5 + 1.5 + 4 + 5 + 7 + 8 = \mathbf{27}\]

\[W^- = \sum_{\{i:\, d_i < 0\}} \text{rank}_i = 3 + 6 + 9 = \mathbf{18}\]

Check: $W^+ + W^- = 27 + 18 = 45 = \dfrac{9 \times 10}{2}$ ✓

Step 4 — Hypothesis test

Hypotheses:
- $H_0$: Medication does not reduce blood pressure; median$(d_i) \leq 0$
- $H_1$: Medication reduces blood pressure; median$(d_i) > 0$
Test statistic: $W^+ = 27$ (R reports $V = 33$ as it takes 0 into ranking)
p-value $= 0.2852$ (from R, one-sided)
Decision: Since p-value $=0.2852 > \alpha = 0.05$, we fail to reject $H_0$.
Conclusion: There is insufficient evidence at the 5% significance level to conclude that the medication significantly reduces blood pressure. The observed differences between before and after readings are consistent with chance variation in this small sample of 10 patients.

5.2 Question 8 — Reading Habits Study (30 marks)

Education researchers studied reading habits among 15 college students from two faculties, recording year of study, hours reading per week, and reading skill score.

St.	Fac.	Yr	Hours	Score	St.	Fac.	Yr	Hours	Score
1	A	2	7.5	70	9	B	3	6.0	73
2	A	2	5.0	65	10	B	3	6.5	82
3	A	2	2.0	68	11	A	4	5.0	90
4	B	2	3.0	80	12	A	4	4.0	95
5	B	2	4.5	81	13	B	4	3.5	80
6	A	3	4.0	77	14	B	4	3.0	70
7	A	3	4.5	90	15	A	4	8.0	95
8	A	3	7.0	75

dt8 <- data.frame(
  Student      = 1:15,
  Faculty      = factor(c("A","A","A","B","B","A","A","A","B","B",
                          "A","A","B","B","A")),
  YearofStudy  = factor(c(2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)),
  NumberHours  = c(7.5,5,2,3,4.5,4,4.5,7,6,6.5,5,4,3.5,3,8),
  ReadingScore = c(70,65,68,80,81,77,90,75,73,82,90,95,80,70,95)
)

par(mfrow = c(1, 3))
boxplot(ReadingScore ~ YearofStudy, data = dt8,
        main = "Score by Year", xlab = "Year of Study",
        ylab = "Reading Score",
        col  = c("lightblue","lightgreen","lightyellow"))
boxplot(NumberHours ~ Faculty, data = dt8,
        main = "Hours by Faculty", xlab = "Faculty",
        ylab = "Hours/week", col = c("lightblue","lightcoral"))
plot(dt8$NumberHours, dt8$ReadingScore,
     pch = 19, col = as.integer(dt8$Faculty) + 1,
     xlab = "Hours/week", ylab = "Reading Score",
     main = "Hours vs. Score")
legend("topleft", legend = c("A","B"), col = c(2,3), pch = 19, bty = "n")

par(mfrow = c(1, 1))

5.2.1 (8.1) Is there a significant difference in reading skills between years of study? (10 marks)

kruskal.test(ReadingScore ~ YearofStudy, data = dt8)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ReadingScore by YearofStudy
## Kruskal-Wallis chi-squared = 4.1245, df = 2, p-value = 0.1272

Answer Q(8.1):

Test choice: Three independent groups (Year 2, 3, 4), small samples ($n_i = 5$ each), normality unverifiable. Use the Kruskal-Wallis test.

Step 1 — Rank all $N = 15$ scores together (average ranks for ties)

Student	Score	Year	Rank $R_i$
2	65	2	1.0
3	68	2	2.0
1	70	2	3.5
14	70	4	3.5
9	73	3	5.0
8	75	3	6.0
6	77	3	7.0
4	80	2	8.5
13	80	4	8.5
5	81	2	10.0
10	82	3	11.0
7	90	3	12.5
11	90	4	12.5
12	95	4	14.5
15	95	4	14.5

Step 2 — Compute rank sums $R_j$ per group

Year	Students	Ranks	$R_j$	$n_j$	$R_j^2/n_j$
2	1,2,3,4,5	3.5, 1.0, 2.0, 8.5, 10.0	25.0	5	125.00
3	6,7,8,9,10	7.0, 12.5, 6.0, 5.0, 11.0	41.5	5	344.45
4	11,12,13,14,15	12.5, 14.5, 8.5, 3.5, 14.5	53.5	5	572.45

Check: $R_1 + R_2 + R_3 = 25.0 + 41.5 + 53.5 = 120 = \dfrac{N(N+1)}{2} = \dfrac{15 \times 16}{2}$ ✓

Step 3 — Compute the Kruskal-Wallis statistic $H$

\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]

\[H = \frac{12}{15 \times 16}\left(125.00 + 344.45 + 572.45\right) - 3(16)\]

\[H = \frac{12}{240}\times 1041.90 - 48 = 0.05 \times 1041.90 - 48 = 52.095 - 48 = \mathbf{4.095}\]

(R reports $H = 4.1245$ — the small difference reflects R’s correction for ties.)

Hypothesis test:

$H_0$: The distribution of reading scores is identical across all three years of study
$H_1$: At least one year has a different distribution of reading scores
Test statistic: $H = 4.1245$ on $k-1 = 2$ df
p-value $= 0.1272$
Decision: Since p-value $= 0.127 > 0.05$, we fail to reject $H_0$.
Conclusion: There is no significant difference in reading skill scores between 2nd, 3rd, and 4th year students at the 5% level.

5.2.2 (8.2) Is there a significant difference in reading hours between Faculty A and Faculty B? (10 marks)

wilcox.test(x = dt8$NumberHours[dt8$Faculty == "A"],
            y = dt8$NumberHours[dt8$Faculty == "B"],
            paired = FALSE, alternative = "two.sided")

## 
##  Wilcoxon rank sum exact test
## 
## data:  dt8$NumberHours[dt8$Faculty == "A"] and dt8$NumberHours[dt8$Faculty == "B"]
## W = 35.5, p-value = 0.3381
## alternative hypothesis: true location shift is not equal to 0

Answer Q(8.2):

Test choice: Two independent groups (Faculty A: $n_A = 9$; Faculty B: $n_B = 6$), small samples. Use the Mann-Whitney U test.

Step 1 — Rank all $N = 15$ hour values together

Value	Faculty	Rank
2.0	A	1.0
3.0	B	2.5
3.0	B	2.5
3.5	B	4.0
4.0	A	5.5
4.0	A	5.5
4.5	A	7.5
4.5	B	7.5
5.0	A	9.5
5.0	A	9.5
6.0	B	11.0
6.5	B	12.0
7.0	A	13.0
7.5	A	14.0
8.0	A	15.0

Step 2 — Compute rank sum $S$ for Faculty A and test statistic $U$

\[S_A = 1.0 + 5.5 + 5.5 + 7.5 + 9.5 + 9.5 + 13.0 + 14.0 + 15.0 = \mathbf{80.5}\]

\[U_A = S_A - \frac{n_A(n_A+1)}{2} = 80.5 - \frac{9 \times 10}{2} = 80.5 - 45 = \mathbf{35.5}\]

Check: $S_A + S_B = 80.5 + 39.5 = 120 = \dfrac{N(N+1)}{2}$ ✓

Hypothesis test:

$H_0$: The median hours reading per week is the same for Faculty A and Faculty B ($\theta_A = \theta_B$)
$H_1$: The medians differ ($\theta_A \neq \theta_B$)
Test statistic: $W = U_A = 35.5$
p-value $= 0.3440$
Decision: Since p-value $= 0.344 > 0.05$, we fail to reject $H_0$.
Conclusion: There is no significant difference in weekly reading hours between Faculty A and Faculty B students.

5.2.3 (8.3) Is there a significant monotone relationship between reading hours and skill scores? (10 marks)

cor.test(dt8$NumberHours, dt8$ReadingScore,
         method = "spearman", alternative = "two.sided")

## 
##  Spearman's rank correlation rho
## 
## data:  dt8$NumberHours and dt8$ReadingScore
## S = 453.74, p-value = 0.4982
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1897482

Answer Q(8.3):

Test choice: Association between two continuous variables, small sample ($n = 15$), normality not assumed. Use Spearman rank correlation.

Step 1 — Rank hours $R(X)$ and scores $R(Y)$ separately, then compute $D_i = R(X_i) - R(Y_i)$

St.	Hours	Score	$R(X)$	$R(Y)$	$D$	$D^2$
1	7.5	70	14.0	3.5	+10.5	110.25
2	5.0	65	9.5	1.0	+8.5	72.25
3	2.0	68	1.0	2.0	−1.0	1.00
4	3.0	80	2.5	8.5	−6.0	36.00
5	4.5	81	7.5	10.0	−2.5	6.25
6	4.0	77	5.5	7.0	−1.5	2.25
7	4.5	90	7.5	12.5	−5.0	25.00
8	7.0	75	13.0	6.0	+7.0	49.00
9	6.0	73	11.0	5.0	+6.0	36.00
10	6.5	82	12.0	11.0	+1.0	1.00
11	5.0	90	9.5	12.5	−3.0	9.00
12	4.0	95	5.5	14.5	−9.0	81.00
13	3.5	80	4.0	8.5	−4.5	20.25
14	3.0	70	2.5	3.5	−1.0	1.00
15	8.0	95	15.0	14.5	+0.5	0.25
Total						450.50

Step 2 — Compute Spearman’s $r_s$

\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 450.50}{15(225-1)} = 1 - \frac{2703}{3360} = 1 - 0.8045 = \mathbf{0.1955}\]

(R reports $r_s = 0.1897$ using a more precise tie-correction.)

Hypothesis test:

$H_0$: No monotone association between reading hours and skill score ($\rho_s = 0$)
$H_1$: Significant monotone association ($\rho_s \neq 0$)
Test statistic: $r_s = 0.1897$
p-value $= 0.4982$
Decision: Since p-value $= 0.498 > 0.05$, we fail to reject $H_0$.
Conclusion: There is no significant monotone association between weekly reading hours and reading skill scores in this sample ($r_s = 0.19$, weak and non-significant).

💡 Reflection

The absence of a significant result may reflect the small sample size (low power with $n = 15$), or that skill depends more on reading quality and prior ability than on time alone.

5.3 Question 9 — Effect of Jumping on Bone Density (25 marks)

Researchers examined the effect of jumping on bone density in growing rats. Six different rats were randomly assigned to each of three groups: Control, Low Jump (30 cm), High Jump (60 cm). Bone densities (mg/cm³) were recorded after 8 weeks.

Rat	1	2	3	4	5	6
Control	611	621	614	593	593	653
Low Jump	635	605	638	594	599	632
High Jump	650	622	626	626	631	622

dt9 <- data.frame(
  Densities = c(611,621,614,593,593,653,
                635,605,638,594,599,632,
                650,622,626,626,631,622),
  Group = factor(rep(c("Control","LowJump","HighJump"), each = 6),
                 levels = c("Control","LowJump","HighJump"))
)

boxplot(Densities ~ Group, data = dt9,
        col  = c("lightgray","lightblue","lightcoral"),
        main = "Bone Density by Treatment Group",
        xlab = "Treatment", ylab = "Bone Density (mg/cm³)")

❗ Choosing the correct test

Each of the 18 rats is a different animal measured exactly once. This is a completely randomised design with three independent groups — not a repeated-measures design.

✅ Correct: Kruskal-Wallis test — for $k \geq 3$ independent groups.
❌ Incorrect: Friedman test — requires the same subjects observed under all conditions (no blocks exist here).

kruskal.test(Densities ~ Group, data = dt9)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Densities by Group
## Kruskal-Wallis chi-squared = 2.6398, df = 2, p-value = 0.2672

At $\alpha = 0.05$, test whether jumping treatment affects bone density.

Answer Q9:

Test choice: Three independent groups ($n_i = 6$ each, $N = 18$), small samples, normality unverifiable. Use the Kruskal-Wallis test.

Step 1 — Rank all $N = 18$ bone densities together

Density	Group	Rank
593	Control	1.5
593	Control	1.5
594	LowJump	3.0
599	LowJump	4.0
605	LowJump	5.0
611	Control	6.0
614	Control	7.0
621	Control	8.0
622	HighJump	9.5
622	HighJump	9.5
626	HighJump	11.5
626	HighJump	11.5
631	HighJump	13.0
632	LowJump	14.0
635	LowJump	15.0
638	LowJump	16.0
650	HighJump	17.0
653	Control	18.0

Step 2 — Compute rank sums $R_j$ per group

Group	Individual ranks	$R_j$	$n_j$	$R_j^2 / n_j$
Control	1.5, 1.5, 6.0, 7.0, 8.0, 18.0	42.0	6	294.00
LowJump	3.0, 4.0, 5.0, 14.0, 15.0, 16.0	57.0	6	541.50
HighJump	9.5, 9.5, 11.5, 11.5, 13.0, 17.0	72.0	6	864.00

Check: $42.0 + 57.0 + 72.0 = 171 = \dfrac{18 \times 19}{2}$ ✓

Step 3 — Compute the Kruskal-Wallis statistic $H$

\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]

\[H = \frac{12}{18 \times 19}\left(294.00 + 541.50 + 864.00\right) - 3(19)\]

\[H = \frac{12}{342} \times 1699.50 - 57 = 0.03509 \times 1699.50 - 57 = 59.64 - 57 = \mathbf{2.640}\]

Hypothesis test:

$H_0$: Bone density distributions are identical across Control, Low Jump, and High Jump groups
$H_1$: At least one group has a different bone density distribution
Test statistic: $H = 2.640$ on $k-1 = 2$ df
p-value $= 0.267$
Decision: Since p-value $= 0.267 > \alpha = 0.05$, we fail to reject $H_0$.
Conclusion: There is no significant evidence that jumping treatment affects bone density in rats at the 5% significance level. The differences in median bone density across the three groups are not large enough to be statistically significant with $n = 6$ per group.

📝 Note on test choice

A Friedman test on these data (incorrectly treating rat index as a blocking factor) returns $\chi^2 = 3.0$, p-value $= 0.223$ — also non-significant, and produced by the wrong test for this design. Both tests agree there is no significant effect here, but the Kruskal-Wallis is the correct and valid choice.

5.4 Question 10 — Animal Shelter Study (25 marks)

Researchers compared cats and dogs from two shelters in Chiang Mai to examine factors affecting length of stay before adoption.

Type	Sex	Age (mo.)	Weight (kg)	Days in Shelter
Cat	F	15	9.80	9
Cat	F	2	10.50	28
Cat	M	0	4.50	28
Cat	F	4	10.80	22
Cat	M	0	4.30	18
Cat	M	3	9.00	30
Cat	M	3	10.00	16
Dog	F	10	3.30	6
Dog	M	0	22.00	7
Dog	M	3	54.00	19
Dog	F	6	9.50	14
Dog	M	1	16.50	11
Dog	F	1	46.00	12
Dog	F	3	48.25	20
Dog	M	3	55.30	16

dt10 <- data.frame(
  Type    = factor(c(rep("Cat",7), rep("Dog",8))),
  Sex     = c("F","F","M","F","M","M","M","F","M","M","F","M","F","F","M"),
  Age     = c(15,2,0,4,0,3,3,10,0,3,6,1,1,3,3),
  Weight  = c(9.8,10.5,4.5,10.8,4.3,9.0,10.0,3.3,22.0,54.0,9.5,16.5,46.0,48.25,55.3),
  Days    = c(9,28,28,22,18,30,16,6,7,19,14,11,12,20,16)
)

par(mfrow = c(1, 2))
boxplot(Days ~ Type, data = dt10,
        col  = c("lightblue","lightyellow"),
        main = "Days in Shelter by Type",
        xlab = "Type", ylab = "Days in Shelter")
plot(dt10$Weight, dt10$Days,
     pch = 19, col = as.integer(dt10$Type) + 1,
     xlab = "Weight (kg)", ylab = "Days in Shelter",
     main = "Weight vs. Days in Shelter")
legend("topright", legend = levels(dt10$Type),
       col = c(2,3), pch = 19, bty = "n")

par(mfrow = c(1, 1))

5.4.1 (10.1) Is there evidence that dogs and cats spend different lengths of time in the shelter? (10 marks)

wilcox.test(x = dt10$Days[dt10$Type == "Dog"],
            y = dt10$Days[dt10$Type == "Cat"],
            paired = FALSE, alternative = "two.sided")

## 
##  Wilcoxon rank sum exact test
## 
## data:  dt10$Days[dt10$Type == "Dog"] and dt10$Days[dt10$Type == "Cat"]
## W = 10.5, p-value = 0.04258
## alternative hypothesis: true location shift is not equal to 0

Answer Q(10.1):

Test choice: Two independent groups (Dogs: $n_D = 8$; Cats: $n_C = 7$), small samples, days-in-shelter is not normally distributed. Use the Mann-Whitney U test.

Step 1 — Rank all $N = 15$ days values together

Days	Type	Rank
6	Dog	1.0
7	Dog	2.0
9	Cat	3.0
11	Dog	4.0
12	Dog	5.0
14	Dog	6.0
16	Dog	7.5
16	Cat	7.5
18	Cat	9.0
19	Dog	10.0
20	Dog	11.0
22	Cat	12.0
28	Cat	13.5
28	Cat	13.5
30	Cat	15.0

Step 2 — Compute rank sum $S$ for Dogs and test statistic $U$

\[S_D = 1.0 + 2.0 + 4.0 + 5.0 + 6.0 + 7.5 + 10.0 + 11.0 = \mathbf{46.5}\]

\[U_D = S_D - \frac{n_D(n_D+1)}{2} = 46.5 - \frac{8 \times 9}{2} = 46.5 - 36 = \mathbf{10.5}\]

Verification — rank sum for Cats: \[S_C = 3.0 + 7.5 + 9.0 + 12.0 + 13.5 + 13.5 + 15.0 = 73.5\] \[U_C = S_C - \frac{n_C(n_C+1)}{2} = 73.5 - \frac{7 \times 8}{2} = 73.5 - 28 = 45.5\] \[U_D + U_C = 10.5 + 45.5 = 56 = n_D \times n_C = 8 \times 7 \checkmark\]

Hypothesis test:

$H_0$: The median days in shelter is the same for dogs and cats ($\theta_D = \theta_C$)
$H_1$: The medians differ ($\theta_D \neq \theta_C$)
Test statistic: $W = U_D = 10.5$
p-value $= 0.0487$
Decision: Since p-value $= 0.049 < \alpha = 0.05$, we reject $H_0$.
Conclusion: There is significant evidence that dogs and cats spend different lengths of time in the shelter ($U = 10.5$, p $= 0.049$). Cats tend to stay longer (median 22 days) than dogs (median 13.5 days).

5.4.2 (10.2) Is there a significant monotone relationship between weight and days in the shelter? (8 marks)

cor.test(~ Days + Weight, data = dt10,
         method = "spearman", alternative = "two.sided")

## 
##  Spearman's rank correlation rho
## 
## data:  Days and Weight
## S = 573.02, p-value = 0.9344
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02325585

Answer Q(10.2):

Test choice: Association between two continuous variables ($n = 15$), weight has extreme outliers (cats ≤ 11 kg, dogs up to 55 kg). Spearman rank correlation is robust to outliers.

Step 1 — Rank weight $R(W)$ and days $R(D)$ separately, compute $D_i = R(W_i) - R(D_i)$

Animal	Weight	Days	$R(W)$	$R(D)$	$D$	$D^2$
Cat 1	9.80	9	6.0	3.0	+3.0	9.00
Cat 2	10.50	28	8.0	13.5	−5.5	30.25
Cat 3	4.50	28	3.0	13.5	−10.5	110.25
Cat 4	10.80	22	9.0	12.0	−3.0	9.00
Cat 5	4.30	18	2.0	9.0	−7.0	49.00
Cat 6	9.00	30	4.0	15.0	−11.0	121.00
Cat 7	10.00	16	7.0	7.5	−0.5	0.25
Dog 1	3.30	6	1.0	1.0	0.0	0.00
Dog 2	22.00	7	11.0	2.0	+9.0	81.00
Dog 3	54.00	19	14.0	10.0	+4.0	16.00
Dog 4	9.50	14	5.0	6.0	−1.0	1.00
Dog 5	16.50	11	10.0	4.0	+6.0	36.00
Dog 6	46.00	12	12.0	5.0	+7.0	49.00
Dog 7	48.25	20	13.0	11.0	+2.0	4.00
Dog 8	55.30	16	15.0	7.5	+7.5	56.25
Total						572.00

Step 2 — Compute Spearman’s $r_s$

\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 572}{15(225-1)} = 1 - \frac{3432}{3360} = 1 - 1.0214 = \mathbf{-0.021}\]

(R reports $r_s = -0.023$ with tie correction.)

Hypothesis test:

$H_0$: No monotone association between weight and days in shelter ($\rho_s = 0$)
$H_1$: Significant monotone association ($\rho_s \neq 0$)
Test statistic: $r_s = -0.023$
p-value $= 0.9344$
Decision: Since p-value $= 0.934 > 0.05$, we fail to reject $H_0$.
Conclusion: There is no significant monotone association between an animal’s weight and the number of days it spends in the shelter ($r_s \approx 0$). Weight alone does not predict adoption waiting time.

💡 Additional insight

The near-zero correlation may partly reflect confounding by animal type. Cats are light (< 11 kg) and tend to stay longer; dogs are heavier but leave sooner. Within each type, weight-shelter relationships might differ.

5.4.3 (10.3) Is there evidence that the median days cats spend in the shelter exceeds 20 days? (7 marks)

cats    <- dt10[dt10$Type == "Cat", ]
theta0  <- 20
T.pos   <- sum(cats$Days > theta0)
T.neg   <- sum(cats$Days < theta0)
T.zero  <- sum(cats$Days == theta0)
n_eff   <- T.pos + T.neg
cat("Cat days:", cats$Days, "\n")

## Cat days: 9 28 28 22 18 30 16

cat("T+ =", T.pos, " T- =", T.neg, " Ties =", T.zero, " n =", n_eff, "\n")

## T+ = 4  T- = 3  Ties = 0  n = 7

binom.test(T.pos, n_eff, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  T.pos and n_eff
## number of successes = 4, number of trials = 7, p-value = 0.5
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.2253216 1.0000000
## sample estimates:
## probability of success 
##              0.5714286

Answer Q(10.3):

Test choice: Single sample of cats ($n = 7$), testing a claim about the median. No symmetry assumption needed. Use the sign test.

Step 1 — Assign signs ($\theta_0 = 20$ days; cats only)

Cat	Days	$d_i = \text{Days} - 20$	Sign
1	9	−11	−
2	28	+8	+
3	28	+8	+
4	22	+2	+
5	18	−2	−
6	30	+10	+
7	16	−4	−

$T^+ = 4$, $T^- = 3$, no ties. Effective $n = 7$.

Step 2 — Compute the test statistic

Under $H_0: \theta \leq 20$, each observation is equally likely to be above or below 20, so:

\[T^+ \sim \text{Binomial}(n = 7,\; p = 0.5)\]

The p-value for a one-sided test ($H_1: \theta > 20$) is the probability of observing $T^+ \geq 4$:

\[p\text{-value} = P(T^+ \geq 4) = \sum_{x=4}^{7}\binom{7}{x}(0.5)^7\]

\[= \left[\binom{7}{4} + \binom{7}{5} + \binom{7}{6} + \binom{7}{7}\right](0.5)^7 = (35 + 21 + 7 + 1) \times \frac{1}{128} = \frac{64}{128} = \mathbf{0.5000}\]

Hypothesis test:

$H_0$: The median days cats stay in the shelter $\leq 20$ ($\theta \leq 20$)
$H_1$: The median days exceeds 20 ($\theta > 20$)
Test statistic: $T^+ = 4$ under $\text{Binomial}(7, 0.5)$
p-value $= P(T^+ \geq 4 \mid n=7, p=0.5) = 0.5000$
Decision: Since p-value $= 0.500 > \alpha = 0.05$, we fail to reject $H_0$.
Conclusion: There is insufficient evidence to conclude that the median number of days cats spend in the shelter exceeds 20 days. With only 4 of 7 cats staying more than 20 days and the sign split nearly even (4+ vs. 3−), the data are consistent with a median around 20 days. The small sample size ($n = 7$) gives this test very low power.

Source	SS	df	MS	\(F\)
Regression	\(SS_\text{reg} = b_1 S_{xy}\)	\(1\)	\(MS_\text{reg} = SS_\text{reg}/1\)	\(F = MS_\text{reg}/MS_\text{res}\)
Residual	\(SS_\text{res} = S_{yy} - b_1 S_{xy}\)	\(n-2\)	\(MS_\text{res} = SS_\text{res}/(n-2)\)
Total	\(SS_T = S_{yy}\)	\(n-1\)

Obs	\(y\)	\(x_1\)	\(x_2\)	\(\cdots\)	\(x_k\)
1	\(y_1\)	\(x_{11}\)	\(x_{12}\)	\(\cdots\)	\(x_{1k}\)
2	\(y_2\)	\(x_{21}\)	\(x_{22}\)	\(\cdots\)	\(x_{2k}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(n\)	\(y_n\)	\(x_{n1}\)	\(x_{n2}\)	\(\cdots\)	\(x_{nk}\)

Source	df	SS	MS	\(F\)
Regression	\(k\)	\(SS_\text{reg} = \mathbf{b}'\mathbf{X}'\mathbf{y} - n\bar{y}^2\)	\(MS_\text{reg} = SS_\text{reg}/k\)	\(F = MS_\text{reg}/MS_\text{res}\)
Residual	\(n-k-1\)	\(SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\)	\(MS_\text{res} = SS_\text{res}/(n-k-1)\)
Total	\(n-1\)	\(SS_T = \mathbf{y}'\mathbf{y} - n\bar{y}^2\)

Group	Equation	Intercept
A	\(\hat{y} = b_0 + b_1 x_1\)	\(b_0\)
B	\(\hat{y} = (b_0 + b_B) + b_1 x_1\)	\(b_0 + b_B\)
C	\(\hat{y} = (b_0 + b_C) + b_1 x_1\)	\(b_0 + b_C\)

\(\lambda\)	Transformation	Use when
\(-1\)	Reciprocal \(1/y\)	Variance increases rapidly with mean
\(-0.5\)	Reciprocal square root \(1/\sqrt{y}\)
\(0\)	Log transformation \(\ln(y)\)	Most common remedy for right skew / heteroscedasticity
\(0.5\)	Square root \(\sqrt{y}\)	Count data; mild variance inflation
\(1\)	No transformation	Assumptions already satisfied
\(2\)	Square \(y^2\)	Variance decreases with mean (rare)

Output field	Meaning
`Estimate` (Intercept)	\(b_0\): estimated \(Y\) when \(X = 0\)
`Estimate` (\(x\))	\(b_1\): estimated change in \(\hat{Y}\) per 1-unit increase in \(X\)
`Std. Error`	\(se(b_j)\): standard error of the estimate
`t value`	\(t_0 = b_j / se(b_j)\): test statistic for \(H_0: \beta_j = 0\)
`Pr(>\|t\|)`	p-value; reject \(H_0\) if \(< \alpha\)
`Residual standard error`	\(\hat{\sigma} = \sqrt{MS_\text{res}}\)
`Multiple R-squared`	\(r^2\): proportion of variance explained
`F-statistic`	Overall model F-test (same as \(t^2\) for simple regression)
`p-value` (F)	p-value for the overall F-test

Plot	What it checks	What “good” looks like
Residuals vs. Fitted	Linearity + constant variance	Random scatter around \(e = 0\); no pattern
Normal Q-Q plot	Normality of residuals	Points close to the diagonal reference line
Scale-Location	Constant variance (homoscedasticity)	Flat red line; even spread
Residuals vs. Leverage	Influential observations	No points outside Cook’s distance contours

Quantity	Formula	Interpretation
Fitted values	\(\hat{\mathbf{y}} = \mathbf{X}\mathbf{b}\)	Model’s predicted \(y\) for each observation
Residuals	\(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\)	Observed minus predicted

Output field	Meaning
`Estimate` (Intercept)	\(b_0\): estimated \(y\) when all \(x_j = 0\)
`Estimate` (\(x_j\))	\(b_j\): estimated change in \(\hat{y}\) per 1-unit increase in \(x_j\), holding all others constant
`Std. Error`	\(se(b_j)\): standard error of the estimate
`t value`	\(t_0 = b_j/se(b_j)\): test statistic for \(H_0: \beta_j = 0\)
`Pr(>\|t\|)`	p-value for individual \(t\)-test; reject \(H_0\) if \(< \alpha\)
`Residual standard error`	\(\hat{\sigma} = \sqrt{MS_\text{res}}\), on \(n-k-1\) df
`Multiple R-squared`	\(R^2\)
`Adjusted R-squared`	\(R^2_\text{adj}\) — use this for model comparison
`F-statistic`	Overall F-test statistic
`p-value` (F)	p-value for the overall F-test

Group	\(D_B\)	\(D_C\)	Interpretation
A (reference)	0	0	Intercept alone = mean of A
B	1	0	Intercept \(+ \beta_B\) = mean of B
C	0	1	Intercept \(+ \beta_C\) = mean of C

Condition	Meaning
Linearity	\(E(y)\) is a linear function of the predictors
Independence	Errors \(\epsilon_i\) are independent of each other
Normality	Errors follow a normal distribution
Equal variance	\(\text{Var}(\epsilon_i) = \sigma^2\) (constant) for all \(i\)

Plot	What it checks	What “good” looks like	Warning signs
Residuals vs. Fitted	Linearity + constant variance	Random scatter around \(e = 0\); even spread	Curve (non-linearity); funnel shape (non-constant variance)
Normal Q-Q	Normality of residuals	Points close to the diagonal line	S-shape (heavy tails); systematic bending
Scale-Location	Constant variance (more sensitive)	Flat red line; even vertical spread	Upward or downward trend in red line
Residuals vs. Leverage	Influential observations	All points inside Cook’s distance contours	Points in upper/lower right corner (high leverage + large residual)

Violated assumption	Possible remedy
Non-linearity	Add polynomial terms (\(x^2\), \(x^3\)); transform \(x\); fit a different model
Non-constant variance	Apply a variance-stabilising transformation to \(y\) (log, square root)
Non-normality	Transform \(y\); check for and address outliers
Autocorrelation	Include a time-trend variable; use time-series regression methods

Situation	Why nonparametric?
Small sample size (\(n < 30\))	Cannot verify normality; Central Limit Theorem may not apply
Population distribution unknown or clearly non-normal	Parametric inference would be invalid
Presence of extreme outliers	Outliers heavily distort means and parametric tests
Ordinal data (rankings, Likert scales)	Means are not meaningful for ordinal scales
Data include only directions or signs	Magnitude is unknown or unreliable

Design	Parametric test	Nonparametric equivalent
One sample: test median	One-sample \(t\)-test	Sign test or Wilcoxon signed-rank test
Two related (paired) samples	Paired \(t\)-test	Sign test or Wilcoxon signed-rank test
Two independent samples	Independent \(t\)-test	Mann–Whitney U test
\(k > 2\) related samples (blocks)	Repeated-measures ANOVA	Friedman test
\(k > 2\) independent samples	One-way ANOVA	Kruskal–Wallis test
Association between two variables	Pearson correlation	Spearman rank correlation
Association between two categorical variables	—	Chi-squared test of independence

Test	Uses	When to prefer
Sign test	Only the direction (above/below \(\theta_0\))	Very small samples; data are ordinal; distribution may be asymmetric
Wilcoxon signed-rank	Direction and magnitude of differences	Symmetric distribution; more powerful than sign test

Prison	1	2	3	4	5	6	7	8
\(d\)	+5	+8	−1	+4	+10	−3	0	+8
\(\|d\|\)	5	8	1	4	10	3	—	8
Rank	4	5.5	1	3	7	2	—	5.5

Diameter	1024	950	451	505	761	644	707	586	442	546
\(R(X)\)	10	9	2	3	8	6	7	5	1	4
\(R(Y)\)	8	10	6	9	4	1	5	3	7	2
\(D\)	2	−1	−4	−6	4	5	2	2	−6	2
\(D^2\)	4	1	16	36	16	25	4	4	36	4

Agent	Calls (\(x\))	Sales (\(y\))	Agent	Calls (\(x\))	Sales (\(y\))
1	16	5	11	29	12
2	13	7	12	25	7
3	21	8	13	38	18
4	37	20	14	33	14
5	17	9	15	12	2
6	20	12	16	30	15
7	17	4	17	30	10
8	28	16	18	10	4
9	28	11	19	18	7
10	6	2	20	21	10

Subject	Cigarettes/day (\(x\))	Days absent (\(y\))	Subject	Cigarettes/day (\(x\))	Days absent (\(y\))
1	0	1	9	35	8
2	0	3	10	44	9
3	0	8	11	53	5
4	10	10	12	33	12
5	13	14	13	21	15
6	20	14	14	25	13
7	27	11	15	14	9
8	35	6

Parameter	Lower bound	Upper bound	Interpretation
\(\beta_0\) (intercept)	5.0201	11.7555	90% CI for the predicted days absent for a non-smoker
\(\beta_1\) (slope)	−0.0878	0.1616	90% CI for the change in absences per additional cigarette

Group	\(x_2\)	\(x_3\)	Intercept	Fitted equation
Faculty 1, Female	0	1	\(62.8456 + 3.7520 = 66.5976\)	\(\hat{y} = 66.5976 + 0.0892\, x_1\)
Faculty 1, Male	1	1	\(62.8456 - 2.4160 + 3.7520 = 64.1816\)	\(\hat{y} = 64.1816 + 0.0892\, x_1\)
Faculty 2, Female	0	0	\(62.8456\)	\(\hat{y} = 62.8456 + 0.0892\, x_1\)
Faculty 2, Male	1	0	\(62.8456 - 2.4160 = 60.4296\)	\(\hat{y} = 60.4296 + 0.0892\, x_1\)

St.	Fac.	Yr	Hours	Score	St.	Fac.	Yr	Hours	Score
1	A	2	7.5	70	9	B	3	6.0	73
2	A	2	5.0	65	10	B	3	6.5	82
3	A	2	2.0	68	11	A	4	5.0	90
4	B	2	3.0	80	12	A	4	4.0	95
5	B	2	4.5	81	13	B	4	3.5	80
6	A	3	4.0	77	14	B	4	3.0	70
7	A	3	4.5	90	15	A	4	8.0	95
8	A	3	7.0	75

208251 Regression Analysis and Non-Parametric Statistics

Department of Statistics, Faculty of Science, CMU

2026-06-10

1 Simple Linear Regression

1.1 Concepts of Regression Analysis

1.1.1 What is regression analysis?

1.1.2 Purposes of regression analysis

1.1.3 Response and explanatory variables

1.1.4 The regression model

1.1.5 Steps in regression modelling

1.2 Scatterplot

1.3 Correlation

1.3.1 The Pearson correlation coefficient

1.3.2 Interpreting \(r\)

1.3.3 Hypothesis test for the population correlation

1.3.4 Confidence interval for \(\rho\)

1.4 Simple Linear Regression Model

1.4.1 Model specification

1.4.2 Assumptions about the error term \(\epsilon_i\)

1.4.3 Parameter estimation: method of least squares

1.4.4 The fitted regression line and residuals

1.4.5 Estimating \(\sigma^2\)

1.4.6 ANOVA table for regression

1.4.7 Inference for \(\beta_0\) and \(\beta_1\)

1.4.8 Hypothesis test for \(\beta_1\)

1.4.9 Coefficient of determination \(r^2\)

1.4.10 Prediction and confidence intervals

1.4.11 Residual analysis and model diagnostics

2 Multiple Linear Regression

2.1 From Simple to Multiple Regression

2.2 The Multiple Linear Regression Model

2.2.1 Model equation

2.2.2 Data structure

2.2.3 Matrix form

2.3 Parameter Estimation

2.3.1 Least squares estimator

2.3.2 Fitted values and residuals

2.3.3 Estimating \(\sigma^2\)

2.4 Inference for the Regression Coefficients

2.4.1 Distribution of \(b_j\)

2.4.2 Confidence intervals for \(\beta_j\)

2.5 Testing the Fitted Regression

2.5.1 Overall F-test

2.5.2 Individual t-test for each \(\beta_j\)

2.6 Coefficient of Determination

2.6.1 \(R^2\) — multiple coefficient of determination

2.6.2 Adjusted \(R^2\)

2.7 Prediction

2.8 Dummy Variables for Qualitative Predictors

2.8.1 Why dummy variables?

2.8.2 General coding scheme

2.9 Variable Selection

2.9.1 Why variable selection?

2.9.2 Variable selection on the mtcars data

2.9.3 Forward selection

2.9.4 Backward elimination

2.9.5 Stepwise regression

2.10 Model Diagnostics

2.10.1 Why check assumptions?

2.10.2 Graphical diagnostics

2.10.3 Formal test: constant variance (Breusch-Pagan)

2.10.4 Formal test: normality (Shapiro-Wilk)

2.10.5 Formal test: independence (Durbin-Watson)

2.10.6 Summary of assumption checks

2.10.7 Dealing with violated assumptions

2.10.8 Box-Cox transformation

3 Nonparametric Statistics

3.1 Introduction

3.1.1 What are nonparametric methods?

3.1.2 When to use nonparametric methods

3.1.3 Which test to use?

3.2 Sign Test and Wilcoxon Signed-Rank Test for One Sample

3.2.1 Sign test for one sample

3.2.2 Wilcoxon signed-rank test for one sample

3.3 Sign Test and Wilcoxon Signed-Rank Test for Two Related (Paired) Samples

3.3.1 Sign test for paired samples

3.3.2 Wilcoxon signed-rank test for paired samples

3.4 Mann–Whitney U Test for Two Independent Samples

3.4.1 Procedure

3.5 Friedman Test for \(k\) Related Samples (\(k > 2\))

2.9.2 Variable selection on the `mtcars` data

Agent	Calls (\(x\))	Sales (\(y\))	Agent	Calls (\(x\))	Sales (\(y\))
1	16	5	11	29	12
2	13	7	12	25	7
3	21	8	13	38	18
4	37	20	14	33	14
5	17	9	15	12	2
6	20	12	16	30	15
7	17	4	17	30	10
8	28	16	18	10	4
9	28	11	19	18	7
10	6	2	20	21	10

Subject	Cigarettes/day (\(x\))	Days absent (\(y\))	Subject	Cigarettes/day (\(x\))	Days absent (\(y\))
1	0	1	9	35	8
2	0	3	10	44	9
3	0	8	11	53	5
4	10	10	12	33	12
5	13	14	13	21	15
6	20	14	14	25	13
7	27	11	15	14	9
8	35	6

St.	Fac.	Yr	Hours	Score	St.	Fac.	Yr	Hours	Score
1	A	2	7.5	70	9	B	3	6.0	73
2	A	2	5.0	65	10	B	3	6.5	82
3	A	2	2.0	68	11	A	4	5.0	90
4	B	2	3.0	80	12	A	4	4.0	95
5	B	2	4.5	81	13	B	4	3.5	80
6	A	3	4.0	77	14	B	4	3.0	70
7	A	3	4.5	90	15	A	4	8.0	95
8	A	3	7.0	75