Many research questions involve understanding how one variable is related to another. Consider these examples:
Two closely related techniques help answer these questions:
A regression model expresses the relationship between a response variable (also called the dependent variable) and one or more explanatory variables (also called independent variables or predictors).
The true relationship between \(Y\) and \(X_1, X_2, \ldots, X_p\) is approximated by:
\[Y = f(X_1, X_2, \ldots, X_p) + \epsilon\]
Before fitting any model, we always begin by visualising the data. A scatterplot displays pairs \((x_i, y_i)\) — the predictor \(x\) on the horizontal axis and the response \(y\) on the vertical axis — and reveals the form, direction, and strength of the relationship.
What to look for in a scatterplot:
| Feature | What it means |
|---|---|
| Form | Is the pattern linear (a straight band) or curved? |
| Direction | Positive (both increase together) or negative? |
| Strength | How tightly do the points cluster around the pattern? |
| Outliers | Any unusual points that deviate from the overall pattern? |
data(cars)
plot(cars$speed, cars$dist,
xlab = "Speed (mph)", ylab = "Stopping Distance (ft)",
main = "Scatterplot: Speed vs. Stopping Distance",
pch = 19, col = "steelblue")The scatterplot shows that speed and dist
are positively and approximately linearly related — as speed increases,
stopping distance tends to increase.
library(MASS)
data(survey)
plot(survey$Age, survey$Pulse,
xlab = "Age (years)", ylab = "Pulse Rate (beats/min)",
main = "Scatterplot: Age vs. Pulse Rate",
pch = 19, col = "tomato")The scatterplot shows that Age and Pulse do
not appear to be linearly related — the points scatter widely with no
clear directional pattern.
The sample Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables. For \(n\) pairs \((x_1, y_1), \ldots, (x_n, y_n)\):
\[ r = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2 \;\cdot\; \displaystyle\sum_{i=1}^{n}(y_i - \bar{y})^2}} = \frac{n\sum x_i y_i - \sum x_i \sum y_i} {\sqrt{\bigl[n\sum x_i^2 - (\sum x_i)^2\bigr]\bigl[n\sum y_i^2 - (\sum y_i)^2\bigr]}} \]
The range is \(-1 \leq r \leq 1\).
| Value of \(|r|\) | Interpretation |
|---|---|
| 0.00 – 0.19 | Very weak (negligible) |
| 0.20 – 0.39 | Weak |
| 0.40 – 0.59 | Moderate |
| 0.60 – 0.79 | Strong |
| 0.80 – 1.00 | Very strong |
❗ Important
\(r\) measures linear association only. A value of \(r \approx 0\) does not mean the variables are unrelated — they might have a strong nonlinear relationship that \(r\) cannot detect. Always examine the scatterplot first.
## [1] 0.8068949
The correlation \(r = 0.807\) confirms a strong positive linear relationship between speed and stopping distance.
The population correlation coefficient \(\rho\) (rho) is the correlation computed from all possible pairs in the population. We never observe it directly; instead, we use \(r\) from our sample to make inferences about \(\rho\).
Possible hypotheses (where \(\rho_0\) is the value under \(H_0\), most commonly \(\rho_0 = 0\)):
\[ \begin{aligned} (a)\; &H_0: \rho \leq 0, \quad H_1: \rho > 0 \quad &\text{(test for positive correlation)}\\ (b)\; &H_0: \rho \geq 0, \quad H_1: \rho < 0 \quad &\text{(test for negative correlation)}\\ (c)\; &H_0: \rho = 0, \quad H_1: \rho \neq 0 \quad &\text{(test for any linear correlation)} \end{aligned} \]
Four-step procedure:
A direct CI for \(\rho\) uses Fisher’s \(z\)-transformation. Let \(R = \dfrac{1}{2}\ln\dfrac{1+r}{1-r}\) (the Fisher transform of \(r\)), with approximate standard deviation \(\sigma_R = \dfrac{1}{\sqrt{n-3}}\).
A \((1-\alpha)100\%\) CI for the transformed correlation is:
\[R \pm Z_{1-\alpha/2} \cdot \sigma_R\]
We then back-transform the bounds to the original \(\rho\) scale using \(\rho = \dfrac{e^{2R}-1}{e^{2R}+1}\).
💡 In R
cor.test(x, y, alternative = "two.sided", conf.level = 0.95)
performs the hypothesis test and computes the 95% CI for \(\rho\) automatically — no need to apply
Fisher’s transformation by hand.
📝 Example 1.1: Delivery Time and Number of Items
A delivery company wants to know whether delivery time is correlated with the number of items delivered. They recorded times and item counts for 25 orders.
| Order | Time \(y\) (min) | Items \(x\) | Order | Time \(y\) (min) | Items \(x\) | |
|---|---|---|---|---|---|---|
| 1 | 16.68 | 7 | 14 | 19.75 | 6 | |
| 2 | 11.50 | 3 | 15 | 24.00 | 9 | |
| 3 | 12.03 | 3 | 16 | 29.00 | 10 | |
| 4 | 14.88 | 4 | 17 | 15.35 | 6 | |
| 5 | 13.75 | 6 | 18 | 19.00 | 7 | |
| 6 | 18.11 | 7 | 19 | 9.50 | 3 | |
| 7 | 8.00 | 2 | 20 | 35.10 | 17 | |
| 8 | 17.83 | 7 | 21 | 17.90 | 10 | |
| 9 | 79.24 | 30 | 22 | 52.32 | 26 | |
| 10 | 21.50 | 5 | 23 | 18.75 | 9 | |
| 11 | 40.33 | 16 | 24 | 19.83 | 8 | |
| 12 | 21.00 | 10 | 25 | 10.75 | 4 | |
| 13 | 13.50 | 4 |
Assuming that delivery time and number of items follow a bivariate normal distribution, test at \(\alpha = 0.05\) whether delivery time is correlated with the number of items.
# Data
items <- c(7,3,3,4,6,7,2,7,30,5,16,10,4,6,9,10,6,7,3,17,10,26,9,8,4)
time <- c(16.68,11.5,12.03,14.88,13.75,18.11,8,17.83,79.24,21.5,
40.33,21,13.5,19.75,24,29,15.35,19,9.5,35.1,17.9,52.32,
18.75,19.83,10.75)
# Scatterplot
plot(items, time,
main = "Number of Items vs. Delivery Time",
xlab = "Number of Items (x)", ylab = "Delivery Time (min) (y)",
pch = 19, col = "steelblue")##
## Pearson's product-moment correlation
##
## data: items and time
## t = 17.546, df = 23, p-value = 8.22e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9202275 0.9845031
## sample estimates:
## cor
## 0.9646146
From the data, \(r = 0.9646\).
Hypothesis test:
\(H_0: \rho = 0\) vs. \(H_1: \rho \neq 0\)
Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)
Compute: \[t_\text{cal} = \frac{0.9646\sqrt{25-2}}{\sqrt{1-0.9646^2}} = 17.55, \quad \text{p-value} = 8.22 \times 10^{-15}\]
Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: there is significant evidence that delivery time is linearly correlated with the number of items.
95% confidence interval for \(\rho\):
\[R = \frac{1}{2}\ln\frac{1+0.9646}{1-0.9646} = 2.0082, \qquad \sigma_R = \frac{1}{\sqrt{25-3}} = 0.2132\]
\[2.0082 \pm 1.96\sqrt{0.2132} \;\Rightarrow\; 1.5903 < \mu_R < 2.4261\]
Back-transforming: \(\quad \boxed{0.9202 < \rho < 0.9844}\)
📝 Example 1.2: Math Midterm and Final Scores
Math midterm and final scores for 9 students are shown below. Find the sample correlation \(r\) and test whether midterm and final scores are correlated at \(\alpha = 0.10\).
| Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Midterm | 77 | 50 | 71 | 72 | 80 | 93 | 95 | 98 | 66 |
| Final | 82 | 66 | 78 | 43 | 56 | 85 | 99 | 98 | 68 |
df <- data.frame(
midterm = c(77, 50, 71, 72, 80, 93, 95, 98, 66),
final = c(82, 66, 78, 43, 56, 85, 99, 98, 68)
)
# Scatterplot
plot(df$midterm, df$final,
main = "Midterm vs. Final Exam Scores",
xlab = "Midterm Score", ylab = "Final Score",
pch = 19, col = "darkred")# Correlation test at alpha = 0.10 (90% CI)
cor.test(df$midterm, df$final, alternative = "two.sided", conf.level = 0.90)##
## Pearson's product-moment correlation
##
## data: df$midterm and df$final
## t = 2.197, df = 7, p-value = 0.06402
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
## 0.0845006 0.8911984
## sample estimates:
## cor
## 0.6388399
The sample correlation is \(r = 0.6388\), indicating a moderately strong positive linear relationship.
Hypothesis test:
\(H_0: \rho = 0\) vs. \(H_1: \rho \neq 0\)
Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)
Compute with \(r = 0.6388\), \(n = 9\): \[t_\text{cal} = \frac{0.6388\sqrt{7}}{\sqrt{1-0.6388^2}} = 2.197, \quad \text{p-value} = 0.0640\]
Since p-value \(= 0.0640 < \alpha = 0.10\), we reject \(H_0\). Conclusion: at the 10% significance level, there is sufficient evidence that midterm and final scores are linearly correlated.
📝 Example 1.3: Rainfall and Umbrella Sales
Peter owns a supermarket and believes that months with higher rainfall will have higher umbrella sales. He recorded monthly rainfall (mm) and umbrellas sold over nine consecutive months.
| Month | Rainfall (mm) | Umbrellas sold |
|---|---|---|
| 1 | 82.0 | 12 |
| 2 | 92.5 | 25 |
| 3 | 83.2 | 17 |
| 4 | 97.7 | 28 |
| 5 | 131.9 | 41 |
| 6 | 141.3 | 47 |
| 7 | 165.4 | 50 |
| 8 | 140.0 | 46 |
| 9 | 126.7 | 37 |
Assuming both variables follow a normal distribution, test at \(\alpha = 0.05\) whether Peter’s assumption is correct.
df2 <- data.frame(
Rainfall = c(82.0, 92.5, 83.2, 97.7, 131.9, 141.3, 165.4, 140.0, 126.7),
Umbrellas = c(12, 25, 17, 28, 41, 47, 50, 46, 37)
)
# Scatterplot
plot(df2$Rainfall, df2$Umbrellas,
main = "Monthly Rainfall vs. Umbrella Sales",
xlab = "Rainfall (mm)", ylab = "Umbrellas Sold",
pch = 19, col = "royalblue")##
## Pearson's product-moment correlation
##
## data: df2$Rainfall and df2$Umbrellas
## t = 10.455, df = 7, p-value = 7.97e-06
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.8877971 1.0000000
## sample estimates:
## cor
## 0.9694419
Hypothesis test:
\(H_0: \rho \leq 0\) vs. \(H_1: \rho > 0\) (Peter believes higher rainfall → more sales)
Test statistic: \(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}}\)
Compute with \(r = 0.9694\), \(n = 9\): \[t_\text{cal} = \frac{0.9694\sqrt{7}}{\sqrt{1-0.9694^2}} = 10.455, \quad \text{p-value} = 7.97 \times 10^{-6}\]
Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: there is significant evidence that higher rainfall is associated with more umbrella sales. Peter’s assumption is supported by the data.
When both the form of the relationship and the direction of interest (predicting \(Y\) from \(X\)) are clear, we move from correlation to regression.
The simple linear regression model is:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, 2, \ldots, n\]
where:
\[ \begin{aligned} E(y_i) &= \beta_0 + \beta_1 x_i &&\text{(deterministic / systematic part)} \\ \epsilon_i &&&\text{(random error component)} \\ \beta_0 &&&\text{(intercept: value of } E(y) \text{ when } x = 0\text{)} \\ \beta_1 &&&\text{(slope: change in } E(y) \text{ per unit increase in } x\text{)} \end{aligned} \]
For valid inference, the random errors must satisfy:
In short: \(\epsilon_i \overset{iid}{\sim} N(0, \sigma^2)\).
💡 Checking the assumptions
These assumptions cannot be verified from the raw data — they must be checked using residual diagnostics after the model is fitted (see @sec-diagnostics).
Since \(\beta_0\) and \(\beta_1\) are unknown population parameters, we estimate them from a sample of size \(n\).
The least-squares estimates (LSE) \(b_0\) and \(b_1\) minimise the sum of squared residuals:
\[Q = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]
Setting the partial derivatives \(\partial Q / \partial \beta_0 = 0\) and \(\partial Q / \partial \beta_1 = 0\) and solving gives the normal equations, whose solutions are:
\[\boxed{b_1 = \frac{S_{xy}}{S_{xx}}, \qquad b_0 = \bar{y} - b_1\bar{x}}\]
where \(S_{xx} = \sum(x_i - \bar{x})^2\) and \(S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})\).
Equivalently:
\[b_1 = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - \left(\sum x_i\right)^2}, \qquad b_0 = \bar{y} - b_1\bar{x}\]
The fitted (estimated) regression line is: \[\hat{y} = b_0 + b_1 x\]
The fitted value at observation \(i\) is \(\hat{y}_i = b_0 + b_1 x_i\), and the residual is: \[e_i = y_i - \hat{y}_i\]
Residuals represent the part of \(y_i\) not explained by the model — they are the observable counterparts of the unobservable errors \(\epsilon_i\).
The error variance \(\sigma^2\) measures the spread of observations around the regression line. It is estimated by the mean squared residual (also called mean squared error, MSE):
\[\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n-2} = \frac{\displaystyle\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2}\]
The denominator \(n-2\) (not \(n\)) gives an unbiased estimate because two parameters (\(\beta_0\) and \(\beta_1\)) are estimated from the data.
The residual sum of squares can also be expressed as: \[SS_\text{res} = S_{yy} - b_1 S_{xy}\]
where \(S_{yy} = \sum(y_i - \bar{y})^2\).
The total variation in \(Y\) can be decomposed as:
\[SS_T = SS_\text{reg} + SS_\text{res}\]
| Source | SS | df | MS | \(F\) |
|---|---|---|---|---|
| Regression | \(SS_\text{reg} = b_1 S_{xy}\) | \(1\) | \(MS_\text{reg} = SS_\text{reg}/1\) | \(F = MS_\text{reg}/MS_\text{res}\) |
| Residual | \(SS_\text{res} = S_{yy} - b_1 S_{xy}\) | \(n-2\) | \(MS_\text{res} = SS_\text{res}/(n-2)\) | |
| Total | \(SS_T = S_{yy}\) | \(n-1\) |
The F-statistic tests \(H_0: \beta_1 = 0\) (the model has no linear fit). For simple regression, this is equivalent to the \(t\)-test on \(\beta_1\).
💡 In R
anova(fit) produces this table.
summary(fit) gives the \(F\)-statistic at the bottom of its
output.
Because \(b_0\) and \(b_1\) are computed from a random sample, they are random variables with their own sampling distributions.
Distribution of \(b_1\):
\[b_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{S_{xx}}\right)\]
When \(\sigma^2\) is unknown and estimated by \(MS_\text{res}\):
\[t = \frac{b_1 - \beta_1}{se(b_1)} \sim t_{n-2}, \qquad se(b_1) = \sqrt{\frac{MS_\text{res}}{S_{xx}}}\]
Distribution of \(b_0\):
\[t = \frac{b_0 - \beta_0}{se(b_0)} \sim t_{n-2}, \qquad se(b_0) = \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)}\]
Confidence intervals:
\[ \begin{aligned} (1-\alpha)100\% \text{ CI for } \beta_1: &\quad b_1 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_1) \\ (1-\alpha)100\% \text{ CI for } \beta_0: &\quad b_0 \pm t_{1-\alpha/2,\, n-2} \cdot se(b_0) \end{aligned} \]
Confidence interval for \(\sigma^2\):
\[\frac{(n-2)MS_\text{res}}{\chi^2_{1-\alpha/2,\, n-2}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{\alpha/2,\, n-2}}\]
The most important test in simple regression is:
\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0\]
If we fail to reject \(H_0\), there is no evidence of a linear relationship between \(X\) and \(Y\).
Four-step procedure:
The coefficient of determination measures the proportion of the total variation in \(Y\) that is explained by the regression model:
\[r^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq r^2 \leq 1\]
Interpretation template: “\(r^2 = 0.845\) means that 84.5% of the total variation in \(Y\) is explained by the linear regression on \(X\). The remaining 15.5% is unexplained (attributable to random error or other factors not in the model).”
Note: In simple linear regression, \(r^2 = r^2\) (the square of the Pearson correlation coefficient).
💡 Reading summary(fit) output in R
| Output field | Meaning |
|---|---|
Estimate (Intercept) |
\(b_0\): estimated \(Y\) when \(X = 0\) |
Estimate (\(x\)) |
\(b_1\): estimated change in \(\hat{Y}\) per 1-unit increase in \(X\) |
Std. Error |
\(se(b_j)\): standard error of the estimate |
t value |
\(t_0 = b_j / se(b_j)\): test statistic for \(H_0: \beta_j = 0\) |
Pr(>|t|) |
p-value; reject \(H_0\) if \(< \alpha\) |
Residual standard error |
\(\hat{\sigma} = \sqrt{MS_\text{res}}\) |
Multiple R-squared |
\(r^2\): proportion of variance explained |
F-statistic |
Overall model F-test (same as \(t^2\) for simple regression) |
p-value (F) |
p-value for the overall F-test |
After fitting the model \(\hat{y} = b_0 + b_1 x\), we can form two types of interval estimates for a new value \(x^*\):
1. Confidence interval for the mean response \(E(y \mid x = x^*)\):
This estimates the average \(Y\) for all observations with \(X = x^*\): \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(\frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]
2. Prediction interval for a new individual observation \(y\) at \(x = x^*\):
This estimates a single future \(Y\) when \(X = x^*\): \[\hat{y}^* \pm t_{1-\alpha/2,\, n-2} \sqrt{MS_\text{res}\!\left(1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}\right)}\]
❗ CI vs. PI — which is wider?
The prediction interval (PI) is always wider than the confidence interval (CI) for the same \(x^*\). The extra “1” inside the square root accounts for the additional uncertainty in predicting a single new observation (which has its own random error \(\epsilon\)), rather than estimating the mean response (which averages out individual variation).
Both intervals are narrowest at \(x^* = \bar{x}\) and widen as \(x^*\) moves away from \(\bar{x}\).
After fitting the model, we must verify that the regression assumptions hold. We do this by examining the residuals \(e_i = y_i - \hat{y}_i\).
Key diagnostic plots:
| Plot | What it checks | What “good” looks like |
|---|---|---|
| Residuals vs. Fitted | Linearity + constant variance | Random scatter around \(e = 0\); no pattern |
| Normal Q-Q plot | Normality of residuals | Points close to the diagonal reference line |
| Scale-Location | Constant variance (homoscedasticity) | Flat red line; even spread |
| Residuals vs. Leverage | Influential observations | No points outside Cook’s distance contours |
Warning signs:
💡 Formal tests (used in later labs)
shapiro.test(residuals(fit))): tests normality. Satisfied
if p-value > 0.05.bptest(fit) from
lmtest): tests constant variance. Satisfied if p-value >
0.05.dwtest(fit) from
lmtest): tests for autocorrelation. Satisfied if p-value
> 0.05.📝 Example 1.4: Condominium Area and Sale Price
A marketing manager wants to study the relationship between the area of a condominium unit (\(m^2\)) and its sale price (10,000 Baht) in the Mueang district of Chiang Mai. Ten units were randomly selected.
| Obs | Area \(x\) | Price \(y\) | \(xy\) | \(x^2\) | \(y^2\) |
|---|---|---|---|---|---|
| 1 | 13 | 115 | 1495 | 169 | 13225 |
| 2 | 4 | 45 | 180 | 16 | 2025 |
| 3 | 12 | 100 | 1200 | 144 | 10000 |
| 4 | 5 | 50 | 250 | 25 | 2500 |
| 5 | 6 | 55 | 330 | 36 | 3025 |
| 6 | 8 | 85 | 680 | 64 | 7225 |
| 7 | 3 | 40 | 120 | 9 | 1600 |
| 8 | 4 | 50 | 200 | 16 | 2500 |
| 9 | 5 | 45 | 225 | 25 | 2025 |
| 10 | 7 | 70 | 490 | 49 | 4900 |
| Total | 67 | 655 | 5170 | 553 | 49025 |
df <- data.frame(
Area = c(13, 4, 12, 5, 6, 8, 3, 4, 5, 7),
Sale = c(115, 45, 100, 50, 55, 85, 40, 50, 45, 70)
)
# Scatterplot
plot(df$Area, df$Sale,
xlab = expression("Area (" * m^2 * ")"),
ylab = "Sale Price (10,000 Baht)",
main = "Condominium Area vs. Sale Price",
pch = 19, col = "steelblue")##
## Call:
## lm(formula = Sale ~ Area, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.738 -4.618 0.987 2.269 9.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.202 4.120 3.69 0.00613 **
## Area 7.507 0.554 13.55 8.45e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared: 0.9582, Adjusted R-squared: 0.953
## F-statistic: 183.6 on 1 and 8 DF, p-value: 8.451e-07
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Area | 1 | 5866.8804 | 5866.88040 | 183.6129 | 8e-07 |
| Residuals | 8 | 255.6196 | 31.95245 | NA | NA |
Test for significance of regression (\(\beta_1\)):
\(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)
Test statistic: \(t_0 = \dfrac{b_1}{se(b_1)}\)
From R: \(t = 13.55\), p-value \(= 8.45 \times 10^{-7}\)
Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: condominium area is a significant linear predictor of sale price.
Fitted equation: \[\hat{y} = 15.202 + 7.507\, x\]
Interpretation: For each additional \(1\, m^2\) of area, the estimated sale price increases by 75,070 Baht (7.507 × 10,000). A unit with 0 \(m^2\) area is estimated to cost 152,020 Baht (intercept, not practically meaningful here).
Coefficient of determination:
\(r^2 = 0.9582\): 95.82% of the total variation in sale price is explained by the linear regression on area. The remaining 4.18% is unexplained.
95% confidence interval for \(E(y \mid x = 13)\):
## fit lwr upr
## 1 112.7954 103.7525 121.8382
\[\hat{y}_h \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(\frac{1}{10} + \frac{(13 - 6.7)^2}{104.1}\right)} = 112.80 \pm 9.04 \;\Rightarrow\; \boxed{103.76 < E(y \mid x=13) < 121.83}\]
95% prediction interval for a new unit with \(x = 13\):
## fit lwr upr
## 1 112.7954 96.93079 128.66
\[\hat{y}_0 \pm t_{0.975,\,8}\sqrt{MS_\text{res}\!\left(1 + \frac{1}{10} + \frac{(13-6.7)^2}{104.1}\right)} = 112.80 \pm 15.86 \;\Rightarrow\; \boxed{96.94 < y \mid x=13 < 128.65}\]
Plot: fitted line with CI and PI bands:
newx_all <- data.frame(Area = sort(df$Area))
conf_int <- predict(fit, newdata = newx_all, interval = "confidence")
pred_int <- predict(fit, newdata = newx_all, interval = "prediction")
plot(df$Area, df$Sale,
xlab = expression("Area (" * m^2 * ")"),
ylab = "Sale Price (10,000 Baht)",
main = "Fitted Line with 95% CI and PI Bands",
pch = 19, col = "steelblue",
ylim = c(20, 140))
abline(fit, lwd = 2, col = "black")
lines(sort(df$Area), conf_int[, "lwr"], col = "red", lty = 2, lwd = 1.5)
lines(sort(df$Area), conf_int[, "upr"], col = "red", lty = 2, lwd = 1.5)
lines(sort(df$Area), pred_int[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(sort(df$Area), pred_int[, "upr"], col = "blue", lty = 3, lwd = 1.5)
legend("topleft",
legend = c("Data", "Fitted line", "95% CI", "95% PI"),
pch = c(19, NA, NA, NA),
lty = c(NA, 1, 2, 3),
col = c("steelblue", "black", "red", "blue"),
lwd = 2, bty = "n")Residual diagnostics:
📝 Example 1.5: Rocket Propellant Shear Strength
A rocket motor is manufactured by bonding two types of propellant inside a metal housing. The shear strength of the bond is an important quality characteristic. It is suspected that shear strength (\(y\), psi) is related to the age of the sustainer propellant batch (\(x\), weeks). Twenty observations are available.
| Obs | Shear Strength \(y\) | Age \(x\) | Obs | Shear Strength \(y\) | Age \(x\) | |
|---|---|---|---|---|---|---|
| 1 | 2158.70 | 15.50 | 11 | 2165.20 | 13.00 | |
| 2 | 1678.15 | 23.75 | 12 | 2399.55 | 3.75 | |
| 3 | 2316.00 | 8.00 | 13 | 1779.80 | 25.00 | |
| 4 | 2061.30 | 17.00 | 14 | 2336.75 | 9.75 | |
| 5 | 2207.50 | 5.50 | 15 | 1765.30 | 22.00 | |
| 6 | 1708.30 | 19.00 | 16 | 2053.50 | 18.00 | |
| 7 | 1784.70 | 24.00 | 17 | 2414.40 | 6.00 | |
| 8 | 2575.00 | 2.50 | 18 | 2200.50 | 12.50 | |
| 9 | 2357.90 | 7.50 | 19 | 2654.20 | 2.00 | |
| 10 | 2256.70 | 11.00 | 20 | 1753.70 | 21.50 |
age <- c(15.50, 23.75, 8.00, 17.00, 5.50, 19.00, 24.00, 2.50,
7.50, 11.00, 13.00, 3.75, 25.00, 9.75, 22.00, 18.00,
6.00, 12.50, 2.00, 21.50)
strength <- c(2158.70, 1678.15, 2316.00, 2061.30, 2207.50, 1708.30,
1784.70, 2575.00, 2357.90, 2256.70, 2165.20, 2399.55,
1779.80, 2336.75, 1765.30, 2053.50, 2414.40, 2200.50,
2654.20, 1753.70)
dt2 <- data.frame(Age = age, Strength = strength)
# Scatterplot
plot(dt2$Age, dt2$Strength,
xlab = "Age of Propellant (weeks)",
ylab = "Shear Strength (psi)",
main = "Propellant Age vs. Bond Shear Strength",
pch = 19, col = "darkorange")The scatterplot shows a strong negative linear relationship — older batches of propellant tend to have lower shear strength. The straight-line model \(y = \beta_0 + \beta_1 x + \epsilon\) appears appropriate.
##
## Call:
## lm(formula = Strength ~ Age, data = dt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.98 -50.68 28.74 66.61 106.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2627.822 44.184 59.48 < 2e-16 ***
## Age -37.154 2.889 -12.86 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared: 0.9018, Adjusted R-squared: 0.8964
## F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Age | 1 | 1527482.7 | 1527482.743 | 165.3768 | 0 |
| Residuals | 18 | 166254.9 | 9236.381 | NA | NA |
Test for significance of regression (\(\beta_1\)):
\(H_0: \beta_1 = 0\) vs. \(H_1: \beta_1 \neq 0\)
Test statistic: \(t_0 = \dfrac{b_1}{se(b_1)}\)
From R: \(t = -12.86\), p-value \(= 1.64 \times 10^{-10}\)
Since p-value \(< 0.05\), we reject \(H_0\). Conclusion: propellant age is a significant linear predictor of shear strength.
Fitted equation: \[\hat{y} = 2627.822 - 37.154\, x\]
Interpretation: For each additional week of propellant age, the estimated shear strength decreases by 37.154 psi on average.
90% and 95% confidence intervals for \(\beta_1\):
## 5 % 95 %
## (Intercept) 2551.20465 2704.4401
## Age -42.16349 -32.1437
## 2.5 % 97.5 %
## (Intercept) 2534.99540 2720.6493
## Age -43.22338 -31.0838
At 90%: \(-37.15 \pm (1.734)(2.89)\), giving \(-43.22 \leq \beta_1 \leq -31.08\)
95% confidence interval for \(\sigma^2\):
\[\frac{(n-2)MS_\text{res}}{\chi^2_{0.975,\,18}} \leq \sigma^2 \leq \frac{(n-2)MS_\text{res}}{\chi^2_{0.025,\,18}} = \frac{18 \times 9244.59}{31.526} \leq \sigma^2 \leq \frac{18 \times 9244.59}{8.231}\]
\[5{,}278.27 \leq \sigma^2 \leq 20{,}216.57\]
Plot: fitted line with CI and PI bands:
new5 <- data.frame(Age = sort(dt2$Age))
ci5 <- predict(fit1, newdata = new5, interval = "confidence")
pi5 <- predict(fit1, newdata = new5, interval = "prediction")
plot(dt2$Age, dt2$Strength,
xlab = "Age of Propellant (weeks)",
ylab = "Shear Strength (psi)",
main = "Fitted Line with 95% CI and PI Bands",
pch = 19, col = "darkorange")
abline(fit1, lwd = 2, col = "black")
lines(sort(dt2$Age), ci5[, "lwr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), ci5[, "upr"], col = "darkgreen", lty = 2, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "lwr"], col = "purple", lty = 3, lwd = 1.5)
lines(sort(dt2$Age), pi5[, "upr"], col = "purple", lty = 3, lwd = 1.5)
legend("topright",
legend = c("Data", "Fitted line", "95% CI", "95% PI"),
pch = c(19, NA, NA, NA),
lty = c(NA, 1, 2, 3),
col = c("darkorange", "black", "darkgreen", "purple"),
lwd = 2, bty = "n")Residual diagnostics:
In Chapter 1 we modelled a single response variable \(Y\) using one predictor \(X\): \[y = \beta_0 + \beta_1 x + \epsilon\]
In practice, most responses depend on several predictors simultaneously. For example:
Multiple linear regression extends the simple model to handle \(k \geq 1\) predictors — both quantitative and qualitative — while keeping the linear structure that makes estimation and inference tractable.
Given \(n\) observations on a response \(y\) and \(k\) predictors \(x_1, x_2, \ldots, x_k\), the multiple linear regression model is:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_i, \qquad i = 1, 2, \ldots, n\]
where:
The systematic (deterministic) part of the model is: \[E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\]
| Obs | \(y\) | \(x_1\) | \(x_2\) | \(\cdots\) | \(x_k\) |
|---|---|---|---|---|---|
| 1 | \(y_1\) | \(x_{11}\) | \(x_{12}\) | \(\cdots\) | \(x_{1k}\) |
| 2 | \(y_2\) | \(x_{21}\) | \(x_{22}\) | \(\cdots\) | \(x_{2k}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
| \(n\) | \(y_n\) | \(x_{n1}\) | \(x_{n2}\) | \(\cdots\) | \(x_{nk}\) |
Writing one equation per observation and stacking them gives the compact matrix form:
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]
where
\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}_{n\times 1},\quad \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix}_{n\times(k+1)},\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}_{(k+1)\times 1},\quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}_{n\times 1} \]
\(\mathbf{X}\) is called the design matrix. Its first column is all ones, corresponding to the intercept \(\beta_0\).
We estimate \(\boldsymbol{\beta}\) by minimising the residual sum of squares:
\[SS_\text{res} = \boldsymbol{\epsilon}'\boldsymbol{\epsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]
Setting \(\partial SS_\text{res}/\partial \boldsymbol{\beta} = \mathbf{0}\) gives the normal equations \(\mathbf{X}'\mathbf{X}\boldsymbol{\beta} = \mathbf{X}'\mathbf{y}\), whose solution is the least squares estimator (LSE):
\[\boxed{\mathbf{b} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}}\]
provided \(\mathbf{X}'\mathbf{X}\) is invertible (i.e., no perfect multicollinearity among the predictors).
| Quantity | Formula | Interpretation |
|---|---|---|
| Fitted values | \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{b}\) | Model’s predicted \(y\) for each observation |
| Residuals | \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) | Observed minus predicted |
\[SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\]
\[\boxed{\hat{\sigma}^2 = MS_\text{res} = \frac{SS_\text{res}}{n - k - 1}}\]
The denominator \(n - k - 1\) accounts for estimating \(k + 1\) parameters (\(\beta_0, \beta_1, \ldots, \beta_k\)), giving an unbiased estimator of \(\sigma^2\).
Each least squares estimate \(b_j\) follows a \(t\)-distribution:
\[t = \frac{b_j - \beta_j}{se(b_j)} \sim t_{n-k-1}, \qquad j = 0, 1, \ldots, k\]
where the standard error of \(b_j\) is: \[se(b_j) = \sqrt{\hat{\sigma}^2 C_{jj}}\]
and \(C_{jj}\) is the \((j, j)\)-th diagonal element of \((\mathbf{X}'\mathbf{X})^{-1}\).
A \((1-\alpha)100\%\) confidence interval for \(\beta_j\) is:
\[b_j \pm t_{1-\alpha/2,\; n-k-1} \cdot se(b_j)\]
The overall F-test asks: does the model explain any significant variation in \(y\)?
\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \qquad \text{vs.} \qquad H_1: \text{at least one } \beta_j \neq 0\]
Test statistic:
\[F = \frac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)} = \frac{MS_\text{reg}}{MS_\text{res}} \sim F_{k,\; n-k-1} \quad \text{under } H_0\]
ANOVA table:
| Source | df | SS | MS | \(F\) |
|---|---|---|---|---|
| Regression | \(k\) | \(SS_\text{reg} = \mathbf{b}'\mathbf{X}'\mathbf{y} - n\bar{y}^2\) | \(MS_\text{reg} = SS_\text{reg}/k\) | \(F = MS_\text{reg}/MS_\text{res}\) |
| Residual | \(n-k-1\) | \(SS_\text{res} = \mathbf{y}'\mathbf{y} - \mathbf{b}'\mathbf{X}'\mathbf{y}\) | \(MS_\text{res} = SS_\text{res}/(n-k-1)\) | |
| Total | \(n-1\) | \(SS_T = \mathbf{y}'\mathbf{y} - n\bar{y}^2\) |
Reject \(H_0\) if \(F_\text{cal} > F_{1-\alpha,\; k,\; n-k-1}\) or if p-value \(< \alpha\).
💡 In R
anova(fit) produces the ANOVA table.
summary(fit) reports the \(F\)-statistic and its p-value at the bottom
of the output.
Once the overall F-test is significant, we test each predictor individually:
\[H_0: \beta_j = 0 \qquad \text{vs.} \qquad H_1: \beta_j \neq 0\]
Test statistic: \(t_0 = \dfrac{b_j}{se(b_j)} \sim t_{n-k-1}\) under \(H_0\)
This tests whether \(x_j\) contributes significantly to the model given that all other predictors are already included.
\[R^2 = \frac{SS_\text{reg}}{SS_T} = 1 - \frac{SS_\text{res}}{SS_T}, \qquad 0 \leq R^2 \leq 1\]
Limitation: \(R^2\) always increases (never decreases) when a predictor is added, even if it is useless. This makes it unreliable for comparing models with different numbers of predictors.
The adjusted \(R^2\) penalises for the number of predictors and can decrease if an added predictor does not improve fit sufficiently:
\[R^2_\text{adj} = 1 - \frac{SS_\text{res}/(n-k-1)}{SS_T/(n-1)} = 1 - \left(\frac{n-1}{n-k-1}\right)\frac{SS_\text{res}}{SS_T}\]
❗ \(R^2\) vs. Adjusted \(R^2\)
Use \(R^2\) to describe the fit of a given model. Use adjusted \(R^2\) — or AIC/BIC — when comparing models with different numbers of predictors. A model with higher adjusted \(R^2\) is generally preferred.
For a new observation with predictor values \(\mathbf{x}_0 = (1, x_{01}, x_{02}, \ldots, x_{0k})'\):
Point estimate: \(\hat{y}_0 = \mathbf{x}_0'\mathbf{b}\)
\((1-\alpha)100\%\) confidence interval for \(E(y \mid \mathbf{x}_0)\) (mean response): \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \, \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}\]
\((1-\alpha)100\%\) prediction interval for a new individual \(y \mid \mathbf{x}_0\): \[\hat{y}_0 \pm t_{1-\alpha/2,\; n-k-1} \sqrt{\hat{\sigma}^2 \left(1 + \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0\right)}\]
The prediction interval is always wider than the confidence interval for the same reason as in simple regression — it must account for the random error in a single new observation.
💡 Reading summary(fit) output in R
| Output field | Meaning |
|---|---|
Estimate (Intercept) |
\(b_0\): estimated \(y\) when all \(x_j = 0\) |
Estimate (\(x_j\)) |
\(b_j\): estimated change in \(\hat{y}\) per 1-unit increase in \(x_j\), holding all others constant |
Std. Error |
\(se(b_j)\): standard error of the estimate |
t value |
\(t_0 = b_j/se(b_j)\): test statistic for \(H_0: \beta_j = 0\) |
Pr(>|t|) |
p-value for individual \(t\)-test; reject \(H_0\) if \(< \alpha\) |
Residual standard error |
\(\hat{\sigma} = \sqrt{MS_\text{res}}\), on \(n-k-1\) df |
Multiple R-squared |
\(R^2\) |
Adjusted R-squared |
\(R^2_\text{adj}\) — use this for model comparison |
F-statistic |
Overall F-test statistic |
p-value (F) |
p-value for the overall F-test |
📝 Example 2.1: Delivery Time Data
A soft drink bottler wants to predict the time required by a route driver to service vending machines. The industrial engineer in charge suggests two key predictors: the number of cases stocked (\(x_1\)) and the distance walked by the driver (\(x_2\), in feet). Data were collected from 25 deliveries.
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]
| Obs | Time \(y\) (min) | Cases \(x_1\) | Distance \(x_2\) (ft) | Obs | Time \(y\) | Cases \(x_1\) | Distance \(x_2\) | |
|---|---|---|---|---|---|---|---|---|
| 1 | 16.68 | 7 | 560 | 14 | 19.75 | 6 | 462 | |
| 2 | 11.50 | 3 | 220 | 15 | 24.00 | 9 | 448 | |
| 3 | 12.03 | 3 | 340 | 16 | 29.00 | 10 | 776 | |
| 4 | 14.88 | 4 | 80 | 17 | 15.35 | 6 | 200 | |
| 5 | 13.75 | 6 | 150 | 18 | 19.00 | 7 | 132 | |
| 6 | 18.11 | 7 | 330 | 19 | 9.50 | 3 | 36 | |
| 7 | 15.93 | 5 | 50 | 20 | 35.10 | 17 | 770 | |
| 8 | 14.38 | 4 | 357 | 21 | 17.90 | 10 | 140 | |
| 9 | 25.32 | 11 | 449 | 22 | 52.32 | 26 | 810 | |
| 10 | 11.52 | 3 | 120 | 23 | 18.75 | 9 | 450 | |
| 11 | 11.93 | 4 | 220 | 24 | 19.83 | 8 | 635 | |
| 12 | 27.90 | 10 | 588 | 25 | 10.75 | 4 | 150 | |
| 13 | 13.13 | 4 | 64 |
dat <- data.frame(
y = c(16.68, 11.50, 12.03, 14.88, 13.75, 18.11, 15.93, 14.38, 25.32,
11.52, 11.93, 27.90, 13.13, 19.75, 24.00, 29.00, 15.35, 19.00,
9.50, 35.10, 17.90, 52.32, 18.75, 19.83, 10.75),
x1 = c(7, 3, 3, 4, 6, 7, 5, 4, 11, 3, 4, 10, 4, 6, 9, 10, 6, 7, 3,
17, 10, 26, 9, 8, 4),
x2 = c(560, 220, 340, 80, 150, 330, 50, 357, 449, 120, 220, 588, 64,
462, 448, 776, 200, 132, 36, 770, 140, 810, 450, 635, 150)
)Exploratory analysis
## y x1 x2
## y 1.000 0.968 0.791
## x1 0.968 1.000 0.729
## x2 0.791 0.729 1.000
# Pairwise scatterplot matrix
pairs(dat,
main = "Scatterplot Matrix: Delivery Time Data",
pch = 19, col = "steelblue")From the correlation matrix and scatterplots, both \(x_1\) and \(x_2\) are positively correlated with \(y\). The correlation between \(x_1\) and \(x_2\) is moderate, so multicollinearity is unlikely to be a serious problem.
Fit the multiple regression model
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7550 -1.7329 -0.0388 1.7226 3.1038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.301098 0.803072 6.601 1.23e-06 ***
## x1 1.537225 0.125982 12.202 2.88e-11 ***
## x2 0.007012 0.002645 2.651 0.0146 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 22 degrees of freedom
## Multiple R-squared: 0.9517, Adjusted R-squared: 0.9474
## F-statistic: 216.9 on 2 and 22 DF, p-value: 3.305e-15
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 1979.40029 | 1979.400287 | 426.860658 | 0.0000000 |
| x2 | 1 | 32.59253 | 32.592526 | 7.028627 | 0.0145887 |
| Residuals | 22 | 102.01644 | 4.637111 | NA | NA |
(a) Overall F-test at \(\alpha = 0.05\)
\(H_0: \beta_1 = \beta_2 = 0\) vs. \(H_1\): at least one \(\beta_j \neq 0\)
Test statistic: \(F = \dfrac{SS_\text{reg}/k}{SS_\text{res}/(n-k-1)}\)
From R: \(F = 261.2\) on \((2, 22)\) df, p-value \(= 4.69 \times 10^{-16}\)
| Source | df | SS | MS | \(F\) |
|---|---|---|---|---|
| Regression | 2 | 5550.82 | 2775.41 | 261.24 |
| Residual | 22 | 233.73 | 10.62 | |
| Total | 24 | 5784.55 |
(b) Test whether \(x_1\) should be included when \(x_2\) is already in the model
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 23 | 792.4236 | NA | NA | NA | NA |
| 22 | 102.0164 | 1 | 690.4071 | 148.8873 | 0 |
(c) Fitted equation
## (Intercept) x1 x2
## 5.3011 1.5372 0.0070
\[\hat{y} = 5.301 + 1.537\, x_1 + 0.007\, x_2\]
Interpretation:
(d) 95% confidence intervals for \(\beta_0, \beta_1, \beta_2\)
## 2.5 % 97.5 %
## (Intercept) 3.635628107 6.9665670
## x1 1.275953966 1.7984953
## x2 0.001526778 0.0124966
(e) How much variation in \(y\) is explained by \(x_1\) and \(x_2\)?
From summary(), the Adjusted \(R^2 = 0.9487\). That is, the model
using number of cases and distance walked explains
94.87% of the total variation in delivery time.
(f) Predict delivery time when \(x_1 = 8\) cases and \(x_2 = 275\) ft
\[\hat{y} = 5.301 + 1.537(8) + 0.007(275) = 19.522 \text{ min}\]
newdata <- data.frame(x1 = 8, x2 = 275)
predict(fit.full, newdata, interval = "confidence", level = 0.95)## fit lwr upr
## 1 19.52711 18.51225 20.54197
## fit lwr upr
## 1 19.52711 14.94738 24.10684
When a predictor is qualitative (categorical), it cannot be entered directly as a number — the numeric codes 1, 2, 3 would incorrectly imply an ordering or equal spacing between groups.
Instead, we create dummy (indicator) variables: binary (0/1) variables that encode group membership.
Key rule: for a qualitative variable with \(k\) groups, we create \(k - 1\) dummy variables. The omitted group is the reference category (baseline); all dummy coefficients are interpreted as differences relative to that baseline.
Why \(k-1\), not \(k\)? Using \(k\) dummies would make the design matrix \(\mathbf{X}\) singular — the \(k\) dummy columns always sum to 1 (the intercept column), creating perfect multicollinearity. Dropping one group avoids this and lets the intercept represent the reference group’s mean.
For a variable with \(k = 3\) groups (A = reference, B, C):
| Group | \(D_B\) | \(D_C\) | Interpretation |
|---|---|---|---|
| A (reference) | 0 | 0 | Intercept alone = mean of A |
| B | 1 | 0 | Intercept \(+ \beta_B\) = mean of B |
| C | 0 | 1 | Intercept \(+ \beta_C\) = mean of C |
The model is: \[\hat{y} = b_0 + b_1 x_1 + b_B D_B + b_C D_C\]
This gives parallel regression lines (same slope, different intercepts) for each group:
| Group | Equation | Intercept |
|---|---|---|
| A | \(\hat{y} = b_0 + b_1 x_1\) | \(b_0\) |
| B | \(\hat{y} = (b_0 + b_B) + b_1 x_1\) | \(b_0 + b_B\) |
| C | \(\hat{y} = (b_0 + b_C) + b_1 x_1\) | \(b_0 + b_C\) |
Interpretation of dummy coefficients: \(b_B\) estimates the mean difference in \(y\) between Group B and Group A (the reference), holding \(x_1\) constant. A positive \(b_B\) means Group B has a higher mean response than Group A.
📝 Example 2.2: Tool Life, Lathe Speed, and Tool Type
Twenty observations on tool life (\(y\), hours), lathe speed (\(x_1\), rpm), and tool type (\(x_2\): A or B) are shown below.
| obs | \(y\) (hrs) | \(x_1\) (rpm) | Type | obs | \(y\) (hrs) | \(x_1\) (rpm) | Type | |
|---|---|---|---|---|---|---|---|---|
| 1 | 18.73 | 610 | A | 11 | 30.16 | 670 | B | |
| 2 | 14.52 | 950 | A | 12 | 27.09 | 770 | B | |
| 3 | 17.43 | 720 | A | 13 | 25.40 | 880 | B | |
| 4 | 14.54 | 840 | A | 14 | 26.05 | 1000 | B | |
| 5 | 13.44 | 980 | A | 15 | 33.49 | 760 | B | |
| 6 | 24.39 | 530 | A | 16 | 35.62 | 590 | B | |
| 7 | 13.34 | 680 | A | 17 | 26.07 | 910 | B | |
| 8 | 22.71 | 540 | A | 18 | 36.78 | 650 | B | |
| 9 | 12.68 | 890 | A | 19 | 34.95 | 810 | B | |
| 10 | 19.32 | 730 | A | 20 | 43.67 | 500 | B |
The model is: \(y = \beta_0 + \beta_1 x_1 + \beta_2 D_B + \epsilon\), where \(D_B = 1\) if Tool Type B, \(0\) if Tool Type A (reference).
dt <- data.frame(
y = c(18.73,14.52,17.43,14.54,13.44,24.39,13.34,22.71,12.68,19.32,
30.16,27.09,25.40,26.05,33.49,35.62,26.07,36.78,34.95,43.67),
x1 = c(610,950,720,840,980,530,680,540,890,730,
670,770,880,1000,760,590,910,650,810,500),
ToolType = c(rep("A",10), rep("B",10))
)Exploratory analysis
# Scatterplot coloured by Tool Type
par(mfrow = c(1, 2))
plot(dt$x1, dt$y,
pch = 21,
bg = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
main = "Tool Life vs. Speed by Type")
legend("topright", legend = c("Type A","Type B"),
pt.bg = c("steelblue","tomato"), pch = 21, bty = "n")
# Boxplot by Tool Type
boxplot(y ~ ToolType, data = dt,
xlab = "Tool Type", ylab = "Tool Life (hours)",
main = "Tool Life by Tool Type",
col = c("steelblue","tomato"))The scatterplot shows two distinct clouds of points — one for each tool type — sitting at different vertical levels but following similar downward slopes. This strongly suggests that tool type shifts the intercept (level) of the relationship, making a dummy variable appropriate.
Create dummy variable (\(D_B = 1\) if Type B, \(0\) if Type A):
| ToolType | D_B |
|---|---|
| A | 0 |
| A | 0 |
| A | 0 |
| A | 0 |
| A | 0 |
| ToolType | D_B | |
|---|---|---|
| 16 | B | 1 |
| 17 | B | 1 |
| 18 | B | 1 |
| 19 | B | 1 |
| 20 | B | 1 |
Fit the regression model
##
## Call:
## lm(formula = y ~ x1 + D_B, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5527 -1.7868 -0.0016 1.8395 4.9838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.98560 3.51038 10.536 7.16e-09 ***
## x1 -0.02661 0.00452 -5.887 1.79e-05 ***
## D_B 15.00425 1.35967 11.035 3.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared: 0.9003, Adjusted R-squared: 0.8886
## F-statistic: 76.75 on 2 and 17 DF, p-value: 3.086e-09
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 293.0054 | 293.005403 | 31.71568 | 2.99e-05 |
| D_B | 1 | 1125.0282 | 1125.028214 | 121.77602 | 0.00e+00 |
| Residuals | 17 | 157.0546 | 9.238504 | NA | NA |
## 2.5 % 97.5 %
## (Intercept) 29.57934160 44.39186079
## x1 -0.03614301 -0.01707145
## D_B 12.13559830 17.87290293
Test whether \(x_1\) (lathe speed) should be included, given \(x_2\) (tool type) is in the model:
Test whether \(x_2\) (tool type) should be included, given \(x_1\) is in the model:
Overall F-test at \(\alpha = 0.05\):
Fitted equation (overall):
\[\hat{y} = 36.986 - 0.027\, x_1 + 15.004\, D_B\]
Fitted equations by tool type:
\[ \begin{aligned} \text{Type A } (D_B=0): \quad \hat{y}_A &= 36.986 - 0.027\, x_1 \\ \text{Type B } (D_B=1): \quad \hat{y}_B &= (36.986 + 15.004) - 0.027\, x_1 = 51.990 - 0.027\, x_1 \end{aligned} \]
Both equations share the same slope (\(-0.027\): tool life decreases by 0.027 hours per additional rpm), but Tool Type B has an intercept 15.004 hours higher than Type A. This means, at any given lathe speed, Tool Type B is estimated to last about 15 hours longer than Tool Type A.
Plot the two fitted lines:
b <- coef(fit2)
x_range <- seq(min(dt$x1), max(dt$x1), length.out = 100)
plot(dt$x1, dt$y,
pch = 21,
bg = c("steelblue","tomato")[as.integer(factor(dt$ToolType))],
xlab = "Lathe Speed (rpm)", ylab = "Tool Life (hours)",
main = "Fitted Regression Lines by Tool Type")
lines(x_range, b[1] + b[2]*x_range, col = "steelblue", lwd = 2)
lines(x_range, b[1] + b[3] + b[2]*x_range, col = "tomato", lwd = 2)
legend("topright",
legend = c("Type A (observed)","Type B (observed)",
"Type A (fitted)", "Type B (fitted)"),
pt.bg = c("steelblue","tomato", NA, NA),
pch = c(21, 21, NA, NA),
lty = c(NA, NA, 1, 1),
col = c("black","black","steelblue","tomato"),
lwd = 2, bty = "n")95% confidence interval for \(\beta_2\) (the tool-type effect):
\[b_2 \pm t_{0.975,\;17} \cdot se(b_2) = 15.004 \pm (2.110)(1.360) \;\Rightarrow\; 12.135 \leq \beta_2 \leq 17.873\]
Interpretation: We are 95% confident that Tool Type B lasts between 12.1 and 17.9 hours longer than Tool Type A at the same lathe speed.
Residual diagnostics:
par(mfrow = c(2, 2))
plot(fit2, pch = 16,
col = c("steelblue","tomato")[as.integer(factor(dt$ToolType))])With many candidate predictors:
All three standard methods use AIC (Akaike Information Criterion) to compare models:
\[\text{AIC} = n\ln(SS_\text{res}/n) + 2(k+1)\]
Lower AIC = better balance between goodness of fit and model complexity.
| Method | Starting point | Strategy | Stops when |
|---|---|---|---|
| Forward selection | Intercept only | Add best predictor at each step | No remaining predictor reduces AIC |
| Backward elimination | Full model | Remove worst predictor at each step | No remaining predictor increases AIC if removed |
| Stepwise (both) | Full model | Add or remove at each step | No change improves AIC |
❗ Which method to prefer?
Stepwise regression is generally preferred because it can reconsider variables added or removed in earlier steps. Forward and backward selections cannot undo earlier decisions. However, none of the three methods guarantees finding the globally optimal model — always check whether the selected model makes practical sense and satisfies the regression assumptions.
mtcars dataThe built-in mtcars dataset records fuel efficiency and
characteristics of 32 automobiles. We illustrate variable selection with
mpg (fuel efficiency) as the response.
| Variable | Description | Type |
|---|---|---|
mpg |
Miles per US gallon | Response |
cyl |
Number of cylinders | Discrete (4, 6, 8) |
disp |
Engine displacement (cu.in.) | Numeric |
hp |
Gross horsepower | Numeric |
drat |
Rear axle ratio | Numeric |
wt |
Weight (1,000 lbs) | Numeric |
qsec |
¼-mile time (seconds) | Numeric |
vs |
Engine shape (0 = V, 1 = straight) | Binary |
am |
Transmission (0 = auto, 1 = manual) | Binary |
gear |
Number of forward gears | Discrete |
carb |
Number of carburetors | Discrete |
Starts with the intercept-only model and adds the predictor that most reduces AIC at each step. Stops when no addition improves AIC.
fw.fit <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = 1)## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## + drat 1 522.48 603.57 97.988
## + vs 1 496.53 629.52 99.335
## + am 1 405.15 720.90 103.672
## + carb 1 341.78 784.27 106.369
## + gear 1 259.75 866.30 109.552
## + qsec 1 197.39 928.66 111.776
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.150 191.17 63.198
## + hp 1 83.274 195.05 63.840
## + qsec 1 82.858 195.46 63.908
## + vs 1 54.228 224.09 68.283
## + carb 1 44.602 233.72 69.628
## + disp 1 31.639 246.68 71.356
## <none> 278.32 73.217
## + drat 1 9.081 269.24 74.156
## + gear 1 1.137 277.19 75.086
## + am 1 0.002 278.32 75.217
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.5514 176.62 62.665
## + carb 1 13.7724 177.40 62.805
## <none> 191.17 63.198
## + qsec 1 10.5674 180.60 63.378
## + gear 1 3.0281 188.14 64.687
## + disp 1 2.6796 188.49 64.746
## + vs 1 0.7059 190.47 65.080
## + am 1 0.1249 191.05 65.177
## + drat 1 0.0010 191.17 65.198
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## + am 1 6.6228 170.00 63.442
## + disp 1 6.1762 170.44 63.526
## + carb 1 2.5187 174.10 64.205
## + drat 1 2.2453 174.38 64.255
## + qsec 1 1.4010 175.22 64.410
## + gear 1 0.8558 175.76 64.509
## + vs 1 0.0599 176.56 64.654
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Result: Forward selection chose
wt,cyl, andhp.Fitted equation: \(\hat{\text{mpg}} = 38.751 - 3.167\,\text{wt} - 0.942\,\text{cyl} - 0.018\,\text{hp}\)
Starts with the full model and removes the predictor whose removal most reduces AIC at each step.
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Result: Backward elimination chose
wt,qsec, andam.Fitted equation: \(\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}\)
Combines both directions — at each step the algorithm may add or remove a variable. Generally the most reliable of the three.
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## + cyl 1 0.0799 147.49 70.898
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## + vs 1 0.2685 147.57 68.915
## + cyl 1 0.1883 147.66 68.932
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## + carb 1 0.685 147.84 66.973
## + vs 1 0.434 148.09 67.028
## + cyl 1 0.414 148.11 67.032
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## + gear 1 1.565 148.53 65.121
## + cyl 1 1.003 149.09 65.242
## + vs 1 0.645 149.45 65.319
## + carb 1 0.107 149.99 65.434
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## + drat 1 3.345 150.09 63.457
## + gear 1 2.977 150.46 63.535
## + cyl 1 2.447 150.99 63.648
## + vs 1 1.121 152.32 63.927
## + carb 1 0.011 153.43 64.160
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## + disp 1 6.629 153.44 62.162
## + carb 1 3.227 156.84 62.864
## + drat 1 1.428 158.64 63.229
## - qsec 1 20.225 180.29 63.323
## + cyl 1 0.249 159.82 63.465
## + vs 1 0.249 159.82 63.466
## + gear 1 0.171 159.90 63.481
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## + hp 1 9.219 160.07 61.515
## + carb 1 8.036 161.25 61.751
## + disp 1 3.276 166.01 62.682
## + cyl 1 1.501 167.78 63.022
## + drat 1 1.400 167.89 63.042
## + gear 1 0.123 169.16 63.284
## + vs 1 0.000 169.29 63.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Result: Stepwise also chose
wt,qsec, andam— same as backward elimination.Fitted equation: \(\hat{\text{mpg}} = 9.618 - 3.917\,\text{wt} + 1.226\,\text{qsec} + 2.936\,\text{am}\)
💡 Comparing the three results
| Method | Selected variables | Adj. \(R^2\) |
|---|---|---|
| Forward | wt, cyl, hp |
~0.826 |
| Backward | wt, qsec, am |
~0.834 |
| Stepwise | wt, qsec, am |
~0.834 |
Forward selection disagrees here — it chose a different model. When methods disagree, compare adjusted \(R^2\), check diagnostics on each, and prefer the model with better interpretability and better-satisfied assumptions.
The linear regression model assumes:
\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \epsilon, \qquad \epsilon \sim N(0, \sigma^2)\]
This implies four conditions — LINE:
| Condition | Meaning |
|---|---|
| Linearity | \(E(y)\) is a linear function of the predictors |
| Independence | Errors \(\epsilon_i\) are independent of each other |
| Normality | Errors follow a normal distribution |
| Equal variance | \(\text{Var}(\epsilon_i) = \sigma^2\) (constant) for all \(i\) |
If these are met, our \(t\)-tests, F-tests, and confidence intervals are valid. If they are violated, inference is unreliable — “garbage in, garbage out.”
R’s plot(fit) produces four standard diagnostic plots.
We use the stepwise-selected model (sw.fit) as the working
example.
fit <- lm(mpg ~ wt + qsec + am, data = mtcars)
par(mfrow = c(2, 2))
plot(fit, pch = 16, col = "steelblue")| Plot | What it checks | What “good” looks like | Warning signs |
|---|---|---|---|
| Residuals vs. Fitted | Linearity + constant variance | Random scatter around \(e = 0\); even spread | Curve (non-linearity); funnel shape (non-constant variance) |
| Normal Q-Q | Normality of residuals | Points close to the diagonal line | S-shape (heavy tails); systematic bending |
| Scale-Location | Constant variance (more sensitive) | Flat red line; even vertical spread | Upward or downward trend in red line |
| Residuals vs. Leverage | Influential observations | All points inside Cook’s distance contours | Points in upper/lower right corner (high leverage + large residual) |
\[H_0: \text{errors have constant variance} \qquad H_1: \text{non-constant variance (heteroscedasticity)}\]
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 6.1871, df = 3, p-value = 0.1029
Since p-value \(> 0.05\), we fail to reject \(H_0\). The constant variance assumption is satisfied.
\[H_0: \text{errors are normally distributed} \qquad H_1: \text{errors are not normal}\]
res <- resid(fit)
par(mfrow = c(1, 2))
hist(res,
main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white",
freq = FALSE)
curve(dnorm(x, mean(res), sd(res)), add = TRUE, col = "red", lwd = 2)
qqnorm(res, pch = 16, col = "steelblue", main = "Normal Q-Q Plot")
qqline(res, col = "red", lwd = 2)##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.9411, p-value = 0.08043
Since p-value \(> 0.05\), we fail to reject \(H_0\). The normality assumption is satisfied.
\[H_0: \text{errors are not autocorrelated (independent)} \qquad H_1: \text{errors are autocorrelated}\]
##
## Durbin-Watson test
##
## data: fit
## DW = 1.7141, p-value = 0.1263
## alternative hypothesis: true autocorrelation is greater than 0
Since p-value \(> 0.05\), we fail to reject \(H_0\). The independence assumption is satisfied.
| Assumption | Graphical check | Formal test | Satisfied if |
|---|---|---|---|
| Linearity | Residuals vs. Fitted | — | No pattern in the plot |
| Constant variance | Scale-Location | Breusch-Pagan | p-value \(> 0.05\) |
| Normality | Q-Q plot, histogram | Shapiro-Wilk | p-value \(> 0.05\) |
| Independence | Residuals vs. order | Durbin-Watson | p-value \(> 0.05\) |
If an assumption is violated, consider the remedies in the table below before drawing any conclusions.
| Violated assumption | Possible remedy |
|---|---|
| Non-linearity | Add polynomial terms (\(x^2\), \(x^3\)); transform \(x\); fit a different model |
| Non-constant variance | Apply a variance-stabilising transformation to \(y\) (log, square root) |
| Non-normality | Transform \(y\); check for and address outliers |
| Autocorrelation | Include a time-trend variable; use time-series regression methods |
When the response \(y\) is strictly positive and the model shows non-normality or non-constant variance, the Box-Cox family of power transformations provides a systematic approach to choosing a variance-stabilising transformation:
\[y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \ln(y) & \lambda = 0 \end{cases}\]
Common values of \(\lambda\) and their corresponding transformations:
| \(\lambda\) | Transformation | Use when |
|---|---|---|
| \(-1\) | Reciprocal \(1/y\) | Variance increases rapidly with mean |
| \(-0.5\) | Reciprocal square root \(1/\sqrt{y}\) | |
| \(0\) | Log transformation \(\ln(y)\) | Most common remedy for right skew / heteroscedasticity |
| \(0.5\) | Square root \(\sqrt{y}\) | Count data; mild variance inflation |
| \(1\) | No transformation | Assumptions already satisfied |
| \(2\) | Square \(y^2\) | Variance decreases with mean (rare) |
Finding the optimal \(\lambda\) in R:
The plot shows the log-likelihood as a function of \(\lambda\). The peak of the curve is the maximum-likelihood estimate of \(\lambda\). The dashed vertical lines mark the 95% confidence interval for \(\lambda\).
- If the CI contains \(\lambda = 1\): no transformation is needed.
- If the CI contains \(\lambda = 0\): use a log transformation (\(\ln y\)).
- If the CI contains \(\lambda = 0.5\): use a square-root transformation (\(\sqrt{y}\)).
For the
mtcarsmodel, the peak is near \(\lambda = 1\), suggesting no transformation is necessary — consistent with the diagnostic tests above.
# If lambda = 0 were indicated, apply log transformation:
fit_log <- lm(log(mpg) ~ wt + qsec + am, data = mtcars)
summary(fit_log)##
## Call:
## lm(formula = log(mpg) ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13879 -0.08114 -0.03466 0.07030 0.26575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.69410 0.31326 8.600 2.40e-09 ***
## wt -0.22456 0.03201 -7.015 1.25e-07 ***
## qsec 0.05329 0.01299 4.101 0.00032 ***
## am 0.08558 0.06351 1.347 0.18863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1107 on 28 degrees of freedom
## Multiple R-squared: 0.8752, Adjusted R-squared: 0.8619
## F-statistic: 65.47 on 3 and 28 DF, p-value: 9.036e-13
# Then re-check diagnostics on the transformed model
par(mfrow = c(2, 2))
plot(fit_log, pch = 16, col = "steelblue")Learning Objectives
After completing this chapter, students should be able to:
Nonparametric (or distribution-free) statistical methods do not require the data to follow a specific probability distribution. They are often based on ranks or signs rather than the raw numerical values, making them robust to outliers and skewed data.
Compared to their parametric counterparts, nonparametric methods:
| Situation | Why nonparametric? |
|---|---|
| Small sample size (\(n < 30\)) | Cannot verify normality; Central Limit Theorem may not apply |
| Population distribution unknown or clearly non-normal | Parametric inference would be invalid |
| Presence of extreme outliers | Outliers heavily distort means and parametric tests |
| Ordinal data (rankings, Likert scales) | Means are not meaningful for ordinal scales |
| Data include only directions or signs | Magnitude is unknown or unreliable |
Use this table to choose the appropriate test based on your study design:
| Design | Parametric test | Nonparametric equivalent |
|---|---|---|
| One sample: test median | One-sample \(t\)-test | Sign test or Wilcoxon signed-rank test |
| Two related (paired) samples | Paired \(t\)-test | Sign test or Wilcoxon signed-rank test |
| Two independent samples | Independent \(t\)-test | Mann–Whitney U test |
| \(k > 2\) related samples (blocks) | Repeated-measures ANOVA | Friedman test |
| \(k > 2\) independent samples | One-way ANOVA | Kruskal–Wallis test |
| Association between two variables | Pearson correlation | Spearman rank correlation |
| Association between two categorical variables | — | Chi-squared test of independence |
Both tests address the same question — does the population median equal a specified value \(\theta_0\)? — but differ in what information they use:
| Test | Uses | When to prefer |
|---|---|---|
| Sign test | Only the direction (above/below \(\theta_0\)) | Very small samples; data are ordinal; distribution may be asymmetric |
| Wilcoxon signed-rank | Direction and magnitude of differences | Symmetric distribution; more powerful than sign test |
Hypotheses (choose one):
\[ \begin{aligned} (a)\; &H_0: \theta \leq \theta_0 \quad H_1: \theta > \theta_0 \\ (b)\; &H_0: \theta \geq \theta_0 \quad H_1: \theta < \theta_0 \\ (c)\; &H_0: \theta = \theta_0 \quad H_1: \theta \neq \theta_0 \end{aligned} \]
Procedure:
binom.test() in R.📝 Example 3.1: PM2.5 Concentration
Is the median PM2.5 concentration equal to 35? Data from 5 measurements: 30, 40, 35, 50, 45.
x <- c(30, 40, 35, 50, 45)
theta0 <- 35
par(mfrow = c(1, 2))
boxplot(x, main = "PM2.5 Concentration", ylab = "PM2.5", col = "lightblue")
hist(x, main = "Histogram", xlab = "PM2.5", col = "lightblue", border = "white")par(mfrow = c(1, 1))
Tpos <- sum(x > theta0)
Tneg <- sum(x < theta0)
cat("T+ =", Tpos, " T- =", Tneg, " Ties =", sum(x == theta0), "\n")## T+ = 3 T- = 1 Ties = 1
##
## Exact binomial test
##
## data: Tpos and Tpos + Tneg
## number of successes = 3, number of trials = 4, p-value = 0.625
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1941204 0.9936905
## sample estimates:
## probability of success
## 0.75
Hypothesis test:
📝 Example 3.2: Caffeine in Starbucks Latte
A scientist purchased 16-oz Starbucks Lattes on 10 consecutive days and measured the caffeine content (mg). Does the median caffeine content exceed 40 mg?
| Day | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Caffeine (mg) | 41 | 33 | 43 | 52 | 46 | 37 | 44 | 49 | 53 | 30 |
caffeine <- c(41, 33, 43, 52, 46, 37, 44, 49, 53, 30)
theta0 <- 40
boxplot(caffeine, ylab = "Caffeine (mg)", main = "Caffeine in Starbucks Latte",
col = "lightyellow")
abline(h = theta0, col = "red", lty = 2)
legend("topright", legend = paste("θ₀ =", theta0), col = "red", lty = 2, bty = "n")Tpos <- sum(caffeine > theta0)
Tneg <- sum(caffeine < theta0)
cat("T+ =", Tpos, " T- =", Tneg, " Ties =", sum(caffeine == theta0), "\n")## T+ = 7 T- = 3 Ties = 0
##
## Exact binomial test
##
## data: Tpos and Tpos + Tneg
## number of successes = 7, number of trials = 10, p-value = 0.1719
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.3933758 1.0000000
## sample estimates:
## probability of success
## 0.7
Sign table:
| Day | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Caffeine (mg) | 41 | 33 | 43 | 52 | 46 | 37 | 44 | 49 | 53 | 30 |
| Sign | + | − | + | + | + | − | + | + | + | − |
Hypothesis test:
📝 Example 3.3: Snow Cones Sold Per Day
A convenience store owner claims the median number of snow cones sold per day is 40. A random sample of 20 days gives the following counts. Test at \(\alpha = 0.05\).
x <- c(42,43,40,25,22,45,49,32,37,36,39,44,59,40,48,56,50,44,49,52)
boxplot(x, ylab = "Snow Cones Sold", main = "Daily Snow Cone Sales",
col = "lightcyan")
abline(h = 40, col = "red", lty = 2)T.pos <- sum(x > 40); T.neg <- sum(x < 40); T.ze <- sum(x == 40)
cat("T+ =", T.pos, " T- =", T.neg, " Ties =", T.ze, "\n")## T+ = 12 T- = 6 Ties = 2
##
## Exact binomial test
##
## data: T.pos and T.pos + T.neg
## number of successes = 12, number of trials = 18, p-value = 0.2379
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4099252 0.8665726
## sample estimates:
## probability of success
## 0.6666667
Sign table:
| x | 42 | 43 | 40 | 25 | 22 | 45 | 49 | 32 | 37 | 36 |
| Sign | + | + | 0 | − | − | + | + | − | − | − |
| x | 39 | 44 | 59 | 40 | 48 | 56 | 50 | 44 | 49 | 52 |
| Sign | − | + | + | 0 | + | + | + | + | + | + |
Hypothesis test:
This test improves on the sign test by incorporating the magnitude of differences, not just their direction. It assumes the distribution of \(X - \theta_0\) is symmetric.
Procedure:
wilcox.test() in R.📝 Example 3.4: Employee Weekly Income
A temporary employment agency claims its employees earn a median weekly income greater than 500. A sample of 8 employee records gives: 505, 508, 499, 504, 510, 497, 500, 492. Can the claim be verified at \(\alpha = 0.05\)?
x <- c(505, 508, 499, 504, 510, 497, 500, 492)
boxplot(x, ylab = "Weekly Income", main = "Employee Weekly Income",
col = "lightgreen")
abline(h = 500, col = "red", lty = 2)##
## Wilcoxon signed rank exact test
##
## data: x
## V = 23.5, p-value = 0.2266
## alternative hypothesis: true location is greater than 500
Rank table (\(\theta_0 = 500\)):
| Obs | \(x\) | \(d = x - 500\) | \(|d|\) | Rank of \(|d|\) | Signed rank |
|---|---|---|---|---|---|
| 1 | 505 | +5 | 5 | 4 | +4 |
| 2 | 508 | +8 | 8 | 5.5 | +5.5 |
| 3 | 499 | −1 | 1 | 1 | −1 |
| 4 | 504 | +4 | 4 | 3 | +3 |
| 5 | 510 | +10 | 10 | 7 | +7 |
| 6 | 497 | −3 | 3 | 2 | −2 |
| 7 | 500 | 0 | — | — | (tie, excluded) |
| 8 | 492 | −8 | 8 | 5.5 | −5.5 |
\(W^+ = 4 + 5.5 + 3 + 7 = 19.5\)
Hypothesis test:
📝 Example 3.5: Checkout Waiting Times
A supermarket claims checkout waiting times are at most 4 minutes. An angry customer records 6 waiting times (min): 3.8, 5.3, 3.5, 4.5, 7.2, 5.1. Can the customer dispute the claim at \(\alpha = 0.05\)?
x <- c(3.8, 5.3, 3.5, 4.5, 7.2, 5.1)
boxplot(x, ylab = "Waiting Time (min)", main = "Checkout Waiting Times",
col = "lightyellow")
abline(h = 4, col = "red", lty = 2)##
## Wilcoxon signed rank exact test
##
## data: x
## V = 17.5, p-value = 0.09375
## alternative hypothesis: true location is greater than 4
Rank table (\(\theta_0 = 4\)):
| Obs | \(x\) | \(d = x-4\) | \(|d|\) | Rank | Signed rank |
|---|---|---|---|---|---|
| 1 | 3.8 | −0.2 | 0.2 | 1 | −1 |
| 2 | 5.3 | +1.3 | 1.3 | 5 | +5 |
| 3 | 3.5 | −0.5 | 0.5 | 2.5 | −2.5 |
| 4 | 4.5 | +0.5 | 0.5 | 2.5 | +2.5 |
| 5 | 7.2 | +3.2 | 3.2 | 6 | +6 |
| 6 | 5.1 | +1.1 | 1.1 | 4 | +4 |
\(W^+ = 5 + 2.5 + 6 + 4 = 17.5\)
Hypothesis test:
📝 Example 3.6: Law College Exam Scores
A sample of 10 law students’ marks: 71, 79, 40, 70, 82, 72, 60, 76, 69, 75. Test at \(\alpha = 0.05\) whether the median mark is less than or equal to 67.
x <- c(71, 79, 40, 70, 82, 72, 60, 76, 69, 75)
boxplot(x, ylab = "Mark", main = "Law College Exam Scores",
col = "lightpink")
abline(h = 67, col = "red", lty = 2)##
## Wilcoxon signed rank exact test
##
## data: x
## V = 40, p-value = 0.9033
## alternative hypothesis: true location is less than 67
Hypothesis test:
The Mann–Whitney U test (also called the Wilcoxon rank-sum test) compares the distributions of two independent groups. It is the nonparametric counterpart of the independent-samples \(t\)-test.
When to use: Two independent groups; response is at least ordinal; normality is not assumed.
wilcox.test(x, y, paired = FALSE)Hypotheses:
\[ \begin{aligned} (a)\; &H_0: \theta_X = \theta_Y, \quad H_1: \theta_X \neq \theta_Y \quad \text{(two-sided)} \\ (b)\; &H_0: \theta_X \geq \theta_Y, \quad H_1: \theta_X < \theta_Y \quad \text{(one-sided)} \\ (c)\; &H_0: \theta_X \leq \theta_Y, \quad H_1: \theta_X > \theta_Y \quad \text{(one-sided)} \end{aligned} \]
📝 Example 3.12: Coffee Brand Ratings
Nine participants rated Brand X and nine rated Brand Y on likelihood of purchase. Can we conclude that Brand Y is preferred (rated higher) at \(\alpha = 0.05\)?
| Brand X | 8.0 | 8.7 | 7.7 | 7.2 | 8.5 | 6.0 | 8.5 | 6.0 | 13.5 |
|---|---|---|---|---|---|---|---|---|---|
| Brand Y | 13.0 | 10.7 | 7.7 | 11.2 | 13.7 | 12.0 | 11.5 | 14.2 | 13.7 |
brand_x <- c(8.0, 8.7, 7.7, 7.2, 8.5, 6.0, 8.5, 6.0, 13.5)
brand_y <- c(13.0, 10.7, 7.7, 11.2, 13.7, 12.0, 11.5, 14.2, 13.7)
boxplot(brand_x, brand_y, names = c("Brand X","Brand Y"),
col = c("lightblue","lightgreen"),
main = "Coffee Brand Ratings", ylab = "Rating")##
## Wilcoxon rank sum exact test
##
## data: brand_x and brand_y
## W = 10.5, p-value = 0.003085
## alternative hypothesis: true location shift is less than 0
Rank table (combined ranks, \(n=m=9\)):
| Brand X | 8.0 | 8.7 | 7.7 | 7.2 | 8.5 | 6.0 | 8.5 | 6.0 | 13.5 |
|---|---|---|---|---|---|---|---|---|---|
| Rank | 6 | 9 | 4.5 | 3 | 7.5 | 1.5 | 7.5 | 1.5 | 15 |
\(S = 6+9+4.5+3+7.5+1.5+7.5+1.5+15 = 55.5\), \(\quad U = 55.5 - \dfrac{9 \times 10}{2} = 10.5\)
Hypothesis test:
📝 Example 3.13: Reaction Times Under Two Drugs
An experimental psychologist compared reaction times (seconds) of adult males under Drug A (\(n=6\)) vs. Drug B (\(n=4\)). Is there a significant difference at \(\alpha = 0.10\)?
| Drug A (s) | 52 | 53 | 73 | 68 | 56 | 74 |
|---|---|---|---|---|---|---|
| Drug B (s) | 94 | 64 | 74 | 51 |
drug_a <- c(52,53,73,68,56,74)
drug_b <- c(94,64,74,51)
boxplot(drug_a, drug_b, names = c("Drug A","Drug B"),
col = c("skyblue","orange"),
main = "Reaction Time by Drug", ylab = "Reaction Time (s)")##
## Wilcoxon rank sum exact test
##
## data: drug_a and drug_b
## W = 9.5, p-value = 0.6476
## alternative hypothesis: true location shift is not equal to 0
Rank table:
| Drug A | 52 | 53 | 73 | 68 | 56 | 74 | Drug B | 94 | 64 | 74 | 51 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 2 | 3 | 7 | 6 | 4 | 8.5 | Rank | 10 | 5 | 8.5 | 1 |
\(S = 2+3+7+6+4+8.5 = 30.5\), \(\quad U = 30.5 - \dfrac{6 \times 7}{2} = 9.5\)
Hypothesis test:
📝 Example 3.14: Prenatal Care Programme and APGAR Scores
A pilot trial assessed whether enhanced prenatal in-home visits improve newborn health (measured by APGAR scores at 5 min). 8 women received usual care; 7 received the new programme.
| Usual Care | 83 | 91 | 94 | 89 | 96 | 91 | 92 | 90 |
|---|---|---|---|---|---|---|---|---|
| New Program | 78 | 82 | 81 | 77 | 81 | 80 | 81 |
usual_care <- c(83,91,94,89,96,91,92,90)
new_program <- c(78,82,81,77,81,80,81)
boxplot(usual_care, new_program,
names = c("Usual Care","New Program"),
col = c("lightblue","lightgreen"),
main = "APGAR Scores by Care Programme",
ylab = "APGAR Score")##
## Wilcoxon rank sum exact test
##
## data: usual_care and new_program
## W = 56, p-value = 0.0003108
## alternative hypothesis: true location shift is not equal to 0
Rank table (\(n=8\), \(m=7\)):
| Usual Care | 83 | 91 | 94 | 89 | 96 | 91 | 92 | 90 |
|---|---|---|---|---|---|---|---|---|
| Rank | 8 | 11.5 | 14 | 9 | 15 | 11.5 | 13 | 10 |
| New Program | 78 | 82 | 81 | 77 | 81 | 80 | 81 | |
| Rank | 2 | 7 | 5 | 1 | 5 | 3 | 5 |
\(S = 8+11.5+14+9+15+11.5+13+10 = 92\), \(\quad U = 92 - \dfrac{8\times9}{2} = 56\)
Hypothesis test:
The Kruskal–Wallis test is the nonparametric equivalent of one-way ANOVA. It compares the location (median) of \(k \geq 3\) independent groups.
Test statistic:
\[H = \frac{12}{N(N+1)}\sum_{i=1}^{k} \frac{R_i^2}{n_i} - 3(N+1)\]
where \(n_i\) = sample size of group \(i\), \(R_i\) = rank sum for group \(i\), \(N = \sum n_i\).
Under \(H_0\), \(H \approx \chi^2_{k-1}\) for sufficiently large \(n_i\).
Hypotheses:
📝 Example 3.17: Potassium Content in Breakfast Drinks
Three brands of breakfast drink were tested for potassium content (milli-equivalents per quart). Is there a difference in potassium content at \(\alpha = 0.10\)?
| Brand A | Brand B | Brand C |
|---|---|---|
| 5.0 | 3.0 | 4.0 |
| 1.0 | 11.5 | 7.5 |
| 2.0 | 5.0 | 9.5 |
| 6.5 | 7.0 | 5.0 |
| 4.5 | 10.0 | 8.2 |
| 1.0 | 9.4 | 10.25 |
df17 <- data.frame(
Potassium = c(5.0,1.0,2.0,6.5,4.5,1.0,
3.0,11.5,5.0,7.0,10.0,9.4,
4.0,7.5,9.5,5.0,8.2,10.25),
Brand = rep(c("A","B","C"), each = 6)
)
df17$Brand <- as.factor(df17$Brand)
boxplot(Potassium ~ Brand, data = df17,
col = c("tomato","skyblue","lightgreen"),
main = "Potassium by Brand", ylab = "Potassium (mEq/qt)")##
## Kruskal-Wallis rank sum test
##
## data: Potassium by Brand
## Kruskal-Wallis chi-squared = 6.4308, df = 2, p-value = 0.04014
Rank table (\(N = 18\), \(k = 3\)):
| Brand A | Rank | Brand B | Rank | Brand C | Rank | |
|---|---|---|---|---|---|---|
| 5.0 | 8 | 3.0 | 4 | 4.0 | 5 | |
| 1.0 | 1.5 | 11.5 | 18 | 7.5 | 12 | |
| 2.0 | 3 | 5.0 | 8 | 9.5 | 15 | |
| 6.5 | 10 | 7.0 | 11 | 5.0 | 8 | |
| 4.5 | 6 | 10.0 | 16 | 8.2 | 13 | |
| 1.0 | 1.5 | 9.4 | 14 | 10.25 | 17 | |
| \(R_i\) | 30 | 71 | 70 |
\[H = \frac{12}{18 \times 19}\left(\frac{30^2}{6}+\frac{71^2}{6}+\frac{70^2}{6}\right) - 3(19) = 6.39\]
Hypothesis test:
Post-hoc pairwise comparisons:
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: df17$Potassium and df17$Brand
##
## A B
## B 0.091 -
## C 0.091 1.000
##
## P value adjustment method: bonferroni
📝 Example 3.18: Caffeine in Four Types of Beverages
Caffeine content (mg) was measured in regular servings of four beverage types (unequal sample sizes). Is there a significant difference at \(\alpha = 0.05\)?
| Type 1 | Type 2 | Type 3 | Type 4 |
|---|---|---|---|
| 57 | 43 | 87 | 60 |
| 81 | 61 | 69 | 70 |
| 67 | 42 | 58 | 75 |
| 64 | 45 | 82 | 72 |
| 96 | 39 | 90 | 79 |
| 80 | 91 | 76 | |
| 68 | |||
| 56 |
df18 <- data.frame(
Caffeine = c(57,81,67,64,96,80,68,56,
43,61,42,45,39,
87,69,58,82,90,91,
60,70,75,72,79,76),
Type = factor(c(rep(1,8), rep(2,5), rep(3,6), rep(4,6)))
)
boxplot(Caffeine ~ Type, data = df18,
col = c("lightblue","lightgreen","coral","gold"),
main = "Caffeine Content by Beverage Type",
xlab = "Type", ylab = "Caffeine (mg)")##
## Kruskal-Wallis rank sum test
##
## data: Caffeine by Type
## Kruskal-Wallis chi-squared = 11.308, df = 3, p-value = 0.01017
Rank table (\(N=25\), \(k=4\), rank sums: \(R_1=108\), \(R_2=19\), \(R_3=110\), \(R_4=88\)):
\[H = \frac{12}{25 \times 26}\left(\frac{108^2}{8}+\frac{19^2}{5}+\frac{110^2}{6}+\frac{88^2}{6}\right) - 3(26) = 11.308\]
Hypothesis test:
Post-hoc pairwise comparisons:
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: df18$Caffeine and df18$Type
##
## 1 2 3
## 2 0.037 - -
## 3 1.000 0.052 -
## 4 1.000 0.052 1.000
##
## P value adjustment method: bonferroni
Spearman’s rank correlation coefficient \(r_s\) measures the monotone association between two variables. It is the nonparametric counterpart of Pearson’s \(r\), and is appropriate when:
Formula:
\[r_s = 1 - \frac{6\sum_{i=1}^{n} D_i^2}{n(n^2-1)}\]
where \(D_i = R(X_i) - R(Y_i)\) is the difference in ranks for observation \(i\).
The range is \(-1 \leq r_s \leq 1\), with the same directional interpretation as Pearson’s \(r\).
Hypotheses:
\[ \begin{aligned} (a)\; &H_0: \rho_s = 0, \quad H_1: \rho_s \neq 0 \quad \text{(two-sided)} \\ (b)\; &H_0: \rho_s \leq 0, \quad H_1: \rho_s > 0 \quad \text{(positive association)} \\ (c)\; &H_0: \rho_s \geq 0, \quad H_1: \rho_s < 0 \quad \text{(negative association)} \end{aligned} \]
In R:
cor.test(x, y, method = "spearman", alternative = ...)
📝 Example 3.19: Machinery Productivity and Operator Satisfaction
Ten types of machinery are ranked on productivity (1–10) and operator satisfaction (1–10). Test at \(\alpha = 0.05\) whether productivity and satisfaction are positively related.
| Machine | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Productivity rank | 8 | 3 | 9 | 2 | 7 | 10 | 4 | 6 | 1 | 5 |
| Satisfaction rank | 9 | 5 | 10 | 1 | 8 | 7 | 3 | 4 | 2 | 6 |
productivity <- c(8,3,9,2,7,10,4,6,1,5)
satisfaction <- c(9,5,10,1,8,7,3,4,2,6)
plot(productivity, satisfaction,
main = "Productivity Rank vs. Satisfaction Rank",
xlab = "Productivity Rank", ylab = "Satisfaction Rank",
pch = 19, col = "steelblue")
abline(lm(satisfaction ~ productivity), col = "red", lty = 2)##
## Spearman's rank correlation rho
##
## data: productivity and satisfaction
## S = 24, p-value = 0.001752
## alternative hypothesis: true rho is greater than 0
## sample estimates:
## rho
## 0.8545455
Calculation table:
| Machine | \(R(X)\) | \(R(Y)\) | \(D\) | \(D^2\) |
|---|---|---|---|---|
| 1 | 8 | 9 | −1 | 1 |
| 2 | 3 | 5 | −2 | 4 |
| 3 | 9 | 10 | −1 | 1 |
| 4 | 2 | 1 | +1 | 1 |
| 5 | 7 | 8 | −1 | 1 |
| 6 | 10 | 7 | +3 | 9 |
| 7 | 4 | 3 | +1 | 1 |
| 8 | 6 | 4 | +2 | 4 |
| 9 | 1 | 2 | −1 | 1 |
| 10 | 5 | 6 | −1 | 1 |
| Total | \(\sum D^2 = 24\) |
\[r_s = 1 - \frac{6 \times 24}{10(100-1)} = 1 - 0.1455 = 0.8545\]
Hypothesis test:
📝 Example 3.20: Cigarette Smoking and Birth Weight
A study recorded the number of cigarettes smoked per day during pregnancy and the birth weight (lbs) of the newborn for 12 women. Is there a significant (two-sided) association?
| Woman | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cigarettes/day | 12 | 15 | 35 | 21 | 20 | 17 | 19 | 46 | 20 | 25 | 39 | 25 |
| Birth weight (lbs) | 7.7 | 8.1 | 6.9 | 8.2 | 8.6 | 8.3 | 9.4 | 7.8 | 8.3 | 5.2 | 6.4 | 7.9 |
cigarettes <- c(12,15,35,21,20,17,19,46,20,25,39,25)
weight <- c(7.7,8.1,6.9,8.2,8.6,8.3,9.4,7.8,8.3,5.2,6.4,7.9)
plot(cigarettes, weight,
main = "Cigarettes per Day vs. Birth Weight",
xlab = "Cigarettes per Day", ylab = "Birth Weight (lbs)",
pch = 19, col = "tomato")##
## Spearman's rank correlation rho
##
## data: cigarettes and weight
## S = 431.26, p-value = 0.09183
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5079094
Hypothesis test:
📝 Example 3.21: Tree Diameter and Height
A biologist measures the diameter (inches, at 4.5 ft height) and height (feet) of 10 tall trees. Is there a significant association?
| Diameter (in.) | 1024 | 950 | 451 | 505 | 761 | 644 | 707 | 586 | 442 | 546 |
|---|---|---|---|---|---|---|---|---|---|---|
| Height (ft) | 261 | 321 | 219 | 281 | 159 | 83 | 191 | 141 | 232 | 108 |
diameter <- c(1024,950,451,505,761,644,707,586,442,546)
height <- c(261,321,219,281,159,83,191,141,232,108)
plot(diameter, height,
main = "Tree Diameter vs. Height",
xlab = "Diameter (in.)", ylab = "Height (ft)",
pch = 19, col = "darkgreen")##
## Spearman's rank correlation rho
##
## data: diameter and height
## S = 146, p-value = 0.7588
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1151515
Calculation table:
| Diameter | 1024 | 950 | 451 | 505 | 761 | 644 | 707 | 586 | 442 | 546 |
|---|---|---|---|---|---|---|---|---|---|---|
| \(R(X)\) | 10 | 9 | 2 | 3 | 8 | 6 | 7 | 5 | 1 | 4 |
| \(R(Y)\) | 8 | 10 | 6 | 9 | 4 | 1 | 5 | 3 | 7 | 2 |
| \(D\) | 2 | −1 | −4 | −6 | 4 | 5 | 2 | 2 | −6 | 2 |
| \(D^2\) | 4 | 1 | 16 | 36 | 16 | 25 | 4 | 4 | 36 | 4 |
\(\sum D^2 = 146\), \(\quad r_s = 1 - \dfrac{6 \times 146}{10(99)} = 1 - 0.885 = 0.115\)
Hypothesis test:
The chi-squared test of independence tests whether two categorical variables are associated. It is used when both variables are nominal or ordinal and the data are presented in a contingency table.
Hypotheses:
Procedure:
❗ Conditions for validity
All expected cell counts should be \(\geq 5\). If any are \(< 5\):
fisher.test()) instead — valid for any cell counts.In R: chisq.test(table(var_A, var_B)) or
chisq.test(contingency_matrix)
📝 Example 3.22: Smoking Status and Exercise Habit
A researcher surveyed 200 adults on their smoking status (Smoker / Non-smoker) and exercise habit (Regular / Occasional / None). Is smoking status associated with exercise habit at \(\alpha = 0.05\)?
| Regular | Occasional | None | Total | |
|---|---|---|---|---|
| Smoker | 25 | 30 | 45 | 100 |
| Non-smoker | 55 | 35 | 10 | 100 |
| Total | 80 | 65 | 55 | 200 |
obs <- matrix(c(25,30,45, 55,35,10),
nrow = 2, byrow = TRUE,
dimnames = list(Status = c("Smoker","Non-smoker"),
Exercise = c("Regular","Occasional","None")))
obs## Exercise
## Status Regular Occasional None
## Smoker 25 30 45
## Non-smoker 55 35 10
## Exercise
## Status Regular Occasional None
## Smoker 40 32.5 27.5
## Non-smoker 40 32.5 27.5
##
## Pearson's Chi-squared test
##
## data: obs
## X-squared = 33.907, df = 2, p-value = 4.336e-08
Hypothesis test:
📝 Example 3.23: Gender and Method of Payment
A retail store recorded the gender and preferred method of payment (Cash, Card, Digital) for 120 customers. Is there a significant association at \(\alpha = 0.05\)?
| Cash | Card | Digital | Total | |
|---|---|---|---|---|
| Female | 20 | 35 | 15 | 70 |
| Male | 25 | 15 | 10 | 50 |
| Total | 45 | 50 | 25 | 120 |
obs2 <- matrix(c(20,35,15, 25,15,10),
nrow = 2, byrow = TRUE,
dimnames = list(Gender = c("Female","Male"),
Payment = c("Cash","Card","Digital")))
obs2## Payment
## Gender Cash Card Digital
## Female 20 35 15
## Male 25 15 10
## Payment
## Gender Cash Card Digital
## Female 26.25 29.16667 14.58333
## Male 18.75 20.83333 10.41667
##
## Pearson's Chi-squared test
##
## data: obs2
## X-squared = 6.4, df = 2, p-value = 0.04076
Hypothesis test:
❗Instructions
A company analyst wants to investigate the relationship between the number of calls an agent makes in a week and the number of sales made. The analyst reasons that sales should depend to some degree on the number of calls. Data for a random sample of 20 agents are given below.
| Agent | Calls (\(x\)) | Sales (\(y\)) | Agent | Calls (\(x\)) | Sales (\(y\)) | |
|---|---|---|---|---|---|---|
| 1 | 16 | 5 | 11 | 29 | 12 | |
| 2 | 13 | 7 | 12 | 25 | 7 | |
| 3 | 21 | 8 | 13 | 38 | 18 | |
| 4 | 37 | 20 | 14 | 33 | 14 | |
| 5 | 17 | 9 | 15 | 12 | 2 | |
| 6 | 20 | 12 | 16 | 30 | 15 | |
| 7 | 17 | 4 | 17 | 30 | 10 | |
| 8 | 28 | 16 | 18 | 10 | 4 | |
| 9 | 28 | 11 | 19 | 18 | 7 | |
| 10 | 6 | 2 | 20 | 21 | 10 |
calls <- c(16,13,21,37,17,20,17,28,28, 6,
29,25,38,33,12,30,30,10,18,21)
sales <- c( 5, 7, 8,20, 9,12, 4,16,11, 2,
12, 7,18,14, 2,15,10, 4, 7,10)
dt1 <- data.frame(calls = calls, sales = sales)The R output below is used to answer all parts of Question 1.
# Scatterplot
plot(calls, sales,
xlab = "Number of Calls", ylab = "Number of Sales",
main = "Calls vs. Sales",
pch = 19, col = "steelblue")# Pearson correlation test (95% CI by default)
cor.test(calls, sales, alternative = "two.sided", conf.level = 0.95)##
## Pearson's product-moment correlation
##
## data: calls and sales
## t = 8.7336, df = 18, p-value = 6.873e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7591250 0.9599183
## sample estimates:
## cor
## 0.8994835
(1.1) From the scatterplot, describe the relationship between the two variables. (3 marks)
Answer: The scatterplot shows a strong, positive, linear relationship between the number of calls and the number of sales. As the number of calls an agent makes increases, the number of sales also tends to increase. The points cluster fairly tightly around an upward-sloping straight line, and there are no obvious outliers or influential observations.
(1.2) What is the sample correlation coefficient between the number of sales and the number of calls? Interpret its value. (3 marks)
Answer: From the R output, the sample correlation coefficient is \(r = 0.899\).
Interpretation: This value indicates a very strong positive linear relationship between the number of calls and the number of sales. Agents who make more calls tend to achieve substantially more sales. The value is close to 1, confirming that the linear association is strong.
(1.3) At significance level 0.05, test whether the number of sales and the number of calls are correlated. (6 marks)
Answer:
Hypotheses:
Test statistic and distribution:
\(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim
t_{n-2} = t_{18}\) under \(H_0\)
From R: \(t_0 = 8.7336\), p-value \(= 6.873 \times 10^{-8}\)
Decision: Since p-value \(= 6.873 \times 10^{-8} < \alpha = 0.05\), we reject \(H_0\).
Conclusion: There is very strong statistical evidence of a significant positive linear correlation between the number of calls made and the number of sales achieved (\(r = 0.899\), p-value \(< 0.001\)).
(1.4) Find the 95% confidence interval for the population correlation \(\rho\) and interpret it. (4 marks)
Answer: From the R output, the 95% confidence interval for \(\rho\) is (0.7591, 0.9599).
Interpretation: We are 95% confident that the true population correlation between calls and sales lies between 0.7591 and 0.9599. Since this interval is entirely above zero and contains only large positive values, it strongly supports the conclusion that calls and sales are positively and substantially correlated in the population.
A researcher investigates the relationship between daily cigarette smoking and illness-related work absences. For 15 employees, the number of cigarettes smoked daily (\(x\)) and the number of days absent in the past year (\(y\)) were recorded.
| Subject | Cigarettes/day (\(x\)) | Days absent (\(y\)) | Subject | Cigarettes/day (\(x\)) | Days absent (\(y\)) | |
|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 9 | 35 | 8 | |
| 2 | 0 | 3 | 10 | 44 | 9 | |
| 3 | 0 | 8 | 11 | 53 | 5 | |
| 4 | 10 | 10 | 12 | 33 | 12 | |
| 5 | 13 | 14 | 13 | 21 | 15 | |
| 6 | 20 | 14 | 14 | 25 | 13 | |
| 7 | 27 | 11 | 15 | 14 | 9 | |
| 8 | 35 | 6 |
The simple linear regression model is used: \(y = \beta_0 + \beta_1 x + \epsilon\).
cigs <- c( 0, 0, 0,10,13,20,27,35,35,44,53,33,21,25,14)
absent <- c( 1, 3, 8,10,14,14,11, 6, 8, 9, 5,12,15,13, 9)
dt2 <- data.frame(cigs = cigs, absent = absent)# Scatterplot
plot(cigs, absent,
xlab = "Cigarettes per Day", ylab = "Days Absent",
main = "Cigarettes Smoked vs. Days Absent",
pch = 19, col = "tomato")##
## Pearson's product-moment correlation
##
## data: cigs and absent
## t = 0.52436, df = 13, p-value = 0.6089
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3976627 0.6111258
## sample estimates:
## cor
## 0.1439172
##
## Call:
## lm(formula = absent ~ cigs, data = dt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3878 -2.6799 0.0953 3.0416 5.8369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.38784 1.90166 4.411 0.000704 ***
## cigs 0.03692 0.07040 0.524 0.608854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.273 on 13 degrees of freedom
## Multiple R-squared: 0.02071, Adjusted R-squared: -0.05462
## F-statistic: 0.275 on 1 and 13 DF, p-value: 0.6089
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| cigs | 1 | 5.02063 | 5.02063 | 0.2749531 | 0.6088536 |
| Residuals | 13 | 237.37937 | 18.25995 | NA | NA |
## 5 % 95 %
## (Intercept) 5.02012966 11.7555489
## cigs -0.08776229 0.1615951
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 0.97208, df = 1, p-value = 0.3242
##
## Shapiro-Wilk normality test
##
## data: resid(fit2)
## W = 0.95729, p-value = 0.6454
##
## Durbin-Watson test
##
## data: fit2
## DW = 0.83406, p-value = 0.002079
## alternative hypothesis: true autocorrelation is greater than 0
(2.1) Obtain the fitted least-squares regression line for predicting days absent from number of cigarettes smoked. (3 marks)
Answer: From the summary() output, the
estimated coefficients are \(b_0 =
8.3878\) and \(b_1 = 0.0369\).
The fitted regression line is:
\[\hat{y} = 8.3878 + 0.0369\, x\]
Interpretation:
(2.2) At significance level 0.10, test whether the number of cigarettes smoked can be used to describe days absent. (5 marks)
Answer:
Hypotheses:
Test statistic:
\(t_0 = \dfrac{b_1}{se(b_1)} \sim t_{n-2} =
t_{13}\) under \(H_0\)
From R: \(t_0 = 0.5244\), p-value \(= 0.6089\)
Decision: Since p-value \(= 0.609 > \alpha = 0.10\), we fail to reject \(H_0\).
Conclusion: There is insufficient evidence at the 10% significance level to conclude that cigarette smoking is a significant linear predictor of work absences. The number of cigarettes smoked daily does not appear to linearly explain the number of days absent.
(2.3) Find the 90% confidence intervals for the regression parameters \(\beta_0\) and \(\beta_1\). (3 marks)
Answer: From
confint(fit, level = 0.90):
| Parameter | Lower bound | Upper bound | Interpretation |
|---|---|---|---|
| \(\beta_0\) (intercept) | 5.0201 | 11.7555 | 90% CI for the predicted days absent for a non-smoker |
| \(\beta_1\) (slope) | −0.0878 | 0.1616 | 90% CI for the change in absences per additional cigarette |
Note that the 90% CI for \(\beta_1\) contains zero, which is consistent with the conclusion in Q2.2 that \(\beta_1\) is not significantly different from zero.
(2.4) What percentage of the variation in days absent can be accounted for by the fitted regression line? (2 marks)
Answer: From summary(), the Multiple
R-squared \(= 0.0207\).
Interpretation: Only 2.07% of the total variation in days absent is explained by the linear regression on number of cigarettes smoked. This is a very low value, indicating that the number of cigarettes smoked is a poor predictor of work absences when used alone.
(2.5) At significance level 0.10, is the constant variance assumption violated? (2 marks)
Answer: Test: Breusch-Pagan
test
Hypotheses: \(H_0\):
constant variance (homoscedasticity) vs. \(H_1\): non-constant variance
(heteroscedasticity)
From R: p-value \(=
0.3242\)
Decision: Since p-value \(=
0.324 > \alpha = 0.10\), we fail to reject \(H_0\).
Conclusion: The constant variance assumption is
not violated. The errors appear to have constant
variance.
(2.6) At significance level 0.10, is the normality assumption violated? (2 marks)
Answer: Test: Shapiro-Wilk
test
Hypotheses: \(H_0\):
residuals are normally distributed vs. \(H_1\): residuals are not normal
From R: p-value \(=
0.6454\)
Decision: Since p-value \(=
0.645 > \alpha = 0.10\), we fail to reject \(H_0\).
Conclusion: The normality assumption is not
violated. The residuals are consistent with a normal
distribution.
(2.7) At significance level 0.10, is the independence assumption violated? (3 marks)
Answer: Test: Durbin-Watson
test
Hypotheses: \(H_0\):
no autocorrelation in residuals vs. \(H_1\): positive autocorrelation
present
From R: DW statistic \(=
0.8341\), p-value \(=
0.0021\)
Decision: Since p-value \(=
0.0021 < \alpha = 0.10\), we reject \(H_0\).
Conclusion: The independence assumption is
violated. There is significant positive autocorrelation in the
residuals.
⚠️ What to do next
A DW statistic well below 2 indicates that consecutive residuals tend to have the same sign (positive autocorrelation). Possible remedies include:
In this case, the overall model fit is also poor (\(R^2 = 2.1\%\)), and the slope is not significant. Other predictors of absenteeism (age, health status, working conditions) may be needed.
Researchers study the relationship between Gender, Age, Number of working years, and Income. Let \(y\) = Income, \(x_1\) = Age, \(x_2\) = Number of working years, and \(x_3\) = Gender dummy (\(x_3 = 1\) if Female, \(0\) if Male). The full model is:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]
dt3 <- data.frame(
Income = c(27000,35000,42000,31000,26000,45000,38000,
29000,50000,33000,41000,28000,36000,52000,30000),
Age = c(25,35,42,29,24,48,38,27,52,31,40,26,34,50,28),
Working = c( 2,10,15, 5, 1,20,12, 3,25, 7,14, 2, 9,22, 4),
Gender = c("male","male","female","male","female","female","male",
"female","female","male","male","female","male","female","male")
)y <- dt3$Income
x1 <- dt3$Age
x2 <- dt3$Working
x3 <- ifelse(dt3$Gender == "female", 1, 0)
# After stepwise selection
full <- lm(y ~ x1 + x2+ x3)
f1 <- lm(y ~ x1 + x3)
anova(f1,full)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 12 | 17585095 | NA | NA | NA | NA |
| 11 | 17569551 | 1 | 15544.03 | 0.0097319 | 0.9231908 |
##
## Call:
## lm(formula = y ~ x1 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2274.96 -495.53 73.46 323.68 2966.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5292.11 1260.21 4.199 0.00123 **
## x1 879.47 36.47 24.114 1.55e-11 ***
## x3 -231.88 662.78 -0.350 0.73251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1211 on 12 degrees of freedom
## Multiple R-squared: 0.9817, Adjusted R-squared: 0.9787
## F-statistic: 322.4 on 2 and 12 DF, p-value: 3.722e-11
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 944635535.8 | 944635535.8 | 644.615605 | 0.0000000 |
| x3 | 1 | 179369.5 | 179369.5 | 0.122401 | 0.7325084 |
| Residuals | 12 | 17585094.7 | 1465424.6 | NA | NA |
## 2.5 % 97.5 %
## (Intercept) 2546.3481 8037.871
## x1 800.0101 958.937
## x3 -1675.9500 1212.192
(3.1) Conduct the overall fit test of the regression at \(\alpha = 0.05\). (5 marks)
Answer:
Hypotheses:
Test statistic:
\(F = 322.4\) on \((2, 12)\) df
p-value \(= 3.722 \times
10^{-11}\)
Decision: Since p-value \(< \alpha = 0.05\), we reject \(H_0\).
Conclusion: The regression model is statistically significant. Age and Gender together are significant predictors of Income at the 5% significance level.
(3.2) Write down the fitted equations for predicting Income of female and male staff separately. (4 marks)
Answer: The stepwise procedure removed \(x_2\) (Number of working years). The final fitted equation is:
\[\hat{y} = 5292.11 + 879.47\, x_1 -231.88\, x_3\]
Substituting the dummy variable values:
Male staff (\(x_3 = 0\)): \[\hat{y}_\text{Male} = 5292.11 + 879.47\, x_1\]
Female staff (\(x_3 = 1\)): \[\hat{y}_\text{Female} = (5292.11-231.88) + 879.47\, x_1 = 5060.23 + 879.47\, x_1\]
Interpretation of \(b_3 = -231.88\): Holding Age constant, female staff earn an estimated 231.88 less than male staff of the same age, on average.
Both lines share the same slope: each additional year of age is associated with an estimated 879.47 increase in income, for both genders.
(3.3) Find the coefficient of determination of the fitted model and interpret its value. (2 marks)
Answer: From summary(), \(R^2 = 0.9817\), i.e.,
98.17%.
Interpretation: The model (Age and Gender) explains approximately 98.17% of the total variation in Income. The remaining is attributable to other factors not in the model (such as working years, which were dropped by stepwise selection, or other unmeasured variables).
(3.4) Find the 95% confidence interval for the parameter that represents the income difference between male and female staff. (2 marks)
Answer: The income difference between genders is
captured by \(\beta_3\). From
confint(fit):
\[(-1675.95,\; 1212.192)\]
Interpretation: We are 95% confident that the true income difference between female and male staff of the same age lies between -1675.95 and 1212.192. Since the interval contains zero, the difference in income between genders is not statistically significant at the 5% level, despite the negative point estimate.
(3.5) Predict the Income of a male staff member aged 28 who has been working for 5 years. (2 marks)
Answer: Since Number of Working Years (\(x_2\)) was removed during stepwise selection, it is not used in the prediction. We use \(x_1 = 28\) (Age) and \(x_3 = 0\) (Male):
\[\hat{y}_\text{Male} = 5292.11 + 879.47(28)=29917.27\]
The predicted monthly income of this male staff member is approximately 29,917 Baht.
The Faculty of Science, CMU, is piloting a new online English learning programme called SpeedyE. Twenty students were sampled — 10 from Faculty 1 and 10 from Faculty 2, with 5 male and 5 female from each. Students completed a pretest, used the programme for one week, then completed a posttest.
| Faculty | Gender | Pretest | Posttest | Faculty | Gender | Pretest | Posttest | |
|---|---|---|---|---|---|---|---|---|
| 1 | Male | 58 | 71 | 2 | Male | 51 | 69 | |
| 1 | Male | 57 | 69 | 2 | Male | 62 | 69 | |
| 1 | Male | 63 | 71 | 2 | Male | 58 | 66 | |
| 1 | Male | 66 | 70 | 2 | Male | 52 | 61 | |
| 1 | Male | 45 | 65 | 2 | Male | 59 | 63 | |
| 1 | Female | 64 | 71 | 2 | Female | 39 | 65 | |
| 1 | Female | 39 | 71 | 2 | Female | 32 | 66 | |
| 1 | Female | 69 | 71 | 2 | Female | 62 | 70 | |
| 1 | Female | 56 | 76 | 2 | Female | 64 | 68 | |
| 1 | Female | 67 | 71 | 2 | Female | 66 | 68 |
The model is: \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]
where \(x_1\) = Pretest score, \(x_2\) = Gender dummy (\(1\) = Male, \(0\) = Female), \(x_3\) = Faculty dummy (\(1\) = Faculty 1, \(0\) = Faculty 2), and \(y\) = Posttest score.
dt4 <- data.frame(
Faculty = c(1,1,1,1,1, 2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2),
Gender = c(rep("Male",10), rep("Female",10)),
Pretest = c(58,57,63,66,45, 51,62,58,52,59, 64,39,69,56,67, 39,32,62,64,66),
Posttest = c(71,69,71,70,65, 69,69,66,61,63, 71,71,71,76,71, 65,66,70,68,68)
)
y <- dt4$Posttest
x1 <- dt4$Pretest
x2 <- ifelse(dt4$Gender == "Male", 1, 0)
x3 <- ifelse(dt4$Faculty == 1, 1, 0)
fit4 <- lm(y ~ x1 + x2 + x3)
summary(fit4)##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0690 -1.3876 -0.1686 1.3039 4.4061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.84557 3.11985 20.144 8.57e-13 ***
## x1 0.08922 0.05517 1.617 0.12541
## x2 -2.41598 1.10151 -2.193 0.04340 *
## x3 3.75205 1.12003 3.350 0.00407 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.458 on 16 degrees of freedom
## Multiple R-squared: 0.5665, Adjusted R-squared: 0.4852
## F-statistic: 6.969 on 3 and 16 DF, p-value: 0.003264
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 28.31784 | 28.317844 | 4.687714 | 0.0458376 |
| x2 | 1 | 30.18700 | 30.186998 | 4.997132 | 0.0399947 |
| x3 | 1 | 67.79133 | 67.791325 | 11.222123 | 0.0040678 |
| Residuals | 16 | 96.65383 | 6.040865 | NA | NA |
(4.1) Write down the fitted linear equation to predict Posttest score. (3 marks)
Answer: From the summary() output, the
estimated coefficients are:
| Parameter | Estimate | Variable |
|---|---|---|
| \(b_0\) | 62.8456 | Intercept |
| \(b_1\) | 0.0892 | Pretest score |
| \(b_2\) | −2.4160 | Gender (Male = 1) |
| \(b_3\) | 3.7520 | Faculty (Faculty 1 = 1) |
The overall fitted equation is: \[\hat{y} = 62.8456 + 0.0892\, x_1 - 2.4160\, x_2 + 3.7520\, x_3\]
Coefficient interpretations:
(4.2) Write down the fitted equations for each combination of faculty and gender. (6 marks)
Answer: Substituting the dummy variable values gives four parallel equations (all with slope \(b_1 = 0.0892\)):
| Group | \(x_2\) | \(x_3\) | Intercept | Fitted equation |
|---|---|---|---|---|
| Faculty 1, Female | 0 | 1 | \(62.8456 + 3.7520 = 66.5976\) | \(\hat{y} = 66.5976 + 0.0892\, x_1\) |
| Faculty 1, Male | 1 | 1 | \(62.8456 - 2.4160 + 3.7520 = 64.1816\) | \(\hat{y} = 64.1816 + 0.0892\, x_1\) |
| Faculty 2, Female | 0 | 0 | \(62.8456\) | \(\hat{y} = 62.8456 + 0.0892\, x_1\) |
| Faculty 2, Male | 1 | 0 | \(62.8456 - 2.4160 = 60.4296\) | \(\hat{y} = 60.4296 + 0.0892\, x_1\) |
All four lines have the same slope (0.0892) — the dummy variables shift the intercept only. Faculty 1 Female students have the highest predicted posttest for any given pretest score; Faculty 2 Male students have the lowest.
(4.3) Compare the predicted posttest scores of male and female students from the same faculty. (3 marks)
Answer: The difference in predicted posttest scores between male and female students from the same faculty with the same pretest score is captured entirely by \(b_2 = -2.416\).
For any fixed pretest score \(x_1\) and faculty: \[\hat{y}_\text{Male} - \hat{y}_\text{Female} = -2.416\]
Conclusion: Male students are predicted to score 2.416 points lower on the posttest than female students from the same faculty with the same pretest score. This gap is the same regardless of faculty or pretest level, because the model assumes no interaction between gender and the other predictors.
A market research analyst examines whether advertising budget is related to consumer retention of advertisements. For 21 companies, the advertising budget (\(x\), million Baht) and the percentage of customers who remember key details (“retained impressions”, \(y\), %) were recorded.
ad_budget <- c(90.1,85.1,108.6,71.6,92.9,45.1,98.9,36.9,
90.5,69.8,84.5,42.5,100.4,81.6,57.4,84.6,
51.5,95.1,53.6,89.8,74.9)
ret_impr <- c(55.6,42.7,62.2,30.5,45.8,16.4,60.7,10.1,
45.5,32.9,48.3,15.8,58.2,35.9,26.8,49.4,
19.3,56.9,21.3,48.2,33.2)
dt5 <- data.frame(x = ad_budget, y = ret_impr)plot(dt5$x, dt5$y,
xlab = "Advertising Budget (million Baht)",
ylab = "Retained Impressions (%)",
main = "Advertising Budget vs. Retained Impressions",
pch = 19, col = "darkorchid")Correlation analysis:
##
## Pearson's product-moment correlation
##
## data: dt5$x and dt5$y
## t = 18.605, df = 19, p-value = 1.18e-13
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
## 0.9436239 0.9877721
## sample estimates:
## cor
## 0.9736366
(5.1) Find the sample correlation between advertising budget and retained impressions. Interpret the value. (3 marks)
Answer: From R, \(r = 0.9736\).
Interpretation: There is a moderate-to-strong positive linear relationship between advertising budget and retained impressions. Companies with higher advertising budgets tend to have a higher percentage of customers who remember their advertisement. The value is strong ( close to 1), suggesting that other factors besides budget also influence retention.
(5.2) Test whether advertising budget and retained impressions are correlated in the population at \(\alpha = 0.10\). (5 marks)
Answer:
Hypotheses:
Test statistic:
\(t_0 = \dfrac{r\sqrt{n-2}}{\sqrt{1-r^2}} =
18.605\) on \(n - 2 = 19\)
df
p-value \(= 1.18 \times
10^{-13}\)
Decision: Since p-value $ < = 0.10$, we reject \(H_0\).
Conclusion: There is significant evidence at the 10% level that advertising budget and retained impressions are linearly correlated in the population.
(5.3) Find the 90% confidence interval for the population correlation \(\rho\). (2 marks)
Answer: From R (conf.level = 0.90):
\((0.9436,\; 0.9877)\)
Interpretation: We are 90% confident that the true population correlation between advertising budget and retained impressions lies between 0.9436 and 0.9877. The interval is entirely above zero, confirming a positive association.
Simple linear regression:
##
## Call:
## lm(formula = y ~ x, data = dt5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8143 -2.6441 -0.3756 2.4648 6.4990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.59850 3.19614 -5.819 1.32e-05 ***
## x 0.75138 0.04039 18.605 1.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.789 on 19 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9452
## F-statistic: 346.2 on 1 and 19 DF, p-value: 1.18e-13
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x | 1 | 4969.0893 | 4969.08931 | 346.161 | 0 |
| Residuals | 19 | 272.7421 | 14.35485 | NA | NA |
## 5 % 95 %
## (Intercept) -24.1250439 -13.0719581
## x 0.6815506 0.8212132
(5.4) Write the fitted regression equation and explain the coefficient estimates. (4 marks)
Answer: From summary(): \(b_0 = -18.598\), \(b_1 = 0.7514\). The fitted equation is:
\[\hat{y} = -18.598 + 0.7514\, x\]
Coefficient interpretations:
(5.5) At significance level 0.10, test whether advertising budget is a significant predictor of retained impressions. (3 marks)
Answer:
(5.6) Find the coefficient of determination and interpret its value. (2 marks)
Answer: From summary(), \(r^2 = 0.948\), i.e.,
94.8%.
Interpretation: The linear regression on advertising budget explains 94.8% of the total variation in retained impressions. The remaining is due to other factors not included in this model (e.g., quality of the advertisement, target audience, timing) or inherent random variability.
(5.7) Find the 90% confidence intervals for \(\beta_0\) and \(\beta_1\). (2 marks)
Answer: From
confint(fit, level = 0.90):
| Parameter | 90% CI Lower | 90% CI Upper |
|---|---|---|
| \(\beta_0\) (intercept) | -24.1250 | -13.0719 |
| \(\beta_1\) (slope) | 0.6815 | 0.8212 |
Interpretation of \(\beta_1\) CI: We are 90% confident that each additional million Baht of advertising budget is associated with a true increase in retained impressions of between 0.195 and 0.531 percentage points.
(5.8) Predict the retained impressions percentage when advertising budget is 100 million Baht. (1 mark)
Answer:
\[\hat{y} = -18.598 + 0.7514(100) = \mathbf{56.542\%}\]
When the advertising budget is 100 million Baht, the model predicts that approximately 56.542% of customers will retain key details from the advertisement.
A data analyst investigates factors related to box-office gross of book-inspired movies. Ten movies were selected. Variables: \(y\) = box-office gross (million USD), \(x_1\) = production budget, \(x_2\) = advertising budget, \(x_3\) = pre-adaptation book sales (all in million USD).
| Movie | \(y\) | \(x_1\) | \(x_2\) | \(x_3\) | Movie | \(y\) | \(x_1\) | \(x_2\) | \(x_3\) | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 85.1 | 8.5 | 5.1 | 4.7 | 6 | 30.3 | 3.5 | 1.2 | 3.5 | |
| 2 | 106.3 | 12.9 | 5.8 | 8.8 | 7 | 79.4 | 9.2 | 3.7 | 9.7 | |
| 3 | 50.2 | 5.2 | 2.1 | 15.1 | 8 | 91.0 | 9.0 | 7.6 | 5.9 | |
| 4 | 130.6 | 10.7 | 8.4 | 12.2 | 9 | 135.4 | 15.1 | 7.7 | 20.8 | |
| 5 | 54.8 | 3.1 | 2.9 | 10.6 | 10 | 89.3 | 10.2 | 4.5 | 7.9 |
The full model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
dt6 <- data.frame(
y = c( 85.1,106.3, 50.2,130.6, 54.8, 30.3, 79.4, 91.0,135.4, 89.3),
x1 = c( 8.5, 12.9, 5.2, 10.7, 3.1, 3.5, 9.2, 9.0, 15.1, 10.2),
x2 = c( 5.1, 5.8, 2.1, 8.4, 2.9, 1.2, 3.7, 7.6, 7.7, 4.5),
x3 = c( 4.7, 8.8, 15.1, 12.2, 10.6, 3.5, 9.7, 5.9, 20.8, 7.9)
)##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = dt6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4384 -3.1695 0.8499 3.5134 9.6207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6760 6.7602 1.135 0.2995
## x1 3.6616 1.1178 3.276 0.0169 *
## x2 7.6211 1.6573 4.598 0.0037 **
## x3 0.8285 0.5394 1.536 0.1754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9502
## F-statistic: 58.22 on 3 and 6 DF, p-value: 7.913e-05
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 8647.3942 | 8647.39415 | 152.064149 | 0.0000173 |
| x2 | 1 | 1150.8996 | 1150.89959 | 20.238533 | 0.0041095 |
| x3 | 1 | 134.1697 | 134.16974 | 2.359371 | 0.1754396 |
| Residuals | 6 | 341.2005 | 56.86675 | NA | NA |
(6.1) At \(\alpha = 0.05\), test whether book sales (\(x_3\)) should be included in the model when production budget (\(x_1\)) and advertising budget (\(x_2\)) are already included. (3 marks)
Answer:
Hypotheses:
From R: \(t_0 = 1.536\), p-value \(= 0.1754\)
Decision: Since p-value \(= 0.175 > \alpha = 0.05\), we fail to reject \(H_0\).
Conclusion: Book sales (\(x_3\)) do not significantly improve the model when production and advertising budgets are already present. \(x_3\) should be excluded from the model.
The reduced model using only \(x_1\) and \(x_2\):
##
## Call:
## lm(formula = y ~ x1 + x2, data = dt6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4168 -2.5696 0.8051 2.1200 11.0463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.848 6.765 1.751 0.12334
## x1 4.228 1.153 3.667 0.00800 **
## x2 7.436 1.806 4.117 0.00448 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.241 on 7 degrees of freedom
## Multiple R-squared: 0.9537, Adjusted R-squared: 0.9405
## F-statistic: 72.14 on 2 and 7 DF, p-value: 2.131e-05
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| x1 | 1 | 8647.3942 | 8647.39415 | 127.33603 | 0.0000096 |
| x2 | 1 | 1150.8996 | 1150.89959 | 16.94742 | 0.0044779 |
| Residuals | 7 | 475.3703 | 67.91004 | NA | NA |
## 2.5 % 97.5 %
## (Intercept) -4.148412 27.844835
## x1 1.501766 6.954722
## x2 3.164848 11.707370
##
## studentized Breusch-Pagan test
##
## data: fit6
## BP = 7.6284, df = 2, p-value = 0.02206
##
## Shapiro-Wilk normality test
##
## data: resid(fit6)
## W = 0.93429, p-value = 0.4914
##
## Durbin-Watson test
##
## data: fit6
## DW = 2.0124, p-value = 0.5113
## alternative hypothesis: true autocorrelation is greater than 0
(6.2) Test the overall fit of the reduced model (\(x_1\) and \(x_2\)) at \(\alpha = 0.05\). Provide the ANOVA table. (4 marks)
Answer:
| Source | df | SS | MS | \(F\) | p-value |
|---|---|---|---|---|---|
| Regression | 2 | 9,788.7 | 4,894.35 | 70.62 | \(<0.0001\) |
| Residual | 7 | 485.0 | 69.29 | ||
| Total | 9 | 10,273.7 |
(6.3) Test the normality assumption of the errors at \(\alpha = 0.05\). (2 marks)
Answer: Test: Shapiro-Wilk test on
residuals
Hypotheses: \(H_0\):
residuals are normally distributed vs. \(H_1\): not normal
From R: W statistic, p-value \(= 0.4914\)
Decision: Since p-value \(=
0.491 > 0.05\), we fail to reject \(H_0\).
Conclusion: The normality assumption is
satisfied. The residuals are consistent with a normal
distribution.
(6.4) Test the independence assumption of the errors at \(\alpha = 0.05\). (2 marks)
Answer: Test: Durbin-Watson
test
Hypotheses: \(H_0\):
no autocorrelation vs. \(H_1\):
autocorrelation present
From R: DW statistic \(=
2.0124\), p-value \(=0.5113 >
0.05\)
Decision: Since p-value \(> 0.05\), we fail to reject
\(H_0\).
Conclusion: The independence assumption is
satisfied. A DW value close to 2 indicates no
significant autocorrelation in the residuals.
(6.5) Test the constant variance assumption of the errors at \(\alpha = 0.05\). (3 marks)
Answer: Test: Breusch-Pagan
test
Hypotheses: \(H_0\):
constant variance vs. \(H_1\):
non-constant variance (heteroscedasticity)
From R: p-value \(=
0.0226\)
Decision: Since p-value \(=
0.022 < \alpha = 0.05\), we reject \(H_0\).
Conclusion: The constant variance assumption is
violated. There is evidence of heteroscedasticity — the spread
of the residuals changes with the fitted values.
⚠️ What to do next
Since heteroscedasticity is present, the standard errors and therefore the \(t\)-tests and confidence intervals from this model may be unreliable. Recommended next steps:
fit6_log <- lm(log(y) ~ x1 + x2, data = dt6)
par(mfrow = c(2, 2)); plot(fit6_log); par(mfrow = c(1, 1))
bptest(fit6_log)sandwich package in R).High blood pressure can lead to serious health problems including heart attack, stroke, and kidney disease. Researchers selected 10 patients and recorded their blood pressure before and after taking medication.
| Patient | Before | After | Patient | Before | After | |
|---|---|---|---|---|---|---|
| 1 | 86 | 80 | 6 | 77 | 82 | |
| 2 | 84 | 80 | 7 | 89 | 88 | |
| 3 | 78 | 92 | 8 | 90 | 89 | |
| 4 | 90 | 79 | 9 | 90 | 92 | |
| 5 | 92 | 92 | 10 | 86 | 83 |
dt7 <- data.frame(
Patient = 1:10,
Before = c(86, 84, 78, 90, 92, 77, 89, 90, 90, 86),
After = c(80, 80, 92, 79, 92, 82, 88, 89, 92, 83)
)diff_bp <- dt7$Before - dt7$After
boxplot(diff_bp,
ylab = "Before − After (mmHg)",
main = "Blood Pressure Change After Medication",
col = "lightblue")
abline(h = 0, col = "red", lty = 2)
legend("topright", legend = "H₀: median diff = 0",
col = "red", lty = 2, bty = "n")##
## Wilcoxon signed rank exact test
##
## data: dt7$Before and dt7$After
## V = 33, p-value = 0.2852
## alternative hypothesis: true location shift is greater than 0
At \(\alpha = 0.05\), test whether medication helps reduce blood pressure.
Answer Q7:
Test choice: The data are 10 matched pairs (same patients, before and after). The sample is small and normality of differences cannot be assumed. The Wilcoxon signed-rank test is appropriate for paired data when the distribution of differences is approximately symmetric.
Step 1 — Compute differences \(d_i = \text{Before}_i - \text{After}_i\)
| Patient | Before | After | \(d_i\) | Action |
|---|---|---|---|---|
| 1 | 86 | 80 | +6 | keep |
| 2 | 84 | 80 | +4 | keep |
| 3 | 78 | 92 | −14 | keep |
| 4 | 90 | 79 | +11 | keep |
| 5 | 92 | 92 | 0 | exclude (tie) |
| 6 | 77 | 82 | −5 | keep |
| 7 | 89 | 88 | +1 | keep |
| 8 | 90 | 89 | +1 | keep |
| 9 | 90 | 92 | −2 | keep |
| 10 | 86 | 83 | +3 | keep |
Patient 5 is excluded (\(d = 0\)), leaving \(n = 9\) non-tied pairs.
Step 2 — Rank the absolute differences \(|d_i|\), assign average ranks to ties
Sorting \(|d_i|\) in ascending order:
| \(|d|\) value | Patients | Raw ranks | Average rank |
|---|---|---|---|
| 1 | 7, 8 | 1, 2 | 1.5 |
| 2 | 9 | 3 | 3 |
| 3 | 10 | 4 | 4 |
| 4 | 2 | 5 | 5 |
| 5 | 6 | 6 | 6 |
| 6 | 1 | 7 | 7 |
| 11 | 4 | 8 | 8 |
| 14 | 3 | 9 | 9 |
Step 3 — Assign signed ranks and compute \(W^+\)
| Patient | \(d_i\) | \(|d_i|\) | Rank of \(|d_i|\) | Signed rank |
|---|---|---|---|---|
| 7 | +1 | 1 | 1.5 | +1.5 |
| 8 | +1 | 1 | 1.5 | +1.5 |
| 9 | −2 | 2 | 3 | −3 |
| 10 | +3 | 3 | 4 | +4 |
| 2 | +4 | 4 | 5 | +5 |
| 6 | −5 | 5 | 6 | −6 |
| 1 | +6 | 6 | 7 | +7 |
| 4 | +11 | 11 | 8 | +8 |
| 3 | −14 | 14 | 9 | −9 |
\[W^+ = \sum_{\{i:\, d_i > 0\}} \text{rank}_i = 1.5 + 1.5 + 4 + 5 + 7 + 8 = \mathbf{27}\]
\[W^- = \sum_{\{i:\, d_i < 0\}} \text{rank}_i = 3 + 6 + 9 = \mathbf{18}\]
Check: \(W^+ + W^- = 27 + 18 = 45 = \dfrac{9 \times 10}{2}\) ✓
Step 4 — Hypothesis test
Hypotheses:
Test statistic: \(W^+ = 27\) (R reports \(V = 33\) as it takes 0 into ranking)
p-value \(= 0.2852\) (from R, one-sided)
Decision: Since p-value \(=0.2852 > \alpha = 0.05\), we fail to reject \(H_0\).
Conclusion: There is insufficient evidence at the 5% significance level to conclude that the medication significantly reduces blood pressure. The observed differences between before and after readings are consistent with chance variation in this small sample of 10 patients.
Education researchers studied reading habits among 15 college students from two faculties, recording year of study, hours reading per week, and reading skill score.
| St. | Fac. | Yr | Hours | Score | St. | Fac. | Yr | Hours | Score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 2 | 7.5 | 70 | 9 | B | 3 | 6.0 | 73 | |
| 2 | A | 2 | 5.0 | 65 | 10 | B | 3 | 6.5 | 82 | |
| 3 | A | 2 | 2.0 | 68 | 11 | A | 4 | 5.0 | 90 | |
| 4 | B | 2 | 3.0 | 80 | 12 | A | 4 | 4.0 | 95 | |
| 5 | B | 2 | 4.5 | 81 | 13 | B | 4 | 3.5 | 80 | |
| 6 | A | 3 | 4.0 | 77 | 14 | B | 4 | 3.0 | 70 | |
| 7 | A | 3 | 4.5 | 90 | 15 | A | 4 | 8.0 | 95 | |
| 8 | A | 3 | 7.0 | 75 |
dt8 <- data.frame(
Student = 1:15,
Faculty = factor(c("A","A","A","B","B","A","A","A","B","B",
"A","A","B","B","A")),
YearofStudy = factor(c(2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)),
NumberHours = c(7.5,5,2,3,4.5,4,4.5,7,6,6.5,5,4,3.5,3,8),
ReadingScore = c(70,65,68,80,81,77,90,75,73,82,90,95,80,70,95)
)par(mfrow = c(1, 3))
boxplot(ReadingScore ~ YearofStudy, data = dt8,
main = "Score by Year", xlab = "Year of Study",
ylab = "Reading Score",
col = c("lightblue","lightgreen","lightyellow"))
boxplot(NumberHours ~ Faculty, data = dt8,
main = "Hours by Faculty", xlab = "Faculty",
ylab = "Hours/week", col = c("lightblue","lightcoral"))
plot(dt8$NumberHours, dt8$ReadingScore,
pch = 19, col = as.integer(dt8$Faculty) + 1,
xlab = "Hours/week", ylab = "Reading Score",
main = "Hours vs. Score")
legend("topleft", legend = c("A","B"), col = c(2,3), pch = 19, bty = "n")##
## Kruskal-Wallis rank sum test
##
## data: ReadingScore by YearofStudy
## Kruskal-Wallis chi-squared = 4.1245, df = 2, p-value = 0.1272
Answer Q(8.1):
Test choice: Three independent groups (Year 2, 3, 4), small samples (\(n_i = 5\) each), normality unverifiable. Use the Kruskal-Wallis test.
Step 1 — Rank all \(N = 15\) scores together (average ranks for ties)
| Student | Score | Year | Rank \(R_i\) |
|---|---|---|---|
| 2 | 65 | 2 | 1.0 |
| 3 | 68 | 2 | 2.0 |
| 1 | 70 | 2 | 3.5 |
| 14 | 70 | 4 | 3.5 |
| 9 | 73 | 3 | 5.0 |
| 8 | 75 | 3 | 6.0 |
| 6 | 77 | 3 | 7.0 |
| 4 | 80 | 2 | 8.5 |
| 13 | 80 | 4 | 8.5 |
| 5 | 81 | 2 | 10.0 |
| 10 | 82 | 3 | 11.0 |
| 7 | 90 | 3 | 12.5 |
| 11 | 90 | 4 | 12.5 |
| 12 | 95 | 4 | 14.5 |
| 15 | 95 | 4 | 14.5 |
Step 2 — Compute rank sums \(R_j\) per group
| Year | Students | Ranks | \(R_j\) | \(n_j\) | \(R_j^2/n_j\) |
|---|---|---|---|---|---|
| 2 | 1,2,3,4,5 | 3.5, 1.0, 2.0, 8.5, 10.0 | 25.0 | 5 | 125.00 |
| 3 | 6,7,8,9,10 | 7.0, 12.5, 6.0, 5.0, 11.0 | 41.5 | 5 | 344.45 |
| 4 | 11,12,13,14,15 | 12.5, 14.5, 8.5, 3.5, 14.5 | 53.5 | 5 | 572.45 |
Check: \(R_1 + R_2 + R_3 = 25.0 + 41.5 + 53.5 = 120 = \dfrac{N(N+1)}{2} = \dfrac{15 \times 16}{2}\) ✓
Step 3 — Compute the Kruskal-Wallis statistic \(H\)
\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]
\[H = \frac{12}{15 \times 16}\left(125.00 + 344.45 + 572.45\right) - 3(16)\]
\[H = \frac{12}{240}\times 1041.90 - 48 = 0.05 \times 1041.90 - 48 = 52.095 - 48 = \mathbf{4.095}\]
(R reports \(H = 4.1245\) — the small difference reflects R’s correction for ties.)
Hypothesis test:
\(H_0\): The distribution of
reading scores is identical across all three years of study
\(H_1\): At least one year has a
different distribution of reading scores
Test statistic: \(H = 4.1245\) on \(k-1 = 2\) df
p-value \(= 0.1272\)
Decision: Since p-value \(= 0.127 > 0.05\), we fail to reject \(H_0\).
Conclusion: There is no significant difference in reading skill scores between 2nd, 3rd, and 4th year students at the 5% level.
wilcox.test(x = dt8$NumberHours[dt8$Faculty == "A"],
y = dt8$NumberHours[dt8$Faculty == "B"],
paired = FALSE, alternative = "two.sided")##
## Wilcoxon rank sum exact test
##
## data: dt8$NumberHours[dt8$Faculty == "A"] and dt8$NumberHours[dt8$Faculty == "B"]
## W = 35.5, p-value = 0.3381
## alternative hypothesis: true location shift is not equal to 0
Answer Q(8.2):
Test choice: Two independent groups (Faculty A: \(n_A = 9\); Faculty B: \(n_B = 6\)), small samples. Use the Mann-Whitney U test.
Step 1 — Rank all \(N = 15\) hour values together
| Value | Faculty | Rank |
|---|---|---|
| 2.0 | A | 1.0 |
| 3.0 | B | 2.5 |
| 3.0 | B | 2.5 |
| 3.5 | B | 4.0 |
| 4.0 | A | 5.5 |
| 4.0 | A | 5.5 |
| 4.5 | A | 7.5 |
| 4.5 | B | 7.5 |
| 5.0 | A | 9.5 |
| 5.0 | A | 9.5 |
| 6.0 | B | 11.0 |
| 6.5 | B | 12.0 |
| 7.0 | A | 13.0 |
| 7.5 | A | 14.0 |
| 8.0 | A | 15.0 |
Step 2 — Compute rank sum \(S\) for Faculty A and test statistic \(U\)
\[S_A = 1.0 + 5.5 + 5.5 + 7.5 + 9.5 + 9.5 + 13.0 + 14.0 + 15.0 = \mathbf{80.5}\]
\[U_A = S_A - \frac{n_A(n_A+1)}{2} = 80.5 - \frac{9 \times 10}{2} = 80.5 - 45 = \mathbf{35.5}\]
Check: \(S_A + S_B = 80.5 + 39.5 = 120 = \dfrac{N(N+1)}{2}\) ✓
Hypothesis test:
\(H_0\): The median hours
reading per week is the same for Faculty A and Faculty B (\(\theta_A = \theta_B\))
\(H_1\): The medians differ (\(\theta_A \neq \theta_B\))
Test statistic: \(W = U_A = 35.5\)
p-value \(= 0.3440\)
Decision: Since p-value \(= 0.344 > 0.05\), we fail to reject \(H_0\).
Conclusion: There is no significant difference in weekly reading hours between Faculty A and Faculty B students.
##
## Spearman's rank correlation rho
##
## data: dt8$NumberHours and dt8$ReadingScore
## S = 453.74, p-value = 0.4982
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1897482
Answer Q(8.3):
Test choice: Association between two continuous variables, small sample (\(n = 15\)), normality not assumed. Use Spearman rank correlation.
Step 1 — Rank hours \(R(X)\) and scores \(R(Y)\) separately, then compute \(D_i = R(X_i) - R(Y_i)\)
| St. | Hours | Score | \(R(X)\) | \(R(Y)\) | \(D\) | \(D^2\) |
|---|---|---|---|---|---|---|
| 1 | 7.5 | 70 | 14.0 | 3.5 | +10.5 | 110.25 |
| 2 | 5.0 | 65 | 9.5 | 1.0 | +8.5 | 72.25 |
| 3 | 2.0 | 68 | 1.0 | 2.0 | −1.0 | 1.00 |
| 4 | 3.0 | 80 | 2.5 | 8.5 | −6.0 | 36.00 |
| 5 | 4.5 | 81 | 7.5 | 10.0 | −2.5 | 6.25 |
| 6 | 4.0 | 77 | 5.5 | 7.0 | −1.5 | 2.25 |
| 7 | 4.5 | 90 | 7.5 | 12.5 | −5.0 | 25.00 |
| 8 | 7.0 | 75 | 13.0 | 6.0 | +7.0 | 49.00 |
| 9 | 6.0 | 73 | 11.0 | 5.0 | +6.0 | 36.00 |
| 10 | 6.5 | 82 | 12.0 | 11.0 | +1.0 | 1.00 |
| 11 | 5.0 | 90 | 9.5 | 12.5 | −3.0 | 9.00 |
| 12 | 4.0 | 95 | 5.5 | 14.5 | −9.0 | 81.00 |
| 13 | 3.5 | 80 | 4.0 | 8.5 | −4.5 | 20.25 |
| 14 | 3.0 | 70 | 2.5 | 3.5 | −1.0 | 1.00 |
| 15 | 8.0 | 95 | 15.0 | 14.5 | +0.5 | 0.25 |
| Total | 450.50 |
Step 2 — Compute Spearman’s \(r_s\)
\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 450.50}{15(225-1)} = 1 - \frac{2703}{3360} = 1 - 0.8045 = \mathbf{0.1955}\]
(R reports \(r_s = 0.1897\) using a more precise tie-correction.)
Hypothesis test:
\(H_0\): No monotone association
between reading hours and skill score (\(\rho_s = 0\))
\(H_1\): Significant monotone
association (\(\rho_s \neq
0\))
Test statistic: \(r_s = 0.1897\)
p-value \(= 0.4982\)
Decision: Since p-value \(= 0.498 > 0.05\), we fail to reject \(H_0\).
Conclusion: There is no significant monotone association between weekly reading hours and reading skill scores in this sample (\(r_s = 0.19\), weak and non-significant).
💡 Reflection
The absence of a significant result may reflect the small sample size (low power with \(n = 15\)), or that skill depends more on reading quality and prior ability than on time alone.
Researchers examined the effect of jumping on bone density in growing rats. Six different rats were randomly assigned to each of three groups: Control, Low Jump (30 cm), High Jump (60 cm). Bone densities (mg/cm³) were recorded after 8 weeks.
| Rat | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Control | 611 | 621 | 614 | 593 | 593 | 653 |
| Low Jump | 635 | 605 | 638 | 594 | 599 | 632 |
| High Jump | 650 | 622 | 626 | 626 | 631 | 622 |
dt9 <- data.frame(
Densities = c(611,621,614,593,593,653,
635,605,638,594,599,632,
650,622,626,626,631,622),
Group = factor(rep(c("Control","LowJump","HighJump"), each = 6),
levels = c("Control","LowJump","HighJump"))
)boxplot(Densities ~ Group, data = dt9,
col = c("lightgray","lightblue","lightcoral"),
main = "Bone Density by Treatment Group",
xlab = "Treatment", ylab = "Bone Density (mg/cm³)")❗ Choosing the correct test
Each of the 18 rats is a different animal measured exactly once. This is a completely randomised design with three independent groups — not a repeated-measures design.
##
## Kruskal-Wallis rank sum test
##
## data: Densities by Group
## Kruskal-Wallis chi-squared = 2.6398, df = 2, p-value = 0.2672
At \(\alpha = 0.05\), test whether jumping treatment affects bone density.
Answer Q9:
Test choice: Three independent groups (\(n_i = 6\) each, \(N = 18\)), small samples, normality unverifiable. Use the Kruskal-Wallis test.
Step 1 — Rank all \(N = 18\) bone densities together
| Density | Group | Rank |
|---|---|---|
| 593 | Control | 1.5 |
| 593 | Control | 1.5 |
| 594 | LowJump | 3.0 |
| 599 | LowJump | 4.0 |
| 605 | LowJump | 5.0 |
| 611 | Control | 6.0 |
| 614 | Control | 7.0 |
| 621 | Control | 8.0 |
| 622 | HighJump | 9.5 |
| 622 | HighJump | 9.5 |
| 626 | HighJump | 11.5 |
| 626 | HighJump | 11.5 |
| 631 | HighJump | 13.0 |
| 632 | LowJump | 14.0 |
| 635 | LowJump | 15.0 |
| 638 | LowJump | 16.0 |
| 650 | HighJump | 17.0 |
| 653 | Control | 18.0 |
Step 2 — Compute rank sums \(R_j\) per group
| Group | Individual ranks | \(R_j\) | \(n_j\) | \(R_j^2 / n_j\) |
|---|---|---|---|---|
| Control | 1.5, 1.5, 6.0, 7.0, 8.0, 18.0 | 42.0 | 6 | 294.00 |
| LowJump | 3.0, 4.0, 5.0, 14.0, 15.0, 16.0 | 57.0 | 6 | 541.50 |
| HighJump | 9.5, 9.5, 11.5, 11.5, 13.0, 17.0 | 72.0 | 6 | 864.00 |
Check: \(42.0 + 57.0 + 72.0 = 171 = \dfrac{18 \times 19}{2}\) ✓
Step 3 — Compute the Kruskal-Wallis statistic \(H\)
\[H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1)\]
\[H = \frac{12}{18 \times 19}\left(294.00 + 541.50 + 864.00\right) - 3(19)\]
\[H = \frac{12}{342} \times 1699.50 - 57 = 0.03509 \times 1699.50 - 57 = 59.64 - 57 = \mathbf{2.640}\]
Hypothesis test:
\(H_0\): Bone density
distributions are identical across Control, Low Jump, and High Jump
groups
\(H_1\): At least one group has a
different bone density distribution
Test statistic: \(H = 2.640\) on \(k-1 = 2\) df
p-value \(= 0.267\)
Decision: Since p-value \(= 0.267 > \alpha = 0.05\), we fail to reject \(H_0\).
Conclusion: There is no significant evidence that jumping treatment affects bone density in rats at the 5% significance level. The differences in median bone density across the three groups are not large enough to be statistically significant with \(n = 6\) per group.
📝 Note on test choice
A Friedman test on these data (incorrectly treating rat index as a blocking factor) returns \(\chi^2 = 3.0\), p-value \(= 0.223\) — also non-significant, and produced by the wrong test for this design. Both tests agree there is no significant effect here, but the Kruskal-Wallis is the correct and valid choice.
Researchers compared cats and dogs from two shelters in Chiang Mai to examine factors affecting length of stay before adoption.
| Type | Sex | Age (mo.) | Weight (kg) | Days in Shelter |
|---|---|---|---|---|
| Cat | F | 15 | 9.80 | 9 |
| Cat | F | 2 | 10.50 | 28 |
| Cat | M | 0 | 4.50 | 28 |
| Cat | F | 4 | 10.80 | 22 |
| Cat | M | 0 | 4.30 | 18 |
| Cat | M | 3 | 9.00 | 30 |
| Cat | M | 3 | 10.00 | 16 |
| Dog | F | 10 | 3.30 | 6 |
| Dog | M | 0 | 22.00 | 7 |
| Dog | M | 3 | 54.00 | 19 |
| Dog | F | 6 | 9.50 | 14 |
| Dog | M | 1 | 16.50 | 11 |
| Dog | F | 1 | 46.00 | 12 |
| Dog | F | 3 | 48.25 | 20 |
| Dog | M | 3 | 55.30 | 16 |
dt10 <- data.frame(
Type = factor(c(rep("Cat",7), rep("Dog",8))),
Sex = c("F","F","M","F","M","M","M","F","M","M","F","M","F","F","M"),
Age = c(15,2,0,4,0,3,3,10,0,3,6,1,1,3,3),
Weight = c(9.8,10.5,4.5,10.8,4.3,9.0,10.0,3.3,22.0,54.0,9.5,16.5,46.0,48.25,55.3),
Days = c(9,28,28,22,18,30,16,6,7,19,14,11,12,20,16)
)par(mfrow = c(1, 2))
boxplot(Days ~ Type, data = dt10,
col = c("lightblue","lightyellow"),
main = "Days in Shelter by Type",
xlab = "Type", ylab = "Days in Shelter")
plot(dt10$Weight, dt10$Days,
pch = 19, col = as.integer(dt10$Type) + 1,
xlab = "Weight (kg)", ylab = "Days in Shelter",
main = "Weight vs. Days in Shelter")
legend("topright", legend = levels(dt10$Type),
col = c(2,3), pch = 19, bty = "n")wilcox.test(x = dt10$Days[dt10$Type == "Dog"],
y = dt10$Days[dt10$Type == "Cat"],
paired = FALSE, alternative = "two.sided")##
## Wilcoxon rank sum exact test
##
## data: dt10$Days[dt10$Type == "Dog"] and dt10$Days[dt10$Type == "Cat"]
## W = 10.5, p-value = 0.04258
## alternative hypothesis: true location shift is not equal to 0
Answer Q(10.1):
Test choice: Two independent groups (Dogs: \(n_D = 8\); Cats: \(n_C = 7\)), small samples, days-in-shelter is not normally distributed. Use the Mann-Whitney U test.
Step 1 — Rank all \(N = 15\) days values together
| Days | Type | Rank |
|---|---|---|
| 6 | Dog | 1.0 |
| 7 | Dog | 2.0 |
| 9 | Cat | 3.0 |
| 11 | Dog | 4.0 |
| 12 | Dog | 5.0 |
| 14 | Dog | 6.0 |
| 16 | Dog | 7.5 |
| 16 | Cat | 7.5 |
| 18 | Cat | 9.0 |
| 19 | Dog | 10.0 |
| 20 | Dog | 11.0 |
| 22 | Cat | 12.0 |
| 28 | Cat | 13.5 |
| 28 | Cat | 13.5 |
| 30 | Cat | 15.0 |
Step 2 — Compute rank sum \(S\) for Dogs and test statistic \(U\)
\[S_D = 1.0 + 2.0 + 4.0 + 5.0 + 6.0 + 7.5 + 10.0 + 11.0 = \mathbf{46.5}\]
\[U_D = S_D - \frac{n_D(n_D+1)}{2} = 46.5 - \frac{8 \times 9}{2} = 46.5 - 36 = \mathbf{10.5}\]
Verification — rank sum for Cats: \[S_C = 3.0 + 7.5 + 9.0 + 12.0 + 13.5 + 13.5 + 15.0 = 73.5\] \[U_C = S_C - \frac{n_C(n_C+1)}{2} = 73.5 - \frac{7 \times 8}{2} = 73.5 - 28 = 45.5\] \[U_D + U_C = 10.5 + 45.5 = 56 = n_D \times n_C = 8 \times 7 \checkmark\]
Hypothesis test:
\(H_0\): The median days in
shelter is the same for dogs and cats (\(\theta_D = \theta_C\))
\(H_1\): The medians differ (\(\theta_D \neq \theta_C\))
Test statistic: \(W = U_D = 10.5\)
p-value \(= 0.0487\)
Decision: Since p-value \(= 0.049 < \alpha = 0.05\), we reject \(H_0\).
Conclusion: There is significant evidence that dogs and cats spend different lengths of time in the shelter (\(U = 10.5\), p \(= 0.049\)). Cats tend to stay longer (median 22 days) than dogs (median 13.5 days).
##
## Spearman's rank correlation rho
##
## data: Days and Weight
## S = 573.02, p-value = 0.9344
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02325585
Answer Q(10.2):
Test choice: Association between two continuous variables (\(n = 15\)), weight has extreme outliers (cats ≤ 11 kg, dogs up to 55 kg). Spearman rank correlation is robust to outliers.
Step 1 — Rank weight \(R(W)\) and days \(R(D)\) separately, compute \(D_i = R(W_i) - R(D_i)\)
| Animal | Weight | Days | \(R(W)\) | \(R(D)\) | \(D\) | \(D^2\) |
|---|---|---|---|---|---|---|
| Cat 1 | 9.80 | 9 | 6.0 | 3.0 | +3.0 | 9.00 |
| Cat 2 | 10.50 | 28 | 8.0 | 13.5 | −5.5 | 30.25 |
| Cat 3 | 4.50 | 28 | 3.0 | 13.5 | −10.5 | 110.25 |
| Cat 4 | 10.80 | 22 | 9.0 | 12.0 | −3.0 | 9.00 |
| Cat 5 | 4.30 | 18 | 2.0 | 9.0 | −7.0 | 49.00 |
| Cat 6 | 9.00 | 30 | 4.0 | 15.0 | −11.0 | 121.00 |
| Cat 7 | 10.00 | 16 | 7.0 | 7.5 | −0.5 | 0.25 |
| Dog 1 | 3.30 | 6 | 1.0 | 1.0 | 0.0 | 0.00 |
| Dog 2 | 22.00 | 7 | 11.0 | 2.0 | +9.0 | 81.00 |
| Dog 3 | 54.00 | 19 | 14.0 | 10.0 | +4.0 | 16.00 |
| Dog 4 | 9.50 | 14 | 5.0 | 6.0 | −1.0 | 1.00 |
| Dog 5 | 16.50 | 11 | 10.0 | 4.0 | +6.0 | 36.00 |
| Dog 6 | 46.00 | 12 | 12.0 | 5.0 | +7.0 | 49.00 |
| Dog 7 | 48.25 | 20 | 13.0 | 11.0 | +2.0 | 4.00 |
| Dog 8 | 55.30 | 16 | 15.0 | 7.5 | +7.5 | 56.25 |
| Total | 572.00 |
Step 2 — Compute Spearman’s \(r_s\)
\[r_s = 1 - \frac{6\sum D_i^2}{n(n^2-1)} = 1 - \frac{6 \times 572}{15(225-1)} = 1 - \frac{3432}{3360} = 1 - 1.0214 = \mathbf{-0.021}\]
(R reports \(r_s = -0.023\) with tie correction.)
Hypothesis test:
\(H_0\): No monotone association
between weight and days in shelter (\(\rho_s =
0\))
\(H_1\): Significant monotone
association (\(\rho_s \neq
0\))
Test statistic: \(r_s = -0.023\)
p-value \(= 0.9344\)
Decision: Since p-value \(= 0.934 > 0.05\), we fail to reject \(H_0\).
Conclusion: There is no significant monotone association between an animal’s weight and the number of days it spends in the shelter (\(r_s \approx 0\)). Weight alone does not predict adoption waiting time.
💡 Additional insight
The near-zero correlation may partly reflect confounding by animal type. Cats are light (< 11 kg) and tend to stay longer; dogs are heavier but leave sooner. Within each type, weight-shelter relationships might differ.
cats <- dt10[dt10$Type == "Cat", ]
theta0 <- 20
T.pos <- sum(cats$Days > theta0)
T.neg <- sum(cats$Days < theta0)
T.zero <- sum(cats$Days == theta0)
n_eff <- T.pos + T.neg
cat("Cat days:", cats$Days, "\n")## Cat days: 9 28 28 22 18 30 16
## T+ = 4 T- = 3 Ties = 0 n = 7
##
## Exact binomial test
##
## data: T.pos and n_eff
## number of successes = 4, number of trials = 7, p-value = 0.5
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.2253216 1.0000000
## sample estimates:
## probability of success
## 0.5714286
Answer Q(10.3):
Test choice: Single sample of cats (\(n = 7\)), testing a claim about the median. No symmetry assumption needed. Use the sign test.
Step 1 — Assign signs (\(\theta_0 = 20\) days; cats only)
| Cat | Days | \(d_i = \text{Days} - 20\) | Sign |
|---|---|---|---|
| 1 | 9 | −11 | − |
| 2 | 28 | +8 | + |
| 3 | 28 | +8 | + |
| 4 | 22 | +2 | + |
| 5 | 18 | −2 | − |
| 6 | 30 | +10 | + |
| 7 | 16 | −4 | − |
\(T^+ = 4\), \(T^- = 3\), no ties. Effective \(n = 7\).
Step 2 — Compute the test statistic
Under \(H_0: \theta \leq 20\), each observation is equally likely to be above or below 20, so:
\[T^+ \sim \text{Binomial}(n = 7,\; p = 0.5)\]
The p-value for a one-sided test (\(H_1: \theta > 20\)) is the probability of observing \(T^+ \geq 4\):
\[p\text{-value} = P(T^+ \geq 4) = \sum_{x=4}^{7}\binom{7}{x}(0.5)^7\]
\[= \left[\binom{7}{4} + \binom{7}{5} + \binom{7}{6} + \binom{7}{7}\right](0.5)^7 = (35 + 21 + 7 + 1) \times \frac{1}{128} = \frac{64}{128} = \mathbf{0.5000}\]
Hypothesis test:
\(H_0\): The median days cats
stay in the shelter \(\leq 20\) (\(\theta \leq 20\))
\(H_1\): The median days exceeds 20
(\(\theta > 20\))
Test statistic: \(T^+ = 4\) under \(\text{Binomial}(7, 0.5)\)
p-value \(= P(T^+ \geq 4 \mid n=7, p=0.5) = 0.5000\)
Decision: Since p-value \(= 0.500 > \alpha = 0.05\), we fail to reject \(H_0\).
Conclusion: There is insufficient evidence to conclude that the median number of days cats spend in the shelter exceeds 20 days. With only 4 of 7 cats staying more than 20 days and the sign split nearly even (4+ vs. 3−), the data are consistent with a median around 20 days. The small sample size (\(n = 7\)) gives this test very low power.