Every dataset used in these labs is already built into R (or generated by a short line of code).
How to use these lab materials
- Read the explanation text before running each code chunk.
- Pay attention to interpretation notes — these help you understand what the output means statistically.
- In each Practice section you will see worked examples; the Assignment section is for you to complete independently.
- Two datasets recur across several labs so you can build familiarity with them:
trees(Case Study A) andmtcars(Case Study B), both built into base R. A few others (attitude, and one simulated dataset) appear once, in the assignments.
Objectives: Students are able to use R to:
Key concepts: The Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables \(X\) and \(Y\). It ranges from −1 to +1:
\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \; \sum(y_i-\bar{y})^2}}\]
| Value of \(|r|\) | Interpretation |
|---|---|
| 0.00 – 0.19 | Very weak |
| 0.20 – 0.39 | Weak |
| 0.40 – 0.59 | Moderate |
| 0.60 – 0.79 | Strong |
| 0.80 – 1.00 | Very strong |
The sign of \(r\) indicates direction: positive means both variables increase together; negative means one increases as the other decreases.
Important: Correlation measures linear association only. A correlation of 0 does not necessarily mean there is no relationship — only that there is no linear relationship.
trees)R comes with a built-in dataset called trees: 31 felled
black cherry trees, each with measured girth (trunk
diameter), height, and volume of
usable timber. No import is needed — trees is available the
moment R starts.
Variables:
Girth(trunk diameter, inches) andVolume(timber volume, cubic feet). Research question: Is there a significant linear correlation between trunk girth and timber volume?
Step 1 — Load and inspect the data
# trees is a built-in dataset - no download or setwd() required
data(trees)
head(trees) # first 6 rows## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
Note:
str()tells you how many observations (rows) and variables (columns) are in your data, and what type each variable is (numeric, character, factor, etc.).
Step 2 — Descriptive statistics
x <- trees$Girth # independent variable (predictor)
y <- trees$Volume # dependent variable (response)
# Summary statistics
summary(x)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.30 11.05 12.90 13.25 15.25 20.60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.20 19.40 24.20 30.17 37.30 77.00
## [1] 3.138139
## [1] 16.43785
Step 3 — Create a scatterplot
A scatterplot is always the first step — it reveals the form, direction, and strength of the relationship before any numerical summary.
plot(x, y,
xlab = "Girth (in.)",
ylab = "Volume (cu. ft.)",
main = "Scatterplot: Volume vs. Girth",
pch = 16, # filled circle points
col = "steelblue")What to look for in the scatterplot:
- Form: Is the pattern linear (straight) or curved?
- Direction: Positive (both increase together) or negative (one increases as the other decreases)?
- Strength: Are the points tightly clustered around a line (strong) or widely scattered (weak)?
- Outliers: Are there any unusual observations far from the general pattern?
Step 4 — Compute the correlation coefficient
## [1] 0.9671194
Interpretation: Report the value of \(r\) and describe its sign and magnitude using the reference table above. Running the code above gives \(r \approx 0.967\): “The sample correlation coefficient is \(r = 0.967\), indicating a very strong positive linear relationship between girth and volume.”
Step 5 — Hypothesis test for the population correlation
We test whether there is a statistically significant linear correlation in the population:
\[H_0: \rho = 0 \quad \text{(no linear correlation)} \qquad \text{vs.} \qquad H_a: \rho \neq 0\]
The test statistic is:
\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \text{ under } H_0\]
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 20.478, df = 29, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9322519 0.9841887
## sample estimates:
## cor
## 0.9671194
Interpretation of
cor.test()output:
Output field What it means corSample correlation coefficient \(r\) tTest statistic value dfDegrees of freedom (\(n - 2\)) p-valueReject \(H_0\) if p-value \(< \alpha = 0.05\) 95 percent confidence interval95% CI for the population correlation \(\rho\) 4-step conclusion template: 1. \(H_0: \rho = 0\) vs. \(H_a: \rho \neq 0\) 2. \(t \approx 20.5\), \(df = 29\) 3. \(p\text{-value} < 0.001\) 4. Since p-value \(< 0.05\), we reject \(H_0\). There is sufficient evidence of a significant linear correlation between girth and volume.
mtcars)R’s built-in mtcars dataset records fuel consumption and
10 design/performance variables for 32 cars (1973–74 models). It will be
used as a running case study across several labs, so
it’s worth getting to know well now.
Variables:
wt(weight, 1000 lbs),mpg(miles per gallon),am(transmission: 0 = automatic, 1 = manual),cyl(number of cylinders: 4, 6, or 8),gear(number of forward gears: 3, 4, or 5). Research question: Is there a significant linear correlation between a car’s weight and its fuel economy?
Step 1 — Load the data
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Step 2 — Explore the data
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Frequency table for transmission type (am: 0 = automatic, 1 = manual)
table1 <- table(mtcars$am)
table1##
## 0 1
## 19 13
##
## 0 1
## 0.59375 0.40625
# Two-way frequency table: number of cylinders by transmission type
table2 <- table(mtcars$cyl, mtcars$am)
table2##
## 0 1
## 4 3 8
## 6 4 3
## 8 12 2
##
## 0 1
## 4 0.09375 0.25000
## 6 0.12500 0.09375
## 8 0.37500 0.06250
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
# Boxplots
par(mfrow = c(1, 2))
boxplot(mtcars$mpg,
main = "Distribution of MPG", ylab = "Miles per Gallon")
boxplot(mtcars$mpg ~ mtcars$am,
xlab = "Transmission (0 = auto, 1 = manual)", ylab = "MPG",
main = "MPG by Transmission",
col = c("lightpink", "lightblue"))par(mfrow = c(1, 1))
boxplot(mtcars$mpg ~ mtcars$gear,
xlab = "Number of Gears", ylab = "MPG",
main = "MPG by Number of Gears")Interpretation: Boxplots let you compare distributions across groups. Look for differences in medians, spread (IQR), and the presence of outliers.
Step 3 — Scatterplot
x2 <- mtcars$wt
y2 <- mtcars$mpg
plot(x2, y2,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
main = "Scatterplot: MPG vs. Weight",
pch = 16, col = "steelblue")Step 4 — Correlation coefficient and hypothesis test
## [1] -0.8676594
# Hypothesis test: H0: rho = 0 vs Ha: rho ≠ 0
cor.test(x2, y2, alternative = "two.sided", conf.level = 0.95)##
## Pearson's product-moment correlation
##
## data: x2 and y2
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
Running the code gives \(r \approx -0.868\) (\(p < 0.001\)) — a very strong negative linear relationship: heavier cars tend to have lower fuel economy.
attitude)R’s built-in attitude dataset comes from a survey of
clerical employees at a large financial organisation: 30 employees rated
their department (0–100 scale) on several dimensions, including overall
performance rating and perceived opportunity for
learning.
Variables:
learning(perceived learning opportunity score, 0–100) andrating(overall performance rating, 0–100).
Complete all questions using R. Show your R code and output, and write your interpretation in full sentences.
## 'data.frame': 30 obs. of 7 variables:
## $ rating : num 43 63 71 61 81 43 58 71 72 67 ...
## $ complaints: num 51 64 70 63 78 55 67 75 82 61 ...
## $ privileges: num 30 51 68 45 56 49 42 50 72 45 ...
## $ learning : num 39 54 69 47 66 44 56 55 67 47 ...
## $ raises : num 61 63 76 54 71 54 66 70 71 62 ...
## $ critical : num 92 73 86 84 83 49 68 66 83 80 ...
## $ advance : num 45 47 48 35 47 34 35 41 31 41 ...
Perform descriptive statistics for both learning and
rating (mean, SD, min, max). Summarise the
results.
Make a scatterplot of learning (\(x\)) vs. rating (\(y\)). Describe the form, direction,
strength, and any outliers.
Find the sample correlation coefficient \(r\) between learning and
rating. Interpret its value.
Test whether the population correlation is different from zero at \(\alpha = 0.05\). Use the 4-step procedure: (i) state hypotheses, (ii) give the test statistic and degrees of freedom, (iii) give the p-value, (iv) state your conclusion in context.
Find the 95% confidence interval for the population correlation \(\rho\) and interpret it.
Objectives: Students are able to use R to:
Key concepts: The simple linear regression model expresses the relationship between a response variable \(Y\) and a single predictor \(X\) as:
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]
where:
The least-squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimise \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).
We continue with the same trees data used in Lab 1.
Variables:
Girth(in.) as \(X\),Volume(cu. ft.) as \(Y\). Research question: Can we predict timber volume from trunk girth using a linear model?
Step 1 — Load data
Step 2 — Fit the regression model
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## x 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -43.825953 -30.060965
## x 4.559914 5.571799
Key output from
summary(fit):
Output field What it means Estimate(Intercept)\(\hat{\beta}_0\): predicted \(Y\) when \(X = 0\) Estimate(x)\(\hat{\beta}_1\): change in \(\hat{Y}\) for each 1-unit increase in \(X\) Std. ErrorStandard error of the coefficient estimate t valueTest statistic for \(H_0: \beta_j = 0\) Pr(>|t|)p-value; reject \(H_0\) if \(< 0.05\) Multiple R-squared\(r^2\): proportion of variance in \(Y\) explained by \(X\) F-statisticOverall model F-test (equivalent to t-test in simple regression) Residual standard error\(\hat{\sigma} = \sqrt{MS_{res}}\): estimated SD of errors How to write the fitted equation: Find the
Estimatecolumn in the output and substitute: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\] Here, \(\hat{\beta}_0 \approx -36.94\) and \(\hat{\beta}_1 \approx 5.07\), so \(\hat{y} = -36.94 + 5.07x\).
Step 3 — Plot the fitted regression line
plot(x, y,
xlab = "Girth (in.)",
ylab = "Volume (cu. ft.)",
main = "Regression Line: Volume vs. Girth",
pch = 16, col = "steelblue")
abline(fit, col = "red", lwd = 2) # overlay regression lineStep 4 — Residual analysis
Residuals are defined as \(e_i = y_i - \hat{y}_i\). Before trusting regression inference, we must verify three assumptions:
res <- resid(fit)
par(mfrow = c(1, 3))
# 1. Residuals vs. Fitted — check linearity and constant variance
plot(fitted(fit), res,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs. Fitted",
pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
# 2. Histogram — check approximate normality
hist(res, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white")
# 3. Normal Q-Q plot — check normality more carefully
qqnorm(res, main = "Normal Q-Q Plot", pch = 16)
qqline(res, col = "red")What to look for:
- Residuals vs. Fitted: points should scatter randomly around the zero line. A curve suggests non-linearity; a fan shape suggests non-constant variance.
- Histogram: should be roughly bell-shaped and symmetric around zero.
- Q-Q plot: points should fall close to the red diagonal line. Systematic deviations indicate non-normality.
Note: With girth and volume, a slight curve is visible in the residuals vs. fitted plot — a cylinder is only an approximation for a tree trunk’s shape, and real trees taper. This is a nice discussion point: sometimes real data hint that a straight line isn’t the perfect model.
Step 5 — Prediction for a new value of x
Two types of intervals are available:
The PI is always wider than the CI because it includes both estimation uncertainty and individual variability.
## 1
## 28.91267
# 95% Confidence interval for E(Y | x = 13)
predict(fit, newdata = new, interval = "confidence", level = 0.95)## fit lwr upr
## 1 28.91267 27.34573 30.47962
# 95% Prediction interval for a new individual at x = 13
predict(fit, newdata = new, interval = "prediction", level = 0.95)## fit lwr upr
## 1 28.91267 20.07634 37.74901
Output columns:
fit= point estimate,lwr= lower bound,upr= upper bound.
Step 6 — Plot fitted line with CI and PI bands
new_range <- data.frame(x = sort(x)) # sorted x values for smooth curves
pred.CI <- predict(fit, newdata = new_range, interval = "confidence")
pred.PI <- predict(fit, newdata = new_range, interval = "prediction")
plot(x, y,
xlab = "Girth (in.)", ylab = "Volume (cu. ft.)",
main = "Fitted Line with 95% CI and PI Bands",
pch = 16, col = "steelblue")
# Fitted regression line
lines(new_range$x, pred.CI[, "fit"], col = "black", lwd = 2)
# 95% Confidence interval (red dashed)
lines(new_range$x, pred.CI[, "lwr"], col = "red", lwd = 1.5, lty = 2)
lines(new_range$x, pred.CI[, "upr"], col = "red", lwd = 1.5, lty = 2)
# 95% Prediction interval (blue dotted)
lines(new_range$x, pred.PI[, "lwr"], col = "blue", lwd = 1.5, lty = 3)
lines(new_range$x, pred.PI[, "upr"], col = "blue", lwd = 1.5, lty = 3)
legend("topleft",
legend = c("Fitted line", "95% CI", "95% PI"),
col = c("black", "red", "blue"),
lty = c(1, 2, 3), lwd = 2, bty = "n")Why is the PI always wider than the CI? The CI captures only the uncertainty in estimating the mean of \(Y\), while the PI must additionally account for the natural scatter of individual observations around that mean. The gap between CI and PI bands is largest near the extremes of \(x\) and narrowest near \(\bar{x}\).
We continue with the same mtcars data used in Lab 1
Practice II.
Variables:
wt(1000 lbs) as \(X\),mpgas \(Y\). Research question: Can we predict fuel economy from weight using a linear model?
Step 1 — Load data
Step 2 — Fit the model
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## x2 -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
## 2.5 % 97.5 %
## (Intercept) 33.450500 41.119753
## x2 -6.486308 -4.202635
Step 3 — Plot fitted line
plot(x2, y2,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
main = "Regression Line: MPG vs. Weight",
pch = 16, col = "steelblue")
abline(fit2, col = "red", lwd = 2)Step 4 — Residual analysis
res2 <- resid(fit2)
par(mfrow = c(1, 3))
plot(fitted(fit2), res2,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs. Fitted", pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
hist(res2, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white")
qqnorm(res2, main = "Normal Q-Q Plot", pch = 16)
qqline(res2, col = "red")Step 5 — CI and PI bands
new2 <- data.frame(x2 = sort(x2))
pred.CI2 <- predict(fit2, newdata = new2, interval = "confidence")
pred.PI2 <- predict(fit2, newdata = new2, interval = "prediction")
plot(x2, y2,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
main = "Fitted Line with 95% CI and PI Bands",
pch = 16, col = "steelblue",
ylim = range(c(pred.PI2[, "lwr"], pred.PI2[, "upr"])))
lines(new2$x2, pred.CI2[, "fit"], col = "black", lwd = 2)
lines(new2$x2, pred.CI2[, "lwr"], col = "red", lty = 2, lwd = 1.5)
lines(new2$x2, pred.CI2[, "upr"], col = "red", lty = 2, lwd = 1.5)
lines(new2$x2, pred.PI2[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(new2$x2, pred.PI2[, "upr"], col = "blue", lty = 3, lwd = 1.5)
legend("topleft",
legend = c("Fitted line", "95% CI", "95% PI"),
col = c("black", "red", "blue"),
lty = c(1, 2, 3), lwd = 2, bty = "n")We continue with the attitude data from Lab 1. Now we
extend the analysis from correlation to regression.
Obtain the least-squares regression line for predicting
rating (\(Y\)) from
learning (\(X\)). Write
the fitted equation.
Interpret \(\hat{\beta}_0\) and \(\hat{\beta}_1\) in the context of this problem.
Find \(MS_{res}\) (= residual standard error squared) and state its units.
Test whether learning is a significant linear
predictor of rating at \(\alpha =
0.05\). Use the 4-step procedure.
Find the 95% confidence intervals for \(\beta_0\) and \(\beta_1\).
Find \(r^2\) and interpret its value in context.
Check the regression assumptions using residual plots (residuals vs. fitted, histogram, Q-Q plot). Are the assumptions satisfied?
Find the 95% confidence interval and 95% prediction interval for the rating of a department with a learning score of 60. Explain the difference between the two intervals.
Plot the fitted regression line along with its 95% CI and PI bands for all values of \(x\).
Objectives: Students are able to use R to:
Key concepts: The multiple linear regression model with \(p\) predictors is:
\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i\]
We estimate the coefficients by least squares and evaluate the model with:
We continue with mtcars. Rather than predicting
mpg from weight alone, we now bring in several quantitative
predictors at once:
| Variable | Description | Type |
|---|---|---|
disp |
Engine displacement (cu. in.) | Quantitative |
hp |
Gross horsepower | Quantitative |
drat |
Rear axle ratio | Quantitative |
wt |
Weight (1000 lbs) | Quantitative |
qsec |
Quarter-mile time (seconds) | Quantitative |
mpg |
Miles per gallon | Response |
Step 1 — Load data
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Step 2 — Descriptive statistics and correlation matrix
Since all variables here are quantitative, we examine pairwise scatterplots and correlations.
## mpg disp hp drat wt qsec
## mpg 1.0000000 -0.8475514 -0.7761684 0.68117191 -0.8676594 0.41868403
## disp -0.8475514 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788
## hp -0.7761684 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339
## drat 0.6811719 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476
## wt -0.8676594 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588
## qsec 0.4186840 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000
# Pairwise scatterplot matrix
pairs(mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")],
main = "Scatterplot Matrix for mtcars",
pch = 16, col = "steelblue")Interpretation: Look at the first row (or column) to see which predictors appear most linearly related to
mpg. Also check for high correlations between predictors (e.g.disp,hp, andwtare all strongly related to each other) — this is called multicollinearity and can cause instability in coefficient estimates.
Step 3 — Full model (all five predictors)
##
## Call:
## lm(formula = mpg ~ disp + hp + drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5404 -1.6701 -0.4264 1.1320 5.4996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.53357 10.96423 1.508 0.14362
## disp 0.00872 0.01119 0.779 0.44281
## hp -0.02060 0.01528 -1.348 0.18936
## drat 2.01578 1.30946 1.539 0.13579
## wt -4.38546 1.24343 -3.527 0.00158 **
## qsec 0.64015 0.45934 1.394 0.17523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.558 on 26 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8199
## F-statistic: 29.22 on 5 and 26 DF, p-value: 6.892e-10
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## disp 1 808.89 808.89 123.6185 2.23e-11 ***
## hp 1 33.67 33.67 5.1449 0.031854 *
## drat 1 30.15 30.15 4.6073 0.041340 *
## wt 1 70.51 70.51 10.7754 0.002933 **
## qsec 1 12.71 12.71 1.9422 0.175233
## Residuals 26 170.13 6.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Key outputs:
anova(full)gives the sequential (Type I) ANOVA table, showing the reduction in \(SS_{res}\) as each predictor is added.- The overall F-test p-value tests whether at least one \(\beta_j \neq 0\). Here it is highly significant (\(R^2 \approx 0.85\)), even though several individual predictors are not.
- Some individual predictors may be non-significant even if the overall model is — this is common with correlated predictors (see the multicollinearity note above).
Step 4 — Variable selection
With many predictors we use automatic selection to find the best subset. All three methods use AIC (Akaike Information Criterion) — lower AIC = better model.
null <- lm(mpg ~ 1, data = mtcars) # intercept-only (no predictors)
full <- lm(mpg ~ disp + hp + drat + wt + qsec, data = mtcars)
# Forward selection: starts empty, adds one predictor at a time
fw.fit <- step(null,
scope = list(lower = null, upper = full),
direction = "forward",
trace = 1)## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## + drat 1 522.48 603.57 97.988
## + qsec 1 197.39 928.66 111.776
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + hp 1 83.274 195.05 63.840
## + qsec 1 82.858 195.46 63.908
## + disp 1 31.639 246.68 71.356
## <none> 278.32 73.217
## + drat 1 9.081 269.24 74.156
##
## Step: AIC=63.84
## mpg ~ wt + hp
##
## Df Sum of Sq RSS AIC
## <none> 195.05 63.840
## + drat 1 11.3659 183.68 63.919
## + qsec 1 8.9885 186.06 64.331
## + disp 1 0.0571 194.99 65.831
# Backward elimination: starts full, removes one predictor at a time
be.fit <- step(full, direction = "backward", trace = 1)## Start: AIC=65.47
## mpg ~ disp + hp + drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## - disp 1 3.974 174.10 64.205
## <none> 170.13 65.466
## - hp 1 11.886 182.01 65.627
## - qsec 1 12.708 182.84 65.772
## - drat 1 15.506 185.63 66.258
## - wt 1 81.394 251.52 75.978
##
## Step: AIC=64.21
## mpg ~ hp + drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## - hp 1 9.418 183.52 63.891
## - qsec 1 9.578 183.68 63.919
## <none> 174.10 64.205
## - drat 1 11.956 186.06 64.331
## - wt 1 113.882 287.99 78.310
##
## Step: AIC=63.89
## mpg ~ drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## <none> 183.52 63.891
## - drat 1 11.942 195.46 63.908
## - qsec 1 85.720 269.24 74.156
## - wt 1 275.686 459.21 91.241
# Stepwise regression: both directions (generally preferred)
sw.fit <- step(full, direction = "both", trace = 1)## Start: AIC=65.47
## mpg ~ disp + hp + drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## - disp 1 3.974 174.10 64.205
## <none> 170.13 65.466
## - hp 1 11.886 182.01 65.627
## - qsec 1 12.708 182.84 65.772
## - drat 1 15.506 185.63 66.258
## - wt 1 81.394 251.52 75.978
##
## Step: AIC=64.21
## mpg ~ hp + drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## - hp 1 9.418 183.52 63.891
## - qsec 1 9.578 183.68 63.919
## <none> 174.10 64.205
## - drat 1 11.956 186.06 64.331
## + disp 1 3.974 170.13 65.466
## - wt 1 113.882 287.99 78.310
##
## Step: AIC=63.89
## mpg ~ drat + wt + qsec
##
## Df Sum of Sq RSS AIC
## <none> 183.52 63.891
## - drat 1 11.942 195.46 63.908
## + hp 1 9.418 174.10 64.205
## + disp 1 1.506 182.02 65.627
## - qsec 1 85.720 269.24 74.156
## - wt 1 275.686 459.21 91.241
Comparing the three methods: They do not always agree. In this dataset, forward selection tends to settle on
{wt, hp}, while backward elimination and stepwise (both) agree on{drat, wt, qsec}. Stepwise regression is generally preferred as it reconsiders variables at each step.
Step 5 — Fitted model from stepwise regression
##
## Call:
## lm(formula = mpg ~ drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1152 -1.8273 -0.2696 1.0502 5.5010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3945 8.0689 1.412 0.16892
## drat 1.6561 1.2269 1.350 0.18789
## wt -4.3978 0.6781 -6.485 5.01e-07 ***
## qsec 0.9462 0.2616 3.616 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.8196
## F-statistic: 47.93 on 3 and 28 DF, p-value: 3.723e-11
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## drat 1 522.48 522.48 79.715 1.105e-09 ***
## wt 1 334.33 334.33 51.008 9.009e-08 ***
## qsec 1 85.72 85.72 13.078 0.001163 **
## Residuals 28 183.52 6.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 %
## (Intercept) -5.1339149 27.922817
## drat -0.8571278 4.169418
## wt -5.7868161 -3.008776
## qsec 0.4102527 1.482155
How to write the fitted equation: If stepwise selected
drat,wt, andqsec: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \, \text{drat} + \hat{\beta}_2 \, \text{wt} + \hat{\beta}_3 \, \text{qsec}\] Substitute theEstimatevalues fromsummary()output (approximately \(\hat{y} = 11.39 + 1.66\,\text{drat} - 4.40\,\text{wt} + 0.95\,\text{qsec}\)).
Using the results from the Practice section above, answer the following:
Which predictors were selected by each method (forward, backward, stepwise)? Do they agree?
Write the fitted regression equation from the stepwise model.
Test the overall fit at \(\alpha = 0.05\) using the 4-step hypothesis testing procedure.
Interpret each significant regression coefficient in the context of the problem.
Find the 95% CI for each coefficient in the selected model.
Find \(R^2\) and adjusted \(R^2\). What do they tell you?
Objectives: Students are able to use R to:
Key concepts — Dummy variables: When a predictor is qualitative (categorical), we cannot enter it directly into a regression model as a number. Instead, we create \(k - 1\) dummy (indicator) variables for a variable with \(k\) groups. The omitted group is called the reference category.
Each dummy variable’s coefficient \(\hat{\beta}_j\) represents the estimated difference in the mean response between that group and the reference group, holding all other predictors constant.
Why \(k - 1\) dummies, not \(k\)? Using \(k\) dummies would create perfect multicollinearity (the dummy columns would always sum to 1, which equals the intercept column). Dropping one group — the reference — avoids this problem. The intercept then represents the mean response for the reference group.
General dummy coding scheme for a variable with \(k = 3\) groups (A = reference, B, C):
| Group | \(D_B\) | \(D_C\) |
|---|---|---|
| A (reference) | 0 | 0 |
| B | 1 | 0 |
| C | 0 | 1 |
The model becomes: \[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 D_B + \hat{\beta}_3 D_C\]
This gives separate parallel regression lines for each group:
| Group | Fitted equation | Intercept |
|---|---|---|
| A (reference) | \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0\) |
| B | \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_2) + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0 + \hat{\beta}_2\) |
| C | \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_3) + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0 + \hat{\beta}_3\) |
All three lines share the same slope \(\hat{\beta}_1\) but have different intercepts.
Research question: Can fuel economy be predicted from weight and number of cylinders?
Variables:
| Variable | Name in data | Type |
|---|---|---|
| Fuel economy (mpg) | mpg |
Response (\(Y\)) |
| Weight (1000 lbs) | wt |
Quantitative predictor (\(X_1\)) |
| Number of cylinders | cyl |
Qualitative predictor (4 / 6 / 8) |
Step 1 — Load data
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Step 2 — Descriptive statistics
Before fitting, explore the relationship between each predictor and the response separately.
y <- mtcars$mpg
x1 <- mtcars$wt # quantitative predictor
x2 <- factor(mtcars$cyl) # qualitative predictor (4, 6, or 8 cylinders)
# Summaries
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
## x2
## 4 6 8
## 11 7 14
## [1] -0.8676594
# Scatterplot coloured by cylinder group
plot(x1, y,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
pch = 16, col = as.integer(x2),
main = "MPG vs. Weight, by Cylinders")
legend("topright",
legend = levels(x2),
col = 1:length(levels(x2)),
pch = 16, bty = "n")# Boxplot: mpg by cylinder group
boxplot(y ~ x2,
xlab = "Number of Cylinders", ylab = "Miles per Gallon",
main = "MPG by Cylinders",
col = c("lightblue", "lightgreen", "lightyellow"))What to look for:
- Do the groups (4, 6, 8 cylinders) appear to have different average fuel economy in the boxplot? If yes, cylinder count is likely a useful predictor.
- In the scatterplot, do the three groups of points appear to sit at different vertical levels? If so, they may have different intercepts.
- Is the relationship between weight and mpg roughly linear within each group?
# Alternative using ggplot2 (install once: install.packages("ggplot2"))
library(ggplot2)
# Scatterplot coloured by cylinder group
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
geom_point(size = 2) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
colour = "Cylinders",
title = "MPG vs. Weight, by Cylinders") +
theme_minimal()Step 3 — Create dummy variables
With \(k = 3\) groups (4, 6, 8 cylinders), we need \(k - 1 = 2\) dummy variables. We choose 4 cylinders as the reference category.
x2.dummy <- ifelse(mtcars$cyl == 6, 1, 0) # D_Six
x3.dummy <- ifelse(mtcars$cyl == 8, 1, 0) # D_Eight
# Verify the coding is correct: print side-by-side
data.frame(Cylinders = mtcars$cyl,
D_Six = x2.dummy,
D_Eight = x3.dummy)## Cylinders D_Six D_Eight
## 1 6 1 0
## 2 6 1 0
## 3 4 0 0
## 4 6 1 0
## 5 8 0 1
## 6 6 1 0
## 7 8 0 1
## 8 4 0 0
## 9 4 0 0
## 10 6 1 0
## 11 6 1 0
## 12 8 0 1
## 13 8 0 1
## 14 8 0 1
## 15 8 0 1
## 16 8 0 1
## 17 8 0 1
## 18 4 0 0
## 19 4 0 0
## 20 4 0 0
## 21 4 0 0
## 22 8 0 1
## 23 8 0 1
## 24 8 0 1
## 25 8 0 1
## 26 4 0 0
## 27 4 0 0
## 28 4 0 0
## 29 8 0 1
## 30 6 1 0
## 31 8 0 1
## 32 4 0 0
Dummy coding summary:
Cylinders \(D_{\text{Six}}\) \(D_{\text{Eight}}\) 4 (reference) 0 0 6 1 0 8 0 1 You can verify: when both dummies are 0, the observation must be a 4-cylinder car. ✓
Step 4 — Fit the full model
We fit the model with \(X_1\)
(wt), \(D_{\text{Six}}\),
and \(D_{\text{Eight}}\) as
predictors.
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## x1 -3.2056 0.7539 -4.252 0.000213 ***
## x2.dummy -4.2556 1.3861 -3.070 0.004718 **
## x3.dummy -6.0709 1.6523 -3.674 0.000999 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 847.73 847.73 129.6650 5.079e-12 ***
## x2.dummy 1 7.00 7.00 1.0713 0.3095128
## x3.dummy 1 88.26 88.26 13.4999 0.0009992 ***
## Residuals 28 183.06 6.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 %
## (Intercept) 30.123824 37.857764
## x1 -4.749898 -1.661328
## x2.dummy -7.094824 -1.416341
## x3.dummy -9.455418 -2.686301
Reading the
summary()output:
(Intercept): \(\hat{\beta}_0 \approx 34.0\) — estimated mean mpg for the 4-cylinder group whenwt = 0(a mathematical extrapolation, not a realistic car).x1: \(\hat{\beta}_1 \approx -3.21\) — estimated change in mpg for each additional 1000 lbs, within any cylinder group (slope is shared).x2.dummy: \(\hat{\beta}_2 \approx -4.26\) — estimated difference in mean mpg between 6-cylinder and 4-cylinder cars, holding weight constant.x3.dummy: \(\hat{\beta}_3 \approx -6.07\) — estimated difference in mean mpg between 8-cylinder and 4-cylinder cars, holding weight constant.All three coefficients are statistically significant (\(p < 0.01\)). Both dummy coefficients are negative, meaning that — at the same weight — cars with more cylinders get worse fuel economy.
Step 5 — Write the fitted equation and derive group-specific equations
From the Estimate column of
summary(fit.full), substitute values into:
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 D_{\text{Six}} + \hat{\beta}_3 D_{\text{Eight}}\]
Then derive the three separate equations by substituting the dummy values:
## (Intercept) x1 x2.dummy x3.dummy
## 33.990794 -3.205613 -4.255582 -6.070860
# The three group equations are derived as follows:
# 4-cylinder (D_Six=0, D_Eight=0):
# yhat = b0 + b1*x1
# 6-cylinder (D_Six=1, D_Eight=0):
# yhat = (b0 + b2) + b1*x1
# 8-cylinder (D_Six=0, D_Eight=1):
# yhat = (b0 + b3) + b1*x1Interpretation example (from the actual fitted model): \(\hat{\beta}_0 \approx 34.0\), \(\hat{\beta}_1 \approx -3.21\), \(\hat{\beta}_2 \approx -4.26\), \(\hat{\beta}_3 \approx -6.07\).
- 4-cylinder: \(\hat{y} = 34.0 - 3.21\,x_1\)
- 6-cylinder: \(\hat{y} = 29.7 - 3.21\,x_1\) (intercept decreases by 4.26)
- 8-cylinder: \(\hat{y} = 27.9 - 3.21\,x_1\) (intercept decreases by 6.07)
Interpretation: “At the same weight, 8-cylinder cars are estimated to get about 6.07 fewer miles per gallon on average than 4-cylinder cars.”
Step 6 — Plot the three fitted lines
# Compute fitted values
yhat <- fitted(fit.full)
# Get coefficient estimates
b <- coef(fit.full)
b0 <- b["(Intercept)"]
b1 <- b["x1"]
b2 <- b["x2.dummy"]
b3 <- b["x3.dummy"]
x1_range <- seq(min(x1), max(x1), length.out = 100)
# Base scatterplot coloured by cylinder group
plot(x1, y,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
pch = 16, col = as.integer(x2),
main = "Fitted Regression Lines by Cylinder Group")
# Add a fitted line for each group
lines(x1_range, b0 + b1 * x1_range, col = 1, lwd = 2) # 4-cylinder
lines(x1_range, b0 + b2 + b1 * x1_range, col = 2, lwd = 2) # 6-cylinder
lines(x1_range, b0 + b3 + b1 * x1_range, col = 3, lwd = 2) # 8-cylinder
legend("topright",
legend = c("4-cylinder", "6-cylinder", "8-cylinder"),
col = 1:3, lty = 1, lwd = 2, pch = 16, bty = "n")What this plot shows: Three parallel lines — same slope, different intercepts. The vertical gap between any two lines equals the corresponding dummy coefficient. If the lines were not parallel, that would suggest an interaction between cylinders and weight (a more advanced model not covered here).
Step 7 — Plot observed vs. predicted
plot(x1, y,
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
pch = 1, main = "Observed vs. Predicted MPG")
points(x1, yhat, pch = 20, col = "red")
legend("topright",
legend = c("Observed y", "Predicted ŷ"),
pch = c(1, 20), col = c("black", "red"), bty = "n")For this assignment we use a simulated dataset of 48
students. It is generated directly in R with set.seed(), so
it is fully reproducible and requires no download.
Variables: Y (post-test score),
Pretest (pre-test score), EQ (emotional
quotient score), Method (teaching method: A, B, C, D).
set.seed(208251)
n <- 48
Method <- factor(rep(c("A", "B", "C", "D"), each = n / 4))
Pretest <- round(rnorm(n, mean = 60, sd = 8), 1)
EQ <- round(rnorm(n, mean = 70, sd = 7), 1)
# True (unknown to students) method effects, added as noise for realism
method_effect <- c(A = 0, B = 4, C = 9, D = 6)[as.character(Method)]
Y <- round(15 + 0.55 * Pretest + 0.30 * EQ + method_effect +
rnorm(n, mean = 0, sd = 5), 1)
student_data <- data.frame(Y, Pretest, EQ, Method)
head(student_data)## Y Pretest EQ Method
## 1 82.9 71.1 74.2 A
## 2 74.9 65.2 78.2 A
## 3 69.3 63.3 72.2 A
## 4 73.9 53.8 77.3 A
## 5 71.3 68.4 60.5 A
## 6 71.0 66.0 70.9 A
## 'data.frame': 48 obs. of 4 variables:
## $ Y : num 82.9 74.9 69.3 73.9 71.3 71 50.3 64.7 61.1 65.6 ...
## $ Pretest: num 71.1 65.2 63.3 53.8 68.4 66 47.1 57.9 51.4 52.9 ...
## $ EQ : num 74.2 78.2 72.2 77.3 60.5 70.9 58.9 65.2 70 67.7 ...
## $ Method : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
Explore the data:
Pretest and \(Y\)
vs. EQ. Describe each relationship (form, direction,
strength).Method. Do the teaching methods appear to differ in mean
score?Method is a qualitative variable with \(k = 4\) groups. Create \(k - 1 = 3\) dummy variables. Clearly
state:
Fit the full model: \(Y = \beta_0 + \beta_1 \cdot \text{Pretest} + \beta_2 \cdot \text{EQ} + \beta_3 D_B + \beta_4 D_C + \beta_5 D_D + \varepsilon\). Show the R output and write the fitted equation.
Interpret each coefficient in the context of the problem. Pay particular attention to the dummy coefficients — what do they tell you about the effect of teaching method?
Write the separate fitted equation for students taught by each of Method A, B, C, and D. Comment on how the methods compare.
Plot all four fitted lines on a single scatterplot of \(Y\) vs. Pretest, with points
coloured by Method.
Objectives: Students are able to use R to:
Key concepts — Variable selection: When we have many candidate predictors, automatic selection methods help find a parsimonious model (few predictors, good fit). All three methods below use AIC (Akaike Information Criterion): lower AIC = better trade-off between fit and complexity.
| Method | Starting model | Strategy |
|---|---|---|
| Forward selection | Intercept only | Add the predictor that reduces AIC the most, repeat |
| Backward elimination | Full model | Remove the predictor that reduces AIC the most, repeat |
| Stepwise (both) | Full model | Add or remove at each step — generally preferred |
Important limitation: Automated selection does not guarantee the “best” model in any absolute sense. Always check whether the selected model makes practical sense and satisfies the regression assumptions.
Key concepts — Model diagnostics: Before using a fitted model for inference or prediction, we must verify four assumptions of the regression model \(Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \varepsilon_i\):
| Assumption | What it means | How to check |
|---|---|---|
| Linearity | \(E(\varepsilon_i) = 0\); no pattern in residuals | Residuals vs. Fitted plot |
| Normality | \(\varepsilon_i \sim N(0, \sigma^2)\) | Histogram, Q-Q plot, Shapiro-Wilk test |
| Constant variance | \(\text{Var}(\varepsilon_i) = \sigma^2\) (homoscedasticity) | Scale-Location plot, Breusch-Pagan test |
| Independence | Residuals are uncorrelated | Residuals vs. order plot, Durbin-Watson test |
Additionally, we check for influential observations — data points that have an unusually large effect on the fitted model.
We continue with the mtcars dummy-variable model from
Lab 4. The goal is now to select the best subset of predictors and
validate the model assumptions.
Step 1 — Re-load data and recreate variables
data(mtcars)
y <- mtcars$mpg
x1 <- mtcars$wt
x2.dummy <- ifelse(mtcars$cyl == 6, 1, 0)
x3.dummy <- ifelse(mtcars$cyl == 8, 1, 0)Step 2 — Variable selection
null <- lm(y ~ 1) # intercept-only model
full <- lm(y ~ x1 + x2.dummy + x3.dummy) # full model from Lab 4
# Forward selection
fw.fit <- step(null,
scope = list(lower = null, upper = full),
direction = "forward",
trace = 1)## Start: AIC=115.94
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x1 1 847.73 278.32 73.217
## + x3.dummy 1 619.89 506.16 92.355
## <none> 1126.05 115.943
## + x2.dummy 1 1.08 1124.96 117.913
##
## Step: AIC=73.22
## y ~ x1
##
## Df Sum of Sq RSS AIC
## + x3.dummy 1 33.635 244.69 71.096
## <none> 278.32 73.217
## + x2.dummy 1 7.004 271.32 74.402
##
## Step: AIC=71.1
## y ~ x1 + x3.dummy
##
## Df Sum of Sq RSS AIC
## + x2.dummy 1 61.628 183.06 63.810
## <none> 244.69 71.096
##
## Step: AIC=63.81
## y ~ x1 + x3.dummy + x2.dummy
## Start: AIC=63.81
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 183.06 63.810
## - x2.dummy 1 61.628 244.69 71.096
## - x3.dummy 1 88.259 271.32 74.402
## - x1 1 118.204 301.26 77.752
# Stepwise regression (both directions) — generally preferred
sw.fit <- step(full,
direction = "both",
trace = 1)## Start: AIC=63.81
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 183.06 63.810
## - x2.dummy 1 61.628 244.69 71.096
## - x3.dummy 1 88.259 271.32 74.402
## - x1 1 118.204 301.26 77.752
Reading the
step()output: Each row shows one candidate model. The columns are:
Df: degrees of freedom used by the variable being added/removedAIC: AIC of the candidate model after the changeThe algorithm selects the change that gives the lowest AIC. It stops when no change improves AIC.
Comparing the three methods: In this dataset, all three methods agree that all three predictors (
x1,x2.dummy,x3.dummy) should be retained — the full model already has the lowest AIC. This won’t always happen, but it’s a useful reminder that selection sometimes confirms your original model rather than simplifying it.
Step 3 — Examine the selected model
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5890 -1.2357 -0.5159 1.3845 5.7915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9908 1.8878 18.006 < 2e-16 ***
## x1 -3.2056 0.7539 -4.252 0.000213 ***
## x2.dummy -4.2556 1.3861 -3.070 0.004718 **
## x3.dummy -6.0709 1.6523 -3.674 0.000999 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.82
## F-statistic: 48.08 on 3 and 28 DF, p-value: 3.594e-11
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 847.73 847.73 129.6650 5.079e-12 ***
## x2.dummy 1 7.00 7.00 1.0713 0.3095128
## x3.dummy 1 88.26 88.26 13.4999 0.0009992 ***
## Residuals 28 183.06 6.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 %
## (Intercept) 30.123824 37.857764
## x1 -4.749898 -1.661328
## x2.dummy -7.094824 -1.416341
## x3.dummy -9.455418 -2.686301
Overall F-test (4-step procedure):
- \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (no predictors are useful) vs. \(H_a\): at least one \(\beta_j \neq 0\)
- \(F = MS_{reg} / MS_{res}\), with \(df_1 = p\) and \(df_2 = n - p - 1\)
- p-value from the
F-statisticline insummary()- Conclusion in context
Step 4 — Comprehensive model diagnostics
R’s built-in plot() for lm objects produces
four diagnostic plots at once. Understanding each one is essential.
Interpreting the four diagnostic plots:
Plot 1 — Residuals vs. Fitted - Checks: linearity and constant variance - Good: points scatter randomly around the horizontal zero line, with roughly equal spread across all fitted values - Warning signs: a curve (non-linearity) or a funnel/fan shape (non-constant variance)
Plot 2 — Normal Q-Q - Checks: normality of residuals - Good: points follow the diagonal reference line closely - Warning signs: S-shaped curve (heavy tails), points bending away at the ends
Plot 3 — Scale-Location (√|standardised residuals| vs. Fitted) - Checks: constant variance (more sensitive than Plot 1) - Good: a roughly horizontal red line with points evenly spread - Warning signs: a clear upward or downward trend in the red line
Plot 4 — Residuals vs. Leverage - Checks: influential observations - Points with high leverage AND large residuals are most dangerous - Points outside the Cook’s distance dashed lines (if visible) are highly influential
Step 5 — Formal tests for each assumption
res <- resid(sw.fit)
# --- Normality ---
# Visual: histogram and Q-Q plot side by side
par(mfrow = c(1, 2))
hist(res, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white",
freq = FALSE) # density scale
curve(dnorm(x, mean = mean(res), sd = sd(res)),
add = TRUE, col = "red", lwd = 2) # overlay normal curve
qqnorm(res, main = "Normal Q-Q Plot", pch = 16, col = "steelblue")
qqline(res, col = "red", lwd = 2)par(mfrow = c(1, 1))
# Shapiro-Wilk test (best for n < 50; use for up to ~2000 obs)
shapiro.test(res)##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.95907, p-value = 0.259
# H0: residuals are normally distributed
# Satisfied if p-value > 0.05 (here p ≈ 0.26 — satisfied)
# --- Constant variance (homoscedasticity) ---
# install.packages("lmtest") # run once if not yet installed
library(lmtest)
bptest(sw.fit)##
## studentized Breusch-Pagan test
##
## data: sw.fit
## BP = 6.6551, df = 3, p-value = 0.08374
# H0: variance is constant (homoscedastic)
# Satisfied if p-value > 0.05 (here p ≈ 0.08 — satisfied, though borderline)
# --- Independence (no autocorrelation) ---
dwtest(sw.fit)##
## Durbin-Watson test
##
## data: sw.fit
## DW = 1.8064, p-value = 0.2188
## alternative hypothesis: true autocorrelation is greater than 0
# H0: residuals are not autocorrelated
# Satisfied if p-value > 0.05 (DW statistic close to 2 is ideal; here DW ≈ 1.81)Summary of formal tests:
Test \(H_0\) Assumption satisfied if Shapiro-Wilk Residuals are normally distributed p-value > 0.05 Breusch-Pagan Variance is constant (homoscedastic) p-value > 0.05 Durbin-Watson No autocorrelation in residuals p-value > 0.05 If an assumption is violated, consider:
Violation Possible remedy Non-linearity Add a polynomial term (e.g. \(X^2\)) or transform \(X\) Non-constant variance Apply log or square-root transformation to \(Y\) Non-normality Transform \(Y\); check for outliers Autocorrelation Include a time trend; use time-series methods
Step 6 — Checking for influential observations
An observation can be unusual in two ways:
# Standardised residuals — values beyond ±2 are potential outliers
std.res <- rstandard(sw.fit)
which(abs(std.res) > 2) # row indices of potential outliers (e.g. Fiat 128, Toyota Corolla)## 18 20
## 18 20
# Leverage values — rule of thumb: high if > 2*(p+1)/n
n <- nrow(mtcars)
p <- length(coef(sw.fit)) - 1 # number of predictors
hat.vals <- hatvalues(sw.fit)
which(hat.vals > 2 * (p + 1) / n)## named integer(0)
# Cook's distance — rule of thumb: influential if > 4/n
cooks <- cooks.distance(sw.fit)
which(cooks > 4 / n) # e.g. Chrysler Imperial, Fiat 128, Toyota Corolla## 17 18 20
## 17 18 20
# Combined influence plot
plot(sw.fit, which = 4,
main = "Cook's Distance",
sub = paste("Threshold = 4/n =", round(4/n, 4)))
abline(h = 4/n, col = "red", lty = 2)What to do with influential observations: Do NOT automatically delete them. First: 1. Check whether the data point was recorded correctly (data entry error?). 2. Fit the model with and without the point and compare — if results change substantially, report both. 3. If justified, remove the point and note it in your report.
Step 7 — Prediction using the selected model
# Predict mpg for a new car:
# wt = 3.0 (thousand lbs), cyl = 6 (D_Six=1, D_Eight=0)
new_obs <- data.frame(x1 = 3.0,
x2.dummy = 1,
x3.dummy = 0)
# Point estimate
predict(sw.fit, newdata = new_obs)## 1
## 20.11837
# 95% confidence interval for the mean mpg at these values
predict(sw.fit, newdata = new_obs, interval = "confidence", level = 0.95)## fit lwr upr
## 1 20.11837 18.1305 22.10625
# 95% prediction interval for a single new car
predict(sw.fit, newdata = new_obs, interval = "prediction", level = 0.95)## fit lwr upr
## 1 20.11837 14.51622 25.72052
This assignment continues directly from Lab 4 Assignment. Re-generate
the same simulated dataset (the set.seed() guarantees you
get identical data) and use the dummy variables you created there.
set.seed(208251)
n <- 48
Method <- factor(rep(c("A", "B", "C", "D"), each = n / 4))
Pretest <- round(rnorm(n, mean = 60, sd = 8), 1)
EQ <- round(rnorm(n, mean = 70, sd = 7), 1)
method_effect <- c(A = 0, B = 4, C = 9, D = 6)[as.character(Method)]
Y <- round(15 + 0.55 * Pretest + 0.30 * EQ + method_effect +
rnorm(n, mean = 0, sd = 5), 1)
student_data <- data.frame(Y, Pretest, EQ, Method)Apply forward selection, backward elimination, and stepwise
regression to select the best predictors of \(Y\) from
{Pretest, EQ, D_B, D_C, D_D}. Compare the three methods —
do they agree?
Write the fitted equation from the stepwise-selected model. Interpret each coefficient in context.
Test the overall fit of the selected model at \(\alpha = 0.05\) using the 4-step procedure. Report \(R^2\) and adjusted \(R^2\) and explain what they tell you.
Produce R’s four built-in diagnostic plots
(plot(model)). For each plot, describe what you see and
state what assumption it is checking.
Carry out the three formal assumption tests:
For each test: state \(H_0\), report the test statistic and p-value, and conclude whether the assumption is satisfied.
Check for influential observations using Cook’s distance. Report any observations that exceed the threshold \(4/n\) and describe what you would do with them.
Find the 90% and 95% confidence intervals for each regression coefficient in the selected model.
Predict the post-test score for a new student with
Pretest = 60, EQ = 70, taught by Method C.
Give both a 95% confidence interval and a 95% prediction interval, and
explain the difference.
Objectives: Students are able to:
When to use non-parametric tests: Use non-parametric methods when:
| Parametric test | Non-parametric equivalent | Use case |
|---|---|---|
| One-sample t-test | Wilcoxon signed-rank test | One sample, test about median |
| Independent samples t-test | Mann-Whitney U (Wilcoxon rank-sum) | Two independent samples |
| Paired t-test | Wilcoxon signed-rank test (paired) | Two related samples |
| One-way ANOVA | Kruskal-Wallis test | Three or more independent samples |
| Repeated-measures ANOVA | Friedman test | Three or more related samples |
| Pearson correlation | Spearman rank correlation | Association, non-normal data |
Variables:
mpg,qsec,hp,wt,am(transmission),cyl(cylinders)
Load and prepare data
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Convert qualitative variables to factors
mtcars$am <- factor(mtcars$am, labels = c("automatic", "manual"))
mtcars$cyl <- factor(mtcars$cyl)Question A: Is the median mpg equal to 20?
One sample, test about a median → Wilcoxon signed-rank test
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
## [1] 6.026948
par(mfrow = c(1, 2))
boxplot(mtcars$mpg, main = "Boxplot of MPG", ylab = "MPG")
hist(mtcars$mpg, main = "Histogram of MPG",
xlab = "MPG", col = "lightblue", border = "white")par(mfrow = c(1, 1))
# H0: median mpg = 20 Ha: median mpg ≠ 20
wilcox.test(x = mtcars$mpg, mu = 20, alternative = "two.sided")##
## Wilcoxon signed rank exact test
##
## data: mtcars$mpg
## V = 249, p-value = 0.7855
## alternative hypothesis: true location is not equal to 20
Interpretation: The Wilcoxon signed-rank test tests \(H_0\): population median =
mu. - p-value < 0.05: reject \(H_0\); the median mpg is significantly different from 20. - p-value ≥ 0.05: insufficient evidence to conclude the median differs from 20.
Question B: Is the median quarter-mile time
(qsec) less than 18 seconds?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 16.89 17.71 17.85 18.90 22.90
par(mfrow = c(1, 2))
boxplot(mtcars$qsec, ylab = "Quarter-mile time (sec)", main = "Boxplot")
hist(mtcars$qsec, xlab = "Quarter-mile time (sec)",
main = "Histogram", col = "lightblue", border = "white")par(mfrow = c(1, 1))
# H0: median = 18 Ha: median < 18
wilcox.test(x = mtcars$qsec, mu = 18, alternative = "less")##
## Wilcoxon signed rank exact test
##
## data: mtcars$qsec
## V = 225.5, p-value = 0.2427
## alternative hypothesis: true location is less than 18
RoundingTimes)This example uses a classic teaching dataset for the Friedman test,
distributed with R’s own documentation (?friedman.test). It
records how long it took 22 different operators to
round the corner of a piece using three measurement
methods: “Round Out”, “Narrow Angle”, and “Wide Angle”. Because
each operator used all three methods, this is a paired (repeated
measures) design.
Step 1 — Enter the data directly
RoundingTimes <-
matrix(c(5.40, 5.50, 5.55,
5.85, 5.70, 5.75,
5.20, 5.60, 5.50,
5.55, 5.50, 5.40,
5.90, 5.85, 5.70,
5.45, 5.55, 5.60,
5.40, 5.40, 5.35,
5.45, 5.50, 5.35,
5.25, 5.15, 5.00,
5.85, 5.80, 5.70,
5.25, 5.20, 5.10,
5.65, 5.55, 5.45,
5.60, 5.35, 5.45,
5.05, 5.00, 4.95,
5.50, 5.50, 5.40,
5.45, 5.55, 5.50,
5.55, 5.55, 5.35,
5.45, 5.50, 5.55,
5.50, 5.45, 5.25,
5.65, 5.60, 5.40,
5.70, 5.65, 5.55,
6.30, 6.30, 6.25),
nrow = 22,
byrow = TRUE,
dimnames = list(1 : 22,
c("RoundOut", "NarrowAngle", "WideAngle")))
rt <- as.data.frame(RoundingTimes)
head(rt)## RoundOut NarrowAngle WideAngle
## 1 5.40 5.50 5.55
## 2 5.85 5.70 5.75
## 3 5.20 5.60 5.50
## 4 5.55 5.50 5.40
## 5 5.90 5.85 5.70
## 6 5.45 5.55 5.60
Step 2 — Explore data
## RoundOut NarrowAngle WideAngle
## Min. :5.050 Min. :5.000 Min. :4.950
## 1st Qu.:5.412 1st Qu.:5.463 1st Qu.:5.350
## Median :5.500 Median :5.525 Median :5.450
## Mean :5.543 Mean :5.534 Mean :5.459
## 3rd Qu.:5.650 3rd Qu.:5.600 3rd Qu.:5.550
## Max. :6.300 Max. :6.300 Max. :6.250
boxplot(rt,
xlab = "Measurement Method", ylab = "Time (sec)",
main = "Rounding Time by Method",
col = c("lightblue", "lightgreen", "lightyellow"))Question A: Is there a difference in timing between “Round Out” and “Narrow Angle”?
# H0: median(RoundOut) = median(NarrowAngle) Ha: ≠
wilcox.test(x = rt$RoundOut,
y = rt$NarrowAngle,
paired = TRUE,
alternative = "two.sided")##
## Wilcoxon signed rank exact test
##
## data: rt$RoundOut and rt$NarrowAngle
## V = 154.5, p-value = 0.288
## alternative hypothesis: true location shift is not equal to 0
Question B: Is the timing under “Narrow Angle” greater than under “Wide Angle”?
# H0: median(NarrowAngle) = median(WideAngle) Ha: NarrowAngle > WideAngle
wilcox.test(x = rt$NarrowAngle,
y = rt$WideAngle,
paired = TRUE,
alternative = "greater")##
## Wilcoxon signed rank exact test
##
## data: rt$NarrowAngle and rt$WideAngle
## V = 222, p-value = 0.0005486
## alternative hypothesis: true location shift is greater than 0
Why
paired = TRUE? Each operator produced a time under all three methods, so observations are matched by operator. Usingpaired = TRUEaccounts for individual differences between operators and increases the power of the test.
Objectives: Students are able to:
Load and prepare data
data(mtcars)
mtcars$am <- factor(mtcars$am, labels = c("automatic", "manual"))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
str(mtcars)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Question a: Is there a difference in average fuel economy between automatic and manual cars?
Two independent groups → Mann-Whitney U test
(Wilcoxon rank-sum, paired = FALSE)
## $automatic
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
##
## $manual
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
boxplot(mtcars$mpg ~ mtcars$am,
xlab = "Transmission", ylab = "MPG",
main = "MPG by Transmission",
col = c("lightpink", "lightblue"))# H0: median mpg equal for automatic and manual Ha: ≠
wilcox.test(x = mtcars$mpg[mtcars$am == "automatic"],
y = mtcars$mpg[mtcars$am == "manual"],
paired = FALSE,
alternative = "two.sided")##
## Wilcoxon rank sum exact test
##
## data: mtcars$mpg[mtcars$am == "automatic"] and mtcars$mpg[mtcars$am == "manual"]
## W = 42, p-value = 0.001159
## alternative hypothesis: true location shift is not equal to 0
Question b: Is fuel economy related to number of cylinders?
Three or more independent groups → Kruskal-Wallis test
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.40 22.80 26.00 26.66 30.40 33.90
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.80 18.65 19.70 19.74 21.00 21.40
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.40 15.20 15.10 16.25 19.20
boxplot(mtcars$mpg ~ mtcars$cyl,
xlab = "Number of Cylinders", ylab = "MPG",
main = "MPG by Cylinders")# H0: median mpg equal across all cylinder groups Ha: at least one differs
kruskal.test(mtcars$mpg ~ mtcars$cyl)##
## Kruskal-Wallis rank sum test
##
## data: mtcars$mpg by mtcars$cyl
## Kruskal-Wallis chi-squared = 25.746, df = 2, p-value = 2.566e-06
Post-hoc testing: If the Kruskal-Wallis test is significant (p < 0.05), we know at least one group differs but not which pairs. Use pairwise Wilcoxon tests with a Bonferroni correction:
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: mtcars$mpg and mtcars$cyl
##
## 4 6
## 6 0.00038 -
## 8 1.3e-06 0.00072
##
## P value adjustment method: bonferroni
Question c: Is there a (monotone) association between horsepower and quarter-mile time?
Two quantitative variables, non-normal data → Spearman rank correlation
plot(mtcars$hp, mtcars$qsec,
xlab = "Horsepower", ylab = "Quarter-mile Time (sec)",
main = "Quarter-mile Time vs. Horsepower",
pch = 16, col = "steelblue")# H0: ρ_s = 0 (no monotone association) Ha: ρ_s ≠ 0
cor.test(~ hp + qsec,
data = mtcars,
method = "spearman",
alternative = "two.sided")##
## Spearman's rank correlation rho
##
## data: hp and qsec
## S = 9093, p-value = 3.105e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.666606
Spearman vs. Pearson: Spearman measures monotone association (consistently increasing or decreasing), not just linear. It is more robust to outliers and does not require normality.
Question d: Does displacement tend to differ across gear counts?
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120.1 275.8 318.0 326.3 380.0 472.0
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.10 78.92 130.90 123.02 160.00 167.60
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 95.1 120.3 145.0 202.5 301.0 351.0
boxplot(mtcars$disp ~ mtcars$gear,
xlab = "Number of Gears", ylab = "Displacement (cu. in.)",
main = "Displacement by Number of Gears")##
## Kruskal-Wallis rank sum test
##
## data: mtcars$disp by mtcars$gear
## Kruskal-Wallis chi-squared = 16.578, df = 2, p-value = 0.0002513
# Post-hoc if significant:
pairwise.wilcox.test(mtcars$disp, mtcars$gear,
p.adjust.method = "bonferroni")##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: mtcars$disp and mtcars$gear
##
## 3 4
## 4 1.5e-05 -
## 5 0.24 0.95
##
## P value adjustment method: bonferroni
Question e: Is there a relationship between transmission type and number of cylinders?
Two categorical variables → Chi-squared test of independence
##
## 4 6 8
## automatic 3 4 12
## manual 8 3 2
##
## 4 6 8
## automatic 0.1578947 0.2105263 0.6315789
## manual 0.6153846 0.2307692 0.1538462
# H0: transmission type and cylinders are independent Ha: they are associated
chisq.test(mtcars$am, mtcars$cyl)##
## Pearson's Chi-squared test
##
## data: mtcars$am and mtcars$cyl
## X-squared = 8.7407, df = 2, p-value = 0.01265
Check the chi-squared conditions: All expected cell counts should be ≥ 5.
## mtcars$cyl
## mtcars$am 4 6 8
## automatic 6.53125 4.15625 8.3125
## manual 4.46875 2.84375 5.6875
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.009105
## alternative hypothesis: two.sided
Design: Each operator timed the piece using all three methods → repeated measures with 3+ groups → Friedman test.
RoundingTimes <-
matrix(c(5.40, 5.50, 5.55,
5.85, 5.70, 5.75,
5.20, 5.60, 5.50,
5.55, 5.50, 5.40,
5.90, 5.85, 5.70,
5.45, 5.55, 5.60,
5.40, 5.40, 5.35,
5.45, 5.50, 5.35,
5.25, 5.15, 5.00,
5.85, 5.80, 5.70,
5.25, 5.20, 5.10,
5.65, 5.55, 5.45,
5.60, 5.35, 5.45,
5.05, 5.00, 4.95,
5.50, 5.50, 5.40,
5.45, 5.55, 5.50,
5.55, 5.55, 5.35,
5.45, 5.50, 5.55,
5.50, 5.45, 5.25,
5.65, 5.60, 5.40,
5.70, 5.65, 5.55,
6.30, 6.30, 6.25),
nrow = 22,
byrow = TRUE,
dimnames = list(1 : 22,
c("RoundOut", "NarrowAngle", "WideAngle")))
str(RoundingTimes)## num [1:22, 1:3] 5.4 5.85 5.2 5.55 5.9 5.45 5.4 5.45 5.25 5.85 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:22] "1" "2" "3" "4" ...
## ..$ : chr [1:3] "RoundOut" "NarrowAngle" "WideAngle"
Question: Are timing results affected by the measurement method?
rt <- as.data.frame(RoundingTimes)
boxplot(rt,
xlab = "Measurement Method", ylab = "Time (sec)",
main = "Rounding Time by Method",
col = c("lightblue", "lightgreen", "lightyellow"))# H0: distributions of time are the same under all three methods
# Ha: at least one method leads to a different distribution
friedman.test(RoundingTimes)##
## Friedman rank sum test
##
## data: RoundingTimes
## Friedman chi-squared = 11.143, df = 2, p-value = 0.003805
Friedman test: The non-parametric equivalent of one-way repeated-measures ANOVA. It ranks observations within each block (operator) and tests whether rank sums differ across methods. Here the test is significant (p ≈ 0.004): the three methods are not all the same.
# Post-hoc pairwise comparisons (if Friedman test is significant)
pairwise.wilcox.test(as.vector(RoundingTimes),
rep(colnames(RoundingTimes), each = nrow(RoundingTimes)),
paired = TRUE,
p.adjust.method = "bonferroni")##
## Pairwise comparisons using Wilcoxon signed rank exact test
##
## data: as.vector(RoundingTimes) and rep(colnames(RoundingTimes), each = nrow(RoundingTimes))
##
## NarrowAngle RoundOut
## RoundOut 0.8641 -
## WideAngle 0.0033 0.0365
##
## P value adjustment method: bonferroni
Continue using the mtcars data (re-load it fresh with
data(mtcars) and re-create the factors as in Lab 7 Practice
I). For each test, follow the 4-step procedure: state hypotheses, give
the test statistic and p-value, state your conclusion in context.
Is the median mpg greater than 20?
Is there a difference in quarter-mile time (qsec)
between automatic and manual cars?
Is the median horsepower (hp) different across
gear-count groups? If so, which groups differ significantly? (Use
Bonferroni correction.)
Is there a significant positive monotone association between
weight (wt) and horsepower (hp)?
Is there a relationship between number of cylinders and number of gears? (Check the chi-squared expected count conditions first.)
End of Lab Materials