Every dataset used in these labs is already built into R (or generated by a short line of code).

How to use these lab materials

Read the explanation text before running each code chunk.

Pay attention to interpretation notes — these help you understand what the output means statistically.

In each Practice section you will see worked examples; the Assignment section is for you to complete independently.

Two datasets recur across several labs so you can build familiarity with them: trees (Case Study A) and mtcars (Case Study B), both built into base R. A few others (attitude, and one simulated dataset) appear once, in the assignments.

Lab 1: Correlation Analysis

Objectives: Students are able to use R to:

Perform descriptive statistics for two quantitative variables
Construct and interpret a scatterplot
Compute and interpret the Pearson correlation coefficient
Perform a hypothesis test for the population correlation
Find a confidence interval for the population correlation

Key concepts: The Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables \(X\) and \(Y\). It ranges from −1 to +1:

\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \; \sum(y_i-\bar{y})^2}}\]

Value of \(\|r\|\)	Interpretation
0.00 – 0.19	Very weak
0.20 – 0.39	Weak
0.40 – 0.59	Moderate
0.60 – 0.79	Strong
0.80 – 1.00	Very strong

The sign of \(r\) indicates direction: positive means both variables increase together; negative means one increases as the other decreases.

Important: Correlation measures linear association only. A correlation of 0 does not necessarily mean there is no relationship — only that there is no linear relationship.

Practice I: Cherry Tree Volume (Case Study A — `trees`)

R comes with a built-in dataset called trees: 31 felled black cherry trees, each with measured girth (trunk diameter), height, and volume of usable timber. No import is needed — trees is available the moment R starts.

Variables: Girth (trunk diameter, inches) and Volume (timber volume, cubic feet). Research question: Is there a significant linear correlation between trunk girth and timber volume?

Step 1 — Load and inspect the data

# trees is a built-in dataset - no download or setwd() required
data(trees)

head(trees)      # first 6 rows

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

str(trees)       # variable types and dimensions

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

summary(trees)   # basic descriptive statistics

##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

Note: str() tells you how many observations (rows) and variables (columns) are in your data, and what type each variable is (numeric, character, factor, etc.).

Step 2 — Descriptive statistics

x <- trees$Girth    # independent variable (predictor)
y <- trees$Volume    # dependent variable (response)

# Summary statistics
summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.30   11.05   12.90   13.25   15.25   20.60

summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   19.40   24.20   30.17   37.30   77.00

# Standard deviation
sd(x)

## [1] 3.138139

sd(y)

## [1] 16.43785

Step 3 — Create a scatterplot

A scatterplot is always the first step — it reveals the form, direction, and strength of the relationship before any numerical summary.

plot(x, y,
     xlab = "Girth (in.)",
     ylab = "Volume (cu. ft.)",
     main = "Scatterplot: Volume vs. Girth",
     pch  = 16,        # filled circle points
     col  = "steelblue")

What to look for in the scatterplot:

Form: Is the pattern linear (straight) or curved?

Direction: Positive (both increase together) or negative (one increases as the other decreases)?

Strength: Are the points tightly clustered around a line (strong) or widely scattered (weak)?

Outliers: Are there any unusual observations far from the general pattern?

Step 4 — Compute the correlation coefficient

# Sample Pearson correlation coefficient
cor(x, y)

## [1] 0.9671194

Interpretation: Report the value of \(r\) and describe its sign and magnitude using the reference table above. Running the code above gives \(r \approx 0.967\): “The sample correlation coefficient is \(r = 0.967\), indicating a very strong positive linear relationship between girth and volume.”

Step 5 — Hypothesis test for the population correlation

We test whether there is a statistically significant linear correlation in the population:

\[H_0: \rho = 0 \quad \text{(no linear correlation)} \qquad \text{vs.} \qquad H_a: \rho \neq 0\]

The test statistic is:

\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \text{ under } H_0\]

# Two-sided test at alpha = 0.05
cor.test(x, y, alternative = "two.sided", conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 20.478, df = 29, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9322519 0.9841887
## sample estimates:
##       cor 
## 0.9671194

Interpretation of cor.test() output:

Output field What it means

cor Sample correlation coefficient \(r\)

t Test statistic value

df Degrees of freedom (\(n - 2\))

p-value Reject \(H_0\) if p-value \(< \alpha = 0.05\)

95 percent confidence interval 95% CI for the population correlation \(\rho\)

4-step conclusion template: 1. \(H_0: \rho = 0\) vs. \(H_a: \rho \neq 0\) 2. \(t \approx 20.5\), \(df = 29\) 3. \(p\text{-value} < 0.001\) 4. Since p-value \(< 0.05\), we reject \(H_0\). There is sufficient evidence of a significant linear correlation between girth and volume.

Output field	What it means
`cor`	Sample correlation coefficient \(r\)
`t`	Test statistic value
`df`	Degrees of freedom (\(n - 2\))
`p-value`	Reject \(H_0\) if p-value \(< \alpha = 0.05\)
`95 percent confidence interval`	95% CI for the population correlation \(\rho\)

Practice II: Car Weight and Fuel Economy (Case Study B — `mtcars`)

R’s built-in mtcars dataset records fuel consumption and 10 design/performance variables for 32 cars (1973–74 models). It will be used as a running case study across several labs, so it’s worth getting to know well now.

Variables: wt (weight, 1000 lbs), mpg (miles per gallon), am (transmission: 0 = automatic, 1 = manual), cyl (number of cylinders: 4, 6, or 8), gear (number of forward gears: 3, 4, or 5). Research question: Is there a significant linear correlation between a car’s weight and its fuel economy?

Step 1 — Load the data

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Step 2 — Explore the data

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Frequency table for transmission type (am: 0 = automatic, 1 = manual)
table1 <- table(mtcars$am)
table1

## 
##  0  1 
## 19 13

prop.table(table1)   # proportions (multiply by 100 for percentages)

## 
##       0       1 
## 0.59375 0.40625

# Two-way frequency table: number of cylinders by transmission type
table2 <- table(mtcars$cyl, mtcars$am)
table2

##    
##      0  1
##   4  3  8
##   6  4  3
##   8 12  2

prop.table(table2)

##    
##           0       1
##   4 0.09375 0.25000
##   6 0.12500 0.09375
##   8 0.37500 0.06250

# Descriptive statistics for quantitative variables
summary(mtcars$wt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

# Boxplots
par(mfrow = c(1, 2))
boxplot(mtcars$mpg,
        main = "Distribution of MPG", ylab = "Miles per Gallon")
boxplot(mtcars$mpg ~ mtcars$am,
        xlab = "Transmission (0 = auto, 1 = manual)", ylab = "MPG",
        main = "MPG by Transmission",
        col  = c("lightpink", "lightblue"))

par(mfrow = c(1, 1))

boxplot(mtcars$mpg ~ mtcars$gear,
        xlab = "Number of Gears", ylab = "MPG",
        main = "MPG by Number of Gears")

Interpretation: Boxplots let you compare distributions across groups. Look for differences in medians, spread (IQR), and the presence of outliers.

Step 3 — Scatterplot

x2 <- mtcars$wt
y2 <- mtcars$mpg

plot(x2, y2,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     main = "Scatterplot: MPG vs. Weight",
     pch  = 16, col = "steelblue")

Step 4 — Correlation coefficient and hypothesis test

# Sample correlation
cor(x2, y2)

## [1] -0.8676594

# Hypothesis test: H0: rho = 0  vs  Ha: rho ≠ 0
cor.test(x2, y2, alternative = "two.sided", conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  x2 and y2
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

Running the code gives \(r \approx -0.868\) (\(p < 0.001\)) — a very strong negative linear relationship: heavier cars tend to have lower fuel economy.

Assignment: Employee Survey Ratings (Case Study C — `attitude`)

R’s built-in attitude dataset comes from a survey of clerical employees at a large financial organisation: 30 employees rated their department (0–100 scale) on several dimensions, including overall performance rating and perceived opportunity for learning.

Variables: learning (perceived learning opportunity score, 0–100) and rating (overall performance rating, 0–100).

Complete all questions using R. Show your R code and output, and write your interpretation in full sentences.

data(attitude)
str(attitude)

## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

Perform descriptive statistics for both learning and rating (mean, SD, min, max). Summarise the results.
Make a scatterplot of learning (\(x\)) vs. rating (\(y\)). Describe the form, direction, strength, and any outliers.
Find the sample correlation coefficient \(r\) between learning and rating. Interpret its value.
Test whether the population correlation is different from zero at \(\alpha = 0.05\). Use the 4-step procedure: (i) state hypotheses, (ii) give the test statistic and degrees of freedom, (iii) give the p-value, (iv) state your conclusion in context.
Find the 95% confidence interval for the population correlation \(\rho\) and interpret it.

Lab 2: Simple Linear Regression

Objectives: Students are able to use R to:

Fit a simple linear regression model using least squares
Interpret the regression coefficients
Perform inference on regression parameters (t-tests and confidence intervals)
Assess model fit using \(r^2\) and \(MS_{res}\)
Check regression assumptions using residual diagnostics
Construct confidence and prediction intervals

Key concepts: The simple linear regression model expresses the relationship between a response variable \(Y\) and a single predictor \(X\) as:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]

where:

\(\beta_0\) = intercept (value of \(E(Y)\) when \(X = 0\))
\(\beta_1\) = slope (change in \(E(Y)\) for a 1-unit increase in \(X\))
\(\varepsilon_i\) = random error term (independent, normally distributed with constant variance)

The least-squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimise \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).

Practice I: Cherry Tree Volume (continued)

We continue with the same trees data used in Lab 1.

Variables: Girth (in.) as \(X\), Volume (cu. ft.) as \(Y\). Research question: Can we predict timber volume from trunk girth using a linear model?

Step 1 — Load data

data(trees)
x <- trees$Girth
y <- trees$Volume

Step 2 — Fit the regression model

fit <- lm(y ~ x)       # fit simple linear regression
summary(fit)           # full model summary

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## x             5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

confint(fit, level = 0.95)  # 95% CI for β₀ and β₁

##                  2.5 %     97.5 %
## (Intercept) -43.825953 -30.060965
## x             4.559914   5.571799

Key output from summary(fit):

Output field What it means

Estimate (Intercept) \(\hat{\beta}_0\): predicted \(Y\) when \(X = 0\)

Estimate (x) \(\hat{\beta}_1\): change in \(\hat{Y}\) for each 1-unit increase in \(X\)

Std. Error Standard error of the coefficient estimate

t value Test statistic for \(H_0: \beta_j = 0\)

Pr(>|t|) p-value; reject \(H_0\) if \(< 0.05\)

Multiple R-squared \(r^2\): proportion of variance in \(Y\) explained by \(X\)

F-statistic Overall model F-test (equivalent to t-test in simple regression)

Residual standard error \(\hat{\sigma} = \sqrt{MS_{res}}\): estimated SD of errors

How to write the fitted equation: Find the Estimate column in the output and substitute: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\] Here, \(\hat{\beta}_0 \approx -36.94\) and \(\hat{\beta}_1 \approx 5.07\), so \(\hat{y} = -36.94 + 5.07x\).

Output field	What it means
`Estimate` (Intercept)	\(\hat{\beta}_0\): predicted \(Y\) when \(X = 0\)
`Estimate` (x)	\(\hat{\beta}_1\): change in \(\hat{Y}\) for each 1-unit increase in \(X\)
`Std. Error`	Standard error of the coefficient estimate
`t value`	Test statistic for \(H_0: \beta_j = 0\)
`Pr(>\|t\|)`	p-value; reject \(H_0\) if \(< 0.05\)
`Multiple R-squared`	\(r^2\): proportion of variance in \(Y\) explained by \(X\)
`F-statistic`	Overall model F-test (equivalent to t-test in simple regression)
`Residual standard error`	\(\hat{\sigma} = \sqrt{MS_{res}}\): estimated SD of errors

Step 3 — Plot the fitted regression line

plot(x, y,
     xlab = "Girth (in.)",
     ylab = "Volume (cu. ft.)",
     main = "Regression Line: Volume vs. Girth",
     pch  = 16, col = "steelblue")
abline(fit, col = "red", lwd = 2)   # overlay regression line

Step 4 — Residual analysis

Residuals are defined as \(e_i = y_i - \hat{y}_i\). Before trusting regression inference, we must verify three assumptions:

Linearity — no systematic pattern in residuals vs. fitted values
Normality — residuals are approximately normally distributed
Constant variance (homoscedasticity) — spread of residuals is roughly equal for all fitted values

res <- resid(fit)

par(mfrow = c(1, 3))

# 1. Residuals vs. Fitted — check linearity and constant variance
plot(fitted(fit), res,
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs. Fitted",
     pch  = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")

# 2. Histogram — check approximate normality
hist(res, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white")

# 3. Normal Q-Q plot — check normality more carefully
qqnorm(res, main = "Normal Q-Q Plot", pch = 16)
qqline(res, col = "red")

par(mfrow = c(1, 1))

What to look for:

Residuals vs. Fitted: points should scatter randomly around the zero line. A curve suggests non-linearity; a fan shape suggests non-constant variance.

Histogram: should be roughly bell-shaped and symmetric around zero.

Q-Q plot: points should fall close to the red diagonal line. Systematic deviations indicate non-normality.

Note: With girth and volume, a slight curve is visible in the residuals vs. fitted plot — a cylinder is only an approximation for a tree trunk’s shape, and real trees taper. This is a nice discussion point: sometimes real data hint that a straight line isn’t the perfect model.

Step 5 — Prediction for a new value of x

Two types of intervals are available:

Confidence interval (CI): estimates the mean response \(E(Y \mid X = x^*)\) — how high is the average volume for all trees with girth \(= x^*\)?
Prediction interval (PI): estimates a single future observation at \(X = x^*\) — what volume might an individual tree with girth \(= x^*\) have?

The PI is always wider than the CI because it includes both estimation uncertainty and individual variability.

new <- data.frame(x = 13)   # predict at x = 13 in.

# Point estimate
predict(fit, newdata = new)

##        1 
## 28.91267

# 95% Confidence interval for E(Y | x = 13)
predict(fit, newdata = new, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 28.91267 27.34573 30.47962

# 95% Prediction interval for a new individual at x = 13
predict(fit, newdata = new, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 28.91267 20.07634 37.74901

Output columns: fit = point estimate, lwr = lower bound, upr = upper bound.

Step 6 — Plot fitted line with CI and PI bands

new_range <- data.frame(x = sort(x))   # sorted x values for smooth curves

pred.CI <- predict(fit, newdata = new_range, interval = "confidence")
pred.PI <- predict(fit, newdata = new_range, interval = "prediction")

plot(x, y,
     xlab = "Girth (in.)", ylab = "Volume (cu. ft.)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch  = 16, col = "steelblue")

# Fitted regression line
lines(new_range$x, pred.CI[, "fit"], col = "black", lwd = 2)

# 95% Confidence interval (red dashed)
lines(new_range$x, pred.CI[, "lwr"], col = "red",  lwd = 1.5, lty = 2)
lines(new_range$x, pred.CI[, "upr"], col = "red",  lwd = 1.5, lty = 2)

# 95% Prediction interval (blue dotted)
lines(new_range$x, pred.PI[, "lwr"], col = "blue", lwd = 1.5, lty = 3)
lines(new_range$x, pred.PI[, "upr"], col = "blue", lwd = 1.5, lty = 3)

legend("topleft",
       legend = c("Fitted line", "95% CI", "95% PI"),
       col    = c("black", "red", "blue"),
       lty    = c(1, 2, 3), lwd = 2, bty = "n")

Why is the PI always wider than the CI? The CI captures only the uncertainty in estimating the mean of \(Y\), while the PI must additionally account for the natural scatter of individual observations around that mean. The gap between CI and PI bands is largest near the extremes of \(x\) and narrowest near \(\bar{x}\).

Practice II: Car Weight and Fuel Economy (continued)

We continue with the same mtcars data used in Lab 1 Practice II.

Variables: wt (1000 lbs) as \(X\), mpg as \(Y\). Research question: Can we predict fuel economy from weight using a linear model?

Step 1 — Load data

data(mtcars)
x2 <- mtcars$wt
y2 <- mtcars$mpg

Step 2 — Fit the model

fit2 <- lm(y2 ~ x2)
summary(fit2)

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## x2           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

confint(fit2)

##                 2.5 %    97.5 %
## (Intercept) 33.450500 41.119753
## x2          -6.486308 -4.202635

Step 3 — Plot fitted line

plot(x2, y2,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     main = "Regression Line: MPG vs. Weight",
     pch  = 16, col = "steelblue")
abline(fit2, col = "red", lwd = 2)

Step 4 — Residual analysis

res2 <- resid(fit2)

par(mfrow = c(1, 3))
plot(fitted(fit2), res2,
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs. Fitted", pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
hist(res2, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white")
qqnorm(res2, main = "Normal Q-Q Plot", pch = 16)
qqline(res2, col = "red")

par(mfrow = c(1, 1))

Step 5 — CI and PI bands

new2     <- data.frame(x2 = sort(x2))
pred.CI2 <- predict(fit2, newdata = new2, interval = "confidence")
pred.PI2 <- predict(fit2, newdata = new2, interval = "prediction")

plot(x2, y2,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     main = "Fitted Line with 95% CI and PI Bands",
     pch  = 16, col = "steelblue",
     ylim = range(c(pred.PI2[, "lwr"], pred.PI2[, "upr"])))

lines(new2$x2, pred.CI2[, "fit"], col = "black", lwd = 2)
lines(new2$x2, pred.CI2[, "lwr"], col = "red",  lty = 2, lwd = 1.5)
lines(new2$x2, pred.CI2[, "upr"], col = "red",  lty = 2, lwd = 1.5)
lines(new2$x2, pred.PI2[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(new2$x2, pred.PI2[, "upr"], col = "blue", lty = 3, lwd = 1.5)

legend("topleft",
       legend = c("Fitted line", "95% CI", "95% PI"),
       col    = c("black", "red", "blue"),
       lty    = c(1, 2, 3), lwd = 2, bty = "n")

Assignment: Employee Survey Ratings (continued)

We continue with the attitude data from Lab 1. Now we extend the analysis from correlation to regression.

data(attitude)

Obtain the least-squares regression line for predicting rating (\(Y\)) from learning (\(X\)). Write the fitted equation.
Interpret \(\hat{\beta}_0\) and \(\hat{\beta}_1\) in the context of this problem.
Find \(MS_{res}\) (= residual standard error squared) and state its units.
Test whether learning is a significant linear predictor of rating at \(\alpha = 0.05\). Use the 4-step procedure.
Find the 95% confidence intervals for \(\beta_0\) and \(\beta_1\).
Find \(r^2\) and interpret its value in context.
Check the regression assumptions using residual plots (residuals vs. fitted, histogram, Q-Q plot). Are the assumptions satisfied?
Find the 95% confidence interval and 95% prediction interval for the rating of a department with a learning score of 60. Explain the difference between the two intervals.
Plot the fitted regression line along with its 95% CI and PI bands for all values of \(x\).

Lab 3: Multiple Linear Regression

Objectives: Students are able to use R to:

Perform descriptive statistics for multiple linear regression
Select independent variables into the model
Perform multiple linear regression analysis and make inference on regression parameters
Interpret the results

Key concepts: The multiple linear regression model with \(p\) predictors is:

\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i\]

We estimate the coefficients by least squares and evaluate the model with:

Overall F-test: tests \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)
Individual t-tests: tests \(H_0: \beta_j = 0\) for each predictor
\(R^2\) (multiple): proportion of variance in \(Y\) explained by all predictors together
Adjusted \(R^2\): penalises for the number of predictors; better for comparing models of different sizes

Practice: Predicting Fuel Economy from Several Car Characteristics

We continue with mtcars. Rather than predicting mpg from weight alone, we now bring in several quantitative predictors at once:

Variable	Description	Type
`disp`	Engine displacement (cu. in.)	Quantitative
`hp`	Gross horsepower	Quantitative
`drat`	Rear axle ratio	Quantitative
`wt`	Weight (1000 lbs)	Quantitative
`qsec`	Quarter-mile time (seconds)	Quantitative
`mpg`	Miles per gallon	Response

Step 1 — Load data

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Step 2 — Descriptive statistics and correlation matrix

Since all variables here are quantitative, we examine pairwise scatterplots and correlations.

# Pairwise correlation matrix
cor(mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")])

##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475514 -0.7761684  0.68117191 -0.8676594  0.41868403
## disp -0.8475514  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
## hp   -0.7761684  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
## wt   -0.8676594  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000

# Pairwise scatterplot matrix
pairs(mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")],
      main = "Scatterplot Matrix for mtcars",
      pch  = 16, col = "steelblue")

Interpretation: Look at the first row (or column) to see which predictors appear most linearly related to mpg. Also check for high correlations between predictors (e.g. disp, hp, and wt are all strongly related to each other) — this is called multicollinearity and can cause instability in coefficient estimates.

Step 3 — Full model (all five predictors)

full <- lm(mpg ~ disp + hp + drat + wt + qsec, data = mtcars)
summary(full)

## 
## Call:
## lm(formula = mpg ~ disp + hp + drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5404 -1.6701 -0.4264  1.1320  5.4996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 16.53357   10.96423   1.508  0.14362   
## disp         0.00872    0.01119   0.779  0.44281   
## hp          -0.02060    0.01528  -1.348  0.18936   
## drat         2.01578    1.30946   1.539  0.13579   
## wt          -4.38546    1.24343  -3.527  0.00158 **
## qsec         0.64015    0.45934   1.394  0.17523   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.558 on 26 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8199 
## F-statistic: 29.22 on 5 and 26 DF,  p-value: 6.892e-10

anova(full)

## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq  F value   Pr(>F)    
## disp       1 808.89  808.89 123.6185 2.23e-11 ***
## hp         1  33.67   33.67   5.1449 0.031854 *  
## drat       1  30.15   30.15   4.6073 0.041340 *  
## wt         1  70.51   70.51  10.7754 0.002933 ** 
## qsec       1  12.71   12.71   1.9422 0.175233    
## Residuals 26 170.13    6.54                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Key outputs:

anova(full) gives the sequential (Type I) ANOVA table, showing the reduction in \(SS_{res}\) as each predictor is added.

The overall F-test p-value tests whether at least one \(\beta_j \neq 0\). Here it is highly significant (\(R^2 \approx 0.85\)), even though several individual predictors are not.

Some individual predictors may be non-significant even if the overall model is — this is common with correlated predictors (see the multicollinearity note above).

Step 4 — Variable selection

With many predictors we use automatic selection to find the best subset. All three methods use AIC (Akaike Information Criterion) — lower AIC = better model.

null <- lm(mpg ~ 1, data = mtcars)                          # intercept-only (no predictors)
full <- lm(mpg ~ disp + hp + drat + wt + qsec, data = mtcars)

# Forward selection: starts empty, adds one predictor at a time
fw.fit <- step(null,
               scope     = list(lower = null, upper = full),
               direction = "forward",
               trace     = 1)

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## 
## Step:  AIC=63.84
## mpg ~ wt + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              195.05 63.840
## + drat  1   11.3659 183.68 63.919
## + qsec  1    8.9885 186.06 64.331
## + disp  1    0.0571 194.99 65.831

# Backward elimination: starts full, removes one predictor at a time
be.fit <- step(full, direction = "backward", trace = 1)

## Start:  AIC=65.47
## mpg ~ disp + hp + drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     3.974 174.10 64.205
## <none>              170.13 65.466
## - hp    1    11.886 182.01 65.627
## - qsec  1    12.708 182.84 65.772
## - drat  1    15.506 185.63 66.258
## - wt    1    81.394 251.52 75.978
## 
## Step:  AIC=64.21
## mpg ~ hp + drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.418 183.52 63.891
## - qsec  1     9.578 183.68 63.919
## <none>              174.10 64.205
## - drat  1    11.956 186.06 64.331
## - wt    1   113.882 287.99 78.310
## 
## Step:  AIC=63.89
## mpg ~ drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## <none>              183.52 63.891
## - drat  1    11.942 195.46 63.908
## - qsec  1    85.720 269.24 74.156
## - wt    1   275.686 459.21 91.241

# Stepwise regression: both directions (generally preferred)
sw.fit <- step(full, direction = "both",     trace = 1)

## Start:  AIC=65.47
## mpg ~ disp + hp + drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     3.974 174.10 64.205
## <none>              170.13 65.466
## - hp    1    11.886 182.01 65.627
## - qsec  1    12.708 182.84 65.772
## - drat  1    15.506 185.63 66.258
## - wt    1    81.394 251.52 75.978
## 
## Step:  AIC=64.21
## mpg ~ hp + drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.418 183.52 63.891
## - qsec  1     9.578 183.68 63.919
## <none>              174.10 64.205
## - drat  1    11.956 186.06 64.331
## + disp  1     3.974 170.13 65.466
## - wt    1   113.882 287.99 78.310
## 
## Step:  AIC=63.89
## mpg ~ drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## <none>              183.52 63.891
## - drat  1    11.942 195.46 63.908
## + hp    1     9.418 174.10 64.205
## + disp  1     1.506 182.02 65.627
## - qsec  1    85.720 269.24 74.156
## - wt    1   275.686 459.21 91.241

Comparing the three methods: They do not always agree. In this dataset, forward selection tends to settle on {wt, hp}, while backward elimination and stepwise (both) agree on {drat, wt, qsec}. Stepwise regression is generally preferred as it reconsiders variables at each step.

Step 5 — Fitted model from stepwise regression

summary(sw.fit)     # coefficients, R², F-test

## 
## Call:
## lm(formula = mpg ~ drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1152 -1.8273 -0.2696  1.0502  5.5010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3945     8.0689   1.412  0.16892    
## drat          1.6561     1.2269   1.350  0.18789    
## wt           -4.3978     0.6781  -6.485 5.01e-07 ***
## qsec          0.9462     0.2616   3.616  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8196 
## F-statistic: 47.93 on 3 and 28 DF,  p-value: 3.723e-11

anova(sw.fit)       # ANOVA table

## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## drat       1 522.48  522.48  79.715 1.105e-09 ***
## wt         1 334.33  334.33  51.008 9.009e-08 ***
## qsec       1  85.72   85.72  13.078  0.001163 ** 
## Residuals 28 183.52    6.55                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(sw.fit)     # 95% CI for all coefficients

##                  2.5 %    97.5 %
## (Intercept) -5.1339149 27.922817
## drat        -0.8571278  4.169418
## wt          -5.7868161 -3.008776
## qsec         0.4102527  1.482155

How to write the fitted equation: If stepwise selected drat, wt, and qsec: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \, \text{drat} + \hat{\beta}_2 \, \text{wt} + \hat{\beta}_3 \, \text{qsec}\] Substitute the Estimate values from summary() output (approximately \(\hat{y} = 11.39 + 1.66\,\text{drat} - 4.40\,\text{wt} + 0.95\,\text{qsec}\)).

Assignment: Predicting Fuel Economy (continued)

Using the results from the Practice section above, answer the following:

Which predictors were selected by each method (forward, backward, stepwise)? Do they agree?
Write the fitted regression equation from the stepwise model.
Test the overall fit at \(\alpha = 0.05\) using the 4-step hypothesis testing procedure.
Interpret each significant regression coefficient in the context of the problem.
Find the 95% CI for each coefficient in the selected model.
Find \(R^2\) and adjusted \(R^2\). What do they tell you?

Lab 4: Multiple Linear Regression with Dummy Variables

Objectives: Students are able to use R to:

Recognise when and why dummy variables are needed
Transform a qualitative variable into dummy variables
Fit a multiple regression model with dummy variables
Interpret all regression coefficients, including dummy variable coefficients
Derive and compare the fitted equations for each group

Key concepts — Dummy variables: When a predictor is qualitative (categorical), we cannot enter it directly into a regression model as a number. Instead, we create \(k - 1\) dummy (indicator) variables for a variable with \(k\) groups. The omitted group is called the reference category.

Each dummy variable’s coefficient \(\hat{\beta}_j\) represents the estimated difference in the mean response between that group and the reference group, holding all other predictors constant.

Why \(k - 1\) dummies, not \(k\)? Using \(k\) dummies would create perfect multicollinearity (the dummy columns would always sum to 1, which equals the intercept column). Dropping one group — the reference — avoids this problem. The intercept then represents the mean response for the reference group.

General dummy coding scheme for a variable with \(k = 3\) groups (A = reference, B, C):

Group	\(D_B\)	\(D_C\)
A (reference)	0	0
B	1	0
C	0	1

The model becomes: \[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 D_B + \hat{\beta}_3 D_C\]

This gives separate parallel regression lines for each group:

Group	Fitted equation	Intercept
A (reference)	\(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1\)	\(\hat{\beta}_0\)
B	\(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_2) + \hat{\beta}_1 X_1\)	\(\hat{\beta}_0 + \hat{\beta}_2\)
C	\(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_3) + \hat{\beta}_1 X_1\)	\(\hat{\beta}_0 + \hat{\beta}_3\)

All three lines share the same slope \(\hat{\beta}_1\) but have different intercepts.

Practice I: Fuel Economy by Weight and Number of Cylinders

Research question: Can fuel economy be predicted from weight and number of cylinders?

Variables:

Variable	Name in data	Type
Fuel economy (mpg)	`mpg`	Response (\(Y\))
Weight (1000 lbs)	`wt`	Quantitative predictor (\(X_1\))
Number of cylinders	`cyl`	Qualitative predictor (4 / 6 / 8)

Step 1 — Load data

data(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Step 2 — Descriptive statistics

Before fitting, explore the relationship between each predictor and the response separately.

y  <- mtcars$mpg
x1 <- mtcars$wt              # quantitative predictor
x2 <- factor(mtcars$cyl)     # qualitative predictor (4, 6, or 8 cylinders)

# Summaries
summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

summary(x1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424

table(x2)   # count of each cylinder group

## x2
##  4  6  8 
## 11  7 14

# Correlation between the quantitative predictor and response
cor(x1, y)

## [1] -0.8676594

# Scatterplot coloured by cylinder group
plot(x1, y,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     pch  = 16, col = as.integer(x2),
     main = "MPG vs. Weight, by Cylinders")
legend("topright",
       legend = levels(x2),
       col    = 1:length(levels(x2)),
       pch = 16, bty = "n")

# Boxplot: mpg by cylinder group
boxplot(y ~ x2,
        xlab = "Number of Cylinders", ylab = "Miles per Gallon",
        main = "MPG by Cylinders",
        col  = c("lightblue", "lightgreen", "lightyellow"))

What to look for:

Do the groups (4, 6, 8 cylinders) appear to have different average fuel economy in the boxplot? If yes, cylinder count is likely a useful predictor.

In the scatterplot, do the three groups of points appear to sit at different vertical levels? If so, they may have different intercepts.

Is the relationship between weight and mpg roughly linear within each group?

# Alternative using ggplot2 (install once: install.packages("ggplot2"))
library(ggplot2)

# Scatterplot coloured by cylinder group
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
  geom_point(size = 2) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
       colour = "Cylinders",
       title = "MPG vs. Weight, by Cylinders") +
  theme_minimal()

Step 3 — Create dummy variables

With \(k = 3\) groups (4, 6, 8 cylinders), we need \(k - 1 = 2\) dummy variables. We choose 4 cylinders as the reference category.

x2.dummy <- ifelse(mtcars$cyl == 6, 1, 0)   # D_Six
x3.dummy <- ifelse(mtcars$cyl == 8, 1, 0)   # D_Eight

# Verify the coding is correct: print side-by-side
data.frame(Cylinders = mtcars$cyl,
           D_Six     = x2.dummy,
           D_Eight   = x3.dummy)

##    Cylinders D_Six D_Eight
## 1          6     1       0
## 2          6     1       0
## 3          4     0       0
## 4          6     1       0
## 5          8     0       1
## 6          6     1       0
## 7          8     0       1
## 8          4     0       0
## 9          4     0       0
## 10         6     1       0
## 11         6     1       0
## 12         8     0       1
## 13         8     0       1
## 14         8     0       1
## 15         8     0       1
## 16         8     0       1
## 17         8     0       1
## 18         4     0       0
## 19         4     0       0
## 20         4     0       0
## 21         4     0       0
## 22         8     0       1
## 23         8     0       1
## 24         8     0       1
## 25         8     0       1
## 26         4     0       0
## 27         4     0       0
## 28         4     0       0
## 29         8     0       1
## 30         6     1       0
## 31         8     0       1
## 32         4     0       0

Dummy coding summary:

Cylinders \(D_{\text{Six}}\) \(D_{\text{Eight}}\)

4 (reference) 0 0

6 1 0

8 0 1

You can verify: when both dummies are 0, the observation must be a 4-cylinder car. ✓

Cylinders	\(D_{\text{Six}}\)	\(D_{\text{Eight}}\)
4 (reference)	0	0
6	1	0
8	0	1

Step 4 — Fit the full model

We fit the model with \(X_1\) (wt), \(D_{\text{Six}}\), and \(D_{\text{Eight}}\) as predictors.

fit.full <- lm(y ~ x1 + x2.dummy + x3.dummy)
summary(fit.full)

## 
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## x1           -3.2056     0.7539  -4.252 0.000213 ***
## x2.dummy     -4.2556     1.3861  -3.070 0.004718 ** 
## x3.dummy     -6.0709     1.6523  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

anova(fit.full)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 847.73  847.73 129.6650 5.079e-12 ***
## x2.dummy   1   7.00    7.00   1.0713 0.3095128    
## x3.dummy   1  88.26   88.26  13.4999 0.0009992 ***
## Residuals 28 183.06    6.54                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(fit.full, level = 0.95)

##                 2.5 %    97.5 %
## (Intercept) 30.123824 37.857764
## x1          -4.749898 -1.661328
## x2.dummy    -7.094824 -1.416341
## x3.dummy    -9.455418 -2.686301

Reading the summary() output:

(Intercept): \(\hat{\beta}_0 \approx 34.0\) — estimated mean mpg for the 4-cylinder group when wt = 0 (a mathematical extrapolation, not a realistic car).

x1: \(\hat{\beta}_1 \approx -3.21\) — estimated change in mpg for each additional 1000 lbs, within any cylinder group (slope is shared).

x2.dummy: \(\hat{\beta}_2 \approx -4.26\) — estimated difference in mean mpg between 6-cylinder and 4-cylinder cars, holding weight constant.

x3.dummy: \(\hat{\beta}_3 \approx -6.07\) — estimated difference in mean mpg between 8-cylinder and 4-cylinder cars, holding weight constant.

All three coefficients are statistically significant (\(p < 0.01\)). Both dummy coefficients are negative, meaning that — at the same weight — cars with more cylinders get worse fuel economy.

Step 5 — Write the fitted equation and derive group-specific equations

From the Estimate column of summary(fit.full), substitute values into:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 D_{\text{Six}} + \hat{\beta}_3 D_{\text{Eight}}\]

Then derive the three separate equations by substituting the dummy values:

coef(fit.full)   # extract all coefficient estimates

## (Intercept)          x1    x2.dummy    x3.dummy 
##   33.990794   -3.205613   -4.255582   -6.070860

# The three group equations are derived as follows:
# 4-cylinder (D_Six=0, D_Eight=0):
#   yhat = b0 + b1*x1

# 6-cylinder (D_Six=1, D_Eight=0):
#   yhat = (b0 + b2) + b1*x1

# 8-cylinder (D_Six=0, D_Eight=1):
#   yhat = (b0 + b3) + b1*x1

Interpretation example (from the actual fitted model): \(\hat{\beta}_0 \approx 34.0\), \(\hat{\beta}_1 \approx -3.21\), \(\hat{\beta}_2 \approx -4.26\), \(\hat{\beta}_3 \approx -6.07\).

4-cylinder: \(\hat{y} = 34.0 - 3.21\,x_1\)

6-cylinder: \(\hat{y} = 29.7 - 3.21\,x_1\) (intercept decreases by 4.26)

8-cylinder: \(\hat{y} = 27.9 - 3.21\,x_1\) (intercept decreases by 6.07)

Interpretation: “At the same weight, 8-cylinder cars are estimated to get about 6.07 fewer miles per gallon on average than 4-cylinder cars.”

Step 6 — Plot the three fitted lines

# Compute fitted values
yhat <- fitted(fit.full)

# Get coefficient estimates
b <- coef(fit.full)
b0 <- b["(Intercept)"]
b1 <- b["x1"]
b2 <- b["x2.dummy"]
b3 <- b["x3.dummy"]

x1_range <- seq(min(x1), max(x1), length.out = 100)

# Base scatterplot coloured by cylinder group
plot(x1, y,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     pch  = 16, col = as.integer(x2),
     main = "Fitted Regression Lines by Cylinder Group")

# Add a fitted line for each group
lines(x1_range, b0            + b1 * x1_range, col = 1, lwd = 2)  # 4-cylinder
lines(x1_range, b0 + b2       + b1 * x1_range, col = 2, lwd = 2)  # 6-cylinder
lines(x1_range, b0       + b3 + b1 * x1_range, col = 3, lwd = 2)  # 8-cylinder

legend("topright",
       legend = c("4-cylinder", "6-cylinder", "8-cylinder"),
       col    = 1:3, lty = 1, lwd = 2, pch = 16, bty = "n")

What this plot shows: Three parallel lines — same slope, different intercepts. The vertical gap between any two lines equals the corresponding dummy coefficient. If the lines were not parallel, that would suggest an interaction between cylinders and weight (a more advanced model not covered here).

Step 7 — Plot observed vs. predicted

plot(x1, y,
     xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
     pch  = 1, main = "Observed vs. Predicted MPG")
points(x1, yhat, pch = 20, col = "red")
legend("topright",
       legend = c("Observed y", "Predicted ŷ"),
       pch    = c(1, 20), col = c("black", "red"), bty = "n")

Assignment: Student Performance — Part 1 (Dummy Coding and Interpretation)

For this assignment we use a simulated dataset of 48 students. It is generated directly in R with set.seed(), so it is fully reproducible and requires no download.

Variables: Y (post-test score), Pretest (pre-test score), EQ (emotional quotient score), Method (teaching method: A, B, C, D).

set.seed(208251)

n <- 48
Method  <- factor(rep(c("A", "B", "C", "D"), each = n / 4))
Pretest <- round(rnorm(n, mean = 60, sd = 8), 1)
EQ      <- round(rnorm(n, mean = 70, sd = 7), 1)

# True (unknown to students) method effects, added as noise for realism
method_effect <- c(A = 0, B = 4, C = 9, D = 6)[as.character(Method)]

Y <- round(15 + 0.55 * Pretest + 0.30 * EQ + method_effect +
             rnorm(n, mean = 0, sd = 5), 1)

student_data <- data.frame(Y, Pretest, EQ, Method)
head(student_data)

##      Y Pretest   EQ Method
## 1 82.9    71.1 74.2      A
## 2 74.9    65.2 78.2      A
## 3 69.3    63.3 72.2      A
## 4 73.9    53.8 77.3      A
## 5 71.3    68.4 60.5      A
## 6 71.0    66.0 70.9      A

str(student_data)

## 'data.frame':    48 obs. of  4 variables:
##  $ Y      : num  82.9 74.9 69.3 73.9 71.3 71 50.3 64.7 61.1 65.6 ...
##  $ Pretest: num  71.1 65.2 63.3 53.8 68.4 66 47.1 57.9 51.4 52.9 ...
##  $ EQ     : num  74.2 78.2 72.2 77.3 60.5 70.9 58.9 65.2 70 67.7 ...
##  $ Method : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...

Explore the data:
1. Create scatterplots of \(Y\) vs. Pretest and \(Y\) vs. EQ. Describe each relationship (form, direction, strength).
2. Create a boxplot of \(Y\) by Method. Do the teaching methods appear to differ in mean score?
Method is a qualitative variable with \(k = 4\) groups. Create \(k - 1 = 3\) dummy variables. Clearly state:
1. Which group is the reference category.
2. The complete dummy coding table.
3. The R code used to create the dummies and verify them.
Fit the full model: \(Y = \beta_0 + \beta_1 \cdot \text{Pretest} + \beta_2 \cdot \text{EQ} + \beta_3 D_B + \beta_4 D_C + \beta_5 D_D + \varepsilon\). Show the R output and write the fitted equation.
Interpret each coefficient in the context of the problem. Pay particular attention to the dummy coefficients — what do they tell you about the effect of teaching method?
Write the separate fitted equation for students taught by each of Method A, B, C, and D. Comment on how the methods compare.
Plot all four fitted lines on a single scatterplot of \(Y\) vs. Pretest, with points coloured by Method.

Lab 5: Variable Selection and Model Diagnostics

Objectives: Students are able to use R to:

Apply automatic variable selection methods (forward, backward, stepwise)
Understand and compare AIC-based selection criteria
Perform a comprehensive model diagnostic analysis
Interpret diagnostic plots and formal tests for regression assumptions
Identify influential observations and outliers
Make predictions using the selected model

Key concepts — Variable selection: When we have many candidate predictors, automatic selection methods help find a parsimonious model (few predictors, good fit). All three methods below use AIC (Akaike Information Criterion): lower AIC = better trade-off between fit and complexity.

Method	Starting model	Strategy
Forward selection	Intercept only	Add the predictor that reduces AIC the most, repeat
Backward elimination	Full model	Remove the predictor that reduces AIC the most, repeat
Stepwise (both)	Full model	Add or remove at each step — generally preferred

Important limitation: Automated selection does not guarantee the “best” model in any absolute sense. Always check whether the selected model makes practical sense and satisfies the regression assumptions.

Key concepts — Model diagnostics: Before using a fitted model for inference or prediction, we must verify four assumptions of the regression model \(Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \varepsilon_i\):

Assumption	What it means	How to check
Linearity	\(E(\varepsilon_i) = 0\); no pattern in residuals	Residuals vs. Fitted plot
Normality	\(\varepsilon_i \sim N(0, \sigma^2)\)	Histogram, Q-Q plot, Shapiro-Wilk test
Constant variance	\(\text{Var}(\varepsilon_i) = \sigma^2\) (homoscedasticity)	Scale-Location plot, Breusch-Pagan test
Independence	Residuals are uncorrelated	Residuals vs. order plot, Durbin-Watson test

Additionally, we check for influential observations — data points that have an unusually large effect on the fitted model.

Practice I: Fuel Economy by Weight and Cylinders (continued)

We continue with the mtcars dummy-variable model from Lab 4. The goal is now to select the best subset of predictors and validate the model assumptions.

Step 1 — Re-load data and recreate variables

data(mtcars)

y        <- mtcars$mpg
x1       <- mtcars$wt
x2.dummy <- ifelse(mtcars$cyl == 6, 1, 0)
x3.dummy <- ifelse(mtcars$cyl == 8, 1, 0)

Step 2 — Variable selection

null <- lm(y ~ 1)                           # intercept-only model
full <- lm(y ~ x1 + x2.dummy + x3.dummy)   # full model from Lab 4

# Forward selection
fw.fit <- step(null,
               scope     = list(lower = null, upper = full),
               direction = "forward",
               trace     = 1)

## Start:  AIC=115.94
## y ~ 1
## 
##            Df Sum of Sq     RSS     AIC
## + x1        1    847.73  278.32  73.217
## + x3.dummy  1    619.89  506.16  92.355
## <none>                  1126.05 115.943
## + x2.dummy  1      1.08 1124.96 117.913
## 
## Step:  AIC=73.22
## y ~ x1
## 
##            Df Sum of Sq    RSS    AIC
## + x3.dummy  1    33.635 244.69 71.096
## <none>                  278.32 73.217
## + x2.dummy  1     7.004 271.32 74.402
## 
## Step:  AIC=71.1
## y ~ x1 + x3.dummy
## 
##            Df Sum of Sq    RSS    AIC
## + x2.dummy  1    61.628 183.06 63.810
## <none>                  244.69 71.096
## 
## Step:  AIC=63.81
## y ~ x1 + x3.dummy + x2.dummy

# Backward elimination
be.fit <- step(full,
               direction = "backward",
               trace     = 1)

## Start:  AIC=63.81
## y ~ x1 + x2.dummy + x3.dummy
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  183.06 63.810
## - x2.dummy  1    61.628 244.69 71.096
## - x3.dummy  1    88.259 271.32 74.402
## - x1        1   118.204 301.26 77.752

# Stepwise regression (both directions) — generally preferred
sw.fit <- step(full,
               direction = "both",
               trace     = 1)

## Start:  AIC=63.81
## y ~ x1 + x2.dummy + x3.dummy
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  183.06 63.810
## - x2.dummy  1    61.628 244.69 71.096
## - x3.dummy  1    88.259 271.32 74.402
## - x1        1   118.204 301.26 77.752

Reading the step() output: Each row shows one candidate model. The columns are:

Df: degrees of freedom used by the variable being added/removed

AIC: AIC of the candidate model after the change

The algorithm selects the change that gives the lowest AIC. It stops when no change improves AIC.

Comparing the three methods: In this dataset, all three methods agree that all three predictors (x1, x2.dummy, x3.dummy) should be retained — the full model already has the lowest AIC. This won’t always happen, but it’s a useful reminder that selection sometimes confirms your original model rather than simplifying it.

Step 3 — Examine the selected model

# Use the stepwise-selected model
summary(sw.fit)    # coefficients, R², overall F-test

## 
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## x1           -3.2056     0.7539  -4.252 0.000213 ***
## x2.dummy     -4.2556     1.3861  -3.070 0.004718 ** 
## x3.dummy     -6.0709     1.6523  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

anova(sw.fit)      # ANOVA table: SS, MS, F for each term

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 847.73  847.73 129.6650 5.079e-12 ***
## x2.dummy   1   7.00    7.00   1.0713 0.3095128    
## x3.dummy   1  88.26   88.26  13.4999 0.0009992 ***
## Residuals 28 183.06    6.54                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(sw.fit, level = 0.95)   # 95% CI for each coefficient

##                 2.5 %    97.5 %
## (Intercept) 30.123824 37.857764
## x1          -4.749898 -1.661328
## x2.dummy    -7.094824 -1.416341
## x3.dummy    -9.455418 -2.686301

Overall F-test (4-step procedure):

\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (no predictors are useful) vs. \(H_a\): at least one \(\beta_j \neq 0\)

\(F = MS_{reg} / MS_{res}\), with \(df_1 = p\) and \(df_2 = n - p - 1\)

p-value from the F-statistic line in summary()

Conclusion in context

Step 4 — Comprehensive model diagnostics

R’s built-in plot() for lm objects produces four diagnostic plots at once. Understanding each one is essential.

par(mfrow = c(2, 2))
plot(sw.fit)

par(mfrow = c(1, 1))

Interpreting the four diagnostic plots:

Plot 1 — Residuals vs. Fitted - Checks: linearity and constant variance - Good: points scatter randomly around the horizontal zero line, with roughly equal spread across all fitted values - Warning signs: a curve (non-linearity) or a funnel/fan shape (non-constant variance)

Plot 2 — Normal Q-Q - Checks: normality of residuals - Good: points follow the diagonal reference line closely - Warning signs: S-shaped curve (heavy tails), points bending away at the ends

Plot 3 — Scale-Location (√|standardised residuals| vs. Fitted) - Checks: constant variance (more sensitive than Plot 1) - Good: a roughly horizontal red line with points evenly spread - Warning signs: a clear upward or downward trend in the red line

Plot 4 — Residuals vs. Leverage - Checks: influential observations - Points with high leverage AND large residuals are most dangerous - Points outside the Cook’s distance dashed lines (if visible) are highly influential

Step 5 — Formal tests for each assumption

res <- resid(sw.fit)

# --- Normality ---
# Visual: histogram and Q-Q plot side by side
par(mfrow = c(1, 2))
hist(res, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white",
     freq = FALSE)                        # density scale
curve(dnorm(x, mean = mean(res), sd = sd(res)),
      add = TRUE, col = "red", lwd = 2)  # overlay normal curve

qqnorm(res, main = "Normal Q-Q Plot", pch = 16, col = "steelblue")
qqline(res, col = "red", lwd = 2)

par(mfrow = c(1, 1))

# Shapiro-Wilk test (best for n < 50; use for up to ~2000 obs)
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.95907, p-value = 0.259

# H0: residuals are normally distributed
# Satisfied if p-value > 0.05  (here p ≈ 0.26 — satisfied)

# --- Constant variance (homoscedasticity) ---
# install.packages("lmtest")   # run once if not yet installed
library(lmtest)
bptest(sw.fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  sw.fit
## BP = 6.6551, df = 3, p-value = 0.08374

# H0: variance is constant (homoscedastic)
# Satisfied if p-value > 0.05  (here p ≈ 0.08 — satisfied, though borderline)

# --- Independence (no autocorrelation) ---
dwtest(sw.fit)

## 
##  Durbin-Watson test
## 
## data:  sw.fit
## DW = 1.8064, p-value = 0.2188
## alternative hypothesis: true autocorrelation is greater than 0

# H0: residuals are not autocorrelated
# Satisfied if p-value > 0.05 (DW statistic close to 2 is ideal; here DW ≈ 1.81)

Summary of formal tests:

Test \(H_0\) Assumption satisfied if

Shapiro-Wilk Residuals are normally distributed p-value > 0.05

Breusch-Pagan Variance is constant (homoscedastic) p-value > 0.05

Durbin-Watson No autocorrelation in residuals p-value > 0.05

If an assumption is violated, consider:

Violation Possible remedy

Non-linearity Add a polynomial term (e.g. \(X^2\)) or transform \(X\)

Non-constant variance Apply log or square-root transformation to \(Y\)

Non-normality Transform \(Y\); check for outliers

Autocorrelation Include a time trend; use time-series methods

Test	\(H_0\)	Assumption satisfied if
Shapiro-Wilk	Residuals are normally distributed	p-value > 0.05
Breusch-Pagan	Variance is constant (homoscedastic)	p-value > 0.05
Durbin-Watson	No autocorrelation in residuals	p-value > 0.05

Violation	Possible remedy
Non-linearity	Add a polynomial term (e.g. \(X^2\)) or transform \(X\)
Non-constant variance	Apply log or square-root transformation to \(Y\)
Non-normality	Transform \(Y\); check for outliers
Autocorrelation	Include a time trend; use time-series methods

Step 6 — Checking for influential observations

An observation can be unusual in two ways:

Outlier: unusual response value (large residual)
High-leverage point: unusual predictor value (far from the centre of the \(X\) space)
Influential point: changes the fitted model substantially if removed (high Cook’s distance)

# Standardised residuals — values beyond ±2 are potential outliers
std.res <- rstandard(sw.fit)
which(abs(std.res) > 2)   # row indices of potential outliers (e.g. Fiat 128, Toyota Corolla)

## 18 20 
## 18 20

# Leverage values — rule of thumb: high if > 2*(p+1)/n
n <- nrow(mtcars)
p <- length(coef(sw.fit)) - 1   # number of predictors
hat.vals <- hatvalues(sw.fit)
which(hat.vals > 2 * (p + 1) / n)

## named integer(0)

# Cook's distance — rule of thumb: influential if > 4/n
cooks <- cooks.distance(sw.fit)
which(cooks > 4 / n)   # e.g. Chrysler Imperial, Fiat 128, Toyota Corolla

## 17 18 20 
## 17 18 20

# Combined influence plot
plot(sw.fit, which = 4,
     main = "Cook's Distance",
     sub  = paste("Threshold = 4/n =", round(4/n, 4)))
abline(h = 4/n, col = "red", lty = 2)

What to do with influential observations: Do NOT automatically delete them. First: 1. Check whether the data point was recorded correctly (data entry error?). 2. Fit the model with and without the point and compare — if results change substantially, report both. 3. If justified, remove the point and note it in your report.

Step 7 — Prediction using the selected model

# Predict mpg for a new car:
# wt = 3.0 (thousand lbs), cyl = 6 (D_Six=1, D_Eight=0)
new_obs <- data.frame(x1       = 3.0,
                      x2.dummy = 1,
                      x3.dummy = 0)

# Point estimate
predict(sw.fit, newdata = new_obs)

##        1 
## 20.11837

# 95% confidence interval for the mean mpg at these values
predict(sw.fit, newdata = new_obs, interval = "confidence", level = 0.95)

##        fit     lwr      upr
## 1 20.11837 18.1305 22.10625

# 95% prediction interval for a single new car
predict(sw.fit, newdata = new_obs, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 20.11837 14.51622 25.72052

Assignment: Student Performance — Part 2 (Variable Selection and Diagnostics)

This assignment continues directly from Lab 4 Assignment. Re-generate the same simulated dataset (the set.seed() guarantees you get identical data) and use the dummy variables you created there.

set.seed(208251)

n <- 48
Method  <- factor(rep(c("A", "B", "C", "D"), each = n / 4))
Pretest <- round(rnorm(n, mean = 60, sd = 8), 1)
EQ      <- round(rnorm(n, mean = 70, sd = 7), 1)
method_effect <- c(A = 0, B = 4, C = 9, D = 6)[as.character(Method)]
Y <- round(15 + 0.55 * Pretest + 0.30 * EQ + method_effect +
             rnorm(n, mean = 0, sd = 5), 1)

student_data <- data.frame(Y, Pretest, EQ, Method)

Apply forward selection, backward elimination, and stepwise regression to select the best predictors of \(Y\) from {Pretest, EQ, D_B, D_C, D_D}. Compare the three methods — do they agree?
Write the fitted equation from the stepwise-selected model. Interpret each coefficient in context.
Test the overall fit of the selected model at \(\alpha = 0.05\) using the 4-step procedure. Report \(R^2\) and adjusted \(R^2\) and explain what they tell you.
Produce R’s four built-in diagnostic plots (plot(model)). For each plot, describe what you see and state what assumption it is checking.
Carry out the three formal assumption tests:
1. Shapiro-Wilk test for normality
2. Breusch-Pagan test for constant variance
3. Durbin-Watson test for independence
For each test: state \(H_0\), report the test statistic and p-value, and conclude whether the assumption is satisfied.
Check for influential observations using Cook’s distance. Report any observations that exceed the threshold \(4/n\) and describe what you would do with them.
Find the 90% and 95% confidence intervals for each regression coefficient in the selected model.
Predict the post-test score for a new student with Pretest = 60, EQ = 70, taught by Method C. Give both a 95% confidence interval and a 95% prediction interval, and explain the difference.

Lab 6: Non-Parametric Statistics I

Objectives: Students are able to:

Perform descriptive statistics
Apply appropriate non-parametric tests to answer research questions

When to use non-parametric tests: Use non-parametric methods when:

The sample size is small and normality cannot be assumed
Data are ordinal (ranked) rather than continuous
Assumptions of the corresponding parametric test are clearly violated

Parametric test	Non-parametric equivalent	Use case
One-sample t-test	Wilcoxon signed-rank test	One sample, test about median
Independent samples t-test	Mann-Whitney U (Wilcoxon rank-sum)	Two independent samples
Paired t-test	Wilcoxon signed-rank test (paired)	Two related samples
One-way ANOVA	Kruskal-Wallis test	Three or more independent samples
Repeated-measures ANOVA	Friedman test	Three or more related samples
Pearson correlation	Spearman rank correlation	Association, non-normal data

Practice I: Fuel Economy and Car Characteristics

Variables: mpg, qsec, hp, wt, am (transmission), cyl (cylinders)

Load and prepare data

data(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Convert qualitative variables to factors
mtcars$am  <- factor(mtcars$am,  labels = c("automatic", "manual"))
mtcars$cyl <- factor(mtcars$cyl)

Question A: Is the median mpg equal to 20?

One sample, test about a median → Wilcoxon signed-rank test

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

sd(mtcars$mpg)

## [1] 6.026948

par(mfrow = c(1, 2))
boxplot(mtcars$mpg, main = "Boxplot of MPG", ylab = "MPG")
hist(mtcars$mpg, main = "Histogram of MPG",
     xlab = "MPG", col = "lightblue", border = "white")

par(mfrow = c(1, 1))

# H0: median mpg = 20    Ha: median mpg ≠ 20
wilcox.test(x = mtcars$mpg, mu = 20, alternative = "two.sided")

## 
##  Wilcoxon signed rank exact test
## 
## data:  mtcars$mpg
## V = 249, p-value = 0.7855
## alternative hypothesis: true location is not equal to 20

Interpretation: The Wilcoxon signed-rank test tests \(H_0\): population median = mu. - p-value < 0.05: reject \(H_0\); the median mpg is significantly different from 20. - p-value ≥ 0.05: insufficient evidence to conclude the median differs from 20.

Question B: Is the median quarter-mile time (qsec) less than 18 seconds?

summary(mtcars$qsec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   16.89   17.71   17.85   18.90   22.90

par(mfrow = c(1, 2))
boxplot(mtcars$qsec, ylab = "Quarter-mile time (sec)", main = "Boxplot")
hist(mtcars$qsec, xlab = "Quarter-mile time (sec)",
     main = "Histogram", col = "lightblue", border = "white")

par(mfrow = c(1, 1))

# H0: median = 18    Ha: median < 18
wilcox.test(x = mtcars$qsec, mu = 18, alternative = "less")

## 
##  Wilcoxon signed rank exact test
## 
## data:  mtcars$qsec
## V = 225.5, p-value = 0.2427
## alternative hypothesis: true location is less than 18

Practice II: Comparing Three Timing Methods (Case Study E — `RoundingTimes`)

This example uses a classic teaching dataset for the Friedman test, distributed with R’s own documentation (?friedman.test). It records how long it took 22 different operators to round the corner of a piece using three measurement methods: “Round Out”, “Narrow Angle”, and “Wide Angle”. Because each operator used all three methods, this is a paired (repeated measures) design.

Step 1 — Enter the data directly

RoundingTimes <-
  matrix(c(5.40, 5.50, 5.55,
           5.85, 5.70, 5.75,
           5.20, 5.60, 5.50,
           5.55, 5.50, 5.40,
           5.90, 5.85, 5.70,
           5.45, 5.55, 5.60,
           5.40, 5.40, 5.35,
           5.45, 5.50, 5.35,
           5.25, 5.15, 5.00,
           5.85, 5.80, 5.70,
           5.25, 5.20, 5.10,
           5.65, 5.55, 5.45,
           5.60, 5.35, 5.45,
           5.05, 5.00, 4.95,
           5.50, 5.50, 5.40,
           5.45, 5.55, 5.50,
           5.55, 5.55, 5.35,
           5.45, 5.50, 5.55,
           5.50, 5.45, 5.25,
           5.65, 5.60, 5.40,
           5.70, 5.65, 5.55,
           6.30, 6.30, 6.25),
         nrow = 22,
         byrow = TRUE,
         dimnames = list(1 : 22,
                          c("RoundOut", "NarrowAngle", "WideAngle")))

rt <- as.data.frame(RoundingTimes)
head(rt)

##   RoundOut NarrowAngle WideAngle
## 1     5.40        5.50      5.55
## 2     5.85        5.70      5.75
## 3     5.20        5.60      5.50
## 4     5.55        5.50      5.40
## 5     5.90        5.85      5.70
## 6     5.45        5.55      5.60

Step 2 — Explore data

summary(rt)

##     RoundOut      NarrowAngle      WideAngle    
##  Min.   :5.050   Min.   :5.000   Min.   :4.950  
##  1st Qu.:5.412   1st Qu.:5.463   1st Qu.:5.350  
##  Median :5.500   Median :5.525   Median :5.450  
##  Mean   :5.543   Mean   :5.534   Mean   :5.459  
##  3rd Qu.:5.650   3rd Qu.:5.600   3rd Qu.:5.550  
##  Max.   :6.300   Max.   :6.300   Max.   :6.250

boxplot(rt,
        xlab = "Measurement Method", ylab = "Time (sec)",
        main = "Rounding Time by Method",
        col  = c("lightblue", "lightgreen", "lightyellow"))

Question A: Is there a difference in timing between “Round Out” and “Narrow Angle”?

# H0: median(RoundOut) = median(NarrowAngle)    Ha: ≠
wilcox.test(x           = rt$RoundOut,
            y           = rt$NarrowAngle,
            paired      = TRUE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank exact test
## 
## data:  rt$RoundOut and rt$NarrowAngle
## V = 154.5, p-value = 0.288
## alternative hypothesis: true location shift is not equal to 0

Question B: Is the timing under “Narrow Angle” greater than under “Wide Angle”?

# H0: median(NarrowAngle) = median(WideAngle)    Ha: NarrowAngle > WideAngle
wilcox.test(x           = rt$NarrowAngle,
            y           = rt$WideAngle,
            paired      = TRUE,
            alternative = "greater")

## 
##  Wilcoxon signed rank exact test
## 
## data:  rt$NarrowAngle and rt$WideAngle
## V = 222, p-value = 0.0005486
## alternative hypothesis: true location shift is greater than 0

Why paired = TRUE? Each operator produced a time under all three methods, so observations are matched by operator. Using paired = TRUE accounts for individual differences between operators and increases the power of the test.

Lab 7: Non-Parametric Statistics II

Objectives: Students are able to:

Perform descriptive statistics
Apply appropriate non-parametric tests (Mann-Whitney, Kruskal-Wallis, Spearman, Chi-squared, Friedman)

Practice I: Fuel Economy and Car Characteristics (continued)

Load and prepare data

data(mtcars)

mtcars$am  <- factor(mtcars$am,  labels = c("automatic", "manual"))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question a: Is there a difference in average fuel economy between automatic and manual cars?

Two independent groups → Mann-Whitney U test (Wilcoxon rank-sum, paired = FALSE)

tapply(mtcars$mpg, mtcars$am, summary)

## $automatic
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## 
## $manual
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

boxplot(mtcars$mpg ~ mtcars$am,
        xlab = "Transmission", ylab = "MPG",
        main = "MPG by Transmission",
        col  = c("lightpink", "lightblue"))

# H0: median mpg equal for automatic and manual    Ha: ≠
wilcox.test(x           = mtcars$mpg[mtcars$am == "automatic"],
            y           = mtcars$mpg[mtcars$am == "manual"],
            paired      = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum exact test
## 
## data:  mtcars$mpg[mtcars$am == "automatic"] and mtcars$mpg[mtcars$am == "manual"]
## W = 42, p-value = 0.001159
## alternative hypothesis: true location shift is not equal to 0

Question b: Is fuel economy related to number of cylinders?

Three or more independent groups → Kruskal-Wallis test

tapply(mtcars$mpg, mtcars$cyl, summary)

## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.40   22.80   26.00   26.66   30.40   33.90 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.80   18.65   19.70   19.74   21.00   21.40 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.40   15.20   15.10   16.25   19.20

boxplot(mtcars$mpg ~ mtcars$cyl,
        xlab = "Number of Cylinders", ylab = "MPG",
        main = "MPG by Cylinders")

# H0: median mpg equal across all cylinder groups    Ha: at least one differs
kruskal.test(mtcars$mpg ~ mtcars$cyl)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  mtcars$mpg by mtcars$cyl
## Kruskal-Wallis chi-squared = 25.746, df = 2, p-value = 2.566e-06

Post-hoc testing: If the Kruskal-Wallis test is significant (p < 0.05), we know at least one group differs but not which pairs. Use pairwise Wilcoxon tests with a Bonferroni correction:

pairwise.wilcox.test(mtcars$mpg, mtcars$cyl,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  mtcars$mpg and mtcars$cyl 
## 
##   4       6      
## 6 0.00038 -      
## 8 1.3e-06 0.00072
## 
## P value adjustment method: bonferroni

Question c: Is there a (monotone) association between horsepower and quarter-mile time?

Two quantitative variables, non-normal data → Spearman rank correlation

plot(mtcars$hp, mtcars$qsec,
     xlab = "Horsepower", ylab = "Quarter-mile Time (sec)",
     main = "Quarter-mile Time vs. Horsepower",
     pch = 16, col = "steelblue")

# H0: ρ_s = 0 (no monotone association)    Ha: ρ_s ≠ 0
cor.test(~ hp + qsec,
         data        = mtcars,
         method      = "spearman",
         alternative = "two.sided")

## 
##  Spearman's rank correlation rho
## 
## data:  hp and qsec
## S = 9093, p-value = 3.105e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.666606

Spearman vs. Pearson: Spearman measures monotone association (consistently increasing or decreasing), not just linear. It is more robust to outliers and does not require normality.

Question d: Does displacement tend to differ across gear counts?

tapply(mtcars$disp, mtcars$gear, summary)

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   120.1   275.8   318.0   326.3   380.0   472.0 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   71.10   78.92  130.90  123.02  160.00  167.60 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    95.1   120.3   145.0   202.5   301.0   351.0

boxplot(mtcars$disp ~ mtcars$gear,
        xlab = "Number of Gears", ylab = "Displacement (cu. in.)",
        main = "Displacement by Number of Gears")

kruskal.test(mtcars$disp ~ mtcars$gear)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  mtcars$disp by mtcars$gear
## Kruskal-Wallis chi-squared = 16.578, df = 2, p-value = 0.0002513

# Post-hoc if significant:
pairwise.wilcox.test(mtcars$disp, mtcars$gear,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  mtcars$disp and mtcars$gear 
## 
##   3       4   
## 4 1.5e-05 -   
## 5 0.24    0.95
## 
## P value adjustment method: bonferroni

Question e: Is there a relationship between transmission type and number of cylinders?

Two categorical variables → Chi-squared test of independence

tab <- table(mtcars$am, mtcars$cyl)
tab

##            
##              4  6  8
##   automatic  3  4 12
##   manual     8  3  2

prop.table(tab, margin = 1)   # row percentages

##            
##                     4         6         8
##   automatic 0.1578947 0.2105263 0.6315789
##   manual    0.6153846 0.2307692 0.1538462

# H0: transmission type and cylinders are independent    Ha: they are associated
chisq.test(mtcars$am, mtcars$cyl)

## 
##  Pearson's Chi-squared test
## 
## data:  mtcars$am and mtcars$cyl
## X-squared = 8.7407, df = 2, p-value = 0.01265

Check the chi-squared conditions: All expected cell counts should be ≥ 5.

chisq.test(mtcars$am, mtcars$cyl)$expected

##            mtcars$cyl
## mtcars$am         4       6      8
##   automatic 6.53125 4.15625 8.3125
##   manual    4.46875 2.84375 5.6875

# If any expected count < 5, use Fisher's exact test:
fisher.test(tab)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.009105
## alternative hypothesis: two.sided

Practice II: Comparing Three Timing Methods — Friedman Test

Design: Each operator timed the piece using all three methods → repeated measures with 3+ groups → Friedman test.

RoundingTimes <-
  matrix(c(5.40, 5.50, 5.55,
           5.85, 5.70, 5.75,
           5.20, 5.60, 5.50,
           5.55, 5.50, 5.40,
           5.90, 5.85, 5.70,
           5.45, 5.55, 5.60,
           5.40, 5.40, 5.35,
           5.45, 5.50, 5.35,
           5.25, 5.15, 5.00,
           5.85, 5.80, 5.70,
           5.25, 5.20, 5.10,
           5.65, 5.55, 5.45,
           5.60, 5.35, 5.45,
           5.05, 5.00, 4.95,
           5.50, 5.50, 5.40,
           5.45, 5.55, 5.50,
           5.55, 5.55, 5.35,
           5.45, 5.50, 5.55,
           5.50, 5.45, 5.25,
           5.65, 5.60, 5.40,
           5.70, 5.65, 5.55,
           6.30, 6.30, 6.25),
         nrow = 22,
         byrow = TRUE,
         dimnames = list(1 : 22,
                          c("RoundOut", "NarrowAngle", "WideAngle")))
str(RoundingTimes)

##  num [1:22, 1:3] 5.4 5.85 5.2 5.55 5.9 5.45 5.4 5.45 5.25 5.85 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:22] "1" "2" "3" "4" ...
##   ..$ : chr [1:3] "RoundOut" "NarrowAngle" "WideAngle"

Question: Are timing results affected by the measurement method?

rt <- as.data.frame(RoundingTimes)
boxplot(rt,
        xlab = "Measurement Method", ylab = "Time (sec)",
        main = "Rounding Time by Method",
        col  = c("lightblue", "lightgreen", "lightyellow"))

# H0: distributions of time are the same under all three methods
# Ha: at least one method leads to a different distribution
friedman.test(RoundingTimes)

## 
##  Friedman rank sum test
## 
## data:  RoundingTimes
## Friedman chi-squared = 11.143, df = 2, p-value = 0.003805

Friedman test: The non-parametric equivalent of one-way repeated-measures ANOVA. It ranks observations within each block (operator) and tests whether rank sums differ across methods. Here the test is significant (p ≈ 0.004): the three methods are not all the same.

# Post-hoc pairwise comparisons (if Friedman test is significant)
pairwise.wilcox.test(as.vector(RoundingTimes),
                     rep(colnames(RoundingTimes), each = nrow(RoundingTimes)),
                     paired          = TRUE,
                     p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using Wilcoxon signed rank exact test 
## 
## data:  as.vector(RoundingTimes) and rep(colnames(RoundingTimes), each = nrow(RoundingTimes)) 
## 
##           NarrowAngle RoundOut
## RoundOut  0.8641      -       
## WideAngle 0.0033      0.0365  
## 
## P value adjustment method: bonferroni

Assignment: Non-Parametric Tests — Fuel Economy Data

Continue using the mtcars data (re-load it fresh with data(mtcars) and re-create the factors as in Lab 7 Practice I). For each test, follow the 4-step procedure: state hypotheses, give the test statistic and p-value, state your conclusion in context.

Is the median mpg greater than 20?
Is there a difference in quarter-mile time (qsec) between automatic and manual cars?
Is the median horsepower (hp) different across gear-count groups? If so, which groups differ significantly? (Use Bonferroni correction.)
Is there a significant positive monotone association between weight (wt) and horsepower (hp)?
Is there a relationship between number of cylinders and number of gears? (Check the chi-squared expected count conditions first.)

End of Lab Materials

208251 — Regression Analysis and Non-Parametric Statistics: Lab Materials

Donlapark Ponnoprat
Department of Statistics
Faculty of Science, Chiang Mai University

July 02, 2026

Lab 1: Correlation Analysis

Practice I: Cherry Tree Volume (Case Study A — `trees`)

Practice II: Car Weight and Fuel Economy (Case Study B — `mtcars`)

Assignment: Employee Survey Ratings (Case Study C — `attitude`)

Lab 2: Simple Linear Regression

Practice I: Cherry Tree Volume (continued)

Practice II: Car Weight and Fuel Economy (continued)

Assignment: Employee Survey Ratings (continued)

Lab 3: Multiple Linear Regression

Practice: Predicting Fuel Economy from Several Car Characteristics

Assignment: Predicting Fuel Economy (continued)

Lab 4: Multiple Linear Regression with Dummy Variables

Practice I: Fuel Economy by Weight and Number of Cylinders

Assignment: Student Performance — Part 1 (Dummy Coding and Interpretation)

Lab 5: Variable Selection and Model Diagnostics

Practice I: Fuel Economy by Weight and Cylinders (continued)

Assignment: Student Performance — Part 2 (Variable Selection and Diagnostics)

Lab 6: Non-Parametric Statistics I

Practice I: Fuel Economy and Car Characteristics

Practice II: Comparing Three Timing Methods (Case Study E — `RoundingTimes`)

Lab 7: Non-Parametric Statistics II

Practice I: Fuel Economy and Car Characteristics (continued)

Practice II: Comparing Three Timing Methods — Friedman Test

Assignment: Non-Parametric Tests — Fuel Economy Data

208251 — Regression Analysis and Non-Parametric Statistics: Lab Materials

Donlapark Ponnoprat Department of Statistics Faculty of Science, Chiang Mai University

July 02, 2026

Lab 1: Correlation Analysis

Practice I: Cherry Tree Volume (Case Study A — trees)

Practice II: Car Weight and Fuel Economy (Case Study B — mtcars)

Assignment: Employee Survey Ratings (Case Study C — attitude)

Lab 2: Simple Linear Regression

Practice I: Cherry Tree Volume (continued)

Practice II: Car Weight and Fuel Economy (continued)

Assignment: Employee Survey Ratings (continued)

Lab 3: Multiple Linear Regression

Practice: Predicting Fuel Economy from Several Car Characteristics

Assignment: Predicting Fuel Economy (continued)

Lab 4: Multiple Linear Regression with Dummy Variables

Practice I: Fuel Economy by Weight and Number of Cylinders

Assignment: Student Performance — Part 1 (Dummy Coding and Interpretation)

Lab 5: Variable Selection and Model Diagnostics

Practice I: Fuel Economy by Weight and Cylinders (continued)

Assignment: Student Performance — Part 2 (Variable Selection and Diagnostics)

Lab 6: Non-Parametric Statistics I

Practice I: Fuel Economy and Car Characteristics

Practice II: Comparing Three Timing Methods (Case Study E — RoundingTimes)

Lab 7: Non-Parametric Statistics II

Practice I: Fuel Economy and Car Characteristics (continued)

Practice II: Comparing Three Timing Methods — Friedman Test

Assignment: Non-Parametric Tests — Fuel Economy Data

Donlapark Ponnoprat
Department of Statistics
Faculty of Science, Chiang Mai University

Practice I: Cherry Tree Volume (Case Study A — `trees`)

Practice II: Car Weight and Fuel Economy (Case Study B — `mtcars`)

Assignment: Employee Survey Ratings (Case Study C — `attitude`)

Practice II: Comparing Three Timing Methods (Case Study E — `RoundingTimes`)