📂 Data files — Download all datasets before starting any lab:
⬇ Download all data files from Google Drive
Save all files to one folder on your computer (e.g.
C:\Users\YourName\Downloads), then update thesetwd()path in each lab to match.
How to use these lab materials
- Before your first lab: Download all data files and save them to one folder on your computer.
- Read the explanation text before running each code chunk.
- Pay attention to interpretation notes — these help you understand what the output means statistically.
- In each Practice section you will see worked examples; the Assignment section is for you to complete independently.
- Update the
setwd()path in each lab’s import step to point to the folder where you saved the data files.
Objectives: Students are able to use R to:
Key concepts: The Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables \(X\) and \(Y\). It ranges from −1 to +1:
\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \; \sum(y_i-\bar{y})^2}}\]
| Value of \(|r|\) | Interpretation |
|---|---|
| 0.00 – 0.19 | Very weak |
| 0.20 – 0.39 | Weak |
| 0.40 – 0.59 | Moderate |
| 0.60 – 0.79 | Strong |
| 0.80 – 1.00 | Very strong |
The sign of \(r\) indicates direction: positive means both variables increase together; negative means one increases as the other decreases.
Important: Correlation measures linear association only. A correlation of 0 does not necessarily mean there is no relationship — only that there is no linear relationship.
You can download the data file from Mango CMU → Lab Data.
Data file: SalePrice.csv
Variables:
Area(square metres) andSale.price(Baht).
Research question: Is there a significant linear correlation between house area and sale price?
Step 1 — Set your working directory and import data
📂 Data file needed:
SalePrice.csv— Download from Google Drive
If your data file is in Excel (.xls), save it as .csv first:
Open file → File → Save As → choose CSV format.
# Set working directory to the folder where you saved the data files.
# Example for Windows — replace the path with your own:
setwd("C:\\Users\\YourName\\Downloads")
# Read the CSV file
Mydata <- read.csv("SalePrice.csv", header = TRUE)
# Preview the data
head(Mydata) # first 6 rows
str(Mydata) # variable types and dimensions
summary(Mydata) # basic descriptive statisticsNote:
str()tells you how many observations (rows) and variables (columns) are in your data, and what type each variable is (numeric, character, factor, etc.).
Step 2 — Descriptive statistics
x <- Mydata$Area # independent variable (predictor)
y <- Mydata$Sale.price # dependent variable (response)
# Summary statistics
summary(x)
summary(y)
# Standard deviation
sd(x)
sd(y)Step 3 — Create a scatterplot
A scatterplot is always the first step — it reveals the form, direction, and strength of the relationship before any numerical summary.
plot(x, y,
xlab = "Area (sq.m.)",
ylab = "Sale Price (Baht)",
main = "Scatterplot: Sale Price vs. Area",
pch = 16, # filled circle points
col = "steelblue")What to look for in the scatterplot:
- Form: Is the pattern linear (straight) or curved?
- Direction: Positive (both increase together) or negative (one increases as the other decreases)?
- Strength: Are the points tightly clustered around a line (strong) or widely scattered (weak)?
- Outliers: Are there any unusual observations far from the general pattern?
Step 4 — Compute the correlation coefficient
Interpretation: Report the value of \(r\) and describe its sign and magnitude using the reference table above. For example: “The sample correlation coefficient is \(r = 0.87\), indicating a very strong positive linear relationship between area and sale price.”
Step 5 — Hypothesis test for the population correlation
We test whether there is a statistically significant linear correlation in the population:
\[H_0: \rho = 0 \quad \text{(no linear correlation)} \qquad \text{vs.} \qquad H_a: \rho \neq 0\]
The test statistic is:
\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \text{ under } H_0\]
Interpretation of
cor.test()output:
Output field What it means corSample correlation coefficient \(r\) tTest statistic value dfDegrees of freedom (\(n - 2\)) p-valueReject \(H_0\) if p-value \(< \alpha = 0.05\) 95 percent confidence interval95% CI for the population correlation \(\rho\) 4-step conclusion template:
1. \(H_0: \rho = 0\) vs. \(H_a: \rho \neq 0\)
2. \(t = \ldots\), \(df = \ldots\)
3. \(p\text{-value} = \ldots\)
4. Since p-value [< / ≥] 0.05, we [reject / fail to reject] \(H_0\). There [is / is not] sufficient evidence of a significant linear correlation between area and sale price.
Variables:
Age,Income,Gender,Education,Job.Satisfaction
Research question: Is there a significant linear correlation between age and income?
Step 1 — Import data
📂 Data file needed:
JobSatisfaction.csv— Download from Google Drive
Step 2 — Explore the data
str(Mydata2)
# Frequency table for Gender
table1 <- table(Mydata2$Gender)
table1
prop.table(table1) # proportions (multiply by 100 for percentages)
# Two-way frequency table: Education by Gender
table2 <- table(Mydata2$Education, Mydata2$Gender)
table2
prop.table(table2)
# Descriptive statistics for quantitative variables
summary(Mydata2$Age)
summary(Mydata2$Income)
# Boxplots
par(mfrow = c(1, 2))
boxplot(Mydata2$Income,
main = "Distribution of Income", ylab = "Income (Baht)")
boxplot(Mydata2$Income ~ Mydata2$Gender,
xlab = "Gender", ylab = "Income",
main = "Income by Gender",
col = c("lightpink", "lightblue"))
par(mfrow = c(1, 1))
boxplot(Mydata2$Income ~ Mydata2$Job.Satisfaction,
xlab = "Job Satisfaction", ylab = "Income",
main = "Income by Job Satisfaction Level")Interpretation: Boxplots let you compare distributions across groups. Look for differences in medians, spread (IQR), and the presence of outliers.
Step 3 — Scatterplot
x2 <- Mydata2$Age
y2 <- Mydata2$Income
plot(x2, y2,
xlab = "Age (years)", ylab = "Income (Baht)",
main = "Scatterplot: Income vs. Age",
pch = 16, col = "steelblue")Step 4 — Correlation coefficient and hypothesis test
# Sample correlation
cor(x2, y2)
# Hypothesis test: H0: rho = 0 vs Ha: rho ≠ 0
cor.test(x2, y2, alternative = "two.sided", conf.level = 0.95)A statistics teacher wants to examine the relationship between students’ midterm and final exam scores.
📂 Data file needed:
MidtermandFinal.csv— Download from Google Drive
Complete all questions using R. Show your R code and output, and write your interpretation in full sentences.
Perform descriptive statistics for both midterm and final scores (mean, SD, min, max). Summarise the results.
Make a scatterplot of midterm score (\(x\)) vs. final score (\(y\)). Describe the form, direction, strength, and any outliers.
Find the sample correlation coefficient \(r\) between midterm and final scores. Interpret its value.
Test whether the population correlation is different from zero at \(\alpha = 0.05\). Use the 4-step procedure: (i) state hypotheses, (ii) give the test statistic and degrees of freedom, (iii) give the p-value, (iv) state your conclusion in context.
Find the 95% confidence interval for the population correlation \(\rho\) and interpret it.
Objectives: Students are able to use R to:
Key concepts: The simple linear regression model expresses the relationship between a response variable \(Y\) and a single predictor \(X\) as:
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]
where:
The least-squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimise \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).
We continue with the same data used in Lab 1.
📂 Data file needed:
SalePrice.csv— Download from Google Drive
Variables:
Area(sq.m.) as \(X\),Sale.price(Baht) as \(Y\).
Research question: Can we predict sale price from area using a linear model?
Step 1 — Import data
setwd("C:\\Users\\YourName\\Downloads")
Mydata <- read.csv("SalePrice.csv", header = TRUE)
x <- Mydata$Area
y <- Mydata$Sale.priceStep 2 — Fit the regression model
fit <- lm(y ~ x) # fit simple linear regression
summary(fit) # full model summary
confint(fit, level = 0.95) # 95% CI for β₀ and β₁Key output from
summary(fit):
Output field What it means Estimate(Intercept)\(\hat{\beta}_0\): predicted \(Y\) when \(X = 0\) Estimate(x)\(\hat{\beta}_1\): change in \(\hat{Y}\) for each 1-unit increase in \(X\) Std. ErrorStandard error of the coefficient estimate t valueTest statistic for \(H_0: \beta_j = 0\) Pr(>|t|)p-value; reject \(H_0\) if \(< 0.05\) Multiple R-squared\(r^2\): proportion of variance in \(Y\) explained by \(X\) F-statisticOverall model F-test (equivalent to t-test in simple regression) Residual standard error\(\hat{\sigma} = \sqrt{MS_{res}}\): estimated SD of errors How to write the fitted equation: Find the
Estimatecolumn in the output and substitute: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]
Step 3 — Plot the fitted regression line
plot(x, y,
xlab = "Area (sq.m.)",
ylab = "Sale Price (Baht)",
main = "Regression Line: Sale Price vs. Area",
pch = 16, col = "steelblue")
abline(fit, col = "red", lwd = 2) # overlay regression lineStep 4 — Residual analysis
Residuals are defined as \(e_i = y_i - \hat{y}_i\). Before trusting regression inference, we must verify three assumptions:
res <- resid(fit)
par(mfrow = c(1, 3))
# 1. Residuals vs. Fitted — check linearity and constant variance
plot(fitted(fit), res,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs. Fitted",
pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
# 2. Histogram — check approximate normality
hist(res, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white")
# 3. Normal Q-Q plot — check normality more carefully
qqnorm(res, main = "Normal Q-Q Plot", pch = 16)
qqline(res, col = "red")
par(mfrow = c(1, 1))What to look for:
- Residuals vs. Fitted: points should scatter randomly around the zero line. A curve suggests non-linearity; a fan shape suggests non-constant variance.
- Histogram: should be roughly bell-shaped and symmetric around zero.
- Q-Q plot: points should fall close to the red diagonal line. Systematic deviations indicate non-normality.
Step 5 — Prediction for a new value of x
Two types of intervals are available:
The PI is always wider than the CI because it includes both estimation uncertainty and individual variability.
new <- data.frame(x = 13) # predict at x = 13 sq.m.
# Point estimate
predict(fit, newdata = new)
# 95% Confidence interval for E(Y | x = 13)
predict(fit, newdata = new, interval = "confidence", level = 0.95)
# 95% Prediction interval for a new individual at x = 13
predict(fit, newdata = new, interval = "prediction", level = 0.95)Output columns:
fit= point estimate,lwr= lower bound,upr= upper bound.
Step 6 — Plot fitted line with CI and PI bands
new_range <- data.frame(x = sort(x)) # sorted x values for smooth curves
pred.CI <- predict(fit, newdata = new_range, interval = "confidence")
pred.PI <- predict(fit, newdata = new_range, interval = "prediction")
plot(x, y,
xlab = "Area (sq.m.)", ylab = "Sale Price (Baht)",
main = "Fitted Line with 95% CI and PI Bands",
pch = 16, col = "steelblue")
# Fitted regression line
lines(new_range$x, pred.CI[, "fit"], col = "black", lwd = 2)
# 95% Confidence interval (red dashed)
lines(new_range$x, pred.CI[, "lwr"], col = "red", lwd = 1.5, lty = 2)
lines(new_range$x, pred.CI[, "upr"], col = "red", lwd = 1.5, lty = 2)
# 95% Prediction interval (blue dotted)
lines(new_range$x, pred.PI[, "lwr"], col = "blue", lwd = 1.5, lty = 3)
lines(new_range$x, pred.PI[, "upr"], col = "blue", lwd = 1.5, lty = 3)
legend("topleft",
legend = c("Fitted line", "95% CI", "95% PI"),
col = c("black", "red", "blue"),
lty = c(1, 2, 3), lwd = 2, bty = "n")Why is the PI always wider than the CI? The CI captures only the uncertainty in estimating the mean of \(Y\), while the PI must additionally account for the natural scatter of individual observations around that mean. The gap between CI and PI bands is largest near the extremes of \(x\) and narrowest near \(\bar{x}\).
We continue with the same data used in Lab 1 Practice II.
📂 Data file needed:
JobSatisfaction.csv— Download from Google Drive
Variables:
Ageas \(X\),Incomeas \(Y\).
Research question: Can we predict income from age using a linear model?
Step 1 — Import data
setwd("C:\\Users\\YourName\\Downloads")
Mydata2 <- read.csv("JobSatisfaction.csv", header = TRUE)
x2 <- Mydata2$Age
y2 <- Mydata2$IncomeStep 2 — Fit the model
Step 3 — Plot fitted line
plot(x2, y2,
xlab = "Age (years)", ylab = "Income (Baht)",
main = "Regression Line: Income vs. Age",
pch = 16, col = "steelblue")
abline(fit2, col = "red", lwd = 2)Step 4 — Residual analysis
res2 <- resid(fit2)
par(mfrow = c(1, 3))
plot(fitted(fit2), res2,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs. Fitted", pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
hist(res2, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white")
qqnorm(res2, main = "Normal Q-Q Plot", pch = 16)
qqline(res2, col = "red")
par(mfrow = c(1, 1))Step 5 — CI and PI bands
new2 <- data.frame(x2 = sort(x2))
pred.CI2 <- predict(fit2, newdata = new2, interval = "confidence")
pred.PI2 <- predict(fit2, newdata = new2, interval = "prediction")
plot(x2, y2,
xlab = "Age (years)", ylab = "Income (Baht)",
main = "Fitted Line with 95% CI and PI Bands",
pch = 16, col = "steelblue",
ylim = range(c(pred.PI2[, "lwr"], pred.PI2[, "upr"])))
lines(new2$x2, pred.CI2[, "fit"], col = "black", lwd = 2)
lines(new2$x2, pred.CI2[, "lwr"], col = "red", lty = 2, lwd = 1.5)
lines(new2$x2, pred.CI2[, "upr"], col = "red", lty = 2, lwd = 1.5)
lines(new2$x2, pred.PI2[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(new2$x2, pred.PI2[, "upr"], col = "blue", lty = 3, lwd = 1.5)
legend("topleft",
legend = c("Fitted line", "95% CI", "95% PI"),
col = c("black", "red", "blue"),
lty = c(1, 2, 3), lwd = 2, bty = "n")We continue with the MidtermandFinal.csv data from Lab
1. Now we extend the analysis from correlation to regression.
📂 Data file needed:
MidtermandFinal.csv— Download from Google Drive
Obtain the least-squares regression line for predicting final score (\(Y\)) from midterm score (\(X\)). Write the fitted equation.
Interpret \(\hat{\beta}_0\) and \(\hat{\beta}_1\) in the context of this problem.
Find \(MS_{res}\) (= residual standard error squared) and state its units.
Test whether midterm score is a significant linear predictor of final score at \(\alpha = 0.05\). Use the 4-step procedure.
Find the 95% confidence intervals for \(\beta_0\) and \(\beta_1\).
Find \(r^2\) and interpret its value in context.
Check the regression assumptions using residual plots (residuals vs. fitted, histogram, Q-Q plot). Are the assumptions satisfied?
Find the 95% confidence interval and 95% prediction interval for the final score of a student with a midterm score of 75. Explain the difference between the two intervals.
Plot the fitted regression line along with its 95% CI and PI bands for all values of \(x\).
Objectives: Students are able to use R to:
Key concepts: The multiple linear regression model with \(p\) predictors is:
\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i\]
We estimate the coefficients by least squares and evaluate the model with:
A farm owner wants to reduce monthly water costs. He collected data over 12 months on:
| Variable | Description | Type |
|---|---|---|
| \(x_1\) | Average monthly temperature (°F) | Quantitative |
| \(x_2\) | Yield (lb) | Quantitative |
| \(x_3\) | Number of days agricultural operations were carried out | Quantitative |
| \(x_4\) | Average number of workers employed | Quantitative |
| \(x_5\) | Average amount of red light | Quantitative |
| \(y\) | Average monthly water consumption (gallons) | Response |
Step 1 — Import data
📂 Data file needed:
Farming.csv— Download from Google Drive
setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("Farming.csv", header = TRUE)
str(dt)
summary(dt)Step 2 — Descriptive statistics and correlation matrix
Since all variables are quantitative, we examine pairwise scatterplots and correlations.
# Pairwise correlation matrix
cor(dt[, c("y", "x1", "x2", "x3", "x4", "x5")])
# Pairwise scatterplot matrix
pairs(dt[, c("y", "x1", "x2", "x3", "x4", "x5")],
main = "Scatterplot Matrix for Farming Data",
pch = 16, col = "steelblue")Interpretation: Look at the first row (or column) to see which predictors appear most linearly related to \(y\). Also check for high correlations between predictors — this is called multicollinearity and can cause instability in coefficient estimates.
Step 3 — Full model (all five predictors)
Key outputs:
anova(full)gives the sequential (Type I) ANOVA table, showing the reduction in \(SS_{res}\) as each predictor is added.- The overall F-test p-value tests whether at least one \(\beta_j \neq 0\).
- Some individual predictors may be non-significant even if the overall model is — this is common with correlated predictors.
Step 4 — Variable selection
With many predictors we use automatic selection to find the best subset. All three methods use AIC (Akaike Information Criterion) — lower AIC = better model.
null <- lm(y ~ 1, data = dt) # intercept-only (no predictors)
full <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = dt)
# Forward selection: starts empty, adds one predictor at a time
fw.fit <- step(null,
scope = list(lower = null, upper = full),
direction = "forward",
trace = 1)
# Backward elimination: starts full, removes one predictor at a time
be.fit <- step(full, direction = "backward", trace = 1)
# Stepwise regression: both directions (generally preferred)
sw.fit <- step(full, direction = "both", trace = 1)Comparing the three methods: They do not always agree. Stepwise regression is generally preferred as it reconsiders variables at each step.
Step 5 — Fitted model from stepwise regression
summary(sw.fit) # coefficients, R², F-test
anova(sw.fit) # ANOVA table
confint(sw.fit) # 95% CI for all coefficientsHow to write the fitted equation: If stepwise selected \(x_1\) and \(x_3\), for example: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_3 x_3\] Substitute the
Estimatevalues fromsummary()output.
Using the results from the Practice section above, answer the following:
Which predictors were selected by each method (forward, backward, stepwise)? Do they agree?
Write the fitted regression equation from the stepwise model.
Test the overall fit at \(\alpha = 0.05\) using the 4-step hypothesis testing procedure.
Interpret each significant regression coefficient in the context of the problem.
Find the 95% CI for each coefficient in the selected model.
Find \(R^2\) and adjusted \(R^2\). What do they tell you?
Objectives: Students are able to use R to:
Key concepts — Dummy variables: When a predictor is qualitative (categorical), we cannot enter it directly into a regression model as a number. Instead, we create \(k - 1\) dummy (indicator) variables for a variable with \(k\) groups. The omitted group is called the reference category.
Each dummy variable’s coefficient \(\hat{\beta}_j\) represents the estimated difference in the mean response between that group and the reference group, holding all other predictors constant.
Why \(k - 1\) dummies, not \(k\)? Using \(k\) dummies would create perfect multicollinearity (the dummy columns would always sum to 1, which equals the intercept column). Dropping one group — the reference — avoids this problem. The intercept then represents the mean response for the reference group.
General dummy coding scheme for a variable with \(k = 3\) groups (A = reference, B, C):
| Group | \(D_B\) | \(D_C\) |
|---|---|---|
| A (reference) | 0 | 0 |
| B | 1 | 0 |
| C | 0 | 1 |
The model becomes: \[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 D_B + \hat{\beta}_3 D_C\]
This gives separate parallel regression lines for each group:
| Group | Fitted equation | Intercept |
|---|---|---|
| A (reference) | \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0\) |
| B | \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_2) + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0 + \hat{\beta}_2\) |
| C | \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_3) + \hat{\beta}_1 X_1\) | \(\hat{\beta}_0 + \hat{\beta}_3\) |
All three lines share the same slope \(\hat{\beta}_1\) but have different intercepts.
Research question: Can sale price be predicted from the number of households and location type?
Variables:
| Variable | Name in data | Type |
|---|---|---|
| Sale price (Baht) | SalesPrice |
Response (\(Y\)) |
| Number of households | NumberOfHousehold |
Quantitative predictor (\(X_1\)) |
| Location | Location |
Qualitative predictor (Residential / Street / Mall) |
Step 1 — Import data
📂 Data file needed:
HousePrices.csv— Download from Google Drive
setwd("C:\\Users\\YourName\\Downloads")
data <- read.csv("HousePrices.csv", header = TRUE)
head(data)
str(data)Step 2 — Descriptive statistics
Before fitting, explore the relationship between each predictor and the response separately.
y <- data$SalesPrice
x1 <- data$NumberOfHousehold # quantitative predictor
x2 <- data$Location # qualitative predictor
# Summaries
summary(y)
summary(x1)
table(x2) # count of each Location group
# Correlation between the quantitative predictor and response
cor(x1, y)
# Scatterplot coloured by Location
plot(x1, y,
xlab = "Number of Households", ylab = "Sale Price (Baht)",
pch = 16, col = as.integer(as.factor(x2)),
main = "Sale Price vs. Households, by Location")
legend("topleft",
legend = levels(as.factor(x2)),
col = 1:length(levels(as.factor(x2))),
pch = 16, bty = "n")
# Boxplot: sale price by Location
boxplot(y ~ x2,
xlab = "Location", ylab = "Sale Price (Baht)",
main = "Sale Price by Location",
col = c("lightblue", "lightgreen", "lightyellow"))What to look for:
- Do the groups (Residential, Street, Mall) appear to have different average sale prices in the boxplot? If yes, location is likely a useful predictor.
- In the scatterplot, do the three groups of points appear to sit at different vertical levels? If so, they may have different intercepts.
- Is the relationship between households and sale price roughly linear within each group?
# Alternative using ggplot2 (install once: install.packages("ggplot2"))
library(ggplot2)
# Scatterplot coloured by Location
ggplot(data, aes(x = NumberOfHousehold, y = SalesPrice, colour = Location)) +
geom_point(size = 3) +
labs(title = "Sale Price vs. Households by Location",
x = "Number of Households", y = "Sale Price (Baht)") +
theme_minimal()
# Boxplot by Location
ggplot(data, aes(x = Location, y = SalesPrice, fill = Location)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Sale Price by Location",
x = "Location", y = "Sale Price (Baht)") +
theme_minimal()Step 3 — Create dummy variables
Location has \(k = 3\)
groups → we need \(k - 1 = 2\) dummy
variables. We choose Residential as the reference
category (the baseline group).
# D_Street = 1 if Location is "Street", 0 otherwise
x2.dummy <- ifelse(data$Location == "Street", 1, 0)
# D_Mall = 1 if Location is "Mall", 0 otherwise
x3.dummy <- ifelse(data$Location == "Mall", 1, 0)
# Verify the coding is correct: print side-by-side
data.frame(Location = data$Location,
D_Street = x2.dummy,
D_Mall = x3.dummy)Dummy coding summary:
Location \(D_{\text{Street}}\) \(D_{\text{Mall}}\) Residential (reference) 0 0 Street 1 0 Mall 0 1 You can verify: when both dummies are 0, the observation must be Residential. ✓
Step 4 — Fit the full model
We fit the model with \(X_1\) (NumberOfHousehold), \(D_{\text{Street}}\), and \(D_{\text{Mall}}\) as predictors.
fit.full <- lm(y ~ x1 + x2.dummy + x3.dummy)
summary(fit.full)
anova(fit.full)
confint(fit.full, level = 0.95)Reading the
summary()output:
(Intercept): \(\hat{\beta}_0\) — estimated mean sale price for the Residential group whenNumberOfHousehold = 0.x1: \(\hat{\beta}_1\) — estimated change in sale price for each additional household, within any location group (slope is shared).x2.dummy: \(\hat{\beta}_2\) — estimated difference in mean sale price between Street and Residential, holding \(X_1\) constant.x3.dummy: \(\hat{\beta}_3\) — estimated difference in mean sale price between Mall and Residential, holding \(X_1\) constant.A positive \(\hat{\beta}_2\) means Street locations have higher average sale prices than Residential; a negative \(\hat{\beta}_2\) means lower.
Step 5 — Write the fitted equation and derive group-specific equations
From the Estimate column of
summary(fit.full), substitute values into:
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 D_{\text{Street}} + \hat{\beta}_3 D_{\text{Mall}}\]
Then derive the three separate equations by substituting the dummy values:
coef(fit.full) # extract all coefficient estimates
# The three group equations are derived as follows:
# Residential (D_Street=0, D_Mall=0):
# yhat = b0 + b1*x1
# Street (D_Street=1, D_Mall=0):
# yhat = (b0 + b2) + b1*x1
# Mall (D_Street=0, D_Mall=1):
# yhat = (b0 + b3) + b1*x1Interpretation example (replace with your actual estimates):
Suppose
summary()gives: \(\hat{\beta}_0 = 50{,}000\), \(\hat{\beta}_1 = 3{,}000\), \(\hat{\beta}_2 = 12{,}000\), \(\hat{\beta}_3 = 25{,}000\).
- Residential: \(\hat{y} = 50{,}000 + 3{,}000 \, x_1\)
- Street: \(\hat{y} = 62{,}000 + 3{,}000 \, x_1\) (intercept increases by 12,000)
- Mall: \(\hat{y} = 75{,}000 + 3{,}000 \, x_1\) (intercept increases by 25,000)
Interpretation: “Mall locations are estimated to sell for 25,000 Baht more on average than Residential locations with the same number of households.”
Step 6 — Plot the three fitted lines
# Compute fitted values
yhat <- fitted(fit.full)
# Get coefficient estimates
b <- coef(fit.full)
b0 <- b["(Intercept)"]
b1 <- b["x1"]
b2 <- b["x2.dummy"]
b3 <- b["x3.dummy"]
x1_range <- seq(min(x1), max(x1), length.out = 100)
# Base scatterplot coloured by Location
plot(x1, y,
xlab = "Number of Households", ylab = "Sale Price (Baht)",
pch = 16, col = as.integer(as.factor(x2)),
main = "Fitted Regression Lines by Location")
# Add a fitted line for each group
lines(x1_range, b0 + b1 * x1_range, col = 1, lwd = 2) # Residential
lines(x1_range, b0 + b2 + b1 * x1_range, col = 2, lwd = 2) # Street
lines(x1_range, b0 + b3 + b1 * x1_range, col = 3, lwd = 2) # Mall
legend("topleft",
legend = c("Residential", "Street", "Mall"),
col = 1:3, lty = 1, lwd = 2, pch = 16, bty = "n")What this plot shows: Three parallel lines — same slope, different intercepts. The vertical gap between any two lines equals the corresponding dummy coefficient. If the lines were not parallel, that would suggest an interaction between location and households (a more advanced model not covered here).
Step 7 — Plot observed vs. predicted
plot(x1, y,
xlab = "Number of Households", ylab = "Sale Price (Baht)",
pch = 1, main = "Observed vs. Predicted Sale Price")
points(x1, yhat, pch = 20, col = "red")
legend("topleft",
legend = c("Observed y", "Predicted ŷ"),
pch = c(1, 20), col = c("black", "red"), bty = "n")📂 Data file needed:
ScoreAndEQ.csv— Download from Google Drive
Variables: Y (post-test score),
Pretest (pre-test score), EQ (emotional
quotient score), Method (teaching method: A, B, C, D).
Explore the data:
Pretest and \(Y\)
vs. EQ. Describe each relationship (form, direction,
strength).Method. Do the teaching methods appear to differ in mean
score?Method is a qualitative variable with \(k = 4\) groups. Create \(k - 1 = 3\) dummy variables. Clearly
state:
Fit the full model: \(Y = \beta_0 + \beta_1 \cdot \text{Pretest} + \beta_2 \cdot \text{EQ} + \beta_3 D_B + \beta_4 D_C + \beta_5 D_D + \varepsilon\). Show the R output and write the fitted equation.
Interpret each coefficient in the context of the problem. Pay particular attention to the dummy coefficients — what do they tell you about the effect of teaching method?
Write the separate fitted equation for students taught by each of Method A, B, C, and D. Comment on how the methods compare.
Plot all four fitted lines on a single scatterplot of \(Y\) vs. Pretest, with points
coloured by Method.
Objectives: Students are able to use R to:
Key concepts — Variable selection: When we have many candidate predictors, automatic selection methods help find a parsimonious model (few predictors, good fit). All three methods below use AIC (Akaike Information Criterion): lower AIC = better trade-off between fit and complexity.
| Method | Starting model | Strategy |
|---|---|---|
| Forward selection | Intercept only | Add the predictor that reduces AIC the most, repeat |
| Backward elimination | Full model | Remove the predictor that reduces AIC the most, repeat |
| Stepwise (both) | Full model | Add or remove at each step — generally preferred |
Important limitation: Automated selection does not guarantee the “best” model in any absolute sense. Always check whether the selected model makes practical sense and satisfies the regression assumptions.
Key concepts — Model diagnostics: Before using a fitted model for inference or prediction, we must verify four assumptions of the regression model \(Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \varepsilon_i\):
| Assumption | What it means | How to check |
|---|---|---|
| Linearity | \(E(\varepsilon_i) = 0\); no pattern in residuals | Residuals vs. Fitted plot |
| Normality | \(\varepsilon_i \sim N(0, \sigma^2)\) | Histogram, Q-Q plot, Shapiro-Wilk test |
| Constant variance | \(\text{Var}(\varepsilon_i) = \sigma^2\) (homoscedasticity) | Scale-Location plot, Breusch-Pagan test |
| Independence | Residuals are uncorrelated | Residuals vs. order plot, Durbin-Watson test |
Additionally, we check for influential observations — data points that have an unusually large effect on the fitted model.
We continue with the HousePrices.csv data from Lab 4.
The goal is now to select the best subset of predictors and validate the
model assumptions.
📂 Data file needed:
HousePrices.csv— Download from Google Drive
Step 1 — Re-import data and recreate variables
setwd("C:\\Users\\YourName\\Downloads")
data <- read.csv("HousePrices.csv", header = TRUE)
y <- data$SalesPrice
x1 <- data$NumberOfHousehold
x2.dummy <- ifelse(data$Location == "Street", 1, 0)
x3.dummy <- ifelse(data$Location == "Mall", 1, 0)Step 2 — Variable selection
null <- lm(y ~ 1) # intercept-only model
full <- lm(y ~ x1 + x2.dummy + x3.dummy) # full model from Lab 4
# Forward selection
fw.fit <- step(null,
scope = list(lower = null, upper = full),
direction = "forward",
trace = 1)
# Backward elimination
be.fit <- step(full,
direction = "backward",
trace = 1)
# Stepwise regression (both directions) — generally preferred
sw.fit <- step(full,
direction = "both",
trace = 1)Reading the
step()output: Each row shows one candidate model. The columns are:
Df: degrees of freedom used by the variable being added/removedAIC: AIC of the candidate model after the changeThe algorithm selects the change that gives the lowest AIC. It stops when no change improves AIC.
Comparing the three methods: Write down which variables each method selected. Do they agree? Stepwise is preferred because it can reconsider earlier decisions.
Step 3 — Examine the selected model
# Use the stepwise-selected model
summary(sw.fit) # coefficients, R², overall F-test
anova(sw.fit) # ANOVA table: SS, MS, F for each term
confint(sw.fit, level = 0.95) # 95% CI for each coefficientOverall F-test (4-step procedure):
- \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (no predictors are useful) vs. \(H_a\): at least one \(\beta_j \neq 0\)
- \(F = MS_{reg} / MS_{res}\), with \(df_1 = p\) and \(df_2 = n - p - 1\)
- p-value from the
F-statisticline insummary()- Conclusion in context
Step 4 — Comprehensive model diagnostics
R’s built-in plot() for lm objects produces
four diagnostic plots at once. Understanding each one is essential.
Interpreting the four diagnostic plots:
Plot 1 — Residuals vs. Fitted - Checks: linearity and constant variance - Good: points scatter randomly around the horizontal zero line, with roughly equal spread across all fitted values - Warning signs: a curve (non-linearity) or a funnel/fan shape (non-constant variance)
Plot 2 — Normal Q-Q - Checks: normality of residuals - Good: points follow the diagonal reference line closely - Warning signs: S-shaped curve (heavy tails), points bending away at the ends
Plot 3 — Scale-Location (√|standardised residuals| vs. Fitted) - Checks: constant variance (more sensitive than Plot 1) - Good: a roughly horizontal red line with points evenly spread - Warning signs: a clear upward or downward trend in the red line
Plot 4 — Residuals vs. Leverage - Checks: influential observations - Points with high leverage AND large residuals are most dangerous - Points outside the Cook’s distance dashed lines (if visible) are highly influential
Step 5 — Formal tests for each assumption
res <- resid(sw.fit)
# --- Normality ---
# Visual: histogram and Q-Q plot side by side
par(mfrow = c(1, 2))
hist(res, main = "Histogram of Residuals",
xlab = "Residuals", col = "lightblue", border = "white",
freq = FALSE) # density scale
curve(dnorm(x, mean = mean(res), sd = sd(res)),
add = TRUE, col = "red", lwd = 2) # overlay normal curve
qqnorm(res, main = "Normal Q-Q Plot", pch = 16, col = "steelblue")
qqline(res, col = "red", lwd = 2)
par(mfrow = c(1, 1))
# Shapiro-Wilk test (best for n < 50; use for up to ~2000 obs)
shapiro.test(res)
# H0: residuals are normally distributed
# Satisfied if p-value > 0.05
# --- Constant variance (homoscedasticity) ---
# install.packages("lmtest") # run once if not yet installed
library(lmtest)
bptest(sw.fit)
# H0: variance is constant (homoscedastic)
# Satisfied if p-value > 0.05
# --- Independence (no autocorrelation) ---
dwtest(sw.fit)
# H0: residuals are not autocorrelated
# Satisfied if p-value > 0.05 (DW statistic close to 2 is ideal)Summary of formal tests:
Test \(H_0\) Assumption satisfied if Shapiro-Wilk Residuals are normally distributed p-value > 0.05 Breusch-Pagan Variance is constant (homoscedastic) p-value > 0.05 Durbin-Watson No autocorrelation in residuals p-value > 0.05 If an assumption is violated, consider:
Violation Possible remedy Non-linearity Add a polynomial term (e.g. \(X^2\)) or transform \(X\) Non-constant variance Apply log or square-root transformation to \(Y\) Non-normality Transform \(Y\); check for outliers Autocorrelation Include a time trend; use time-series methods
Step 6 — Checking for influential observations
An observation can be unusual in two ways:
# Standardised residuals — values beyond ±2 are potential outliers
std.res <- rstandard(sw.fit)
which(abs(std.res) > 2) # row indices of potential outliers
# Leverage values — rule of thumb: high if > 2*(p+1)/n
n <- nrow(data)
p <- length(coef(sw.fit)) - 1 # number of predictors
hat.vals <- hatvalues(sw.fit)
which(hat.vals > 2 * (p + 1) / n)
# Cook's distance — rule of thumb: influential if > 4/n
cooks <- cooks.distance(sw.fit)
which(cooks > 4 / n)
# Combined influence plot
plot(sw.fit, which = 4,
main = "Cook's Distance",
sub = paste("Threshold = 4/n =", round(4/n, 4)))
abline(h = 4/n, col = "red", lty = 2)What to do with influential observations: Do NOT automatically delete them. First: 1. Check whether the data point was recorded correctly (data entry error?). 2. Fit the model with and without the point and compare — if results change substantially, report both. 3. If justified, remove the point and note it in your report.
Step 7 — Prediction using the selected model
# Predict sale price for a new house:
# NumberOfHousehold = 10, Location = "Street" (D_Street=1, D_Mall=0)
new_obs <- data.frame(x1 = 10,
x2.dummy = 1,
x3.dummy = 0)
# Point estimate
predict(sw.fit, newdata = new_obs)
# 95% confidence interval for the mean sale price at these values
predict(sw.fit, newdata = new_obs, interval = "confidence", level = 0.95)
# 95% prediction interval for a single new house
predict(sw.fit, newdata = new_obs, interval = "prediction", level = 0.95)📂 Data file needed:
ScoreAndEQ.csv— Download from Google Drive
This assignment continues directly from Lab 4 Assignment. Use the dummy variables you created there.
Apply forward selection, backward elimination, and stepwise
regression to select the best predictors of \(Y\) from
{Pretest, EQ, D_B, D_C, D_D}. Compare the three methods —
do they agree?
Write the fitted equation from the stepwise-selected model. Interpret each coefficient in context.
Test the overall fit of the selected model at \(\alpha = 0.05\) using the 4-step procedure. Report \(R^2\) and adjusted \(R^2\) and explain what they tell you.
Produce R’s four built-in diagnostic plots
(plot(model)). For each plot, describe what you see and
state what assumption it is checking.
Carry out the three formal assumption tests:
For each test: state \(H_0\), report the test statistic and p-value, and conclude whether the assumption is satisfied.
Check for influential observations using Cook’s distance. Report any observations that exceed the threshold \(4/n\) and describe what you would do with them.
Find the 90% and 95% confidence intervals for each regression coefficient in the selected model.
Predict the post-test score for a new student with
Pretest = 60, EQ = 70, taught by Method C.
Give both a 95% confidence interval and a 95% prediction interval, and
explain the difference.
Objectives: Students are able to:
When to use non-parametric tests: Use non-parametric methods when:
| Parametric test | Non-parametric equivalent | Use case |
|---|---|---|
| One-sample t-test | Wilcoxon signed-rank test | One sample, test about median |
| Independent samples t-test | Mann-Whitney U (Wilcoxon rank-sum) | Two independent samples |
| Paired t-test | Wilcoxon signed-rank test (paired) | Two related samples |
| One-way ANOVA | Kruskal-Wallis test | Three or more independent samples |
| Repeated-measures ANOVA | Friedman test | Three or more related samples |
| Pearson correlation | Spearman rank correlation | Association, non-normal data |
📂 Data file needed:
AStore_data.csv— Download from Google DriveVariables:
Age,ItemsPerchased,Sales,DiscountAmount,Gender,MaritalStatus,MethodOfPayment
Import and prepare data
setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("AStore_data.csv", header = TRUE)
head(dt)
str(dt)
# Convert qualitative variables to factors
dt$MethodOfPayment <- as.factor(dt$MethodOfPayment)
dt$Gender <- as.factor(dt$Gender)
dt$MaritalStatus <- as.factor(dt$MaritalStatus)Question A: Is the median age of customers equal to 45?
One sample, test about a median → Wilcoxon signed-rank test
summary(dt$Age)
sd(dt$Age)
par(mfrow = c(1, 2))
boxplot(dt$Age, main = "Boxplot of Age", ylab = "Age")
hist(dt$Age, main = "Histogram of Age",
xlab = "Age", col = "lightblue", border = "white")
par(mfrow = c(1, 1))
# H0: median age = 45 Ha: median age ≠ 45
wilcox.test(x = dt$Age, mu = 45, alternative = "two.sided")Interpretation: The Wilcoxon signed-rank test tests \(H_0\): population median =
mu. - p-value < 0.05: reject \(H_0\); the median age is significantly different from 45. - p-value ≥ 0.05: insufficient evidence to conclude the median differs from 45.
Question B: Is the median number of purchased items less than 7?
summary(dt$ItemsPerchased)
par(mfrow = c(1, 2))
boxplot(dt$ItemsPerchased, ylab = "Items Purchased", main = "Boxplot")
hist(dt$ItemsPerchased, xlab = "Items Purchased",
main = "Histogram", col = "lightblue", border = "white")
par(mfrow = c(1, 1))
# H0: median = 7 Ha: median < 7
wilcox.test(x = dt$ItemsPerchased, mu = 7, alternative = "less")📂 Data file needed:
Music_data.csv— Download from Google DriveDesign: Each runner exercised under three music conditions (None, Classical, Dance). This is a paired (repeated measures) design → use the paired Wilcoxon signed-rank test.
Step 1 — Import and prepare data
setwd("C:\\Users\\YourName\\Downloads")
dt2 <- read.csv("Music_data.csv", header = TRUE)
head(dt2)
str(dt2)
dt2$Type <- as.factor(dt2$Type)Step 2 — Explore data
tapply(dt2$Scale, dt2$Type, summary)
boxplot(dt2$Scale ~ dt2$Type,
xlab = "Type of Music", ylab = "Exertion Scale",
main = "Exertion by Music Type",
col = c("lightblue", "lightgreen", "lightyellow"))Question A: Is there a difference in exertion between No Music and Classical?
# H0: median(None) = median(Classical) Ha: ≠
wilcox.test(x = dt2$Scale[dt2$Type == "None"],
y = dt2$Scale[dt2$Type == "Classical"],
paired = TRUE,
alternative = "two.sided")Question B: Is the exertion scale under Classical music higher than under Dance music?
# H0: median(Classical) = median(Dance) Ha: Classical > Dance
wilcox.test(x = dt2$Scale[dt2$Type == "Classical"],
y = dt2$Scale[dt2$Type == "Dance"],
paired = TRUE,
alternative = "greater")Why
paired = TRUE? Each runner experienced all music conditions, so observations are matched by runner. Usingpaired = TRUEaccounts for individual differences between runners and increases the power of the test.
Objectives: Students are able to:
Import and prepare data
📂 Data file needed:
AStore_data.csv— Download from Google Drive
setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("AStore_data.csv", header = TRUE)
dt$MethodOfPayment <- as.factor(dt$MethodOfPayment)
dt$Gender <- as.factor(dt$Gender)
dt$MaritalStatus <- as.factor(dt$MaritalStatus)
str(dt)Question a: Is there a difference in average sales between female and male customers?
Two independent groups → Mann-Whitney U test
(Wilcoxon rank-sum, paired = FALSE)
tapply(dt$Sales, dt$Gender, summary)
boxplot(dt$Sales ~ dt$Gender,
xlab = "Gender", ylab = "Sales",
main = "Sales by Gender",
col = c("lightpink", "lightblue"))
# H0: median sales equal for female and male Ha: ≠
wilcox.test(x = dt$Sales[dt$Gender == "female"],
y = dt$Sales[dt$Gender == "male"],
paired = FALSE,
alternative = "two.sided")Question b: Is the discount amount related to method of payment?
Three or more independent groups → Kruskal-Wallis test
tapply(dt$DiscountAmount, dt$MethodOfPayment, summary)
boxplot(dt$DiscountAmount ~ dt$MethodOfPayment,
xlab = "Method of Payment", ylab = "Discount Amount",
main = "Discount Amount by Payment Method")
# H0: median discount equal across all payment methods Ha: at least one differs
kruskal.test(dt$DiscountAmount ~ dt$MethodOfPayment)Post-hoc testing: If the Kruskal-Wallis test is significant (p < 0.05), we know at least one group differs but not which pairs. Use pairwise Wilcoxon tests with a Bonferroni correction:
Question c: Is there a (monotone) association between items purchased and discount amount?
Two quantitative variables, non-normal data → Spearman rank correlation
plot(dt$ItemsPerchased, dt$DiscountAmount,
xlab = "Items Purchased", ylab = "Discount Amount",
main = "Discount Amount vs. Items Purchased",
pch = 16, col = "steelblue")
# H0: ρ_s = 0 (no monotone association) Ha: ρ_s ≠ 0
cor.test(~ ItemsPerchased + DiscountAmount,
data = dt,
method = "spearman",
alternative = "two.sided")Spearman vs. Pearson: Spearman measures monotone association (consistently increasing or decreasing), not just linear. It is more robust to outliers and does not require normality.
Question d: Does sales tend to increase with customer age?
plot(dt$Age, dt$Sales,
xlab = "Age", ylab = "Sales",
main = "Sales vs. Age",
pch = 16, col = "steelblue")
# H0: ρ_s = 0 Ha: ρ_s > 0 (positive monotone association)
cor.test(~ Age + Sales,
data = dt,
method = "spearman",
alternative = "greater")Question e: Is sales affected by marital status?
tapply(dt$Sales, dt$MaritalStatus, summary)
boxplot(dt$Sales ~ dt$MaritalStatus,
xlab = "Marital Status", ylab = "Sales",
main = "Sales by Marital Status")
kruskal.test(dt$Sales ~ dt$MaritalStatus)
# Post-hoc if significant:
pairwise.wilcox.test(dt$Sales, dt$MaritalStatus,
p.adjust.method = "bonferroni")Question f: Is there a relationship between gender and method of payment?
Two categorical variables → Chi-squared test of independence
tab <- table(dt$Gender, dt$MethodOfPayment)
tab
prop.table(tab, margin = 1) # row percentages
# H0: Gender and MethodOfPayment are independent Ha: they are associated
chisq.test(dt$Gender, dt$MethodOfPayment)Check the chi-squared conditions: All expected cell counts should be ≥ 5.
chisq.test(dt$Gender, dt$MethodOfPayment)$expected
# If any expected count < 5, use Fisher's exact test:
fisher.test(tab)📂 Data file needed:
Music_data.csv— Download from Google DriveDesign: Each runner experienced all music types → repeated measures with 3+ groups → Friedman test.
setwd("C:\\Users\\YourName\\Downloads")
dt2 <- read.csv("Music_data.csv", header = TRUE)
dt2$Type <- as.factor(dt2$Type)
str(dt2)Question: Are exertion scales affected by the type of music?
tapply(dt2$Scale, dt2$Type, summary)
boxplot(dt2$Scale ~ dt2$Type,
xlab = "Type of Music", ylab = "Exertion Scale",
main = "Exertion by Music Type",
col = c("lightblue", "lightgreen", "lightyellow"))
# H0: distributions of Scale are the same under all music types
# Ha: at least one music type leads to a different distribution
friedman.test(y = dt2$Scale, groups = dt2$Type, blocks = dt2$Runner)Friedman test: The non-parametric equivalent of one-way repeated-measures ANOVA. It ranks observations within each block (runner) and tests whether rank sums differ across groups.
# Post-hoc pairwise comparisons (if Friedman test is significant)
pairwise.wilcox.test(dt2$Scale, dt2$Type,
paired = TRUE,
p.adjust.method = "bonferroni")📂 Data file needed:
AStore_data.csv— Download from Google Drive
Using AStore_data.csv, answer the following questions.
For each test, follow the 4-step procedure: state hypotheses, give the
test statistic and p-value, state your conclusion in context.
Is the median age of customers greater than 40?
Is there a difference in the number of items purchased between male and female customers?
Is the median sales amount different across marital status groups? If so, which groups differ significantly? (Use Bonferroni correction.)
Is there a significant positive monotone association between Age and Items Purchased?
Is there a relationship between Marital Status and Method of Payment? (Check the chi-squared expected count conditions first.)
End of Lab Materials