📂 Data files — Download all datasets before starting any lab:

⬇ Download all data files from Google Drive

Save all files to one folder on your computer (e.g. C:\Users\YourName\Downloads), then update the setwd() path in each lab to match.


How to use these lab materials

  • Before your first lab: Download all data files and save them to one folder on your computer.
  • Read the explanation text before running each code chunk.
  • Pay attention to interpretation notes — these help you understand what the output means statistically.
  • In each Practice section you will see worked examples; the Assignment section is for you to complete independently.
  • Update the setwd() path in each lab’s import step to point to the folder where you saved the data files.

Lab 1: Correlation Analysis

Objectives: Students are able to use R to:

  1. Perform descriptive statistics for two quantitative variables
  2. Construct and interpret a scatterplot
  3. Compute and interpret the Pearson correlation coefficient
  4. Perform a hypothesis test for the population correlation
  5. Find a confidence interval for the population correlation

Key concepts: The Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two quantitative variables \(X\) and \(Y\). It ranges from −1 to +1:

\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \; \sum(y_i-\bar{y})^2}}\]

Value of \(|r|\) Interpretation
0.00 – 0.19 Very weak
0.20 – 0.39 Weak
0.40 – 0.59 Moderate
0.60 – 0.79 Strong
0.80 – 1.00 Very strong

The sign of \(r\) indicates direction: positive means both variables increase together; negative means one increases as the other decreases.

Important: Correlation measures linear association only. A correlation of 0 does not necessarily mean there is no relationship — only that there is no linear relationship.


Practice I: Sale Price

You can download the data file from Mango CMU → Lab Data.
Data file: SalePrice.csv

Variables: Area (square metres) and Sale.price (Baht).
Research question: Is there a significant linear correlation between house area and sale price?

Step 1 — Set your working directory and import data

📂 Data file needed: SalePrice.csvDownload from Google Drive

If your data file is in Excel (.xls), save it as .csv first:
Open file → File → Save As → choose CSV format.

# Set working directory to the folder where you saved the data files.
# Example for Windows — replace the path with your own:
setwd("C:\\Users\\YourName\\Downloads")

# Read the CSV file
Mydata <- read.csv("SalePrice.csv", header = TRUE)

# Preview the data
head(Mydata)      # first 6 rows
str(Mydata)       # variable types and dimensions
summary(Mydata)   # basic descriptive statistics

Note: str() tells you how many observations (rows) and variables (columns) are in your data, and what type each variable is (numeric, character, factor, etc.).

Step 2 — Descriptive statistics

x <- Mydata$Area        # independent variable (predictor)
y <- Mydata$Sale.price  # dependent variable (response)

# Summary statistics
summary(x)
summary(y)

# Standard deviation
sd(x)
sd(y)

Step 3 — Create a scatterplot

A scatterplot is always the first step — it reveals the form, direction, and strength of the relationship before any numerical summary.

plot(x, y,
     xlab = "Area (sq.m.)",
     ylab = "Sale Price (Baht)",
     main = "Scatterplot: Sale Price vs. Area",
     pch  = 16,        # filled circle points
     col  = "steelblue")

What to look for in the scatterplot:

  • Form: Is the pattern linear (straight) or curved?
  • Direction: Positive (both increase together) or negative (one increases as the other decreases)?
  • Strength: Are the points tightly clustered around a line (strong) or widely scattered (weak)?
  • Outliers: Are there any unusual observations far from the general pattern?

Step 4 — Compute the correlation coefficient

# Sample Pearson correlation coefficient
cor(x, y)

Interpretation: Report the value of \(r\) and describe its sign and magnitude using the reference table above. For example: “The sample correlation coefficient is \(r = 0.87\), indicating a very strong positive linear relationship between area and sale price.”

Step 5 — Hypothesis test for the population correlation

We test whether there is a statistically significant linear correlation in the population:

\[H_0: \rho = 0 \quad \text{(no linear correlation)} \qquad \text{vs.} \qquad H_a: \rho \neq 0\]

The test statistic is:

\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2} \text{ under } H_0\]

# Two-sided test at alpha = 0.05
cor.test(x, y, alternative = "two.sided", conf.level = 0.95)

Interpretation of cor.test() output:

Output field What it means
cor Sample correlation coefficient \(r\)
t Test statistic value
df Degrees of freedom (\(n - 2\))
p-value Reject \(H_0\) if p-value \(< \alpha = 0.05\)
95 percent confidence interval 95% CI for the population correlation \(\rho\)

4-step conclusion template:
1. \(H_0: \rho = 0\) vs. \(H_a: \rho \neq 0\)
2. \(t = \ldots\), \(df = \ldots\)
3. \(p\text{-value} = \ldots\)
4. Since p-value [< / ≥] 0.05, we [reject / fail to reject] \(H_0\). There [is / is not] sufficient evidence of a significant linear correlation between area and sale price.


Practice II: Job and Life Satisfaction

Variables: Age, Income, Gender, Education, Job.Satisfaction
Research question: Is there a significant linear correlation between age and income?

Step 1 — Import data

📂 Data file needed: JobSatisfaction.csvDownload from Google Drive

setwd("C:\\Users\\YourName\\Downloads")
Mydata2 <- read.csv("JobSatisfaction.csv", header = TRUE)

Step 2 — Explore the data

str(Mydata2)

# Frequency table for Gender
table1 <- table(Mydata2$Gender)
table1
prop.table(table1)   # proportions (multiply by 100 for percentages)

# Two-way frequency table: Education by Gender
table2 <- table(Mydata2$Education, Mydata2$Gender)
table2
prop.table(table2)

# Descriptive statistics for quantitative variables
summary(Mydata2$Age)
summary(Mydata2$Income)

# Boxplots
par(mfrow = c(1, 2))
boxplot(Mydata2$Income,
        main = "Distribution of Income", ylab = "Income (Baht)")
boxplot(Mydata2$Income ~ Mydata2$Gender,
        xlab = "Gender", ylab = "Income",
        main = "Income by Gender",
        col  = c("lightpink", "lightblue"))
par(mfrow = c(1, 1))

boxplot(Mydata2$Income ~ Mydata2$Job.Satisfaction,
        xlab = "Job Satisfaction", ylab = "Income",
        main = "Income by Job Satisfaction Level")

Interpretation: Boxplots let you compare distributions across groups. Look for differences in medians, spread (IQR), and the presence of outliers.

Step 3 — Scatterplot

x2 <- Mydata2$Age
y2 <- Mydata2$Income

plot(x2, y2,
     xlab = "Age (years)", ylab = "Income (Baht)",
     main = "Scatterplot: Income vs. Age",
     pch  = 16, col = "steelblue")

Step 4 — Correlation coefficient and hypothesis test

# Sample correlation
cor(x2, y2)

# Hypothesis test: H0: rho = 0  vs  Ha: rho ≠ 0
cor.test(x2, y2, alternative = "two.sided", conf.level = 0.95)

Assignment: Midterm and Final Score

A statistics teacher wants to examine the relationship between students’ midterm and final exam scores.

📂 Data file needed: MidtermandFinal.csvDownload from Google Drive

Complete all questions using R. Show your R code and output, and write your interpretation in full sentences.

  1. Perform descriptive statistics for both midterm and final scores (mean, SD, min, max). Summarise the results.

  2. Make a scatterplot of midterm score (\(x\)) vs. final score (\(y\)). Describe the form, direction, strength, and any outliers.

  3. Find the sample correlation coefficient \(r\) between midterm and final scores. Interpret its value.

  4. Test whether the population correlation is different from zero at \(\alpha = 0.05\). Use the 4-step procedure: (i) state hypotheses, (ii) give the test statistic and degrees of freedom, (iii) give the p-value, (iv) state your conclusion in context.

  5. Find the 95% confidence interval for the population correlation \(\rho\) and interpret it.


Lab 2: Simple Linear Regression

Objectives: Students are able to use R to:

  1. Fit a simple linear regression model using least squares
  2. Interpret the regression coefficients
  3. Perform inference on regression parameters (t-tests and confidence intervals)
  4. Assess model fit using \(r^2\) and \(MS_{res}\)
  5. Check regression assumptions using residual diagnostics
  6. Construct confidence and prediction intervals

Key concepts: The simple linear regression model expresses the relationship between a response variable \(Y\) and a single predictor \(X\) as:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]

where:

  • \(\beta_0\) = intercept (value of \(E(Y)\) when \(X = 0\))
  • \(\beta_1\) = slope (change in \(E(Y)\) for a 1-unit increase in \(X\))
  • \(\varepsilon_i\) = random error term (independent, normally distributed with constant variance)

The least-squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimise \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).


Practice I: Sale Price

We continue with the same data used in Lab 1.

📂 Data file needed: SalePrice.csvDownload from Google Drive

Variables: Area (sq.m.) as \(X\), Sale.price (Baht) as \(Y\).
Research question: Can we predict sale price from area using a linear model?

Step 1 — Import data

setwd("C:\\Users\\YourName\\Downloads")
Mydata <- read.csv("SalePrice.csv", header = TRUE)

x <- Mydata$Area
y <- Mydata$Sale.price

Step 2 — Fit the regression model

fit <- lm(y ~ x)       # fit simple linear regression
summary(fit)           # full model summary
confint(fit, level = 0.95)  # 95% CI for β₀ and β₁

Key output from summary(fit):

Output field What it means
Estimate (Intercept) \(\hat{\beta}_0\): predicted \(Y\) when \(X = 0\)
Estimate (x) \(\hat{\beta}_1\): change in \(\hat{Y}\) for each 1-unit increase in \(X\)
Std. Error Standard error of the coefficient estimate
t value Test statistic for \(H_0: \beta_j = 0\)
Pr(>|t|) p-value; reject \(H_0\) if \(< 0.05\)
Multiple R-squared \(r^2\): proportion of variance in \(Y\) explained by \(X\)
F-statistic Overall model F-test (equivalent to t-test in simple regression)
Residual standard error \(\hat{\sigma} = \sqrt{MS_{res}}\): estimated SD of errors

How to write the fitted equation: Find the Estimate column in the output and substitute: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

Step 3 — Plot the fitted regression line

plot(x, y,
     xlab = "Area (sq.m.)",
     ylab = "Sale Price (Baht)",
     main = "Regression Line: Sale Price vs. Area",
     pch  = 16, col = "steelblue")
abline(fit, col = "red", lwd = 2)   # overlay regression line

Step 4 — Residual analysis

Residuals are defined as \(e_i = y_i - \hat{y}_i\). Before trusting regression inference, we must verify three assumptions:

  1. Linearity — no systematic pattern in residuals vs. fitted values
  2. Normality — residuals are approximately normally distributed
  3. Constant variance (homoscedasticity) — spread of residuals is roughly equal for all fitted values
res <- resid(fit)

par(mfrow = c(1, 3))

# 1. Residuals vs. Fitted — check linearity and constant variance
plot(fitted(fit), res,
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs. Fitted",
     pch  = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")

# 2. Histogram — check approximate normality
hist(res, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white")

# 3. Normal Q-Q plot — check normality more carefully
qqnorm(res, main = "Normal Q-Q Plot", pch = 16)
qqline(res, col = "red")

par(mfrow = c(1, 1))

What to look for:

  • Residuals vs. Fitted: points should scatter randomly around the zero line. A curve suggests non-linearity; a fan shape suggests non-constant variance.
  • Histogram: should be roughly bell-shaped and symmetric around zero.
  • Q-Q plot: points should fall close to the red diagonal line. Systematic deviations indicate non-normality.

Step 5 — Prediction for a new value of x

Two types of intervals are available:

  • Confidence interval (CI): estimates the mean response \(E(Y \mid X = x^*)\) — how high is the average sale price for all houses with area \(= x^*\)?
  • Prediction interval (PI): estimates a single future observation at \(X = x^*\) — what sale price might an individual house with area \(= x^*\) have?

The PI is always wider than the CI because it includes both estimation uncertainty and individual variability.

new <- data.frame(x = 13)   # predict at x = 13 sq.m.

# Point estimate
predict(fit, newdata = new)

# 95% Confidence interval for E(Y | x = 13)
predict(fit, newdata = new, interval = "confidence", level = 0.95)

# 95% Prediction interval for a new individual at x = 13
predict(fit, newdata = new, interval = "prediction", level = 0.95)

Output columns: fit = point estimate, lwr = lower bound, upr = upper bound.

Step 6 — Plot fitted line with CI and PI bands

new_range <- data.frame(x = sort(x))   # sorted x values for smooth curves

pred.CI <- predict(fit, newdata = new_range, interval = "confidence")
pred.PI <- predict(fit, newdata = new_range, interval = "prediction")

plot(x, y,
     xlab = "Area (sq.m.)", ylab = "Sale Price (Baht)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch  = 16, col = "steelblue")

# Fitted regression line
lines(new_range$x, pred.CI[, "fit"], col = "black", lwd = 2)

# 95% Confidence interval (red dashed)
lines(new_range$x, pred.CI[, "lwr"], col = "red",  lwd = 1.5, lty = 2)
lines(new_range$x, pred.CI[, "upr"], col = "red",  lwd = 1.5, lty = 2)

# 95% Prediction interval (blue dotted)
lines(new_range$x, pred.PI[, "lwr"], col = "blue", lwd = 1.5, lty = 3)
lines(new_range$x, pred.PI[, "upr"], col = "blue", lwd = 1.5, lty = 3)

legend("topleft",
       legend = c("Fitted line", "95% CI", "95% PI"),
       col    = c("black", "red", "blue"),
       lty    = c(1, 2, 3), lwd = 2, bty = "n")

Why is the PI always wider than the CI? The CI captures only the uncertainty in estimating the mean of \(Y\), while the PI must additionally account for the natural scatter of individual observations around that mean. The gap between CI and PI bands is largest near the extremes of \(x\) and narrowest near \(\bar{x}\).


Practice II: Job and Life Satisfaction

We continue with the same data used in Lab 1 Practice II.

📂 Data file needed: JobSatisfaction.csvDownload from Google Drive

Variables: Age as \(X\), Income as \(Y\).
Research question: Can we predict income from age using a linear model?

Step 1 — Import data

setwd("C:\\Users\\YourName\\Downloads")
Mydata2 <- read.csv("JobSatisfaction.csv", header = TRUE)

x2 <- Mydata2$Age
y2 <- Mydata2$Income

Step 2 — Fit the model

fit2 <- lm(y2 ~ x2)
summary(fit2)
confint(fit2)

Step 3 — Plot fitted line

plot(x2, y2,
     xlab = "Age (years)", ylab = "Income (Baht)",
     main = "Regression Line: Income vs. Age",
     pch  = 16, col = "steelblue")
abline(fit2, col = "red", lwd = 2)

Step 4 — Residual analysis

res2 <- resid(fit2)

par(mfrow = c(1, 3))
plot(fitted(fit2), res2,
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs. Fitted", pch = 16, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
hist(res2, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white")
qqnorm(res2, main = "Normal Q-Q Plot", pch = 16)
qqline(res2, col = "red")
par(mfrow = c(1, 1))

Step 5 — CI and PI bands

new2     <- data.frame(x2 = sort(x2))
pred.CI2 <- predict(fit2, newdata = new2, interval = "confidence")
pred.PI2 <- predict(fit2, newdata = new2, interval = "prediction")

plot(x2, y2,
     xlab = "Age (years)", ylab = "Income (Baht)",
     main = "Fitted Line with 95% CI and PI Bands",
     pch  = 16, col = "steelblue",
     ylim = range(c(pred.PI2[, "lwr"], pred.PI2[, "upr"])))

lines(new2$x2, pred.CI2[, "fit"], col = "black", lwd = 2)
lines(new2$x2, pred.CI2[, "lwr"], col = "red",  lty = 2, lwd = 1.5)
lines(new2$x2, pred.CI2[, "upr"], col = "red",  lty = 2, lwd = 1.5)
lines(new2$x2, pred.PI2[, "lwr"], col = "blue", lty = 3, lwd = 1.5)
lines(new2$x2, pred.PI2[, "upr"], col = "blue", lty = 3, lwd = 1.5)

legend("topleft",
       legend = c("Fitted line", "95% CI", "95% PI"),
       col    = c("black", "red", "blue"),
       lty    = c(1, 2, 3), lwd = 2, bty = "n")

Assignment: Midterm and Final Score

We continue with the MidtermandFinal.csv data from Lab 1. Now we extend the analysis from correlation to regression.

📂 Data file needed: MidtermandFinal.csvDownload from Google Drive

  1. Obtain the least-squares regression line for predicting final score (\(Y\)) from midterm score (\(X\)). Write the fitted equation.

  2. Interpret \(\hat{\beta}_0\) and \(\hat{\beta}_1\) in the context of this problem.

  3. Find \(MS_{res}\) (= residual standard error squared) and state its units.

  4. Test whether midterm score is a significant linear predictor of final score at \(\alpha = 0.05\). Use the 4-step procedure.

  5. Find the 95% confidence intervals for \(\beta_0\) and \(\beta_1\).

  6. Find \(r^2\) and interpret its value in context.

  7. Check the regression assumptions using residual plots (residuals vs. fitted, histogram, Q-Q plot). Are the assumptions satisfied?

  8. Find the 95% confidence interval and 95% prediction interval for the final score of a student with a midterm score of 75. Explain the difference between the two intervals.

  9. Plot the fitted regression line along with its 95% CI and PI bands for all values of \(x\).


Lab 3: Multiple Linear Regression

Objectives: Students are able to use R to:

  1. Perform descriptive statistics for multiple linear regression
  2. Select independent variables into the model
  3. Perform multiple linear regression analysis and make inference on regression parameters
  4. Interpret the results

Key concepts: The multiple linear regression model with \(p\) predictors is:

\[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i\]

We estimate the coefficients by least squares and evaluate the model with:

  • Overall F-test: tests \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)
  • Individual t-tests: tests \(H_0: \beta_j = 0\) for each predictor
  • \(R^2\) (multiple): proportion of variance in \(Y\) explained by all predictors together
  • Adjusted \(R^2\): penalises for the number of predictors; better for comparing models of different sizes

Practice: Farming Data

A farm owner wants to reduce monthly water costs. He collected data over 12 months on:

Variable Description Type
\(x_1\) Average monthly temperature (°F) Quantitative
\(x_2\) Yield (lb) Quantitative
\(x_3\) Number of days agricultural operations were carried out Quantitative
\(x_4\) Average number of workers employed Quantitative
\(x_5\) Average amount of red light Quantitative
\(y\) Average monthly water consumption (gallons) Response

Step 1 — Import data

📂 Data file needed: Farming.csvDownload from Google Drive

setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("Farming.csv", header = TRUE)
str(dt)
summary(dt)

Step 2 — Descriptive statistics and correlation matrix

Since all variables are quantitative, we examine pairwise scatterplots and correlations.

# Pairwise correlation matrix
cor(dt[, c("y", "x1", "x2", "x3", "x4", "x5")])

# Pairwise scatterplot matrix
pairs(dt[, c("y", "x1", "x2", "x3", "x4", "x5")],
      main = "Scatterplot Matrix for Farming Data",
      pch  = 16, col = "steelblue")

Interpretation: Look at the first row (or column) to see which predictors appear most linearly related to \(y\). Also check for high correlations between predictors — this is called multicollinearity and can cause instability in coefficient estimates.

Step 3 — Full model (all five predictors)

full <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = dt)
summary(full)
anova(full)

Key outputs:

  • anova(full) gives the sequential (Type I) ANOVA table, showing the reduction in \(SS_{res}\) as each predictor is added.
  • The overall F-test p-value tests whether at least one \(\beta_j \neq 0\).
  • Some individual predictors may be non-significant even if the overall model is — this is common with correlated predictors.

Step 4 — Variable selection

With many predictors we use automatic selection to find the best subset. All three methods use AIC (Akaike Information Criterion) — lower AIC = better model.

null <- lm(y ~ 1, data = dt)            # intercept-only (no predictors)
full <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = dt)

# Forward selection: starts empty, adds one predictor at a time
fw.fit <- step(null,
               scope     = list(lower = null, upper = full),
               direction = "forward",
               trace     = 1)

# Backward elimination: starts full, removes one predictor at a time
be.fit <- step(full, direction = "backward", trace = 1)

# Stepwise regression: both directions (generally preferred)
sw.fit <- step(full, direction = "both",     trace = 1)

Comparing the three methods: They do not always agree. Stepwise regression is generally preferred as it reconsiders variables at each step.

Step 5 — Fitted model from stepwise regression

summary(sw.fit)     # coefficients, R², F-test
anova(sw.fit)       # ANOVA table
confint(sw.fit)     # 95% CI for all coefficients

How to write the fitted equation: If stepwise selected \(x_1\) and \(x_3\), for example: \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_3 x_3\] Substitute the Estimate values from summary() output.


Assignment: Farming Data (continued)

Using the results from the Practice section above, answer the following:

  1. Which predictors were selected by each method (forward, backward, stepwise)? Do they agree?

  2. Write the fitted regression equation from the stepwise model.

  3. Test the overall fit at \(\alpha = 0.05\) using the 4-step hypothesis testing procedure.

  4. Interpret each significant regression coefficient in the context of the problem.

  5. Find the 95% CI for each coefficient in the selected model.

  6. Find \(R^2\) and adjusted \(R^2\). What do they tell you?


Lab 4: Multiple Linear Regression with Dummy Variables

Objectives: Students are able to use R to:

  1. Recognise when and why dummy variables are needed
  2. Transform a qualitative variable into dummy variables
  3. Fit a multiple regression model with dummy variables
  4. Interpret all regression coefficients, including dummy variable coefficients
  5. Derive and compare the fitted equations for each group

Key concepts — Dummy variables: When a predictor is qualitative (categorical), we cannot enter it directly into a regression model as a number. Instead, we create \(k - 1\) dummy (indicator) variables for a variable with \(k\) groups. The omitted group is called the reference category.

Each dummy variable’s coefficient \(\hat{\beta}_j\) represents the estimated difference in the mean response between that group and the reference group, holding all other predictors constant.

Why \(k - 1\) dummies, not \(k\)? Using \(k\) dummies would create perfect multicollinearity (the dummy columns would always sum to 1, which equals the intercept column). Dropping one group — the reference — avoids this problem. The intercept then represents the mean response for the reference group.

General dummy coding scheme for a variable with \(k = 3\) groups (A = reference, B, C):

Group \(D_B\) \(D_C\)
A (reference) 0 0
B 1 0
C 0 1

The model becomes: \[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 D_B + \hat{\beta}_3 D_C\]

This gives separate parallel regression lines for each group:

Group Fitted equation Intercept
A (reference) \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1\) \(\hat{\beta}_0\)
B \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_2) + \hat{\beta}_1 X_1\) \(\hat{\beta}_0 + \hat{\beta}_2\)
C \(\hat{Y} = (\hat{\beta}_0 + \hat{\beta}_3) + \hat{\beta}_1 X_1\) \(\hat{\beta}_0 + \hat{\beta}_3\)

All three lines share the same slope \(\hat{\beta}_1\) but have different intercepts.


Practice I: House Price

Research question: Can sale price be predicted from the number of households and location type?

Variables:

Variable Name in data Type
Sale price (Baht) SalesPrice Response (\(Y\))
Number of households NumberOfHousehold Quantitative predictor (\(X_1\))
Location Location Qualitative predictor (Residential / Street / Mall)

Step 1 — Import data

📂 Data file needed: HousePrices.csvDownload from Google Drive

setwd("C:\\Users\\YourName\\Downloads")
data <- read.csv("HousePrices.csv", header = TRUE)
head(data)
str(data)

Step 2 — Descriptive statistics

Before fitting, explore the relationship between each predictor and the response separately.

y  <- data$SalesPrice
x1 <- data$NumberOfHousehold   # quantitative predictor
x2 <- data$Location            # qualitative predictor

# Summaries
summary(y)
summary(x1)
table(x2)   # count of each Location group

# Correlation between the quantitative predictor and response
cor(x1, y)

# Scatterplot coloured by Location
plot(x1, y,
     xlab = "Number of Households", ylab = "Sale Price (Baht)",
     pch  = 16, col = as.integer(as.factor(x2)),
     main = "Sale Price vs. Households, by Location")
legend("topleft",
       legend = levels(as.factor(x2)),
       col    = 1:length(levels(as.factor(x2))),
       pch = 16, bty = "n")

# Boxplot: sale price by Location
boxplot(y ~ x2,
        xlab = "Location", ylab = "Sale Price (Baht)",
        main = "Sale Price by Location",
        col  = c("lightblue", "lightgreen", "lightyellow"))

What to look for:

  • Do the groups (Residential, Street, Mall) appear to have different average sale prices in the boxplot? If yes, location is likely a useful predictor.
  • In the scatterplot, do the three groups of points appear to sit at different vertical levels? If so, they may have different intercepts.
  • Is the relationship between households and sale price roughly linear within each group?
# Alternative using ggplot2 (install once: install.packages("ggplot2"))
library(ggplot2)

# Scatterplot coloured by Location
ggplot(data, aes(x = NumberOfHousehold, y = SalesPrice, colour = Location)) +
  geom_point(size = 3) +
  labs(title = "Sale Price vs. Households by Location",
       x = "Number of Households", y = "Sale Price (Baht)") +
  theme_minimal()

# Boxplot by Location
ggplot(data, aes(x = Location, y = SalesPrice, fill = Location)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Sale Price by Location",
       x = "Location", y = "Sale Price (Baht)") +
  theme_minimal()

Step 3 — Create dummy variables

Location has \(k = 3\) groups → we need \(k - 1 = 2\) dummy variables. We choose Residential as the reference category (the baseline group).

# D_Street = 1 if Location is "Street", 0 otherwise
x2.dummy <- ifelse(data$Location == "Street", 1, 0)

# D_Mall = 1 if Location is "Mall", 0 otherwise
x3.dummy <- ifelse(data$Location == "Mall", 1, 0)

# Verify the coding is correct: print side-by-side
data.frame(Location  = data$Location,
           D_Street  = x2.dummy,
           D_Mall    = x3.dummy)

Dummy coding summary:

Location \(D_{\text{Street}}\) \(D_{\text{Mall}}\)
Residential (reference) 0 0
Street 1 0
Mall 0 1

You can verify: when both dummies are 0, the observation must be Residential. ✓

Step 4 — Fit the full model

We fit the model with \(X_1\) (NumberOfHousehold), \(D_{\text{Street}}\), and \(D_{\text{Mall}}\) as predictors.

fit.full <- lm(y ~ x1 + x2.dummy + x3.dummy)
summary(fit.full)
anova(fit.full)
confint(fit.full, level = 0.95)

Reading the summary() output:

  • (Intercept): \(\hat{\beta}_0\) — estimated mean sale price for the Residential group when NumberOfHousehold = 0.
  • x1: \(\hat{\beta}_1\) — estimated change in sale price for each additional household, within any location group (slope is shared).
  • x2.dummy: \(\hat{\beta}_2\) — estimated difference in mean sale price between Street and Residential, holding \(X_1\) constant.
  • x3.dummy: \(\hat{\beta}_3\) — estimated difference in mean sale price between Mall and Residential, holding \(X_1\) constant.

A positive \(\hat{\beta}_2\) means Street locations have higher average sale prices than Residential; a negative \(\hat{\beta}_2\) means lower.

Step 5 — Write the fitted equation and derive group-specific equations

From the Estimate column of summary(fit.full), substitute values into:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 D_{\text{Street}} + \hat{\beta}_3 D_{\text{Mall}}\]

Then derive the three separate equations by substituting the dummy values:

coef(fit.full)   # extract all coefficient estimates

# The three group equations are derived as follows:
# Residential (D_Street=0, D_Mall=0):
#   yhat = b0 + b1*x1

# Street (D_Street=1, D_Mall=0):
#   yhat = (b0 + b2) + b1*x1

# Mall (D_Street=0, D_Mall=1):
#   yhat = (b0 + b3) + b1*x1

Interpretation example (replace with your actual estimates):

Suppose summary() gives: \(\hat{\beta}_0 = 50{,}000\), \(\hat{\beta}_1 = 3{,}000\), \(\hat{\beta}_2 = 12{,}000\), \(\hat{\beta}_3 = 25{,}000\).

  • Residential: \(\hat{y} = 50{,}000 + 3{,}000 \, x_1\)
  • Street: \(\hat{y} = 62{,}000 + 3{,}000 \, x_1\) (intercept increases by 12,000)
  • Mall: \(\hat{y} = 75{,}000 + 3{,}000 \, x_1\) (intercept increases by 25,000)

Interpretation: “Mall locations are estimated to sell for 25,000 Baht more on average than Residential locations with the same number of households.”

Step 6 — Plot the three fitted lines

# Compute fitted values
yhat <- fitted(fit.full)

# Get coefficient estimates
b <- coef(fit.full)
b0 <- b["(Intercept)"]
b1 <- b["x1"]
b2 <- b["x2.dummy"]
b3 <- b["x3.dummy"]

x1_range <- seq(min(x1), max(x1), length.out = 100)

# Base scatterplot coloured by Location
plot(x1, y,
     xlab = "Number of Households", ylab = "Sale Price (Baht)",
     pch  = 16, col = as.integer(as.factor(x2)),
     main = "Fitted Regression Lines by Location")

# Add a fitted line for each group
lines(x1_range, b0            + b1 * x1_range, col = 1, lwd = 2)  # Residential
lines(x1_range, b0 + b2       + b1 * x1_range, col = 2, lwd = 2)  # Street
lines(x1_range, b0       + b3 + b1 * x1_range, col = 3, lwd = 2)  # Mall

legend("topleft",
       legend = c("Residential", "Street", "Mall"),
       col    = 1:3, lty = 1, lwd = 2, pch = 16, bty = "n")

What this plot shows: Three parallel lines — same slope, different intercepts. The vertical gap between any two lines equals the corresponding dummy coefficient. If the lines were not parallel, that would suggest an interaction between location and households (a more advanced model not covered here).

Step 7 — Plot observed vs. predicted

plot(x1, y,
     xlab = "Number of Households", ylab = "Sale Price (Baht)",
     pch  = 1, main = "Observed vs. Predicted Sale Price")
points(x1, yhat, pch = 20, col = "red")
legend("topleft",
       legend = c("Observed y", "Predicted ŷ"),
       pch    = c(1, 20), col = c("black", "red"), bty = "n")

Assignment: Score and EQ — Part 1 (Dummy Coding and Interpretation)

📂 Data file needed: ScoreAndEQ.csvDownload from Google Drive

Variables: Y (post-test score), Pretest (pre-test score), EQ (emotional quotient score), Method (teaching method: A, B, C, D).

  1. Explore the data:

    1. Create scatterplots of \(Y\) vs. Pretest and \(Y\) vs. EQ. Describe each relationship (form, direction, strength).
    2. Create a boxplot of \(Y\) by Method. Do the teaching methods appear to differ in mean score?
  2. Method is a qualitative variable with \(k = 4\) groups. Create \(k - 1 = 3\) dummy variables. Clearly state:

    1. Which group is the reference category.
    2. The complete dummy coding table.
    3. The R code used to create the dummies and verify them.
  3. Fit the full model: \(Y = \beta_0 + \beta_1 \cdot \text{Pretest} + \beta_2 \cdot \text{EQ} + \beta_3 D_B + \beta_4 D_C + \beta_5 D_D + \varepsilon\). Show the R output and write the fitted equation.

  4. Interpret each coefficient in the context of the problem. Pay particular attention to the dummy coefficients — what do they tell you about the effect of teaching method?

  5. Write the separate fitted equation for students taught by each of Method A, B, C, and D. Comment on how the methods compare.

  6. Plot all four fitted lines on a single scatterplot of \(Y\) vs. Pretest, with points coloured by Method.


Lab 5: Variable Selection and Model Diagnostics

Objectives: Students are able to use R to:

  1. Apply automatic variable selection methods (forward, backward, stepwise)
  2. Understand and compare AIC-based selection criteria
  3. Perform a comprehensive model diagnostic analysis
  4. Interpret diagnostic plots and formal tests for regression assumptions
  5. Identify influential observations and outliers
  6. Make predictions using the selected model

Key concepts — Variable selection: When we have many candidate predictors, automatic selection methods help find a parsimonious model (few predictors, good fit). All three methods below use AIC (Akaike Information Criterion): lower AIC = better trade-off between fit and complexity.

Method Starting model Strategy
Forward selection Intercept only Add the predictor that reduces AIC the most, repeat
Backward elimination Full model Remove the predictor that reduces AIC the most, repeat
Stepwise (both) Full model Add or remove at each step — generally preferred

Important limitation: Automated selection does not guarantee the “best” model in any absolute sense. Always check whether the selected model makes practical sense and satisfies the regression assumptions.

Key concepts — Model diagnostics: Before using a fitted model for inference or prediction, we must verify four assumptions of the regression model \(Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \varepsilon_i\):

Assumption What it means How to check
Linearity \(E(\varepsilon_i) = 0\); no pattern in residuals Residuals vs. Fitted plot
Normality \(\varepsilon_i \sim N(0, \sigma^2)\) Histogram, Q-Q plot, Shapiro-Wilk test
Constant variance \(\text{Var}(\varepsilon_i) = \sigma^2\) (homoscedasticity) Scale-Location plot, Breusch-Pagan test
Independence Residuals are uncorrelated Residuals vs. order plot, Durbin-Watson test

Additionally, we check for influential observations — data points that have an unusually large effect on the fitted model.


Practice I: House Price (continued)

We continue with the HousePrices.csv data from Lab 4. The goal is now to select the best subset of predictors and validate the model assumptions.

📂 Data file needed: HousePrices.csvDownload from Google Drive

Step 1 — Re-import data and recreate variables

setwd("C:\\Users\\YourName\\Downloads")
data <- read.csv("HousePrices.csv", header = TRUE)

y        <- data$SalesPrice
x1       <- data$NumberOfHousehold
x2.dummy <- ifelse(data$Location == "Street", 1, 0)
x3.dummy <- ifelse(data$Location == "Mall",   1, 0)

Step 2 — Variable selection

null <- lm(y ~ 1)                           # intercept-only model
full <- lm(y ~ x1 + x2.dummy + x3.dummy)   # full model from Lab 4

# Forward selection
fw.fit <- step(null,
               scope     = list(lower = null, upper = full),
               direction = "forward",
               trace     = 1)

# Backward elimination
be.fit <- step(full,
               direction = "backward",
               trace     = 1)

# Stepwise regression (both directions) — generally preferred
sw.fit <- step(full,
               direction = "both",
               trace     = 1)

Reading the step() output: Each row shows one candidate model. The columns are:

  • Df: degrees of freedom used by the variable being added/removed
  • AIC: AIC of the candidate model after the change

The algorithm selects the change that gives the lowest AIC. It stops when no change improves AIC.

Comparing the three methods: Write down which variables each method selected. Do they agree? Stepwise is preferred because it can reconsider earlier decisions.

Step 3 — Examine the selected model

# Use the stepwise-selected model
summary(sw.fit)    # coefficients, R², overall F-test
anova(sw.fit)      # ANOVA table: SS, MS, F for each term
confint(sw.fit, level = 0.95)   # 95% CI for each coefficient

Overall F-test (4-step procedure):

  1. \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (no predictors are useful) vs. \(H_a\): at least one \(\beta_j \neq 0\)
  2. \(F = MS_{reg} / MS_{res}\), with \(df_1 = p\) and \(df_2 = n - p - 1\)
  3. p-value from the F-statistic line in summary()
  4. Conclusion in context

Step 4 — Comprehensive model diagnostics

R’s built-in plot() for lm objects produces four diagnostic plots at once. Understanding each one is essential.

par(mfrow = c(2, 2))
plot(sw.fit)
par(mfrow = c(1, 1))

Interpreting the four diagnostic plots:

Plot 1 — Residuals vs. Fitted - Checks: linearity and constant variance - Good: points scatter randomly around the horizontal zero line, with roughly equal spread across all fitted values - Warning signs: a curve (non-linearity) or a funnel/fan shape (non-constant variance)

Plot 2 — Normal Q-Q - Checks: normality of residuals - Good: points follow the diagonal reference line closely - Warning signs: S-shaped curve (heavy tails), points bending away at the ends

Plot 3 — Scale-Location (√|standardised residuals| vs. Fitted) - Checks: constant variance (more sensitive than Plot 1) - Good: a roughly horizontal red line with points evenly spread - Warning signs: a clear upward or downward trend in the red line

Plot 4 — Residuals vs. Leverage - Checks: influential observations - Points with high leverage AND large residuals are most dangerous - Points outside the Cook’s distance dashed lines (if visible) are highly influential

Step 5 — Formal tests for each assumption

res <- resid(sw.fit)

# --- Normality ---
# Visual: histogram and Q-Q plot side by side
par(mfrow = c(1, 2))
hist(res, main = "Histogram of Residuals",
     xlab = "Residuals", col = "lightblue", border = "white",
     freq = FALSE)                        # density scale
curve(dnorm(x, mean = mean(res), sd = sd(res)),
      add = TRUE, col = "red", lwd = 2)  # overlay normal curve

qqnorm(res, main = "Normal Q-Q Plot", pch = 16, col = "steelblue")
qqline(res, col = "red", lwd = 2)
par(mfrow = c(1, 1))

# Shapiro-Wilk test (best for n < 50; use for up to ~2000 obs)
shapiro.test(res)
# H0: residuals are normally distributed
# Satisfied if p-value > 0.05

# --- Constant variance (homoscedasticity) ---
# install.packages("lmtest")   # run once if not yet installed
library(lmtest)
bptest(sw.fit)
# H0: variance is constant (homoscedastic)
# Satisfied if p-value > 0.05

# --- Independence (no autocorrelation) ---
dwtest(sw.fit)
# H0: residuals are not autocorrelated
# Satisfied if p-value > 0.05 (DW statistic close to 2 is ideal)

Summary of formal tests:

Test \(H_0\) Assumption satisfied if
Shapiro-Wilk Residuals are normally distributed p-value > 0.05
Breusch-Pagan Variance is constant (homoscedastic) p-value > 0.05
Durbin-Watson No autocorrelation in residuals p-value > 0.05

If an assumption is violated, consider:

Violation Possible remedy
Non-linearity Add a polynomial term (e.g. \(X^2\)) or transform \(X\)
Non-constant variance Apply log or square-root transformation to \(Y\)
Non-normality Transform \(Y\); check for outliers
Autocorrelation Include a time trend; use time-series methods

Step 6 — Checking for influential observations

An observation can be unusual in two ways:

  • Outlier: unusual response value (large residual)
  • High-leverage point: unusual predictor value (far from the centre of the \(X\) space)
  • Influential point: changes the fitted model substantially if removed (high Cook’s distance)
# Standardised residuals — values beyond ±2 are potential outliers
std.res <- rstandard(sw.fit)
which(abs(std.res) > 2)   # row indices of potential outliers

# Leverage values — rule of thumb: high if > 2*(p+1)/n
n <- nrow(data)
p <- length(coef(sw.fit)) - 1   # number of predictors
hat.vals <- hatvalues(sw.fit)
which(hat.vals > 2 * (p + 1) / n)

# Cook's distance — rule of thumb: influential if > 4/n
cooks <- cooks.distance(sw.fit)
which(cooks > 4 / n)

# Combined influence plot
plot(sw.fit, which = 4,
     main = "Cook's Distance",
     sub  = paste("Threshold = 4/n =", round(4/n, 4)))
abline(h = 4/n, col = "red", lty = 2)

What to do with influential observations: Do NOT automatically delete them. First: 1. Check whether the data point was recorded correctly (data entry error?). 2. Fit the model with and without the point and compare — if results change substantially, report both. 3. If justified, remove the point and note it in your report.

Step 7 — Prediction using the selected model

# Predict sale price for a new house:
# NumberOfHousehold = 10, Location = "Street" (D_Street=1, D_Mall=0)
new_obs <- data.frame(x1       = 10,
                      x2.dummy = 1,
                      x3.dummy = 0)

# Point estimate
predict(sw.fit, newdata = new_obs)

# 95% confidence interval for the mean sale price at these values
predict(sw.fit, newdata = new_obs, interval = "confidence", level = 0.95)

# 95% prediction interval for a single new house
predict(sw.fit, newdata = new_obs, interval = "prediction", level = 0.95)

Assignment: Score and EQ — Part 2 (Variable Selection and Diagnostics)

📂 Data file needed: ScoreAndEQ.csvDownload from Google Drive

This assignment continues directly from Lab 4 Assignment. Use the dummy variables you created there.

  1. Apply forward selection, backward elimination, and stepwise regression to select the best predictors of \(Y\) from {Pretest, EQ, D_B, D_C, D_D}. Compare the three methods — do they agree?

  2. Write the fitted equation from the stepwise-selected model. Interpret each coefficient in context.

  3. Test the overall fit of the selected model at \(\alpha = 0.05\) using the 4-step procedure. Report \(R^2\) and adjusted \(R^2\) and explain what they tell you.

  4. Produce R’s four built-in diagnostic plots (plot(model)). For each plot, describe what you see and state what assumption it is checking.

  5. Carry out the three formal assumption tests:

    1. Shapiro-Wilk test for normality
    2. Breusch-Pagan test for constant variance
    3. Durbin-Watson test for independence

    For each test: state \(H_0\), report the test statistic and p-value, and conclude whether the assumption is satisfied.

  6. Check for influential observations using Cook’s distance. Report any observations that exceed the threshold \(4/n\) and describe what you would do with them.

  7. Find the 90% and 95% confidence intervals for each regression coefficient in the selected model.

  8. Predict the post-test score for a new student with Pretest = 60, EQ = 70, taught by Method C. Give both a 95% confidence interval and a 95% prediction interval, and explain the difference.


Lab 6: Non-Parametric Statistics I

Objectives: Students are able to:

  1. Perform descriptive statistics
  2. Apply appropriate non-parametric tests to answer research questions

When to use non-parametric tests: Use non-parametric methods when:

  • The sample size is small and normality cannot be assumed
  • Data are ordinal (ranked) rather than continuous
  • Assumptions of the corresponding parametric test are clearly violated
Parametric test Non-parametric equivalent Use case
One-sample t-test Wilcoxon signed-rank test One sample, test about median
Independent samples t-test Mann-Whitney U (Wilcoxon rank-sum) Two independent samples
Paired t-test Wilcoxon signed-rank test (paired) Two related samples
One-way ANOVA Kruskal-Wallis test Three or more independent samples
Repeated-measures ANOVA Friedman test Three or more related samples
Pearson correlation Spearman rank correlation Association, non-normal data

Practice I: AStore Data

📂 Data file needed: AStore_data.csvDownload from Google Drive

Variables: Age, ItemsPerchased, Sales, DiscountAmount, Gender, MaritalStatus, MethodOfPayment

Import and prepare data

setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("AStore_data.csv", header = TRUE)
head(dt)
str(dt)

# Convert qualitative variables to factors
dt$MethodOfPayment <- as.factor(dt$MethodOfPayment)
dt$Gender          <- as.factor(dt$Gender)
dt$MaritalStatus   <- as.factor(dt$MaritalStatus)

Question A: Is the median age of customers equal to 45?

One sample, test about a median → Wilcoxon signed-rank test

summary(dt$Age)
sd(dt$Age)

par(mfrow = c(1, 2))
boxplot(dt$Age, main = "Boxplot of Age", ylab = "Age")
hist(dt$Age, main = "Histogram of Age",
     xlab = "Age", col = "lightblue", border = "white")
par(mfrow = c(1, 1))

# H0: median age = 45    Ha: median age ≠ 45
wilcox.test(x = dt$Age, mu = 45, alternative = "two.sided")

Interpretation: The Wilcoxon signed-rank test tests \(H_0\): population median = mu. - p-value < 0.05: reject \(H_0\); the median age is significantly different from 45. - p-value ≥ 0.05: insufficient evidence to conclude the median differs from 45.

Question B: Is the median number of purchased items less than 7?

summary(dt$ItemsPerchased)

par(mfrow = c(1, 2))
boxplot(dt$ItemsPerchased, ylab = "Items Purchased", main = "Boxplot")
hist(dt$ItemsPerchased, xlab = "Items Purchased",
     main = "Histogram", col = "lightblue", border = "white")
par(mfrow = c(1, 1))

# H0: median = 7    Ha: median < 7
wilcox.test(x = dt$ItemsPerchased, mu = 7, alternative = "less")

Practice II: Music and Exercise

📂 Data file needed: Music_data.csvDownload from Google Drive

Design: Each runner exercised under three music conditions (None, Classical, Dance). This is a paired (repeated measures) design → use the paired Wilcoxon signed-rank test.

Step 1 — Import and prepare data

setwd("C:\\Users\\YourName\\Downloads")
dt2 <- read.csv("Music_data.csv", header = TRUE)
head(dt2)
str(dt2)

dt2$Type <- as.factor(dt2$Type)

Step 2 — Explore data

tapply(dt2$Scale, dt2$Type, summary)

boxplot(dt2$Scale ~ dt2$Type,
        xlab = "Type of Music", ylab = "Exertion Scale",
        main = "Exertion by Music Type",
        col  = c("lightblue", "lightgreen", "lightyellow"))

Question A: Is there a difference in exertion between No Music and Classical?

# H0: median(None) = median(Classical)    Ha: ≠
wilcox.test(x           = dt2$Scale[dt2$Type == "None"],
            y           = dt2$Scale[dt2$Type == "Classical"],
            paired      = TRUE,
            alternative = "two.sided")

Question B: Is the exertion scale under Classical music higher than under Dance music?

# H0: median(Classical) = median(Dance)    Ha: Classical > Dance
wilcox.test(x           = dt2$Scale[dt2$Type == "Classical"],
            y           = dt2$Scale[dt2$Type == "Dance"],
            paired      = TRUE,
            alternative = "greater")

Why paired = TRUE? Each runner experienced all music conditions, so observations are matched by runner. Using paired = TRUE accounts for individual differences between runners and increases the power of the test.


Lab 7: Non-Parametric Statistics II

Objectives: Students are able to:

  1. Perform descriptive statistics
  2. Apply appropriate non-parametric tests (Mann-Whitney, Kruskal-Wallis, Spearman, Chi-squared, Friedman)

Practice I: AStore Data

Import and prepare data

📂 Data file needed: AStore_data.csvDownload from Google Drive

setwd("C:\\Users\\YourName\\Downloads")
dt <- read.csv("AStore_data.csv", header = TRUE)

dt$MethodOfPayment <- as.factor(dt$MethodOfPayment)
dt$Gender          <- as.factor(dt$Gender)
dt$MaritalStatus   <- as.factor(dt$MaritalStatus)

str(dt)

Question a: Is there a difference in average sales between female and male customers?

Two independent groups → Mann-Whitney U test (Wilcoxon rank-sum, paired = FALSE)

tapply(dt$Sales, dt$Gender, summary)
boxplot(dt$Sales ~ dt$Gender,
        xlab = "Gender", ylab = "Sales",
        main = "Sales by Gender",
        col  = c("lightpink", "lightblue"))

# H0: median sales equal for female and male    Ha: ≠
wilcox.test(x           = dt$Sales[dt$Gender == "female"],
            y           = dt$Sales[dt$Gender == "male"],
            paired      = FALSE,
            alternative = "two.sided")

Question b: Is the discount amount related to method of payment?

Three or more independent groups → Kruskal-Wallis test

tapply(dt$DiscountAmount, dt$MethodOfPayment, summary)
boxplot(dt$DiscountAmount ~ dt$MethodOfPayment,
        xlab = "Method of Payment", ylab = "Discount Amount",
        main = "Discount Amount by Payment Method")

# H0: median discount equal across all payment methods    Ha: at least one differs
kruskal.test(dt$DiscountAmount ~ dt$MethodOfPayment)

Post-hoc testing: If the Kruskal-Wallis test is significant (p < 0.05), we know at least one group differs but not which pairs. Use pairwise Wilcoxon tests with a Bonferroni correction:

pairwise.wilcox.test(dt$DiscountAmount, dt$MethodOfPayment,
                     p.adjust.method = "bonferroni")

Question c: Is there a (monotone) association between items purchased and discount amount?

Two quantitative variables, non-normal data → Spearman rank correlation

plot(dt$ItemsPerchased, dt$DiscountAmount,
     xlab = "Items Purchased", ylab = "Discount Amount",
     main = "Discount Amount vs. Items Purchased",
     pch = 16, col = "steelblue")

# H0: ρ_s = 0 (no monotone association)    Ha: ρ_s ≠ 0
cor.test(~ ItemsPerchased + DiscountAmount,
         data        = dt,
         method      = "spearman",
         alternative = "two.sided")

Spearman vs. Pearson: Spearman measures monotone association (consistently increasing or decreasing), not just linear. It is more robust to outliers and does not require normality.

Question d: Does sales tend to increase with customer age?

plot(dt$Age, dt$Sales,
     xlab = "Age", ylab = "Sales",
     main = "Sales vs. Age",
     pch = 16, col = "steelblue")

# H0: ρ_s = 0    Ha: ρ_s > 0 (positive monotone association)
cor.test(~ Age + Sales,
         data        = dt,
         method      = "spearman",
         alternative = "greater")

Question e: Is sales affected by marital status?

tapply(dt$Sales, dt$MaritalStatus, summary)
boxplot(dt$Sales ~ dt$MaritalStatus,
        xlab = "Marital Status", ylab = "Sales",
        main = "Sales by Marital Status")

kruskal.test(dt$Sales ~ dt$MaritalStatus)

# Post-hoc if significant:
pairwise.wilcox.test(dt$Sales, dt$MaritalStatus,
                     p.adjust.method = "bonferroni")

Question f: Is there a relationship between gender and method of payment?

Two categorical variables → Chi-squared test of independence

tab <- table(dt$Gender, dt$MethodOfPayment)
tab
prop.table(tab, margin = 1)   # row percentages

# H0: Gender and MethodOfPayment are independent    Ha: they are associated
chisq.test(dt$Gender, dt$MethodOfPayment)

Check the chi-squared conditions: All expected cell counts should be ≥ 5.

chisq.test(dt$Gender, dt$MethodOfPayment)$expected
# If any expected count < 5, use Fisher's exact test:
fisher.test(tab)

Practice II: Music Data — Friedman Test

📂 Data file needed: Music_data.csvDownload from Google Drive

Design: Each runner experienced all music types → repeated measures with 3+ groups → Friedman test.

setwd("C:\\Users\\YourName\\Downloads")
dt2 <- read.csv("Music_data.csv", header = TRUE)
dt2$Type <- as.factor(dt2$Type)
str(dt2)

Question: Are exertion scales affected by the type of music?

tapply(dt2$Scale, dt2$Type, summary)
boxplot(dt2$Scale ~ dt2$Type,
        xlab = "Type of Music", ylab = "Exertion Scale",
        main = "Exertion by Music Type",
        col  = c("lightblue", "lightgreen", "lightyellow"))

# H0: distributions of Scale are the same under all music types
# Ha: at least one music type leads to a different distribution
friedman.test(y = dt2$Scale, groups = dt2$Type, blocks = dt2$Runner)

Friedman test: The non-parametric equivalent of one-way repeated-measures ANOVA. It ranks observations within each block (runner) and tests whether rank sums differ across groups.

# Post-hoc pairwise comparisons (if Friedman test is significant)
pairwise.wilcox.test(dt2$Scale, dt2$Type,
                     paired          = TRUE,
                     p.adjust.method = "bonferroni")

Assignment: Non-Parametric Tests — AStore Data

📂 Data file needed: AStore_data.csvDownload from Google Drive

Using AStore_data.csv, answer the following questions. For each test, follow the 4-step procedure: state hypotheses, give the test statistic and p-value, state your conclusion in context.

  1. Is the median age of customers greater than 40?

  2. Is there a difference in the number of items purchased between male and female customers?

  3. Is the median sales amount different across marital status groups? If so, which groups differ significantly? (Use Bonferroni correction.)

  4. Is there a significant positive monotone association between Age and Items Purchased?

  5. Is there a relationship between Marital Status and Method of Payment? (Check the chi-squared expected count conditions first.)


End of Lab Materials