Data files


ปฎิบัติการ 1: Introduction to R

1. Installing R and RStudio

R is a programming language for statistical computing and graphics. RStudio is an IDE (Integrated Development Environment) that makes R much easier to use with a four-panel layout:

Panel Location Purpose
Script Editor Top-left Write and save code
Console Bottom-left Execute code, see output
Environment / History Top-right See objects in memory
Files / Plots / Help Bottom-right Browse files, view plots

Windows

  • Install R: Download the latest .exe installer and run it.
  • Install RStudio: Download the Windows installer and follow the instructions.

Mac

  • Install R: Download the latest .pkg, double-click, and follow instructions.
  • Install RStudio: Download the Mac version, open the file, and drag to Applications.

2. Basic Data Types

Type Example Description
numeric 4.5 Decimal numbers
integer 4L Whole numbers (also numeric)
logical TRUE / FALSE Boolean values
character "hello" Text / strings

Use class() to check the type of any object.

Numeric and Integer Values

# "<-" is the assignment operator (preferred in R; "=" also works)
a <- 7
a
## [1] 7
# c() combines values into a vector (R's basic data structure)
b <- c(1.25, 2.9, 3.0)
b
## [1] 1.25 2.90 3.00
# ":" creates an integer sequence from start to end
c <- 1:10
c
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq(from, to, by) creates a sequence with a custom step size
d <- seq(1, 10, 0.5)
d
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
# rep() repeats values
rep(99, times = 5)           # repeat a single value 5 times
## [1] 99 99 99 99 99
rep(c(1, 2, 3), times = 5)  # repeat the whole vector: 1 2 3 1 2 3 ...
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1, 2, 3), each = 5)   # repeat each element:  1 1 1 1 1 2 2 2 ...
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Character Values

x <- "Monday"
class(x)   # "character"
## [1] "character"
y <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
class(y)
## [1] "character"

Factors

A factor stores categorical data with a fixed set of levels. R uses factors automatically in models and summaries.

# as.factor() converts a character vector to a factor
yy <- as.factor(y)
yy
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
# Note: R shows the unique "Levels" below the values

3. Data Structures

Matrix

All elements must be the same type (e.g., all numeric). Data fills column-by-column by default.

mat0 <- matrix(NA, ncol = 3, nrow = 2)                      # empty matrix of NAs
mat1 <- matrix(1:6, ncol = 3, nrow = 2)                     # fills column-by-column
mat2 <- matrix(1:6, ncol = 3, nrow = 2, byrow = TRUE)       # fills row-by-row
mat0; mat1; mat2
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

List

The most flexible structure — each element can hold a different type of object.

# A list containing a numeric vector, a factor, and a matrix
lst <- list(d, yy, mat2)
lst
## [[1]]
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
## 
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
# Access elements: lst[[1]], lst[[2]], lst[[3]]

Data Frame

The standard structure for statistical analysis. Each column is a variable; each row is an observation.

data(ToothGrowth)
head(ToothGrowth, n = 6)   # preview first 6 rows
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
str(ToothGrowth)            # variable names, types, and sample values
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

4. Reading Files

# setwd() sets the working directory (folder where data files are stored)
# On Windows: use "\\" (double backslash) or "/" as path separators
setwd("C:\\Users\\ASUS\\Downloads")

# read.csv() reads a comma-separated file into a data frame
# header = TRUE means the first row contains variable names
dt <- read.csv("bank_data.csv", header = TRUE)

head(dt)   # preview first 6 rows
str(dt)    # check variable names and types

After loading, convert categorical variables from "chr" to factor:

dt$job       <- as.factor(dt$job)
dt$marital   <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing   <- as.factor(dt$housing)
dt$loan      <- as.factor(dt$loan)

str(dt)   # verify: these columns now show "Factor w/ N levels"

5. Numerical Summaries: Qualitative Data

summary(dt)   # counts for factors; five-number summary for numerics

# Single-variable frequency tables
table(dt$education)
table(dt$marital)
table(dt$housing)
table(dt$loan)

# Proportions and percentages
tt <- table(dt$marital)
prop.table(tt)        # proportions (sum to 1)
prop.table(tt) * 100  # percentages

# Two-way contingency table (rows = marital, columns = loan)
table(dt$marital, dt$loan)

# Add row/column totals
tb <- table(dt$marital, dt$loan)
addmargins(tb)

# Cell percentages
prop.table(tb) * 100

6. Numerical Summaries: Quantitative Data

# Custom mode function (R has no built-in statistical mode)
getmode <- function(x) {
  u   <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}
summary(dt$age)       # min, Q1, median, mean, Q3, max
mean(dt$age)
getmode(dt$age)       # most frequent value

min(dt$age)
range(dt$age)         # c(min, max)
IQR(dt$age)           # Q3 - Q1
sd(dt$age)            # sample standard deviation

# Quartiles
quantile(dt$age, probs = c(0.25, 0.50, 0.75))

# Coefficient of Variation (CV) = SD / Mean
# Measures relative variability — useful for comparing groups with different means
cv_age <- sd(dt$age) / mean(dt$age)
cv_age

# Compare CV between subgroups using [ ] subsetting
d1  <- dt$age[dt$housing == "no"]
d2  <- dt$age[dt$housing == "yes"]
cv1 <- sd(d1) / mean(d1)   # CV of age for housing = no
cv2 <- sd(d2) / mean(d2)   # CV of age for housing = yes
cv1; cv2

# aggregate(): compute summary statistics by group
# aggregate(response ~ group, data, FUN)
aggregate(age ~ housing, data = dt, FUN = mean)
aggregate(age ~ housing, data = dt, FUN = sd)
aggregate(age ~ housing, data = dt, FUN = summary)

# Two-way grouping: mean age by housing AND marital status
aggregate(age ~ housing + marital, data = dt, FUN = mean)

7. Graphical Presentation: Qualitative Data

Bar Graph

# barplot() requires a frequency table as input
tt <- table(dt$marital)
barplot(tt)
pie(tt, main = "Pie chart of Marital status")

# Stacked bar: composition of one variable within levels of another
counts <- table(dt$marital, dt$loan)
barplot(counts, legend = rownames(counts), xlab = "loan")

# Grouped (side-by-side) bar: easier to compare individual counts
# beside = TRUE places bars next to each other rather than stacking
barplot(counts,
        col    = c("blue", "red", "green"),
        legend = rownames(counts),
        beside = TRUE,
        xlab   = "loan")

8. Graphical Presentation: Quantitative Data

Base R Plots

# Histogram
hist(dt$age)
hist(dt$age, main = "Histogram of Age", xlab = "Age", ylab = "Count")
hist(dt$age, breaks = 20, main = "Histogram of Age", xlab = "Age")

# Boxplot — shows Q1, median, Q3, whiskers, and outliers
boxplot(dt$age, main = "Boxplot of age")

# Side-by-side boxplots by group
boxplot(dt$age ~ dt$marital,
        main = "Boxplot of age by Marital Status",
        xlab = "Marital", ylab = "Age")

# Scatter plot — relationship between two numeric variables
plot(x = dt$age, y = dt$balance, xlab = "Age", ylab = "Balance")

ggplot2 Plots

ggplot2 uses a layered grammar: ggplot(data, aes(...)) + geom_*() + theme_*()

# install.packages("ggplot2")  # run once if not yet installed
library(ggplot2)

# Scatter plot of age vs balance
ggplot(dt, aes(x = age, y = balance)) +
  geom_point() +
  theme_bw()

# Colour points by marital status (adds a third variable to the plot)
ggplot(dt, aes(x = age, y = balance)) +
  geom_point(aes(col = marital)) +
  theme_bw()

# Boxplot of age (no grouping)
ggplot(dt, aes(y = age)) +
  geom_boxplot() +
  theme_bw()

# Boxplot of age grouped by marital status
ggplot(dt, aes(x = marital, y = age)) +
  geom_boxplot() +
  theme_bw()

# Boxplot of age by marital, coloured by housing status
# This simultaneously displays THREE variables in one plot
ggplot(dt, aes(x = marital, y = age)) +
  geom_boxplot(aes(col = housing)) +
  theme_bw()

9. Assignment

Numerical presentation for bank_data.csv:

  1. Frequency table of education with proportions
  2. Two-way table of job × loan with percentages
  3. Summary statistics of balance
  4. Q1, Q2, Q3 and IQR of balance
  5. CV of balance by housing type
  6. Summary statistics of balance by housing
  7. SD of balance by housing and loan

Graphical presentation for bank_data.csv:

  1. Appropriate chart for education
  2. Chart comparing education × marital counts
  3. Appropriate chart for balance
  4. Chart comparing balance across education groups
  5. Chart showing the relationship between age and duration

ปฎิบัติการ 2: Simple Linear Regression (การถดถอยเชิงเส้นอย่างง่าย)

วัตถุประสงค์ นักศึกษาสามารถ:

  • วิเคราะห์การถดถอยเชิงเส้นอย่างง่ายด้วย R ได้
  • หาค่าพยากรณ์พร้อมช่วงความเชื่อมั่นได้
  • จำลองข้อมูลเพื่อตรวจสอบการแจกแจงของตัวประมาณพารามิเตอร์ได้

Task 1: The Rocket Data

Background: This dataset records the shear strength of a rocket propellant bond (Sher.Strength) and the age of the propellant in weeks (Age.of.Propellant). We want to model how age affects shear strength using simple linear regression.

setwd("C:\\Users\\ASUS\\Downloads")
dt1 <- read.csv("TheRocketData.csv", header = TRUE)
str(dt1)

# Assign shorter variable names for convenience
x <- dt1$Age.of.Propellant   # predictor (X)
y <- dt1$Sher.Strength        # response (Y)

Step 1: Scatter Plot — always visualise before fitting

# Look for: linear trend, curvature, outliers, unequal spread
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")

Step 2: Correlation

cor(x, y)   # sample Pearson correlation coefficient r

# cor.test() formally tests H₀: ρ = 0 (no linear association)
# Output: r, t-statistic, p-value, and confidence interval for ρ
cor.test(x, y, alternative = "two.sided")                    # 95% CI (default)
cor.test(x, y, alternative = "two.sided", conf.level = 0.90) # 90% CI

Step 3: Fit the Model \(Y = \beta_0 + \beta_1 X + \epsilon\)

# lm(response ~ predictor) fits the OLS regression line
# Use "~" (Alt + 126 on Thai keyboard)
fit <- lm(y ~ x)

# summary() gives:
#   Coefficients: β̂₀ (Intercept) and β̂₁ (slope), SE, t-value, p-value
#   R-squared: proportion of variance in Y explained by X
#   Residual SE: estimate of σ
summary(fit)

# anova() gives the ANOVA table: SSR, SSE, MSR, MSE, F-statistic
# F = MSR / MSE tests whether β₁ ≠ 0 (overall model significance)
anova(fit)

# Manual verification: SSreg=1527483, SSres=166255, df_res=18
# MSres = 166255/18 = 9236; F = 1527483/9236 = 165.3
1527483 / 9236

Step 4: Confidence Intervals for Regression Coefficients

# confint.lm() returns CIs for β₀ and β₁
confint.lm(fit, level = 0.90)   # 90% CI
confint.lm(fit, level = 0.95)   # 95% CI
confint.lm(fit, level = 0.99)   # 99% CI

Steps 7–8: Prediction for a Specific x Value

# Predict response when x = 10
xnew <- data.frame(x = 10)

# Point estimate
yhat <- predict.lm(fit, newdata = xnew)
yhat

# 95% Confidence Interval for E(Y | x = 10)
# Estimates the MEAN response for ALL units with x = 10
predict.lm(fit, newdata = xnew, interval = "confidence")

# 95% Prediction Interval for a NEW individual Y when x = 10
# Estimates the range for a SINGLE future observation; always wider than CI
predict.lm(fit, newdata = xnew, interval = "prediction")

Key difference:

  • Confidence interval — precision of the estimated mean E(Y|x); narrows as sample size increases
  • Prediction interval — range for a future individual Y; always wider because it includes individual variability σ²

Step 9: Plot with Confidence and Prediction Bands

# Generate a fine grid for smooth band curves
new     <- data.frame(x = seq(min(x), max(x), len = 100))
pred.cf <- predict(fit, new, interval = "confidence")   # CI for E(Y)
pred.pd <- predict(fit, new, interval = "prediction")   # PI for Y

# Build the plot layer by layer
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")
lines(x, predict(fit), col = "red")                         # fitted line

lines(new$x, pred.cf[, "lwr"], col = "blue",  lty = 2)     # CI lower
lines(new$x, pred.cf[, "upr"], col = "blue",  lty = 2)     # CI upper

lines(new$x, pred.pd[, "lwr"], col = "green", lty = 2)     # PI lower
lines(new$x, pred.pd[, "upr"], col = "green", lty = 2)     # PI upper

legend("topright",
       legend = c("Fitted line", "95% CI for E(Y)", "95% PI for Y"),
       col    = c("red", "blue", "green"),
       lty    = c(1, 2, 2))

Model Performance Metrics

# install.packages("Metrics")   # run once if not yet installed
library(Metrics)

yhatall <- predict(fit, newdata = data.frame(x = x))

# RMSE: Root Mean Squared Error — average prediction error in the Y unit
rmse(y, yhatall)

# MAPE: Mean Absolute Percentage Error — average relative prediction error
mape(y, yhatall)

Task 2: Simulation Study — Sampling Distribution of Regression Coefficients

Concept: Each sample from a population gives different estimates \(b_0\) and \(b_1\). This task lets us observe how estimates vary across samples and verify the key property of unbiasedness: \(E(b_0) = \beta_0\) and \(E(b_1) = \beta_1\).

Population data (5 pairs):

\[\begin{array}{c|ccccc} \text{No.} & 1 & 2 & 3 & 4 & 5 \\ \hline X & 8 & 12 & 29.5 & 43 & 53 \\ \hline Y & 8.89 & 8.14 & 6.67 & 6.08 & 5.85 \end{array}\]

Note: Replace the last Y value (5.85) with the last 3 digits of your student ID.

# Define the population
popl <- data.frame(X = c(8, 12, 29.5, 43, 53),
                   Y = c(8.89, 8.14, 6.67, 6.08, 5.85))
str(popl)

# Fitting to the full population gives the TRUE parameter values β₀ and β₁
fit.pop <- lm(Y ~ X, data = popl)
summary(fit.pop)

plot(popl$X, popl$Y, xlab = "X", ylab = "Y",
     main = "True Regression Line (Population)")
abline(fit.pop, col = "red")

There are \(\binom{5}{3} = 10\) possible samples of size \(n = 3\):

# Example: Sample 1 — observations 1, 2, 3
x1 <- c(8, 12, 29.5)
y1 <- c(8.89, 8.14, 6.67)
fit1 <- lm(y1 ~ x1)
summary(fit1)
# Observe: b0 and b1 differ from the true β₀ and β₁ above

# TO DO:
# 1. Fit regression to all 10 possible samples of size n = 3
# 2. Collect b0 and b1 from each fitted model
# 3. Compute mean and SD of b0 estimates, and of b1 estimates
# 4. Verify: mean(b0) ≈ β₀ and mean(b1) ≈ β₁ (unbiasedness)
# 5. Plot histograms of b0 and b1 to visualise their sampling distributions

ปฎิบัติการ 3: Dummy Variable Regression (ตัวแปรหุ่น)

วัตถุประสงค์ นักศึกษาสามารถ:

  • แปลงตัวแปรเชิงคุณภาพเป็นตัวแปรหุ่นได้
  • วิเคราะห์การถดถอยที่มีตัวแปรหุ่นและทดสอบ interaction ได้
  • อธิบายความหมายของสัมประสิทธิ์ได้

Background: For a categorical predictor with \(k\) groups, we create \(k-1\) dummy variables (binary 0/1). One group is the reference category — its effect is absorbed into the intercept. Each dummy coefficient represents the difference relative to the reference group.

rm(list = ls())   # clear the environment before starting a new task

Task 1: Salary, GPA, and Gender

setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("SalaryAndGPA.csv", header = TRUE)
str(dt)

dt$Gender <- as.factor(dt$Gender)
str(dt)   # verify Gender shows "Factor w/ 2 levels"

Exploratory Analysis

# Base R
plot(dt$GPA, dt$Salary, xlab = "GPA", ylab = "Salary")
boxplot(dt$Salary ~ dt$Gender, xlab = "Gender", ylab = "Salary")

# ggplot2
library(ggplot2)

# Scatter coloured by Gender — look for parallel or crossing trends
ggplot(dt, aes(x = GPA, y = Salary, colour = Gender)) +
  geom_point(size = 3) +
  labs(title = "Salary vs GPA by Gender") +
  theme_bw()

# Boxplot comparing salary distributions
ggplot(dt, aes(x = Gender, y = Salary, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Salary by Gender") +
  theme_bw()

# Numerical summary by group
aggregate(Salary ~ Gender, data = dt, FUN = summary)

Create the Dummy Variable for Gender

Let \(Y\) = Salary (thousand $), \(X_1\) = GPA, and:

\[X_2 = \begin{cases} 1, & \text{Female} \\ 0, & \text{Male (reference)} \end{cases}\]

y  <- dt$Salary
x1 <- dt$GPA

# Method 1: ifelse() — concise one-liner
x2 <- ifelse(dt$Gender == "F", 1, 0)

# Method 2: for-loop — same result, shows the logic explicitly
x2 <- c()
for (i in 1:nrow(dt)) {
  if (dt$Gender[i] == "F") x2[i] <- 1
  else x2[i] <- 0
}
x2

Model 1 — with Interaction: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon\)

The interaction \(\beta_3 X_1 X_2\) allows the GPA slope to differ between genders (non-parallel lines):

# "x1 * x2" in R automatically includes x1, x2, AND interaction x1:x2
fit1 <- lm(y ~ x1 + x2 + x1 * x2)
summary(fit1)
# If p-value of x1:x2 > 0.05 → slopes are not significantly different → use Model 2
anova(fit1)

Model 2 — without Interaction (Parallel Lines): \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\)

\(\beta_2\) is the constant vertical shift between the two gender lines:

fit2 <- lm(y ~ x1 + x2)
summary(fit2)
anova(fit2)

# 95% CI for β₀, β₁, β₂
confint.lm(fit2, level = 0.95)

Interpretation of \(\hat{\beta}_2\): Holding GPA constant, females earn \(\hat{\beta}_2\) thousand dollars more (if positive) or less (if negative) than males on average.


Task 2: ปัจจัยที่ส่งผลต่อคะแนนสอบวิชาคณิตศาสตร์

ศึกษาปัจจัยที่ส่งผลต่อคะแนนสอบคณิตศาสตร์ (เต็ม 150 คะแนน) โดยพิจารณาวิธีการสอน 4 รูปแบบ (A, B, C, D), คะแนน Pretest และคะแนน EQ จากกลุ่มตัวอย่าง 36 หน่วย

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("ScoreAndEQ.csv")
str(dt2)
dt2$Method <- as.factor(dt2$Method)

กำหนดตัวแปรหุ่น 3 ตัวสำหรับวิธีสอน 4 กลุ่ม โดยให้ Method D เป็นกลุ่มอ้างอิง:

\[X_1 = \begin{cases}1, & A \\ 0, & \text{อื่นๆ}\end{cases},\quad X_2 = \begin{cases}1, & B \\ 0, & \text{อื่นๆ}\end{cases},\quad X_3 = \begin{cases}1, & C \\ 0, & \text{อื่นๆ}\end{cases}\]

y  <- dt2$Y
x4 <- dt2$Pretest
x5 <- dt2$EQ

# Method 1: ifelse() — concise
x1 <- ifelse(dt2$Method == "A", 1, 0)
x2 <- ifelse(dt2$Method == "B", 1, 0)
x3 <- ifelse(dt2$Method == "C", 1, 0)

# Method 2: for-loop — explicit logic
n  <- length(y)
x1 <- c(); x2 <- c(); x3 <- c()
for (i in 1:n) {
  x1[i] <- if (dt2$Method[i] == "A") 1 else 0
  x2[i] <- if (dt2$Method[i] == "B") 1 else 0
  x3[i] <- if (dt2$Method[i] == "C") 1 else 0
}

วิเคราะห์ตัวแบบ: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon\)

fit3 <- lm(y ~ x1 + x2 + x3 + x4 + x5)
summary(fit3)
# Interpretation:
#   β₁ = mean difference: Method A vs Method D (reference), holding x4 and x5 constant
#   β₂ = mean difference: Method B vs Method D
#   β₃ = mean difference: Method C vs Method D
#   β₄ = effect of Pretest score on math score
#   β₅ = effect of EQ score on math score

anova(fit3)

ปฏิบัติการ 4: Variable Selection (การคัดเลือกตัวแปรอิสระ)

วัตถุประสงค์ นักศึกษาสามารถแสดงการคัดเลือกตัวแปรด้วยวิธี Forward Selection, Backward Elimination และ Stepwise Regression ได้

AIC (Akaike Information Criterion):
\[AIC = n\ln\!\left(\frac{SS_{Res}}{n}\right) + 2p\] Lower AIC = better model. R’s step() adds/removes predictors to minimise AIC, balancing goodness-of-fit against model complexity.

Method Starting Model Action
Forward Selection Null (intercept only) Add predictor that most reduces AIC
Backward Elimination Full (all predictors) Remove predictor whose removal most reduces AIC
Stepwise Full Add or remove at each step — can undo earlier decisions

Task 1: VO2 Oxygen Consumption

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("VO2Data.csv")
str(dt)
dt$Gender <- as.factor(dt$Gender)

# Define all variables
y  <- dt$V02         # response: oxygen consumption
x1 <- dt$Weight
x2 <- dt$Age
x3 <- ifelse(dt$Gender == "Male", 1, 0)   # dummy: Male=1, Female=0
x4 <- dt$HeartRate
x5 <- dt$BodyTemp
x6 <- dt$Height

Step 1: Correlation Matrix

# Pairwise correlations between y and each predictor
cor(y, x1); cor(y, x2); cor(y, x4)

# Correlations among predictors (high values suggest multicollinearity)
cor(x1, x2); cor(x1, x4); cor(x2, x4)

# Full correlation matrix for all numeric columns
cor(dt[, c(2, 3, 4, 6, 7, 8)])

Step 2: VIF (Variance Inflation Factor)

\[VIF_j = \frac{1}{1 - R^2_j}\]

where \(R^2_j\) is from regressing \(X_j\) on all other predictors. VIF > 5 (some use 10) is considered problematic.

full <- lm(y ~ x1 + x2 + x3 + x4)

library(car)   # install.packages("car") if needed
vif(full)      # VIF for each predictor in the full model

# Manual cross-check: regress each predictor on the rest
m0 <- lm(x1 ~ x2 + x3 + x4); summary(m0)
1 / (1 - 0.1555)   # VIF(x1) ≈ 1.184

m1 <- lm(x2 ~ x1 + x3 + x4); summary(m1)
1 / (1 - 0.05297)  # VIF(x2) ≈ 1.056

summary(lm(x3 ~ x1 + x2 + x4))   # VIF(x3) ≈ 1.072
summary(lm(x4 ~ x1 + x2 + x3))   # VIF(x4) ≈ 1.176

# All VIFs are low → no multicollinearity issue here

Step 3: Forward Selection

null <- lm(y ~ 1)                    # null model (intercept only)
full <- lm(y ~ x1 + x2 + x3 + x4)  # full model

# Automated (uses AIC)
step(null,
     scope     = list(lower = null, upper = full),
     direction = "forward")

Step-by-step manually (to understand the process):

# Step 1: Compare adding each predictor to the null model
# Use anova(smaller, larger) — tests whether the added predictor is significant
fit.x1 <- lm(y ~ x1); anova(null, fit.x1)
fit.x2 <- lm(y ~ x2); anova(null, fit.x2)
fit.x3 <- lm(y ~ x3); anova(null, fit.x3)   # x3 enters first (most significant)
fit.x4 <- lm(y ~ x4); anova(null, fit.x4)

# Step 2: x3 is in. Test adding each remaining predictor given x3
fit.x3x1 <- lm(y ~ x3 + x1); anova(fit.x3, fit.x3x1)  # test x1 | x3
fit.x3x2 <- lm(y ~ x3 + x2); anova(fit.x3, fit.x3x2)  # test x2 | x3
fit.x3x4 <- lm(y ~ x3 + x4); anova(fit.x3, fit.x3x4)  # test x4 | x3
# x1 enters (significant contribution given x3)

# Step 3: x3, x1 in model. Test remaining predictors
fit.x3x1x2 <- lm(y ~ x3 + x1 + x2); anova(fit.x3x1, fit.x3x1x2)  # test x2 | x3, x1
fit.x3x1x4 <- lm(y ~ x3 + x1 + x4); anova(fit.x3x1, fit.x3x1x4)  # test x4 | x3, x1
# Neither is significant → STOP

# Final model: y ~ x3 + x1
summary(fit.x3x1)

Step 4: Backward Elimination

# Automated
step(full, direction = "backward")

Step-by-step manually:

# Step 1: Full model. Remove each predictor one at a time and compare with full
ft1 <- lm(y ~ x1 + x2 + x3); anova(ft1, full)   # -x4
ft2 <- lm(y ~ x1 + x2 + x4); anova(ft2, full)   # -x3
ft3 <- lm(y ~ x2 + x3 + x4); anova(ft3, full)   # -x1
ft4 <- lm(y ~ x1 + x3 + x4); anova(ft4, full)   # -x2
# x4 has largest non-significant p-value → remove x4; now: x1, x2, x3

# Step 2: From {x1, x2, x3}, test removing each
ft5 <- lm(y ~ x1 + x2); anova(ft5, ft1)   # -x3
ft6 <- lm(y ~ x1 + x3); anova(ft6, ft1)   # -x2
ft7 <- lm(y ~ x2 + x3); anova(ft7, ft1)   # -x1
# x2 is not significant → remove x2; now: x1, x3

# Step 3: From {x1, x3}, test removing each
ft8 <- lm(y ~ x3); anova(ft8, ft6)   # -x1
ft9 <- lm(y ~ x1); anova(ft9, ft6)   # -x3
# Both are significant → STOP; final model: y ~ x1 + x3

Step 5: Stepwise Regression

# Considers both adding and removing at each step
step(full, direction = "both")

ปฏิบัติการ 5: Model Adequacy Checking (การตรวจสอบความเหมาะสมของตัวแบบ)

วัตถุประสงค์ นักศึกษาสามารถ:

  • ตรวจสอบข้อสมมติของการถดถอยด้วยแผนภาพและสถิติทดสอบ
  • แก้ไขปัญหาด้วยการแปลงตัวแปร (transformation) ได้

Four Assumptions of Linear Regression (LINE):

Assumption Diagnostic Tool Formal Test
Linearity Residuals vs Fitted plot
Independence Residuals vs order plot Durbin-Watson
Normality Normal Q-Q plot Shapiro-Wilk
Equal variance (Homoscedasticity) Scale-Location plot Breusch-Pagan, F-test

Task 1: The Rocket Propellant Data

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("TheRocketData.csv")
str(dt)

x <- dt$Sher.Strength
y <- dt$Age.of.Propellant

par(mfrow = c(1, 1))
plot(x, y, xlab = "Shear Strength", ylab = "Age of Propellant")

Step 1: Fit the Simple Linear Regression Model

fit1 <- lm(y ~ x)
summary(fit1)
anova(fit1)

Step 2: Residual Diagnostic Plots

res1 <- resid(fit1)   # residuals: eᵢ = yᵢ − ŷᵢ

par(mfrow = c(2, 2))  # 2×2 grid of diagnostic plots
plot(fit1)
# Plot 1 — Residuals vs Fitted:  random scatter → linearity + equal variance OK
# Plot 2 — Normal Q-Q:           points on diagonal → normality OK
# Plot 3 — Scale-Location:       flat horizontal band → equal variance OK
# Plot 4 — Residuals vs Leverage: no large Cook's distance → no influential outliers

Step 2a: Test Equal Variance

library(lmtest)

# F-test: compare variance of first half vs second half of residuals
# H₀: σ²₁ = σ²₂ (equal variance)
var.test(res1[1:10], res1[11:20])

# Breusch-Pagan test: regresses squared residuals on the predictor(s)
# H₀: homoscedasticity (constant variance); p-value < 0.05 → problem
bptest(fit1)

Step 2b: Test Normality

# Shapiro-Wilk test (best for n < 50)
# H₀: residuals are normally distributed
# p-value > 0.05 → normality is supported
shapiro.test(res1)

Step 2c: Test Independence (Autocorrelation)

# Durbin-Watson test
# H₀: no autocorrelation in residuals
# DW ≈ 2 → no autocorrelation; DW < 2 → positive autocorrelation
library(lmtest)
dwtest(fit1)

Step 3: Box-Cox Transformation — Find Optimal λ

λ Transformation
1 None (Y unchanged)
0.5 \(\sqrt{Y}\)
0 \(\ln(Y)\)
−1 \(1/Y\)
library(MASS)
par(mfrow = c(1, 1))

# The peak of the curve = optimal λ
# The horizontal dashed line = 95% CI for λ
# If CI includes 0.5 → use √Y; if CI includes 0 → use log(Y)
boxcox(fit1, lambda = seq(-3, 3, by = 0.05))

Step 4: Fit the Transformed Model \(\sqrt{Y} = \beta_0 + \beta_1 X + \epsilon\)

fit2 <- lm(sqrt(y) ~ x)
summary(fit2)
anova(fit2)

res2 <- resid(fit2)

# Repeat ALL diagnostic checks and compare with fit1
par(mfrow = c(2, 2))
plot(fit2)

var.test(res2[1:10], res2[11:20])   # F-test for equal variance
bptest(fit2)                         # Breusch-Pagan
dwtest(fit2)                         # Durbin-Watson
shapiro.test(res2)                   # Shapiro-Wilk

Sensitivity: Remove Influential Observations

# Negative indexing removes specific observations
# Here we remove rows 5 and 6 (identified as potential outliers from diagnostics)
xx <- x[-c(5, 6)]
yy <- y[-c(5, 6)]

f1 <- lm(yy ~ xx)        # original model without outliers
summary(f1)

f2 <- lm(sqrt(yy) ~ xx)  # transformed model without outliers
summary(f2)

# If results change substantially, those observations are highly influential

Task 2: Olympics Athletes — Cholesterol and Fat Intake

This dataset contains cholesterol levels (mg/L) and daily saturated fat intake (mg) for 10 Olympic athletes.

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("TheAthleteData.csv")
str(dt2)

x <- dt2$Fat.Intake
y <- dt2$Cholesterol

par(mfrow = c(1, 1))
plot(x, y, xlab = "Fat Intake", ylab = "Cholesterol",
     main = "Cholesterol vs Fat Intake")

a. Fit the Linear Model and Check Assumptions

fit.lin <- lm(y ~ x)
summary(fit.lin)
anova(fit.lin)

pred.lin <- predict(fit.lin)
res.lin  <- resid(fit.lin)

par(mfrow = c(2, 2))
plot(fit.lin)

# Formal assumption tests
var.test(res.lin[1:5], res.lin[6:10])   # F-test: equal variance
bptest(fit.lin)                          # Breusch-Pagan
dwtest(fit.lin)                          # Durbin-Watson
shapiro.test(res.lin)                    # Shapiro-Wilk

b. Box-Cox to Identify an Appropriate Transformation

library(MASS)
par(mfrow = c(1, 1))
boxcox(fit.lin, lambda = seq(-3, 3, by = 0.05))

c. Fit a Quadratic Model (if the scatter plot shows curvature)

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]

# I(x^2) protects the squaring operation inside the formula
fit.qua <- lm(y ~ x + I(x^2))
summary(fit.qua)
anova(fit.qua)

res.qua <- resid(fit.qua)

par(mfrow = c(2, 2))
plot(fit.qua)

# Repeat all diagnostics
var.test(res.qua[1:5], res.qua[6:10],
         ratio = 1, alternative = "two.sided", conf.level = 0.95)
bptest(fit.qua)
dwtest(fit.qua)
shapiro.test(res.qua)

d. Summary Comparison Table

Record results from both models:

Statistic Linear Model Quadratic / Transformed
\(R^2\)
Adjusted \(R^2\)
Residual SE
Shapiro-Wilk p-value
Breusch-Pagan p-value
Durbin-Watson statistic
F-test (var.test) p-value