Data files

ปฎิบัติการ 1: Introduction to R

References: Intro to R · Data Exploration · Missing Data · ggplot2 Scatter

1. Installing R and RStudio

R is a programming language for statistical computing and graphics. RStudio is an IDE (Integrated Development Environment) that makes R much easier to use with a four-panel layout:

Panel	Location	Purpose
Script Editor	Top-left	Write and save code
Console	Bottom-left	Execute code, see output
Environment / History	Top-right	See objects in memory
Files / Plots / Help	Bottom-right	Browse files, view plots

Windows

Install R: Download the latest .exe installer and run it.
Install RStudio: Download the Windows installer and follow the instructions.

Mac

Install R: Download the latest .pkg, double-click, and follow instructions.
Install RStudio: Download the Mac version, open the file, and drag to Applications.

2. Basic Data Types

Type	Example	Description
`numeric`	`4.5`	Decimal numbers
`integer`	`4L`	Whole numbers (also numeric)
`logical`	`TRUE` / `FALSE`	Boolean values
`character`	`"hello"`	Text / strings

Use class() to check the type of any object.

Numeric and Integer Values

# "<-" is the assignment operator (preferred in R; "=" also works)
a <- 7
a

## [1] 7

# c() combines values into a vector (R's basic data structure)
b <- c(1.25, 2.9, 3.0)
b

## [1] 1.25 2.90 3.00

# ":" creates an integer sequence from start to end
c <- 1:10
c

##  [1]  1  2  3  4  5  6  7  8  9 10

# seq(from, to, by) creates a sequence with a custom step size
d <- seq(1, 10, 0.5)
d

##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0

# rep() repeats values
rep(99, times = 5)           # repeat a single value 5 times

## [1] 99 99 99 99 99

rep(c(1, 2, 3), times = 5)  # repeat the whole vector: 1 2 3 1 2 3 ...

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

rep(c(1, 2, 3), each = 5)   # repeat each element:  1 1 1 1 1 2 2 2 ...

##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Character Values

x <- "Monday"
class(x)   # "character"

## [1] "character"

y <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
class(y)

## [1] "character"

Factors

A factor stores categorical data with a fixed set of levels. R uses factors automatically in models and summaries.

# as.factor() converts a character vector to a factor
yy <- as.factor(y)
yy

## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed

# Note: R shows the unique "Levels" below the values

3. Data Structures

Matrix

All elements must be the same type (e.g., all numeric). Data fills column-by-column by default.

mat0 <- matrix(NA, ncol = 3, nrow = 2)                      # empty matrix of NAs
mat1 <- matrix(1:6, ncol = 3, nrow = 2)                     # fills column-by-column
mat2 <- matrix(1:6, ncol = 3, nrow = 2, byrow = TRUE)       # fills row-by-row
mat0; mat1; mat2

##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

List

The most flexible structure — each element can hold a different type of object.

# A list containing a numeric vector, a factor, and a matrix
lst <- list(d, yy, mat2)
lst

## [[1]]
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
## 
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

# Access elements: lst[[1]], lst[[2]], lst[[3]]

Data Frame

The standard structure for statistical analysis. Each column is a variable; each row is an observation.

data(ToothGrowth)
head(ToothGrowth, n = 6)   # preview first 6 rows

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

str(ToothGrowth)            # variable names, types, and sample values

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

4. Reading Files

# setwd() sets the working directory (folder where data files are stored)
# On Windows: use "\\" (double backslash) or "/" as path separators
setwd("C:\\Users\\ASUS\\Downloads")

# read.csv() reads a comma-separated file into a data frame
# header = TRUE means the first row contains variable names
dt <- read.csv("bank_data.csv", header = TRUE)

head(dt)   # preview first 6 rows
str(dt)    # check variable names and types

After loading, convert categorical variables from "chr" to factor:

dt$job       <- as.factor(dt$job)
dt$marital   <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing   <- as.factor(dt$housing)
dt$loan      <- as.factor(dt$loan)

str(dt)   # verify: these columns now show "Factor w/ N levels"

5. Numerical Summaries: Qualitative Data

summary(dt)   # counts for factors; five-number summary for numerics

# Single-variable frequency tables
table(dt$education)
table(dt$marital)
table(dt$housing)
table(dt$loan)

# Proportions and percentages
tt <- table(dt$marital)
prop.table(tt)        # proportions (sum to 1)
prop.table(tt) * 100  # percentages

# Two-way contingency table (rows = marital, columns = loan)
table(dt$marital, dt$loan)

# Add row/column totals
tb <- table(dt$marital, dt$loan)
addmargins(tb)

# Cell percentages
prop.table(tb) * 100

6. Numerical Summaries: Quantitative Data

# Custom mode function (R has no built-in statistical mode)
getmode <- function(x) {
  u   <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}

summary(dt$age)       # min, Q1, median, mean, Q3, max
mean(dt$age)
getmode(dt$age)       # most frequent value

min(dt$age)
range(dt$age)         # c(min, max)
IQR(dt$age)           # Q3 - Q1
sd(dt$age)            # sample standard deviation

# Quartiles
quantile(dt$age, probs = c(0.25, 0.50, 0.75))

# Coefficient of Variation (CV) = SD / Mean
# Measures relative variability — useful for comparing groups with different means
cv_age <- sd(dt$age) / mean(dt$age)
cv_age

# Compare CV between subgroups using [ ] subsetting
d1  <- dt$age[dt$housing == "no"]
d2  <- dt$age[dt$housing == "yes"]
cv1 <- sd(d1) / mean(d1)   # CV of age for housing = no
cv2 <- sd(d2) / mean(d2)   # CV of age for housing = yes
cv1; cv2

# aggregate(): compute summary statistics by group
# aggregate(response ~ group, data, FUN)
aggregate(age ~ housing, data = dt, FUN = mean)
aggregate(age ~ housing, data = dt, FUN = sd)
aggregate(age ~ housing, data = dt, FUN = summary)

# Two-way grouping: mean age by housing AND marital status
aggregate(age ~ housing + marital, data = dt, FUN = mean)

7. Graphical Presentation: Qualitative Data

Bar Graph

# barplot() requires a frequency table as input
tt <- table(dt$marital)
barplot(tt)
pie(tt, main = "Pie chart of Marital status")

# Stacked bar: composition of one variable within levels of another
counts <- table(dt$marital, dt$loan)
barplot(counts, legend = rownames(counts), xlab = "loan")

# Grouped (side-by-side) bar: easier to compare individual counts
# beside = TRUE places bars next to each other rather than stacking
barplot(counts,
        col    = c("blue", "red", "green"),
        legend = rownames(counts),
        beside = TRUE,
        xlab   = "loan")

8. Graphical Presentation: Quantitative Data

Base R Plots

# Histogram
hist(dt$age)
hist(dt$age, main = "Histogram of Age", xlab = "Age", ylab = "Count")
hist(dt$age, breaks = 20, main = "Histogram of Age", xlab = "Age")

# Boxplot — shows Q1, median, Q3, whiskers, and outliers
boxplot(dt$age, main = "Boxplot of age")

# Side-by-side boxplots by group
boxplot(dt$age ~ dt$marital,
        main = "Boxplot of age by Marital Status",
        xlab = "Marital", ylab = "Age")

# Scatter plot — relationship between two numeric variables
plot(x = dt$age, y = dt$balance, xlab = "Age", ylab = "Balance")

ggplot2 Plots

ggplot2 uses a layered grammar: ggplot(data, aes(...)) + geom_*() + theme_*()

# install.packages("ggplot2")  # run once if not yet installed
library(ggplot2)

# Scatter plot of age vs balance
ggplot(dt, aes(x = age, y = balance)) +
  geom_point() +
  theme_bw()

# Colour points by marital status (adds a third variable to the plot)
ggplot(dt, aes(x = age, y = balance)) +
  geom_point(aes(col = marital)) +
  theme_bw()

# Boxplot of age (no grouping)
ggplot(dt, aes(y = age)) +
  geom_boxplot() +
  theme_bw()

# Boxplot of age grouped by marital status
ggplot(dt, aes(x = marital, y = age)) +
  geom_boxplot() +
  theme_bw()

# Boxplot of age by marital, coloured by housing status
# This simultaneously displays THREE variables in one plot
ggplot(dt, aes(x = marital, y = age)) +
  geom_boxplot(aes(col = housing)) +
  theme_bw()

9. Assignment

Numerical presentation for bank_data.csv:

Frequency table of education with proportions
Two-way table of job × loan with percentages
Summary statistics of balance
Q1, Q2, Q3 and IQR of balance
CV of balance by housing type
Summary statistics of balance by housing
SD of balance by housing and loan

Graphical presentation for bank_data.csv:

Appropriate chart for education
Chart comparing education × marital counts
Appropriate chart for balance
Chart comparing balance across education groups
Chart showing the relationship between age and duration

ปฎิบัติการ 2: Simple Linear Regression (การถดถอยเชิงเส้นอย่างง่าย)

วัตถุประสงค์ นักศึกษาสามารถ:

วิเคราะห์การถดถอยเชิงเส้นอย่างง่ายด้วย R ได้
หาค่าพยากรณ์พร้อมช่วงความเชื่อมั่นได้
จำลองข้อมูลเพื่อตรวจสอบการแจกแจงของตัวประมาณพารามิเตอร์ได้

Task 1: The Rocket Data

Background: This dataset records the shear strength of a rocket propellant bond (Sher.Strength) and the age of the propellant in weeks (Age.of.Propellant). We want to model how age affects shear strength using simple linear regression.

setwd("C:\\Users\\ASUS\\Downloads")
dt1 <- read.csv("TheRocketData.csv", header = TRUE)
str(dt1)

# Assign shorter variable names for convenience
x <- dt1$Age.of.Propellant   # predictor (X)
y <- dt1$Sher.Strength        # response (Y)

Step 1: Scatter Plot — always visualise before fitting

# Look for: linear trend, curvature, outliers, unequal spread
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")

Step 2: Correlation

cor(x, y)   # sample Pearson correlation coefficient r

# cor.test() formally tests H₀: ρ = 0 (no linear association)
# Output: r, t-statistic, p-value, and confidence interval for ρ
cor.test(x, y, alternative = "two.sided")                    # 95% CI (default)
cor.test(x, y, alternative = "two.sided", conf.level = 0.90) # 90% CI

Step 3: Fit the Model $Y = \beta_0 + \beta_1 X + \epsilon$

# lm(response ~ predictor) fits the OLS regression line
# Use "~" (Alt + 126 on Thai keyboard)
fit <- lm(y ~ x)

# summary() gives:
#   Coefficients: β̂₀ (Intercept) and β̂₁ (slope), SE, t-value, p-value
#   R-squared: proportion of variance in Y explained by X
#   Residual SE: estimate of σ
summary(fit)

# anova() gives the ANOVA table: SSR, SSE, MSR, MSE, F-statistic
# F = MSR / MSE tests whether β₁ ≠ 0 (overall model significance)
anova(fit)

# Manual verification: SSreg=1527483, SSres=166255, df_res=18
# MSres = 166255/18 = 9236; F = 1527483/9236 = 165.3
1527483 / 9236

Step 4: Confidence Intervals for Regression Coefficients

# confint.lm() returns CIs for β₀ and β₁
confint.lm(fit, level = 0.90)   # 90% CI
confint.lm(fit, level = 0.95)   # 95% CI
confint.lm(fit, level = 0.99)   # 99% CI

Steps 7–8: Prediction for a Specific x Value

# Predict response when x = 10
xnew <- data.frame(x = 10)

# Point estimate
yhat <- predict.lm(fit, newdata = xnew)
yhat

# 95% Confidence Interval for E(Y | x = 10)
# Estimates the MEAN response for ALL units with x = 10
predict.lm(fit, newdata = xnew, interval = "confidence")

# 95% Prediction Interval for a NEW individual Y when x = 10
# Estimates the range for a SINGLE future observation; always wider than CI
predict.lm(fit, newdata = xnew, interval = "prediction")

Key difference:

Confidence interval — precision of the estimated mean E(Y|x); narrows as sample size increases

Prediction interval — range for a future individual Y; always wider because it includes individual variability σ²

Step 9: Plot with Confidence and Prediction Bands

# Generate a fine grid for smooth band curves
new     <- data.frame(x = seq(min(x), max(x), len = 100))
pred.cf <- predict(fit, new, interval = "confidence")   # CI for E(Y)
pred.pd <- predict(fit, new, interval = "prediction")   # PI for Y

# Build the plot layer by layer
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")
lines(x, predict(fit), col = "red")                         # fitted line

lines(new$x, pred.cf[, "lwr"], col = "blue",  lty = 2)     # CI lower
lines(new$x, pred.cf[, "upr"], col = "blue",  lty = 2)     # CI upper

lines(new$x, pred.pd[, "lwr"], col = "green", lty = 2)     # PI lower
lines(new$x, pred.pd[, "upr"], col = "green", lty = 2)     # PI upper

legend("topright",
       legend = c("Fitted line", "95% CI for E(Y)", "95% PI for Y"),
       col    = c("red", "blue", "green"),
       lty    = c(1, 2, 2))

Model Performance Metrics

# install.packages("Metrics")   # run once if not yet installed
library(Metrics)

yhatall <- predict(fit, newdata = data.frame(x = x))

# RMSE: Root Mean Squared Error — average prediction error in the Y unit
rmse(y, yhatall)

# MAPE: Mean Absolute Percentage Error — average relative prediction error
mape(y, yhatall)

Task 2: Simulation Study — Sampling Distribution of Regression Coefficients

Concept: Each sample from a population gives different estimates $b_0$ and $b_1$. This task lets us observe how estimates vary across samples and verify the key property of unbiasedness: $E(b_0) = \beta_0$ and $E(b_1) = \beta_1$.

Population data (5 pairs):

\[\begin{array}{c|ccccc} \text{No.} & 1 & 2 & 3 & 4 & 5 \\ \hline X & 8 & 12 & 29.5 & 43 & 53 \\ \hline Y & 8.89 & 8.14 & 6.67 & 6.08 & 5.85 \end{array}\]

Note: Replace the last Y value (5.85) with the last 3 digits of your student ID.

# Define the population
popl <- data.frame(X = c(8, 12, 29.5, 43, 53),
                   Y = c(8.89, 8.14, 6.67, 6.08, 5.85))
str(popl)

# Fitting to the full population gives the TRUE parameter values β₀ and β₁
fit.pop <- lm(Y ~ X, data = popl)
summary(fit.pop)

plot(popl$X, popl$Y, xlab = "X", ylab = "Y",
     main = "True Regression Line (Population)")
abline(fit.pop, col = "red")

There are $\binom{5}{3} = 10$ possible samples of size $n = 3$:

# Example: Sample 1 — observations 1, 2, 3
x1 <- c(8, 12, 29.5)
y1 <- c(8.89, 8.14, 6.67)
fit1 <- lm(y1 ~ x1)
summary(fit1)
# Observe: b0 and b1 differ from the true β₀ and β₁ above

# TO DO:
# 1. Fit regression to all 10 possible samples of size n = 3
# 2. Collect b0 and b1 from each fitted model
# 3. Compute mean and SD of b0 estimates, and of b1 estimates
# 4. Verify: mean(b0) ≈ β₀ and mean(b1) ≈ β₁ (unbiasedness)
# 5. Plot histograms of b0 and b1 to visualise their sampling distributions

ปฎิบัติการ 3: Dummy Variable Regression (ตัวแปรหุ่น)

วัตถุประสงค์ นักศึกษาสามารถ:

แปลงตัวแปรเชิงคุณภาพเป็นตัวแปรหุ่นได้
วิเคราะห์การถดถอยที่มีตัวแปรหุ่นและทดสอบ interaction ได้
อธิบายความหมายของสัมประสิทธิ์ได้

Background: For a categorical predictor with $k$ groups, we create $k-1$ dummy variables (binary 0/1). One group is the reference category — its effect is absorbed into the intercept. Each dummy coefficient represents the difference relative to the reference group.

rm(list = ls())   # clear the environment before starting a new task

Task 1: Salary, GPA, and Gender

setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("SalaryAndGPA.csv", header = TRUE)
str(dt)

dt$Gender <- as.factor(dt$Gender)
str(dt)   # verify Gender shows "Factor w/ 2 levels"

Exploratory Analysis

# Base R
plot(dt$GPA, dt$Salary, xlab = "GPA", ylab = "Salary")
boxplot(dt$Salary ~ dt$Gender, xlab = "Gender", ylab = "Salary")

# ggplot2
library(ggplot2)

# Scatter coloured by Gender — look for parallel or crossing trends
ggplot(dt, aes(x = GPA, y = Salary, colour = Gender)) +
  geom_point(size = 3) +
  labs(title = "Salary vs GPA by Gender") +
  theme_bw()

# Boxplot comparing salary distributions
ggplot(dt, aes(x = Gender, y = Salary, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Salary by Gender") +
  theme_bw()

# Numerical summary by group
aggregate(Salary ~ Gender, data = dt, FUN = summary)

Create the Dummy Variable for Gender

Let $Y$ = Salary (thousand $), $X_1$ = GPA, and:

\[X_2 = \begin{cases} 1, & \text{Female} \\ 0, & \text{Male (reference)} \end{cases}\]

y  <- dt$Salary
x1 <- dt$GPA

# Method 1: ifelse() — concise one-liner
x2 <- ifelse(dt$Gender == "F", 1, 0)

# Method 2: for-loop — same result, shows the logic explicitly
x2 <- c()
for (i in 1:nrow(dt)) {
  if (dt$Gender[i] == "F") x2[i] <- 1
  else x2[i] <- 0
}
x2

Model 1 — with Interaction: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$

The interaction $\beta_3 X_1 X_2$ allows the GPA slope to differ between genders (non-parallel lines):

# "x1 * x2" in R automatically includes x1, x2, AND interaction x1:x2
fit1 <- lm(y ~ x1 + x2 + x1 * x2)
summary(fit1)
# If p-value of x1:x2 > 0.05 → slopes are not significantly different → use Model 2
anova(fit1)

Model 2 — without Interaction (Parallel Lines): $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$

$\beta_2$ is the constant vertical shift between the two gender lines:

fit2 <- lm(y ~ x1 + x2)
summary(fit2)
anova(fit2)

# 95% CI for β₀, β₁, β₂
confint.lm(fit2, level = 0.95)

Interpretation of $\hat{\beta}_2$: Holding GPA constant, females earn $\hat{\beta}_2$ thousand dollars more (if positive) or less (if negative) than males on average.

Task 2: ปัจจัยที่ส่งผลต่อคะแนนสอบวิชาคณิตศาสตร์

ศึกษาปัจจัยที่ส่งผลต่อคะแนนสอบคณิตศาสตร์ (เต็ม 150 คะแนน) โดยพิจารณาวิธีการสอน 4 รูปแบบ (A, B, C, D), คะแนน Pretest และคะแนน EQ จากกลุ่มตัวอย่าง 36 หน่วย

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("ScoreAndEQ.csv")
str(dt2)
dt2$Method <- as.factor(dt2$Method)

กำหนดตัวแปรหุ่น 3 ตัวสำหรับวิธีสอน 4 กลุ่ม โดยให้ Method D เป็นกลุ่มอ้างอิง:

\[X_1 = \begin{cases}1, & A \\ 0, & \text{อื่นๆ}\end{cases},\quad X_2 = \begin{cases}1, & B \\ 0, & \text{อื่นๆ}\end{cases},\quad X_3 = \begin{cases}1, & C \\ 0, & \text{อื่นๆ}\end{cases}\]

y  <- dt2$Y
x4 <- dt2$Pretest
x5 <- dt2$EQ

# Method 1: ifelse() — concise
x1 <- ifelse(dt2$Method == "A", 1, 0)
x2 <- ifelse(dt2$Method == "B", 1, 0)
x3 <- ifelse(dt2$Method == "C", 1, 0)

# Method 2: for-loop — explicit logic
n  <- length(y)
x1 <- c(); x2 <- c(); x3 <- c()
for (i in 1:n) {
  x1[i] <- if (dt2$Method[i] == "A") 1 else 0
  x2[i] <- if (dt2$Method[i] == "B") 1 else 0
  x3[i] <- if (dt2$Method[i] == "C") 1 else 0
}

วิเคราะห์ตัวแบบ: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon$

fit3 <- lm(y ~ x1 + x2 + x3 + x4 + x5)
summary(fit3)
# Interpretation:
#   β₁ = mean difference: Method A vs Method D (reference), holding x4 and x5 constant
#   β₂ = mean difference: Method B vs Method D
#   β₃ = mean difference: Method C vs Method D
#   β₄ = effect of Pretest score on math score
#   β₅ = effect of EQ score on math score

anova(fit3)

ปฏิบัติการ 4: Variable Selection (การคัดเลือกตัวแปรอิสระ)

วัตถุประสงค์ นักศึกษาสามารถแสดงการคัดเลือกตัวแปรด้วยวิธี Forward Selection, Backward Elimination และ Stepwise Regression ได้

AIC (Akaike Information Criterion):
\[AIC = n\ln\!\left(\frac{SS_{Res}}{n}\right) + 2p\] Lower AIC = better model. R’s step() adds/removes predictors to minimise AIC, balancing goodness-of-fit against model complexity.

Method	Starting Model	Action
Forward Selection	Null (intercept only)	Add predictor that most reduces AIC
Backward Elimination	Full (all predictors)	Remove predictor whose removal most reduces AIC
Stepwise	Full	Add or remove at each step — can undo earlier decisions

Task 1: VO2 Oxygen Consumption

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("VO2Data.csv")
str(dt)
dt$Gender <- as.factor(dt$Gender)

# Define all variables
y  <- dt$V02         # response: oxygen consumption
x1 <- dt$Weight
x2 <- dt$Age
x3 <- ifelse(dt$Gender == "Male", 1, 0)   # dummy: Male=1, Female=0
x4 <- dt$HeartRate
x5 <- dt$BodyTemp
x6 <- dt$Height

Step 1: Correlation Matrix

# Pairwise correlations between y and each predictor
cor(y, x1); cor(y, x2); cor(y, x4)

# Correlations among predictors (high values suggest multicollinearity)
cor(x1, x2); cor(x1, x4); cor(x2, x4)

# Full correlation matrix for all numeric columns
cor(dt[, c(2, 3, 4, 6, 7, 8)])

Step 2: VIF (Variance Inflation Factor)

\[VIF_j = \frac{1}{1 - R^2_j}\]

where $R^2_j$ is from regressing $X_j$ on all other predictors. VIF > 5 (some use 10) is considered problematic.

full <- lm(y ~ x1 + x2 + x3 + x4)

library(car)   # install.packages("car") if needed
vif(full)      # VIF for each predictor in the full model

# Manual cross-check: regress each predictor on the rest
m0 <- lm(x1 ~ x2 + x3 + x4); summary(m0)
1 / (1 - 0.1555)   # VIF(x1) ≈ 1.184

m1 <- lm(x2 ~ x1 + x3 + x4); summary(m1)
1 / (1 - 0.05297)  # VIF(x2) ≈ 1.056

summary(lm(x3 ~ x1 + x2 + x4))   # VIF(x3) ≈ 1.072
summary(lm(x4 ~ x1 + x2 + x3))   # VIF(x4) ≈ 1.176

# All VIFs are low → no multicollinearity issue here

Step 3: Forward Selection

null <- lm(y ~ 1)                    # null model (intercept only)
full <- lm(y ~ x1 + x2 + x3 + x4)  # full model

# Automated (uses AIC)
step(null,
     scope     = list(lower = null, upper = full),
     direction = "forward")

Step-by-step manually (to understand the process):

# Step 1: Compare adding each predictor to the null model
# Use anova(smaller, larger) — tests whether the added predictor is significant
fit.x1 <- lm(y ~ x1); anova(null, fit.x1)
fit.x2 <- lm(y ~ x2); anova(null, fit.x2)
fit.x3 <- lm(y ~ x3); anova(null, fit.x3)   # x3 enters first (most significant)
fit.x4 <- lm(y ~ x4); anova(null, fit.x4)

# Step 2: x3 is in. Test adding each remaining predictor given x3
fit.x3x1 <- lm(y ~ x3 + x1); anova(fit.x3, fit.x3x1)  # test x1 | x3
fit.x3x2 <- lm(y ~ x3 + x2); anova(fit.x3, fit.x3x2)  # test x2 | x3
fit.x3x4 <- lm(y ~ x3 + x4); anova(fit.x3, fit.x3x4)  # test x4 | x3
# x1 enters (significant contribution given x3)

# Step 3: x3, x1 in model. Test remaining predictors
fit.x3x1x2 <- lm(y ~ x3 + x1 + x2); anova(fit.x3x1, fit.x3x1x2)  # test x2 | x3, x1
fit.x3x1x4 <- lm(y ~ x3 + x1 + x4); anova(fit.x3x1, fit.x3x1x4)  # test x4 | x3, x1
# Neither is significant → STOP

# Final model: y ~ x3 + x1
summary(fit.x3x1)

Step 4: Backward Elimination

# Automated
step(full, direction = "backward")

Step-by-step manually:

# Step 1: Full model. Remove each predictor one at a time and compare with full
ft1 <- lm(y ~ x1 + x2 + x3); anova(ft1, full)   # -x4
ft2 <- lm(y ~ x1 + x2 + x4); anova(ft2, full)   # -x3
ft3 <- lm(y ~ x2 + x3 + x4); anova(ft3, full)   # -x1
ft4 <- lm(y ~ x1 + x3 + x4); anova(ft4, full)   # -x2
# x4 has largest non-significant p-value → remove x4; now: x1, x2, x3

# Step 2: From {x1, x2, x3}, test removing each
ft5 <- lm(y ~ x1 + x2); anova(ft5, ft1)   # -x3
ft6 <- lm(y ~ x1 + x3); anova(ft6, ft1)   # -x2
ft7 <- lm(y ~ x2 + x3); anova(ft7, ft1)   # -x1
# x2 is not significant → remove x2; now: x1, x3

# Step 3: From {x1, x3}, test removing each
ft8 <- lm(y ~ x3); anova(ft8, ft6)   # -x1
ft9 <- lm(y ~ x1); anova(ft9, ft6)   # -x3
# Both are significant → STOP; final model: y ~ x1 + x3

Step 5: Stepwise Regression

# Considers both adding and removing at each step
step(full, direction = "both")

ปฏิบัติการ 5: Model Adequacy Checking (การตรวจสอบความเหมาะสมของตัวแบบ)

วัตถุประสงค์ นักศึกษาสามารถ:

ตรวจสอบข้อสมมติของการถดถอยด้วยแผนภาพและสถิติทดสอบ
แก้ไขปัญหาด้วยการแปลงตัวแปร (transformation) ได้

Four Assumptions of Linear Regression (LINE):

Assumption Diagnostic Tool Formal Test

Linearity Residuals vs Fitted plot —

Independence Residuals vs order plot Durbin-Watson

Normality Normal Q-Q plot Shapiro-Wilk

Equal variance (Homoscedasticity) Scale-Location plot Breusch-Pagan, F-test

Assumption	Diagnostic Tool	Formal Test
Linearity	Residuals vs Fitted plot	—
Independence	Residuals vs order plot	Durbin-Watson
Normality	Normal Q-Q plot	Shapiro-Wilk
Equal variance (Homoscedasticity)	Scale-Location plot	Breusch-Pagan, F-test

Task 1: The Rocket Propellant Data

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("TheRocketData.csv")
str(dt)

x <- dt$Sher.Strength
y <- dt$Age.of.Propellant

par(mfrow = c(1, 1))
plot(x, y, xlab = "Shear Strength", ylab = "Age of Propellant")

Step 1: Fit the Simple Linear Regression Model

fit1 <- lm(y ~ x)
summary(fit1)
anova(fit1)

Step 2: Residual Diagnostic Plots

res1 <- resid(fit1)   # residuals: eᵢ = yᵢ − ŷᵢ

par(mfrow = c(2, 2))  # 2×2 grid of diagnostic plots
plot(fit1)
# Plot 1 — Residuals vs Fitted:  random scatter → linearity + equal variance OK
# Plot 2 — Normal Q-Q:           points on diagonal → normality OK
# Plot 3 — Scale-Location:       flat horizontal band → equal variance OK
# Plot 4 — Residuals vs Leverage: no large Cook's distance → no influential outliers

Step 2a: Test Equal Variance

library(lmtest)

# F-test: compare variance of first half vs second half of residuals
# H₀: σ²₁ = σ²₂ (equal variance)
var.test(res1[1:10], res1[11:20])

# Breusch-Pagan test: regresses squared residuals on the predictor(s)
# H₀: homoscedasticity (constant variance); p-value < 0.05 → problem
bptest(fit1)

Step 2b: Test Normality

# Shapiro-Wilk test (best for n < 50)
# H₀: residuals are normally distributed
# p-value > 0.05 → normality is supported
shapiro.test(res1)

Step 2c: Test Independence (Autocorrelation)

# Durbin-Watson test
# H₀: no autocorrelation in residuals
# DW ≈ 2 → no autocorrelation; DW < 2 → positive autocorrelation
library(lmtest)
dwtest(fit1)

Step 3: Box-Cox Transformation — Find Optimal λ

λ	Transformation
1	None (Y unchanged)
0.5	$\sqrt{Y}$
0	$\ln(Y)$
−1	$1/Y$

library(MASS)
par(mfrow = c(1, 1))

# The peak of the curve = optimal λ
# The horizontal dashed line = 95% CI for λ
# If CI includes 0.5 → use √Y; if CI includes 0 → use log(Y)
boxcox(fit1, lambda = seq(-3, 3, by = 0.05))

Step 4: Fit the Transformed Model $\sqrt{Y} = \beta_0 + \beta_1 X + \epsilon$

fit2 <- lm(sqrt(y) ~ x)
summary(fit2)
anova(fit2)

res2 <- resid(fit2)

# Repeat ALL diagnostic checks and compare with fit1
par(mfrow = c(2, 2))
plot(fit2)

var.test(res2[1:10], res2[11:20])   # F-test for equal variance
bptest(fit2)                         # Breusch-Pagan
dwtest(fit2)                         # Durbin-Watson
shapiro.test(res2)                   # Shapiro-Wilk

Sensitivity: Remove Influential Observations

# Negative indexing removes specific observations
# Here we remove rows 5 and 6 (identified as potential outliers from diagnostics)
xx <- x[-c(5, 6)]
yy <- y[-c(5, 6)]

f1 <- lm(yy ~ xx)        # original model without outliers
summary(f1)

f2 <- lm(sqrt(yy) ~ xx)  # transformed model without outliers
summary(f2)

# If results change substantially, those observations are highly influential

Task 2: Olympics Athletes — Cholesterol and Fat Intake

This dataset contains cholesterol levels (mg/L) and daily saturated fat intake (mg) for 10 Olympic athletes.

rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("TheAthleteData.csv")
str(dt2)

x <- dt2$Fat.Intake
y <- dt2$Cholesterol

par(mfrow = c(1, 1))
plot(x, y, xlab = "Fat Intake", ylab = "Cholesterol",
     main = "Cholesterol vs Fat Intake")

a. Fit the Linear Model and Check Assumptions

fit.lin <- lm(y ~ x)
summary(fit.lin)
anova(fit.lin)

pred.lin <- predict(fit.lin)
res.lin  <- resid(fit.lin)

par(mfrow = c(2, 2))
plot(fit.lin)

# Formal assumption tests
var.test(res.lin[1:5], res.lin[6:10])   # F-test: equal variance
bptest(fit.lin)                          # Breusch-Pagan
dwtest(fit.lin)                          # Durbin-Watson
shapiro.test(res.lin)                    # Shapiro-Wilk

b. Box-Cox to Identify an Appropriate Transformation

library(MASS)
par(mfrow = c(1, 1))
boxcox(fit.lin, lambda = seq(-3, 3, by = 0.05))

c. Fit a Quadratic Model (if the scatter plot shows curvature)

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]

# I(x^2) protects the squaring operation inside the formula
fit.qua <- lm(y ~ x + I(x^2))
summary(fit.qua)
anova(fit.qua)

res.qua <- resid(fit.qua)

par(mfrow = c(2, 2))
plot(fit.qua)

# Repeat all diagnostics
var.test(res.qua[1:5], res.qua[6:10],
         ratio = 1, alternative = "two.sided", conf.level = 0.95)
bptest(fit.qua)
dwtest(fit.qua)
shapiro.test(res.qua)

d. Summary Comparison Table

Record results from both models:

Statistic	Linear Model	Quadratic / Transformed
$R^2$
Adjusted $R^2$
Residual SE
Shapiro-Wilk p-value
Breusch-Pagan p-value
Durbin-Watson statistic
F-test (var.test) p-value

208329: Lab Material — Regression Analysis

กมลรัตน์ สุภาวรรณ
Kamonrat Suphawan
Department of Statistics
Faculty of Science, Chiang Mai University

Data files

ปฎิบัติการ 1: Introduction to R

References: Intro to R · Data Exploration · Missing Data · ggplot2 Scatter

1. Installing R and RStudio

Windows

Mac

2. Basic Data Types

Numeric and Integer Values

Character Values

Factors

3. Data Structures

Matrix

List

Data Frame

4. Reading Files

5. Numerical Summaries: Qualitative Data

6. Numerical Summaries: Quantitative Data

7. Graphical Presentation: Qualitative Data

Bar Graph

8. Graphical Presentation: Quantitative Data

Base R Plots

ggplot2 Plots

9. Assignment

ปฎิบัติการ 2: Simple Linear Regression (การถดถอยเชิงเส้นอย่างง่าย)

Task 1: The Rocket Data

Task 2: Simulation Study — Sampling Distribution of Regression Coefficients

ปฎิบัติการ 3: Dummy Variable Regression (ตัวแปรหุ่น)

Task 1: Salary, GPA, and Gender

Task 2: ปัจจัยที่ส่งผลต่อคะแนนสอบวิชาคณิตศาสตร์

ปฏิบัติการ 4: Variable Selection (การคัดเลือกตัวแปรอิสระ)

Task 1: VO2 Oxygen Consumption

ปฏิบัติการ 5: Model Adequacy Checking (การตรวจสอบความเหมาะสมของตัวแบบ)

Task 1: The Rocket Propellant Data

Task 2: Olympics Athletes — Cholesterol and Fat Intake

Statistic	Linear Model	Quadratic / Transformed
\(R^2\)
Adjusted \(R^2\)
Residual SE
Shapiro-Wilk p-value
Breusch-Pagan p-value
Durbin-Watson statistic
F-test (var.test) p-value

208329: Lab Material — Regression Analysis

กมลรัตน์ สุภาวรรณ Kamonrat Suphawan Department of Statistics Faculty of Science, Chiang Mai University

Data files

ปฎิบัติการ 1: Introduction to R

References: Intro to R · Data Exploration · Missing Data · ggplot2 Scatter

1. Installing R and RStudio

Windows

Mac

2. Basic Data Types

Numeric and Integer Values

Character Values

Factors

3. Data Structures

Matrix

List

Data Frame

4. Reading Files

5. Numerical Summaries: Qualitative Data

6. Numerical Summaries: Quantitative Data

7. Graphical Presentation: Qualitative Data

Bar Graph

8. Graphical Presentation: Quantitative Data

Base R Plots

ggplot2 Plots

9. Assignment

ปฎิบัติการ 2: Simple Linear Regression (การถดถอยเชิงเส้นอย่างง่าย)

Task 1: The Rocket Data

Task 2: Simulation Study — Sampling Distribution of Regression Coefficients

ปฎิบัติการ 3: Dummy Variable Regression (ตัวแปรหุ่น)

Task 1: Salary, GPA, and Gender

Task 2: ปัจจัยที่ส่งผลต่อคะแนนสอบวิชาคณิตศาสตร์

ปฏิบัติการ 4: Variable Selection (การคัดเลือกตัวแปรอิสระ)

Task 1: VO2 Oxygen Consumption

ปฏิบัติการ 5: Model Adequacy Checking (การตรวจสอบความเหมาะสมของตัวแบบ)

Task 1: The Rocket Propellant Data

Task 2: Olympics Athletes — Cholesterol and Fat Intake

กมลรัตน์ สุภาวรรณ
Kamonrat Suphawan
Department of Statistics
Faculty of Science, Chiang Mai University