R is a programming language for statistical computing and graphics. RStudio is an IDE (Integrated Development Environment) that makes R much easier to use with a four-panel layout:
| Panel | Location | Purpose |
|---|---|---|
| Script Editor | Top-left | Write and save code |
| Console | Bottom-left | Execute code, see output |
| Environment / History | Top-right | See objects in memory |
| Files / Plots / Help | Bottom-right | Browse files, view plots |
| Type | Example | Description |
|---|---|---|
numeric |
4.5 |
Decimal numbers |
integer |
4L |
Whole numbers (also numeric) |
logical |
TRUE / FALSE |
Boolean values |
character |
"hello" |
Text / strings |
Use class() to check the type of any object.
## [1] 7
## [1] 1.25 2.90 3.00
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
## [1] 99 99 99 99 99
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
## [1] "character"
## [1] "character"
A factor stores categorical data with a fixed set of levels. R uses factors automatically in models and summaries.
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
All elements must be the same type (e.g., all numeric). Data fills column-by-column by default.
mat0 <- matrix(NA, ncol = 3, nrow = 2) # empty matrix of NAs
mat1 <- matrix(1:6, ncol = 3, nrow = 2) # fills column-by-column
mat2 <- matrix(1:6, ncol = 3, nrow = 2, byrow = TRUE) # fills row-by-row
mat0; mat1; mat2## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The most flexible structure — each element can hold a different type of object.
## [[1]]
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
##
## [[2]]
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The standard structure for statistical analysis. Each column is a variable; each row is an observation.
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# setwd() sets the working directory (folder where data files are stored)
# On Windows: use "\\" (double backslash) or "/" as path separators
setwd("C:\\Users\\ASUS\\Downloads")
# read.csv() reads a comma-separated file into a data frame
# header = TRUE means the first row contains variable names
dt <- read.csv("bank_data.csv", header = TRUE)
head(dt) # preview first 6 rows
str(dt) # check variable names and typesAfter loading, convert categorical variables from "chr"
to factor:
dt$job <- as.factor(dt$job)
dt$marital <- as.factor(dt$marital)
dt$education <- as.factor(dt$education)
dt$housing <- as.factor(dt$housing)
dt$loan <- as.factor(dt$loan)
str(dt) # verify: these columns now show "Factor w/ N levels"summary(dt) # counts for factors; five-number summary for numerics
# Single-variable frequency tables
table(dt$education)
table(dt$marital)
table(dt$housing)
table(dt$loan)
# Proportions and percentages
tt <- table(dt$marital)
prop.table(tt) # proportions (sum to 1)
prop.table(tt) * 100 # percentages
# Two-way contingency table (rows = marital, columns = loan)
table(dt$marital, dt$loan)
# Add row/column totals
tb <- table(dt$marital, dt$loan)
addmargins(tb)
# Cell percentages
prop.table(tb) * 100# Custom mode function (R has no built-in statistical mode)
getmode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}summary(dt$age) # min, Q1, median, mean, Q3, max
mean(dt$age)
getmode(dt$age) # most frequent value
min(dt$age)
range(dt$age) # c(min, max)
IQR(dt$age) # Q3 - Q1
sd(dt$age) # sample standard deviation
# Quartiles
quantile(dt$age, probs = c(0.25, 0.50, 0.75))
# Coefficient of Variation (CV) = SD / Mean
# Measures relative variability — useful for comparing groups with different means
cv_age <- sd(dt$age) / mean(dt$age)
cv_age
# Compare CV between subgroups using [ ] subsetting
d1 <- dt$age[dt$housing == "no"]
d2 <- dt$age[dt$housing == "yes"]
cv1 <- sd(d1) / mean(d1) # CV of age for housing = no
cv2 <- sd(d2) / mean(d2) # CV of age for housing = yes
cv1; cv2
# aggregate(): compute summary statistics by group
# aggregate(response ~ group, data, FUN)
aggregate(age ~ housing, data = dt, FUN = mean)
aggregate(age ~ housing, data = dt, FUN = sd)
aggregate(age ~ housing, data = dt, FUN = summary)
# Two-way grouping: mean age by housing AND marital status
aggregate(age ~ housing + marital, data = dt, FUN = mean)# barplot() requires a frequency table as input
tt <- table(dt$marital)
barplot(tt)
pie(tt, main = "Pie chart of Marital status")
# Stacked bar: composition of one variable within levels of another
counts <- table(dt$marital, dt$loan)
barplot(counts, legend = rownames(counts), xlab = "loan")
# Grouped (side-by-side) bar: easier to compare individual counts
# beside = TRUE places bars next to each other rather than stacking
barplot(counts,
col = c("blue", "red", "green"),
legend = rownames(counts),
beside = TRUE,
xlab = "loan")# Histogram
hist(dt$age)
hist(dt$age, main = "Histogram of Age", xlab = "Age", ylab = "Count")
hist(dt$age, breaks = 20, main = "Histogram of Age", xlab = "Age")
# Boxplot — shows Q1, median, Q3, whiskers, and outliers
boxplot(dt$age, main = "Boxplot of age")
# Side-by-side boxplots by group
boxplot(dt$age ~ dt$marital,
main = "Boxplot of age by Marital Status",
xlab = "Marital", ylab = "Age")
# Scatter plot — relationship between two numeric variables
plot(x = dt$age, y = dt$balance, xlab = "Age", ylab = "Balance")ggplot2 uses a layered grammar:
ggplot(data, aes(...)) + geom_*() + theme_*()
# install.packages("ggplot2") # run once if not yet installed
library(ggplot2)
# Scatter plot of age vs balance
ggplot(dt, aes(x = age, y = balance)) +
geom_point() +
theme_bw()
# Colour points by marital status (adds a third variable to the plot)
ggplot(dt, aes(x = age, y = balance)) +
geom_point(aes(col = marital)) +
theme_bw()
# Boxplot of age (no grouping)
ggplot(dt, aes(y = age)) +
geom_boxplot() +
theme_bw()
# Boxplot of age grouped by marital status
ggplot(dt, aes(x = marital, y = age)) +
geom_boxplot() +
theme_bw()
# Boxplot of age by marital, coloured by housing status
# This simultaneously displays THREE variables in one plot
ggplot(dt, aes(x = marital, y = age)) +
geom_boxplot(aes(col = housing)) +
theme_bw()Numerical presentation for
bank_data.csv:
education with proportionsjob × loan with
percentagesbalancebalancebalance by housing typebalance by
housingbalance by housing and
loanGraphical presentation for
bank_data.csv:
educationeducation × marital
countsbalancebalance across education
groupsage and
durationวัตถุประสงค์ นักศึกษาสามารถ:
Background: This dataset records the shear strength
of a rocket propellant bond (Sher.Strength) and the age of
the propellant in weeks (Age.of.Propellant). We want to
model how age affects shear strength using simple linear regression.
setwd("C:\\Users\\ASUS\\Downloads")
dt1 <- read.csv("TheRocketData.csv", header = TRUE)
str(dt1)
# Assign shorter variable names for convenience
x <- dt1$Age.of.Propellant # predictor (X)
y <- dt1$Sher.Strength # response (Y)Step 1: Scatter Plot — always visualise before fitting
# Look for: linear trend, curvature, outliers, unequal spread
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")Step 2: Correlation
cor(x, y) # sample Pearson correlation coefficient r
# cor.test() formally tests H₀: ρ = 0 (no linear association)
# Output: r, t-statistic, p-value, and confidence interval for ρ
cor.test(x, y, alternative = "two.sided") # 95% CI (default)
cor.test(x, y, alternative = "two.sided", conf.level = 0.90) # 90% CIStep 3: Fit the Model \(Y = \beta_0 + \beta_1 X + \epsilon\)
# lm(response ~ predictor) fits the OLS regression line
# Use "~" (Alt + 126 on Thai keyboard)
fit <- lm(y ~ x)
# summary() gives:
# Coefficients: β̂₀ (Intercept) and β̂₁ (slope), SE, t-value, p-value
# R-squared: proportion of variance in Y explained by X
# Residual SE: estimate of σ
summary(fit)
# anova() gives the ANOVA table: SSR, SSE, MSR, MSE, F-statistic
# F = MSR / MSE tests whether β₁ ≠ 0 (overall model significance)
anova(fit)
# Manual verification: SSreg=1527483, SSres=166255, df_res=18
# MSres = 166255/18 = 9236; F = 1527483/9236 = 165.3
1527483 / 9236Step 4: Confidence Intervals for Regression Coefficients
# confint.lm() returns CIs for β₀ and β₁
confint.lm(fit, level = 0.90) # 90% CI
confint.lm(fit, level = 0.95) # 95% CI
confint.lm(fit, level = 0.99) # 99% CISteps 7–8: Prediction for a Specific x Value
# Predict response when x = 10
xnew <- data.frame(x = 10)
# Point estimate
yhat <- predict.lm(fit, newdata = xnew)
yhat
# 95% Confidence Interval for E(Y | x = 10)
# Estimates the MEAN response for ALL units with x = 10
predict.lm(fit, newdata = xnew, interval = "confidence")
# 95% Prediction Interval for a NEW individual Y when x = 10
# Estimates the range for a SINGLE future observation; always wider than CI
predict.lm(fit, newdata = xnew, interval = "prediction")Key difference:
- Confidence interval — precision of the estimated mean E(Y|x); narrows as sample size increases
- Prediction interval — range for a future individual Y; always wider because it includes individual variability σ²
Step 9: Plot with Confidence and Prediction Bands
# Generate a fine grid for smooth band curves
new <- data.frame(x = seq(min(x), max(x), len = 100))
pred.cf <- predict(fit, new, interval = "confidence") # CI for E(Y)
pred.pd <- predict(fit, new, interval = "prediction") # PI for Y
# Build the plot layer by layer
plot(x, y, xlab = "Age of Propellant", ylab = "Shear Strength")
lines(x, predict(fit), col = "red") # fitted line
lines(new$x, pred.cf[, "lwr"], col = "blue", lty = 2) # CI lower
lines(new$x, pred.cf[, "upr"], col = "blue", lty = 2) # CI upper
lines(new$x, pred.pd[, "lwr"], col = "green", lty = 2) # PI lower
lines(new$x, pred.pd[, "upr"], col = "green", lty = 2) # PI upper
legend("topright",
legend = c("Fitted line", "95% CI for E(Y)", "95% PI for Y"),
col = c("red", "blue", "green"),
lty = c(1, 2, 2))Model Performance Metrics
# install.packages("Metrics") # run once if not yet installed
library(Metrics)
yhatall <- predict(fit, newdata = data.frame(x = x))
# RMSE: Root Mean Squared Error — average prediction error in the Y unit
rmse(y, yhatall)
# MAPE: Mean Absolute Percentage Error — average relative prediction error
mape(y, yhatall)Concept: Each sample from a population gives different estimates \(b_0\) and \(b_1\). This task lets us observe how estimates vary across samples and verify the key property of unbiasedness: \(E(b_0) = \beta_0\) and \(E(b_1) = \beta_1\).
Population data (5 pairs):
\[\begin{array}{c|ccccc} \text{No.} & 1 & 2 & 3 & 4 & 5 \\ \hline X & 8 & 12 & 29.5 & 43 & 53 \\ \hline Y & 8.89 & 8.14 & 6.67 & 6.08 & 5.85 \end{array}\]Note: Replace the last Y value (5.85) with the last 3 digits of your student ID.
# Define the population
popl <- data.frame(X = c(8, 12, 29.5, 43, 53),
Y = c(8.89, 8.14, 6.67, 6.08, 5.85))
str(popl)
# Fitting to the full population gives the TRUE parameter values β₀ and β₁
fit.pop <- lm(Y ~ X, data = popl)
summary(fit.pop)
plot(popl$X, popl$Y, xlab = "X", ylab = "Y",
main = "True Regression Line (Population)")
abline(fit.pop, col = "red")There are \(\binom{5}{3} = 10\) possible samples of size \(n = 3\):
# Example: Sample 1 — observations 1, 2, 3
x1 <- c(8, 12, 29.5)
y1 <- c(8.89, 8.14, 6.67)
fit1 <- lm(y1 ~ x1)
summary(fit1)
# Observe: b0 and b1 differ from the true β₀ and β₁ above
# TO DO:
# 1. Fit regression to all 10 possible samples of size n = 3
# 2. Collect b0 and b1 from each fitted model
# 3. Compute mean and SD of b0 estimates, and of b1 estimates
# 4. Verify: mean(b0) ≈ β₀ and mean(b1) ≈ β₁ (unbiasedness)
# 5. Plot histograms of b0 and b1 to visualise their sampling distributionsวัตถุประสงค์ นักศึกษาสามารถ:
Background: For a categorical predictor with \(k\) groups, we create \(k-1\) dummy variables (binary 0/1). One group is the reference category — its effect is absorbed into the intercept. Each dummy coefficient represents the difference relative to the reference group.
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("SalaryAndGPA.csv", header = TRUE)
str(dt)
dt$Gender <- as.factor(dt$Gender)
str(dt) # verify Gender shows "Factor w/ 2 levels"Exploratory Analysis
# Base R
plot(dt$GPA, dt$Salary, xlab = "GPA", ylab = "Salary")
boxplot(dt$Salary ~ dt$Gender, xlab = "Gender", ylab = "Salary")
# ggplot2
library(ggplot2)
# Scatter coloured by Gender — look for parallel or crossing trends
ggplot(dt, aes(x = GPA, y = Salary, colour = Gender)) +
geom_point(size = 3) +
labs(title = "Salary vs GPA by Gender") +
theme_bw()
# Boxplot comparing salary distributions
ggplot(dt, aes(x = Gender, y = Salary, fill = Gender)) +
geom_boxplot() +
labs(title = "Salary by Gender") +
theme_bw()
# Numerical summary by group
aggregate(Salary ~ Gender, data = dt, FUN = summary)Create the Dummy Variable for Gender
Let \(Y\) = Salary (thousand $), \(X_1\) = GPA, and:
\[X_2 = \begin{cases} 1, & \text{Female} \\ 0, & \text{Male (reference)} \end{cases}\]
y <- dt$Salary
x1 <- dt$GPA
# Method 1: ifelse() — concise one-liner
x2 <- ifelse(dt$Gender == "F", 1, 0)
# Method 2: for-loop — same result, shows the logic explicitly
x2 <- c()
for (i in 1:nrow(dt)) {
if (dt$Gender[i] == "F") x2[i] <- 1
else x2[i] <- 0
}
x2Model 1 — with Interaction: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon\)
The interaction \(\beta_3 X_1 X_2\) allows the GPA slope to differ between genders (non-parallel lines):
# "x1 * x2" in R automatically includes x1, x2, AND interaction x1:x2
fit1 <- lm(y ~ x1 + x2 + x1 * x2)
summary(fit1)
# If p-value of x1:x2 > 0.05 → slopes are not significantly different → use Model 2
anova(fit1)Model 2 — without Interaction (Parallel Lines): \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\)
\(\beta_2\) is the constant vertical shift between the two gender lines:
fit2 <- lm(y ~ x1 + x2)
summary(fit2)
anova(fit2)
# 95% CI for β₀, β₁, β₂
confint.lm(fit2, level = 0.95)Interpretation of \(\hat{\beta}_2\): Holding GPA constant, females earn \(\hat{\beta}_2\) thousand dollars more (if positive) or less (if negative) than males on average.
ศึกษาปัจจัยที่ส่งผลต่อคะแนนสอบคณิตศาสตร์ (เต็ม 150 คะแนน) โดยพิจารณาวิธีการสอน 4 รูปแบบ (A, B, C, D), คะแนน Pretest และคะแนน EQ จากกลุ่มตัวอย่าง 36 หน่วย
rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("ScoreAndEQ.csv")
str(dt2)
dt2$Method <- as.factor(dt2$Method)กำหนดตัวแปรหุ่น 3 ตัวสำหรับวิธีสอน 4 กลุ่ม โดยให้ Method D เป็นกลุ่มอ้างอิง:
\[X_1 = \begin{cases}1, & A \\ 0, & \text{อื่นๆ}\end{cases},\quad X_2 = \begin{cases}1, & B \\ 0, & \text{อื่นๆ}\end{cases},\quad X_3 = \begin{cases}1, & C \\ 0, & \text{อื่นๆ}\end{cases}\]
y <- dt2$Y
x4 <- dt2$Pretest
x5 <- dt2$EQ
# Method 1: ifelse() — concise
x1 <- ifelse(dt2$Method == "A", 1, 0)
x2 <- ifelse(dt2$Method == "B", 1, 0)
x3 <- ifelse(dt2$Method == "C", 1, 0)
# Method 2: for-loop — explicit logic
n <- length(y)
x1 <- c(); x2 <- c(); x3 <- c()
for (i in 1:n) {
x1[i] <- if (dt2$Method[i] == "A") 1 else 0
x2[i] <- if (dt2$Method[i] == "B") 1 else 0
x3[i] <- if (dt2$Method[i] == "C") 1 else 0
}วิเคราะห์ตัวแบบ: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon\)
fit3 <- lm(y ~ x1 + x2 + x3 + x4 + x5)
summary(fit3)
# Interpretation:
# β₁ = mean difference: Method A vs Method D (reference), holding x4 and x5 constant
# β₂ = mean difference: Method B vs Method D
# β₃ = mean difference: Method C vs Method D
# β₄ = effect of Pretest score on math score
# β₅ = effect of EQ score on math score
anova(fit3)วัตถุประสงค์ นักศึกษาสามารถแสดงการคัดเลือกตัวแปรด้วยวิธี Forward Selection, Backward Elimination และ Stepwise Regression ได้
AIC (Akaike Information Criterion):
\[AIC = n\ln\!\left(\frac{SS_{Res}}{n}\right) + 2p\] Lower AIC = better model. R’sstep()adds/removes predictors to minimise AIC, balancing goodness-of-fit against model complexity.
| Method | Starting Model | Action |
|---|---|---|
| Forward Selection | Null (intercept only) | Add predictor that most reduces AIC |
| Backward Elimination | Full (all predictors) | Remove predictor whose removal most reduces AIC |
| Stepwise | Full | Add or remove at each step — can undo earlier decisions |
rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("VO2Data.csv")
str(dt)
dt$Gender <- as.factor(dt$Gender)
# Define all variables
y <- dt$V02 # response: oxygen consumption
x1 <- dt$Weight
x2 <- dt$Age
x3 <- ifelse(dt$Gender == "Male", 1, 0) # dummy: Male=1, Female=0
x4 <- dt$HeartRate
x5 <- dt$BodyTemp
x6 <- dt$HeightStep 1: Correlation Matrix
# Pairwise correlations between y and each predictor
cor(y, x1); cor(y, x2); cor(y, x4)
# Correlations among predictors (high values suggest multicollinearity)
cor(x1, x2); cor(x1, x4); cor(x2, x4)
# Full correlation matrix for all numeric columns
cor(dt[, c(2, 3, 4, 6, 7, 8)])Step 2: VIF (Variance Inflation Factor)
\[VIF_j = \frac{1}{1 - R^2_j}\]
where \(R^2_j\) is from regressing \(X_j\) on all other predictors. VIF > 5 (some use 10) is considered problematic.
full <- lm(y ~ x1 + x2 + x3 + x4)
library(car) # install.packages("car") if needed
vif(full) # VIF for each predictor in the full model
# Manual cross-check: regress each predictor on the rest
m0 <- lm(x1 ~ x2 + x3 + x4); summary(m0)
1 / (1 - 0.1555) # VIF(x1) ≈ 1.184
m1 <- lm(x2 ~ x1 + x3 + x4); summary(m1)
1 / (1 - 0.05297) # VIF(x2) ≈ 1.056
summary(lm(x3 ~ x1 + x2 + x4)) # VIF(x3) ≈ 1.072
summary(lm(x4 ~ x1 + x2 + x3)) # VIF(x4) ≈ 1.176
# All VIFs are low → no multicollinearity issue hereStep 3: Forward Selection
null <- lm(y ~ 1) # null model (intercept only)
full <- lm(y ~ x1 + x2 + x3 + x4) # full model
# Automated (uses AIC)
step(null,
scope = list(lower = null, upper = full),
direction = "forward")Step-by-step manually (to understand the process):
# Step 1: Compare adding each predictor to the null model
# Use anova(smaller, larger) — tests whether the added predictor is significant
fit.x1 <- lm(y ~ x1); anova(null, fit.x1)
fit.x2 <- lm(y ~ x2); anova(null, fit.x2)
fit.x3 <- lm(y ~ x3); anova(null, fit.x3) # x3 enters first (most significant)
fit.x4 <- lm(y ~ x4); anova(null, fit.x4)
# Step 2: x3 is in. Test adding each remaining predictor given x3
fit.x3x1 <- lm(y ~ x3 + x1); anova(fit.x3, fit.x3x1) # test x1 | x3
fit.x3x2 <- lm(y ~ x3 + x2); anova(fit.x3, fit.x3x2) # test x2 | x3
fit.x3x4 <- lm(y ~ x3 + x4); anova(fit.x3, fit.x3x4) # test x4 | x3
# x1 enters (significant contribution given x3)
# Step 3: x3, x1 in model. Test remaining predictors
fit.x3x1x2 <- lm(y ~ x3 + x1 + x2); anova(fit.x3x1, fit.x3x1x2) # test x2 | x3, x1
fit.x3x1x4 <- lm(y ~ x3 + x1 + x4); anova(fit.x3x1, fit.x3x1x4) # test x4 | x3, x1
# Neither is significant → STOP
# Final model: y ~ x3 + x1
summary(fit.x3x1)Step 4: Backward Elimination
Step-by-step manually:
# Step 1: Full model. Remove each predictor one at a time and compare with full
ft1 <- lm(y ~ x1 + x2 + x3); anova(ft1, full) # -x4
ft2 <- lm(y ~ x1 + x2 + x4); anova(ft2, full) # -x3
ft3 <- lm(y ~ x2 + x3 + x4); anova(ft3, full) # -x1
ft4 <- lm(y ~ x1 + x3 + x4); anova(ft4, full) # -x2
# x4 has largest non-significant p-value → remove x4; now: x1, x2, x3
# Step 2: From {x1, x2, x3}, test removing each
ft5 <- lm(y ~ x1 + x2); anova(ft5, ft1) # -x3
ft6 <- lm(y ~ x1 + x3); anova(ft6, ft1) # -x2
ft7 <- lm(y ~ x2 + x3); anova(ft7, ft1) # -x1
# x2 is not significant → remove x2; now: x1, x3
# Step 3: From {x1, x3}, test removing each
ft8 <- lm(y ~ x3); anova(ft8, ft6) # -x1
ft9 <- lm(y ~ x1); anova(ft9, ft6) # -x3
# Both are significant → STOP; final model: y ~ x1 + x3Step 5: Stepwise Regression
วัตถุประสงค์ นักศึกษาสามารถ:
Four Assumptions of Linear Regression (LINE):
Assumption Diagnostic Tool Formal Test Linearity Residuals vs Fitted plot — Independence Residuals vs order plot Durbin-Watson Normality Normal Q-Q plot Shapiro-Wilk Equal variance (Homoscedasticity) Scale-Location plot Breusch-Pagan, F-test
rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt <- read.csv("TheRocketData.csv")
str(dt)
x <- dt$Sher.Strength
y <- dt$Age.of.Propellant
par(mfrow = c(1, 1))
plot(x, y, xlab = "Shear Strength", ylab = "Age of Propellant")Step 1: Fit the Simple Linear Regression Model
Step 2: Residual Diagnostic Plots
res1 <- resid(fit1) # residuals: eᵢ = yᵢ − ŷᵢ
par(mfrow = c(2, 2)) # 2×2 grid of diagnostic plots
plot(fit1)
# Plot 1 — Residuals vs Fitted: random scatter → linearity + equal variance OK
# Plot 2 — Normal Q-Q: points on diagonal → normality OK
# Plot 3 — Scale-Location: flat horizontal band → equal variance OK
# Plot 4 — Residuals vs Leverage: no large Cook's distance → no influential outliersStep 2a: Test Equal Variance
library(lmtest)
# F-test: compare variance of first half vs second half of residuals
# H₀: σ²₁ = σ²₂ (equal variance)
var.test(res1[1:10], res1[11:20])
# Breusch-Pagan test: regresses squared residuals on the predictor(s)
# H₀: homoscedasticity (constant variance); p-value < 0.05 → problem
bptest(fit1)Step 2b: Test Normality
# Shapiro-Wilk test (best for n < 50)
# H₀: residuals are normally distributed
# p-value > 0.05 → normality is supported
shapiro.test(res1)Step 2c: Test Independence (Autocorrelation)
# Durbin-Watson test
# H₀: no autocorrelation in residuals
# DW ≈ 2 → no autocorrelation; DW < 2 → positive autocorrelation
library(lmtest)
dwtest(fit1)Step 3: Box-Cox Transformation — Find Optimal λ
| λ | Transformation |
|---|---|
| 1 | None (Y unchanged) |
| 0.5 | \(\sqrt{Y}\) |
| 0 | \(\ln(Y)\) |
| −1 | \(1/Y\) |
library(MASS)
par(mfrow = c(1, 1))
# The peak of the curve = optimal λ
# The horizontal dashed line = 95% CI for λ
# If CI includes 0.5 → use √Y; if CI includes 0 → use log(Y)
boxcox(fit1, lambda = seq(-3, 3, by = 0.05))Step 4: Fit the Transformed Model \(\sqrt{Y} = \beta_0 + \beta_1 X + \epsilon\)
fit2 <- lm(sqrt(y) ~ x)
summary(fit2)
anova(fit2)
res2 <- resid(fit2)
# Repeat ALL diagnostic checks and compare with fit1
par(mfrow = c(2, 2))
plot(fit2)
var.test(res2[1:10], res2[11:20]) # F-test for equal variance
bptest(fit2) # Breusch-Pagan
dwtest(fit2) # Durbin-Watson
shapiro.test(res2) # Shapiro-WilkSensitivity: Remove Influential Observations
# Negative indexing removes specific observations
# Here we remove rows 5 and 6 (identified as potential outliers from diagnostics)
xx <- x[-c(5, 6)]
yy <- y[-c(5, 6)]
f1 <- lm(yy ~ xx) # original model without outliers
summary(f1)
f2 <- lm(sqrt(yy) ~ xx) # transformed model without outliers
summary(f2)
# If results change substantially, those observations are highly influentialThis dataset contains cholesterol levels (mg/L) and daily saturated fat intake (mg) for 10 Olympic athletes.
rm(list = ls())
setwd("C:\\Users\\ASUS\\Downloads")
dt2 <- read.csv("TheAthleteData.csv")
str(dt2)
x <- dt2$Fat.Intake
y <- dt2$Cholesterol
par(mfrow = c(1, 1))
plot(x, y, xlab = "Fat Intake", ylab = "Cholesterol",
main = "Cholesterol vs Fat Intake")a. Fit the Linear Model and Check Assumptions
fit.lin <- lm(y ~ x)
summary(fit.lin)
anova(fit.lin)
pred.lin <- predict(fit.lin)
res.lin <- resid(fit.lin)
par(mfrow = c(2, 2))
plot(fit.lin)
# Formal assumption tests
var.test(res.lin[1:5], res.lin[6:10]) # F-test: equal variance
bptest(fit.lin) # Breusch-Pagan
dwtest(fit.lin) # Durbin-Watson
shapiro.test(res.lin) # Shapiro-Wilkb. Box-Cox to Identify an Appropriate Transformation
c. Fit a Quadratic Model (if the scatter plot shows curvature)
\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]
# I(x^2) protects the squaring operation inside the formula
fit.qua <- lm(y ~ x + I(x^2))
summary(fit.qua)
anova(fit.qua)
res.qua <- resid(fit.qua)
par(mfrow = c(2, 2))
plot(fit.qua)
# Repeat all diagnostics
var.test(res.qua[1:5], res.qua[6:10],
ratio = 1, alternative = "two.sided", conf.level = 0.95)
bptest(fit.qua)
dwtest(fit.qua)
shapiro.test(res.qua)d. Summary Comparison Table
Record results from both models:
| Statistic | Linear Model | Quadratic / Transformed |
|---|---|---|
| \(R^2\) | ||
| Adjusted \(R^2\) | ||
| Residual SE | ||
| Shapiro-Wilk p-value | ||
| Breusch-Pagan p-value | ||
| Durbin-Watson statistic | ||
| F-test (var.test) p-value |