library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
model <- lm(Volume ~ Girth + Height, data = trees)
data(trees) # Load the built-in dataset (trees)
head(trees) # the first six observations of the dataset
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
summary(trees) # summary statistics for all variables
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
str(trees) # shows variable names, data types, and number of observations
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
Girth
Range: 8.3 – 20.6
Median: 12.9
Mean: 13.25
Most trees have girths between 11.0 and 15.3 inches.
Height (feet)
Range: 63 – 87
Median: 76
Mean: 76
Heights cluster between 72 and 80 feet, showing relatively less variation compared to girth.
Volume (cubic feet)
Range: 10.2 – 77.0
Median: 24.2
Mean: 30.17
Most tree volumes fall between 19.4 and 37.3 cubic feet, but a few large trees reach up to 77 cubic feet.
Interpretation The dataset contains 31 observations of tree girth, height, and volume.
Girth shows wider variability than height, and it strongly influences volume.
Height is fairly consistent across trees, but still contributes to differences in volume.
Volume has the largest spread, reflecting how both girth and height combine to determine tree size.
# Load dataset
data(trees)
# Step 1: Check missing values
colSums(is.na(trees))
## Girth Height Volume
## 0 0 0
# Count the number of missing values in each variable
# Step 2: Remove rows with missing values
trees_no_na <- na.omit(trees)
dim(trees_no_na)
## [1] 31 3
# Check the dimensions of the cleaned dataset
# Step 3: Visualize outliers
boxplot(trees_no_na[, c("Girth", "Height", "Volume")],
main = "Boxplots of Tree Measurements",
col = c("lightblue", "lightgreen", "lightpink"))
# Step 4: Detect outliers for each variable
Girth_out <- boxplot.stats(trees_no_na$Girth)$out
Height_out <- boxplot.stats(trees_no_na$Height)$out
Volume_out <- boxplot.stats(trees_no_na$Volume)$out
# Step 5: Remove rows containing any outliers
trees_clean <- trees_no_na[
!trees_no_na$Girth %in% Girth_out &
!trees_no_na$Height %in% Height_out &
!trees_no_na$Volume %in% Volume_out, ]
# Step 6: Check dimensions after removing outliers
dim(trees_clean)
## [1] 30 3
# Step 7: Visualize cleaned dataset
boxplot(trees_clean[, c("Girth", "Height", "Volume")],
main = "Boxplots without Outliers",
col = c("lightblue", "lightgreen", "lightpink"))
Interpretation
The dataset was already complete (no missing values).
Outlier removal reduced the sample size but improved the reliability of statistical analysis.
Girth and Volume had the most noticeable outliers, while Height was relatively stable.
The cleaned dataset (trees_clean) is now better suited for regression modeling and correlation analysis.
### PREDICTED VALUES
model <- lm(Volume ~ Girth + Height, data = trees_clean)
predicted_volume <- predict(model) # Obtain predicted volume values from the model
head(predicted_volume) # first few predicted values
## 1 2 3 4 5 6
## 5.866600 5.713969 6.011098 16.314930 19.902849 20.948902
summary(model)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6396 -1.2786 -0.4938 2.0262 6.4599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52.2362 8.0390 -6.498 5.76e-07 ***
## Girth 4.4773 0.2518 17.781 < 2e-16 ***
## Height 0.2992 0.1179 2.538 0.0172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.49 on 27 degrees of freedom
## Multiple R-squared: 0.9437, Adjusted R-squared: 0.9395
## F-statistic: 226.3 on 2 and 27 DF, p-value: < 2.2e-16
Model Summary Intercept = -52.24
The baseline when Girth and Height are 0 (not meaningful physically, but required mathematically).
Girth coefficient = 4.48
For each additional inch of girth, tree volume increases by about 4.48 cubic feet, holding height constant.
Very strong predictor (t = 17.8, p < 2e-16).
Height coefficient = 0.30
For each additional foot of height, tree volume increases by about 0.30 cubic feet, holding girth constant.
Model Fit Residuals: Most errors are small (between -5.6 and +6.5), showing predictions are close to actual values.
Residual Standard Error = 3.49 → On average, predictions are off by about 3.5 cubic feet.
R² = 0.944 → The model explains 94.4% of the variation in tree volume.
Adjusted R² = 0.940 → Still very strong, accounting for the number of predictors.
F-statistic = 226.3, p < 2.2e-16 → The overall model is highly significant.
###MODEL DIAGNOSTICS
###Diagnostic plots include:
# Residuals vs Fitted
#Checks linearity assumption.
# Normal Q-Q Plot
#Checks normality of residuals.
#Scale-Location Plot
#Checks constant variance (homoscedasticity).
#Residuals vs Leverage
#Identifies influential observations.
par(mfrow = c(2,2)) # Arrange four plots in a 2 × 2 layout
plot(model)
The diagnostic plots provide a comprehensive check of regression
assumptions. The Residuals vs Fitted plot tests whether the relationship
between predictors and the outcome is linear, while the Normal Q-Q plot
assesses whether residuals follow a normal distribution. The
Scale-Location plot evaluates homoscedasticity, ensuring that residuals
have constant variance across fitted values.
##Relationship between variable
# Simple regression: Volume ~ Girth
plot(trees_clean$Girth, trees_clean$Volume,
main = "Volume vs Girth",
xlab = "Girth (inches)",
ylab = "Volume (cubic feet)",
pch = 19, col = "darkgreen")
abline(lm(Volume ~ Girth, data = trees_clean),
col = "tomato", lwd = 2)
# Simple regression: Volume ~ Height
plot(trees_clean$Height, trees_clean$Volume,
main = "Volume vs Height",
xlab = "Height (feet)",
ylab = "Volume (cubic feet)",
pch = 19, col = "blue")
abline(lm(Volume ~ Height, data = trees_clean),
col = "darkblue", lwd = 2)
# Multiple regression: Volume ~ Girth + Height
model <- lm(Volume ~ Girth + Height, data = trees_clean)
# Predicted values
predicted <- predict(model, trees_clean)
# Actual vs Predicted plot
plot(trees_clean$Volume, predicted,
main = "Actual vs Predicted Volume",
xlab = "Actual Volume",
ylab = "Predicted Volume",
pch = 19, col = "orange")
# Add 45-degree reference line (perfect fit)
abline(a = 0, b = 1, col = "blue", lwd = 2)
# Add regression line of predicted ~ actual
abline(lm(predicted ~ trees_clean$Volume), col = "red", lwd = 2, lty = 2)
# Show R² value
r2 <- summary(model)$r.squared
legend("topleft", legend = paste("R² =", round(r2, 3)),
bty = "n", text.col = "darkblue")
##CORRELATION MATRIX
##The multiple linear regression analysis indicates that both Girth and
#Examine correlations among variables
cor(trees_clean[, c("Volume", "Girth", "Height")])
## Volume Girth Height
## Volume 1.0000000 0.9645078 0.5333608
## Girth 0.9645078 1.0000000 0.4454272
## Height 0.5333608 0.4454272 1.0000000
The correlation analysis shows that tree volume is very strongly correlated with girth (r ≈ 0.96) and moderately correlated with height (r ≈ 0.53), while girth and height themselves have a weaker relationship (r ≈ 0.45). This aligns with the regression results, where girth is the dominant predictor of volume, contributing the largest increase in timber volume per unit change, while height adds a smaller but statistically significant effect
Question II: The main variable selection method in R
1. Stepwise Selection (Sequential Search)
Definition:
An automated algorithm that adds or removes variables one by one based on statistical tests or criteria like AIC or BIC.
Role:
To quickly simplify an overwhelming list of predictors down to a manageable baseline model.When to use: Use this when you have a moderate number of predictors (under 30) and need a fast, traditional statistical model where every step is easy to explain.
Example:
``` r
# 1. Fit the full model
full_model <- lm(mpg ~ ., data = mtcars)
# 2. Run bidirectional stepwise selection using AIC
step_model <- step(full_model, direction = "both", trace = 0)
# 3. View the selected final variables
summary(step_model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
2.LASSO
Definition: Least Absolute Shrinkage and Selection Operator. Adds a penalty (λ × Σ|β|) to the loss, shrinking some coefficients exactly to zero.
Role:
Simultaneous variable selection and regularisation. Ideal when you have many predictors (p >> n) or suspect only a few are relevant.
Definition: A regression technique that adds a mathematical penalty to the size of coefficients, forcing the least important coefficients to become exactly zero.
Role: To perform automated variable selection and prevent overfitting simultaneously.When to use: Use this when you have more variables than observations (\(p > n\)), or when you suspect only a few variables are actually important out of a massive list
Definition:
Trains an ensemble of decision trees; measures each variable’s importance by the mean decrease in impurity (Gini) or accuracy when the variable is permuted.
Role:
Model-agnostic screening. Detects non-linear effects and interactions that regression-based methods miss. Good first step before modelling.
```
Thanks