# Load the built-in airquality dataset
data("airquality")
# Display the structure of the dataset
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
# Display summary statistics
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NAs :37 NAs :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
# Remove rows containing missing values (NA)
aq <- na.omit(airquality)
# Show number of rows and columns after removing missing values
dim(aq)
## [1] 111 6
# Display first six observations
head(aq)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
# Create a scatterplot matrix to visualize
# relationships among variables
pairs(
aq[, c("Ozone", "Solar.R", "Wind", "Temp")],
main = "Scatterplot Matrix"
)
# Create a multiple linear regression model
# Ozone is the dependent variable
# Solar.R, Wind, and Temp are independent variables
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)
# Display regression results including
# coefficients, p-values, R-squared, etc.
summary(model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
# Display estimated coefficients
coef(model)
## (Intercept) Solar.R Wind Temp
## -64.34207893 0.05982059 -3.33359131 1.65209291
# Plot Ozone against Temperature
plot(
aq$Temp,
aq$Ozone,
xlab = "Temperature",
ylab = "Ozone",
main = "Ozone vs Temperature"
)
# Add simple regression line
abline(
lm(Ozone ~ Temp, data = aq),
col = "red",
lwd = 2
)
# Plot Ozone against Wind
plot(
aq$Wind,
aq$Ozone,
xlab = "Wind",
ylab = "Ozone",
main = "Ozone vs Wind"
)
# Add simple regression line
abline(
lm(Ozone ~ Wind, data = aq),
col = "blue",
lwd = 2
)
# Plot Ozone against Solar Radiation
plot(
aq$Solar.R,
aq$Ozone,
xlab = "Solar Radiation",
ylab = "Ozone",
main = "Ozone vs Solar Radiation"
)
# Add simple regression line
abline(
lm(Ozone ~ Solar.R, data = aq),
col = "green",
lwd = 2
)
# Calculate predicted Ozone values from the model
predicted <- predict(model)
# Compare actual and predicted values
plot(
predicted,
aq$Ozone,
xlab = "Predicted Ozone",
ylab = "Observed Ozone",
main = "Observed vs Predicted Ozone"
)
# Add reference line (perfect prediction)
abline(
a = 0,
b = 1,
col = "red",
lwd = 2
)
# Arrange plots in 2x2 layout
par(mfrow = c(2, 2))
# Produce regression diagnostic plots
# Residuals vs Fitted
# Normal Q-Q
# Scale-Location
# Residuals vs Leverage
plot(model)
# Reset plotting area
par(mfrow = c(1, 1))
# Calculate 95% confidence intervals
# for regression coefficients
confint(model)
## 2.5 % 97.5 %
## (Intercept) -110.04538108 -18.6387768
## Solar.R 0.01385613 0.1057851
## Wind -4.63087706 -2.0363055
## Temp 1.14949967 2.1546862
# Generate Analysis of Variance (ANOVA) table
anova(model)
## Analysis of Variance Table
##
## Response: Ozone
## Df Sum Sq Mean Sq F value Pr(>F)
## Solar.R 1 14780 14780 32.944 8.946e-08 ***
## Wind 1 39969 39969 89.094 9.509e-16 ***
## Temp 1 19050 19050 42.463 2.424e-09 ***
## Residuals 107 48003 449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA tests whether the predictors
# collectively explain a significant amount
# of variation in Ozone.
# It provides
# Df= Degrees of Freedom
# Sum Sq= Variation explained
# Mean Sq= Average variation
# F value= Test statistic
# Pr(>F)= p-value
# If p-value < 0.05:
# The predictor significantly contributes
# to explaining Ozone levels.
# Percentage of variability explained by the model
summary(model)$r.squared
## [1] 0.6058946
# R-squared adjusted for number of predictors
summary(model)$adj.r.squared
## [1] 0.5948449
The multiple linear regression model shows that Temperature and Solar Radiation increase Ozone levels, while Wind decreases Ozone levels. The model explains approximately 60% of the variation in Ozone concentration, indicating a reasonably good fit. The ANOVA table confirms that the predictors contribute significantly to explaining changes in Ozone levels.
# Load dataset
data("airquality")
# Remove missing values
aq <- na.omit(airquality)
# Fit Multiple Linear Regression Model
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)
# Display Model Summary
summary(model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
# Display Coefficients
coef(model)
## (Intercept) Solar.R Wind Temp
## -64.34207893 0.05982059 -3.33359131 1.65209291
# ANOVA Table
anova(model)
## Analysis of Variance Table
##
## Response: Ozone
## Df Sum Sq Mean Sq F value Pr(>F)
## Solar.R 1 14780 14780 32.944 8.946e-08 ***
## Wind 1 39969 39969 89.094 9.509e-16 ***
## Temp 1 19050 19050 42.463 2.424e-09 ***
## Residuals 107 48003 449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# R-squared
summary(model)$r.squared
## [1] 0.6058946
# Adjusted R-squared
summary(model)$adj.r.squared
## [1] 0.5948449
# Create the full model with all predictors
full_model <- lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = aq)
# Create a null model with no predictors
null_model <- lm(Ozone ~ 1, data = aq)
# Forward Selection
# Starts with no predictors and adds the
# most significant variables one at a time
forward_model <- step(
null_model,
scope = list(
lower = null_model,
upper = full_model
),
direction = "forward"
)
## Start: AIC=779.07
## Ozone ~ 1
##
## Df Sum of Sq RSS AIC
## + Temp 1 59434 62367 706.77
## + Wind 1 45694 76108 728.87
## + Solar.R 1 14780 107022 766.71
## + Month 1 2487 119315 778.78
## <none> 121802 779.07
## + Day 1 3 121799 781.07
##
## Step: AIC=706.77
## Ozone ~ Temp
##
## Df Sum of Sq RSS AIC
## + Wind 1 11378.5 50989 686.41
## + Month 1 2824.7 59543 703.63
## + Solar.R 1 2723.1 59644 703.82
## <none> 62367 706.77
## + Day 1 476.5 61891 707.92
##
## Step: AIC=686.41
## Ozone ~ Temp + Wind
##
## Df Sum of Sq RSS AIC
## + Solar.R 1 2986.17 48003 681.71
## + Month 1 2734.79 48254 682.29
## <none> 50989 686.41
## + Day 1 486.59 50502 687.35
##
## Step: AIC=681.71
## Ozone ~ Temp + Wind + Solar.R
##
## Df Sum of Sq RSS AIC
## + Month 1 1701.18 46302 679.71
## <none> 48003 681.71
## + Day 1 564.53 47438 682.40
##
## Step: AIC=679.71
## Ozone ~ Temp + Wind + Solar.R + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## + Day 1 618.68 45683 680.21
# Display the final selected model
summary(forward_model)
##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R + Month, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
# Backward Elimination
# Starts with all predictors and removes
# the least important variables one at a time
backward_model <- step(
full_model,
direction = "backward"
)
## Start: AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Day 1 618.7 46302 679.71
## <none> 45683 680.21
## - Month 1 1755.3 47438 682.40
## - Solar.R 1 2005.1 47688 682.98
## - Wind 1 11533.9 57217 703.20
## - Temp 1 20845.0 66528 719.94
##
## Step: AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## - Month 1 1701.2 48003 681.71
## - Solar.R 1 1952.6 48254 682.29
## - Wind 1 11520.5 57822 702.37
## - Temp 1 20419.5 66721 718.26
# Display the final selected model
summary(backward_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
# Stepwise Selection
# Combines forward selection and backward
# elimination to find the best model
stepwise_model <- step(
full_model,
direction = "both"
)
## Start: AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Day 1 618.7 46302 679.71
## <none> 45683 680.21
## - Month 1 1755.3 47438 682.40
## - Solar.R 1 2005.1 47688 682.98
## - Wind 1 11533.9 57217 703.20
## - Temp 1 20845.0 66528 719.94
##
## Step: AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## + Day 1 618.7 45683 680.21
## - Month 1 1701.2 48003 681.71
## - Solar.R 1 1952.6 48254 682.29
## - Wind 1 11520.5 57822 702.37
## - Temp 1 20419.5 66721 718.26
# Display the final selected model
summary(stepwise_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = aq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
# Compare model performance using AIC
# Lower AIC indicates a better model
AIC(full_model)
## [1] 997.2188
AIC(forward_model)
## [1] 996.7119
AIC(backward_model)
## [1] 996.7119
AIC(stepwise_model)
## [1] 996.7119
Forward Selection starts with no variables and adds predictors that improve the model. Backward Elimination starts with all predictors and removes those that do not contribute significantly. Stepwise Selection adds and removes predictors to find the best combination of variables. AIC(Akaike Information Criterion) used to compare models; the model with the lowest AIC is generally preferred.