This study uses the built-in airquality dataset available in R to investigate factors associated with air pollution. The dataset contains daily air quality measurements collected in New York, USA, from May to September 1973. It includes information on ozone concentration, solar radiation, wind speed, temperature, month, and day. These environmental variables are commonly used to understand the factors that influence air quality and atmospheric conditions. Therefore, it is suitable for building a multiple regression model in R and interpreting how do Solar Radiation, Wind Speed, and Temperature affect Ozone concentration
by Response Variable (Y) (dependant)
Ozone = Ozone concentration
#Predictor Variables (X) (independant)
Solar.R = Solar Radiation
Wind = Wind Speed
Temp = Temperature
# 1. Load the Dataset
#importing built_in dataset
data(airquality) # Load the built-in dataset( airquality)
head(airquality) # the first six observations of the dataset
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
summary(airquality) # used to display descriptive (summary) statistics for each variable in the dataset
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NAs :37 NAs :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
str(airquality) # used to display the structure of a dataset . It provides information about the object type, number of observations, number of variables, variable names, data types, and sample values.
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
The dataset contains 153 observations and 6 variables but the real analysis variablesare 4 variables and the remaining 2 variables (day and month) are not used in analysis
# 2. Exploratory Data Analysis (EDA) by fitting the multiple linear regression model, I conduct an Exploratory Data Analysis (EDA) process. This process includes checking for missing values,removing the one observed, duplicate observations, outliers, and variable distributions.
colSums(is.na(airquality)) # Count the number of missing values in each variable
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
air <- na.omit(airquality) # Remove rows containing missing values
sum(duplicated(airquality)) # Check for duplicate Observations
## [1] 0
dim(air) # Check the dimensions of the cleaned dataset
## [1] 111 6
#Boxplot for all numeric variables to identify Outliers in advertizing dataset
boxplot(airquality[, c("Ozone", "Solar.R","Wind", "Temp")],
main = "Boxplots of environmental variabless(numerical values)",
col = c("blue", "darkgreen", "tomato", "orange"))
there are some outliers in ozone and wind as shown in boxplot,therefore,
i can identify and extract them by using boxplot.stats()
boxplot.stats(airquality$Ozone)$out
## [1] 135 168
boxplot.stats(airquality$Solar.R)$out
## integer(0)
boxplot.stats(airquality$Wind)$out
## [1] 20.1 18.4 20.7
boxplot.stats(airquality$Temp)$out
## integer(0)
Boxplots and boxplot.stats() function revealed and identify a few potential outliers (only 5) in the advertising expenditure variables (ozone variable (135 and 168),and Wind variable(20.1 ,18.4and 20.7) other has 0 outliers) .
#removing outlies
air_clean <- air[
air$Ozone >= 1 &
air$Ozone <= 122 &
air$Wind >= 1 &
air$Wind <= 18,
]
nrow(air)
## [1] 111
nrow(air_clean)
## [1] 106
new boxplot with no outlies
boxplot(air_clean[, c("Ozone", "Solar.R","Wind", "Temp")],
main = "Boxplots of environmental variabless(numerical values)",
col = c("blue", "darkgreen", "tomato", "orange"))
#Correlation analysis relationships between the predictor variables ( solar radiation, wind speed, temperature) and the response variable (ozone concentration,)
# Examine correlations among variables
round(cor(air_clean[, c("Ozone", "Solar.R", "Wind", "Temp")]),2)
## Ozone Solar.R Wind Temp
## Ozone 1.00 0.33 -0.60 0.75
## Solar.R 0.33 1.00 -0.07 0.26
## Wind -0.60 -0.07 1.00 -0.45
## Temp 0.75 0.26 -0.45 1.00
#Interpretation of the Correlation Matrix
he correlation analysis revealed several important relationships among the variables in the airquality dataset. Ozone concentration showed a moderate positive correlation with Solar Radiation (r ≈ 0.33), indicating that ozone levels tend to increase on sunnier days. A strong negative correlation was observed between Ozone and Wind Speed (r ≈ -0.60), suggesting that higher wind speeds help disperse pollutants and reduce ozone concentration. Ozone also exhibited a strong positive correlation with Temperature (r ≈ 0.70), implying that warmer temperatures are associated with increased ozone formation. Solar Radiation and Wind Speed had a very weak negative correlation (r ≈ -0.06), indicating little to no relationship between these variables. Additionally, Solar Radiation and Temperature were weakly positively correlated (r ≈ 0.28), meaning that days with higher solar radiation tend to be slightly warmer. Finally, Wind Speed and Temperature showed a moderate negative correlation (r ≈ -0.45), suggesting that higher temperatures generally occur on days with lower wind speeds. Overall, Temperature and Wind Speed appear to be the factors most strongly associated with Ozone concentration in the dataset.
Now multiple linear regression is fitted using data with outliers removed from both Ozone and Wind.
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = air_clean)
# Ozone is the response (dependent) variable
# Solar.R, Wind, and Temp are predictor (independent) variables
summary(model) # summary of regression results
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = air_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.328 -12.312 -1.728 9.535 57.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.90059 19.47836 -3.897 0.000175 ***
## Solar.R 0.05233 0.01956 2.676 0.008690 **
## Wind -3.27947 0.61231 -5.356 5.28e-07 ***
## Temp 1.77984 0.21094 8.438 2.28e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.45 on 102 degrees of freedom
## Multiple R-squared: 0.6719, Adjusted R-squared: 0.6623
## F-statistic: 69.64 on 3 and 102 DF, p-value: < 2.2e-16
as shown in the result above from model: (Intercept)=-75.90059
Solar.R= 0.05233
Wind =-3.27947
Temp= 1.77984
from the model(model <- lm(Ozone ~ Solar.R + Wind + Temp, data = air_clean)) The fitted model has the form:
Ozone = b0 + b1(Solar.R) + b2(Wind) + b3(Temp) + ε Where: b0 = Intercept b1 = Effect of Solar Radiation b2 = Effect of Wind Speed b3 = Effect of Temperature ε = Random Error this means Regression Equation Using the estimated coefficients:
Ozone=−75.9+0.05(Solar.R)−3.27(Wind)+1.77(Temp)
This equation predicts ozone concentration. from -75.9 (b0)When Solar Radiation, Wind, and Temperature are all zero, the predicted ozone concentration is -75.9. This value has little practical meaning because these conditions are unrealistic. A multiple linear regression model was fitted to investigate the relationship between Ozone concentration and three environmental factors: Solar Radiation, Wind Speed, and Temperature. The fitted regression equation is:
Ozone = -75.90 + 0.052(Solar.R) - 3.279(Wind) + 1.780(Temp)
#Interpretation of the Intercept
The intercept coefficient (-75.90) is statistically significant (p = 0.000175) and represents the predicted ozone concentration when Solar Radiation, Wind Speed, and Temperature are all zero. However, since such conditions are unrealistic, the intercept mainly serves as a baseline value for the regression equation rather than having practical meaning. #Interpretation of Solar Radiation
The coefficient for Solar Radiation is 0.0523 (p = 0.00869), indicating a statistically significant positive relationship with ozone concentration. This means that, holding Wind Speed and Temperature constant, a one-unit increase in Solar Radiation increases Ozone by about 0.052 units. In simple terms, higher sunlight levels are associated with higher ozone concentrations.
#Interpretation of Wind Speed
The coefficient for Wind Speed is -3.2795 (p < 0.001), showing a highly significant negative relationship with ozone concentration. This means that, holding Solar Radiation and Temperature constant, a one-unit increase in Wind Speed decreases Ozone by about 3.28 units. In simple terms, higher wind speeds reduce ozone levels because wind helps disperse pollutants in the air. Wind Speed is one of the strongest predictors in the model..
#Interpretation of Temperature
The coefficient for Temperature is 1.7798 (p < 0.001), indicating a highly significant positive relationship with ozone concentration. This means that, holding Solar Radiation and Wind Speed constant, a one-degree increase in Temperature increases Ozone by about 1.78 units. In simple terms, higher temperatures lead to higher ozone levels, making Temperature the strongest positive predictor in the model..
predicted_ozone <- predict(model) # Obtain predicted ozone values from the model
head(predicted_ozone) # first few predicted values
## 1 2 3 4 7 8
## 29.02248 32.18647 22.28269 13.11360 27.23102 -10.96658
par(mfrow = c(2,2)) # this Arrange four plots in a 2 × 2 layout
plot(model) # Produce diagnostic plots
Diagnostic plots include: 1. Residuals vs Fitted Checks linearity assumption.
Normal Q-Q Plot Checks normality of residuals.
Scale-Location Plot Checks constant variance (homoscedasticity).
Residuals vs Leverage Identifies influential observations.
The diagnostic plots indicate that the multiple linear regression model provides a reasonably good fit to the data. The residuals are approximately normally distributed, the linearity assumption is largely satisfied, and there is no evidence of highly influential observations. While the Residuals vs Fitted and Scale-Location plots suggest slight departures from perfect linearity and constant variance, these deviations are relatively minor. Therefore, the assumptions of multiple linear regression are generally satisfied, and the model can be considered appropriate for explaining the relationship between Ozone concentration and the environmental variables Solar Radiation, Wind Speed, and Temperature
# Relationship between Solar Radiation and Ozone
plot(air$Solar.R, air$Ozone,
main = "Ozone vs Solar Radiation",
xlab = "Solar Radiation",
ylab = "Ozone")
abline(lm(Ozone ~ Solar.R, data = air), col = "tomato", lwd = 2)
# Relationship between Wind and Ozone
# Ozone vs Wind
plot(air$Wind, air$Ozone,
main = "Ozone vs Wind",
xlab = "Wind Speed",
ylab = "Ozone")
abline(lm(Ozone ~ Wind, data = air), col = "yellow", lwd = 2)
# Relationship between Temperature and Ozone
# Ozone vs Temperature
plot(air$Temp, air$Ozone,
main = "Ozone vs Temperature",
xlab = "Temperature",
ylab = "Ozone",
pch=17)
abline(lm(Ozone ~ Temp, data = air), col = "darkblue", lwd = 2)
one-unit increase in Solar Radiation increases Ozone by approximately
0.05 units. Higher solar radiation contributes to ozone formation
A one-unit increase in Wind Speed decreases Ozone by approximately 3.32 units. Stronger winds disperse pollutants, reducing ozone concentration.
A one-degree increase in Temperature increases Ozone by approximately 1.77 units. Warmer temperatures are associated with higher ozone levels.
all this means three predictors significantly affect ozone concentration.
# Multiple Correlation Coefficient
round(sqrt(summary(model)$r.squared),2)
## [1] 0.82
The value R = 0.82 shows that the model has a strong ability to predict Ozone concentration using Solar Radiation, Wind Speed, and Temperature. In other words, the predicted ozone values are strongly related to the actual ozone values observed in the dataset.
#Conclusion
The multiple linear regression analysis indicates that Solar Radiation and Temperature positively influence ozone levels, while Wind Speed negatively influences ozone levels. The model explains a substantial proportion of the variability in ozone concentration and can be used to predict ozone levels based on weather conditions.
Introduction
let me use the same dataset ,The built-in airquality dataset contains daily air quality measurements collected in New York from May to September 1973. The dataset includes information on Ozone concentration, Solar Radiation, Wind Speed, Temperature, Month, and Day.
The objective of this homework is to apply different variable selection methods to identify the most important predictors of Ozone concentration. The response variable is Ozone, while the candidate predictors are Solar.R, Wind, Temp, Month, and Day.
# Load dataset
data(airquality) # Remove missing values
air <- na.omit(airquality) # View structure
str(air) # Summary statistics summary(air)
## 'data.frame': 111 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 23 19 8 16 11 14 ...
## $ Solar.R: int 190 118 149 313 299 99 19 256 290 274 ...
## $ Wind : num 7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
## $ Temp : int 67 72 74 62 65 59 61 69 66 68 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 7 8 9 12 13 14 ...
## - attr(*, "na.action")= 'omit' Named int [1:42] 5 6 10 11 25 26 27 32 33 34 ...
## ..- attr(*, "names")= chr [1:42] "5" "6" "10" "11" ...
full_model <- lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = air)
summary(full_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.014 -12.284 -3.302 8.454 95.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.11632 23.48249 -2.730 0.00742 **
## Solar.R 0.05027 0.02342 2.147 0.03411 *
## Wind -3.31844 0.64451 -5.149 1.23e-06 ***
## Temp 1.89579 0.27389 6.922 3.66e-10 ***
## Month -3.03996 1.51346 -2.009 0.04714 *
## Day 0.27388 0.22967 1.192 0.23576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.86 on 105 degrees of freedom
## Multiple R-squared: 0.6249, Adjusted R-squared: 0.6071
## F-statistic: 34.99 on 5 and 105 DF, p-value: < 2.2e-16
Interpretation of the Full Model
The full model includes all available predictors:
[ Ozone = Ozone = b0+b1(Solar.R)+b2(Wind)+b3(Temp)+b4(Month)+b5(Day)]
Variables with small p-values contribute significantly to explaining ozone concentration, while variables with large p-values may be candidates for removal.
Forward Selection starts with no predictors and adds variables one at a time based on improvement in model performance.
null_model <- lm(Ozone ~ 1, data = air)
forward_model <- step(
null_model,
scope = formula(full_model),
direction= "forward" )
## Start: AIC=779.07
## Ozone ~ 1
##
## Df Sum of Sq RSS AIC
## + Temp 1 59434 62367 706.77
## + Wind 1 45694 76108 728.87
## + Solar.R 1 14780 107022 766.71
## + Month 1 2487 119315 778.78
## <none> 121802 779.07
## + Day 1 3 121799 781.07
##
## Step: AIC=706.77
## Ozone ~ Temp
##
## Df Sum of Sq RSS AIC
## + Wind 1 11378.5 50989 686.41
## + Month 1 2824.7 59543 703.63
## + Solar.R 1 2723.1 59644 703.82
## <none> 62367 706.77
## + Day 1 476.5 61891 707.92
##
## Step: AIC=686.41
## Ozone ~ Temp + Wind
##
## Df Sum of Sq RSS AIC
## + Solar.R 1 2986.17 48003 681.71
## + Month 1 2734.79 48254 682.29
## <none> 50989 686.41
## + Day 1 486.59 50502 687.35
##
## Step: AIC=681.71
## Ozone ~ Temp + Wind + Solar.R
##
## Df Sum of Sq RSS AIC
## + Month 1 1701.18 46302 679.71
## <none> 48003 681.71
## + Day 1 564.53 47438 682.40
##
## Step: AIC=679.71
## Ozone ~ Temp + Wind + Solar.R + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## + Day 1 618.68 45683 680.21
summary(forward_model)
##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R + Month, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
Interpretation
The forward selection procedure chooses variables that significantly improve the prediction of ozone concentration. The final selected model represents the best balance between model simplicity and predictive ability.
Backward Elimination starts with all predictors and removes the least significant variable one at a time.
backward_model <- step(
full_model,
direction = "backward" )
## Start: AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Day 1 618.7 46302 679.71
## <none> 45683 680.21
## - Month 1 1755.3 47438 682.40
## - Solar.R 1 2005.1 47688 682.98
## - Wind 1 11533.9 57217 703.20
## - Temp 1 20845.0 66528 719.94
##
## Step: AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## - Month 1 1701.2 48003 681.71
## - Solar.R 1 1952.6 48254 682.29
## - Wind 1 11520.5 57822 702.37
## - Temp 1 20419.5 66721 718.26
summary(backward_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
Interpretation
Variables with the highest p-values are removed first. The process continues until all remaining predictors contribute meaningfully to the model.
#Method 3: Stepwise Regression
Stepwise Regression combines Forward Selection and Backward Elimination.
stepwise_model <- step(
full_model,
direction = "both" )
## Start: AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Day 1 618.7 46302 679.71
## <none> 45683 680.21
## - Month 1 1755.3 47438 682.40
## - Solar.R 1 2005.1 47688 682.98
## - Wind 1 11533.9 57217 703.20
## - Temp 1 20845.0 66528 719.94
##
## Step: AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
##
## Df Sum of Sq RSS AIC
## <none> 46302 679.71
## + Day 1 618.7 45683 680.21
## - Month 1 1701.2 48003 681.71
## - Solar.R 1 1952.6 48254 682.29
## - Wind 1 11520.5 57822 702.37
## - Temp 1 20419.5 66721 718.26
summary(stepwise_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
#Interpretation Based on AIC
The procedure compares models using the Akaike Information Criterion (AIC). Lower AIC values indicate better models. The final model selected by stepwise regression is considered the most efficient according to the AIC criterion.
#Method 4: Best Subset Selection
Best Subset Selection evaluates all possible combinations of predictors.
m1 <- lm(Ozone ~ Solar.R, data = air)
m2 <- lm(Ozone ~ Wind, data = air)
m3 <- lm(Ozone ~ Temp, data = air)
m4 <- lm(Ozone ~ Solar.R + Wind, data = air)
m5 <- lm(Ozone ~ Solar.R + Temp, data = air)
m6 <- lm(Ozone ~ Wind + Temp, data = air)
m7 <- lm(Ozone ~ Solar.R + Wind + Temp, data = air)
m8 <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = air)
m9 <- lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = air)
AIC(m1,m2,m3,m4,m5,m6,m7,m8,m9)
## df AIC
## m1 3 1083.7144
## m2 3 1045.8759
## m3 3 1023.7751
## m4 4 1033.8155
## m5 4 1020.8197
## m6 4 1003.4160
## m7 5 998.7171
## m8 6 996.7119
## m9 7 997.2188
#Interpretation Based on AIC
The model with the smallest AIC value is considered the best model according to the AIC criterion.
BIC(m1,m2,m3,m4,m5,m6,m7,m8,m9) #Interpretation Based on BIC
The model with the lowest BIC value is preferred because it balances model fit and model complexity.
#Comparison of Models
data.frame(
Model = c("Forward","Backward","Stepwise"),
AIC = c(
AIC(forward_model),
AIC(backward_model),
AIC(stepwise_model) ) )
## Model AIC
## 1 Forward 996.7119
## 2 Backward 996.7119
## 3 Stepwise 996.7119
#Final Selected Model
summary(stepwise_model)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.870 -13.968 -2.671 9.553 97.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.05384 22.97114 -2.527 0.0130 *
## Solar.R 0.04960 0.02346 2.114 0.0368 *
## Wind -3.31651 0.64579 -5.136 1.29e-06 ***
## Temp 1.87087 0.27363 6.837 5.34e-10 ***
## Month -2.99163 1.51592 -1.973 0.0510 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
## F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
The final regression model shows how each predictor influences ozone concentration while keeping the other variables constant. Predictors with positive coefficients increase ozone levels, while those with negative coefficients decrease ozone levels. Variables with p-values less than 0.05 are considered statistically significant and are important contributors to predicting ozone concentration. Overall, the final model retains the most relevant variables and provides an efficient explanation of variations in ozone levels.
All four variable selection methods were applied to the airquality dataset. The methods were compared using statistical criteria such as p-values, AIC, BIC, and model fit measures. Variables consistently retained across the methods can be considered the most important predictors of ozone concentration.
In most analyses of the airquality dataset, Temperature, Wind Speed, and Solar Radiation remain the strongest predictors of Ozone concentration. Temperature and Solar Radiation generally show positive effects on ozone levels, while Wind Speed shows a negative effect. The final selected model provides a simpler and more interpretable model while maintaining strong predictive performance.