Question 1: Find a data set of which you can fit multiple linear regression and interpret your results

# Load the built-in airquality dataset
data("airquality")

# Display the structure of the dataset
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Display summary statistics
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NAs    :37       NAs    :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Remove rows containing missing values (NA)
aq <- na.omit(airquality)

# Show number of rows and columns after removing missing values
dim(aq)
## [1] 111   6
# Display first six observations
head(aq)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8
# Create a scatterplot matrix to visualize
# relationships among variables
pairs(
  aq[, c("Ozone", "Solar.R", "Wind", "Temp")],
  main = "Scatterplot Matrix"
)

# Create a multiple linear regression model
# Ozone is the dependent variable
# Solar.R, Wind, and Temp are independent variables
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)

# Display regression results including
# coefficients, p-values, R-squared, etc.
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
# Display estimated coefficients
coef(model)
##  (Intercept)      Solar.R         Wind         Temp 
## -64.34207893   0.05982059  -3.33359131   1.65209291
# Plot Ozone against Temperature
plot(
  aq$Temp,
  aq$Ozone,
  xlab = "Temperature",
  ylab = "Ozone",
  main = "Ozone vs Temperature"
)

# Add simple regression line
abline(
  lm(Ozone ~ Temp, data = aq),
  col = "red",
  lwd = 2
)

# Plot Ozone against Wind
plot(
  aq$Wind,
  aq$Ozone,
  xlab = "Wind",
  ylab = "Ozone",
  main = "Ozone vs Wind"
)

# Add simple regression line
abline(
  lm(Ozone ~ Wind, data = aq),
  col = "blue",
  lwd = 2
)

# Plot Ozone against Solar Radiation
plot(
  aq$Solar.R,
  aq$Ozone,
  xlab = "Solar Radiation",
  ylab = "Ozone",
  main = "Ozone vs Solar Radiation"
)

# Add simple regression line
abline(
  lm(Ozone ~ Solar.R, data = aq),
  col = "green",
  lwd = 2
)

# Calculate predicted Ozone values from the model
predicted <- predict(model)

# Compare actual and predicted values
plot(
  predicted,
  aq$Ozone,
  xlab = "Predicted Ozone",
  ylab = "Observed Ozone",
  main = "Observed vs Predicted Ozone"
)

# Add reference line (perfect prediction)
abline(
  a = 0,
  b = 1,
  col = "red",
  lwd = 2
)

# Arrange plots in 2x2 layout
par(mfrow = c(2, 2))

# Produce regression diagnostic plots
# Residuals vs Fitted
# Normal Q-Q
# Scale-Location
# Residuals vs Leverage
plot(model)

# Reset plotting area
par(mfrow = c(1, 1))

# Calculate 95% confidence intervals
# for regression coefficients
confint(model)
##                     2.5 %      97.5 %
## (Intercept) -110.04538108 -18.6387768
## Solar.R        0.01385613   0.1057851
## Wind          -4.63087706  -2.0363055
## Temp           1.14949967   2.1546862
# Generate Analysis of Variance (ANOVA) table
anova(model)
## Analysis of Variance Table
## 
## Response: Ozone
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Solar.R     1  14780   14780  32.944 8.946e-08 ***
## Wind        1  39969   39969  89.094 9.509e-16 ***
## Temp        1  19050   19050  42.463 2.424e-09 ***
## Residuals 107  48003     449                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ANOVA tests whether the predictors
# collectively explain a significant amount
# of variation in Ozone.

# It provides
# Df= Degrees of Freedom
# Sum Sq= Variation explained
# Mean Sq= Average variation
# F value= Test statistic
# Pr(>F)= p-value

# If p-value < 0.05:
# The predictor significantly contributes
# to explaining Ozone levels.

# Percentage of variability explained by the model
summary(model)$r.squared
## [1] 0.6058946
# R-squared adjusted for number of predictors
summary(model)$adj.r.squared
## [1] 0.5948449

The multiple linear regression model shows that Temperature and Solar Radiation increase Ozone levels, while Wind decreases Ozone levels. The model explains approximately 60% of the variation in Ozone concentration, indicating a reasonably good fit. The ANOVA table confirms that the predictors contribute significantly to explaining changes in Ozone levels.

Question 2: Read about variable selection methods

# Load dataset
data("airquality")

# Remove missing values
aq <- na.omit(airquality)

# Fit Multiple Linear Regression Model
model <- lm(Ozone ~ Solar.R + Wind + Temp, data = aq)

# Display Model Summary
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
# Display Coefficients
coef(model)
##  (Intercept)      Solar.R         Wind         Temp 
## -64.34207893   0.05982059  -3.33359131   1.65209291
# ANOVA Table
anova(model)
## Analysis of Variance Table
## 
## Response: Ozone
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Solar.R     1  14780   14780  32.944 8.946e-08 ***
## Wind        1  39969   39969  89.094 9.509e-16 ***
## Temp        1  19050   19050  42.463 2.424e-09 ***
## Residuals 107  48003     449                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# R-squared
summary(model)$r.squared
## [1] 0.6058946
# Adjusted R-squared
summary(model)$adj.r.squared
## [1] 0.5948449
# Create the full model with all predictors
full_model <- lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = aq)

# Create a null model with no predictors
null_model <- lm(Ozone ~ 1, data = aq)


# Forward Selection
# Starts with no predictors and adds the
# most significant variables one at a time

forward_model <- step(
  null_model,
  scope = list(
    lower = null_model,
    upper = full_model
  ),
  direction = "forward"
)
## Start:  AIC=779.07
## Ozone ~ 1
## 
##           Df Sum of Sq    RSS    AIC
## + Temp     1     59434  62367 706.77
## + Wind     1     45694  76108 728.87
## + Solar.R  1     14780 107022 766.71
## + Month    1      2487 119315 778.78
## <none>                 121802 779.07
## + Day      1         3 121799 781.07
## 
## Step:  AIC=706.77
## Ozone ~ Temp
## 
##           Df Sum of Sq   RSS    AIC
## + Wind     1   11378.5 50989 686.41
## + Month    1    2824.7 59543 703.63
## + Solar.R  1    2723.1 59644 703.82
## <none>                 62367 706.77
## + Day      1     476.5 61891 707.92
## 
## Step:  AIC=686.41
## Ozone ~ Temp + Wind
## 
##           Df Sum of Sq   RSS    AIC
## + Solar.R  1   2986.17 48003 681.71
## + Month    1   2734.79 48254 682.29
## <none>                 50989 686.41
## + Day      1    486.59 50502 687.35
## 
## Step:  AIC=681.71
## Ozone ~ Temp + Wind + Solar.R
## 
##         Df Sum of Sq   RSS    AIC
## + Month  1   1701.18 46302 679.71
## <none>               48003 681.71
## + Day    1    564.53 47438 682.40
## 
## Step:  AIC=679.71
## Ozone ~ Temp + Wind + Solar.R + Month
## 
##        Df Sum of Sq   RSS    AIC
## <none>              46302 679.71
## + Day   1    618.68 45683 680.21
# Display the final selected model
summary(forward_model)
## 
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R + Month, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.870 -13.968  -2.671   9.553  97.918 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.05384   22.97114  -2.527   0.0130 *  
## Temp          1.87087    0.27363   6.837 5.34e-10 ***
## Wind         -3.31651    0.64579  -5.136 1.29e-06 ***
## Solar.R       0.04960    0.02346   2.114   0.0368 *  
## Month        -2.99163    1.51592  -1.973   0.0510 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared:  0.6199, Adjusted R-squared:  0.6055 
## F-statistic: 43.21 on 4 and 106 DF,  p-value: < 2.2e-16
# Backward Elimination
# Starts with all predictors and removes
# the least important variables one at a time

backward_model <- step(
  full_model,
  direction = "backward"
)
## Start:  AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
## 
##           Df Sum of Sq   RSS    AIC
## - Day      1     618.7 46302 679.71
## <none>                 45683 680.21
## - Month    1    1755.3 47438 682.40
## - Solar.R  1    2005.1 47688 682.98
## - Wind     1   11533.9 57217 703.20
## - Temp     1   20845.0 66528 719.94
## 
## Step:  AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 46302 679.71
## - Month    1    1701.2 48003 681.71
## - Solar.R  1    1952.6 48254 682.29
## - Wind     1   11520.5 57822 702.37
## - Temp     1   20419.5 66721 718.26
# Display the final selected model
summary(backward_model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.870 -13.968  -2.671   9.553  97.918 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.05384   22.97114  -2.527   0.0130 *  
## Solar.R       0.04960    0.02346   2.114   0.0368 *  
## Wind         -3.31651    0.64579  -5.136 1.29e-06 ***
## Temp          1.87087    0.27363   6.837 5.34e-10 ***
## Month        -2.99163    1.51592  -1.973   0.0510 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared:  0.6199, Adjusted R-squared:  0.6055 
## F-statistic: 43.21 on 4 and 106 DF,  p-value: < 2.2e-16
# Stepwise Selection
# Combines forward selection and backward
# elimination to find the best model

stepwise_model <- step(
  full_model,
  direction = "both"
)
## Start:  AIC=680.21
## Ozone ~ Solar.R + Wind + Temp + Month + Day
## 
##           Df Sum of Sq   RSS    AIC
## - Day      1     618.7 46302 679.71
## <none>                 45683 680.21
## - Month    1    1755.3 47438 682.40
## - Solar.R  1    2005.1 47688 682.98
## - Wind     1   11533.9 57217 703.20
## - Temp     1   20845.0 66528 719.94
## 
## Step:  AIC=679.71
## Ozone ~ Solar.R + Wind + Temp + Month
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 46302 679.71
## + Day      1     618.7 45683 680.21
## - Month    1    1701.2 48003 681.71
## - Solar.R  1    1952.6 48254 682.29
## - Wind     1   11520.5 57822 702.37
## - Temp     1   20419.5 66721 718.26
# Display the final selected model
summary(stepwise_model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = aq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.870 -13.968  -2.671   9.553  97.918 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.05384   22.97114  -2.527   0.0130 *  
## Solar.R       0.04960    0.02346   2.114   0.0368 *  
## Wind         -3.31651    0.64579  -5.136 1.29e-06 ***
## Temp          1.87087    0.27363   6.837 5.34e-10 ***
## Month        -2.99163    1.51592  -1.973   0.0510 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 106 degrees of freedom
## Multiple R-squared:  0.6199, Adjusted R-squared:  0.6055 
## F-statistic: 43.21 on 4 and 106 DF,  p-value: < 2.2e-16
# Compare model performance using AIC
# Lower AIC indicates a better model

AIC(full_model)
## [1] 997.2188
AIC(forward_model)
## [1] 996.7119
AIC(backward_model)
## [1] 996.7119
AIC(stepwise_model)
## [1] 996.7119

Forward Selection starts with no variables and adds predictors that improve the model. Backward Elimination starts with all predictors and removes those that do not contribute significantly. Stepwise Selection adds and removes predictors to find the best combination of variables. AIC(Akaike Information Criterion) used to compare models; the model with the lowest AIC is generally preferred.