Intro to Data Science HW 7

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

The chapter on linear models (“Lining Up Our Models”) introduces linear predictive modeling using the tool known as multiple regression. The term “multiple regression” has an odd history, dating back to an early scientific observation of a phenomenon called “regression to the mean.” These days, multiple regression is just an interesting name for using linear modeling to assess the connection between one or more predictor variables and an outcome variable.


In this exercise, you will predict Ozone air levels from three predictors.

  1. We will be using the airquality data set available in R. Copy it into a dataframe called air and use the appropriate functions to summarize the data.
# Copy the 'airquality' dataset
air <- airquality 

# Display a summary of the dataset
summary(air)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Print the first 5 rows of the dataset
head(air)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
  1. In the analysis that follows, Ozone will be considered as the outcome variable, and Solar.R, Wind, and Temp as the predictors. Add a comment to briefly explain the outcome and predictor variables in the dataframe using ?airquality.
# Open the help file for the built-in "airquality" dataest.
?airquality

# Outcome variable:
# Ozone (ppb) - Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

# Predictor variables:
# Solar.R (lang) - Solar radiation in Langleys
# Wind (mph) - Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
# Temp (degrees F) - Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport
  1. Inspect the outcome and predictor variables – are there any missing values? Show the code you used to check for that.
# Check for missing values in the outcome variable (Ozone)

# "is.na(air$Ozone)" creates a logical vector where each element is TRUE if the corresponding value in 'air$Ozone' is missing (NA), and FALSE otherwise.

# "sum(is.na(air$Ozone))" returns a total for the resulting number of missing values in the outcome variable 'Ozone'.

sum(is.na(air$Ozone))
## [1] 37
# Check for missing values in the predictor variables (Solar.R, Wind and Temp)

sum(is.na(air$Solar.R))
## [1] 7
sum(is.na(air$Wind))
## [1] 0
sum(is.na(air$Temp))
## [1] 0
  1. Use the na_interpolation() function from the imputeTS package (remember this was used in a previous HW) to fill in the missing values in each of the 4 columns. Make sure there are no more missing values using the commands from Step C.
# Install and load the 'imputeTS' package (if not already installed)

# The NOT operator is used to reverse the logical value if the package is installed, the whole expression becomes !TRUE, which evaluates to FALSE. If the package is not installed, the expression becomes !FALSE, which evaluates to TRUE

# This a conditional statement that triggers the installation of the imputeTS package only if it's not already present on your system, avoiding error messages and redundant installations.

# Passing the argument "quietly = TRUE" to the 'requiredNamespace' function suppresses any messages that would normally be displayed upon running, handles situations where a package is missing without interrupting the code's execution or cluttering the output with unnecessary messages.

if (!requireNamespace("imputeTS", quietly = TRUE)) {
  install.packages("imputeTS")
}
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(imputeTS)

# Impute missing values in the outcome and predictor variables using na_interpolation

air$Ozone <- na_interpolation(air$Ozone)
air$Solar.R <- na_interpolation(air$Solar.R)
air$Wind <- na_interpolation(air$Wind)
air$Temp <- na_interpolation(air$Temp)

# Verify that there are no more missing values
sum(is.na(air$Ozone))
## [1] 0
sum(is.na(air$Solar.R))
## [1] 0
sum(is.na(air$Wind))
## [1] 0
sum(is.na(air$Temp))
## [1] 0
  1. Create 3 bivariate scatterplots (X-Y) plots (using ggplot), for each of the predictors with the outcome. Hint: In each case, put Ozone on the Y-axis, and a predictor on the X-axis. Add a comment to each, describing the plot and explaining whether there appears to be a linear relationship between the outcome variable and the respective predictor.
# I was having issues with R Studio displaying plot objects. So I tried three different ways of creating the plot to see if one works. Interestingly, they all worked when I would knit the file, but I could not get the plot objects to display in interactive mode until I knit the file and tryed using 'Control + Shift + Enter" in the interactive window. Then the plots were displaying. 

library(ggplot2)

# Create plot object for Scatterplot 1: Ozone vs. Solar.R
p1 <- ggplot(air, aes(x = Solar.R, y = Ozone)) +
# geom_poitn() adds the datapoints to the plots  
  geom_point() +
# labs() sets the title and axis labels for clarity
  labs(title = "Ozone vs. Solar Radiation",
       x = "Solar Radiation (lang)", y = "Ozone (ppb)") + 
  theme_minimal() + 
  geom_smooth(method = "lm", se = FALSE, color = "red", formula = y ~ x)
# geom_smooth(methd = "lm", color = "red") adds a red linear regression trend line to visualize the overall trend in the data.
# 'se = FALSE' argument removes the confidence interval around the line.

# Display plot object
show(p1)

# Comment: This plot shows a weak positive linear relationship between ozone and solar radiation.
# Higher solar radiation tends to be associated with slightly higher ozone levels.

# Scatterplot 2: Ozone vs. Wind
p2<- ggplot(air, aes(x = Wind, y = Ozone)) +
  geom_point() +
  labs(title = "Ozone vs. Wind Speed",
       x = "Wind Speed (mph)", y = "Ozone (ppb)") +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE, color = "red") 

print(p2)
## `geom_smooth()` using formula = 'y ~ x'

# Comment: This plot suggests a moderate negative linear relationship between ozone and wind speed.
# Higher wind speeds tend to be associated with lower ozone levels.

# Scatterplot 3: Ozone vs. Temp
ggplot(air, aes(x = Temp, y = Ozone)) +
  geom_point() +
  labs(title = "Ozone vs. Temperature",
       x = "Temperature (degrees F)", y = "Ozone (ppb)") +
  theme_minimal() +
  geom_smooth(method = "lm", se = FALSE, color = "red") 
## `geom_smooth()` using formula = 'y ~ x'

# Comment: This plot indicates a strong positive linear relationship between ozone and temperature.
# Higher temperatures tend to be associated with higher ozone levels.
  1. Next, create a simple regression model predicting Ozone based on Wind, using the lm( ) command. In a comment, report the coefficient (aka slope or beta weight) of Wind in the regression output and, if it is statistically significant, interpret it with respect to Ozone. Report the adjusted R-squared of the model and try to explain what it means.
# Simple linear regression model: Ozone vs. Wind
model1 <- lm(Ozone ~ Wind, data = air)

# Display the summary of the model
summary(model1)
## 
## Call:
## lm(formula = Ozone ~ Wind, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.332 -18.332  -4.155  14.163  94.594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  89.0205     6.6991  13.288  < 2e-16 ***
## Wind         -4.5925     0.6345  -7.238 2.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.56 on 151 degrees of freedom
## Multiple R-squared:  0.2576, Adjusted R-squared:  0.2527 
## F-statistic: 52.39 on 1 and 151 DF,  p-value: 2.148e-11
# The coefficient of Wind is -4.59 (p < 0.001), indicating that for every 1 mph increase in wind speed, the ozone level is expected to decrease, on average, by 4.59 ppb. This is a statistically significant relationship, as the p-value is extremely small.

# The adjusted R-squared of the model is 0.2527, meaning that approximately 25% of the variation in ozone levels can be explained by wind speed alone. This suggests that wind speed is a relevant predictor, but other factors also contribute to ozone levels.
  1. Create a multiple regression model predicting Ozone based on Solar.R, Wind, and Temp.
    Make sure to include all three predictors in one model – NOT three different models each with one predictor.
# Multiple linear regression model: Ozone vs. Solar.R, Wind, Temp
  # This line creates the multiple regression model using the lm() function.
  # 'Ozone' is the outcome variable (the one we're trying to predict).
  # The tilde essentially reads as "is modeled as a function of".
  # 'Solar.R + Wind + Temp' specifies that we're using all three variables as predictors in the model. The + signs indicate that they are included additively.
  # 'data = air' specifies that the data for the model is stored in the air data frame.
model2 <- lm(Ozone ~ Solar.R + Wind + Temp, data = air)

# Display the summary of the model
summary(model2)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.651 -15.622  -4.981  12.422 101.411 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.16596   21.90933  -2.381   0.0185 *  
## Solar.R       0.01654    0.02272   0.728   0.4678    
## Wind         -2.69669    0.63085  -4.275 3.40e-05 ***
## Temp          1.53072    0.24115   6.348 2.49e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.26 on 149 degrees of freedom
## Multiple R-squared:  0.4321, Adjusted R-squared:  0.4207 
## F-statistic: 37.79 on 3 and 149 DF,  p-value: < 2.2e-16
  1. Report the adjusted R-Squared in a comment – how does it compare to the adjusted R-squared from Step F? Is this better or worse? Which of the predictors are statistically significant in the model? In a comment, report the coefficient of each predictor that is statistically significant. Do not report the coefficients for predictors that are not significant.
# The adjusted R-squared for the multiple regression model is 0.4207, which is higher than the adjusted R-squared of 0.25 from the simple regression model with only Wind as a predictor. This indicates that the multiple regression model explains more of the variation in ozone levels (42% vs. 25%).

# The predictors Wind (coefficient = -2.70, p < 0.001) and Temp (coefficient = 1.53, p < 0.001) are statistically significant in the model. Solar.R is not a significant predictor.
# The low p-values for Wind and Temp (< 2.2e-16 and 2.49e-09, respectively) strongly suggest that these predictors have a significant effect on Ozone levels, even after accounting for the effects of other predictors in the model.
# The high p-value for Solar.R (0.4678) suggests that, in this particular model and with this dataset, there's not enough evidence to conclude that Solar.R has a significant effect on Ozone levels once the effects of Wind and Temp are accounted for.

# Interpretation of coefficients:
# - For every 1 mph increase in wind speed, the ozone level is expected to decrease by 2.70 ppb, on average, holding solar radiation and temperature constant.
# - For every 1 degree Fahrenheit increase in temperature, the ozone level is expected to increase by 1.53 ppb, on average, holding solar radiation and wind speed constant.

# In this case, both Wind and Temp seem to have stronger relationships with Ozone than Solar.R, as evidenced by their larger coefficient magnitudes and lower p-values.
  1. Create a one-row data frame like this:
predDF <- data.frame(Solar.R=290, Wind=13, Temp=61)

and use it with the predict( ) function to predict the expected value of Ozone:

# Create the data frame with new predictor values
  # A data frame called predDF is created with one row and three columns (Solar.R, Wind, and Temp). 
  # The values for these predictors (290, 13, and 61, respectively) are specified in the data frame.

predDF <- data.frame(Solar.R = 290, Wind = 13, Temp = 61)

# Use the predict function with the multiple regression model
  # The predict() function is used to predict the ozone level (Ozone) based on the values in predDF.
  # The first argument is the fitted multiple regression model (model2).
  # 'newdata = predDF' tells the function to use the values from predDF for the prediction.

predicted_ozone <- predict(model2, newdata = predDF)

# Display the predicted ozone value

print(paste("The predicted ozone level is:", round(predicted_ozone, 2), "ppb"))
## [1] "The predicted ozone level is: 10.95 ppb"
  1. Create an additional multiple regression model, with Temp as the outcome variable, and the other 3 variables as the predictors.

Review the quality of the model by commenting on its adjusted R-Squared.

# Multiple linear regression model: Temp vs. Ozone, Solar.R, Wind
model3 <- lm(Temp ~ Ozone + Solar.R + Wind, data = air)

# Display the summary of the model
summary(model3)
## 
## Call:
## lm(formula = Temp ~ Ozone + Solar.R + Wind, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.831  -4.802   1.174   4.880  18.004 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.693222   2.796787  26.707  < 2e-16 ***
## Ozone        0.139055   0.021907   6.348 2.49e-09 ***
## Solar.R      0.015751   0.006737   2.338  0.02072 *  
## Wind        -0.580176   0.195774  -2.963  0.00354 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.313 on 149 degrees of freedom
## Multiple R-squared:  0.4148, Adjusted R-squared:  0.403 
## F-statistic: 35.21 on 3 and 149 DF,  p-value: < 2.2e-16
# The adjusted R-squared for the model predicting temperature (Temp) from ozone, solar radiation, and wind speed is 0.403.

# This means that approximately 40.3% of the variation in daily maximum temperature can be explained by the combined effects of ozone concentration, solar radiation, and wind speed.

# Model Fit: The adjusted R-squared of 0.403 indicates a moderate fit of the model to the data. The model explains a reasonable portion of the variability in temperature, but there's still a substantial amount of unexplained variance.

# Comparison: The adjusted R-squared for this model (0.403) is slightly lower than the adjusted R-squared for the model predicting Ozone (0.4207). This suggests that these predictors are slightly better at explaining ozone levels than they are at explaining temperature.

# Improvements: Exploring additional predictors or interactions could potentially improve the model fit.