Importing the dataset

data("airquality")

Perform exploratory data analysis on the airquality data set. This should include

calculating summary statistics, creating visualizations, and identifying outliers.

# Load the necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

View the structure of the dataset

str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

View the summary statistics of the dataset

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Create a boxplot of the Ozone variable

ggplot(airquality, aes(y = Ozone)) +
  geom_boxplot()

Create a scatterplot of the Ozone and Wind variables

ggplot(airquality, aes(x = Wind, y = Ozone)) +
  geom_point()

Identify outliers using Tukey’s method

outliers <- boxplot.stats(airquality$Ozone)$out
outliers
## [1] 135 168

Identify missing values

sum(is.na(airquality))
## [1] 44

Fit a simple linear regression model

model <- lm(Ozone ~ Solar.R, data = airquality)

# View the summary of the model
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.292 -21.361  -8.864  16.373 119.136 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.59873    6.74790   2.756 0.006856 ** 
## Solar.R      0.12717    0.03278   3.880 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.33 on 109 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.1213, Adjusted R-squared:  0.1133 
## F-statistic: 15.05 on 1 and 109 DF,  p-value: 0.0001793

Fit a multiple linear regression model

model <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)

# View the summary of the model
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

Simple linear regression model

model1 <- lm(Ozone ~ Solar.R, data = airquality)
summary(model1)$r.squared
## [1] 0.1213419

Multiple linear regression model

model2 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
summary(model2)$r.squared
## [1] 0.6058946

-The regression analysis we performed aimed to predict the ozone level (response variable) based on the solar radiation, wind speed, and temperature (predictor variables) in the airquality dataset. We fit two linear regression models: a simple linear regression model using only the Solar.R variable and a multiple linear regression model using all three predictor variables.

The multiple linear regression model showed that all three predictor variables are statistically significant predictors of ozone levels, with positive coefficients for Solar.R and Temp and a negative coefficient for Wind. The adjusted R-squared value of the model was 0.5927, indicating that the model explains 59.27% of the variance in ozone levels.

This analysis has implications for air quality management in New York. Solar radiation, wind speed, and temperature are all environmental factors that can affect air quality. Solar radiation is a source of energy that can react with pollutants to form ozone, while high temperatures and low wind speeds can trap pollutants close to the ground and increase the concentration of pollutants in the air.

By using the multiple linear regression model, we can better understand the relationship between these environmental factors and ozone levels. This information can be used to inform air quality management policies and strategies in New York, such as implementing measures to reduce emissions from sources of pollution on days with high solar radiation or high temperatures and low wind speeds. The model can also be used to predict ozone levels based on environmental conditions, allowing air quality officials to take proactive measures to protect public health.

A brief report summarizing the findings and interpretations of the results.

-Introduction: The airquality dataset provides daily air quality measurements in New York between May and September of 1973. In this report, we analyzed the relationship between ozone levels and environmental factors such as solar radiation, wind speed, and temperature. We fit two linear regression models to predict ozone levels based on these factors and evaluated their performance.

Methods:

We first performed exploratory data analysis to gain insights into the data, including summary statistics and visualizations. We then fit a simple linear regression model using only the Solar.R variable and a multiple linear regression model using all three predictor variables. We compared the R-squared values of the two models to evaluate their performance and interpreted the results to discuss their implications for air quality management in New York.

Results:

The multiple linear regression model showed that all three predictor variables are statistically significant predictors of ozone levels, with positive coefficients for Solar.R and Temp and a negative coefficient for Wind. The adjusted R-squared value of the model was 0.5927, indicating that the model explains 59.27% of the variance in ozone levels. By using the multiple linear regression model, we can better understand the relationship between environmental factors and ozone levels, and this information can be used to inform air quality management policies and strategies in New York.

Conclusion:

Our analysis of the airquality dataset highlights the relationship between environmental factors and ozone levels in New York. By using a multiple linear regression model, we can better understand the complex relationship between these factors and ozone levels and use this information to inform air quality management policies and strategies. The model can also be used to predict ozone levels based on environmental conditions, allowing air quality officials to take proactive measures to protect public health.