Final Project

Introduction

Using the New York air quality dataset (airquality), we will perform:

Data preparation: show the information of the dataset airquality.
Data exploration (EDA): perform data visualization present conclusions.
Data analysis: perform hypothesis testing with null and alternative hypothesis and present a conclusion.
Data analysis: build a linear regression model with subset, assess the model, and perform prediction.

Data preparation

#import all necessary packages
library(ggplot2)
library(tidyverse)
library(MASS)
library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 4.1.2

# In addition, I'd like to ask `R` to print decimal numbers with 2 digits:
options(scipen=2)

Check the number of observations and features using dim

#load data
data(airquality)

# Check the number of observations and features
dim_d <- dim(airquality)
dim_d

## [1] 153   6

The Air Quality dataset has 153 rows and 6 columns.

Check for missing values

# Check for missing values
missing <- sum(is.na(airquality))
missing

## [1] 44

The Air Quality dataset has 44 missing values.

Correct the missing values

 library(mice)

## Warning: package 'mice' was built under R version 4.1.2

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

# Show the missing values
md.pattern(airquality)

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

# Input the missing values, m = 5 = # imputed datasets, maxit = num iterations, pmm = predictive mean matching
tempAQ <- mice(airquality,m=5,maxit=50,meth='pmm',seed=500, printFlag = FALSE)
summary(tempAQ)

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   1
## Solar.R     1       0    1    1     1   1
## Wind        1       1    0    1     1   1
## Temp        1       1    1    0     1   1
## Month       1       1    1    1     0   1
## Day         1       1    1    1     1   0

airquality2 <- complete(tempAQ,1)
md.pattern(airquality2)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     Ozone Solar.R Wind Temp Month Day  
## 153     1       1    1    1     1   1 0
##         0       0    0    0     0   0 0

We can see that all missing values have been replaced with predictive imputed values by using the mice package.

Check for duplicated values

# Check for duplicated values
dupl <- sum(duplicated(airquality2))
dupl

## [1] 0

The Air Quality dataset has 0 duplicated values.

Check the structure of the data

# Check the structure of the data
str(airquality2)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 14 28 23 19 8 11 ...
##  $ Solar.R: int  190 118 149 313 237 82 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The Air Quality data set is stored as a data frame with 153 rows and 6 columns, since there are 153 observations of 6 variables. The Ozone, Solar.R, Temp, Month, Day columns contains integer data int, and the Wind column contains numeric data num.

The Air Quality data set consists of values from May 1, 1973 to September 30, 1973:

Ozone: Mean ozone in part per billion from 1300 to 1500 hours at Roosevelt Island.

Solar.R: Solar radiation in Langleys between 4000 and 7700 Angstroms from 0800 to 1200 hours at Central Park.

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at La Guardia Airport.

Temp: Maximimum daily temperature in degrees Fahrenheit at La Guardia Airport.

Exploratory data analysis

Observe the numerical summary

# Observe the numberical summary
summary(airquality2)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 20.00   1st Qu.:112.0   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 36.00   Median :203.0   Median : 9.700   Median :79.00  
##  Mean   : 42.33   Mean   :184.5   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 59.00   3rd Qu.:259.0   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0

The smallest Ozone, Solar.R, Wind, and Temp values are 1, 7, 1.7, and 56, respectively. The largest Ozone, Solar.R, Wind, and Temp values are 168, 334, 20.7, and 97, respectively. Fifty percent of all the Ozone, Solar.R, Wind, and Temp values are less than 36, 203, 9.7, and 79, respectively.

Standardize and indicate if there are any outliers

oz_std <- scale(airquality2$Ozone)
oz_outliers <- airquality2$Ozone[oz_std < -3 | oz_std > 3]
oz_outliers

## [1] 135 168

sr_std <- scale(airquality2$Solar.R)
sr_outliers <- airquality2$Solar.R[sr_std < -3 | sr_std > 3]
sr_outliers

## integer(0)

wn_std <- scale(airquality2$Wind)
wn_outliers <- airquality2$Wind[wn_std < -3 | wn_std > 3]
wn_outliers

## [1] 20.7

tm_std <- scale(airquality2$Temp)
tm_outliers <- airquality2$Temp[tm_std < -3 | tm_std > 3]
tm_outliers

## integer(0)

There are 2 Ozone, 0 Solar.R, 1 Wind, and 0 Temp outliers (greater than 3 or less than -3).

Show boxplots of outliers.

# Show boxplots of outliers
par(mfrow=c(1,4))
boxplot(airquality2$Ozone, main = "Ozone Plot Outliers")
boxplot(airquality2$Solar.R, main = "Solar.R Plot Outliers")
boxplot(airquality2$Wind, main = "Wind Plot Outliers")
boxplot(airquality2$Temp, main = "Temp Plot Outliers")

Observations: The Ozone and Wind variables have outliers while Solar.R and Temp have none. Solar.R and Temp have more variability in the data (especially Solar.R).

Show histogram of Temp

t <- ggplot(data = airquality2) + 
  geom_bar(mapping = aes(x = Temp), fill = "blue") +
  ggtitle("New York Tempurature Readings")
t + theme(
  plot.title = element_text(family = "Helvetica", face = "bold", size = (15)))

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Temps between 75 and 85 were most common.

Show histogram of Ozone

o <- ggplot(data = airquality2) + 
  geom_freqpoly(mapping = aes(x = Ozone), color = "red") +
  ggtitle("New York Ozone Readings")
o + theme(
  plot.title = element_text(family = "Helvetica", face = "bold", size = (15)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Ozone readings were mostly between 0 and 50.

Show scatterplot of Ozone and Temp

plot(airquality2$Temp, airquality$Ozone, 
     col = "purple",
     xlab = "Temp (F)", 
     ylab = 'Ozone (PPB)',
     main = 'New York Ozone & Temperature Data')

The scatterplot shows that as temperature rises, so does the Ozone count, possibly indicating a correlation.

Hypothesis Test

Construct a hypothesis testing with null and alternative hypotheses. Use the appropriate test to get the conclusion.

We will do an alternative hypothesis to see if the Ozone count mean is similar or dissimilar between May and September.

\(H_{0}:\) Ozone count (PPB) means between months is equal.
\(H_{1}:\) Ozone count (PPB) means between months is not equal.

boxplot(airquality2$Ozone~airquality2$Month, main = "Ozone Levels Across May - September", xlab = "Month", ylab = "Ozone Count (PPB)")

We can visually see that Ozone means has similarities and differences.

# Utilize a difference-in-means test to check the null hypothesis
May <- subset(airquality2,airquality2$Month=="5")
June <- subset(airquality2,airquality2$Month=="6")
July <- subset(airquality2,airquality2$Month=="7")
August <- subset(airquality2,airquality2$Month=="8")
September <- subset(airquality2,airquality2$Month=="9")
t.test(June$Ozone,May$Ozone, var.equal = FALSE, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  June$Ozone and May$Ozone
## t = 3.887, df = 58.993, p-value = 0.0001299
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  11.60082      Inf
## sample estimates:
## mean of x mean of y 
##  42.83333  22.48387

t.test(July$Ozone,June$Ozone, var.equal = FALSE, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  July$Ozone and June$Ozone
## t = 2.1157, df = 52.795, p-value = 0.01955
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.848508      Inf
## sample estimates:
## mean of x mean of y 
##  56.48387  42.83333

t.test(July$Ozone,August$Ozone, var.equal = FALSE, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  July$Ozone and August$Ozone
## t = -0.15142, df = 57.218, p-value = 0.5599
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -15.53723       Inf
## sample estimates:
## mean of x mean of y 
##  56.48387  57.77419

t.test(August$Ozone,September$Ozone, var.equal = FALSE, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  August$Ozone and September$Ozone
## t = 3.2722, df = 51.348, p-value = 0.0009557
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  12.69408      Inf
## sample estimates:
## mean of x mean of y 
##  57.77419  31.76667

t.test(September$Ozone,May$Ozone, var.equal = FALSE, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  September$Ozone and May$Ozone
## t = 1.6174, df = 57.491, p-value = 0.05563
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.3122599        Inf
## sample estimates:
## mean of x mean of y 
##  31.76667  22.48387

We can see the p-value for the July - August mean comparison is 0.56, and for the May - September mean comparison is 0.056, and are both higher than \(\alpha = 0.05\). Therefore, we must consider not rejecting \(H_{0}:\) Ozone counts (PPB) means between July - August and May - September are equal (or similar).

However, we can also see the the p-value for the May - June mean comparison is 0.0001, the p-value for the June - July mean comparison is 0.019, and the p-value for the August - September mean comparision is 0.0009, and are lower than \(\alpha = 0.05\). Therefore, we must consider rejecting \(H_{0}\) in favor of \(H_{1}:\) Ozone counts (PPB) means between May - June, June - July, and August - September are not equal (or dissimilar).

Linear Regression

Build a linear regression model with subset selection. Please indicate all significant attributes, assess your model, and predict in a test dataset.
Split data set into 80:20 train and test data

# split data set into 80:20
#data("airquality2")
i<- sample(2, nrow(airquality2), replace = TRUE, prob = c(0.8,0.2))
AirTraining<- airquality2[i==1,]
AirTest<- airquality2[i==2,]
pairs(AirTraining[,1:4], lower.panel = NULL)

pairs(AirTest[,1:4], lower.panel = NULL)

Observations: All variables have correlation with, but Ozone and Temperature seem to have higher effect on airquality based on the graph.

Check correlation between variables

cor(AirTraining)

##               Ozone     Solar.R        Wind       Temp       Month         Day
## Ozone    1.00000000  0.33531060 -0.61361202  0.6761906  0.17270264 -0.07456763
## Solar.R  0.33531060  1.00000000 -0.11280067  0.2738459 -0.03594261 -0.14670447
## Wind    -0.61361202 -0.11280067  1.00000000 -0.4701731 -0.24281659  0.03498455
## Temp     0.67619062  0.27384591 -0.47017309  1.0000000  0.42612959 -0.16608135
## Month    0.17270264 -0.03594261 -0.24281659  0.4261296  1.00000000  0.02280542
## Day     -0.07456763 -0.14670447  0.03498455 -0.1660814  0.02280542  1.00000000

Testing the two datasets, AirTraining and AirTest we can comfirm our correlation.

Multiple Linear Regression Model

Please use all the other features/attributes to construct a linear regression model.

fitlm <- lm(Ozone~., data = AirTraining)
summary(fitlm)

## 
## Call:
## lm(formula = Ozone ~ ., data = AirTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.078 -12.146  -2.358   7.269  97.156 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44.51388   21.46437  -2.074   0.0404 *  
## Solar.R       0.05155    0.02130   2.421   0.0171 *  
## Wind         -3.49638    0.60557  -5.774 7.16e-08 ***
## Temp          1.65376    0.24259   6.817 5.07e-10 ***
## Month        -2.93376    1.41046  -2.080   0.0398 *  
## Day           0.17972    0.21437   0.838   0.4036    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.72 on 111 degrees of freedom
## Multiple R-squared:  0.6125, Adjusted R-squared:  0.595 
## F-statistic: 35.09 on 5 and 111 DF,  p-value: < 2.2e-16

Interpret the coefficients of all the attributes. Which attributes are insignificant?

In looking at the coeffients of the attributes, we see that there are strong correlations (three stars) for Solar radiation, Temperature and Wind. The correlations shows that their p-values’ are close to 0. The insignificant coeffients are Day and Month.

Remove the insignificant attributes and construct a new linear regression model

airattri <- lm(Ozone ~ . -Day -Month, data = airquality2)
summary(airattri)

## 
## Call:
## lm(formula = Ozone ~ . - Day - Month, data = airquality2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.219 -13.203  -3.146   9.479  98.560 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -55.34822   17.99226  -3.076  0.00250 ** 
## Solar.R       0.05771    0.01855   3.110  0.00224 ** 
## Wind         -2.95246    0.51607  -5.721 5.62e-08 ***
## Temp          1.49496    0.20096   7.439 7.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.83 on 149 degrees of freedom
## Multiple R-squared:  0.5749, Adjusted R-squared:  0.5663 
## F-statistic: 67.16 on 3 and 149 DF,  p-value: < 2.2e-16

Replace this text with your answer.

Based on the above linear regression model to predict and calculate MAE and MSE.

ypred <-predict(object = airattri, newdata = AirTest[,1:6])
summary(ypred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -9.023  33.330  43.914  43.414  61.101  80.610

MAE(y_pred = ypred, y_true = AirTest$Ozone)

## [1] 15.66733

MSE(y_pred = ypred, y_true = AirTest$Ozone)

## [1] 364.4143

Replace this text with your answer.

Subset Selection

Both Stepwise

Please construct a forward stepwise regression.

#airqualityomit <- na.omit(airquality2)
air_incercept_only <- lm(Ozone ~ 1, data=AirTraining)
airall <- lm(Ozone~., data=AirTraining)
airboth <- stepAIC (air_incercept_only, direction = 'both', scope = formula(airall), trace = 0)
airboth$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Ozone ~ 1
## 
## Final Model:
## Ozone ~ Temp + Wind + Solar.R + Month
## 
## 
##        Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                              116  111405.23 804.4744
## 2    + Temp  1 50938.232       115   60467.00 734.9785
## 3    + Wind  1 12504.399       114   47962.60 709.8723
## 4 + Solar.R  1  2939.319       113   45023.28 704.4730
## 5   + Month  1  1579.864       112   43443.42 702.2938

summary(airboth)

## 
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R + Month, data = AirTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.120 -12.973  -2.267   7.094  98.828 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.38540   20.54683  -1.917   0.0578 .  
## Temp          1.62089    0.23909   6.780 5.93e-10 ***
## Wind         -3.51850    0.60420  -5.823 5.60e-08 ***
## Solar.R       0.04994    0.02118   2.358   0.0201 *  
## Month        -2.83229    1.40340  -2.018   0.0460 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.69 on 112 degrees of freedom
## Multiple R-squared:   0.61,  Adjusted R-squared:  0.5961 
## F-statistic:  43.8 on 4 and 112 DF,  p-value: < 2.2e-16

Replace this text with your answer.

par(mfrow=c(2,2))
plot(airboth)

Use this model to predict and calculateMAEandMSE`.

ypred_airboth <- predict(object = airboth, newdata = AirTest[,1:6])
MAE(y_pred = ypred_airboth, y_true = AirTest$Ozone)

## [1] 16.62599

MSE(y_pred = ypred_airboth, y_true = AirTest$Ozone)

## [1] 401.3099

Replace this text with your answer.

Model Assessment

Compare all the linear regression models. You can use plots, assessment measures (\(R^2\), RSS, MAE, MSE,etc.) Which on is the best? Explain your answer.

# Edit me

Replace this text with your answer.

Final Project

Team 10: Greg Eure, Julio Bolano, Sydney Easter

Week 15, ITMD/ITMS/STAT 514, FA21

Introduction

Data preparation

Exploratory data analysis

Hypothesis Test

Linear Regression

Multiple Linear Regression Model

Subset Selection

Both Stepwise

Model Assessment