Using the New York air quality dataset (airquality), we will perform:
Data preparation: show the information of the dataset airquality.
Data exploration (EDA): perform data visualization present conclusions.
Data analysis: perform hypothesis testing with null and alternative hypothesis and present a conclusion.
Data analysis: build a linear regression model with subset, assess the model, and perform prediction.
#import all necessary packages
library(ggplot2)
library(tidyverse)
library(MASS)
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.1.2
# In addition, I'd like to ask `R` to print decimal numbers with 2 digits:
options(scipen=2)
dim
#load data
data(airquality)
# Check the number of observations and features
dim_d <- dim(airquality)
dim_d
## [1] 153 6
The Air Quality dataset has 153 rows and 6 columns.
# Check for missing values
missing <- sum(is.na(airquality))
missing
## [1] 44
The Air Quality dataset has 44 missing values.
library(mice)
## Warning: package 'mice' was built under R version 4.1.2
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Show the missing values
md.pattern(airquality)
## Wind Temp Month Day Solar.R Ozone
## 111 1 1 1 1 1 1 0
## 35 1 1 1 1 1 0 1
## 5 1 1 1 1 0 1 1
## 2 1 1 1 1 0 0 2
## 0 0 0 0 7 37 44
# Input the missing values, m = 5 = # imputed datasets, maxit = num iterations, pmm = predictive mean matching
tempAQ <- mice(airquality,m=5,maxit=50,meth='pmm',seed=500, printFlag = FALSE)
summary(tempAQ)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Ozone Solar.R Wind Temp Month Day
## "pmm" "pmm" "" "" "" ""
## PredictorMatrix:
## Ozone Solar.R Wind Temp Month Day
## Ozone 0 1 1 1 1 1
## Solar.R 1 0 1 1 1 1
## Wind 1 1 0 1 1 1
## Temp 1 1 1 0 1 1
## Month 1 1 1 1 0 1
## Day 1 1 1 1 1 0
airquality2 <- complete(tempAQ,1)
md.pattern(airquality2)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Ozone Solar.R Wind Temp Month Day
## 153 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0
We can see that all missing values have been replaced with predictive imputed values by using the mice package.
# Check for duplicated values
dupl <- sum(duplicated(airquality2))
dupl
## [1] 0
The Air Quality dataset has 0 duplicated values.
# Check the structure of the data
str(airquality2)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 14 28 23 19 8 11 ...
## $ Solar.R: int 190 118 149 313 237 82 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
The Air Quality data set is stored as a data frame with 153 rows and 6 columns, since there are 153 observations of 6 variables. The
Ozone
,Solar.R
,Temp
,Month
,Day
columns contains integer dataint
, and theWind
column contains numeric datanum
.The Air Quality data set consists of values from May 1, 1973 to September 30, 1973:
- Ozone: Mean ozone in part per billion from 1300 to 1500 hours at Roosevelt Island.
- Solar.R: Solar radiation in Langleys between 4000 and 7700 Angstroms from 0800 to 1200 hours at Central Park.
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at La Guardia Airport.
- Temp: Maximimum daily temperature in degrees Fahrenheit at La Guardia Airport.
# Observe the numberical summary
summary(airquality2)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 20.00 1st Qu.:112.0 1st Qu.: 7.400 1st Qu.:72.00
## Median : 36.00 Median :203.0 Median : 9.700 Median :79.00
## Mean : 42.33 Mean :184.5 Mean : 9.958 Mean :77.88
## 3rd Qu.: 59.00 3rd Qu.:259.0 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
The smallest Ozone, Solar.R, Wind, and Temp values are 1, 7, 1.7, and 56, respectively. The largest Ozone, Solar.R, Wind, and Temp values are 168, 334, 20.7, and 97, respectively. Fifty percent of all the Ozone, Solar.R, Wind, and Temp values are less than 36, 203, 9.7, and 79, respectively.
oz_std <- scale(airquality2$Ozone)
oz_outliers <- airquality2$Ozone[oz_std < -3 | oz_std > 3]
oz_outliers
## [1] 135 168
sr_std <- scale(airquality2$Solar.R)
sr_outliers <- airquality2$Solar.R[sr_std < -3 | sr_std > 3]
sr_outliers
## integer(0)
wn_std <- scale(airquality2$Wind)
wn_outliers <- airquality2$Wind[wn_std < -3 | wn_std > 3]
wn_outliers
## [1] 20.7
tm_std <- scale(airquality2$Temp)
tm_outliers <- airquality2$Temp[tm_std < -3 | tm_std > 3]
tm_outliers
## integer(0)
There are 2 Ozone, 0 Solar.R, 1 Wind, and 0 Temp outliers (greater than 3 or less than -3).
# Show boxplots of outliers
par(mfrow=c(1,4))
boxplot(airquality2$Ozone, main = "Ozone Plot Outliers")
boxplot(airquality2$Solar.R, main = "Solar.R Plot Outliers")
boxplot(airquality2$Wind, main = "Wind Plot Outliers")
boxplot(airquality2$Temp, main = "Temp Plot Outliers")
Observations: The Ozone and Wind variables have outliers while Solar.R and Temp have none. Solar.R and Temp have more variability in the data (especially Solar.R).
t <- ggplot(data = airquality2) +
geom_bar(mapping = aes(x = Temp), fill = "blue") +
ggtitle("New York Tempurature Readings")
t + theme(
plot.title = element_text(family = "Helvetica", face = "bold", size = (15)))
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
Temps between 75 and 85 were most common.
o <- ggplot(data = airquality2) +
geom_freqpoly(mapping = aes(x = Ozone), color = "red") +
ggtitle("New York Ozone Readings")
o + theme(
plot.title = element_text(family = "Helvetica", face = "bold", size = (15)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
Ozone readings were mostly between 0 and 50.
plot(airquality2$Temp, airquality$Ozone,
col = "purple",
xlab = "Temp (F)",
ylab = 'Ozone (PPB)',
main = 'New York Ozone & Temperature Data')
The scatterplot shows that as temperature rises, so does the Ozone count, possibly indicating a correlation.
We will do an alternative hypothesis to see if the Ozone count mean is similar or dissimilar between May and September.
\(H_{0}:\) Ozone count (PPB) means between months is equal.
\(H_{1}:\) Ozone count (PPB) means between months is not equal.
boxplot(airquality2$Ozone~airquality2$Month, main = "Ozone Levels Across May - September", xlab = "Month", ylab = "Ozone Count (PPB)")
We can visually see that Ozone means has similarities and differences.
# Utilize a difference-in-means test to check the null hypothesis
May <- subset(airquality2,airquality2$Month=="5")
June <- subset(airquality2,airquality2$Month=="6")
July <- subset(airquality2,airquality2$Month=="7")
August <- subset(airquality2,airquality2$Month=="8")
September <- subset(airquality2,airquality2$Month=="9")
t.test(June$Ozone,May$Ozone, var.equal = FALSE, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: June$Ozone and May$Ozone
## t = 3.887, df = 58.993, p-value = 0.0001299
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 11.60082 Inf
## sample estimates:
## mean of x mean of y
## 42.83333 22.48387
t.test(July$Ozone,June$Ozone, var.equal = FALSE, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: July$Ozone and June$Ozone
## t = 2.1157, df = 52.795, p-value = 0.01955
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 2.848508 Inf
## sample estimates:
## mean of x mean of y
## 56.48387 42.83333
t.test(July$Ozone,August$Ozone, var.equal = FALSE, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: July$Ozone and August$Ozone
## t = -0.15142, df = 57.218, p-value = 0.5599
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -15.53723 Inf
## sample estimates:
## mean of x mean of y
## 56.48387 57.77419
t.test(August$Ozone,September$Ozone, var.equal = FALSE, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: August$Ozone and September$Ozone
## t = 3.2722, df = 51.348, p-value = 0.0009557
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 12.69408 Inf
## sample estimates:
## mean of x mean of y
## 57.77419 31.76667
t.test(September$Ozone,May$Ozone, var.equal = FALSE, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: September$Ozone and May$Ozone
## t = 1.6174, df = 57.491, p-value = 0.05563
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.3122599 Inf
## sample estimates:
## mean of x mean of y
## 31.76667 22.48387
We can see the p-value for the July - August mean comparison is 0.56, and for the May - September mean comparison is 0.056, and are both higher than \(\alpha = 0.05\). Therefore, we must consider not rejecting \(H_{0}:\) Ozone counts (PPB) means between July - August and May - September are equal (or similar).
However, we can also see the the p-value for the May - June mean comparison is 0.0001, the p-value for the June - July mean comparison is 0.019, and the p-value for the August - September mean comparision is 0.0009, and are lower than \(\alpha = 0.05\). Therefore, we must consider rejecting \(H_{0}\) in favor of \(H_{1}:\) Ozone counts (PPB) means between May - June, June - July, and August - September are not equal (or dissimilar).
Build a linear regression model with subset selection. Please indicate all significant attributes, assess your model, and predict in a test dataset.
Split data set into 80:20 train and test data
# split data set into 80:20
#data("airquality2")
i<- sample(2, nrow(airquality2), replace = TRUE, prob = c(0.8,0.2))
AirTraining<- airquality2[i==1,]
AirTest<- airquality2[i==2,]
pairs(AirTraining[,1:4], lower.panel = NULL)
pairs(AirTest[,1:4], lower.panel = NULL)
Observations: All variables have correlation with, but Ozone and Temperature seem to have higher effect on airquality based on the graph.
cor(AirTraining)
## Ozone Solar.R Wind Temp Month Day
## Ozone 1.00000000 0.33531060 -0.61361202 0.6761906 0.17270264 -0.07456763
## Solar.R 0.33531060 1.00000000 -0.11280067 0.2738459 -0.03594261 -0.14670447
## Wind -0.61361202 -0.11280067 1.00000000 -0.4701731 -0.24281659 0.03498455
## Temp 0.67619062 0.27384591 -0.47017309 1.0000000 0.42612959 -0.16608135
## Month 0.17270264 -0.03594261 -0.24281659 0.4261296 1.00000000 0.02280542
## Day -0.07456763 -0.14670447 0.03498455 -0.1660814 0.02280542 1.00000000
Testing the two datasets, AirTraining and AirTest we can comfirm our correlation.
Please use all the other features/attributes to construct a linear regression model.
fitlm <- lm(Ozone~., data = AirTraining)
summary(fitlm)
##
## Call:
## lm(formula = Ozone ~ ., data = AirTraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.078 -12.146 -2.358 7.269 97.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44.51388 21.46437 -2.074 0.0404 *
## Solar.R 0.05155 0.02130 2.421 0.0171 *
## Wind -3.49638 0.60557 -5.774 7.16e-08 ***
## Temp 1.65376 0.24259 6.817 5.07e-10 ***
## Month -2.93376 1.41046 -2.080 0.0398 *
## Day 0.17972 0.21437 0.838 0.4036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.72 on 111 degrees of freedom
## Multiple R-squared: 0.6125, Adjusted R-squared: 0.595
## F-statistic: 35.09 on 5 and 111 DF, p-value: < 2.2e-16
In looking at the coeffients of the attributes, we see that there are strong correlations (three stars) for Solar radiation, Temperature and Wind. The correlations shows that their p-values’ are close to 0. The insignificant coeffients are Day and Month.
airattri <- lm(Ozone ~ . -Day -Month, data = airquality2)
summary(airattri)
##
## Call:
## lm(formula = Ozone ~ . - Day - Month, data = airquality2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.219 -13.203 -3.146 9.479 98.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -55.34822 17.99226 -3.076 0.00250 **
## Solar.R 0.05771 0.01855 3.110 0.00224 **
## Wind -2.95246 0.51607 -5.721 5.62e-08 ***
## Temp 1.49496 0.20096 7.439 7.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.83 on 149 degrees of freedom
## Multiple R-squared: 0.5749, Adjusted R-squared: 0.5663
## F-statistic: 67.16 on 3 and 149 DF, p-value: < 2.2e-16
Replace this text with your answer.
MAE
and MSE
.ypred <-predict(object = airattri, newdata = AirTest[,1:6])
summary(ypred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.023 33.330 43.914 43.414 61.101 80.610
MAE(y_pred = ypred, y_true = AirTest$Ozone)
## [1] 15.66733
MSE(y_pred = ypred, y_true = AirTest$Ozone)
## [1] 364.4143
Replace this text with your answer.
#airqualityomit <- na.omit(airquality2)
air_incercept_only <- lm(Ozone ~ 1, data=AirTraining)
airall <- lm(Ozone~., data=AirTraining)
airboth <- stepAIC (air_incercept_only, direction = 'both', scope = formula(airall), trace = 0)
airboth$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Ozone ~ 1
##
## Final Model:
## Ozone ~ Temp + Wind + Solar.R + Month
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 116 111405.23 804.4744
## 2 + Temp 1 50938.232 115 60467.00 734.9785
## 3 + Wind 1 12504.399 114 47962.60 709.8723
## 4 + Solar.R 1 2939.319 113 45023.28 704.4730
## 5 + Month 1 1579.864 112 43443.42 702.2938
summary(airboth)
##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R + Month, data = AirTraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.120 -12.973 -2.267 7.094 98.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.38540 20.54683 -1.917 0.0578 .
## Temp 1.62089 0.23909 6.780 5.93e-10 ***
## Wind -3.51850 0.60420 -5.823 5.60e-08 ***
## Solar.R 0.04994 0.02118 2.358 0.0201 *
## Month -2.83229 1.40340 -2.018 0.0460 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.69 on 112 degrees of freedom
## Multiple R-squared: 0.61, Adjusted R-squared: 0.5961
## F-statistic: 43.8 on 4 and 112 DF, p-value: < 2.2e-16
Replace this text with your answer.
par(mfrow=c(2,2))
plot(airboth)
and calculate
MAEand
MSE`.ypred_airboth <- predict(object = airboth, newdata = AirTest[,1:6])
MAE(y_pred = ypred_airboth, y_true = AirTest$Ozone)
## [1] 16.62599
MSE(y_pred = ypred_airboth, y_true = AirTest$Ozone)
## [1] 401.3099
Replace this text with your answer.
Compare all the linear regression models. You can use plots, assessment measures (\(R^2\), RSS, MAE, MSE,etc.) Which on is the best? Explain your answer.
# Edit me
Replace this text with your answer.