Daily air quality measurements in New York, May to September 1973.
A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)
Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
library(dplyr)
library(tidyr)
library(caret)
library(datasets)
data("airquality")
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
help(airquality)
Checking missing values
sapply(airquality, function(y) sum(length(which(is.na(y)))))/nrow(airquality)*100
## Ozone Solar.R Wind Temp Month Day
## 24.183007 4.575163 0.000000 0.000000 0.000000 0.000000
We have 24% of missing values in target variable (Ozone), I will omit them and build the model based on the rest 76%.
# omit rows where 'Ozone' has missing values
airquality<-na.omit(airquality, cols="Ozone")
Checking if there are any missing values in other variables.
sapply(airquality, function(y) sum(length(which(is.na(y)))))/nrow(airquality)*100
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
Checking normality of the data
par(mfrow = c(3,2))
plot(density(airquality$Ozone))
plot(density(airquality$Solar.R))
plot(density(airquality$Wind))
plot(density(airquality$Temp))
plot(density(airquality$Month))
plot(density(airquality$Day))
Variables show close to normal distribution except variable “Month”
Checking relationships of target variable with the dependent variables
par(mfrow = c(3,2))
plot(airquality$Ozone~airquality$Solar.R)
plot(airquality$Ozone~airquality$Wind)
plot(airquality$Ozone~airquality$Temp)
plot(airquality$Ozone~airquality$Month)
plot(airquality$Ozone~airquality$Day)
Ozone has linear relationships with all variables except Month.
Applying pre-process transformations: BoxCox, centering and scaling.
pp <- preProcess(airquality, method = c( "BoxCox", "center","scale"))
airquality_trans<- predict(pp, airquality)
pp$method
## $BoxCox
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
##
## $center
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
##
## $scale
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
##
## $ignore
## character(0)
Checking collinearity.
cor(airquality_trans)
## Ozone Solar.R Wind Temp Month
## Ozone 1.00000000 0.43688091 -0.61018838 0.75946581 0.16146260
## Solar.R 0.43688091 1.00000000 -0.13144639 0.28956555 -0.08115972
## Wind -0.61018838 -0.13144639 1.00000000 -0.51547523 -0.18415341
## Temp 0.75946581 0.28956555 -0.51547523 1.00000000 0.36347553
## Month 0.16146260 -0.08115972 -0.18415341 0.36347553 1.00000000
## Day -0.03043316 -0.05775380 0.05740956 -0.09683535 -0.01107195
## Day
## Ozone -0.03043316
## Solar.R -0.05775380
## Wind 0.05740956
## Temp -0.09683535
## Month -0.01107195
## Day 1.00000000
There is no strong collinearity detected between predictor variables.
Wind, temperature and Solar.R seem a strong Ozone level predictors.
Building multiple linear regression model.
set.seed(123)
model_1<- train(Ozone~., data = airquality_trans, method = "lm", trControl = trainControl("cv", number = 10))
summary(model_1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65001 -0.36968 -0.00675 0.35707 1.35443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.758e-16 5.281e-02 0.000 1.000000
## Solar.R 2.275e-01 5.671e-02 4.011 0.000113 ***
## Wind -3.036e-01 6.193e-02 -4.901 3.48e-06 ***
## Temp 5.728e-01 6.897e-02 8.305 3.73e-13 ***
## Month -8.356e-02 5.825e-02 -1.435 0.154380
## Day 5.467e-02 5.334e-02 1.025 0.307742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5564 on 105 degrees of freedom
## Multiple R-squared: 0.7045, Adjusted R-squared: 0.6904
## F-statistic: 50.06 on 5 and 105 DF, p-value: < 2.2e-16
model_1
## Linear Regression
##
## 111 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 99, 101, 101, 99, 101, 100, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.5601092 0.7324856 0.4527255
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
10 folds CV of the model_1 shows the following average results: RMSE = 0.5601092 and Adjusted R-squared: 0.6904
Checking residuals.
residuals<-resid(model_1)
plot(residuals)
qqnorm(residuals)
qqline(residuals)
The residuals look almost normally distributed and random, that means that there is no useful information is hidden in residuals to be extracted by the model.