Airquality

Daily air quality measurements in New York, May to September 1973.

A data frame with 154 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb)

[,2] Solar.R numeric Solar R (lang)

[,3] Wind numeric Wind (mph)

[,4] Temp numeric Temperature (degrees F)

[,5] Month numeric Month (1–12)

[,6] Day numeric Day of month (1–31)

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

library(dplyr)
library(tidyr)
library(caret)
library(datasets)
data("airquality")
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
help(airquality)

Checking missing values

sapply(airquality, function(y) sum(length(which(is.na(y)))))/nrow(airquality)*100
##     Ozone   Solar.R      Wind      Temp     Month       Day 
## 24.183007  4.575163  0.000000  0.000000  0.000000  0.000000

We have 24% of missing values in target variable (Ozone), I will omit them and build the model based on the rest 76%.

# omit rows where 'Ozone' has missing values
airquality<-na.omit(airquality, cols="Ozone")

Checking if there are any missing values in other variables.

sapply(airquality, function(y) sum(length(which(is.na(y)))))/nrow(airquality)*100
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

Checking normality of the data

par(mfrow = c(3,2))
plot(density(airquality$Ozone))
plot(density(airquality$Solar.R))
plot(density(airquality$Wind))
plot(density(airquality$Temp))
plot(density(airquality$Month))
plot(density(airquality$Day))

Variables show close to normal distribution except variable “Month”

Checking relationships of target variable with the dependent variables

par(mfrow = c(3,2))
plot(airquality$Ozone~airquality$Solar.R)
plot(airquality$Ozone~airquality$Wind)
plot(airquality$Ozone~airquality$Temp)
plot(airquality$Ozone~airquality$Month)
plot(airquality$Ozone~airquality$Day)

Ozone has linear relationships with all variables except Month.

Applying pre-process transformations: BoxCox, centering and scaling.

pp <- preProcess(airquality, method = c( "BoxCox", "center","scale"))
airquality_trans<- predict(pp, airquality)
pp$method
## $BoxCox
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
## 
## $center
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
## 
## $scale
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
## 
## $ignore
## character(0)

Checking collinearity.

cor(airquality_trans)
##               Ozone     Solar.R        Wind        Temp       Month
## Ozone    1.00000000  0.43688091 -0.61018838  0.75946581  0.16146260
## Solar.R  0.43688091  1.00000000 -0.13144639  0.28956555 -0.08115972
## Wind    -0.61018838 -0.13144639  1.00000000 -0.51547523 -0.18415341
## Temp     0.75946581  0.28956555 -0.51547523  1.00000000  0.36347553
## Month    0.16146260 -0.08115972 -0.18415341  0.36347553  1.00000000
## Day     -0.03043316 -0.05775380  0.05740956 -0.09683535 -0.01107195
##                 Day
## Ozone   -0.03043316
## Solar.R -0.05775380
## Wind     0.05740956
## Temp    -0.09683535
## Month   -0.01107195
## Day      1.00000000

There is no strong collinearity detected between predictor variables.

Wind, temperature and Solar.R seem a strong Ozone level predictors.

Building multiple linear regression model.

set.seed(123)
model_1<- train(Ozone~., data = airquality_trans, method = "lm",  trControl = trainControl("cv", number = 10))
summary(model_1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65001 -0.36968 -0.00675  0.35707  1.35443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.758e-16  5.281e-02   0.000 1.000000    
## Solar.R      2.275e-01  5.671e-02   4.011 0.000113 ***
## Wind        -3.036e-01  6.193e-02  -4.901 3.48e-06 ***
## Temp         5.728e-01  6.897e-02   8.305 3.73e-13 ***
## Month       -8.356e-02  5.825e-02  -1.435 0.154380    
## Day          5.467e-02  5.334e-02   1.025 0.307742    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5564 on 105 degrees of freedom
## Multiple R-squared:  0.7045, Adjusted R-squared:  0.6904 
## F-statistic: 50.06 on 5 and 105 DF,  p-value: < 2.2e-16
model_1
## Linear Regression 
## 
## 111 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 99, 101, 101, 99, 101, 100, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.5601092  0.7324856  0.4527255
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

10 folds CV of the model_1 shows the following average results: RMSE = 0.5601092 and Adjusted R-squared: 0.6904

Checking residuals.

residuals<-resid(model_1)
plot(residuals)

qqnorm(residuals) 
qqline(residuals)

The residuals look almost normally distributed and random, that means that there is no useful information is hidden in residuals to be extracted by the model.