We will be using a Kaggle data set, this data set explores how sightings of UFOs in the US are affected by the air quality at the time of sighting. The data set has dates, locations and climate information about 4 main gases - NO2, O3, SO2 and CO about the UFO sightings that occurred in the USA together with noise to include observations that have no UFO sighting.
To analyze this data, we will use the following R packages:
library(dplyr)
library("rio")
library(PerformanceAnalytics)
library(corrplot)
library(caret)
library(tidyverse)
library(car)
library(ggplot2)
mutate() adds new variables that are functions of existing variables select() picks variables based on their names. filter() picks cases based on their values. summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows.
These all combine naturally with group_by() which allows you to perform any operation “by group”. dplyr is designed to abstract over how the data is stored.
UFOdata <- import("C:/Users/Ankeeta/Desktop/UFO_Data.csv",h = T)
To view the structure of the data:
str(UFOdata)
## 'data.frame': 15822 obs. of 14 variables:
## $ city : chr "kingman" "tacna" "grand canyon" "benson" ...
## $ state : chr "Arizona" "Arizona" "Arizona" "Arizona" ...
## $ month : int 7 7 3 4 5 4 11 2 10 1 ...
## $ year : int 2000 2000 2006 2005 2003 2005 2008 2001 2005 2004 ...
## $ hour : int 21 21 19 20 19 18 20 23 17 7 ...
## $ NO2.Mean: num 7.38 7.38 11.79 13.33 14.42 ...
## $ NO2.AQI : int 14 14 42 42 38 39 35 39 43 35 ...
## $ O3.Mean : num 0.038 0.038 0.0335 0.0249 0.0368 ...
## $ O3.AQI : int 45 45 41 34 47 40 28 33 34 14 ...
## $ SO2.Mean: num 1 1 0.0417 0.375 0.8333 ...
## $ SO2.AQI : int 1 1 1 3 3 6 1 4 6 7 ...
## $ CO.Mean : num 0.196 0.196 0.242 0.283 0.379 ...
## $ CO.AQI : int 3 3 7 6 7 8 5 15 16 13 ...
## $ ET : int 1 1 1 1 1 1 1 1 1 1 ...
To view the overall summary of the data:
summary(UFOdata)
## city state month year
## Length:15822 Length:15822 Min. : 1.000 Min. :2000
## Class :character Class :character 1st Qu.: 4.000 1st Qu.:2000
## Mode :character Mode :character Median : 7.000 Median :2000
## Mean : 6.701 Mean :2000
## 3rd Qu.: 9.000 3rd Qu.:2000
## Max. :12.000 Max. :2008
## hour NO2.Mean NO2.AQI O3.Mean
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.00000
## 1st Qu.: 5.00 1st Qu.: 9.458 1st Qu.: 18.00 1st Qu.:0.01492
## Median :12.00 Median : 15.292 Median : 28.00 Median :0.02296
## Mean :11.56 Mean : 17.053 Mean : 30.01 Mean :0.02389
## 3rd Qu.:18.00 3rd Qu.: 22.810 3rd Qu.: 40.00 3rd Qu.:0.03167
## Max. :23.00 Max. :139.542 Max. :132.00 Max. :0.09508
## O3.AQI SO2.Mean SO2.AQI CO.Mean
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.0000
## 1st Qu.: 22.00 1st Qu.: 1.045 1st Qu.: 4.00 1st Qu.:0.2500
## Median : 31.00 Median : 2.458 Median : 10.00 Median :0.4417
## Mean : 35.31 Mean : 4.018 Mean : 15.47 Mean :0.5090
## 3rd Qu.: 42.00 3rd Qu.: 5.500 3rd Qu.: 21.00 3rd Qu.:0.6750
## Max. :207.00 Max. :39.043 Max. :172.00 Max. :4.0375
## CO.AQI ET
## Min. : 0.000 Min. :0.00000
## 1st Qu.: 5.000 1st Qu.:0.00000
## Median : 7.000 Median :0.00000
## Mean : 8.608 Mean :0.01106
## 3rd Qu.:11.000 3rd Qu.:0.00000
## Max. :62.000 Max. :1.00000
To view first few rows of the data:
head(UFOdata)
## city state month year hour NO2.Mean NO2.AQI O3.Mean O3.AQI
## 1 kingman Arizona 7 2000 21 7.37500 14 0.038000 45
## 2 tacna Arizona 7 2000 21 7.37500 14 0.038000 45
## 3 grand canyon Arizona 3 2006 19 11.79167 42 0.033458 41
## 4 benson Arizona 4 2005 20 13.33333 42 0.024875 34
## 5 tucson Arizona 5 2003 19 14.41667 38 0.036833 47
## 6 peoria Arizona 4 2005 18 15.05833 39 0.027375 40
## SO2.Mean SO2.AQI CO.Mean CO.AQI ET
## 1 1.000000 1 0.195833 3 1
## 2 1.000000 1 0.195833 3 1
## 3 0.041667 1 0.241667 7 1
## 4 0.375000 3 0.283333 6 1
## 5 0.833333 3 0.379167 7 1
## 6 1.750000 6 0.570833 8 1
sum(is.na(UFOdata))
## [1] 0
col.type <- data.frame("varType" = sapply(UFOdata, function(x) class(x)))
plotHistogram <- function(y)
{
for (a in 1:ncol(y))
{
if (col.type[a,1] == "numeric" | col.type[a,1] == "integer")
{
hist(UFOdata[,a],
xlab = colnames(y[a]),
main = colnames(y[a]))
}
}
}
par(mfrow = c(3,4))
plotHistogram(UFOdata)
UFOnumeric <- unlist(lapply(UFOdata, is.numeric))
head(UFOnumeric)
## city state month year hour NO2.Mean
## FALSE FALSE TRUE TRUE TRUE TRUE
UFOnum <- UFOdata[ , UFOnumeric]
head(UFOnum)
## month year hour NO2.Mean NO2.AQI O3.Mean O3.AQI SO2.Mean SO2.AQI
## 1 7 2000 21 7.37500 14 0.038000 45 1.000000 1
## 2 7 2000 21 7.37500 14 0.038000 45 1.000000 1
## 3 3 2006 19 11.79167 42 0.033458 41 0.041667 1
## 4 4 2005 20 13.33333 42 0.024875 34 0.375000 3
## 5 5 2003 19 14.41667 38 0.036833 47 0.833333 3
## 6 4 2005 18 15.05833 39 0.027375 40 1.750000 6
## CO.Mean CO.AQI ET
## 1 0.195833 3 1
## 2 0.195833 3 1
## 3 0.241667 7 1
## 4 0.283333 6 1
## 5 0.379167 7 1
## 6 0.570833 8 1
chart.Correlation(UFOnum,histogram = TRUE, pch = 19)
barCount <- table(UFOdata$state)
par(mfrow = c(1,1))
barplot(barCount, main = "Sightings by state", xlab = "State")
Now we will create new variables to categorize our “State” column with 18 different levels. To deal with this, we created different columns for each state (level). By doing this we will include all the regions into our multiple linear regression model to test the significance of the covariates in model building and adequacy checking.
UFOdata$Arizona = ifelse(UFOdata$state == "Arizona", 1,0)
UFOdata$California = ifelse(UFOdata$state == "California", 1,0)
UFOdata$Colorado = ifelse(UFOdata$state == "Colorado", 1,0)
UFOdata$Florida = ifelse(UFOdata$state == "Florida", 1,0)
UFOdata$Illinois = ifelse(UFOdata$state == "Illinois", 1,0)
UFOdata$Kansas = ifelse(UFOdata$state == "Kansas", 1,0)
UFOdata$Kentucky = ifelse(UFOdata$state == "Kentucky", 1,0)
UFOdata$Louisiana = ifelse(UFOdata$state == "Louisiana", 1,0)
UFOdata$Massachusetts = ifelse(UFOdata$state == "Massachusetts", 1,0)
UFOdata$Michigan = ifelse(UFOdata$state == "Michigan", 1,0)
UFOdata$Missouri = ifelse(UFOdata$state == "Missouri", 1,0)
UFOdata$NewJersey = ifelse(UFOdata$state == "New Jersey", 1,0)
UFOdata$NewYork = ifelse(UFOdata$state == "New York", 1,0)
UFOdata$NorthCarolina = ifelse(UFOdata$state == "North Carolina", 1,0)
UFOdata$Oregon = ifelse(UFOdata$state == "Orlando", 1,0)
UFOdata$Pennsylvania = ifelse(UFOdata$state == "Pennsylvania", 1,0)
UFOdata$Texas = ifelse(UFOdata$state == "Texas", 1,0)
UFOdata$Virginia = ifelse(UFOdata$state == "Virginia", 1,0)
We will then split the dataset into 80:20 (train: test) data ratio. The training set will contain a known output and the model learns on this data in order to be generalized to other data later. We have the test dataset (or subset) in order to test our model’s prediction on this subset.
df = subset(UFOdata, select = -c(city,state))
trainData <- sample_frac(df, 0.8)
dim(trainData)
## [1] 12658 30
sid <- as.numeric(rownames(trainData))
testData <- df[-sid,]
dim(testData)
## [1] 3164 30
model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Massachusetts+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas+Virginia, data = trainData)
summary(model_UFO)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona +
## California + Colorado + Florida + Illinois + Kansas + Kentucky +
## Louisiana + Massachusetts + Michigan + Missouri + NewJersey +
## NewYork + NorthCarolina + Oregon + Pennsylvania + Texas +
## Virginia, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09341 -0.00688 -0.00159 0.00339 1.05998
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.717e+02 1.924e+00 -141.196 < 2e-16 ***
## year 1.359e-01 9.621e-04 141.209 < 2e-16 ***
## month -5.221e-04 1.930e-04 -2.705 0.006842 **
## NO2.Mean -4.787e-04 1.509e-04 -3.173 0.001514 **
## NO2.AQI 5.908e-05 9.047e-05 0.653 0.513738
## O3.Mean -3.480e-01 1.040e-01 -3.347 0.000821 ***
## O3.AQI 9.850e-05 5.222e-05 1.886 0.059271 .
## SO2.Mean -1.471e-04 2.773e-04 -0.530 0.595800
## SO2.AQI 6.838e-05 6.441e-05 1.062 0.288352
## CO.Mean -9.902e-03 4.784e-03 -2.070 0.038517 *
## CO.AQI 3.646e-04 2.461e-04 1.482 0.138400
## Arizona -6.124e-02 3.744e-03 -16.357 < 2e-16 ***
## California -5.228e-02 2.943e-03 -17.764 < 2e-16 ***
## Colorado -2.110e-03 4.812e-03 -0.438 0.661053
## Florida 6.691e-03 4.607e-03 1.452 0.146412
## Illinois 5.044e-03 3.750e-03 1.345 0.178648
## Kansas -2.563e-03 4.891e-03 -0.524 0.600228
## Kentucky -3.975e-03 3.926e-03 -1.012 0.311408
## Louisiana -7.007e-04 4.614e-03 -0.152 0.879311
## Massachusetts 7.331e-01 6.463e-02 11.343 < 2e-16 ***
## Michigan -1.531e-03 6.112e-03 -0.250 0.802254
## Missouri 8.903e-07 3.853e-03 0.000 0.999816
## NewJersey -4.488e-04 4.522e-03 -0.099 0.920941
## NewYork 7.148e-03 3.433e-03 2.082 0.037383 *
## NorthCarolina -3.308e-04 4.653e-03 -0.071 0.943328
## Oregon 4.642e-02 6.494e-02 0.715 0.474757
## Pennsylvania -2.219e-03 2.911e-03 -0.762 0.445899
## Texas 4.871e-03 3.562e-03 1.367 0.171520
## Virginia NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06454 on 12630 degrees of freedom
## Multiple R-squared: 0.6227, Adjusted R-squared: 0.6219
## F-statistic: 772.1 on 27 and 12630 DF, p-value: < 2.2e-16
model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI, data = trainData)
summary(model_UFO)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13297 -0.00179 0.01100 0.01859 1.02734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.491e+02 1.902e+00 -130.966 < 2e-16 ***
## year 1.245e-01 9.509e-04 130.984 < 2e-16 ***
## month -4.730e-04 2.034e-04 -2.326 0.020057 *
## NO2.Mean -1.213e-03 1.510e-04 -8.035 1.02e-15 ***
## NO2.AQI 5.544e-04 9.183e-05 6.038 1.61e-09 ***
## O3.Mean -7.365e-01 1.074e-01 -6.857 7.36e-12 ***
## O3.AQI 2.740e-04 5.386e-05 5.087 3.69e-07 ***
## SO2.Mean 1.115e-03 2.486e-04 4.484 7.40e-06 ***
## SO2.AQI 3.103e-04 6.179e-05 5.021 5.20e-07 ***
## CO.Mean -1.714e-03 4.425e-03 -0.387 0.698498
## CO.AQI -8.701e-04 2.416e-04 -3.601 0.000318 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06825 on 12647 degrees of freedom
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5771
## F-statistic: 1728 on 10 and 12647 DF, p-value: < 2.2e-16
model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas, data = trainData)
summary(model_UFO)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona +
## California + Colorado + Florida + Illinois + Kansas + Kentucky +
## Louisiana + Michigan + Missouri + NewJersey + NewYork + NorthCarolina +
## Oregon + Pennsylvania + Texas, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09333 -0.00688 -0.00165 0.00326 1.06014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.723e+02 1.933e+00 -140.878 < 2e-16 ***
## year 1.362e-01 9.665e-04 140.891 < 2e-16 ***
## month -4.871e-04 1.940e-04 -2.511 0.012040 *
## NO2.Mean -4.573e-04 1.516e-04 -3.016 0.002568 **
## NO2.AQI 5.568e-05 9.093e-05 0.612 0.540329
## O3.Mean -3.443e-01 1.045e-01 -3.294 0.000989 ***
## O3.AQI 9.622e-05 5.248e-05 1.834 0.066750 .
## SO2.Mean -1.260e-04 2.787e-04 -0.452 0.651090
## SO2.AQI 6.484e-05 6.473e-05 1.002 0.316533
## CO.Mean -1.050e-02 4.808e-03 -2.184 0.028972 *
## CO.AQI 3.720e-04 2.473e-04 1.504 0.132550
## Arizona -6.227e-02 3.761e-03 -16.554 < 2e-16 ***
## California -5.324e-02 2.957e-03 -18.007 < 2e-16 ***
## Colorado -2.804e-03 4.836e-03 -0.580 0.561978
## Florida 6.026e-03 4.630e-03 1.302 0.193046
## Illinois 4.132e-03 3.768e-03 1.097 0.272872
## Kansas -3.104e-03 4.916e-03 -0.631 0.527819
## Kentucky -4.921e-03 3.945e-03 -1.247 0.212311
## Louisiana -1.365e-03 4.637e-03 -0.294 0.768496
## Michigan -2.558e-03 6.142e-03 -0.416 0.677131
## Missouri -8.104e-04 3.871e-03 -0.209 0.834193
## NewJersey -1.338e-03 4.544e-03 -0.294 0.768460
## NewYork 6.093e-03 3.450e-03 1.766 0.077366 .
## NorthCarolina -1.043e-03 4.676e-03 -0.223 0.823453
## Oregon 4.340e-02 6.527e-02 0.665 0.506113
## Pennsylvania -3.150e-03 2.925e-03 -1.077 0.281392
## Texas 4.024e-03 3.580e-03 1.124 0.260943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06486 on 12631 degrees of freedom
## Multiple R-squared: 0.6189, Adjusted R-squared: 0.6181
## F-statistic: 788.8 on 26 and 12631 DF, p-value: < 2.2e-16
model_UFO_3 <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+Arizona+California+NewYork, data = trainData)
summary(model_UFO_3)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## Arizona + California + NewYork, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09136 -0.00550 -0.00217 0.00240 1.06019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.727e+02 1.919e+00 -142.067 < 2e-16 ***
## year 1.364e-01 9.597e-04 142.076 < 2e-16 ***
## month -4.301e-04 1.902e-04 -2.261 0.023763 *
## NO2.Mean -5.530e-04 1.353e-04 -4.088 4.38e-05 ***
## NO2.AQI 1.299e-04 8.518e-05 1.525 0.127374
## O3.Mean -1.834e-01 5.454e-02 -3.362 0.000776 ***
## Arizona -6.225e-02 2.475e-03 -25.156 < 2e-16 ***
## California -5.377e-02 1.508e-03 -35.661 < 2e-16 ***
## NewYork 7.097e-03 2.507e-03 2.831 0.004650 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06488 on 12649 degrees of freedom
## Multiple R-squared: 0.6181, Adjusted R-squared: 0.6179
## F-statistic: 2560 on 8 and 12649 DF, p-value: < 2.2e-16
model_UFO_2 <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.AQI, data = trainData)
summary(model_UFO_2)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.AQI, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13323 -0.00181 0.01096 0.01856 1.02745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.491e+02 1.902e+00 -130.971 < 2e-16 ***
## year 1.245e-01 9.508e-04 130.988 < 2e-16 ***
## month -4.675e-04 2.029e-04 -2.304 0.0212 *
## NO2.Mean -1.225e-03 1.478e-04 -8.284 < 2e-16 ***
## NO2.AQI 5.617e-04 8.985e-05 6.252 4.17e-10 ***
## O3.Mean -7.370e-01 1.074e-01 -6.862 7.10e-12 ***
## O3.AQI 2.738e-04 5.385e-05 5.084 3.76e-07 ***
## SO2.Mean 1.110e-03 2.483e-04 4.470 7.88e-06 ***
## SO2.AQI 3.118e-04 6.166e-05 5.057 4.32e-07 ***
## CO.AQI -9.532e-04 1.114e-04 -8.558 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06825 on 12648 degrees of freedom
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1921 on 9 and 12648 DF, p-value: < 2.2e-16
Model adequacy checking is essentially to make sure that the LINE assumptions are satisfied, therefore, model adequacy checking will check to see if the model fits the data well. To do so, we did residual plots and residual analysis.
par(mfrow = c(2,2))
plot(model_UFO_2)
We tried to conduct a Box-Cox transformation however, due to response variable being negative in the model, we could not apply and conduct the transformation. So, we decided to go ahead with the multicollinearity check.
In model adequacy checking, we will check for multicollinearity in order to understand the linear dependence amongst the covariates. We will plot each covariate with response variable and create a correlation plot as well.
par(mfrow = c(3,3))
plot(ET~year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.AQI, data = trainData, pch = 20)
ufo <- trainData[c(1:9,11)]
head(ufo)
## month year hour NO2.Mean NO2.AQI O3.Mean O3.AQI SO2.Mean SO2.AQI
## 13788 12 2001 15 43.272727 56 0.004294 13 4.181818 11
## 811 4 2000 0 18.173913 28 0.019542 24 2.000000 4
## 5752 6 2000 17 14.875000 37 0.029125 38 5.625000 59
## 1649 8 2000 4 11.782609 17 0.018875 19 2.000000 3
## 1380 11 2000 18 11.869565 23 0.012375 19 1.681818 3
## 6090 4 2000 22 7.458333 13 0.027125 27 1.166667 6
## CO.AQI
## 13788 36
## 811 7
## 5752 1
## 1649 7
## 1380 7
## 6090 6
UFOcor <- cor(ufo)
corrplot(UFOcor, method = "number")
par(mfrow = c(1,1))
corrplot(UFOcor, method = "number")
vif(model_UFO_2)
## year month NO2.Mean NO2.AQI O3.Mean O3.AQI SO2.Mean SO2.AQI
## 1.045339 1.061787 7.074337 6.139038 4.594494 3.985470 3.253810 3.157985
## CO.AQI
## 1.820585
Based on multicollinearity result, removing one covariate linked with each other, removing means of NO2, O3 and SO2:
model_UFO_4 <- lm(ET ~ year+month+NO2.AQI+O3.AQI+SO2.AQI+CO.AQI, data = trainData)
summary(model_UFO_4)
##
## Call:
## lm(formula = ET ~ year + month + NO2.AQI + O3.AQI + SO2.AQI +
## CO.AQI, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11866 0.00072 0.01228 0.01770 1.03174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.475e+02 1.901e+00 -130.186 <2e-16 ***
## year 1.238e-01 9.506e-04 130.199 <2e-16 ***
## month -1.779e-04 1.993e-04 -0.893 0.372
## NO2.AQI 2.611e-05 4.761e-05 0.548 0.583
## O3.AQI 1.732e-05 2.811e-05 0.616 0.538
## SO2.AQI 5.236e-04 3.688e-05 14.198 <2e-16 ***
## CO.AQI -1.038e-03 1.072e-04 -9.685 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06853 on 12651 degrees of freedom
## Multiple R-squared: 0.5739, Adjusted R-squared: 0.5737
## F-statistic: 2840 on 6 and 12651 DF, p-value: < 2.2e-16
Let us now check what happens if we remove the AQIs instead of the Means:
model_UFO_5 <- lm(ET ~ year+month+NO2.Mean+O3.Mean+SO2.Mean+CO.AQI, data = trainData)
summary(model_UFO_5)
##
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + O3.Mean + SO2.Mean +
## CO.AQI, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12153 0.00072 0.01169 0.01704 1.02695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.471e+02 1.902e+00 -129.904 < 2e-16 ***
## year 1.236e-01 9.511e-04 129.920 < 2e-16 ***
## month -3.505e-04 2.016e-04 -1.739 0.082067 .
## NO2.Mean -2.678e-04 7.850e-05 -3.412 0.000647 ***
## O3.Mean -1.462e-01 5.642e-02 -2.592 0.009565 **
## SO2.Mean 2.090e-03 1.502e-04 13.914 < 2e-16 ***
## CO.AQI -9.294e-04 1.109e-04 -8.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06858 on 12651 degrees of freedom
## Multiple R-squared: 0.5733, Adjusted R-squared: 0.5731
## F-statistic: 2833 on 6 and 12651 DF, p-value: < 2.2e-16
This is a better model, but we still need to remodel it by removing month with p-value>0.05 and the following is the final model:
model_UFO_6 <- lm(ET ~ year+NO2.Mean+O3.Mean+SO2.Mean+CO.AQI, data = trainData)
Ideally, there should be no multicollinearity between these covariates now so we will check multicollinearity between these covariates:
vif(model_UFO_6)
## year NO2.Mean O3.Mean SO2.Mean CO.AQI
## 1.035714 1.962613 1.223797 1.175899 1.782727
Our VIF values for the covariates are close to 1 hence there exists no multicollinearity between these covariates and this is our best model now. We will plot the graph for this model:
par(mfrow = c(2,2))
plot(model_UFO_6)
We are not getting a linear regression model after removing all possible covariates or adding the best covariates and fitting the model with the covariates with p value smaller than 0.05. So, we will check the validity of our model on our train data.
We fit the multiple linear regression to the training data set.
plot(trainData$year, trainData$ET, pch = 20, ylim = c(-1,2))
abline(model_UFO_6, col = "red")
model_UFO <- glm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Massachusetts+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas+Virginia, data = trainData, family = "binomial")
summary(model_UFO)
##
## Call:
## glm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona +
## California + Colorado + Florida + Illinois + Kansas + Kentucky +
## Louisiana + Massachusetts + Michigan + Missouri + NewJersey +
## NewYork + NorthCarolina + Oregon + Pennsylvania + Texas +
## Virginia, family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2499 -0.0561 -0.0067 -0.0002 4.8204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.240e+12 6.500e+13 0.081 0.93575
## year 5.838e+00 7.594e-01 7.688 1.49e-14 ***
## month -2.429e-02 7.435e-02 -0.327 0.74391
## NO2.Mean -2.429e-02 7.463e-02 -0.325 0.74482
## NO2.AQI -4.512e-03 4.068e-02 -0.111 0.91169
## O3.Mean -9.550e+01 4.475e+01 -2.134 0.03281 *
## O3.AQI 4.357e-02 1.674e-02 2.603 0.00924 **
## SO2.Mean -2.134e-01 1.860e-01 -1.148 0.25110
## SO2.AQI 3.253e-02 3.019e-02 1.077 0.28129
## CO.Mean 9.629e-03 2.391e+00 0.004 0.99679
## CO.AQI -2.702e-02 1.232e-01 -0.219 0.82639
## Arizona -5.240e+12 6.500e+13 -0.081 0.93575
## California -5.240e+12 6.500e+13 -0.081 0.93575
## Colorado -5.240e+12 6.500e+13 -0.081 0.93575
## Florida -5.240e+12 6.500e+13 -0.081 0.93575
## Illinois -5.240e+12 6.500e+13 -0.081 0.93575
## Kansas -5.240e+12 6.500e+13 -0.081 0.93575
## Kentucky -5.240e+12 6.500e+13 -0.081 0.93575
## Louisiana -5.240e+12 6.500e+13 -0.081 0.93575
## Massachusetts 4.498e+15 6.500e+13 69.207 < 2e-16 ***
## Michigan -4.509e+15 6.500e+13 -69.368 < 2e-16 ***
## Missouri -5.240e+12 6.500e+13 -0.081 0.93575
## NewJersey -5.240e+12 6.500e+13 -0.081 0.93575
## NewYork -5.240e+12 6.500e+13 -0.081 0.93575
## NorthCarolina -5.240e+12 6.500e+13 -0.081 0.93575
## Oregon -5.240e+12 6.500e+13 -0.081 0.93575
## Pennsylvania -5.240e+12 6.500e+13 -0.081 0.93575
## Texas -5.240e+12 6.500e+13 -0.081 0.93575
## Virginia -5.240e+12 6.500e+13 -0.081 0.93575
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1548.66 on 12657 degrees of freedom
## Residual deviance: 332.03 on 12629 degrees of freedom
## AIC: 390.03
##
## Number of Fisher Scoring iterations: 25
model_UFO_2 <- glm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI, data = trainData, family = "binomial")
summary(model_UFO_2)
##
## Call:
## glm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean +
## O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI, family = "binomial",
## data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4844 -0.0249 -0.0163 -0.0122 4.4700
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.714e+03 1.037e+03 -9.365 <2e-16 ***
## year 4.853e+00 5.185e-01 9.360 <2e-16 ***
## month -1.212e-02 6.682e-02 -0.181 0.8560
## NO2.Mean -2.810e-02 6.513e-02 -0.432 0.6661
## NO2.AQI 1.873e-02 3.360e-02 0.558 0.5771
## O3.Mean -9.108e+01 4.131e+01 -2.205 0.0275 *
## O3.AQI 3.629e-02 1.639e-02 2.214 0.0268 *
## SO2.Mean -8.350e-02 1.297e-01 -0.644 0.5198
## SO2.AQI 4.527e-02 2.155e-02 2.101 0.0357 *
## CO.Mean 2.487e-01 2.199e+00 0.113 0.9099
## CO.AQI -1.057e-01 1.188e-01 -0.890 0.3734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1548.66 on 12657 degrees of freedom
## Residual deviance: 250.05 on 12647 degrees of freedom
## AIC: 272.05
##
## Number of Fisher Scoring iterations: 11
We kept on removing the covariates one after the other with p values greater than 0.05. Below is the best model which we could get after fitting in the logistic regression.
model_UFO_4 <- glm(ET ~ year+SO2.AQI, data = trainData, family = "binomial")
summary(model_UFO_4)
##
## Call:
## glm(formula = ET ~ year + SO2.AQI, family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2969 -0.0230 -0.0148 -0.0127 4.3838
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.644e+03 1.019e+03 -9.468 < 2e-16 ***
## year 4.817e+00 5.090e-01 9.464 < 2e-16 ***
## SO2.AQI 3.806e-02 1.232e-02 3.089 0.00201 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1548.7 on 12657 degrees of freedom
## Residual deviance: 259.1 on 12655 degrees of freedom
## AIC: 265.1
##
## Number of Fisher Scoring iterations: 10
par(mfrow = c(2,2))
plot(model_UFO_4)
plot(ET~year+ SO2.Mean, data = trainData, pch = 20)
Correlation Plot:
ufo <- trainData[c(2,9)]
head(ufo)
## year SO2.AQI
## 13788 2001 11
## 811 2000 4
## 5752 2000 59
## 1649 2000 3
## 1380 2000 3
## 6090 2000 6
UFOcor <- cor(ufo)
corrplot(UFOcor, method = "number")
par(mfrow = c(1,1))
corrplot(UFOcor, method = "number")
As we can see, there is no linear relationship between the 2 covariates. We will calculate the VIF values for confirmation as well.
vif(model_UFO_4)
## year SO2.AQI
## 1.200142 1.200142
Since the VIF values are close to 1, we can confirm there is no linear relationship between the covariates.
ggplot(trainData, aes(x = year, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
ggplot(trainData, aes(x = SO2.AQI, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
max(trainData[trainData$ET == "1",]$SO2.AQI)
## [1] 61
The answer from the code above concurs with the graph plotted.
pred_prob = predict(model_UFO_4, trainData, type = "response")
pred_value = 1 * (pred_prob > 0.5)
actual_value = trainData$ET
confusion_matrix = table(actual_value, pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## 0 12517 0
## 1 18 123
We can calculate the misclassification error rate using the confusion matrix.
misclassification_error_rate = 1 - sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.001422026
As we can see, the misclassification error rate is low, which means our model is a good model to use for predictions on this data set.
ggplot(testData, aes(x = year, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
ggplot(testData, aes(x = SO2.AQI, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
pred_prob_1 = predict(model_UFO_4, testData, type = "response")
pred_value_1 = 1 * (pred_prob_1 > 0.5)
actua_value_1 = testData$ET
confusion_matrix_1 = table(actua_value_1,pred_value_1)
confusion_matrix_1
## pred_value_1
## actua_value_1 0 1
## 0 3130 0
## 1 8 26
We can calculate the misclassification error rate using the confusion matrix.
misclassification_error_rate_1 = 1 - sum(diag(confusion_matrix_1)) / sum(confusion_matrix_1)
misclassification_error_rate_1
## [1] 0.002528445
As we can see, the misclassification error rate is low, which means our model is a good model to use for predictions on this data set.
Our findings with a never-before analyzed data-set taught us that a Logistic Regression is the best option when your response variable is binary. From among the 24 covariates that we began our analysis with, only the SO2 Air Quality Index and Year were found to influence the number of UFO Sightings.