Introduction:

We will be using a Kaggle data set, this data set explores how sightings of UFOs in the US are affected by the air quality at the time of sighting. The data set has dates, locations and climate information about 4 main gases - NO2, O3, SO2 and CO about the UFO sightings that occurred in the USA together with noise to include observations that have no UFO sighting.

Packages Required:

To analyze this data, we will use the following R packages:

library(dplyr)
library("rio")
library(PerformanceAnalytics)
library(corrplot)
library(caret)
library(tidyverse)
library(car)
library(ggplot2)

mutate() adds new variables that are functions of existing variables select() picks variables based on their names. filter() picks cases based on their values. summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. dplyr is designed to abstract over how the data is stored.

Data Preparation:

UFOdata <- import("C:/Users/Ankeeta/Desktop/UFO_Data.csv",h = T)

To view the structure of the data:

str(UFOdata)
## 'data.frame':    15822 obs. of  14 variables:
##  $ city    : chr  "kingman" "tacna" "grand canyon" "benson" ...
##  $ state   : chr  "Arizona" "Arizona" "Arizona" "Arizona" ...
##  $ month   : int  7 7 3 4 5 4 11 2 10 1 ...
##  $ year    : int  2000 2000 2006 2005 2003 2005 2008 2001 2005 2004 ...
##  $ hour    : int  21 21 19 20 19 18 20 23 17 7 ...
##  $ NO2.Mean: num  7.38 7.38 11.79 13.33 14.42 ...
##  $ NO2.AQI : int  14 14 42 42 38 39 35 39 43 35 ...
##  $ O3.Mean : num  0.038 0.038 0.0335 0.0249 0.0368 ...
##  $ O3.AQI  : int  45 45 41 34 47 40 28 33 34 14 ...
##  $ SO2.Mean: num  1 1 0.0417 0.375 0.8333 ...
##  $ SO2.AQI : int  1 1 1 3 3 6 1 4 6 7 ...
##  $ CO.Mean : num  0.196 0.196 0.242 0.283 0.379 ...
##  $ CO.AQI  : int  3 3 7 6 7 8 5 15 16 13 ...
##  $ ET      : int  1 1 1 1 1 1 1 1 1 1 ...

To view the overall summary of the data:

summary(UFOdata)
##      city              state               month             year     
##  Length:15822       Length:15822       Min.   : 1.000   Min.   :2000  
##  Class :character   Class :character   1st Qu.: 4.000   1st Qu.:2000  
##  Mode  :character   Mode  :character   Median : 7.000   Median :2000  
##                                        Mean   : 6.701   Mean   :2000  
##                                        3rd Qu.: 9.000   3rd Qu.:2000  
##                                        Max.   :12.000   Max.   :2008  
##       hour          NO2.Mean          NO2.AQI          O3.Mean       
##  Min.   : 0.00   Min.   :  0.000   Min.   :  0.00   Min.   :0.00000  
##  1st Qu.: 5.00   1st Qu.:  9.458   1st Qu.: 18.00   1st Qu.:0.01492  
##  Median :12.00   Median : 15.292   Median : 28.00   Median :0.02296  
##  Mean   :11.56   Mean   : 17.053   Mean   : 30.01   Mean   :0.02389  
##  3rd Qu.:18.00   3rd Qu.: 22.810   3rd Qu.: 40.00   3rd Qu.:0.03167  
##  Max.   :23.00   Max.   :139.542   Max.   :132.00   Max.   :0.09508  
##      O3.AQI          SO2.Mean         SO2.AQI          CO.Mean      
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.: 22.00   1st Qu.: 1.045   1st Qu.:  4.00   1st Qu.:0.2500  
##  Median : 31.00   Median : 2.458   Median : 10.00   Median :0.4417  
##  Mean   : 35.31   Mean   : 4.018   Mean   : 15.47   Mean   :0.5090  
##  3rd Qu.: 42.00   3rd Qu.: 5.500   3rd Qu.: 21.00   3rd Qu.:0.6750  
##  Max.   :207.00   Max.   :39.043   Max.   :172.00   Max.   :4.0375  
##      CO.AQI             ET         
##  Min.   : 0.000   Min.   :0.00000  
##  1st Qu.: 5.000   1st Qu.:0.00000  
##  Median : 7.000   Median :0.00000  
##  Mean   : 8.608   Mean   :0.01106  
##  3rd Qu.:11.000   3rd Qu.:0.00000  
##  Max.   :62.000   Max.   :1.00000

To view first few rows of the data:

head(UFOdata)
##           city   state month year hour NO2.Mean NO2.AQI  O3.Mean O3.AQI
## 1      kingman Arizona     7 2000   21  7.37500      14 0.038000     45
## 2        tacna Arizona     7 2000   21  7.37500      14 0.038000     45
## 3 grand canyon Arizona     3 2006   19 11.79167      42 0.033458     41
## 4       benson Arizona     4 2005   20 13.33333      42 0.024875     34
## 5       tucson Arizona     5 2003   19 14.41667      38 0.036833     47
## 6       peoria Arizona     4 2005   18 15.05833      39 0.027375     40
##   SO2.Mean SO2.AQI  CO.Mean CO.AQI ET
## 1 1.000000       1 0.195833      3  1
## 2 1.000000       1 0.195833      3  1
## 3 0.041667       1 0.241667      7  1
## 4 0.375000       3 0.283333      6  1
## 5 0.833333       3 0.379167      7  1
## 6 1.750000       6 0.570833      8  1
sum(is.na(UFOdata))
## [1] 0
From the output above we see that there are no missing values in our current dataset.

Visualizing Data:

col.type <- data.frame("varType" = sapply(UFOdata, function(x) class(x)))
plotHistogram <- function(y)
{
  for (a in 1:ncol(y))
  {
    if (col.type[a,1] == "numeric" | col.type[a,1] == "integer")
    {
      hist(UFOdata[,a],
           xlab = colnames(y[a]),
                   main = colnames(y[a]))
    }
  }
}
par(mfrow = c(3,4))
plotHistogram(UFOdata)

UFOnumeric <- unlist(lapply(UFOdata, is.numeric))
head(UFOnumeric)
##     city    state    month     year     hour NO2.Mean 
##    FALSE    FALSE     TRUE     TRUE     TRUE     TRUE
UFOnum <- UFOdata[ , UFOnumeric]
head(UFOnum)
##   month year hour NO2.Mean NO2.AQI  O3.Mean O3.AQI SO2.Mean SO2.AQI
## 1     7 2000   21  7.37500      14 0.038000     45 1.000000       1
## 2     7 2000   21  7.37500      14 0.038000     45 1.000000       1
## 3     3 2006   19 11.79167      42 0.033458     41 0.041667       1
## 4     4 2005   20 13.33333      42 0.024875     34 0.375000       3
## 5     5 2003   19 14.41667      38 0.036833     47 0.833333       3
## 6     4 2005   18 15.05833      39 0.027375     40 1.750000       6
##    CO.Mean CO.AQI ET
## 1 0.195833      3  1
## 2 0.195833      3  1
## 3 0.241667      7  1
## 4 0.283333      6  1
## 5 0.379167      7  1
## 6 0.570833      8  1
chart.Correlation(UFOnum,histogram = TRUE, pch = 19)

barCount <- table(UFOdata$state)
par(mfrow = c(1,1))
barplot(barCount, main = "Sightings by state", xlab = "State")

Dealing with categorical variables:

Now we will create new variables to categorize our “State” column with 18 different levels. To deal with this, we created different columns for each state (level). By doing this we will include all the regions into our multiple linear regression model to test the significance of the covariates in model building and adequacy checking.

UFOdata$Arizona = ifelse(UFOdata$state == "Arizona", 1,0)
UFOdata$California = ifelse(UFOdata$state == "California", 1,0)
UFOdata$Colorado = ifelse(UFOdata$state == "Colorado", 1,0)
UFOdata$Florida = ifelse(UFOdata$state == "Florida", 1,0)
UFOdata$Illinois = ifelse(UFOdata$state == "Illinois", 1,0)
UFOdata$Kansas = ifelse(UFOdata$state == "Kansas", 1,0)
UFOdata$Kentucky = ifelse(UFOdata$state == "Kentucky", 1,0)
UFOdata$Louisiana = ifelse(UFOdata$state == "Louisiana", 1,0)
UFOdata$Massachusetts = ifelse(UFOdata$state == "Massachusetts", 1,0)
UFOdata$Michigan = ifelse(UFOdata$state == "Michigan", 1,0)
UFOdata$Missouri = ifelse(UFOdata$state == "Missouri", 1,0)
UFOdata$NewJersey = ifelse(UFOdata$state == "New Jersey", 1,0)
UFOdata$NewYork = ifelse(UFOdata$state == "New York", 1,0)
UFOdata$NorthCarolina = ifelse(UFOdata$state == "North Carolina", 1,0)
UFOdata$Oregon = ifelse(UFOdata$state == "Orlando", 1,0)
UFOdata$Pennsylvania = ifelse(UFOdata$state == "Pennsylvania", 1,0)
UFOdata$Texas = ifelse(UFOdata$state == "Texas", 1,0)
UFOdata$Virginia = ifelse(UFOdata$state == "Virginia", 1,0)
Each region has been created successfully with an assigned value of 0 or 1.

Split data into training and test data set:

We will then split the dataset into 80:20 (train: test) data ratio. The training set will contain a known output and the model learns on this data in order to be generalized to other data later. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

df = subset(UFOdata, select = -c(city,state))
trainData <- sample_frac(df, 0.8)
dim(trainData)
## [1] 12658    30
sid <- as.numeric(rownames(trainData))
testData <- df[-sid,]
dim(testData)
## [1] 3164   30

Model Specification:

model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Massachusetts+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas+Virginia, data = trainData)
summary(model_UFO)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona + 
##     California + Colorado + Florida + Illinois + Kansas + Kentucky + 
##     Louisiana + Massachusetts + Michigan + Missouri + NewJersey + 
##     NewYork + NorthCarolina + Oregon + Pennsylvania + Texas + 
##     Virginia, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09341 -0.00688 -0.00159  0.00339  1.05998 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.717e+02  1.924e+00 -141.196  < 2e-16 ***
## year           1.359e-01  9.621e-04  141.209  < 2e-16 ***
## month         -5.221e-04  1.930e-04   -2.705 0.006842 ** 
## NO2.Mean      -4.787e-04  1.509e-04   -3.173 0.001514 ** 
## NO2.AQI        5.908e-05  9.047e-05    0.653 0.513738    
## O3.Mean       -3.480e-01  1.040e-01   -3.347 0.000821 ***
## O3.AQI         9.850e-05  5.222e-05    1.886 0.059271 .  
## SO2.Mean      -1.471e-04  2.773e-04   -0.530 0.595800    
## SO2.AQI        6.838e-05  6.441e-05    1.062 0.288352    
## CO.Mean       -9.902e-03  4.784e-03   -2.070 0.038517 *  
## CO.AQI         3.646e-04  2.461e-04    1.482 0.138400    
## Arizona       -6.124e-02  3.744e-03  -16.357  < 2e-16 ***
## California    -5.228e-02  2.943e-03  -17.764  < 2e-16 ***
## Colorado      -2.110e-03  4.812e-03   -0.438 0.661053    
## Florida        6.691e-03  4.607e-03    1.452 0.146412    
## Illinois       5.044e-03  3.750e-03    1.345 0.178648    
## Kansas        -2.563e-03  4.891e-03   -0.524 0.600228    
## Kentucky      -3.975e-03  3.926e-03   -1.012 0.311408    
## Louisiana     -7.007e-04  4.614e-03   -0.152 0.879311    
## Massachusetts  7.331e-01  6.463e-02   11.343  < 2e-16 ***
## Michigan      -1.531e-03  6.112e-03   -0.250 0.802254    
## Missouri       8.903e-07  3.853e-03    0.000 0.999816    
## NewJersey     -4.488e-04  4.522e-03   -0.099 0.920941    
## NewYork        7.148e-03  3.433e-03    2.082 0.037383 *  
## NorthCarolina -3.308e-04  4.653e-03   -0.071 0.943328    
## Oregon         4.642e-02  6.494e-02    0.715 0.474757    
## Pennsylvania  -2.219e-03  2.911e-03   -0.762 0.445899    
## Texas          4.871e-03  3.562e-03    1.367 0.171520    
## Virginia              NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06454 on 12630 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.6219 
## F-statistic: 772.1 on 27 and 12630 DF,  p-value: < 2.2e-16
model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI, data = trainData)
summary(model_UFO)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13297 -0.00179  0.01100  0.01859  1.02734 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.491e+02  1.902e+00 -130.966  < 2e-16 ***
## year         1.245e-01  9.509e-04  130.984  < 2e-16 ***
## month       -4.730e-04  2.034e-04   -2.326 0.020057 *  
## NO2.Mean    -1.213e-03  1.510e-04   -8.035 1.02e-15 ***
## NO2.AQI      5.544e-04  9.183e-05    6.038 1.61e-09 ***
## O3.Mean     -7.365e-01  1.074e-01   -6.857 7.36e-12 ***
## O3.AQI       2.740e-04  5.386e-05    5.087 3.69e-07 ***
## SO2.Mean     1.115e-03  2.486e-04    4.484 7.40e-06 ***
## SO2.AQI      3.103e-04  6.179e-05    5.021 5.20e-07 ***
## CO.Mean     -1.714e-03  4.425e-03   -0.387 0.698498    
## CO.AQI      -8.701e-04  2.416e-04   -3.601 0.000318 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06825 on 12647 degrees of freedom
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5771 
## F-statistic:  1728 on 10 and 12647 DF,  p-value: < 2.2e-16

With these 2 multiple linear regression models, we will now conduct parameter estimation to get a parsimonious model.

Parameter Estimation:

model_UFO <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas, data = trainData)
summary(model_UFO)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona + 
##     California + Colorado + Florida + Illinois + Kansas + Kentucky + 
##     Louisiana + Michigan + Missouri + NewJersey + NewYork + NorthCarolina + 
##     Oregon + Pennsylvania + Texas, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09333 -0.00688 -0.00165  0.00326  1.06014 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.723e+02  1.933e+00 -140.878  < 2e-16 ***
## year           1.362e-01  9.665e-04  140.891  < 2e-16 ***
## month         -4.871e-04  1.940e-04   -2.511 0.012040 *  
## NO2.Mean      -4.573e-04  1.516e-04   -3.016 0.002568 ** 
## NO2.AQI        5.568e-05  9.093e-05    0.612 0.540329    
## O3.Mean       -3.443e-01  1.045e-01   -3.294 0.000989 ***
## O3.AQI         9.622e-05  5.248e-05    1.834 0.066750 .  
## SO2.Mean      -1.260e-04  2.787e-04   -0.452 0.651090    
## SO2.AQI        6.484e-05  6.473e-05    1.002 0.316533    
## CO.Mean       -1.050e-02  4.808e-03   -2.184 0.028972 *  
## CO.AQI         3.720e-04  2.473e-04    1.504 0.132550    
## Arizona       -6.227e-02  3.761e-03  -16.554  < 2e-16 ***
## California    -5.324e-02  2.957e-03  -18.007  < 2e-16 ***
## Colorado      -2.804e-03  4.836e-03   -0.580 0.561978    
## Florida        6.026e-03  4.630e-03    1.302 0.193046    
## Illinois       4.132e-03  3.768e-03    1.097 0.272872    
## Kansas        -3.104e-03  4.916e-03   -0.631 0.527819    
## Kentucky      -4.921e-03  3.945e-03   -1.247 0.212311    
## Louisiana     -1.365e-03  4.637e-03   -0.294 0.768496    
## Michigan      -2.558e-03  6.142e-03   -0.416 0.677131    
## Missouri      -8.104e-04  3.871e-03   -0.209 0.834193    
## NewJersey     -1.338e-03  4.544e-03   -0.294 0.768460    
## NewYork        6.093e-03  3.450e-03    1.766 0.077366 .  
## NorthCarolina -1.043e-03  4.676e-03   -0.223 0.823453    
## Oregon         4.340e-02  6.527e-02    0.665 0.506113    
## Pennsylvania  -3.150e-03  2.925e-03   -1.077 0.281392    
## Texas          4.024e-03  3.580e-03    1.124 0.260943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06486 on 12631 degrees of freedom
## Multiple R-squared:  0.6189, Adjusted R-squared:  0.6181 
## F-statistic: 788.8 on 26 and 12631 DF,  p-value: < 2.2e-16
We keep iterating this process to remove covariates which has no significance in changing or affecting the p-value i.e. with p-value>0.05.
model_UFO_3 <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+Arizona+California+NewYork, data = trainData)
summary(model_UFO_3)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     Arizona + California + NewYork, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09136 -0.00550 -0.00217  0.00240  1.06019 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.727e+02  1.919e+00 -142.067  < 2e-16 ***
## year         1.364e-01  9.597e-04  142.076  < 2e-16 ***
## month       -4.301e-04  1.902e-04   -2.261 0.023763 *  
## NO2.Mean    -5.530e-04  1.353e-04   -4.088 4.38e-05 ***
## NO2.AQI      1.299e-04  8.518e-05    1.525 0.127374    
## O3.Mean     -1.834e-01  5.454e-02   -3.362 0.000776 ***
## Arizona     -6.225e-02  2.475e-03  -25.156  < 2e-16 ***
## California  -5.377e-02  1.508e-03  -35.661  < 2e-16 ***
## NewYork      7.097e-03  2.507e-03    2.831 0.004650 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06488 on 12649 degrees of freedom
## Multiple R-squared:  0.6181, Adjusted R-squared:  0.6179 
## F-statistic:  2560 on 8 and 12649 DF,  p-value: < 2.2e-16

But if we choose this model then we believe the model will be a biased one since the regions Arizona, California and New York have a larger percentage of observations than the other regions.

  • Parameter Estimation: Model 2 - We will create a new model with all the covariates, minus the regions. We will then first remove the covariate CO.Mean, since it has p value>0.05.
model_UFO_2 <- lm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.AQI, data = trainData)
summary(model_UFO_2)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.AQI, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13323 -0.00181  0.01096  0.01856  1.02745 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.491e+02  1.902e+00 -130.971  < 2e-16 ***
## year         1.245e-01  9.508e-04  130.988  < 2e-16 ***
## month       -4.675e-04  2.029e-04   -2.304   0.0212 *  
## NO2.Mean    -1.225e-03  1.478e-04   -8.284  < 2e-16 ***
## NO2.AQI      5.617e-04  8.985e-05    6.252 4.17e-10 ***
## O3.Mean     -7.370e-01  1.074e-01   -6.862 7.10e-12 ***
## O3.AQI       2.738e-04  5.385e-05    5.084 3.76e-07 ***
## SO2.Mean     1.110e-03  2.483e-04    4.470 7.88e-06 ***
## SO2.AQI      3.118e-04  6.166e-05    5.057 4.32e-07 ***
## CO.AQI      -9.532e-04  1.114e-04   -8.558  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06825 on 12648 degrees of freedom
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5772 
## F-statistic:  1921 on 9 and 12648 DF,  p-value: < 2.2e-16
Till now, model 2 is the best multiple linear regression model for the training data. We are choosing model 2 which is without regions since our main objective is to find UFO sightings based on air quality. Also, the model with regions Arizona, California and New York - are biased since they have a larger percentage of observations than the other regions. Hence, we will move forward and conduct model adequacy checks on model 2.

Model Adequacy Checking:

Model adequacy checking is essentially to make sure that the LINE assumptions are satisfied, therefore, model adequacy checking will check to see if the model fits the data well. To do so, we did residual plots and residual analysis.

par(mfrow = c(2,2))
plot(model_UFO_2)

Analyzing the scatter plot of residual against the fitted values, we can see the points are not evenly distributed around zero with constant variance across the x-axis.

Analyzing QQ-plot above, we can see it is a heavy-tailed positively skewed distribution, which means that the residuals are not normally distributed. So, we should go for transformation on y or both x and y to resolve this issue.

Transformation:

We tried to conduct a Box-Cox transformation however, due to response variable being negative in the model, we could not apply and conduct the transformation. So, we decided to go ahead with the multicollinearity check.

Multicollinearity Check:

In model adequacy checking, we will check for multicollinearity in order to understand the linear dependence amongst the covariates. We will plot each covariate with response variable and create a correlation plot as well.

par(mfrow = c(3,3))
plot(ET~year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.AQI, data = trainData, pch = 20)

ufo <- trainData[c(1:9,11)]
head(ufo)
##       month year hour  NO2.Mean NO2.AQI  O3.Mean O3.AQI SO2.Mean SO2.AQI
## 13788    12 2001   15 43.272727      56 0.004294     13 4.181818      11
## 811       4 2000    0 18.173913      28 0.019542     24 2.000000       4
## 5752      6 2000   17 14.875000      37 0.029125     38 5.625000      59
## 1649      8 2000    4 11.782609      17 0.018875     19 2.000000       3
## 1380     11 2000   18 11.869565      23 0.012375     19 1.681818       3
## 6090      4 2000   22  7.458333      13 0.027125     27 1.166667       6
##       CO.AQI
## 13788     36
## 811        7
## 5752       1
## 1649       7
## 1380       7
## 6090       6
UFOcor <- cor(ufo)
corrplot(UFOcor, method = "number")
par(mfrow = c(1,1))
corrplot(UFOcor, method = "number")

As we can see from these plots, there is slight linear relationship between NO2.Mean and NO2.AQI, O3.Mean and O3.AQI, SO2.Mean and SO2.AQI. Hence, we will use VIF to calculate multicollinearity for each covariate.
vif(model_UFO_2)
##     year    month NO2.Mean  NO2.AQI  O3.Mean   O3.AQI SO2.Mean  SO2.AQI 
## 1.045339 1.061787 7.074337 6.139038 4.594494 3.985470 3.253810 3.157985 
##   CO.AQI 
## 1.820585

Based on multicollinearity result, removing one covariate linked with each other, removing means of NO2, O3 and SO2:

model_UFO_4 <- lm(ET ~ year+month+NO2.AQI+O3.AQI+SO2.AQI+CO.AQI, data = trainData)
summary(model_UFO_4)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.AQI + O3.AQI + SO2.AQI + 
##     CO.AQI, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11866  0.00072  0.01228  0.01770  1.03174 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.475e+02  1.901e+00 -130.186   <2e-16 ***
## year         1.238e-01  9.506e-04  130.199   <2e-16 ***
## month       -1.779e-04  1.993e-04   -0.893    0.372    
## NO2.AQI      2.611e-05  4.761e-05    0.548    0.583    
## O3.AQI       1.732e-05  2.811e-05    0.616    0.538    
## SO2.AQI      5.236e-04  3.688e-05   14.198   <2e-16 ***
## CO.AQI      -1.038e-03  1.072e-04   -9.685   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06853 on 12651 degrees of freedom
## Multiple R-squared:  0.5739, Adjusted R-squared:  0.5737 
## F-statistic:  2840 on 6 and 12651 DF,  p-value: < 2.2e-16

Let us now check what happens if we remove the AQIs instead of the Means:

model_UFO_5 <- lm(ET ~ year+month+NO2.Mean+O3.Mean+SO2.Mean+CO.AQI, data = trainData)
summary(model_UFO_5)
## 
## Call:
## lm(formula = ET ~ year + month + NO2.Mean + O3.Mean + SO2.Mean + 
##     CO.AQI, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12153  0.00072  0.01169  0.01704  1.02695 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -2.471e+02  1.902e+00 -129.904  < 2e-16 ***
## year         1.236e-01  9.511e-04  129.920  < 2e-16 ***
## month       -3.505e-04  2.016e-04   -1.739 0.082067 .  
## NO2.Mean    -2.678e-04  7.850e-05   -3.412 0.000647 ***
## O3.Mean     -1.462e-01  5.642e-02   -2.592 0.009565 ** 
## SO2.Mean     2.090e-03  1.502e-04   13.914  < 2e-16 ***
## CO.AQI      -9.294e-04  1.109e-04   -8.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06858 on 12651 degrees of freedom
## Multiple R-squared:  0.5733, Adjusted R-squared:  0.5731 
## F-statistic:  2833 on 6 and 12651 DF,  p-value: < 2.2e-16

This is a better model, but we still need to remodel it by removing month with p-value>0.05 and the following is the final model:

model_UFO_6 <- lm(ET ~ year+NO2.Mean+O3.Mean+SO2.Mean+CO.AQI, data = trainData)

Ideally, there should be no multicollinearity between these covariates now so we will check multicollinearity between these covariates:

vif(model_UFO_6)
##     year NO2.Mean  O3.Mean SO2.Mean   CO.AQI 
## 1.035714 1.962613 1.223797 1.175899 1.782727

Our VIF values for the covariates are close to 1 hence there exists no multicollinearity between these covariates and this is our best model now. We will plot the graph for this model:

par(mfrow = c(2,2))
plot(model_UFO_6)

We are not getting a linear regression model after removing all possible covariates or adding the best covariates and fitting the model with the covariates with p value smaller than 0.05. So, we will check the validity of our model on our train data.

Model Validity:

We fit the multiple linear regression to the training data set.

plot(trainData$year, trainData$ET, pch = 20, ylim = c(-1,2))
abline(model_UFO_6, col = "red")

Obviously, linear regression does not make sense in this situation because some predictions are below 0 or above 1. Hence, this multiple linear regression does not make sense for our data set.
Our research led us to find that if we have a binary response variable along with multiple covariates then we need to switch to Logistic Linear Regression to find our best fitted model.

Logistic Regression:

model_UFO <- glm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI+Arizona+California+Colorado+Florida+Illinois+Kansas+Kentucky+Louisiana+Massachusetts+Michigan+Missouri+NewJersey+NewYork+NorthCarolina+Oregon+Pennsylvania+Texas+Virginia, data = trainData, family = "binomial")
summary(model_UFO)
## 
## Call:
## glm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI + Arizona + 
##     California + Colorado + Florida + Illinois + Kansas + Kentucky + 
##     Louisiana + Massachusetts + Michigan + Missouri + NewJersey + 
##     NewYork + NorthCarolina + Oregon + Pennsylvania + Texas + 
##     Virginia, family = "binomial", data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2499  -0.0561  -0.0067  -0.0002   4.8204  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.240e+12  6.500e+13   0.081  0.93575    
## year           5.838e+00  7.594e-01   7.688 1.49e-14 ***
## month         -2.429e-02  7.435e-02  -0.327  0.74391    
## NO2.Mean      -2.429e-02  7.463e-02  -0.325  0.74482    
## NO2.AQI       -4.512e-03  4.068e-02  -0.111  0.91169    
## O3.Mean       -9.550e+01  4.475e+01  -2.134  0.03281 *  
## O3.AQI         4.357e-02  1.674e-02   2.603  0.00924 ** 
## SO2.Mean      -2.134e-01  1.860e-01  -1.148  0.25110    
## SO2.AQI        3.253e-02  3.019e-02   1.077  0.28129    
## CO.Mean        9.629e-03  2.391e+00   0.004  0.99679    
## CO.AQI        -2.702e-02  1.232e-01  -0.219  0.82639    
## Arizona       -5.240e+12  6.500e+13  -0.081  0.93575    
## California    -5.240e+12  6.500e+13  -0.081  0.93575    
## Colorado      -5.240e+12  6.500e+13  -0.081  0.93575    
## Florida       -5.240e+12  6.500e+13  -0.081  0.93575    
## Illinois      -5.240e+12  6.500e+13  -0.081  0.93575    
## Kansas        -5.240e+12  6.500e+13  -0.081  0.93575    
## Kentucky      -5.240e+12  6.500e+13  -0.081  0.93575    
## Louisiana     -5.240e+12  6.500e+13  -0.081  0.93575    
## Massachusetts  4.498e+15  6.500e+13  69.207  < 2e-16 ***
## Michigan      -4.509e+15  6.500e+13 -69.368  < 2e-16 ***
## Missouri      -5.240e+12  6.500e+13  -0.081  0.93575    
## NewJersey     -5.240e+12  6.500e+13  -0.081  0.93575    
## NewYork       -5.240e+12  6.500e+13  -0.081  0.93575    
## NorthCarolina -5.240e+12  6.500e+13  -0.081  0.93575    
## Oregon        -5.240e+12  6.500e+13  -0.081  0.93575    
## Pennsylvania  -5.240e+12  6.500e+13  -0.081  0.93575    
## Texas         -5.240e+12  6.500e+13  -0.081  0.93575    
## Virginia      -5.240e+12  6.500e+13  -0.081  0.93575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1548.66  on 12657  degrees of freedom
## Residual deviance:  332.03  on 12629  degrees of freedom
## AIC: 390.03
## 
## Number of Fisher Scoring iterations: 25
model_UFO_2 <- glm(ET ~ year+month+NO2.Mean+NO2.AQI+O3.Mean+O3.AQI+SO2.Mean+SO2.AQI+CO.Mean+CO.AQI, data = trainData, family = "binomial")
summary(model_UFO_2)
## 
## Call:
## glm(formula = ET ~ year + month + NO2.Mean + NO2.AQI + O3.Mean + 
##     O3.AQI + SO2.Mean + SO2.AQI + CO.Mean + CO.AQI, family = "binomial", 
##     data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4844  -0.0249  -0.0163  -0.0122   4.4700  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.714e+03  1.037e+03  -9.365   <2e-16 ***
## year         4.853e+00  5.185e-01   9.360   <2e-16 ***
## month       -1.212e-02  6.682e-02  -0.181   0.8560    
## NO2.Mean    -2.810e-02  6.513e-02  -0.432   0.6661    
## NO2.AQI      1.873e-02  3.360e-02   0.558   0.5771    
## O3.Mean     -9.108e+01  4.131e+01  -2.205   0.0275 *  
## O3.AQI       3.629e-02  1.639e-02   2.214   0.0268 *  
## SO2.Mean    -8.350e-02  1.297e-01  -0.644   0.5198    
## SO2.AQI      4.527e-02  2.155e-02   2.101   0.0357 *  
## CO.Mean      2.487e-01  2.199e+00   0.113   0.9099    
## CO.AQI      -1.057e-01  1.188e-01  -0.890   0.3734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1548.66  on 12657  degrees of freedom
## Residual deviance:  250.05  on 12647  degrees of freedom
## AIC: 272.05
## 
## Number of Fisher Scoring iterations: 11

We kept on removing the covariates one after the other with p values greater than 0.05. Below is the best model which we could get after fitting in the logistic regression.

model_UFO_4 <- glm(ET ~ year+SO2.AQI, data = trainData, family = "binomial")
summary(model_UFO_4)
## 
## Call:
## glm(formula = ET ~ year + SO2.AQI, family = "binomial", data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2969  -0.0230  -0.0148  -0.0127   4.3838  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.644e+03  1.019e+03  -9.468  < 2e-16 ***
## year         4.817e+00  5.090e-01   9.464  < 2e-16 ***
## SO2.AQI      3.806e-02  1.232e-02   3.089  0.00201 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1548.7  on 12657  degrees of freedom
## Residual deviance:  259.1  on 12655  degrees of freedom
## AIC: 265.1
## 
## Number of Fisher Scoring iterations: 10
par(mfrow = c(2,2))
plot(model_UFO_4)

plot(ET~year+ SO2.Mean, data = trainData, pch = 20)

Correlation Plot:

ufo <- trainData[c(2,9)]
head(ufo)
##       year SO2.AQI
## 13788 2001      11
## 811   2000       4
## 5752  2000      59
## 1649  2000       3
## 1380  2000       3
## 6090  2000       6
UFOcor <- cor(ufo)
corrplot(UFOcor, method = "number")
par(mfrow = c(1,1))
corrplot(UFOcor, method = "number")

As we can see, there is no linear relationship between the 2 covariates. We will calculate the VIF values for confirmation as well.

vif(model_UFO_4)
##     year  SO2.AQI 
## 1.200142 1.200142

Since the VIF values are close to 1, we can confirm there is no linear relationship between the covariates.

  1. For Year:
ggplot(trainData, aes(x = year, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)

From the above graph we can understand that there were more UFO sightings from 2003-2008 and that there were no UFO sightings for 2001 and 2002.
  1. For SO2.AQI:
ggplot(trainData, aes(x = SO2.AQI, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)

As the concentration of SO2.AQI is increasing the UFO Sightings are decreasing. It is also obvious from the graph that after hitting a maximum SO2.AQI value, there are no more UFO sightings recorded. So, we decided to calculate that maximum SO2.AQI value for when a UFO sighting was observed.
max(trainData[trainData$ET == "1",]$SO2.AQI)
## [1] 61

The answer from the code above concurs with the graph plotted.

  • Confusion Matrix - Train Data - We will use a confusion matrix to evaluate the prediction performance.
pred_prob = predict(model_UFO_4, trainData, type = "response")
pred_value = 1 * (pred_prob > 0.5)
actual_value = trainData$ET
confusion_matrix = table(actual_value, pred_value)
confusion_matrix
##             pred_value
## actual_value     0     1
##            0 12517     0
##            1    18   123

We can calculate the misclassification error rate using the confusion matrix.

misclassification_error_rate = 1 - sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.001422026

As we can see, the misclassification error rate is low, which means our model is a good model to use for predictions on this data set.

  • Model Use - We will now use the model we have created to predict UFO sightings based on the test data and then check it against the actual values. Since we can only use 1 covariate to plot in logistic regression at a time, we will plot the test data and the predicted values against both Year and SO2.AQI.
  1. Year:
ggplot(testData, aes(x = year, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)

  1. SO2.AQI:
ggplot(testData, aes(x = SO2.AQI, y = ET)) + geom_point() + stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)

  • Confusion Matrix - Test Data - We will use a confusion matrix to evaluate the prediction performance of our model on the test data.
pred_prob_1 = predict(model_UFO_4, testData, type = "response")
pred_value_1 = 1 * (pred_prob_1 > 0.5)
actua_value_1 = testData$ET
confusion_matrix_1 = table(actua_value_1,pred_value_1)
confusion_matrix_1
##              pred_value_1
## actua_value_1    0    1
##             0 3130    0
##             1    8   26

We can calculate the misclassification error rate using the confusion matrix.

misclassification_error_rate_1 = 1 - sum(diag(confusion_matrix_1)) / sum(confusion_matrix_1)
misclassification_error_rate_1
## [1] 0.002528445

As we can see, the misclassification error rate is low, which means our model is a good model to use for predictions on this data set.

Conclusion:

Our findings with a never-before analyzed data-set taught us that a Logistic Regression is the best option when your response variable is binary. From among the 24 covariates that we began our analysis with, only the SO2 Air Quality Index and Year were found to influence the number of UFO Sightings.