Data-606-Final-Project---NYC-Air-Quality-Data.utf8.md

title: “Data 606 - Final Project”

author: “Sufian”

date: “11/17/2019”

output: html_document

Research Question

Are there any trend lines and correlations between the 4 observed variables over time?
If so, we can then use one or more to perform as predictors in a regression analysis to predict Ozone

level

There are 154 observations on 6 variables (but only first 4 were actually used for regression purposes):

Ozone numeric Ozone (ppb)
Solar.R numeric Solar R (lang)
Wind numeric Wind (mph)
Temp numeric Temperature (degrees F)
Month numeric Month (1–12)
Day numeric Day of month (1–31)

Data gathering Details:

-Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

-Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

-Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours

at Central Park

-Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

-Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Cases

Each case represents an observations of all the 4 variables in a given day. There are 154 observations

Data Collection

The data were obtained from the New York State Department of Conservation (ozone data) and the National

Weather Service (meteorological data)

Type of study

This was an observational study

Data Source

References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont,

CA: Wadsworth.

Response

The response variables is Ozone levels (it is numerical)

Explanatory

The predictor variables to be tested are: Solar, Temperature and Wind which are all numerical

Relevant summary Statistics

Trendlining key variables to see if there’s any trends
Perform standard statistical summaries (box plots & normality checks)
Check for outliers, and implement several outlier capping strategies (corrective actions, if needed)
Run a collerational study and perform regression
Check and test for power of regression via ANOVAs (example: significance of slope vis T-stat or p-values)

Load all the libraries

# Load necessary libraries
require(ggthemes)
library(tidyverse)
library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)
library(reshape2)
library("mice")
library('e1071')

Load the data from Github

Original dataset contains a bunch of NAs.

In this project, We will deal with NAs two different ways:

The simple (if its in the minority) is to simply zeroed out the NAs with numerical zeros
Or replacing NAs with monthly average

We shall see which strategy is better and utilzed the “better” results to perform our correlational

and linear regression studies; finally, the resultant linear model will form the basis of our predictive tool

url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)

glimpse(df)

## Observations: 153
## Variables: 7
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...

Preprocess the dataset

Let’s begin by finding which attributes have missing values. We then need to impute those missing

values(NA), which we will be doing simply by replacing NA with zeros first. Let’s begin!

col<- mapply(anyNA,df) # apply function anyNA() on all columns of airquality dataset
col

##       X   Ozone Solar.R    Wind    Temp   Month     Day 
##   FALSE    TRUE    TRUE   FALSE   FALSE   FALSE   FALSE

The output shows that only Ozone and Solar.R attributes have NA i.e. some missing values.

Method A (Replacing NA with Zeros)

head(df,n=3)

##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3

# First run is replacing NA with zeros 
df <- df %>% mutate_all(funs(replace_na(.,0))) %>%  # replacing NAs with Zero
           select(-c(1))  # drop first irrelevant column from futher analysis

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

head(df,n=2)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2

Wide to long format using melt function

This is to make data analysis more effecient

# Melt:  Wide to long format
df1 <- melt(df, id.vars = c("Month", "Day"),
  variable.name = "climate_type", 
  value.name = "climate_value")

head(df1,n=5)

##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0

Bar plots of the climate variables from May 1973 to Sept 1973; to see trends, if any.

  ggplot(df1,aes(x=Month, y=climate_value,fill = climate_type)) + 
  geom_bar(stat = "identity", position = "dodge") +
  xlab("climate variables") + ylab("Climate Values") +
  ggtitle("Climate Variables from May - Sept of 1973") + ylim(0, 400)+
   theme_excel()

Function for Detecting climate variable outliers

Visualization and Mathematical Methods:

For Visualization Methods, Boxplot with range 1.5 and Histogram with break 15 were used to know more

about the data. For instance, Quantile Capping Method was used to detect the outliers(Mathematically) in

the data for each variable

#This function detects Outliers by quantile the capping method
outdetect <- function(c,w=1.5)
{
  h <- w*IQR(c,na.rm = T)
  q <- quantile(c,probs=c(.25, .75),na.rm = T)
  if(length(which(q[1]-h>c))==0)
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
  else
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
  if(length(which(q[2]+h<c))==0)
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
  else
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}

Individual OZONE, Solar R, Temp and Wind Charts to vizualize the different distribution sets

ozone_dat <- df1 %>% 
             filter(climate_type=='Ozone')

head(ozone_dat) #extracting only ozone

##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0
## 6     5   6        Ozone            28

Solar_dat <- df1 %>% 
             filter(climate_type=='Solar.R')

head(Solar_dat) #extracting only Solar Radiation

##   Month Day climate_type climate_value
## 1     5   1      Solar.R           190
## 2     5   2      Solar.R           118
## 3     5   3      Solar.R           149
## 4     5   4      Solar.R           313
## 5     5   5      Solar.R             0
## 6     5   6      Solar.R             0

Wind_dat <- df1 %>% 
             filter(climate_type=='Wind')

head(Wind_dat ) #extracting only Wind_dat

##   Month Day climate_type climate_value
## 1     5   1         Wind           7.4
## 2     5   2         Wind           8.0
## 3     5   3         Wind          12.6
## 4     5   4         Wind          11.5
## 5     5   5         Wind          14.3
## 6     5   6         Wind          14.9

Temp_dat <- df1 %>% 
             filter(climate_type=='Temp')

head(Temp_dat ) #extracting only Wind_dat

##   Month Day climate_type climate_value
## 1     5   1         Temp            67
## 2     5   2         Temp            72
## 3     5   3         Temp            74
## 4     5   4         Temp            62
## 5     5   5         Temp            56
## 6     5   6         Temp            66

#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))

p <- ggplot(df1, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
  ggtitle("Boxplot of Climate Variables for Sep 1973")+
    theme_excel()
p <- p + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))

p

pm <- ggplot(df1, aes(x=climate_value)) +
  geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
  labs(title="Climate histogram plots",x="Climate Variables", y = "Frequency")

pm

Quantifying Outliers, if any: {Ozone, Solar Radiation, Wind & Temperature}

my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)

## [1] Ozone Outliers:

outdetect(ozone_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 6 observations above the 3rd quantile on rows 30 62 99 101 117 121 and the values are 115 135 122 110 168 118

Ozone OUtlier Observation:

6 outliers were found

my_string <- "Solar Radiation Outliers:"
print(my_string, quote = FALSE)

## [1] Solar Radiation Outliers:

outdetect(Solar_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

Solar OUtlier Observation:

None were found

my_string <- "Wind Outliers:"
print(my_string, quote = FALSE)

## [1] Wind Outliers:

outdetect(Wind_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7

Wind OUtlier Observation:

3 outliers were found

my_string <- "Temp Outliers:"
print(my_string, quote = FALSE)

## [1] Temp Outliers:

outdetect(Temp_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

Wind OUtlier Observation:

None were found

Outliers Treatment

Outliers by definition are rare events (should be). The simplest approach is to remove them and carry on with the

analysis but capping method can also be used.

Percentile Capping is a method of imputing the outlier values by:

Replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the

upper limit, with the value of 95th percentile of the same dataset.

# Function to cap outliers
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}

Outlier replacement (Data cleansing): Before and After capping outliers are provided below

This is applied only to Ozone and Wind since they were both found to contain outliers, see above analysis

Data <- df
# Data cleansing
outcap(ozone_dat$climate_value)
Data$Ozone<- x
outcap(Wind_dat$climate_value)
Data$Wind<- x
Data[c(9,18,48,30,62,99,101,117,121),c(1,3)]

##     Ozone Wind
## 9       8 15.5
## 18      6 15.5
## 48     37 15.5
## 30     97  5.7
## 62     97  4.1
## 99     97  4.0
## 101    97  8.0
## 117    97  3.4
## 121    97  2.3

Data Transformation and Analysis

Function to view if distributions of datasets are normal or not

#vizualization tool for comparing normality distributions
Viz <- function(x){
  h1 <- hist(x,col="yellow",main="Histogram",xlab="Variables");
  xfit<-seq(0,max(x),length=50) 
  yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
  yfit <- yfit*diff(h1$mids[1:2])*length(x)
  lines(xfit, yfit, col="blue", lwd=2)
}

View Ozone distribution

mean_ozone <- ozone_dat %>% 
             summarise(mean(climate_value))
meanozone <- mean_ozone[1,1] # mean of ozone

sd_ozone <- ozone_dat %>% 
             summarise(sd(climate_value))
sdozone <- sd_ozone[1,1] # std dev. of ozone

hist(ozone_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:150
y <- dnorm(x = x, mean = meanozone, sd = sdozone)
lines(x = x, y = y, col = "blue")

Viz(ozone_dat$climate_value)

Ozone is right skewed

(A) “Normalization Ozone thru shifting and log transforms”

# shifting ozone by -15 to make it more "normal"
check <- abs(ozone_dat$climate_value-15)

Ozone_T <- log(check)

Viz(Ozone_T)

Q-Q Plot

After shifting the values and logging it, Ozone is now more normally distributed; its

definitely a step in the right direction

par(mfrow=c(1,2))
qqnorm(ozone_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat$climate_value, col = 2)
qqnorm(Ozone_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T , col = 2)

Q-Q Plot - Ozone

The after Q-Q plot showed a more normal distribution set than before, especially at the left tail end.

However, the middle can be further improved for sure, and let’s see if that can be improved by Method (b)

View Solar distribution

mean_solar <- Solar_dat %>% 
             summarise(mean(climate_value))
meansolar <- mean_solar[1,1] # mean of solar

sd_solar <- Solar_dat %>% 
             summarise(sd(climate_value))
sdsolar <- sd_solar[1,1] # Std dev of solar

hist(Solar_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:350
y <- dnorm(x = x, mean = meansolar, sd = sdsolar)
lines(x = x, y = y, col = "blue")

Viz(Solar_dat$climate_value)

Solar transformation - squaring

check <- abs(Solar_dat$climate_value)^2

Solar_T <- (check)

Viz(Solar_T)

par(mfrow=c(1,2))
qqnorm(Solar_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat$climate_value, col = 2)
qqnorm(Solar_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T , col = 2)

Q-Q Plot - Solar Radiation

The after Q-Q plot showed a more normal distribution set than before, especially the middle portion is above the

45 degree line. However again like Ozone, the left can be improved further, and let’s see if that can be

improved by Method (b)

(C) “Wind is already Normal - Q-Q plot ascertains it, see below; no need for transformation”

qqnorm(Wind_dat$climate_value ,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Wind",ylab="Theoretical quantiles")
qqline(Wind_dat$climate_value , col = 2)

Wind looks quite good acutally - therefore, no transforms needed

shapiro.test(Wind_dat$climate_value) #Shapiro test for normality; making sure its really normal

## 
##  Shapiro-Wilk normality test
## 
## data:  Wind_dat$climate_value
## W = 0.98575, p-value = 0.1178

Shapiro Wilk test to make sure Wind is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Wind data is normal for sure

(D) “Normalization of Temperature thru squaring transforms”

Viz(Temp_dat$climate_value)

Temperature is already quite normal, but lets see if squaring can make it better?

check <- abs(Temp_dat$climate_value)^2

temp_T <- (check)

Viz(temp_T)

Squaring, did make it slighlty better. Lets see the before and after Q-Q plots

par(mfrow=c(1,2))
qqnorm(Temp_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(Temp_dat$climate_value, col = 2)
qqnorm(temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(temp_T, col = 2)

shapiro.test(temp_T) #Shapiro to check for normality

## 
##  Shapiro-Wilk normality test
## 
## data:  temp_T
## W = 0.98546, p-value = 0.1089

Shapiro Wilk test to make sure Temperature is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Temperature data is normal

NOw that we have gone thru Method (a), lets try out Method (b) which is to impute average monthly data to replace

the NA (missing values). This is to see if we get better results and if so, we will proceed with regression

Method B (Replacing NA with average monthly)

url1 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df_mavg <- read.csv(file = url1 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)

glimpse(df_mavg)

## Observations: 153
## Variables: 7
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...

Since only Ozone and Solar needed improvement the most, we only imputed these 2 variables

# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg )){
  if(is.na(df_mavg [i,"Ozone"])){
    df_mavg [i,"Ozone"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Ozone"],na.rm = TRUE)
  }
# Impute monthly mean in Solar.R
    if(is.na(df_mavg [i,"Solar.R"])){
    df_mavg [i,"Solar.R"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
  
}

# First remove X column
df_mavg <- df_mavg %>% 
           select(-c(1))  # drop first irrelevant column from futher analysis

head(df_mavg ,n=5)

##      Ozone  Solar.R Wind Temp Month Day
## 1 41.00000 190.0000  7.4   67     5   1
## 2 36.00000 118.0000  8.0   72     5   2
## 3 12.00000 149.0000 12.6   74     5   3
## 4 18.00000 313.0000 11.5   62     5   4
## 5 23.61538 181.2963 14.3   56     5   5

Repeat the wide to long format like before

# Melt:  Wide to long format
df_mavg <- melt(df_mavg, id.vars = c("Month", "Day"),
  variable.name = "climate_type", 
  value.name = "climate_value")

head(df_mavg)

##   Month Day climate_type climate_value
## 1     5   1        Ozone      41.00000
## 2     5   2        Ozone      36.00000
## 3     5   3        Ozone      12.00000
## 4     5   4        Ozone      18.00000
## 5     5   5        Ozone      23.61538
## 6     5   6        Ozone      28.00000

Lets view the distributions after the averaging

#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))

p1 <- ggplot(df_mavg, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
  ggtitle("Post averaging: Boxplot of Climate Variables for Sep 1973")+
    theme_excel()
p1 <- p1 + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))

p1

pm1 <- ggplot(df_mavg, aes(x=climate_value)) +
  geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
  labs(title="Post averaging: Climate histogram plots",x="Climate Variables", y = "Frequency")

pm1

Checking for normality:

Starting with Ozone:
Let’s look at the distribution with the before and after Q-Q plots

ozone_dat1 <- df_mavg %>% 
             filter(climate_type=='Ozone')

head(ozone_dat1) #extracting only ozone

##   Month Day climate_type climate_value
## 1     5   1        Ozone      41.00000
## 2     5   2        Ozone      36.00000
## 3     5   3        Ozone      12.00000
## 4     5   4        Ozone      18.00000
## 5     5   5        Ozone      23.61538
## 6     5   6        Ozone      28.00000

Solar_dat1 <- df_mavg %>% 
             filter(climate_type=='Solar.R')

head(Solar_dat1) #extracting only Solar Radiation

##   Month Day climate_type climate_value
## 1     5   1      Solar.R      190.0000
## 2     5   2      Solar.R      118.0000
## 3     5   3      Solar.R      149.0000
## 4     5   4      Solar.R      313.0000
## 5     5   5      Solar.R      181.2963
## 6     5   6      Solar.R      181.2963

my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)

## [1] Ozone Outliers:

outdetect(ozone_dat1$climate_value)

## There are 0 observations below the 1st quantile
## There are 4 observations above the 3rd quantile on rows 62 99 117 121 and the values are 135 122 168 118

There are less outliers now, so we do not cap the outliers like before and see how the distribution looks

Ozone still looks like before
But did logging it helps with normality?

Viz(ozone_dat1$climate_value)

# Loggng it make it more "normal"
check1 <- abs(ozone_dat1$climate_value)

Ozone_T1 <- log(check1)

Viz(Ozone_T1)

logging it made it more left skewed

par(mfrow=c(1,2))
qqnorm(ozone_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat1$climate_value, col = 2)
qqnorm(Ozone_T1,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T1 , col = 2)

Observation 1 (Ozone):

Imputation of NA with monthly averaging and logging it, made the Ozone much more normal. So we will settle

with this and use this for regression

check1 <- (abs(Solar_dat1$climate_value))

Solar_T1 <- (check1)

Viz(Solar_T1)

par(mfrow=c(1,2))
qqnorm(Solar_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat1$climate_value, col = 2)
qqnorm(Solar_T1 ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T1 , col = 2)

Observation 2 (Solar):

Imputation of NA with monthly averaging only, made it more normal; the fact that the Q-Q plots of before and

after did not differ says it all. So we will settle with this and use this for regression

Now that we have all the variables normal or some what more normal:

We will perform the 2nd phase of this project, by first checking for correlation between variables
If correlational relationships exist, we will then perform linear regression

The Correlation Matrix - simplest way to check for correlation

# Easiest for me is to re-read it back again and storing in a new URL2
url2 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df_mavg2 <- read.csv(file = url2 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)


# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg2 )){
  if(is.na(df_mavg2 [i,"Ozone"])){
    df_mavg2 [i,"Ozone"]<- mean(df_mavg2 [which(df_mavg [,"Month"]==df_mavg2 [i,"Month"]),"Ozone"],na.rm = TRUE)
  }
# Impute monthly mean in Solar.R
    if(is.na(df_mavg2 [i,"Solar.R"])){
    df_mavg2 [i,"Solar.R"]<- mean(df_mavg2 [which(df_mavg2 [,"Month"]==df_mavg2 [i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
  
}

# First remove X column
df_mavg3 <- df_mavg2 %>% 
           select(-c(1))  # drop first irrelevant column from futher analysis

df_mavg3<- df_mavg3 %>% 
              select(-c(5,6))

pairs(df_mavg3, panel = panel.smooth, main = "Air Quality Data")

Pearson correlation test

Numerical test on all the original variables and pick the highest one for linear regression

cor_oz_Solar <- cor.test(df_mavg3$Ozone, df_mavg3$Solar.R, 
                    method = "pearson")
cor_oz_Solar

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Solar.R
## t = 3.763, df = 151, p-value = 0.0002398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1406624 0.4314379
## sample estimates:
##       cor 
## 0.2928051

cor_oz_Wind <- cor.test(df_mavg3$Ozone, df_mavg3$Wind, 
                    method = "pearson")
cor_oz_Wind

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Wind
## t = -7.417, df = 151, p-value = 8.034e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6242426 -0.3900557
## sample estimates:
##        cor 
## -0.5167504

cor_oz_Temp <- cor.test(df_mavg3$Ozone, df_mavg3$Temp, 
                    method = "pearson")
cor_oz_Temp

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Temp
## t = 10.389, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5425428 0.7295725
## sample estimates:
##       cor 
## 0.6456381

cor_Solar_Wind <- cor.test(df_mavg3$Solar.R, df_mavg3$Wind, 
                    method = "pearson")
cor_Solar_Wind

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Solar.R and df_mavg3$Wind
## t = -0.64444, df = 151, p-value = 0.5203
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2093105  0.1071971
## sample estimates:
##         cor 
## -0.05237183

cor_Solar_Temp <- cor.test(df_mavg3$Solar.R, df_mavg3$Temp, 
                    method = "pearson")
cor_Solar_Temp

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Solar.R and df_mavg3$Temp
## t = 3.3351, df = 151, p-value = 0.001073
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1077306 0.4038252
## sample estimates:
##       cor 
## 0.2619312

cor_wind_Temp <- cor.test(df_mavg3$Wind, df_mavg3$Temp, 
                    method = "pearson")
cor_wind_Temp

## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Wind and df_mavg3$Temp
## t = -6.3308, df = 151, p-value = 2.642e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5748874 -0.3227660
## sample estimates:
##        cor 
## -0.4579879

Observation 3:

Only Ozone vs. Wind and Ozone vs. Temperature has high correlation. Lets use these 2 for linear regression
But first we need to log Ozone as it was shown that the logs of Ozone made it more normal
We shall then use logs of Ozone to regressed against Temp and Wind

df_mavg4 <- df_mavg3 %>% 
  mutate(log_Ozone = log(Ozone)) # Addding the log of ozone columns

library("ggpubr")
ggscatter(df_mavg4, x = "Temp", y = "log_Ozone", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Temperature", ylab = "Log of Ozone")

ggscatter(df_mavg4, x = "Wind", y = "log_Ozone", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Wind", ylab = "Log of Ozone")

1) Building the Linear Regression Model => Temp vs. Log of Ozone

linearMod1 <- lm(Temp ~ log_Ozone , data=df_mavg4)  # build linear regression model on full data
print(linearMod1)

## 
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
## 
## Coefficients:
## (Intercept)    log_Ozone  
##      48.951        8.388

summary(linearMod1)  # model summary

## 
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4737  -3.3242   0.8293   4.6182  15.6758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.951      2.566   19.08   <2e-16 ***
## log_Ozone      8.388      0.726   11.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.918 on 151 degrees of freedom
## Multiple R-squared:  0.4693, Adjusted R-squared:  0.4657 
## F-statistic: 133.5 on 1 and 151 DF,  p-value: < 2.2e-16

Observation 3a:

The equation is => log_ozone = 48.951 + 8.388*Temp

linearMod2 <- lm(Wind ~ log_Ozone, data=df_mavg4)  # build linear regression model on full data
print(linearMod2)

## 
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
## 
## Coefficients:
## (Intercept)    log_Ozone  
##      17.189       -2.097

summary(linearMod2)  # model summary

## 
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3970 -2.2181 -0.3149  1.9922 11.0819 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.1887     1.1639  14.769  < 2e-16 ***
## log_Ozone    -2.0966     0.3293  -6.366  2.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.138 on 151 degrees of freedom
## Multiple R-squared:  0.2116, Adjusted R-squared:  0.2064 
## F-statistic: 40.53 on 1 and 151 DF,  p-value: 2.204e-09

Observation 3b:

The equation is => log_ozone = 17.189 - 2.097*Wind

Predicting with Linear Models

Step 1: Create the training and test data

# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(df_mavg4), 0.8*nrow(df_mavg4))  # row indices for training data
trainingData <- df_mavg4[trainingRowIndex, ]  # model training data
testData  <- df_mavg4[-trainingRowIndex, ]   # test data

Step 2: Fit the model on training data and predict on test data => Temp vs. Log of Ozone

# Build the model on training data
lmMod1 <- lm(log_Ozone~ Temp , data=trainingData)  # build the model
distPred1 <- predict(lmMod1, testData)  # predict log of ozone

Step 3: Review diagnostic measures (Log ozone vs Temp)

summary (lmMod1)  # model summary

## 
## Call:
## lm(formula = log_Ozone ~ Temp, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43361 -0.30422  0.02787  0.36438  1.49313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.613376   0.381376  -1.608     0.11    
## Temp         0.052398   0.004863  10.774   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5195 on 120 degrees of freedom
## Multiple R-squared:  0.4917, Adjusted R-squared:  0.4875 
## F-statistic: 116.1 on 1 and 120 DF,  p-value: < 2.2e-16

Step 4: Calculate prediction accuracy and error rates

actuals_preds <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred1))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)  # 67.5%
correlation_accuracy

##              actuals predicteds
## actuals    1.0000000  0.6745797
## predicteds 0.6745797  1.0000000

head(actuals_preds)

##     actuals predicteds
## 1  3.713572   2.897265
## 6  3.332205   2.844868
## 11 1.945910   3.264049
## 13 2.397895   2.844868
## 21 0.000000   2.478084
## 33 3.382505   3.264049

Observation 4:

A simple correlation between the actuals and predicted values can be used as a form of accuracy measure.
Higher correlation accuracy implies that the actuals and predicted values have similar directional movement,

i.e. when the actuals values increase the predicted values also increase and vice-versa.

Here we have a correlation of 67.5%; Not bad and definitely could be improved but we stop this at this point
RSME = 0.7193

DMwR::regr.eval(actuals_preds$actuals, actuals_preds$predicteds)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

##       mae       mse      rmse      mape 
## 0.5024264 0.5174255 0.7193229       Inf

Repeat Step 2 from above: Fit the model on training data and predict on test data => Wind vs. Log of Ozone

# Build the model on training data
lmMod2 <- lm(log_Ozone ~ Wind, data=trainingData)  # build the model
distPred2 <- predict(lmMod2, testData)  # predict log of ozone

Repeat Step 2 from above: Review diagnostic measures (Log ozone vs Wind)

summary (lmMod2)  # model summary

## 
## Call:
## lm(formula = log_Ozone ~ Wind, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10874 -0.41262  0.07006  0.44759  1.31907 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.55602    0.17140  26.581  < 2e-16 ***
## Wind        -0.10938    0.01622  -6.743 5.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6205 on 120 degrees of freedom
## Multiple R-squared:  0.2748, Adjusted R-squared:  0.2687 
## F-statistic: 45.46 on 1 and 120 DF,  p-value: 5.781e-10

actuals_preds1 <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred2))  # make actuals_predicteds dataframe.
correlation_accuracy1 <- cor(actuals_preds1)  # 28.6%
correlation_accuracy1

##              actuals predicteds
## actuals    1.0000000  0.2863958
## predicteds 0.2863958  1.0000000

head(actuals_preds1)

##     actuals predicteds
## 1  3.713572   3.746608
## 6  3.332205   2.926254
## 11 1.945910   3.801298
## 13 2.397895   3.549723
## 21 0.000000   3.495032
## 33 3.382505   3.495032

DMwR::regr.eval(actuals_preds1$actuals, actuals_preds1$predicteds)

##       mae       mse      rmse      mape 
## 0.6474125 0.8251163 0.9083591       Inf

Observation 5:

RSME = 0.908 is higher than the Temp vs. Ozone
Here we have a really low correlation of 28.6%; definitely could be much improved, but we stop here. We

therefore recommend that the Temp vs. Ozone be the linear model for prediction going forward; since

it had the best correlational relationship amongst all the pertinent predictior variables.

But before we end here, lets double check if the 80-20 split was truly representative of the test data?

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that this

model will perform equally well all the time?

This is where k- Fold Cross validation can help

We can ascertain that the linear model we pick, i.e, Temp vs log of Ozone is a good model to use based

on the 80-20 split by performing K-Fold Cross Validation

K-Fold Cross Validation provides rigorous testing; that is it checks if the model equation performs equally

well, when trained and tested on different distinct chunks of data. Let’s take a peek!

Temp vs. log of ozone

library(DAAG)
cvResults <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Temp, m=5, dots=FALSE, seed=29, legend.pos="bottomright",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."));  # performs the CV

attr(cvResults, 'ms')

## [1] 0.323749

cvResults1 <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Wind, m=5, dots=FALSE, seed=29, legend.pos="bottomright",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."));  # performs the CV

attr(cvResults1, 'ms')

## [1] 0.4746991

Observation 6:

What to look for in the K-Fold charts?

To quickly check for model accuracy, the dashed lines should be as parallel as it can be and the small and

big symbols are not dispersed for any one particular color

The Mean Squared Error (“MS”) for Temp vs log of Ozone is still smaller than Wind vs. log of Ozone. This is

the average squared error for the different folds put together

Summary:

There were 4 variables; Ozone (Response variable), Solar Radiation, Wind and Temperatures that was tracked and

both a statistical and a predictive linear model was performed. These climate data were taken in a finite period

of time; May 1,1973 (a Tuesday) to September 30, 1973. The data were taken through a fairly rigorous statistical

process to check for normality and outliers (correction if needed). In addition post data cleansing,

correlational study was performed across all the variables and only the highest correlated variables were used to

predict the response variable; Ozone levels. After corrective actions, the log of Ozone relative to temperature

were found to be the best correlated pairs. Wind were the other variable that statistically proved to have the

2nd highest predictive power. The linear model based on Temperature vs. log of ozone was then used to train and

tested. A two-step process was taken to verify for model accuracy; simple 80-20 split of train vs. test data

sets and the more robust K-Fold Cross Validation procedure. Both methodlogies reinforced the fact that Temp vs

Ozone was the best (in this sample set) to build predictive analytics off from. Finally, The linear equation

was identified to be log_ozone = 48.951 + 8.388*Temp. Lastly, since ozone was logged to obtained better

normality, it should be noted that in the prediction phase, anti-logs of the log_ozone response variables must be

undertaken, to bring it back to the original scale.