title: “Data 606 - Final Project”

author: “Sufian”

date: “11/17/2019”

output: html_document


Research Question

level


There are 154 observations on 6 variables (but only first 4 were actually used for regression purposes):

Data gathering Details:

-Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

-Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

-Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours

at Central Park

-Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

-Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.


Cases

Each case represents an observations of all the 4 variables in a given day. There are 154 observations


Data Collection

The data were obtained from the New York State Department of Conservation (ozone data) and the National

Weather Service (meteorological data)


Type of study

This was an observational study


Data Source

References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont,

CA: Wadsworth.


Response

The response variables is Ozone levels (it is numerical)


Explanatory

The predictor variables to be tested are: Solar, Temperature and Wind which are all numerical


Relevant summary Statistics


Load all the libraries

# Load necessary libraries
require(ggthemes)
library(tidyverse)
library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)
library(reshape2)
library("mice")
library('e1071')

Load the data from Github

Original dataset contains a bunch of NAs.


  1. The simple (if its in the minority) is to simply zeroed out the NAs with numerical zeros

  2. Or replacing NAs with monthly average

and linear regression studies; finally, the resultant linear model will form the basis of our predictive tool


url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)

glimpse(df)
## Observations: 153
## Variables: 7
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...

Preprocess the dataset

values(NA), which we will be doing simply by replacing NA with zeros first. Let’s begin!

col<- mapply(anyNA,df) # apply function anyNA() on all columns of airquality dataset
col 
##       X   Ozone Solar.R    Wind    Temp   Month     Day 
##   FALSE    TRUE    TRUE   FALSE   FALSE   FALSE   FALSE

The output shows that only Ozone and Solar.R attributes have NA i.e. some missing values.


Method A (Replacing NA with Zeros)


head(df,n=3)
##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3
# First run is replacing NA with zeros 
df <- df %>% mutate_all(funs(replace_na(.,0))) %>%  # replacing NAs with Zero
           select(-c(1))  # drop first irrelevant column from futher analysis
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df,n=2)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2

Wide to long format using melt function

# Melt:  Wide to long format
df1 <- melt(df, id.vars = c("Month", "Day"),
  variable.name = "climate_type", 
  value.name = "climate_value")

head(df1,n=5)
##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0

Function for Detecting climate variable outliers

Visualization and Mathematical Methods:

For Visualization Methods, Boxplot with range 1.5 and Histogram with break 15 were used to know more

about the data. For instance, Quantile Capping Method was used to detect the outliers(Mathematically) in

the data for each variable

#This function detects Outliers by quantile the capping method
outdetect <- function(c,w=1.5)
{
  h <- w*IQR(c,na.rm = T)
  q <- quantile(c,probs=c(.25, .75),na.rm = T)
  if(length(which(q[1]-h>c))==0)
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
  else
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
  if(length(which(q[2]+h<c))==0)
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
  else
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}

Individual OZONE, Solar R, Temp and Wind Charts to vizualize the different distribution sets

ozone_dat <- df1 %>% 
             filter(climate_type=='Ozone')

head(ozone_dat) #extracting only ozone
##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0
## 6     5   6        Ozone            28
Solar_dat <- df1 %>% 
             filter(climate_type=='Solar.R')

head(Solar_dat) #extracting only Solar Radiation
##   Month Day climate_type climate_value
## 1     5   1      Solar.R           190
## 2     5   2      Solar.R           118
## 3     5   3      Solar.R           149
## 4     5   4      Solar.R           313
## 5     5   5      Solar.R             0
## 6     5   6      Solar.R             0
Wind_dat <- df1 %>% 
             filter(climate_type=='Wind')

head(Wind_dat ) #extracting only Wind_dat 
##   Month Day climate_type climate_value
## 1     5   1         Wind           7.4
## 2     5   2         Wind           8.0
## 3     5   3         Wind          12.6
## 4     5   4         Wind          11.5
## 5     5   5         Wind          14.3
## 6     5   6         Wind          14.9
Temp_dat <- df1 %>% 
             filter(climate_type=='Temp')

head(Temp_dat ) #extracting only Wind_dat 
##   Month Day climate_type climate_value
## 1     5   1         Temp            67
## 2     5   2         Temp            72
## 3     5   3         Temp            74
## 4     5   4         Temp            62
## 5     5   5         Temp            56
## 6     5   6         Temp            66
#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))

p <- ggplot(df1, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
  ggtitle("Boxplot of Climate Variables for Sep 1973")+
    theme_excel()
p <- p + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))

p

pm <- ggplot(df1, aes(x=climate_value)) +
  geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
  labs(title="Climate histogram plots",x="Climate Variables", y = "Frequency")

pm

Quantifying Outliers, if any: {Ozone, Solar Radiation, Wind & Temperature}

my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)
## [1] Ozone Outliers:
outdetect(ozone_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 6 observations above the 3rd quantile on rows 30 62 99 101 117 121 and the values are 115 135 122 110 168 118

Ozone OUtlier Observation:

my_string <- "Solar Radiation Outliers:"
print(my_string, quote = FALSE)
## [1] Solar Radiation Outliers:
outdetect(Solar_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

Solar OUtlier Observation:

my_string <- "Wind Outliers:"
print(my_string, quote = FALSE)
## [1] Wind Outliers:
outdetect(Wind_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7

Wind OUtlier Observation:

my_string <- "Temp Outliers:"
print(my_string, quote = FALSE)
## [1] Temp Outliers:
outdetect(Temp_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

Wind OUtlier Observation:


Outliers Treatment

Outliers by definition are rare events (should be). The simplest approach is to remove them and carry on with the

analysis but capping method can also be used.

Percentile Capping is a method of imputing the outlier values by:

  • Replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the

upper limit, with the value of 95th percentile of the same dataset.

# Function to cap outliers
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}

Outlier replacement (Data cleansing): Before and After capping outliers are provided below

  • This is applied only to Ozone and Wind since they were both found to contain outliers, see above analysis
Data <- df
# Data cleansing
outcap(ozone_dat$climate_value)
Data$Ozone<- x
outcap(Wind_dat$climate_value)
Data$Wind<- x
Data[c(9,18,48,30,62,99,101,117,121),c(1,3)]
##     Ozone Wind
## 9       8 15.5
## 18      6 15.5
## 48     37 15.5
## 30     97  5.7
## 62     97  4.1
## 99     97  4.0
## 101    97  8.0
## 117    97  3.4
## 121    97  2.3

Data Transformation and Analysis

#vizualization tool for comparing normality distributions
Viz <- function(x){
  h1 <- hist(x,col="yellow",main="Histogram",xlab="Variables");
  xfit<-seq(0,max(x),length=50) 
  yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
  yfit <- yfit*diff(h1$mids[1:2])*length(x)
  lines(xfit, yfit, col="blue", lwd=2)
}

View Ozone distribution

mean_ozone <- ozone_dat %>% 
             summarise(mean(climate_value))
meanozone <- mean_ozone[1,1] # mean of ozone

sd_ozone <- ozone_dat %>% 
             summarise(sd(climate_value))
sdozone <- sd_ozone[1,1] # std dev. of ozone

hist(ozone_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:150
y <- dnorm(x = x, mean = meanozone, sd = sdozone)
lines(x = x, y = y, col = "blue")

Viz(ozone_dat$climate_value)


Ozone is right skewed


(A) “Normalization Ozone thru shifting and log transforms”

# shifting ozone by -15 to make it more "normal"
check <- abs(ozone_dat$climate_value-15)

Ozone_T <- log(check)

Viz(Ozone_T)

Q-Q Plot

After shifting the values and logging it, Ozone is now more normally distributed; its

definitely a step in the right direction

par(mfrow=c(1,2))
qqnorm(ozone_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat$climate_value, col = 2)
qqnorm(Ozone_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T , col = 2)

Q-Q Plot - Ozone

The after Q-Q plot showed a more normal distribution set than before, especially at the left tail end.

However, the middle can be further improved for sure, and let’s see if that can be improved by Method (b)


View Solar distribution

mean_solar <- Solar_dat %>% 
             summarise(mean(climate_value))
meansolar <- mean_solar[1,1] # mean of solar

sd_solar <- Solar_dat %>% 
             summarise(sd(climate_value))
sdsolar <- sd_solar[1,1] # Std dev of solar

hist(Solar_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:350
y <- dnorm(x = x, mean = meansolar, sd = sdsolar)
lines(x = x, y = y, col = "blue")

Viz(Solar_dat$climate_value)

Solar transformation - squaring

check <- abs(Solar_dat$climate_value)^2

Solar_T <- (check)

Viz(Solar_T)

par(mfrow=c(1,2))
qqnorm(Solar_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat$climate_value, col = 2)
qqnorm(Solar_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T , col = 2)


Q-Q Plot - Solar Radiation

The after Q-Q plot showed a more normal distribution set than before, especially the middle portion is above the

45 degree line. However again like Ozone, the left can be improved further, and let’s see if that can be

improved by Method (b)


(C) “Wind is already Normal - Q-Q plot ascertains it, see below; no need for transformation”

qqnorm(Wind_dat$climate_value ,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Wind",ylab="Theoretical quantiles")
qqline(Wind_dat$climate_value , col = 2)

Wind looks quite good acutally - therefore, no transforms needed

shapiro.test(Wind_dat$climate_value) #Shapiro test for normality; making sure its really normal
## 
##  Shapiro-Wilk normality test
## 
## data:  Wind_dat$climate_value
## W = 0.98575, p-value = 0.1178

Shapiro Wilk test to make sure Wind is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Wind data is normal for sure


(D) “Normalization of Temperature thru squaring transforms”

Viz(Temp_dat$climate_value)

Temperature is already quite normal, but lets see if squaring can make it better?

check <- abs(Temp_dat$climate_value)^2

temp_T <- (check)

Viz(temp_T)

Squaring, did make it slighlty better. Lets see the before and after Q-Q plots

par(mfrow=c(1,2))
qqnorm(Temp_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(Temp_dat$climate_value, col = 2)
qqnorm(temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(temp_T, col = 2)

shapiro.test(temp_T) #Shapiro to check for normality
## 
##  Shapiro-Wilk normality test
## 
## data:  temp_T
## W = 0.98546, p-value = 0.1089

Shapiro Wilk test to make sure Temperature is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Temperature data is normal


NOw that we have gone thru Method (a), lets try out Method (b) which is to impute average monthly data to replace

the NA (missing values). This is to see if we get better results and if so, we will proceed with regression

Method B (Replacing NA with average monthly)

url1 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df_mavg <- read.csv(file = url1 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)

glimpse(df_mavg)
## Observations: 153
## Variables: 7
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...

Since only Ozone and Solar needed improvement the most, we only imputed these 2 variables

# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg )){
  if(is.na(df_mavg [i,"Ozone"])){
    df_mavg [i,"Ozone"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Ozone"],na.rm = TRUE)
  }
# Impute monthly mean in Solar.R
    if(is.na(df_mavg [i,"Solar.R"])){
    df_mavg [i,"Solar.R"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
  
}

# First remove X column
df_mavg <- df_mavg %>% 
           select(-c(1))  # drop first irrelevant column from futher analysis

head(df_mavg ,n=5)
##      Ozone  Solar.R Wind Temp Month Day
## 1 41.00000 190.0000  7.4   67     5   1
## 2 36.00000 118.0000  8.0   72     5   2
## 3 12.00000 149.0000 12.6   74     5   3
## 4 18.00000 313.0000 11.5   62     5   4
## 5 23.61538 181.2963 14.3   56     5   5

Repeat the wide to long format like before

# Melt:  Wide to long format
df_mavg <- melt(df_mavg, id.vars = c("Month", "Day"),
  variable.name = "climate_type", 
  value.name = "climate_value")

head(df_mavg)
##   Month Day climate_type climate_value
## 1     5   1        Ozone      41.00000
## 2     5   2        Ozone      36.00000
## 3     5   3        Ozone      12.00000
## 4     5   4        Ozone      18.00000
## 5     5   5        Ozone      23.61538
## 6     5   6        Ozone      28.00000

Lets view the distributions after the averaging

#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))

p1 <- ggplot(df_mavg, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
  ggtitle("Post averaging: Boxplot of Climate Variables for Sep 1973")+
    theme_excel()
p1 <- p1 + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))

p1

pm1 <- ggplot(df_mavg, aes(x=climate_value)) +
  geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
  labs(title="Post averaging: Climate histogram plots",x="Climate Variables", y = "Frequency")

pm1

Checking for normality:

ozone_dat1 <- df_mavg %>% 
             filter(climate_type=='Ozone')

head(ozone_dat1) #extracting only ozone
##   Month Day climate_type climate_value
## 1     5   1        Ozone      41.00000
## 2     5   2        Ozone      36.00000
## 3     5   3        Ozone      12.00000
## 4     5   4        Ozone      18.00000
## 5     5   5        Ozone      23.61538
## 6     5   6        Ozone      28.00000
Solar_dat1 <- df_mavg %>% 
             filter(climate_type=='Solar.R')

head(Solar_dat1) #extracting only Solar Radiation
##   Month Day climate_type climate_value
## 1     5   1      Solar.R      190.0000
## 2     5   2      Solar.R      118.0000
## 3     5   3      Solar.R      149.0000
## 4     5   4      Solar.R      313.0000
## 5     5   5      Solar.R      181.2963
## 6     5   6      Solar.R      181.2963
my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)
## [1] Ozone Outliers:
outdetect(ozone_dat1$climate_value)
## There are 0 observations below the 1st quantile
## There are 4 observations above the 3rd quantile on rows 62 99 117 121 and the values are 135 122 168 118

There are less outliers now, so we do not cap the outliers like before and see how the distribution looks

Viz(ozone_dat1$climate_value)

# Loggng it make it more "normal"
check1 <- abs(ozone_dat1$climate_value)

Ozone_T1 <- log(check1)

Viz(Ozone_T1)

logging it made it more left skewed

par(mfrow=c(1,2))
qqnorm(ozone_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat1$climate_value, col = 2)
qqnorm(Ozone_T1,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T1 , col = 2)

Observation 1 (Ozone):

with this and use this for regression

check1 <- (abs(Solar_dat1$climate_value))

Solar_T1 <- (check1)

Viz(Solar_T1)

par(mfrow=c(1,2))
qqnorm(Solar_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat1$climate_value, col = 2)
qqnorm(Solar_T1 ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T1 , col = 2)

Observation 2 (Solar):

after did not differ says it all. So we will settle with this and use this for regression


Now that we have all the variables normal or some what more normal:


The Correlation Matrix - simplest way to check for correlation

# Easiest for me is to re-read it back again and storing in a new URL2
url2 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df_mavg2 <- read.csv(file = url2 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)


# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg2 )){
  if(is.na(df_mavg2 [i,"Ozone"])){
    df_mavg2 [i,"Ozone"]<- mean(df_mavg2 [which(df_mavg [,"Month"]==df_mavg2 [i,"Month"]),"Ozone"],na.rm = TRUE)
  }
# Impute monthly mean in Solar.R
    if(is.na(df_mavg2 [i,"Solar.R"])){
    df_mavg2 [i,"Solar.R"]<- mean(df_mavg2 [which(df_mavg2 [,"Month"]==df_mavg2 [i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
  
}

# First remove X column
df_mavg3 <- df_mavg2 %>% 
           select(-c(1))  # drop first irrelevant column from futher analysis

df_mavg3<- df_mavg3 %>% 
              select(-c(5,6))

pairs(df_mavg3, panel = panel.smooth, main = "Air Quality Data")

Pearson correlation test

cor_oz_Solar <- cor.test(df_mavg3$Ozone, df_mavg3$Solar.R, 
                    method = "pearson")
cor_oz_Solar 
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Solar.R
## t = 3.763, df = 151, p-value = 0.0002398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1406624 0.4314379
## sample estimates:
##       cor 
## 0.2928051
cor_oz_Wind <- cor.test(df_mavg3$Ozone, df_mavg3$Wind, 
                    method = "pearson")
cor_oz_Wind
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Wind
## t = -7.417, df = 151, p-value = 8.034e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6242426 -0.3900557
## sample estimates:
##        cor 
## -0.5167504
cor_oz_Temp <- cor.test(df_mavg3$Ozone, df_mavg3$Temp, 
                    method = "pearson")
cor_oz_Temp
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Ozone and df_mavg3$Temp
## t = 10.389, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5425428 0.7295725
## sample estimates:
##       cor 
## 0.6456381
cor_Solar_Wind <- cor.test(df_mavg3$Solar.R, df_mavg3$Wind, 
                    method = "pearson")
cor_Solar_Wind 
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Solar.R and df_mavg3$Wind
## t = -0.64444, df = 151, p-value = 0.5203
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2093105  0.1071971
## sample estimates:
##         cor 
## -0.05237183
cor_Solar_Temp <- cor.test(df_mavg3$Solar.R, df_mavg3$Temp, 
                    method = "pearson")
cor_Solar_Temp 
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Solar.R and df_mavg3$Temp
## t = 3.3351, df = 151, p-value = 0.001073
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1077306 0.4038252
## sample estimates:
##       cor 
## 0.2619312
cor_wind_Temp <- cor.test(df_mavg3$Wind, df_mavg3$Temp, 
                    method = "pearson")
cor_wind_Temp 
## 
##  Pearson's product-moment correlation
## 
## data:  df_mavg3$Wind and df_mavg3$Temp
## t = -6.3308, df = 151, p-value = 2.642e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5748874 -0.3227660
## sample estimates:
##        cor 
## -0.4579879

Observation 3:

df_mavg4 <- df_mavg3 %>% 
  mutate(log_Ozone = log(Ozone)) # Addding the log of ozone columns
library("ggpubr")
ggscatter(df_mavg4, x = "Temp", y = "log_Ozone", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Temperature", ylab = "Log of Ozone")

ggscatter(df_mavg4, x = "Wind", y = "log_Ozone", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Wind", ylab = "Log of Ozone")

1) Building the Linear Regression Model => Temp vs. Log of Ozone

linearMod1 <- lm(Temp ~ log_Ozone , data=df_mavg4)  # build linear regression model on full data
print(linearMod1)
## 
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
## 
## Coefficients:
## (Intercept)    log_Ozone  
##      48.951        8.388
summary(linearMod1)  # model summary
## 
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4737  -3.3242   0.8293   4.6182  15.6758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.951      2.566   19.08   <2e-16 ***
## log_Ozone      8.388      0.726   11.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.918 on 151 degrees of freedom
## Multiple R-squared:  0.4693, Adjusted R-squared:  0.4657 
## F-statistic: 133.5 on 1 and 151 DF,  p-value: < 2.2e-16

Observation 3a:

The equation is => log_ozone = 48.951 + 8.388*Temp

linearMod2 <- lm(Wind ~ log_Ozone, data=df_mavg4)  # build linear regression model on full data
print(linearMod2)
## 
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
## 
## Coefficients:
## (Intercept)    log_Ozone  
##      17.189       -2.097
summary(linearMod2)  # model summary
## 
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3970 -2.2181 -0.3149  1.9922 11.0819 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.1887     1.1639  14.769  < 2e-16 ***
## log_Ozone    -2.0966     0.3293  -6.366  2.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.138 on 151 degrees of freedom
## Multiple R-squared:  0.2116, Adjusted R-squared:  0.2064 
## F-statistic: 40.53 on 1 and 151 DF,  p-value: 2.204e-09

Observation 3b:

The equation is => log_ozone = 17.189 - 2.097*Wind


Predicting with Linear Models


Step 1: Create the training and test data

# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(df_mavg4), 0.8*nrow(df_mavg4))  # row indices for training data
trainingData <- df_mavg4[trainingRowIndex, ]  # model training data
testData  <- df_mavg4[-trainingRowIndex, ]   # test data

Step 2: Fit the model on training data and predict on test data => Temp vs. Log of Ozone

# Build the model on training data
lmMod1 <- lm(log_Ozone~ Temp , data=trainingData)  # build the model
distPred1 <- predict(lmMod1, testData)  # predict log of ozone

Step 3: Review diagnostic measures (Log ozone vs Temp)

summary (lmMod1)  # model summary
## 
## Call:
## lm(formula = log_Ozone ~ Temp, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43361 -0.30422  0.02787  0.36438  1.49313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.613376   0.381376  -1.608     0.11    
## Temp         0.052398   0.004863  10.774   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5195 on 120 degrees of freedom
## Multiple R-squared:  0.4917, Adjusted R-squared:  0.4875 
## F-statistic: 116.1 on 1 and 120 DF,  p-value: < 2.2e-16

Step 4: Calculate prediction accuracy and error rates

actuals_preds <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred1))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)  # 67.5%
correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.6745797
## predicteds 0.6745797  1.0000000
head(actuals_preds)
##     actuals predicteds
## 1  3.713572   2.897265
## 6  3.332205   2.844868
## 11 1.945910   3.264049
## 13 2.397895   2.844868
## 21 0.000000   2.478084
## 33 3.382505   3.264049

Observation 4:

i.e. when the actuals values increase the predicted values also increase and vice-versa.

DMwR::regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
##       mae       mse      rmse      mape 
## 0.5024264 0.5174255 0.7193229       Inf

Repeat Step 2 from above: Fit the model on training data and predict on test data => Wind vs. Log of Ozone

# Build the model on training data
lmMod2 <- lm(log_Ozone ~ Wind, data=trainingData)  # build the model
distPred2 <- predict(lmMod2, testData)  # predict log of ozone

Repeat Step 2 from above: Review diagnostic measures (Log ozone vs Wind)

summary (lmMod2)  # model summary
## 
## Call:
## lm(formula = log_Ozone ~ Wind, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10874 -0.41262  0.07006  0.44759  1.31907 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.55602    0.17140  26.581  < 2e-16 ***
## Wind        -0.10938    0.01622  -6.743 5.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6205 on 120 degrees of freedom
## Multiple R-squared:  0.2748, Adjusted R-squared:  0.2687 
## F-statistic: 45.46 on 1 and 120 DF,  p-value: 5.781e-10
actuals_preds1 <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred2))  # make actuals_predicteds dataframe.
correlation_accuracy1 <- cor(actuals_preds1)  # 28.6%
correlation_accuracy1
##              actuals predicteds
## actuals    1.0000000  0.2863958
## predicteds 0.2863958  1.0000000
head(actuals_preds1)
##     actuals predicteds
## 1  3.713572   3.746608
## 6  3.332205   2.926254
## 11 1.945910   3.801298
## 13 2.397895   3.549723
## 21 0.000000   3.495032
## 33 3.382505   3.495032
DMwR::regr.eval(actuals_preds1$actuals, actuals_preds1$predicteds)
##       mae       mse      rmse      mape 
## 0.6474125 0.8251163 0.9083591       Inf

Observation 5:

therefore recommend that the Temp vs. Ozone be the linear model for prediction going forward; since

it had the best correlational relationship amongst all the pertinent predictior variables.

But before we end here, lets double check if the 80-20 split was truly representative of the test data?

model will perform equally well all the time?


This is where k- Fold Cross validation can help

on the 80-20 split by performing K-Fold Cross Validation

well, when trained and tested on different distinct chunks of data. Let’s take a peek!

Temp vs. log of ozone

library(DAAG)
cvResults <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Temp, m=5, dots=FALSE, seed=29, legend.pos="bottomright",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."));  # performs the CV

attr(cvResults, 'ms')
## [1] 0.323749
cvResults1 <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Wind, m=5, dots=FALSE, seed=29, legend.pos="bottomright",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."));  # performs the CV

attr(cvResults1, 'ms')
## [1] 0.4746991

Observation 6:

What to look for in the K-Fold charts?

big symbols are not dispersed for any one particular color

the average squared error for the different folds put together


Summary:

There were 4 variables; Ozone (Response variable), Solar Radiation, Wind and Temperatures that was tracked and

both a statistical and a predictive linear model was performed. These climate data were taken in a finite period

of time; May 1,1973 (a Tuesday) to September 30, 1973. The data were taken through a fairly rigorous statistical

process to check for normality and outliers (correction if needed). In addition post data cleansing,

correlational study was performed across all the variables and only the highest correlated variables were used to

predict the response variable; Ozone levels. After corrective actions, the log of Ozone relative to temperature

were found to be the best correlated pairs. Wind were the other variable that statistically proved to have the

2nd highest predictive power. The linear model based on Temperature vs. log of ozone was then used to train and

tested. A two-step process was taken to verify for model accuracy; simple 80-20 split of train vs. test data

sets and the more robust K-Fold Cross Validation procedure. Both methodlogies reinforced the fact that Temp vs

Ozone was the best (in this sample set) to build predictive analytics off from. Finally, The linear equation

was identified to be log_ozone = 48.951 + 8.388*Temp. Lastly, since ozone was logged to obtained better

normality, it should be noted that in the prediction phase, anti-logs of the log_ozone response variables must be

undertaken, to bring it back to the original scale.