title: “Data 606 - Final Project”
author: “Sufian”
date: “11/17/2019”
output: html_document
Research Question
Are there any trend lines and correlations between the 4 observed variables over time?
If so, we can then use one or more to perform as predictors in a regression analysis to predict Ozone
level
There are 154 observations on 6 variables (but only first 4 were actually used for regression purposes):
Ozone numeric Ozone (ppb)
Solar.R numeric Solar R (lang)
Wind numeric Wind (mph)
Temp numeric Temperature (degrees F)
Month numeric Month (1–12)
Day numeric Day of month (1–31)
-Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
-Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
-Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours
at Central Park
-Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
-Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Each case represents an observations of all the 4 variables in a given day. There are 154 observations
The data were obtained from the New York State Department of Conservation (ozone data) and the National
Weather Service (meteorological data)
This was an observational study
References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont,
CA: Wadsworth.
The response variables is Ozone levels (it is numerical)
The predictor variables to be tested are: Solar, Temperature and Wind which are all numerical
Trendlining key variables to see if there’s any trends
Perform standard statistical summaries (box plots & normality checks)
Check for outliers, and implement several outlier capping strategies (corrective actions, if needed)
Run a collerational study and perform regression
Check and test for power of regression via ANOVAs (example: significance of slope vis T-stat or p-values)
# Load necessary libraries
require(ggthemes)
library(tidyverse)
library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)
library(reshape2)
library("mice")
library('e1071')
The simple (if its in the minority) is to simply zeroed out the NAs with numerical zeros
Or replacing NAs with monthly average
and linear regression studies; finally, the resultant linear model will form the basis of our predictive tool
url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'
# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
glimpse(df)
## Observations: 153
## Variables: 7
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
values(NA), which we will be doing simply by replacing NA with zeros first. Let’s begin!
col<- mapply(anyNA,df) # apply function anyNA() on all columns of airquality dataset
col
## X Ozone Solar.R Wind Temp Month Day
## FALSE TRUE TRUE FALSE FALSE FALSE FALSE
head(df,n=3)
## X Ozone Solar.R Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
# First run is replacing NA with zeros
df <- df %>% mutate_all(funs(replace_na(.,0))) %>% # replacing NAs with Zero
select(-c(1)) # drop first irrelevant column from futher analysis
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df,n=2)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
# Melt: Wide to long format
df1 <- melt(df, id.vars = c("Month", "Day"),
variable.name = "climate_type",
value.name = "climate_value")
head(df1,n=5)
## Month Day climate_type climate_value
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone 0
ggplot(df1,aes(x=Month, y=climate_value,fill = climate_type)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("climate variables") + ylab("Climate Values") +
ggtitle("Climate Variables from May - Sept of 1973") + ylim(0, 400)+
theme_excel()
Visualization and Mathematical Methods:
For Visualization Methods, Boxplot with range 1.5 and Histogram with break 15 were used to know more
about the data. For instance, Quantile Capping Method was used to detect the outliers(Mathematically) in
the data for each variable
#This function detects Outliers by quantile the capping method
outdetect <- function(c,w=1.5)
{
h <- w*IQR(c,na.rm = T)
q <- quantile(c,probs=c(.25, .75),na.rm = T)
if(length(which(q[1]-h>c))==0)
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
else
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
if(length(which(q[2]+h<c))==0)
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
else
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}
ozone_dat <- df1 %>%
filter(climate_type=='Ozone')
head(ozone_dat) #extracting only ozone
## Month Day climate_type climate_value
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone 0
## 6 5 6 Ozone 28
Solar_dat <- df1 %>%
filter(climate_type=='Solar.R')
head(Solar_dat) #extracting only Solar Radiation
## Month Day climate_type climate_value
## 1 5 1 Solar.R 190
## 2 5 2 Solar.R 118
## 3 5 3 Solar.R 149
## 4 5 4 Solar.R 313
## 5 5 5 Solar.R 0
## 6 5 6 Solar.R 0
Wind_dat <- df1 %>%
filter(climate_type=='Wind')
head(Wind_dat ) #extracting only Wind_dat
## Month Day climate_type climate_value
## 1 5 1 Wind 7.4
## 2 5 2 Wind 8.0
## 3 5 3 Wind 12.6
## 4 5 4 Wind 11.5
## 5 5 5 Wind 14.3
## 6 5 6 Wind 14.9
Temp_dat <- df1 %>%
filter(climate_type=='Temp')
head(Temp_dat ) #extracting only Wind_dat
## Month Day climate_type climate_value
## 1 5 1 Temp 67
## 2 5 2 Temp 72
## 3 5 3 Temp 74
## 4 5 4 Temp 62
## 5 5 5 Temp 56
## 6 5 6 Temp 66
#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))
p <- ggplot(df1, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
ggtitle("Boxplot of Climate Variables for Sep 1973")+
theme_excel()
p <- p + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))
p
pm <- ggplot(df1, aes(x=climate_value)) +
geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
labs(title="Climate histogram plots",x="Climate Variables", y = "Frequency")
pm
my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)
## [1] Ozone Outliers:
outdetect(ozone_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 6 observations above the 3rd quantile on rows 30 62 99 101 117 121 and the values are 115 135 122 110 168 118
my_string <- "Solar Radiation Outliers:"
print(my_string, quote = FALSE)
## [1] Solar Radiation Outliers:
outdetect(Solar_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile
my_string <- "Wind Outliers:"
print(my_string, quote = FALSE)
## [1] Wind Outliers:
outdetect(Wind_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7
my_string <- "Temp Outliers:"
print(my_string, quote = FALSE)
## [1] Temp Outliers:
outdetect(Temp_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile
Outliers by definition are rare events (should be). The simplest approach is to remove them and carry on with the
analysis but capping method can also be used.
Percentile Capping is a method of imputing the outlier values by:
upper limit, with the value of 95th percentile of the same dataset.
# Function to cap outliers
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}
Data <- df
# Data cleansing
outcap(ozone_dat$climate_value)
Data$Ozone<- x
outcap(Wind_dat$climate_value)
Data$Wind<- x
Data[c(9,18,48,30,62,99,101,117,121),c(1,3)]
## Ozone Wind
## 9 8 15.5
## 18 6 15.5
## 48 37 15.5
## 30 97 5.7
## 62 97 4.1
## 99 97 4.0
## 101 97 8.0
## 117 97 3.4
## 121 97 2.3
#vizualization tool for comparing normality distributions
Viz <- function(x){
h1 <- hist(x,col="yellow",main="Histogram",xlab="Variables");
xfit<-seq(0,max(x),length=50)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h1$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
}
mean_ozone <- ozone_dat %>%
summarise(mean(climate_value))
meanozone <- mean_ozone[1,1] # mean of ozone
sd_ozone <- ozone_dat %>%
summarise(sd(climate_value))
sdozone <- sd_ozone[1,1] # std dev. of ozone
hist(ozone_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:150
y <- dnorm(x = x, mean = meanozone, sd = sdozone)
lines(x = x, y = y, col = "blue")
Viz(ozone_dat$climate_value)
# shifting ozone by -15 to make it more "normal"
check <- abs(ozone_dat$climate_value-15)
Ozone_T <- log(check)
Viz(Ozone_T)
After shifting the values and logging it, Ozone is now more normally distributed; its
definitely a step in the right direction
par(mfrow=c(1,2))
qqnorm(ozone_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat$climate_value, col = 2)
qqnorm(Ozone_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T , col = 2)
The after Q-Q plot showed a more normal distribution set than before, especially at the left tail end.
However, the middle can be further improved for sure, and let’s see if that can be improved by Method (b)
mean_solar <- Solar_dat %>%
summarise(mean(climate_value))
meansolar <- mean_solar[1,1] # mean of solar
sd_solar <- Solar_dat %>%
summarise(sd(climate_value))
sdsolar <- sd_solar[1,1] # Std dev of solar
hist(Solar_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:350
y <- dnorm(x = x, mean = meansolar, sd = sdsolar)
lines(x = x, y = y, col = "blue")
Viz(Solar_dat$climate_value)
check <- abs(Solar_dat$climate_value)^2
Solar_T <- (check)
Viz(Solar_T)
par(mfrow=c(1,2))
qqnorm(Solar_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat$climate_value, col = 2)
qqnorm(Solar_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T , col = 2)
The after Q-Q plot showed a more normal distribution set than before, especially the middle portion is above the
45 degree line. However again like Ozone, the left can be improved further, and let’s see if that can be
improved by Method (b)
qqnorm(Wind_dat$climate_value ,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Wind",ylab="Theoretical quantiles")
qqline(Wind_dat$climate_value , col = 2)
shapiro.test(Wind_dat$climate_value) #Shapiro test for normality; making sure its really normal
##
## Shapiro-Wilk normality test
##
## data: Wind_dat$climate_value
## W = 0.98575, p-value = 0.1178
Because P-value is greater than 0.05, we can be 95% confident that Wind data is normal for sure
Viz(Temp_dat$climate_value)
check <- abs(Temp_dat$climate_value)^2
temp_T <- (check)
Viz(temp_T)
par(mfrow=c(1,2))
qqnorm(Temp_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(Temp_dat$climate_value, col = 2)
qqnorm(temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(temp_T, col = 2)
shapiro.test(temp_T) #Shapiro to check for normality
##
## Shapiro-Wilk normality test
##
## data: temp_T
## W = 0.98546, p-value = 0.1089
Because P-value is greater than 0.05, we can be 95% confident that Temperature data is normal
NOw that we have gone thru Method (a), lets try out Method (b) which is to impute average monthly data to replace
the NA (missing values). This is to see if we get better results and if so, we will proceed with regression
url1 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'
# Reading & Loading data
df_mavg <- read.csv(file = url1 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
glimpse(df_mavg)
## Observations: 153
## Variables: 7
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
## $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
## $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg )){
if(is.na(df_mavg [i,"Ozone"])){
df_mavg [i,"Ozone"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Ozone"],na.rm = TRUE)
}
# Impute monthly mean in Solar.R
if(is.na(df_mavg [i,"Solar.R"])){
df_mavg [i,"Solar.R"]<- mean(df_mavg [which(df_mavg [,"Month"]==df_mavg [i,"Month"]),"Solar.R"],na.rm = TRUE)
}
}
# First remove X column
df_mavg <- df_mavg %>%
select(-c(1)) # drop first irrelevant column from futher analysis
head(df_mavg ,n=5)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.0000 7.4 67 5 1
## 2 36.00000 118.0000 8.0 72 5 2
## 3 12.00000 149.0000 12.6 74 5 3
## 4 18.00000 313.0000 11.5 62 5 4
## 5 23.61538 181.2963 14.3 56 5 5
# Melt: Wide to long format
df_mavg <- melt(df_mavg, id.vars = c("Month", "Day"),
variable.name = "climate_type",
value.name = "climate_value")
head(df_mavg)
## Month Day climate_type climate_value
## 1 5 1 Ozone 41.00000
## 2 5 2 Ozone 36.00000
## 3 5 3 Ozone 12.00000
## 4 5 4 Ozone 18.00000
## 5 5 5 Ozone 23.61538
## 6 5 6 Ozone 28.00000
#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))
p1 <- ggplot(df_mavg, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
ggtitle("Post averaging: Boxplot of Climate Variables for Sep 1973")+
theme_excel()
p1 <- p1 + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))
p1
pm1 <- ggplot(df_mavg, aes(x=climate_value)) +
geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
labs(title="Post averaging: Climate histogram plots",x="Climate Variables", y = "Frequency")
pm1
Starting with Ozone:
Let’s look at the distribution with the before and after Q-Q plots
ozone_dat1 <- df_mavg %>%
filter(climate_type=='Ozone')
head(ozone_dat1) #extracting only ozone
## Month Day climate_type climate_value
## 1 5 1 Ozone 41.00000
## 2 5 2 Ozone 36.00000
## 3 5 3 Ozone 12.00000
## 4 5 4 Ozone 18.00000
## 5 5 5 Ozone 23.61538
## 6 5 6 Ozone 28.00000
Solar_dat1 <- df_mavg %>%
filter(climate_type=='Solar.R')
head(Solar_dat1) #extracting only Solar Radiation
## Month Day climate_type climate_value
## 1 5 1 Solar.R 190.0000
## 2 5 2 Solar.R 118.0000
## 3 5 3 Solar.R 149.0000
## 4 5 4 Solar.R 313.0000
## 5 5 5 Solar.R 181.2963
## 6 5 6 Solar.R 181.2963
my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)
## [1] Ozone Outliers:
outdetect(ozone_dat1$climate_value)
## There are 0 observations below the 1st quantile
## There are 4 observations above the 3rd quantile on rows 62 99 117 121 and the values are 135 122 168 118
Ozone still looks like before
But did logging it helps with normality?
Viz(ozone_dat1$climate_value)
# Loggng it make it more "normal"
check1 <- abs(ozone_dat1$climate_value)
Ozone_T1 <- log(check1)
Viz(Ozone_T1)
par(mfrow=c(1,2))
qqnorm(ozone_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat1$climate_value, col = 2)
qqnorm(Ozone_T1,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T1 , col = 2)
with this and use this for regression
check1 <- (abs(Solar_dat1$climate_value))
Solar_T1 <- (check1)
Viz(Solar_T1)
par(mfrow=c(1,2))
qqnorm(Solar_dat1$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat1$climate_value, col = 2)
qqnorm(Solar_T1 ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T1 , col = 2)
after did not differ says it all. So we will settle with this and use this for regression
We will perform the 2nd phase of this project, by first checking for correlation between variables
If correlational relationships exist, we will then perform linear regression
# Easiest for me is to re-read it back again and storing in a new URL2
url2 <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'
# Reading & Loading data
df_mavg2 <- read.csv(file = url2 ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
# Impute monthly mean in Ozone
for (i in 1:nrow(df_mavg2 )){
if(is.na(df_mavg2 [i,"Ozone"])){
df_mavg2 [i,"Ozone"]<- mean(df_mavg2 [which(df_mavg [,"Month"]==df_mavg2 [i,"Month"]),"Ozone"],na.rm = TRUE)
}
# Impute monthly mean in Solar.R
if(is.na(df_mavg2 [i,"Solar.R"])){
df_mavg2 [i,"Solar.R"]<- mean(df_mavg2 [which(df_mavg2 [,"Month"]==df_mavg2 [i,"Month"]),"Solar.R"],na.rm = TRUE)
}
}
# First remove X column
df_mavg3 <- df_mavg2 %>%
select(-c(1)) # drop first irrelevant column from futher analysis
df_mavg3<- df_mavg3 %>%
select(-c(5,6))
pairs(df_mavg3, panel = panel.smooth, main = "Air Quality Data")
cor_oz_Solar <- cor.test(df_mavg3$Ozone, df_mavg3$Solar.R,
method = "pearson")
cor_oz_Solar
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Ozone and df_mavg3$Solar.R
## t = 3.763, df = 151, p-value = 0.0002398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1406624 0.4314379
## sample estimates:
## cor
## 0.2928051
cor_oz_Wind <- cor.test(df_mavg3$Ozone, df_mavg3$Wind,
method = "pearson")
cor_oz_Wind
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Ozone and df_mavg3$Wind
## t = -7.417, df = 151, p-value = 8.034e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6242426 -0.3900557
## sample estimates:
## cor
## -0.5167504
cor_oz_Temp <- cor.test(df_mavg3$Ozone, df_mavg3$Temp,
method = "pearson")
cor_oz_Temp
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Ozone and df_mavg3$Temp
## t = 10.389, df = 151, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5425428 0.7295725
## sample estimates:
## cor
## 0.6456381
cor_Solar_Wind <- cor.test(df_mavg3$Solar.R, df_mavg3$Wind,
method = "pearson")
cor_Solar_Wind
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Solar.R and df_mavg3$Wind
## t = -0.64444, df = 151, p-value = 0.5203
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2093105 0.1071971
## sample estimates:
## cor
## -0.05237183
cor_Solar_Temp <- cor.test(df_mavg3$Solar.R, df_mavg3$Temp,
method = "pearson")
cor_Solar_Temp
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Solar.R and df_mavg3$Temp
## t = 3.3351, df = 151, p-value = 0.001073
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1077306 0.4038252
## sample estimates:
## cor
## 0.2619312
cor_wind_Temp <- cor.test(df_mavg3$Wind, df_mavg3$Temp,
method = "pearson")
cor_wind_Temp
##
## Pearson's product-moment correlation
##
## data: df_mavg3$Wind and df_mavg3$Temp
## t = -6.3308, df = 151, p-value = 2.642e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5748874 -0.3227660
## sample estimates:
## cor
## -0.4579879
Only Ozone vs. Wind and Ozone vs. Temperature has high correlation. Lets use these 2 for linear regression
But first we need to log Ozone as it was shown that the logs of Ozone made it more normal
We shall then use logs of Ozone to regressed against Temp and Wind
df_mavg4 <- df_mavg3 %>%
mutate(log_Ozone = log(Ozone)) # Addding the log of ozone columns
library("ggpubr")
ggscatter(df_mavg4, x = "Temp", y = "log_Ozone",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Temperature", ylab = "Log of Ozone")
ggscatter(df_mavg4, x = "Wind", y = "log_Ozone",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Wind", ylab = "Log of Ozone")
linearMod1 <- lm(Temp ~ log_Ozone , data=df_mavg4) # build linear regression model on full data
print(linearMod1)
##
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
##
## Coefficients:
## (Intercept) log_Ozone
## 48.951 8.388
summary(linearMod1) # model summary
##
## Call:
## lm(formula = Temp ~ log_Ozone, data = df_mavg4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4737 -3.3242 0.8293 4.6182 15.6758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.951 2.566 19.08 <2e-16 ***
## log_Ozone 8.388 0.726 11.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.918 on 151 degrees of freedom
## Multiple R-squared: 0.4693, Adjusted R-squared: 0.4657
## F-statistic: 133.5 on 1 and 151 DF, p-value: < 2.2e-16
The equation is => log_ozone = 48.951 + 8.388*Temp
linearMod2 <- lm(Wind ~ log_Ozone, data=df_mavg4) # build linear regression model on full data
print(linearMod2)
##
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
##
## Coefficients:
## (Intercept) log_Ozone
## 17.189 -2.097
summary(linearMod2) # model summary
##
## Call:
## lm(formula = Wind ~ log_Ozone, data = df_mavg4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3970 -2.2181 -0.3149 1.9922 11.0819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.1887 1.1639 14.769 < 2e-16 ***
## log_Ozone -2.0966 0.3293 -6.366 2.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.138 on 151 degrees of freedom
## Multiple R-squared: 0.2116, Adjusted R-squared: 0.2064
## F-statistic: 40.53 on 1 and 151 DF, p-value: 2.204e-09
The equation is => log_ozone = 17.189 - 2.097*Wind
# Create Training and Test data -
set.seed(100) # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(df_mavg4), 0.8*nrow(df_mavg4)) # row indices for training data
trainingData <- df_mavg4[trainingRowIndex, ] # model training data
testData <- df_mavg4[-trainingRowIndex, ] # test data
# Build the model on training data
lmMod1 <- lm(log_Ozone~ Temp , data=trainingData) # build the model
distPred1 <- predict(lmMod1, testData) # predict log of ozone
summary (lmMod1) # model summary
##
## Call:
## lm(formula = log_Ozone ~ Temp, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43361 -0.30422 0.02787 0.36438 1.49313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.613376 0.381376 -1.608 0.11
## Temp 0.052398 0.004863 10.774 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5195 on 120 degrees of freedom
## Multiple R-squared: 0.4917, Adjusted R-squared: 0.4875
## F-statistic: 116.1 on 1 and 120 DF, p-value: < 2.2e-16
actuals_preds <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred1)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds) # 67.5%
correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.6745797
## predicteds 0.6745797 1.0000000
head(actuals_preds)
## actuals predicteds
## 1 3.713572 2.897265
## 6 3.332205 2.844868
## 11 1.945910 3.264049
## 13 2.397895 2.844868
## 21 0.000000 2.478084
## 33 3.382505 3.264049
A simple correlation between the actuals and predicted values can be used as a form of accuracy measure.
Higher correlation accuracy implies that the actuals and predicted values have similar directional movement,
i.e. when the actuals values increase the predicted values also increase and vice-versa.
Here we have a correlation of 67.5%; Not bad and definitely could be improved but we stop this at this point
RSME = 0.7193
DMwR::regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## mae mse rmse mape
## 0.5024264 0.5174255 0.7193229 Inf
# Build the model on training data
lmMod2 <- lm(log_Ozone ~ Wind, data=trainingData) # build the model
distPred2 <- predict(lmMod2, testData) # predict log of ozone
summary (lmMod2) # model summary
##
## Call:
## lm(formula = log_Ozone ~ Wind, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10874 -0.41262 0.07006 0.44759 1.31907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.55602 0.17140 26.581 < 2e-16 ***
## Wind -0.10938 0.01622 -6.743 5.78e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6205 on 120 degrees of freedom
## Multiple R-squared: 0.2748, Adjusted R-squared: 0.2687
## F-statistic: 45.46 on 1 and 120 DF, p-value: 5.781e-10
actuals_preds1 <- data.frame(cbind(actuals=testData$log_Ozone, predicteds=distPred2)) # make actuals_predicteds dataframe.
correlation_accuracy1 <- cor(actuals_preds1) # 28.6%
correlation_accuracy1
## actuals predicteds
## actuals 1.0000000 0.2863958
## predicteds 0.2863958 1.0000000
head(actuals_preds1)
## actuals predicteds
## 1 3.713572 3.746608
## 6 3.332205 2.926254
## 11 1.945910 3.801298
## 13 2.397895 3.549723
## 21 0.000000 3.495032
## 33 3.382505 3.495032
DMwR::regr.eval(actuals_preds1$actuals, actuals_preds1$predicteds)
## mae mse rmse mape
## 0.6474125 0.8251163 0.9083591 Inf
RSME = 0.908 is higher than the Temp vs. Ozone
Here we have a really low correlation of 28.6%; definitely could be much improved, but we stop here. We
therefore recommend that the Temp vs. Ozone be the linear model for prediction going forward; since
it had the best correlational relationship amongst all the pertinent predictior variables.
model will perform equally well all the time?
on the 80-20 split by performing K-Fold Cross Validation
well, when trained and tested on different distinct chunks of data. Let’s take a peek!
library(DAAG)
cvResults <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Temp, m=5, dots=FALSE, seed=29, legend.pos="bottomright", printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals.")); # performs the CV
attr(cvResults, 'ms')
## [1] 0.323749
cvResults1 <- suppressWarnings(CVlm(data=df_mavg4, form.lm=log_Ozone ~ Wind, m=5, dots=FALSE, seed=29, legend.pos="bottomright", printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals.")); # performs the CV
attr(cvResults1, 'ms')
## [1] 0.4746991
What to look for in the K-Fold charts?
big symbols are not dispersed for any one particular color
the average squared error for the different folds put together
There were 4 variables; Ozone (Response variable), Solar Radiation, Wind and Temperatures that was tracked and
both a statistical and a predictive linear model was performed. These climate data were taken in a finite period
of time; May 1,1973 (a Tuesday) to September 30, 1973. The data were taken through a fairly rigorous statistical
process to check for normality and outliers (correction if needed). In addition post data cleansing,
correlational study was performed across all the variables and only the highest correlated variables were used to
predict the response variable; Ozone levels. After corrective actions, the log of Ozone relative to temperature
were found to be the best correlated pairs. Wind were the other variable that statistically proved to have the
2nd highest predictive power. The linear model based on Temperature vs. log of ozone was then used to train and
tested. A two-step process was taken to verify for model accuracy; simple 80-20 split of train vs. test data
sets and the more robust K-Fold Cross Validation procedure. Both methodlogies reinforced the fact that Temp vs
Ozone was the best (in this sample set) to build predictive analytics off from. Finally, The linear equation
was identified to be log_ozone = 48.951 + 8.388*Temp. Lastly, since ozone was logged to obtained better
normality, it should be noted that in the prediction phase, anti-logs of the log_ozone response variables must be
undertaken, to bring it back to the original scale.