Rpubs:
http://rpubs.com/ssufian/539243
Research Question
Are there any trend lines and correlations between the 6 observed variables below?
If so, we can then use one or more to perform as predictors in a regression analysis to predict Ozone levels
There are 154 observations on 6 variables:
Ozone numeric Ozone (ppb)
Solar.R numeric Solar R (lang)
Wind numeric Wind (mph)
Temp numeric Temperature (degrees F)
Month numeric Month (1–12)
Day numeric Day of month (1–31)
Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Each case represents an observations of all the 6 variables in a given day. There are 154 observations
The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather
Service (meteorological data)
This was an observational study
References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont,
CA: Wadsworth.
The response variables is Ozone level (it is numerical)
The predictor variables to be tested are: Solar, Temperature and Wind are all numerical
Trendlining key variables to see if there’s any trends
Perform standard statistical summaries (box plots, normality checks and corrective actions)
Check for outliers, and implement several outlier capping strategies
Run a collerational study and perform regression
Check and test for power of regression via ANOVAs
# Load necessary libraries
require(ggthemes)
library(tidyverse)
library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)
library(reshape2)
url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'
# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
head(df)
## X Ozone Solar.R Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
## 4 4 18 313 11.5 62 5 4
## 5 5 NA NA 14.3 56 5 5
## 6 6 28 NA 14.9 66 5 6
df <- df %>% mutate_all(funs(replace_na(.,0))) %>% # replacing NAs with Zero
select(-c(1)) # drop first irrelevant column from futher analysis
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
dim(df)
## [1] 153 6
head(df)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 0 0 14.3 56 5 5
## 6 28 0 14.9 66 5 6
# Melt: Wide to long format
df1 <- melt(df, id.vars = c("Month", "Day"),
variable.name = "climate_type",
value.name = "climate_value")
dim(df1)
## [1] 612 4
head(df1)
## Month Day climate_type climate_value
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone 0
## 6 5 6 Ozone 28
ggplot(df1,aes(x=Month, y=climate_value,fill = climate_type)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("climate variables") + ylab("Climate Values") +
ggtitle("Climate Variables from May - Sept of 1973") + ylim(0, 300)+
theme_excel()
## Warning: Removed 9 rows containing missing values (geom_bar).
Visualization and Mathematical Methods:
For Visualization Methods, Boxplot with range 1.5 and Histogram with break 15 were used to know more about the
data. For instance, Quantile Capping Method was used to detect the outliers(Mathematically) in the data for each
variable
#This function detects Outliers by quantile capping method
outdetect <- function(c,w=1.5)
{
h <- w*IQR(c,na.rm = T)
q <- quantile(c,probs=c(.25, .75),na.rm = T)
if(length(which(q[1]-h>c))==0)
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
else
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
if(length(which(q[2]+h<c))==0)
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
else
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}
ozone_dat <- df1 %>%
filter(climate_type=='Ozone')
head(ozone_dat) #extracting only ozone
## Month Day climate_type climate_value
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone 0
## 6 5 6 Ozone 28
Solar_dat <- df1 %>%
filter(climate_type=='Solar.R')
head(Solar_dat) #extracting only Solar Radiation
## Month Day climate_type climate_value
## 1 5 1 Solar.R 190
## 2 5 2 Solar.R 118
## 3 5 3 Solar.R 149
## 4 5 4 Solar.R 313
## 5 5 5 Solar.R 0
## 6 5 6 Solar.R 0
Wind_dat <- df1 %>%
filter(climate_type=='Wind')
head(Wind_dat ) #extracting only Wind_dat
## Month Day climate_type climate_value
## 1 5 1 Wind 7.4
## 2 5 2 Wind 8.0
## 3 5 3 Wind 12.6
## 4 5 4 Wind 11.5
## 5 5 5 Wind 14.3
## 6 5 6 Wind 14.9
Temp_dat <- df1 %>%
filter(climate_type=='Temp')
head(Temp_dat ) #extracting only Wind_dat
## Month Day climate_type climate_value
## 1 5 1 Temp 67
## 2 5 2 Temp 72
## 3 5 3 Temp 74
## 4 5 4 Temp 62
## 5 5 5 Temp 56
## 6 5 6 Temp 66
#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))
p <- ggplot(df1, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
ggtitle("Boxplot of Climate Variables for Sep 1973")+
theme_excel()
p <- p + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))
p
pm <- ggplot(df1, aes(x=climate_value)) +
geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
labs(title="Climate histogram plots",x="Climate Variables", y = "Frequency")
pm
my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)
## [1] Ozone Outliers:
outdetect(ozone_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 6 observations above the 3rd quantile on rows 30 62 99 101 117 121 and the values are 115 135 122 110 168 118
my_string <- "Solar Radiation Outliers:"
print(my_string, quote = FALSE)
## [1] Solar Radiation Outliers:
outdetect(Solar_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile
my_string <- "Wind Outliers:"
print(my_string, quote = FALSE)
## [1] Wind Outliers:
outdetect(Wind_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7
my_string <- "Temp Outliers:"
print(my_string, quote = FALSE)
## [1] Temp Outliers:
outdetect(Temp_dat$climate_value)
## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile
Outliers by definition are rare events (should be). The simplest approach is to remove them and carry on with the
analysis but capping method can also be used.
Percentile Capping is a method of imputing the outlier values by:
upper limit, with the value of 95th percentile of the same dataset.
# Function to cap outliers
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}
Data <- df
# Data cleansing
outcap(ozone_dat$climate_value)
Data$Ozone<- x
outcap(Wind_dat$climate_value)
Data$Wind<- x
Data[c(9,18,48,30,62,99,101,117,121),c(1,3)]
## Ozone Wind
## 9 8 15.5
## 18 6 15.5
## 48 37 15.5
## 30 97 5.7
## 62 97 4.1
## 99 97 4.0
## 101 97 8.0
## 117 97 3.4
## 121 97 2.3
#vizualization tool for comparing normality distributions
Viz <- function(x){
h1 <- hist(x,col="yellow",main="Histogram",xlab="Variables");
xfit<-seq(0,max(x),length=50)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h1$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
}
mean_ozone <- ozone_dat %>%
summarise(mean(climate_value))
meanozone <- mean_ozone[1,1] # mean of ozone
sd_ozone <- ozone_dat %>%
summarise(sd(climate_value))
sdozone <- sd_ozone[1,1] # mean of ozone
hist(ozone_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:150
y <- dnorm(x = x, mean = meanozone, sd = sdozone)
lines(x = x, y = y, col = "blue")
Viz(ozone_dat$climate_value)
check <- abs(ozone_dat$climate_value-15)
Ozone_T <- log(check)
Viz(Ozone_T)
After shifting the values and logging it, Ozone is now more normally distributed; its
definitely a step in the right direction
par(mfrow=c(1,2))
qqnorm(ozone_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat$climate_value, col = 2)
qqnorm(Ozone_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T , col = 2)
The after Q-Q plot showed a more normal distribution set than before, especially at the left
tail end. However, the middle can be further improved for sure, and that is left for future
exercise
mean_solar <- Solar_dat %>%
summarise(mean(climate_value))
meansolar <- mean_solar[1,1] # mean of solar
sd_solar <- Solar_dat %>%
summarise(sd(climate_value))
sdsolar <- sd_solar[1,1] # Std dev of solar
hist(Solar_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:350
y <- dnorm(x = x, mean = meansolar, sd = sdsolar)
lines(x = x, y = y, col = "blue")
Viz(Solar_dat$climate_value)
check <- abs(Solar_dat$climate_value)^2
Solar_T <- (check)
Viz(Solar_T)
par(mfrow=c(1,2))
qqnorm(Solar_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat$climate_value, col = 2)
qqnorm(Solar_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T , col = 2)
### Q-Q Plot - Solar Radiation
The after Q-Q plot showed a more normal distribution set than before, especially the middle
portion is above the 45
degree line. However again like Ozone, the left can be improved further, and that is left for
future exercise
--------------------------------------------------------------------------------
\clearpag
### (C) "Wind is already Normal - Q-Q plot ascertains it, see below; no need for transformation"
qqnorm(Wind_dat$climate_value ,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Wind",ylab="Theoretical quantiles")
qqline(Wind_dat$climate_value , col = 2)
shapiro.test(Wind_dat$climate_value) #Shapiro test for normality; making sure its really normal
##
## Shapiro-Wilk normality test
##
## data: Wind_dat$climate_value
## W = 0.98575, p-value = 0.1178
### Shapiro Wilk test to make sure Wind is really Normal
Because P-value is greater than 0.05, we can be 95% confident that Wind data is normal for sure
Viz(Temp_dat$climate_value)
check <- abs(Temp_dat$climate_value)^2
temp_T <- (check)
Viz(temp_T)
par(mfrow=c(1,2))
qqnorm(Temp_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(Temp_dat$climate_value, col = 2)
qqnorm(temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(temp_T, col = 2)
shapiro.test(temp_T) #Shapiro to check for normality
##
## Shapiro-Wilk normality test
##
## data: temp_T
## W = 0.98546, p-value = 0.1089
Because P-value is greater than 0.05, we can be 95% confident that Temperature data is normal