Data 606 - Data Project Proposal

Rpubs:

http://rpubs.com/ssufian/539243

Research Question

Are there any trend lines and correlations between the 6 observed variables below?
If so, we can then use one or more to perform as predictors in a regression analysis to predict Ozone levels

There are 154 observations on 6 variables:

Ozone numeric Ozone (ppb)
Solar.R numeric Solar R (lang)
Wind numeric Wind (mph)
Temp numeric Temperature (degrees F)
Month numeric Month (1–12)
Day numeric Day of month (1–31)

Data gathering Details:

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Cases

Each case represents an observations of all the 6 variables in a given day. There are 154 observations

Data Collection

The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather

Service (meteorological data)

Type of study

This was an observational study

Data Source

References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont,

CA: Wadsworth.

Response

The response variables is Ozone level (it is numerical)

Explanatory

The predictor variables to be tested are: Solar, Temperature and Wind are all numerical

Relevant summary Statistics

Trendlining key variables to see if there’s any trends
Perform standard statistical summaries (box plots, normality checks and corrective actions)
Check for outliers, and implement several outlier capping strategies
Run a collerational study and perform regression
Check and test for power of regression via ANOVAs

Load all the libraries

# Load necessary libraries
require(ggthemes)
library(tidyverse)
library(magrittr)
library(stringr)
library(tidyr)
library(dplyr)
library(reshape2)

Load the data from Github

url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/NYC_airquality%20(1).csv'


# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)

head(df)

##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3
## 4 4    18     313 11.5   62     5   4
## 5 5    NA      NA 14.3   56     5   5
## 6 6    28      NA 14.9   66     5   6

df <- df %>% mutate_all(funs(replace_na(.,0))) %>%  # replacing NAs with Zero
           select(-c(1))  # drop first irrelevant column from futher analysis

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

dim(df)

## [1] 153   6

head(df)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5     0       0 14.3   56     5   5
## 6    28       0 14.9   66     5   6

Wide to long format using melt

# Melt:  Wide to long format
df1 <- melt(df, id.vars = c("Month", "Day"),
  variable.name = "climate_type", 
  value.name = "climate_value")

dim(df1)

## [1] 612   4

head(df1)

##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0
## 6     5   6        Ozone            28

Bar plots of the climate variables from May 1973 to Sept 1973

  ggplot(df1,aes(x=Month, y=climate_value,fill = climate_type)) + 
  geom_bar(stat = "identity", position = "dodge") +
  xlab("climate variables") + ylab("Climate Values") +
  ggtitle("Climate Variables from May - Sept of 1973") + ylim(0, 300)+
   theme_excel()

## Warning: Removed 9 rows containing missing values (geom_bar).

Detecting climate variable outliers

Visualization and Mathematical Methods:

For Visualization Methods, Boxplot with range 1.5 and Histogram with break 15 were used to know more about the

data. For instance, Quantile Capping Method was used to detect the outliers(Mathematically) in the data for each

variable

#This function detects Outliers by quantile capping method
outdetect <- function(c,w=1.5)
{
  h <- w*IQR(c,na.rm = T)
  q <- quantile(c,probs=c(.25, .75),na.rm = T)
  if(length(which(q[1]-h>c))==0)
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
  else
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
  if(length(which(q[2]+h<c))==0)
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
  else
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}

OZONE, Solar R, Temp and Wind Chartings to vizualize the different distribution sets

ozone_dat <- df1 %>% 
             filter(climate_type=='Ozone')

head(ozone_dat) #extracting only ozone

##   Month Day climate_type climate_value
## 1     5   1        Ozone            41
## 2     5   2        Ozone            36
## 3     5   3        Ozone            12
## 4     5   4        Ozone            18
## 5     5   5        Ozone             0
## 6     5   6        Ozone            28

Solar_dat <- df1 %>% 
             filter(climate_type=='Solar.R')

head(Solar_dat) #extracting only Solar Radiation

##   Month Day climate_type climate_value
## 1     5   1      Solar.R           190
## 2     5   2      Solar.R           118
## 3     5   3      Solar.R           149
## 4     5   4      Solar.R           313
## 5     5   5      Solar.R             0
## 6     5   6      Solar.R             0

Wind_dat <- df1 %>% 
             filter(climate_type=='Wind')

head(Wind_dat ) #extracting only Wind_dat

##   Month Day climate_type climate_value
## 1     5   1         Wind           7.4
## 2     5   2         Wind           8.0
## 3     5   3         Wind          12.6
## 4     5   4         Wind          11.5
## 5     5   5         Wind          14.3
## 6     5   6         Wind          14.9

Temp_dat <- df1 %>% 
             filter(climate_type=='Temp')

head(Temp_dat ) #extracting only Wind_dat

##   Month Day climate_type climate_value
## 1     5   1         Temp            67
## 2     5   2         Temp            72
## 3     5   3         Temp            74
## 4     5   4         Temp            62
## 5     5   5         Temp            56
## 6     5   6         Temp            66

#boxplots & Histogram of Climate Variables
par(mfrow=c(1,2))

p <- ggplot(df1, aes(y=climate_value,x=climate_type,fill=climate_type))+geom_boxplot()+
  ggtitle("Boxplot of Climate Variables for Sep 1973")+
    theme_excel()
p <- p + theme(axis.text = element_text(size = 10,angle =45, hjust = 1))

p

pm <- ggplot(df1, aes(x=climate_value)) +
  geom_histogram(fill="yellow", alpha=1,binwidth = 13,position="identity",color='blue')+facet_grid( .~climate_type)+
  labs(title="Climate histogram plots",x="Climate Variables", y = "Frequency")

pm

Quantifying Outliers, if any: {Ozone, Solar Radiation, Wind & Temperature}

my_string <- "Ozone Outliers:"
print(my_string, quote = FALSE)

## [1] Ozone Outliers:

outdetect(ozone_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 6 observations above the 3rd quantile on rows 30 62 99 101 117 121 and the values are 115 135 122 110 168 118

my_string <- "Solar Radiation Outliers:"
print(my_string, quote = FALSE)

## [1] Solar Radiation Outliers:

outdetect(Solar_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

my_string <- "Wind Outliers:"
print(my_string, quote = FALSE)

## [1] Wind Outliers:

outdetect(Wind_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7

my_string <- "Temp Outliers:"
print(my_string, quote = FALSE)

## [1] Temp Outliers:

outdetect(Temp_dat$climate_value)

## There are 0 observations below the 1st quantile
## There are 0 observations above the 3rd quantile

Outliers Treatment

Outliers by definition are rare events (should be). The simplest approach is to remove them and carry on with the

analysis but capping method can also be used.

Percentile Capping is a method of imputing the outlier values by:

Replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the

upper limit, with the value of 95th percentile of the same dataset.

# Function to cap outliers
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}

Outlier replacement (Data cleansing): Before and After capping outliers are provided below

Data <- df
# Data cleansing
outcap(ozone_dat$climate_value)
Data$Ozone<- x
outcap(Wind_dat$climate_value)
Data$Wind<- x
Data[c(9,18,48,30,62,99,101,117,121),c(1,3)]

##     Ozone Wind
## 9       8 15.5
## 18      6 15.5
## 48     37 15.5
## 30     97  5.7
## 62     97  4.1
## 99     97  4.0
## 101    97  8.0
## 117    97  3.4
## 121    97  2.3

Data Transformation and Analysis

(A) “Normalization Ozone thru shifting and log transforms”

#vizualization tool for comparing normality distributions
Viz <- function(x){
  h1 <- hist(x,col="yellow",main="Histogram",xlab="Variables");
  xfit<-seq(0,max(x),length=50) 
  yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
  yfit <- yfit*diff(h1$mids[1:2])*length(x)
  lines(xfit, yfit, col="blue", lwd=2)
}

mean_ozone <- ozone_dat %>% 
             summarise(mean(climate_value))
meanozone <- mean_ozone[1,1] # mean of ozone

sd_ozone <- ozone_dat %>% 
             summarise(sd(climate_value))
sdozone <- sd_ozone[1,1] # mean of ozone

hist(ozone_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:150
y <- dnorm(x = x, mean = meanozone, sd = sdozone)
lines(x = x, y = y, col = "blue")

Viz(ozone_dat$climate_value)

check <- abs(ozone_dat$climate_value-15)

Ozone_T <- log(check)

Viz(Ozone_T)

Q-Q Plot

After shifting the values and logging it, Ozone is now more normally distributed; its

definitely a step in the right direction

par(mfrow=c(1,2))
qqnorm(ozone_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(ozone_dat$climate_value, col = 2)
qqnorm(Ozone_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Ozone_T , col = 2)

Q-Q Plot - Ozone

The after Q-Q plot showed a more normal distribution set than before, especially at the left

tail end. However, the middle can be further improved for sure, and that is left for future

exercise

(B) “Normalization of Solar Radiation thru squaring transforms”

mean_solar <- Solar_dat %>% 
             summarise(mean(climate_value))
meansolar <- mean_solar[1,1] # mean of solar

sd_solar <- Solar_dat %>% 
             summarise(sd(climate_value))
sdsolar <- sd_solar[1,1] # Std dev of solar

hist(Solar_dat$climate_value, probability = TRUE, ylim = c(0, 0.11), col = "yellow")
x <- 0:350
y <- dnorm(x = x, mean = meansolar, sd = sdsolar)
lines(x = x, y = y, col = "blue")

Viz(Solar_dat$climate_value)

check <- abs(Solar_dat$climate_value)^2

Solar_T <- (check)

Viz(Solar_T)

par(mfrow=c(1,2))
qqnorm(Solar_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_dat$climate_value, col = 2)
qqnorm(Solar_T ,pch=16,main="After Transformation",xlab="Sample quantiles of Solar Radiation",ylab="Theoretical quantiles")
qqline(Solar_T , col = 2)


### Q-Q Plot - Solar Radiation

The after Q-Q plot showed a more normal distribution set than before, especially the middle

portion is above the 45

degree line.  However again like Ozone, the left can be improved further, and that is left for

future exercise

--------------------------------------------------------------------------------

\clearpag

### (C)  "Wind is already Normal - Q-Q plot ascertains it, see below; no need for transformation"

qqnorm(Wind_dat$climate_value ,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Wind",ylab="Theoretical quantiles")
qqline(Wind_dat$climate_value , col = 2)

shapiro.test(Wind_dat$climate_value) #Shapiro test for normality; making sure its really normal

## 
##  Shapiro-Wilk normality test
## 
## data:  Wind_dat$climate_value
## W = 0.98575, p-value = 0.1178


### Shapiro Wilk test to make sure Wind is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Wind data is normal for sure

(D) “Normalization of Temperature thru squaring transforms”

Viz(Temp_dat$climate_value)

Temperature is already quite normal, but lets see if squaring can make it better?

check <- abs(Temp_dat$climate_value)^2

temp_T <- (check)

Viz(temp_T)

Squaring, did make it slighlty better. Lets see the before and after Q-Q plots

par(mfrow=c(1,2))
qqnorm(Temp_dat$climate_value,pch=16,main="Before Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(Temp_dat$climate_value, col = 2)
qqnorm(temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Temp",ylab="Theoretical quantiles")
qqline(temp_T, col = 2)

shapiro.test(temp_T) #Shapiro to check for normality

## 
##  Shapiro-Wilk normality test
## 
## data:  temp_T
## W = 0.98546, p-value = 0.1089

Shapiro Wilk test to make sure Temperature is really Normal

Because P-value is greater than 0.05, we can be 95% confident that Temperature data is normal