A. About the Dataset

Data Set

New York Air Quality Measurements

Description

Daily air quality measurements in New York, May to September 1973.

Format

A data frame with 153 observations on 6 variables.

Ozone : numeric Ozone (ppb)
Solar.R : numeric Solar R (lang)
Wind : numeric Wind (mph)
Temp : numeric Temperature (degrees F)
Month : numeric Month (1–12)
Day : numeric Day of month (1–31)

Details

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000-7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Source:

The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data).

B. Analyzing the Structure of the Data

#Packages Deployed
load.libraries <- c('e1071','dplyr','datasets','VIM','mice')
#if not present identify and download
install.lib <- load.libraries[!load.libraries %in% installed.packages()]
sapply(install.lib, install.packages,repos = "http://cran.us.r-project.org")
#Load the packages
sapply(load.libraries, require, character = TRUE)

Here we show the Data Structure of the dataset using the class() function, Dimension of the dataset by dim() and Data types of each variable by using glimpse() function from base and dplyr packages in R.Next identify the data type and category of the variables.

DataStructure

#Data Structure of the Dataset
class(airquality)

[1] "data.frame"

Dimension

#Dimension of the Dataset
dim(airquality)

[1] 153   6

Data Types

#Data Type of Each Variable in the Dataset
glimpse(airquality)

Observations: 153
Variables: 6
$ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
$ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
$ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
$ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
$ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...

The Data Structure of the dataset is a dataframe with Dimension of 153 rows and 6 columns and Data types are int and dbl i.e. integer and double(numeric) respectively.

Category of the Variable

The variable with int and dbl data type are of the format of continuous variable.

C. Outlier Detection and Treatment

Outliers

For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles.

Detecting Outliers

Visualization and Mathematical Methods

For Visualization Methods Boxplot with range 1.5 and Histogram with break 15 is used to get a clear idea about the data. Quantile Capping Method is used to detect the outliers(Mathematically) in the data for each variable after Visualization.

#This function detects Outliers by quantile capping method
outdetect <- function(c,w=1.5)
{
  h <- w*IQR(c,na.rm = T)
  q <- quantile(c,probs=c(.25, .75),na.rm = T)
  if(length(which(q[1]-h>c))==0)
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
  else
    cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
  if(length(which(q[2]+h<c))==0)
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
  else
    cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}

Ozone

par(mfrow=c(1,2))
boxplot(airquality$Ozone,col = "antiquewhite3",main = "Boxplot of Ozone",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Ozone,col = "antiquewhite3",main = "Histogram of Ozone", xlab = "Observations",breaks = 15)

outdetect(airquality$Ozone)

There are 0 observations below the 1st quantile
There are 2 observations above the 3rd quantile on rows 62 117 and the values are 135 168

Solar.R

par(mfrow=c(1,2))
boxplot(airquality$Solar.R,col = "antiquewhite3",main = "Boxplot of Solar.R",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Solar.R,col = "antiquewhite3",main = "Histogram of Solar.R", xlab = "Observations",breaks = 15)

outdetect(airquality$Solar.R)

There are 0 observations below the 1st quantile
There are 0 observations above the 3rd quantile

Wind

par(mfrow=c(1,2))
boxplot(airquality$Wind,col = "antiquewhite3",main = "Boxplot of Wind",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Wind,col = "antiquewhite3",main = "Histogram of Wind", xlab = "Observations",breaks = 15)

outdetect(airquality$Wind)

There are 0 observations below the 1st quantile
There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7

Temp

par(mfrow=c(1,2))
boxplot(airquality$Temp,col = "antiquewhite3",main = "Boxplot of Temp",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Temp,col = "antiquewhite3",main = "Histogram of Temp", xlab = "Observations",breaks = 15)

outdetect(airquality$Temp)

There are 0 observations below the 1st quantile
There are 0 observations above the 3rd quantile

From the Above Boxplots we can see that in Ozone and Wind Variables there are Outliers(the blue dots). Also on Histogram of both the variables Ozone and Wind we can see that there is a gap between observations at extreme i.e. In Ozone Histogram there is one gap in the chart and in Wind Histogram there are two gaps in chart, so they are Outliers.

Outliers Treatment

Since the number of outliers in the dataset is very small the best approach is to remove them and carry on with the analysis but capping method can also be used. Percentile Capping is a method of imputing the outlier values by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile of the same dataset.

outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}

Data <- airquality
outcap(Data$Ozone)
Data$Ozone <- x
outcap(Data$Wind)
Data$Wind <- x
Data[c(9,18,48,62,117),c(1,3)]

Now check the 9th, 18th and 48th value of Wind and 62nd 117th value of Ozone variable the Varibles are free of Outliers so we move to treat Missing Values.

D. Missing Value Detection and Treatment

Missing Values

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

Detecting Missing Values

Mathematical Methods

To check for Missing Value call the summary() on the dataset.

Summary of the Data

#Summary of the Data
summary(Data[-c(5,6)])

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 41.39   Mean   :185.9   Mean   : 9.875   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :122.00   Max.   :334.0   Max.   :16.600   Max.   :97.00  
 NA's   :37       NA's   :7

See that on Ozone there are 37 NA’s and on Solar.R there are 7 NA’s.

Names of the Columns which contains Missing Values

#Names of the Variables which contains Missing Values
colnames(Data)[colSums(is.na(Data)) > 0]

[1] "Ozone"   "Solar.R"

Percentage of Columns and Rows which contains Missing Values

Assuming the data is Missing Completly At Random, too much missing data can be a problem too. A safe maximum threshold is 5% of the total for large datasets. If missing data for a certain Variable is more than 5% then leave that Variable out. Let’s check the columns and rows where more than 5% of the data is missing using a simple function :-

PerMiss <- function(x){sum(is.na(x))/length(x)*100}

Columns

apply(Data[c(1,2)],2,PerMiss)

    Ozone   Solar.R 
24.183007  4.575163

Rows

apply(Data,1,PerMiss)

  [1]  0.00000  0.00000  0.00000  0.00000 33.33333 16.66667  0.00000
  [8]  0.00000  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000
 [15]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [22]  0.00000  0.00000  0.00000 16.66667 16.66667 33.33333  0.00000
 [29]  0.00000  0.00000  0.00000 16.66667 16.66667 16.66667 16.66667
 [36] 16.66667 16.66667  0.00000 16.66667  0.00000  0.00000 16.66667
 [43] 16.66667  0.00000 16.66667 16.66667  0.00000  0.00000  0.00000
 [50]  0.00000  0.00000 16.66667 16.66667 16.66667 16.66667 16.66667
 [57] 16.66667 16.66667 16.66667 16.66667 16.66667  0.00000  0.00000
 [64]  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
 [71]  0.00000 16.66667  0.00000  0.00000 16.66667  0.00000  0.00000
 [78]  0.00000  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667
 [85]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
 [92]  0.00000  0.00000  0.00000  0.00000 16.66667 16.66667 16.66667
 [99]  0.00000  0.00000  0.00000 16.66667 16.66667  0.00000  0.00000
[106]  0.00000 16.66667  0.00000  0.00000  0.00000  0.00000  0.00000
[113]  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000 16.66667
[120]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[127]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[134]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[141]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[148]  0.00000  0.00000 16.66667  0.00000  0.00000  0.00000

We see that Ozone is missing almost 25% of the datapoints, therefore we might consider either dropping it from the analysis or gather more measurements. The Wind variables have below 5% threshold so we can keep them. As far as the Rows are concerned, missing just one feature leads to a 17% missing data per sample. Row Observations that are missing 2 or more Variables (>34%), should be dropped if possible.

Patterns and Visualizations of Missing Values

The mice package provides a nice function md.pattern() to get a better understanding of the pattern of missing data. Patterns

library(mice)
md.pattern(Data)

    Wind Temp Month Day Solar.R Ozone   
111    1    1     1   1       1     1  0
 35    1    1     1   1       1     0  1
  5    1    1     1   1       0     1  1
  2    1    1     1   1       0     0  2
       0    0     0   0       7    37 44

The output tells us that 44 samples are complete, 37 samples miss only the Ozone measurement, 7 samples miss only the Solar.R value.

A perhaps more helpful visual representation can be obtained using the VIM package as follows

Visualizations

library(VIM)
aggr_plot <-  aggr(Data[c(1,2)], col=c('antiquewhite3','antiquewhite1'), numbers=TRUE, sortVars=TRUE, labels=names(Data[c(1,2)]), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
 Variable      Count
    Ozone 0.24183007
  Solar.R 0.04575163

The plot helps us understanding that almost 72% of the samples are not missing any information, 24% are missing the Ozone value, and the remaining ones show other missing patterns. Through this approach the situation looks a bit clearer in my opinion.

Another helpful visual approach is a special box plot

marginplot(Data[c(1,2)], col=c('antiquewhite3','antiquewhite1'))

Obviously here we are constrained at plotting 2 variables at a time only, but nevertheless we can gather some interesting insights. The red box plot on the left shows the distribution of Solar.R with Ozone missing while the blue box plot shows the distribution of the remaining datapoints. Likewise for the Ozone box plots at the bottom of the graph. If our assumption of MCAR data is correct, then we expect the red and blue box plots to be very similar.

Missing Value Treatment

The mice() function takes care of the imputing process.PMM (Predictive Mean Matching) - technique is used because it is suitable for numeric variables.

tempData <- mice(Data,m=5,maxit=50,meth='pmm',seed=500)

Summary of Missing Values

summary(tempData)

Multiply imputed data set
Call:
mice(data = Data, m = 5, method = "pmm", maxit = 50, seed = 500)
Number of multiple imputations:  5
Missing cells per column:
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 
Imputation methods:
  Ozone Solar.R    Wind    Temp   Month     Day 
  "pmm"   "pmm"   "pmm"   "pmm"   "pmm"   "pmm" 
VisitSequence:
  Ozone Solar.R 
      1       2 
PredictorMatrix:
        Ozone Solar.R Wind Temp Month Day
Ozone       0       1    1    1     1   1
Solar.R     1       0    1    1     1   1
Wind        0       0    0    0     0   0
Temp        0       0    0    0     0   0
Month       0       0    0    0     0   0
Day         0       0    0    0     0   0
Random generator seed value:  500

Data <- complete(tempData,1)

Summary of the Dataset(Before Treating Missing Values and Outlier)

summary(airquality[c(1,2)])

     Ozone           Solar.R     
 Min.   :  1.00   Min.   :  7.0  
 1st Qu.: 18.00   1st Qu.:115.8  
 Median : 31.50   Median :205.0  
 Mean   : 42.13   Mean   :185.9  
 3rd Qu.: 63.25   3rd Qu.:258.8  
 Max.   :168.00   Max.   :334.0  
 NA's   :37       NA's   :7

Summary of the Dataset(after Treating Missing Values and Outlier)

summary(Data[c(1,2)])

     Ozone           Solar.R     
 Min.   :  1.00   Min.   :  7.0  
 1st Qu.: 16.00   1st Qu.:115.0  
 Median : 30.00   Median :203.0  
 Mean   : 40.15   Mean   :185.3  
 3rd Qu.: 59.00   3rd Qu.:258.0  
 Max.   :122.00   Max.   :334.0

Conclusions If we compare the summaries of both the Datasets we can see that the values are not so much deviated from their respective summaries, so we can conclude that taking Percentile Capping as Outlier Imputation and Predictive Mean Matching as Missing Value Imputation was a right choice.

E. Feature Engineering and Analysis

Skewness Transformation

Viz <- function(x){
  h1 <- hist(x,col="antiquewhite3",main="Histogram",xlab="Variables");
  xfit<-seq(min(x),max(x),length=40) 
  yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
  yfit <- yfit*diff(h1$mids[1:2])*length(x)
  lines(xfit, yfit, col="black", lwd=2)
}

For Ozone

Before Transformation Histogram

Viz(Data$Ozone)

skewness(Data$Ozone)

[1] 0.9819931

The histogram with the density curve of Ozone clearly shows that tail of the distribution lie towards right and thus the variable is Right Skewed.So we need to transform the Variable.

After Transformation Histogram

Ozone_T <- log(Data$Ozone+1)
Viz(Ozone_T)

skewness(Ozone_T)

[1] -0.2845094

After Log Transformation the histogram with the density curve of Ozone clearly shows that maximum frequency of the values lie slightly towards left and thus the variable is nearly skewed and so the data is from normal population. Also the skewness is much close to 0.

QQPlot

par(mfrow=c(1,2))
qqnorm(Data$Ozone,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Ozone, col = 2)
qqnorm(log(Data$Ozone+1),pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(log(Data$Ozone+1), col = 2)

QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.

Hypothesis testing:

shapiro.test(Ozone_T)


    Shapiro-Wilk normality test

data:  Ozone_T
W = 0.97695, p-value = 0.01139

If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.

For Solar.R

Before Transformation Histogram

Viz(Data$Solar.R)

skewness(Data$Solar.R)

[1] -0.4044689

The histogram with the density curve of Solar.R clearly shows that tail of the distribution lie towards left and thus the variable is Left Skewed.So we need to transform the Variable.

After Transformation Histogram

Solar_T <- (Data$Solar.R)^2
Viz(Solar_T)

skewness(Solar_T)

[1] 0.2455632

After Square Transformation the histogram with the density curve of Solar.R clearly shows that maximum frequency of the values lie slightly towards left and thus the variable is nearly skewed and so the data is from normal population. Also the skewness is much close to 0.

QQPlot

par(mfrow=c(1,2))
qqnorm(Data$Solar.R,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Solar.R, col = 2)
qqnorm(Solar_T,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Solar_T, col = 2)

QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.

shapiro.test(Solar_T)


    Shapiro-Wilk normality test

data:  Solar_T
W = 0.94474, p-value = 1.015e-05

The distribution is slightly Skewed and is not normal even after transformation but it is not skewed like before.

For Wind

Histogram

Viz(Data$Wind)

skewness(Data$Wind)  #right positive

[1] 0.05791221

The histogram with the density curve of Wind clearly shows that the distribution is normally distributed.

QQPlot

qqnorm(Data$Wind,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Wind, col = 2)

QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.

Hypothesis testing:

shapiro.test(Data$Wind)


    Shapiro-Wilk normality test

data:  Data$Wind
W = 0.98177, p-value = 0.04044

If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.

For Temp

Before Transformation Histogram

Viz(Data$Temp)

skewness(Data$Temp)

[1] -0.3705073

The histogram with the density curve of Ozone clearly shows that tail of the distribution lie towards left and thus the variable is Left Skewed.So we need to transform the Variable.

After Transformation Histogram

Temp_T <- Data$Temp^2
Viz(Temp_T)

skewness(Temp_T)

[1] -0.1080352

After Square Transformation the histogram with the density curve of Temp clearly shows that the variable is slightly skewed and so the data is from normal population. Also the skewness is much close to 0.

QQPlot

par(mfrow=c(1,2))
qqnorm(Data$Temp,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Temp, col = 2)
qqnorm(Temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Temp_T, col = 2)

QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.

Hypothesis testing:

shapiro.test(Temp_T)


    Shapiro-Wilk normality test

data:  Temp_T
W = 0.98546, p-value = 0.1089

If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.

Data$Ozone <- Ozone_T
Data$Solar.R <- Solar_T
Data$Temp <- Temp_T

Scale Transformation

This transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution.

T <- Data[c(5,6)]
Data <- apply(Data[-c(5,6)],2,scale)
Data <- cbind(Data,T)
head(Data,50)

Now that the Data has been transformed and made consistent these are very important steps in Exploratory Data Analysis before Model Fitting. The quality and effort invested in data exploration can make a diffrence in building a good model from bad model.

Exploratory Data Analysis on New York Air Quality Dataset

Arjun Dutta

31 Jan 2018

A. About the Dataset

Data Set

Description

Format

Details

Source:

B. Analyzing the Structure of the Data

DataStructure

Dimension

Data Types

Category of the Variable

C. Outlier Detection and Treatment

Outliers

Detecting Outliers

Outliers Treatment

D. Missing Value Detection and Treatment

Missing Values

Detecting Missing Values

Missing Value Treatment

E. Feature Engineering and Analysis

Skewness Transformation

For Ozone

For Solar.R

For Wind

For Temp

Scale Transformation