New York Air Quality Measurements
Daily air quality measurements in New York, May to September 1973.
A data frame with 153 observations on 6 variables.
Ozone : numeric Ozone (ppb)
Solar.R : numeric Solar R (lang)
Wind : numeric Wind (mph)
Temp : numeric Temperature (degrees F)
Month : numeric Month (1–12)
Day : numeric Day of month (1–31)
Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000-7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data).
#Packages Deployed
load.libraries <- c('e1071','dplyr','datasets','VIM','mice')
#if not present identify and download
install.lib <- load.libraries[!load.libraries %in% installed.packages()]
sapply(install.lib, install.packages,repos = "http://cran.us.r-project.org")
#Load the packages
sapply(load.libraries, require, character = TRUE)
Here we show the Data Structure of the dataset using the class() function, Dimension of the dataset by dim() and Data types of each variable by using glimpse() function from base and dplyr packages in R.Next identify the data type and category of the variables.
#Data Structure of the Dataset
class(airquality)
[1] "data.frame"
#Dimension of the Dataset
dim(airquality)
[1] 153 6
#Data Type of Each Variable in the Dataset
glimpse(airquality)
Observations: 153
Variables: 6
$ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, ...
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256,...
$ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6...
$ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68...
$ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
$ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
The Data Structure of the dataset is a dataframe with Dimension of 153 rows and 6 columns and Data types are int and dbl i.e. integer and double(numeric) respectively.
The variable with int and dbl data type are of the format of continuous variable.
For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles.
Visualization and Mathematical Methods
For Visualization Methods Boxplot with range 1.5 and Histogram with break 15 is used to get a clear idea about the data. Quantile Capping Method is used to detect the outliers(Mathematically) in the data for each variable after Visualization.
#This function detects Outliers by quantile capping method
outdetect <- function(c,w=1.5)
{
h <- w*IQR(c,na.rm = T)
q <- quantile(c,probs=c(.25, .75),na.rm = T)
if(length(which(q[1]-h>c))==0)
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile\n")
else
cat("There are",sum(q[1]-h>c,na.rm = T),"observations below the 1st quantile","on rows",which(q[1]-h>c),"and the values are",boxplot.stats(c)$out,"\n")
if(length(which(q[2]+h<c))==0)
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile\n")
else
cat("There are",sum(q[2]+h<c,na.rm = T),"observations above the 3rd quantile","on rows",which(q[2]+h<c),"and the values are",boxplot.stats(c)$out,"\n")
}
Ozone
par(mfrow=c(1,2))
boxplot(airquality$Ozone,col = "antiquewhite3",main = "Boxplot of Ozone",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Ozone,col = "antiquewhite3",main = "Histogram of Ozone", xlab = "Observations",breaks = 15)
outdetect(airquality$Ozone)
There are 0 observations below the 1st quantile
There are 2 observations above the 3rd quantile on rows 62 117 and the values are 135 168
Solar.R
par(mfrow=c(1,2))
boxplot(airquality$Solar.R,col = "antiquewhite3",main = "Boxplot of Solar.R",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Solar.R,col = "antiquewhite3",main = "Histogram of Solar.R", xlab = "Observations",breaks = 15)
outdetect(airquality$Solar.R)
There are 0 observations below the 1st quantile
There are 0 observations above the 3rd quantile
Wind
par(mfrow=c(1,2))
boxplot(airquality$Wind,col = "antiquewhite3",main = "Boxplot of Wind",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Wind,col = "antiquewhite3",main = "Histogram of Wind", xlab = "Observations",breaks = 15)
outdetect(airquality$Wind)
There are 0 observations below the 1st quantile
There are 3 observations above the 3rd quantile on rows 9 18 48 and the values are 20.1 18.4 20.7
Temp
par(mfrow=c(1,2))
boxplot(airquality$Temp,col = "antiquewhite3",main = "Boxplot of Temp",outcol="Blue",outpch=19,boxwex=0.7,range = 1.5)
hist(airquality$Temp,col = "antiquewhite3",main = "Histogram of Temp", xlab = "Observations",breaks = 15)
outdetect(airquality$Temp)
There are 0 observations below the 1st quantile
There are 0 observations above the 3rd quantile
From the Above Boxplots we can see that in Ozone and Wind Variables there are Outliers(the blue dots). Also on Histogram of both the variables Ozone and Wind we can see that there is a gap between observations at extreme i.e. In Ozone Histogram there is one gap in the chart and in Wind Histogram there are two gaps in chart, so they are Outliers.
Since the number of outliers in the dataset is very small the best approach is to remove them and carry on with the analysis but capping method can also be used. Percentile Capping is a method of imputing the outlier values by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile of the same dataset.
outcap <- function(x)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
sum(x > (qnt[2] + H))
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
x<<-x
}
Data <- airquality
outcap(Data$Ozone)
Data$Ozone <- x
outcap(Data$Wind)
Data$Wind <- x
Data[c(9,18,48,62,117),c(1,3)]
Now check the 9th, 18th and 48th value of Wind and 62nd 117th value of Ozone variable the Varibles are free of Outliers so we move to treat Missing Values.
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.
Mathematical Methods
To check for Missing Value call the summary() on the dataset.
Summary of the Data
#Summary of the Data
summary(Data[-c(5,6)])
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 41.39 Mean :185.9 Mean : 9.875 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :122.00 Max. :334.0 Max. :16.600 Max. :97.00
NA's :37 NA's :7
See that on Ozone there are 37 NA’s and on Solar.R there are 7 NA’s.
Names of the Columns which contains Missing Values
#Names of the Variables which contains Missing Values
colnames(Data)[colSums(is.na(Data)) > 0]
[1] "Ozone" "Solar.R"
Percentage of Columns and Rows which contains Missing Values
Assuming the data is Missing Completly At Random, too much missing data can be a problem too. A safe maximum threshold is 5% of the total for large datasets. If missing data for a certain Variable is more than 5% then leave that Variable out. Let’s check the columns and rows where more than 5% of the data is missing using a simple function :-
PerMiss <- function(x){sum(is.na(x))/length(x)*100}
Columns
apply(Data[c(1,2)],2,PerMiss)
Ozone Solar.R
24.183007 4.575163
Rows
apply(Data,1,PerMiss)
[1] 0.00000 0.00000 0.00000 0.00000 33.33333 16.66667 0.00000
[8] 0.00000 0.00000 16.66667 16.66667 0.00000 0.00000 0.00000
[15] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[22] 0.00000 0.00000 0.00000 16.66667 16.66667 33.33333 0.00000
[29] 0.00000 0.00000 0.00000 16.66667 16.66667 16.66667 16.66667
[36] 16.66667 16.66667 0.00000 16.66667 0.00000 0.00000 16.66667
[43] 16.66667 0.00000 16.66667 16.66667 0.00000 0.00000 0.00000
[50] 0.00000 0.00000 16.66667 16.66667 16.66667 16.66667 16.66667
[57] 16.66667 16.66667 16.66667 16.66667 16.66667 0.00000 0.00000
[64] 0.00000 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000
[71] 0.00000 16.66667 0.00000 0.00000 16.66667 0.00000 0.00000
[78] 0.00000 0.00000 0.00000 0.00000 0.00000 16.66667 16.66667
[85] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[92] 0.00000 0.00000 0.00000 0.00000 16.66667 16.66667 16.66667
[99] 0.00000 0.00000 0.00000 16.66667 16.66667 0.00000 0.00000
[106] 0.00000 16.66667 0.00000 0.00000 0.00000 0.00000 0.00000
[113] 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000 16.66667
[120] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[127] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[134] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[141] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[148] 0.00000 0.00000 16.66667 0.00000 0.00000 0.00000
We see that Ozone is missing almost 25% of the datapoints, therefore we might consider either dropping it from the analysis or gather more measurements. The Wind variables have below 5% threshold so we can keep them. As far as the Rows are concerned, missing just one feature leads to a 17% missing data per sample. Row Observations that are missing 2 or more Variables (>34%), should be dropped if possible.
Patterns and Visualizations of Missing Values
The mice package provides a nice function md.pattern() to get a better understanding of the pattern of missing data. Patterns
library(mice)
md.pattern(Data)
Wind Temp Month Day Solar.R Ozone
111 1 1 1 1 1 1 0
35 1 1 1 1 1 0 1
5 1 1 1 1 0 1 1
2 1 1 1 1 0 0 2
0 0 0 0 7 37 44
The output tells us that 44 samples are complete, 37 samples miss only the Ozone measurement, 7 samples miss only the Solar.R value.
A perhaps more helpful visual representation can be obtained using the VIM package as follows
Visualizations
library(VIM)
aggr_plot <- aggr(Data[c(1,2)], col=c('antiquewhite3','antiquewhite1'), numbers=TRUE, sortVars=TRUE, labels=names(Data[c(1,2)]), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
Variables sorted by number of missings:
Variable Count
Ozone 0.24183007
Solar.R 0.04575163
The plot helps us understanding that almost 72% of the samples are not missing any information, 24% are missing the Ozone value, and the remaining ones show other missing patterns. Through this approach the situation looks a bit clearer in my opinion.
Another helpful visual approach is a special box plot
marginplot(Data[c(1,2)], col=c('antiquewhite3','antiquewhite1'))
Obviously here we are constrained at plotting 2 variables at a time only, but nevertheless we can gather some interesting insights. The red box plot on the left shows the distribution of Solar.R with Ozone missing while the blue box plot shows the distribution of the remaining datapoints. Likewise for the Ozone box plots at the bottom of the graph. If our assumption of MCAR data is correct, then we expect the red and blue box plots to be very similar.
The mice() function takes care of the imputing process.PMM (Predictive Mean Matching) - technique is used because it is suitable for numeric variables.
tempData <- mice(Data,m=5,maxit=50,meth='pmm',seed=500)
Summary of Missing Values
summary(tempData)
Multiply imputed data set
Call:
mice(data = Data, m = 5, method = "pmm", maxit = 50, seed = 500)
Number of multiple imputations: 5
Missing cells per column:
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
Imputation methods:
Ozone Solar.R Wind Temp Month Day
"pmm" "pmm" "pmm" "pmm" "pmm" "pmm"
VisitSequence:
Ozone Solar.R
1 2
PredictorMatrix:
Ozone Solar.R Wind Temp Month Day
Ozone 0 1 1 1 1 1
Solar.R 1 0 1 1 1 1
Wind 0 0 0 0 0 0
Temp 0 0 0 0 0 0
Month 0 0 0 0 0 0
Day 0 0 0 0 0 0
Random generator seed value: 500
Data <- complete(tempData,1)
Summary of the Dataset(Before Treating Missing Values and Outlier)
summary(airquality[c(1,2)])
Ozone Solar.R
Min. : 1.00 Min. : 7.0
1st Qu.: 18.00 1st Qu.:115.8
Median : 31.50 Median :205.0
Mean : 42.13 Mean :185.9
3rd Qu.: 63.25 3rd Qu.:258.8
Max. :168.00 Max. :334.0
NA's :37 NA's :7
Summary of the Dataset(after Treating Missing Values and Outlier)
summary(Data[c(1,2)])
Ozone Solar.R
Min. : 1.00 Min. : 7.0
1st Qu.: 16.00 1st Qu.:115.0
Median : 30.00 Median :203.0
Mean : 40.15 Mean :185.3
3rd Qu.: 59.00 3rd Qu.:258.0
Max. :122.00 Max. :334.0
Conclusions If we compare the summaries of both the Datasets we can see that the values are not so much deviated from their respective summaries, so we can conclude that taking Percentile Capping as Outlier Imputation and Predictive Mean Matching as Missing Value Imputation was a right choice.
Viz <- function(x){
h1 <- hist(x,col="antiquewhite3",main="Histogram",xlab="Variables");
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h1$mids[1:2])*length(x)
lines(xfit, yfit, col="black", lwd=2)
}
Before Transformation Histogram
Viz(Data$Ozone)
skewness(Data$Ozone)
[1] 0.9819931
The histogram with the density curve of Ozone clearly shows that tail of the distribution lie towards right and thus the variable is Right Skewed.So we need to transform the Variable.
After Transformation Histogram
Ozone_T <- log(Data$Ozone+1)
Viz(Ozone_T)
skewness(Ozone_T)
[1] -0.2845094
After Log Transformation the histogram with the density curve of Ozone clearly shows that maximum frequency of the values lie slightly towards left and thus the variable is nearly skewed and so the data is from normal population. Also the skewness is much close to 0.
QQPlot
par(mfrow=c(1,2))
qqnorm(Data$Ozone,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Ozone, col = 2)
qqnorm(log(Data$Ozone+1),pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(log(Data$Ozone+1), col = 2)
QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.
Hypothesis testing:
shapiro.test(Ozone_T)
Shapiro-Wilk normality test
data: Ozone_T
W = 0.97695, p-value = 0.01139
If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.
Before Transformation Histogram
Viz(Data$Solar.R)
skewness(Data$Solar.R)
[1] -0.4044689
The histogram with the density curve of Solar.R clearly shows that tail of the distribution lie towards left and thus the variable is Left Skewed.So we need to transform the Variable.
After Transformation Histogram
Solar_T <- (Data$Solar.R)^2
Viz(Solar_T)
skewness(Solar_T)
[1] 0.2455632
After Square Transformation the histogram with the density curve of Solar.R clearly shows that maximum frequency of the values lie slightly towards left and thus the variable is nearly skewed and so the data is from normal population. Also the skewness is much close to 0.
QQPlot
par(mfrow=c(1,2))
qqnorm(Data$Solar.R,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Solar.R, col = 2)
qqnorm(Solar_T,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Solar_T, col = 2)
QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.
shapiro.test(Solar_T)
Shapiro-Wilk normality test
data: Solar_T
W = 0.94474, p-value = 1.015e-05
The distribution is slightly Skewed and is not normal even after transformation but it is not skewed like before.
Histogram
Viz(Data$Wind)
skewness(Data$Wind) #right positive
[1] 0.05791221
The histogram with the density curve of Wind clearly shows that the distribution is normally distributed.
QQPlot
qqnorm(Data$Wind,pch=16,main="QQplot for Wind",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Wind, col = 2)
QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.
Hypothesis testing:
shapiro.test(Data$Wind)
Shapiro-Wilk normality test
data: Data$Wind
W = 0.98177, p-value = 0.04044
If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.
Before Transformation Histogram
Viz(Data$Temp)
skewness(Data$Temp)
[1] -0.3705073
The histogram with the density curve of Ozone clearly shows that tail of the distribution lie towards left and thus the variable is Left Skewed.So we need to transform the Variable.
After Transformation Histogram
Temp_T <- Data$Temp^2
Viz(Temp_T)
skewness(Temp_T)
[1] -0.1080352
After Square Transformation the histogram with the density curve of Temp clearly shows that the variable is slightly skewed and so the data is from normal population. Also the skewness is much close to 0.
QQPlot
par(mfrow=c(1,2))
qqnorm(Data$Temp,pch=16,main="Before Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Data$Temp, col = 2)
qqnorm(Temp_T,pch=16,main="After Transformation",xlab="Sample quantiles of Ozone",ylab="Theoretical quantiles")
qqline(Temp_T, col = 2)
QQPlot The above QQ plot clearly shows that most of the values lies above the normal line but more or less close to it. So we can interpret that the data is surely from a normal distribution.
Hypothesis testing:
shapiro.test(Temp_T)
Shapiro-Wilk normality test
data: Temp_T
W = 0.98546, p-value = 0.1089
If p is more than .01 {\(\alpha\)}, we can be 99% {\((1-\alpha)100\%\)} certain that the data are normally distributed.
Data$Ozone <- Ozone_T
Data$Solar.R <- Solar_T
Data$Temp <- Temp_T
This transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution.
T <- Data[c(5,6)]
Data <- apply(Data[-c(5,6)],2,scale)
Data <- cbind(Data,T)
head(Data,50)
Now that the Data has been transformed and made consistent these are very important steps in Exploratory Data Analysis before Model Fitting. The quality and effort invested in data exploration can make a diffrence in building a good model from bad model.