This data set has daily air quality measurements in New York from May to September 1973 over a period of 5 months.
| Ozone |
numeric |
parts per billion |
mean Ozone concentration |
Roosevelt island |
| Solar.R |
numeric |
|
Solar radiaion |
central park |
| Wind |
numeric |
miles per hour |
average wind speed |
LaGuardia airport |
| Temp |
numeric |
Fahrenheit |
maximum daily temperature |
LaGuardia airport |
| Month |
numeric |
|
Month of observation |
5 values |
| Day |
numeric |
|
day of month |
153 days total |
|
|
|
|
|
Necessary packages
require(ggplot2)
require(moments)
Let’s make a copy of the dataset for the following analysis.
aq <- data.frame(airquality)
Names of columns
colnames(aq)
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Let’s look at some of the rows:
head(aq)
Several entries are not available. Let’s summarize:
colSums(is.na(aq))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
Day should be a factor variable
aq$Day <- factor(aq$Day, levels=c(1:31), ordered=TRUE)
Month should be a factor variable
aq$Month <- factor(aq$Month, levels=5:9, labels=month.abb[5:9], ordered=TRUE)
Summary statistiscs for all the variables:
summary(aq)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
May:31 1 : 5
Jun:30 2 : 5
Jul:31 3 : 5
Aug:31 4 : 5
Sep:30 5 : 5
6 : 5
(Other):123
Variables
Ozone
summary(aq$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 18.00 31.50 42.13 63.25 168.00 37
Histogram
ggplot(data=aq) + geom_histogram(mapping=aes(Ozone))

The histogram appears assymetric.
We can check the normalcy of the distribution by measuring its skewness and kurtosis
skewness(aq$Ozone, na.rm = T)
[1] 1.225681
The data is skewed heavily towards right.
Let’s also measure kurtosis:
kurtosis(aq$Ozone, na.rm=T)
[1] 4.184071
This is leptokurtic.
Let’s apply a transformation to see if the histogram becomes more symmetric:
aq$Ozone.sqrt <- sqrt(aq$Ozone)
ggplot(data=aq) + geom_histogram(mapping=aes(Ozone.sqrt))

We have stored the square root of Ozone concentration values in the aq dataframe itself as a new column. This is indeed more symmetric and looking like normal distribution.
Let’s verify the skewness:
skewness(aq$Ozone.sqrt, na.rm=T)
[1] 0.5143488
Also the kurtosis:
kurtosis(aq$Ozone.sqrt, na.rm=T)
[1] 2.561186
There is indeed improvement in skewness.
Seasonal variation in data:
qplot(Month, Ozone, data=aq, geom="boxplot", color=Month)

Solar.R
summary(aq$Solar.R)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
7.0 115.8 205.0 185.9 258.8 334.0 7
Histogram
ggplot(data=aq) + geom_histogram(mapping=aes(Solar.R))

Seasonal variation in data:
qplot(Month, Solar.R, data=aq, geom="boxplot", color=Month)

Wind
summary(aq$Wind)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.700 7.400 9.700 9.958 11.500 20.700
Histogram
ggplot(data=aq) + geom_histogram(mapping=aes(Wind))

Let’s verify the normalcy of the distribution
skewness(aq$Wind)
[1] 0.3443985
kurtosis(aq$Wind)
[1] 3.068849
This is very close to a normal distribution.
Seasonal variation in data:
qplot(Month, Wind, data=aq, geom="boxplot", color=Month)

Temp
summary(aq$Temp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
56.00 72.00 79.00 77.88 85.00 97.00
Seasonal variation in data:
qplot(Month, Temp, data=aq, geom="boxplot", color=Month)

Month
table(aq$Month)
May Jun Jul Aug Sep
31 30 31 31 30
This just reflects the number of days in each month from May to September.
Day
table(aq$Day)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
25 26 27 28 29 30 31
5 5 5 5 5 5 3
As expected, day 31 appears only 3 times.
Correlations
Let us find out the Pearson correlation coefficients between the different variables in the dataset:
# Only numerical variables can be used for correlation
columns <- c('Ozone.sqrt', 'Solar.R', 'Wind', 'Temp')
# We identify non-na rows as rows with na values cause problems
rows <- rowSums(is.na(aq)) == 0
# Compute correlation of variables and round to 2 digits
round(cor(aq[rows, columns]), 2)
Ozone.sqrt Solar.R Wind Temp
Ozone.sqrt 1.00 0.40 -0.61 0.75
Solar.R 0.40 1.00 -0.13 0.29
Wind -0.61 -0.13 1.00 -0.50
Temp 0.75 0.29 -0.50 1.00
- Solar Radiation is moderately correlated with ozone concentration.
- wind and ozone seem strongly correlated.
- Temperature and ozone are strongly correlated.
- Very weak correlation between wind and solar radiation.
- Temperature and solar radition are weakly correlated.
- Temperature and wind are moderately correlated.
Let us verify these correlations using scatter plots.
Very weak correlation:
qplot(Solar.R, Wind, data=aq, geom="point", color=Month)

Weak correlation:
qplot(Temp, Solar.R, data=aq, geom="point", color=Month)

Moderate correlation:
qplot(Solar.R, Ozone.sqrt, data=aq, color=Month, geom="point")

qplot(Temp, Wind, data=aq, color=Month, geom="point")

Strong correlation
qplot(Wind, Ozone.sqrt, data=aq, color=Month, geom='point')

qplot(Temp, Ozone.sqrt, data=aq, color=Month, geom='point')

LS0tDQp0aXRsZTogIkV4cGxvcmF0aW9uIG9mIEFpciBRdWFsaXR5IERhdGFzZXQiDQphdXRob3I6ICJTaGFpbGVzaCBLdW1hciINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgbnVtYmVyX3NlY3Rpb25zOiB0cnVlDQotLS0NClRoaXMgZGF0YSBzZXQgaGFzIGRhaWx5IGFpciBxdWFsaXR5IG1lYXN1cmVtZW50cyBpbiBOZXcgWW9yayBmcm9tDQpNYXkgdG8gU2VwdGVtYmVyIDE5NzMgb3ZlciBhIHBlcmlvZCBvZiA1IG1vbnRocy4gDQoNCnwgVmFyaWFibGUgfCBUeXBlIHwgVW5pdCB8IERlc2NyaXB0aW9uIHwgRGV0YWlsIHwNCnwtLS0tLS0tLS0tfC0tLS0tLXwtLS0tLS18LS0tLS0tLS0tLS0tLXwtLS0tLS0tLXwNCnwgT3pvbmUgfCBudW1lcmljIHwgcGFydHMgcGVyIGJpbGxpb24gfCBtZWFuIE96b25lIGNvbmNlbnRyYXRpb24gfCAgUm9vc2V2ZWx0IGlzbGFuZCB8DQp8IFNvbGFyLlIgfCBudW1lcmljIHwgIHwgU29sYXIgcmFkaWFpb24gfCBjZW50cmFsIHBhcmsgfCANCnwgV2luZCB8IG51bWVyaWMgfCBtaWxlcyBwZXIgaG91ciB8IGF2ZXJhZ2Ugd2luZCBzcGVlZCB8IExhR3VhcmRpYSBhaXJwb3J0IHwNCnwgVGVtcCB8bnVtZXJpYyB8IEZhaHJlbmhlaXQgfCBtYXhpbXVtIGRhaWx5IHRlbXBlcmF0dXJlIHwgTGFHdWFyZGlhIGFpcnBvcnQgfA0KfCBNb250aCB8IG51bWVyaWMgfCAgfCBNb250aCBvZiBvYnNlcnZhdGlvbiB8IDUgdmFsdWVzIHwNCnwgRGF5IHwgIG51bWVyaWMgfCAgfCBkYXkgb2YgbW9udGggfCAxNTMgZGF5cyB0b3RhbCB8DQp8ICB8IHwgfCB8IHwNCg0KDQpOZWNlc3NhcnkgcGFja2FnZXMNCmBgYHtyfQ0KcmVxdWlyZShnZ3Bsb3QyKQ0KcmVxdWlyZShtb21lbnRzKQ0KYGBgDQpMZXQncyBtYWtlIGEgY29weSBvZiB0aGUgZGF0YXNldCBmb3IgdGhlIA0KZm9sbG93aW5nIGFuYWx5c2lzLiANCmBgYHtyfQ0KYXEgPC0gZGF0YS5mcmFtZShhaXJxdWFsaXR5KQ0KYGBgDQoNCg0KTmFtZXMgb2YgY29sdW1ucw0KYGBge3J9DQpjb2xuYW1lcyhhcSkNCmBgYA0KDQoNCkxldCdzIGxvb2sgYXQgc29tZSBvZiB0aGUgcm93czoNCg0KYGBge3J9DQpoZWFkKGFxKQ0KYGBgDQoNClNldmVyYWwgZW50cmllcyBhcmUgbm90IGF2YWlsYWJsZS4gTGV0J3Mgc3VtbWFyaXplOg0KDQpgYGB7cn0NCmNvbFN1bXMoaXMubmEoYXEpKQ0KYGBgDQoNCg0KRGF5IHNob3VsZCBiZSBhIGZhY3RvciB2YXJpYWJsZQ0KYGBge3J9DQphcSREYXkgPC0gZmFjdG9yKGFxJERheSwgbGV2ZWxzPWMoMTozMSksIG9yZGVyZWQ9VFJVRSkNCmBgYA0KTW9udGggc2hvdWxkIGJlIGEgZmFjdG9yIHZhcmlhYmxlDQpgYGB7cn0NCmFxJE1vbnRoIDwtIGZhY3RvcihhcSRNb250aCwgbGV2ZWxzPTU6OSwgbGFiZWxzPW1vbnRoLmFiYls1OjldLCBvcmRlcmVkPVRSVUUpDQpgYGANCg0KDQpTdW1tYXJ5IHN0YXRpc3Rpc2NzIGZvciBhbGwgdGhlIHZhcmlhYmxlczoNCmBgYHtyfQ0Kc3VtbWFyeShhcSkNCmBgYA0KDQojIFZhcmlhYmxlcw0KDQojIyBPem9uZQ0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkT3pvbmUpDQpgYGANCg0KSGlzdG9ncmFtDQpgYGB7cn0NCmdncGxvdChkYXRhPWFxKSArIGdlb21faGlzdG9ncmFtKG1hcHBpbmc9YWVzKE96b25lKSkNCmBgYA0KDQpUaGUgaGlzdG9ncmFtIGFwcGVhcnMgYXNzeW1ldHJpYy4gDQoNCg0KV2UgY2FuIGNoZWNrIHRoZSBub3JtYWxjeSBvZiB0aGUgZGlzdHJpYnV0aW9uIGJ5IG1lYXN1cmluZyBpdHMgDQpza2V3bmVzcyBhbmQga3VydG9zaXMNCg0KYGBge3J9DQpza2V3bmVzcyhhcSRPem9uZSwgbmEucm0gPSBUKQ0KYGBgDQpUaGUgZGF0YSBpcyBza2V3ZWQgaGVhdmlseSB0b3dhcmRzIHJpZ2h0Lg0KDQpMZXQncyBhbHNvIG1lYXN1cmUga3VydG9zaXM6DQpgYGB7cn0NCmt1cnRvc2lzKGFxJE96b25lLCBuYS5ybT1UKQ0KYGBgDQpUaGlzIGlzICpsZXB0b2t1cnRpYyouDQoNCkxldCdzIGFwcGx5IGEgdHJhbnNmb3JtYXRpb24gdG8gc2VlIGlmIHRoZSBoaXN0b2dyYW0gYmVjb21lcyBtb3JlIHN5bW1ldHJpYzoNCg0KDQpgYGB7cn0NCmFxJE96b25lLnNxcnQgPC0gc3FydChhcSRPem9uZSkNCmdncGxvdChkYXRhPWFxKSArIGdlb21faGlzdG9ncmFtKG1hcHBpbmc9YWVzKE96b25lLnNxcnQpKQ0KYGBgDQpXZSBoYXZlIHN0b3JlZCB0aGUgc3F1YXJlIHJvb3Qgb2YgT3pvbmUgY29uY2VudHJhdGlvbiB2YWx1ZXMgaW4gDQp0aGUgYXEgZGF0YWZyYW1lIGl0c2VsZiBhcyBhIG5ldyBjb2x1bW4uDQpUaGlzIGlzIGluZGVlZCBtb3JlIHN5bW1ldHJpYyBhbmQgbG9va2luZyBsaWtlIG5vcm1hbCBkaXN0cmlidXRpb24uDQoNCkxldCdzIHZlcmlmeSB0aGUgc2tld25lc3M6DQpgYGB7cn0NCnNrZXduZXNzKGFxJE96b25lLnNxcnQsIG5hLnJtPVQpDQpgYGANCkFsc28gdGhlIGt1cnRvc2lzOg0KYGBge3J9DQprdXJ0b3NpcyhhcSRPem9uZS5zcXJ0LCBuYS5ybT1UKQ0KYGBgDQoNClRoZXJlIGlzIGluZGVlZCBpbXByb3ZlbWVudCBpbiBza2V3bmVzcy4NCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBPem9uZSwgZGF0YT1hcSwgZ2VvbT0iYm94cGxvdCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCg0KIyMgU29sYXIuUg0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkU29sYXIuUikNCmBgYA0KDQpIaXN0b2dyYW0NCmBgYHtyfQ0KZ2dwbG90KGRhdGE9YXEpICsgZ2VvbV9oaXN0b2dyYW0obWFwcGluZz1hZXMoU29sYXIuUikpDQpgYGANCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBTb2xhci5SLCBkYXRhPWFxLCBnZW9tPSJib3hwbG90IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyMgV2luZA0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkV2luZCkNCmBgYA0KDQoNCkhpc3RvZ3JhbQ0KYGBge3J9DQpnZ3Bsb3QoZGF0YT1hcSkgKyBnZW9tX2hpc3RvZ3JhbShtYXBwaW5nPWFlcyhXaW5kKSkNCmBgYA0KDQpMZXQncyB2ZXJpZnkgdGhlIG5vcm1hbGN5IG9mIHRoZSBkaXN0cmlidXRpb24NCmBgYHtyfQ0Kc2tld25lc3MoYXEkV2luZCkNCmt1cnRvc2lzKGFxJFdpbmQpDQpgYGANClRoaXMgaXMgdmVyeSBjbG9zZSB0byBhIG5vcm1hbCBkaXN0cmlidXRpb24uDQoNClNlYXNvbmFsIHZhcmlhdGlvbiBpbiBkYXRhOg0KYGBge3J9DQpxcGxvdChNb250aCwgV2luZCwgZGF0YT1hcSwgZ2VvbT0iYm94cGxvdCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCiMjIFRlbXANCg0KYGBge3J9DQpzdW1tYXJ5KGFxJFRlbXApDQpgYGANCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBUZW1wLCBkYXRhPWFxLCBnZW9tPSJib3hwbG90IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyMgTW9udGgNCg0KYGBge3J9DQp0YWJsZShhcSRNb250aCkNCmBgYA0KDQpUaGlzIGp1c3QgcmVmbGVjdHMgdGhlIG51bWJlciBvZiBkYXlzIGluIGVhY2ggbW9udGggZnJvbSBNYXkgdG8gU2VwdGVtYmVyLiANCg0KIyMgRGF5DQoNCmBgYHtyfQ0KdGFibGUoYXEkRGF5KQ0KYGBgDQoNCkFzIGV4cGVjdGVkLCBkYXkgMzEgYXBwZWFycyBvbmx5IDMgdGltZXMuIA0KDQoNCiMgQ29ycmVsYXRpb25zDQoNCkxldCB1cyBmaW5kIG91dCB0aGUgUGVhcnNvbiBjb3JyZWxhdGlvbiBjb2VmZmljaWVudHMgYmV0d2Vlbg0KdGhlIGRpZmZlcmVudCB2YXJpYWJsZXMgaW4gdGhlIGRhdGFzZXQ6DQoNCmBgYHtyfQ0KIyBPbmx5IG51bWVyaWNhbCB2YXJpYWJsZXMgY2FuIGJlIHVzZWQgZm9yIGNvcnJlbGF0aW9uDQpjb2x1bW5zIDwtIGMoJ096b25lLnNxcnQnLCAnU29sYXIuUicsICdXaW5kJywgJ1RlbXAnKQ0KIyBXZSBpZGVudGlmeSBub24tbmEgcm93cyBhcyByb3dzIHdpdGggbmEgdmFsdWVzIGNhdXNlIHByb2JsZW1zIA0Kcm93cyA8LSByb3dTdW1zKGlzLm5hKGFxKSkgPT0gMA0KIyBDb21wdXRlIGNvcnJlbGF0aW9uIG9mIHZhcmlhYmxlcyBhbmQgcm91bmQgdG8gMiBkaWdpdHMNCnJvdW5kKGNvcihhcVtyb3dzLCBjb2x1bW5zXSksIDIpDQpgYGANCg0KKiBTb2xhciBSYWRpYXRpb24gaXMgbW9kZXJhdGVseSBjb3JyZWxhdGVkIHdpdGggb3pvbmUgY29uY2VudHJhdGlvbi4NCiogd2luZCBhbmQgb3pvbmUgc2VlbSBzdHJvbmdseSBjb3JyZWxhdGVkLg0KKiBUZW1wZXJhdHVyZSBhbmQgb3pvbmUgYXJlIHN0cm9uZ2x5IGNvcnJlbGF0ZWQuIA0KKiBWZXJ5IHdlYWsgY29ycmVsYXRpb24gYmV0d2VlbiB3aW5kIGFuZCBzb2xhciByYWRpYXRpb24uDQoqIFRlbXBlcmF0dXJlIGFuZCBzb2xhciByYWRpdGlvbiBhcmUgd2Vha2x5IGNvcnJlbGF0ZWQuDQoqIFRlbXBlcmF0dXJlIGFuZCB3aW5kIGFyZSBtb2RlcmF0ZWx5IGNvcnJlbGF0ZWQuDQoNCg0KTGV0IHVzIHZlcmlmeSB0aGVzZSBjb3JyZWxhdGlvbnMgdXNpbmcgc2NhdHRlciBwbG90cy4NCg0KDQojI1Zlcnkgd2VhayBjb3JyZWxhdGlvbjogDQpgYGB7cn0NCnFwbG90KFNvbGFyLlIsIFdpbmQsIGRhdGE9YXEsIGdlb209InBvaW50IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyNXZWFrIGNvcnJlbGF0aW9uOg0KYGBge3J9DQpxcGxvdChUZW1wLCBTb2xhci5SLCBkYXRhPWFxLCBnZW9tPSJwb2ludCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCiMjTW9kZXJhdGUgY29ycmVsYXRpb246DQpgYGB7cn0NCnFwbG90KFNvbGFyLlIsIE96b25lLnNxcnQsIGRhdGE9YXEsIGNvbG9yPU1vbnRoLCBnZW9tPSJwb2ludCIpDQpgYGANCmBgYHtyfQ0KcXBsb3QoVGVtcCwgV2luZCwgZGF0YT1hcSwgY29sb3I9TW9udGgsIGdlb209InBvaW50IikNCmBgYA0KDQojI1N0cm9uZyBjb3JyZWxhdGlvbg0KYGBge3J9DQpxcGxvdChXaW5kLCBPem9uZS5zcXJ0LCBkYXRhPWFxLCBjb2xvcj1Nb250aCwgZ2VvbT0ncG9pbnQnKQ0KYGBgDQoNCmBgYHtyfQ0KcXBsb3QoVGVtcCwgT3pvbmUuc3FydCwgZGF0YT1hcSwgY29sb3I9TW9udGgsIGdlb209J3BvaW50JykNCmBgYA0KDQo=