This data set has daily air quality measurements in New York from May to September 1973 over a period of 5 months.

Variable Type Unit Description Detail
Ozone numeric parts per billion mean Ozone concentration Roosevelt island
Solar.R numeric Solar radiaion central park
Wind numeric miles per hour average wind speed LaGuardia airport
Temp numeric Fahrenheit maximum daily temperature LaGuardia airport
Month numeric Month of observation 5 values
Day numeric day of month 153 days total

Necessary packages

require(ggplot2)
require(moments)

Let’s make a copy of the dataset for the following analysis.

aq <- data.frame(airquality)

Names of columns

colnames(aq)
[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    

Let’s look at some of the rows:

head(aq)

Several entries are not available. Let’s summarize:

colSums(is.na(aq))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 

Day should be a factor variable

aq$Day <- factor(aq$Day, levels=c(1:31), ordered=TRUE)

Month should be a factor variable

aq$Month <- factor(aq$Month, levels=5:9, labels=month.abb[5:9], ordered=TRUE)

Summary statistiscs for all the variables:

summary(aq)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
 Month         Day     
 May:31   1      :  5  
 Jun:30   2      :  5  
 Jul:31   3      :  5  
 Aug:31   4      :  5  
 Sep:30   5      :  5  
          6      :  5  
          (Other):123  

1 Variables

1.1 Ozone

summary(aq$Ozone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   18.00   31.50   42.13   63.25  168.00      37 

Histogram

ggplot(data=aq) + geom_histogram(mapping=aes(Ozone))

The histogram appears assymetric.

We can check the normalcy of the distribution by measuring its skewness and kurtosis

skewness(aq$Ozone, na.rm = T)
[1] 1.225681

The data is skewed heavily towards right.

Let’s also measure kurtosis:

kurtosis(aq$Ozone, na.rm=T)
[1] 4.184071

This is leptokurtic.

Let’s apply a transformation to see if the histogram becomes more symmetric:

aq$Ozone.sqrt <- sqrt(aq$Ozone)
ggplot(data=aq) + geom_histogram(mapping=aes(Ozone.sqrt))

We have stored the square root of Ozone concentration values in the aq dataframe itself as a new column. This is indeed more symmetric and looking like normal distribution.

Let’s verify the skewness:

skewness(aq$Ozone.sqrt, na.rm=T)
[1] 0.5143488

Also the kurtosis:

kurtosis(aq$Ozone.sqrt, na.rm=T)
[1] 2.561186

There is indeed improvement in skewness.

Seasonal variation in data:

qplot(Month, Ozone, data=aq, geom="boxplot", color=Month)

1.2 Solar.R

summary(aq$Solar.R)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    7.0   115.8   205.0   185.9   258.8   334.0       7 

Histogram

ggplot(data=aq) + geom_histogram(mapping=aes(Solar.R))

Seasonal variation in data:

qplot(Month, Solar.R, data=aq, geom="boxplot", color=Month)

1.3 Wind

summary(aq$Wind)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.700   7.400   9.700   9.958  11.500  20.700 

Histogram

ggplot(data=aq) + geom_histogram(mapping=aes(Wind))

Let’s verify the normalcy of the distribution

skewness(aq$Wind)
[1] 0.3443985
kurtosis(aq$Wind)
[1] 3.068849

This is very close to a normal distribution.

Seasonal variation in data:

qplot(Month, Wind, data=aq, geom="boxplot", color=Month)

1.4 Temp

summary(aq$Temp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  56.00   72.00   79.00   77.88   85.00   97.00 

Seasonal variation in data:

qplot(Month, Temp, data=aq, geom="boxplot", color=Month)

1.5 Month

table(aq$Month)

May Jun Jul Aug Sep 
 31  30  31  31  30 

This just reflects the number of days in each month from May to September.

1.6 Day

table(aq$Day)

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
 5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5 
25 26 27 28 29 30 31 
 5  5  5  5  5  5  3 

As expected, day 31 appears only 3 times.

2 Correlations

Let us find out the Pearson correlation coefficients between the different variables in the dataset:

# Only numerical variables can be used for correlation
columns <- c('Ozone.sqrt', 'Solar.R', 'Wind', 'Temp')
# We identify non-na rows as rows with na values cause problems 
rows <- rowSums(is.na(aq)) == 0
# Compute correlation of variables and round to 2 digits
round(cor(aq[rows, columns]), 2)
           Ozone.sqrt Solar.R  Wind  Temp
Ozone.sqrt       1.00    0.40 -0.61  0.75
Solar.R          0.40    1.00 -0.13  0.29
Wind            -0.61   -0.13  1.00 -0.50
Temp             0.75    0.29 -0.50  1.00

Let us verify these correlations using scatter plots.

2.1 Very weak correlation:

qplot(Solar.R, Wind, data=aq, geom="point", color=Month)

2.2 Weak correlation:

qplot(Temp, Solar.R, data=aq, geom="point", color=Month)

2.3 Moderate correlation:

qplot(Solar.R, Ozone.sqrt, data=aq, color=Month, geom="point")

qplot(Temp, Wind, data=aq, color=Month, geom="point")

2.4 Strong correlation

qplot(Wind, Ozone.sqrt, data=aq, color=Month, geom='point')

qplot(Temp, Ozone.sqrt, data=aq, color=Month, geom='point')

LS0tDQp0aXRsZTogIkV4cGxvcmF0aW9uIG9mIEFpciBRdWFsaXR5IERhdGFzZXQiDQphdXRob3I6ICJTaGFpbGVzaCBLdW1hciINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgbnVtYmVyX3NlY3Rpb25zOiB0cnVlDQotLS0NClRoaXMgZGF0YSBzZXQgaGFzIGRhaWx5IGFpciBxdWFsaXR5IG1lYXN1cmVtZW50cyBpbiBOZXcgWW9yayBmcm9tDQpNYXkgdG8gU2VwdGVtYmVyIDE5NzMgb3ZlciBhIHBlcmlvZCBvZiA1IG1vbnRocy4gDQoNCnwgVmFyaWFibGUgfCBUeXBlIHwgVW5pdCB8IERlc2NyaXB0aW9uIHwgRGV0YWlsIHwNCnwtLS0tLS0tLS0tfC0tLS0tLXwtLS0tLS18LS0tLS0tLS0tLS0tLXwtLS0tLS0tLXwNCnwgT3pvbmUgfCBudW1lcmljIHwgcGFydHMgcGVyIGJpbGxpb24gfCBtZWFuIE96b25lIGNvbmNlbnRyYXRpb24gfCAgUm9vc2V2ZWx0IGlzbGFuZCB8DQp8IFNvbGFyLlIgfCBudW1lcmljIHwgIHwgU29sYXIgcmFkaWFpb24gfCBjZW50cmFsIHBhcmsgfCANCnwgV2luZCB8IG51bWVyaWMgfCBtaWxlcyBwZXIgaG91ciB8IGF2ZXJhZ2Ugd2luZCBzcGVlZCB8IExhR3VhcmRpYSBhaXJwb3J0IHwNCnwgVGVtcCB8bnVtZXJpYyB8IEZhaHJlbmhlaXQgfCBtYXhpbXVtIGRhaWx5IHRlbXBlcmF0dXJlIHwgTGFHdWFyZGlhIGFpcnBvcnQgfA0KfCBNb250aCB8IG51bWVyaWMgfCAgfCBNb250aCBvZiBvYnNlcnZhdGlvbiB8IDUgdmFsdWVzIHwNCnwgRGF5IHwgIG51bWVyaWMgfCAgfCBkYXkgb2YgbW9udGggfCAxNTMgZGF5cyB0b3RhbCB8DQp8ICB8IHwgfCB8IHwNCg0KDQpOZWNlc3NhcnkgcGFja2FnZXMNCmBgYHtyfQ0KcmVxdWlyZShnZ3Bsb3QyKQ0KcmVxdWlyZShtb21lbnRzKQ0KYGBgDQpMZXQncyBtYWtlIGEgY29weSBvZiB0aGUgZGF0YXNldCBmb3IgdGhlIA0KZm9sbG93aW5nIGFuYWx5c2lzLiANCmBgYHtyfQ0KYXEgPC0gZGF0YS5mcmFtZShhaXJxdWFsaXR5KQ0KYGBgDQoNCg0KTmFtZXMgb2YgY29sdW1ucw0KYGBge3J9DQpjb2xuYW1lcyhhcSkNCmBgYA0KDQoNCkxldCdzIGxvb2sgYXQgc29tZSBvZiB0aGUgcm93czoNCg0KYGBge3J9DQpoZWFkKGFxKQ0KYGBgDQoNClNldmVyYWwgZW50cmllcyBhcmUgbm90IGF2YWlsYWJsZS4gTGV0J3Mgc3VtbWFyaXplOg0KDQpgYGB7cn0NCmNvbFN1bXMoaXMubmEoYXEpKQ0KYGBgDQoNCg0KRGF5IHNob3VsZCBiZSBhIGZhY3RvciB2YXJpYWJsZQ0KYGBge3J9DQphcSREYXkgPC0gZmFjdG9yKGFxJERheSwgbGV2ZWxzPWMoMTozMSksIG9yZGVyZWQ9VFJVRSkNCmBgYA0KTW9udGggc2hvdWxkIGJlIGEgZmFjdG9yIHZhcmlhYmxlDQpgYGB7cn0NCmFxJE1vbnRoIDwtIGZhY3RvcihhcSRNb250aCwgbGV2ZWxzPTU6OSwgbGFiZWxzPW1vbnRoLmFiYls1OjldLCBvcmRlcmVkPVRSVUUpDQpgYGANCg0KDQpTdW1tYXJ5IHN0YXRpc3Rpc2NzIGZvciBhbGwgdGhlIHZhcmlhYmxlczoNCmBgYHtyfQ0Kc3VtbWFyeShhcSkNCmBgYA0KDQojIFZhcmlhYmxlcw0KDQojIyBPem9uZQ0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkT3pvbmUpDQpgYGANCg0KSGlzdG9ncmFtDQpgYGB7cn0NCmdncGxvdChkYXRhPWFxKSArIGdlb21faGlzdG9ncmFtKG1hcHBpbmc9YWVzKE96b25lKSkNCmBgYA0KDQpUaGUgaGlzdG9ncmFtIGFwcGVhcnMgYXNzeW1ldHJpYy4gDQoNCg0KV2UgY2FuIGNoZWNrIHRoZSBub3JtYWxjeSBvZiB0aGUgZGlzdHJpYnV0aW9uIGJ5IG1lYXN1cmluZyBpdHMgDQpza2V3bmVzcyBhbmQga3VydG9zaXMNCg0KYGBge3J9DQpza2V3bmVzcyhhcSRPem9uZSwgbmEucm0gPSBUKQ0KYGBgDQpUaGUgZGF0YSBpcyBza2V3ZWQgaGVhdmlseSB0b3dhcmRzIHJpZ2h0Lg0KDQpMZXQncyBhbHNvIG1lYXN1cmUga3VydG9zaXM6DQpgYGB7cn0NCmt1cnRvc2lzKGFxJE96b25lLCBuYS5ybT1UKQ0KYGBgDQpUaGlzIGlzICpsZXB0b2t1cnRpYyouDQoNCkxldCdzIGFwcGx5IGEgdHJhbnNmb3JtYXRpb24gdG8gc2VlIGlmIHRoZSBoaXN0b2dyYW0gYmVjb21lcyBtb3JlIHN5bW1ldHJpYzoNCg0KDQpgYGB7cn0NCmFxJE96b25lLnNxcnQgPC0gc3FydChhcSRPem9uZSkNCmdncGxvdChkYXRhPWFxKSArIGdlb21faGlzdG9ncmFtKG1hcHBpbmc9YWVzKE96b25lLnNxcnQpKQ0KYGBgDQpXZSBoYXZlIHN0b3JlZCB0aGUgc3F1YXJlIHJvb3Qgb2YgT3pvbmUgY29uY2VudHJhdGlvbiB2YWx1ZXMgaW4gDQp0aGUgYXEgZGF0YWZyYW1lIGl0c2VsZiBhcyBhIG5ldyBjb2x1bW4uDQpUaGlzIGlzIGluZGVlZCBtb3JlIHN5bW1ldHJpYyBhbmQgbG9va2luZyBsaWtlIG5vcm1hbCBkaXN0cmlidXRpb24uDQoNCkxldCdzIHZlcmlmeSB0aGUgc2tld25lc3M6DQpgYGB7cn0NCnNrZXduZXNzKGFxJE96b25lLnNxcnQsIG5hLnJtPVQpDQpgYGANCkFsc28gdGhlIGt1cnRvc2lzOg0KYGBge3J9DQprdXJ0b3NpcyhhcSRPem9uZS5zcXJ0LCBuYS5ybT1UKQ0KYGBgDQoNClRoZXJlIGlzIGluZGVlZCBpbXByb3ZlbWVudCBpbiBza2V3bmVzcy4NCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBPem9uZSwgZGF0YT1hcSwgZ2VvbT0iYm94cGxvdCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCg0KIyMgU29sYXIuUg0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkU29sYXIuUikNCmBgYA0KDQpIaXN0b2dyYW0NCmBgYHtyfQ0KZ2dwbG90KGRhdGE9YXEpICsgZ2VvbV9oaXN0b2dyYW0obWFwcGluZz1hZXMoU29sYXIuUikpDQpgYGANCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBTb2xhci5SLCBkYXRhPWFxLCBnZW9tPSJib3hwbG90IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyMgV2luZA0KDQpgYGB7cn0NCnN1bW1hcnkoYXEkV2luZCkNCmBgYA0KDQoNCkhpc3RvZ3JhbQ0KYGBge3J9DQpnZ3Bsb3QoZGF0YT1hcSkgKyBnZW9tX2hpc3RvZ3JhbShtYXBwaW5nPWFlcyhXaW5kKSkNCmBgYA0KDQpMZXQncyB2ZXJpZnkgdGhlIG5vcm1hbGN5IG9mIHRoZSBkaXN0cmlidXRpb24NCmBgYHtyfQ0Kc2tld25lc3MoYXEkV2luZCkNCmt1cnRvc2lzKGFxJFdpbmQpDQpgYGANClRoaXMgaXMgdmVyeSBjbG9zZSB0byBhIG5vcm1hbCBkaXN0cmlidXRpb24uDQoNClNlYXNvbmFsIHZhcmlhdGlvbiBpbiBkYXRhOg0KYGBge3J9DQpxcGxvdChNb250aCwgV2luZCwgZGF0YT1hcSwgZ2VvbT0iYm94cGxvdCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCiMjIFRlbXANCg0KYGBge3J9DQpzdW1tYXJ5KGFxJFRlbXApDQpgYGANCg0KU2Vhc29uYWwgdmFyaWF0aW9uIGluIGRhdGE6DQpgYGB7cn0NCnFwbG90KE1vbnRoLCBUZW1wLCBkYXRhPWFxLCBnZW9tPSJib3hwbG90IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyMgTW9udGgNCg0KYGBge3J9DQp0YWJsZShhcSRNb250aCkNCmBgYA0KDQpUaGlzIGp1c3QgcmVmbGVjdHMgdGhlIG51bWJlciBvZiBkYXlzIGluIGVhY2ggbW9udGggZnJvbSBNYXkgdG8gU2VwdGVtYmVyLiANCg0KIyMgRGF5DQoNCmBgYHtyfQ0KdGFibGUoYXEkRGF5KQ0KYGBgDQoNCkFzIGV4cGVjdGVkLCBkYXkgMzEgYXBwZWFycyBvbmx5IDMgdGltZXMuIA0KDQoNCiMgQ29ycmVsYXRpb25zDQoNCkxldCB1cyBmaW5kIG91dCB0aGUgUGVhcnNvbiBjb3JyZWxhdGlvbiBjb2VmZmljaWVudHMgYmV0d2Vlbg0KdGhlIGRpZmZlcmVudCB2YXJpYWJsZXMgaW4gdGhlIGRhdGFzZXQ6DQoNCmBgYHtyfQ0KIyBPbmx5IG51bWVyaWNhbCB2YXJpYWJsZXMgY2FuIGJlIHVzZWQgZm9yIGNvcnJlbGF0aW9uDQpjb2x1bW5zIDwtIGMoJ096b25lLnNxcnQnLCAnU29sYXIuUicsICdXaW5kJywgJ1RlbXAnKQ0KIyBXZSBpZGVudGlmeSBub24tbmEgcm93cyBhcyByb3dzIHdpdGggbmEgdmFsdWVzIGNhdXNlIHByb2JsZW1zIA0Kcm93cyA8LSByb3dTdW1zKGlzLm5hKGFxKSkgPT0gMA0KIyBDb21wdXRlIGNvcnJlbGF0aW9uIG9mIHZhcmlhYmxlcyBhbmQgcm91bmQgdG8gMiBkaWdpdHMNCnJvdW5kKGNvcihhcVtyb3dzLCBjb2x1bW5zXSksIDIpDQpgYGANCg0KKiBTb2xhciBSYWRpYXRpb24gaXMgbW9kZXJhdGVseSBjb3JyZWxhdGVkIHdpdGggb3pvbmUgY29uY2VudHJhdGlvbi4NCiogd2luZCBhbmQgb3pvbmUgc2VlbSBzdHJvbmdseSBjb3JyZWxhdGVkLg0KKiBUZW1wZXJhdHVyZSBhbmQgb3pvbmUgYXJlIHN0cm9uZ2x5IGNvcnJlbGF0ZWQuIA0KKiBWZXJ5IHdlYWsgY29ycmVsYXRpb24gYmV0d2VlbiB3aW5kIGFuZCBzb2xhciByYWRpYXRpb24uDQoqIFRlbXBlcmF0dXJlIGFuZCBzb2xhciByYWRpdGlvbiBhcmUgd2Vha2x5IGNvcnJlbGF0ZWQuDQoqIFRlbXBlcmF0dXJlIGFuZCB3aW5kIGFyZSBtb2RlcmF0ZWx5IGNvcnJlbGF0ZWQuDQoNCg0KTGV0IHVzIHZlcmlmeSB0aGVzZSBjb3JyZWxhdGlvbnMgdXNpbmcgc2NhdHRlciBwbG90cy4NCg0KDQojI1Zlcnkgd2VhayBjb3JyZWxhdGlvbjogDQpgYGB7cn0NCnFwbG90KFNvbGFyLlIsIFdpbmQsIGRhdGE9YXEsIGdlb209InBvaW50IiwgY29sb3I9TW9udGgpDQpgYGANCg0KIyNXZWFrIGNvcnJlbGF0aW9uOg0KYGBge3J9DQpxcGxvdChUZW1wLCBTb2xhci5SLCBkYXRhPWFxLCBnZW9tPSJwb2ludCIsIGNvbG9yPU1vbnRoKQ0KYGBgDQoNCiMjTW9kZXJhdGUgY29ycmVsYXRpb246DQpgYGB7cn0NCnFwbG90KFNvbGFyLlIsIE96b25lLnNxcnQsIGRhdGE9YXEsIGNvbG9yPU1vbnRoLCBnZW9tPSJwb2ludCIpDQpgYGANCmBgYHtyfQ0KcXBsb3QoVGVtcCwgV2luZCwgZGF0YT1hcSwgY29sb3I9TW9udGgsIGdlb209InBvaW50IikNCmBgYA0KDQojI1N0cm9uZyBjb3JyZWxhdGlvbg0KYGBge3J9DQpxcGxvdChXaW5kLCBPem9uZS5zcXJ0LCBkYXRhPWFxLCBjb2xvcj1Nb250aCwgZ2VvbT0ncG9pbnQnKQ0KYGBgDQoNCmBgYHtyfQ0KcXBsb3QoVGVtcCwgT3pvbmUuc3FydCwgZGF0YT1hcSwgY29sb3I9TW9udGgsIGdlb209J3BvaW50JykNCmBgYA0KDQo=