Introduction

This is an exploratory analysis of an air quality data set.It contains daily air quality measurements in New York City, from May 1, 1973 to September 30, 1973. Four data points were to be measured each day along with the date. The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data).

First Look at Dataset

We read the dataset in from a simple text file (CSV) to a dataframe named airquality. We can take our first look at what we imported with R’s Head and Tail functions. We can follow up by looking at the structure and summary of the data.

airquality <- read.csv("~/CUNY/Bridge Classes/R Programming/Week4/airquality.csv", row.names=1)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
tail(airquality)
##     Ozone Solar.R Wind Temp Month Day
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

We see a dataframe of 153 observations of our four variables: Ozone Level, Solar Radiation, Wind Speed, and Air Temperature. The date for each obsevation set is stored in two fields, one for month and one for day. All the data values are stored as integer except Wind Speed, which is stored as a numeric value.

Looking at the results from the Summary Function we see that Ozone and Solar.R are the only fields with NA’s recorded for obsevations. We also see that although Solar.R is missing seven (7) values or the equivalent of a week of data, Ozone is missing thirty-seven (37) or over a month of observation in a five (5) month study.

Closer study shows that four (4) of the Solar.R NA’s were in May with only two in a row. The other three were consecutive days in August. The Ozone values show five (5) missed days in May with three in a row. There were twenty-one missed days in June with both a six day consecutive run at the beginning of the month and a ten (10) day run at the end. There were five (5) missed Ozone values in July with two in a row being the longest streak, and the same was true for August. There was one NA Ozone value in September making it the most complete month.

Individual Variable Examination

Sometimes examining the values of a variable visually can give additional insight into the data set. One easy way to inspect data this way is to do a histogram on each variable. This gives you a feel for the spread of the values and how they cluster. Here are the histograms for ozone, solar radiation, wind speed and air temperature.

ggplot(airquality, aes(x = Ozone)) + geom_histogram(binwidth = 5, fill="green", color="black")

ggplot(airquality, aes(x = Solar.R)) + geom_histogram(binwidth = 10, fill="orange", color="black")

ggplot(airquality, aes(x = Wind)) + geom_histogram(binwidth = 1, fill="blue", color="black")

ggplot(airquality, aes(x = Temp)) + geom_histogram(binwidth = 5, fill="red", color="black")

We can also look at the distribution of values for a variable by using a box plot. Here are box plots for are four values.

ggplot(airquality, aes(y = Ozone, x = 1)) + geom_boxplot(fill="green")
## Warning: Removed 37 rows containing non-finite values (stat_boxplot).

ggplot(airquality, aes(y = Solar.R, x = 1)) + geom_boxplot(fill="orange")
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

ggplot(airquality, aes(y = Wind, x = 1)) + geom_boxplot(fill="blue")

ggplot(airquality, aes(y = Temp, x = 1)) + geom_boxplot(fill="red")

Comparing Data Over Time

To get a quick and dirty view of air quality by month, and the impact of the missing ozone readings for June, we do a line and point plot of ozone versus time. To keep the days in their months, the months in order, and stay quick, we multiply the month integer by 100 and add the day. This means May 1st is 501 and May 2nd is 502. This gives us this view.

ggplot(airquality, aes(x = (Month * 100 + Day), y = Ozone)) + geom_line() + geom_point()
## Warning: Removed 37 rows containing missing values (geom_point).

This plot shows all five months. The space between the months is the lack of observations for 532 to 600, 631 to 700 and so on. What we can see is the higest recorded value for ozone in August, the next highest in July, not very many data points in June, and the next highest ozone data point in May. There is an interesting drop off at the end of August with a slight rise going into September followed by a trend to lower ozone readings. This hints at a possible relationship to temperature.

Before we continue to explore the data, we will fix the date values to eliminate the gaps between months. We will do that by adding a date column to the dataframe and converting the month and day columns to dates.

airquality$date <- as.Date(paste("1973", airquality$Month, airquality$Day, sep="-"))
ggplot(airquality, aes(x= date, y = Ozone)) + geom_line(color = "green") + geom_point()
## Warning: Removed 37 rows containing missing values (geom_point).

str(airquality)
## 'data.frame':    153 obs. of  7 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ date   : Date, format: "1973-05-01" "1973-05-02" ...

Now let’s overlay our other measured values by day and see how they look in comparison. We will add on the air temperature measuremnts first, then the solar radiation values and last the wind speed.

graph <- ggplot(airquality, aes(x= date, y = Ozone)) + geom_line(color = "green") + geom_point()
graph <- graph + geom_line(aes(x = date, y = Temp), color = "red")
graph
## Warning: Removed 37 rows containing missing values (geom_point).

graph <- graph + geom_line(aes(x = date, y = Solar.R), color = "orange")
graph
## Warning: Removed 37 rows containing missing values (geom_point).

graph <- graph + geom_line(aes(x = date, y = Wind), color = "blue")
graph
## Warning: Removed 37 rows containing missing values (geom_point).

Conclusion

On first look there does not seem to be any obvious relationship between observed values. Although temperature seems to track well with ozone during September, it starts to diverge in August. In fact the highest ozone level is followed by a relative temperature high. Comparisons of solar radiation and wind have similar problems. We may need more data over a longer period to find any relationships. I will leave you with this panel view, which I would probably use as a cover page image,

airquality <- subset(airquality, select = c(-Month, -Day, -date))
pairs(airquality, panel = panel.smooth, main = "Air Quality Data")