R Bridge Course Final Project

This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here: https://archive.ics.uci.edu/ml/datasets.html

The presentation approach is up to you but it should contain the following:

Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the dataset. Please include some conclusions in the R Markdown text.

Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example -if it makes sense you could sum two columns together)

Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

BONUS -place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Please submit you r.rmd file and the .csv file as well as a link to your RPubs.

Data: New York Air Quality Measurements

Taken from: http://vincentarelbundock.github.io/Rdatasets/csv/datasets/airquality.csv

GitHub Raw file location: https://raw.githubusercontent.com/mmunjal/CUNY/master/airquality.csv

Description

Daily air quality measurements in New York, May to September 1973.

Usage

airquality

Format

A data frame with 154 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb) [,2] Solar.R numeric Solar R (lang) [,3] Wind numeric Wind (mph) [,4] Temp numeric Temperature (degrees F) [,5] Month numeric Month (1–12) [,6] Day numeric Day of month (1–31)

Details

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000-7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Source The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data).

References Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.

Loading the Dataset

#Link for github -> https://raw.githubusercontent.com/mmunjal/CUNY/master/airquality.csv

# ---Bonus  Internet File
#Original Location of the file
#airquality <- read.csv(url("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/airquality.csv"), header = TRUE)

#GitHub Raw Location of the file
airquality <- read.csv(url("https://raw.githubusercontent.com/mmunjal/CUNY/master/airquality.csv"), header = TRUE)
head(airquality)
##   X Ozone Solar.R Wind Temp Month Day
## 1 1    41     190  7.4   67     5   1
## 2 2    36     118  8.0   72     5   2
## 3 3    12     149 12.6   74     5   3
## 4 4    18     313 11.5   62     5   4
## 5 5    NA      NA 14.3   56     5   5
## 6 6    28      NA 14.9   66     5   6

Operations on Dataset

airquality <- data.frame(airquality)

#Summary Function
summary(airquality)
##        X           Ozone           Solar.R           Wind       
##  Min.   :  1   Min.   :  1.00   Min.   :  7.0   Min.   : 1.700  
##  1st Qu.: 39   1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400  
##  Median : 77   Median : 31.50   Median :205.0   Median : 9.700  
##  Mean   : 77   Mean   : 42.13   Mean   :185.9   Mean   : 9.958  
##  3rd Qu.:115   3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500  
##  Max.   :153   Max.   :168.00   Max.   :334.0   Max.   :20.700  
##                NA's   :37       NA's   :7                       
##       Temp           Month            Day      
##  Min.   :56.00   Min.   :5.000   Min.   : 1.0  
##  1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
##  Median :79.00   Median :7.000   Median :16.0  
##  Mean   :77.88   Mean   :6.993   Mean   :15.8  
##  3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :97.00   Max.   :9.000   Max.   :31.0  
## 
#Returning Mean and Medican
median(airquality$Wind)
## [1] 9.7
mean(airquality$Wind)
## [1] 9.957516
median(airquality$Temp)
## [1] 79
mean(airquality$Temp)
## [1] 77.88235

We see a dataframe of 153 observations of our four variables: Ozone Level, Solar Radiation, Wind Speed, and Air Temperature. The date for each obsevation set is stored in two fields, one for month and one for day. All the data values are stored as integer except Wind Speed, which is stored as a numeric value.

Looking at the results from the Summary Function we see that Ozone and Solar.R are the only fields with NA’s recorded for obsevations. We also see that although Solar.R is missing seven (7) values or the equivalent of a week of data, Ozone is missing thirty-seven (37) or over a month of observation in a five (5) month study.

Closer study shows that four (4) of the Solar.R NA’s were in May with only two in a row. The other three were consecutive days in August. The Ozone values show five (5) missed days in May with three in a row. There were twenty-one missed days in June with both a six day consecutive run at the beginning of the month and a ten (10) day run at the end. There were five (5) missed Ozone values in July with two in a row being the longest streak, and the same was true for August. There was one NA Ozone value in September making it the most complete month.

Individual Variable Examination

Sometimes examining the values of a variable visually can give additional insight into the data set. One easy way to inspect data this way is to do a histogram on each variable. This gives you a feel for the spread of the values and how they cluster. Here are the histograms for ozone, solar radiation, wind speed and air temperature.

library("ggplot2")

ggplot(airquality, aes(x = Ozone)) + geom_histogram(binwidth = 5, fill="green", color="black")

ggplot(airquality, aes(x = Solar.R)) + geom_histogram(binwidth = 10, fill="orange", color="black")

ggplot(airquality, aes(x = Wind)) + geom_histogram(binwidth = 1, fill="blue", color="black")

ggplot(airquality, aes(x = Temp)) + geom_histogram(binwidth = 5, fill="red", color="black")

We can also look at the distribution of values for a variable by using a box plot. Here are box plots for are four values.

ggplot(airquality, aes(y = Ozone, x = 1)) + geom_boxplot(fill="green")
## Warning: Removed 37 rows containing non-finite values (stat_boxplot).

ggplot(airquality, aes(y = Solar.R, x = 1)) + geom_boxplot(fill="orange")
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

ggplot(airquality, aes(y = Wind, x = 1)) + geom_boxplot(fill="blue")

ggplot(airquality, aes(y = Temp, x = 1)) + geom_boxplot(fill="red")

Comparing Data Over Time

To get a quick and dirty view of air quality by month, and the impact of the missing ozone readings for June, we do a line and point plot of ozone versus time. To keep the days in their months, the months in order, and stay quick, we multiply the month integer by 100 and add the day. This means May 1st is 501 and May 2nd is 502. This gives us this view.

ggplot(airquality, aes(x = (Month * 100 + Day), y = Ozone)) + geom_line() + geom_point()
## Warning: Removed 37 rows containing missing values (geom_point).

This plot shows all five months. The space between the months is the lack of observations for 532 to 600, 631 to 700 and so on. What we can see is the higest recorded value for ozone in August, the next highest in July, not very many data points in June, and the next highest ozone data point in May. There is an interesting drop off at the end of August with a slight rise going into September followed by a trend to lower ozone readings. This hints at a possible relationship to temperature.

Before we continue to explore the data, we will fix the date values to eliminate the gaps between months. We will do that by adding a date column to the dataframe and converting the month and day columns to dates.

airquality$date <- as.Date(paste("1973", airquality$Month, airquality$Day, sep="-"))
ggplot(airquality, aes(x= date, y = Ozone)) + geom_line(color = "green") + geom_point()

str(airquality)
## 'data.frame':    153 obs. of  8 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ date   : Date, format: "1973-05-01" "1973-05-02" ...

Now let’s overlay our other measured values by day and see how they look in comparison. We will add on the air temperature measuremnts first, then the solar radiation values and last the wind speed.

graph <- ggplot(airquality, aes(x= date, y = Ozone)) + geom_line(color = "green") + geom_point()
graph <- graph + geom_line(aes(x = date, y = Temp), color = "red")
graph

graph <- graph + geom_line(aes(x = date, y = Solar.R), color = "orange")
graph

graph <- graph + geom_line(aes(x = date, y = Wind), color = "blue")
graph

Conclusion

On first look there does not seem to be any obvious relationship between observed values. Although temperature seems to track well with ozone during September, it starts to diverge in August. In fact the highest ozone level is followed by a relative temperature high. Comparisons of solar radiation and wind have similar problems. We may need more data over a longer period to find any relationships. I will leave you with this panel view, which I would probably use as a cover page image,

airquality <- subset(airquality, select = c(-Month, -Day, -date))
pairs(airquality, panel = panel.smooth, main = "Air Quality Data")