DATA 607 PROJECT 2

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

(1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)

For each of the three chosen datasets:

 Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

 Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

 Perform the analysis requested in the discussion item.

 Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Wide Data Set # 3

Choose a wide data set from the Week 5/6 discussions to convert into long data set and generate analysis from the long data set. Screenshot of discussion data set below. I chose this dataset because it was a wide dataset. Also it is introducing another set of tools to use that can turn a wide dataset into a long dataset with a package called reshape2. The dataset is data that is built into R called airquality.

Website reference has been added. This site was used and referenced to turn wide dataset into a long dataset using rshape2 tools:

https://seananderson.ca/2013/10/19/reshape/

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

# Changing the header names to lowercase based on website author recommendation that it enhances readability
colnames(airquality)<-tolower(colnames(airquality))
head(airquality)

##   ozone solar.r wind temp month day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

In the rshape2 package, the melt function will turn the column variable names into rows. In the first iteration, it will be applied on all the default argument values, even the month and day. No Id variables will be identified. For the 2nd iteration, Id variables will be stated along with column names. This is very similiar to the gather function in the tidyr package.

# requires reshape2 package
library(reshape2)
airquality_long<- melt(airquality)

## No id variables; using all as measure variables

head(airquality_long)

##   variable value
## 1    ozone    41
## 2    ozone    36
## 3    ozone    12
## 4    ozone    18
## 5    ozone    NA
## 6    ozone    28

airquality2_long<-melt(airquality, id.vars = c("month", "day"))
head(airquality2_long)

##   month day variable value
## 1     5   1    ozone    41
## 2     5   2    ozone    36
## 3     5   3    ozone    12
## 4     5   4    ozone    18
## 5     5   5    ozone    NA
## 6     5   6    ozone    28

airquality3_long<- melt(airquality, id.vars = c("month", "day"),
variable.name = "climatetype", value.name = "climatemeasurement")
head(airquality3_long)

##   month day climatetype climatemeasurement
## 1     5   1       ozone                 41
## 2     5   2       ozone                 36
## 3     5   3       ozone                 12
## 4     5   4       ozone                 18
## 5     5   5       ozone                 NA
## 6     5   6       ozone                 28

NA values have been removed.

airquality4_long<-na.omit(airquality3_long)
head(airquality4_long)

##   month day climatetype climatemeasurement
## 1     5   1       ozone                 41
## 2     5   2       ozone                 36
## 3     5   3       ozone                 12
## 4     5   4       ozone                 18
## 6     5   6       ozone                 28
## 7     5   7       ozone                 23

str(airquality4_long)

## 'data.frame':    568 obs. of  4 variables:
##  $ month             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ day               : int  1 2 3 4 6 7 8 9 11 12 ...
##  $ climatetype       : Factor w/ 4 levels "ozone","solar.r",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ climatemeasurement: num  41 36 12 18 28 23 19 8 7 16 ...
##  - attr(*, "na.action")= 'omit' Named int  5 10 25 26 27 32 33 34 35 36 ...
##   ..- attr(*, "names")= chr  "5" "10" "25" "26" ...

summary(airquality4_long)

##      month            day         climatetype  climatemeasurement
##  Min.   :5.000   Min.   : 1.00   ozone  :116   Min.   :  1.00    
##  1st Qu.:6.000   1st Qu.: 8.00   solar.r:146   1st Qu.: 13.00    
##  Median :7.000   Median :16.00   wind   :153   Median : 66.00    
##  Mean   :7.044   Mean   :15.83   temp   :153   Mean   : 80.06    
##  3rd Qu.:8.000   3rd Qu.:23.00                 3rd Qu.: 91.00    
##  Max.   :9.000   Max.   :31.00                 Max.   :334.00

Analysis of data will be performed by creating visualization of one of the variables and the recorded measurements over the course of one week in a given month. The climate variable to subset on will be “temp” and the time period will be for September (9) between day (1) and day (21). The chart shows the temperature trending downward from the 80’s to the 70’s as there appears to be a change in the weather from late summer to early fall and cooler weather.

library(ggplot2)
airquality5_long<-subset(airquality4_long,climatetype=="temp" & month==9 & day >=1 & day<=21)
airquality5_long

##     month day climatetype climatemeasurement
## 583     9   1        temp                 91
## 584     9   2        temp                 92
## 585     9   3        temp                 93
## 586     9   4        temp                 93
## 587     9   5        temp                 87
## 588     9   6        temp                 84
## 589     9   7        temp                 80
## 590     9   8        temp                 78
## 591     9   9        temp                 75
## 592     9  10        temp                 73
## 593     9  11        temp                 81
## 594     9  12        temp                 76
## 595     9  13        temp                 77
## 596     9  14        temp                 71
## 597     9  15        temp                 71
## 598     9  16        temp                 78
## 599     9  17        temp                 67
## 600     9  18        temp                 76
## 601     9  19        temp                 68
## 602     9  20        temp                 82
## 603     9  21        temp                 64

ggplot(airquality5_long,aes(x=airquality5_long$day,y=airquality5_long$climatemeasurement, fill=airquality5_long$type)) + geom_bar(stat='identity', position='dodge')