For this project, the objective is to select three untidy data sets posted by students on the class discussion board, and - for each of the three datasets - read the data into R, tidy the data, and complete the analysis suggested by the classmate in the discussion board post.
This R markdown file explains the steps used to complete the analysis for the first dataset I selected, which is the results of a Crosstab query posted by Jose Zuniga. The dataset and source are shown below:
The Crosstab data was loaded to a MySQL database schema called ‘project_two’ into a table called ‘cross_tab’. The SQL scripts to build the database and load the data are available on GitHub at the link below. Please note that the GitHub file includes the code used for all three datasets selected for the project, not just this dataset:
https://github.com/LelandoSupreme/DATA607/blob/master/Randles_Project2_DATA607.sql
To facilitate the loading of data from MySQL to an R data frame, I loaded the “RMySQL” package.
install.packages("RMySQL",repos='http://cran.wustl.edu/')
library(RMySQL)
Once the package was installed, I connected to the database and created a data frame from the ‘cross_tab’ table.
# Get the MySQL
drv = dbDriver("MySQL")
# Create a connection to the MySQL database
con <- dbConnect(drv, user = 'root', password = 'temp1002!', dbname = 'project_two')
# Create the cross_tab data frame
cross_tab <- dbReadTable(con, "cross_tab")
# View the cross_tab data frame
head(cross_tab)
## mth region1 region2 region3 region4 region5
## 1 April 13 33 76 2 47
## 2 May 17 55 209 1 143
## 3 June 8 63 221 1 127
## 4 July 13 104 240 6 123
## 5 August 18 121 274 9 111
## 6 September 25 160 239 2 88
As you can see, the ‘cross_tab’ table has 6 columns: mth, region1, region2, region3, region4, and region5.
Hadley Wickham defines “tidy data” (http://vita.had.co.nz/papers/tidy-data.pdf.) as data which is structured such that:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
In the case of the cross_tab data, there are three variables: mth, region, and citizenship_cnts. To make each row an observation, we need a row for every month/region combination. In this case, there is only one type of observational unit, so one table will be appropriate.
To tidy and transform the data, I loaded the tidyr package (https://cran.r-project.org/web/packages/tidyr/tidyr.pdf) and the dplyr package (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).
install.packages("tidyr",repos='http://mirrors.nics.utk.edu/cran/')
library(tidyr)
library(dplyr)
Once the packages were loaded, I pipelined three functions to create the tidy data frame ‘cross_tab’. The steps performed:
1. Turn the five region columns into values in a ‘region’ column using the gather function.
2. Use the rename function to rename the ‘n’ column created in Step 1 to ‘citizenship_cnts’.
3. Use the mutate function to create a mth_num column (will be needed to facilitate plotting).
# Pipelined commands executing steps shown above
cross_tab <- cross_tab %>% gather("region", "n", 2:6) %>% rename(citizenship_cnts = n) %>% mutate(mth_num = match(mth, month.name))
head(cross_tab)
## mth region citizenship_cnts mth_num
## 1 April region1 13 4
## 2 May region1 17 5
## 3 June region1 8 6
## 4 July region1 13 7
## 5 August region1 18 8
## 6 September region1 25 9
The suggested analysis is to “compare monthly citizenship for the given regions”. To do this, I loaded the ggplot2 package and created line charts by region.
# Load ggplot2 to create line chart
library("ggplot2")
# Create line graph showing citizenship counts by month by region
ggplot(data=cross_tab, aes(x=month.abb[mth_num], y=citizenship_cnts, group=region, colour=region)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Month") + ylab("Citizenship Counts")