Project 2 - Dataset 1 (Crosstab query), DATA607 Fall 2016

The Crosstab Dataset Posted by Jose Zuniga

For this project, the objective is to select three untidy data sets posted by students on the class discussion board, and - for each of the three datasets - read the data into R, tidy the data, and complete the analysis suggested by the classmate in the discussion board post.

This R markdown file explains the steps used to complete the analysis for the first dataset I selected, which is the results of a Crosstab query posted by Jose Zuniga. The dataset and source are shown below:

Create Database Containing Crosstab Data

The Crosstab data was loaded to a MySQL database schema called ‘project_two’ into a table called ‘cross_tab’. The SQL scripts to build the database and load the data are available on GitHub at the link below. Please note that the GitHub file includes the code used for all three datasets selected for the project, not just this dataset:

https://github.com/LelandoSupreme/DATA607/blob/master/Randles_Project2_DATA607.sql

Read cross_tab MySQL Table into R

To facilitate the loading of data from MySQL to an R data frame, I loaded the “RMySQL” package.

install.packages("RMySQL",repos='http://cran.wustl.edu/')
library(RMySQL)

Once the package was installed, I connected to the database and created a data frame from the ‘cross_tab’ table.

# Get the MySQL
drv = dbDriver("MySQL")
# Create a connection to the MySQL database
con <- dbConnect(drv, user = 'root', password = 'temp1002!', dbname = 'project_two')
# Create the cross_tab data frame
cross_tab <- dbReadTable(con, "cross_tab")
# View the cross_tab data frame
head(cross_tab)

##         mth region1 region2 region3 region4 region5
## 1     April      13      33      76       2      47
## 2       May      17      55     209       1     143
## 3      June       8      63     221       1     127
## 4      July      13     104     240       6     123
## 5    August      18     121     274       9     111
## 6 September      25     160     239       2      88

As you can see, the ‘cross_tab’ table has 6 columns: mth, region1, region2, region3, region4, and region5.

Tidy and Tranform Data

Hadley Wickham defines “tidy data” (http://vita.had.co.nz/papers/tidy-data.pdf.) as data which is structured such that:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In the case of the cross_tab data, there are three variables: mth, region, and citizenship_cnts. To make each row an observation, we need a row for every month/region combination. In this case, there is only one type of observational unit, so one table will be appropriate.

To tidy and transform the data, I loaded the tidyr package (https://cran.r-project.org/web/packages/tidyr/tidyr.pdf) and the dplyr package (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).

install.packages("tidyr",repos='http://mirrors.nics.utk.edu/cran/')
library(tidyr)
library(dplyr)

Once the packages were loaded, I pipelined three functions to create the tidy data frame ‘cross_tab’. The steps performed:
1. Turn the five region columns into values in a ‘region’ column using the gather function.
2. Use the rename function to rename the ‘n’ column created in Step 1 to ‘citizenship_cnts’.
3. Use the mutate function to create a mth_num column (will be needed to facilitate plotting).

# Pipelined commands executing steps shown above
cross_tab <- cross_tab %>% gather("region", "n", 2:6) %>% rename(citizenship_cnts = n) %>% mutate(mth_num = match(mth, month.name))
head(cross_tab)

##         mth  region citizenship_cnts mth_num
## 1     April region1               13       4
## 2       May region1               17       5
## 3      June region1                8       6
## 4      July region1               13       7
## 5    August region1               18       8
## 6 September region1               25       9

Citzenship by Month for Each Region

The suggested analysis is to “compare monthly citizenship for the given regions”. To do this, I loaded the ggplot2 package and created line charts by region.

# Load ggplot2 to create line chart
library("ggplot2")
# Create line graph showing citizenship counts by month by region
ggplot(data=cross_tab, aes(x=month.abb[mth_num], y=citizenship_cnts, group=region, colour=region)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Month") + ylab("Citizenship Counts")