My personal documentation, for future reference and created after completion of the JHU Data Science Specialization, online via Coursera LMS. Notes paraphrased from Roger D. Peng’s book Mastering Software Development in R.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x <- 2
y <- 3
x + y
## [1] 5

Note that adding echo = FALSE parameter to the code chunk would prevent printing of the R code that generated the plot.

Tidy Data

According to Hadley Wickham, a tidy dataset has the following properties:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Example of Untidy Data

head(VADeaths)
##       Rural Male Rural Female Urban Male Urban Female
## 50-54       11.7          8.7       15.4          8.4
## 55-59       18.1         11.7       24.3         13.6
## 60-64       26.9         20.3       37.0         19.3
## 65-69       41.0         30.9       54.6         35.1
## 70-74       66.0         54.3       71.1         50.0

The above format violates tidy data because there are variables in both the rows and columns. In this case the vars are age category, gender and urban-ness. Finally, the death rate itself, which is the fourth var, is presented inside the table.

Conversion to Tidy Data

library(tidyverse)
tidyData <- VADeaths %>% 
        tbl_df() %>% 
        mutate(age = row.names(VADeaths)) %>% 
        gather(key, death_rate, -age) %>% 
        separate(key, c("urban", "gender"), sep = " ") %>% 
        mutate(age = factor(age), urban = factor(urban), 
           gender = factor(gender))

head(tidyData)
## # A tibble: 6 x 4
##   age   urban gender death_rate
##   <fct> <fct> <fct>       <dbl>
## 1 50-54 Rural Male         11.7
## 2 55-59 Rural Male         18.1
## 3 60-64 Rural Male         26.9
## 4 65-69 Rural Male         41  
## 5 70-74 Rural Male         66  
## 6 50-54 Rural Female        8.7