My personal documentation, for future reference and created after completion of the JHU Data Science Specialization, online via Coursera LMS. Notes paraphrased from Roger D. Peng’s book Mastering Software Development in R.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
x <- 2
y <- 3
x + y
## [1] 5
Note that adding echo = FALSE parameter to the code chunk would prevent printing of the R code that generated the plot.
According to Hadley Wickham, a tidy dataset has the following properties:
head(VADeaths)
## Rural Male Rural Female Urban Male Urban Female
## 50-54 11.7 8.7 15.4 8.4
## 55-59 18.1 11.7 24.3 13.6
## 60-64 26.9 20.3 37.0 19.3
## 65-69 41.0 30.9 54.6 35.1
## 70-74 66.0 54.3 71.1 50.0
The above format violates tidy data because there are variables in both the rows and columns. In this case the vars are age category, gender and urban-ness. Finally, the death rate itself, which is the fourth var, is presented inside the table.
library(tidyverse)
tidyData <- VADeaths %>%
tbl_df() %>%
mutate(age = row.names(VADeaths)) %>%
gather(key, death_rate, -age) %>%
separate(key, c("urban", "gender"), sep = " ") %>%
mutate(age = factor(age), urban = factor(urban),
gender = factor(gender))
head(tidyData)
## # A tibble: 6 x 4
## age urban gender death_rate
## <fct> <fct> <fct> <dbl>
## 1 50-54 Rural Male 11.7
## 2 55-59 Rural Male 18.1
## 3 60-64 Rural Male 26.9
## 4 65-69 Rural Male 41
## 5 70-74 Rural Male 66
## 6 50-54 Rural Female 8.7