tidyverse in RI highly recommend you download and use RStudio Desktop if you aren’t already.
Please copy the following commands into an R script and run in your RStudio Console:
dl_repo <- "http://cran.rstudio.org"
install.packages("tidyverse", repos = dl_repo)
install.packages("googlesheets", repos = dl_repo)Note that you don’t need to do this if you are using the Reed College RStudio Server as these are already installed there.
Slides available at http://bit.ly/reed-imp-tidy
tidyverse containsFor our work today, it includes the following R packages:
IMPORT
haven for SPSS, SAS, and Stata data filesjsonlite for JSON filesreadxl for XLS and XLSX filesreadr for CSV and TSV files (and R specific RDS files)TIDYING
tidyr for converting “messy” into “tidy” data framesSlides available at http://bit.ly/reed-imp-tidy
Slides available at http://bit.ly/reed-imp-tidy
We will begin by downloading and importing a variety of different “messy” data sets. You can download all of them in a ZIP file at http://bit.ly/reed-messy-data. The links below go to the original sources. I’ve converted these original sources into different formats.
Slides available at http://bit.ly/reed-imp-tidy
Slides available at http://bit.ly/reed-imp-tidy
who_df, county_pop_df, edu_county_df, and potus12_dfSlides available at http://bit.ly/reed-imp-tidy
library(readr)
life_exp_df <- read_csv("le_mess.csv")
View(life_exp_df)Slides available at http://bit.ly/reed-imp-tidy
tidyr packageSlides available at http://bit.ly/reed-imp-tidy
library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>%
gather(key = "year", value = "life_exp", -country))## # A tibble: 13,332 x 3
## country year life_exp
## <chr> <chr> <dbl>
## 1 Afghanistan 1951 27.13
## 2 Albania 1951 54.72
## 3 Algeria 1951 43.03
## 4 Angola 1951 31.05
## 5 Antigua and Barbuda 1951 58.26
## 6 Argentina 1951 61.93
## 7 Armenia 1951 62.67
## 8 Aruba 1951 58.96
## 9 Australia 1951 68.71
## 10 Austria 1951 65.24
## # ... with 13,322 more rows
Slides available at http://bit.ly/reed-imp-tidy
library(haven)
who_df <- read_dta("who.dta")
View(who_df)Slides available at http://bit.ly/reed-imp-tidy
country, iso2, and iso3 are redundant and identify the countryyear is a variable and it looks like it corresponds to each countrynew_sp_m014? And the other columns?…Slides available at http://bit.ly/reed-imp-tidy
Like before, we can gather these non-country and non-year variables together:
who_tidy <- who_df %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "value")We can now see what this has done to the data frame:
View(who_tidy)Slides available at http://bit.ly/reed-imp-tidy
key column correspond to new or old cases of TB._) correspond to the type of TB:
rel for relapse, ep for extrapulmonary TBsn for smear negative, sp for smear positive_ corresponds to the sex of the TB patient.014 for 0 to 14 years65 for 65 or olderSlides available at http://bit.ly/reed-imp-tidy
Looking over the entries of key in who_tidy, do you see anything else that doesn’t match the pattern?
It may be easier to remember that the entries of key correspond to column names in who_df:
names(who_df)## [1] "country" "iso2" "iso3" "year"
## [5] "new_sp_m014" "new_sp_m1524" "new_sp_m2534" "new_sp_m3544"
## [9] "new_sp_m4554" "new_sp_m5564" "new_sp_m65" "new_sp_f014"
## [13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"
## [17] "new_sp_f5564" "new_sp_f65" "new_sn_m014" "new_sn_m1524"
## [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564"
## [25] "new_sn_m65" "new_sn_f014" "new_sn_f1524" "new_sn_f2534"
## [29] "new_sn_f3544" "new_sn_f4554" "new_sn_f5564" "new_sn_f65"
## [33] "new_ep_m014" "new_ep_m1524" "new_ep_m2534" "new_ep_m3544"
## [37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014"
## [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554"
## [45] "new_ep_f5564" "new_ep_f65" "newrel_m014" "newrel_m1524"
## [49] "newrel_m2534" "newrel_m3544" "newrel_m4554" "newrel_m5564"
## [53] "newrel_m65" "newrel_f014" "newrel_f1524" "newrel_f2534"
## [57] "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
Slides available at http://bit.ly/reed-imp-tidy
stringrYou can replace all of the entries in key that contained newrel with new_rel via
library(stringr)
library(dplyr)
who_tidy <- who_tidy %>%
mutate(key = str_replace(string = key,
pattern = "newrel",
replacement = "new_rel"))Slides available at http://bit.ly/reed-imp-tidy
The entry new_sp_m014 is actually four different variables. Use the separate function to pull them apart:
who_tidy <- who_tidy %>%
separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy## # A tibble: 405,440 x 8
## country iso2 iso3 year new type sexage value
## * <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1980 new sp m014 NaN
## 2 Afghanistan AF AFG 1981 new sp m014 NaN
## 3 Afghanistan AF AFG 1982 new sp m014 NaN
## 4 Afghanistan AF AFG 1983 new sp m014 NaN
## 5 Afghanistan AF AFG 1984 new sp m014 NaN
## 6 Afghanistan AF AFG 1985 new sp m014 NaN
## 7 Afghanistan AF AFG 1986 new sp m014 NaN
## 8 Afghanistan AF AFG 1987 new sp m014 NaN
## 9 Afghanistan AF AFG 1988 new sp m014 NaN
## 10 Afghanistan AF AFG 1989 new sp m014 NaN
## # ... with 405,430 more rows
Slides available at http://bit.ly/reed-imp-tidy
sexage into two different variables.?separate, you’ll see that the following is an option:who_tidy <- who_tidy %>%
separate(col = sexage, into = c("sex", "age"), sep = 1)
who_tidy## # A tibble: 405,440 x 9
## country iso2 iso3 year new type sex age value
## * <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1980 new sp m 014 NaN
## 2 Afghanistan AF AFG 1981 new sp m 014 NaN
## 3 Afghanistan AF AFG 1982 new sp m 014 NaN
## 4 Afghanistan AF AFG 1983 new sp m 014 NaN
## 5 Afghanistan AF AFG 1984 new sp m 014 NaN
## 6 Afghanistan AF AFG 1985 new sp m 014 NaN
## 7 Afghanistan AF AFG 1986 new sp m 014 NaN
## 8 Afghanistan AF AFG 1987 new sp m 014 NaN
## 9 Afghanistan AF AFG 1988 new sp m 014 NaN
## 10 Afghanistan AF AFG 1989 new sp m 014 NaN
## # ... with 405,430 more rows
Slides available at http://bit.ly/reed-imp-tidy
iso2 and iso3 are redundant if we have countrynew is constant since this only has new cases of TBvalue is actually cases so we can rename that column.who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>%
na.omit() %>% rename("cases" = value)
head(who_tidy, 7)## # A tibble: 7 x 6
## country year type sex age cases
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
Slides available at http://bit.ly/reed-imp-tidy
%>% (pipe)Everything we did above can be chained into one sequence:
who_tidy <- who_df %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%
mutate(key = str_replace(key, pattern = "newrel",
replacement = "new_rel")) %>%
separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>%
separate(col = sexage, into = c("sex", "age"), sep = 1) %>%
select(-iso2, -iso3, -new) %>%
na.omit() %>%
rename("cases" = value)Slides available at http://bit.ly/reed-imp-tidy
I’ve updated the gapminder data set available in the gapminder R data package by Jenny Bryan here. Jenny provides instructions for recreating the data here.
You can download the updated data from Google Sheets by running the following in R:
# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
gs_read()Slides available at http://bit.ly/reed-imp-tidy
Try to tidy the other three data sets you downloaded and imported.
Slides available at http://bit.ly/reed-imp-tidy
Slides available at http://bit.ly/reed-imp-tidy
sessionInfo()## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.1 (Sierra)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] stringr_1.1.0 dplyr_0.5.0 tidyr_0.6.0 needs_0.0.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 png_0.1-7 revealjs_0.7
## [4] digest_0.6.10 assertthat_0.1 R6_2.1.3
## [7] DBI_0.5 formatR_1.4 magrittr_1.5
## [10] evaluate_0.9 stringi_1.1.1 lazyeval_0.2.0
## [13] rmarkdown_1.0.9013 tools_3.3.1 readr_1.0.0
## [16] yaml_2.1.13 htmltools_0.3.5 haven_1.0.0
## [19] knitr_1.14 tibble_1.1