Disclaimer: The contents of this document come from Chapter 9. Tidy Data with tidyr of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.
This document is also published on RPubs.
Examples ([Source](https://www.123rf.com/photo_28508797_before-untidy-and-after-tidy-wardrobe-with-colorful-winter-clothes-and-accessories-messy-clothes-thr.html))

Examples (Source)

install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages("nycflights13", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)
library(nycflights13)

Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

1. Intro

What we do:

  1. Understand tibble in R
  2. Read in files to an R session
  3. Change the structure of dataframes (e.g., long form or wide form)

Why we do:

  1. Consistency
  2. R is good at handling vectors: good to have variables in columns and observations in rows

Terms:

  1. Tidy data Each column is a variable, and each row is an observation (i.e., wide form)

2. Tibbles with tibble

as_tibble(iris)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 140 more rows
# note that each line inside tibble denotes one column 
tibble(
  x = 1:5, 
  y = 1, 
  z = x^2 + y
)
## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26
# nonsyntactic names are possible with backticks  
tb <- tibble( 
  `:)` = "smile", 
  ` ` = "space", 
  `2000` = "number"
)
tb
## # A tibble: 1 x 3
##   `:)`  ` `   `2000`
##   <chr> <chr> <chr> 
## 1 smile space number
# transposed tibble: easier form for data entry 
# note that each line inside triblle denotes one observation  
tribble( 
  ~x, ~y, ~z, 
  #--/--/---- this line is not required; it's for those who work on data entry 
  "a", 1, 3.6, 
  "b", 1, 8.5 
)
## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         1   3.6
## 2 b         1   8.5
test <- tibble(
  a = lubridate::now() + runif(1e3) * 86400, 
  b = lubridate::today() + runif(1e3) * 30, 
  c = 1:1e3, 
  d = runif(1e3), 
  e = sample(letters, 1e3, replace = TRUE)
)

nycflights13::flights %>% 
  print(n = 10, width = Inf)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
##    sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
##             <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
##  1            819        11 UA        1545 N14228  EWR    IAH        227
##  2            830        20 UA        1714 N24211  LGA    IAH        227
##  3            850        33 AA        1141 N619AA  JFK    MIA        160
##  4           1022       -18 B6         725 N804JB  JFK    BQN        183
##  5            837       -25 DL         461 N668DN  LGA    ATL        116
##  6            728        12 UA        1696 N39463  EWR    ORD        150
##  7            854        19 B6         507 N516JB  EWR    FLL        158
##  8            723       -14 EV        5708 N829AS  LGA    IAD         53
##  9            846        -8 B6          79 N593JB  JFK    MCO        140
## 10            745         8 AA         301 N3ALAA  LGA    ORD        138
##    distance  hour minute time_hour          
##       <dbl> <dbl>  <dbl> <dttm>             
##  1     1400     5     15 2013-01-01 05:00:00
##  2     1416     5     29 2013-01-01 05:00:00
##  3     1089     5     40 2013-01-01 05:00:00
##  4     1576     5     45 2013-01-01 05:00:00
##  5      762     6      0 2013-01-01 06:00:00
##  6      719     5     58 2013-01-01 05:00:00
##  7     1065     6      0 2013-01-01 06:00:00
##  8      229     6      0 2013-01-01 06:00:00
##  9      944     6      0 2013-01-01 06:00:00
## 10      733     6      0 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows

Subsetting: extract only part of a dataframe.

df <- tibble(
  x = runif(5), 
  y = rnorm(5)
)
# extract by name 
df$x
## [1] 0.0003648389 0.9236751895 0.8817046548 0.7365425052 0.8366075973
df[["x"]]
## [1] 0.0003648389 0.9236751895 0.8817046548 0.7365425052 0.8366075973
# extract by position 
df[[1]]
## [1] 0.0003648389 0.9236751895 0.8817046548 0.7365425052 0.8366075973
# with pipes, use the placeholder . 
df %>% .$x 
## [1] 0.0003648389 0.9236751895 0.8817046548 0.7365425052 0.8366075973
df %>% .[["x"]]
## [1] 0.0003648389 0.9236751895 0.8817046548 0.7365425052 0.8366075973

Tibbles are new, and sometimes they do not work with old tools.

class(as.data.frame(tb))
## [1] "data.frame"

3. Data Import with readr

How to read in files?

  1. Download a sample data file from Hadley’s Github

  2. Save it in your working directory. To check it, type getwd()

  3. Put the filename with double quotation marks inside read_csv()
# getwd()
heights <- read_csv("heights.csv")
## Parsed with column specification:
## cols(
##   earn = col_double(),
##   height = col_double(),
##   sex = col_character(),
##   ed = col_integer(),
##   age = col_integer(),
##   race = col_character()
## )
# How to check any problems when reading in data? 
challenge <- read_csv(readr_example("challenge.csv")) 
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected       actual      file                              expected   <int> <chr> <chr>          <chr>       <chr>                             actual 1  1001 x     no trailing c~ .238379750~ 'C:/Users/ylee366/Documents/R/R-~ file 2  1002 x     no trailing c~ .411679971~ 'C:/Users/ylee366/Documents/R/R-~ row 3  1003 x     no trailing c~ .746071676~ 'C:/Users/ylee366/Documents/R/R-~ col 4  1004 x     no trailing c~ .723450553~ 'C:/Users/ylee366/Documents/R/R-~ expected 5  1005 x     no trailing c~ .614524137~ 'C:/Users/ylee366/Documents/R/R-~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
problems(challenge)
## # A tibble: 1,000 x 5
##      row col   expected       actual      file                            
##    <int> <chr> <chr>          <chr>       <chr>                           
##  1  1001 x     no trailing c~ .238379750~ 'C:/Users/ylee366/Documents/R/R~
##  2  1002 x     no trailing c~ .411679971~ 'C:/Users/ylee366/Documents/R/R~
##  3  1003 x     no trailing c~ .746071676~ 'C:/Users/ylee366/Documents/R/R~
##  4  1004 x     no trailing c~ .723450553~ 'C:/Users/ylee366/Documents/R/R~
##  5  1005 x     no trailing c~ .614524137~ 'C:/Users/ylee366/Documents/R/R~
##  6  1006 x     no trailing c~ .473980569~ 'C:/Users/ylee366/Documents/R/R~
##  7  1007 x     no trailing c~ .578461039~ 'C:/Users/ylee366/Documents/R/R~
##  8  1008 x     no trailing c~ .241593722~ 'C:/Users/ylee366/Documents/R/R~
##  9  1009 x     no trailing c~ .114378662~ 'C:/Users/ylee366/Documents/R/R~
## 10  1010 x     no trailing c~ .298344632~ 'C:/Users/ylee366/Documents/R/R~
## # ... with 990 more rows
# Modifying code to solve the problems
challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
  #x = col_integer(), 
  #y = col_character()
  x = col_double(), 
  y = col_date()
  )
) 

# Check any remaining problems 
tail(challenge) # print the last six observations <-> head()
## # A tibble: 6 x 2
##       x y         
##   <dbl> <date>    
## 1 0.805 2019-11-21
## 2 0.164 2018-03-29
## 3 0.472 2014-08-04
## 4 0.718 2015-08-16
## 5 0.270 2020-02-04
## 6 0.608 2019-01-06
# Alternatively, check more than the first 1,000 observations 
challenge2 <- read_csv(
  readr_example("challenge.csv"), 
  guess_max = 1001 # the default: 1000 
)
## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_date(format = "")
## )

How to write files to disk?

# However, it does not keep all information 
write_csv(challenge, "challenge.csv")
read_csv("challenge.csv")
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected               actual             file            expected   <int> <chr> <chr>                  <chr>              <chr>           actual 1  1001 x     no trailing characters .23837975086644292 'challenge.csv' file 2  1002 x     no trailing characters .41167997173033655 'challenge.csv' row 3  1003 x     no trailing characters .7460716762579978  'challenge.csv' col 4  1004 x     no trailing characters .723450553836301   'challenge.csv' expected 5  1005 x     no trailing characters .614524137461558   'challenge.csv'
## ... ................. ... ....................................................................... ........ ....................................................................... ...... ....................................................................... .... ....................................................................... ... ....................................................................... ... ....................................................................... ........ .......................................................................
## See problems(...) for more details.
## # A tibble: 2,000 x 2
##        x y    
##    <int> <chr>
##  1   404 <NA> 
##  2  4172 <NA> 
##  3  3004 <NA> 
##  4   787 <NA> 
##  5    37 <NA> 
##  6  2332 <NA> 
##  7  2489 <NA> 
##  8  1449 <NA> 
##  9  3665 <NA> 
## 10  3863 <NA> 
## # ... with 1,990 more rows
# to keep variable types, use write_rds() 
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
## # A tibble: 2,000 x 2
##        x y         
##    <dbl> <date>    
##  1   404 NA        
##  2  4172 NA        
##  3  3004 NA        
##  4   787 NA        
##  5    37 NA        
##  6  2332 NA        
##  7  2489 NA        
##  8  1449 NA        
##  9  3665 NA        
## 10  3863 NA        
## # ... with 1,990 more rows

4. What’s wrong with untidy data?

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Two common problems:

  1. One variable might be spread across multiple columns.
  2. One observation might be scattered across multiple rows.

Luckily, only one is found in most data. Sometimes, both are present though.

# Still, each country has two rows (repetition) 
table1 %>%
  mutate(rate = cases / population * 10000) 
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67
# Weighted count (sum of "cases"" variable), instead of counting # of rows 
table1 %>% 
  count(year, wt = cases) 
## # A tibble: 2 x 2
##    year      n
##   <int>  <int>
## 1  1999 250740
## 2  2000 296920
# Visualize changes over time 
# put ", color = "grey50"" outside of aes. 
# Otherwise it does not refer to a specific color 
# Instead R sees it as a value that is common to all cases, and R uses a default (ugly) color for all cases (points).
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), color = 'grey50') +  
  geom_point(aes(color = country))

5a. Gather with gather

# Check the dataframe again 
table4a 
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

We need three sets of inputs:

  1. Columns for gethering
  2. key = A new variable whose values form the gathered columns (a categorical variable)
  3. value = A new variable whose values are spread across multiple columns in the input dataframe
# Note the names of two variables do not follow the R naming rules: Start with alphabets 
# Thus, we  use a "backtick" or ` 
table4a %>%
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
table4a %>%
  gather(`1999`:`2000`, key = year, value = cases) # without quotation marks 
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
table4a %>%
  gather(year, cases, `1999`:`2000`) # putting key and value first without argument names  
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population" )
## # A tibble: 6 x 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

How to join two dataframes? Use left_join (more details in Chapter 10)

# Note that the order of rows are the same. 
tidy4a <- table4a %>%
  gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population" )
left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

5b. Spread with spread

When to use? An observation is scattered across multiple rows.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Two inputs are needed:

  1. key = The column that contains variable names (categorival variable)
  2. value = The column whose values will be scattered across multiple columns in the output dataframe
table2 %>% 
  spread(key = type, value = count) 
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2 %>% 
  spread(type, count) 
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2 %>% 
  spread(type, count) %>%
  mutate(
    rate = cases/population *1000
  ) %>% 
  select(-cases, -population)
## # A tibble: 6 x 3
##   country      year   rate
##   <chr>       <int>  <dbl>
## 1 Afghanistan  1999 0.0373
## 2 Afghanistan  2000 0.129 
## 3 Brazil       1999 0.219 
## 4 Brazil       2000 0.461 
## 5 China        1999 0.167 
## 6 China        2000 0.167

5c. Exercises (p.156 or online)

No 3. Why does spreading this tibble fail? How could you add a new column to fix the problem?

people <- ~tribble(
  ~name, ~key, ~value, 
  "Phillip Woods", "age", 45, 
  "Phillip Woods", "height", 186, 
  "Phillip Woods", "age", 50, 
  "Jessica Cordero", "age", 37, 
  "Jessica Cordero", "height", 156 
)

No 4. Tidy this simple tibble. Do you need to spread or gather it? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female, 
  "yes", NA, 10, 
  "no", 20, 12
) # play with na.rm = TRUE  

6a. Separate with separate

When to use? One column contains more than one variables (<-> unite())

How it works? Split values in a sinlge cell by a separator character (a non alphanumerical character by default)

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table3 %>% 
  separate(rate, into= c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table3 %>% 
  separate(rate, into= c("cases", "population"), sep = "/", convert = TRUE) # specify a separator & convert to integers
## # A tibble: 6 x 4
##   country      year  cases population
## * <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table3 %>% 
  separate(year, into=c("century", "year"), sep = 2, convert = FALSE) # override the previous year variable & as characters
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

6b. Unite with unite

table5
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>%
  unite(new, century, year) # the default separator is _ (underline)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
table5 %>%
  unite(new, century, year, sep = "") # no separator 
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

6c. Exercises (p.160 or online)

No 2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

-> Read RDocumentation about unite

7. Missing Values

Two types of missing

  1. Explicit missing: marked by NA
  2. Implicit missing: not obvious

How to detect missing values?

stocks <- tibble(
  year =   c(2015, 2015, 2015, 2015, 2016, 2016, 2016), 
  qtr =    c(   1,    2,    3,    4,    2,    3,    4), 
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks %>% 
  spread(year, return) # two missing values present, not one 
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  NA   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66
stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE) # remove missing values, they become implicit 
## # A tibble: 6 x 3
##     qtr year  return
## * <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     3 2015    0.35
## 4     2 2016    0.92
## 5     3 2016    0.17
## 6     4 2016    2.66

Alternative way of checking missing values is use complete().

  1. Input: a set of columns (categorical variables)
  2. Output: each row shows each of the unique combinations of input columns
stocks %>% 
  complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     1  NA   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

8. Weekly HW #3

From this chapter, you have ONE question for Weekly Homework #3.

Submit your R script file that includes your answers to individual questions (use #comment for discussion) on Canvas.

Note that online chapters differ from those of the print copy. Check the page numbers below.

Due at 11:59:00 PM (EST) on February 3rd Sunday.

12.2.Tidy Data (in the print copy p.151 )

No 2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

  1. Extract the number of TB cases per country per year.
  2. Extract the matching population per country per year.
  3. Divide cases by population, and multiply by 10000.
  4. Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?