12.2.1 Exercises

1. Using prose, describe how the variables and observations are organised in each of the sample tables.

  • table1 国・年度ごとの人口・患者数
  • table2 tidyっぽい
  • table3 患者数/人口
  • table4a case
  • table4b population
  • table5 century year rate

2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

table2 %>% spread(type, count) %>%
  mutate(rate = cases/population * 10000) %>%
  gather(key = type, value = count, cases, population) %>%
  arrange(country, year) %>% select(country, year, type, count, rate) %>% kable
country year type count rate
Afghanistan 1999 cases 745 0.372741
Afghanistan 1999 population 19987071 0.372741
Afghanistan 2000 cases 2666 1.294466
Afghanistan 2000 population 20595360 1.294466
Brazil 1999 cases 37737 2.193931
Brazil 1999 population 172006362 2.193931
Brazil 2000 cases 80488 4.612363
Brazil 2000 population 174504898 4.612363
China 1999 cases 212258 1.667495
China 1999 population 1272915272 1.667495
China 2000 cases 213766 1.669488
China 2000 population 1280428583 1.669488
table4a_1 <- table4a %>% gather(year, cases, `1999`, `2000`)
table4b_1 <- table4b %>% gather(year, population, `1999`, `2000`)
table4_1 <- left_join(table4a_1, table4b_1) %>% mutate(rate = cases/population * 10000)
## Joining, by = c("country", "year")
table4_1 %>% select(-population) %>% spread(year, cases) %>% kable
country rate 1999 2000
Afghanistan 0.372741 745 NA
Afghanistan 1.294466 NA 2666
Brazil 2.193931 37737 NA
Brazil 4.612363 NA 80488
China 1.667495 212258 NA
China 1.669488 NA 213766
table4_1 %>% select(-cases) %>% spread(year, population) %>% kable
country rate 1999 2000
Afghanistan 0.372741 19987071 NA
Afghanistan 1.294466 NA 20595360
Brazil 2.193931 172006362 NA
Brazil 4.612363 NA 174504898
China 1.667495 1272915272 NA
China 1.669488 NA 1280428583

3. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

table2 %>% filter(type == "cases") %>% 
  ggplot(aes(year, count)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

12.3.3 Exercises

1. Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     1 2016    0.92
## 4     2 2016    0.17

yearがchrになってる。

spread

This is useful if the value column was a mix of variables that was coerced to a string.

gather

This is useful if the column types are actually numeric, integer, or logical.

2. Why does this code fail?

table4a %>% gather(1999, 2000, key = "year", value = "cases")

1999番目のcolumnとして解釈されているから。

3. Why does spreading this tibble fail? How could you add a new column to fix the problem?

people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

“Phillip Woods”のageが2つあるので失敗する

被っている2つめを削除

people %>% distinct(name, key, .keep_all = TRUE) %>%
  spread(key, value) %>% kable
name age height
Jessica Cordero 37 156
Phillip Woods 45 186

idをつける

people <- tribble(
  ~name,             ~key,    ~value, ~id,
  #-----------------|--------|------|----|
  "Phillip Woods",   "age",       45,   1,
  "Phillip Woods",   "height",   186,   1,
  "Phillip Woods",   "age",       50,   2,
  "Jessica Cordero", "age",       37,   3,
  "Jessica Cordero", "height",   156,   3,
  )

spread(people, key, value) %>% kable
name id age height
Jessica Cordero 3 37 156
Phillip Woods 1 45 186
Phillip Woods 2 50 NA

4. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

preg %>% gather(key = sex, value = count, male, female)
## # A tibble: 4 x 3
##   pregnant sex    count
##   <chr>    <chr>  <dbl>
## 1 yes      male      NA
## 2 no       male      20
## 3 yes      female    10
## 4 no       female    12

12.4.3 Exercises

1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tb1 <- tibble(x = c("a,b,c", "d,e,f,g", "h,i,j"))
tb1 %>% separate(x, c("one", "two", "three")) %>% kable
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
one two three
a b c
d e f
h i j
tb1 %>% separate(x, c("one", "two", "three"), extra = "warn") %>% kable #drop
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
one two three
a b c
d e f
h i j
tb1 %>% separate(x, c("one", "two", "three"), extra = "drop") %>% kable
one two three
a b c
d e f
h i j
tb1 %>% separate(x, c("one", "two", "three"), extra = "merge") %>% kable
one two three
a b c
d e f,g
h i j
tb2 <- tibble(x = c("a,b,c", "d,e", "f,g,i"))
tb2 %>% separate(x, c("one", "two", "three")) %>% kable
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
one two three
a b c
d e NA
f g i
tb2 %>% separate(x, c("one", "two", "three"), fill = "warn") %>% kable #right
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
one two three
a b c
d e NA
f g i
tb2 %>% separate(x, c("one", "two", "three"), fill = "right") %>% kable
one two three
a b c
d e NA
f g i
tb2 %>% separate(x, c("one", "two", "three"), fill = "left") %>% kable
one two three
a b c
NA d e
f g i

2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

元データをのこしておきたいときにFALSEにする

3. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

  • separateは区切りを指定する一方、extractは抜き出したい部分を指定する。
  • 文字列を分解するより組み立てる方が単純。

12.5. 1

1. Compare and contrast the fill arguments to spread() and complete().

spreadのfillは単一の値だが、completeのfillはcolumn別に指定できる。

stocks %>% spread(year, return) %>%
  complete(qtr, fill = list(`2015` = 12, `2016` = 30)) %>% kable
qtr 2015 2016
1 1.88 30.00
2 0.59 0.92
3 0.35 0.17
4 12.00 2.66

2. What does the direction argument to fill() do?

上向きか下向きか。

treatment %>%
  fill(person, .direction = "up")
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Katherine Burke          2       10
## 3 Katherine Burke          3        9
## 4 Katherine Burke          1        4

12.6.1 Exercises

1. In this case study I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

complete(who5, country, year, type, sex, age) %>%
  count(year, isna = is.na(cases)) %>% filter(isna) %>%
  ggplot(aes(year, n)) +
  geom_line()

調査範囲が変わっていそう。

2. What happens if you neglect the mutate() step? (mutate(key = stringr::str_replace(key, “newrel”, “new_rel”)))

separateがうまくいかない

3. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

who3 %>% count(country)
## # A tibble: 219 x 2
##    country                 n
##    <chr>               <int>
##  1 Afghanistan           244
##  2 Albania               448
##  3 Algeria               224
##  4 American Samoa        172
##  5 Andorra               387
##  6 Angola                270
##  7 Anguilla              155
##  8 Antigua and Barbuda   346
##  9 Argentina             448
## 10 Armenia               461
## # … with 209 more rows
who3 %>% count(country, iso2, iso3)
## # A tibble: 219 x 4
##    country             iso2  iso3      n
##    <chr>               <chr> <chr> <int>
##  1 Afghanistan         AF    AFG     244
##  2 Albania             AL    ALB     448
##  3 Algeria             DZ    DZA     224
##  4 American Samoa      AS    ASM     172
##  5 Andorra             AD    AND     387
##  6 Angola              AO    AGO     270
##  7 Anguilla            AI    AIA     155
##  8 Antigua and Barbuda AG    ATG     346
##  9 Argentina           AR    ARG     448
## 10 Armenia             AM    ARM     461
## # … with 209 more rows

4. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.

who_count1 <- who5 %>% count(iso3, year, wt = cases) %>%
  rename(cases = n) %>% complete(iso3, year)

library(rnaturalearth)
library(rnaturalearthdata)

world <- ne_countries(scale = "medium", returnclass = "sf")

プロットできなかった国(isoコードが合わなかった)

iso3
ANT
BES
SCG
TKL
TUV
library(gganimate)
library(wbstats)

pop_data <- wb(indicator = "SP.POP.TOTL", startdate = 1980, enddate = 2013) %>% as_tibble() %>%
  mutate(year = as.integer(date)) %>% 
  select(iso3 = iso3c, year, population = value) %>%
  complete(iso3, year)

who_world <- world %>% right_join(who_count1, by = c("iso_a3" = "iso3"))
who_world_pop <- who_world %>% left_join(pop_data, by = c("iso_a3" = "iso3", "year" = "year"))


who_world_pop %>% filter(year == 1980) %>%
  ggplot() + geom_sf(aes(fill = cases / population))

who_world_pop %>% filter(year == 1995) %>%
  ggplot() + geom_sf(aes(fill = cases / population))

who_world_pop %>% filter(year == 2013) %>%
  ggplot() + geom_sf(aes(fill = cases / population))

p <- ggplot(data = who_world_pop) +
  geom_sf(aes(fill = cases / population))

## ani <- p + transition_time(who_world_pop$year,
##   range = range(pull(who_world_pop, year)) ) +
##   labs(title = "Year: {frame_time}")