12.2.1 Exercises

1.Using prose, describe how the variables and observations are organised in each of the sample tables. 1) in table1, each variables(country, year, population) is placed in column; each row reflects the value of case and population for each country of each year.

2)in table2, the variables case and population are placed in the same column but separate rows. Their value can be found in count column.

  1. Table3 reflects the ratio of case and population under the column “rate”

  2. Table4 contains 2 tables. Table 4a is the value of “case” with “country” in the row and “year” in the column.

Table 4b is the value of population, and it is strucutured same as table 4a.

2.Compute the rate for table2, and table4a + table4b. You will need to perform four operations: 1.Extract the number of TB cases per country per year. 2.Extract the matching population per country per year. 3.Divide cases by population, and multiply by 10000. 4.Store back in the appropriate place.

library(tidyverse)
## ── Attaching packages ─────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Create Two tables for “cases” and “population”, and arrange them in same order

#Table2
tb_case<-filter(table2, type=="cases")%>%
  rename(cases=count)%>%
  arrange(country,year)
tb_population<-filter(table2,type=="population")%>%
  rename(population=count)%>%
 arrange(country, year)

create a new data frame with the population and cases.

tb_case_per_cap <-tibble(
  year=tb_case$year,
  country=tb_case$country,
  cases=tb_case$cases,
  population=tb_population$population
)%>%
  mutate(case_per_capital=(cases/population)*10000)%>%
  select(country,year,case_per_capital)
tb_case_per_cap<-tb_case_per_cap %>%
  mutate(type="case_per_cap")%>%
  rename(count=case_per_capital)
bind_rows(table2,tb_case_per_cap)%>%
  arrange(country,year,type,count)
## # A tibble: 18 x 4
##    country      year type           count
##    <chr>       <int> <chr>          <dbl>
##  1 Afghanistan  1999 case_per_cap 3.73e-1
##  2 Afghanistan  1999 cases        7.45e+2
##  3 Afghanistan  1999 population   2.00e+7
##  4 Afghanistan  2000 case_per_cap 1.29e+0
##  5 Afghanistan  2000 cases        2.67e+3
##  6 Afghanistan  2000 population   2.06e+7
##  7 Brazil       1999 case_per_cap 2.19e+0
##  8 Brazil       1999 cases        3.77e+4
##  9 Brazil       1999 population   1.72e+8
## 10 Brazil       2000 case_per_cap 4.61e+0
## 11 Brazil       2000 cases        8.05e+4
## 12 Brazil       2000 population   1.75e+8
## 13 China        1999 case_per_cap 1.67e+0
## 14 China        1999 cases        2.12e+5
## 15 China        1999 population   1.27e+9
## 16 China        2000 case_per_cap 1.67e+0
## 17 China        2000 cases        2.14e+5
## 18 China        2000 population   1.28e+9

#table4

table4c<-tibble(
  country=table4a$country,
  '1999'=table4a[["1999"]]/table4b[["1999"]]*10000,
  '2000'=table4a[["2000"]]/table4b[["2000"]]*10000
)
table4c
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan  0.373   1.29
## 2 Brazil       2.19    4.61
## 3 China        1.67    1.67

Which representation is easiest to work with? Which is hardest? Why? Table 1 is the easier table to work with because it place each variable in each column. Table2 separate the row for population and case. If we want to calculate the ratio of case and population, we need to create a new column for each variable. For table4, we need to join these two tables.

3.Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

#plot “cases”

table2%>%
  filter(type=="cases")%>%
  ggplot(aes(year,count))+
  geom_line(aes(group=country,color=country))+
  ylab("cases")

12.3.3

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016    0.92
## 3     2 2015    0.59
## 4     2 2016    0.17

2.Why does this code fail? table4a %>% pivot_longer(c(1999, 2000), names_to = “year”, values_to = “cases”) #> Error: Can’t subset columns that don’t exist. #> ✖ Locations 1999 and 2000 don’t exist. #> ℹ There are only 3 columns.

#correct code

table4a %>%
  pivot_longer(c('1999','2000'),names_to ="year", values_to = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

3.What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

people <- tribble(
  ~name, ~key, ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)
people
## # A tibble: 5 x 3
##   name            key    values
##   <chr>           <chr>   <dbl>
## 1 Phillip Woods   age        45
## 2 Phillip Woods   height    186
## 3 Phillip Woods   age        50
## 4 Jessica Cordero age        37
## 5 Jessica Cordero height    156

#pivot_wider(people,names_from = “name”, values_from = “values”)

results: Values are not uniquely identified; output will contain list-cols. * Use values_fn = list to suppress this warning. * Use values_fn = length to identify where the duplicates arise * Use values_fn = {summary_fun} to summarise duplicates

Here is a value uniqueness problem occurred.

#Add a row to indentify each observation count for each combination of name and key.

people2<-people %>%
  group_by(name,key) %>%
  mutate(obs=row_number())
people2
## # A tibble: 5 x 4
## # Groups:   name, key [4]
##   name            key    values   obs
##   <chr>           <chr>   <dbl> <int>
## 1 Phillip Woods   age        45     1
## 2 Phillip Woods   height    186     1
## 3 Phillip Woods   age        50     2
## 4 Jessica Cordero age        37     1
## 5 Jessica Cordero height    156     1
people2%>%
  pivot_wider(names_from = "name", values_from="values")
## # A tibble: 3 x 4
## # Groups:   key [2]
##   key      obs `Phillip Woods` `Jessica Cordero`
##   <chr>  <int>           <dbl>             <dbl>
## 1 age        1              45                37
## 2 height     1             186               156
## 3 age        2              50                NA

12.5.1 Missing value

1.Compare and contrast the fill arguments to pivot_wider() and complete().

ans: The values_fill argument in pivot_wider() and the fill() argument to complete() both set values to replace NA. Both arguments accept named lists to sets values for each column. In addition, the values_fill argument of pivot_wider() accepts a single value. In complete(), the fill argument also sets a value to replace NAs.

2.What does the direction argument to fill() do? # complete()

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks %>%
  pivot_wider(names_from = year, values_from = return, values_fill =0)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88   0   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66

with complete()

stocks %>%
  complete(year,qtr, fill=list(return=0))
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4   0   
## 5  2016     1   0   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66