12.2.1 Exercises
1.Using prose, describe how the variables and observations are organised in each of the sample tables. 1) in table1, each variables(country, year, population) is placed in column; each row reflects the value of case and population for each country of each year.
2)in table2, the variables case and population are placed in the same column but separate rows. Their value can be found in count column.
Table3 reflects the ratio of case and population under the column “rate”
Table4 contains 2 tables. Table 4a is the value of “case” with “country” in the row and “year” in the column.
Table 4b is the value of population, and it is strucutured same as table 4a.
2.Compute the rate for table2, and table4a + table4b. You will need to perform four operations: 1.Extract the number of TB cases per country per year. 2.Extract the matching population per country per year. 3.Divide cases by population, and multiply by 10000. 4.Store back in the appropriate place.
library(tidyverse)
## ── Attaching packages ─────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Create Two tables for “cases” and “population”, and arrange them in same order
#Table2
tb_case<-filter(table2, type=="cases")%>%
rename(cases=count)%>%
arrange(country,year)
tb_population<-filter(table2,type=="population")%>%
rename(population=count)%>%
arrange(country, year)
create a new data frame with the population and cases.
tb_case_per_cap <-tibble(
year=tb_case$year,
country=tb_case$country,
cases=tb_case$cases,
population=tb_population$population
)%>%
mutate(case_per_capital=(cases/population)*10000)%>%
select(country,year,case_per_capital)
tb_case_per_cap<-tb_case_per_cap %>%
mutate(type="case_per_cap")%>%
rename(count=case_per_capital)
bind_rows(table2,tb_case_per_cap)%>%
arrange(country,year,type,count)
## # A tibble: 18 x 4
## country year type count
## <chr> <int> <chr> <dbl>
## 1 Afghanistan 1999 case_per_cap 3.73e-1
## 2 Afghanistan 1999 cases 7.45e+2
## 3 Afghanistan 1999 population 2.00e+7
## 4 Afghanistan 2000 case_per_cap 1.29e+0
## 5 Afghanistan 2000 cases 2.67e+3
## 6 Afghanistan 2000 population 2.06e+7
## 7 Brazil 1999 case_per_cap 2.19e+0
## 8 Brazil 1999 cases 3.77e+4
## 9 Brazil 1999 population 1.72e+8
## 10 Brazil 2000 case_per_cap 4.61e+0
## 11 Brazil 2000 cases 8.05e+4
## 12 Brazil 2000 population 1.75e+8
## 13 China 1999 case_per_cap 1.67e+0
## 14 China 1999 cases 2.12e+5
## 15 China 1999 population 1.27e+9
## 16 China 2000 case_per_cap 1.67e+0
## 17 China 2000 cases 2.14e+5
## 18 China 2000 population 1.28e+9
#table4
table4c<-tibble(
country=table4a$country,
'1999'=table4a[["1999"]]/table4b[["1999"]]*10000,
'2000'=table4a[["2000"]]/table4b[["2000"]]*10000
)
table4c
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 0.373 1.29
## 2 Brazil 2.19 4.61
## 3 China 1.67 1.67
Which representation is easiest to work with? Which is hardest? Why? Table 1 is the easier table to work with because it place each variable in each column. Table2 separate the row for population and case. If we want to calculate the ratio of case and population, we need to create a new column for each variable. For table4, we need to join these two tables.
3.Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?
#plot “cases”
table2%>%
filter(type=="cases")%>%
ggplot(aes(year,count))+
geom_line(aes(group=country,color=country))+
ylab("cases")
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 1 2016 0.92
## 3 2 2015 0.59
## 4 2 2016 0.17
2.Why does this code fail? table4a %>% pivot_longer(c(1999, 2000), names_to = “year”, values_to = “cases”) #> Error: Can’t subset columns that don’t exist. #> ✖ Locations 1999 and 2000 don’t exist. #> ℹ There are only 3 columns.
#correct code
table4a %>%
pivot_longer(c('1999','2000'),names_to ="year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
3.What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble(
~name, ~key, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
people
## # A tibble: 5 x 3
## name key values
## <chr> <chr> <dbl>
## 1 Phillip Woods age 45
## 2 Phillip Woods height 186
## 3 Phillip Woods age 50
## 4 Jessica Cordero age 37
## 5 Jessica Cordero height 156
#pivot_wider(people,names_from = “name”, values_from = “values”)
results: Values are not uniquely identified; output will contain list-cols. * Use values_fn = list to suppress this warning. * Use values_fn = length to identify where the duplicates arise * Use values_fn = {summary_fun} to summarise duplicates
Here is a value uniqueness problem occurred.
#Add a row to indentify each observation count for each combination of name and key.
people2<-people %>%
group_by(name,key) %>%
mutate(obs=row_number())
people2
## # A tibble: 5 x 4
## # Groups: name, key [4]
## name key values obs
## <chr> <chr> <dbl> <int>
## 1 Phillip Woods age 45 1
## 2 Phillip Woods height 186 1
## 3 Phillip Woods age 50 2
## 4 Jessica Cordero age 37 1
## 5 Jessica Cordero height 156 1
people2%>%
pivot_wider(names_from = "name", values_from="values")
## # A tibble: 3 x 4
## # Groups: key [2]
## key obs `Phillip Woods` `Jessica Cordero`
## <chr> <int> <dbl> <dbl>
## 1 age 1 45 37
## 2 height 1 186 156
## 3 age 2 50 NA
1.Compare and contrast the fill arguments to pivot_wider() and complete().
ans: The values_fill argument in pivot_wider() and the fill() argument to complete() both set values to replace NA. Both arguments accept named lists to sets values for each column. In addition, the values_fill argument of pivot_wider() accepts a single value. In complete(), the fill argument also sets a value to replace NAs.
2.What does the direction argument to fill() do? # complete()
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks %>%
pivot_wider(names_from = year, values_from = return, values_fill =0)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 0
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
stocks %>%
complete(year,qtr, fill=list(return=0))
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 0
## 5 2016 1 0
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66