Loading Libraries:

knitr::opts_chunk$set(warning = FALSE, fig.align = "center", out.width = "85%", 
                      message = FALSE, cache = TRUE)
library(openintro)
library(tidyverse)


Global Theme Set Up:

my_theme <- theme(
  plot.title = element_text(size = rel(1.8), hjust = 0.5, 
                            margin = margin(10,15,10,10)),
  axis.title = element_text(size = rel(1.5), hjust = 0.5),
  axis.title.x = element_text(margin = margin(10,5,5,5)),
  axis.title.y = element_text(margin = margin(5,10,5,5)),
  axis.text = element_text(size = rel(1.1)))

Part I: Lab Exercise

Lab 1:
Run the following code to create a random number list. Then use logical subsetting to find all numbers that are greater than one and less than two.

Code:

rand_num <- runif(100, 0, 3)
rand_num[rand_num > 1 & rand_num < 2]
##  [1] 1.082171 1.285525 1.829140 1.714602 1.477539 1.477167 1.281946 1.315248
##  [9] 1.943766 1.477770 1.001603 1.792081 1.263464 1.544191 1.624726 1.915867
## [17] 1.408941 1.073963 1.798901 1.843595 1.231767 1.901173 1.036204 1.763537
## [25] 1.084374 1.142768 1.171027 1.826148 1.693367 1.275600 1.032825 1.175974
## [33] 1.897936

Lab 2: Create a tibble in R for the following data set

Student Name Test 1 Score Test 2 Score Test 3 Score
Alice 80 90 85
Bob 70 75 80
Charlie 85 88 91
Denise 60 65 70

Code:

score <- tribble(
  ~`Student Name`, ~`Test 1 Score`, ~`Test 2 Score`, ~`Test 3 Score`,
  #---------------|---------------|----------------|----------------
  "Alice", 80, 90, 85,
  "Bob", 70, 75, 80,
  "Charlie", 85, 88, 91,
  "Denise", 60, 65, 70
  ) 

Lab 3: Tell for each table of Table 2-4, what symptoms of untidiness are there?


Answer: Table 2 combines multiple variables into a single column; the cases and population values should be separated into distinct columns. Similarly, Table 3 faces the similar issue and requires separate the values of rate column into two different varaibles – cases and population. In Table 4a, the variables 1999 and 2000 should be merged into a single column with clearly specified labels (e.g year) for the values. Table 4b has the same issue as Table 4a and requires the same adjustment.

Lab 4: Compute the rates per 10,000 in population for each year and country using table2 instead of table1. What is the difficulty?


Answer: A direct mutation of the data is challenging because multiple variables are stored in a single column. To enable easier further data transformation, the cases and population values need to be separated into two distinct columns.

Lab 5: Use pivot_longer() to tidy the following data set

Student Name Test 1 Score Test 2 Score Test 3 Score
Alice 80 90 85
Bob 70 75 80
Charlie 85 88 91
Denise 60 65 70

Code:

pivot_longer(score, c(`Test 1 Score`,`Test 2 Score`, `Test 3 Score`), 
             names_to = "test_label", values_to = "score") -> new_score
new_score
## # A tibble: 12 × 3
##    `Student Name` test_label   score
##    <chr>          <chr>        <dbl>
##  1 Alice          Test 1 Score    80
##  2 Alice          Test 2 Score    90
##  3 Alice          Test 3 Score    85
##  4 Bob            Test 1 Score    70
##  5 Bob            Test 2 Score    75
##  6 Bob            Test 3 Score    80
##  7 Charlie        Test 1 Score    85
##  8 Charlie        Test 2 Score    88
##  9 Charlie        Test 3 Score    91
## 10 Denise         Test 1 Score    60
## 11 Denise         Test 2 Score    65
## 12 Denise         Test 3 Score    70

Lab 6: Tidy the simple tibble below. Do you need to make it wider or longer?

Code:

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

pivot_longer(preg, c(`male`,`female`), names_to = "gender", 
             values_to = "preg_count") -> preg_new
preg_new
## # A tibble: 4 × 3
##   pregnant gender preg_count
##   <chr>    <chr>       <dbl>
## 1 yes      male           NA
## 2 yes      female         10
## 3 no       male           20
## 4 no       female         12

Part II: More Exercises

0. Do the data tidying of who data set step by step by yourself, and then make the following plot with your summary:

Code:

glimpse(who)
## Rows: 7,240
## Columns: 60
## $ country      <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan…
## $ iso2         <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF…
## $ iso3         <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "…
## $ year         <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 198…
## $ new_sp_m014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f014  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f65   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
who1 <- who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = "key",
    values_to = "cases",
    values_drop_na = TRUE) %>%
  print()
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # ℹ 76,036 more rows
who2 <- who1 %>%
  mutate(key = str_replace(key, "mewrel", "new_rel")) %>%
  print()
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # ℹ 76,036 more rows
who3 <- who2 %>%
  separate(key, c("new", "type", "genderage"), sep = "_") %>%
  print()
## # A tibble: 76,046 × 8
##    country     iso2  iso3   year new   type  genderage cases
##    <chr>       <chr> <chr> <dbl> <chr> <chr> <chr>     <dbl>
##  1 Afghanistan AF    AFG    1997 new   sp    m014          0
##  2 Afghanistan AF    AFG    1997 new   sp    m1524        10
##  3 Afghanistan AF    AFG    1997 new   sp    m2534         6
##  4 Afghanistan AF    AFG    1997 new   sp    m3544         3
##  5 Afghanistan AF    AFG    1997 new   sp    m4554         5
##  6 Afghanistan AF    AFG    1997 new   sp    m5564         2
##  7 Afghanistan AF    AFG    1997 new   sp    m65           0
##  8 Afghanistan AF    AFG    1997 new   sp    f014          5
##  9 Afghanistan AF    AFG    1997 new   sp    f1524        38
## 10 Afghanistan AF    AFG    1997 new   sp    f2534        36
## # ℹ 76,036 more rows
who4 <- who3 %>%
  select(-new, -iso2, -iso3) %>%
  print()
## # A tibble: 76,046 × 5
##    country      year type  genderage cases
##    <chr>       <dbl> <chr> <chr>     <dbl>
##  1 Afghanistan  1997 sp    m014          0
##  2 Afghanistan  1997 sp    m1524        10
##  3 Afghanistan  1997 sp    m2534         6
##  4 Afghanistan  1997 sp    m3544         3
##  5 Afghanistan  1997 sp    m4554         5
##  6 Afghanistan  1997 sp    m5564         2
##  7 Afghanistan  1997 sp    m65           0
##  8 Afghanistan  1997 sp    f014          5
##  9 Afghanistan  1997 sp    f1524        38
## 10 Afghanistan  1997 sp    f2534        36
## # ℹ 76,036 more rows
who5 <- who4 %>%
  separate(genderage, c("gender", "age"), sep = 1) %>%
  print()
## # A tibble: 76,046 × 6
##    country      year type  gender age   cases
##    <chr>       <dbl> <chr> <chr>  <chr> <dbl>
##  1 Afghanistan  1997 sp    m      014       0
##  2 Afghanistan  1997 sp    m      1524     10
##  3 Afghanistan  1997 sp    m      2534      6
##  4 Afghanistan  1997 sp    m      3544      3
##  5 Afghanistan  1997 sp    m      4554      5
##  6 Afghanistan  1997 sp    m      5564      2
##  7 Afghanistan  1997 sp    m      65        0
##  8 Afghanistan  1997 sp    f      014       5
##  9 Afghanistan  1997 sp    f      1524     38
## 10 Afghanistan  1997 sp    f      2534     36
## # ℹ 76,036 more rows

1. Plot the total number of TB cases in the world across years.

Code:

ggplot(who5) +
  stat_summary(aes(x = year, y = cases), fun = "sum", boundary = 1,
               bins = 17, geom = "bar", fill = "darkorange3") +
  labs(title = "World Tuberculosis Cases From 1997-2013",
       x = "Years",
       y = "Mean of cases per year") +
  scale_x_continuous(limits = c(1995, 2015), breaks = seq(1995, 2015, 5)) +
  scale_y_continuous(limits = c(0, 4.5e+06), breaks = seq(0, 5000000, 500000)) +
  theme(plot.title = element_text(color = "magenta4"),
        axis.title = element_text(color = "magenta3")) +
  my_theme

2. Find out which country has the highest male-to-female ratio of TB cases in 2010.

who5 %>%
  filter(year == "2010", cases != 0, !is.na(cases)) %>%
  group_by(country) %>%
  summarise(tb_ratio = cases[gender == "m"]/cases[gender == "f"]) %>%
  arrange(desc(tb_ratio)) %>%
  print()
## # A tibble: 2,934 × 2
## # Groups:   country [196]
##    country                tb_ratio
##    <chr>                     <dbl>
##  1 Bosnia and Herzegovina    154  
##  2 Azerbaijan                109. 
##  3 Hungary                    90  
##  4 Belarus                    65  
##  5 Armenia                    36  
##  6 Qatar                      32  
##  7 Azerbaijan                 29.8
##  8 Bahrain                    27  
##  9 Croatia                    24  
## 10 Netherlands                23  
## # ℹ 2,924 more rows