COTTRILL_SD322_Week3

MIDN Rylie Cottrill SD322 Week 3 Assignment

Level C.

12.2.1 Problems

In table1, each row represents a single observation and each column represents a single variable. In table2, each row still represents a country-year combination but the variables “cases” and “population” are stored in values in a single column called “count” and another column “type” indicates which variable the value represents. Because of this, each observation is spread across multiple rows. In table3, each row still represents a country-year combination but the number of cases and population are put together in a “rate” value. In table4a, each row represents a country and each column represents the number of cases for a particular year. Table4b is the same format but it contains population values instead of case counts.
library(tidyverse)

table2 %>% pivot_wider(names_from = “type”, values_from = “count”) %>% mutate(rate = cases / population * 10000)

table4a %>% pivot_longer(-country, names_to = “year”, values_to = “cases”) %>% left_join(table4b %>% pivot_longer(-country, names_to = “year”, values_to = “population”), by = c(“country”,“year”)) %>% mutate(rate = cases / population * 10000)

table2 is easiest to work with because its variables are already values. table4 is harder because variables are split across tables and stored as column names.

First, I need to separate cases and population into different columns so table2 has the same structure as table1.

table2_tidy = table2 %>% pivot_wider(names_from = “type”, values_from = “count”)

Using code given in 12.2 to recreate the plot.

library(ggplot2) ggplot(table2_tidy, aes(year, cases)) + geom_line(aes(group = country), colour = “grey50”) + geom_point(aes(colour = country))

12.4.3 Problems

In separate(), “extra” determines how excess values are handled when there are more pieces than expected, while “fill” determines how missing values are filled when there are fewer pieces than expected.
The “remove” argument controls whether the original columns are dropped after unite() or separate(). Setting it to FALSE keeps the original variables, which would be helpful if you wanted to preserve the raw data for comparison or further analysis later.
separate() is used when a column can be split in a predictable way (by separator or position). extract() is used when the pieces of a column need to be identified using regex. There are three variations of separation but only one unite() because splitting data can happen in multiple ways but there is only one way to combine multiple columns into one.

12.5.1

With pivot_wider(), “fill” provides a value for missing cells created when the data is widened. complete() generates all unique combinations of selected columns, ensures they exist in the data, then “fill” is used to replace the resulting NA values with the most recent non-missing value.
The “direction” argument in fill() controls whether missing values are filled downward, upward, or in both directions. In other words, the missing values will be taken from the last observed value above for “down,” the next observed value below for “up,” and so on.

12.6.1

Using values_drop_na = TRUE is reasonable for checking that the data transformation happened correctly, but missing values in this dataset represent unknown counts, not zero cases, so for more accurate analysis, NAs should be preserved. The difference between them is that NA means values are unknown or unreported, while 0 means the value is known to be zero.
If we neglect the mutate() step, the variable names will be inconsistent, leading to improper splitting of the values later on. For example, with the variable newrel_f65, we want to split it into TB type, sex, and age, but since there is only one underscore separator instead of the expected two, the desired split cannot occur correctly.
iso2 and iso3 are redundant with country because each country name corresponds to exactly one iso code and each iso code corresponds to exactly one country.
I modified the code given in the case study.

tb_totals = who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = “key”, values_to = “cases”, values_drop_na = TRUE ) %>% mutate( key = stringr::str_replace(key, “newrel”, “new_rel”) ) %>% separate(key, c(“new”, “var”, “sexage”)) %>% select(-new, -iso2, -iso3) %>% separate(sexage, c(“sex”, “age”), sep = 1) %>% group_by(country, year, sex) %>% summarize(total_cases = sum(cases, na.rm = TRUE))

I modified the code for plotting from 12.2.1 problem set.

ggplot(tb_totals, aes(year, total_cases)) + geom_line(aes(group = country), colour = “grey50”) + geom_point(aes(colour = sex)) + facet_wrap(~ sex)

Referenced https://ggplot2.tidyverse.org/reference/facet_wrap.html. Used facet_wrap to make it more readable.