library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The output above illustrates that I have the necessary package and tools within the R Studio to continue.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
In order to have a tidy data set there are certain rules to follow:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
The following figure provides more clarity on these rules Accordingly, we now know by the presentation of these rules that table1 it tidy.
Now that we know what tidy data looks like, we can use different tools in the tidyverse that help organize tidy data. dplr and ggplot2 are two packages that are designed to do such tasks. For example, we can use table1 to work with the tools in the tidyverse:
#Compute rate per 10,000
table1 %>%
mutate(rate=cases/population*10000)
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
This above example shows that you can add another variable into the data set. Mutate means that you can manipulate a set a variables, add or subtract, and the %>% sign means that it is a pipeline operation. Essentially, whatever comes after the %>% sign relates to the values before the %>%. For example, you can have sqrt(2) and the answer will be the same if you type 2 %>% sqrt.
#Compute cases per year
table1 %>%
count(year, wt=cases)
## # A tibble: 2 x 2
## year n
## <int> <int>
## 1 1999 250740
## 2 2000 296920
In this case, we are adding up all the values under the cases variable and listing them under the total years, which is only 1999 and 2000. and the count or hashtag will list 1 and 2. 1 for 1999 and 2 for 2000.
#Visualize changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country))
a.) In the first table, the tidy set, each variable, or column, supports numerical values in each row except for the nominal country descriptor variable. Also, the table is organized by 6 rows and 4 columns. This table is also the cleanest in retrospect to “tidiness”.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
b.) Table 2 represents a more categorical set compared to table 1. For instance, two columns represent nominal values, country and type, and the other two columns, year being ordinal, and count. The different categories within the type variable are matched with specific values directly adjacent to the right in the count column. The table is a 12 by 4 set meaning 12 rows and 4 columns.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
c.) Table 3 includes three variable columns that represent country, year, and rate. The interesting thing about this table is that the variable “rate” was included as an equation dividing case by population. Numbers in each rate row are shown to be a division argument by the forward slash. In other words, the rate column organizes both cases and population to fit a division equation within one variable.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
i.) Extract the number of TB cases per country per year. ii.) Extract the matching population per country per year. iii.) Divide cases by population, and multiply by 10000. iv.) Store back in the appropriate place Which representation is easiest to work with? Which is hardest? Why?
In order to divide cases by population, we need to separate the cases and population so that we can work with clarity. Also, when we separate the cases and population into two sections, then it will be easy to move forward with the example problem.
t2_cases <- filter(table2, type == "cases") %>%
rename(cases = count) %>%
arrange(country, year)
t2_population <- filter(table2, type == "population") %>%
rename(population = count) %>%
arrange(country, year)
t2_cases
t2_cases
## # A tibble: 6 x 4
## country year type cases
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 2000 cases 2666
## 3 Brazil 1999 cases 37737
## 4 Brazil 2000 cases 80488
## 5 China 1999 cases 212258
## 6 China 2000 cases 213766
t2_population
t2_population
## # A tibble: 6 x 4
## country year type population
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 population 19987071
## 2 Afghanistan 2000 population 20595360
## 3 Brazil 1999 population 172006362
## 4 Brazil 2000 population 174504898
## 5 China 1999 population 1272915272
## 6 China 2000 population 1280428583
The next R chunk below will show a recreation of table 2 so that the variables cases and population will be in the same table. We do this by using the tibble tool, which creates a new data frame. This shows that cases_per_pop has included the division equation into a new variable.
t2_cases_per_pop <- tibble(
year = t2_cases$year,
country = t2_cases$country,
cases = t2_cases$cases,
population = t2_population$population
) %>%
mutate(cases_per_pop = (cases/population) * 10000) %>%
select(country, year, cases_per_pop)
This next chunk essentially creates the new variable with its associated numbers within each row. Therefore, this step will then be combined with table2 so that we now have the new variable and new units implemented into a new data frame.
t2_cases_per_pop <- t2_cases_per_pop %>%
mutate(type = "cases_per_pop") %>%
rename(count = cases_per_pop)
Below shows the new variable and its values ready to be combined with table2.
t2_cases_per_pop
## # A tibble: 6 x 4
## country year count type
## <chr> <int> <dbl> <chr>
## 1 Afghanistan 1999 0.373 cases_per_pop
## 2 Afghanistan 2000 1.29 cases_per_pop
## 3 Brazil 1999 2.19 cases_per_pop
## 4 Brazil 2000 4.61 cases_per_pop
## 5 China 1999 1.67 cases_per_pop
## 6 China 2000 1.67 cases_per_pop
This next section combines table2 and t2_cases_per_pop so that previous work is now integrated into the entirety of a new dataset.
bind_rows(table2, t2_cases_per_pop) %>%
arrange(country, year, type, count)
## # A tibble: 18 x 4
## country year type count
## <chr> <int> <chr> <dbl>
## 1 Afghanistan 1999 cases 7.45e+2
## 2 Afghanistan 1999 cases_per_pop 3.73e-1
## 3 Afghanistan 1999 population 2.00e+7
## 4 Afghanistan 2000 cases 2.67e+3
## 5 Afghanistan 2000 cases_per_pop 1.29e+0
## 6 Afghanistan 2000 population 2.06e+7
## 7 Brazil 1999 cases 3.77e+4
## 8 Brazil 1999 cases_per_pop 2.19e+0
## 9 Brazil 1999 population 1.72e+8
## 10 Brazil 2000 cases 8.05e+4
## 11 Brazil 2000 cases_per_pop 4.61e+0
## 12 Brazil 2000 population 1.75e+8
## 13 China 1999 cases 2.12e+5
## 14 China 1999 cases_per_pop 1.67e+0
## 15 China 1999 population 1.27e+9
## 16 China 2000 cases 2.14e+5
## 17 China 2000 cases_per_pop 1.67e+0
## 18 China 2000 population 1.28e+9
I am renaming the final data frame so that it is easy to view.
t2_cases_and_pop_rate <- bind_rows(table2, t2_cases_per_pop) %>%
arrange(country, year, type, count)
Here is the final product.
t2_cases_and_pop_rate
## # A tibble: 18 x 4
## country year type count
## <chr> <int> <chr> <dbl>
## 1 Afghanistan 1999 cases 7.45e+2
## 2 Afghanistan 1999 cases_per_pop 3.73e-1
## 3 Afghanistan 1999 population 2.00e+7
## 4 Afghanistan 2000 cases 2.67e+3
## 5 Afghanistan 2000 cases_per_pop 1.29e+0
## 6 Afghanistan 2000 population 2.06e+7
## 7 Brazil 1999 cases 3.77e+4
## 8 Brazil 1999 cases_per_pop 2.19e+0
## 9 Brazil 1999 population 1.72e+8
## 10 Brazil 2000 cases 8.05e+4
## 11 Brazil 2000 cases_per_pop 4.61e+0
## 12 Brazil 2000 population 1.75e+8
## 13 China 1999 cases 2.12e+5
## 14 China 1999 cases_per_pop 1.67e+0
## 15 China 1999 population 1.27e+9
## 16 China 2000 cases 2.14e+5
## 17 China 2000 cases_per_pop 1.67e+0
## 18 China 2000 population 1.28e+9
The next R chunk will be the integration of both table4a and table4b shown to represent cases per population like the section before.
#We can label the integration of both tables as table4ab
table4ab <-
tibble(
country = table4a$country,
'1999' = table4a[["1999"]] / table4b[["1999"]] * 10000,
'2000' = table4a[["2000"]] / table4b[["2000"]] * 10000
)
table4ab
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 0.373 1.29
## 2 Brazil 2.19 4.61
## 3 China 1.67 1.67
In sum, I would say that the first set of examples devoid of table4a and table4b are particularly tedious to work with. My table4ab example was not as tedious but not necessarily less difficult. This was the first time I had to manipulate data like this in R so it was a bit of a learning curve.
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_x_continuous(breaks = unique(table2$year)) +
ylab("cases")
There are reasons for untidiness. That is, most people are not familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data. Secondly, data is often organised to facilitate some use other than analysis. For example, data is often organized to make entry as easy as possible.
Most of the time, those who work with data will have to do some tidying. The first step is to figure out what the variables and observations are. The second step is to resolve one of two common problems: One variable might be spread across multiple columns, and one observation might be scattered across multiple rows. Typically a dataset will only suffer from one of these problems.
To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().
A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Like table4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To tidy a dataset like this, we need to pivot the offending columns into a new pair of variable. To describe the operation we need three parameters: • The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000. • The name of the variable to move the column names to. Here it is year. • The name of the variable to move the column vales to. Here it’s cases.
Together those parameters generate the call to pivot_longer():
table4a %>%
pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
We can use pivot_longer() to tidy table4b in a similar fashion:
table4b %>%
pivot_longer(c('1999','2000'), names_to = "year", values_to = "population")
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
To combine the tidied versions of table4a and table4b into a single tibble, we need to use dplyr::left_join().
tidy4A <- table4a %>%
pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
tidy4B <- table4b %>%
pivot_longer(c('1999','2000'), names_to = "year", values_to = "population")
left_join(tidy4A, tidy4B)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
This is definitely easier to accomplish than the first exercise I did.
pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows. For example, take table2: an observation is a country in a year, but each observation is spread across two rows.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
We only need two parameters this time: • The column to take variable names from. Here, it’s type. • The column to take values form. Here it’s count. Then we can use pivot_wider(), as shown below:
table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
pivot_longer() and pivot_wider() not perfectly symmetrical?Carefully consider the following example:
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
## year half return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2016 1 0.92
## 4 2016 2 0.17
Below is pivot_wider()
stocks %>%
pivot_wider(names_from = year, values_from = return)
## # A tibble: 2 x 3
## half `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 0.92
## 2 2 0.59 0.17
We then use pivot_wider() followed by pivot_longer()
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 1 2016 0.92
## 3 2 2015 0.59
## 4 2 2016 0.17
The tibble is in tidy form. However, the tibble is not quite the same as the original tibble. The difference is the change in double to character vector form in the year column/variable. In order to maintain the tibble and have the column in double vector form, we need to use the names_ptype = list(year = double()).
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return",
names_ptypes = list(year = double()))
## # A tibble: 4 x 3
## half year return
## <dbl> <dbl> <dbl>
## 1 1 2015 1.88
## 2 1 2016 0.92
## 3 2 2015 0.59
## 4 2 2016 0.17
The tibble now have the appropriate vectors.
table4a %>% pivot_longer(c(1999, 2000), names_to = “year”, values_to = “cases”) > Error in inds_combine(.vars, ind_list): Position must be between 0 and n
The reason why the code above fails is because we forgot that 1999 and 2000 are characters in the table and not integers. In other words, even though we might be fooled as to think they are integers (which they are in nature) when we want to refer to columns within a table we need to be careful whether the columns we wish to represent are integers or characters.
In order to change this we add ' to both sides of 1999 and 2000.
table4a %>%
pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Now we can see that the code works.
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
people
## # A tibble: 5 x 3
## name names values
## <chr> <chr> <dbl>
## 1 Phillip Woods age 45
## 2 Phillip Woods height 186
## 3 Phillip Woods age 50
## 4 Jessica Cordero age 37
## 5 Jessica Cordero height 156
The next intuitive move is to use the pivot_wider() tool to widen the table.
people %>%
pivot_wider(names_from = names, values_from = values)
## Warning: Values in `values` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list(values = list)` to suppress this warning.
## * Use `values_fn = list(values = length)` to identify where the duplicates arise
## * Use `values_fn = list(values = summary_fun)` to summarise duplicates
## # A tibble: 2 x 3
## name age height
## <chr> <list<dbl>> <list<dbl>>
## 1 Phillip Woods [2] [1]
## 2 Jessica Cordero [1] [1]
This warning is important as it gives us the opportunity to see how to work with duplicated data. We can do that or re-create the table in a better fashion so no duplicates occur.
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
In this case we would want to make the table longer:
preg %>%
pivot_longer(c('male', 'female'), names_to = "sex", values_to = "count")
## # A tibble: 4 x 3
## pregnant sex count
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 yes female 10
## 3 no male 20
## 4 no female 12
So far we’ve learned how to tidy table2 and both table4a and table4b, but not table3. table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.
separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate column contains both cases and population variable, and we need to split it into two variables. separate() the name of the column to separate, and the names of the columns tot separate into, as shown below:
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
By default, this will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). For example, in the code above, separate() split the values of rate at the forward slash characters. If you wish to sue a specific character to separate a column, you can pass the character to the sep argument of separate(). For example, we could rewrite the code above as:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You’ll notice that cases and population are character columns. This is the default behavior in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
I can also separate some of the integers within a column if I use sep within the separate() function:
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite() is the inverse of separate(): it combines multiple columns into a single column. We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:
table5 %>%
unite(new, century, year)
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
We do not really want the underscore (_) in the data so what we can do is unite the data but separate the empty space.
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
If you run both sets of code, then you receive an error saying that there are missing pieces. Interestingly, separate(), by default, drops extra values with a warning. Adding the argument, extra = "drop", produces the same result as above but without the error or warning.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
Then, if you do extra = “merge”, the g is added with the f.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
In the next example, “d,e” has too few elements.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
By default the empty space is filled with NA.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "right")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The same will happen if you type in “left” but the missing value will appear on the left side of the table.
Essentially, the remove argument does away with the imputed columns in the result data frame, and the FALSE means that you want to create a new variable but keep the old one.
Separate() separates types of values within a column and are very important in terms of tidying up data. Some examples can be seen above in earlier sections. Below are examples:
#Separators
tibble(x = c("A_1", "A_2", "A_3")) %>%
separate(x, c("Category", "type"))
## # A tibble: 3 x 2
## Category type
## <chr> <chr>
## 1 A 1
## 2 A 2
## 3 A 3
#Position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
separate(x, c("variable", "into"), sep = c(1))
## # A tibble: 4 x 2
## variable into
## <chr> <chr>
## 1 X 1
## 2 X 2
## 3 Y 1
## 4 Y 2
The function extract() uses a regular expression to specify groups in a character vector and split that single character vector into multiple columns. Generally, this is more flexible than separate() as you can create parameters that extract the varying values shown within a vector.
Examples:
#Separators
tibble(x = c("A_1","C_3","F_5","T_4", "X_8", "Y_9")) %>%
extract(x,c("Category", "Value"), regex = "([A-Z])_([0-9])")
## # A tibble: 6 x 2
## Category Value
## <chr> <chr>
## 1 A 1
## 2 C 3
## 3 F 5
## 4 T 4
## 5 X 8
## 6 Y 9
#Position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
## # A tibble: 4 x 2
## variable id
## <chr> <chr>
## 1 X 1
## 2 X 2
## 3 Y 1
## 4 Y 2
#Extract specific
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
## # A tibble: 4 x 2
## variable id
## <chr> <chr>
## 1 X 1
## 2 X 20
## 3 AA 11
## 4 AA 2
Unite() as opposed to separate() and extract() converts many columns to one. Thanks to the book, “Both separate() and extract() convert a single column to many columns. However, unite() converts many columns to one, with a choice of a separator to include between column values.”
Values can be missing in two ways: • Explicitly, i.e. flagged with NA. • Implicitly, i.e. simply not present in the data. For example:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
The NA is explicit and there should be a 1st qtr for 2016 which is explicitly missing. One way to make the implicit value explicit is to order or tidy the data by variable “year” so that NA appears.
stocks %>%
pivot_wider(names_from = year, values_from = return)
## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
Now we have two NA. If those data points are not important then we can go ahead and make the values implicit. We do this by adding values_drop_na = TRUE in pivot_longer():
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
## # A tibble: 6 x 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17
## 6 4 2016 2.66
Another important tool for making missing values explicit in tidy data is complete():
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
We can also use fill() to carry forward the previous in this case person.
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 Katherine Burke 1 4
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
Both spread() and complete() replace implicit and explicit missing values. In spread(), the fill argument *explicitly* sets the value to replace NAs. The fill argument in complete() allows for different values of different variables.
The direction determines whether the missing values should be replaced by values from above or below or "up" and "down".
The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## # new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## # new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## # new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## # new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## # new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## # new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## # new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## # new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## # new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
## # new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## # new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
## # new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## # new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
## # newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
## # newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
## # newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
## # newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>
This dataset is really messy so it is important that I use my knowledge of tidyr and later dplyr to help make this dataset clear.
Country, iso2, and iso3 are three variables that repeatedly label the country. Year is also a variable.
new_sp_m014, new_ep_m014, new_ep_f014 are likely to be values, not variables.
“We need to gather together all the columns from new_sp_m014 to newrel_f65. We don’t know what those values represent yet, so we’ll give them the generic name”key“. We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use na.rm just so we can focus on the values that are present.”
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
## # A tibble: 76,046 x 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
We can get some hint of the structure of the values in the new key column by counting them:
who1 %>%
count(key)
## # A tibble: 56 x 2
## key n
## <chr> <int>
## 1 new_ep_f014 1032
## 2 new_ep_f1524 1021
## 3 new_ep_f2534 1021
## 4 new_ep_f3544 1021
## 5 new_ep_f4554 1017
## 6 new_ep_f5564 1017
## 7 new_ep_f65 1014
## 8 new_ep_m014 1038
## 9 new_ep_m1524 1026
## 10 new_ep_m2534 1020
## # … with 46 more rows
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel stands for cases of relapse ep stands for cases of extrapulmonary TB sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative) sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive) The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014 = 0 – 14 years old 1524 = 15 – 24 years old 2534 = 25 – 34 years old 3544 = 35 – 44 years old 4554 = 45 – 54 years old 5564 = 55 – 64 years old 65 = 65 or older The names are slightly inconsistent because instead of new_rel we have newrel. The basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.
who2 <- who1 %>%
mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 7
## country iso2 iso3 year key cases names_from
## <chr> <chr> <chr> <int> <chr> <int> <chr>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0 new_sp_m014
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10 new_sp_m1524
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6 new_sp_m2534
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3 new_sp_m3544
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5 new_sp_m4554
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2 new_sp_m5564
## 7 Afghanistan AF AFG 1997 new_sp_m65 0 new_sp_m65
## 8 Afghanistan AF AFG 1997 new_sp_f014 5 new_sp_f014
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38 new_sp_f1524
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36 new_sp_f2534
## # … with 76,036 more rows
We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
who3
## # A tibble: 76,046 x 9
## country iso2 iso3 year new type sexage cases names_from
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr>
## 1 Afghanistan AF AFG 1997 new sp m014 0 new_sp_m014
## 2 Afghanistan AF AFG 1997 new sp m1524 10 new_sp_m1524
## 3 Afghanistan AF AFG 1997 new sp m2534 6 new_sp_m2534
## 4 Afghanistan AF AFG 1997 new sp m3544 3 new_sp_m3544
## 5 Afghanistan AF AFG 1997 new sp m4554 5 new_sp_m4554
## 6 Afghanistan AF AFG 1997 new sp m5564 2 new_sp_m5564
## 7 Afghanistan AF AFG 1997 new sp m65 0 new_sp_m65
## 8 Afghanistan AF AFG 1997 new sp f014 5 new_sp_f014
## 9 Afghanistan AF AFG 1997 new sp f1524 38 new_sp_f1524
## 10 Afghanistan AF AFG 1997 new sp f2534 36 new_sp_f2534
## # … with 76,036 more rows
Then we drop the “new” column because it’s constant in this dataset. We also drop iso2 and iso3 since they’re redundant.
who3 %>%
count(new)
## # A tibble: 2 x 2
## new n
## <chr> <int>
## 1 new 73466
## 2 newrel 2580
who4 <- who3 %>%
select(-new, -iso2, -iso3)
who4
## # A tibble: 76,046 x 6
## country year type sexage cases names_from
## <chr> <int> <chr> <chr> <int> <chr>
## 1 Afghanistan 1997 sp m014 0 new_sp_m014
## 2 Afghanistan 1997 sp m1524 10 new_sp_m1524
## 3 Afghanistan 1997 sp m2534 6 new_sp_m2534
## 4 Afghanistan 1997 sp m3544 3 new_sp_m3544
## 5 Afghanistan 1997 sp m4554 5 new_sp_m4554
## 6 Afghanistan 1997 sp m5564 2 new_sp_m5564
## 7 Afghanistan 1997 sp m65 0 new_sp_m65
## 8 Afghanistan 1997 sp f014 5 new_sp_f014
## 9 Afghanistan 1997 sp f1524 38 new_sp_f1524
## 10 Afghanistan 1997 sp f2534 36 new_sp_f2534
## # … with 76,036 more rows
Next we’ll separate sexage into sex and age by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
## # A tibble: 76,046 x 7
## country year type sex age cases names_from
## <chr> <int> <chr> <chr> <chr> <int> <chr>
## 1 Afghanistan 1997 sp m 014 0 new_sp_m014
## 2 Afghanistan 1997 sp m 1524 10 new_sp_m1524
## 3 Afghanistan 1997 sp m 2534 6 new_sp_m2534
## 4 Afghanistan 1997 sp m 3544 3 new_sp_m3544
## 5 Afghanistan 1997 sp m 4554 5 new_sp_m4554
## 6 Afghanistan 1997 sp m 5564 2 new_sp_m5564
## 7 Afghanistan 1997 sp m 65 0 new_sp_m65
## 8 Afghanistan 1997 sp f 014 5 new_sp_f014
## 9 Afghanistan 1997 sp f 1524 38 new_sp_f1524
## 10 Afghanistan 1997 sp f 2534 36 new_sp_f2534
## # … with 76,036 more rows
The who dataset is now tidy! Normally you would build up a really complex pipe:
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
## country year var sex age cases
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1997 sp m 1524 10
## 3 Afghanistan 1997 sp m 2534 6
## 4 Afghanistan 1997 sp m 3544 3
## 5 Afghanistan 1997 sp m 4554 5
## 6 Afghanistan 1997 sp m 5564 2
## 7 Afghanistan 1997 sp m 65 0
## 8 Afghanistan 1997 sp f 014 5
## 9 Afghanistan 1997 sp f 1524 38
## 10 Afghanistan 1997 sp f 2534 36
## # … with 76,036 more rows
One thing that is really important is that we want to know whether who did not have any cases are that they did not know or have any data on them for another reason. IF there are no 0 values, then that means there is no data or no cases in the who database. If there are both explicit and implicit missing values, then it suggests that missing values are being used in different ways. NA would mean no cases and implicit would mean no data on the number of cases, presumably.
We can also check to see if there are any zeros by doing this:
who1 %>%
filter(cases == 0) %>%
nrow()
## [1] 11080
Cases of zero TB are explicitly indicated with NA. Next, it is important to check and see whether values for a country or year are missing or whether only some columns are missing.
gather(who, new_sp_m014:newrel_f65, key = "key", value = "cases") %>%
group_by(country,year) %>%
mutate(prop_missing = sum(is.na(cases)) / n()) %>%
filter(prop_missing > 0, prop_missing < 1)
## # A tibble: 195,104 x 7
## # Groups: country, year [3,484]
## country iso2 iso3 year key cases prop_missing
## <chr> <chr> <chr> <int> <chr> <int> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0 0.75
## 2 Afghanistan AF AFG 1998 new_sp_m014 30 0.75
## 3 Afghanistan AF AFG 1999 new_sp_m014 8 0.75
## 4 Afghanistan AF AFG 2000 new_sp_m014 52 0.75
## 5 Afghanistan AF AFG 2001 new_sp_m014 129 0.75
## 6 Afghanistan AF AFG 2002 new_sp_m014 90 0.75
## 7 Afghanistan AF AFG 2003 new_sp_m014 127 0.75
## 8 Afghanistan AF AFG 2004 new_sp_m014 139 0.75
## 9 Afghanistan AF AFG 2005 new_sp_m014 151 0.75
## 10 Afghanistan AF AFG 2006 new_sp_m014 193 0.75
## # … with 195,094 more rows
If you neglect this then the separate() function emits the warning "too few values". If we check the rows for keys beginning with "newrel_", we see that sexage is missing, and type = m014.
who3a <- who1 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
filter(who3a, new == "newrel") %>% head()
## # A tibble: 6 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 2013 newrel m014 <NA> 1705
## 2 Afghanistan AF AFG 2013 newrel f014 <NA> 1749
## 3 Albania AL ALB 2013 newrel m014 <NA> 14
## 4 Albania AL ALB 2013 newrel m1524 <NA> 60
## 5 Albania AL ALB 2013 newrel m2534 <NA> 61
## 6 Albania AL ALB 2013 newrel m3544 <NA> 32
iso2 and iso3 are redundant and should be one combination of the two values. The distinct tool helps with this process:
select(who3, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
## # A tibble: 0 x 3
## # Groups: country [0]
## # … with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
In other words, both iso2 and iso3 contain the 2 and 3 letter abbreviations for the country.
who5 %>%
group_by(country, year, sex) %>%
filter(year > 1995) %>%
summarise(cases = sum(cases)) %>%
unite(country_sex, country, sex, remove = FALSE) %>%
ggplot(aes(x = year, y = cases, group = country_sex, color = sex)) + geom_line()