When data in is wide format, a subjects responses will be in a single
row, and each response is in separate columns. However R prefers long
format. We could melt and cast with
reshape2 to reshape from wide to long format, but is there
a way to reshape using even less code? Luckily for us, Hadley
Wickham has created the easy to use tidyr!
tidyr allow us to quickly and easily tidy and reorganize
our data for all sorts of analyses. This is particularly helpful with a
disorganized dataset. tidyr is built for this function.
Specifically, tidyr can only be used with exisiting
dataframes, and cannot aggregate.
In this chapter, we will go over the hallmark functions of
tidyr: gather(), separate(),
unite(), and spread().
First let’s install and call up the tidyr package. We
will also need to use the dplyr package.
#install.packages("tidyr") # I have used "#" to "comment out" this line for this tutorial. Just take away the first "#" and you are good to go!
#install.packages("dplyr")
library(tidyr)
library(dplyr)
Why do we need dplyr? dplyr is a grammar of
data manipulation. We need dplyr to use the pipe operator,
%>%, in our code. %>% is not required to
use tidyr, but it does make things easier!
%>% allows you to pipe a value forward into an
expression or to function call; such that x %>% f,
instead of f(x). This short hand was created by Stefan
Milton Bache with the magrittr package. To read more about
this function, click here
Here I have created a messy wide dataset. Feel free to use it to follow along!
In this example study, participants were asked to categorize three faces by clicking various buttons that represent three different categories. The time it took to click a button is in milliseconds.
n=10
wide <- data.frame(
ID = c(1:n),
Face.1 = c(411,723,325,456,579,612,709,513,527,379),
Face.2 = c(123,300,400,500,600,654,789,906,413,567),
Face.3 = c(1457,1000,569,896,956,2345,780,599,1023,678)
)
This dataset I created is messy; As you can see below, only ID is in a column, Response time split between three columns, such that responses are in both rows and columns (by ID and Face.1, Face.2, and Face.3).
What we want instead is one column for the condition (Face.1, Face.2, or Face.3) responses and a column for response time, with each row being a singular observation for each participant. Participant IDs should repeat as this is a within subject design (each participant saw each face).
## ID Face.1 Face.2 Face.3
## 1 1 411 123 1457
## 2 2 723 300 1000
## 3 3 325 400 569
## 4 4 456 500 896
## 5 5 579 600 956
## 6 6 612 654 2345
## 7 7 709 789 780
## 8 8 513 906 599
## 9 9 527 413 1023
## 10 10 379 567 678
By using the gather() function, we can transform the
data from wide to long Here is the generic code for
gather():
#gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
Whoa! What does this all mean? Let’s find out more about the
arguments of gather():
data: Your data frame.
key, value: The unquoted new names of key and value
columns to create in the output. The key will become the name of the
condition/IV column, and value will become the name of the response/DV
column.
...: The columns to gather. Use the exisiting
variable names. Select a range of variables with : (e.g. if
you have variables a, b, c, and d, and want to select all of these
varibeles you will indicate this with a:d). If you want to exclude a
variable, use - (e.g. exclude y with -y).
na.rm: If you indicate that na.rm=TRUE, this will
remove rows from the output where the value is missing.
convert: If TRUE this will automatically convert the
key column to a logical, integer, numeric, complex, or factor as
appropriate. This is useful if the column names are actually numeric,
integer, or logical.
factor_key: If FALSE, the default, the key values
will be stored as a character vector. If TRUE, will be stored as a
factor, which preserves the original ordering of the columns.
Now that we have a better understanding of the arguements, lets make
our data set long using gather()!
long <- wide %>% gather(Face, ResponseTime, Face.1:Face.3)
## ID Face ResponseTime
## 1 1 Face.1 411
## 2 2 Face.1 723
## 3 3 Face.1 325
## 4 4 Face.1 456
## 5 5 Face.1 579
## 6 6 Face.1 612
## 7 7 Face.1 709
## 8 8 Face.1 513
## 9 9 Face.1 527
## 10 10 Face.1 379
## 11 1 Face.2 123
## 12 2 Face.2 300
## 13 3 Face.2 400
## 14 4 Face.2 500
## 15 5 Face.2 600
## 16 6 Face.2 654
## 17 7 Face.2 789
## 18 8 Face.2 906
## 19 9 Face.2 413
## 20 10 Face.2 567
## 21 1 Face.3 1457
## 22 2 Face.3 1000
## 23 3 Face.3 569
## 24 4 Face.3 896
## 25 5 Face.3 956
## 26 6 Face.3 2345
## 27 7 Face.3 780
## 28 8 Face.3 599
## 29 9 Face.3 1023
## 30 10 Face.3 678
As you can see, now we have two columns: One for the the Faces, and one for response time. Each participant saw each face, so ID repeats three times.
Although the long dataset we created using gather() is
acceptable for use, we can break down the face variable even further
using separate().
Here is the generic code for separate():
#separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")
What are the arugments unique to separate()?
col: Unquoted name of the column to be
separated.
into: Names for the new variables that you are
separating out from the column.
sep: Separator between columns. If the seprator is a
character, it is interpreted as a regular expression. The default value
is a regular expression that matches any sequence of non-alphanumeric
values. In the example, each face is indicated by a number that follows
a period (.). I do not need speifcy this because this
exists in each level of Face. If numeric, it is interpreted as the
position to split at. Positive values start at 1 at the far-left of the
string; negative value start at -1 at the far-right of the string. The
length of sep should be one less than
into.
remove: If this is TRUE, it removes the input column
from output data frame.
extra: If sep is a character vector
(like .), this controls what happens when there are too
many pieces (e.g. Face.1.A, rather than Face.1). There are three valid
options: “warn” (the default): emit a warning and drop extra values.
“drop”: drop any extra values without a warning. “merge”: only splits at
most length(into) times
fill: If sep is a character vector
(like .), this controls what happens when there are not
enough pieces (e.g. Face1, rather than Face.1). There are three valid
options: “warn” (the default): emit a warning and fill from the right
“right”: fill with missing values on the right “left”: fill with missing
values on the left
Each face is indicated by number after a period. This variable
annotation allows us to separate the face variable into two. By using
the separate function of tidyr we can tease
apart single variables which sometimes capture multiple variables (or
sometimes redundant information).
In this case, I want to split the Face from the number attached to it, which in this example represents the race of the face.
long_separate <- long %>% separate(Face, c("Target", "Number"))
## ID Target Number ResponseTime
## 1 1 Face 1 411
## 2 2 Face 1 723
## 3 3 Face 1 325
## 4 4 Face 1 456
## 5 5 Face 1 579
## 6 6 Face 1 612
## 7 7 Face 1 709
## 8 8 Face 1 513
## 9 9 Face 1 527
## 10 10 Face 1 379
## 11 1 Face 2 123
## 12 2 Face 2 300
## 13 3 Face 2 400
## 14 4 Face 2 500
## 15 5 Face 2 600
## 16 6 Face 2 654
## 17 7 Face 2 789
## 18 8 Face 2 906
## 19 9 Face 2 413
## 20 10 Face 2 567
## 21 1 Face 3 1457
## 22 2 Face 3 1000
## 23 3 Face 3 569
## 24 4 Face 3 896
## 25 5 Face 3 956
## 26 6 Face 3 2345
## 27 7 Face 3 780
## 28 8 Face 3 599
## 29 9 Face 3 1023
## 30 10 Face 3 678
Now, We have two columns, one for Target, the values of which are all “Face”, and one for Number, which indicates which of the three faces it is.
To undo separate(), we can use unite(),
which merges two variables into one.
Here is the generic code for unite():
#unite(data, col, ..., sep = ".", remove = TRUE)
Here are the arugments unique to unite():
sep: In the code for unite() the
sep indicated the separator we choose to to use to bind
values. In this case, we are using .long_unite <- long_separate %>% unite(Face, Target, Number, sep = ".")
## ID Face ResponseTime
## 1 1 Face.1 411
## 2 2 Face.1 723
## 3 3 Face.1 325
## 4 4 Face.1 456
## 5 5 Face.1 579
## 6 6 Face.1 612
## 7 7 Face.1 709
## 8 8 Face.1 513
## 9 9 Face.1 527
## 10 10 Face.1 379
## 11 1 Face.2 123
## 12 2 Face.2 300
## 13 3 Face.2 400
## 14 4 Face.2 500
## 15 5 Face.2 600
## 16 6 Face.2 654
## 17 7 Face.2 789
## 18 8 Face.2 906
## 19 9 Face.2 413
## 20 10 Face.2 567
## 21 1 Face.3 1457
## 22 2 Face.3 1000
## 23 3 Face.3 569
## 24 4 Face.3 896
## 25 5 Face.3 956
## 26 6 Face.3 2345
## 27 7 Face.3 780
## 28 8 Face.3 599
## 29 9 Face.3 1023
## 30 10 Face.3 678
As you can see the data now looks like it did when we first
transfromed from wide to long using gather()!
Finally, we will transform the data from long back to wide with the
spread() function.
Here is the generic code for spread()
#spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
The arugments unique to spread():
key: The unquoted name of the column whose values
will be used as column headings.
value: The unquoted name of the column whose values
will populate the cells.
fill: If used, missing values will be replaced with
this value. There are two types of missing in the input: explicit
missing values (i.e. NA), and implicit missings, rows that
simply aren’t present. Both types of missing value will be replaced by
fill.
convert: If TRUE, this will
automatically convert the new columns to a logical, integer, numeric,
complex, or factor as appropriate.
drop: If FALSE, will keep factor levels
that don’t appear in the data, filling in missing combinations with
fill.
sep: If NULL, the column names will be
taken from the values of key variable. If non-NULL, the
column names will be created by stringing together the name, separator,
and value.
back_to_wide <- long_unite %>% spread(Face, ResponseTime)
## ID Face.1 Face.2 Face.3
## 1 1 411 123 1457
## 2 2 723 300 1000
## 3 3 325 400 569
## 4 4 456 500 896
## 5 5 579 600 956
## 6 6 612 654 2345
## 7 7 709 789 780
## 8 8 513 906 599
## 9 9 527 413 1023
## 10 10 379 567 678
And there we have it! We have come full circle back into wide.
Now it’s your turn.
## wide format
gap_wide <- readr::read_csv('https://raw.githubusercontent.com/kflisikowski/ds/master/gapminder_wide.csv')
## long format
gapminder <- readr::read_csv('https://raw.githubusercontent.com/kflisikowski/ds/master/gapminder.csv')
Let’s have a look:
head(gap_wide)
## # A tibble: 6 × 38
## continent country gdpPe…¹ gdpPe…² gdpPe…³ gdpPe…⁴ gdpPe…⁵ gdpPe…⁶ gdpPe…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Algeria 2449. 3014. 2551. 3247. 4183. 4910. 5745.
## 2 Africa Angola 3521. 3828. 4269. 5523. 5473. 3009. 2757.
## 3 Africa Benin 1063. 960. 949. 1036. 1086. 1029. 1278.
## 4 Africa Botswana 851. 918. 984. 1215. 2264. 3215. 4551.
## 5 Africa Burkina Faso 543. 617. 723. 795. 855. 743. 807.
## 6 Africa Burundi 339. 380. 355. 413. 464. 556. 560.
## # … with 29 more variables: gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>,
## # gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, gdpPercap_2007 <dbl>,
## # lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>,
## # lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
## # lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
## # lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, pop_1952 <dbl>,
## # pop_1957 <dbl>, pop_1962 <dbl>, pop_1967 <dbl>, pop_1972 <dbl>, …
str(gap_wide)
## spc_tbl_ [142 × 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ continent : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num [1:142] 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num [1:142] 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num [1:142] 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num [1:142] 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num [1:142] 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num [1:142] 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num [1:142] 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num [1:142] 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
## $ pop_2007 : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
## - attr(*, "spec")=
## .. cols(
## .. continent = col_character(),
## .. country = col_character(),
## .. gdpPercap_1952 = col_double(),
## .. gdpPercap_1957 = col_double(),
## .. gdpPercap_1962 = col_double(),
## .. gdpPercap_1967 = col_double(),
## .. gdpPercap_1972 = col_double(),
## .. gdpPercap_1977 = col_double(),
## .. gdpPercap_1982 = col_double(),
## .. gdpPercap_1987 = col_double(),
## .. gdpPercap_1992 = col_double(),
## .. gdpPercap_1997 = col_double(),
## .. gdpPercap_2002 = col_double(),
## .. gdpPercap_2007 = col_double(),
## .. lifeExp_1952 = col_double(),
## .. lifeExp_1957 = col_double(),
## .. lifeExp_1962 = col_double(),
## .. lifeExp_1967 = col_double(),
## .. lifeExp_1972 = col_double(),
## .. lifeExp_1977 = col_double(),
## .. lifeExp_1982 = col_double(),
## .. lifeExp_1987 = col_double(),
## .. lifeExp_1992 = col_double(),
## .. lifeExp_1997 = col_double(),
## .. lifeExp_2002 = col_double(),
## .. lifeExp_2007 = col_double(),
## .. pop_1952 = col_double(),
## .. pop_1957 = col_double(),
## .. pop_1962 = col_double(),
## .. pop_1967 = col_double(),
## .. pop_1972 = col_double(),
## .. pop_1977 = col_double(),
## .. pop_1982 = col_double(),
## .. pop_1987 = col_double(),
## .. pop_1992 = col_double(),
## .. pop_1997 = col_double(),
## .. pop_2002 = col_double(),
## .. pop_2007 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
While wide format is nice for data entry, it’s not nice for calculations. Some of the columns are a mix of variable (e.g. “gdpPercap”) and data (“1952”). What if you were asked for the mean population after 1990 in Algeria? Possible, but ugly. But we know it doesn’t need to be so ugly. Let’s tidy it back to the format we’ve been using.
Your job is to get it to long format. You will have to do this in 2 steps. The first step is to take all of those column names (e.g. lifeExp_1970) and make them a variable in a new column, and transfer the values into another column.
We need to name two new variables in the key-value pair, one for the key, one for the value. It can be hard to wrap your mind around this, so let’s give it a try. Let’s name them obstype_year and obs_values.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values)
Although we were already planning to inspect our work, let’s definitely do it now:
str(gap_long)
## tibble [5,396 × 2] (S3: tbl_df/tbl/data.frame)
## $ obstype_year: chr [1:5396] "continent" "continent" "continent" "continent" ...
## $ obs_values : chr [1:5396] "Africa" "Africa" "Africa" "Africa" ...
Although we were already planning to inspect our work, let’s definitely do it now:
str(gap_long)
## tibble [5,396 × 2] (S3: tbl_df/tbl/data.frame)
## $ obstype_year: chr [1:5396] "continent" "continent" "continent" "continent" ...
## $ obs_values : chr [1:5396] "Africa" "Africa" "Africa" "Africa" ...
We have reshaped our dataframe but this new format isn’t really what we wanted.
What went wrong? Notice that it didn’t know that we wanted to keep continent and country untouched; we need to give it more information about which columns we want reshaped. We can do this in several ways.
One way is to identify the columns is by name. Listing them explicitly can be a good approach if there are just a few. But in our case we have 30 columns. I’m not going to list them out here since there is way too much potential for error if I tried to list gdpPercap_1952, gdpPercap_1957, gdpPercap_1962 and so on. But we could use some of dplyr’s awesome helper functions — because we expect that there is a better way to do this!
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
dplyr::starts_with('pop'),
dplyr::starts_with('lifeExp'),
dplyr::starts_with('gdpPercap')) #here i'm listing all the columns to use in gather
str(gap_long)
## tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
## $ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr [1:5112] "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr [1:5112] "pop_1952" "pop_1952" "pop_1952" "pop_1952" ...
## $ obs_values : num [1:5112] 9279525 4232095 1738315 442308 4469979 ...
head(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria pop_1952 9279525
## 2 Africa Angola pop_1952 4232095
## 3 Africa Benin pop_1952 1738315
## 4 Africa Botswana pop_1952 442308
## 5 Africa Burkina Faso pop_1952 4469979
## 6 Africa Burundi pop_1952 2445618
tail(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Europe Sweden gdpPercap_2007 33860.
## 2 Europe Switzerland gdpPercap_2007 37506.
## 3 Europe Turkey gdpPercap_2007 8458.
## 4 Europe United Kingdom gdpPercap_2007 33203.
## 5 Oceania Australia gdpPercap_2007 34435.
## 6 Oceania New Zealand gdpPercap_2007 25185.
Success! And there is another way that is nice to use if your columns don’t follow such a structured pattern: you can exclude the columns you don’t want.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country)
str(gap_long)
## tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
## $ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr [1:5112] "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_values : num [1:5112] 2449 3521 1063 851 543 ...
head(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria gdpPercap_1952 2449.
## 2 Africa Angola gdpPercap_1952 3521.
## 3 Africa Benin gdpPercap_1952 1063.
## 4 Africa Botswana gdpPercap_1952 851.
## 5 Africa Burkina Faso gdpPercap_1952 543.
## 6 Africa Burundi gdpPercap_1952 339.
tail(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Europe Sweden pop_2007 9031088
## 2 Europe Switzerland pop_2007 7554661
## 3 Europe Turkey pop_2007 71158647
## 4 Europe United Kingdom pop_2007 60776238
## 5 Oceania Australia pop_2007 20434176
## 6 Oceania New Zealand pop_2007 4115771
Using gap_long, calculate the mean life expectancy for each continent over time from 1982 to 2007.
Convert gap_long all the way back to gap_wide. Hint: Do this in 2 steps. First, create appropriate labels for all our new variables (variable_year combinations) with the opposite of separate: tidyr::unite(). Second, spread() that variable_year column into wider format.
head(gap_long) # remember the columns
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria gdpPercap_1952 2449.
## 2 Africa Angola gdpPercap_1952 3521.
## 3 Africa Benin gdpPercap_1952 1063.
## 4 Africa Botswana gdpPercap_1952 851.
## 5 Africa Burkina Faso gdpPercap_1952 543.
## 6 Africa Burundi gdpPercap_1952 339.
One of the coolest functions in tidyr is the function complete(). We’ll start with an example dataframe:
df <- data.frame(
Year = c(1999, 2000, 2004, 1999, 2004),
Var1 = c("La", "La", "La", "Mi", "Mi"),
Var2 = c(4,5,2,1,8)
)
df
## Year Var1 Var2
## 1 1999 La 4
## 2 2000 La 5
## 3 2004 La 2
## 4 1999 Mi 1
## 5 2004 Mi 8
“Mi” is not listed for the year 2000.
Does this mean it wasn’t observed (Var2 = 0) or that it wasn’t recorded (Var2 = NA)? Only the person who recorded the data knows, but let’s assume that this means the Var2 was 0 for that year.
We can use the complete() function to make our dataset more complete.
df %>%
complete(Year, Var1)
## # A tibble: 6 × 3
## Year Var1 Var2
## <dbl> <chr> <dbl>
## 1 1999 La 4
## 2 1999 Mi 1
## 3 2000 La 5
## 4 2000 Mi NA
## 5 2004 La 2
## 6 2004 Mi 8
This gives us an NA for “Mi” in 2000, but we want it to be a 0 instead. We can use the fill argument to assign the fill value.
df %>% complete(Year, Var1, fill = list(Var2 = 0))
## # A tibble: 6 × 3
## Year Var1 Var2
## <dbl> <chr> <dbl>
## 1 1999 La 4
## 2 1999 Mi 1
## 3 2000 La 5
## 4 2000 Mi 0
## 5 2004 La 2
## 6 2004 Mi 8
Now we have what we want. Let’s assume that all years between 1999 and 2004 that aren’t listed should actually be assigned a value of 0. We can use the full_seq() function from tidyr to fill out our dataset with all years 1999-2004 and assign Var2 values of 0 to those years & Var1 for which there was no observation.
df %>% complete(Year = full_seq(Year, period = 1),
Var1,
fill = list(Var2 = 0))
## # A tibble: 12 × 3
## Year Var1 Var2
## <dbl> <chr> <dbl>
## 1 1999 La 4
## 2 1999 Mi 1
## 3 2000 La 5
## 4 2000 Mi 0
## 5 2001 La 0
## 6 2001 Mi 0
## 7 2002 La 0
## 8 2002 Mi 0
## 9 2003 La 0
## 10 2003 Mi 0
## 11 2004 La 2
## 12 2004 Mi 8
https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ # Who is this Hadley Wickham guy?
https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
# Further reading about %>%
https://blog.rstudio.org/2014/07/22/introducing-tidyr/ # Helpful overview of tidyr
http://ademos.people.uic.edu/Chapter8.html # Tim Carsel’s chapter on reshape2
https://cran.r-project.org/web/packages/tidyr/tidyr.pdf # A full guide of tidy r and all the arguments for each function of the package
http://garrettgman.github.io/tidying/ # Different types of messy data and how to fix with tidyr