Ben Bond-Lamberty
2024-02-29
A short workshop covering reproducibility and code/data management; data filtering, reshaping, and joining; and summarizing.
gapminder
dataset)babynames
dataset)Slides: https://rpubs.com/bpbond/874167
This workshop assumes a basic to intermediate knowledge of R.
If you want to do the hands-on exercises (encouraged!), make sure up-to-date versions of the following packages are installed:
dplyr
- fast, flexible tool for working with data
framestidyr
- reshaping and cleaning dataggplot2
- popular package for visualizing datagapminder
- life expectancy, GDP per capita, and
population for 142 countriesbabynames
- names provided to the SSA 1880-2017What we’re working with today–the “tidyverse”–constitute a particular and popular dialect of R, but the principles we’ll go over are broadly applicable.
Like anything else, dplyr
has advantages and
disadvantages. In particular, I do not recommend using it if you
don’t already have experience with base R.
I will point out base R equivalents as we go.
More information: https://github.com/matloff/TidyverseSkeptic
The data.table
package is also worth
checking out for its speed, stability, and zero dependencies.
We are in the era of collaborative ‘big data’, but even if you work by yourself with ‘little data’ you have to have some skills to deal with those data.
Most fundamentally, your results have to be reproducible.
Your most important collaborator is your future self. It’s important to make a workflow that you can use time and time again, and even pass on to others in such a way that you don’t have to be there to walk them through it. Source
Prepare yourself for the future. Funders, journals, governments, colleagues are all pushing for more reproducibility and openness: open-access journals, open code, and data deposition at the peer review stage and beyond.
It’s a slow but steady ratchet.
Reproducibility generally means scripts tied to open source software with effective data management and archiving.
By ZabMilenko at English Wikipedia, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=49717967
…what you’ve lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago?
Git (and website GitHub) are the most popular version control tools for use with R, and many other languages. They offer:
Version control and scripts address two of the biggest problems with managing code and data:
Ideally, every step in your analysis is programmatic. This means it is performed by a script that can be read and understood in the future.
Manual steps are hard to reproduce, laborious, and error-prone.
Don’t let the perfect be the enemy of the good.
Upgrade and improve your workflow and skills over time:
Organizing analyses so that they are reproducible is not easy. It requires diligence and a considerable investment of time: to learn new computational tools, and to organize and document analyses as you go.
But partially reproducible is better than not at all reproducible. Just try to make your next paper or project better organized than the last.
A great and practical guide: http://kbroman.org/steps2rr/
A typical project/paper directory for me, slightly idealized:
1-download.R
2-prep_data.R
3-analyze_data.R
4-manuscript_report.Rmd
logs/
output/
rawdata/
This directory contains scripts that are backed up both locally and remotely. It is under version control, so it’s easy to track changes over time.
There’s also targets and make, i.e. fully automated analysis workflows…but that’s a topic for another day.
In honor of the late Hans Rosling,
we’ll use the gapminder
dataset.
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
Read the help page about gapminder
.
The magrittr
package (used by both dplyr
and tidyr
) provides the %>%
operator, which
allows us to pipe an object forward into a function or call
expression.
Note that x %>% f
is usually equivalent to
f(x)
.
print(gapminder)
library(dplyr)
gapminder %>% print()
gagminder %>% head()
gapminder %>% head(n=20)
gapminder %>%
print() %>%
summary() # what is non-piped equivalent?
summary(print(gapminder))
RStudio has a keyboard shortcut for this.
R now (as of v4.1) has its own built-in pipe operator:
|>
.
The dplyr
package uses verbs (functions) to
operate on tibbles (data frames). When using pipes, this looks
like:
As noted above, pipelines have tradeoffs like any other tool. But they do provide easy-to-read algorithmic steps in many cases.
Let’s go over some of those possible do_something
steps.
Very commonly used, as it lets you filter a dataset by one or more conditions.
gapminder %>% filter(country == "Egypt")
gapminder %>% filter(country == "Egypt", year > 2000) # AND
gapminder %>% filter(country == "Egypt" | year > 2000) # OR
gapminder %>%
filter(country %in% c("Egypt", "New Zealand", "Chad"))
This is equivalent to base R’s subset
function.
library(ggplot2)
gapminder %>%
filter(year == 1997) %>%
ggplot(aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
Also extremely useful. Note the different notations for selecting columns:
select(gapminder, pop, year)
gapminder %>% select(pop, year)
# Base R equivalent: gapminder[c("pop", "year")]
gapminder %>% select(-lifeExp, -gdpPercap)
gapminder %>% select(-1)
There are lots of other cool ways to select columns–see
?select
.
In base R:
Let’s focus on a single country’s data for a bit. Write a pipeline
that picks out Egypt data only, removes the continent and country
columns, and assigns the result to a variable Egypt
. How
many rows does the resulting dataset have?
Let’s focus on a single country’s data for a bit. Write a pipeline
that picks out Egypt data only, removes the continent and country
columns, and assigns the result to a variable Egypt
. How
many rows does the resulting dataset have?
## # A tibble: 12 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 41.9 22223309 1419.
## 2 1957 44.4 25009741 1459.
## 3 1962 47.0 28173309 1693.
## 4 1967 49.3 31681188 1815.
## 5 1972 51.1 34807417 2024.
## 6 1977 53.3 38783863 2785.
## 7 1982 56.0 45681811 3504.
## 8 1987 59.8 52799062 3885.
## 9 1992 63.7 59402198 3795.
## 10 1997 67.2 66134291 4173.
## 11 2002 69.8 73312559 4755.
## 12 2007 71.3 80264543 5581.
These functions can be very useful.
## # A tibble: 1,704 × 5
## coco year lifeExp pop gdpPercap
## <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan_Asia 1952 28.8 8425333 779.
## 2 Afghanistan_Asia 1957 30.3 9240934 821.
## 3 Afghanistan_Asia 1962 32.0 10267083 853.
## 4 Afghanistan_Asia 1967 34.0 11537966 836.
## 5 Afghanistan_Asia 1972 36.1 13079460 740.
## 6 Afghanistan_Asia 1977 38.4 14880372 786.
## 7 Afghanistan_Asia 1982 39.9 12881816 978.
## 8 Afghanistan_Asia 1987 40.8 13867957 852.
## 9 Afghanistan_Asia 1992 41.7 16317921 649.
## 10 Afghanistan_Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
gapminder %>%
unite(coco, country, continent) %>%
separate(coco,
into = c("country", "continent"),
sep = "_")
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
The mutate
function in particular is used a lot
in dplyr
pipelines.
## # A tibble: 12 × 5
## year lifeExp population gdpPercap logpop
## <int> <dbl> <int> <dbl> <dbl>
## 1 1952 41.9 22223309 1419. 16.9
## 2 1957 44.4 25009741 1459. 17.0
## 3 1962 47.0 28173309 1693. 17.2
## 4 1967 49.3 31681188 1815. 17.3
## 5 1972 51.1 34807417 2024. 17.4
## 6 1977 53.3 38783863 2785. 17.5
## 7 1982 56.0 45681811 3504. 17.6
## 8 1987 59.8 52799062 3885. 17.8
## 9 1992 63.7 59402198 3795. 17.9
## 10 1997 67.2 66134291 4173. 18.0
## 11 2002 69.8 73312559 4755. 18.1
## 12 2007 71.3 80264543 5581. 18.2
Several important notes:
mutate
; this is more efficient than separate
mutate
callstransmute
, which adds new variables and
drops existing onesSeveral important notes:
## # A tibble: 2 × 4
## x y z m
## <int> <int> <int> <dbl>
## 1 1 3 5 3.5
## 2 2 4 6 3.5
## # A tibble: 2 × 4
## # Rowwise:
## x y z m
## <int> <int> <int> <dbl>
## 1 1 3 5 3
## 2 2 4 6 4
Note that for a large dataset, base R’s
rowMeans
function will be much faster than this,
however. Use the best tool for the job!
Two handy tidyr
functions allow you to fill data based
on adjacent values, and fill in NA
values.
people <- tibble(who = c("Alice", "Bob", "Carol"), age = c(25, NA, 45))
people %>% fill(age, .direction = "down")
## # A tibble: 3 × 2
## who age
## <chr> <dbl>
## 1 Alice 25
## 2 Bob 25
## 3 Carol 45
## # A tibble: 3 × 2
## who age
## <chr> <dbl>
## 1 Alice 25
## 2 Bob 99
## 3 Carol 45
Let’s start with the simple people
data frame, but
adding another column:
## # A tibble: 3 × 3
## who age height
## <chr> <dbl> <dbl>
## 1 Alice 25 160
## 2 Bob NA 170
## 3 Carol 45 180
Note that there is one column of metadata (name) and two of data (age and height).
Let’s say we want to put people
into long (or
tidy) format–where every row is a different observation. In
other words, we need to collapse multiple columns of data into a single
one.
For this we use tidyr::pivot_longer
, which at a minimum
needs to know: what’s the data source, and what column(s) do we want to
pivot?
## # A tibble: 6 × 3
## who name value
## <chr> <chr> <dbl>
## 1 Alice age 25
## 2 Alice height 160
## 3 Bob age NA
## 4 Bob height 170
## 5 Carol age 45
## 6 Carol height 180
(Note base R has a reshape
function, but I find it
difficult to use.)
We could also say “pivot everything except for the ‘who’ column”, or provide custom column names in the resulting data frame:
## # A tibble: 6 × 3
## who name value
## <chr> <chr> <dbl>
## 1 Alice age 25
## 2 Alice height 160
## 3 Bob age NA
## 4 Bob height 170
## 5 Carol age 45
## 6 Carol height 180
## # A tibble: 6 × 3
## who variable datum
## <chr> <chr> <dbl>
## 1 Alice age 25
## 2 Alice height 160
## 3 Bob age NA
## 4 Bob height 170
## 5 Carol age 45
## 6 Carol height 180
To transform data into a wider form, pivot_wider
needs
to know at least two things: which column holds the new column names,
and which the data.
## # A tibble: 3 × 3
## who age height
## <chr> <dbl> <dbl>
## 1 Alice 25 160
## 2 Bob NA 170
## 3 Carol 45 180
Here tidyr
has guessed that if the names of the new
columns are coming from “name”, then the values are probably coming from
the numeric column “value”.
And so we get our original people
data frame back.
What happens if we take column names from the “who” column?
Pivoting data wider may result in NA
values if not every
row-column combination is present in the dataset:
## # A tibble: 3 × 4
## height Alice Bob Carol
## <dbl> <dbl> <dbl> <dbl>
## 1 160 25 NA NA
## 2 170 NA NA NA
## 3 180 NA NA 45
## # A tibble: 3 × 4
## height Alice Bob Carol
## <dbl> <dbl> <dbl> <dbl>
## 1 160 25 -1 -1
## 2 170 -1 NA -1
## 3 180 -1 -1 45
Time for something a bit more complex. Back to our gapminder-derived
Egypt
dataset!
## # A tibble: 36 × 3
## year name value
## <int> <chr> <dbl>
## 1 1952 lifeExp 41.9
## 2 1952 pop 22223309
## 3 1952 gdpPercap 1419.
## 4 1957 lifeExp 44.4
## 5 1957 pop 25009741
## 6 1957 gdpPercap 1459.
## 7 1962 lifeExp 47.0
## 8 1962 pop 28173309
## 9 1962 gdpPercap 1693.
## 10 1967 lifeExp 49.3
## # ℹ 26 more rows
Egypt %>%
pivot_longer(-year) %>%
ggplot(aes(year, value)) + geom_line() +
facet_wrap(~name, scales = "free")
Experiment. Why do these do what they do?
Our Egypt
dataset has one row per year, and three data
columns. Write code to transform it to have one row per datum, and each
year in a different column. Do this for post-2000 data only.
Hint: start by filtering and then pivoting it into long form.
Our Egypt
dataset has one row per year, and three data
columns. Write code to transform it to have one row per datum, and each
year in a different column. Do this for post-2000 data only.
Hint: start by filtering and then pivoting it into long form.
## # A tibble: 3 × 3
## name `2002` `2007`
## <chr> <dbl> <dbl>
## 1 lifeExp 69.8 71.3
## 2 pop 73312559 80264543
## 3 gdpPercap 4755. 5581.
Frequently we have more than one data source:
## # A tibble: 3 × 3
## who age height
## <chr> <dbl> <dbl>
## 1 Alice 25 160
## 2 Bob NA 170
## 3 Carol 45 180
hometowns <- tibble(who = c("Alice", "Bob", "Dave"),
where = c("Abilene", "Billsville", "Darien"))
hometowns
## # A tibble: 3 × 2
## who where
## <chr> <chr>
## 1 Alice Abilene
## 2 Bob Billsville
## 3 Dave Darien
Here, we’d like to merge
(the base R function) or
join these together.
Probably the most commonly used join is left_join
. This
works by including all rows in the first data frame, and any
matches it finds from the second:
## Joining with `by = join_by(who)`
## # A tibble: 3 × 4
## who age height where
## <chr> <dbl> <dbl> <chr>
## 1 Alice 25 160 Abilene
## 2 Bob NA 170 Billsville
## 3 Carol 45 180 <NA>
In other words, it’s a lookup.
Note the message. It’s good practice to specify the common columns that joins should use:
The order matters. Why do these produce different results?
## # A tibble: 3 × 4
## who age height where
## <chr> <dbl> <dbl> <chr>
## 1 Alice 25 160 Abilene
## 2 Bob NA 170 Billsville
## 3 Carol 45 180 <NA>
## # A tibble: 3 × 4
## who where age height
## <chr> <chr> <dbl> <dbl>
## 1 Alice Abilene 25 160
## 2 Bob Billsville NA 170
## 3 Dave Darien NA NA
There are many different join operations available.
## # A tibble: 3 × 4
## who age height where
## <chr> <dbl> <dbl> <chr>
## 1 Alice 25 160 Abilene
## 2 Bob NA 170 Billsville
## 3 Dave NA NA Darien
## # A tibble: 2 × 4
## who age height where
## <chr> <dbl> <dbl> <chr>
## 1 Alice 25 160 Abilene
## 2 Bob NA 170 Billsville
## # A tibble: 4 × 4
## who age height where
## <chr> <dbl> <dbl> <chr>
## 1 Alice 25 160 Abilene
## 2 Bob NA 170 Billsville
## 3 Carol 45 180 <NA>
## 4 Dave NA NA Darien
# anti_join returns all rows from x without a match in y;
# semi_join returns all rows from x with a match
people %>% anti_join(hometowns, by = "who")
## # A tibble: 1 × 3
## who age height
## <chr> <dbl> <dbl>
## 1 Carol 45 180
Thinking back to the typical data pipeline, we often want to summarize data by groups as an intermediate or final step. For example, for each subgroup we might want to:
n
→ 1)n
→
n
)Specific examples:
gapminder
: what’s the year of maximum GDP for each
country?babynames
: what’s the most common name over time?These are generally known as split-apply-combine problems.
There are many ways to tackle split-apply-combine in R.
The dplyr
package specializes in data frames, but also
allows you to work with remote, out-of-memory databases, using exactly
the same tools, because it abstracts away how your data is
stored.
dplyr
is very fast for most, though not all, operations
on data frames (tabular data). But again, like any tool, it has
some disadvantages too.
dplyr
verbs (functions) become particularly powerful
when used in conjunction with groups we define in the dataset.
Grouping doesn’t change the data, but instead groups it in
preparation for the next operation we perform.
## # A tibble: 1,704 × 6
## # Groups: country [142]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
In base R:
Most frequently, that “next operation” is a
summarise
(n
→ 1), ormutate
(n
→ n
):## # A tibble: 142 × 2
## country maxpop
## <fct> <int>
## 1 Afghanistan 31889923
## 2 Albania 3600523
## 3 Algeria 33333216
## 4 Angola 12420476
## 5 Argentina 40301927
## 6 Australia 20434176
## 7 Austria 8199783
## 8 Bahrain 708573
## 9 Bangladesh 150448339
## 10 Belgium 10392226
## # ℹ 132 more rows
In base R:
## # A tibble: 5 × 3
## continent meanLifeExp maxpop
## <fct> <dbl> <int>
## 1 Africa 48.9 135031164
## 2 Americas 64.7 301139947
## 3 Asia 60.1 1318683096
## 4 Europe 71.9 82400996
## 5 Oceania 74.3 20434176
Notice that the only columns left, after summarising, are:
By default, every summary operation removes the last grouping variable, while leaving the others intact:
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
## # A tibble: 142 × 3
## # Groups: continent [5]
## continent country max_LifeExp
## <fct> <fct> <dbl>
## 1 Africa Algeria 72.3
## 2 Africa Angola 42.7
## 3 Africa Benin 56.7
## 4 Africa Botswana 63.6
## 5 Africa Burkina Faso 52.3
## 6 Africa Burundi 49.6
## 7 Africa Cameroon 55.0
## 8 Africa Central African Republic 50.5
## 9 Africa Chad 51.7
## 10 Africa Comoros 65.2
## # ℹ 132 more rows
This makes it easy to do certain kinds of multi-step operations; for example, if we want to know what the mean maximum country life expectancy is by continent:
gapminder %>%
group_by(continent, country) %>%
summarise(max_LifeExp = max(lifeExp), .groups = "drop_last") %>%
summarise(mean(max_LifeExp))
## # A tibble: 5 × 2
## continent `mean(max_LifeExp)`
## <fct> <dbl>
## 1 Africa 57.4
## 2 Americas 73.6
## 3 Asia 71.0
## 4 Europe 77.7
## 5 Oceania 80.7
You can also re-group as needed–See the .groups
argument–or ungroup
.
We can apply a function to multiple columns…
## # A tibble: 142 × 4
## country lifeExp pop gdpPercap
## <fct> <dbl> <int> <dbl>
## 1 Afghanistan 43.8 31889923 978.
## 2 Albania 76.4 3600523 5937.
## 3 Algeria 72.3 33333216 6223.
## 4 Angola 42.7 12420476 5523.
## 5 Argentina 75.3 40301927 12779.
## 6 Australia 81.2 20434176 34435.
## 7 Austria 79.8 8199783 36126.
## 8 Bahrain 75.6 708573 29796.
## 9 Bangladesh 64.1 150448339 1391.
## 10 Belgium 79.4 10392226 33693.
## # ℹ 132 more rows
…or multiple functions to one or more columns:
## # A tibble: 142 × 3
## country pop_1 pop_2
## <fct> <int> <int>
## 1 Afghanistan 8425333 31889923
## 2 Albania 1282697 3600523
## 3 Algeria 9279525 33333216
## 4 Angola 4232095 12420476
## 5 Argentina 17876956 40301927
## 6 Australia 8691212 20434176
## 7 Austria 6927772 8199783
## 8 Bahrain 120447 708573
## 9 Bangladesh 46886859 150448339
## 10 Belgium 8730405 10392226
## # ℹ 132 more rows
We now have the tools to build up a long pipeline to, e.g., compute the min and max for all numeric variables and produce a table with continent as columns headers, and variable (gdpPercap, lifeExp, pop) and statistic as rows.
# We define this function to end up with nice column names
min_max <- list(
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
)
gapminder %>%
filter(continent %in% c("Africa", "Americas", "Asia")) %>%
group_by(continent) %>%
# compute min and max for the numeric columns
summarise(across(where(is.numeric), min_max)) %>%
pivot_longer(-continent) %>%
separate(name, into = c("variable", "stat")) %>%
pivot_wider(names_from = "continent", values_from = "value")
## # A tibble: 8 × 5
## variable stat Africa Americas Asia
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 year min 1952 1952 1952
## 2 year max 2007 2007 2007
## 3 lifeExp min 23.6 37.6 28.8
## 4 lifeExp max 76.4 80.7 82.6
## 5 pop min 60011 662850 120447
## 6 pop max 135031164 301139947 1318683096
## 7 gdpPercap min 241. 1202. 331
## 8 gdpPercap max 21951. 42952. 113523.
babynames
Explore babynames
a bit. How many rows, columns does it
have? How many unique names?
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # ℹ 1,924,655 more rows
What does this calculate?
## # A tibble: 276 × 5
## # Groups: year, sex [276]
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 M John 9655 0.0815
## 3 1881 F Mary 6919 0.0700
## 4 1881 M John 8769 0.0810
## 5 1882 F Mary 8148 0.0704
## 6 1882 M John 9557 0.0783
## 7 1883 F Mary 8012 0.0667
## 8 1883 M John 8894 0.0791
## 9 1884 F Mary 9217 0.0670
## 10 1884 M John 9388 0.0765
## # ℹ 266 more rows
The various slice_
functions are quite handy.
babynames
datasetLoad the dataset using library(babynames)
.
Read its help page. Look at its structure (rows, columns, summary).
Use dplyr
to calculate the total number of names in the
SSA database for each year and sex, with each sex’s counts in a separate
column. Hint: n()
.
Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).
babynames
datasetLoad the dataset using library(babynames)
.
Read its help page. Look at its structure (rows, columns, summary).
Use dplyr
to calculate the total number of names in the
SSA database for each year and sex, with each sex’s counts in a separate
column. Hint: n()
.
Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).
babynames %>%
group_by(year, sex) %>%
mutate(rank = row_number(desc(n))) %>%
filter(name %in% c("Benjamin", "Rachel")) %>%
ggplot(aes(year, rank, color = name, shape = sex)) +
geom_point() + scale_y_log10()
## $title
## [1] "Benjamin and Rachel"
##
## attr(,"class")
## [1] "labels"
The diamonds
dataset is included with
ggplot2
.
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10() + scale_y_log10()
This old code now generates a warning. Leaving it here, in the
slides, to make sure to mention dplyr
’s penchant for
changing (and breaking) behavior.
library(broom) # to sweep up -- tidy -- model outputs
diamonds %>%
group_by(cut) %>%
do(mod = lm(log(price) ~ log(carat), data = .)) %>%
# need to change this to `reframe` (no other change needed, thankfully)
summarise(cut = cut,
tidy(mod))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 10 × 6
## cut term estimate std.error statistic p.value
## <ord> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Fair (Intercept) 8.19 0.00805 1017. 0
## 2 Fair log(carat) 1.49 0.0171 87.2 0
## 3 Good (Intercept) 8.38 0.00398 2102. 0
## 4 Good log(carat) 1.74 0.00635 274. 0
## 5 Very Good (Intercept) 8.45 0.00266 3179. 0
## 6 Very Good log(carat) 1.73 0.00382 452. 0
## 7 Premium (Intercept) 8.43 0.00237 3553. 0
## 8 Premium log(carat) 1.66 0.00355 467. 0
## 9 Ideal (Intercept) 8.52 0.00237 3596. 0
## 10 Ideal log(carat) 1.71 0.00310 550. 0
Check out slide decks at https://rpubs.com/bpbond
But the tools we’ve covered here can do a lot!
Feedback welcome.
The slides for this presentation are available here: https://rpubs.com/bpbond/874167
The repository with the R code that generated the slides is here: https://github.com/bpbond/R-workshops/
Source?