Functions and Functionals (R Ladies - Twin Cities)

1 Takeaways from today

  • How to use and write custom functions

  • What is a functional?

  • Using purrr map family functions

Libraries for today:

#install.packages('fivethirtyeight','dplyr','broom','ggplot2','purrr', dependencies = T)
library(fivethirtyeight)
library(dplyr)
library(broom)
library(ggplot2)
library(purrr)

2 Beginning at the end: my use case

data(bechdel)

This data set contains the raw data behind the 538 story “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”. The Bechdel test provides a pass/fail outcome for movies based on three criteria:

  1. The movie has to have at least two women in it,

  2. who talk to each other,

  3. about something besides a man

My question: Are budget and gross related to the probability of failing the Bechdel test? Does time play a role?

The variables I’ll use are the following:

  • year - year of movie release
  • binary - Bechdel test PASS vs FAIL binary
  • budget_2013 - Budget in 2013 inflation adjusted dollars
  • domgross_2013 - Domestic gross (US) in 2013 inflation adjusted dollars

I’m going to fit a separate logistic regression to each decade. (This may not be a good model or even good statistics, but is a good way to show the concepts I’m after). First, I create a time factor variable and a Bechdel test variable.

bechdel_mod <- bechdel %>%
  dplyr::mutate(
    #Using cases since there are several decades
    time_factor = dplyr::case_when(year %in% 1970:1979 ~ '1970s',
                                    year %in% 1980:1989 ~ '1980s',
                                    year %in% 1990:1999 ~ '1990s',
                                    year %in% 2000:2009 ~ '2000s',
                                    year %in% 2010:2019 ~ '2010s'),
    #Using if_then since I only have 2 outcomes
    bechdel_test = dplyr::if_else(binary == 'PASS',  0, 1))
Then, I split the data into list of data.frames by the time factor.
by_time <- split(bechdel_mod, bechdel_mod$time_factor)
by_time
## $`1970s`
## # A tibble: 54 x 17
##     year imdb  title test  clean_test binary budget domgross intgross code 
##    <int> <chr> <chr> <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
##  1  1979 tt00… Alien ok    ok         PASS   9.00e6   8.09e7   2.04e8 1979…
##  2  1979 tt00… Apoc… nowo… nowomen    FAIL   3.15e7   7.88e7   7.88e7 1979…
##  3  1979 tt00… Moon… nota… notalk     FAIL   3.10e7   7.03e7   2.10e8 1979…
##  4  1979 tt00… Star… nota… notalk     FAIL   3.50e7   8.23e7   1.39e8 1979…
##  5  1979 tt00… The … ok    ok         PASS   1.85e7   6.52e7   1.09e8 1979…
##  6  1978 tt00… Anim… men   men        FAIL   3.00e6   1.42e8   1.42e8 1978…
##  7  1978 tt00… Dawn… nowo… nowomen    FAIL   2.80e7   5.90e7   1.03e8 1978…
##  8  1978 tt00… Days… ok    ok         PASS   3.00e6   3.45e6   3.66e6 1978…
##  9  1978 tt00… Grea… ok    ok         PASS   6.00e6   1.82e8   3.88e8 1978…
## 10  1978 tt00… Hall… dubi… dubious    FAIL   1.50e7   5.83e7   7.75e7 1978…
## # … with 44 more rows, and 7 more variables: budget_2013 <int>,
## #   domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## #   decade_code <int>, time_factor <chr>, bechdel_test <dbl>
## 
## $`1980s`
## # A tibble: 125 x 17
##     year imdb  title test  clean_test binary budget domgross intgross code 
##    <int> <chr> <chr> <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
##  1  1989 tt00… Back… ok    ok         PASS   4.00e7   1.18e8   3.32e8 1989…
##  2  1989 tt00… Batm… nota… notalk     FAIL   3.50e7   2.51e8   4.11e8 1989…
##  3  1989 tt00… Bill… men   men        FAIL   1.00e7   4.05e7   4.05e7 1989…
##  4  1989 tt00… Dead… nota… notalk     FAIL   1.64e7   9.59e7   2.40e8 1989…
##  5  1989 tt00… Henr… ok    ok         PASS   9.00e6   1.02e7   1.02e7 1989…
##  6  1989 tt00… Indi… nowo… nowomen    FAIL   4.80e7   1.97e8   4.74e8 1989…
##  7  1989 tt00… Lice… men   men        FAIL   4.20e7   3.47e7   1.56e8 1989…
##  8  1989 tt00… Majo… nota… notalk     FAIL   1.10e7   4.98e7   4.98e7 1989…
##  9  1989 tt00… Road… nota… notalk     FAIL   1.00e7   3.01e7   3.01e7 1989…
## 10  1989 tt00… Sex,… men-… men        FAIL   1.20e6   2.47e7   3.67e7 1989…
## # … with 115 more rows, and 7 more variables: budget_2013 <int>,
## #   domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## #   decade_code <int>, time_factor <chr>, bechdel_test <dbl>
## 
## $`1990s`
## # A tibble: 337 x 17
##     year imdb  title test  clean_test binary budget domgross intgross code 
##    <int> <chr> <chr> <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
##  1  1999 tt01… 10 T… ok    ok         PASS   1.30e7   3.82e7   6.04e7 1999…
##  2  1999 tt01… 8MM   nota… notalk     FAIL   4.00e7   3.64e7   9.64e7 1999…
##  3  1999 tt01… Amer… ok    ok         PASS   1.50e7   1.30e8   3.56e8 1999…
##  4  1999 tt01… Amer… men   men        FAIL   1.20e7   1.02e8   2.35e8 1999…
##  5  1999 tt01… Anal… nota… notalk     FAIL   3.00e7   1.07e8   1.77e8 1999…
##  6  1999 tt01… Anna… men   men        FAIL   7.50e7   3.93e7   3.93e7 1999…
##  7  1999 tt01… Anyw… dubi… dubious    FAIL   2.30e7   1.87e7   1.87e7 1999…
##  8  1999 tt01… Aust… nota… notalk     FAIL   3.50e7   2.06e8   3.10e8 1999…
##  9  1999 tt01… Bein… ok    ok         PASS   1.30e7   2.29e7   3.24e7 1999…
## 10  1999 tt01… Blac… ok    ok         PASS   1.00e7   5.24e6   5.24e6 1999…
## # … with 327 more rows, and 7 more variables: budget_2013 <int>,
## #   domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## #   decade_code <int>, time_factor <chr>, bechdel_test <dbl>
## 
## $`2000s`
## # A tibble: 840 x 17
##     year imdb  title test  clean_test binary budget domgross intgross code 
##    <int> <chr> <chr> <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
##  1  2009 tt10… (500… nota… notalk     FAIL   7.50e6   3.24e7   6.08e7 2009…
##  2  2009 tt09… 17 A… ok-d… ok         PASS   4.00e7   6.42e7   1.39e8 2009…
##  3  2009 tt11… 2012  nota… notalk     FAIL   2.00e8   1.66e8   7.88e8 2009…
##  4  2009 tt04… 9     nowo… nowomen    FAIL   3.00e7   3.17e7   4.86e7 2009…
##  5  2009 tt09… A Pe… dubi… dubious    FAIL   1.40e7   1.55e7   2.28e7 2009…
##  6  2009 tt10… A Se… men   men        FAIL   7.00e6   9.23e6   3.04e7 2009…
##  7  2009 tt13… A Si… nota… notalk     FAIL   7.00e6   9.18e6   2.81e7 2009…
##  8  2009 tt11… Adam  men   men        FAIL   3.20e6   2.28e6   2.83e6 2009…
##  9  2009 tt10… Adve… men-… men        FAIL   9.80e6   1.60e7   1.71e7 2009…
## 10  2009 tt11… Agora nota… notalk     FAIL   7.00e7   6.19e5   3.90e7 2009…
## # … with 830 more rows, and 7 more variables: budget_2013 <int>,
## #   domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## #   decade_code <int>, time_factor <chr>, bechdel_test <dbl>
## 
## $`2010s`
## # A tibble: 438 x 17
##     year imdb  title test  clean_test binary budget domgross intgross code 
##    <int> <chr> <chr> <chr> <ord>      <chr>   <int>    <dbl>    <dbl> <chr>
##  1  2013 tt17… 21 &… nota… notalk     FAIL   1.30e7 25682380   4.22e7 2013…
##  2  2012 tt13… Dred… ok-d… ok         PASS   4.50e7 13414714   4.09e7 2012…
##  3  2013 tt20… 12 Y… nota… notalk     FAIL   2.00e7 53107035   1.59e8 2013…
##  4  2013 tt12… 2 Gu… nota… notalk     FAIL   6.10e7 75612460   1.32e8 2013…
##  5  2013 tt04… 42    men   men        FAIL   4.00e7 95020213   9.50e7 2013…
##  6  2013 tt13… 47 R… men   men        FAIL   2.25e8 38362475   1.46e8 2013…
##  7  2013 tt16… A Go… nota… notalk     FAIL   9.20e7 67349198   3.04e8 2013…
##  8  2013 tt21… Abou… ok-d… ok         PASS   1.20e7 15323921   8.73e7 2013…
##  9  2013 tt18… Admi… ok    ok         PASS   1.30e7 18007317   1.80e7 2013…
## 10  2013 tt18… Afte… nota… notalk     FAIL   1.30e8 60522097   2.44e8 2013…
## # … with 428 more rows, and 7 more variables: budget_2013 <int>,
## #   domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## #   decade_code <int>, time_factor <chr>, bechdel_test <dbl>
Now I fit a logistic regression using the 2013 inflated budget and domestic gross of the movie to predict whether or not it will pass the Bechdel test.
modeled_bech <- by_time %>% 
  #Fit logistic regression
  purrr::map(~ glm(bechdel_test ~ log(budget_2013) + log(domgross_2013), 
            family = 'binomial', data = .x)) %>% 
  #Using broom::tidy to make the output nicer for processsing
  purrr::map(~ broom::tidy(.x)) %>%
  #Grabbing variable names, point estimates, and standard errors
  purrr::map(~ dplyr::select(.x, term, estimate, std.error)) %>%
  #Getting the list indices into the data set
  purrr::imap(~ dplyr::mutate(.x, decade = .y)) %>%
  #Recombining everything into a single data.frame
  dplyr::bind_rows() 

modeled_bech
## # A tibble: 15 x 4
##    term               estimate std.error decade
##    <chr>                 <dbl>     <dbl> <chr> 
##  1 (Intercept)        -3.86       3.84   1970s 
##  2 log(budget_2013)   -0.0423     0.207  1970s 
##  3 log(domgross_2013)  0.302      0.195  1970s 
##  4 (Intercept)         0.471      4.15   1980s 
##  5 log(budget_2013)   -0.299      0.238  1980s 
##  6 log(domgross_2013)  0.307      0.161  1980s 
##  7 (Intercept)        -1.85       1.44   1990s 
##  8 log(budget_2013)    0.230      0.102  1990s 
##  9 log(domgross_2013) -0.105      0.0920 1990s 
## 10 (Intercept)        -3.25       0.947  2000s 
## 11 log(budget_2013)    0.185      0.0729 2000s 
## 12 log(domgross_2013)  0.00687    0.0528 2000s 
## 13 (Intercept)        -3.97       1.26   2010s 
## 14 log(budget_2013)    0.441      0.104  2010s 
## 15 log(domgross_2013) -0.198      0.0810 2010s
Next, I plot the coefficient results using the ggplot2 package.
  modeled_bech %>%
  ggplot(aes(x = decade, y = estimate, group = term)) + 
  #adding error bars for 95% conf. intervals
  geom_errorbar(aes(x = decade, 
                    ymin = estimate - 2 * std.error, 
                    ymax = estimate + 2 * std.error), 
                color = 'grey50', width = 0.2) + 
  geom_point(color = 'blue', size = 2) + 
  #If you want lines, you can uncomment the following line
  #geom_line(lwd = 1.2) + 
  #highlighting the x-axis
  geom_hline(yintercept = 0, alpha = 0.2) + 
  #allowing different scales on each coefficient
  facet_wrap(~term, scales = 'free_y') + 
  #not a fan of the default plotting
  theme_bw() +
  #removing grid, changes facet title colors
  theme(panel.grid = element_blank(), 
        strip.background = element_rect(fill = 'black'), 
        strip.text = element_text(color = 'white')) + 
  xlab('Decade') + ylab('Estimated Coefficient')

Conclusions:

  • Higher probability of failing Bechdel test in high budget movies in recent decades (action movies have higher budget?)

  • Lower probability of failing Bechdel test in high gross movies in 2010s (is the audience paying attention to the presence of women?)

By the end of tonight, I hope that you will be able to do a variant of this analysis. Don’t worry about all of the tidyverse elements if they are new to you.

3 Function Basics

If you use R, you are probably already using many built-in R functions. A function is a block of organized, reusable code that is used to perform a single, related action.

3.1 Why write your own functions?

DRY Principle: don’t repeat yourself

For example, what is the median budget in each decade split by whether the movie passed the Bechdel test.

bech_1970 <- bechdel_mod %>% 
  filter(time_factor == '1970s') %>% 
  group_by(binary) %>% 
  summarise(med_budget = median(budget_2013))

bech_1980 <- bechdel_mod %>% 
  filter(time_factor == '1980s') %>% 
  group_by(binary) %>% 
  summarise(med_budget = median(budget_2013))

bech_1990 <- bechdel_mod %>% 
  filter(time_factor == '1990s') %>% 
  group_by(binary) %>% 
  summarise(med_budget = median(budget_2013))

bech_1970
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL     20666099
## 2 PASS     22241557
bech_1980
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL    30755961 
## 2 PASS    33906418.
bech_1990
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL    51268214.
## 2 PASS    37157440
Aside - there is a smarter way of getting this all in one data set if that is my goal. Anyone have thoughts on how I would do that?
bechdel_mod %>% 
  group_by(time_factor, binary) %>% 
  summarise(med_budget = median(budget_2013))
## # A tibble: 10 x 3
## # Groups:   time_factor [5]
##    time_factor binary med_budget
##    <chr>       <chr>       <dbl>
##  1 1970s       FAIL    20666099 
##  2 1970s       PASS    22241557 
##  3 1980s       FAIL    30755961 
##  4 1980s       PASS    33906418.
##  5 1990s       FAIL    51268214.
##  6 1990s       PASS    37157440 
##  7 2000s       FAIL    46231854 
##  8 2000s       PASS    31018861 
##  9 2010s       FAIL    46606085 
## 10 2010s       PASS    26927960

Back to the example at hand. A loop could do the trick, but is not as customizable for the user as we’d like. Let’s write a function. We can see it will give us identical results as before.

bech_yr <- function(decade){
  
  modified_data <- bechdel_mod %>% 
    filter(time_factor == decade) %>% 
    group_by(binary) %>% 
    summarise(med_budget = median(budget_2013))
  
  return(modified_data)
}
#Check that this yields identical results
all.equal(bech_1990, 
          bech_yr('1990s'))
## [1] TRUE

3.2 Scoping

How did my previous function find the data? I didn’t tell it where to look. Also, I used a bunch of other dplyr functions (verbs) in my custom functions.

R uses lexical scoping. This means that names defined inside a function mask names defined outside a function. If a name isn’t defined inside a function, R looks one level up. The same rules apply if a function is defined inside another function.

  • First, R looks inside the current function.

  • Then, it looks where that function was defined (and so on, all the way up to the global environment).

  • Finally, it looks in other loaded packages.

Sometimes, it can be smart to add explicit arguments to your function rather than relying on users to have everything loaded correctly around it. It can also be helpful to add default argument values to help your users.

In this case I defaulted bech_data to have the data set I anticipate in the environment.

bech_yr2 <- function(decade, bech_data = bechdel_mod){
  return(bech_data %>% 
    filter(time_factor == decade) %>% 
    group_by(binary) %>% 
    summarise(med_budget = median(budget_2013)))
}

3.3 Syntactical flexibility

There are several ways to define and use the same function; be consistent to enhance readability!

#This is super dangerous... no {}
bech_yr3 <- function(decade, bech_data = bechdel_mod) 
  bech_data %>% 
    filter(time_factor == decade) %>% 
    group_by(binary) %>% 
    summarise(med_budget = median(budget_2013))

#Using argument names
bech_yr3(decade = '1990s', bech_data = bechdel_mod)
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL    51268214.
## 2 PASS    37157440
#No argument names
bech_yr3('1990s', bechdel_mod)
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL    51268214.
## 2 PASS    37157440
#Using the default - can get dicey
bech_yr3('1990s')
## # A tibble: 2 x 2
##   binary med_budget
##   <chr>       <dbl>
## 1 FAIL    51268214.
## 2 PASS    37157440

Some other general readability advice from my experience:

  • Use {} - you will see people skipping this all over the internet! This makes your short functions consistent with your longer functions where you need braces.

  • Use return - by default, the item in last line of a function is returned. However, it is helpful make explicit what is returned and mirrors other programming languages. You can only return one thing. Use a list() object if you need to return multiple items in one.

3.4 Function components

Functions can be broken down into three components: arguments, body, and environment.
my_first_function <- function(x, y) {
  return(x + y)
}

formals(my_first_function) #a.k.a arguments
## $x
## 
## 
## $y
body(my_first_function)
## {
##     return(x + y)
## }
environment(my_first_function)
## <environment: R_GlobalEnv>
environment(mutate)
## <environment: namespace:dplyr>

You can also view a function by typing the name after it has been defined.

my_first_function
## function(x, y) {
##   return(x + y)
## }

Functions can be flexible in ways you might expect (or not expect). This function can take a scalar or vector of numeric inputs and apply them in (hopefully) intuitive ways. It is important to check a range of input types to understand behavior.

my_first_function(1, 2)
## [1] 3
my_first_function(c(1,1), 2)
## [1] 3 3
my_first_function(c(1,1), c(2,3))
## [1] 3 4

4 Functionals

The functions I’ve covered so far are part of the upper left corner of this 2x2 table. However, things can get a lot more exotic. For today, I’m going to cover functionals. As seen in the table, a functional is a function that takes a function as an input and returns a vector as output.

Example functional: note the ellipsis (…) lets you leave room for other arguments without specifying what they might be.
set.seed(1000) #for reproducibility

#fn is a function, not a vector!
random_fn <- function(fn, ...) {
  #Taking a function fn of 1000 random uniform(0,1) values
  return(fn(runif(1e3), ...))
}

random_fn(mean)
## [1] 0.4998541
random_fn(median)
## [1] 0.4909904
#Example including extra argument
random_fn(quantile, probs = c(0.25, 0.5, 0.75))
##       25%       50%       75% 
## 0.2523493 0.4903264 0.7404705

4.1 purrr

purrr is part of R’s functional programming toolkit. It can help reduce the use of for loops and enhance code transparency. There is not always a functional to replace a for loop! Don’t go crazy trying to find one if it doesn’t make sense :) Additionally, the purrr functions are specialised to work with one-dimensional vectors. base::apply() is specialised to work with two-dimensional and higher vectors, i.e. matrices and arrays.

Take a look at the cheatsheet.

4.2 purrr map family functions

The map functions transform their input by applying a function to each element of a vector or list and returning a list or vector the same length as the input.

  • map(), map_if(), and map_at() always return a list. The _if and _at variants take a predicate function .p that determines which elements of .x are transformed with .f.

  • map_lgl() - logical, map_int() - integer, map_dbl() - double, and map_chr() - character return vectors of the corresponding type.

  • map_dfr() and map_dfc() return data frames created by row-binding and column-binding respectively. They require dplyr.

4.2.1 Translating a for loop to a map

#Creating a named vector to store my calculations
number_of_columns <- dim(bechdel)[2]
len <- vector(length = number_of_columns)
names(len) <- names(bechdel)

#Calculating the number of unique values in each column of the data.frame
for(col in 1:number_of_columns){
  len[col] <- length(unique((bechdel[[col]])))
}

len
##          year          imdb         title          test    clean_test 
##            44          1794          1768            10             5 
##        binary        budget      domgross      intgross          code 
##             2           272          1751          1757            85 
##   budget_2013 domgross_2013 intgross_2013   period_code   decade_code 
##          1188          1776          1783             6             4

However, we can use purrr::map() for a neater solution.

f <- function(x){
  return(length(unique(x)))
}
#I want output as a numeric vector
map_dbl(bechdel, f)
##          year          imdb         title          test    clean_test 
##            44          1794          1768            10             5 
##        binary        budget      domgross      intgross          code 
##             2           272          1751          1757            85 
##   budget_2013 domgross_2013 intgross_2013   period_code   decade_code 
##          1188          1776          1783             6             4

Note that the function maps column by column just like the for loop. Columns in data.frames look like rows in this graphic.

4.2.2 Anonymous functions

Instead of using map() with an existing function (we created a function f above), you can create an inline anonymous (not saved elsewhere) functions.

purrr supports a special shortcut for this kind of function:

map_dbl(bechdel, ~ length(unique(.x)))
##          year          imdb         title          test    clean_test 
##            44          1794          1768            10             5 
##        binary        budget      domgross      intgross          code 
##             2           272          1751          1757            85 
##   budget_2013 domgross_2013 intgross_2013   period_code   decade_code 
##          1188          1776          1783             6             4

You can use the pipe (%>%) in dplyr style syntax.

bechdel %>%
  map_dbl( ~ length(unique(.x))) 
##          year          imdb         title          test    clean_test 
##            44          1794          1768            10             5 
##        binary        budget      domgross      intgross          code 
##             2           272          1751          1757            85 
##   budget_2013 domgross_2013 intgross_2013   period_code   decade_code 
##          1188          1776          1783             6             4

Let’s use some of the other map functions to see how the output format changes.

bechdel %>%
  map_chr( ~ length(unique(.x)))
##          year          imdb         title          test    clean_test 
##          "44"        "1794"        "1768"          "10"           "5" 
##        binary        budget      domgross      intgross          code 
##           "2"         "272"        "1751"        "1757"          "85" 
##   budget_2013 domgross_2013 intgross_2013   period_code   decade_code 
##        "1188"        "1776"        "1783"           "6"           "4"

We could also bind the results into a data.frame by rows:

bechdel %>%
  map_dfr( ~ length(unique(.x)))
## # A tibble: 1 x 15
##    year  imdb title  test clean_test binary budget domgross intgross  code
##   <int> <int> <int> <int>      <int>  <int>  <int>    <int>    <int> <int>
## 1    44  1794  1768    10          5      2    272     1751     1757    85
## # … with 5 more variables: budget_2013 <int>, domgross_2013 <int>,
## #   intgross_2013 <int>, period_code <int>, decade_code <int>

4.2.3 Other map functions

  • Output same type as input with modify().
  • Iterate over two inputs with map2().
  • Iterate with an index using imap().
  • Return nothing with walk().
  • Iterate over any number of inputs with pmap().

I’ll cover map2() and imap() briefly.

Regular map can only apply a function to one input. Other inputs are applied to all iterations.

map2() let’s your function have two inputs - x and y

For example, let’s say we want to take a weighted mean:
#List of 8 vectors filled with 10 random uniform(0,1) values
#Cool trick for generating random vectors!!!
xs <- map(1:8, ~runif(10))
#Giving weights to the values in xs based on their relative size
ws <- map(xs, ~ .x/sum(.x))
#Calculating a weighted mean
map2_dbl(xs, ws, weighted.mean)
## [1] 0.3568452 0.6299610 0.4993194 0.6774614 0.5278370 0.6718677 0.7677067
## [8] 0.5714697

imap() If y is an index or names of x, you can use imap instead of map2. imap(x, f) = map2(x, y = names(x), f)

Note you still get all of the output variants (dbl, chr, etc.)

names(xs) <- paste0('rladies', 1:8)
#Printing out the first value and labeling the index
imap_chr(xs, ~ paste0("The first value of ", .y, " is ", .x[[1]]))
##                                             rladies1 
##   "The first value of rladies1 is 0.556265470804647" 
##                                             rladies2 
##   "The first value of rladies2 is 0.477754818974063" 
##                                             rladies3 
##   "The first value of rladies3 is 0.124986263224855" 
##                                             rladies4 
##   "The first value of rladies4 is 0.926147045800462" 
##                                             rladies5 
## "The first value of rladies5 is 0.00617365562357008" 
##                                             rladies6 
##   "The first value of rladies6 is 0.677013292210177" 
##                                             rladies7 
##  "The first value of rladies7 is 0.0583309563808143" 
##                                             rladies8 
##   "The first value of rladies8 is 0.494969137944281"

4.3 Challenge problem

Going back to the very first example, say I want to change around my logic. My question is now how the pass/fail of the Bechdel test and the budget of the movie will affect the domestic gross.

Hints: use lm instead of glm; use log(domgross_2013) as dependent variable
modeled_bech_lm <- by_time %>% 
  #Fit logistic regression
  purrr::map(~ lm(log(domgross_2013) ~ binary + log(budget_2013), data = .x)) %>% 
  #Using broom::tidy to make the output nicer for processsing
  purrr::map(~ broom::tidy(.x)) %>%
  #Grabbing variable names and point estimates
  purrr::map(~ dplyr::select(.x, term, estimate, std.error)) %>%
  #Getting the list indices into the data set
  purrr::imap(~ dplyr::mutate(.x, decade = .y)) %>%
  #Recombining everything into a data.frame
  dplyr::bind_rows() 

modeled_bech_lm
## # A tibble: 15 x 4
##    term             estimate std.error decade
##    <chr>               <dbl>     <dbl> <chr> 
##  1 (Intercept)       12.3       2.32   1970s 
##  2 binaryPASS        -0.848     0.505  1970s 
##  3 log(budget_2013)   0.393     0.137  1970s 
##  4 (Intercept)        9.64      2.14   1980s 
##  5 binaryPASS        -0.513     0.253  1980s 
##  6 log(budget_2013)   0.507     0.125  1980s 
##  7 (Intercept)        5.67      0.802  1990s 
##  8 binaryPASS         0.155     0.136  1990s 
##  9 log(budget_2013)   0.693     0.0457 1990s 
## 10 (Intercept)        1.73      0.616  2000s 
## 11 binaryPASS        -0.0138    0.0936 2000s 
## 12 log(budget_2013)   0.904     0.0352 2000s 
## 13 (Intercept)        1.60      0.754  2010s 
## 14 binaryPASS         0.310     0.124  2010s 
## 15 log(budget_2013)   0.902     0.0429 2010s
Now plot the results using the ggplot package.
  modeled_bech_lm %>%
  ggplot(aes(x = decade, y = estimate, group = term)) + 
  #adding error bars for 95% conf. intervals
  geom_errorbar(aes(x = decade, 
                    ymin = estimate - 2 * std.error, 
                    ymax = estimate + 2 * std.error), 
                color = 'grey50', width = 0.2) + 
  geom_point(color = 'blue', size = 2) + 
  #Uncomment the following line if you want a connecting line
  #geom_line(lwd = 1.2) + 
  #highlighting the x-axis
  geom_hline(yintercept = 0, alpha = 0.2) + 
  #allowing different scales on each coefficient
  facet_wrap(~term, scales = 'free_y') + 
  #not a fan of the default plotting
  theme_bw() +
  #removing grid, changes facet title colors
  theme(panel.grid = element_blank(), 
        strip.background = element_rect(fill = 'black'), 
        strip.text = element_text(color = 'white')) + 
  xlab('Decade') + ylab('Estimated Coefficient')

4.4 More stuff in purrr

  • Reduce family functions
    • reduce() takes a vector of length n and produces a vector of length 1 by calling a function with a pair of values at a time: reduce(1:4, f) is equivalent to f(f(f(1, 2), 3), 4).
    • accumulate() is useful for understanding how reduce works, because instead of returning just the final result, it returns all the intermediate results as well.
  • Predicate functions
    • Returns a single TRUE or FALSE.
    • some(.x, .p) returns TRUE if any element matches; every(.x, .p) returns TRUE if all elements match.
    • detect(.x, .p) returns the value of the first match; detect_index(.x, .p) returns the location of the first match.
    • keep(.x, .p) keeps all matching elements; discard(.x, .p) drops all matching elements.

Lindsey Dietz

Dec. 10, 2019