Tidyverse Create Assignment

dplyr

For this exercise, we are using the dplyr library in the tidyverse package to show a number of different capabailities that are possible via this library.

Selection

Sometimes when you get data, you are only interested in a subset of the features that are available in the dataset. For this example, we are going to start by selecting a subset of the columns that we are interested in working with and then assigning them to a subset vector:

title
type
release_year
age_certification
runtime
imdb_score
imdb_votes
tmdb_popularity

keep_cols <- c('title', 'release_year', 'age_certification', 'runtime', 'imdb_score', 
              'imdb_votes', 'tmdb_popularity')

Using our subset vector, we will use the select function in dplyr to select only these columns from the data frame. We wrap this vector of column names into the all_of() function that we will pass to the select function.

movies_df %>% 
  select(all_of(keep_cols))

## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

Now that we’ve selected our data, we can use the arrange function to sort the data by the release year feature, in ascending order.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(release_year)

## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 It's the Easter…         1974 G                      25        7.5       3001
##  5 Be My Valentine…         1975 G                      25        7.4       2777
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 It's Christmast…         1992 G                      22        6.9       2815
##  9 Out of Sight             1998 R                     123        7        95024
## 10 Entrapment               1999 PG-13                 112        6.3     118492
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

If instead we wanted to sort the data in descending order, we can encase the variable that we are sorting on (in this case “release_year”) inside of the desc() function.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(desc(release_year))

## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 Shrinking                2023 TV-MA                  55        7.9      13799
##  2 Hello Tomorrow!          2023 TV-MA                   0        6.3       1562
##  3 Sharper                  2023 R                     116        6.6      13245
##  4 Dear Edward              2023 TV-MA                   0        7.4       1324
##  5 The Reluctant T…         2023 TV-PG                  35        7.8        473
##  6 Liaison                  2023 TV-MA                  51        6.3        549
##  7 Super League: T…         2023 TV-MA                   0        7.4        414
##  8 Shape Island             2023 TV-G                   24        7.6        121
##  9 Severance                2022 TV-MA                  48        8.7     141086
## 10 Slow Horses              2022 TV-MA                  48        7.9      38504
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

If we only wanted to preview the first n rows of the resulting dataset, we can use the slice_head() function to limit the output to just the first few rows in the data frame. We pass the number of rows we want to display into the function by assigning it to the function’s n variable. For example, in the code below, we use n=5 to only show the first five rows of the output data.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(desc(release_year)) %>%
  slice_head(n=5)

## # A tibble: 5 × 7
##   title             release_year age_certification runtime imdb_score imdb_votes
##   <chr>                    <dbl> <chr>               <dbl>      <dbl>      <dbl>
## 1 Shrinking                 2023 TV-MA                  55        7.9      13799
## 2 Hello Tomorrow!           2023 TV-MA                   0        6.3       1562
## 3 Sharper                   2023 R                     116        6.6      13245
## 4 Dear Edward               2023 TV-MA                   0        7.4       1324
## 5 The Reluctant Tr…         2023 TV-PG                  35        7.8        473
## # ℹ 1 more variable: tmdb_popularity <dbl>

Filtering and Mutating

Filtering is an important feature of dplyr that allows us to select specific rows of data that meet certain criteria.

If we wanted to only show the rows in the dataframe that correspond to movies that were released earlier than the year 2000, we can do so using the library’s filter function as so.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  filter(release_year < 2000)

## # A tibble: 10 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 1 more variable: tmdb_popularity <dbl>

In this example, this results in a dataframe showing the 10 films in the library that fit this criteria.

If instead, of filtering and only selecting those movies that fit our criteria, we wanted to label movies as being pre-2000 and post-2000. We can do this by using the library’s mutate function and combining it with a conditional statement. For this example, we will use the ifelse() function to set the criteria that we will use for setting the value of a new variable “release_window” to “pre_2000” if the movie was released prior to the year 2000, and “post_2000” otherwise.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  mutate(release_window = ifelse(release_year < 2000, "pre_2000","post_2000" ))

## # A tibble: 170 × 8
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 160 more rows
## # ℹ 2 more variables: tmdb_popularity <dbl>, release_window <chr>

Grouping & Summarizing

Grouping a data set by a common category is a very useful feature that allows for the data to be viewed by common features and details. For this data we will begin by grouping the data by age_certification and collecting some general information on the data based on these groupings. We group the data using the group_by() function and passing the name of a single column or multiple columns that we want to use for our grouping.

With our data in groups, we can then combine this with the summarize() function to calculate summary metrics that represent the calculated summary measures that pertain to the set of rows pertaining to newly created groups.

In this example, we group our films by the age_certification category, and calculate a number of metrics for the data.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  group_by(age_certification) %>%
  summarize(num_titles = n_distinct(title),
            total_runtime = sum(runtime),
            avg_runtime = mean(runtime),
            min_runtime = min(runtime),
            max_runtime = max(runtime),
            avg_imdb_score = mean(imdb_score),
            min_imdb_score = min(imdb_score),
            max_imdb_score = max(imdb_score))

## # A tibble: 11 × 9
##    age_certification num_titles total_runtime avg_runtime min_runtime
##    <chr>                  <int>         <dbl>       <dbl>       <dbl>
##  1 G                         16           593        37.1          18
##  2 PG                         4           334        83.5          48
##  3 PG-13                     13          1385       107.           87
##  4 R                         20          2258       113.           91
##  5 TV-14                     14           544        38.9          21
##  6 TV-G                      16           371        23.2           0
##  7 TV-MA                     51          2187        42.9           0
##  8 TV-PG                     11           342        31.1           0
##  9 TV-Y                      12           155        12.9           0
## 10 TV-Y7                      2            12         6             0
## 11 <NA>                      11           507        46.1           9
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## #   min_imdb_score <dbl>, max_imdb_score <dbl>

Finally, we can take our new grouped data and sort it by one of the fields in the resulting dataframe. In this case, we will conclude by sorting the data in reverse order based on the total number of titles in the group (num_titles)

movies_df %>% 
  select(all_of(keep_cols)) %>%
  group_by(age_certification) %>%
  summarize(num_titles = n_distinct(title),
            total_runtime = sum(runtime),
            avg_runtime = mean(runtime),
            min_runtime = min(runtime),
            max_runtime = max(runtime),
            avg_imdb_score = mean(imdb_score),
            min_imdb_score = min(imdb_score),
            max_imdb_score = max(imdb_score)) %>%
  arrange(desc(num_titles))

## # A tibble: 11 × 9
##    age_certification num_titles total_runtime avg_runtime min_runtime
##    <chr>                  <int>         <dbl>       <dbl>       <dbl>
##  1 TV-MA                     51          2187        42.9           0
##  2 R                         20          2258       113.           91
##  3 G                         16           593        37.1          18
##  4 TV-G                      16           371        23.2           0
##  5 TV-14                     14           544        38.9          21
##  6 PG-13                     13          1385       107.           87
##  7 TV-Y                      12           155        12.9           0
##  8 TV-PG                     11           342        31.1           0
##  9 <NA>                      11           507        46.1           9
## 10 PG                         4           334        83.5          48
## 11 TV-Y7                      2            12         6             0
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## #   min_imdb_score <dbl>, max_imdb_score <dbl>

Tidyverse Create Assignment

Kory Martin

2023-04-08

Introduction

dplyr

Selection

Filtering and Mutating

Grouping & Summarizing

Conclusion