Introduction

For this project, we are using the Netflix TV Shows and Movies dataset from Kaggle. Per the description provided on the website, “[t]his data set was created to list all shows available on Netflix streaming, and analyze the data to find interesting facts. This data was acquired in July 2022 containing data available in the United States.” This dataset is broken up into a credits and titles datasets. For the purposes of this exercise, we focused on using the titles subfile.

dplyr

For this exercise, we are using the dplyr library in the tidyverse package to show a number of different capabailities that are possible via this library.

Selection

Sometimes when you get data, you are only interested in a subset of the features that are available in the dataset. For this example, we are going to start by selecting a subset of the columns that we are interested in working with and then assigning them to a subset vector:

  • title
  • type
  • release_year
  • age_certification
  • runtime
  • imdb_score
  • imdb_votes
  • tmdb_popularity
keep_cols <- c('title', 'release_year', 'age_certification', 'runtime', 'imdb_score', 
              'imdb_votes', 'tmdb_popularity')

Using our subset vector, we will use the select function in dplyr to select only these columns from the data frame. We wrap this vector of column names into the all_of() function that we will pass to the select function.

movies_df %>% 
  select(all_of(keep_cols))
## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

Now that we’ve selected our data, we can use the arrange function to sort the data by the release year feature, in ascending order.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(release_year)
## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 It's the Easter…         1974 G                      25        7.5       3001
##  5 Be My Valentine…         1975 G                      25        7.4       2777
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 It's Christmast…         1992 G                      22        6.9       2815
##  9 Out of Sight             1998 R                     123        7        95024
## 10 Entrapment               1999 PG-13                 112        6.3     118492
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

If instead we wanted to sort the data in descending order, we can encase the variable that we are sorting on (in this case “release_year”) inside of the desc() function.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(desc(release_year))
## # A tibble: 170 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 Shrinking                2023 TV-MA                  55        7.9      13799
##  2 Hello Tomorrow!          2023 TV-MA                   0        6.3       1562
##  3 Sharper                  2023 R                     116        6.6      13245
##  4 Dear Edward              2023 TV-MA                   0        7.4       1324
##  5 The Reluctant T…         2023 TV-PG                  35        7.8        473
##  6 Liaison                  2023 TV-MA                  51        6.3        549
##  7 Super League: T…         2023 TV-MA                   0        7.4        414
##  8 Shape Island             2023 TV-G                   24        7.6        121
##  9 Severance                2022 TV-MA                  48        8.7     141086
## 10 Slow Horses              2022 TV-MA                  48        7.9      38504
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>

If we only wanted to preview the first n rows of the resulting dataset, we can use the slice_head() function to limit the output to just the first few rows in the data frame. We pass the number of rows we want to display into the function by assigning it to the function’s n variable. For example, in the code below, we use n=5 to only show the first five rows of the output data.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  arrange(desc(release_year)) %>%
  slice_head(n=5)
## # A tibble: 5 × 7
##   title             release_year age_certification runtime imdb_score imdb_votes
##   <chr>                    <dbl> <chr>               <dbl>      <dbl>      <dbl>
## 1 Shrinking                 2023 TV-MA                  55        7.9      13799
## 2 Hello Tomorrow!           2023 TV-MA                   0        6.3       1562
## 3 Sharper                   2023 R                     116        6.6      13245
## 4 Dear Edward               2023 TV-MA                   0        7.4       1324
## 5 The Reluctant Tr…         2023 TV-PG                  35        7.8        473
## # ℹ 1 more variable: tmdb_popularity <dbl>

Filtering and Mutating

Filtering is an important feature of dplyr that allows us to select specific rows of data that meet certain criteria.

If we wanted to only show the rows in the dataframe that correspond to movies that were released earlier than the year 2000, we can do so using the library’s filter function as so.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  filter(release_year < 2000)
## # A tibble: 10 × 7
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 1 more variable: tmdb_popularity <dbl>

In this example, this results in a dataframe showing the 10 films in the library that fit this criteria.

If instead, of filtering and only selecting those movies that fit our criteria, we wanted to label movies as being pre-2000 and post-2000. We can do this by using the library’s mutate function and combining it with a conditional statement. For this example, we will use the ifelse() function to set the criteria that we will use for setting the value of a new variable “release_window” to “pre_2000” if the movie was released prior to the year 2000, and “post_2000” otherwise.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  mutate(release_window = ifelse(release_year < 2000, "pre_2000","post_2000" ))
## # A tibble: 170 × 8
##    title            release_year age_certification runtime imdb_score imdb_votes
##    <chr>                   <dbl> <chr>               <dbl>      <dbl>      <dbl>
##  1 A Charlie Brown…         1965 G                      25        8.3      40328
##  2 It's the Great …         1966 G                      25        8.2      20299
##  3 A Charlie Brown…         1973 G                      25        7.7      11406
##  4 Be My Valentine…         1975 G                      25        7.4       2777
##  5 It's the Easter…         1974 G                      25        7.5       3001
##  6 Fraggle Rock             1983 TV-G                   26        7.9       7939
##  7 Happy New Year,…         1986 G                      28        7.2       1849
##  8 Out of Sight             1998 R                     123        7        95024
##  9 Entrapment               1999 PG-13                 112        6.3     118492
## 10 It's Christmast…         1992 G                      22        6.9       2815
## # ℹ 160 more rows
## # ℹ 2 more variables: tmdb_popularity <dbl>, release_window <chr>

Grouping & Summarizing

Grouping a data set by a common category is a very useful feature that allows for the data to be viewed by common features and details. For this data we will begin by grouping the data by age_certification and collecting some general information on the data based on these groupings. We group the data using the group_by() function and passing the name of a single column or multiple columns that we want to use for our grouping.

With our data in groups, we can then combine this with the summarize() function to calculate summary metrics that represent the calculated summary measures that pertain to the set of rows pertaining to newly created groups.

In this example, we group our films by the age_certification category, and calculate a number of metrics for the data.

movies_df %>% 
  select(all_of(keep_cols)) %>%
  group_by(age_certification) %>%
  summarize(num_titles = n_distinct(title),
            total_runtime = sum(runtime),
            avg_runtime = mean(runtime),
            min_runtime = min(runtime),
            max_runtime = max(runtime),
            avg_imdb_score = mean(imdb_score),
            min_imdb_score = min(imdb_score),
            max_imdb_score = max(imdb_score))
## # A tibble: 11 × 9
##    age_certification num_titles total_runtime avg_runtime min_runtime
##    <chr>                  <int>         <dbl>       <dbl>       <dbl>
##  1 G                         16           593        37.1          18
##  2 PG                         4           334        83.5          48
##  3 PG-13                     13          1385       107.           87
##  4 R                         20          2258       113.           91
##  5 TV-14                     14           544        38.9          21
##  6 TV-G                      16           371        23.2           0
##  7 TV-MA                     51          2187        42.9           0
##  8 TV-PG                     11           342        31.1           0
##  9 TV-Y                      12           155        12.9           0
## 10 TV-Y7                      2            12         6             0
## 11 <NA>                      11           507        46.1           9
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## #   min_imdb_score <dbl>, max_imdb_score <dbl>

Finally, we can take our new grouped data and sort it by one of the fields in the resulting dataframe. In this case, we will conclude by sorting the data in reverse order based on the total number of titles in the group (num_titles)

movies_df %>% 
  select(all_of(keep_cols)) %>%
  group_by(age_certification) %>%
  summarize(num_titles = n_distinct(title),
            total_runtime = sum(runtime),
            avg_runtime = mean(runtime),
            min_runtime = min(runtime),
            max_runtime = max(runtime),
            avg_imdb_score = mean(imdb_score),
            min_imdb_score = min(imdb_score),
            max_imdb_score = max(imdb_score)) %>%
  arrange(desc(num_titles))
## # A tibble: 11 × 9
##    age_certification num_titles total_runtime avg_runtime min_runtime
##    <chr>                  <int>         <dbl>       <dbl>       <dbl>
##  1 TV-MA                     51          2187        42.9           0
##  2 R                         20          2258       113.           91
##  3 G                         16           593        37.1          18
##  4 TV-G                      16           371        23.2           0
##  5 TV-14                     14           544        38.9          21
##  6 PG-13                     13          1385       107.           87
##  7 TV-Y                      12           155        12.9           0
##  8 TV-PG                     11           342        31.1           0
##  9 <NA>                      11           507        46.1           9
## 10 PG                         4           334        83.5          48
## 11 TV-Y7                      2            12         6             0
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## #   min_imdb_score <dbl>, max_imdb_score <dbl>

Conclusion

dplyr is a very useful library in the tidyverse package that provides highly useful functions that make working with dataframes easier and very effective. You can access the documentation for the library here and download a “cheat sheet” describing the key funcions for the library, here