For this project, we are using the Netflix TV Shows and Movies dataset from Kaggle. Per the description provided on the website, “[t]his data set was created to list all shows available on Netflix streaming, and analyze the data to find interesting facts. This data was acquired in July 2022 containing data available in the United States.” This dataset is broken up into a credits and titles datasets. For the purposes of this exercise, we focused on using the titles subfile.
For this exercise, we are using the dplyr library in the tidyverse package to show a number of different capabailities that are possible via this library.
Sometimes when you get data, you are only interested in a subset of the features that are available in the dataset. For this example, we are going to start by selecting a subset of the columns that we are interested in working with and then assigning them to a subset vector:
keep_cols <- c('title', 'release_year', 'age_certification', 'runtime', 'imdb_score',
'imdb_votes', 'tmdb_popularity')
Using our subset vector, we will use the select function in dplyr to select only these columns from the data frame. We wrap this vector of column names into the all_of() function that we will pass to the select function.
movies_df %>%
select(all_of(keep_cols))
## # A tibble: 170 × 7
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A Charlie Brown… 1965 G 25 8.3 40328
## 2 It's the Great … 1966 G 25 8.2 20299
## 3 A Charlie Brown… 1973 G 25 7.7 11406
## 4 Be My Valentine… 1975 G 25 7.4 2777
## 5 It's the Easter… 1974 G 25 7.5 3001
## 6 Fraggle Rock 1983 TV-G 26 7.9 7939
## 7 Happy New Year,… 1986 G 28 7.2 1849
## 8 Out of Sight 1998 R 123 7 95024
## 9 Entrapment 1999 PG-13 112 6.3 118492
## 10 It's Christmast… 1992 G 22 6.9 2815
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>
Now that we’ve selected our data, we can use the arrange function to sort the data by the release year feature, in ascending order.
movies_df %>%
select(all_of(keep_cols)) %>%
arrange(release_year)
## # A tibble: 170 × 7
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A Charlie Brown… 1965 G 25 8.3 40328
## 2 It's the Great … 1966 G 25 8.2 20299
## 3 A Charlie Brown… 1973 G 25 7.7 11406
## 4 It's the Easter… 1974 G 25 7.5 3001
## 5 Be My Valentine… 1975 G 25 7.4 2777
## 6 Fraggle Rock 1983 TV-G 26 7.9 7939
## 7 Happy New Year,… 1986 G 28 7.2 1849
## 8 It's Christmast… 1992 G 22 6.9 2815
## 9 Out of Sight 1998 R 123 7 95024
## 10 Entrapment 1999 PG-13 112 6.3 118492
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>
If instead we wanted to sort the data in descending order, we can encase the variable that we are sorting on (in this case “release_year”) inside of the desc() function.
movies_df %>%
select(all_of(keep_cols)) %>%
arrange(desc(release_year))
## # A tibble: 170 × 7
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Shrinking 2023 TV-MA 55 7.9 13799
## 2 Hello Tomorrow! 2023 TV-MA 0 6.3 1562
## 3 Sharper 2023 R 116 6.6 13245
## 4 Dear Edward 2023 TV-MA 0 7.4 1324
## 5 The Reluctant T… 2023 TV-PG 35 7.8 473
## 6 Liaison 2023 TV-MA 51 6.3 549
## 7 Super League: T… 2023 TV-MA 0 7.4 414
## 8 Shape Island 2023 TV-G 24 7.6 121
## 9 Severance 2022 TV-MA 48 8.7 141086
## 10 Slow Horses 2022 TV-MA 48 7.9 38504
## # ℹ 160 more rows
## # ℹ 1 more variable: tmdb_popularity <dbl>
If we only wanted to preview the first n rows of the resulting dataset, we can use the slice_head() function to limit the output to just the first few rows in the data frame. We pass the number of rows we want to display into the function by assigning it to the function’s n variable. For example, in the code below, we use n=5 to only show the first five rows of the output data.
movies_df %>%
select(all_of(keep_cols)) %>%
arrange(desc(release_year)) %>%
slice_head(n=5)
## # A tibble: 5 × 7
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Shrinking 2023 TV-MA 55 7.9 13799
## 2 Hello Tomorrow! 2023 TV-MA 0 6.3 1562
## 3 Sharper 2023 R 116 6.6 13245
## 4 Dear Edward 2023 TV-MA 0 7.4 1324
## 5 The Reluctant Tr… 2023 TV-PG 35 7.8 473
## # ℹ 1 more variable: tmdb_popularity <dbl>
Filtering is an important feature of dplyr that allows us to select specific rows of data that meet certain criteria.
If we wanted to only show the rows in the dataframe that correspond to movies that were released earlier than the year 2000, we can do so using the library’s filter function as so.
movies_df %>%
select(all_of(keep_cols)) %>%
filter(release_year < 2000)
## # A tibble: 10 × 7
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A Charlie Brown… 1965 G 25 8.3 40328
## 2 It's the Great … 1966 G 25 8.2 20299
## 3 A Charlie Brown… 1973 G 25 7.7 11406
## 4 Be My Valentine… 1975 G 25 7.4 2777
## 5 It's the Easter… 1974 G 25 7.5 3001
## 6 Fraggle Rock 1983 TV-G 26 7.9 7939
## 7 Happy New Year,… 1986 G 28 7.2 1849
## 8 Out of Sight 1998 R 123 7 95024
## 9 Entrapment 1999 PG-13 112 6.3 118492
## 10 It's Christmast… 1992 G 22 6.9 2815
## # ℹ 1 more variable: tmdb_popularity <dbl>
In this example, this results in a dataframe showing the 10 films in the library that fit this criteria.
If instead, of filtering and only selecting those movies that fit our criteria, we wanted to label movies as being pre-2000 and post-2000. We can do this by using the library’s mutate function and combining it with a conditional statement. For this example, we will use the ifelse() function to set the criteria that we will use for setting the value of a new variable “release_window” to “pre_2000” if the movie was released prior to the year 2000, and “post_2000” otherwise.
movies_df %>%
select(all_of(keep_cols)) %>%
mutate(release_window = ifelse(release_year < 2000, "pre_2000","post_2000" ))
## # A tibble: 170 × 8
## title release_year age_certification runtime imdb_score imdb_votes
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A Charlie Brown… 1965 G 25 8.3 40328
## 2 It's the Great … 1966 G 25 8.2 20299
## 3 A Charlie Brown… 1973 G 25 7.7 11406
## 4 Be My Valentine… 1975 G 25 7.4 2777
## 5 It's the Easter… 1974 G 25 7.5 3001
## 6 Fraggle Rock 1983 TV-G 26 7.9 7939
## 7 Happy New Year,… 1986 G 28 7.2 1849
## 8 Out of Sight 1998 R 123 7 95024
## 9 Entrapment 1999 PG-13 112 6.3 118492
## 10 It's Christmast… 1992 G 22 6.9 2815
## # ℹ 160 more rows
## # ℹ 2 more variables: tmdb_popularity <dbl>, release_window <chr>
Grouping a data set by a common category is a very useful feature that allows for the data to be viewed by common features and details. For this data we will begin by grouping the data by age_certification and collecting some general information on the data based on these groupings. We group the data using the group_by() function and passing the name of a single column or multiple columns that we want to use for our grouping.
With our data in groups, we can then combine this with the summarize() function to calculate summary metrics that represent the calculated summary measures that pertain to the set of rows pertaining to newly created groups.
In this example, we group our films by the age_certification category, and calculate a number of metrics for the data.
movies_df %>%
select(all_of(keep_cols)) %>%
group_by(age_certification) %>%
summarize(num_titles = n_distinct(title),
total_runtime = sum(runtime),
avg_runtime = mean(runtime),
min_runtime = min(runtime),
max_runtime = max(runtime),
avg_imdb_score = mean(imdb_score),
min_imdb_score = min(imdb_score),
max_imdb_score = max(imdb_score))
## # A tibble: 11 × 9
## age_certification num_titles total_runtime avg_runtime min_runtime
## <chr> <int> <dbl> <dbl> <dbl>
## 1 G 16 593 37.1 18
## 2 PG 4 334 83.5 48
## 3 PG-13 13 1385 107. 87
## 4 R 20 2258 113. 91
## 5 TV-14 14 544 38.9 21
## 6 TV-G 16 371 23.2 0
## 7 TV-MA 51 2187 42.9 0
## 8 TV-PG 11 342 31.1 0
## 9 TV-Y 12 155 12.9 0
## 10 TV-Y7 2 12 6 0
## 11 <NA> 11 507 46.1 9
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## # min_imdb_score <dbl>, max_imdb_score <dbl>
Finally, we can take our new grouped data and sort it by one of the fields in the resulting dataframe. In this case, we will conclude by sorting the data in reverse order based on the total number of titles in the group (num_titles)
movies_df %>%
select(all_of(keep_cols)) %>%
group_by(age_certification) %>%
summarize(num_titles = n_distinct(title),
total_runtime = sum(runtime),
avg_runtime = mean(runtime),
min_runtime = min(runtime),
max_runtime = max(runtime),
avg_imdb_score = mean(imdb_score),
min_imdb_score = min(imdb_score),
max_imdb_score = max(imdb_score)) %>%
arrange(desc(num_titles))
## # A tibble: 11 × 9
## age_certification num_titles total_runtime avg_runtime min_runtime
## <chr> <int> <dbl> <dbl> <dbl>
## 1 TV-MA 51 2187 42.9 0
## 2 R 20 2258 113. 91
## 3 G 16 593 37.1 18
## 4 TV-G 16 371 23.2 0
## 5 TV-14 14 544 38.9 21
## 6 PG-13 13 1385 107. 87
## 7 TV-Y 12 155 12.9 0
## 8 TV-PG 11 342 31.1 0
## 9 <NA> 11 507 46.1 9
## 10 PG 4 334 83.5 48
## 11 TV-Y7 2 12 6 0
## # ℹ 4 more variables: max_runtime <dbl>, avg_imdb_score <dbl>,
## # min_imdb_score <dbl>, max_imdb_score <dbl>