Introduction

This vignette demonstrates how to use core TidyVerse packages such as dplyr, ggplot2, tidyr, and stringr to explore and visualize a Netflix dataset. The goal is to clean the data, answer questions, and visualize trends using the TidyVerse ecosystem.

Load libraries

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
netflix <- read_csv("netflix_titles.csv")
## Rows: 8807 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): show_id, type, title, director, cast, country, date_added, rating,...
## dbl  (1): release_year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(netflix)
## Rows: 8,807
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
## $ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
## $ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
## $ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
## $ country      <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
## $ date_added   <chr> "September 25, 2021", "September 24, 2021", "September 24…
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
## $ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
## $ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
## $ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
## $ description  <chr> "As her father nears the end of his life, filmmaker Kirst…
cleaned_netflix <- netflix %>%
filter(!is.na(release_year), !is.na(type), !is.na(country)) %>%
separate_rows(listed_in, sep = ", ") %>%
rename(genre = listed_in)

cleaned_netflix %>%
slice_head(n = 10)
## # A tibble: 10 × 12
##    show_id type    title   director cast  country date_added release_year rating
##    <chr>   <chr>   <chr>   <chr>    <chr> <chr>   <chr>             <dbl> <chr> 
##  1 s1      Movie   Dick J… Kirsten… <NA>  United… September…         2020 PG-13 
##  2 s2      TV Show Blood … <NA>     Ama … South … September…         2021 TV-MA 
##  3 s2      TV Show Blood … <NA>     Ama … South … September…         2021 TV-MA 
##  4 s2      TV Show Blood … <NA>     Ama … South … September…         2021 TV-MA 
##  5 s5      TV Show Kota F… <NA>     Mayu… India   September…         2021 TV-MA 
##  6 s5      TV Show Kota F… <NA>     Mayu… India   September…         2021 TV-MA 
##  7 s5      TV Show Kota F… <NA>     Mayu… India   September…         2021 TV-MA 
##  8 s8      Movie   Sankofa Haile G… Kofi… United… September…         1993 TV-MA 
##  9 s8      Movie   Sankofa Haile G… Kofi… United… September…         1993 TV-MA 
## 10 s8      Movie   Sankofa Haile G… Kofi… United… September…         1993 TV-MA 
## # ℹ 3 more variables: duration <chr>, genre <chr>, description <chr>

Question 1: How many movies vs TV shows are there?

q1_counts <- cleaned_netflix %>%
count(type) %>%
mutate(percentage = n / sum(n) * 100)

q1_counts
## # A tibble: 2 × 3
##   type        n percentage
##   <chr>   <int>      <dbl>
## 1 Movie   12332       70.1
## 2 TV Show  5269       29.9

Question 2: What are the top 10 genres?

q2_top_genres <- cleaned_netflix %>%
count(genre, sort = TRUE) %>%
slice_head(n = 10)

q2_top_genres
## # A tibble: 10 × 2
##    genre                      n
##    <chr>                  <int>
##  1 International Movies    2543
##  2 Dramas                  2317
##  3 Comedies                1580
##  4 International TV Shows  1128
##  5 Action & Adventure       817
##  6 Documentaries            794
##  7 Independent Movies       745
##  8 TV Dramas                663
##  9 Romantic Movies          588
## 10 Thrillers                549
q2_top_genres %>%
ggplot(aes(x = reorder(genre, n), y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Netflix Genres",
x = "Genre",
y = "Number of Titles"
)

Question 3: How has the number of titles changed over time?

q3_yearly <- cleaned_netflix %>%
group_by(release_year) %>%
summarise(total_titles = n(), .groups = "drop")

q3_yearly %>%
slice_tail(n = 10)
## # A tibble: 10 × 2
##    release_year total_titles
##           <dbl>        <int>
##  1         2012          495
##  2         2013          612
##  3         2014          743
##  4         2015         1186
##  5         2016         1814
##  6         2017         2028
##  7         2018         2298
##  8         2019         2016
##  9         2020         1873
## 10         2021          842
q3_yearly %>%
ggplot(aes(x = release_year, y = total_titles)) +
geom_line() +
labs(
title = "Netflix Releases Over Time",
x = "Year",
y = "Number of Titles"
)

Conclusion

This vignette used several TidyVerse functions:

• dplyr → filter, count, group_by, summarise • tidyr → separate_rows • ggplot2 → visualization • readr → read_csv