This vignette demonstrates how to use core TidyVerse packages such as
dplyr, ggplot2, tidyr, and
stringr to explore and visualize a Netflix dataset. The
goal is to clean the data, answer questions, and visualize trends using
the TidyVerse ecosystem.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
netflix <- read_csv("netflix_titles.csv")
## Rows: 8807 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): show_id, type, title, director, cast, country, date_added, rating,...
## dbl (1): release_year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(netflix)
## Rows: 8,807
## Columns: 12
## $ show_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
## $ title <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
## $ director <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
## $ cast <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
## $ country <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
## $ date_added <chr> "September 25, 2021", "September 24, 2021", "September 24…
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
## $ rating <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
## $ duration <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
## $ listed_in <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
## $ description <chr> "As her father nears the end of his life, filmmaker Kirst…
cleaned_netflix <- netflix %>%
filter(!is.na(release_year), !is.na(type), !is.na(country)) %>%
separate_rows(listed_in, sep = ", ") %>%
rename(genre = listed_in)
cleaned_netflix %>%
slice_head(n = 10)
## # A tibble: 10 × 12
## show_id type title director cast country date_added release_year rating
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 s1 Movie Dick J… Kirsten… <NA> United… September… 2020 PG-13
## 2 s2 TV Show Blood … <NA> Ama … South … September… 2021 TV-MA
## 3 s2 TV Show Blood … <NA> Ama … South … September… 2021 TV-MA
## 4 s2 TV Show Blood … <NA> Ama … South … September… 2021 TV-MA
## 5 s5 TV Show Kota F… <NA> Mayu… India September… 2021 TV-MA
## 6 s5 TV Show Kota F… <NA> Mayu… India September… 2021 TV-MA
## 7 s5 TV Show Kota F… <NA> Mayu… India September… 2021 TV-MA
## 8 s8 Movie Sankofa Haile G… Kofi… United… September… 1993 TV-MA
## 9 s8 Movie Sankofa Haile G… Kofi… United… September… 1993 TV-MA
## 10 s8 Movie Sankofa Haile G… Kofi… United… September… 1993 TV-MA
## # ℹ 3 more variables: duration <chr>, genre <chr>, description <chr>
q1_counts <- cleaned_netflix %>%
count(type) %>%
mutate(percentage = n / sum(n) * 100)
q1_counts
## # A tibble: 2 × 3
## type n percentage
## <chr> <int> <dbl>
## 1 Movie 12332 70.1
## 2 TV Show 5269 29.9
q2_top_genres <- cleaned_netflix %>%
count(genre, sort = TRUE) %>%
slice_head(n = 10)
q2_top_genres
## # A tibble: 10 × 2
## genre n
## <chr> <int>
## 1 International Movies 2543
## 2 Dramas 2317
## 3 Comedies 1580
## 4 International TV Shows 1128
## 5 Action & Adventure 817
## 6 Documentaries 794
## 7 Independent Movies 745
## 8 TV Dramas 663
## 9 Romantic Movies 588
## 10 Thrillers 549
q2_top_genres %>%
ggplot(aes(x = reorder(genre, n), y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Netflix Genres",
x = "Genre",
y = "Number of Titles"
)
q3_yearly <- cleaned_netflix %>%
group_by(release_year) %>%
summarise(total_titles = n(), .groups = "drop")
q3_yearly %>%
slice_tail(n = 10)
## # A tibble: 10 × 2
## release_year total_titles
## <dbl> <int>
## 1 2012 495
## 2 2013 612
## 3 2014 743
## 4 2015 1186
## 5 2016 1814
## 6 2017 2028
## 7 2018 2298
## 8 2019 2016
## 9 2020 1873
## 10 2021 842
q3_yearly %>%
ggplot(aes(x = release_year, y = total_titles)) +
geom_line() +
labs(
title = "Netflix Releases Over Time",
x = "Year",
y = "Number of Titles"
)
This vignette used several TidyVerse functions:
• dplyr → filter, count, group_by, summarise • tidyr → separate_rows • ggplot2 → visualization • readr → read_csv