Netflix

Surya Teja Madhunala

2025-12-04

Issue Description

In recent years, streaming platforms have become a dominant way for people to watch films and TV shows. Netflix periodically publishes engagement reports summarizing what people watched and how many hours they spent watching different titles.

In this project, I analyze Netflix’s viewing data from late 2023 through the first half of 2025 using the TidyTuesday “Netflix viewing” dataset. I focus on understanding which kinds of titles attract the most viewing time, how movies and shows differ in engagement, and whether runtime is related to total viewing.

Questions

  1. Do TV shows or movies account for more total hours viewed on Netflix?

  2. Which titles and genres dominate viewing time during the period covered by the dataset?

  3. Is there a relationship between runtime and engagement (hours viewed and “views” = hours viewed/runtime)?

  4. Are newer titles watched more than older ones, or do older catalog titles still generate substantial viewing?

Data Source

I use two CSV files: 1. movies.csv – viewing statistics for movies Url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv 2. shows.csv – viewing statistics for series “https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv

# Load packages
library(tidytuesdayR)
## Warning: package 'tidytuesdayR' was built under R version 4.5.2
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.5.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2

#Load Datasets

# Option 1: tidytuesdayR
tuesdata <- tidytuesdayR::tt_load('2025-07-29')
## ---- Compiling #TidyTuesday Information for 2025-07-29 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "movies.csv"
##   2 of 2: "shows.csv"
movies <- tuesdata$movies
shows  <- tuesdata$shows

Documentation

The dataset is documented on the TidyTuesday GitHub repository and in accompanying blog posts that describe how the Netflix engagement reports were converted into tidy data.

Each file includes a simple data dictionary: movies.csv / shows.csv key variables:

source – original file/report name
report – reporting period label (e.g. "2025Jan-Jun")
title – movie or show title
available_globally – "Yes" / "No"
release_date – date when the title was released on Netflix
hours_viewed – total hours viewed during the report period
runtime – runtime stored as a Period (e.g. 1H 54M 0S)
views – hours_viewed / runtime (approximate number of complete views)

Description of the Data

  1. I first inspect the structure and basic summaries of the two datasets.
str(movies)
## spc_tbl_ [36,121 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ source            : chr [1:36121] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
##  $ report            : chr [1:36121] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
##  $ title             : chr [1:36121] "Back in Action" "STRAW" "The Life List" "Exterritorial" ...
##  $ available_globally: chr [1:36121] "Yes" "Yes" "Yes" "Yes" ...
##  $ release_date      : Date[1:36121], format: "2025-01-17" "2025-06-06" ...
##  $ hours_viewed      : num [1:36121] 3.13e+08 1.85e+08 1.99e+08 1.59e+08 1.55e+08 ...
##  $ runtime           : chr [1:36121] "1H 54M 0S" "1H 48M 0S" "2H 5M 0S" "1H 49M 0S" ...
##  $ views             : num [1:36121] 1.65e+08 1.03e+08 9.55e+07 8.75e+07 8.69e+07 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   source = col_character(),
##   ..   report = col_character(),
##   ..   title = col_character(),
##   ..   available_globally = col_character(),
##   ..   release_date = col_date(format = ""),
##   ..   hours_viewed = col_double(),
##   ..   runtime = col_character(),
##   ..   views = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(movies)
##     source             report             title           available_globally
##  Length:36121       Length:36121       Length:36121       Length:36121      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_date         hours_viewed         runtime         
##  Min.   :2013-12-12   Min.   :   100000   Length:36121      
##  1st Qu.:2019-11-26   1st Qu.:   200000   Class :character  
##  Median :2021-08-13   Median :   400000   Mode  :character  
##  Mean   :2021-05-19   Mean   :  2790858                     
##  3rd Qu.:2022-12-15   3rd Qu.:  1900000                     
##  Max.   :2025-06-30   Max.   :313000000                     
##  NA's   :29396        NA's   :12                            
##      views          
##  Min.   :   100000  
##  1st Qu.:   100000  
##  Median :   300000  
##  Mean   :  1573256  
##  3rd Qu.:  1100000  
##  Max.   :164700000  
##  NA's   :12
str(shows)
## spc_tbl_ [27,803 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ source            : chr [1:27803] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
##  $ report            : chr [1:27803] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
##  $ title             : chr [1:27803] "Adolescence: Limited Series" "Squid Game: Season 2 // 오징어 게임: 시즌 2" "Squid Game: Season 3 // 오징어 게임: 시즌 3" "Zero Day: Limited Series" ...
##  $ available_globally: chr [1:27803] "Yes" "Yes" "Yes" "Yes" ...
##  $ release_date      : Date[1:27803], format: "2025-03-13" "2024-12-26" ...
##  $ hours_viewed      : num [1:27803] 5.55e+08 8.40e+08 4.39e+08 3.16e+08 2.19e+08 ...
##  $ runtime           : chr [1:27803] "3H 50M 0S" "7H 10M 0S" "6H 8M 0S" "5H 9M 0S" ...
##  $ views             : num [1:27803] 1.45e+08 1.17e+08 7.15e+07 6.13e+07 5.80e+07 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   source = col_character(),
##   ..   report = col_character(),
##   ..   title = col_character(),
##   ..   available_globally = col_character(),
##   ..   release_date = col_date(format = ""),
##   ..   hours_viewed = col_double(),
##   ..   runtime = col_character(),
##   ..   views = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(shows)
##     source             report             title           available_globally
##  Length:27803       Length:27803       Length:27803       Length:27803      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_date         hours_viewed         runtime         
##  Min.   :2010-04-01   Min.   :   100000   Length:27803      
##  1st Qu.:2019-04-17   1st Qu.:   800000   Class :character  
##  Median :2021-04-14   Median :  2500000   Mode  :character  
##  Mean   :2021-01-08   Mean   :  9807243                     
##  3rd Qu.:2022-12-16   3rd Qu.:  8200000                     
##  Max.   :2025-06-30   Max.   :840300000                     
##  NA's   :14033        NA's   :12                            
##      views          
##  Min.   :   100000  
##  1st Qu.:   200000  
##  Median :   400000  
##  Mean   :  1414994  
##  3rd Qu.:  1300000  
##  Max.   :144800000  
##  NA's   :12
  1. To compare movies and shows together, I create a combined dataset with a type column:
viewing <- bind_rows(
  movies |> mutate(type = "Movie"),
  shows  |> mutate(type = "Show")
)

str(viewing)
## tibble [63,924 × 9] (S3: tbl_df/tbl/data.frame)
##  $ source            : chr [1:63924] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
##  $ report            : chr [1:63924] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
##  $ title             : chr [1:63924] "Back in Action" "STRAW" "The Life List" "Exterritorial" ...
##  $ available_globally: chr [1:63924] "Yes" "Yes" "Yes" "Yes" ...
##  $ release_date      : Date[1:63924], format: "2025-01-17" "2025-06-06" ...
##  $ hours_viewed      : num [1:63924] 3.13e+08 1.85e+08 1.99e+08 1.59e+08 1.55e+08 ...
##  $ runtime           : chr [1:63924] "1H 54M 0S" "1H 48M 0S" "2H 5M 0S" "1H 49M 0S" ...
##  $ views             : num [1:63924] 1.65e+08 1.03e+08 9.55e+07 8.75e+07 8.69e+07 ...
##  $ type              : chr [1:63924] "Movie" "Movie" "Movie" "Movie" ...
summary(viewing$hours_viewed)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    100000    300000   1000000   5842382   4200000 840300000        24
summary(viewing$views)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    100000    100000    300000   1504426   1200000 164700000        24

Cleaning and Preparation

Main cleaning steps:

  1. Combine movies and shows into one data frame with a type indicator.

  2. Convert runtime to minutes so that it is numeric.

  3. Create derived variables such as release year.

  4. Filter out rows with missing or obviously problematic values.

parse_runtime <- function(x) {
  hours <- str_extract(x, "(\\d+)H")
  mins  <- str_extract(x, "(\\d+)M")
  secs  <- str_extract(x, "(\\d+)S")
  
  hours <- ifelse(is.na(hours), 0, as.numeric(str_remove(hours, "H")))
  mins  <- ifelse(is.na(mins), 0, as.numeric(str_remove(mins, "M")))
  secs  <- ifelse(is.na(secs), 0, as.numeric(str_remove(secs, "S")))
  
  total_minutes <- hours*60 + mins + secs/60
  return(total_minutes)
}

viewing_clean <- viewing |>
  mutate(
    runtime_minutes = parse_runtime(runtime),
    runtime_hours = runtime_minutes / 60,
    release_year = year(release_date)
  ) |>
  filter(
    !is.na(hours_viewed),
    !is.na(views),
    !is.na(runtime_minutes),
    runtime_minutes > 0
  )

summary(viewing_clean$runtime_minutes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    99.0   119.0   260.4   289.0 23853.0

Final Results

Q1 – Do TV shows or movies account for more total hours viewed?

hours_by_type <- viewing_clean |>
  group_by(type) |>
  summarise(
    total_hours = sum(hours_viewed, na.rm = TRUE),
    median_hours = median(hours_viewed, na.rm = TRUE),
    n_titles = n()
  )

hours_by_type
## # A tibble: 2 × 4
##   type   total_hours median_hours n_titles
##   <chr>        <dbl>        <dbl>    <int>
## 1 Movie 100775100000       400000    36109
## 2 Show  262387100000      2400000    27089
ggplot(hours_by_type, aes(x = type, y = total_hours)) +
  geom_col() +
  labs(
    title = "Total Netflix Hours Viewed by Type",
    x = "Type",
    y = "Total hours viewed (reporting period)"
  )

Q2 – Which titles dominate viewing time?

# Top 10 most viewed titles (movies + shows)
top_titles <- viewing_clean |>
  arrange(desc(hours_viewed)) |>
  slice_head(n = 10)

top_titles
## # A tibble: 10 × 12
##    source      report title available_globally release_date hours_viewed runtime
##    <chr>       <chr>  <chr> <chr>              <date>              <dbl> <chr>  
##  1 1_What_We_… 2025J… Squi… Yes                2024-12-26      840300000 7H 10M…
##  2 3_What_We_… 2024J… Brid… Yes                2024-05-16      733800000 7H 59M…
##  3 3_What_We_… 2024J… Fool… Yes                2024-01-01      689500000 6H 25M…
##  4 3_What_We_… 2024J… Quee… Yes                2024-03-09      682600000 23H 22…
##  5 4_What_We_… 2023J… King… Yes                2023-06-17      630200000 19H 0M…
##  6 2_What_We_… 2024J… Squi… Yes                2024-12-26      619900000 7H 10M…
##  7 1_What_We_… 2025J… When… Yes                2025-03-07      577000000 16H 35…
##  8 1_What_We_… 2025J… Adol… Yes                2025-03-13      555100000 3H 50M…
##  9 2_What_We_… 2024J… Mons… Yes                2024-09-19      550400000 7H 54M…
## 10 4_What_We_… 2023J… ONE … Yes                2023-08-31      541900000 7H 34M…
## # ℹ 5 more variables: views <dbl>, type <chr>, runtime_minutes <dbl>,
## #   runtime_hours <dbl>, release_year <dbl>
ggplot(top_titles,
       aes(x = reorder(title, hours_viewed),
           y = hours_viewed,
           fill = type)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Netflix Titles by Hours Viewed",
    x = "Title",
    y = "Hours Viewed"
  ) +
  theme(legend.position = "bottom")

ggplot(top_titles,
       aes(x = reorder(title, hours_viewed),
           y = hours_viewed,
           fill = type)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Netflix Titles by Hours Viewed",
    x = "Title",
    y = "Hours viewed"
  ) +
  theme(legend.position = "bottom")

Q4 – Are newer titles watched more than older ones?

ggplot(viewing_clean,
       aes(x = release_year,
           y = hours_viewed,
           color = type)) +
  geom_point(alpha = 0.4, position = position_jitter(width = 0.2)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Hours Viewed vs. Release Year",
    x = "Release year",
    y = "Hours viewed"
  ) +
  theme(legend.position = "bottom")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 42804 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42804 rows containing missing values or values outside the scale range
## (`geom_point()`).