Surya Teja Madhunala
2025-12-04
In recent years, streaming platforms have become a dominant way for people to watch films and TV shows. Netflix periodically publishes engagement reports summarizing what people watched and how many hours they spent watching different titles.
In this project, I analyze Netflix’s viewing data from late 2023 through the first half of 2025 using the TidyTuesday “Netflix viewing” dataset. I focus on understanding which kinds of titles attract the most viewing time, how movies and shows differ in engagement, and whether runtime is related to total viewing.
Do TV shows or movies account for more total hours viewed on Netflix?
Which titles and genres dominate viewing time during the period covered by the dataset?
Is there a relationship between runtime and engagement (hours viewed and “views” = hours viewed/runtime)?
Are newer titles watched more than older ones, or do older catalog titles still generate substantial viewing?
I use two CSV files: 1. movies.csv – viewing statistics for movies Url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv 2. shows.csv – viewing statistics for series “https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv”
## Warning: package 'tidytuesdayR' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'lubridate' was built under R version 4.5.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
#Load Datasets
## ---- Compiling #TidyTuesday Information for 2025-07-29 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "movies.csv"
## 2 of 2: "shows.csv"
The dataset is documented on the TidyTuesday GitHub repository and in accompanying blog posts that describe how the Netflix engagement reports were converted into tidy data.
Each file includes a simple data dictionary: movies.csv / shows.csv key variables:
source – original file/report name
report – reporting period label (e.g. "2025Jan-Jun")
title – movie or show title
available_globally – "Yes" / "No"
release_date – date when the title was released on Netflix
hours_viewed – total hours viewed during the report period
runtime – runtime stored as a Period (e.g. 1H 54M 0S)
views – hours_viewed / runtime (approximate number of complete views)
## spc_tbl_ [36,121 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ source : chr [1:36121] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
## $ report : chr [1:36121] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
## $ title : chr [1:36121] "Back in Action" "STRAW" "The Life List" "Exterritorial" ...
## $ available_globally: chr [1:36121] "Yes" "Yes" "Yes" "Yes" ...
## $ release_date : Date[1:36121], format: "2025-01-17" "2025-06-06" ...
## $ hours_viewed : num [1:36121] 3.13e+08 1.85e+08 1.99e+08 1.59e+08 1.55e+08 ...
## $ runtime : chr [1:36121] "1H 54M 0S" "1H 48M 0S" "2H 5M 0S" "1H 49M 0S" ...
## $ views : num [1:36121] 1.65e+08 1.03e+08 9.55e+07 8.75e+07 8.69e+07 ...
## - attr(*, "spec")=
## .. cols(
## .. source = col_character(),
## .. report = col_character(),
## .. title = col_character(),
## .. available_globally = col_character(),
## .. release_date = col_date(format = ""),
## .. hours_viewed = col_double(),
## .. runtime = col_character(),
## .. views = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## source report title available_globally
## Length:36121 Length:36121 Length:36121 Length:36121
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_date hours_viewed runtime
## Min. :2013-12-12 Min. : 100000 Length:36121
## 1st Qu.:2019-11-26 1st Qu.: 200000 Class :character
## Median :2021-08-13 Median : 400000 Mode :character
## Mean :2021-05-19 Mean : 2790858
## 3rd Qu.:2022-12-15 3rd Qu.: 1900000
## Max. :2025-06-30 Max. :313000000
## NA's :29396 NA's :12
## views
## Min. : 100000
## 1st Qu.: 100000
## Median : 300000
## Mean : 1573256
## 3rd Qu.: 1100000
## Max. :164700000
## NA's :12
## spc_tbl_ [27,803 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ source : chr [1:27803] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
## $ report : chr [1:27803] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
## $ title : chr [1:27803] "Adolescence: Limited Series" "Squid Game: Season 2 // 오징어 게임: 시즌 2" "Squid Game: Season 3 // 오징어 게임: 시즌 3" "Zero Day: Limited Series" ...
## $ available_globally: chr [1:27803] "Yes" "Yes" "Yes" "Yes" ...
## $ release_date : Date[1:27803], format: "2025-03-13" "2024-12-26" ...
## $ hours_viewed : num [1:27803] 5.55e+08 8.40e+08 4.39e+08 3.16e+08 2.19e+08 ...
## $ runtime : chr [1:27803] "3H 50M 0S" "7H 10M 0S" "6H 8M 0S" "5H 9M 0S" ...
## $ views : num [1:27803] 1.45e+08 1.17e+08 7.15e+07 6.13e+07 5.80e+07 ...
## - attr(*, "spec")=
## .. cols(
## .. source = col_character(),
## .. report = col_character(),
## .. title = col_character(),
## .. available_globally = col_character(),
## .. release_date = col_date(format = ""),
## .. hours_viewed = col_double(),
## .. runtime = col_character(),
## .. views = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## source report title available_globally
## Length:27803 Length:27803 Length:27803 Length:27803
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_date hours_viewed runtime
## Min. :2010-04-01 Min. : 100000 Length:27803
## 1st Qu.:2019-04-17 1st Qu.: 800000 Class :character
## Median :2021-04-14 Median : 2500000 Mode :character
## Mean :2021-01-08 Mean : 9807243
## 3rd Qu.:2022-12-16 3rd Qu.: 8200000
## Max. :2025-06-30 Max. :840300000
## NA's :14033 NA's :12
## views
## Min. : 100000
## 1st Qu.: 200000
## Median : 400000
## Mean : 1414994
## 3rd Qu.: 1300000
## Max. :144800000
## NA's :12
viewing <- bind_rows(
movies |> mutate(type = "Movie"),
shows |> mutate(type = "Show")
)
str(viewing)## tibble [63,924 × 9] (S3: tbl_df/tbl/data.frame)
## $ source : chr [1:63924] "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" "1_What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun" ...
## $ report : chr [1:63924] "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" "2025Jan-Jun" ...
## $ title : chr [1:63924] "Back in Action" "STRAW" "The Life List" "Exterritorial" ...
## $ available_globally: chr [1:63924] "Yes" "Yes" "Yes" "Yes" ...
## $ release_date : Date[1:63924], format: "2025-01-17" "2025-06-06" ...
## $ hours_viewed : num [1:63924] 3.13e+08 1.85e+08 1.99e+08 1.59e+08 1.55e+08 ...
## $ runtime : chr [1:63924] "1H 54M 0S" "1H 48M 0S" "2H 5M 0S" "1H 49M 0S" ...
## $ views : num [1:63924] 1.65e+08 1.03e+08 9.55e+07 8.75e+07 8.69e+07 ...
## $ type : chr [1:63924] "Movie" "Movie" "Movie" "Movie" ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100000 300000 1000000 5842382 4200000 840300000 24
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100000 100000 300000 1504426 1200000 164700000 24
Main cleaning steps:
Combine movies and shows into one data frame with a type indicator.
Convert runtime to minutes so that it is numeric.
Create derived variables such as release year.
Filter out rows with missing or obviously problematic values.
parse_runtime <- function(x) {
hours <- str_extract(x, "(\\d+)H")
mins <- str_extract(x, "(\\d+)M")
secs <- str_extract(x, "(\\d+)S")
hours <- ifelse(is.na(hours), 0, as.numeric(str_remove(hours, "H")))
mins <- ifelse(is.na(mins), 0, as.numeric(str_remove(mins, "M")))
secs <- ifelse(is.na(secs), 0, as.numeric(str_remove(secs, "S")))
total_minutes <- hours*60 + mins + secs/60
return(total_minutes)
}
viewing_clean <- viewing |>
mutate(
runtime_minutes = parse_runtime(runtime),
runtime_hours = runtime_minutes / 60,
release_year = year(release_date)
) |>
filter(
!is.na(hours_viewed),
!is.na(views),
!is.na(runtime_minutes),
runtime_minutes > 0
)
summary(viewing_clean$runtime_minutes)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 99.0 119.0 260.4 289.0 23853.0
Q1 – Do TV shows or movies account for more total hours viewed?
hours_by_type <- viewing_clean |>
group_by(type) |>
summarise(
total_hours = sum(hours_viewed, na.rm = TRUE),
median_hours = median(hours_viewed, na.rm = TRUE),
n_titles = n()
)
hours_by_type## # A tibble: 2 × 4
## type total_hours median_hours n_titles
## <chr> <dbl> <dbl> <int>
## 1 Movie 100775100000 400000 36109
## 2 Show 262387100000 2400000 27089
ggplot(hours_by_type, aes(x = type, y = total_hours)) +
geom_col() +
labs(
title = "Total Netflix Hours Viewed by Type",
x = "Type",
y = "Total hours viewed (reporting period)"
)# Top 10 most viewed titles (movies + shows)
top_titles <- viewing_clean |>
arrange(desc(hours_viewed)) |>
slice_head(n = 10)
top_titles## # A tibble: 10 × 12
## source report title available_globally release_date hours_viewed runtime
## <chr> <chr> <chr> <chr> <date> <dbl> <chr>
## 1 1_What_We_… 2025J… Squi… Yes 2024-12-26 840300000 7H 10M…
## 2 3_What_We_… 2024J… Brid… Yes 2024-05-16 733800000 7H 59M…
## 3 3_What_We_… 2024J… Fool… Yes 2024-01-01 689500000 6H 25M…
## 4 3_What_We_… 2024J… Quee… Yes 2024-03-09 682600000 23H 22…
## 5 4_What_We_… 2023J… King… Yes 2023-06-17 630200000 19H 0M…
## 6 2_What_We_… 2024J… Squi… Yes 2024-12-26 619900000 7H 10M…
## 7 1_What_We_… 2025J… When… Yes 2025-03-07 577000000 16H 35…
## 8 1_What_We_… 2025J… Adol… Yes 2025-03-13 555100000 3H 50M…
## 9 2_What_We_… 2024J… Mons… Yes 2024-09-19 550400000 7H 54M…
## 10 4_What_We_… 2023J… ONE … Yes 2023-08-31 541900000 7H 34M…
## # ℹ 5 more variables: views <dbl>, type <chr>, runtime_minutes <dbl>,
## # runtime_hours <dbl>, release_year <dbl>
ggplot(top_titles,
aes(x = reorder(title, hours_viewed),
y = hours_viewed,
fill = type)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Netflix Titles by Hours Viewed",
x = "Title",
y = "Hours Viewed"
) +
theme(legend.position = "bottom")ggplot(top_titles,
aes(x = reorder(title, hours_viewed),
y = hours_viewed,
fill = type)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Netflix Titles by Hours Viewed",
x = "Title",
y = "Hours viewed"
) +
theme(legend.position = "bottom")ggplot(viewing_clean,
aes(x = runtime_hours,
y = hours_viewed,
color = type)) +
geom_point(alpha = 0.4) +
scale_x_continuous("Runtime (hours)") +
scale_y_continuous("Hours viewed") +
labs(title = "Runtime vs. Hours Viewed on Netflix") +
theme(legend.position = "bottom")runtime_model <- lm(hours_viewed ~ runtime_hours + type, data = viewing_clean)
summary(runtime_model)##
## Call:
## lm(formula = hours_viewed ~ runtime_hours + type, data = viewing_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124469749 -5697872 -2546808 -1089693 830841997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2116187 98309 21.53 <2e-16 ***
## runtime_hours 381194 9796 38.91 <2e-16 ***
## typeShow 4609927 159045 28.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18390000 on 63195 degrees of freedom
## Multiple R-squared: 0.05518, Adjusted R-squared: 0.05515
## F-statistic: 1845 on 2 and 63195 DF, p-value: < 2.2e-16
ggplot(viewing_clean,
aes(x = release_year,
y = hours_viewed,
color = type)) +
geom_point(alpha = 0.4, position = position_jitter(width = 0.2)) +
geom_smooth(se = FALSE) +
labs(
title = "Hours Viewed vs. Release Year",
x = "Release year",
y = "Hours viewed"
) +
theme(legend.position = "bottom")## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 42804 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42804 rows containing missing values or values outside the scale range
## (`geom_point()`).