References:
- IMDb.com, Inc. (2024, March 18). IMDb
non-commercial datasets [Dataset]. Retrieved October 29, 2025, from https://developer.imdb.com/non-commercial-datasets
---
title: "Dating Reality TV - A Guilty Pleasure"
author: "Storytelling with Open Data by Nikita Walunjkar (4146512)"
output:
flexdashboard::flex_dashboard:
orientation: rows
vertical_layout: fill
theme: cosmo
source_code: embed
css: styles.css
---
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Load libraries
library(readr)
library(tidyverse)
library(vroom)
library(glue)
library(ggplot2)
library(data.table)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(plotly)
library(leaflet)
library(scales)
library(ggrepel)
library(tidytext)
library(forcats)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
setwd("~/RMIT/Semesters/2/Data Visualization and Communication/A3")
# titles <- fread("data/title.basics.tsv")
# ratings <- fread("data/title.ratings.tsv")
# episodes <- fread("data/title.episode.tsv")
# akas <- fread("data/title.akas.tsv")
```
```{r}
# # Detect dating reality TV series from the IMDb data
# dating_keywords <- "(?i)\\b(love|date|dating|bachelor|bachelorette|married|match|island|temptation|first\\s+dates|too\\s+hot)\\b"
#
# dating_reality_series <- titles %>%
# filter(
# titleType == "tvSeries",
# str_detect(genres %||% "", "(^|,)Reality-TV(,|$)"),
# str_detect(primaryTitle %||% "", dating_keywords) |
# str_detect(originalTitle %||% "", dating_keywords)
# ) %>%
# mutate(
# startYear = as.integer(startYear),
# endYear = as.integer(endYear),
# runtimeMinutes = as.integer(runtimeMinutes)
# )
#
# # Join the ratings to the series data
# dating_reality_rated <- dating_reality_series %>%
# inner_join(ratings, by = "tconst") %>%
# mutate(
# numVotes = as.integer(numVotes),
# decade = (startYear %/% 10) * 10
# )
#
# # Join episode data to series
# dating_reality_eps <- episodes %>%
# semi_join(dating_reality_series, by = c("parentTconst" = "tconst")) %>%
# mutate(
# seasonNumber = as.integer(seasonNumber),
# episodeNumber = as.integer(episodeNumber)
# )
#
# # Filter regional data to only dating reality
# merged <- akas %>%
# filter(titleId %in% dating_reality_series$tconst)
#
# # Count the number of unique regions in the regional data
# aka_counts <- merged %>%
# group_by(titleId) %>%
# summarise(aka_count = n(),
# unique_regions = n_distinct(region))
#
# # Prep average rating and popularity data for all TV series genres
# genres <- titles %>%
# filter(titleType == "tvSeries") %>%
# filter(genres != "\\N") %>%
# left_join(ratings, by = "tconst") %>%
# separate_rows(genres, sep = ",") %>%
# group_by(tconst) %>%
# mutate(
# genre_rank = row_number(),
# averageRating = if_else(genre_rank %in% c(2, 3), 0, averageRating)
# ) %>%
# ungroup()
```
```{r}
# Saving the files and reading them since the code takes a long time to execute
# write.csv(dating_reality_series, "data/dating_reality_series.csv", row.names = FALSE)
# write.csv(dating_reality_rated, "data/dating_reality_rated.csv", row.names = FALSE)
# write.csv(dating_reality_eps, "data/dating_reality_eps.csv", row.names = FALSE)
# write.csv(merged, "data/akas.csv", row.names = FALSE)
# write.csv(aka_counts, "data/aka_counts.csv", row.names = FALSE)
# write.csv(genres, "data/genres.csv", row.names = FALSE)
```
```{r}
# Reading the saved and filtered data files
dating_reality_series <- read.csv("data/dating_reality_series.csv")
dating_reality_rated <- read.csv("data/dating_reality_rated.csv")
dating_reality_eps <- read.csv("data/dating_reality_eps.csv")
akas <- read.csv("data/akas.csv")
aka_counts <- read.csv("data/aka_counts.csv")
genres <- read.csv("data/genres.csv")
```
```{r}
# Filtering some shows manually from all files
filter_list <- c("tt2373856", "tt3455408", "tt27664503", "tt14849626", "tt15336554", "tt7838238")
dating_reality_series <- dating_reality_series %>%
filter(!tconst %in% filter_list)
dating_reality_rated <- dating_reality_rated %>%
filter(!tconst %in% filter_list)
dating_reality_eps <- dating_reality_eps %>%
filter(!parentTconst %in% filter_list)
akas <- akas %>%
filter(!titleId %in% filter_list)
aka_counts <- aka_counts %>%
filter(!titleId %in% filter_list)
```
# Dating Reality TV - A Guilty Pleasure!
- Binge watching dating reality TV is one of my favorite guilty pleasures
- I've devoured one too many seasons and at a very high consumption rate!
- I'm pretty sure everyone loves occasionally engaging in dramatic content like Love Island, The Bachelor and Too Hot To Handle!
- In this flexdashboard, I analyse the IMDb non-commercial datasets, zooming into the Dating Reality TV genre and try to uncover global trends and reasons why this genre is SO DAMN ADDICTIVE!
- The dataset used here is: IMDb.com, Inc. (2024, March 18). IMDb non-commercial datasets [Dataset]. Retrieved October 29, 2025, from https://developer.imdb.com/non-commercial-datasets
*********************************
How Reality-TV compares to other genres
=====================================
Row
-------------------------------------
### Dating Reality TV versus other IMDb genres
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Calculating the average rating and sum of votes for all genres
genres_viz <- genres %>%
group_by(genres) %>%
summarise(
avg_rating = mean(averageRating, na.rm = TRUE),
count_titles = n(),
sum_votes = sum(numVotes, na.rm = TRUE)
) %>%
arrange(desc(avg_rating))
# Top 10 genres by average rating
top10_genres <- genres_viz %>%
arrange(desc(avg_rating)) %>%
slice_head(n = 10)
# Color the Reality-TV column differently
top10_genres <- top10_genres %>%
mutate(color_fill = ifelse(genres == "Reality-TV", "#ff69b4", "#3b82f6"))
ggplot(top10_genres, aes(x = reorder(genres, avg_rating), y = avg_rating, fill = color_fill)) +
geom_col(alpha = 0.9) +
coord_flip() +
geom_text(aes(label = round(avg_rating, 2)), hjust = -0.2, size = 3.8) +
scale_fill_identity() +
labs(
title = "Top 10 IMDb Genres by Average Rating",
x = "Genre",
y = "Average Rating"
) +
theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank()) +
ylim(0, max(top10_genres$avg_rating) + 1)
```
### Inferences
- The graph shows the average ratings of all TV series grouped by genre rated on IMDb
- Action ranks #1 with an average rating of 6.95
- Reality TV is among the top 10 genres and ranks 7th with an average rating of 3.83
*********************************
The Rise of Dating Reality TV
=====================================
Row
-------------------------------------
### Dating Reality TV over the years
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Data Prep
reality_by_year <- dating_reality_rated %>%
filter(!is.na(startYear), startYear >= 1990) %>%
count(startYear) %>%
arrange(startYear)
# Create cumulative frames
cum_df <- reality_by_year %>%
tidyr::expand(frame = startYear, startYear) %>%
filter(startYear <= frame) %>%
left_join(reality_by_year, by = "startYear")
x_order <- sort(unique(reality_by_year$startYear))
# Build chart
p <- plot_ly(
data = cum_df,
x = ~startYear,
y = ~n,
frame = ~frame,
ids = ~startYear,
type = "bar",
name = "Number of new shows",
color = I("#ff69b4"),
hovertemplate = "Year %{x}<br>New series %{y}<extra></extra>",
marker = list(line = list(color = "#ffffff", width = 1))
) %>%
layout(
title = "The Rise of Dating Reality TV",
xaxis = list(title = "Start Year", categoryorder = "array", categoryarray = x_order),
yaxis = list(title = "New Dating Reality Series")
) %>%
animation_opts(
frame = 200,
transition = 150,
easing = "linear"
)
p
```
### Inferences
- Dating Reality TV took off in the 2000s with more and more new shows of this genre taking over the TV series market
- The market saw a peak in 2022 and 2023 with shows like Too Hot To Handle gaining traction
- The COVID-19 pandemic and increased time spent on social media might have influenced people's content choices, reality TV potentially providing a much needed escape from real life
*********************************
Top Shows (Rating and Popularity)
=====================================
Row
-------------------------------------
### Top 10 shows by rating and popularity
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Prep data for ratings
top10_rated <- dating_reality_rated %>%
filter(!is.na(averageRating)) %>%
arrange(desc(averageRating), desc(numVotes), primaryTitle) %>%
slice_head(n = 10) %>%
transmute(title = primaryTitle, x = averageRating, aux = numVotes) %>%
arrange(x, title)
# Prep data for popularity
top10_popular <- dating_reality_rated %>%
filter(!is.na(numVotes)) %>%
arrange(desc(numVotes), desc(averageRating), primaryTitle) %>%
slice_head(n = 10) %>%
transmute(title = primaryTitle, x = numVotes, aux = round(averageRating, 1)) %>%
arrange(x, title)
# Plot
fig <- plot_ly(
x = top10_rated$x,
y = top10_rated$title,
type = "bar",
orientation = "h",
hovertemplate = "<b>%{y}</b><br>Rating: %{x:.1f}<br>Votes: %{customdata}<extra></extra>",
customdata = top10_rated$aux,
marker = list(color = "#ff69b4")
)
fig <- fig %>% layout(
title = list(text = "Top 10 Shows", x = 0.02),
xaxis = list(title = "IMDb Rating", tickformat = ".1f"),
yaxis = list(
title = "",
categoryorder = "array",
categoryarray = top10_rated$title,
automargin = TRUE,
tickfont = list(size = 12)
),
margin = list(l = 240, r = 120, t = 60, b = 40),
bargap = 0.2,
showlegend = FALSE,
updatemenus = list(list(
type = "dropdown",
direction = "down",
x = 1.08, y = 0.5,
xanchor = "left", yanchor = "middle",
bgcolor = "white",
bordercolor = "lightgray",
buttons = list(
list(
label = "Top 10 by Rating",
method = "update",
args = list(
list(
x = list(top10_rated$x),
y = list(top10_rated$title),
customdata = list(top10_rated$aux),
hovertemplate = list("<b>%{y}</b><br>Rating: %{x:.1f}<br>Votes: %{customdata}<extra></extra>"),
marker = list(list(color = "#ff69b4"))
),
list(
xaxis = list(title = "IMDb Rating", tickformat = ".1f"),
yaxis = list(
categoryorder = "array",
categoryarray = top10_rated$title,
automargin = TRUE,
tickfont = list(size = 12)
)
)
)
),
list(
label = "Top 10 by Popularity",
method = "update",
args = list(
list(
x = list(top10_popular$x),
y = list(top10_popular$title),
customdata = list(top10_popular$aux),
hovertemplate = list("<b>%{y}</b><br>Votes: %{x:,}<br>Rating: %{customdata:.1f}<extra></extra>"),
marker = list(list(color = "#ff69b4"))
),
list(
xaxis = list(title = "Number of Votes", tickformat = ",d"),
yaxis = list(
categoryorder = "array",
categoryarray = top10_popular$title,
automargin = TRUE,
tickfont = list(size = 12)
)
)
)
)
)
))
)
fig
```
### Inferences
- The graph shows the top 10 shows by rating, i.e. the dating reality shows/ TV movies that are rated highly on IMDb
- From the drop down, we can also toggle to the the shows that are most popular i.e. received the most number of votes or ratings on IMDb
- Recipe for Love and Love on the Run have >9 ratings
- Too Hot To Handle and Love is Blind are the most popular dating reality shows as of date with the most number of votes on IMDb
*********************************
Shows with the highest global reach
=====================================
Row
-------------------------------------
### Shows with the highest number of regions aired
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Prep data
top_shows <- aka_counts %>%
top_n(10, unique_regions) %>%
left_join(dating_reality_series, by = c("titleId" = "tconst"))
# Plot
plot_ly(
data = top_shows %>%
arrange(desc(unique_regions)),
x = ~unique_regions,
y = ~reorder(primaryTitle, unique_regions),
type = "bar",
orientation = "h",
hoverinfo = "text",
hovertext = ~paste0(
"<b>", primaryTitle, "</b>",
"<br>Countries Aired: ", unique_regions
),
marker = list(color = '#ff69b4')
) %>%
layout(
title = "Top Dating Reality Shows by Countires Aired",
xaxis = list(title = "Number of Countries Aired"),
yaxis = list(title = "", categoryorder = "total ascending"),
hoverlabel = list(bgcolor = "white")
)
```
### Inferences
- This graph shows the number of countries in which a particular show/ title was aired
- Love is Blind tops the list again, probably due to popularity and was aired in a total of 39 countries
- It is followed closely by Too Hot To Handle: Brazil which was aired in 34 countries
- This shows the far and wide global reach of trending shows like Love is Blind
*********************************
What decade did global reach rise?
=====================================
Row
-------------------------------------
### Global reach every decade
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Merge regional data with start year data
merged_trend <- left_join(aka_counts, dating_reality_series, by = c("titleId" = "tconst"))
df <- merged_trend %>%
filter(!is.na(startYear), startYear >= 1990, !is.na(unique_regions)) %>%
mutate(unique_regions = as.numeric(unique_regions))
# Define decades
df_dec <- df %>%
mutate(era = cut(startYear,
breaks = c(1990, 2000, 2010, 2020, 2030),
right = FALSE,
labels = c("1990s","2000s","2010s","2020s")))
# Plot boxplot
p3 <- ggplot(df_dec, aes(era, unique_regions)) +
geom_boxplot(outlier.alpha = 0.3, fill = "lightpink", color = "deeppink4") +
stat_summary(fun = median, geom = "point", size = 2, color = "deeppink4") +
labs(title = "Countries aired by decade",
x = NULL, y = "Countries aired") +
theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank())
p3
```
### Inferences
- Plotting the global reach by decade, we can see that the globalization of dating reality series started recently i.e. in the 2020s
- The pandemic, strict stay-at-home mandates and economic uncertainty probably encouraged people to indulge in some dramatic reality television.
*********************************
Countries with the highest number of dating reality shows
=====================================
Row
-------------------------------------
### The United States consumes 500+ dating reality shows
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Counts the regions in the regional aka file
region_counts <- akas %>%
filter(region != "\\N") %>%
count(region, name = "num_shows")
# Get world data including ISO 2 and ISO 3 country codes
world <- ne_countries(scale = "medium", returnclass = "sf") %>%
select(name_long, iso_a2, iso_a3, geometry)
# Join this data to region counts
world_counts <- world %>%
left_join(region_counts, by = c("iso_a2" = "region")) %>%
mutate(num_shows = tidyr::replace_na(num_shows, 0L))
# Drop the geometry column
world_counts_df <- world_counts %>% st_drop_geometry()
# Define pink gradient for colorscale
pink_scale <- list(
list(0.00, "#fff0f6"),
list(0.25, "#ffd1e6"),
list(0.50, "#ff9ec4"),
list(0.75, "#ff4fa3"),
list(1.00, "#7a003c")
)
plot_ly(
data = world_counts_df,
type = "choropleth",
locations = ~iso_a3,
z = ~num_shows,
text = ~paste0(name_long, "<br><b>Shows:</b> ", scales::comma(num_shows)),
hoverinfo = "text",
colorscale = pink_scale,
marker = list(line = list(color = "rgba(255,255,255,0.7)", width = 0.3)),
zmin = 0,
zmax = max(world_counts_df$num_shows, na.rm = TRUE)
) %>%
colorbar(title = "Number of Shows") %>%
layout(
title = list(text = "Global Distribution of Dating Reality Shows"),
geo = list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = "natural earth"),
bgcolor = "rgba(0,0,0,0)"
)
)
```
### Inferences
- The choropleth shows the countries with the highest number of dating reality series on the market
- US tops the charts with a massive 560 reality shows
- English speaking countries like Canada, UK and Australia also have a plethora of dating reality TV options
- Some emerging markets include India, Brazil, Italy and Switzerland
*********************************
High Snackability
=====================================
Column
-------------------------------------
### Runtimes of dating reality shows
```{r echo=FALSE, message=FALSE, warning=FALSE}
p3 <- dating_reality_rated %>%
filter(!is.na(runtimeMinutes), runtimeMinutes > 0, runtimeMinutes < 200) %>%
ggplot(aes(x = runtimeMinutes)) +
geom_density(fill = "lightpink", alpha = 0.4, color = "deeppink", linewidth = 1) +
labs(
title = "Snackability of dating reality shows",
x = "Runtime (minutes)",
y = "Density"
) +
theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank())
p3
```
### Inferences
- I was also curious about the run times of reality television, especially since these shows are addictive and binge-worthy!
- The density plots show that the run times are mostly between 40-60 mins
- These run times make each episode highly snackable and keep you coming back for more!
*********************************
Legacy franchises have an incredibly high number of seasons
=====================================
Row
-------------------------------------
### Shows with the highest number of seasons
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Season and episode count from the episodes dataset
seasons_count <- dating_reality_eps %>%
filter(!is.na(seasonNumber), seasonNumber > 0) %>%
distinct(parentTconst, seasonNumber) %>%
count(parentTconst, name = "seasons") %>%
left_join(
dating_reality_eps %>%
filter(!is.na(episodeNumber)) %>%
count(parentTconst, name = "episodes"),
by = "parentTconst"
) %>%
left_join(
dating_reality_series %>% select(tconst, primaryTitle, startYear),
by = c("parentTconst" = "tconst")
) %>%
mutate(
primaryTitle = ifelse(is.na(primaryTitle), parentTconst, primaryTitle),
episodes = ifelse(is.na(episodes), 0L, episodes)
)
# Top 10 by number of seasons
top10_seasons <- seasons_count %>%
arrange(desc(seasons), desc(episodes), primaryTitle) %>%
slice_head(n = 10)
# Plot
p <- ggplot(
top10_seasons,
aes(
x = reorder(primaryTitle, seasons),
y = seasons,
text = paste0(
"Show: ", primaryTitle,
"<br>Seasons: ", seasons,
"<br>Episodes: ", episodes,
ifelse(is.na(startYear), "", paste0("<br>Start year: ", startYear))
)
)
) +
geom_col(fill = "#ff69b4") +
coord_flip() +
labs(
title = "Top 10 by Number of Seasons",
x = "Show",
y = "Seasons"
) +
theme_minimal()
ggplotly(p, tooltip = "text")
```
### Inferences
- When I investigated the number of seasons in some legacy franchises like The Bachelor and The Bachelorette, the results were staggering
- The Bachelor has a total of 30 seasons and 300+ episodes
- This highlights the popularity of legacy franchises and their cultural relevance
*********************************
# Final Thoughts and References
- Reality TV is in the top 10 highest rated genres on IMDb
- Dating Reality TV took off post 2021 and is still dominating screens worldwide
- Western audiences can't get enough: Dating reality has a high global reach in western markets and is expanding rapidly across other markets
- Legacy franchises with repeat seasons and short run times of 40-60 minutes keep people coming back for more!
- Looks like Reality TV isn’t just my guilty pleasure — it’s a global one!
References:<br>
- IMDb.com, Inc. (2024, March 18). IMDb non-commercial datasets [Dataset]. Retrieved October 29, 2025, from https://developer.imdb.com/non-commercial-datasets