Introduction

This week’s analysis explores time-related aspects of the TMDB dataset, focusing on long-term trends in TV show ratings (vote_average) and the number of shows produced each year. By applying techniques like time series visualization, regression modeling, and autocorrelation analysis, we aim to uncover meaningful insights about changes in TV content and audience reception over the years.

# Load necessary libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
# Load the TMDB dataset
tmdb_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert the `first_air_date` column to Date
tmdb_data <- tmdb_data |>
  mutate(first_air_date = as.Date(first_air_date, format = "%Y-%m-%d"))

# Check for missing or unusual dates
summary(tmdb_data$first_air_date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1917-01-09" "2006-04-28" "2015-10-02" "2010-11-18" "2020-10-01" "2046-02-24" 
##         NA's 
##      "31736"

Insights

  • The first_air_date column spans a wide range of years, indicating that the dataset covers decades of TV show history. Any missing or invalid dates will be addressed to ensure accurate analysis.

Aggregating Data for Yearly Trends:

We extract the year from first_air_date and compute aggregated metrics: - avg_vote: Average rating of shows for each year. - show_count: Total number of shows released each year.

# Extract the year and aggregate data
yearly_data <- tmdb_data |>
  filter(!is.na(first_air_date)) |>
  mutate(year = year(first_air_date)) |>
  group_by(year) |>
  summarise(
    avg_vote = mean(vote_average, na.rm = TRUE),
    total_votes = sum(vote_count, na.rm = TRUE),
    show_count = n()
  )

# Display the aggregated data
head(yearly_data)
## # A tibble: 6 × 4
##    year avg_vote total_votes show_count
##   <dbl>    <dbl>       <dbl>      <int>
## 1  1917      8             2          1
## 2  1921      0             0          1
## 3  1936      2             1          2
## 4  1938      4.5           2          2
## 5  1939      0             0          1
## 6  1940      0             0          2
  • The aggregated data provides a concise view of yearly metrics, helping us analyze long-term trends in average ratings and production activity.
  • The variability in the number of shows per year may correlate with broader industry developments.

Detecting Seasonality and Patterns

Beyond long-term trends, data often exhibit seasonal or recurring patterns. We use smoothing techniques and autocorrelation analysis to detect these patterns.

# Apply smoothing
yearly_tsibble |>
  ggplot(aes(x = year, y = avg_vote)) +
  geom_line(color = "blue", alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "green") +
  labs(
    title = "Smoothed Average Ratings Over Time",
    x = "Year",
    y = "Average Rating"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Use ACF to check for seasonality
acf(yearly_data$avg_vote, main = "ACF of Average Ratings")

Observations

  • The smoothed line (green) indicates an overall decline in average ratings over time. After a period of stability during the mid-20th century, ratings appear to decrease significantly in more recent years.
  • Early years (pre-1950) show greater fluctuations in average ratings, likely due to limited data points or variability in content quality during the formative years of television production.
  • The ACF plot reveals significant autocorrelation at lag 1, indicating that the ratings of one year are strongly influenced by the ratings of the preceding year.
  • Weak periodic peaks in the ACF suggest minimal seasonality, but no clear repeating cycles are evident in this dataset.

Significance

  • The declining trend in ratings in recent years may reflect evolving audience preferences, changing production standards, or increased competition from alternative media platforms.
  • Identifying when the decline began can help producers and studios adapt strategies to maintain viewer satisfaction, such as experimenting with new genres or improving production quality.
  • The strong lag-1 autocorrelation implies that recent ratings can be used to predict future ratings, which is useful for time series forecasting models.
  • The weak periodicity indicates that external factors (e.g., release schedules, cultural events) may not have a strong influence on average ratings, simplifying trend analysis but suggesting the need for alternative explanatory variables to understand the fluctuations better.

Key Findings

Further Questions