This week’s analysis explores time-related aspects of the TMDB
dataset, focusing on long-term trends in TV show ratings
(vote_average
) and the number of shows produced each year.
By applying techniques like time series visualization, regression
modeling, and autocorrelation analysis, we aim to uncover meaningful
insights about changes in TV content and audience reception over the
years.
# Load necessary libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
# Load the TMDB dataset
tmdb_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert the `first_air_date` column to Date
tmdb_data <- tmdb_data |>
mutate(first_air_date = as.Date(first_air_date, format = "%Y-%m-%d"))
# Check for missing or unusual dates
summary(tmdb_data$first_air_date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1917-01-09" "2006-04-28" "2015-10-02" "2010-11-18" "2020-10-01" "2046-02-24"
## NA's
## "31736"
Aggregating Data for Yearly Trends:
We extract the year from first_air_date and compute aggregated metrics: - avg_vote: Average rating of shows for each year. - show_count: Total number of shows released each year.
# Extract the year and aggregate data
yearly_data <- tmdb_data |>
filter(!is.na(first_air_date)) |>
mutate(year = year(first_air_date)) |>
group_by(year) |>
summarise(
avg_vote = mean(vote_average, na.rm = TRUE),
total_votes = sum(vote_count, na.rm = TRUE),
show_count = n()
)
# Display the aggregated data
head(yearly_data)
## # A tibble: 6 × 4
## year avg_vote total_votes show_count
## <dbl> <dbl> <dbl> <int>
## 1 1917 8 2 1
## 2 1921 0 0 1
## 3 1936 2 1 2
## 4 1938 4.5 2 2
## 5 1939 0 0 1
## 6 1940 0 0 2
To identify trends, we create a time series object and plot the average ratings (avg_vote) over time.
# Create a tsibble object
yearly_tsibble <- yearly_data |> as_tsibble(index = year)
# Plot average ratings over time
yearly_tsibble |>
ggplot(aes(x = year, y = avg_vote)) +
geom_line(color = "blue") +
geom_point(color = "darkblue") +
labs(
title = "Average Ratings Over Time",
x = "Year",
y = "Average Rating"
) +
theme_minimal()
Linear regression allows us to detect overall trends in the data, such as whether average ratings are increasing or decreasing over time.
# Fit a linear regression model
linear_model <- lm(avg_vote ~ year, data = yearly_data)
# Summarize the model
summary(linear_model)
##
## Call:
## lm(formula = avg_vote ~ year, data = yearly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8361 -0.1617 0.2636 0.4708 5.1471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.907884 8.103534 1.346 0.182
## year -0.004202 0.004089 -1.028 0.307
##
## Residual standard error: 1.109 on 91 degrees of freedom
## Multiple R-squared: 0.01147, Adjusted R-squared: 0.0006099
## F-statistic: 1.056 on 1 and 91 DF, p-value: 0.3068
# Add the regression line to the plot
yearly_tsibble |>
ggplot(aes(x = year, y = avg_vote)) +
geom_line(color = "blue") +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Trend in Average Ratings with Linear Regression",
x = "Year",
y = "Average Rating"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Beyond long-term trends, data often exhibit seasonal or recurring patterns. We use smoothing techniques and autocorrelation analysis to detect these patterns.
# Apply smoothing
yearly_tsibble |>
ggplot(aes(x = year, y = avg_vote)) +
geom_line(color = "blue", alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "green") +
labs(
title = "Smoothed Average Ratings Over Time",
x = "Year",
y = "Average Rating"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Use ACF to check for seasonality
acf(yearly_data$avg_vote, main = "ACF of Average Ratings")