Week 12 - Time Series Analysis

Introduction

This week’s analysis explores time-related aspects of the TMDB dataset, focusing on long-term trends in TV show ratings (vote_average) and the number of shows produced each year. By applying techniques like time series visualization, regression modeling, and autocorrelation analysis, we aim to uncover meaningful insights about changes in TV content and audience reception over the years.

# Load necessary libraries
library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

# Load the TMDB dataset
tmdb_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert the `first_air_date` column to Date
tmdb_data <- tmdb_data |>
  mutate(first_air_date = as.Date(first_air_date, format = "%Y-%m-%d"))

# Check for missing or unusual dates
summary(tmdb_data$first_air_date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1917-01-09" "2006-04-28" "2015-10-02" "2010-11-18" "2020-10-01" "2046-02-24" 
##         NA's 
##      "31736"

Insights

The first_air_date column spans a wide range of years, indicating that the dataset covers decades of TV show history. Any missing or invalid dates will be addressed to ensure accurate analysis.

Aggregating Data for Yearly Trends:

We extract the year from first_air_date and compute aggregated metrics: - avg_vote: Average rating of shows for each year. - show_count: Total number of shows released each year.

# Extract the year and aggregate data
yearly_data <- tmdb_data |>
  filter(!is.na(first_air_date)) |>
  mutate(year = year(first_air_date)) |>
  group_by(year) |>
  summarise(
    avg_vote = mean(vote_average, na.rm = TRUE),
    total_votes = sum(vote_count, na.rm = TRUE),
    show_count = n()
  )

# Display the aggregated data
head(yearly_data)

## # A tibble: 6 × 4
##    year avg_vote total_votes show_count
##   <dbl>    <dbl>       <dbl>      <int>
## 1  1917      8             2          1
## 2  1921      0             0          1
## 3  1936      2             1          2
## 4  1938      4.5           2          2
## 5  1939      0             0          1
## 6  1940      0             0          2

The aggregated data provides a concise view of yearly metrics, helping us analyze long-term trends in average ratings and production activity.
The variability in the number of shows per year may correlate with broader industry developments.

Visualizing Trends in Ratings

To identify trends, we create a time series object and plot the average ratings (avg_vote) over time.

# Create a tsibble object
yearly_tsibble <- yearly_data |> as_tsibble(index = year)

# Plot average ratings over time
yearly_tsibble |>
  ggplot(aes(x = year, y = avg_vote)) +
  geom_line(color = "blue") +
  geom_point(color = "darkblue") +
  labs(
    title = "Average Ratings Over Time",
    x = "Year",
    y = "Average Rating"
  ) +
  theme_minimal()

Observations

The average ratings before 1950 exhibit extreme fluctuations, with both sharp peaks and steep drops. This could be attributed to a limited number of shows during that time or incomplete data, leading to unreliable averages.
From the 1950s onward, the average ratings become more consistent, indicating a period of steady audience engagement and improved data quality. This likely reflects the growth of TV as a popular medium and the availability of a larger number of shows. These changes could be influenced by shifts in audience preferences, production quality, or technological advancements in the industry.
A noticeable decline in average ratings is observed in the 21st century, which could reflect changing viewer preferences, stricter scoring standards, or an increase in the number of TV shows diluting the impact of high-quality content.
The graph shows an abrupt drop in ratings for future dates, reaching zero. This is likely caused by incorrect or placeholder dates in the dataset and highlights the need for data cleaning to remove invalid entries for accurate analysis.

Significance

These trends provide insights into the evolution of audience engagement and content quality over decades. The recent decline may warrant further investigation to understand its causes and implications.

Identifying Trends Using Linear Regression

Linear regression allows us to detect overall trends in the data, such as whether average ratings are increasing or decreasing over time.

# Fit a linear regression model
linear_model <- lm(avg_vote ~ year, data = yearly_data)

# Summarize the model
summary(linear_model)

## 
## Call:
## lm(formula = avg_vote ~ year, data = yearly_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8361 -0.1617  0.2636  0.4708  5.1471 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.907884   8.103534   1.346    0.182
## year        -0.004202   0.004089  -1.028    0.307
## 
## Residual standard error: 1.109 on 91 degrees of freedom
## Multiple R-squared:  0.01147,    Adjusted R-squared:  0.0006099 
## F-statistic: 1.056 on 1 and 91 DF,  p-value: 0.3068

# Add the regression line to the plot
yearly_tsibble |>
  ggplot(aes(x = year, y = avg_vote)) +
  geom_line(color = "blue") +
  geom_point(color = "darkblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Trend in Average Ratings with Linear Regression",
    x = "Year",
    y = "Average Rating"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insights

The regression line indicates a gradual downward trend in average ratings over the years, suggesting that TV shows are receiving slightly lower ratings on average over time.
The actual ratings, represented by the blue line, fluctuate significantly around the regression line. These fluctuations indicate that despite the long-term downward trend, there are specific years with spikes in ratings, likely driven by the release of standout TV shows.
The variability in ratings seems to increase in the later years, as evidenced by the wider divergence from the regression line. This could reflect the growing volume and diversity of TV shows being produced.
The data shows an unnatural drop to zero in future years, indicating the presence of placeholder or incorrect dates in the dataset rather than real trends.

Significance

This trend highlights the changing dynamics of audience preferences and expectations. Production studios and content creators could use this insight to assess the quality of their offerings and focus on innovation to meet evolving audience demands.
These fluctuations suggest that standout years with exceptional shows still occur. Identifying and replicating the factors contributing to these high-performing years could be valuable for producers aiming to create impactful content.
The increasing variability reflects the challenges of maintaining consistent quality in a saturated market. This calls for data-driven decision-making to focus resources on content likely to perform well with audiences.
The anomalies highlight the importance of data cleaning and validation before analysis. Ensuring accurate and reliable data is critical to deriving meaningful insights that can drive actionable strategies.

Detecting Seasonality and Patterns

Beyond long-term trends, data often exhibit seasonal or recurring patterns. We use smoothing techniques and autocorrelation analysis to detect these patterns.

# Apply smoothing
yearly_tsibble |>
  ggplot(aes(x = year, y = avg_vote)) +
  geom_line(color = "blue", alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "green") +
  labs(
    title = "Smoothed Average Ratings Over Time",
    x = "Year",
    y = "Average Rating"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Use ACF to check for seasonality
acf(yearly_data$avg_vote, main = "ACF of Average Ratings")

Observations

The smoothed line (green) indicates an overall decline in average ratings over time. After a period of stability during the mid-20th century, ratings appear to decrease significantly in more recent years.
Early years (pre-1950) show greater fluctuations in average ratings, likely due to limited data points or variability in content quality during the formative years of television production.
The ACF plot reveals significant autocorrelation at lag 1, indicating that the ratings of one year are strongly influenced by the ratings of the preceding year.
Weak periodic peaks in the ACF suggest minimal seasonality, but no clear repeating cycles are evident in this dataset.

Significance

The declining trend in ratings in recent years may reflect evolving audience preferences, changing production standards, or increased competition from alternative media platforms.
Identifying when the decline began can help producers and studios adapt strategies to maintain viewer satisfaction, such as experimenting with new genres or improving production quality.
The strong lag-1 autocorrelation implies that recent ratings can be used to predict future ratings, which is useful for time series forecasting models.
The weak periodicity indicates that external factors (e.g., release schedules, cultural events) may not have a strong influence on average ratings, simplifying trend analysis but suggesting the need for alternative explanatory variables to understand the fluctuations better.

Key Findings

Average ratings show a gradual decline over time, reflecting changes in audience preferences or scoring practices.
Extreme variability in early ratings highlights the challenges of analyzing sparse data.
The strong lag-1 autocorrelation supports the use of time series models for predicting future ratings.
The absence of strong seasonal patterns suggests that external factors like release schedules have limited influence on ratings.

Further Questions

Are external factors, such as the rise of streaming platforms, influencing these trends?
How do trends vary across different genres or regions?
Would additional granularity (e.g., monthly data) provide more insights?

Week 12 - Time Series Analysis

2024-11-21

Introduction

Insights

Visualizing Trends in Ratings

Observations

Significance

Identifying Trends Using Linear Regression

Insights

Significance

Detecting Seasonality and Patterns

Observations

Significance

Key Findings

Further Questions