Loading and Preparing the Dataset

I’m working with a Netflix dataset that includes several variables, including the release year and IMDb scores. Since I only have the release_year and not a full date, I’ll treat release_year as a numeric variable to analyze trends over time.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

netflix_data <- netflix_data %>%
  drop_na(imdb_score, release_year)

# Treat release_year as numeric since it's just a year without a full date
netflix_data$release_year <- as.numeric(netflix_data$release_year)

# Check the structure of the dataset
str(netflix_data)
## 'data.frame':    5283 obs. of  15 variables:
##  $ id                  : chr  "tm84618" "tm127384" "tm70993" "tm190788" ...
##  $ title               : chr  "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
##  $ type                : chr  "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
##  $ release_year        : num  1976 1975 1979 1973 1969 ...
##  $ age_certification   : chr  "R" "PG" "R" "R" ...
##  $ runtime             : int  113 91 94 133 30 102 170 104 110 117 ...
##  $ genres              : chr  "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
##  $ production_countries: chr  "['US']" "['GB']" "['GB']" "['US']" ...
##  $ seasons             : num  NA NA NA NA 4 NA NA NA NA NA ...
##  $ imdb_id             : chr  "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
##  $ imdb_score          : num  8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
##  $ imdb_votes          : num  795222 530877 392419 391942 72895 ...
##  $ tmdb_popularity     : num  27.6 18.2 17.5 95.3 12.9 ...
##  $ tmdb_score          : num  8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...

Explanation

I have data spanning multiple years but only at the year level. By using just the release year, I can still detect trends and patterns over time without needing a full date. Although there are limitations to not having specific months or days, treating release_year as a numeric variable allows me to analyze long-term trends effectively.


Choosing a Response Variable

For this analysis, I’ve chosen IMDb scores (imdb_score) as my response variable. I’m interested in seeing how IMDb scores for Netflix titles have evolved over time, and whether specific periods show different trends.


Creating a tsibble Object

To manage time-based data effectively, I’ll create a tsibble object using release_year and imdb_score. This allows me to explore time-series patterns and trends.

# No need to convert to Date, using release_year as numeric
netflix_data$release_year <- as.numeric(netflix_data$release_year)


netflix_tsibble <- netflix_data %>%
  filter(!is.na(imdb_score)) %>%  # Remove missing IMDb scores
  select(release_year, imdb_score) %>%
  arrange(release_year)

# View the structure of the data
glimpse(netflix_tsibble)
## Rows: 5,283
## Columns: 2
## $ release_year <dbl> 1953, 1954, 1954, 1956, 1958, 1959, 1960, 1961, 1962, 196…
## $ imdb_score   <dbl> 6.8, 7.5, 7.4, 6.7, 7.5, 6.7, 6.4, 7.5, 6.8, 7.6, 7.8, 7.…

Explanation

I’ve created a tsibble using just the release year and IMDb scores, since I don’t have specific dates. This structure allows me to proceed with time-series analysis, focusing on yearly changes in IMDb ratings.


Plotting the Data Over Time

I’ll plot IMDb scores over time to get a visual understanding of any trends or patterns. This helps me identify any immediate changes in scores over different years.

# Plot IMDb scores over time

ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
  geom_line(color = "blue") +
  labs(title = "IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

Explanation of Initial Plot

The plot shows that IMDb scores have fluctuated over time, with some periods displaying higher concentrations of certain scores. Early in the dataset, there are fewer titles, which may explain the more consistent higher scores. In recent years, with a much larger volume of content, there’s a broader range of scores, leading to more variation, including both high and low scores.


Plot Subsets (Pre-2000 and Post-2000)

We can create two separate plots, one for titles released before 2000 and one for titles released from 2000 onward. These plots will help visualize the trends and patterns in IMDb scores for each time period.

# Plot IMDb scores for Pre-2000
ggplot(pre_2000, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "IMDb Scores (Pre-2000)", x = "Release Year", y = "IMDb Score") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Plot IMDb scores for Post-2000
ggplot(post_2000, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "IMDb Scores (Post-2000)", x = "Release Year", y = "IMDb Score") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Explanation

These plots clearly show the differences in trends before and after 2000. As expected, the pre-2000 plot shows a downward trend line with fewer data points, while the post-2000 plot reveals a more pronounced downward trend, consistent with the larger volume of content produced in recent years.


Plot a Bar Graph for the Count of Titles Each Year

To better understand the distribution of Netflix titles over time, I’ll plot a bar graph showing the number of titles released in each year.

# Count the number of titles per year
title_counts <- netflix_data %>%
  group_by(release_year) %>%
  summarise(count = n())

# Bar plot showing the number of titles per year
ggplot(title_counts, aes(x = release_year, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Number of Netflix Titles Released Each Year", x = "Release Year", y = "Count of Titles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Explanation

The bar plot will show the growth in Netflix’s content library over time. We expect to see a sharp rise in the number of titles released per year, particularly after 2000, which explains the increasing variation in IMDb scores during this period. This also supports the insight that more recent years have a wider range of IMDb ratings, driven by the sheer volume of content.


Smoothing and Detecting Seasonality

I’ll now apply LOESS smoothing to detect more nuanced patterns in the data and explore whether seasonality might be present.

# Smoothing the data
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "loess", color = "green") +
  labs(title = "Smoothed IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# ACF and PACF to check for seasonality
acf(netflix_tsibble$imdb_score, na.action = na.pass, main = "ACF of IMDb Scores")

pacf(netflix_tsibble$imdb_score, na.action = na.pass, main = "PACF of IMDb Scores")

Explanation:

The LOESS smoothing curve highlights fluctuations in IMDb scores that the linear model couldn’t capture. For instance, there may be certain periods where scores rise temporarily before falling again.

The Autocorrelation (ACF) and Partial Autocorrelation (PACF) plots don’t show strong seasonality, which is expected given that we’re working with yearly data and IMDb ratings. The absence of clear autocorrelation patterns confirms that IMDb scores don’t exhibit cyclical or repeating patterns year-over-year, but rather, they reflect more general trends.Insights


Explaining Time Series and Seasonality: A Wikipedia Page Views Example:

In the plot analyzing IMDb scores over time, there is no clear seasonality. Unlike datasets such as sales or retail data that typically experience cyclic patterns (e.g., spikes during holiday seasons), the nature of Netflix’s content production and viewer ratings does not follow a similar cyclical trend.

For instance, page views for Netflix on Wikipedia exhibit no seasonal patterns either, as the streaming industry is continuous and not driven by specific seasons. Similarly, in my dataset, IMDb scores reflect long-term trends rather than seasonal fluctuations. Even if the data had specific day/month fields, it likely wouldn’t change the outcome because IMDb scores are not tied to specific months of the year like sales or holiday-related activities.

By using just year values, I focus on analyzing long-term patterns in the IMDb ratings, which provides insights into Netflix’s content evolution over time without being influenced by short-term or seasonal variations.



Insights

  1. Trend: IMDb scores have trended downward over time, especially in the post-2000 period. This is likely due to Netflix’s large-scale content production, leading to a broader range of IMDb ratings.

  2. Periods of Change: Pre-2000, IMDb scores were more stable, while post-2000, the explosion of content has contributed to a wider distribution of ratings, including more low-rated content, which may pull down the overall average.

  3. Seasonality: No strong seasonality was detected in the data, which is consistent with yearly-level data where cyclic patterns (e.g., seasonal trends) are less likely.

  4. No Seasonality: No strong seasonality was detected in the data, which is expected for yearly-level data. IMDb scores seem to reflect long-term trends rather than any repeating seasonal patterns.


Conclusion

By using release_year as a proxy for time, I was able to effectively explore trends in IMDb scores over time. Despite the lack of specific dates, the analysis revealed significant patterns, including a general decline in scores as the volume of content increased. This suggests that the quality of Netflix content has become more variable, with both highly-rated and low-rated content being produced in greater numbers in recent years. Future investigations might focus on specific genres or types of content to see if these trends hold across different categories.