Netflix Data Dive - Time based Analysis

Loading and Preparing the Dataset

I’m working with a Netflix dataset that includes several variables, including the release year and IMDb scores. Since I only have the release_year and not a full date, I’ll treat release_year as a numeric variable to analyze trends over time.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

netflix_data <- netflix_data %>%
  drop_na(imdb_score, release_year)

# Treat release_year as numeric since it's just a year without a full date
netflix_data$release_year <- as.numeric(netflix_data$release_year)

# Check the structure of the dataset
str(netflix_data)

## 'data.frame':    5283 obs. of  15 variables:
##  $ id                  : chr  "tm84618" "tm127384" "tm70993" "tm190788" ...
##  $ title               : chr  "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
##  $ type                : chr  "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
##  $ release_year        : num  1976 1975 1979 1973 1969 ...
##  $ age_certification   : chr  "R" "PG" "R" "R" ...
##  $ runtime             : int  113 91 94 133 30 102 170 104 110 117 ...
##  $ genres              : chr  "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
##  $ production_countries: chr  "['US']" "['GB']" "['GB']" "['US']" ...
##  $ seasons             : num  NA NA NA NA 4 NA NA NA NA NA ...
##  $ imdb_id             : chr  "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
##  $ imdb_score          : num  8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
##  $ imdb_votes          : num  795222 530877 392419 391942 72895 ...
##  $ tmdb_popularity     : num  27.6 18.2 17.5 95.3 12.9 ...
##  $ tmdb_score          : num  8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...

Explanation

I have data spanning multiple years but only at the year level. By using just the release year, I can still detect trends and patterns over time without needing a full date. Although there are limitations to not having specific months or days, treating release_year as a numeric variable allows me to analyze long-term trends effectively.

Choosing a Response Variable

For this analysis, I’ve chosen IMDb scores (imdb_score) as my response variable. I’m interested in seeing how IMDb scores for Netflix titles have evolved over time, and whether specific periods show different trends.

Creating a `tsibble` Object

To manage time-based data effectively, I’ll create a tsibble object using release_year and imdb_score. This allows me to explore time-series patterns and trends.

# No need to convert to Date, using release_year as numeric
netflix_data$release_year <- as.numeric(netflix_data$release_year)


netflix_tsibble <- netflix_data %>%
  filter(!is.na(imdb_score)) %>%  # Remove missing IMDb scores
  select(release_year, imdb_score) %>%
  arrange(release_year)

# View the structure of the data
glimpse(netflix_tsibble)

## Rows: 5,283
## Columns: 2
## $ release_year <dbl> 1953, 1954, 1954, 1956, 1958, 1959, 1960, 1961, 1962, 196…
## $ imdb_score   <dbl> 6.8, 7.5, 7.4, 6.7, 7.5, 6.7, 6.4, 7.5, 6.8, 7.6, 7.8, 7.…

Explanation

I’ve created a tsibble using just the release year and IMDb scores, since I don’t have specific dates. This structure allows me to proceed with time-series analysis, focusing on yearly changes in IMDb ratings.

Plotting the Data Over Time

I’ll plot IMDb scores over time to get a visual understanding of any trends or patterns. This helps me identify any immediate changes in scores over different years.

# Plot IMDb scores over time

ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
  geom_line(color = "blue") +
  labs(title = "IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

Explanation of Initial Plot

The plot shows that IMDb scores have fluctuated over time, with some periods displaying higher concentrations of certain scores. Early in the dataset, there are fewer titles, which may explain the more consistent higher scores. In recent years, with a much larger volume of content, there’s a broader range of scores, leading to more variation, including both high and low scores.

Detecting Trends with Linear Regression

Next, I’ll apply linear regression to quantify any overall trends in IMDb scores over time.

# Perform linear regression to detect trends
trend_model <- lm(imdb_score ~ release_year, data = netflix_data)

# View the summary of the model
summary(trend_model)

## 
## Call:
## lm(formula = imdb_score ~ release_year, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0315 -0.6990  0.1173  0.8173  3.1010 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.298788   4.360640   9.012  < 2e-16 ***
## release_year -0.016254   0.002163  -7.514  6.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.155 on 5281 degrees of freedom
## Multiple R-squared:  0.01058,    Adjusted R-squared:  0.01039 
## F-statistic: 56.46 on 1 and 5281 DF,  p-value: 6.7e-14

Regression Results:

Intercept: 39.30
Release Year Coefficient: -0.0163 (p-value < 2e-16)

Explanation:

The linear regression reveals a statistically significant negative trend in IMDb scores over time. The negative coefficient for release_year (-0.0163) indicates that for each additional year, the IMDb score decreases by 0.016 points on average. This trend suggests that the average quality of Netflix titles, as reflected by IMDb scores, has declined in recent years. One potential explanation for this is Netflix’s rapid expansion, which introduced a much larger variety of content. While some titles remain critically acclaimed, the broader production slate includes more lower-rated content, leading to a downward pull on the average IMDb scores.

# Plot the trend line on the original plot
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "IMDb Scores Over Time with Trend Line", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The red trend line on the plot further visualizes this decline in IMDb scores.

Subsetting Data for Multiple Trends

Given the significant increase in the volume of content in recent years, I’ll split the data into two periods: pre-2000 and post-2000. This allows me to check if trends differ between earlier and more recent content.

# Subset data for two time periods
pre_2000 <- netflix_data %>% filter(release_year < 2000)
post_2000 <- netflix_data %>% filter(release_year >= 2000)

# Perform regression for both periods
pre_2000_model <- lm(imdb_score ~ release_year, data = pre_2000)
post_2000_model <- lm(imdb_score ~ release_year, data = post_2000)

# View summaries of both models
summary(pre_2000_model)

## 
## Call:
## lm(formula = imdb_score ~ release_year, data = pre_2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7244 -0.6445  0.0762  0.8455  2.2885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  30.238221  14.913387   2.028   0.0438 *
## release_year -0.011825   0.007501  -1.577   0.1164  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 214 degrees of freedom
## Multiple R-squared:  0.01148,    Adjusted R-squared:  0.006862 
## F-statistic: 2.486 on 1 and 214 DF,  p-value: 0.1164

summary(post_2000_model)

## 
## Call:
## lm(formula = imdb_score ~ release_year, data = post_2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0625 -0.6920  0.1138  0.8138  3.1080 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.673464   7.771406   9.995   <2e-16 ***
## release_year -0.035273   0.003853  -9.155   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.151 on 5065 degrees of freedom
## Multiple R-squared:  0.01628,    Adjusted R-squared:  0.01608 
## F-statistic: 83.82 on 1 and 5065 DF,  p-value: < 2.2e-16

Explanation of Subsetting Results

Subset Analysis Results:

Pre-2000:
- Coefficient for release_year: -0.0118 (p-value = 0.1164)
- No statistically significant trend.
Post-2000:
- Coefficient for release_year: -0.0353 (p-value < 2e-16)
- A much steeper, statistically significant downward trend.

The results show a significant difference between the two periods:

Pre-2000: The trend is not statistically significant, likely due to the smaller number of titles during this period. IMDb scores remained relatively stable before Netflix’s content explosion.
Post-2000: There is a significant downward trend in scores, with a larger negative coefficient (-0.0353). The broader variety of content in recent years—ranging from high-quality originals to lower-budget or less critically successful titles—has led to this sharp decline.

Plot Subsets (Pre-2000 and Post-2000)

We can create two separate plots, one for titles released before 2000 and one for titles released from 2000 onward. These plots will help visualize the trends and patterns in IMDb scores for each time period.

# Plot IMDb scores for Pre-2000
ggplot(pre_2000, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "IMDb Scores (Pre-2000)", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Plot IMDb scores for Post-2000
ggplot(post_2000, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "IMDb Scores (Post-2000)", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Explanation

These plots clearly show the differences in trends before and after 2000. As expected, the pre-2000 plot shows a downward trend line with fewer data points, while the post-2000 plot reveals a more pronounced downward trend, consistent with the larger volume of content produced in recent years.

Plot a Bar Graph for the Count of Titles Each Year

To better understand the distribution of Netflix titles over time, I’ll plot a bar graph showing the number of titles released in each year.

# Count the number of titles per year
title_counts <- netflix_data %>%
  group_by(release_year) %>%
  summarise(count = n())

# Bar plot showing the number of titles per year
ggplot(title_counts, aes(x = release_year, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Number of Netflix Titles Released Each Year", x = "Release Year", y = "Count of Titles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Explanation

The bar plot will show the growth in Netflix’s content library over time. We expect to see a sharp rise in the number of titles released per year, particularly after 2000, which explains the increasing variation in IMDb scores during this period. This also supports the insight that more recent years have a wider range of IMDb ratings, driven by the sheer volume of content.

Smoothing and Detecting Seasonality

I’ll now apply LOESS smoothing to detect more nuanced patterns in the data and explore whether seasonality might be present.

# Smoothing the data
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "loess", color = "green") +
  labs(title = "Smoothed IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# ACF and PACF to check for seasonality
acf(netflix_tsibble$imdb_score, na.action = na.pass, main = "ACF of IMDb Scores")

pacf(netflix_tsibble$imdb_score, na.action = na.pass, main = "PACF of IMDb Scores")

Explanation:

The LOESS smoothing curve highlights fluctuations in IMDb scores that the linear model couldn’t capture. For instance, there may be certain periods where scores rise temporarily before falling again.

The Autocorrelation (ACF) and Partial Autocorrelation (PACF) plots don’t show strong seasonality, which is expected given that we’re working with yearly data and IMDb ratings. The absence of clear autocorrelation patterns confirms that IMDb scores don’t exhibit cyclical or repeating patterns year-over-year, but rather, they reflect more general trends.Insights

Explaining Time Series and Seasonality: A Wikipedia Page Views Example:

In the plot analyzing IMDb scores over time, there is no clear seasonality. Unlike datasets such as sales or retail data that typically experience cyclic patterns (e.g., spikes during holiday seasons), the nature of Netflix’s content production and viewer ratings does not follow a similar cyclical trend.

For instance, page views for Netflix on Wikipedia exhibit no seasonal patterns either, as the streaming industry is continuous and not driven by specific seasons. Similarly, in my dataset, IMDb scores reflect long-term trends rather than seasonal fluctuations. Even if the data had specific day/month fields, it likely wouldn’t change the outcome because IMDb scores are not tied to specific months of the year like sales or holiday-related activities.

By using just year values, I focus on analyzing long-term patterns in the IMDb ratings, which provides insights into Netflix’s content evolution over time without being influenced by short-term or seasonal variations.

Insights

Trend: IMDb scores have trended downward over time, especially in the post-2000 period. This is likely due to Netflix’s large-scale content production, leading to a broader range of IMDb ratings.
Periods of Change: Pre-2000, IMDb scores were more stable, while post-2000, the explosion of content has contributed to a wider distribution of ratings, including more low-rated content, which may pull down the overall average.
Seasonality: No strong seasonality was detected in the data, which is consistent with yearly-level data where cyclic patterns (e.g., seasonal trends) are less likely.
No Seasonality: No strong seasonality was detected in the data, which is expected for yearly-level data. IMDb scores seem to reflect long-term trends rather than any repeating seasonal patterns.

Conclusion

By using release_year as a proxy for time, I was able to effectively explore trends in IMDb scores over time. Despite the lack of specific dates, the analysis revealed significant patterns, including a general decline in scores as the volume of content increased. This suggests that the quality of Netflix content has become more variable, with both highly-rated and low-rated content being produced in greater numbers in recent years. Future investigations might focus on specific genres or types of content to see if these trends hold across different categories.

Netflix Data Dive - Time based Analysis

Junaid Ahmed Mohammed

2024-11-17

Loading and Preparing the Dataset

Explanation

Choosing a Response Variable

Creating a `tsibble` Object

Explanation

Plotting the Data Over Time

Detecting Trends with Linear Regression

Explanation:

Subsetting Data for Multiple Trends

Subset Analysis Results:

Plot Subsets (Pre-2000 and Post-2000)

Smoothing and Detecting Seasonality

Explaining Time Series and Seasonality: A Wikipedia Page Views Example:

Insights

Conclusion

Netflix Data Dive - Time based Analysis

Junaid Ahmed Mohammed

2024-11-17

Loading and Preparing the Dataset

Explanation

Choosing a Response Variable

Creating a tsibble Object

Explanation

Plotting the Data Over Time

Detecting Trends with Linear Regression

Explanation:

Subsetting Data for Multiple Trends

Subset Analysis Results:

Plot Subsets (Pre-2000 and Post-2000)

Smoothing and Detecting Seasonality

Explaining Time Series and Seasonality: A Wikipedia Page Views Example:

Insights

Conclusion

Creating a `tsibble` Object