I’m working with a Netflix dataset that includes several variables,
including the release year and IMDb scores. Since I only have the
release_year
and not a full date, I’ll treat
release_year
as a numeric variable to analyze trends over
time.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
netflix_data <- netflix_data %>%
drop_na(imdb_score, release_year)
# Treat release_year as numeric since it's just a year without a full date
netflix_data$release_year <- as.numeric(netflix_data$release_year)
# Check the structure of the dataset
str(netflix_data)
## 'data.frame': 5283 obs. of 15 variables:
## $ id : chr "tm84618" "tm127384" "tm70993" "tm190788" ...
## $ title : chr "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
## $ type : chr "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
## $ release_year : num 1976 1975 1979 1973 1969 ...
## $ age_certification : chr "R" "PG" "R" "R" ...
## $ runtime : int 113 91 94 133 30 102 170 104 110 117 ...
## $ genres : chr "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
## $ production_countries: chr "['US']" "['GB']" "['GB']" "['US']" ...
## $ seasons : num NA NA NA NA 4 NA NA NA NA NA ...
## $ imdb_id : chr "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
## $ imdb_score : num 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
## $ imdb_votes : num 795222 530877 392419 391942 72895 ...
## $ tmdb_popularity : num 27.6 18.2 17.5 95.3 12.9 ...
## $ tmdb_score : num 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...
I have data spanning multiple years but only at the year level. By
using just the release year, I can still detect trends and patterns over
time without needing a full date. Although there are limitations to not
having specific months or days, treating release_year
as a
numeric variable allows me to analyze long-term trends effectively.
For this analysis, I’ve chosen IMDb scores
(imdb_score
) as my response variable. I’m
interested in seeing how IMDb scores for Netflix titles have evolved
over time, and whether specific periods show different trends.
tsibble
ObjectTo manage time-based data effectively, I’ll create a
tsibble
object using release_year
and
imdb_score
. This allows me to explore time-series patterns
and trends.
# No need to convert to Date, using release_year as numeric
netflix_data$release_year <- as.numeric(netflix_data$release_year)
netflix_tsibble <- netflix_data %>%
filter(!is.na(imdb_score)) %>% # Remove missing IMDb scores
select(release_year, imdb_score) %>%
arrange(release_year)
# View the structure of the data
glimpse(netflix_tsibble)
## Rows: 5,283
## Columns: 2
## $ release_year <dbl> 1953, 1954, 1954, 1956, 1958, 1959, 1960, 1961, 1962, 196…
## $ imdb_score <dbl> 6.8, 7.5, 7.4, 6.7, 7.5, 6.7, 6.4, 7.5, 6.8, 7.6, 7.8, 7.…
I’ve created a tsibble using just the release year and IMDb scores, since I don’t have specific dates. This structure allows me to proceed with time-series analysis, focusing on yearly changes in IMDb ratings.
I’ll plot IMDb scores over time to get a visual understanding of any trends or patterns. This helps me identify any immediate changes in scores over different years.
# Plot IMDb scores over time
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
geom_line(color = "blue") +
labs(title = "IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
theme_minimal()
Explanation of Initial Plot
The plot shows that IMDb scores have fluctuated over time, with some periods displaying higher concentrations of certain scores. Early in the dataset, there are fewer titles, which may explain the more consistent higher scores. In recent years, with a much larger volume of content, there’s a broader range of scores, leading to more variation, including both high and low scores.
Next, I’ll apply linear regression to quantify any overall trends in IMDb scores over time.
# Perform linear regression to detect trends
trend_model <- lm(imdb_score ~ release_year, data = netflix_data)
# View the summary of the model
summary(trend_model)
##
## Call:
## lm(formula = imdb_score ~ release_year, data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0315 -0.6990 0.1173 0.8173 3.1010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.298788 4.360640 9.012 < 2e-16 ***
## release_year -0.016254 0.002163 -7.514 6.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.155 on 5281 degrees of freedom
## Multiple R-squared: 0.01058, Adjusted R-squared: 0.01039
## F-statistic: 56.46 on 1 and 5281 DF, p-value: 6.7e-14
Regression Results:
Intercept: 39.30
Release Year Coefficient: -0.0163 (p-value < 2e-16)
The linear regression reveals a statistically significant
negative trend in IMDb scores over time. The negative
coefficient for release_year
(-0.0163) indicates that for
each additional year, the IMDb score decreases by 0.016 points on
average. This trend suggests that the average quality
of Netflix titles, as reflected by IMDb scores, has
declined in recent years. One potential explanation for
this is Netflix’s rapid expansion, which introduced a much larger
variety of content. While some titles remain critically acclaimed, the
broader production slate includes more lower-rated content, leading to a
downward pull on the average IMDb scores.
# Plot the trend line on the original plot
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "IMDb Scores Over Time with Trend Line", x = "Release Year", y = "IMDb Score") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The red trend line on the plot further visualizes this decline in IMDb scores.
Given the significant increase in the volume of content in recent years, I’ll split the data into two periods: pre-2000 and post-2000. This allows me to check if trends differ between earlier and more recent content.
# Subset data for two time periods
pre_2000 <- netflix_data %>% filter(release_year < 2000)
post_2000 <- netflix_data %>% filter(release_year >= 2000)
# Perform regression for both periods
pre_2000_model <- lm(imdb_score ~ release_year, data = pre_2000)
post_2000_model <- lm(imdb_score ~ release_year, data = post_2000)
# View summaries of both models
summary(pre_2000_model)
##
## Call:
## lm(formula = imdb_score ~ release_year, data = pre_2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7244 -0.6445 0.0762 0.8455 2.2885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.238221 14.913387 2.028 0.0438 *
## release_year -0.011825 0.007501 -1.577 0.1164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.158 on 214 degrees of freedom
## Multiple R-squared: 0.01148, Adjusted R-squared: 0.006862
## F-statistic: 2.486 on 1 and 214 DF, p-value: 0.1164
summary(post_2000_model)
##
## Call:
## lm(formula = imdb_score ~ release_year, data = post_2000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0625 -0.6920 0.1138 0.8138 3.1080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.673464 7.771406 9.995 <2e-16 ***
## release_year -0.035273 0.003853 -9.155 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.151 on 5065 degrees of freedom
## Multiple R-squared: 0.01628, Adjusted R-squared: 0.01608
## F-statistic: 83.82 on 1 and 5065 DF, p-value: < 2.2e-16
Explanation of Subsetting Results
Pre-2000:
Coefficient for release_year
: -0.0118 (p-value =
0.1164)
No statistically significant trend.
Post-2000:
Coefficient for release_year
: -0.0353 (p-value <
2e-16)
A much steeper, statistically significant downward trend.
The results show a significant difference between the two periods:
Pre-2000: The trend is not statistically significant, likely due to the smaller number of titles during this period. IMDb scores remained relatively stable before Netflix’s content explosion.
Post-2000: There is a significant downward trend in scores, with a larger negative coefficient (-0.0353). The broader variety of content in recent years—ranging from high-quality originals to lower-budget or less critically successful titles—has led to this sharp decline.
We can create two separate plots, one for titles released before 2000 and one for titles released from 2000 onward. These plots will help visualize the trends and patterns in IMDb scores for each time period.
# Plot IMDb scores for Pre-2000
ggplot(pre_2000, aes(x = release_year, y = imdb_score)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "IMDb Scores (Pre-2000)", x = "Release Year", y = "IMDb Score") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Plot IMDb scores for Post-2000
ggplot(post_2000, aes(x = release_year, y = imdb_score)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "IMDb Scores (Post-2000)", x = "Release Year", y = "IMDb Score") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Explanation
These plots clearly show the differences in trends before and after 2000. As expected, the pre-2000 plot shows a downward trend line with fewer data points, while the post-2000 plot reveals a more pronounced downward trend, consistent with the larger volume of content produced in recent years.
Plot a Bar Graph for the Count of Titles Each Year
To better understand the distribution of Netflix titles over time, I’ll plot a bar graph showing the number of titles released in each year.
# Count the number of titles per year
title_counts <- netflix_data %>%
group_by(release_year) %>%
summarise(count = n())
# Bar plot showing the number of titles per year
ggplot(title_counts, aes(x = release_year, y = count)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Number of Netflix Titles Released Each Year", x = "Release Year", y = "Count of Titles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Explanation
The bar plot will show the growth in Netflix’s content library over time. We expect to see a sharp rise in the number of titles released per year, particularly after 2000, which explains the increasing variation in IMDb scores during this period. This also supports the insight that more recent years have a wider range of IMDb ratings, driven by the sheer volume of content.
I’ll now apply LOESS smoothing to detect more nuanced patterns in the data and explore whether seasonality might be present.
# Smoothing the data
ggplot(netflix_tsibble, aes(x = release_year, y = imdb_score)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "loess", color = "green") +
labs(title = "Smoothed IMDb Scores Over Time", x = "Release Year", y = "IMDb Score") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# ACF and PACF to check for seasonality
acf(netflix_tsibble$imdb_score, na.action = na.pass, main = "ACF of IMDb Scores")
pacf(netflix_tsibble$imdb_score, na.action = na.pass, main = "PACF of IMDb Scores")
Explanation:
The LOESS smoothing curve highlights fluctuations in IMDb scores that the linear model couldn’t capture. For instance, there may be certain periods where scores rise temporarily before falling again.
The Autocorrelation (ACF) and Partial Autocorrelation (PACF) plots don’t show strong seasonality, which is expected given that we’re working with yearly data and IMDb ratings. The absence of clear autocorrelation patterns confirms that IMDb scores don’t exhibit cyclical or repeating patterns year-over-year, but rather, they reflect more general trends.Insights
In the plot analyzing IMDb scores over time, there is no clear seasonality. Unlike datasets such as sales or retail data that typically experience cyclic patterns (e.g., spikes during holiday seasons), the nature of Netflix’s content production and viewer ratings does not follow a similar cyclical trend.
For instance, page views for Netflix on Wikipedia exhibit no seasonal patterns either, as the streaming industry is continuous and not driven by specific seasons. Similarly, in my dataset, IMDb scores reflect long-term trends rather than seasonal fluctuations. Even if the data had specific day/month fields, it likely wouldn’t change the outcome because IMDb scores are not tied to specific months of the year like sales or holiday-related activities.
By using just year values, I focus on analyzing long-term patterns in the IMDb ratings, which provides insights into Netflix’s content evolution over time without being influenced by short-term or seasonal variations.
Trend: IMDb scores have trended downward over time, especially in the post-2000 period. This is likely due to Netflix’s large-scale content production, leading to a broader range of IMDb ratings.
Periods of Change: Pre-2000, IMDb scores were more stable, while post-2000, the explosion of content has contributed to a wider distribution of ratings, including more low-rated content, which may pull down the overall average.
Seasonality: No strong seasonality was detected in the data, which is consistent with yearly-level data where cyclic patterns (e.g., seasonal trends) are less likely.
No Seasonality: No strong seasonality was detected in the data, which is expected for yearly-level data. IMDb scores seem to reflect long-term trends rather than any repeating seasonal patterns.
By using release_year
as a proxy for time, I was able to
effectively explore trends in IMDb scores over time. Despite the lack of
specific dates, the analysis revealed significant patterns, including a
general decline in scores as the volume of content increased. This
suggests that the quality of Netflix content has become more variable,
with both highly-rated and low-rated content being produced in greater
numbers in recent years. Future investigations might focus on specific
genres or types of content to see if these trends hold across different
categories.