library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(lubridate)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(zoo)
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:tsibble':
##
## index
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(feasts)
## Warning: package 'feasts' was built under R version 4.3.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.3.3
data <- read_csv("movies_metadata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 45466 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
## dbl (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## lgl (2): adult, video
## date (1): release_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For this Data Dive, I’ve decided to analyze how vote counts for movies have changed over time using a time series model. The idea is that user engagement (measured by vote count) can reflect trends in movie popularity and online activity over the years.
Using a time series approach, I will:
Create a tsibble for tidy time series modeling.
Visualize the time series to spot patterns
Fit a linear regression model to check for trends
Apply smoothing techniques
Check for seasonality using ACF/PACF
data <- data |>
mutate(release_date = ymd(release_date),
vote_count = as.numeric(vote_count)) |>
filter(!is.na(release_date), !is.na(vote_count))
data_ts <- data |>
mutate(month = floor_date(release_date, "month")) |>
group_by(month) |>
summarise(total_votes = sum(vote_count, na.rm = TRUE)) |>
as_tsibble(index = month)
data_ts |>
ggplot(aes(x = month, y = total_votes)) +
geom_line() +
labs(title = "Monthly Vote Count Over Time", x = "Month", y = "Vote Count") +
theme_hc()
We can see that somewhere around the 1990s to early 2010s, our vote counts increase significantly.
We hit our peak in votes around 2015 and then hit a drop off, this could be due to new films taking more time to accumulate votes or fewer older movies being included in out data.
data_ts |>
filter_index("1990" ~ "2017") |>
ggplot(aes(x = month, y = total_votes)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Vote Count Linear Trend (1990–2017)", x = "Month", y = "Votes") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Similar to our previous plot we can see that the time series stays moving in a positive direction.
The linear trend is very strongly positive up until around 2015, this suggests that more users were voting in the 2000s times.
This could be due to more user activity on voting platforms or more movies being watched online.
data_ts |>
filter_index("2000" ~ "2017") |>
ggplot(aes(x = month, y = total_votes)) +
geom_point(size = 0.3, alpha = 0.5) +
geom_smooth(span = 0.3, se = FALSE, color = "blue") +
labs(title = "LOESS Smoothed Vote Count (2000–2017)",
x = "Month", y = "Total Votes") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The LOESS curve confirms our speculations of a strong rise in vote count during the 2000s with a peak in 2015.
It also shows that we have a decline after 2015. This non linear pattern is missed by our linear model and also hints at ups and downs potentially tied to film release seasons.
votes_ts <- ts(data_ts$total_votes, frequency = 12)
decomp <- decompose(votes_ts)
plot(decomp)
acf(votes_ts, main = "ACF for Monthly Vote Count")
pacf(votes_ts, main = "PACF for Monthly Vote Count")
This pattern suggests that the time series is not stationary, there’s a strong overall trend or long-term dependency. It also shows that past values influence future values.
The PACF has a strong spike at lag 1, followed by smaller but noticeable spikes. This implies that the most recent past month explains a large portion of the variation in the current month’s vote count.