Week 12 Data Dive - Time Series

Initialization Step 1 - Load Libraries

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)      #loaded to address viz issues, including currency issues
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
options(scipen=999)  #disable scientific notation since high values are used
# Load the tsibble package
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Initialization Step 2 - Load Data Set

#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")

#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"

cat(colnames(movies), sep = ", ") 
## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10

Insights


Task Demonstrations

Weekly Assignment

The purpose of this week’s data dive is for you to explore the time aspect of your data, or the time aspect of some other related dataset.

Your RMarkdown notebook for this data dive should contain the following:

  • Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.

    • Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like "__/01/01").
  • If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab.

    • If you choose this option, find ways to tie your results from the below analysis into what you’re seeing with your own data!
  • Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.

  • Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.

    • What stands out immediately?
  • Use linear regression to detect any upwards or downwards trends.

    • Do you need to subset the data for multiple trends?

    • How strong are these trends?

  • Use smoothing to detect at least one season in your data, and interpret your results.

    • Can you illustrate the seasonality using ACF or PACF?

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Week 12 Data Dive - Checklist

Purpose

The purpose of this week’s data dive is to explore the time aspect of your data or a related dataset.

Steps

1. Identify Time Column

The movies data set only includes a single time variable, year. As this data set represents top 200 movies for each year, year is more of a categorical value. The solution to this is to create a time series using the existing data which shows filtered trends over time. The primary limitation is that Year is the lowest level of granularity for analysis. For this analysis we are going to create a time series to examine the annual change in the foreign revenue ratio by genre. The time series will include the year, the prime genre and the average foreign revenue ratio. Year will be managed in format YYYY-MM-DD with defaults of 01/01 for month and year. For this assignment the dates will be left in this format as it is the ISO 8601 format used by default in R.

2. Handle Missing Time Columns

Here is the code for the adjusted time series

# Calculate Foreign Revenue Percentage and Aggregate by Year and Genre
broad_time_series <- movies |>
  group_by(Year, Prime_Genre) |>
  summarise(
    Avg_Foreign_Percent = mean((ForeignRevenue / WorldwideRevenue) * 100, na.rm = TRUE)   #addressing the challenge of missing values
  ) |>
  mutate(
    Year_Date = as.Date(paste0(Year, "-01-01")) # Convert Year to Date using R default format of YYYY-MM-DD
  )
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Convert Prime_Genre to a factor
broad_time_series <- broad_time_series |>
  mutate(Prime_Genre = as.factor(Prime_Genre))


# View the resulting time series
print(broad_time_series)
## # A tibble: 436 Ă— 4
## # Groups:   Year [25]
##     Year Prime_Genre Avg_Foreign_Percent Year_Date 
##    <dbl> <fct>                     <dbl> <date>    
##  1  2000 Action                    34.6  2000-01-01
##  2  2000 Adventure                 50.5  2000-01-01
##  3  2000 Animation                 48.6  2000-01-01
##  4  2000 Comedy                    32.9  2000-01-01
##  5  2000 Crime                     44.1  2000-01-01
##  6  2000 Documentary                4.10 2000-01-01
##  7  2000 Drama                     36.5  2000-01-01
##  8  2000 Family                    36.9  2000-01-01
##  9  2000 Fantasy                   42.0  2000-01-01
## 10  2000 History                   48.0  2000-01-01
## # ℹ 426 more rows

3. Select a Response Variable

For this analysis we will use Avg_Foreign_Percent as the response variable. We are interested in changes to the foreign percentage value.

4. Create a tsibble Object

Build a tsibble object containing the date column and your selected response variable.

# Create the tsibble object
broad_tsibble <- broad_time_series |>
  as_tsibble(index = Year_Date, key = Prime_Genre)

# View the tsibble object
print(broad_tsibble)
## # A tsibble: 436 x 4 [1D]
## # Key:       Prime_Genre [19]
## # Groups:    Year [25]
##     Year Prime_Genre Avg_Foreign_Percent Year_Date 
##    <dbl> <fct>                     <dbl> <date>    
##  1  2000 Action                     34.6 2000-01-01
##  2  2001 Action                     38.2 2001-01-01
##  3  2002 Action                     43.8 2002-01-01
##  4  2003 Action                     55.1 2003-01-01
##  5  2004 Action                     62.0 2004-01-01
##  6  2005 Action                     60.9 2005-01-01
##  7  2006 Action                     62.3 2006-01-01
##  8  2007 Action                     69.7 2007-01-01
##  9  2008 Action                     65.7 2008-01-01
## 10  2009 Action                     71.0 2009-01-01
## # ℹ 426 more rows

Plot the data over time and experiment with different time windows.

# Visualize time series data using ggplot2
ggplot(broad_tsibble, 
       aes(x = Year_Date, 
           y = Avg_Foreign_Percent, 
           color = Prime_Genre)) +
  geom_line(size = 1.2) +
  scale_x_date(
    date_breaks = "1 year",
    date_labels = "%Y"
  ) +
  scale_y_continuous(
    breaks = seq(0, 100, by = 10)
  ) +
  labs(
    title = "Foreign Revenue Percentage Trends by Genre",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, 
                               hjust = 1),
    panel.grid.major = element_line(color = "grey", 
                                    linetype = "dashed")
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  • There are too many genres to make this visualization impactful. The solution is to reduce the number of variables. To help focus this discussion we’ll isolate the graph to one genre, Comedy, as this will be easier to analyze. Another option, is to remove the genre entirely but for simplicity we’ll stick to looking at one genre. Maybe future analysis comparing the total foreing revenue trend to specific genre trends might expose some other areas of interest.

    This is an excellent example of why we should minimize the parameters. The amount of data shown in this graph is so distracting that it is nearly impossible to glean any true insight. I do see though, that there is still value in this. It is possilbe that there might have been some clear patterns or trends, which may have taken the analysis in another direction. There don’t seem to be any trends so isolating to a smaller number of variables seems logical.
# Filter the tsibble for Comedy movies only
comedy_tsibble <- broad_tsibble |>
  filter(Prime_Genre == "Comedy")

# View the filtered tsibble
print(comedy_tsibble)
## # A tsibble: 25 x 4 [1D]
## # Key:       Prime_Genre [18]
## # Groups:    Year [25]
##     Year Prime_Genre Avg_Foreign_Percent Year_Date 
##    <dbl> <fct>                     <dbl> <date>    
##  1  2000 Comedy                     32.9 2000-01-01
##  2  2001 Comedy                     44.0 2001-01-01
##  3  2002 Comedy                     45.4 2002-01-01
##  4  2003 Comedy                     49.1 2003-01-01
##  5  2004 Comedy                     45.2 2004-01-01
##  6  2005 Comedy                     55.3 2005-01-01
##  7  2006 Comedy                     52.6 2006-01-01
##  8  2007 Comedy                     54.4 2007-01-01
##  9  2008 Comedy                     61.1 2008-01-01
## 10  2009 Comedy                     62.5 2009-01-01
## # ℹ 15 more rows
  • This looks much cleaner.
# Plot the filtered data for Comedy genre
ggplot(comedy_tsibble, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
  geom_line(color = "blue", size = 1.2) +
  scale_x_date(
    date_breaks = "1 year",
    date_labels = "%Y"
  ) +
  scale_y_continuous(
    breaks = seq(0, 100, by = 10)
  ) +
  labs(
    title = "Foreign Revenue Percentage Trends for Comedy Movies",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major = element_line(color = "grey", linetype = "dashed")
  )

There are a few interesting items of note in this graph:

  • There appears to be a notable upward trend in the amount of revenue generated in foreign markets.

  • It’s not clear, but there seems to be a regular downward dip in foreign revenue about every six years, though this trend is broken during the COVID years.

  • There is a pronounced spike in revenue in 2020, 2021, and 2022 with a sharp drop in 2023.

# Filter for a specific time window (e.g., 2015 to 2020)
filtered_tsibble <- comedy_tsibble |>
  filter(Year_Date >= as.Date("2015-01-01") & Year_Date <= as.Date("2024-12-31"))

# Visualize filtered time series data
ggplot(filtered_tsibble, aes(x = Year_Date, y = Avg_Foreign_Percent, color = Prime_Genre)) +
  geom_line(size = 1.2) +
  labs(
    title = "Foreign Revenue Trends (2015-2024)",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal()

Note key observations and insights from the visualization.

  • Here is a closer look at the COVID years. The drastic spike and drop are very clear here.

    What are possible explanations? It seems reasonable to speculate that as movie theaters were shut down during COVID other delivery methods were used in to distribute movies such as straight to video and streaming services. Of similar interest is the sharp drop then evident rebound int 2023 and 2024. This may be explained as the rebounds in the market, possibly driven by distributors attempting to bring audiences back into theaters locally.

6. Identify Seasonality

  • Use smoothing techniques (e.g., moving averages or LOESS) to detect seasonality in your data.
  • Interpret the seasonal patterns you observe.
# Apply a 3-year moving average 
comedy_tsibble <- comedy_tsibble |>
  mutate(MA_3yr = ifelse(n() >= 3, 
                         rollapply(Avg_Foreign_Percent, 
                                   width = 3, 
                                   FUN = mean, 
                                   align = "right", 
                                   fill = NA), 
                         NA))
# Apply LOESS smoothing
ggplot(comedy_tsibble, 
       aes(x = Year_Date, 
           y = Avg_Foreign_Percent)) +
  geom_point(color = "blue") +                         # Raw data points
  geom_smooth(method = "loess", 
              span = 0.2, 
              color = "red", 
              size = 1.2) +                           # LOESS smoothed trend
  labs(
    title = "LOESS Smoothing: Foreign Revenue Percentage for Comedy Movies",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

Analysis:

  • The smoothing line shows a general increase in foreign revenue percentage suggesting a growing international trend for comedy movies

  • There are occasional spikes indicating periodic surges in foreign revenue, most notably in 2014 and 2015 , then in 2020 an 2021.

There seem to be two notable trends, one before 2020 and another post 2020.

#create df for sets
# Filter for Pre-2020 (2000 - 2019)
comedy_pre2020 <- comedy_tsibble |>
  filter(Year_Date < as.Date("2020-01-01"))

# Filter for Post-2020 (2020 - 2024)
comedy_post2020 <- comedy_tsibble |>
  filter(Year_Date >= as.Date("2020-01-01"))

Look at linear regression for each set.

# Linear regression for pre-2020
pre2020_lm <- lm(Avg_Foreign_Percent ~ Year_Date, data = comedy_pre2020)
summary(pre2020_lm)
## 
## Call:
## lm(formula = Avg_Foreign_Percent ~ Year_Date, data = comedy_pre2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1225  -3.1899   0.6766   2.4639   6.4795 
## 
## Coefficients:
##               Estimate Std. Error t value    Pr(>|t|)    
## (Intercept) -2.0192355  7.2492215  -0.279       0.784    
## Year_Date    0.0041089  0.0004972   8.264 0.000000154 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.683 on 18 degrees of freedom
## Multiple R-squared:  0.7914, Adjusted R-squared:  0.7798 
## F-statistic: 68.29 on 1 and 18 DF,  p-value: 0.0000001539
ggplot(comedy_pre2020, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
  geom_point(color = "blue") +  # Raw data points
  geom_smooth(method = "loess", span = 0.2, color = "blue", size = 1.2) +  # LOESS smoothed trend
  labs(
    title = "LOESS Smoothing: Foreign Revenue Trends (Comedy Pre-2020)",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

  • This shows a the same gradual increase in foreign revenue from 2000 -2020

  • There are regular peaks and valleys that suggest possible periodic trends.

  • Gemerally there are more steep rises than declines.

# Linear regression for post-2020
post2020_lm <- lm(Avg_Foreign_Percent ~ Year_Date, data = comedy_post2020)
summary(post2020_lm)
## 
## Call:
## lm(formula = Avg_Foreign_Percent ~ Year_Date, data = comedy_post2020)
## 
## Residuals:
##          1          2          3          4          5 
##  -2.444399   9.117638  -0.002072 -17.564475  10.893309 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 406.20094  215.82476   1.882    0.156
## Year_Date    -0.01727    0.01136  -1.521    0.226
## 
## Residual standard error: 13.12 on 3 degrees of freedom
## Multiple R-squared:  0.4353, Adjusted R-squared:  0.247 
## F-statistic: 2.312 on 1 and 3 DF,  p-value: 0.2257
ggplot(comedy_post2020, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
  geom_point(color = "red") +  # Raw data points
  geom_smooth(method = "loess", span = 0.5, color = "red", size = 1.2) +  # LOESS smoothed trend
  labs(
    title = "LOESS Smoothing: Foreign Revenue Trends (Comedy Post-2020)",
    x = "Year",
    y = "Foreign Revenue Percentage"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18255
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 373.31
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1.3861e+05
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.  fewer
## data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 18255
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 373.31
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1.3861e+05
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

  • This graph is actually very interesting in relation to the COVID pandemic. There was a significnant growth in foreign revenue a the beginning of the pandemic and then a significant drop near the end. This is consistent with the idea that many film productions were shut down during the pandemic so there would be fewer movies for release.

7. Validate Seasonality with ACF/PACF

# Compute the ACF
acf(comedy_tsibble$Avg_Foreign_Percent, 
    lag.max = 20, 
    main = "ACF Plot: Comedy Foreign Revenue Percentage")

Analysis - ACF

  • Lags 1, 2, and 3 extend past the confidence bands indicating statistically significant correlation in short term revenue trends. This suggests that foreign revenue for comedy movies is highly dependent on recent years.

  • The overall trend is a gradual decrease suggesting long term trends.

  • There is a slight increase at year 6, but as it does not exceed the confidence bands it is not statistically significant. It may be driven by other factors.

# Compute the PACF
pacf(comedy_tsibble$Avg_Foreign_Percent, 
     lag.max = 20, 
     main = "PACF Plot: Comedy Foreign Revenue Percentage")

Analysis - PACF

  • Lag 1 is significant, indicating that last years revenue has impact on this years revenue.

  • There do not appear to be any strong intervals that would indicate periodicity in revenue trends.

  • There is a positive spike at lag 6, but it is not strong enough to be significant.

Analysis - ACF/PACF overall

These results indicate that short term trends are more important in representing foreign revenue of Comedy movies than long term cylclical behavior. Additionally, while there is a slight 6 year pattern it is not strong enough to use for forecasting.

Insight -

This assignment was a bit of a surprise. I entered it assuming that my data set was not robust enough to support this analysis. Overall, I understand how having a larger scale of continuous data might be stronger. What surprised me is that even being restricted to just Year values we were able to apply the time series tools sufficiently to provide interesting results. When I filtered the data down to just Comedy, it was definitely easier to interpret, but I was surprised that there was some evidence of cycles at 6 years. Though analysis shows that it was not statistically significant, it was still interesting and suggests further analysis opportunities. What, for example would the data show if we didn’t categorize by Genre. Are there cycles with other genres? If we find a statistically significant cycle what would it suggest about foreign revenue trends?