#load the data libraries - remove or add as needed
library(tidyverse) #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
# Load the tsibble package
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")
#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"
cat(colnames(movies), sep = ", ")
## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10
Weekly Assignment
The purpose of this week’s data dive is for you to explore the time aspect of your data, or the time aspect of some other related dataset.
Your RMarkdown notebook for this data dive should contain the following:
Select a column of your data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R.
- Note, you may need to use some combination of
as.Date
, orto_datetime
. And, you may even need to paste year, month, day, hour, etc. together usingpaste
(even if you need to make up a month, like"__/01/01"
).If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab.
- If you choose this option, find ways to tie your results from the below analysis into what you’re seeing with your own data!
Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
- What stands out immediately?
Use linear regression to detect any upwards or downwards trends.
Do you need to subset the data for multiple trends?
How strong are these trends?
Use smoothing to detect at least one season in your data, and interpret your results.
- Can you illustrate the seasonality using ACF or PACF?
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
The purpose of this week’s data dive is to explore the time aspect of your data or a related dataset.
The movies data set only includes a single time variable, year. As this data set represents top 200 movies for each year, year is more of a categorical value. The solution to this is to create a time series using the existing data which shows filtered trends over time. The primary limitation is that Year is the lowest level of granularity for analysis. For this analysis we are going to create a time series to examine the annual change in the foreign revenue ratio by genre. The time series will include the year, the prime genre and the average foreign revenue ratio. Year will be managed in format YYYY-MM-DD with defaults of 01/01 for month and year. For this assignment the dates will be left in this format as it is the ISO 8601 format used by default in R.
Here is the code for the adjusted time series
# Calculate Foreign Revenue Percentage and Aggregate by Year and Genre
broad_time_series <- movies |>
group_by(Year, Prime_Genre) |>
summarise(
Avg_Foreign_Percent = mean((ForeignRevenue / WorldwideRevenue) * 100, na.rm = TRUE) #addressing the challenge of missing values
) |>
mutate(
Year_Date = as.Date(paste0(Year, "-01-01")) # Convert Year to Date using R default format of YYYY-MM-DD
)
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Convert Prime_Genre to a factor
broad_time_series <- broad_time_series |>
mutate(Prime_Genre = as.factor(Prime_Genre))
# View the resulting time series
print(broad_time_series)
## # A tibble: 436 Ă— 4
## # Groups: Year [25]
## Year Prime_Genre Avg_Foreign_Percent Year_Date
## <dbl> <fct> <dbl> <date>
## 1 2000 Action 34.6 2000-01-01
## 2 2000 Adventure 50.5 2000-01-01
## 3 2000 Animation 48.6 2000-01-01
## 4 2000 Comedy 32.9 2000-01-01
## 5 2000 Crime 44.1 2000-01-01
## 6 2000 Documentary 4.10 2000-01-01
## 7 2000 Drama 36.5 2000-01-01
## 8 2000 Family 36.9 2000-01-01
## 9 2000 Fantasy 42.0 2000-01-01
## 10 2000 History 48.0 2000-01-01
## # ℹ 426 more rows
For this analysis we will use Avg_Foreign_Percent as the response variable. We are interested in changes to the foreign percentage value.
# Create the tsibble object
broad_tsibble <- broad_time_series |>
as_tsibble(index = Year_Date, key = Prime_Genre)
# View the tsibble object
print(broad_tsibble)
## # A tsibble: 436 x 4 [1D]
## # Key: Prime_Genre [19]
## # Groups: Year [25]
## Year Prime_Genre Avg_Foreign_Percent Year_Date
## <dbl> <fct> <dbl> <date>
## 1 2000 Action 34.6 2000-01-01
## 2 2001 Action 38.2 2001-01-01
## 3 2002 Action 43.8 2002-01-01
## 4 2003 Action 55.1 2003-01-01
## 5 2004 Action 62.0 2004-01-01
## 6 2005 Action 60.9 2005-01-01
## 7 2006 Action 62.3 2006-01-01
## 8 2007 Action 69.7 2007-01-01
## 9 2008 Action 65.7 2008-01-01
## 10 2009 Action 71.0 2009-01-01
## # ℹ 426 more rows
# Visualize time series data using ggplot2
ggplot(broad_tsibble,
aes(x = Year_Date,
y = Avg_Foreign_Percent,
color = Prime_Genre)) +
geom_line(size = 1.2) +
scale_x_date(
date_breaks = "1 year",
date_labels = "%Y"
) +
scale_y_continuous(
breaks = seq(0, 100, by = 10)
) +
labs(
title = "Foreign Revenue Percentage Trends by Genre",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45,
hjust = 1),
panel.grid.major = element_line(color = "grey",
linetype = "dashed")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Filter the tsibble for Comedy movies only
comedy_tsibble <- broad_tsibble |>
filter(Prime_Genre == "Comedy")
# View the filtered tsibble
print(comedy_tsibble)
## # A tsibble: 25 x 4 [1D]
## # Key: Prime_Genre [18]
## # Groups: Year [25]
## Year Prime_Genre Avg_Foreign_Percent Year_Date
## <dbl> <fct> <dbl> <date>
## 1 2000 Comedy 32.9 2000-01-01
## 2 2001 Comedy 44.0 2001-01-01
## 3 2002 Comedy 45.4 2002-01-01
## 4 2003 Comedy 49.1 2003-01-01
## 5 2004 Comedy 45.2 2004-01-01
## 6 2005 Comedy 55.3 2005-01-01
## 7 2006 Comedy 52.6 2006-01-01
## 8 2007 Comedy 54.4 2007-01-01
## 9 2008 Comedy 61.1 2008-01-01
## 10 2009 Comedy 62.5 2009-01-01
## # ℹ 15 more rows
# Plot the filtered data for Comedy genre
ggplot(comedy_tsibble, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
geom_line(color = "blue", size = 1.2) +
scale_x_date(
date_breaks = "1 year",
date_labels = "%Y"
) +
scale_y_continuous(
breaks = seq(0, 100, by = 10)
) +
labs(
title = "Foreign Revenue Percentage Trends for Comedy Movies",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major = element_line(color = "grey", linetype = "dashed")
)
There are a few interesting items of note in this graph:
There appears to be a notable upward trend in the amount of revenue generated in foreign markets.
It’s not clear, but there seems to be a regular downward dip in foreign revenue about every six years, though this trend is broken during the COVID years.
There is a pronounced spike in revenue in 2020, 2021, and 2022 with a sharp drop in 2023.
# Filter for a specific time window (e.g., 2015 to 2020)
filtered_tsibble <- comedy_tsibble |>
filter(Year_Date >= as.Date("2015-01-01") & Year_Date <= as.Date("2024-12-31"))
# Visualize filtered time series data
ggplot(filtered_tsibble, aes(x = Year_Date, y = Avg_Foreign_Percent, color = Prime_Genre)) +
geom_line(size = 1.2) +
labs(
title = "Foreign Revenue Trends (2015-2024)",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal()
# Apply linear regression to Comedy time series
comedy_lm <- lm(Avg_Foreign_Percent ~ Year_Date, data = comedy_tsibble)
# Summarize the linear model results
summary(comedy_lm)
##
## Call:
## lm(formula = Avg_Foreign_Percent ~ Year_Date, data = comedy_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9166 -3.5843 0.3079 4.5026 18.4168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.485637 9.665172 -0.257 0.799
## Year_Date 0.004167 0.000621 6.710 0.000000761 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.178 on 23 degrees of freedom
## Multiple R-squared: 0.6619, Adjusted R-squared: 0.6472
## F-statistic: 45.03 on 1 and 23 DF, p-value: 0.0000007612
The year_date coefficient is 0.004167 which suggest a slight upward trend in the percent of foreign revenue for Comedy movies since 2000.
The year_date p-value is .000000761 which is highly significant and confirms that the trend is not likely to have been caused by random noise.
The R-Squared value is 0.6619 meaning 66.19% of the variance is explained by the model. It’s not perfect but it suggests a moderately strong trend.
The adjusted R-Squared is slightly lower at 64.72%, but still supports the trend.
There looks to be a few opportunities for subsetting.
Look at data between 2000 and 2019 separate from 2020- 2024 as the pattern changes significantly
Possibly create 6 year subsets to see if there are any trends around the 6 year dip that appears to happen everys 6 years, starting in 2024.
# Apply a 3-year moving average
comedy_tsibble <- comedy_tsibble |>
mutate(MA_3yr = ifelse(n() >= 3,
rollapply(Avg_Foreign_Percent,
width = 3,
FUN = mean,
align = "right",
fill = NA),
NA))
# Apply LOESS smoothing
ggplot(comedy_tsibble,
aes(x = Year_Date,
y = Avg_Foreign_Percent)) +
geom_point(color = "blue") + # Raw data points
geom_smooth(method = "loess",
span = 0.2,
color = "red",
size = 1.2) + # LOESS smoothed trend
labs(
title = "LOESS Smoothing: Foreign Revenue Percentage for Comedy Movies",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
Analysis:
The smoothing line shows a general increase in foreign revenue percentage suggesting a growing international trend for comedy movies
There are occasional spikes indicating periodic surges in foreign revenue, most notably in 2014 and 2015 , then in 2020 an 2021.
There seem to be two notable trends, one before 2020 and another post 2020.
#create df for sets
# Filter for Pre-2020 (2000 - 2019)
comedy_pre2020 <- comedy_tsibble |>
filter(Year_Date < as.Date("2020-01-01"))
# Filter for Post-2020 (2020 - 2024)
comedy_post2020 <- comedy_tsibble |>
filter(Year_Date >= as.Date("2020-01-01"))
Look at linear regression for each set.
# Linear regression for pre-2020
pre2020_lm <- lm(Avg_Foreign_Percent ~ Year_Date, data = comedy_pre2020)
summary(pre2020_lm)
##
## Call:
## lm(formula = Avg_Foreign_Percent ~ Year_Date, data = comedy_pre2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1225 -3.1899 0.6766 2.4639 6.4795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0192355 7.2492215 -0.279 0.784
## Year_Date 0.0041089 0.0004972 8.264 0.000000154 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.683 on 18 degrees of freedom
## Multiple R-squared: 0.7914, Adjusted R-squared: 0.7798
## F-statistic: 68.29 on 1 and 18 DF, p-value: 0.0000001539
ggplot(comedy_pre2020, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
geom_point(color = "blue") + # Raw data points
geom_smooth(method = "loess", span = 0.2, color = "blue", size = 1.2) + # LOESS smoothed trend
labs(
title = "LOESS Smoothing: Foreign Revenue Trends (Comedy Pre-2020)",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
This shows a the same gradual increase in foreign revenue from 2000 -2020
There are regular peaks and valleys that suggest possible periodic trends.
Gemerally there are more steep rises than declines.
# Linear regression for post-2020
post2020_lm <- lm(Avg_Foreign_Percent ~ Year_Date, data = comedy_post2020)
summary(post2020_lm)
##
## Call:
## lm(formula = Avg_Foreign_Percent ~ Year_Date, data = comedy_post2020)
##
## Residuals:
## 1 2 3 4 5
## -2.444399 9.117638 -0.002072 -17.564475 10.893309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 406.20094 215.82476 1.882 0.156
## Year_Date -0.01727 0.01136 -1.521 0.226
##
## Residual standard error: 13.12 on 3 degrees of freedom
## Multiple R-squared: 0.4353, Adjusted R-squared: 0.247
## F-statistic: 2.312 on 1 and 3 DF, p-value: 0.2257
ggplot(comedy_post2020, aes(x = Year_Date, y = Avg_Foreign_Percent)) +
geom_point(color = "red") + # Raw data points
geom_smooth(method = "loess", span = 0.5, color = "red", size = 1.2) + # LOESS smoothed trend
labs(
title = "LOESS Smoothing: Foreign Revenue Trends (Comedy Post-2020)",
x = "Year",
y = "Foreign Revenue Percentage"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18255
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 373.31
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1.3861e+05
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small. fewer
## data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 18255
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 373.31
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1.3861e+05
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
# Compute the ACF
acf(comedy_tsibble$Avg_Foreign_Percent,
lag.max = 20,
main = "ACF Plot: Comedy Foreign Revenue Percentage")
Lags 1, 2, and 3 extend past the confidence bands indicating statistically significant correlation in short term revenue trends. This suggests that foreign revenue for comedy movies is highly dependent on recent years.
The overall trend is a gradual decrease suggesting long term trends.
There is a slight increase at year 6, but as it does not exceed the confidence bands it is not statistically significant. It may be driven by other factors.
# Compute the PACF
pacf(comedy_tsibble$Avg_Foreign_Percent,
lag.max = 20,
main = "PACF Plot: Comedy Foreign Revenue Percentage")
Lag 1 is significant, indicating that last years revenue has impact on this years revenue.
There do not appear to be any strong intervals that would indicate periodicity in revenue trends.
There is a positive spike at lag 6, but it is not strong enough to be significant.
These results indicate that short term trends are more important in representing foreign revenue of Comedy movies than long term cylclical behavior. Additionally, while there is a slight 6 year pattern it is not strong enough to use for forecasting.
This assignment was a bit of a surprise. I entered it assuming that my data set was not robust enough to support this analysis. Overall, I understand how having a larger scale of continuous data might be stronger. What surprised me is that even being restricted to just Year values we were able to apply the time series tools sufficiently to provide interesting results. When I filtered the data down to just Comedy, it was definitely easier to interpret, but I was surprised that there was some evidence of cycles at 6 years. Though analysis shows that it was not statistically significant, it was still interesting and suggests further analysis opportunities. What, for example would the data show if we didn’t categorize by Genre. Are there cycles with other genres? If we find a statistically significant cycle what would it suggest about foreign revenue trends?