#load the data libraries - remove or add as needed
library(tidyverse) #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#movies <- t_box_office
#cat(colnames(movies), sep = ",")
#cat(colnames(movies),sep = ", ")
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")
#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"
cat(colnames(movies), sep = ", ") #list column names for reference
## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10
The purpose of this week’s data dive is for you to expand your knowledge on linear modeling and generalized linear models.
Your RMarkdown notebook for this data dive should contain the following:
Build a linear (or generalized linear) model as you like
- Use whatever response variable and explanatory variables you prefer
Use the tools from previous weeks to diagnose the model
- Highlight any issues with the model
Interpret at least one of the coefficients
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
Looking at the data set I am very interested in what drives the Worldwide Revenue. Prior analysis show the Worldwide revenue is positively skewed so we’ll address this by creating a LogWorldwideRevenue column that should scale those values down to improve analysis and support linear regression.
We know there are issues with the Worldwide Revenue column being non-numeric from previous analysis.
# correct issues with Worldwide data being non-numeric
# Remove any non-numeric characters (like $, commas) and convert to numeric
movies$WorldwideRevenue <- as.numeric(gsub("[^0-9.]",
"",
movies$WorldwideRevenue))
# Check if conversion was successful
str(movies$WorldwideRevenue) # Should now show as numeric
## num [1:5000] 546388108 460583960 429632142 374111707 349822765 ...
summary(movies$WorldwideRevenue) # Summarize the data to check values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1666028 24662197 48446575 119213693 119758766 2799439100
I finally get the true value of the log function when looking at data with very large ranges. This is very obvious when looking at the summary data. The difference between the average $48M and the middle value $120M is so significant that it’s bound to skew the analysis results. Also a range between $1.6 M and $2.8B for 5000 records is so large it’s almost a perfect example of when to use the log function to scale values down.
#add a logWorldwideRevenue column to help scale the worldwide data
# Apply log transformation
movies$LogWorldwideRevenue <- log(movies$WorldwideRevenue)
# Visualize the transformed data
hist(movies$LogWorldwideRevenue,
main = "Histogram of Log-Transformed Worldwide Revenue",
xlab = "Log of Worldwide Revenue",
breaks = 30,
col = "lightgreen")
# Check summary statistics
#summary(movies$LogWorldwideRevenue)
This looks much more normal which will support our GLM.
Going to do the same with the vote_count field as it is also right skewed
# Apply log transformation to Vote_Count
movies$LogVoteCount <- log(movies$Vote_Count + 1) # Add 1 to avoid issues with log(0)
# Visualize the transformed data
hist(movies$LogVoteCount,
main = "Histogram of Log-Transformed Vote Count",
xlab = "Log of Vote Count",
breaks = 30,
col = "lightblue")
# Check summary statistics
#summary(movies$LogVoteCount)
These adjustments should help support the GLM model.
Next we will remove known NA data
# Remove rows where Vote_Count is NA
movies_cleaned <- movies[!is.na(movies$Vote_Count), ]
# Confirm the updated dataset size
nrow(movies_cleaned)
## [1] 4830
Response variable - Worldwide revenue, the primary factor that represents a movie’s success- as this data shows a positive skew we’ll apply a log function to the data to improve the model
Gong to start the model with 4 explanatory variables:
Year - Should help identify long term trends over time.
Vote_Count - serves as a general measure of a movie’s popularity - as this data also shows a positive skew we’ll apply a log function to this data as well.
Rating_of_10 - Serves as a general measure of a movie’s quality.
Prime_Genre - provides a categorical measure for each row of data.
Applying the log functions allows us to look at the data on a tighter scale that avoids being misled by outliers that significantly skew the data.
# Build the GLM
initial_model <- glm(LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 + LogVoteCount,
data = movies_cleaned,
family = gaussian()) # Assuming a normal distribution for log-transformed revenue
# Summary of the model
summary(initial_model)
##
## Call:
## glm(formula = LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 +
## LogVoteCount, family = gaussian(), data = movies_cleaned)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.186491 3.737564 -3.528 0.000422
## Prime_GenreAdventure 0.213542 0.062094 3.439 0.000589
## Prime_GenreAnimation 0.133735 0.058259 2.296 0.021745
## Prime_GenreComedy -0.369590 0.042026 -8.794 < 0.0000000000000002
## Prime_GenreCrime -0.426953 0.072114 -5.921 0.00000000343294591
## Prime_GenreDocumentary 0.192228 0.122502 1.569 0.116673
## Prime_GenreDrama -0.370209 0.043443 -8.522 < 0.0000000000000002
## Prime_GenreFamily -0.061381 0.080593 -0.762 0.446324
## Prime_GenreFantasy 0.010671 0.089010 0.120 0.904581
## Prime_GenreHistory -0.362971 0.139155 -2.608 0.009125
## Prime_GenreHorror -0.584635 0.061355 -9.529 < 0.0000000000000002
## Prime_GenreMusic -0.141974 0.148982 -0.953 0.340660
## Prime_GenreMystery -0.241593 0.114992 -2.101 0.035698
## Prime_GenreRomance -0.317421 0.078631 -4.037 0.00005502210344555
## Prime_GenreScience Fiction 0.137540 0.092373 1.489 0.136560
## Prime_GenreThriller -0.338398 0.077448 -4.369 0.00001272532711610
## Prime_GenreWar 0.151020 0.141757 1.065 0.286773
## Prime_GenreWestern -0.231351 0.314334 -0.736 0.461764
## Year 0.014577 0.001861 7.834 0.00000000000000578
## Rating_of_10 -0.096751 0.015730 -6.151 0.00000000083475059
## LogVoteCount 0.389815 0.007460 52.251 < 0.0000000000000002
##
## (Intercept) ***
## Prime_GenreAdventure ***
## Prime_GenreAnimation *
## Prime_GenreComedy ***
## Prime_GenreCrime ***
## Prime_GenreDocumentary
## Prime_GenreDrama ***
## Prime_GenreFamily
## Prime_GenreFantasy
## Prime_GenreHistory **
## Prime_GenreHorror ***
## Prime_GenreMusic
## Prime_GenreMystery *
## Prime_GenreRomance ***
## Prime_GenreScience Fiction
## Prime_GenreThriller ***
## Prime_GenreWar
## Prime_GenreWestern
## Year ***
## Rating_of_10 ***
## LogVoteCount ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7815642)
##
## Null deviance: 6742.8 on 4821 degrees of freedom
## Residual deviance: 3752.3 on 4801 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 12519
##
## Number of Fisher Scoring iterations: 2
# Model diagnostics
#plot(initial_model)
Ajdust the model for a potentially better fit.
# Add interaction term between Rating_of_10 and LogVoteCount
interaction_model_2 <- glm(LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 * LogVoteCount,
data = movies_cleaned,
family = gaussian())
# Summary of the model
summary(interaction_model_2)
##
## Call:
## glm(formula = LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 *
## LogVoteCount, family = gaussian(), data = movies_cleaned)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.411732 3.773727 -1.964 0.04958
## Prime_GenreAdventure 0.194755 0.061688 3.157 0.00160
## Prime_GenreAnimation 0.102715 0.057958 1.772 0.07642
## Prime_GenreComedy -0.356402 0.041754 -8.536 < 0.0000000000000002
## Prime_GenreCrime -0.463144 0.071726 -6.457 0.0000000001172
## Prime_GenreDocumentary 0.127175 0.121868 1.044 0.29675
## Prime_GenreDrama -0.417310 0.043494 -9.595 < 0.0000000000000002
## Prime_GenreFamily -0.053352 0.080020 -0.667 0.50497
## Prime_GenreFantasy 0.020416 0.088378 0.231 0.81732
## Prime_GenreHistory -0.364202 0.138156 -2.636 0.00841
## Prime_GenreHorror -0.545543 0.061091 -8.930 < 0.0000000000000002
## Prime_GenreMusic -0.239079 0.148362 -1.611 0.10715
## Prime_GenreMystery -0.237933 0.114167 -2.084 0.03721
## Prime_GenreRomance -0.338415 0.078106 -4.333 0.0000150272986
## Prime_GenreScience Fiction 0.143548 0.091712 1.565 0.11760
## Prime_GenreThriller -0.334802 0.076893 -4.354 0.0000136388362
## Prime_GenreWar 0.126824 0.140768 0.901 0.36766
## Prime_GenreWestern -0.289635 0.312153 -0.928 0.35353
## Year 0.012255 0.001868 6.561 0.0000000000591
## Rating_of_10 -0.263817 0.025271 -10.440 < 0.0000000000000002
## LogVoteCount 0.143927 0.030164 4.772 0.0000018826304
## Rating_of_10:LogVoteCount 0.037097 0.004411 8.409 < 0.0000000000000002
##
## (Intercept) *
## Prime_GenreAdventure **
## Prime_GenreAnimation .
## Prime_GenreComedy ***
## Prime_GenreCrime ***
## Prime_GenreDocumentary
## Prime_GenreDrama ***
## Prime_GenreFamily
## Prime_GenreFantasy
## Prime_GenreHistory **
## Prime_GenreHorror ***
## Prime_GenreMusic
## Prime_GenreMystery *
## Prime_GenreRomance ***
## Prime_GenreScience Fiction
## Prime_GenreThriller ***
## Prime_GenreWar
## Prime_GenreWestern
## Year ***
## Rating_of_10 ***
## LogVoteCount ***
## Rating_of_10:LogVoteCount ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7703776)
##
## Null deviance: 6742.8 on 4821 degrees of freedom
## Residual deviance: 3697.8 on 4800 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 12450
##
## Number of Fisher Scoring iterations: 2
Insights - This model is marginally better than the previous one. I selected these parameters as I felt that all variables are likely to have some impact on worldwide revenue. I chose an interactive term for Rating_of_10 and LogVote_Count as I felt it is likely that those two terms are tightly related as logically we might guess that the quality of a movie (rating_of_10) is linked to the popularity of a movie.
#Residual vs Fitted
plot(interaction_model_2, which = 1) # Residuals vs. fitted values
The ideal is that the values are randomly scattered around 0 with no pattern. In this case there does seem to be a bit of a pattern as shown by the curve of the red line. On the ends the line deviates from 0, suggesting areas where the model doesn’t perfectly explain the data. This may be related to the three labeled points which might be significant outliers. Next steps would suggest investigating these 3 data points to determine their impact on the model.
#QQ Plot
plot(interaction_model_2, which = 2) # Normal Q-Q plot
In ideal situations the QQ Plot should show the results closely following the dotted line. In this case the alignment is pretty strong until the higher levels. Again, there are some labeled entries which are worth investigation sto see if they are significant outliers. We know, however that this data set includes some extreme values so this alone does not necessarily indicate a problem with the model.
#cooks distance
plot(interaction_model_2, which = 4) # Cook's Distance plot
Cook’s distance helps us identify influential data points, generally we are looking for values above 1. This diagram shows that all data points are well below 1 which is within the expected range. It still would be a good idea to look into the labeled data point to examine what their impact. In real world terms this might be related to some something exceptional about the movie that sets it apart from its counterparts, only looking into the listed items would expose that information.
#leverage Plot
hatvalues <- hatvalues(interaction_model_2)
plot(hatvalues, main = "Leverage of Observations")
In a leverage plot we are expect most of the data points to have leverage values near zero which indicates they have minimal influence on the model. In this plot we see that this is mostly true, most values are near zero. There are some definite instances of higher leverage values which bear investigation. This might not indicate an issue but it would be responsible of us to check if this is because of a legitimate reason, such as being an outlier or if there is a data error behind the observation.
#multicolinearity
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(interaction_model_2)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## Prime_Genre 1.366122 17 1.009218
## Year 1.134417 1 1.065090
## Rating_of_10 3.461815 1 1.860595
## LogVoteCount 22.230605 1 4.714934
## Rating_of_10:LogVoteCount 28.359214 1 5.325337
In this test we are looking for multicolinearity which happens when certain variables are highly correlated to each other which makes it difficult to interpret the impact of each explanatory variable to the model. The indicator of multicolinearity is a value higher than 5 for the scaled GVIF.
In this case we see that individually, Prime_Genre, Year and Rating_of_10 all score low suggesting limited multicolinearity. LogVoteCount shows moderate multicolinearity at 4.714934 but we expect this due to the strong relationship to rating and it’s inclusion in the interactive term. Of greatest concern is that the interactive term does score slightly higher than 5. This doesn’t necessarily invalidate the model.
The overall model results do show that the interactive tern do improve the models explanatory power. To adress this further we might examine the influential observations in the previous steps and adjust the model to determine if addressing those observations impacts the multicolinearity.
#Residual Deviance
summary(interaction_model_2)$deviance # Residual deviance
## [1] 3697.813
AIC(interaction_model_2)
## [1] 12450.26
The residual deviance helps us quantify the unexplained variance
remaining in the model. A significant value of this number is its’ value
as a relative value to observe during iterative model changes. Reducing
this number suggests a better model fit. The current value of 3697.813
is acceptable, but suggests that there is some variance not accounted
for by the model.
The AIC has the same value as the residual deviance to help monitor
iterative improvements to the model as a smaller value suggests a better
fit. It is noteworthy taht this AIC value is smaller than the AIC value
of the original model listed above indicating a better fit that better
captures the relationship between teh predictors and the response
variable. Going forward it would be advantageous to monitor these values
after each iteration of the model to assist in finding the best
representative model.
The interactive term Rating_of_10 * LogVoteCount is the most interesting for this model. (Remember we applied a log function to the Vote_Count to create a more normalized, scaled representation of Vote_counts) We’ll draw attention to it by running the model with only the interactive term applied. Stated in english the question might be, “how do quality (ratings) and popularity (vote counts) work together to influence revenue?”.
simple_model <- glm(LogWorldwideRevenue ~ Rating_of_10 * LogVoteCount,
data = movies_cleaned,
family = gaussian())
summary(simple_model)
##
## Call:
## glm(formula = LogWorldwideRevenue ~ Rating_of_10 * LogVoteCount,
## family = gaussian(), data = movies_cleaned)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.970823 0.154756 109.661 < 0.0000000000000002 ***
## Rating_of_10 -0.253565 0.024573 -10.319 < 0.0000000000000002 ***
## LogVoteCount 0.115318 0.028297 4.075 0.0000467 ***
## Rating_of_10:LogVoteCount 0.041162 0.004227 9.737 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.8327139)
##
## Null deviance: 6756.6 on 4829 degrees of freedom
## Residual deviance: 4018.7 on 4826 degrees of freedom
## AIC: 12829
##
## Number of Fisher Scoring iterations: 2
Looking at the estimate of the interactive term we find a value of
.041162 which means the effect of the Rating_of_10 on Worldwide Revenue
(LogWorldwideRevenue) increases as Vote Count increases. In general
terms this indicates that the impact of a movies quality is greater when
it’s popularity increases. This makes sense as you would expect that a
movie which is popular and is considered high quality would have higher
earning potential. Following that logic, a high quality movie with a
limited distribution and a low vote count would likely have lower
earnings and crtically acclaimed movies with a large audience would
generate more revenue.
Looking further at the data we see a t-value of 9.737 and a p-value near
0 indicates the interaction term is highly significant, which supports
the idea that quality and popularity together is an influential factor
in explaining revenue.
It is interesting that on it’s own Rating_of_10 has a slightly negative impact if the LogVoteCount is constant. This suggests that quality is not a strong enough factor to drive revenue without the added impact of popularity. Conversely, popularity has a positive effect which suggests that popularity is a more consistent driver of revenue than quality, but the two together make a greater impact.