Week 11 Data Dive - GLM part 2

Initialization Step 1 - Load Libraries

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)      #loaded to address viz issues, including currency issues

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

options(scipen=999)  #disable scientific notation since high values are used

Initialization Step 2 - Load Data Set

#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")

## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#movies <- t_box_office
#cat(colnames(movies), sep = ",")
#cat(colnames(movies),sep = ", ")

#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")

#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"

cat(colnames(movies), sep = ", ")                  #list column names for reference

## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10

Task Demonstrations

The purpose of this week’s data dive is for you to expand your knowledge on linear modeling and generalized linear models.

Your RMarkdown notebook for this data dive should contain the following:

Build a linear (or generalized linear) model as you like

Use whatever response variable and explanatory variables you prefer

Use the tools from previous weeks to diagnose the model

Highlight any issues with the model

Interpret at least one of the coefficients

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Create and diagnose a generalized linear model

Step 0 - Preliminary tasks

Determine model for analysis based on data set

Looking at the data set I am very interested in what drives the Worldwide Revenue. Prior analysis show the Worldwide revenue is positively skewed so we’ll address this by creating a LogWorldwideRevenue column that should scale those values down to improve analysis and support linear regression.

Prepare/adjust data set if necessary

We know there are issues with the Worldwide Revenue column being non-numeric from previous analysis.

# correct issues with Worldwide data being non-numeric
# Remove any non-numeric characters (like $, commas) and convert to numeric
movies$WorldwideRevenue <- as.numeric(gsub("[^0-9.]",
                                           "", 
                                           movies$WorldwideRevenue))

# Check if conversion was successful
str(movies$WorldwideRevenue)  # Should now show as numeric

##  num [1:5000] 546388108 460583960 429632142 374111707 349822765 ...

summary(movies$WorldwideRevenue)  # Summarize the data to check values

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##    1666028   24662197   48446575  119213693  119758766 2799439100

Insights

I finally get the true value of the log function when looking at data with very large ranges. This is very obvious when looking at the summary data. The difference between the average $48M and the middle value $120M is so significant that it’s bound to skew the analysis results. Also a range between $1.6 M and $2.8B for 5000 records is so large it’s almost a perfect example of when to use the log function to scale values down.

#add a logWorldwideRevenue column to help scale the worldwide data
# Apply log transformation
movies$LogWorldwideRevenue <- log(movies$WorldwideRevenue)

# Visualize the transformed data
hist(movies$LogWorldwideRevenue, 
     main = "Histogram of Log-Transformed Worldwide Revenue", 
     xlab = "Log of Worldwide Revenue", 
     breaks = 30, 
     col = "lightgreen")

# Check summary statistics

#summary(movies$LogWorldwideRevenue)

This looks much more normal which will support our GLM.

Going to do the same with the vote_count field as it is also right skewed

# Apply log transformation to Vote_Count
movies$LogVoteCount <- log(movies$Vote_Count + 1)  # Add 1 to avoid issues with log(0)

# Visualize the transformed data
hist(movies$LogVoteCount, 
     main = "Histogram of Log-Transformed Vote Count", 
     xlab = "Log of Vote Count", 
     breaks = 30, 
     col = "lightblue")

# Check summary statistics
#summary(movies$LogVoteCount)

These adjustments should help support the GLM model.

Next we will remove known NA data

# Remove rows where Vote_Count is NA
movies_cleaned <- movies[!is.na(movies$Vote_Count), ]

# Confirm the updated dataset size
nrow(movies_cleaned)

## [1] 4830

Step 1- Select the response variable an explanatory variables (explain)

Response variable - Worldwide revenue, the primary factor that represents a movie’s success- as this data shows a positive skew we’ll apply a log function to the data to improve the model

Gong to start the model with 4 explanatory variables:

Year - Should help identify long term trends over time.
Vote_Count - serves as a general measure of a movie’s popularity - as this data also shows a positive skew we’ll apply a log function to this data as well.
Rating_of_10 - Serves as a general measure of a movie’s quality.
Prime_Genre - provides a categorical measure for each row of data.

Insight (significance and further questions)

Applying the log functions allows us to look at the data on a tighter scale that avoids being misled by outliers that significantly skew the data.

Create the initial model

# Build the GLM
initial_model <- glm(LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 + LogVoteCount, 
                     data = movies_cleaned, 
                     family = gaussian())  # Assuming a normal distribution for log-transformed revenue

# Summary of the model
summary(initial_model)

## 
## Call:
## glm(formula = LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 + 
##     LogVoteCount, family = gaussian(), data = movies_cleaned)
## 
## Coefficients:
##                              Estimate Std. Error t value             Pr(>|t|)
## (Intercept)                -13.186491   3.737564  -3.528             0.000422
## Prime_GenreAdventure         0.213542   0.062094   3.439             0.000589
## Prime_GenreAnimation         0.133735   0.058259   2.296             0.021745
## Prime_GenreComedy           -0.369590   0.042026  -8.794 < 0.0000000000000002
## Prime_GenreCrime            -0.426953   0.072114  -5.921  0.00000000343294591
## Prime_GenreDocumentary       0.192228   0.122502   1.569             0.116673
## Prime_GenreDrama            -0.370209   0.043443  -8.522 < 0.0000000000000002
## Prime_GenreFamily           -0.061381   0.080593  -0.762             0.446324
## Prime_GenreFantasy           0.010671   0.089010   0.120             0.904581
## Prime_GenreHistory          -0.362971   0.139155  -2.608             0.009125
## Prime_GenreHorror           -0.584635   0.061355  -9.529 < 0.0000000000000002
## Prime_GenreMusic            -0.141974   0.148982  -0.953             0.340660
## Prime_GenreMystery          -0.241593   0.114992  -2.101             0.035698
## Prime_GenreRomance          -0.317421   0.078631  -4.037  0.00005502210344555
## Prime_GenreScience Fiction   0.137540   0.092373   1.489             0.136560
## Prime_GenreThriller         -0.338398   0.077448  -4.369  0.00001272532711610
## Prime_GenreWar               0.151020   0.141757   1.065             0.286773
## Prime_GenreWestern          -0.231351   0.314334  -0.736             0.461764
## Year                         0.014577   0.001861   7.834  0.00000000000000578
## Rating_of_10                -0.096751   0.015730  -6.151  0.00000000083475059
## LogVoteCount                 0.389815   0.007460  52.251 < 0.0000000000000002
##                               
## (Intercept)                ***
## Prime_GenreAdventure       ***
## Prime_GenreAnimation       *  
## Prime_GenreComedy          ***
## Prime_GenreCrime           ***
## Prime_GenreDocumentary        
## Prime_GenreDrama           ***
## Prime_GenreFamily             
## Prime_GenreFantasy            
## Prime_GenreHistory         ** 
## Prime_GenreHorror          ***
## Prime_GenreMusic              
## Prime_GenreMystery         *  
## Prime_GenreRomance         ***
## Prime_GenreScience Fiction    
## Prime_GenreThriller        ***
## Prime_GenreWar                
## Prime_GenreWestern            
## Year                       ***
## Rating_of_10               ***
## LogVoteCount               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7815642)
## 
##     Null deviance: 6742.8  on 4821  degrees of freedom
## Residual deviance: 3752.3  on 4801  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 12519
## 
## Number of Fisher Scoring iterations: 2

# Model diagnostics
#plot(initial_model)

Ajdust the model for a potentially better fit.

# Add interaction term between Rating_of_10 and LogVoteCount
interaction_model_2 <- glm(LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 * LogVoteCount, 
                           data = movies_cleaned, 
                           family = gaussian())

# Summary of the model
summary(interaction_model_2)

## 
## Call:
## glm(formula = LogWorldwideRevenue ~ Prime_Genre + Year + Rating_of_10 * 
##     LogVoteCount, family = gaussian(), data = movies_cleaned)
## 
## Coefficients:
##                             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)                -7.411732   3.773727  -1.964              0.04958
## Prime_GenreAdventure        0.194755   0.061688   3.157              0.00160
## Prime_GenreAnimation        0.102715   0.057958   1.772              0.07642
## Prime_GenreComedy          -0.356402   0.041754  -8.536 < 0.0000000000000002
## Prime_GenreCrime           -0.463144   0.071726  -6.457      0.0000000001172
## Prime_GenreDocumentary      0.127175   0.121868   1.044              0.29675
## Prime_GenreDrama           -0.417310   0.043494  -9.595 < 0.0000000000000002
## Prime_GenreFamily          -0.053352   0.080020  -0.667              0.50497
## Prime_GenreFantasy          0.020416   0.088378   0.231              0.81732
## Prime_GenreHistory         -0.364202   0.138156  -2.636              0.00841
## Prime_GenreHorror          -0.545543   0.061091  -8.930 < 0.0000000000000002
## Prime_GenreMusic           -0.239079   0.148362  -1.611              0.10715
## Prime_GenreMystery         -0.237933   0.114167  -2.084              0.03721
## Prime_GenreRomance         -0.338415   0.078106  -4.333      0.0000150272986
## Prime_GenreScience Fiction  0.143548   0.091712   1.565              0.11760
## Prime_GenreThriller        -0.334802   0.076893  -4.354      0.0000136388362
## Prime_GenreWar              0.126824   0.140768   0.901              0.36766
## Prime_GenreWestern         -0.289635   0.312153  -0.928              0.35353
## Year                        0.012255   0.001868   6.561      0.0000000000591
## Rating_of_10               -0.263817   0.025271 -10.440 < 0.0000000000000002
## LogVoteCount                0.143927   0.030164   4.772      0.0000018826304
## Rating_of_10:LogVoteCount   0.037097   0.004411   8.409 < 0.0000000000000002
##                               
## (Intercept)                *  
## Prime_GenreAdventure       ** 
## Prime_GenreAnimation       .  
## Prime_GenreComedy          ***
## Prime_GenreCrime           ***
## Prime_GenreDocumentary        
## Prime_GenreDrama           ***
## Prime_GenreFamily             
## Prime_GenreFantasy            
## Prime_GenreHistory         ** 
## Prime_GenreHorror          ***
## Prime_GenreMusic              
## Prime_GenreMystery         *  
## Prime_GenreRomance         ***
## Prime_GenreScience Fiction    
## Prime_GenreThriller        ***
## Prime_GenreWar                
## Prime_GenreWestern            
## Year                       ***
## Rating_of_10               ***
## LogVoteCount               ***
## Rating_of_10:LogVoteCount  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7703776)
## 
##     Null deviance: 6742.8  on 4821  degrees of freedom
## Residual deviance: 3697.8  on 4800  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 12450
## 
## Number of Fisher Scoring iterations: 2

Insights - This model is marginally better than the previous one. I selected these parameters as I felt that all variables are likely to have some impact on worldwide revenue. I chose an interactive term for Rating_of_10 and LogVote_Count as I felt it is likely that those two terms are tightly related as logically we might guess that the quality of a movie (rating_of_10) is linked to the popularity of a movie.

Step 2 - Diagnose model and highlight issues

Choose the tools to diagnose the model and provide insight to highlight issues witht he mode

Diagnostic Tool: Residual vs Fitted values

#Residual vs Fitted
plot(interaction_model_2, which = 1)  # Residuals vs. fitted values

Insight (significance and model issues):

The ideal is that the values are randomly scattered around 0 with no pattern. In this case there does seem to be a bit of a pattern as shown by the curve of the red line. On the ends the line deviates from 0, suggesting areas where the model doesn’t perfectly explain the data. This may be related to the three labeled points which might be significant outliers. Next steps would suggest investigating these 3 data points to determine their impact on the model.

Diagnostic Tool: QQ Plot

#QQ Plot
plot(interaction_model_2, which = 2)  # Normal Q-Q plot

Insight (significance and model issues):

In ideal situations the QQ Plot should show the results closely following the dotted line. In this case the alignment is pretty strong until the higher levels. Again, there are some labeled entries which are worth investigation sto see if they are significant outliers. We know, however that this data set includes some extreme values so this alone does not necessarily indicate a problem with the model.

Diagnostic Tool: Cook’s Distance

#cooks distance
plot(interaction_model_2, which = 4)  # Cook's Distance plot

Insight (significance and model issues):

Cook’s distance helps us identify influential data points, generally we are looking for values above 1. This diagram shows that all data points are well below 1 which is within the expected range. It still would be a good idea to look into the labeled data point to examine what their impact. In real world terms this might be related to some something exceptional about the movie that sets it apart from its counterparts, only looking into the listed items would expose that information.

Diagnostic Tool: Leverage Plot

#leverage Plot
hatvalues <- hatvalues(interaction_model_2)
plot(hatvalues, main = "Leverage of Observations")

Insight (significance and model issues):

In a leverage plot we are expect most of the data points to have leverage values near zero which indicates they have minimal influence on the model. In this plot we see that this is mostly true, most values are near zero. There are some definite instances of higher leverage values which bear investigation. This might not indicate an issue but it would be responsible of us to check if this is because of a legitimate reason, such as being an outlier or if there is a data error behind the observation.

Diagnostic Tool: Multicolinearity

#multicolinearity
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(interaction_model_2)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                                GVIF Df GVIF^(1/(2*Df))
## Prime_Genre                1.366122 17        1.009218
## Year                       1.134417  1        1.065090
## Rating_of_10               3.461815  1        1.860595
## LogVoteCount              22.230605  1        4.714934
## Rating_of_10:LogVoteCount 28.359214  1        5.325337

Insight (significance and model issues):

In this test we are looking for multicolinearity which happens when certain variables are highly correlated to each other which makes it difficult to interpret the impact of each explanatory variable to the model. The indicator of multicolinearity is a value higher than 5 for the scaled GVIF.

In this case we see that individually, Prime_Genre, Year and Rating_of_10 all score low suggesting limited multicolinearity. LogVoteCount shows moderate multicolinearity at 4.714934 but we expect this due to the strong relationship to rating and it’s inclusion in the interactive term. Of greatest concern is that the interactive term does score slightly higher than 5. This doesn’t necessarily invalidate the model.

The overall model results do show that the interactive tern do improve the models explanatory power. To adress this further we might examine the influential observations in the previous steps and adjust the model to determine if addressing those observations impacts the multicolinearity.

Diagnostic Tool: Residual Deviance

#Residual Deviance
summary(interaction_model_2)$deviance  # Residual deviance

## [1] 3697.813

AIC(interaction_model_2)

## [1] 12450.26

Insight (significance and model issues):

The residual deviance helps us quantify the unexplained variance remaining in the model. A significant value of this number is its’ value as a relative value to observe during iterative model changes. Reducing this number suggests a better model fit. The current value of 3697.813 is acceptable, but suggests that there is some variance not accounted for by the model.

The AIC has the same value as the residual deviance to help monitor iterative improvements to the model as a smaller value suggests a better fit. It is noteworthy taht this AIC value is smaller than the AIC value of the original model listed above indicating a better fit that better captures the relationship between teh predictors and the response variable. Going forward it would be advantageous to monitor these values after each iteration of the model to assist in finding the best representative model.

Step 3 - Interpret at least one of the coefficients

The interactive term Rating_of_10 * LogVoteCount is the most interesting for this model. (Remember we applied a log function to the Vote_Count to create a more normalized, scaled representation of Vote_counts) We’ll draw attention to it by running the model with only the interactive term applied. Stated in english the question might be, “how do quality (ratings) and popularity (vote counts) work together to influence revenue?”.

simple_model <- glm(LogWorldwideRevenue ~ Rating_of_10 * LogVoteCount, 
                    data = movies_cleaned, 
                    family = gaussian())
summary(simple_model)

## 
## Call:
## glm(formula = LogWorldwideRevenue ~ Rating_of_10 * LogVoteCount, 
##     family = gaussian(), data = movies_cleaned)
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               16.970823   0.154756 109.661 < 0.0000000000000002 ***
## Rating_of_10              -0.253565   0.024573 -10.319 < 0.0000000000000002 ***
## LogVoteCount               0.115318   0.028297   4.075            0.0000467 ***
## Rating_of_10:LogVoteCount  0.041162   0.004227   9.737 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8327139)
## 
##     Null deviance: 6756.6  on 4829  degrees of freedom
## Residual deviance: 4018.7  on 4826  degrees of freedom
## AIC: 12829
## 
## Number of Fisher Scoring iterations: 2

Analysis and Insight:

Looking at the estimate of the interactive term we find a value of .041162 which means the effect of the Rating_of_10 on Worldwide Revenue (LogWorldwideRevenue) increases as Vote Count increases. In general terms this indicates that the impact of a movies quality is greater when it’s popularity increases. This makes sense as you would expect that a movie which is popular and is considered high quality would have higher earning potential. Following that logic, a high quality movie with a limited distribution and a low vote count would likely have lower earnings and crtically acclaimed movies with a large audience would generate more revenue.

Looking further at the data we see a t-value of 9.737 and a p-value near 0 indicates the interaction term is highly significant, which supports the idea that quality and popularity together is an influential factor in explaining revenue.

It is interesting that on it’s own Rating_of_10 has a slightly negative impact if the LogVoteCount is constant. This suggests that quality is not a strong enough factor to drive revenue without the added impact of popularity. Conversely, popularity has a positive effect which suggests that popularity is a more consistent driver of revenue than quality, but the two together make a greater impact.

H510_DJH_wk11_DataDive_v01

DJH

2025-04-09

Week 11 Data Dive - GLM part 2

Initialization Step 1 - Load Libraries

Initialization Step 2 - Load Data Set

Task Demonstrations

Create and diagnose a generalized linear model

Step 0 - Preliminary tasks

Determine model for analysis based on data set

Prepare/adjust data set if necessary

Insights

Step 1- Select the response variable an explanatory variables (explain)

Insight (significance and further questions)

Create the initial model

Step 2 - Diagnose model and highlight issues

Choose the tools to diagnose the model and provide insight to highlight issues witht he mode

Diagnostic Tool: Residual vs Fitted values

Insight (significance and model issues):

Diagnostic Tool: QQ Plot

Insight (significance and model issues):

Diagnostic Tool: Cook’s Distance

Insight (significance and model issues):

Diagnostic Tool: Leverage Plot

Insight (significance and model issues):

Diagnostic Tool: Multicolinearity

Insight (significance and model issues):

Diagnostic Tool: Residual Deviance

Insight (significance and model issues):

Step 3 - Interpret at least one of the coefficients

Analysis and Insight: