Week 10 Data Dive - GLM

Initialization Step 1 - Load Libraries

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)      #loaded to address viz issues, including currency issues
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
options(scipen=999)  #disable scientific notation since high values are used
library(lindia)
library(broom)
## Warning: package 'broom' was built under R version 4.4.3

Initialization Step 2 - Load Data Set

#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")

#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"

cat(colnames(movies), sep = ", ")                  #list column names for reference
## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10

Insights


Task Demonstrations

The purpose of this week’s data dive is for you to attempt building a generalized linear model, or to make adjustments to variables represented in a model.

Your RMarkdown notebook for this data dive should contain the following:

  • Select an interesting binary column of data, or one which can be reasonably converted into a binary variable 

    • This should be something worth modeling
  • Build a logistic regression model for this variable, using between 1-4 explanatory variables

    • Interpret the coefficients, and explain what they mean in your notebook

    • Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Step 1 - Create the binary target variable

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable 

  • This should be something worth modeling

Convert the RankForYear column into a binary variable. Assign a value of 1 for movies with RankForYear 10 or less (top 10 movies) and 0 otherwise. This will tell me if a movie was in the top 10 for its release year.

#add a column to identify mvoies in the top ten for each year.
movies$isTopTen <- ifelse(movies$RankForYear <= 10, 1, 0)
#head(movies[c("RankForYear", "isTopTen", "Rating_of_10", "WorldwideRevenue")])

Remove null values to avoid reporting errors.

#remove na values
movies_clean <- movies[!is.na(movies$Rating_of_10), ]
#head(movies_clean)

Step 2 - Choose Explanatory Variables

I am selecting variables that might influence a movie’s likelihood of being in the top 10 of their year.

I’m selecting the following variables for the given reasons:
Prime_Genre - Certain genres might be more popular and have a higher likelihood of producing top 10 movies.

Rating_of_10 - Ratings (score) should play a major role in determining a movies success financially.

IsUSProduced - As the US has one of the largest movie industries it should be interesting to see what impact a movie produced in the US would have on a movie being in the top 10. It is suspected that Movies produced in the US will more likely be in the top ten.

# Create the new binary variable
movies_clean$IsUSProduced <- ifelse(movies_clean$Prime_Production_Country == "United States of America", 1, 0)

# Verify the new column
#table(movies$IsUSProduced)  # Check distribution of U.S.-produced vs. non-U.S.-produced movies
#head(movies[c("Prime_Production_Country", "IsUSProduced")])  # Inspect new column

Step 3 - Build the Logistic Regression Model

Build a logistic regression model for this variable, using between 1-4 explanatory variables

Using the glm() function in R to build the model, with IsTopTen as the dependent variable and Prime_Genre, IsUSProduced, and Rating_of_10 as predictors.

# Build the logistic regression model
model <- glm(isTopTen ~ Prime_Genre + IsUSProduced + Rating_of_10, 
             data = movies_clean, 
             family = binomial)
summary(model)    #look at the resu
## 
## Call:
## glm(formula = isTopTen ~ Prime_Genre + IsUSProduced + Rating_of_10, 
##     family = binomial, data = movies_clean)
## 
## Coefficients:
##                              Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                 -11.25380    0.80030 -14.062 < 0.0000000000000002
## Prime_GenreAdventure          0.80422    0.20770   3.872             0.000108
## Prime_GenreAnimation         -0.26650    0.22986  -1.159             0.246288
## Prime_GenreComedy            -2.17259    0.32121  -6.764      0.0000000000135
## Prime_GenreCrime             -2.26429    0.60489  -3.743             0.000182
## Prime_GenreDocumentary      -17.96197 1384.52375  -0.013             0.989649
## Prime_GenreDrama             -2.54021    0.30670  -8.282 < 0.0000000000000002
## Prime_GenreFamily            -0.57235    0.36695  -1.560             0.118817
## Prime_GenreFantasy            0.21279    0.36229   0.587             0.556966
## Prime_GenreHistory          -17.49692 1534.16759  -0.011             0.990900
## Prime_GenreHorror           -17.30666  597.16439  -0.029             0.976879
## Prime_GenreMusic             -2.40440    1.04873  -2.293             0.021867
## Prime_GenreMystery           -2.07037    1.03712  -1.996             0.045904
## Prime_GenreRomance          -17.33722  792.50039  -0.022             0.982546
## Prime_GenreScience Fiction    0.20715    0.32376   0.640             0.522283
## Prime_GenreThriller          -1.03857    0.44495  -2.334             0.019590
## Prime_GenreWar               -0.06055    0.57161  -0.106             0.915643
## Prime_GenreWestern          -18.02878 3680.66635  -0.005             0.996092
## IsUSProduced                  1.65274    0.15694  10.531 < 0.0000000000000002
## Rating_of_10                  1.21129    0.11077  10.935 < 0.0000000000000002
##                               
## (Intercept)                ***
## Prime_GenreAdventure       ***
## Prime_GenreAnimation          
## Prime_GenreComedy          ***
## Prime_GenreCrime           ***
## Prime_GenreDocumentary        
## Prime_GenreDrama           ***
## Prime_GenreFamily             
## Prime_GenreFantasy            
## Prime_GenreHistory            
## Prime_GenreHorror             
## Prime_GenreMusic           *  
## Prime_GenreMystery         *  
## Prime_GenreRomance            
## Prime_GenreScience Fiction    
## Prime_GenreThriller        *  
## Prime_GenreWar                
## Prime_GenreWestern            
## IsUSProduced               ***
## Rating_of_10               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1963.9  on 4796  degrees of freedom
## Residual deviance: 1415.3  on 4777  degrees of freedom
##   (33 observations deleted due to missingness)
## AIC: 1455.3
## 
## Number of Fisher Scoring iterations: 18
# Diagnose the model using lindia
#gg_diagnose(model)

Step 4 - Interpret the coefficients

Interpret the coefficients, and explain what they mean in your notebook

Extract and interpret the coefficients from the model to understand the relationship between the predictors and the likelihood of a movie being in the top 10.

coefficients <- coef(model)
coefficients
##                (Intercept)       Prime_GenreAdventure 
##               -11.25379963                 0.80422087 
##       Prime_GenreAnimation          Prime_GenreComedy 
##                -0.26650049                -2.17258794 
##           Prime_GenreCrime     Prime_GenreDocumentary 
##                -2.26429139               -17.96197222 
##           Prime_GenreDrama          Prime_GenreFamily 
##                -2.54020957                -0.57234927 
##         Prime_GenreFantasy         Prime_GenreHistory 
##                 0.21279293               -17.49691872 
##          Prime_GenreHorror           Prime_GenreMusic 
##               -17.30666431                -2.40440024 
##         Prime_GenreMystery         Prime_GenreRomance 
##                -2.07037340               -17.33721603 
## Prime_GenreScience Fiction        Prime_GenreThriller 
##                 0.20714836                -1.03856784 
##             Prime_GenreWar         Prime_GenreWestern 
##                -0.06054635               -18.02878197 
##               IsUSProduced               Rating_of_10 
##                 1.65273669                 1.21129127

This is hard to read, forgot I can use the broom library.

tidy_coefficients <- tidy(model)

# View the results
print(tidy_coefficients)
## # A tibble: 20 × 5
##    term                       estimate std.error statistic  p.value
##    <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                -11.3        0.800 -14.1     6.50e-45
##  2 Prime_GenreAdventure         0.804      0.208   3.87    1.08e- 4
##  3 Prime_GenreAnimation        -0.267      0.230  -1.16    2.46e- 1
##  4 Prime_GenreComedy           -2.17       0.321  -6.76    1.35e-11
##  5 Prime_GenreCrime            -2.26       0.605  -3.74    1.82e- 4
##  6 Prime_GenreDocumentary     -18.0     1385.     -0.0130  9.90e- 1
##  7 Prime_GenreDrama            -2.54       0.307  -8.28    1.21e-16
##  8 Prime_GenreFamily           -0.572      0.367  -1.56    1.19e- 1
##  9 Prime_GenreFantasy           0.213      0.362   0.587   5.57e- 1
## 10 Prime_GenreHistory         -17.5     1534.     -0.0114  9.91e- 1
## 11 Prime_GenreHorror          -17.3      597.     -0.0290  9.77e- 1
## 12 Prime_GenreMusic            -2.40       1.05   -2.29    2.19e- 2
## 13 Prime_GenreMystery          -2.07       1.04   -2.00    4.59e- 2
## 14 Prime_GenreRomance         -17.3      793.     -0.0219  9.83e- 1
## 15 Prime_GenreScience Fiction   0.207      0.324   0.640   5.22e- 1
## 16 Prime_GenreThriller         -1.04       0.445  -2.33    1.96e- 2
## 17 Prime_GenreWar              -0.0605     0.572  -0.106   9.16e- 1
## 18 Prime_GenreWestern         -18.0     3681.     -0.00490 9.96e- 1
## 19 IsUSProduced                 1.65       0.157  10.5     6.21e-26
## 20 Rating_of_10                 1.21       0.111  10.9     7.84e-28

Much easier to read now.

Interpretation:

To interepret the coefficients we’ll use the formula for converting log odds into odds. This formula is ecoeffcient where e is euler’s number. This will help put the results in easier to interpret terms.

Observation 1: The intercept estimate value of -11.25379963 transalates into odds of approximately .0000129 meaning that if all factors are at their baseline there is only a remote possibility that a movie will be in the top 10.

Observation 2: The estimate value for Adventure movies translates into the 2.23, meaning the odds of a movie being in the top 10 is improved by a factor of 2.23 if the movie is an in the adventure genre, further this is shown to be statistically significant as the p-value is significantly less than .05 at .00108.

Observation 3: Based on the p-values being extremely low (under .05) , Crime, Comedy and Drama genres are statistically significant but their estimates are in the -2 range which translates to approximately a 13.5% odds that movies of these genres will be in the top 10.

Observation 4: While Mysteries, Music, and Thriller genres have a mild significance, a p-value just under .05, the remaining genres don’t have enough information to establish their significance.

Observation 5: Movies produced in the US and a movie’s rating out of 10 both appear to be statistically significant as they have extremely low p-values. The odds of a movie produced in US being a top 10 movie is improved by a factor of 5.08 and a higher rated movie in the top 10 is improved by a factor of 3.36

Step 5 - Construct a Confidence Interval

Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning

Based on the earlier findings, a movie being in the adventure genre increases it’s odds of being a top 10 movie more than any other genre, so we’ll focus on it.

# Calculate confidence intervals for all coefficients
conf_intervals <- confint(model)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
# View the CI specifically for Adventure
adventure_ci <- conf_intervals["Prime_GenreAdventure", ]
print(adventure_ci)
##     2.5 %    97.5 % 
## 0.3949812 1.2103872

Ok. I know this gives me what I need, but those warnings bother me. I did some research and discovered those warnings happen when there are instances where the data will always be one or zero, that might happen if some of the genres have all their values as true or false. If I’m right the solution is to find the genres with this feature and rerun the model without them.

Some cleanup before we look at the CI

#evaluating why there are so many warning messages.
# Cross-tabulate isTopTen with Prime_Genre and IsUSProduced
table(movies_clean$isTopTen, movies_clean$Prime_Genre)      #binary/categorical
##    
##     Action Adventure Animation Comedy Crime Documentary Drama Family Fantasy
##   0    721       218       299   1046   184          58   948    132     102
##   1     75        58        39     12     3           0    14     10      11
##    
##     History Horror Music Mystery Romance Science Fiction Thriller  War Western
##   0      43    286    36      63     153              88      150   37       8
##   1       0      0     1       1       0              16        6    4       0
table(movies_clean$isTopTen, movies_clean$IsUSProduced)     #binary/binary
##    
##        0    1
##   0 2665 1885
##   1   72  178
#table(movies_clean$isTopTen, movies_clean$Rating_of_10)    #binary/continuous
# Create a filtered dataset excluding specific genres that aren't in the top 10
movies_filtered <- movies_clean %>%
  filter(!Prime_Genre %in% c("Horror", "Documentary", "History", "Romance", "Western"))

# View the filtered data
#print(movies_filtered)
#rerun the model with the adjusted data set
# Build the logistic regression model
model2 <- glm(isTopTen ~ Prime_Genre + IsUSProduced + Rating_of_10, 
             data = movies_filtered, 
             family = binomial)
summary(model2)    #look at the resu
## 
## Call:
## glm(formula = isTopTen ~ Prime_Genre + IsUSProduced + Rating_of_10, 
##     family = binomial, data = movies_filtered)
## 
## Coefficients:
##                             Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                -11.25380    0.80030 -14.062 < 0.0000000000000002
## Prime_GenreAdventure         0.80422    0.20770   3.872             0.000108
## Prime_GenreAnimation        -0.26650    0.22986  -1.159             0.246288
## Prime_GenreComedy           -2.17259    0.32121  -6.764      0.0000000000134
## Prime_GenreCrime            -2.26429    0.60489  -3.743             0.000182
## Prime_GenreDrama            -2.54021    0.30670  -8.282 < 0.0000000000000002
## Prime_GenreFamily           -0.57235    0.36695  -1.560             0.118817
## Prime_GenreFantasy           0.21279    0.36229   0.587             0.556965
## Prime_GenreMusic            -2.40440    1.04873  -2.293             0.021866
## Prime_GenreMystery          -2.07037    1.03711  -1.996             0.045903
## Prime_GenreScience Fiction   0.20715    0.32376   0.640             0.522283
## Prime_GenreThriller         -1.03857    0.44495  -2.334             0.019590
## Prime_GenreWar              -0.06055    0.57160  -0.106             0.915643
## IsUSProduced                 1.65274    0.15694  10.531 < 0.0000000000000002
## Rating_of_10                 1.21129    0.11077  10.935 < 0.0000000000000002
##                               
## (Intercept)                ***
## Prime_GenreAdventure       ***
## Prime_GenreAnimation          
## Prime_GenreComedy          ***
## Prime_GenreCrime           ***
## Prime_GenreDrama           ***
## Prime_GenreFamily             
## Prime_GenreFantasy            
## Prime_GenreMusic           *  
## Prime_GenreMystery         *  
## Prime_GenreScience Fiction    
## Prime_GenreThriller        *  
## Prime_GenreWar                
## IsUSProduced               ***
## Rating_of_10               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1902.5  on 4256  degrees of freedom
## Residual deviance: 1415.3  on 4242  degrees of freedom
##   (25 observations deleted due to missingness)
## AIC: 1445.3
## 
## Number of Fisher Scoring iterations: 7

Rerun the CI

# Calculate confidence intervals for all coefficients
conf_intervals2 <- confint(model2)
## Waiting for profiling to be done...
# View the CI specifically for Adventure
adventure_ci <- conf_intervals2["Prime_GenreAdventure", ]
print(adventure_ci)
##     2.5 %    97.5 % 
## 0.3949814 1.2103870

Yay, much better! Now to analyze.

First let’s translate the log odds into odds. e^.3949814 = approx 1.49 and e ^ 1.2103870 = approx 3.35.

Practically, what this means is that there is a 95% chance that the true odds for the impact of the genre Adventure are within the this range. At the low end of the confidence interval an adventure movie is still 49% more likely to be a top 10 movie than the other genres, at the top end it is potentially 3x more likely. Since the entire CI is greater than 1 this confirms that a movie being in the Adventure genre has a positive impact on the likelihood the movie is in the top 10.

# Create a data frame for plotting
ci_data <- data.frame(
  Genre = "Adventure",
  Estimate = (log(0.3949814) + log(1.2103870)) / 2, # Midpoint in log-odds space
  Lower = 0.3949814, # Lower bound of the CI
  Upper = 1.2103870  # Upper bound of the CI
)

# Plot the CI
ggplot(ci_data, aes(x = Genre, y = Estimate)) +
  geom_point(size = 4) +  # Point for the estimate
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +  # CI error bars
  scale_y_continuous(trans = "log", labels = scales::comma) +  # Log scale for odds
  labs(
    title = "Confidence Interval for Adventure Genre",
    x = "Genre",
    y = "Odds Ratio"
  ) +
  theme_minimal()
## Warning in log(x, base): NaNs produced
## Warning in scale_y_continuous(trans = "log", labels = scales::comma):
## log-2.718282 transformation introduced infinite values.
## Warning in log(x, base): NaNs produced
## Warning in scale_y_continuous(trans = "log", labels = scales::comma):
## log-2.718282 transformation introduced infinite values.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Other insights

While these factors definitely help establish what is driving movies to be in the top 10 there might be other factors in the data that should be considered,

  • Would movies that do better in foreign markets have any impact on the top 10 standing? We might consider the same question for domestic markets.

  • Addtionally it would be interesting to analyze what the most or least profitable year was and run this same model on the smaller data set.

  • While this model does appear to show significant influencers, what data that’s not in the data set might help analyze this question?

    • As discussed in previous exercises, the exclusion of budget from this data set is a significant gap.

    • For this analysis knowing the baseline budget for each genre might be a very telling addition to the model as it is suspected that budget might be a strong factor in a movies success.