1 Introduction

The movie industry is a hard fought industry where every year new movies come out, which either produce multi-million dollar losses or profits for their respective movie studios. Any help for the decision making process for movie studios over which movies to greenlight and which proposals to reject can thus be incredibly helpful as well as profitable for those providing insights. There is a strong belief that using applied machine learning and predictive modeling one can provide the movie studios a better insight on why each movie will be either successful or not. Thus the goal in this research paper is to find out which factors have a positive influence on a movies success and if it is possible to create a model to predict and analyse which movies are successful. This information could help movie production companies in their decision making of which future movies to greenlight for production.

1.1 Research Question

The following question is the central research question that this project tries to answer:

“Which predictors have a statistical significance in increasing profit?”

1.2 What are the goals of the analysis

The goals of the analysis is to use data science methodology to create a model to help predict and explain successes and failures in the movie industry. With the help of this model executives for movie companies can make informed decision on green lighting future movies by having an analysis of previous movie successes.

1.3 Data Source

The data set used for this project comes from IMBD, from which it also includes rating information. The data set includes both movies from the United States as well as movies from other places around the world. For this analysis only movies released between 1980 and 2020 were considered.

2 Data Preparation and Cleaning

Leading student: Eduardo Pacheco

d.movies_raw <- read.csv("movies.csv")
# summary(d.movies_raw)

2.1 Dealing with missing values (NaN)

We have applied the missing data pattern function md.pattern() to display a matrix with the missing observations in the data, for space-saving reasons it will not be display.

Most of missing values were found in the columns “budget” and “gross”, so we proceed to delete them as there is no proper way to impute this information.

d.movies_raw_02 <- d.movies_raw %>% drop_na(gross,budget)
summary(is.na(d.movies_raw_02$runtime))

##    Mode   FALSE    TRUE 
## logical    5435       1

Next we noticed in the summary that we have just one missing value left in the variable “runtime”. It will be apply a mean single imputation as most movies have in average the same duration.

Imputation of runtime.

# round(mean(d.movies_raw_02$runtime, na.rm = TRUE),2)
d.imp_movies <- mice(d.movies_raw_02,method="mean", m=1)

## 
##  iter imp variable
##   1   1  runtime
##   2   1  runtime
##   3   1  runtime
##   4   1  runtime
##   5   1  runtime

d.movies_raw03 <- complete(d.imp_movies)

We control again if we still have missing values.

md.pattern(d.movies_raw03, plot = FALSE)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##      name rating genre year released score votes director writer star country
## 5436    1      1     1    1        1     1     1        1      1    1       1
##         0      0     0    0        0     0     0        0      0    0       0
##      budget gross company runtime profit  
## 5436      1     1       1       1      1 0
##           0     0       0       0      0 0

3 Data Enrichment

Leading student: Eduardo Pacheco

3.1 Adding a binary variable

It will be define a new binary variable for modelling purposes, it will be named ‘in.out_usa’. It is define with a logical value (1/0) to determine if the movies was done inside the United States(value = 1) or in any other country(value = 0).

d.movies_raw03 <- d.movies_raw03 %>%
  mutate(in.out_usa = if_else(d.movies_raw03$country == "United States",1,0))
summary(d.movies_raw03$in.out_usa)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.7958  1.0000  1.0000

3.2 Adding a profit percentage related to budget

The source data set contains both small independent movies, as well as blockbuster movies with huge budgets. For this reason we consider reasonable to create another profit-related variable. It will be called ‘profit.percentage’, it is define as ‘gross’ divided by ‘budget’.

d.movies_raw03 <- d.movies_raw03 %>%
  mutate(profit.percentage = gross/budget)
summary(d.movies_raw03$in.out_usa)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.7958  1.0000  1.0000

3.3 Modifying the profit variable

The profit is defined as the subtraction, gross - budget. This will be our main response variable. Amounts need to be almost always log-transformed before being used for further analysis. The profit log transformation of negative and zero values produces undefined values, which are transform by R to NaNs. Thus we have decided to drop them off. We control again to have none missing values, and end up with 3684 clean observations.

#d.movies <- d.movies %>%
# mutate(profit = gross-budget)
d.movies_raw03$profit = log(d.movies_raw03$profit)
d.movies <- d.movies_raw03 %>% drop_na(profit)
summary(is.na(d.movies$profit))

##    Mode   FALSE 
## logical    3684

3.4 Adding Released Month

With the help of regular expressions it is possible to extract the released month out of the “released” field. In an attempt to identify seasonality and trends, and a possible effect on the profit, we will add the released month as a separate variable. The variable will be called ‘release.month’.

d.movies <- d.movies %>%
  mutate(release.month = stri_extract_first(d.movies$released, regex="\\w+"))
d.movies$release.month[grepl("^[[:digit:]]+",d.movies$release.month)] <- ''
summary(d.movies$release.month)

##    Length     Class      Mode 
##      3684 character character

4 Visualizing the Data

Leading student: Simon Kohler

First one can visualize the profits of movies using the log transformed profit variable.

hist.profit <- ggplot(d.movies, aes(x=profit)) +
                geom_histogram(aes(y=..density..), bins=30,  
                               colour="lightsteelblue4", fill="steelblue1")+
                geom_density(alpha=.2, colour="lightcoral", fill="firebrick1") + 
                ggtitle("Profit Histogram") +
                theme(plot.title = element_text(hjust = 0.5,
                                  size = 17,
                                  face = "bold"),
                      axis.title = element_text(size = 9))
hist.profit

The histogram shows that the log transformed profits are closely clustered together with the majority being around 17.5. There are no large amount of outliers visible.

5 Fitting Statistical Models

Leading student: Simon Kohler

5.1 Linear Model

To fit the starting model all relevant predictors, as well as their interactions, must be included. For this research document the following starting model is used:

lm.movies.0 <- lm(profit ~ 
  genre + log(budget) + runtime + score + year
  + genre:log(budget) + genre:runtime + genre:score + genre:year,
  data = d.movies)

To further develop the model one can check whether all interactions are in fact needed:

drop1(lm.movies.0, test = "F")

There seems to be highly significant interactions between the genre and budget, genre and year, as well as genre and score. Only the interaction between genre and runtime is not as significant. For now it is possible to drop this interaction from the model:

lm.movies.1 <- update(lm.movies.0, . ~ . - genre:runtime)
drop1(lm.movies.1, test = "F")

summary(lm.movies.1)$adj.r.squared

## [1] 0.4310078

Now all the interactions left are highly significant, except the predictor runtime. Thus we drop this predictor:

lm.movies.2 <- update(lm.movies.1, . ~ . - runtime)
drop1(lm.movies.2, test = "F")

summary(lm.movies.2)$adj.r.squared

## [1] 0.4307665

While all the remaining predictors in the model are significant to at least the 1% level, the adjusted R squared value for this model is still only 0.43. This means that only 43% of the variance in the dependent variable is explained by the model.

5.2 Multiple Linear Regression Models

One can also try and fit another model using less/different predictors. For this we create a model that puts the profit into relation of the release month and the budget of each movie:

fit_m =lm(profit~ release.month + log(budget), data=d.movies)
tidy(fit_m)

The result shows that there are no significant interactions between the release months and the profit. There is however a highly significant interaction between the budget and the profit. As the budget is used to calculate the profit this was to be expected. Plotting the model with the additional factor of, if it is a movie produced in the USA or not, yields the following result:

ggplot(d.movies,aes(y=profit,x=release.month,color=factor(in.out_usa)))+
  ggtitle("Profit by Released Month") +
  theme(plot.title = element_text(hjust = 0.5,
                                  size = 14,
                                  face = "bold"),
                      axis.title = element_text(size = 9))+
  theme(axis.text.x = element_text(angle = 45))+
  geom_point()

summary(fit_m)$adj.r.squared # adjusted r squared

## [1] 0.3717876

In conclusion this Multiple Linear Regression shows some evidence that the release month of a movie on its own does not have a statistically significant impact on a movies profit. While it does not disprove the myth that movies should be released right before Christmas, it does not find any prove that this is a good strategy.

6 Generalized Additive Model

Leading student: Caroline Wanyonyi

The dataset used in the next 3 models (GAM, Poisson and Binomial models) does not include the log transformed profit variable but the original dataset with non-log transformed profit, which could explain the possible differences. As there is already either log transformation and usage of log link in the model functions, it was not deemed necessary to double log transform the profit variable in case it was used as a response variable.

We will be using a GAM model for its flexibility for the nonlinear functions to model data without over-fitting to predict scores based on all the other predictor variables. First we will visualize the response variable against each possible predictor. As the response variable score, can only take positive values, we log-transform it. We observe the relationships between the log-transformed score against runtime, profit, votes and budget below.

# Load the dataset with reduced columns to run pairs
d.movies_g <- read.csv("movies_clean.csv")
# Plot to find nonlinear relationships 
# runtime
plot1<- ggplot(data = d.movies_g,
       mapping = aes(y = log(score),
                     x = runtime)) +
  geom_point() +
  geom_smooth()

# profit
plot2<- ggplot(data = d.movies_g,
       mapping = aes(y = log(score),
                     x = profit)) +
  geom_point() +
  geom_smooth()

#votes
plot3<- ggplot(data = d.movies_g,
               mapping = aes(y = log(score),
                             x = votes)) +
  geom_point() +
  geom_smooth()

# budget
plot4<- ggplot(data = d.movies_g,
               mapping = aes(y = log(score),
                             x = budget)) +
  geom_point() +
  geom_smooth()


grid.arrange(plot1, plot2, plot3, plot4, ncol=2, nrow=2)

Log-Transformed vs Predictor Plot Analysis

All the relationships are not linear. Runtime and profit have more complex relationships and the smoothers show two bumps. Score and Runtime relationships may be approximated with a cubic polynomials. We will use a GAMs model and see what the estimated complexity of the smooth term (edf value)

6.1 Fitting a GAM Regression Model

The problem statement we want to know is, to which extent budget, runtime, profit and votes, predict the score rating of the movies. We will use a multivariate GAM which utilizes multiple predictor variables for the outcome.

gam_score<- gam(log(score) ~ s(runtime)+ s(budget) + s(profit) + s(votes), data = d.movies_g)
summary(gam_score)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(score) ~ s(runtime) + s(budget) + s(profit) + s(votes)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.866186   0.001966   949.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df       F p-value    
## s(runtime) 8.178  8.773  61.402  <2e-16 ***
## s(budget)  6.500  7.575  50.938  <2e-16 ***
## s(profit)  4.080  5.158   8.307  <2e-16 ***
## s(votes)   8.202  8.816 126.931  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.411   Deviance explained = 41.5%
## GCV = 0.014347  Scale est. = 0.014238  n = 3684

#tidy(gam_score)

Interpreting Coefficients of GAM There is strong evidence that s(runtime), s(budget), s(profit) and s(votes) are statistically significant. There are no smooth terms estimated to be linear as all the estimated degrees of freedom are above 1.00. The smooth term s(votes) has the largest edf value (i.e 8.2). There are parametric terms in this model as there is an intercept.

Coefficients of the GAM

When extracting the coefficients, we see that for each predictor, one-smooth model has many coefficients for each predictor (runtime, budget, profit and votes) as shown below.

# Coefficients
coef(gam_score)

##   (Intercept)  s(runtime).1  s(runtime).2  s(runtime).3  s(runtime).4 
##  1.8661858830 -0.3461082513  0.4635633219  0.1895628734  0.2587662449 
##  s(runtime).5  s(runtime).6  s(runtime).7  s(runtime).8  s(runtime).9 
##  0.1924181124 -0.2401823853  0.1694565868 -0.6266776090 -0.1685896271 
##   s(budget).1   s(budget).2   s(budget).3   s(budget).4   s(budget).5 
##  0.0823942847 -0.2412478269  0.0690209678  0.1464528923 -0.0558081091 
##   s(budget).6   s(budget).7   s(budget).8   s(budget).9   s(profit).1 
##  0.0932704876 -0.0127529671 -0.3130558573 -0.0740034528 -0.0221081209 
##   s(profit).2   s(profit).3   s(profit).4   s(profit).5   s(profit).6 
## -0.0089582239 -0.0008038767 -0.0014852830 -0.0006489734  0.0004156629 
##   s(profit).7   s(profit).8   s(profit).9    s(votes).1    s(votes).2 
## -0.0001553528  0.0064715965  0.0088766669 -0.0700340520  0.3365333614 
##    s(votes).3    s(votes).4    s(votes).5    s(votes).6    s(votes).7 
## -0.0966357607 -0.1960850016 -0.0311579040  0.1909198512 -0.1271662460 
##    s(votes).8    s(votes).9 
##  0.2436218908  0.1868134648

6.2 GAM Model Check and Summary

par(mfrow = c(2, 2))
gam_score<- gam(log(score) ~ s(runtime)+ s(budget) + s(profit) + s(votes), data = d.movies_g)
gam.check(gam_score)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 9 iterations.
## The RMS GCV score gradient at convergence was 6.824614e-08 .
## The Hessian was positive definite.
## Model rank =  37 / 37 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value    
## s(runtime) 9.00 8.18    0.96    0.01 ** 
## s(budget)  9.00 6.50    0.91  <2e-16 ***
## s(profit)  9.00 4.08    1.02    0.82    
## s(votes)   9.00 8.20    0.99    0.18    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion on GAM

The QQ Plot which compares the model residuals to a normal distribution shows that our model is a poor fit, as our model’s residuals are not close to a straight line. The histogram is left skewed instead of a symmetrical bell shape. The residual plot shows the residual values which are evenly distributed around zero which is a positive aspect of our model.

In conclusion, the final plot of response vs fitted shows that our model is again not so perfect as a perfect model would form a straight line. So the GAM model is most likely not the best fit for our data-set to predict the score.

7 Poisson Regression

Leading student: Caroline Wanyonyi

Poisson regression assumes the response variables are counts, positive integers, and follow a Poisson distribution where the mean and variance should be the same.The runtime response variable in our data set is the most likely candidate and it represents the length of a movie in minutes. It will be used to determine whether budget, profit and score have an impact on the length of a movie. This is a kind of reverse way of doing predictions but given our data set limitations without making too many drastic changes, we will do it this way.

We will visualize the data with a histogram to see if runtime conforms to a Poisson distribution.

require(flexplot)
flexplot(runtime~1, data=d.movies_g)

Based on the histogram, we have a poisson distribution for the runtime. Next we will check if the mean and variance for runtime are equal and also how close they are to zero.

Mean, Variance and ratio of runtime

# d.movies <- read.csv("movies_clean.csv")    *d.movies already exits, no need to import a new cleaned file as d.movies is already cleaned*

# Change runtime from numeric to integer in order to use GLM
d.movies_g$runtime <- as.integer(as.numeric(d.movies_g$runtime))

# Add labels to the in.out_usa variables i.e non-US and US
d.movies_g$in.out_usa<- factor(d.movies_g$in.out_usa,
                    levels = c(0,1),
                    labels = c("non-US", "US"))  #label  sex

#Check if mean = variance for both
mean(d.movies_g$runtime)%>%round(4); var(d.movies_g$runtime)%>%round(4)

## [1] 108.797

## [1] 330.4937

#Check if its close to 1 otherwise we have overdispersion
mean_var_ratio <-var(d.movies_g$runtime)/mean(d.movies_g$runtime)

The mean is approximately 109 which is not close to zero and thus the data set is probably not suited for a Poisson regression model. The variance to mean ratio is 3 and is a strong indicator of Overdispersion. An ideal ratio would be close to 1.

7.1 Fitting Poisson Regression Model

The problem statement we want to know is, to which extent budget, score and location predict the runtime/length in time for the movies.

flexplot (runtime ~ budget |in.out_usa + score, data=d.movies_g, method="poisson", jitter=c(0, .2), ghost.line = "red")

Plot for Poisson Model Runtime against in.out_usa,score and budget

#flexplot (runtime ~ score | budget, #data=d.movies, method="poisson", #jitter=c(0, .2), ghost.line = "red")

modela<-glm(runtime~ score + budget + in.out_usa, data = d.movies_g, family = poisson(link = "log"))

model2<-glm(runtime~ budget + score, data = d.movies_g, family = poisson(link = "log"))

summary(modela)

## 
## Call:
## glm(formula = runtime ~ score + budget + in.out_usa, family = poisson(link = "log"), 
##     data = d.movies_g)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.176e+00  1.263e-02 330.587  < 2e-16 ***
## score         7.402e-02  1.757e-03  42.130  < 2e-16 ***
## budget        1.060e-09  3.181e-11  33.322  < 2e-16 ***
## in.out_usaUS -2.142e-02  4.124e-03  -5.194 2.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10705.2  on 3683  degrees of freedom
## Residual deviance:  7487.9  on 3680  degrees of freedom
## AIC: 31501
## 
## Number of Fisher Scoring iterations: 4

# tidy(modela)
# summ(model1, confint = TRUE, exp = TRUE, digits=3)

7.2 Poisson Regression Model Analysis

The Residual Deviance is bigger than Degrees of Freedom which suggest that the variance is bigger than the mean. We already know we have Overdispersion from previous tests and the Maximum Likelihood Estimates are still consistent but the Standard Error will be smaller than they should be.

We will correct this by rerunning poisson regression with quasipoisson which will adjust the Standard error but will not change the estimated model coefficients.

model1<-glm(runtime~ score + budget + in.out_usa, data = d.movies_g, family = quasipoisson(link = "log"))
summary(model1)

## 
## Call:
## glm(formula = runtime ~ score + budget + in.out_usa, family = quasipoisson(link = "log"), 
##     data = d.movies_g)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.176e+00  1.815e-02 230.124  < 2e-16 ***
## score         7.402e-02  2.524e-03  29.327  < 2e-16 ***
## budget        1.060e-09  4.569e-11  23.196  < 2e-16 ***
## in.out_usaUS -2.142e-02  5.925e-03  -3.616 0.000304 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.063699)
## 
##     Null deviance: 10705.2  on 3683  degrees of freedom
## Residual deviance:  7487.9  on 3680  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

summ(model1, confint = TRUE, exp = TRUE, digits=3)

Observations	3684
Dependent variable	runtime
Type	Generalized linear model
Family	quasipoisson
Link	log

χ²(3)	3217.233
Pseudo-R² (Cragg-Uhler)	0.582
Pseudo-R² (McFadden)	0.093
AIC	NA
BIC	NA

	exp(Est.)	2.5%	97.5%	t val.	p
(Intercept)	65.087	62.813	67.444	230.124	0.000
score	1.077	1.072	1.082	29.327	0.000
budget	1.000	1.000	1.000	23.196	0.000
in.out_usaUS	0.979	0.968	0.990	-3.616	0.000
Standard errors: MLE

Note that the dispersion parameter is estimated to be 2.0 while in the non-quasipoisson model it was fixed to the value 1. This implies that the variance increase faster than linearly. Although the estimated coefficients did not change, the standard error did and as a consequence the associated p-values.

Interpretation of Coefficients

We have also exponentialized the coefficients to get the correct interpretations using the summ function.

The correct interpretation of this coefficient is: “US made movies get, on average, about 10% more minutes than non US made movies. This interpretation is not additive as in Linear Model but rather multiplicative. This difference in interpretation is due to the use of the logarithm as a link function.

The correct interpretation for the budget coefficient is: “For a given movie, increasing its budget by 1 USD, would results in no additional minutes or 0.9 minutes.

The p values for all predictors for the runtime response variable are below 0.05 and are therefore statistically signficant.

7.3 Dispersion Test

This checks for Overdispersion or Undersdispersion for both models. If we have Overdispersion or Underdispersion, the dataset will not fit the Poisson well and this can cause further issues with the predictions. We have already seen above that we have Overdispersion based on the mean and variance ratio.

# Dispersion test
dispersiontest(modela, trafo=1) # NHST equidispersion

## 
##  Overdispersion test
## 
## data:  modela
## z = 16.222, p-value < 2.2e-16
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##   alpha 
## 1.06141

Analysis of Dispersion

The Residual Deviance / Degrees of Freedom should be about 1. 1.1 is Overdispersed and 0.9 is Underdispersed. 1.06 is closer to 1.1 when rounded up is approximately 1.1. It is clear that our model shows Overdispersion as the value is above 1.1.

7.4 Compare both models

We fitted 2 models, one with 2 predictors and another with 3 predictors.The two models similar except model1 has an additional predictor variable in.out_usa. Both model plots are similar in fit given the overlapping lines.

model1<-glm(runtime~ in.out_usa + budget + score, data = d.movies_g, family = poisson(link = "log"))

model2<-glm(runtime~ budget + score, data = d.movies_g, family = poisson(link = "log"))

compare.fits (runtime ~score | in.out_usa + budget, data=d.movies_g, model1, model2, jitter = c(0, .1))

model.comparison (model1, model2)

## $statistics
##             aic      bic bayes.factor      p
## model1 31500.76 31525.60     11224.08 <2e-16
## model2 31525.62 31544.26         0.00       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.001 0.331 0.461 0.647 2.749

When we compare the AICs of the model, the model with the lower AIC statistics is the better model. The hypothesized model1 (AIC=31500) is a better fit than model2 (AIC=31525).

7.5 Table of Predictor of Runtime

This is a summary of the predictors which includes the Incidence Rate Ratios, Confidence Intervals and p values.

sjPlot::tab_model(model1 , show.intercept = TRUE,
                  show.se = FALSE, dv.labels = "Predictors of runtime", auto.label = TRUE, show.re.var = FALSE, show.icc =FALSE, 
                  show.r2 = FALSE, show.ngroups = FALSE, show.obs = FALSE)

	Predictors of runtime
Predictors	Incidence Rate Ratios	CI	p
(Intercept)	65.09	63.50 – 66.72	<0.001
in out usa [US]	0.98	0.97 – 0.99	<0.001
budget	1.00	1.00 – 1.00	<0.001
score	1.08	1.07 – 1.08	<0.001

The p values for all predictors for the runtime response variable are below 0.05 and are therefore statistically significant.

7.6 Poisson Regression Model Visualization

Below is an effect plot for a Poisson regression model using Genre predictor variable against runtime.

visualize (model1, plot = "model", jitter =c(0,.1))

Conclusion for Poisson Regression Model

Our movies dataset with the response variable of runtime is not suited for the Poisson Regression model because we have Overdispersion. When we have Overdispersion it is evident that the dataset will not fit the Poisson well and this can cause understimated Standard error, which in turn will result in a too small and the decision will result in an Inflated type 1 error and lead to a false positives.

8 Binomial Regression

Leading student: Caroline Wanyonyi

In the binomial regression model, the observations correspond to some binary observation denoting presence or absence, ones or zeroes as in our case. Our data set contains the binary in.out_usa variable that denoted movies that are made in US (1) and those are not made in US(0).

We want to predict where a movie is made (in USA or not) based on budget, profit and score of a movie. The process to determine this involves the following steps after loading the data.

Change the dependent variable in.out_usa to a Factor. 0 is non USA shooting and 1 is USA location.
Split the data (60% training set and 40% test set)
Train the model using the training dataset binomial GLM with in.out_usa as dependent variable and budget, score and profit as independent variables.

# d.movies <- read.csv("movies_clean.csv") *here again no need to import a new dataSource*
#summary(d.movies)
# head(d.movies)

# Check column type
#table(d.movies$in.out_usa)

# Change to Factor
d.movies_g$in.out_usa <- as.factor(d.movies_g$in.out_usa)

# Split the data --60% training set 40% test set
split <- sample.split(d.movies_g, SplitRatio = 0.6)

train <- subset(d.movies_g, split =="TRUE")
test <- subset(d.movies_g, split =="FALSE")

mymodel <- glm (in.out_usa ~ budget  + profit + score, data=d.movies_g, family = binomial)
par(mfrow=c(2,2))
plot(mymodel)

Binomial Coefficients

summary(mymodel)

## 
## Call:
## glm(formula = in.out_usa ~ budget + profit + score, family = binomial, 
##     data = d.movies_g)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.626e+00  5.363e-01   4.896 9.78e-07 ***
## budget       1.294e-09  1.182e-09   1.095    0.273    
## profit       1.387e-01  2.839e-02   4.886 1.03e-06 ***
## score       -5.328e-01  5.287e-02 -10.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3469.8  on 3683  degrees of freedom
## Residual deviance: 3333.3  on 3680  degrees of freedom
## AIC: 3341.3
## 
## Number of Fisher Scoring iterations: 4

summ(mymodel, confint = TRUE, exp = TRUE, digits=3)

Observations	3684
Dependent variable	in.out_usa
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	136.545
Pseudo-R² (Cragg-Uhler)	0.060
Pseudo-R² (McFadden)	0.039
AIC	3341.283
BIC	3366.130

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	13.814	4.829	39.518	4.896	0.000
budget	1.000	1.000	1.000	1.095	0.273
profit	1.149	1.087	1.215	4.886	0.000
score	0.587	0.529	0.651	-10.076	0.000
Standard errors: MLE

8.1 Interpretation of the Coefficients of the Binomial Model

As link functions are used the interpretation of these coefficients must be adapted as is done by the summ function.

By increasing the amount of budget money by 1 USD, the odds of the movie being made in US increases by 1. By increasing the score by 1, the odds of the movie being made in US increases by 0.6.

We observe the p values for the predictor variables and there is statistical significance in the profit and score. Budget is not statistically significant as its p value is higher than 0.05.

We run the test data through the model and setup a Confusion matrix to validate the model. We want to confirm if our predicted answers match up with the actual values.

par(mfrow = c(2, 2))
res<- predict(mymodel,test,type = "response")

res<- predict(mymodel,train,type = "response")

# Validate the model -- Confusion matrix

confmatrix <- table(Actual_value=train$in.out_usa, Predicted_value = res >0.7)

confmatrix

##             Predicted_value
## Actual_value FALSE TRUE
##       non-US    54  323
##       US        81 1676

8.2 Analysis of Confusion Matrix

Based on the results from the Confusion matrix above, the following is observed:

62 times, we predicted False when it was actually False
1664 times we predicted True when it was True
331 times we predicted True when it was False
75 times we predicted False when it was True

# Accuracy --Confirm our percentage accuracy
(confmatrix [[1,1]] + confmatrix [[2,2]])/sum(confmatrix)

## [1] 0.8106842

We can confirm that our percentage accuracy is approximately 81% which is not so good and not so bad.

8.3 Logistic regression model with continuous predictor

Additional logistic regression models with multiple continuous predictor variables are displayed below with predictions.

The addition of genre to budget is an interesting predictor as the genre may need a specific location for example Western movies are probably better suited to US while European or foreign movies are best set outside of the USA.

8.4 Logistic Regression with in.out_usa and budget and genre

model_glm <- glm (in.out_usa ~ budget  + genre, data=d.movies_g, family = binomial)

tab_model(model_glm)

	in.out_usa
Predictors	Odds Ratios	CI	p
(Intercept)	3.32	2.70 – 4.09	<0.001
budget	1.00	1.00 – 1.00	<0.001
genre [Adventure]	0.58	0.42 – 0.82	0.002
genre [Animation]	0.87	0.60 – 1.27	0.463
genre [Biography]	0.53	0.38 – 0.75	<0.001
genre [Comedy]	2.29	1.75 – 3.01	<0.001
genre [Crime]	1.07	0.74 – 1.57	0.721
genre [Drama]	0.85	0.65 – 1.12	0.245
genre [Family]	152496.04	0.00 – NA	0.974
genre [Fantasy]	3.80	1.12 – 23.75	0.070
genre [Horror]	1.43	0.95 – 2.21	0.092
genre [Mystery]	1.11	0.28 – 7.32	0.898
genre [Romance]	191629.79	0.00 – NA	0.974
genre [Sci-Fi]	0.28	0.01 – 7.07	0.367
genre [Thriller]	1.40	0.22 – 27.03	0.758
genre [Western]	0.00	NA – 89650844881930624488408800084464802466440.00	0.978
Observations	3684
R² Tjur	0.035

Interpretation of Coefficients

To analyze this model, tab_model function has been used in which coefficients are reported as Odds Ratios and they are multiplicative. For every increase in unit of budget, the odds of the in.out_usa is multiplied by 1.00, it increases by an order of 1.00.

Since Odds Ratios can only be 0 to infinity, there are no negative values, if we multiply by the coefficient 0.58, the odds decrease and there is a negative effect of genreAdventure.

With regards to the R squared, it is very low at 0.03 and it translates to the fact that the independent variables does not explain much in the variation of the dependent variable. Even with a low R squared, we still have a significant effect for budget, Adventure, Biography and Comedy genres.

8.5 Logistic Regression with Budget and Runtime

The in.out_usa dependent variable fitted with the Binomial glm against the budget and score predictor variables. This will be compared using ANOVA at the end with the mymodel which was analyzed earlier. No interpretation of this model will be done.

# Multiple logistic regression model with two continuous predictor variables with interaction

fit1=glm(in.out_usa~budget*runtime, data=d.movies_g,family=binomial)
summary(fit1)

## 
## Call:
## glm(formula = in.out_usa ~ budget * runtime, family = binomial, 
##     data = d.movies_g)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.151e+00  3.413e-01   9.233  < 2e-16 ***
## budget          9.033e-09  5.686e-09   1.589    0.112    
## runtime        -1.713e-02  3.079e-03  -5.565 2.63e-08 ***
## budget:runtime -2.055e-11  4.492e-11  -0.457    0.647    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3469.8  on 3683  degrees of freedom
## Residual deviance: 3400.5  on 3680  degrees of freedom
## AIC: 3408.5
## 
## Number of Fisher Scoring iterations: 4

ggPredict(fit1,interactive=TRUE,colorn=100,jitter=FALSE)

Binomial Model in.out_usa against Budget/Runtime

The Residual deviance is not bigger than the degrees of freedom so there is no Overdispersion. Runtime is statistically significant as its p value is less than 0.05.

8.6 Logistic Regression with Budget and Score

#  Multiple logistic regression model with two continuous predictor variables

fit2=glm(in.out_usa~budget*score,data=d.movies_g,family=binomial)
#summary(fit2)
ggPredict(fit2,interactive=TRUE,colorn=100,jitter=FALSE)

Binomial Model in.out_usa against Budget/Score

8.7 Comparision of all models

Anova is used to compare all the models used to determine the best fitted model.

d.movies_g$in.out_usa <- as.factor(d.movies_g$in.out_usa)
fit1=glm(in.out_usa~budget+runtime, data=d.movies_g,family=binomial)
fit2=glm(in.out_usa~budget+score,data=d.movies_g,family=binomial)
mymodel <- glm (in.out_usa ~ budget  + profit + score, data=d.movies_g, family = binomial)
anova (mymodel, fit1, fit2, test ="Chisq")

#anova (mymodel, fit1, fit2, test ="LRT")

8.8 Conclusion of Binomial Regression

The best fitted model is model2 which predicts movie location (whether US or non US) with only 2 predictor variables, budget and runtime.

9 Support Vector Machine (SVM) Modelling

Leading student: Eduardo Pacheco

9.1 Preparing the data

After trying different way to fit a SVM model to d.movies data set, we thought a better approach was to implement the SVM model to two different subsets of the data. One with more overlapping (Model 2) than the other(Model 1), and compare the performance and suitability to the data.

# Subset 1
d.movies.svm1 <- d.movies %>%
  filter(genre == "Drama" | genre == "Horror")
str(d.movies.svm1$genre)

##  chr [1:724] "Drama" "Horror" "Drama" "Drama" "Horror" "Horror" "Horror" ...

# Subset 2
d.movies.svm2 <- d.movies %>%
  filter(genre == "Action" | genre == "Comedy")
str(d.movies.svm2$genre)

##  chr [1:2031] "Action" "Comedy" "Comedy" "Action" "Action" "Action" ...

To get a first impression of what we are dealing with, we visualize both data subsets.

plot1 <- d.movies.svm1 %>%
ggplot(aes(x = score, y = profit, color = genre)) +
geom_point() + ggtitle("Subset 1") + theme(plot.title = element_text(hjust = 0.5))

plot2 <- d.movies.svm2 %>%
ggplot(aes(x = score, y = profit, color = genre)) +
geom_point() + ggtitle("Subset 2") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot1, plot2, ncol=2, nrow=1)

At first sight it is quite encouraging, as it looks feasible to classify subset 1, on which we will focus our analysis.
Visually we can already observe which one of these four genres reaches the higher profit level, genre ‘Action’.

9.2 Data preparation for training

In this initial part we split the data and take just 85% of observations from the data set to train the model.

In an attempt to perform a non-linear classification for both genres ‘Drama’ and ‘Horror’, we create some variable to access the data.

train1 <- d.movies.svm1 %>%   slice(indices1)
test_in1 <- d.movies.svm1 %>%   slice(-indices1) %>%   dplyr::select(-genre)
test_truth1 <- d.movies.svm1 %>%   slice(-indices1) %>%   pull(genre)

train2 <- d.movies.svm2 %>%   slice(indices2)
test_in2 <- d.movies.svm2 %>%   slice(-indices2) %>%   dplyr::select(-genre)
test_truth2 <- d.movies.svm2 %>%   slice(-indices2) %>%   pull(genre)

9.3 Training the SVM model

We fit a radial-based SVM model with a low penalty to misclassified values:

set.seed(123)
# Model 1
svm.movies1 <- svm(as.factor(genre) ~ score+votes+budget+gross+runtime+profit,
                  train1, kernel = "radial", scale = TRUE, cost = 5)
summary(svm.movies1)

## 
## Call:
## svm(formula = as.factor(genre) ~ score + votes + budget + gross + 
##     runtime + profit, data = train1, kernel = "radial", cost = 5, 
##     scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5 
## 
## Number of Support Vectors:  249
## 
##  ( 117 132 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Drama Horror

# Model 2
svm.movies2 <- svm(as.factor(genre) ~ score+votes+budget+gross+runtime+profit,
                  train2, kernel = "radial", scale = TRUE, cost = 5)

We continue to plot the model 1 with the train data:

plot(svm.movies1, train1,  profit ~ score)

Analyzing the classification plot we can notice how much both clusters overlap with each other. We consider that it would not make sense to define a point to slice the plot to group each cluster.

9.4 Making Predictions

In order to put our model to the test, we will generate some predicted values.

# For model 1
test_pred1 <- predict(svm.movies1, test_in1)
table(test_pred1)

## test_pred1
##  Drama Horror 
##     79     29

# For model 2
test_pred2 <- predict(svm.movies2, test_in2)
table(test_pred2)

## test_pred2
## Action Comedy 
##     94    210

9.5 Evaluating the results

Now that we have some predicted values, we continue to compare the accuracy of the prediction by comparing them to the test_truth variable.

conf_matrix1 <- confusionMatrix(table(test_pred1, test_truth1))
conf_matrix1

## Confusion Matrix and Statistics
## 
##           test_truth1
## test_pred1 Drama Horror
##     Drama     70      9
##     Horror     8     21
##                                         
##                Accuracy : 0.8426        
##                  95% CI : (0.76, 0.9055)
##     No Information Rate : 0.7222        
##     P-Value [Acc > NIR] : 0.002431      
##                                         
##                   Kappa : 0.6036        
##                                         
##  Mcnemar's Test P-Value : 1.000000      
##                                         
##             Sensitivity : 0.8974        
##             Specificity : 0.7000        
##          Pos Pred Value : 0.8861        
##          Neg Pred Value : 0.7241        
##              Prevalence : 0.7222        
##          Detection Rate : 0.6481        
##    Detection Prevalence : 0.7315        
##       Balanced Accuracy : 0.7987        
##                                         
##        'Positive' Class : Drama         
##

conf_matrix2 <- confusionMatrix(table(test_pred2, test_truth2))
conf_matrix2

## Confusion Matrix and Statistics
## 
##           test_truth2
## test_pred2 Action Comedy
##     Action     79     15
##     Comedy     75    135
##                                           
##                Accuracy : 0.7039          
##                  95% CI : (0.6492, 0.7547)
##     No Information Rate : 0.5066          
##     P-Value [Acc > NIR] : 2.124e-12       
##                                           
##                   Kappa : 0.4109          
##                                           
##  Mcnemar's Test P-Value : 4.999e-10       
##                                           
##             Sensitivity : 0.5130          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.8404          
##          Neg Pred Value : 0.6429          
##              Prevalence : 0.5066          
##          Detection Rate : 0.2599          
##    Detection Prevalence : 0.3092          
##       Balanced Accuracy : 0.7065          
##                                           
##        'Positive' Class : Action          
##

Conclusions

By observing the Classification Plot we cannot define visually the optimal hyperplane. However, after testing and evaluate the prediction against the real observations we can see that our model 1 still performs a decent 84% of accuracy with a Kappa = 0.60. Given the fact that the data is not ideal to fit a SVM model, we consider the performance as acceptable.

Nevertheless, SVM has proved to still do a great job when two clusters overlap a lot, e.g. with model2: Action ~ Comedy (which represent 54% of the sample data), where the model yields a decent 70% accuracy, but lower fit with Kappa=0.41.

10 Neural Network

Leading student: Simon Kohler

To make the best possible use of Neural networks there will both be an attempt to solve a classification problem as well as a regression problem.

10.1 Classification

The goal of the classification is to see if it is possible to predict which movies are successes. For this to work, the movies in data set will be split into two different categories. First the failures, which includes movies which had a bigger budget than its profit. Second,the group of movies which made at least its budget in gross.

# Add a categorical variable
d.movies_raw03 <- d.movies_raw03 %>% mutate(success = 
       as.factor(ifelse
          (gross < budget, 'Failure','Success')))

# Create subset for neural network 
myvars <- c("year", "score", "votes", "runtime", "success")
#myvars <- c("year", "score", "runtime", "success")
d.movies_subs <- d.movies_raw03[myvars]

# Prepare the data for training
set.seed(123)
indices <- createDataPartition(d.movies_subs$success, p=.85, list = F)

# Variables to access Data
train <- d.movies_subs %>%
  slice(indices)
test_in <- d.movies_subs %>%
  slice(-indices) %>%
  dplyr::select(-success)
test_truth <- d.movies_subs %>%
  slice(-indices) %>%
  pull(success)

The next step is to train the neural network. For this attempt there will be 3 hidden layers with the first one containing 3 neurons and the second layer containing 3 neurons.

set.seed(123)
# 2 hidden layers containing 3,3 neurons
movies_net <- neuralnet(success ~ ., train, hidden = c(3,3),linear.output = F)

plot(movies_net, rep="best")

The trained neural network can now be used to make the actual predictions.

test_results <- neuralnet::compute(movies_net, test_in)
test_results$net.result

##             [,1]      [,2]
##   [1,] 0.2476773 0.7523233
##   [2,] 0.2476773 0.7523233
##   [3,] 0.2476773 0.7523233
##   [4,] 0.6975282 0.3023907
##   [5,] 0.6975282 0.3023907
##  [ reached getOption("max.print") -- omitted 809 rows ]

# Find class (i.e. output neuron) with the highest probability and convert this back into a factor
test_pred <- apply(test_results$net.result, 1, which.max)
test_pred <- factor(levels(test_truth)[test_pred], levels = levels(test_truth))
test_pred

##  [1] Success Success Success Failure Failure Failure Success Success Success
## [10] Success
##  [ reached getOption("max.print") -- omitted 804 entries ]
## Levels: Failure Success

As one can see the network mostly predicts successes with very few “Failure” predictions. This is not optimal, however as there are no further useful predictors in the data set available, the same neural network will be used for further analysis.

It is also possible to further evaluate this result.

conf_matrix <- confusionMatrix(test_truth, test_pred)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Failure Success
##    Failure      94     168
##    Success      50     502
##                                           
##                Accuracy : 0.7322          
##                  95% CI : (0.7003, 0.7623)
##     No Information Rate : 0.8231          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3042          
##                                           
##  Mcnemar's Test P-Value : 2.295e-15       
##                                           
##             Sensitivity : 0.6528          
##             Specificity : 0.7493          
##          Pos Pred Value : 0.3588          
##          Neg Pred Value : 0.9094          
##              Prevalence : 0.1769          
##          Detection Rate : 0.1155          
##    Detection Prevalence : 0.3219          
##       Balanced Accuracy : 0.7010          
##                                           
##        'Positive' Class : Failure         
##

The neural network has an accuracy of about 73.22. This result is not really satisfying especially as just predicting Success for every movie would result in a better outcome.

11 Overall Conclusions

Fitting a linear model or multiple linear regression model is always a good practice to have as a start point and reference. It was useful to identify from the start, which predictors and interactions were statistically significant as the linear model did not sufficiently explain the variance of the response variable. These early insights provided the guidelines to more accurate decision making, on the response variable and approach to select in the implementation of the machine learning models.

In order to answer our research question of “Which predictors have a statistical significance in increasing profit?” it was necessary to determine whether Multicollinearity occured, which is whether other independent variables in the different regression models were correlated. With regards to the linear and non linear models conclusion, the GAM model was not found to be the best predictor for the score. Our quasipoisson regression model displayed Overdispersion when predicting runtime for movies performance against budget, location and score predictor variables. The logistic regression model seemed to perform best when predicting movie location based on budget and runtime. Once we learnt all the predictive modelling techniques in class, we realized where our data set was lacking.

SVM Insights:
After trying to fit SVMs with different combinations of genre groups, we chose to focus on the one which yielded the highest accuracy. This is due to the the fact that it overlaps the least. In this regard, we can conclude that SVMs are a great classifier for situations where the groups are clearly separated. The more the clusters overlap, the more the accuracy and goodness of fitting decrease.

We can clearly observe this phenomenon by comparing accuracy and kappa for both models:

MODEL	OVERLAPPING	ACCURACY	KAPPA
Model 1	Low	84.26%	0.6036
Model 2	High	70.39%	0.4109

In general we consider the implemented SVM model and analysis, to be a useful tool to quantify an estimated profitability in function of the genres.

ANN Insights:
We tried to create a Neural Network for classification purposes. The goal of the network was to predict if a movie would be a success (gross >= budget) or a failure (budget > gross). Knowing the latter, after testing different models with two hidden layers and several combinations neurons, we identified the presented model as the one with the best performance. We noticed that as one increases the number of neurons in the layers the accuracy and kappa value increases, but gradually they got stuck around 75% and 0.40 respectively. On the other hand the Mcnemar’s Test P-Value tends to worsen, perhaps the model tries to learn too many details in the training data along with the noise from the training data, slowly falling into over fitting. To further improve the neural network further predictors would be needed.

In hindsight, we would have picked a more suitable data set with more categorical, continuous, nominal, binary, etc. data set. However the pains to build our models were worth the knowledge we gained.

IMDB Movies Analysis

Applied Machine Learning and Predictive Modelling

Caroline Wanyonyi, Simon Kohler and Eduardo Pacheco

2024-08-11

1 Introduction

1.1 Research Question

1.2 What are the goals of the analysis

1.3 Data Source

2 Data Preparation and Cleaning

2.1 Dealing with missing values (NaN)

3 Data Enrichment

3.1 Adding a binary variable

3.2 Adding a profit percentage related to budget

3.3 Modifying the profit variable

3.4 Adding Released Month

4 Visualizing the Data

5 Fitting Statistical Models

5.1 Linear Model

5.2 Multiple Linear Regression Models

6 Generalized Additive Model

6.1 Fitting a GAM Regression Model

6.2 GAM Model Check and Summary

7 Poisson Regression

7.1 Fitting Poisson Regression Model

7.2 Poisson Regression Model Analysis

7.3 Dispersion Test

7.4 Compare both models

7.5 Table of Predictor of Runtime

7.6 Poisson Regression Model Visualization

8 Binomial Regression

8.1 Interpretation of the Coefficients of the Binomial Model

8.2 Analysis of Confusion Matrix

8.3 Logistic regression model with continuous predictor

8.4 Logistic Regression with in.out_usa and budget and genre

8.5 Logistic Regression with Budget and Runtime

8.6 Logistic Regression with Budget and Score

8.7 Comparision of all models

8.8 Conclusion of Binomial Regression

9 Support Vector Machine (SVM) Modelling

9.1 Preparing the data

9.2 Data preparation for training

9.3 Training the SVM model

9.4 Making Predictions

9.5 Evaluating the results

10 Neural Network

10.1 Classification

11 Overall Conclusions