Does Increasing a Movie’s Budget Ever Pay Off?

\(\underline{Data \space Set}\)

We are going to work with a movies data set. The Top 5000 movies on IMDB is from the following link: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset. You can download the data set from Canvas as well.

movies <- read.csv("movie_metadata.csv")

\(\underline{Step \space 1: \space Cleaning \space and \space Data \space Preparation}\)

The code below will filter out films with missing budget and gross and unreasonably large budgets. Also, use the code to create new variables that are simplified versions of the budget variables. Also split your data into testing and training sets. We will use set.seed(310) to ensure your results are comparable to mine.

#library(tidyverse)
# removing missing values of budget and gross
movies <- movies[!is.na(movies$budget),] 
movies <- movies[!is.na(movies$gross),]
# removing empty content rating or not rated
movies <- movies[(movies$content_rating != "" & movies$content_rating != "Not Rated"), ]
# removing movies with budget > 400M
movies <- movies[movies$budget<4e+8,]
# simplifying variables
movies$grossM <- movies$gross/1e+6 
movies$budgetM <- movies$budget/1e+6 
movies$profitM <- movies$grossM-movies$budgetM
# creating new column `rating_simple` using `fct_lump` (from `tidyverse` package)
# to pick 4 major levels and lump all other levels into "Other".
movies$rating_simple <- fct_lump(movies$content_rating, n = 4)
# creating train and test sets
set.seed(310)
train_indx <- sample(1:nrow(movies), 0.8 * nrow(movies), replace=FALSE) 
movies_train <- movies[train_indx, ]
movies_test <- movies[-train_indx, ]

\(\underline{Step \space 2:Estimating \space the \space Linear \space Model}\)

We can estimate the linear model using the lm command with grossM (movie revenue) on the left hand side, and imdb_score (movie rating) and budgetM (movie budget) on the right-hand side. Make sure to estimate on the training set. Will will use the summary command to show the summary of your model.

model1 <- lm(grossM ~ imdb_score + budgetM, movies_train)
summary(model1)
## 
## Call:
## lm(formula = grossM ~ imdb_score + budgetM, data = movies_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.15  -26.68   -9.59   16.19  481.60 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -75.50080    6.12502  -12.33   <2e-16 ***
## imdb_score   13.70041    0.93185   14.70   <2e-16 ***
## budgetM       1.03872    0.02235   46.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.86 on 3026 degrees of freedom
## Multiple R-squared:  0.4464, Adjusted R-squared:  0.446 
## F-statistic:  1220 on 2 and 3026 DF,  p-value: < 2.2e-16

\(\underline{Step \space 3:\space Interpreting \space Results}\)

The following is an interpretation the coefficient on budgetM (movie budget), holding fixed the coefficient imdb_score (rating score of the movie). Does spending more money on movies seem to have a net positive return on movie gross?

\(Interpertation:\)

Spending 1 unit more on movies causes the return on movie gross to increase by 1.03872. Spending more money on movies DOES have a positive return on movies gross. However, the magnitude of the 1 matters. For example if 1 is 1 million, spending an additional 1 million will increase the return by only 3.8% of 1 million or 38,000. This might imply it might not be worth it to spend more on movies.

\(\underline{Step \space 4: \space Prepare \space a \space Second \space Model}\)

Reanalizing with the movie budget adding the square of budget as an indempendent vaiarable:

We will now estimate a linear model using the lm command with grossM (movie revenue) on the left hand side, and imdb_score (movie rating), budgetM (movie budget) and the square of budgetM (movie budget) as independent variables. Make sure to estimate on the training set. Use the summary command to show the summary of your model.

model2 <- lm(grossM ~ budgetM + I(budgetM^2) + imdb_score, movies_train)
summary(model2)
## 
## Call:
## lm(formula = grossM ~ budgetM + I(budgetM^2) + imdb_score, data = movies_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -350.10  -26.41   -9.43   16.03  492.38 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.874e+01  6.301e+00 -12.497   <2e-16 ***
## budgetM       1.144e+00  5.349e-02  21.394   <2e-16 ***
## I(budgetM^2) -6.060e-04  2.791e-04  -2.171     0.03 *  
## imdb_score    1.389e+01  9.353e-01  14.849   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.83 on 3025 degrees of freedom
## Multiple R-squared:  0.4472, Adjusted R-squared:  0.4467 
## F-statistic: 815.8 on 3 and 3025 DF,  p-value: < 2.2e-16

Now we will investigate the marginal impact of budget for different levels of budget. Use the margins command to calculate the marginal impact of an additional dollar of budget at budget levels of 25, 50, 75, 90, 100, 200, and 300 million. For which levels does it make sense to increase your movie’s budget?

#library(margins)
(m1 <- margins(model1, at = list(budgetM=c(25, 50, 75, 90, 100, 200, 300))))
## Average marginal effects at specified values
## lm(formula = grossM ~ imdb_score + budgetM, data = movies_train)
##  at(budgetM) imdb_score budgetM
##           25       13.7   1.039
##           50       13.7   1.039
##           75       13.7   1.039
##           90       13.7   1.039
##          100       13.7   1.039
##          200       13.7   1.039
##          300       13.7   1.039
(m2 <- margins(model2, at = list (budgetM = c(25, 50, 75, 90, 100, 200, 300))))
## Average marginal effects at specified values
## lm(formula = grossM ~ budgetM + I(budgetM^2) + imdb_score, data = movies_train)
##  at(budgetM) budgetM imdb_score
##           25  1.1139      13.89
##           50  1.0836      13.89
##           75  1.0533      13.89
##           90  1.0352      13.89
##          100  1.0230      13.89
##          200  0.9019      13.89
##          300  0.7807      13.89
\(Interpertation:\)

This tells marginal effect of budget on gross return. For model2, at levels 25,50,75,90 and 100 you would be making a profit and not losing money. However, for model1 you will always be making a profit.

Movie Residuals and Predicted Values

\(\underline{Step \space 1: Building \space a \space third \space model}\)

Using the movies data and estimate a third model that will predict gross movie revenue using imdb_score (movie rating), budgetM (movie budget), the square of budgetM and ratings_simple (type of movie R,G,PG-13,etc.) as independent variables. Using the relevel command to change the reference category of ratings to “R”. Print the summary of this regression table.

model3 <- lm(grossM ~ imdb_score + budgetM + I(budgetM^2) + relevel(rating_simple, ref = "R"), movies_train)
summary(model3)
## 
## Call:
## lm(formula = grossM ~ imdb_score + budgetM + I(budgetM^2) + relevel(rating_simple, 
##     ref = "R"), data = movies_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -344.81  -26.27   -8.08   16.63  491.67 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -9.517e+01  6.471e+00 -14.706  < 2e-16
## imdb_score                              1.543e+01  9.396e-01  16.424  < 2e-16
## budgetM                                 1.012e+00  5.486e-02  18.444  < 2e-16
## I(budgetM^2)                           -2.585e-04  2.783e-04  -0.929    0.353
## relevel(rating_simple, ref = "R")G      2.976e+01  6.410e+00   4.643 3.58e-06
## relevel(rating_simple, ref = "R")PG     2.437e+01  2.966e+00   8.217 3.05e-16
## relevel(rating_simple, ref = "R")PG-13  1.678e+01  2.307e+00   7.275 4.38e-13
## relevel(rating_simple, ref = "R")Other  3.897e+00  7.757e+00   0.502    0.615
##                                           
## (Intercept)                            ***
## imdb_score                             ***
## budgetM                                ***
## I(budgetM^2)                              
## relevel(rating_simple, ref = "R")G     ***
## relevel(rating_simple, ref = "R")PG    ***
## relevel(rating_simple, ref = "R")PG-13 ***
## relevel(rating_simple, ref = "R")Other    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.03 on 3021 degrees of freedom
## Multiple R-squared:  0.4642, Adjusted R-squared:  0.463 
## F-statistic:   374 on 7 and 3021 DF,  p-value: < 2.2e-16
\(Interpertation of the coefficent on a movie ratee G.\)

The coefficient for g-rated movies is 29.7625452. Which means that every g rated movie increases profit by 29.7625452.

\(\underline{Step \space 2: Predicting \space the \space data}\)

Using the predict function we will generate the predictions in the test and training set.

preds1 <- predict(model3)
preds2 <- predict( model3 , newdata  = movies_test)

\(\underline{Step \space 3: Generating \space the \space residuals \space for \space the \space test \space and \space training \space set}\)

Generate residuals for test and training. Note that residuals is the difference between true and predicted outcome (grossM).

preds_train1 <- data.frame(true = movies_train$grossM,
                           pred = preds1,
                           resid = model3$residuals)
preds_test1<- data.frame( true  = movies_test$grossM,
                          pred = preds2)

$

Plot the residuals against the predicted values in the test and training sets. Do our errors appear homoskedastic or heteroskedastic?

#library(ggplot2)
ggplot(data=preds_train1, aes(x=pred, y=resid)) +
  geom_point() + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

preds_test1$test_resid <- preds_test1$true - preds_test1$pred
Residplot2 <- ggplot(aes(x = pred, y = test_resid), data = preds_test1) + geom_point()+ geom_smooth()
Residplot2
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

\(Interpertation:\)

We seem to have a heteroskedasticity error because we don’t have constant variance of error. When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. We need to add some independent variables to the model by either transforming the dependent variable, redefining the dependent variable, or using weighted regression.

\(\underline{Step \space 5: Plot \space the \space predicted \space versus \space true \space in \space the \space test \space and \space training \space set}\)

#training
Residplot3 <- ggplot(aes(x = true, y = pred), data = preds_train1) + geom_point() + geom_hline(yintercept = 0) + geom_smooth()
Residplot3
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

#test 
Residplot4 <- ggplot(aes(x = true, y = pred), data = preds_test1) + geom_point() + geom_hline(yintercept = 0) + geom_smooth()
Residplot4
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

\(\underline{Step \space 6: Is \space the \space model \space overfit?}\)

Use the function below and the RMSE function in the package caret to calculate in-sample and out-of-sample RMSE. Is the model over fit? And if so, how do we know?

#library(caret) 
# training error 
RMSE(preds_train1$pred, preds_train1$true)
## [1] 52.96042
# test error
RMSE(preds_test1$pred, preds_test1$true)
## [1] 44.68023
\(Interpertation:\)

The RMSE was 8 points lower for the test data. Since the error was not higher then the training data RMSE then I would say that the model is not overfitted.