We are going to work with a movies data set. The Top 5000 movies on IMDB is from the following link: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset. You can download the data set from Canvas as well.
movies <- read.csv("movie_metadata.csv")
The code below will filter out films with missing budget and gross and unreasonably large budgets. Also, use the code to create new variables that are simplified versions of the budget variables. Also split your data into testing and training sets. We will use set.seed(310) to ensure your results are comparable to mine.
#library(tidyverse)
# removing missing values of budget and gross
movies <- movies[!is.na(movies$budget),]
movies <- movies[!is.na(movies$gross),]
# removing empty content rating or not rated
movies <- movies[(movies$content_rating != "" & movies$content_rating != "Not Rated"), ]
# removing movies with budget > 400M
movies <- movies[movies$budget<4e+8,]
# simplifying variables
movies$grossM <- movies$gross/1e+6
movies$budgetM <- movies$budget/1e+6
movies$profitM <- movies$grossM-movies$budgetM
# creating new column `rating_simple` using `fct_lump` (from `tidyverse` package)
# to pick 4 major levels and lump all other levels into "Other".
movies$rating_simple <- fct_lump(movies$content_rating, n = 4)
# creating train and test sets
set.seed(310)
train_indx <- sample(1:nrow(movies), 0.8 * nrow(movies), replace=FALSE)
movies_train <- movies[train_indx, ]
movies_test <- movies[-train_indx, ]
We can estimate the linear model using the lm command with grossM (movie revenue) on the left hand side, and imdb_score (movie rating) and budgetM (movie budget) on the right-hand side. Make sure to estimate on the training set. Will will use the summary command to show the summary of your model.
model1 <- lm(grossM ~ imdb_score + budgetM, movies_train)
summary(model1)
##
## Call:
## lm(formula = grossM ~ imdb_score + budgetM, data = movies_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403.15 -26.68 -9.59 16.19 481.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.50080 6.12502 -12.33 <2e-16 ***
## imdb_score 13.70041 0.93185 14.70 <2e-16 ***
## budgetM 1.03872 0.02235 46.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.86 on 3026 degrees of freedom
## Multiple R-squared: 0.4464, Adjusted R-squared: 0.446
## F-statistic: 1220 on 2 and 3026 DF, p-value: < 2.2e-16
The following is an interpretation the coefficient on budgetM (movie budget), holding fixed the coefficient imdb_score (rating score of the movie). Does spending more money on movies seem to have a net positive return on movie gross?
Spending 1 unit more on movies causes the return on movie gross to increase by 1.03872. Spending more money on movies DOES have a positive return on movies gross. However, the magnitude of the 1 matters. For example if 1 is 1 million, spending an additional 1 million will increase the return by only 3.8% of 1 million or 38,000. This might imply it might not be worth it to spend more on movies.
We will now estimate a linear model using the lm command with grossM (movie revenue) on the left hand side, and imdb_score (movie rating), budgetM (movie budget) and the square of budgetM (movie budget) as independent variables. Make sure to estimate on the training set. Use the summary command to show the summary of your model.
model2 <- lm(grossM ~ budgetM + I(budgetM^2) + imdb_score, movies_train)
summary(model2)
##
## Call:
## lm(formula = grossM ~ budgetM + I(budgetM^2) + imdb_score, data = movies_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -350.10 -26.41 -9.43 16.03 492.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.874e+01 6.301e+00 -12.497 <2e-16 ***
## budgetM 1.144e+00 5.349e-02 21.394 <2e-16 ***
## I(budgetM^2) -6.060e-04 2.791e-04 -2.171 0.03 *
## imdb_score 1.389e+01 9.353e-01 14.849 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.83 on 3025 degrees of freedom
## Multiple R-squared: 0.4472, Adjusted R-squared: 0.4467
## F-statistic: 815.8 on 3 and 3025 DF, p-value: < 2.2e-16
Now we will investigate the marginal impact of budget for different levels of budget. Use the margins command to calculate the marginal impact of an additional dollar of budget at budget levels of 25, 50, 75, 90, 100, 200, and 300 million. For which levels does it make sense to increase your movie’s budget?
#library(margins)
(m1 <- margins(model1, at = list(budgetM=c(25, 50, 75, 90, 100, 200, 300))))
## Average marginal effects at specified values
## lm(formula = grossM ~ imdb_score + budgetM, data = movies_train)
## at(budgetM) imdb_score budgetM
## 25 13.7 1.039
## 50 13.7 1.039
## 75 13.7 1.039
## 90 13.7 1.039
## 100 13.7 1.039
## 200 13.7 1.039
## 300 13.7 1.039
(m2 <- margins(model2, at = list (budgetM = c(25, 50, 75, 90, 100, 200, 300))))
## Average marginal effects at specified values
## lm(formula = grossM ~ budgetM + I(budgetM^2) + imdb_score, data = movies_train)
## at(budgetM) budgetM imdb_score
## 25 1.1139 13.89
## 50 1.0836 13.89
## 75 1.0533 13.89
## 90 1.0352 13.89
## 100 1.0230 13.89
## 200 0.9019 13.89
## 300 0.7807 13.89
This tells marginal effect of budget on gross return. For model2, at levels 25,50,75,90 and 100 you would be making a profit and not losing money. However, for model1 you will always be making a profit.
Using the movies data and estimate a third model that will predict gross movie revenue using imdb_score (movie rating), budgetM (movie budget), the square of budgetM and ratings_simple (type of movie R,G,PG-13,etc.) as independent variables. Using the relevel command to change the reference category of ratings to “R”. Print the summary of this regression table.
model3 <- lm(grossM ~ imdb_score + budgetM + I(budgetM^2) + relevel(rating_simple, ref = "R"), movies_train)
summary(model3)
##
## Call:
## lm(formula = grossM ~ imdb_score + budgetM + I(budgetM^2) + relevel(rating_simple,
## ref = "R"), data = movies_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -344.81 -26.27 -8.08 16.63 491.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.517e+01 6.471e+00 -14.706 < 2e-16
## imdb_score 1.543e+01 9.396e-01 16.424 < 2e-16
## budgetM 1.012e+00 5.486e-02 18.444 < 2e-16
## I(budgetM^2) -2.585e-04 2.783e-04 -0.929 0.353
## relevel(rating_simple, ref = "R")G 2.976e+01 6.410e+00 4.643 3.58e-06
## relevel(rating_simple, ref = "R")PG 2.437e+01 2.966e+00 8.217 3.05e-16
## relevel(rating_simple, ref = "R")PG-13 1.678e+01 2.307e+00 7.275 4.38e-13
## relevel(rating_simple, ref = "R")Other 3.897e+00 7.757e+00 0.502 0.615
##
## (Intercept) ***
## imdb_score ***
## budgetM ***
## I(budgetM^2)
## relevel(rating_simple, ref = "R")G ***
## relevel(rating_simple, ref = "R")PG ***
## relevel(rating_simple, ref = "R")PG-13 ***
## relevel(rating_simple, ref = "R")Other
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.03 on 3021 degrees of freedom
## Multiple R-squared: 0.4642, Adjusted R-squared: 0.463
## F-statistic: 374 on 7 and 3021 DF, p-value: < 2.2e-16
The coefficient for g-rated movies is 29.7625452. Which means that every g rated movie increases profit by 29.7625452.
Using the predict function we will generate the predictions in the test and training set.
preds1 <- predict(model3)
preds2 <- predict( model3 , newdata = movies_test)
Generate residuals for test and training. Note that residuals is the difference between true and predicted outcome (grossM).
preds_train1 <- data.frame(true = movies_train$grossM,
pred = preds1,
resid = model3$residuals)
preds_test1<- data.frame( true = movies_test$grossM,
pred = preds2)
Plot the residuals against the predicted values in the test and training sets. Do our errors appear homoskedastic or heteroskedastic?
#library(ggplot2)
ggplot(data=preds_train1, aes(x=pred, y=resid)) +
geom_point() + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
preds_test1$test_resid <- preds_test1$true - preds_test1$pred
Residplot2 <- ggplot(aes(x = pred, y = test_resid), data = preds_test1) + geom_point()+ geom_smooth()
Residplot2
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We seem to have a heteroskedasticity error because we don’t have constant variance of error. When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. We need to add some independent variables to the model by either transforming the dependent variable, redefining the dependent variable, or using weighted regression.
#training
Residplot3 <- ggplot(aes(x = true, y = pred), data = preds_train1) + geom_point() + geom_hline(yintercept = 0) + geom_smooth()
Residplot3
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#test
Residplot4 <- ggplot(aes(x = true, y = pred), data = preds_test1) + geom_point() + geom_hline(yintercept = 0) + geom_smooth()
Residplot4
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Use the function below and the RMSE function in the package caret to calculate in-sample and out-of-sample RMSE. Is the model over fit? And if so, how do we know?
#library(caret)
# training error
RMSE(preds_train1$pred, preds_train1$true)
## [1] 52.96042
# test error
RMSE(preds_test1$pred, preds_test1$true)
## [1] 44.68023
The RMSE was 8 points lower for the test data. Since the error was not higher then the training data RMSE then I would say that the model is not overfitted.