Each year, box offices at movie theaters collect billions of dollars in revenue in the United States alone. In this problem, we seek to determine whether or not we can predict box office revenue based on different variables related to a movie.
In this problem, we’ll use a dataset of 334 movies that were produced from 1953 to 2015. The dataset Movies.csv includes the following 24 variables:
Name = the name of the movie
Year = the year the movie was produced
Rated = the rating given to the movie by the MPAA
Runtime = the duration of the movie in minutes
Action = binary variable that takes value 1 if the movie is an action movie, 0 otherwise
Adventure, Crime, Drama, Thriller, Fantasy, Horror, Sci.Fi, Comedy, Family, Mystery, Romance, Animation, Music, History, Documentary are all defined like Action
Wins = number of awards won by the movie
Nominations = number of awards the movie was nominated for
Production.Budget = the natural logarithm of the production budget in dollars
Worldwide = the natural logarithm of the worldwide revenue in dollars
Load the dataset Movies.csv into R and call it “Movies”. In this problem, we will build a model to predict worldwide box office revenue for movies made in 2010-2015. Create a training set that consists of movies released before 2010 and a testing set that consists of movies released in 2010 and after.
Movies = read.csv('Movies.csv')
str(Movies)
## 'data.frame': 334 obs. of 24 variables:
## $ Name : Factor w/ 334 levels "2 Fast 2 Furious",..: 28 32 31 29 30 247 248 4 5 6 ...
## $ Year : int 1989 1992 1995 1997 2005 2008 2012 2002 2006 2014 ...
## $ Rated : Factor w/ 6 levels "Approved","G",..: 5 5 5 5 5 3 5 6 6 6 ...
## $ Runtime : int 126 126 121 125 140 152 165 113 117 102 ...
## $ Action : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Adventure : int 1 0 1 0 1 0 0 0 0 0 ...
## $ Crime : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Drama : int 0 0 0 0 0 1 0 0 0 1 ...
## $ Thriller : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Fantasy : int 0 0 1 0 0 0 0 0 1 1 ...
## $ Horror : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Sci.Fi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Comedy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Family : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Animation : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Music : int 0 0 0 0 0 0 0 0 0 0 ...
## $ History : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Wins : int 10 2 9 5 15 134 43 10 12 0 ...
## $ Nominations : int 21 16 20 20 49 106 83 25 30 6 ...
## $ Production.Budget: num 18.1 18.9 19.1 19.2 19.1 ...
## $ Worldwide : num 20.6 20.1 20.3 19.9 20 ...
# training set that consists of movies released before 2010
train = subset(Movies, Year<'2010')
summary(train$Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1953 1999 2003 2001 2006 2009
dim(train)
## [1] 248 24
# testing set that consists of movies released in 2010 and after
test = subset(Movies, Year>='2010')
summary(test$Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2010 2011 2012 2012 2013 2015
dim(test)
## [1] 86 24
In this class, we have frequently used the sample.split function to randomly split our data. Why do we use a different approach here?
library(caTools)
table(sample.split(Movies$Worldwide, SplitRatio = 1/3))
##
## FALSE TRUE
## 223 111
table(sample.split(Movies$Worldwide, SplitRatio = 1/2))
##
## FALSE TRUE
## 167 167
Build a linear regression model to predict “Worldwide” based on all of the remaining variables, except for Name and Year. Use the training set to build the model. If your training set is called MoviesTrain, an easy way to do this it to pass the argument data = MoviesTrain[ , 3:ncol(MoviesTrain)] to the lm function.
# build the linear regression model
model.lm = lm(Worldwide ~ . , data=train[,!(names(train) %in% c('Year', 'Name')) ])
# R-squared
summary(model.lm)$r.squared
## [1] 0.5412771
In your linear regression model, which of the independent variables are significant at the p=0.05 level (at least one star)? For factor variables, consider the variable significant if at least one level is significant.
summary(model.lm)
##
## Call:
## lm(formula = Worldwide ~ ., data = train[, !(names(train) %in%
## c("Year", "Name"))])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8311 -0.3893 0.0809 0.3906 1.6326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.9976766 1.1580192 13.815 < 2e-16 ***
## RatedG 0.4102319 0.7041759 0.583 0.560774
## RatedN/A 0.5562304 0.7322003 0.760 0.448258
## RatedPG 0.7090507 0.6892058 1.029 0.304696
## RatedPG-13 0.6027509 0.7107462 0.848 0.397322
## RatedR 0.4127760 0.7210913 0.572 0.567608
## Runtime 0.0097079 0.0032359 3.000 0.003007 **
## Action 0.0172144 0.1333332 0.129 0.897389
## Adventure -0.2232011 0.1283551 -1.739 0.083432 .
## Crime -0.3315595 0.1473878 -2.250 0.025457 *
## Drama -0.2097674 0.1881216 -1.115 0.266029
## Thriller -0.1067814 0.1364400 -0.783 0.434682
## Fantasy 0.1634927 0.1406340 1.163 0.246264
## Horror -0.6173923 0.2056864 -3.002 0.002993 **
## Sci.Fi -0.0147251 0.1353512 -0.109 0.913466
## Comedy -0.1431461 0.1611241 -0.888 0.375276
## Family -0.3206547 0.1780467 -1.801 0.073067 .
## Mystery 0.1454495 0.1837735 0.791 0.429520
## Romance -0.0449227 0.2063239 -0.218 0.827840
## Animation 0.6116779 0.2085349 2.933 0.003706 **
## Music -0.1531571 0.6860353 -0.223 0.823546
## History -1.3823299 0.6876694 -2.010 0.045624 *
## Documentary -0.4480795 0.4830499 -0.928 0.354620
## Wins 0.0003704 0.0039940 0.093 0.926186
## Nominations 0.0159878 0.0043745 3.655 0.000321 ***
## Production.Budget 0.1104362 0.0550254 2.007 0.045962 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6431 on 222 degrees of freedom
## Multiple R-squared: 0.5413, Adjusted R-squared: 0.4896
## F-statistic: 10.48 on 25 and 222 DF, p-value: < 2.2e-16
# extract p-factors
model.summary = summary(model.lm)$coefficients
#model.summary[order(model.summary[,4]),]
#independent variables significant at the p=0.05 level
sort(model.summary[,4]<0.05)
## RatedG RatedN/A RatedPG RatedPG-13
## FALSE FALSE FALSE FALSE
## RatedR Action Adventure Drama
## FALSE FALSE FALSE FALSE
## Thriller Fantasy Sci.Fi Comedy
## FALSE FALSE FALSE FALSE
## Family Mystery Romance Music
## FALSE FALSE FALSE FALSE
## Documentary Wins (Intercept) Runtime
## FALSE FALSE TRUE TRUE
## Crime Horror Animation History
## TRUE TRUE TRUE TRUE
## Nominations Production.Budget
## TRUE TRUE
barplot(sort(model.summary[model.summary[,4]<0.05,4]), main="independent variables Pr(>|t|) <0.05")
What is the correlation between Worldwide and Production.Budget in the training set?
cor(train$Worldwide, train$Production.Budget)
## [1] 0.4947683
Create a new linear regression model on the training set with only the significant variables you found in Problem 4 as the independent variables.
# build the shrinked linear regression model
model.lm = lm(Worldwide ~ Runtime + Crime + Horror + Animation + History + Nominations + Production.Budget, data=train)
# plain summary
summary(model.lm)
##
## Call:
## lm(formula = Worldwide ~ Runtime + Crime + Horror + Animation +
## History + Nominations + Production.Budget, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87336 -0.36913 0.07701 0.37701 1.93620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.531741 0.803224 19.337 < 2e-16 ***
## Runtime 0.010095 0.002634 3.833 0.000162 ***
## Crime -0.293315 0.120701 -2.430 0.015829 *
## Horror -0.427419 0.141596 -3.019 0.002814 **
## Animation 0.465257 0.166530 2.794 0.005629 **
## History -1.556980 0.650418 -2.394 0.017443 *
## Nominations 0.017313 0.003010 5.751 2.68e-08 ***
## Production.Budget 0.152527 0.046708 3.266 0.001252 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6426 on 240 degrees of freedom
## Multiple R-squared: 0.5048, Adjusted R-squared: 0.4904
## F-statistic: 34.95 on 7 and 240 DF, p-value: < 2.2e-16
In the model from Problem 6, what is the coefficient for Animation in the linear regression?
0.465257 - correct
The coefficient for Runtime is 0.010095. What is the interpretation of this coefficient?
For an additional minute of Runtime, the prediction for the variable Worldwide increases by approximately 0.01 units. - correct
#Make predictions on the test set using the linear regression model.
pred.lm = predict(model.lm, newdata=test)
# the Sum of Squared Errors (SSE) on the test set
SSE = sum((pred.lm - test$Worldwide)^2)
#the Total Sum of Squares (SST) on the test set
SST = sum((mean(test$Worldwide) - test$Worldwide)^2)
#R-squared on the test set
Rsquared = 1-SSE/SST
print(paste("SSE=", SSE, "SST=", SST, "Rsquared=", Rsquared))
## [1] "SSE= 25.0005997807204 SST= 60.3587341024956 Rsquared= 0.585799799275666"
#plot the predicted Worldwide value vs the actual value
plot(pred.lm, test$Worldwide, main="Predicted vs actual Worldwide value")
True or False: Our linear regression model suffers from overfitting.
Let’s turn this problem into a multi-class classification problem by creating a new dependent variable. Our new dependent variable will take three different values: “Excellent”, “Average”, and “Poor” for films with Worldwide revenue in the top quartile, middle 50%, and bottom quartile, respectively.
Movies$Performance = factor(ifelse(Movies$Worldwide > quantile(Movies$Worldwide, .75), "Excellent", ifelse(Movies$Worldwide > quantile(Movies$Worldwide, .25), "Average", "Poor")))
table(Movies$Performance)
##
## Average Excellent Poor
## 166 84 84
Movies$Worldwide = NULL
library(caTools)
#randomly split Movies into a training set, containing 70% of the observations, and a testing set, containing 30% of the observations
set.seed(15071)
split = sample.split(Movies$Performance, SplitRatio=0.70)
table(split)
## split
## FALSE TRUE
## 100 234
# now perform splitting:
train = Movies[split==TRUE,]
test = Movies[split==FALSE,]
Build a CART model to predict “Performance” using all of the other variables except for “Name and”Year" to build the model.
library(rpart)
library(rpart.plot)
# to predict a multi-class dependent variable, use the rpart function in the same way as for a binary classification problem
model.CART = rpart(Performance ~ ., data=train[,!(names(train) %in% c('Year', 'Name')) ])
prp(model.CART)
#The CART model you just built predicts only two possible outcomes for a movie with Production.Budget less than 18. Which outcome does it never predict for these low-budget films?
Make predictions on the training set, and then create a confusion matrix. What is the overall accuracy of the model?
table(predict(model.CART, newdata=train, type = "class"), train$Performance)
##
## Average Excellent Poor
## Average 96 17 11
## Excellent 9 41 2
## Poor 11 1 46
print(paste('Accuracy on the train set', (96+41+46)/nrow(train)))
## [1] "Accuracy on the train set 0.782051282051282"
What is the accuracy on the training set of a baseline model that predicts the most frequent outcome (Average) for all observations?
table(train$Performance)
##
## Average Excellent Poor
## 116 59 59
print(paste('Baseline model accuracy on the train set', (116)/nrow(train)))
## [1] "Baseline model accuracy on the train set 0.495726495726496"
Make predictions on the testing set, and then create a confusion matrix. What is the overall accuracy of the model on the testing set?
predict.CART = predict(model.CART, newdata=test, type = "class")
table(predict.CART, test$Performance)
##
## predict.CART Average Excellent Poor
## Average 36 9 8
## Excellent 8 16 1
## Poor 6 0 16
print(paste('Accuracy on the test set', (36+16+16)/nrow(test) ))
## [1] "Accuracy on the test set 0.68"
What is the accuracy on the testing set of a baseline model that predicts the most frequent outcome (Average) for all observations?
table(test$Performance)
##
## Average Excellent Poor
## 50 25 25
What can you conclude from the CART model?