Each year, box offices at movie theaters collect billions of dollars in revenue in the United States alone. In this problem, we seek to determine whether or not we can predict box office revenue based on different variables related to a movie.

In this problem, we’ll use a dataset of 334 movies that were produced from 1953 to 2015. The dataset Movies.csv includes the following 24 variables:

Name = the name of the movie
Year = the year the movie was produced
Rated = the rating given to the movie by the MPAA
Runtime = the duration of the movie in minutes
Action = binary variable that takes value 1 if the movie is an action movie, 0 otherwise
Adventure, Crime, Drama, Thriller, Fantasy, Horror, Sci.Fi, Comedy, Family, Mystery, Romance, Animation, Music, History, Documentary are all defined like Action
Wins = number of awards won by the movie
Nominations = number of awards the movie was nominated for
Production.Budget = the natural logarithm of the production budget in dollars
Worldwide = the natural logarithm of the worldwide revenue in dollars

PROBLEM 1 - LOADING THE DATA

Load the dataset Movies.csv into R and call it “Movies”. In this problem, we will build a model to predict worldwide box office revenue for movies made in 2010-2015. Create a training set that consists of movies released before 2010 and a testing set that consists of movies released in 2010 and after.

Movies = read.csv('Movies.csv')
str(Movies)

## 'data.frame':    334 obs. of  24 variables:
##  $ Name             : Factor w/ 334 levels "2 Fast 2 Furious",..: 28 32 31 29 30 247 248 4 5 6 ...
##  $ Year             : int  1989 1992 1995 1997 2005 2008 2012 2002 2006 2014 ...
##  $ Rated            : Factor w/ 6 levels "Approved","G",..: 5 5 5 5 5 3 5 6 6 6 ...
##  $ Runtime          : int  126 126 121 125 140 152 165 113 117 102 ...
##  $ Action           : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Adventure        : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ Crime            : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Drama            : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ Thriller         : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Fantasy          : int  0 0 1 0 0 0 0 0 1 1 ...
##  $ Horror           : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Sci.Fi           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Comedy           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Family           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Animation        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Music            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ History          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Documentary      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Wins             : int  10 2 9 5 15 134 43 10 12 0 ...
##  $ Nominations      : int  21 16 20 20 49 106 83 25 30 6 ...
##  $ Production.Budget: num  18.1 18.9 19.1 19.2 19.1 ...
##  $ Worldwide        : num  20.6 20.1 20.3 19.9 20 ...

# training set that consists of movies released before 2010
train = subset(Movies, Year<'2010')
summary(train$Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1953    1999    2003    2001    2006    2009

dim(train)

## [1] 248  24

# testing set that consists of movies released in 2010 and after
test = subset(Movies, Year>='2010')
summary(test$Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2010    2011    2012    2012    2013    2015

dim(test)

## [1] 86 24

PROBLEM 2 - METHOD OF SPLITTING THE DATA

In this class, we have frequently used the sample.split function to randomly split our data. Why do we use a different approach here?

library(caTools)
table(sample.split(Movies$Worldwide, SplitRatio = 1/3))

## 
## FALSE  TRUE 
##   223   111

table(sample.split(Movies$Worldwide, SplitRatio = 1/2))

## 
## FALSE  TRUE 
##   167   167

PROBLEM 3 - A LINEAR REGRESSION MODEL

Build a linear regression model to predict “Worldwide” based on all of the remaining variables, except for Name and Year. Use the training set to build the model. If your training set is called MoviesTrain, an easy way to do this it to pass the argument data = MoviesTrain[ , 3:ncol(MoviesTrain)] to the lm function.

# build the linear regression model
model.lm = lm(Worldwide ~ . , data=train[,!(names(train) %in% c('Year', 'Name')) ])

# R-squared
summary(model.lm)$r.squared

## [1] 0.5412771

PROBLEM 4 - CHECKING FOR SIGNIFICANCE

In your linear regression model, which of the independent variables are significant at the p=0.05 level (at least one star)? For factor variables, consider the variable significant if at least one level is significant.

summary(model.lm)

## 
## Call:
## lm(formula = Worldwide ~ ., data = train[, !(names(train) %in% 
##     c("Year", "Name"))])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8311 -0.3893  0.0809  0.3906  1.6326 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15.9976766  1.1580192  13.815  < 2e-16 ***
## RatedG             0.4102319  0.7041759   0.583 0.560774    
## RatedN/A           0.5562304  0.7322003   0.760 0.448258    
## RatedPG            0.7090507  0.6892058   1.029 0.304696    
## RatedPG-13         0.6027509  0.7107462   0.848 0.397322    
## RatedR             0.4127760  0.7210913   0.572 0.567608    
## Runtime            0.0097079  0.0032359   3.000 0.003007 ** 
## Action             0.0172144  0.1333332   0.129 0.897389    
## Adventure         -0.2232011  0.1283551  -1.739 0.083432 .  
## Crime             -0.3315595  0.1473878  -2.250 0.025457 *  
## Drama             -0.2097674  0.1881216  -1.115 0.266029    
## Thriller          -0.1067814  0.1364400  -0.783 0.434682    
## Fantasy            0.1634927  0.1406340   1.163 0.246264    
## Horror            -0.6173923  0.2056864  -3.002 0.002993 ** 
## Sci.Fi            -0.0147251  0.1353512  -0.109 0.913466    
## Comedy            -0.1431461  0.1611241  -0.888 0.375276    
## Family            -0.3206547  0.1780467  -1.801 0.073067 .  
## Mystery            0.1454495  0.1837735   0.791 0.429520    
## Romance           -0.0449227  0.2063239  -0.218 0.827840    
## Animation          0.6116779  0.2085349   2.933 0.003706 ** 
## Music             -0.1531571  0.6860353  -0.223 0.823546    
## History           -1.3823299  0.6876694  -2.010 0.045624 *  
## Documentary       -0.4480795  0.4830499  -0.928 0.354620    
## Wins               0.0003704  0.0039940   0.093 0.926186    
## Nominations        0.0159878  0.0043745   3.655 0.000321 ***
## Production.Budget  0.1104362  0.0550254   2.007 0.045962 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6431 on 222 degrees of freedom
## Multiple R-squared:  0.5413, Adjusted R-squared:  0.4896 
## F-statistic: 10.48 on 25 and 222 DF,  p-value: < 2.2e-16

# extract p-factors
model.summary = summary(model.lm)$coefficients
#model.summary[order(model.summary[,4]),]

#independent variables significant at the p=0.05 level
sort(model.summary[,4]<0.05)

##            RatedG          RatedN/A           RatedPG        RatedPG-13 
##             FALSE             FALSE             FALSE             FALSE 
##            RatedR            Action         Adventure             Drama 
##             FALSE             FALSE             FALSE             FALSE 
##          Thriller           Fantasy            Sci.Fi            Comedy 
##             FALSE             FALSE             FALSE             FALSE 
##            Family           Mystery           Romance             Music 
##             FALSE             FALSE             FALSE             FALSE 
##       Documentary              Wins       (Intercept)           Runtime 
##             FALSE             FALSE              TRUE              TRUE 
##             Crime            Horror         Animation           History 
##              TRUE              TRUE              TRUE              TRUE 
##       Nominations Production.Budget 
##              TRUE              TRUE

barplot(sort(model.summary[model.summary[,4]<0.05,4]), main="independent variables Pr(>|t|) <0.05")

PROBLEM 5 - CORRELATIONS

What is the correlation between Worldwide and Production.Budget in the training set?

cor(train$Worldwide, train$Production.Budget)

## [1] 0.4947683

PROBLEM 6 - AN UPDATED MODEL

Create a new linear regression model on the training set with only the significant variables you found in Problem 4 as the independent variables.

# build the shrinked linear regression model
model.lm = lm(Worldwide ~ Runtime + Crime + Horror + Animation + History + Nominations + Production.Budget, data=train)

# plain summary
summary(model.lm)

## 
## Call:
## lm(formula = Worldwide ~ Runtime + Crime + Horror + Animation + 
##     History + Nominations + Production.Budget, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87336 -0.36913  0.07701  0.37701  1.93620 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15.531741   0.803224  19.337  < 2e-16 ***
## Runtime            0.010095   0.002634   3.833 0.000162 ***
## Crime             -0.293315   0.120701  -2.430 0.015829 *  
## Horror            -0.427419   0.141596  -3.019 0.002814 ** 
## Animation          0.465257   0.166530   2.794 0.005629 ** 
## History           -1.556980   0.650418  -2.394 0.017443 *  
## Nominations        0.017313   0.003010   5.751 2.68e-08 ***
## Production.Budget  0.152527   0.046708   3.266 0.001252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 240 degrees of freedom
## Multiple R-squared:  0.5048, Adjusted R-squared:  0.4904 
## F-statistic: 34.95 on 7 and 240 DF,  p-value: < 2.2e-16

PROBLEM 7 - UNDERSTANDING COEFFICIENTS

In the model from Problem 6, what is the coefficient for Animation in the linear regression?

0.465257 - correct

The coefficient for Runtime is 0.010095. What is the interpretation of this coefficient?

For an additional minute of Runtime, the prediction for the variable Worldwide increases by approximately 0.01 units. - correct

PROBLEM 8 - PREDICTIONS ON THE TEST SET

#Make predictions on the test set using the linear regression model.
pred.lm = predict(model.lm, newdata=test)

# the Sum of Squared Errors (SSE) on the test set
SSE = sum((pred.lm - test$Worldwide)^2)

#the Total Sum of Squares (SST) on the test set
SST = sum((mean(test$Worldwide) - test$Worldwide)^2)

#R-squared on the test set
Rsquared = 1-SSE/SST

print(paste("SSE=", SSE, "SST=", SST, "Rsquared=", Rsquared))

## [1] "SSE= 25.0005997807204 SST= 60.3587341024956 Rsquared= 0.585799799275666"

#plot the predicted Worldwide value vs the actual value
plot(pred.lm, test$Worldwide, main="Predicted vs actual Worldwide value")

PROBLEM 9 - UNDERSTANDING THE MODEL

True or False: Our linear regression model suffers from overfitting.

PROBLEM 10 - A CLASSIFICATION PROBLEM

Let’s turn this problem into a multi-class classification problem by creating a new dependent variable. Our new dependent variable will take three different values: “Excellent”, “Average”, and “Poor” for films with Worldwide revenue in the top quartile, middle 50%, and bottom quartile, respectively.

Movies$Performance = factor(ifelse(Movies$Worldwide > quantile(Movies$Worldwide, .75), "Excellent", ifelse(Movies$Worldwide > quantile(Movies$Worldwide, .25), "Average", "Poor")))
table(Movies$Performance)

## 
##   Average Excellent      Poor 
##       166        84        84

Movies$Worldwide = NULL
library(caTools)
#randomly split Movies into a training set, containing 70% of the observations, and a testing set, containing 30% of the observations
set.seed(15071)
split = sample.split(Movies$Performance, SplitRatio=0.70)
table(split)

## split
## FALSE  TRUE 
##   100   234

# now perform splitting:
train = Movies[split==TRUE,]
test  = Movies[split==FALSE,]

PROBLEM 11 - A CART MODEL

Build a CART model to predict “Performance” using all of the other variables except for “Name and”Year" to build the model.

library(rpart)
library(rpart.plot)

# to predict a multi-class dependent variable, use the rpart function in the same way as for a binary classification problem
model.CART = rpart(Performance ~ ., data=train[,!(names(train) %in% c('Year', 'Name')) ])
prp(model.CART)

#The CART model you just built predicts only two possible outcomes for a movie with Production.Budget less than 18. Which outcome does it never predict for these low-budget films?

PROBLEM 12 - TRAINING SET ACCURACY

Make predictions on the training set, and then create a confusion matrix. What is the overall accuracy of the model?

table(predict(model.CART, newdata=train, type = "class"), train$Performance)

##            
##             Average Excellent Poor
##   Average        96        17   11
##   Excellent       9        41    2
##   Poor           11         1   46

print(paste('Accuracy on the train set', (96+41+46)/nrow(train)))

## [1] "Accuracy on the train set 0.782051282051282"

PROBLEM 13 - A BASELINE MODEL

What is the accuracy on the training set of a baseline model that predicts the most frequent outcome (Average) for all observations?

table(train$Performance)

## 
##   Average Excellent      Poor 
##       116        59        59

print(paste('Baseline model accuracy on the train set', (116)/nrow(train)))

## [1] "Baseline model accuracy on the train set 0.495726495726496"

PROBLEM 14 - TESTING SET ACCURACY

Make predictions on the testing set, and then create a confusion matrix. What is the overall accuracy of the model on the testing set?

predict.CART = predict(model.CART, newdata=test, type = "class")
table(predict.CART, test$Performance)

##             
## predict.CART Average Excellent Poor
##    Average        36         9    8
##    Excellent       8        16    1
##    Poor            6         0   16

print(paste('Accuracy on the test set', (36+16+16)/nrow(test) ))

## [1] "Accuracy on the test set 0.68"

PROBLEM 15 - BASELINE ACCURACY ON TESTING SET

What is the accuracy on the testing set of a baseline model that predicts the most frequent outcome (Average) for all observations?

table(test$Performance)

## 
##   Average Excellent      Poor 
##        50        25        25

PROBLEM 16 - UNDERSTANDING THE MODEL

What can you conclude from the CART model?

Turning this problem into a classification problem significantly improved the ability to predict movie revenue.
Both the linear regression and CART models are well-suited for this prediction problem.
The linear regression model is significantly better suited for the continuous prediction problem than the CART model is for the classification problem.

PREDICTING BOX OFFICE REVENUE