Introduction

The Hollywood movie production business has a very instinct and contact driven low-tech decision-making process that generates a portfolio of movies that a production house decides to fund in any given year. The same type of decision-making process is employed by movie stars and their agents to decide which projects to pursue and which ones to pass. This leads to high degree of variation in the success rate of projects (as measured by gross box office receipts and return on investment). Most production houses employ a portfolio driven approach and diversify their risk across a number of low, medium and high budget movies. Various attributes such as genre, source, use of computer graphics, the star power of the cast etc are easily available for a movie prior to its official release. These attributes have been used in advanced machine learning algorithms to predict the success of a movie before its release.

Packages Required in R

The following packages will be used for the analysis:

  • readxl: Package to read excel files in R
  • ggplot2: Data visualisation in R
  • dplyr: Easy functions to perform data manipulation in R
  • rpart: Package used to build decision tree models in R
  • randomForest: Package used to build random forest models in R
  • tidytext: Package for text mining for word processing and sentiment analysis
  • wordcloud: Package to generate word clouds
  • caret: Misc functions for training and plotting classification and regression models
  • gbm: Package to build Gradient Bossting Method models in R
  • ModelMetrics: Collection of metrics for evaluating models in R
  • scales: Package to graphically scale map data to aesthetics
  • rpart.plot: Package to draw neat decision trees
## function to install and load multiple packages at once in R

install_load <- function(pack){
    ## Statement to check if the package has been previously installed
    new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
  if (length(new_pack_load))
      install.packages(new_pack_load, dependencies = TRUE)
      sapply(pack, require, character.only = TRUE)
}

package_load <- c("readxl", "dplyr", "rpart", "randomForest", "tidytext", "wordcloud", "caret", "ModelMetrics", "gbm", "ggplot2", "scales", "rpart.plot")

install_load(package_load)

Building the Predictive Model

The following section here deals with reading of the training data provided and then building a predictive model. The tabs below present the step-wise approach followed.

Reading the data provided

Before proceeding with the analysis, it is necessary to specify the directory where the files have been saved. After mentioning the filepath, the training excel file is read using the function available in the readxl package. This training excel sheet contains the data for various movies and their success rate in the market.

#### Setting up the working directory. Please change basis the location of your directory. ####
setwd("C:/Users/rohit/Documents/RPubs_uploads/proj_1/")

#### Reading the files provided
train.dat <- read_excel("Training sheet.xlsx")

The training data contains 1196 observations and 15 features. A glimpse of the data provided is shown below:

glimpse(train.dat)
## Observations: 1,196
## Variables: 15
## $ id                                 <dbl> 7950115, 50950115, 11987011...
## $ name                               <chr> "Avatar", "Harry Potter and...
## $ display_name                       <chr> "Avatar", "Harry Potter and...
## $ production_year                    <dbl> 2009, 2011, 2011, 2010, 201...
## $ movie_sequel                       <dbl> 0, 1, 1, 1, 1, 0, 1, 1, 1, ...
## $ creative_type                      <chr> "Science Fiction", "Fantasy...
## $ source                             <chr> "Original Screenplay", "Bas...
## $ production_method                  <chr> "Animation/Live Action", "A...
## $ genre                              <chr> "Action", "Adventure", "Act...
## $ language                           <chr> "English", "English", "Engl...
## $ board_rating_reason                <chr> "For intense epic battle se...
## $ movie_board_rating_display_name    <chr> "PG-13", "PG-13", "PG-13", ...
## $ movie_release_pattern_display_name <chr> "Wide", "Wide", "Wide", "Wi...
## $ total                              <dbl> 2784, 1328, 1124, 1064, 104...
## $ Category                           <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, ...

Exploring the data

After reading the data, the training data is explored to understand the variables present in the data.

The plot below shows the distribution of movies across the different genre in the data. Drama and Comedy categories have the highest number of movies made.

#### Plot to determine genre distribution of the movie database
genre.dat <- train.dat %>% group_by(genre) %>% dplyr::summarize(count = n(), percn = round(n()*100/nrow(train.dat), 2))

ggplot(genre.dat, aes(x = factor(genre), y = percn)) + 
  geom_point(size = 3) + 
  geom_segment(aes(x = genre, 
                   xend = genre, 
                   y = 0, 
                   yend = percn)) + 
  labs(title = "Distribution of movies across genre") +
  xlab("Genre of the movie") +
  ylab("Percentage of data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

The bar plot below provides the distribution of the movies across movie_release_pattern_display_name. Wide formatted movies is the most popular format of the movie release pattern.

#### Plot to see the distribution of movie release pattern
ggplot(data = train.dat, aes(x = movie_release_pattern_display_name)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)),  fill = "dark cyan", color = "black") + 
  scale_y_continuous(labels = percent) +
  ylab("Percentage of data points") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bar plot for movie release pattern")

The below table shows the frequency distribution of categories in the variable language. As it is observed over 95% of the movies in the data are made in the language English.

#### Frequency distribution table
print("Frequency distribution of categories in the variable language")
## [1] "Frequency distribution of categories in the variable language"
train.dat %>% 
  group_by(language) %>% 
  dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
  arrange(desc(percentage))
## # A tibble: 16 x 3
##      language count percentage
##         <chr> <int>      <dbl>
##  1    English  1144      95.65
##  2      Hindi    16       1.34
##  3     French    14       1.17
##  4    Spanish     6       0.50
##  5     German     3       0.25
##  6   Japanese     2       0.17
##  7    Swedish     2       0.17
##  8     Arabic     1       0.08
##  9     Danish     1       0.08
## 10      Farsi     1       0.08
## 11     Hebrew     1       0.08
## 12    Italian     1       0.08
## 13  Norwegian     1       0.08
## 14     Polish     1       0.08
## 15 Portuguese     1       0.08
## 16     Silent     1       0.08

The below table shows the frequency distribution of categories in the variable source. It is observed that more than 50% of the movies in the data are have Original Screenplay.

print("Frequency distribution of categories in the variable source")
## [1] "Frequency distribution of categories in the variable source"
train.dat %>% 
  group_by(source) %>% 
  dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
  arrange(desc(percentage))
## # A tibble: 16 x 3
##                                 source count percentage
##                                  <chr> <int>      <dbl>
##  1                 Original Screenplay   629      52.59
##  2   Based on Fiction Book/Short Story   218      18.23
##  3           Based on Real Life Events   128      10.70
##  4                              Remake    65       5.43
##  5                         Based on TV    38       3.18
##  6        Based on Comic/Graphic Novel    36       3.01
##  7       Based on Factual Book/Article    23       1.92
##  8                       Based on Play    21       1.76
##  9 Based on Folk Tale/Legend/Fairytale    10       0.84
## 10                       Based on Game     8       0.67
## 11                            Spin-Off     5       0.42
## 12           Based on Musical or Opera     4       0.33
## 13                 Based on Short Film     4       0.33
## 14                         Compilation     3       0.25
## 15            Based on Theme Park Ride     2       0.17
## 16                        Based on Toy     2       0.17

The below table shows the frequency distribution of categories in the variable creative_type. It is observed that more than 50% of the movies in the data are Fiction based.

print("Frequency distribution of categories in the variable creative_type")
## [1] "Frequency distribution of categories in the variable creative_type"
train.dat %>% 
  group_by(creative_type) %>% 
  dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
  arrange(desc(percentage))
## # A tibble: 9 x 3
##             creative_type count percentage
##                     <chr> <int>      <dbl>
## 1    Contemporary Fiction   638      53.34
## 2                 Fantasy   131      10.95
## 3      Historical Fiction   100       8.36
## 4           Dramatization    95       7.94
## 5         Science Fiction    89       7.44
## 6                 Factual    60       5.02
## 7            Kids Fiction    59       4.93
## 8              Super Hero    20       1.67
## 9 Multiple Creative Types     4       0.33

The below table shows the frequency distribution of categories in the variable production_method. It is observed that more than 90% of the movies in the data are produced as LIve Action.

print("Frequency distribution of categories in the variable production_method")
## [1] "Frequency distribution of categories in the variable production_method"
train.dat %>% 
  group_by(production_method) %>% 
  dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
  arrange(desc(percentage))
## # A tibble: 6 x 3
##             production_method count percentage
##                         <chr> <int>      <dbl>
## 1                 Live Action  1093      91.39
## 2           Digital Animation    53       4.43
## 3       Animation/Live Action    36       3.01
## 4              Hand Animation     6       0.50
## 5       Stop-Motion Animation     5       0.42
## 6 Multiple Production Methods     3       0.25

The variable Category takes the value from 1 to 9, with 1 being the least successful movie and 9 being the most successful movie.

The below plot shows the histogram distribution of the Category variable across the different production_year. Except for the year 2011, the trend of the variable Category is consistent.

## Plotting the histogram of category across different production year
ggplot(data = train.dat, aes(x = factor(Category), fill = factor(Category))) + 
  geom_bar(aes(y = ..count../sum(..count..))) +
  facet_wrap(~ production_year) +
  scale_y_continuous(labels = percent) +
  ggtitle("Histogram distribution of Category across the production years") +
  ylab("Percentage of data points") +
  xlab("Category") +
  theme_bw() +
  theme(legend.position = "none") 

The below plot shows the histogram distribution of the Category variable across the different board_ratings. As observed Not Rated has a different trend with respect to the other categories. Further, for NC-17, there are not many observations to determine the trend of Category.

## Plotting the histogram of category across different board ratings
ggplot(data = train.dat, aes(x = factor(Category), fill = factor(Category))) + 
  geom_bar(aes(y = ..count../sum(..count..))) +
  facet_wrap(~ movie_board_rating_display_name) +
  scale_y_continuous(labels = percent) +
  ggtitle("Histogram distribution of Category across different board ratings") +
  ylab("Percentage of data points") +
  xlab("Category") +
  theme_bw() +
  theme(legend.position = "none") 

The below plot shows the histogram distribution of the Category variable across movie_sequel. Although, the number of observations is less for movies that have sequels, yet the success rate of a movie sequel is higher as compared to movies that don’t have a sequel.

train.dat.movie.seq <- train.dat %>% 
                          select(Category, movie_sequel) %>% 
                          mutate(movie_sequel_grp = ifelse(movie_sequel == 1, "Yes", "No"))

## Plotting the histogram of category on whether a movie sequel was made or not 
ggplot(data = train.dat.movie.seq, aes(x = factor(Category), fill = factor(Category))) + 
  geom_bar(aes(y = ..count../sum(..count..))) +
  facet_wrap(~ movie_sequel_grp) +
  scale_y_continuous(labels = percent) +
  ggtitle("Histogram distribution of Category on movie sequel criteria") +
  ylab("Percentage of data points") +
  xlab("Category") +
  theme_bw() +
  theme(legend.position = "none") 

Preparing the data

After observing patterns in the data, the following steps have been performed to prepare the final data for modeling:

  1. Lowest frequency categories in each of the character variables are combined. By doing this, if any new category appears in any independent variable in the new scoring data, the model will be able to provide predcition for that particular observation. Further, whenever the model is used to score any new data, a population stability index (PSI) value for each of the independent variables should be calculated. The PSI calculation compares a particular variable distribution between the model-development data and any new data, and provides a value that signifies the degree of change in the variable characteristics. If the PSI value of the variable is high, then the predictive model built should be re-visited to incoporate the variable changes.

  2. The character variables are converted to factors. The levels corresponding to these variables can then be easily transferred to the variable in the scoring data. This ensures consitency in the model outputs and easy replicability of the model output.

The code here performs the above steps and the final data set is then ready to be used for modeling.

#### Converting the character variables to factor ####

var_list <- c("movie_sequel", "creative_type", "source", "production_method", "genre","language", "movie_board_rating_display_name", "movie_release_pattern_display_name")

#### Combining low frequency variable into one category
#### so that in future if any new category comes, that particular observation can be modeled
train.dat$movie_sequel <- ifelse(!(train.dat$movie_sequel %in% c(0)), "other", train.dat$movie_sequel)
train.dat$creative_type <- ifelse(!(train.dat$creative_type %in% c("Contemporary Fiction",
                                                                   "Fantasy",
                                                                   "Historical Fiction",
                                                                   "Dramatization",
                                                                   "Science Fiction",
                                                                   "Factual",
                                                                   "Kids Fiction",
                                                                   "Super Hero")), "other", train.dat$creative_type)

train.dat$source <- ifelse(!(train.dat$source %in% c("Original Screenplay",
                                                     "Based on Fiction Book/Short Story",
                                                     "Based on Real Life Events",
                                                     "Remake",
                                                     "Based on TV",
                                                     "Based on Comic/Graphic Novel",
                                                     "Based on Factual Book/Article",
                                                     "Based on Play",
                                                     "Based on Folk Tale/Legend/Fairytale",
                                                     "Based on Game",
                                                     "Spin-Off",
                                                     "Based on Musical or Opera",
                                                     "Based on Short Film",
                                                     "Compilation")), "other", train.dat$source)

train.dat$production_method <- ifelse(!(train.dat$production_method %in% c("Live Action",
                                                                           "Digital Animation",
                                                                           "Animation/Live Action",
                                                                           "Hand Animation",
                                                                           "Stop-Motion Animation")), "other", train.dat$production_method)

train.dat$genre <- ifelse(!(train.dat$genre %in% c("Drama",
                                                   "Comedy",
                                                   "Thriller/Suspense",
                                                   "Action",
                                                   "Adventure",
                                                   "Romantic Comedy",
                                                   "Horror",
                                                   "Documentary",
                                                   "Black Comedy",
                                                   "Musical",
                                                   "Multiple Genres")), "other", train.dat$genre)

train.dat$language <- ifelse(!(train.dat$language %in% c("English",
                                                         "Hindi",
                                                         "French",
                                                         "Spanish",
                                                         "German",
                                                         "Japanese",
                                                         "Swedish")), "other", train.dat$language)

train.dat$movie_board_rating_display_name <- ifelse(!(train.dat$movie_board_rating_display_name %in% c("R",
                                                                                                       "PG-13",
                                                                                                       "PG",
                                                                                                       "Not Rated",
                                                                                                       "G")), "other", train.dat$movie_board_rating_display_name)

train.dat$movie_release_pattern_display_name <- ifelse(!(train.dat$movie_release_pattern_display_name %in% c("Wide",
                                                                                                             "Limited",
                                                                                                             "Exclusive",
                                                                                                             "Expands Wide",
                                                                                                             "IMAX",
                                                                                                             "Oscar Qualifying Run")), "other", train.dat$movie_release_pattern_display_name)

label <- list()

for (i in 1:length(var_list)) {
  label[i] <- unique(as.data.frame(train.dat[, var_list[i]]))
  train.dat[, var_list[i]] <- factor(as.data.frame(train.dat)[, var_list[i]], labels = label[[i]])
}

Further, a feature extraction has been done on the variable board_rating_reason. Using the sentiment library in the tidytext package, a sentiment score is calculated for the comment provided in the variable board_rating_reason. Although, the variable movie_board_rating_display_name has its category based on comments from board_rating_reason, it would be a good idea to test whether the sentiment associated with the comment in board_rating_reason can provide any extra input to the final prediction.

The figure below provides an illustration of commonly used words in the variable board_rating_reason.

##### Extracting information from the variable "board_rating_reason" ####
##### calculating the sentiment score associated with each of the observation ####

## Nesting each and every word in the text to form a tidy data ##
train.dat.board_rating <- train.dat %>%
                            select(id, board_rating_reason) %>%
                            unnest_tokens(word, board_rating_reason) %>%
                            anti_join(stop_words, by = "word")

train.dat.board_rating_count <- train.dat.board_rating %>% count(word)

## Forming word cloud of comments in the variable ##
word_cloud <- train.dat.board_rating_count %>% 
                arrange(desc(train.dat.board_rating_count$n))
                
word_cloud %>%
  with(wordcloud(word, n, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2")))

## Associating sentiment score with each word ##
id_senti_score <- train.dat.board_rating %>%
                  inner_join(get_sentiments("afinn"), by = "word") %>%
                  group_by(id) %>%
                  summarize(sentiment_score = sum(score))

## Merging with the original data ##
train.dat <- train.dat %>%
                left_join(id_senti_score, by = "id")

A glimpse of the final data is shown below:

glimpse(train.dat)
## Observations: 1,196
## Variables: 16
## $ id                                 <dbl> 7950115, 50950115, 11987011...
## $ name                               <chr> "Avatar", "Harry Potter and...
## $ display_name                       <chr> "Avatar", "Harry Potter and...
## $ production_year                    <dbl> 2009, 2011, 2011, 2010, 201...
## $ movie_sequel                       <fctr> 0, other, other, other, ot...
## $ creative_type                      <fctr> Factual, Super Hero, Factu...
## $ source                             <fctr> Based on Real Life Events,...
## $ production_method                  <fctr> Animation/Live Action, Ani...
## $ genre                              <fctr> Action, Adventure, Action,...
## $ language                           <fctr> English, English, English,...
## $ board_rating_reason                <chr> "For intense epic battle se...
## $ movie_board_rating_display_name    <fctr> Not Rated, Not Rated, Not ...
## $ movie_release_pattern_display_name <fctr> Oscar Qualifying Run, Osca...
## $ total                              <dbl> 2784, 1328, 1124, 1064, 104...
## $ Category                           <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ sentiment_score                    <int> -2, -5, -5, NA, -3, -5, -4,...

Modeling Approach

To build the predictive model of success for a movie, a variety of tree based models have been used. The reason for using a tree based model is that it is easy to interpret the model, has a good accuracy, can handle variety of predictor variables and can handle different types of response variables.

The following modeling approaches have been used:

  1. Approach 1: In this approach, the variable Category is used as the dependent variable to predict. Various machine learning approaches are tried to build the model: Pruned decision tree, Random Forest and Gradient Boosting Method.

  2. Approach 2: In this approach, the variable total is first predicted using a variety of models. The output from the best model is then used to finally predict the Category variable. This is done because it was observed that the Category and the total variable are highly correlated and thus, it would be interesting to see the performance of a model in such an approach.

In all the cases, the model is developed on 80% of the data and then tested on the remaining 20% of the data. Prediction accuracy across both the train data and the test data is compared. The final model is selected basis the best performace across both the train data and the test data. A comparison of all the models built is shown in the tab Comparison of all models formed.

Modeling Approach 1

In this approach, the variable Category is directly modeled. Various machine learning models are built and their performance compared.

First, the training model is split into development data and validation data. The splitting is done in such a way that the proportion of each of the categories of the dependent variable is maintained in both the development data and the validation data. Thus, a stratified sampling approach has been used to split the training data. This ensures that distribution of the variable is consistent across the development data and the validation data.

Development data comprises of 80% of the training data and the rest of the training data forms the validation data.

#### Modeling the data ####

## Separating into training and test data - 80% training & 20% test ##

set.seed(524)
subset <- createDataPartition(train.dat$Category, times = 1, p = 0.80, list = FALSE)

## splitting the data into 80:20
dat.model_train <- train.dat[subset, ]
dat.model_test <- train.dat[-subset, ]

The below tables provide the comparision of proportion of each of the categories of the dependent variable in training data, development data and validation data.

## Checking %age distribution across the different levels of category
print("Proportion of each of the categories in the training data")
## [1] "Proportion of each of the categories in the training data"
table(train.dat$Category)/nrow(train.dat)
## 
##          1          2          3          4          5          6 
## 0.14046823 0.20652174 0.20484950 0.17224080 0.11036789 0.06688963 
##          7          8          9 
## 0.05100334 0.03177258 0.01588629
print("Proportion of each of the categories in the development data")
## [1] "Proportion of each of the categories in the development data"
table(dat.model_train$Category)/nrow(dat.model_train)
## 
##          1          2          3          4          5          6 
## 0.14196242 0.20459290 0.20459290 0.17327766 0.10960334 0.06889353 
##          7          8          9 
## 0.04906054 0.03131524 0.01670146
print("Proportion of each of the categories in the validation data")
## [1] "Proportion of each of the categories in the validation data"
table(dat.model_test$Category)/nrow(dat.model_test)
## 
##          1          2          3          4          5          6 
## 0.13445378 0.21428571 0.20588235 0.16806723 0.11344538 0.05882353 
##          7          8          9 
## 0.05882353 0.03361345 0.01260504

It is observed that the proportion remains consistent across each of the variables.

Technique 1: Decision Tree Model: Pruned Tree

The rpart package is used to build the decision tree model. Initially the decision tree builds a high depth tree that works well in the development data but the performance is worse in the validation data. This is due to over-fitting of the model on the development data. Thus, the more complex a model is, the higher the variance in the model and this results in a higher total error. The total error is defined as the sum of the variance term and the square of the bias term. To have a balance between the bias and the variance of the model, this high depth tree is pruned such that the resulting model has the lowest total error.

Therefore, first a high depth tree is formed. Then, a plot of error and tree complexity is made. This helps to determine the complexity of the tree model that is required to have a low error in the model. The complexity plot for the tree model developed is shown below.

#### Model approach 1: predict the category variable ####
#### Trying different techniques ####

dat.model_train.sub <- dat.model_train %>% 
  select(-c(id, name, display_name, board_rating_reason, total, production_year))

#### Technique 1: Pruned Decision Tree ####

## Building a high depth tree ##
set.seed(12345)
train.rpart <- rpart(formula = Category ~ . , data = dat.model_train.sub, method = "class", cp = 0.00001)
plotcp(train.rpart)

printcp(train.rpart)
## 
## Classification tree:
## rpart(formula = Category ~ ., data = dat.model_train.sub, method = "class", 
##     cp = 1e-05)
## 
## Variables actually used in tree construction:
## [1] creative_type                      genre                             
## [3] language                           movie_board_rating_display_name   
## [5] movie_release_pattern_display_name movie_sequel                      
## [7] production_method                  sentiment_score                   
## [9] source                            
## 
## Root node error: 762/958 = 0.79541
## 
## n= 958 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.07611549      0   1.00000 1.04593 0.015188
## 2  0.01531059      1   0.92388 0.96063 0.017245
## 3  0.00918635      4   0.87795 0.92782 0.017861
## 4  0.00787402      5   0.86877 0.92388 0.017929
## 5  0.00557743      8   0.84514 0.92782 0.017861
## 6  0.00524934     12   0.82283 0.93176 0.017792
## 7  0.00393701     16   0.80184 0.93307 0.017768
## 8  0.00306212     21   0.78215 0.93701 0.017697
## 9  0.00262467     36   0.73097 0.93963 0.017649
## 10 0.00218723     40   0.72047 0.94357 0.017576
## 11 0.00196850     43   0.71391 0.94488 0.017552
## 12 0.00131234     45   0.70997 0.95276 0.017401
## 13 0.00065617     56   0.69554 0.94357 0.017576
## 14 0.00001000     58   0.69423 0.93701 0.017697

Basis the plot and the complexity table output, a tree with complexity parameter 0.00393701 can be used. This tree has 17 terminal nodes and is able to predict each of the categories in the dependent variable. A plot of the tree is shown below:

## Pruned decision tree
train.pruned_tree <- prune(train.rpart, cp = 0.00393701)
prp(train.pruned_tree, varlen = 12)

In case of development data, the truth table of the model can be seen as:

## Development truth table
train.pred.pruned.tree <- predict(train.pruned_tree, dat.model_train.sub, type = "class")
table(dat.model_train.sub$Category, train.pred.pruned.tree, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1 47 72  7  6  3  1  0  0  0
##     2 22 99 47 21  4  1  0  2  0
##     3 11 47 73 52  3  1  3  5  1
##     4 16 13 44 73 13  1  2  2  2
##     5  3  8 18 45 21  3  2  5  0
##     6  2  3 16 29  7  4  1  4  0
##     7  0  2  3 15  7  4  8  8  0
##     8  1  0  0  6  0  0  7 16  0
##     9  0  0  0  1  1  1  4  3  6
dev.tree.error <- sum((dat.model_train.sub$Category == train.pred.pruned.tree)/nrow(dat.model_train.sub))  ## 36.22%

Further, for the validation data, the truth table is:

## Validation truth table
test.pred.pruned.tree <- predict(train.pruned_tree, dat.model_test, type = "class")
table(dat.model_test$Category, test.pred.pruned.tree, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1 10 19  2  1  0  0  0  0  0
##     2 12 19 11  8  0  0  1  0  0
##     3  6  9 13 17  3  0  0  1  0
##     4  3  7 15 10  4  0  0  1  0
##     5  1  2  7 11  5  1  0  0  0
##     6  1  2  3  3  4  0  0  0  1
##     7  0  2  2  2  2  2  1  3  0
##     8  0  0  2  0  2  1  1  1  1
##     9  0  0  0  0  0  1  0  2  0
val.tree.error <- sum((dat.model_test$Category == test.pred.pruned.tree)/nrow(dat.model_test))  ## 24.79%

For this pruned decision tree model, the classification accuracy rate in the development data is 36.22 % and in the validation data is 24.79 %.

Technique 2: Random Forest

The random forest technique is an improved version of the original decision tree model. In this technique, multiple high depth decision tree models are formed and then the output is the average of all the models. By doing this, the precision of the predcited value is maintained and at the same time the variance is reduced as the average is taken across all the models. In random forest, for each of the tree model non-parametric bootstrap samples are taken from the original data and the decision tree model made. Further, at every binary tree split only a subset of the variables is chosen. This ensures that each of the trees formed are independent from each other. This technique is also know as an ensemble machine learning technique where the output of various models are combined to predict the response variable.

#### Technique 2: Random Forest ####

## removing variable with missing value, as tbis RF function cannot handle variables with missing values
dat.model_train.sub.rf <- dat.model_train.sub %>% select(-c(sentiment_score))

## For now, tunning paramter selected basis best approach given in literature
set.seed(12346)
rf.model <- randomForest(as.factor(Category) ~ ., data = dat.model_train.sub.rf, mtry = 3, importance = TRUE, ntree = 100)

To use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.

In case of development data, the truth table of the model can be seen as:

## Development truth table
train.pred.rf <- predict(rf.model, dat.model_train.sub.rf, type = "class")
table(dat.model_train.sub.rf$Category, train.pred.rf, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth   1   2   3   4   5   6   7   8   9
##     1 106   9  13   6   1   0   1   0   0
##     2  29 114  31  16   4   2   0   0   0
##     3  16  14 129  28   6   0   3   0   0
##     4   5   8  29 119   4   0   1   0   0
##     5   1   8  26  14  52   2   1   1   0
##     6   2   4  13  14   6  26   1   0   0
##     7   1   4   4   6   3   1  26   2   0
##     8   0   0   0   1   3   1   1  24   0
##     9   0   0   0   0   2   1   1   1  11
dev.rf.error <- sum((dat.model_train.sub.rf$Category == train.pred.rf)/nrow(dat.model_train.sub.rf))  ## 63.15%

Further, for the validation data, the truth table is:

## Validation truth table
test.pred.rf <- predict(rf.model, dat.model_test, type = "class")
table(dat.model_test$Category, test.pred.rf, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1 19  9  2  2  0  0  0  0  0
##     2 16  9 11 12  2  0  1  0  0
##     3 11  7 11 16  4  0  0  0  0
##     4  3  8 13 10  4  1  0  1  0
##     5  1  2  6 10  6  1  1  0  0
##     6  0  2  4  4  2  1  0  0  1
##     7  0  2  2  3  3  0  2  0  2
##     8  0  0  0  2  1  2  2  0  1
##     9  0  0  0  0  0  0  0  1  2
val.rf.error <- sum((dat.model_test$Category == test.pred.rf)/nrow(dat.model_test))  ## 24.37%

For this random forest model, the classification accuracy rate in the development data is 63.36 % and in the validation data is 25.21 %. The results are better as compared to the decision tree model formed above.

Technique 3: Gradient Boosting Method

The Gradient Boosting Method (GBM) is another modification of the original decision tree model. In this particular technique, initially a low precision tree (less depth) model is formed. Then, the resulting error is modeled again using a different tree model. Its output is added to the original tree model. Again, the error is calculated and the whole process is repeated. Thus, in GBM for each iteration the error term is modeled and finally, all the output are combined to predict the response variable. Here, the parameters can limit the learning rate in each step, the maximum depth of tree in each step, the minimum nunber of observation in each node and the number of iterations to be done. This technique also belongs to the ensemble class of machine learning models and helps to maintian the precision of the response variable and at the same time the variance of the model is not compromised.

#### Technique 3: Gradient Boosting Method ####

set.seed(12347)
gbm.model <- gbm(as.factor(Category) ~ ., 
                  data = dat.model_train.sub.rf,
                  distribution = 'multinomial',
                  n.trees = 500,
                  interaction.depth = 4,
                  cv.folds = 5,
                  shrinkage = 0.005)

To use the GBM technique, the tunning parameters selected are basis many iterations done. A comprehensive approach can be to prepare a search grid of various values corresponding to the different parameters and then choose the best model basis the lowest error achieved.

In case of development data, the truth table of the model can be seen as:

## Development truth table
train.pred.gbm <- predict(gbm.model, n.trees = 500, newdata = dat.model_train.sub.rf, type = 'response')
p.train.pred.gbm <- apply(train.pred.gbm, 1, which.max)
table(dat.model_train.sub.rf$Category, p.train.pred.gbm, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1 95 24  9  6  1  0  1  0  0
##     2 47 85 35 21  4  1  1  2  0
##     3 31 29 86 36  6  2  3  3  0
##     4 14 25 40 73  9  1  1  3  0
##     5  4 11 26 19 38  2  1  3  1
##     6  1 11 17 10  7 16  2  1  1
##     7  1  3  5  7  7  4 18  2  0
##     8  0  1  2  2  3  2  1 18  1
##     9  0  0  0  0  1  0  1  2 12
dev.gbm.error <- sum((dat.model_train.sub.rf$Category == p.train.pred.gbm)/nrow(dat.model_train.sub.rf))  ## 46.55%

Further, for the validation data, the truth table is:

## Validation truth table
test.pred.gbm <- predict(gbm.model, n.trees = 500, newdata = dat.model_test, type = 'response')
p.test.pred.gbm <- apply(test.pred.gbm, 1, which.max)
table(dat.model_test$Category, p.test.pred.gbm, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1 18 11  1  2  0  0  0  0  0
##     2 20 11  9  9  1  0  1  0  0
##     3 10  9 12 14  3  1  0  0  0
##     4  3 10 13 10  3  1  0  0  0
##     5  2  4  8  7  2  1  3  0  0
##     6  0  3  4  5  2  0  0  0  0
##     7  0  2  1  3  1  1  1  3  2
##     8  0  2  1  0  1  1  2  0  1
##     9  0  0  0  0  0  0  0  2  1
val.gbm.error <- sum((dat.model_test$Category == p.test.pred.gbm)/nrow(dat.model_test))  ## 22.68%

For this GBM model, the classification accuracy rate in the development data is 46.03 % and in the validation data is 23.11 %. The results are comparable to the decision tree model formed above.

Modeling Approach 2

In this approach, the variable total is first predicted using a variety of machine learning models. The output from the best model is then used to finally predict the Category variable.

Modeling total variable

First, the training model is split into development data and validation data. Development data comprises of 80% of the training data and the rest of the training data forms the validation data.

#### Model approach 2: First, predict the total variable and ten, using the predicted total variable predict the category ####

#### Step1: Predicting the total variable ####
set.seed(525)
subset2 <- createDataPartition(train.dat$total, times = 1, p = 0.80, list = FALSE)

## splitting the data into 80:20
dat.model_train2 <- train.dat[subset2, ]
dat.model_test2 <- train.dat[-subset2, ]

dat.model_train.sub2 <- dat.model_train2 %>% 
                          select(-c(id, name, display_name, board_rating_reason, Category, production_year))

Technique 1: Decision Tree Model: Pruned Tree

The rpart package is used to build the decision tree model. Initially the decision tree builds a high depth tree that works well in the development data but the performance is worse in the validation data. This is due to over-fitting of the model on the development data. Thus, the more complex a model is, the higher the variance in the model and this results in a higher total error. The total error is defined as the sum of the variance term and the square of the bias term. To have a balance between the bias and the variance of the model, this high depth tree is pruned such that the resulting model has the lowest total error.

Therefore, first a high depth tree is formed. Then, a plot of error and tree complexity is made. This helps to determine the complexity of the tree model that is required to have a low error in the model. The complexity plot for the tree model developed is shown below.

#### Technique 1: Pruned Decision Tree ####

set.seed(12348)
train.rpart2 <- rpart(formula = total ~ . , data = dat.model_train.sub2, cp = 0.0001)
plotcp(train.rpart2)

printcp(train.rpart2)
## 
## Regression tree:
## rpart(formula = total ~ ., data = dat.model_train.sub2, cp = 1e-04)
## 
## Variables actually used in tree construction:
## [1] creative_type                      genre                             
## [3] language                           movie_board_rating_display_name   
## [5] movie_release_pattern_display_name movie_sequel                      
## [7] production_method                  sentiment_score                   
## [9] source                            
## 
## Root node error: 33329256/959 = 34754
## 
## n= 959 
## 
##            CP nsplit rel error  xerror    xstd
## 1  0.19486698      0   1.00000 1.00183 0.23384
## 2  0.08814736      1   0.80513 0.84202 0.22972
## 3  0.05099837      2   0.71699 0.78377 0.22671
## 4  0.04986019      3   0.66599 0.78202 0.19823
## 5  0.02870162      4   0.61613 0.75571 0.19685
## 6  0.02720284      5   0.58743 0.74279 0.19602
## 7  0.02527418      6   0.56022 0.73234 0.19591
## 8  0.01574280      7   0.53495 0.68465 0.19506
## 9  0.00840680      8   0.51921 0.66815 0.19467
## 10 0.00619198      9   0.51080 0.66614 0.19506
## 11 0.00597338     10   0.50461 0.66453 0.19508
## 12 0.00446735     11   0.49863 0.64872 0.18268
## 13 0.00410730     12   0.49417 0.64838 0.18250
## 14 0.00343149     13   0.49006 0.63199 0.16815
## 15 0.00297039     14   0.48663 0.63158 0.16824
## 16 0.00276443     15   0.48366 0.63133 0.16819
## 17 0.00240661     16   0.48089 0.63135 0.16819
## 18 0.00174394     18   0.47608 0.64021 0.16825
## 19 0.00158784     20   0.47259 0.64295 0.16826
## 20 0.00145852     21   0.47100 0.64296 0.16826
## 21 0.00137425     22   0.46955 0.64511 0.16834
## 22 0.00120285     23   0.46817 0.64306 0.16829
## 23 0.00118678     24   0.46697 0.64298 0.16829
## 24 0.00116805     26   0.46459 0.64269 0.16829
## 25 0.00085313     27   0.46343 0.63992 0.16820
## 26 0.00058453     28   0.46257 0.63545 0.16808
## 27 0.00056236     29   0.46199 0.63656 0.16808
## 28 0.00053878     30   0.46143 0.63694 0.16808
## 29 0.00043853     31   0.46089 0.63614 0.16808
## 30 0.00039306     32   0.46045 0.63735 0.16809
## 31 0.00037953     33   0.46006 0.63838 0.16810
## 32 0.00031301     35   0.45930 0.63876 0.16809
## 33 0.00030698     36   0.45898 0.63954 0.16810
## 34 0.00029554     37   0.45868 0.63959 0.16810
## 35 0.00028939     38   0.45838 0.63977 0.16810
## 36 0.00028836     41   0.45751 0.63977 0.16810
## 37 0.00021025     46   0.45607 0.64048 0.16809
## 38 0.00018735     47   0.45586 0.64106 0.16809
## 39 0.00018539     48   0.45567 0.64085 0.16809
## 40 0.00016826     49   0.45549 0.64106 0.16809
## 41 0.00012019     50   0.45532 0.64112 0.16809
## 42 0.00010000     52   0.45508 0.63982 0.16803

Basis the plot and the complexity table output, a tree with complexity parameter 0.00276443 can be used. This tree has 16 terminal nodes. A plot of the tree is shown below:

## Pruned decision tree
train.pruned_tree2 <- prune(train.rpart2, cp = 0.00276443)
prp(train.pruned_tree2, varlen = 12)

## Development data
train.pred.pruned.tree2 <- predict(train.pruned_tree2, dat.model_train.sub2)
dev.tree.error.rmse <- rmse(train.pred.pruned.tree2, dat.model_train.sub2$total)   ## 129.6499

## Validation data
test.pred.pruned.tree2 <- predict(train.pruned_tree2, dat.model_test2)
val.tree.error.rmse <- rmse(test.pred.pruned.tree2, dat.model_test2$total)   ## 131.6686

For this pruned decision tree model, the rmse in the development data is 129.65 and in the validation data is 131.67.

Technique 2: Random Forest

The random forest technique is an improved version of the original decision tree model. In this technique, multiple high depth decision tree models are formed and then the output is the average of all the models. By doing this, the precision of the predcited value is maintained and at the same time the variance is reduced as the average is taken across all the models. In random forest, for each of the tree model non-parametric bootstrap samples are taken from the original data and the decision tree model made. Further, at every binary tree split only a subset of the variables is chosen. This ensures that each of the trees formed are independent from each other. This technique is also know as an ensemble machine learning technique where the output of various models are combined to predict the response variable.

#### Technique 2: Random Forest ####
## removing variable with missing value, as RF cannot handle variables with missing values
dat.model_train.sub2.rf <- dat.model_train.sub2 %>% select(-c(sentiment_score))

## For now, tunning paramter selected basis best approach given in literature
set.seed(12349)
rf.model2 <- randomForest(total ~ ., data = dat.model_train.sub2.rf, mtry = 3, importance = TRUE, ntree = 100)

## Development data
train.pred.rf2 <- predict(rf.model2, dat.model_train.sub2.rf)
dev.rf.error.rmse <- rmse(train.pred.rf2, dat.model_train.sub2.rf$total)   ## 100.1452

## Validation data
test.pred.rf2 <- predict(rf.model2, dat.model_test2)
val.rf.error.rmse <- rmse(test.pred.rf2, dat.model_test2$total)   ## 124.3666

To use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.

For this random forest model, the rmse in the development data is 100.15 and in the validation data is 124.37. The rmse obtained is less as compared to the decision tree model above.

Technique 3: Gradient Boosting Method

The Gradient Boosting Method (GBM) is another modification of the original decision tree model. In this particular technique, initially a low precision tree (less depth) model is formed. Then, the resulting error is modeled again using a different tree model. Its output is added to the original tree model. Again, the error is calculated and the whole process is repeated. Thus, in GBM for each iteration the error term is modeled and finally, all the output are combined to predict the response variable. Here, the parameters can limit the learning rate in each step, the maximum depth of tree in each step, the minimum nunber of observation in each node and the number of iterations to be done. This technique also belongs to the ensemble class of machine learning models and helps to maintian the precision of the response variable and at the same time the variance of the model is not compromised.

#### Technique 3: Gradient Boosting Method ####
set.seed(12350)
gbm.model2 <- gbm(total ~ ., 
                 data = dat.model_train.sub2.rf,
                 distribution = 'gaussian',
                 n.trees = 500,
                 interaction.depth = 3,
                 cv.folds = 5,
                 shrinkage = 0.1)

## Development data
train.pred.gbm2 <- predict(gbm.model2, n.trees = 500, newdata = dat.model_train.sub2.rf)
dev.gbm.error.rmse <- rmse(train.pred.gbm2, dat.model_train.sub2.rf$total)   ## 103.1652

## Validation data
test.pred.gbm2 <- predict(gbm.model2, n.trees = 500, newdata = dat.model_test2)
val.gbm.error.rmse <- rmse(test.pred.gbm2, dat.model_test2$total)   ## 156.1931

To use the GBM technique, the tunning parameters selected are basis many iterations done. A comprehensive approach can be to prepare a search grid of various values corresponding to the different parameters and then choose the best model basis the lowest error achieved.

For this GBM model, the rmse in the development data is 103.17 and in the validation data is 156.19. The rmse obtained is less as compared to the decision tree model above and comparable to the random forest model. However, the validation rmse is lower for the random forest model.

Final model selected to predict total

Given the three above models, the random forest model performs the best amongst all the three models. Thus, this model output will be used to predict the Category variable in the next step.

Modeling Category variable

The output of the predicted total variable from the random forest model built in the last step is used to estimate the Category variable.

First, a stratified sampling is used to separate the training data into development data and validation data. Stratified sampling ensures that the proportion of each category of the response variable is maintained in both the data sets. 80% of the training data is used as development data and the rest 20% as the validation data.

#### Step2: Using the total predicted variable, calcualte the category variable ####

pred.total.train.dat <- predict(rf.model2, train.dat)
train.dat.pred.total <- bind_cols(train.dat, "predicted_total" = pred.total.train.dat)

set.seed(526)
subset3 <- createDataPartition(train.dat.pred.total$Category, times = 1, p = 0.80, list = FALSE)

## splitting the data into 80:20
dat.model_train3 <- train.dat.pred.total[subset3, ]
dat.model_test3 <- train.dat.pred.total[-subset3, ]

Now, the random forest technique is used to predict the Category variable.

#### Technique: Random Forest ####

var_list <- c("movie_sequel", "creative_type", "source", "production_method", "genre","language", "movie_board_rating_display_name",   
              "movie_release_pattern_display_name")

## For now, tunning paramter selected basis best approach given in literature
set.seed(12351)
rf.model3 <- randomForest(as.factor(Category) ~ predicted_total, 
                          data = dat.model_train3, mtry = 1, importance = TRUE, ntree = 500)

Again, to use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.

In case of development data, the truth table of the model can be seen as:

## Development truth table
train.pred.rf3 <- predict(rf.model3, dat.model_train3, type = "class")
table(dat.model_train3$Category, train.pred.rf3, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth   1   2   3   4   5   6   7   8   9
##     1 105  18  12   5   1   1   1   0   0
##     2  22 119  22  17   5   3   1   0   0
##     3   8  22 125  30   5   2   3   1   0
##     4   5  11  28 115   3   0   1   0   1
##     5   1   6  22  15  54   2   2   3   2
##     6   1   3  14  10   6  23   0   2   1
##     7   2   2   4   3   2   0  36   2   0
##     8   0   0   1   2   1   1   4  26   0
##     9   0   0   0   0   0   0   1   1  11
dev.rf.error2 <- sum((dat.model_train3$Category == train.pred.rf3)/nrow(dat.model_train3))  ## 64.09%

Further, for the validation data, the truth table is:

## Validation truth table
test.pred.rf3 <- predict(rf.model3, dat.model_test3, type = "class")
table(dat.model_test3$Category, test.pred.rf3, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  1  2  3  4  5  6  7  8  9
##     1  9 12  3  1  0  0  0  0  0
##     2 12 16 17 11  1  0  1  0  0
##     3 10  9 13 11  5  0  1  0  0
##     4  2  4 15 15  4  1  1  0  0
##     5  0  2  6  6  6  2  2  1  0
##     6  0  3  2  5  4  1  4  1  0
##     7  0  1  0  2  1  2  2  2  0
##     8  0  0  1  0  0  0  0  1  1
##     9  0  0  0  0  0  0  1  2  3
val.rf.error2 <- sum((dat.model_test3$Category == test.pred.rf3)/nrow(dat.model_test3))  ## 27.73%

For this random forest model, the classification accuracy rate in the development data is 64.09 % and in the validation data is 27.73 %.

Comparison of all models formed

Modeling Approach 1 -

The following table summarizes the performance of all the models formed in this approach. The error rate repoted here is the miss-classification rate.

Modeling Technique Development classification accuracy rate Validation classification accuracy rate
Pruned Decision Tree 36.22 % 24.79 %
Random Forest 63.36 % 25.21 %
Gradient Boosting Method 46.03 % 23.11 %

Modeling Approach 2 -

The following table summarizes the performance of all the models formed in this approach.

First, the comparison is made for the prediction of the variable total. In this the error repoted is rmse.

Modeling Technique Development RMSE Validation RMSE
Pruned Decision Tree 129.65 131.67
Random Forest 100.15 124.37
Gradient Boosting Method 103.17 156.19

Basis this table, it appears that the random forest model is the best model amongst the three models built. The output of this model is then used to predict the variable Category. Following table provides the miss-classification rate for the model obtained.

Modeling Technique Development classification accuracy rate Validation classification accuracy rate
Random Forest 64.09 % 27.73 %

From both the modeling approaches, it appears that the model obtained in approach 2 is the best model amongst all the models built. Thus, this model is finalized and used for prediction in the scoring data.