The Hollywood movie production business has a very instinct and contact driven low-tech decision-making process that generates a portfolio of movies that a production house decides to fund in any given year. The same type of decision-making process is employed by movie stars and their agents to decide which projects to pursue and which ones to pass. This leads to high degree of variation in the success rate of projects (as measured by gross box office receipts and return on investment). Most production houses employ a portfolio driven approach and diversify their risk across a number of low, medium and high budget movies. Various attributes such as genre, source, use of computer graphics, the star power of the cast etc are easily available for a movie prior to its official release. These attributes have been used in advanced machine learning algorithms to predict the success of a movie before its release.
The following packages will be used for the analysis:
## function to install and load multiple packages at once in R
install_load <- function(pack){
## Statement to check if the package has been previously installed
new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
if (length(new_pack_load))
install.packages(new_pack_load, dependencies = TRUE)
sapply(pack, require, character.only = TRUE)
}
package_load <- c("readxl", "dplyr", "rpart", "randomForest", "tidytext", "wordcloud", "caret", "ModelMetrics", "gbm", "ggplot2", "scales", "rpart.plot")
install_load(package_load)
The following section here deals with reading of the training data provided and then building a predictive model. The tabs below present the step-wise approach followed.
Before proceeding with the analysis, it is necessary to specify the directory where the files have been saved. After mentioning the filepath, the training excel file is read using the function available in the readxl
package. This training excel sheet contains the data for various movies and their success rate in the market.
#### Setting up the working directory. Please change basis the location of your directory. ####
setwd("C:/Users/rohit/Documents/RPubs_uploads/proj_1/")
#### Reading the files provided
train.dat <- read_excel("Training sheet.xlsx")
The training data contains 1196 observations and 15 features. A glimpse of the data provided is shown below:
glimpse(train.dat)
## Observations: 1,196
## Variables: 15
## $ id <dbl> 7950115, 50950115, 11987011...
## $ name <chr> "Avatar", "Harry Potter and...
## $ display_name <chr> "Avatar", "Harry Potter and...
## $ production_year <dbl> 2009, 2011, 2011, 2010, 201...
## $ movie_sequel <dbl> 0, 1, 1, 1, 1, 0, 1, 1, 1, ...
## $ creative_type <chr> "Science Fiction", "Fantasy...
## $ source <chr> "Original Screenplay", "Bas...
## $ production_method <chr> "Animation/Live Action", "A...
## $ genre <chr> "Action", "Adventure", "Act...
## $ language <chr> "English", "English", "Engl...
## $ board_rating_reason <chr> "For intense epic battle se...
## $ movie_board_rating_display_name <chr> "PG-13", "PG-13", "PG-13", ...
## $ movie_release_pattern_display_name <chr> "Wide", "Wide", "Wide", "Wi...
## $ total <dbl> 2784, 1328, 1124, 1064, 104...
## $ Category <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, ...
After reading the data, the training data is explored to understand the variables present in the data.
The plot below shows the distribution of movies across the different genre
in the data. Drama and Comedy categories have the highest number of movies made.
#### Plot to determine genre distribution of the movie database
genre.dat <- train.dat %>% group_by(genre) %>% dplyr::summarize(count = n(), percn = round(n()*100/nrow(train.dat), 2))
ggplot(genre.dat, aes(x = factor(genre), y = percn)) +
geom_point(size = 3) +
geom_segment(aes(x = genre,
xend = genre,
y = 0,
yend = percn)) +
labs(title = "Distribution of movies across genre") +
xlab("Genre of the movie") +
ylab("Percentage of data") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar plot below provides the distribution of the movies across movie_release_pattern_display_name
. Wide formatted movies is the most popular format of the movie release pattern.
#### Plot to see the distribution of movie release pattern
ggplot(data = train.dat, aes(x = movie_release_pattern_display_name)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dark cyan", color = "black") +
scale_y_continuous(labels = percent) +
ylab("Percentage of data points") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bar plot for movie release pattern")
The below table shows the frequency distribution of categories in the variable language
. As it is observed over 95% of the movies in the data are made in the language English.
#### Frequency distribution table
print("Frequency distribution of categories in the variable language")
## [1] "Frequency distribution of categories in the variable language"
train.dat %>%
group_by(language) %>%
dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
arrange(desc(percentage))
## # A tibble: 16 x 3
## language count percentage
## <chr> <int> <dbl>
## 1 English 1144 95.65
## 2 Hindi 16 1.34
## 3 French 14 1.17
## 4 Spanish 6 0.50
## 5 German 3 0.25
## 6 Japanese 2 0.17
## 7 Swedish 2 0.17
## 8 Arabic 1 0.08
## 9 Danish 1 0.08
## 10 Farsi 1 0.08
## 11 Hebrew 1 0.08
## 12 Italian 1 0.08
## 13 Norwegian 1 0.08
## 14 Polish 1 0.08
## 15 Portuguese 1 0.08
## 16 Silent 1 0.08
The below table shows the frequency distribution of categories in the variable source
. It is observed that more than 50% of the movies in the data are have Original Screenplay.
print("Frequency distribution of categories in the variable source")
## [1] "Frequency distribution of categories in the variable source"
train.dat %>%
group_by(source) %>%
dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
arrange(desc(percentage))
## # A tibble: 16 x 3
## source count percentage
## <chr> <int> <dbl>
## 1 Original Screenplay 629 52.59
## 2 Based on Fiction Book/Short Story 218 18.23
## 3 Based on Real Life Events 128 10.70
## 4 Remake 65 5.43
## 5 Based on TV 38 3.18
## 6 Based on Comic/Graphic Novel 36 3.01
## 7 Based on Factual Book/Article 23 1.92
## 8 Based on Play 21 1.76
## 9 Based on Folk Tale/Legend/Fairytale 10 0.84
## 10 Based on Game 8 0.67
## 11 Spin-Off 5 0.42
## 12 Based on Musical or Opera 4 0.33
## 13 Based on Short Film 4 0.33
## 14 Compilation 3 0.25
## 15 Based on Theme Park Ride 2 0.17
## 16 Based on Toy 2 0.17
The below table shows the frequency distribution of categories in the variable creative_type
. It is observed that more than 50% of the movies in the data are Fiction based.
print("Frequency distribution of categories in the variable creative_type")
## [1] "Frequency distribution of categories in the variable creative_type"
train.dat %>%
group_by(creative_type) %>%
dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
arrange(desc(percentage))
## # A tibble: 9 x 3
## creative_type count percentage
## <chr> <int> <dbl>
## 1 Contemporary Fiction 638 53.34
## 2 Fantasy 131 10.95
## 3 Historical Fiction 100 8.36
## 4 Dramatization 95 7.94
## 5 Science Fiction 89 7.44
## 6 Factual 60 5.02
## 7 Kids Fiction 59 4.93
## 8 Super Hero 20 1.67
## 9 Multiple Creative Types 4 0.33
The below table shows the frequency distribution of categories in the variable production_method
. It is observed that more than 90% of the movies in the data are produced as LIve Action.
print("Frequency distribution of categories in the variable production_method")
## [1] "Frequency distribution of categories in the variable production_method"
train.dat %>%
group_by(production_method) %>%
dplyr::summarize(count = n(), percentage = round(n()*100/nrow(train.dat), 2)) %>%
arrange(desc(percentage))
## # A tibble: 6 x 3
## production_method count percentage
## <chr> <int> <dbl>
## 1 Live Action 1093 91.39
## 2 Digital Animation 53 4.43
## 3 Animation/Live Action 36 3.01
## 4 Hand Animation 6 0.50
## 5 Stop-Motion Animation 5 0.42
## 6 Multiple Production Methods 3 0.25
The variable Category
takes the value from 1 to 9, with 1 being the least successful movie and 9 being the most successful movie.
The below plot shows the histogram distribution of the Category
variable across the different production_year
. Except for the year 2011, the trend of the variable Category
is consistent.
## Plotting the histogram of category across different production year
ggplot(data = train.dat, aes(x = factor(Category), fill = factor(Category))) +
geom_bar(aes(y = ..count../sum(..count..))) +
facet_wrap(~ production_year) +
scale_y_continuous(labels = percent) +
ggtitle("Histogram distribution of Category across the production years") +
ylab("Percentage of data points") +
xlab("Category") +
theme_bw() +
theme(legend.position = "none")
The below plot shows the histogram distribution of the Category
variable across the different board_ratings
. As observed Not Rated has a different trend with respect to the other categories. Further, for NC-17, there are not many observations to determine the trend of Category
.
## Plotting the histogram of category across different board ratings
ggplot(data = train.dat, aes(x = factor(Category), fill = factor(Category))) +
geom_bar(aes(y = ..count../sum(..count..))) +
facet_wrap(~ movie_board_rating_display_name) +
scale_y_continuous(labels = percent) +
ggtitle("Histogram distribution of Category across different board ratings") +
ylab("Percentage of data points") +
xlab("Category") +
theme_bw() +
theme(legend.position = "none")
The below plot shows the histogram distribution of the Category
variable across movie_sequel
. Although, the number of observations is less for movies that have sequels, yet the success rate of a movie sequel is higher as compared to movies that don’t have a sequel.
train.dat.movie.seq <- train.dat %>%
select(Category, movie_sequel) %>%
mutate(movie_sequel_grp = ifelse(movie_sequel == 1, "Yes", "No"))
## Plotting the histogram of category on whether a movie sequel was made or not
ggplot(data = train.dat.movie.seq, aes(x = factor(Category), fill = factor(Category))) +
geom_bar(aes(y = ..count../sum(..count..))) +
facet_wrap(~ movie_sequel_grp) +
scale_y_continuous(labels = percent) +
ggtitle("Histogram distribution of Category on movie sequel criteria") +
ylab("Percentage of data points") +
xlab("Category") +
theme_bw() +
theme(legend.position = "none")
After observing patterns in the data, the following steps have been performed to prepare the final data for modeling:
Lowest frequency categories in each of the character variables are combined. By doing this, if any new category appears in any independent variable in the new scoring data, the model will be able to provide predcition for that particular observation. Further, whenever the model is used to score any new data, a population stability index (PSI) value for each of the independent variables should be calculated. The PSI calculation compares a particular variable distribution between the model-development data and any new data, and provides a value that signifies the degree of change in the variable characteristics. If the PSI value of the variable is high, then the predictive model built should be re-visited to incoporate the variable changes.
The character variables are converted to factors. The levels corresponding to these variables can then be easily transferred to the variable in the scoring data. This ensures consitency in the model outputs and easy replicability of the model output.
The code here performs the above steps and the final data set is then ready to be used for modeling.
#### Converting the character variables to factor ####
var_list <- c("movie_sequel", "creative_type", "source", "production_method", "genre","language", "movie_board_rating_display_name", "movie_release_pattern_display_name")
#### Combining low frequency variable into one category
#### so that in future if any new category comes, that particular observation can be modeled
train.dat$movie_sequel <- ifelse(!(train.dat$movie_sequel %in% c(0)), "other", train.dat$movie_sequel)
train.dat$creative_type <- ifelse(!(train.dat$creative_type %in% c("Contemporary Fiction",
"Fantasy",
"Historical Fiction",
"Dramatization",
"Science Fiction",
"Factual",
"Kids Fiction",
"Super Hero")), "other", train.dat$creative_type)
train.dat$source <- ifelse(!(train.dat$source %in% c("Original Screenplay",
"Based on Fiction Book/Short Story",
"Based on Real Life Events",
"Remake",
"Based on TV",
"Based on Comic/Graphic Novel",
"Based on Factual Book/Article",
"Based on Play",
"Based on Folk Tale/Legend/Fairytale",
"Based on Game",
"Spin-Off",
"Based on Musical or Opera",
"Based on Short Film",
"Compilation")), "other", train.dat$source)
train.dat$production_method <- ifelse(!(train.dat$production_method %in% c("Live Action",
"Digital Animation",
"Animation/Live Action",
"Hand Animation",
"Stop-Motion Animation")), "other", train.dat$production_method)
train.dat$genre <- ifelse(!(train.dat$genre %in% c("Drama",
"Comedy",
"Thriller/Suspense",
"Action",
"Adventure",
"Romantic Comedy",
"Horror",
"Documentary",
"Black Comedy",
"Musical",
"Multiple Genres")), "other", train.dat$genre)
train.dat$language <- ifelse(!(train.dat$language %in% c("English",
"Hindi",
"French",
"Spanish",
"German",
"Japanese",
"Swedish")), "other", train.dat$language)
train.dat$movie_board_rating_display_name <- ifelse(!(train.dat$movie_board_rating_display_name %in% c("R",
"PG-13",
"PG",
"Not Rated",
"G")), "other", train.dat$movie_board_rating_display_name)
train.dat$movie_release_pattern_display_name <- ifelse(!(train.dat$movie_release_pattern_display_name %in% c("Wide",
"Limited",
"Exclusive",
"Expands Wide",
"IMAX",
"Oscar Qualifying Run")), "other", train.dat$movie_release_pattern_display_name)
label <- list()
for (i in 1:length(var_list)) {
label[i] <- unique(as.data.frame(train.dat[, var_list[i]]))
train.dat[, var_list[i]] <- factor(as.data.frame(train.dat)[, var_list[i]], labels = label[[i]])
}
Further, a feature extraction has been done on the variable board_rating_reason
. Using the sentiment library in the tidytext
package, a sentiment score is calculated for the comment provided in the variable board_rating_reason
. Although, the variable movie_board_rating_display_name
has its category based on comments from board_rating_reason
, it would be a good idea to test whether the sentiment associated with the comment in board_rating_reason
can provide any extra input to the final prediction.
The figure below provides an illustration of commonly used words in the variable board_rating_reason
.
##### Extracting information from the variable "board_rating_reason" ####
##### calculating the sentiment score associated with each of the observation ####
## Nesting each and every word in the text to form a tidy data ##
train.dat.board_rating <- train.dat %>%
select(id, board_rating_reason) %>%
unnest_tokens(word, board_rating_reason) %>%
anti_join(stop_words, by = "word")
train.dat.board_rating_count <- train.dat.board_rating %>% count(word)
## Forming word cloud of comments in the variable ##
word_cloud <- train.dat.board_rating_count %>%
arrange(desc(train.dat.board_rating_count$n))
word_cloud %>%
with(wordcloud(word, n, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2")))
## Associating sentiment score with each word ##
id_senti_score <- train.dat.board_rating %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(id) %>%
summarize(sentiment_score = sum(score))
## Merging with the original data ##
train.dat <- train.dat %>%
left_join(id_senti_score, by = "id")
A glimpse of the final data is shown below:
glimpse(train.dat)
## Observations: 1,196
## Variables: 16
## $ id <dbl> 7950115, 50950115, 11987011...
## $ name <chr> "Avatar", "Harry Potter and...
## $ display_name <chr> "Avatar", "Harry Potter and...
## $ production_year <dbl> 2009, 2011, 2011, 2010, 201...
## $ movie_sequel <fctr> 0, other, other, other, ot...
## $ creative_type <fctr> Factual, Super Hero, Factu...
## $ source <fctr> Based on Real Life Events,...
## $ production_method <fctr> Animation/Live Action, Ani...
## $ genre <fctr> Action, Adventure, Action,...
## $ language <fctr> English, English, English,...
## $ board_rating_reason <chr> "For intense epic battle se...
## $ movie_board_rating_display_name <fctr> Not Rated, Not Rated, Not ...
## $ movie_release_pattern_display_name <fctr> Oscar Qualifying Run, Osca...
## $ total <dbl> 2784, 1328, 1124, 1064, 104...
## $ Category <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ sentiment_score <int> -2, -5, -5, NA, -3, -5, -4,...
To build the predictive model of success for a movie, a variety of tree based models have been used. The reason for using a tree based model is that it is easy to interpret the model, has a good accuracy, can handle variety of predictor variables and can handle different types of response variables.
The following modeling approaches have been used:
Approach 1: In this approach, the variable Category
is used as the dependent variable to predict. Various machine learning approaches are tried to build the model: Pruned decision tree, Random Forest and Gradient Boosting Method.
Approach 2: In this approach, the variable total
is first predicted using a variety of models. The output from the best model is then used to finally predict the Category
variable. This is done because it was observed that the Category
and the total
variable are highly correlated and thus, it would be interesting to see the performance of a model in such an approach.
In all the cases, the model is developed on 80% of the data and then tested on the remaining 20% of the data. Prediction accuracy across both the train data and the test data is compared. The final model is selected basis the best performace across both the train data and the test data. A comparison of all the models built is shown in the tab Comparison of all models formed
.
In this approach, the variable Category
is directly modeled. Various machine learning models are built and their performance compared.
First, the training model is split into development data and validation data. The splitting is done in such a way that the proportion of each of the categories of the dependent variable is maintained in both the development data and the validation data. Thus, a stratified sampling approach has been used to split the training data. This ensures that distribution of the variable is consistent across the development data and the validation data.
Development data comprises of 80% of the training data and the rest of the training data forms the validation data.
#### Modeling the data ####
## Separating into training and test data - 80% training & 20% test ##
set.seed(524)
subset <- createDataPartition(train.dat$Category, times = 1, p = 0.80, list = FALSE)
## splitting the data into 80:20
dat.model_train <- train.dat[subset, ]
dat.model_test <- train.dat[-subset, ]
The below tables provide the comparision of proportion of each of the categories of the dependent variable in training data, development data and validation data.
## Checking %age distribution across the different levels of category
print("Proportion of each of the categories in the training data")
## [1] "Proportion of each of the categories in the training data"
table(train.dat$Category)/nrow(train.dat)
##
## 1 2 3 4 5 6
## 0.14046823 0.20652174 0.20484950 0.17224080 0.11036789 0.06688963
## 7 8 9
## 0.05100334 0.03177258 0.01588629
print("Proportion of each of the categories in the development data")
## [1] "Proportion of each of the categories in the development data"
table(dat.model_train$Category)/nrow(dat.model_train)
##
## 1 2 3 4 5 6
## 0.14196242 0.20459290 0.20459290 0.17327766 0.10960334 0.06889353
## 7 8 9
## 0.04906054 0.03131524 0.01670146
print("Proportion of each of the categories in the validation data")
## [1] "Proportion of each of the categories in the validation data"
table(dat.model_test$Category)/nrow(dat.model_test)
##
## 1 2 3 4 5 6
## 0.13445378 0.21428571 0.20588235 0.16806723 0.11344538 0.05882353
## 7 8 9
## 0.05882353 0.03361345 0.01260504
It is observed that the proportion remains consistent across each of the variables.
Technique 1: Decision Tree Model: Pruned Tree
The rpart
package is used to build the decision tree model. Initially the decision tree builds a high depth tree that works well in the development data but the performance is worse in the validation data. This is due to over-fitting of the model on the development data. Thus, the more complex a model is, the higher the variance in the model and this results in a higher total error. The total error is defined as the sum of the variance term and the square of the bias term. To have a balance between the bias and the variance of the model, this high depth tree is pruned such that the resulting model has the lowest total error.
Therefore, first a high depth tree is formed. Then, a plot of error and tree complexity is made. This helps to determine the complexity of the tree model that is required to have a low error in the model. The complexity plot for the tree model developed is shown below.
#### Model approach 1: predict the category variable ####
#### Trying different techniques ####
dat.model_train.sub <- dat.model_train %>%
select(-c(id, name, display_name, board_rating_reason, total, production_year))
#### Technique 1: Pruned Decision Tree ####
## Building a high depth tree ##
set.seed(12345)
train.rpart <- rpart(formula = Category ~ . , data = dat.model_train.sub, method = "class", cp = 0.00001)
plotcp(train.rpart)
printcp(train.rpart)
##
## Classification tree:
## rpart(formula = Category ~ ., data = dat.model_train.sub, method = "class",
## cp = 1e-05)
##
## Variables actually used in tree construction:
## [1] creative_type genre
## [3] language movie_board_rating_display_name
## [5] movie_release_pattern_display_name movie_sequel
## [7] production_method sentiment_score
## [9] source
##
## Root node error: 762/958 = 0.79541
##
## n= 958
##
## CP nsplit rel error xerror xstd
## 1 0.07611549 0 1.00000 1.04593 0.015188
## 2 0.01531059 1 0.92388 0.96063 0.017245
## 3 0.00918635 4 0.87795 0.92782 0.017861
## 4 0.00787402 5 0.86877 0.92388 0.017929
## 5 0.00557743 8 0.84514 0.92782 0.017861
## 6 0.00524934 12 0.82283 0.93176 0.017792
## 7 0.00393701 16 0.80184 0.93307 0.017768
## 8 0.00306212 21 0.78215 0.93701 0.017697
## 9 0.00262467 36 0.73097 0.93963 0.017649
## 10 0.00218723 40 0.72047 0.94357 0.017576
## 11 0.00196850 43 0.71391 0.94488 0.017552
## 12 0.00131234 45 0.70997 0.95276 0.017401
## 13 0.00065617 56 0.69554 0.94357 0.017576
## 14 0.00001000 58 0.69423 0.93701 0.017697
Basis the plot and the complexity table output, a tree with complexity parameter 0.00393701 can be used. This tree has 17 terminal nodes and is able to predict each of the categories in the dependent variable. A plot of the tree is shown below:
## Pruned decision tree
train.pruned_tree <- prune(train.rpart, cp = 0.00393701)
prp(train.pruned_tree, varlen = 12)
In case of development data, the truth table of the model can be seen as:
## Development truth table
train.pred.pruned.tree <- predict(train.pruned_tree, dat.model_train.sub, type = "class")
table(dat.model_train.sub$Category, train.pred.pruned.tree, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 47 72 7 6 3 1 0 0 0
## 2 22 99 47 21 4 1 0 2 0
## 3 11 47 73 52 3 1 3 5 1
## 4 16 13 44 73 13 1 2 2 2
## 5 3 8 18 45 21 3 2 5 0
## 6 2 3 16 29 7 4 1 4 0
## 7 0 2 3 15 7 4 8 8 0
## 8 1 0 0 6 0 0 7 16 0
## 9 0 0 0 1 1 1 4 3 6
dev.tree.error <- sum((dat.model_train.sub$Category == train.pred.pruned.tree)/nrow(dat.model_train.sub)) ## 36.22%
Further, for the validation data, the truth table is:
## Validation truth table
test.pred.pruned.tree <- predict(train.pruned_tree, dat.model_test, type = "class")
table(dat.model_test$Category, test.pred.pruned.tree, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 10 19 2 1 0 0 0 0 0
## 2 12 19 11 8 0 0 1 0 0
## 3 6 9 13 17 3 0 0 1 0
## 4 3 7 15 10 4 0 0 1 0
## 5 1 2 7 11 5 1 0 0 0
## 6 1 2 3 3 4 0 0 0 1
## 7 0 2 2 2 2 2 1 3 0
## 8 0 0 2 0 2 1 1 1 1
## 9 0 0 0 0 0 1 0 2 0
val.tree.error <- sum((dat.model_test$Category == test.pred.pruned.tree)/nrow(dat.model_test)) ## 24.79%
For this pruned decision tree model, the classification accuracy rate in the development data is 36.22 % and in the validation data is 24.79 %.
Technique 2: Random Forest
The random forest technique is an improved version of the original decision tree model. In this technique, multiple high depth decision tree models are formed and then the output is the average of all the models. By doing this, the precision of the predcited value is maintained and at the same time the variance is reduced as the average is taken across all the models. In random forest, for each of the tree model non-parametric bootstrap samples are taken from the original data and the decision tree model made. Further, at every binary tree split only a subset of the variables is chosen. This ensures that each of the trees formed are independent from each other. This technique is also know as an ensemble machine learning technique where the output of various models are combined to predict the response variable.
#### Technique 2: Random Forest ####
## removing variable with missing value, as tbis RF function cannot handle variables with missing values
dat.model_train.sub.rf <- dat.model_train.sub %>% select(-c(sentiment_score))
## For now, tunning paramter selected basis best approach given in literature
set.seed(12346)
rf.model <- randomForest(as.factor(Category) ~ ., data = dat.model_train.sub.rf, mtry = 3, importance = TRUE, ntree = 100)
To use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.
In case of development data, the truth table of the model can be seen as:
## Development truth table
train.pred.rf <- predict(rf.model, dat.model_train.sub.rf, type = "class")
table(dat.model_train.sub.rf$Category, train.pred.rf, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 106 9 13 6 1 0 1 0 0
## 2 29 114 31 16 4 2 0 0 0
## 3 16 14 129 28 6 0 3 0 0
## 4 5 8 29 119 4 0 1 0 0
## 5 1 8 26 14 52 2 1 1 0
## 6 2 4 13 14 6 26 1 0 0
## 7 1 4 4 6 3 1 26 2 0
## 8 0 0 0 1 3 1 1 24 0
## 9 0 0 0 0 2 1 1 1 11
dev.rf.error <- sum((dat.model_train.sub.rf$Category == train.pred.rf)/nrow(dat.model_train.sub.rf)) ## 63.15%
Further, for the validation data, the truth table is:
## Validation truth table
test.pred.rf <- predict(rf.model, dat.model_test, type = "class")
table(dat.model_test$Category, test.pred.rf, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 19 9 2 2 0 0 0 0 0
## 2 16 9 11 12 2 0 1 0 0
## 3 11 7 11 16 4 0 0 0 0
## 4 3 8 13 10 4 1 0 1 0
## 5 1 2 6 10 6 1 1 0 0
## 6 0 2 4 4 2 1 0 0 1
## 7 0 2 2 3 3 0 2 0 2
## 8 0 0 0 2 1 2 2 0 1
## 9 0 0 0 0 0 0 0 1 2
val.rf.error <- sum((dat.model_test$Category == test.pred.rf)/nrow(dat.model_test)) ## 24.37%
For this random forest model, the classification accuracy rate in the development data is 63.36 % and in the validation data is 25.21 %. The results are better as compared to the decision tree model formed above.
Technique 3: Gradient Boosting Method
The Gradient Boosting Method (GBM) is another modification of the original decision tree model. In this particular technique, initially a low precision tree (less depth) model is formed. Then, the resulting error is modeled again using a different tree model. Its output is added to the original tree model. Again, the error is calculated and the whole process is repeated. Thus, in GBM for each iteration the error term is modeled and finally, all the output are combined to predict the response variable. Here, the parameters can limit the learning rate in each step, the maximum depth of tree in each step, the minimum nunber of observation in each node and the number of iterations to be done. This technique also belongs to the ensemble class of machine learning models and helps to maintian the precision of the response variable and at the same time the variance of the model is not compromised.
#### Technique 3: Gradient Boosting Method ####
set.seed(12347)
gbm.model <- gbm(as.factor(Category) ~ .,
data = dat.model_train.sub.rf,
distribution = 'multinomial',
n.trees = 500,
interaction.depth = 4,
cv.folds = 5,
shrinkage = 0.005)
To use the GBM technique, the tunning parameters selected are basis many iterations done. A comprehensive approach can be to prepare a search grid of various values corresponding to the different parameters and then choose the best model basis the lowest error achieved.
In case of development data, the truth table of the model can be seen as:
## Development truth table
train.pred.gbm <- predict(gbm.model, n.trees = 500, newdata = dat.model_train.sub.rf, type = 'response')
p.train.pred.gbm <- apply(train.pred.gbm, 1, which.max)
table(dat.model_train.sub.rf$Category, p.train.pred.gbm, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 95 24 9 6 1 0 1 0 0
## 2 47 85 35 21 4 1 1 2 0
## 3 31 29 86 36 6 2 3 3 0
## 4 14 25 40 73 9 1 1 3 0
## 5 4 11 26 19 38 2 1 3 1
## 6 1 11 17 10 7 16 2 1 1
## 7 1 3 5 7 7 4 18 2 0
## 8 0 1 2 2 3 2 1 18 1
## 9 0 0 0 0 1 0 1 2 12
dev.gbm.error <- sum((dat.model_train.sub.rf$Category == p.train.pred.gbm)/nrow(dat.model_train.sub.rf)) ## 46.55%
Further, for the validation data, the truth table is:
## Validation truth table
test.pred.gbm <- predict(gbm.model, n.trees = 500, newdata = dat.model_test, type = 'response')
p.test.pred.gbm <- apply(test.pred.gbm, 1, which.max)
table(dat.model_test$Category, p.test.pred.gbm, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 18 11 1 2 0 0 0 0 0
## 2 20 11 9 9 1 0 1 0 0
## 3 10 9 12 14 3 1 0 0 0
## 4 3 10 13 10 3 1 0 0 0
## 5 2 4 8 7 2 1 3 0 0
## 6 0 3 4 5 2 0 0 0 0
## 7 0 2 1 3 1 1 1 3 2
## 8 0 2 1 0 1 1 2 0 1
## 9 0 0 0 0 0 0 0 2 1
val.gbm.error <- sum((dat.model_test$Category == p.test.pred.gbm)/nrow(dat.model_test)) ## 22.68%
For this GBM model, the classification accuracy rate in the development data is 46.03 % and in the validation data is 23.11 %. The results are comparable to the decision tree model formed above.
In this approach, the variable total
is first predicted using a variety of machine learning models. The output from the best model is then used to finally predict the Category
variable.
total
variableFirst, the training model is split into development data and validation data. Development data comprises of 80% of the training data and the rest of the training data forms the validation data.
#### Model approach 2: First, predict the total variable and ten, using the predicted total variable predict the category ####
#### Step1: Predicting the total variable ####
set.seed(525)
subset2 <- createDataPartition(train.dat$total, times = 1, p = 0.80, list = FALSE)
## splitting the data into 80:20
dat.model_train2 <- train.dat[subset2, ]
dat.model_test2 <- train.dat[-subset2, ]
dat.model_train.sub2 <- dat.model_train2 %>%
select(-c(id, name, display_name, board_rating_reason, Category, production_year))
Technique 1: Decision Tree Model: Pruned Tree
The rpart
package is used to build the decision tree model. Initially the decision tree builds a high depth tree that works well in the development data but the performance is worse in the validation data. This is due to over-fitting of the model on the development data. Thus, the more complex a model is, the higher the variance in the model and this results in a higher total error. The total error is defined as the sum of the variance term and the square of the bias term. To have a balance between the bias and the variance of the model, this high depth tree is pruned such that the resulting model has the lowest total error.
Therefore, first a high depth tree is formed. Then, a plot of error and tree complexity is made. This helps to determine the complexity of the tree model that is required to have a low error in the model. The complexity plot for the tree model developed is shown below.
#### Technique 1: Pruned Decision Tree ####
set.seed(12348)
train.rpart2 <- rpart(formula = total ~ . , data = dat.model_train.sub2, cp = 0.0001)
plotcp(train.rpart2)
printcp(train.rpart2)
##
## Regression tree:
## rpart(formula = total ~ ., data = dat.model_train.sub2, cp = 1e-04)
##
## Variables actually used in tree construction:
## [1] creative_type genre
## [3] language movie_board_rating_display_name
## [5] movie_release_pattern_display_name movie_sequel
## [7] production_method sentiment_score
## [9] source
##
## Root node error: 33329256/959 = 34754
##
## n= 959
##
## CP nsplit rel error xerror xstd
## 1 0.19486698 0 1.00000 1.00183 0.23384
## 2 0.08814736 1 0.80513 0.84202 0.22972
## 3 0.05099837 2 0.71699 0.78377 0.22671
## 4 0.04986019 3 0.66599 0.78202 0.19823
## 5 0.02870162 4 0.61613 0.75571 0.19685
## 6 0.02720284 5 0.58743 0.74279 0.19602
## 7 0.02527418 6 0.56022 0.73234 0.19591
## 8 0.01574280 7 0.53495 0.68465 0.19506
## 9 0.00840680 8 0.51921 0.66815 0.19467
## 10 0.00619198 9 0.51080 0.66614 0.19506
## 11 0.00597338 10 0.50461 0.66453 0.19508
## 12 0.00446735 11 0.49863 0.64872 0.18268
## 13 0.00410730 12 0.49417 0.64838 0.18250
## 14 0.00343149 13 0.49006 0.63199 0.16815
## 15 0.00297039 14 0.48663 0.63158 0.16824
## 16 0.00276443 15 0.48366 0.63133 0.16819
## 17 0.00240661 16 0.48089 0.63135 0.16819
## 18 0.00174394 18 0.47608 0.64021 0.16825
## 19 0.00158784 20 0.47259 0.64295 0.16826
## 20 0.00145852 21 0.47100 0.64296 0.16826
## 21 0.00137425 22 0.46955 0.64511 0.16834
## 22 0.00120285 23 0.46817 0.64306 0.16829
## 23 0.00118678 24 0.46697 0.64298 0.16829
## 24 0.00116805 26 0.46459 0.64269 0.16829
## 25 0.00085313 27 0.46343 0.63992 0.16820
## 26 0.00058453 28 0.46257 0.63545 0.16808
## 27 0.00056236 29 0.46199 0.63656 0.16808
## 28 0.00053878 30 0.46143 0.63694 0.16808
## 29 0.00043853 31 0.46089 0.63614 0.16808
## 30 0.00039306 32 0.46045 0.63735 0.16809
## 31 0.00037953 33 0.46006 0.63838 0.16810
## 32 0.00031301 35 0.45930 0.63876 0.16809
## 33 0.00030698 36 0.45898 0.63954 0.16810
## 34 0.00029554 37 0.45868 0.63959 0.16810
## 35 0.00028939 38 0.45838 0.63977 0.16810
## 36 0.00028836 41 0.45751 0.63977 0.16810
## 37 0.00021025 46 0.45607 0.64048 0.16809
## 38 0.00018735 47 0.45586 0.64106 0.16809
## 39 0.00018539 48 0.45567 0.64085 0.16809
## 40 0.00016826 49 0.45549 0.64106 0.16809
## 41 0.00012019 50 0.45532 0.64112 0.16809
## 42 0.00010000 52 0.45508 0.63982 0.16803
Basis the plot and the complexity table output, a tree with complexity parameter 0.00276443 can be used. This tree has 16 terminal nodes. A plot of the tree is shown below:
## Pruned decision tree
train.pruned_tree2 <- prune(train.rpart2, cp = 0.00276443)
prp(train.pruned_tree2, varlen = 12)
## Development data
train.pred.pruned.tree2 <- predict(train.pruned_tree2, dat.model_train.sub2)
dev.tree.error.rmse <- rmse(train.pred.pruned.tree2, dat.model_train.sub2$total) ## 129.6499
## Validation data
test.pred.pruned.tree2 <- predict(train.pruned_tree2, dat.model_test2)
val.tree.error.rmse <- rmse(test.pred.pruned.tree2, dat.model_test2$total) ## 131.6686
For this pruned decision tree model, the rmse in the development data is 129.65 and in the validation data is 131.67.
Technique 2: Random Forest
The random forest technique is an improved version of the original decision tree model. In this technique, multiple high depth decision tree models are formed and then the output is the average of all the models. By doing this, the precision of the predcited value is maintained and at the same time the variance is reduced as the average is taken across all the models. In random forest, for each of the tree model non-parametric bootstrap samples are taken from the original data and the decision tree model made. Further, at every binary tree split only a subset of the variables is chosen. This ensures that each of the trees formed are independent from each other. This technique is also know as an ensemble machine learning technique where the output of various models are combined to predict the response variable.
#### Technique 2: Random Forest ####
## removing variable with missing value, as RF cannot handle variables with missing values
dat.model_train.sub2.rf <- dat.model_train.sub2 %>% select(-c(sentiment_score))
## For now, tunning paramter selected basis best approach given in literature
set.seed(12349)
rf.model2 <- randomForest(total ~ ., data = dat.model_train.sub2.rf, mtry = 3, importance = TRUE, ntree = 100)
## Development data
train.pred.rf2 <- predict(rf.model2, dat.model_train.sub2.rf)
dev.rf.error.rmse <- rmse(train.pred.rf2, dat.model_train.sub2.rf$total) ## 100.1452
## Validation data
test.pred.rf2 <- predict(rf.model2, dat.model_test2)
val.rf.error.rmse <- rmse(test.pred.rf2, dat.model_test2$total) ## 124.3666
To use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.
For this random forest model, the rmse in the development data is 100.15 and in the validation data is 124.37. The rmse obtained is less as compared to the decision tree model above.
Technique 3: Gradient Boosting Method
The Gradient Boosting Method (GBM) is another modification of the original decision tree model. In this particular technique, initially a low precision tree (less depth) model is formed. Then, the resulting error is modeled again using a different tree model. Its output is added to the original tree model. Again, the error is calculated and the whole process is repeated. Thus, in GBM for each iteration the error term is modeled and finally, all the output are combined to predict the response variable. Here, the parameters can limit the learning rate in each step, the maximum depth of tree in each step, the minimum nunber of observation in each node and the number of iterations to be done. This technique also belongs to the ensemble class of machine learning models and helps to maintian the precision of the response variable and at the same time the variance of the model is not compromised.
#### Technique 3: Gradient Boosting Method ####
set.seed(12350)
gbm.model2 <- gbm(total ~ .,
data = dat.model_train.sub2.rf,
distribution = 'gaussian',
n.trees = 500,
interaction.depth = 3,
cv.folds = 5,
shrinkage = 0.1)
## Development data
train.pred.gbm2 <- predict(gbm.model2, n.trees = 500, newdata = dat.model_train.sub2.rf)
dev.gbm.error.rmse <- rmse(train.pred.gbm2, dat.model_train.sub2.rf$total) ## 103.1652
## Validation data
test.pred.gbm2 <- predict(gbm.model2, n.trees = 500, newdata = dat.model_test2)
val.gbm.error.rmse <- rmse(test.pred.gbm2, dat.model_test2$total) ## 156.1931
To use the GBM technique, the tunning parameters selected are basis many iterations done. A comprehensive approach can be to prepare a search grid of various values corresponding to the different parameters and then choose the best model basis the lowest error achieved.
For this GBM model, the rmse in the development data is 103.17 and in the validation data is 156.19. The rmse obtained is less as compared to the decision tree model above and comparable to the random forest model. However, the validation rmse is lower for the random forest model.
Final model selected to predict total
Given the three above models, the random forest model performs the best amongst all the three models. Thus, this model output will be used to predict the Category
variable in the next step.
Category
variableThe output of the predicted total
variable from the random forest model built in the last step is used to estimate the Category
variable.
First, a stratified sampling is used to separate the training data into development data and validation data. Stratified sampling ensures that the proportion of each category of the response variable is maintained in both the data sets. 80% of the training data is used as development data and the rest 20% as the validation data.
#### Step2: Using the total predicted variable, calcualte the category variable ####
pred.total.train.dat <- predict(rf.model2, train.dat)
train.dat.pred.total <- bind_cols(train.dat, "predicted_total" = pred.total.train.dat)
set.seed(526)
subset3 <- createDataPartition(train.dat.pred.total$Category, times = 1, p = 0.80, list = FALSE)
## splitting the data into 80:20
dat.model_train3 <- train.dat.pred.total[subset3, ]
dat.model_test3 <- train.dat.pred.total[-subset3, ]
Now, the random forest technique is used to predict the Category
variable.
#### Technique: Random Forest ####
var_list <- c("movie_sequel", "creative_type", "source", "production_method", "genre","language", "movie_board_rating_display_name",
"movie_release_pattern_display_name")
## For now, tunning paramter selected basis best approach given in literature
set.seed(12351)
rf.model3 <- randomForest(as.factor(Category) ~ predicted_total,
data = dat.model_train3, mtry = 1, importance = TRUE, ntree = 500)
Again, to use the random forest technique, the tunning parameters selected are basis the best approach given in literature. Better options of tunning parameters can be chosen using cross-validation technique.
In case of development data, the truth table of the model can be seen as:
## Development truth table
train.pred.rf3 <- predict(rf.model3, dat.model_train3, type = "class")
table(dat.model_train3$Category, train.pred.rf3, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 105 18 12 5 1 1 1 0 0
## 2 22 119 22 17 5 3 1 0 0
## 3 8 22 125 30 5 2 3 1 0
## 4 5 11 28 115 3 0 1 0 1
## 5 1 6 22 15 54 2 2 3 2
## 6 1 3 14 10 6 23 0 2 1
## 7 2 2 4 3 2 0 36 2 0
## 8 0 0 1 2 1 1 4 26 0
## 9 0 0 0 0 0 0 1 1 11
dev.rf.error2 <- sum((dat.model_train3$Category == train.pred.rf3)/nrow(dat.model_train3)) ## 64.09%
Further, for the validation data, the truth table is:
## Validation truth table
test.pred.rf3 <- predict(rf.model3, dat.model_test3, type = "class")
table(dat.model_test3$Category, test.pred.rf3, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 1 2 3 4 5 6 7 8 9
## 1 9 12 3 1 0 0 0 0 0
## 2 12 16 17 11 1 0 1 0 0
## 3 10 9 13 11 5 0 1 0 0
## 4 2 4 15 15 4 1 1 0 0
## 5 0 2 6 6 6 2 2 1 0
## 6 0 3 2 5 4 1 4 1 0
## 7 0 1 0 2 1 2 2 2 0
## 8 0 0 1 0 0 0 0 1 1
## 9 0 0 0 0 0 0 1 2 3
val.rf.error2 <- sum((dat.model_test3$Category == test.pred.rf3)/nrow(dat.model_test3)) ## 27.73%
For this random forest model, the classification accuracy rate in the development data is 64.09 % and in the validation data is 27.73 %.
Modeling Approach 1 -
The following table summarizes the performance of all the models formed in this approach. The error rate repoted here is the miss-classification rate.
Modeling Technique | Development classification accuracy rate | Validation classification accuracy rate |
---|---|---|
Pruned Decision Tree | 36.22 % | 24.79 % |
Random Forest | 63.36 % | 25.21 % |
Gradient Boosting Method | 46.03 % | 23.11 % |
Modeling Approach 2 -
The following table summarizes the performance of all the models formed in this approach.
First, the comparison is made for the prediction of the variable total
. In this the error repoted is rmse.
Modeling Technique | Development RMSE | Validation RMSE |
---|---|---|
Pruned Decision Tree | 129.65 | 131.67 |
Random Forest | 100.15 | 124.37 |
Gradient Boosting Method | 103.17 | 156.19 |
Basis this table, it appears that the random forest model is the best model amongst the three models built. The output of this model is then used to predict the variable Category
. Following table provides the miss-classification rate for the model obtained.
Modeling Technique | Development classification accuracy rate | Validation classification accuracy rate |
---|---|---|
Random Forest | 64.09 % | 27.73 % |
From both the modeling approaches, it appears that the model obtained in approach 2 is the best model amongst all the models built. Thus, this model is finalized and used for prediction in the scoring data.