true

# load the required libraries
library('reshape')
library('ggplot2')
library('rpart')
library('randomForest')

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library('caret')

## Loading required package: lattice

Introduction

In the following sections we apply some preprocessing to the awards data provides and explore three candidate models for predicting ‘best picture’ based on movies the win other categories of the academy awards. We aim to test the validity of the statement that ‘Best Film Editing is the best predictor of ’Best Picture’.

Data Description

We start our analysis by loading the data:

# define the input file name
fileName<-"C:/Users/dgn2/Documents/R/IS607/Project_3/Awards_File.csv"
# read in the CSV file
awards_df <- read.csv(fileName,stringsAsFactors=F)

The first 10 lines of the data give us a sense of the variables we have available for our modeling:

# get the first n lines of the data
displayAwardsData<-head(awards_df,nLines)
# create the table
knitr::kable(displayAwardsData,
             caption = paste('First',nLines,'of Awards Data Set'))

First 10 of Awards Data Set
movie_id	movie_name	year	category_id	category_name	won
1	Biutiful	2010	1	Actor – Leading Role	0
2	True Grit	2010	1	Actor – Leading Role	0
2	True Grit	2010	4	Actress – Supporting Role	0
3	The Social Network	2010	1	Actor – Leading Role	0
4	The King’s Speech	2010	1	Actor – Leading Role	1
4	The King’s Speech	2010	2	Actor – Supporting Role	0
4	The King’s Speech	2010	4	Actress – Supporting Role	0
5	127 Hours	2010	1	Actor – Leading Role	0
6	The Fighter	2010	2	Actor – Supporting Role	1
6	The Fighter	2010	4	Actress – Supporting Role	1

Our data is provided in long format.

# 
str(awards_df)

## 'data.frame':    8598 obs. of  6 variables:
##  $ movie_id     : int  1 2 2 3 4 4 4 5 6 6 ...
##  $ movie_name   : chr  "Biutiful " "True Grit " "True Grit " "The Social Network " ...
##  $ year         : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ category_id  : int  1 1 4 1 1 2 4 1 2 4 ...
##  $ category_name: chr  "Actor -- Leading Role" "Actor -- Leading Role" "Actress -- Supporting Role" "Actor -- Leading Role" ...
##  $ won          : int  0 0 0 0 1 0 0 0 1 1 ...

summary(awards_df)

##     movie_id     movie_name             year       category_id   
##  Min.   :   1   Length:8598        Min.   :1928   Min.   : 1.00  
##  1st Qu.:1207   Class :character   1st Qu.:1949   1st Qu.: 6.00  
##  Median :2400   Mode  :character   Median :1968   Median :12.00  
##  Mean   :2394                      Mean   :1970   Mean   :11.79  
##  3rd Qu.:3557                      3rd Qu.:1990   3rd Qu.:17.00  
##  Max.   :4894                      Max.   :2010   Max.   :23.00  
##  category_name           won        
##  Length:8598        Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.2077  
##                     3rd Qu.:0.0000  
##                     Max.   :1.0000

The unique award categories and their corresponding IDs are as follows:

# create the unique award_categories
award_categories<-unique(data.frame(category_id=awards_df$category_id,
                                    category_name=awards_df$category_name))
print(award_categories)

##      category_id               category_name
## 1              1       Actor -- Leading Role
## 3              4  Actress -- Supporting Role
## 6              2    Actor -- Supporting Role
## 13             3     Actress -- Leading Role
## 21             5       Animated Feature Film
## 22            14             Music (Scoring)
## 25            15                Music (Song)
## 26            16                Best Picture
## 27            20               Sound Editing
## 28            22                     Writing
## 29             6               Art Direction
## 30             8              Costume Design
## 31            21              Visual Effects
## 35             7              Cinematography
## 38            19                       Sound
## 45             9                   Directing
## 46            12                Film Editing
## 76            10       Documentary (Feature)
## 81            11 Documentary (Short Subject)
## 91            13                      Makeup
## 100           17       Short Film (Animated)
## 105           18    Short Film (Live Action)
## 7383          23         Documentary (other)

Preprocessing

Preprocessing is often one of the most important contributors to model performance. There are several preprocessing steps that we take prior to any modeling. Two different preprocessing approaches were explored.

Approach 1

First, we convert our long data into wide form.

# extract the required fields
awards_modified <- awards_df[,c(1,3,4,6)] 
# extract the subset of winners
winners <- subset(awards_df, won == 1) 
# create the data frame
molten_winners <- melt.data.frame(winners,id.vars=c("year","category_name"),
                                  measure.vars="movie_id")
# remove the 3rd column ()
molten_winners <- molten_winners[,-3]
# reshape the data for modeling
wide_winners <- reshape(molten_winners,timevar="category_name",idvar="year",
                        direction="wide")
# rename the columns
colnames(wide_winners) <- as.vector(c("year",paste0('c',
  as.character(award_categories$category_id))))
# remove the row names
rownames(wide_winners) <- NULL

We extract the relevant fields from the long form data, then pull out only the winners for each year and award category. The data is then reshaped so that our transformed data set contains the movie ID for the winner of each category (columns) by year (rows).

# get the first n lines of the data
displayWideWinners<-head(wide_winners,nLines)
# create the table
knitr::kable(displayWideWinners[,1:10],
             caption = paste('First',nLines,' Lines and 10 Columns of Wide Form Data Set'))

First 10 Lines and 10 Columns of Wide Form Data Set
year	c1	c4	c2	c3	c5	c14	c15	c16	c20
2010	4	6	6	11	16	16	17	17	19
2009	61	69	72	70	79	107	80	84	80
2008	126	131	137	136	140	147	142	144	147
2007	179	184	178	189	197	227	202	208	203
2006	248	249	251	255	259	278	263	271	263
2005	307	312	321	310	324	352	328	328	328
2004	373	372	371	372	382	411	385	385	385
2003	433	433	431	440	444	448	448	448	449
2002	481	482	488	486	496	519	498	498	502
2001	540	542	537	545	551	550	556	556	555

Second, we convert our winning movie IDs for each year and award category into a new binary variable based on our research question. We aim to test the validity of the statement that ‘Best Film Editing is the best predictor of Best Picture’. Therefore, for each award category, we convert the movie ID into our binary variable. If the movie that wins ‘Best Film Editing’ is the same as the movie that wins the awards category, the new variable takes the value 1 otherwise the variable takes the variable 0.

The data transformation is completed as follows:

# extract the year
year<-wide_winners$year
# set dependent variable as best picture
y<-wide_winners[,'c16']
# remove year and dependent variable, best picture
x<-subset(wide_winners, select = -c(year,c16))
# ad the column names
columnNames<-c('c16',colnames(x))
# create the data frame
df<-data.frame(c16=y,x,row.names=year)
# add the column names
colnames(df)<-columnNames
# order the data frame by year (row name)
df <- df[ order(row.names(df)), ]
# drop all the rows where best picture is NA
notNaIndex<-!is.na(df[,'c16'])
df<-df[notNaIndex,]
# find the columns with NAs
notNaIndex<-colSums(is.na(df))==0
# drop the columns with NAs
noNasData<-df[,notNaIndex]
# is the movie that won film editing the winner of the category
filmEditingWon<-noNasData[,'c12']==noNasData
# extract the row names
rowNames<-rownames(filmEditingWon)
# extract the column names
columnNames<-colnames(filmEditingWon)
# find the dimension of the 
dimension<-dim(filmEditingWon)
# create the new variable data frame
data<-data.frame(matrix(factor(filmEditingWon),dimension[1],dimension[2]))
# label the rows
rownames(data)<-rowNames
# label the columns
colnames(data)<-columnNames
data_1<-data

A sample of the fully transformed wide form data appears as follows:

# get the first n lines of the data
displayData<-head(data_1,nLines)
# create the table
knitr::kable(displayData[,1:10],
             caption = paste('First',nLines,' Lines and 10 Columns of Wide Form Factor Data Set'))

First 10 Lines and 10 Columns of Wide Form Factor Data Set
	c16	c1	c4	c2	c3	c14	c15	c20	c22	c21
1948	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1949	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
1950	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
1951	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE
1952	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1953	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1954	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1955	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE
1956	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1957	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	TRUE

Approach 2

In our second preprocessing approach we re-scale our movie IDs to reduce the number of classes and create a more meaningful feature set for use in our modeling. The approach reduces the number of classes from the number of winning movies across all categories to the number of categories.

First we re-scale the movie ID:

# re-scale the movie ID between 1 and the number of unique movie 
# IDs for the year

# make a copy of the data
data<-noNasData
# get the years
years<-rownames(data)
# iterate over each year
for (yearIndex in seq_along(years)){
  # find the unique movie IDs
  a<-sort(unique(as.numeric(data[yearIndex,])))
  # create a sequence of movie IDs between 1 and the number 
  # of unique movie IDs for the year
  sequence<-seq_along(a)
  # iterate over each movie ID
  for (i in sequence){
    # re-label the movie ID with the corresponding number between
    # 1 and the number of unique movie IDs for the year
    data[yearIndex,data[yearIndex,]==a[i]]<-i
  }
}

We then convert each column to factors as follows:

# convert the columns to factors
columns<-colnames(data)
# iterate over each column
for (column_i in seq_along(columns)){
  # get the column name
  column<-columns[column_i]
  # convert the data in column to a factor
  data[,column]<-as.factor(data[,column])
}
data_2<-data

Modeling

We build three models and examine the impact of our two different preprocessing approaches on the importance of the award category variables in the predictions of ‘best picture’ using the best candidate modeling approach (namely random forests).

Data Splitting

In this section we demonstrate how data can be partitioned for in- and out of sample analysis.

In- and Out- of-Sample

Our data set can be partitioned into an in-sample training set and an out-of-sample testing set as follows:

inSampleProportion<-0.75
# create partition index (75% in-sample and 25% out-of-sample)
inSampleIndex<-createDataPartition(data_2[,'c16'],
                                   p=inSampleProportion)[[1]]

# approach 1
# extract the in-sample data (approach 1)
inSampleData_1<-data_1[inSampleIndex,]
# extract the out-of-sample data (approach 1)
outOfSampleData_1<-data_1[!inSampleIndex,]

# approach 2
# extract the in-sample data (approach 2)
inSampleData_2<-data_2[inSampleIndex,]
# extract the out-of-sample data (approach 2)
outOfSampleData_2<-data_2[!inSampleIndex,]

Model Candidates

We explore the following 3 model candidates in our attempt to answer the research question:

classification and regression trees (CART)
bagged trees
random forests

Non-Ensemble Techniques: Motivation for Ensembles

In this section we provide a very high-level description of decision trees as motivation for the following section about ensemble techniques.

Decision Trees

A decision tree is used in many classes of machine learning models as a predictive model mapping observations about an item to conclusions about the target value of the item. Tree models where the target variable can take a finite set of values are called classification trees, while decision trees where the target variable can take continuous values are referred to as regression trees.

The term Classification And Regression Tree (CART) is used as a generic term to describe the class of models that use trees to either predict the class to which data belongs (classification) or predict a real number (regression tree). (See the references for an in-depth treatment of this topic).

Decision trees can be thought of as machine-generated business rules.

Ensemble Techniques

Ensemble techniques combine the predictions of many models to reduce the variability of predictions and improve model robustness. Ensembles generally perform better in real-life modeling situations than non-ensemble techniques.

Bagged Trees

Bagged trees combine input data resampling with decision tree building. Bagging trees improves predictive performance over a single tree by reducing variance of the prediction. By generating bootstrap samples we introduce a random component into the tree building process, creating a distribution of trees, and thus, a corresponding distribution of predicted values for each sample.

Bagging models provide several advantages over models that are not bagged

Bagging effectively reduces the variance of a prediction through its aggregation process. For models that produce an unstable prediction (like regression trees), aggregating over many versions of the training data actually reduces the variance in the prediction, making the prediction more stable.
Another advantage of bagging models is that they can provide their own internal estimate of predictive performance that correlates well with either cross-validation estimates or test set estimates. When constructing a bootstrap sample for each model in the ensemble, certain samples are left out. These samples are often referred to as ‘out-of-bag’, and they can be used to assess the predictive performance of that specific model because they were not used to build the model. Every model in the ensemble generates a measure of predictive performance using the out-of-bag samples. The average of the out-of-bag performance metrics can then be used to gauge the predictive performance of the entire ensemble, and this value usually correlates well with the assessment of predictive performance we can get with cross-validation. This error estimate is usually referred to as the ‘out-of-bag’ estimate. Measures of variable importance can be constructed by combining measures of importance from the individual models across the ensemble. More about variable importance will be discussed in the next section when we examine random forests.

Although bagging usually improves predictive performance for unstable models, there are a few disadvantages:

Computational costs and memory requirements increase as the number of bootstrap samples increases. This disadvantage can be mostly mitigated if the modeler has access to parallel computing because the bagging process can be easily parallelized. As each bootstrap sample and corresponding model is independent of any other sample and model, each model can be built separately and all models can be brought together to generate the prediction.
A bagged model is much less interpretable than a model that is not bagged. Convenient rules that we can get from a single classification or regression tree cannot be attained.

Randon Forests

Motivation

The trees used in bagging are not completely independent of each other (i.e., all of the original predictors are considered at every split of every tree). If the relationship between predictors and the response can be adequately modeled by a tree - and our original data sample is large enough - trees from different bootstrap samples may have similar structures to each other (especially at the top of the trees) due to the underlying relationship. This characteristic - known as tree correlation - prevents bagging from optimally reducing variance of the predicted values. Reducing tree correlation among trees (referred to as de-correlating trees), is one important approach to improving the performance of bagging. Reducing correlation among predictors can be done by adding randomness to the tree construction process. Random forests is a general algorithm for this de-correlation process. After this de-correlation process, each model in the ensemble is used to generate a prediction for a new sample and these \(m\) predictions are averaged to give the predictions of the forest. Randomly selecting predictors at each split lessens tree correlation.

Tuning Parameters

The number of randomly selected predictors, \(k\), to choose from at each split - commonly referred to as \(m_{try}\) 0 - must be chosen. We must also specify the number of trees for the forest. The random forest tuning parameter does not typically have a drastic effect on performance.

Advantages & Disadvantages

Random forests protect against over-fitting. Many researchers claim that a random forests model will not be adversely affected if a large number of tress are build for the forest. In practice, the larger the forest, the more computational burden we will incur to train and build the model. Linear combinations of many independent learners reduce the variance of the overall ensemble relative to any individual learner in the ensemble. This variance reduction is achieved by the random forest model by selecting strong, complex learners that exhibit low bias. The ensemble of many independent, strong learners generates an improvement in error rates. Since each learner is selected independently of all other learners, random forests is robust to a noisy response.

Compared to bagging, random forests is more computationally efficient on a tree-by-tree basis because the tree building process only needs to evaluate a fraction of the original predictors at each split (although random forests usually require more trees).

As in bagging, CART can be used as the base learner in random forests. The ensemble nature of random forests makes it impossible to gain an understanding of the relationship between the predictors and the response. However, because trees are the typical base learner for this method, it is possible to quantify the impact of predictors in the ensemble. Randomly permuting the values of each predictor for the out-of-bag sample of one predictor at a time for each tree, taking the difference in predictive performance between the non-permuted sample and the permuted sample for each predictor, and finally aggregating across the entire forest can provide such a measure. We can also measure the improvement in node purity based on the performance metric for each predictor at each occurrence of that predictor across the forest to determine the overall importance for the predictor.

Model Tuning & Model Comparisons

In this section we do a brief comparison of 3 different models using data preprocessed according to our second approach.

Cross-Validation

Cross-validation is a technique for assessing how well a model generalizes, whereby we repeatedly partition our data into training and test sets, train our models on the training data, test our model on data that has not been used in training, then average the results.

The purpose of this process is to reduce overfitting (i.e., improve the ability of our models to generalize from data) and thereby improve the ability of our models to predict out-of-sample (i.e., predict data that has not been used in training).

# set parameters for the 
numberOfRepeats<-50
kFolds<-5

We configure a repeated 5-fold cross-validation as follows:

# set the seed
set.seed(randomSeed)
# set the cross-validation specification 
controlObject<-trainControl(method='repeatedcv',
                            repeats=numberOfRepeats,
                            number=kFolds)

Model Training

In this section we train three models, then compare the results using our configured cross-validation data.

Approach 2

The models are trained as follows:

set.seed(randomSeed)
# train the CART model
cartModel_2<-train(c16 ~.,data=inSampleData_2,method='rpart',tuneLength=30,
                 trControl=controlObject)

# train the bagged tree model
baggedTreeModel_2<-train(c16 ~.,data=inSampleData_2,method='treebag',
                       trControl=controlObject)

## Loading required package: ipred
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any

# train the random forest model
randomForestModel_2<-train(c16 ~.,data=inSampleData_2,method='rf',tuneLength=10,
                         ntrees=10000,importance=TRUE,trControl=controlObject)

Resampled Model Performance

Each model uses the same sample cross-validation folds. Parallel-coordinate plots for the resampled results are provided below.

Approach 2

set.seed(randomSeed)
modelResamples_2<-resamples(list("CART" = cartModel_2,
                               'Bagged Tree' = baggedTreeModel_2,
                               'Random Forest' = randomForestModel_2))

# create the summary
summary(modelResamples_2)

## 
## Call:
## summary.resamples(object = modelResamples_2)
## 
## Models: CART, Bagged Tree, Random Forest 
## Number of resamples: 250 
## 
## Accuracy 
##                 Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## CART          0.0000  0.3000 0.3636 0.3509  0.4444 0.6667    0
## Bagged Tree   0.0000  0.2727 0.3333 0.3503  0.4444 0.7778    0
## Random Forest 0.1111  0.3333 0.4000 0.4152  0.5000 0.6667   50
## 
## Kappa 
##                  Min.  1st Qu.  Median    Mean 3rd Qu.   Max. NA's
## CART          -0.4706 -0.02784 0.08475 0.06759  0.1549 0.4706    0
## Bagged Tree   -0.3091 -0.03194 0.08475 0.09291  0.1940 0.6842    0
## Random Forest -0.3585  0.00000 0.09091 0.11520  0.2340 0.4545   50

We can see that the mean accuracy of the random forest is highest, followed by CART, and the bagged tree respectively. The random forest model also shows less variation across resamples and is thus likely to be more stable out-of-sample (i.e., robust).

The accuracy rate (i.e., error rate) provides an indication of the agreement between the observed and predicted classes. This statistic is simple to understand, but provides no information about the type of errors made.

# plot accuracy values
parallelplot(modelResamples_2,metric='Accuracy')

Each line in the above parallel-coordinate plot corresponds to a common cross validation holdout.

The Kappa statistic is defined as follows

\[Kappa = \frac{O-E}{1-E}\]

where

\(O\) is the observed accuracy

\(E\) is the expected accuracy based on the marginal totals of the confusion matrix

# plot kappa values
parallelplot(modelResamples_2,metric='Kappa')

The Kappa statistic takes a value between -1 and 1, where 0 indicates no agreement between the observed and predicted classes, 1 indicates perfect agreement between the observed and predicted classes, and negative 1 indicates that the prediction is opposite of the observed classes.

Using Random Forests to Answer our Research Question

To estimate the “importance” of a category variable, the values of each variable are randomly permuted in the out-of-bag samples, and the corresponding decrease in the accuracy of each tree is estimated. If the average decrease over all the trees is large, then the variable is considered important (i.e., its value makes a big difference in predicting the outcome). If the average decrease is small, then the variable does not make much difference to the outcome.

In this section we use our random forest model to explore the importance of the various different award categories on our ability to predict ‘best picture’.

Approach 1

Using data produced by our first preprocessing method we fit a random forests model:

library('randomForest')
set.seed(randomSeed)
# fit random forest model
rfModel1<-randomForest(c16 ~., data=inSampleData_1,ntree=10000,importance=T)

We again examine accuracy and kappa, this time plotting an importance measure computed by removing award category and determining the impact on model performance (i.e., from the perspective of accuracy and kappa):

# create accuracy plot
varImpPlot(rfModel1,type=1)

# create kappa plot
varImpPlot(rfModel1,type=2)

Using our first preprocessing approach we can see from the importance plot based on accuracy that the awards category ‘Music’ (c15) provides - by far - the most predictive power, followed by Cinematography (c7), Sound (c19), and Directing (c9).

Using kappa as our performance measure we see the same categories providing predictive power, but ‘Music’ (c15) is far more important a predictor than the any of the other categories. The next three predictors in terms of performance are the same as in the accuracy case, but have significantly less impact, with their order of importance shifting slightly.

Approach 2

Using data produced by our second preprocessing method we fit a random forests model:

library('randomForest')
set.seed(randomSeed)
# fit random forest model
rfModel2<-randomForest(c16 ~., data=inSampleData_2,ntree=10000,importance=T)

# create accuracy plot
varImpPlot(rfModel2,type=1)

# create kappa plot
varImpPlot(rfModel2,type=2)

Using our second preprocessing approach we can see from the importance plot based on accuracy that the awards category Documentary Feature (c10) and Film Editing (c12) provide - by far - the most predictive power, followed Music Scoring (c14), Actor – Leading Role (c1), Music Song (c15), and Sound Editing (c20).

Using kappa as our performance measure we see the same Film Editing (c12), Music Scoring (c14), Writing (c22), Documentary Feature (c10), and Sound (c9) as our best predictors.

The importance of Documentary Feature seems highly suspicious and points to a likely problem with the model.

Conclusion

Our preprocessing choice had a very significant impact on our conclusions about the importance of award categories on predicting ‘best picture’. Given the large disparities between models it is difficult to draw a definitive conclusion. More work would be required to determine the best predictor of best picture.

References:

[1] M. Kuhn & K. Johnson (2013), Applied Predictive Modeling, Springer

[2] N. Zumel & J. Mount (2014), Practical Data Science With R, Manning

[3] T. Hastie, R. Tibshirani, & Jerome Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer

[4] R. O. Duda, P. E. Hart, D. G. Stork (2001), Pattern Classification, John Wiley & Sons, Inc

IS 607 Project 3 - Predictive Modeling

Derek G. Nokes

Saturday, March 28, 2015

Introduction

Data Description

Preprocessing

Approach 1

Approach 2

Modeling

Data Splitting

In- and Out- of-Sample

Model Candidates

Non-Ensemble Techniques: Motivation for Ensembles

Decision Trees

Ensemble Techniques

Bagged Trees

Randon Forests

Motivation

Tuning Parameters

Advantages & Disadvantages

Model Tuning & Model Comparisons

Cross-Validation

Model Training

Approach 2

Resampled Model Performance

Approach 2

Using Random Forests to Answer our Research Question

Approach 1

Approach 2

Conclusion

References: