Project statement

This is a group project and we use Oscar Academy Awards data to determine whether Best Film Editing is the best predictor of Best Picture.

Data set

The input data is from link. The 1’s indicate winning the award, 0’s indicate not winning.

Techniques Used

  1. Random Rorests
  2. Training and Testing
  3. Confusion Matrix

Loading the data

library(XML)
oscar_data <- readHTMLTable('tidedoscarwinners.csv.html')
oscar_data <- oscar_data[[1]]
head(oscar_data)
##    structure(c("1", "2", "3", "4", "5", "6"), class = "AsIs") Year
## 1                                                           1 1934
## 2                                                           2 1934
## 3                                                           3 1934
## 4                                                           4 1934
## 5                                                           5 1935
## 6                                                           6 1935
##                    pictures Best_Picture Best_Editing Best_Directing
## 1                    Eskimo            0            1              0
## 2     It Happened One Night            1            0              1
## 3         One Night of Love            0            0              0
## 4          The Gay Divorcee            0            0              0
## 5 A Midsummer Night's Dream            0            1              0
## 6                 Dangerous            0            0              0
##   Best_Actor Best_Supporting_Actor Best_Actress Best_Supporting_Actress
## 1          0                     0            0                       0
## 2          1                     0            1                       0
## 3          0                     0            0                       0
## 4          0                     0            0                       0
## 5          0                     0            0                       0
## 6          0                     0            1                       0
##   Best_Sound Best_Song
## 1          0         0
## 2          0         0
## 3          1         0
## 4          0         1
## 5          0         0
## 6          0         0

Data cleaning

oscar_data_ml <- oscar_data[,-c(1:4)]
head(oscar_data_ml)
##   Best_Picture Best_Editing Best_Directing Best_Actor
## 1            0            1              0          0
## 2            1            0              1          1
## 3            0            0              0          0
## 4            0            0              0          0
## 5            0            1              0          0
## 6            0            0              0          0
##   Best_Supporting_Actor Best_Actress Best_Supporting_Actress Best_Sound
## 1                     0            0                       0          0
## 2                     0            1                       0          0
## 3                     0            0                       0          1
## 4                     0            0                       0          0
## 5                     0            0                       0          0
## 6                     0            1                       0          0
##   Best_Song
## 1         0
## 2         0
## 3         0
## 4         1
## 5         0
## 6         0
dim(oscar_data_ml)
## [1] 457   9
str(oscar_data_ml)
## 'data.frame':    457 obs. of  9 variables:
##  $ Best_Picture           : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
##  $ Best_Editing           : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 1 1 ...
##  $ Best_Directing         : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Best_Actor             : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Best_Supporting_Actor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Best_Actress           : Factor w/ 2 levels "0","1": 1 2 1 1 1 2 1 1 1 1 ...
##  $ Best_Supporting_Actress: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Best_Sound             : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
##  $ Best_Song              : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...

Data Analysis

library(lattice)
library(ggplot2)
library(caret)
set.seed(45)
inTrain <- createDataPartition(y=oscar_data_ml$Best_Picture, p=0.7,list=FALSE)
training <- oscar_data_ml[inTrain,]
testing <- oscar_data_ml[-inTrain,]
dim(training)
## [1] 321   9
dim(testing)
## [1] 136   9
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(Best_Picture ~., data=training)
model
## 
## Call:
##  randomForest(formula = Best_Picture ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 8.41%
## Confusion matrix:
##     0  1 class.error
## 0 251 13  0.04924242
## 1  14 43  0.24561404
importance(model)
##                         MeanDecreaseGini
## Best_Editing                   4.2596672
## Best_Directing                33.5250633
## Best_Actor                     2.7116693
## Best_Supporting_Actor          1.0599584
## Best_Actress                   1.0723682
## Best_Supporting_Actress        0.9064921
## Best_Sound                     1.9436359
## Best_Song                      1.8514441
library(caret)
predicted <- predict(model, testing)
table(predicted)
## predicted
##   0   1 
## 112  24
library(e1071)
confusionMatrix(predicted, testing$Best_Picture)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 106   6
##          1   6  18
##                                           
##                Accuracy : 0.9118          
##                  95% CI : (0.8509, 0.9536)
##     No Information Rate : 0.8235          
##     P-Value [Acc > NIR] : 0.002799        
##                                           
##                   Kappa : 0.6964          
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.9464          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.9464          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.8235          
##          Detection Rate : 0.7794          
##    Detection Prevalence : 0.8235          
##       Balanced Accuracy : 0.8482          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

With this dataset, Random forests provide a nice model to predict whether movie wins Best Picture or not. From the model, we can see Best_Directing is the strongest predictor, and Best_Editing is the next.