Project 3

Derek G. Nokes

Tuesday, March 31, 2015

Ensemble Techniques

combine predictions of many models to reduce the variability of predictions and improve model robustness
ensembles generally perform better in real-life modeling situations than non-ensemble techniques

Bagged Trees

combine input data resampling with decision tree building
improves predictive performance over a single tree by reducing variance of prediction
introduce a random component into the tree building process through resampling, creating a distribution of trees, and a corresponding distribution of predicted values for each sample.

Random Forests

Cross-Validation

technique for assessing how well a model generalizes
repeatedly partition our data into training and test sets, train our models on the training data, test our model on data that has not been used in training, then average the results
purpose: reduce overfitting (i.e., improve the ability of our models to generalize from data) and thereby improve the ability of our models to predict out-of-sample (i.e., predict data that has not been used in training)

Using Random Forests to Answer our Research Question

can estimate “importance” of a category variable
values of each variable are randomly permuted in the out-of-bag samples; corresponding decrease in accuracy of each tree is estimated
If average decrease over all trees is large, variable is considered important (i.e., its value makes a big difference in predicting the outcome).
If average decrease is small, variable does not make much difference to outcome.

Using first preprocessing approach importance plot indicates that the awards category ‘Music’ (c15) provides - by far - the most predictive power, followed by Cinematography (c7), Sound (c19), and Directing (c9). - preprocessing might have been problematic

Conclusion

-model did not work well; expect naive approach might have done better

do people in the class have suggestions?