Derek G. Nokes
Tuesday, March 31, 2015
combine predictions of many models to reduce the variability of predictions and improve model robustness
ensembles generally perform better in real-life modeling situations than non-ensemble techniques
combine input data resampling with decision tree building
improves predictive performance over a single tree by reducing variance of prediction
introduce a random component into the tree building process through resampling, creating a distribution of trees, and a corresponding distribution of predicted values for each sample.
technique for assessing how well a model generalizes
repeatedly partition our data into training and test sets, train our models on the training data, test our model on data that has not been used in training, then average the results
purpose: reduce overfitting (i.e., improve the ability of our models to generalize from data) and thereby improve the ability of our models to predict out-of-sample (i.e., predict data that has not been used in training)
can estimate “importance” of a category variable
values of each variable are randomly permuted in the out-of-bag samples; corresponding decrease in accuracy of each tree is estimated
If average decrease over all trees is large, variable is considered important (i.e., its value makes a big difference in predicting the outcome).
If average decrease is small, variable does not make much difference to outcome.
Using first preprocessing approach importance plot indicates that the awards category ‘Music’ (c15) provides - by far - the most predictive power, followed by Cinematography (c7), Sound (c19), and Directing (c9). - preprocessing might have been problematic
-model did not work well; expect naive approach might have done better