Monthly DO Swing data analyzed with rpart (decision tree), caret (bootstrap aggregating) and randomforest (a large number of decision trees) I will only include results and plots that were run using a training data set and testing data set for each month. The same monthly training set was run through rpart, caret and randomforest. The training data set for each month was a randomly selected 80 percent of the DO monthly data for each month.

DECISION TREES USING RPART

Single decision trees using rpart. I took the training data set and created a decision tree. I calculated variable importance. I then pruned the decision tree, this snips off the least important internal nodes to achieve terminal nodes (leafs). I will include a decision tree for each month plus a bar diagram of variable importance. I also ran the predicting capabilities of the model from the test dataset (20 %). I wont include any of that.

#APRIL TRAINING DATA

#APRIL TRAINING PRUNED DATA

#MAY TRAINING DATA

#MAY TRAINING PRUNED DATA

#JUNE TRAINING DATA

#JUNE TRAINING PRUNED DATA

#JULY TRAINING DATA

#JULY TRAINING PRUNED DATA

#AUGUST TRAINING DATA

#AUGUST TRAINING PRUNED DATA

#SEPTEMBER TRAINING DATA

#SEPTEMBER TRAINING PRUNED DATA

#OCTOBER TRAINING DATA

#OCTOBER TRAINING PRUNED DATA

BOOSTRAPPED REGRESSION TREES USING CARET

I think this is more powerful than the single decision trees, but not as powerful as the random forest. Boostrapped decision trees using caret. I took the training data set and created a boostrapped model. I used a 10 fold cross validation. I calculated variable importance and included a plot for each month. I also ran the predicting capabilities of the model from the test dataset (20 %). I wont include any of that.

THIS DOESNT PLOT A DENDROGRAM TREE> I WAS ABLE TO PLOT THE VARIABLE IMPORTANCE. THIS PACKAGE CALCULATES VI ON SCALE OF 0-100

#APRIL TRAINING DATA

#MAY TRAINING DATA

#JUNE TRAINING DATA

#JULY TRAINING DATA

#AUGUST TRAINING DATA

#SEPTEMBER TRAINING DATA

#OCTOBER TRAINING DATA

RANDOM FOREST

This is the most powerful of the three that i ran. In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.

THIS ALSO DOESNT PRODUCE A DENDROGRAM. I included a plot of the important variables by month. I also ran another algorithm called Boruta. This is a different way to calculate important variables but tells you which variables are confirmed important, which ones are tentatively important and which ones are excluded. I can show you those results without including them here. Variables from the random forest run are probably good enough for what we are concerned with.

#APRIL TRAINING DATA

#MAY TRAINING DATA

#JUNE TRAINING DATA

#JULY TRAINING DATA

#AUGUST TRAINING DATA

#SEPTEMBER TRAINING DATA

#OCTOBER TRAINING DATA

#MODEL PREDICTIONS each of the three types of models, I used training data to calibrate the decision trees. I then went and found out how to compare the predicted DO SWING data vs the actual DO Swing data from the “testing” data set. Some look very good, some not so much. Ill show you them next time we are online. I have to show them on my screen or I suppose i can put them in an excel file if you want to look them over.