Homework #10

Question #1 Our first model that we thought was succeussful was after cleaning up all of the data we ran a forward variable selection of the variables. We then obtained a model of about 29 variables. We then observed the residual plots and observed potential high leverage points/ residuals, and our qq-plot was not linear. So, we plotted every single predictor with the response variable and saw whch one needed to be transformed (either log or square root), and took out high leverage points as well. This resulted in a model in the top 15, which we are pretty content with at the moment.

Our other model I ran random forest on our data as well as a pruned tree. Although the computation took awhile for the computer, I eventually obtained a model with around 20 variables. After observed the pruned tree I took a couple of variables out that did not have a significant change in my overall model. From there I ran regression and submitted the model but found that my testing error was pretty large. I attempted to do transformations on the data set by me and my group found that after doing all we could we still had a model that was hovering around the “B” range, and was not nearly as good as the model mentioned above. We scraped the pruned tree and went with a random forest but saw that a similiar predicament occured. Because our above model was already pretty solid, we scraped this idea entirly.

Question #2

We tried many other approaches that either just did not work, or created very large testing error. For our first model we just ran simple linear regression on all the quantative variables and submitted the model. Surprisngly , this model beat alot of the class!! We first tried to do a LASSO however for some reason our code would just not go; my theory is the number of predictors and sample size was to large but im not too sure. Additionaly I ran ridge regression but witht he number of predictors the testing error was large. Lastly, when trying to find the best subset for our regression, I attempted to run k-fold but with the smount of varaibles and obersvations, the double loop could just never finish without R aborting. This could have been human error on my part, but nonetheless the model we ultimatly obtained was very good.

Question #3

Some advice I would give to a future Stats 101C student in starting with this project is to most importantly first sort out exactly what you want your grou members to do. A non-unifed group can make the project alot harder than it should be; meeting up for even 30 minutes every other day can make sure your group is staying on top of what needs to be done. From a statistical standpoint, I would recommend first taking some time to really look at your training set data to see how many variables are not quantitative, and to eliminate variables that are obviously not important. Cleaning up the data should also be one of the first things your group does before running any best predictor function or transforming variables; without cleaning up the data running any sort of best predictor function will most likely result in errors. You must code qualitative variables, and run correlation matrix of the predictors onto the response variable to obtain a starting point. Lastly, I would tell the student to explore every possible regression technique that we learned in class, especially the ones later in the course. Although it might be time consuming and trouble sum to use every strategy, you will have a plethora of models to choose from and transform. The more models that you have, the better chance you come up with a very low testing error.

Homework #10

Dakota McKenzie

Thursday, June 04, 2015