STA 363 Project Part 2

Getting Started

Open the same Markdown file you used for Project Part 1. You will be continuing to work on the same Markdown file for this part of the project. Keep everything!

Again, you are ADDING to your Part 1 file. You need all the previous sections in your submission for this project.

PRO TIP 1

Note that this project is a paper, not a lab. This means you need complete sentences, proper grammar and spelling, and you need to be clear in your steps and explanations. Let Dr. Dalzell know if you have any questions!

PRO TIP 2

Every section in your paper must have a transition sentence, something like “In this section, we will…”. This helps your reader follow your work, and it will also help you structure your paper.

Section R:

Section R: Revisions

This section is going to feel a little disjointed from the rest of the paper, and that is okay. If you don’t like it in the paper, you can submit it in the comments on the submission on Canvas instead! Either way, you need to:

Look at your comments from Project Part 1.
If I made any suggestions for changes or improvements, you need to make these changes in the Part 1 part you will submit with Part 2. In other words, since you are putting your Part 2 in the same document as Part 1, scroll up to the Part 1 component and make your changes there.
In this section, you are going to clearly indicate how you addressed each of those suggestions. For example: “Comment: Add transition sentences. Addressed: Transition sentences were added.”
You WILL lose points if you do not address each comment, so make sure you let me know if you have any questions!!

Section 4:

Section 4.1: Regression without Selection

In this section, you are going to apply an appropriate regression model to predict your response \(Y\).

Your task is to:

Determine whether ridge regression is needed for your application. Clearly explain why or why not.
Build a regression model (or ridge regression model) using either linear regression, logistic regression, or multinomial regression (depending on your \(Y\)). NOTE: If you are using ridge, you will need to use the model.matrix() code from the slides if you have any categorical variables!! You likely also did this in Project Part 1.
Show a formatted table of your coefficients and state the value of any needed tuning parameters.
Compute and interpret an appropriate test/validation metric to assess how well your first regression model is doing at prediction. Most of you will not have test data, so think how you can get around this using methods we have learned in class!

Coding Tips: Categorical Y

If you need a logistic regression model, your code is glm(y ~ x , data = dataset, family = "binomial")
If you need to do ridge regression with a binary \(Y\), the code is the same as what we use with numeric \(Y\) except that you add family= "binomial" to the end of the code.
If you need a multinomial regression model, you need to load the library library(nnet). Once that is loaded, you use multinom(y~x, data= dataset).
If you need to do ridge regression with a categorical \(Y\) with more than two levels, the code is the same as what we use with numeric \(Y\) except that you add family= "multinomial" to the end of the code.
For either scenario, to do prediction you need to use predict(model, newx = ) and after newx = you need to supply the data you want to predict on. This data set needs to be a matrix. Example: testx <- model.matrix( Y ~ ., data = test)[,-1], and then predict(model, newdata= testx).

For my folks using categorical Y

In your output from ridge regression, if you need it, you know that cvm usually gives us the RMSE. However, if your \(Y\) is categorical, this is not the case!!
Instead, cvm is the deviance, and it is not something we usually interpret. You still choose \(\lambda\) by finding the smallest value of cvm though, so you can still plot it!

Section 4.2: Regression with Feature Selection

Your task is to:

Use a penalized regression model that allows feature selection to predict \(Y\).
Clearly explain which technique you chose and why.
State any tuning parameters needed for your approach. Show a plot or table to support your choice of tuning parameter.
Compute and interpret an appropriate test/validation metric to assess how well your penalized regression model is doing at prediction.

If your Y is categorical, see the coding notes in the previous section!

Section 4.3: Comparison

Your task is to:

State whether you would recommend using your model from Section 4.1 or your model from Section 4.2 for prediction and explain your choice.
Using the model you recommend, show a plot or table of your predictions versus the true values of \(Y\) and comment on how well regression is doing at predicting \(Y\).
State whether you would recommend using your chosen regression model or KNN to predict \(Y\), and explain your choice.

Turning in your assignment

You have completed Part 2! We will build on this to complete Part 3.

Submission Component 1:

Submit a .Rmd file showing all of your code in such a way that Dr. Dalzell can re-run it and get the same answers as you show in your paper.

Submission Component 2:

Submit a PDF or html version of your work. Make sure:

You have run spell check.
There is NO raw R output (meaning no R output that is not formatted).
There is NO formatted output or plot of any kind that does not have words right near it to describe the output.
All plots have labelled axes and titles or captions.
You do NOT have any super long output (like printing 50 numbers on the screen.)

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.