STA 363 Project Part 1

Getting Started

Change the first chunk in your Markdown file to match what is below. This allows you to hide all your code so you have a professional looking file at the end!

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message= FALSE, fig.asp = .6)

Load your data set into R. To do this, you need to:

  • Step 1: Download your data onto your computer. Usually, this means it ends up in Downloads.
  • Step 2: In your Environment tab (usually upper right) choose the symbol to import data:

  • Step 3: Choose From Text (base) is your data is a csv file and From Excel if your data set is an excel spreadsheet.

  • Step 4: Follow the steps prompted to load your data into R.

  • Step 5: Create a chunk in your Project RMarkdown file. Look at the bottom of your RStudio screen and find a line of code that relates to your data set (it will NOT match what you see in the picture).

  • Step 6: Copy that line of code and paste it into the chunk you created in your Project Markdown file.

Doing these steps will mean that your data will be loaded and read to work with.

PRO TIP 1

Note that this project is a paper, not a lab. This means you need complete sentences, proper grammar and spelling, and you need to be clear in your steps and explanations. Let Dr. Dalzell know if you have any questions!

PRO TIP 2

Every section in your paper must have a transition sentence, something like “In this section, we will…”. This helps your reader follow your work, and it will also help you structure your paper.

Section 1:

Section 1: Data and Goal Overview

In this section, you are going to provide a short overview of your data and your goals for modeling. This serves as an introduction to your paper.

This section should include the following. Note that I am including bullet points just to make the requirements clearer, but you may NOT have bullet points in your final paper. Write in complete sentences.

  • A brief description of the data set. Example: “For my analysis, I have chosen to work with a data set on red pandas I obtained from the National Science Foundation. It has 4513 rows with 12 columns.”
  • Your prediction goal. Example: “The goal of this project is to predict how much a red panda eats in a day.”
  • A list of features. Example “We will be using panda age in years, body mass in grams, sex, status (captivity or wild), length in mm, and location (climate region) to predict food consumption.” Note: For this, you may feel free to use bullets, and if you have more than 8 features, you can just list 8.
  • Why you are interested (this part is just fun for me to read). Example: “I love red pandas and always have, so I thought it would be interesting to model something that would be helpful for zoo keepers and conservationists to know.”

Section 2:

Section 2: EDA

In this section, you are going to conduct exploratory data analysis and perform any needed cleaning steps.

Your task is to:

  • Perform any needed data cleaning steps. Clearly describe what steps you took, and make sure to be clear about how many rows and features you have at the end of your cleaning. Remember, you are required to have at least 6 features and 250 rows at the end of cleaning!
  • We have discussed some key EDA steps you need to take before modeling. Implement those on your data, and show any needed plots or tables. Clearly explain why would made each plot/table and what information it provides. Formatting is part of your grade!
  • Show any other needed EDA. This will differ quite a bit depending on your data set, but explore. What relationships do you see in your data? Is there anything unusual that might impact your modeling? If you show me a plot or a table, you need to write something about what the plot/table shows and why you created it. Example: “To check to see if there is a relationship between body mass and panda length, we created Figure 1. Though we might think there was a relationship between these two features, it appears in Figure 1 to not be a strong relationship.”

Note: I am looking for careful consideration and quality here, not quantity!!! If you show me 10 plots for no reason, that doesn’t help your analysis.

Section 3:

In this section, you are going to use KNN to predict your response \(Y\). There are a few things to think about before you dive into this section.

Categorical Features: If you have categorical features, and you want to use them for KNN, you need to use what is called one-hot-encoding to convert your features to 0s and 1s to make them usable for KNN. To do this, you need to use a code called model.matrix that we will be learning soon for penalized regression.

Suppose I have a column y that has my \(Y\) variable in a data set called mydata. The code I need is:

# One-hot-encoding
dataKNN <- model.matrix( y ~ . , data = mydata)[,-1]
dataKNN <- data.frame(dataKNN)

# Add the y variable back on
dataKNN$y < - mydata$y

You then use dataKNN for running KNN instead of mydata.

Standardizing: One issue that can arise with KNN with Euclidean distance is that not all features are on the same scale. This means larger features can dominate the distance measure in KNN. To avoid this, we often scale our numeric features so that they all have a mean of 0 and a standard deviation of 1. This can be done easily in R, but there are 2 things you need to check.

    1. You need to make sure any categorical variables are actually treated as categorical by R. If R thinks a column column is a number that should be a category, you can use this code to convert the variable to categorical:
dataKNN$column <- as.factor( dataKNN$column )
    1. We do NOT want to scale \(Y\). \(Y\) is not used in computing the distance, so leave it alone!

Once you are aware of 1) and 2), suppose my \(Y\) variable is in column 12 of my data set, dataKNN. I then use the following code to scale the numeric features:

dataKNN[,-12] <- dataKNN[,-12] |>
  mutate(across(where(is.numeric), ~as.vector(scale(.))))

I encourage you to try KNN with and without the scaling to see which gives you better predictions! It can depend on your data set, and it’s easy to try it.

Section 3: KNN

Your task is to:

  • Briefly explain how KNN can be used to predict \(Y\) in your application.
  • Briefly explain how you choose \(K\) and what \(K\) you recommend. Show an appropriate plot or table to back up your choice.
  • Compute an appropriate metric to assess how well KNN would do at prediction on test data. Most of you will not have test data, so think how you can get around this using methods we have learned in class!
  • Show a plot or table of your predictions versus the true values of \(Y\) and comment on how well KNN is doing at predicting \(Y\).

Throughout this section, if you choose any approaches to create validation data, you need to be clear about which methods you choose and why.

Turning in your assignment

You have completed Part 1! We will build on this to complete Part 2.

The Knit Document

Go ahead and knit your assignment. Make sure:

  • You change your first chunk from to knitr::opts_chunk$set(echo=TRUE) to knitr::opts_chunk$set(echo = FALSE, message= FALSE, warning = FALSE)
  • You have run spell check.
  • There is NO R output of any kind that does not have words right near it to describe the output.
  • All plots have labelled axes and titles or captions.
  • You do NOT have any long output (like printing 50 numbers on the screen.)

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

The Code

You also need to submit a .Rmd file showing all of your code in such a way that Dr. Dalzell can re-run it and get the same answers as you show in your paper.