STA 363 Project Part 1
Getting Started
Change the first chunk in your Markdown file to match what is below. This allows you to hide all your code so you have a professional looking file at the end!
Load your data set into R. To do this, you need to:
- Step 1: Download your data onto your computer. Usually, this means it ends up in Downloads.
- Step 2: In your Environment tab (usually upper right) choose the symbol to import data:
Step 3: Choose From Text (base) is your data is a csv file and From Excel if your data set is an excel spreadsheet.
Step 4: Follow the steps prompted to load your data into R.
Step 5: Create a chunk in your Project RMarkdown file. Look at the bottom of your RStudio screen and find a line of code that relates to your data set (it will NOT match what you see in the picture).
- Step 6: Copy that line of code and paste it into the chunk you created in your Project Markdown file.
Doing these steps will mean that your data will be loaded and read to work with.
PRO TIP 1
Note that this project is a paper, not a lab. This means you need complete sentences, proper grammar and spelling, and you need to be clear in your steps and explanations. Let Dr. Dalzell know if you have any questions!
PRO TIP 2
Every section in your paper must have a transition sentence, something like “In this section, we will…”. This helps your reader follow your work, and it will also help you structure your paper.
Section 1:
Section 1: Data and Goal Overview
In this section, you are going to provide a short overview of your data and your goals for modeling. This serves as an introduction to your paper.
This section should include the following. Note that I am including bullet points just to make the requirements clearer, but you may NOT have bullet points in your final paper. Write in complete sentences.
- A brief description of the data set. Example: “For my analysis, I have chosen to work with a data set on red pandas I obtained from the National Science Foundation. It has 4513 rows with 12 columns.”
- Your prediction goal. Example: “The goal of this project is to predict how much a red panda eats in a day.”
- A list of features. Example “We will be using panda age in years, body mass in grams, sex, status (captivity or wild), length in mm, and location (climate region) to predict food consumption.” Note: For this, you may feel free to use bullets, and if you have more than 8 features, you can just list 8.
- Why you are interested (this park is just fun for me to read). Example: “I love red pandas and always have, so I thought it would be interesting to model something that would be helpful for zoo keepers and conservationists to know.”
Section 2:
Section 2: EDA
In this section, you are going to conduct exploratory data analysis and perform any needed cleaning steps.
Your task is to:
- Perform any needed data cleaning steps. Clearly describe what steps you took, and make sure to be clear about how many rows and features you have at the end of your cleaning. Remember, you are required to have at least 6 at the end of cleaning!
- We have discussed some key EDA steps you need to take before modeling. Implement those on your data, and show any needed plots or tables. Formatting is part of your grade!
- Show any other needed EDA. This will differ quite a bit depending on your data set, but explore. What relationships do you see in your data? Is there anything unusual that might impact your modeling? If you show me a plot or a table, you need to write something about what the plot/table shows and why you created it. Example: “To check to see if there is a relationship between body mass and panda length, we created the following plot. Though we might think there was a relationship between these two features, it appears in Figure 1 to not be a strong relationship.”
Note: I am looking for careful consideration and quality here, not quantitiy!!! If you show me 10 plots for no reason, that doesn’t help your analysis.
Section 3:
Section 3: KNN
In this section, you are going to use KNN to predict your response \(Y\).
Your task is to:
- Briefly explain how KNN can be used to predict \(Y\) in your application.
- Briefly explain how you choose \(K\) and what \(K\) you recommend. Show an appropriate plot or table to back up your choice.
- Compute an appropriate metric to assess how well your model is doing at prediction. Most of you will not have test data, so think how you can get around this using methods we have learned in class!
Turning in your assignment
You have completed Part 1! We will build on this to complete Part 2. Go ahead and knit your assignment. Make sure:
- You have run spell check.
- There is NO R output of any kind that does not have words right near it to describe the output.
- All plots have labelled axes and titles or captions.
- You do NOT have any super long output (like printing 50 numbers on the screen.)
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.