Necessary packages: tidyverse, caret, skimr, rpart, randomForest, rattle, neuralnet, nnet

Background

In this assignment, you will run your first machine learning prediction model, and submit your work to a [kaggle] (https://www.kaggle.com/) competition!

The level of difficulty is flexible. You can do the bare minimum quite quickly. Alternatively, you can take a long time to hone your model, until you take top spot on the leaderboard! But that is not required.

As always, do this in markdown and publish on Rpubs. Just put in your code and finish with a screenshot of your completed kaggle rank or submission.

(note: the easiest way to add a picture in markdown is “”, in its own line, and not inside of any code chunk).

Task 1 (really easy)

In this task, we’ll walk through a minimal machine learning exercise and submit our results to kaggle. Just follow the instructions.

Understand the problem

Go to the Housing Prices Competition for Kaggle Learn Users and skim the information. Understand that:

You will be trying to predict the prices of homes based on characteristics - You will have data to work with, which you download in the data tab
At the end you will submit your results here, in the form of a 2-columns .csv file as described here.
The website will then compare your predictions to the real data, and give you a score depending on how well you did.

[Note: you may have to register for a kaggle account. It’s free and easy.]
Download the data

Go to the data tab and download the data to your laptop. Unzip it. See that there is a train.csv and a test.csv file. The former is for training your models, the latter is what you use to generate the predictions you will submit.

Load and clean the data

library(tidyverse)
library(caret)
library(skimr)
library(rpart)
library(randomForest)
library(rattle)
library(neuralnet)
library(nnet)
library(ggplot2)
library(haven)
library(dplyr)
library(reshape2)
library(glue)
library(tidytable)
library(tibble)
library(readr)

setwd("/Users/hunteryuan/Downloads/AAEC 8610/R Working Directory/HW11")

# Import training and testing data:
# (Obviously, your file paths might be different here ):
train_raw <- read.csv2("home-data-for-ml-course/train.csv", sep = ",",
                       stringsAsFactors = TRUE)

test_raw <- read.csv2("home-data-for-ml-course/test.csv", sep = ",",
                      stringsAsFactors = TRUE)
# dim(train_raw)
# dim(test_raw)
# This is useful to look at data, from skimr package. Doesn't render in latex though.
# skim(train_raw)


# Functions to replace NAs with most frequent level or median
replace_na_most <- function(x){
  fct_explicit_na(x, na_level = names(which.max(table(x))))
}
replace_na_med <- function(x){
  x[is.na(x)] <- median(x,na.rm = TRUE)
  x
}
cleanup_minimal <- function(data){
  nomis <- data %>%
    mutate_if(is.factor, replace_na_most) %>%
    mutate_if(is.numeric, replace_na_med)
  nomis
}


train_minclean <- cleanup_minimal(train_raw)
test_minclean <- cleanup_minimal(test_raw)

Run the simplest tree algorithm there is

mod_rpart <- rpart(SalePrice~., data=train_minclean)
# Try this command to make a nice tree plot!
fancyRpartPlot(mod_rpart, caption = NULL)

pred_rpart <- predict(mod_rpart, newdata = test_minclean)
submission_rpart <- tibble(Id=test_raw$Id, SalePrice=pred_rpart)
head(submission_rpart)

## # A tibble: 6 × 2
##      Id SalePrice
##   <int>     <dbl>
## 1  1461   118199.
## 2  1462   151246.
## 3  1463   185210.
## 4  1464   185210.
## 5  1465   249392.
## 6  1466   185210.

# Obviously, your file path might be different here:
write_csv(submission_rpart, file="home-data-for-ml-course/submission_rpart.csv")

Submit your predictions to kaggle

Go back to the kaggle website and upload your submission_rpart.csv file in the submissions tab.

Note: You don’t need to use that python code they give you. Just dragging and dropping the file also works.

Show your work!

Take a screenshot of your leaderboard position, and show it in your markdown file. Don’t worry: at this point you will be very far down (you may be ranked 50,000th or something).

knitr::include_graphics("home-data-for-ml-course/submission_task1.png")

Task 2: Push further, as much as you like.

Keep playing around with machine learning: try running other models, try using the caret package to tune your models, try to move up the leaderboard. There is no hard requirement for how far you need to take this in order to get an A: just put in a good effort.

A few helpful hints:

• Many people share their code on the code tab. There is much to learn there, even though most are using python. But you can click filter and only look at the R code submissions.

• Virtually any model will do better than the tree we ran in task 1.

# Training a random forest
mod_rf <- randomForest(SalePrice ~ ., data = train_minclean)
# These 2 lines below are just a stupid trick to fix a bug in R. Without it prediction gets an error.
# This thread gave me the solution:
# https://stackoverflow.com/questions/24829674/r-random-forest-error-type-of-predictors-in-new-data-do-
trainX <- select(train_minclean, -SalePrice)
test_minclean <- rbind(trainX[1, ] , test_minclean)
test_minclean <- test_minclean[-1,]
# Get my predictions:
pred_rf <-predict(mod_rf, newdata = test_minclean)
submission_rf <- tibble(Id=test_raw$Id, SalePrice=pred_rf)
write_csv(submission_rf, file="home-data-for-ml-course/submission_rf.csv")

knitr::include_graphics("home-data-for-ml-course/submission_task2.png")

• Trees and forests don’t require for you to normalize all variables, nor to one-hot-encode all the factors, but many other models do.

• R is quite finicky, and you will get a lot of errors when trying to run different models. One very common issue is that the training and testing end up looking incompatible after cleaning (say, if your cleaning drops a variable which is a node in your neural network, the test prediction will crash, etc.). A common solution is to merge your training and testing data first, just for the cleaning part. Then you split them back when learning. It’s ok to do in this case since you don’t know the Y variable for the test data, so you’re not using it to train your model, just to ensure compatibility.

# Training a random forest
mod_rf <- randomForest(SalePrice ~ ., data = train_minclean)
# These 2 lines below are just a stupid trick to fix a bug in R. Without it prediction gets an error.
# This thread gave the solution:
# https://stackoverflow.com/questions/24829674/r-random-forest-error-type-of-predictors-in-new-data-do-trainX <- select(train_minclean, -SalePrice)
test_minclean <- rbind(trainX[1, ] , test_minclean)
test_minclean <- test_minclean[-1,]
# Get my predictions:
pred_rf <-predict(mod_rf, newdata = test_minclean)
submission_rf <- tibble(Id=test_raw$Id, SalePrice=pred_rf)
write_csv(submission_rf, file="home-data-for-ml-course/submission_rf.csv")
knitr::include_graphics("home-data-for-ml-course/submission_task3.png")

Homework 11

leonskennedy

2023-04-26

Background

Task 1 (really easy)

Task 2: Push further, as much as you like.