HW11ML

Task 1 (compulsory but really easy)

In this task, we’ll walk through a minimal machine learning exercise and submit our results to kaggle.

Go to the data tab and download the data to your laptop. Unzip it. See that there is a train.csv and a test.csv file. The former is for training your models, the latter is what you use to generate the predictions you will submit.

# libraries needed
library(tidyverse)
library(caret)
library(skimr)
library(rpart)
library(randomForest)
library(rattle)
library(neuralnet)
library(nnet)

Load and clean the data

# Here’s simple code to import the data:
# Import training and testing data:
train_raw <- read.csv("train.csv", sep = ",", stringsAsFactors = TRUE)
test_raw <- read.csv("test.csv", sep = ",", stringsAsFactors = TRUE)

dim(train_raw)

## [1] 1460   81

dim(test_raw)

## [1] 1459   80

# This is useful to look at data, from skimr package. Doesn't render in latex though.
# skim(train_raw)

Note: train dataset has one more column than the test dataset (SalePrice, which you want to predict)

Very minimal cleaning

# Functions to replace NAs with most frequent level or median
replace_na_most <- function(x){
  fct_explicit_na(x, na_level = names(which.max(table(x))))
}
replace_na_med <- function(x){
  x[is.na(x)] <- median(x,na.rm = TRUE)
  x
}
cleanup_minimal <- function(data){
  nomis <- data %>%
    mutate_if(is.factor, replace_na_most) %>%
    mutate_if(is.numeric, replace_na_med)
  nomis
}

train_minclean <- cleanup_minimal(train_raw)
test_minclean <- cleanup_minimal(test_raw)

Run the simplest tree algorithm

#run an rpart regression tree, and plot it
mod_rpart <- rpart(SalePrice~., data=train_minclean)

# tree plot
fancyRpartPlot(mod_rpart, caption = NULL)

Export the predictions in the appropriate format

pred_rpart <- predict(mod_rpart, newdata = test_minclean)
submission_rpart <- tibble(Id=test_raw$Id, SalePrice=pred_rpart)
head(submission_rpart)

## # A tibble: 6 x 2
##      Id SalePrice
##   <int>     <dbl>
## 1  1461   118199.
## 2  1462   151246.
## 3  1463   185210.
## 4  1464   185210.
## 5  1465   249392.
## 6  1466   185210.

write_csv(submission_rpart, file = "submission_rpart.csv")

Screenshot from leaderboard

I enjoyed this class. Thank you Dr. Filipski!