Table of Content: 1. Introduction 2. Import & Check Dataset 3. Exploratory Data Analysis (EAD) 4. Data Modeling 5. Cross Validation 6. Submission 7. Conclusion

Introduction: Developing regression models that accurately predict the sale price of residential homes in Ames, Iowa based on various features (such as square footage and location) is the objective of the House Prices machine learning challenge. The goal of this project is to create a model with just 5 predictors additively that gives a R2 of above .75 among 79 explanatory variables. The following context is the requirements and the steps of the project.

Requirements: (1) RMSE and R2 (>.75) on the train set (2) estimated RMSE and R2 on the test set (3) Kaggle score (returned log RMSE) and rank

Steps: 1. EDA: understand how the variables relate to one another, the structure and meaning of the missing observations 2. Develop a linear model of house prices using just 5 predictors additively. 3. Submit predictions to Kaggle. 4. Use a simple cross-validation method to ensure that your results will generalize well to new data

Import Data & Check Dataset

# setup & import data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.1
## ✔ tidyr   1.2.1     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(corrplot)
## corrplot 0.92 loaded
# overview
test <- read_csv("./house-prices-advanced-regression-techniques/test.csv")
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train <- read_csv("./house-prices-advanced-regression-techniques/train.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
submit_example <- read_csv("./house-prices-advanced-regression-techniques/sample_submission.csv")
## Rows: 1459 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Id, SalePrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(train) # Remove NAs of Numeric: MasVnrArea, GarageYrBlt  #BsmtCond, BsmtFinType2, BsmtFinSF2, GarageQual, GarageCond, MiscFeature.NA = None
## # A tibble: 6 × 81
##      Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
##   <dbl>      <dbl> <chr>      <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
## 1     1         60 RL            65    8450 Pave   <NA>  Reg     Lvl     AllPub 
## 2     2         20 RL            80    9600 Pave   <NA>  Reg     Lvl     AllPub 
## 3     3         60 RL            68   11250 Pave   <NA>  IR1     Lvl     AllPub 
## 4     4         70 RL            60    9550 Pave   <NA>  IR1     Lvl     AllPub 
## 5     5         60 RL            84   14260 Pave   <NA>  IR1     Lvl     AllPub 
## 6     6         50 RL            85   14115 Pave   <NA>  IR1     Lvl     AllPub 
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
head(test)
## # A tibble: 6 × 80
##      Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
##   <dbl>      <dbl> <chr>      <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
## 1  1461         20 RH            80   11622 Pave   <NA>  Reg     Lvl     AllPub 
## 2  1462         20 RL            81   14267 Pave   <NA>  IR1     Lvl     AllPub 
## 3  1463         60 RL            74   13830 Pave   <NA>  IR1     Lvl     AllPub 
## 4  1464         60 RL            78    9978 Pave   <NA>  IR1     Lvl     AllPub 
## 5  1465        120 RL            43    5005 Pave   <NA>  IR1     HLS     AllPub 
## 6  1466         60 RL            75   10000 Pave   <NA>  IR1     Lvl     AllPub 
## # … with 70 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
head(submit_example)
## # A tibble: 6 × 2
##      Id SalePrice
##   <dbl>     <dbl>
## 1  1461   169277.
## 2  1462   187758.
## 3  1463   183584.
## 4  1464   179317.
## 5  1465   150730.
## 6  1466   177151.

Exploratory Data Analysis (EAD)

Missing data

In total, there are 34 predictors that contains missing values. My approach is to review the potential meaning of the NAs and convert them into meaning values, for example converting character variables into factors, otherwise, they will be removed. Let’s look at them in a big picture as well as in smaller groups.

count_missings <- function(x) sum(is.na(x))

train %>% 
  summarize_all(count_missings) # Handy summarize_all function
## # A tibble: 1 × 81
##      Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
##   <int>      <int>    <int>   <int>   <int>  <int> <int>   <int>   <int>   <int>
## 1     0          0        0     259       0      0  1369       0       0       0
## # … with 71 more variables: LotConfig <int>, LandSlope <int>,
## #   Neighborhood <int>, Condition1 <int>, Condition2 <int>, BldgType <int>,
## #   HouseStyle <int>, OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## #   YearRemodAdd <int>, RoofStyle <int>, RoofMatl <int>, Exterior1st <int>,
## #   Exterior2nd <int>, MasVnrType <int>, MasVnrArea <int>, ExterQual <int>,
## #   ExterCond <int>, Foundation <int>, BsmtQual <int>, BsmtCond <int>,
## #   BsmtExposure <int>, BsmtFinType1 <int>, BsmtFinSF1 <int>, …
train <- train %>% # Save the result back into the original data
  mutate(Alley = replace_na(data = Alley, replace = "none")) # Overwrite the existing column with new values

# Check that it worked
train %>% 
  count(Alley)
## # A tibble: 3 × 2
##   Alley     n
##   <chr> <int>
## 1 Grvl     50
## 2 Pave     41
## 3 none   1369

Tested with the variable Alley. What about the variables regarding masonry veneer?

count_missings(train$MasVnrType) 
## [1] 8
train %>%
  filter(is.na(MasVnrType), is.na(MasVnrArea))
## # A tibble: 8 × 81
##      Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
##   <dbl>      <dbl> <chr>      <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
## 1   235         60 RL            NA    7851 Pave   none  Reg     Lvl     AllPub 
## 2   530         20 RL            NA   32668 Pave   none  IR1     Lvl     AllPub 
## 3   651         60 FV            65    8125 Pave   none  Reg     Lvl     AllPub 
## 4   937         20 RL            67   10083 Pave   none  Reg     Lvl     AllPub 
## 5   974         20 FV            95   11639 Pave   none  Reg     Lvl     AllPub 
## 6   978        120 FV            35    4274 Pave   Pave  IR1     Lvl     AllPub 
## 7  1244         20 RL           107   13891 Pave   none  Reg     Lvl     AllPub 
## 8  1279         60 RL            75    9473 Pave   none  Reg     Lvl     AllPub 
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
  mutate(MasVnrType = replace_na(MasVnrType, "none"),
         MasVnrArea = replace_na(MasVnrArea, 0)) 

# Check that it worked
train %>%
  filter(is.na(MasVnrType), is.na(MasVnrArea))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## #   LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## #   LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …

Let’s try it on other variables before imputing to all other parameters.

count_missings(train$LotFrontage) 
## [1] 259
train <- train %>% # Again, saving this change into the original data
  mutate(LotFrontage = replace_na(LotFrontage, median(LotFrontage, na.rm = T))) # Need to set na.rm = T

# Check that it worked
train$LotFrontage %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   60.00   69.00   69.86   79.00  313.00

I will now impute in groups to prevent making mistakes.

count_missings(train$BsmtQual) 
## [1] 37
count_missings(train$BsmtCond) 
## [1] 37
count_missings(train$BsmtExposure) 
## [1] 38
count_missings(train$BsmtFinType1) 
## [1] 37
count_missings(train$BsmtFinType2) 
## [1] 38
train %>%
  filter(is.na(BsmtQual), is.na(BsmtCond), is.na(BsmtExposure), is.na(BsmtFinType1), is.na(BsmtFinType2))
## # A tibble: 37 × 81
##       Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
##    <dbl>      <dbl> <chr>     <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
##  1    18         90 RL           72   10791 Pave   none  Reg     Lvl     AllPub 
##  2    40         90 RL           65    6040 Pave   none  Reg     Lvl     AllPub 
##  3    91         20 RL           60    7200 Pave   none  Reg     Lvl     AllPub 
##  4   103         90 RL           64    7018 Pave   none  Reg     Bnk     AllPub 
##  5   157         20 RL           60    7200 Pave   none  Reg     Lvl     AllPub 
##  6   183         20 RL           60    9060 Pave   none  Reg     Lvl     AllPub 
##  7   260         20 RM           70   12702 Pave   none  Reg     Lvl     AllPub 
##  8   343         90 RL           69    8544 Pave   none  Reg     Lvl     AllPub 
##  9   363         85 RL           64    7301 Pave   none  Reg     Lvl     AllPub 
## 10   372         50 RL           80   17120 Pave   none  Reg     Lvl     AllPub 
## # … with 27 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
  mutate(BsmtQual = replace_na(BsmtQual, "none"),
         BsmtCond = replace_na(BsmtCond, "none"),
         BsmtExposure = replace_na(BsmtExposure, "none"),
         BsmtFinType1 = replace_na(BsmtFinType1, "none"),
         BsmtFinType2 = replace_na(BsmtFinType2, "none")) 

# Check that it worked
train %>%
  filter(is.na(BsmtQual), is.na(BsmtCond), is.na(BsmtExposure), is.na(BsmtFinType1), is.na(BsmtFinType2))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## #   LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## #   LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
count_missings(train$GarageType)
## [1] 81
count_missings(train$GarageYrBlt)
## [1] 81
count_missings(train$GarageFinish)
## [1] 81
count_missings(train$GarageQual)
## [1] 81
count_missings(train$GarageCond)
## [1] 81
train %>%
  filter(is.na(GarageType), is.na(GarageYrBlt), is.na(GarageFinish), is.na(GarageQual), is.na(GarageCond))
## # A tibble: 81 × 81
##       Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
##    <dbl>      <dbl> <chr>     <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
##  1    40         90 RL           65    6040 Pave   none  Reg     Lvl     AllPub 
##  2    49        190 RM           33    4456 Pave   none  Reg     Lvl     AllPub 
##  3    79         90 RL           72   10778 Pave   none  Reg     Lvl     AllPub 
##  4    89         50 C (all)     105    8470 Pave   none  IR1     Lvl     AllPub 
##  5    90         20 RL           60    8070 Pave   none  Reg     Lvl     AllPub 
##  6   100         20 RL           77    9320 Pave   none  IR1     Lvl     AllPub 
##  7   109         50 RM           85    8500 Pave   none  Reg     Lvl     AllPub 
##  8   126        190 RM           60    6780 Pave   none  Reg     Lvl     AllPub 
##  9   128         45 RM           55    4388 Pave   none  IR1     Bnk     AllPub 
## 10   141         20 RL           70   10500 Pave   none  Reg     Lvl     AllPub 
## # … with 71 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
  mutate(GarageType = replace_na(GarageType, "none"),
         GarageYrBlt = replace_na(GarageYrBlt, 0),
         GarageFinish = replace_na(GarageFinish, "none"),
         GarageQual = replace_na(GarageQual, "none"),
         GarageCond = replace_na(GarageCond, "none")) 

# Check that it worked
train %>%
  filter(is.na(GarageType), is.na(GarageYrBlt), is.na(GarageFinish), is.na(GarageQual), is.na(GarageCond))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## #   LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## #   LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
count_missings(train$FireplaceQu) 
## [1] 690
count_missings(train$PoolQC) 
## [1] 1453
count_missings(train$Fence) 
## [1] 1179
count_missings(train$MiscFeature) 
## [1] 1406
train %>%
  filter(is.na(FireplaceQu), is.na(PoolQC), is.na(Fence), is.na(MiscFeature))
## # A tibble: 521 × 81
##       Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
##    <dbl>      <dbl> <chr>     <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
##  1     1         60 RL           65    8450 Pave   none  Reg     Lvl     AllPub 
##  2    11         20 RL           70   11200 Pave   none  Reg     Lvl     AllPub 
##  3    13         20 RL           69   12968 Pave   none  IR2     Lvl     AllPub 
##  4    19         20 RL           66   13695 Pave   none  Reg     Lvl     AllPub 
##  5    27         20 RL           60    7200 Pave   none  Reg     Lvl     AllPub 
##  6    30         30 RM           60    6324 Pave   none  IR1     Lvl     AllPub 
##  7    33         20 RL           85   11049 Pave   none  Reg     Lvl     AllPub 
##  8    37         20 RL          112   10859 Pave   none  Reg     Lvl     AllPub 
##  9    39         20 RL           68    7922 Pave   none  Reg     Lvl     AllPub 
## 10    40         90 RL           65    6040 Pave   none  Reg     Lvl     AllPub 
## # … with 511 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
  mutate(FireplaceQu = replace_na(FireplaceQu, "none"),
         PoolQC = replace_na(PoolQC, "none"),
         Fence = replace_na(Fence, "none"),
         MiscFeature = replace_na(MiscFeature, "none")) 

# Check that it worked
train %>%
  filter(is.na(FireplaceQu), is.na(PoolQC), is.na(Fence), is.na(MiscFeature))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## #   LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## #   LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
# Review data structure
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 60.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 69.86  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 79.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.1                     
##                                        3rd Qu.: 164.2                     
##                                        Max.   :1600.0                     
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##   HeatingQC          CentralAir         Electrical           1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##     2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :   0  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1958  
##  Median :1.000   Mode  :character   Mode  :character   Median :1977  
##  Mean   :0.613                                         Mean   :1869  
##  3rd Qu.:1.000                                         3rd Qu.:2001  
##  Max.   :3.000                                         Max.   :2010  
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##  EnclosedPorch      3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000
# Convert Character to Factors
train <- train %>%
  mutate(MSZoning = factor(MSZoning),
         Street = factor(Street),
         Alley = factor(Alley),
         LotShape = factor(LotShape),
         LandContour = factor(LandContour),
         Utilities = factor(Utilities),
         LotConfig = factor(LotConfig),
         LandSlope = factor(LandSlope),
         Condition1 = factor(Condition1),
         Condition2 = factor(Condition2),
         BldgType = factor(BldgType),
         HouseStyle = factor(HouseStyle),
         RoofStyle = factor(RoofStyle),
         RoofMatl = factor(RoofMatl),
         Exterior1st = factor(Exterior1st),
         Exterior2nd = factor(Exterior2nd),
         MasVnrType = factor(MasVnrType),
         ExterQual = factor(ExterQual),
        ExterCond = factor(ExterCond),
        Foundation = factor(Foundation),
        BsmtQual = factor(BsmtQual),
        BsmtCond = factor(BsmtCond),
        BsmtExposure = factor(BsmtExposure),
        BsmtFinType1 = factor(BsmtFinType1),
        BsmtFinType2 = factor(BsmtFinType2),
        Heating = factor(Heating),
        HeatingQC = factor(HeatingQC),
        CentralAir = factor(CentralAir),
        Electrical = factor(Electrical),
        KitchenQual = factor(KitchenQual),
        Functional = factor(Functional),
        FireplaceQu = factor(FireplaceQu),
        GarageType = factor(GarageType),
        GarageFinish = factor(GarageFinish),
        GarageQual = factor(GarageQual),
        GarageCond = factor(GarageCond),
        PavedDrive = factor(PavedDrive),
        PoolQC = factor(PoolQC),
        Fence = factor(Fence),
        MiscFeature = factor(MiscFeature),
        SaleType = factor(SaleType),
        SaleCondition = factor(SaleCondition))

# Check that it worked
summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 60.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 69.86  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 79.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                                  
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   none:1369   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               Pave:  41   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope  Neighborhood         Condition1     Condition2  
##  Corner : 263   Gtl:1382   Length:1460        Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   Class :character   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   Mode  :character   Artery :  48   Artery :   2  
##  FR3    :   4                                 RRAn   :  26   PosN   :   2  
##  Inside :1052                                 PosN   :  19   RRNn   :   2  
##                                               RRAe   :  11   PosA   :   1  
##                                               (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                (Other): 19                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  none   :  8   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  None   :864   Mean   : 103.1   TA:906    Po:   1   Slab  : 24   none: 37  
##  Stone  :128   3rd Qu.: 164.2             TA:1282   Stone :  6   TA  :649  
##                Max.   :1600.0                       Wood  :  3             
##                                                                            
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  none:  37   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  Po  :   2   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  TA  :1311   none: 38     none: 37     3rd Qu.: 712.2   none:  38   
##                           Rec :133     Max.   :5644.0   Rec :  54   
##                           Unf :430                      Unf :1256   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical      1stFlrSF       2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :   0  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1958  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1977  
##  Min2:  34   Mean   :0.613   none:690    BuiltIn: 88   Mean   :1869  
##  Mod :  15   3rd Qu.:1.000   Po  : 20    CarPort:  9   3rd Qu.:2001  
##  Sev :   1   Max.   :3.000   TA  :313    Detchd :387   Max.   :2010  
##  Typ :1360                               none   : 81                 
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  none: 81     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  RFn :422     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  Unf :605     Mean   :1.767   Mean   : 473.0   none:  81   none:  81  
##               3rd Qu.:2.000   3rd Qu.: 576.0   Po  :   3   Po  :   7  
##               Max.   :4.000   Max.   :1418.0   TA  :1311   TA  :1326  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      3SsnPorch     
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   none:1406  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Othr:   2  
##  Mean   : 15.06   Mean   :  2.759   none:1453   MnWw :  11   Shed:  49  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               none :1179   TenC:   1  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
## 

Now there is no longer missing values in the dataset. Let’s look at the target variable, SalePrice.

summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
# Histogram of SalePrice
ggplot(train, aes(SalePrice)) +
  geom_histogram(col = 'white') + 
  scale_x_continuous(labels = comma) # Right-skewed
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The above histogram indicates that SalePrice is right skewed. This will pose a challenge in building the model. Hence, SalePrice is logged as the following:

# Histogram of log SalePrice
ggplot(train, aes(log(SalePrice))) + 
  geom_histogram(col = 'white') + 
  scale_x_continuous(labels = comma) # Normally distributed after logged
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Log Term of SalePrice
train$SalePrice <- log(train$SalePrice)

I also notice that there might be multicolinearity among several variables, especially in the garage variables and basement variables. I will visualize them before selecting one among the variables that are correlated to each other. Correlation heatmap is a great virtual tool to help summarize the correlation between numeric parameters.

options(repr.plot.width = 10, repr.plot.height = 10)

numeric_var <- names(train)[which(sapply(train, is.numeric))]
train_cont <- train[numeric_var]

correlations <- cor(na.omit(train_cont[,-1]))
corrplot(correlations, method="square", type='lower', diag=FALSE)

According to the Correlation heatmap, the predictors that are correlated to SalePrice, the target variable, include GrLivArea, TotalBsmtSF, 1stFlrSF, GarageArea, FullBath, YearBuilt and YearRemodAdd. However, YearBuilt is highly correlated to GarageYrBlt (correlation of ~0.6) because most garages were built at the same time when the houses were built. To avoid multicolinearity, some variables either need to be dropped or restructured.

# Dropping Variables
drop <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalRmsAbvGrd', 'BsmtFinSF1')

train <- train[,!(names(train) %in% drop)]

OverallQual has shown to have the highest correlation with SalePrice. There is no doubt that it will very likely be included as one of the predictors so let’s check it out by plotting it.

ggplot(train, aes(OverallQual, SalePrice)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, color = "blue") +
  labs(title = "SalePrice ~ OverallQual, with local regression") # Linear
## `geom_smooth()` using formula = 'y ~ x'

OverallQual is a categorical feature (quality split into 10 categories) that is encoded as numeric. The scatter plot shows that it has a linear line so factoring is optional (supported by the R-squared below).

lm(SalePrice ~ OverallQual, data = train) %>% summary() # R-squared .66
## 
## Call:
## lm(formula = SalePrice ~ OverallQual, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06831 -0.12974  0.01309  0.13332  0.92438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.58444    0.02727  388.18   <2e-16 ***
## OverallQual  0.23603    0.00436   54.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared:  0.6678, Adjusted R-squared:  0.6676 
## F-statistic:  2931 on 1 and 1458 DF,  p-value: < 2.2e-16
lm(SalePrice ~ factor(OverallQual), data = train) %>% summary() # .67
## 
## Call:
## lm(formula = SalePrice ~ factor(OverallQual), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0969 -0.1238  0.0099  0.1362  0.8958 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.79880    0.16192  66.691  < 2e-16 ***
## factor(OverallQual)2   0.02658    0.20904   0.127  0.89884    
## factor(OverallQual)3   0.53867    0.16983   3.172  0.00155 ** 
## factor(OverallQual)4   0.75834    0.16331   4.643 3.74e-06 ***
## factor(OverallQual)5   0.98185    0.16233   6.048 1.86e-09 ***
## factor(OverallQual)6   1.16850    0.16236   7.197 9.83e-13 ***
## factor(OverallQual)7   1.42297    0.16243   8.760  < 2e-16 ***
## factor(OverallQual)8   1.69839    0.16288  10.427  < 2e-16 ***
## factor(OverallQual)9   1.99446    0.16565  12.040  < 2e-16 ***
## factor(OverallQual)10  2.12250    0.17068  12.435  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.229 on 1450 degrees of freedom
## Multiple R-squared:  0.6734, Adjusted R-squared:  0.6714 
## F-statistic: 332.2 on 9 and 1450 DF,  p-value: < 2.2e-16

OverallQual is not the only categorical feature (quality split into 10 categories) that is encoded as numeric.

str(train$YrSold) # Distribution of YrSold is narrow so it makes sense to convert into categorical
##  num [1:1460] 2008 2007 2008 2006 2008 ...
str(train$YearBuilt) # Opposite of YrSold so no factoring is needed
##  num [1:1460] 2003 1976 2001 1915 2000 ...
# Numeric Variables into Factors
train <- train %>% 
  mutate(YrSold = factor(YrSold),
         MoSold = factor(MoSold),
         Fireplaces = factor(Fireplaces))

Data modeling

I will select the variables that have shown a high correlation with SalePrice from the correlation Heatmap to start off building the predictive model.

lm(SalePrice ~ OverallQual, data = train) %>% summary() # .66
## 
## Call:
## lm(formula = SalePrice ~ OverallQual, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06831 -0.12974  0.01309  0.13332  0.92438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.58444    0.02727  388.18   <2e-16 ***
## OverallQual  0.23603    0.00436   54.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared:  0.6678, Adjusted R-squared:  0.6676 
## F-statistic:  2931 on 1 and 1458 DF,  p-value: < 2.2e-16
# GrLivArea has shown a strong correlation of ~0.6
lm(SalePrice ~ OverallQual + 
     GrLivArea, data = train) %>% summary() # .74
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78553 -0.09638  0.02084  0.12482  0.76698 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.055e+01  2.420e-02  435.95   <2e-16 ***
## OverallQual 1.789e-01  4.792e-03   37.33   <2e-16 ***
## GrLivArea   2.536e-04  1.261e-05   20.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2038 on 1457 degrees of freedom
## Multiple R-squared:   0.74,  Adjusted R-squared:  0.7396 
## F-statistic:  2073 on 2 and 1457 DF,  p-value: < 2.2e-16
# Reduced RMSE and increased R2 - keeping this predictor

# GarageCars has shown a correlation of ~0.6
lm(SalePrice ~ OverallQual + 
     GarageCars, data = train) %>% summary() # .72
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GarageCars, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92518 -0.11859  0.00723  0.11863  0.77929 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.618176   0.024933  425.88   <2e-16 ***
## OverallQual  0.184521   0.004971   37.12   <2e-16 ***
## GarageCars   0.158689   0.009200   17.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2099 on 1457 degrees of freedom
## Multiple R-squared:  0.7241, Adjusted R-squared:  0.7238 
## F-statistic:  1912 on 2 and 1457 DF,  p-value: < 2.2e-16
# TotRmsAbvGrd has shown a correlation of ~0.5
lm(SalePrice ~ OverallQual + 
     TotRmsAbvGrd, data = train) %>% summary() # .70
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotRmsAbvGrd, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1579 -0.1074  0.0113  0.1346  0.8945 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.392196   0.028735  361.66   <2e-16 ***
## OverallQual   0.208064   0.004510   46.14   <2e-16 ***
## TotRmsAbvGrd  0.055664   0.003837   14.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2154 on 1457 degrees of freedom
## Multiple R-squared:  0.7097, Adjusted R-squared:  0.7093 
## F-statistic:  1781 on 2 and 1457 DF,  p-value: < 2.2e-16
# FullBath has shown a correlation of ~0.4
lm(SalePrice ~ OverallQual + 
     FullBath, data = train) %>% summary() # .69
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + FullBath, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05255 -0.12542  0.01322  0.13376  0.94015 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.550189   0.026166  403.20   <2e-16 ***
## OverallQual  0.202976   0.004982   40.74   <2e-16 ***
## FullBath     0.150696   0.012507   12.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2197 on 1457 degrees of freedom
## Multiple R-squared:  0.6979, Adjusted R-squared:  0.6975 
## F-statistic:  1683 on 2 and 1457 DF,  p-value: < 2.2e-16
# YearBuilt has shown a correlation of ~0.4
lm(SalePrice ~ OverallQual + 
     YearBuilt, data = train) %>% summary() # .68
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00979 -0.12665  0.00344  0.12601  0.90340 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.1537390  0.4474525  13.753   <2e-16 ***
## OverallQual 0.2068050  0.0051475  40.175   <2e-16 ***
## YearBuilt   0.0023381  0.0002357   9.919   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.223 on 1457 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6884 
## F-statistic:  1612 on 2 and 1457 DF,  p-value: < 2.2e-16

The following might not have a strong correlation with the target variable in the Correlation Heatmap, but I would like to double confirm by checking the R squared:

lm(SalePrice ~ OverallQual + 
     MSSubClass, data = train) %>% summary() # .72
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + MSSubClass, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09190 -0.12543  0.01134  0.13285  0.89127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.6327731  0.0277889 382.626  < 2e-16 ***
## OverallQual  0.2369772  0.0042966  55.155  < 2e-16 ***
## MSSubClass  -0.0009512  0.0001405  -6.771 1.84e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2269 on 1457 degrees of freedom
## Multiple R-squared:  0.6779, Adjusted R-squared:  0.6775 
## F-statistic:  1533 on 2 and 1457 DF,  p-value: < 2.2e-16
# MSSubClass has a more promising result than some of the variables listed above

lm(SalePrice ~ OverallQual + 
     Neighborhood, data = train) %>% summary() # .75
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + Neighborhood, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93020 -0.11256  0.00119  0.11100  0.64743 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         10.884858   0.061926 175.773  < 2e-16 ***
## OverallQual          0.178996   0.005413  33.066  < 2e-16 ***
## NeighborhoodBlueste -0.132296   0.148772  -0.889 0.374016    
## NeighborhoodBrDale  -0.355032   0.069724  -5.092 4.01e-07 ***
## NeighborhoodBrkSide -0.109368   0.056031  -1.952 0.051142 .  
## NeighborhoodClearCr  0.300246   0.061529   4.880 1.18e-06 ***
## NeighborhoodCollgCr  0.090252   0.050966   1.771 0.076803 .  
## NeighborhoodCrawfor  0.198690   0.055898   3.555 0.000391 ***
## NeighborhoodEdwards -0.081844   0.053382  -1.533 0.125455    
## NeighborhoodGilbert  0.097277   0.053266   1.826 0.068018 .  
## NeighborhoodIDOTRR  -0.289408   0.059713  -4.847 1.39e-06 ***
## NeighborhoodMeadowV -0.210552   0.069754  -3.018 0.002585 ** 
## NeighborhoodMitchel  0.048175   0.056621   0.851 0.395008    
## NeighborhoodNAmes    0.023770   0.050970   0.466 0.641040    
## NeighborhoodNoRidge  0.372273   0.057500   6.474 1.31e-10 ***
## NeighborhoodNPkVill -0.092355   0.082212  -1.123 0.261465    
## NeighborhoodNridgHt  0.256095   0.053604   4.778 1.96e-06 ***
## NeighborhoodNWAmes   0.112929   0.053742   2.101 0.035787 *  
## NeighborhoodOldTown -0.145669   0.052621  -2.768 0.005708 ** 
## NeighborhoodSawyer   0.026793   0.054728   0.490 0.624514    
## NeighborhoodSawyerW  0.074214   0.054927   1.351 0.176865    
## NeighborhoodSomerst  0.098308   0.052783   1.863 0.062735 .  
## NeighborhoodStoneBr  0.240023   0.062732   3.826 0.000136 ***
## NeighborhoodSWISU   -0.020160   0.063208  -0.319 0.749817    
## NeighborhoodTimber   0.197365   0.058017   3.402 0.000688 ***
## NeighborhoodVeenker  0.255165   0.076977   3.315 0.000940 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1988 on 1434 degrees of freedom
## Multiple R-squared:  0.7565, Adjusted R-squared:  0.7522 
## F-statistic: 178.2 on 25 and 1434 DF,  p-value: < 2.2e-16
# So as Neighborhood

lm(SalePrice ~ OverallQual + 
     BldgType, data = train) %>% summary() # .68
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + BldgType, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08061 -0.12299  0.01488  0.12946  0.91208 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.585486   0.027457 385.524  < 2e-16 ***
## OverallQual     0.238842   0.004361  54.762  < 2e-16 ***
## BldgType2fmCon -0.038934   0.041069  -0.948    0.343    
## BldgTypeDuplex  0.010410   0.032121   0.324    0.746    
## BldgTypeTwnhs  -0.261296   0.034759  -7.517 9.74e-14 ***
## BldgTypeTwnhsE -0.128791   0.022089  -5.830 6.80e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.224 on 1454 degrees of freedom
## Multiple R-squared:  0.6866, Adjusted R-squared:  0.6855 
## F-statistic:   637 on 5 and 1454 DF,  p-value: < 2.2e-16
lm(SalePrice ~ OverallQual + 
     OverallCond, data = train) %>% summary() # .66
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05819 -0.12773  0.01492  0.13265  0.92065 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.500974   0.042568  246.69   <2e-16 ***
## OverallQual  0.237052   0.004370   54.24   <2e-16 ***
## OverallCond  0.013850   0.005431    2.55   0.0109 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2299 on 1457 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6688 
## F-statistic:  1474 on 2 and 1457 DF,  p-value: < 2.2e-16

With the help of the Correlation Heatmap and Linear Regression data, I can now build a model with 5 predictors additively. The 5 predictors are OverallQual, Neighborhood, GrLivArea, TotRmsAbvGrd and MSSubClass.

lm(SalePrice ~ OverallQual + 
     GrLivArea +
     TotRmsAbvGrd, data = train) %>% summary() # .81
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80965 -0.09708  0.02200  0.12456  0.76176 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.056e+01  3.017e-02 350.071   <2e-16 ***
## OverallQual   1.784e-01  4.838e-03  36.868   <2e-16 ***
## GrLivArea     2.660e-04  2.039e-05  13.041   <2e-16 ***
## TotRmsAbvGrd -4.516e-03  5.873e-03  -0.769    0.442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2039 on 1456 degrees of freedom
## Multiple R-squared:  0.7401, Adjusted R-squared:  0.7395 
## F-statistic:  1382 on 3 and 1456 DF,  p-value: < 2.2e-16
lm(SalePrice ~ OverallQual + 
     GrLivArea +
     TotRmsAbvGrd +
     Neighborhood, data = train) %>% summary() # .81
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd + 
##     Neighborhood, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43630 -0.08454  0.00761  0.09766  0.54773 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.095e+01  5.576e-02 196.340  < 2e-16 ***
## OverallQual          1.186e-01  5.400e-03  21.966  < 2e-16 ***
## GrLivArea            2.622e-04  1.832e-05  14.312  < 2e-16 ***
## TotRmsAbvGrd        -7.590e-04  5.071e-03  -0.150 0.881034    
## NeighborhoodBlueste -1.947e-01  1.280e-01  -1.521 0.128551    
## NeighborhoodBrDale  -3.708e-01  5.996e-02  -6.184 8.14e-10 ***
## NeighborhoodBrkSide -1.793e-01  4.833e-02  -3.710 0.000215 ***
## NeighborhoodClearCr  1.287e-01  5.356e-02   2.403 0.016393 *  
## NeighborhoodCollgCr  4.404e-02  4.389e-02   1.003 0.315866    
## NeighborhoodCrawfor  4.950e-02  4.855e-02   1.020 0.308134    
## NeighborhoodEdwards -1.856e-01  4.616e-02  -4.021 6.10e-05 ***
## NeighborhoodGilbert  4.472e-03  4.599e-02   0.097 0.922547    
## NeighborhoodIDOTRR  -3.611e-01  5.145e-02  -7.019 3.45e-12 ***
## NeighborhoodMeadowV -2.783e-01  6.019e-02  -4.624 4.10e-06 ***
## NeighborhoodMitchel -1.281e-02  4.879e-02  -0.263 0.792914    
## NeighborhoodNAmes   -5.529e-02  4.398e-02  -1.257 0.208869    
## NeighborhoodNoRidge  1.356e-01  5.083e-02   2.669 0.007703 ** 
## NeighborhoodNPkVill -1.180e-01  7.072e-02  -1.669 0.095401 .  
## NeighborhoodNridgHt  1.945e-01  4.617e-02   4.213 2.68e-05 ***
## NeighborhoodNWAmes  -1.622e-02  4.657e-02  -0.348 0.727677    
## NeighborhoodOldTown -2.670e-01  4.558e-02  -5.859 5.78e-09 ***
## NeighborhoodSawyer  -4.754e-02  4.718e-02  -1.008 0.313772    
## NeighborhoodSawyerW -2.098e-02  4.745e-02  -0.442 0.658416    
## NeighborhoodSomerst  6.379e-02  4.547e-02   1.403 0.160856    
## NeighborhoodStoneBr  1.815e-01  5.410e-02   3.355 0.000814 ***
## NeighborhoodSWISU   -2.221e-01  5.510e-02  -4.031 5.85e-05 ***
## NeighborhoodTimber   1.125e-01  5.004e-02   2.249 0.024675 *  
## NeighborhoodVeenker  1.985e-01  6.636e-02   2.991 0.002831 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.171 on 1432 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.8168 
## F-statistic: 241.9 on 27 and 1432 DF,  p-value: < 2.2e-16
lm(SalePrice ~ OverallQual + 
     GrLivArea +
     TotRmsAbvGrd +
     Neighborhood +
     MSSubClass, data = train) %>% summary() # .82
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd + 
##     Neighborhood + MSSubClass, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47258 -0.07998  0.00921  0.09338  0.56531 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.106e+01  5.677e-02 194.819  < 2e-16 ***
## OverallQual          1.156e-01  5.317e-03  21.753  < 2e-16 ***
## GrLivArea            2.759e-04  1.808e-05  15.261  < 2e-16 ***
## TotRmsAbvGrd        -1.642e-03  4.979e-03  -0.330  0.74164    
## NeighborhoodBlueste -1.567e-01  1.258e-01  -1.246  0.21313    
## NeighborhoodBrDale  -3.300e-01  5.912e-02  -5.583 2.83e-08 ***
## NeighborhoodBrkSide -2.420e-01  4.819e-02  -5.021 5.78e-07 ***
## NeighborhoodClearCr  6.418e-02  5.329e-02   1.204  0.22871    
## NeighborhoodCollgCr -2.290e-02  4.403e-02  -0.520  0.60298    
## NeighborhoodCrawfor -8.439e-03  4.830e-02  -0.175  0.86131    
## NeighborhoodEdwards -2.431e-01  4.597e-02  -5.289 1.42e-07 ***
## NeighborhoodGilbert -5.064e-02  4.575e-02  -1.107  0.26856    
## NeighborhoodIDOTRR  -4.201e-01  5.113e-02  -8.216 4.67e-16 ***
## NeighborhoodMeadowV -2.376e-01  5.935e-02  -4.004 6.55e-05 ***
## NeighborhoodMitchel -6.854e-02  4.848e-02  -1.414  0.15763    
## NeighborhoodNAmes   -1.280e-01  4.428e-02  -2.892  0.00388 ** 
## NeighborhoodNoRidge  6.957e-02  5.069e-02   1.373  0.17009    
## NeighborhoodNPkVill -9.403e-02  6.950e-02  -1.353  0.17626    
## NeighborhoodNridgHt  1.447e-01  4.582e-02   3.158  0.00162 ** 
## NeighborhoodNWAmes  -8.552e-02  4.666e-02  -1.833  0.06706 .  
## NeighborhoodOldTown -3.166e-01  4.524e-02  -6.998 3.98e-12 ***
## NeighborhoodSawyer  -1.185e-01  4.729e-02  -2.505  0.01237 *  
## NeighborhoodSawyerW -7.656e-02  4.718e-02  -1.623  0.10483    
## NeighborhoodSomerst  2.555e-02  4.493e-02   0.569  0.56975    
## NeighborhoodStoneBr  1.505e-01  5.328e-02   2.825  0.00479 ** 
## NeighborhoodSWISU   -2.697e-01  5.447e-02  -4.951 8.25e-07 ***
## NeighborhoodTimber   4.063e-02  5.007e-02   0.811  0.41722    
## NeighborhoodVeenker  1.475e-01  6.550e-02   2.252  0.02445 *  
## MSSubClass          -9.119e-04  1.230e-04  -7.411 2.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1678 on 1431 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8234 
## F-statistic:   244 on 28 and 1431 DF,  p-value: < 2.2e-16
# Looks good!

Cross validation

# Randomly sample 70% of the rows
set.seed(124)
index <- sample(x = 1:nrow(train), size = nrow(train)*.7, replace = F)

head(index) # These are row numbers
## [1] 1345  167 1002 1435  261  728
# Subset train using the index to create train_fold
train_fold <- train[index, ]

# Subset the remaining row to create validation fold.
validation_fold <- train[-index, ]

# Fit model
model <- lm(SalePrice ~ OverallQual + 
     Neighborhood +
     GrLivArea +
     TotRmsAbvGrd +
     MSSubClass, data = train) 

# Get predictions for the validation fold
predictions <- predict(model, newdata = validation_fold)

# Create functions for calculating RMSE and R-squared
rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))

R2 <- function(observed, predicted){
  TSS <- sum((observed - mean(observed))^2)
  RSS <- sum((observed - predicted)^2)
  1- RSS/TSS
}

rmse(validation_fold$SalePrice, predictions)
## [1] 0.1651863
R2(validation_fold$SalePrice, predictions)
## [1] 0.826656

Submission

# 1. Fit model to the entire train set.
submission_model <- lm(SalePrice ~ OverallQual + 
     Neighborhood +
     GrLivArea +
     TotRmsAbvGrd +
     MSSubClass, data = train)
# 2. Check there are no missing observations for your selected predictors in the test set.
test %>% 
  select(Neighborhood, GrLivArea, TotRmsAbvGrd, MSSubClass) %>% 
  summarize_all(count_missings) 
## # A tibble: 1 × 4
##   Neighborhood GrLivArea TotRmsAbvGrd MSSubClass
##          <int>     <int>        <int>      <int>
## 1            0         0            0          0
# 3. Make predictions for the test set.
submission_predictions <- predict(submission_model, newdata = test) # Use the newdata argument!

head(submission_predictions)
##        1        2        3        4        5        6 
## 11.73106 11.96451 11.97256 12.07967 12.37131 12.09374
# 4. Format your submission file.
submission <- test %>% 
  select(Id) %>% 
  mutate(SalePrice = exp(submission_predictions))

head(submission)
## # A tibble: 6 × 2
##      Id SalePrice
##   <dbl>     <dbl>
## 1  1461   124375.
## 2  1462   157081.
## 3  1463   158349.
## 4  1464   176252.
## 5  1465   235935.
## 6  1466   178749.
write.csv(submission, "submission.csv")

Conclusion

This is my results for the project. (1) RMSE and R2 (>.75) on the train set: 0.1678, 0.8234 (2) Estimated RMSE and R2 on the test set: 0.1651863, 0.826656 (3) Kaggle score (returned log RMSE) and rank: 0.17077, 3098