Table of Content: 1. Introduction 2. Import & Check Dataset 3. Exploratory Data Analysis (EAD) 4. Data Modeling 5. Cross Validation 6. Submission 7. Conclusion
Introduction: Developing regression models that accurately predict the sale price of residential homes in Ames, Iowa based on various features (such as square footage and location) is the objective of the House Prices machine learning challenge. The goal of this project is to create a model with just 5 predictors additively that gives a R2 of above .75 among 79 explanatory variables. The following context is the requirements and the steps of the project.
Requirements: (1) RMSE and R2 (>.75) on the train set (2) estimated RMSE and R2 on the test set (3) Kaggle score (returned log RMSE) and rank
Steps: 1. EDA: understand how the variables relate to one another, the structure and meaning of the missing observations 2. Develop a linear model of house prices using just 5 predictors additively. 3. Submit predictions to Kaggle. 4. Use a simple cross-validation method to ensure that your results will generalize well to new data
# setup & import data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(corrplot)
## corrplot 0.92 loaded
# overview
test <- read_csv("./house-prices-advanced-regression-techniques/test.csv")
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train <- read_csv("./house-prices-advanced-regression-techniques/train.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
submit_example <- read_csv("./house-prices-advanced-regression-techniques/sample_submission.csv")
## Rows: 1459 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Id, SalePrice
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(train) # Remove NAs of Numeric: MasVnrArea, GarageYrBlt #BsmtCond, BsmtFinType2, BsmtFinSF2, GarageQual, GarageCond, MiscFeature.NA = None
## # A tibble: 6 × 81
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl AllPub
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl AllPub
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl AllPub
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl AllPub
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl AllPub
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl AllPub
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
head(test)
## # A tibble: 6 × 80
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1461 20 RH 80 11622 Pave <NA> Reg Lvl AllPub
## 2 1462 20 RL 81 14267 Pave <NA> IR1 Lvl AllPub
## 3 1463 60 RL 74 13830 Pave <NA> IR1 Lvl AllPub
## 4 1464 60 RL 78 9978 Pave <NA> IR1 Lvl AllPub
## 5 1465 120 RL 43 5005 Pave <NA> IR1 HLS AllPub
## 6 1466 60 RL 75 10000 Pave <NA> IR1 Lvl AllPub
## # … with 70 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
head(submit_example)
## # A tibble: 6 × 2
## Id SalePrice
## <dbl> <dbl>
## 1 1461 169277.
## 2 1462 187758.
## 3 1463 183584.
## 4 1464 179317.
## 5 1465 150730.
## 6 1466 177151.
In total, there are 34 predictors that contains missing values. My approach is to review the potential meaning of the NAs and convert them into meaning values, for example converting character variables into factors, otherwise, they will be removed. Let’s look at them in a big picture as well as in smaller groups.
count_missings <- function(x) sum(is.na(x))
train %>%
summarize_all(count_missings) # Handy summarize_all function
## # A tibble: 1 × 81
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 259 0 0 1369 0 0 0
## # … with 71 more variables: LotConfig <int>, LandSlope <int>,
## # Neighborhood <int>, Condition1 <int>, Condition2 <int>, BldgType <int>,
## # HouseStyle <int>, OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## # YearRemodAdd <int>, RoofStyle <int>, RoofMatl <int>, Exterior1st <int>,
## # Exterior2nd <int>, MasVnrType <int>, MasVnrArea <int>, ExterQual <int>,
## # ExterCond <int>, Foundation <int>, BsmtQual <int>, BsmtCond <int>,
## # BsmtExposure <int>, BsmtFinType1 <int>, BsmtFinSF1 <int>, …
train <- train %>% # Save the result back into the original data
mutate(Alley = replace_na(data = Alley, replace = "none")) # Overwrite the existing column with new values
# Check that it worked
train %>%
count(Alley)
## # A tibble: 3 × 2
## Alley n
## <chr> <int>
## 1 Grvl 50
## 2 Pave 41
## 3 none 1369
Tested with the variable Alley. What about the variables regarding masonry veneer?
count_missings(train$MasVnrType)
## [1] 8
train %>%
filter(is.na(MasVnrType), is.na(MasVnrArea))
## # A tibble: 8 × 81
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 235 60 RL NA 7851 Pave none Reg Lvl AllPub
## 2 530 20 RL NA 32668 Pave none IR1 Lvl AllPub
## 3 651 60 FV 65 8125 Pave none Reg Lvl AllPub
## 4 937 20 RL 67 10083 Pave none Reg Lvl AllPub
## 5 974 20 FV 95 11639 Pave none Reg Lvl AllPub
## 6 978 120 FV 35 4274 Pave Pave IR1 Lvl AllPub
## 7 1244 20 RL 107 13891 Pave none Reg Lvl AllPub
## 8 1279 60 RL 75 9473 Pave none Reg Lvl AllPub
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
mutate(MasVnrType = replace_na(MasVnrType, "none"),
MasVnrArea = replace_na(MasVnrArea, 0))
# Check that it worked
train %>%
filter(is.na(MasVnrType), is.na(MasVnrArea))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## # LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## # LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
Let’s try it on other variables before imputing to all other parameters.
count_missings(train$LotFrontage)
## [1] 259
train <- train %>% # Again, saving this change into the original data
mutate(LotFrontage = replace_na(LotFrontage, median(LotFrontage, na.rm = T))) # Need to set na.rm = T
# Check that it worked
train$LotFrontage %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 60.00 69.00 69.86 79.00 313.00
I will now impute in groups to prevent making mistakes.
count_missings(train$BsmtQual)
## [1] 37
count_missings(train$BsmtCond)
## [1] 37
count_missings(train$BsmtExposure)
## [1] 38
count_missings(train$BsmtFinType1)
## [1] 37
count_missings(train$BsmtFinType2)
## [1] 38
train %>%
filter(is.na(BsmtQual), is.na(BsmtCond), is.na(BsmtExposure), is.na(BsmtFinType1), is.na(BsmtFinType2))
## # A tibble: 37 × 81
## Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 18 90 RL 72 10791 Pave none Reg Lvl AllPub
## 2 40 90 RL 65 6040 Pave none Reg Lvl AllPub
## 3 91 20 RL 60 7200 Pave none Reg Lvl AllPub
## 4 103 90 RL 64 7018 Pave none Reg Bnk AllPub
## 5 157 20 RL 60 7200 Pave none Reg Lvl AllPub
## 6 183 20 RL 60 9060 Pave none Reg Lvl AllPub
## 7 260 20 RM 70 12702 Pave none Reg Lvl AllPub
## 8 343 90 RL 69 8544 Pave none Reg Lvl AllPub
## 9 363 85 RL 64 7301 Pave none Reg Lvl AllPub
## 10 372 50 RL 80 17120 Pave none Reg Lvl AllPub
## # … with 27 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
mutate(BsmtQual = replace_na(BsmtQual, "none"),
BsmtCond = replace_na(BsmtCond, "none"),
BsmtExposure = replace_na(BsmtExposure, "none"),
BsmtFinType1 = replace_na(BsmtFinType1, "none"),
BsmtFinType2 = replace_na(BsmtFinType2, "none"))
# Check that it worked
train %>%
filter(is.na(BsmtQual), is.na(BsmtCond), is.na(BsmtExposure), is.na(BsmtFinType1), is.na(BsmtFinType2))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## # LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## # LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
count_missings(train$GarageType)
## [1] 81
count_missings(train$GarageYrBlt)
## [1] 81
count_missings(train$GarageFinish)
## [1] 81
count_missings(train$GarageQual)
## [1] 81
count_missings(train$GarageCond)
## [1] 81
train %>%
filter(is.na(GarageType), is.na(GarageYrBlt), is.na(GarageFinish), is.na(GarageQual), is.na(GarageCond))
## # A tibble: 81 × 81
## Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 40 90 RL 65 6040 Pave none Reg Lvl AllPub
## 2 49 190 RM 33 4456 Pave none Reg Lvl AllPub
## 3 79 90 RL 72 10778 Pave none Reg Lvl AllPub
## 4 89 50 C (all) 105 8470 Pave none IR1 Lvl AllPub
## 5 90 20 RL 60 8070 Pave none Reg Lvl AllPub
## 6 100 20 RL 77 9320 Pave none IR1 Lvl AllPub
## 7 109 50 RM 85 8500 Pave none Reg Lvl AllPub
## 8 126 190 RM 60 6780 Pave none Reg Lvl AllPub
## 9 128 45 RM 55 4388 Pave none IR1 Bnk AllPub
## 10 141 20 RL 70 10500 Pave none Reg Lvl AllPub
## # … with 71 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
mutate(GarageType = replace_na(GarageType, "none"),
GarageYrBlt = replace_na(GarageYrBlt, 0),
GarageFinish = replace_na(GarageFinish, "none"),
GarageQual = replace_na(GarageQual, "none"),
GarageCond = replace_na(GarageCond, "none"))
# Check that it worked
train %>%
filter(is.na(GarageType), is.na(GarageYrBlt), is.na(GarageFinish), is.na(GarageQual), is.na(GarageCond))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## # LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## # LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
count_missings(train$FireplaceQu)
## [1] 690
count_missings(train$PoolQC)
## [1] 1453
count_missings(train$Fence)
## [1] 1179
count_missings(train$MiscFeature)
## [1] 1406
train %>%
filter(is.na(FireplaceQu), is.na(PoolQC), is.na(Fence), is.na(MiscFeature))
## # A tibble: 521 × 81
## Id MSSubClass MSZon…¹ LotFr…² LotArea Street Alley LotSh…³ LandC…⁴ Utili…⁵
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave none Reg Lvl AllPub
## 2 11 20 RL 70 11200 Pave none Reg Lvl AllPub
## 3 13 20 RL 69 12968 Pave none IR2 Lvl AllPub
## 4 19 20 RL 66 13695 Pave none Reg Lvl AllPub
## 5 27 20 RL 60 7200 Pave none Reg Lvl AllPub
## 6 30 30 RM 60 6324 Pave none IR1 Lvl AllPub
## 7 33 20 RL 85 11049 Pave none Reg Lvl AllPub
## 8 37 20 RL 112 10859 Pave none Reg Lvl AllPub
## 9 39 20 RL 68 7922 Pave none Reg Lvl AllPub
## 10 40 90 RL 65 6040 Pave none Reg Lvl AllPub
## # … with 511 more rows, 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
train <- train %>%
mutate(FireplaceQu = replace_na(FireplaceQu, "none"),
PoolQC = replace_na(PoolQC, "none"),
Fence = replace_na(Fence, "none"),
MiscFeature = replace_na(MiscFeature, "none"))
# Check that it worked
train %>%
filter(is.na(FireplaceQu), is.na(PoolQC), is.na(Fence), is.na(MiscFeature))
## # A tibble: 0 × 81
## # … with 81 variables: Id <dbl>, MSSubClass <dbl>, MSZoning <chr>,
## # LotFrontage <dbl>, LotArea <dbl>, Street <chr>, Alley <chr>,
## # LotShape <chr>, LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, …
# Review data structure
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 60.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 69.86
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 79.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.1
## 3rd Qu.: 164.2
## Max. :1600.0
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. : 0
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1958
## Median :1.000 Mode :character Mode :character Median :1977
## Mean :0.613 Mean :1869
## 3rd Qu.:1.000 3rd Qu.:2001
## Max. :3.000 Max. :2010
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
# Convert Character to Factors
train <- train %>%
mutate(MSZoning = factor(MSZoning),
Street = factor(Street),
Alley = factor(Alley),
LotShape = factor(LotShape),
LandContour = factor(LandContour),
Utilities = factor(Utilities),
LotConfig = factor(LotConfig),
LandSlope = factor(LandSlope),
Condition1 = factor(Condition1),
Condition2 = factor(Condition2),
BldgType = factor(BldgType),
HouseStyle = factor(HouseStyle),
RoofStyle = factor(RoofStyle),
RoofMatl = factor(RoofMatl),
Exterior1st = factor(Exterior1st),
Exterior2nd = factor(Exterior2nd),
MasVnrType = factor(MasVnrType),
ExterQual = factor(ExterQual),
ExterCond = factor(ExterCond),
Foundation = factor(Foundation),
BsmtQual = factor(BsmtQual),
BsmtCond = factor(BsmtCond),
BsmtExposure = factor(BsmtExposure),
BsmtFinType1 = factor(BsmtFinType1),
BsmtFinType2 = factor(BsmtFinType2),
Heating = factor(Heating),
HeatingQC = factor(HeatingQC),
CentralAir = factor(CentralAir),
Electrical = factor(Electrical),
KitchenQual = factor(KitchenQual),
Functional = factor(Functional),
FireplaceQu = factor(FireplaceQu),
GarageType = factor(GarageType),
GarageFinish = factor(GarageFinish),
GarageQual = factor(GarageQual),
GarageCond = factor(GarageCond),
PavedDrive = factor(PavedDrive),
PoolQC = factor(PoolQC),
Fence = factor(Fence),
MiscFeature = factor(MiscFeature),
SaleType = factor(SaleType),
SaleCondition = factor(SaleCondition))
# Check that it worked
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 60.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 69.86
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 79.00
## Max. :1460.0 Max. :190.0 Max. :313.00
##
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## 1st Qu.: 7554 Pave:1454 none:1369 IR2: 41 HLS: 50 NoSeWa: 1
## Median : 9478 Pave: 41 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 Length:1460 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 Class :character Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 Mode :character Artery : 48 Artery : 2
## FR3 : 4 RRAn : 26 PosN : 2
## Inside :1052 PosN : 19 RRNn : 2
## RRAe : 11 PosA : 1
## (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000 Min. :1872
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000 Median :1973
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575 Mean :1971
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## 1.5Unf : 14 Max. :10.000 Max. :9.000 Max. :2010
## (Other): 19
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504
## 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214
## Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207
## Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197
## 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142
## Max. :2010 Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
## (Other): 2 (Other):128 (Other):136
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146 Ex :121
## BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634 Fa : 35
## none : 8 Median : 0.0 Gd:488 Gd: 146 PConc :647 Gd :618
## None :864 Mean : 103.1 TA:906 Po: 1 Slab : 24 none: 37
## Stone :128 3rd Qu.: 164.2 TA:1282 Stone : 6 TA :649
## Max. :1600.0 Wood : 3
##
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 45 Av :221 ALQ :220 Min. : 0.0 ALQ : 19
## Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0 BLQ : 33
## none: 37 Mn :114 GLQ :418 Median : 383.5 GLQ : 14
## Po : 2 No :953 LwQ : 74 Mean : 443.6 LwQ : 46
## TA :1311 none: 38 none: 37 3rd Qu.: 712.2 none: 38
## Rec :133 Max. :5644.0 Rec : 54
## Unf :430 Unf :1256
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49
## Median : 0.00 Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Wall : 4
##
## CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## N: 95 FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## Y:1365 FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd:586 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Maj1: 14 Min. :0.000 Ex : 24 2Types : 6 Min. : 0
## Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1958
## Min1: 31 Median :1.000 Gd :380 Basment: 19 Median :1977
## Min2: 34 Mean :0.613 none:690 BuiltIn: 88 Mean :1869
## Mod : 15 3rd Qu.:1.000 Po : 20 CarPort: 9 3rd Qu.:2001
## Sev : 1 Max. :3.000 TA :313 Detchd :387 Max. :2010
## Typ :1360 none : 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :352 Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2
## none: 81 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35
## RFn :422 Median :2.000 Median : 480.0 Gd : 14 Gd : 9
## Unf :605 Mean :1.767 Mean : 473.0 none: 81 none: 81
## 3rd Qu.:2.000 3rd Qu.: 576.0 Po : 3 Po : 7
## Max. :4.000 Max. :1418.0 TA :1311 TA :1326
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1340 Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 none:1406
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Othr: 2
## Mean : 15.06 Mean : 2.759 none:1453 MnWw : 11 Shed: 49
## 3rd Qu.: 0.00 3rd Qu.: 0.000 none :1179 TenC: 1
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
Now there is no longer missing values in the dataset. Let’s look at the target variable, SalePrice.
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
# Histogram of SalePrice
ggplot(train, aes(SalePrice)) +
geom_histogram(col = 'white') +
scale_x_continuous(labels = comma) # Right-skewed
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The above histogram indicates that SalePrice is right skewed. This will pose a challenge in building the model. Hence, SalePrice is logged as the following:
# Histogram of log SalePrice
ggplot(train, aes(log(SalePrice))) +
geom_histogram(col = 'white') +
scale_x_continuous(labels = comma) # Normally distributed after logged
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Log Term of SalePrice
train$SalePrice <- log(train$SalePrice)
I also notice that there might be multicolinearity among several variables, especially in the garage variables and basement variables. I will visualize them before selecting one among the variables that are correlated to each other. Correlation heatmap is a great virtual tool to help summarize the correlation between numeric parameters.
options(repr.plot.width = 10, repr.plot.height = 10)
numeric_var <- names(train)[which(sapply(train, is.numeric))]
train_cont <- train[numeric_var]
correlations <- cor(na.omit(train_cont[,-1]))
corrplot(correlations, method="square", type='lower', diag=FALSE)
According to the Correlation heatmap, the predictors that are correlated to SalePrice, the target variable, include GrLivArea, TotalBsmtSF, 1stFlrSF, GarageArea, FullBath, YearBuilt and YearRemodAdd. However, YearBuilt is highly correlated to GarageYrBlt (correlation of ~0.6) because most garages were built at the same time when the houses were built. To avoid multicolinearity, some variables either need to be dropped or restructured.
# Dropping Variables
drop <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalRmsAbvGrd', 'BsmtFinSF1')
train <- train[,!(names(train) %in% drop)]
OverallQual has shown to have the highest correlation with SalePrice. There is no doubt that it will very likely be included as one of the predictors so let’s check it out by plotting it.
ggplot(train, aes(OverallQual, SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se = F, color = "blue") +
labs(title = "SalePrice ~ OverallQual, with local regression") # Linear
## `geom_smooth()` using formula = 'y ~ x'
OverallQual is a categorical feature (quality split into 10 categories) that is encoded as numeric. The scatter plot shows that it has a linear line so factoring is optional (supported by the R-squared below).
lm(SalePrice ~ OverallQual, data = train) %>% summary() # R-squared .66
##
## Call:
## lm(formula = SalePrice ~ OverallQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06831 -0.12974 0.01309 0.13332 0.92438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.58444 0.02727 388.18 <2e-16 ***
## OverallQual 0.23603 0.00436 54.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.6676
## F-statistic: 2931 on 1 and 1458 DF, p-value: < 2.2e-16
lm(SalePrice ~ factor(OverallQual), data = train) %>% summary() # .67
##
## Call:
## lm(formula = SalePrice ~ factor(OverallQual), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0969 -0.1238 0.0099 0.1362 0.8958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.79880 0.16192 66.691 < 2e-16 ***
## factor(OverallQual)2 0.02658 0.20904 0.127 0.89884
## factor(OverallQual)3 0.53867 0.16983 3.172 0.00155 **
## factor(OverallQual)4 0.75834 0.16331 4.643 3.74e-06 ***
## factor(OverallQual)5 0.98185 0.16233 6.048 1.86e-09 ***
## factor(OverallQual)6 1.16850 0.16236 7.197 9.83e-13 ***
## factor(OverallQual)7 1.42297 0.16243 8.760 < 2e-16 ***
## factor(OverallQual)8 1.69839 0.16288 10.427 < 2e-16 ***
## factor(OverallQual)9 1.99446 0.16565 12.040 < 2e-16 ***
## factor(OverallQual)10 2.12250 0.17068 12.435 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.229 on 1450 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6714
## F-statistic: 332.2 on 9 and 1450 DF, p-value: < 2.2e-16
OverallQual is not the only categorical feature (quality split into 10 categories) that is encoded as numeric.
str(train$YrSold) # Distribution of YrSold is narrow so it makes sense to convert into categorical
## num [1:1460] 2008 2007 2008 2006 2008 ...
str(train$YearBuilt) # Opposite of YrSold so no factoring is needed
## num [1:1460] 2003 1976 2001 1915 2000 ...
# Numeric Variables into Factors
train <- train %>%
mutate(YrSold = factor(YrSold),
MoSold = factor(MoSold),
Fireplaces = factor(Fireplaces))
I will select the variables that have shown a high correlation with SalePrice from the correlation Heatmap to start off building the predictive model.
lm(SalePrice ~ OverallQual, data = train) %>% summary() # .66
##
## Call:
## lm(formula = SalePrice ~ OverallQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06831 -0.12974 0.01309 0.13332 0.92438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.58444 0.02727 388.18 <2e-16 ***
## OverallQual 0.23603 0.00436 54.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.6676
## F-statistic: 2931 on 1 and 1458 DF, p-value: < 2.2e-16
# GrLivArea has shown a strong correlation of ~0.6
lm(SalePrice ~ OverallQual +
GrLivArea, data = train) %>% summary() # .74
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78553 -0.09638 0.02084 0.12482 0.76698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.055e+01 2.420e-02 435.95 <2e-16 ***
## OverallQual 1.789e-01 4.792e-03 37.33 <2e-16 ***
## GrLivArea 2.536e-04 1.261e-05 20.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2038 on 1457 degrees of freedom
## Multiple R-squared: 0.74, Adjusted R-squared: 0.7396
## F-statistic: 2073 on 2 and 1457 DF, p-value: < 2.2e-16
# Reduced RMSE and increased R2 - keeping this predictor
# GarageCars has shown a correlation of ~0.6
lm(SalePrice ~ OverallQual +
GarageCars, data = train) %>% summary() # .72
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GarageCars, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92518 -0.11859 0.00723 0.11863 0.77929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.618176 0.024933 425.88 <2e-16 ***
## OverallQual 0.184521 0.004971 37.12 <2e-16 ***
## GarageCars 0.158689 0.009200 17.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2099 on 1457 degrees of freedom
## Multiple R-squared: 0.7241, Adjusted R-squared: 0.7238
## F-statistic: 1912 on 2 and 1457 DF, p-value: < 2.2e-16
# TotRmsAbvGrd has shown a correlation of ~0.5
lm(SalePrice ~ OverallQual +
TotRmsAbvGrd, data = train) %>% summary() # .70
##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotRmsAbvGrd, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1579 -0.1074 0.0113 0.1346 0.8945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.392196 0.028735 361.66 <2e-16 ***
## OverallQual 0.208064 0.004510 46.14 <2e-16 ***
## TotRmsAbvGrd 0.055664 0.003837 14.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2154 on 1457 degrees of freedom
## Multiple R-squared: 0.7097, Adjusted R-squared: 0.7093
## F-statistic: 1781 on 2 and 1457 DF, p-value: < 2.2e-16
# FullBath has shown a correlation of ~0.4
lm(SalePrice ~ OverallQual +
FullBath, data = train) %>% summary() # .69
##
## Call:
## lm(formula = SalePrice ~ OverallQual + FullBath, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05255 -0.12542 0.01322 0.13376 0.94015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.550189 0.026166 403.20 <2e-16 ***
## OverallQual 0.202976 0.004982 40.74 <2e-16 ***
## FullBath 0.150696 0.012507 12.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2197 on 1457 degrees of freedom
## Multiple R-squared: 0.6979, Adjusted R-squared: 0.6975
## F-statistic: 1683 on 2 and 1457 DF, p-value: < 2.2e-16
# YearBuilt has shown a correlation of ~0.4
lm(SalePrice ~ OverallQual +
YearBuilt, data = train) %>% summary() # .68
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00979 -0.12665 0.00344 0.12601 0.90340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1537390 0.4474525 13.753 <2e-16 ***
## OverallQual 0.2068050 0.0051475 40.175 <2e-16 ***
## YearBuilt 0.0023381 0.0002357 9.919 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.223 on 1457 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6884
## F-statistic: 1612 on 2 and 1457 DF, p-value: < 2.2e-16
The following might not have a strong correlation with the target variable in the Correlation Heatmap, but I would like to double confirm by checking the R squared:
lm(SalePrice ~ OverallQual +
MSSubClass, data = train) %>% summary() # .72
##
## Call:
## lm(formula = SalePrice ~ OverallQual + MSSubClass, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09190 -0.12543 0.01134 0.13285 0.89127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6327731 0.0277889 382.626 < 2e-16 ***
## OverallQual 0.2369772 0.0042966 55.155 < 2e-16 ***
## MSSubClass -0.0009512 0.0001405 -6.771 1.84e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2269 on 1457 degrees of freedom
## Multiple R-squared: 0.6779, Adjusted R-squared: 0.6775
## F-statistic: 1533 on 2 and 1457 DF, p-value: < 2.2e-16
# MSSubClass has a more promising result than some of the variables listed above
lm(SalePrice ~ OverallQual +
Neighborhood, data = train) %>% summary() # .75
##
## Call:
## lm(formula = SalePrice ~ OverallQual + Neighborhood, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93020 -0.11256 0.00119 0.11100 0.64743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.884858 0.061926 175.773 < 2e-16 ***
## OverallQual 0.178996 0.005413 33.066 < 2e-16 ***
## NeighborhoodBlueste -0.132296 0.148772 -0.889 0.374016
## NeighborhoodBrDale -0.355032 0.069724 -5.092 4.01e-07 ***
## NeighborhoodBrkSide -0.109368 0.056031 -1.952 0.051142 .
## NeighborhoodClearCr 0.300246 0.061529 4.880 1.18e-06 ***
## NeighborhoodCollgCr 0.090252 0.050966 1.771 0.076803 .
## NeighborhoodCrawfor 0.198690 0.055898 3.555 0.000391 ***
## NeighborhoodEdwards -0.081844 0.053382 -1.533 0.125455
## NeighborhoodGilbert 0.097277 0.053266 1.826 0.068018 .
## NeighborhoodIDOTRR -0.289408 0.059713 -4.847 1.39e-06 ***
## NeighborhoodMeadowV -0.210552 0.069754 -3.018 0.002585 **
## NeighborhoodMitchel 0.048175 0.056621 0.851 0.395008
## NeighborhoodNAmes 0.023770 0.050970 0.466 0.641040
## NeighborhoodNoRidge 0.372273 0.057500 6.474 1.31e-10 ***
## NeighborhoodNPkVill -0.092355 0.082212 -1.123 0.261465
## NeighborhoodNridgHt 0.256095 0.053604 4.778 1.96e-06 ***
## NeighborhoodNWAmes 0.112929 0.053742 2.101 0.035787 *
## NeighborhoodOldTown -0.145669 0.052621 -2.768 0.005708 **
## NeighborhoodSawyer 0.026793 0.054728 0.490 0.624514
## NeighborhoodSawyerW 0.074214 0.054927 1.351 0.176865
## NeighborhoodSomerst 0.098308 0.052783 1.863 0.062735 .
## NeighborhoodStoneBr 0.240023 0.062732 3.826 0.000136 ***
## NeighborhoodSWISU -0.020160 0.063208 -0.319 0.749817
## NeighborhoodTimber 0.197365 0.058017 3.402 0.000688 ***
## NeighborhoodVeenker 0.255165 0.076977 3.315 0.000940 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1988 on 1434 degrees of freedom
## Multiple R-squared: 0.7565, Adjusted R-squared: 0.7522
## F-statistic: 178.2 on 25 and 1434 DF, p-value: < 2.2e-16
# So as Neighborhood
lm(SalePrice ~ OverallQual +
BldgType, data = train) %>% summary() # .68
##
## Call:
## lm(formula = SalePrice ~ OverallQual + BldgType, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08061 -0.12299 0.01488 0.12946 0.91208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.585486 0.027457 385.524 < 2e-16 ***
## OverallQual 0.238842 0.004361 54.762 < 2e-16 ***
## BldgType2fmCon -0.038934 0.041069 -0.948 0.343
## BldgTypeDuplex 0.010410 0.032121 0.324 0.746
## BldgTypeTwnhs -0.261296 0.034759 -7.517 9.74e-14 ***
## BldgTypeTwnhsE -0.128791 0.022089 -5.830 6.80e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.224 on 1454 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6855
## F-statistic: 637 on 5 and 1454 DF, p-value: < 2.2e-16
lm(SalePrice ~ OverallQual +
OverallCond, data = train) %>% summary() # .66
##
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05819 -0.12773 0.01492 0.13265 0.92065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.500974 0.042568 246.69 <2e-16 ***
## OverallQual 0.237052 0.004370 54.24 <2e-16 ***
## OverallCond 0.013850 0.005431 2.55 0.0109 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2299 on 1457 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6688
## F-statistic: 1474 on 2 and 1457 DF, p-value: < 2.2e-16
With the help of the Correlation Heatmap and Linear Regression data, I can now build a model with 5 predictors additively. The 5 predictors are OverallQual, Neighborhood, GrLivArea, TotRmsAbvGrd and MSSubClass.
lm(SalePrice ~ OverallQual +
GrLivArea +
TotRmsAbvGrd, data = train) %>% summary() # .81
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80965 -0.09708 0.02200 0.12456 0.76176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.056e+01 3.017e-02 350.071 <2e-16 ***
## OverallQual 1.784e-01 4.838e-03 36.868 <2e-16 ***
## GrLivArea 2.660e-04 2.039e-05 13.041 <2e-16 ***
## TotRmsAbvGrd -4.516e-03 5.873e-03 -0.769 0.442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2039 on 1456 degrees of freedom
## Multiple R-squared: 0.7401, Adjusted R-squared: 0.7395
## F-statistic: 1382 on 3 and 1456 DF, p-value: < 2.2e-16
lm(SalePrice ~ OverallQual +
GrLivArea +
TotRmsAbvGrd +
Neighborhood, data = train) %>% summary() # .81
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd +
## Neighborhood, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43630 -0.08454 0.00761 0.09766 0.54773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.095e+01 5.576e-02 196.340 < 2e-16 ***
## OverallQual 1.186e-01 5.400e-03 21.966 < 2e-16 ***
## GrLivArea 2.622e-04 1.832e-05 14.312 < 2e-16 ***
## TotRmsAbvGrd -7.590e-04 5.071e-03 -0.150 0.881034
## NeighborhoodBlueste -1.947e-01 1.280e-01 -1.521 0.128551
## NeighborhoodBrDale -3.708e-01 5.996e-02 -6.184 8.14e-10 ***
## NeighborhoodBrkSide -1.793e-01 4.833e-02 -3.710 0.000215 ***
## NeighborhoodClearCr 1.287e-01 5.356e-02 2.403 0.016393 *
## NeighborhoodCollgCr 4.404e-02 4.389e-02 1.003 0.315866
## NeighborhoodCrawfor 4.950e-02 4.855e-02 1.020 0.308134
## NeighborhoodEdwards -1.856e-01 4.616e-02 -4.021 6.10e-05 ***
## NeighborhoodGilbert 4.472e-03 4.599e-02 0.097 0.922547
## NeighborhoodIDOTRR -3.611e-01 5.145e-02 -7.019 3.45e-12 ***
## NeighborhoodMeadowV -2.783e-01 6.019e-02 -4.624 4.10e-06 ***
## NeighborhoodMitchel -1.281e-02 4.879e-02 -0.263 0.792914
## NeighborhoodNAmes -5.529e-02 4.398e-02 -1.257 0.208869
## NeighborhoodNoRidge 1.356e-01 5.083e-02 2.669 0.007703 **
## NeighborhoodNPkVill -1.180e-01 7.072e-02 -1.669 0.095401 .
## NeighborhoodNridgHt 1.945e-01 4.617e-02 4.213 2.68e-05 ***
## NeighborhoodNWAmes -1.622e-02 4.657e-02 -0.348 0.727677
## NeighborhoodOldTown -2.670e-01 4.558e-02 -5.859 5.78e-09 ***
## NeighborhoodSawyer -4.754e-02 4.718e-02 -1.008 0.313772
## NeighborhoodSawyerW -2.098e-02 4.745e-02 -0.442 0.658416
## NeighborhoodSomerst 6.379e-02 4.547e-02 1.403 0.160856
## NeighborhoodStoneBr 1.815e-01 5.410e-02 3.355 0.000814 ***
## NeighborhoodSWISU -2.221e-01 5.510e-02 -4.031 5.85e-05 ***
## NeighborhoodTimber 1.125e-01 5.004e-02 2.249 0.024675 *
## NeighborhoodVeenker 1.985e-01 6.636e-02 2.991 0.002831 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.171 on 1432 degrees of freedom
## Multiple R-squared: 0.8202, Adjusted R-squared: 0.8168
## F-statistic: 241.9 on 27 and 1432 DF, p-value: < 2.2e-16
lm(SalePrice ~ OverallQual +
GrLivArea +
TotRmsAbvGrd +
Neighborhood +
MSSubClass, data = train) %>% summary() # .82
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotRmsAbvGrd +
## Neighborhood + MSSubClass, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47258 -0.07998 0.00921 0.09338 0.56531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.106e+01 5.677e-02 194.819 < 2e-16 ***
## OverallQual 1.156e-01 5.317e-03 21.753 < 2e-16 ***
## GrLivArea 2.759e-04 1.808e-05 15.261 < 2e-16 ***
## TotRmsAbvGrd -1.642e-03 4.979e-03 -0.330 0.74164
## NeighborhoodBlueste -1.567e-01 1.258e-01 -1.246 0.21313
## NeighborhoodBrDale -3.300e-01 5.912e-02 -5.583 2.83e-08 ***
## NeighborhoodBrkSide -2.420e-01 4.819e-02 -5.021 5.78e-07 ***
## NeighborhoodClearCr 6.418e-02 5.329e-02 1.204 0.22871
## NeighborhoodCollgCr -2.290e-02 4.403e-02 -0.520 0.60298
## NeighborhoodCrawfor -8.439e-03 4.830e-02 -0.175 0.86131
## NeighborhoodEdwards -2.431e-01 4.597e-02 -5.289 1.42e-07 ***
## NeighborhoodGilbert -5.064e-02 4.575e-02 -1.107 0.26856
## NeighborhoodIDOTRR -4.201e-01 5.113e-02 -8.216 4.67e-16 ***
## NeighborhoodMeadowV -2.376e-01 5.935e-02 -4.004 6.55e-05 ***
## NeighborhoodMitchel -6.854e-02 4.848e-02 -1.414 0.15763
## NeighborhoodNAmes -1.280e-01 4.428e-02 -2.892 0.00388 **
## NeighborhoodNoRidge 6.957e-02 5.069e-02 1.373 0.17009
## NeighborhoodNPkVill -9.403e-02 6.950e-02 -1.353 0.17626
## NeighborhoodNridgHt 1.447e-01 4.582e-02 3.158 0.00162 **
## NeighborhoodNWAmes -8.552e-02 4.666e-02 -1.833 0.06706 .
## NeighborhoodOldTown -3.166e-01 4.524e-02 -6.998 3.98e-12 ***
## NeighborhoodSawyer -1.185e-01 4.729e-02 -2.505 0.01237 *
## NeighborhoodSawyerW -7.656e-02 4.718e-02 -1.623 0.10483
## NeighborhoodSomerst 2.555e-02 4.493e-02 0.569 0.56975
## NeighborhoodStoneBr 1.505e-01 5.328e-02 2.825 0.00479 **
## NeighborhoodSWISU -2.697e-01 5.447e-02 -4.951 8.25e-07 ***
## NeighborhoodTimber 4.063e-02 5.007e-02 0.811 0.41722
## NeighborhoodVeenker 1.475e-01 6.550e-02 2.252 0.02445 *
## MSSubClass -9.119e-04 1.230e-04 -7.411 2.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1678 on 1431 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8234
## F-statistic: 244 on 28 and 1431 DF, p-value: < 2.2e-16
# Looks good!
# Randomly sample 70% of the rows
set.seed(124)
index <- sample(x = 1:nrow(train), size = nrow(train)*.7, replace = F)
head(index) # These are row numbers
## [1] 1345 167 1002 1435 261 728
# Subset train using the index to create train_fold
train_fold <- train[index, ]
# Subset the remaining row to create validation fold.
validation_fold <- train[-index, ]
# Fit model
model <- lm(SalePrice ~ OverallQual +
Neighborhood +
GrLivArea +
TotRmsAbvGrd +
MSSubClass, data = train)
# Get predictions for the validation fold
predictions <- predict(model, newdata = validation_fold)
# Create functions for calculating RMSE and R-squared
rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))
R2 <- function(observed, predicted){
TSS <- sum((observed - mean(observed))^2)
RSS <- sum((observed - predicted)^2)
1- RSS/TSS
}
rmse(validation_fold$SalePrice, predictions)
## [1] 0.1651863
R2(validation_fold$SalePrice, predictions)
## [1] 0.826656
# 1. Fit model to the entire train set.
submission_model <- lm(SalePrice ~ OverallQual +
Neighborhood +
GrLivArea +
TotRmsAbvGrd +
MSSubClass, data = train)
# 2. Check there are no missing observations for your selected predictors in the test set.
test %>%
select(Neighborhood, GrLivArea, TotRmsAbvGrd, MSSubClass) %>%
summarize_all(count_missings)
## # A tibble: 1 × 4
## Neighborhood GrLivArea TotRmsAbvGrd MSSubClass
## <int> <int> <int> <int>
## 1 0 0 0 0
# 3. Make predictions for the test set.
submission_predictions <- predict(submission_model, newdata = test) # Use the newdata argument!
head(submission_predictions)
## 1 2 3 4 5 6
## 11.73106 11.96451 11.97256 12.07967 12.37131 12.09374
# 4. Format your submission file.
submission <- test %>%
select(Id) %>%
mutate(SalePrice = exp(submission_predictions))
head(submission)
## # A tibble: 6 × 2
## Id SalePrice
## <dbl> <dbl>
## 1 1461 124375.
## 2 1462 157081.
## 3 1463 158349.
## 4 1464 176252.
## 5 1465 235935.
## 6 1466 178749.
write.csv(submission, "submission.csv")
This is my results for the project. (1) RMSE and R2 (>.75) on the train set: 0.1678, 0.8234 (2) Estimated RMSE and R2 on the test set: 0.1651863, 0.826656 (3) Kaggle score (returned log RMSE) and rank: 0.17077, 3098