Housing data prediction :

For this problem, we are going to use the housing dataset from this kaggle competition. The dataset has 80 variables for each house describing the lot shape, size, neighborhood, number of rooms, number of bathrooms, etc. The goal is to predict the house sale price from these variables.

Download and unzip the attached data. It consists of two files: 1- housing.csv which is the dataset and 2- data_description.txt which describes the variables in the data. Read this description to understand what each variable means.

Machine Learning Assignment 5 Hands on with Regularization and Tree-based Ensemble

Section 1. Data Cleaning

# Import all the relevent libraries

library(tm)
library(gmodels)
library(Matrix)
library(qdap)
library(keras)
library(tensorflow)
library(readr)
library(tfruns)
library(ggplot2)
library(tidyr)
library(dplyr)
library(corrplot)
library(caret)
library(neuralnet)
library(GGally)
library(glmnet)
library(caret)
library(RANN)
library(mltools)
library(data.table)

Remove the first column, as it is a unique identifier and not used in predicting sale house price.

housing <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/housing.csv", header = T, sep = ",")

# Remove the first column
housing <- housing[-1]

dim(housing)
## [1] 1460   80

1. Take a summary of the data and explore the result. How many categorical and numerical variables are there in the dataset?

# Take a summary of the data
summary(housing)
##    MSSubClass      MSZoning          LotFrontage        LotArea      
##  Min.   : 20.0   Length:1460        Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 20.0   Class :character   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 50.0   Mode  :character   Median : 69.00   Median :  9478  
##  Mean   : 56.9                      Mean   : 70.05   Mean   : 10517  
##  3rd Qu.: 70.0                      3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :190.0                      Max.   :313.00   Max.   :215245  
##                                     NA's   :259                      
##     Street             Alley             LotShape         LandContour       
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Utilities          LotConfig          LandSlope         Neighborhood      
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Condition1         Condition2          BldgType          HouseStyle       
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##   RoofStyle           RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   MasVnrType          MasVnrArea      ExterQual          ExterCond        
##  Length:1460        Min.   :   0.0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0.0   Class :character   Class :character  
##  Mode  :character   Median :   0.0   Mode  :character   Mode  :character  
##                     Mean   : 103.7                                        
##                     3rd Qu.: 166.0                                        
##                     Max.   :1600.0                                        
##                     NA's   :8                                             
##   Foundation          BsmtQual           BsmtCond         BsmtExposure      
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtFinType1         BsmtFinSF1     BsmtFinType2         BsmtFinSF2     
##  Length:1460        Min.   :   0.0   Length:1460        Min.   :   0.00  
##  Class :character   1st Qu.:   0.0   Class :character   1st Qu.:   0.00  
##  Mode  :character   Median : 383.5   Mode  :character   Median :   0.00  
##                     Mean   : 443.6                      Mean   :  46.55  
##                     3rd Qu.: 712.2                      3rd Qu.:   0.00  
##                     Max.   :5644.0                      Max.   :1474.00  
##                                                                          
##    BsmtUnfSF       TotalBsmtSF       Heating           HeatingQC        
##  Min.   :   0.0   Min.   :   0.0   Length:1460        Length:1460       
##  1st Qu.: 223.0   1st Qu.: 795.8   Class :character   Class :character  
##  Median : 477.5   Median : 991.5   Mode  :character   Mode  :character  
##  Mean   : 567.2   Mean   :1057.4                                        
##  3rd Qu.: 808.0   3rd Qu.:1298.2                                        
##  Max.   :2336.0   Max.   :6110.0                                        
##                                                                         
##   CentralAir         Electrical          X1stFlrSF      X2ndFlrSF   
##  Length:1460        Length:1460        Min.   : 334   Min.   :   0  
##  Class :character   Class :character   1st Qu.: 882   1st Qu.:   0  
##  Mode  :character   Mode  :character   Median :1087   Median :   0  
##                                        Mean   :1163   Mean   : 347  
##                                        3rd Qu.:1391   3rd Qu.: 728  
##                                        Max.   :4692   Max.   :2065  
##                                                                     
##   LowQualFinSF       GrLivArea     BsmtFullBath     BsmtHalfBath    
##  Min.   :  0.000   Min.   : 334   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :  0.000   Median :1464   Median :0.0000   Median :0.00000  
##  Mean   :  5.845   Mean   :1515   Mean   :0.4253   Mean   :0.05753  
##  3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :572.000   Max.   :5642   Max.   :3.0000   Max.   :2.00000  
##                                                                     
##     FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :0.0000   Median :3.000   Median :1.000  
##  Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000  
##                                                                  
##  KitchenQual         TotRmsAbvGrd     Functional          Fireplaces   
##  Length:1460        Min.   : 2.000   Length:1460        Min.   :0.000  
##  Class :character   1st Qu.: 5.000   Class :character   1st Qu.:0.000  
##  Mode  :character   Median : 6.000   Mode  :character   Median :1.000  
##                     Mean   : 6.518                      Mean   :0.613  
##                     3rd Qu.: 7.000                      3rd Qu.:1.000  
##                     Max.   :14.000                      Max.   :3.000  
##                                                                        
##  FireplaceQu         GarageType         GarageYrBlt   GarageFinish      
##  Length:1460        Length:1460        Min.   :1900   Length:1460       
##  Class :character   Class :character   1st Qu.:1961   Class :character  
##  Mode  :character   Mode  :character   Median :1980   Mode  :character  
##                                        Mean   :1979                     
##                                        3rd Qu.:2002                     
##                                        Max.   :2010                     
##                                        NA's   :81                       
##    GarageCars      GarageArea      GarageQual         GarageCond       
##  Min.   :0.000   Min.   :   0.0   Length:1460        Length:1460       
##  1st Qu.:1.000   1st Qu.: 334.5   Class :character   Class :character  
##  Median :2.000   Median : 480.0   Mode  :character   Mode  :character  
##  Mean   :1.767   Mean   : 473.0                                        
##  3rd Qu.:2.000   3rd Qu.: 576.0                                        
##  Max.   :4.000   Max.   :1418.0                                        
##                                                                        
##   PavedDrive          WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Length:1460        Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Median :  0.00   Median : 25.00   Median :  0.00  
##                     Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##                     3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##                     Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                       
##    X3SsnPorch      ScreenPorch        PoolArea          PoolQC         
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Length:1460       
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Class :character  
##  Median :  0.00   Median :  0.00   Median :  0.000   Mode  :character  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759                     
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000                     
##  Max.   :508.00   Max.   :480.00   Max.   :738.000                     
##                                                                        
##     Fence           MiscFeature           MiscVal             MoSold      
##  Length:1460        Length:1460        Min.   :    0.00   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:    0.00   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Median :    0.00   Median : 6.000  
##                                        Mean   :   43.49   Mean   : 6.322  
##                                        3rd Qu.:    0.00   3rd Qu.: 8.000  
##                                        Max.   :15500.00   Max.   :12.000  
##                                                                           
##      YrSold       SaleType         SaleCondition        SalePrice     
##  Min.   :2006   Length:1460        Length:1460        Min.   : 34900  
##  1st Qu.:2007   Class :character   Class :character   1st Qu.:129975  
##  Median :2008   Mode  :character   Mode  :character   Median :163000  
##  Mean   :2008                                         Mean   :180921  
##  3rd Qu.:2009                                         3rd Qu.:214000  
##  Max.   :2010                                         Max.   :755000  
## 
# Count the number of categorical and numeric variables
categorical_variable <- sapply(housing, is.character)
numerical_variables <- sapply(housing, is.numeric)
count_categorical <- sum(categorical_variable)
count_numerical <- sum(numerical_variables)
cat("categorical variables: ",count_categorical,"\n")
## categorical variables:  43
cat("numerical variables: ", count_numerical)
## numerical variables:  37

2. (1pt) Which columns have missing values and what percentage of those columns have NAs?

as.data.frame(colSums(is.na(housing))) %>% rename_at('colSums(is.na(housing))', ~'Missing_Values')
##               Missing_Values
## MSSubClass                 0
## MSZoning                   0
## LotFrontage              259
## LotArea                    0
## Street                     0
## Alley                   1369
## LotShape                   0
## LandContour                0
## Utilities                  0
## LotConfig                  0
## LandSlope                  0
## Neighborhood               0
## Condition1                 0
## Condition2                 0
## BldgType                   0
## HouseStyle                 0
## OverallQual                0
## OverallCond                0
## YearBuilt                  0
## YearRemodAdd               0
## RoofStyle                  0
## RoofMatl                   0
## Exterior1st                0
## Exterior2nd                0
## MasVnrType                 8
## MasVnrArea                 8
## ExterQual                  0
## ExterCond                  0
## Foundation                 0
## BsmtQual                  37
## BsmtCond                  37
## BsmtExposure              38
## BsmtFinType1              37
## BsmtFinSF1                 0
## BsmtFinType2              38
## BsmtFinSF2                 0
## BsmtUnfSF                  0
## TotalBsmtSF                0
## Heating                    0
## HeatingQC                  0
## CentralAir                 0
## Electrical                 1
## X1stFlrSF                  0
## X2ndFlrSF                  0
## LowQualFinSF               0
## GrLivArea                  0
## BsmtFullBath               0
## BsmtHalfBath               0
## FullBath                   0
## HalfBath                   0
## BedroomAbvGr               0
## KitchenAbvGr               0
## KitchenQual                0
## TotRmsAbvGrd               0
## Functional                 0
## Fireplaces                 0
## FireplaceQu              690
## GarageType                81
## GarageYrBlt               81
## GarageFinish              81
## GarageCars                 0
## GarageArea                 0
## GarageQual                81
## GarageCond                81
## PavedDrive                 0
## WoodDeckSF                 0
## OpenPorchSF                0
## EnclosedPorch              0
## X3SsnPorch                 0
## ScreenPorch                0
## PoolArea                   0
## PoolQC                  1453
## Fence                   1179
## MiscFeature             1406
## MiscVal                    0
## MoSold                     0
## YrSold                     0
## SaleType                   0
## SaleCondition              0
## SalePrice                  0
as.data.frame(sapply(housing, function(y) round((sum(length(which(is.na(y))))/nrow(housing))*100.0,1))) %>%
  rename_at('sapply(housing, function(y) round((sum(length(which(is.na(y))))/nrow(housing)) * 100, 1))', ~'Missing_Values(%)')
##               Missing_Values(%)
## MSSubClass                  0.0
## MSZoning                    0.0
## LotFrontage                17.7
## LotArea                     0.0
## Street                      0.0
## Alley                      93.8
## LotShape                    0.0
## LandContour                 0.0
## Utilities                   0.0
## LotConfig                   0.0
## LandSlope                   0.0
## Neighborhood                0.0
## Condition1                  0.0
## Condition2                  0.0
## BldgType                    0.0
## HouseStyle                  0.0
## OverallQual                 0.0
## OverallCond                 0.0
## YearBuilt                   0.0
## YearRemodAdd                0.0
## RoofStyle                   0.0
## RoofMatl                    0.0
## Exterior1st                 0.0
## Exterior2nd                 0.0
## MasVnrType                  0.5
## MasVnrArea                  0.5
## ExterQual                   0.0
## ExterCond                   0.0
## Foundation                  0.0
## BsmtQual                    2.5
## BsmtCond                    2.5
## BsmtExposure                2.6
## BsmtFinType1                2.5
## BsmtFinSF1                  0.0
## BsmtFinType2                2.6
## BsmtFinSF2                  0.0
## BsmtUnfSF                   0.0
## TotalBsmtSF                 0.0
## Heating                     0.0
## HeatingQC                   0.0
## CentralAir                  0.0
## Electrical                  0.1
## X1stFlrSF                   0.0
## X2ndFlrSF                   0.0
## LowQualFinSF                0.0
## GrLivArea                   0.0
## BsmtFullBath                0.0
## BsmtHalfBath                0.0
## FullBath                    0.0
## HalfBath                    0.0
## BedroomAbvGr                0.0
## KitchenAbvGr                0.0
## KitchenQual                 0.0
## TotRmsAbvGrd                0.0
## Functional                  0.0
## Fireplaces                  0.0
## FireplaceQu                47.3
## GarageType                  5.5
## GarageYrBlt                 5.5
## GarageFinish                5.5
## GarageCars                  0.0
## GarageArea                  0.0
## GarageQual                  5.5
## GarageCond                  5.5
## PavedDrive                  0.0
## WoodDeckSF                  0.0
## OpenPorchSF                 0.0
## EnclosedPorch               0.0
## X3SsnPorch                  0.0
## ScreenPorch                 0.0
## PoolArea                    0.0
## PoolQC                     99.5
## Fence                      80.8
## MiscFeature                96.3
## MiscVal                     0.0
## MoSold                      0.0
## YrSold                      0.0
## SaleType                    0.0
## SaleCondition               0.0
## SalePrice                   0.0

3. (1pt)Read the data description carefully. For some of the variables, such as PoolQC, FirePlaceQU, Fence, etc. NA means not applicable rather than missing at random. For instance, a house that does not have a pool gets NA for PoolQC. For those variables for which NA means not applicable, you can replace NA with zero ( if that variable is numeric) or replace it with a new category/level, for instance, “notApplicable” or “None” if that variable is categorical.

housing$PoolQC[is.na(housing$PoolQC)] <- "notApplicable"
housing$FireplaceQu[is.na(housing$FireplaceQu)] <- "notApplicable"
housing$Fence[is.na(housing$Fence)] <- "notApplicable"

df_hou <- housing

#df_hou <- df_hou %>% mutate_if(is.integer, ~replace(., is.na(.), 0))
df_hou <- df_hou %>% mutate_if(is.character, ~replace(., is.na(.), "notApplicable"))

4. (1pt) After replacing not applicable NAs with appropriate values, find out which columns (if any) still have NAs and what percentage of each column is missing.

as.data.frame(colSums(is.na(df_hou))) %>% rename_at('colSums(is.na(df_hou))', ~'Missing_Values')
##               Missing_Values
## MSSubClass                 0
## MSZoning                   0
## LotFrontage              259
## LotArea                    0
## Street                     0
## Alley                      0
## LotShape                   0
## LandContour                0
## Utilities                  0
## LotConfig                  0
## LandSlope                  0
## Neighborhood               0
## Condition1                 0
## Condition2                 0
## BldgType                   0
## HouseStyle                 0
## OverallQual                0
## OverallCond                0
## YearBuilt                  0
## YearRemodAdd               0
## RoofStyle                  0
## RoofMatl                   0
## Exterior1st                0
## Exterior2nd                0
## MasVnrType                 0
## MasVnrArea                 8
## ExterQual                  0
## ExterCond                  0
## Foundation                 0
## BsmtQual                   0
## BsmtCond                   0
## BsmtExposure               0
## BsmtFinType1               0
## BsmtFinSF1                 0
## BsmtFinType2               0
## BsmtFinSF2                 0
## BsmtUnfSF                  0
## TotalBsmtSF                0
## Heating                    0
## HeatingQC                  0
## CentralAir                 0
## Electrical                 0
## X1stFlrSF                  0
## X2ndFlrSF                  0
## LowQualFinSF               0
## GrLivArea                  0
## BsmtFullBath               0
## BsmtHalfBath               0
## FullBath                   0
## HalfBath                   0
## BedroomAbvGr               0
## KitchenAbvGr               0
## KitchenQual                0
## TotRmsAbvGrd               0
## Functional                 0
## Fireplaces                 0
## FireplaceQu                0
## GarageType                 0
## GarageYrBlt               81
## GarageFinish               0
## GarageCars                 0
## GarageArea                 0
## GarageQual                 0
## GarageCond                 0
## PavedDrive                 0
## WoodDeckSF                 0
## OpenPorchSF                0
## EnclosedPorch              0
## X3SsnPorch                 0
## ScreenPorch                0
## PoolArea                   0
## PoolQC                     0
## Fence                      0
## MiscFeature                0
## MiscVal                    0
## MoSold                     0
## YrSold                     0
## SaleType                   0
## SaleCondition              0
## SalePrice                  0
as.data.frame(sapply(df_hou, function(y) round((sum(length(which(is.na(y))))/nrow(df_hou))*100.0,1))) %>%
  rename_at('sapply(df_hou, function(y) round((sum(length(which(is.na(y))))/nrow(df_hou)) * 100, 1))', ~'Missing_Values(%)')
##               Missing_Values(%)
## MSSubClass                  0.0
## MSZoning                    0.0
## LotFrontage                17.7
## LotArea                     0.0
## Street                      0.0
## Alley                       0.0
## LotShape                    0.0
## LandContour                 0.0
## Utilities                   0.0
## LotConfig                   0.0
## LandSlope                   0.0
## Neighborhood                0.0
## Condition1                  0.0
## Condition2                  0.0
## BldgType                    0.0
## HouseStyle                  0.0
## OverallQual                 0.0
## OverallCond                 0.0
## YearBuilt                   0.0
## YearRemodAdd                0.0
## RoofStyle                   0.0
## RoofMatl                    0.0
## Exterior1st                 0.0
## Exterior2nd                 0.0
## MasVnrType                  0.0
## MasVnrArea                  0.5
## ExterQual                   0.0
## ExterCond                   0.0
## Foundation                  0.0
## BsmtQual                    0.0
## BsmtCond                    0.0
## BsmtExposure                0.0
## BsmtFinType1                0.0
## BsmtFinSF1                  0.0
## BsmtFinType2                0.0
## BsmtFinSF2                  0.0
## BsmtUnfSF                   0.0
## TotalBsmtSF                 0.0
## Heating                     0.0
## HeatingQC                   0.0
## CentralAir                  0.0
## Electrical                  0.0
## X1stFlrSF                   0.0
## X2ndFlrSF                   0.0
## LowQualFinSF                0.0
## GrLivArea                   0.0
## BsmtFullBath                0.0
## BsmtHalfBath                0.0
## FullBath                    0.0
## HalfBath                    0.0
## BedroomAbvGr                0.0
## KitchenAbvGr                0.0
## KitchenQual                 0.0
## TotRmsAbvGrd                0.0
## Functional                  0.0
## Fireplaces                  0.0
## FireplaceQu                 0.0
## GarageType                  0.0
## GarageYrBlt                 5.5
## GarageFinish                0.0
## GarageCars                  0.0
## GarageArea                  0.0
## GarageQual                  0.0
## GarageCond                  0.0
## PavedDrive                  0.0
## WoodDeckSF                  0.0
## OpenPorchSF                 0.0
## EnclosedPorch               0.0
## X3SsnPorch                  0.0
## ScreenPorch                 0.0
## PoolArea                    0.0
## PoolQC                      0.0
## Fence                       0.0
## MiscFeature                 0.0
## MiscVal                     0.0
## MoSold                      0.0
## YrSold                      0.0
## SaleType                    0.0
## SaleCondition               0.0
## SalePrice                   0.0

5. (1pt) what percentage of rows in the dataset have one or more missing values?

# Calculate the percentage of rows with missing values
missing_rows <- sum(apply(df_hou, 1, function(x) any(is.na(x))))
percent_missing_rows <- round(missing_rows / nrow(df_hou) * 100, 2)

# Print the result
cat("Percentage of rows with missing values:", percent_missing_rows, "%")
## Percentage of rows with missing values: 23.22 %

Section 2. Data Exploration

6. (1pt) plot the histogram of SalePrice. Interpret the histogram. Is SalePrice variable skewed?

# Plot the histogram of SalePrice
hist(df_hou$SalePrice, main = "Histogram of SalePrice", xlab = "Sale Price", col = "steelblue")

The histogram of SalePrice shows that the data is not evenly distributed and it is skewed to the right. A right-skewed histogram means that the tail of the histogram extends towards the higher values of the variable, indicating that there are a few houses that have much higher SalePrice than the rest of the houses in the dataset. This indicates that the SalePrice variable may have outliers.

7. (3 pt) Use appropriate plots and test statistics to find out which variables are associated with SalePrice. Remove variables that show no association or very weak association with salesprice. Note: the t-test or one.way test for some variables may throw an error if there are not enough observations in a group. You can ignore this error at this point. Later we will find and eliminate groups with little variance.

Co-relation table

numeric_values <- df_hou %>% 
  dplyr::select_if(is.numeric)
# correlation for all variables
as.data.frame(round(cor(numeric_values),digits = 2 ))
##               MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## MSSubClass          1.00          NA   -0.14        0.03       -0.06      0.03
## LotFrontage           NA           1      NA          NA          NA        NA
## LotArea            -0.14          NA    1.00        0.11       -0.01      0.01
## OverallQual         0.03          NA    0.11        1.00       -0.09      0.57
## OverallCond        -0.06          NA   -0.01       -0.09        1.00     -0.38
## YearBuilt           0.03          NA    0.01        0.57       -0.38      1.00
## YearRemodAdd        0.04          NA    0.01        0.55        0.07      0.59
## MasVnrArea            NA          NA      NA          NA          NA        NA
## BsmtFinSF1         -0.07          NA    0.21        0.24       -0.05      0.25
## BsmtFinSF2         -0.07          NA    0.11       -0.06        0.04     -0.05
## BsmtUnfSF          -0.14          NA    0.00        0.31       -0.14      0.15
## TotalBsmtSF        -0.24          NA    0.26        0.54       -0.17      0.39
## X1stFlrSF          -0.25          NA    0.30        0.48       -0.14      0.28
## X2ndFlrSF           0.31          NA    0.05        0.30        0.03      0.01
## LowQualFinSF        0.05          NA    0.00       -0.03        0.03     -0.18
## GrLivArea           0.07          NA    0.26        0.59       -0.08      0.20
## BsmtFullBath        0.00          NA    0.16        0.11       -0.05      0.19
## BsmtHalfBath        0.00          NA    0.05       -0.04        0.12     -0.04
## FullBath            0.13          NA    0.13        0.55       -0.19      0.47
## HalfBath            0.18          NA    0.01        0.27       -0.06      0.24
## BedroomAbvGr       -0.02          NA    0.12        0.10        0.01     -0.07
## KitchenAbvGr        0.28          NA   -0.02       -0.18       -0.09     -0.17
## TotRmsAbvGrd        0.04          NA    0.19        0.43       -0.06      0.10
## Fireplaces         -0.05          NA    0.27        0.40       -0.02      0.15
## GarageYrBlt           NA          NA      NA          NA          NA        NA
## GarageCars         -0.04          NA    0.15        0.60       -0.19      0.54
## GarageArea         -0.10          NA    0.18        0.56       -0.15      0.48
## WoodDeckSF         -0.01          NA    0.17        0.24        0.00      0.22
## OpenPorchSF        -0.01          NA    0.08        0.31       -0.03      0.19
## EnclosedPorch      -0.01          NA   -0.02       -0.11        0.07     -0.39
## X3SsnPorch         -0.04          NA    0.02        0.03        0.03      0.03
## ScreenPorch        -0.03          NA    0.04        0.06        0.05     -0.05
## PoolArea            0.01          NA    0.08        0.07        0.00      0.00
## MiscVal            -0.01          NA    0.04       -0.03        0.07     -0.03
## MoSold             -0.01          NA    0.00        0.07        0.00      0.01
## YrSold             -0.02          NA   -0.01       -0.03        0.04     -0.01
## SalePrice          -0.08          NA    0.26        0.79       -0.08      0.52
##               YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## MSSubClass            0.04         NA      -0.07      -0.07     -0.14
## LotFrontage             NA         NA         NA         NA        NA
## LotArea               0.01         NA       0.21       0.11      0.00
## OverallQual           0.55         NA       0.24      -0.06      0.31
## OverallCond           0.07         NA      -0.05       0.04     -0.14
## YearBuilt             0.59         NA       0.25      -0.05      0.15
## YearRemodAdd          1.00         NA       0.13      -0.07      0.18
## MasVnrArea              NA          1         NA         NA        NA
## BsmtFinSF1            0.13         NA       1.00      -0.05     -0.50
## BsmtFinSF2           -0.07         NA      -0.05       1.00     -0.21
## BsmtUnfSF             0.18         NA      -0.50      -0.21      1.00
## TotalBsmtSF           0.29         NA       0.52       0.10      0.42
## X1stFlrSF             0.24         NA       0.45       0.10      0.32
## X2ndFlrSF             0.14         NA      -0.14      -0.10      0.00
## LowQualFinSF         -0.06         NA      -0.06       0.01      0.03
## GrLivArea             0.29         NA       0.21      -0.01      0.24
## BsmtFullBath          0.12         NA       0.65       0.16     -0.42
## BsmtHalfBath         -0.01         NA       0.07       0.07     -0.10
## FullBath              0.44         NA       0.06      -0.08      0.29
## HalfBath              0.18         NA       0.00      -0.03     -0.04
## BedroomAbvGr         -0.04         NA      -0.11      -0.02      0.17
## KitchenAbvGr         -0.15         NA      -0.08      -0.04      0.03
## TotRmsAbvGrd          0.19         NA       0.04      -0.04      0.25
## Fireplaces            0.11         NA       0.26       0.05      0.05
## GarageYrBlt             NA         NA         NA         NA        NA
## GarageCars            0.42         NA       0.22      -0.04      0.21
## GarageArea            0.37         NA       0.30      -0.02      0.18
## WoodDeckSF            0.21         NA       0.20       0.07     -0.01
## OpenPorchSF           0.23         NA       0.11       0.00      0.13
## EnclosedPorch        -0.19         NA      -0.10       0.04      0.00
## X3SsnPorch            0.05         NA       0.03      -0.03      0.02
## ScreenPorch          -0.04         NA       0.06       0.09     -0.01
## PoolArea              0.01         NA       0.14       0.04     -0.04
## MiscVal              -0.01         NA       0.00       0.00     -0.02
## MoSold                0.02         NA      -0.02      -0.02      0.03
## YrSold                0.04         NA       0.01       0.03     -0.04
## SalePrice             0.51         NA       0.39      -0.01      0.21
##               TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## MSSubClass          -0.24     -0.25      0.31         0.05      0.07
## LotFrontage            NA        NA        NA           NA        NA
## LotArea              0.26      0.30      0.05         0.00      0.26
## OverallQual          0.54      0.48      0.30        -0.03      0.59
## OverallCond         -0.17     -0.14      0.03         0.03     -0.08
## YearBuilt            0.39      0.28      0.01        -0.18      0.20
## YearRemodAdd         0.29      0.24      0.14        -0.06      0.29
## MasVnrArea             NA        NA        NA           NA        NA
## BsmtFinSF1           0.52      0.45     -0.14        -0.06      0.21
## BsmtFinSF2           0.10      0.10     -0.10         0.01     -0.01
## BsmtUnfSF            0.42      0.32      0.00         0.03      0.24
## TotalBsmtSF          1.00      0.82     -0.17        -0.03      0.45
## X1stFlrSF            0.82      1.00     -0.20        -0.01      0.57
## X2ndFlrSF           -0.17     -0.20      1.00         0.06      0.69
## LowQualFinSF        -0.03     -0.01      0.06         1.00      0.13
## GrLivArea            0.45      0.57      0.69         0.13      1.00
## BsmtFullBath         0.31      0.24     -0.17        -0.05      0.03
## BsmtHalfBath         0.00      0.00     -0.02        -0.01     -0.02
## FullBath             0.32      0.38      0.42         0.00      0.63
## HalfBath            -0.05     -0.12      0.61        -0.03      0.42
## BedroomAbvGr         0.05      0.13      0.50         0.11      0.52
## KitchenAbvGr        -0.07      0.07      0.06         0.01      0.10
## TotRmsAbvGrd         0.29      0.41      0.62         0.13      0.83
## Fireplaces           0.34      0.41      0.19        -0.02      0.46
## GarageYrBlt            NA        NA        NA           NA        NA
## GarageCars           0.43      0.44      0.18        -0.09      0.47
## GarageArea           0.49      0.49      0.14        -0.07      0.47
## WoodDeckSF           0.23      0.24      0.09        -0.03      0.25
## OpenPorchSF          0.25      0.21      0.21         0.02      0.33
## EnclosedPorch       -0.10     -0.07      0.06         0.06      0.01
## X3SsnPorch           0.04      0.06     -0.02         0.00      0.02
## ScreenPorch          0.08      0.09      0.04         0.03      0.10
## PoolArea             0.13      0.13      0.08         0.06      0.17
## MiscVal             -0.02     -0.02      0.02         0.00      0.00
## MoSold               0.01      0.03      0.04        -0.02      0.05
## YrSold              -0.01     -0.01     -0.03        -0.03     -0.04
## SalePrice            0.61      0.61      0.32        -0.03      0.71
##               BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## MSSubClass            0.00         0.00     0.13     0.18        -0.02
## LotFrontage             NA           NA       NA       NA           NA
## LotArea               0.16         0.05     0.13     0.01         0.12
## OverallQual           0.11        -0.04     0.55     0.27         0.10
## OverallCond          -0.05         0.12    -0.19    -0.06         0.01
## YearBuilt             0.19        -0.04     0.47     0.24        -0.07
## YearRemodAdd          0.12        -0.01     0.44     0.18        -0.04
## MasVnrArea              NA           NA       NA       NA           NA
## BsmtFinSF1            0.65         0.07     0.06     0.00        -0.11
## BsmtFinSF2            0.16         0.07    -0.08    -0.03        -0.02
## BsmtUnfSF            -0.42        -0.10     0.29    -0.04         0.17
## TotalBsmtSF           0.31         0.00     0.32    -0.05         0.05
## X1stFlrSF             0.24         0.00     0.38    -0.12         0.13
## X2ndFlrSF            -0.17        -0.02     0.42     0.61         0.50
## LowQualFinSF         -0.05        -0.01     0.00    -0.03         0.11
## GrLivArea             0.03        -0.02     0.63     0.42         0.52
## BsmtFullBath          1.00        -0.15    -0.06    -0.03        -0.15
## BsmtHalfBath         -0.15         1.00    -0.05    -0.01         0.05
## FullBath             -0.06        -0.05     1.00     0.14         0.36
## HalfBath             -0.03        -0.01     0.14     1.00         0.23
## BedroomAbvGr         -0.15         0.05     0.36     0.23         1.00
## KitchenAbvGr         -0.04        -0.04     0.13    -0.07         0.20
## TotRmsAbvGrd         -0.05        -0.02     0.55     0.34         0.68
## Fireplaces            0.14         0.03     0.24     0.20         0.11
## GarageYrBlt             NA           NA       NA       NA           NA
## GarageCars            0.13        -0.02     0.47     0.22         0.09
## GarageArea            0.18        -0.02     0.41     0.16         0.07
## WoodDeckSF            0.18         0.04     0.19     0.11         0.05
## OpenPorchSF           0.07        -0.03     0.26     0.20         0.09
## EnclosedPorch        -0.05        -0.01    -0.12    -0.10         0.04
## X3SsnPorch            0.00         0.04     0.04     0.00        -0.02
## ScreenPorch           0.02         0.03    -0.01     0.07         0.04
## PoolArea              0.07         0.02     0.05     0.02         0.07
## MiscVal              -0.02        -0.01    -0.01     0.00         0.01
## MoSold               -0.03         0.03     0.06    -0.01         0.05
## YrSold                0.07        -0.05    -0.02    -0.01        -0.04
## SalePrice             0.23        -0.02     0.56     0.28         0.17
##               KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## MSSubClass            0.28         0.04      -0.05          NA      -0.04
## LotFrontage             NA           NA         NA          NA         NA
## LotArea              -0.02         0.19       0.27          NA       0.15
## OverallQual          -0.18         0.43       0.40          NA       0.60
## OverallCond          -0.09        -0.06      -0.02          NA      -0.19
## YearBuilt            -0.17         0.10       0.15          NA       0.54
## YearRemodAdd         -0.15         0.19       0.11          NA       0.42
## MasVnrArea              NA           NA         NA          NA         NA
## BsmtFinSF1           -0.08         0.04       0.26          NA       0.22
## BsmtFinSF2           -0.04        -0.04       0.05          NA      -0.04
## BsmtUnfSF             0.03         0.25       0.05          NA       0.21
## TotalBsmtSF          -0.07         0.29       0.34          NA       0.43
## X1stFlrSF             0.07         0.41       0.41          NA       0.44
## X2ndFlrSF             0.06         0.62       0.19          NA       0.18
## LowQualFinSF          0.01         0.13      -0.02          NA      -0.09
## GrLivArea             0.10         0.83       0.46          NA       0.47
## BsmtFullBath         -0.04        -0.05       0.14          NA       0.13
## BsmtHalfBath         -0.04        -0.02       0.03          NA      -0.02
## FullBath              0.13         0.55       0.24          NA       0.47
## HalfBath             -0.07         0.34       0.20          NA       0.22
## BedroomAbvGr          0.20         0.68       0.11          NA       0.09
## KitchenAbvGr          1.00         0.26      -0.12          NA      -0.05
## TotRmsAbvGrd          0.26         1.00       0.33          NA       0.36
## Fireplaces           -0.12         0.33       1.00          NA       0.30
## GarageYrBlt             NA           NA         NA           1         NA
## GarageCars           -0.05         0.36       0.30          NA       1.00
## GarageArea           -0.06         0.34       0.27          NA       0.88
## WoodDeckSF           -0.09         0.17       0.20          NA       0.23
## OpenPorchSF          -0.07         0.23       0.17          NA       0.21
## EnclosedPorch         0.04         0.00      -0.02          NA      -0.15
## X3SsnPorch           -0.02        -0.01       0.01          NA       0.04
## ScreenPorch          -0.05         0.06       0.18          NA       0.05
## PoolArea             -0.01         0.08       0.10          NA       0.02
## MiscVal               0.06         0.02       0.00          NA      -0.04
## MoSold                0.03         0.04       0.05          NA       0.04
## YrSold                0.03        -0.03      -0.02          NA      -0.04
## SalePrice            -0.14         0.53       0.47          NA       0.64
##               GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## MSSubClass         -0.10      -0.01       -0.01         -0.01      -0.04
## LotFrontage           NA         NA          NA            NA         NA
## LotArea             0.18       0.17        0.08         -0.02       0.02
## OverallQual         0.56       0.24        0.31         -0.11       0.03
## OverallCond        -0.15       0.00       -0.03          0.07       0.03
## YearBuilt           0.48       0.22        0.19         -0.39       0.03
## YearRemodAdd        0.37       0.21        0.23         -0.19       0.05
## MasVnrArea            NA         NA          NA            NA         NA
## BsmtFinSF1          0.30       0.20        0.11         -0.10       0.03
## BsmtFinSF2         -0.02       0.07        0.00          0.04      -0.03
## BsmtUnfSF           0.18      -0.01        0.13          0.00       0.02
## TotalBsmtSF         0.49       0.23        0.25         -0.10       0.04
## X1stFlrSF           0.49       0.24        0.21         -0.07       0.06
## X2ndFlrSF           0.14       0.09        0.21          0.06      -0.02
## LowQualFinSF       -0.07      -0.03        0.02          0.06       0.00
## GrLivArea           0.47       0.25        0.33          0.01       0.02
## BsmtFullBath        0.18       0.18        0.07         -0.05       0.00
## BsmtHalfBath       -0.02       0.04       -0.03         -0.01       0.04
## FullBath            0.41       0.19        0.26         -0.12       0.04
## HalfBath            0.16       0.11        0.20         -0.10       0.00
## BedroomAbvGr        0.07       0.05        0.09          0.04      -0.02
## KitchenAbvGr       -0.06      -0.09       -0.07          0.04      -0.02
## TotRmsAbvGrd        0.34       0.17        0.23          0.00      -0.01
## Fireplaces          0.27       0.20        0.17         -0.02       0.01
## GarageYrBlt           NA         NA          NA            NA         NA
## GarageCars          0.88       0.23        0.21         -0.15       0.04
## GarageArea          1.00       0.22        0.24         -0.12       0.04
## WoodDeckSF          0.22       1.00        0.06         -0.13      -0.03
## OpenPorchSF         0.24       0.06        1.00         -0.09      -0.01
## EnclosedPorch      -0.12      -0.13       -0.09          1.00      -0.04
## X3SsnPorch          0.04      -0.03       -0.01         -0.04       1.00
## ScreenPorch         0.05      -0.07        0.07         -0.08      -0.03
## PoolArea            0.06       0.07        0.06          0.05      -0.01
## MiscVal            -0.03      -0.01       -0.02          0.02       0.00
## MoSold              0.03       0.02        0.07         -0.03       0.03
## YrSold             -0.03       0.02       -0.06         -0.01       0.02
## SalePrice           0.62       0.32        0.32         -0.13       0.04
##               ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
## MSSubClass          -0.03     0.01   -0.01  -0.01  -0.02     -0.08
## LotFrontage            NA       NA      NA     NA     NA        NA
## LotArea              0.04     0.08    0.04   0.00  -0.01      0.26
## OverallQual          0.06     0.07   -0.03   0.07  -0.03      0.79
## OverallCond          0.05     0.00    0.07   0.00   0.04     -0.08
## YearBuilt           -0.05     0.00   -0.03   0.01  -0.01      0.52
## YearRemodAdd        -0.04     0.01   -0.01   0.02   0.04      0.51
## MasVnrArea             NA       NA      NA     NA     NA        NA
## BsmtFinSF1           0.06     0.14    0.00  -0.02   0.01      0.39
## BsmtFinSF2           0.09     0.04    0.00  -0.02   0.03     -0.01
## BsmtUnfSF           -0.01    -0.04   -0.02   0.03  -0.04      0.21
## TotalBsmtSF          0.08     0.13   -0.02   0.01  -0.01      0.61
## X1stFlrSF            0.09     0.13   -0.02   0.03  -0.01      0.61
## X2ndFlrSF            0.04     0.08    0.02   0.04  -0.03      0.32
## LowQualFinSF         0.03     0.06    0.00  -0.02  -0.03     -0.03
## GrLivArea            0.10     0.17    0.00   0.05  -0.04      0.71
## BsmtFullBath         0.02     0.07   -0.02  -0.03   0.07      0.23
## BsmtHalfBath         0.03     0.02   -0.01   0.03  -0.05     -0.02
## FullBath            -0.01     0.05   -0.01   0.06  -0.02      0.56
## HalfBath             0.07     0.02    0.00  -0.01  -0.01      0.28
## BedroomAbvGr         0.04     0.07    0.01   0.05  -0.04      0.17
## KitchenAbvGr        -0.05    -0.01    0.06   0.03   0.03     -0.14
## TotRmsAbvGrd         0.06     0.08    0.02   0.04  -0.03      0.53
## Fireplaces           0.18     0.10    0.00   0.05  -0.02      0.47
## GarageYrBlt            NA       NA      NA     NA     NA        NA
## GarageCars           0.05     0.02   -0.04   0.04  -0.04      0.64
## GarageArea           0.05     0.06   -0.03   0.03  -0.03      0.62
## WoodDeckSF          -0.07     0.07   -0.01   0.02   0.02      0.32
## OpenPorchSF          0.07     0.06   -0.02   0.07  -0.06      0.32
## EnclosedPorch       -0.08     0.05    0.02  -0.03  -0.01     -0.13
## X3SsnPorch          -0.03    -0.01    0.00   0.03   0.02      0.04
## ScreenPorch          1.00     0.05    0.03   0.02   0.01      0.11
## PoolArea             0.05     1.00    0.03  -0.03  -0.06      0.09
## MiscVal              0.03     0.03    1.00  -0.01   0.00     -0.02
## MoSold               0.02    -0.03   -0.01   1.00  -0.15      0.05
## YrSold               0.01    -0.06    0.00  -0.15   1.00     -0.03
## SalePrice            0.11     0.09   -0.02   0.05  -0.03      1.00
# the correlation between SalePrice and the other variables
correlations <- cor(numeric_values)
correlations_with_saleprice <- correlations["SalePrice", ]
sorted_correlations_with_saleprice <- sort(correlations_with_saleprice, decreasing = TRUE)
as.data.frame(sorted_correlations_with_saleprice)
##               sorted_correlations_with_saleprice
## SalePrice                             1.00000000
## OverallQual                           0.79098160
## GrLivArea                             0.70862448
## GarageCars                            0.64040920
## GarageArea                            0.62343144
## TotalBsmtSF                           0.61358055
## X1stFlrSF                             0.60585218
## FullBath                              0.56066376
## TotRmsAbvGrd                          0.53372316
## YearBuilt                             0.52289733
## YearRemodAdd                          0.50710097
## Fireplaces                            0.46692884
## BsmtFinSF1                            0.38641981
## WoodDeckSF                            0.32441344
## X2ndFlrSF                             0.31933380
## OpenPorchSF                           0.31585623
## HalfBath                              0.28410768
## LotArea                               0.26384335
## BsmtFullBath                          0.22712223
## BsmtUnfSF                             0.21447911
## BedroomAbvGr                          0.16821315
## ScreenPorch                           0.11144657
## PoolArea                              0.09240355
## MoSold                                0.04643225
## X3SsnPorch                            0.04458367
## BsmtFinSF2                           -0.01137812
## BsmtHalfBath                         -0.01684415
## MiscVal                              -0.02118958
## LowQualFinSF                         -0.02560613
## YrSold                               -0.02892259
## OverallCond                          -0.07785589
## MSSubClass                           -0.08428414
## EnclosedPorch                        -0.12857796
## KitchenAbvGr                         -0.13590737

Box Plot

# Find the categorical variables
#categorical_vars <- sapply(df_hou, function(x) is.character(x))
#cat("Categorical variables:\n")
#names(df_hou[categorical_vars])

predictors <- c("MSZoning", "Street","Alley", "LotShape", "LandContour", "Utilities",
                "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType",
                "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType",
                "ExterQual",  "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure",
                "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical",
                "KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageFinish","GarageQual",
                "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType",
                "SaleCondition"
                )
data <- df_hou[c(predictors, "SalePrice")]


# Create box plots for each predictor
for (i in 1:43) {
  boxplot(SalePrice ~ eval(parse(text = predictors[i])), data = data,
          main = predictors[i], xlab = predictors[i], ylab = "SalePrice",
          col = "steelblue"
          )
}

t-test

continuous_variables <- names(select_if(df_hou, is.numeric))

# run a for loop through continuous variables and perform t-tests
for (var in continuous_variables) {
  print(paste0("T-test for association between SalePrice and ", var, ":"))
  print(t.test(df_hou[[var]], df_hou$SalePrice))
  print("_____________________________________________________________________")
}
## [1] "T-test for association between SalePrice and MSSubClass:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.991, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184942.7 -176785.9
## sample estimates:
##    mean of x    mean of y 
##     56.89726 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LotFrontage:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.985, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184929.5 -176772.8
## sample estimates:
##    mean of x    mean of y 
##     70.04996 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LotArea:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -174514.7 -166294.1
## sample estimates:
## mean of x mean of y 
##  10516.83 180921.20 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OverallQual:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184993.5 -176836.7
## sample estimates:
##    mean of x    mean of y 
## 6.099315e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OverallCond:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184994.0 -176837.3
## sample estimates:
##    mean of x    mean of y 
## 5.575342e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YearBuilt:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183028.3 -174871.6
## sample estimates:
##  mean of x  mean of y 
##   1971.268 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YearRemodAdd:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.064, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183014.7 -174858.0
## sample estimates:
##  mean of x  mean of y 
##   1984.866 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MasVnrArea:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.969, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184895.9 -176739.1
## sample estimates:
##   mean of x   mean of y 
##    103.6853 180921.1959 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFinSF1:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.804, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184556.0 -176399.1
## sample estimates:
##   mean of x   mean of y 
##    443.6397 180921.1959 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFinSF2:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184953.0 -176796.3
## sample estimates:
##    mean of x    mean of y 
##     46.54932 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtUnfSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.745, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184432.4 -176275.5
## sample estimates:
##   mean of x   mean of y 
##    567.2404 180921.1959 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and TotalBsmtSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183942.2 -175785.3
## sample estimates:
##  mean of x  mean of y 
##   1057.429 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X1stFlrSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.459, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183837.0 -175680.2
## sample estimates:
##  mean of x  mean of y 
##   1162.627 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X2ndFlrSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.851, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184652.6 -176495.8
## sample estimates:
##   mean of x   mean of y 
##    346.9925 180921.1959 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LowQualFinSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184993.7 -176837.0
## sample estimates:
##    mean of x    mean of y 
## 5.844521e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GrLivArea:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183484.2 -175327.3
## sample estimates:
##  mean of x  mean of y 
##   1515.464 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFullBath:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184999.1 -176842.4
## sample estimates:
##    mean of x    mean of y 
## 4.253425e-01 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtHalfBath:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184999.5 -176842.8
## sample estimates:
##    mean of x    mean of y 
## 5.753425e-02 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and FullBath:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184998.0 -176841.3
## sample estimates:
##    mean of x    mean of y 
## 1.565068e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and HalfBath:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184999.2 -176842.5
## sample estimates:
##    mean of x    mean of y 
## 3.828767e-01 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BedroomAbvGr:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184996.7 -176840.0
## sample estimates:
##    mean of x    mean of y 
## 2.866438e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and KitchenAbvGr:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184998.5 -176841.8
## sample estimates:
##    mean of x    mean of y 
## 1.046575e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and TotRmsAbvGrd:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184993.0 -176836.3
## sample estimates:
##    mean of x    mean of y 
## 6.517808e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and Fireplaces:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184998.9 -176842.2
## sample estimates:
##    mean of x    mean of y 
## 6.130137e-01 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageYrBlt:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.067, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183021.0 -174864.3
## sample estimates:
##  mean of x  mean of y 
##   1978.506 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageCars:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184997.8 -176841.1
## sample estimates:
##    mean of x    mean of y 
## 1.767123e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageArea:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.791, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184526.6 -176369.8
## sample estimates:
##   mean of x   mean of y 
##    472.9801 180921.1959 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and WoodDeckSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.973, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184905.3 -176748.6
## sample estimates:
##    mean of x    mean of y 
##     94.24452 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OpenPorchSF:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184952.9 -176796.2
## sample estimates:
##    mean of x    mean of y 
##     46.66027 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and EnclosedPorch:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.008, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184977.6 -176820.9
## sample estimates:
##    mean of x    mean of y 
##     21.95411 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X3SsnPorch:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184996.1 -176839.4
## sample estimates:
##    mean of x    mean of y 
## 3.409589e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and ScreenPorch:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.012, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184984.5 -176827.8
## sample estimates:
##    mean of x    mean of y 
##     15.06096 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and PoolArea:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184996.8 -176840.1
## sample estimates:
##    mean of x    mean of y 
## 2.758904e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MiscVal:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184956.1 -176799.3
## sample estimates:
##    mean of x    mean of y 
##     43.48904 180921.19589 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MoSold:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184993.2 -176836.5
## sample estimates:
##    mean of x    mean of y 
## 6.321918e+00 1.809212e+05 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YrSold:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = -86.053, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -182991.7 -174835.0
## sample estimates:
##  mean of x  mean of y 
##   2007.816 180921.196 
## 
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and SalePrice:"
## 
##  Welch Two Sample t-test
## 
## data:  df_hou[[var]] and df_hou$SalePrice
## t = 0, df = 2918, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5765.271  5765.271
## sample estimates:
## mean of x mean of y 
##  180921.2  180921.2 
## 
## [1] "_____________________________________________________________________"

Remove the column have more than 80% of NA those are : Alley, PoolQC, Fence, and MiscFeature.

df_hou <- df_hou %>%
  select(-Alley, -PoolQC, -Fence, -MiscFeature)

EnclosedPorch and KitchenAbvGr has negative co-relation with sales.

Based on the correlation coefficients, scatter plots, and hypothesis tests, we can identify variables that are not strongly associated with SalePrice. We can remove these variables from the dataset.

BedroomAbvGr 0.16821315
ScreenPorch 0.11144657
PoolArea 0.09240355
MoSold 0.04643225
X3SsnPorch 0.04458367
BsmtFinSF2 -0.01137812
BsmtHalfBath -0.01684415
MiscVal -0.02118958
LowQualFinSF -0.02560613
YrSold -0.02892259
OverallCond -0.07785589
MSSubClass -0.08428414

df_hou <- df_hou %>%
  select(-BedroomAbvGr, -ScreenPorch, -PoolArea, -MoSold, -X3SsnPorch, -BsmtFinSF2, -BsmtHalfBath,
         -MiscVal, -LowQualFinSF, -YrSold, -OverallCond, -MSSubClass)

Removing variables which doesn’t have much difference in median they are:

df_hou <- df_hou %>%
  select(-Street, -Utilities, -LandSlope, -BldgType)

Dealing with Missing Values:

If a large percentage of rows still have missing values, omitting those rows will cause loss of information. We are going to try to impute the missing values. Several methods for missing data imputation exists. The most simple imputation is to replace the missing values with the mean( or mode) of their columns. Another method would be to use other columns to predict the column with missing values This is called multiple imputation and it assumes that a missing value in one attribute can be predicted from the data in other attributes. The simplest form of multiple imputation is to use knn imputation. In knn imputation, we find the k complete data points closest to the data point with missing value (incomplete row) and take their average to fill the missing value.

To avoid data leakage, the nearest neighbors must be computed based only on the training data instead of the entire dataset. In other words, to impute the missing values in training, validation, and test data we should only use the training data to find the nearest neighbors.

Data leakage is when information outside of the training data is used in creating a machine learning model. This causes the model to overfit and not generalize well to future data.

When doing cross validation, the imputation must be computed based only on the training data on each fold (excluding the validation data). Luckily, caret’s train method in R streamlines this process. you can use the option preProc=”knnImpute” to do knn imputation and the option na.action=na.pass to allow the NA values to be passed to the model to be imputed:

KnnImpute scales and centers the numeric features and uses Euclidian distance to compute the nearest neighbors. Since Euclidean distance is not meaningful for categorical features, it will ignore categorical features if present in the dataset and does not impute them.

8. (2 pt) Examine the columns with missing values to see if any of them are categorical.Use caret’s createDataPartition method to partition the dataset to 80% training and 20% testing. If a categorical column has missing values in train or test data, impute it with the mode of that column in the training data. It is important that the mode is computed based only on the training data only (instead of the entire dataset) to avoid data leakage.

set.seed(101)
# Create data partition
inTrain <- createDataPartition(df_hou$SalePrice, p = 0.8, list = FALSE)

# Create training and test sets
hou_train <- df_hou[inTrain, ]
hos_test <- df_hou[-inTrain, ]

dim(hou_train)
## [1] 1169   60
dim(hos_test)
## [1] 291  60

Imputation on caterorical variables

# Identify categorical columns
categorical_cols <- sapply(hou_train, is.character)

# Remove SalePrice from the list of categorical columns
categorical_cols["SalePrice"] <- FALSE

# Get the names of the categorical columns
categorical_col_names <- names(hou_train)[categorical_cols]
# Loop through each categorical column
for (col in categorical_col_names) 
  {
  
  # Get the mode of the column in the training data
  mode_val <- hou_train %>% 
    select({{ col }}) %>% 
    summarise(mode = names(which.max(table(.))))
  
  # Replace missing values in the training data with the mode
  hou_train[[col]] <- ifelse(hou_train[[col]] == 'notApplicable', mode_val$mode, hou_train[[col]])
  
  # Replace missing values in the test data with the mode
  hos_test[[col]] <- ifelse(hos_test[[col]] == 'notApplicable', mode_val$mode, hos_test[[col]])
}

Imputation on numerical variables

Variables has NAs in the dataset are LotFrontage, MasVnrArea, GarageYrBlt

# # Identify numerical columns
# numerical_cols <- sapply(hou_train, is.numeric)
# 
# # Remove SalePrice from the list of numerical columns
# numerical_cols["SalePrice"] <- FALSE
# 
# # Get the names of the numerical columns
# numerical_col_names <- names(hou_train)[numerical_cols]

Eliminating variables with little to no variance:

This dataset has several variables with a handful of unique values that occur with very low frequency. For instance, if you take a summary of the “Street” variable 99.5% of the samples have “Pave” street while only 0.04% of samples have “gravel”. The concern here that when we split the data for cross validation, the training samples will all have “paved” street but some of the validation samples might have “gravel” street which create a discrepancy between training and validation sets. In addition, a handful of unique values may have an undue influence on the model. Therefore, it is better to identify and eliminate these variables with near zero variance.

Luckily caret’s train method have a preprocessing option preProc=“nzv” that can be used during cross validation to remove variables with zero or little variance in the training set.

Use preProc=c(“knnImpute”, “nzv”) inside caret’s train method in the models you create in the next section to eliminate variables with near zero variance and to impute missing values in numeric variables using knnImpute.

Section 3. Creating Predictive Models

After cleaning and exploring data, we are ready to build our machine learning models to predict the SalePrice of a house based on other variables. We are going to examine four categories of models: Regularized linear regression, Tree-based Ensemble models, SVM, and neural networks with drop out.

Section 3.1 Creating Regularized Linear Regression Models

8. (2pt) Set.seed(1) and train a Lasso Linear Regression model using “glmnet” and “caret” as explained in the lectures to predict the SalePrice. Use 10 fold cross validation and Tune the lambda parameter) Note: You do not need to worry about scaling your test or train data, glmnet will automatically do it for you.

Use preProc=c(”knnImpute”,”nzv”) and na.action=na.pass options inside the train method to let caret impute the missing values using knn based on the training data during cross validation.

# Define the pre-processing steps to be applied to the training data during cross-validation
preProc <- c("knnImpute", "nzv")
# For reproducibility
set.seed(1) 

hos_train2 <- hou_train
hos_train2[ ,c('Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)

# Define the Lasso Linear Regression model using "glmnet"
lassoModel <- train(SalePrice ~ ., 
                    data = hos_train2, 
                    method = "glmnet", 
                    trControl = trainControl(method = "cv", number = 10), 
                    tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 0.1, by = 0.001)), 
                    preProcess = preProc, 
                    na.action = na.pass)

9. (1pt) Get the coefficients for the best tuned model. Did Lasso shrink some of the coefficients to zero? If so, what does this mean?

best_model <- lassoModel$finalModel
coef(best_model, s = best_model$lambdaOpt)
## 83 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)              181213.23213
## MSZoningRL                 6037.80097
## MSZoningRM                  447.49950
## LotFrontage               -1228.45567
## LotArea                    3574.54378
## LotShapeReg               -1293.45747
## LandContourLvl             3487.77176
## LotConfigCulDSac           3411.20583
## LotConfigInside            -131.57185
## NeighborhoodCollgCr       -2484.45046
## NeighborhoodEdwards       -5759.72742
## NeighborhoodGilbert       -2377.47456
## NeighborhoodNAmes         -4253.77835
## NeighborhoodNridgHt        4123.65924
## NeighborhoodNWAmes        -1412.75848
## NeighborhoodOldTown       -3380.05338
## NeighborhoodSawyer        -2280.33770
## NeighborhoodSomerst        3472.04807
## Condition1Feedr            -429.60977
## Condition1Norm             5241.78741
## HouseStyle1Story           6708.47009
## HouseStyle2Story          -5071.92175
## OverallQual               17022.54966
## YearBuilt                   -33.18685
## YearRemodAdd               4371.81654
## RoofStyleGable             -333.78041
## RoofStyleHip               1560.21368
## MasVnrTypeBrkFace          3677.72839
## MasVnrTypeNone             8972.71159
## MasVnrTypeStone            3152.23818
## MasVnrArea                 5024.31085
## ExterQualGd               -5148.22751
## ExterQualTA               -8352.47975
## FoundationCBlock           2928.93992
## FoundationPConc            3977.75693
## BsmtQualGd                -9762.94474
## BsmtQualTA                -7774.04347
## BsmtCondTA                 1702.48507
## BsmtExposureGd             6122.86643
## BsmtExposureMn            -1099.44587
## BsmtExposureNo            -3201.57767
## BsmtFinType1BLQ              65.34630
## BsmtFinType1GLQ            1604.16899
## BsmtFinType1LwQ           -1686.59823
## BsmtFinType1Rec             291.21540
## BsmtFinType1Unf           -5006.57768
## BsmtFinSF1                -1533.27179
## BsmtFinType2Unf             442.59767
## BsmtUnfSF                  -405.87746
## TotalBsmtSF                 407.68124
## HeatingQCGd                -892.65534
## HeatingQCTA               -2819.04358
## CentralAirY                2661.95417
## ElectricalSBrkr            -134.01028
## X1stFlrSF                 19131.51275
## X2ndFlrSF                 27545.87423
## GrLivArea                     .      
## BsmtFullBath               2590.10393
## FullBath                   1197.65582
## HalfBath                   1702.62957
## KitchenQualGd             -9876.88075
## KitchenQualTA            -10097.00644
## TotRmsAbvGrd               1832.82578
## FunctionalTyp              4053.50463
## Fireplaces                 3709.68463
## FireplaceQuGd             -1178.40242
## FireplaceQunotApplicable    267.36778
## FireplaceQuTA              -449.41343
## GarageTypeAttchd           2710.89911
## GarageTypeBuiltIn          1438.39661
## GarageTypeDetchd           2427.77691
## GarageYrBlt               -3675.83613
## GarageFinishRFn           -1780.38079
## GarageFinishUnf           -1647.57985
## GarageCars                 8169.33093
## GarageArea                 1773.37407
## PavedDriveY                 769.25695
## WoodDeckSF                 2155.52155
## OpenPorchSF                -129.93880
## SaleTypeNew                9425.67731
## SaleTypeWD                  345.23932
## SaleConditionNormal        1053.83154
## SaleConditionPartial      -4631.84451

Lasso regularization shrinks some of the coefficients towards zero, and in some cases, it can shrink coefficients all the way to zero. This means that the corresponding predictors are effectively removed from the model. Lasso regularization can be used for feature selection, as it can identify the most important predictors in the dataset by shrinking the coefficients of less important predictors towards zero.

If some coefficients are shrunk to zero, it means that the corresponding predictors are not contributing significantly to the model’s predictive power, and can be removed without significantly affecting the model’s performance. This can help to reduce overfitting and improve the interpretability of the model.

10. (1pt) Get the predictions on the test data using the “predict” function with option na.action=na.pass. This will allow the NA values in the test data to be passed to the model and imputed using knn imputation based on the training data. Go ahead and run install.packages(“RANN”) to install the RANN package if you get an error stating that you need this package.

Compute the RMSE on of the predictions.

The train and test data will have the same number of features in case some levels of a categorical variable does not occur in training data but occurs in test or validation data will drop.

# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
hos_test2 <- hos_test
hos_test2[ ,c('SalePrice', 'Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)
testPredictions_lasso <- predict(lassoModel, newdata = hos_test2, na.action = na.pass)
# Evaluate the performance of the model on the test data using metrics such as MSE, RMSE, and R2
MSE_lasso <- mean((hos_test$SalePrice - testPredictions_lasso)^2)
RMSE_lasso <- sqrt(MSE_lasso)
RMSE_lasso
## [1] 28972.69

11. (1 pt) set.seed(1) again and train a Ridge linear regression model using 10 fold cross validation and tune lambda as you did for lasso and compute the RMSE of this model on the test data.

Use knn imputation similar to what you did for lasso.

set.seed(1)
# Define the Ridge Linear Regression model using "glmnet"

ridgeModel <- train(SalePrice ~ ., 
                    data = hos_train2, 
                    method = "glmnet", 
                    trControl = trainControl(method = "cv", number = 10), 
                    tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 0.1, by = 0.001)), 
                    preProcess = preProc, 
                    na.action = na.pass)

# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_ridge <- predict(ridgeModel, newdata = hos_test2, na.action = na.pass)

# Compute the RMSE of the Ridge Linear Regression model on the test data
RMSE_ridge <- sqrt(mean((hos_test$SalePrice - testPredictions_ridge)^2))

# View the RMSE
RMSE_ridge
## [1] 28858.33

12. (1 pt) set.seed(1) again and train an Elastic net linear regression model using 10 fold cross validation and tune lambda as you did before and tune alpha to be a sequence of 10 values between 0 and 1, that is: 0,0.1,0.2,….1 . Compute the RMSE of the tuned model on the test data Use knn imputation similar to what you did for the two previous models.

set.seed(1)
# Define the Elastic Net Linear Regression model using "glmnet"
elasticNetModel <- train(SalePrice ~ ., 
                         data = hos_train2, 
                         method = "glmnet", 
                         trControl = trainControl(method = "cv", number = 10), 
                tuneGrid = expand.grid(alpha = seq(0, 1, by = 0.1), lambda = seq(0.001, 0.1, by = 0.001)), 
                         preProcess = preProc, 
                         na.action = na.pass)

# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_elasticNet <- predict(elasticNetModel, newdata = hos_test2, na.action = na.pass)

# Compute the RMSE of the Elastic Net Linear Regression model on the test data
RMSE_elasticNet <- sqrt(mean((hos_test$SalePrice - testPredictions_elasticNet)^2))

# View the RMSE
RMSE_elasticNet
## [1] 28858.33
Section 3.2 Creating Tree-Ensemble and SVM Models

13. (2 pt) Set.seed(1) and Use Caret package with “rf” method to train a random forest model on the training data to predict the SalePrice. You can impute the missing values using knn similar to what you did for the previous models. Use 10-fold cross validation and let caret auto-tune the model. Use the model to predict the SalePrice for test data and compute RMSE. (Note: use importance=T in your train method so it computes the variable importance while building the model). Be patient. This model may take a long time to train.

set.seed(1)

# Train a Random Forest model using the "caret" package
rfModel <- train(SalePrice ~ ., 
                 data = hos_train2, 
                 method = "rf", 
                 trControl = trainControl(method = "cv", number = 10), 
                 preProcess = preProc, 
                 na.action = na.pass, 
                 importance = TRUE)
# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_rf <- predict(rfModel, newdata = hos_test2, na.action = na.pass)

# Compute the RMSE of the Random Forest model on the test data
RMSE_rf <- sqrt(mean((hos_test$SalePrice - testPredictions_rf)^2))

# View the RMSE
RMSE_rf
## [1] 27543.66

(1 pt) User caret’s varImp function to get the variable importance for the random forest model. Which variables were most predictive in the random forest model?

varImp(rfModel)
## rf variable importance
## 
##   only 20 most important variables shown (out of 82)
## 
##                  Overall
## OverallQual       100.00
## GrLivArea          77.83
## TotalBsmtSF        32.58
## GarageArea         26.74
## YearBuilt          26.21
## X1stFlrSF          24.08
## GarageFinishUnf    23.44
## GarageCars         23.17
## GarageTypeDetchd   22.40
## MSZoningRM         21.77
## BsmtFinSF1         21.36
## BsmtQualGd         18.73
## YearRemodAdd       18.14
## LotArea            18.05
## X2ndFlrSF          18.01
## KitchenQualGd      17.36
## CentralAirY        16.67
## BsmtUnfSF          16.62
## GarageYrBlt        16.19
## ExterQualGd        15.63

Based on the variable importance scores, you can identify the most predictive variables in the Random Forest model. The specific variables that are most predictive may vary depending on the dataset and the modeling approach used, but some common predictors that tend to be important in predicting housing prices include OverallQual, GrLivArea, TotalBsmtSF, TotalBsmtSF and YearBuilt.

14. (1 pt) Set.seed(1) and Use Caret package with “gbm” method to train a Gradient Boosted Tree model on the training data. GBM needs minimum data preprocessing, you don’t need to scale numeric features or encode the categorical variables. In addition, it can be trained directly on data with missing values without having to do imputation.

Use 10 fold cross validation and let caret auto-tune the mode.

Use the model to predict the SalePrice for the test data and compute RMSE.

set.seed(1)
hos_train2 <- hos_train2 %>% mutate_if(is.integer, ~replace(., is.na(.), 0))
hos_test2 <- hos_test2 %>% mutate_if(is.integer, ~replace(., is.na(.), 0))

# Define the training control for 10-fold cross validation
train_control <- trainControl(method = "cv", number = 10)

# Train the Gradient Boosted Tree model using the "gbm" method
gbmModel <- train(SalePrice ~ ., data = hos_train2, method = "gbm", trControl = train_control, verbose = FALSE)

# Use the trained model to make predictions on the test data
gbmPred <- predict(gbmModel, newdata = hos_test2)

# Compute the RMSE of the Gradient Boosted Tree model on the test data
RMSE_gbm <- RMSE(gbmPred, hos_test$SalePrice)

RMSE_gbm
## [1] 29045.64

15. (1 pt) Set.seed(1) and Use Caret package with “svmLinear” method to train a support vector machine model on the training data. Use preProc=”knnImpute” to impute the missing values and scale data. Use 10 fold cross validation and let caret auto-tune the model, explain what is hyper-parameter “c”? Use the model to predict the SalePrice for the test data and compute RMSE.

set.seed(1)

# Define the pre-processing pipeline for imputing missing values and scaling data
pre_proc_2 <- c("knnImpute", "center", "scale")

# Define the training control for 10-fold cross validation
train_control <- trainControl(method = "cv", number = 10)

# Train the SVM model using the "svmLinear" method and auto-tune the hyper-parameter "C"
svmModel <- train(SalePrice ~ ., data = hos_train2, method = "svmLinear", trControl = train_control, preProcess = pre_proc_2, tuneLength = 10)

# Use the trained model to make predictions on the test data
svmPred <- predict(svmModel, newdata = hos_test2)

# Compute the RMSE of the SVM model on the test data
RMSE_svm <- RMSE(svmPred, hos_test$SalePrice)

RMSE_svm
## [1] 27030.55

In SVM, “C” is a hyper-parameter that controls the trade-off between maximizing the margin between the decision boundary and the training samples (i.e., finding a solution with low variance) and minimizing the classification error on the training data (i.e., finding a solution with low bias). A large value of “C” will result in a smaller margin but more accurate classification on the training data, while a small value of “C” will result in a larger margin but potentially less accurate classification on the training data. The optimal value of “C” depends on the specific dataset and problem at hand, and can be determined using techniques such as grid search or random search. In this case, we’ve let the caret package auto-tune the “C” hyper-parameter for us using tuneLength = 10, which specifies that 10 values of “C” will be tried and the best one will be selected based on cross-validation performance.

16. (1 pt) repeat the above steps but set train method to “svmRadial” to use radial basis function as kernel.

set.seed(1)

svmRadialFit <- train(SalePrice ~ ., data = hos_train2, method = "svmRadial", preProcess = c("knnImpute", "center", "scale"), trControl = trainControl(method = "cv", number = 10), tuneLength = 10)

svmRadialPred <- predict(svmRadialFit, newdata = hos_test2)

RMSE_svmRadial <- sqrt(mean((svmRadialPred - hos_test$SalePrice)^2))
RMSE_svmRadial
## [1] 30109.03

17. (2pt) Use “resamples” method to compare the cross validation RMSE of the seven models you created above (LASSO, RIDGE, elastic net, randomforest, gbm, svmlinear, and svmradial). In a sentence or two, interpret the results.

compare=resamples(list(Lasso=lassoModel,
                       Ridge= ridgeModel,
                       Enet=elasticNetModel,
                       Rf = rfModel,
                       GBM = gbmModel,
                       SVM = svmModel,
                       SVMRadial = svmRadialFit
                        ))

summary(compare)
## 
## Call:
## summary.resamples(object = compare)
## 
## Models: Lasso, Ridge, Enet, Rf, GBM, SVM, SVMRadial 
## Number of resamples: 10 
## 
## MAE 
##               Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Lasso     17365.44 19537.12 20399.17 20738.48 21030.85 26414.82    0
## Ridge     16831.32 18420.89 19249.83 20029.57 20712.16 25158.80    0
## Enet      16831.32 18420.89 19249.83 20029.57 20712.16 25158.80    0
## Rf        15349.11 15728.54 17252.79 18036.91 19219.20 25801.48    0
## GBM       14012.30 15996.89 17330.83 17912.57 19626.93 24778.62    0
## SVM       13758.41 16036.98 32216.26 32622.13 49950.47 53213.40    0
## SVMRadial 14475.04 16804.38 33720.13 35629.89 55726.51 58944.42    0
## 
## RMSE 
##               Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Lasso     24800.46 27076.01 28491.98 34543.55 33460.37 64030.17    0
## Ridge     24565.37 26192.31 27971.46 33927.82 34333.66 61676.30    0
## Enet      24565.37 26192.31 27971.46 33927.82 34333.66 61676.30    0
## Rf        22257.67 23238.84 27485.30 29037.03 29585.46 50186.95    0
## GBM       19318.17 23589.96 26656.57 28216.54 28276.35 46837.17    0
## SVM       18400.75 25491.87 61308.81 52759.27 74612.64 89153.88    0
## SVMRadial 19527.40 24416.80 48578.15 53067.67 81212.42 94951.58    0
## 
## Rsquared 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Lasso     0.5353258 0.8441237 0.8542847 0.8187877 0.8841227 0.9087009    0
## Ridge     0.5444124 0.8365572 0.8624062 0.8240850 0.8893835 0.9139098    0
## Enet      0.5444124 0.8365572 0.8624062 0.8240850 0.8893835 0.9139098    0
## Rf        0.7243770 0.8876044 0.8987582 0.8723577 0.9019227 0.9112096    0
## GBM       0.7591947 0.8662518 0.8928676 0.8763809 0.9148695 0.9329393    0
## SVM       0.5681408 0.7080022 0.7499640 0.7767257 0.8999275 0.9502109    0
## SVMRadial 0.4699145 0.5219629 0.6933872 0.7037527 0.9099775 0.9318635    0
tabularview_1 <- data.frame(
  "Models" = c("Lasso","Ridge", "ElasticNet", "RandomForest","GBM","SVM","SVMRadial"),
  "RMSE" = c(RMSE_lasso, RMSE_ridge, RMSE_elasticNet, RMSE_rf, RMSE_gbm, RMSE_svm, RMSE_svmRadial)
                          )

kableExtra::kable(tabularview_1) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)
Models RMSE
Lasso 28972.69
Ridge 28858.33
ElasticNet 28858.33
RandomForest 27543.66
GBM 29045.64
SVM 27030.55
SVMRadial 30109.03

The “resamples” method allows us to compare the performance of different models using cross-validation. In this case, we used it to compare the RMSE of seven different models: LASSO, RIDGE, elastic net, random forest, GBM, SVM linear, and SVM radial. Based on the results, we can see that the SVM linear model had the lowest RMSE, followed by the Random Forest model. The LASSO, RIDGE, and elastic net models had similar RMSE values, which were slightly higher than those of the GBM models. The SVM radial models had the highest RMSE values. Overall, the random forest and SVM linear models performed the best, while the SVM Radial model performed the worst.

Section 3.3 Creating a Neural Network Model

18. Split the training data to train –validation set. (use 90% for training and 10% for validation) 19. Use knn imputation to impute the missing values in the train/validation/ and test data based on the training data. To do this, you can use the preProcess function in caret with method=”knnImpute” on the training data. This will return a preprocessing model which can be used to transform data using the predict function. Call the predict function as follows to do knn imputation on train, test, and validation data based on the information on the training data:

20. (3 pt) Neural Networks cannot take factor variables and you must convert your categorical variables to numbers before training your neural network model. One-hot encode the categorical variables using one_hot function from the mltools package.. Set dropUnusedLevels=FALSE. This is to make sure that the train and test data will have the same number of features in case some levels of a categorical variable does not occur in training data but occurs in test or validation data.

df_hou_2 <- df_hou
# Identify categorical columns
categorical_cols_2 <- sapply(df_hou_2, is.character)

# Remove SalePrice from the list of categorical columns
categorical_cols_2["SalePrice"] <- FALSE

# Get the names of the categorical columns
categorical_col_names_2 <- names(df_hou_2)[categorical_cols_2]
# Loop through each categorical column
for (col in categorical_col_names_2)
  {

  # Get the mode of the column in the training data
  mode_val_1 <- df_hou_2 %>%
    select({{ col }}) %>%
    summarise(mode = names(which.max(table(.))))


  # Replace missing values in the training data with the mode
  df_hou_2[[col]] <- ifelse(df_hou_2[[col]] == 'notApplicable', mode_val_1$mode, df_hou_2[[col]])

}
#df_hou_2[ ,c('Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)

df_hou_2 <- df_hou_2 %>% mutate_if(is.character, as.factor)

# one-hot encode categorical variables in train data
df_hou_2 <- as.data.frame(one_hot(as.data.table(df_hou_2), dropUnusedLevels = FALSE))
set.seed(101)
# Create data partition
inTrain_2 <- createDataPartition(df_hou_2$SalePrice, p = 0.8, list = FALSE)


# Create training and test sets
hou_train_2 <- df_hou_2[inTrain_2, ]
hos_test_2 <- df_hou_2[-inTrain_2, ]


# dim(hou_train_2)
# dim(hos_test_2)
# Split the data into training and validation sets
trainIndex <- createDataPartition(hou_train_2$SalePrice, p = 0.9, list = FALSE, times = 1)
training <- hou_train_2[trainIndex,]
validation <- hou_train_2[-trainIndex,]

# dim(training)
# dim(validation)
# create a preprocessing model using knn imputation on the training data
preproc_model_1 <- preProcess(hou_train_2, method = "knnImpute")
preproc_model_2 <- preProcess(hos_test_2, method = "knnImpute")

# transform the train, validation, and test data using the preproc_model
train_data_imputed <- predict(preproc_model_1, training)
val_data_imputed <- predict(preproc_model_1, validation)
test_data_imputed <- predict(preproc_model_2, hos_test_2)

# dim(train_data_imputed)
# dim(val_data_imputed)
# dim(test_data_imputed)

21. (2pt) Since we are not using Caret to train neural networks, We will have to manually remove variables with little or no variance. To identify these variables in your training data, use the method “nearZeroVar” from caret package. This will return the indices of variables with little to no variance in the training data. Use these indices to remove these variables from train, validation, and test data. You can refer to his page (https://topepo.github.io/caret/pre-processing.html#nzv , section 3.2 ) for an example of identifying and removing near zero variance.

# identify near-zero variance variables
nzv <- nearZeroVar(train_data_imputed, saveMetrics= TRUE)
#nzv_2 <- nearZeroVar(test_data_imputed, saveMetrics= TRUE)


# get indices of near-zero variance variables
nzv_indices <- which(nzv$nzv == TRUE)
#nzv_indices_2 <- which(nzv_2$nzv == TRUE)

# remove near-zero variance variables from train, validation, and test data
train <- train_data_imputed[, -nzv_indices]
validation <- val_data_imputed[, -nzv_indices]
test <- test_data_imputed[, -nzv_indices]

dim(train)
## [1] 1053  109
dim(validation)
## [1] 116 109
dim(test)
## [1] 291 109

22. (5 pt) Create a Neural Network model with at least two hidden layers to predict the saleprice in 100K units. In other words, your target variables/labels should be SalePrice/100000. We scale down the sale price to avoid error gradients to get too large during backpropagation. If gradients are too large, they can make the model unstable and you end up having NAN for training or validation loss.

Use the training and validation set you created above. Add a drop out layer after each hidden layer to regularize your neural network model. Use tfruns package to tune your hyper-parameters including the drop out factors. You should include two flags for the drop out factors, one for each hidden layer. Display the table returned by trfuns.

X_train <- select(train, -c(SalePrice))
X_val <- select(validation, -c(SalePrice))
X_test <- select(test, -c(SalePrice))

X_train <- as.matrix(X_train)
X_val <- as.matrix(X_val)
X_test <- as.matrix(X_test)

# Scale down the SalePrice
y_train <- train$SalePrice/100000
y_val <- validation$SalePrice/100000
y_test <- test$SalePrice/100000
set.seed(1)
# Run the model
runs <- tuning_run("assign5_nn.R",
                   flags = list(
                     nodes = c(32, 64, 128),
                     learning_rate = c(0.01, 0.001, 0.0001),
                     batch_size=c(50,100,200),
                     epochs=c(30,50,100),
                     activation=c("relu","sigmoid","tanh")
                   ),
                   sample = 0.02

)
## 
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size", 
## +     32), flag_string("activation", "relu"), flag_numeric("learning_rate", 
## + .... [TRUNCATED] 
## 
## > model = keras_model_sequential()
## 
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>% 
## +     layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
## 
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate), 
## +     loss = "mse", metrics = list("mse"))
## 
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs, 
## +     batch_size = FLAGS$batch_size, validation_data = list(X_val, 
## +         y_val))
## 
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size", 
## +     32), flag_string("activation", "relu"), flag_numeric("learning_rate", 
## + .... [TRUNCATED] 
## 
## > model = keras_model_sequential()
## 
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>% 
## +     layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
## 
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate), 
## +     loss = "mse", metrics = list("mse"))
## 
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs, 
## +     batch_size = FLAGS$batch_size, validation_data = list(X_val, 
## +         y_val))
## 
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size", 
## +     32), flag_string("activation", "relu"), flag_numeric("learning_rate", 
## + .... [TRUNCATED] 
## 
## > model = keras_model_sequential()
## 
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>% 
## +     layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
## 
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate), 
## +     loss = "mse", metrics = list("mse"))
## 
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs, 
## +     batch_size = FLAGS$batch_size, validation_data = list(X_val, 
## +         y_val))
## 
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size", 
## +     32), flag_string("activation", "relu"), flag_numeric("learning_rate", 
## + .... [TRUNCATED] 
## 
## > model = keras_model_sequential()
## 
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>% 
## +     layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
## 
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate), 
## +     loss = "mse", metrics = list("mse"))
## 
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs, 
## +     batch_size = FLAGS$batch_size, validation_data = list(X_val, 
## +         y_val))
## 
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size", 
## +     32), flag_string("activation", "relu"), flag_numeric("learning_rate", 
## + .... [TRUNCATED] 
## 
## > model = keras_model_sequential()
## 
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>% 
## +     layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
## 
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate), 
## +     loss = "mse", metrics = list("mse"))
## 
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs, 
## +     batch_size = FLAGS$batch_size, validation_data = list(X_val, 
## +         y_val))

23. (2 pts) Use view_run to look at your best model. Note that the best model is the model with lowest validation loss. What hyper-parameter combination is used in your best model. Does your best model still overfit?

view_run(runs$run_dir[1])

Yes, the validation loss decreases continously.

Best Model:

Flags: nodes 64 batch_size 200 activation tanh learning_rate 0.0001 epochs 50

Kindly look at the BestModel file.

24. (2 pt) Now that we tuned the hyperparameters, we don’t need the validation data anymore and we can use ALL of the training data for training. Use all of your training data ( that is, train + validation data) to train a model with the best combination of hyper-parameters you found in the previous step.

X_train_2 <- rbind(X_train, X_val)
y_train_2 <- c(y_train, y_val)
set.seed(1)

# Retrain the best model once again
best_model =keras_model_sequential()

best_model %>%
  layer_dense(units = 64, activation = "tanh", input_shape = dim(X_train_2)[2]) %>%
  layer_dense(units = 1)
best_model %>% compile(
  optimizer = optimizer_adam(learning_rate=0.0001),
  loss = 'mse',
  metrics = list('mse'))
history <- best_model %>% fit(
  X_train, y_train, epochs = 50,
  batch_size= 200)

25. (2pt) Use your model above to predict the saleprice in 100K units for the test data. To get RMSE in the original scale, you should multiply your predictions and test labels by 100000 before computing RMSE.

set.seed(100)
# Make predictions on the test set using the best model
y_pred <-  predict(best_model, X_test)  %>% as.vector()

# Reverse the 100000 transformation
y_pred_orig <- y_pred * 100000
y_test_orig <- y_test * 100000

# Compute the RMSE in the original scale
RMSE_nn <- sqrt(mean((y_test_orig - y_pred_orig)^2))
RMSE_nn
## [1] 32348.2

26. (1pt) Compare the RMSE of your lasso, ridge, elastic net, random forest, gbm, svm, and neural networks models on the test data. Which model did better on this dataset?

tabularview_1 <- data.frame(
  "Models" = c("Lasso","Ridge", "ElasticNet", "RandomForest","GBM","SVM","SVMRadial", "NeuralNets"),
  "RMSE" = c(RMSE_lasso, RMSE_ridge, RMSE_elasticNet, RMSE_rf, RMSE_gbm, RMSE_svm, RMSE_svmRadial,RMSE_nn)
                          )

kableExtra::kable(tabularview_1) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)
Models RMSE
Lasso 28972.69
Ridge 28858.33
ElasticNet 28858.33
RandomForest 27543.66
GBM 29045.64
SVM 27030.55
SVMRadial 30109.03
NeuralNets 32348.20