For this problem, we are going to use the housing dataset from this kaggle competition. The dataset has 80 variables for each house describing the lot shape, size, neighborhood, number of rooms, number of bathrooms, etc. The goal is to predict the house sale price from these variables.
Download and unzip the attached data. It consists of two files: 1- housing.csv which is the dataset and 2- data_description.txt which describes the variables in the data. Read this description to understand what each variable means.
# Import all the relevent libraries
library(tm)
library(gmodels)
library(Matrix)
library(qdap)
library(keras)
library(tensorflow)
library(readr)
library(tfruns)
library(ggplot2)
library(tidyr)
library(dplyr)
library(corrplot)
library(caret)
library(neuralnet)
library(GGally)
library(glmnet)
library(caret)
library(RANN)
library(mltools)
library(data.table)Remove the first column, as it is a unique identifier and not used in predicting sale house price.
housing <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/housing.csv", header = T, sep = ",")
# Remove the first column
housing <- housing[-1]
dim(housing)## [1] 1460 80
1. Take a summary of the data and explore the result. How many categorical and numerical variables are there in the dataset?
# Take a summary of the data
summary(housing)## MSSubClass MSZoning LotFrontage LotArea
## Min. : 20.0 Length:1460 Min. : 21.00 Min. : 1300
## 1st Qu.: 20.0 Class :character 1st Qu.: 59.00 1st Qu.: 7554
## Median : 50.0 Mode :character Median : 69.00 Median : 9478
## Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## Street Alley LotShape LandContour
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Utilities LotConfig LandSlope Neighborhood
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Condition1 Condition2 BldgType HouseStyle
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## RoofStyle RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0.0 Length:1460 Length:1460
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## Foundation BsmtQual BsmtCond BsmtExposure
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## Length:1460 Min. : 0.0 Length:1460 Min. : 0.00
## Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00
## Mode :character Median : 383.5 Mode :character Median : 0.00
## Mean : 443.6 Mean : 46.55
## 3rd Qu.: 712.2 3rd Qu.: 0.00
## Max. :5644.0 Max. :1474.00
##
## BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.0 Min. : 0.0 Length:1460 Length:1460
## 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character
## Median : 477.5 Median : 991.5 Mode :character Mode :character
## Mean : 567.2 Mean :1057.4
## 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :2336.0 Max. :6110.0
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF
## Length:1460 Length:1460 Min. : 334 Min. : 0
## Class :character Class :character 1st Qu.: 882 1st Qu.: 0
## Mode :character Mode :character Median :1087 Median : 0
## Mean :1163 Mean : 347
## 3rd Qu.:1391 3rd Qu.: 728
## Max. :4692 Max. :2065
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1464 Median :0.0000 Median :0.00000
## Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753
## 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000
##
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.000 Median :1.000
## Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
##
## KitchenQual TotRmsAbvGrd Functional Fireplaces
## Length:1460 Min. : 2.000 Length:1460 Min. :0.000
## Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.000
## Mode :character Median : 6.000 Mode :character Median :1.000
## Mean : 6.518 Mean :0.613
## 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :14.000 Max. :3.000
##
## FireplaceQu GarageType GarageYrBlt GarageFinish
## Length:1460 Length:1460 Min. :1900 Length:1460
## Class :character Class :character 1st Qu.:1961 Class :character
## Mode :character Mode :character Median :1980 Mode :character
## Mean :1979
## 3rd Qu.:2002
## Max. :2010
## NA's :81
## GarageCars GarageArea GarageQual GarageCond
## Min. :0.000 Min. : 0.0 Length:1460 Length:1460
## 1st Qu.:1.000 1st Qu.: 334.5 Class :character Class :character
## Median :2.000 Median : 480.0 Mode :character Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Length:1460 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Median : 0.00 Median : 25.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:1460
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Class :character
## Median : 0.00 Median : 0.00 Median : 0.000 Mode :character
## Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## Length:1460 Length:1460 Min. : 0.00 Min. : 1.000
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 5.000
## Mode :character Mode :character Median : 0.00 Median : 6.000
## Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 Length:1460 Length:1460 Min. : 34900
## 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
## Median :2008 Mode :character Mode :character Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
# Count the number of categorical and numeric variables
categorical_variable <- sapply(housing, is.character)
numerical_variables <- sapply(housing, is.numeric)
count_categorical <- sum(categorical_variable)
count_numerical <- sum(numerical_variables)
cat("categorical variables: ",count_categorical,"\n")## categorical variables: 43
cat("numerical variables: ", count_numerical)## numerical variables: 37
2. (1pt) Which columns have missing values and what percentage of those columns have NAs?
as.data.frame(colSums(is.na(housing))) %>% rename_at('colSums(is.na(housing))', ~'Missing_Values')## Missing_Values
## MSSubClass 0
## MSZoning 0
## LotFrontage 259
## LotArea 0
## Street 0
## Alley 1369
## LotShape 0
## LandContour 0
## Utilities 0
## LotConfig 0
## LandSlope 0
## Neighborhood 0
## Condition1 0
## Condition2 0
## BldgType 0
## HouseStyle 0
## OverallQual 0
## OverallCond 0
## YearBuilt 0
## YearRemodAdd 0
## RoofStyle 0
## RoofMatl 0
## Exterior1st 0
## Exterior2nd 0
## MasVnrType 8
## MasVnrArea 8
## ExterQual 0
## ExterCond 0
## Foundation 0
## BsmtQual 37
## BsmtCond 37
## BsmtExposure 38
## BsmtFinType1 37
## BsmtFinSF1 0
## BsmtFinType2 38
## BsmtFinSF2 0
## BsmtUnfSF 0
## TotalBsmtSF 0
## Heating 0
## HeatingQC 0
## CentralAir 0
## Electrical 1
## X1stFlrSF 0
## X2ndFlrSF 0
## LowQualFinSF 0
## GrLivArea 0
## BsmtFullBath 0
## BsmtHalfBath 0
## FullBath 0
## HalfBath 0
## BedroomAbvGr 0
## KitchenAbvGr 0
## KitchenQual 0
## TotRmsAbvGrd 0
## Functional 0
## Fireplaces 0
## FireplaceQu 690
## GarageType 81
## GarageYrBlt 81
## GarageFinish 81
## GarageCars 0
## GarageArea 0
## GarageQual 81
## GarageCond 81
## PavedDrive 0
## WoodDeckSF 0
## OpenPorchSF 0
## EnclosedPorch 0
## X3SsnPorch 0
## ScreenPorch 0
## PoolArea 0
## PoolQC 1453
## Fence 1179
## MiscFeature 1406
## MiscVal 0
## MoSold 0
## YrSold 0
## SaleType 0
## SaleCondition 0
## SalePrice 0
as.data.frame(sapply(housing, function(y) round((sum(length(which(is.na(y))))/nrow(housing))*100.0,1))) %>%
rename_at('sapply(housing, function(y) round((sum(length(which(is.na(y))))/nrow(housing)) * 100, 1))', ~'Missing_Values(%)')## Missing_Values(%)
## MSSubClass 0.0
## MSZoning 0.0
## LotFrontage 17.7
## LotArea 0.0
## Street 0.0
## Alley 93.8
## LotShape 0.0
## LandContour 0.0
## Utilities 0.0
## LotConfig 0.0
## LandSlope 0.0
## Neighborhood 0.0
## Condition1 0.0
## Condition2 0.0
## BldgType 0.0
## HouseStyle 0.0
## OverallQual 0.0
## OverallCond 0.0
## YearBuilt 0.0
## YearRemodAdd 0.0
## RoofStyle 0.0
## RoofMatl 0.0
## Exterior1st 0.0
## Exterior2nd 0.0
## MasVnrType 0.5
## MasVnrArea 0.5
## ExterQual 0.0
## ExterCond 0.0
## Foundation 0.0
## BsmtQual 2.5
## BsmtCond 2.5
## BsmtExposure 2.6
## BsmtFinType1 2.5
## BsmtFinSF1 0.0
## BsmtFinType2 2.6
## BsmtFinSF2 0.0
## BsmtUnfSF 0.0
## TotalBsmtSF 0.0
## Heating 0.0
## HeatingQC 0.0
## CentralAir 0.0
## Electrical 0.1
## X1stFlrSF 0.0
## X2ndFlrSF 0.0
## LowQualFinSF 0.0
## GrLivArea 0.0
## BsmtFullBath 0.0
## BsmtHalfBath 0.0
## FullBath 0.0
## HalfBath 0.0
## BedroomAbvGr 0.0
## KitchenAbvGr 0.0
## KitchenQual 0.0
## TotRmsAbvGrd 0.0
## Functional 0.0
## Fireplaces 0.0
## FireplaceQu 47.3
## GarageType 5.5
## GarageYrBlt 5.5
## GarageFinish 5.5
## GarageCars 0.0
## GarageArea 0.0
## GarageQual 5.5
## GarageCond 5.5
## PavedDrive 0.0
## WoodDeckSF 0.0
## OpenPorchSF 0.0
## EnclosedPorch 0.0
## X3SsnPorch 0.0
## ScreenPorch 0.0
## PoolArea 0.0
## PoolQC 99.5
## Fence 80.8
## MiscFeature 96.3
## MiscVal 0.0
## MoSold 0.0
## YrSold 0.0
## SaleType 0.0
## SaleCondition 0.0
## SalePrice 0.0
3. (1pt)Read the data description carefully. For some of the variables, such as PoolQC, FirePlaceQU, Fence, etc. NA means not applicable rather than missing at random. For instance, a house that does not have a pool gets NA for PoolQC. For those variables for which NA means not applicable, you can replace NA with zero ( if that variable is numeric) or replace it with a new category/level, for instance, “notApplicable” or “None” if that variable is categorical.
housing$PoolQC[is.na(housing$PoolQC)] <- "notApplicable"
housing$FireplaceQu[is.na(housing$FireplaceQu)] <- "notApplicable"
housing$Fence[is.na(housing$Fence)] <- "notApplicable"
df_hou <- housing
#df_hou <- df_hou %>% mutate_if(is.integer, ~replace(., is.na(.), 0))
df_hou <- df_hou %>% mutate_if(is.character, ~replace(., is.na(.), "notApplicable"))4. (1pt) After replacing not applicable NAs with appropriate values, find out which columns (if any) still have NAs and what percentage of each column is missing.
as.data.frame(colSums(is.na(df_hou))) %>% rename_at('colSums(is.na(df_hou))', ~'Missing_Values')## Missing_Values
## MSSubClass 0
## MSZoning 0
## LotFrontage 259
## LotArea 0
## Street 0
## Alley 0
## LotShape 0
## LandContour 0
## Utilities 0
## LotConfig 0
## LandSlope 0
## Neighborhood 0
## Condition1 0
## Condition2 0
## BldgType 0
## HouseStyle 0
## OverallQual 0
## OverallCond 0
## YearBuilt 0
## YearRemodAdd 0
## RoofStyle 0
## RoofMatl 0
## Exterior1st 0
## Exterior2nd 0
## MasVnrType 0
## MasVnrArea 8
## ExterQual 0
## ExterCond 0
## Foundation 0
## BsmtQual 0
## BsmtCond 0
## BsmtExposure 0
## BsmtFinType1 0
## BsmtFinSF1 0
## BsmtFinType2 0
## BsmtFinSF2 0
## BsmtUnfSF 0
## TotalBsmtSF 0
## Heating 0
## HeatingQC 0
## CentralAir 0
## Electrical 0
## X1stFlrSF 0
## X2ndFlrSF 0
## LowQualFinSF 0
## GrLivArea 0
## BsmtFullBath 0
## BsmtHalfBath 0
## FullBath 0
## HalfBath 0
## BedroomAbvGr 0
## KitchenAbvGr 0
## KitchenQual 0
## TotRmsAbvGrd 0
## Functional 0
## Fireplaces 0
## FireplaceQu 0
## GarageType 0
## GarageYrBlt 81
## GarageFinish 0
## GarageCars 0
## GarageArea 0
## GarageQual 0
## GarageCond 0
## PavedDrive 0
## WoodDeckSF 0
## OpenPorchSF 0
## EnclosedPorch 0
## X3SsnPorch 0
## ScreenPorch 0
## PoolArea 0
## PoolQC 0
## Fence 0
## MiscFeature 0
## MiscVal 0
## MoSold 0
## YrSold 0
## SaleType 0
## SaleCondition 0
## SalePrice 0
as.data.frame(sapply(df_hou, function(y) round((sum(length(which(is.na(y))))/nrow(df_hou))*100.0,1))) %>%
rename_at('sapply(df_hou, function(y) round((sum(length(which(is.na(y))))/nrow(df_hou)) * 100, 1))', ~'Missing_Values(%)')## Missing_Values(%)
## MSSubClass 0.0
## MSZoning 0.0
## LotFrontage 17.7
## LotArea 0.0
## Street 0.0
## Alley 0.0
## LotShape 0.0
## LandContour 0.0
## Utilities 0.0
## LotConfig 0.0
## LandSlope 0.0
## Neighborhood 0.0
## Condition1 0.0
## Condition2 0.0
## BldgType 0.0
## HouseStyle 0.0
## OverallQual 0.0
## OverallCond 0.0
## YearBuilt 0.0
## YearRemodAdd 0.0
## RoofStyle 0.0
## RoofMatl 0.0
## Exterior1st 0.0
## Exterior2nd 0.0
## MasVnrType 0.0
## MasVnrArea 0.5
## ExterQual 0.0
## ExterCond 0.0
## Foundation 0.0
## BsmtQual 0.0
## BsmtCond 0.0
## BsmtExposure 0.0
## BsmtFinType1 0.0
## BsmtFinSF1 0.0
## BsmtFinType2 0.0
## BsmtFinSF2 0.0
## BsmtUnfSF 0.0
## TotalBsmtSF 0.0
## Heating 0.0
## HeatingQC 0.0
## CentralAir 0.0
## Electrical 0.0
## X1stFlrSF 0.0
## X2ndFlrSF 0.0
## LowQualFinSF 0.0
## GrLivArea 0.0
## BsmtFullBath 0.0
## BsmtHalfBath 0.0
## FullBath 0.0
## HalfBath 0.0
## BedroomAbvGr 0.0
## KitchenAbvGr 0.0
## KitchenQual 0.0
## TotRmsAbvGrd 0.0
## Functional 0.0
## Fireplaces 0.0
## FireplaceQu 0.0
## GarageType 0.0
## GarageYrBlt 5.5
## GarageFinish 0.0
## GarageCars 0.0
## GarageArea 0.0
## GarageQual 0.0
## GarageCond 0.0
## PavedDrive 0.0
## WoodDeckSF 0.0
## OpenPorchSF 0.0
## EnclosedPorch 0.0
## X3SsnPorch 0.0
## ScreenPorch 0.0
## PoolArea 0.0
## PoolQC 0.0
## Fence 0.0
## MiscFeature 0.0
## MiscVal 0.0
## MoSold 0.0
## YrSold 0.0
## SaleType 0.0
## SaleCondition 0.0
## SalePrice 0.0
5. (1pt) what percentage of rows in the dataset have one or more missing values?
# Calculate the percentage of rows with missing values
missing_rows <- sum(apply(df_hou, 1, function(x) any(is.na(x))))
percent_missing_rows <- round(missing_rows / nrow(df_hou) * 100, 2)
# Print the result
cat("Percentage of rows with missing values:", percent_missing_rows, "%")## Percentage of rows with missing values: 23.22 %
6. (1pt) plot the histogram of SalePrice. Interpret the histogram. Is SalePrice variable skewed?
# Plot the histogram of SalePrice
hist(df_hou$SalePrice, main = "Histogram of SalePrice", xlab = "Sale Price", col = "steelblue")The histogram of SalePrice shows that the data is not evenly distributed and it is skewed to the right. A right-skewed histogram means that the tail of the histogram extends towards the higher values of the variable, indicating that there are a few houses that have much higher SalePrice than the rest of the houses in the dataset. This indicates that the SalePrice variable may have outliers.
7. (3 pt) Use appropriate plots and test statistics to find out which variables are associated with SalePrice. Remove variables that show no association or very weak association with salesprice. Note: the t-test or one.way test for some variables may throw an error if there are not enough observations in a group. You can ignore this error at this point. Later we will find and eliminate groups with little variance.
Co-relation table
numeric_values <- df_hou %>%
dplyr::select_if(is.numeric)
# correlation for all variables
as.data.frame(round(cor(numeric_values),digits = 2 ))## MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## MSSubClass 1.00 NA -0.14 0.03 -0.06 0.03
## LotFrontage NA 1 NA NA NA NA
## LotArea -0.14 NA 1.00 0.11 -0.01 0.01
## OverallQual 0.03 NA 0.11 1.00 -0.09 0.57
## OverallCond -0.06 NA -0.01 -0.09 1.00 -0.38
## YearBuilt 0.03 NA 0.01 0.57 -0.38 1.00
## YearRemodAdd 0.04 NA 0.01 0.55 0.07 0.59
## MasVnrArea NA NA NA NA NA NA
## BsmtFinSF1 -0.07 NA 0.21 0.24 -0.05 0.25
## BsmtFinSF2 -0.07 NA 0.11 -0.06 0.04 -0.05
## BsmtUnfSF -0.14 NA 0.00 0.31 -0.14 0.15
## TotalBsmtSF -0.24 NA 0.26 0.54 -0.17 0.39
## X1stFlrSF -0.25 NA 0.30 0.48 -0.14 0.28
## X2ndFlrSF 0.31 NA 0.05 0.30 0.03 0.01
## LowQualFinSF 0.05 NA 0.00 -0.03 0.03 -0.18
## GrLivArea 0.07 NA 0.26 0.59 -0.08 0.20
## BsmtFullBath 0.00 NA 0.16 0.11 -0.05 0.19
## BsmtHalfBath 0.00 NA 0.05 -0.04 0.12 -0.04
## FullBath 0.13 NA 0.13 0.55 -0.19 0.47
## HalfBath 0.18 NA 0.01 0.27 -0.06 0.24
## BedroomAbvGr -0.02 NA 0.12 0.10 0.01 -0.07
## KitchenAbvGr 0.28 NA -0.02 -0.18 -0.09 -0.17
## TotRmsAbvGrd 0.04 NA 0.19 0.43 -0.06 0.10
## Fireplaces -0.05 NA 0.27 0.40 -0.02 0.15
## GarageYrBlt NA NA NA NA NA NA
## GarageCars -0.04 NA 0.15 0.60 -0.19 0.54
## GarageArea -0.10 NA 0.18 0.56 -0.15 0.48
## WoodDeckSF -0.01 NA 0.17 0.24 0.00 0.22
## OpenPorchSF -0.01 NA 0.08 0.31 -0.03 0.19
## EnclosedPorch -0.01 NA -0.02 -0.11 0.07 -0.39
## X3SsnPorch -0.04 NA 0.02 0.03 0.03 0.03
## ScreenPorch -0.03 NA 0.04 0.06 0.05 -0.05
## PoolArea 0.01 NA 0.08 0.07 0.00 0.00
## MiscVal -0.01 NA 0.04 -0.03 0.07 -0.03
## MoSold -0.01 NA 0.00 0.07 0.00 0.01
## YrSold -0.02 NA -0.01 -0.03 0.04 -0.01
## SalePrice -0.08 NA 0.26 0.79 -0.08 0.52
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## MSSubClass 0.04 NA -0.07 -0.07 -0.14
## LotFrontage NA NA NA NA NA
## LotArea 0.01 NA 0.21 0.11 0.00
## OverallQual 0.55 NA 0.24 -0.06 0.31
## OverallCond 0.07 NA -0.05 0.04 -0.14
## YearBuilt 0.59 NA 0.25 -0.05 0.15
## YearRemodAdd 1.00 NA 0.13 -0.07 0.18
## MasVnrArea NA 1 NA NA NA
## BsmtFinSF1 0.13 NA 1.00 -0.05 -0.50
## BsmtFinSF2 -0.07 NA -0.05 1.00 -0.21
## BsmtUnfSF 0.18 NA -0.50 -0.21 1.00
## TotalBsmtSF 0.29 NA 0.52 0.10 0.42
## X1stFlrSF 0.24 NA 0.45 0.10 0.32
## X2ndFlrSF 0.14 NA -0.14 -0.10 0.00
## LowQualFinSF -0.06 NA -0.06 0.01 0.03
## GrLivArea 0.29 NA 0.21 -0.01 0.24
## BsmtFullBath 0.12 NA 0.65 0.16 -0.42
## BsmtHalfBath -0.01 NA 0.07 0.07 -0.10
## FullBath 0.44 NA 0.06 -0.08 0.29
## HalfBath 0.18 NA 0.00 -0.03 -0.04
## BedroomAbvGr -0.04 NA -0.11 -0.02 0.17
## KitchenAbvGr -0.15 NA -0.08 -0.04 0.03
## TotRmsAbvGrd 0.19 NA 0.04 -0.04 0.25
## Fireplaces 0.11 NA 0.26 0.05 0.05
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.42 NA 0.22 -0.04 0.21
## GarageArea 0.37 NA 0.30 -0.02 0.18
## WoodDeckSF 0.21 NA 0.20 0.07 -0.01
## OpenPorchSF 0.23 NA 0.11 0.00 0.13
## EnclosedPorch -0.19 NA -0.10 0.04 0.00
## X3SsnPorch 0.05 NA 0.03 -0.03 0.02
## ScreenPorch -0.04 NA 0.06 0.09 -0.01
## PoolArea 0.01 NA 0.14 0.04 -0.04
## MiscVal -0.01 NA 0.00 0.00 -0.02
## MoSold 0.02 NA -0.02 -0.02 0.03
## YrSold 0.04 NA 0.01 0.03 -0.04
## SalePrice 0.51 NA 0.39 -0.01 0.21
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## MSSubClass -0.24 -0.25 0.31 0.05 0.07
## LotFrontage NA NA NA NA NA
## LotArea 0.26 0.30 0.05 0.00 0.26
## OverallQual 0.54 0.48 0.30 -0.03 0.59
## OverallCond -0.17 -0.14 0.03 0.03 -0.08
## YearBuilt 0.39 0.28 0.01 -0.18 0.20
## YearRemodAdd 0.29 0.24 0.14 -0.06 0.29
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.52 0.45 -0.14 -0.06 0.21
## BsmtFinSF2 0.10 0.10 -0.10 0.01 -0.01
## BsmtUnfSF 0.42 0.32 0.00 0.03 0.24
## TotalBsmtSF 1.00 0.82 -0.17 -0.03 0.45
## X1stFlrSF 0.82 1.00 -0.20 -0.01 0.57
## X2ndFlrSF -0.17 -0.20 1.00 0.06 0.69
## LowQualFinSF -0.03 -0.01 0.06 1.00 0.13
## GrLivArea 0.45 0.57 0.69 0.13 1.00
## BsmtFullBath 0.31 0.24 -0.17 -0.05 0.03
## BsmtHalfBath 0.00 0.00 -0.02 -0.01 -0.02
## FullBath 0.32 0.38 0.42 0.00 0.63
## HalfBath -0.05 -0.12 0.61 -0.03 0.42
## BedroomAbvGr 0.05 0.13 0.50 0.11 0.52
## KitchenAbvGr -0.07 0.07 0.06 0.01 0.10
## TotRmsAbvGrd 0.29 0.41 0.62 0.13 0.83
## Fireplaces 0.34 0.41 0.19 -0.02 0.46
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.43 0.44 0.18 -0.09 0.47
## GarageArea 0.49 0.49 0.14 -0.07 0.47
## WoodDeckSF 0.23 0.24 0.09 -0.03 0.25
## OpenPorchSF 0.25 0.21 0.21 0.02 0.33
## EnclosedPorch -0.10 -0.07 0.06 0.06 0.01
## X3SsnPorch 0.04 0.06 -0.02 0.00 0.02
## ScreenPorch 0.08 0.09 0.04 0.03 0.10
## PoolArea 0.13 0.13 0.08 0.06 0.17
## MiscVal -0.02 -0.02 0.02 0.00 0.00
## MoSold 0.01 0.03 0.04 -0.02 0.05
## YrSold -0.01 -0.01 -0.03 -0.03 -0.04
## SalePrice 0.61 0.61 0.32 -0.03 0.71
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## MSSubClass 0.00 0.00 0.13 0.18 -0.02
## LotFrontage NA NA NA NA NA
## LotArea 0.16 0.05 0.13 0.01 0.12
## OverallQual 0.11 -0.04 0.55 0.27 0.10
## OverallCond -0.05 0.12 -0.19 -0.06 0.01
## YearBuilt 0.19 -0.04 0.47 0.24 -0.07
## YearRemodAdd 0.12 -0.01 0.44 0.18 -0.04
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.65 0.07 0.06 0.00 -0.11
## BsmtFinSF2 0.16 0.07 -0.08 -0.03 -0.02
## BsmtUnfSF -0.42 -0.10 0.29 -0.04 0.17
## TotalBsmtSF 0.31 0.00 0.32 -0.05 0.05
## X1stFlrSF 0.24 0.00 0.38 -0.12 0.13
## X2ndFlrSF -0.17 -0.02 0.42 0.61 0.50
## LowQualFinSF -0.05 -0.01 0.00 -0.03 0.11
## GrLivArea 0.03 -0.02 0.63 0.42 0.52
## BsmtFullBath 1.00 -0.15 -0.06 -0.03 -0.15
## BsmtHalfBath -0.15 1.00 -0.05 -0.01 0.05
## FullBath -0.06 -0.05 1.00 0.14 0.36
## HalfBath -0.03 -0.01 0.14 1.00 0.23
## BedroomAbvGr -0.15 0.05 0.36 0.23 1.00
## KitchenAbvGr -0.04 -0.04 0.13 -0.07 0.20
## TotRmsAbvGrd -0.05 -0.02 0.55 0.34 0.68
## Fireplaces 0.14 0.03 0.24 0.20 0.11
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.13 -0.02 0.47 0.22 0.09
## GarageArea 0.18 -0.02 0.41 0.16 0.07
## WoodDeckSF 0.18 0.04 0.19 0.11 0.05
## OpenPorchSF 0.07 -0.03 0.26 0.20 0.09
## EnclosedPorch -0.05 -0.01 -0.12 -0.10 0.04
## X3SsnPorch 0.00 0.04 0.04 0.00 -0.02
## ScreenPorch 0.02 0.03 -0.01 0.07 0.04
## PoolArea 0.07 0.02 0.05 0.02 0.07
## MiscVal -0.02 -0.01 -0.01 0.00 0.01
## MoSold -0.03 0.03 0.06 -0.01 0.05
## YrSold 0.07 -0.05 -0.02 -0.01 -0.04
## SalePrice 0.23 -0.02 0.56 0.28 0.17
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## MSSubClass 0.28 0.04 -0.05 NA -0.04
## LotFrontage NA NA NA NA NA
## LotArea -0.02 0.19 0.27 NA 0.15
## OverallQual -0.18 0.43 0.40 NA 0.60
## OverallCond -0.09 -0.06 -0.02 NA -0.19
## YearBuilt -0.17 0.10 0.15 NA 0.54
## YearRemodAdd -0.15 0.19 0.11 NA 0.42
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 -0.08 0.04 0.26 NA 0.22
## BsmtFinSF2 -0.04 -0.04 0.05 NA -0.04
## BsmtUnfSF 0.03 0.25 0.05 NA 0.21
## TotalBsmtSF -0.07 0.29 0.34 NA 0.43
## X1stFlrSF 0.07 0.41 0.41 NA 0.44
## X2ndFlrSF 0.06 0.62 0.19 NA 0.18
## LowQualFinSF 0.01 0.13 -0.02 NA -0.09
## GrLivArea 0.10 0.83 0.46 NA 0.47
## BsmtFullBath -0.04 -0.05 0.14 NA 0.13
## BsmtHalfBath -0.04 -0.02 0.03 NA -0.02
## FullBath 0.13 0.55 0.24 NA 0.47
## HalfBath -0.07 0.34 0.20 NA 0.22
## BedroomAbvGr 0.20 0.68 0.11 NA 0.09
## KitchenAbvGr 1.00 0.26 -0.12 NA -0.05
## TotRmsAbvGrd 0.26 1.00 0.33 NA 0.36
## Fireplaces -0.12 0.33 1.00 NA 0.30
## GarageYrBlt NA NA NA 1 NA
## GarageCars -0.05 0.36 0.30 NA 1.00
## GarageArea -0.06 0.34 0.27 NA 0.88
## WoodDeckSF -0.09 0.17 0.20 NA 0.23
## OpenPorchSF -0.07 0.23 0.17 NA 0.21
## EnclosedPorch 0.04 0.00 -0.02 NA -0.15
## X3SsnPorch -0.02 -0.01 0.01 NA 0.04
## ScreenPorch -0.05 0.06 0.18 NA 0.05
## PoolArea -0.01 0.08 0.10 NA 0.02
## MiscVal 0.06 0.02 0.00 NA -0.04
## MoSold 0.03 0.04 0.05 NA 0.04
## YrSold 0.03 -0.03 -0.02 NA -0.04
## SalePrice -0.14 0.53 0.47 NA 0.64
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## MSSubClass -0.10 -0.01 -0.01 -0.01 -0.04
## LotFrontage NA NA NA NA NA
## LotArea 0.18 0.17 0.08 -0.02 0.02
## OverallQual 0.56 0.24 0.31 -0.11 0.03
## OverallCond -0.15 0.00 -0.03 0.07 0.03
## YearBuilt 0.48 0.22 0.19 -0.39 0.03
## YearRemodAdd 0.37 0.21 0.23 -0.19 0.05
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.30 0.20 0.11 -0.10 0.03
## BsmtFinSF2 -0.02 0.07 0.00 0.04 -0.03
## BsmtUnfSF 0.18 -0.01 0.13 0.00 0.02
## TotalBsmtSF 0.49 0.23 0.25 -0.10 0.04
## X1stFlrSF 0.49 0.24 0.21 -0.07 0.06
## X2ndFlrSF 0.14 0.09 0.21 0.06 -0.02
## LowQualFinSF -0.07 -0.03 0.02 0.06 0.00
## GrLivArea 0.47 0.25 0.33 0.01 0.02
## BsmtFullBath 0.18 0.18 0.07 -0.05 0.00
## BsmtHalfBath -0.02 0.04 -0.03 -0.01 0.04
## FullBath 0.41 0.19 0.26 -0.12 0.04
## HalfBath 0.16 0.11 0.20 -0.10 0.00
## BedroomAbvGr 0.07 0.05 0.09 0.04 -0.02
## KitchenAbvGr -0.06 -0.09 -0.07 0.04 -0.02
## TotRmsAbvGrd 0.34 0.17 0.23 0.00 -0.01
## Fireplaces 0.27 0.20 0.17 -0.02 0.01
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.88 0.23 0.21 -0.15 0.04
## GarageArea 1.00 0.22 0.24 -0.12 0.04
## WoodDeckSF 0.22 1.00 0.06 -0.13 -0.03
## OpenPorchSF 0.24 0.06 1.00 -0.09 -0.01
## EnclosedPorch -0.12 -0.13 -0.09 1.00 -0.04
## X3SsnPorch 0.04 -0.03 -0.01 -0.04 1.00
## ScreenPorch 0.05 -0.07 0.07 -0.08 -0.03
## PoolArea 0.06 0.07 0.06 0.05 -0.01
## MiscVal -0.03 -0.01 -0.02 0.02 0.00
## MoSold 0.03 0.02 0.07 -0.03 0.03
## YrSold -0.03 0.02 -0.06 -0.01 0.02
## SalePrice 0.62 0.32 0.32 -0.13 0.04
## ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
## MSSubClass -0.03 0.01 -0.01 -0.01 -0.02 -0.08
## LotFrontage NA NA NA NA NA NA
## LotArea 0.04 0.08 0.04 0.00 -0.01 0.26
## OverallQual 0.06 0.07 -0.03 0.07 -0.03 0.79
## OverallCond 0.05 0.00 0.07 0.00 0.04 -0.08
## YearBuilt -0.05 0.00 -0.03 0.01 -0.01 0.52
## YearRemodAdd -0.04 0.01 -0.01 0.02 0.04 0.51
## MasVnrArea NA NA NA NA NA NA
## BsmtFinSF1 0.06 0.14 0.00 -0.02 0.01 0.39
## BsmtFinSF2 0.09 0.04 0.00 -0.02 0.03 -0.01
## BsmtUnfSF -0.01 -0.04 -0.02 0.03 -0.04 0.21
## TotalBsmtSF 0.08 0.13 -0.02 0.01 -0.01 0.61
## X1stFlrSF 0.09 0.13 -0.02 0.03 -0.01 0.61
## X2ndFlrSF 0.04 0.08 0.02 0.04 -0.03 0.32
## LowQualFinSF 0.03 0.06 0.00 -0.02 -0.03 -0.03
## GrLivArea 0.10 0.17 0.00 0.05 -0.04 0.71
## BsmtFullBath 0.02 0.07 -0.02 -0.03 0.07 0.23
## BsmtHalfBath 0.03 0.02 -0.01 0.03 -0.05 -0.02
## FullBath -0.01 0.05 -0.01 0.06 -0.02 0.56
## HalfBath 0.07 0.02 0.00 -0.01 -0.01 0.28
## BedroomAbvGr 0.04 0.07 0.01 0.05 -0.04 0.17
## KitchenAbvGr -0.05 -0.01 0.06 0.03 0.03 -0.14
## TotRmsAbvGrd 0.06 0.08 0.02 0.04 -0.03 0.53
## Fireplaces 0.18 0.10 0.00 0.05 -0.02 0.47
## GarageYrBlt NA NA NA NA NA NA
## GarageCars 0.05 0.02 -0.04 0.04 -0.04 0.64
## GarageArea 0.05 0.06 -0.03 0.03 -0.03 0.62
## WoodDeckSF -0.07 0.07 -0.01 0.02 0.02 0.32
## OpenPorchSF 0.07 0.06 -0.02 0.07 -0.06 0.32
## EnclosedPorch -0.08 0.05 0.02 -0.03 -0.01 -0.13
## X3SsnPorch -0.03 -0.01 0.00 0.03 0.02 0.04
## ScreenPorch 1.00 0.05 0.03 0.02 0.01 0.11
## PoolArea 0.05 1.00 0.03 -0.03 -0.06 0.09
## MiscVal 0.03 0.03 1.00 -0.01 0.00 -0.02
## MoSold 0.02 -0.03 -0.01 1.00 -0.15 0.05
## YrSold 0.01 -0.06 0.00 -0.15 1.00 -0.03
## SalePrice 0.11 0.09 -0.02 0.05 -0.03 1.00
# the correlation between SalePrice and the other variables
correlations <- cor(numeric_values)
correlations_with_saleprice <- correlations["SalePrice", ]
sorted_correlations_with_saleprice <- sort(correlations_with_saleprice, decreasing = TRUE)
as.data.frame(sorted_correlations_with_saleprice)## sorted_correlations_with_saleprice
## SalePrice 1.00000000
## OverallQual 0.79098160
## GrLivArea 0.70862448
## GarageCars 0.64040920
## GarageArea 0.62343144
## TotalBsmtSF 0.61358055
## X1stFlrSF 0.60585218
## FullBath 0.56066376
## TotRmsAbvGrd 0.53372316
## YearBuilt 0.52289733
## YearRemodAdd 0.50710097
## Fireplaces 0.46692884
## BsmtFinSF1 0.38641981
## WoodDeckSF 0.32441344
## X2ndFlrSF 0.31933380
## OpenPorchSF 0.31585623
## HalfBath 0.28410768
## LotArea 0.26384335
## BsmtFullBath 0.22712223
## BsmtUnfSF 0.21447911
## BedroomAbvGr 0.16821315
## ScreenPorch 0.11144657
## PoolArea 0.09240355
## MoSold 0.04643225
## X3SsnPorch 0.04458367
## BsmtFinSF2 -0.01137812
## BsmtHalfBath -0.01684415
## MiscVal -0.02118958
## LowQualFinSF -0.02560613
## YrSold -0.02892259
## OverallCond -0.07785589
## MSSubClass -0.08428414
## EnclosedPorch -0.12857796
## KitchenAbvGr -0.13590737
Box Plot
# Find the categorical variables
#categorical_vars <- sapply(df_hou, function(x) is.character(x))
#cat("Categorical variables:\n")
#names(df_hou[categorical_vars])
predictors <- c("MSZoning", "Street","Alley", "LotShape", "LandContour", "Utilities",
"LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType",
"HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType",
"ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure",
"BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical",
"KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageFinish","GarageQual",
"GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType",
"SaleCondition"
)
data <- df_hou[c(predictors, "SalePrice")]
# Create box plots for each predictor
for (i in 1:43) {
boxplot(SalePrice ~ eval(parse(text = predictors[i])), data = data,
main = predictors[i], xlab = predictors[i], ylab = "SalePrice",
col = "steelblue"
)
}t-test
continuous_variables <- names(select_if(df_hou, is.numeric))
# run a for loop through continuous variables and perform t-tests
for (var in continuous_variables) {
print(paste0("T-test for association between SalePrice and ", var, ":"))
print(t.test(df_hou[[var]], df_hou$SalePrice))
print("_____________________________________________________________________")
}## [1] "T-test for association between SalePrice and MSSubClass:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.991, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184942.7 -176785.9
## sample estimates:
## mean of x mean of y
## 56.89726 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LotFrontage:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.985, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184929.5 -176772.8
## sample estimates:
## mean of x mean of y
## 70.04996 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LotArea:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -174514.7 -166294.1
## sample estimates:
## mean of x mean of y
## 10516.83 180921.20
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OverallQual:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184993.5 -176836.7
## sample estimates:
## mean of x mean of y
## 6.099315e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OverallCond:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184994.0 -176837.3
## sample estimates:
## mean of x mean of y
## 5.575342e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YearBuilt:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183028.3 -174871.6
## sample estimates:
## mean of x mean of y
## 1971.268 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YearRemodAdd:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.064, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183014.7 -174858.0
## sample estimates:
## mean of x mean of y
## 1984.866 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MasVnrArea:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.969, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184895.9 -176739.1
## sample estimates:
## mean of x mean of y
## 103.6853 180921.1959
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFinSF1:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.804, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184556.0 -176399.1
## sample estimates:
## mean of x mean of y
## 443.6397 180921.1959
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFinSF2:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184953.0 -176796.3
## sample estimates:
## mean of x mean of y
## 46.54932 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtUnfSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.745, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184432.4 -176275.5
## sample estimates:
## mean of x mean of y
## 567.2404 180921.1959
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and TotalBsmtSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183942.2 -175785.3
## sample estimates:
## mean of x mean of y
## 1057.429 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X1stFlrSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.459, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183837.0 -175680.2
## sample estimates:
## mean of x mean of y
## 1162.627 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X2ndFlrSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.851, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184652.6 -176495.8
## sample estimates:
## mean of x mean of y
## 346.9925 180921.1959
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and LowQualFinSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184993.7 -176837.0
## sample estimates:
## mean of x mean of y
## 5.844521e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GrLivArea:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183484.2 -175327.3
## sample estimates:
## mean of x mean of y
## 1515.464 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtFullBath:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184999.1 -176842.4
## sample estimates:
## mean of x mean of y
## 4.253425e-01 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BsmtHalfBath:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184999.5 -176842.8
## sample estimates:
## mean of x mean of y
## 5.753425e-02 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and FullBath:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184998.0 -176841.3
## sample estimates:
## mean of x mean of y
## 1.565068e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and HalfBath:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.019, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184999.2 -176842.5
## sample estimates:
## mean of x mean of y
## 3.828767e-01 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and BedroomAbvGr:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184996.7 -176840.0
## sample estimates:
## mean of x mean of y
## 2.866438e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and KitchenAbvGr:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184998.5 -176841.8
## sample estimates:
## mean of x mean of y
## 1.046575e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and TotRmsAbvGrd:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184993.0 -176836.3
## sample estimates:
## mean of x mean of y
## 6.517808e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and Fireplaces:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184998.9 -176842.2
## sample estimates:
## mean of x mean of y
## 6.130137e-01 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageYrBlt:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.067, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183021.0 -174864.3
## sample estimates:
## mean of x mean of y
## 1978.506 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageCars:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184997.8 -176841.1
## sample estimates:
## mean of x mean of y
## 1.767123e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and GarageArea:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.791, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184526.6 -176369.8
## sample estimates:
## mean of x mean of y
## 472.9801 180921.1959
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and WoodDeckSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.973, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184905.3 -176748.6
## sample estimates:
## mean of x mean of y
## 94.24452 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and OpenPorchSF:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184952.9 -176796.2
## sample estimates:
## mean of x mean of y
## 46.66027 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and EnclosedPorch:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.008, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184977.6 -176820.9
## sample estimates:
## mean of x mean of y
## 21.95411 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and X3SsnPorch:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184996.1 -176839.4
## sample estimates:
## mean of x mean of y
## 3.409589e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and ScreenPorch:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.012, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184984.5 -176827.8
## sample estimates:
## mean of x mean of y
## 15.06096 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and PoolArea:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184996.8 -176840.1
## sample estimates:
## mean of x mean of y
## 2.758904e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MiscVal:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.996, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184956.1 -176799.3
## sample estimates:
## mean of x mean of y
## 43.48904 180921.19589
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and MoSold:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184993.2 -176836.5
## sample estimates:
## mean of x mean of y
## 6.321918e+00 1.809212e+05
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and YrSold:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = -86.053, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -182991.7 -174835.0
## sample estimates:
## mean of x mean of y
## 2007.816 180921.196
##
## [1] "_____________________________________________________________________"
## [1] "T-test for association between SalePrice and SalePrice:"
##
## Welch Two Sample t-test
##
## data: df_hou[[var]] and df_hou$SalePrice
## t = 0, df = 2918, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5765.271 5765.271
## sample estimates:
## mean of x mean of y
## 180921.2 180921.2
##
## [1] "_____________________________________________________________________"
Remove the column have more than 80% of NA those are : Alley, PoolQC, Fence, and MiscFeature.
df_hou <- df_hou %>%
select(-Alley, -PoolQC, -Fence, -MiscFeature)EnclosedPorch and KitchenAbvGr has negative co-relation with sales.
Based on the correlation coefficients, scatter plots, and hypothesis tests, we can identify variables that are not strongly associated with SalePrice. We can remove these variables from the dataset.
BedroomAbvGr 0.16821315
ScreenPorch 0.11144657
PoolArea 0.09240355
MoSold 0.04643225
X3SsnPorch 0.04458367
BsmtFinSF2 -0.01137812
BsmtHalfBath -0.01684415
MiscVal -0.02118958
LowQualFinSF -0.02560613
YrSold -0.02892259
OverallCond -0.07785589
MSSubClass -0.08428414
df_hou <- df_hou %>%
select(-BedroomAbvGr, -ScreenPorch, -PoolArea, -MoSold, -X3SsnPorch, -BsmtFinSF2, -BsmtHalfBath,
-MiscVal, -LowQualFinSF, -YrSold, -OverallCond, -MSSubClass)Removing variables which doesn’t have much difference in median they are:
df_hou <- df_hou %>%
select(-Street, -Utilities, -LandSlope, -BldgType)Dealing with Missing Values:
If a large percentage of rows still have missing values, omitting those rows will cause loss of information. We are going to try to impute the missing values. Several methods for missing data imputation exists. The most simple imputation is to replace the missing values with the mean( or mode) of their columns. Another method would be to use other columns to predict the column with missing values This is called multiple imputation and it assumes that a missing value in one attribute can be predicted from the data in other attributes. The simplest form of multiple imputation is to use knn imputation. In knn imputation, we find the k complete data points closest to the data point with missing value (incomplete row) and take their average to fill the missing value.
To avoid data leakage, the nearest neighbors must be computed based only on the training data instead of the entire dataset. In other words, to impute the missing values in training, validation, and test data we should only use the training data to find the nearest neighbors.
Data leakage is when information outside of the training data is used in creating a machine learning model. This causes the model to overfit and not generalize well to future data.
When doing cross validation, the imputation must be computed based only on the training data on each fold (excluding the validation data). Luckily, caret’s train method in R streamlines this process. you can use the option preProc=”knnImpute” to do knn imputation and the option na.action=na.pass to allow the NA values to be passed to the model to be imputed:
KnnImpute scales and centers the numeric features and uses Euclidian distance to compute the nearest neighbors. Since Euclidean distance is not meaningful for categorical features, it will ignore categorical features if present in the dataset and does not impute them.
8. (2 pt) Examine the columns with missing values to see if any of them are categorical.Use caret’s createDataPartition method to partition the dataset to 80% training and 20% testing. If a categorical column has missing values in train or test data, impute it with the mode of that column in the training data. It is important that the mode is computed based only on the training data only (instead of the entire dataset) to avoid data leakage.
set.seed(101)
# Create data partition
inTrain <- createDataPartition(df_hou$SalePrice, p = 0.8, list = FALSE)
# Create training and test sets
hou_train <- df_hou[inTrain, ]
hos_test <- df_hou[-inTrain, ]
dim(hou_train)## [1] 1169 60
dim(hos_test)## [1] 291 60
Imputation on caterorical variables
# Identify categorical columns
categorical_cols <- sapply(hou_train, is.character)
# Remove SalePrice from the list of categorical columns
categorical_cols["SalePrice"] <- FALSE
# Get the names of the categorical columns
categorical_col_names <- names(hou_train)[categorical_cols]# Loop through each categorical column
for (col in categorical_col_names)
{
# Get the mode of the column in the training data
mode_val <- hou_train %>%
select({{ col }}) %>%
summarise(mode = names(which.max(table(.))))
# Replace missing values in the training data with the mode
hou_train[[col]] <- ifelse(hou_train[[col]] == 'notApplicable', mode_val$mode, hou_train[[col]])
# Replace missing values in the test data with the mode
hos_test[[col]] <- ifelse(hos_test[[col]] == 'notApplicable', mode_val$mode, hos_test[[col]])
}Imputation on numerical variables
Variables has NAs in the dataset are LotFrontage, MasVnrArea, GarageYrBlt
# # Identify numerical columns
# numerical_cols <- sapply(hou_train, is.numeric)
#
# # Remove SalePrice from the list of numerical columns
# numerical_cols["SalePrice"] <- FALSE
#
# # Get the names of the numerical columns
# numerical_col_names <- names(hou_train)[numerical_cols]Eliminating variables with little to no variance:
This dataset has several variables with a handful of unique values that occur with very low frequency. For instance, if you take a summary of the “Street” variable 99.5% of the samples have “Pave” street while only 0.04% of samples have “gravel”. The concern here that when we split the data for cross validation, the training samples will all have “paved” street but some of the validation samples might have “gravel” street which create a discrepancy between training and validation sets. In addition, a handful of unique values may have an undue influence on the model. Therefore, it is better to identify and eliminate these variables with near zero variance.
Luckily caret’s train method have a preprocessing option preProc=“nzv” that can be used during cross validation to remove variables with zero or little variance in the training set.
Use preProc=c(“knnImpute”, “nzv”) inside caret’s train method in the models you create in the next section to eliminate variables with near zero variance and to impute missing values in numeric variables using knnImpute.
After cleaning and exploring data, we are ready to build our machine learning models to predict the SalePrice of a house based on other variables. We are going to examine four categories of models: Regularized linear regression, Tree-based Ensemble models, SVM, and neural networks with drop out.
8. (2pt) Set.seed(1) and train a Lasso Linear Regression model using “glmnet” and “caret” as explained in the lectures to predict the SalePrice. Use 10 fold cross validation and Tune the lambda parameter) Note: You do not need to worry about scaling your test or train data, glmnet will automatically do it for you.
Use preProc=c(”knnImpute”,”nzv”) and na.action=na.pass options inside the train method to let caret impute the missing values using knn based on the training data during cross validation.
# Define the pre-processing steps to be applied to the training data during cross-validation
preProc <- c("knnImpute", "nzv")# For reproducibility
set.seed(1)
hos_train2 <- hou_train
hos_train2[ ,c('Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)
# Define the Lasso Linear Regression model using "glmnet"
lassoModel <- train(SalePrice ~ .,
data = hos_train2,
method = "glmnet",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 0.1, by = 0.001)),
preProcess = preProc,
na.action = na.pass)9. (1pt) Get the coefficients for the best tuned model. Did Lasso shrink some of the coefficients to zero? If so, what does this mean?
best_model <- lassoModel$finalModel
coef(best_model, s = best_model$lambdaOpt)## 83 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 181213.23213
## MSZoningRL 6037.80097
## MSZoningRM 447.49950
## LotFrontage -1228.45567
## LotArea 3574.54378
## LotShapeReg -1293.45747
## LandContourLvl 3487.77176
## LotConfigCulDSac 3411.20583
## LotConfigInside -131.57185
## NeighborhoodCollgCr -2484.45046
## NeighborhoodEdwards -5759.72742
## NeighborhoodGilbert -2377.47456
## NeighborhoodNAmes -4253.77835
## NeighborhoodNridgHt 4123.65924
## NeighborhoodNWAmes -1412.75848
## NeighborhoodOldTown -3380.05338
## NeighborhoodSawyer -2280.33770
## NeighborhoodSomerst 3472.04807
## Condition1Feedr -429.60977
## Condition1Norm 5241.78741
## HouseStyle1Story 6708.47009
## HouseStyle2Story -5071.92175
## OverallQual 17022.54966
## YearBuilt -33.18685
## YearRemodAdd 4371.81654
## RoofStyleGable -333.78041
## RoofStyleHip 1560.21368
## MasVnrTypeBrkFace 3677.72839
## MasVnrTypeNone 8972.71159
## MasVnrTypeStone 3152.23818
## MasVnrArea 5024.31085
## ExterQualGd -5148.22751
## ExterQualTA -8352.47975
## FoundationCBlock 2928.93992
## FoundationPConc 3977.75693
## BsmtQualGd -9762.94474
## BsmtQualTA -7774.04347
## BsmtCondTA 1702.48507
## BsmtExposureGd 6122.86643
## BsmtExposureMn -1099.44587
## BsmtExposureNo -3201.57767
## BsmtFinType1BLQ 65.34630
## BsmtFinType1GLQ 1604.16899
## BsmtFinType1LwQ -1686.59823
## BsmtFinType1Rec 291.21540
## BsmtFinType1Unf -5006.57768
## BsmtFinSF1 -1533.27179
## BsmtFinType2Unf 442.59767
## BsmtUnfSF -405.87746
## TotalBsmtSF 407.68124
## HeatingQCGd -892.65534
## HeatingQCTA -2819.04358
## CentralAirY 2661.95417
## ElectricalSBrkr -134.01028
## X1stFlrSF 19131.51275
## X2ndFlrSF 27545.87423
## GrLivArea .
## BsmtFullBath 2590.10393
## FullBath 1197.65582
## HalfBath 1702.62957
## KitchenQualGd -9876.88075
## KitchenQualTA -10097.00644
## TotRmsAbvGrd 1832.82578
## FunctionalTyp 4053.50463
## Fireplaces 3709.68463
## FireplaceQuGd -1178.40242
## FireplaceQunotApplicable 267.36778
## FireplaceQuTA -449.41343
## GarageTypeAttchd 2710.89911
## GarageTypeBuiltIn 1438.39661
## GarageTypeDetchd 2427.77691
## GarageYrBlt -3675.83613
## GarageFinishRFn -1780.38079
## GarageFinishUnf -1647.57985
## GarageCars 8169.33093
## GarageArea 1773.37407
## PavedDriveY 769.25695
## WoodDeckSF 2155.52155
## OpenPorchSF -129.93880
## SaleTypeNew 9425.67731
## SaleTypeWD 345.23932
## SaleConditionNormal 1053.83154
## SaleConditionPartial -4631.84451
Lasso regularization shrinks some of the coefficients towards zero, and in some cases, it can shrink coefficients all the way to zero. This means that the corresponding predictors are effectively removed from the model. Lasso regularization can be used for feature selection, as it can identify the most important predictors in the dataset by shrinking the coefficients of less important predictors towards zero.
If some coefficients are shrunk to zero, it means that the corresponding predictors are not contributing significantly to the model’s predictive power, and can be removed without significantly affecting the model’s performance. This can help to reduce overfitting and improve the interpretability of the model.
10. (1pt) Get the predictions on the test data using the “predict” function with option na.action=na.pass. This will allow the NA values in the test data to be passed to the model and imputed using knn imputation based on the training data. Go ahead and run install.packages(“RANN”) to install the RANN package if you get an error stating that you need this package.
Compute the RMSE on of the predictions.
The train and test data will have the same number of features in case some levels of a categorical variable does not occur in training data but occurs in test or validation data will drop.
# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
hos_test2 <- hos_test
hos_test2[ ,c('SalePrice', 'Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)
testPredictions_lasso <- predict(lassoModel, newdata = hos_test2, na.action = na.pass)# Evaluate the performance of the model on the test data using metrics such as MSE, RMSE, and R2
MSE_lasso <- mean((hos_test$SalePrice - testPredictions_lasso)^2)
RMSE_lasso <- sqrt(MSE_lasso)
RMSE_lasso## [1] 28972.69
11. (1 pt) set.seed(1) again and train a Ridge linear regression model using 10 fold cross validation and tune lambda as you did for lasso and compute the RMSE of this model on the test data.
Use knn imputation similar to what you did for lasso.
set.seed(1)
# Define the Ridge Linear Regression model using "glmnet"
ridgeModel <- train(SalePrice ~ .,
data = hos_train2,
method = "glmnet",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 0.1, by = 0.001)),
preProcess = preProc,
na.action = na.pass)
# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_ridge <- predict(ridgeModel, newdata = hos_test2, na.action = na.pass)
# Compute the RMSE of the Ridge Linear Regression model on the test data
RMSE_ridge <- sqrt(mean((hos_test$SalePrice - testPredictions_ridge)^2))
# View the RMSE
RMSE_ridge## [1] 28858.33
12. (1 pt) set.seed(1) again and train an Elastic net linear regression model using 10 fold cross validation and tune lambda as you did before and tune alpha to be a sequence of 10 values between 0 and 1, that is: 0,0.1,0.2,….1 . Compute the RMSE of the tuned model on the test data Use knn imputation similar to what you did for the two previous models.
set.seed(1)
# Define the Elastic Net Linear Regression model using "glmnet"
elasticNetModel <- train(SalePrice ~ .,
data = hos_train2,
method = "glmnet",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = seq(0, 1, by = 0.1), lambda = seq(0.001, 0.1, by = 0.001)),
preProcess = preProc,
na.action = na.pass)
# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_elasticNet <- predict(elasticNetModel, newdata = hos_test2, na.action = na.pass)
# Compute the RMSE of the Elastic Net Linear Regression model on the test data
RMSE_elasticNet <- sqrt(mean((hos_test$SalePrice - testPredictions_elasticNet)^2))
# View the RMSE
RMSE_elasticNet## [1] 28858.33
13. (2 pt) Set.seed(1) and Use Caret package with “rf” method to train a random forest model on the training data to predict the SalePrice. You can impute the missing values using knn similar to what you did for the previous models. Use 10-fold cross validation and let caret auto-tune the model. Use the model to predict the SalePrice for test data and compute RMSE. (Note: use importance=T in your train method so it computes the variable importance while building the model). Be patient. This model may take a long time to train.
set.seed(1)
# Train a Random Forest model using the "caret" package
rfModel <- train(SalePrice ~ .,
data = hos_train2,
method = "rf",
trControl = trainControl(method = "cv", number = 10),
preProcess = preProc,
na.action = na.pass,
importance = TRUE)# Get the predictions on the test data using the "predict" function with the "na.action=na.pass" option
testPredictions_rf <- predict(rfModel, newdata = hos_test2, na.action = na.pass)
# Compute the RMSE of the Random Forest model on the test data
RMSE_rf <- sqrt(mean((hos_test$SalePrice - testPredictions_rf)^2))
# View the RMSE
RMSE_rf## [1] 27543.66
(1 pt) User caret’s varImp function to get the variable importance for the random forest model. Which variables were most predictive in the random forest model?
varImp(rfModel)## rf variable importance
##
## only 20 most important variables shown (out of 82)
##
## Overall
## OverallQual 100.00
## GrLivArea 77.83
## TotalBsmtSF 32.58
## GarageArea 26.74
## YearBuilt 26.21
## X1stFlrSF 24.08
## GarageFinishUnf 23.44
## GarageCars 23.17
## GarageTypeDetchd 22.40
## MSZoningRM 21.77
## BsmtFinSF1 21.36
## BsmtQualGd 18.73
## YearRemodAdd 18.14
## LotArea 18.05
## X2ndFlrSF 18.01
## KitchenQualGd 17.36
## CentralAirY 16.67
## BsmtUnfSF 16.62
## GarageYrBlt 16.19
## ExterQualGd 15.63
Based on the variable importance scores, you can identify the most predictive variables in the Random Forest model. The specific variables that are most predictive may vary depending on the dataset and the modeling approach used, but some common predictors that tend to be important in predicting housing prices include OverallQual, GrLivArea, TotalBsmtSF, TotalBsmtSF and YearBuilt.
14. (1 pt) Set.seed(1) and Use Caret package with “gbm” method to train a Gradient Boosted Tree model on the training data. GBM needs minimum data preprocessing, you don’t need to scale numeric features or encode the categorical variables. In addition, it can be trained directly on data with missing values without having to do imputation.
Use 10 fold cross validation and let caret auto-tune the mode.
Use the model to predict the SalePrice for the test data and compute RMSE.
set.seed(1)
hos_train2 <- hos_train2 %>% mutate_if(is.integer, ~replace(., is.na(.), 0))
hos_test2 <- hos_test2 %>% mutate_if(is.integer, ~replace(., is.na(.), 0))
# Define the training control for 10-fold cross validation
train_control <- trainControl(method = "cv", number = 10)
# Train the Gradient Boosted Tree model using the "gbm" method
gbmModel <- train(SalePrice ~ ., data = hos_train2, method = "gbm", trControl = train_control, verbose = FALSE)
# Use the trained model to make predictions on the test data
gbmPred <- predict(gbmModel, newdata = hos_test2)
# Compute the RMSE of the Gradient Boosted Tree model on the test data
RMSE_gbm <- RMSE(gbmPred, hos_test$SalePrice)
RMSE_gbm## [1] 29045.64
15. (1 pt) Set.seed(1) and Use Caret package with “svmLinear” method to train a support vector machine model on the training data. Use preProc=”knnImpute” to impute the missing values and scale data. Use 10 fold cross validation and let caret auto-tune the model, explain what is hyper-parameter “c”? Use the model to predict the SalePrice for the test data and compute RMSE.
set.seed(1)
# Define the pre-processing pipeline for imputing missing values and scaling data
pre_proc_2 <- c("knnImpute", "center", "scale")
# Define the training control for 10-fold cross validation
train_control <- trainControl(method = "cv", number = 10)
# Train the SVM model using the "svmLinear" method and auto-tune the hyper-parameter "C"
svmModel <- train(SalePrice ~ ., data = hos_train2, method = "svmLinear", trControl = train_control, preProcess = pre_proc_2, tuneLength = 10)
# Use the trained model to make predictions on the test data
svmPred <- predict(svmModel, newdata = hos_test2)
# Compute the RMSE of the SVM model on the test data
RMSE_svm <- RMSE(svmPred, hos_test$SalePrice)
RMSE_svm## [1] 27030.55
In SVM, “C” is a hyper-parameter that controls the trade-off between maximizing the margin between the decision boundary and the training samples (i.e., finding a solution with low variance) and minimizing the classification error on the training data (i.e., finding a solution with low bias). A large value of “C” will result in a smaller margin but more accurate classification on the training data, while a small value of “C” will result in a larger margin but potentially less accurate classification on the training data. The optimal value of “C” depends on the specific dataset and problem at hand, and can be determined using techniques such as grid search or random search. In this case, we’ve let the caret package auto-tune the “C” hyper-parameter for us using tuneLength = 10, which specifies that 10 values of “C” will be tried and the best one will be selected based on cross-validation performance.
16. (1 pt) repeat the above steps but set train method to “svmRadial” to use radial basis function as kernel.
set.seed(1)
svmRadialFit <- train(SalePrice ~ ., data = hos_train2, method = "svmRadial", preProcess = c("knnImpute", "center", "scale"), trControl = trainControl(method = "cv", number = 10), tuneLength = 10)
svmRadialPred <- predict(svmRadialFit, newdata = hos_test2)
RMSE_svmRadial <- sqrt(mean((svmRadialPred - hos_test$SalePrice)^2))
RMSE_svmRadial## [1] 30109.03
17. (2pt) Use “resamples” method to compare the cross validation RMSE of the seven models you created above (LASSO, RIDGE, elastic net, randomforest, gbm, svmlinear, and svmradial). In a sentence or two, interpret the results.
compare=resamples(list(Lasso=lassoModel,
Ridge= ridgeModel,
Enet=elasticNetModel,
Rf = rfModel,
GBM = gbmModel,
SVM = svmModel,
SVMRadial = svmRadialFit
))
summary(compare)##
## Call:
## summary.resamples(object = compare)
##
## Models: Lasso, Ridge, Enet, Rf, GBM, SVM, SVMRadial
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Lasso 17365.44 19537.12 20399.17 20738.48 21030.85 26414.82 0
## Ridge 16831.32 18420.89 19249.83 20029.57 20712.16 25158.80 0
## Enet 16831.32 18420.89 19249.83 20029.57 20712.16 25158.80 0
## Rf 15349.11 15728.54 17252.79 18036.91 19219.20 25801.48 0
## GBM 14012.30 15996.89 17330.83 17912.57 19626.93 24778.62 0
## SVM 13758.41 16036.98 32216.26 32622.13 49950.47 53213.40 0
## SVMRadial 14475.04 16804.38 33720.13 35629.89 55726.51 58944.42 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Lasso 24800.46 27076.01 28491.98 34543.55 33460.37 64030.17 0
## Ridge 24565.37 26192.31 27971.46 33927.82 34333.66 61676.30 0
## Enet 24565.37 26192.31 27971.46 33927.82 34333.66 61676.30 0
## Rf 22257.67 23238.84 27485.30 29037.03 29585.46 50186.95 0
## GBM 19318.17 23589.96 26656.57 28216.54 28276.35 46837.17 0
## SVM 18400.75 25491.87 61308.81 52759.27 74612.64 89153.88 0
## SVMRadial 19527.40 24416.80 48578.15 53067.67 81212.42 94951.58 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Lasso 0.5353258 0.8441237 0.8542847 0.8187877 0.8841227 0.9087009 0
## Ridge 0.5444124 0.8365572 0.8624062 0.8240850 0.8893835 0.9139098 0
## Enet 0.5444124 0.8365572 0.8624062 0.8240850 0.8893835 0.9139098 0
## Rf 0.7243770 0.8876044 0.8987582 0.8723577 0.9019227 0.9112096 0
## GBM 0.7591947 0.8662518 0.8928676 0.8763809 0.9148695 0.9329393 0
## SVM 0.5681408 0.7080022 0.7499640 0.7767257 0.8999275 0.9502109 0
## SVMRadial 0.4699145 0.5219629 0.6933872 0.7037527 0.9099775 0.9318635 0
tabularview_1 <- data.frame(
"Models" = c("Lasso","Ridge", "ElasticNet", "RandomForest","GBM","SVM","SVMRadial"),
"RMSE" = c(RMSE_lasso, RMSE_ridge, RMSE_elasticNet, RMSE_rf, RMSE_gbm, RMSE_svm, RMSE_svmRadial)
)
kableExtra::kable(tabularview_1) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)| Models | RMSE |
|---|---|
| Lasso | 28972.69 |
| Ridge | 28858.33 |
| ElasticNet | 28858.33 |
| RandomForest | 27543.66 |
| GBM | 29045.64 |
| SVM | 27030.55 |
| SVMRadial | 30109.03 |
The “resamples” method allows us to compare the performance of different models using cross-validation. In this case, we used it to compare the RMSE of seven different models: LASSO, RIDGE, elastic net, random forest, GBM, SVM linear, and SVM radial. Based on the results, we can see that the SVM linear model had the lowest RMSE, followed by the Random Forest model. The LASSO, RIDGE, and elastic net models had similar RMSE values, which were slightly higher than those of the GBM models. The SVM radial models had the highest RMSE values. Overall, the random forest and SVM linear models performed the best, while the SVM Radial model performed the worst.
18. Split the training data to train –validation set. (use 90% for training and 10% for validation) 19. Use knn imputation to impute the missing values in the train/validation/ and test data based on the training data. To do this, you can use the preProcess function in caret with method=”knnImpute” on the training data. This will return a preprocessing model which can be used to transform data using the predict function. Call the predict function as follows to do knn imputation on train, test, and validation data based on the information on the training data:
20. (3 pt) Neural Networks cannot take factor variables and you must convert your categorical variables to numbers before training your neural network model. One-hot encode the categorical variables using one_hot function from the mltools package.. Set dropUnusedLevels=FALSE. This is to make sure that the train and test data will have the same number of features in case some levels of a categorical variable does not occur in training data but occurs in test or validation data.
df_hou_2 <- df_hou# Identify categorical columns
categorical_cols_2 <- sapply(df_hou_2, is.character)
# Remove SalePrice from the list of categorical columns
categorical_cols_2["SalePrice"] <- FALSE
# Get the names of the categorical columns
categorical_col_names_2 <- names(df_hou_2)[categorical_cols_2]# Loop through each categorical column
for (col in categorical_col_names_2)
{
# Get the mode of the column in the training data
mode_val_1 <- df_hou_2 %>%
select({{ col }}) %>%
summarise(mode = names(which.max(table(.))))
# Replace missing values in the training data with the mode
df_hou_2[[col]] <- ifelse(df_hou_2[[col]] == 'notApplicable', mode_val_1$mode, df_hou_2[[col]])
}#df_hou_2[ ,c('Condition2','RoofMatl','Exterior1st','Exterior2nd','ExterCond','Heating')] <- list(NULL)
df_hou_2 <- df_hou_2 %>% mutate_if(is.character, as.factor)
# one-hot encode categorical variables in train data
df_hou_2 <- as.data.frame(one_hot(as.data.table(df_hou_2), dropUnusedLevels = FALSE))set.seed(101)
# Create data partition
inTrain_2 <- createDataPartition(df_hou_2$SalePrice, p = 0.8, list = FALSE)
# Create training and test sets
hou_train_2 <- df_hou_2[inTrain_2, ]
hos_test_2 <- df_hou_2[-inTrain_2, ]
# dim(hou_train_2)
# dim(hos_test_2)# Split the data into training and validation sets
trainIndex <- createDataPartition(hou_train_2$SalePrice, p = 0.9, list = FALSE, times = 1)
training <- hou_train_2[trainIndex,]
validation <- hou_train_2[-trainIndex,]
# dim(training)
# dim(validation)# create a preprocessing model using knn imputation on the training data
preproc_model_1 <- preProcess(hou_train_2, method = "knnImpute")
preproc_model_2 <- preProcess(hos_test_2, method = "knnImpute")
# transform the train, validation, and test data using the preproc_model
train_data_imputed <- predict(preproc_model_1, training)
val_data_imputed <- predict(preproc_model_1, validation)
test_data_imputed <- predict(preproc_model_2, hos_test_2)
# dim(train_data_imputed)
# dim(val_data_imputed)
# dim(test_data_imputed)21. (2pt) Since we are not using Caret to train neural networks, We will have to manually remove variables with little or no variance. To identify these variables in your training data, use the method “nearZeroVar” from caret package. This will return the indices of variables with little to no variance in the training data. Use these indices to remove these variables from train, validation, and test data. You can refer to his page (https://topepo.github.io/caret/pre-processing.html#nzv , section 3.2 ) for an example of identifying and removing near zero variance.
# identify near-zero variance variables
nzv <- nearZeroVar(train_data_imputed, saveMetrics= TRUE)
#nzv_2 <- nearZeroVar(test_data_imputed, saveMetrics= TRUE)
# get indices of near-zero variance variables
nzv_indices <- which(nzv$nzv == TRUE)
#nzv_indices_2 <- which(nzv_2$nzv == TRUE)
# remove near-zero variance variables from train, validation, and test data
train <- train_data_imputed[, -nzv_indices]
validation <- val_data_imputed[, -nzv_indices]
test <- test_data_imputed[, -nzv_indices]
dim(train)## [1] 1053 109
dim(validation)## [1] 116 109
dim(test)## [1] 291 109
22. (5 pt) Create a Neural Network model with at least two hidden layers to predict the saleprice in 100K units. In other words, your target variables/labels should be SalePrice/100000. We scale down the sale price to avoid error gradients to get too large during backpropagation. If gradients are too large, they can make the model unstable and you end up having NAN for training or validation loss.
Use the training and validation set you created above. Add a drop out layer after each hidden layer to regularize your neural network model. Use tfruns package to tune your hyper-parameters including the drop out factors. You should include two flags for the drop out factors, one for each hidden layer. Display the table returned by trfuns.
X_train <- select(train, -c(SalePrice))
X_val <- select(validation, -c(SalePrice))
X_test <- select(test, -c(SalePrice))
X_train <- as.matrix(X_train)
X_val <- as.matrix(X_val)
X_test <- as.matrix(X_test)
# Scale down the SalePrice
y_train <- train$SalePrice/100000
y_val <- validation$SalePrice/100000
y_test <- test$SalePrice/100000set.seed(1)
# Run the model
runs <- tuning_run("assign5_nn.R",
flags = list(
nodes = c(32, 64, 128),
learning_rate = c(0.01, 0.001, 0.0001),
batch_size=c(50,100,200),
epochs=c(30,50,100),
activation=c("relu","sigmoid","tanh")
),
sample = 0.02
)##
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size",
## + 32), flag_string("activation", "relu"), flag_numeric("learning_rate",
## + .... [TRUNCATED]
##
## > model = keras_model_sequential()
##
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>%
## + layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
##
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate),
## + loss = "mse", metrics = list("mse"))
##
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs,
## + batch_size = FLAGS$batch_size, validation_data = list(X_val,
## + y_val))
##
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size",
## + 32), flag_string("activation", "relu"), flag_numeric("learning_rate",
## + .... [TRUNCATED]
##
## > model = keras_model_sequential()
##
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>%
## + layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
##
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate),
## + loss = "mse", metrics = list("mse"))
##
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs,
## + batch_size = FLAGS$batch_size, validation_data = list(X_val,
## + y_val))
##
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size",
## + 32), flag_string("activation", "relu"), flag_numeric("learning_rate",
## + .... [TRUNCATED]
##
## > model = keras_model_sequential()
##
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>%
## + layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
##
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate),
## + loss = "mse", metrics = list("mse"))
##
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs,
## + batch_size = FLAGS$batch_size, validation_data = list(X_val,
## + y_val))
##
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size",
## + 32), flag_string("activation", "relu"), flag_numeric("learning_rate",
## + .... [TRUNCATED]
##
## > model = keras_model_sequential()
##
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>%
## + layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
##
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate),
## + loss = "mse", metrics = list("mse"))
##
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs,
## + batch_size = FLAGS$batch_size, validation_data = list(X_val,
## + y_val))
##
## > FLAGS <- flags(flag_numeric("nodes", 64), flag_numeric("batch_size",
## + 32), flag_string("activation", "relu"), flag_numeric("learning_rate",
## + .... [TRUNCATED]
##
## > model = keras_model_sequential()
##
## > model %>% layer_dense(units = FLAGS$nodes, input_shape = dim(X_train)[2]) %>%
## + layer_dropout(rate = 0.3) %>% layer_dense(units = 1)
##
## > model %>% compile(optimizer = optimizer_adam(learning_rate = FLAGS$learning_rate),
## + loss = "mse", metrics = list("mse"))
##
## > model %>% fit(X_train, y_train, epochs = FLAGS$epochs,
## + batch_size = FLAGS$batch_size, validation_data = list(X_val,
## + y_val))
23. (2 pts) Use view_run to look at your best model. Note that the best model is the model with lowest validation loss. What hyper-parameter combination is used in your best model. Does your best model still overfit?
view_run(runs$run_dir[1])Yes, the validation loss decreases continously.
Best Model:
Flags: nodes 64 batch_size 200 activation tanh learning_rate 0.0001 epochs 50
Kindly look at the BestModel file.
24. (2 pt) Now that we tuned the hyperparameters, we don’t need the validation data anymore and we can use ALL of the training data for training. Use all of your training data ( that is, train + validation data) to train a model with the best combination of hyper-parameters you found in the previous step.
X_train_2 <- rbind(X_train, X_val)
y_train_2 <- c(y_train, y_val)set.seed(1)
# Retrain the best model once again
best_model =keras_model_sequential()
best_model %>%
layer_dense(units = 64, activation = "tanh", input_shape = dim(X_train_2)[2]) %>%
layer_dense(units = 1)best_model %>% compile(
optimizer = optimizer_adam(learning_rate=0.0001),
loss = 'mse',
metrics = list('mse'))history <- best_model %>% fit(
X_train, y_train, epochs = 50,
batch_size= 200)25. (2pt) Use your model above to predict the saleprice in 100K units for the test data. To get RMSE in the original scale, you should multiply your predictions and test labels by 100000 before computing RMSE.
set.seed(100)
# Make predictions on the test set using the best model
y_pred <- predict(best_model, X_test) %>% as.vector()
# Reverse the 100000 transformation
y_pred_orig <- y_pred * 100000
y_test_orig <- y_test * 100000
# Compute the RMSE in the original scale
RMSE_nn <- sqrt(mean((y_test_orig - y_pred_orig)^2))
RMSE_nn## [1] 32348.2
26. (1pt) Compare the RMSE of your lasso, ridge, elastic net, random forest, gbm, svm, and neural networks models on the test data. Which model did better on this dataset?
tabularview_1 <- data.frame(
"Models" = c("Lasso","Ridge", "ElasticNet", "RandomForest","GBM","SVM","SVMRadial", "NeuralNets"),
"RMSE" = c(RMSE_lasso, RMSE_ridge, RMSE_elasticNet, RMSE_rf, RMSE_gbm, RMSE_svm, RMSE_svmRadial,RMSE_nn)
)
kableExtra::kable(tabularview_1) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% kableExtra::column_spec(1, bold = T)| Models | RMSE |
|---|---|
| Lasso | 28972.69 |
| Ridge | 28858.33 |
| ElasticNet | 28858.33 |
| RandomForest | 27543.66 |
| GBM | 29045.64 |
| SVM | 27030.55 |
| SVMRadial | 30109.03 |
| NeuralNets | 32348.20 |