Prediction in research satisfies one of humankind’s greatest primal desires. We make a lot of predictions to see into the future and know what awaits us. Most scientists regularly use prediction in research as a fundamental of the scientific method, when they generate a hypothesis and predict what will happen. Machine learning models function similarly to humans. They use instances/logics they’ve seen before to forecast what will happen given a certain set of circumstances.
For this lesson, we will bulid a model to figure out how much a house will sell for?
The steps that we will follow are given below:
Load Package
Read Train and Test data
Combine Train and Test dataset
Data Visualization
Missing Value
Removing Skewed Variables
Build the model
Variable importance
Final Prediction
Calculate RMSE
Loading R packages used besides base R.
library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(corrplot)
library(caret)
library(gridExtra)
library(scales)
library(Rmisc)
library(ggrepel)
library(randomForest)
library(psych)
library(xgboost)
The first thing we want to do is familiarize ourselves with the data. By checking the dataset, we can have a holistic picture about the variables that are present in the dataset.
train.data <- read.csv(("Data Set/train.csv"), stringsAsFactors = F)
test.data <- read.csv(("Data Set/test.csv"), stringsAsFactors = F)
dim(train.data)
str(train.data)
dim(test.data)
str(test.data)
The housing train data set has 1460 rows and 81 features with the target feature “SalePrice”. The housing test data set has 1459 rows and 80 features with the target feature Sale Price.
#Count the number of columns that consists of text data
sum(sapply(train.data[,1:81], typeof) == "character")
## [1] 43
#Count the number of columns that consists of numerical data
sum(sapply(train.data[,1:81], typeof) == "integer")
## [1] 38
We have 43 columns that consist of character variable and 38 columns are numerical. The text data could be challenging to work with. For those that are numerical,let’s take a look at some descriptive statistics.
cat('Train data has', dim(train.data)[1], 'rows and', dim(train.data)[2], 'columns.')
## Train data has 1460 rows and 81 columns.
cat('Test data has', dim(test.data)[1], 'rows and', dim(test.data)[2], ' columns.')
## Test data has 1459 rows and 80 columns.
# The percentage of data missing in train
tr.miss <- (sum(is.na(train.data)) / (nrow(train.data) *ncol(train.data)))*100
cat("Missing value Percentage for Train Data is", tr.miss, '%')
## Missing value Percentage for Train Data is 5.889565 %
# The percentage of data missing in test
tst.miss <- (sum(is.na(test.data)) / (nrow(test.data) * ncol(test.data)))*100
cat("Missing value Percentage for Train Data is", tst.miss, '%')
## Missing value Percentage for Train Data is 5.997258 %
# Check for duplicated rows
cat("The number of duplicated rows are", nrow(train.data) - nrow(unique(train.data)))
## The number of duplicated rows are 0
We have 5.88% missing values in our train data set and 7.15% missing values in our test data set.
Since test data set has no “Saleprice” variable. We will create it and then combine.
test.data$SalePrice<-rep(NA,1459)
combined_data <-bind_rows(train.data,test.data)
all <- rbind(train.data,test.data)
dim(all)
## [1] 2919 81
After combining the data set, we can clearly see that there are 81 columns in our data set.
ggplot(data=all[!is.na(all$SalePrice),], aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
As we can see, the sale prices are right skewed. This was expected as few people can afford very expensive houses. We will keep this in mind, and take measures before modeling.
kable(as.array(summary(all$SalePrice)),
caption = 'Summary Statistics of Sales Price')
| Var1 | Freq |
|---|---|
| Min. | 34900.0 |
| 1st Qu. | 129975.0 |
| Median | 163000.0 |
| Mean | 180921.2 |
| 3rd Qu. | 214000.0 |
| Max. | 755000.0 |
| NA’s | 1459.0 |
The character variables need some work before we can use them. To get a feel for the dataset, I decided to first see which numeric variables have a high correlation with the SalePrice.
numericVars <- which(sapply(all, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on
all_numVar <- all[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")
We have 38 numerical variables in our data set.Altogether, there are 10 numeric variables with a correlation of at least 0.5 with SalePrice. All those correlations are positive. As from the correlation matrix plot, we can see that OverallQual, GrLivArea, GarageCars are highly correlated with the SalePrice. We have to keep in mind about multicollinearity. For example: the correlation between GarageCars and GarageArea is very high (0.89), and both have similar (high) correlations with SalePrice.
The other 6 six variables with a correlation higher than 0.5 with SalePrice are:
-TotalBsmtSF: Total square feet of basement area
-1stFlrSF: First Floor square feet
-FullBath: Full bathrooms above grade
-TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
-YearBuilt: Original construction date
-YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
Overall Quality has the highest correlation with SalePrice among the numeric variables (0.79). It rates the overall material and finish of the house on a scale from 1 (very poor) to 10 (very excellent).
ggplot(data=all[!is.na(all$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
geom_boxplot(col='red') + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
The positive correlation is certainly there indeed, and seems to be a slightly upward curve. Regarding outliers, we do not see any extreme values. If there is a candidate to take out as an outlier later on, it seems to be the expensive house with grade 4.
The numeric variable with the second highest correlation with SalesPrice is the Above Grade Living Area. This make a lot of sense; big houses are generally more expensive.
ggplot(data=all[!is.na(all$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
geom_point(col='red') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_text_repel(aes(label = ifelse(all$GrLivArea[!is.na(all$SalePrice)]>4500, rownames(all), '')))
## `geom_smooth()` using formula 'y ~ x'
Especially the two houses with really big living areas and low SalePrices seem outliers (houses 524 and 1299, see labels in graph). we will not take them out yet, as taking outliers can be dangerous. For instance, a low score on the Overall Quality could explain a low price. However, as we can see below, these two houses actually also score maximum points on Overall Quality. Therefore, we will keep houses 1299 and 524 in mind as prime candidates to take out as outliers.
#Looking at highly correlated values of outliers
a.outlr <- all[c(524, 1299), c('SalePrice', 'GrLivArea', 'OverallQual')]
kable(a.outlr)
| SalePrice | GrLivArea | OverallQual | |
|---|---|---|---|
| 524 | 184750 | 4676 | 10 |
| 1299 | 160000 | 5642 | 10 |
ggplot(all, aes(x = GarageCars, y = SalePrice)) +
geom_point(alpha = .5) +
geom_smooth()+
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
As we can see, if there in no garage for cars then the price is low. With the increase of GarageCars, SalePrice increases.
First of all, we would like to see which variables contain missing values.
NAcol <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
cat('There are', length(NAcol), 'columns with missing values')
## There are 35 columns with missing values
We can clearly see that there are 1459 NAs in SalePrice which is matched perfectly with the size of the test set. So, we have to fix NAs in 34 predictor variables.
We are going to fix the 34 predictors that have missing values in this area. We will go over them one by one, starting with the most common NAs and working our way down until we’ve repaired them all. If we come across a variable that is part of a group with other variables, we shall treat them as a unit. Pool, Garage, and Basement, for example, each have a number of variations.
We turned character variables into ordinal integers if there is evident ordinality, or into factors if levels are categories without ordinality, in addition to ensuring that NAs are handled. We’ll use one-hot encoding to transform these factors to numbers later (using the model.matrix function).
Pool Quality and the PoolArea variable
The PoolQC is the variable with most NAs.
kable(unique(all$PoolQC),
caption= 'Pool Quality Features')
| x |
|---|
| NA |
| Ex |
| Fa |
| Gd |
The description is as follows:
PoolQC: Pool quality,Ex:Excellent, Gd:Good,TA: Average/Typical,Fa:Fair, NA:No Pool
So, it is obvious that we need to just assign ‘No Pool’ to the NAs. Also, the high number of NAs makes sense as normally only a small proportion of houses have a pool.
all$PoolQC[is.na(all$PoolQC)] <- 'None'
It is also clear that we can label encode this variable as the values are ordinal. As there a multiple variables that use the same quality levels, we are going to create a vector that we can reuse later on.
Qualities <- c('None' = 0, 'Fa' = 1, 'Gd' = 2, 'Ex' = 3)
Now, we can use the function ‘revalue’ to do the work for me.
all$PoolQC<-as.integer(revalue(all$PoolQC, Qualities))
table(all$PoolQC)
##
## 0 1 2 3
## 2909 2 4 4
However, there is a second variable that relates to Pools. This is the PoolArea variable (in square feet). As you can see below, there are 3 houses without PoolQC. First, I checked if there was a clear relation between the PoolArea and the PoolQC. As I did not see a clear relation (bigger of smaller pools with better PoolQC), I am going to impute PoolQC values based on the Overall Quality of the houses (which is not very high for those 3 houses).
ql <- all[all$PoolArea>0 & all$PoolQC==0, c('PoolArea', 'PoolQC', 'OverallQual')]
kable(ql)
| PoolArea | PoolQC | OverallQual | |
|---|---|---|---|
| 2421 | 368 | 0 | 4 |
| 2504 | 444 | 0 | 6 |
| 2600 | 561 | 0 | 3 |
Within Miscellaneous Feature, there are 2814 NAs. As the values are not ordinal, we will convert MiscFeature into a factor. Values:
unique(all$MiscFeature)
## [1] NA "Shed" "Gar2" "Othr" "TenC"
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
all$MiscFeature[is.na(all$MiscFeature)] <- 'None'
all$MiscFeature <- as.factor(all$MiscFeature)
ggplot(all[!is.na(all$SalePrice),], aes(x=MiscFeature, y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
kable(table(all$MiscFeature),
caption= 'Table for Miscelenious Features',
col.names = c('Variable','Frequency'))
| Variable | Frequency |
|---|---|
| Gar2 | 5 |
| None | 2814 |
| Othr | 4 |
| Shed | 95 |
| TenC | 1 |
When looking at the frequencies, the variable seems irrelevant to me. Having a shed probably means ‘no Garage’, which would explain the lower sales price for Shed. Also, while it makes a lot of sense that a house with a Tennis court is expensive, there is only one house with a tennis court in the training set.
unique(all$Alley)
## [1] NA "Grvl" "Pave"
Within Alley, there are 2721 NAs. As the values are not ordinal, we will convert Alley into a factor.
Grvl Gravel
Pave Paved
NA No alley access
all$Alley[is.na(all$Alley)] <- 'None'
all$Alley <- as.factor(all$Alley)
ggplot(all[!is.na(all$SalePrice),], aes(x=Alley, y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue')+
scale_y_continuous(breaks= seq(0, 200000, by=50000), labels = comma)
kable(table(all$Alley),
caption= 'Table for Alley variable',
col.names = c('Variable','Frequency'))
| Variable | Frequency |
|---|---|
| Grvl | 120 |
| None | 2721 |
| Pave | 78 |
We have seen that, there are 2348 NAs within Fence variable.
unique(all$Fence)
## [1] NA "MnPrv" "GdWo" "GdPrv" "MnWw"
Meaning
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
all$Fence[is.na(all$Fence)] <- 'None'
kable(table(all$Fence),
caption= 'Table for Fence Quality',
col.names = c('Type','Frequency'))
| Type | Frequency |
|---|---|
| GdPrv | 118 |
| GdWo | 112 |
| MnPrv | 329 |
| MnWw | 12 |
| None | 2348 |
# Summary Based on median
all[!is.na(all$SalePrice),] %>%
group_by(Fence) %>%
summarise(median = median(SalePrice), counts=n())
## # A tibble: 5 × 3
## Fence median counts
## <chr> <dbl> <int>
## 1 GdPrv 167500 59
## 2 GdWo 138750 54
## 3 MnPrv 137450 157
## 4 MnWw 130000 11
## 5 None 173000 1179
My conclusion is that the values do not seem ordinal (no fence is best). Therefore, I will convert Fence into a factor.
all$Fence <- as.factor(all$Fence)
Fireplace quality
all$FireplaceQu[is.na(all$FireplaceQu)] <- 'None'
all$FireplaceQu<-as.integer(revalue(all$FireplaceQu, Qualities))
table(all$FireplaceQu)
##
## 0 1 2 3
## 1420 74 744 43
Values:
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
Number of fireplaces
Fireplaces is an integer variable, and there are no missing values.
table(all$Fireplaces)
##
## 0 1 2 3 4
## 1420 1268 219 11 1
sum(table(all$Fireplaces))
## [1] 2919
###Lot variables
3 variables. One with 1 NA, and 2 complete variables.
LotFrontage: Linear feet of street connected to property
486 NAs. The most reasonable imputation seems to take the median per neigborhood.
ggplot(all[!is.na(all$LotFrontage),], aes(x=as.factor(Neighborhood), y=LotFrontage)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
LotShape: General shape of property
No NAs. Values seem ordinal (Regular=best)
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
all$LotShape<-as.integer(revalue(all$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(all$LotShape)
##
## 0 1 2 3
## 16 76 968 1859
sum(table(all$LotShape))
## [1] 2919
LotConfig: Lot configuration
No NAs. The values seemed possibly ordinal to me, but the visualization does not show this. Therefore, I will convert the variable into a factor.
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
ggplot(all[!is.na(all$SalePrice),], aes(x=as.factor(LotConfig), y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue')+
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
all$LotConfig <- as.factor(all$LotConfig)
table(all$LotConfig)
##
## Corner CulDSac FR2 FR3 Inside
## 511 176 85 14 2133
sum(table(all$LotConfig))
## [1] 2919
GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
GarageType: Garage location
The values do not seem ordinal, so I will convert into a factor.
2Types More than one type of garage Attchd Attached to home Basment Basement Garage BuiltIn Built-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage
all$GarageType[is.na(all$GarageType)] <- 'No Garage'
all$GarageType <- as.factor(all$GarageType)
table(all$GarageType)
##
## 2Types Attchd Basment BuiltIn CarPort Detchd No Garage
## 23 1723 36 186 15 779 157
GarageFinish: Interior finish of the garage
The values are ordinal.
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
all$GarageFinish[is.na(all$GarageFinish)] <- 'None'
Finish <- c('None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)
all$GarageFinish<-as.integer(revalue(all$GarageFinish, Finish))
table(all$GarageFinish)
##
## 0 1 2 3
## 159 1230 811 719
As from the data we can see that there are 159 houses with no garage, 1230 houses with unfinished garage.
GarageQual: Garage quality
Another variable than can be made ordinal with the Qualities vector.
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
all$GarageQual[is.na(all$GarageQual)] <- 'None'
Ga.Quality <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
all$GarageQual<-as.integer(revalue(all$GarageQual, Ga.Quality))
table(all$GarageQual)
##
## 0 1 2 3 4 5
## 159 5 124 2604 24 3
GarageCond: Garage condition
Garage Condition is another categorical variable which can be converted to integer.
all$GarageCond[is.na(all$GarageCond)] <- 'None'
all$GarageCond<-as.integer(revalue(all$GarageCond, Qualities))
## Warning: NAs introduced by coercion
table(all$GarageCond)
##
## 0 1 2 3
## 159 74 15 3
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
These classes are coded as numbers, but really are categories.
str(all$MSSubClass)
## int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...
all$MSSubClass <- as.factor(all$MSSubClass)
#revalue for better readability
all$MSSubClass<-revalue(all$MSSubClass, c('20'='1 story 1946+', '30'='1 story 1945-', '40'='1 story unf attic', '45'='1,5 story unf', '50'='1,5 story fin', '60'='2 story 1946+', '70'='2 story 1945-', '75'='2,5 story all ages', '80'='split/multi level', '85'='split foyer', '90'='duplex all style/age', '120'='1 story PUD 1946+', '150'='1,5 story PUD all', '160'='2 story PUD 1946+', '180'='PUD multilevel', '190'='2 family conversion'))
kable(table(all$MSSubClass))
| Var1 | Freq |
|---|---|
| 1 story 1946+ | 1079 |
| 1 story 1945- | 139 |
| 1 story unf attic | 6 |
| 1,5 story unf | 18 |
| 1,5 story fin | 287 |
| 2 story 1946+ | 575 |
| 2 story 1945- | 128 |
| 2,5 story all ages | 23 |
| split/multi level | 118 |
| split foyer | 48 |
| duplex all style/age | 109 |
| 1 story PUD 1946+ | 182 |
| 1,5 story PUD all | 1 |
| 2 story PUD 1946+ | 128 |
| PUD multilevel | 17 |
| 2 family conversion | 61 |
I have already visualized the relation between Overall Quality and SalePrice in my initial explorations, but I want to visualize the frequency distribution as well. As there are more quality measurements, I am taking the opportunity to bundle them in this section.
q1 <- ggplot(data=all, aes(x=as.factor(OverallQual))) +
geom_histogram(stat='count')
q2 <- ggplot(data=all, aes(x=as.factor(ExterQual))) +
geom_histogram(stat='count')
q3 <- ggplot(data=all, aes(x=as.factor(BsmtQual))) +
geom_histogram(stat='count')
q4 <- ggplot(data=all, aes(x=as.factor(KitchenQual))) +
geom_histogram(stat='count')
q5 <- ggplot(data=all, aes(x=as.factor(GarageQual))) +
geom_histogram(stat='count')
q6 <- ggplot(data=all, aes(x=as.factor(FireplaceQu))) +
geom_histogram(stat='count')
q7 <- ggplot(data=all, aes(x=as.factor(PoolQC))) +
geom_histogram(stat='count')
layout <- matrix(c(1,2,8,3,4,8,5,6,7),3,3,byrow=TRUE)
multiplot(q1, q2, q3, q4, q5, q6, q7, layout=layout)
Overall Quality is crucial, yet it is also more detailed than the other characteristics. External quality is important as well, but it has a strong link to overall quality (0.73). Kitchen quality appears to be something to bear in mind, given all residences have a kitchen and there is considerable variation in terms of content. The majority of garages have Q3 quality, which does not appear to make much of a difference. Fireplace Quality is on the list of essential variables and has a good association. The PoolQC is simply lacking (the 13 pools cannot even be seen on this scale).
The first visualization shows the median SalePrice by MSSubClass. The frequency (number of houses) of each MSSubClass in the train set is shown in the labels.
The histrogram shows the frequencies across all data. Most houses are relatively new, and have one or two stories.
ms1 <- ggplot(all[!is.na(all$SalePrice),], aes(x=MSSubClass, y=SalePrice)) +
geom_bar(stat='summary', fun.y = "median", fill='blue') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
ms2 <- ggplot(data=all, aes(x=MSSubClass)) +
geom_histogram(stat='count')+
geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(ms1, ms2)
## No summary function supplied, defaulting to `mean_se()`
###Garage variables
Several Garage variables have a high correlation with SalePrice, and are also in the top-20 list of the quick random forest. However, there is multicolinearity among them and I think that 7 garage variables is too many anyway.We feel that something like 3 variables should be sufficient (possibly GarageCars, GarageType, and a Quality measurement), but before I do any selection I am visualizing all of them in this section.
#correct error
all$GarageYrBlt[2593] <- 2007 #this must have been a typo. GarageYrBlt=2207, YearBuilt=2006, YearRemodAdd=2007.
g1 <- ggplot(data=all[all$GarageCars !=0,], aes(x=GarageYrBlt)) +
geom_histogram()
g2 <- ggplot(data=all, aes(x=as.factor(GarageCars))) +
geom_histogram(stat='count')
g3 <- ggplot(data= all, aes(x=GarageArea)) +
geom_density()
g4 <- ggplot(data=all, aes(x=as.factor(GarageCond))) +
geom_histogram(stat='count')
g5 <- ggplot(data=all, aes(x=GarageType)) +
geom_histogram(stat='count')
g6 <- ggplot(data=all, aes(x=as.factor(GarageQual))) +
geom_histogram(stat='count')
g7 <- ggplot(data=all, aes(x=as.factor(GarageFinish))) +
geom_histogram(stat='count')
layout <- matrix(c(1,5,5,2,3,8,6,4,7),3,3,byrow=TRUE)
multiplot(g1, g2, g3, g4, g5, g6, g7, layout=layout)
GarageCars and GarageArea are highly correlated. Here, GarageQual and GarageCond also seem highly correlated, and both are dominated by level =3.
There are 4 bathroom variables. Individually, these variables are not very important. However, I assume that I if I add them up into one predictor, this predictor is likely to become a strong one.
“A half-bath, also known as a powder room or guest bath, has only two of the four main bathroom components-typically a toilet and sink.” Consequently, I will also count the half bathrooms as half.
all$TotBathrooms <- all$FullBath + (all$HalfBath*0.5) + all$BsmtFullBath + (all$BsmtHalfBath*0.5)
As you can see in the first graph, there now seems to be a clear correlation (it’s 0.63). The frequency distribution of Bathrooms in all data is shown in the second graph.
tb1 <- ggplot(data=all[!is.na(all$SalePrice),], aes(x=as.factor(TotBathrooms), y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
tb2 <- ggplot(data=all, aes(x=as.factor(TotBathrooms))) +
geom_histogram(stat='count')
grid.arrange(tb1, tb2)
## `geom_smooth()` using formula 'y ~ x'