Predicting House Prices: Regression Techniques

Christian Thieme

11/28/2020

Introduction

This data comes from the following Kaggle competition with the object to predict the sales prices of houses in Ames, Iowa. The dataset provided has 79 explanatory variables describing (almost) every aspect of the residential homes. A description of each column can be found here.

For this competition, I’ve been asked to build a model using multiple regression. Before building the model, however, I’ll do some exploratory data analysis (EDA) on the dataset, clean it, as well as do some feature engineering.

Loading Libraries and Data

I’ll load the necessary R libraries:

library(tidyverse)
library(scales)
library(corrplot)
library(moments) #skewness and kurtosis
library(gridExtra)
library(plyr)
library(tidymodels)
library(vip)

Below, I’ll load the train and test CSVs into dataframes:

train <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/train.csv')
test <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/test.csv')

I’ll take a quick look at the first 10 variables and our dependent variable:

glimpse(train[,c(1:10,81)])
## Rows: 1,460
## Columns: 11
## $ Id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ MSSubClass  <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20...
## $ MSZoning    <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "...
## $ LotFrontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91,...
## $ LotArea     <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 61...
## $ Street      <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave",...
## $ Alley       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ LotShape    <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1",...
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities   <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllP...
## $ SalePrice   <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000,...

In looking at the output above, it looks like we have a good mix of both numeric and character data. Let’s take an initial look to see how many of each we have:

  • Numeric columns:
dim(train %>% select_if(is.numeric))[2]
## [1] 38
  • Character columns:
dim(train %>% select_if(is.character))[2]
## [1] 43

We’ll do some further investigation later, but its very probable that some of our numeric columns are actually categorical data and will need to be converted. Additionally, I assume that there are quite a few of the character columns that are ordinal that we’ll also need to convert.

Now, let’s look at the dimensions of our train and test datasets:

Train:

dim(train)
## [1] 1460   81

Test:

dim(test)
## [1] 1459   80

Combined:

dim(train)[1] + dim(test)[1]
## [1] 2919

In looking at the above row counts, it looks like this dataset is split in half. There are as many rows in the test dataset as there are in the training dataset. Let’s now combine these datasets so we can perform all the same transformations on each dataset. To do this, we’ll need to add a ‘SalePrice’ column to our test dataset and add some identifying columns to each dataset. Finally, we’ll append them together:

train <- train %>% 
  mutate(dataset = 'train')

test <- test %>% 
  mutate(SalePrice = NA,
    dataset = 'test')

homes <- rbind(train, test)
dim(homes)
## [1] 2919   82

The dimensions of our combined dataset match what we saw previously (considering the columns we just added).

Exploration and Preliminary Data Cleaning

Response Variable - SalePrice

summary(homes$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   34900  129975  163000  180921  214000  755000    1459

Let’s take a look at the distribution of our response variable, SalePrice:

eda_data <- homes %>% filter(dataset == 'train')

ggplot(eda_data) + 
  aes(x = SalePrice) + 
  geom_histogram(binwidth = 20000, fill = "lightsalmon2", color = "black") +
  scale_x_continuous(labels = comma, breaks = seq(0, 900000, by = 100000)) + 
  labs(title = "Histogram of SalePrice", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

Looking at the distribution above, we see what we would expect - a right skewed distribution, as most people can’t afford very expensive housing. Let’s take a look at the skewness and kurtosis of SalePrice:

moments::skewness(homes$SalePrice, na.rm = TRUE)
## [1] 1.880941

We can see above that the skewness of SalePrice is ~1.88. As noted above, this indicates that the distribution is not normal (a normal distribution would have a skewness of 0). As the skewness value is positive, and as we see in our plot, we know our distribution in right skewed.

moments::kurtosis(homes$SalePrice, na.rm = TRUE)
## [1] 9.509812

Our kurtosis value tells us that we have some extreme values in the tail of our distribution, which you can see clearly in the Q-Q plot below.

qqnorm(homes$SalePrice)
qqline(homes$SalePrice)

To fix this, we’ll use a log transformation on the SalePrice column and rerun the calcs on skew and kurtosis:

homes_fixed <- homes %>% 
  mutate(SalePrice =log(SalePrice))

moments::skewness(homes_fixed$SalePrice, na.rm = TRUE)
## [1] 0.1212104
moments::kurtosis(homes_fixed$SalePrice, na.rm = TRUE)
## [1] 3.802656

You can see above, we are doing much better. Now let’s look at our Q-Q plot again:

qqnorm(homes_fixed$SalePrice)
qqline(homes_fixed$SalePrice)

Much better! We don’t see nearly as many extreme values at the edge of our distribution. We have shown that our response variable can benefit from a log transformation. while we could make this change now, let’s wait and use the tidymodels package to do most of our preprocessing - more on this later. Stay Tuned!

Now that we’ve briefly looked at our response variable, let’s move on and start looking at some of our independent variables.

Extent of Nulls

Before jumping into the analysis of our dataset, it’s good to get an understanding of how many null values are in each column so we know if we can move forward with visualizing and analyzing our data or if we need to do some processing in order to have the data tell its story:

colSums(is.na(homes))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             4           486             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          2721             0             0             2 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             1             1 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##            24            23             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            81            82            82            79             1 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            80             1             1             1             0 
##     HeatingQC    CentralAir    Electrical      1stFlrSF      2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             2             2             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             1             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             2             0          1420           157           159 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##           159             1             1           159           159 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          2909          2348          2814 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             1             0 
##     SalePrice       dataset 
##          1459             0

While there are some columns with a significant amount of nulls, for the most part, it doesn’t look like null values are that pervasive in this dataset. We’ll refer back to these numbers as we walk through our analysis and cleaning.

A Starting Place

With over 80 features, its difficult to know where to start with our exploratory data analysis and cleaning. A quick way to gain our bearings will be to run a correlation plot over the current numeric features of the dataset and to start working through the features that are highly correlated with SalePrice. We’ll create a correlation plot and only keep the features with a correlation to SalePrice greater than 0.45 for our initial run.

numeric_cols <- eda_data %>% select_if(is.numeric)

correlation <- cor(numeric_cols)

sale_price_col <- as.matrix(abs(correlation[,'SalePrice']))
ordered_matrix <- sale_price_col[order(sale_price_col, decreasing = TRUE),]

names <- ordered_matrix[ordered_matrix > 0.45, drop = FALSE]
names <- rownames(as.matrix(names[!is.na(names)]))

filtered_cor_matrix <- correlation[names, names]
#sorted_matrix <- filtered_cor_matrix[order(filtered_cor_matrix[,'SalePrice'], decreasing = TRUE),]

corrplot.mixed(filtered_cor_matrix, tl.col="black", tl.pos = "lt")

Looking at the above correlation plot, we can see that OverallQual has the highest correlation with SalePrice. Additionally, it looks like the next several features all have to do with square footage (size) or number of rooms or garages (quantity). I’ll take note of this, because we can probably create some new combination features out of these variables to tease out some additional predictive power while feature engineering. Toward the bottom, we see that the year it was built/remodeled is also moderately correlated. Fireplaces is the last feature that met our threshold. We’ll have to investigate this feature to see if it is truly meaningful to SalePrice. Last, I can see that there are a few features that are collinear with other features such as GarageArea area and GarageCars. We’ll want to address this once we get a little farther along.

Overall Quality (OverallQual) and Other Condition Features

Overall Quality

OverallQual = Rates the overall material and finish of the house

As OverallQual (Overall Quality) was the mostly highly correlated variable to SalePrice, we’ll start our analysis here.

eda_data <- homes %>% 
  filter(dataset == "train")

qual_hist <- ggplot(eda_data) + 
  aes(x = as.factor(OverallQual)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Overall Quality', x= 'Quality Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey"))
    
qual_box <- ggplot(eda_data) + 
  aes(x = factor(OverallQual), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Overall Quality vs Sale Price', y = 'Sale Price', x= 'Quality Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )
  

gridExtra::grid.arrange(qual_hist, qual_box, nrow=1)

It looks like the majority of the homes fall into average to very good condition. In looking at the above boxplots, we can confirm what we saw in our correlation matrix; there is a strong relationship between these two variables. Additionally, it looks like there may be an outlier we’ll need to look at in category 4.

qual_four_outlier <- eda_data %>%
  filter(OverallQual == 4 & SalePrice > 250000)

qual_four_outlier %>% select(Id, LotArea, OverallQual, SalePrice)
## # A tibble: 1 x 4
##      Id LotArea OverallQual SalePrice
##   <dbl>   <dbl>       <dbl>     <dbl>
## 1   458   53227           4    256000

In comparing the above row to many of the other rows for houses with an OverallQual of 4, I believe this house has been misclassified as a 4. In looking through all of the other attributes, it actually has quite a few features that are upgraded compared to other homes and has at least typical and average on all other condition and quality metrics. I will remove this row during the cleaning stage since OverallQuality is the most correlated predictor of sale price.

Overall Condition

OverallCond = Rates the overall condition of the house

cond_hist <- ggplot(eda_data) + 
  aes(x = as.factor(OverallCond)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Overall Condition', x= 'Condition Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey"))
    
cond_box <- ggplot(eda_data) + 
  aes(x = factor(OverallCond), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Overall Condition vs Sale Price', y = 'Sale Price', x= 'Condition Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )
  
gridExtra::grid.arrange(cond_hist, cond_box, nrow=1)

It looks like there is quite a bit of difference between overall condition and overall quality. The condition measures the overall condition of the house whereas the overall quality measures the material and finish of the house. The scales appear to be different as well as overall quality goes to 10 and overall condition goes to 9. It also looks like overall condition and overall quality are not strongly correlated.

cor(eda_data$OverallCond, eda_data$OverallQual)
## [1] -0.09193234

In looking at the relationship between sale price and overall condition, what we see here is somewhat surprising. It does look like we see a trend in categories 1-5, however after 5 the trend does not continue and it looks like additional improvements in the condition of the home above and beyond “average” does not affect the price of the home very much. What this tells us is that this factor is not, on its own, highly correlated with sale price. Also, in looking at the above plot, we may have an outlier in group 6.

eda_data %>% filter(OverallCond == 6  & SalePrice > 700000) %>% select(Neighborhood, KitchenQual, GrLivArea, ExterQual)
## # A tibble: 1 x 4
##   Neighborhood KitchenQual GrLivArea ExterQual
##   <chr>        <chr>           <dbl> <chr>    
## 1 NoRidge      Ex               4316 Ex

In looking at the value, it appears that the house is very large, is in a ritzy neighborhood and has some excellent upgrades to the exterior and interior but other than that it doesn’t appear to have anything unexpected that would make us call this an outlier.

Basement Quality

BsmtQual: Evaluates the height of the basement

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtQual)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Evaluation of Height of Basement', x= 'Evaluation Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtQual), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Evaluation of Height of Basement vs Sale Price', y = 'Sale Price', x= 'Evaluation Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

It looks like in every case where there is a basement, the evaluation is at least fair or higher. Additionally, we see a strong correlation here between sale price and the evaluation height of the basement. We also note that there are some NA’s in this feature. My hunch is that these NAs are actually homes without a basement and these just need to be set to ‘No Basement’.

eda_data %>% 
  filter(is.na(BsmtQual)) %>%
  select(contains('BsmtFin')) %>% head()
## # A tibble: 6 x 4
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
##   <chr>             <dbl> <chr>             <dbl>
## 1 <NA>                  0 <NA>                  0
## 2 <NA>                  0 <NA>                  0
## 3 <NA>                  0 <NA>                  0
## 4 <NA>                  0 <NA>                  0
## 5 <NA>                  0 <NA>                  0
## 6 <NA>                  0 <NA>                  0

Our hunch proved to be correct. All of these NA values relate to homes with no basements. We’ll recode these during our data cleaning stage.

Basement Condtion

BsmtCond: Evaluates the general condition of the basement

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtCond)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Condition of Basement', x= 'Condition Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtCond), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Condition of Basement vs Sale Price', y = 'Sale Price', x= 'Condition Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

We see something very similar to what we say above. It looks like the same 37 records (which were homes with no basements) are null in this feature as well. Overall, we see that most homes have basements in fair or higher condition. Additionally, we see a strong relationship between the condition of the basement and the sale price of the home.

GarageQual

GarageQual = Garage quality

garhist <- ggplot(eda_data) + 
  aes(x = GarageQual) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Garage Quality', y = 'Count', x= 'Quality Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
  #  axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

garplot<- ggplot(eda_data) + 
  aes(x = reorder(factor(GarageQual),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(garhist, garplot, nrow=1)

It appears that almost every home in our dataset has a “typical/average” garage. Based on the boxplot, it looks like there is some order to these categorical variables. We’ll take note of this and utilize in our data cleaning stage. Interestingly, for the 3 garages with an “excellent” garage, it doesn’t seem to have made a big difference in the sale price for 2 of the homes. It turns out that one of these homes is on a very large lot with a very large house and a 10 on overall quality and the other two are fairly small homes and have lower overall quality (so why a nice garage?), which may be why they have lower values. I think what we are seeing here is an issue with a small sample size for this group.

Garage Condition

GarageCond = Garage Condition

garhist <- ggplot(eda_data) + 
  aes(x = GarageCond) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Garage Condition', y = 'Count', x= 'Condition Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

garplot <- ggplot(eda_data) + 
  aes(x = reorder(factor(GarageCond),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Condition vs Sale Price', y = 'Sale Price', x= 'Condition Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(garhist, garplot, nrow=1)

Looking at the two plots above, its pretty apparent that this field is almost exactly identical to GarageQual, with the main difference being the “Excellent” group not having the large outlier bringing up the IQR. What this means is there is a difference in one of those values between garage quality and garage condition, which doesn’t seem correct. This plot makes me question whether those homes categorized as “excellent” have been labeled appropriately. We’ll keep this in mind as we continue our analysis and also note that we’ll probably drop garage quality in our data cleaning phase since its just redundant data.

Exterior Condition

ExterCond: Evaluates the present condition of the material on the exterior

garhist <- ggplot(eda_data) + 
  aes(x = ExterCond) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Exterior Condition', y = 'Count', x= 'Condition Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

garplot <- ggplot(eda_data) + 
  aes(x = reorder(factor(ExterCond),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Exterior Condition vs Sale Price', y = 'Sale Price', x= 'Condition Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(garhist, garplot, nrow=1)

As we’ve seen before - most homes have a “typical” exterior condition. In addition, looking at our boxplots, it looks like these are ordinal categorical values, so we’ll update that in our data prep stage. The piece that is a little concerning however is that the ‘Good (Gd)’ condition is actually more favorable than the ‘Typical (TA)’ condition, yet we don’t see that trend in the boxplots. They very well may be because of the difference in sample size between the two groups - but still will be something we’ll have to keep an eye out for. It also may be a case where the difference in exterior condition really only matters if it is less than typical. Here, perhaps a better variable to engineer would be if the exterior condition is acceptable or not (binary variable).

Exterior Quality

ExterQual: Evaluates the quality of the material on the exterior

garhist <- ggplot(eda_data) + 
  aes(x = ExterQual) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Exterior Quality', y = 'Count', x= 'Condition Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

garplot <- ggplot(eda_data) + 
  aes(x = reorder(factor(ExterQual),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Exterior Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(garhist, garplot, nrow=1)

The exterior quality does look like it is an ordinal categorical variable. What we see here is more promising than what we saw with the exterior condition variable. Here we see what we would expect. As the exterior quality of the home improves the sale price increases. I could see this being from going from vinyl siding, to wood, to stucco and so on, which would definitely make a difference in the price of a house.

Sale Condition

SaleCondition: Condition of Sale

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(SaleCondition)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Sale Condition', x= 'Sale Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(SaleCondition), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Sale Condition vs Sale Price', y = 'Sale Price', x= 'Sale Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

As we’d expect, it looks like the majority of our homes have a normal sale condition. In looking at the boxplots, it may appear that there is some sense of order to the variables, however, in looking at the description for each, it does not appear that these are ordinal categorical variables. One is not necessarily better than the other except for perhaps “Abnormal”, which signifies a foreclosure or short-sale. So maybe this make sense to be 3 categories as opposed to 6.

Condition 1

Condition1: Proximity to various conditions

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(Condition1)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Proximity to Various Conditions', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Condition1), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

It looks like the majority of our Condition 1 variable are normal conditions. Additionally, it doesn’t look like this variable is strongly correlated with sale price. I’ll also note that even though the boxplot seems to show some semblance of order, these are not ordinal values. A further note is that some of these are actually positive conditions, such as being near a park or greenbelt and some are negative. It probably makes sense to create some kind of variable describing whether these are positive or negative conditions.

Condition 2

Condition2: Proximity to various conditions (if more than one is present)

This variable will show ‘Normal’ unless a second proximity to another condition is present.

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(Condition2)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Proximity to Various Conditions if More Than One Is Present', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Condition2), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

In looking at and understanding this variable’s relationship to Condition1, it would make sense to make a variable during our feature engineering that labeled if a home had more than one condition (positive or negative) it was close to. It seems like homes near more than one negative condition may have lower value and a higher value if they are near more than one positive condition (and based on our first plot its very rare), but let’s check:

eda_data %>%
  filter(Condition2 != 'Norm') %>%
  ggplot() + 
  aes(x = reorder(Condition2, SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

It does look like there is a relationship between positive/negative conditions and sale price. We’ll make sure to add a binary variable in our feature engineering section.

Functional

Functional: Home functionality (Assume typical unless deductions are warranted)

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(Functional)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Home Functionality', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Functional), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Home Functionality vs Sale Price', y = 'Sale Price', x= 'Condition') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

As we’d expect, most homes have typical functionality when they are sold. What is interesting is the boxplot on the right. You would think that deductions (minor, moderate, major, and severe deductions) would play a large part in the sale price of a home as well as the severity of the deduction, however that isn’t what we see here. The only thing we can clearly identify is that homes with typical functionality, in general, have higher sales prices.

House Style

HouseStyle: Style of dwelling

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(HouseStyle)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'House Style', x= 'Style') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(HouseStyle), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'House Style vs Sale Price', y = 'Sale Price', x= 'Style') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

One and two story homes seem to the be the majority with one and a half story homes with the second level finished being the third highest. The ‘.5’ in these indicates the presence of a basement, either finished or unfinished. Some of the relationships here make sense, but others don’t, for example, why does 1.5Fin (1 story home with a finished basement), in general, have lower sales prices than just a one story home? This variable clearly needs some other variables in order to tell the entire story (probably square footage) and doesn’t look to be strongly correlated with sale price.

Above Ground Living Area (GrLivArea) and Other Size Features

GrLivArea

GrLivArea = Above ground living area

live_area_outliers <- eda_data %>%
  filter(GrLivArea > 4500)

abv_hist <- ggplot(eda_data) + 
  aes(x = GrLivArea) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
  scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Above Ground Living Area", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data) + 
  aes(x = GrLivArea, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'GrLiveArea vs Sale Price', y = 'Sale Price', x= 'Square Feet', subtitle = "Above Ground Living Area in Square Feet") +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

The distribution for above ground living area is right skewed, with the majority of homes having between 1000-2000 square feet. For the most part, We see what we would expect here; homes with more above ground square feet have higher home prices. We do see what appear to be several outliers out past 4,500 square feet with very low sale prices that we may want to remove later. Let’s take a look at these items:

eda_data %>%
  filter(GrLivArea > 4500) %>%
  select(Id, LotArea, OverallQual, OverallCond, SalePrice)
## # A tibble: 2 x 5
##      Id LotArea OverallQual OverallCond SalePrice
##   <dbl>   <dbl>       <dbl>       <dbl>     <dbl>
## 1   524   40094          10           5    184750
## 2  1299   63887          10           5    160000

In looking at the two rows above, these two rows do indeed look like outliers (possibly even data entry errors?). In fact, these two homes both have enormous lot sizes. What makes me think these homes may be data errors is that both of them have an overall quality rating of 10 (these are the two outliers you can see at the top of the boxplots for category 10), but an overall condition rating of 5. That in conjunction with the fact that the above ground living area sizes are significantly larger than most homes and that the sale price is significantly lower than other homes with smaller square footage, makes me think these are outliers or mistakes.

First Floor Square Feet

1stFlrSF = First Floor square feet

abv_grnd_outliers <- eda_data %>%
  filter(`1stFlrSF` > 4500)

abv_hist <- ggplot(eda_data) + 
  aes(x = `1stFlrSF`) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
  scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of First Floor Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data) + 
  aes(x = `1stFlrSF`, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
  geom_point(data = abv_grnd_outliers, aes(x = `1stFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'First Floor Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

Similar to above ground living area this feature has a right skewed distribution and appears to have an outlier out past 4,000 square feet. This variable has a fairly strong positive correlation with Sale Price.

Second Floor Square Feet

2ndFlrSF = Second floor quare feet

abv_hist <- ggplot(eda_data %>% filter(`2ndFlrSF` > 0)) + 
  aes(x = `2ndFlrSF`) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of 2nd Floor Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data %>% filter(`2ndFlrSF` > 0)) + 
  aes(x = `2ndFlrSF`, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
 # geom_point(data = abv_grnd_outliers, aes(x = `2ndFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = '2nd Floor Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

The distribution looks to be almost normal and it doesn’t appear that there are any significant outliers. Additionally, the square footage of the second floor looks to be strongly correlated with the sale price.

After looking at all of these variables, it appears that GrLivArea is the sum of 1stFlrSF, 2ndFlrSF, and LowQualFinSF. Because of that, I’d expect all of these variables to show some significant collinearity.

Low Quality Finished Square Feet (All Floors)

LowQualFinSF = Low quality finished square feet (all floors)

abv_hist <- ggplot(eda_data %>% filter(LowQualFinSF > 0)) + 
  aes(x = LowQualFinSF) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Low Quality Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data %>% filter(LowQualFinSF > 0)) + 
  aes(x = LowQualFinSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
 # geom_point(data = abv_grnd_outliers, aes(x = `2ndFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Low Quality Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

It looks like there are not many homes with low quality finished square feet. It doesn’t look like there is much of a relationship here either with sale price.

Total Rooms Above Ground

TotRmsAbvGrd: Total rooms above ground (does not include bathrooms)

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(TotRmsAbvGrd)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Total Rooms Above Ground', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = factor(TotRmsAbvGrd), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Rooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

In looking at the above plots, it’s pretty clear that this feature is not talking just about bedrooms. This is a combination of bedrooms, living room, laundry room, etc. EXCEPT FOR bathrooms. I don’t see any extreme outliers. It looks like there is a relationship between rooms of a house (probably strongly correlated with square footage) and sale price that is fairly predictable until we get past 11 rooms and then our sale price tends to fall. This could be for multiple reasons, perhaps the homes are very old or are in poor condition. There are also very few homes with over 10 rooms.

Bedrooms Rooms Above Ground

BedroomAbvGr: Bedrooms rooms above ground (does NOT include basement bedrooms)

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BedroomAbvGr)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Bedrooms Above Ground', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = factor(BedroomAbvGr), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Bedrooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

The first thing I notice is that there are homes with 0 bedrooms above ground… what does that mean? Oddly enough, as you can see in the boxplot these homes also have higher median sales prices than most other homes. Let’s take a look and see if we can identify what these are:

eda_data %>%
  filter(BedroomAbvGr == 0) %>% 
  select(`1stFlrSF`, BsmtFinSF1, BsmtFinType1, SalePrice)
## # A tibble: 6 x 4
##   `1stFlrSF` BsmtFinSF1 BsmtFinType1 SalePrice
##        <dbl>      <dbl> <chr>            <dbl>
## 1       1842       1810 GLQ             385000
## 2       1593       1153 GLQ             286000
## 3       1056       1056 GLQ             144000
## 4       1258       1198 GLQ             108959
## 5        960        648 GLQ             145000
## 6       1332       1258 GLQ             260000

It looks like there are homes where there are no bedrooms above ground and they are all in the basement! These homes have square footage in a 1st floor, but there are no bedrooms above the basement. It does look like this is a somewhat sought after feature (and rare) because it has a higher median sale price than even homes with 5 above ground bedrooms. It is completely possible that these homes have quite a few rooms in the basement, but there is not a variable for basement rooms, so we’ll just have to rely on the square footage. Overall, this variable does not look strongly correlated with sale price.

Full Bathrooms

FullBath: Full bathrooms above ground

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(FullBath)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Bathrooms Above Ground', x= '#') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = factor(FullBath), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Bathrooms Above Ground vs Sale Price', y = 'Sale Price', x= '#') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

Besides the strange issue we see where bathrooms are all in the basement with a few homes, for the most part, the more bathrooms there are, the higher the sales price. In general more bathrooms is often correlated with higher square feet.

Half Bathrooms

HalfBath: Half baths above ground

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(FullBath)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Half Bathrooms Above Ground', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = factor(FullBath), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Half Bathrooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

We see the same pattern as we did with the other bathroom variable. We don’t have a total bathroom feature, so this is one we will look at creating as we can add both full and half bathrooms above and below ground together.

Basement Finished Square Feet

BsmtFinSF1 = Type 1 finished square feet

live_area_outliers <- eda_data %>%
  filter(BsmtFinSF1 > 4000)

abv_hist <- ggplot(eda_data) + 
  aes(x = BsmtFinSF1) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Basement Finished Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data %>% filter(BsmtFinSF1 > 0)) + 
  aes(x = BsmtFinSF1, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
  geom_point(data = live_area_outliers, aes(x = BsmtFinSF1, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Finished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

Our distribution here looks pretty good and we can see that there definitely is a relationship between the square footage of the basement that is finished and sale price. We do see a potential outlier on the scatter plot out past 4,000 square feet. Let’s take a look at that point.

eda_data %>% filter(BsmtFinSF1 > 4000) %>% select(Id, GrLivArea, BsmtFinSF1, OverallQual, OverallCond, ExterQual, ExterCond)
## # A tibble: 1 x 7
##      Id GrLivArea BsmtFinSF1 OverallQual OverallCond ExterQual ExterCond
##   <dbl>     <dbl>      <dbl>       <dbl>       <dbl> <chr>     <chr>    
## 1  1299      5642       5644          10           5 Ex        TA

Something isn’t adding up for this row… this house between basement and above ground square feet has almost 10K square feet of space and many bedrooms. Additionally, based on the condition and quality columns it looks to be in AT LEAST typical condition for everything and is not near a railroad or anything like that. It’s one downfall looks to be that the property has a quick and significant rise from the street to the grade of the building (banked). However, in looking at the LandContour variable… it doesn’t look like this would create that large of an effect on a house that would normally be very expensive. We’ll continue our analysis, but not that this may be an outlier than needs to be removed.

Basement Finished Square Feet

BsmtFinSF2 = Type 2 finished square feet

abv_hist <- ggplot(eda_data) + 
  aes(x = BsmtFinSF2) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Type 2 Basement Finished Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data ) + 
  aes(x = BsmtFinSF2, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
 # geom_point(data = live_area_outliers, aes(x = BsmtFinSF1, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Type 2 Finished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

It looks like the majority of homes do not have Type 2 basement finished square feet and it also looks like there isn’t really a relationship here with sale price. This may be a feature we end up removing from the dataset.

Basement Unfinished Square Feet

BsmtUnfSF = Unfinished square feet of basement area

##  filter(BsmtUnfSF > 4500)

abv_hist <- ggplot(eda_data) + 
  aes(x = BsmtUnfSF) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Basement Unfinished Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data  ) + 
  aes(x = BsmtUnfSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
  #geom_point(data = live_area_outliers, aes(x = TotalBsmtSF, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Unfinished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

It appears that the majority of homes don’t have a tremendous amount of unfinished square footage in their basement. Interestingly, we also don’t see a very strong relationship between unfinished square footage and sale price. I don’t see anything that I would consider an outlier here.

Total Basement Square Feet

TotalBsmtSF = Total square feet of basement areat

live_area_outliers <- eda_data %>%
  filter(TotalBsmtSF > 4500)

abv_hist <- ggplot(eda_data) + 
  aes(x = TotalBsmtSF) + 
  geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
 # scale_y_continuous( breaks = seq(0, 200, by = 25)) + 
  labs(title = "Histogram of Total Basement Square Feet", y = "Count") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

abv_scatter <- ggplot(eda_data  ) + 
  aes(x = TotalBsmtSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
  geom_point(data = live_area_outliers, aes(x = TotalBsmtSF, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Basement Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)

Here we see very similar trends to what we saw with the other basement square foot features. Again, we also note the presence of the outlier out beyond 6,000 square feet. One thing I’d like to check is to see if the two other basement variables add up to this variable:

eda_data %>% 
  select(BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF) %>%
  mutate(CheckColumn = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF) %>%
  mutate(Difference = CheckColumn - TotalBsmtSF ) %>% 
  filter(Difference != 0)
## # A tibble: 0 x 6
## # ... with 6 variables: BsmtFinSF1 <dbl>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## #   TotalBsmtSF <dbl>, CheckColumn <dbl>, Difference <dbl>

It looks like the combination of BsmtFinSF1, BsmtFinSF2, and BsmtUnfSF make TotalBsmtSF. Good info to keep in our toolbox.

Basement Half Bathrooms

BsmtHalfBath: Basement half bathrooms

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtHalfBath)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Basement Half Bathrooms', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtHalfBath), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Half Bathrooms vs Sale Price', y = 'Sale Price', x= 'Rooms') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

It looks like in the majority of cases, basements do not have half bathrooms. It also looks like the presence of a half bathroom is not a significant factor in the price of a house. Let’s see if the same hold true for a full bathroom in the basement.

Basement Full Bathrooms

BsmtFullBath: Basement full bathrooms

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtFullBath)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Basement Full Bathrooms', x= '#') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtFullBath), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Full Bathrooms vs Sale Price', y = 'Sale Price', x= '#') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

Over half of all basements don’t have a bathroom which is surprising to me. The other surprise is that the presence of extra bathrooms does not seem to be strongly connected to the sale price of the house. While w can see a slight increase from 1 to 2 bathrooms at the median value, the IQR is almost identical.

Other Basement Variables

Basement Exposure

BsmtExposure = Refers to walkout or garden level walls

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtExposure)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Basement Exposure', x= 'Exposure Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtExposure), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Exposure vs Sale Price', y = 'Sale Price', x= 'Exposure Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

While it looks like the majority of homes don’t have basement exposure, it does look like it is a feature that is positively correlated with sale price and is ordinal. Additionally, it looks like the NAs in the case related to houses that do not have basements. It looks like based on the boxplots, homes with basements are valued above those that do not have basements.

Rating of Basement Finished Area

BsmtFinType1 = Rating of basement finished area

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtFinType1)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Basement Finish Type Rating', x= 'Finish Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtFinType1), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Finish Type Rating vs Sale Price', y = 'Sale Price', x= 'Finish Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

It looks like about 1/4 of homes have a basement that has “good living quarters” and another quarter has an unfinished basement. Sale price wise, it looks like sale price increases substantially if the home has good living quarters, otherwise it doesn’t make a huge difference in the finish type.

Rating of Basement Finished Area

BsmtFinType2 = Rating of basement finished area (if multiple types)

sale_hist <- ggplot(eda_data) + 
  aes(x = as.factor(BsmtFinType2)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Basement Finish Type Rating', x= 'Rating Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

sale_box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BsmtFinType2), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Basement Finish Type Rating vs Sale Price', y = 'Sale Price', x= 'Rating Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)

A basement can have multiple finish types such as a portion that is good living condition and then another section that is unfinished. It looks like the overwhelming majority of homes have a portion of their basement that is unfinished (maybe a utility or storage room?). As we saw before, we don’t really see much difference in the sale price except for with space that is “livable”.

Garage Cars (GarageCars), Garage Areas (GarageArea) and Other Garage Features

Garage Cars (GarageCars)

Garage Cars = Size of garage in car capacity

hist <- ggplot(eda_data) + 
  aes(x = as.factor(GarageCars)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Garage Cars', x= '# of Garage Cars') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    panel.grid.major.y =  element_blank(),
    axis.title.y= element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = factor(GarageCars), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Cars vs Sale Price', y = 'Sale Price', x= '# of Garage Cars') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(hist, box, nrow=1)

It looks like about 56% of homes have a two car garage, about 6% have no garage or four garages, and the remaining 38% have one or three car garages. In looking at the above boxplots, for the first four categories we see what we would expect; sale price increasing with the number of garage cars. However, once we get to a 4 car garage, sale price appears to go down which doesn’t seem to make sense. Let’s take a look at these homes and see if we see anything odd:

eda_data %>% 
  filter(GarageCars == 4) %>%
  select(Id, GrLivArea, LotArea, GarageCars, SalePrice, OverallQual )
## # A tibble: 5 x 6
##      Id GrLivArea LotArea GarageCars SalePrice OverallQual
##   <dbl>     <dbl>   <dbl>      <dbl>     <dbl>       <dbl>
## 1   421      1344    7060          4    206300           7
## 2   748      2640   11700          4    265979           7
## 3  1191      1622   32463          4    168000           4
## 4  1341       872    8294          4    123000           4
## 5  1351      2634   11643          4    200000           5

In looking through the above data (only 5 rows), I don’t see anything that would indicate that these are outliers or that they should be removed. Perhaps as we do more exploration into more of our features we’ll see some of these homes again.

Garage Area (GarageArea)

Garage Area = Size of garage in square feet

hist <- ggplot(eda_data) + 
  aes(x = GarageArea) + 
  geom_histogram(binwidth =50, fill = "steelblue", color = "black") +
  labs(title = 'Garage Area', y = 'Count', x= 'Area of Garage in Square Feet') +
  scale_x_continuous(breaks= seq(0, 2000, by=200)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = GarageArea, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Looking at the above distribution, we see that it is multi-modal as we’d expect. It appears that there is one peak for each car port, 0-4. This demonstrates what we saw earlier in our correlation plot - that garage area and garage cars are highly correlated. I also don’t see any obvious outliers. In looking at this variable in relation to sale price, we can see that these variables are highly correlated.

Garage Type

Garage Type = Garage Location

hist <- ggplot(eda_data) + 
  aes(x = GarageType) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Garage Type', y = 'Count', x= 'Garage Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(GarageType),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Type vs Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(hist, box, nrow=1)

In looking at our boxplots above, it does look like there may be a relationships between sale price and garage type, in particular it looks like there may actually be some order to these categorical variables (ordinal values). We’ll look to adjust this in our data cleaning stage. Additionally, as we saw in the counts above, most of the homes have either a detached garage or an attached garage – looking at the boxplots above, there is a fairly clear distinction between these two categories in sale price.

Garage Finish

Garage Finish = Interior Finish of the Garage

hist <- ggplot(eda_data) + 
  aes(x = GarageFinish) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Garage Finish', y = 'Count', x= 'Finish') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(GarageFinish),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Finish vs Sale Price', y = 'Sale Price', x= 'Finish Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

In looking at the above categorical variables, it appears that these variables also display a sense of order (no garage being the lowest and finished being the highest). We’ll incorporate this in our data cleaning stage below. Additionally, I’ll note that the NAs are due to houses with no garage.

Garage - Year Built

GarageYrBlt = Year garage was built

hist <- ggplot(eda_data) + 
  aes(x = GarageYrBlt) + 
  geom_histogram( bins = 30, fill = 'steelblue') +
  labs(title = 'Garage - Year Built', y = 'Count', x= 'Year') +
 scale_x_continuous(breaks= seq(0, 2020, by=10)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) +
  geom_vline(xintercept = 1906) + 
  geom_vline(xintercept = 1948) +
  geom_vline(xintercept = 1986)


box <- eda_data %>%
  select(GarageYrBlt, SalePrice, GarageCars) %>% 
#  filter(GarageCars != 0 ) %>%
  mutate(garage_age_category = ifelse(GarageYrBlt < 1906, 'Oldest', 
                                      ifelse(GarageYrBlt < 1948, 'Old', 
                                             ifelse(GarageYrBlt < 1986, 'Average', 'New')))) %>%
  ggplot() + 
  aes(x = reorder(garage_age_category, SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Age Category vs Sale Price', y = 'Sale Price', x= 'Age Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

gridExtra::grid.arrange(hist, box, nrow=1)

In looking at the distribution, it looks like there are four subsets of this data here: garages built between 1870 and 1906, garages built between 1905 and 1948, garages built between 1950 and 1986, and then homes built from 1987 on. Looking at the above boxplot, we can see that garage built year is relevant to the sale price (As I assume build year of the home would be). While the garage build year itself may be sufficient for predictive power, we’ll look at creating a feature for garage age category when we work through our feature engineering. Also, it looks like we may need to combine the bins from ‘Oldest’ and ‘Old’.

Kitchen Features

Kitchens Above Ground

KitchenAbvGr = Kitchens above ground

hist <- ggplot(eda_data) + 
  aes(x = KitchenAbvGr) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Kitchens Above Ground', y = 'Count', x= '# of Kitchens') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(KitchenAbvGr),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Kitchens Above Ground vs Sale Price', y = 'Sale Price', x= '# of Kitchens') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The vast majority of homes have at least one kitchen above ground. It is pretty rare to not have a kitchen above ground in which case perhaps it is in the basement? Additionally, its very surprising to see that homes with 2+ kitchens have a lower sale price in general than those with only one kitchen. Why would that be?

eda_data %>% filter(KitchenAbvGr >= 2) %>% select(MSSubClass) %>% head()
## # A tibble: 6 x 1
##   MSSubClass
##        <dbl>
## 1         50
## 2        190
## 3         90
## 4         90
## 5        190
## 6         50

It looks like the majority of these with more than one kitchen are related to a duplex (someone selling both sides at once) or a two family conversion home, which makes perfect sense why they would have lower values as well as more kitchens. The home with no above ground kitchen does have a basement with good livable space, so perhaps their kitchen is in the basement…

Kitchens Quality

KitchenQual = Kitchen quality

hist <- ggplot(eda_data) + 
  aes(x = KitchenQual) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Kitchens Quality', y = 'Count', x= 'Quality Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(KitchenQual),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Kitchens Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Wow, kitchen quality looks strongly correlated to sale price. We can see significant differences in price at each increase in quality type. This is a categorical ordinal variable. We’ll need to encode this when we get to the data processing section.

Lot Features

Neighborhood

Neighborhood = Physical locations within Ames city limits

hist <- ggplot(eda_data) + 
  aes(reorder(x = Neighborhood, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Neighborhood Count', y = 'Count', x= 'Name') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Neighborhood),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Neighborhood vs Sale Price', y = 'Sale Price', x= 'Name') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

In looking at the above boxplot, it looks like there are almost some distinct brakes in neighborhoods where a new class of family lives. For example there is a fairly large change in the median from Mitchel to SawyerW (Mitchel and below = Poor?). Then again at Timber to StoneBr (SawyerW to Timber = middle class?) and those above Timber. I think we could segment this into 3 or 4 bins based on the median value of home prices in the neighborhood where we see the largest shifts in the median value.

eda_data %>% 
  select(Neighborhood, SalePrice) %>%
  mutate( Neighborhood_class = case_when(
                      Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale') ~ 'Poor',
                      Neighborhood %in% c('BrkSide', 'Edwards', 'OldTown', 'Sawyer', 'Blueste', 'SWISU', 'NPkVill', 'NAmes', 'Mitchel') ~ 'Lower-Middle',
                      Neighborhood %in% c('SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'CollgCr', 'Crawfor', 'ClearCr', 'Somerst', 'Veenker', 'Timber') ~ 'Middle',
                                  TRUE ~ 'Rich')) %>% ggplot() + 
  aes(x = reorder(factor(Neighborhood_class),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Neighborhood vs Sale Price', y = 'Sale Price', x= 'Name') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
 #   axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey"))

Binning the homes in this way makes this a bit more absorbable and looks like it represents the relationship between neighborhood and sale price very well.

Type of Home Involved in the Sale

MSSubClass = Identifies the type of dwelling involved in the sale.

hist <- ggplot(eda_data) + 
  aes(reorder(x = MSSubClass, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Type of Home', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(MSSubClass),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Type of Home vs Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The type of home looks to be a good indicator of sales price. Additionally, it doesn’t look like there are any extreme outliers that we need to worry about.

Zoning Classification of the Sale

MSZoning = Identifies the general zoning classification of the sale.

hist <- ggplot(eda_data) + 
  aes(reorder(x = MSZoning, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Zoning Classification of the Home', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(MSZoning),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Zoning Classification of the Home vs Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

There appears to be a fairly strong relationship between zoning and sale price as well. Most homes fall into the residential low density category. This type of zoning is made for neighborhoods with bigger lots and to only have single family homes. FV or Floating Village Residential is often a retirement community and looks to have the highest median price.

Building Type

BldgType = Type of dwelling.

hist <- ggplot(eda_data) + 
  aes(reorder(x = BldgType, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Building Type', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(BldgType),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Building Type vs Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The majority of homes in our dataset are single family detached homes. These homes overall have the highest sale price. Another interesting note is that it looks like homes attached to other homes (duplex, townhouse) have lower values as we would expect, UNLESS it is a townhouse that is an end unit (TwnhsE), which appears to be more valuable.

Lot Shape

LotSahpe = General shape of property

hist <- ggplot(eda_data) + 
  aes(reorder(x = LotShape, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Lot Shape', y = 'Count', x= 'Shape') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(LotShape),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Lot Shape vs Sale Price', y = 'Sale Price', x= 'Shape') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Regular lots and slightly irregular lots make up most of our dataset. What is interesting is I would have thought that regular lots would be more valuable than irregular lots - but in every case, irregular lots have a higher median price than regular lots. These values don’t necessarily follow a logic pattern, so we’ll have to decide if we want to include them or not. I’m leaning toward no.

Lot Configuration

LotConfig = Lot configuration

hist <- ggplot(eda_data) + 
  aes(reorder(x = LotConfig, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Lot Configuration', y = 'Count', x= 'Config') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(LotConfig),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Lot Configuration vs Sale Price', y = 'Sale Price', x= 'Config') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

It looks like lot configuration affects sale price if it is a frontage on 3 sides of property or if it is in a cul-de-sac, otherwise it appears that the other configurations are pretty similar.

Foundation

Foundation = Type of foundation

hist <- ggplot(eda_data) + 
  aes(reorder(x = Foundation, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Foundation', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Foundation),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Foundation vs Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Most homes have a cinderblock foundation or a poured concrete foundation. It does appear that the type of foundation makes a pretty large difference in the sale price of the house. This might have something to do with the age of the home. Let’s see if we can dig into this and understand it a little better:

ggplot(eda_data) + 
  aes(x = YearBuilt, y = SalePrice, color = Foundation) + 
  geom_point( alpha = 0.40) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
#  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  scale_x_continuous(breaks= seq(0, 2020, by=10)) +
  labs(title = 'Foundation Type Over Time vs Sale Price', y = 'Sale Price', x= 'Year') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

We can see very clearly that this feature is tied to age. Poured concrete seems to be the standard for all new homes and slightly older homes used cinder blocks. Before 1940 it looks like several other methods were used but are used very infrequently after that time.

Land Contour

LandContour = Flatness of the property

hist <- ggplot(eda_data) + 
  aes(reorder(x = LandContour, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Flatness of the Property', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(LandContour),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Flatness of the Property', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Most homes have a level plot of land. It does appear that there is some type of relationship between the flatness of the property and the sale price although I’m not sure I understand it. It looks like the median value for homes with a depression (low) in the property is higher than those with a level plot which I’m not sure if that makes sense to me. I can see how hillside could be more expensive, because generally nicer homes are found on hills with views.

Land Slope

LandSlope = Slope of property

hist <- ggplot(eda_data) + 
  aes(reorder(x = LandSlope, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Slope of the Property', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(LandSlope),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Slope of the Property', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Ames Iowa is a fairly hilly region. It appears that homes on hills are more valuable than those that aren’t. The higher and steeper the better.

Lot Area

LotArea = Lot size in square feet

hist <- ggplot(eda_data) + 
  aes(x = LotArea) + 
  geom_histogram(binwidth =5000, fill = "steelblue", color = "black") +
  labs(title = 'Lot Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = LotArea, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Lot Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

The lot area distribution is heavily right skewed. This looks to be due to several fairly extreme outliers with lot sizes greater than 150,000 feet. If we remove these homes from our dataset, it looks like there is a loose relationship between lot area and sale price, however, the relationship isn’t super strong. This makes sense, because the lot area is just one factor of the property and often the lot area is sufficient between a general range and after that the remainder of the value of the home comes from the actual home and its properties.

Lot Frontage

LotFrontage = Linear feet of street connected to property

hist <- ggplot(eda_data) + 
  aes(x = LotFrontage) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Lot Frontage', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = LotArea, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
 # geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Lot Frontage vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Very similar to the last feature we looked at, we see the distribution is extremely right skewed. This is in large part due to several outliers in the dataset.

Street

Street = Type of road access

hist <- ggplot(eda_data) + 
  aes(reorder(x = Street, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Type of Road Access', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Street),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Type of Road Access v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Almost every home has a paved road to the home. It looks like pavement to the home is more valuable than a gravel road.

Paved Drive

PavedDrive = Paved driveway

hist <- ggplot(eda_data) + 
  aes(reorder(x = PavedDrive, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Paved Driveway?', y = 'Count', x= 'Paved?') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(PavedDrive),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Paved Driveway v Sale Price', y = 'Sale Price', x= 'Paved?') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

As we saw before, most homes have paved driveways. We can see clearly that those homes with a paved driveway have higher sales prices than those with partially paved driveways and those with no paved driveway.

Alley

Alley = Type of alley access

hist <- ggplot(eda_data) + 
  aes(reorder(x = Alley, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Type of Alley Access', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Alley),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Type of Alley Access v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The majority of homes don’t have access to an alley (probably pretty normal in Iowa). It does appear that those homes with paved access to an alley have overall higher home prices than those with gravel access.

Roof Features

Roof Materials

RoofMatl = Roof material

hist <- ggplot(eda_data) + 
  aes(reorder(x = RoofMatl, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Roof Material', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(RoofMatl),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Roof Material v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Nearly every single house has standard composite shingle. You’d expect this because a roof needs to be changed every 30 years or so, so even if a house originally had a different roof type, if its over 30 years old, it was probably reroofed at some point to composite shingles. Wood Shingling is apparently a nice feature.

Roof Style

RoofStyle = Type of roof

hist <- ggplot(eda_data) + 
  aes(reorder(x = RoofStyle, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Roof Style', y = 'Count', x= 'Style') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(RoofStyle),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Roof Style v Sale Price', y = 'Sale Price', x= 'Style') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

As we’ve seen with the other roof feature, it looks like almost all roofs are Gable style. Additionally, it doesn’t look like roof style is too strongly correlated with Sale Price.

Porch Features

Wood Deck Square Footage

WoodDeckSF = Wood deck area in square feet

hist <- ggplot(eda_data) + 
  aes(x = WoodDeckSF) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Wood Deck Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = WoodDeckSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Wood Deck Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Interestingly, having a wood deck does not seem to be strongly correlated with sale price. It does look loosely correlated, but not as strong as I would have suspected.

Open Porch Square Footage

OpenPorchSF = Open porch area in square feet

hist <- ggplot(eda_data) + 
  aes(x = OpenPorchSF) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Open Porch Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = OpenPorchSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Open Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Similar to what we saw above, it doesn’t appear that there is a very strong relationship between porch area and sale price.

Enclosed Porch Square Footage

EnclosedPorch = Enclosed porch area in square feet

hist <- ggplot(eda_data) + 
  aes(x = EnclosedPorch) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Enclosed Porch Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = EnclosedPorch, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Enclosed Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

In looking at the enclosed porch area, it appears that there is a negative relationship with sale price, although it is a very weak, loose relationship. I would not expect this. It must be due to other factors of homes with enclosed porch areas.

Screen Porch Square Footage

ScreenPorch = Screen porch area in square feet

hist <- ggplot(eda_data) + 
  aes(x = ScreenPorch) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Screen Porch Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = ScreenPorch, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Screen Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Very few homes have screen porch square footage. There is a very weak relationship with sale price.

Three Season Porch Square Footage

3SsnPorch = Three season porch area in square feet

hist <- ggplot(eda_data) + 
  aes(x = `3SsnPorch`) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Three Season Porch Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = `3SsnPorch`, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Three Season Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Again, so few homes have a three season porch. There is a weak relationship with sale price. It doesn’t look like there is a total porch square footage variable. Let’s create one and see if we can find a better relationship than what we have been seeing:

eda_data %>% 
  mutate(total_porch_area = WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + `3SsnPorch`) %>%
  select(total_porch_area, SalePrice) %>% 
  ggplot() + 
  aes(x = total_porch_area, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

While the relationship is loose at best, there is a positive linear relationship between these two variables. We’ll create this feature in our feature engineering and be sure to include it in our modeling.

Pool Features

Pool Area

PoolArea = Pool area in square feet

hist <- ggplot(eda_data %>% filter(PoolArea > 0)) + 
  aes(x = PoolArea) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Pool Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data %>% filter(PoolArea > 0)) + 
  aes(x = PoolArea, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Pool Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

Apparently pools in Iowa aren’t treasured like they should be. It actually appears that the more square footage you have for your pool, the worse your home price. However, this is a bad sample because so few homes have pools. It does not look like there is any relationship between Pool Area and Sale Price.

Pool Quality

PoolQC = Pool quality

hist <- ggplot(eda_data) + 
  aes(reorder(x = PoolQC, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Pool Quality', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    #axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(PoolQC),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Pool Quality v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

There are only 7 homes with pools so there isn’t a lot of data to go off of. If you look at the box plots, we don’t see what we’d expect. I may end removing the pool features as they don’t seem to affect sale price.

Home Exterior Features

Exterior Covering of the House

Exterior1st = Exterior covering on house

hist <- ggplot(eda_data) + 
  aes(reorder(x = Exterior1st, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Exterior Covering of the House', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Exterior1st),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Exterior Covering of the House v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

While not all the types of Exterior covering have significant differences in their medians, it does look like there are some clear differences in the high and lows.

Exterior Covering of the House if More Than One

Exterior2nd = Exterior covering on house (if more than one material)

hist <- ggplot(eda_data) + 
  aes(reorder(x = Exterior2nd, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Exterior Covering of the House (If More Than One)', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Exterior2nd),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Exterior Covering of the House v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

We see a very similar pattern with this feature as we did in the previous exterior feature that we looked at.

Masonry Veneer Type

MasVnrType = Masonry veneer type

hist <- ggplot(eda_data) + 
  aes(reorder(x = MasVnrType, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Masonry Veneer Type', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
 #   axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(MasVnrType),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Masonry Veneer Type v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

It looks like Stone is a better veneer type than brick, but having no brick is better than having common brick. Additionally, it looks like we have some NAs to take care of during our data prep state.

Masonry Veneer Area in Square Feet

MasVnrArea = Masonry veneer area in square feet

hist <- ggplot(eda_data ) + 
  aes(x = MasVnrArea) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Masonry Veneer Area', y = 'Count', x= 'Square Feet') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data ) + 
  aes(x = MasVnrArea, y = SalePrice,  color = MasVnrType) + 
  geom_point( alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Masonry Veneer Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

It looks like for stone veneer, the square footage is more important that for a brick face in determining sale price.

Utilities

Utilities

Utilities = Type of utilities available

hist <- ggplot(eda_data) + 
  aes(reorder(x = Utilities, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Utilities Available', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Utilities),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Utilities Available v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
  #  axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Only one home doesn’t have public utilities and the sale price was lower than the median of the sale prices from those homes with access to public utilities.

Central Air

CentralAir = Central air conditioning

hist <- ggplot(eda_data) + 
  aes(reorder(x = CentralAir, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Air Conditioning?', y = 'Count', x= 'Yes/No') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(CentralAir),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Utilities Available v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Thankfully almost every home has air conditioning, and as expected, these homes sell for considerably more.

Electrical

Electrical = Electrical system

hist <- ggplot(eda_data) + 
  aes(reorder(x = Electrical, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Electrical System', y = 'Count', x= 'Yes/No') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Electrical),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Electrical System v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

There is some pretty clear differentiation between these electrical systems and home price. It looks like the standard is the standard breaker electrical system and homes with this system, being the most common, get have higher prices.

Heating

Heating = Type of heating

hist <- ggplot(eda_data) + 
  aes(reorder(x = Heating, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Type of Heating', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Heating),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Type of Heating v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Almost every house has a Gas forced warm air furnace. Those without have lower home prices.

Heating Quality

HeatingQC = Heating quality and condition

hist <- ggplot(eda_data) + 
  aes(reorder(x = HeatingQC, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Heating Quality and Condition', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(HeatingQC),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Heating Quality and Condition v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The condition of your heating system does play a role in the house price. We can see that as the quality increases, so does the sale price.

Other Features

Fence

Fence = Fence Quality

hist <- ggplot(eda_data) + 
  aes(reorder(x = Fence, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Fence Quality', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Fence),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Fence Quality v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Apparently fences are not very important in Iowa like they are in other parts of the country. The majority of homes don’t have fences. When a home does have a fence, if it offers good privacy, then it will increase the selling price of the home, otherwise the quality doesn’t really matter. This is a weird variable. It really should almost be too separate variables: fence privacy and then good wood or not.

Fireplaces

Fireplaces = Number of fireplaces

hist <- ggplot(eda_data) + 
  aes(reorder(x = Fireplaces, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Number of Fireplaces', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(Fireplaces),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Number of Fireplaces v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Fireplaces looks like a good indicator for sale price. This may be because it is correlated with square footage and other important factors.

Fireplace Quality

FireplaceQu = Fireplace Quality

hist <- ggplot(eda_data) + 
  aes(reorder(x = FireplaceQu, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Fireplace Quality', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(FireplaceQu),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Fireplace Quality v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

The quality of fireplace seems to be correlated with sale price as well. The NAs here indicate that there is no fireplace, which looks like is worse than not having a fireplace in poor condition.

Miscellaneous Features Not Covered in Other Categories

MiscFeature = Miscellaneous feature not covered in other categories

hist <- ggplot(eda_data) + 
  aes(reorder(x = MiscFeature, SalePrice)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Miscellaneous Features', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # axis.text.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = reorder(factor(MiscFeature),SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Miscellaneous Features v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
   # axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

Owning a tennis court is definitely a good sign that your house is worth some money. All of these variables are fairly rare. Most homes do not have miscellaneous features and having them doesn’t really mean much for the sale price other than if you have a tennis court (which is one house).

Miscellaneous Features Value

MiscVal = $Value of miscellaneous feature

hist <- ggplot(eda_data ) + 
  aes(x = MiscVal) + 
  geom_histogram(fill = "steelblue", color = "black") +
  labs(title = 'Miscellaneous Feature Value', y = 'Count', x= '$') +
#  scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data ) + 
  aes(x = MiscVal, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Miscellaneous Feature Value vs Sale Price', y = 'Sale Price', x= '$') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 
  
gridExtra::grid.arrange(hist, box, nrow=1)

It looks like the value doesn’t make much of a difference in the sale price. I wouldn’t expect these plots to show much since there are so few instances of this.

Date Features

Year Built

YearBuilt = Original construction date

hist <- ggplot(eda_data) + 
  aes(x = as.factor(YearBuilt)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Year Built', y = 'Count', x= 'Year') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = factor(YearBuilt), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Year Built v Sale Price', y = 'Sale Price', x= 'Year') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, ncol=1)

Very interesting plot! We see quite a bit of variation here. Some if it is due to there only being a handful of houses in each year. One thing is pretty clear, newer homes tend to have higher selling prices than older homes. As a note, year built should be a categorical variable as opposed to an integer. we’ll update this in a later stage.

Year Remodeled

YearRemodAdd = Remodel date

hist <- ggplot(eda_data) + 
  aes(x = as.factor(YearRemodAdd)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Year Remodeled', y = 'Count', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = factor(YearRemodAdd), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Year Remodeled v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, ncol=1)

As we would expect, homes that were remodeled more recently have higher sale prices than those that were remodeled many years ago. It’s also interesting to see that many people remodeled their homes in 1950.

Year Sold

YrSold = Year Sold

hist <- ggplot(eda_data) + 
  aes(x = as.factor(YrSold)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Year Sold', y = 'Count', x= 'Year') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = factor(YrSold), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Year Sold v Sale Price', y = 'Sale Price', x= 'Year') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

It doesn’t look like the year really has any effect on the Sale Price which is incredibly interesting especially since this data encompasses the financial crises that began in 2007. You would think that home prices would have dropped substantially as people stopped selling their homes and the market was flooded with people defaulting on their loans.

Month Sold

MoSold = Month Sold

hist <- ggplot(eda_data) + 
  aes(x = as.factor(MoSold)) + 
  geom_bar(stat = "count", fill = 'steelblue') +
  geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) + 
  labs(title = 'Month Sold', y = 'Count', x= 'Month') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    axis.text.y = element_blank(),
    axis.title.y= element_blank(),
    axis.title.x= element_blank(),
    panel.grid.major.y =  element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  ) 

box <- ggplot(eda_data) + 
  aes(x = factor(MoSold), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Month Sold v Sale Price', y = 'Sale Price', x= 'Month') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )


gridExtra::grid.arrange(hist, box, nrow=1)

While there is some variation in the box plot above with respect to the different months, it doesn’t look like there is a clear relationship between sale price and the month a home was sold.

Data Cleaning and Data Encoding

From our extensive exploratory data analysis (EDA) we learned A LOT! Now we need to take our findings and clean our data so that they can best be utilized by the model. During our EDA we saw that we have lots of ordinal categorical variables that will need to be encoded. Additionally, there are some NA values that we’ll need to take care of as well as some outliers.

Outliers

When working with OverallQual we identified an outlier in Category 4 that we’ll need to remove. We’ll do that now:

#remove from our full dataset
homes <- homes %>% 
  filter(Id != 458)

Missing Values

Garage Features

Among other NAs, we have 1 NA in GarageCars and Garage Area. Let’s explore this value and see if it is something we need to remove or adjust. The null value is actually in our test set and both of the nulls are found in the same row (2577).

homes %>% 
  filter(is.na(GarageCars) & is.na(GarageArea)) %>%
  select('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')
## # A tibble: 1 x 6
##   GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
##        <dbl>      <dbl> <chr>      <chr>      <chr>      <chr>       
## 1         NA         NA Detchd     <NA>       <NA>       <NA>

As we can see, all of the garage features are missing except for garage type. As all other observations in the dataset had a value for GarageType and GarageArea, I think this may be an input error. We’ll make and assumption here and say that this house does not have a garage, and adjust accordingly:

homes <- homes %>% 
  mutate(GarageCars = case_when( Id == 2577 & is.na(GarageCars) ~ 0, TRUE ~ GarageCars)) %>% 
  mutate(GarageArea = case_when( Id == 2577 & is.na(GarageArea) ~ 0, TRUE ~ GarageArea)) %>%
  mutate(GarageType = case_when( Id == 2577 & GarageType == 'Detchd' ~ NA_character_, TRUE ~ GarageType)) 

When working with the basement data, we noted many NA values that were actually homes without basements. Let’s change these NAs to ‘No Basement’ so when we work to fill null values later, these aren’t imputed:

homes %>% filter(
  is.na(GarageType) & is.na(GarageYrBlt) & is.na(GarageFinish) & is.na(GarageQual) & is.na(GarageCond)) %>%
    select('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish') %>% dplyr::count(GarageCars, GarageArea)
## # A tibble: 1 x 3
##   GarageCars GarageArea     n
##        <dbl>      <dbl> <int>
## 1          0          0   158

In looking at the above, we can clearly see that all the NA values in these columns come because the house does not have a garage. We can easily clean these up now:

homes <- homes %>% 
  mutate( GarageType = case_when( is.na(GarageType) ~ 'None', TRUE ~ GarageType)) %>% 
  mutate( GarageYrBlt = case_when( is.na(GarageYrBlt) ~ YearBuilt, TRUE ~ GarageYrBlt)) %>% #defaulting to YearBuilt when there is no garage year built date
  mutate(GarageFinish = case_when( is.na(GarageFinish) ~ 'None', TRUE ~ GarageFinish)) %>%
  mutate(GarageQual = case_when( is.na(GarageQual) ~ 'None', TRUE ~ GarageQual)) %>%
  mutate(GarageCond = case_when( is.na(GarageCond) ~ 'None', TRUE ~ GarageCond))

Now, let’s ensure we’ve filled all nulls in the garage features:

colSums(is.na(homes %>% select(contains('Garage'))))
##   GarageType  GarageYrBlt GarageFinish   GarageCars   GarageArea   GarageQual 
##            0            0            0            0            0            0 
##   GarageCond 
##            0

Basement Features

In looking at the NAs within the basement features, we saw before during our analysis that this was due to there being no basement in the house. We’ll leave most of the imputing to be done by R when we get to the modeling section, but we will set any NAs that we don’t want imputed to a value. Most of the NAs in the BsmtQual column are missing because there is not a basement. There are 9 cases where this is not true below:

homes %>% 
  filter((!is.na(BsmtFinType1) & (is.na(BsmtCond)|is.na(BsmtQual)|is.na(BsmtExposure)|is.na(BsmtFinType2)))) %>% select('Id', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')
## # A tibble: 9 x 6
##      Id BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
##   <dbl> <chr>    <chr>    <chr>        <chr>        <chr>       
## 1   333 Gd       TA       No           GLQ          <NA>        
## 2   949 Gd       TA       <NA>         Unf          Unf         
## 3  1488 Gd       TA       <NA>         Unf          Unf         
## 4  2041 Gd       <NA>     Mn           GLQ          Rec         
## 5  2186 TA       <NA>     No           BLQ          Unf         
## 6  2218 <NA>     Fa       No           Unf          Unf         
## 7  2219 <NA>     TA       No           Unf          Unf         
## 8  2349 Gd       TA       <NA>         Unf          Unf         
## 9  2525 TA       <NA>     Av           ALQ          Unf

We’ll let the imputation engine impute those for us. For every other value in BsmtQual,BsmtCond, BsmtExposure, BsmtFinType1, BsmtFin we’ll set to ‘None’

homes$BsmtQual <- homes$BsmtQual %>% replace_na('None')
homes$BsmtCond <- homes$BsmtCond %>% replace_na('None')
homes$BsmtExposure <- homes$BsmtExposure %>% replace_na('None')
homes$BsmtFinType1 <- homes$BsmtFinType1 %>% replace_na('None')
homes$BsmtFinType2 <- homes$BsmtFinType2 %>% replace_na('None')
homes <- homes %>% 
  mutate(BsmtQual = case_when( Id %in% c(2218,2219) ~ NA_character_, TRUE ~ BsmtQual)) %>%
  mutate(BsmtCond = case_when( Id %in% c(2041,2186,2525) ~ NA_character_, TRUE ~ BsmtCond)) %>%
  mutate(BsmtExposure = case_when( Id %in% c(949,1488,2349) ~ NA_character_, TRUE ~ BsmtExposure)) %>%
  mutate(BsmtFinType2 = case_when( Id %in% c(333) ~ NA_character_, TRUE ~ BsmtFinType2))

Lot Features

Alley

When Alley is NA it represents the house not having access to an alley. We’ll encode these NAs as None.

homes$Alley <- homes$Alley %>% replace_na('None')

Lot Frontage

homes %>% filter(is.na(LotFrontage)) %>% select(Id, LotFrontage, LotArea, LotShape, LandContour)
## # A tibble: 485 x 5
##       Id LotFrontage LotArea LotShape LandContour
##    <dbl>       <dbl>   <dbl> <chr>    <chr>      
##  1     8          NA   10382 IR1      Lvl        
##  2    13          NA   12968 IR2      Lvl        
##  3    15          NA   10920 IR1      Lvl        
##  4    17          NA   11241 IR1      Lvl        
##  5    25          NA    8246 IR1      Lvl        
##  6    32          NA    8544 IR1      Lvl        
##  7    43          NA    9180 IR1      Lvl        
##  8    44          NA    9200 IR1      Lvl        
##  9    51          NA   13869 IR2      Lvl        
## 10    65          NA    9375 Reg      Lvl        
## # ... with 475 more rows

In reading the data documentation about lot frontage, it does not say that NAs are a house with NO lot frontage. Additionally in other numeric variables when a number does not have something we see 0. There are 486 missing rows of lot frontage. The best method to fill these would be imputation. If we were doing it manually by hand, we may take the median lot frontage of the neighborhood they are in, but we’ll most likely use KNN to fill these values.

Fence

Here we read in the documentation that NA is when a house has no fence. We’ll encode all NAs here as ‘None’.

homes$Fence <- homes$Fence %>% replace_na('None')

FireplaceQu

Again we read that NA means a home with no fireplace. We’ll encode all NAs as ‘None’ as we’ve done before.

homes$FireplaceQu <- homes$FireplaceQu %>% replace_na('None')

MiscFeature

homes$MiscFeature <- homes$MiscFeature %>% replace_na('None')

PoolQC

There are three instances where a home has a pool area but not a quality rating. We’ll impute these in our modeling section. For all others, where PoolQC is blank, we’ll fill with ‘None’.

homes %>% 
  filter(PoolArea != 0 & is.na(PoolQC)) %>%
  select(Id, PoolArea, PoolQC)
## # A tibble: 3 x 3
##      Id PoolArea PoolQC
##   <dbl>    <dbl> <chr> 
## 1  2421      368 <NA>  
## 2  2504      444 <NA>  
## 3  2600      561 <NA>
homes$PoolQC <- homes$PoolQC %>% replace_na('None')

homes <- homes %>% 
  mutate(PoolQC = case_when( Id %in% c(2421,2504,2600) ~ NA_character_, TRUE ~ PoolQC)) 

Let’s take a final look at the null values and make sure all the null values we see remaining are those we have intentionally left so that they can be imputed during the modeling stage:

colSums(is.na(homes))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             4           485             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0             0             0             0             2 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             1             1 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##            24            23             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##             2             3             3             0             1 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##             1             1             1             1             0 
##     HeatingQC    CentralAir    Electrical      1stFlrSF      2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             2             2             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             1             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             2             0             0             0             0 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##             0             0             0             0             0 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0             3             0             0 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             1             0 
##     SalePrice       dataset 
##          1459             0

Looks great! We’ll move on to encoding our ordinal categorical variables.

Encoding Variables

We have many categorical variables that are actually ordinal. We’ll adjust those here.

We’ll make a qualities vector that will be used multiple times to encode our variables:

qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

Basement Features

BsmtQual

homes$BsmtQual <- as.integer(plyr::revalue(homes$BsmtQual, qualities))
## The following `from` values were not present in `x`: Po
table(homes$BsmtQual)
## 
##    0    2    3    4    5 
##   79   88 1283 1208  258

BsmtCond

homes$BsmtCond <- as.integer(plyr::revalue(homes$BsmtCond, qualities))
## The following `from` values were not present in `x`: Ex
table(homes$BsmtCond)
## 
##    0    1    2    3    4 
##   79    5  104 2605  122

BsmtExposure

exposure <- c('None' = 0, 'No' = 1, 'Mn' = 2, 'Av' = 3, 'Gd' = 4)

homes$BsmtExposure <- as.integer(plyr::revalue(homes$BsmtExposure, exposure))
table(homes$BsmtExposure)
## 
##    0    1    2    3    4 
##   79 1904  239  418  275

BsmtFinType1

homes$BsmtFinType1 <- as.factor(homes$BsmtFinType1)

BsmtFinType2

homes$BsmtFinType2 <- as.factor(homes$BsmtFinType2)

Garage Features

GarageQual

homes$GarageQual <- as.integer(plyr::revalue(homes$GarageQual, qualities))
table(homes$GarageQual)
## 
##    0    1    2    3    4    5 
##  159    5  124 2603   24    3

GarageCond

homes$GarageCond <- as.integer(plyr::revalue(homes$GarageCond, qualities))
table(homes$GarageCond)
## 
##    0    1    2    3    4    5 
##  159   14   74 2653   15    3

GarageFinish

finish <- c('None' = 0, 'Unf' = 1, 'RFn' = 2, 'Fin' = 3)

homes$GarageFinish <- as.integer(plyr::revalue(homes$GarageFinish, finish))
table(homes$GarageFinish)
## 
##    0    1    2    3 
##  159 1230  811  718

GarageType

homes$GarageType <- as.factor(homes$GarageType)

GarageYrBlt

homes$GarageYrBlt <- as.factor(homes$GarageYrBlt)

Exterior Features

ExterCond

homes$ExterCond <- as.integer(plyr::revalue(homes$ExterCond, qualities))
## The following `from` values were not present in `x`: None
table(homes$ExterCond)
## 
##    1    2    3    4    5 
##    3   67 2537  299   12

ExterQual

homes$ExterQual <- as.integer(plyr::revalue(homes$ExterQual, qualities))
## The following `from` values were not present in `x`: None, Po
table(homes$ExterQual)
## 
##    2    3    4    5 
##   35 1797  979  107

Exterior1st

homes$Exterior1st  <- as.factor(homes$Exterior1st )

Exterior2nd

homes$Exterior2nd  <- as.factor(homes$Exterior2nd )

MasVnrType

homes$MasVnrType   <- as.factor(homes$MasVnrType  )

Condition Features

Functional

functions <- c('Sal' = 0, 'Sev' = 1, 'Maj2' = 2, 'Maj1' = 3, 'Mod' = 4, 'Min2' = 5, 'Min1' = 6, 'Typ' = 7)

homes$Functional <- as.integer(plyr::revalue(homes$Functional, functions))
## The following `from` values were not present in `x`: Sal
table(homes$Functional)
## 
##    1    2    3    4    5    6    7 
##    2    9   19   35   70   64 2717

SaleCondition

homes$SaleCondition <- as.factor(homes$SaleCondition)

Condition1

homes$Condition1 <- as.factor(homes$Condition1)

Condition2

homes$Condition2 <- as.factor(homes$Condition2)

House Style

homes$HouseStyle <- as.factor(homes$HouseStyle)

Kitchen Features

homes$KitchenQual <- as.integer(plyr::revalue(homes$KitchenQual, qualities))
## The following `from` values were not present in `x`: None, Po
table(homes$KitchenQual)
## 
##    2    3    4    5 
##   70 1492 1150  205

Lot Features

MSSubClass

homes$MSSubClass <- as.factor(homes$MSSubClass)

LotShape

shape <- c('IR3' = 0, 'IR2' = 1, 'IR1' = 2, 'Reg' = 3)

homes$LotShape <- as.integer(plyr::revalue(homes$LotShape, shape))
table(homes$LotShape)
## 
##    0    1    2    3 
##   16   76  967 1859

LandSlope

slope <- c('Sev' = 0, 'Mod' = 1, 'Gtl' = 2)

homes$LandSlope <- as.integer(plyr::revalue(homes$LandSlope, slope))
table(homes$LandSlope)
## 
##    0    1    2 
##   16  124 2778

Neighborhood

homes$Neighborhood <- as.factor(homes$Neighborhood)

MSZoning

homes$MSZoning <- as.factor(homes$MSZoning)

BldgType

homes$BldgType  <- as.factor(homes$BldgType )

LotConfig

homes$LotConfig  <- as.factor(homes$LotConfig )

Foundation

homes$Foundation  <- as.factor(homes$Foundation )

LandContour

homes$LandContour  <- as.factor(homes$LandContour )

Street

homes$Street  <- as.factor(homes$Street )

PavedDrive

homes$PavedDrive  <- as.factor(homes$PavedDrive )

Alley

homes$Alley  <- as.factor(homes$Alley )

Roof Features

RoofMatl

homes$RoofMatl  <- as.factor(homes$RoofMatl )

RoofStyle

homes$RoofStyle  <- as.factor(homes$RoofStyle)

Pool Features

homes$PoolQC <- as.integer(plyr::revalue(homes$PoolQC, qualities))
## The following `from` values were not present in `x`: Po, TA
table(homes$PoolQC)
## 
##    0    2    4    5 
## 2905    2    4    4

Utility Features

HeatingQC

homes$HeatingQC <- as.integer(plyr::revalue(homes$HeatingQC, qualities))
## The following `from` values were not present in `x`: None
table(homes$HeatingQC)
## 
##    1    2    3    4    5 
##    3   92  857  474 1492

Utilities

homes$Utilities <- as.factor(homes$Utilities)

CentralAir

homes$CentralAir <- as.factor(homes$CentralAir)

Electrical

homes$Electrical <- as.factor(homes$Electrical)

Heating

homes$Heating <- as.factor(homes$Heating)

Other Features

homes$FireplaceQu <- as.integer(plyr::revalue(homes$FireplaceQu, qualities))
table(homes$FireplaceQu)
## 
##    0    1    2    3    4    5 
## 1420   46   74  592  743   43

Fence

homes$Fence <- as.factor(homes$Fence)

MiscFeature

homes$MiscFeature  <- as.factor(homes$MiscFeature )

Date Features

YearBuilt

homes$YearBuilt <- as.factor(homes$YearBuilt)

YearRemodAdd

homes$YearRemodAdd <- as.factor(homes$YearRemodAdd)

YrSold

homes$YrSold <- as.factor(homes$YrSold)

MoSold

homes$MoSold <- as.factor(homes$MoSold)

Feature Engineering

Age Feature

I want to add a feature that tells us how long it has been since the house was updated, if ever. I will subtract the year sold from the year remodeled column. The year remodeled column defaults to the year built if there was no remodel.

homes <- homes %>% 
  mutate(Age = as.integer(as.character(YrSold)) - as.integer(as.character(YearRemodAdd))) 
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = Age, y = SalePrice) +
   geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Age Since Last Update vs Sale Price', y = 'Sale Price', x= 'Age') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

Proximity to Positive, Normal, and Negative Conditions

The Condition1 column tells you if a home is close to certain conditions, but doesn’t say if the conditions are positive or negative. This feature helps to clarify that.

homes <- homes %>% mutate(Pos_Neg_Conditions = case_when( Condition1 == 'Norm' ~ 'Normal',
                                                 Condition1 %in% c('PosN', 'PosA') ~ 'Positive',
                                                 TRUE ~ 'Negative') ) 
condit <- c('Negative' = 0, 'Normal' = 1, 'Positive' = 2)

homes$Pos_Neg_Conditions <- as.integer(plyr::revalue(homes$Pos_Neg_Conditions, condit))

ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = as.factor(Pos_Neg_Conditions), y = SalePrice) + 
   geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Positive and Negative Conditions v Sale Price', y = 'Sale Price', x= 'Type') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )

Total Square Feet Feature

There is not a total square feet feature - it is broken into above ground and basement. Let’s add all the square feet variables together to get total square footage.

homes <- homes %>% 
  mutate(TotalSF = GrLivArea + TotalBsmtSF) 

ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = TotalSF, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  ) 

Total Bathrooms

We have the count of bathrooms for basement as well as above ground. This feature will add them together.

homes <- homes %>%
  mutate(TotalBathrooms = FullBath + (HalfBath *0.5) + BsmtFullBath + (BsmtHalfBath * 0.5))
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = as.factor(TotalBathrooms), y = SalePrice) + 
   geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Bathrooms v Sale Price', y = 'Sale Price', x= '#') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )

Garage Age Built Category

homes <- homes %>% mutate(garage_age_category = ifelse(as.numeric(as.character(GarageYrBlt)) < 1948, 'Old', 
                                             ifelse(as.numeric(as.character(GarageYrBlt)) < 1986, 'Average', 'New'))) 

age <- c('Old' = 0, 'Average' = 1, 'New' = 2)

homes$garage_age_category <- as.integer(plyr::revalue(homes$garage_age_category, age))
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = reorder(as.factor(garage_age_category), SalePrice), y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Garage Age Category v Sale Price', y = 'Sale Price', x= 'Category') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )

Neighborhood Consolidation

As described in the analysis section, we’ll bin some of the neighborhoods into subgroups.

homes <- homes %>% mutate( Neighborhood_class = case_when(
                      Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale') ~ 'Poor',
                      Neighborhood %in% c('BrkSide', 'Edwards', 'OldTown', 'Sawyer', 'Blueste', 'SWISU', 'NPkVill', 'NAmes', 'Mitchel') ~ 'Lower-Middle',
                      Neighborhood %in% c('SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'CollgCr', 'Crawfor', 'ClearCr', 'Somerst', 'Veenker', 'Timber') ~ 'Middle',
                                  TRUE ~ 'Rich'))
class <- c('Poor' = 0, 'Lower-Middle' = 1, 'Middle' = 2, 'Rich' = 3)

homes$Neighborhood_class <- as.integer(plyr::revalue(homes$Neighborhood_class, class))

See the plot in the analysis section to see the change.

Total Porch Area

We have quite a few porch area features but don’t have a total porch feature. If you recall from our analysis, the porch area didn’t really affect the sale price, so I’m not expecting big things from this new feature. We’ll build this now:

homes <- homes %>% 
  mutate(TotalPorch = WoodDeckSF + OpenPorchSF + EnclosedPorch + `3SsnPorch` + ScreenPorch)
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = TotalPorch, y = SalePrice) + 
  geom_point(color = 'steelblue', alpha = 0.35) +
#  geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
  geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Total Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    plot.subtitle = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

New House Indicator

I want an indicator to flag if the house was new in the year it was sold. Generally new houses sell for more than houses that have been lived in.

homes <- homes %>% 
  mutate(NewHouse = ifelse(as.numeric(as.character(YearBuilt)) == as.numeric(as.character(YrSold)), 'Yes', 'No'))

homes$NewHouse <- as.factor(homes$NewHouse)
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = NewHouse, y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'New House v Sale Price', y = 'Sale Price', x= 'Yes/No') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )

Remodeled Flag

The column we have that says if the home was remodeled is not actually a flag because it defaults to the year it was built if it was never remodeled. We need to create a flag that shows if the home has been remodeled or not.

homes <- homes %>% 
  mutate(remodeled = ifelse(as.numeric(as.character(YearRemodAdd)) == as.numeric(as.character(YearBuilt)), 'No', 'Yes'))

homes$remodeled <- as.factor(homes$remodeled)
ggplot(homes %>% filter(dataset == 'train')) + 
  aes(x = remodeled, y = SalePrice) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
  scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
  labs(title = 'Remodeled v Sale Price', y = 'Sale Price', x= 'Yes/No') +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
     axis.title.x= element_blank(),
    axis.text.x = element_text(angle = 90),
    axis.ticks.x = element_line(color = "grey")
  )

It actually appears that if you have remodeled, you often have a lower selling price. This must just be because if you haven’t remodeled recently, perhaps you are in a new home.

Preprocessing with Recipes and Building our Model

Let’s split our data back out into our train and test sets:

train <- homes %>% filter(dataset == 'train')

test <- homes %>% filter(dataset == 'test')

In order to evaluate our model before submitting, we’ll need to further split our training data into a train/test split:

set.seed(123)

train_split <- initial_split(train, prop = 0.80)

reg_train <- training(train_split)

reg_test <- testing(train_split)

Next, we’ll initialize our model:

lm_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

Now that we have our data split out again, let’s move forward with our preprocessing of the data. Using the tidymodels framework, this will be extremely easy and will also be very readable. We will do the following:

  • remove the ‘dataset’ column
  • create dummy variables
  • apply a non-zero variance filter (this helps when there are factors in the testing data that are not in the training data)
  • normalize the data
  • center the data
  • scale the data
  • filter out any features with correlations with each other higher than 0.90
  • impute null values using KNN algorithm using 3 nearest neighbors
  • remove any columns that are linear combinations of each other
  • log transform our response variable
reg_recipe <- 
  recipe(SalePrice ~ ., data = reg_train) %>%
  update_role(Id, new_role = "ID") %>%
  step_rm(dataset, WoodDeckSF , OpenPorchSF , EnclosedPorch , `3SsnPorch` , ScreenPorch , Neighborhood, Functional, LowQualFinSF, BsmtFinSF2, BsmtHalfBath, BsmtFinType2, KitchenAbvGr, LotConfig, PoolArea, MiscVal) %>%
  step_unknown(all_predictors(), -all_numeric()) %>% 
  step_dummy(all_nominal()) %>%
  step_nzv(all_predictors()) %>%
  step_normalize(all_numeric(), -Id, -all_outcomes()) %>%
  step_center(all_numeric(), -Id, -all_outcomes()) %>%
  step_scale(all_numeric(),-Id,  -all_outcomes()) %>%
  step_corr(all_numeric(), -Id, threshold = 0.90) %>%
  step_knnimpute(all_predictors(), neighbors = 3) %>%
  step_lincomb(all_numeric(), -Id, -all_outcomes()) %>%
  step_log(all_outcomes(), base = 10) #, skip = TRUE)  

Next, we’ll build our workflow and add our model and our recipe:

reg_workflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(reg_recipe)

Now we’ll fit our model using our test_train split:

reg_fit <- reg_workflow %>% 
  last_fit(split = train_split)

Now that our model is fit, we can look at our model metrics

reg_fit %>% collect_metrics()
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0543
## 2 rsq     standard      0.908

Our root mean squared error (RMSE) is ~0.0543 and our \({ R }^{ 2 }\) value is ~0.90798 What does this mean? Well the root mean squared error here doesn’t make a ton of sense because we have log transformed our response variable SalePrice. What we can say is that on average, we are off about 0.054 from each log transformed SalePrice value. I am far more interested in \({ R }^{ 2 }\) value. The \({ R }^{ 2 }\) value tells us how much of the variation in SalePrice we are explaining with our model. It looks like our model is explaining about 90% of the variation within the SalePrice which is very good (at least in this small subset of the training data)!

preds <- reg_fit %>% collect_predictions()
ggplot(data = preds) +
       aes(x = .pred, y = SalePrice) +
  geom_point(color = 'steelblue', alpha = 0.35) +
  geom_abline(intercept = 0, slope = 1, color = 'firebrick') +
  labs(title = 'R-Squared Plot - Predicting SalePrice',
       x = 'Predicted Sale Price',
       y = 'Actual Sale Price') +
   theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

The above \({ R }^{ 2 }\) plot is a good represenation of the fit of our model. The straight line is a visual representation of a model that predicted each SalePrice perfectly. The points are the actual model predictions. As you can see, in general, our points fall pretty close to the line, which means our model is fitting pretty well. Now that we’ve seen that our model looks to be working, let’s use this same approach to create predictions for our test dataset that we can submit to Kaggle contest:

final_lm_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

final_recipe <- 
  recipe(SalePrice ~ ., data = train) %>%
  update_role(Id, new_role = "ID") %>%
  step_rm(dataset, WoodDeckSF , OpenPorchSF , EnclosedPorch , `3SsnPorch` , ScreenPorch , Neighborhood, Functional, LowQualFinSF, BsmtFinSF2, BsmtHalfBath, BsmtFinType2,    KitchenAbvGr, LotConfig, PoolArea, MiscVal) %>%
  step_unknown(all_predictors(), -all_numeric()) %>% 
  step_dummy(all_nominal()) %>%
  step_nzv(all_predictors()) %>%
  step_normalize(all_numeric(), -Id, -all_outcomes()) %>%
  step_center(all_numeric(), -Id, -all_outcomes()) %>%
  step_scale(all_numeric(),-Id,  -all_outcomes()) %>%
  step_corr(all_numeric(), -Id, threshold = 0.90) %>%
  step_knnimpute(all_predictors(), neighbors = 3) %>%
  step_lincomb(all_numeric(), -Id, -all_outcomes()) %>%
  step_log(all_outcomes(), base = 10 , skip = TRUE)

final_workflow <- workflow() %>% 
  add_model(final_lm_model) %>% 
  add_recipe(final_recipe)

final_fit <- final_workflow %>% 
  fit(data = train)

predictions <- predict(final_fit, test) %>%
  bind_cols(test %>% select(Id)) %>%
  mutate(SalePrice =  round(10^.pred,0)) %>% 
  select(Id, SalePrice)

Here’s what our final output looks like with the SalePrice column being our predictions:

head(predictions)
## # A tibble: 6 x 2
##      Id SalePrice
##   <dbl>     <dbl>
## 1  1461    109376
## 2  1462    153318
## 3  1463    170800
## 4  1464    202566
## 5  1465    194396
## 6  1466    172461

Now, let’s export our predictions and submit to Kaggle:

readr::write_delim(predictions, 'C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/submission2.csv', delim = ',')

Having submitted my CSV file to Kaggle, I got a score of 0.13895:

While this is nowhere close to being at the top of the leaderboard, I imagine it is a fairly high score for only using multiple regression. I’m sure I could do a little tuning to improve my score slightly, but to see big improvements, I’d definitely need to switch to another method of regression or XGBoost. Overall, I am very pleased with the performance of my model. Kaggle Username: christianthieme

Learnings and Next Steps

This was an amazing project to work through. It took me a TON of time to work through all the variables, but I got to know the data and variables very well and was able to come up with some good new features as a result that I think helped in my model performance.

After I’d already encoded all of my variables as factors, I learned I could have converted all my ordinal factors to numeric scores using step_ordinalscore(), and converted all my character/string data to factors using step_string2factor() from recipes. That would have saved me a ton of time. Next time! Overall, I’m growing more comfortable with the TidyModels framework and realizing how simple it makes difficult tasks.

My next step for this project will be to run this data through XGBoost and see what improvements we can see!