Introduction
This data comes from the following Kaggle competition with the object to predict the sales prices of houses in Ames, Iowa. The dataset provided has 79 explanatory variables describing (almost) every aspect of the residential homes. A description of each column can be found here.
For this competition, I’ve been asked to build a model using multiple regression. Before building the model, however, I’ll do some exploratory data analysis (EDA) on the dataset, clean it, as well as do some feature engineering.
Loading Libraries and Data
I’ll load the necessary R libraries:
library(tidyverse)
library(scales)
library(corrplot)
library(moments) #skewness and kurtosis
library(gridExtra)
library(plyr)
library(tidymodels)
library(vip)
Below, I’ll load the train and test CSVs into dataframes:
train <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/train.csv')
test <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/test.csv')
I’ll take a quick look at the first 10 variables and our dependent variable:
glimpse(train[,c(1:10,81)])
## Rows: 1,460
## Columns: 11
## $ Id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ MSSubClass <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20...
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "...
## $ LotFrontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91,...
## $ LotArea <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 61...
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave",...
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1",...
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",...
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllP...
## $ SalePrice <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000,...
In looking at the output above, it looks like we have a good mix of both numeric and character data. Let’s take an initial look to see how many of each we have:
- Numeric columns:
dim(train %>% select_if(is.numeric))[2]
## [1] 38
- Character columns:
dim(train %>% select_if(is.character))[2]
## [1] 43
We’ll do some further investigation later, but its very probable that some of our numeric columns are actually categorical data and will need to be converted. Additionally, I assume that there are quite a few of the character columns that are ordinal that we’ll also need to convert.
Now, let’s look at the dimensions of our train and test datasets:
Train:
dim(train)
## [1] 1460 81
Test:
dim(test)
## [1] 1459 80
Combined:
dim(train)[1] + dim(test)[1]
## [1] 2919
In looking at the above row counts, it looks like this dataset is split in half. There are as many rows in the test dataset as there are in the training dataset. Let’s now combine these datasets so we can perform all the same transformations on each dataset. To do this, we’ll need to add a ‘SalePrice’ column to our test dataset and add some identifying columns to each dataset. Finally, we’ll append them together:
train <- train %>%
mutate(dataset = 'train')
test <- test %>%
mutate(SalePrice = NA,
dataset = 'test')
homes <- rbind(train, test)
dim(homes)
## [1] 2919 82
The dimensions of our combined dataset match what we saw previously (considering the columns we just added).
Exploration and Preliminary Data Cleaning
Response Variable - SalePrice
summary(homes$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
Let’s take a look at the distribution of our response variable, SalePrice:
eda_data <- homes %>% filter(dataset == 'train')
ggplot(eda_data) +
aes(x = SalePrice) +
geom_histogram(binwidth = 20000, fill = "lightsalmon2", color = "black") +
scale_x_continuous(labels = comma, breaks = seq(0, 900000, by = 100000)) +
labs(title = "Histogram of SalePrice", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
Looking at the distribution above, we see what we would expect - a right skewed distribution, as most people can’t afford very expensive housing. Let’s take a look at the skewness and kurtosis of SalePrice:
moments::skewness(homes$SalePrice, na.rm = TRUE)
## [1] 1.880941
We can see above that the skewness of SalePrice is ~1.88. As noted above, this indicates that the distribution is not normal (a normal distribution would have a skewness of 0). As the skewness value is positive, and as we see in our plot, we know our distribution in right skewed.
moments::kurtosis(homes$SalePrice, na.rm = TRUE)
## [1] 9.509812
Our kurtosis value tells us that we have some extreme values in the tail of our distribution, which you can see clearly in the Q-Q plot below.
qqnorm(homes$SalePrice)
qqline(homes$SalePrice)
To fix this, we’ll use a log transformation on the SalePrice column and rerun the calcs on skew and kurtosis:
homes_fixed <- homes %>%
mutate(SalePrice =log(SalePrice))
moments::skewness(homes_fixed$SalePrice, na.rm = TRUE)
## [1] 0.1212104
moments::kurtosis(homes_fixed$SalePrice, na.rm = TRUE)
## [1] 3.802656
You can see above, we are doing much better. Now let’s look at our Q-Q plot again:
qqnorm(homes_fixed$SalePrice)
qqline(homes_fixed$SalePrice)
Much better! We don’t see nearly as many extreme values at the edge of our distribution. We have shown that our response variable can benefit from a log transformation. while we could make this change now, let’s wait and use the tidymodels package to do most of our preprocessing - more on this later. Stay Tuned!
Now that we’ve briefly looked at our response variable, let’s move on and start looking at some of our independent variables.
Extent of Nulls
Before jumping into the analysis of our dataset, it’s good to get an understanding of how many null values are in each column so we know if we can move forward with visualizing and analyzing our data or if we need to do some processing in order to have the data tell its story:
colSums(is.na(homes))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 4 486 0
## Street Alley LotShape LandContour Utilities
## 0 2721 0 0 2
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 1 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 24 23 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 81 82 82 79 1
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 80 1 1 1 0
## HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 2 2 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 1 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 2 0 1420 157 159
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 159 1 1 159 159
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 2909 2348 2814
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 1 0
## SalePrice dataset
## 1459 0
While there are some columns with a significant amount of nulls, for the most part, it doesn’t look like null values are that pervasive in this dataset. We’ll refer back to these numbers as we walk through our analysis and cleaning.
A Starting Place
With over 80 features, its difficult to know where to start with our exploratory data analysis and cleaning. A quick way to gain our bearings will be to run a correlation plot over the current numeric features of the dataset and to start working through the features that are highly correlated with SalePrice. We’ll create a correlation plot and only keep the features with a correlation to SalePrice greater than 0.45 for our initial run.
numeric_cols <- eda_data %>% select_if(is.numeric)
correlation <- cor(numeric_cols)
sale_price_col <- as.matrix(abs(correlation[,'SalePrice']))
ordered_matrix <- sale_price_col[order(sale_price_col, decreasing = TRUE),]
names <- ordered_matrix[ordered_matrix > 0.45, drop = FALSE]
names <- rownames(as.matrix(names[!is.na(names)]))
filtered_cor_matrix <- correlation[names, names]
#sorted_matrix <- filtered_cor_matrix[order(filtered_cor_matrix[,'SalePrice'], decreasing = TRUE),]
corrplot.mixed(filtered_cor_matrix, tl.col="black", tl.pos = "lt")
Looking at the above correlation plot, we can see that OverallQual has the highest correlation with SalePrice. Additionally, it looks like the next several features all have to do with square footage (size) or number of rooms or garages (quantity). I’ll take note of this, because we can probably create some new combination features out of these variables to tease out some additional predictive power while feature engineering. Toward the bottom, we see that the year it was built/remodeled is also moderately correlated. Fireplaces is the last feature that met our threshold. We’ll have to investigate this feature to see if it is truly meaningful to SalePrice. Last, I can see that there are a few features that are collinear with other features such as GarageArea area and GarageCars. We’ll want to address this once we get a little farther along.
Overall Quality (OverallQual) and Other Condition Features
Overall Quality
OverallQual = Rates the overall material and finish of the house
As OverallQual (Overall Quality) was the mostly highly correlated variable to SalePrice, we’ll start our analysis here.
eda_data <- homes %>%
filter(dataset == "train")
qual_hist <- ggplot(eda_data) +
aes(x = as.factor(OverallQual)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Overall Quality', x= 'Quality Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey"))
qual_box <- ggplot(eda_data) +
aes(x = factor(OverallQual), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Overall Quality vs Sale Price', y = 'Sale Price', x= 'Quality Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(qual_hist, qual_box, nrow=1)
It looks like the majority of the homes fall into average to very good condition. In looking at the above boxplots, we can confirm what we saw in our correlation matrix; there is a strong relationship between these two variables. Additionally, it looks like there may be an outlier we’ll need to look at in category 4.
qual_four_outlier <- eda_data %>%
filter(OverallQual == 4 & SalePrice > 250000)
qual_four_outlier %>% select(Id, LotArea, OverallQual, SalePrice)
## # A tibble: 1 x 4
## Id LotArea OverallQual SalePrice
## <dbl> <dbl> <dbl> <dbl>
## 1 458 53227 4 256000
In comparing the above row to many of the other rows for houses with an OverallQual of 4, I believe this house has been misclassified as a 4. In looking through all of the other attributes, it actually has quite a few features that are upgraded compared to other homes and has at least typical and average on all other condition and quality metrics. I will remove this row during the cleaning stage since OverallQuality is the most correlated predictor of sale price.
Overall Condition
OverallCond = Rates the overall condition of the house
cond_hist <- ggplot(eda_data) +
aes(x = as.factor(OverallCond)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Overall Condition', x= 'Condition Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey"))
cond_box <- ggplot(eda_data) +
aes(x = factor(OverallCond), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Overall Condition vs Sale Price', y = 'Sale Price', x= 'Condition Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(cond_hist, cond_box, nrow=1)
It looks like there is quite a bit of difference between overall condition and overall quality. The condition measures the overall condition of the house whereas the overall quality measures the material and finish of the house. The scales appear to be different as well as overall quality goes to 10 and overall condition goes to 9. It also looks like overall condition and overall quality are not strongly correlated.
cor(eda_data$OverallCond, eda_data$OverallQual)
## [1] -0.09193234
In looking at the relationship between sale price and overall condition, what we see here is somewhat surprising. It does look like we see a trend in categories 1-5, however after 5 the trend does not continue and it looks like additional improvements in the condition of the home above and beyond “average” does not affect the price of the home very much. What this tells us is that this factor is not, on its own, highly correlated with sale price. Also, in looking at the above plot, we may have an outlier in group 6.
eda_data %>% filter(OverallCond == 6 & SalePrice > 700000) %>% select(Neighborhood, KitchenQual, GrLivArea, ExterQual)
## # A tibble: 1 x 4
## Neighborhood KitchenQual GrLivArea ExterQual
## <chr> <chr> <dbl> <chr>
## 1 NoRidge Ex 4316 Ex
In looking at the value, it appears that the house is very large, is in a ritzy neighborhood and has some excellent upgrades to the exterior and interior but other than that it doesn’t appear to have anything unexpected that would make us call this an outlier.
Basement Quality
BsmtQual: Evaluates the height of the basement
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtQual)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Evaluation of Height of Basement', x= 'Evaluation Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtQual), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Evaluation of Height of Basement vs Sale Price', y = 'Sale Price', x= 'Evaluation Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
It looks like in every case where there is a basement, the evaluation is at least fair or higher. Additionally, we see a strong correlation here between sale price and the evaluation height of the basement. We also note that there are some NA’s in this feature. My hunch is that these NAs are actually homes without a basement and these just need to be set to ‘No Basement’.
eda_data %>%
filter(is.na(BsmtQual)) %>%
select(contains('BsmtFin')) %>% head()
## # A tibble: 6 x 4
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## <chr> <dbl> <chr> <dbl>
## 1 <NA> 0 <NA> 0
## 2 <NA> 0 <NA> 0
## 3 <NA> 0 <NA> 0
## 4 <NA> 0 <NA> 0
## 5 <NA> 0 <NA> 0
## 6 <NA> 0 <NA> 0
Our hunch proved to be correct. All of these NA values relate to homes with no basements. We’ll recode these during our data cleaning stage.
Basement Condtion
BsmtCond: Evaluates the general condition of the basement
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtCond)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Condition of Basement', x= 'Condition Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtCond), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Condition of Basement vs Sale Price', y = 'Sale Price', x= 'Condition Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
We see something very similar to what we say above. It looks like the same 37 records (which were homes with no basements) are null in this feature as well. Overall, we see that most homes have basements in fair or higher condition. Additionally, we see a strong relationship between the condition of the basement and the sale price of the home.
GarageQual
GarageQual = Garage quality
garhist <- ggplot(eda_data) +
aes(x = GarageQual) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Garage Quality', y = 'Count', x= 'Quality Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
# axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
garplot<- ggplot(eda_data) +
aes(x = reorder(factor(GarageQual),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(garhist, garplot, nrow=1)
It appears that almost every home in our dataset has a “typical/average” garage. Based on the boxplot, it looks like there is some order to these categorical variables. We’ll take note of this and utilize in our data cleaning stage. Interestingly, for the 3 garages with an “excellent” garage, it doesn’t seem to have made a big difference in the sale price for 2 of the homes. It turns out that one of these homes is on a very large lot with a very large house and a 10 on overall quality and the other two are fairly small homes and have lower overall quality (so why a nice garage?), which may be why they have lower values. I think what we are seeing here is an issue with a small sample size for this group.
Garage Condition
GarageCond = Garage Condition
garhist <- ggplot(eda_data) +
aes(x = GarageCond) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Garage Condition', y = 'Count', x= 'Condition Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
garplot <- ggplot(eda_data) +
aes(x = reorder(factor(GarageCond),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Condition vs Sale Price', y = 'Sale Price', x= 'Condition Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(garhist, garplot, nrow=1)
Looking at the two plots above, its pretty apparent that this field is almost exactly identical to GarageQual, with the main difference being the “Excellent” group not having the large outlier bringing up the IQR. What this means is there is a difference in one of those values between garage quality and garage condition, which doesn’t seem correct. This plot makes me question whether those homes categorized as “excellent” have been labeled appropriately. We’ll keep this in mind as we continue our analysis and also note that we’ll probably drop garage quality in our data cleaning phase since its just redundant data.
Exterior Condition
ExterCond: Evaluates the present condition of the material on the exterior
garhist <- ggplot(eda_data) +
aes(x = ExterCond) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Exterior Condition', y = 'Count', x= 'Condition Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
garplot <- ggplot(eda_data) +
aes(x = reorder(factor(ExterCond),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Exterior Condition vs Sale Price', y = 'Sale Price', x= 'Condition Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(garhist, garplot, nrow=1)
As we’ve seen before - most homes have a “typical” exterior condition. In addition, looking at our boxplots, it looks like these are ordinal categorical values, so we’ll update that in our data prep stage. The piece that is a little concerning however is that the ‘Good (Gd)’ condition is actually more favorable than the ‘Typical (TA)’ condition, yet we don’t see that trend in the boxplots. They very well may be because of the difference in sample size between the two groups - but still will be something we’ll have to keep an eye out for. It also may be a case where the difference in exterior condition really only matters if it is less than typical. Here, perhaps a better variable to engineer would be if the exterior condition is acceptable or not (binary variable).
Exterior Quality
ExterQual: Evaluates the quality of the material on the exterior
garhist <- ggplot(eda_data) +
aes(x = ExterQual) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Exterior Quality', y = 'Count', x= 'Condition Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
garplot <- ggplot(eda_data) +
aes(x = reorder(factor(ExterQual),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Exterior Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(garhist, garplot, nrow=1)
The exterior quality does look like it is an ordinal categorical variable. What we see here is more promising than what we saw with the exterior condition variable. Here we see what we would expect. As the exterior quality of the home improves the sale price increases. I could see this being from going from vinyl siding, to wood, to stucco and so on, which would definitely make a difference in the price of a house.
Sale Condition
SaleCondition: Condition of Sale
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(SaleCondition)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Sale Condition', x= 'Sale Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(SaleCondition), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Sale Condition vs Sale Price', y = 'Sale Price', x= 'Sale Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
As we’d expect, it looks like the majority of our homes have a normal sale condition. In looking at the boxplots, it may appear that there is some sense of order to the variables, however, in looking at the description for each, it does not appear that these are ordinal categorical variables. One is not necessarily better than the other except for perhaps “Abnormal”, which signifies a foreclosure or short-sale. So maybe this make sense to be 3 categories as opposed to 6.
Condition 1
Condition1: Proximity to various conditions
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(Condition1)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Proximity to Various Conditions', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(Condition1), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
It looks like the majority of our Condition 1 variable are normal conditions. Additionally, it doesn’t look like this variable is strongly correlated with sale price. I’ll also note that even though the boxplot seems to show some semblance of order, these are not ordinal values. A further note is that some of these are actually positive conditions, such as being near a park or greenbelt and some are negative. It probably makes sense to create some kind of variable describing whether these are positive or negative conditions.
Condition 2
Condition2: Proximity to various conditions (if more than one is present)
This variable will show ‘Normal’ unless a second proximity to another condition is present.
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(Condition2)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Proximity to Various Conditions if More Than One Is Present', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(Condition2), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
In looking at and understanding this variable’s relationship to Condition1, it would make sense to make a variable during our feature engineering that labeled if a home had more than one condition (positive or negative) it was close to. It seems like homes near more than one negative condition may have lower value and a higher value if they are near more than one positive condition (and based on our first plot its very rare), but let’s check:
eda_data %>%
filter(Condition2 != 'Norm') %>%
ggplot() +
aes(x = reorder(Condition2, SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Proximity to Various Conditions vs Sale Price', y = 'Sale Price', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
It does look like there is a relationship between positive/negative conditions and sale price. We’ll make sure to add a binary variable in our feature engineering section.
Functional
Functional: Home functionality (Assume typical unless deductions are warranted)
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(Functional)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Home Functionality', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(Functional), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Home Functionality vs Sale Price', y = 'Sale Price', x= 'Condition') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
As we’d expect, most homes have typical functionality when they are sold. What is interesting is the boxplot on the right. You would think that deductions (minor, moderate, major, and severe deductions) would play a large part in the sale price of a home as well as the severity of the deduction, however that isn’t what we see here. The only thing we can clearly identify is that homes with typical functionality, in general, have higher sales prices.
House Style
HouseStyle: Style of dwelling
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(HouseStyle)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'House Style', x= 'Style') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(HouseStyle), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'House Style vs Sale Price', y = 'Sale Price', x= 'Style') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
One and two story homes seem to the be the majority with one and a half story homes with the second level finished being the third highest. The ‘.5’ in these indicates the presence of a basement, either finished or unfinished. Some of the relationships here make sense, but others don’t, for example, why does 1.5Fin (1 story home with a finished basement), in general, have lower sales prices than just a one story home? This variable clearly needs some other variables in order to tell the entire story (probably square footage) and doesn’t look to be strongly correlated with sale price.
Above Ground Living Area (GrLivArea) and Other Size Features
GrLivArea
GrLivArea = Above ground living area
live_area_outliers <- eda_data %>%
filter(GrLivArea > 4500)
abv_hist <- ggplot(eda_data) +
aes(x = GrLivArea) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Above Ground Living Area", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data) +
aes(x = GrLivArea, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'GrLiveArea vs Sale Price', y = 'Sale Price', x= 'Square Feet', subtitle = "Above Ground Living Area in Square Feet") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
The distribution for above ground living area is right skewed, with the majority of homes having between 1000-2000 square feet. For the most part, We see what we would expect here; homes with more above ground square feet have higher home prices. We do see what appear to be several outliers out past 4,500 square feet with very low sale prices that we may want to remove later. Let’s take a look at these items:
eda_data %>%
filter(GrLivArea > 4500) %>%
select(Id, LotArea, OverallQual, OverallCond, SalePrice)
## # A tibble: 2 x 5
## Id LotArea OverallQual OverallCond SalePrice
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 524 40094 10 5 184750
## 2 1299 63887 10 5 160000
In looking at the two rows above, these two rows do indeed look like outliers (possibly even data entry errors?). In fact, these two homes both have enormous lot sizes. What makes me think these homes may be data errors is that both of them have an overall quality rating of 10 (these are the two outliers you can see at the top of the boxplots for category 10), but an overall condition rating of 5. That in conjunction with the fact that the above ground living area sizes are significantly larger than most homes and that the sale price is significantly lower than other homes with smaller square footage, makes me think these are outliers or mistakes.
First Floor Square Feet
1stFlrSF = First Floor square feet
abv_grnd_outliers <- eda_data %>%
filter(`1stFlrSF` > 4500)
abv_hist <- ggplot(eda_data) +
aes(x = `1stFlrSF`) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of First Floor Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data) +
aes(x = `1stFlrSF`, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
geom_point(data = abv_grnd_outliers, aes(x = `1stFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'First Floor Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
Similar to above ground living area this feature has a right skewed distribution and appears to have an outlier out past 4,000 square feet. This variable has a fairly strong positive correlation with Sale Price.
Second Floor Square Feet
2ndFlrSF = Second floor quare feet
abv_hist <- ggplot(eda_data %>% filter(`2ndFlrSF` > 0)) +
aes(x = `2ndFlrSF`) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of 2nd Floor Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data %>% filter(`2ndFlrSF` > 0)) +
aes(x = `2ndFlrSF`, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = abv_grnd_outliers, aes(x = `2ndFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = '2nd Floor Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
The distribution looks to be almost normal and it doesn’t appear that there are any significant outliers. Additionally, the square footage of the second floor looks to be strongly correlated with the sale price.
After looking at all of these variables, it appears that GrLivArea is the sum of 1stFlrSF, 2ndFlrSF, and LowQualFinSF. Because of that, I’d expect all of these variables to show some significant collinearity.
Low Quality Finished Square Feet (All Floors)
LowQualFinSF = Low quality finished square feet (all floors)
abv_hist <- ggplot(eda_data %>% filter(LowQualFinSF > 0)) +
aes(x = LowQualFinSF) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Low Quality Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data %>% filter(LowQualFinSF > 0)) +
aes(x = LowQualFinSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = abv_grnd_outliers, aes(x = `2ndFlrSF`, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Low Quality Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
It looks like there are not many homes with low quality finished square feet. It doesn’t look like there is much of a relationship here either with sale price.
Total Rooms Above Ground
TotRmsAbvGrd: Total rooms above ground (does not include bathrooms)
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(TotRmsAbvGrd)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Total Rooms Above Ground', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = factor(TotRmsAbvGrd), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Rooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
In looking at the above plots, it’s pretty clear that this feature is not talking just about bedrooms. This is a combination of bedrooms, living room, laundry room, etc. EXCEPT FOR bathrooms. I don’t see any extreme outliers. It looks like there is a relationship between rooms of a house (probably strongly correlated with square footage) and sale price that is fairly predictable until we get past 11 rooms and then our sale price tends to fall. This could be for multiple reasons, perhaps the homes are very old or are in poor condition. There are also very few homes with over 10 rooms.
Bedrooms Rooms Above Ground
BedroomAbvGr: Bedrooms rooms above ground (does NOT include basement bedrooms)
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BedroomAbvGr)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Bedrooms Above Ground', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = factor(BedroomAbvGr), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Bedrooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
The first thing I notice is that there are homes with 0 bedrooms above ground… what does that mean? Oddly enough, as you can see in the boxplot these homes also have higher median sales prices than most other homes. Let’s take a look and see if we can identify what these are:
eda_data %>%
filter(BedroomAbvGr == 0) %>%
select(`1stFlrSF`, BsmtFinSF1, BsmtFinType1, SalePrice)
## # A tibble: 6 x 4
## `1stFlrSF` BsmtFinSF1 BsmtFinType1 SalePrice
## <dbl> <dbl> <chr> <dbl>
## 1 1842 1810 GLQ 385000
## 2 1593 1153 GLQ 286000
## 3 1056 1056 GLQ 144000
## 4 1258 1198 GLQ 108959
## 5 960 648 GLQ 145000
## 6 1332 1258 GLQ 260000
It looks like there are homes where there are no bedrooms above ground and they are all in the basement! These homes have square footage in a 1st floor, but there are no bedrooms above the basement. It does look like this is a somewhat sought after feature (and rare) because it has a higher median sale price than even homes with 5 above ground bedrooms. It is completely possible that these homes have quite a few rooms in the basement, but there is not a variable for basement rooms, so we’ll just have to rely on the square footage. Overall, this variable does not look strongly correlated with sale price.
Full Bathrooms
FullBath: Full bathrooms above ground
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(FullBath)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Bathrooms Above Ground', x= '#') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = factor(FullBath), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Bathrooms Above Ground vs Sale Price', y = 'Sale Price', x= '#') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
Besides the strange issue we see where bathrooms are all in the basement with a few homes, for the most part, the more bathrooms there are, the higher the sales price. In general more bathrooms is often correlated with higher square feet.
Half Bathrooms
HalfBath: Half baths above ground
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(FullBath)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Half Bathrooms Above Ground', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = factor(FullBath), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Half Bathrooms Above Ground vs Sale Price', y = 'Sale Price', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
We see the same pattern as we did with the other bathroom variable. We don’t have a total bathroom feature, so this is one we will look at creating as we can add both full and half bathrooms above and below ground together.
Basement Finished Square Feet
BsmtFinSF1 = Type 1 finished square feet
live_area_outliers <- eda_data %>%
filter(BsmtFinSF1 > 4000)
abv_hist <- ggplot(eda_data) +
aes(x = BsmtFinSF1) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Basement Finished Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data %>% filter(BsmtFinSF1 > 0)) +
aes(x = BsmtFinSF1, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
geom_point(data = live_area_outliers, aes(x = BsmtFinSF1, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Finished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
Our distribution here looks pretty good and we can see that there definitely is a relationship between the square footage of the basement that is finished and sale price. We do see a potential outlier on the scatter plot out past 4,000 square feet. Let’s take a look at that point.
eda_data %>% filter(BsmtFinSF1 > 4000) %>% select(Id, GrLivArea, BsmtFinSF1, OverallQual, OverallCond, ExterQual, ExterCond)
## # A tibble: 1 x 7
## Id GrLivArea BsmtFinSF1 OverallQual OverallCond ExterQual ExterCond
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1299 5642 5644 10 5 Ex TA
Something isn’t adding up for this row… this house between basement and above ground square feet has almost 10K square feet of space and many bedrooms. Additionally, based on the condition and quality columns it looks to be in AT LEAST typical condition for everything and is not near a railroad or anything like that. It’s one downfall looks to be that the property has a quick and significant rise from the street to the grade of the building (banked). However, in looking at the LandContour variable… it doesn’t look like this would create that large of an effect on a house that would normally be very expensive. We’ll continue our analysis, but not that this may be an outlier than needs to be removed.
Basement Finished Square Feet
BsmtFinSF2 = Type 2 finished square feet
abv_hist <- ggplot(eda_data) +
aes(x = BsmtFinSF2) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Type 2 Basement Finished Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data ) +
aes(x = BsmtFinSF2, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = BsmtFinSF1, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Type 2 Finished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
It looks like the majority of homes do not have Type 2 basement finished square feet and it also looks like there isn’t really a relationship here with sale price. This may be a feature we end up removing from the dataset.
Basement Unfinished Square Feet
BsmtUnfSF = Unfinished square feet of basement area
## filter(BsmtUnfSF > 4500)
abv_hist <- ggplot(eda_data) +
aes(x = BsmtUnfSF) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Basement Unfinished Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data ) +
aes(x = BsmtUnfSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
#geom_point(data = live_area_outliers, aes(x = TotalBsmtSF, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Unfinished Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
It appears that the majority of homes don’t have a tremendous amount of unfinished square footage in their basement. Interestingly, we also don’t see a very strong relationship between unfinished square footage and sale price. I don’t see anything that I would consider an outlier here.
Total Basement Square Feet
TotalBsmtSF = Total square feet of basement areat
live_area_outliers <- eda_data %>%
filter(TotalBsmtSF > 4500)
abv_hist <- ggplot(eda_data) +
aes(x = TotalBsmtSF) +
geom_histogram(binwidth =100, fill = "steelblue", color = "black") +
# scale_y_continuous( breaks = seq(0, 200, by = 25)) +
labs(title = "Histogram of Total Basement Square Feet", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)
abv_scatter <- ggplot(eda_data ) +
aes(x = TotalBsmtSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
geom_point(data = live_area_outliers, aes(x = TotalBsmtSF, y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Basement Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(abv_hist, abv_scatter, nrow=1)
Here we see very similar trends to what we saw with the other basement square foot features. Again, we also note the presence of the outlier out beyond 6,000 square feet. One thing I’d like to check is to see if the two other basement variables add up to this variable:
eda_data %>%
select(BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF) %>%
mutate(CheckColumn = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF) %>%
mutate(Difference = CheckColumn - TotalBsmtSF ) %>%
filter(Difference != 0)
## # A tibble: 0 x 6
## # ... with 6 variables: BsmtFinSF1 <dbl>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## # TotalBsmtSF <dbl>, CheckColumn <dbl>, Difference <dbl>
It looks like the combination of BsmtFinSF1, BsmtFinSF2, and BsmtUnfSF make TotalBsmtSF. Good info to keep in our toolbox.
Basement Half Bathrooms
BsmtHalfBath: Basement half bathrooms
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtHalfBath)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Basement Half Bathrooms', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtHalfBath), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Half Bathrooms vs Sale Price', y = 'Sale Price', x= 'Rooms') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
It looks like in the majority of cases, basements do not have half bathrooms. It also looks like the presence of a half bathroom is not a significant factor in the price of a house. Let’s see if the same hold true for a full bathroom in the basement.
Basement Full Bathrooms
BsmtFullBath: Basement full bathrooms
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtFullBath)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Basement Full Bathrooms', x= '#') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtFullBath), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Full Bathrooms vs Sale Price', y = 'Sale Price', x= '#') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
Over half of all basements don’t have a bathroom which is surprising to me. The other surprise is that the presence of extra bathrooms does not seem to be strongly connected to the sale price of the house. While w can see a slight increase from 1 to 2 bathrooms at the median value, the IQR is almost identical.
Other Basement Variables
Basement Exposure
BsmtExposure = Refers to walkout or garden level walls
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtExposure)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Basement Exposure', x= 'Exposure Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtExposure), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Exposure vs Sale Price', y = 'Sale Price', x= 'Exposure Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
While it looks like the majority of homes don’t have basement exposure, it does look like it is a feature that is positively correlated with sale price and is ordinal. Additionally, it looks like the NAs in the case related to houses that do not have basements. It looks like based on the boxplots, homes with basements are valued above those that do not have basements.
Rating of Basement Finished Area
BsmtFinType1 = Rating of basement finished area
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtFinType1)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Basement Finish Type Rating', x= 'Finish Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtFinType1), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Finish Type Rating vs Sale Price', y = 'Sale Price', x= 'Finish Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
It looks like about 1/4 of homes have a basement that has “good living quarters” and another quarter has an unfinished basement. Sale price wise, it looks like sale price increases substantially if the home has good living quarters, otherwise it doesn’t make a huge difference in the finish type.
Rating of Basement Finished Area
BsmtFinType2 = Rating of basement finished area (if multiple types)
sale_hist <- ggplot(eda_data) +
aes(x = as.factor(BsmtFinType2)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Basement Finish Type Rating', x= 'Rating Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
sale_box <- ggplot(eda_data) +
aes(x = reorder(factor(BsmtFinType2), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Basement Finish Type Rating vs Sale Price', y = 'Sale Price', x= 'Rating Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(sale_hist, sale_box, nrow=1)
A basement can have multiple finish types such as a portion that is good living condition and then another section that is unfinished. It looks like the overwhelming majority of homes have a portion of their basement that is unfinished (maybe a utility or storage room?). As we saw before, we don’t really see much difference in the sale price except for with space that is “livable”.
Garage Cars (GarageCars), Garage Areas (GarageArea) and Other Garage Features
Garage Cars (GarageCars)
Garage Cars = Size of garage in car capacity
hist <- ggplot(eda_data) +
aes(x = as.factor(GarageCars)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Garage Cars', x= '# of Garage Cars') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank(),
axis.title.y= element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = factor(GarageCars), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Cars vs Sale Price', y = 'Sale Price', x= '# of Garage Cars') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It looks like about 56% of homes have a two car garage, about 6% have no garage or four garages, and the remaining 38% have one or three car garages. In looking at the above boxplots, for the first four categories we see what we would expect; sale price increasing with the number of garage cars. However, once we get to a 4 car garage, sale price appears to go down which doesn’t seem to make sense. Let’s take a look at these homes and see if we see anything odd:
eda_data %>%
filter(GarageCars == 4) %>%
select(Id, GrLivArea, LotArea, GarageCars, SalePrice, OverallQual )
## # A tibble: 5 x 6
## Id GrLivArea LotArea GarageCars SalePrice OverallQual
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 421 1344 7060 4 206300 7
## 2 748 2640 11700 4 265979 7
## 3 1191 1622 32463 4 168000 4
## 4 1341 872 8294 4 123000 4
## 5 1351 2634 11643 4 200000 5
In looking through the above data (only 5 rows), I don’t see anything that would indicate that these are outliers or that they should be removed. Perhaps as we do more exploration into more of our features we’ll see some of these homes again.
Garage Area (GarageArea)
Garage Area = Size of garage in square feet
hist <- ggplot(eda_data) +
aes(x = GarageArea) +
geom_histogram(binwidth =50, fill = "steelblue", color = "black") +
labs(title = 'Garage Area', y = 'Count', x= 'Area of Garage in Square Feet') +
scale_x_continuous(breaks= seq(0, 2000, by=200)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = GarageArea, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Looking at the above distribution, we see that it is multi-modal as we’d expect. It appears that there is one peak for each car port, 0-4. This demonstrates what we saw earlier in our correlation plot - that garage area and garage cars are highly correlated. I also don’t see any obvious outliers. In looking at this variable in relation to sale price, we can see that these variables are highly correlated.
Garage Type
Garage Type = Garage Location
hist <- ggplot(eda_data) +
aes(x = GarageType) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Garage Type', y = 'Count', x= 'Garage Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(GarageType),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Type vs Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
In looking at our boxplots above, it does look like there may be a relationships between sale price and garage type, in particular it looks like there may actually be some order to these categorical variables (ordinal values). We’ll look to adjust this in our data cleaning stage. Additionally, as we saw in the counts above, most of the homes have either a detached garage or an attached garage – looking at the boxplots above, there is a fairly clear distinction between these two categories in sale price.
Garage Finish
Garage Finish = Interior Finish of the Garage
hist <- ggplot(eda_data) +
aes(x = GarageFinish) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Garage Finish', y = 'Count', x= 'Finish') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(GarageFinish),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Finish vs Sale Price', y = 'Sale Price', x= 'Finish Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
In looking at the above categorical variables, it appears that these variables also display a sense of order (no garage being the lowest and finished being the highest). We’ll incorporate this in our data cleaning stage below. Additionally, I’ll note that the NAs are due to houses with no garage.
Garage - Year Built
GarageYrBlt = Year garage was built
hist <- ggplot(eda_data) +
aes(x = GarageYrBlt) +
geom_histogram( bins = 30, fill = 'steelblue') +
labs(title = 'Garage - Year Built', y = 'Count', x= 'Year') +
scale_x_continuous(breaks= seq(0, 2020, by=10)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
) +
geom_vline(xintercept = 1906) +
geom_vline(xintercept = 1948) +
geom_vline(xintercept = 1986)
box <- eda_data %>%
select(GarageYrBlt, SalePrice, GarageCars) %>%
# filter(GarageCars != 0 ) %>%
mutate(garage_age_category = ifelse(GarageYrBlt < 1906, 'Oldest',
ifelse(GarageYrBlt < 1948, 'Old',
ifelse(GarageYrBlt < 1986, 'Average', 'New')))) %>%
ggplot() +
aes(x = reorder(garage_age_category, SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Age Category vs Sale Price', y = 'Sale Price', x= 'Age Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
In looking at the distribution, it looks like there are four subsets of this data here: garages built between 1870 and 1906, garages built between 1905 and 1948, garages built between 1950 and 1986, and then homes built from 1987 on. Looking at the above boxplot, we can see that garage built year is relevant to the sale price (As I assume build year of the home would be). While the garage build year itself may be sufficient for predictive power, we’ll look at creating a feature for garage age category when we work through our feature engineering. Also, it looks like we may need to combine the bins from ‘Oldest’ and ‘Old’.
Kitchen Features
Kitchens Above Ground
KitchenAbvGr = Kitchens above ground
hist <- ggplot(eda_data) +
aes(x = KitchenAbvGr) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Kitchens Above Ground', y = 'Count', x= '# of Kitchens') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(KitchenAbvGr),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Kitchens Above Ground vs Sale Price', y = 'Sale Price', x= '# of Kitchens') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The vast majority of homes have at least one kitchen above ground. It is pretty rare to not have a kitchen above ground in which case perhaps it is in the basement? Additionally, its very surprising to see that homes with 2+ kitchens have a lower sale price in general than those with only one kitchen. Why would that be?
eda_data %>% filter(KitchenAbvGr >= 2) %>% select(MSSubClass) %>% head()
## # A tibble: 6 x 1
## MSSubClass
## <dbl>
## 1 50
## 2 190
## 3 90
## 4 90
## 5 190
## 6 50
It looks like the majority of these with more than one kitchen are related to a duplex (someone selling both sides at once) or a two family conversion home, which makes perfect sense why they would have lower values as well as more kitchens. The home with no above ground kitchen does have a basement with good livable space, so perhaps their kitchen is in the basement…
Kitchens Quality
KitchenQual = Kitchen quality
hist <- ggplot(eda_data) +
aes(x = KitchenQual) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Kitchens Quality', y = 'Count', x= 'Quality Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(KitchenQual),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Kitchens Quality vs Sale Price', y = 'Sale Price', x= 'Quality Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Wow, kitchen quality looks strongly correlated to sale price. We can see significant differences in price at each increase in quality type. This is a categorical ordinal variable. We’ll need to encode this when we get to the data processing section.
Lot Features
Neighborhood
Neighborhood = Physical locations within Ames city limits
hist <- ggplot(eda_data) +
aes(reorder(x = Neighborhood, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Neighborhood Count', y = 'Count', x= 'Name') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Neighborhood),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Neighborhood vs Sale Price', y = 'Sale Price', x= 'Name') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
In looking at the above boxplot, it looks like there are almost some distinct brakes in neighborhoods where a new class of family lives. For example there is a fairly large change in the median from Mitchel to SawyerW (Mitchel and below = Poor?). Then again at Timber to StoneBr (SawyerW to Timber = middle class?) and those above Timber. I think we could segment this into 3 or 4 bins based on the median value of home prices in the neighborhood where we see the largest shifts in the median value.
eda_data %>%
select(Neighborhood, SalePrice) %>%
mutate( Neighborhood_class = case_when(
Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale') ~ 'Poor',
Neighborhood %in% c('BrkSide', 'Edwards', 'OldTown', 'Sawyer', 'Blueste', 'SWISU', 'NPkVill', 'NAmes', 'Mitchel') ~ 'Lower-Middle',
Neighborhood %in% c('SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'CollgCr', 'Crawfor', 'ClearCr', 'Somerst', 'Veenker', 'Timber') ~ 'Middle',
TRUE ~ 'Rich')) %>% ggplot() +
aes(x = reorder(factor(Neighborhood_class),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Neighborhood vs Sale Price', y = 'Sale Price', x= 'Name') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey"))
Binning the homes in this way makes this a bit more absorbable and looks like it represents the relationship between neighborhood and sale price very well.
Type of Home Involved in the Sale
MSSubClass = Identifies the type of dwelling involved in the sale.
hist <- ggplot(eda_data) +
aes(reorder(x = MSSubClass, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Type of Home', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(MSSubClass),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Type of Home vs Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The type of home looks to be a good indicator of sales price. Additionally, it doesn’t look like there are any extreme outliers that we need to worry about.
Zoning Classification of the Sale
MSZoning = Identifies the general zoning classification of the sale.
hist <- ggplot(eda_data) +
aes(reorder(x = MSZoning, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Zoning Classification of the Home', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(MSZoning),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Zoning Classification of the Home vs Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
There appears to be a fairly strong relationship between zoning and sale price as well. Most homes fall into the residential low density category. This type of zoning is made for neighborhoods with bigger lots and to only have single family homes. FV or Floating Village Residential is often a retirement community and looks to have the highest median price.
Building Type
BldgType = Type of dwelling.
hist <- ggplot(eda_data) +
aes(reorder(x = BldgType, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Building Type', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(BldgType),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Building Type vs Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The majority of homes in our dataset are single family detached homes. These homes overall have the highest sale price. Another interesting note is that it looks like homes attached to other homes (duplex, townhouse) have lower values as we would expect, UNLESS it is a townhouse that is an end unit (TwnhsE), which appears to be more valuable.
Lot Shape
LotSahpe = General shape of property
hist <- ggplot(eda_data) +
aes(reorder(x = LotShape, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Lot Shape', y = 'Count', x= 'Shape') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(LotShape),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Lot Shape vs Sale Price', y = 'Sale Price', x= 'Shape') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Regular lots and slightly irregular lots make up most of our dataset. What is interesting is I would have thought that regular lots would be more valuable than irregular lots - but in every case, irregular lots have a higher median price than regular lots. These values don’t necessarily follow a logic pattern, so we’ll have to decide if we want to include them or not. I’m leaning toward no.
Lot Configuration
LotConfig = Lot configuration
hist <- ggplot(eda_data) +
aes(reorder(x = LotConfig, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Lot Configuration', y = 'Count', x= 'Config') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(LotConfig),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Lot Configuration vs Sale Price', y = 'Sale Price', x= 'Config') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It looks like lot configuration affects sale price if it is a frontage on 3 sides of property or if it is in a cul-de-sac, otherwise it appears that the other configurations are pretty similar.
Foundation
Foundation = Type of foundation
hist <- ggplot(eda_data) +
aes(reorder(x = Foundation, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Foundation', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Foundation),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Foundation vs Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Most homes have a cinderblock foundation or a poured concrete foundation. It does appear that the type of foundation makes a pretty large difference in the sale price of the house. This might have something to do with the age of the home. Let’s see if we can dig into this and understand it a little better:
ggplot(eda_data) +
aes(x = YearBuilt, y = SalePrice, color = Foundation) +
geom_point( alpha = 0.40) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
# geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
scale_x_continuous(breaks= seq(0, 2020, by=10)) +
labs(title = 'Foundation Type Over Time vs Sale Price', y = 'Sale Price', x= 'Year') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
We can see very clearly that this feature is tied to age. Poured concrete seems to be the standard for all new homes and slightly older homes used cinder blocks. Before 1940 it looks like several other methods were used but are used very infrequently after that time.
Land Contour
LandContour = Flatness of the property
hist <- ggplot(eda_data) +
aes(reorder(x = LandContour, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Flatness of the Property', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(LandContour),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Flatness of the Property', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Most homes have a level plot of land. It does appear that there is some type of relationship between the flatness of the property and the sale price although I’m not sure I understand it. It looks like the median value for homes with a depression (low) in the property is higher than those with a level plot which I’m not sure if that makes sense to me. I can see how hillside could be more expensive, because generally nicer homes are found on hills with views.
Land Slope
LandSlope = Slope of property
hist <- ggplot(eda_data) +
aes(reorder(x = LandSlope, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Slope of the Property', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(LandSlope),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Slope of the Property', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Ames Iowa is a fairly hilly region. It appears that homes on hills are more valuable than those that aren’t. The higher and steeper the better.
Lot Area
LotArea = Lot size in square feet
hist <- ggplot(eda_data) +
aes(x = LotArea) +
geom_histogram(binwidth =5000, fill = "steelblue", color = "black") +
labs(title = 'Lot Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = LotArea, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Lot Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The lot area distribution is heavily right skewed. This looks to be due to several fairly extreme outliers with lot sizes greater than 150,000 feet. If we remove these homes from our dataset, it looks like there is a loose relationship between lot area and sale price, however, the relationship isn’t super strong. This makes sense, because the lot area is just one factor of the property and often the lot area is sufficient between a general range and after that the remainder of the value of the home comes from the actual home and its properties.
Lot Frontage
LotFrontage = Linear feet of street connected to property
hist <- ggplot(eda_data) +
aes(x = LotFrontage) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Lot Frontage', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = LotArea, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
# geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Lot Frontage vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Very similar to the last feature we looked at, we see the distribution is extremely right skewed. This is in large part due to several outliers in the dataset.
Street
Street = Type of road access
hist <- ggplot(eda_data) +
aes(reorder(x = Street, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Type of Road Access', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Street),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Type of Road Access v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Almost every home has a paved road to the home. It looks like pavement to the home is more valuable than a gravel road.
Paved Drive
PavedDrive = Paved driveway
hist <- ggplot(eda_data) +
aes(reorder(x = PavedDrive, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Paved Driveway?', y = 'Count', x= 'Paved?') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(PavedDrive),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Paved Driveway v Sale Price', y = 'Sale Price', x= 'Paved?') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
As we saw before, most homes have paved driveways. We can see clearly that those homes with a paved driveway have higher sales prices than those with partially paved driveways and those with no paved driveway.
Alley
Alley = Type of alley access
hist <- ggplot(eda_data) +
aes(reorder(x = Alley, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Type of Alley Access', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Alley),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Type of Alley Access v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The majority of homes don’t have access to an alley (probably pretty normal in Iowa). It does appear that those homes with paved access to an alley have overall higher home prices than those with gravel access.
Roof Features
Roof Materials
RoofMatl = Roof material
hist <- ggplot(eda_data) +
aes(reorder(x = RoofMatl, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Roof Material', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(RoofMatl),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Roof Material v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Nearly every single house has standard composite shingle. You’d expect this because a roof needs to be changed every 30 years or so, so even if a house originally had a different roof type, if its over 30 years old, it was probably reroofed at some point to composite shingles. Wood Shingling is apparently a nice feature.
Roof Style
RoofStyle = Type of roof
hist <- ggplot(eda_data) +
aes(reorder(x = RoofStyle, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Roof Style', y = 'Count', x= 'Style') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(RoofStyle),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Roof Style v Sale Price', y = 'Sale Price', x= 'Style') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
As we’ve seen with the other roof feature, it looks like almost all roofs are Gable style. Additionally, it doesn’t look like roof style is too strongly correlated with Sale Price.
Porch Features
Wood Deck Square Footage
WoodDeckSF = Wood deck area in square feet
hist <- ggplot(eda_data) +
aes(x = WoodDeckSF) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Wood Deck Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = WoodDeckSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Wood Deck Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Interestingly, having a wood deck does not seem to be strongly correlated with sale price. It does look loosely correlated, but not as strong as I would have suspected.
Open Porch Square Footage
OpenPorchSF = Open porch area in square feet
hist <- ggplot(eda_data) +
aes(x = OpenPorchSF) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Open Porch Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = OpenPorchSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Open Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Similar to what we saw above, it doesn’t appear that there is a very strong relationship between porch area and sale price.
Enclosed Porch Square Footage
EnclosedPorch = Enclosed porch area in square feet
hist <- ggplot(eda_data) +
aes(x = EnclosedPorch) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Enclosed Porch Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = EnclosedPorch, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Enclosed Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
In looking at the enclosed porch area, it appears that there is a negative relationship with sale price, although it is a very weak, loose relationship. I would not expect this. It must be due to other factors of homes with enclosed porch areas.
Screen Porch Square Footage
ScreenPorch = Screen porch area in square feet
hist <- ggplot(eda_data) +
aes(x = ScreenPorch) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Screen Porch Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = ScreenPorch, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Screen Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Very few homes have screen porch square footage. There is a very weak relationship with sale price.
Three Season Porch Square Footage
3SsnPorch = Three season porch area in square feet
hist <- ggplot(eda_data) +
aes(x = `3SsnPorch`) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Three Season Porch Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = `3SsnPorch`, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Three Season Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Again, so few homes have a three season porch. There is a weak relationship with sale price. It doesn’t look like there is a total porch square footage variable. Let’s create one and see if we can find a better relationship than what we have been seeing:
eda_data %>%
mutate(total_porch_area = WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + `3SsnPorch`) %>%
select(total_porch_area, SalePrice) %>%
ggplot() +
aes(x = total_porch_area, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
While the relationship is loose at best, there is a positive linear relationship between these two variables. We’ll create this feature in our feature engineering and be sure to include it in our modeling.
Pool Features
Pool Area
PoolArea = Pool area in square feet
hist <- ggplot(eda_data %>% filter(PoolArea > 0)) +
aes(x = PoolArea) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Pool Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data %>% filter(PoolArea > 0)) +
aes(x = PoolArea, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Pool Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Apparently pools in Iowa aren’t treasured like they should be. It actually appears that the more square footage you have for your pool, the worse your home price. However, this is a bad sample because so few homes have pools. It does not look like there is any relationship between Pool Area and Sale Price.
Pool Quality
PoolQC = Pool quality
hist <- ggplot(eda_data) +
aes(reorder(x = PoolQC, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Pool Quality', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
#axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(PoolQC),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Pool Quality v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
There are only 7 homes with pools so there isn’t a lot of data to go off of. If you look at the box plots, we don’t see what we’d expect. I may end removing the pool features as they don’t seem to affect sale price.
Home Exterior Features
Exterior Covering of the House
Exterior1st = Exterior covering on house
hist <- ggplot(eda_data) +
aes(reorder(x = Exterior1st, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Exterior Covering of the House', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Exterior1st),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Exterior Covering of the House v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
While not all the types of Exterior covering have significant differences in their medians, it does look like there are some clear differences in the high and lows.
Exterior Covering of the House if More Than One
Exterior2nd = Exterior covering on house (if more than one material)
hist <- ggplot(eda_data) +
aes(reorder(x = Exterior2nd, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Exterior Covering of the House (If More Than One)', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Exterior2nd),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Exterior Covering of the House v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
We see a very similar pattern with this feature as we did in the previous exterior feature that we looked at.
Masonry Veneer Type
MasVnrType = Masonry veneer type
hist <- ggplot(eda_data) +
aes(reorder(x = MasVnrType, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Masonry Veneer Type', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(MasVnrType),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Masonry Veneer Type v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It looks like Stone is a better veneer type than brick, but having no brick is better than having common brick. Additionally, it looks like we have some NAs to take care of during our data prep state.
Masonry Veneer Area in Square Feet
MasVnrArea = Masonry veneer area in square feet
hist <- ggplot(eda_data ) +
aes(x = MasVnrArea) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Masonry Veneer Area', y = 'Count', x= 'Square Feet') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data ) +
aes(x = MasVnrArea, y = SalePrice, color = MasVnrType) +
geom_point( alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Masonry Veneer Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It looks like for stone veneer, the square footage is more important that for a brick face in determining sale price.
Utilities
Utilities
Utilities = Type of utilities available
hist <- ggplot(eda_data) +
aes(reorder(x = Utilities, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Utilities Available', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Utilities),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Utilities Available v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Only one home doesn’t have public utilities and the sale price was lower than the median of the sale prices from those homes with access to public utilities.
Central Air
CentralAir = Central air conditioning
hist <- ggplot(eda_data) +
aes(reorder(x = CentralAir, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Air Conditioning?', y = 'Count', x= 'Yes/No') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(CentralAir),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Utilities Available v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Thankfully almost every home has air conditioning, and as expected, these homes sell for considerably more.
Electrical
Electrical = Electrical system
hist <- ggplot(eda_data) +
aes(reorder(x = Electrical, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Electrical System', y = 'Count', x= 'Yes/No') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Electrical),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Electrical System v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
There is some pretty clear differentiation between these electrical systems and home price. It looks like the standard is the standard breaker electrical system and homes with this system, being the most common, get have higher prices.
Heating
Heating = Type of heating
hist <- ggplot(eda_data) +
aes(reorder(x = Heating, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Type of Heating', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Heating),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Type of Heating v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Almost every house has a Gas forced warm air furnace. Those without have lower home prices.
Heating Quality
HeatingQC = Heating quality and condition
hist <- ggplot(eda_data) +
aes(reorder(x = HeatingQC, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Heating Quality and Condition', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(HeatingQC),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Heating Quality and Condition v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The condition of your heating system does play a role in the house price. We can see that as the quality increases, so does the sale price.
Other Features
Fence
Fence = Fence Quality
hist <- ggplot(eda_data) +
aes(reorder(x = Fence, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Fence Quality', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Fence),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Fence Quality v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Apparently fences are not very important in Iowa like they are in other parts of the country. The majority of homes don’t have fences. When a home does have a fence, if it offers good privacy, then it will increase the selling price of the home, otherwise the quality doesn’t really matter. This is a weird variable. It really should almost be too separate variables: fence privacy and then good wood or not.
Fireplaces
Fireplaces = Number of fireplaces
hist <- ggplot(eda_data) +
aes(reorder(x = Fireplaces, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Number of Fireplaces', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(Fireplaces),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Number of Fireplaces v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Fireplaces looks like a good indicator for sale price. This may be because it is correlated with square footage and other important factors.
Fireplace Quality
FireplaceQu = Fireplace Quality
hist <- ggplot(eda_data) +
aes(reorder(x = FireplaceQu, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Fireplace Quality', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(FireplaceQu),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Fireplace Quality v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
The quality of fireplace seems to be correlated with sale price as well. The NAs here indicate that there is no fireplace, which looks like is worse than not having a fireplace in poor condition.
Miscellaneous Features Not Covered in Other Categories
MiscFeature = Miscellaneous feature not covered in other categories
hist <- ggplot(eda_data) +
aes(reorder(x = MiscFeature, SalePrice)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Miscellaneous Features', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# axis.text.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = reorder(factor(MiscFeature),SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Miscellaneous Features v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
# axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
Owning a tennis court is definitely a good sign that your house is worth some money. All of these variables are fairly rare. Most homes do not have miscellaneous features and having them doesn’t really mean much for the sale price other than if you have a tennis court (which is one house).
Miscellaneous Features Value
MiscVal = $Value of miscellaneous feature
hist <- ggplot(eda_data ) +
aes(x = MiscVal) +
geom_histogram(fill = "steelblue", color = "black") +
labs(title = 'Miscellaneous Feature Value', y = 'Count', x= '$') +
# scale_x_continuous(breaks= seq(0, 25000, by=5000)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data ) +
aes(x = MiscVal, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Miscellaneous Feature Value vs Sale Price', y = 'Sale Price', x= '$') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It looks like the value doesn’t make much of a difference in the sale price. I wouldn’t expect these plots to show much since there are so few instances of this.
Date Features
Year Built
YearBuilt = Original construction date
hist <- ggplot(eda_data) +
aes(x = as.factor(YearBuilt)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Year Built', y = 'Count', x= 'Year') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = factor(YearBuilt), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Year Built v Sale Price', y = 'Sale Price', x= 'Year') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, ncol=1)
Very interesting plot! We see quite a bit of variation here. Some if it is due to there only being a handful of houses in each year. One thing is pretty clear, newer homes tend to have higher selling prices than older homes. As a note, year built should be a categorical variable as opposed to an integer. we’ll update this in a later stage.
Year Remodeled
YearRemodAdd = Remodel date
hist <- ggplot(eda_data) +
aes(x = as.factor(YearRemodAdd)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Year Remodeled', y = 'Count', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = factor(YearRemodAdd), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Year Remodeled v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, ncol=1)
As we would expect, homes that were remodeled more recently have higher sale prices than those that were remodeled many years ago. It’s also interesting to see that many people remodeled their homes in 1950.
Year Sold
YrSold = Year Sold
hist <- ggplot(eda_data) +
aes(x = as.factor(YrSold)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Year Sold', y = 'Count', x= 'Year') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = factor(YrSold), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Year Sold v Sale Price', y = 'Sale Price', x= 'Year') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
It doesn’t look like the year really has any effect on the Sale Price which is incredibly interesting especially since this data encompasses the financial crises that began in 2007. You would think that home prices would have dropped substantially as people stopped selling their homes and the market was flooded with people defaulting on their loans.
Month Sold
MoSold = Month Sold
hist <- ggplot(eda_data) +
aes(x = as.factor(MoSold)) +
geom_bar(stat = "count", fill = 'steelblue') +
geom_text(stat='count', aes(label=..count..), vjust=-.25, color = "gray20", size= 3) +
labs(title = 'Month Sold', y = 'Count', x= 'Month') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
axis.text.y = element_blank(),
axis.title.y= element_blank(),
axis.title.x= element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
box <- ggplot(eda_data) +
aes(x = factor(MoSold), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Month Sold v Sale Price', y = 'Sale Price', x= 'Month') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
gridExtra::grid.arrange(hist, box, nrow=1)
While there is some variation in the box plot above with respect to the different months, it doesn’t look like there is a clear relationship between sale price and the month a home was sold.
Data Cleaning and Data Encoding
From our extensive exploratory data analysis (EDA) we learned A LOT! Now we need to take our findings and clean our data so that they can best be utilized by the model. During our EDA we saw that we have lots of ordinal categorical variables that will need to be encoded. Additionally, there are some NA values that we’ll need to take care of as well as some outliers.
Outliers
When working with OverallQual we identified an outlier in Category 4 that we’ll need to remove. We’ll do that now:
#remove from our full dataset
homes <- homes %>%
filter(Id != 458)
Missing Values
Garage Features
Among other NAs, we have 1 NA in GarageCars and Garage Area. Let’s explore this value and see if it is something we need to remove or adjust. The null value is actually in our test set and both of the nulls are found in the same row (2577).
homes %>%
filter(is.na(GarageCars) & is.na(GarageArea)) %>%
select('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')
## # A tibble: 1 x 6
## GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
## <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 NA NA Detchd <NA> <NA> <NA>
As we can see, all of the garage features are missing except for garage type. As all other observations in the dataset had a value for GarageType and GarageArea, I think this may be an input error. We’ll make and assumption here and say that this house does not have a garage, and adjust accordingly:
homes <- homes %>%
mutate(GarageCars = case_when( Id == 2577 & is.na(GarageCars) ~ 0, TRUE ~ GarageCars)) %>%
mutate(GarageArea = case_when( Id == 2577 & is.na(GarageArea) ~ 0, TRUE ~ GarageArea)) %>%
mutate(GarageType = case_when( Id == 2577 & GarageType == 'Detchd' ~ NA_character_, TRUE ~ GarageType))
When working with the basement data, we noted many NA values that were actually homes without basements. Let’s change these NAs to ‘No Basement’ so when we work to fill null values later, these aren’t imputed:
homes %>% filter(
is.na(GarageType) & is.na(GarageYrBlt) & is.na(GarageFinish) & is.na(GarageQual) & is.na(GarageCond)) %>%
select('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish') %>% dplyr::count(GarageCars, GarageArea)
## # A tibble: 1 x 3
## GarageCars GarageArea n
## <dbl> <dbl> <int>
## 1 0 0 158
In looking at the above, we can clearly see that all the NA values in these columns come because the house does not have a garage. We can easily clean these up now:
homes <- homes %>%
mutate( GarageType = case_when( is.na(GarageType) ~ 'None', TRUE ~ GarageType)) %>%
mutate( GarageYrBlt = case_when( is.na(GarageYrBlt) ~ YearBuilt, TRUE ~ GarageYrBlt)) %>% #defaulting to YearBuilt when there is no garage year built date
mutate(GarageFinish = case_when( is.na(GarageFinish) ~ 'None', TRUE ~ GarageFinish)) %>%
mutate(GarageQual = case_when( is.na(GarageQual) ~ 'None', TRUE ~ GarageQual)) %>%
mutate(GarageCond = case_when( is.na(GarageCond) ~ 'None', TRUE ~ GarageCond))
Now, let’s ensure we’ve filled all nulls in the garage features:
colSums(is.na(homes %>% select(contains('Garage'))))
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 0 0 0 0 0 0
## GarageCond
## 0
Basement Features
In looking at the NAs within the basement features, we saw before during our analysis that this was due to there being no basement in the house. We’ll leave most of the imputing to be done by R when we get to the modeling section, but we will set any NAs that we don’t want imputed to a value. Most of the NAs in the BsmtQual column are missing because there is not a basement. There are 9 cases where this is not true below:
homes %>%
filter((!is.na(BsmtFinType1) & (is.na(BsmtCond)|is.na(BsmtQual)|is.na(BsmtExposure)|is.na(BsmtFinType2)))) %>% select('Id', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')
## # A tibble: 9 x 6
## Id BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 333 Gd TA No GLQ <NA>
## 2 949 Gd TA <NA> Unf Unf
## 3 1488 Gd TA <NA> Unf Unf
## 4 2041 Gd <NA> Mn GLQ Rec
## 5 2186 TA <NA> No BLQ Unf
## 6 2218 <NA> Fa No Unf Unf
## 7 2219 <NA> TA No Unf Unf
## 8 2349 Gd TA <NA> Unf Unf
## 9 2525 TA <NA> Av ALQ Unf
We’ll let the imputation engine impute those for us. For every other value in BsmtQual,BsmtCond, BsmtExposure, BsmtFinType1, BsmtFin we’ll set to ‘None’
homes$BsmtQual <- homes$BsmtQual %>% replace_na('None')
homes$BsmtCond <- homes$BsmtCond %>% replace_na('None')
homes$BsmtExposure <- homes$BsmtExposure %>% replace_na('None')
homes$BsmtFinType1 <- homes$BsmtFinType1 %>% replace_na('None')
homes$BsmtFinType2 <- homes$BsmtFinType2 %>% replace_na('None')
homes <- homes %>%
mutate(BsmtQual = case_when( Id %in% c(2218,2219) ~ NA_character_, TRUE ~ BsmtQual)) %>%
mutate(BsmtCond = case_when( Id %in% c(2041,2186,2525) ~ NA_character_, TRUE ~ BsmtCond)) %>%
mutate(BsmtExposure = case_when( Id %in% c(949,1488,2349) ~ NA_character_, TRUE ~ BsmtExposure)) %>%
mutate(BsmtFinType2 = case_when( Id %in% c(333) ~ NA_character_, TRUE ~ BsmtFinType2))
Lot Features
Alley
When Alley is NA it represents the house not having access to an alley. We’ll encode these NAs as None.
homes$Alley <- homes$Alley %>% replace_na('None')
Lot Frontage
homes %>% filter(is.na(LotFrontage)) %>% select(Id, LotFrontage, LotArea, LotShape, LandContour)
## # A tibble: 485 x 5
## Id LotFrontage LotArea LotShape LandContour
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 8 NA 10382 IR1 Lvl
## 2 13 NA 12968 IR2 Lvl
## 3 15 NA 10920 IR1 Lvl
## 4 17 NA 11241 IR1 Lvl
## 5 25 NA 8246 IR1 Lvl
## 6 32 NA 8544 IR1 Lvl
## 7 43 NA 9180 IR1 Lvl
## 8 44 NA 9200 IR1 Lvl
## 9 51 NA 13869 IR2 Lvl
## 10 65 NA 9375 Reg Lvl
## # ... with 475 more rows
In reading the data documentation about lot frontage, it does not say that NAs are a house with NO lot frontage. Additionally in other numeric variables when a number does not have something we see 0. There are 486 missing rows of lot frontage. The best method to fill these would be imputation. If we were doing it manually by hand, we may take the median lot frontage of the neighborhood they are in, but we’ll most likely use KNN to fill these values.
Fence
Here we read in the documentation that NA is when a house has no fence. We’ll encode all NAs here as ‘None’.
homes$Fence <- homes$Fence %>% replace_na('None')
FireplaceQu
Again we read that NA means a home with no fireplace. We’ll encode all NAs as ‘None’ as we’ve done before.
homes$FireplaceQu <- homes$FireplaceQu %>% replace_na('None')
MiscFeature
homes$MiscFeature <- homes$MiscFeature %>% replace_na('None')
PoolQC
There are three instances where a home has a pool area but not a quality rating. We’ll impute these in our modeling section. For all others, where PoolQC is blank, we’ll fill with ‘None’.
homes %>%
filter(PoolArea != 0 & is.na(PoolQC)) %>%
select(Id, PoolArea, PoolQC)
## # A tibble: 3 x 3
## Id PoolArea PoolQC
## <dbl> <dbl> <chr>
## 1 2421 368 <NA>
## 2 2504 444 <NA>
## 3 2600 561 <NA>
homes$PoolQC <- homes$PoolQC %>% replace_na('None')
homes <- homes %>%
mutate(PoolQC = case_when( Id %in% c(2421,2504,2600) ~ NA_character_, TRUE ~ PoolQC))
Let’s take a final look at the null values and make sure all the null values we see remaining are those we have intentionally left so that they can be imputed during the modeling stage:
colSums(is.na(homes))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 4 485 0
## Street Alley LotShape LandContour Utilities
## 0 0 0 0 2
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 1 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 24 23 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 2 3 3 0 1
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 1 1 1 1 0
## HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 2 2 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 1 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 2 0 0 0 0
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 0 0 0 0 0
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 3 0 0
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 1 0
## SalePrice dataset
## 1459 0
Looks great! We’ll move on to encoding our ordinal categorical variables.
Encoding Variables
We have many categorical variables that are actually ordinal. We’ll adjust those here.
We’ll make a qualities vector that will be used multiple times to encode our variables:
qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Basement Features
BsmtQual
homes$BsmtQual <- as.integer(plyr::revalue(homes$BsmtQual, qualities))
## The following `from` values were not present in `x`: Po
table(homes$BsmtQual)
##
## 0 2 3 4 5
## 79 88 1283 1208 258
BsmtCond
homes$BsmtCond <- as.integer(plyr::revalue(homes$BsmtCond, qualities))
## The following `from` values were not present in `x`: Ex
table(homes$BsmtCond)
##
## 0 1 2 3 4
## 79 5 104 2605 122
BsmtExposure
exposure <- c('None' = 0, 'No' = 1, 'Mn' = 2, 'Av' = 3, 'Gd' = 4)
homes$BsmtExposure <- as.integer(plyr::revalue(homes$BsmtExposure, exposure))
table(homes$BsmtExposure)
##
## 0 1 2 3 4
## 79 1904 239 418 275
BsmtFinType1
homes$BsmtFinType1 <- as.factor(homes$BsmtFinType1)
BsmtFinType2
homes$BsmtFinType2 <- as.factor(homes$BsmtFinType2)
Garage Features
GarageQual
homes$GarageQual <- as.integer(plyr::revalue(homes$GarageQual, qualities))
table(homes$GarageQual)
##
## 0 1 2 3 4 5
## 159 5 124 2603 24 3
GarageCond
homes$GarageCond <- as.integer(plyr::revalue(homes$GarageCond, qualities))
table(homes$GarageCond)
##
## 0 1 2 3 4 5
## 159 14 74 2653 15 3
GarageFinish
finish <- c('None' = 0, 'Unf' = 1, 'RFn' = 2, 'Fin' = 3)
homes$GarageFinish <- as.integer(plyr::revalue(homes$GarageFinish, finish))
table(homes$GarageFinish)
##
## 0 1 2 3
## 159 1230 811 718
GarageType
homes$GarageType <- as.factor(homes$GarageType)
GarageYrBlt
homes$GarageYrBlt <- as.factor(homes$GarageYrBlt)
Exterior Features
ExterCond
homes$ExterCond <- as.integer(plyr::revalue(homes$ExterCond, qualities))
## The following `from` values were not present in `x`: None
table(homes$ExterCond)
##
## 1 2 3 4 5
## 3 67 2537 299 12
ExterQual
homes$ExterQual <- as.integer(plyr::revalue(homes$ExterQual, qualities))
## The following `from` values were not present in `x`: None, Po
table(homes$ExterQual)
##
## 2 3 4 5
## 35 1797 979 107
Exterior1st
homes$Exterior1st <- as.factor(homes$Exterior1st )
Exterior2nd
homes$Exterior2nd <- as.factor(homes$Exterior2nd )
MasVnrType
homes$MasVnrType <- as.factor(homes$MasVnrType )
Condition Features
Functional
functions <- c('Sal' = 0, 'Sev' = 1, 'Maj2' = 2, 'Maj1' = 3, 'Mod' = 4, 'Min2' = 5, 'Min1' = 6, 'Typ' = 7)
homes$Functional <- as.integer(plyr::revalue(homes$Functional, functions))
## The following `from` values were not present in `x`: Sal
table(homes$Functional)
##
## 1 2 3 4 5 6 7
## 2 9 19 35 70 64 2717
SaleCondition
homes$SaleCondition <- as.factor(homes$SaleCondition)
Condition1
homes$Condition1 <- as.factor(homes$Condition1)
Condition2
homes$Condition2 <- as.factor(homes$Condition2)
House Style
homes$HouseStyle <- as.factor(homes$HouseStyle)
Kitchen Features
homes$KitchenQual <- as.integer(plyr::revalue(homes$KitchenQual, qualities))
## The following `from` values were not present in `x`: None, Po
table(homes$KitchenQual)
##
## 2 3 4 5
## 70 1492 1150 205
Lot Features
MSSubClass
homes$MSSubClass <- as.factor(homes$MSSubClass)
LotShape
shape <- c('IR3' = 0, 'IR2' = 1, 'IR1' = 2, 'Reg' = 3)
homes$LotShape <- as.integer(plyr::revalue(homes$LotShape, shape))
table(homes$LotShape)
##
## 0 1 2 3
## 16 76 967 1859
LandSlope
slope <- c('Sev' = 0, 'Mod' = 1, 'Gtl' = 2)
homes$LandSlope <- as.integer(plyr::revalue(homes$LandSlope, slope))
table(homes$LandSlope)
##
## 0 1 2
## 16 124 2778
Neighborhood
homes$Neighborhood <- as.factor(homes$Neighborhood)
MSZoning
homes$MSZoning <- as.factor(homes$MSZoning)
BldgType
homes$BldgType <- as.factor(homes$BldgType )
LotConfig
homes$LotConfig <- as.factor(homes$LotConfig )
Foundation
homes$Foundation <- as.factor(homes$Foundation )
LandContour
homes$LandContour <- as.factor(homes$LandContour )
Street
homes$Street <- as.factor(homes$Street )
PavedDrive
homes$PavedDrive <- as.factor(homes$PavedDrive )
Alley
homes$Alley <- as.factor(homes$Alley )
Roof Features
RoofMatl
homes$RoofMatl <- as.factor(homes$RoofMatl )
RoofStyle
homes$RoofStyle <- as.factor(homes$RoofStyle)
Pool Features
homes$PoolQC <- as.integer(plyr::revalue(homes$PoolQC, qualities))
## The following `from` values were not present in `x`: Po, TA
table(homes$PoolQC)
##
## 0 2 4 5
## 2905 2 4 4
Utility Features
HeatingQC
homes$HeatingQC <- as.integer(plyr::revalue(homes$HeatingQC, qualities))
## The following `from` values were not present in `x`: None
table(homes$HeatingQC)
##
## 1 2 3 4 5
## 3 92 857 474 1492
Utilities
homes$Utilities <- as.factor(homes$Utilities)
CentralAir
homes$CentralAir <- as.factor(homes$CentralAir)
Electrical
homes$Electrical <- as.factor(homes$Electrical)
Heating
homes$Heating <- as.factor(homes$Heating)
Other Features
homes$FireplaceQu <- as.integer(plyr::revalue(homes$FireplaceQu, qualities))
table(homes$FireplaceQu)
##
## 0 1 2 3 4 5
## 1420 46 74 592 743 43
Fence
homes$Fence <- as.factor(homes$Fence)
MiscFeature
homes$MiscFeature <- as.factor(homes$MiscFeature )
Date Features
YearBuilt
homes$YearBuilt <- as.factor(homes$YearBuilt)
YearRemodAdd
homes$YearRemodAdd <- as.factor(homes$YearRemodAdd)
YrSold
homes$YrSold <- as.factor(homes$YrSold)
MoSold
homes$MoSold <- as.factor(homes$MoSold)
Feature Engineering
Age Feature
I want to add a feature that tells us how long it has been since the house was updated, if ever. I will subtract the year sold from the year remodeled column. The year remodeled column defaults to the year built if there was no remodel.
homes <- homes %>%
mutate(Age = as.integer(as.character(YrSold)) - as.integer(as.character(YearRemodAdd)))
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = Age, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Age Since Last Update vs Sale Price', y = 'Sale Price', x= 'Age') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
Proximity to Positive, Normal, and Negative Conditions
The Condition1 column tells you if a home is close to certain conditions, but doesn’t say if the conditions are positive or negative. This feature helps to clarify that.
homes <- homes %>% mutate(Pos_Neg_Conditions = case_when( Condition1 == 'Norm' ~ 'Normal',
Condition1 %in% c('PosN', 'PosA') ~ 'Positive',
TRUE ~ 'Negative') )
condit <- c('Negative' = 0, 'Normal' = 1, 'Positive' = 2)
homes$Pos_Neg_Conditions <- as.integer(plyr::revalue(homes$Pos_Neg_Conditions, condit))
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = as.factor(Pos_Neg_Conditions), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Positive and Negative Conditions v Sale Price', y = 'Sale Price', x= 'Type') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
Total Square Feet Feature
There is not a total square feet feature - it is broken into above ground and basement. Let’s add all the square feet variables together to get total square footage.
homes <- homes %>%
mutate(TotalSF = GrLivArea + TotalBsmtSF)
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = TotalSF, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Square Feet vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
Total Bathrooms
We have the count of bathrooms for basement as well as above ground. This feature will add them together.
homes <- homes %>%
mutate(TotalBathrooms = FullBath + (HalfBath *0.5) + BsmtFullBath + (BsmtHalfBath * 0.5))
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = as.factor(TotalBathrooms), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Bathrooms v Sale Price', y = 'Sale Price', x= '#') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
Garage Age Built Category
homes <- homes %>% mutate(garage_age_category = ifelse(as.numeric(as.character(GarageYrBlt)) < 1948, 'Old',
ifelse(as.numeric(as.character(GarageYrBlt)) < 1986, 'Average', 'New')))
age <- c('Old' = 0, 'Average' = 1, 'New' = 2)
homes$garage_age_category <- as.integer(plyr::revalue(homes$garage_age_category, age))
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = reorder(as.factor(garage_age_category), SalePrice), y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Garage Age Category v Sale Price', y = 'Sale Price', x= 'Category') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
Neighborhood Consolidation
As described in the analysis section, we’ll bin some of the neighborhoods into subgroups.
homes <- homes %>% mutate( Neighborhood_class = case_when(
Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale') ~ 'Poor',
Neighborhood %in% c('BrkSide', 'Edwards', 'OldTown', 'Sawyer', 'Blueste', 'SWISU', 'NPkVill', 'NAmes', 'Mitchel') ~ 'Lower-Middle',
Neighborhood %in% c('SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'CollgCr', 'Crawfor', 'ClearCr', 'Somerst', 'Veenker', 'Timber') ~ 'Middle',
TRUE ~ 'Rich'))
class <- c('Poor' = 0, 'Lower-Middle' = 1, 'Middle' = 2, 'Rich' = 3)
homes$Neighborhood_class <- as.integer(plyr::revalue(homes$Neighborhood_class, class))
See the plot in the analysis section to see the change.
Total Porch Area
We have quite a few porch area features but don’t have a total porch feature. If you recall from our analysis, the porch area didn’t really affect the sale price, so I’m not expecting big things from this new feature. We’ll build this now:
homes <- homes %>%
mutate(TotalPorch = WoodDeckSF + OpenPorchSF + EnclosedPorch + `3SsnPorch` + ScreenPorch)
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = TotalPorch, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
# geom_point(data = live_area_outliers, aes(x = GrLivArea,y=SalePrice), color = "red3", shape = 'triangle', size = 2.5) +
geom_smooth(method = lm, se = FALSE, alpha = 0.5, size= 0.1, color = "grey21") +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Total Porch Area vs Sale Price', y = 'Sale Price', x= 'Square Feet') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
plot.subtitle = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
New House Indicator
I want an indicator to flag if the house was new in the year it was sold. Generally new houses sell for more than houses that have been lived in.
homes <- homes %>%
mutate(NewHouse = ifelse(as.numeric(as.character(YearBuilt)) == as.numeric(as.character(YrSold)), 'Yes', 'No'))
homes$NewHouse <- as.factor(homes$NewHouse)
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = NewHouse, y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'New House v Sale Price', y = 'Sale Price', x= 'Yes/No') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
Remodeled Flag
The column we have that says if the home was remodeled is not actually a flag because it defaults to the year it was built if it was never remodeled. We need to create a flag that shows if the home has been remodeled or not.
homes <- homes %>%
mutate(remodeled = ifelse(as.numeric(as.character(YearRemodAdd)) == as.numeric(as.character(YearBuilt)), 'No', 'Yes'))
homes$remodeled <- as.factor(homes$remodeled)
ggplot(homes %>% filter(dataset == 'train')) +
aes(x = remodeled, y = SalePrice) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
labs(title = 'Remodeled v Sale Price', y = 'Sale Price', x= 'Yes/No') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.x= element_blank(),
axis.text.x = element_text(angle = 90),
axis.ticks.x = element_line(color = "grey")
)
It actually appears that if you have remodeled, you often have a lower selling price. This must just be because if you haven’t remodeled recently, perhaps you are in a new home.
Preprocessing with Recipes and Building our Model
Let’s split our data back out into our train and test sets:
train <- homes %>% filter(dataset == 'train')
test <- homes %>% filter(dataset == 'test')
In order to evaluate our model before submitting, we’ll need to further split our training data into a train/test split:
set.seed(123)
train_split <- initial_split(train, prop = 0.80)
reg_train <- training(train_split)
reg_test <- testing(train_split)
Next, we’ll initialize our model:
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
Now that we have our data split out again, let’s move forward with our preprocessing of the data. Using the tidymodels framework, this will be extremely easy and will also be very readable. We will do the following:
- remove the ‘dataset’ column
- create dummy variables
- apply a non-zero variance filter (this helps when there are factors in the testing data that are not in the training data)
- normalize the data
- center the data
- scale the data
- filter out any features with correlations with each other higher than 0.90
- impute null values using KNN algorithm using 3 nearest neighbors
- remove any columns that are linear combinations of each other
- log transform our response variable
reg_recipe <-
recipe(SalePrice ~ ., data = reg_train) %>%
update_role(Id, new_role = "ID") %>%
step_rm(dataset, WoodDeckSF , OpenPorchSF , EnclosedPorch , `3SsnPorch` , ScreenPorch , Neighborhood, Functional, LowQualFinSF, BsmtFinSF2, BsmtHalfBath, BsmtFinType2, KitchenAbvGr, LotConfig, PoolArea, MiscVal) %>%
step_unknown(all_predictors(), -all_numeric()) %>%
step_dummy(all_nominal()) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_numeric(), -Id, -all_outcomes()) %>%
step_center(all_numeric(), -Id, -all_outcomes()) %>%
step_scale(all_numeric(),-Id, -all_outcomes()) %>%
step_corr(all_numeric(), -Id, threshold = 0.90) %>%
step_knnimpute(all_predictors(), neighbors = 3) %>%
step_lincomb(all_numeric(), -Id, -all_outcomes()) %>%
step_log(all_outcomes(), base = 10) #, skip = TRUE)
Next, we’ll build our workflow and add our model and our recipe:
reg_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(reg_recipe)
Now we’ll fit our model using our test_train split:
reg_fit <- reg_workflow %>%
last_fit(split = train_split)
Now that our model is fit, we can look at our model metrics
reg_fit %>% collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0543
## 2 rsq standard 0.908
Our root mean squared error (RMSE) is ~0.0543 and our \({ R }^{ 2 }\) value is ~0.90798 What does this mean? Well the root mean squared error here doesn’t make a ton of sense because we have log transformed our response variable SalePrice. What we can say is that on average, we are off about 0.054 from each log transformed SalePrice value. I am far more interested in \({ R }^{ 2 }\) value. The \({ R }^{ 2 }\) value tells us how much of the variation in SalePrice we are explaining with our model. It looks like our model is explaining about 90% of the variation within the SalePrice which is very good (at least in this small subset of the training data)!
preds <- reg_fit %>% collect_predictions()
ggplot(data = preds) +
aes(x = .pred, y = SalePrice) +
geom_point(color = 'steelblue', alpha = 0.35) +
geom_abline(intercept = 0, slope = 1, color = 'firebrick') +
labs(title = 'R-Squared Plot - Predicting SalePrice',
x = 'Predicted Sale Price',
y = 'Actual Sale Price') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
The above \({ R }^{ 2 }\) plot is a good represenation of the fit of our model. The straight line is a visual representation of a model that predicted each SalePrice perfectly. The points are the actual model predictions. As you can see, in general, our points fall pretty close to the line, which means our model is fitting pretty well. Now that we’ve seen that our model looks to be working, let’s use this same approach to create predictions for our test dataset that we can submit to Kaggle contest:
final_lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
final_recipe <-
recipe(SalePrice ~ ., data = train) %>%
update_role(Id, new_role = "ID") %>%
step_rm(dataset, WoodDeckSF , OpenPorchSF , EnclosedPorch , `3SsnPorch` , ScreenPorch , Neighborhood, Functional, LowQualFinSF, BsmtFinSF2, BsmtHalfBath, BsmtFinType2, KitchenAbvGr, LotConfig, PoolArea, MiscVal) %>%
step_unknown(all_predictors(), -all_numeric()) %>%
step_dummy(all_nominal()) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_numeric(), -Id, -all_outcomes()) %>%
step_center(all_numeric(), -Id, -all_outcomes()) %>%
step_scale(all_numeric(),-Id, -all_outcomes()) %>%
step_corr(all_numeric(), -Id, threshold = 0.90) %>%
step_knnimpute(all_predictors(), neighbors = 3) %>%
step_lincomb(all_numeric(), -Id, -all_outcomes()) %>%
step_log(all_outcomes(), base = 10 , skip = TRUE)
final_workflow <- workflow() %>%
add_model(final_lm_model) %>%
add_recipe(final_recipe)
final_fit <- final_workflow %>%
fit(data = train)
predictions <- predict(final_fit, test) %>%
bind_cols(test %>% select(Id)) %>%
mutate(SalePrice = round(10^.pred,0)) %>%
select(Id, SalePrice)
Here’s what our final output looks like with the SalePrice column being our predictions:
head(predictions)
## # A tibble: 6 x 2
## Id SalePrice
## <dbl> <dbl>
## 1 1461 109376
## 2 1462 153318
## 3 1463 170800
## 4 1464 202566
## 5 1465 194396
## 6 1466 172461
Now, let’s export our predictions and submit to Kaggle:
readr::write_delim(predictions, 'C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/submission2.csv', delim = ',')
Having submitted my CSV file to Kaggle, I got a score of 0.13895:
While this is nowhere close to being at the top of the leaderboard, I imagine it is a fairly high score for only using multiple regression. I’m sure I could do a little tuning to improve my score slightly, but to see big improvements, I’d definitely need to switch to another method of regression or XGBoost. Overall, I am very pleased with the performance of my model. Kaggle Username: christianthieme
Learnings and Next Steps
This was an amazing project to work through. It took me a TON of time to work through all the variables, but I got to know the data and variables very well and was able to come up with some good new features as a result that I think helped in my model performance.
After I’d already encoded all of my variables as factors, I learned I could have converted all my ordinal factors to numeric scores using step_ordinalscore(), and converted all my character/string data to factors using step_string2factor() from recipes. That would have saved me a ton of time. Next time! Overall, I’m growing more comfortable with the TidyModels framework and realizing how simple it makes difficult tasks.
My next step for this project will be to run this data through XGBoost and see what improvements we can see!