Setup
# Load Libraries
library('MASS')
library('ggplot2') # library to create plots
library('dplyr') # data manipulation
## Warning: package 'dplyr' was built under R version 3.4.2
library('knitr') # required to apply knitr options
library('statsr') # staistics functions
library('R.utils') # support functions used in the analysis
## Warning: package 'R.utils' was built under R version 3.4.2
library('GGally') # library to create plots
library('gridExtra') # arrange plots
library('BAS') # Bayesian statistics functions
library('labeling') # dependency package to ggplot
library('tidyr')
# apply general knitr options
knitr::opts_chunk$set(comment=NA, fig.align='center')
# load data file
load('ames_train.Rdata')
data <- ames_train
John Eugene Driscoll | Module 5 - Data Analysis Project | Nov 2017 Submission
Introduction
The purpose of this document is to complete the Peer-graded Assignment: EDA and Basic Model Selection project required during week 4 of the Statistics with R Capstone course by Duke University (Coursera.)
The background context regarding the assignment can be found at: https://www.coursera.org/learn/statistics-project/peer/XkGx1/eda-and-basic-model-selection
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
data$age <- sapply(data$Year.Built, function(x) 2017 - x)
# create plot
ggplot(data = data, aes(x = age, y = ..density..)) +
geom_histogram(bins = 30) +
geom_density(size = 1) +
labs(title = "Ages of houses in years", x = 'age', y = "Frequency") +
geom_vline(data = data, mapping = aes( xintercept = mean(data$age)), size = 1.5) +
geom_vline(data = data,mapping = aes( xintercept = median(data$age)), size = 1.5) +
geom_text(data = data, aes( x = (mean(data$age)), y = .015, label = 'mean'), size = 3, parse = T) +
geom_text(data = data,aes( x = (median(data$age)),y = .020, label = 'median'), size = 3, parse = T) +
guides(color = FALSE, size = FALSE)
# find summary stats
summary.age <- data %>% summarise(mean_age = mean(age),
median_age = median(age),
sd_age = sd(age),
min_age = min(age),
max_age = max(age),
IQR_age = IQR(age),
total = n())
The distribution of the calculated age of homes in the sample shows multimodal behavior and a right skewed distribution ( which makes sense as homes can not be less than 0 years old)
Summary stats presented in the table below.
Function | Value |
---|---|
mean | 44.797 |
median | 42 |
sd | 29.63741 |
min | 7 |
max | 145 |
IQR | 46 |
# items | 1000 |
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# extract summary parameter
price.neigborhood <- data %>% dplyr::select(price, Neighborhood)
price.neigborhood.summary <- price.neigborhood %>% group_by(Neighborhood) %>% summarise(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
IQR_price = IQR(price),
sd_price = sd(price),
var_price = var(price),
total = n())
# find the required information in the summary data
ExpNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$median_price == max(price.neigborhood.summary$median_price)),]
LeastExpNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$median_price == min(price.neigborhood.summary$median_price)),]
HetNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$sd_price == max(price.neigborhood.summary$sd_price)),]
# create box plot
ggplot(data, aes(x = Neighborhood, y = (data$price / 100))) +
geom_boxplot() +
labs(title = "Housing prices per Neighborhood", x = 'Neighborhood', y = "Price in [$ 1000s]") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The boxplot shows the price distribution per neighborhood. The most expensive neighborhood is StoneBr
. This was determined by calculating the maximum median value of all houses in a neighborhood, while the least expensive beighborhood is MeadowV
. This was determined by calculating the minimum median value of all houses in a Neighborhood. The most heterogeneous neighborhood, measured by looking at the standard deviation of housing prices isStoneBr
.
kable(ExpNeighborhood[1:9], caption = '<b>Most expensive Neighborhood</b>')
Neighborhood | mean_price | median_price | min_price | max_price | IQR_price | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 15242150036 | 20 |
kable(LeastExpNeighborhood[1:9], caption = '<b>Least expensive Neighborhood</b>')
Neighborhood | mean_price | median_price | min_price | max_price | IQR_price | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
MeadowV | 92946.88 | 85750 | 73000 | 129500 | 20150 | 18939.78 | 358715156 | 16 |
kable(HetNeighborhood[1:9], caption = '<b>Most heterogenious Neighborhood</b>')
Neighborhood | mean_price | median_price | min_price | max_price | IQR_price | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 15242150036 | 20 |
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# find the variable with the highest NA count
na_count <- sapply(data, function(x) sum(is.na(x)))
df.na_count <- data.frame(na_count)
df.merged <- cbind(c(row.names(df.na_count)), df.na_count[,1])
colnames(df.merged) <- c('feature', 'No_NA')
df.merged[which(df.na_count == max(df.na_count)),]
feature No_NA
"Pool.QC" "997"
Pool Quality (Pool.QC) has the higest number of NA’s. Per the codebook NA’s found within this feature mean a house does NOT have a pool.
As Iowa has harsh winters and four chaning seasons, It is reasonable to expect that many homes in a given city within Iowa may not have a pool.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
Model selection
Adjusted R2 describes the strength of a model, and it is a useful tool for evaluating, which predictors are adding value to the model.
Backwards elimination was applied to select an appropriate model as a means for finding the model with the best adjusted R2. We start with a full model (containing all predictors), drop one predictor at a time until adjusted R2 value is maximzied.
The full Model uses all canidate features.
log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr
The analysis showes that the full model gave the highest adjusted R2 value.
# Call out model features
full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)
# remove NAs
full.model <- full.model[complete.cases(full.model), ]
formula <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
lm.houses <- lm(formula, full.model)
summary(lm.houses)
Call:
lm(formula = formula, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.0878 -0.1651 -0.0211 0.1657 0.9945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
The summary of our model indicates that the adjusted R2 value is 0.5598 and the p-value: < 2.2e-16.
Removing the ‘still significant’ feature Land.Slope
and running the analysis again would lead to a slightly smaller adjusted R2 value of 0.5526. As the change in adjusted R2 is immaterial, we will opt to use the model with all features.
Find below an excerpt of models, which give an indication how R_Squared changes with backward elimination.
formula1 <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula2 <- as.formula(log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula3 <- as.formula(log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula4 <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr)
lm.houses1 <- lm(formula1, full.model)
lm.houses2 <- lm(formula2, full.model)
lm.houses3 <- lm(formula3, full.model)
lm.houses4 <- lm(formula4, full.model)
m1 <- summary(lm.houses1)$adj.r.squared
m2 <- summary(lm.houses2)$adj.r.squared
m3 <- summary(lm.houses3)$adj.r.squared
m4 <- summary(lm.houses4)$adj.r.squared
R_Squared <- rbind(m1, m2, m3, m4)
model <- c('lm.houses1', 'lm.houses2', 'lm.houses3', 'lm.houses4')
df <- data.frame(cbind(model, R_Squared))
colnames(df) <- c('Model', 'R_Squared')
kable(df[1:2], caption = 'Model - R_Squared')
Model | R_Squared | |
---|---|---|
m1 | lm.houses1 | 0.559834474177341 |
m2 | lm.houses2 | 0.522009031037836 |
m3 | lm.houses3 | 0.552606180049213 |
m4 | lm.houses4 | 0.447366335111107 |
In order to get comfortable with our model, let us evaluate check the conditions for regression
-We first will examine whether the numerical variables included in the model by examining the distribution of the residuals.
ggplot(data = NULL, aes(x = log(full.model$price), y = lm.houses$residuals)) +
geom_point() + geom_hline(yintercept = 0, linetype = 'dashed') +
ylab('Residuals') + xlab('Price')
The residuals are scattered randomly around 0.
-Check whether the residuals display a nearly normal distribution centred around 0.
par(mfrow = c(1,2))
hist(lm.houses$residuals)
qqnorm(lm.houses$residuals)
qqline(lm.houses$residuals)
The results of the histogram of the residuals shows a normal distribution around 0, which is slightly left skewed. The Q-Q plot also indicates some skewness in the tails, but there are no major deviations. We can conclude that the conditions for this model are reasonable.
par(mfrow = c(1,2))
ggplot(data = NULL, aes(x = lm.houses$fitted, y = lm.houses$residuals)) + geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed', color = 'red')
The results show that the residuals are equally variable for low and high values of the predicted values, i.e., residuals have a constant variability.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# summary function
summary(lm.houses)
Call:
lm(formula = formula, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.0878 -0.1651 -0.0211 0.1657 0.9945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
# evaluate the house with the highest residualsmax residuals
which(abs(resid(lm.houses)) == max(abs(resid(lm.houses))))
428
428
# show data row
as.data.frame(full.model[428, ])
Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
1 9656 Gtl 1923 1970 2 12789
The query indicates that the house number 428 has the highest residuals.
full.model.pred <- full.model
full.model.pred <- as.data.frame(full.model.pred)
full.model.pred$predicted <- exp(predict(lm.houses))
full.model.pred$residuals <- residuals(lm.houses)
full.model.pred[428, ]
Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
428 9656 Gtl 1923 1970 2 12789
predicted residuals
428 103176.2 -2.087853
House number 428 stands out because of the actual price of 12,789. Our predicted price was 103,176. This is significant delta.
Looking at the complete case from our dataset, we are able to see that this house is generally of either fair or poor quality when looking at most of the categorical data points. Specifcally, Overall.Qual
and Overall.Cond
are both poor.
The sub-average quality of the home may explain the delta between prediciated and actual price.
Adding some of the before mentioned variables (not included in the full model,) to a new model may better predict the sales price of the home.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)
# make sure no NA values are in the dataset
full.model <- full.model[complete.cases(full.model), ]
formula <- as.formula(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
lm.houses.log <- lm(formula, full.model)
summary(lm.houses.log)
Call:
lm(formula = formula, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.14050 -0.15650 -0.01561 0.15350 0.90854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.553e+01 8.197e-01 -18.947 < 2e-16 ***
log(Lot.Area) 2.442e-01 1.708e-02 14.297 < 2e-16 ***
Land.SlopeMod 1.151e-01 4.734e-02 2.431 0.0152 *
Land.SlopeSev -6.554e-02 1.222e-01 -0.536 0.5917
Year.Built 5.981e-03 3.597e-04 16.628 < 2e-16 ***
Year.Remod.Add 6.734e-03 5.190e-04 12.975 < 2e-16 ***
Bedroom.AbvGr 5.909e-02 1.055e-02 5.599 2.8e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2649 on 993 degrees of freedom
Multiple R-squared: 0.6056, Adjusted R-squared: 0.6032
F-statistic: 254.1 on 6 and 993 DF, p-value: < 2.2e-16
Removing the ‘insignificant’ feature Land.Slope
and running the analysis again would lead to a slightly smaller adjusted R2 value of 0.6015. As the change in adjusted R2 is immaterial, we will opt to use the model with all features.
Besides the introduction of the Log functions, the model in question 6 is the same as the model in question 4.
formula: log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr
formula.Q6_2 <- as.formula(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
lm.houses.red <- lm(formula.Q6_2, full.model)
summary(lm.houses.red)
Call:
lm(formula = formula.Q6_2, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.14609 -0.15825 -0.01477 0.15354 1.01578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
log(Lot.Area) 2.471e-01 1.654e-02 14.935 < 2e-16 ***
Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2655 on 995 degrees of freedom
Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
Adjusted R2
Model type | Adj. R-squared |
---|---|
Full model | 0.6032 |
Reduced model | 0.6015 |
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)
# make sure no NA values are in the dataset
full.model <- full.model[complete.cases(full.model), ]
# prepare formulas with log transformation of Lot.Area and without
formula.nolog <- as.formula(log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula.log <- as.formula(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
# run regression on log version
lm.houses.log <- lm(formula.log, full.model)
summary(lm.houses.log)
Call:
lm(formula = formula.log, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.14609 -0.15825 -0.01477 0.15354 1.01578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
log(Lot.Area) 2.471e-01 1.654e-02 14.935 < 2e-16 ***
Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2655 on 995 degrees of freedom
Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
# run regression on nolog version
lm.houses.nolog <- lm(formula.nolog, full.model)
summary(lm.houses.nolog)
Call:
lm(formula = formula.nolog, data = full.model)
Residuals:
Min 1Q Median 3Q Max
-2.09102 -0.16488 -0.02135 0.16605 1.13359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.385e+01 8.633e-01 -16.049 < 2e-16 ***
Lot.Area 8.697e-06 9.168e-07 9.486 < 2e-16 ***
Year.Built 6.019e-03 3.819e-04 15.761 < 2e-16 ***
Year.Remod.Add 6.889e-03 5.505e-04 12.512 < 2e-16 ***
Bedroom.AbvGr 8.704e-02 1.082e-02 8.047 2.4e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2813 on 995 degrees of freedom
Multiple R-squared: 0.5544, Adjusted R-squared: 0.5526
F-statistic: 309.5 on 4 and 995 DF, p-value: < 2.2e-16
Adjusted R2
Model type | Adj. R-squared |
---|---|
Log model | 0.6015 |
No Log model | 0.5526 |
Observations:
The Log model version of the model has a considerably large improvement in the Adjusted R2 value versus the No Log model.
When looking at the P Values, both models are statistically significant.
Visualization below.
# retrieve true values from data set
true.values <- full.model[, 'price']
# prepare dataset to run test
data.test <- subset(full.model, select = -c(price) )
data.prediction <- data.test
data.pred.log <- predict(lm.houses.log, data.prediction, interval = "prediction", level = 0.95)
data.pred.log <- data.frame(data.pred.log)
pred.log <- sapply(data.pred.log[1], function(x) exp(x) )
data.pred.nolog <- predict(lm.houses.nolog, data.prediction, interval = "prediction", level = 0.95)
data.pred.nolog <- data.frame(data.pred.nolog)
pred.nolog <- sapply(data.pred.nolog[1], function(x) exp(x) )
# merge pred.log and true.values
pred.log <- cbind(pred.log, true.values)
colnames(pred.log) <- c('pred', 'true.value')
# merge pred.nolog and true.values
pred.nolog <- cbind(pred.nolog, true.values)
colnames(pred.nolog) <- c('pred', 'true.value')
p1 <- ggplot(pred.log, aes(x = pred, y = true.values)) +
geom_point() +
labs(title = "Model with log transformation", x = 'Predicted price', y = "Actual price") +
geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) +
geom_abline(intercept = 0, slope = 1)
p2 <- ggplot(pred.nolog, aes(x = pred, y = true.values)) +
geom_point() +
labs(title = "Model without log transformation", x = 'Predicted price', y = "Actual price") +
geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) +
geom_abline(intercept = 0, slope = 1)
grid.arrange(p1, p2)
Log model (Plot 1)
The residuals more even distributed. As a result, the standard errors of our regression coefficients are more reliable.This also means that the statistical significance of our independent variables may be more understated. In other words, the statistical significance may be better than model 2.
No Log model (Plot 2)
The residuals are heteroskedastic. This means the variance of the error is not constant across various levels of dependent variable. As a result, the standard errors of our regression coefficients are unreliable and may be understated. In turn, this means that the statistical significance of our independent variables may be overstated. In other words, they may not be statistically significant.
Decision:
The Log model will be used as the selection based upon the fact that the Log model should be more statistcally significant than the No Log model and the Log model has a higher adjusted R2 than the No Log model.