First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(ggplot2)
library(dplyr)
library(knitr)
library(statsr)
library(GGally)
library(gridExtra)
library(BAS)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution
ames_train$age <- sapply(ames_train$Year.Built, function(x) 2019 - x)
ggplot(ames_train, aes(x = age, y =..density..)) +
geom_histogram(bins = 30, fill = 'blue', colour = 'black') +
geom_density(size = 1, colour = 'red') +
labs(title = 'Distribution of House Age', x = 'Age', y = 'Number of Houses') +
geom_vline(xintercept = mean(ames_train$age), colour = 'green', size = 1) +
geom_vline(xintercept = median(ames_train$age), colour = 'brown', size = 1)
summary_age <- ames_train %>%
summarize(Mean_age = mean(age),
Median_age = median(age),
Sd_age = sd(age),
Q1 = quantile(age, 0.25),
Q3 = quantile(age, 0.75),
IQR = IQR(age),
Total = n())
summary_age
## # A tibble: 1 x 7
## Mean_age Median_age Sd_age Q1 Q3 IQR Total
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 46.8 44 29.6 18 64 46 1000
The distribution of `age’ is righ-skewed (the mean is 46.797, median is 44) and showed multimodal behaviour indicating that there are some high counts of houses for certain ages (such as 50 and 90). There is an decrease in number of houses as the age of houses increase.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# Perform summary statistics
q2 <- ames_train %>%
group_by(Neighborhood) %>%
summarise(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
IQR = IQR(price),
sd_price = sd(price),
var_price = var(price),
total = n())
# Determine most expensive, lease expensive and most heterogeneous
most_expensive <- q2[which(q2$median_price == max(q2$median_price)),]
least_expensive <- q2[which(q2$median_price == min(q2$median_price)),]
most_heter <- q2[which(q2$sd_price == max(q2$sd_price)),]
kable(most_expensive[1:9], caption = 'Most expensive houses')
Neighborhood | mean_price | median_price | min_price | max_price | IQR | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 15242150036 | 20 |
kable(least_expensive[1:9], caption = 'Least expensive houses')
Neighborhood | mean_price | median_price | min_price | max_price | IQR | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
MeadowV | 92946.88 | 85750 | 73000 | 129500 | 20150 | 18939.78 | 358715156 | 16 |
kable(most_heter[1:9], caption = 'Most heterogeneous houses')
Neighborhood | mean_price | median_price | min_price | max_price | IQR | sd_price | var_price | total |
---|---|---|---|---|---|---|---|---|
StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 15242150036 | 20 |
# Plot
ggplot(ames_train, aes(x = Neighborhood, y = price/1000)) +
geom_boxplot(colour = 'black', fill = 'orange') +
labs(title = "Housing prices per Neighborhood", x = 'Neighborhood', y = "Price in [$ 1000]") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
According to the chart, StoneBr
is the most expensive and most heterogeneous neighborhood, follow by NridgHt
while the least expensive Neighborhood is MeadowV
. We can create different plots for StoneBr
and NridgHt
stone <- ames_train %>%
filter(Neighborhood == 'StoneBr')
p1 <- ggplot(stone, aes(x = price/1000, y = ..density..)) +
geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
geom_density(size = 1, colour = 'blue') +
labs(title = 'Neigborhood - StoneBr: Price distribution', x = 'Price(in $1000)', y = 'Density')
nrid <- ames_train %>%
filter(Neighborhood == 'NridgHt')
p2 <- ggplot(nrid, aes(x = price/1000, y = ..density..)) +
geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
geom_density(size = 1, colour = 'blue') +
labs(title = 'Neigborhood - NridgHt: Price distribution', x = 'Price(in $1000)', y = 'Density')
grid.arrange(p1, p2, ncol = 2)
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE), n = 1)
## Pool.QC
## 997
Pool Quality (Pool.QC) has the higest number of NA’s. This number is likely high as “NA” is coded as “No Pool” in the data and not all of houses must have a pool.
** We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.**
First, we will overview the correlation between those varibales.
ggpairs(ames_train, columns = c('price', 'Lot.Area', 'Land.Slope', 'Year.Built', 'Year.Remod.Add', 'Bedroom.AbvGr'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the chart, Year.Built
and price
have highest correlation. Yet, we should not conclude that Year.Built
is a good predictor to predict the price just take into consideration of correlation. Instead, we must build a model.
To search for the best model, we will start with a full model that predicts price based on lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). In this section, we will use backwards elimination to pick significant predictors. Remember that P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Thus, we will use diagnostic plots to check for conditions. We will focus on adjusted R2 because this figure describes the strength of a model, and it is a useful tool for evaluating which predictors are adding value to the model. Firts, I will start with a full model, then drop one predictor at a time until the parsimonious model is reached.
# Select only relevant variables
data <- ames_train %>%
select(Lot.Area, Land.Slope, Year.Built,
Year.Remod.Add, Bedroom.AbvGr, price)
# Remove NA
data_model <- data[complete.cases(data),]
# Build full model
m <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
summary(m)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = data_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(m)
The summary of our model indicates that the adjusted R2 value is 0.5598. We will drop the variable with the highest p-value and re-fit the model.
# Adj.r.squared of full model
s <- summary(m)$adj.r.squared
# Create a new model, m1 to remove Lot.Area from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model
m1 <- lm(log(price) ~ Land.Slope + Year.Built +
Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s1 <- summary(m1)$adj.r.squared
# Create a new model, m2 to remove Year.Built from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model
m2 <- lm(log(price) ~ Lot.Area + Land.Slope +
Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s2 <- summary(m2)$adj.r.squared
# Create a new model, m3 to remove Year.Remod.Add from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model
m3 <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built +
Bedroom.AbvGr, data = data_model)
s3 <- summary(m3)$adj.r.squared
# Create a new model, m4 to remove Bedroom.AbvGr from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model
m4 <- lm(log(price) ~ Lot.Area + Land.Slope +
Year.Built + Year.Remod.Add, data = data_model)
s4 <- summary(m4)$adj.r.squared
# Create a new model, m5 to remove Land.Slope from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model
m5 <- lm(log(price) ~ Lot.Area + Year.Built +
Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s5 <- summary(m5)$adj.r.squared
Adj.R_Squared <- rbind(s, s1, s2, s3, s4, s5)
model <- c('m','m1', 'm2', 'm3', 'm4', 'm5')
df <- data.frame(cbind(model, Adj.R_Squared))
kable(df[1:2], caption = 'Model - Adj.R_Squared')
model | V2 | |
---|---|---|
s | m | 0.559834474177341 |
s1 | m1 | 0.522009031037836 |
s2 | m2 | 0.447366335111107 |
s3 | m3 | 0.492238381396363 |
s4 | m4 | 0.531485946233459 |
s5 | m5 | 0.552606180049213 |
The full model yield highest adj.r.squared. We will use Bayesian model averaging to verify this finding
bma_m <- bas.lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model, prior = 'BIC', modelprior = uniform())
image(bma_m, rotate = FALSE)
Land.SlopeMod and Land.Slope.Sev does not yield highest adjusted r squared but Land.SlopeGtl is the intercept value and, thus, Land.Slope cannot be removed unless this variable was split into three separate dummy variables. So the final model will be
m <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
which(abs(resid(m)) == max(abs(resid(m))))
## 428
## 428
House #428 has the largest squared residual
as.data.frame(data_model[428,])
## Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
## 1 9656 Gtl 1923 1970 2 12789
We will use model to predict the price of #428 and compare with its actual price
data_model.pred <- data_model
data_model.pred <- as.data.frame(data_model.pred)
data_model.pred$predicted <- exp(predict(m))
data_model.pred$residuals <- residuals(m)
data_model.pred[428,]
## Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
## 428 9656 Gtl 1923 1970 2 12789
## predicted residuals
## 428 103176.2 -2.087853
This home stands out from the rest because the actual price (12,789) but we predicted 103,176. When we look at the house’s information, we see that the house only has 2 bedroom with lot.are is 9656 and other figures are not high so the price of 12,789 is reasonable. From here, we see that our model is overprediting
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
model6 <- lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
summary(model6)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = data_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14050 -0.15650 -0.01561 0.15350 0.90854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.553e+01 8.197e-01 -18.947 < 2e-16 ***
## log(Lot.Area) 2.442e-01 1.708e-02 14.297 < 2e-16 ***
## Land.SlopeMod 1.151e-01 4.734e-02 2.431 0.0152 *
## Land.SlopeSev -6.554e-02 1.222e-01 -0.536 0.5917
## Year.Built 5.981e-03 3.597e-04 16.628 < 2e-16 ***
## Year.Remod.Add 6.734e-03 5.190e-04 12.975 < 2e-16 ***
## Bedroom.AbvGr 5.909e-02 1.055e-02 5.599 2.8e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2649 on 993 degrees of freedom
## Multiple R-squared: 0.6056, Adjusted R-squared: 0.6032
## F-statistic: 254.1 on 6 and 993 DF, p-value: < 2.2e-16
We can see that p-value for Land.Slope is significantly changed. p-value for this figure increases more than in the full model. We will remove Land.Slope
from the full model to check if there is any difference
model6_red <- lm(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
summary(model6_red)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr, data = data_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14609 -0.15825 -0.01477 0.15354 1.01578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
## log(Lot.Area) 2.471e-01 1.654e-02 14.935 < 2e-16 ***
## Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
## Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
## Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
## F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
The adjusted value of reduced model (0.6015) is lower than the full model (0.6032). However, the reduced model has Land.SlopeGtl included in the intercept value. Thus, Land.Slope cannot be removed unless this variable was split into three separate dummy variables, so the model ends up having the same predictors.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7
true <- log(ames_train$price)
pred1 <- predict(m)
pred2 <- predict(model6)
data <- data.frame(true = true, pred = c(pred1, pred2), prediction = c(rep(c('Lot.Area'),1000), rep(c('log(Lot.Area))'),1000)), diff = pred1-pred2)
ggplot(data, aes(x = true, y = pred, color = prediction)) +
geom_point(alpha = 0.6) +
geom_abline(slope=1, intercept=0) +
theme(legend.position=c(0.2, 0.8),
plot.title = element_text(hjust = 0.5)) +
labs(title = "Predicted log(Price) vs Actual log(Price) for Both Models", y = "Predicted log(Price)", x = "Actual log(Price)")
Both model has the same predictors. The model6
has higher adj.r.squared than m
model (0.6032 > 0.5598). According to the plot, Lot.Area has more outliers than log(Lot.Area) (blue dots) so the model6
has lower prediction error and better linear fit