Peer Assessment I

First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(ggplot2)    
library(dplyr)        
library(knitr)        
library(statsr)        
library(GGally)       
library(gridExtra)    
library(BAS)

0.1 Question 1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution

ames_train$age <- sapply(ames_train$Year.Built, function(x) 2019 - x)

ggplot(ames_train, aes(x = age, y =..density..)) +
  geom_histogram(bins = 30, fill = 'blue', colour = 'black') +
  geom_density(size = 1, colour = 'red') +
  labs(title = 'Distribution of House Age', x = 'Age', y = 'Number of Houses') +
  geom_vline(xintercept = mean(ames_train$age), colour = 'green', size = 1) +
  geom_vline(xintercept = median(ames_train$age), colour = 'brown', size = 1)

summary_age <- ames_train %>%
  summarize(Mean_age = mean(age),
            Median_age = median(age),
            Sd_age = sd(age),
            Q1 = quantile(age, 0.25),
            Q3 = quantile(age, 0.75),
            IQR = IQR(age),
            Total = n())
summary_age

## # A tibble: 1 x 7
##   Mean_age Median_age Sd_age    Q1    Q3   IQR Total
##      <dbl>      <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1     46.8         44   29.6    18    64    46  1000

The distribution of `age’ is righ-skewed (the mean is 46.797, median is 44) and showed multimodal behaviour indicating that there are some high counts of houses for certain ages (such as 50 and 90). There is an decrease in number of houses as the age of houses increase.

0.2 Question 2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# Perform summary statistics

q2 <- ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean_price = mean(price),
            median_price = median(price),
            min_price = min(price),
            max_price = max(price),
            IQR = IQR(price),
            sd_price = sd(price),
            var_price = var(price),
            total = n()) 

# Determine most expensive, lease expensive and most heterogeneous

most_expensive <- q2[which(q2$median_price == max(q2$median_price)),]
least_expensive <- q2[which(q2$median_price == min(q2$median_price)),]
most_heter <- q2[which(q2$sd_price == max(q2$sd_price)),]

kable(most_expensive[1:9], caption = 'Most expensive houses')

Most expensive houses
Neighborhood	mean_price	median_price	min_price	max_price	IQR	sd_price	var_price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	15242150036	20

kable(least_expensive[1:9], caption = 'Least expensive houses')

Least expensive houses
Neighborhood	mean_price	median_price	min_price	max_price	IQR	sd_price	var_price	total
MeadowV	92946.88	85750	73000	129500	20150	18939.78	358715156	16

kable(most_heter[1:9], caption = 'Most heterogeneous houses')

Most heterogeneous houses
Neighborhood	mean_price	median_price	min_price	max_price	IQR	sd_price	var_price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	15242150036	20

# Plot
ggplot(ames_train, aes(x = Neighborhood, y = price/1000)) +
  geom_boxplot(colour = 'black', fill = 'orange') +
  labs(title = "Housing prices per Neighborhood", x = 'Neighborhood', y = "Price in [$ 1000]") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

According to the chart, StoneBr is the most expensive and most heterogeneous neighborhood, follow by NridgHt while the least expensive Neighborhood is MeadowV. We can create different plots for StoneBr and NridgHt

stone <- ames_train %>%
  filter(Neighborhood == 'StoneBr')

p1 <- ggplot(stone, aes(x = price/1000, y = ..density..)) +
  geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
  geom_density(size = 1, colour = 'blue') +
  labs(title = 'Neigborhood - StoneBr: Price distribution', x = 'Price(in $1000)', y = 'Density')

nrid <- ames_train %>%
  filter(Neighborhood == 'NridgHt') 
 
p2 <- ggplot(nrid, aes(x = price/1000, y = ..density..)) +
  geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
  geom_density(size = 1, colour = 'blue') +
  labs(title = 'Neigborhood - NridgHt: Price distribution', x = 'Price(in $1000)', y = 'Density')

grid.arrange(p1, p2, ncol = 2)

0.3 Question 3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE), n = 1)

## Pool.QC 
##     997

Pool Quality (Pool.QC) has the higest number of NA’s. This number is likely high as “NA” is coded as “No Pool” in the data and not all of houses must have a pool.

0.4 Question 4

** We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.**

First, we will overview the correlation between those varibales.

ggpairs(ames_train, columns = c('price', 'Lot.Area', 'Land.Slope', 'Year.Built', 'Year.Remod.Add', 'Bedroom.AbvGr'))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the chart, Year.Built and price have highest correlation. Yet, we should not conclude that Year.Built is a good predictor to predict the price just take into consideration of correlation. Instead, we must build a model.

To search for the best model, we will start with a full model that predicts price based on lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). In this section, we will use backwards elimination to pick significant predictors. Remember that P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Thus, we will use diagnostic plots to check for conditions. We will focus on adjusted R² because this figure describes the strength of a model, and it is a useful tool for evaluating which predictors are adding value to the model. Firts, I will start with a full model, then drop one predictor at a time until the parsimonious model is reached.

# Select only relevant variables
data <- ames_train %>%
  select(Lot.Area, Land.Slope, Year.Built, 
         Year.Remod.Add, Bedroom.AbvGr, price)

# Remove NA
data_model <- data[complete.cases(data),]

# Build full model
m <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)

summary(m)

## 
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = data_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(m)

The summary of our model indicates that the adjusted R² value is 0.5598. We will drop the variable with the highest p-value and re-fit the model.

# Adj.r.squared of full model 

s <- summary(m)$adj.r.squared

#  Create a new model, m1 to remove Lot.Area from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model

m1 <- lm(log(price) ~ Land.Slope + Year.Built + 
           Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s1 <- summary(m1)$adj.r.squared

#  Create a new model, m2 to remove Year.Built from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model

m2 <- lm(log(price) ~ Lot.Area + Land.Slope + 
           Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s2 <- summary(m2)$adj.r.squared

#  Create a new model, m3 to remove Year.Remod.Add from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model

m3 <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + 
           Bedroom.AbvGr, data = data_model)
s3 <- summary(m3)$adj.r.squared

#  Create a new model, m4 to remove Bedroom.AbvGr from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model

m4 <- lm(log(price) ~ Lot.Area + Land.Slope + 
           Year.Built + Year.Remod.Add, data = data_model)
s4 <- summary(m4)$adj.r.squared

#  Create a new model, m5 to remove Land.Slope from the list of explanatory variables. Then check adj.r.squared and compare it to the adj.r.squared of the full model

m5 <- lm(log(price) ~ Lot.Area + Year.Built + 
     Year.Remod.Add + Bedroom.AbvGr, data = data_model)
s5 <- summary(m5)$adj.r.squared

Adj.R_Squared  <- rbind(s, s1, s2, s3, s4, s5)
model <- c('m','m1', 'm2', 'm3', 'm4', 'm5')
df <- data.frame(cbind(model, Adj.R_Squared))
kable(df[1:2], caption = 'Model - Adj.R_Squared')

Model - Adj.R_Squared
	model	V2
s	m	0.559834474177341
s1	m1	0.522009031037836
s2	m2	0.447366335111107
s3	m3	0.492238381396363
s4	m4	0.531485946233459
s5	m5	0.552606180049213

The full model yield highest adj.r.squared. We will use Bayesian model averaging to verify this finding

bma_m <- bas.lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model, prior = 'BIC', modelprior = uniform())

image(bma_m, rotate = FALSE)

Land.SlopeMod and Land.Slope.Sev does not yield highest adjusted r squared but Land.SlopeGtl is the intercept value and, thus, Land.Slope cannot be removed unless this variable was split into three separate dummy variables. So the final model will be

m <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)

0.5 Question 5:

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

which(abs(resid(m)) == max(abs(resid(m))))

## 428 
## 428

House #428 has the largest squared residual

as.data.frame(data_model[428,])

##   Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
## 1     9656        Gtl       1923           1970             2 12789

We will use model to predict the price of #428 and compare with its actual price

data_model.pred <- data_model
data_model.pred <- as.data.frame(data_model.pred)

data_model.pred$predicted <- exp(predict(m))
data_model.pred$residuals <- residuals(m)

data_model.pred[428,]

##     Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
## 428     9656        Gtl       1923           1970             2 12789
##     predicted residuals
## 428  103176.2 -2.087853

This home stands out from the rest because the actual price (12,789) but we predicted 103,176. When we look at the house’s information, we see that the house only has 2 bedroom with lot.are is 9656 and other figures are not high so the price of 12,789 is reasonable. From here, we see that our model is overprediting

0.6 Question 6:

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

model6 <- lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
summary(model6)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = data_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14050 -0.15650 -0.01561  0.15350  0.90854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.553e+01  8.197e-01 -18.947  < 2e-16 ***
## log(Lot.Area)   2.442e-01  1.708e-02  14.297  < 2e-16 ***
## Land.SlopeMod   1.151e-01  4.734e-02   2.431   0.0152 *  
## Land.SlopeSev  -6.554e-02  1.222e-01  -0.536   0.5917    
## Year.Built      5.981e-03  3.597e-04  16.628  < 2e-16 ***
## Year.Remod.Add  6.734e-03  5.190e-04  12.975  < 2e-16 ***
## Bedroom.AbvGr   5.909e-02  1.055e-02   5.599  2.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2649 on 993 degrees of freedom
## Multiple R-squared:  0.6056, Adjusted R-squared:  0.6032 
## F-statistic: 254.1 on 6 and 993 DF,  p-value: < 2.2e-16

We can see that p-value for Land.Slope is significantly changed. p-value for this figure increases more than in the full model. We will remove Land.Slope from the full model to check if there is any difference

model6_red <- lm(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = data_model)
summary(model6_red)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr, data = data_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## log(Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
## Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

The adjusted value of reduced model (0.6015) is lower than the full model (0.6032). However, the reduced model has Land.SlopeGtl included in the intercept value. Thus, Land.Slope cannot be removed unless this variable was split into three separate dummy variables, so the model ends up having the same predictors.

0.7 Question 7:

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7

true <- log(ames_train$price)
pred1 <- predict(m)
pred2 <- predict(model6)
data <- data.frame(true = true, pred = c(pred1, pred2), prediction = c(rep(c('Lot.Area'),1000), rep(c('log(Lot.Area))'),1000)), diff = pred1-pred2)

ggplot(data, aes(x = true, y = pred, color = prediction)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope=1, intercept=0) +
  theme(legend.position=c(0.2, 0.8), 
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Predicted log(Price) vs Actual log(Price) for Both Models", y = "Predicted log(Price)", x = "Actual log(Price)")

Both model has the same predictors. The model6 has higher adj.r.squared than m model (0.6032 > 0.5598). According to the plot, Lot.Area has more outliers than log(Lot.Area) (blue dots) so the model6 has lower prediction error and better linear fit