Peer-graded Assignment: EDA and Basic Model Selection

Setup

# Load Libraries

library('MASS')
library('ggplot2')      # library to create plots
library('dplyr')        # data manipulation

## Warning: package 'dplyr' was built under R version 3.4.2

library('knitr')        # required to apply knitr options 
library('statsr')       # staistics functions   
library('R.utils')       # support functions used in the analysis

## Warning: package 'R.utils' was built under R version 3.4.2

library('GGally')       # library to create plots
library('gridExtra')    # arrange plots
library('BAS')          # Bayesian statistics functions
library('labeling')     # dependency package to ggplot
library('tidyr')

# apply general knitr options
knitr::opts_chunk$set(comment=NA, fig.align='center')

# load data file

load('ames_train.Rdata')

data <- ames_train

John Eugene Driscoll | Module 5 - Data Analysis Project | Nov 2017 Submission

Introduction

The purpose of this document is to complete the Peer-graded Assignment: EDA and Basic Model Selection project required during week 4 of the Statistics with R Capstone course by Duke University (Coursera.)

The background context regarding the assignment can be found at: https://www.coursera.org/learn/statistics-project/peer/XkGx1/eda-and-basic-model-selection

Question 1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

data$age <- sapply(data$Year.Built, function(x) 2017 - x) 


# create plot
ggplot(data = data, aes(x = age, y = ..density..)) +
  geom_histogram(bins = 30) +
  geom_density(size = 1) +
  labs(title = "Ages of houses in years", x = 'age', y = "Frequency") +
  geom_vline(data = data, mapping = aes( xintercept = mean(data$age)), size = 1.5) +
  geom_vline(data = data,mapping = aes( xintercept = median(data$age)), size = 1.5) +
  geom_text(data = data, aes( x = (mean(data$age)), y = .015, label = 'mean'), size = 3, parse = T) +
  geom_text(data = data,aes( x = (median(data$age)),y = .020,  label = 'median'), size = 3, parse = T) +
  guides(color = FALSE, size = FALSE)

# find summary stats
summary.age <- data %>% summarise(mean_age = mean(age),
                   median_age = median(age),
                   sd_age = sd(age),
                   min_age = min(age),
                   max_age = max(age),
                   IQR_age = IQR(age),
                   total = n())

The distribution of the calculated age of homes in the sample shows multimodal behavior and a right skewed distribution ( which makes sense as homes can not be less than 0 years old)

Summary stats presented in the table below.

Function	Value
mean	44.797
median	42
sd	29.63741
min	7
max	145
IQR	46
# items	1000

Question 2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# extract summary parameter 
price.neigborhood <- data %>% dplyr::select(price, Neighborhood)
price.neigborhood.summary <- price.neigborhood %>% group_by(Neighborhood) %>% summarise(mean_price = mean(price),
                                                                median_price = median(price),
                                                                min_price = min(price),
                                                                max_price = max(price),
                                                                IQR_price = IQR(price),
                                                                sd_price  = sd(price),
                                                                var_price = var(price),
                                                                total = n())
# find the required information in the summary data 
ExpNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$median_price == max(price.neigborhood.summary$median_price)),]
LeastExpNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$median_price == min(price.neigborhood.summary$median_price)),]
HetNeighborhood <- price.neigborhood.summary[which(price.neigborhood.summary$sd_price == max(price.neigborhood.summary$sd_price)),]

# create box plot
ggplot(data, aes(x = Neighborhood, y = (data$price / 100))) +
       geom_boxplot() + 
       labs(title = "Housing prices per Neighborhood", x = 'Neighborhood', y = "Price in [$ 1000s]") +
       theme(axis.text.x = element_text(angle = 90, hjust = 1))

The boxplot shows the price distribution per neighborhood. The most expensive neighborhood is StoneBr. This was determined by calculating the maximum median value of all houses in a neighborhood, while the least expensive beighborhood is MeadowV. This was determined by calculating the minimum median value of all houses in a Neighborhood. The most heterogeneous neighborhood, measured by looking at the standard deviation of housing prices isStoneBr.

kable(ExpNeighborhood[1:9], caption = '<b>Most expensive Neighborhood</b>')

**Most expensive Neighborhood**
Neighborhood	mean_price	median_price	min_price	max_price	IQR_price	sd_price	var_price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	15242150036	20

kable(LeastExpNeighborhood[1:9], caption = '<b>Least expensive Neighborhood</b>')

**Least expensive Neighborhood**
Neighborhood	mean_price	median_price	min_price	max_price	IQR_price	sd_price	var_price	total
MeadowV	92946.88	85750	73000	129500	20150	18939.78	358715156	16

kable(HetNeighborhood[1:9], caption = '<b>Most heterogenious Neighborhood</b>')

**Most heterogenious Neighborhood**
Neighborhood	mean_price	median_price	min_price	max_price	IQR_price	sd_price	var_price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	15242150036	20

Question 3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# find the variable with the highest NA count
na_count <- sapply(data, function(x) sum(is.na(x)))
df.na_count <- data.frame(na_count)

df.merged <- cbind(c(row.names(df.na_count)), df.na_count[,1])
colnames(df.merged) <- c('feature', 'No_NA')

df.merged[which(df.na_count == max(df.na_count)),]

  feature     No_NA 
"Pool.QC"     "997"

Pool Quality (Pool.QC) has the higest number of NA’s. Per the codebook NA’s found within this feature mean a house does NOT have a pool.

As Iowa has harsh winters and four chaning seasons, It is reasonable to expect that many homes in a given city within Iowa may not have a pool.

Question 4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

Model selection

Adjusted R² describes the strength of a model, and it is a useful tool for evaluating, which predictors are adding value to the model.

Backwards elimination was applied to select an appropriate model as a means for finding the model with the best adjusted R². We start with a full model (containing all predictors), drop one predictor at a time until adjusted R² value is maximzied.

The full Model uses all canidate features.

log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr

The analysis showes that the full model gave the highest adjusted R² value.

# Call out model features 
full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)

# remove NAs
full.model <- full.model[complete.cases(full.model), ]
formula <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)


lm.houses <- lm(formula, full.model)
summary(lm.houses)


Call:
lm(formula = formula, data = full.model)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0878 -0.1651 -0.0211  0.1657  0.9945 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared:  0.5625,    Adjusted R-squared:  0.5598 
F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

The summary of our model indicates that the adjusted R² value is 0.5598 and the p-value: < 2.2e-16.

Removing the ‘still significant’ feature Land.Slope and running the analysis again would lead to a slightly smaller adjusted R² value of 0.5526. As the change in adjusted R² is immaterial, we will opt to use the model with all features.

Find below an excerpt of models, which give an indication how R_Squared changes with backward elimination.

formula1 <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula2 <- as.formula(log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula3 <- as.formula(log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula4 <- as.formula(log(price) ~ Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr)

lm.houses1 <- lm(formula1, full.model)
lm.houses2 <- lm(formula2, full.model)
lm.houses3 <- lm(formula3, full.model)
lm.houses4 <- lm(formula4, full.model)

m1 <- summary(lm.houses1)$adj.r.squared
m2 <- summary(lm.houses2)$adj.r.squared
m3 <- summary(lm.houses3)$adj.r.squared
m4 <- summary(lm.houses4)$adj.r.squared

R_Squared <- rbind(m1, m2, m3, m4)
model <- c('lm.houses1', 'lm.houses2', 'lm.houses3', 'lm.houses4')
df <- data.frame(cbind(model, R_Squared))

colnames(df) <- c('Model', 'R_Squared')

kable(df[1:2], caption = 'Model - R_Squared')

Model - R_Squared
	Model	R_Squared
m1	lm.houses1	0.559834474177341
m2	lm.houses2	0.522009031037836
m3	lm.houses3	0.552606180049213
m4	lm.houses4	0.447366335111107

In order to get comfortable with our model, let us evaluate check the conditions for regression

There is a linear relationship between any numerical predictor variables and the response variable.
The residuals are nearly normally distributed
The residuals display constant variability
The residuals are independent

-We first will examine whether the numerical variables included in the model by examining the distribution of the residuals.

ggplot(data = NULL, aes(x = log(full.model$price), y = lm.houses$residuals)) + 
      geom_point() + geom_hline(yintercept = 0, linetype = 'dashed') + 
      ylab('Residuals') + xlab('Price')

The residuals are scattered randomly around 0.

-Check whether the residuals display a nearly normal distribution centred around 0.

par(mfrow = c(1,2))
hist(lm.houses$residuals)
qqnorm(lm.houses$residuals)
qqline(lm.houses$residuals)

The results of the histogram of the residuals shows a normal distribution around 0, which is slightly left skewed. The Q-Q plot also indicates some skewness in the tails, but there are no major deviations. We can conclude that the conditions for this model are reasonable.

Check constant variability

par(mfrow = c(1,2))
ggplot(data = NULL, aes(x = lm.houses$fitted, y = lm.houses$residuals)) + geom_point() + 
      geom_hline(yintercept = 0, linetype = 'dashed', color = 'red')

The results show that the residuals are equally variable for low and high values of the predicted values, i.e., residuals have a constant variability.

Regarding inference for the model, the p-value of the model’s F-statistic indicates that the model as a whole is significant.

Question 5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# summary function
summary(lm.houses)


Call:
lm(formula = formula, data = full.model)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0878 -0.1651 -0.0211  0.1657  0.9945 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared:  0.5625,    Adjusted R-squared:  0.5598 
F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

# evaluate the house with the highest residualsmax residuals 
which(abs(resid(lm.houses)) == max(abs(resid(lm.houses))))

428 
428

# show data row
as.data.frame(full.model[428, ])

  Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
1     9656        Gtl       1923           1970             2 12789

The query indicates that the house number 428 has the highest residuals.

full.model.pred <- full.model
full.model.pred <- as.data.frame(full.model.pred)

full.model.pred$predicted <- exp(predict(lm.houses))
full.model.pred$residuals <- residuals(lm.houses)

full.model.pred[428, ]

    Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr price
428     9656        Gtl       1923           1970             2 12789
    predicted residuals
428  103176.2 -2.087853

House number 428 stands out because of the actual price of 12,789. Our predicted price was 103,176. This is significant delta.

Looking at the complete case from our dataset, we are able to see that this house is generally of either fair or poor quality when looking at most of the categorical data points. Specifcally, Overall.Qual and Overall.Cond are both poor.

The sub-average quality of the home may explain the delta between prediciated and actual price.

Adding some of the before mentioned variables (not included in the full model,) to a new model may better predict the sales price of the home.

Question 6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)

# make sure no NA values are in the dataset
full.model <- full.model[complete.cases(full.model), ]

formula <- as.formula(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr)

lm.houses.log <- lm(formula, full.model)
summary(lm.houses.log)


Call:
lm(formula = formula, data = full.model)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.14050 -0.15650 -0.01561  0.15350  0.90854 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.553e+01  8.197e-01 -18.947  < 2e-16 ***
log(Lot.Area)   2.442e-01  1.708e-02  14.297  < 2e-16 ***
Land.SlopeMod   1.151e-01  4.734e-02   2.431   0.0152 *  
Land.SlopeSev  -6.554e-02  1.222e-01  -0.536   0.5917    
Year.Built      5.981e-03  3.597e-04  16.628  < 2e-16 ***
Year.Remod.Add  6.734e-03  5.190e-04  12.975  < 2e-16 ***
Bedroom.AbvGr   5.909e-02  1.055e-02   5.599  2.8e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2649 on 993 degrees of freedom
Multiple R-squared:  0.6056,    Adjusted R-squared:  0.6032 
F-statistic: 254.1 on 6 and 993 DF,  p-value: < 2.2e-16

Removing the ‘insignificant’ feature Land.Slope and running the analysis again would lead to a slightly smaller adjusted R² value of 0.6015. As the change in adjusted R² is immaterial, we will opt to use the model with all features.

Besides the introduction of the Log functions, the model in question 6 is the same as the model in question 4.

formula:
log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr

formula.Q6_2 <- as.formula(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr)

lm.houses.red <- lm(formula.Q6_2, full.model)
summary(lm.houses.red)


Call:
lm(formula = formula.Q6_2, data = full.model)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.14609 -0.15825 -0.01477  0.15354  1.01578 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
log(Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2655 on 995 degrees of freedom
Multiple R-squared:  0.6031,    Adjusted R-squared:  0.6015 
F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

Adjusted R²

Model type	Adj. R-squared
Full model	0.6032
Reduced model	0.6015

Question 7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

full.model <- data %>% dplyr::select( Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)

# make sure no NA values are in the dataset
full.model <- full.model[complete.cases(full.model), ]

# prepare formulas with log transformation of Lot.Area and without
formula.nolog <- as.formula(log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr)
formula.log   <- as.formula(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr)

# run regression on log version
lm.houses.log <- lm(formula.log, full.model)
summary(lm.houses.log)


Call:
lm(formula = formula.log, data = full.model)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.14609 -0.15825 -0.01477  0.15354  1.01578 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
log(Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2655 on 995 degrees of freedom
Multiple R-squared:  0.6031,    Adjusted R-squared:  0.6015 
F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

# run regression on nolog version
lm.houses.nolog <- lm(formula.nolog, full.model)
summary(lm.houses.nolog)


Call:
lm(formula = formula.nolog, data = full.model)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.09102 -0.16488 -0.02135  0.16605  1.13359 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.385e+01  8.633e-01 -16.049  < 2e-16 ***
Lot.Area        8.697e-06  9.168e-07   9.486  < 2e-16 ***
Year.Built      6.019e-03  3.819e-04  15.761  < 2e-16 ***
Year.Remod.Add  6.889e-03  5.505e-04  12.512  < 2e-16 ***
Bedroom.AbvGr   8.704e-02  1.082e-02   8.047  2.4e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2813 on 995 degrees of freedom
Multiple R-squared:  0.5544,    Adjusted R-squared:  0.5526 
F-statistic: 309.5 on 4 and 995 DF,  p-value: < 2.2e-16

Adjusted R²

Model type	Adj. R-squared
Log model	0.6015
No Log model	0.5526

Observations:

The Log model version of the model has a considerably large improvement in the Adjusted R² value versus the No Log model.

When looking at the P Values, both models are statistically significant.

Visualization below.

# retrieve true values from data set
true.values <- full.model[, 'price']

# prepare dataset to run test
data.test <- subset(full.model, select = -c(price) )                           
data.prediction <- data.test

data.pred.log <- predict(lm.houses.log, data.prediction, interval = "prediction", level = 0.95)
data.pred.log <- data.frame(data.pred.log)
pred.log <- sapply(data.pred.log[1], function(x) exp(x) )  

data.pred.nolog <- predict(lm.houses.nolog, data.prediction, interval = "prediction", level = 0.95)
data.pred.nolog <- data.frame(data.pred.nolog)
pred.nolog <- sapply(data.pred.nolog[1], function(x) exp(x) )  

# merge pred.log and true.values
pred.log <- cbind(pred.log, true.values)
colnames(pred.log) <- c('pred', 'true.value')

# merge pred.nolog and true.values
pred.nolog <- cbind(pred.nolog, true.values)
colnames(pred.nolog) <- c('pred', 'true.value')

p1 <- ggplot(pred.log, aes(x = pred, y = true.values)) + 
  geom_point() + 
  labs(title = "Model with log transformation", x = 'Predicted price', y = "Actual price") +
  geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) + 
  geom_abline(intercept = 0, slope = 1)


p2 <- ggplot(pred.nolog, aes(x = pred, y = true.values)) + 
  geom_point() + 
  labs(title = "Model without log transformation", x = 'Predicted price', y = "Actual price") +
  geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) + 
  geom_abline(intercept = 0, slope = 1)

grid.arrange(p1, p2)

Log model (Plot 1)

The residuals more even distributed. As a result, the standard errors of our regression coefficients are more reliable.This also means that the statistical significance of our independent variables may be more understated. In other words, the statistical significance may be better than model 2.

No Log model (Plot 2)

The residuals are heteroskedastic. This means the variance of the error is not constant across various levels of dependent variable. As a result, the standard errors of our regression coefficients are unreliable and may be understated. In turn, this means that the statistical significance of our independent variables may be overstated. In other words, they may not be statistically significant.

Decision:

The Log model will be used as the selection based upon the fact that the Log model should be more statistcally significant than the No Log model and the Log model has a higher adjusted R² than the No Log model.