Peer Assessment I

First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS); library(kableExtra)
library(dplyr); library(broom)
library(ggplot2)

Question 1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# Create a histogram with summary statistics
ames_train %>%
        ggplot(aes(Year.Built)) +
        geom_histogram(bins =30, fill = "lightblue", color = "black") +
        geom_vline(aes(xintercept = mean(Year.Built), 
                   color = "mean"), linetype = "dashed", size = 1) +
        geom_vline(aes(xintercept = median(Year.Built), 
                   color = "median"), linetype = "dashed", size = 1) +
        scale_color_manual(name = "Statistics", values = c(mean = "red", median = "darkgreen")) +
        ggtitle("Histogram of Year Homes were Built in Ames") +
        xlab("Year Built") + ylab("Number of Homes")

Some observations

The distribution is bi-model, with a peak of newly built homes.
The distribution is left skewed.
The mean of the distribution is less than the median.

Question 2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

library(ggridges)

# Create a ridgeline chart
ames_train %>%
        ggplot(aes(x = price / 1000, y = Neighborhood, fill = Neighborhood)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Home Price (,000's)") +
        theme(axis.title.y = element_blank())

This ridgeline graph will provide a nice visualize display of where the most and least expensive homes are in Ames (by Neighborhood), as well as a distirbution of the home prices.

From this chart we can see that the least expensive homes are likely in ‘MeadowV’, the most expensive in ‘StoneBr’ or ‘NridegHt’ and the most heterogeneous in those same neighborhoods.

Let’s look at summary statistics to confirm

# Create a data set grouping by Neighborhood and creating summary statistics
ames_hoods <- ames_train %>%
        group_by(Neighborhood) %>%
        summarize(mean = mean(price),
                  median = median(price),
                  st_dev = sd(price),
                  IQR = IQR(price),
                  range = range(price)[2] - range(price)[1])

# Sort by median and show top 5
ames_hoods %>%
        arrange(desc(median)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
        column_spec(3, bold = TRUE, background = "lightblue")

Neighborhood	mean	median	st_dev	IQR	range
StoneBr	339316.0	340691.5	123459.10	151358.0	411587
NridgHt	333646.7	336860.0	105088.90	148800.0	442000
NoRidge	295844.6	290000.0	35888.97	50312.5	170000
GrnHill	280000.0	280000.0	70710.68	50000.0	100000
Timber	265192.2	232500.0	84029.57	151200.0	250000

# Sort by median and show bottom 5
ames_hoods %>%
        arrange(median) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
        column_spec(3, bold = TRUE, background = "lightblue")

Neighborhood	mean	median	st_dev	IQR	range
MeadowV	92946.88	85750	18939.78	20150	56500
BrDale	98930.00	98750	13337.59	16725	42500
IDOTRR	97620.69	99500	31530.44	48900	116009
OldTown	120225.61	120000	36429.69	29550	253190
Blueste	125800.00	123900	10381.23	10250	20500

# Sort by Standard Deviation and show top 5
ames_hoods %>%
        arrange(desc(st_dev)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped")  %>%
        column_spec(4, bold = TRUE, background = "lightblue")

Neighborhood	mean	median	st_dev	IQR	range
StoneBr	339316.0	340691.5	123459.10	151358	411587
NridgHt	333646.7	336860.0	105088.90	148800	442000
Timber	265192.2	232500.0	84029.57	151200	250000
Veenker	233650.0	205750.0	72545.41	68125	235000
Crawfor	204196.6	205000.0	71267.56	80100	296000

# Sort by IQR and show top 5
ames_hoods %>%
        arrange(desc(IQR)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped")  %>%
        column_spec(5, bold = TRUE, background = "lightblue")

Neighborhood	mean	median	st_dev	IQR	range
StoneBr	339316.0	340691.5	123459.10	151358	411587
Timber	265192.2	232500.0	84029.57	151200	250000
NridgHt	333646.7	336860.0	105088.90	148800	442000
Crawfor	204196.6	205000.0	71267.56	80100	296000
ClearCr	193153.8	185000.0	48068.69	76000	169500

Some observations with regards to price by Neighborhood:

With regards to most and least expensive, it appears that we will arrive at the same conclusion whether we utilize the mean or the median of the distribution. However, when trying to determine heterogeneity, our use of metrics does matter as Standard Deviation, IQR, and Range will arrive at different results. For our purposes, we will use the more robust statistics of Median and IQR.
Most expensive neighborhoods: ‘StoneBr’, ‘NridgHt’ and ‘NoRidge’.
Least expensive neighborhoods: ‘MeadowV’, ‘BrDale’, and ‘IDOTRR’.
Most heterogeneous neighborhoods: ‘StoneBr’, ‘Timber’ and ‘NridgeHt’.

Question 3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

#  is.na provides 1/0 for True/False - sum each column of is.na and sort
ames_NAs <- sort(colSums(is.na(ames_train)), decreasing = TRUE)
        
# display the 5 most columns with NAs
ames_NAs[1:5]

     Pool.QC Misc.Feature        Alley        Fence Fireplace.Qu 
         997          971          933          798          491

The column Pool.QC has the largest number of missing values at 997. This makes sense as this variable holds ordinal values for Pool Quality. However, if the home has ‘No Pool’ it has an NA in this field.

Question 4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# subseet the data set based on candidate explanatory variables
ames_Q4 <- ames_train %>%
        dplyr::select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)

# build an Linear Regression model for the log(price) vs all the variables selected
ames_lm <- lm(log(price) ~ ., ames_Q4)

# Calculate the adjusted R^2 for the full model
summary(ames_lm)$adj.r.squared

[1] 0.5598345

We have built our base linear model using the candidate explanatory variables to predict the natural log of home prices. Next, we will use backward selection utilizing \(Adjusted R^2\) as our metric. We will remove each variable individually from the model and recalculate \(Adjusted R^2\) each time. If we see an increase in this metric, we will accept the smaller model. If none of the smaller models show an increase in this metric, then we will keep the full model.

# Calculate the adjusted R^2 for all (p - 1) models
summary(lm(log(price) ~ . -Lot.Area, ames_Q4))$adj.r.squared;

[1] 0.522009

summary(lm(log(price) ~ . -Land.Slope, ames_Q4))$adj.r.squared;

[1] 0.5526062

summary(lm(log(price) ~ . -Year.Built, ames_Q4))$adj.r.squared;

[1] 0.4473663

summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q4))$adj.r.squared;

[1] 0.4922384

summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q4))$adj.r.squared

[1] 0.5314859

As shown, in the above calculation of the \(Adjusted R^2\) for each of the \(p - 1\) models, none exhibits an increase in the metric chosen. We will accept the full model with all the condidate explanatory variables.

Question 5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

par(mfrow = c(1, 1))

# Display the residuals plot
plot(ames_lm, which = 1)

# Carve out the observation that is an outlier
outlier <- sort(ames_lm$residuals^2, decreasing = TRUE)[1]
outlier

    428 
4.35913

# Display the observation to see if any data looks noteworthy
ames_train[names(outlier), ] %>%
        kable() %>%
        kable_styling() %>%
        scroll_box(width = "100%")

PID	area	price	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Utilities	Lot.Config	Land.Slope	Neighborhood	Condition.1	Condition.2	Bldg.Type	House.Style	Overall.Qual	Overall.Cond	Year.Built	Year.Remod.Add	Roof.Style	Roof.Matl	Exterior.1st	Exterior.2nd	Mas.Vnr.Type	Mas.Vnr.Area	Exter.Qual	Exter.Cond	Foundation	Bsmt.Qual	Bsmt.Cond	Bsmt.Exposure	BsmtFin.Type.1	BsmtFin.SF.1	BsmtFin.Type.2	BsmtFin.SF.2	Bsmt.Unf.SF	Total.Bsmt.SF	Heating	Heating.QC	Central.Air	Electrical	X1st.Flr.SF	X2nd.Flr.SF	Low.Qual.Fin.SF	Bsmt.Full.Bath	Bsmt.Half.Bath	Full.Bath	Half.Bath	Bedroom.AbvGr	Kitchen.AbvGr	Kitchen.Qual	TotRms.AbvGrd	Functional	Fireplaces	Fireplace.Qu	Garage.Type	Garage.Yr.Blt	Garage.Finish	Garage.Cars	Garage.Area	Garage.Qual	Garage.Cond	Paved.Drive	Wood.Deck.SF	Open.Porch.SF	Enclosed.Porch	X3Ssn.Porch	Screen.Porch	Pool.Area	Pool.QC	Fence	Misc.Feature	Misc.Val	Mo.Sold	Yr.Sold	Sale.Type	Sale.Condition
902207130	832	12789	30	RM	68	9656	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	OldTown	Norm	Norm	1Fam	1Story	2	2	1923	1970	Gable	CompShg	AsbShng	AsbShng	None	0	TA	Fa	BrkTil	Fa	Fa	No	Unf	0	Unf	0	678	678	GasA	TA	N	SBrkr	832	0	0	0	0	1	0	2	1	TA	5	Typ	1	Gd	Detchd	1928	Unf	2	780	Fa	Fa	N	0	0	0	0	0	0	NA	NA	NA	0	6	2010	WD	Abnorml

We have identified the largest outlier, now let’s run some visualizations to understand this observation better.

par(mfrow = c(1, 2))

# Plot Lot Area vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Lot.Area, log(price)))
points(ames_Q4[names(outlier), 'Lot.Area'], log(ames_Q4[names(outlier), 'price']), 
       col = "red", pch = 19, size = 2)

# Plot Year Built vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Year.Built, log(price))) 
points(ames_Q4[names(outlier),'Year.Built'], log(ames_Q4[names(outlier), 'price']),  
       col = "red", pch = 19, size = 2)

# table of Sale conditions for all observations
table(ames_train$Sale.Condition)


Abnorml AdjLand  Alloca  Family  Normal Partial 
     61       2       4      17     834      82

# Sale Condition for the outlier
ames_train[names(outlier),]$Sale.Condition

[1] Abnorml
Levels: Abnorml AdjLand Alloca Family Normal Partial

# Plot Lot Area vs log(price) for abnormal sales
ames_train %>%
        filter(Sale.Condition == "Abnorml") %>%
        ggplot(aes(Lot.Area, log(price))) +
        geom_point() +
        geom_smooth(method = "lm") +
        ggtitle("Plot of Abnormal Sales - Lot Area vs log(price)")

From running quite a few charts, having included only a few here, it was difficult to see any variable that could explain the low value of this home. One explanation could be that, as seen above, that this was an ‘Abnormal’ sale, which is defined as a trade, foreclosure, short sale.

However, even with this listing being an ‘Abnormal’ sale, it does appear to still be an outlier.

Question 6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# convert the Lot.Area variable to the log
ames_Q6 <- ames_Q4 %>%
        mutate(Lot.Area = log(Lot.Area))

# build a new full model
ames_lm2 <- lm(log(price) ~ ., ames_Q6)

# calculate Adjusted R^2 again for full model and (p-1) models
summary(ames_lm2)$adj.r.squared

[1] 0.6032018

summary(lm(log(price) ~ . -Lot.Area, ames_Q6))$adj.r.squared;

[1] 0.522009

summary(lm(log(price) ~ . -Land.Slope, ames_Q6))$adj.r.squared;

[1] 0.6014831

summary(lm(log(price) ~ . -Year.Built, ames_Q6))$adj.r.squared;

[1] 0.4932282

summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q6))$adj.r.squared;

[1] 0.5363921

summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q6))$adj.r.squared

[1] 0.5910888

Yes, we do arrive the same resulting model. Even though the \(Adjusted R^2\) improved in this model, none of the \(p - 1\) models improved the metric, so we keep the full model with all predictors.

Question 7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

par(mfrow = c(1, 1))

# Create histogram for the Lot.Area predictor
hist(ames_train$Lot.Area, main = "Histogram of Lot Area")

par(mfrow = c(1, 2))

# Plot the fitted values of the model vs the True prices
plot(exp(ames_lm$fitted.values), ames_train$price,
     main = "Fitted vs True Prices Lot.Area")

# Plot the fitted values of the log model vs the true prices
plot(exp(ames_lm2$fitted.values), ames_train$price, 
     main = "Fitted vs True Prices w/ log(Lot.Area)")

As seen in the histogram, when you have a severly right skewed distribution, then taking the log of that predictor is likely to be helpful.

As shown in the next set of charts, the log of Lot.Area looks to do a slightly better overall job of predicting price.