First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS); library(kableExtra)
library(dplyr); library(broom)
library(ggplot2)

Question 1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# Create a histogram with summary statistics
ames_train %>%
        ggplot(aes(Year.Built)) +
        geom_histogram(bins =30, fill = "lightblue", color = "black") +
        geom_vline(aes(xintercept = mean(Year.Built), 
                   color = "mean"), linetype = "dashed", size = 1) +
        geom_vline(aes(xintercept = median(Year.Built), 
                   color = "median"), linetype = "dashed", size = 1) +
        scale_color_manual(name = "Statistics", values = c(mean = "red", median = "darkgreen")) +
        ggtitle("Histogram of Year Homes were Built in Ames") +
        xlab("Year Built") + ylab("Number of Homes")


Some observations


Question 2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

library(ggridges)

# Create a ridgeline chart
ames_train %>%
        ggplot(aes(x = price / 1000, y = Neighborhood, fill = Neighborhood)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Home Price (,000's)") +
        theme(axis.title.y = element_blank())

This ridgeline graph will provide a nice visualize display of where the most and least expensive homes are in Ames (by Neighborhood), as well as a distirbution of the home prices.

From this chart we can see that the least expensive homes are likely in ‘MeadowV’, the most expensive in ‘StoneBr’ or ‘NridegHt’ and the most heterogeneous in those same neighborhoods.

Let’s look at summary statistics to confirm

# Create a data set grouping by Neighborhood and creating summary statistics
ames_hoods <- ames_train %>%
        group_by(Neighborhood) %>%
        summarize(mean = mean(price),
                  median = median(price),
                  st_dev = sd(price),
                  IQR = IQR(price),
                  range = range(price)[2] - range(price)[1])

# Sort by median and show top 5
ames_hoods %>%
        arrange(desc(median)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
        column_spec(3, bold = TRUE, background = "lightblue")
Neighborhood mean median st_dev IQR range
StoneBr 339316.0 340691.5 123459.10 151358.0 411587
NridgHt 333646.7 336860.0 105088.90 148800.0 442000
NoRidge 295844.6 290000.0 35888.97 50312.5 170000
GrnHill 280000.0 280000.0 70710.68 50000.0 100000
Timber 265192.2 232500.0 84029.57 151200.0 250000
# Sort by median and show bottom 5
ames_hoods %>%
        arrange(median) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
        column_spec(3, bold = TRUE, background = "lightblue")
Neighborhood mean median st_dev IQR range
MeadowV 92946.88 85750 18939.78 20150 56500
BrDale 98930.00 98750 13337.59 16725 42500
IDOTRR 97620.69 99500 31530.44 48900 116009
OldTown 120225.61 120000 36429.69 29550 253190
Blueste 125800.00 123900 10381.23 10250 20500
# Sort by Standard Deviation and show top 5
ames_hoods %>%
        arrange(desc(st_dev)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped")  %>%
        column_spec(4, bold = TRUE, background = "lightblue")
Neighborhood mean median st_dev IQR range
StoneBr 339316.0 340691.5 123459.10 151358 411587
NridgHt 333646.7 336860.0 105088.90 148800 442000
Timber 265192.2 232500.0 84029.57 151200 250000
Veenker 233650.0 205750.0 72545.41 68125 235000
Crawfor 204196.6 205000.0 71267.56 80100 296000
# Sort by IQR and show top 5
ames_hoods %>%
        arrange(desc(IQR)) %>%
        head(5) %>%
        kable() %>%
        kable_styling(full_width = FALSE, bootstrap_options = "striped")  %>%
        column_spec(5, bold = TRUE, background = "lightblue")
Neighborhood mean median st_dev IQR range
StoneBr 339316.0 340691.5 123459.10 151358 411587
Timber 265192.2 232500.0 84029.57 151200 250000
NridgHt 333646.7 336860.0 105088.90 148800 442000
Crawfor 204196.6 205000.0 71267.56 80100 296000
ClearCr 193153.8 185000.0 48068.69 76000 169500

Some observations with regards to price by Neighborhood:


Question 3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

#  is.na provides 1/0 for True/False - sum each column of is.na and sort
ames_NAs <- sort(colSums(is.na(ames_train)), decreasing = TRUE)
        
# display the 5 most columns with NAs
ames_NAs[1:5]  
     Pool.QC Misc.Feature        Alley        Fence Fireplace.Qu 
         997          971          933          798          491 

The column Pool.QC has the largest number of missing values at 997. This makes sense as this variable holds ordinal values for Pool Quality. However, if the home has ‘No Pool’ it has an NA in this field.


Question 4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# subseet the data set based on candidate explanatory variables
ames_Q4 <- ames_train %>%
        dplyr::select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)

# build an Linear Regression model for the log(price) vs all the variables selected
ames_lm <- lm(log(price) ~ ., ames_Q4)

# Calculate the adjusted R^2 for the full model
summary(ames_lm)$adj.r.squared
[1] 0.5598345

We have built our base linear model using the candidate explanatory variables to predict the natural log of home prices. Next, we will use backward selection utilizing \(Adjusted R^2\) as our metric. We will remove each variable individually from the model and recalculate \(Adjusted R^2\) each time. If we see an increase in this metric, we will accept the smaller model. If none of the smaller models show an increase in this metric, then we will keep the full model.

# Calculate the adjusted R^2 for all (p - 1) models
summary(lm(log(price) ~ . -Lot.Area, ames_Q4))$adj.r.squared;
[1] 0.522009
summary(lm(log(price) ~ . -Land.Slope, ames_Q4))$adj.r.squared;
[1] 0.5526062
summary(lm(log(price) ~ . -Year.Built, ames_Q4))$adj.r.squared;
[1] 0.4473663
summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q4))$adj.r.squared;
[1] 0.4922384
summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q4))$adj.r.squared
[1] 0.5314859

As shown, in the above calculation of the \(Adjusted R^2\) for each of the \(p - 1\) models, none exhibits an increase in the metric chosen. We will accept the full model with all the condidate explanatory variables.


Question 5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

par(mfrow = c(1, 1))

# Display the residuals plot
plot(ames_lm, which = 1)

# Carve out the observation that is an outlier
outlier <- sort(ames_lm$residuals^2, decreasing = TRUE)[1]
outlier
    428 
4.35913 
# Display the observation to see if any data looks noteworthy
ames_train[names(outlier), ] %>%
        kable() %>%
        kable_styling() %>%
        scroll_box(width = "100%")
PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
902207130 832 12789 30 RM 68 9656 Pave NA Reg Lvl AllPub Inside Gtl OldTown Norm Norm 1Fam 1Story 2 2 1923 1970 Gable CompShg AsbShng AsbShng None 0 TA Fa BrkTil Fa Fa No Unf 0 Unf 0 678 678 GasA TA N SBrkr 832 0 0 0 0 1 0 2 1 TA 5 Typ 1 Gd Detchd 1928 Unf 2 780 Fa Fa N 0 0 0 0 0 0 NA NA NA 0 6 2010 WD Abnorml

We have identified the largest outlier, now let’s run some visualizations to understand this observation better.

par(mfrow = c(1, 2))

# Plot Lot Area vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Lot.Area, log(price)))
points(ames_Q4[names(outlier), 'Lot.Area'], log(ames_Q4[names(outlier), 'price']), 
       col = "red", pch = 19, size = 2)

# Plot Year Built vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Year.Built, log(price))) 
points(ames_Q4[names(outlier),'Year.Built'], log(ames_Q4[names(outlier), 'price']),  
       col = "red", pch = 19, size = 2)

# table of Sale conditions for all observations
table(ames_train$Sale.Condition)  

Abnorml AdjLand  Alloca  Family  Normal Partial 
     61       2       4      17     834      82 
# Sale Condition for the outlier
ames_train[names(outlier),]$Sale.Condition  
[1] Abnorml
Levels: Abnorml AdjLand Alloca Family Normal Partial
# Plot Lot Area vs log(price) for abnormal sales
ames_train %>%
        filter(Sale.Condition == "Abnorml") %>%
        ggplot(aes(Lot.Area, log(price))) +
        geom_point() +
        geom_smooth(method = "lm") +
        ggtitle("Plot of Abnormal Sales - Lot Area vs log(price)")


From running quite a few charts, having included only a few here, it was difficult to see any variable that could explain the low value of this home. One explanation could be that, as seen above, that this was an ‘Abnormal’ sale, which is defined as a trade, foreclosure, short sale.

However, even with this listing being an ‘Abnormal’ sale, it does appear to still be an outlier.


Question 6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# convert the Lot.Area variable to the log
ames_Q6 <- ames_Q4 %>%
        mutate(Lot.Area = log(Lot.Area))

# build a new full model
ames_lm2 <- lm(log(price) ~ ., ames_Q6)

# calculate Adjusted R^2 again for full model and (p-1) models
summary(ames_lm2)$adj.r.squared
[1] 0.6032018
summary(lm(log(price) ~ . -Lot.Area, ames_Q6))$adj.r.squared;
[1] 0.522009
summary(lm(log(price) ~ . -Land.Slope, ames_Q6))$adj.r.squared;
[1] 0.6014831
summary(lm(log(price) ~ . -Year.Built, ames_Q6))$adj.r.squared;
[1] 0.4932282
summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q6))$adj.r.squared;
[1] 0.5363921
summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q6))$adj.r.squared
[1] 0.5910888

Yes, we do arrive the same resulting model. Even though the \(Adjusted R^2\) improved in this model, none of the \(p - 1\) models improved the metric, so we keep the full model with all predictors.


Question 7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

par(mfrow = c(1, 1))

# Create histogram for the Lot.Area predictor
hist(ames_train$Lot.Area, main = "Histogram of Lot Area")

par(mfrow = c(1, 2))

# Plot the fitted values of the model vs the True prices
plot(exp(ames_lm$fitted.values), ames_train$price,
     main = "Fitted vs True Prices Lot.Area")

# Plot the fitted values of the log model vs the true prices
plot(exp(ames_lm2$fitted.values), ames_train$price, 
     main = "Fitted vs True Prices w/ log(Lot.Area)")


As seen in the histogram, when you have a severly right skewed distribution, then taking the log of that predictor is likely to be helpful.

As shown in the next set of charts, the log of Lot.Area looks to do a slightly better overall job of predicting price.