First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS); library(kableExtra)
library(dplyr); library(broom)
library(ggplot2)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# Create a histogram with summary statistics
ames_train %>%
ggplot(aes(Year.Built)) +
geom_histogram(bins =30, fill = "lightblue", color = "black") +
geom_vline(aes(xintercept = mean(Year.Built),
color = "mean"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median(Year.Built),
color = "median"), linetype = "dashed", size = 1) +
scale_color_manual(name = "Statistics", values = c(mean = "red", median = "darkgreen")) +
ggtitle("Histogram of Year Homes were Built in Ames") +
xlab("Year Built") + ylab("Number of Homes")
Some observations
The distribution is bi-model, with a peak of newly built homes.
The distribution is left skewed.
The mean of the distribution is less than the median.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
library(ggridges)
# Create a ridgeline chart
ames_train %>%
ggplot(aes(x = price / 1000, y = Neighborhood, fill = Neighborhood)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
xlab("Home Price (,000's)") +
theme(axis.title.y = element_blank())
This ridgeline graph will provide a nice visualize display of where the most and least expensive homes are in Ames (by Neighborhood), as well as a distirbution of the home prices.
From this chart we can see that the least expensive homes are likely in ‘MeadowV’, the most expensive in ‘StoneBr’ or ‘NridegHt’ and the most heterogeneous in those same neighborhoods.
Let’s look at summary statistics to confirm
# Create a data set grouping by Neighborhood and creating summary statistics
ames_hoods <- ames_train %>%
group_by(Neighborhood) %>%
summarize(mean = mean(price),
median = median(price),
st_dev = sd(price),
IQR = IQR(price),
range = range(price)[2] - range(price)[1])
# Sort by median and show top 5
ames_hoods %>%
arrange(desc(median)) %>%
head(5) %>%
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(3, bold = TRUE, background = "lightblue")
| Neighborhood | mean | median | st_dev | IQR | range |
|---|---|---|---|---|---|
| StoneBr | 339316.0 | 340691.5 | 123459.10 | 151358.0 | 411587 |
| NridgHt | 333646.7 | 336860.0 | 105088.90 | 148800.0 | 442000 |
| NoRidge | 295844.6 | 290000.0 | 35888.97 | 50312.5 | 170000 |
| GrnHill | 280000.0 | 280000.0 | 70710.68 | 50000.0 | 100000 |
| Timber | 265192.2 | 232500.0 | 84029.57 | 151200.0 | 250000 |
# Sort by median and show bottom 5
ames_hoods %>%
arrange(median) %>%
head(5) %>%
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(3, bold = TRUE, background = "lightblue")
| Neighborhood | mean | median | st_dev | IQR | range |
|---|---|---|---|---|---|
| MeadowV | 92946.88 | 85750 | 18939.78 | 20150 | 56500 |
| BrDale | 98930.00 | 98750 | 13337.59 | 16725 | 42500 |
| IDOTRR | 97620.69 | 99500 | 31530.44 | 48900 | 116009 |
| OldTown | 120225.61 | 120000 | 36429.69 | 29550 | 253190 |
| Blueste | 125800.00 | 123900 | 10381.23 | 10250 | 20500 |
# Sort by Standard Deviation and show top 5
ames_hoods %>%
arrange(desc(st_dev)) %>%
head(5) %>%
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(4, bold = TRUE, background = "lightblue")
| Neighborhood | mean | median | st_dev | IQR | range |
|---|---|---|---|---|---|
| StoneBr | 339316.0 | 340691.5 | 123459.10 | 151358 | 411587 |
| NridgHt | 333646.7 | 336860.0 | 105088.90 | 148800 | 442000 |
| Timber | 265192.2 | 232500.0 | 84029.57 | 151200 | 250000 |
| Veenker | 233650.0 | 205750.0 | 72545.41 | 68125 | 235000 |
| Crawfor | 204196.6 | 205000.0 | 71267.56 | 80100 | 296000 |
# Sort by IQR and show top 5
ames_hoods %>%
arrange(desc(IQR)) %>%
head(5) %>%
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = "striped") %>%
column_spec(5, bold = TRUE, background = "lightblue")
| Neighborhood | mean | median | st_dev | IQR | range |
|---|---|---|---|---|---|
| StoneBr | 339316.0 | 340691.5 | 123459.10 | 151358 | 411587 |
| Timber | 265192.2 | 232500.0 | 84029.57 | 151200 | 250000 |
| NridgHt | 333646.7 | 336860.0 | 105088.90 | 148800 | 442000 |
| Crawfor | 204196.6 | 205000.0 | 71267.56 | 80100 | 296000 |
| ClearCr | 193153.8 | 185000.0 | 48068.69 | 76000 | 169500 |
Some observations with regards to price by Neighborhood:
With regards to most and least expensive, it appears that we will arrive at the same conclusion whether we utilize the mean or the median of the distribution. However, when trying to determine heterogeneity, our use of metrics does matter as Standard Deviation, IQR, and Range will arrive at different results. For our purposes, we will use the more robust statistics of Median and IQR.
Most expensive neighborhoods: ‘StoneBr’, ‘NridgHt’ and ‘NoRidge’.
Least expensive neighborhoods: ‘MeadowV’, ‘BrDale’, and ‘IDOTRR’.
Most heterogeneous neighborhoods: ‘StoneBr’, ‘Timber’ and ‘NridgeHt’.
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# is.na provides 1/0 for True/False - sum each column of is.na and sort
ames_NAs <- sort(colSums(is.na(ames_train)), decreasing = TRUE)
# display the 5 most columns with NAs
ames_NAs[1:5]
Pool.QC Misc.Feature Alley Fence Fireplace.Qu
997 971 933 798 491
The column Pool.QC has the largest number of missing values at 997. This makes sense as this variable holds ordinal values for Pool Quality. However, if the home has ‘No Pool’ it has an NA in this field.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
# subseet the data set based on candidate explanatory variables
ames_Q4 <- ames_train %>%
dplyr::select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
# build an Linear Regression model for the log(price) vs all the variables selected
ames_lm <- lm(log(price) ~ ., ames_Q4)
# Calculate the adjusted R^2 for the full model
summary(ames_lm)$adj.r.squared
[1] 0.5598345
We have built our base linear model using the candidate explanatory variables to predict the natural log of home prices. Next, we will use backward selection utilizing \(Adjusted R^2\) as our metric. We will remove each variable individually from the model and recalculate \(Adjusted R^2\) each time. If we see an increase in this metric, we will accept the smaller model. If none of the smaller models show an increase in this metric, then we will keep the full model.
# Calculate the adjusted R^2 for all (p - 1) models
summary(lm(log(price) ~ . -Lot.Area, ames_Q4))$adj.r.squared;
[1] 0.522009
summary(lm(log(price) ~ . -Land.Slope, ames_Q4))$adj.r.squared;
[1] 0.5526062
summary(lm(log(price) ~ . -Year.Built, ames_Q4))$adj.r.squared;
[1] 0.4473663
summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q4))$adj.r.squared;
[1] 0.4922384
summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q4))$adj.r.squared
[1] 0.5314859
As shown, in the above calculation of the \(Adjusted R^2\) for each of the \(p - 1\) models, none exhibits an increase in the metric chosen. We will accept the full model with all the condidate explanatory variables.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
par(mfrow = c(1, 1))
# Display the residuals plot
plot(ames_lm, which = 1)
# Carve out the observation that is an outlier
outlier <- sort(ames_lm$residuals^2, decreasing = TRUE)[1]
outlier
428
4.35913
# Display the observation to see if any data looks noteworthy
ames_train[names(outlier), ] %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%")
| PID | area | price | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Utilities | Lot.Config | Land.Slope | Neighborhood | Condition.1 | Condition.2 | Bldg.Type | House.Style | Overall.Qual | Overall.Cond | Year.Built | Year.Remod.Add | Roof.Style | Roof.Matl | Exterior.1st | Exterior.2nd | Mas.Vnr.Type | Mas.Vnr.Area | Exter.Qual | Exter.Cond | Foundation | Bsmt.Qual | Bsmt.Cond | Bsmt.Exposure | BsmtFin.Type.1 | BsmtFin.SF.1 | BsmtFin.Type.2 | BsmtFin.SF.2 | Bsmt.Unf.SF | Total.Bsmt.SF | Heating | Heating.QC | Central.Air | Electrical | X1st.Flr.SF | X2nd.Flr.SF | Low.Qual.Fin.SF | Bsmt.Full.Bath | Bsmt.Half.Bath | Full.Bath | Half.Bath | Bedroom.AbvGr | Kitchen.AbvGr | Kitchen.Qual | TotRms.AbvGrd | Functional | Fireplaces | Fireplace.Qu | Garage.Type | Garage.Yr.Blt | Garage.Finish | Garage.Cars | Garage.Area | Garage.Qual | Garage.Cond | Paved.Drive | Wood.Deck.SF | Open.Porch.SF | Enclosed.Porch | X3Ssn.Porch | Screen.Porch | Pool.Area | Pool.QC | Fence | Misc.Feature | Misc.Val | Mo.Sold | Yr.Sold | Sale.Type | Sale.Condition |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 902207130 | 832 | 12789 | 30 | RM | 68 | 9656 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Norm | Norm | 1Fam | 1Story | 2 | 2 | 1923 | 1970 | Gable | CompShg | AsbShng | AsbShng | None | 0 | TA | Fa | BrkTil | Fa | Fa | No | Unf | 0 | Unf | 0 | 678 | 678 | GasA | TA | N | SBrkr | 832 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 1 | Gd | Detchd | 1928 | Unf | 2 | 780 | Fa | Fa | N | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Abnorml |
We have identified the largest outlier, now let’s run some visualizations to understand this observation better.
par(mfrow = c(1, 2))
# Plot Lot Area vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Lot.Area, log(price)))
points(ames_Q4[names(outlier), 'Lot.Area'], log(ames_Q4[names(outlier), 'price']),
col = "red", pch = 19, size = 2)
# Plot Year Built vs Log(price) with coloring the outlier red
with(ames_Q4, plot(Year.Built, log(price)))
points(ames_Q4[names(outlier),'Year.Built'], log(ames_Q4[names(outlier), 'price']),
col = "red", pch = 19, size = 2)
# table of Sale conditions for all observations
table(ames_train$Sale.Condition)
Abnorml AdjLand Alloca Family Normal Partial
61 2 4 17 834 82
# Sale Condition for the outlier
ames_train[names(outlier),]$Sale.Condition
[1] Abnorml
Levels: Abnorml AdjLand Alloca Family Normal Partial
# Plot Lot Area vs log(price) for abnormal sales
ames_train %>%
filter(Sale.Condition == "Abnorml") %>%
ggplot(aes(Lot.Area, log(price))) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Plot of Abnormal Sales - Lot Area vs log(price)")
From running quite a few charts, having included only a few here, it was difficult to see any variable that could explain the low value of this home. One explanation could be that, as seen above, that this was an ‘Abnormal’ sale, which is defined as a trade, foreclosure, short sale.
However, even with this listing being an ‘Abnormal’ sale, it does appear to still be an outlier.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# convert the Lot.Area variable to the log
ames_Q6 <- ames_Q4 %>%
mutate(Lot.Area = log(Lot.Area))
# build a new full model
ames_lm2 <- lm(log(price) ~ ., ames_Q6)
# calculate Adjusted R^2 again for full model and (p-1) models
summary(ames_lm2)$adj.r.squared
[1] 0.6032018
summary(lm(log(price) ~ . -Lot.Area, ames_Q6))$adj.r.squared;
[1] 0.522009
summary(lm(log(price) ~ . -Land.Slope, ames_Q6))$adj.r.squared;
[1] 0.6014831
summary(lm(log(price) ~ . -Year.Built, ames_Q6))$adj.r.squared;
[1] 0.4932282
summary(lm(log(price) ~ . -Year.Remod.Add, ames_Q6))$adj.r.squared;
[1] 0.5363921
summary(lm(log(price) ~ . -Bedroom.AbvGr, ames_Q6))$adj.r.squared
[1] 0.5910888
Yes, we do arrive the same resulting model. Even though the \(Adjusted R^2\) improved in this model, none of the \(p - 1\) models improved the metric, so we keep the full model with all predictors.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
par(mfrow = c(1, 1))
# Create histogram for the Lot.Area predictor
hist(ames_train$Lot.Area, main = "Histogram of Lot Area")
par(mfrow = c(1, 2))
# Plot the fitted values of the model vs the True prices
plot(exp(ames_lm$fitted.values), ames_train$price,
main = "Fitted vs True Prices Lot.Area")
# Plot the fitted values of the log model vs the true prices
plot(exp(ames_lm2$fitted.values), ames_train$price,
main = "Fitted vs True Prices w/ log(Lot.Area)")
As seen in the histogram, when you have a severly right skewed distribution, then taking the log of that predictor is likely to be helpful.
As shown in the next set of charts, the log of Lot.Area looks to do a slightly better overall job of predicting price.