First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# type your code for Question 1 here, and Knit
m <- list(l=50,r=50,b=100,t=50,pad=4)
ames_train$Age <- as.numeric(format(Sys.Date(),"%Y"))-ames_train$Year.Built
mean(ames_train$Age)
## [1] 48.797
median(ames_train$Age)
## [1] 46
Age_density <- density(ames_train$Age)
fig <- plot_ly(data=ames_train, x=~Age, type="histogram",
nbinsx=30,name="Histogram") %>%
add_trace(x=Age_density$x,y=Age_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
layout(autosize = F, width=600, height=400, margin=m,
title="Histogram and Density Plot of 'Age'",
xaxis=list(title="House Age (years)"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.7, y = 0.8))
fig
The distribution of Age is right-skewed, multi-modal, with a mean of 48.8 years and a median of 46 years.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# type your code for Question 2 here, and Knit
fig <- plot_ly(data=ames_train,x=~Neighborhood,y=~price,type="box") %>%
layout(title="BoxPlot of Home Prices by Neighborhood",
xaxis=list(title="Neighborhood"),
yaxis=list(title="Home Price ($)"))
fig
neighborhood_summary <- ames_train %>% group_by(Neighborhood) %>%
summarise(mean_price=mean(price),median_price=median(price),
std_dev_price=sd(price))
Most expensive and least expensive neighborhoods would be based on the average price.
StoneBr average price = $339,316MeadowV average price = $92,947Most heterogeneous (having the most variation in housing price) neighborhoods would be the neighborhood with the highest standard deviation for price.
StoneBr std. deviation = $123,459Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# type your code for Question 3 here, and Knit
na_col <- sapply(ames_train, function(x) sum(is.na(x)))
na_col
## PID area price MS.SubClass MS.Zoning
## 0 0 0 0 0
## Lot.Frontage Lot.Area Street Alley Lot.Shape
## 167 0 0 933 0
## Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 0 0 0 0 0
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual
## 0 0 0 0 0
## Overall.Cond Year.Built Year.Remod.Add Roof.Style Roof.Matl
## 0 0 0 0 0
## Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual
## 0 0 0 7 0
## Exter.Cond Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure
## 0 0 21 21 21
## BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## 21 1 21 1 1
## Total.Bsmt.SF Heating Heating.QC Central.Air Electrical
## 1 0 0 0 0
## X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath Bsmt.Half.Bath
## 0 0 0 1 1
## Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## 0 0 0 0 0
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type
## 0 0 0 491 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual
## 48 46 1 1 47
## Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF Enclosed.Porch
## 47 0 0 0 0
## X3Ssn.Porch Screen.Porch Pool.Area Pool.QC Fence
## 0 0 0 997 798
## Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type
## 971 0 0 0 0
## Sale.Condition Age
## 0 0
The column that has the largest number of missing values is Pool.QC. The documentation for the ames_train dataset indicates that this is a measure of pool quality. A value of NA in this column indicates that the property does not have a pool. This probably means that most of the houses in this dataset do not have pools.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
# type your code for Question 4 here, and Knit
# Subset the `ames_train` dataset to just the variables of interest
q4_df <- ames_train %>% mutate(lprice=log(price)) %>% select(c(lprice,Lot.Area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr))
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lprice <- bas.lm(lprice ~ ., data = na.omit(q4_df),
prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_lprice
##
## Call:
## bas.lm(formula = lprice ~ ., data = na.omit(q4_df), prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept Lot.Area Land.SlopeMod Land.SlopeSev Year.Built
## 1.0000 1.0000 0.6324 0.7923 1.0000
## Year.Remod.Add Bedroom.AbvGr
## 1.0000 1.0000
# Top 5 most probable models
summary(bma_lprice)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.000000
## Lot.Area 1.0000000 1.0000 1.0000000 1.000000
## Land.SlopeMod 0.6323994 1.0000 0.0000000 1.000000
## Land.SlopeSev 0.7923033 1.0000 1.0000000 0.000000
## Year.Built 1.0000000 1.0000 1.0000000 1.000000
## Year.Remod.Add 1.0000000 1.0000 1.0000000 1.000000
## Bedroom.AbvGr 1.0000000 1.0000 1.0000000 1.000000
## BF NA 1.0000 0.6674353 0.330911
## PostProbs NA 0.4752 0.3171000 0.157200
## R2 NA 0.5625 0.5591000 0.558500
## dim NA 7.0000 6.0000000 6.000000
## logmarg NA -2198.1673 -2198.5716571 -2199.273250
## model 4 model 5
## Intercept 1.0000000 1.000000e+00
## Lot.Area 1.0000000 1.000000e+00
## Land.SlopeMod 0.0000000 0.000000e+00
## Land.SlopeSev 0.0000000 1.000000e+00
## Year.Built 1.0000000 1.000000e+00
## Year.Remod.Add 1.0000000 1.000000e+00
## Bedroom.AbvGr 1.0000000 0.000000e+00
## BF 0.1061955 1.578929e-12
## PostProbs 0.0505000 0.000000e+00
## R2 0.5544000 5.316000e-01
## dim 5.0000000 5.000000e+00
## logmarg -2200.4098178 -2.225342e+03
# Obtain the coefficients from the model `bma_lprice`
coef_lprice <- coefficients(bma_lprice)
coef_lprice
##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 64 models
## post mean post SD post p(B != 0)
## Intercept 1.202e+01 8.843e-03 1.000e+00
## Lot.Area 1.003e-05 1.346e-06 1.000e+00
## Land.SlopeMod 8.955e-02 7.910e-02 6.324e-01
## Land.SlopeSev -3.732e-01 2.343e-01 7.923e-01
## Year.Built 6.040e-03 3.797e-04 1.000e+00
## Year.Remod.Add 6.803e-03 5.489e-04 1.000e+00
## Bedroom.AbvGr 8.658e-02 1.087e-02 1.000e+00
# Plot the probability distributions of the model coefficients
plot(coef_lprice, ask = FALSE)
Using Bayesian model averaging in the BAS package, we find that Model 1 (which incorporates all of the variables under consideration) yields the highest posterior probability (0.4752).
We have also plotted the posterior distribution of all the model coefficients specified by this model averaging approach.
Therefore, linear model is as follows:
\[\begin{aligned} Log(price)=12+(10^{-5}*Lot.Area)+(0.09*Land.SlopeMod)-(0.373*Land.SlopeLev)\end{aligned}\] \[\begin{aligned}+(0.006*Year.Built)+(0.007*Year.Remod.Add)+(0.0866*Bedroom.AbvGr)\end{aligned}\]
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# type your code for Question 5 here, and Knit
lm_q5 <- lm(lprice~.,data=na.omit(q4_df))
summary(lm_q5)
##
## Call:
## lm(formula = lprice ~ ., data = na.omit(q4_df))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
# Calculate the square of the residuals
q4_df$sq_res <- residuals(lm_q5)^2
result <- q4_df %>% filter(sq_res==max(sq_res))
result
## # A tibble: 1 x 7
## lprice Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr sq_res
## <dbl> <int> <fct> <int> <int> <int> <dbl>
## 1 9.46 9656 Gtl 1923 1970 2 4.36
Looking at the other features for the house that generates the highest residuals-squared, we see that is quite old (built in 1923) and it only has two bedrooms. Chances are that this house is in a historic district where location plays much more of a factor on the home price than the actual structure.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# type your code for Question 6 here, and Knit
# Subset the `ames_train` dataset to just the variables of interest
q6_df <- ames_train %>% mutate(lprice=log(price),l_lot.area=log(Lot.Area)) %>% select(c(lprice,l_lot.area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr))
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lprice <- bas.lm(lprice ~ ., data = na.omit(q6_df),
prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_lprice
##
## Call:
## bas.lm(formula = lprice ~ ., data = na.omit(q6_df), prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept l_lot.area Land.SlopeMod Land.SlopeSev Year.Built
## 1.00000 1.00000 0.39326 0.03645 1.00000
## Year.Remod.Add Bedroom.AbvGr
## 1.00000 0.99999
# Top 5 most probable models
summary(bma_lprice)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.000000e+00
## l_lot.area 1.00000000 1.0000 1.0000000 1.000000e+00
## Land.SlopeMod 0.39325522 0.0000 1.0000000 0.000000e+00
## Land.SlopeSev 0.03645331 0.0000 0.0000000 1.000000e+00
## Year.Built 1.00000000 1.0000 1.0000000 1.000000e+00
## Year.Remod.Add 1.00000000 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 0.99998941 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.6494644 3.866294e-02
## PostProbs NA 0.5842 0.3794000 2.260000e-02
## R2 NA 0.6031 0.6055000 6.032000e-01
## dim NA 5.0000 6.0000000 6.000000e+00
## logmarg NA -2142.5652 -2142.9968106 -2.145818e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## l_lot.area 1.000000e+00 1.000000e+00
## Land.SlopeMod 1.000000e+00 0.000000e+00
## Land.SlopeSev 1.000000e+00 0.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Year.Remod.Add 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 0.000000e+00
## BF 2.374021e-02 1.412793e-05
## PostProbs 1.390000e-02 0.000000e+00
## R2 6.056000e-01 5.913000e-01
## dim 7.000000e+00 4.000000e+00
## logmarg -2.146306e+03 -2.153733e+03
# Obtain the coefficients from the model `bma_lprice`
coef_lprice <- coefficients(bma_lprice)
coef_lprice
##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 64 models
## post mean post SD post p(B != 0)
## Intercept 12.0184706 0.0083882 1.0000000
## l_lot.area 0.2452196 0.0167729 1.0000000
## Land.SlopeMod 0.0456379 0.0639766 0.3932552
## Land.SlopeSev -0.0026562 0.0270720 0.0364533
## Year.Built 0.0059700 0.0003602 1.0000000
## Year.Remod.Add 0.0067566 0.0005193 1.0000000
## Bedroom.AbvGr 0.0580499 0.0105887 0.9999894
# Plot the probability distributions of the model coefficients
plot(coef_lprice, ask = FALSE)
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
# type your code for Question 7
# Linear model for Lot.Area
bma_lprice1 <- bas.lm(lprice ~ ., data = na.omit(q4_df),
prior = "BIC",
modelprior = uniform())
predictions1<- predict(bma_lprice1,newdata=q4_df,
estimator="BMA",
type="response")
pred_values1 <- as.numeric(unlist(predictions1["fit"]))
# Linear model for log(Lot.Area)
bma_lprice2 <- bas.lm(lprice ~ ., data = na.omit(q6_df),
prior = "BIC",
modelprior = uniform())
predictions2 <- predict(bma_lprice2,newdata=q6_df,
estimator="BMA",
type="response")
pred_values2 <- as.numeric(unlist(predictions2["fit"]))
# Plot predicted vs. true values of log(price) using `Lot.Area` model...
fig1 <- plot_ly(x=~q4_df$lprice,y=~pred_values1,type="scatter",mode="markers") %>%
layout(title="Predicted vs. True Values of Log(price) --- Lot.Area Model",
xaxis=list(title="True Value - Log(Price)"),
yaxis=list(title="Predicted Value - Log(Price)"))
fig1
# Plot predicted vs. true values of log(price) using `Log(Lot.Area)` model...
fig2 <- plot_ly(x=~q4_df$lprice,y=~pred_values2,type="scatter",mode="markers") %>%
layout(title="Predicted vs. True Values of Log(price) --- Log(Lot.Area) Model",
xaxis=list(title="True Value - Log(Price)"),
yaxis=list(title="Predicted Value - Log(Price)"))
fig2
After inspecting the two plots above, I would submit that the Lot.Area model provides better results than the Log(Lot.Area) model because most of the points in the Lot.Area plot are more tightly compacted, indicating a more linear relationship between the predicted values and the true values of Log(price). The Log(Lot.Area) plot shows points more spread out, which shows more discrepency between the true and predicted values of Log(price).