First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit

m <- list(l=50,r=50,b=100,t=50,pad=4)

ames_train$Age <- as.numeric(format(Sys.Date(),"%Y"))-ames_train$Year.Built
mean(ames_train$Age)

## [1] 48.797

median(ames_train$Age)

## [1] 46

Age_density <- density(ames_train$Age)

fig <- plot_ly(data=ames_train, x=~Age, type="histogram", 
               nbinsx=30,name="Histogram") %>%
  add_trace(x=Age_density$x,y=Age_density$y,type="scatter",
            mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
layout(autosize = F, width=600, height=400, margin=m,
         title="Histogram and Density Plot of 'Age'",
         xaxis=list(title="House Age (years)"),yaxis=list(title="Number of Observations"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 0.7, y = 0.8))


fig

The distribution of Age is right-skewed, multi-modal, with a mean of 48.8 years and a median of 46 years.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# type your code for Question 2 here, and Knit

fig <- plot_ly(data=ames_train,x=~Neighborhood,y=~price,type="box") %>%
    layout(title="BoxPlot of Home Prices by Neighborhood", 
         xaxis=list(title="Neighborhood"),
         yaxis=list(title="Home Price ($)"))
fig

neighborhood_summary <- ames_train %>% group_by(Neighborhood) %>%
  summarise(mean_price=mean(price),median_price=median(price),
            std_dev_price=sd(price))

Most expensive and least expensive neighborhoods would be based on the average price.

Most expensive: StoneBr average price = $339,316
Least expensive: MeadowV average price = $92,947

Most heterogeneous (having the most variation in housing price) neighborhoods would be the neighborhood with the highest standard deviation for price.

Most heterogeneous: StoneBr std. deviation = $123,459

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# type your code for Question 3 here, and Knit

na_col <- sapply(ames_train, function(x) sum(is.na(x)))
na_col

##             PID            area           price     MS.SubClass       MS.Zoning 
##               0               0               0               0               0 
##    Lot.Frontage        Lot.Area          Street           Alley       Lot.Shape 
##             167               0               0             933               0 
##    Land.Contour       Utilities      Lot.Config      Land.Slope    Neighborhood 
##               0               0               0               0               0 
##     Condition.1     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##               0               0               0               0               0 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style       Roof.Matl 
##               0               0               0               0               0 
##    Exterior.1st    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area      Exter.Qual 
##               0               0               0               7               0 
##      Exter.Cond      Foundation       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure 
##               0               0              21              21              21 
##  BsmtFin.Type.1    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF 
##              21               1              21               1               1 
##   Total.Bsmt.SF         Heating      Heating.QC     Central.Air      Electrical 
##               1               0               0               0               0 
##     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath  Bsmt.Half.Bath 
##               0               0               0               1               1 
##       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual 
##               0               0               0               0               0 
##   TotRms.AbvGrd      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##               0               0               0             491              46 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area     Garage.Qual 
##              48              46               1               1              47 
##     Garage.Cond     Paved.Drive    Wood.Deck.SF   Open.Porch.SF  Enclosed.Porch 
##              47               0               0               0               0 
##     X3Ssn.Porch    Screen.Porch       Pool.Area         Pool.QC           Fence 
##               0               0               0             997             798 
##    Misc.Feature        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##             971               0               0               0               0 
##  Sale.Condition             Age 
##               0               0

The column that has the largest number of missing values is Pool.QC. The documentation for the ames_train dataset indicates that this is a measure of pool quality. A value of NA in this column indicates that the property does not have a pool. This probably means that most of the houses in this dataset do not have pools.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit

# Subset the `ames_train` dataset to just the variables of interest
q4_df <- ames_train %>% mutate(lprice=log(price)) %>% select(c(lprice,Lot.Area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr))

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lprice <- bas.lm(lprice ~ ., data = na.omit(q4_df),
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable
bma_lprice

## 
## Call:
## bas.lm(formula = lprice ~ ., data = na.omit(q4_df), prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept        Lot.Area   Land.SlopeMod   Land.SlopeSev      Year.Built  
##         1.0000          1.0000          0.6324          0.7923          1.0000  
## Year.Remod.Add   Bedroom.AbvGr  
##         1.0000          1.0000

# Top 5 most probable models
summary(bma_lprice)

##                P(B != 0 | Y)    model 1       model 2      model 3
## Intercept          1.0000000     1.0000     1.0000000     1.000000
## Lot.Area           1.0000000     1.0000     1.0000000     1.000000
## Land.SlopeMod      0.6323994     1.0000     0.0000000     1.000000
## Land.SlopeSev      0.7923033     1.0000     1.0000000     0.000000
## Year.Built         1.0000000     1.0000     1.0000000     1.000000
## Year.Remod.Add     1.0000000     1.0000     1.0000000     1.000000
## Bedroom.AbvGr      1.0000000     1.0000     1.0000000     1.000000
## BF                        NA     1.0000     0.6674353     0.330911
## PostProbs                 NA     0.4752     0.3171000     0.157200
## R2                        NA     0.5625     0.5591000     0.558500
## dim                       NA     7.0000     6.0000000     6.000000
## logmarg                   NA -2198.1673 -2198.5716571 -2199.273250
##                      model 4       model 5
## Intercept          1.0000000  1.000000e+00
## Lot.Area           1.0000000  1.000000e+00
## Land.SlopeMod      0.0000000  0.000000e+00
## Land.SlopeSev      0.0000000  1.000000e+00
## Year.Built         1.0000000  1.000000e+00
## Year.Remod.Add     1.0000000  1.000000e+00
## Bedroom.AbvGr      1.0000000  0.000000e+00
## BF                 0.1061955  1.578929e-12
## PostProbs          0.0505000  0.000000e+00
## R2                 0.5544000  5.316000e-01
## dim                5.0000000  5.000000e+00
## logmarg        -2200.4098178 -2.225342e+03

# Obtain the coefficients from the model `bma_lprice`
coef_lprice <- coefficients(bma_lprice)
coef_lprice

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  64 models 
##                 post mean   post SD     post p(B != 0)
## Intercept        1.202e+01   8.843e-03   1.000e+00    
## Lot.Area         1.003e-05   1.346e-06   1.000e+00    
## Land.SlopeMod    8.955e-02   7.910e-02   6.324e-01    
## Land.SlopeSev   -3.732e-01   2.343e-01   7.923e-01    
## Year.Built       6.040e-03   3.797e-04   1.000e+00    
## Year.Remod.Add   6.803e-03   5.489e-04   1.000e+00    
## Bedroom.AbvGr    8.658e-02   1.087e-02   1.000e+00

# Plot the probability distributions of the model coefficients
plot(coef_lprice, ask = FALSE)

Using Bayesian model averaging in the BAS package, we find that Model 1 (which incorporates all of the variables under consideration) yields the highest posterior probability (0.4752).

We have also plotted the posterior distribution of all the model coefficients specified by this model averaging approach.

Therefore, linear model is as follows:

\[\begin{aligned} Log(price)=12+(10^{-5}*Lot.Area)+(0.09*Land.SlopeMod)-(0.373*Land.SlopeLev)\end{aligned}\] \[\begin{aligned}+(0.006*Year.Built)+(0.007*Year.Remod.Add)+(0.0866*Bedroom.AbvGr)\end{aligned}\]

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit

lm_q5 <- lm(lprice~.,data=na.omit(q4_df))
summary(lm_q5)

## 
## Call:
## lm(formula = lprice ~ ., data = na.omit(q4_df))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

# Calculate the square of the residuals
q4_df$sq_res <- residuals(lm_q5)^2
result <- q4_df %>% filter(sq_res==max(sq_res))
result

## # A tibble: 1 x 7
##   lprice Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr sq_res
##    <dbl>    <int> <fct>           <int>          <int>         <int>  <dbl>
## 1   9.46     9656 Gtl              1923           1970             2   4.36

Looking at the other features for the house that generates the highest residuals-squared, we see that is quite old (built in 1923) and it only has two bedrooms. Chances are that this house is in a historic district where location plays much more of a factor on the home price than the actual structure.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit

# Subset the `ames_train` dataset to just the variables of interest
q6_df <- ames_train %>% mutate(lprice=log(price),l_lot.area=log(Lot.Area)) %>% select(c(lprice,l_lot.area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr))

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lprice <- bas.lm(lprice ~ ., data = na.omit(q6_df),
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable
bma_lprice

## 
## Call:
## bas.lm(formula = lprice ~ ., data = na.omit(q6_df), prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept      l_lot.area   Land.SlopeMod   Land.SlopeSev      Year.Built  
##        1.00000         1.00000         0.39326         0.03645         1.00000  
## Year.Remod.Add   Bedroom.AbvGr  
##        1.00000         0.99999

# Top 5 most probable models
summary(bma_lprice)

##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000     1.0000000  1.000000e+00
## l_lot.area        1.00000000     1.0000     1.0000000  1.000000e+00
## Land.SlopeMod     0.39325522     0.0000     1.0000000  0.000000e+00
## Land.SlopeSev     0.03645331     0.0000     0.0000000  1.000000e+00
## Year.Built        1.00000000     1.0000     1.0000000  1.000000e+00
## Year.Remod.Add    1.00000000     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr     0.99998941     1.0000     1.0000000  1.000000e+00
## BF                        NA     1.0000     0.6494644  3.866294e-02
## PostProbs                 NA     0.5842     0.3794000  2.260000e-02
## R2                        NA     0.6031     0.6055000  6.032000e-01
## dim                       NA     5.0000     6.0000000  6.000000e+00
## logmarg                   NA -2142.5652 -2142.9968106 -2.145818e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## l_lot.area      1.000000e+00  1.000000e+00
## Land.SlopeMod   1.000000e+00  0.000000e+00
## Land.SlopeSev   1.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Year.Remod.Add  1.000000e+00  1.000000e+00
## Bedroom.AbvGr   1.000000e+00  0.000000e+00
## BF              2.374021e-02  1.412793e-05
## PostProbs       1.390000e-02  0.000000e+00
## R2              6.056000e-01  5.913000e-01
## dim             7.000000e+00  4.000000e+00
## logmarg        -2.146306e+03 -2.153733e+03

# Obtain the coefficients from the model `bma_lprice`
coef_lprice <- coefficients(bma_lprice)
coef_lprice

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  64 models 
##                 post mean   post SD     post p(B != 0)
## Intercept       12.0184706   0.0083882   1.0000000    
## l_lot.area       0.2452196   0.0167729   1.0000000    
## Land.SlopeMod    0.0456379   0.0639766   0.3932552    
## Land.SlopeSev   -0.0026562   0.0270720   0.0364533    
## Year.Built       0.0059700   0.0003602   1.0000000    
## Year.Remod.Add   0.0067566   0.0005193   1.0000000    
## Bedroom.AbvGr    0.0580499   0.0105887   0.9999894

# Plot the probability distributions of the model coefficients
plot(coef_lprice, ask = FALSE)

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7

# Linear model for Lot.Area

bma_lprice1 <- bas.lm(lprice ~ ., data = na.omit(q4_df),
                   prior = "BIC", 
                   modelprior = uniform())

predictions1<- predict(bma_lprice1,newdata=q4_df,
                                estimator="BMA",
                                type="response")
pred_values1 <- as.numeric(unlist(predictions1["fit"]))

# Linear model for log(Lot.Area)

bma_lprice2 <- bas.lm(lprice ~ ., data = na.omit(q6_df),
                   prior = "BIC", 
                   modelprior = uniform())

predictions2 <- predict(bma_lprice2,newdata=q6_df,
                        estimator="BMA",
                        type="response")
pred_values2 <- as.numeric(unlist(predictions2["fit"]))

# Plot predicted vs. true values of log(price) using `Lot.Area` model...

fig1 <- plot_ly(x=~q4_df$lprice,y=~pred_values1,type="scatter",mode="markers") %>%
  layout(title="Predicted vs. True Values of Log(price) --- Lot.Area Model",
         xaxis=list(title="True Value - Log(Price)"),
         yaxis=list(title="Predicted Value - Log(Price)"))
fig1

# Plot predicted vs. true values of log(price) using `Log(Lot.Area)` model...

fig2 <- plot_ly(x=~q4_df$lprice,y=~pred_values2,type="scatter",mode="markers") %>%
  layout(title="Predicted vs. True Values of Log(price) --- Log(Lot.Area) Model",
         xaxis=list(title="True Value - Log(Price)"),
         yaxis=list(title="Predicted Value - Log(Price)"))
fig2

After inspecting the two plots above, I would submit that the Lot.Area model provides better results than the Log(Lot.Area) model because most of the points in the Lot.Area plot are more tightly compacted, indicating a more linear relationship between the predicted values and the true values of Log(price). The Log(Lot.Area) plot shows points more spread out, which shows more discrepency between the true and predicted values of Log(price).

Statistics with R Capstone Project - Peer Assessment I

Ken Wood

6/18/21

1

2

3

4

5

6

7