First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(devtools)
library(statsr)
library(BAS)

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit
ames_train$age <- 2010-ames_train$Year.Built
hist(ames_train$age, main = "Building tendentions", xlab = "Years")


The distribution is strongly right skewed, centered at 70 and with its peak at 0 year. The histogram shows us that a lot of new houses were build recently. The destribution appears multimodal with it peak arounf 30-50 years old buildings And the oldest house is in the age around 150.


2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# type your code for Question 2 here, and Knit


ames_train_loc<-ames_train%>%
        select(price, Neighborhood)%>%
        group_by(Neighborhood)%>%
                summarise(median = median(price), max = max(price), min = min(price), sd = sd(price))
print(arrange(ames_train_loc, desc(max)))
## # A tibble: 27 x 5
##    Neighborhood   median    max    min        sd
##          <fctr>    <dbl>  <dbl>  <dbl>     <dbl>
##  1      NridgHt 336860.0 615000 173000 105088.90
##  2      StoneBr 340691.5 591587 180000 123459.10
##  3      CollgCr 195800.0 475000 110000  52786.08
##  4      Somerst 221650.0 468000 139000  65199.49
##  5       Timber 232500.0 425000 175000  84029.57
##  6      Edwards 127250.0 415000  61500  54851.63
##  7      NoRidge 290000.0 405000 235000  35888.97
##  8      Crawfor 205000.0 392500  96500  71267.56
##  9      Veenker 205750.0 385000 150000  72545.41
## 10      Gilbert 183500.0 377500 133000  41190.38
## # ... with 17 more rows
print(arrange(ames_train_loc, min))
## # A tibble: 27 x 5
##    Neighborhood median    max   min       sd
##          <fctr>  <dbl>  <dbl> <dbl>    <dbl>
##  1      OldTown 120000 265979 12789 36429.69
##  2       IDOTRR  99500 150909 34900 31530.44
##  3      BrkSide 124000 207000 39300 37309.91
##  4      Edwards 127250 415000 61500 54851.63
##  5       Sawyer 136000 219000 63900 21216.22
##  6      SawyerW 182500 320000 67500 48354.36
##  7      MeadowV  85750 129500 73000 18939.78
##  8        NAmes 139900 277500 76500 27267.97
##  9        SWISU 134000 169000 80000 27375.76
## 10       NWAmes 185000 278000 82500 41340.50
## # ... with 17 more rows
print(arrange(ames_train_loc, desc(median)))
## # A tibble: 27 x 5
##    Neighborhood   median    max    min        sd
##          <fctr>    <dbl>  <dbl>  <dbl>     <dbl>
##  1      StoneBr 340691.5 591587 180000 123459.10
##  2      NridgHt 336860.0 615000 173000 105088.90
##  3      NoRidge 290000.0 405000 235000  35888.97
##  4      GrnHill 280000.0 330000 230000  70710.68
##  5       Timber 232500.0 425000 175000  84029.57
##  6      Somerst 221650.0 468000 139000  65199.49
##  7       Greens 212625.0 214000 155000  29063.42
##  8      Veenker 205750.0 385000 150000  72545.41
##  9      Crawfor 205000.0 392500  96500  71267.56
## 10      CollgCr 195800.0 475000 110000  52786.08
## # ... with 17 more rows
ggplot(ames_train, aes(y = price, x = reorder(Neighborhood, price, median), fill = Neighborhood), main = "Price among Neighborhoods", xlim= "Neighborhood", ylim="Price" )+geom_boxplot() + coord_flip()


By plotting the neighborhoods agains prices of Ames, Iowa neighborhoods, we could see, according to the median, the houses of NridgHt(336860.0) and StoneBr(340691.5) territories have the highest prices among others, as well as, widest price range 123459.10 for StoneBr and 105088.90 for NridgHt , according to its standart deviations. The least expensive neighborhoods are the MeadowV (85750.0), BrDale(98750.0) and the Old Town(min - 12789) has the lowest price among others. I choose to use boxplotting to see the broad summary of the whole dataset as well as it provides with full visual information about all neighborhoods’ price range and means. More detailed summary is located in the summary table.


3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# type your code for Question 3 here, and Knit
ames_train_na<-summary(is.na(ames_train))[TRUE]
summary(ames_train$Pool.QC)
##   Ex   Fa   Gd   TA NA's 
##    1    1    1    0  997

Pool.QC has the largest number of missing values. It could be explaind by the code book description. This variable represents pool area and, as well as, that simply shows that a lot of houses do not have a pool


4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit
fit_lm = lm(log(price)~Lot.Area+Land.Slope
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train)
summary(fit_lm)
## 
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16
fit_bic = bas.lm(log(price)~Lot.Area+Land.Slope
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train,
        prior = "BIC",
        modelprior = uniform(),
        method = "MCMC" )
summary(fit_bic)
##                P(B != 0 | Y)    model 1       model 2      model 3
## Intercept          1.0000000     1.0000     1.0000000     1.000000
## Lot.Area           0.9937500     1.0000     1.0000000     1.000000
## Land.SlopeMod      0.6812500     1.0000     0.0000000     1.000000
## Land.SlopeSev      0.7539062     1.0000     1.0000000     0.000000
## Year.Built         1.0000000     1.0000     1.0000000     1.000000
## Year.Remod.Add     0.9992188     1.0000     1.0000000     1.000000
## Bedroom.AbvGr      0.9968750     1.0000     1.0000000     1.000000
## BF                        NA     1.0000     0.6674353     0.330911
## PostProbs                 NA     0.4898     0.2617000     0.191400
## R2                        NA     0.5625     0.5591000     0.558500
## dim                       NA     7.0000     6.0000000     6.000000
## logmarg                   NA -2198.1673 -2198.5716571 -2199.273250
##                      model 4       model 5
## Intercept          1.0000000  1.000000e+00
## Lot.Area           1.0000000  0.000000e+00
## Land.SlopeMod      0.0000000  0.000000e+00
## Land.SlopeSev      0.0000000  0.000000e+00
## Year.Built         1.0000000  1.000000e+00
## Year.Remod.Add     1.0000000  1.000000e+00
## Bedroom.AbvGr      1.0000000  0.000000e+00
## BF                 0.1061955  3.125949e-35
## PostProbs          0.0508000  2.300000e-03
## R2                 0.5544000  4.728000e-01
## dim                5.0000000  3.000000e+00
## logmarg        -2200.4098178 -2.277618e+03

We can start to build a multiple linear regration model by including mentioned parameters from the data set. A way to get around the problem with choosing an appropriate model is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. We can use this for either implementing BMA or selecting models. For model prior uniform distribution is chosen and the BIC is selected as a prior distribution because it provides highest Posterior Probabilities values among all others methods. In order to calculate model likelihoods, Bayesian methods often require computationally intensive techniques like Markov chain Monte Carlo. (MCMC) Also, the BIC selects simpler models than the AIC. Thus, the running BMA (BIC and MCMC), we can see that the model suggests to keep all variables as the best fitted predictors of the home price. If we had a look at the linear model and would chose to eliminate a few variables to reach the persimonian model, we would not touch any of them, either. Even thoug the simple linear model suggest to remove Land.SlopeMod, we will not have this opportunities due to the case required to remove the Land.Slope variable at all.


5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit
plot(fit_bic)

ames_train[c(125, 126, 428),]
## # A tibble: 3 x 82
##         PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
##       <int> <int>  <int>       <int>    <fctr>        <int>    <int>
## 1 906412010  2276 475000          20        RL           91    11778
## 2 535301060  1154 142000          20        RL           60    11556
## 3 902207130   832  12789          30        RM           68     9656
## # ... with 75 more variables: Street <fctr>, Alley <fctr>,
## #   Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## #   Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## #   Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## #   House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## #   Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## #   Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## #   Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## #   Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## #   Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## #   BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## #   Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## #   Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## #   X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## #   Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## #   Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## #   Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## #   Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## #   Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## #   Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## #   Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## #   Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## #   Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## #   Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## #   Sale.Condition <fctr>, age <dbl>
ames_train$res_sqr<-fit_lm$residuals**2
which(ames_train$res_sqr==max(ames_train$res_sqr))
## [1] 428
exp(fit_lm$fitted.values[428])
##      428 
## 103176.2

Running plot residuals vs fitted, we found that lot PID number 902207130 (case = 428) looks like an outlier which tells us that it has the highest squared residual. The plot shows the lot does not fit the model. It could be explained by the lowest price of the house, and the year when the house was build. It was sold for just over 12,000 USD. The predicted price is 103176.20 USD which makes it a significant outlier. It has an unfinished basement and was sold under abnormal conditions. Both variables not included in our model. Those seem to be the most likely explanations for the low sale price of just over 12.000 USD. * * *

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit
fit_lm = lm(log(price)~log(Lot.Area)+Land.Slope
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train)
summary(fit_lm)
## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14050 -0.15650 -0.01561  0.15350  0.90854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.553e+01  8.197e-01 -18.947  < 2e-16 ***
## log(Lot.Area)   2.442e-01  1.708e-02  14.297  < 2e-16 ***
## Land.SlopeMod   1.151e-01  4.734e-02   2.431   0.0152 *  
## Land.SlopeSev  -6.554e-02  1.222e-01  -0.536   0.5917    
## Year.Built      5.981e-03  3.597e-04  16.628  < 2e-16 ***
## Year.Remod.Add  6.734e-03  5.190e-04  12.975  < 2e-16 ***
## Bedroom.AbvGr   5.909e-02  1.055e-02   5.599  2.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2649 on 993 degrees of freedom
## Multiple R-squared:  0.6056, Adjusted R-squared:  0.6032 
## F-statistic: 254.1 on 6 and 993 DF,  p-value: < 2.2e-16
fit_bic = bas.lm(log(price)~log(Lot.Area)+Land.Slope
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train,
        prior = "BIC",
        modelprior = uniform(),
        method = "MCMC" )
summary(fit_bic)
##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000     1.0000000  1.000000e+00
## log(Lot.Area)     0.99765625     1.0000     1.0000000  1.000000e+00
## Land.SlopeMod     0.44609375     0.0000     1.0000000  0.000000e+00
## Land.SlopeSev     0.06171875     0.0000     0.0000000  1.000000e+00
## Year.Built        0.99921875     1.0000     1.0000000  1.000000e+00
## Year.Remod.Add    0.99609375     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr     0.99531250     1.0000     1.0000000  1.000000e+00
## BF                        NA     1.0000     0.6494644  3.866294e-02
## PostProbs                 NA     0.5094     0.4242000  3.980000e-02
## R2                        NA     0.6031     0.6055000  6.032000e-01
## dim                       NA     5.0000     6.0000000  6.000000e+00
## logmarg                   NA -2142.5652 -2142.9968106 -2.145818e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## log(Lot.Area)   1.000000e+00  0.000000e+00
## Land.SlopeMod   1.000000e+00  0.000000e+00
## Land.SlopeSev   1.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Year.Remod.Add  1.000000e+00  0.000000e+00
## Bedroom.AbvGr   1.000000e+00  0.000000e+00
## BF              2.374021e-02  1.227606e-87
## PostProbs       2.190000e-02  1.600000e-03
## R2              6.056000e-01  3.953000e-01
## dim             7.000000e+00  2.000000e+00
## logmarg        -2.146306e+03 -2.342685e+03
fit_lm_log = lm(log(price)~log(Lot.Area)
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train)
summary(fit_lm_log)
## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## log(Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
## Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16
fit_bic_log = bas.lm(log(price)~log(Lot.Area)
         +Year.Built+Year.Remod.Add
        +Bedroom.AbvGr, data = ames_train,
        prior = "BIC",
        modelprior = uniform(),
        method = "MCMC" )
summary(fit_bic_log)
##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept           1.000000     1.0000  1.000000e+00  1.000000e+00
## log(Lot.Area)       0.978125     1.0000  1.000000e+00  1.000000e+00
## Year.Built          0.993750     1.0000  1.000000e+00  1.000000e+00
## Year.Remod.Add      0.940625     1.0000  1.000000e+00  0.000000e+00
## Bedroom.AbvGr       0.906250     1.0000  0.000000e+00  1.000000e+00
## BF                        NA     1.0000  1.412793e-05  2.253065e-33
## PostProbs                 NA     0.8688  6.560000e-02  3.750000e-02
## R2                        NA     0.6031  5.913000e-01  5.355000e-01
## dim                       NA     5.0000  4.000000e+00  4.000000e+00
## logmarg                   NA -2142.5652 -2.153733e+03 -2.217738e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## log(Lot.Area)   0.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  0.000000e+00
## Year.Remod.Add  0.000000e+00  1.000000e+00
## Bedroom.AbvGr   0.000000e+00  0.000000e+00
## BF              1.227606e-87  3.417138e-97
## PostProbs       1.560000e-02  6.200000e-03
## R2              3.953000e-01  3.681000e-01
## dim             2.000000e+00  2.000000e+00
## logmarg        -2.342685e+03 -2.364687e+03

Modifying the variable Lot.Area by natural log, surprisingly, we get slightly different model. in both cases simple linear model and BMA, we could see that it’s suggested to remove both Land.Slope variables. Which were close to be removed in the previous models.

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7 here, and Knit
plot(ames_train$price~fit_lm$fitted.values)
abline(lm(ames_train$price~fit_lm$fitted.values), col="red")

plot(log(ames_train$price)~fit_lm_log$fitted.values)
abline(lm(log(ames_train$price)~fit_lm_log$fitted.values), col="red")


By plotting both graphs price vs Lot Area original quantative and their natural logariphmic versions, we could see that it makes sense to use logariphms in the linear multiple regrassion because the log transforming reduces the root mean square error. The plot has nicier shape, so we could see linear correlation of those variables without applications of extra tools.


7.0.1