Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

library(statsr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.3

library(BAS)

## Warning: package 'BAS' was built under R version 3.4.4

library(MASS)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

load("ames_train.Rdata")

The first thing to do is to cut down on inventory that may pose too much difficulty or cost to deal with, To this end only include those sold under “normal” sales condition.

ames_train <- ames_train%>%
  filter(Sale.Condition == "Normal")

Use the code block below to load any necessary packages

Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

hist(ames_train$price, main = "Ames Price Distribution", label = T, col = "blue", las =1, ylim= c(0,400))

summary(ames_train$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39300  129000  155500  174622  205000  615000

boxplot(ames_train$price, col= "red", main = "Ames Houes Price Boxplot", ylab = " House Price")

ggplot(data = ames_train, aes(Neighborhood, color = Neighborhood, fill = Neighborhood)) +
  geom_bar()+coord_flip()+
  labs(title = "Neighborhood Counts")

house_age<-(2018)-(ames_train$Year.Built)
  ames_train<-ames_train%>%
  mutate(house_age)
ames_train$"house_age <- (2018) - (ames_train$Year.Built)" <- NULL

summary(house_age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       9      21      46      48      64     146

ggplot(data = ames_train, aes(ames_train$house_age)) +
  geom_histogram(binwidth = 4.7, color = "blue", fill = "yellow") +
  geom_vline(xintercept = 43, size = .1, color = "red") + 
  geom_vline(xintercept = 17, size = .1, color = "green")+
  geom_vline(xintercept = 45.8, size = .1, color = "blue")+
  labs(title = "Table 1: House Age Histogram", x = "House Age from 2018", y = "Number of Houses")+
  theme_linedraw()

ggplot(data = ames_train, aes(price, fill = Neighborhood)) +geom_histogram()+
  labs(title = "Counts by Neighborhood and Price")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

house_price<-ames_train %>%
  group_by(Neighborhood) %>%
  summarise(med_price = median(price), mean_price =  mean(price), sd_price = sd(price), var_price = var(price),iqr_price = IQR(price), n = n())
house_price

## # A tibble: 27 x 7
##    Neighborhood med_price mean_price sd_price   var_price iqr_price     n
##    <fct>            <dbl>      <dbl>    <dbl>       <dbl>     <dbl> <int>
##  1 Blmngtn        192000.    198961.   24912.  620582881.    20130.     7
##  2 Blueste        123900.    125800.   10381.  107770000.    10250.     3
##  3 BrDale         100500.    100557.   13596.  184856190.    11450.     7
##  4 BrkSide        125250.    123733.   38466. 1479636208.    41394.    36
##  5 ClearCr        187500.    198273.   50054. 2505368182.    68000.    11
##  6 CollgCr        195000.    191878.   43178. 1864357269.    58500.    75
##  7 Crawfor        198000.    197296.   64847. 4205186233.    76100.    25
##  8 Edwards        125400.    130975.   53555. 2868098189.    40125.    50
##  9 Gilbert        184000.    191722.   36939. 1364520386.    22900.    36
## 10 Greens         212625.    198562.   29063.  844682292.    16438.     4
## # ... with 17 more rows

a<-ames_train%>%
  group_by(Neighborhood)%>%
  filter(house_age<=10)
a

## # A tibble: 9 x 82
## # Groups:   Neighborhood [5]
##        PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##      <int> <int>  <int>       <int> <fct>            <int>    <int> <fct> 
## 1   9.16e8  1346 220000          20 RL                  88    11896 Pave  
## 2   9.06e8  1226 198900          20 RL                  94    10402 Pave  
## 3   5.28e8  1547 215200          60 FV                  72     8640 Pave  
## 4   9.21e8  2018 378500          20 RL                  89    13214 Pave  
## 5   5.28e8  1808 324000          20 RL                 102    13514 Pave  
## 6   9.05e8  1212 186000          20 RL                  83    10420 Pave  
## 7   5.28e8  2007 310000          60 FV                  85    11003 Pave  
## 8   5.28e8  1743 335000          20 RL                  87    10367 Pave  
## 9   5.28e8  2020 404000          20 RL                  95    12350 Pave  
## # ... with 74 more variables: Alley <fct>, Lot.Shape <fct>,
## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>, house_age <dbl>

ggplot(data = a, aes(Neighborhood, fill = Neighborhood))+geom_bar()+
  labs(title = "Neighborhood Counts for Houses Younger than Ten Years")

ggplot(data = ames_train, aes(ames_train$price, label = T, las =1)) +
  geom_histogram(binwidth = 50000, color = "blue", fill = "cyan") +
  geom_vline(xintercept = 155500, size = 1.7, color = "black") + 
  geom_vline(xintercept = 174622, size = 1.4, color = "yellow")+
  geom_vline(xintercept = 300000, size = 1.7, color = "red")+
  labs(title = "Table 1: Ames Price with Meidan, Mean and Ouliers", x = "House Age from 2018", y = "Number of Houses")+
  theme_linedraw()

# type your code for Question 2 here, and Knit
ggplot(aes(y = price, x = Neighborhood, fill = Neighborhood), data = ames_train ) +
  geom_boxplot() +
  labs(title = "Table 2: Neighborhood vs Price") +coord_flip() +theme_dark()

ggplot(data = house_price, aes(x = sd_price, y = med_price ))+
  geom_point(col = "blue", size = 2.2)+
  labs(title = "Table3: Neighborhood Standard Deviation vs Neigbhorhood Median Price", x = "Median Price per Standard Deviation", y = "Standard Deviation by Neighborhood")

cor(house_price$sd_price, house_price$mean_price)

## [1] 0.7793776

which.min(house_price$med_price)

## [1] 13

which.min(house_price$sd_price)

## [1] 2

which.min(house_price$iqr_price)

## [1] 2

which.max(house_price$med_price)

## [1] 18

which.max(house_price$sd_price)

## [1] 18

which.max(house_price$iqr_price)

## [1] 18

The above graphs bring to light a few things to take notice of. The price distribution is clearly right skewed implying the mean is greater than the median. This is confirmed by the summary. This indicates the presence of high-priced, and possibly overvalued, homes in the housing pool. The good news is that both the mean and median are in the same 50,000 dollar binwidth. They are in fact very close being about 20,000 dollars apart. It is also evident that the majority, more tha half, are withing the 100,000 to 200,000 dollar range. This implies there are a good number of potential prospects for purchase.

** The boxplot shows the presence of outliers in the set.Somewhere around 300,000 the houses start moving away from the rest of the sample and become overpriced.**

The first table is essentially a blend of the first two EDA graphs. It shows the overall price distribution and correspointing values of interest, median, mean and outlier boundary. This points to prices for consideration for acceptance or rejection for strategic investment potential.

The second table gives a graphic look at the price spread per neighborhood. This gives a sense of ranges and median prices as well as outliers at both the low and hugh ends. This is important to decide if the mobility of the price in a neighborhood warrants investment potential.

The third table directly shows the relationship between the median price versus the standard deviation per neighborhood. the relationship of how big a spread there is ties in the presence of “deals” on a price, i.e. undervalued properties in a area that may be primed for price increases. The most striking feature of this table is there is a clear linear relationship between the standard deviation and median price for the entire set per neighboord at 78% correlation. In short the more expensive the neighborhood median price the greater the spread in house price from the average of that neighborhood.

The ranges on the median, standard deviation and interquartile ranges for the variuos neighborhoods are as follows:

The smallest median price is in Meadow Village. The smallest standard deviation and interquartile range are in Blue Stem.

The largest of each are all in North Ridge Heights.

NOTE: At this stage for these variables, the presence of NA’a did not impact output so cleaning was not necessary

Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

library(MASS)
model_train<-lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  Neighborhood +log(area) + Full.Bath + Bedroom.AbvGr + Year.Built  +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train)

summary(model_train)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + log(Garage.Area + 1) + 
##     Neighborhood + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + 
##     log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73166 -0.06308  0.00166  0.07012  0.51978 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.7457973  0.7560257  -2.309 0.021188 *  
## Overall.Qual          0.0780147  0.0055068  14.167  < 2e-16 ***
## log(Garage.Area + 1)  0.0051522  0.0038302   1.345 0.178952    
## NeighborhoodBlueste  -0.0742720  0.0819874  -0.906 0.365265    
## NeighborhoodBrDale   -0.1907787  0.0646406  -2.951 0.003256 ** 
## NeighborhoodBrkSide  -0.0629624  0.0545630  -1.154 0.248870    
## NeighborhoodClearCr  -0.0536285  0.0619106  -0.866 0.386627    
## NeighborhoodCollgCr  -0.1028692  0.0485217  -2.120 0.034308 *  
## NeighborhoodCrawfor   0.0293478  0.0549179   0.534 0.593217    
## NeighborhoodEdwards  -0.1414351  0.0511227  -2.767 0.005796 ** 
## NeighborhoodGilbert  -0.1456983  0.0505066  -2.885 0.004023 ** 
## NeighborhoodGreens    0.0944600  0.0745767   1.267 0.205662    
## NeighborhoodGrnHill   0.2577918  0.0945396   2.727 0.006535 ** 
## NeighborhoodIDOTRR   -0.1711145  0.0560396  -3.053 0.002337 ** 
## NeighborhoodMeadowV  -0.1570883  0.0565996  -2.775 0.005642 ** 
## NeighborhoodMitchel  -0.0694029  0.0508342  -1.365 0.172549    
## NeighborhoodNAmes    -0.0763612  0.0500166  -1.527 0.127227    
## NeighborhoodNoRidge   0.0134845  0.0516248   0.261 0.794005    
## NeighborhoodNPkVill  -0.0436025  0.0740910  -0.588 0.556364    
## NeighborhoodNridgHt   0.0745068  0.0497750   1.497 0.134822    
## NeighborhoodNWAmes   -0.1185988  0.0514253  -2.306 0.021353 *  
## NeighborhoodOldTown  -0.1220336  0.0544072  -2.243 0.025173 *  
## NeighborhoodSawyer   -0.0926175  0.0515403  -1.797 0.072715 .  
## NeighborhoodSawyerW  -0.1663309  0.0500255  -3.325 0.000925 ***
## NeighborhoodSomerst  -0.0138265  0.0479958  -0.288 0.773363    
## NeighborhoodStoneBr   0.0361535  0.0566160   0.639 0.523283    
## NeighborhoodSWISU    -0.0886177  0.0627484  -1.412 0.158260    
## NeighborhoodTimber   -0.0083022  0.0547359  -0.152 0.879479    
## NeighborhoodVeenker  -0.0288405  0.0618963  -0.466 0.641379    
## log(area)             0.4935606  0.0244240  20.208  < 2e-16 ***
## Full.Bath            -0.0011532  0.0123762  -0.093 0.925782    
## Bedroom.AbvGr        -0.0326804  0.0074310  -4.398 1.24e-05 ***
## Year.Built            0.0041093  0.0003562  11.538  < 2e-16 ***
## log(Lot.Area)         0.1498777  0.0117607  12.744  < 2e-16 ***
## Central.AirY          0.1003716  0.0216710   4.632 4.24e-06 ***
## Overall.Cond          0.0531754  0.0044419  11.971  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 798 degrees of freedom
## Multiple R-squared:  0.9105, Adjusted R-squared:  0.9066 
## F-statistic:   232 on 35 and 798 DF,  p-value: < 2.2e-16

In deciding on features for the model I made the following choices: Given the wide discrepancy in house ages overall and the fact there are a substantial number of new houses being built Year.Built seemes a reasonable variable to include. Younger houses may need less work and older houses are likely cheaper. In this set Neighborhood also played a significant factor in price as their is a huge variance between both medians and variance. However the output reveals that quite a few of these neighborhoods and a few other predictors are in fact not statistically significant. The variable selection was based on requests that seemed reasonable to ask about such as Full Bath, Central Air, Overall Qual etc.

Of the ten variables selected for this model, two were not statistically significant, log(Garage.Area + 1) and Full.Bath (which seems surprising). Neighborhood is a mixed bag with a total of 15 of the neighborhoods not being statistically significant for price prediction. In a way this is good news as it refines the number of areas to be examined but it could also hide potential for growth in an undervalued area. However because some of the variables under Neighborhood are significant we cannot drop the category.

The R-squared value is quite high at 91%, meaning that 91% of the variation in price is explained by the variables in this model. As previoulsy noted, however, there is a certain level of complexity that is unecessary in the model as indicated by the number of statistically non-significant variables. The addition of these variables repesent an increase in potential overfitting and so as a measure of a penalty for this, the adjusted R-squared will lower the degree of fit of the model. But even with this calculation it remains near 91% so this model is on good footing taken as is.

The p value is particularly low mwaning that the hypothesis that all coefficients are zero is rejected in favor of the alternative that at least one of the regression coefficents does not equal zero.

The y-intercept is not a realistic figure in this case. While it is positive which is good, it is also extremely low. Adjusting for the logarithm the intercept value is about 175 dollars which is not a viable realistic value for the houses and in fact contradicts that the minimium house price is 39,000 dollars.

Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

#AIC backward selection
AIC_back<-step(model_train, direction = "backward")

## Start:  AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                        Df Sum of Sq    RSS     AIC
## - Full.Bath             1    0.0001 10.871 -3549.6
## - log(Garage.Area + 1)  1    0.0247 10.896 -3547.8
## <none>                              10.871 -3547.7
## - Bedroom.AbvGr         1    0.2635 11.135 -3529.7
## - Central.Air           1    0.2922 11.163 -3527.5
## - Neighborhood         26    2.5528 13.424 -3423.7
## - Year.Built            1    1.8136 12.685 -3421.0
## - Overall.Cond          1    1.9523 12.823 -3411.9
## - log(Lot.Area)         1    2.2125 13.084 -3395.2
## - Overall.Qual          1    2.7342 13.605 -3362.6
## - log(area)             1    5.5632 16.434 -3205.0
## 
## Step:  AIC=-3549.65
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood + 
##     log(area) + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                        Df Sum of Sq    RSS     AIC
## - log(Garage.Area + 1)  1    0.0250 10.896 -3549.7
## <none>                              10.871 -3549.6
## - Bedroom.AbvGr         1    0.2750 11.146 -3530.8
## - Central.Air           1    0.2958 11.167 -3529.3
## - Neighborhood         26    2.5617 13.433 -3425.2
## - Year.Built            1    1.8936 12.765 -3417.7
## - Overall.Cond          1    1.9527 12.824 -3413.9
## - log(Lot.Area)         1    2.2146 13.086 -3397.0
## - Overall.Qual          1    2.7387 13.610 -3364.3
## - log(area)             1    6.3141 17.185 -3169.7
## 
## Step:  AIC=-3549.73
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + Overall.Cond
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       10.896 -3549.7
## - Bedroom.AbvGr  1    0.2838 11.180 -3530.3
## - Central.Air    1    0.3532 11.250 -3525.1
## - Neighborhood  26    2.6129 13.509 -3422.5
## - Overall.Cond   1    1.9391 12.835 -3415.1
## - Year.Built     1    1.9794 12.876 -3412.5
## - log(Lot.Area)  1    2.3282 13.225 -3390.2
## - Overall.Qual   1    2.7611 13.657 -3363.4
## - log(area)      1    6.4673 17.364 -3163.1

#AIC forward selection
AIC_forw<-step(model_train, direction = "forward")

## Start:  AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + Overall.Cond

summary(AIC_back)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72173 -0.06326  0.00098  0.06980  0.53260 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.8630744  0.7262336  -2.565 0.010487 *  
## Overall.Qual         0.0783002  0.0054995  14.238  < 2e-16 ***
## NeighborhoodBlueste -0.0692021  0.0814672  -0.849 0.395887    
## NeighborhoodBrDale  -0.1872665  0.0640904  -2.922 0.003577 ** 
## NeighborhoodBrkSide -0.0603686  0.0543550  -1.111 0.267059    
## NeighborhoodClearCr -0.0539172  0.0618208  -0.872 0.383387    
## NeighborhoodCollgCr -0.1030812  0.0483872  -2.130 0.033448 *  
## NeighborhoodCrawfor  0.0302845  0.0547911   0.553 0.580605    
## NeighborhoodEdwards -0.1443631  0.0507947  -2.842 0.004596 ** 
## NeighborhoodGilbert -0.1475456  0.0504250  -2.926 0.003530 ** 
## NeighborhoodGreens   0.0967099  0.0742574   1.302 0.193168    
## NeighborhoodGrnHill  0.2542540  0.0943951   2.694 0.007218 ** 
## NeighborhoodIDOTRR  -0.1726389  0.0558741  -3.090 0.002072 ** 
## NeighborhoodMeadowV -0.1642407  0.0558941  -2.938 0.003394 ** 
## NeighborhoodMitchel -0.0690971  0.0505470  -1.367 0.172014    
## NeighborhoodNAmes   -0.0748871  0.0495134  -1.512 0.130811    
## NeighborhoodNoRidge  0.0127871  0.0514149   0.249 0.803653    
## NeighborhoodNPkVill -0.0409370  0.0740100  -0.553 0.580330    
## NeighborhoodNridgHt  0.0736187  0.0496769   1.482 0.138748    
## NeighborhoodNWAmes  -0.1185271  0.0513334  -2.309 0.021199 *  
## NeighborhoodOldTown -0.1200146  0.0543088  -2.210 0.027398 *  
## NeighborhoodSawyer  -0.0918229  0.0510502  -1.799 0.072446 .  
## NeighborhoodSawyerW -0.1671598  0.0499781  -3.345 0.000862 ***
## NeighborhoodSomerst -0.0138799  0.0479576  -0.289 0.772335    
## NeighborhoodStoneBr  0.0347125  0.0565899   0.613 0.539783    
## NeighborhoodSWISU   -0.0895647  0.0627063  -1.428 0.153590    
## NeighborhoodTimber  -0.0100021  0.0547074  -0.183 0.854978    
## NeighborhoodVeenker -0.0297807  0.0615042  -0.484 0.628372    
## log(area)            0.4959963  0.0227621  21.790  < 2e-16 ***
## Bedroom.AbvGr       -0.0332892  0.0072932  -4.564 5.80e-06 ***
## Year.Built           0.0041612  0.0003452  12.055  < 2e-16 ***
## log(Lot.Area)        0.1521561  0.0116379  13.074  < 2e-16 ***
## Central.AirY         0.1071024  0.0210307   5.093 4.41e-07 ***
## Overall.Cond         0.0529576  0.0044383  11.932  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 800 degrees of freedom
## Multiple R-squared:  0.9103, Adjusted R-squared:  0.9066 
## F-statistic:   246 on 33 and 800 DF,  p-value: < 2.2e-16

summary(AIC_forw)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + log(Garage.Area + 1) + 
##     Neighborhood + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + 
##     log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73166 -0.06308  0.00166  0.07012  0.51978 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.7457973  0.7560257  -2.309 0.021188 *  
## Overall.Qual          0.0780147  0.0055068  14.167  < 2e-16 ***
## log(Garage.Area + 1)  0.0051522  0.0038302   1.345 0.178952    
## NeighborhoodBlueste  -0.0742720  0.0819874  -0.906 0.365265    
## NeighborhoodBrDale   -0.1907787  0.0646406  -2.951 0.003256 ** 
## NeighborhoodBrkSide  -0.0629624  0.0545630  -1.154 0.248870    
## NeighborhoodClearCr  -0.0536285  0.0619106  -0.866 0.386627    
## NeighborhoodCollgCr  -0.1028692  0.0485217  -2.120 0.034308 *  
## NeighborhoodCrawfor   0.0293478  0.0549179   0.534 0.593217    
## NeighborhoodEdwards  -0.1414351  0.0511227  -2.767 0.005796 ** 
## NeighborhoodGilbert  -0.1456983  0.0505066  -2.885 0.004023 ** 
## NeighborhoodGreens    0.0944600  0.0745767   1.267 0.205662    
## NeighborhoodGrnHill   0.2577918  0.0945396   2.727 0.006535 ** 
## NeighborhoodIDOTRR   -0.1711145  0.0560396  -3.053 0.002337 ** 
## NeighborhoodMeadowV  -0.1570883  0.0565996  -2.775 0.005642 ** 
## NeighborhoodMitchel  -0.0694029  0.0508342  -1.365 0.172549    
## NeighborhoodNAmes    -0.0763612  0.0500166  -1.527 0.127227    
## NeighborhoodNoRidge   0.0134845  0.0516248   0.261 0.794005    
## NeighborhoodNPkVill  -0.0436025  0.0740910  -0.588 0.556364    
## NeighborhoodNridgHt   0.0745068  0.0497750   1.497 0.134822    
## NeighborhoodNWAmes   -0.1185988  0.0514253  -2.306 0.021353 *  
## NeighborhoodOldTown  -0.1220336  0.0544072  -2.243 0.025173 *  
## NeighborhoodSawyer   -0.0926175  0.0515403  -1.797 0.072715 .  
## NeighborhoodSawyerW  -0.1663309  0.0500255  -3.325 0.000925 ***
## NeighborhoodSomerst  -0.0138265  0.0479958  -0.288 0.773363    
## NeighborhoodStoneBr   0.0361535  0.0566160   0.639 0.523283    
## NeighborhoodSWISU    -0.0886177  0.0627484  -1.412 0.158260    
## NeighborhoodTimber   -0.0083022  0.0547359  -0.152 0.879479    
## NeighborhoodVeenker  -0.0288405  0.0618963  -0.466 0.641379    
## log(area)             0.4935606  0.0244240  20.208  < 2e-16 ***
## Full.Bath            -0.0011532  0.0123762  -0.093 0.925782    
## Bedroom.AbvGr        -0.0326804  0.0074310  -4.398 1.24e-05 ***
## Year.Built            0.0041093  0.0003562  11.538  < 2e-16 ***
## log(Lot.Area)         0.1498777  0.0117607  12.744  < 2e-16 ***
## Central.AirY          0.1003716  0.0216710   4.632 4.24e-06 ***
## Overall.Cond          0.0531754  0.0044419  11.971  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 798 degrees of freedom
## Multiple R-squared:  0.9105, Adjusted R-squared:  0.9066 
## F-statistic:   232 on 35 and 798 DF,  p-value: < 2.2e-16

model_train_AIC<-lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  Neighborhood +log(area) + Full.Bath + Bedroom.AbvGr + Year.Built  +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train)
# Model selection using AIC
model_train_AIC <- stepAIC(model_train_AIC, k = 2)

## Start:  AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                        Df Sum of Sq    RSS     AIC
## - Full.Bath             1    0.0001 10.871 -3549.6
## - log(Garage.Area + 1)  1    0.0247 10.896 -3547.8
## <none>                              10.871 -3547.7
## - Bedroom.AbvGr         1    0.2635 11.135 -3529.7
## - Central.Air           1    0.2922 11.163 -3527.5
## - Neighborhood         26    2.5528 13.424 -3423.7
## - Year.Built            1    1.8136 12.685 -3421.0
## - Overall.Cond          1    1.9523 12.823 -3411.9
## - log(Lot.Area)         1    2.2125 13.084 -3395.2
## - Overall.Qual          1    2.7342 13.605 -3362.6
## - log(area)             1    5.5632 16.434 -3205.0
## 
## Step:  AIC=-3549.65
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood + 
##     log(area) + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                        Df Sum of Sq    RSS     AIC
## - log(Garage.Area + 1)  1    0.0250 10.896 -3549.7
## <none>                              10.871 -3549.6
## - Bedroom.AbvGr         1    0.2750 11.146 -3530.8
## - Central.Air           1    0.2958 11.167 -3529.3
## - Neighborhood         26    2.5617 13.433 -3425.2
## - Year.Built            1    1.8936 12.765 -3417.7
## - Overall.Cond          1    1.9527 12.824 -3413.9
## - log(Lot.Area)         1    2.2146 13.086 -3397.0
## - Overall.Qual          1    2.7387 13.610 -3364.3
## - log(area)             1    6.3141 17.185 -3169.7
## 
## Step:  AIC=-3549.73
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + Overall.Cond
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       10.896 -3549.7
## - Bedroom.AbvGr  1    0.2838 11.180 -3530.3
## - Central.Air    1    0.3532 11.250 -3525.1
## - Neighborhood  26    2.6129 13.509 -3422.5
## - Overall.Cond   1    1.9391 12.835 -3415.1
## - Year.Built     1    1.9794 12.876 -3412.5
## - log(Lot.Area)  1    2.3282 13.225 -3390.2
## - Overall.Qual   1    2.7611 13.657 -3363.4
## - log(area)      1    6.4673 17.364 -3163.1

model_train_AIC

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond, data = ames_train)
## 
## Coefficients:
##         (Intercept)         Overall.Qual  NeighborhoodBlueste  
##           -1.863074             0.078300            -0.069202  
##  NeighborhoodBrDale  NeighborhoodBrkSide  NeighborhoodClearCr  
##           -0.187266            -0.060369            -0.053917  
## NeighborhoodCollgCr  NeighborhoodCrawfor  NeighborhoodEdwards  
##           -0.103081             0.030285            -0.144363  
## NeighborhoodGilbert   NeighborhoodGreens  NeighborhoodGrnHill  
##           -0.147546             0.096710             0.254254  
##  NeighborhoodIDOTRR  NeighborhoodMeadowV  NeighborhoodMitchel  
##           -0.172639            -0.164241            -0.069097  
##   NeighborhoodNAmes  NeighborhoodNoRidge  NeighborhoodNPkVill  
##           -0.074887             0.012787            -0.040937  
## NeighborhoodNridgHt   NeighborhoodNWAmes  NeighborhoodOldTown  
##            0.073619            -0.118527            -0.120015  
##  NeighborhoodSawyer  NeighborhoodSawyerW  NeighborhoodSomerst  
##           -0.091823            -0.167160            -0.013880  
## NeighborhoodStoneBr    NeighborhoodSWISU   NeighborhoodTimber  
##            0.034712            -0.089565            -0.010002  
## NeighborhoodVeenker            log(area)        Bedroom.AbvGr  
##           -0.029781             0.495996            -0.033289  
##          Year.Built        log(Lot.Area)         Central.AirY  
##            0.004161             0.152156             0.107102  
##        Overall.Cond  
##            0.052958

BIC(AIC_back)

## [1] -1015.525

BIC(AIC_forw)

## [1] -1003.996

Using the Akaike Information Criterion to assess and forward and backward selection the model is mearly identical under either test. In fact both best models have the same adjusted R-squared. . In order to further assess this model a stepwisw AIC was used. This model agrees that the backward selection model is the preferred. However the AIC scores are still fairly close so as a further analysis a Bayesian Information Criterion (BIC) was used and the lower score was obtained by the backward selection model.

Since AIC and BIC agree on the bacward model that will be the one selected. It has 8 variables instead of 10, omitting, log(Garage.Area + 1) and Full.Bath

Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

model_price_select<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
par(mfrow = c(2,2))
plot(model_price_select)

hist(model_price_select$residuals, col = "maroon", las = 1, main = "Selected Model Residual Histogram", label = T, ylim = c(0,325))

The assumptions of the model rest on four properties:

Normality of Residuals
Linearity of the numerical predictor variables with the response variable
Homoscedasticity
Residual independence

Normality of the residuals is demonstrated by the histogram and the Normal Q-Q plot. Both show a high degree of normality and the QQ plot indicates there is some skew deviation from normality by the indicated outliers. However the residuals are nearly normal.

Residual linearity the linearity of the variables is in the first graph. The scatter of the residuals is around zero and there is no obvious shape or pattern to the scatter. The red line being practically flat assures us of the strong linear relationship of the residuals in the model. Again there a few outliers but the majority of points agrees with the premise for linearity

Homoscedasticity refers to the degree of constancy of the the variance of the residuals. This is demonstrated by the first graph and the scatter of the points. The majority of the points fall between two lines, -0.5 and 0.5. with almost nothing outside of this region indicating a fairly constant average variance between residuals. The Scale-Location graph also shows the spread of residuals about the predictors. Since we have a lack of pattern and an almost flat red line, with a little dip, we can be assured of good homoscedasicity with the model.

Independence is an indication that the variables were randomly selected and that the selection of any particular observation has no bearing on another. since there is no test for this we have to go to first principles. Since the data was obtained from the table dataset its structure will indicate the randomness of the partition used for the tarining and other sets. The table has no structure that indicates any meaningful subsetting of the data so any subsection would be a random sample of the entire pool. Therefore any division would qualify as a SRS. Any set of observations would therefore be independent.

Residuals vs Leverage. The fourth graph helps us clarify the nature of our outliers with respect to their leverage. Its been established that there are outliers present in a number of different graphical and analytical tools used so far. The question remains though should they be left in or omitted to improve the model and its performannce? Analytically this is expressed by the Cook’s distance, If the distance is less than one then the point, outlier or not, is not influential and therefore its presence does not have a significant impact on the regression of the model. As the graph shows all of points are well withing this range and as such can be included in the final model without undo impact. Therefore, no outliers need be deleted from the model.

Therefore the conditions for both a non-multicollinear and parsimonious model from the initial one are met.

Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

# Extract Predictions
predict.full <- exp(predict(model_price_select, ames_train))

# Extract Residuals
resid.full <- ames_train$price - predict.full

# Calculate RMSE
rmse.full <- sqrt(mean(resid.full^2))
rmse.full

## [1] 22665.42

The Root mean square error (RMSE) is a measure of the standard deviation of the residuals.In this way it represents an absolute measure of fit and shows how closely the data is actually spread from the regression line. The smaller it is the better the fit. For purposes of prediction this is the better option.

This model has a RMSE of 22,665.42 dollars. This value is** 13% of the mean home price and 15% of the median home price**. this is not an inordinate amount to vary by when purchasing a house so the fit is relatively good. * * *

Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called â???ooverfitting.â??? To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_test)
summary(model_price_select_test)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond, data = ames_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55288 -0.06329  0.00404  0.07388  0.33664 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.3197284  0.6920407  -0.462 0.644204    
## Overall.Qual         0.0788674  0.0054303  14.524  < 2e-16 ***
## NeighborhoodBlueste -0.0506033  0.0640896  -0.790 0.430017    
## NeighborhoodBrDale  -0.1174224  0.0577298  -2.034 0.042288 *  
## NeighborhoodBrkSide -0.1014693  0.0520253  -1.950 0.051486 .  
## NeighborhoodClearCr -0.0290978  0.0571392  -0.509 0.610724    
## NeighborhoodCollgCr -0.0645388  0.0457340  -1.411 0.158589    
## NeighborhoodCrawfor  0.0299842  0.0503906   0.595 0.551992    
## NeighborhoodEdwards -0.1592750  0.0488420  -3.261 0.001158 ** 
## NeighborhoodGilbert -0.1418487  0.0472388  -3.003 0.002760 ** 
## NeighborhoodIDOTRR  -0.1385458  0.0558523  -2.481 0.013326 *  
## NeighborhoodLandmrk -0.1396342  0.1251099  -1.116 0.264725    
## NeighborhoodMeadowV -0.1028101  0.0580819  -1.770 0.077101 .  
## NeighborhoodMitchel -0.0981988  0.0513926  -1.911 0.056400 .  
## NeighborhoodNAmes   -0.0931867  0.0469399  -1.985 0.047467 *  
## NeighborhoodNoRidge  0.0797727  0.0500766   1.593 0.111560    
## NeighborhoodNPkVill -0.0297085  0.0583926  -0.509 0.611055    
## NeighborhoodNridgHt  0.0743417  0.0485020   1.533 0.125739    
## NeighborhoodNWAmes  -0.1182678  0.0486513  -2.431 0.015283 *  
## NeighborhoodOldTown -0.1464827  0.0507662  -2.885 0.004016 ** 
## NeighborhoodSawyer  -0.1113037  0.0498445  -2.233 0.025829 *  
## NeighborhoodSawyerW -0.1188930  0.0483722  -2.458 0.014191 *  
## NeighborhoodSomerst -0.0151898  0.0457289  -0.332 0.739849    
## NeighborhoodStoneBr  0.0533240  0.0538454   0.990 0.322325    
## NeighborhoodSWISU   -0.0882204  0.0579430  -1.523 0.128278    
## NeighborhoodTimber  -0.0679612  0.0539005  -1.261 0.207733    
## NeighborhoodVeenker -0.0007065  0.0819207  -0.009 0.993121    
## log(area)            0.5112374  0.0228948  22.330  < 2e-16 ***
## Bedroom.AbvGr       -0.0362240  0.0069766  -5.192 2.65e-07 ***
## Year.Built           0.0033515  0.0003334  10.054  < 2e-16 ***
## log(Lot.Area)        0.1533616  0.0118716  12.918  < 2e-16 ***
## Central.AirY         0.0723886  0.0204066   3.547 0.000412 ***
## Overall.Cond         0.0481565  0.0044583  10.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1176 on 784 degrees of freedom
## Multiple R-squared:  0.9011, Adjusted R-squared:  0.8971 
## F-statistic: 223.3 on 32 and 784 DF,  p-value: < 2.2e-16

# Extract Predictions
predict_test <- exp(predict(model_price_select_test, ames_test))

# Extract Residuals
resid_test <- ames_test$price - predict_test

# Calculate RMSE
rmse_test<- sqrt(mean(resid_test^2))
rmse_test

## [1] 23197.04

# Predict prices
predict_test_select <- exp(predict(model_price_select_test, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage_prob_test_select <- mean(ames_test$price > predict_test_select[,"lwr"] &
                            ames_test$price < predict_test_select[,"upr"])
coverage_prob_test_select

## [1] 0.9522644

model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)

# Predict prices
predict_train<- exp(predict(model_price_select_test, ames_train, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage_prob_train <- mean(ames_train$price > predict_train[,"lwr"] &
                            ames_train$price < predict_train[,"upr"])
coverage_prob_train

## [1] 0.9580336

NOTE: Write your written response to section 2.5 here. Delete this note before you submit your work.

We can see from the application of the model to the test data that there is a difference in how well it fits. The p value assures us that the model is significant however the R-squared and adjusted R-squared values have changed. In both cases they are lower but not drastically. The R-squared value is still approximately 91% but the adjusted R-squared value has dropped to 89.7%. Not a big drop in either case so the model still fits rather well.

The root mean square error as expected in this case has gone up and it has but only by 2.3% so the values for the absolute fit of this model is on par with the previous one. The difference in rmse is less than 1000 dollars.

**As we can see the model performs better on the training data than it does on the test data and as such so we dont have to be concerned at this point with the possibility of overfitting.However they are very close to each other so in one sense we can say they fit them “equally well**

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.

model_final<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + Heating + Fireplaces +Kitchen.Qual + log(Wood.Deck.SF +1) + log(Open.Porch.SF +1) +Bldg.Type + House.Style +Overall.Qual:Neighborhood, data = ames_train)


summary(model_final)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style + Overall.Qual:Neighborhood, 
##     data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69822 -0.05617  0.00425  0.06373  0.37444 
## 
## Coefficients: (3 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                      -0.5266563  1.0105996  -0.521 0.602427
## Overall.Qual                      0.0927874  0.0882472   1.051 0.293388
## NeighborhoodBlueste               0.0428123  0.1372979   0.312 0.755263
## NeighborhoodBrDale                0.7402694  0.8234023   0.899 0.368919
## NeighborhoodBrkSide               0.1541402  0.6529460   0.236 0.813443
## NeighborhoodClearCr               0.7954652  0.6979593   1.140 0.254772
## NeighborhoodCollgCr               0.2750019  0.6512966   0.422 0.672972
## NeighborhoodCrawfor               0.3523530  0.6601897   0.534 0.593696
## NeighborhoodEdwards               0.1583458  0.6520805   0.243 0.808202
## NeighborhoodGilbert               0.1477275  0.6604809   0.224 0.823077
## NeighborhoodGreens                0.0951448  0.0927763   1.026 0.305442
## NeighborhoodGrnHill               0.3719440  0.0901194   4.127 4.08e-05
## NeighborhoodIDOTRR                0.1853424  0.6517930   0.284 0.776214
## NeighborhoodMeadowV               0.3132852  0.6767949   0.463 0.643573
## NeighborhoodMitchel               0.2211560  0.6535294   0.338 0.735154
## NeighborhoodNAmes                 0.2772342  0.6500526   0.426 0.669880
## NeighborhoodNoRidge               0.4270951  0.7051283   0.606 0.544897
## NeighborhoodNPkVill               0.9068422  1.0011008   0.906 0.365306
## NeighborhoodNridgHt               0.0253128  0.6627167   0.038 0.969542
## NeighborhoodNWAmes                0.2961887  0.6630522   0.447 0.655216
## NeighborhoodOldTown               0.1704423  0.6489251   0.263 0.792889
## NeighborhoodSawyer                0.4324344  0.6559705   0.659 0.509950
## NeighborhoodSawyerW              -0.1968511  0.6623965  -0.297 0.766411
## NeighborhoodSomerst               0.3046157  0.6588591   0.462 0.643972
## NeighborhoodStoneBr               0.0185944  0.7537742   0.025 0.980326
## NeighborhoodSWISU                -0.0365519  0.7127408  -0.051 0.959113
## NeighborhoodTimber               -0.1839511  0.6779657  -0.271 0.786213
## NeighborhoodVeenker              -0.2877899  0.6784808  -0.424 0.671564
## log(area)                         0.5301257  0.0262830  20.170  < 2e-16
## Bedroom.AbvGr                    -0.0101726  0.0075518  -1.347 0.178373
## Year.Built                        0.0035099  0.0003594   9.765  < 2e-16
## log(Lot.Area)                     0.0919559  0.0134785   6.822 1.84e-11
## Central.AirY                      0.1306494  0.0219619   5.949 4.13e-09
## Overall.Cond                      0.0509427  0.0045747  11.136  < 2e-16
## HeatingGasW                       0.1444190  0.0451145   3.201 0.001426
## HeatingGrav                       0.0477921  0.1208976   0.395 0.692725
## HeatingOthW                       0.0050873  0.1158124   0.044 0.964974
## HeatingWall                       0.1339746  0.1141956   1.173 0.241084
## Fireplaces                        0.0317582  0.0075335   4.216 2.79e-05
## Kitchen.QualFa                   -0.1095302  0.0383594  -2.855 0.004416
## Kitchen.QualGd                   -0.0760636  0.0246296  -3.088 0.002087
## Kitchen.QualPo                    0.0153935  0.1248634   0.123 0.901916
## Kitchen.QualTA                   -0.1139836  0.0266646  -4.275 2.16e-05
## log(Wood.Deck.SF + 1)             0.0045636  0.0017350   2.630 0.008705
## log(Open.Porch.SF + 1)            0.0031072  0.0021995   1.413 0.158168
## Bldg.Type2fmCon                   0.0251457  0.0285940   0.879 0.379461
## Bldg.TypeDuplex                  -0.0921404  0.0245435  -3.754 0.000187
## Bldg.TypeTwnhs                   -0.0537371  0.0324561  -1.656 0.098200
## Bldg.TypeTwnhsE                  -0.0268705  0.0227535  -1.181 0.237998
## House.Style1.5Unf                 0.0603442  0.0449435   1.343 0.179783
## House.Style1Story                 0.0815747  0.0167729   4.863 1.40e-06
## House.Style2.5Unf                 0.0178577  0.0386041   0.463 0.643794
## House.Style2Story                -0.0039440  0.0164673  -0.240 0.810780
## House.StyleSFoyer                 0.1490007  0.0276012   5.398 9.01e-08
## House.StyleSLvl                   0.0660868  0.0238136   2.775 0.005654
## Overall.Qual:NeighborhoodBlueste         NA         NA      NA       NA
## Overall.Qual:NeighborhoodBrDale  -0.1341061  0.1259888  -1.064 0.287475
## Overall.Qual:NeighborhoodBrkSide -0.0174782  0.0901425  -0.194 0.846310
## Overall.Qual:NeighborhoodClearCr -0.1270657  0.0992514  -1.280 0.200853
## Overall.Qual:NeighborhoodCollgCr -0.0441635  0.0893042  -0.495 0.621077
## Overall.Qual:NeighborhoodCrawfor -0.0378912  0.0910554  -0.416 0.677431
## Overall.Qual:NeighborhoodEdwards -0.0359666  0.0900227  -0.400 0.689617
## Overall.Qual:NeighborhoodGilbert -0.0298331  0.0908822  -0.328 0.742805
## Overall.Qual:NeighborhoodGreens          NA         NA      NA       NA
## Overall.Qual:NeighborhoodGrnHill         NA         NA      NA       NA
## Overall.Qual:NeighborhoodIDOTRR  -0.0454130  0.0900849  -0.504 0.614329
## Overall.Qual:NeighborhoodMeadowV -0.0805197  0.0991498  -0.812 0.416989
## Overall.Qual:NeighborhoodMitchel -0.0315168  0.0900743  -0.350 0.726513
## Overall.Qual:NeighborhoodNAmes   -0.0473328  0.0894901  -0.529 0.597019
## Overall.Qual:NeighborhoodNoRidge -0.0467993  0.0955442  -0.490 0.624405
## Overall.Qual:NeighborhoodNPkVill -0.1377268  0.1509558  -0.912 0.361868
## Overall.Qual:NeighborhoodNridgHt  0.0062515  0.0904779   0.069 0.944933
## Overall.Qual:NeighborhoodNWAmes  -0.0515192  0.0914319  -0.563 0.573281
## Overall.Qual:NeighborhoodOldTown -0.0328543  0.0891315  -0.369 0.712526
## Overall.Qual:NeighborhoodSawyer  -0.0793605  0.0910731  -0.871 0.383816
## Overall.Qual:NeighborhoodSawyerW  0.0218426  0.0914242   0.239 0.811236
## Overall.Qual:NeighborhoodSomerst -0.0323187  0.0901753  -0.358 0.720145
## Overall.Qual:NeighborhoodStoneBr  0.0043993  0.1006984   0.044 0.965165
## Overall.Qual:NeighborhoodSWISU    0.0098192  0.1028172   0.096 0.923942
## Overall.Qual:NeighborhoodTimber   0.0262250  0.0924563   0.284 0.776758
## Overall.Qual:NeighborhoodVeenker  0.0438011  0.0931258   0.470 0.638245
##                                     
## (Intercept)                         
## Overall.Qual                        
## NeighborhoodBlueste                 
## NeighborhoodBrDale                  
## NeighborhoodBrkSide                 
## NeighborhoodClearCr                 
## NeighborhoodCollgCr                 
## NeighborhoodCrawfor                 
## NeighborhoodEdwards                 
## NeighborhoodGilbert                 
## NeighborhoodGreens                  
## NeighborhoodGrnHill              ***
## NeighborhoodIDOTRR                  
## NeighborhoodMeadowV                 
## NeighborhoodMitchel                 
## NeighborhoodNAmes                   
## NeighborhoodNoRidge                 
## NeighborhoodNPkVill                 
## NeighborhoodNridgHt                 
## NeighborhoodNWAmes                  
## NeighborhoodOldTown                 
## NeighborhoodSawyer                  
## NeighborhoodSawyerW                 
## NeighborhoodSomerst                 
## NeighborhoodStoneBr                 
## NeighborhoodSWISU                   
## NeighborhoodTimber                  
## NeighborhoodVeenker                 
## log(area)                        ***
## Bedroom.AbvGr                       
## Year.Built                       ***
## log(Lot.Area)                    ***
## Central.AirY                     ***
## Overall.Cond                     ***
## HeatingGasW                      ** 
## HeatingGrav                         
## HeatingOthW                         
## HeatingWall                         
## Fireplaces                       ***
## Kitchen.QualFa                   ** 
## Kitchen.QualGd                   ** 
## Kitchen.QualPo                      
## Kitchen.QualTA                   ***
## log(Wood.Deck.SF + 1)            ** 
## log(Open.Porch.SF + 1)              
## Bldg.Type2fmCon                     
## Bldg.TypeDuplex                  ***
## Bldg.TypeTwnhs                   .  
## Bldg.TypeTwnhsE                     
## House.Style1.5Unf                   
## House.Style1Story                ***
## House.Style2.5Unf                   
## House.Style2Story                   
## House.StyleSFoyer                ***
## House.StyleSLvl                  ** 
## Overall.Qual:NeighborhoodBlueste    
## Overall.Qual:NeighborhoodBrDale     
## Overall.Qual:NeighborhoodBrkSide    
## Overall.Qual:NeighborhoodClearCr    
## Overall.Qual:NeighborhoodCollgCr    
## Overall.Qual:NeighborhoodCrawfor    
## Overall.Qual:NeighborhoodEdwards    
## Overall.Qual:NeighborhoodGilbert    
## Overall.Qual:NeighborhoodGreens     
## Overall.Qual:NeighborhoodGrnHill    
## Overall.Qual:NeighborhoodIDOTRR     
## Overall.Qual:NeighborhoodMeadowV    
## Overall.Qual:NeighborhoodMitchel    
## Overall.Qual:NeighborhoodNAmes      
## Overall.Qual:NeighborhoodNoRidge    
## Overall.Qual:NeighborhoodNPkVill    
## Overall.Qual:NeighborhoodNridgHt    
## Overall.Qual:NeighborhoodNWAmes     
## Overall.Qual:NeighborhoodOldTown    
## Overall.Qual:NeighborhoodSawyer     
## Overall.Qual:NeighborhoodSawyerW    
## Overall.Qual:NeighborhoodSomerst    
## Overall.Qual:NeighborhoodStoneBr    
## Overall.Qual:NeighborhoodSWISU      
## Overall.Qual:NeighborhoodTimber     
## Overall.Qual:NeighborhoodVeenker    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1054 on 756 degrees of freedom
## Multiple R-squared:  0.9308, Adjusted R-squared:  0.9238 
## F-statistic: 132.1 on 77 and 756 DF,  p-value: < 2.2e-16

As we can see from the p-value the model is significant at the 0.05 level of significance. This model accounts for approximatley 93% of the variability for both R-squared and 92% for adjusted R squared.

From the model above the it appears that** Overall.Qual, Neighborhood (pnly Green Hill), log(area), Year.Built , log(Lot.Area), Central.Air, Overall.Cond, Heating (Gas), Fireplaces, Kitchen.Qual, log(Wood.Deck.SF +1), Bldg.Type (Duplex) and House.Style are the variables of significance** for the model at the 5% level indicating that in a hypothesis test their at least on of the coefficeients are likely not zero (not accepting the null hypothesis) whereas the rest would be

The interaction between Neighborhood and Overall Quality will be discussed below.

Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

YES* I did transform variables, price, area, Lot.Area, Wood.Deck.SF, Open.Porch.SF**for the purpose of scaling a covariate that can equal zero regardless of the corresponding coefficient. The transformation is to add 1 to the term and take the log of the sum.

Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

inter_reg<- lm(log(price)~Overall.Qual:Neighborhood, ames_test)
summary(inter_reg)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual:Neighborhood, data = ames_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83152 -0.10348  0.00566  0.10949  0.63202 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      10.962179   0.040904 267.997  < 2e-16 ***
## Overall.Qual:NeighborhoodBlmngtn  0.169859   0.010737  15.820  < 2e-16 ***
## Overall.Qual:NeighborhoodBlueste  0.135850   0.012028  11.295  < 2e-16 ***
## Overall.Qual:NeighborhoodBrDale   0.110066   0.012388   8.885  < 2e-16 ***
## Overall.Qual:NeighborhoodBrkSide  0.148776   0.009920  14.998  < 2e-16 ***
## Overall.Qual:NeighborhoodClearCr  0.217461   0.009850  22.077  < 2e-16 ***
## Overall.Qual:NeighborhoodCollgCr  0.184400   0.006730  27.398  < 2e-16 ***
## Overall.Qual:NeighborhoodCrawfor  0.200134   0.008040  24.891  < 2e-16 ***
## Overall.Qual:NeighborhoodEdwards  0.154885   0.009209  16.819  < 2e-16 ***
## Overall.Qual:NeighborhoodGilbert  0.179308   0.007455  24.053  < 2e-16 ***
## Overall.Qual:NeighborhoodIDOTRR   0.139787   0.012129  11.525  < 2e-16 ***
## Overall.Qual:NeighborhoodLandmrk  0.144259   0.031285   4.611 4.67e-06 ***
## Overall.Qual:NeighborhoodMeadowV  0.134794   0.015531   8.679  < 2e-16 ***
## Overall.Qual:NeighborhoodMitchel  0.185488   0.009926  18.686  < 2e-16 ***
## Overall.Qual:NeighborhoodNAmes    0.171970   0.007998  21.502  < 2e-16 ***
## Overall.Qual:NeighborhoodNoRidge  0.219160   0.006571  33.350  < 2e-16 ***
## Overall.Qual:NeighborhoodNPkVill  0.147395   0.011778  12.514  < 2e-16 ***
## Overall.Qual:NeighborhoodNridgHt  0.198724   0.006426  30.927  < 2e-16 ***
## Overall.Qual:NeighborhoodNWAmes   0.183963   0.007804  23.574  < 2e-16 ***
## Overall.Qual:NeighborhoodOldTown  0.142932   0.008322  17.175  < 2e-16 ***
## Overall.Qual:NeighborhoodSawyer   0.170038   0.010070  16.886  < 2e-16 ***
## Overall.Qual:NeighborhoodSawyerW  0.184639   0.007875  23.447  < 2e-16 ***
## Overall.Qual:NeighborhoodSomerst  0.176890   0.006647  26.611  < 2e-16 ***
## Overall.Qual:NeighborhoodStoneBr  0.188703   0.007911  23.855  < 2e-16 ***
## Overall.Qual:NeighborhoodSWISU    0.162032   0.011133  14.554  < 2e-16 ***
## Overall.Qual:NeighborhoodTimber   0.196954   0.008751  22.506  < 2e-16 ***
## Overall.Qual:NeighborhoodVeenker  0.220654   0.018706  11.796  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1832 on 790 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7501 
## F-statistic: 95.18 on 26 and 790 DF,  p-value: < 2.2e-16

Yes I did include a variable interaction. The one variable interaction that I felt was important was the overall quality of the house and the neighborhood to which it belonged. This was because that question could arise if there is a correlation between the two. Does Neighborhood matter for the condition of the house?

Taken by themselves the above table seems to make ALL Neigborhoods significant, However an examination of the R-square (76%) and adjusted R-square (75%) show a relationship that while strong within itself is weak overall for price prediction.

Consequently there are other variables exerting at least just as great an influence of the predictive strength of the model as Neighborhood.

This forms the basis for my variable selection.

Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_test)
summary(model_price_select_test)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond, data = ames_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55288 -0.06329  0.00404  0.07388  0.33664 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.3197284  0.6920407  -0.462 0.644204    
## Overall.Qual         0.0788674  0.0054303  14.524  < 2e-16 ***
## NeighborhoodBlueste -0.0506033  0.0640896  -0.790 0.430017    
## NeighborhoodBrDale  -0.1174224  0.0577298  -2.034 0.042288 *  
## NeighborhoodBrkSide -0.1014693  0.0520253  -1.950 0.051486 .  
## NeighborhoodClearCr -0.0290978  0.0571392  -0.509 0.610724    
## NeighborhoodCollgCr -0.0645388  0.0457340  -1.411 0.158589    
## NeighborhoodCrawfor  0.0299842  0.0503906   0.595 0.551992    
## NeighborhoodEdwards -0.1592750  0.0488420  -3.261 0.001158 ** 
## NeighborhoodGilbert -0.1418487  0.0472388  -3.003 0.002760 ** 
## NeighborhoodIDOTRR  -0.1385458  0.0558523  -2.481 0.013326 *  
## NeighborhoodLandmrk -0.1396342  0.1251099  -1.116 0.264725    
## NeighborhoodMeadowV -0.1028101  0.0580819  -1.770 0.077101 .  
## NeighborhoodMitchel -0.0981988  0.0513926  -1.911 0.056400 .  
## NeighborhoodNAmes   -0.0931867  0.0469399  -1.985 0.047467 *  
## NeighborhoodNoRidge  0.0797727  0.0500766   1.593 0.111560    
## NeighborhoodNPkVill -0.0297085  0.0583926  -0.509 0.611055    
## NeighborhoodNridgHt  0.0743417  0.0485020   1.533 0.125739    
## NeighborhoodNWAmes  -0.1182678  0.0486513  -2.431 0.015283 *  
## NeighborhoodOldTown -0.1464827  0.0507662  -2.885 0.004016 ** 
## NeighborhoodSawyer  -0.1113037  0.0498445  -2.233 0.025829 *  
## NeighborhoodSawyerW -0.1188930  0.0483722  -2.458 0.014191 *  
## NeighborhoodSomerst -0.0151898  0.0457289  -0.332 0.739849    
## NeighborhoodStoneBr  0.0533240  0.0538454   0.990 0.322325    
## NeighborhoodSWISU   -0.0882204  0.0579430  -1.523 0.128278    
## NeighborhoodTimber  -0.0679612  0.0539005  -1.261 0.207733    
## NeighborhoodVeenker -0.0007065  0.0819207  -0.009 0.993121    
## log(area)            0.5112374  0.0228948  22.330  < 2e-16 ***
## Bedroom.AbvGr       -0.0362240  0.0069766  -5.192 2.65e-07 ***
## Year.Built           0.0033515  0.0003334  10.054  < 2e-16 ***
## log(Lot.Area)        0.1533616  0.0118716  12.918  < 2e-16 ***
## Central.AirY         0.0723886  0.0204066   3.547 0.000412 ***
## Overall.Cond         0.0481565  0.0044583  10.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1176 on 784 degrees of freedom
## Multiple R-squared:  0.9011, Adjusted R-squared:  0.8971 
## F-statistic: 223.3 on 32 and 784 DF,  p-value: < 2.2e-16

I have called back the model to illustrate a point about the interactions and selection process. In the previous section it seemed that neighborhood was important but had weak R-square values. When other variables are introduced the significance of neighborhood drops off in favor of other things.

why is this?

simply put, people do not make purchases rartionally in general. This is no more seen clearly than in home buying. The issues that matter to people are the ones that provide emotional satisfaction and a sense and place for the future. These are not always variables that at first seem important.

Something like a fireplace or a deck can do more for the desireability of a house than the neighborhood especially if you can still get a deal on the overall price and quality.

If you look at the additional variables many of them refer to the creature comfort aspect of a house. As such I tried to build a model that included those and the result was a better fit for the predictive nature of the model. In conjunction with the de-throning of the the interaction of overall quality and neighborhood it points to the subjective nature of the decision process when accepting or rejecting a house price.

This is demonstrated below by using stepwise AIC on the model above

model_final_AIC<-stepAIC(model_final, k =2)

## Start:  AIC=-3678.25
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
##     Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style + Overall.Qual:Neighborhood
## 
##                             Df Sum of Sq     RSS     AIC
## - Overall.Qual:Neighborhood 23    0.4274  8.8322 -3682.9
## - Bedroom.AbvGr              1    0.0202  8.4250 -3678.3
## <none>                                    8.4049 -3678.3
## - log(Open.Porch.SF + 1)     1    0.0222  8.4270 -3678.1
## - Heating                    4    0.1280  8.5329 -3673.6
## - log(Wood.Deck.SF + 1)      1    0.0769  8.4818 -3672.7
## - Bldg.Type                  4    0.2057  8.6105 -3666.1
## - Kitchen.Qual               4    0.2559  8.6608 -3661.2
## - Fireplaces                 1    0.1976  8.6024 -3660.9
## - Central.Air                1    0.3934  8.7983 -3642.1
## - log(Lot.Area)              1    0.5175  8.9223 -3630.4
## - House.Style                6    0.6566  9.0615 -3627.5
## - Year.Built                 1    1.0600  9.4649 -3581.2
## - Overall.Cond               1    1.3786  9.7835 -3553.6
## - log(area)                  1    4.5229 12.9278 -3321.2
## 
## Step:  AIC=-3682.89
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
##     Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style
## 
##                          Df Sum of Sq     RSS     AIC
## - log(Open.Porch.SF + 1)  1    0.0198  8.8520 -3683.0
## <none>                                 8.8322 -3682.9
## - Bedroom.AbvGr           1    0.0392  8.8714 -3681.2
## - log(Wood.Deck.SF + 1)   1    0.0713  8.9035 -3678.2
## - Heating                 4    0.1437  8.9759 -3677.4
## - Bldg.Type               4    0.2290  9.0612 -3669.5
## - Fireplaces              1    0.1815  9.0137 -3667.9
## - Kitchen.Qual            4    0.4806  9.3128 -3646.7
## - Central.Air             1    0.4349  9.2672 -3644.8
## - log(Lot.Area)           1    0.5785  9.4107 -3632.0
## - House.Style             6    0.6929  9.5252 -3631.9
## - Year.Built              1    1.2252 10.0574 -3576.6
## - Neighborhood           26    1.9989 10.8311 -3564.7
## - Overall.Qual            1    1.3736 10.2058 -3564.3
## - Overall.Cond            1    1.3963 10.2285 -3562.5
## - log(area)               1    4.6351 13.4673 -3333.1
## 
## Step:  AIC=-3683.02
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
##     Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + Bldg.Type + House.Style
## 
##                         Df Sum of Sq     RSS     AIC
## <none>                                8.8520 -3683.0
## - Bedroom.AbvGr          1    0.0447  8.8967 -3680.8
## - log(Wood.Deck.SF + 1)  1    0.0668  8.9188 -3678.8
## - Heating                4    0.1386  8.9906 -3678.1
## - Bldg.Type              4    0.2359  9.0880 -3669.1
## - Fireplaces             1    0.1782  9.0302 -3668.4
## - Kitchen.Qual           4    0.4910  9.3430 -3646.0
## - Central.Air            1    0.4331  9.2851 -3645.2
## - log(Lot.Area)          1    0.5735  9.4255 -3632.7
## - House.Style            6    0.6979  9.5499 -3631.7
## - Year.Built             1    1.2827 10.1347 -3572.2
## - Neighborhood          26    1.9874 10.8395 -3566.1
## - Overall.Cond           1    1.3795 10.2315 -3564.2
## - Overall.Qual           1    1.3999 10.2519 -3562.6
## - log(area)              1    4.9036 13.7556 -3317.4

summary(model_final_AIC)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + Bldg.Type + House.Style, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68115 -0.05809 -0.00013  0.06399  0.39079 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.5450732  0.7297305  -0.747 0.455318    
## Overall.Qual           0.0606529  0.0054612  11.106  < 2e-16 ***
## NeighborhoodBlueste    0.0099167  0.0772420   0.128 0.897877    
## NeighborhoodBrDale    -0.0665733  0.0634705  -1.049 0.294555    
## NeighborhoodBrkSide    0.0031492  0.0533136   0.059 0.952912    
## NeighborhoodClearCr    0.0210728  0.0595575   0.354 0.723568    
## NeighborhoodCollgCr   -0.0414695  0.0471163  -0.880 0.379048    
## NeighborhoodCrawfor    0.0890713  0.0529027   1.684 0.092643 .  
## NeighborhoodEdwards   -0.0929931  0.0489076  -1.901 0.057617 .  
## NeighborhoodGilbert   -0.0727539  0.0493584  -1.474 0.140887    
## NeighborhoodGreens     0.1277815  0.0689483   1.853 0.064218 .  
## NeighborhoodGrnHill    0.3479941  0.0870026   4.000 6.94e-05 ***
## NeighborhoodIDOTRR    -0.1082178  0.0545228  -1.985 0.047515 *  
## NeighborhoodMeadowV   -0.1244833  0.0533691  -2.332 0.019928 *  
## NeighborhoodMitchel   -0.0094556  0.0489189  -0.193 0.846782    
## NeighborhoodNAmes     -0.0347972  0.0479963  -0.725 0.468672    
## NeighborhoodNoRidge    0.0767954  0.0497886   1.542 0.123375    
## NeighborhoodNPkVill    0.0276138  0.0707151   0.390 0.696278    
## NeighborhoodNridgHt    0.0865020  0.0476483   1.815 0.069842 .  
## NeighborhoodNWAmes    -0.0587472  0.0493722  -1.190 0.234454    
## NeighborhoodOldTown   -0.0606729  0.0529408  -1.146 0.252125    
## NeighborhoodSawyer    -0.0452126  0.0492553  -0.918 0.358942    
## NeighborhoodSawyerW   -0.0910218  0.0482280  -1.887 0.059488 .  
## NeighborhoodSomerst    0.0658933  0.0462688   1.424 0.154806    
## NeighborhoodStoneBr    0.0659977  0.0526610   1.253 0.210489    
## NeighborhoodSWISU     -0.0234563  0.0603135  -0.389 0.697451    
## NeighborhoodTimber     0.0149122  0.0522014   0.286 0.775210    
## NeighborhoodVeenker    0.0114988  0.0574368   0.200 0.841377    
## log(area)              0.5358828  0.0257803  20.787  < 2e-16 ***
## Bedroom.AbvGr         -0.0144719  0.0072904  -1.985 0.047486 *  
## Year.Built             0.0036432  0.0003427  10.631  < 2e-16 ***
## log(Lot.Area)          0.0936459  0.0131734   7.109 2.65e-12 ***
## Central.AirY           0.1333048  0.0215789   6.178 1.05e-09 ***
## Overall.Cond           0.0477292  0.0043292  11.025  < 2e-16 ***
## HeatingGasW            0.1423503  0.0446255   3.190 0.001480 ** 
## HeatingGrav            0.0501757  0.1190623   0.421 0.673562    
## HeatingOthW            0.0184439  0.1127412   0.164 0.870092    
## HeatingWall            0.1572899  0.1108588   1.419 0.156348    
## Fireplaces             0.0297317  0.0075033   3.963 8.10e-05 ***
## Kitchen.QualFa        -0.1417247  0.0364427  -3.889 0.000109 ***
## Kitchen.QualGd        -0.1054297  0.0215640  -4.889 1.23e-06 ***
## Kitchen.QualPo         0.0439225  0.1188289   0.370 0.711760    
## Kitchen.QualTA        -0.1482163  0.0239487  -6.189 9.78e-10 ***
## log(Wood.Deck.SF + 1)  0.0041417  0.0017073   2.426 0.015498 *  
## Bldg.Type2fmCon        0.0243767  0.0286304   0.851 0.394794    
## Bldg.TypeDuplex       -0.0921453  0.0243434  -3.785 0.000165 ***
## Bldg.TypeTwnhs        -0.0686228  0.0311169  -2.205 0.027722 *  
## Bldg.TypeTwnhsE       -0.0339208  0.0216769  -1.565 0.118026    
## House.Style1.5Unf      0.0589464  0.0444919   1.325 0.185600    
## House.Style1Story      0.0835486  0.0166983   5.003 6.96e-07 ***
## House.Style2.5Unf      0.0207134  0.0380180   0.545 0.586025    
## House.Style2Story     -0.0023239  0.0163394  -0.142 0.886939    
## House.StyleSFoyer      0.1505716  0.0273637   5.503 5.08e-08 ***
## House.StyleSLvl        0.0581141  0.0237033   2.452 0.014435 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1065 on 780 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9222 
## F-statistic: 187.3 on 53 and 780 DF,  p-value: < 2.2e-16

As we can see the model with the lowest AIC = -3683 has many “amenities” in the house and once again for predicting price, Neighborhood is all but irrelevant except for a few areas. The subjective component in this final model is demonstrated by which variables have significance

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

NOTE: Write your written response to section 3.5 here. Delete this note before you submit your work.

model_final_AIC<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
    Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
    1) + Bldg.Type + House.Style, data=ames_test)

# Predict prices
predict_final <- exp(predict(model_final_AIC, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage_prob_test_final <- mean(ames_test$price > predict_final[,"lwr"] &
                            ames_test$price < predict_final[,"upr"])
coverage_prob_test_final

## [1] 0.9583843

This final model nows shows the models out of sample coverage has improved to 96%.

To further refine the analysis Im going to add a Bayesian analysis.

BAYESIAN ASSESSMENT

model_bayes<- bas.lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
    Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
    1) + Bldg.Type + House.Style, data=ames_test, prior = "ZS-null", modelprior = uniform(),method = "MCMC")

par(mfrow = c(1,2))
diagnostics(model_bayes)

 image(model_bayes, rotate = T, cex.lab = .6)

plot(model_bayes, which = 4, cex.lab  = .8)

The Bayesian Model Averaging (BMA) used will include a Markov Chain Monte Carlo sampling, which entails taking a sample of a model generated by all the possible 2^p models where p is the number of predictor variables and using Bayes Factors evaluating the posterior probability for that model and then comparing it the current model.

The prior being used is the Zellners g-prior which allows for giving the intercept to have the same meaning across all models by subtracting the sample mean from each of the predictors. This allows for greater precision by the parameter g, which is a scalar, and scaled variance from the ordinary least squares.This simplifies the posterior by having prior elicitation converge to g and b-zero. However the g prior has some disadvantages, the information paradox, Bartletts paradox, lack of uncertainty about g, and not allowing g to be updated which would allow the data to cause the coefficients to converge to zero, and to avoid those drawbacks we modify the g-prior with hyper g/n and the Zellner-Siow cauchy prior.

WE will assume a model prior that is uniform across all models.

The MCMC and normalized posterior inclusion probabilities are in close agreement as they should be if the MCMC has been run long enough.

The results of the anaylsis for the best model for the model ranktends to agree with the model selected.

The inclusion probability model however has more neighborhoods included in its final model. For additional detail this is useful but since it is likely that at this point we will not subset Neighborhood then all areas will be included in that particular predictor category

Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

par(mfrow = c(2,2))
plot(model_final_AIC)

## Warning: not plotting observations with leverage one:
##   205

## Warning: not plotting observations with leverage one:
##   205

hist(model_final_AIC$residuals, col = "orange", las = 1, main = "Final Model Residual Histogram", label = T, ylim = c(0,325))

* * * The assumptions of the model rest on four properties:

Normality of Residuals
Linearity of the numerical predictor variables with the response variable
Homoscedasticity
Residual independence

Residual linearity the linearity of the variables is in the first graph. The scatter of the residuals is around zero and there is no obvious shape or pattern to the scatter.

The Scale-Location graph also shows the spread of residuals about the predictors. Since we have a lack of pattern and an almost flat red line, with a little dip, we can be assured of good homoscedasicity with the model.

Independence is an indication that the variables were randomly selected and that the selection of any particular observation has no bearing on another. since there is no test for this we have to go to first principles. The table has no structure that indicates any meaningful subsetting of the data so any subsection would be a random sample of the entire pool.

Residuals vs Leverage. The fourth graph helps us clarify the nature of our outliers with respect to their leverage. As the graph shows all of points are well withing this range and as such can be included in the final model without undo impact. Therefore, no outliers need be deleted from the model.

Therefore the conditions for both a non-multicollinear and parsimonious model from the initial one are met.

Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

# Extract Predictions
predict_test <- exp(predict(model_final_AIC, ames_test))

# Extract Residuals
resid_test <- ames_test$price - predict_test

# Calculate RMSE
rmse_test<- sqrt(mean(resid_test^2))
rmse_test

## [1] 19798.47

The RSME for the final model is smaller than previous models and as an absolute measure of fit is 12.6% more accurate than previous models.

Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

The model I have is more accurate as far as RSME and R-squared and adjusted R-squared is concerned. These measures are small and high, respectively. However there are many subjective componenets and the model ignores Neighborhood. This is counterintuitive for many people. The model is also small compared to the number of variables and its possible that a larger model may be more effective.

Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the â???oames_validationâ??? dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

model_final_AIC_valid<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + 
    Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
    1) + Bldg.Type + House.Style, data=ames_validation)

summary(model_final_AIC_valid)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF + 
##     1) + Bldg.Type + House.Style, data = ames_validation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34506 -0.05944  0.00000  0.06180  0.43579 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.089313   0.674330  -3.098 0.002023 ** 
## Overall.Qual           0.072705   0.005119  14.203  < 2e-16 ***
## NeighborhoodBlueste    0.040359   0.121437   0.332 0.739729    
## NeighborhoodBrDale    -0.032725   0.071272  -0.459 0.646260    
## NeighborhoodBrkSide    0.030115   0.060712   0.496 0.620027    
## NeighborhoodClearCr    0.013750   0.062292   0.221 0.825361    
## NeighborhoodCollgCr   -0.033869   0.056241  -0.602 0.547229    
## NeighborhoodCrawfor    0.084826   0.060510   1.402 0.161393    
## NeighborhoodEdwards   -0.081370   0.058632  -1.388 0.165629    
## NeighborhoodGilbert   -0.093541   0.057764  -1.619 0.105813    
## NeighborhoodGreens     0.090050   0.078549   1.146 0.252013    
## NeighborhoodIDOTRR    -0.055108   0.062495  -0.882 0.378180    
## NeighborhoodMeadowV   -0.101815   0.069505  -1.465 0.143406    
## NeighborhoodMitchel   -0.055860   0.058478  -0.955 0.339789    
## NeighborhoodNAmes     -0.042409   0.056947  -0.745 0.456699    
## NeighborhoodNoRidge    0.106421   0.063258   1.682 0.092945 .  
## NeighborhoodNPkVill    0.035143   0.066788   0.526 0.598921    
## NeighborhoodNridgHt    0.011642   0.057416   0.203 0.839372    
## NeighborhoodNWAmes    -0.051076   0.058000  -0.881 0.378822    
## NeighborhoodOldTown   -0.035768   0.059579  -0.600 0.548464    
## NeighborhoodSawyer    -0.004763   0.058926  -0.081 0.935603    
## NeighborhoodSawyerW   -0.072336   0.057810  -1.251 0.211253    
## NeighborhoodSomerst    0.056174   0.057879   0.971 0.332104    
## NeighborhoodStoneBr    0.089628   0.062501   1.434 0.152005    
## NeighborhoodSWISU     -0.039895   0.063061  -0.633 0.527170    
## NeighborhoodTimber     0.006139   0.060261   0.102 0.918885    
## NeighborhoodVeenker    0.001529   0.064116   0.024 0.980984    
## log(area)              0.518117   0.024798  20.893  < 2e-16 ***
## Bedroom.AbvGr         -0.015457   0.006763  -2.285 0.022586 *  
## Year.Built             0.004226   0.000313  13.505  < 2e-16 ***
## log(Lot.Area)          0.128404   0.014210   9.036  < 2e-16 ***
## Central.AirY           0.057300   0.018140   3.159 0.001652 ** 
## Overall.Cond           0.057163   0.004143  13.797  < 2e-16 ***
## HeatingGasA            0.150477   0.106894   1.408 0.159652    
## HeatingGasW            0.239269   0.112810   2.121 0.034270 *  
## HeatingGrav            0.127009   0.125067   1.016 0.310198    
## HeatingOthW            0.004119   0.153851   0.027 0.978646    
## HeatingWall            0.164631   0.150046   1.097 0.272925    
## Fireplaces             0.027667   0.007585   3.648 0.000284 ***
## Kitchen.QualFa        -0.112518   0.033126  -3.397 0.000720 ***
## Kitchen.QualGd        -0.089810   0.020619  -4.356 1.52e-05 ***
## Kitchen.QualTA        -0.112292   0.022858  -4.913 1.12e-06 ***
## log(Wood.Deck.SF + 1)  0.003180   0.001670   1.904 0.057273 .  
## Bldg.Type2fmCon       -0.042451   0.024383  -1.741 0.082125 .  
## Bldg.TypeDuplex       -0.089368   0.023137  -3.863 0.000122 ***
## Bldg.TypeTwnhs        -0.036305   0.038390  -0.946 0.344633    
## Bldg.TypeTwnhsE        0.002357   0.021867   0.108 0.914199    
## House.Style1.5Unf      0.100253   0.055954   1.792 0.073609 .  
## House.Style1Story      0.056914   0.015525   3.666 0.000265 ***
## House.Style2.5Fin      0.027220   0.056175   0.485 0.628144    
## House.Style2.5Unf      0.069118   0.043429   1.592 0.111941    
## House.Style2Story     -0.010823   0.015083  -0.718 0.473258    
## House.StyleSFoyer      0.127250   0.029984   4.244 2.49e-05 ***
## House.StyleSLvl        0.044750   0.023589   1.897 0.058230 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1034 on 709 degrees of freedom
## Multiple R-squared:  0.9241, Adjusted R-squared:  0.9185 
## F-statistic: 162.9 on 53 and 709 DF,  p-value: < 2.2e-16

# Extract Predictions
predict_valid <- exp(predict(model_final_AIC_valid, ames_validation))

# Extract Residuals
resid_valid <- ames_validation$price - predict_valid

# Calculate RMSE
rmse_valid<- sqrt(mean(resid_valid^2))
rmse_valid

## [1] 19173.35

model_valid<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr + 
    Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_validation)

# Predict prices
predict_valid<- exp(predict(model_valid, ames_validation, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage_prob_valid <- mean(ames_validation$price > predict_valid[,"lwr"] &
                            ames_validation$price < predict_valid[,"upr"])
coverage_prob_valid

## [1] 0.9541284

The final model works even better on the validation set with a RSME of 19,798 dollars which increases the fit from 12.6% to 15.4% more accurate as compared to the original RSME, and dropping the RSME from 14.6% (at 22,665) to 12.3% of the median home price.

The R-squared is high at 92.4% percent of the variability given by the model and the adjusted R-squared ay 91.8% variability. Both of these while slighty worse than the final model variability at (92.7% and 92.2% respestively) the RSME gives a better absolute fit to the data.

The coverage probability for the taining model was 95.8, the testing model; 95.2 and the final model 95.8 making it as good a fit as as the training model. Since the validation set represents an out-of-sample set relative to the final model, then if incertainty assumptions are met it should be lower but not significantly less than or more than 0.95. AS seen abobe the coverage probability for the validation set is 95.4 and so the conditions regarding uncertainty assumptions are met at this stage.

Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

My final model works with a high degree of variability explanatory power and a close fit to predicted values for out of sample data. Its relatively small for such power so parsimony exists for the model. With so many statistically insignificant areas though I believe it points to the impact of variance within a given neighborhood throwing off the predictive model at least within that section of the city.

Model building is a time consuming process. Plagued as it is by combinatorial and subjective elements. Consequently it will not be a linear process and its always unfinished. The most pressing issue of theis fact is emphasized by change and what that can do to the elements of a data set and any previous constraints or conditions. Some will fall some will rise and new ones will emerge either organically or through necessity. Its a learning and a growing process and sometimes the data helps you and other times it hinders you. I will wonder if I still have too many or not enough variables or quite frankly even the correct or best ones. My models worked well even from the beginning and tightening them a bit is always a good feeling but models are fragile and in five or ten years or perhaps even two how good will it be. Models are always provisional so its a good idea to not get attached to them or their performance. * * *

Capstone_Final_Peer

Anthony Wiggins

May 4, 2018

Background