Explaining Variation in Prices of Houses in Iowa

First Attempt—Keep

IAHouseModel<-(lm(SalePrice~Year.Built, data = IAHouse))

summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -147394  -41227  -14502   23093  540805 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.727e+06  7.983e+04  -34.16   <2e-16 ***
## Year.Built   1.475e+03  4.049e+01   36.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66280 on 2928 degrees of freedom
## Multiple R-squared:  0.3118, Adjusted R-squared:  0.3116 
## F-statistic:  1327 on 1 and 2928 DF,  p-value: < 2.2e-16

The Intercept and Coefficient of our chosen variable (Year.Built) are both significant at the level of .05. They both also have very large t values. These values indicate that our variable is significant. The r squared value of our variable explains around 31% of the variation in price. This is one worth keeping.

Second Attempt—Remove

IAHouseModel<-(lm(SalePrice~Year.Built+Overall.Cond, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Overall.Cond, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146582  -41155  -14239   23598  534357 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.007e+06  8.773e+04 -34.272  < 2e-16 ***
## Year.Built    1.592e+03  4.317e+01  36.888  < 2e-16 ***
## Overall.Cond  8.671e+03  1.175e+03   7.381 2.03e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65680 on 2927 degrees of freedom
## Multiple R-squared:  0.3244, Adjusted R-squared:  0.324 
## F-statistic: 702.8 on 2 and 2927 DF,  p-value: < 2.2e-16

Our second model retains extremely low p values, but a smaller magnitude on the t value. It also only explains around 1% of the variation in price. Based on these results, we may want to remove this attempt from our model.

Third Attempt—Remove

IAHouseModel<-(lm(SalePrice~Year.Built+House.Style, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + House.Style, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -143288  -38566  -12225   23225  524933 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.937e+06  8.651e+04 -33.953  < 2e-16 ***
## Year.Built         1.589e+03  4.466e+01  35.574  < 2e-16 ***
## House.Style1.5Unf -2.576e+04  1.516e+04  -1.699   0.0895 .  
## House.Style1Story -2.194e+04  4.364e+03  -5.027 5.29e-07 ***
## House.Style2.5Fin  1.206e+05  2.300e+04   5.243 1.69e-07 ***
## House.Style2.5Unf  6.704e+04  1.361e+04   4.924 8.94e-07 ***
## House.Style2Story -5.943e+02  4.660e+03  -0.128   0.8985    
## House.StyleSFoyer -6.076e+04  8.140e+03  -7.464 1.10e-13 ***
## House.StyleSLvl   -3.698e+04  6.974e+03  -5.303 1.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64180 on 2921 degrees of freedom
## Multiple R-squared:  0.3563, Adjusted R-squared:  0.3546 
## F-statistic: 202.1 on 8 and 2921 DF,  p-value: < 2.2e-16

While this variable explains around 5% more variation in price, it has some questionable p values within it. House.Style2Story and House.Style1.5Unf are well above the .05 significance level. This attempt is best left out of our final model. ```

Fourth Attempt—Keep

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -223853  -31591   -5731   21476  419295 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.314e+06  7.845e+04 -16.746  < 2e-16 ***
## Year.Built      8.272e+02  3.926e+01  21.071  < 2e-16 ***
## Kitchen.QualFa -1.799e+05  7.815e+03 -23.023  < 2e-16 ***
## Kitchen.QualGd -1.193e+05  4.076e+03 -29.277  < 2e-16 ***
## Kitchen.QualPo -1.645e+05  5.384e+04  -3.056  0.00227 ** 
## Kitchen.QualTA -1.659e+05  4.271e+03 -38.833  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53620 on 2924 degrees of freedom
## Multiple R-squared:  0.5503, Adjusted R-squared:  0.5495 
## F-statistic: 715.6 on 5 and 2924 DF,  p-value: < 2.2e-16

The Kitchen.Qual variable has very small p values, except one, Kitchen.QualPo, who is still below the .05 significance level. The T value for this measurement is also smaller at around 3. Besides this detractor, the rest of the data fits well, and explains an additional 24% of variation in price. This data is worth adding to the final model.

Fifth Attempt—Remove

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+Utilities, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + Utilities, 
##     data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -223896  -31554   -5687   21403  419292 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.310e+06  7.850e+04 -16.692  < 2e-16 ***
## Year.Built       8.255e+02  3.928e+01  21.014  < 2e-16 ***
## Kitchen.QualFa  -1.800e+05  7.816e+03 -23.033  < 2e-16 ***
## Kitchen.QualGd  -1.193e+05  4.077e+03 -29.266  < 2e-16 ***
## Kitchen.QualPo  -1.646e+05  5.384e+04  -3.058  0.00225 ** 
## Kitchen.QualTA  -1.659e+05  4.272e+03 -38.834  < 2e-16 ***
## UtilitiesNoSeWa -4.917e+04  5.366e+04  -0.916  0.35954    
## UtilitiesNoSewr -3.150e+04  3.796e+04  -0.830  0.40664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53620 on 2922 degrees of freedom
## Multiple R-squared:  0.5505, Adjusted R-squared:  0.5495 
## F-statistic: 511.3 on 7 and 2922 DF,  p-value: < 2.2e-16

The Utilities variable was an awful choice. It was extremely low t values, both below one, along with very high p values. It also only explains .0002% of the variation in price. This variable will not be included in the final model.

Sixth Attempt—Keep

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+TotRms.AbvGrd, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + TotRms.AbvGrd, 
##     data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -284282  -26122   -3351   18786  382959 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.493e+06  6.938e+04 -21.515  < 2e-16 ***
## Year.Built      8.508e+02  3.459e+01  24.593  < 2e-16 ***
## Kitchen.QualFa -1.441e+05  6.994e+03 -20.606  < 2e-16 ***
## Kitchen.QualGd -9.859e+04  3.662e+03 -26.924  < 2e-16 ***
## Kitchen.QualPo -1.317e+05  4.744e+04  -2.776  0.00554 ** 
## Kitchen.QualTA -1.364e+05  3.897e+03 -35.017  < 2e-16 ***
## TotRms.AbvGrd   1.684e+04  5.794e+02  29.059  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47240 on 2923 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6504 
## F-statistic: 909.1 on 6 and 2923 DF,  p-value: < 2.2e-16

The variable TotRms.AbvGrd has a a very low p vaue and a strong magnitude on the t value. It also explains around 10% of the variation in price. This is worth keeping in the final model.

Seventh Attempt—keep

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+TotRms.AbvGrd+Gr.Liv.Area, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + TotRms.AbvGrd + 
##     Gr.Liv.Area, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -487314  -21783    -803   17328  309424 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.359e+06  5.913e+04 -22.984  < 2e-16 ***
## Year.Built      7.741e+02  2.950e+01  26.238  < 2e-16 ***
## Kitchen.QualFa -1.144e+05  6.012e+03 -19.026  < 2e-16 ***
## Kitchen.QualGd -8.418e+04  3.143e+03 -26.784  < 2e-16 ***
## Kitchen.QualPo -9.891e+04  4.035e+04  -2.451   0.0143 *  
## Kitchen.QualTA -1.067e+05  3.430e+03 -31.091  < 2e-16 ***
## TotRms.AbvGrd  -4.879e+03  8.143e+02  -5.992 2.33e-09 ***
## Gr.Liv.Area     9.048e+01  2.702e+00  33.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40160 on 2922 degrees of freedom
## Multiple R-squared:  0.7479, Adjusted R-squared:  0.7473 
## F-statistic:  1238 on 7 and 2922 DF,  p-value: < 2.2e-16

The variable Gr.Liv.Area fits the trend of having a low p value and a very high t value. As of now, it is the highest p value at 33.487. This variable also explains 10% of the variation in price. I will keep this in the final model.

Eighth Attempt—Remove

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+TotRms.AbvGrd+Gr.Liv.Area+Kitchen.AbvGr, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + TotRms.AbvGrd + 
##     Gr.Liv.Area + Kitchen.AbvGr, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486646  -20728   -1023   16267  308185 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.320e+06  5.844e+04 -22.582  < 2e-16 ***
## Year.Built      7.654e+02  2.910e+01  26.304  < 2e-16 ***
## Kitchen.QualFa -1.081e+05  5.965e+03 -18.124  < 2e-16 ***
## Kitchen.QualGd -8.251e+04  3.103e+03 -26.588  < 2e-16 ***
## Kitchen.QualPo -9.774e+04  3.978e+04  -2.457  0.01406 *  
## Kitchen.QualTA -1.025e+05  3.411e+03 -30.039  < 2e-16 ***
## TotRms.AbvGrd  -2.433e+03  8.449e+02  -2.879  0.00401 ** 
## Gr.Liv.Area     8.745e+01  2.683e+00  32.591  < 2e-16 ***
## Kitchen.AbvGr  -3.477e+04  3.748e+03  -9.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39590 on 2921 degrees of freedom
## Multiple R-squared:  0.7551, Adjusted R-squared:  0.7544 
## F-statistic:  1126 on 8 and 2921 DF,  p-value: < 2.2e-16

The newest variable Kitchen.AbvGr, has a low p value alonw with a decetly high t value. It only slightly increases the r squared value by about 1%. We could find a better predictor.

Ninth Attempt—Remove

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+TotRms.AbvGrd+Gr.Liv.Area+Heating.QC, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + TotRms.AbvGrd + 
##     Gr.Liv.Area + Heating.QC, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -484609  -21487    -809   17166  308612 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.272e+06  6.093e+04 -20.875  < 2e-16 ***
## Year.Built      7.319e+02  3.036e+01  24.112  < 2e-16 ***
## Kitchen.QualFa -1.093e+05  6.086e+03 -17.954  < 2e-16 ***
## Kitchen.QualGd -8.360e+04  3.129e+03 -26.720  < 2e-16 ***
## Kitchen.QualPo -9.625e+04  4.016e+04  -2.396 0.016618 *  
## Kitchen.QualTA -1.021e+05  3.498e+03 -29.196  < 2e-16 ***
## TotRms.AbvGrd  -4.910e+03  8.103e+02  -6.059 1.54e-09 ***
## Gr.Liv.Area     8.965e+01  2.691e+00  33.315  < 2e-16 ***
## Heating.QCFa   -1.258e+04  4.542e+03  -2.770 0.005643 ** 
## Heating.QCGd   -7.637e+03  2.259e+03  -3.381 0.000732 ***
## Heating.QCPo   -3.374e+04  2.334e+04  -1.445 0.148473    
## Heating.QCTA   -1.164e+04  2.036e+03  -5.716 1.20e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39940 on 2918 degrees of freedom
## Multiple R-squared:  0.751,  Adjusted R-squared:   0.75 
## F-statistic: 799.9 on 11 and 2918 DF,  p-value: < 2.2e-16

The Heating.QC model doesn’t meet our significance test of .05. Therefore, it will not be included in the final model.

Final Model

IAHouseModel<-(lm(SalePrice~Year.Built+Kitchen.Qual+TotRms.AbvGrd+Gr.Liv.Area, data = IAHouse))
summary(IAHouseModel)
## 
## Call:
## lm(formula = SalePrice ~ Year.Built + Kitchen.Qual + TotRms.AbvGrd + 
##     Gr.Liv.Area, data = IAHouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -487314  -21783    -803   17328  309424 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.359e+06  5.913e+04 -22.984  < 2e-16 ***
## Year.Built      7.741e+02  2.950e+01  26.238  < 2e-16 ***
## Kitchen.QualFa -1.144e+05  6.012e+03 -19.026  < 2e-16 ***
## Kitchen.QualGd -8.418e+04  3.143e+03 -26.784  < 2e-16 ***
## Kitchen.QualPo -9.891e+04  4.035e+04  -2.451   0.0143 *  
## Kitchen.QualTA -1.067e+05  3.430e+03 -31.091  < 2e-16 ***
## TotRms.AbvGrd  -4.879e+03  8.143e+02  -5.992 2.33e-09 ***
## Gr.Liv.Area     9.048e+01  2.702e+00  33.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40160 on 2922 degrees of freedom
## Multiple R-squared:  0.7479, Adjusted R-squared:  0.7473 
## F-statistic:  1238 on 7 and 2922 DF,  p-value: < 2.2e-16

Adjusted R-squared: 0.7473
p-value: < 2.2e-16 F-statistic: 1238 on 7 The model accounts for 74.73% of the variation is Sale Price can by accounted for by the kitchen quality, the year the home was built, The number of rooms above ground floor along with square feet of living area above ground floor. Based on our P value, this model can be statistical significant.

Model Plots

The T value for Year.Built is 36.43 and the P value is less than 2e-16. The adjusted r-squred value is .3116. The coefficient for this model is 1474.964, which means that for ever year later a house is built, the price goes up by 1474.964 dollars.




boxplot(SalePrice~Kitchen.Qual,data=IAHouse,main="Sale Price Vs.Quality of Kitchen",xlab="Kitchen Quality",ylab="Sale Price")

The T value for Kitchen.QualTA is -46.156 and the P value is less than 2e-16. The T value for Kitchen.QualFa is -29.057 and the P value is less than 2e-16. The T value for Kitchen.QualGd is -29.021 and the P value is less than 2e-16. The T value for Kitchen.QualPo is -3.985 and the P value is 6.91e-05.The adjusted r-squred value is .4813. The coefficient for Kitchen.QualTA is -197789 which means that if a house contains an Average quality kitchen then the price decreases by 197789 dollars. The coefficient for Kitchen.QualFa is -231432 which means that if a house contains a fair quality kitchen then the price decreases by 231432 dollars. The coefficient for Kitchen.QualGd is -126504 which means that if a house contains a good quality kitchen then the price decreases by 126504 dollars. The coefficient for Kitchen.QualPo is -229839 which means that if a house contains a poor quality kitchen then the price decreases by 229839 dollars. The T value of TotRms.AbvGrd is TotRms.AbvGrd 30.866 while the P value is less than 2e-16. The Adjusted r-Squared value is .2452. The coefficient for this model is 25163.83 which means that for ever room above ground level, the price of that house increases by 25163.83 dollars. The P value of Gr.Liv.Area is Gr.Liv.Area while the T value is 4.064. THe Adjusted R-Squared value is .4994. The Coeffifcient of this variable is 111.694 which means that for ever square foot of living area above ground, the price will increase by 111.694 dollars.