Introduction

library("ggplot2")
Data <- read.csv("IowaHousing.csv")

the Data that I’m using for this experiment is called IowaHousing.csv. Explained by Dead de Cock from Truman State University, this data was originally obtained from the Ames Assesor’s office which is used for tax assessment purposes but also lends itself directly to the prediction of home selling prices. In this experiment I will be using the data to find an appropiate linear model to explain as much as possible the variability in price for Iowa Housing.

Preliminaries

After looking at the different observations to determine house value. I preferred to start with the Lot.Area (Continuous) Data. Which contains the value of Lot Size per square feet.

TestModel <- lm(SalePrice ~ Gr.Liv.Area, data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## Gr.Liv.Area   111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16

As a start we could determine that it was statistatically significant, we can see that the model gives us a \(R^2\) of: 0.4995379 which isn’t a bad start. I proceed to pick another observation.

Land Countour

Is the Flatness of the property a major factor to determine the change of prices in the property? This observation is divided in 4 variables: Lvl: Near Flat/Level, Bnk: Banked, HLS: Hillside, and Low: Depression. I continue add it to the model to test this its significance

TestModel <- lm(SalePrice ~ Gr.Liv.Area + Land.Contour, data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -432316  -29531   -1300   23420  335657 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -32753.495   6001.240  -5.458 5.22e-08 ***
## Gr.Liv.Area        110.789      2.011  55.093  < 2e-16 ***
## Land.ContourHLS  97732.543   7118.288  13.730  < 2e-16 ***
## Land.ContourLow  63364.313   8699.292   7.284 4.15e-13 ***
## Land.ContourLvl  46849.476   5179.521   9.045  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54760 on 2925 degrees of freedom
## Multiple R-squared:  0.5307, Adjusted R-squared:  0.5301 
## F-statistic: 826.9 on 4 and 2925 DF,  p-value: < 2.2e-16

The Model significance increases, although it isn’t by a lot the \(R^2\) rises to 0.530692, So I decide to keep this observation, because it might be even more significant later on. I decide to continue adding another observation to try and form a better model that describes the house pricing.

Bedrooms

Beedroms Observation (Bedroom.AbvGr) shows the amount of Bedrooms above grade (Not including basement bedrooms). I decided to pick this observation because makes sense theoretically that the more Bedrooms the higher the price is.

TestModel <- lm(SalePrice ~ Gr.Liv.Area +  as.factor(Bedroom.AbvGr), data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + as.factor(Bedroom.AbvGr), 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -598290  -27410    -324   23925  334340 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.882e+04  1.871e+04   1.540  0.12365    
## Gr.Liv.Area                1.391e+02  2.304e+00  60.363  < 2e-16 ***
## as.factor(Bedroom.AbvGr)1 -6.426e+03  1.910e+04  -0.336  0.73660    
## as.factor(Bedroom.AbvGr)2 -3.692e+04  1.855e+04  -1.990  0.04666 *  
## as.factor(Bedroom.AbvGr)3 -5.509e+04  1.850e+04  -2.978  0.00292 ** 
## as.factor(Bedroom.AbvGr)4 -9.669e+04  1.870e+04  -5.171 2.48e-07 ***
## as.factor(Bedroom.AbvGr)5 -1.495e+05  2.006e+04  -7.454 1.18e-13 ***
## as.factor(Bedroom.AbvGr)6 -1.649e+05  2.175e+04  -7.580 4.62e-14 ***
## as.factor(Bedroom.AbvGr)8 -3.009e+05  5.554e+04  -5.418 6.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52180 on 2921 degrees of freedom
## Multiple R-squared:  0.5745, Adjusted R-squared:  0.5734 
## F-statistic:   493 on 8 and 2921 DF,  p-value: < 2.2e-16

After creating the Model, although the value for \(R^2\) increased, some of the factors in the observation do not represent significance to the model so I decide to remove this observation and continue testing other ones.

Year Built

Furthermore, I decided to test relation betwen the Flatness of the property and the sustained model.

TestModel <- lm(SalePrice ~ Gr.Liv.Area +  Land.Contour+ as.factor(Year.Built), data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour + as.factor(Year.Built), 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -471063  -24035   -1245   19594  292581 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.080e+05  4.421e+04  -2.443 0.014629 *  
## Gr.Liv.Area                9.089e+01  1.821e+00  49.910  < 2e-16 ***
## Land.ContourHLS            5.421e+04  6.005e+03   9.028  < 2e-16 ***
## Land.ContourLow            4.265e+04  7.223e+03   5.904 3.97e-09 ***
## Land.ContourLvl            1.570e+04  4.467e+03   3.516 0.000445 ***
## as.factor(Year.Built)1875  9.361e+04  6.190e+04   1.512 0.130624    
## as.factor(Year.Built)1879  5.427e+04  6.186e+04   0.877 0.380329    
## as.factor(Year.Built)1880  6.204e+04  4.792e+04   1.295 0.195578    
## as.factor(Year.Built)1882  1.020e+05  6.187e+04   1.648 0.099367 .  
## as.factor(Year.Built)1885  5.192e+04  5.358e+04   0.969 0.332696    
## as.factor(Year.Built)1890  7.750e+04  4.678e+04   1.657 0.097659 .  
## as.factor(Year.Built)1892  1.118e+05  5.358e+04   2.087 0.037003 *  
## as.factor(Year.Built)1893  1.651e+05  6.186e+04   2.669 0.007654 ** 
## as.factor(Year.Built)1895  6.315e+04  5.052e+04   1.250 0.211412    
## as.factor(Year.Built)1896  5.592e+04  6.191e+04   0.903 0.366440    
## as.factor(Year.Built)1898  3.995e+03  6.186e+04   0.065 0.948512    
## as.factor(Year.Built)1900  6.552e+04  4.451e+04   1.472 0.141165    
## as.factor(Year.Built)1901  9.884e+04  5.361e+04   1.844 0.065350 .  
## as.factor(Year.Built)1902  3.610e+04  6.203e+04   0.582 0.560603    
## as.factor(Year.Built)1904  1.099e+05  6.187e+04   1.777 0.075718 .  
## as.factor(Year.Built)1905  7.581e+04  5.052e+04   1.500 0.133605    
## as.factor(Year.Built)1906  9.380e+04  6.188e+04   1.516 0.129669    
## as.factor(Year.Built)1907  1.445e+04  6.186e+04   0.234 0.815341    
## as.factor(Year.Built)1908  1.196e+05  5.363e+04   2.230 0.025802 *  
## as.factor(Year.Built)1910  7.062e+04  4.428e+04   1.595 0.110835    
## as.factor(Year.Built)1911  5.951e+04  6.203e+04   0.959 0.337394    
## as.factor(Year.Built)1912  6.485e+04  4.797e+04   1.352 0.176460    
## as.factor(Year.Built)1913  1.196e+05  6.193e+04   1.931 0.053592 .  
## as.factor(Year.Built)1914  7.784e+04  4.642e+04   1.677 0.093664 .  
## as.factor(Year.Built)1915  8.195e+04  4.467e+04   1.835 0.066670 .  
## as.factor(Year.Built)1916  8.968e+04  4.592e+04   1.953 0.050918 .  
## as.factor(Year.Built)1917  7.253e+04  5.053e+04   1.435 0.151300    
## as.factor(Year.Built)1918  9.999e+04  4.589e+04   2.179 0.029418 *  
## as.factor(Year.Built)1919  8.632e+04  4.794e+04   1.800 0.071901 .  
## as.factor(Year.Built)1920  8.550e+04  4.417e+04   1.936 0.053003 .  
## as.factor(Year.Built)1921  1.009e+05  4.575e+04   2.205 0.027518 *  
## as.factor(Year.Built)1922  8.429e+04  4.514e+04   1.867 0.061967 .  
## as.factor(Year.Built)1923  9.642e+04  4.505e+04   2.140 0.032424 *  
## as.factor(Year.Built)1924  9.253e+04  4.514e+04   2.050 0.040457 *  
## as.factor(Year.Built)1925  1.055e+05  4.443e+04   2.374 0.017641 *  
## as.factor(Year.Built)1926  1.100e+05  4.494e+04   2.449 0.014389 *  
## as.factor(Year.Built)1927  1.012e+05  4.615e+04   2.193 0.028370 *  
## as.factor(Year.Built)1928  1.116e+05  4.615e+04   2.418 0.015669 *  
## as.factor(Year.Built)1929  1.223e+05  4.644e+04   2.633 0.008504 ** 
## as.factor(Year.Built)1930  1.037e+05  4.463e+04   2.323 0.020232 *  
## as.factor(Year.Built)1931  8.497e+04  4.680e+04   1.815 0.069578 .  
## as.factor(Year.Built)1932  1.245e+05  4.799e+04   2.595 0.009516 ** 
## as.factor(Year.Built)1934  1.033e+05  4.797e+04   2.153 0.031406 *  
## as.factor(Year.Built)1935  1.126e+05  4.541e+04   2.480 0.013182 *  
## as.factor(Year.Built)1936  1.066e+05  4.573e+04   2.330 0.019863 *  
## as.factor(Year.Built)1937  1.079e+05  4.621e+04   2.334 0.019666 *  
## as.factor(Year.Built)1938  9.722e+04  4.544e+04   2.140 0.032479 *  
## as.factor(Year.Built)1939  1.077e+05  4.486e+04   2.400 0.016475 *  
## as.factor(Year.Built)1940  1.084e+05  4.440e+04   2.441 0.014717 *  
## as.factor(Year.Built)1941  1.041e+05  4.472e+04   2.328 0.019997 *  
## as.factor(Year.Built)1942  1.062e+05  4.729e+04   2.245 0.024828 *  
## as.factor(Year.Built)1945  9.984e+04  4.524e+04   2.207 0.027397 *  
## as.factor(Year.Built)1946  8.356e+04  4.524e+04   1.847 0.064837 .  
## as.factor(Year.Built)1947  1.050e+05  4.574e+04   2.295 0.021786 *  
## as.factor(Year.Built)1948  1.023e+05  4.458e+04   2.295 0.021789 *  
## as.factor(Year.Built)1949  8.992e+04  4.498e+04   1.999 0.045707 *  
## as.factor(Year.Built)1950  1.085e+05  4.435e+04   2.445 0.014529 *  
## as.factor(Year.Built)1951  1.049e+05  4.498e+04   2.332 0.019745 *  
## as.factor(Year.Built)1952  1.011e+05  4.499e+04   2.246 0.024766 *  
## as.factor(Year.Built)1953  1.111e+05  4.468e+04   2.487 0.012951 *  
## as.factor(Year.Built)1954  1.180e+05  4.430e+04   2.665 0.007747 ** 
## as.factor(Year.Built)1955  1.074e+05  4.443e+04   2.418 0.015684 *  
## as.factor(Year.Built)1956  1.230e+05  4.435e+04   2.774 0.005573 ** 
## as.factor(Year.Built)1957  1.151e+05  4.440e+04   2.592 0.009595 ** 
## as.factor(Year.Built)1958  1.130e+05  4.424e+04   2.554 0.010707 *  
## as.factor(Year.Built)1959  1.243e+05  4.429e+04   2.807 0.005035 ** 
## as.factor(Year.Built)1960  1.218e+05  4.437e+04   2.746 0.006073 ** 
## as.factor(Year.Built)1961  1.210e+05  4.443e+04   2.722 0.006520 ** 
## as.factor(Year.Built)1962  1.195e+05  4.440e+04   2.692 0.007148 ** 
## as.factor(Year.Built)1963  1.263e+05  4.441e+04   2.843 0.004496 ** 
## as.factor(Year.Built)1964  1.187e+05  4.443e+04   2.672 0.007574 ** 
## as.factor(Year.Built)1965  1.220e+05  4.442e+04   2.747 0.006048 ** 
## as.factor(Year.Built)1966  1.341e+05  4.440e+04   3.021 0.002545 ** 
## as.factor(Year.Built)1967  1.160e+05  4.431e+04   2.617 0.008915 ** 
## as.factor(Year.Built)1968  1.231e+05  4.426e+04   2.781 0.005453 ** 
## as.factor(Year.Built)1969  1.123e+05  4.454e+04   2.521 0.011746 *  
## as.factor(Year.Built)1970  1.107e+05  4.431e+04   2.498 0.012544 *  
## as.factor(Year.Built)1971  1.243e+05  4.435e+04   2.804 0.005088 ** 
## as.factor(Year.Built)1972  1.200e+05  4.434e+04   2.707 0.006824 ** 
## as.factor(Year.Built)1973  1.019e+05  4.481e+04   2.273 0.023091 *  
## as.factor(Year.Built)1974  1.249e+05  4.471e+04   2.794 0.005236 ** 
## as.factor(Year.Built)1975  1.310e+05  4.466e+04   2.934 0.003379 ** 
## as.factor(Year.Built)1976  1.213e+05  4.417e+04   2.745 0.006090 ** 
## as.factor(Year.Built)1977  1.248e+05  4.415e+04   2.827 0.004725 ** 
## as.factor(Year.Built)1978  1.289e+05  4.429e+04   2.910 0.003645 ** 
## as.factor(Year.Built)1979  1.214e+05  4.482e+04   2.709 0.006797 ** 
## as.factor(Year.Built)1980  1.290e+05  4.457e+04   2.893 0.003841 ** 
## as.factor(Year.Built)1981  1.694e+05  4.589e+04   3.691 0.000228 ***
## as.factor(Year.Built)1982  1.490e+05  4.684e+04   3.180 0.001486 ** 
## as.factor(Year.Built)1983  1.491e+05  4.647e+04   3.209 0.001347 ** 
## as.factor(Year.Built)1984  1.399e+05  4.492e+04   3.114 0.001864 ** 
## as.factor(Year.Built)1985  1.328e+05  4.679e+04   2.838 0.004572 ** 
## as.factor(Year.Built)1986  1.458e+05  4.570e+04   3.190 0.001439 ** 
## as.factor(Year.Built)1987  1.687e+05  4.643e+04   3.633 0.000285 ***
## as.factor(Year.Built)1988  1.418e+05  4.520e+04   3.138 0.001718 ** 
## as.factor(Year.Built)1989  1.368e+05  4.645e+04   2.945 0.003251 ** 
## as.factor(Year.Built)1990  1.400e+05  4.489e+04   3.119 0.001835 ** 
## as.factor(Year.Built)1991  1.341e+05  4.555e+04   2.944 0.003269 ** 
## as.factor(Year.Built)1992  1.386e+05  4.456e+04   3.109 0.001894 ** 
## as.factor(Year.Built)1993  1.502e+05  4.429e+04   3.392 0.000704 ***
## as.factor(Year.Built)1994  1.625e+05  4.434e+04   3.664 0.000253 ***
## as.factor(Year.Built)1995  1.660e+05  4.445e+04   3.734 0.000192 ***
## as.factor(Year.Built)1996  1.586e+05  4.439e+04   3.572 0.000361 ***
## as.factor(Year.Built)1997  1.473e+05  4.438e+04   3.320 0.000913 ***
## as.factor(Year.Built)1998  1.605e+05  4.421e+04   3.630 0.000289 ***
## as.factor(Year.Built)1999  1.444e+05  4.417e+04   3.268 0.001096 ** 
## as.factor(Year.Built)2000  1.503e+05  4.420e+04   3.400 0.000682 ***
## as.factor(Year.Built)2001  1.743e+05  4.437e+04   3.929 8.75e-05 ***
## as.factor(Year.Built)2002  1.635e+05  4.422e+04   3.697 0.000223 ***
## as.factor(Year.Built)2003  1.688e+05  4.400e+04   3.837 0.000128 ***
## as.factor(Year.Built)2004  1.616e+05  4.398e+04   3.673 0.000244 ***
## as.factor(Year.Built)2005  1.803e+05  4.391e+04   4.105 4.15e-05 ***
## as.factor(Year.Built)2006  1.929e+05  4.392e+04   4.392 1.17e-05 ***
## as.factor(Year.Built)2007  1.987e+05  4.396e+04   4.521 6.40e-06 ***
## as.factor(Year.Built)2008  2.263e+05  4.419e+04   5.121 3.25e-07 ***
## as.factor(Year.Built)2009  2.187e+05  4.463e+04   4.901 1.01e-06 ***
## as.factor(Year.Built)2010  2.392e+05  5.053e+04   4.734 2.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43740 on 2808 degrees of freedom
## Multiple R-squared:  0.7126, Adjusted R-squared:  0.7002 
## F-statistic: 57.54 on 121 and 2808 DF,  p-value: < 2.2e-16

We can see how the significance of the test compared to the model before rises humoungously, but not all the years have that much of a significance to the price of the house. So I decide to make 4 Intervals (Tiers) of 35 Years each to divide the data and acquire a better model.

year<- function(x){ if(x>1975) 'Tier1' else if(x>1940) 'Tier2' else if(x>1905) 'Tier3' else 'Tier4'}
Data$Year <- sapply(Data$Year.Built, year)

TestModel <- lm(SalePrice ~ Gr.Liv.Area +  Land.Contour+ Year, data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour + Year, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418304  -27871   -2312   21809  322074 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      45370.024   5851.253   7.754 1.22e-14 ***
## Gr.Liv.Area         94.458      1.872  50.470  < 2e-16 ***
## Land.ContourHLS  68345.424   6244.047  10.946  < 2e-16 ***
## Land.ContourLow  38627.058   7610.528   5.075 4.11e-07 ***
## Land.ContourLvl  20913.586   4590.488   4.556 5.43e-06 ***
## YearTier2       -47015.842   2098.272 -22.407  < 2e-16 ***
## YearTier3       -67814.941   2673.877 -25.362  < 2e-16 ***
## YearTier4       -94823.383   6167.467 -15.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47440 on 2922 degrees of freedom
## Multiple R-squared:  0.6481, Adjusted R-squared:  0.6473 
## F-statistic: 768.9 on 7 and 2922 DF,  p-value: < 2.2e-16

Although the \(R^2\) value decreases a bit, the significance is much higher, so I decide to test and continue with the model acquired.

Overall Material Quality

Overall Quality observation, rates the overall material and finish of the house. Before I continue to test this observation, I can see from the data that there is too much of a different levels for Material Quality (very excellent, Excellent, Very good, Good, Above Average, Average, Below Average, Fair, Poor, Very Poor), I decide to make a simple categorization an create 3 levels: High which is higher than 7 points, Mid which is higher than 4 points but less than 7, and Low which is lower than 4 points.

quality <- function(x) {if(x> 7) 'High' else if (x>4)  'Mid' else 'Low'}
Data$MQuality <- sapply(Data$Overall.Qual, quality)
Data$MQuality <- factor(Data$MQuality, levels= c("High", "Mid", "Low"))


TestModel <- lm(SalePrice ~  Gr.Liv.Area + Year + MQuality +  Land.Contour, data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Year + MQuality + Land.Contour, 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -392101  -18703    -843   17307  283004 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      140161.98    5637.45  24.863  < 2e-16 ***
## Gr.Liv.Area          73.01       1.70  42.940  < 2e-16 ***
## YearTier2        -25876.22    1876.12 -13.792  < 2e-16 ***
## YearTier3        -44409.29    2373.17 -18.713  < 2e-16 ***
## YearTier4        -68321.49    5286.89 -12.923  < 2e-16 ***
## MQualityMid      -80205.66    2377.03 -33.742  < 2e-16 ***
## MQualityLow     -104705.14    3645.03 -28.725  < 2e-16 ***
## Land.ContourHLS   41842.26    5317.89   7.868 5.02e-15 ***
## Land.ContourLow   39644.15    6411.74   6.183 7.16e-10 ***
## Land.ContourLvl   16710.24    3867.74   4.320 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39950 on 2920 degrees of freedom
## Multiple R-squared:  0.7507, Adjusted R-squared:  0.7499 
## F-statistic: 976.7 on 9 and 2920 DF,  p-value: < 2.2e-16

After looking at the relevance of the different Years to the price and the \(R^2\) of 0.7506506 compared to 0.65 from the previous testing, I decide to stay with this observation and continue to find another one to get a better prediction.

External Quality

After looking at the different options available I decided to choose something more specific than just the Overall Quality Material of the house. People love have good materials in the outside to protect their homes and make it nicer which should raise the price?

TestModel <- lm(SalePrice ~  Gr.Liv.Area + MQuality + Year + Exter.Qual +  Land.Contour, data= Data)
Data$Exter.Qual <- factor(Data$Exter.Qual, levels= c("Ex", "Gd", "TA", "Fa", "PO"))
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual + 
##     Land.Contour, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -437532  -18396     100   16972  290935 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.168e+05  6.273e+03  34.570  < 2e-16 ***
## Gr.Liv.Area      6.747e+01  1.586e+00  42.547  < 2e-16 ***
## MQualityMid     -5.780e+04  2.424e+03 -23.844  < 2e-16 ***
## MQualityLow     -8.100e+04  3.543e+03 -22.860  < 2e-16 ***
## YearTier2       -1.538e+04  2.000e+03  -7.692 1.97e-14 ***
## YearTier3       -3.285e+04  2.385e+03 -13.774  < 2e-16 ***
## YearTier4       -5.758e+04  4.939e+03 -11.658  < 2e-16 ***
## Exter.QualFa    -1.256e+05  7.829e+03 -16.039  < 2e-16 ***
## Exter.QualGd    -7.984e+04  3.998e+03 -19.968  < 2e-16 ***
## Exter.QualTA    -1.020e+05  4.466e+03 -22.829  < 2e-16 ***
## Land.ContourHLS  3.521e+04  4.900e+03   7.187 8.40e-13 ***
## Land.ContourLow  4.166e+04  5.904e+03   7.056 2.14e-12 ***
## Land.ContourLvl  1.504e+04  3.565e+03   4.220 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36740 on 2917 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7885 
## F-statistic: 910.7 on 12 and 2917 DF,  p-value: < 2.2e-16

Indeed the model continues to be more statistically significant. So I decide to stay with this 5 predictors as my Final Model.

Final Model

For my Final Model I decided to keep 5 of the Predictors to maximize accuracy. The Predictors are: Living Area, Material Quality, Year Built, External Quality, and Land Contour.

TestModel <- lm(SalePrice ~  Gr.Liv.Area + MQuality + Year + Exter.Qual +  Land.Contour, data= Data)
summary(TestModel)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual + 
##     Land.Contour, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -437532  -18396     100   16972  290935 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.168e+05  6.273e+03  34.570  < 2e-16 ***
## Gr.Liv.Area      6.747e+01  1.586e+00  42.547  < 2e-16 ***
## MQualityMid     -5.780e+04  2.424e+03 -23.844  < 2e-16 ***
## MQualityLow     -8.100e+04  3.543e+03 -22.860  < 2e-16 ***
## YearTier2       -1.538e+04  2.000e+03  -7.692 1.97e-14 ***
## YearTier3       -3.285e+04  2.385e+03 -13.774  < 2e-16 ***
## YearTier4       -5.758e+04  4.939e+03 -11.658  < 2e-16 ***
## Exter.QualGd    -7.984e+04  3.998e+03 -19.968  < 2e-16 ***
## Exter.QualTA    -1.020e+05  4.466e+03 -22.829  < 2e-16 ***
## Exter.QualFa    -1.256e+05  7.829e+03 -16.039  < 2e-16 ***
## Land.ContourHLS  3.521e+04  4.900e+03   7.187 8.40e-13 ***
## Land.ContourLow  4.166e+04  5.904e+03   7.056 2.14e-12 ***
## Land.ContourLvl  1.504e+04  3.565e+03   4.220 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36740 on 2917 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7885 
## F-statistic: 910.7 on 12 and 2917 DF,  p-value: < 2.2e-16

We can see that the Model got a \(R^2\) value of 0.7893241, which tells how close the data is to a fitted regression line. In other words, any prediction made with the coefficients present in the model would have a 79% probability that is a price that coincides with the data for the house predicted. the Test statistic value is: 910.7427297 and the p-value is: < 2.2e-16

Final Equation: 2.168455310^{5}+ 67.4736498\(A\) + -5.780438110^{4}\(B\) + -8.100205610^{4}\(C\) + -1.538488210^{4}\(D\) +-3.284989110^{4}\(E\) +-5.757612710^{4}\(F\) + -7.983637210^{4}\(G\) + -1.019662810^{5}\(H\) + -1.25576610^{5}\(I\) + 3.52125310^{4}\(J\) + 4.165890610^{4}\(K\) + 1.504387710^{4}\(L\)

Predictors

1. Living Area

The predictor has a coefficient value of 67.4736498 meaning that for each feet squared in the living area the number is then multiplied by that coefficient.

The graph shows the relation between the Living Area and the Sale Price.

ggplot(Data, aes(Gr.Liv.Area, SalePrice))+ geom_point()

2. Land Contour

The predictor has a coefficient values of Land.ContourHLS: 3.52125310^{4} , Land.ContourLOW: 4.165890610^{4}, Land.ContourLvl: 1.504387710^{4}, depending on the type of land the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the type of land is Low the coefficient for ContourLow will be multiplied by 1 and the rest by 0.

The graph shows the relation between the Land Contour and the Sale Price.

ggplot(Data, aes(Land.Contour, SalePrice))+ geom_boxplot(aes(color=Land.Contour))

3. Year Built

The predictor has a coefficient values of YearTier2: -1.538488210^{4} , YearTier3: -3.284989110^{4}, Year.Tier4: -5.757612710^{4}, depending on the Year Built the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the Year Built is in Tier 2 (Older than 1940 but Less than 1975) the coefficient for YearTier2 will be multiplied by 1 and the rest by 0.

The graph shows the relation between the Year Built and the Sale Price.

ggplot(Data, aes(Year, SalePrice))+ geom_boxplot(aes(color=Year))

4. Overall Material Quality

The predictor has a coefficient values of MQualityMid: -5.780438110^{4} , MQualityLow: -8.100205610^{4} depending on the Overall Quality Material the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the Material Quality is Low then coefficient for MQuality will be multiplied by 1 and the rest by 0.

The graph shows the relation between the Overall Material Quality and the Sale Price.

ggplot(Data, aes(MQuality, SalePrice))+ geom_boxplot(aes(color=MQuality))

5. External Material Quality

The predictor has a coefficient values of Exter.QualFa: -7.983637210^{4} , Exter.QualGd: -1.019662810^{5}, Exter.QualTA: -1.25576610^{5}, depending on the External Material Quality the value of 0 or 1 would be multiplied by the determined coefficient, for example, if the Material Quality falls in the category of Fair the coefficient for Exter.QualFa will be multiplied by 1 and the rest by 0.

The graph shows the relation between the External Quality and the Sale Price.

ggplot(Data, aes(Exter.Qual, SalePrice))+ geom_boxplot(aes(color=as.factor(Exter.Qual)))

Noticeable Relations within Variables

Variables such as the Living Area influence the External Quality of the House are relatable. WHich is really interesting, the more the living area the higher external quality materials.

ggplot(Data, aes(Exter.Qual, Gr.Liv.Area))+ geom_boxplot(aes(color=Exter.Qual))

And same for the Overall Material Quality.

ggplot(Data, aes(MQuality, Gr.Liv.Area))+ geom_boxplot(aes(color=MQuality))