library("ggplot2")
Data <- read.csv("IowaHousing.csv")
the Data that I’m using for this experiment is called IowaHousing.csv. Explained by Dead de Cock from Truman State University, this data was originally obtained from the Ames Assesor’s office which is used for tax assessment purposes but also lends itself directly to the prediction of home selling prices. In this experiment I will be using the data to find an appropiate linear model to explain as much as possible the variability in price for Iowa Housing.
After looking at the different observations to determine house value. I preferred to start with the Lot.Area (Continuous) Data. Which contains the value of Lot Size per square feet.
TestModel <- lm(SalePrice ~ Gr.Liv.Area, data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 4.94e-05 ***
## Gr.Liv.Area 111.694 2.066 54.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
As a start we could determine that it was statistatically significant, we can see that the model gives us a \(R^2\) of: 0.4995379 which isn’t a bad start. I proceed to pick another observation.
Is the Flatness of the property a major factor to determine the change of prices in the property? This observation is divided in 4 variables: Lvl: Near Flat/Level, Bnk: Banked, HLS: Hillside, and Low: Depression. I continue add it to the model to test this its significance
TestModel <- lm(SalePrice ~ Gr.Liv.Area + Land.Contour, data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -432316 -29531 -1300 23420 335657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32753.495 6001.240 -5.458 5.22e-08 ***
## Gr.Liv.Area 110.789 2.011 55.093 < 2e-16 ***
## Land.ContourHLS 97732.543 7118.288 13.730 < 2e-16 ***
## Land.ContourLow 63364.313 8699.292 7.284 4.15e-13 ***
## Land.ContourLvl 46849.476 5179.521 9.045 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54760 on 2925 degrees of freedom
## Multiple R-squared: 0.5307, Adjusted R-squared: 0.5301
## F-statistic: 826.9 on 4 and 2925 DF, p-value: < 2.2e-16
The Model significance increases, although it isn’t by a lot the \(R^2\) rises to 0.530692, So I decide to keep this observation, because it might be even more significant later on. I decide to continue adding another observation to try and form a better model that describes the house pricing.
Beedroms Observation (Bedroom.AbvGr) shows the amount of Bedrooms above grade (Not including basement bedrooms). I decided to pick this observation because makes sense theoretically that the more Bedrooms the higher the price is.
TestModel <- lm(SalePrice ~ Gr.Liv.Area + as.factor(Bedroom.AbvGr), data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + as.factor(Bedroom.AbvGr),
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -598290 -27410 -324 23925 334340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.882e+04 1.871e+04 1.540 0.12365
## Gr.Liv.Area 1.391e+02 2.304e+00 60.363 < 2e-16 ***
## as.factor(Bedroom.AbvGr)1 -6.426e+03 1.910e+04 -0.336 0.73660
## as.factor(Bedroom.AbvGr)2 -3.692e+04 1.855e+04 -1.990 0.04666 *
## as.factor(Bedroom.AbvGr)3 -5.509e+04 1.850e+04 -2.978 0.00292 **
## as.factor(Bedroom.AbvGr)4 -9.669e+04 1.870e+04 -5.171 2.48e-07 ***
## as.factor(Bedroom.AbvGr)5 -1.495e+05 2.006e+04 -7.454 1.18e-13 ***
## as.factor(Bedroom.AbvGr)6 -1.649e+05 2.175e+04 -7.580 4.62e-14 ***
## as.factor(Bedroom.AbvGr)8 -3.009e+05 5.554e+04 -5.418 6.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52180 on 2921 degrees of freedom
## Multiple R-squared: 0.5745, Adjusted R-squared: 0.5734
## F-statistic: 493 on 8 and 2921 DF, p-value: < 2.2e-16
After creating the Model, although the value for \(R^2\) increased, some of the factors in the observation do not represent significance to the model so I decide to remove this observation and continue testing other ones.
Furthermore, I decided to test relation betwen the Flatness of the property and the sustained model.
TestModel <- lm(SalePrice ~ Gr.Liv.Area + Land.Contour+ as.factor(Year.Built), data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour + as.factor(Year.Built),
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -471063 -24035 -1245 19594 292581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.080e+05 4.421e+04 -2.443 0.014629 *
## Gr.Liv.Area 9.089e+01 1.821e+00 49.910 < 2e-16 ***
## Land.ContourHLS 5.421e+04 6.005e+03 9.028 < 2e-16 ***
## Land.ContourLow 4.265e+04 7.223e+03 5.904 3.97e-09 ***
## Land.ContourLvl 1.570e+04 4.467e+03 3.516 0.000445 ***
## as.factor(Year.Built)1875 9.361e+04 6.190e+04 1.512 0.130624
## as.factor(Year.Built)1879 5.427e+04 6.186e+04 0.877 0.380329
## as.factor(Year.Built)1880 6.204e+04 4.792e+04 1.295 0.195578
## as.factor(Year.Built)1882 1.020e+05 6.187e+04 1.648 0.099367 .
## as.factor(Year.Built)1885 5.192e+04 5.358e+04 0.969 0.332696
## as.factor(Year.Built)1890 7.750e+04 4.678e+04 1.657 0.097659 .
## as.factor(Year.Built)1892 1.118e+05 5.358e+04 2.087 0.037003 *
## as.factor(Year.Built)1893 1.651e+05 6.186e+04 2.669 0.007654 **
## as.factor(Year.Built)1895 6.315e+04 5.052e+04 1.250 0.211412
## as.factor(Year.Built)1896 5.592e+04 6.191e+04 0.903 0.366440
## as.factor(Year.Built)1898 3.995e+03 6.186e+04 0.065 0.948512
## as.factor(Year.Built)1900 6.552e+04 4.451e+04 1.472 0.141165
## as.factor(Year.Built)1901 9.884e+04 5.361e+04 1.844 0.065350 .
## as.factor(Year.Built)1902 3.610e+04 6.203e+04 0.582 0.560603
## as.factor(Year.Built)1904 1.099e+05 6.187e+04 1.777 0.075718 .
## as.factor(Year.Built)1905 7.581e+04 5.052e+04 1.500 0.133605
## as.factor(Year.Built)1906 9.380e+04 6.188e+04 1.516 0.129669
## as.factor(Year.Built)1907 1.445e+04 6.186e+04 0.234 0.815341
## as.factor(Year.Built)1908 1.196e+05 5.363e+04 2.230 0.025802 *
## as.factor(Year.Built)1910 7.062e+04 4.428e+04 1.595 0.110835
## as.factor(Year.Built)1911 5.951e+04 6.203e+04 0.959 0.337394
## as.factor(Year.Built)1912 6.485e+04 4.797e+04 1.352 0.176460
## as.factor(Year.Built)1913 1.196e+05 6.193e+04 1.931 0.053592 .
## as.factor(Year.Built)1914 7.784e+04 4.642e+04 1.677 0.093664 .
## as.factor(Year.Built)1915 8.195e+04 4.467e+04 1.835 0.066670 .
## as.factor(Year.Built)1916 8.968e+04 4.592e+04 1.953 0.050918 .
## as.factor(Year.Built)1917 7.253e+04 5.053e+04 1.435 0.151300
## as.factor(Year.Built)1918 9.999e+04 4.589e+04 2.179 0.029418 *
## as.factor(Year.Built)1919 8.632e+04 4.794e+04 1.800 0.071901 .
## as.factor(Year.Built)1920 8.550e+04 4.417e+04 1.936 0.053003 .
## as.factor(Year.Built)1921 1.009e+05 4.575e+04 2.205 0.027518 *
## as.factor(Year.Built)1922 8.429e+04 4.514e+04 1.867 0.061967 .
## as.factor(Year.Built)1923 9.642e+04 4.505e+04 2.140 0.032424 *
## as.factor(Year.Built)1924 9.253e+04 4.514e+04 2.050 0.040457 *
## as.factor(Year.Built)1925 1.055e+05 4.443e+04 2.374 0.017641 *
## as.factor(Year.Built)1926 1.100e+05 4.494e+04 2.449 0.014389 *
## as.factor(Year.Built)1927 1.012e+05 4.615e+04 2.193 0.028370 *
## as.factor(Year.Built)1928 1.116e+05 4.615e+04 2.418 0.015669 *
## as.factor(Year.Built)1929 1.223e+05 4.644e+04 2.633 0.008504 **
## as.factor(Year.Built)1930 1.037e+05 4.463e+04 2.323 0.020232 *
## as.factor(Year.Built)1931 8.497e+04 4.680e+04 1.815 0.069578 .
## as.factor(Year.Built)1932 1.245e+05 4.799e+04 2.595 0.009516 **
## as.factor(Year.Built)1934 1.033e+05 4.797e+04 2.153 0.031406 *
## as.factor(Year.Built)1935 1.126e+05 4.541e+04 2.480 0.013182 *
## as.factor(Year.Built)1936 1.066e+05 4.573e+04 2.330 0.019863 *
## as.factor(Year.Built)1937 1.079e+05 4.621e+04 2.334 0.019666 *
## as.factor(Year.Built)1938 9.722e+04 4.544e+04 2.140 0.032479 *
## as.factor(Year.Built)1939 1.077e+05 4.486e+04 2.400 0.016475 *
## as.factor(Year.Built)1940 1.084e+05 4.440e+04 2.441 0.014717 *
## as.factor(Year.Built)1941 1.041e+05 4.472e+04 2.328 0.019997 *
## as.factor(Year.Built)1942 1.062e+05 4.729e+04 2.245 0.024828 *
## as.factor(Year.Built)1945 9.984e+04 4.524e+04 2.207 0.027397 *
## as.factor(Year.Built)1946 8.356e+04 4.524e+04 1.847 0.064837 .
## as.factor(Year.Built)1947 1.050e+05 4.574e+04 2.295 0.021786 *
## as.factor(Year.Built)1948 1.023e+05 4.458e+04 2.295 0.021789 *
## as.factor(Year.Built)1949 8.992e+04 4.498e+04 1.999 0.045707 *
## as.factor(Year.Built)1950 1.085e+05 4.435e+04 2.445 0.014529 *
## as.factor(Year.Built)1951 1.049e+05 4.498e+04 2.332 0.019745 *
## as.factor(Year.Built)1952 1.011e+05 4.499e+04 2.246 0.024766 *
## as.factor(Year.Built)1953 1.111e+05 4.468e+04 2.487 0.012951 *
## as.factor(Year.Built)1954 1.180e+05 4.430e+04 2.665 0.007747 **
## as.factor(Year.Built)1955 1.074e+05 4.443e+04 2.418 0.015684 *
## as.factor(Year.Built)1956 1.230e+05 4.435e+04 2.774 0.005573 **
## as.factor(Year.Built)1957 1.151e+05 4.440e+04 2.592 0.009595 **
## as.factor(Year.Built)1958 1.130e+05 4.424e+04 2.554 0.010707 *
## as.factor(Year.Built)1959 1.243e+05 4.429e+04 2.807 0.005035 **
## as.factor(Year.Built)1960 1.218e+05 4.437e+04 2.746 0.006073 **
## as.factor(Year.Built)1961 1.210e+05 4.443e+04 2.722 0.006520 **
## as.factor(Year.Built)1962 1.195e+05 4.440e+04 2.692 0.007148 **
## as.factor(Year.Built)1963 1.263e+05 4.441e+04 2.843 0.004496 **
## as.factor(Year.Built)1964 1.187e+05 4.443e+04 2.672 0.007574 **
## as.factor(Year.Built)1965 1.220e+05 4.442e+04 2.747 0.006048 **
## as.factor(Year.Built)1966 1.341e+05 4.440e+04 3.021 0.002545 **
## as.factor(Year.Built)1967 1.160e+05 4.431e+04 2.617 0.008915 **
## as.factor(Year.Built)1968 1.231e+05 4.426e+04 2.781 0.005453 **
## as.factor(Year.Built)1969 1.123e+05 4.454e+04 2.521 0.011746 *
## as.factor(Year.Built)1970 1.107e+05 4.431e+04 2.498 0.012544 *
## as.factor(Year.Built)1971 1.243e+05 4.435e+04 2.804 0.005088 **
## as.factor(Year.Built)1972 1.200e+05 4.434e+04 2.707 0.006824 **
## as.factor(Year.Built)1973 1.019e+05 4.481e+04 2.273 0.023091 *
## as.factor(Year.Built)1974 1.249e+05 4.471e+04 2.794 0.005236 **
## as.factor(Year.Built)1975 1.310e+05 4.466e+04 2.934 0.003379 **
## as.factor(Year.Built)1976 1.213e+05 4.417e+04 2.745 0.006090 **
## as.factor(Year.Built)1977 1.248e+05 4.415e+04 2.827 0.004725 **
## as.factor(Year.Built)1978 1.289e+05 4.429e+04 2.910 0.003645 **
## as.factor(Year.Built)1979 1.214e+05 4.482e+04 2.709 0.006797 **
## as.factor(Year.Built)1980 1.290e+05 4.457e+04 2.893 0.003841 **
## as.factor(Year.Built)1981 1.694e+05 4.589e+04 3.691 0.000228 ***
## as.factor(Year.Built)1982 1.490e+05 4.684e+04 3.180 0.001486 **
## as.factor(Year.Built)1983 1.491e+05 4.647e+04 3.209 0.001347 **
## as.factor(Year.Built)1984 1.399e+05 4.492e+04 3.114 0.001864 **
## as.factor(Year.Built)1985 1.328e+05 4.679e+04 2.838 0.004572 **
## as.factor(Year.Built)1986 1.458e+05 4.570e+04 3.190 0.001439 **
## as.factor(Year.Built)1987 1.687e+05 4.643e+04 3.633 0.000285 ***
## as.factor(Year.Built)1988 1.418e+05 4.520e+04 3.138 0.001718 **
## as.factor(Year.Built)1989 1.368e+05 4.645e+04 2.945 0.003251 **
## as.factor(Year.Built)1990 1.400e+05 4.489e+04 3.119 0.001835 **
## as.factor(Year.Built)1991 1.341e+05 4.555e+04 2.944 0.003269 **
## as.factor(Year.Built)1992 1.386e+05 4.456e+04 3.109 0.001894 **
## as.factor(Year.Built)1993 1.502e+05 4.429e+04 3.392 0.000704 ***
## as.factor(Year.Built)1994 1.625e+05 4.434e+04 3.664 0.000253 ***
## as.factor(Year.Built)1995 1.660e+05 4.445e+04 3.734 0.000192 ***
## as.factor(Year.Built)1996 1.586e+05 4.439e+04 3.572 0.000361 ***
## as.factor(Year.Built)1997 1.473e+05 4.438e+04 3.320 0.000913 ***
## as.factor(Year.Built)1998 1.605e+05 4.421e+04 3.630 0.000289 ***
## as.factor(Year.Built)1999 1.444e+05 4.417e+04 3.268 0.001096 **
## as.factor(Year.Built)2000 1.503e+05 4.420e+04 3.400 0.000682 ***
## as.factor(Year.Built)2001 1.743e+05 4.437e+04 3.929 8.75e-05 ***
## as.factor(Year.Built)2002 1.635e+05 4.422e+04 3.697 0.000223 ***
## as.factor(Year.Built)2003 1.688e+05 4.400e+04 3.837 0.000128 ***
## as.factor(Year.Built)2004 1.616e+05 4.398e+04 3.673 0.000244 ***
## as.factor(Year.Built)2005 1.803e+05 4.391e+04 4.105 4.15e-05 ***
## as.factor(Year.Built)2006 1.929e+05 4.392e+04 4.392 1.17e-05 ***
## as.factor(Year.Built)2007 1.987e+05 4.396e+04 4.521 6.40e-06 ***
## as.factor(Year.Built)2008 2.263e+05 4.419e+04 5.121 3.25e-07 ***
## as.factor(Year.Built)2009 2.187e+05 4.463e+04 4.901 1.01e-06 ***
## as.factor(Year.Built)2010 2.392e+05 5.053e+04 4.734 2.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43740 on 2808 degrees of freedom
## Multiple R-squared: 0.7126, Adjusted R-squared: 0.7002
## F-statistic: 57.54 on 121 and 2808 DF, p-value: < 2.2e-16
We can see how the significance of the test compared to the model before rises humoungously, but not all the years have that much of a significance to the price of the house. So I decide to make 4 Intervals (Tiers) of 35 Years each to divide the data and acquire a better model.
year<- function(x){ if(x>1975) 'Tier1' else if(x>1940) 'Tier2' else if(x>1905) 'Tier3' else 'Tier4'}
Data$Year <- sapply(Data$Year.Built, year)
TestModel <- lm(SalePrice ~ Gr.Liv.Area + Land.Contour+ Year, data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Land.Contour + Year, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -418304 -27871 -2312 21809 322074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45370.024 5851.253 7.754 1.22e-14 ***
## Gr.Liv.Area 94.458 1.872 50.470 < 2e-16 ***
## Land.ContourHLS 68345.424 6244.047 10.946 < 2e-16 ***
## Land.ContourLow 38627.058 7610.528 5.075 4.11e-07 ***
## Land.ContourLvl 20913.586 4590.488 4.556 5.43e-06 ***
## YearTier2 -47015.842 2098.272 -22.407 < 2e-16 ***
## YearTier3 -67814.941 2673.877 -25.362 < 2e-16 ***
## YearTier4 -94823.383 6167.467 -15.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47440 on 2922 degrees of freedom
## Multiple R-squared: 0.6481, Adjusted R-squared: 0.6473
## F-statistic: 768.9 on 7 and 2922 DF, p-value: < 2.2e-16
Although the \(R^2\) value decreases a bit, the significance is much higher, so I decide to test and continue with the model acquired.
Overall Quality observation, rates the overall material and finish of the house. Before I continue to test this observation, I can see from the data that there is too much of a different levels for Material Quality (very excellent, Excellent, Very good, Good, Above Average, Average, Below Average, Fair, Poor, Very Poor), I decide to make a simple categorization an create 3 levels: High which is higher than 7 points, Mid which is higher than 4 points but less than 7, and Low which is lower than 4 points.
quality <- function(x) {if(x> 7) 'High' else if (x>4) 'Mid' else 'Low'}
Data$MQuality <- sapply(Data$Overall.Qual, quality)
Data$MQuality <- factor(Data$MQuality, levels= c("High", "Mid", "Low"))
TestModel <- lm(SalePrice ~ Gr.Liv.Area + Year + MQuality + Land.Contour, data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Year + MQuality + Land.Contour,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -392101 -18703 -843 17307 283004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140161.98 5637.45 24.863 < 2e-16 ***
## Gr.Liv.Area 73.01 1.70 42.940 < 2e-16 ***
## YearTier2 -25876.22 1876.12 -13.792 < 2e-16 ***
## YearTier3 -44409.29 2373.17 -18.713 < 2e-16 ***
## YearTier4 -68321.49 5286.89 -12.923 < 2e-16 ***
## MQualityMid -80205.66 2377.03 -33.742 < 2e-16 ***
## MQualityLow -104705.14 3645.03 -28.725 < 2e-16 ***
## Land.ContourHLS 41842.26 5317.89 7.868 5.02e-15 ***
## Land.ContourLow 39644.15 6411.74 6.183 7.16e-10 ***
## Land.ContourLvl 16710.24 3867.74 4.320 1.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39950 on 2920 degrees of freedom
## Multiple R-squared: 0.7507, Adjusted R-squared: 0.7499
## F-statistic: 976.7 on 9 and 2920 DF, p-value: < 2.2e-16
After looking at the relevance of the different Years to the price and the \(R^2\) of 0.7506506 compared to 0.65 from the previous testing, I decide to stay with this observation and continue to find another one to get a better prediction.
After looking at the different options available I decided to choose something more specific than just the Overall Quality Material of the house. People love have good materials in the outside to protect their homes and make it nicer which should raise the price?
TestModel <- lm(SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual + Land.Contour, data= Data)
Data$Exter.Qual <- factor(Data$Exter.Qual, levels= c("Ex", "Gd", "TA", "Fa", "PO"))
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual +
## Land.Contour, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -437532 -18396 100 16972 290935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.168e+05 6.273e+03 34.570 < 2e-16 ***
## Gr.Liv.Area 6.747e+01 1.586e+00 42.547 < 2e-16 ***
## MQualityMid -5.780e+04 2.424e+03 -23.844 < 2e-16 ***
## MQualityLow -8.100e+04 3.543e+03 -22.860 < 2e-16 ***
## YearTier2 -1.538e+04 2.000e+03 -7.692 1.97e-14 ***
## YearTier3 -3.285e+04 2.385e+03 -13.774 < 2e-16 ***
## YearTier4 -5.758e+04 4.939e+03 -11.658 < 2e-16 ***
## Exter.QualFa -1.256e+05 7.829e+03 -16.039 < 2e-16 ***
## Exter.QualGd -7.984e+04 3.998e+03 -19.968 < 2e-16 ***
## Exter.QualTA -1.020e+05 4.466e+03 -22.829 < 2e-16 ***
## Land.ContourHLS 3.521e+04 4.900e+03 7.187 8.40e-13 ***
## Land.ContourLow 4.166e+04 5.904e+03 7.056 2.14e-12 ***
## Land.ContourLvl 1.504e+04 3.565e+03 4.220 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36740 on 2917 degrees of freedom
## Multiple R-squared: 0.7893, Adjusted R-squared: 0.7885
## F-statistic: 910.7 on 12 and 2917 DF, p-value: < 2.2e-16
Indeed the model continues to be more statistically significant. So I decide to stay with this 5 predictors as my Final Model.
For my Final Model I decided to keep 5 of the Predictors to maximize accuracy. The Predictors are: Living Area, Material Quality, Year Built, External Quality, and Land Contour.
TestModel <- lm(SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual + Land.Contour, data= Data)
summary(TestModel)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + MQuality + Year + Exter.Qual +
## Land.Contour, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -437532 -18396 100 16972 290935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.168e+05 6.273e+03 34.570 < 2e-16 ***
## Gr.Liv.Area 6.747e+01 1.586e+00 42.547 < 2e-16 ***
## MQualityMid -5.780e+04 2.424e+03 -23.844 < 2e-16 ***
## MQualityLow -8.100e+04 3.543e+03 -22.860 < 2e-16 ***
## YearTier2 -1.538e+04 2.000e+03 -7.692 1.97e-14 ***
## YearTier3 -3.285e+04 2.385e+03 -13.774 < 2e-16 ***
## YearTier4 -5.758e+04 4.939e+03 -11.658 < 2e-16 ***
## Exter.QualGd -7.984e+04 3.998e+03 -19.968 < 2e-16 ***
## Exter.QualTA -1.020e+05 4.466e+03 -22.829 < 2e-16 ***
## Exter.QualFa -1.256e+05 7.829e+03 -16.039 < 2e-16 ***
## Land.ContourHLS 3.521e+04 4.900e+03 7.187 8.40e-13 ***
## Land.ContourLow 4.166e+04 5.904e+03 7.056 2.14e-12 ***
## Land.ContourLvl 1.504e+04 3.565e+03 4.220 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36740 on 2917 degrees of freedom
## Multiple R-squared: 0.7893, Adjusted R-squared: 0.7885
## F-statistic: 910.7 on 12 and 2917 DF, p-value: < 2.2e-16
We can see that the Model got a \(R^2\) value of 0.7893241, which tells how close the data is to a fitted regression line. In other words, any prediction made with the coefficients present in the model would have a 79% probability that is a price that coincides with the data for the house predicted. the Test statistic value is: 910.7427297 and the p-value is: < 2.2e-16
Final Equation: 2.168455310^{5}+ 67.4736498\(A\) + -5.780438110^{4}\(B\) + -8.100205610^{4}\(C\) + -1.538488210^{4}\(D\) +-3.284989110^{4}\(E\) +-5.757612710^{4}\(F\) + -7.983637210^{4}\(G\) + -1.019662810^{5}\(H\) + -1.25576610^{5}\(I\) + 3.52125310^{4}\(J\) + 4.165890610^{4}\(K\) + 1.504387710^{4}\(L\)
The predictor has a coefficient value of 67.4736498 meaning that for each feet squared in the living area the number is then multiplied by that coefficient.
The graph shows the relation between the Living Area and the Sale Price.
ggplot(Data, aes(Gr.Liv.Area, SalePrice))+ geom_point()
The predictor has a coefficient values of Land.ContourHLS: 3.52125310^{4} , Land.ContourLOW: 4.165890610^{4}, Land.ContourLvl: 1.504387710^{4}, depending on the type of land the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the type of land is Low the coefficient for ContourLow will be multiplied by 1 and the rest by 0.
The graph shows the relation between the Land Contour and the Sale Price.
ggplot(Data, aes(Land.Contour, SalePrice))+ geom_boxplot(aes(color=Land.Contour))
The predictor has a coefficient values of YearTier2: -1.538488210^{4} , YearTier3: -3.284989110^{4}, Year.Tier4: -5.757612710^{4}, depending on the Year Built the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the Year Built is in Tier 2 (Older than 1940 but Less than 1975) the coefficient for YearTier2 will be multiplied by 1 and the rest by 0.
The graph shows the relation between the Year Built and the Sale Price.
ggplot(Data, aes(Year, SalePrice))+ geom_boxplot(aes(color=Year))
The predictor has a coefficient values of MQualityMid: -5.780438110^{4} , MQualityLow: -8.100205610^{4} depending on the Overall Quality Material the value of 0 or 1 would be multiplied by the determined coefficient, for instance, if the Material Quality is Low then coefficient for MQuality will be multiplied by 1 and the rest by 0.
The graph shows the relation between the Overall Material Quality and the Sale Price.
ggplot(Data, aes(MQuality, SalePrice))+ geom_boxplot(aes(color=MQuality))
The predictor has a coefficient values of Exter.QualFa: -7.983637210^{4} , Exter.QualGd: -1.019662810^{5}, Exter.QualTA: -1.25576610^{5}, depending on the External Material Quality the value of 0 or 1 would be multiplied by the determined coefficient, for example, if the Material Quality falls in the category of Fair the coefficient for Exter.QualFa will be multiplied by 1 and the rest by 0.
The graph shows the relation between the External Quality and the Sale Price.
ggplot(Data, aes(Exter.Qual, SalePrice))+ geom_boxplot(aes(color=as.factor(Exter.Qual)))
Variables such as the Living Area influence the External Quality of the House are relatable. WHich is really interesting, the more the living area the higher external quality materials.
ggplot(Data, aes(Exter.Qual, Gr.Liv.Area))+ geom_boxplot(aes(color=Exter.Qual))
And same for the Overall Material Quality.
ggplot(Data, aes(MQuality, Gr.Liv.Area))+ geom_boxplot(aes(color=MQuality))