Examining the factors which influence the sale price of a home is important in order to learn how to price homes, what selling points to bring up to potential buyers, and overall understand the value of a home. In this report, we explore and discuss what the key factors of a home are associated with higher sale prices. Data was collected from a random sample of 580 houses that have sold in the last year. The sale price was noted for each house, along with eight potentially important factors. These factors include the area of the lot in square feet, the neighborhood, the quality of the home, the condition of the home, the total square feet of the basement, the square footage of the first floor, the type of kitchen quality, and the condition of the garage’s finish. We explore each of these variables, fit a rough model, determine what factors are most important, and conclude with a fitted model. We then rationalize our model choice, discuss our findings, and note the limitations to our study.
In order to understand what the relationship between features of a home and the sale price of that home, we begin by exploring the data visually. Lot area, overall quality, overall condition, total square feet of basement space, and the square footage on the first floor are all numeric variables, meaning that the value is a number. In this case, the lot area is measured in square feet, quality and condition are scores from 0 to 10, and the basement and first floor are measured in square footage. The variables of kitchen quality, garage finish, and neighborhood are categorical variables, meaning that they are measured in groups. We will first investigate the home features through graphic visualizations, followed by an investigation of the sales price graphically. Finally, there will be graphic representations of the relationship between each home feature and the sale price.
Figure 1.1 is used to visualize lot area, measured in square feet,
via a frequency polygraph. This graph demonstrates that the data is
relatively normally distributed, centered around a lot area of
approximately 10,000 and with several high outliers. Figure 1.2 is a
frequency polygraph used to visualize the total square feet of basement
space a home has. This second graph demonstrates that the data is
positively skewed, with the majority of square footage under 2,000
square feet. Figure 1.2 also suggests that the average square footage of
basements in the sample was around 1,500 square feet. Figure 1.3 is a
frequency polygraph which visualizes the total square footage of the
first floor in a home. This third graph demonstrates that the data is
slightly positively skewed, with the majority of the data below 1,700
square feet. Figure 1.3 additionally suggests that the average of first
floor square footage in the sample was around 1,400. Figure 1.4 is a bar
graph used to visualize kitchen quality, which is separated into the
categories of excellent (Ex), fair (Fa), good (Gd), poor (po), and
typical (TA). This fourth graph demonstrates that the most homes in the
sample of homes sold in the past year had “good” kitchen quality, being
over 250 homes. The second highest amount were homes with “typical”
kitchens, being slightly over 200, then with “excellent” kitchens, being
slightly under 100 homes. The lowest amount of homes in the sample had
“fair” kitchens, being eight homes, and zero homes in the sample had
“poor” quality kitchens. Figure 1.5 is a bar graph used to visualize
garage finish, which is separated into categories of finished with walls
and foundation (Fin), partially finished (RFn), and unfinished (Unf).
This fifth graph demonstrates the highest amount of homes in the sample
had garages that were finished with walls and foundation, being over 200
homes. The second highest amount of homes had partially finished
garages, around 190 homes, and the lowest amount of homes had unfinished
garages, around 175 homes. This graph also demonstrates that the
differences between garage conditions were relatively small, with a
difference between the highest and lowest group being less than 50 homes
out of a sample of 580 homes. Figure 1.6 is a bar graph used to
visualize what neighborhood the home is in, which is separated into 13
different neighborhoods. The highest amount of homes by far were in the
StoneBr neighborhood, with over 110 homes. This was followed by NAmes,
with around 74 homes, and then NridgHt, with around 65 homes. The lowest
amount of homes were in the SawyerW neighborhood, with lower than 25
homes. The nine other neighborhoods contained between 25 and 40 homes
from the sample in each. Figure 1.7 is a histogram used to visualize
what the overall condition score of the home was, on a scale from 0 to
10. This seventh graph demonstrates that the data is relatively normally
distributed, with an average condition score of 5. The lowest score
given was a 2, and the highest score given was a 9.
Figure 1.8 is a bar plot used to visualize overall quality score of a
home, on a scale from 0 to 10. This graph demonstrates that the most
homes had a quality score between 6 and 8, with an average of a score of
7. The distribution of scores is relatively normal, with no apparent
outliers. Figure 1.9 is a frequency polygraph used to visualized the
sale price of the homes in the sample, measured in thousands of dollars.
This final graph demonstrates that the sale prices were positively
skewed, so that the majority of homes sold had prices under $300,000.
There appear to be some high outliers in the data, reaching up to just
under $800,000. It was heavily reccomended that we use the natural log
of sale prices in our study, so we tested this in Figure 1.10. As Figure
1.10 demonstrates, taking the natural log of the sale price data makes
the distribution a normal distribution, and thus we will use
log(SalePrice) for the rest of the report.
Now that we have investigated each variable separately, we may explore the relationship between each home feature and the response variable of interest, sales price. To do this, we will graphically visualize the relationships depending on the type of variable. Log ales price is a numeric variable, meaning that the value is a number. To explore log sales price’s relationships with other numeric variables, we will utilize scatter plots. To explore log sales price’s relationships with categorical variables, which have categories as values, we will utilize box plots.
Figure 1.11 is a scatter plot, with lot area in square feet on the x-axis and log sales price in thousands on the y-axis. This graph demonstrates that there may be a weak positive relationship between lot area and sales price, although it is difficult to determine whether this is a linear relationship or a cloud that would suggest a different kind of relationship. This graph also indicates multiple outliers, the most extreme being a home with a lot area of over 70,000. Despite this high lot area, the log sale price of the home was not the highest as well, and the point does not seem to be largely influential in the general trend of the data. Therefore, it is important to note this outlier as well as the outliers of houses over 30,000 total square feet and over around 6,000 log dollars, but we will continue to keep them in this sample. This particular relationship may not be linear. Figure 1.12 is a scatter plot, with total basement square footage on the x-axis and log sales price in thousands on the y-axis. This graph demonstrates that there may be a moderate positive linear relationship between total basement square footage and log sale price. There are a couple outliers, one being with total basement square footage of over 3,000. However, these points are not influential to the slope of the general trend line, and therefore we will keep them in the sample. Figure 1.13 is a scatter plot, with first floor square footage on the x-axis and log sales price in thousands on the y-axis. This graph demonstrates that there is a moderate positive linear relationship between first floor square footage and log sales price. Several outliers appear, particularly in seven points with around 2,250 or higher square footage. However, these points do not influence the overall slope of the trend, and therefore we will keep them in the sample. Figure 1.14 is a side-by-side box plot, with neighborhood on the x-axis and log sale price in thousands on the y-axis. This graph demonstrates that the NridgHt neighborhood has the highest median sale price of all the neighborhoods, and sale price differs between neighborhoods. The lowest median log sale price is in the Sawyer neighborhood, although this may be influenced by a low outlier. The second lowest median log sale price is in the neighborhood OldTown, which is of particular note because there are several high outliers in that neighborhood that are skewing the median to be higher. There are multiple outliers in this data, particularly in the CollgCr, Gilbert, NAmes, NoRidge, NWAmes, OldTown, Sawyer, SawyerW, Somerst, and StoneBr neighborhoods. However, in order to keep a comprehensive understanding of this sample, we will keep these outliers in the sample. Figure 1.15 is a side-by-side boxplot, with kitchen quality on the x-axis and log sale price in thousands on the y-axis. This graph suggests that there is a difference between median log sale price based on kitchen quality. The highest sales price median occurred in homes with excellent kitchen quality, followed by good, then typical, then fair quality kitchens. There are several outliers indicated in the data, in each category of kitchen quality. The highest outlier is in the excellent quality category, with a price over 6,500 log dollars and the typical category also had a low outlier below 4,500 log dollars. In order to have a comprehensive understanding of the sample and what home features influence sales price, we will keep these outliers in the data while noting their presence. Figure 1.16 is a side-by-side box plot, with garage finish on the x-axis and log sales price in thousands of dollars on the y-axis. This graph demonstrates that there is a difference in median log sale price between garage finishes. The highest median sale price was in finished garages, followed by partially finished, and the lowest with unfinished. Each category had outliers, namely several in the finished category had sale prices over 6,500 log dollars, and one in the unfinished category had a low outlier of under 4,500 log dollars. However, to keep a comprehensive understanding of the sample, we will note these outliers and keep them in the sample. Figure 1.17 is a scatter plot with overall condition on the x-axis and log sale price in thousands on the y-axis. This graph suggests that there is a moderate positive linear relationship between overall condition score and the log sale price of a home. There are several outliers, in particular one with an overall condition score of 2, as well as two scores with a log sale price over 6,500 log dollars. However, these outliers do not influence the slope of the general trend line, and thus we will include them in the sample. Figure 1.18 is a scatter plot with overall quality on the x-axis and log sale price in thousands on the y-axis. This graph suggests that there is a strong positive linear relationship between overall qualirt score and the log sale price of a home. There appear to be a couple outliers, but none that are drastically far from the general trend line or influential to the slope, and thus we will keep them in the sample.
In order to make a model, we first must check for multicollinearity in case any interaction terms are necessary for our numeric variables. To do this, we created a correlation plot to test the correlation between each of the potential numeric explanatory home feature variables provided.
##
## The downloaded binary packages are in
## /var/folders/46/lyr0qk9s4334r99v9cz27lr80000gn/T//RtmpBfGVoB/downloaded_packages
## corrplot 0.92 loaded
In this correlation plot, it is demonstrated that total basement square feet and total first floor square footage have a correlation of 0.88, indicating that an interaction term is required in the model. The rest of the variables have low enough correlations where an interaction term is not necessary, as multicollinearity is not an issue.
In order to determine which model will best explain what home features may be used to predict a home’s sale price, we explored the potential of multiple models. We were recommended to use the natural log of sales price as the response variable, rather than sales price alone. In order to support this recommendation, we explored the visuals graphically to determine if natural log was the best fit for the data. We then followed this by constructing a rough model, which includes all eight potential explanatory home feature variables provided to us, as well as the interaction term deemed necessary by the correlation plot. We then determined how well that model fit the data and calculated the significance of each coefficient in the model. We then created a second model utilizing the significant coefficients only, and compared the fit of this second model to the first model. We then created a third model which included the significant terms as well as the interaction term. We compared the adjusted R-squared of all three models, and chose the model that best fit the data.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.0065974 | 0.1182244 | 33.8897729 | 0.0000000 |
| X1stFlrSF | 0.0001817 | 0.0000570 | 3.1876219 | 0.0015151 |
| TotalBsmtSF | 0.0000849 | 0.0000767 | 1.1075593 | 0.2685311 |
| LotArea | 0.0000113 | 0.0000012 | 9.3291208 | 0.0000000 |
| OverallQual | 0.1230046 | 0.0089007 | 13.8196704 | 0.0000000 |
| OverallCond | 0.0402567 | 0.0083426 | 4.8254373 | 0.0000018 |
| KitchenQualFa | -0.2208623 | 0.0668563 | -3.3035378 | 0.0010161 |
| KitchenQualGd | -0.1092967 | 0.0254416 | -4.2959889 | 0.0000205 |
| KitchenQualTA | -0.1829523 | 0.0319092 | -5.7335351 | 0.0000000 |
| GarageFinishRFn | -0.0222192 | 0.0181420 | -1.2247381 | 0.2211926 |
| GarageFinishUnf | -0.0936952 | 0.0219749 | -4.2637289 | 0.0000236 |
| NeighborhoodCrawfor | 0.0472387 | 0.0415631 | 1.1365526 | 0.2562150 |
| NeighborhoodGilbert | 0.0219421 | 0.0376339 | 0.5830421 | 0.5601014 |
| NeighborhoodNAmes | -0.1034226 | 0.0352766 | -2.9317636 | 0.0035091 |
| NeighborhoodNoRidge | 0.1780434 | 0.0385158 | 4.6226122 | 0.0000047 |
| NeighborhoodNridgHt | 0.0467461 | 0.0343166 | 1.3622035 | 0.1736853 |
| NeighborhoodNWAmes | -0.0707180 | 0.0394627 | -1.7920225 | 0.0736730 |
| NeighborhoodOldTown | -0.1371880 | 0.0455736 | -3.0102528 | 0.0027288 |
| NeighborhoodSawyer | -0.0809795 | 0.0490343 | -1.6514858 | 0.0992040 |
| NeighborhoodSawyerW | 0.0117050 | 0.0404899 | 0.2890838 | 0.7726250 |
| NeighborhoodSomerst | 0.0387328 | 0.0370063 | 1.0466553 | 0.2957134 |
| NeighborhoodStoneBr | -0.0605052 | 0.0312187 | -1.9381067 | 0.0531161 |
| NeighborhoodTimber | -0.0183894 | 0.0410968 | -0.4474644 | 0.6547140 |
| X1stFlrSF:TotalBsmtSF | 0.0000000 | 0.0000000 | -0.3062620 | 0.7595198 |
Model 1 is modeled as: \[log(SalesPrice) = \beta_0 + \beta_1X1stFlrSF + \beta_2TotalBsmtSF + \beta_3LotArea + \beta_4OverallQuality + \beta_5OverallCondition + \beta_6FairKitchen \] \[+ \beta_7GoodKitchen + \beta_8TypicalKitchen + \beta_9PartiallyFinishedGarage + \beta_{10}UninfishedGarage + \beta_{11}NeighborhoodCrawfor + \beta_{12}NeighborhoodGilbert \] \[+ \beta_{13}NeighborhoodNAmes + \beta_{14}NeighborhoodNoRidge + \beta_{15}NeighborhoodNridgHt + \beta_{16}NeighborhoodNWames + \beta_{17}NeighborhoodOldTown + \beta_{18}NeighborhoodSawyer\] \[+ \beta_{19}NeighborhoodSawyerW + \beta_{20}NeighborhoodSomerst + \beta_{21}NeighborhoodStoneBr + \beta_{22}NeighborhoodTimber + \beta_{23}TotalSmtSF:X1stFlrSF + error \]
Based on the results of Model 1, it fits the data well, with an adjusted R-squared of 0.8153. Therefore, 81.53% of the variance in log sales price is accounted for by Model 1. However, this model contains many betas, and not all were found to be significant, as seen in the table. For Model 2, we removed any variables from the model that did not have a significant p-value at the 0.05 level, and tested how well that fit the data.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 3.9818988 | 0.0884339 | 45.026820 | 0.0000000 |
| X1stFlrSF | 0.0002071 | 0.0000219 | 9.456880 | 0.0000000 |
| LotArea | 0.0000111 | 0.0000012 | 9.153315 | 0.0000000 |
| OverallQual | 0.1377530 | 0.0084208 | 16.358719 | 0.0000000 |
| OverallCond | 0.0353805 | 0.0077231 | 4.581111 | 0.0000057 |
| KitchenQualFa | -0.2190373 | 0.0667551 | -3.281208 | 0.0010970 |
| KitchenQualGd | -0.1220144 | 0.0239379 | -5.097127 | 0.0000005 |
| KitchenQualTA | -0.2196710 | 0.0302542 | -7.260834 | 0.0000000 |
| GarageFinish == “Unf”TRUE | -0.0841010 | 0.0186975 | -4.497980 | 0.0000083 |
| Neighborhood == “NAmes”TRUE | -0.0741188 | 0.0221363 | -3.348288 | 0.0008670 |
| Neighborhood == “NoRidge”TRUE | 0.1744893 | 0.0312155 | 5.589822 | 0.0000000 |
| Neighborhood == “OldTown”TRUE | -0.1197370 | 0.0355614 | -3.367051 | 0.0008112 |
Model 2 is modeled as: \[log(SalesPrice) = \beta_0 + \beta_1X1stFlrSF + \beta_2LotArea + \beta_3OverallQuality + \beta_4OverallCondition + \beta_5FairKitchen + \beta_6GoodKitchen \] \[ + \beta_7TypicalKitchen + \beta_8UnfinishedFinishedGarage + \beta_9NeighborhoodNAmes + \beta_{10}NeighborhoodNoRidge + \beta_{11}NeighborhoodOldTown + error \]
Based on the results of Model 2, it fits the data well, with an adjusted R-squared of 0.8078. Therefore, 80.78% of the variance in log sales price is accounted for by Model 2. This model contains 12 fewer betas than Model 1, and has a smaller adjusted R squared than Model 1. Due to this, we do not believe that this is the best model for the sample. In order to further explore the sample, we constructed a model that uses all of the significant variables from model 1, confirmed in significance in model 2, and included only the interaction term between first floor square footage and basement square footage. This interaction term can be supported by the correlation between the two variables being 0.88, which is approaching the 0.9 correlation that encourages the use of only one of the two variables or the interaction term alone.
Model 3: \[log(SalesPrice) = \beta_0 + \beta_1X1stFlrSF + \beta_2LotArea + \beta_3OverallQuality + \beta_4OverallCondition + \beta_5FairKitchen + \beta_6GoodKitchen \] \[ + \beta_7TypicalKitchen + \beta_8UnfinishedFinishedGarage + \beta_9NeighborhoodNAmes + \beta_10NeighborhoodNoRidge + \beta_{11}NeighborhoodOldTown + \beta_{12}(X1stFlrSF: TotalBsmtSF) + error \]
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.0361997 | 0.0935430 | 43.148062 | 0.0000000 |
| X1stFlrSF | 0.0001235 | 0.0000524 | 2.355599 | 0.0188325 |
| LotArea | 0.0000111 | 0.0000012 | 9.201194 | 0.0000000 |
| OverallQual | 0.1350594 | 0.0085445 | 15.806538 | 0.0000000 |
| OverallCond | 0.0369250 | 0.0077592 | 4.758899 | 0.0000025 |
| KitchenQualFa | -0.2159523 | 0.0666566 | -3.239774 | 0.0012663 |
| KitchenQualGd | -0.1158923 | 0.0241478 | -4.799299 | 0.0000020 |
| KitchenQualTA | -0.2140710 | 0.0303673 | -7.049383 | 0.0000000 |
| GarageFinish == “Unf”TRUE | -0.0848069 | 0.0186678 | -4.542965 | 0.0000068 |
| Neighborhood == “NAmes”TRUE | -0.0718774 | 0.0221329 | -3.247537 | 0.0012329 |
| Neighborhood == “NoRidge”TRUE | 0.1722144 | 0.0311856 | 5.522242 | 0.0000001 |
| Neighborhood == “OldTown”TRUE | -0.1160579 | 0.0355585 | -3.263861 | 0.0011652 |
| X1stFlrSF:TotalBsmtSF | 0.0000000 | 0.0000000 | 1.754171 | 0.0799413 |
Based on the results of Model 3, it fits the data well, with an adjusted R-squared of 0.8085. Therefore, 80.85% of the variance in log sales price is accounted for by Model 3. This model contains 12 fewer betas than Model 1, and has a smaller adjusted R squared than Model 1. Model 3 also contains the interaction term, meaning it has one more beta than Model 2, and has only a slightly higher adjusted R-squared.
After comparing Model 1, Model 2, and Model 3 through their adjusted R squareds, it is apparent that Model 1 is the best fit for the data, as it has the largest R squared at 0.8153. We use the adjusted R squared to compare models with different numbers of betas, as its calculation includes a penalty for the number of predictors. Therefore, although it contains many betas, Model 1 still has the highest R squared of the three models, and therefore is the best fit for the data.
Our selected model, Model 1, may be demonstrated through the population model: \[log(SalesPrice) = \beta_0 + \beta_1X1stFlrSF + \beta_2TotalBsmtSF + \beta_3LotArea + \beta_4OverallQuality + \beta_5OverallCondition + \beta_6FairKitchen \] \[+ \beta_7GoodKitchen + \beta_8TypicalKitchen + \beta_9PartiallyFinishedGarage + \beta_{10}UninfishedGarage + \beta_{11}NeighborhoodCrawfor + \beta_{12}NeighborhoodGilbert \] \[+ \beta_{13}NeighborhoodNAmes + \beta_{14}NeighborhoodNoRidge + \beta_{15}NeighborhoodNridgHt + \beta_{16}NeighborhoodNWames + \beta_{17}NeighborhoodOldTown + \beta_{18}NeighborhoodSawyer\] \[+ \beta_{19}NeighborhoodSawyerW + \beta_{20}NeighborhoodSomerst + \beta_{21}NeighborhoodStoneBr + \beta_{22}NeighborhoodTimber + \beta_{23}TotalSmtSF:X1stFlrSF + error \]
In order to use this model with our sample data, we must check the necessary conditions. In order to use a log-linear regression model, the conditions of shape, independence, randomness, zero mean, constant variance, and normality. Shape is checked when comparing numeric variables, so Figure 1.11, Figure 1.12, and Figure 1.13 all demonstrate a positive linear relationship between log sale price and lot area, basement square footage, and first floor square footage respectively. Each graph had several outliers, but no influential outliers, so they are noted but not removed. The zero mean and constant variance conditions may both be checked with a residual plot.
Figure 2.1 is a residual plot, with predicted values from Model 1 on the x-axis and the residuals on the y-axis. Figure 2.1 demonstrates that the zero mean condition of Model 1 is met, as the red line indicates that the average of the residuals is in fact zero. This also demonstrates that the constant variance condition is met, as the residuals vary above and below 0 around 0.5 on either side. There are three outliers below y = -0.5, around x = 4.6, x = 5.26, and x = 5.65, and one outlier slightly above y = 0.5 around x = 5.53.
Figure 2.2 is a Q-Q plot, which plots how many standard deviations away
from the mean the data is. If the data is normally distributed, the data
should follow the line on the graph. As Figure 2.2 indicates, the data
follows the provided line, and thus the normality condition is met for
this model.
In order to check the randomness condition, the response Y variable must be a random variable. The response variable in this model is log sales price, which is a random variable. In this context, random means that we know the possible outcomes, but do not know for sure which outcome will occur. For this model, we know the possible outcomes for the log sale price of a home, but we do not know with certainty which outcome will occur for each house.
In order to check the independence condition, the outcomes in the data set must be independent, so the outcomes should not rely upon each other and order should not matter. In this model, log sales price of homes are assumed to be independent, as the order in which a home was sold in comparison to another does not influence the sale price of that home.
After demonstrating that the conditions for multiple linear regression are met, we may fit Model 1 to our sample. Thus, the fitted Model 1 for our sample is: \[Log\widehat{(SalesPrice)} = 4.29 + 0.00018X1stFlrSF + 0.000085TotalBsmtSF + 0.000011LotArea + 0.123OverallQuality + 0.0402OverallCondition - 0.221FairKitchen \] \[ - 0.109GoodKitchen - 0.183TypicalKitchen - 0.022PartiallyFinishedGarage - 0.0937UninfishedGarage + 0.047NeighborhoodCrawfor + 0.022NeighborhoodGilbert \] \[ - 0.103NeighborhoodNAmes + 0.018NeighborhoodNoRidge + 0.047NeighborhoodNridgHt - 0.0707NeighborhoodNWames - 0.137NeighborhoodOldTown - 0.0809NeighborhoodSawyer\] \[+ 0.0117NeighborhoodSawyerW + 0.0387NeighborhoodSomerst - 0.0605NeighborhoodStoneBr - 0.0184NeighborhoodTimber - 0.00000001(TotalSmtSF:X1stFlrSF) \]
The adjusted R-squared value of Model 1 is 0.8153. Therefore, 81.53% of the variance in log sales price is accounted for by Model 1, which is a good fit, as it explains over three quarters of the variance.
There is a possibility of error variance for the coefficients in our model, which is why we construct 95% confidence intervals. Below, we construct a 95% confidence interval for each of the 22 betas in Model 1. A confidence interval follows the equation: \[ \hat{beta_1} ± t* SE_\hat{beta_1} \]
The critical t value for a sample with 578 degrees of freedom for a 95% confidence interval is 1.96. The following table contains the 95% confidence intervals of each coefficient in our model, starting with the lower bound, and then the upper bound. For each of the coefficients, we are 95% confident that the true population mean lies between the upper and lower bound provided.
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 3.7748819 | 4.2383130 |
| X1stFlrSF | 0.0000700 | 0.0002933 |
| TotalBsmtSF | -0.0000654 | 0.0002352 |
| LotArea | 0.0000089 | 0.0000136 |
| OverallQual | 0.1055596 | 0.1404496 |
| OverallCond | 0.0239055 | 0.0566080 |
| KitchenQualFa | -0.3518982 | -0.0898264 |
| KitchenQualGd | -0.1591613 | -0.0594321 |
| KitchenQualTA | -0.2454931 | -0.1204115 |
| GarageFinishRFn | -0.0577769 | 0.0133385 |
| GarageFinishUnf | -0.1367652 | -0.0506251 |
| NeighborhoodCrawfor | -0.0342236 | 0.1287009 |
| NeighborhoodGilbert | -0.0518189 | 0.0957031 |
| NeighborhoodNAmes | -0.1725634 | -0.0342818 |
| NeighborhoodNoRidge | 0.1025539 | 0.2535329 |
| NeighborhoodNridgHt | -0.0205131 | 0.1140054 |
| NeighborhoodNWAmes | -0.1480634 | 0.0066274 |
| NeighborhoodOldTown | -0.2265106 | -0.0478654 |
| NeighborhoodSawyer | -0.1770851 | 0.0151260 |
| NeighborhoodSawyerW | -0.0676538 | 0.0910637 |
| NeighborhoodSomerst | -0.0337982 | 0.1112638 |
| NeighborhoodStoneBr | -0.1216927 | 0.0006824 |
| NeighborhoodTimber | -0.0989377 | 0.0621589 |
| X1stFlrSF:TotalBsmtSF | -0.0000001 | 0.0000001 |
As you were most interested in how a kitchen’s condition may influence the average sale price of a home, we selected to include this section highlighting kitchens. Of the categorical variables in the model, kitchen was the only one that had all levels be of statistical significance. There were four kitchen qualities possible - excellent, good, typical, and fair. In our model, excellent was the baseline, so our intercept demonstated this along with aspects of two other variables. Thus, the coefficients of the three levels good, typical, and fair, were in comparison to an excellent kitchen. As stated in the prior section, we predict that having a home with a good kitchen will on average lower home sale price by 10.3% in comparison to a home with an excellent kitchen. We predict that having a home with a typical kitchen will on average lower home sale price by 16.72% in comparison to a home with an excellent kitchen. Finally, we predict that having a home with a fair kitchen will on average lower home sale price by 19.9% in comparison to a home with an excellent kitchen. To compare the three, having a fair kitchen lowers the average home sale price 3.18% more than a typical kitchen does in comparison to an excellent kitchen. A typical kitchen lowers the average home sale price 6.42% more than a good kitchen, in comparison to an excellent kitchen. There were no homes sold in the sample that had a poor kitchen quality. Therefore, we may conclude that any kitchen upgrade will increase the average home sale price to some extent, although the extent to which the sale price increases varies from condition to condition.
Although this model fits the majority of the data in our sample, there are some limitations. Namely, there were several large outliers in the sales price of some houses, as well as in the features of some houses. These outliers may have led to calculations that are not entirely accurate for the average home. There are also several variables that we did not investigate that may be applicable to sale price, such as the general housing market. Additionally, this model contains many variables. In the future, it may be beneficial to try to limit the number of variables in the model, althoguh the fit for the data may be lower. However, if one is particularly interested in a specific feature’s influence on a home’s sale price, a more limited model would be encouraged. Overall, all of the conditions were met for this model to work, and the relationship between many of the variables and average log sale price were significant. Of the variables that had multiple levels, at least one level in each variable was found to be significantly linked to the average log sale price. We greatly appreciate your business and hope our model is helpful!