Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
house_data <- read.csv(file="train.csv", header=TRUE, sep=",")
The aim of this project is to create a multiple regression model that predicts the Sale Price of houses in the given data set. This data set has 79 explanatory variables.
This model uses 17 explanatory variables. Some of these variables are used to generate calculated variables. Below is a list of these.
Neighborhood: used to create a new variable NeighborhoodGroup.
YearRemodAdd, YrSold: used to create a new variable SinceRemod. This is the number of years since remodeled at time of sale. The remodeled value is equal to construction year if property was never remodeled.
TotalBsmtSF, X1stFlrSF, X2ndFlrSF, GarageArea, PoolArea: used to generate a new variable called OverallArea.
SaleCondition: used to create a new variable IsAbnormalSale. This is set to 1 when the sale is abnormal (e.g., foreclosures)
Functional: used to create a new variable IsReduced when the functional level of the property is not normal.
KitchenQual: used to create a new variable KitchenQualGroup that groups the kitchen into high, medium, and low quality levels.
FireplaceQu: used to create a new variable ExcellentFireplace that flags properties with excellent fireplaces.
GarageQual: used to create a new variable ExcellentGarage that flags properties with excellent garages.
PoolQC: used to create a new variable ExcellentPool that flags properties with excellent pools.
OverallQual: used to create a new variable OverallQualityHigh that flags properties with high overall conditions.
MasVnrArea
CentralAir
We all know that location is one important factor that determines the price of any real estate. So I went ahead and did a linear regression on Sale Price by Neighborhood.
m <- lm(SalePrice ~ Neighborhood, data=house_data)
As you can see in the summary below, there are a lot of neighborhoods. In total, there are 25 different neighborhoods. In addition, some of the estimators for the neighborhoods are not really significant; however, a good number of the estimators are significant. The adjusted-R-squared is 0.538, which means that the variable neighborhood explains about 50% of the variability in sale price in this single variable regression.
##
## Call:
## lm(formula = SalePrice ~ Neighborhood, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162271 -27552 -5324 19685 419705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194871 13097 14.879 < 2e-16 ***
## NeighborhoodBlueste -57371 40367 -1.421 0.155463
## NeighborhoodBrDale -90377 18809 -4.805 1.71e-06 ***
## NeighborhoodBrkSide -70037 14893 -4.703 2.81e-06 ***
## NeighborhoodClearCr 17695 16603 1.066 0.286721
## NeighborhoodCollgCr 3095 13819 0.224 0.822820
## NeighborhoodCrawfor 15754 15123 1.042 0.297712
## NeighborhoodEdwards -66651 14166 -4.705 2.78e-06 ***
## NeighborhoodGilbert -2016 14437 -0.140 0.888944
## NeighborhoodIDOTRR -94747 15822 -5.988 2.67e-09 ***
## NeighborhoodMeadowV -96294 18522 -5.199 2.29e-07 ***
## NeighborhoodMitchel -38601 15200 -2.540 0.011204 *
## NeighborhoodNAmes -49024 13582 -3.609 0.000318 ***
## NeighborhoodNoRidge 140424 15577 9.015 < 2e-16 ***
## NeighborhoodNPkVill -52176 22260 -2.344 0.019217 *
## NeighborhoodNridgHt 121400 14470 8.390 < 2e-16 ***
## NeighborhoodNWAmes -5821 14542 -0.400 0.689011
## NeighborhoodOldTown -66646 14047 -4.744 2.30e-06 ***
## NeighborhoodSawyer -58078 14523 -3.999 6.69e-05 ***
## NeighborhoodSawyerW -8315 14864 -0.559 0.575974
## NeighborhoodSomerst 30509 14333 2.129 0.033456 *
## NeighborhoodStoneBr 115628 16975 6.812 1.42e-11 ***
## NeighborhoodSWISU -52280 16975 -3.080 0.002111 **
## NeighborhoodTimber 47377 15756 3.007 0.002686 **
## NeighborhoodVeenker 43902 20895 2.101 0.035810 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54000 on 1435 degrees of freedom
## Multiple R-squared: 0.5456, Adjusted R-squared: 0.538
## F-statistic: 71.78 on 24 and 1435 DF, p-value: < 2.2e-16
Based on common knowledge, I know that certain areas are more affordable and others are more expensive. Since I do not have any additional information on which neighborhood would tend to have more affordable and which ones are more expensive, I used the output of the linear regression above to rank the neighborhood based on the values of the estimators. The base group in the estimate is Blmngtn.
Neighborhood | Estimate | Group |
---|---|---|
NoRidge | 140424 | Group C |
NridgHt | 121400 | Group C |
StoneBr | 115628 | Group C |
Timber | 47377 | Group C |
Veenker | 43902 | Group C |
Somerst | 30509 | Group C |
ClearCr | 17695 | Group B |
Crawfor | 15754 | Group B |
CollgCr | 3095 | Group B |
Blmngtn | base | Group B |
Gilbert | -2016 | Group B |
NWAmes | -5821 | Group B |
SawyerW | -8315 | Group B |
Mitchel | -38601 | Group A |
NAmes | -49024 | Group A |
NPkVill | -52176 | Group A |
SWISU | -52280 | Group A |
Blueste | -57371 | Group A |
Sawyer | -58078 | Group A |
OldTown | -66646 | Group A |
Edwards | -66651 | Group A |
BrkSide | -70037 | Group A |
BrDale | -90377 | Group A |
IDOTRR | -94747 | Group A |
MeadowV | -96294 | Group A |
I divided the 25 neighborhoods into 3 groups - A, B, and C.
#Neighborhood: NeighborhoodGroup
house_data$NeighborhoodGroup[house_data$Neighborhood %in%
c('NoRidge', 'NridgHt', 'StoneBr', 'Timber', 'Veenker', 'Somerst')] <- "GroupC"
house_data$NeighborhoodGroup[house_data$Neighborhood %in%
c('ClearCr', 'Crawfor', 'CollgCr', 'Blmngtn', 'Gilbert', 'NWAmes', 'SawyerW')] <- "GroupB"
house_data$NeighborhoodGroup[house_data$Neighborhood %in%
c('Mitchel', 'NAmes', 'NPkVill', 'SWISU', 'Blueste', 'Sawyer', 'OldTown', 'Edwards',
'BrkSide', 'BrDale', 'IDOTRR', 'MeadowV')] <- "GroupA"
Running the linear model on NeighborhoodGroup, this is the result:
summary(lm(SalePrice ~ NeighborhoodGroup, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139755 -29240 -5240 23157 477745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134240 2162 62.08 <2e-16 ***
## NeighborhoodGroupGroupB 62138 3478 17.87 <2e-16 ***
## NeighborhoodGroupGroupC 143016 4107 34.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58220 on 1457 degrees of freedom
## Multiple R-squared: 0.4636, Adjusted R-squared: 0.4629
## F-statistic: 629.7 on 2 and 1457 DF, p-value: < 2.2e-16
All base group is GroupA, and the estimators for the groups are significant. The adjusted R-squared is 0.4629, which is lower than the previous value of 0.5380.
The YearRemodAdd is the remodel date, and it is same as construction date if no remodeling or additions. YrSold is the year when the property was sold.
The calculated variable SinceRemod is the number of years since the property had remodeling or additions done when at the time it was sold.
#YearRemodAdd, YrSold: SinceRemod
house_data$SinceRemod <- house_data$YrSold - house_data$YearRemodAdd
As you can, there is an inverse relationship between sale price and number of years since last remodeled (or if never remodeled this is number of years since house was built at time of sale).
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137405 -29662 -5083 21787 482313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161617.95 3597.22 44.929 <2e-16 ***
## NeighborhoodGroupGroupB 47342.02 3728.64 12.697 <2e-16 ***
## NeighborhoodGroupGroupC 120745.53 4643.48 26.003 <2e-16 ***
## SinceRemod -806.39 86.01 -9.376 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56560 on 1456 degrees of freedom
## Multiple R-squared: 0.4942, Adjusted R-squared: 0.4931
## F-statistic: 474.1 on 3 and 1456 DF, p-value: < 2.2e-16
Running the model with SinceRemod increases the adjusted R-squared from 0.4629 to 0.4931. All the estimators are significant.
The variables TotalBsmtSF, X1stFlrSF, X2ndFlrSF, GarageArea, PoolArea are quantitative variables that measure the area (in square feet) of the basement, first and second floor, garage area, and pool area respectively. A new variable is created called OverallArea, which sums the areas of all these spaces.
#OverallArea: TotalBsmtSF, X1stFlrSF, X2ndFlrSF, GarageArea, PoolArea
house_data$OverallArea <- house_data$TotalBsmtSF + house_data$X1stFlrSF + house_data$X2ndFlrSF + house_data$GarageArea + house_data$PoolArea
Running the model with OverallArea updates the model as follows.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea,
## data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -539894 -18058 124 15946 297998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21044.527 4344.702 4.844 1.41e-06 ***
## NeighborhoodGroupGroupB 23476.480 2658.836 8.830 < 2e-16 ***
## NeighborhoodGroupGroupC 64145.197 3528.119 18.181 < 2e-16 ***
## SinceRemod -479.797 60.300 -7.957 3.52e-15 ***
## OverallArea 49.733 1.258 39.546 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39280 on 1455 degrees of freedom
## Multiple R-squared: 0.7562, Adjusted R-squared: 0.7555
## F-statistic: 1128 on 4 and 1455 DF, p-value: < 2.2e-16
Including OverallArea in the model increased the adjusted R-squared from 0.4931 to 0.7555. All the estimators are significant.
The variable SaleCondition captures a category level Abnorml, which describes an abnormal sale such as trade, foreclosure, and short sale. A new variable IsAbnormalSale is created that flags this condition.
#IsAbnormalSale: SaleCondition when 'Abnorml'
house_data$IsAbnormalSale[house_data$SaleCondition == "Abnorml"] = 1
house_data$IsAbnormalSale[house_data$SaleCondition != "Abnorml"] = 0
Running the model with IsAbnormalSale gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -540378 -18127 111 15793 297383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21681.537 4336.208 5.000 6.43e-07 ***
## NeighborhoodGroupGroupB 23154.949 2652.713 8.729 < 2e-16 ***
## NeighborhoodGroupGroupC 63991.474 3517.720 18.191 < 2e-16 ***
## SinceRemod -461.732 60.390 -7.646 3.75e-14 ***
## OverallArea 49.721 1.254 39.658 < 2e-16 ***
## IsAbnormalSale -12832.356 4078.789 -3.146 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39160 on 1454 degrees of freedom
## Multiple R-squared: 0.7579, Adjusted R-squared: 0.757
## F-statistic: 910.1 on 5 and 1454 DF, p-value: < 2.2e-16
Including IsAbnormalSale in the model increased the adjusted R-squared from 0.7555 to 0.7570. All the estimators are significant.
The variable Functional describes the functional level of the property. The new variable IsReduced flags those properties whose sale price were reduced because of less than normal functional levels.
#IsReduced: Functional
house_data$IsReduced <- 0
house_data$IsReduced[house_data$Functional %in% c("Min1", "Min2", "Mod", "Maj1", "Maj2", "Sev", "Sal")] <- 1
Running the model with IsReduced gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -546997 -18132 3 15161 295505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22320.407 4314.599 5.173 2.62e-07 ***
## NeighborhoodGroupGroupB 21661.005 2661.970 8.137 8.60e-16 ***
## NeighborhoodGroupGroupC 61862.044 3534.942 17.500 < 2e-16 ***
## SinceRemod -458.138 60.058 -7.628 4.28e-14 ***
## OverallArea 50.159 1.251 40.091 < 2e-16 ***
## IsAbnormalSale -12922.799 4055.970 -3.186 0.00147 **
## IsReduced -17153.569 4106.567 -4.177 3.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38940 on 1453 degrees of freedom
## Multiple R-squared: 0.7607, Adjusted R-squared: 0.7597
## F-statistic: 769.9 on 6 and 1453 DF, p-value: < 2.2e-16
Including IsReduced increased the adjusted R-squared from 0.7570 to 0.7597. All estimators are significant.
The variable KitchenQual has 5 different category levels. Running the model without modifying this variable yielded some estimators that were not significant. The variable KitchenQualGroup collapses the 5 levels into 3 groups of high, medium, and low.
#KitchenQual - KitchenQualGroup
house_data$KitchenQualGroup[house_data$KitchenQual %in% c('Ex')] <- "High"
house_data$KitchenQualGroup[house_data$KitchenQual %in% c('Gd')] <- "Medium"
house_data$KitchenQualGroup[house_data$KitchenQual %in% c('Fa', 'TA', 'Po')] <- "Low"
house_data$KitchenQualGroup <- factor(house_data$KitchenQualGroup)
house_data$KitchenQualGroup <- relevel(house_data$KitchenQualGroup, ref="Low")
Running the model with KitchenQualGroup gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced + KitchenQualGroup, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -543188 -17467 -531 14830 272878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27633.508 4242.041 6.514 1.00e-10 ***
## NeighborhoodGroupGroupB 23157.806 2590.991 8.938 < 2e-16 ***
## NeighborhoodGroupGroupC 54424.616 3473.461 15.669 < 2e-16 ***
## SinceRemod -297.760 64.070 -4.647 3.67e-06 ***
## OverallArea 44.886 1.255 35.763 < 2e-16 ***
## IsAbnormalSale -15207.386 3841.335 -3.959 7.89e-05 ***
## IsReduced -14857.451 3894.247 -3.815 0.000142 ***
## KitchenQualGroupHigh 62865.836 4950.620 12.699 < 2e-16 ***
## KitchenQualGroupMedium 9208.140 2793.350 3.296 0.001003 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36820 on 1451 degrees of freedom
## Multiple R-squared: 0.7863, Adjusted R-squared: 0.7852
## F-statistic: 667.6 on 8 and 1451 DF, p-value: < 2.2e-16
Including KitchenQualGroup in the model increased the adjusted R-squared from 0.7597 to 0.7852. All estimators are significant.
The variable FireplaceQu describes the quality of the fireplace. In my previous trial and error with categorical variables that have different levels of quality, I find that some of the levels do not have estimators that are significant. I have decided to focus on properties that have excellent fireplaces, and reviewing the plots below properties with excellent fireplaces tend to have higher sale prices.
The variable ExcellentFireplace flags those with excellent fireplaces.
#FireplaceQu
house_data$ExcellentFireplace<- 0
house_data$ExcellentFireplace[house_data$FireplaceQu %in% c('Ex')] <- 1
Running the model with ExcellentFireplace gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace,
## data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -534852 -18065 -477 14829 255238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28684.913 4231.666 6.779 1.76e-11 ***
## NeighborhoodGroupGroupB 23657.789 2582.472 9.161 < 2e-16 ***
## NeighborhoodGroupGroupC 54267.363 3457.779 15.694 < 2e-16 ***
## SinceRemod -296.534 63.777 -4.650 3.63e-06 ***
## OverallArea 44.430 1.255 35.399 < 2e-16 ***
## IsAbnormalSale -15729.775 3826.195 -4.111 4.16e-05 ***
## IsReduced -14474.575 3877.700 -3.733 0.000197 ***
## KitchenQualGroupHigh 59699.802 4998.039 11.945 < 2e-16 ***
## KitchenQualGroupMedium 8995.305 2781.104 3.234 0.001246 **
## ExcellentFireplace 29985.922 7901.663 3.795 0.000154 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36650 on 1450 degrees of freedom
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7871
## F-statistic: 600.5 on 9 and 1450 DF, p-value: < 2.2e-16
Including ExcellentFireplace increased the adjusted R-squared from 0.7852 to 0.7871. All estimators are significant.
The variable GarageQual describes the quality of the garage. The variable ExcellentGarage flags houses with excellent garages.
#GarageQual
house_data$ExcellentGarage <- 0
house_data$ExcellentGarage[house_data$GarageQual %in% c('Ex')] <- 1
Running the model with ExcellentGarage gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace + ExcellentGarage, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace +
## ExcellentGarage, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -533238 -17954 -405 14884 255884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28739.913 4219.711 6.811 1.42e-11 ***
## NeighborhoodGroupGroupB 23958.073 2577.043 9.297 < 2e-16 ***
## NeighborhoodGroupGroupC 54738.012 3451.448 15.859 < 2e-16 ***
## SinceRemod -301.849 63.620 -4.745 2.30e-06 ***
## OverallArea 44.371 1.252 35.449 < 2e-16 ***
## IsAbnormalSale -15493.043 3816.143 -4.060 5.17e-05 ***
## IsReduced -14228.549 3867.555 -3.679 0.000243 ***
## KitchenQualGroupHigh 58836.975 4991.936 11.786 < 2e-16 ***
## KitchenQualGroupMedium 8869.918 2773.528 3.198 0.001413 **
## ExcellentFireplace 30403.653 7880.463 3.858 0.000119 ***
## ExcellentGarage 64507.044 21204.164 3.042 0.002391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36550 on 1449 degrees of freedom
## Multiple R-squared: 0.7898, Adjusted R-squared: 0.7883
## F-statistic: 544.4 on 10 and 1449 DF, p-value: < 2.2e-16
Including ExcellentGarage increased the adjusted R-squared from 0.7871 to 0.7883. All estimators are significant.
The variable PoolQC describes the quality of the pool. The new variable ExcellentPool flags properties with excellent pools.
Running the model with ExcellentPool gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace + ExcellentGarage + ExcellentPool, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace +
## ExcellentGarage + ExcellentPool, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -525458 -17696 -265 14985 255218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30634.759 4222.363 7.255 6.51e-13 ***
## NeighborhoodGroupGroupB 24451.458 2565.963 9.529 < 2e-16 ***
## NeighborhoodGroupGroupC 55518.337 3438.102 16.148 < 2e-16 ***
## SinceRemod -300.701 63.277 -4.752 2.21e-06 ***
## OverallArea 43.640 1.258 34.700 < 2e-16 ***
## IsAbnormalSale -17647.202 3831.823 -4.605 4.48e-06 ***
## IsReduced -13919.965 3847.417 -3.618 0.000307 ***
## KitchenQualGroupHigh 59135.120 4965.524 11.909 < 2e-16 ***
## KitchenQualGroupMedium 8909.968 2758.574 3.230 0.001266 **
## ExcellentFireplace 26817.001 7886.709 3.400 0.000691 ***
## ExcellentGarage 64770.585 21089.808 3.071 0.002172 **
## ExcellentPool 108539.628 26504.527 4.095 4.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36350 on 1448 degrees of freedom
## Multiple R-squared: 0.7922, Adjusted R-squared: 0.7906
## F-statistic: 501.8 on 11 and 1448 DF, p-value: < 2.2e-16
Including ExcellentPool in the model increased the adjusted R-squared from 0.7883 to 0.7906. All estimators are significant.
The variable OverallQual describes the overall quality of the house. This variable has 10 different levels. When I included this variable in the model as is, the estimator was not significant. The different levels are coded as integers. Upon reviewing the box plot of this variable, I noticed a clear trend that houses with better overall quality sold for higher prices.
The variable OverallQualityHigh flags properties with quality scores of 8, 9, and 10. I also tried breaking the 10 different levels into high, medium, and low; however, this resulted in some estimators that were not significant.
The variable MasVnrArea is the veneer area in square feet.
The variable CentralAir indicates if property has central air conditioning or not.
#OverallQual
house_data$OverallQualityHigh <- 0
house_data$OverallQualityHigh[house_data$OverallQual %in% c(8,9,10)] <- 1
Running the model with OverallQualityHigh gives the following result.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace + ExcellentGarage + ExcellentPool +
OverallQualityHigh, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace +
## ExcellentGarage + ExcellentPool + OverallQualityHigh, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -507915 -16263 -176 14493 265628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38919.246 4212.929 9.238 < 2e-16 ***
## NeighborhoodGroupGroupB 24724.972 2498.321 9.897 < 2e-16 ***
## NeighborhoodGroupGroupC 45168.215 3539.968 12.759 < 2e-16 ***
## SinceRemod -284.823 61.630 -4.622 4.15e-06 ***
## OverallArea 40.195 1.283 31.327 < 2e-16 ***
## IsAbnormalSale -17413.701 3730.625 -4.668 3.33e-06 ***
## IsReduced -12302.462 3750.041 -3.281 0.00106 **
## KitchenQualGroupHigh 47462.397 5005.854 9.481 < 2e-16 ***
## KitchenQualGroupMedium 7195.338 2692.430 2.672 0.00761 **
## ExcellentFireplace 23207.879 7688.739 3.018 0.00259 **
## ExcellentGarage 59661.403 20540.205 2.905 0.00373 **
## ExcellentPool 104081.991 25808.689 4.033 5.80e-05 ***
## OverallQualityHigh 32872.630 3659.353 8.983 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35390 on 1447 degrees of freedom
## Multiple R-squared: 0.8032, Adjusted R-squared: 0.8015
## F-statistic: 492.1 on 12 and 1447 DF, p-value: < 2.2e-16
Including OverallQualityHigh in the model increased the adjusted R-squared from 0.7906 to 0.8015. All estimators are significant.
Inculding MasVnrArea in the model increased the adjusted R-squared to 0.8048. All estimators are significant.
summary(lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace + ExcellentGarage + ExcellentPool +
OverallQualityHigh + MasVnrArea, data=house_data))
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace +
## ExcellentGarage + ExcellentPool + OverallQualityHigh + MasVnrArea,
## data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -504314 -16424 -17 13922 247510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42718.818 4247.930 10.056 < 2e-16 ***
## NeighborhoodGroupGroupB 24847.742 2477.169 10.031 < 2e-16 ***
## NeighborhoodGroupGroupC 42874.282 3560.382 12.042 < 2e-16 ***
## SinceRemod -294.750 61.049 -4.828 1.53e-06 ***
## OverallArea 38.052 1.341 28.368 < 2e-16 ***
## IsAbnormalSale -17252.257 3693.129 -4.671 3.27e-06 ***
## IsReduced -10299.096 3739.744 -2.754 0.00596 **
## KitchenQualGroupHigh 46220.170 4967.850 9.304 < 2e-16 ***
## KitchenQualGroupMedium 7916.816 2672.772 2.962 0.00311 **
## ExcellentFireplace 22824.924 7617.646 2.996 0.00278 **
## ExcellentGarage 63899.310 20345.619 3.141 0.00172 **
## ExcellentPool 117917.603 25682.422 4.591 4.79e-06 ***
## OverallQualityHigh 31406.265 3655.204 8.592 < 2e-16 ***
## MasVnrArea 30.844 5.951 5.183 2.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35030 on 1438 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8065, Adjusted R-squared: 0.8048
## F-statistic: 461.1 on 13 and 1438 DF, p-value: < 2.2e-16
Lastyly, adding CentralAir to the model increased the adjusted R-squared to 0.8065. All estimators are significant.
model <- lm(SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea + IsAbnormalSale + IsReduced +
KitchenQualGroup + ExcellentFireplace + ExcellentGarage + ExcellentPool +
OverallQualityHigh + MasVnrArea + CentralAir, data=house_data)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ NeighborhoodGroup + SinceRemod + OverallArea +
## IsAbnormalSale + IsReduced + KitchenQualGroup + ExcellentFireplace +
## ExcellentGarage + ExcellentPool + OverallQualityHigh + MasVnrArea +
## CentralAir, data = house_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -500221 -16098 161 13705 249142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29597.605 5534.000 5.348 1.03e-07 ***
## NeighborhoodGroupGroupB 24363.267 2469.974 9.864 < 2e-16 ***
## NeighborhoodGroupGroupC 42587.343 3545.845 12.010 < 2e-16 ***
## SinceRemod -250.497 61.966 -4.043 5.57e-05 ***
## OverallArea 37.628 1.341 28.069 < 2e-16 ***
## IsAbnormalSale -17385.937 3677.338 -4.728 2.49e-06 ***
## IsReduced -9549.326 3729.152 -2.561 0.010547 *
## KitchenQualGroupHigh 46712.205 4948.178 9.440 < 2e-16 ***
## KitchenQualGroupMedium 7982.469 2661.273 2.999 0.002751 **
## ExcellentFireplace 22853.123 7584.708 3.013 0.002632 **
## ExcellentGarage 61809.474 20265.609 3.050 0.002331 **
## ExcellentPool 118176.450 25571.458 4.621 4.15e-06 ***
## OverallQualityHigh 32106.491 3644.377 8.810 < 2e-16 ***
## MasVnrArea 29.824 5.932 5.028 5.58e-07 ***
## CentralAirY 14439.168 3927.211 3.677 0.000245 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34880 on 1437 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8065
## F-statistic: 432.9 on 14 and 1437 DF, p-value: < 2.2e-16
The variance inflation factor for the final model is 5.217287. Usually, VIF values under 10 does not suggest multicolinearity.
VIF(model)
## [1] 5.217287
The residual vs fitted plot shows an approximately horizontal line, which suggests that the relationship is linear.
The normal Q-Q plot is somewhat approximately normal, which suggests that the residuals are approximately normally distributed; however, the points towards the end of the tail deviate from the line.
The scale-location plot should show a horizontal line with equally spread points, which is a good indication of homoscedasticity (constant variance of the residuals). This is not the case here.
In the residual vs leverage plot there are some observations with high leverage.
This is the function that would predict the sale price.
Predicted Sale Price =
29597.605 + 24363.267(NeighborhoodGroupGroupB) + 42587.343(NeighborhoodGroupGroupC) + -250.497(SinceRemod) +
37.628(OverallArea) + -17385.937(IsAbnormalSale) + -9549.326(IsReduced) + 46712.205(KitchenQualGroupHigh) +
7982.469(KitchenQualGroupMedium) + 22853.123(ExcellentFireplace) + 61809.474(ExcellentGarage) +
118176.450(ExcellentPool) + 32106.49(OverallQualityHigh) + 29.824(MasVnrArea) + 14439.168(CentralAir)
I ran the predictions on the test data and submitted it to Kaggle.
My score is 0.18037 (root mean squared logarithmic error)