This is a playground competition from Kaggle. The key research topic here is to find out how do home features add up to its price tag. The data set includes 79 explanatory variables describing many aspect of residential homes in Ames, Iowa. My job is to select useful indicators to predict the sold price of each home in the test data set.
First of all, I look at all variables carefully to pick variables that I think may be most relative to house price, according to my life experience so far. Then I use plots to present the relationship between them and the house price. The followings are some examples.
ggplot(train, aes(x=GrLivArea,y=SalePrice, color=SalePrice)) +
geom_point()+
xlim(c(0, quantile(train$GrLivArea, 0.95))) +
ylim(c(0,quantile(train$SalePrice, 0.95))) +
xlab('Above Grade (Ground) Living Area Square Feet') +
ylab('Sale Price') +
ggtitle('House Sale Pirce by Above Grade (Ground) Living Area Square Feet') +
scale_color_gradient(limits=c(0,quantile(train$SalePrice, 0.95)), low = "#99FFFF", high = "#003366")
ggplot(train, aes(x=LotArea,y=SalePrice, color=SalePrice)) +
geom_point()+
xlim(c(0, quantile(train$LotArea, 0.95))) +
ylim(c(0,quantile(train$SalePrice, 0.95))) +
xlab('Lot Size in Square Feet') +
ylab('Sale Price') +
ggtitle('House Sale Pirce by Lot Size in Square Feet') +
scale_color_gradient(limits=c(0,quantile(train$SalePrice, 0.95)), low = "#99FFFF", high = "#003366")
Neighborhoodmedian <- with(train, reorder(Neighborhood, SalePrice, median))
ggplot(train, aes(x=Neighborhoodmedian,y=SalePrice)) +
geom_boxplot(aes(fill=Neighborhood)) + coord_flip() + labs(fill="") +
ylab('Sale Price') +
xlab('Neighborhood') +
ggtitle('House Sale Pirce by Neighborhood')
ggplot(train, aes(x=OverallQual,y=SalePrice)) +
geom_boxplot(aes(fill=OverallQual)) +
xlab('Rates the Overall Material and Finish of the House') +
ylab('Sale Price') +
theme(legend.position="none")+
ggtitle('House Sale Pirce by Overall Quality')
I choose a regression model to predict the house price. I pick 28 independent variables. The model produces a pretty decent output, given the R square is 0.8939. This model explains what factors determine the house price and helps home buyers to estimate house price.
model1 <- lm(log(SalePrice) ~ I(LotArea^(1/3))+LandContour+Neighborhood+
BldgType+HouseStyle+OverallQual+YearBuilt+YearRemodAdd+
MasVnrType+MasVnrArea+TotalBsmtSF+
CentralAir+I(GrLivArea^(1/3))+BsmtFullBath+FullBath+
BedroomAbvGr+KitchenAbvGr+TotRmsAbvGrd+Fireplaces+GarageCars+GarageArea+WoodDeckSF+OpenPorchSF+
EnclosedPorch+ScreenPorch+PoolArea+KitchenQual,data=train)
summary(model1)
##
## Call:
## lm(formula = log(SalePrice) ~ I(LotArea^(1/3)) + LandContour +
## Neighborhood + BldgType + HouseStyle + OverallQual + YearBuilt +
## YearRemodAdd + MasVnrType + MasVnrArea + TotalBsmtSF + CentralAir +
## I(GrLivArea^(1/3)) + BsmtFullBath + FullBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea + KitchenQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28552 -0.05946 0.00582 0.06993 0.42074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.103e+00 7.677e-01 5.345 1.06e-07 ***
## I(LotArea^(1/3)) 6.579e-03 1.485e-03 4.430 1.02e-05 ***
## LandContourHLS 7.809e-02 2.740e-02 2.850 0.004430 **
## LandContourLow 6.209e-02 3.256e-02 1.907 0.056743 .
## LandContourLvl 5.252e-02 1.885e-02 2.786 0.005404 **
## NeighborhoodBlueste -5.118e-02 1.031e-01 -0.496 0.619682
## NeighborhoodBrDale -1.063e-01 5.501e-02 -1.933 0.053415 .
## NeighborhoodBrkSide -4.442e-02 4.661e-02 -0.953 0.340716
## NeighborhoodClearCr 5.218e-03 4.938e-02 0.106 0.915853
## NeighborhoodCollgCr -3.042e-02 3.991e-02 -0.762 0.446095
## NeighborhoodCrawfor 1.060e-01 4.543e-02 2.334 0.019748 *
## NeighborhoodEdwards -1.397e-01 4.311e-02 -3.242 0.001216 **
## NeighborhoodGilbert -6.410e-02 4.238e-02 -1.513 0.130631
## NeighborhoodIDOTRR -1.860e-01 4.899e-02 -3.796 0.000153 ***
## NeighborhoodMeadowV -1.582e-01 5.292e-02 -2.989 0.002852 **
## NeighborhoodMitchel -7.977e-02 4.425e-02 -1.803 0.071635 .
## NeighborhoodNAmes -5.284e-02 4.176e-02 -1.266 0.205903
## NeighborhoodNoRidge 7.480e-02 4.560e-02 1.640 0.101137
## NeighborhoodNPkVill 8.751e-04 5.928e-02 0.015 0.988224
## NeighborhoodNridgHt 8.071e-02 4.138e-02 1.951 0.051301 .
## NeighborhoodNWAmes -6.345e-02 4.238e-02 -1.497 0.134593
## NeighborhoodOldTown -1.278e-01 4.527e-02 -2.823 0.004820 **
## NeighborhoodSawyer -7.774e-02 4.366e-02 -1.781 0.075209 .
## NeighborhoodSawyerW -6.786e-02 4.222e-02 -1.607 0.108254
## NeighborhoodSomerst 3.255e-02 4.098e-02 0.794 0.427086
## NeighborhoodStoneBr 1.105e-01 4.594e-02 2.404 0.016342 *
## NeighborhoodSWISU -3.412e-02 5.248e-02 -0.650 0.515750
## NeighborhoodTimber -3.359e-02 4.512e-02 -0.745 0.456643
## NeighborhoodVeenker 5.958e-02 5.601e-02 1.064 0.287637
## BldgType2fmCon 4.431e-02 3.015e-02 1.470 0.141897
## BldgTypeDuplex -1.217e-02 3.091e-02 -0.394 0.693835
## BldgTypeTwnhs -1.179e-01 2.950e-02 -3.995 6.82e-05 ***
## BldgTypeTwnhsE -7.661e-02 1.923e-02 -3.984 7.13e-05 ***
## HouseStyle1.5Unf 5.145e-02 3.978e-02 1.293 0.196111
## HouseStyle1Story 3.989e-02 1.632e-02 2.444 0.014643 *
## HouseStyle2.5Fin -8.660e-03 5.345e-02 -0.162 0.871315
## HouseStyle2.5Unf 8.143e-02 4.499e-02 1.810 0.070494 .
## HouseStyle2Story -5.307e-03 1.522e-02 -0.349 0.727342
## HouseStyleSFoyer 7.075e-02 2.967e-02 2.384 0.017241 *
## HouseStyleSLvl 4.396e-02 2.251e-02 1.953 0.051034 .
## OverallQual2 1.185e-02 1.266e-01 0.094 0.925430
## OverallQual3 2.412e-01 1.054e-01 2.290 0.022182 *
## OverallQual4 3.661e-01 1.027e-01 3.566 0.000375 ***
## OverallQual5 4.304e-01 1.030e-01 4.177 3.13e-05 ***
## OverallQual6 4.815e-01 1.034e-01 4.656 3.53e-06 ***
## OverallQual7 5.288e-01 1.040e-01 5.082 4.23e-07 ***
## OverallQual8 6.040e-01 1.052e-01 5.744 1.14e-08 ***
## OverallQual9 7.075e-01 1.081e-01 6.542 8.50e-11 ***
## OverallQual10 6.664e-01 1.124e-01 5.928 3.87e-09 ***
## YearBuilt 1.323e-03 3.067e-04 4.313 1.72e-05 ***
## YearRemodAdd 1.627e-03 2.614e-04 6.225 6.37e-10 ***
## MasVnrTypeBrkFace 9.891e-02 3.629e-02 2.726 0.006497 **
## MasVnrTypeNone 1.030e-01 3.651e-02 2.822 0.004848 **
## MasVnrTypeStone 9.841e-02 3.842e-02 2.562 0.010523 *
## MasVnrArea 2.427e-05 3.117e-05 0.779 0.436296
## TotalBsmtSF 2.124e-05 1.439e-05 1.476 0.140207
## CentralAirY 1.083e-01 1.797e-02 6.029 2.11e-09 ***
## I(GrLivArea^(1/3)) 1.066e-01 8.490e-03 12.553 < 2e-16 ***
## BsmtFullBath 6.210e-02 7.938e-03 7.823 1.02e-14 ***
## FullBath 2.395e-02 1.072e-02 2.236 0.025538 *
## BedroomAbvGr -4.008e-03 7.205e-03 -0.556 0.578095
## KitchenAbvGr -1.111e-01 2.840e-02 -3.911 9.64e-05 ***
## TotRmsAbvGrd 2.237e-03 5.066e-03 0.442 0.658851
## Fireplaces 2.205e-02 7.228e-03 3.050 0.002331 **
## GarageCars 5.568e-02 1.154e-02 4.826 1.54e-06 ***
## GarageArea -1.645e-05 3.959e-05 -0.415 0.677841
## WoodDeckSF 1.182e-04 3.149e-05 3.753 0.000182 ***
## OpenPorchSF 5.812e-05 6.077e-05 0.956 0.339002
## EnclosedPorch 1.819e-05 6.636e-05 0.274 0.784032
## ScreenPorch 3.142e-04 6.649e-05 4.726 2.52e-06 ***
## PoolArea -1.408e-04 9.166e-05 -1.536 0.124857
## KitchenQualFa -1.021e-01 3.235e-02 -3.157 0.001626 **
## KitchenQualGd -6.287e-02 1.918e-02 -3.278 0.001071 **
## KitchenQualTA -9.001e-02 2.122e-02 -4.242 2.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 1386 degrees of freedom
## Multiple R-squared: 0.8939, Adjusted R-squared: 0.8883
## F-statistic: 160 on 73 and 1386 DF, p-value: < 2.2e-16
pred <- exp(predict(model1, test))