The variable we want to predict is SalePrice.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
There are 79 variables that can be used to help predict SalePrice.
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "dataset"
The variable SqFt is the sum of the sqft on the first floor plus the sqft of the second floor.

Use log because of the funnel shape in the data.

Use neighborhood and sqft to predict log(SalePrice).
mod <- lm(log(SalePrice) ~ SqFt + Neighborhood, data= train)
summary(mod)
##
## Call:
## lm(formula = log(SalePrice) ~ SqFt + Neighborhood, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41968 -0.10150 0.01051 0.10863 0.82312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161e+01 5.068e-02 229.032 < 2e-16 ***
## SqFt 3.929e-04 1.168e-05 33.641 < 2e-16 ***
## NeighborhoodBlueste -3.290e-01 1.475e-01 -2.230 0.025910 *
## NeighborhoodBrDale -5.097e-01 6.882e-02 -7.406 2.21e-13 ***
## NeighborhoodBrkSide -3.978e-01 5.449e-02 -7.301 4.73e-13 ***
## NeighborhoodClearCr -6.431e-02 6.081e-02 -1.058 0.290439
## NeighborhoodCollgCr -2.643e-02 5.051e-02 -0.523 0.600901
## NeighborhoodCrawfor -1.027e-01 5.542e-02 -1.853 0.064114 .
## NeighborhoodEdwards -4.226e-01 5.178e-02 -8.161 7.21e-16 ***
## NeighborhoodGilbert -9.719e-02 5.282e-02 -1.840 0.065978 .
## NeighborhoodIDOTRR -5.985e-01 5.794e-02 -10.331 < 2e-16 ***
## NeighborhoodMeadowV -5.499e-01 6.782e-02 -8.108 1.09e-15 ***
## NeighborhoodMitchel -1.829e-01 5.557e-02 -3.292 0.001020 **
## NeighborhoodNAmes -2.541e-01 4.966e-02 -5.116 3.54e-07 ***
## NeighborhoodNoRidge 8.188e-02 5.831e-02 1.404 0.160464
## NeighborhoodNPkVill -2.342e-01 8.138e-02 -2.878 0.004061 **
## NeighborhoodNridgHt 2.583e-01 5.319e-02 4.855 1.33e-06 ***
## NeighborhoodNWAmes -1.566e-01 5.326e-02 -2.941 0.003327 **
## NeighborhoodOldTown -4.778e-01 5.134e-02 -9.307 < 2e-16 ***
## NeighborhoodSawyer -2.743e-01 5.313e-02 -5.163 2.77e-07 ***
## NeighborhoodSawyerW -1.444e-01 5.436e-02 -2.656 0.007994 **
## NeighborhoodSomerst 6.189e-02 5.242e-02 1.181 0.237906
## NeighborhoodStoneBr 2.388e-01 6.226e-02 3.836 0.000131 ***
## NeighborhoodSWISU -4.336e-01 6.211e-02 -6.982 4.44e-12 ***
## NeighborhoodTimber 6.775e-02 5.770e-02 1.174 0.240568
## NeighborhoodVeenker 1.309e-01 7.637e-02 1.714 0.086813 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1973 on 1434 degrees of freedom
## Multiple R-squared: 0.7601, Adjusted R-squared: 0.7559
## F-statistic: 181.8 on 25 and 1434 DF, p-value: < 2.2e-16
test$logSalePricePredictions <- predict(mod, test)
train$logSalePricePredictions <- predict(mod, train)
The best fit line would be used to predict the SalePrice of the houses in the test data.

Take the exponential of the log to get the y axis back to SalePrice.
