The data comes from Kaggle. https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The variable we want to predict is SalePrice.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

There are 79 variables that can be used to help predict SalePrice.

##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "dataset"

Based on housing shows I would expect square footage and neighborhood to be the biggest predictors so lets look at those.

The variable SqFt is the sum of the sqft on the first floor plus the sqft of the second floor.

Use log because of the funnel shape in the data.

Use neighborhood and sqft to predict log(SalePrice).

mod <- lm(log(SalePrice) ~ SqFt + Neighborhood, data= train)

summary(mod)
## 
## Call:
## lm(formula = log(SalePrice) ~ SqFt + Neighborhood, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41968 -0.10150  0.01051  0.10863  0.82312 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.161e+01  5.068e-02 229.032  < 2e-16 ***
## SqFt                 3.929e-04  1.168e-05  33.641  < 2e-16 ***
## NeighborhoodBlueste -3.290e-01  1.475e-01  -2.230 0.025910 *  
## NeighborhoodBrDale  -5.097e-01  6.882e-02  -7.406 2.21e-13 ***
## NeighborhoodBrkSide -3.978e-01  5.449e-02  -7.301 4.73e-13 ***
## NeighborhoodClearCr -6.431e-02  6.081e-02  -1.058 0.290439    
## NeighborhoodCollgCr -2.643e-02  5.051e-02  -0.523 0.600901    
## NeighborhoodCrawfor -1.027e-01  5.542e-02  -1.853 0.064114 .  
## NeighborhoodEdwards -4.226e-01  5.178e-02  -8.161 7.21e-16 ***
## NeighborhoodGilbert -9.719e-02  5.282e-02  -1.840 0.065978 .  
## NeighborhoodIDOTRR  -5.985e-01  5.794e-02 -10.331  < 2e-16 ***
## NeighborhoodMeadowV -5.499e-01  6.782e-02  -8.108 1.09e-15 ***
## NeighborhoodMitchel -1.829e-01  5.557e-02  -3.292 0.001020 ** 
## NeighborhoodNAmes   -2.541e-01  4.966e-02  -5.116 3.54e-07 ***
## NeighborhoodNoRidge  8.188e-02  5.831e-02   1.404 0.160464    
## NeighborhoodNPkVill -2.342e-01  8.138e-02  -2.878 0.004061 ** 
## NeighborhoodNridgHt  2.583e-01  5.319e-02   4.855 1.33e-06 ***
## NeighborhoodNWAmes  -1.566e-01  5.326e-02  -2.941 0.003327 ** 
## NeighborhoodOldTown -4.778e-01  5.134e-02  -9.307  < 2e-16 ***
## NeighborhoodSawyer  -2.743e-01  5.313e-02  -5.163 2.77e-07 ***
## NeighborhoodSawyerW -1.444e-01  5.436e-02  -2.656 0.007994 ** 
## NeighborhoodSomerst  6.189e-02  5.242e-02   1.181 0.237906    
## NeighborhoodStoneBr  2.388e-01  6.226e-02   3.836 0.000131 ***
## NeighborhoodSWISU   -4.336e-01  6.211e-02  -6.982 4.44e-12 ***
## NeighborhoodTimber   6.775e-02  5.770e-02   1.174 0.240568    
## NeighborhoodVeenker  1.309e-01  7.637e-02   1.714 0.086813 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1973 on 1434 degrees of freedom
## Multiple R-squared:  0.7601, Adjusted R-squared:  0.7559 
## F-statistic: 181.8 on 25 and 1434 DF,  p-value: < 2.2e-16
test$logSalePricePredictions <- predict(mod, test)
train$logSalePricePredictions <- predict(mod, train)

The best fit line would be used to predict the SalePrice of the houses in the test data.

Take the exponential of the log to get the y axis back to SalePrice.