library(readr)
west <- read_csv("C:/Users/cbado/Downloads/WestRoxbury.csv")
summary(west)
## TOTAL VALUE TAX LOT SQFT YR BUILT GROSS AREA
## Min. : 105.0 Min. : 1320 Min. : 997 Min. : 0 Min. : 821
## 1st Qu.: 325.1 1st Qu.: 4090 1st Qu.: 4772 1st Qu.:1920 1st Qu.:2347
## Median : 375.9 Median : 4728 Median : 5683 Median :1935 Median :2700
## Mean : 392.7 Mean : 4939 Mean : 6278 Mean :1937 Mean :2925
## 3rd Qu.: 438.8 3rd Qu.: 5520 3rd Qu.: 7022 3rd Qu.:1955 3rd Qu.:3239
## Max. :1217.8 Max. :15319 Max. :46411 Max. :2011 Max. :8154
## LIVING AREA FLOORS ROOMS BEDROOMS FULL BATH
## Min. : 504 Min. :1.000 Min. : 3.000 Min. :1.00 Min. :1.000
## 1st Qu.:1308 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.:3.00 1st Qu.:1.000
## Median :1548 Median :2.000 Median : 7.000 Median :3.00 Median :1.000
## Mean :1657 Mean :1.684 Mean : 6.995 Mean :3.23 Mean :1.297
## 3rd Qu.:1874 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.:4.00 3rd Qu.:2.000
## Max. :5289 Max. :3.000 Max. :14.000 Max. :9.00 Max. :5.000
## HALF BATH KITCHEN FIREPLACE REMODEL
## Min. :0.0000 Min. :1.000 Min. :0.0000 Length:5802
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.0000 Median :1.000 Median :1.0000 Mode :character
## Mean :0.6139 Mean :1.015 Mean :0.7399
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.000 Max. :4.0000
west$REMODEL <-ifelse(test = west$REMODEL=="None",yes = "0", no=(ifelse(test = west$REMODEL=="Old",yes = "1", no="2")))
west$REMODEL <- as.numeric(west$REMODEL)
dt = sort(sample(nrow(west), nrow(west)*.7))
train<-west[dt,]
test<-west[-dt,]
totval = lm(`TOTAL VALUE`~., data=train)
summary(totval)
##
## Call:
## lm(formula = `TOTAL VALUE` ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041570 -0.019725 0.000097 0.019775 0.040826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.480e-02 1.935e-02 1.282 0.2000
## TAX 7.949e-02 6.540e-07 121551.161 <2e-16 ***
## `LOT SQFT` 2.343e-07 1.671e-07 1.403 0.1608
## `YR BUILT` 4.348e-06 9.616e-06 0.452 0.6512
## `GROSS AREA` 1.830e-06 1.040e-06 1.760 0.0785 .
## `LIVING AREA` -3.054e-06 1.911e-06 -1.598 0.1101
## FLOORS 1.327e-03 1.104e-03 1.202 0.2296
## ROOMS -1.051e-04 4.093e-04 -0.257 0.7973
## BEDROOMS 2.840e-04 6.234e-04 0.456 0.6487
## `FULL BATH` -6.850e-05 8.580e-04 -0.080 0.9364
## `HALF BATH` 4.948e-04 7.880e-04 0.628 0.5301
## KITCHEN 4.439e-03 2.965e-03 1.497 0.1344
## FIREPLACE 2.882e-04 6.912e-04 0.417 0.6767
## REMODEL 3.761e-04 5.117e-04 0.735 0.4624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02275 on 4047 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.063e+09 on 13 and 4047 DF, p-value: < 2.2e-16
predtotval <- predict(totval, test)
actuals_preds <- data.frame(cbind(actuals=test$`TOTAL VALUE`, predicteds=predtotval))
cor(actuals_preds)
## actuals predicteds
## actuals 1 1
## predicteds 1 1
While this model does a perfect job at predicting total values of homes in West Roxbury, this is only because the model takes into account tax paid per property, which are calculated based on total property value. The main practical use case for this type of valuation model would be to determine the fair value of a property based on its attributes (by a real estate agent or family looking to purchase a home), where tax information would not be known. Therefore, a useful model would need to preduct total value without considering taxes paid.
totvallt = lm(`TOTAL VALUE`~`LOT SQFT`+`YR BUILT`+`GROSS AREA`+`LIVING AREA`+FLOORS+ROOMS+BEDROOMS+`FULL BATH`+`HALF BATH`+KITCHEN+FIREPLACE+REMODEL, data=train)
summary(totvallt)
##
## Call:
## lm(formula = `TOTAL VALUE` ~ `LOT SQFT` + `YR BUILT` + `GROSS AREA` +
## `LIVING AREA` + FLOORS + ROOMS + BEDROOMS + `FULL BATH` +
## `HALF BATH` + KITCHEN + FIREPLACE + REMODEL, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -263.336 -26.988 -0.737 25.705 229.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.419e+01 3.696e+01 -0.925 0.3549
## `LOT SQFT` 8.370e-03 2.908e-04 28.786 <2e-16 ***
## `YR BUILT` 4.406e-02 1.836e-02 2.400 0.0164 *
## `GROSS AREA` 3.176e-02 1.923e-03 16.513 <2e-16 ***
## `LIVING AREA` 5.252e-02 3.556e-03 14.771 <2e-16 ***
## FLOORS 3.996e+01 2.014e+00 19.844 <2e-16 ***
## ROOMS -1.620e-01 7.819e-01 -0.207 0.8358
## BEDROOMS -7.862e-01 1.191e+00 -0.660 0.5092
## `FULL BATH` 2.154e+01 1.604e+00 13.432 <2e-16 ***
## `HALF BATH` 1.981e+01 1.473e+00 13.453 <2e-16 ***
## KITCHEN -1.266e+01 5.661e+00 -2.237 0.0253 *
## FIREPLACE 1.886e+01 1.287e+00 14.659 <2e-16 ***
## REMODEL 1.139e+01 9.610e-01 11.855 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.45 on 4048 degrees of freedom
## Multiple R-squared: 0.8126, Adjusted R-squared: 0.812
## F-statistic: 1462 on 12 and 4048 DF, p-value: < 2.2e-16
predtotvallt <- predict(totvallt, test)
actuals_predst <- data.frame(cbind(actuals=test$`TOTAL VALUE`, predicteds=predtotvallt))
cor(actuals_predst)
## actuals predicteds
## actuals 1.0000000 0.9017715
## predicteds 0.9017715 1.0000000
This linear regression model has an adjusted R^2 value of 0.8117, and when fed testing data, has a predictive accuracy of 90.17%, which is very good.