Loading in Data

library(readr)
west <- read_csv("C:/Users/cbado/Downloads/WestRoxbury.csv")

Changing data types and integer labels

summary(west)
##   TOTAL VALUE          TAX           LOT SQFT        YR BUILT      GROSS AREA  
##  Min.   : 105.0   Min.   : 1320   Min.   :  997   Min.   :   0   Min.   : 821  
##  1st Qu.: 325.1   1st Qu.: 4090   1st Qu.: 4772   1st Qu.:1920   1st Qu.:2347  
##  Median : 375.9   Median : 4728   Median : 5683   Median :1935   Median :2700  
##  Mean   : 392.7   Mean   : 4939   Mean   : 6278   Mean   :1937   Mean   :2925  
##  3rd Qu.: 438.8   3rd Qu.: 5520   3rd Qu.: 7022   3rd Qu.:1955   3rd Qu.:3239  
##  Max.   :1217.8   Max.   :15319   Max.   :46411   Max.   :2011   Max.   :8154  
##   LIVING AREA       FLOORS          ROOMS           BEDROOMS      FULL BATH    
##  Min.   : 504   Min.   :1.000   Min.   : 3.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1308   1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:3.00   1st Qu.:1.000  
##  Median :1548   Median :2.000   Median : 7.000   Median :3.00   Median :1.000  
##  Mean   :1657   Mean   :1.684   Mean   : 6.995   Mean   :3.23   Mean   :1.297  
##  3rd Qu.:1874   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.:4.00   3rd Qu.:2.000  
##  Max.   :5289   Max.   :3.000   Max.   :14.000   Max.   :9.00   Max.   :5.000  
##    HALF BATH         KITCHEN        FIREPLACE        REMODEL         
##  Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Length:5802       
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.0000   Median :1.000   Median :1.0000   Mode  :character  
##  Mean   :0.6139   Mean   :1.015   Mean   :0.7399                     
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000                     
##  Max.   :3.0000   Max.   :2.000   Max.   :4.0000
west$REMODEL <-ifelse(test = west$REMODEL=="None",yes = "0", no=(ifelse(test = west$REMODEL=="Old",yes = "1", no="2")))
west$REMODEL <- as.numeric(west$REMODEL)

Data partitioning

dt = sort(sample(nrow(west), nrow(west)*.7))
train<-west[dt,]
test<-west[-dt,]

Predicting Total Value in West Roxbury

totval = lm(`TOTAL VALUE`~., data=train)
summary(totval)
## 
## Call:
## lm(formula = `TOTAL VALUE` ~ ., data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041570 -0.019725  0.000097  0.019775  0.040826 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)    2.480e-02  1.935e-02      1.282   0.2000    
## TAX            7.949e-02  6.540e-07 121551.161   <2e-16 ***
## `LOT SQFT`     2.343e-07  1.671e-07      1.403   0.1608    
## `YR BUILT`     4.348e-06  9.616e-06      0.452   0.6512    
## `GROSS AREA`   1.830e-06  1.040e-06      1.760   0.0785 .  
## `LIVING AREA` -3.054e-06  1.911e-06     -1.598   0.1101    
## FLOORS         1.327e-03  1.104e-03      1.202   0.2296    
## ROOMS         -1.051e-04  4.093e-04     -0.257   0.7973    
## BEDROOMS       2.840e-04  6.234e-04      0.456   0.6487    
## `FULL BATH`   -6.850e-05  8.580e-04     -0.080   0.9364    
## `HALF BATH`    4.948e-04  7.880e-04      0.628   0.5301    
## KITCHEN        4.439e-03  2.965e-03      1.497   0.1344    
## FIREPLACE      2.882e-04  6.912e-04      0.417   0.6767    
## REMODEL        3.761e-04  5.117e-04      0.735   0.4624    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02275 on 4047 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.063e+09 on 13 and 4047 DF,  p-value: < 2.2e-16
predtotval <- predict(totval, test)

actuals_preds <- data.frame(cbind(actuals=test$`TOTAL VALUE`, predicteds=predtotval))
cor(actuals_preds)
##            actuals predicteds
## actuals          1          1
## predicteds       1          1

While this model does a perfect job at predicting total values of homes in West Roxbury, this is only because the model takes into account tax paid per property, which are calculated based on total property value. The main practical use case for this type of valuation model would be to determine the fair value of a property based on its attributes (by a real estate agent or family looking to purchase a home), where tax information would not be known. Therefore, a useful model would need to preduct total value without considering taxes paid.

Predicting Total Value in West Roxbury ignoring taxes paid

totvallt = lm(`TOTAL VALUE`~`LOT SQFT`+`YR BUILT`+`GROSS AREA`+`LIVING AREA`+FLOORS+ROOMS+BEDROOMS+`FULL BATH`+`HALF BATH`+KITCHEN+FIREPLACE+REMODEL, data=train)
summary(totvallt)
## 
## Call:
## lm(formula = `TOTAL VALUE` ~ `LOT SQFT` + `YR BUILT` + `GROSS AREA` + 
##     `LIVING AREA` + FLOORS + ROOMS + BEDROOMS + `FULL BATH` + 
##     `HALF BATH` + KITCHEN + FIREPLACE + REMODEL, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -263.336  -26.988   -0.737   25.705  229.854 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.419e+01  3.696e+01  -0.925   0.3549    
## `LOT SQFT`     8.370e-03  2.908e-04  28.786   <2e-16 ***
## `YR BUILT`     4.406e-02  1.836e-02   2.400   0.0164 *  
## `GROSS AREA`   3.176e-02  1.923e-03  16.513   <2e-16 ***
## `LIVING AREA`  5.252e-02  3.556e-03  14.771   <2e-16 ***
## FLOORS         3.996e+01  2.014e+00  19.844   <2e-16 ***
## ROOMS         -1.620e-01  7.819e-01  -0.207   0.8358    
## BEDROOMS      -7.862e-01  1.191e+00  -0.660   0.5092    
## `FULL BATH`    2.154e+01  1.604e+00  13.432   <2e-16 ***
## `HALF BATH`    1.981e+01  1.473e+00  13.453   <2e-16 ***
## KITCHEN       -1.266e+01  5.661e+00  -2.237   0.0253 *  
## FIREPLACE      1.886e+01  1.287e+00  14.659   <2e-16 ***
## REMODEL        1.139e+01  9.610e-01  11.855   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.45 on 4048 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.812 
## F-statistic:  1462 on 12 and 4048 DF,  p-value: < 2.2e-16
predtotvallt <- predict(totvallt, test)

actuals_predst <- data.frame(cbind(actuals=test$`TOTAL VALUE`, predicteds=predtotvallt))
cor(actuals_predst)
##              actuals predicteds
## actuals    1.0000000  0.9017715
## predicteds 0.9017715  1.0000000

This linear regression model has an adjusted R^2 value of 0.8117, and when fed testing data, has a predictive accuracy of 90.17%, which is very good.