David Segovia

Part 3

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.10     ✓ rsample      0.1.1 
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflows    0.2.4 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.1.0 
## ✓ parsnip      0.2.1      ✓ yardstick    0.0.9 
## ✓ recipes      0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## x tune::tune()      masks parsnip::tune()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## Loading required package: usethis
## 
## Attaching package: 'devtools'
## The following object is masked from 'package:recipes':
## 
##     check
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
## 
##     fips_codes
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
## corrplot 0.92 loaded
## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2020"

Part B

Feature engineering. You have now created two base models and evaluation metrics from last week. Investigate creating at least two new predictors and analyze if they improve your model. Some possibilities:

Neighborhood foreclosures/blight tickets *

Census variables (such as income, race) *

## Getting data from the 2015-2019 5-year ACS

Model B: overassessment

Results

  1. The area under the ROC curve is 0.58

  2. The accuracy estimate is 0.56

  3. Specificity estimate is 0.62

  4. Sensitivity estimate is 0.50

  5. Looking at the heatmap,

242 is true not over-assessed (sensitivity) 290 is true over-assessed (specificity)

233 is false over-assessment. This means the property was classified as over-assessed even though it was not over-assessed.

## # A tibble: 248 × 40
##    parcel_num  sale_date  SALE_PRICE grantor grantee sale_terms ecf   property_c
##    <chr>       <date>          <dbl> <chr>   <chr>   <chr>      <chr>      <dbl>
##  1 22023850.   2016-01-14      22500 GE INV… NBC PR… VALID ARM… 0043C        401
##  2 22058297.   2016-01-28      48395 ASM HO… CHIANG… VALID ARM… 7185A        401
##  3 21025917.   2016-01-26      65000 JOHNSO… MARSH,… VALID ARM… 2079A        401
##  4 17016636.   2016-01-21      25000 SMITH,… LACEY,… VALID ARM… 2073A        401
##  5 22091382.   2016-01-15      43500 PREFER… KOTOMA… VALID ARM… 7180A        401
##  6 22034594.   2016-01-13      29900 RBF TR… JH INV… VALID ARM… 9023A        401
##  7 21077744.   2016-01-22      27000 WARRIA… SMITH,… VALID ARM… 3098A        401
##  8 22119512.0… 2016-02-04      38000 MATHEW… SGV PR… VALID ARM… 7175A        401
##  9 13011763.   2016-02-10      12150 HSD 2 … BROWN,… VALID ARM… 1070A        401
## 10 21026571.   2016-02-09      36500 TESH 1… SARGHI… VALID ARM… 2075A        401
## # … with 238 more rows, and 32 more variables: SALE_YEAR <dbl>,
## #   DESCRIPTION <chr>, TYPE <chr>, CATEGORY <chr>, ward <chr>, zip_code <chr>,
## #   total_square_footage <dbl>, total_acreage <dbl>, frontage <dbl>,
## #   depth <dbl>, homestead_pre <dbl>, is_improved <int>, num_bldgs <int>,
## #   total_floor_area <int>, year_built <int>, prop_addr <chr>,
## #   Foreclosures <fct>, ASSESSED_VALUE <dbl>, TAXABLEVALUE <dbl>,
## #   TAX_YEAR <dbl>, RATIO <dbl>, arms_length_transaction <dbl>, …

192 is false not over-assessment, meaning the property was classified as not over-assessed even though it was over-assessed.

## # A tibble: 210 × 40
##    parcel_num sale_date  SALE_PRICE grantor  grantee sale_terms ecf   property_c
##    <chr>      <date>          <dbl> <chr>    <chr>   <chr>      <chr>      <dbl>
##  1 16010497-8 2016-01-11      17000 BRIONI,… CUEVAS… VALID ARM… 5154A        401
##  2 12009665.  2016-01-28      20000 FLORES,… MORALE… VALID ARM… 5169A        401
##  3 09005656.  2016-02-08       7500 DAVIS, … SYED, … VALID ARM… 1062A        401
##  4 22044570.  2016-01-08      15500 SIMMONS… HARRIS… VALID ARM… 7190A        401
##  5 22059630.  2016-01-20      12000 PARK ST… BROWN,… VALID ARM… 7185A        401
##  6 22036271.  2016-01-12       7500 EDWARDS… YOUNG,… VALID ARM… 9023A        401
##  7 22086706.  2016-01-20       8500 COTE, D… CAMACH… VALID ARM… 7183A        401
##  8 18006043.  2016-01-09       6900 G. P. A… MCNEIL… VALID ARM… 7187A        401
##  9 21073808.  2016-02-19      45000 CAMINOW… YOSEPH… VALID ARM… 3097A        401
## 10 16027918.  2016-02-09      20000 ATOMIC … HOLLAN… VALID ARM… 0040A        401
## # … with 200 more rows, and 32 more variables: SALE_YEAR <dbl>,
## #   DESCRIPTION <chr>, TYPE <chr>, CATEGORY <chr>, ward <chr>, zip_code <chr>,
## #   total_square_footage <dbl>, total_acreage <dbl>, frontage <dbl>,
## #   depth <dbl>, homestead_pre <dbl>, is_improved <int>, num_bldgs <int>,
## #   total_floor_area <int>, year_built <int>, prop_addr <chr>,
## #   Foreclosures <fct>, ASSESSED_VALUE <dbl>, TAXABLEVALUE <dbl>,
## #   TAX_YEAR <dbl>, RATIO <dbl>, arms_length_transaction <dbl>, …

model C

Results

  1. The mean absolute percentage error (MAPE) is the average difference between predicted value and original value. This value is 14.41 %

  2. The root mean square error (RMSE) is how far predicted values are from actual values in regression analysis. In our model, this value is 3860

Part C (20%)

Prediction. Create out of sample predictions for both models. By this, predict over-assessment and assessment/valuation for homes which did not sell for each model (2016 for B, 2019 for C).

out of sample prediction: model B, classification

classification metric for 2016

## Joining, by = c("TAXABLEVALUE", "ward", "zip_code", "total_square_footage",
## "total_acreage", "frontage", "depth", "homestead_pre", "is_improved",
## "num_bldgs", "total_floor_area", "year_built", "prop_addr", "Foreclosures",
## "medianincomeE", "percent_white", "percent_nonwhite", "TotalnumberOfTickets",
## "SALE_PRICE")
##    no   yes  NA's 
## 24332 28059 27753

out of sample prediction: model C, predict assessment value

Filter out assessments for 2019, find total # of homes in 2019

Part D (40%) Model Explanation. Each model has different tools for explainability and we will discuss this more in class. Undertake this initial work knowing that we will gain more techniques for this later on.

For Model B/overassessment, aggregate your predictions by census tract. Join in a census variable. Create a simple correlation plot and leaflet map of likelihood of over-assessment by tract using the template for class.

Leaflet

Very strong positive correlation between probability of getting over-assessed and the percent nonwhite in the parcel’s zip code. There is also a moderate, negative correlation between probability of over-assessment and median income

For Model C/assessment, undertake an initial analysis of which factors your model identified as most important for valuation.

The most important factors for assessing properties is whether the property is located in ward 2. if it is, and the house has been improved, then the assessment is 692. If it has not been improved, another important factor is the total square footage followed by total floor area.