David Segovia
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.10 ✓ rsample 0.1.1
## ✓ dials 0.0.10 ✓ tune 0.1.6
## ✓ infer 1.0.0 ✓ workflows 0.2.4
## ✓ modeldata 0.1.1 ✓ workflowsets 0.1.0
## ✓ parsnip 0.2.1 ✓ yardstick 0.0.9
## ✓ recipes 0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## x tune::tune() masks parsnip::tune()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
## Loading required package: usethis
##
## Attaching package: 'devtools'
## The following object is masked from 'package:recipes':
##
## check
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
##
## fips_codes
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
## corrplot 0.92 loaded
## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2020"
Feature engineering. You have now created two base models and evaluation metrics from last week. Investigate creating at least two new predictors and analyze if they improve your model. Some possibilities:
Neighborhood foreclosures/blight tickets *
Census variables (such as income, race) *
## Getting data from the 2015-2019 5-year ACS
Results
The area under the ROC curve is 0.58
The accuracy estimate is 0.56
Specificity estimate is 0.62
Sensitivity estimate is 0.50
Looking at the heatmap,
242 is true not over-assessed (sensitivity) 290 is true over-assessed (specificity)
233 is false over-assessment. This means the property was classified as over-assessed even though it was not over-assessed.
## # A tibble: 248 × 40
## parcel_num sale_date SALE_PRICE grantor grantee sale_terms ecf property_c
## <chr> <date> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 22023850. 2016-01-14 22500 GE INV… NBC PR… VALID ARM… 0043C 401
## 2 22058297. 2016-01-28 48395 ASM HO… CHIANG… VALID ARM… 7185A 401
## 3 21025917. 2016-01-26 65000 JOHNSO… MARSH,… VALID ARM… 2079A 401
## 4 17016636. 2016-01-21 25000 SMITH,… LACEY,… VALID ARM… 2073A 401
## 5 22091382. 2016-01-15 43500 PREFER… KOTOMA… VALID ARM… 7180A 401
## 6 22034594. 2016-01-13 29900 RBF TR… JH INV… VALID ARM… 9023A 401
## 7 21077744. 2016-01-22 27000 WARRIA… SMITH,… VALID ARM… 3098A 401
## 8 22119512.0… 2016-02-04 38000 MATHEW… SGV PR… VALID ARM… 7175A 401
## 9 13011763. 2016-02-10 12150 HSD 2 … BROWN,… VALID ARM… 1070A 401
## 10 21026571. 2016-02-09 36500 TESH 1… SARGHI… VALID ARM… 2075A 401
## # … with 238 more rows, and 32 more variables: SALE_YEAR <dbl>,
## # DESCRIPTION <chr>, TYPE <chr>, CATEGORY <chr>, ward <chr>, zip_code <chr>,
## # total_square_footage <dbl>, total_acreage <dbl>, frontage <dbl>,
## # depth <dbl>, homestead_pre <dbl>, is_improved <int>, num_bldgs <int>,
## # total_floor_area <int>, year_built <int>, prop_addr <chr>,
## # Foreclosures <fct>, ASSESSED_VALUE <dbl>, TAXABLEVALUE <dbl>,
## # TAX_YEAR <dbl>, RATIO <dbl>, arms_length_transaction <dbl>, …
192 is false not over-assessment, meaning the property was classified as not over-assessed even though it was over-assessed.
## # A tibble: 210 × 40
## parcel_num sale_date SALE_PRICE grantor grantee sale_terms ecf property_c
## <chr> <date> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 16010497-8 2016-01-11 17000 BRIONI,… CUEVAS… VALID ARM… 5154A 401
## 2 12009665. 2016-01-28 20000 FLORES,… MORALE… VALID ARM… 5169A 401
## 3 09005656. 2016-02-08 7500 DAVIS, … SYED, … VALID ARM… 1062A 401
## 4 22044570. 2016-01-08 15500 SIMMONS… HARRIS… VALID ARM… 7190A 401
## 5 22059630. 2016-01-20 12000 PARK ST… BROWN,… VALID ARM… 7185A 401
## 6 22036271. 2016-01-12 7500 EDWARDS… YOUNG,… VALID ARM… 9023A 401
## 7 22086706. 2016-01-20 8500 COTE, D… CAMACH… VALID ARM… 7183A 401
## 8 18006043. 2016-01-09 6900 G. P. A… MCNEIL… VALID ARM… 7187A 401
## 9 21073808. 2016-02-19 45000 CAMINOW… YOSEPH… VALID ARM… 3097A 401
## 10 16027918. 2016-02-09 20000 ATOMIC … HOLLAN… VALID ARM… 0040A 401
## # … with 200 more rows, and 32 more variables: SALE_YEAR <dbl>,
## # DESCRIPTION <chr>, TYPE <chr>, CATEGORY <chr>, ward <chr>, zip_code <chr>,
## # total_square_footage <dbl>, total_acreage <dbl>, frontage <dbl>,
## # depth <dbl>, homestead_pre <dbl>, is_improved <int>, num_bldgs <int>,
## # total_floor_area <int>, year_built <int>, prop_addr <chr>,
## # Foreclosures <fct>, ASSESSED_VALUE <dbl>, TAXABLEVALUE <dbl>,
## # TAX_YEAR <dbl>, RATIO <dbl>, arms_length_transaction <dbl>, …
Results
The mean absolute percentage error (MAPE) is the average difference between predicted value and original value. This value is 14.41 %
The root mean square error (RMSE) is how far predicted values are from actual values in regression analysis. In our model, this value is 3860
Prediction. Create out of sample predictions for both models. By this, predict over-assessment and assessment/valuation for homes which did not sell for each model (2016 for B, 2019 for C).
## Joining, by = c("TAXABLEVALUE", "ward", "zip_code", "total_square_footage",
## "total_acreage", "frontage", "depth", "homestead_pre", "is_improved",
## "num_bldgs", "total_floor_area", "year_built", "prop_addr", "Foreclosures",
## "medianincomeE", "percent_white", "percent_nonwhite", "TotalnumberOfTickets",
## "SALE_PRICE")
## no yes NA's
## 24332 28059 27753
Filter out assessments for 2019, find total # of homes in 2019
Leaflet
Very strong positive correlation between probability of getting over-assessed and the percent nonwhite in the parcel’s zip code. There is also a moderate, negative correlation between probability of over-assessment and median income
The most important factors for assessing properties is whether the property is located in ward 2. if it is, and the house has been improved, then the assessment is 692. If it has not been improved, another important factor is the total square footage followed by total floor area.