Introduction

Detroit, Michigan has experienced over-assessment inaccuracy for years and this has hit the lowest-valued homes the hardest. Inequitable assessment has led to regressive taxation, meaning that the government has collected a larger share of taxes among low-valued households compared to high-valued households in Michigan. The reasoning for this is because the lowest-valued households have higher sales ratio values compared to high-value households, meaning that the assessed value is significantly higher compared to the actual sales price for low-valued households.

Thus, this report will examine property assessment in Detroit, Michigan and use classification machine learning algorithm to classify a property either over-assessed or not over-assessed in 2016. It will also predict 2019 home prices.

A property will be classified as over-assessed if the sales ratio is above the median sales ratio of Detroit in 2016, which is 0.51.

## 
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                         log(ASSESSED_VALUE)     
## ------------------------------------------------
## log(SALE_PRICE)               0.313***          
##                               (0.003)           
##                                                 
## Constant                      6.295***          
##                               (0.033)           
##                                                 
## ------------------------------------------------
## Observations                   37,553           
## R2                             0.201            
## Adjusted R2                    0.201            
## Residual Std. Error      0.439 (df = 37551)     
## F Statistic         9,449.477*** (df = 1; 37551)
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

This is a simple regression of the log of assessed value (AV) against the log of sales price. The coefficient on sale price should be 1. However, this is not the case. It is less than 1, which indicates regressivity because assessment value decreases for higher valued sale prices. The interpretation here is for every 1% increase in sales price, assessed value decreases by 0.31%. This is statistically significant at the .01 level.

## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                          Foreclosures        
## ---------------------------------------------
## RATIO                      1.009***          
##                             (0.072)          
##                                              
## Constant                   -5.000***         
##                             (0.075)          
##                                              
## ---------------------------------------------
## Observations                37,553           
## Log Likelihood            -2,583.860         
## Akaike Inf. Crit.          5,171.720         
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

The second regression estimates the effect of the sales ratio on foreclosures(bi-variate, 0 = no and 1 = yes ) . The results indicate that a one unit increase in the sales ratio (ASSESSED VALUE/PRICE) is associated with a 1.080 change in log odds of foreclosure.

This table shows that total arm length sales has increased since 2011 ( with the exception of 2020 due to the pandemic). However, assessment has been steadily declining from 2012-2018 and slightly went up in 2019 and 2020. Median Sale Price has been steadily going up since 2015.

This graph shows that median sale price has gone up since 2015.

However, median sales assessment value has decreased, except the trend has been going up slightly since 2017.The key insight here is that assessment does not mirror sales price.

## `summarise()` has grouped output by 'SALE_YEAR'. You can override using the
## `.groups` argument.

Foreclosures has also had a fluctuating trend, with a slight drop in 2020 because they were halted due to the pandemic.

## [[1]]

The Coefficient of Dispersion (COD) is the average absolute percentage difference from the median sales ratio. For example, a COD of 15 means that properties have sales ratios that deviate by around 15% from the median ratio. An acceptable range is below 15, as seen in the shaded region, but as we can see from the graph, the COD is way above 15. In 2020, it was 40. While this is decreasing, it still has a long road for improvement. In 2019, the COD was 43.1170. We will see if our model beats this number of 43.11

There are also outliers with sale prices. This histogram shows that prices are skewed to the right. The boxplot below also displays the outliers.

model classification(2016)

Results

  1. The area under the ROC curve is 0.55.

  2. The accuracy estimate is 0.52

  3. Specificity estimate is 0.64

  4. Sensitivity estimate is 0.42

  5. Looking at the heatmap,

210 is true not over-assessed (sensitivity)

305 is true over-assessed (specificity)

287 is false over-assessment. This means the property was classified as over-assessed even though it was not over-assessed.

## [1] 288

171 is false not over-assessment, meaning the property was classified as not over-assessed even though it was over-assessed.

## [1] 159
.pred_class n
no 72596
yes 115366
NA 59563

Map showing percent over-assessed by census tract

## Joining, by = "GEOID"

Census tracts with high levels of over-assessment have a large number of non-white population and low levels of over-assessment have low numbers of non-white population

Positive correlation between the probability of predicting yes (over-assessment) and the percent non-white. A negative correlation between median imncome and predicting yes(over-assessment)

I will now explore hyperparameter Exploration for my Classification model.

cost rbf_sigma .metric .estimator mean n std_err .config .iter
0.015625 0.000001 roc_auc binary 0.5262552 3 0.0057224 Preprocessor1_Model1 0
0.015625 0.000100 roc_auc binary 0.5053051 3 0.0163176 Preprocessor1_Model3 0
2.000000 0.000001 roc_auc binary 0.5033823 3 0.0192794 Preprocessor1_Model2 0
2.000000 0.000100 roc_auc binary 0.4791954 3 0.0080459 Preprocessor1_Model4 0

After tuning the hyperparameters for my classification model, the best model has an area under the Receiver Operating Characteristic Curve (ROV) of 0.51.

model assessment(2019)

Now, I will be predicting home price for homes that did not sell in 2019.

The most important factors in our decision tree model for assessing properties is whether the property is located in ward #2. The second most important factor is the total floor area, total square footage, median income, and whether property has had interior improvement.

NEW ASSESSMENT MODELS

## [1] 90
model .config rmse rank
boost_tree Preprocessor1_Model1 31257.55 1
boost_tree Preprocessor1_Model2 31554.64 2
boost_tree Preprocessor1_Model3 33331.62 3
boost_tree Preprocessor1_Model5 33667.99 4
decision_tree Preprocessor1_Model3 34894.63 5
decision_tree Preprocessor1_Model5 34972.83 6
decision_tree Preprocessor1_Model1 35622.52 7
decision_tree Preprocessor1_Model4 35660.75 8
boost_tree Preprocessor1_Model4 35775.75 9
decision_tree Preprocessor1_Model2 35784.92 10
linear_reg Preprocessor1_Model1 38026.50 11
linear_reg Preprocessor1_Model3 38029.80 12
linear_reg Preprocessor1_Model2 38029.83 13
linear_reg Preprocessor1_Model4 38030.62 14
linear_reg Preprocessor1_Model5 38031.70 15

The best model from the grid results is the boosted tree with a Root Mean Square Error (RMSE) of ~30-32K.

model .config rmse rank
boost_tree Preprocessor1_Model1 31257.55 1
boost_tree Preprocessor1_Model2 31554.64 2
boost_tree Preprocessor1_Model3 33331.62 3
boost_tree Preprocessor1_Model5 33667.99 4
decision_tree Preprocessor1_Model3 34894.63 5
decision_tree Preprocessor1_Model5 34972.83 6
decision_tree Preprocessor1_Model1 35622.52 7
decision_tree Preprocessor1_Model4 35660.75 8
boost_tree Preprocessor1_Model4 35775.75 9
decision_tree Preprocessor1_Model2 35784.92 10
linear_reg Preprocessor1_Model1 38026.50 11
linear_reg Preprocessor1_Model3 38029.80 12
linear_reg Preprocessor1_Model2 38029.83 13
linear_reg Preprocessor1_Model4 38030.62 14
linear_reg Preprocessor1_Model5 38031.70 15

We get the same thing with race results. Boosted tree is our preferred model.

## ! train/test split: preprocessor 1/1: There are new levels in a factor: NA, the standard dev...
## ! train/test split: preprocessor 1/1, model 1/1 (predictions): There are new levels in a fac...

There seems to be a lot of clustering for homes less than ~100K. My model doesn’t do a very job predicting home prices.

## [[1]]

After adjusting the hyper-parameters for the boosted tree model based on the best_model results, I conducted a final out of sample prediction for all houses unsold in the dataset from 2011-2019. The Coefficient of Dispersion in 2019 is 19.18, which is almost within the targeted range of 5-15.

conclusion

In conclusion, I think that my models do a fairly okay job in predicting over-assessment and the price of homes based on the variables in our recipe. It is important to keep in mind that assessment values in Detroit, Michigan at the moment hit low-value, communities of color the most. The Coefficient of Dispersion for homes in Detroit that sold between 2011-2019 is slightly above 40. In 2019 specifically, the value was 43.11. However, the target range is between 5-15, so there is a long way to go for improvement.

For my initial classification model in 2016, I created a recipe that predicts over-assessment for homes that sold in 2016 and included a bunch of variables in my recipe, which were total square footage of a footage, whether the house has internal improvement, whether it foreclosed, whether it has a garage, basement, total number of tickets, median income, percent non-white in the census tract, and ward. I dealt with the missing values within the recipe by imputing the mean. My model was a support vector machine Radial Basis Function classification model.So my initial results gave me an area under the ROC curve of 0.55 and accuracy of 0.52. 119K homes were classified as over-assessed, 68K were classified as under-assessed, and I had 59K NA values. Just looking at the map, it’s clear that have a high percentage of over-assessment are located in the North-west region of Detroit and low percentage of over-assessment are located in the South(East) region of Detroit. Looking at my second map, it seems like these areas also have a high percentage of non-white population, but in areas with low levels of non-white population, it seems like over-assessment percentage values are lower. When I added the hyperparameter exploration for my model, the area under the ROC curve was 0.51, so something happened here. I would assume this happened because I change my recipe.

For my model assessment in 2019, I used the same variables from the classification model and the same recipe except for the dependent variable, which was sale price. I performed the training on houses that sold, and then performed an out of sample prediction for homes that didn’t sell in 2019. My model for this assesment was a decision tree with a tree depth of 5. The initial Root mean Square Error (RMSE) was 40,829. After tuning and exploring the different models using grid_results() and race_results() for boosting tree, linear regression, and decision tree, the best model I got was a boosted tree with a RMSE of 30,530. Finally, after changing the model I used to boosted tree and changing the hyper-parameters based on the values from best_results, and then performing a final prediction for homes that didn’t sell, I received a COD of almost 19.365 in 2019.

Whether we should use this model, I would probably recommend not do use it. I know that the COD is pretty low in 2019, but the RMSE score is almost 30K, not to mention that there were a missing values within the model itself. Also, my classification model had an area under the ROC curve of 0.5 and accuracy of 0.52, which is below the acceptable range of 0.7. So while out model does reduce the COD, I wouldn’t recommend using my model for the specified reasons.