MPalagyi Lab Exercise 8

Data Wrangling and EDA

The sf object called portland.demolitions is based on an older version of this dataset available on the Portland open data portal. If you are curious about its attributes, take a look at the metadata there, but we mainly want to plot the points to get a feel for the pattern of residential demolitions over time in the city.

We also now have a dataset with observations from 2014 to 2018 and a second dataset with observations from 2019 in memory. In both cases, the set of demolished properties was combined with a random sample of single-family (detached) residential properties that existed in 2014 but were not demolished as a control. A description of what the columns (variables) each of these datasets contains is given in the image below:

The BLDGTOTVAL feature is of particular interest due to its connection to the rent-gap theory of gentrification posited by Neil Smith in the late 1970s. The idea is that gentrification is more likely to occur in neighborhoods with a high proportion of real estate where the land is valued much more highly than the structure itself and so demolishing housing units like those and replacing them with the “highest and best use” is a strategy for extracting profits from the city.

Let’s do some quick exploratory data analysis before proceeding.

# Read the Portland basemap so we do not have to retrieve it from the web

load(file = "./data/pdx_basemap.RData")
blank <- as_tibble(load(file = "./data/pdx_basemap.RData"))

portland.demolitions.wgs84 <- st_transform(portland.demolitions, 4326)
                                           
demos.map <- ggmap(pdx_basemap) +
  geom_point(data = portland.demolitions.wgs84, aes(color = as_factor(DEMOYEAR), 
    x = st_coordinates(portland.demolitions.wgs84)[,1], 
    y = st_coordinates(portland.demolitions.wgs84)[,2])) +
    scale_color_brewer(palette = "Spectral") +
    theme_classic() +
    theme(axis.line = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(), axis.title = element_blank(), 
        legend.title = element_blank())


demos.hist <- ggplot(portland.demolitions.wgs84, aes(x = DEMOYEAR)) +
  geom_histogram(bins = 6, fill = brewer.pal(n = 6, name = "Spectral"), color="#e9ecef", alpha=0.9) + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.x = element_text(face="bold", size=10)) + 
  scale_x_continuous(breaks = seq(2014, 2020, by = 1)) +
  labs(x = "Year", y = "", caption = "Source: City of Portland Open Data Portal") + 
  theme_minimal() + 
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())



ggarrange(demos.map, demos.hist)

## Warning: Removed 4 rows containing missing values (geom_point).

Now that we have a feel for where demolitions occurred in Portland and the volume by year, we can turn our attention to fitting models.

Fitting Classification Tree Models

In the code chunk below, we use the set.seed function to assign a starting value to the random number generator built into R so that we can replicate the analysis later.

set.seed(227)

# Use rsample::initial_split to partition the 2014-2018 data. 
portland.2014.2018.split <- initial_split(portland.2014.2018, prop = 0.7, strata = DEMOLISHED)
portland.2014.2018.train <- training(portland.2014.2018.split)
portland.2014.2018.test  <- testing(portland.2014.2018.split)

# Check the distribution in original data and partitioned data
# and note that the proportion of demolished and non-demolished
# housing units is comparable in both.
prop.table(table(portland.2014.2018.test$DEMOLISHED)) * 100

## 
##      0      1 
## 72.661 27.339

prop.table(table(portland.2014.2018.train$DEMOLISHED)) * 100

## 
##      0      1 
## 72.687 27.313

# Next, choose the variables we want to use
portland.2014.2018.test <- subset(portland.2014.2018.test, select = c("DEMOLISHED", "BLDGTOTVAL", "PARCELSIZE", "AGE_2014", "WScore", "TScore", "Pop.Density", "X_COORD", "Y_COORD"))

portland.2014.2018.train <- subset(portland.2014.2018.train, select = c("DEMOLISHED", "BLDGTOTVAL", "PARCELSIZE", "AGE_2014",
                    "WScore", "TScore", "Pop.Density"))

This component of the Lab Assignment introduces classification tree models and applies this tool to the Portland housing unit demolitions dataset. For classification (i.e., categorical dependent variables), the decision tree algorithm predicts the probability that a given observation belongs to a particular class or category. In the code chunk below, we use the rpart package to fit a classification tree model where the outcome is demolition (i.e., was the dwelling unit demolished or not?), but this is just one of many options for fitting decision tree models in R (e.g., caret, ranger, h2o, etc.).

?rpart
#?caret
#?ranger
#?h20

The caret::confusionMatrix function produces a variety of metrics that help us to assess performance of the classification model. In the actual confusion matrix at the top of the output, the numbers on the diagonal represent correctly classified observations, while the numbers off the diagonal represent mis-classified observations. The accuracy rate is given along with a confidence interval and p-value. The kappa statistic is also included. Note that for kappa, 1 means perfect agreement and 0 is chance agreement. One rule-of-thumb suggests that kappa values of 0.21-0.40 is fair, 0.41- 0.60 is moderate, 0.61-0.80 is substantial, and 0.81-1.00 is almost perfect agreement.

Take a few moments to review the help documentation for the rpart and the rpart.object functions, then proceed with the code chunk below.

?rpart
?rpart.object
?rpart.control

# Because our dependent variable is categorical, we set 
# the methods argument to "class" for classification...
model.class <- rpart(
  formula = as_factor(DEMOLISHED) ~ .,
  data    = portland.2014.2018.train,
  method  = "class", 
  minsplit = 4, 
  minbucket = 2
)

# Examine the output in both text and visual format...
model.class

## n= 3837 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3837 1048 0 (0.7268700 0.2731300)  
##    2) BLDGTOTVAL>=0.4494291 2765  472 0 (0.8292948 0.1707052) *
##    3) BLDGTOTVAL< 0.4494291 1072  496 1 (0.4626866 0.5373134)  
##      6) AGE_2014< 71.5 495  198 0 (0.6000000 0.4000000)  
##       12) BLDGTOTVAL>=0.356339 343  103 0 (0.6997085 0.3002915) *
##       13) BLDGTOTVAL< 0.356339 152   57 1 (0.3750000 0.6250000)  
##         26) WScore< 58.5 68   30 0 (0.5588235 0.4411765)  
##           52) PARCELSIZE< 0.7652287 46   12 0 (0.7391304 0.2608696) *
##           53) PARCELSIZE>=0.7652287 22    4 1 (0.1818182 0.8181818) *
##         27) WScore>=58.5 84   19 1 (0.2261905 0.7738095) *
##      7) AGE_2014>=71.5 577  199 1 (0.3448873 0.6551127) *

rpart.plot(model.class)

# Retrieve more detailed info...
summary(model.class)

## Call:
## rpart(formula = as_factor(DEMOLISHED) ~ ., data = portland.2014.2018.train, 
##     method = "class", minsplit = 4, minbucket = 2)
##   n= 3837 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.08540076      0 1.0000000 1.0000000 0.02633587
## 2 0.03625954      2 0.8291985 0.8730916 0.02518799
## 3 0.01049618      3 0.7929389 0.8282443 0.02472907
## 4 0.01000000      5 0.7719466 0.8301527 0.02474920
## 
## Variable importance
##  BLDGTOTVAL    AGE_2014  PARCELSIZE      WScore      TScore Pop.Density 
##          67          11           7           6           5           4 
## 
## Node number 1: 3837 observations,    complexity param=0.08540076
##   predicted class=0  expected loss=0.27313  P(node) =1
##     class counts:  2789  1048
##    probabilities: 0.727 0.273 
##   left son=2 (2765 obs) right son=3 (1072 obs)
##   Primary splits:
##       BLDGTOTVAL < 0.4494291 to the right, improve=207.65020, (0 missing)
##       AGE_2014   < 62.5      to the left,  improve= 57.30428, (0 missing)
##       WScore     < 83.5      to the left,  improve= 43.99917, (0 missing)
##       PARCELSIZE < 0.9237271 to the left,  improve= 24.36047, (0 missing)
##       TScore     < 40.5      to the left,  improve= 10.14819, (0 missing)
##   Surrogate splits:
##       PARCELSIZE < 0.4473768 to the left,  agree=0.727, adj=0.022, (0 split)
##       AGE_2014   < 129.5     to the left,  agree=0.722, adj=0.006, (0 split)
##       WScore     < 0.5       to the right, agree=0.721, adj=0.001, (0 split)
## 
## Node number 2: 2765 observations
##   predicted class=0  expected loss=0.1707052  P(node) =0.7206151
##     class counts:  2293   472
##    probabilities: 0.829 0.171 
## 
## Node number 3: 1072 observations,    complexity param=0.08540076
##   predicted class=1  expected loss=0.4626866  P(node) =0.2793849
##     class counts:   496   576
##    probabilities: 0.463 0.537 
##   left son=6 (495 obs) right son=7 (577 obs)
##   Primary splits:
##       AGE_2014    < 71.5      to the left,  improve=34.680090, (0 missing)
##       BLDGTOTVAL  < 0.3563673 to the right, improve=34.351480, (0 missing)
##       WScore      < 75.5      to the left,  improve=30.795290, (0 missing)
##       TScore      < 41.5      to the left,  improve=13.175530, (0 missing)
##       Pop.Density < 8781.018  to the left,  improve= 7.933128, (0 missing)
##   Surrogate splits:
##       WScore      < 69.5      to the left,  agree=0.660, adj=0.263, (0 split)
##       TScore      < 42.5      to the left,  agree=0.625, adj=0.188, (0 split)
##       PARCELSIZE  < 0.124915  to the right, agree=0.616, adj=0.168, (0 split)
##       Pop.Density < 6327.815  to the left,  agree=0.595, adj=0.123, (0 split)
##       BLDGTOTVAL  < 0.4125434 to the right, agree=0.559, adj=0.044, (0 split)
## 
## Node number 6: 495 observations,    complexity param=0.03625954
##   predicted class=0  expected loss=0.4  P(node) =0.129007
##     class counts:   297   198
##    probabilities: 0.600 0.400 
##   left son=12 (343 obs) right son=13 (152 obs)
##   Primary splits:
##       BLDGTOTVAL < 0.356339  to the right, improve=22.210060, (0 missing)
##       WScore     < 75.5      to the left,  improve=14.757670, (0 missing)
##       TScore     < 41.5      to the left,  improve= 6.999092, (0 missing)
##       PARCELSIZE < 0.9838875 to the left,  improve= 5.392340, (0 missing)
##       AGE_2014   < 59.5      to the left,  improve= 5.251608, (0 missing)
##   Surrogate splits:
##       PARCELSIZE  < 0.4214552 to the left,  agree=0.727, adj=0.112, (0 split)
##       Pop.Density < 1446.75   to the right, agree=0.723, adj=0.099, (0 split)
##       TScore      < 22        to the right, agree=0.715, adj=0.072, (0 split)
##       WScore      < 16        to the right, agree=0.713, adj=0.066, (0 split)
##       AGE_2014    < 24        to the right, agree=0.705, adj=0.039, (0 split)
## 
## Node number 7: 577 observations
##   predicted class=1  expected loss=0.3448873  P(node) =0.1503779
##     class counts:   199   378
##    probabilities: 0.345 0.655 
## 
## Node number 12: 343 observations
##   predicted class=0  expected loss=0.3002915  P(node) =0.08939275
##     class counts:   240   103
##    probabilities: 0.700 0.300 
## 
## Node number 13: 152 observations,    complexity param=0.01049618
##   predicted class=1  expected loss=0.375  P(node) =0.03961428
##     class counts:    57    95
##    probabilities: 0.375 0.625 
##   left son=26 (68 obs) right son=27 (84 obs)
##   Primary splits:
##       WScore      < 58.5      to the left,  improve=8.315826, (0 missing)
##       TScore      < 42.5      to the left,  improve=7.527436, (0 missing)
##       BLDGTOTVAL  < 0.2415573 to the right, improve=3.952165, (0 missing)
##       Pop.Density < 7845.726  to the left,  improve=3.952165, (0 missing)
##       PARCELSIZE  < 0.1395381 to the right, improve=3.581510, (0 missing)
##   Surrogate splits:
##       TScore      < 40.5      to the left,  agree=0.842, adj=0.647, (0 split)
##       Pop.Density < 4021.768  to the left,  agree=0.816, adj=0.588, (0 split)
##       PARCELSIZE  < 0.2748356 to the right, agree=0.770, adj=0.485, (0 split)
##       BLDGTOTVAL  < 0.2760824 to the left,  agree=0.586, adj=0.074, (0 split)
##       AGE_2014    < 62.5      to the left,  agree=0.566, adj=0.029, (0 split)
## 
## Node number 26: 68 observations,    complexity param=0.01049618
##   predicted class=0  expected loss=0.4411765  P(node) =0.01772218
##     class counts:    38    30
##    probabilities: 0.559 0.441 
##   left son=52 (46 obs) right son=53 (22 obs)
##   Primary splits:
##       PARCELSIZE  < 0.7652287 to the left,  improve=9.244827, (0 missing)
##       BLDGTOTVAL  < 0.196914  to the right, improve=4.936308, (0 missing)
##       Pop.Density < 3244.374  to the right, improve=3.258855, (0 missing)
##       WScore      < 13.5      to the right, improve=2.489412, (0 missing)
##       TScore      < 32.5      to the right, improve=1.922522, (0 missing)
##   Surrogate splits:
##       WScore      < 16        to the right, agree=0.794, adj=0.364, (0 split)
##       Pop.Density < 2729.434  to the right, agree=0.779, adj=0.318, (0 split)
##       BLDGTOTVAL  < 0.2257027 to the right, agree=0.765, adj=0.273, (0 split)
##       TScore      < 17        to the right, agree=0.765, adj=0.273, (0 split)
##       AGE_2014    < 48.5      to the right, agree=0.721, adj=0.136, (0 split)
## 
## Node number 27: 84 observations
##   predicted class=1  expected loss=0.2261905  P(node) =0.0218921
##     class counts:    19    65
##    probabilities: 0.226 0.774 
## 
## Node number 52: 46 observations
##   predicted class=0  expected loss=0.2608696  P(node) =0.01198853
##     class counts:    34    12
##    probabilities: 0.739 0.261 
## 
## Node number 53: 22 observations
##   predicted class=1  expected loss=0.1818182  P(node) =0.005733646
##     class counts:     4    18
##    probabilities: 0.182 0.818

# Display the frame attribute
model.class$frame

In this simple model, we begin with 3,839 observations in the training set and there are three internal nodes (including the root node) and four terminal nodes or leaves. As we move down the tree, the plot communicates which independent variables were chosen for splits, the values of those independent variables that optimized the split criterion (e.g., Gini impurity), the average value of the independent variable in question at each node, and the percentage of the observations that “flow” to each node.

As shown above, typing the name of the rpart object displays model information that mirrors what we see in the tree plot, but in text form. Note how the depth of the indent reflects the structure of the (plotted) tree. These are the elements shown above:

node: a unique number for the node in the tree
split: the equation used to branch at the node
n: number of observations “flowing” into this node
loss (for classification tree): measure of mis-classifications and/or impurity
yval: predicted value at the node
(yprob): predicted probability of observing yval at this node versus observing the alternate yval at this node
deviance (for regression tree): the deviance associated with that branch
*: an asterisk next to a node indicates it is terminal

For example, 73% of the observations had a BLDGTOTVAL >= 0.44 (i.e., the ratio of structure value to the total value of the property) and the predicted probability of demolition at that terminal node was 18% (actually 0.1775 in my run). For the remaining observations where BLDGTOTVAL < 0.44 (i.e., the structure itself accounted for less than 44% of the total value of the property) the predicted probability of demolition at that internal node was 54% (actually 0.5383 in my run). After the initial split (using the BLDGTOTVAL feature), the algorithm splits the node at the upper left based on the age of the housing unit in 2014 (i.e., the AGE_2014 feature) and 539 of the 1,018 observations at that node “flow” into the right-most terminal node if they are less than 72.5 years old and are assigned a value of one (i.e., demolished) with a predicted probability of 66%. The remaining 479 observations “flow” into the opposite internal node where the algorithm splits those data once more using the BLDGTOTVAL feature again, but with a different threshold value of 0.36. The terminal nodes in the middle might represent the distinction between a housing unit that it more than 73 years old and is: (a) falling down in disrepair or (b) on the historic preservation registry and the predicted probability of demolition reflects that important nuance.

Applying the summary function to an rpart object displays more detailed information. As shown in the documentation for the rpart.object function, the frame attribute of an rpart object (i.e., model.class$frame) is a “data frame with one row for each node in the tree.” The var column lists variables used at each split, n is the number of observations reaching that node, wt is the sum of of case weights (1 by default) for observations reaching that node, dev is the deviance which is a goodness of fit statistic that equals the sum of squared residuals for linear models. The yval column is the fitted value of the dependent variable at the node, complexity is the complexity parameter at which this split will collapse, ncompete is the number of competitor splits recorded, and nsurrogate is the number of surrogate splits recorded.

Surrogate variables are used when a value is missing and so a split cannot be completed with the main variable. A surrogate is a different variable that is chosen to approximate the first-choice variable in a split.

The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We could also say that tree construction does not continue unless it would decrease the overall lack of fit by a factor of cp. We can retrieve the cp value for the model and set it to a value above the threshold for the second BLDGTOTVAL split to demonstrate its effects like this: model.class$control$cp

# Prune the model...
model.pruned <- prune(model.class, cp = 0.075)
model.pruned$frame

rpart.plot(model.pruned)


# We could also set this parameter explicitly and rerun the model: 
temp <- rpart.control(cp = 0.075)
model.class.2 <- rpart(
  formula = as_factor(DEMOLISHED) ~ .,
  data    = portland.2014.2018.train,
  method  = "class",
  control=temp
)

model.class.2$frame

rpart.plot(model.class.2)

According to the rpart.object documentation, the cp table is “a matrix of information on the optimal prunings based on a complexity parameter.” Behind the scenes rpart is automatically applying an array of cost complexity values to prune the tree. The CP column is the value of this parameter that would grow the tree to this size. The nsplit column is the number of nodes at that size, rel error is the error for predictions of the data that were used to estimate the model, while xerror is the cross-validation error generated by the rpart built-in cross validation routine and xstd gives us a sense of its variability over the default 10 iterations (see plotcp). Also note that (1 - relative error) is roughly equal to the variance explained by the model.

The CP values control the size of the tree and the greater the CP value, the fewer the number of splits in the tree. The optimal size of the tree is generally the row in the CP table that minimizes all error with the fewest branches.

In the plot below, the dotted line refers to a recommendation from Breiman et al. (1984) to use the smallest tree within 1 standard deviation of the minimum cross validation error. You can think about the complexity parameter as the amount by which splitting that node improved the relative error.

model.class$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.08540076      0 1.0000000 1.0000000 0.02633587
## 2 0.03625954      2 0.8291985 0.8730916 0.02518799
## 3 0.01049618      3 0.7929389 0.8282443 0.02472907
## 4 0.01000000      5 0.7719466 0.8301527 0.02474920

plotcp(model.class)

model.class$splits

##             count ncat     improve        index          adj
## BLDGTOTVAL   3837    1 207.6502413    0.4494291 0.0000000000
## AGE_2014     3837   -1  57.3042763   62.5000000 0.0000000000
## WScore       3837   -1  43.9991691   83.5000000 0.0000000000
## PARCELSIZE   3837   -1  24.3604742    0.9237271 0.0000000000
## TScore       3837   -1  10.1481898   40.5000000 0.0000000000
## PARCELSIZE      0   -1   0.7268700    0.4473768 0.0223880597
## AGE_2014        0   -1   0.7221788  129.5000000 0.0055970149
## WScore          0    1   0.7208757    0.5000000 0.0009328358
## AGE_2014     1072   -1  34.6800900   71.5000000 0.0000000000
## BLDGTOTVAL   1072    1  34.3514786    0.3563673 0.0000000000
## WScore       1072   -1  30.7952875   75.5000000 0.0000000000
## TScore       1072   -1  13.1755326   41.5000000 0.0000000000
## Pop.Density  1072   -1   7.9331276 8781.0175000 0.0000000000
## WScore          0   -1   0.6595149   69.5000000 0.2626262626
## TScore          0   -1   0.6250000   42.5000000 0.1878787879
## PARCELSIZE      0    1   0.6156716    0.1249150 0.1676767677
## Pop.Density     0   -1   0.5951493 6327.8150000 0.1232323232
## BLDGTOTVAL      0    1   0.5587687    0.4125434 0.0444444444
## BLDGTOTVAL    495    1  22.2100583    0.3563390 0.0000000000
## WScore        495   -1  14.7576745   75.5000000 0.0000000000
## TScore        495   -1   6.9990921   41.5000000 0.0000000000
## PARCELSIZE    495   -1   5.3923404    0.9838875 0.0000000000
## AGE_2014      495   -1   5.2516080   59.5000000 0.0000000000
## PARCELSIZE      0   -1   0.7272727    0.4214552 0.1118421053
## Pop.Density     0    1   0.7232323 1446.7500000 0.0986842105
## TScore          0    1   0.7151515   22.0000000 0.0723684211
## WScore          0    1   0.7131313   16.0000000 0.0657894737
## AGE_2014        0    1   0.7050505   24.0000000 0.0394736842
## WScore        152   -1   8.3158263   58.5000000 0.0000000000
## TScore        152   -1   7.5274363   42.5000000 0.0000000000
## BLDGTOTVAL    152    1   3.9521645    0.2415573 0.0000000000
## Pop.Density   152   -1   3.9521645 7845.7260000 0.0000000000
## PARCELSIZE    152    1   3.5815096    0.1395381 0.0000000000
## TScore          0   -1   0.8421053   40.5000000 0.6470588235
## Pop.Density     0   -1   0.8157895 4021.7680000 0.5882352941
## PARCELSIZE      0    1   0.7697368    0.2748356 0.4852941176
## BLDGTOTVAL      0   -1   0.5855263    0.2760824 0.0735294118
## AGE_2014        0   -1   0.5657895   62.5000000 0.0294117647
## PARCELSIZE     68   -1   9.2448268    0.7652287 0.0000000000
## BLDGTOTVAL     68    1   4.9363083    0.1969140 0.0000000000
## Pop.Density    68    1   3.2588547 3244.3735000 0.0000000000
## WScore         68    1   2.4894118   13.5000000 0.0000000000
## TScore         68    1   1.9225222   32.5000000 0.0000000000
## WScore          0    1   0.7941176   16.0000000 0.3636363636
## Pop.Density     0    1   0.7794118 2729.4345000 0.3181818182
## BLDGTOTVAL      0    1   0.7647059    0.2257027 0.2727272727
## TScore          0    1   0.7647059   17.0000000 0.2727272727
## AGE_2014        0    1   0.7205882   48.5000000 0.1363636364

The splits attribute consists of a row label which is the name of the split variable, and the rest of the columns are count which is the number of observations sent left or right by the split (for competitor splits this is the number that would have been sent left or right had this split been used, for surrogate splits it is the number missing the primary split variable which were decided using this surrogate), ncat which is the number of categories or levels for the variable (+/-1 for a continuous variable), improve which is the improvement in deviance given by this split, or, for surrogates, the concordance of the surrogate with the primary, and index, the numeric split point. The last column adj gives the adjusted concordance for surrogate splits. For a continuous variable, the sign of ncat determines whether the subset x < cutpoint or x > cutpoint is sent to the left. The improve and index columns are typically the most interesting here.

A case weight is a non-negative numeric variable that indicates the importance of each case. There are three types of case weights: frequencies, sampling weights, and variance weights.

model.class.pred <-  predict(model.class, newdata = portland.2014.2018.test)

caret::RMSE(pred = model.class.pred, obs = portland.2014.2018.test$DEMOLISHED)

## [1] 0.5819374

model.class.pred.tb <- as_tibble(model.class.pred)
model.class.pred.tb$outcome <- ifelse((model.class.pred.tb[, 2] > 0.05), 1, 0)
caret::confusionMatrix(as_factor(model.class.pred.tb$outcome), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")

## Warning in confusionMatrix.default(as_factor(model.class.pred.tb$outcome), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 1196  450
##                                              
##                Accuracy : 0.2734             
##                  95% CI : (0.252, 0.2956)    
##     No Information Rate : 0.7266             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.2734             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.2734             
##          Detection Rate : 0.2734             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 1                  
##

Based on the results from the above code chunk, this model does not fit the validation set very well. We could modify the features (independent variables) we have included (see the “Fit Initial Classification Tree” chunk) to try and improve our predictive capacity. Alternatively, we could pursue one of the strategies that you encountered in the assigned reading—bootstrap aggregating also known as “bagging”. This technique involves sampling with replacement and fitting the model each time as a means of improving the stability and predictive power of our tree-based model.

Here we pivot into using the caret package because of its ease of modifying training parameters and because it supports bagging. The trainControl function can be accessed directly or from within the train function via the trControl argument. In the example below we are setting up 10-fold cross-validation which happens to be one of the more commonly used. This means the algorithm will use 90 percent of the data to create a model and 10 percent to test it. Then, we will run the model again with a different 10 percent and repeat this 10 times, until all the data has been used as both training and test data.

The preProcess argument can be used to scale, center, or otherwise transform the input data. The tuneLength parameter tells the algorithm to try different default values (essentially automating the hyperparameter tuning) and the tuneGrid parameter lets us decide which values the main parameter will take: tuneGrid = expand.grid(k = c(5, 11, 21, 25)). This is how you could fit a different model than the “best-fitting” one identified by the algorithm. The expand.grid function was used in the random forest reading that we did for this week as a way to “tune” the hyperparameters as a means of increasing the overall performance of the model beyond feature engineering and feature selection.

# Specify the standard 10-fold cross validation
ctrl <- caret::trainControl(method = "cv",  number = 10) 

# Fit the bagged model but setting the method argument to 
# treebagged as shown below...
model.bagged <- train(
  as_factor(DEMOLISHED) ~ .,
  data    = portland.2014.2018.train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE,
  minsplit = 4, 
  minbucket = 2
)

# Assess the results
model.bagged

## Bagged CART 
## 
## 3837 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3453, 3454, 3454, 3454, 3453, 3453, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.7763898  0.396478

# plot most important variables
plot(varImp(model.bagged), 3)

plot(varImp(model.bagged), 5)

# Make predictions using the ensemble "bagged" model...
pred <- predict(model.bagged, portland.2014.2018.test)
caret::confusionMatrix(as_factor(pred), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1067  232
##          1  129  218
##                                           
##                Accuracy : 0.7807          
##                  95% CI : (0.7599, 0.8005)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : 0.00000028355   
##                                           
##                   Kappa : 0.4055          
##                                           
##  Mcnemar's Test P-Value : 0.00000007943   
##                                           
##             Sensitivity : 0.4844          
##             Specificity : 0.8921          
##          Pos Pred Value : 0.6282          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.2734          
##          Detection Rate : 0.1324          
##    Detection Prevalence : 0.2108          
##       Balanced Accuracy : 0.6883          
##                                           
##        'Positive' Class : 1               
##

#previous caret c. matrix
caret::confusionMatrix(as_factor(model.class.pred.tb$outcome), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")

## Warning in confusionMatrix.default(as_factor(model.class.pred.tb$outcome), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 1196  450
##                                              
##                Accuracy : 0.2734             
##                  95% CI : (0.252, 0.2956)    
##     No Information Rate : 0.7266             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.2734             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.2734             
##          Detection Rate : 0.2734             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 1                  
##

As we can see, the “bagging” technique allowed us to identify a model with greater predictive capacity. But we can also engage in hyperparameter tuning to squeeze even more predictive capability out of the model.

First, consider what the tunable parameters are for rpart objects by referring to the documentation of this massive list of algorithms that can be fit using caret::train as well as the tuning parameters for each. The maxdepth parameter represents the maximum depth of any node of the final tree, with the root node counted as depth 0. The minsplit parameter is the minimum number of observations that must exist in a node in order for a split to be attempted. The minbucket parameter is the minimum number of observations in any terminal node and is not being tuned in the chunk below.

# Create hyperparameter grid... which is a dataframe
tuning.grid <- expand.grid(
  maxdepth = c(1, 3, 5, 8, 15), 
  minsplit = c(2, 5, 10, 15),
  accuracy = NA, 
  kappa = NA
)

# Execute full cartesian grid search...
for(i in seq_len(nrow(tuning.grid))) {
  
# Fit model for each hyperparameter combination...
  fit <- train(
  as_factor(DEMOLISHED) ~ .,
  data = portland.2014.2018.train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE,
  maxdepth = tuning.grid$maxdepth[i],
  minsplit = tuning.grid$minsplit[i],
  minbucket = 4
  )


  # Save fit metrics from each model...
  tuning.grid$accuracy[i] <- fit$results$Accuracy 
  tuning.grid$kappa[i] <- fit$results$Kappa 
}

# assess top 10 models...
tuning.grid %>%
  arrange(-accuracy) %>%
  head(10)

# # Visualize how fit metrics change over 
# # different combinations of selected hyperparameters...
# plot.depth <- ggplot() + 
#   geom_point(data = tuning.grid, mapping = aes(y = kappa, x = maxdepth, fill = "dodgerblue"), 
#              color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
#   geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = maxdepth, fill = "red"), 
#              color = "pink", size = 3, shape = 21, alpha = 0.5) +
#   scale_fill_manual(name = "", values = c("dodgerblue", "red"), labels = c("Kappa", "Accuracy")) +
#   xlab("Maximum Depth") + 
#   ylab("Metric") +
#   theme_minimal() 
#   
# 
# plot.split <- ggplot() + 
#   geom_point(data = tuning.grid, mapping = aes(y = kappa, x = minsplit, fill = "dodgerblue"), 
#              color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
#   geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = minsplit, fill = "red"), 
#              color = "pink", size = 3, shape = 21, alpha = 0.5) +
#   scale_fill_manual(name = "", values = c("dodgerblue", "red"), labels = c("Kappa", "Accuracy")) +
#   xlab("Minimum Split") + 
#   ylab("Metric") +
#   theme_minimal()  
#   
# ggarrange(plot.depth, plot.split, common.legend = TRUE)
# 
# ggsave("hyperparameters_tuning.png", width = 18, height = 8)
# 
# 
# plot.accuracy <- ggplot() + 
#   geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = minsplit, fill = as_factor(maxdepth)), 
#              color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
#   scale_fill_manual(name = "Maximum Depth", values = c(rainbow(5))) +
#   scale_x_continuous(breaks = c(2, 5, 10, 15), labels = c(2, 5, 10, 15), limits = c(0, 15)) +
#   ylab("Accuracy")
# 
# 
# 
# plot.kappa <- ggplot() + 
#   geom_point(data = tuning.grid, mapping = aes(y = kappa, x = minsplit, fill = as_factor(maxdepth)), 
#              color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
#   scale_fill_manual(name = "Maximum Depth", values = c(rainbow(5))) +
#   scale_x_continuous(breaks = c(2, 5, 10, 15), labels = c(2, 5, 10, 15), limits = c(0, 15)) +
#   ylab("Kappa")
# 
# ggarrange(plot.accuracy, plot.kappa, common.legend = TRUE)
# 
# ggsave("hyperparameters_tuning_with_fill.png", width = 18, height = 8)

The graphics above simply show how the fit metrics—vary across different combinations of the selected hyperparameters. Now, we can take the best model and see how it performs on the test set.

best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <-  tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]

# Re-run model with best hyperparameters...

best.fit <- train(
  as_factor(DEMOLISHED) ~ .,
  data = portland.2014.2018.train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE,
  maxdepth = best.accuracy$maxdepth,
  minsplit = best.kappa$minsplit,
  minbucket = 4)

best.fit

## Bagged CART 
## 
## 3837 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3453, 3453, 3453, 3453, 3453, 3454, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7789844  0.4080063

# The features included in the model...
best.fit$coefnames

## [1] "BLDGTOTVAL"  "PARCELSIZE"  "AGE_2014"    "WScore"      "TScore"     
## [6] "Pop.Density"

# Now use the best fitting model to predict demolition status for
# the test set. 

best.fit.pred <-  predict(best.fit, newdata = portland.2014.2018.test)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1072  237
##          1  124  213
##                                           
##                Accuracy : 0.7807          
##                  95% CI : (0.7599, 0.8005)
##     No Information Rate : 0.7266          
##     P-Value [Acc > NIR] : 0.000000283554  
##                                           
##                   Kappa : 0.4011          
##                                           
##  Mcnemar's Test P-Value : 0.000000003753  
##                                           
##             Sensitivity : 0.4733          
##             Specificity : 0.8963          
##          Pos Pred Value : 0.6320          
##          Neg Pred Value : 0.8189          
##              Prevalence : 0.2734          
##          Detection Rate : 0.1294          
##    Detection Prevalence : 0.2047          
##       Balanced Accuracy : 0.6848          
##                                           
##        'Positive' Class : 1               
##

# Make point sf object from coordinates for mapping...
portland.2014.2018.test.sf <- st_as_sf(portland.2014.2018.test, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))

# Map the model performance...
# plot.2014.2018.test.set <- bind_cols(portland.2014.2018.test, best.fit.pred)
plot.2014.2018.test.set <- bind_cols(portland.2014.2018.test.sf, best.fit.pred)

## New names:
## * `` -> ...9

names(plot.2014.2018.test.set)[dim(plot.2014.2018.test.set)[2]] <- "PREDICTED"

plot.2014.2018.test.set <- plot.2014.2018.test.set %>%
  mutate(PERFORM = case_when(
    ((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"), 
    ((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"), 
    ((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))


# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")

## Reading layer `OGRGeoJSON' from data source 
##   `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS:  WGS 84

neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))

ggplot() + 
  geom_sf(data = plot.2014.2018.test.set, aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fp <- plot.2014.2018.test.set %>%
  filter(PERFORM == "False Positive") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fn <- plot.2014.2018.test.set %>%
  filter(PERFORM == "False Negative") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

ggarrange(fp, fn)

# Plot model predictions with Google basemap...
plot.2014.2018.test.set.wgs84 <- st_transform(plot.2014.2018.test.set, 4326)

ggmap(pdx_basemap) +
  geom_point(data = plot.2014.2018.test.set.wgs84, aes(color = PERFORM), 
    x = st_coordinates(plot.2014.2018.test.set.wgs84)[,1], 
    y = st_coordinates(plot.2014.2018.test.set.wgs84)[,2], size = 2) +
  scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  theme_void()

There are several resources available that go into greater detail with bootstrap aggregation including this post from R Bloggers and this tutorial. As mentioned in class, bagging is a technique that combines the predictions of many, many high variance models (i.e., decision trees) in order to enhance predictive capacity. If you were doing this kind of work professionally or for a thesis, you might combine bagging with boosting which iteratively attempts to correct the errors of the previous model by weighting incorrect predictions more heavily as the overall model is refined. The very next chapter in the Boehmke and Greenwell (2020) text covers this in detail.

You have the foundational concepts and skills needed to take a deeper dive here, if you like. Perhaps over the summer you may decide to complete the Machine Learning Crash Course from Google that has become something of a standard. The assumption is that you are on team Python but all of these techniques are available in R as well.

Exercise

For this portion of the Lab Assignment, you are asked to experiment with the classification model presented here.

Question 1 Adapt the “Apply Fitted Model” to Test Dataset and Visualize Results” code chunk above to the portland.2019 dataset.

portland.2019_tibb <- as_tibble(portland.2019)

best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <-  tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]

# Re-run model with best hyperparameters...

best.fit <- train(
  as_factor(DEMOLISHED) ~ .,
  data = portland.2014.2018.train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE,
  maxdepth = best.accuracy$maxdepth,
  minsplit = best.kappa$minsplit,
  
  minbucket = 4)

best.fit

## Bagged CART 
## 
## 3837 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3453, 3453, 3454, 3453, 3453, 3453, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7745734  0.4001762

# The features included in the model...
best.fit$coefnames

## [1] "BLDGTOTVAL"  "PARCELSIZE"  "AGE_2014"    "WScore"      "TScore"     
## [6] "Pop.Density"

# Now use the best fitting model to predict demolition status for
# the test set. 

#filtering 2019 columns to have only the best hyper pams - do i need to do this?

portland.2019_hp <- portland.2019 %>%
  select(DEMOLISHED,
         BLDGTOTVAL,
         PARCELSIZE,
         AGE_2014,
         WScore,
         TScore,
         Pop.Density,
         X_COORD,
         Y_COORD)

best.fit.pred <-  predict(best.fit, newdata = portland.2019_hp)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2019$DEMOLISHED), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3556  114
##          1  424  100
##                                              
##                Accuracy : 0.8717             
##                  95% CI : (0.8612, 0.8817)   
##     No Information Rate : 0.949              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2141             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.46729            
##             Specificity : 0.89347            
##          Pos Pred Value : 0.19084            
##          Neg Pred Value : 0.96894            
##              Prevalence : 0.05103            
##          Detection Rate : 0.02384            
##    Detection Prevalence : 0.12494            
##       Balanced Accuracy : 0.68038            
##                                              
##        'Positive' Class : 1                  
##

# Make point sf object from coordinates for mapping...
portland.2019.sf <- st_as_sf(portland.2019_hp, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))

# Map the model performance...
# plot.2019.test.set <- bind_cols(portland.2019_hp, best.fit.pred)
plot.2019.test.set<- bind_cols(portland.2019.sf, best.fit.pred)

## New names:
## * `` -> ...9

names(plot.2019.test.set)[dim(plot.2019.test.set)[2]] <- "PREDICTED"

plot.2019.test.set <- plot.2019.test.set %>%
  mutate(PERFORM = case_when(
    ((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"), 
    ((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"), 
    ((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))


# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")

## Reading layer `OGRGeoJSON' from data source 
##   `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS:  WGS 84

neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))

ggplot() + 
  geom_sf(data = plot.2019.test.set, aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fp <- plot.2019.test.set %>%
  filter(PERFORM == "False Positive") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fn <- plot.2019.test.set %>%
  filter(PERFORM == "False Negative") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

ggarrange(fp, fn)

# Plot model predictions with Google basemap...
plot.2019.test.set.wgs84 <- st_transform(plot.2019.test.set, 4326)

ggmap(pdx_basemap) +
  geom_point(data = plot.2019.test.set.wgs84, aes(color = PERFORM), 
    x = st_coordinates(plot.2019.test.set.wgs84)[,1], 
    y = st_coordinates(plot.2019.test.set.wgs84)[,2], size = 2) +
  scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  theme_void()

+ Does the model perform as well “in the wild” with these 2019 data as it did on the test set?

#practice results
filter(plot.2014.2018.test.set.wgs84, PERFORM == "Correct") %>%
  count() / count(plot.2014.2018.test.set.wgs84)

filter(plot.2014.2018.test.set.wgs84, PERFORM == "False Negative") %>%
  count() / count(plot.2014.2018.test.set.wgs84)

filter(plot.2014.2018.test.set.wgs84, PERFORM == "False Positive") %>%
  count() / count(plot.2014.2018.test.set.wgs84)

#performance results 
filter(plot.2019.test.set.wgs84, PERFORM == "Correct") %>%
  count() / count(plot.2019.test.set.wgs84)

filter(plot.2019.test.set.wgs84, PERFORM == "False Negative") %>%
  count() / count(plot.2019.test.set.wgs84)

filter(plot.2019.test.set.wgs84, PERFORM == "False Positive") %>%
  count() / count(plot.2019.test.set.wgs84)

##Q1 Answer 1/3: If I did this correct (my best.fit object was still attached to 2014.2018 data when applying it to the 2019 “wild” data), it appears our model performs better on our wild data (88% of demolition predictions correct) than it did on the training data (78%). Concerning false positives and negatives, given our 2019 model’s increased predictive performance, false positives saw an unforeseen increase (from 8% to 10%), while false negatives saw a drastic decrease (from 14% to 2%). It would seem our 2019 model really nailed it concerning those false negatives - in reality, I guess this would mean a building could sleep comfortably (joke) if our model says it will not be demolished.

Are there any obvious patterns in the location of correctly labeled observations, false positives, and false negatives, relative to the maps from the “Apply Fitted Model to Test Dataset and Visualize Results” code chunk based on the test set?
As part of your submission, include screen captures, code, etc. that shows OR recreates the information you used to answer the preceding questions.

#since I'm pretty comfortable with merging spatial data - thanks to my term project - I'll go extra on this one

neighborhoods$Correct <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "Correct")))
neighborhoods$FALSE_N <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "False Negative")))
neighborhoods$FALSE_P <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "False Positive")))

neighborhoods$N_Predictions <- (neighborhoods$Correct + neighborhoods$FALSE_N + neighborhoods$FALSE_P)
  
arrange(neighborhoods, desc(Correct)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

arrange(neighborhoods, desc(FALSE_N)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

arrange(neighborhoods, desc(FALSE_P)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

##Q1 Answer 2/3: The CENTENNIAL COMMUNITY ASSOCIATION neighborhood was the best ‘performed on’ neighborhood with 167 correct predictions and only 7 incorrect predictions (all being false positive). The Arbor Lodge, Woodstock, and Overlook neighborhoods all had the highest number of false negatives (8), while Montavilla had the most false positives (20).

library(viridis)

## Loading required package: viridisLite

library(ggthemes)

neighborhoods <- neighborhoods %>%
  mutate(Pct_Correct = (Correct / N_Predictions)) %>%
    mutate(Pct_FALSE_P = (FALSE_P / N_Predictions)) %>%
    mutate(Pct_FALSE_N = (FALSE_N / N_Predictions))

neighborhoods.wgs84 <- st_transform(neighborhoods, 4326)
#pdx_basemap.wgs84 <- st_transform(pdx_basemap, 4326) - this don't work
  
  
  ggmap(pdx_basemap) +
  geom_sf(data = neighborhoods.wgs84,
             mapping = aes(fill = Pct_Correct), inherit.aes = FALSE,
          color = "gray90", size = 0.25, alpha = 0.75) +
 # scale_fill_viridis(option = "plasma") +
  scale_fill_viridis(option = "E", direction = 1) +
 # scale_fill_distiller(palette = "RdPu", direction = 1) +
  theme_map() +
  theme(legend.position = "bottom",
        strip.background = element_blank()) +
  labs(fill = "Percent Correct",
       title = "Percent Correct - Predicted Demolitions",
       subtitle = "By Neighborhood, Portland 2019 Data")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

  neighborhoods.wgs84 %>%
    arrange(desc(Pct_Correct)) %>%
    select(MAPLABEL,
           N_Predictions,
          Pct_Correct) %>%
    head()

ggmap(pdx_basemap) +
  geom_sf(data = neighborhoods.wgs84,
             mapping = aes(fill = Pct_FALSE_P), inherit.aes = FALSE,
          color = "gray90", size = 0.25, alpha = 0.75) +
 # scale_fill_viridis(option = "plasma") +
  scale_fill_viridis(option = "E", direction = 1) +
 # scale_fill_distiller(palette = "RdPu", direction = 1) +
  theme_map() +
  theme(legend.position = "bottom",
        strip.background = element_blank()) +
  labs(fill = "Percent False Positive",
       title = "Percent False Positive - Predicted Demolitions",
       subtitle = "By Neighborhood, Portland 2019 Data")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

  neighborhoods.wgs84 %>%
    arrange(desc(Pct_FALSE_P)) %>%
    select(MAPLABEL,
           N_Predictions,
          Pct_FALSE_P) %>%
    head()

ggmap(pdx_basemap) +
  geom_sf(data = neighborhoods.wgs84,
             mapping = aes(fill = Pct_FALSE_N), inherit.aes = FALSE,
          color = "gray90", size = 0.25, alpha = 0.75) +
 # scale_fill_viridis(option = "plasma") +
  scale_fill_viridis(option = "E", direction = 1) +
 # scale_fill_distiller(palette = "RdPu", direction = 1) +
  theme_map() +
  theme(legend.position = "bottom",
        strip.background = element_blank()) +
  labs(fill = "Percent False Negative",
       title = "Percent False Negative - Predicted Demolitions",
       subtitle = "By Neighborhood, Portland 2019 Data")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

  neighborhoods.wgs84 %>%
    arrange(desc(Pct_FALSE_N)) %>%
    select(MAPLABEL,
           N_Predictions,
          Pct_FALSE_N) %>%
    head()

##Q1 Answer 3/3: Looking at the the percent of predictions that were correct, false positive, and false negative, there were some clear outliers visible across the Portland neighborhoods. Much of the eastern side of the of the city fared really well with our model, with most of these neighborhoods displaying bright yellow colors indicating near or at 100% accuracy (I was trying to make these values pop up on the map - but was unsuccessful). Assessing neighborhoods by their percent of false positive predictions, both Healy Heights and Goose Hollow Foothills League had 100% of their demolitions predicted as falsely positive. However, both these neighborhoods had less than 3 total predictions. The following neighborhood in descendancy of ‘percent false positive’, Boise, had 16 total demolition predictions within its boundary, and roughly 44% of them were predicted as false positive. This is the truest outlier concerning this variable, and is a neighborhood that our model is not well set up against. On the opposite side, false negatives, the variable our model is very well set up against (with only 2% of predictions), higher rates of false negatives can be seen near the city’s center with neighborhoods such as Overlook, Homestead, and Hosford-Abernethy all containing more than 10% of false negative predictions. Lastly, like the Healy Heights neighborhood previously mentioned, the neighborhoods Linnton and _Sunderland Association of Neighbors__ both displayed high percentages of false negative predictions (100% and 50%) but with very low total predictions (less than 3 each).

Question 2 Revisit the “Prep Data for Classification Tree Modeling” code chunk and include more features than were considered in the example provided OR add the cp parameter to the custom grid search.

Were you able to improve the performance of the model?

Complexity Parameter Tuning - don’t run this

# #finding the ideal **cp**
# 
# #taking in the `Assess Model` Visual
# cp_values_grid <- seq(from = 0.02, to = 0.056, len = 50) #a lengthy amount of time, but let's see if it's worth it
# 
# #without specifying cp
#   #0.7821081 accuracy and 0.4152267 kappa
#   #best minsplit =2
#   #best mapdepth = 1
# #five grid tests?
# # with cp grid  - using the initial `assess model` visual
#   # 0.02 to 0.056, best cp = 0.038 with .7821112 accuracy and 0.4064244 kappa
#   #best min split = 5
#   #best maxdepth = 3
# 
# 
# # Create hyperparameter grid... which is a dataframe
# tuning.grid <- expand.grid(
#   maxdepth = c(1, 3, 5, 8, 15), 
#   minsplit = c(2, 5, 10, 15), 
#   .cp = cp_values_grid,
#   accuracy = NA, 
#   kappa = NA
# )
# 
# # Execute full cartesian grid search...
# for(i in seq_len(nrow(tuning.grid))) {
#   
# # Fit model for each hyperparameter combination...
#   fit <- train(
#   as_factor(DEMOLISHED) ~ .,
#   data = portland.2014.2018.train,
#   method = "treebag",
#   trControl = ctrl,
#   importance = TRUE,
#   maxdepth = tuning.grid$maxdepth[i],
#   minsplit = tuning.grid$minsplit[i],
#   .cp = tuning.grid$.cp[i],
#   minbucket = 4
#   )
# 
# 
#   # Save fit metrics from each model...
#   tuning.grid$accuracy[i] <- fit$results$Accuracy 
#   tuning.grid$kappa[i] <- fit$results$Kappa 
# }
# 
# # assess top 10 models...
 # tuning.grid %>%
 #   arrange(-kappa) %>%
 #   head(10)
# 
# tuning.grid$best_kappa = if_else(tuning.grid$kappa == max(tuning.grid$kappa), 1, 0)
# 
# ----------------
# the_pal <- c("cyan3", "deeppink")
# 
# p1 <- ggplot(tuning.grid, aes(x = `.cp`, y = kappa, size = best_kappa, fill = as_factor(best_kappa))) +
#   geom_point(
#         shape=21,
#         stroke = 0,
#         alpha = 0.5
#         ) +
#   facet_grid(maxdepth ~ minsplit) +
#   theme_light() +
#   scale_fill_manual(values = the_pal, labels = c("False", "True")) +
#   scale_size(range = c(1, 2), breaks = c(0,1), labels = c("False", "True")) +
#   labs(x = "cp", y = "kappa", size = "Highest Kappa", fill = "Highest Kappa", subtitle = "Faceted by minsplit (x-axis) & maxdepth (y-axis)")
# 
 # ggsave(filename = paste0("./img/CP grid ", min(cp_values_grid), "-", max(cp_values_grid), ".png"), plot = p1, width = 16, height = 8)

Complexity Parameter Tuning Result

#2019 data test with cp

portland.2019_tibb <- as_tibble(portland.2019)

best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <-  tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]

# Re-run model with best hyperparameters...

best.fit <- train(
  as_factor(DEMOLISHED) ~ .,
  data = portland.2014.2018.train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE,
  maxdepth = best.accuracy$maxdepth,
  minsplit = best.kappa$minsplit,
  .cp = best.kappa$.cp,
  minbucket = 4)

best.fit

## Bagged CART 
## 
## 3837 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3453, 3453, 3454, 3453, 3454, 3453, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7711849  0.3909168

# The features included in the model...
best.fit$coefnames

## [1] "BLDGTOTVAL"  "PARCELSIZE"  "AGE_2014"    "WScore"      "TScore"     
## [6] "Pop.Density"

# Now use the best fitting model to predict demolition status for
# the test set. 

#filtering 2019 columns to have only the best hyper pams - do i need to do this?

portland.2019_hp <- portland.2019 %>%
  select(DEMOLISHED,
         BLDGTOTVAL,
         PARCELSIZE,
         AGE_2014,
         WScore,
         TScore,
         Pop.Density,
         X_COORD,
         Y_COORD)

best.fit.pred <-  predict(best.fit, newdata = portland.2019_hp)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2019$DEMOLISHED), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3571  109
##          1  409  105
##                                              
##                Accuracy : 0.8765             
##                  95% CI : (0.8662, 0.8863)   
##     No Information Rate : 0.949              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2332             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.49065            
##             Specificity : 0.89724            
##          Pos Pred Value : 0.20428            
##          Neg Pred Value : 0.97038            
##              Prevalence : 0.05103            
##          Detection Rate : 0.02504            
##    Detection Prevalence : 0.12256            
##       Balanced Accuracy : 0.69395            
##                                              
##        'Positive' Class : 1                  
##

# Make point sf object from coordinates for mapping...
portland.2019.sf <- st_as_sf(portland.2019_hp, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))

# Map the model performance...
# plot.2019.test.set <- bind_cols(portland.2019_hp, best.fit.pred)
plot.2019.test.set_cp <- bind_cols(portland.2019.sf, best.fit.pred)

## New names:
## * `` -> ...9

names(plot.2019.test.set_cp)[dim(plot.2019.test.set_cp)[2]] <- "PREDICTED"

plot.2019.test.set_cp <- plot.2019.test.set_cp %>%
  mutate(PERFORM = case_when(
    ((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"), 
    ((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"), 
    ((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))


# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")

## Reading layer `OGRGeoJSON' from data source 
##   `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS:  WGS 84

neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))

ggplot() + 
  geom_sf(data = plot.2019.test.set_cp, aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fp <- plot.2019.test.set_cp %>%
  filter(PERFORM == "False Positive") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

fn <- plot.2019.test.set_cp %>%
  filter(PERFORM == "False Negative") %>% 
ggplot() + 
  geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
  scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
  theme_void()

ggarrange(fp, fn)

# Plot model predictions with Google basemap...
plot.2019.test.set.wgs84_cp <- st_transform(plot.2019.test.set_cp, 4326)

ggmap(pdx_basemap) +
  geom_point(data = plot.2019.test.set.wgs84_cp, aes(color = PERFORM), 
    x = st_coordinates(plot.2019.test.set.wgs84)[,1], 
    y = st_coordinates(plot.2019.test.set.wgs84)[,2], size = 2) +
  scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
  theme_void()

#merging spatial data

neighborhoods$Correct <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "Correct")))
neighborhoods$FALSE_N <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "False Negative")))
neighborhoods$FALSE_P <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "False Positive")))

neighborhoods$N_Predictions <- (neighborhoods$Correct + neighborhoods$FALSE_N + neighborhoods$FALSE_P)
  
arrange(neighborhoods, desc(Correct)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

arrange(neighborhoods, desc(FALSE_N)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

arrange(neighborhoods, desc(FALSE_P)) %>%
  select(NAME,
         MAPLABEL,
         HORZ_VERT,
         Correct,
         FALSE_N,
         FALSE_P) %>%
  head()

plot.2019_nogeom <- st_drop_geometry(plot.2019.test.set.wgs84)
plot.2019_CP_nogeom <- st_drop_geometry(plot.2019.test.set.wgs84_cp)

#no cp results
filter(plot.2019_nogeom, PERFORM == "Correct") %>%
  count()

filter(plot.2019_nogeom, PERFORM == "False Negative") %>%
  count()

filter(plot.2019_nogeom, PERFORM == "False Positive") %>%
  count()

filter(plot.2019_nogeom, PERFORM == "Correct") %>%
  count() / count(plot.2019_nogeom)

filter(plot.2019_nogeom, PERFORM == "False Negative") %>%
  count() / count(plot.2019_nogeom)

filter(plot.2019_nogeom, PERFORM == "False Positive") %>%
  count() / count(plot.2019_nogeom)

#with cp results 
filter(plot.2019_CP_nogeom, PERFORM == "Correct") %>%
  count()

filter(plot.2019_CP_nogeom, PERFORM == "False Negative") %>%
  count()

filter(plot.2019_CP_nogeom, PERFORM == "False Positive") %>%
  count()

filter(plot.2019_CP_nogeom, PERFORM == "Correct") %>%
  count() / count(plot.2019_CP_nogeom)

filter(plot.2019_CP_nogeom, PERFORM == "False Negative") %>%
  count() / count(plot.2019_CP_nogeom)

filter(plot.2019_CP_nogeom, PERFORM == "False Positive") %>%
  count() / count(plot.2019_CP_nogeom)

##Q2 Answer: Okay, let’s finally have a look. I used the best.kappa for my CP for the model, because I would have liked to think we care about making the rules fair for complexity. But I don’t think it would have been necessarily wrong to go with accuracy.

My hypothesis was that this didn’t have much of an impact on our number of correct predictions, because we barely increased the kappa or accuracy with the iteration of cp’s over the 2014.2018 data. So, I don’t have too high of hopes here.

Concerning the number of correct predictions, our 2019 model with the included complexity parameter, the number of correct predictions increased by a whopping 0.16% (an additional 6 correct predictions).

Concerning the number of false negatives, our 2019 model with the included complexity parameter, the number of false negatives decreased by 0.86% (a decrease of 1 false negative).

Concerning the number of false positives, our 2019 model with the included complexity parameter, the number of false positives increased by 1.18% (a decrease of 5 false positives).

So, in the end, we may have moved an inch in the right direction, which is about right when considering the impact the iterative cp’s had on the 2014.18 accuracy and kappa.

Question 3.

Please do not forget the Reflective Content portion of the Lab Assignment.

To use your language Bev, the juice was not really worth the squeeze. I think the from concentrate .cp would have been satisfactory. However, it was valuable to go through this process.

I’m definitely intrigued by it all. Obviously I’m not too concerned with the grade, but I decided to put a lot of extra time into this model and lab as a whole, because it’s been the most exploratory yet (and I have a little bit of extra time on my hands). There’s much left to do in terms of increasing this model’s capability (or not?). I’m not sure what the next steps would be to further this model’s predictive capabilities.

But, overall, the lab as a whole was very worthwhile. It took me awhile to figure out how I wanted to present my complexity parameter graphic for Q2, so I learned some visual awareness skills there (size, fill, palette, more facet_wrap, themes). In addition, I tested the caret::train timespan for a length of 50 iterative cp parameters, and I think you are exactly right Bev. This training workflow is best set up for a dummy computer that I won’t need to touch until the end of the day, or next week.

I thank you for putting this lab on, and I’ll have to really urge myself to keep exploring these tools into the summer. I have an idea of how I want to do that (sports), but it will be a lot of wrangling ’till then.

End of Work