The sf object called
portland.demolitions is based on an older version of this
dataset available on the Portland open data portal. If you are
curious about its attributes, take a look at the metadata there, but we
mainly want to plot the points to get a feel for the pattern of
residential demolitions over time in the city.
We also now have a dataset with observations from 2014 to 2018 and a second dataset with observations from 2019 in memory. In both cases, the set of demolished properties was combined with a random sample of single-family (detached) residential properties that existed in 2014 but were not demolished as a control. A description of what the columns (variables) each of these datasets contains is given in the image below:
The BLDGTOTVAL feature is of particular interest due to its connection to the rent-gap theory of gentrification posited by Neil Smith in the late 1970s. The idea is that gentrification is more likely to occur in neighborhoods with a high proportion of real estate where the land is valued much more highly than the structure itself and so demolishing housing units like those and replacing them with the “highest and best use” is a strategy for extracting profits from the city.
Let’s do some quick exploratory data analysis before proceeding.
# Read the Portland basemap so we do not have to retrieve it from the web
load(file = "./data/pdx_basemap.RData")
blank <- as_tibble(load(file = "./data/pdx_basemap.RData"))
portland.demolitions.wgs84 <- st_transform(portland.demolitions, 4326)
demos.map <- ggmap(pdx_basemap) +
geom_point(data = portland.demolitions.wgs84, aes(color = as_factor(DEMOYEAR),
x = st_coordinates(portland.demolitions.wgs84)[,1],
y = st_coordinates(portland.demolitions.wgs84)[,2])) +
scale_color_brewer(palette = "Spectral") +
theme_classic() +
theme(axis.line = element_blank(), axis.text = element_blank(),
axis.ticks = element_blank(), axis.title = element_blank(),
legend.title = element_blank())
demos.hist <- ggplot(portland.demolitions.wgs84, aes(x = DEMOYEAR)) +
geom_histogram(bins = 6, fill = brewer.pal(n = 6, name = "Spectral"), color="#e9ecef", alpha=0.9) +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face="bold", size=10)) +
scale_x_continuous(breaks = seq(2014, 2020, by = 1)) +
labs(x = "Year", y = "", caption = "Source: City of Portland Open Data Portal") +
theme_minimal() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())
ggarrange(demos.map, demos.hist)
## Warning: Removed 4 rows containing missing values (geom_point).
Now that we have a feel for where demolitions occurred in Portland and the volume by year, we can turn our attention to fitting models.
In the code chunk below, we use the set.seed function to
assign a starting value to the random number generator built into R so
that we can replicate the analysis later.
set.seed(227)
# Use rsample::initial_split to partition the 2014-2018 data.
portland.2014.2018.split <- initial_split(portland.2014.2018, prop = 0.7, strata = DEMOLISHED)
portland.2014.2018.train <- training(portland.2014.2018.split)
portland.2014.2018.test <- testing(portland.2014.2018.split)
# Check the distribution in original data and partitioned data
# and note that the proportion of demolished and non-demolished
# housing units is comparable in both.
prop.table(table(portland.2014.2018.test$DEMOLISHED)) * 100
##
## 0 1
## 72.661 27.339
prop.table(table(portland.2014.2018.train$DEMOLISHED)) * 100
##
## 0 1
## 72.687 27.313
# Next, choose the variables we want to use
portland.2014.2018.test <- subset(portland.2014.2018.test, select = c("DEMOLISHED", "BLDGTOTVAL", "PARCELSIZE", "AGE_2014", "WScore", "TScore", "Pop.Density", "X_COORD", "Y_COORD"))
portland.2014.2018.train <- subset(portland.2014.2018.train, select = c("DEMOLISHED", "BLDGTOTVAL", "PARCELSIZE", "AGE_2014",
"WScore", "TScore", "Pop.Density"))
This component of the Lab Assignment introduces classification tree models and applies this tool to the Portland housing unit demolitions dataset. For classification (i.e., categorical dependent variables), the decision tree algorithm predicts the probability that a given observation belongs to a particular class or category. In the code chunk below, we use the rpart package to fit a classification tree model where the outcome is demolition (i.e., was the dwelling unit demolished or not?), but this is just one of many options for fitting decision tree models in R (e.g., caret, ranger, h2o, etc.).
?rpart
#?caret
#?ranger
#?h20
The caret::confusionMatrix function produces a variety
of metrics that help us to assess performance of the classification
model. In the actual confusion
matrix at the top of the output, the numbers on the diagonal
represent correctly classified observations, while the numbers off the
diagonal represent mis-classified observations. The accuracy
rate is given along with a confidence interval and p-value. The
kappa
statistic is also included. Note that for kappa, 1 means
perfect agreement and 0 is chance agreement. One rule-of-thumb suggests
that kappa values of 0.21-0.40 is fair, 0.41- 0.60 is moderate,
0.61-0.80 is substantial, and 0.81-1.00 is almost perfect agreement.
Take a few moments to review the help documentation for the
rpart and the rpart.object functions, then
proceed with the code chunk below.
?rpart
?rpart.object
?rpart.control
# Because our dependent variable is categorical, we set
# the methods argument to "class" for classification...
model.class <- rpart(
formula = as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "class",
minsplit = 4,
minbucket = 2
)
# Examine the output in both text and visual format...
model.class
## n= 3837
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3837 1048 0 (0.7268700 0.2731300)
## 2) BLDGTOTVAL>=0.4494291 2765 472 0 (0.8292948 0.1707052) *
## 3) BLDGTOTVAL< 0.4494291 1072 496 1 (0.4626866 0.5373134)
## 6) AGE_2014< 71.5 495 198 0 (0.6000000 0.4000000)
## 12) BLDGTOTVAL>=0.356339 343 103 0 (0.6997085 0.3002915) *
## 13) BLDGTOTVAL< 0.356339 152 57 1 (0.3750000 0.6250000)
## 26) WScore< 58.5 68 30 0 (0.5588235 0.4411765)
## 52) PARCELSIZE< 0.7652287 46 12 0 (0.7391304 0.2608696) *
## 53) PARCELSIZE>=0.7652287 22 4 1 (0.1818182 0.8181818) *
## 27) WScore>=58.5 84 19 1 (0.2261905 0.7738095) *
## 7) AGE_2014>=71.5 577 199 1 (0.3448873 0.6551127) *
rpart.plot(model.class)
# Retrieve more detailed info...
summary(model.class)
## Call:
## rpart(formula = as_factor(DEMOLISHED) ~ ., data = portland.2014.2018.train,
## method = "class", minsplit = 4, minbucket = 2)
## n= 3837
##
## CP nsplit rel error xerror xstd
## 1 0.08540076 0 1.0000000 1.0000000 0.02633587
## 2 0.03625954 2 0.8291985 0.8730916 0.02518799
## 3 0.01049618 3 0.7929389 0.8282443 0.02472907
## 4 0.01000000 5 0.7719466 0.8301527 0.02474920
##
## Variable importance
## BLDGTOTVAL AGE_2014 PARCELSIZE WScore TScore Pop.Density
## 67 11 7 6 5 4
##
## Node number 1: 3837 observations, complexity param=0.08540076
## predicted class=0 expected loss=0.27313 P(node) =1
## class counts: 2789 1048
## probabilities: 0.727 0.273
## left son=2 (2765 obs) right son=3 (1072 obs)
## Primary splits:
## BLDGTOTVAL < 0.4494291 to the right, improve=207.65020, (0 missing)
## AGE_2014 < 62.5 to the left, improve= 57.30428, (0 missing)
## WScore < 83.5 to the left, improve= 43.99917, (0 missing)
## PARCELSIZE < 0.9237271 to the left, improve= 24.36047, (0 missing)
## TScore < 40.5 to the left, improve= 10.14819, (0 missing)
## Surrogate splits:
## PARCELSIZE < 0.4473768 to the left, agree=0.727, adj=0.022, (0 split)
## AGE_2014 < 129.5 to the left, agree=0.722, adj=0.006, (0 split)
## WScore < 0.5 to the right, agree=0.721, adj=0.001, (0 split)
##
## Node number 2: 2765 observations
## predicted class=0 expected loss=0.1707052 P(node) =0.7206151
## class counts: 2293 472
## probabilities: 0.829 0.171
##
## Node number 3: 1072 observations, complexity param=0.08540076
## predicted class=1 expected loss=0.4626866 P(node) =0.2793849
## class counts: 496 576
## probabilities: 0.463 0.537
## left son=6 (495 obs) right son=7 (577 obs)
## Primary splits:
## AGE_2014 < 71.5 to the left, improve=34.680090, (0 missing)
## BLDGTOTVAL < 0.3563673 to the right, improve=34.351480, (0 missing)
## WScore < 75.5 to the left, improve=30.795290, (0 missing)
## TScore < 41.5 to the left, improve=13.175530, (0 missing)
## Pop.Density < 8781.018 to the left, improve= 7.933128, (0 missing)
## Surrogate splits:
## WScore < 69.5 to the left, agree=0.660, adj=0.263, (0 split)
## TScore < 42.5 to the left, agree=0.625, adj=0.188, (0 split)
## PARCELSIZE < 0.124915 to the right, agree=0.616, adj=0.168, (0 split)
## Pop.Density < 6327.815 to the left, agree=0.595, adj=0.123, (0 split)
## BLDGTOTVAL < 0.4125434 to the right, agree=0.559, adj=0.044, (0 split)
##
## Node number 6: 495 observations, complexity param=0.03625954
## predicted class=0 expected loss=0.4 P(node) =0.129007
## class counts: 297 198
## probabilities: 0.600 0.400
## left son=12 (343 obs) right son=13 (152 obs)
## Primary splits:
## BLDGTOTVAL < 0.356339 to the right, improve=22.210060, (0 missing)
## WScore < 75.5 to the left, improve=14.757670, (0 missing)
## TScore < 41.5 to the left, improve= 6.999092, (0 missing)
## PARCELSIZE < 0.9838875 to the left, improve= 5.392340, (0 missing)
## AGE_2014 < 59.5 to the left, improve= 5.251608, (0 missing)
## Surrogate splits:
## PARCELSIZE < 0.4214552 to the left, agree=0.727, adj=0.112, (0 split)
## Pop.Density < 1446.75 to the right, agree=0.723, adj=0.099, (0 split)
## TScore < 22 to the right, agree=0.715, adj=0.072, (0 split)
## WScore < 16 to the right, agree=0.713, adj=0.066, (0 split)
## AGE_2014 < 24 to the right, agree=0.705, adj=0.039, (0 split)
##
## Node number 7: 577 observations
## predicted class=1 expected loss=0.3448873 P(node) =0.1503779
## class counts: 199 378
## probabilities: 0.345 0.655
##
## Node number 12: 343 observations
## predicted class=0 expected loss=0.3002915 P(node) =0.08939275
## class counts: 240 103
## probabilities: 0.700 0.300
##
## Node number 13: 152 observations, complexity param=0.01049618
## predicted class=1 expected loss=0.375 P(node) =0.03961428
## class counts: 57 95
## probabilities: 0.375 0.625
## left son=26 (68 obs) right son=27 (84 obs)
## Primary splits:
## WScore < 58.5 to the left, improve=8.315826, (0 missing)
## TScore < 42.5 to the left, improve=7.527436, (0 missing)
## BLDGTOTVAL < 0.2415573 to the right, improve=3.952165, (0 missing)
## Pop.Density < 7845.726 to the left, improve=3.952165, (0 missing)
## PARCELSIZE < 0.1395381 to the right, improve=3.581510, (0 missing)
## Surrogate splits:
## TScore < 40.5 to the left, agree=0.842, adj=0.647, (0 split)
## Pop.Density < 4021.768 to the left, agree=0.816, adj=0.588, (0 split)
## PARCELSIZE < 0.2748356 to the right, agree=0.770, adj=0.485, (0 split)
## BLDGTOTVAL < 0.2760824 to the left, agree=0.586, adj=0.074, (0 split)
## AGE_2014 < 62.5 to the left, agree=0.566, adj=0.029, (0 split)
##
## Node number 26: 68 observations, complexity param=0.01049618
## predicted class=0 expected loss=0.4411765 P(node) =0.01772218
## class counts: 38 30
## probabilities: 0.559 0.441
## left son=52 (46 obs) right son=53 (22 obs)
## Primary splits:
## PARCELSIZE < 0.7652287 to the left, improve=9.244827, (0 missing)
## BLDGTOTVAL < 0.196914 to the right, improve=4.936308, (0 missing)
## Pop.Density < 3244.374 to the right, improve=3.258855, (0 missing)
## WScore < 13.5 to the right, improve=2.489412, (0 missing)
## TScore < 32.5 to the right, improve=1.922522, (0 missing)
## Surrogate splits:
## WScore < 16 to the right, agree=0.794, adj=0.364, (0 split)
## Pop.Density < 2729.434 to the right, agree=0.779, adj=0.318, (0 split)
## BLDGTOTVAL < 0.2257027 to the right, agree=0.765, adj=0.273, (0 split)
## TScore < 17 to the right, agree=0.765, adj=0.273, (0 split)
## AGE_2014 < 48.5 to the right, agree=0.721, adj=0.136, (0 split)
##
## Node number 27: 84 observations
## predicted class=1 expected loss=0.2261905 P(node) =0.0218921
## class counts: 19 65
## probabilities: 0.226 0.774
##
## Node number 52: 46 observations
## predicted class=0 expected loss=0.2608696 P(node) =0.01198853
## class counts: 34 12
## probabilities: 0.739 0.261
##
## Node number 53: 22 observations
## predicted class=1 expected loss=0.1818182 P(node) =0.005733646
## class counts: 4 18
## probabilities: 0.182 0.818
# Display the frame attribute
model.class$frame
In this simple model, we begin with 3,839 observations in the training set and there are three internal nodes (including the root node) and four terminal nodes or leaves. As we move down the tree, the plot communicates which independent variables were chosen for splits, the values of those independent variables that optimized the split criterion (e.g., Gini impurity), the average value of the independent variable in question at each node, and the percentage of the observations that “flow” to each node.
As shown above, typing the name of the rpart object displays model information that mirrors what we see in the tree plot, but in text form. Note how the depth of the indent reflects the structure of the (plotted) tree. These are the elements shown above:
For example, 73% of the observations had a BLDGTOTVAL >= 0.44 (i.e., the ratio of structure value to the total value of the property) and the predicted probability of demolition at that terminal node was 18% (actually 0.1775 in my run). For the remaining observations where BLDGTOTVAL < 0.44 (i.e., the structure itself accounted for less than 44% of the total value of the property) the predicted probability of demolition at that internal node was 54% (actually 0.5383 in my run). After the initial split (using the BLDGTOTVAL feature), the algorithm splits the node at the upper left based on the age of the housing unit in 2014 (i.e., the AGE_2014 feature) and 539 of the 1,018 observations at that node “flow” into the right-most terminal node if they are less than 72.5 years old and are assigned a value of one (i.e., demolished) with a predicted probability of 66%. The remaining 479 observations “flow” into the opposite internal node where the algorithm splits those data once more using the BLDGTOTVAL feature again, but with a different threshold value of 0.36. The terminal nodes in the middle might represent the distinction between a housing unit that it more than 73 years old and is: (a) falling down in disrepair or (b) on the historic preservation registry and the predicted probability of demolition reflects that important nuance.
Applying the summary function to an rpart
object displays more detailed information. As shown in the
documentation for the rpart.object function, the frame
attribute of an rpart object (i.e., model.class$frame)
is a “data frame with one row for each node in the tree.” The
var column lists variables used at each split,
n is the number of observations reaching that node,
wt is the sum of of case weights (1 by default) for
observations reaching that node, dev is the deviance
which is a goodness of fit statistic that equals the sum of squared
residuals for linear models. The yval column is the
fitted value of the dependent variable at the node,
complexity is the complexity parameter at which this
split will collapse, ncompete is the number of
competitor splits recorded, and nsurrogate is the
number of surrogate splits recorded.
Surrogate variables are used when a value is missing and so a split cannot be completed with the main variable. A surrogate is a different variable that is chosen to approximate the first-choice variable in a split.
The complexity parameter (cp) is used to control the
size of the decision tree and to select the optimal tree size. If the
cost of adding another variable to the decision tree from the current
node is above the value of cp, then tree building does
not continue. We could also say that tree construction does not continue
unless it would decrease the overall lack of fit by a factor of
cp. We can retrieve the cp value for
the model and set it to a value above the threshold for the second
BLDGTOTVAL split to demonstrate its effects like this:
model.class$control$cp
# Prune the model...
model.pruned <- prune(model.class, cp = 0.075)
model.pruned$frame
rpart.plot(model.pruned)
# We could also set this parameter explicitly and rerun the model:
temp <- rpart.control(cp = 0.075)
model.class.2 <- rpart(
formula = as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "class",
control=temp
)
model.class.2$frame
rpart.plot(model.class.2)
According to the rpart.object documentation, the
cp table is “a matrix of information on the optimal
prunings based on a complexity parameter.” Behind the scenes
rpart is automatically applying an array of cost complexity
values to prune the tree. The CP column is the value of
this parameter that would grow the tree to this size. The
nsplit column is the number of nodes at that size,
rel error is the error for predictions of the data that
were used to estimate the model, while xerror is the
cross-validation error generated by the rpart built-in
cross validation routine and xstd gives us a sense of
its variability over the default 10 iterations (see
plotcp). Also note that (1 - relative error) is roughly
equal to the variance explained by the model.
The CP values control the size of the tree and the greater the CP value, the fewer the number of splits in the tree. The optimal size of the tree is generally the row in the CP table that minimizes all error with the fewest branches.
In the plot below, the dotted line refers to a recommendation from Breiman et al. (1984) to use the smallest tree within 1 standard deviation of the minimum cross validation error. You can think about the complexity parameter as the amount by which splitting that node improved the relative error.
model.class$cptable
## CP nsplit rel error xerror xstd
## 1 0.08540076 0 1.0000000 1.0000000 0.02633587
## 2 0.03625954 2 0.8291985 0.8730916 0.02518799
## 3 0.01049618 3 0.7929389 0.8282443 0.02472907
## 4 0.01000000 5 0.7719466 0.8301527 0.02474920
plotcp(model.class)
model.class$splits
## count ncat improve index adj
## BLDGTOTVAL 3837 1 207.6502413 0.4494291 0.0000000000
## AGE_2014 3837 -1 57.3042763 62.5000000 0.0000000000
## WScore 3837 -1 43.9991691 83.5000000 0.0000000000
## PARCELSIZE 3837 -1 24.3604742 0.9237271 0.0000000000
## TScore 3837 -1 10.1481898 40.5000000 0.0000000000
## PARCELSIZE 0 -1 0.7268700 0.4473768 0.0223880597
## AGE_2014 0 -1 0.7221788 129.5000000 0.0055970149
## WScore 0 1 0.7208757 0.5000000 0.0009328358
## AGE_2014 1072 -1 34.6800900 71.5000000 0.0000000000
## BLDGTOTVAL 1072 1 34.3514786 0.3563673 0.0000000000
## WScore 1072 -1 30.7952875 75.5000000 0.0000000000
## TScore 1072 -1 13.1755326 41.5000000 0.0000000000
## Pop.Density 1072 -1 7.9331276 8781.0175000 0.0000000000
## WScore 0 -1 0.6595149 69.5000000 0.2626262626
## TScore 0 -1 0.6250000 42.5000000 0.1878787879
## PARCELSIZE 0 1 0.6156716 0.1249150 0.1676767677
## Pop.Density 0 -1 0.5951493 6327.8150000 0.1232323232
## BLDGTOTVAL 0 1 0.5587687 0.4125434 0.0444444444
## BLDGTOTVAL 495 1 22.2100583 0.3563390 0.0000000000
## WScore 495 -1 14.7576745 75.5000000 0.0000000000
## TScore 495 -1 6.9990921 41.5000000 0.0000000000
## PARCELSIZE 495 -1 5.3923404 0.9838875 0.0000000000
## AGE_2014 495 -1 5.2516080 59.5000000 0.0000000000
## PARCELSIZE 0 -1 0.7272727 0.4214552 0.1118421053
## Pop.Density 0 1 0.7232323 1446.7500000 0.0986842105
## TScore 0 1 0.7151515 22.0000000 0.0723684211
## WScore 0 1 0.7131313 16.0000000 0.0657894737
## AGE_2014 0 1 0.7050505 24.0000000 0.0394736842
## WScore 152 -1 8.3158263 58.5000000 0.0000000000
## TScore 152 -1 7.5274363 42.5000000 0.0000000000
## BLDGTOTVAL 152 1 3.9521645 0.2415573 0.0000000000
## Pop.Density 152 -1 3.9521645 7845.7260000 0.0000000000
## PARCELSIZE 152 1 3.5815096 0.1395381 0.0000000000
## TScore 0 -1 0.8421053 40.5000000 0.6470588235
## Pop.Density 0 -1 0.8157895 4021.7680000 0.5882352941
## PARCELSIZE 0 1 0.7697368 0.2748356 0.4852941176
## BLDGTOTVAL 0 -1 0.5855263 0.2760824 0.0735294118
## AGE_2014 0 -1 0.5657895 62.5000000 0.0294117647
## PARCELSIZE 68 -1 9.2448268 0.7652287 0.0000000000
## BLDGTOTVAL 68 1 4.9363083 0.1969140 0.0000000000
## Pop.Density 68 1 3.2588547 3244.3735000 0.0000000000
## WScore 68 1 2.4894118 13.5000000 0.0000000000
## TScore 68 1 1.9225222 32.5000000 0.0000000000
## WScore 0 1 0.7941176 16.0000000 0.3636363636
## Pop.Density 0 1 0.7794118 2729.4345000 0.3181818182
## BLDGTOTVAL 0 1 0.7647059 0.2257027 0.2727272727
## TScore 0 1 0.7647059 17.0000000 0.2727272727
## AGE_2014 0 1 0.7205882 48.5000000 0.1363636364
The splits attribute consists of a row label which is the name of the split variable, and the rest of the columns are count which is the number of observations sent left or right by the split (for competitor splits this is the number that would have been sent left or right had this split been used, for surrogate splits it is the number missing the primary split variable which were decided using this surrogate), ncat which is the number of categories or levels for the variable (+/-1 for a continuous variable), improve which is the improvement in deviance given by this split, or, for surrogates, the concordance of the surrogate with the primary, and index, the numeric split point. The last column adj gives the adjusted concordance for surrogate splits. For a continuous variable, the sign of ncat determines whether the subset x < cutpoint or x > cutpoint is sent to the left. The improve and index columns are typically the most interesting here.
A case weight is a non-negative numeric variable that indicates the importance of each case. There are three types of case weights: frequencies, sampling weights, and variance weights.
model.class.pred <- predict(model.class, newdata = portland.2014.2018.test)
caret::RMSE(pred = model.class.pred, obs = portland.2014.2018.test$DEMOLISHED)
## [1] 0.5819374
model.class.pred.tb <- as_tibble(model.class.pred)
model.class.pred.tb$outcome <- ifelse((model.class.pred.tb[, 2] > 0.05), 1, 0)
caret::confusionMatrix(as_factor(model.class.pred.tb$outcome), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")
## Warning in confusionMatrix.default(as_factor(model.class.pred.tb$outcome), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 1196 450
##
## Accuracy : 0.2734
## 95% CI : (0.252, 0.2956)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.2734
## Neg Pred Value : NaN
## Prevalence : 0.2734
## Detection Rate : 0.2734
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
Based on the results from the above code chunk, this model does not fit the validation set very well. We could modify the features (independent variables) we have included (see the “Fit Initial Classification Tree” chunk) to try and improve our predictive capacity. Alternatively, we could pursue one of the strategies that you encountered in the assigned reading—bootstrap aggregating also known as “bagging”. This technique involves sampling with replacement and fitting the model each time as a means of improving the stability and predictive power of our tree-based model.
Here we pivot into using the caret package because of its
ease of modifying training parameters and because it supports bagging.
The trainControl function can be accessed directly or from
within the train function via the trControl
argument. In the example below we are setting up 10-fold
cross-validation which happens to be one of the more commonly used.
This means the algorithm will use 90 percent of the data to create a
model and 10 percent to test it. Then, we will run the model again with
a different 10 percent and repeat this 10 times, until all the data has
been used as both training and test data.
The preProcess argument can be used to scale, center, or
otherwise transform the input data. The tuneLength
parameter tells the algorithm to try different default values
(essentially automating the hyperparameter tuning) and the
tuneGrid parameter lets us decide which values the main
parameter will take:
tuneGrid = expand.grid(k = c(5, 11, 21, 25)). This is how
you could fit a different model than the “best-fitting” one identified
by the algorithm. The expand.grid function was used in the
random
forest reading that we did for this week as a way to “tune” the hyperparameters
as a means of increasing
the overall performance of the model beyond feature engineering and
feature selection.
# Specify the standard 10-fold cross validation
ctrl <- caret::trainControl(method = "cv", number = 10)
# Fit the bagged model but setting the method argument to
# treebagged as shown below...
model.bagged <- train(
as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "treebag",
trControl = ctrl,
importance = TRUE,
minsplit = 4,
minbucket = 2
)
# Assess the results
model.bagged
## Bagged CART
##
## 3837 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3453, 3454, 3454, 3454, 3453, 3453, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7763898 0.396478
# plot most important variables
plot(varImp(model.bagged), 3)
plot(varImp(model.bagged), 5)
# Make predictions using the ensemble "bagged" model...
pred <- predict(model.bagged, portland.2014.2018.test)
caret::confusionMatrix(as_factor(pred), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1067 232
## 1 129 218
##
## Accuracy : 0.7807
## 95% CI : (0.7599, 0.8005)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : 0.00000028355
##
## Kappa : 0.4055
##
## Mcnemar's Test P-Value : 0.00000007943
##
## Sensitivity : 0.4844
## Specificity : 0.8921
## Pos Pred Value : 0.6282
## Neg Pred Value : 0.8214
## Prevalence : 0.2734
## Detection Rate : 0.1324
## Detection Prevalence : 0.2108
## Balanced Accuracy : 0.6883
##
## 'Positive' Class : 1
##
#previous caret c. matrix
caret::confusionMatrix(as_factor(model.class.pred.tb$outcome), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")
## Warning in confusionMatrix.default(as_factor(model.class.pred.tb$outcome), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 1196 450
##
## Accuracy : 0.2734
## 95% CI : (0.252, 0.2956)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.2734
## Neg Pred Value : NaN
## Prevalence : 0.2734
## Detection Rate : 0.2734
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
As we can see, the “bagging” technique allowed us to identify a model with greater predictive capacity. But we can also engage in hyperparameter tuning to squeeze even more predictive capability out of the model.
First, consider what the tunable parameters are for
rpart objects by referring
to the documentation of this massive list
of algorithms that can be fit using caret::train as
well as the tuning parameters for each. The maxdepth
parameter represents the maximum depth of any node of the final tree,
with the root node counted as depth 0. The minsplit
parameter is the minimum number of observations that must exist in a
node in order for a split to be attempted. The
minbucket parameter is the minimum number of
observations in any terminal node and is not being tuned in the chunk
below.
# Create hyperparameter grid... which is a dataframe
tuning.grid <- expand.grid(
maxdepth = c(1, 3, 5, 8, 15),
minsplit = c(2, 5, 10, 15),
accuracy = NA,
kappa = NA
)
# Execute full cartesian grid search...
for(i in seq_len(nrow(tuning.grid))) {
# Fit model for each hyperparameter combination...
fit <- train(
as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "treebag",
trControl = ctrl,
importance = TRUE,
maxdepth = tuning.grid$maxdepth[i],
minsplit = tuning.grid$minsplit[i],
minbucket = 4
)
# Save fit metrics from each model...
tuning.grid$accuracy[i] <- fit$results$Accuracy
tuning.grid$kappa[i] <- fit$results$Kappa
}
# assess top 10 models...
tuning.grid %>%
arrange(-accuracy) %>%
head(10)
# # Visualize how fit metrics change over
# # different combinations of selected hyperparameters...
# plot.depth <- ggplot() +
# geom_point(data = tuning.grid, mapping = aes(y = kappa, x = maxdepth, fill = "dodgerblue"),
# color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
# geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = maxdepth, fill = "red"),
# color = "pink", size = 3, shape = 21, alpha = 0.5) +
# scale_fill_manual(name = "", values = c("dodgerblue", "red"), labels = c("Kappa", "Accuracy")) +
# xlab("Maximum Depth") +
# ylab("Metric") +
# theme_minimal()
#
#
# plot.split <- ggplot() +
# geom_point(data = tuning.grid, mapping = aes(y = kappa, x = minsplit, fill = "dodgerblue"),
# color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
# geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = minsplit, fill = "red"),
# color = "pink", size = 3, shape = 21, alpha = 0.5) +
# scale_fill_manual(name = "", values = c("dodgerblue", "red"), labels = c("Kappa", "Accuracy")) +
# xlab("Minimum Split") +
# ylab("Metric") +
# theme_minimal()
#
# ggarrange(plot.depth, plot.split, common.legend = TRUE)
#
# ggsave("hyperparameters_tuning.png", width = 18, height = 8)
#
#
# plot.accuracy <- ggplot() +
# geom_point(data = tuning.grid, mapping = aes(y = accuracy, x = minsplit, fill = as_factor(maxdepth)),
# color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
# scale_fill_manual(name = "Maximum Depth", values = c(rainbow(5))) +
# scale_x_continuous(breaks = c(2, 5, 10, 15), labels = c(2, 5, 10, 15), limits = c(0, 15)) +
# ylab("Accuracy")
#
#
#
# plot.kappa <- ggplot() +
# geom_point(data = tuning.grid, mapping = aes(y = kappa, x = minsplit, fill = as_factor(maxdepth)),
# color = "royalblue", size = 3, shape = 21, alpha = 0.5) +
# scale_fill_manual(name = "Maximum Depth", values = c(rainbow(5))) +
# scale_x_continuous(breaks = c(2, 5, 10, 15), labels = c(2, 5, 10, 15), limits = c(0, 15)) +
# ylab("Kappa")
#
# ggarrange(plot.accuracy, plot.kappa, common.legend = TRUE)
#
# ggsave("hyperparameters_tuning_with_fill.png", width = 18, height = 8)
The graphics above simply show how the fit metrics—vary across different combinations of the selected hyperparameters. Now, we can take the best model and see how it performs on the test set.
best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <- tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]
# Re-run model with best hyperparameters...
best.fit <- train(
as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "treebag",
trControl = ctrl,
importance = TRUE,
maxdepth = best.accuracy$maxdepth,
minsplit = best.kappa$minsplit,
minbucket = 4)
best.fit
## Bagged CART
##
## 3837 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3453, 3453, 3453, 3453, 3453, 3454, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7789844 0.4080063
# The features included in the model...
best.fit$coefnames
## [1] "BLDGTOTVAL" "PARCELSIZE" "AGE_2014" "WScore" "TScore"
## [6] "Pop.Density"
# Now use the best fitting model to predict demolition status for
# the test set.
best.fit.pred <- predict(best.fit, newdata = portland.2014.2018.test)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2014.2018.test$DEMOLISHED), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1072 237
## 1 124 213
##
## Accuracy : 0.7807
## 95% CI : (0.7599, 0.8005)
## No Information Rate : 0.7266
## P-Value [Acc > NIR] : 0.000000283554
##
## Kappa : 0.4011
##
## Mcnemar's Test P-Value : 0.000000003753
##
## Sensitivity : 0.4733
## Specificity : 0.8963
## Pos Pred Value : 0.6320
## Neg Pred Value : 0.8189
## Prevalence : 0.2734
## Detection Rate : 0.1294
## Detection Prevalence : 0.2047
## Balanced Accuracy : 0.6848
##
## 'Positive' Class : 1
##
# Make point sf object from coordinates for mapping...
portland.2014.2018.test.sf <- st_as_sf(portland.2014.2018.test, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))
# Map the model performance...
# plot.2014.2018.test.set <- bind_cols(portland.2014.2018.test, best.fit.pred)
plot.2014.2018.test.set <- bind_cols(portland.2014.2018.test.sf, best.fit.pred)
## New names:
## * `` -> ...9
names(plot.2014.2018.test.set)[dim(plot.2014.2018.test.set)[2]] <- "PREDICTED"
plot.2014.2018.test.set <- plot.2014.2018.test.set %>%
mutate(PERFORM = case_when(
((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"),
((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"),
((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))
# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")
## Reading layer `OGRGeoJSON' from data source
## `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson'
## using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS: WGS 84
neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))
ggplot() +
geom_sf(data = plot.2014.2018.test.set, aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fp <- plot.2014.2018.test.set %>%
filter(PERFORM == "False Positive") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fn <- plot.2014.2018.test.set %>%
filter(PERFORM == "False Negative") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
ggarrange(fp, fn)
# Plot model predictions with Google basemap...
plot.2014.2018.test.set.wgs84 <- st_transform(plot.2014.2018.test.set, 4326)
ggmap(pdx_basemap) +
geom_point(data = plot.2014.2018.test.set.wgs84, aes(color = PERFORM),
x = st_coordinates(plot.2014.2018.test.set.wgs84)[,1],
y = st_coordinates(plot.2014.2018.test.set.wgs84)[,2], size = 2) +
scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
theme_void()
There are several resources available that go into greater detail with bootstrap aggregation including this post from R Bloggers and this tutorial. As mentioned in class, bagging is a technique that combines the predictions of many, many high variance models (i.e., decision trees) in order to enhance predictive capacity. If you were doing this kind of work professionally or for a thesis, you might combine bagging with boosting which iteratively attempts to correct the errors of the previous model by weighting incorrect predictions more heavily as the overall model is refined. The very next chapter in the Boehmke and Greenwell (2020) text covers this in detail.
You have the foundational concepts and skills needed to take a deeper dive here, if you like. Perhaps over the summer you may decide to complete the Machine Learning Crash Course from Google that has become something of a standard. The assumption is that you are on team Python but all of these techniques are available in R as well.
For this portion of the Lab Assignment, you are asked to experiment with the classification model presented here.
portland.2019_tibb <- as_tibble(portland.2019)
best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <- tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]
# Re-run model with best hyperparameters...
best.fit <- train(
as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "treebag",
trControl = ctrl,
importance = TRUE,
maxdepth = best.accuracy$maxdepth,
minsplit = best.kappa$minsplit,
minbucket = 4)
best.fit
## Bagged CART
##
## 3837 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3453, 3453, 3454, 3453, 3453, 3453, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7745734 0.4001762
# The features included in the model...
best.fit$coefnames
## [1] "BLDGTOTVAL" "PARCELSIZE" "AGE_2014" "WScore" "TScore"
## [6] "Pop.Density"
# Now use the best fitting model to predict demolition status for
# the test set.
#filtering 2019 columns to have only the best hyper pams - do i need to do this?
portland.2019_hp <- portland.2019 %>%
select(DEMOLISHED,
BLDGTOTVAL,
PARCELSIZE,
AGE_2014,
WScore,
TScore,
Pop.Density,
X_COORD,
Y_COORD)
best.fit.pred <- predict(best.fit, newdata = portland.2019_hp)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2019$DEMOLISHED), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3556 114
## 1 424 100
##
## Accuracy : 0.8717
## 95% CI : (0.8612, 0.8817)
## No Information Rate : 0.949
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2141
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.46729
## Specificity : 0.89347
## Pos Pred Value : 0.19084
## Neg Pred Value : 0.96894
## Prevalence : 0.05103
## Detection Rate : 0.02384
## Detection Prevalence : 0.12494
## Balanced Accuracy : 0.68038
##
## 'Positive' Class : 1
##
# Make point sf object from coordinates for mapping...
portland.2019.sf <- st_as_sf(portland.2019_hp, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))
# Map the model performance...
# plot.2019.test.set <- bind_cols(portland.2019_hp, best.fit.pred)
plot.2019.test.set<- bind_cols(portland.2019.sf, best.fit.pred)
## New names:
## * `` -> ...9
names(plot.2019.test.set)[dim(plot.2019.test.set)[2]] <- "PREDICTED"
plot.2019.test.set <- plot.2019.test.set %>%
mutate(PERFORM = case_when(
((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"),
((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"),
((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))
# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")
## Reading layer `OGRGeoJSON' from data source
## `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson'
## using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS: WGS 84
neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))
ggplot() +
geom_sf(data = plot.2019.test.set, aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fp <- plot.2019.test.set %>%
filter(PERFORM == "False Positive") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fn <- plot.2019.test.set %>%
filter(PERFORM == "False Negative") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
ggarrange(fp, fn)
# Plot model predictions with Google basemap...
plot.2019.test.set.wgs84 <- st_transform(plot.2019.test.set, 4326)
ggmap(pdx_basemap) +
geom_point(data = plot.2019.test.set.wgs84, aes(color = PERFORM),
x = st_coordinates(plot.2019.test.set.wgs84)[,1],
y = st_coordinates(plot.2019.test.set.wgs84)[,2], size = 2) +
scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
theme_void()
+ Does the model perform as well “in the wild” with these 2019
data as it did on the test set?
#practice results
filter(plot.2014.2018.test.set.wgs84, PERFORM == "Correct") %>%
count() / count(plot.2014.2018.test.set.wgs84)
filter(plot.2014.2018.test.set.wgs84, PERFORM == "False Negative") %>%
count() / count(plot.2014.2018.test.set.wgs84)
filter(plot.2014.2018.test.set.wgs84, PERFORM == "False Positive") %>%
count() / count(plot.2014.2018.test.set.wgs84)
#performance results
filter(plot.2019.test.set.wgs84, PERFORM == "Correct") %>%
count() / count(plot.2019.test.set.wgs84)
filter(plot.2019.test.set.wgs84, PERFORM == "False Negative") %>%
count() / count(plot.2019.test.set.wgs84)
filter(plot.2019.test.set.wgs84, PERFORM == "False Positive") %>%
count() / count(plot.2019.test.set.wgs84)
##Q1 Answer 1/3: If I did this correct (my best.fit object was still attached to 2014.2018 data when applying it to the 2019 “wild” data), it appears our model performs better on our wild data (88% of demolition predictions correct) than it did on the training data (78%). Concerning false positives and negatives, given our 2019 model’s increased predictive performance, false positives saw an unforeseen increase (from 8% to 10%), while false negatives saw a drastic decrease (from 14% to 2%). It would seem our 2019 model really nailed it concerning those false negatives - in reality, I guess this would mean a building could sleep comfortably (joke) if our model says it will not be demolished.
#since I'm pretty comfortable with merging spatial data - thanks to my term project - I'll go extra on this one
neighborhoods$Correct <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "Correct")))
neighborhoods$FALSE_N <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "False Negative")))
neighborhoods$FALSE_P <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set, PERFORM == "False Positive")))
neighborhoods$N_Predictions <- (neighborhoods$Correct + neighborhoods$FALSE_N + neighborhoods$FALSE_P)
arrange(neighborhoods, desc(Correct)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
arrange(neighborhoods, desc(FALSE_N)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
arrange(neighborhoods, desc(FALSE_P)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
##Q1 Answer 2/3: The CENTENNIAL COMMUNITY ASSOCIATION neighborhood was the best ‘performed on’ neighborhood with 167 correct predictions and only 7 incorrect predictions (all being false positive). The Arbor Lodge, Woodstock, and Overlook neighborhoods all had the highest number of false negatives (8), while Montavilla had the most false positives (20).
library(viridis)
## Loading required package: viridisLite
library(ggthemes)
neighborhoods <- neighborhoods %>%
mutate(Pct_Correct = (Correct / N_Predictions)) %>%
mutate(Pct_FALSE_P = (FALSE_P / N_Predictions)) %>%
mutate(Pct_FALSE_N = (FALSE_N / N_Predictions))
neighborhoods.wgs84 <- st_transform(neighborhoods, 4326)
#pdx_basemap.wgs84 <- st_transform(pdx_basemap, 4326) - this don't work
ggmap(pdx_basemap) +
geom_sf(data = neighborhoods.wgs84,
mapping = aes(fill = Pct_Correct), inherit.aes = FALSE,
color = "gray90", size = 0.25, alpha = 0.75) +
# scale_fill_viridis(option = "plasma") +
scale_fill_viridis(option = "E", direction = 1) +
# scale_fill_distiller(palette = "RdPu", direction = 1) +
theme_map() +
theme(legend.position = "bottom",
strip.background = element_blank()) +
labs(fill = "Percent Correct",
title = "Percent Correct - Predicted Demolitions",
subtitle = "By Neighborhood, Portland 2019 Data")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
neighborhoods.wgs84 %>%
arrange(desc(Pct_Correct)) %>%
select(MAPLABEL,
N_Predictions,
Pct_Correct) %>%
head()
ggmap(pdx_basemap) +
geom_sf(data = neighborhoods.wgs84,
mapping = aes(fill = Pct_FALSE_P), inherit.aes = FALSE,
color = "gray90", size = 0.25, alpha = 0.75) +
# scale_fill_viridis(option = "plasma") +
scale_fill_viridis(option = "E", direction = 1) +
# scale_fill_distiller(palette = "RdPu", direction = 1) +
theme_map() +
theme(legend.position = "bottom",
strip.background = element_blank()) +
labs(fill = "Percent False Positive",
title = "Percent False Positive - Predicted Demolitions",
subtitle = "By Neighborhood, Portland 2019 Data")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
neighborhoods.wgs84 %>%
arrange(desc(Pct_FALSE_P)) %>%
select(MAPLABEL,
N_Predictions,
Pct_FALSE_P) %>%
head()
ggmap(pdx_basemap) +
geom_sf(data = neighborhoods.wgs84,
mapping = aes(fill = Pct_FALSE_N), inherit.aes = FALSE,
color = "gray90", size = 0.25, alpha = 0.75) +
# scale_fill_viridis(option = "plasma") +
scale_fill_viridis(option = "E", direction = 1) +
# scale_fill_distiller(palette = "RdPu", direction = 1) +
theme_map() +
theme(legend.position = "bottom",
strip.background = element_blank()) +
labs(fill = "Percent False Negative",
title = "Percent False Negative - Predicted Demolitions",
subtitle = "By Neighborhood, Portland 2019 Data")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
neighborhoods.wgs84 %>%
arrange(desc(Pct_FALSE_N)) %>%
select(MAPLABEL,
N_Predictions,
Pct_FALSE_N) %>%
head()
##Q1 Answer 3/3: Looking at the the percent of predictions that were correct, false positive, and false negative, there were some clear outliers visible across the Portland neighborhoods. Much of the eastern side of the of the city fared really well with our model, with most of these neighborhoods displaying bright yellow colors indicating near or at 100% accuracy (I was trying to make these values pop up on the map - but was unsuccessful). Assessing neighborhoods by their percent of false positive predictions, both Healy Heights and Goose Hollow Foothills League had 100% of their demolitions predicted as falsely positive. However, both these neighborhoods had less than 3 total predictions. The following neighborhood in descendancy of ‘percent false positive’, Boise, had 16 total demolition predictions within its boundary, and roughly 44% of them were predicted as false positive. This is the truest outlier concerning this variable, and is a neighborhood that our model is not well set up against. On the opposite side, false negatives, the variable our model is very well set up against (with only 2% of predictions), higher rates of false negatives can be seen near the city’s center with neighborhoods such as Overlook, Homestead, and Hosford-Abernethy all containing more than 10% of false negative predictions. Lastly, like the Healy Heights neighborhood previously mentioned, the neighborhoods Linnton and _Sunderland Association of Neighbors__ both displayed high percentages of false negative predictions (100% and 50%) but with very low total predictions (less than 3 each).
Complexity Parameter Tuning - don’t run this
# #finding the ideal **cp**
#
# #taking in the `Assess Model` Visual
# cp_values_grid <- seq(from = 0.02, to = 0.056, len = 50) #a lengthy amount of time, but let's see if it's worth it
#
# #without specifying cp
# #0.7821081 accuracy and 0.4152267 kappa
# #best minsplit =2
# #best mapdepth = 1
# #five grid tests?
# # with cp grid - using the initial `assess model` visual
# # 0.02 to 0.056, best cp = 0.038 with .7821112 accuracy and 0.4064244 kappa
# #best min split = 5
# #best maxdepth = 3
#
#
# # Create hyperparameter grid... which is a dataframe
# tuning.grid <- expand.grid(
# maxdepth = c(1, 3, 5, 8, 15),
# minsplit = c(2, 5, 10, 15),
# .cp = cp_values_grid,
# accuracy = NA,
# kappa = NA
# )
#
# # Execute full cartesian grid search...
# for(i in seq_len(nrow(tuning.grid))) {
#
# # Fit model for each hyperparameter combination...
# fit <- train(
# as_factor(DEMOLISHED) ~ .,
# data = portland.2014.2018.train,
# method = "treebag",
# trControl = ctrl,
# importance = TRUE,
# maxdepth = tuning.grid$maxdepth[i],
# minsplit = tuning.grid$minsplit[i],
# .cp = tuning.grid$.cp[i],
# minbucket = 4
# )
#
#
# # Save fit metrics from each model...
# tuning.grid$accuracy[i] <- fit$results$Accuracy
# tuning.grid$kappa[i] <- fit$results$Kappa
# }
#
# # assess top 10 models...
# tuning.grid %>%
# arrange(-kappa) %>%
# head(10)
#
# tuning.grid$best_kappa = if_else(tuning.grid$kappa == max(tuning.grid$kappa), 1, 0)
#
# ----------------
# the_pal <- c("cyan3", "deeppink")
#
# p1 <- ggplot(tuning.grid, aes(x = `.cp`, y = kappa, size = best_kappa, fill = as_factor(best_kappa))) +
# geom_point(
# shape=21,
# stroke = 0,
# alpha = 0.5
# ) +
# facet_grid(maxdepth ~ minsplit) +
# theme_light() +
# scale_fill_manual(values = the_pal, labels = c("False", "True")) +
# scale_size(range = c(1, 2), breaks = c(0,1), labels = c("False", "True")) +
# labs(x = "cp", y = "kappa", size = "Highest Kappa", fill = "Highest Kappa", subtitle = "Faceted by minsplit (x-axis) & maxdepth (y-axis)")
#
# ggsave(filename = paste0("./img/CP grid ", min(cp_values_grid), "-", max(cp_values_grid), ".png"), plot = p1, width = 16, height = 8)
Complexity Parameter Tuning Result
#2019 data test with cp
portland.2019_tibb <- as_tibble(portland.2019)
best.accuracy <- tuning.grid[tuning.grid$accuracy == max(tuning.grid$accuracy), ]
best.kappa <- tuning.grid[tuning.grid$kappa == max(tuning.grid$kappa), ]
# Re-run model with best hyperparameters...
best.fit <- train(
as_factor(DEMOLISHED) ~ .,
data = portland.2014.2018.train,
method = "treebag",
trControl = ctrl,
importance = TRUE,
maxdepth = best.accuracy$maxdepth,
minsplit = best.kappa$minsplit,
.cp = best.kappa$.cp,
minbucket = 4)
best.fit
## Bagged CART
##
## 3837 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3453, 3453, 3454, 3453, 3454, 3453, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7711849 0.3909168
# The features included in the model...
best.fit$coefnames
## [1] "BLDGTOTVAL" "PARCELSIZE" "AGE_2014" "WScore" "TScore"
## [6] "Pop.Density"
# Now use the best fitting model to predict demolition status for
# the test set.
#filtering 2019 columns to have only the best hyper pams - do i need to do this?
portland.2019_hp <- portland.2019 %>%
select(DEMOLISHED,
BLDGTOTVAL,
PARCELSIZE,
AGE_2014,
WScore,
TScore,
Pop.Density,
X_COORD,
Y_COORD)
best.fit.pred <- predict(best.fit, newdata = portland.2019_hp)
caret::confusionMatrix(as_factor(best.fit.pred), as_factor(portland.2019$DEMOLISHED), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3571 109
## 1 409 105
##
## Accuracy : 0.8765
## 95% CI : (0.8662, 0.8863)
## No Information Rate : 0.949
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2332
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.49065
## Specificity : 0.89724
## Pos Pred Value : 0.20428
## Neg Pred Value : 0.97038
## Prevalence : 0.05103
## Detection Rate : 0.02504
## Detection Prevalence : 0.12256
## Balanced Accuracy : 0.69395
##
## 'Positive' Class : 1
##
# Make point sf object from coordinates for mapping...
portland.2019.sf <- st_as_sf(portland.2019_hp, coords = c("X_COORD", "Y_COORD"), crs = st_crs(portland.demolitions))
# Map the model performance...
# plot.2019.test.set <- bind_cols(portland.2019_hp, best.fit.pred)
plot.2019.test.set_cp <- bind_cols(portland.2019.sf, best.fit.pred)
## New names:
## * `` -> ...9
names(plot.2019.test.set_cp)[dim(plot.2019.test.set_cp)[2]] <- "PREDICTED"
plot.2019.test.set_cp <- plot.2019.test.set_cp %>%
mutate(PERFORM = case_when(
((as.character(DEMOLISHED) == as.character(PREDICTED)) ~ "Correct"),
((as.character(DEMOLISHED) == 1 & as.character(PREDICTED) == 0) ~ "False Negative"),
((as.character(DEMOLISHED) == 0 & as.character(PREDICTED) == 1) ~ "False Positive")))
# Plot model predictions with neighborhood boundaries...
neighborhoods_0 <- st_read("https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson")
## Reading layer `OGRGeoJSON' from data source
## `https://www.portlandmaps.com/arcgis/rest/services/Public/COP_OpenData_Boundary/MapServer/125/query?outFields=*&where=1%3D1&f=geojson'
## using driver `GeoJSON'
## Simple feature collection with 98 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.8677 ymin: 45.43254 xmax: -122.4666 ymax: 45.65302
## Geodetic CRS: WGS 84
neighborhoods <- st_transform(neighborhoods_0, crs = st_crs(portland.demolitions))
ggplot() +
geom_sf(data = plot.2019.test.set_cp, aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fp <- plot.2019.test.set_cp %>%
filter(PERFORM == "False Positive") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
fn <- plot.2019.test.set_cp %>%
filter(PERFORM == "False Negative") %>%
ggplot() +
geom_sf(aes(fill = PERFORM), shape = 21, size = 3) +
scale_fill_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
geom_sf(data = neighborhoods, color = "grey70", fill = NA) +
theme_void()
ggarrange(fp, fn)
# Plot model predictions with Google basemap...
plot.2019.test.set.wgs84_cp <- st_transform(plot.2019.test.set_cp, 4326)
ggmap(pdx_basemap) +
geom_point(data = plot.2019.test.set.wgs84_cp, aes(color = PERFORM),
x = st_coordinates(plot.2019.test.set.wgs84)[,1],
y = st_coordinates(plot.2019.test.set.wgs84)[,2], size = 2) +
scale_color_manual(name = "Performance", breaks = c("Correct", "False Positive", "False Negative"), values = c("green4", "red", "dodgerblue")) +
theme_void()
#merging spatial data
neighborhoods$Correct <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "Correct")))
neighborhoods$FALSE_N <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "False Negative")))
neighborhoods$FALSE_P <- lengths(st_intersects(neighborhoods, filter(plot.2019.test.set_cp, PERFORM == "False Positive")))
neighborhoods$N_Predictions <- (neighborhoods$Correct + neighborhoods$FALSE_N + neighborhoods$FALSE_P)
arrange(neighborhoods, desc(Correct)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
arrange(neighborhoods, desc(FALSE_N)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
arrange(neighborhoods, desc(FALSE_P)) %>%
select(NAME,
MAPLABEL,
HORZ_VERT,
Correct,
FALSE_N,
FALSE_P) %>%
head()
plot.2019_nogeom <- st_drop_geometry(plot.2019.test.set.wgs84)
plot.2019_CP_nogeom <- st_drop_geometry(plot.2019.test.set.wgs84_cp)
#no cp results
filter(plot.2019_nogeom, PERFORM == "Correct") %>%
count()
filter(plot.2019_nogeom, PERFORM == "False Negative") %>%
count()
filter(plot.2019_nogeom, PERFORM == "False Positive") %>%
count()
filter(plot.2019_nogeom, PERFORM == "Correct") %>%
count() / count(plot.2019_nogeom)
filter(plot.2019_nogeom, PERFORM == "False Negative") %>%
count() / count(plot.2019_nogeom)
filter(plot.2019_nogeom, PERFORM == "False Positive") %>%
count() / count(plot.2019_nogeom)
#with cp results
filter(plot.2019_CP_nogeom, PERFORM == "Correct") %>%
count()
filter(plot.2019_CP_nogeom, PERFORM == "False Negative") %>%
count()
filter(plot.2019_CP_nogeom, PERFORM == "False Positive") %>%
count()
filter(plot.2019_CP_nogeom, PERFORM == "Correct") %>%
count() / count(plot.2019_CP_nogeom)
filter(plot.2019_CP_nogeom, PERFORM == "False Negative") %>%
count() / count(plot.2019_CP_nogeom)
filter(plot.2019_CP_nogeom, PERFORM == "False Positive") %>%
count() / count(plot.2019_CP_nogeom)
##Q2 Answer: Okay, let’s finally have a look. I used the best.kappa for my CP for the model, because I would have liked to think we care about making the rules fair for complexity. But I don’t think it would have been necessarily wrong to go with accuracy.
My hypothesis was that this didn’t have much of an impact on our number of correct predictions, because we barely increased the kappa or accuracy with the iteration of cp’s over the 2014.2018 data. So, I don’t have too high of hopes here.
Concerning the number of correct predictions, our 2019 model with the included complexity parameter, the number of correct predictions increased by a whopping 0.16% (an additional 6 correct predictions).
Concerning the number of false negatives, our 2019 model with the included complexity parameter, the number of false negatives decreased by 0.86% (a decrease of 1 false negative).
Concerning the number of false positives, our 2019 model with the included complexity parameter, the number of false positives increased by 1.18% (a decrease of 5 false positives).
So, in the end, we may have moved an inch in the right direction, which is about right when considering the impact the iterative cp’s had on the 2014.18 accuracy and kappa.
Please do not forget the Reflective Content portion of the Lab Assignment.
To use your language Bev, the juice was not really worth the squeeze. I think the from concentrate .cp would have been satisfactory. However, it was valuable to go through this process.
I’m definitely intrigued by it all. Obviously I’m not too concerned with the grade, but I decided to put a lot of extra time into this model and lab as a whole, because it’s been the most exploratory yet (and I have a little bit of extra time on my hands). There’s much left to do in terms of increasing this model’s capability (or not?). I’m not sure what the next steps would be to further this model’s predictive capabilities.
But, overall, the lab as a whole was very worthwhile. It took me awhile to figure out how I wanted to present my complexity parameter graphic for Q2, so I learned some visual awareness skills there (size, fill, palette, more facet_wrap, themes). In addition, I tested the caret::train timespan for a length of 50 iterative cp parameters, and I think you are exactly right Bev. This training workflow is best set up for a dummy computer that I won’t need to touch until the end of the day, or next week.
I thank you for putting this lab on, and I’ll have to really urge myself to keep exploring these tools into the summer. I have an idea of how I want to do that (sports), but it will be a lot of wrangling ’till then.
End of Work