Predicting Housing Median Prices. The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded.
The last column (CAT.MEDV) was derived from MEDV, such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the median value (MEDV) of a tract, given the information in the first 12 columns.
# Load wine data from mlba package
data(BostonHousing, package = "mlba")
# Skim the data
skimr::skim(BostonHousing)
Name | BostonHousing |
Number of rows | 506 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
CRIM | 0 | 1 | 3.61 | 8.60 | 0.01 | 0.08 | 0.26 | 3.68 | 88.98 | ▇▁▁▁▁ |
ZN | 0 | 1 | 11.36 | 23.32 | 0.00 | 0.00 | 0.00 | 12.50 | 100.00 | ▇▁▁▁▁ |
INDUS | 0 | 1 | 11.14 | 6.86 | 0.46 | 5.19 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
CHAS | 0 | 1 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
NOX | 0 | 1 | 0.55 | 0.12 | 0.38 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▆▅▁ |
RM | 0 | 1 | 6.28 | 0.70 | 3.56 | 5.89 | 6.21 | 6.62 | 8.78 | ▁▂▇▂▁ |
AGE | 0 | 1 | 68.57 | 28.15 | 2.90 | 45.02 | 77.50 | 94.07 | 100.00 | ▂▂▂▃▇ |
DIS | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.21 | 5.19 | 12.13 | ▇▅▂▁▁ |
RAD | 0 | 1 | 9.55 | 8.71 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
TAX | 0 | 1 | 408.24 | 168.54 | 187.00 | 279.00 | 330.00 | 666.00 | 711.00 | ▇▇▃▁▇ |
PTRATIO | 0 | 1 | 18.46 | 2.16 | 12.60 | 17.40 | 19.05 | 20.20 | 22.00 | ▁▃▅▅▇ |
LSTAT | 0 | 1 | 12.65 | 7.14 | 1.73 | 6.95 | 11.36 | 16.96 | 37.97 | ▇▇▅▂▁ |
MEDV | 0 | 1 | 22.53 | 9.20 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
CAT.MEDV | 0 | 1 | 0.17 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
Partition the data into training (60%) and validation (40%) sets.
# Set seed for reproducibility
set.seed(121)
# Partition the data into training (60%) and validation (40%) sets, without column 14
train.index <- sample(1:nrow(BostonHousing), 0.6 * nrow(BostonHousing))
trainData <- BostonHousing[train.index, -14]
validData <- BostonHousing[-train.index, -14]
# Check the partition
dim(trainData)
## [1] 303 13
dim(validData)
## [1] 203 13
Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5. Make sure to normalize the data, and choose function knn() from package class rather than package FNN. To make sure R is using the class package (when both packages are loaded), use class::knn(). What is the best k? What does it mean?
# Use preprocess function to create a model to normalise data using the first 12 cols of train data as the standard scaler
normValues <- preProcess(trainData[, -13], method = c("center", "scale"))
# Normalize the data, keep the col 13 as the response variable, then add it back
trainDataNorm <- trainData # copy all 13 cols from trainData to trainDataNorm
validDataNorm <- validData
trainDataNorm[, -13] <- predict(normValues, trainData[, -13]) # normalise the first 12 cols of trainData copy it to trainDataNorm
validDataNorm[, -13] <- predict(normValues, validData[, -13])
# Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5
accuracy <- data.frame(k = seq(1, 5, 1), RMSE = rep(0, 5))
for (i in 1:5) {
knn.pred <- class::knn(train = trainDataNorm[, -13],
test = validDataNorm[, -13],
cl = trainData$MEDV, k = i)
accuracy[i, 2] <- RMSE(as.numeric(as.character(knn.pred)), validData$MEDV) # Calculate the accuracy
}
# Find the best k and print the best k
bestK <- accuracy[which.min(accuracy$RMSE), 1]
print(paste("The best k is", bestK))
## [1] "The best k is 1"
# show accuracy table
accuracy
## k RMSE
## 1 1 4.894865
## 2 2 5.251849
## 3 3 5.263290
## 4 4 5.623013
## 5 5 5.734512
# Plot the accuracy
plot(accuracy$k, accuracy$RMSE, type = "b", xlab = "k", ylab = "RMSE", main = "RMSE vs. k")
Predict the MEDV for a tract with the following information, using the best k:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT |
---|---|---|---|---|---|---|---|---|---|---|---|
0.2 | 0 | 7 | 0 | 0.538 | 6 | 62 | 4.7 | 4 | 307 | 21 | 10 |
# Predict the MEDV for a tract with the following information, using the best k
newData <- data.frame(CRIM = 0.2, ZN = 0, INDUS = 7, CHAS = 0, NOX = 0.538, RM = 6, AGE = 62, DIS = 4.7, RAD = 4, TAX = 307, PTRATIO = 21, LSTAT = 10) # create a new data frame of the new data
bestK <- 2
newDataNorm <- predict(normValues, newData) # normalise the new data using the same scaler as the training data
knn.pred <- class::knn(train = trainDataNorm[, -13],
test = newDataNorm,
cl = trainData$MEDV, k = bestK) # predict the MEDV for the new data using the best k
print(paste("The predicted MEDV is", knn.pred))
## [1] "The predicted MEDV is 20.4"
If we used the above k-NN algorithm to score the training data, what would be the error of the training set?
# Predict the MEDV for the training data
knn.pred <- class::knn(train = trainDataNorm[, -13],
test = trainDataNorm[, -13],
cl = trainData$MEDV, k = 1)
print(paste("The RMSE of the training set is", RMSE(as.numeric(as.character(knn.pred)), trainData$MEDV)))
## [1] "The RMSE of the training set is 0"
The RMSE of the training set is 0 because…
Why is the validation data error overly optimistic compared to the error rate when applying this k-NN predictor to new data?
Financial Condition of Banks. The file Banks.csv includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This outcome variable takes one of two possible values—weak or strong—according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank. Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cutoff value of 0.5.
Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:
# read banks data from mlba package
data(banks, package = "mlba")
# check data
skimr::skim(Banks)
# train logistic regression model with 2 predictors: TotLns&Lses/Assets and TotExp/Assets, target: Financial Condition
model <- glm(Financial.Condition ~ TotLns.Lses.Assets + TotExp.Assets, data = Banks, family = "binomial") # family = "binomial" means logistic regression, other options are "gaussian" for linear regression, "poisson" for poisson regression, etc.
# check model
summary(model)
##
## Call:
## glm(formula = Financial.Condition ~ TotLns.Lses.Assets + TotExp.Assets,
## family = "binomial", data = Banks)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.721 6.675 -2.205 0.0274 *
## TotLns.Lses.Assets 8.371 5.779 1.449 0.1474
## TotExp.Assets 89.834 47.781 1.880 0.0601 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.726 on 19 degrees of freedom
## Residual deviance: 13.148 on 17 degrees of freedom
## AIC: 19.148
##
## Number of Fisher Scoring iterations: 6
# load library
library(kableExtra)
# model coefficients to 4 decimal places
model$coefficients %>%
round(4) %>%
kable() %>%
kable_styling()
x | |
---|---|
(Intercept) | -14.7210 |
TotLns.Lses.Assets | 8.3713 |
TotExp.Assets | 89.8339 |
Logit, odds, and probability equations:
Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use R to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cutoff = 0.5).
# create a new data frame for the new bank
newBank <- data.frame(TotLns.Lses.Assets = 0.6, TotExp.Assets = 0.11)
# calculate the logit, odds, probability, and financial condition classification
newBank$logit <- predict(model, newBank, type = "link") # type = "link" means predict the logit
newBank$odds <- exp(newBank$logit)
newBank$probability <- predict(model, newBank, type = "response") # type = "response" means predict the probability
newBank$Financial.Condition <- ifelse(newBank$probability > 0.5, 1, 0)
# manually calculate the logit, odds, probability and financial condition classification
newBank$manual_logit <- model$coefficients[1] + model$coefficients[2] * newBank$TotLns.Lses.Assets + model$coefficients[3] * newBank$TotExp.Assets
newBank$manual_odds <- exp(newBank$manual_logit)
newBank$manual_probability <- 1 / (1 + exp(-newBank$manual_logit))
newBank$manual_Financial.Condition <- ifelse(newBank$manual_probability > 0.5, 1, 0)
# print the results to 4 decimal places
newBank %>%
round(4) %>%
kable() %>%
kable_styling()
TotLns.Lses.Assets | TotExp.Assets | logit | odds | probability | Financial.Condition | manual_logit | manual_odds | manual_probability | manual_Financial.Condition |
---|---|---|---|---|---|---|---|---|---|
0.6 | 0.11 | 0.1835 | 1.2014 | 0.5458 | 1 | 0.1835 | 1.2014 | 0.5458 | 1 |
The cutoff value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit.
Cutoff = 0.5 - Probability: P = 0.5 - Odds: Od = P/(1-P) = 0.5/(1-0.5) = 1 - Logit: L = Ln(Od) = Ln(1)) = 0
Interpret the estimated coefficient for the total loans & leases to total assets ratio (TotLns&Lses/Assets) in terms of the odds of being financially weak.
…
When a bank that is in poor financial condition is misclassified as financially strong, the misclassification cost is much higher than when a financially strong bank is misclassified as weak. To minimize the expected cost of misclassification, should the cutoff value for classification (which is currently at 0.5) be increased or decreased?