7.3

Predicting Housing Median Prices. The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded.

The last column (CAT.MEDV) was derived from MEDV, such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the median value (MEDV) of a tract, given the information in the first 12 columns.

# Load wine data from mlba package
data(BostonHousing, package = "mlba")

# Skim the data
skimr::skim(BostonHousing)
Data summary
Name BostonHousing
Number of rows 506
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CRIM 0 1 3.61 8.60 0.01 0.08 0.26 3.68 88.98 ▇▁▁▁▁
ZN 0 1 11.36 23.32 0.00 0.00 0.00 12.50 100.00 ▇▁▁▁▁
INDUS 0 1 11.14 6.86 0.46 5.19 9.69 18.10 27.74 ▇▆▁▇▁
CHAS 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
NOX 0 1 0.55 0.12 0.38 0.45 0.54 0.62 0.87 ▇▇▆▅▁
RM 0 1 6.28 0.70 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
AGE 0 1 68.57 28.15 2.90 45.02 77.50 94.07 100.00 ▂▂▂▃▇
DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.13 ▇▅▂▁▁
RAD 0 1 9.55 8.71 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
TAX 0 1 408.24 168.54 187.00 279.00 330.00 666.00 711.00 ▇▇▃▁▇
PTRATIO 0 1 18.46 2.16 12.60 17.40 19.05 20.20 22.00 ▁▃▅▅▇
LSTAT 0 1 12.65 7.14 1.73 6.95 11.36 16.96 37.97 ▇▇▅▂▁
MEDV 0 1 22.53 9.20 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
CAT.MEDV 0 1 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂

Partition the data into training (60%) and validation (40%) sets.

# Set seed for reproducibility
set.seed(121)

# Partition the data into training (60%) and validation (40%) sets, without column 14
train.index <- sample(1:nrow(BostonHousing), 0.6 * nrow(BostonHousing))
trainData <- BostonHousing[train.index, -14]
validData <- BostonHousing[-train.index, -14]

# Check the partition
dim(trainData)
## [1] 303  13
dim(validData)
## [1] 203  13

a.

Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5. Make sure to normalize the data, and choose function knn() from package class rather than package FNN. To make sure R is using the class package (when both packages are loaded), use class::knn(). What is the best k? What does it mean?

# Use preprocess function to create a model to normalise data using the first 12 cols of train data as the standard scaler
normValues <- preProcess(trainData[, -13], method = c("center", "scale"))

# Normalize the data, keep the col 13 as the response variable, then add it back
trainDataNorm <- trainData # copy all 13 cols from trainData to trainDataNorm
validDataNorm <- validData
trainDataNorm[, -13] <- predict(normValues, trainData[, -13]) # normalise the first 12 cols of trainData copy it to trainDataNorm
validDataNorm[, -13] <- predict(normValues, validData[, -13])
# Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5
accuracy <- data.frame(k = seq(1, 5, 1), RMSE = rep(0, 5))
for (i in 1:5) {
  knn.pred <- class::knn(train = trainDataNorm[, -13], 
                         test = validDataNorm[, -13], 
                         cl = trainData$MEDV, k = i)
  accuracy[i, 2] <- RMSE(as.numeric(as.character(knn.pred)), validData$MEDV) # Calculate the accuracy
}

# Find the best k and print the best k
bestK <- accuracy[which.min(accuracy$RMSE), 1]
print(paste("The best k is", bestK))
## [1] "The best k is 1"
# show accuracy table
accuracy
##   k     RMSE
## 1 1 4.894865
## 2 2 5.251849
## 3 3 5.263290
## 4 4 5.623013
## 5 5 5.734512
# Plot the accuracy
plot(accuracy$k, accuracy$RMSE, type = "b", xlab = "k", ylab = "RMSE", main = "RMSE vs. k")

b.

Predict the MEDV for a tract with the following information, using the best k:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT
0.2 0 7 0 0.538 6 62 4.7 4 307 21 10
# Predict the MEDV for a tract with the following information, using the best k
newData <- data.frame(CRIM = 0.2, ZN = 0, INDUS = 7, CHAS = 0, NOX = 0.538, RM = 6, AGE = 62, DIS = 4.7, RAD = 4, TAX = 307, PTRATIO = 21, LSTAT = 10) # create a new data frame of the new data

bestK <- 2

newDataNorm <- predict(normValues, newData) # normalise the new data using the same scaler as the training data
knn.pred <- class::knn(train = trainDataNorm[, -13], 
                       test = newDataNorm, 
                       cl = trainData$MEDV, k = bestK) # predict the MEDV for the new data using the best k

print(paste("The predicted MEDV is", knn.pred))
## [1] "The predicted MEDV is 20.4"

c.

If we used the above k-NN algorithm to score the training data, what would be the error of the training set?

# Predict the MEDV for the training data
knn.pred <- class::knn(train = trainDataNorm[, -13], 
                       test = trainDataNorm[, -13], 
                       cl = trainData$MEDV, k = 1)
print(paste("The RMSE of the training set is", RMSE(as.numeric(as.character(knn.pred)), trainData$MEDV)))
## [1] "The RMSE of the training set is 0"

The RMSE of the training set is 0 because…

d.

Why is the validation data error overly optimistic compared to the error rate when applying this k-NN predictor to new data?

e.

  1. If the purpose is to predict MEDV for several thousands of new tracts, what would be the disadvantage of using k-NN prediction? List the operations that the algorithm goes through in order to produce each prediction.

10.1

Financial Condition of Banks. The file Banks.csv includes data on a sample of 20 banks. The “Financial Condition” column records the judgment of an expert on the financial condition of each bank. This outcome variable takes one of two possible values—weak or strong—according to the financial condition of the bank. The predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total expenses to total assets. The target is to use the two ratios for classifying the financial condition of a new bank. Run a logistic regression model (on the entire dataset) that models the status of a bank as a function of the two financial measures provided. Specify the success class as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise), and use the default cutoff value of 0.5.

a.

Write the estimated equation that associates the financial condition of a bank with its two predictors in three formats:

  1. The logit as a function of the predictors
  2. The odds as a function of the predictors
  3. The probability as a function of the predictors
# read banks data from mlba package
data(banks, package = "mlba")
  
# check data
skimr::skim(Banks)
# train logistic regression model with 2 predictors: TotLns&Lses/Assets and TotExp/Assets, target: Financial Condition
model <- glm(Financial.Condition ~ TotLns.Lses.Assets + TotExp.Assets, data = Banks, family = "binomial") # family = "binomial" means logistic regression, other options are "gaussian" for linear regression, "poisson" for poisson regression, etc.

# check model
summary(model)
## 
## Call:
## glm(formula = Financial.Condition ~ TotLns.Lses.Assets + TotExp.Assets, 
##     family = "binomial", data = Banks)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         -14.721      6.675  -2.205   0.0274 *
## TotLns.Lses.Assets    8.371      5.779   1.449   0.1474  
## TotExp.Assets        89.834     47.781   1.880   0.0601 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 13.148  on 17  degrees of freedom
## AIC: 19.148
## 
## Number of Fisher Scoring iterations: 6
# load library
library(kableExtra)

# model coefficients to 4 decimal places
model$coefficients %>% 
  round(4) %>%
  kable() %>%
  kable_styling()
x
(Intercept) -14.7210
TotLns.Lses.Assets 8.3713
TotExp.Assets 89.8339

Logit, odds, and probability equations: Logit, odds, and probability equations

b.

Consider a new bank whose total loans and leases/assets ratio = 0.6 and total expenses/assets ratio = 0.11. From your logistic regression model, estimate the following four quantities for this bank (use R to do all the intermediate calculations; show your final answers to four decimal places): the logit, the odds, the probability of being financially weak, and the classification of the bank (use cutoff = 0.5).

# create a new data frame for the new bank
newBank <- data.frame(TotLns.Lses.Assets = 0.6, TotExp.Assets = 0.11)

# calculate the logit, odds, probability, and financial condition classification
newBank$logit <- predict(model, newBank, type = "link") # type = "link" means predict the logit
newBank$odds <- exp(newBank$logit)
newBank$probability <- predict(model, newBank, type = "response") # type = "response" means predict the probability
newBank$Financial.Condition <- ifelse(newBank$probability > 0.5, 1, 0)

# manually calculate the logit, odds, probability and financial condition classification
newBank$manual_logit <- model$coefficients[1] + model$coefficients[2] * newBank$TotLns.Lses.Assets + model$coefficients[3] * newBank$TotExp.Assets
newBank$manual_odds <- exp(newBank$manual_logit)
newBank$manual_probability <- 1 / (1 + exp(-newBank$manual_logit))
newBank$manual_Financial.Condition <- ifelse(newBank$manual_probability > 0.5, 1, 0)

# print the results to 4 decimal places
newBank %>% 
  round(4) %>%
  kable() %>%
  kable_styling()
TotLns.Lses.Assets TotExp.Assets logit odds probability Financial.Condition manual_logit manual_odds manual_probability manual_Financial.Condition
0.6 0.11 0.1835 1.2014 0.5458 1 0.1835 1.2014 0.5458 1

c. 

The cutoff value of 0.5 is used in conjunction with the probability of being financially weak. Compute the threshold that should be used if we want to make a classification based on the odds of being financially weak, and the threshold for the corresponding logit.

Answer

Cutoff = 0.5 - Probability: P = 0.5 - Odds: Od = P/(1-P) = 0.5/(1-0.5) = 1 - Logit: L = Ln(Od) = Ln(1)) = 0

d.

Interpret the estimated coefficient for the total loans & leases to total assets ratio (TotLns&Lses/Assets) in terms of the odds of being financially weak.

Answer

e.

When a bank that is in poor financial condition is misclassified as financially strong, the misclassification cost is much higher than when a financially strong bank is misclassified as weak. To minimize the expected cost of misclassification, should the cutoff value for classification (which is currently at 0.5) be increased or decreased?

Answer