Info

Objectives

By the end of this activity, you should

  1. Summarize data with the package tidyverse

  2. Train linear and polynomial regression in R

  3. Train logistic regression in R

  4. Print linear and logistic regressions as nicely formatted tables.

  5. Explain what Receiver operating characteristic (ROC) is and how it can be used to estimate predictive power of a classifier

Preliminaries

You should be familiar with basics of R - installing packages, variables, atomic types, lists, data frames and matrices. These topics were covered in lab 0. You should be familiar with the pipe operator and splitting data into training and test sets (lab 1).

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Libraries

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for some extra convenience functions
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(modelsummary) # for printing nice regression tables
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR) # for calculating ROC AUC
library(plotROC) # for plotting ROC 

Data

We will work with the dataset hdb_sales.csv - resale transactions of HDB units between Jan 2017 and Jun 2020 downloaded on 4 Aug 2020 from

hdb <- read_csv("hdb_sales.csv")
## Rows: 73320 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
## dbl (3): floor_area_sqm, lease_commence_date, resale_price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hdb)

The second dataset that we will use is hdb_info.csv - information on HDB blocks downloaded from

hdb_info <- read_csv("hdb_info.csv")
## Rows: 12268 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): blk_no, street, residential, commercial, market_hawker, miscellane...
## dbl (15): max_floor_lvl, year_completed, total_dwelling_units, 1room_sold, 2...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hdb_info)

Data summarization

A pretty common task is to somehow summarize the data. Let’s say, we want to find the average price for all flat types. This is done as follows:

hdb %>%
  group_by(flat_type) %>%
  summarise(mean_price = mean(resale_price))

Note that we applied a composition of two functions from tidyverse here. The first one is group_by. It doesn’t do any visible transformation:

hdb %>%
  group_by(flat_type) %>%
  head

But it adds some extra structure on top of data frame:

hdb %>%
  group_by(flat_type) %>%
  class
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Now applying summarize to such a grouped data frame means that we will call certain functions on groups created via group_by.

We can simply count the number of observations in each group:

hdb %>%
  group_by(flat_type) %>%
  summarise(N_transactions = n())

We can also group by more than one categorical variable and summarize with respect to more than one variable:

hdb %>%
  group_by(flat_type, storey_range) %>%
  summarise(N_transactions = n(), mean_price = mean(resale_price), max_price = max(resale_price))
## `summarise()` has grouped output by 'flat_type'. You can override using the
## `.groups` argument.

Question 1

Create a single pipeline that will calculate the number of transactions, the mean price, the minimum price, the maximum price and standard deviation of resale prices in each HDB block.

What is the highest resale price in any HDB unit? What is the block with the lowest average price? Print all the information on all HDB blocks with at least 50 transactions.

# Write your code here
hdb_summary <- hdb %>%
  mutate(address = paste(block, street_name)) %>%
  group_by(address) %>%
  summarise(N_transactions = n(), min_price = min(resale_price),
            mean_price = mean(resale_price), max_price = max(resale_price),
            SD = sd(resale_price))

cat("Highest resale price ever is", max(hdb_summary$max_price), "\n")
## Highest resale price ever is 1232000
cat("Cheapest units are in block", hdb_summary$address[which.min(hdb_summary$mean_price)], "\n")
## Cheapest units are in block 68 CIRCUIT RD
cat("Below is all info on blocks with at least 50 transactions")
## Below is all info on blocks with at least 50 transactions
hdb_summary %>%
  filter(N_transactions >= 50)

Regression models

Here, we will construct regression models to predict resale prices based on floor area and remaining lease.

Data transformation

The first issue is that the remaining lease is given as a character. We will convert it to numeric by just extracting the first two symbols of the variable “remaining_lease”.

hdb <- hdb %>%
  mutate(rl = substr(remaining_lease, 1, 2)) %>%
  mutate(rl = as.numeric(rl))

head(hdb)

Question 2

Convert “remaining_lease” to a numeric variable that accounts for months, i.e., “73 years 03 months” should become 73.25.

# Write your code here
hdb <- hdb %>%
  mutate(remaining_lease = ifelse(nchar(remaining_lease) > 15,
                                  remaining_lease, paste(hdb$remaining_lease, "00 months"))) %>%
  mutate(years = substr(remaining_lease, 1, 2)) %>%
  mutate(years = as.numeric(years)) %>%
  mutate(months = substr(remaining_lease, 10, 11)) %>%
  mutate(months = as.numeric(months)) %>%
  mutate(rl = years + months / 12)

head(hdb)

Training and test sets

We will split the data into 80% training and 20% test data.

set.seed(42)
ind <- runif(nrow(hdb)) < 0.8

train_hdb <- hdb[ind , ] 
test_hdb <- hdb[!ind , ]

cat("Dimensions of the training set are", dim(train_hdb), "\n")
## Dimensions of the training set are 58430 14
cat("Dimensions of the test set are", dim(test_hdb), "\n")
## Dimensions of the test set are 14890 14

Linear regression

Now we will train a linear model to predict resale prices from floor area and remaining lease. Below we print the summary of the model.

mod_lin <- lm(resale_price ~ floor_area_sqm + rl, data = train_hdb)
summary(mod_lin)
## 
## Call:
## lm(formula = resale_price ~ floor_area_sqm + rl, data = train_hdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -231820  -73721  -26941   39045  727298 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -118126.28    3154.99  -37.44   <2e-16 ***
## floor_area_sqm    3746.60      19.95  187.77   <2e-16 ***
## rl                2554.68      38.27   66.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114600 on 58427 degrees of freedom
## Multiple R-squared:  0.4436, Adjusted R-squared:  0.4435 
## F-statistic: 2.329e+04 on 2 and 58427 DF,  p-value: < 2.2e-16

It is, however, more convenient to print linear models with the function modelsummary (it can produce nicely formatted tables in text, LaTeX, and HTML format and has tons of settings to play with)

modelsummary(mod_lin, fmt = 1)
(1)
(Intercept) -118126.3
(3155.0)
floor_area_sqm 3746.6
(20.0)
rl 2554.7
(38.3)
Num.Obs. 58430
R2 0.444
R2 Adj. 0.444
AIC 1527136.8
BIC 1527172.7
Log.Lik. -763564.381
F 23287.093
RMSE 114586.01

Here we report the root mean squared error of the linear model (numbers will become too huge without the root):

rmse <- function(x, y) {
  sqrt(mean((x-y)^2))
}

mod_lin %>% predict(test_hdb) %>%
  rmse(test_hdb$resale_price)
## [1] 113584.7

And here are 95%-confidence intervals for regression coefficients:

confint(mod_lin)
##                      2.5 %      97.5 %
## (Intercept)    -124310.081 -111942.481
## floor_area_sqm    3707.492    3785.708
## rl                2479.673    2629.680

Or, as a nicely formatted table:

modelsummary(
  mod_lin,
  statistic = "conf.int",   
  conf_level = 0.95,
  fmt = 1
)
(1)
(Intercept) -118126.3
[-124310.1, -111942.5]
floor_area_sqm 3746.6
[3707.5, 3785.7]
rl 2554.7
[2479.7, 2629.7]
Num.Obs. 58430
R2 0.444
R2 Adj. 0.444
AIC 1527136.8
BIC 1527172.7
Log.Lik. -763564.381
F 23287.093
RMSE 114586.01

Question 3

What is the interpretation of the model? In particular, how much does every square metre of an HDB cost? By how much does an HDB unit become cheaper every year?

Answer: a square metre costs SGD 3746.60 and your unit loses SGD2554.68 in price every year

Question 4

Train a log-linear regression model for HDB prices. Is it more or less accurate than the linear model?

mod_loglin <- lm(log(resale_price) ~ floor_area_sqm + rl, data = train_hdb)
modelsummary(mod_loglin, fmt = 4)
(1)
(Intercept) 11.6046
(0.0062)
floor_area_sqm 0.0087
(0.0000)
rl 0.0064
(0.0001)
Num.Obs. 58430
R2 0.538
R2 Adj. 0.538
AIC 1502377.7
BIC 1502413.6
Log.Lik. 4592.272
F 33984.733
RMSE 0.22
mod_loglin %>%
  predict(test_hdb) %>%
  exp %>%
  rmse(test_hdb$resale_price)
## [1] 113981.5

Loglinear model is about as accurate as a linear model here, even though it has higher R-squared

Polynomial regression

Now we will train a polynomial regression that will include floor area, remaining lease, their product and their squares as predictors.

mod_poly <- lm(resale_price ~ 
                 I(floor_area_sqm^2) + I(rl^2) + 
                 floor_area_sqm * rl, data = train_hdb)

modelsummary(mod_poly, fmt = 1)
(1)
(Intercept) 970460.6
(19920.9)
I(floor_area_sqm^2) 1.3
(0.6)
I(rl^2) 222.3
(3.4)
floor_area_sqm 5257.9
(191.7)
rl -29827.9
(521.4)
floor_area_sqm × rl -14.5
(1.8)
Num.Obs. 58430
R2 0.483
R2 Adj. 0.483
AIC 1522882.1
BIC 1522944.9
Log.Lik. -761434.052
F 10902.807
RMSE 110483.50

The root mean squared error of the polynomial model is

predict(mod_poly, test_hdb) %>%
  rmse(test_hdb$resale_price)
## [1] 109552.9

So, apparently, the polynomial regression does a better job.

Predicting future HDB prices

Now we can make actual predictions, i.e., estimate the price of our HDB unit in future. Let’s say we have an HDB unit built in 1995 with floor area 95 square metres and we want to see how its price is going to change after 2020 assuming that the current trend inferred from our data will persist.

We will create a fake data:

fake_data <- data.frame(
  year = 2020:2050,
  floor_area_sqm = 95
) %>% mutate (rl = 2095 - year)

head(fake_data)

Now we will predict and plot the hypothetical price of our unit in 2020 - 2050

fake_data <- fake_data %>% 
  mutate(pred_price = predict(mod_poly, fake_data))

ggplot(data = fake_data, aes(x = year, y = pred_price)) +
  geom_line() + ylab("Predicted price")

Clearly, these predictions do not make sense and the reason is that our polynomial regression model has a positive coefficient at \[ \mbox{remaining lease}^2 \]

Question 5

Plot prices of our imaginary HDB unit predicted in 2020-2050 by the log-linear model to see whether it reasonably generalizes beyond the usual range or remaining leases.

# Write your code here

fake_data %>% 
  mutate(pred_exp = exp(predict(mod_loglin, fake_data))) %>%
  ggplot(aes(x = year, y = pred_exp)) +
  geom_line() + ylab("Predicted price")

This looks more reasonable than polynomial regression predictions

Logistic regression

Data preparation

We will create a new dataset. Its observations are HDB blocks (not transactions), for each HDB block we will look at all transactions in that block, calculate resale prices predicted by linear regression and find out if, on average, resale prices are higher or lower than prices predicted by linear regression. This will be our response variable called price_status. For blocks where units are sold at a higher price than predictions of linear regression, we will set price_status == "expensive", for blocks where units are sold at a lower price than predictions of linear regression, we will set price_status == "cheap".

hdb_summary <- hdb %>%
  mutate(predicted_price = predict(mod_lin, hdb)) %>%
  mutate(price_gain = resale_price - predicted_price) %>%
  mutate(address = paste(block, street_name)) %>%
  group_by(address) %>%
  summarise(N_transactions = n(), mean_gain = mean(price_gain)) %>%
  mutate(price_status = ifelse(mean_gain > 0, "expensive", "cheap"))

head(hdb_summary)

Question 6

Filter out blocks with fewer than 5 transactions (the number 5 here is arbitrary, it’s just an exercise)

# Type your code here
hdb_summary <- filter(hdb_summary, N_transactions >= 5)

Merging our data with property info

Now we will merge our dataset hdb_summary that contains processed information about an average transaction in each HDB block with property info. We will also remove some variables to simplify the process. We will also create the variable remaining lease that shows remaining lease as of 2020

H <- hdb_info %>%
  select(-ends_with("sold")) %>%
  select(-ends_with("rental")) %>%
  mutate(address = paste(blk_no, street)) %>%
  select(-street, -blk_no) %>%
  mutate(remaining_lease = year_completed + 100 - 2020) %>%
  select(-year_completed) %>%
  merge(hdb_summary) %>%
  select(-mean_gain) %>%
  mutate(price_status = as.factor(price_status))

head(H)

Training and test datasets

We will split our data into 70% training and 30% test sets

set.seed(8128)
ind <- runif(nrow(H)) < 0.7
train_H <- H %>% filter(ind) 
test_H <- H %>% filter(!ind)

cat("Dimensions of the training set are", dim(train_H), "\n")
## Dimensions of the training set are 4586 13
cat("Dimensions of the test set are", dim(test_H), "\n")
## Dimensions of the test set are 1919 13

Baseline prediction

Our response variable is price_status. It is a binary variable with value expensive for HDB blocks whose actual resale price on average exceeds price predicted by linear regression and cheap for the rest of the blocks.

Let us look at frequency of of these labels in the training set:

train_H %>% tabyl(price_status)

The more frequent label is cheap. So if we predict that all blocks in the test set are cheap, the accuracy of our prediction will be

baseline_acc <- test_H %>% 
  tabyl(price_status) %>%
  pull(percent) %>%
  max

baseline_acc
## [1] 0.6347056

Logistic model

We will try to predict the price status of HDB block using the number of floors (height of the block), the number of units (size of the block), and the remaining lease (age of the block).

Important note: response variable for logistic regression should be either numeric (with values between 0 and 1) or factor (special type for categorical variables in R). That’s why we converted character to factor before

mod_logistic_simple <-
  glm(price_status ~ max_floor_lvl + total_dwelling_units + remaining_lease,
      family = "binomial",
      data = train_H)

modelsummary(mod_logistic_simple)
(1)
(Intercept) 3.112
(0.279)
max_floor_lvl 0.206
(0.010)
total_dwelling_units -0.005
(0.001)
remaining_lease -0.088
(0.004)
Num.Obs. 4586
AIC 5176.2
BIC 5201.9
Log.Lik. -2584.101
F 202.785
RMSE 0.43

And now we will find the accuracy and other metrics of the model on the test set. First, let us construct our predictions

raw_preds <- predict(mod_logistic_simple, test_H, type = "response")
head(raw_preds)
##         1         2         3         4         5         6 
## 0.9157088 0.5465544 0.4998965 0.7466088 0.9166088 0.2477066

Note that these predictions are probabilities (without indicating type = "response" we would get log-odds of probabilities). But how do we know if these are probabilities of expensive or cheap? Under the hood, logistic regression converts categorical values of our response variable to numeric. Below are categorical and numeric values respectively:

head(train_H$price_status)
## [1] expensive expensive cheap     cheap     expensive expensive
## Levels: cheap expensive
head(mod_logistic_simple$y)
## 1 2 3 4 5 6 
## 1 1 0 0 1 1

We see that 1 means expensive and 0 means cheap, i.e., are predicting the probability that an HDB block is labelled expensive. Now we will calculate predicted labels and report all the metrics of our model.

pred_labels <- ifelse(raw_preds > 0.5, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels, test_H$price_status)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cheap expensive
##   cheap      1122       396
##   expensive    96       305
##                                          
##                Accuracy : 0.7436         
##                  95% CI : (0.7235, 0.763)
##     No Information Rate : 0.6347         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3919         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9212         
##             Specificity : 0.4351         
##          Pos Pred Value : 0.7391         
##          Neg Pred Value : 0.7606         
##              Prevalence : 0.6347         
##          Detection Rate : 0.5847         
##    Detection Prevalence : 0.7910         
##       Balanced Accuracy : 0.6781         
##                                          
##        'Positive' Class : cheap          
## 

Note that the function confusionMatrix comes from a different R library and it labelled cheap rather than expensive as the positive class.

Also note that the overall accuracy rate is better than the baseline.

Question 7

In the model above, we used the threshold of \(0.5\) to construct our predictions, i.e., we labelled a new observation as expensive whenever we have \[ p(Y|X) > 0.5, \] where \(p(Y|X)\) is the estimate of the conditional probability that \(Y=\mbox{expensive}\) given the observed value of \(X\).

For our HDB data, the threshold of \(0.5\) makes sense, but there there are datasets when the cost of misclassification of “Yes” as “No” is much higher or lower than the cost of misclassification of “No” as “Yes”. Think of 3 hypothetical examples of such data sets.

Fedor’s answer: cancer diagnosis (it is much worse to report that someone does not have cancer when she, in fact, does than to report than she has cancer when she, in fact, does not), credit card default (doing something about a bank customer who is not going to default is less costly than not doing anything about a bank customer who is going to default), jury verdict (condemning an innocent is worse than releasing a criminal).

Question 8

It may be a good idea to change the threshold to a higher or a lower number. Change \(0.5\) to \(0.4\). How will it affect the number of predicted expensive labels? How will it affect the confusion matrix and prediction metrics? First, think about it for a few minutes and then verify your common sense by an R code.

# Type your R code here
pred_labels_2 <- ifelse(raw_preds > 0.4, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels_2, test_H$price_status)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cheap expensive
##   cheap       989       287
##   expensive   229       414
##                                           
##                Accuracy : 0.7311          
##                  95% CI : (0.7107, 0.7508)
##     No Information Rate : 0.6347          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4098          
##                                           
##  Mcnemar's Test P-Value : 0.0121          
##                                           
##             Sensitivity : 0.8120          
##             Specificity : 0.5906          
##          Pos Pred Value : 0.7751          
##          Neg Pred Value : 0.6439          
##              Prevalence : 0.6347          
##          Detection Rate : 0.5154          
##    Detection Prevalence : 0.6649          
##       Balanced Accuracy : 0.7013          
##                                           
##        'Positive' Class : cheap           
## 

Answer: Lowering the threshold for predicting expensive will yield more labels expensive because cases where the estimated probability is between 0.4 and 0.5 were originally labelled cheap but now they will be labelled expensive.

Below is the direction of the change of entries of the confusion matrix (true positive, false positive, false negative, true negative): \[ \begin{bmatrix} TP\downarrow & FP\downarrow \\ FN\uparrow & TN\uparrow \end{bmatrix} \]

It means that after lowering down the threshold, sensitivity \(\frac{TP}{TP+FN}\) will always decrease and specificity \(\frac{TN}{FP+TN}\) will always increase. The direction of change of positive predictive value \(\frac{TP}{TP+FP}\) and negative predictive value \(\frac{TN}{FN+TN}\) may be either one.

ROC curve

The ROC (receiver operating characteristic) curve visualizes how predictive power of a binary classifier depends on the threshold. This is a parametric curve parameterized by the threshold value in the axes (false positive fraction, true positive fraction)

df <- tibble(
  predictions = raw_preds,
  labels = 0 + (test_H$price_status == "expensive")
)

ggplot(df, aes(m = predictions, d = labels)) + 
  geom_roc(n.cuts = 10, labels=TRUE) + 
  style_roc(theme = theme_grey) + 
  coord_fixed()

Area under the curve

The area under the ROC curve (AUC) is a popular measure of the overall discriminating power of a binary classifier for different threshold values. The AUC is smallest possible when our classifier just does no better than a random chance - in such a case, the ROC will be simply the diagonal of the square, i.e., the line FPF = TPF. This corresponds to AUC = 0.5. The theoretically best case is that of a classifier that happens to be 100% accurate when the threshold is 0.5. In such a case, our ROC will be the union of the left and the top sides of the square and AUC = 1.

Below is the value of our ROC AUC:

pred_obj <- prediction(raw_preds, 0 + (test_H$price_status == "expensive"))
auc_ROCR <- performance(pred_obj, measure = "auc")
auc_ROCR@y.values[[1]]
## [1] 0.7480581

Survey

There is a link to a simple survey after lab 2: