By the end of this activity, you should
Summarize data with the package tidyverse
Train linear and polynomial regression in R
Train logistic regression in R
Print linear and logistic regressions as nicely formatted tables.
Explain what Receiver operating characteristic (ROC) is and how it can be used to estimate predictive power of a classifier
You should be familiar with basics of R - installing packages, variables, atomic types, lists, data frames and matrices. These topics were covered in lab 0. You should be familiar with the pipe operator and splitting data into training and test sets (lab 1).
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for some extra convenience functions
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(modelsummary) # for printing nice regression tables
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR) # for calculating ROC AUC
library(plotROC) # for plotting ROC
We will work with the dataset hdb_sales.csv - resale
transactions of HDB units between Jan 2017 and Jun 2020 downloaded on 4
Aug 2020 from
hdb <- read_csv("hdb_sales.csv")
## Rows: 73320 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
## dbl (3): floor_area_sqm, lease_commence_date, resale_price
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hdb)
The second dataset that we will use is hdb_info.csv -
information on HDB blocks downloaded from
hdb_info <- read_csv("hdb_info.csv")
## Rows: 12268 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): blk_no, street, residential, commercial, market_hawker, miscellane...
## dbl (15): max_floor_lvl, year_completed, total_dwelling_units, 1room_sold, 2...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hdb_info)
A pretty common task is to somehow summarize the data. Let’s say, we want to find the average price for all flat types. This is done as follows:
hdb %>%
group_by(flat_type) %>%
summarise(mean_price = mean(resale_price))
Note that we applied a composition of two functions from
tidyverse here. The first one is group_by. It
doesn’t do any visible transformation:
hdb %>%
group_by(flat_type) %>%
head
But it adds some extra structure on top of data frame:
hdb %>%
group_by(flat_type) %>%
class
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Now applying summarize to such a grouped data frame
means that we will call certain functions on groups created via
group_by.
We can simply count the number of observations in each group:
hdb %>%
group_by(flat_type) %>%
summarise(N_transactions = n())
We can also group by more than one categorical variable and summarize with respect to more than one variable:
hdb %>%
group_by(flat_type, storey_range) %>%
summarise(N_transactions = n(), mean_price = mean(resale_price), max_price = max(resale_price))
## `summarise()` has grouped output by 'flat_type'. You can override using the
## `.groups` argument.
Create a single pipeline that will calculate the number of transactions, the mean price, the minimum price, the maximum price and standard deviation of resale prices in each HDB block.
What is the highest resale price in any HDB unit? What is the block with the lowest average price? Print all the information on all HDB blocks with at least 50 transactions.
# Write your code here
hdb_summary <- hdb %>%
mutate(address = paste(block, street_name)) %>%
group_by(address) %>%
summarise(N_transactions = n(), min_price = min(resale_price),
mean_price = mean(resale_price), max_price = max(resale_price),
SD = sd(resale_price))
cat("Highest resale price ever is", max(hdb_summary$max_price), "\n")
## Highest resale price ever is 1232000
cat("Cheapest units are in block", hdb_summary$address[which.min(hdb_summary$mean_price)], "\n")
## Cheapest units are in block 68 CIRCUIT RD
cat("Below is all info on blocks with at least 50 transactions")
## Below is all info on blocks with at least 50 transactions
hdb_summary %>%
filter(N_transactions >= 50)
Here, we will construct regression models to predict resale prices based on floor area and remaining lease.
The first issue is that the remaining lease is given as a character. We will convert it to numeric by just extracting the first two symbols of the variable “remaining_lease”.
hdb <- hdb %>%
mutate(rl = substr(remaining_lease, 1, 2)) %>%
mutate(rl = as.numeric(rl))
head(hdb)
Convert “remaining_lease” to a numeric variable that accounts for months, i.e., “73 years 03 months” should become 73.25.
# Write your code here
hdb <- hdb %>%
mutate(remaining_lease = ifelse(nchar(remaining_lease) > 15,
remaining_lease, paste(hdb$remaining_lease, "00 months"))) %>%
mutate(years = substr(remaining_lease, 1, 2)) %>%
mutate(years = as.numeric(years)) %>%
mutate(months = substr(remaining_lease, 10, 11)) %>%
mutate(months = as.numeric(months)) %>%
mutate(rl = years + months / 12)
head(hdb)
We will split the data into 80% training and 20% test data.
set.seed(42)
ind <- runif(nrow(hdb)) < 0.8
train_hdb <- hdb[ind , ]
test_hdb <- hdb[!ind , ]
cat("Dimensions of the training set are", dim(train_hdb), "\n")
## Dimensions of the training set are 58430 14
cat("Dimensions of the test set are", dim(test_hdb), "\n")
## Dimensions of the test set are 14890 14
Now we will train a linear model to predict resale prices from floor area and remaining lease. Below we print the summary of the model.
mod_lin <- lm(resale_price ~ floor_area_sqm + rl, data = train_hdb)
summary(mod_lin)
##
## Call:
## lm(formula = resale_price ~ floor_area_sqm + rl, data = train_hdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231820 -73721 -26941 39045 727298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118126.28 3154.99 -37.44 <2e-16 ***
## floor_area_sqm 3746.60 19.95 187.77 <2e-16 ***
## rl 2554.68 38.27 66.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114600 on 58427 degrees of freedom
## Multiple R-squared: 0.4436, Adjusted R-squared: 0.4435
## F-statistic: 2.329e+04 on 2 and 58427 DF, p-value: < 2.2e-16
It is, however, more convenient to print linear models with the
function modelsummary (it can produce nicely formatted
tables in text, LaTeX, and HTML format and has tons of settings to play
with)
modelsummary(mod_lin, fmt = 1)
| (1) | |
|---|---|
| (Intercept) | -118126.3 |
| (3155.0) | |
| floor_area_sqm | 3746.6 |
| (20.0) | |
| rl | 2554.7 |
| (38.3) | |
| Num.Obs. | 58430 |
| R2 | 0.444 |
| R2 Adj. | 0.444 |
| AIC | 1527136.8 |
| BIC | 1527172.7 |
| Log.Lik. | -763564.381 |
| F | 23287.093 |
| RMSE | 114586.01 |
Here we report the root mean squared error of the linear model (numbers will become too huge without the root):
rmse <- function(x, y) {
sqrt(mean((x-y)^2))
}
mod_lin %>% predict(test_hdb) %>%
rmse(test_hdb$resale_price)
## [1] 113584.7
And here are 95%-confidence intervals for regression coefficients:
confint(mod_lin)
## 2.5 % 97.5 %
## (Intercept) -124310.081 -111942.481
## floor_area_sqm 3707.492 3785.708
## rl 2479.673 2629.680
Or, as a nicely formatted table:
modelsummary(
mod_lin,
statistic = "conf.int",
conf_level = 0.95,
fmt = 1
)
| (1) | |
|---|---|
| (Intercept) | -118126.3 |
| [-124310.1, -111942.5] | |
| floor_area_sqm | 3746.6 |
| [3707.5, 3785.7] | |
| rl | 2554.7 |
| [2479.7, 2629.7] | |
| Num.Obs. | 58430 |
| R2 | 0.444 |
| R2 Adj. | 0.444 |
| AIC | 1527136.8 |
| BIC | 1527172.7 |
| Log.Lik. | -763564.381 |
| F | 23287.093 |
| RMSE | 114586.01 |
What is the interpretation of the model? In particular, how much does every square metre of an HDB cost? By how much does an HDB unit become cheaper every year?
Answer: a square metre costs SGD 3746.60 and your unit loses SGD2554.68 in price every year
Train a log-linear regression model for HDB prices. Is it more or less accurate than the linear model?
mod_loglin <- lm(log(resale_price) ~ floor_area_sqm + rl, data = train_hdb)
modelsummary(mod_loglin, fmt = 4)
| (1) | |
|---|---|
| (Intercept) | 11.6046 |
| (0.0062) | |
| floor_area_sqm | 0.0087 |
| (0.0000) | |
| rl | 0.0064 |
| (0.0001) | |
| Num.Obs. | 58430 |
| R2 | 0.538 |
| R2 Adj. | 0.538 |
| AIC | 1502377.7 |
| BIC | 1502413.6 |
| Log.Lik. | 4592.272 |
| F | 33984.733 |
| RMSE | 0.22 |
mod_loglin %>%
predict(test_hdb) %>%
exp %>%
rmse(test_hdb$resale_price)
## [1] 113981.5
Loglinear model is about as accurate as a linear model here, even though it has higher R-squared
Now we will train a polynomial regression that will include floor area, remaining lease, their product and their squares as predictors.
mod_poly <- lm(resale_price ~
I(floor_area_sqm^2) + I(rl^2) +
floor_area_sqm * rl, data = train_hdb)
modelsummary(mod_poly, fmt = 1)
| (1) | |
|---|---|
| (Intercept) | 970460.6 |
| (19920.9) | |
| I(floor_area_sqm^2) | 1.3 |
| (0.6) | |
| I(rl^2) | 222.3 |
| (3.4) | |
| floor_area_sqm | 5257.9 |
| (191.7) | |
| rl | -29827.9 |
| (521.4) | |
| floor_area_sqm × rl | -14.5 |
| (1.8) | |
| Num.Obs. | 58430 |
| R2 | 0.483 |
| R2 Adj. | 0.483 |
| AIC | 1522882.1 |
| BIC | 1522944.9 |
| Log.Lik. | -761434.052 |
| F | 10902.807 |
| RMSE | 110483.50 |
The root mean squared error of the polynomial model is
predict(mod_poly, test_hdb) %>%
rmse(test_hdb$resale_price)
## [1] 109552.9
So, apparently, the polynomial regression does a better job.
Now we can make actual predictions, i.e., estimate the price of our HDB unit in future. Let’s say we have an HDB unit built in 1995 with floor area 95 square metres and we want to see how its price is going to change after 2020 assuming that the current trend inferred from our data will persist.
We will create a fake data:
fake_data <- data.frame(
year = 2020:2050,
floor_area_sqm = 95
) %>% mutate (rl = 2095 - year)
head(fake_data)
Now we will predict and plot the hypothetical price of our unit in 2020 - 2050
fake_data <- fake_data %>%
mutate(pred_price = predict(mod_poly, fake_data))
ggplot(data = fake_data, aes(x = year, y = pred_price)) +
geom_line() + ylab("Predicted price")
Clearly, these predictions do not make sense and the reason is that our polynomial regression model has a positive coefficient at \[ \mbox{remaining lease}^2 \]
Plot prices of our imaginary HDB unit predicted in 2020-2050 by the log-linear model to see whether it reasonably generalizes beyond the usual range or remaining leases.
# Write your code here
fake_data %>%
mutate(pred_exp = exp(predict(mod_loglin, fake_data))) %>%
ggplot(aes(x = year, y = pred_exp)) +
geom_line() + ylab("Predicted price")
This looks more reasonable than polynomial regression predictions
We will create a new dataset. Its observations are HDB blocks (not
transactions), for each HDB block we will look at all transactions in
that block, calculate resale prices predicted by linear regression and
find out if, on average, resale prices are higher or lower than prices
predicted by linear regression. This will be our response variable
called price_status. For blocks where units are sold at a
higher price than predictions of linear regression, we will set
price_status == "expensive", for blocks where units are
sold at a lower price than predictions of linear regression, we will set
price_status == "cheap".
hdb_summary <- hdb %>%
mutate(predicted_price = predict(mod_lin, hdb)) %>%
mutate(price_gain = resale_price - predicted_price) %>%
mutate(address = paste(block, street_name)) %>%
group_by(address) %>%
summarise(N_transactions = n(), mean_gain = mean(price_gain)) %>%
mutate(price_status = ifelse(mean_gain > 0, "expensive", "cheap"))
head(hdb_summary)
Filter out blocks with fewer than 5 transactions (the number 5 here is arbitrary, it’s just an exercise)
# Type your code here
hdb_summary <- filter(hdb_summary, N_transactions >= 5)
Now we will merge our dataset hdb_summary that contains
processed information about an average transaction in each HDB block
with property info. We will also remove some variables to simplify the
process. We will also create the variable remaining lease
that shows remaining lease as of 2020
H <- hdb_info %>%
select(-ends_with("sold")) %>%
select(-ends_with("rental")) %>%
mutate(address = paste(blk_no, street)) %>%
select(-street, -blk_no) %>%
mutate(remaining_lease = year_completed + 100 - 2020) %>%
select(-year_completed) %>%
merge(hdb_summary) %>%
select(-mean_gain) %>%
mutate(price_status = as.factor(price_status))
head(H)
We will split our data into 70% training and 30% test sets
set.seed(8128)
ind <- runif(nrow(H)) < 0.7
train_H <- H %>% filter(ind)
test_H <- H %>% filter(!ind)
cat("Dimensions of the training set are", dim(train_H), "\n")
## Dimensions of the training set are 4586 13
cat("Dimensions of the test set are", dim(test_H), "\n")
## Dimensions of the test set are 1919 13
Our response variable is price_status. It is a binary
variable with value expensive for HDB blocks whose actual
resale price on average exceeds price predicted by linear regression and
cheap for the rest of the blocks.
Let us look at frequency of of these labels in the training set:
train_H %>% tabyl(price_status)
The more frequent label is cheap. So if we predict that
all blocks in the test set are cheap, the accuracy of our prediction
will be
baseline_acc <- test_H %>%
tabyl(price_status) %>%
pull(percent) %>%
max
baseline_acc
## [1] 0.6347056
We will try to predict the price status of HDB block using the number of floors (height of the block), the number of units (size of the block), and the remaining lease (age of the block).
Important note: response variable for logistic regression should be either numeric (with values between 0 and 1) or factor (special type for categorical variables in R). That’s why we converted character to factor before
mod_logistic_simple <-
glm(price_status ~ max_floor_lvl + total_dwelling_units + remaining_lease,
family = "binomial",
data = train_H)
modelsummary(mod_logistic_simple)
| (1) | |
|---|---|
| (Intercept) | 3.112 |
| (0.279) | |
| max_floor_lvl | 0.206 |
| (0.010) | |
| total_dwelling_units | -0.005 |
| (0.001) | |
| remaining_lease | -0.088 |
| (0.004) | |
| Num.Obs. | 4586 |
| AIC | 5176.2 |
| BIC | 5201.9 |
| Log.Lik. | -2584.101 |
| F | 202.785 |
| RMSE | 0.43 |
And now we will find the accuracy and other metrics of the model on the test set. First, let us construct our predictions
raw_preds <- predict(mod_logistic_simple, test_H, type = "response")
head(raw_preds)
## 1 2 3 4 5 6
## 0.9157088 0.5465544 0.4998965 0.7466088 0.9166088 0.2477066
Note that these predictions are probabilities (without indicating
type = "response" we would get log-odds of probabilities).
But how do we know if these are probabilities of expensive
or cheap? Under the hood, logistic regression converts
categorical values of our response variable to numeric. Below are
categorical and numeric values respectively:
head(train_H$price_status)
## [1] expensive expensive cheap cheap expensive expensive
## Levels: cheap expensive
head(mod_logistic_simple$y)
## 1 2 3 4 5 6
## 1 1 0 0 1 1
We see that 1 means expensive and
0 means cheap, i.e., are predicting the
probability that an HDB block is labelled expensive. Now we
will calculate predicted labels and report all the metrics of our
model.
pred_labels <- ifelse(raw_preds > 0.5, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels, test_H$price_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cheap expensive
## cheap 1122 396
## expensive 96 305
##
## Accuracy : 0.7436
## 95% CI : (0.7235, 0.763)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3919
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9212
## Specificity : 0.4351
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.7606
## Prevalence : 0.6347
## Detection Rate : 0.5847
## Detection Prevalence : 0.7910
## Balanced Accuracy : 0.6781
##
## 'Positive' Class : cheap
##
Note that the function confusionMatrix comes from a
different R library and it labelled cheap rather than
expensive as the positive class.
Also note that the overall accuracy rate is better than the baseline.
In the model above, we used the threshold of \(0.5\) to construct our predictions, i.e.,
we labelled a new observation as expensive whenever we have
\[
p(Y|X) > 0.5,
\] where \(p(Y|X)\) is the
estimate of the conditional probability that \(Y=\mbox{expensive}\) given the observed
value of \(X\).
For our HDB data, the threshold of \(0.5\) makes sense, but there there are datasets when the cost of misclassification of “Yes” as “No” is much higher or lower than the cost of misclassification of “No” as “Yes”. Think of 3 hypothetical examples of such data sets.
Fedor’s answer: cancer diagnosis (it is much worse to report that someone does not have cancer when she, in fact, does than to report than she has cancer when she, in fact, does not), credit card default (doing something about a bank customer who is not going to default is less costly than not doing anything about a bank customer who is going to default), jury verdict (condemning an innocent is worse than releasing a criminal).
It may be a good idea to change the threshold to a higher or a lower
number. Change \(0.5\) to \(0.4\). How will it affect the number of
predicted expensive labels? How will it affect the
confusion matrix and prediction metrics? First, think about it for a few
minutes and then verify your common sense by an R code.
# Type your R code here
pred_labels_2 <- ifelse(raw_preds > 0.4, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels_2, test_H$price_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cheap expensive
## cheap 989 287
## expensive 229 414
##
## Accuracy : 0.7311
## 95% CI : (0.7107, 0.7508)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4098
##
## Mcnemar's Test P-Value : 0.0121
##
## Sensitivity : 0.8120
## Specificity : 0.5906
## Pos Pred Value : 0.7751
## Neg Pred Value : 0.6439
## Prevalence : 0.6347
## Detection Rate : 0.5154
## Detection Prevalence : 0.6649
## Balanced Accuracy : 0.7013
##
## 'Positive' Class : cheap
##
Answer: Lowering the threshold for predicting
expensive will yield more labels expensive
because cases where the estimated probability is between 0.4 and 0.5
were originally labelled cheap but now they will be
labelled expensive.
Below is the direction of the change of entries of the confusion matrix (true positive, false positive, false negative, true negative): \[ \begin{bmatrix} TP\downarrow & FP\downarrow \\ FN\uparrow & TN\uparrow \end{bmatrix} \]
It means that after lowering down the threshold, sensitivity \(\frac{TP}{TP+FN}\) will always decrease and specificity \(\frac{TN}{FP+TN}\) will always increase. The direction of change of positive predictive value \(\frac{TP}{TP+FP}\) and negative predictive value \(\frac{TN}{FN+TN}\) may be either one.
The ROC (receiver operating characteristic) curve visualizes how predictive power of a binary classifier depends on the threshold. This is a parametric curve parameterized by the threshold value in the axes (false positive fraction, true positive fraction)
df <- tibble(
predictions = raw_preds,
labels = 0 + (test_H$price_status == "expensive")
)
ggplot(df, aes(m = predictions, d = labels)) +
geom_roc(n.cuts = 10, labels=TRUE) +
style_roc(theme = theme_grey) +
coord_fixed()
The area under the ROC curve (AUC) is a popular measure of the overall discriminating power of a binary classifier for different threshold values. The AUC is smallest possible when our classifier just does no better than a random chance - in such a case, the ROC will be simply the diagonal of the square, i.e., the line FPF = TPF. This corresponds to AUC = 0.5. The theoretically best case is that of a classifier that happens to be 100% accurate when the threshold is 0.5. In such a case, our ROC will be the union of the left and the top sides of the square and AUC = 1.
Below is the value of our ROC AUC:
pred_obj <- prediction(raw_preds, 0 + (test_H$price_status == "expensive"))
auc_ROCR <- performance(pred_obj, measure = "auc")
auc_ROCR@y.values[[1]]
## [1] 0.7480581