By the end of this activity, you should
Summarize data with the package tidyverse
Train linear and polynomial regression in R
Train logistic regression in R
Print linear and logistic regressions as nicely formatted tables.
Explain what Receiver operating characteristic (ROC) is and how it can be used to estimate predictive power of a classifier
You should be familiar with basics of R - installing packages, variables, atomic types, lists, data frames and matrices. These topics were covered in lab 0. You should be familiar with the pipe operator and splitting data into training and test sets (lab 1).
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stargazer) # for printing nice regression tables
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR) # for calculating ROC AUC
library(plotROC) # for plotting ROC
We will work with the dataset hdb_sales.csv
- resale transactions of HDB units between Jan 2017 and Jun 2020 downloaded on 4 Aug 2020 from
hdb <- read_csv("hdb_sales.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## month = col_character(),
## town = col_character(),
## flat_type = col_character(),
## block = col_character(),
## street_name = col_character(),
## storey_range = col_character(),
## floor_area_sqm = col_double(),
## flat_model = col_character(),
## lease_commence_date = col_double(),
## remaining_lease = col_character(),
## resale_price = col_double()
## )
head(hdb)
The second dataset that we will use is hdb_info.csv
- information on HDB blocks downloaded from
hdb_info <- read_csv("hdb_info.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## blk_no = col_character(),
## street = col_character(),
## residential = col_character(),
## commercial = col_character(),
## market_hawker = col_character(),
## miscellaneous = col_character(),
## multistorey_carpark = col_character(),
## precinct_pavilion = col_character(),
## bldg_contract_town = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
head(hdb_info)
A pretty common task is to somehow summarize the data. Let’s say, we want to find the average price for all flat types. This is done as follows:
hdb %>%
group_by(flat_type) %>%
summarise(mean_price = mean(resale_price))
Note that we applied a composition of two functions from tidyverse
here. The first one is group_by
. It doesn’t do any visible transformation:
hdb %>%
group_by(flat_type) %>%
head
But it adds some extra structure on top of data frame:
hdb %>%
group_by(flat_type) %>%
class
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Now applying summarize
to such a grouped data frame means that we will call certain functions on groups created via group_by
.
We can simply count the number of observations in each group:
hdb %>%
group_by(flat_type) %>%
summarise(N_transactions = n())
We can also group by more than one categorical variable and summarize with respect to more than one variable:
hdb %>%
group_by(flat_type, storey_range) %>%
summarise(N_transactions = n(), mean_price = mean(resale_price), max_price = max(resale_price))
## `summarise()` has grouped output by 'flat_type'. You can override using the `.groups` argument.
Create a single pipeline that will calculate the number of transactions, the mean price, the minimum price, the maximum price and standard deviation of resale prices in each HDB block.
What is the highest resale price in any HDB unit? What is the block with the lowest average price? Print all the information on all HDB blocks with at least 50 transactions.
# Write your code here
hdb_summary <- hdb %>%
mutate(address = paste(block, street_name)) %>%
group_by(address) %>%
summarise(N_transactions = n(), min_price = min(resale_price),
mean_price = mean(resale_price), max_price = max(resale_price),
SD = sd(resale_price))
cat("Highest resale price ever is", max(hdb_summary$max_price), "\n")
## Highest resale price ever is 1232000
cat("Cheapest units are in block", hdb_summary$address[which.min(hdb_summary$mean_price)], "\n")
## Cheapest units are in block 68 CIRCUIT RD
cat("Below is all info on blocks with at least 50 transactions")
## Below is all info on blocks with at least 50 transactions
hdb_summary %>%
filter(N_transactions >= 50)
Here, we will construct regression models to predict resale prices based on floor area and remaining lease.
The first issue is that the remaining lease is given as a character. We will convert it to numeric by just extracting the first two symbols of the variable “remaining_lease”.
hdb <- hdb %>%
mutate(rl = substr(remaining_lease, 1, 2)) %>%
mutate(rl = as.numeric(rl))
head(hdb)
Convert “remaining_lease” to a numeric variable that accounts for months, i.e., “73 years 03 months” should become 73.25.
# Write your code here
hdb <- hdb %>%
mutate(remaining_lease = ifelse(nchar(remaining_lease) > 15,
remaining_lease, paste(hdb$remaining_lease, "00 months"))) %>%
mutate(years = substr(remaining_lease, 1, 2)) %>%
mutate(years = as.numeric(years)) %>%
mutate(months = substr(remaining_lease, 10, 11)) %>%
mutate(months = as.numeric(months)) %>%
mutate(rl = years + months / 12)
head(hdb)
We will split the data into 80% training and 20% test data.
set.seed(42)
ind <- runif(nrow(hdb)) < 0.8
train_hdb <- hdb[ind , ]
test_hdb <- hdb[!ind , ]
cat("Dimensions of the training set are", dim(train_hdb), "\n")
## Dimensions of the training set are 58430 14
cat("Dimensions of the test set are", dim(test_hdb), "\n")
## Dimensions of the test set are 14890 14
Now we will train a linear model to predict resale prices from floor area and remaining lease. Below we print the summary of the model.
mod_lin <- lm(resale_price ~ floor_area_sqm + rl, data = train_hdb)
summary(mod_lin)
##
## Call:
## lm(formula = resale_price ~ floor_area_sqm + rl, data = train_hdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231820 -73721 -26941 39045 727298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118126.28 3154.99 -37.44 <2e-16 ***
## floor_area_sqm 3746.60 19.95 187.77 <2e-16 ***
## rl 2554.68 38.27 66.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114600 on 58427 degrees of freedom
## Multiple R-squared: 0.4436, Adjusted R-squared: 0.4435
## F-statistic: 2.329e+04 on 2 and 58427 DF, p-value: < 2.2e-16
It is, however, more convenient to print linear models with the function stargazer
(it can produce nicely formatted tables in text, LaTeX, and HTML format and has tons of settings to play with)
stargazer(mod_lin, type = "text")
##
## =================================================
## Dependent variable:
## -----------------------------
## resale_price
## -------------------------------------------------
## floor_area_sqm 3,746.600***
## (19.953)
##
## rl 2,554.677***
## (38.267)
##
## Constant -118,126.300***
## (3,154.993)
##
## -------------------------------------------------
## Observations 58,430
## R2 0.444
## Adjusted R2 0.444
## Residual Std. Error 114,588.900 (df = 58427)
## F Statistic 23,287.090*** (df = 2; 58427)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here we report the root mean squared error of the linear model (numbers will become too huge without the root):
rmse <- function(x, y) {
sqrt(mean((x-y)^2))
}
mod_lin %>% predict(test_hdb) %>%
rmse(test_hdb$resale_price)
## [1] 113584.7
And here are confidence intervals for regression coefficients:
confint(mod_lin)
## 2.5 % 97.5 %
## (Intercept) -124310.081 -111942.481
## floor_area_sqm 3707.492 3785.708
## rl 2479.673 2629.680
What is the interpretation of the model? In particular, how much does every square metre of an HDB cost? By how much does an HDB unit become cheaper every year?
Answer: a square metre costs SGD 3746.60 and your unit loses SGD2554.68 in price every year
Train a log-linear regression model for HDB prices. Is it more or less accurate than the linear model?
mod_loglin <- lm(log(resale_price) ~ floor_area_sqm + rl, data = train_hdb)
stargazer(mod_loglin, type = "text")
##
## =================================================
## Dependent variable:
## -----------------------------
## log(resale_price)
## -------------------------------------------------
## floor_area_sqm 0.009***
## (0.00004)
##
## rl 0.006***
## (0.0001)
##
## Constant 11.605***
## (0.006)
##
## -------------------------------------------------
## Observations 58,430
## R2 0.538
## Adjusted R2 0.538
## Residual Std. Error 0.224 (df = 58427)
## F Statistic 33,984.730*** (df = 2; 58427)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
mod_loglin %>%
predict(test_hdb) %>%
exp %>%
rmse(test_hdb$resale_price)
## [1] 113981.5
Loglinear model is about as accurate as a linear model here, even though it has higher R-squared
Now we will train a polynomial regression that will include floor area, remaining lease, their product and their squares as predictors.
mod_poly <- lm(resale_price ~
I(floor_area_sqm^2) + I(rl^2) + floor_area_sqm * rl, data = train_hdb)
stargazer(mod_poly, type = "text")
##
## =================================================
## Dependent variable:
## -----------------------------
## resale_price
## -------------------------------------------------
## I(floor_area_sqm2) 1.313**
## (0.621)
##
## I(rl2) 222.296***
## (3.422)
##
## floor_area_sqm 5,257.908***
## (191.684)
##
## rl -29,827.950***
## (521.374)
##
## floor_area_sqm:rl -14.470***
## (1.836)
##
## Constant 970,460.600***
## (19,920.890)
##
## -------------------------------------------------
## Observations 58,430
## R2 0.483
## Adjusted R2 0.483
## Residual Std. Error 110,489.200 (df = 58424)
## F Statistic 10,902.810*** (df = 5; 58424)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The root mean squared error of the polynomial model is
predict(mod_poly, test_hdb) %>%
rmse(test_hdb$resale_price)
## [1] 109552.9
So, apparently, the polynomial regression does a better job.
Now we can make actual predictions, i.e., estimate the price of our HDB unit in future. Let’s say we have an HDB unit built in 1995 with floor area 95 square metres and we want to see how its price is going to change after 2020 assuming that the current trend inferred from our data will persist.
We will create a fake data:
fake_data <- data.frame(
year = 2020:2050,
floor_area_sqm = 95
) %>% mutate (rl = 2095 - year)
head(fake_data)
Now we will predict and plot the hypothetical price of our unit in 2020 - 2050
fake_data <- fake_data %>%
mutate(pred_price = predict(mod_poly, fake_data))
ggplot(data = fake_data, aes(x = year, y = pred_price)) +
geom_line() + ylab("Predicted price")
Clearly, these predictions do not make sense and the reason is that our polynomial regression model has a positive coefficient at \[ \mbox{remaining lease}^2 \]
Plot prices of our imaginary HDB unit predicted in 2020-2050 by the log-linear model to see whether it reasonably generalizes beyond the usual range or remaining leases.
# Write your code here
fake_data %>%
mutate(pred_exp = exp(predict(mod_loglin, fake_data))) %>%
ggplot(aes(x = year, y = pred_exp)) +
geom_line() + ylab("Predicted price")
This looks more reasonable than polynomial regression predictions
We will create a new dataset. Its observations are HDB blocks (not transactions), for each HDB block we will look at all transactions in that block, calculate resale prices predicted by linear regression and find out if, on average, resale prices are higher or lower than prices predicted by linear regression. This will be our response variable called price_status
. For blocks where units are sold at a higher price than predictions of linear regression, we will set price_status == "expensive"
, for blocks where units are sold at a lower price than predictions of linear regression, we will set price_status == "cheap"
.
hdb_summary <- hdb %>%
mutate(predicted_price = predict(mod_lin, hdb)) %>%
mutate(price_gain = resale_price - predicted_price) %>%
mutate(address = paste(block, street_name)) %>%
group_by(address) %>%
summarise(N_transactions = n(), mean_gain = mean(price_gain)) %>%
mutate(price_status = ifelse(mean_gain > 0, "expensive", "cheap"))
head(hdb_summary)
Filter out blocks with fewer than 5 transactions (the number 5 here is arbitrary, it’s just an exercise)
# Type your code here
hdb_summary <- filter(hdb_summary, N_transactions >= 5)
Now we will merge our dataset hdb_summary
that contains processed information about an average transaction in each HDB block with property info. We will also remove some variables to simplify the process. We will also create the variable remaining lease
that shows remaining lease as of 2020
H <- hdb_info %>%
select(-ends_with("sold")) %>%
select(-ends_with("rental")) %>%
mutate(address = paste(blk_no, street)) %>%
select(-street, -blk_no) %>%
mutate(remaining_lease = year_completed + 100 - 2020) %>%
select(-year_completed) %>%
merge(hdb_summary) %>%
select(-mean_gain) %>%
mutate(price_status = as.factor(price_status))
head(H)
We will split our data into 70% training and 30% test sets
set.seed(8128)
ind <- runif(nrow(H)) < 0.7
train_H <- H[ind , ]
test_H <- H[!ind , ]
cat("Dimensions of the training set are", dim(train_H), "\n")
## Dimensions of the training set are 4586 13
cat("Dimensions of the test set are", dim(test_H), "\n")
## Dimensions of the test set are 1919 13
Our response variable is price_status
. It is a binary variable with value expensive
for HDB blocks whose actual resale price on average exceeds price predicted by linear regression and cheap
for the rest of the blocks.
Let us look at frequency of of these labels in the training set:
table(train_H$price_status)
##
## cheap expensive
## 2883 1703
The more frequent label is cheap
. So if we predict that all blocks in the test set are cheap, the accuracy of our prediction will be
baseline_acc <- table(test_H$price_status)['cheap'] / nrow(test_H)
baseline_acc
## cheap
## 0.6347056
We will try to predict the price status of HDB block using the number of floors (height of the block), the number of units (size of the block), and the remaining lease (age of the block).
Important note: response variable for logistic regression should be either numeric (with values between 0 and 1) or factor (special type for categorical variables in R). That’s why we converted character to factor before
mod_logistic_simple <-
glm(price_status ~ max_floor_lvl + total_dwelling_units + remaining_lease,
family = "binomial",
data = train_H)
stargazer(mod_logistic_simple, type = "text")
##
## ================================================
## Dependent variable:
## ---------------------------
## price_status
## ------------------------------------------------
## max_floor_lvl 0.206***
## (0.010)
##
## total_dwelling_units -0.005***
## (0.001)
##
## remaining_lease -0.088***
## (0.004)
##
## Constant 3.112***
## (0.279)
##
## ------------------------------------------------
## Observations 4,586
## Log Likelihood -2,584.101
## Akaike Inf. Crit. 5,176.201
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
And now we will find the accuracy and other metrics of the model on the test set. First, let us construct our predictions
raw_preds <- predict(mod_logistic_simple, test_H, type = "response")
head(raw_preds)
## 5 6 15 20 21 23
## 0.9157088 0.5465544 0.4998965 0.7466088 0.9166088 0.2477066
Note that these predictions are probabilities (without indicating type = "response"
we would get log-odds of probabilities). But how do we know if these are probabilities of expensive
or cheap
? Under the hood, logistic regression converts categorical values of our response variable to numeric. Below are categorical and numeric values respectively:
head(train_H$price_status)
## [1] expensive expensive cheap cheap expensive expensive
## Levels: cheap expensive
head(mod_logistic_simple$y)
## 1 2 3 4 7 8
## 1 1 0 0 1 1
We see that 1
means expensive
and 0
means cheap
, i.e., are predicting the probability that an HDB block is labelled expensive
. Now we will calculate predicted labels and report all the metrics of our model.
pred_labels <- ifelse(raw_preds > 0.5, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels, test_H$price_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cheap expensive
## cheap 1122 396
## expensive 96 305
##
## Accuracy : 0.7436
## 95% CI : (0.7235, 0.763)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3919
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9212
## Specificity : 0.4351
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.7606
## Prevalence : 0.6347
## Detection Rate : 0.5847
## Detection Prevalence : 0.7910
## Balanced Accuracy : 0.6781
##
## 'Positive' Class : cheap
##
Note that the function confusionMatrix
comes from a different R library and it labelled cheap
rather than expensive
as the positive class.
Also note that the overall accuracy rate is better than the baseline.
In the model above, we used the threshold of \(0.5\) to construct our predictions, i.e., we labelled a new observation as expensive
whenever we have \[
p(Y|X) > 0.5,
\] where \(p(Y|X)\) is the estimate of the conditional probability that \(Y=\mbox{expensive}\) given the observed value of \(X\).
For our HDB data, the threshold of \(0.5\) makes sense, but there there are datasets when the cost of misclassification of “Yes” as “No” is much higher or lower than the cost of misclassification of “No” as “Yes”. Think of 3 hypothetical examples of such data sets.
Fedor’s answer: cancer diagnosis (it is much worse to report that someone does not have cancer when she, in fact, does than to report than she has cancer when she, in fact, does not), credit card default (doing something about a bank customer who is not going to default is less costly than not doing anything about a bank customer who is going to default), jury verdict (condemning an innocent is much worse than releasing a criminal).
It may be a good idea to change the threshold to a higher or a lower number. Change \(0.5\) to \(0.4\). How will it affect the number of predicted expensive
labels? How will it affect the confusion matrix and prediction metrics? First, think about it for a few minutes and then verify your common sense by an R code.
# Type your R code here
pred_labels_2 <- ifelse(raw_preds > 0.4, "expensive", "cheap") %>% as.factor
confusionMatrix(pred_labels_2, test_H$price_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cheap expensive
## cheap 989 287
## expensive 229 414
##
## Accuracy : 0.7311
## 95% CI : (0.7107, 0.7508)
## No Information Rate : 0.6347
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4098
##
## Mcnemar's Test P-Value : 0.0121
##
## Sensitivity : 0.8120
## Specificity : 0.5906
## Pos Pred Value : 0.7751
## Neg Pred Value : 0.6439
## Prevalence : 0.6347
## Detection Rate : 0.5154
## Detection Prevalence : 0.6649
## Balanced Accuracy : 0.7013
##
## 'Positive' Class : cheap
##
Answer: Lowering the threshold for predicting expensive
will yield more labels expensive
because cases where the estimated probability is between 0.4 and 0.5 were originally labelled cheap
but now they will be labelled expensive
.
Below is the direction of the change of entries of the confusion matrix (true positive, false positive, false negative, true negative): \[ \begin{bmatrix} TP\downarrow & FP\downarrow \\ FN\uparrow & TN\uparrow \end{bmatrix} \]
It means that after lowering down the threshold, sensitivity \(\frac{TP}{TP+FN}\) will always decrease and specificity \(\frac{TN}{FP+TN}\) will always increase. The direction of change of positive predictive value \(\frac{TP}{TP+FP}\) and negative predictive value \(\frac{TN}{FN+TN}\) may be either one.
The ROC (receiver operating characteristic) curve visualizes how predictive power of a binary classifier depends on the threshold. This is a parametric curve parameterized by the threshold value in the axes (false positive fraction, true positive fraction)
df <- data.frame(
predictions = raw_preds,
labels = 0 + (test_H$price_status == "expensive")
)
ggplot(df, aes(m = predictions, d = labels)) +
geom_roc(n.cuts = 10, labels=TRUE) +
style_roc(theme = theme_grey) +
coord_fixed()
The area under the ROC curve (AUC) is a popular measure of the overall discriminating power of a binary classifier for different threshold values. The AUC is smallest possible when our classifier just does no better than a random chance - in such a case, the ROC will be simply the diagonal of the square, i.e., the line FPF = TPF. This corresponds to AUC = 0.5. The theoretically best case is that of a classifier that happens to be 100% accurate when the threshold is 0.5. In such a case, our ROC will be the union of the left and the top sides of the square and AUC = 1.
Below is the value of our ROC AUC:
pred_obj <- prediction(raw_preds, 0 + (test_H$price_status == "expensive"))
auc_ROCR <- performance(pred_obj, measure = "auc")
auc_ROCR@y.values[[1]]
## [1] 0.7480581