Info

Objective

The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!

Your task

Solve the practical questions, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.

Marking scheme

You will get “excellent”, or 100% for this lab report if everything is perfect. You will get “good”, or 75% if there are minor issues. For example, you will get “good” if you do transform the data by manually subsetting and changing variable names instead of a tidyverse pipepine. You will get “average”, or 50% if there are serious issues in your report, such as doing using incorrect predictors for polynomial regression. You will get “poor”, or 25% if you barely attempt this report. You will get “not done”, or 0% if you do not attempt this report.

Deadline

26 Aug 2024, midnight

Libraries

Here, we load libraries and set the random seed. Replace the number “1729” with the numeric part of your matric no

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(stargazer) # for printing regression tables
## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
set.seed(1729)

Question 1

Load the data on HDB transactions in 2015-2016 from the file “hdb_sales_2015_2016.csv”. Create a new variable block_age that will be equal to the block age at the moment of transaction and delete all the variables except for block, street_name, floor_area_sqm, block_age, resale_price.

Split the data into 80% training and 20% test sets and report dimensions of the training and test sets.

# There are two methods to find the block_age, one of them is commented out:
X <- read.csv("hdb_sales_2015_2016.csv") %>%
#  mutate(block_age = as.numeric(substr(month, 1, 4)) - lease_commence_date)
   mutate(block_age = 99 - remaining_lease) %>%
   select(`block`, `street_name`,`floor_area_sqm`, `block_age`, `resale_price`)
   
head(X)
ind <- runif(nrow(X)) <= 0.8
train_data <- X[ind , ]
test_data <- X[!ind , ]
cat("Dimensions of the training set =", dim(train_data), "\n")
## Dimensions of the training set = 29771 5
cat("Dimensions of the test set =", dim(test_data), "\n")
## Dimensions of the test set = 7382 5

Question 2

Train a polynomial regression of degree 2 to predict resale prices. Your predictors should be floor_are_sqm, block_age, their product, and their squares.

Report the root mean squared error on the training set and the root mean squared error on the test set.

mse <- function(x, y) {
  (x-y)^2 %>% mean %>% sqrt
}

mod_poly <- lm(resale_price ~ floor_area_sqm + block_age +
                 floor_area_sqm * block_age +
                 I(floor_area_sqm^2) + I(block_age^2), data = train_data)

#### This is also accepted:
# summary(mod_poly)
stargazer(mod_poly, type = "text")
## 
## =====================================================
##                              Dependent variable:     
##                          ----------------------------
##                                  resale_price        
## -----------------------------------------------------
## floor_area_sqm                     -128.833          
##                                   (191.712)          
##                                                      
## block_age                       -27,223.830***       
##                                   (377.041)          
##                                                      
## I(floor_area_sqm2)                 8.620***          
##                                    (0.773)           
##                                                      
## I(block_age2)                     356.496***         
##                                    (4.812)           
##                                                      
## floor_area_sqm:block_age          95.016***          
##                                    (2.735)           
##                                                      
## Constant                        557,902.300***       
##                                  (12,483.740)        
##                                                      
## -----------------------------------------------------
## Observations                        29,771           
## R2                                  0.538            
## Adjusted R2                         0.538            
## Residual Std. Error        92,103.770 (df = 29765)   
## F Statistic              6,944.702*** (df = 5; 29765)
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01
cat("RMSE for training set is",
    mod_poly %>% predict(train_data) %>%
      mse(train_data$resale_price), "\n")
## RMSE for training set is 92094.49
cat("RMSE for test set is",
    mod_poly %>% predict(test_data) %>%
      mse(test_data$resale_price), "\n")
## RMSE for test set is 92686.77

Question 3

Make and print (you can plot, but it is not necessary) predictions for a hypothetical HDB flat constructed in 1960 with floor area of 120 square metres in 2020 - 2050

df <- data.frame(year = 2020:2050,
                 floor_area_sqm = 120) %>%
  mutate(block_age = year - 1960)

preds <- predict(mod_poly, df)

df <- df %>%
  mutate(predicted_price = preds)

df

Question 4

Explain why the model predicts that resale prices are going to increase, even though, clearly, older HDB flats should be cheaper and, by definition, an HDB flat becomes worthless when it is 100 years old.

Answer: This is because of confounding variables. We haven’t included location into the model. At the same time, most old HDB flats are located in central areas, such as Queenstown and hence they are more expensive than they should have been if we only accounted for their age. As a result, the coefficient at the remaining lease squared in the polynomial regression model becomes positive and we observe this counter-intuitive prediction.

Remark: Accept any reasonable explanation that mentions confounding variables or the positive coefficient at the block age squared in the model. Don’t accept explanations such as overfitting, multi-collinearity or data drift.