12/6/2020

The problem

House price prediction using Beijing, China house price dataset.

The data

data = read.csv("../data/cleaned_beijing.csv")
data = select(data, square, constructionTime,
              subway, district, price)
head(data)
##   square constructionTime     subway  district price
## 1 131.00             2005 Has_Subway  ChaoYang 31680
## 2 132.38             2004  No_Subway  ChaoYang 43436
## 3 198.00             2005  No_Subway  ChaoYang 52021
## 4 134.00             2008  No_Subway ChangPing 22202
## 5  81.00             1960 Has_Subway DongCheng 48396
## 6  53.00             2005  No_Subway  ChaoYang 52000

Exploratory analysis

Analysing the median price of houses in each district

Code

g = data %>% 
  group_by(district, subway) %>%
  summarise(
    price = median(price)/10e3
  ) %>%
  ggplot(aes(x = district,
             y = price)
         ) + 
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90)) + 
  scale_y_continuous(name="Price (in K)") + 
  scale_x_discrete(name="District") 

Plot

Creating a model using caret package

mdl = train(price~., data = training, method = "glm",
            family = "gaussian")
plot(mdl$finalModel, which = 1)

Gauging the performance using RMSE score

On test data

pred = predict(mdl, newdata = testing)
sqrt(sum((pred-testing$price)^2)/nrow(testing))
[1] 17754.01