Practice: Linear Regression

Data lấy từ Kaggle có tên Real estate price prediction. Link data: https://www.kaggle.com/quantbruce/real-estate-price-prediction?select=Real+estate.csv

Để tiến hành thực hiện đầu tiên cần Tải các thư viện cần thiết để tiến hành phân tích

A. Khảo sát và thăm dò dữ liệu

Đọc dữ liệu vào R

df <- read.csv("C:/Users/KH/Google Drive/Data Sample/Real estate.csv")
head(df)

dim(df)

## [1] 414   8

Xem nào, orginal dataset chứa 414 explanatory variables giải thích 7 khía cạnh của bất động sản:

X1 transaction date (numeric)
X2 house age (numeric, in year)
X3 distance to the nearest MRT station (numeric)
X4 number of convenience stores (numeric)
X5 latitude (numeric)
X6 longitude (numeric)
X7 house price of unit area (numeric)

Trong bài thực hành này chúng ta sẽ đi tìm một mô hình giá bán bất động sản với các biến độc lập có sẵn, sẽ được dùng để hiểu giá thay đổi chính xác như thế nào với các biến độc lập, việc này có ý nghĩa gì ?

Quản lý chiến lược kinh doanh về giá
Các nhân tố chính tạo động lực để định giá ở một thị trường mới

Research Question ?

Những biến nào có ý nghĩa trong việc dự đoán giá bất động sản ?
Trong các biến đó biến nào tốt nhất để mô tả về giá bán ?

Visualization các bất động sản trên bản đồ thông qua hai biến X5 và X6 để xác định vị trí và đơn giá bán từng bất động sản

df <- df %>% mutate(popup_info = paste("No:", No, '<br/>', 'Unit Price: ', 
                                       Y.house.price.of.unit.area))
df

leaflet(data = df) %>%
  addProviderTiles(providers$Stamen.Terrain) %>% 
  addCircles(lng = ~X6.longitude, 
             lat = ~X5.latitude, 
             weight = 10, 
             opacity = 0.5, 
             radius = ~5,
             popup = ~popup_info, 
             color = "orange")

colSums(is.na(df))

##                                     No                    X1.transaction.date 
##                                      0                                      0 
##                           X2.house.age X3.distance.to.the.nearest.MRT.station 
##                                      0                                      0 
##        X4.number.of.convenience.stores                            X5.latitude 
##                                      0                                      0 
##                           X6.longitude             Y.house.price.of.unit.area 
##                                      0                                      0 
##                             popup_info 
##                                      0

Oh! Dữ liệu Không có NA values là điều rất hiếm gặp trong thực tế

Bỏ cột No, X1.transaction.date, popup_info vừa khởi tạo vì không cần thiết, xác định biến outcome trong trường hợp này cần tiên lượng chính là đơn giá của BĐS đang chào bán (biến Y.house.price.of.unit.area)

df <- df[-c(1:2,9)]

head(df)

Để làm gọn tên các biến số mình dùng hàm rename trong package ‘dplyr’, quy ước chung:

X1: đã loại bỏ do không cần thiết trong trường hợp này
X2: House Age
X3: Distance to the nearest MRT Station
X4: Number of Convenience Stores
X5: Latitude
X6: Longitude
Y: House Price of Unit Area

df <- rename(df,
             X2 = X2.house.age, 
             X3 = X3.distance.to.the.nearest.MRT.station, 
             X4 = X4.number.of.convenience.stores,
             X5 = X5.latitude,
             X6 = X6.longitude,
             Y = Y.house.price.of.unit.area)

head(df)

dim(df)

## [1] 414   6

Tiến hành thống kê mô tả dữ liệu khảo sát phân bố như thế nào nhé. R giúp chúng ta tính các chỉ số thống kê mô tả, đưa ra các chỉ số thống kê mô tả cơ bản, hàm describe() cung cấp khá đầy đủ, trong đó có 2 chỉ số để đánh giá độ lệch và độ nhọn của phân bố như: skew và kurtosis, chỉ số trung bình, độ lệch chuẩn: mean, sd, median …và dưới đây là kết quả:

describe(df) %>%
  kable() %>% 
  kable_classic(full_width = T, 
                html_font = "Cambria")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
X2	1	414	17.712560	11.3924845	16.1000	17.260843	12.4538400	0.00000	43.80000	43.80000	0.3801559	-0.8912447	0.5599101
X3	2	414	1083.885689	1262.1095954	492.2313	809.272339	455.6791856	23.38284	6488.02100	6464.63816	1.8750920	3.1251015	62.0293025
X4	3	414	4.094203	2.9455618	4.0000	3.981928	4.4478000	0.00000	10.00000	10.00000	0.1534880	-1.0767064	0.1447665
X5	4	414	24.969030	0.0124102	24.9711	24.969771	0.0116681	24.93207	25.01459	0.08252	-0.4354253	0.2356851	0.0006099
X6	5	414	121.533361	0.0153472	121.5386	121.535281	0.0086065	121.47353	121.56627	0.09274	-1.2107681	1.1527348	0.0007543
Y	6	414	37.980193	13.6064877	38.4500	37.631627	13.8623100	7.60000	117.50000	109.90000	0.5955128	2.1136172	0.6687224

Visualization các biến quan sát

Tiến hành khảo sát hệ số tương quan của các biến X so với Y, lưu ý hệ số tương quan không nói lên mối liên hệ nhân quả giữa các biến số.

Dùng corrplot()

library(corrplot)

corrplot.mixed(cor(df),
               lower = "number", 
               upper = "circle",
               tl.col = "black")

Hoặc có thể sử dụng ggpairs() trong thư viện GGally, thực hiện visualization ma trận tương quan các biến quan sát, trong đó các hệ số tương quan được tính bằng hệ số tương quan Pearson

ggpairs(df)

Ngoài ra có thể sử dụng thư viện correlation để thực hiện, nó cung cấp thêm cho chúng ta chỉ số p-value và khoảng tin cậy 95% (95% CI)

df %>% correlation() -> out_corr #Sử dụng hệ số tương quan Pearson 

out_corr

Điểm hay của thư viện correlation có thể tùy biến method và p_adjust, trường hợp này mình hiệu chỉnh p-value theo phương pháp Bonferroni:

df %>% correlation(ci = 0.975, #Khoảng tin cậy 97.5% 
                   method="spearman", 
                   p_adjust = "bonferroni") -> out_corr #Sử dụng hệ số tương quan phi tham số Spearman với chỉ số p adjust theo PP Bonferroni  
out_corr

df %>% correlation(ci = 0.975, 
                   method="pearson", # Hệ số tương quan Pearson
                   bayesian = TRUE, 
                   iterations = 1000) -> out_corr_bayes # Sử dụng phương pháp Bayes 
out_corr_bayes

Kiểm tra các giá trị ngoại vi (outlier) hoặc giá trị có ảnh hưởng đến tham số của mô hình ?

t_1 <- outlierTest(lm(Y ~ X3, data=df))

outlier_test_plot_1 <- ggplot(df, aes(X3, Y)) + 
  geom_boxplot(outlier.colour = "orange") + 
  geom_jitter(positiion = position_jitter(width = 0.5, height = 0.5), alpha = 1/4) + 
  geom_text_repel(aes(label = ifelse(rownames(df) == names(t_1$rstudent), rownames(df)," ")), color ="orange")

suppressMessages(print(outlier_test_plot_1))

t_2 <- outlierTest(lm(Y ~ X4, data=df))

outlier_test_plot_2 <- ggplot(df, aes(X4, Y)) + 
  geom_boxplot(outlier.colour = "orange") + 
  geom_jitter(positiion = position_jitter(width = 0.5, height = 0.5), alpha = 1/4) + 
  geom_text_repel(aes(label = ifelse(rownames(df) == names(t_2$rstudent), rownames(df)," ")), color ="orange")

suppressMessages(print(outlier_test_plot_2))

Kiểm tra lại bằng phương pháp Rosner’s Test:

library(EnvStats)

rosnerTest(df$Y, k = 3)$all.stats

Mình sẽ tạm thời không loại bỏ giá trị đang được tạm gọi là outlier nói trên

B. Một chút với … “Recipes”

Đầu tiên mình sẽ chia dữ liệu thành 2 tập trainset (312 obs) và testset (102 obs).

set.seed(123)

idx = caret::createDataPartition(y=df$Y, p = .75, list = FALSE) #Tách dữ liệu training data chiếm 75% dữ liệu

trainset = df[idx,]
testset = df[-idx,]

trainset %>% head()

Định vai trò từng biến

rec <- recipe(trainset)
summary(rec)

rec <- rec %>% 
  update_role(Y, new_role = "outcome") %>%
  update_role(X2, X3, X4, X5, X6, new_role = "predictor")
summary(rec)

rec$template %>% 
  head() %>% 
  knitr::kable()

X2	X3	X4	X5	X6	Y
32.0	84.87882	10	24.98298	121.5402	37.9
19.5	306.59470	9	24.98034	121.5395	42.2
13.3	561.98450	5	24.98746	121.5439	54.8
34.5	623.47310	7	24.97933	121.5364	40.3
20.3	287.60250	6	24.98042	121.5423	46.7
6.3	90.45606	9	24.97433	121.5431	58.1

rec$steps

## NULL

Kết hợp quy trình logarithm sẽ chuẩn hóa toàn bộ predictor trong công thức

rec %>%
  step_log(all_numeric_predictors())

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          5
## 
## Operations:
## 
## Log transformation on all_numeric_predictors()

rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          5

Thực hiện công thức cho testset, ta được bộ dữ liệu đã được chuẩn hóa

std_func <- prep(rec, training = trainset, retain = T)

std_test <- bake(std_func, new_data = testset)
std_train <- bake(std_func, new_data = trainset)

C. Modelling

Ta sẽ Dựng mô hình Multiple Linear Regression cổ điển với outcome là biến Y và biến predictor là các biến X2, X3, X4, X5, X6 trên tập dữ liệu std_train

std_train$Y = log(std_train$Y)

model_1 = lm(Y ~ X2 + X3 + X4 + X5 + X6, data = std_train)
summary(model_1)

## 
## Call:
## lm(formula = Y ~ X2 + X3 + X4 + X5 + X6, data = std_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64808 -0.11923  0.00381  0.10931  1.07192 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.087e+02  1.736e+02  -1.203    0.230    
## X2          -6.764e-03  1.142e-03  -5.921 8.57e-09 ***
## X3          -1.506e-04  2.060e-05  -7.310 2.34e-12 ***
## X4           2.801e-02  5.512e-03   5.082 6.53e-07 ***
## X5           8.545e+00  1.282e+00   6.667 1.22e-10 ***
## X6          -7.419e-03  1.373e+00  -0.005    0.996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2266 on 306 degrees of freedom
## Multiple R-squared:  0.6793, Adjusted R-squared:  0.6741 
## F-statistic: 129.6 on 5 and 306 DF,  p-value: < 2.2e-16

Tính chỉ số variance inflation factor - VIF để phát hiện hiên tượng đa cộng tuyến

vif(model_1)

##       X2       X3       X4       X5       X6 
## 1.009679 3.977494 1.626433 1.518783 2.680212

autoplot(model_1)

Bootstrap validation approach, sử dụng phương pháp tái chọn mẫu Bootstrap

model_2 = train(form = Y ~ X2 + X3 + X4 + X5, 
                    data = std_train, 
                    method = "lm",
                    trControl = trainControl(method = "boot"), 
                    number = 200)
model_2

## Linear Regression 
## 
## 312 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 312, 312, 312, 312, 312, 312, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2312092  0.6634836  0.1564545
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

summary(model_2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat, number = 200)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64805 -0.11924  0.00378  0.10930  1.07201 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.096e+02  3.174e+01  -6.606 1.74e-10 ***
## X2          -6.764e-03  1.139e-03  -5.938 7.80e-09 ***
## X3          -1.505e-04  1.424e-05 -10.570  < 2e-16 ***
## X4           2.801e-02  5.477e-03   5.115 5.53e-07 ***
## X5           8.546e+00  1.271e+00   6.724 8.64e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2263 on 307 degrees of freedom
## Multiple R-squared:  0.6793, Adjusted R-squared:  0.6751 
## F-statistic: 162.6 on 4 and 307 DF,  p-value: < 2.2e-16

Một cách khác ta sử dụng phương pháp Ridge hoặc phương pháp LASSO vì trong mô hình nhiều biến tiên lượng là mô hình phức tạp ngoài bias, phương sai của ước số mô hình có xu hướng tăng.

Sử dụng ridge regression (L2 Penalty) tìm ước số mô hình sao cho tổng bình phương phần dư thấp nhất (least squares) và cộng thêm hằng số $ ||b||^2$ trong đó $\lambda$ là ‘regularization penalty’

$ L(b) = ||y - Xb||^2 + ||b||^2 $

x_vars = model.matrix(Y ~ X2 + X3 + X4 + X5 + X6, data = std_train)
y_vars = std_train$Y

ridge_L2 = cv.glmnet(x_vars, y_vars, alpha = 0)

plot(ridge_L2)

Hệ số của các biến

coef(ridge_L2)

## 7 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -5.322074e+02
## (Intercept)  .           
## X2          -3.232700e-03
## X3          -7.067895e-05
## X4           2.267914e-02
## X5           5.944356e+00
## X6           3.187480e+00

sử dụng LASSO (L1 Penalty) khác với ‘phạt’ cho tổng bình phương ước số $b_j$ như Rigde Regression, LASSO sẽ ‘phạt’ tổng giá trị tuyệt đối của $b_j$

lasso_L1 = cv.glmnet(x_vars, y_vars, alpha = 1)

plot(lasso_L1)

coef(lasso_L1)

## 7 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -1.291796e+02
## (Intercept)  .           
## X2          -1.103494e-03
## X3          -1.375528e-04
## X4           1.594839e-02
## X5           5.320401e+00
## X6           .

Dùng phương pháp BMA

library(BMA)

bma = bicreg(x_vars, y_vars, strict = FALSE, OR=30)

## Reordering variables and trying again:

summary(bma)

## 
## Call:
## bicreg(x = x_vars, y = y_vars, strict = FALSE, OR = 30)
## 
## 
##   2  models were selected
##  Best  2  models (cumulative posterior probability =  1 ): 
## 
##               p!=0    EV         SD         model 1     model 2   
## Intercept     100.0  -2.096e+02  5.067e+01  -2.096e+02  -2.087e+02
## X.Intercept.    0.0   0.000e+00  0.000e+00       .           .    
## X2            100.0  -6.764e-03  1.139e-03  -6.764e-03  -6.764e-03
## X3            100.0  -1.505e-04  1.465e-05  -1.505e-04  -1.506e-04
## X4            100.0   2.801e-02  5.479e-03   2.801e-02   2.801e-02
## X5            100.0   8.546e+00  1.272e+00   8.546e+00   8.545e+00
## X6              5.4  -3.975e-04  3.178e-01       .      -7.419e-03
##                                                                   
## nVar                                           4           5      
## r2                                           0.679       0.679    
## BIC                                         -3.319e+02  -3.261e+02
## post prob                                    0.946       0.054

imageplot.bma(bma)

D. Kiểm định mô hình

Tạo ra biến tiên lượng dựa trên giá trị thật (Y) của mô hình, tính phần dư của mô hình (residuals)

std_test$Y = log(std_test$Y)

std_test$predicted = predict(model_2, std_test)

std_test$truth = std_test$Y

std_test$error = std_test$Y - std_test$predicted

std_test %>% head()

So sánh mật độ phân bố giữa giá trị thực tế và tiên lượng:

std_test %>% gather(truth, 
                    predicted, 
                    key = "Price", 
                    value = 'Y') %>%
  ggplot(aes(x = Y, fill = Price)) + 
  geom_density(alpha = 0.6) + 
  scale_fill_manual(values = c("gray",
                               "blue")) + 
  theme_bw()

std_test %>% gather(truth,
                    predicted,
                    key = "Y",
                    value="X3") %>%
  ggplot(aes(Y, X3, fill = Y, col = Y)) +
  geom_jitter(alpha = 0.5) + 
  geom_boxplot(alpha = 0.5) +
  coord_flip()+
  stat_compare_means(method = "t.test", 
                     paired = TRUE, 
                     label.y = 4) +
  scale_color_manual(values = c("blue",
                              "red"))+
  scale_fill_manual(values = c("blue",
                             "red"))+
  theme_bw()

Tương quan tuyến tính giữa giá trị thực tế và tiên lượng:

std_test %>% ggplot(aes(truth, 
                   predicted)) + 
  geom_jitter(alpha=0.5) + 
  geom_smooth(alpha=0.2, 
              method='lm') +
  stat_cor(method="spearman", 
           label.x = 3, 
           label.y = 3) + 
  scale_color_manual(values=c("blue","red")) + 
  theme_bw()

Khảo sát trực tiếp sai biệt giữa thực tế và tiên lượng

ggplot(std_test, aes(error)) + 
  geom_histogram(bindwith=20, 
                 aes(y=..density..), 
                 col='white', 
                 fill='gray', 
                 lwd = 0.8) + 
  geom_density() + 
  labs(x = 'Error', y = 'Frequency') +
  theme_bw()

Khảo sát khuynh hướng sai biệt của mô hình

std_test %>% mutate(est = if_else(.$error > 0, "Over", "Under")) %>%
  ggplot(aes(x = truth, 
             y = error, 
             col = est)) +
  geom_jitter(alpha=0.8) +
  scale_color_manual(values = c("red","blue")) + 
  geom_hline(yintercept = 0, linetype = 2, col = "black")+
  theme_bw()

Kiểm tra tính hợp lý của nội dung mô hình, quy luật bên trong mô hình có phù hợp với quan sát thực tế không ?

std_test %>% gather(truth, 
                    predicted, 
                    key="Price", 
                    value="Y") %>% 
  ggplot() +
  geom_jitter(aes(X3, 
                  Y), 
              alpha=0.5) +
  geom_smooth(aes(X3, 
                  Y, 
                  col = Price, 
                  fill = Price),
              alpha=0.2) +
  scale_color_manual(values = c("blue", "red")) +
  scale_fill_manual(values = c("blue", "red")) +
  ggtitle("Test Set")+
  theme_bw()

Sử dụng các chỉ số để kiểm tra tính hợp lý của mô hình:

scoring = function(methods,
                   truth,
                   predicted){
  
  SSE = function(truth, 
                 predicted) {
    sum((predicted - truth)^2)
  }
  
  # MSE
  
  MSE = function(truth, 
                 predicted) {
    mean((predicted - truth)^2)
  }
  
  # RMSE
  
  RMSE = function(truth, 
                  predicted) {
    sqrt(MSE(truth, 
             predicted))
  }
  
  # MEDSE
  
  MEDSE = function(truth, 
                   predicted) {
    median((predicted - truth)^2)
  }
  
  # SAE
  
  SAE = function(truth, 
                 predicted) {
    sum(abs(predicted - truth))
  }
  
  # MAE
  
  MAE = function(truth, 
                 predicted) {
    mean(abs(predicted - truth))
  }
  
  # MEDAE
  
  MEDAE = function(truth, 
                   predicted) {
    median(abs(predicted - truth))
  }
  
  # RSQ
  
  RSQ = function(truth,  
                 predicted) {
    rss = SSE(truth,  predicted)
    ess = sum((truth - mean(truth))^2L)
    if (ess == 0){
      warning("Error: all truth values are equal")
      return(NA_real_)
    }
    1 - rss / ess
  }

  # RRSE Root relative squared error
  
  RRSE = function(truth, 
                  predicted){
    tss = sum((truth - mean(truth))^2L)
    if (tss == 0){
      warning("Error: all truth values are equal.")
      return(NA_real_)
    }
    sqrt(SSE(truth, predicted) / tss)
  }
  
  # RAE : Relative absolute error
  
  RAE = function(truth, 
                 predicted){
    meanad = sum(abs(truth - mean(truth)))
    if (meanad == 0){
      warning("Error:  all truth values are equal.")
      return(NA_real_)
    }
    return(SAE(truth, 
               predicted) / meanad)
  }
  
  # MAPE Mean absolute percentage error
  
  MAPE = function(truth,
                  predicted){
    if (any(truth == 0)){
      warning("Error: truth value is equal to 0.")
      return(NA_real_)
    }
    return(mean(abs((truth - predicted) / truth)))
  }
  
  # Mean squared logarithmic error
  
  MSLE = function(truth, 
                  predicted) {
    if (any(truth < -1))
      stop("All truth values must be greater or equal -1")
    if (any(predicted < -1))
      stop("All predicted values must be greater or equal -1")
    
    mean((log(predicted + 1) - log(truth + 1))^2)
  }
  
  # Root mean squared logarithmic error
  
  RMSLE = function(truth, 
                   predicted) {
    sqrt(MSLE(truth, predicted))
  }
  
  # Kendall Tau
  
  KendallTau = function(truth, 
                        predicted) {
    cor(truth, predicted, use = "na.or.complete", method = "kendall")
  }
  
  # rho
  
  SpearmanRho = function(truth, 
                         predicted) {
    cor(truth, predicted, use = "na.or.complete", method = "spearman")
  }
  
  # Pearson r
  
  PearsonR = function(truth,
                      predicted) {
    cor(truth, 
        predicted, 
        use = "na.or.complete", method = "pearson")
  }
  
  scores = data_frame(Score = methods, Value=rep(NA,
                                                 length(methods)))
  
  for (i in 1:length(methods)){
    scoring_func = get(methods[i])
    scores$Value[i]= scoring_func(truth,
                                  predicted)
  }
  
  return(scores)
}

scores = scoring(methods=c('MAE',
                           'MAPE',
                           'MEDAE',
                           'MEDSE',
                           'MSE',
                           'MSLE',
                           'RAE',
                           'RMSE',
                           'RMSLE',
                           'RRSE',
                           'RSQ',
                           'SAE',
                           'SSE',
                           'PearsonR',
                           'KendallTau',
                           'SpearmanRho'),
                 truth = std_test$truth, 
                 predicted = std_test$predicted)

knitr::kable(scores)

Score	Value
MAE	0.1552740
MAPE	0.0445896
MEDAE	0.1091510
MEDSE	0.0119140
MSE	0.0483227
MSLE	0.0025639
RAE	0.5129345
RMSE	0.2198241
RMSLE	0.0506347
RRSE	0.5811224
RSQ	0.6622967
SAE	15.8379475
SSE	4.9289106
PearsonR	0.8203669
KendallTau	0.6781376
SpearmanRho	0.8543421

Practice: Linear Regression_LASSO & RIDGE

namvu

19/03/2022

A. Khảo sát và thăm dò dữ liệu

B. Một chút với … “Recipes”

C. Modelling

D. Kiểm định mô hình