HousePrices: đây là bộ dữ liệu khảo sát 546 người về giá bán nhà ở thành phố Windsor, Canada, trong tháng 7, 8 và 9 năm 1987.
Bộ dữ liệu gồm 546 quan sát và 12 biến. Trong đó có 6 biến định tính và 6 biến định lượng.
6 biến định tính bao gồm:
driveway: nhà có khu vực riêng để đậu xe không?
recreation: nhà có phòng giải trí không?
fullbase: nhà có được trang bị tầng hầm hoàn thiện hay không? ( tầng hầm hoàn thiện bao gồm các phòng như phòng tập thể dục, trò chơi điện tử, …)
gasheat: nhà có sử dụng gas để đun nước nóng không?
aircon: nhà có máy điều hoà trung tâm không?
prefer: nhà có nằm trong khu trung tâm của thành phố không?
6 biến định lượng bao gồm
price: giá bán nhà
lotsize: diện tích căn nhà
bedrooms: số phòng ngủ của ngôi nhà
bathrooms: số phòng tắm của ngôi nhà
stories: số tầng của ngôi nhà không tính tầng hầm
garage: số gara trong nhà
library(xlsx)
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.3
library(data.table)
## Warning: package 'data.table' was built under R version 4.2.3
d <- read_excel("D:/data2607.xlsx",1)
## Warning: Expecting numeric in G3081 / R3081C7: got a date
## New names:
## • `` -> `...1`
data.table(d)
## ...1 property_type price location city
## 1: 0 Flat 10000000 G-10 Islamabad
## 2: 1 Flat 6900000 E-11 Islamabad
## 3: 2 House 16500000 G-15 Islamabad
## 4: 3 House 43500000 Bani Gala Islamabad
## 5: 4 House 7000000 DHA Defence Islamabad
## ---
## 153426: 168441 House 26500000 Gadap Town Karachi
## 153427: 168442 House 12500000 Gadap Town Karachi
## 153428: 168443 House 27000000 Gadap Town Karachi
## 153429: 168444 House 11000000 Gadap Town Karachi
## 153430: 168445 House 9000000 Bahria Town Karachi Karachi
## province_name latitude longitude baths purpose bedrooms
## 1: Islamabad Capital 3.367989e+06 7.301264e+06 2 For Sale 2
## 2: Islamabad Capital 3.370099e+07 7.297149e+07 3 For Sale 3
## 3: Islamabad Capital 3.363149e+16 7.292656e+07 6 For Sale 5
## 4: Islamabad Capital 3.370757e+13 7.315120e+12 4 For Sale 4
## 5: Islamabad Capital 3.349259e+07 7.330134e+07 3 For Sale 3
## ---
## 153426: Sindh 2.502991e+07 6.713719e+07 0 For Sale 6
## 153427: Sindh 2.501795e+07 6.713639e+07 0 For Sale 3
## 153428: Sindh 2.501538e+07 6.711633e+06 0 For Sale 6
## 153429: Sindh 2.501327e+07 6.712082e+07 0 For Sale 3
## 153430: Sindh 2.511357e+07 6.735381e+07 3 For Sale 3
## date_added agency
## 1: 2019-02-04 Self
## 2: 2019-05-04 Self
## 3: 2019-07-17 Self
## 4: 2019-04-05 Self
## 5: 2019-07-10 Easy Property
## ---
## 153426: 2019-07-18 Al Shahab Enterprises
## 153427: 2019-07-18 Al Shahab Enterprises
## 153428: 2019-07-18 Al Shahab Enterprises
## 153429: 2019-07-18 Al Shahab Enterprises
## 153430: 2019-07-18 ZPN Real Estate & Builders
## agent Area_in_Marla
## 1: Self 4.0
## 2: Self 5.6
## 3: Self 8.0
## 4: Self 40.0
## 5: Muhammad Junaid Ceo Muhammad Shahid Director 8.0
## ---
## 153426: Shahmir 9.6
## 153427: Shahmir 8.0
## 153428: Shahmir 9.6
## 153429: Shahmir 7.8
## 153430: Ali Raza 9.4
table(d$purpose)/sum(table(d$purpose))
##
## For Rent For Sale
## 0.2814508 0.7185492
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
d |> ggplot(aes(x = purpose, y = after_stat(count))) +
geom_bar(fill = 'pink') +
geom_text(aes(label = scales::percent(after_stat(count/sum(count)))), stat = 'count', color = 'black', vjust = - .5) +
theme_classic() +
labs(x = 'Mục đích sử dụng tài sản', y = 'Số tài sản')
Nhận xét: Trong tổng số 564 loại tài sản được khảo sát thì có 28% loại được thuê và 72% loại bán.
pp<- d[d$purpose == "For Sale",]
prop.test(length(pp$purpose), length(d$purpose), p = 0.6)
##
## 1-sample proportions test with continuity correction
##
## data: length(pp$purpose) out of length(d$purpose), null probability 0.6
## X-squared = 8984.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.6
## 95 percent confidence interval:
## 0.7162903 0.7207971
## sample estimates:
## p
## 0.7185492
Ta có p_value <0.05 nên ta bác bỏ \(H_0\). Vì vậy tỉ lệ loại tài sản dùng để bán không bằng 60%. Khoảng ước lượng tỷ lệ số căn hộ với mục đích để bán với độ tin cậy 95% là (0,7162903 ; 0,7207971).
fit1 <- glm(factor(purpose) ~ property_type + price + baths + bedrooms + Area_in_Marla, family = binomial(link = 'logit'), data = d)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit1)
##
## Call:
## glm(formula = factor(purpose) ~ property_type + price + baths +
## bedrooms + Area_in_Marla, family = binomial(link = "logit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.798e+13 2.733e+06 1.756e+07 <2e-16 ***
## property_typeFlat 1.556e+15 2.752e+06 5.652e+08 <2e-16 ***
## property_typeHouse -5.182e+14 2.739e+06 -1.892e+08 <2e-16 ***
## property_typeLower Portion -3.930e+14 2.827e+06 -1.390e+08 <2e-16 ***
## property_typePenthouse -4.679e+14 4.388e+06 -1.066e+08 <2e-16 ***
## property_typeRoom -1.277e+15 3.774e+06 -3.384e+08 <2e-16 ***
## property_typeUpper Portion -2.953e+14 2.795e+06 -1.056e+08 <2e-16 ***
## price 6.576e+07 5.084e-03 1.294e+10 <2e-16 ***
## baths -4.013e+13 9.157e+04 -4.382e+08 <2e-16 ***
## bedrooms 6.753e+13 1.220e+05 5.535e+08 <2e-16 ***
## Area_in_Marla -2.979e+12 1.843e+03 -1.617e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 182373 on 153429 degrees of freedom
## Residual deviance: 1385951 on 153419 degrees of freedom
## AIC: 1385973
##
## Number of Fisher Scoring iterations: 25
fit2 <- glm(factor(purpose) ~ property_type + price + baths + bedrooms + Area_in_Marla, family = binomial(link = 'probit'), data = d)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit2)
##
## Call:
## glm(formula = factor(purpose) ~ property_type + price + baths +
## bedrooms + Area_in_Marla, family = binomial(link = "probit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.008e+14 2.733e+06 2.565e+08 <2e-16 ***
## property_typeFlat 1.828e+14 2.752e+06 6.640e+07 <2e-16 ***
## property_typeHouse 3.536e+14 2.739e+06 1.291e+08 <2e-16 ***
## property_typeLower Portion -4.680e+14 2.827e+06 -1.656e+08 <2e-16 ***
## property_typePenthouse -9.521e+14 4.388e+06 -2.170e+08 <2e-16 ***
## property_typeRoom -3.868e+15 3.774e+06 -1.025e+09 <2e-16 ***
## property_typeUpper Portion -3.312e+14 2.795e+06 -1.185e+08 <2e-16 ***
## price 5.405e+07 5.084e-03 1.063e+10 <2e-16 ***
## baths -5.215e+13 9.157e+04 -5.695e+08 <2e-16 ***
## bedrooms -1.224e+14 1.220e+05 -1.003e+09 <2e-16 ***
## Area_in_Marla -2.860e+12 1.843e+03 -1.552e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 182373 on 153429 degrees of freedom
## Residual deviance: 1958756 on 153419 degrees of freedom
## AIC: 1958778
##
## Number of Fisher Scoring iterations: 25
fit3 <- glm(factor(purpose) ~ property_type + price + baths + bedrooms + Area_in_Marla, family = binomial(link = 'cloglog'), data = d)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit3)
##
## Call:
## glm(formula = factor(purpose) ~ property_type + price + baths +
## bedrooms + Area_in_Marla, family = binomial(link = "cloglog"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.014e+14 2.733e+06 3.712e+07 <2e-16 ***
## property_typeFlat 3.185e+14 2.752e+06 1.157e+08 <2e-16 ***
## property_typeHouse -9.157e+13 2.739e+06 -3.343e+07 <2e-16 ***
## property_typeLower Portion -6.761e+14 2.827e+06 -2.392e+08 <2e-16 ***
## property_typePenthouse 4.042e+14 4.388e+06 9.213e+07 <2e-16 ***
## property_typeRoom -2.254e+14 3.774e+06 -5.971e+07 <2e-16 ***
## property_typeUpper Portion -1.774e+15 2.795e+06 -6.346e+08 <2e-16 ***
## price 4.226e+07 5.084e-03 8.311e+09 <2e-16 ***
## baths -2.894e+13 9.157e+04 -3.160e+08 <2e-16 ***
## bedrooms -4.268e+13 1.220e+05 -3.498e+08 <2e-16 ***
## Area_in_Marla -1.871e+12 1.843e+03 -1.015e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 182373 on 153429 degrees of freedom
## Residual deviance: 1345942 on 153419 degrees of freedom
## AIC: 1345964
##
## Number of Fisher Scoring iterations: 25
AIC1 <- AIC(fit1)
AIC2 <- AIC(fit2)
AIC3 <- AIC(fit3)
data.frame(AIC1,AIC2,AIC3)
## AIC1 AIC2 AIC3
## 1 1385973 1958778 1345964
Deviance1 <- deviance(fit1)
Deviance2 <- deviance(fit2)
Deviance3 <- deviance(fit3)
data.frame(Deviance1, Deviance2, Deviance3)
## Deviance1 Deviance2 Deviance3
## 1 1385951 1958756 1345942
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.2.3
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:data.table':
##
## %like%
BS1 <- BrierScore(fit1)
BS2 <- BrierScore(fit2)
BS3 <- BrierScore(fit3)
data.frame(BS1,BS2,BS3)
## BS1 BS2 BS3
## 1 0.125308 0.177097 0.1216907
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:DescTools':
##
## MAE, RMSE
predictions <- predict(fit1 , type = "response")
predicted_classes <- ifelse(predictions >= 0.5, "1", "0")
predictions1<- factor(predicted_classes, levels = c("0","1"))
actual<- factor(fit1$data$purpose, labels = c("0","1"))
confusionMatrix(table(predictions1, actual))
## Confusion Matrix and Statistics
##
## actual
## predictions1 0 1
## 0 33707 9750
## 1 9476 100497
##
## Accuracy : 0.8747
## 95% CI : (0.873, 0.8763)
## No Information Rate : 0.7185
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6908
##
## Mcnemar's Test P-Value : 0.04897
##
## Sensitivity : 0.7806
## Specificity : 0.9116
## Pos Pred Value : 0.7756
## Neg Pred Value : 0.9138
## Prevalence : 0.2815
## Detection Rate : 0.2197
## Detection Prevalence : 0.2832
## Balanced Accuracy : 0.8461
##
## 'Positive' Class : 0
##
library(caret)
predictions <- predict(fit2 , type = "response")
predicted_classes <- ifelse(predictions >= 0.5, "1", "0")
predictions1<- factor(predicted_classes, levels = c("0","1"))
actual<- factor(fit2$data$purpose, labels = c("0","1"))
confusionMatrix(table(predictions1, actual))
## Confusion Matrix and Statistics
##
## actual
## predictions1 0 1
## 0 16371 360
## 1 26812 109887
##
## Accuracy : 0.8229
## 95% CI : (0.821, 0.8248)
## No Information Rate : 0.7185
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4619
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3791
## Specificity : 0.9967
## Pos Pred Value : 0.9785
## Neg Pred Value : 0.8039
## Prevalence : 0.2815
## Detection Rate : 0.1067
## Detection Prevalence : 0.1090
## Balanced Accuracy : 0.6879
##
## 'Positive' Class : 0
##
library(caret)
predictions <- predict(fit3 , type = "response")
predicted_classes <- ifelse(predictions >= 0.5, "1", "0")
predictions1<- factor(predicted_classes, levels = c("0","1"))
actual<- factor(fit3$data$purpose, labels = c("0","1"))
confusionMatrix(table(predictions1, actual))
## Confusion Matrix and Statistics
##
## actual
## predictions1 0 1
## 0 32529 8017
## 1 10654 102230
##
## Accuracy : 0.8783
## 95% CI : (0.8767, 0.8799)
## No Information Rate : 0.7185
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6934
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7533
## Specificity : 0.9273
## Pos Pred Value : 0.8023
## Neg Pred Value : 0.9056
## Prevalence : 0.2815
## Detection Rate : 0.2120
## Detection Prevalence : 0.2643
## Balanced Accuracy : 0.8403
##
## 'Positive' Class : 0
##