本專案分析並探究 Airbnb Price Dataset 數據集中之 Airbnb 房價與各項可能影響房價的因素的關聯;分析之因素包括房間數、可容納人數、床型等硬體條件因素,以及房東訊息回覆率、訂房取消政策、是否收取清潔費等服務條件因素。藉此分析,我們不但解釋了什麼因素影響 Airbnb 物件定價、影響多大,更基於這些真實數據,借助機器學習演算法建立了一個估價模型,並在線上(shinyapps.io)部署了一個互動式 Airbnb 房價估計工具的 Shiny App,可提供 Airbnb 屋主參考市面行情為自己的房源定價,或用以協助租客判斷某物件房價相較市場行情之高低,輔助其訂房決策。
數據集來源:Airbnb Price Dataset
首先,我們讀取本次分析的資料集 Airbnb Price Dataset,並初步探索本資料集的樣貌
airbnb = read.csv('airbnb.csv')
我們可以藉由觀察資料集中前三筆資料點,預覽本資料集的的樣貌:
head(airbnb, 3)
由此可知,本資料集有 74,111 筆資料,而其欄位由價格(依賴變數)與另 28 個可能影響房價的因素(獨立變數),共 29 個變數所組成。我們藉由個別變數的前數筆資料瞭解其數據大致樣態:
str(airbnb)
'data.frame': 74111 obs. of 29 variables:
$ id : int 6901257 6304928 7919400 13418779 3808709 12422935 11825529 13971273 180792 5385260 ...
$ log_price : num 5.01 5.13 4.98 6.62 4.74 ...
$ property_type : chr "Apartment" "Apartment" "Apartment" "House" ...
$ room_type : chr "Entire home/apt" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
$ amenities : chr "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Essentials,\"Hair dryer\",I"| __truncated__ "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke detect"| __truncated__ "{TV,\"Cable TV\",\"Wireless Internet\",\"Air conditioning\",Kitchen,Breakfast,\"Buzzer/wireless intercom\",Heat"| __truncated__ "{TV,\"Cable TV\",Internet,\"Wireless Internet\",Kitchen,\"Indoor fireplace\",\"Buzzer/wireless intercom\",Heati"| __truncated__ ...
$ accommodates : int 3 7 5 4 2 2 3 2 2 2 ...
$ bathrooms : num 1 1 1 1 1 1 1 1 1 1 ...
$ bed_type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
$ cancellation_policy : chr "strict" "strict" "moderate" "flexible" ...
$ cleaning_fee : chr "True" "True" "True" "True" ...
$ city : chr "NYC" "NYC" "NYC" "SF" ...
$ description : chr "Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and"| __truncated__ "Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Par"| __truncated__ "The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!!"| __truncated__ "This light-filled home-away-from-home is super clean and comes with all of the modern amenities travelers could"| __truncated__ ...
$ first_review : chr "2016-06-18" "2017-08-05" "2017-04-30" "" ...
$ host_has_profile_pic : chr "t" "t" "t" "t" ...
$ host_identity_verified: chr "t" "f" "t" "t" ...
$ host_response_rate : chr "" "100%" "100%" "" ...
$ host_since : chr "2012-03-26" "2017-06-19" "2016-10-25" "2015-04-19" ...
$ instant_bookable : chr "f" "t" "t" "f" ...
$ last_review : chr "2016-07-18" "2017-09-23" "2017-09-14" "" ...
$ latitude : num 40.7 40.8 40.8 37.8 38.9 ...
$ longitude : num -74 -74 -73.9 -122.4 -77 ...
$ name : chr "Beautiful brownstone 1-bedroom" "Superb 3BR Apt Located Near Times Square" "The Garden Oasis" "Beautiful Flat in the Heart of SF!" ...
$ neighbourhood : chr "Brooklyn Heights" "Hell's Kitchen" "Harlem" "Lower Haight" ...
$ number_of_reviews : int 2 6 10 0 4 3 15 9 159 2 ...
$ review_scores_rating : num 100 93 92 NA 40 100 97 93 99 90 ...
$ thumbnail_url : chr "https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/72208dad-9c86-41ea-a735-43d933111063.jpg?aki_policy=small" ...
$ zipcode : chr "11201" "10019" "10027" "94117.0" ...
$ bedrooms : num 1 3 1 2 0 1 1 1 1 1 ...
$ beds : num 1 3 3 2 1 1 1 1 1 1 ...
我們可藉由資料集中各變數的敘述統計,大致了解個別變數的資料種類以及分佈情形:
summary(airbnb)
id log_price property_type room_type
Min. : 344 Min. :0.000 Length:74111 Length:74111
1st Qu.: 6261964 1st Qu.:4.317 Class :character Class :character
Median :12254147 Median :4.710 Mode :character Mode :character
Mean :11266617 Mean :4.782
3rd Qu.:16402260 3rd Qu.:5.220
Max. :21230903 Max. :7.600
amenities accommodates bathrooms bed_type
Length:74111 Min. : 1.000 Min. :0.000 Length:74111
Class :character 1st Qu.: 2.000 1st Qu.:1.000 Class :character
Mode :character Median : 2.000 Median :1.000 Mode :character
Mean : 3.155 Mean :1.235
3rd Qu.: 4.000 3rd Qu.:1.000
Max. :16.000 Max. :8.000
NA's :200
cancellation_policy cleaning_fee city description
Length:74111 Length:74111 Length:74111 Length:74111
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
first_review host_has_profile_pic host_identity_verified
Length:74111 Length:74111 Length:74111
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
host_response_rate host_since instant_bookable last_review
Length:74111 Length:74111 Length:74111 Length:74111
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
latitude longitude name neighbourhood
Min. :33.34 Min. :-122.51 Length:74111 Length:74111
1st Qu.:34.13 1st Qu.:-118.34 Class :character Class :character
Median :40.66 Median : -77.00 Mode :character Mode :character
Mean :38.45 Mean : -92.40
3rd Qu.:40.75 3rd Qu.: -73.95
Max. :42.39 Max. : -70.99
number_of_reviews review_scores_rating thumbnail_url zipcode
Min. : 0.0 Min. : 20.00 Length:74111 Length:74111
1st Qu.: 1.0 1st Qu.: 92.00 Class :character Class :character
Median : 6.0 Median : 96.00 Mode :character Mode :character
Mean : 20.9 Mean : 94.07
3rd Qu.: 23.0 3rd Qu.:100.00
Max. :605.0 Max. :100.00
NA's :16722
bedrooms beds
Min. : 0.000 Min. : 0.000
1st Qu.: 1.000 1st Qu.: 1.000
Median : 1.000 Median : 1.000
Mean : 1.266 Mean : 1.711
3rd Qu.: 1.000 3rd Qu.: 2.000
Max. :10.000 Max. :18.000
NA's :91 NA's :131
對數據集有了宏觀的理解後,我們針對部分需要進行 one-hot encoding 的多類別離散變數深入了解其數據型態,以利後續進行數據處理(點擊分頁查看個別變數的探索內容)
藉下方,我們了解到 Airbnb 上共有 35 種房屋類型的住宿選項:
unique(airbnb$property_type)
[1] "Apartment" "House" "Condominium"
[4] "Loft" "Townhouse" "Hostel"
[7] "Guest suite" "Bed & Breakfast" "Bungalow"
[10] "Guesthouse" "Dorm" "Other"
[13] "Camper/RV" "Villa" "Boutique hotel"
[16] "Timeshare" "In-law" "Boat"
[19] "Serviced apartment" "Castle" "Cabin"
[22] "Treehouse" "Tipi" "Vacation home"
[25] "Tent" "Hut" "Casa particular"
[28] "Chalet" "Yurt" "Earth House"
[31] "Parking Space" "Train" "Cave"
[34] "Lighthouse" "Island"
藉下方,我們可以查看個別房屋類型共有幾筆物件資料:
property_type_sum = airbnb |>
group_by(property_type) |>
summarise(n=n()) |>
arrange(desc(n))
property_type_sum
ggplot(property_type_sum, aes(x=reorder(property_type, n), y=n, fill=property_type)) +
geom_bar(stat="identity") +
coord_flip() +
labs(
title = "Number of Properties by Property Type",
x = "Property Type",
y = "Number of Properties"
) +
theme(
plot.title=element_text(hjust = 0.5),
legend.position = "none"
)
由此可知,74,111 筆資料中有 65,514 筆集中屬於公寓或獨棟住家的房屋類型,但多數房型的資料筆數過少,不適合作為後續分析與建模的對象;為確保後續分析的有效性,我們選擇僅保留擁有前五多筆資料的房屋類型進行後續分析,將於下階段(3.2 數據處理)執行篩選
藉下方,我們了解到 Airbnb
共提供三種類型的住宿模式,分別為獨自入住整間房屋(Entire home/apt)、與他人同住但擁有私人房間(Private room)、以及與他人入住同一房間(Shared room):
unique(airbnb$room_type)
[1] "Entire home/apt" "Private room" "Shared room"
藉下方,我們可以查看個別住宿模式共有幾筆物件資料:
room_type_sum = airbnb |>
group_by(room_type) |>
summarise(n=n()) |>
arrange(desc(n))
room_type_sum
ggplot(room_type_sum, aes(x=reorder(room_type, n), y=n, fill=room_type)) +
geom_bar(stat="identity") +
coord_flip() +
labs(
title = "Number of Properties by Room Type",
x = "Room Type",
y = "Number of Properties"
) +
theme(
plot.title=element_text(hjust = 0.5),
legend.position = "none"
)
藉下方,我們了解到 Airbnb
的房源共提供五種類型的床鋪:真床(Real Bed)、榻榻米床(Futon)、延展式沙發床(Pull-out Sofa)、沙發(Couch)和充氣床(Airbed):
unique(airbnb$bed_type)
[1] "Real Bed" "Futon" "Pull-out Sofa" "Couch"
[5] "Airbed"
藉下方,我們可以查看個別床鋪種類共有幾筆物件資料:
bed_type_sum = airbnb |>
group_by(bed_type) |>
summarise(n=n()) |>
arrange(desc(n))
bed_type_sum
ggplot(bed_type_sum, aes(x=reorder(bed_type, n), y=n, fill=bed_type)) +
geom_bar(stat="identity") +
coord_flip() +
labs(
title = "Number of Properties by Bed Type",
x = "Bed Type",
y = "Number of Properties"
) +
theme(
plot.title=element_text(hjust = 0.5),
legend.position = "none"
)
由此可知,74,111 個物件中高達 72,078
個提供的是真床(Real Bed),而另外四種床鋪由差不多數量的物件所提供
藉下方,我們了解到資料集中的 Airbnb
房東共提供五種類型的取消政策(事實上 Airbnb
房東擁有大於五種的取消政策選擇):strict、moderate、flexible、super_strict_30
和 super_strict_60。這五種取消政策的解釋如下:
unique(airbnb$cancellation_policy)
[1] "strict" "moderate" "flexible" "super_strict_30"
[5] "super_strict_60"
藉下方,我們可以查看個別訂房取消政策共有幾筆物件資料:
cancel_pol_sum = airbnb |>
group_by(cancellation_policy) |>
summarise(n=n()) |>
arrange(desc(n))
cancel_pol_sum
ggplot(cancel_pol_sum, aes(x=reorder(cancellation_policy, n), y=n, fill=cancellation_policy)) +
geom_bar(stat="identity") +
coord_flip() +
labs(
title = "Number of Properties by Cancellation Policy",
x = "Cancellation Policy",
y = "Number of Properties"
) +
theme(
plot.title=element_text(hjust = 0.5),
legend.position = "none"
)
由於房東採用 super_strict_30 和
super_strict_60 的房源數量極少,我們將於下階段(3.2
數據處理)篩選掉房東採用這兩種取消政策的房源,以確保後續分析的有效性
藉下方,我們了解到本資料集中的房源皆來自六個美國的大城市之一:紐約市(NYC)、舊金山(SF)、華盛頓特區(DC)、洛杉磯(LA)、芝加哥(Chicago)和波士頓(Boston):
unique(airbnb$city)
[1] "NYC" "SF" "DC" "LA" "Chicago" "Boston"
藉下方,我們可以查看個別城市共有幾筆物件資料:
city_sum = airbnb |>
group_by(city) |>
summarise(n=n()) |>
arrange(desc(n))
city_sum
ggplot(city_sum, aes(x=reorder(city, n), y=n, fill=city)) +
geom_bar(stat="identity") +
coord_flip() +
labs(
title = "Number of Properties by City",
x = "City",
y = "Number of Properties"
) +
theme(
plot.title=element_text(hjust = 0.5),
legend.position = "none"
)
為了確保資料分析與建模的準確性和有效性,在此先行移除部分變數。以下列出移除的變數以及移除原因:
另外,我們也針對部分變數進行數據轉換,以利後續進行分析與建模。以下列出進行轉換的變數以及轉換原因:
property_type
的初步探索中得到的結論,我們僅保留屬於前五大房屋種類的物件為後續分析之對象;除此之外,我們也需要對其進行
one-hot encodingcancellation_policy
的初步探索中得到的結論,我們將移除取消政策屬於
super_strict_30與 super_strict_60
的物件,剩餘做為後續分析之對象;除此之外,我們也需要對其進行 one-hot
encodingTrue 以 1
表達,False 以 0 表達,以利後續訓練模型True 以 1
表達,False 以 0 表達,以利後續訓練模型True 以 1
表達,False 以 0 表達,以利後續訓練模型True 以 1 表達,False 以 0
表達,以利後續訓練模型# 前五多筆資料之 Airbnb 房屋類型
top_five_property_type = property_type_sum |> slice_max(n, n=5) |> pull(1)
airbnb = airbnb |>
# 移除我們不關注的變數欄位
select(-id, -description, -first_review, -host_has_profile_pic, -last_review, -name, -neighbourhood, -longitude, -latitude, -thumbnail_url, -zipcode) |>
# 移除帶有缺失值的資料點
na.omit() |>
filter(
# 保留前五大房屋種類
property_type %in% top_five_property_type,
# 移除取消政策為 super_strict_30/60 的資料
!cancellation_policy %in% c("super_strict_30", "super_strict_60")
) |>
mutate(
# log_price 轉換回原始價格
price = exp(log_price),
# amenities 找出是否有五項我們所關注的設施
Wireless_Internet = ifelse(grepl("Wireless Internet", amenities), 1, 0),
Air_conditioning = ifelse(grepl("Air conditioning", amenities), 1, 0),
Heating = ifelse(grepl("Heating", amenities), 1, 0),
TV = ifelse(grepl("TV", amenities), 1, 0),
Elevator = ifelse(grepl("Elevator", amenities), 1, 0),
# cleaning_fee 的 True/False 轉換為 0 和 1
cleaning_fee = ifelse(cleaning_fee == "True", 1, 0),
# host_identity_verified 的 True/False 轉換為 0 和 1
host_identity_verified = ifelse(host_identity_verified == "t", 1, 0),
# host_response_rate 百分比轉換為小數點表達形式
host_response_rate = as.numeric(gsub("%", "", host_response_rate)) / 100,
# host_since 轉換為距今天數
host_since_days = as.numeric(difftime(Sys.Date(), as.Date(host_since), units = "days")),
# instant_bookable 的 True/False 轉換為 0 和 1
instant_bookable = ifelse(instant_bookable == "t", 1, 0),
# 將五個多類別離散變數轉換為 factor 以利進行後續 one hot encoding
property_type_factor = factor(property_type),
room_type_factor = factor(room_type),
bed_type_factor = factor(bed_type),
cancellation_policy_factor = factor(cancellation_policy),
city_factor = factor(city)
) |>
select(-log_price, -amenities)
# 為五個多類別離散變數進行 one-hot encoding
property_type_encoded = model.matrix(~ property_type_factor - 1, airbnb)
room_type_encoded = model.matrix(~ room_type_factor - 1, airbnb)
bed_type_encoded = model.matrix(~ bed_type_factor - 1, airbnb)
cancellation_policy_encoded = model.matrix(~ cancellation_policy_factor - 1, airbnb)
city_encoded = model.matrix(~ city_factor - 1, airbnb)
# 重新命名 one-hot encoded 資料框的欄位名稱(如原 property_type_factorHouse 轉換為 House)
colnames(property_type_encoded) = gsub("property_type_factor", "", colnames(property_type_encoded))
colnames(room_type_encoded) = gsub("room_type_factor", "", colnames(room_type_encoded))
colnames(bed_type_encoded) = gsub("bed_type_factor", "", colnames(bed_type_encoded))
colnames(cancellation_policy_encoded) = gsub("cancellation_policy_factor", "", colnames(cancellation_policy_encoded))
colnames(city_encoded) = gsub("city_factor", "", colnames(city_encoded))
# 將 one-hot encoded 的資料框合併至主要分析的資料框 airbnb
airbnb = bind_cols(
airbnb,
as.data.frame(property_type_encoded),
as.data.frame(room_type_encoded),
as.data.frame(bed_type_encoded),
as.data.frame(cancellation_policy_encoded),
as.data.frame(city_encoded)
) |>
# 後續用不到五個轉變為 factor 形式的欄位,將其移除
select(-property_type_factor, -room_type_factor, -bed_type_factor, -cancellation_policy_factor, -city_factor, -host_since) |>
na.omit()
# 將所有變數名稱之空白與特殊符號替換成 "_" 以利機器處理
colnames(airbnb) = gsub(" ", "_", colnames(airbnb))
colnames(airbnb) = gsub("/", "_", colnames(airbnb))
colnames(airbnb) = gsub("-", "_", colnames(airbnb))
檢視經預處理後的資料集模樣:
airbnb
在此,我們深入觀察個別解釋變數與房價的關係(點擊分頁查看個別變數與房價的分析與視覺化圖表)
首先,觀察各房屋類型的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(property_type), fill = property_type)) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Property Type", x = "Property Type", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,多數(約七成)的房源屬於公寓(apartment),再來是獨棟房屋(house),剩下三者數量相當接近。我們再來看一下個別房屋類型的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = property_type, y = price, fill = property_type)) +
theme_minimal() +
labs(title = "Box Plot of Property Type vs Price", x = "Property Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = property_type, y = price, fill = property_type)) +
theme_minimal() +
labs(title = "Violin Plot of Property Type vs Price", x = "Property Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = property_type)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ property_type, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Property Type", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同房屋類型之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(property_type) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(property_type, mean_price), y = mean_price, fill = property_type)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Property Type", x = "Property Type", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(property_type) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(property_type, median_price), y = median_price, fill = property_type)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Property Type", x = "Property Type", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
由此可知, Loft 房源的平均與中位數房價均為五種房屋類型之首,因而 Loft 房型的房價相對較高;Apartment 房型擁有最低的平均房價,而 House 房型擁有最低的中位數房價,意味著 House 房型擁有大量低價的選項,但同時也有不少高價選項,組內價差較大; Apartment 房型則是房價分佈較為集中,但相較其他房型價格較低。這些現象也能從我們上方的分組圖表中得到驗證。
首先,觀察各住宿模式的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(room_type), fill = room_type)) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Room Type", x = "Room Type", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,多數房源的住宿模式屬於出租整間房屋(Entire home/apt)或與他人共用公共空間但擁有獨立房間(Private room),但也有一千多筆房源屬於與他人共住同一間房間的模式(Shared room)。我們再來看一下個別住宿模式的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = room_type, y = price, fill = room_type)) +
theme_minimal() +
labs(title = "Box Plot of Room Type vs Price", x = "Room Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = room_type, y = price, fill = room_type)) +
theme_minimal() +
labs(title = "Violin Plot of Room Type vs Price", x = "Room Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = room_type)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ room_type, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Room Type", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同住宿模式之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(room_type) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(room_type, mean_price), y = mean_price, fill = room_type)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Room Type", x = "Room Type", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(room_type) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(room_type, median_price), y = median_price, fill = room_type)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Room Type", x = "Room Type", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
與常理相同,出租整間房屋由於提供更好的空間與隱私性,價格也是三種住宿模式中最高的;與他人同住一間房間的住宿模式則是在犧牲一定空間與隱私性的前提下,成為了最具價格優勢的選項;與他人共用公共空間,但擁有獨立房間的住宿模式,則成了折衷的選項。
由於房源之浴室數量的資料為整數形式,我們可以遵照分析類別變數的方式,深入瞭解房源浴室數量與房價的關係。
首先,我們觀察房源可容納人數的資料分佈情形:
airbnb |> ggplot(aes(x = as.factor(accommodates), fill = as.factor(accommodates))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Accommodates", x = "Accommodates", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
接下來,我們利用散佈圖與直方圖了解物件可容納人數與房價的關係:
airbnb |> ggplot(aes(x = accommodates, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Accommodates vs Price", x = "Accommodates", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
airbnb |> ggplot(aes(x = as.factor(accommodates), y = price, fill = as.factor(accommodates))) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of Accommodates vs Price", x = "Accommodates", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
除此之外,我們也可以比較一下不同可容納人數房源的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(accommodates = as.factor(accommodates)) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = as.factor(accommodates), y = mean_price, fill = as.factor(accommodates))) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Accommodates", x = "Accommodates", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5))
medianPrice = airbnb |>
group_by(accommodates = as.factor(accommodates)) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = as.factor(accommodates), y = median_price, fill = accommodates)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Accommodates", x = "Accommodates", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5))
平均房價與中位數房價皆隨著房源可容納人數上升而有著上升的整體趨勢。由此可知,房源可容納人數與房價基本上擁有正相關的關聯性。
由於房源之浴室數量的資料為整數形式,我們可以遵照分析類別變數的方式,深入瞭解房源浴室數量與房價的關係。
首先,我們觀察房源浴室數量的資料分佈情形:
airbnb |> ggplot(aes(x = as.factor(bathrooms), fill = as.factor(bathrooms))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Bathrooms", x = "Bathrooms", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
接下來,我們利用散佈圖與直方圖了解物件浴室數量與房價的關係:
airbnb |> ggplot(aes(x = bathrooms, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Bathrooms vs Price", x = "Bathrooms", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
airbnb |> ggplot(aes(x = as.factor(bathrooms), y = price, fill = as.factor(bathrooms))) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of Bathrooms vs Price", x = "Bathrooms", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
除此之外,我們也可以比較一下不同物件浴室數量的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(bathrooms = as.factor(bathrooms)) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = bathrooms, y = mean_price, fill = bathrooms)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Bathrooms", x = "Bathrooms", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5))
medianPrice = airbnb |>
group_by(bathrooms = as.factor(bathrooms)) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = bathrooms, y = median_price, fill = bathrooms)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Bathrooms", x = "Bathrooms", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5))
撇除掉僅有一筆資料點的八間浴室房源,我們可以看到房價隨房源浴室數量上升而有著上升的整體趨勢。由此可知,房源浴室數量與房價基本上擁有正相關的關聯性。
首先,觀察各床鋪類型的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(bed_type), fill = bed_type)) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Bed Type", x = "Bed Type", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,絕大多數的房源提供真床(Real Bed),另外四種床鋪則各由差不多數量的房源所提供。我們再來看一下個別床鋪類型的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = bed_type, y = price, fill = bed_type)) +
theme_minimal() +
labs(title = "Box Plot of Bed Type vs Price", x = "Bed Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = bed_type, y = price, fill = bed_type)) +
theme_minimal() +
labs(title = "Violin Plot of Bed Type vs Price", x = "Bed Type", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = bed_type)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ bed_type, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Bed Type", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下提供不同床鋪類型的房源之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(bed_type) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(bed_type, mean_price), y = mean_price, fill = bed_type)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Bed Type", x = "Bed Type", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(bed_type) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(bed_type, median_price), y = median_price, fill = bed_type)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Bed Type", x = "Bed Type", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
由此可知, 提供真床的房源收取的房價相對較高,畢竟真床的價格也較高。相反地,由於沙發同時可提供給公共空間使用,提供沙發的房東就不需要額外花錢購買睡眠設施,其房價也順理成章地成為五種床鋪類型中最低者。
首先,觀察各訂房取消政策的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(cancellation_policy), fill = cancellation_policy)) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Cancellation Policy", x = "Cancellation Policy", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,最多房源的房東採用嚴格(strict)的取消政策,再者是中等靈活(moderate),最後則是靈活(flexible)。我們再來看一下個別取消政策的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = cancellation_policy, y = price, fill = cancellation_policy)) +
theme_minimal() +
labs(title = "Box Plot of Cancellation Policy vs Price", x = "Cancellation Policy", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = cancellation_policy, y = price, fill = cancellation_policy)) +
theme_minimal() +
labs(title = "Violin Plot of Cancellation Policy vs Price", x = "Cancellation Policy", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = cancellation_policy)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ cancellation_policy, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Cancellation Policy", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下採用不同訂房取消政策的房源之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(cancellation_policy) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(cancellation_policy, mean_price), y = mean_price, fill = cancellation_policy)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Cancellation Policy", x = "Cancellation Policy", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(cancellation_policy) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(cancellation_policy, median_price), y = median_price, fill = cancellation_policy)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Cancellation Policy", x = "Cancellation Policy", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
有趣的是,採用嚴格取消政策的房源房價竟然是三者中最高,而採用靈活取消政策的房源房價竟然是三者中最低!這代表其可能受其他變數影響(例如較高級、奢華的房源房東通常採用較嚴格的取消政策,而較平價的房源房東偏好採取較靈活的取消政策)
首先,觀察收 / 不收清潔費的房源的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(as.factor(cleaning_fee)), fill = as.factor(cleaning_fee))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Cleaning Fee", x = "Cleaning Fee", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,多數(逾七成)的房源之房東會向房客額外收取清潔費(1),僅兩成多房東不會(0),我們再來看一下個別清潔費收取政策的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = as.factor(cleaning_fee), y = price, fill = as.factor(cleaning_fee))) +
theme_minimal() +
labs(title = "Box Plot of Cleaning Fee vs Price", x = "Cleaning Fee", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = as.factor(cleaning_fee), y = price, fill = as.factor(cleaning_fee))) +
theme_minimal() +
labs(title = "Violin Plot of Cleaning Fee vs Price", x = "Cleaning Fee", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = as.factor(cleaning_fee))) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ as.factor(cleaning_fee), scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Cleaning Fee", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下收 / 不收清潔費的房源之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(cleaning_fee) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(as.factor(cleaning_fee), mean_price), y = mean_price, fill = as.factor(cleaning_fee))) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Cleaning Fee", x = "Cleaning Fee", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(cleaning_fee) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(as.factor(cleaning_fee), median_price), y = median_price, fill = as.factor(cleaning_fee))) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Cleaning Fee", x = "Cleaning Fee", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
有趣的是,收取清潔費的房源相較不收取清潔費的房源,竟然訂定更高的房價!這可能代表其受其他變數影響,例如較高級、奢華的房源的房東較偏好收取清潔費,而較平價者的房東則傾向不收取清潔費(是因為平價的房子不太需要花心力清潔嗎哈哈)
首先,觀察各城市的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(city), fill = city)) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of City", x = "City", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,最多房源位處紐約市(NYC),再來是洛杉磯(LA),剩下四者數量相差不遠。我們再來看一下個別城市的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = city, y = price, fill = city)) +
theme_minimal() +
labs(title = "Box Plot of City vs Price", x = "City", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = city, y = price, fill = city)) +
theme_minimal() +
labs(title = "Violin Plot of City vs Price", x = "City", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = city)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ city, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by City", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同城市的房源之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(city) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(city, mean_price), y = mean_price, fill = city)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by City", x = "City", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(city) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(city, median_price), y = median_price, fill = city)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by City", x = "City", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
由此可知, 舊金山的 Airbnb 房價為六個都市中最高者,而芝加哥的 Airbnb 房價則為六個都市中最低者。此外,舊金山的平均房價與中位數房價都相較另外五個都市高出不少,看來在舊金山生活最好要有鈔能力!
首先,觀察 Airbnb 房東有 / 無經過認證的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(as.factor(host_identity_verified)), fill = as.factor(host_identity_verified))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Host Identity Verified", x = "Host Identity Verified", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,較多房東有經過認證(1)。我們再來看一下房東有
/
無經過認證的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = as.factor(host_identity_verified), y = price, fill = as.factor(host_identity_verified))) +
theme_minimal() +
labs(title = "Box Plot of Host Identity Verified vs Price", x = "Host Identity Verified", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = as.factor(host_identity_verified), y = price, fill = as.factor(host_identity_verified))) +
theme_minimal() +
labs(title = "Violin Plot of Host Identity Verified vs Price", x = "Host Identity Verified", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = as.factor(host_identity_verified))) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ as.factor(host_identity_verified), scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Host Identity Verified", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下房東有 / 無經過認證的房源的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(host_identity_verified) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(as.factor(host_identity_verified), mean_price), y = mean_price, fill = as.factor(host_identity_verified))) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Host Identity Verified", x = "Host Identity Verified", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(host_identity_verified) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(as.factor(host_identity_verified), median_price), y = median_price, fill = as.factor(host_identity_verified))) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Host Identity Verified", x = "Host Identity Verified", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
由此可知, 房東有經過認證之房源,其房價相較房東未經認證者稍高一些。
首先,我們利用散佈圖觀察房東訊息回覆率的資料分佈情形:
# Scatter Plot of Host Response Rate vs Price
airbnb |> ggplot(aes(x = host_response_rate, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Host Response Rate vs Price", x = "Host Response Rate", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
將房源依房東訊息回覆率分類為
low(<25%)、medium(25-50%)、high(50-75%)
和
very high(>=75%)四類別,我們可依照分析類別變數的方式深入分析房東訊息回覆率與房價間的關係。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb_hrr = airbnb |>
mutate(response_rate_category = case_when(
host_response_rate < 0.25 ~ "low",
host_response_rate >= 0.25 & host_response_rate < 0.5 ~ "medium",
host_response_rate >= 0.5 & host_response_rate < 0.75 ~ "high",
host_response_rate >= 0.75 ~ "very high"
))
airbnb_hrr |> ggplot() +
geom_boxplot(aes(x = response_rate_category, y = price, fill = response_rate_category)) +
theme_minimal() +
labs(title = "Box Plot of Host Response Rate Category vs Price", x = "Host Response Rate Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_hrr |> ggplot() +
geom_violin(aes(x = response_rate_category, y = price, fill = response_rate_category)) +
theme_minimal() +
labs(title = "Violin Plot of Host Response Rate Category vs Price", x = "Host Response Rate Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_hrr |> ggplot(aes(x = price, fill = response_rate_category)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ response_rate_category, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Response Rate Category", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同房東訊息回覆率組別之間的平均房價以及中位數房價:
meanPrice = airbnb_hrr |>
group_by(response_rate_category) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(response_rate_category, mean_price), y = mean_price, fill = response_rate_category)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Host Response Rate Category", x = "Host Response Rate Category", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb_hrr |>
group_by(response_rate_category) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(response_rate_category, median_price), y = median_price, fill = response_rate_category)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Host Response Rate Category", x = "Host Response Rate Category", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
雖然房東訊息回覆率最高的組別中的物件擁有最高的房價,但剩下三組別在平均房價 / 中位數房價的比較排序上則略有不同,且四組差異不大,顯示房東訊息回覆率可能與房價有正相關的關聯性,但其關連性可能不強。
首先,我們利用直方圖觀察房源之房東已成為房東天數的資料分佈情形:
airbnb |> ggplot(aes(x = host_since_days, y = price)) +
geom_bar(stat="identity") +
theme_minimal() +
labs(title = "Scatter Plot of Host Since Days vs Price", x = "Host Since Days", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
我們利用 25、50 與 75
百分位數作為分隔點,將房源根據其房東已成為房東天數分類為
new、slighlt new、slightly old 與
old 四類別:
# 計算三個四分位點的值作為分組用
quantiles = quantile(airbnb$host_since_days, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
airbnb_hsd = airbnb |>
mutate(host_since_days_category = case_when(
host_since_days <= quantiles[1] ~ "new",
host_since_days > quantiles[1] & host_since_days <= quantiles[2] ~ "slightly new",
host_since_days > quantiles[2] & host_since_days <= quantiles[3] ~ "slightly old",
host_since_days > quantiles[3] ~ "old"
))
藉此,我們可依照分析類別變數的方式,深入分析房東年資與房價間的關係。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb_hsd |> ggplot() +
geom_boxplot(aes(x = host_since_days_category, y = price, fill = host_since_days_category)) +
theme_minimal() +
labs(title = "Box Plot of Host Since Days Category vs Price", x = "Host Since Days Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_hsd |> ggplot() +
geom_violin(aes(x = host_since_days_category, y = price, fill = host_since_days_category)) +
theme_minimal() +
labs(title = "Violin Plot of Host Since Days Category vs Price", x = "Host Since Days Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_hsd |> ggplot(aes(x = price, fill = host_since_days_category)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ host_since_days_category, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Host Since Days Category", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同房東年資組別之間的平均房價以及中位數房價:
meanPrice = airbnb_hsd |>
group_by(host_since_days_category) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(host_since_days_category, mean_price), y = mean_price, fill = host_since_days_category)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Host Since Days Category", x = "Host Since Days Category", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb_hsd |>
group_by(host_since_days_category) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(host_since_days_category, median_price), y = median_price, fill = host_since_days_category)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Host Since Days Category", x = "Host Since Days Category", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
由此可知,房源之房東年資(已成為房東天數)與房價有著相當明顯的正相關之關聯性。
首先,觀察房源可 / 不可隨時預訂入住的資料分佈情形:
airbnb |> ggplot(aes(x = fct_infreq(as.factor(instant_bookable)), fill = as.factor(instant_bookable))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count), y = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Instant Bookable", x = "Instant Bookable", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
我們可以看到經預處理後的資料集中,較多房源不接受隨時預訂入住(0),要求入住前一段時間預訂才能入住。我們再來看一下房源可
/
不可隨時預訂入住的房價分佈情形如何。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb |> ggplot() +
geom_boxplot(aes(x = as.factor(instant_bookable), y = price, fill = as.factor(instant_bookable))) +
theme_minimal() +
labs(title = "Box Plot of Instant Bookable vs Price", x = "Instant Bookable", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot() +
geom_violin(aes(x = as.factor(instant_bookable), y = price, fill = as.factor(instant_bookable))) +
theme_minimal() +
labs(title = "Violin Plot of Instant Bookable vs Price", x = "Instant Bookable", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb |> ggplot(aes(x = price, fill = as.factor(instant_bookable))) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ as.factor(instant_bookable), scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Instant Bookable", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下可 / 不可隨時預訂入住的房源平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(instant_bookable) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(as.factor(instant_bookable), mean_price), y = mean_price, fill = as.factor(instant_bookable))) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Instant Bookable", x = "Instant Bookable", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb |>
group_by(instant_bookable) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(as.factor(instant_bookable), median_price), y = median_price, fill = as.factor(instant_bookable))) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Instant Bookable", x = "Instant Bookable", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
有趣的是, 不可隨時預訂入住之房源,其房價相較可隨時預訂入住之房源稍高一些!這代表不可隨時預訂入住之房源的房東在提供較少靈活性的前提下,還收取著較高的費用。這意味著其可能受到其他變數影響,例如較高級、奢華的房源,其房東傾向不同意房客隨時預訂入住,而較平價的房源的房東則較傾向讓房客隨時預訂入住。
首先,我們利用直方圖觀察房源收到評論數量的資料分佈情形:
airbnb |> ggplot(aes(x = number_of_reviews, y = price)) +
geom_bar(stat="identity") +
theme_minimal() +
labs(title = "Scatter Plot of Number of Reviews vs Price", x = "Number of Reviews", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
我們利用 25、50 與 75
百分位數作為分隔點,將房源根據其收到評論數分類為
new、slighlt new、slightly old 與
old 四類別:
# 計算三個四分位點的值作為分組用
quantiles_reviews = quantile(airbnb$number_of_reviews, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
airbnb_reviews = airbnb |>
mutate(number_of_reviews_category = case_when(
number_of_reviews <= quantiles_reviews[1] ~ "unpopular",
number_of_reviews > quantiles_reviews[1] & number_of_reviews <= quantiles_reviews[2] ~ "relatively unpopular",
number_of_reviews > quantiles_reviews[2] & number_of_reviews <= quantiles_reviews[3] ~ "relatively popular",
number_of_reviews > quantiles_reviews[3] ~ "popular"
))
藉此,我們可依照分析類別變數的方式,深入分析房源收到評論數與房價間的關係。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb_reviews |> ggplot() +
geom_boxplot(aes(x = number_of_reviews_category, y = price, fill = number_of_reviews_category)) +
theme_minimal() +
labs(title = "Box Plot of Number of Reviews Category vs Price", x = "Number of Reviews Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_reviews |> ggplot() +
geom_violin(aes(x = number_of_reviews_category, y = price, fill = number_of_reviews_category)) +
theme_minimal() +
labs(title = "Violin Plot of Number of Reviews Category vs Price", x = "Number of Reviews Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_reviews |> ggplot(aes(x = price, fill = number_of_reviews_category)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ number_of_reviews_category, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Number of Reviews Category", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下各個房源獲得評論數組別之間的平均房價以及中位數房價:
meanPrice = airbnb_reviews |>
group_by(number_of_reviews_category) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(number_of_reviews_category, mean_price), y = mean_price, fill = number_of_reviews_category)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Number of Reviews Category", x = "Number of Reviews Category", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb_reviews |>
group_by(number_of_reviews_category) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(number_of_reviews_category, median_price), y = median_price, fill = number_of_reviews_category)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Number of Reviews Category", x = "Number of Reviews Category", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
有趣的是,獲得評論數量較少的兩個組別相較獲得評論數量較多的兩個組別,其房源擁有較低的中位數房價,卻擁有較高的平均房價。看來評論數較少的房源組內房價差異較大,而房源獲得評論數量與房價的關係相較不那麼直接。
首先,我們利用直方圖觀察房源獲評等級的資料分佈情形:
airbnb |> ggplot(aes(x = review_scores_rating, y = price)) +
geom_bar(stat="identity") +
theme_minimal() +
labs(title = "Scatter Plot of Review Scores Rating vs Price", x = "Review Scores Rating", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
我們利用85、95 與 100 分作為分隔點(因本變數之 75 百分位已為 100
分),將房源根據其獲評等級分類為
well-rated、relatively well-rated、not so well-rated
與 not well-rated 四類別:
airbnb_ratings = airbnb |>
mutate(review_scores_rating_category = case_when(
review_scores_rating < 85 ~ "not well-rated",
review_scores_rating >= 85 & review_scores_rating < 95 ~ "not so well-rated",
review_scores_rating >= 95 & review_scores_rating < 100 ~ "relatively well-rated",
review_scores_rating == 100 ~ "well-rated"
))
藉此,我們可依照分析類別變數的方式,深入分析房源獲評等級與房價間的關係。此處我們採用箱形圖、小提琴圖和分組直方圖作為視覺化之工具:
airbnb_ratings |> ggplot() +
geom_boxplot(aes(x = review_scores_rating_category, y = price, fill = review_scores_rating_category)) +
theme_minimal() +
labs(title = "Box Plot of Review Scores Rating Category vs Price", x = "Review Scores Rating Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_ratings |> ggplot() +
geom_violin(aes(x = review_scores_rating_category, y = price, fill = review_scores_rating_category)) +
theme_minimal() +
labs(title = "Violin Plot of Review Scores Rating Category vs Price", x = "Review Scores Rating Category", y = "Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
airbnb_ratings |> ggplot(aes(x = price, fill = review_scores_rating_category)) +
geom_histogram(bins = 30, alpha = 0.7) +
facet_wrap(~ review_scores_rating_category, scales = "free_y") +
theme_minimal() +
labs(title = "Distribution of House Prices by Review Scores Rating Category", x = "Price", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
除此之外,我們也可以比較一下不同房源獲評等級組別的平均房價以及中位數房價:
meanPrice = airbnb_ratings |>
group_by(review_scores_rating_category) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = reorder(review_scores_rating_category, mean_price), y = mean_price, fill = review_scores_rating_category)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Review Scores Rating Category", x = "Review Scores Rating Category", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
medianPrice = airbnb_ratings |>
group_by(review_scores_rating_category) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = reorder(review_scores_rating_category, median_price), y = median_price, fill = review_scores_rating_category)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Review Scores Rating Category", x = "Review Scores Rating Category", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set3")
平均房價與中位數房價皆隨著房源獲評等級提高而有著上升的整體趨勢。由此可知,房源獲評等級與其房價基本上有著正相關的關聯性。
由於房源之臥室數量的資料為整數形式,我們可以遵照分析類別變數的方式,深入瞭解房源臥室數量與房價的關係。
首先,我們利用直方圖觀察房源可容納人數的資料分佈情形:
airbnb |> ggplot(aes(x = as.factor(bedrooms), fill = as.factor(bedrooms))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Bedrooms", x = "Bedrooms", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
接下來,我們利用散佈圖與直方圖了解物件臥室數量與房價的關係:
airbnb |> ggplot(aes(x = bedrooms, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Bedrooms vs Price", x = "Bedrooms", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
airbnb |> ggplot(aes(x = as.factor(bedrooms), y = price, fill = as.factor(bedrooms))) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of Bedrooms vs Price", x = "Bedrooms", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
除此之外,我們也可以比較一下不同物件臥室數量的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(bedrooms = as.factor(bedrooms)) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = bedrooms, y = mean_price, fill = bedrooms)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Bedrooms", x = "Bedrooms", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5))
medianPrice = airbnb |>
group_by(bedrooms = as.factor(bedrooms)) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = bedrooms, y = median_price, fill = bedrooms)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Bedrooms", x = "Bedrooms", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5))
基本上,平均與中位數房價皆隨著臥室數量提高而有著上升的整體趨勢。由此可知,房源臥室數量與房價基本上擁有正相關的關聯性。
由於房源之床鋪數量的資料為整數形式,我們可以遵照分析類別變數的方式,深入瞭解房源床鋪數量與房價的關係。
首先,我們利用直方圖觀察房源床鋪數量的資料分佈情形:
airbnb |> ggplot(aes(x = as.factor(beds), fill = as.factor(beds))) +
geom_bar() +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Beds", x = "Beds", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
接下來,我們利用散佈圖與直方圖了解物件床鋪數量與房價的關係:
airbnb |> ggplot(aes(x = beds, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Beds vs Price", x = "Beds", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
airbnb |> ggplot(aes(x = as.factor(beds), y = price, fill = as.factor(beds))) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of Beds vs Price", x = "Beds", y = "Price") +
theme(plot.title = element_text(hjust = 0.5))
除此之外,我們也可以比較一下不同床鋪數量的物件之間的平均房價以及中位數房價:
meanPrice = airbnb |>
group_by(beds = as.factor(beds)) |>
summarise(mean_price = mean(price))
meanPrice
meanPrice |> ggplot(aes(x = beds, y = mean_price, fill = beds)) +
geom_col() +
geom_text(aes(label = round(mean_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Mean Price by Beds", x = "Beds", y = "Mean Price") +
theme(plot.title = element_text(hjust = 0.5))
medianPrice = airbnb |>
group_by(beds = as.factor(beds)) |>
summarise(median_price = median(price))
medianPrice
medianPrice |> ggplot(aes(x = beds, y = median_price, fill = beds)) +
geom_col() +
geom_text(aes(label = round(median_price, 1)), vjust = -0.5) +
theme_minimal() +
labs(title = "Bar Plot of Median Price by Beds", x = "Beds", y = "Median Price") +
theme(plot.title = element_text(hjust = 0.5))
由此可知,房源床鋪數量與房價基本上擁有正相關的關聯性。
對於個別變數有深入的瞭解後,我們也可以利用相關係數矩陣與熱圖更直觀地了解變數之間的關係:
cor_matrix = airbnb |> select(-property_type, -room_type, -bed_type, -cancellation_policy, -city, -price) |> cor()
as.data.frame(cor_matrix)
# 將相關矩陣轉換為長格式
melted_cor_matrix = melt(cor_matrix)
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1)) +
coord_fixed()
由於變數數量大,我們很難在有限的空間裡放入能看到給定兩個變數之間的具體相關性的熱圖。然而,從顏色深淺的分佈來看,我們可以看到絕大多數的格子都是接近全白(無相關性)或淺淺的橘色 / 藍色(弱相關性)。這代表用以預測房價變動的解釋變數彼此之間關聯性不強,因而沒有明顯的變數共線性的問題。這對於後續建立線性迴歸模型尤其重要,因為若資料集中存在變數共線性問題,可能會導致回歸係數的不穩定性,甚至可能導致迴歸係數無解。
除了瞭解解釋變數之間的關係為何,我們也能在正式建模之前,利用相關係數矩陣(熱圖的視覺化效果太差,不利解讀)查看個別解釋變數與被解釋變數(房價)之間的相關係數,藉此對於個別因子對房價的影響力有個初步的瞭解:
cor_matrix2 = airbnb |> select(-property_type, -room_type, -bed_type, -cancellation_policy, -city) |> cor()
price_cor = cor_matrix2[,"price"] |> as.data.frame() |> filter(cor_matrix2[, "price"] != 1) |> t()
rownames(price_cor) = 'Correlation with Price'
price_cor
accommodates bathrooms cleaning_fee
Correlation with Price 0.5893545 0.520413 0.1187518
host_identity_verified host_response_rate
Correlation with Price 0.03356264 0.003354285
instant_bookable number_of_reviews review_scores_rating
Correlation with Price -0.03677136 -0.05007981 0.05698492
bedrooms beds Wireless_Internet Air_conditioning
Correlation with Price 0.5699765 0.5149489 0.04337985 0.02164841
Heating TV Elevator host_since_days
Correlation with Price 0.07475622 0.2067397 0.05864435 0.04775486
Apartment Condominium House Loft Townhouse
Correlation with Price -0.09902429 0.03970503 0.06843644 0.0513037 0.01477649
Entire_home_apt Private_room Shared_room Airbed
Correlation with Price 0.451614 -0.4195994 -0.1144436 -0.03489574
Couch Futon Pull_out_Sofa Real_Bed
Correlation with Price -0.02700377 -0.04681329 -0.03514521 0.0736182
flexible moderate strict Boston Chicago
Correlation with Price -0.1208172 -0.06707068 0.1559063 0.01885406 -0.05083778
DC LA NYC SF
Correlation with Price 0.001522238 -0.004652102 -0.04836868 0.1243781
迴歸分析是一種統計方法,通過數學模型來描述一個或多個自變數(解釋變數)對應變數(預測變數)的影響,用來探索和建模變數之間的關係。藉由多元線性迴歸分析,我們建立一個以前述分析所篩選出的潛在房價影響因子預測、解釋房價的線性迴歸模型:
# 由於 r 會自動對 factor variables 進行 one-hot encoding,但我們已經在前面做過,故在此移除 factor variables
airbnb_lm = airbnb |> select(-property_type, -room_type, -bed_type, -city, -cancellation_policy)
# 指定 price 以外的變數作為迴歸模型的解釋變數
explanatory_vars = colnames(airbnb_lm)[!colnames(airbnb_lm) %in% c("price")]
# 建立輸入參數於 lm() 中的公式
formula = as.formula(paste("price ~", paste(explanatory_vars, collapse = " + ")))
# 建立多元線性迴歸模型
lm_model = lm(formula, data = airbnb)
藉由 summary() 和
anova(),我們可以更深入了解我們的線性迴歸模型訓練結果,包含截距項以及各解釋變數的估計值、標準誤、
t-value 、F-value 與 p-valu 等統計量:
lm_summary = summary(lm_model)
lm_summary
Call:
lm(formula = formula, data = airbnb)
Residuals:
Min 1Q Median 3Q Max
-571.56 -41.76 -6.12 28.27 1691.99
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.309e+02 8.611e+00 -15.206 < 2e-16 ***
accommodates 1.311e+01 4.167e-01 31.457 < 2e-16 ***
bathrooms 6.642e+01 1.014e+00 65.519 < 2e-16 ***
cleaning_fee -6.944e+00 1.219e+00 -5.697 1.23e-08 ***
host_identity_verified 2.667e-01 1.057e+00 0.252 0.800907
host_response_rate -1.591e+01 3.348e+00 -4.753 2.01e-06 ***
instant_bookable -5.438e+00 1.000e+00 -5.436 5.46e-08 ***
number_of_reviews -1.175e-01 1.083e-02 -10.850 < 2e-16 ***
review_scores_rating 9.814e-01 6.260e-02 15.678 < 2e-16 ***
bedrooms 3.554e+01 8.538e-01 41.627 < 2e-16 ***
beds -7.279e+00 6.582e-01 -11.059 < 2e-16 ***
Wireless_Internet -1.582e+00 3.157e+00 -0.501 0.616294
Air_conditioning -5.183e-01 1.296e+00 -0.400 0.689135
Heating -6.161e-01 1.937e+00 -0.318 0.750387
TV 1.345e+01 1.084e+00 12.402 < 2e-16 ***
Elevator 2.218e+01 1.152e+00 19.256 < 2e-16 ***
host_since_days 5.494e-03 7.246e-04 7.582 3.47e-14 ***
Apartment 4.203e+00 2.809e+00 1.496 0.134601
Condominium 1.661e+01 3.551e+00 4.676 2.93e-06 ***
House 1.055e+01 2.846e+00 3.706 0.000211 ***
Loft 4.305e+01 4.218e+00 10.206 < 2e-16 ***
Townhouse NA NA NA NA
Entire_home_apt 1.014e+02 3.076e+00 32.978 < 2e-16 ***
Private_room 3.114e+01 3.012e+00 10.336 < 2e-16 ***
Shared_room NA NA NA NA
Airbed 5.064e+00 6.039e+00 0.839 0.401732
Couch 7.281e+00 8.473e+00 0.859 0.390121
Futon 1.491e+00 4.371e+00 0.341 0.732949
Pull_out_Sofa 1.287e+00 4.777e+00 0.269 0.787702
Real_Bed NA NA NA NA
flexible -7.450e+00 1.249e+00 -5.963 2.50e-09 ***
moderate -7.644e+00 1.032e+00 -7.407 1.32e-13 ***
strict NA NA NA NA
Boston -4.491e+01 2.634e+00 -17.050 < 2e-16 ***
Chicago -9.401e+01 2.537e+00 -37.050 < 2e-16 ***
DC -6.849e+01 2.498e+00 -27.413 < 2e-16 ***
LA -6.810e+01 1.934e+00 -35.213 < 2e-16 ***
NYC -4.123e+01 2.015e+00 -20.467 < 2e-16 ***
SF NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 93.12 on 45507 degrees of freedom
Multiple R-squared: 0.5258, Adjusted R-squared: 0.5255
F-statistic: 1529 on 33 and 45507 DF, p-value: < 2.2e-16
# 對模型進行 ANOVA 分析
anova_model = anova(lm_model)
anova_model
模型建立後,我們可藉由各變數在線性迴歸模型中的係數直觀地了解該變數與房價的相關性為何,包括兩者的移動方向與幅度。
值得注意的是:經過 one-hot encoding
後,每一個變數產生出的數個虛擬變數中必然有一個的
lm_coefficient 為零,這是因為其他虛擬變數的
lm_coefficient 是相對 lm_coefficient
為零的虛擬變數使房價變動的幅度。我們可以將 lm_coefficient
為零的虛擬變數想像成一個「預設值」,例如 Townhouse 的
lm_coefficient 為零,Loft 的
lm_coefficient 為 43,則代表其他條件相同下,模型估計 Loft
房型的房價比 Townhouse 房型的房價貴了約 43 元。
# 查看模型中各變數的迴歸係數
lm_coefficients = lm_model$coefficients
as.data.frame(lm_coefficients)
最常用來評估線性迴歸模型學習成效的指標之一就是 R squared,其代表數據中有多少變異可以被我們的模型解釋:
r_squared = lm_summary$r.squared
r_squared
[1] 0.5258475
R squared 僅有約 0.53,代表我們的模型還需要更多的調整。
藉由包含擬合值和殘差的散點圖,我們可以評估線性回歸模型的適用性和診斷模型的潛在問題。
lm_fitted_values = lm_model$fitted.values
lm_residuals = lm_model$residuals
data = data.frame(Fitted = lm_fitted_values, Residuals = lm_residuals)
data |> ggplot() +
geom_point(aes(x = Fitted, y = Residuals)) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
理想情況下,殘差應該隨擬合值均勻分布,無明顯模式或趨勢;然而,我們的散點圖的點分佈大致呈漏斗狀,意即殘差的變異隨擬合值上升而上升。這代表誤差項的方差不是常數,違反了線性迴歸對於同方差性的基本假設。解決這種異方差性問題的其中一個常見手段是將預測變數進行對數變換,而我們也選擇將房價進行對數變換,並再進行一次建模:
# 建立輸入參數於 lm() 中的公式
formula_log = as.formula(paste("log(price) ~", paste(explanatory_vars, collapse = " + ")))
# 建立多元線性迴歸模型
lm_model_log = lm(formula_log, data = airbnb)
lm_log_summary = summary(lm_model_log)
r_squared_log = lm_log_summary$r.squared
r_squared_log
[1] 0.6390742
將房價進行對數變換後, R squared 顯著地上升到了約 0.64。我們再次檢查新模型的包含擬合值和殘差的散點圖,檢視異方差性的問題是否有被解決:
lm_log_fitted_values = lm_model_log$fitted.values
lm_log_residuals = lm_model_log$residuals
data_log = data.frame(Fitted = lm_log_fitted_values, Residuals = lm_log_residuals)
data_log |> ggplot() +
geom_point(aes(x = Fitted, y = Residuals)) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
在將房價進行對數變換後,散點圖中的點呈現無規則性的分佈,代表異方差性的問題被解決了。然而,R squared 0.64 的表現仍不太理想,因而我們需要採取一些措施優化這個模型。在此,我們採用逐步迴歸(Stepwise Regression)進行特徵選擇。逐步迴歸是一種用於特徵選取的技術,通過逐步增加或刪除特徵,找到最佳的迴歸模型:
# 使用 stepwise 回歸進行特徵選擇
step_model = step(lm_model_log, direction = "both")
Start: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Townhouse + Entire_home_apt +
Private_room + Shared_room + Airbed + Couch + Futon + Pull_out_Sofa +
Real_Bed + flexible + moderate + strict + Boston + Chicago +
DC + LA + NYC + SF
Step: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Townhouse + Entire_home_apt +
Private_room + Shared_room + Airbed + Couch + Futon + Pull_out_Sofa +
Real_Bed + flexible + moderate + strict + Boston + Chicago +
DC + LA + NYC
Step: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Townhouse + Entire_home_apt +
Private_room + Shared_room + Airbed + Couch + Futon + Pull_out_Sofa +
Real_Bed + flexible + moderate + Boston + Chicago + DC +
LA + NYC
Step: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Townhouse + Entire_home_apt +
Private_room + Shared_room + Airbed + Couch + Futon + Pull_out_Sofa +
flexible + moderate + Boston + Chicago + DC + LA + NYC
Step: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Townhouse + Entire_home_apt +
Private_room + Airbed + Couch + Futon + Pull_out_Sofa + flexible +
moderate + Boston + Chicago + DC + LA + NYC
Step: AIC=-82337.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Wireless_Internet +
Air_conditioning + Heating + TV + Elevator + host_since_days +
Apartment + Condominium + House + Loft + Entire_home_apt +
Private_room + Airbed + Couch + Futon + Pull_out_Sofa + flexible +
moderate + Boston + Chicago + DC + LA + NYC
Df Sum of Sq RSS AIC
- Wireless_Internet 1 0.00 7456.9 -82339
- Pull_out_Sofa 1 0.00 7456.9 -82339
- host_identity_verified 1 0.08 7456.9 -82339
- Apartment 1 0.21 7457.1 -82338
- Couch 1 0.26 7457.1 -82338
<none> 7456.9 -82337
- cleaning_fee 1 0.35 7457.2 -82337
- House 1 0.66 7457.5 -82335
- Heating 1 1.24 7458.1 -82332
- Airbed 1 1.27 7458.1 -82332
- Futon 1 1.98 7458.9 -82327
- Condominium 1 3.09 7460.0 -82321
- Air_conditioning 1 3.98 7460.9 -82315
- flexible 1 6.36 7463.2 -82301
- moderate 1 6.49 7463.4 -82300
- host_response_rate 1 6.59 7463.5 -82299
- number_of_reviews 1 6.68 7463.6 -82299
- instant_bookable 1 8.41 7465.3 -82288
- Loft 1 9.67 7466.5 -82280
- beds 1 25.23 7482.1 -82186
- host_since_days 1 46.90 7503.8 -82054
- review_scores_rating 1 73.79 7530.7 -81891
- TV 1 100.15 7557.0 -81732
- Boston 1 104.20 7561.1 -81707
- Elevator 1 126.14 7583.0 -81575
- bathrooms 1 189.18 7646.1 -81198
- NYC 1 208.69 7665.6 -81082
- Private_room 1 210.28 7667.2 -81073
- accommodates 1 230.14 7687.0 -80955
- DC 1 256.28 7713.2 -80801
- bedrooms 1 279.78 7736.6 -80662
- LA 1 477.55 7934.4 -79512
- Chicago 1 520.60 7977.5 -79266
- Entire_home_apt 1 1043.88 8500.7 -76373
Step: AIC=-82339.4
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Air_conditioning +
Heating + TV + Elevator + host_since_days + Apartment + Condominium +
House + Loft + Entire_home_apt + Private_room + Airbed +
Couch + Futon + Pull_out_Sofa + flexible + moderate + Boston +
Chicago + DC + LA + NYC
Df Sum of Sq RSS AIC
- Pull_out_Sofa 1 0.00 7456.9 -82341
- host_identity_verified 1 0.08 7456.9 -82341
- Apartment 1 0.21 7457.1 -82340
- Couch 1 0.26 7457.1 -82340
<none> 7456.9 -82339
- cleaning_fee 1 0.35 7457.2 -82339
+ Wireless_Internet 1 0.00 7456.9 -82337
- House 1 0.66 7457.5 -82337
- Airbed 1 1.27 7458.1 -82334
- Heating 1 1.28 7458.2 -82334
- Futon 1 1.98 7458.9 -82329
- Condominium 1 3.09 7460.0 -82323
- Air_conditioning 1 4.01 7460.9 -82317
- flexible 1 6.37 7463.2 -82303
- moderate 1 6.49 7463.4 -82302
- host_response_rate 1 6.60 7463.5 -82301
- number_of_reviews 1 6.69 7463.6 -82301
- instant_bookable 1 8.41 7465.3 -82290
- Loft 1 9.67 7466.5 -82282
- beds 1 25.23 7482.1 -82188
- host_since_days 1 46.92 7503.8 -82056
- review_scores_rating 1 73.88 7530.8 -81892
- TV 1 101.05 7557.9 -81728
- Boston 1 104.32 7561.2 -81709
- Elevator 1 126.14 7583.0 -81577
- bathrooms 1 189.19 7646.1 -81200
- NYC 1 208.95 7665.8 -81083
- Private_room 1 210.51 7667.4 -81074
- accommodates 1 230.16 7687.0 -80957
- DC 1 256.76 7713.6 -80800
- bedrooms 1 279.79 7736.7 -80664
- LA 1 478.24 7935.1 -79510
- Chicago 1 522.26 7979.1 -79259
- Entire_home_apt 1 1044.44 8501.3 -76372
Step: AIC=-82341.39
log(price) ~ accommodates + bathrooms + cleaning_fee + host_identity_verified +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Air_conditioning +
Heating + TV + Elevator + host_since_days + Apartment + Condominium +
House + Loft + Entire_home_apt + Private_room + Airbed +
Couch + Futon + flexible + moderate + Boston + Chicago +
DC + LA + NYC
Df Sum of Sq RSS AIC
- host_identity_verified 1 0.08 7456.9 -82343
- Apartment 1 0.21 7457.1 -82342
- Couch 1 0.26 7457.1 -82342
<none> 7456.9 -82341
- cleaning_fee 1 0.35 7457.2 -82341
+ Real_Bed 1 0.00 7456.9 -82339
+ Pull_out_Sofa 1 0.00 7456.9 -82339
+ Wireless_Internet 1 0.00 7456.9 -82339
- House 1 0.66 7457.5 -82339
- Airbed 1 1.27 7458.1 -82336
- Heating 1 1.28 7458.2 -82336
- Futon 1 1.98 7458.9 -82331
- Condominium 1 3.09 7460.0 -82325
- Air_conditioning 1 4.01 7460.9 -82319
- flexible 1 6.37 7463.2 -82305
- moderate 1 6.49 7463.4 -82304
- host_response_rate 1 6.60 7463.5 -82303
- number_of_reviews 1 6.69 7463.6 -82303
- instant_bookable 1 8.41 7465.3 -82292
- Loft 1 9.67 7466.5 -82284
- beds 1 25.23 7482.1 -82190
- host_since_days 1 46.93 7503.8 -82058
- review_scores_rating 1 73.88 7530.8 -81894
- TV 1 101.05 7557.9 -81730
- Boston 1 104.32 7561.2 -81711
- Elevator 1 126.15 7583.0 -81579
- bathrooms 1 189.19 7646.1 -81202
- NYC 1 208.95 7665.8 -81085
- Private_room 1 212.64 7669.5 -81063
- accommodates 1 230.16 7687.0 -80959
- DC 1 256.77 7713.6 -80802
- bedrooms 1 279.85 7736.7 -80666
- LA 1 478.25 7935.1 -79512
- Chicago 1 522.28 7979.2 -79260
- Entire_home_apt 1 1055.57 8512.4 -76314
Step: AIC=-82342.92
log(price) ~ accommodates + bathrooms + cleaning_fee + host_response_rate +
instant_bookable + number_of_reviews + review_scores_rating +
bedrooms + beds + Air_conditioning + Heating + TV + Elevator +
host_since_days + Apartment + Condominium + House + Loft +
Entire_home_apt + Private_room + Airbed + Couch + Futon +
flexible + moderate + Boston + Chicago + DC + LA + NYC
Df Sum of Sq RSS AIC
- Apartment 1 0.20 7457.2 -82344
- Couch 1 0.26 7457.2 -82343
<none> 7456.9 -82343
- cleaning_fee 1 0.38 7457.3 -82343
+ host_identity_verified 1 0.08 7456.9 -82341
+ Pull_out_Sofa 1 0.00 7456.9 -82341
+ Real_Bed 1 0.00 7456.9 -82341
+ Wireless_Internet 1 0.00 7456.9 -82341
- House 1 0.66 7457.6 -82341
- Heating 1 1.27 7458.2 -82337
- Airbed 1 1.27 7458.2 -82337
- Futon 1 1.98 7458.9 -82333
- Condominium 1 3.10 7460.1 -82326
- Air_conditioning 1 3.99 7460.9 -82321
- flexible 1 6.32 7463.3 -82306
- moderate 1 6.49 7463.4 -82305
- host_response_rate 1 6.64 7463.6 -82304
- number_of_reviews 1 6.79 7463.7 -82303
- instant_bookable 1 8.35 7465.3 -82294
- Loft 1 9.68 7466.6 -82286
- beds 1 25.23 7482.2 -82191
- host_since_days 1 49.59 7506.5 -82043
- review_scores_rating 1 73.81 7530.8 -81896
- TV 1 101.04 7558.0 -81732
- Boston 1 104.25 7561.2 -81713
- Elevator 1 126.10 7583.0 -81581
- bathrooms 1 189.21 7646.2 -81204
- NYC 1 208.87 7665.8 -81087
- Private_room 1 212.65 7669.6 -81064
- accommodates 1 230.09 7687.0 -80961
- DC 1 257.02 7714.0 -80802
- bedrooms 1 279.99 7736.9 -80666
- LA 1 479.01 7936.0 -79510
- Chicago 1 523.52 7980.5 -79255
- Entire_home_apt 1 1055.77 8512.7 -76315
Step: AIC=-82343.68
log(price) ~ accommodates + bathrooms + cleaning_fee + host_response_rate +
instant_bookable + number_of_reviews + review_scores_rating +
bedrooms + beds + Air_conditioning + Heating + TV + Elevator +
host_since_days + Condominium + House + Loft + Entire_home_apt +
Private_room + Airbed + Couch + Futon + flexible + moderate +
Boston + Chicago + DC + LA + NYC
Df Sum of Sq RSS AIC
- Couch 1 0.26 7457.4 -82344
<none> 7457.2 -82344
- cleaning_fee 1 0.37 7457.5 -82343
+ Apartment 1 0.20 7456.9 -82343
+ Townhouse 1 0.20 7456.9 -82343
+ host_identity_verified 1 0.07 7457.1 -82342
+ Real_Bed 1 0.00 7457.2 -82342
+ Pull_out_Sofa 1 0.00 7457.2 -82342
+ Wireless_Internet 1 0.00 7457.2 -82342
- House 1 0.92 7458.1 -82340
- Airbed 1 1.27 7458.4 -82338
- Heating 1 1.28 7458.4 -82338
- Futon 1 1.97 7459.1 -82334
- Air_conditioning 1 4.02 7461.2 -82321
- flexible 1 6.33 7463.5 -82307
- moderate 1 6.50 7463.6 -82306
- host_response_rate 1 6.61 7463.8 -82305
- number_of_reviews 1 6.82 7464.0 -82304
- instant_bookable 1 8.35 7465.5 -82295
- Condominium 1 10.23 7467.4 -82283
- Loft 1 19.80 7477.0 -82225
- beds 1 25.23 7482.4 -82192
- host_since_days 1 49.45 7506.6 -82045
- review_scores_rating 1 74.03 7531.2 -81896
- TV 1 101.12 7558.3 -81732
- Boston 1 104.29 7561.4 -81713
- Elevator 1 126.57 7583.7 -81579
- bathrooms 1 191.65 7648.8 -81190
- NYC 1 208.93 7666.1 -81087
- Private_room 1 213.20 7670.4 -81062
- accommodates 1 230.14 7687.3 -80961
- DC 1 257.05 7714.2 -80802
- bedrooms 1 280.52 7737.7 -80664
- LA 1 478.88 7936.0 -79511
- Chicago 1 523.82 7981.0 -79254
- Entire_home_apt 1 1055.67 8512.8 -76316
Step: AIC=-82344.11
log(price) ~ accommodates + bathrooms + cleaning_fee + host_response_rate +
instant_bookable + number_of_reviews + review_scores_rating +
bedrooms + beds + Air_conditioning + Heating + TV + Elevator +
host_since_days + Condominium + House + Loft + Entire_home_apt +
Private_room + Airbed + Futon + flexible + moderate + Boston +
Chicago + DC + LA + NYC
Df Sum of Sq RSS AIC
<none> 7457.4 -82344
- cleaning_fee 1 0.37 7457.8 -82344
+ Couch 1 0.26 7457.2 -82344
+ Apartment 1 0.20 7457.2 -82343
+ Townhouse 1 0.20 7457.2 -82343
+ host_identity_verified 1 0.07 7457.3 -82343
+ Real_Bed 1 0.07 7457.3 -82343
+ Wireless_Internet 1 0.00 7457.4 -82342
+ Pull_out_Sofa 1 0.00 7457.4 -82342
- House 1 0.90 7458.3 -82341
- Airbed 1 1.24 7458.7 -82339
- Heating 1 1.28 7458.7 -82338
- Futon 1 1.94 7459.4 -82334
- Air_conditioning 1 4.03 7461.4 -82322
- flexible 1 6.35 7463.8 -82307
- moderate 1 6.50 7463.9 -82306
- host_response_rate 1 6.59 7464.0 -82306
- number_of_reviews 1 6.81 7464.2 -82305
- instant_bookable 1 8.33 7465.7 -82295
- Condominium 1 10.23 7467.6 -82284
- Loft 1 19.79 7477.2 -82225
- beds 1 25.17 7482.6 -82193
- host_since_days 1 49.40 7506.8 -82045
- review_scores_rating 1 73.98 7531.4 -81897
- TV 1 101.07 7558.5 -81733
- Boston 1 104.24 7561.7 -81714
- Elevator 1 126.46 7583.9 -81580
- bathrooms 1 191.87 7649.3 -81189
- NYC 1 208.84 7666.3 -81088
- Private_room 1 223.63 7681.0 -81001
- accommodates 1 230.09 7687.5 -80962
- DC 1 257.14 7714.5 -80802
- bedrooms 1 280.34 7737.7 -80666
- LA 1 478.88 7936.3 -79512
- Chicago 1 523.84 7981.2 -79254
- Entire_home_apt 1 1099.43 8556.8 -76083
# 查看選擇的模型
summary(step_model)
Call:
lm(formula = log(price) ~ accommodates + bathrooms + cleaning_fee +
host_response_rate + instant_bookable + number_of_reviews +
review_scores_rating + bedrooms + beds + Air_conditioning +
Heating + TV + Elevator + host_since_days + Condominium +
House + Loft + Entire_home_apt + Private_room + Airbed +
Futon + flexible + moderate + Boston + Chicago + DC + LA +
NYC, data = airbnb)
Residuals:
Min 1Q Median 3Q Max
-3.9054 -0.2632 -0.0152 0.2482 2.9841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.992e+00 3.359e-02 89.074 < 2e-16 ***
accommodates 6.786e-02 1.811e-03 37.473 < 2e-16 ***
bathrooms 1.502e-01 4.390e-03 34.220 < 2e-16 ***
cleaning_fee -7.949e-03 5.284e-03 -1.504 0.132483
host_response_rate -9.224e-02 1.454e-02 -6.344 2.26e-10 ***
instant_bookable -3.095e-02 4.341e-03 -7.130 1.02e-12 ***
number_of_reviews -3.029e-04 4.697e-05 -6.449 1.14e-10 ***
review_scores_rating 5.773e-03 2.717e-04 21.248 < 2e-16 ***
bedrooms 1.534e-01 3.709e-03 41.363 < 2e-16 ***
beds -3.545e-02 2.861e-03 -12.393 < 2e-16 ***
Air_conditioning 2.782e-02 5.613e-03 4.957 7.19e-07 ***
Heating 2.315e-02 8.273e-03 2.798 0.005149 **
TV 1.165e-01 4.693e-03 24.836 < 2e-16 ***
Elevator 1.382e-01 4.974e-03 27.781 < 2e-16 ***
host_since_days 5.258e-05 3.028e-06 17.363 < 2e-16 ***
Condominium 8.014e-02 1.014e-02 7.903 2.79e-15 ***
House -1.210e-02 5.157e-03 -2.347 0.018932 *
Loft 1.541e-01 1.402e-02 10.989 < 2e-16 ***
Entire_home_apt 1.070e+00 1.307e-02 81.913 < 2e-16 ***
Private_room 4.726e-01 1.279e-02 36.943 < 2e-16 ***
Airbed -7.224e-02 2.624e-02 -2.754 0.005897 **
Futon -6.542e-02 1.899e-02 -3.445 0.000572 ***
flexible -3.376e-02 5.423e-03 -6.226 4.81e-10 ***
moderate -2.825e-02 4.485e-03 -6.298 3.04e-10 ***
Boston -2.885e-01 1.144e-02 -25.223 < 2e-16 ***
Chicago -6.221e-01 1.100e-02 -56.542 < 2e-16 ***
DC -4.290e-01 1.083e-02 -39.614 < 2e-16 ***
LA -4.537e-01 8.393e-03 -54.061 < 2e-16 ***
NYC -3.124e-01 8.749e-03 -35.701 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4048 on 45512 degrees of freedom
Multiple R-squared: 0.639, Adjusted R-squared: 0.6388
F-statistic: 2878 on 28 and 45512 DF, p-value: < 2.2e-16
逐步迴歸找出的最佳模型有著與原先一模一樣的 R squared,代表模型沒有得到改善。我們改採取 PCA(主成份分析)進行特徵選擇,並重新建立一個線性迴歸模型:
# 進行 PCA 分析
# prcomp() 函數用於主成分分析,center = TRUE 和 scale. = TRUE 表示對數據進行中心化和標準化
pca_result = prcomp(airbnb_lm, center = TRUE, scale. = TRUE)
# 查看 PCA 結果摘要
summary(pca_result)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.1165 1.63838 1.42598 1.37301 1.30768 1.28733 1.23119
Proportion of Variance 0.1149 0.06883 0.05214 0.04834 0.04385 0.04249 0.03887
Cumulative Proportion 0.1149 0.18369 0.23583 0.28416 0.32801 0.37050 0.40937
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 1.16633 1.14799 1.09642 1.0817 1.06022 1.05528 1.03104
Proportion of Variance 0.03488 0.03379 0.03082 0.0300 0.02882 0.02855 0.02726
Cumulative Proportion 0.44425 0.47804 0.50887 0.5389 0.56769 0.59625 0.62351
PC15 PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 1.02528 1.00639 1.00430 0.99998 0.99115 0.98116 0.95274
Proportion of Variance 0.02695 0.02597 0.02586 0.02564 0.02519 0.02468 0.02327
Cumulative Proportion 0.65046 0.67643 0.70229 0.72793 0.75312 0.77780 0.80108
PC22 PC23 PC24 PC25 PC26 PC27 PC28
Standard deviation 0.93442 0.92393 0.88341 0.87590 0.86319 0.84729 0.81709
Proportion of Variance 0.02239 0.02189 0.02001 0.01967 0.01911 0.01841 0.01712
Cumulative Proportion 0.82347 0.84536 0.86537 0.88504 0.90414 0.92255 0.93967
PC29 PC30 PC31 PC32 PC33 PC34
Standard deviation 0.79126 0.72924 0.66066 0.60248 0.50016 0.38130
Proportion of Variance 0.01605 0.01364 0.01119 0.00931 0.00641 0.00373
Cumulative Proportion 0.95572 0.96936 0.98055 0.98986 0.99627 1.00000
PC35 PC36 PC37 PC38 PC39
Standard deviation 8.914e-14 2.755e-14 2.18e-14 1.186e-14 5.396e-15
Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
# 查看主成分得分
pca_scores = as.data.frame(pca_result$x)
pca_scores
# 選擇解釋總方差比例前 90% 的主成分
# 提取每個主成分解釋的總方差比例
explained_variance = summary(pca_result)$importance[2,]
# 計算累積方差比例
cumulative_variance = cumsum(explained_variance)
# 找到累積方差比例超過 90% 的主成分數量
num_components = which(cumulative_variance >= 0.90)[1]
# 提取主要成分得分
# 選擇前 num_components 個主成分得分(解釋總方差比例前 90% 的主成分)
selected_pca_scores = pca_scores[, 1:num_components]
# 創建新的資料框,其中包含主要成分得分和我們要預測的房價
pca_data = cbind(selected_pca_scores, price = airbnb$price)
# 建立模型,此次使用我們先前選擇的主成分作為自變量,price 作為應變量
pca_lm_model = lm(price ~ ., data = pca_data)
# 查看各主成份的迴歸模型係數
pca_lm_coefficients = pca_lm_model$coefficients
as.data.frame(pca_lm_coefficients)
# 查看模型摘要
pca_lm_summary = summary(pca_lm_model)
pca_lm_summary
Call:
lm(formula = price ~ ., data = pca_data)
Residuals:
Min 1Q Median 3Q Max
-543.79 -35.66 -2.85 29.34 1326.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 148.9985 0.3584 415.701 < 2e-16 ***
PC1 -47.1932 0.1694 -278.670 < 2e-16 ***
PC2 4.8093 0.2188 21.983 < 2e-16 ***
PC3 -6.3280 0.2514 -25.175 < 2e-16 ***
PC4 7.3036 0.2611 27.977 < 2e-16 ***
PC5 -1.7597 0.2741 -6.420 1.38e-10 ***
PC6 -10.5581 0.2784 -37.920 < 2e-16 ***
PC7 -23.8653 0.2911 -81.976 < 2e-16 ***
PC8 2.9815 0.3073 9.702 < 2e-16 ***
PC9 13.3395 0.3122 42.724 < 2e-16 ***
PC10 16.7850 0.3269 51.345 < 2e-16 ***
PC11 -11.7673 0.3313 -35.514 < 2e-16 ***
PC12 2.5252 0.3381 7.469 8.20e-14 ***
PC13 0.2424 0.3397 0.714 0.475401
PC14 -1.4997 0.3476 -4.314 1.61e-05 ***
PC15 -8.0645 0.3496 -23.068 < 2e-16 ***
PC16 4.6437 0.3562 13.038 < 2e-16 ***
PC17 11.1627 0.3569 31.277 < 2e-16 ***
PC18 -0.9601 0.3584 -2.678 0.007399 **
PC19 1.0013 0.3616 2.769 0.005628 **
PC20 9.9119 0.3653 27.133 < 2e-16 ***
PC21 2.9164 0.3762 7.752 9.24e-15 ***
PC22 -4.1771 0.3836 -10.890 < 2e-16 ***
PC23 -7.3696 0.3879 -18.997 < 2e-16 ***
PC24 1.3743 0.4057 3.387 0.000707 ***
PC25 2.2805 0.4092 5.573 2.52e-08 ***
PC26 -3.8561 0.4152 -9.286 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76.49 on 45514 degrees of freedom
Multiple R-squared: 0.6801, Adjusted R-squared: 0.6799
F-statistic: 3721 on 26 and 45514 DF, p-value: < 2.2e-16
# 查看模型 R squared
pca_r_squared = pca_lm_summary$r.squared
pca_r_squared
[1] 0.6800685
進行主成份分析並選取解釋總方差比例前 90% 的主成分進行建模後,我們的線性迴歸模型的 R squared 上升到了 0.68。雖然相較先前有所改善,但這樣的模型準確度仍差強人意,模型對房價的解釋度仍然不高。這也受限於線性迴歸模型假設過於簡單、無法解釋自變量與應變量之間的非線性關係、受到自變量間或多或少存在的共線性關係影響等先天限制。因此,我們可以採用更為複雜的機器學習演算法,嘗試捕捉房價與這些潛在影響因子之間的關係。
隨機森林是一種集成學習(ensembling)的機器學習演算法。集成學習通過集成多個弱學習器(weak learners),獲得一個強學習器(strong learner),從而提高整體模型的預測能力和穩健性。除了通過結合多棵決策樹的預測結果來提高模型的準確性和穩健性,隨機森林也引入了隨機性(Randomization):藉由採用隨機取後放回的抽樣樣本進行訓練,以及隨機選擇部分特徵來尋找最佳決策樹分裂點,不使用所有特徵,隨機森林的集成學習結果更加穩健。有鑒於隨機森林的強健性與處理非線性問題的能力,其相當適合作為我們本次建模使用的演算法。
set.seed(1000)
# 使用與先前用以建立線性迴歸模型相同的資料框,但把房價進行對數變換
airbnb_ml = airbnb_lm |> mutate(price=log(price))
# 進行模型訓練資料集與測試資料集的切割,以免因 data leakage 建立不準確的模型
train_index = createDataPartition(airbnb_ml$price, p = 0.8, list = FALSE)
train_data = airbnb_ml[train_index, ]
test_data = airbnb_ml[-train_index, ]
# 建立隨機森林模型
rf_model = randomForest(price ~ ., data = train_data, ntree = 100)
# 預測測試數據
rf_predictions = predict(rf_model, test_data)
# 藉由 MSE(均方誤差)評估模型性能
rf_mse = mean((rf_predictions - test_data$price)^2)
rf_mse
[1] 0.1453842
支持向量機回歸(SVR, Support Vector Regressor)為基於常用在分類問題上的支持向量機(SVM, Support Vector Machine)演算法的一種迴歸演算法。SVR 旨在找到一個線性回歸函數 \(f(x) = w ⋅ x + b\),使得大多數訓練數據點 \((x_i,y_i)\) 落在該函數的 \(ε\) 偏差範圍內,即 \(|y_i-f(x_i)| ≤ ε\)。其通過通過核技巧(Kernel Trick),可以將非線性問題轉換為線性問題,在高維度特徵空間中求解,因而適合非線性問題與高維度數據,也相當適合用於建立我們的模型。
set.seed(1000)
svm_model = svm(price ~ ., data = train_data)
# 預測測試數據
svm_predictions = predict(svm_model, test_data)
# 評估模型性能
svm_mse = mean((svm_predictions - test_data$price)^2)
svm_mse
[1] 0.1504324
極限梯度提升(XG Boost, eXtreme Gradient Boosting),是一種 Gradient Boosted Tree (GBDT) 的機器學習演算法。XG Boost 與隨機森林同屬於集成學習的算法;但與隨機森林不同的是,它並不是一次性集成多個弱學習器(如決策樹)進行學習,而是在逐步迭代的過程中,在前一階段模型的基礎上進行改善。每一個迭代學習的步驟中,XGBoost 會保留上一迭代階段的模型並計算樣本殘差值,再訓練新的決策樹加入現有模型以修正錯誤。更新模型後,XG Boost 會調整權重,通過加權的方法強調先前模型錯誤預測的樣本,使得新訓練的決策樹更專注於模型先前做出的錯誤,從而提升整體模型的學習效果。有鑑於 XG Boost 具有強大的學習能力與良好的泛用性,其也相當適合作為我們建構模型的演算法。
set.seed(1000)
# 將數據集轉換為矩陣形式以提供 XG Boost 模型訓練
train_matrix = xgb.DMatrix(data = as.matrix(train_data %>% select(-price)), label = train_data$price)
test_matrix = xgb.DMatrix(data = as.matrix(test_data %>% select(-price)), label = test_data$price)
# XG Boost 模型參數
params = list(
booster = "gbtree", # 使用基於樹的梯度提升算法
objective = "reg:squarederror", # reg 代表迴歸(regression),squarederror 代表均方差損失函數
eta = 0.1, # 學習率,用於控制每次梯度提升的步長,進而影響梯度提升次數與收斂速度
max_depth = 6, # 樹的最大深度,用以控制樹的複雜程度,進而影響模型捕捉的資訊與控制模型過擬和
subsample = 0.8, # 每棵樹的樣本採樣比例,用以防止過擬合(每次構建樹時,只使用部分樣本)
colsample_bytree = 0.8 # 每棵樹的特徵採樣比例,用以防止過擬合(每次構建樹時,只使用部分特徵)
)
# 訓練 XG Boost 模型
xgb_model = xgb.train(
params = params,
data = train_matrix,
nrounds = 100,
watchlist = list(train = train_matrix, test = test_matrix),
early_stopping_rounds = 10,
print_every_n = 10,
verbose = 1
)
[1] train-rmse:3.881214 test-rmse:3.882684
Multiple eval metrics are present. Will use test_rmse for early stopping.
Will train until test_rmse hasn't improved in 10 rounds.
[11] train-rmse:1.406804 test-rmse:1.410092
[21] train-rmse:0.611223 test-rmse:0.618435
[31] train-rmse:0.413685 test-rmse:0.425190
[41] train-rmse:0.377303 test-rmse:0.391268
[51] train-rmse:0.369288 test-rmse:0.385721
[61] train-rmse:0.365865 test-rmse:0.384239
[71] train-rmse:0.362939 test-rmse:0.383317
[81] train-rmse:0.360288 test-rmse:0.382860
[91] train-rmse:0.357973 test-rmse:0.382622
[100] train-rmse:0.355879 test-rmse:0.382359
# 預測測試數據
predictions = predict(xgb_model, test_matrix)
# 評估模型性能
xgboost_mse = mean((test_data$price - predictions)^2)
xgboost_mse
[1] 0.1461982
我們總共選擇了隨機森林(Random Forest)、支持向量機迴歸(Support
Vector Regressor)與極限梯度提升(XG
Boost)三種機器學習演算法以建構預測房價的模型;另外,我們同時加入多元線性迴歸模型進行比較。為此,我們先以
train_data 訓練該線性迴歸模型,並取得其預測
test_data 所得到的均方差:
# 訓練線性迴歸模型
lm_model_comp = lm(price ~ ., data = train_data)
# 預測測試數據
lm_predictions = predict(lm_model_comp, test_data)
# 評估模型性能
lm_mse = mean((lm_predictions - test_data$price)^2)
lm_mse
[1] 0.165323
我們比較三種演算法的學習成效:
mse_data = data.frame(
Model = c("Linear Regression", "Random Forest", "SVM", "XGBoost"),
MSE = c(lm_mse, rf_mse, svm_mse, xgboost_mse)
)
mse_data |>
ggplot(aes(x = reorder(Model, MSE), y = MSE, fill = Model)) +
geom_bar(stat = "identity", width = 0.7) +
theme_minimal() +
labs(title = "Comparison of Model MSE",
x = "Model",
y = "Mean Squared Error (MSE)") +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(size = 10, face = "bold"),
legend.position = "none"
) +
scale_fill_manual(values = c("Random Forest" = "dodgerblue", "SVM" = "orange", "XGBoost" = "green", "Linear Regression" = "purple"))
由此可知,隨機森林模型有著四種模型中最優秀的學習成效,而線性迴歸模型的表現則是四者中最差的。有鑑於此,我們將使用隨機森林模型作為後續在互動式工具中預測房價使用的模型。
最後,我們可以使用 importance()
查看根據我們所訓練的隨機森林模型,個別特徵變量的重要性為何:
# 提取並查看特徵重要性(按照重要性排序)
importance_matrix = importance(rf_model)
importance_matrix_ordered = importance_matrix[order(importance_matrix[,1], decreasing = TRUE),]
importance_matrix_ordered
Entire_home_apt Private_room bedrooms
3239.592542 1803.762919 1772.693131
accommodates host_since_days bathrooms
1360.286665 1119.760569 1057.006532
number_of_reviews review_scores_rating beds
795.774318 618.549517 588.342671
Shared_room SF host_response_rate
363.671877 317.722324 279.723141
TV Elevator LA
206.666415 187.185833 158.524508
NYC Chicago instant_bookable
146.609571 143.205838 133.000882
House host_identity_verified Air_conditioning
128.362731 122.224538 121.787427
cleaning_fee strict Apartment
120.137876 116.875475 113.113110
moderate flexible DC
95.615019 84.727811 65.922082
Heating Boston Condominium
61.219916 59.975688 50.545357
Loft Wireless_Internet Real_Bed
43.898687 33.665349 32.892890
Townhouse Futon Pull_out_Sofa
31.337690 15.580109 11.978556
Airbed Couch
9.571640 6.784511
# 可視化特徵重要性
varImpPlot(rf_model)
IncNodePurity
是隨機森林中衡量變數重要性的一個指標;它基於變數在分裂節點時對純度增加的貢獻。數值越高,表示該變數對模型的重要性越大。通過
varImpPlot() 和
importance(),我們可以藉由圖表直觀地查看這些特徵變數對於預測房價的影響力如何。
由此可知,住宿模式(Entire_home_apt,
Private_room)、房間數量(bedrooms)、可容納人數(accommodates)等因素對於房價預測的影響力最大;城市、房屋類型、訂房取消政策等則對於預測房價重要性不大,代表多數人可能相較不重視這些因素,或房東在訂定房價時較不會考慮這些因素。這也與我們先前進行的資料分析所得出的結論相呼應:住宿模式、房間數量與可容納人數在經過視覺化後,我們確實能看到這些因素與房價擁有明顯的正相關性。
基於先前所建立的隨機森林模型(rf_model),我利用 Shiny
製作出了一個互動式房價估計工具(程式碼參考請見下方)。由於 shinyapps.io
的免費帳戶提供的記憶體容量不足以讓這個 Shiny App
部署於線上,我將部署於線上的版本改為使用線性迴歸模型;如此一來使用的記憶體空間較小,方能讓讀者能夠在線上實際體驗使用本工具。
互動式房價估計工具連結:請點我
本工具的 .Rmd 檔案附在「7. 附錄
Appendix」,只要將該檔案連同本專題所使用的資料集
airbnb.csv(資料集連結附於「1. 研究摘要 Abstract」以及「7.
附錄
Appendix」)一併下載並放置於相同系統路徑後,讀者即可在本機運行此互動式
Airbnb 房價估計工具,並可以採用隨機森林模型做出更精準的房價估計。
(以下為本工具的程式碼片段節錄)
# 獲取預測變量的名稱
predictor_vars = names(train_data)[names(train_data) != "price"]
# 合併變量
city_vars = c("SF", "NYC", "LA", "DC", "Chicago", "Boston")
cancellation_policy_vars = c("strict", "moderate", "flexible")
bed_type_vars = c("Real_Bed", "Pull_out_Sofa", "Futon", "Couch", "Airbed")
property_type_vars = c("Loft", "House", "Townhouse", "Condominium", "Apartment")
room_type_vars = c("Shared_room", "Private_room", "Entire_home_apt")
boolean_vars = c("Elevator", "TV", "Heating", "Air_conditioning", "Wireless_Internet", "instant_bookable", "host_identity_verified", "cleaning_fee")
integer_vars = c("accommodates", "bathrooms", "number_of_reviews", "review_scores_rating", "bedrooms", "beds", "host_since_days", "host_response_rate")
other_vars = setdiff(predictor_vars, c(city_vars, cancellation_policy_vars, bed_type_vars, property_type_vars, room_type_vars, boolean_vars, integer_vars))
# 建立互動式 Shiny App
shinyApp(
ui = fluidPage(
tags$head(tags$style(HTML("
.title-center {
text-align: center;
}
.shiny-input-panel {
max-width: 800px;
margin: 0 auto;
}
"))),
titlePanel(
title = div("互動式房價預測工具", class = "title-center")
),
fluidRow(
column(12, align="center",
h5("請輸入以下資訊以獲得預估 Airbnb 房價(美金)"),
textOutput("prediction"),
tags$hr()
)
),
sidebarLayout(
sidebarPanel(
width = 12,
selectInput("city", "City",
choices = c("SF", "NYC", "LA", "DC", "Chicago", "Boston"),
selected = "SF"),
selectInput("cancellation_policy", "Cancellation Policy",
choices = c("strict", "moderate", "flexible"),
selected = "strict"),
selectInput("bed_type", "Bed Type",
choices = c("Real_Bed", "Pull_out_Sofa", "Futon", "Couch", "Airbed"),
selected = "Real_Bed"),
selectInput("property_type", "Property Type",
choices = c("Loft", "House", "Townhouse", "Condominium", "Apartment"),
selected = "Apartment"),
selectInput("room_type", "Room Type",
choices = c("Shared_room", "Private_room", "Entire_home_apt"),
selected = "Entire_home_apt"),
lapply(integer_vars, function(var) {
numericInput(inputId = var,
label = switch(var,
"accommodates" = "Accommodates",
"bathrooms" = "Bathrooms",
"number_of_reviews" = "Number of Reviews",
"review_scores_rating" = "Review Scores Rating",
"bedrooms" = "Bedrooms",
"beds" = "Beds",
"host_since_days" = "Days of Hosting",
"host_response_rate" = "Host Response Rate"
),
value = 1,
step = 1)
}),
lapply(boolean_vars, function(var) {
selectInput(inputId = var,
label = switch(var,
"Air_conditioning" = "Air Conditioning",
"Wireless_Internet" = "Wireless Internet",
"instant_bookable" = "Instant Bookable",
"host_identity_verified" = "Host Identity Verified",
"cleaning_fee" = "Cleaning Fee",
"Elevator" = "Elevator",
"TV" = "TV",
"Heating" = "Heating"
),
choices = c("True" = 1, "False" = 0),
selected = 1)
})
),
mainPanel(
width = 9,
div(
style = "max-width: 1000px; margin: 0 auto;",
)
)
)
),
server = function(input, output) {
output$prediction = renderText({
# 創建一個新數據框,其中包含使用者輸入的預測變量值
new_data = data.frame(matrix(ncol = length(predictor_vars), nrow = 1))
colnames(new_data) = predictor_vars
# 處理 city
new_data[1, city_vars] = 0
new_data[1, input$city] = 1
# 處理 cancellation_policy
new_data[1, cancellation_policy_vars] = 0
new_data[1, input$cancellation_policy] = 1
# 處理 bed_type
new_data[1, bed_type_vars] = 0
new_data[1, input$bed_type] = 1
# 處理 property_type
new_data[1, property_type_vars] = 0
new_data[1, input$property_type] = 1
# 處理 room_type
new_data[1, room_type_vars] = 0
new_data[1, input$room_type] = 1
# 處理其他 True/ False 值變量
for (var in boolean_vars) {
new_data[[var]] = as.integer(input[[var]])
}
# 處理其他整數變量
for (var in integer_vars) {
new_data[[var]] = input[[var]]
}
for (var in other_vars) {
new_data[[var]] = input[[var]]
}
# 使用隨機森林模型進行預測
predicted_price = predict(rf_model, new_data)
paste("預測的房價為: $", round(exp(predicted_price), 2))
})
}
)
本互動式房價估計工具運行成果如下:
互動式房價估計模型的介面
使用者可輸入 Airbnb 物件之各項條件,獲取預估之房價