This project integrates knowledge in various areas, including:
This dataset contains information about the prices of houses in Bangkok. It includes several features that affect the price of a house, including location, usable area, number of bedrooms, number of bathrooms, number of floors, and other facilities. You can access the source of the data from this link: House Price in Bangkok
| No. | Attribute | Explain | Unit | Type of Data |
|---|---|---|---|---|
| 1 | Title | title of house | - | text. |
| 2 | Location | location of house | - | text. |
| 3 | Beds | No. of Bedroom | room | Integer |
| 4 | Baths | No. of Bathroom | room | Integer |
| 5 | Usable area | Usable area | sq.m | Float |
| 6 | Land area | Land area | sq.m | Float |
| 7 | Floors | No. of Floors | floor | Integer |
| 8 | Others column | Facilities | binary | 0 (no) / 1 (yes) |
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.1
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3")
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import requests
url_main = "https://www.thailand-property.com/houses-for-sale/bangkok"
url_total = 10
title_list = []
location_list = []
price_list = []
beds_list = []
baths_list = []
usable_area_list = []
land_area_list = []
floors_list = []
facilities_list = []
for page in range(1, url_total+1):
if page == 1:
url_page = url_main
else:
url_page = f"{url_main}?page={page}"
req_page = Request(url_page, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req_page).read()
soup = BeautifulSoup(webpage, 'html.parser')
listing_links = soup.select('div.left-block a.hj-listing-snippet')
urls = [link['href'] for link in listing_links]
for url_seller in urls:
req_seller = Request(url_seller, headers={'User-Agent': 'Mozilla/5.0'})
web_seller = urlopen(req_seller).read()
soup_seller = BeautifulSoup(web_seller, 'html.parser')
title = soup_seller.select_one('h1.page-title')
location = soup_seller.select_one('div.location')
price = soup_seller.select_one('div.price-title')
beds = soup_seller.select_one('li:contains("Beds") span')
baths = soup_seller.select_one('li:contains("Baths") span')
usable_area = soup_seller.select_one('li:contains("Usable area") span')
land_area = soup_seller.select_one('li:contains("Land area") span')
floors = soup_seller.select_one('li:contains("Floors") span')
facilities_elements = soup_seller.select('.list-unstyled.facilities li')
facilities = [facility.get_text(strip=True) for facility in facilities_elements]
title_list.append(title.text.strip() if title else None)
location_list.append(location.text.strip() if location else None)
price_list.append(re.search(r'Sale: ฿ ([\d,]+)', price.text).group(1) if price else None)
beds_list.append(beds.text if beds else None)
baths_list.append(baths.text if baths else None)
usable_area_list.append(float(re.search(r'\d+(\.\d+)?', usable_area.text).group()) if usable_area else None)
land_area_list.append(float(re.search(r'\d+(\.\d+)?', land_area.text).group()) if land_area else None)
floors_list.append(floors.text if floors else None)
facilities_list.append(facilities)
## /Users/j.nrup/Library/Python/3.11/lib/python/site-packages/soupsieve/css_parser.py:872: FutureWarning: The pseudo class ':contains' is deprecated, ':-soup-contains' should be used moving forward.
## warnings.warn(
import pandas as pd
house_detail = {
'title': title_list,
'location': location_list,
'beds': beds_list,
'baths': baths_list,
'usable_area': usable_area_list,
'land_area': land_area_list,
'floors': floors_list,
'facilities': facilities_list,
'price': price_list,
}
housePrice_df = pd.DataFrame(house_detail, dtype=object)
facilities_df = pd.get_dummies(housePrice_df['facilities'].explode()).groupby(level=0).sum()
housePrice_df = pd.concat([housePrice_df, facilities_df], axis=1)
housePrice_df = housePrice_df.drop('facilities', axis=1)
housePrice_df
## title ... Water
## 0 5 Bedroom House for sale in Perfect Masterpiec... ... 0
## 1 Single House for Sale Narisa Thonburirom Villa... ... 1
## 2 Brand new Luxury house for sale at Srinakarin ... 1
## 3 3 bed House Suanluang District H015937 ... 0
## 4 ขายบ้านเดี่ยว 2 ชั้น หมู่บ้านปริญญดา เกษตรนวมิ... ... 0
## .. ... ... ...
## 245 3 Bedroom House for sale at Perfect Place Sukh... ... 0
## 246 Single House -2-Storeys - 5 bedrooms - with S... ... 1
## 247 5 Bedroom House for sale at Chaiyaphruek Ram I... ... 0
## 248 3 Bedroom House for sale at Panthip Village ... 0
## 249 บ้านเดี่ยว ลลิล เดอะ ยัง เอ็กซ์เซ็กคลูทีฟ อ่อน... ... 0
##
## [250 rows x 44 columns]
housePrice_df.to_csv("housePriceInBangkok.csv")
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("housePriceInBangkok.csv")
## New names:
## Rows: 250 Columns: 45
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): title, location dbl (42): ...1, beds, baths, usable_area, land_area,
## floors, Access for the ... num (1): price
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
library(RPostgreSQL)
## Loading required package: DBI
library(DBI)
# [1] connect to database.
con <- dbConnect(PostgreSQL(),
host = "floppy.db.elephantsql.com", # Server
port = 5432, # Door to the room // default in postgreSQL: 5432
user = "xjyjkkgf",
password = "psJ6nkYIx0qJVO6IUjVpnjitAhgHYNnr",
dbname = "xjyjkkgf") # default database
# [2] Check table list.
cat("List table in database : ", dbListTables(con))
## List table in database : houseprice
# [3] Remove table if it exist and write table to database.
# Check if the table exists.
if (dbExistsTable(con, "housePrice")) {
# If it exists, remove the table.
dbRemoveTable(con, "houseprice")
print("houseprice removed successfully.")
}
# write new table.
dbWriteTable(con, "houseprice", df)
## Warning in postgresqlWriteTable(conn, name, value, ...): table houseprice
## exists in database: aborting assignTable
## [1] FALSE
print("houseprice write completely.")
## [1] "houseprice write completely."
# [4] Get data
dbGetQuery(con, "SELECT title, location, beds, baths, usable_area, land_area, floors
FROM houseprice
LIMIT 10;")
## title
## 1 5 Bedroom House for sale in Perfect Masterpiece Rama9 – Krungthep Kreetha
## 2 Single House for Sale Narisa Thonburirom Village area of 210 sq.w. Price only 10.9 million
## 3 Brand new Luxury house for sale at Srinakarin
## 4 ขายบ้านเดี่ยว 2 ชั้น หมู่บ้านปริญญดา เกษตรนวมินทร์ ซอยแจ่มจันทร์ ใกล้เลียบด่วนเอกมัยรามอินทรา
## 5 Sell House 55 Sathorn Sathorn Soi 15 Architect's House Unique Design One and only one, price 115 million baht.
## 6 2 Bedroom House for sale in Baan Somjai Phuttha Monthon Sai 2
## 7 Brand new Luxury house for sale at Srinakarin
## 8 Pattanakarn Srinakarin | Elegant Family Home 3+1 Bed walk to MRT
## 9 4 Bedroom House for sale
## 10 For Sale Corner Unit Garden View Narasiri Krungthepkreetha
## location beds baths usable_area land_area floors
## 1 Bangkok, Prawet, Prawet 5 6 470.0 570.4 2
## 2 Bangkok, Thung Khru, Thung Khru 3 3 177.0 440.0 2
## 3 Bangkok, Prawet, Nong Bon 5 7 710.0 272.0 4
## 4 Bangkok, Bueng Kum, Nawamin 4 4 305.0 370.0 2
## 5 Bangkok, Sathon, Thung Maha Mek 3 4 460.0 160.0 5
## 6 Bangkok, Bang Khae, Bang Phai 2 2 114.0 53.0 2
## 7 Bangkok, Prawet, Nong Bon 4 6 601.0 224.0 4
## 8 Bangkok, Suan Luang, Suan Luang 3 4 520.0 504.0 2
## 9 Bangkok, Lat Phrao, Lat Phrao 4 5 766.5 98.0 3
## 10 Bangkok, Bang Kapi, Hua Mak 5 6 687.0 812.0 NA
cat("List table in database : ", dbListTables(con))
## List table in database : houseprice
# [5] Close connection
dbDisconnect(con)
## [1] TRUE
library(ggplot2)
housePrice_df <- tibble(df)
head(housePrice_df)
## # A tibble: 6 × 45
## ...1 title location beds baths usable_area land_area floors price
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 5 Bedroom Hous… Bangkok… 5 6 470 570. 2 4 e7
## 2 1 Single House f… Bangkok… 3 3 177 440 2 1.09e7
## 3 2 Brand new Luxu… Bangkok… 5 7 710 272 4 6.39e7
## 4 3 3 bed House S… Bangkok… NA NA 250 30 3 9.5 e6
## 5 4 ขายบ้านเดี่ยว 2 ชั้… Bangkok… 4 4 305 370 2 9.8 e6
## 6 5 2 Bedroom Hous… Bangkok… 2 2 114 53 2 4.50e6
## # ℹ 36 more variables: `Access for the disabled` <dbl>,
## # `Air conditioning` <dbl>, Alarm <dbl>, Balcony <dbl>,
## # `Built-in kitchen` <dbl>, `Built-in wardrobe` <dbl>, `Car park` <dbl>,
## # Cellar <dbl>, `Children's area` <dbl>, Cistern <dbl>, Concierge <dbl>,
## # Electricity <dbl>, Elevator <dbl>, `Equipped kitchen` <dbl>,
## # Fireplace <dbl>, Garden <dbl>, Grill <dbl>, Guardhouse <dbl>, Gym <dbl>,
## # Heating <dbl>, `Hot Tub` <dbl>, Internet <dbl>, Library <dbl>, …
# Check null row.
mean(complete.cases(housePrice_df))
## [1] 0.404
# Delete null.
clean_df <- drop_na(housePrice_df)
mean(complete.cases(clean_df))
## [1] 1
# Change Column name to snake case.
column_name_mappings <- c(
"...1" = "no",
"title" = "title",
"location" = "location",
"price" = "price",
"beds" = "beds",
"baths" = "baths",
"usable_area" = "usable_area",
"land_area" = "land_area",
"floors" = "floors",
"Access for the disabled" = "access_for_the_disabled",
"Air conditioning" = "air_conditioning",
"Alarm" = "alarm",
"Balcony" = "balcony",
"Built-in kitchen" = "built_in_kitchen",
"Built-in wardrobe" = "built_in_wardrobe",
"Car park" = "car_park",
"Cellar" = "cellar",
"Children's area" = "children_area",
"Cistern" = "cistern",
"Concierge" = "concierge",
"Electricity" = "electricity",
"Elevator" = "elevator",
"Equipped kitchen" = "equipped_kitchen",
"Fireplace" = "fireplace",
"Garden" = "garden",
"Grill" = "grill",
"Guardhouse" = "guardhouse",
"Gym" = "gym",
"Hot Tub" = "hot_tub",
"Internet" = "internet",
"Library" = "library",
"Natural gas" = "natural_gas",
"Office" = "office",
"Panoramic view" = "panoramic_view",
"Patio" = "patio",
"Roof garden" = "roof_garden",
"Sauna" = "sauna",
"Security" = "security",
"Swimming pool" = "swimming_pool",
"Tennis court" = "tennis_court",
"Terrace" = "terrace",
"Utility room" = "utility_room",
"Video cable" = "video_cable",
"Water" = "water"
)
# Rename columns using the mappings
colnames(clean_df) <- sapply(colnames(clean_df), function(col) column_name_mappings[col])
clean_df
## # A tibble: 101 × 45
## no title location beds baths usable_area land_area floors price
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 5 Bedroom Hou… Bangkok… 5 6 470 570. 2 4 e7
## 2 1 Single House … Bangkok… 3 3 177 440 2 1.09e7
## 3 2 Brand new Lux… Bangkok… 5 7 710 272 4 6.39e7
## 4 4 ขายบ้านเดี่ยว 2 … Bangkok… 4 4 305 370 2 9.8 e6
## 5 5 2 Bedroom Hou… Bangkok… 2 2 114 53 2 4.50e6
## 6 6 Sell House 55… Bangkok… 3 4 460 160 5 1.15e8
## 7 7 Brand new Lux… Bangkok… 4 6 601 224 4 5.49e7
## 8 8 Pattanakarn S… Bangkok… 3 4 520 504 2 4.50e7
## 9 9 4 Bedroom Hou… Bangkok… 4 5 766. 98 3 3.90e7
## 10 11 Unique Detach… Bangkok… 5 6 850 3 2 8.5 e7
## # ℹ 91 more rows
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## # alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## # built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## # cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## # equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## # guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>, …
ggplot(data = clean_df, mapping = aes(x = price)) +
geom_histogram(bins=30, fill = "#F5AD9E") +
labs(title = "Distribution of House Price") +
theme_minimal()
ggplot(data = clean_df, mapping = aes(x = price)) +
geom_boxplot() +
labs(title = "Outlier detection of House Price") +
theme_minimal()
The box plot chart shows that the house price data contains outliers.
Therefore,before we proceed to the next step, we will clean the data by
removing the outliers. We will create a new dataframe named
“clean_dfNo_out” to store the data without outliers.
out <- boxplot.stats(clean_df$price)$out
out
## [1] 2.3e+08 6.2e+08
As we can see, there are actually 2 points considered as potential outliers as follows : 2.30e+08, and 6.20e+08.
out_ind <- which(clean_df$price %in% c(out))
out_ind
## [1] 48 84
clean_df[out_ind, ]
## # A tibble: 2 × 45
## no title location beds baths usable_area land_area floors price
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107 Super-Luxury H… Bangkok… 8 8 2 1 3 2.3 e8
## 2 187 Luxury Buildin… Bangkok… 72 72 5 1 8 6.20e8
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## # alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## # built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## # cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## # equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## # guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>,
## # library <dbl>, natural_gas <dbl>, office <dbl>, panoramic_view <dbl>, …
clean_dfNo_out <- clean_df[-out_ind,]
clean_dfNo_out
## # A tibble: 99 × 45
## no title location beds baths usable_area land_area floors price
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 5 Bedroom Hou… Bangkok… 5 6 470 570. 2 4 e7
## 2 1 Single House … Bangkok… 3 3 177 440 2 1.09e7
## 3 2 Brand new Lux… Bangkok… 5 7 710 272 4 6.39e7
## 4 4 ขายบ้านเดี่ยว 2 … Bangkok… 4 4 305 370 2 9.8 e6
## 5 5 2 Bedroom Hou… Bangkok… 2 2 114 53 2 4.50e6
## 6 6 Sell House 55… Bangkok… 3 4 460 160 5 1.15e8
## 7 7 Brand new Lux… Bangkok… 4 6 601 224 4 5.49e7
## 8 8 Pattanakarn S… Bangkok… 3 4 520 504 2 4.50e7
## 9 9 4 Bedroom Hou… Bangkok… 4 5 766. 98 3 3.90e7
## 10 11 Unique Detach… Bangkok… 5 6 850 3 2 8.5 e7
## # ℹ 89 more rows
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## # alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## # built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## # cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## # equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## # guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>, …
ggplot(data = clean_dfNo_out, mapping = aes(x=usable_area, y=price, col=price)) +
geom_point() +
labs(title = "usable_area vs price") +
scale_color_gradient(low="gold",high = "blue")
library(patchwork)
c1 <- ggplot(data = clean_df, mapping = aes(x = price)) +
geom_histogram(bins=10, fill = "#F5AD9E") +
labs(title = "Distribution of House Price") +
theme_minimal()
c2 <- ggplot(data = clean_df, mapping = aes(x = price)) +
geom_boxplot() +
labs(title = "Outlier Detection of House Price") +
theme_minimal()
c3 <- ggplot(data = clean_dfNo_out, mapping = aes(x=usable_area, y=price, col=price)) +
geom_point() +
labs(title = "usable_area vs price") +
scale_color_gradient(low="gold",high = "blue")
(c1 + c2)/c3
Remark : These are simple examples to create visualization using ggplot in R programming.
Correlation measures is a measure of the strength and direction of the relationship between two variables.
First, we compute correlations for several pairs of variables and display them in the matrix below.
# Select specific variables for the correlation matrix
selected_vars <- c("price", "beds", "baths", "usable_area", "land_area", "floors")
# Create a correlation matrix for the selected variables
cor_matrix_selected <- round(
cor(clean_dfNo_out[, selected_vars]),
digits = 2
)
# Print the correlation matrix for selected variables
print(cor_matrix_selected)
## price beds baths usable_area land_area floors
## price 1.00 0.31 0.59 0.52 0.20 0.43
## beds 0.31 1.00 0.67 0.24 0.28 0.16
## baths 0.59 0.67 1.00 0.54 0.36 0.35
## usable_area 0.52 0.24 0.54 1.00 0.21 0.34
## land_area 0.20 0.28 0.36 0.21 1.00 -0.21
## floors 0.43 0.16 0.35 0.34 -0.21 1.00
Finally, we will modify the correlation matrix format to make it more readable by creating a visualization using the package called “corrplot”.
# improved correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix_selected,
method = "number",
type = "upper" # show only upper side
)
In summary, this project combines knowledge in the areas of data analysis, computer programming, data visualization, and statistics. This can be a valuable skill set for a variety of careers.
Thank you for your interest. I hope this project will be beneficial to those who are interested. If there are any errors, I apologize in advance. - Narupong Jarasbunpaisarn