Project Description

This project integrates knowledge in various areas, including:

About Dataset

This dataset contains information about the prices of houses in Bangkok. It includes several features that affect the price of a house, including location, usable area, number of bedrooms, number of bathrooms, number of floors, and other facilities. You can access the source of the data from this link: House Price in Bangkok

No. Attribute Explain Unit Type of Data
1 Title title of house - text.
2 Location location of house - text.
3 Beds No. of Bedroom room Integer
4 Baths No. of Bathroom room Integer
5 Usable area Usable area sq.m Float
6 Land area Land area sq.m Float
7 Floors No. of Floors floor Integer
8 Others column Facilities binary 0 (no) / 1 (yes)

Scope of Project

  • Web Scraping with Python: Extract house prices in Thailand and other details like area, bedrooms, bathrooms, and facilities using BeautifulSoup.
  • Data Cleaning and Transformation with Python: Drop null value and create a dataframe from the scraped data and export it to a CSV file using pandas.
  • Database Creation with R programming: Create a PostgreSQL database using R, populated with the dataframe from the CSV file.
  • Data Analysis with SQL: Query the house price data from the PostgreSQL database using SQL scripts for analysis or other purposes.
  • Data Visualization with R programming: Visualize the data using ggplot2.

Let do it!

First step : Do web scraping by using Python Programming.

  1. Enabled Python Engine to use package of python follow by path.
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.1
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3")
  1. Import library to do Web Scraping and others that related.
  • urlopen : To work with URLs.
  • BeautifulSoup : To do Web Scraping.
  • requests : To allows you to fetch web pages and extract HTML content.
  • re : To use regular expression.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import requests
  1. Declare a variable named “url_main” to store the url of first page on website that you would like to access and variable named “url_total” to store the total quantity of page on website that you would like to do web scraping.
url_main = "https://www.thailand-property.com/houses-for-sale/bangkok"
url_total = 10
  1. Do web scraping as follows:
  • Create list to collect all detail of data. (Title, Location, Bedroom, Bathroom, Usable area, Land area, and other facilities)
  • Use for loop to extract all detail of data and storage in list that are created.
title_list = []
location_list = []
price_list = []
beds_list = []
baths_list = []
usable_area_list = []
land_area_list = []
floors_list = []
facilities_list = []

for page in range(1, url_total+1):
    if page == 1:
        url_page = url_main
    else:
        url_page = f"{url_main}?page={page}"

    req_page = Request(url_page, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req_page).read()

    soup = BeautifulSoup(webpage, 'html.parser')

    listing_links = soup.select('div.left-block a.hj-listing-snippet')
    urls = [link['href'] for link in listing_links]
    
    for url_seller in urls:
        req_seller = Request(url_seller, headers={'User-Agent': 'Mozilla/5.0'})
        web_seller = urlopen(req_seller).read()
        soup_seller = BeautifulSoup(web_seller, 'html.parser')

        title = soup_seller.select_one('h1.page-title')
        location = soup_seller.select_one('div.location')
        price = soup_seller.select_one('div.price-title')
        beds = soup_seller.select_one('li:contains("Beds") span')
        baths = soup_seller.select_one('li:contains("Baths") span')
        usable_area = soup_seller.select_one('li:contains("Usable area") span')
        land_area = soup_seller.select_one('li:contains("Land area") span')
        floors = soup_seller.select_one('li:contains("Floors") span')
        facilities_elements = soup_seller.select('.list-unstyled.facilities li')
        facilities = [facility.get_text(strip=True) for facility in facilities_elements]

        title_list.append(title.text.strip() if title else None)
        location_list.append(location.text.strip() if location else None)
        price_list.append(re.search(r'Sale: ฿ ([\d,]+)', price.text).group(1) if price else None)
        beds_list.append(beds.text if beds else None)
        baths_list.append(baths.text if baths else None)
        usable_area_list.append(float(re.search(r'\d+(\.\d+)?', usable_area.text).group()) if usable_area else None)
        land_area_list.append(float(re.search(r'\d+(\.\d+)?', land_area.text).group()) if land_area else None)
        floors_list.append(floors.text if floors else None)
        facilities_list.append(facilities)
## /Users/j.nrup/Library/Python/3.11/lib/python/site-packages/soupsieve/css_parser.py:872: FutureWarning: The pseudo class ':contains' is deprecated, ':-soup-contains' should be used moving forward.
##   warnings.warn(

Next step : Create Dataframe of House Price using pandas package.

  1. Import library Pandas to create Dataframe of House Price that we extract data from web scraping process.
import pandas as pd
  1. Create JSON called “house_detail” to use create a Dataframe.
house_detail = {
    'title': title_list,
    'location': location_list,
    'beds': beds_list,
    'baths': baths_list,
    'usable_area': usable_area_list,
    'land_area': land_area_list,
    'floors': floors_list,
    'facilities': facilities_list,
    'price': price_list,
}
  1. Create Dataframe called “housePrice_df” and convert facilities data to binary options. (0: There is not facility, 1: There is facility)
housePrice_df = pd.DataFrame(house_detail, dtype=object)
facilities_df = pd.get_dummies(housePrice_df['facilities'].explode()).groupby(level=0).sum()
housePrice_df = pd.concat([housePrice_df, facilities_df], axis=1)
housePrice_df = housePrice_df.drop('facilities', axis=1)
housePrice_df
##                                                  title  ... Water
## 0    5 Bedroom House for sale in Perfect Masterpiec...  ...     0
## 1    Single House for Sale Narisa Thonburirom Villa...  ...     1
## 2        Brand new Luxury house for sale at Srinakarin  ...     1
## 3              3 bed House  Suanluang District H015937  ...     0
## 4    ขายบ้านเดี่ยว 2 ชั้น หมู่บ้านปริญญดา เกษตรนวมิ...  ...     0
## ..                                                 ...  ...   ...
## 245  3 Bedroom House for sale at Perfect Place Sukh...  ...     0
## 246  Single House -2-Storeys  - 5 bedrooms - with S...  ...     1
## 247  5 Bedroom House for sale at Chaiyaphruek Ram I...  ...     0
## 248        3 Bedroom House for sale at Panthip Village  ...     0
## 249  บ้านเดี่ยว ลลิล เดอะ ยัง เอ็กซ์เซ็กคลูทีฟ อ่อน...  ...     0
## 
## [250 rows x 44 columns]
  1. Export the Dataframe housePrice_df to CSV file.
housePrice_df.to_csv("housePriceInBangkok.csv")

Next step : Create database with PostgreSQL and create table to input house data in database.

  1. Create R Dataframe from housePriceInBangkok.csv.
  • Import library tidyverse to read csv file.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("housePriceInBangkok.csv")
## New names:
## Rows: 250 Columns: 45
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): title, location dbl (42): ...1, beds, baths, usable_area, land_area,
## floors, Access for the ... num (1): price
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
  1. Create Server PostgreSQL database as follows:
  • STEP_1 : Go to Elephant SQL website -> Log in -> Create New Instance
  • STEP_2 : Define Name -> Click select Region
  • STEP_3 : Select Region (Nearest target group) -> Click Review
  • STEP_4 : Confirm the information -> Click Create Instance
STEP_1
STEP_1
STEP_2
STEP_2
STEP_3
STEP_3
STEP_4
STEP_4
  1. Create connection to database and table to insert data.
  • Import library RPostgreSQL and DBI for PostgreSQL connection.
  • Write new table to database and query data that is required.
library(RPostgreSQL)
## Loading required package: DBI
library(DBI)

# [1] connect to database.
con <- dbConnect(PostgreSQL(),
                 host = "floppy.db.elephantsql.com", # Server
                 port = 5432, # Door to the room // default in postgreSQL: 5432
                 user = "xjyjkkgf",
                 password = "psJ6nkYIx0qJVO6IUjVpnjitAhgHYNnr",
                 dbname = "xjyjkkgf") # default database

# [2] Check table list.
cat("List table in database : ", dbListTables(con))
## List table in database :  houseprice
# [3] Remove table if it exist and write table to database.
# Check if the table exists.
if (dbExistsTable(con, "housePrice")) {
  # If it exists, remove the table.
  dbRemoveTable(con, "houseprice")
  print("houseprice removed successfully.")
} 
# write new table.
dbWriteTable(con, "houseprice", df)
## Warning in postgresqlWriteTable(conn, name, value, ...): table houseprice
## exists in database: aborting assignTable
## [1] FALSE
print("houseprice write completely.")
## [1] "houseprice write completely."
# [4] Get data
dbGetQuery(con, "SELECT title, location, beds, baths, usable_area, land_area, floors
FROM houseprice
LIMIT 10;")
##                                                                                                              title
## 1                                        5 Bedroom House for sale in Perfect Masterpiece Rama9 – Krungthep Kreetha
## 2                     Single House for Sale Narisa Thonburirom Village   area of 210 sq.w. Price only 10.9 million
## 3                                                                    Brand new Luxury house for sale at Srinakarin
## 4                                       ขายบ้านเดี่ยว 2 ชั้น หมู่บ้านปริญญดา เกษตรนวมินทร์ ซอยแจ่มจันทร์ ใกล้เลียบด่วนเอกมัยรามอินทรา
## 5  Sell House 55 Sathorn Sathorn Soi 15 Architect's House Unique Design  One and only one, price 115 million baht.
## 6                                                    2 Bedroom House for sale in Baan Somjai Phuttha Monthon Sai 2
## 7                                                                    Brand new Luxury house for sale at Srinakarin
## 8                                                 Pattanakarn Srinakarin | Elegant Family Home 3+1 Bed walk to MRT
## 9                                                                                         4 Bedroom House for sale
## 10                                                      For Sale Corner Unit Garden View Narasiri Krungthepkreetha
##                           location beds baths usable_area land_area floors
## 1          Bangkok, Prawet, Prawet    5     6       470.0     570.4      2
## 2  Bangkok, Thung Khru, Thung Khru    3     3       177.0     440.0      2
## 3        Bangkok, Prawet, Nong Bon    5     7       710.0     272.0      4
## 4      Bangkok, Bueng Kum, Nawamin    4     4       305.0     370.0      2
## 5  Bangkok, Sathon, Thung Maha Mek    3     4       460.0     160.0      5
## 6    Bangkok, Bang Khae, Bang Phai    2     2       114.0      53.0      2
## 7        Bangkok, Prawet, Nong Bon    4     6       601.0     224.0      4
## 8  Bangkok, Suan Luang, Suan Luang    3     4       520.0     504.0      2
## 9    Bangkok, Lat Phrao, Lat Phrao    4     5       766.5      98.0      3
## 10     Bangkok, Bang Kapi, Hua Mak    5     6       687.0     812.0     NA
cat("List table in database : ", dbListTables(con))
## List table in database :  houseprice
# [5] Close connection
dbDisconnect(con)
## [1] TRUE

Final step : Do simple data visualization using ggplot2 in R programming.

  1. Import library ggplot2 to create visualization.
library(ggplot2)
  1. See the dataset.
housePrice_df <- tibble(df)
head(housePrice_df)
## # A tibble: 6 × 45
##    ...1 title           location  beds baths usable_area land_area floors  price
##   <dbl> <chr>           <chr>    <dbl> <dbl>       <dbl>     <dbl>  <dbl>  <dbl>
## 1     0 5 Bedroom Hous… Bangkok…     5     6         470      570.      2 4   e7
## 2     1 Single House f… Bangkok…     3     3         177      440       2 1.09e7
## 3     2 Brand new Luxu… Bangkok…     5     7         710      272       4 6.39e7
## 4     3 3 bed House  S… Bangkok…    NA    NA         250       30       3 9.5 e6
## 5     4 ขายบ้านเดี่ยว 2 ชั้… Bangkok…     4     4         305      370       2 9.8 e6
## 6     5 2 Bedroom Hous… Bangkok…     2     2         114       53       2 4.50e6
## # ℹ 36 more variables: `Access for the disabled` <dbl>,
## #   `Air conditioning` <dbl>, Alarm <dbl>, Balcony <dbl>,
## #   `Built-in kitchen` <dbl>, `Built-in wardrobe` <dbl>, `Car park` <dbl>,
## #   Cellar <dbl>, `Children's area` <dbl>, Cistern <dbl>, Concierge <dbl>,
## #   Electricity <dbl>, Elevator <dbl>, `Equipped kitchen` <dbl>,
## #   Fireplace <dbl>, Garden <dbl>, Grill <dbl>, Guardhouse <dbl>, Gym <dbl>,
## #   Heating <dbl>, `Hot Tub` <dbl>, Internet <dbl>, Library <dbl>, …
  1. Data cleansing
  • Drop null data.
  • Change Column name to snake case.
# Check null row.
mean(complete.cases(housePrice_df))
## [1] 0.404
# Delete null.
clean_df <- drop_na(housePrice_df)
mean(complete.cases(clean_df))
## [1] 1
# Change Column name to snake case.
column_name_mappings <- c(
  "...1" = "no",
  "title" = "title",
  "location" = "location",
  "price" = "price",
  "beds" = "beds",
  "baths" = "baths",
  "usable_area" = "usable_area",
  "land_area" = "land_area",
  "floors" = "floors",
  "Access for the disabled" = "access_for_the_disabled",
  "Air conditioning" = "air_conditioning",
  "Alarm" = "alarm",
  "Balcony" = "balcony",
  "Built-in kitchen" = "built_in_kitchen",
  "Built-in wardrobe" = "built_in_wardrobe",
  "Car park" = "car_park",
  "Cellar" = "cellar",
  "Children's area" = "children_area",
  "Cistern" = "cistern",
  "Concierge" = "concierge",
  "Electricity" = "electricity",
  "Elevator" = "elevator",
  "Equipped kitchen" = "equipped_kitchen",
  "Fireplace" = "fireplace",
  "Garden" = "garden",
  "Grill" = "grill",
  "Guardhouse" = "guardhouse",
  "Gym" = "gym",
  "Hot Tub" = "hot_tub",
  "Internet" = "internet",
  "Library" = "library",
  "Natural gas" = "natural_gas",
  "Office" = "office",
  "Panoramic view" = "panoramic_view",
  "Patio" = "patio",
  "Roof garden" = "roof_garden",
  "Sauna" = "sauna",
  "Security" = "security",
  "Swimming pool" = "swimming_pool",
  "Tennis court" = "tennis_court",
  "Terrace" = "terrace",
  "Utility room" = "utility_room",
  "Video cable" = "video_cable",
  "Water" = "water"
)

# Rename columns using the mappings
colnames(clean_df) <- sapply(colnames(clean_df), function(col) column_name_mappings[col])
clean_df
## # A tibble: 101 × 45
##       no title          location  beds baths usable_area land_area floors  price
##    <dbl> <chr>          <chr>    <dbl> <dbl>       <dbl>     <dbl>  <dbl>  <dbl>
##  1     0 5 Bedroom Hou… Bangkok…     5     6        470       570.      2 4   e7
##  2     1 Single House … Bangkok…     3     3        177       440       2 1.09e7
##  3     2 Brand new Lux… Bangkok…     5     7        710       272       4 6.39e7
##  4     4 ขายบ้านเดี่ยว 2 … Bangkok…     4     4        305       370       2 9.8 e6
##  5     5 2 Bedroom Hou… Bangkok…     2     2        114        53       2 4.50e6
##  6     6 Sell House 55… Bangkok…     3     4        460       160       5 1.15e8
##  7     7 Brand new Lux… Bangkok…     4     6        601       224       4 5.49e7
##  8     8 Pattanakarn S… Bangkok…     3     4        520       504       2 4.50e7
##  9     9 4 Bedroom Hou… Bangkok…     4     5        766.       98       3 3.90e7
## 10    11 Unique Detach… Bangkok…     5     6        850         3       2 8.5 e7
## # ℹ 91 more rows
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## #   alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## #   built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## #   cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## #   equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## #   guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>, …
  1. Create histogram of house price to see the distribution.
ggplot(data = clean_df, mapping = aes(x = price)) + 
  geom_histogram(bins=30, fill = "#F5AD9E") + 
  labs(title = "Distribution of House Price") + 
  theme_minimal()

  1. Create box plot of House Price to find the outlier.
ggplot(data = clean_df, mapping = aes(x = price)) + 
  geom_boxplot() + 
  labs(title = "Outlier detection of House Price") + 
  theme_minimal()

The box plot chart shows that the house price data contains outliers. Therefore,before we proceed to the next step, we will clean the data by removing the outliers. We will create a new dataframe named “clean_dfNo_out” to store the data without outliers.

  • Extract the values of the potential outliers based on the IQR criterion.
out <- boxplot.stats(clean_df$price)$out
out
## [1] 2.3e+08 6.2e+08

As we can see, there are actually 2 points considered as potential outliers as follows : 2.30e+08, and 6.20e+08.

  • Extract the row number corresponding to these outliers.
out_ind <- which(clean_df$price %in% c(out))
out_ind
## [1] 48 84
  • Print all variables for these outliers.
clean_df[out_ind, ]
## # A tibble: 2 × 45
##      no title           location  beds baths usable_area land_area floors  price
##   <dbl> <chr>           <chr>    <dbl> <dbl>       <dbl>     <dbl>  <dbl>  <dbl>
## 1   107 Super-Luxury H… Bangkok…     8     8           2         1      3 2.3 e8
## 2   187 Luxury Buildin… Bangkok…    72    72           5         1      8 6.20e8
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## #   alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## #   built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## #   cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## #   equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## #   guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>,
## #   library <dbl>, natural_gas <dbl>, office <dbl>, panoramic_view <dbl>, …
  • Declare a dataframe named “clean_dfNo_out” to store the data which remove outlier completely.
clean_dfNo_out <- clean_df[-out_ind,]
clean_dfNo_out
## # A tibble: 99 × 45
##       no title          location  beds baths usable_area land_area floors  price
##    <dbl> <chr>          <chr>    <dbl> <dbl>       <dbl>     <dbl>  <dbl>  <dbl>
##  1     0 5 Bedroom Hou… Bangkok…     5     6        470       570.      2 4   e7
##  2     1 Single House … Bangkok…     3     3        177       440       2 1.09e7
##  3     2 Brand new Lux… Bangkok…     5     7        710       272       4 6.39e7
##  4     4 ขายบ้านเดี่ยว 2 … Bangkok…     4     4        305       370       2 9.8 e6
##  5     5 2 Bedroom Hou… Bangkok…     2     2        114        53       2 4.50e6
##  6     6 Sell House 55… Bangkok…     3     4        460       160       5 1.15e8
##  7     7 Brand new Lux… Bangkok…     4     6        601       224       4 5.49e7
##  8     8 Pattanakarn S… Bangkok…     3     4        520       504       2 4.50e7
##  9     9 4 Bedroom Hou… Bangkok…     4     5        766.       98       3 3.90e7
## 10    11 Unique Detach… Bangkok…     5     6        850         3       2 8.5 e7
## # ℹ 89 more rows
## # ℹ 36 more variables: access_for_the_disabled <dbl>, air_conditioning <dbl>,
## #   alarm <dbl>, balcony <dbl>, built_in_kitchen <dbl>,
## #   built_in_wardrobe <dbl>, car_park <dbl>, cellar <dbl>, children_area <dbl>,
## #   cistern <dbl>, concierge <dbl>, electricity <dbl>, elevator <dbl>,
## #   equipped_kitchen <dbl>, fireplace <dbl>, garden <dbl>, grill <dbl>,
## #   guardhouse <dbl>, gym <dbl>, `` <dbl>, hot_tub <dbl>, internet <dbl>, …
  1. Create Scatter Plot to see relation between price and usable area.
ggplot(data = clean_dfNo_out, mapping = aes(x=usable_area, y=price, col=price)) + 
  geom_point() + 
  labs(title = "usable_area vs price") +
  scale_color_gradient(low="gold",high = "blue")

  1. Create multiple chart using package patchwork.
library(patchwork)
c1 <- ggplot(data = clean_df, mapping = aes(x = price)) + 
  geom_histogram(bins=10, fill = "#F5AD9E") + 
  labs(title = "Distribution of House Price") + 
  theme_minimal()

c2 <- ggplot(data = clean_df, mapping = aes(x = price)) + 
  geom_boxplot() + 
  labs(title = "Outlier Detection of House Price") + 
  theme_minimal()

c3 <- ggplot(data = clean_dfNo_out, mapping = aes(x=usable_area, y=price, col=price)) + 
  geom_point() + 
  labs(title = "usable_area vs price") +
  scale_color_gradient(low="gold",high = "blue")

(c1 + c2)/c3

Remark : These are simple examples to create visualization using ggplot in R programming.

Extra : Do descriptive analysis by correlation measures.

Correlation measures is a measure of the strength and direction of the relationship between two variables.

  • Strength
    • High number -> High relation
    • Low number -> Low relation
  • Direction
    • Positive Value (+) -> The relationship is in the same direction.
    • Zero Value (0) -> There is not relationship.
    • Negative Value (-) -> The relationship is in the opposite direction.

First, we compute correlations for several pairs of variables and display them in the matrix below.

# Select specific variables for the correlation matrix
selected_vars <- c("price", "beds", "baths", "usable_area", "land_area", "floors")

# Create a correlation matrix for the selected variables
cor_matrix_selected <- round(
  cor(clean_dfNo_out[, selected_vars]),
  digits = 2
)

# Print the correlation matrix for selected variables
print(cor_matrix_selected)
##             price beds baths usable_area land_area floors
## price        1.00 0.31  0.59        0.52      0.20   0.43
## beds         0.31 1.00  0.67        0.24      0.28   0.16
## baths        0.59 0.67  1.00        0.54      0.36   0.35
## usable_area  0.52 0.24  0.54        1.00      0.21   0.34
## land_area    0.20 0.28  0.36        0.21      1.00  -0.21
## floors       0.43 0.16  0.35        0.34     -0.21   1.00

Finally, we will modify the correlation matrix format to make it more readable by creating a visualization using the package called “corrplot”.

# improved correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix_selected,
  method = "number",
  type = "upper" # show only upper side
)

Summary

In summary, this project combines knowledge in the areas of data analysis, computer programming, data visualization, and statistics. This can be a valuable skill set for a variety of careers.

Thank you for your interest. I hope this project will be beneficial to those who are interested. If there are any errors, I apologize in advance. - Narupong Jarasbunpaisarn