Description of the Data Set

Background Information

This data set provides information on property listings in Kuala Lumpur, Malaysia. As an economic, cultural, and political hub of Southeast Asia, the city and its real estate trends have global significance. Insights from this data set are sure to reveal consumer preferences including but not limited to: location, number of bedrooms, and number of bathrooms. Along with much of Southeast Asia, Kuala Lumpur suffers from extreme floods during annual monsoon seasons, especially where both the elevation is low and the slope gradient is shallow. An Informa PLC study conducted by researchers Samy Ismail Elmajdy and Mohammed Mohamed Mostafa concluded that regions in Kuala Lumpur of elevation lower than 41 meters and of slope lower than 7 degrees are particularly susceptible to flooding, as they represent regions where water ultimately flows to and collects during periods of heavy rainfall.

Data Set Explanation

The organization of this data set is simple. It was uploaded to Kaggle.com as a CSV file with 53,884 rows representing each unique listing and 8 columns representing the specific property features recorded. Of the eight columns, I used four: Location, which describes the neighborhood within Kuala Lumpur in which the property is situated; Price, which lists the sales price in Malaysian Ringgit (RM); Rooms, which describes the number of full bedrooms with an optional “+ x” to identify additional rooms not officially recorded as bedrooms; and Bathrooms, which displays the numbers of bathrooms.

Data Collection

Very few details are offered in terms of data collection and sampling other than that the data was scraped from a Kuala Lumpur property listing website. Only the overview page of each listing was scraped, so specific parcel details and photos are absent. Which property listing portal the data originates from is unknown publicly. Furthermore, there is zero indication that this data set encompasses all current property listings in Kuala Lumpur, as such a feat is nearly impossible. Thus, this data set represents a sample drawn most likely out of convenience sampling in which homeowners voluntarily list their properties. In this sense, the sample is atypical as there is no complete or accessible population to draw from.

A sample view of the data set:
Location Price Rooms Bathrooms Car.Parks Property.Type Size Furnishing
KLCC, Kuala Lumpur RM 1,250,000 2+1 3 2 Serviced Residence Built-up : 1,335 sq. ft. Fully Furnished
Damansara Heights, Kuala Lumpur RM 6,800,000 6 7 NA Bungalow Land area : 6900 sq. ft. Partly Furnished
Dutamas, Kuala Lumpur RM 1,030,000 3 4 2 Condominium (Corner) Built-up : 1,875 sq. ft. Partly Furnished
Cheras, Kuala Lumpur NA NA
Bukit Jalil, Kuala Lumpur RM 900,000 4+1 3 2 Condominium (Corner) Built-up : 1,513 sq. ft. Partly Furnished
Taman Tun Dr Ismail, Kuala Lumpur RM 5,350,000 4+2 5 4 Bungalow Land area : 7200 sq. ft. Partly Furnished

Cleaning Original Data

To produce clean and tidy data, I created new columns containing only the relevant and appropriately formatted data from the original file. First, each row is assigned an index value corresponding to its row number. The “Location.Refined” column omits the “, Kuala Lumpur” following each listing’s neighborhood attribute. The “Price.Refined” column drops the “RM” and commas of the original “Price” column, and converts the string values to numeric values. The “Rooms.Refined” rounds down any listings with “+1” or “+2”, indicating their potential for an additional one or two bedrooms, as this metric is non-standard and vague. Finally, the “Bathrooms” column copies the original bathroom column.

clean <- data.frame(matrix(ncol=0,nrow=53883)) # create new dataframe with appropriate number of rows

clean$Index <- c(1:53883) # number each row

clean$Location.Refined <- gsub(pattern = ', Kuala Lumpur', replace = '' , x = as.vector(raw$Location)) # convert from factor to character vector and delete ", Kuala Lumpur"

clean$Price.Refined <- gsub(pattern = 'RM ' , replace = '' , x = as.vector(raw$Price)) # convert from factor to character vector and delete "RM"
clean$Price.Refined <- as.numeric(gsub(pattern = ',', replace = '' , x = clean$Price.Refined)) # delete "," and convert to numeric vector

chop.room <- function(x){ # function to return the number before the "+" 
  final <- x
  if(grepl("+", x, fixed=T) == T)
    final <- substr(x, 1, strfind(x, '+')-1)
  return(final)
  }
clean$Rooms.Refined <- as.numeric(lapply(as.vector(raw$Rooms), chop.room)) # convert from factor to character vector, apply chop.room function, and convert to numeric vector

clean$Bathrooms <- as.numeric(raw$Bathrooms) # convert from factor to numeric vector
A sample view of the cleaned data:
Index Location.Refined Price.Refined Rooms.Refined Bathrooms
1 KLCC 1250000 2 3
2 Damansara Heights 6800000 6 7
3 Dutamas 1030000 3 4
4 Cheras NA NA NA
5 Bukit Jalil 900000 4 3
6 Taman Tun Dr Ismail 5350000 4 5

GIS Analysis

To incorporate the qualitative neighborhood data into meaningful analysis, I first plotted the neighborhoods as points on Google Earth. Disclaimer: only 107 of the 112 unique neighborhoods reported in the data set yield a valid result on Google Earth, and the spatial accuracy of these 107 neighborhoods cannot be guaranteed. I then imported the points onto ArcGIS Pro to visually represent them on a basemap of Kuala Lumpur. Next, I uploaded a digital elevation model taken from The National Map, from which I calculated a slope raster. Used jointly with the elevation model, I could determine the area within Kuala Lumpur that meets both criteria of low elevation and shallow slope as specified by Elmajdy and Mostafa. By intersecting the points layer with this flood-prone zone, I identified 12 of the 112 total neighborhoods in the data set that can be classified as flood-prone.

A Leaflet rendering of the map created with ArcGIS Pro:

Research Questions

  1. Are properties prone to flooding significantly cheaper than properties not prone to flooding?
  2. Does an additional bedroom or additional bathroom drive a higher price increase?

Explore Question #1

Location is universally deemed a primary determinant of real estate prices, reflective of a host of underlying geographic characteristics such as school district, commute time, and ammenities, to name a few. Are properties prone to flooding significantly cheaper than properties not prone to flooding? If the results of this analysis verify that flood-risk properties are indeed significantly cheaper than flood-safe properties, then the same insight can likely be generalized to all of Kuala Lumpur and surrounding Malaysia, as customers will have proven an ability to incorporate this risk into property valuation.

Find Prices of Properties Prone to Flooding and of Properties not Prone to Flooding

clean$Flood.Risk <- rep(NA, 53883) # create null column for flood risk
flood.risk.neighborhoods <- c('cyberjaya', 'Puchong', 'Jalan Klang Lama (Old Klang Road)', 'Salak Selatan', 'Seputeh', 'KL Eco City', 'Brickfields', 'Kota Damansara', 'Jalan Ipoh', 'Segambut', 'Rawang') # create list of neighborhoods designated as at-risk of flooding

check.risk <- function(row){ # function to fill Flood.Risk column with 'Y' or 'N', depending if existing in flood.risk.neighborhoods
  result = 'N'
  if (is.element(clean[row,2],flood.risk.neighborhoods))
    result = 'Y'
  return(result)
  }
clean$Flood.Risk <- as.factor(as.character(lapply(clean$Index, check.risk))) # apply check.risk function to data set

flood.prone <- as.numeric(clean[which(clean$Flood.Risk == 'Y'),3]) # list of prices corresponding to listings at-risk of flooding
flood.safe <- as.numeric(clean[which(clean$Flood.Risk == 'N'),3]) # list of prices correspond to listings not at-risk of flooding

Listings are given a Y/N flood-risk specification, depending on the neighborhood. At-risk neighborhoods are determined by GIS analysis described above. Prices of each designation are gathered into lists, named “flood.prone” and “flood.safe.”

A sample view of the price vectors:
flood.prone flood.safe
NA 1250000
NA 6800000
1438000 1030000
509000 NA
635000 900000
655000 5350000

On average, the price of a home in a flood-prone area is 675000 RM and the price of a home not in a flood-prone area is 1080000 RM. Is this difference statistically significant?

Note: For visualization purposes, 208 extreme outliers (0.386% of the data) have been removed from this plot.

The “flood.prone” vector, of length 5565 and the “flood.safe” vector, of length 48318, are sufficiently large such that the Central Limit Theorem holds. Since both samples are right-skewed, the appropriate test to conduct is the two-sample Wilcoxon rank-sum test for median. Because I’m testing to see whether at-risk properties are cheaper, I will use a one-sided test, however since by default I am unsure of homeowners’ abilities to gauge and incorporate flood risk, I will use a null hypothesis of zero.

Perform the Wilcoxon rank-sum test

results1 <- wilcox.test(flood.prone, flood.safe, mu=0, alternative = "less") # conduct the test

With a p-value of 1.262647910^{-300}, we reject the null hypothesis that the median price is equal across properties regardless of flood-risk. In conclusion, the median price of flood-risk properties is significantly cheaper than the median price of flood-safe properties, which we can generally extend across all of Kuala Lumpur and neighboring Malaysia where customers likely apply the same insight.

Potential Explanations

  • Flood destruction lowers property value in flood-prone areas
  • Flood risk is reported and affects consumer preferences

Explore Question #2

Bedrooms and bathrooms are among the most critical features of a given property, and are prioritized differently according to family type, building use, (rental property, permanent abode, etc.) and other factors. Does an additional bedroom or additional bathroom drive a higher price increase? The results of this analysis will shed light on consumer preferences in Kuala Lumpur, allowing a better understanding of how building features are valued. The results could perhaps be generalized across other real estate markets as well.

Note: For visualization purposes, 118 extreme price outliers (0.219% of the data) have been removed from this plot.

For Samples Drawn, Find Slopes of Bedrooms and Bathrooms Against Price

format.slope <- function(room.type){ # function to calculate slopes for a given room type and format into a list
  room.type.samples <- replicate(1000, sample(clean$Index, size=300, replace=T)) # create 1000 samples of size 300 of index values from clean
  room.type.index <- c(1:1000) # create index values (0 to 1000) for each sample
  find.slope <- function(x){ # function to calculate slope for given sample
   price <- clean["Price.Refined"][room.type.samples[,x],] # create list of prices from given sample
   num.room.type <- clean[room.type][room.type.samples[,x],] # create list of # rooms from given sample
   result <- lm(price~num.room.type) # perform linear regression on price vs. # rooms of given sample
   return(summary(result)$coefficient["num.room.type", "Estimate"]) # return and extract slope of linear regression calculation
  }
  return(sapply(room.type.index, find.slope)) # use sample index values to apply find.slope function to each sample
}

bedroom.slopes <- format.slope("Rooms.Refined") # 1000 slope values from each sample of 300 listings of price vs. # bedrooms 
bathroom.slopes <- format.slope("Bathrooms") # 1000 slope values from each sample of 300 listings of price vs. # bathrooms
A sample view of the slope vectors:
bathroom.slopes bedroom.slopes
803953.8 733743.6
1093469.9 748617.4
826100.0 586708.1
938571.8 974162.5
939771.7 1141616.1
960121.7 657095.7

On average, the addition of one bathroom drives a 949413 RM price increase, while the addition of one bedroom drives a 889110 RM price increase. Is this difference statistically significant?

Repeated for bedrooms and bathrooms, I took 1000 samples of size 300 and regressed price against the number of rooms of each respective type. With 1000 approximately normally distributed slopes generated for both variables, the CLT holds and I can proceed to conduct a two-sample t-test for mean slope. Because I’m testing to see whether the mean slope of bathrooms is significantly greater than that of bedrooms, I will use a one-sided test, however since by default I am unsure which has a greater effect, I will use a null hypothesis of zero.

Perform the two-sample t-test

results2 <- t.test(bedroom.slopes, bathroom.slopes, mu=0, alternative="less") # conduct the test

With a p value of 3.16983610^{-5}, we reject the null hypothesis that the effect of bedrooms and the effect of bathrooms on price is equal. The effect of bedrooms on price is significantly less than the effect of bathrooms on price. Thus, we can generally say that people of Kuala Lumpur tend to value bathrooms more than bedrooms in properties.

Potential Explanations

  • Sanitation is valued more than privacy
  • Individuals typically only require one bedroom, but might desire multiple bathrooms located across their home