This data set provides information on property listings in Kuala Lumpur, Malaysia. As an economic, cultural, and political hub of Southeast Asia, the city and its real estate trends have global significance. Insights from this data set are sure to reveal consumer preferences including but not limited to: location, number of bedrooms, and number of bathrooms. Along with much of Southeast Asia, Kuala Lumpur suffers from extreme floods during annual monsoon seasons, especially where both the elevation is low and the slope gradient is shallow. An Informa PLC study conducted by researchers Samy Ismail Elmajdy and Mohammed Mohamed Mostafa concluded that regions in Kuala Lumpur of elevation lower than 41 meters and of slope lower than 7 degrees are particularly susceptible to flooding, as they represent regions where water ultimately flows to and collects during periods of heavy rainfall.
The organization of this data set is simple. It was uploaded to Kaggle.com as a CSV file with 53,884 rows representing each unique listing and 8 columns representing the specific property features recorded. Of the eight columns, I used four: Location, which describes the neighborhood within Kuala Lumpur in which the property is situated; Price, which lists the sales price in Malaysian Ringgit (RM); Rooms, which describes the number of full bedrooms with an optional “+ x” to identify additional rooms not officially recorded as bedrooms; and Bathrooms, which displays the numbers of bathrooms.
Very few details are offered in terms of data collection and sampling other than that the data was scraped from a Kuala Lumpur property listing website. Only the overview page of each listing was scraped, so specific parcel details and photos are absent. Which property listing portal the data originates from is unknown publicly. Furthermore, there is zero indication that this data set encompasses all current property listings in Kuala Lumpur, as such a feat is nearly impossible. Thus, this data set represents a sample drawn most likely out of convenience sampling in which homeowners voluntarily list their properties. In this sense, the sample is atypical as there is no complete or accessible population to draw from.
| Location | Price | Rooms | Bathrooms | Car.Parks | Property.Type | Size | Furnishing |
|---|---|---|---|---|---|---|---|
| KLCC, Kuala Lumpur | RM 1,250,000 | 2+1 | 3 | 2 | Serviced Residence | Built-up : 1,335 sq. ft. | Fully Furnished |
| Damansara Heights, Kuala Lumpur | RM 6,800,000 | 6 | 7 | NA | Bungalow | Land area : 6900 sq. ft. | Partly Furnished |
| Dutamas, Kuala Lumpur | RM 1,030,000 | 3 | 4 | 2 | Condominium (Corner) | Built-up : 1,875 sq. ft. | Partly Furnished |
| Cheras, Kuala Lumpur | NA | NA | |||||
| Bukit Jalil, Kuala Lumpur | RM 900,000 | 4+1 | 3 | 2 | Condominium (Corner) | Built-up : 1,513 sq. ft. | Partly Furnished |
| Taman Tun Dr Ismail, Kuala Lumpur | RM 5,350,000 | 4+2 | 5 | 4 | Bungalow | Land area : 7200 sq. ft. | Partly Furnished |
To produce clean and tidy data, I created new columns containing only the relevant and appropriately formatted data from the original file. First, each row is assigned an index value corresponding to its row number. The “Location.Refined” column omits the “, Kuala Lumpur” following each listing’s neighborhood attribute. The “Price.Refined” column drops the “RM” and commas of the original “Price” column, and converts the string values to numeric values. The “Rooms.Refined” rounds down any listings with “+1” or “+2”, indicating their potential for an additional one or two bedrooms, as this metric is non-standard and vague. Finally, the “Bathrooms” column copies the original bathroom column.
clean <- data.frame(matrix(ncol=0,nrow=53883)) # create new dataframe with appropriate number of rows
clean$Index <- c(1:53883) # number each row
clean$Location.Refined <- gsub(pattern = ', Kuala Lumpur', replace = '' , x = as.vector(raw$Location)) # convert from factor to character vector and delete ", Kuala Lumpur"
clean$Price.Refined <- gsub(pattern = 'RM ' , replace = '' , x = as.vector(raw$Price)) # convert from factor to character vector and delete "RM"
clean$Price.Refined <- as.numeric(gsub(pattern = ',', replace = '' , x = clean$Price.Refined)) # delete "," and convert to numeric vector
chop.room <- function(x){ # function to return the number before the "+"
final <- x
if(grepl("+", x, fixed=T) == T)
final <- substr(x, 1, strfind(x, '+')-1)
return(final)
}
clean$Rooms.Refined <- as.numeric(lapply(as.vector(raw$Rooms), chop.room)) # convert from factor to character vector, apply chop.room function, and convert to numeric vector
clean$Bathrooms <- as.numeric(raw$Bathrooms) # convert from factor to numeric vector
A sample view of the cleaned data:
| Index | Location.Refined | Price.Refined | Rooms.Refined | Bathrooms |
|---|---|---|---|---|
| 1 | KLCC | 1250000 | 2 | 3 |
| 2 | Damansara Heights | 6800000 | 6 | 7 |
| 3 | Dutamas | 1030000 | 3 | 4 |
| 4 | Cheras | NA | NA | NA |
| 5 | Bukit Jalil | 900000 | 4 | 3 |
| 6 | Taman Tun Dr Ismail | 5350000 | 4 | 5 |
To incorporate the qualitative neighborhood data into meaningful analysis, I first plotted the neighborhoods as points on Google Earth. Disclaimer: only 107 of the 112 unique neighborhoods reported in the data set yield a valid result on Google Earth, and the spatial accuracy of these 107 neighborhoods cannot be guaranteed. I then imported the points onto ArcGIS Pro to visually represent them on a basemap of Kuala Lumpur. Next, I uploaded a digital elevation model taken from The National Map, from which I calculated a slope raster. Used jointly with the elevation model, I could determine the area within Kuala Lumpur that meets both criteria of low elevation and shallow slope as specified by Elmajdy and Mostafa. By intersecting the points layer with this flood-prone zone, I identified 12 of the 112 total neighborhoods in the data set that can be classified as flood-prone.
A Leaflet rendering of the map created with ArcGIS Pro:
Location is universally deemed a primary determinant of real estate prices, reflective of a host of underlying geographic characteristics such as school district, commute time, and ammenities, to name a few. Are properties prone to flooding significantly cheaper than properties not prone to flooding? If the results of this analysis verify that flood-risk properties are indeed significantly cheaper than flood-safe properties, then the same insight can likely be generalized to all of Kuala Lumpur and surrounding Malaysia, as customers will have proven an ability to incorporate this risk into property valuation.
clean$Flood.Risk <- rep(NA, 53883) # create null column for flood risk
flood.risk.neighborhoods <- c('cyberjaya', 'Puchong', 'Jalan Klang Lama (Old Klang Road)', 'Salak Selatan', 'Seputeh', 'KL Eco City', 'Brickfields', 'Kota Damansara', 'Jalan Ipoh', 'Segambut', 'Rawang') # create list of neighborhoods designated as at-risk of flooding
check.risk <- function(row){ # function to fill Flood.Risk column with 'Y' or 'N', depending if existing in flood.risk.neighborhoods
result = 'N'
if (is.element(clean[row,2],flood.risk.neighborhoods))
result = 'Y'
return(result)
}
clean$Flood.Risk <- as.factor(as.character(lapply(clean$Index, check.risk))) # apply check.risk function to data set
flood.prone <- as.numeric(clean[which(clean$Flood.Risk == 'Y'),3]) # list of prices corresponding to listings at-risk of flooding
flood.safe <- as.numeric(clean[which(clean$Flood.Risk == 'N'),3]) # list of prices correspond to listings not at-risk of flooding
Listings are given a Y/N flood-risk specification, depending on the neighborhood. At-risk neighborhoods are determined by GIS analysis described above. Prices of each designation are gathered into lists, named “flood.prone” and “flood.safe.”
A sample view of the price vectors:| flood.prone | flood.safe |
|---|---|
| NA | 1250000 |
| NA | 6800000 |
| 1438000 | 1030000 |
| 509000 | NA |
| 635000 | 900000 |
| 655000 | 5350000 |
On average, the price of a home in a flood-prone area is 675000 RM and the price of a home not in a flood-prone area is 1080000 RM. Is this difference statistically significant?
Note: For visualization purposes, 208 extreme outliers (0.386% of the data) have been removed from this plot.
The “flood.prone” vector, of length 5565 and the “flood.safe” vector, of length 48318, are sufficiently large such that the Central Limit Theorem holds. Since both samples are right-skewed, the appropriate test to conduct is the two-sample Wilcoxon rank-sum test for median. Because I’m testing to see whether at-risk properties are cheaper, I will use a one-sided test, however since by default I am unsure of homeowners’ abilities to gauge and incorporate flood risk, I will use a null hypothesis of zero.
results1 <- wilcox.test(flood.prone, flood.safe, mu=0, alternative = "less") # conduct the test
With a p-value of 1.262647910^{-300}, we reject the null hypothesis that the median price is equal across properties regardless of flood-risk. In conclusion, the median price of flood-risk properties is significantly cheaper than the median price of flood-safe properties, which we can generally extend across all of Kuala Lumpur and neighboring Malaysia where customers likely apply the same insight.
Bedrooms and bathrooms are among the most critical features of a given property, and are prioritized differently according to family type, building use, (rental property, permanent abode, etc.) and other factors. Does an additional bedroom or additional bathroom drive a higher price increase? The results of this analysis will shed light on consumer preferences in Kuala Lumpur, allowing a better understanding of how building features are valued. The results could perhaps be generalized across other real estate markets as well.
Note: For visualization purposes, 118 extreme price outliers (0.219% of the data) have been removed from this plot.
format.slope <- function(room.type){ # function to calculate slopes for a given room type and format into a list
room.type.samples <- replicate(1000, sample(clean$Index, size=300, replace=T)) # create 1000 samples of size 300 of index values from clean
room.type.index <- c(1:1000) # create index values (0 to 1000) for each sample
find.slope <- function(x){ # function to calculate slope for given sample
price <- clean["Price.Refined"][room.type.samples[,x],] # create list of prices from given sample
num.room.type <- clean[room.type][room.type.samples[,x],] # create list of # rooms from given sample
result <- lm(price~num.room.type) # perform linear regression on price vs. # rooms of given sample
return(summary(result)$coefficient["num.room.type", "Estimate"]) # return and extract slope of linear regression calculation
}
return(sapply(room.type.index, find.slope)) # use sample index values to apply find.slope function to each sample
}
bedroom.slopes <- format.slope("Rooms.Refined") # 1000 slope values from each sample of 300 listings of price vs. # bedrooms
bathroom.slopes <- format.slope("Bathrooms") # 1000 slope values from each sample of 300 listings of price vs. # bathrooms
A sample view of the slope vectors:
| bathroom.slopes | bedroom.slopes |
|---|---|
| 803953.8 | 733743.6 |
| 1093469.9 | 748617.4 |
| 826100.0 | 586708.1 |
| 938571.8 | 974162.5 |
| 939771.7 | 1141616.1 |
| 960121.7 | 657095.7 |
On average, the addition of one bathroom drives a 949413 RM price increase, while the addition of one bedroom drives a 889110 RM price increase. Is this difference statistically significant?
Repeated for bedrooms and bathrooms, I took 1000 samples of size 300 and regressed price against the number of rooms of each respective type. With 1000 approximately normally distributed slopes generated for both variables, the CLT holds and I can proceed to conduct a two-sample t-test for mean slope. Because I’m testing to see whether the mean slope of bathrooms is significantly greater than that of bedrooms, I will use a one-sided test, however since by default I am unsure which has a greater effect, I will use a null hypothesis of zero.
results2 <- t.test(bedroom.slopes, bathroom.slopes, mu=0, alternative="less") # conduct the test
With a p value of 3.16983610^{-5}, we reject the null hypothesis that the effect of bedrooms and the effect of bathrooms on price is equal. The effect of bedrooms on price is significantly less than the effect of bathrooms on price. Thus, we can generally say that people of Kuala Lumpur tend to value bathrooms more than bedrooms in properties.