Madrid is the capital and most populous city of Spain. The city has almost 3.4 million inhabitants and a metropolitan area population of approximately 6.7 million. It is the second-largest city in the European Union (EU), and its monocentric metropolitan area is the second-largest in the EU. The municipality covers 604.3 km2 geographical area.
Madrid is a center of attraction, the largest urban agglomeration on the Iberian Peninsula and a decisive engine of the Spanish economy with a lot of social dynamism. The city is not only the traditional political center of Spain, with parks and boulevards, but also a true metropolis in terms of culture, entertainment, sports, business and gastronomy. For example, it is home to two of the best soccer teams in the world. This makes Madrid attractive not only to institutional investors, but also to people looking for both the profitability of an investment and to enjoy the city’s impressive cultural, gastronomic and leisure offer.
In this LBB project of Programming for Data Science with R, we would like to help potential buyers and sellers of real estate in Madrid to objectively assess the current market situation and find the best recommendation for properties according to their needs. This analyses are based on data collected from kaggle.
Dataset used in this project is sourced from kaggle. Data was collected through crawling popular real estate portals in Madrid.
Originally, this dataset has 58 columns but we will perform data wrangling to clean and prepare our dataset, using only the relevant columns for further analysis.
| variable_names | description |
|---|---|
| id | Unique ID |
| subtitle | Neighborhood & City |
| sq_mt_built | Square meter built |
| n_rooms | Number of Room |
| n_bathrooms | Number of Bathroom |
| floor | House Level |
| buy_price | Buy Price (in Euro) |
| buy_price_by_area | Buy Price per sqm (in Euro) |
| house_type_id | Type of the house |
| built_year | Built year |
| energy_certificate | Energy certificate type |
| has_parking | Access to Parking |
| neighborhood_id | Information of Neighborhood and District ID (1 to 21) |
First we read the .csv file and use only the relevant columns, which
then saved in a new data frame called house1
#install package dplyr to manipulate character data
library(dplyr)#turn all missing values to NA
house1 <- house_1a %>% mutate_all(na_if,"")To perform string manipulation, we have to first install package
called stringr, which allow us to extract information from
column subtitle and create new columns called
neighborhood and city. We assign new
values for each House Type ID with shorter name (Type1, Type2, etc) and
classified Numbered Floor level (1,2,3,etc) to one new category called
‘Upperfloor’, for simplification purpose. Another thing we do is to
extract the district id information from neighborhood_id
using strsplit() which will separates series of strings
based on certain pattern/separator.
library(stringr)
#extract neighborhood and city information from subtitle column
sub <- house1$subtitle %>% str_split(", ", n = 2, simplify = TRUE)
house1$neighborhood <- sub[,1]
house1$city <- sub[,2]
house2a <- subset(house1, select = -c(subtitle))To understand the structure of our data frame, we use the
str() function which result in data type information for
each column, number of rows and columns.
## 'data.frame': 21742 obs. of 14 variables:
## $ id : int 21742 21741 21740 21739 21738 21737 21736 21735 21734 21733 ...
## $ sq_mt_built : num 64 70 94 64 108 126 120 125 84 85 ...
## $ n_rooms : int 2 3 2 2 2 4 5 3 3 2 ...
## $ n_bathrooms : num 1 1 2 1 2 2 2 2 2 1 ...
## $ floor : chr "Upperfloor" "Upperfloor" "Upperfloor" "Bajo" ...
## $ buy_price : int 85000 129900 144247 109900 260000 195000 190000 198500 212000 131400 ...
## $ buy_price_by_area : int 1328 1856 1535 1717 2407 1548 1583 1588 2524 1546 ...
## $ house_type_id : chr "Type1" "Type1" "Type1" "Type1" ...
## $ built_year : num 1960 NA NA 1955 2003 ...
## $ energy_certificate: chr "D" "en trámite" "no indicado" "en trámite" ...
## $ has_parking : chr "False" "False" "False" "False" ...
## $ neighborhood : chr "San Cristóbal" "Los Ángeles" "San Andrés" "San Andrés" ...
## $ city : chr "Madrid" "Madrid" "Madrid" "Madrid" ...
## $ district_id : chr "21: Villaverde" "21: Villaverde" "21: Villaverde" "21: Villaverde" ...
Now that we have the correct initial dataframe, we can proceed to the next step in preparing our data, which is missing values and duplicates treatment.
First, let’s find duplicate values (if any)!
Next, find missing value, which previously marked with NA, and decide which treatment to be employed for each variable.
#find which column(s) contain missing values
sum_na <- house2 %>% is.na() %>% colSums()Now we know that built_year column contains 11742
missing value out of total 21742 rows data in house2, or
equal to 54% data. While floor contain 12%
missing values. This may lead to the next question, whether or not we
should continue using built_year as our variable. As for
missing values in the other 3 variables,
sq_mt_built,n_bathrooms &
house_type_id, we can either totally delete the rows, or
assigning certain values to the missing values (i.e mean, mode, or zero
value/0).
In our case, we decide to not using variable built_year
in our analysis since it contains more than 50% of missing values, and
we can drop the rows with missing values in columns
sq_mt_built,n_bathrooms&
house_type_id since they are insignificant in number (less
than 5%) and might hinder further statistical analysis process. To
remove rows with NA values, we can apply drop.na() method
from tidyr package.
Column floor will be transformed into
factor datatype, hence we will keep the missing values
and assign new value called undefined.
#assign 'undefined' values to any missing values in`floor` & `house_type_id` columns
house3 <- house3 %>%
mutate(floor = coalesce(floor, 'undefined'))Continue by assigning correct data type to each column by using
converting function like as.character(),
as.factor(), as.integer(),
as.numeric(), etc. Conversion into correct data type
contribute to memory saving and enable data manipulation using specific
function designed for each datatype.
#re-assign correct data type to column, if needed
house_4a$id <- as.character(house_4a$id)
house_4a <- house_4a %>% mutate_at(c('sq_mt_built','n_bathrooms'), as.integer)
house_4a <- house_4a %>% mutate_at(c('floor','house_type_id','energy_certificate', 'has_parking','neighborhood', 'city', 'district_id'), as.factor)Check our data structure once again, make sure each column/variable is in the right format.
## 'data.frame': 21211 obs. of 13 variables:
## $ id : chr "21742" "21741" "21740" "21739" ...
## $ sq_mt_built : int 64 70 94 64 108 126 120 125 84 85 ...
## $ n_rooms : int 2 3 2 2 2 4 5 3 3 2 ...
## $ n_bathrooms : int 1 1 2 1 2 2 2 2 2 1 ...
## $ floor : Factor w/ 12 levels "Bajo","Entreplanta",..: 12 12 12 1 12 12 12 12 11 12 ...
## $ buy_price : int 85000 129900 144247 109900 260000 195000 190000 198500 212000 131400 ...
## $ buy_price_by_area : int 1328 1856 1535 1717 2407 1548 1583 1588 2524 1546 ...
## $ house_type_id : Factor w/ 4 levels "Type1","Type2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ energy_certificate: Factor w/ 10 levels "A","B","C","D",..: 4 6 10 6 6 6 7 10 7 6 ...
## $ has_parking : Factor w/ 2 levels "False","True": 1 1 1 1 2 2 2 1 2 1 ...
## $ neighborhood : Factor w/ 146 levels "12 de Octubre-Orcasur",..: 118 78 117 117 80 117 117 142 142 80 ...
## $ city : Factor w/ 1 level "Madrid": 1 1 1 1 1 1 1 1 1 1 ...
## $ district_id : Factor w/ 20 levels "1: Arganzuela",..: 13 13 13 13 13 13 13 13 13 13 ...
Lastly, we save our clean dataframe in an object called
house_cln.
You can find the complete explanation on Spanish terms used in our
house_cln dataframe as below :
| level | level_desc |
|---|---|
| Bajo | ground level |
| Entreplanta/Entreplanta exterior/Entreplanta interior | Mezzanine level |
| Sótano/Sótano exterior/Sótano interior | Basement level |
| Semi-sótano/Semi-sótano exterior/Semi-sótano interior | Semi-Basement level |
| Upperfloor | N-th floor (1, 2, 3, etc) |
| undefined | undefined |
While you would think an apartment bathed in natural sunlight would be relatively easy to come by in sunny Spain, it’s actually a lot harder than you’d think, at least in densely populated cities. Many traditional apartments in Spain are composed of labyrinthine corridors leading to many small windowless rooms covered wall to wall with dark wooden furniture. Heavy shutters block out any remaining sun that dares to infiltrate. In fact, the buildings seem purposely designed to shut out sunlight. This does keep out the overwhelming summer heat, but in return creates a dark and depressing living space.
So if natural light is your priority, pay attention to the following two words: Exterior and Interior. If you want to see the sunshine peak through into your apartment for at least some of the day, you need to look for a piso exterior. That means your flat will look onto the street, giving you natural light, and maybe even have a nice view. If you go for a piso interior, your apartment will face away from the street and probably onto the inside of the apartment block.
Another variable worth to mention is House Type. Of all sizes, with one or more floors, with or without a garden and in different environments. Below you can consult a list with the common types of house in our dataframe.
| house_type | house_type_desc |
|---|---|
| Type 1: Pisos | A flat is a single-storey dwelling in which all the necessary facilities are distributed. They can have several rooms, a kitchen, living room and one or more bathrooms, but with no patio or outdoor garden |
| Type 2: Casa o chalet | A chalet is an independent house with several floors, a garden and, in some cases, a swimming pool |
| Type 4: Duplex | This term refers to a house with two floors, connected to each other by a staircase. |
| Type 5: Aticos | The apartments that are located on the top floor of a building |
The last variable we need to discuss is the Energy Certificate, or the energy performance certificate, a report that describes how efficient a home is in terms of energy consumption. It assigns an energy rating to each home on a scale which ranges from “A” (the most efficient) to “G” (the least efficient). Like other European countries, since June 2013, anyone selling or renting a property in Spain, short or long term, needs an energy performance certificate. The introduction of this innovation is due to the requirements of the EU.
Apart from A to G energy rating, our energy_certificate
column might contains other values like en trámite (in
process), no indicado (not indicated) or inmueble
exento (exempt property).
Start our exploration using summary() to extract the
basic statistical information of each column in our
house_cln dataframe.
summary(house_cln)## id sq_mt_built n_rooms n_bathrooms
## Length:21211 Min. : 16.0 Min. : 0.000 Min. : 1.000
## Class :character 1st Qu.: 71.0 1st Qu.: 2.000 1st Qu.: 1.000
## Mode :character Median :101.0 Median : 3.000 Median : 2.000
## Mean :148.5 Mean : 3.035 Mean : 2.076
## 3rd Qu.:165.0 3rd Qu.: 4.000 3rd Qu.: 2.000
## Max. :999.0 Max. :24.000 Max. :14.000
##
## floor buy_price buy_price_by_area house_type_id
## Upperfloor :16364 Min. : 36000 Min. : 447 Type1:17694
## undefined : 2467 1st Qu.: 199000 1st Qu.: 2550 Type2: 1811
## Bajo : 2001 Median : 378000 Median : 3721 Type4: 675
## Entreplanta exterior: 230 Mean : 643793 Mean : 4022 Type5: 1031
## Semi-sótano exterior: 52 3rd Qu.: 761325 3rd Qu.: 5000
## Entreplanta interior: 32 Max. :8800000 Max. :18889
## (Other) : 65
## energy_certificate has_parking neighborhood city
## en trámite :10687 False:13694 Chamartín : 840 Madrid:21211
## no indicado: 3589 True : 7517 Moncloa : 693
## E : 2649 Chamberí : 577
## D : 1087 Centro : 542
## G : 870 Hortaleza : 482
## F : 652 Barrio de Salamanca: 457
## (Other) : 1677 (Other) :17620
## district_id
## 5: Chamartín : 1806
## 4: Centro : 1731
## 6: Chamberí : 1708
## 11: Moncloa : 1546
## 3: Carabanchel: 1486
## 9: Hortaleza : 1447
## (Other) :11487
Now we are confident that every column has the information needed for further analysis.
sq_mt_built column)As we can see, the distribution is right-skewed and heavily distributed in the range of 0-200 sqm. For this reason, it is better to use median (Q2) value as central measuremenmt instead of mean, for it’s robust characteristic against outliers.
n_rooms &
n_bathroomscolumn)Here, both variables have right-skewed distribution and heavily distributed in the low range region. For this reason, again, the central measurement use median (Q2) value instead of mean, for it’s robust characteristic against outliers.
Both variables, again, are right-skewed and heavily distributed in the low range region. For this reason, again, the central measurement use median (Q2) value instead of mean, for it’s robust characteristic against outliers. Compare to the previous plot for number of rooms and bathrooms, here we can see the number of outliers are more prevalent. But inside the boxplot for buying price per area, data distribution is nearly symmetrical.
1.Define Top 10 districts with highest number of listed properties?
2. Which are the top 10 districts with highest average (median) of buying price per area?
3.Correlation of other variables to buying price per area, which variable have strongest correlation with it?
Ans : as we knew the variable
buy_price_by_area is a numerical data while the other
variables are numerical and categorical. To find the correlation between
2 numerical values, we can use the cor() function, hence we
will separate the house_cln data only for the numerical one
and save it into new object calles house_cl_num then
continue by finding the correlation with our target variable
buy_price_by_area.
## sq_mt_built n_rooms n_bathrooms buy_price
## sq_mt_built 1.0000000 0.72440538 0.8518865 0.8314373
## n_rooms 0.7244054 1.00000000 0.7164749 0.5925880
## n_bathrooms 0.8518865 0.71647490 1.0000000 0.7578218
## buy_price 0.8314373 0.59258802 0.7578218 1.0000000
## buy_price_by_area 0.1810571 0.05204698 0.2762335 0.5664525
## buy_price_by_area
## sq_mt_built 0.18105709
## n_rooms 0.05204698
## n_bathrooms 0.27623346
## buy_price 0.56645254
## buy_price_by_area 1.00000000
A positive correlation between the number of rooms and bathrooms in a house with their price per area, but the correlation was too weak. It means that bigger house with more rooms is not necessarily always come with higher price.
Now we can move to our categorical variables, and check their
correlation with a numerical variable, buy_price_by_area,
using Box plot or Kruskal-Wallis chi-squared.
Boxplot Comparison
We find significant variation between median of each
floor category to buy price per sqm, a lot of overlap
between the boxes especially between numbered level 1 to 9 hence we
merged those levels into one called Upperfloor for
simplification purpose, but once you see the box plots of other floor
levels, you will find various median values with less overlapping boxes.
The floor level and buy price per sqm is definitely correlated but not
in a weak way since some of the groups show us significant statistical
differences.
Here, we can see slighty significant differences between each
category of house_type_id to buy price per sqm, if you
compare the median value of each category, may say that there are
correlation but insignificant one (low correlation).
We can see a weak significant differences between each of the energy group especially if we cluster these energy group into A to D, E & en tramite, F&inmueble exento&no indicato, and energy group G. Same as previous comparison, we see lot of overlap between boxes, hence it is correlated but low ones.
Alternatively, we can use Kruskal-Wallis test with
kruskal.test() function.
#compare neighborhood id with buying price per sqm
kruskal.test(buy_price_by_area ~ house_type_id,
data = house_cln)##
## Kruskal-Wallis rank sum test
##
## data: buy_price_by_area by house_type_id
## Kruskal-Wallis chi-squared = 333.21, df = 3, p-value < 2.2e-16
The null hypothesis (H0): The median is equal across all groups. The alternative hypothesis: (H1): The median is not equal across all groups.
The p-value is extremely low, far less than the significant level of 0.05 hence it falls in rejection region. It can be concluded then, that there are significant difference between the treatment group, or the median is not equal across all group, which means the different neighborhoods leads to statistically significant differences in buying price per sqm.
As for the dichotomous variable (iónly two values, TRUE/FALSE) like
the one in has_parking column, we can use the
point-biserial correlation test with ltm()
package.
## [1] -0.1100973
It results a low correlation number between 2 variables.
From the boxplot above, we can see significant difference between groups and not that much overlapping, hence these 2 variables are strongly correlated.
4. If a customer, with single/unmarried status & in their early employement year as an junior engineer at AECOM Madrid, ask for your recommendation as a real-estate consultant to buy his 1st property, what are your suggestions and why?
Ans : to give the best recommendation, first we have to understand the profile of our customers and their requirements.
Let’s consider the priority feature for a single/unmarried employee in their first year employment,
Then, we consider their financial situation. In average, a junior engineer in AECOM Madrid made 30K Euro annually, source : Glassdoor, which equal to Nett. amount of ~1900 Euro per month. Based on this information, we setup the upper limit for mortgage to 5 times his annual salary, 150K Euro, then apply filter based on the desired house type.
Now that we have the list of properties which satisfy 2 of our main constrains, we can then group those properties based on the nice to have features, first the location,
This customer works at AECOM, Madrid. The office is located in C. de Alfonso XII, 62, Madrid ( Lat 40.40911870940878, Long -3.689179361369044). Based on this information, we can then create an interactive map to see the nearby district and match with our list of Top 10 districts with the lowest median value of buying price per sqm.
From the map above, we can classify the 1st-ring district from AECOM office based on the distance that are : Salamanca, Retiro, and Arganzuela, The 2nd ring district will be : Usera, Puente de Vallecas, Moratalaz, Chamberi and Latina. These districts are the most accessible from his office, and from our list above, we can see that district Usera, Puente de Vallecas, Latina and Moratalaz are among the Top 10 district with lowest median buying price per sqm.
So the recommendation is to look at the listed properties in these 4 districts, with their reasonable price per sqm and distance to the office and city center. Puente de Vallecas has the biggest number of 648 available properties with the lowest median value of buying price per sqm (< 2000 Euro per sqm).
5. If a family of 4 looking for a house for them to settle down and stay for a long term, both parents work for quite some years already, what would be your recommendation as a real estate consultant?
Unlike the previous client who is single and in his early employment year, the family with 4 has certain characteristic and requirements for their long-term family home.
SImilarly to our previous client, first we apply filter to our dataset based on the desired budget (300K Euro), then filter based on the number of rooms available (>= 4), and eliminate house type 5 (Attico).
The following listings are our first recommendation which satisfy both their priority features and the nice-to-have ones which are parking lot and house type number 2 and 4.
The properties which satisfy their priority features + parking
availability assigned to object called park_budg2. Those
properties then aggregated to find which district have
median of buying price per sqm <= to 2000 Euro & have more than
10 number of listed properties.
The recommendation will be the first 3 properties which satisfies both priority and nice-to-have features. Otherwise, we have other list of 80 properties distributed in 11 districts which satisfy the priority feature. Among them, properties in districts Villaverde and Puente de Vallecas has the most options of available properties as well as among the lowest buying price per sqm (< 2000 Euro per sqm).
6. Could you define the characteristic of certain area or neighborhoods based on the property profile listed in your dataset?
Ans : No, our dataset has it’s limitation to accurately describe a community. A community described by combination of various factors such as public transport, school availability, available medical infrastructure, available family & leisure activities, available commercial sites, criminal rate, demographic, etc. But our dataset could classifies community based on their customer profile as demonstrated in exploration no #7 and #8.