1 Real Estate Market in Madrid

1.1 Introduction

Madrid is the capital and most populous city of Spain. The city has almost 3.4 million inhabitants and a metropolitan area population of approximately 6.7 million. It is the second-largest city in the European Union (EU), and its monocentric metropolitan area is the second-largest in the EU. The municipality covers 604.3 km2 geographical area.

Madrid is a center of attraction, the largest urban agglomeration on the Iberian Peninsula and a decisive engine of the Spanish economy with a lot of social dynamism. The city is not only the traditional political center of Spain, with parks and boulevards, but also a true metropolis in terms of culture, entertainment, sports, business and gastronomy. For example, it is home to two of the best soccer teams in the world. This makes Madrid attractive not only to institutional investors, but also to people looking for both the profitability of an investment and to enjoy the city’s impressive cultural, gastronomic and leisure offer.

In this LBB project of Programming for Data Science with R, we would like to help potential buyers and sellers of real estate in Madrid to objectively assess the current market situation and find the best recommendation for properties according to their needs. This analyses are based on data collected from kaggle.

1.2 Data

Dataset used in this project is sourced from kaggle. Data was collected through crawling popular real estate portals in Madrid.

Originally, this dataset has 58 columns but we will perform data wrangling to clean and prepare our dataset, using only the relevant columns for further analysis.

List of Variables
variable_names description
id Unique ID
subtitle Neighborhood & City
sq_mt_built Square meter built
n_rooms Number of Room
n_bathrooms Number of Bathroom
floor House Level
buy_price Buy Price (in Euro)
buy_price_by_area Buy Price per sqm (in Euro)
house_type_id Type of the house
built_year Built year
energy_certificate Energy certificate type
has_parking Access to Parking
neighborhood_id Information of Neighborhood and District ID (1 to 21)

2 Data Preparation

2.1 Creating Initial Dataframe

First we read the .csv file and use only the relevant columns, which then saved in a new data frame called house1

#install package dplyr to manipulate character data
library(dplyr)
#turn all missing values to NA
house1 <- house_1a %>% mutate_all(na_if,"")

To perform string manipulation, we have to first install package called stringr, which allow us to extract information from column subtitle and create new columns called neighborhood and city. We assign new values for each House Type ID with shorter name (Type1, Type2, etc) and classified Numbered Floor level (1,2,3,etc) to one new category called ‘Upperfloor’, for simplification purpose. Another thing we do is to extract the district id information from neighborhood_id using strsplit() which will separates series of strings based on certain pattern/separator.

library(stringr)

#extract neighborhood and city information from subtitle column
sub <- house1$subtitle %>% str_split(", ", n = 2, simplify = TRUE)
house1$neighborhood <- sub[,1]
house1$city <- sub[,2]

house2a <- subset(house1, select = -c(subtitle))

To understand the structure of our data frame, we use the str() function which result in data type information for each column, number of rows and columns.

## 'data.frame':    21742 obs. of  14 variables:
##  $ id                : int  21742 21741 21740 21739 21738 21737 21736 21735 21734 21733 ...
##  $ sq_mt_built       : num  64 70 94 64 108 126 120 125 84 85 ...
##  $ n_rooms           : int  2 3 2 2 2 4 5 3 3 2 ...
##  $ n_bathrooms       : num  1 1 2 1 2 2 2 2 2 1 ...
##  $ floor             : chr  "Upperfloor" "Upperfloor" "Upperfloor" "Bajo" ...
##  $ buy_price         : int  85000 129900 144247 109900 260000 195000 190000 198500 212000 131400 ...
##  $ buy_price_by_area : int  1328 1856 1535 1717 2407 1548 1583 1588 2524 1546 ...
##  $ house_type_id     : chr  "Type1" "Type1" "Type1" "Type1" ...
##  $ built_year        : num  1960 NA NA 1955 2003 ...
##  $ energy_certificate: chr  "D" "en trámite" "no indicado" "en trámite" ...
##  $ has_parking       : chr  "False" "False" "False" "False" ...
##  $ neighborhood      : chr  "San Cristóbal" "Los Ángeles" "San Andrés" "San Andrés" ...
##  $ city              : chr  "Madrid" "Madrid" "Madrid" "Madrid" ...
##  $ district_id       : chr  "21: Villaverde" "21: Villaverde" "21: Villaverde" "21: Villaverde" ...

2.2 Missing Value and Duplicates

Now that we have the correct initial dataframe, we can proceed to the next step in preparing our data, which is missing values and duplicates treatment.

First, let’s find duplicate values (if any)!

Next, find missing value, which previously marked with NA, and decide which treatment to be employed for each variable.

#find which column(s) contain missing values

sum_na <- house2 %>% is.na() %>% colSums()

Now we know that built_year column contains 11742 missing value out of total 21742 rows data in house2, or equal to 54% data. While floor contain 12% missing values. This may lead to the next question, whether or not we should continue using built_year as our variable. As for missing values in the other 3 variables, sq_mt_built,n_bathrooms & house_type_id, we can either totally delete the rows, or assigning certain values to the missing values (i.e mean, mode, or zero value/0).

In our case, we decide to not using variable built_year in our analysis since it contains more than 50% of missing values, and we can drop the rows with missing values in columns sq_mt_built,n_bathrooms& house_type_id since they are insignificant in number (less than 5%) and might hinder further statistical analysis process. To remove rows with NA values, we can apply drop.na() method from tidyr package.

Column floor will be transformed into factor datatype, hence we will keep the missing values and assign new value called undefined.

#assign 'undefined' values to any missing values in`floor` & `house_type_id` columns

house3 <- house3 %>% 
          mutate(floor = coalesce(floor, 'undefined'))

Continue by assigning correct data type to each column by using converting function like as.character(), as.factor(), as.integer(), as.numeric(), etc. Conversion into correct data type contribute to memory saving and enable data manipulation using specific function designed for each datatype.

#re-assign correct data type to column, if needed

house_4a$id <- as.character(house_4a$id)

house_4a <- house_4a %>% mutate_at(c('sq_mt_built','n_bathrooms'), as.integer)
house_4a <- house_4a %>% mutate_at(c('floor','house_type_id','energy_certificate', 'has_parking','neighborhood', 'city', 'district_id'), as.factor)

Check our data structure once again, make sure each column/variable is in the right format.

## 'data.frame':    21211 obs. of  13 variables:
##  $ id                : chr  "21742" "21741" "21740" "21739" ...
##  $ sq_mt_built       : int  64 70 94 64 108 126 120 125 84 85 ...
##  $ n_rooms           : int  2 3 2 2 2 4 5 3 3 2 ...
##  $ n_bathrooms       : int  1 1 2 1 2 2 2 2 2 1 ...
##  $ floor             : Factor w/ 12 levels "Bajo","Entreplanta",..: 12 12 12 1 12 12 12 12 11 12 ...
##  $ buy_price         : int  85000 129900 144247 109900 260000 195000 190000 198500 212000 131400 ...
##  $ buy_price_by_area : int  1328 1856 1535 1717 2407 1548 1583 1588 2524 1546 ...
##  $ house_type_id     : Factor w/ 4 levels "Type1","Type2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ energy_certificate: Factor w/ 10 levels "A","B","C","D",..: 4 6 10 6 6 6 7 10 7 6 ...
##  $ has_parking       : Factor w/ 2 levels "False","True": 1 1 1 1 2 2 2 1 2 1 ...
##  $ neighborhood      : Factor w/ 146 levels "12 de Octubre-Orcasur",..: 118 78 117 117 80 117 117 142 142 80 ...
##  $ city              : Factor w/ 1 level "Madrid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ district_id       : Factor w/ 20 levels "1: Arganzuela",..: 13 13 13 13 13 13 13 13 13 13 ...

Lastly, we save our clean dataframe in an object called house_cln.

3 Data Explanation

You can find the complete explanation on Spanish terms used in our house_cln dataframe as below :

Floor Level Explanation
level level_desc
Bajo ground level
Entreplanta/Entreplanta exterior/Entreplanta interior Mezzanine level
Sótano/Sótano exterior/Sótano interior Basement level
Semi-sótano/Semi-sótano exterior/Semi-sótano interior Semi-Basement level
Upperfloor N-th floor (1, 2, 3, etc)
undefined undefined

While you would think an apartment bathed in natural sunlight would be relatively easy to come by in sunny Spain, it’s actually a lot harder than you’d think, at least in densely populated cities. Many traditional apartments in Spain are composed of labyrinthine corridors leading to many small windowless rooms covered wall to wall with dark wooden furniture. Heavy shutters block out any remaining sun that dares to infiltrate. In fact, the buildings seem purposely designed to shut out sunlight. This does keep out the overwhelming summer heat, but in return creates a dark and depressing living space.

So if natural light is your priority, pay attention to the following two words: Exterior and Interior. If you want to see the sunshine peak through into your apartment for at least some of the day, you need to look for a piso exterior. That means your flat will look onto the street, giving you natural light, and maybe even have a nice view. If you go for a piso interior, your apartment will face away from the street and probably onto the inside of the apartment block.

Another variable worth to mention is House Type. Of all sizes, with one or more floors, with or without a garden and in different environments. Below you can consult a list with the common types of house in our dataframe.

House Type Explanation
house_type house_type_desc
Type 1: Pisos A flat is a single-storey dwelling in which all the necessary facilities are distributed. They can have several rooms, a kitchen, living room and one or more bathrooms, but with no patio or outdoor garden
Type 2: Casa o chalet A chalet is an independent house with several floors, a garden and, in some cases, a swimming pool
Type 4: Duplex This term refers to a house with two floors, connected to each other by a staircase.
Type 5: Aticos The apartments that are located on the top floor of a building

The last variable we need to discuss is the Energy Certificate, or the energy performance certificate, a report that describes how efficient a home is in terms of energy consumption. It assigns an energy rating to each home on a scale which ranges from “A” (the most efficient) to “G” (the least efficient). Like other European countries, since June 2013, anyone selling or renting a property in Spain, short or long term, needs an energy performance certificate. The introduction of this innovation is due to the requirements of the EU.

Apart from A to G energy rating, our energy_certificate column might contains other values like en trámite (in process), no indicado (not indicated) or inmueble exento (exempt property).

4 Data Exploration

Start our exploration using summary() to extract the basic statistical information of each column in our house_cln dataframe.

summary(house_cln)
##       id             sq_mt_built       n_rooms        n_bathrooms    
##  Length:21211       Min.   : 16.0   Min.   : 0.000   Min.   : 1.000  
##  Class :character   1st Qu.: 71.0   1st Qu.: 2.000   1st Qu.: 1.000  
##  Mode  :character   Median :101.0   Median : 3.000   Median : 2.000  
##                     Mean   :148.5   Mean   : 3.035   Mean   : 2.076  
##                     3rd Qu.:165.0   3rd Qu.: 4.000   3rd Qu.: 2.000  
##                     Max.   :999.0   Max.   :24.000   Max.   :14.000  
##                                                                      
##                   floor         buy_price       buy_price_by_area house_type_id
##  Upperfloor          :16364   Min.   :  36000   Min.   :  447     Type1:17694  
##  undefined           : 2467   1st Qu.: 199000   1st Qu.: 2550     Type2: 1811  
##  Bajo                : 2001   Median : 378000   Median : 3721     Type4:  675  
##  Entreplanta exterior:  230   Mean   : 643793   Mean   : 4022     Type5: 1031  
##  Semi-sótano exterior:   52   3rd Qu.: 761325   3rd Qu.: 5000                  
##  Entreplanta interior:   32   Max.   :8800000   Max.   :18889                  
##  (Other)             :   65                                                    
##    energy_certificate has_parking                neighborhood       city      
##  en trámite :10687    False:13694   Chamartín          :  840   Madrid:21211  
##  no indicado: 3589    True : 7517   Moncloa            :  693                 
##  E          : 2649                  Chamberí           :  577                 
##  D          : 1087                  Centro             :  542                 
##  G          :  870                  Hortaleza          :  482                 
##  F          :  652                  Barrio de Salamanca:  457                 
##  (Other)    : 1677                  (Other)            :17620                 
##          district_id   
##  5: Chamartín  : 1806  
##  4: Centro     : 1731  
##  6: Chamberí   : 1708  
##  11: Moncloa   : 1546  
##  3: Carabanchel: 1486  
##  9: Hortaleza  : 1447  
##  (Other)       :11487

Now we are confident that every column has the information needed for further analysis.

4.1 Distribution Analysis using Histogram and BoxPlot

4.1.1 Analysis on building area in square meter (sq_mt_built column)

As we can see, the distribution is right-skewed and heavily distributed in the range of 0-200 sqm. For this reason, it is better to use median (Q2) value as central measuremenmt instead of mean, for it’s robust characteristic against outliers.

4.1.2 Analysis on number of room(s) and bathroom(s) per listing (n_rooms & n_bathroomscolumn)

Here, both variables have right-skewed distribution and heavily distributed in the low range region. For this reason, again, the central measurement use median (Q2) value instead of mean, for it’s robust characteristic against outliers.

4.1.3 Analysis on Buy Price and Buy Price per Area

Both variables, again, are right-skewed and heavily distributed in the low range region. For this reason, again, the central measurement use median (Q2) value instead of mean, for it’s robust characteristic against outliers. Compare to the previous plot for number of rooms and bathrooms, here we can see the number of outliers are more prevalent. But inside the boxplot for buying price per area, data distribution is nearly symmetrical.

4.2 Exploratory Analysis with Business Cases

1.Define Top 10 districts with highest number of listed properties?

2. Which are the top 10 districts with highest average (median) of buying price per area?

3.Correlation of other variables to buying price per area, which variable have strongest correlation with it?

Ans : as we knew the variable buy_price_by_area is a numerical data while the other variables are numerical and categorical. To find the correlation between 2 numerical values, we can use the cor() function, hence we will separate the house_cln data only for the numerical one and save it into new object calles house_cl_num then continue by finding the correlation with our target variable buy_price_by_area.

##                   sq_mt_built    n_rooms n_bathrooms buy_price
## sq_mt_built         1.0000000 0.72440538   0.8518865 0.8314373
## n_rooms             0.7244054 1.00000000   0.7164749 0.5925880
## n_bathrooms         0.8518865 0.71647490   1.0000000 0.7578218
## buy_price           0.8314373 0.59258802   0.7578218 1.0000000
## buy_price_by_area   0.1810571 0.05204698   0.2762335 0.5664525
##                   buy_price_by_area
## sq_mt_built              0.18105709
## n_rooms                  0.05204698
## n_bathrooms              0.27623346
## buy_price                0.56645254
## buy_price_by_area        1.00000000

A positive correlation between the number of rooms and bathrooms in a house with their price per area, but the correlation was too weak. It means that bigger house with more rooms is not necessarily always come with higher price.

Now we can move to our categorical variables, and check their correlation with a numerical variable, buy_price_by_area, using Box plot or Kruskal-Wallis chi-squared.

Boxplot Comparison

We find significant variation between median of each floor category to buy price per sqm, a lot of overlap between the boxes especially between numbered level 1 to 9 hence we merged those levels into one called Upperfloor for simplification purpose, but once you see the box plots of other floor levels, you will find various median values with less overlapping boxes. The floor level and buy price per sqm is definitely correlated but not in a weak way since some of the groups show us significant statistical differences.

Here, we can see slighty significant differences between each category of house_type_id to buy price per sqm, if you compare the median value of each category, may say that there are correlation but insignificant one (low correlation).

We can see a weak significant differences between each of the energy group especially if we cluster these energy group into A to D, E & en tramite, F&inmueble exento&no indicato, and energy group G. Same as previous comparison, we see lot of overlap between boxes, hence it is correlated but low ones.

Alternatively, we can use Kruskal-Wallis test with kruskal.test() function.

#compare neighborhood id with buying price per sqm
kruskal.test(buy_price_by_area ~ house_type_id,
                    data = house_cln)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  buy_price_by_area by house_type_id
## Kruskal-Wallis chi-squared = 333.21, df = 3, p-value < 2.2e-16

The null hypothesis (H0): The median is equal across all groups. The alternative hypothesis: (H1): The median is not equal across all groups.

The p-value is extremely low, far less than the significant level of 0.05 hence it falls in rejection region. It can be concluded then, that there are significant difference between the treatment group, or the median is not equal across all group, which means the different neighborhoods leads to statistically significant differences in buying price per sqm.

As for the dichotomous variable (iónly two values, TRUE/FALSE) like the one in has_parking column, we can use the point-biserial correlation test with ltm() package.

## [1] -0.1100973

It results a low correlation number between 2 variables.

From the boxplot above, we can see significant difference between groups and not that much overlapping, hence these 2 variables are strongly correlated.

4. If a customer, with single/unmarried status & in their early employement year as an junior engineer at AECOM Madrid, ask for your recommendation as a real-estate consultant to buy his 1st property, what are your suggestions and why?

Ans : to give the best recommendation, first we have to understand the profile of our customers and their requirements.

Let’s consider the priority feature for a single/unmarried employee in their first year employment,

  • Priority Features :
    • Buying price, as they’re in the first year of employment moreover this is his first time buying property, we should consider their mortgage limit. In this case, the buying price is more of a factor than a buying price per sqm.
    • Number of rooms, as they’re single, we might want to propose 1 - 2 bedrooms house as their first house knowing we have the higher chance of mobility in younger-age worker compare to one with family hence we want to minimize the maintenance cost. For this reason we should go with house type_1 : Pisos, type_4 : duplex or type 5 : Atticos.
  • Nice to Have :
    • Younger generation prefer to not wasting their time commuting, hence the favorable location is the one with well serve connectivity to their office or down-town city centre.
    • Younger generation use less and less private transportation, especially if they live close to city center / has good access to public transport, hence the parking lot is no longer mandatory.

Then, we consider their financial situation. In average, a junior engineer in AECOM Madrid made 30K Euro annually, source : Glassdoor, which equal to Nett. amount of ~1900 Euro per month. Based on this information, we setup the upper limit for mortgage to 5 times his annual salary, 150K Euro, then apply filter based on the desired house type.

Now that we have the list of properties which satisfy 2 of our main constrains, we can then group those properties based on the nice to have features, first the location,

This customer works at AECOM, Madrid. The office is located in C. de Alfonso XII, 62, Madrid ( Lat 40.40911870940878, Long -3.689179361369044). Based on this information, we can then create an interactive map to see the nearby district and match with our list of Top 10 districts with the lowest median value of buying price per sqm.

From the map above, we can classify the 1st-ring district from AECOM office based on the distance that are : Salamanca, Retiro, and Arganzuela, The 2nd ring district will be : Usera, Puente de Vallecas, Moratalaz, Chamberi and Latina. These districts are the most accessible from his office, and from our list above, we can see that district Usera, Puente de Vallecas, Latina and Moratalaz are among the Top 10 district with lowest median buying price per sqm.

So the recommendation is to look at the listed properties in these 4 districts, with their reasonable price per sqm and distance to the office and city center. Puente de Vallecas has the biggest number of 648 available properties with the lowest median value of buying price per sqm (< 2000 Euro per sqm).

5. If a family of 4 looking for a house for them to settle down and stay for a long term, both parents work for quite some years already, what would be your recommendation as a real estate consultant?

Unlike the previous client who is single and in his early employment year, the family with 4 has certain characteristic and requirements for their long-term family home.

  • Priority Features:
    • Type of home. The family would much prefer the detached home than flat or duplex, let alone studio house. They appreciate more outdoor spaces for the kids to play or maybe for having a pet in the future.
    • The family of needs at least 2 bedrooms, kitchen & a living room in their house.
    • Upper limit mortgage for the family of 4 usually is higher than the previous one, moreover they have time to save for more deposit money, therefore the price constrain is more relax here. With average household income of 60K euro annually, this family could request for a mortgage up to 5x of their annual income, or up to 300K Euro.
  • Nice to have Feature:
    • The family with 4 will require car to travel especially if their home is located in the outer ring 7 sub-urban area, therefore a parking lot is needed.
    • Preference of house type 2 and 4

SImilarly to our previous client, first we apply filter to our dataset based on the desired budget (300K Euro), then filter based on the number of rooms available (>= 4), and eliminate house type 5 (Attico).

The following listings are our first recommendation which satisfy both their priority features and the nice-to-have ones which are parking lot and house type number 2 and 4.

The properties which satisfy their priority features + parking availability assigned to object called park_budg2. Those properties then aggregated to find which district have median of buying price per sqm <= to 2000 Euro & have more than 10 number of listed properties.

The recommendation will be the first 3 properties which satisfies both priority and nice-to-have features. Otherwise, we have other list of 80 properties distributed in 11 districts which satisfy the priority feature. Among them, properties in districts Villaverde and Puente de Vallecas has the most options of available properties as well as among the lowest buying price per sqm (< 2000 Euro per sqm).

6. Could you define the characteristic of certain area or neighborhoods based on the property profile listed in your dataset?

Ans : No, our dataset has it’s limitation to accurately describe a community. A community described by combination of various factors such as public transport, school availability, available medical infrastructure, available family & leisure activities, available commercial sites, criminal rate, demographic, etc. But our dataset could classifies community based on their customer profile as demonstrated in exploration no #7 and #8.