Results

  1. 3 models were used: xgboost, lasso regression and emsabling (stacked generalization)

  2. The xgboost model had the best performance with an RMSE score of 0.2223.

  3. It seems that the most important variables to predict the price are: the district in which the house is built and the inside and total area of a property (square meters).

  4. In the case of the inside area, there is a strong linear relationship with the price of the house. For total meters, the relationship with the outcome is quadratic (concave).

  5. The number of rooms, bathrooms and parking spaces also have an effect on the price.

  6. Although not as important as the previous variables, the words in the description of the house also have an impact. Among the most important are places (streets, neighborhoods), adjectives (fine, spectacular) and the characteristics of the house (well, parquet, mediterranean, irrigation).

Intro

I know this has been done many times, but I wanted to try to predict the price of houses in the city where I live. With that objective I did webscrapping on the page Portal Inmobiliario, which is a website where people list properties for sale or rent. The database was collected in May 2021. Only houses in the urban districts of the Metropolitan Region of Chile were considered.

Variables

We are going to start by describing our outcome variable. This is PrecioPesos which is the price of the houses in Chilean pesos.

The distribution of the variable is quite interesting, it has 3 peaks. This is because people who post prices prefer to put “pretty” numbers instead “ugly” numbers. For example, it is more common for people to list a property with a price of 100,000,000 or 50,000,000 rather than numbers like 127,699,322. This is what causes the data to have these peaks. The data is left skewed; there are more not that expensives houses than really expensives ones.

A next variable is metrosUtiles which is the inside area of the house (meters). It seems that there is a fairly strong linear relationship between it and the outcome variable.

Another important variable is the total meters of the houses, Total meters3 on the database. The total meters is the sum of the meters that are inside and outside the house. In this case there seems to be a strong quadratic relationship. The values are marginally positive up to a point where one more square meter happens to have a negative effect on the price of the house. This is possibly due to the fact that houses with many total meters tend to be in rural areas, where the price of land is lower compared to urban areas.

Let’s analyze the rest of the variables, first the bathrooms. As the number of bathrooms increases, the price of houses also increases, as can be seen in the graph.

In the case of bedrooms, it seems that houses with 1 bedroom are more expensive than those with two. But then the value seems to be increasing, until it flattens out around 6 or 7 rooms.

Regarding the number of parking spaces, it is clear that houses with only 1 parking lot have much lower prices. Then the price, as with the other variables, increases as parking spaces increase. The price seems to flatten around the 6 parking spaces. It is important to mention that there are few houses with more than 10 parking spaces in the sample.

Another important variable is the ditrict. In the graph you can see the clear differences that exist between the different districts. The red dashed line represents the median price in our sample. Most of the houses in our sample come from the 4 most exclusive districts of Santiago (Lo Barnechea, Vitacura, Las Condes, Chicureo). Looking at the graph it is clear that there are huge differences between the districts. The eight poorest districts have a median price that is not even one-eighth of the median of the richest district.

In the following 3D map we can see the median price (In millions of Chilean pesos) of the houses by district. The price is represented by the color and height of each district on the map. wealthy districts are to the northeast of the map. While the poorest districts are to the south and northwest of the map. These are the districts where, preferably, people who lived in illegal settlements in the 80’s were relocated. Finally, we have Chicureo and Colina to the north separated from the rest of the metropolitan region. The chart is interactive so it can be rotated and zoomed.

There is also a short description which was tokenized in order to find the words that have the greatest effect on house prices. In the 2 charts below, I only selected words that appeared in at least 200 house descriptions. Among the words with a higher median price there are some that are characteristics of the house such as: wine storage room cava, cinema cine,marble mármol, sauna sauna, basement subterráneo. Others represent neighborhoods (La Dehesa, El Golf) or adjectives: fine finas, spectacular espectacular, wonderful maravilloso, beautiful precioso. One word that stands out is architect, arquitecto. This word probably has a positive effect, since the architect will only be mentioned, by name, when he or she is a recognized, award winning architect. It is possible to think that these types of architects build very expensive houses. The word Mediterranean mediterránea also attracts our attention, it has a positive effect because it is in vogue, in upper-middle-class neighborhoods, to build houses in this style.

On the other hand, there are also words that are associated with low prices. Some are related to places, specifically districts and streets: Vespucio, Tobalaba, Maipú, Puente. There are 5 words that caught my attention. The first is villa, which is what a middle-class neighborhood in Chile is called. The second are the words associated with real estate such as brokerage corretaje, broker corredora. The third is pareado which means semi-detached, this is because in Chile it is very common to build semi-detached houses in middle-class neighborhoods. The fourth is pasaje alleyway, which is probably associated with the way middle-class neighborhoods are built. Fifth, locomocion, which means public transportation, it is obvious that this word is going to be linked to houses in neighborhoods where people cannot afford a car.

We have a large number of dummy variables. One that I found interesting is the variable pool. The graph has the price of the house on one axis and the inner area of the house on the other axis. The color of the hexagons represents the percentage of houses with a swimming pool. It is possible to see that the proportion of houses with a swimming pool tends to increase as the price of houses increases. Now, it is important to note that this variable also seems to have a relationship with the square meters of the inner area of the house.

Treatment of Variables

Some outliers were removed, mainly houses that belonged to rural areas (which were not of our interest), some typing errors and duplicate observations. The numerical observations were transformed to logarithms and normalized. In addition, the district variable was transformed to a numerical variable taking as a reference the median price of houses by district.

Results

We ran 3 models, one using xgboost, the second a lasso regression and an ensabled of both models. The best model was the xgboost one. These were the results:

If we look at the most important variables we have the district (comuna2), this is explained because this variable contains many components within it, mainly access to services and public goods such as: hospitals, parks, schools; and other very important features such as security.

The second most important variable is total meters (metrosTotales3_poly_1 and metrosTotales3_poly_2) and the meters of the inner area of the house metrosUtiles. Which makes sense because one would expect that the bigger the house, the higher the price. The interesting thing is that the total meters have a concave quadratic relationship, which may be due to the fact that houses that have a very large area of land are located in more rural areas where the price of land is lower, which causes this relationship.

After these variables, the most important are the number of bathrooms baños2, the existence of a pool piscina, the number of bedrooms dormitorios2 and the number of parking spaces parking20

Although text on the description was not an important variable in the xgboost model, it was important in the lasso model. Here are a list of the words that were the most important:

Among the words with a positive effect we have places such as: the districts of Quilicura, Providencia, condes (Las Condes) and neighborhoods such as damián (San Damián) and golf (El Golf). The fact that Quilicura has a positive effect takes us by surprise since it is not a wealthy district. Verbo is there because of the Verbo Divino school, which is one of the most exclusive educational establishments in Santiago. Other words with a positive effect are: the adjectives spectacular espectacular and fine fina, the presence of parquet in the houses, when the houses are built in a mediterranean mediterránea style and, finally, the word easybroker which is probably a real state company.

Among the words with a negative effect we have Chicureo, which is curious because it is one of the richest districts in Santiago. The explanation is that since most of the sample comes from the four richest districts (Lo Barnechea, Vitacura, Las Condes and Chicureo), Chicureo having the lowest median price in this group has a negative effect on the total sample. Chamisero is a street in Chicureo therefore that is the reason why this word also has a negative effect.

Other words like irrigation, well and wood stove are signs of rurality so may explain the effect they have. The last words are schools colegios and cash contado. About these, it is a bit difficult to find a clear reason for the negative effect. It may be that when prices are not expensive, sellers ask for cash payment. schools, may have a negative effect because when the schools around a property are not well known, people only mention the word “school” instead of the name of the school, as happens with the well known private school, such Verbo Divino.

Here are the results for all models:

Final Thoughts

First of all, I think it is important to point out that important considerations must be taken when analyzing these results, since the bulk of the houses in our sample come from well-to-do neighborhoods. If we had a more balanced sample, perhaps the results would be different. Here is a chart with the number of houses by district.

Considering what I just mentioned, the district is the variable that helps to better predict the price of the house. It is interesting to note how those districts that received the most relocated families in the 80’s are those with the lowest prices. The importance of the inner and total meters does not surprise us, what we do consider surprising is the quadratic relationship between the price of the house and the total meters of the home.

I think it is still pending to run a model with pairs of words, binomials, instead of just using single words. I think that despite the good results of the models, there is room to improve performance, it would be helpful to have servers to better train the data.

