3 models were used: xgboost, lasso regression and emsabling (stacked generalization)
The xgboost model had the best performance with an RMSE score of 0.2223.
It seems that the most important variables to predict the price are: the district in which the house is built and the inside and total area of a property (square meters).
In the case of the inside area, there is a strong linear relationship with the price of the house. For total meters, the relationship with the outcome is quadratic (concave).
The number of rooms, bathrooms and parking spaces also have an effect on the price.
Although not as important as the previous variables, the words in the description of the house also have an impact. Among the most important are places (streets, neighborhoods), adjectives (fine, spectacular) and the characteristics of the house (well, parquet, mediterranean, irrigation).
I know this has been done many times, but I wanted to try to predict the price of houses in the city where I live. With that objective I did webscrapping on the page Portal Inmobiliario, which is a website where people list properties for sale or rent. The database was collected in May 2021. Only houses in the urban districts of the Metropolitan Region of Chile were considered.
We are going to start by describing our outcome variable. This is PrecioPesos which is the price of the houses in Chilean pesos.
The distribution of the variable is quite interesting, it has 3 peaks. This is because people who post prices prefer to put “pretty” numbers instead “ugly” numbers. For example, it is more common for people to list a property with a price of 100,000,000 or 50,000,000 rather than numbers like 127,699,322. This is what causes the data to have these peaks. The data is left skewed; there are more not that expensives houses than really expensives ones.
A next variable is metrosUtiles which is the inside area of the house (meters). It seems that there is a fairly strong linear relationship between it and the outcome variable.
Another important variable is the total meters of the houses, Total meters3 on the database. The total meters is the sum of the meters that are inside and outside the house. In this case there seems to be a strong quadratic relationship. The values are marginally positive up to a point where one more square meter happens to have a negative effect on the price of the house. This is possibly due to the fact that houses with many total meters tend to be in rural areas, where the price of land is lower compared to urban areas.
Let’s analyze the rest of the variables, first the bathrooms. As the number of bathrooms increases, the price of houses also increases, as can be seen in the graph.
In the case of bedrooms, it seems that houses with 1 bedroom are more expensive than those with two. But then the value seems to be increasing, until it flattens out around 6 or 7 rooms.
Regarding the number of parking spaces, it is clear that houses with only 1 parking lot have much lower prices. Then the price, as with the other variables, increases as parking spaces increase. The price seems to flatten around the 6 parking spaces. It is important to mention that there are few houses with more than 10 parking spaces in the sample.
Another important variable is the ditrict. In the graph you can see the clear differences that exist between the different districts. The red dashed line represents the median price in our sample. Most of the houses in our sample come from the 4 most exclusive districts of Santiago (Lo Barnechea, Vitacura, Las Condes, Chicureo). Looking at the graph it is clear that there are huge differences between the districts. The eight poorest districts have a median price that is not even one-eighth of the median of the richest district.
In the following 3D map we can see the median price (In millions of Chilean pesos) of the houses by district. The price is represented by the color and height of each district on the map. wealthy districts are to the northeast of the map. While the poorest districts are to the south and northwest of the map. These are the districts where, preferably, people who lived in illegal settlements in the 80’s were relocated. Finally, we have Chicureo and Colina to the north separated from the rest of the metropolitan region. The chart is interactive so it can be rotated and zoomed.