We have seen in Part I made by @Fateh Bekioua the problematic that we are trying to adress. An intro video regarding our subject of interest is available here(english subtitles are not available yet.)

In addition, we have seen how data from Ouedkniss Website was gathered using webscraping techniques.

In this part, we first are going to scrape the data related to Lkeria Website and then we’ll explore our data and try to answer our questions about it.

Webscraping data from lkeria website:

This website allows only real estate agencies to list there properties.

Loading Packages:

library(rvest)
library(tidyverse)
library(lubridate)

Create functions to capture informations:

get_links <- function(i=1){
  webpage <- read_html(paste0("https://www.lkeria.com/annonces/immobilier/vente-algerie-P", i))
  dates <- html_text(html_nodes(webpage, xpath = "//span[@class=\"property-label hidden-xs\"]"))[-c(1:3)]
  urls <- html_attr(html_nodes(webpage, xpath = "//a[@rel=\"noopener\"]"), "href")[-1] 
  as.data.frame(cbind(urls, dates))
}

If we try to filter listings on “Algiers” only the URL doesn’t change so the idea is to capture all the sales listings within this website no matter the region.

For dates variable within get_links() function, there was a pattern where the first three observations were “premium” listings that they are repeated across all pages so we excluded them using [-c(1:3)].(this number is subject to change as the website evolves)
Same notice for urls variable where the first observation was irrelevant information.

get_all <- function(nbr.pages=1){
  all <- NULL
  for (k in 1:nbr.pages){
    all <- rbind(all, get_links(i = k))
  }
  all
}

The get_all function serves to gather all pages listing in one single dataframe.

Scrape the data:

df <- get_all(185)

algiers_data <- df %>% 
  filter(str_detect(urls, "Alger")==TRUE) %>% 
  mutate(urls = paste0("https:",urls))

We scraped all the existing 185 pages at that time then we filtered on “Alger” to select the listings of Algiers city. We concatenated also “https:” to our urls to make them accessible.

Getting detailed informations within listings:

Next, we move to getting relevant informations within properties listings for our analysis .

pages <- map(.x = algiers_data$urls, possibly(.%>%read_html(.), NA_real_))

type <- map(pages,possibly(. %>% 
                         html_nodes(".breadcrumb li:nth-child(3) a") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

district <- map(pages,possibly(. %>% 
                         html_nodes(".active") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

surface <- map(pages,possibly(. %>% 
                         html_nodes(".resume li:nth-child(1)") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

nbr_rooms <- map(pages,possibly(. %>% 
                         html_nodes(".resume li+ li") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

description <- map(pages,possibly(. %>% 
                         html_nodes("#bien") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

price <- map(pages,possibly(. %>% 
                         html_nodes(".amount b") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

price_type <- map(pages,possibly(. %>% 
                         html_nodes(".has:nth-child(1)") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

agency_name <- map(pages,possibly(. %>% 
                         html_nodes("#contact span:nth-child(1)") %>% 
                         html_text()
                 ,NA_real_)) %>%
  unlist()

details <- map(pages,possibly(. %>% 
                         html_nodes("section+ section") %>% 
                         html_text() %>% 
                           str_sub(start = 11, end = -3)
                 ,NA_real_)) %>%
  unlist()
  
  
algiers_raw <- as_tibble(cbind(algiers_data, type, district, surface, nbr_rooms, description, price, agency_name, details))

#write_csv(algiers_raw, "algiers_raw.csv")

For this task, we used map() funtion from purrr package to iterate over all our urls alongside with Selector Gadget to select the desired information.

algiers_raw <- read_csv("algiers_raw.csv")

Cleaning the data:

Now that we have the raw data for “Algiers”, we need to do more cleaning on it.

we define first a function to get the number of floor.

get_floor_nbr <- function(string){
  result <- NULL
  for (j in 1:length(string)){ 
    if (is.na(string[j])== TRUE){ 
      result[j] <- NA
      } else { 
        if (str_detect(string[j], "Etage :") == TRUE){ 
          loc <- str_locate(string[j], "Etage :")[,2]
          result[j] <- str_split(str_sub(string[j], loc+1), " ")[[1]][1]
        } else { 
          result[j] <- NA 
        }
      }
  }
  return(result)
}

Then, we make additional adjustments.

algiers <- algiers_raw %>% 
  mutate(dates = case_when(dates %in% c("Hier", "Aujourd'hui") ~ "NA",
                           TRUE ~ dates), #replace Hier and Aujourd'hui values by NA
         surface = as.numeric(str_sub(surface, end = -4)), #remove the last 3 characters
         nbr_rooms = as.numeric(str_sub(nbr_rooms, end = -8)), #remove the last 7 characters
         floor_nbr = get_floor_nbr(details), #use our defined function
         type = factor(type)) %>%
  mutate(dates = dmy(dates = str_sub(dates, start = 4)),
         price = ifelse(str_detect(price, "m2") == TRUE , 
                    as.numeric(str_split_fixed(price, " ", 2)[, 1]) * surface * 1e4, 
                    ifelse(str_detect(price, "Milliards") == TRUE, 
                           as.numeric(str_split_fixed(price, " ", 2)[, 1]) * 1e7,
                           as.numeric(str_split_fixed(price, " ", 2)[, 1]) * 1e4))) %>% 
  mutate(dates = zoo::na.locf(dates, na.rm =FALSE)) #replace NAs with last obs carried forward

#write_csv(algiers, "algiers_lkeria.csv")

The most important reformatting above is to deal with the different scenarios on how the price appears, in this case we used a nested ifelse() function:

price contains “m2”, it’s price per meter squared so the final price would be numeric number x surface x 1e4 considering price per meter cannot exceed 1e5.
price contains “Milliards”, which corresponds to 1e7 DZD.
final case, price doesn’t contain “m2” nor “Milliards”, in other words the price is very likely expressed in “millions”, the price would be numeric number x 1e4

Exploratory Data Analysis:

Now that we have the data from both websites ouedkniss and lkeria let’s explore it.

Loading the data

lkeria <- read_csv("algiers_lkeria.csv")
ouedkniss <- as_tibble(read.csv("Housing_Data26072018.csv"))

Structure

As a first step let’s have an overview of our data.

Lkeria

glimpse(lkeria)

## Observations: 1,166
## Variables: 11
## $ urls        <chr> "https://www.lkeria.com/annonces/immobilier/Vente-...
## $ dates       <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ type        <chr> "Terrain", "Terrain", "Appartement", "Villa", "Ter...
## $ district    <chr> "Khraicia", "Douera", "Draria", "Kouba", "Kouba", ...
## $ surface     <dbl> 600, 220, 76, 234, 0, 75, 84, 88, 65, 255, 100, 15...
## $ nbr_rooms   <int> 0, 0, 2, 0, 0, 3, 4, 3, 3, 0, 4, 4, 0, 0, 0, 7, 11...
## $ description <chr> "Description du bien\nVente Terrain Khraicia Vend ...
## $ price       <dbl> 42000000, 16500000, 16000000, 50000000, 160000000,...
## $ agency_name <chr> "Agence immobilière Hakem Agence Immobilière", "Ag...
## $ details     <chr> "Prix Négociable Surface :600 m2", "Prix Négociabl...
## $ floor_nbr   <chr> NA, NA, NA, NA, NA, NA, NA, "1", NA, NA, NA, NA, N...

Ouedkniss

glimpse(ouedkniss)

## Observations: 2,152
## Variables: 50
## $ Links                       <fct> https://www.ouedkniss.com/vente-ap...
## $ Title                       <fct> Vente Appartement F2 Alger Belouiz...
## $ Description                 <fct> NA, J'ai une grande,propre et top ...
## $ Store.Name                  <fct> NA, NA, NA, NA, MS.Immobilier, NA,...
## $ Pseudo                      <fct> youyayouya, F123kniss, megatrain21...
## $ Store.Addresse              <fct> NA, NA, NA, NA, Alger Bab ezzouar ...
## $ Date                        <fct> 2018-07-26, 2018-07-26, 2018-07-26...
## $ Nb.Floor                    <int> NA, 3, 4, 1, 4, 3, 2, 3, 3, NA, 4,...
## $ Nb.Room                     <int> 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3...
## $ Nb.Views                    <int> 119, NA, 33, 381, 9, 39, 15, 17, 1...
## $ ID.Offer                    <int> 16427845, 16340043, 16600827, 1651...
## $ Price                       <fct> 860 Millions Négociable, 1 Million...
## $ District                    <fct> belcourt, Centre ville, Fatma nsou...
## $ Specifics                   <fct> Electricité ,  Gaz ,  Eau ,  Acte ...
## $ Area                        <int> 48, 70, 65, 42, 97, 70, 100, 136, ...
## $ Price.value                 <dbl> 860.0, 1.0, 7.0, 900.0, 2.4, 750.0...
## $ Price.unit                  <fct> Millions, Millions, Millions, Mill...
## $ Price.desc                  <fct> Négociable, Négociable, Négociable...
## $ Price.value.dzd             <dbl> 8600000, 10000000, 70000000, 90000...
## $ Price.rules                 <fct> Price.value * 10000, Price.value *...
## $ Announcer.Name              <fct> youyayouya, F123kniss, megatrain21...
## $ Announcer.Type              <fct> PARTICULIER, PARTICULIER, PARTICUL...
## $ Hour                        <int> 19, 19, 19, 21, 22, 16, 19, 20, 20...
## $ Month                       <fct> Jul, Jul, Jul, Jul, Jul, Jul, Jul,...
## $ Municipality                <fct> HAMMA ANASSERS, BIRTOUTA, DAR EL B...
## $ Garage                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ Garden                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ Furnished                   <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, ...
## $ Promise                     <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ New.Project                 <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ Payment                     <fct> comptant, comptant, comptant, comp...
## $ Top_Municipalities          <fct> Other, Other, Other, DJISR KSENTIN...
## $ Store.Municipality          <fct> NA, NA, NA, NA, BEB EZZOUAR, NA, E...
## $ commune.names               <fct> Algiers, HAMMA ANASSERS, Algiers, ...
## $ lon                         <dbl> 3.058756, 3.047605, 3.228198, 3.07...
## $ lat                         <dbl> 36.75377, 36.64609, 36.70601, 36.6...
## $ type                        <fct> locality, locality, locality, loca...
## $ loctype                     <fct> approximate, approximate, approxim...
## $ address                     <fct> algiers, sidi m'hamed, algeria, bi...
## $ north                       <dbl> 36.75878, 36.66857, 36.73901, 36.7...
## $ south                       <dbl> 36.74887, 36.61635, 36.66587, 36.6...
## $ east                        <dbl> 3.061624, 3.071880, 3.259807, 3.11...
## $ west                        <dbl> 3.054058, 2.947340, 3.163548, 3.04...
## $ locality                    <fct> Algiers, Birtouta, Dar El Beïda, D...
## $ locality.1                  <fct> Sidi M'Hamed, NA, NA, NA, NA, NA, ...
## $ administrative_area_level_1 <fct> Algiers Province, Algiers Province...
## $ country                     <fct> Algeria, Algeria, Algeria, Algeria...
## $ political                   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Store.lon                   <dbl> NA, NA, NA, NA, 3.185497, NA, 3.06...
## $ Store.lat                   <dbl> NA, NA, NA, NA, 36.72063, NA, 36.7...

Reformatting Ouedkniss dataset

ouedkniss <- ouedkniss %>% 
  select(ID.Offer, Date, Hour, Month, Municipality, lon, lat, 
         Nb.Floor:Nb.Views, Area, Garage:Payment, Price.value.dzd, District, Top_Municipalities,
         Announcer.Name, Announcer.Type, Store.Municipality, Store.lon, Store.lat) %>% 
  filter(Nb.Room %in% c(3:5), Announcer.Type != "PROMOTEUR") %>%
  mutate(website = "ouedkniss",
         price_m2 = Price.value.dzd/Area)

Reformatting Lkeria data

lkeria <- lkeria %>% 
  filter(type == "Appartement") %>% 
  mutate(Nb.floor = as.integer(floor_nbr),
         Top_Municipalities = fct_lump(district, 9),
         Announcer.Type = word(agency_name, 1, 2),
         price_m2 = price/surface, 
         Garage = str_detect(details, "Garage"),
         Garden = str_detect(details, "Jardin"),
         Furnished = str_detect(details, "Meublé"),
         website = "Lkeria") %>% 
  rename(Date = dates,
         Nb.Room = nbr_rooms,
         Municipality = district,
         Area = surface,
         Price.value.dzd = price) %>% 
  filter(Nb.Room %in% c(3:5), Announcer.Type != "Promoteur immobilier") %>%
  filter(!(is.na(price_m2)), !(price_m2 == Inf))

Distribution of price by top municipalities of ouedkniss

ouedkniss %>% 
  group_by(Top_Municipalities) %>% 
  summarise(median_price = median(price_m2),
            quartile_3 = quantile(price_m2, probs = 0.75),
            p_90 = quantile(price_m2, probs = 0.90),
            p_93 = quantile(price_m2, probs = 0.93),
            p_95 = quantile(price_m2, probs = 0.95),
            p_99 = quantile(price_m2, probs = 0.99),
            max_price = max(price_m2),
            n = n()) %>% 
  arrange(desc(median_price))

## # A tibble: 11 x 9
##    Top_Municipalit~ median_price quartile_3   p_90   p_93   p_95   p_99
##    <fct>                   <dbl>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 DELY BRAHIM           259259.    315534. 3.81e5 4.02e5 4.30e5 4.58e5
##  2 EL ACHOUR             247059.    281929. 3.18e5 3.28e5 3.40e5 3.61e5
##  3 CHERAGA               239130.    280000  3.11e5 3.31e5 3.44e5 3.93e5
##  4 BEB EZZOUAR           229885.    259376. 2.77e5 2.86e5 2.90e5 3.05e5
##  5 BIR KHADEM            226237.    263278. 3.01e5 3.10e5 3.48e5 3.98e5
##  6 ALGER                 219298.    263068. 3.00e5 3.12e5 3.15e5 3.41e5
##  7 DJISR KSENTINA        183375.    200000  2.22e5 2.25e5 2.36e5 2.94e5
##  8 Other                 183333.    247312. 3.20e5 3.50e5 3.86e5 5.49e5
##  9 BAB EL OUED           172727.    198361. 2.10e5 2.21e5 2.30e5 4.08e5
## 10 BORDJ EL BAHRI        142308.    175000  1.95e5 2.11e5 2.17e5 2.38e5
## 11 BORDJ EL KIFFAN       119158.    150610. 1.80e5 2.00e5 2.18e5 5.40e5
## # ... with 2 more variables: max_price <dbl>, n <int>

Distribution of price by top municipalities of lkeria

lkeria %>% 
  group_by(Top_Municipalities) %>% 
  summarise(median_price = median(price_m2),
            quartile_3 = quantile(price_m2, probs = 0.75),
            p_90 = quantile(price_m2, probs = 0.90),
            p_93 = quantile(price_m2, probs = 0.93),
            p_95 = quantile(price_m2, probs = 0.95),
            p_99 = quantile(price_m2, probs = 0.99),
            max_price = max(price_m2),
            n = n()) %>% 
  arrange(desc(median_price))

## # A tibble: 10 x 9
##    Top_Municipalit~ median_price quartile_3   p_90   p_93   p_95   p_99
##    <fct>                   <dbl>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Hydra                 412879.    475815. 5.02e5 5.08e5 5.26e5 5.67e5
##  2 Kouba                 326250     468750  4.84e5 4.98e5 5.42e5 6.42e5
##  3 El Achour             282609.    327448. 3.50e5 3.50e5 3.57e5 3.91e5
##  4 Cheraga               264935.    339298. 3.75e5 4.04e5 4.39e5 2.45e6
##  5 Mohammadia            256944.    281401. 3.98e5 6.63e5 8.39e5 1.19e6
##  6 Bir Mourad Rais       256410.    276070. 3.19e5 3.34e5 3.38e5 3.48e5
##  7 Birkhadem             254545.    353611. 3.81e5 3.81e5 3.84e5 3.89e5
##  8 Ouled Fayet           239015.    271504. 2.90e5 2.94e5 3.18e5 3.71e5
##  9 Alger-Centre          213833.    279375  2.99e5 3.46e5 3.90e5 4.78e5
## 10 Other                 200000     250000  3.33e5 3.58e5 3.71e5 1.57e6
## # ... with 2 more variables: max_price <dbl>, n <int>

For ouedkniss data, we selected 25 relevent variables out of 50 for our analysis.
We reformatted some variables in lkeria dataset so it can be combined with ouedkniss dataset.
We notice from the code above that not all Top Municpalities are in common for both datasets and since we have fewer data for Lkeria we’ll also adapt Ouedkniss’s Top Municipalities to match those of Lkeria.
we created a new variable price per meter squared, this will be the subject of our coming comparisons
we showed distributions of price per m2 for the top municipalities to help us identify potential anomalies and outliers.

Adjusting top_municipalities variable

ouedkniss <- ouedkniss %>% 
  mutate(Top_Municipalities_new = case_when(Municipality == "EL ACHOUR" ~ "El Achour",
                                            Municipality == "OULED FAYET" ~ "Ouled Fayet",
                                            Municipality == "HYDRA" ~ "Hydra",
                                            Municipality == "BIR KHADEM" ~ "Birkhadem",
                                            Municipality == "CHERAGA" ~ "Cheraga",
                                            Municipality == "ALGER" ~ "Alger-Centre",
                                            Municipality == "KOUBA" ~ "Kouba",
                                            Municipality == "BIR MOURAD RAIS" ~ "Bir Mourad Rais",
                                            Municipality == "MOHAMMADIA" ~ "Mohammadia",
                                            TRUE ~ "Other"))

Distribution of price for the modified top municipalities

ouedkniss %>% 
  group_by(Top_Municipalities_new) %>% 
  summarise(median_price = median(price_m2),
            quartile_3 = quantile(price_m2, probs = 0.75),
            p_90 = quantile(price_m2, probs = 0.90),
            p_93 = quantile(price_m2, probs = 0.93),
            p_95 = quantile(price_m2, probs = 0.95),
            p_99 = quantile(price_m2, probs = 0.99),
            max_price = max(price_m2),
            n = n()) %>% 
  arrange(desc(median_price))

## # A tibble: 10 x 9
##    Top_Municipalit~ median_price quartile_3   p_90   p_93   p_95   p_99
##    <chr>                   <dbl>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Hydra                 400000     461141. 4.98e5 5.00e5 5.04e5 5.20e5
##  2 Kouba                 269231.    291987. 3.07e5 3.13e5 3.18e5 3.52e5
##  3 Bir Mourad Rais       266667.    306667. 3.38e5 3.49e5 3.51e5 4.72e5
##  4 Mohammadia            253968.    276759. 3.14e5 3.27e5 3.35e5 3.45e5
##  5 El Achour             247059.    281929. 3.18e5 3.28e5 3.40e5 3.61e5
##  6 Cheraga               239130.    280000  3.11e5 3.31e5 3.44e5 3.93e5
##  7 Ouled Fayet           230000     259386. 2.91e5 3.11e5 3.43e5 3.61e5
##  8 Birkhadem             226237.    263278. 3.01e5 3.10e5 3.48e5 3.98e5
##  9 Alger-Centre          219298.    263068. 3.00e5 3.12e5 3.15e5 3.41e5
## 10 Other                 175000     218750  2.74e5 3.00e5 3.21e5 5.06e5
## # ... with 2 more variables: max_price <dbl>, n <int>

Removing outliers

We use our distribution tables to remove outliers and potential errors in our data by discarding very high values of price per meter squared.

for each top municipality that has high differences among the percentiles showed in our tables, we considered the highest rational percentile value as the maximum value.

Ouedkniss

ouedkniss <- ouedkniss %>% 
  filter(!(Top_Municipalities_new == "Other" & price_m2 > 321108.7))

Lkeria

lkeria <- lkeria %>% 
  filter(!(Top_Municipalities == "Cheraga" & price_m2 > 438750.0)) %>%
  filter(!(Top_Municipalities == "Mohammadia" & price_m2 > 398000.0)) %>%
  filter(!(Top_Municipalities == "Other" & price_m2 > 371063.0)) %>% 
  filter(!(Top_Municipalities=="Kouba" & price_m2 > 4e5)) #Haut Standing appartments

Trend by Month

Now that we have our data ready to work with, let’s explore some paths using multiple dimensions.

ouedkniss %>% 
  filter(Announcer.Type == "AGENCE") %>% 
  select(Date, price_m2, website) %>%
  mutate(Date = ymd(Date)) %>% 
  mutate(month = floor_date(Date, unit = "month")) %>% 
  bind_rows(lkeria %>% 
              select(Date, price_m2, website) %>% 
              filter(Date >= as.Date("2017-07-01")) %>% 
              mutate(month = floor_date(Date, unit = "month"))) %>% 
  group_by(month, website) %>% 
  summarise(median_price_m2 = median(price_m2)) %>% 
  ggplot(aes(x = month, y = median_price_m2, color = website)) +
  geom_line(size = 1)+
  labs(title = "The Evolution of Price/M2 by Month")

lkeria prices remain higher for almost all months in our time span.

Overall Price Comparison LKERIA VS. OUEDKNISS

We’ll draw a boxplot to compare the prices across both websites by number of rooms.

First of all, we need to combine our data

combined_data <- ouedkniss %>% 
  select(Nb.Room, price_m2, Top_Municipalities_new, Announcer.Type, website) %>% 
  rename(Top_Municipalities = Top_Municipalities_new) %>% 
  bind_rows(lkeria %>% 
              select(Nb.Room, price_m2, Top_Municipalities, Announcer.Type, website))

AGENCE VS. AGENCE

combined_data %>%
  filter(Announcer.Type %in% c("Agence immobilière", "AGENCE")) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Nb.Room) +
  labs(title = "AGENCE VS. AGENCE")

AGENCE VS. BUREAU D’AFFAIRE

combined_data %>%
  filter(Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE")) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Nb.Room) +
  labs(title = "AGENCE VS. BUREAU D'AFFAIRE")

AGENCE VS. PARTICULIER

combined_data %>%
  filter(Announcer.Type %in% c("Agence immobilière", "PARTICULIER")) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Nb.Room) +
  labs(title = "AGENCE VS. PARTICULIER")

Comments

from this graph, we notice that, on average, the housing prices in Lkeria are higher than those listed in Ouedkniss.

The overall median price values per meter squared in our sample data are:

combined_data %>% 
  group_by(Announcer.Type, website) %>% 
  summarise(med_price_m2 = median(price_m2)) %>% 
  spread(website, med_price_m2)

## # A tibble: 4 x 3
## # Groups:   Announcer.Type [4]
##   Announcer.Type      Lkeria ouedkniss
##   <chr>                <dbl>     <dbl>
## 1 AGENCE                 NA    202703.
## 2 Agence immobilière 242571.       NA 
## 3 BUREAU D'AFFAIRE       NA    212963.
## 4 PARTICULIER            NA    160256.

On average, comparing “agence” to “agence”, lkeria’s prices are higher by approximately +19%.

Statistical Significane

Well, some readers may be skeptical about the results, they may think it is due to a sampling effect. So In order to confirm these results we are going to draw 1000 different samples from both Lkeria and Ouedkniss and analyze the distribution of their median price delta.

ok_agencies <- ouedkniss %>% 
  filter(Announcer.Type == "AGENCE") %>% 
  select(price_m2) %>% 
  unlist()

ok_bur_aff <- ouedkniss %>% 
  filter(Announcer.Type == "BUREAU D'AFFAIRE") %>% 
  select(price_m2) %>% 
  unlist()

lkeria_agencies <- lkeria %>% 
  select(price_m2) %>% 
  unlist()

We create a function to get a 1000 samples with replacement from each dataset.

bootstrap <- function(x){
  #x is a vector
  replicate(1000, median(sample(x = x, size = 100, replace = TRUE)))
}

set.seed(123)
ok_medians <- bootstrap(ok_agencies)
ok_medians_bur_aff <- bootstrap(ok_bur_aff)
lkeria_medians <- bootstrap(lkeria_agencies)

Now, that we have our samples ready, let’s plot the differences

as.tibble(lkeria_medians - ok_medians) %>%
  ggplot(aes(x = value)) +
  geom_histogram() + 
  labs(title = "Distribution of median differences of price per m2",
       subtitle = "Lkeria (Agencies) VS. Ouedkniss (Agencies)", caption = "1000 samples with replacement") +
  xlab(label = "median price difference in DZD") +
  geom_vline(xintercept = median(lkeria_medians - ok_medians), size = 1, color = "red") +
  geom_label(aes(x = median(lkeria_medians - ok_medians), y = 0, label=round(median(lkeria_medians - ok_medians))))

it’s very clear that almost all of our delta distribution is positive with a median delta of 36949 Which means that on average Lkeria is more expensive by 36949 DZD per meter squared (comparing real estate agencies prices from both websites).

as.tibble(lkeria_medians - ok_medians_bur_aff) %>% 
  ggplot(aes(x = value)) +
  geom_histogram() + 
  labs(title = "Distribution of median differences of price per m2",
       subtitle = "Lkeria (Agencies) VS. Ouedkniss (Bureau D'affaire)", caption = "1000 samples with replacement") +
  xlab(label = "median price difference in DZD") +
  geom_vline(xintercept = median(lkeria_medians - ok_medians_bur_aff), size = 1, color = "blue") +
  geom_label(aes(x = median(lkeria_medians - ok_medians_bur_aff), y = 0, label=round(median(lkeria_medians - ok_medians_bur_aff))))

We notice the same thing for this graph, most values of our delta distribution is positive with a median delta of 26834 Which means that on average Lkeria is more expensive by 26834 DZD per meter squared (comparing real estate agencies to “bureau d’affaire” prices).

The t-test is another statistical technique that allow us to generalize our findings. We are going to test whether or not this difference in price is statistically significan setting the null hypothesis as The difference in price medians is equal to zero.

t.test(lkeria_agencies, ok_agencies, 
       mu = 0, 
       alternative = "two.sided",
       conf.level = 0.95,
       var.equal = FALSE,
       paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  lkeria_agencies and ok_agencies
## t = 6.1982, df = 432.66, p-value = 1.333e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  24733.30 47702.95
## sample estimates:
## mean of x mean of y 
##  247316.3  211098.2

t.test(lkeria_agencies, ok_bur_aff, 
       mu = 0, 
       alternative = "two.sided",
       conf.level = 0.95,
       var.equal = FALSE,
       paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  lkeria_agencies and ok_bur_aff
## t = 4.5407, df = 526.77, p-value = 6.956e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  16167.26 40824.07
## sample estimates:
## mean of x mean of y 
##  247316.3  218820.6

For both cases, we have a p-value that is less than 5% which means that this difference is so unlikely to happen randomly thus we reject the null hypothesis.

Overall Price Comparison by top municipalities LKERIA VS. OUEDKNISS

Now let’s see the variation of prices across different locations

F3

AGENCE vs. AGENCE (F3)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "AGENCE"),
         Nb.Room == 3) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2)+
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F3)")

AGENCE vs. BUREAU D’AFFAIRE (F3)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
         Nb.Room == 3) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F3)")

AGENCE vs. PARTICULIER (F3)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
         Nb.Room == 3) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F3)")

F4

AGENCE vs. AGENCE (F4)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "AGENCE"),
         Nb.Room == 4) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F4)")

AGENCE vs. BUREAU D’AFFAIRE (F4)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
         Nb.Room == 4) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F4)")

AGENCE vs. PARTICULIER (F4)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
         Nb.Room == 4) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F4)")

F5

AGENCE vs. AGENCE (F5)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "AGENCE"),
         Nb.Room == 5) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F5)")

AGENCE vs. BUREAU D’AFFAIRE (F5)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
         Nb.Room == 5) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F5)")

AGENCE vs. PARTICULIER (F5)

combined_data %>% 
  filter(Top_Municipalities != "Hydra",
         Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
         Nb.Room == 5) %>% 
  ggplot(aes(x = website, y = price_m2, fill = website)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~ Top_Municipalities, nrow = 2) +
  labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F5)")

Comments

We notice from all the graphs above that the same pattern persists, lkeria has higher prices for most of municipalities.

Summary of prices

combined_data %>%
  mutate(Announcer.Type = ifelse(Announcer.Type == "Agence immobilière", "AGENCE", Announcer.Type)) %>% 
  group_by(Top_Municipalities, Nb.Room, Announcer.Type, website) %>% 
  summarise(median_price_m2 = median(price_m2)) %>% 
  spread(website, median_price_m2)

## # A tibble: 82 x 5
## # Groups:   Top_Municipalities, Nb.Room, Announcer.Type [82]
##    Top_Municipalities Nb.Room Announcer.Type    Lkeria ouedkniss
##    <chr>                <int> <chr>              <dbl>     <dbl>
##  1 Alger-Centre             3 AGENCE           202667.   230724.
##  2 Alger-Centre             3 BUREAU D'AFFAIRE     NA    220000 
##  3 Alger-Centre             3 PARTICULIER          NA    195588.
##  4 Alger-Centre             4 AGENCE           225000    250000 
##  5 Alger-Centre             4 BUREAU D'AFFAIRE     NA    231250 
##  6 Alger-Centre             4 PARTICULIER          NA    200000 
##  7 Alger-Centre             5 AGENCE           204069.   204384.
##  8 Alger-Centre             5 BUREAU D'AFFAIRE     NA    212692.
##  9 Alger-Centre             5 PARTICULIER          NA    192308.
## 10 Bir Mourad Rais          3 AGENCE           289057.   266667.
## # ... with 72 more rows

COMPARE LOCATIONS PRICES BY WEBSITE

let’s examine the same relationship as above from another angle.

AGENCE vs. AGENCE

combined_data %>% 
  filter(Announcer.Type %in% c("Agence immobilière", "AGENCE")) %>% 
  ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
  geom_boxplot() +
  labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. AGENCE")

AGENCE vs. BUREAU D’AFFAIRE

combined_data %>% 
  filter(Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE")) %>% 
  ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
  geom_boxplot() +
  labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE")

AGENCE vs. PARTICULIER

combined_data %>% 
  filter(Announcer.Type %in% c("Agence immobilière", "PARTICULIER")) %>% 
  ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
  geom_boxplot() +
  labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER")

In-depth exploration of Ouedkniss data

For the rest of this exploration notebook, we are going to use only ouedkniss dataset as it has richer observations and variables

We start by checking the impact of other variables that we have on the price distribution

Announcer Type

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Announcer.Type)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 By Announcer Type") +
  coord_flip()

Garage

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Garage)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 By Garage Availability") +
  coord_flip()

Garden

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Garden)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 Garden Availability") +
  coord_flip()

Furnished

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Furnished)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 By Furnished Availability") +
  coord_flip()

Promise

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Promise)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 By Promise Category") +
  coord_flip()

Floor number

ouedkniss %>% 
  filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR", !is.na(Nb.Floor)) %>% 
  mutate(Nb.Floor = case_when(between(Nb.Floor, 1, 3) ~ "1_3",
                              between(Nb.Floor, 4, 6) ~ "4_6",
                              between(Nb.Floor, 7, 10) ~ "7_10",
                              TRUE ~ "sup_10")) %>% 
  ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Nb.Floor)) +
  geom_boxplot() +
  labs(title = "The Distribution of Price/M2 By Floor Number Segment") +
  coord_flip()

Comments

From the Announcer Type graph, we can see that on average, in some regions, “AGENCE” has the highest prices. In other regions, “BUREAU D’AFFAIRE” are more expensive than others, we can suspect that brokers are active in those regions
In the second graph, properties that have garage are more expensive for most of the municipalities which does make sense.
For the garden graph, the price distributions varies from one location to another, we can’t define a clear pattern here.
The same remark holds for the next couple of graphs. In addition, we don’t have many properties that are labeled with those features, thus, we can not say much about them.
From the final graph, generally speaking, properties with the first few floors have higher prices than others.

Numeric summary

The table below shows that bureau d’affaire prices are slightly higher than housing agencies (“AGENCE”) prices. The individual offers (“PARTICULIER”) have lower prices.

ouedkniss %>% 
  group_by(Announcer.Type) %>% 
  summarise(median_price_m2 = round(median(price_m2)))

## # A tibble: 3 x 2
##   Announcer.Type   median_price_m2
##   <fct>                      <dbl>
## 1 AGENCE                    202703
## 2 BUREAU D'AFFAIRE          212963
## 3 PARTICULIER               160256

The median price values for those categories are 212963, 212963, 160256 DZD respectively. So on average, “BUREAU D’AFFAIRE” prices are more expensive by +5% compared to “AGENCE” and by +32% compared to “PARTICULIER”.

MODELISATION

In the previous section, we explored some relationships within our data to get a feel about what could influence the price variable.

We are going to use what we found to build a predictive model that tries to give an approximate selling/buying price that can be used by the real estate agencies and individuals willing to trade their properties.

Data Preparation

Before we build our model, we’re going to do some cleansing on ouedkniss data to improve the training phase.

data_model <- ouedkniss %>% 
  select(Month, Municipality, Nb.Floor, Nb.Room, Garage, 
         Garden, Furnished, Promise, New.Project, Payment, Area, Price.value.dzd, Announcer.Type, price_m2) %>% 
  filter(price_m2>= 100000  & price_m2<=400000) %>% 
  filter(!is.na(Area) & Area>=30 & Area <300) %>%
  mutate_at(vars(Garage, Garden, Furnished, Promise, New.Project), ~ ifelse(is.na(.)==TRUE, F, .)) %>% 
  mutate(Payment = factor(ifelse(is.na(Payment), "comptant", as.character(Payment)))) %>%
  mutate(Nb.Room = factor(Nb.Room),
         Nb.Floor = case_when(Nb.Floor >= 15 ~ 15,
                              is.na(Nb.Floor) ~ 1,
                              TRUE ~ as.numeric(Nb.Floor)))

We selected few variables with potential predictive power.
we expect the minimum price per meter squared to be higher than 100000 DZD and we discarded values above 400000 DZD.
The same logic as above holds, we consider only properties with Area between 30m2 and 300m2.
The missing values for Garage, Garden, Furnished, Promise, New.Project variables has been replaced with FALSE. We expect that if a property contains one or more of these features, the owner would mention it/them to increase the selling price.

Some levels within Municipality variable have very low instances.

data_model %>% 
  group_by(Municipality) %>% 
  count() %>% 
  arrange(n)

## # A tibble: 55 x 2
## # Groups:   Municipality [55]
##    Municipality     n
##    <fct>        <int>
##  1 KHRAISSIA        1
##  2 OULED CHEBEL     1
##  3 EL MAGHARIA      2
##  4 EL MARSA         2
##  5 HARAOUA          2
##  6 CASBAH           3
##  7 EL MADANIA       3
##  8 EL MOURADIA      3
##  9 BOUROUBA         4
## 10 OUED SMAR        4
## # ... with 45 more rows

To avoid problmes related to those levels when splitting the data, we are going to regroup those instances to their close municipalities according to their location.

data_model <- data_model %>% 
  mutate(Municipality = as.character(Municipality)) %>% 
  mutate(Municipality = case_when(Municipality %in% c("KHRAISSIA","OULED CHEBEL") ~ "BIRTOUTA",
                                  Municipality == "TASSALA EL MERDJA" ~ "DOUERA",
                                  Municipality == "HARAOUA" ~ "AIN TAYA",
                                  Municipality == "EL MARSA" ~ "BORDJ EL BAHRI",
                                  Municipality %in% c("EL MAGHARIA", "EL MADANIA") ~ "HUSSEIN DEY",
                                  Municipality == "EL MOURADIA" ~ "BIR MOURAD RAIS",
                                  Municipality == "CASBAH" ~ "BAB EL OUED",
                                  Municipality == "RAIS HAMIDOU" ~ "BOLOGHINE IBN ZIRI",
                                  Municipality == "OUED SMAR" ~ "DAR EL BEIDA",
                                  Municipality == "BOUROUBA" ~ "BACH DJERRAH",
                                  Municipality == "BEN AKNOUN" ~ "EL BIAR",
                                  TRUE ~ as.character(Municipality))) %>% 
  mutate(Municipality = as.factor(Municipality))

Linear regression

Now that our data is ready to empower our machine, let’s build the model using caret package!

Let’s start by splitting our data by putting 60% of our data into a training set then we use the remaining observations to test our model.

library(caret)

set.seed(100)
trainIndex <- createDataPartition(data_model$Price.value.dzd, p = 0.6, list = FALSE)
head(trainIndex)

##      Resample1
## [1,]         1
## [2,]         3
## [3,]         4
## [4,]         5
## [5,]         7
## [6,]        14

train_set <- data_model[trainIndex, ]
test_set <- data_model[-trainIndex, ]

let’s train our model

We start first by building a linear regression model for the sake of interpretability.

lm_model <- lm(formula = Price.value.dzd ~ ., data = select(train_set, -price_m2, -Month))
summary(lm_model)

## 
## Call:
## lm(formula = Price.value.dzd ~ ., data = select(train_set, -price_m2, 
##     -Month))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -18738963  -2531505    -32187   2385183  24604954 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1169163    1163782  -1.005 0.315349    
## MunicipalityAIN TAYA           -2668597    1674778  -1.593 0.111422    
## MunicipalityALGER               3581710    1000769   3.579 0.000363 ***
## MunicipalityBAB EL OUED         -763558    1366584  -0.559 0.576482    
## MunicipalityBABA HASSEN         -280587    1483028  -0.189 0.849980    
## MunicipalityBACH DJERRAH         768132    1567007   0.490 0.624119    
## MunicipalityBAINS ROMAINS      -1511836    2516916  -0.601 0.548211    
## MunicipalityBARAKI             -4026785    1750855  -2.300 0.021683 *  
## MunicipalityBEB EZZOUAR         3559486    1046165   3.402 0.000697 ***
## MunicipalityBENI MESSOUS       -1788032    1496247  -1.195 0.232398    
## MunicipalityBIR KHADEM          3552127    1146940   3.097 0.002015 ** 
## MunicipalityBIR MOURAD RAIS     8385380    1143071   7.336 4.92e-13 ***
## MunicipalityBIRTOUTA           -3864393    1319347  -2.929 0.003486 ** 
## MunicipalityBOLOGHINE IBN ZIRI -1429888    1981725  -0.722 0.470767    
## MunicipalityBORDJ EL BAHRI     -3335186    1069311  -3.119 0.001873 ** 
## MunicipalityBORDJ EL KIFFAN    -3021372    1233205  -2.450 0.014475 *  
## MunicipalityBOUZAREAH           1839799    1396623   1.317 0.188068    
## MunicipalityCHERAGA             5284597    1070878   4.935 9.56e-07 ***
## MunicipalityDAR EL BEIDA         518404    1612348   0.322 0.747890    
## MunicipalityDELY BRAHIM         5516829    1214902   4.541 6.36e-06 ***
## MunicipalityDJISR KSENTINA       184661    1073901   0.172 0.863513    
## MunicipalityDOUERA             -3230142    1521836  -2.123 0.034066 *  
## MunicipalityDRARIA              2065332    1195831   1.727 0.084492 .  
## MunicipalityEL ACHOUR           5530200    1097950   5.037 5.72e-07 ***
## MunicipalityEL BIAR             3521785    1607956   2.190 0.028764 *  
## MunicipalityEL HARRACH         -1283237    2289288  -0.561 0.575251    
## MunicipalityHAMMA ANASSERS      1941682    2014527   0.964 0.335386    
## MunicipalityHUSSEIN DEY         2026097    1796475   1.128 0.259698    
## MunicipalityHYDRA              16970207    1670840  10.157  < 2e-16 ***
## MunicipalityKOUBA               6405146    1269832   5.044 5.51e-07 ***
## MunicipalityLES EUCALYPTUS     -3268473    1350010  -2.421 0.015672 *  
## MunicipalityMAHELMA            -3370482    1877165  -1.796 0.072908 .  
## MunicipalityMOHAMMADIA          5276891    1565910   3.370 0.000784 ***
## MunicipalityOULED FAYET         3664602    1428293   2.566 0.010457 *  
## MunicipalityREGHAIA            -2391389    1320321  -1.811 0.070441 .  
## MunicipalityROUIBA              1305363    1522973   0.857 0.391610    
## MunicipalitySAOULA             -1424577    1668664  -0.854 0.393486    
## MunicipalitySIDI MHAMED         1200647    2516813   0.477 0.633442    
## MunicipalitySIDI MOUSSA        -5729552    1663800  -3.444 0.000600 ***
## MunicipalitySOUIDANIA          -3497900    1308715  -2.673 0.007659 ** 
## MunicipalitySTAOUELI            5316223    1674285   3.175 0.001548 ** 
## MunicipalityZERALDA            -1648228    1350925  -1.220 0.222757    
## Nb.Floor                        -166108      75638  -2.196 0.028341 *  
## Nb.Room4                         284023     422023   0.673 0.501118    
## Nb.Room5                        -400337     750105  -0.534 0.593676    
## GarageTRUE                      2901367     525201   5.524 4.33e-08 ***
## GardenTRUE                      -693354     551350  -1.258 0.208879    
## FurnishedTRUE                   1058825     695119   1.523 0.128053    
## PromiseTRUE                      240304     693961   0.346 0.729214    
## New.ProjectTRUE                 -270667     541598  -0.500 0.617369    
## Paymenttranches                -2972192     919206  -3.233 0.001268 ** 
## Area                             207488       9019  23.007  < 2e-16 ***
## Announcer.TypeBUREAU D'AFFAIRE   141844     406711   0.349 0.727352    
## Announcer.TypePARTICULIER      -1098526     391826  -2.804 0.005162 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4721000 on 898 degrees of freedom
## Multiple R-squared:  0.763,  Adjusted R-squared:  0.749 
## F-statistic: 54.55 on 53 and 898 DF,  p-value: < 2.2e-16

we can get some insights from the output above:

Although the technique used is simple, we’re still have a pretty decent Adjusted R-squared that is equal to 74.9% which means that the variations of our input variables explain about 74% the variations of the price (the response variable).

We have also many significant variables in our model, we can interpret a few of them.

This model suggests that:

The Area variable (very significant one (3 stars)) has a positive coefficient 207 488. it means that an increase in one unit of this variable, i.e one meter squared, will raise, on average, the price by 207 488 DZD.
The inclusion of “Garage” will raise the price on average by 2 901 367 DZD.
The price of an appartment that is located in “BARAKI” is cheaper, on average, by 4 026 785 DZD.
The price of an appartment that is located in “HYDRA” is more expensive, on average, by 16 970 207 DZD.
Buying a property from a individual owner “PARTICULIER”, would save you 1 098 526 DZD on average.

Now, let’s make predictions and test our model:

lm_pred <- predict(lm_model, newdata = select(test_set, -price_m2, -Month, -Price.value.dzd))

lmValues <- data.frame(obs = test_set$Price.value.dzd, pred = lm_pred)
defaultSummary(lmValues)

##         RMSE     Rsquared          MAE 
## 4.783036e+06 7.370490e-01 3.497527e+06

We notice from our output a small drop in the Out-of-sample Rsquared (from 0.763 to 0.737) which suggests that our model performs well on unseen data.
Both Root Mean Squared Error (RMSE) and Mean Absolute Error are in the same unit as our Dependent variable (The price of property) and they indicate the average model prediction error. We’d like to consider MAE over RMSE as RMSE is more sensible to extreme values. Thus, we have an average error of 3 497 527 DZD which is relatively high considering the scale of our response variable.

Random Forest

Let’s see if we can do even better using a more advanced supervized technique, the Random Forests.

library(randomForest)

set.seed(555)
rf_model <- randomForest(formula = Price.value.dzd ~ ., 
                         data = select(train_set, -price_m2, -Month), 
                         ntree = 500, #number of trees to train
                         node.size = 25, #the size of final nodes
                         mtry = 8) #number of random independent variables to use
rf_model

## 
## Call:
##  randomForest(formula = Price.value.dzd ~ ., data = select(train_set,      -price_m2, -Month), ntree = 500, node.size = 25, mtry = 8) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##           Mean of squared residuals: 2.457137e+13
##                     % Var explained: 72.3

rf_pred <- predict(rf_model, newdata = select(test_set, -price_m2, -Month, -Price.value.dzd))

rfValues <- data.frame(obs = test_set$Price.value.dzd, pred = rf_pred)
defaultSummary(rfValues)

##         RMSE     Rsquared          MAE 
## 4.933058e+06 7.182500e-01 3.516278e+06

Comparing the evaluation metrics values above to the ones of the previous model, we notice that random forest didn’t outperform the linear regression.

Housing Analysis Part II: Webscraping, EDA & Modelisation

Webscraping data from lkeria website:

Loading Packages:

Create functions to capture informations:

Scrape the data:

Getting detailed informations within listings:

Cleaning the data:

Exploratory Data Analysis:

Loading the data

Structure

Lkeria

Ouedkniss

Reformatting Ouedkniss dataset

Reformatting Lkeria data

Distribution of price by top municipalities of ouedkniss

Distribution of price by top municipalities of lkeria

Adjusting top_municipalities variable

Distribution of price for the modified top municipalities

Removing outliers

Ouedkniss

Lkeria

Trend by Month

Overall Price Comparison LKERIA VS. OUEDKNISS

AGENCE VS. AGENCE

AGENCE VS. BUREAU D’AFFAIRE

AGENCE VS. PARTICULIER

Comments

Statistical Significane

Overall Price Comparison by top municipalities LKERIA VS. OUEDKNISS

F3

AGENCE vs. AGENCE (F3)

AGENCE vs. BUREAU D’AFFAIRE (F3)

AGENCE vs. PARTICULIER (F3)

F4

AGENCE vs. AGENCE (F4)

AGENCE vs. BUREAU D’AFFAIRE (F4)

AGENCE vs. PARTICULIER (F4)

F5

AGENCE vs. AGENCE (F5)

AGENCE vs. BUREAU D’AFFAIRE (F5)

AGENCE vs. PARTICULIER (F5)

Comments

Summary of prices

COMPARE LOCATIONS PRICES BY WEBSITE

AGENCE vs. AGENCE

AGENCE vs. BUREAU D’AFFAIRE

AGENCE vs. PARTICULIER

In-depth exploration of Ouedkniss data

Announcer Type

Garage

Garden

Furnished

Promise

Floor number

Comments

Numeric summary

MODELISATION

Data Preparation

Linear regression

Random Forest