We have seen in Part I made by @Fateh Bekioua the problematic that we are trying to adress. An intro video regarding our subject of interest is available here(english subtitles are not available yet.)
In addition, we have seen how data from Ouedkniss Website was gathered using webscraping techniques.
In this part, we first are going to scrape the data related to Lkeria Website and then we’ll explore our data and try to answer our questions about it.
This website allows only real estate agencies to list there properties.
library(rvest)
library(tidyverse)
library(lubridate)
get_links <- function(i=1){
webpage <- read_html(paste0("https://www.lkeria.com/annonces/immobilier/vente-algerie-P", i))
dates <- html_text(html_nodes(webpage, xpath = "//span[@class=\"property-label hidden-xs\"]"))[-c(1:3)]
urls <- html_attr(html_nodes(webpage, xpath = "//a[@rel=\"noopener\"]"), "href")[-1]
as.data.frame(cbind(urls, dates))
}
If we try to filter listings on “Algiers” only the URL doesn’t change so the idea is to capture all the sales listings within this website no matter the region.
For dates variable within get_links() function, there was a pattern where the first three observations were “premium” listings that they are repeated across all pages so we excluded them using [-c(1:3)].(this number is subject to change as the website evolves)
Same notice for urls variable where the first observation was irrelevant information.
get_all <- function(nbr.pages=1){
all <- NULL
for (k in 1:nbr.pages){
all <- rbind(all, get_links(i = k))
}
all
}
The get_all function serves to gather all pages listing in one single dataframe.
df <- get_all(185)
algiers_data <- df %>%
filter(str_detect(urls, "Alger")==TRUE) %>%
mutate(urls = paste0("https:",urls))
We scraped all the existing 185 pages at that time then we filtered on “Alger” to select the listings of Algiers city. We concatenated also “https:” to our urls to make them accessible.
Next, we move to getting relevant informations within properties listings for our analysis .
pages <- map(.x = algiers_data$urls, possibly(.%>%read_html(.), NA_real_))
type <- map(pages,possibly(. %>%
html_nodes(".breadcrumb li:nth-child(3) a") %>%
html_text()
,NA_real_)) %>%
unlist()
district <- map(pages,possibly(. %>%
html_nodes(".active") %>%
html_text()
,NA_real_)) %>%
unlist()
surface <- map(pages,possibly(. %>%
html_nodes(".resume li:nth-child(1)") %>%
html_text()
,NA_real_)) %>%
unlist()
nbr_rooms <- map(pages,possibly(. %>%
html_nodes(".resume li+ li") %>%
html_text()
,NA_real_)) %>%
unlist()
description <- map(pages,possibly(. %>%
html_nodes("#bien") %>%
html_text()
,NA_real_)) %>%
unlist()
price <- map(pages,possibly(. %>%
html_nodes(".amount b") %>%
html_text()
,NA_real_)) %>%
unlist()
price_type <- map(pages,possibly(. %>%
html_nodes(".has:nth-child(1)") %>%
html_text()
,NA_real_)) %>%
unlist()
agency_name <- map(pages,possibly(. %>%
html_nodes("#contact span:nth-child(1)") %>%
html_text()
,NA_real_)) %>%
unlist()
details <- map(pages,possibly(. %>%
html_nodes("section+ section") %>%
html_text() %>%
str_sub(start = 11, end = -3)
,NA_real_)) %>%
unlist()
algiers_raw <- as_tibble(cbind(algiers_data, type, district, surface, nbr_rooms, description, price, agency_name, details))
#write_csv(algiers_raw, "algiers_raw.csv")
For this task, we used map() funtion from purrr package to iterate over all our urls alongside with Selector Gadget to select the desired information.
algiers_raw <- read_csv("algiers_raw.csv")
Now that we have the raw data for “Algiers”, we need to do more cleaning on it.
we define first a function to get the number of floor.
get_floor_nbr <- function(string){
result <- NULL
for (j in 1:length(string)){
if (is.na(string[j])== TRUE){
result[j] <- NA
} else {
if (str_detect(string[j], "Etage :") == TRUE){
loc <- str_locate(string[j], "Etage :")[,2]
result[j] <- str_split(str_sub(string[j], loc+1), " ")[[1]][1]
} else {
result[j] <- NA
}
}
}
return(result)
}
Then, we make additional adjustments.
algiers <- algiers_raw %>%
mutate(dates = case_when(dates %in% c("Hier", "Aujourd'hui") ~ "NA",
TRUE ~ dates), #replace Hier and Aujourd'hui values by NA
surface = as.numeric(str_sub(surface, end = -4)), #remove the last 3 characters
nbr_rooms = as.numeric(str_sub(nbr_rooms, end = -8)), #remove the last 7 characters
floor_nbr = get_floor_nbr(details), #use our defined function
type = factor(type)) %>%
mutate(dates = dmy(dates = str_sub(dates, start = 4)),
price = ifelse(str_detect(price, "m2") == TRUE ,
as.numeric(str_split_fixed(price, " ", 2)[, 1]) * surface * 1e4,
ifelse(str_detect(price, "Milliards") == TRUE,
as.numeric(str_split_fixed(price, " ", 2)[, 1]) * 1e7,
as.numeric(str_split_fixed(price, " ", 2)[, 1]) * 1e4))) %>%
mutate(dates = zoo::na.locf(dates, na.rm =FALSE)) #replace NAs with last obs carried forward
#write_csv(algiers, "algiers_lkeria.csv")
The most important reformatting above is to deal with the different scenarios on how the price appears, in this case we used a nested ifelse() function:
price contains “m2”, it’s price per meter squared so the final price would be numeric number x surface x 1e4 considering price per meter cannot exceed 1e5.
price contains “Milliards”, which corresponds to 1e7 DZD.
final case, price doesn’t contain “m2” nor “Milliards”, in other words the price is very likely expressed in “millions”, the price would be numeric number x 1e4
Now that we have the data from both websites ouedkniss and lkeria let’s explore it.
lkeria <- read_csv("algiers_lkeria.csv")
ouedkniss <- as_tibble(read.csv("Housing_Data26072018.csv"))
As a first step let’s have an overview of our data.
glimpse(lkeria)
## Observations: 1,166
## Variables: 11
## $ urls <chr> "https://www.lkeria.com/annonces/immobilier/Vente-...
## $ dates <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ type <chr> "Terrain", "Terrain", "Appartement", "Villa", "Ter...
## $ district <chr> "Khraicia", "Douera", "Draria", "Kouba", "Kouba", ...
## $ surface <dbl> 600, 220, 76, 234, 0, 75, 84, 88, 65, 255, 100, 15...
## $ nbr_rooms <int> 0, 0, 2, 0, 0, 3, 4, 3, 3, 0, 4, 4, 0, 0, 0, 7, 11...
## $ description <chr> "Description du bien\nVente Terrain Khraicia Vend ...
## $ price <dbl> 42000000, 16500000, 16000000, 50000000, 160000000,...
## $ agency_name <chr> "Agence immobilière Hakem Agence Immobilière", "Ag...
## $ details <chr> "Prix Négociable Surface :600 m2", "Prix Négociabl...
## $ floor_nbr <chr> NA, NA, NA, NA, NA, NA, NA, "1", NA, NA, NA, NA, N...
glimpse(ouedkniss)
## Observations: 2,152
## Variables: 50
## $ Links <fct> https://www.ouedkniss.com/vente-ap...
## $ Title <fct> Vente Appartement F2 Alger Belouiz...
## $ Description <fct> NA, J'ai une grande,propre et top ...
## $ Store.Name <fct> NA, NA, NA, NA, MS.Immobilier, NA,...
## $ Pseudo <fct> youyayouya, F123kniss, megatrain21...
## $ Store.Addresse <fct> NA, NA, NA, NA, Alger Bab ezzouar ...
## $ Date <fct> 2018-07-26, 2018-07-26, 2018-07-26...
## $ Nb.Floor <int> NA, 3, 4, 1, 4, 3, 2, 3, 3, NA, 4,...
## $ Nb.Room <int> 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3...
## $ Nb.Views <int> 119, NA, 33, 381, 9, 39, 15, 17, 1...
## $ ID.Offer <int> 16427845, 16340043, 16600827, 1651...
## $ Price <fct> 860 Millions Négociable, 1 Million...
## $ District <fct> belcourt, Centre ville, Fatma nsou...
## $ Specifics <fct> Electricité , Gaz , Eau , Acte ...
## $ Area <int> 48, 70, 65, 42, 97, 70, 100, 136, ...
## $ Price.value <dbl> 860.0, 1.0, 7.0, 900.0, 2.4, 750.0...
## $ Price.unit <fct> Millions, Millions, Millions, Mill...
## $ Price.desc <fct> Négociable, Négociable, Négociable...
## $ Price.value.dzd <dbl> 8600000, 10000000, 70000000, 90000...
## $ Price.rules <fct> Price.value * 10000, Price.value *...
## $ Announcer.Name <fct> youyayouya, F123kniss, megatrain21...
## $ Announcer.Type <fct> PARTICULIER, PARTICULIER, PARTICUL...
## $ Hour <int> 19, 19, 19, 21, 22, 16, 19, 20, 20...
## $ Month <fct> Jul, Jul, Jul, Jul, Jul, Jul, Jul,...
## $ Municipality <fct> HAMMA ANASSERS, BIRTOUTA, DAR EL B...
## $ Garage <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ Garden <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ Furnished <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, ...
## $ Promise <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ New.Project <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ Payment <fct> comptant, comptant, comptant, comp...
## $ Top_Municipalities <fct> Other, Other, Other, DJISR KSENTIN...
## $ Store.Municipality <fct> NA, NA, NA, NA, BEB EZZOUAR, NA, E...
## $ commune.names <fct> Algiers, HAMMA ANASSERS, Algiers, ...
## $ lon <dbl> 3.058756, 3.047605, 3.228198, 3.07...
## $ lat <dbl> 36.75377, 36.64609, 36.70601, 36.6...
## $ type <fct> locality, locality, locality, loca...
## $ loctype <fct> approximate, approximate, approxim...
## $ address <fct> algiers, sidi m'hamed, algeria, bi...
## $ north <dbl> 36.75878, 36.66857, 36.73901, 36.7...
## $ south <dbl> 36.74887, 36.61635, 36.66587, 36.6...
## $ east <dbl> 3.061624, 3.071880, 3.259807, 3.11...
## $ west <dbl> 3.054058, 2.947340, 3.163548, 3.04...
## $ locality <fct> Algiers, Birtouta, Dar El Beïda, D...
## $ locality.1 <fct> Sidi M'Hamed, NA, NA, NA, NA, NA, ...
## $ administrative_area_level_1 <fct> Algiers Province, Algiers Province...
## $ country <fct> Algeria, Algeria, Algeria, Algeria...
## $ political <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Store.lon <dbl> NA, NA, NA, NA, 3.185497, NA, 3.06...
## $ Store.lat <dbl> NA, NA, NA, NA, 36.72063, NA, 36.7...
ouedkniss <- ouedkniss %>%
select(ID.Offer, Date, Hour, Month, Municipality, lon, lat,
Nb.Floor:Nb.Views, Area, Garage:Payment, Price.value.dzd, District, Top_Municipalities,
Announcer.Name, Announcer.Type, Store.Municipality, Store.lon, Store.lat) %>%
filter(Nb.Room %in% c(3:5), Announcer.Type != "PROMOTEUR") %>%
mutate(website = "ouedkniss",
price_m2 = Price.value.dzd/Area)
lkeria <- lkeria %>%
filter(type == "Appartement") %>%
mutate(Nb.floor = as.integer(floor_nbr),
Top_Municipalities = fct_lump(district, 9),
Announcer.Type = word(agency_name, 1, 2),
price_m2 = price/surface,
Garage = str_detect(details, "Garage"),
Garden = str_detect(details, "Jardin"),
Furnished = str_detect(details, "Meublé"),
website = "Lkeria") %>%
rename(Date = dates,
Nb.Room = nbr_rooms,
Municipality = district,
Area = surface,
Price.value.dzd = price) %>%
filter(Nb.Room %in% c(3:5), Announcer.Type != "Promoteur immobilier") %>%
filter(!(is.na(price_m2)), !(price_m2 == Inf))
ouedkniss %>%
group_by(Top_Municipalities) %>%
summarise(median_price = median(price_m2),
quartile_3 = quantile(price_m2, probs = 0.75),
p_90 = quantile(price_m2, probs = 0.90),
p_93 = quantile(price_m2, probs = 0.93),
p_95 = quantile(price_m2, probs = 0.95),
p_99 = quantile(price_m2, probs = 0.99),
max_price = max(price_m2),
n = n()) %>%
arrange(desc(median_price))
## # A tibble: 11 x 9
## Top_Municipalit~ median_price quartile_3 p_90 p_93 p_95 p_99
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DELY BRAHIM 259259. 315534. 3.81e5 4.02e5 4.30e5 4.58e5
## 2 EL ACHOUR 247059. 281929. 3.18e5 3.28e5 3.40e5 3.61e5
## 3 CHERAGA 239130. 280000 3.11e5 3.31e5 3.44e5 3.93e5
## 4 BEB EZZOUAR 229885. 259376. 2.77e5 2.86e5 2.90e5 3.05e5
## 5 BIR KHADEM 226237. 263278. 3.01e5 3.10e5 3.48e5 3.98e5
## 6 ALGER 219298. 263068. 3.00e5 3.12e5 3.15e5 3.41e5
## 7 DJISR KSENTINA 183375. 200000 2.22e5 2.25e5 2.36e5 2.94e5
## 8 Other 183333. 247312. 3.20e5 3.50e5 3.86e5 5.49e5
## 9 BAB EL OUED 172727. 198361. 2.10e5 2.21e5 2.30e5 4.08e5
## 10 BORDJ EL BAHRI 142308. 175000 1.95e5 2.11e5 2.17e5 2.38e5
## 11 BORDJ EL KIFFAN 119158. 150610. 1.80e5 2.00e5 2.18e5 5.40e5
## # ... with 2 more variables: max_price <dbl>, n <int>
lkeria %>%
group_by(Top_Municipalities) %>%
summarise(median_price = median(price_m2),
quartile_3 = quantile(price_m2, probs = 0.75),
p_90 = quantile(price_m2, probs = 0.90),
p_93 = quantile(price_m2, probs = 0.93),
p_95 = quantile(price_m2, probs = 0.95),
p_99 = quantile(price_m2, probs = 0.99),
max_price = max(price_m2),
n = n()) %>%
arrange(desc(median_price))
## # A tibble: 10 x 9
## Top_Municipalit~ median_price quartile_3 p_90 p_93 p_95 p_99
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hydra 412879. 475815. 5.02e5 5.08e5 5.26e5 5.67e5
## 2 Kouba 326250 468750 4.84e5 4.98e5 5.42e5 6.42e5
## 3 El Achour 282609. 327448. 3.50e5 3.50e5 3.57e5 3.91e5
## 4 Cheraga 264935. 339298. 3.75e5 4.04e5 4.39e5 2.45e6
## 5 Mohammadia 256944. 281401. 3.98e5 6.63e5 8.39e5 1.19e6
## 6 Bir Mourad Rais 256410. 276070. 3.19e5 3.34e5 3.38e5 3.48e5
## 7 Birkhadem 254545. 353611. 3.81e5 3.81e5 3.84e5 3.89e5
## 8 Ouled Fayet 239015. 271504. 2.90e5 2.94e5 3.18e5 3.71e5
## 9 Alger-Centre 213833. 279375 2.99e5 3.46e5 3.90e5 4.78e5
## 10 Other 200000 250000 3.33e5 3.58e5 3.71e5 1.57e6
## # ... with 2 more variables: max_price <dbl>, n <int>
For ouedkniss data, we selected 25 relevent variables out of 50 for our analysis.
We reformatted some variables in lkeria dataset so it can be combined with ouedkniss dataset.
We notice from the code above that not all Top Municpalities are in common for both datasets and since we have fewer data for Lkeria we’ll also adapt Ouedkniss’s Top Municipalities to match those of Lkeria.
we created a new variable price per meter squared, this will be the subject of our coming comparisons
we showed distributions of price per m2 for the top municipalities to help us identify potential anomalies and outliers.
ouedkniss <- ouedkniss %>%
mutate(Top_Municipalities_new = case_when(Municipality == "EL ACHOUR" ~ "El Achour",
Municipality == "OULED FAYET" ~ "Ouled Fayet",
Municipality == "HYDRA" ~ "Hydra",
Municipality == "BIR KHADEM" ~ "Birkhadem",
Municipality == "CHERAGA" ~ "Cheraga",
Municipality == "ALGER" ~ "Alger-Centre",
Municipality == "KOUBA" ~ "Kouba",
Municipality == "BIR MOURAD RAIS" ~ "Bir Mourad Rais",
Municipality == "MOHAMMADIA" ~ "Mohammadia",
TRUE ~ "Other"))
ouedkniss %>%
group_by(Top_Municipalities_new) %>%
summarise(median_price = median(price_m2),
quartile_3 = quantile(price_m2, probs = 0.75),
p_90 = quantile(price_m2, probs = 0.90),
p_93 = quantile(price_m2, probs = 0.93),
p_95 = quantile(price_m2, probs = 0.95),
p_99 = quantile(price_m2, probs = 0.99),
max_price = max(price_m2),
n = n()) %>%
arrange(desc(median_price))
## # A tibble: 10 x 9
## Top_Municipalit~ median_price quartile_3 p_90 p_93 p_95 p_99
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hydra 400000 461141. 4.98e5 5.00e5 5.04e5 5.20e5
## 2 Kouba 269231. 291987. 3.07e5 3.13e5 3.18e5 3.52e5
## 3 Bir Mourad Rais 266667. 306667. 3.38e5 3.49e5 3.51e5 4.72e5
## 4 Mohammadia 253968. 276759. 3.14e5 3.27e5 3.35e5 3.45e5
## 5 El Achour 247059. 281929. 3.18e5 3.28e5 3.40e5 3.61e5
## 6 Cheraga 239130. 280000 3.11e5 3.31e5 3.44e5 3.93e5
## 7 Ouled Fayet 230000 259386. 2.91e5 3.11e5 3.43e5 3.61e5
## 8 Birkhadem 226237. 263278. 3.01e5 3.10e5 3.48e5 3.98e5
## 9 Alger-Centre 219298. 263068. 3.00e5 3.12e5 3.15e5 3.41e5
## 10 Other 175000 218750 2.74e5 3.00e5 3.21e5 5.06e5
## # ... with 2 more variables: max_price <dbl>, n <int>
We use our distribution tables to remove outliers and potential errors in our data by discarding very high values of price per meter squared.
for each top municipality that has high differences among the percentiles showed in our tables, we considered the highest rational percentile value as the maximum value.
ouedkniss <- ouedkniss %>%
filter(!(Top_Municipalities_new == "Other" & price_m2 > 321108.7))
lkeria <- lkeria %>%
filter(!(Top_Municipalities == "Cheraga" & price_m2 > 438750.0)) %>%
filter(!(Top_Municipalities == "Mohammadia" & price_m2 > 398000.0)) %>%
filter(!(Top_Municipalities == "Other" & price_m2 > 371063.0)) %>%
filter(!(Top_Municipalities=="Kouba" & price_m2 > 4e5)) #Haut Standing appartments
Now that we have our data ready to work with, let’s explore some paths using multiple dimensions.
ouedkniss %>%
filter(Announcer.Type == "AGENCE") %>%
select(Date, price_m2, website) %>%
mutate(Date = ymd(Date)) %>%
mutate(month = floor_date(Date, unit = "month")) %>%
bind_rows(lkeria %>%
select(Date, price_m2, website) %>%
filter(Date >= as.Date("2017-07-01")) %>%
mutate(month = floor_date(Date, unit = "month"))) %>%
group_by(month, website) %>%
summarise(median_price_m2 = median(price_m2)) %>%
ggplot(aes(x = month, y = median_price_m2, color = website)) +
geom_line(size = 1)+
labs(title = "The Evolution of Price/M2 by Month")
lkeria prices remain higher for almost all months in our time span.
We’ll draw a boxplot to compare the prices across both websites by number of rooms.
First of all, we need to combine our data
combined_data <- ouedkniss %>%
select(Nb.Room, price_m2, Top_Municipalities_new, Announcer.Type, website) %>%
rename(Top_Municipalities = Top_Municipalities_new) %>%
bind_rows(lkeria %>%
select(Nb.Room, price_m2, Top_Municipalities, Announcer.Type, website))
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "AGENCE")) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Nb.Room) +
labs(title = "AGENCE VS. AGENCE")
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE")) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Nb.Room) +
labs(title = "AGENCE VS. BUREAU D'AFFAIRE")
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "PARTICULIER")) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Nb.Room) +
labs(title = "AGENCE VS. PARTICULIER")
Well, some readers may be skeptical about the results, they may think it is due to a sampling effect. So In order to confirm these results we are going to draw 1000 different samples from both Lkeria and Ouedkniss and analyze the distribution of their median price delta.
ok_agencies <- ouedkniss %>%
filter(Announcer.Type == "AGENCE") %>%
select(price_m2) %>%
unlist()
ok_bur_aff <- ouedkniss %>%
filter(Announcer.Type == "BUREAU D'AFFAIRE") %>%
select(price_m2) %>%
unlist()
lkeria_agencies <- lkeria %>%
select(price_m2) %>%
unlist()
We create a function to get a 1000 samples with replacement from each dataset.
bootstrap <- function(x){
#x is a vector
replicate(1000, median(sample(x = x, size = 100, replace = TRUE)))
}
set.seed(123)
ok_medians <- bootstrap(ok_agencies)
ok_medians_bur_aff <- bootstrap(ok_bur_aff)
lkeria_medians <- bootstrap(lkeria_agencies)
Now, that we have our samples ready, let’s plot the differences
as.tibble(lkeria_medians - ok_medians) %>%
ggplot(aes(x = value)) +
geom_histogram() +
labs(title = "Distribution of median differences of price per m2",
subtitle = "Lkeria (Agencies) VS. Ouedkniss (Agencies)", caption = "1000 samples with replacement") +
xlab(label = "median price difference in DZD") +
geom_vline(xintercept = median(lkeria_medians - ok_medians), size = 1, color = "red") +
geom_label(aes(x = median(lkeria_medians - ok_medians), y = 0, label=round(median(lkeria_medians - ok_medians))))
it’s very clear that almost all of our delta distribution is positive with a median delta of 36949 Which means that on average Lkeria is more expensive by 36949 DZD per meter squared (comparing real estate agencies prices from both websites).
as.tibble(lkeria_medians - ok_medians_bur_aff) %>%
ggplot(aes(x = value)) +
geom_histogram() +
labs(title = "Distribution of median differences of price per m2",
subtitle = "Lkeria (Agencies) VS. Ouedkniss (Bureau D'affaire)", caption = "1000 samples with replacement") +
xlab(label = "median price difference in DZD") +
geom_vline(xintercept = median(lkeria_medians - ok_medians_bur_aff), size = 1, color = "blue") +
geom_label(aes(x = median(lkeria_medians - ok_medians_bur_aff), y = 0, label=round(median(lkeria_medians - ok_medians_bur_aff))))
We notice the same thing for this graph, most values of our delta distribution is positive with a median delta of 26834 Which means that on average Lkeria is more expensive by 26834 DZD per meter squared (comparing real estate agencies to “bureau d’affaire” prices).
The t-test is another statistical technique that allow us to generalize our findings. We are going to test whether or not this difference in price is statistically significan setting the null hypothesis as The difference in price medians is equal to zero.
t.test(lkeria_agencies, ok_agencies,
mu = 0,
alternative = "two.sided",
conf.level = 0.95,
var.equal = FALSE,
paired = FALSE)
##
## Welch Two Sample t-test
##
## data: lkeria_agencies and ok_agencies
## t = 6.1982, df = 432.66, p-value = 1.333e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 24733.30 47702.95
## sample estimates:
## mean of x mean of y
## 247316.3 211098.2
t.test(lkeria_agencies, ok_bur_aff,
mu = 0,
alternative = "two.sided",
conf.level = 0.95,
var.equal = FALSE,
paired = FALSE)
##
## Welch Two Sample t-test
##
## data: lkeria_agencies and ok_bur_aff
## t = 4.5407, df = 526.77, p-value = 6.956e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 16167.26 40824.07
## sample estimates:
## mean of x mean of y
## 247316.3 218820.6
For both cases, we have a p-value that is less than 5% which means that this difference is so unlikely to happen randomly thus we reject the null hypothesis.
Now let’s see the variation of prices across different locations
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "AGENCE"),
Nb.Room == 3) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2)+
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F3)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
Nb.Room == 3) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F3)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
Nb.Room == 3) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F3)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "AGENCE"),
Nb.Room == 4) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F4)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
Nb.Room == 4) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F4)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
Nb.Room == 4) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F4)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "AGENCE"),
Nb.Room == 5) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. AGENCE (F5)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE"),
Nb.Room == 5) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE (F5)")
combined_data %>%
filter(Top_Municipalities != "Hydra",
Announcer.Type %in% c("Agence immobilière", "PARTICULIER"),
Nb.Room == 5) %>%
ggplot(aes(x = website, y = price_m2, fill = website)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~ Top_Municipalities, nrow = 2) +
labs(title = "Distribution of Price/M2 across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER (F5)")
We notice from all the graphs above that the same pattern persists, lkeria has higher prices for most of municipalities.
combined_data %>%
mutate(Announcer.Type = ifelse(Announcer.Type == "Agence immobilière", "AGENCE", Announcer.Type)) %>%
group_by(Top_Municipalities, Nb.Room, Announcer.Type, website) %>%
summarise(median_price_m2 = median(price_m2)) %>%
spread(website, median_price_m2)
## # A tibble: 82 x 5
## # Groups: Top_Municipalities, Nb.Room, Announcer.Type [82]
## Top_Municipalities Nb.Room Announcer.Type Lkeria ouedkniss
## <chr> <int> <chr> <dbl> <dbl>
## 1 Alger-Centre 3 AGENCE 202667. 230724.
## 2 Alger-Centre 3 BUREAU D'AFFAIRE NA 220000
## 3 Alger-Centre 3 PARTICULIER NA 195588.
## 4 Alger-Centre 4 AGENCE 225000 250000
## 5 Alger-Centre 4 BUREAU D'AFFAIRE NA 231250
## 6 Alger-Centre 4 PARTICULIER NA 200000
## 7 Alger-Centre 5 AGENCE 204069. 204384.
## 8 Alger-Centre 5 BUREAU D'AFFAIRE NA 212692.
## 9 Alger-Centre 5 PARTICULIER NA 192308.
## 10 Bir Mourad Rais 3 AGENCE 289057. 266667.
## # ... with 72 more rows
let’s examine the same relationship as above from another angle.
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "AGENCE")) %>%
ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
geom_boxplot() +
labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. AGENCE")
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "BUREAU D'AFFAIRE")) %>%
ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
geom_boxplot() +
labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. BUREAU D'AFFAIRE")
combined_data %>%
filter(Announcer.Type %in% c("Agence immobilière", "PARTICULIER")) %>%
ggplot(aes(x = Top_Municipalities, y = price_m2, color = website)) +
geom_boxplot() +
labs(title = "The Variation of Price/M2 Across Top Municipalities", subtitle ="AGENCE VS. PARTICULIER")
For the rest of this exploration notebook, we are going to use only ouedkniss dataset as it has richer observations and variables
We start by checking the impact of other variables that we have on the price distribution
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Announcer.Type)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 By Announcer Type") +
coord_flip()
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Garage)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 By Garage Availability") +
coord_flip()
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Garden)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 Garden Availability") +
coord_flip()
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Furnished)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 By Furnished Availability") +
coord_flip()
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR") %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Promise)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 By Promise Category") +
coord_flip()
ouedkniss %>%
filter(!is.na(price_m2), Nb.Room %in% c(3:5), Announcer.Type!= "PROMOTEUR", !is.na(Nb.Floor)) %>%
mutate(Nb.Floor = case_when(between(Nb.Floor, 1, 3) ~ "1_3",
between(Nb.Floor, 4, 6) ~ "4_6",
between(Nb.Floor, 7, 10) ~ "7_10",
TRUE ~ "sup_10")) %>%
ggplot(aes(x = Top_Municipalities_new, y = price_m2, color = Nb.Floor)) +
geom_boxplot() +
labs(title = "The Distribution of Price/M2 By Floor Number Segment") +
coord_flip()
From the Announcer Type graph, we can see that on average, in some regions, “AGENCE” has the highest prices. In other regions, “BUREAU D’AFFAIRE” are more expensive than others, we can suspect that brokers are active in those regions
In the second graph, properties that have garage are more expensive for most of the municipalities which does make sense.
For the garden graph, the price distributions varies from one location to another, we can’t define a clear pattern here.
The same remark holds for the next couple of graphs. In addition, we don’t have many properties that are labeled with those features, thus, we can not say much about them.
From the final graph, generally speaking, properties with the first few floors have higher prices than others.
The table below shows that bureau d’affaire prices are slightly higher than housing agencies (“AGENCE”) prices. The individual offers (“PARTICULIER”) have lower prices.
ouedkniss %>%
group_by(Announcer.Type) %>%
summarise(median_price_m2 = round(median(price_m2)))
## # A tibble: 3 x 2
## Announcer.Type median_price_m2
## <fct> <dbl>
## 1 AGENCE 202703
## 2 BUREAU D'AFFAIRE 212963
## 3 PARTICULIER 160256
The median price values for those categories are 212963, 212963, 160256 DZD respectively. So on average, “BUREAU D’AFFAIRE” prices are more expensive by +5% compared to “AGENCE” and by +32% compared to “PARTICULIER”.
In the previous section, we explored some relationships within our data to get a feel about what could influence the price variable.
We are going to use what we found to build a predictive model that tries to give an approximate selling/buying price that can be used by the real estate agencies and individuals willing to trade their properties.
Before we build our model, we’re going to do some cleansing on ouedkniss data to improve the training phase.
data_model <- ouedkniss %>%
select(Month, Municipality, Nb.Floor, Nb.Room, Garage,
Garden, Furnished, Promise, New.Project, Payment, Area, Price.value.dzd, Announcer.Type, price_m2) %>%
filter(price_m2>= 100000 & price_m2<=400000) %>%
filter(!is.na(Area) & Area>=30 & Area <300) %>%
mutate_at(vars(Garage, Garden, Furnished, Promise, New.Project), ~ ifelse(is.na(.)==TRUE, F, .)) %>%
mutate(Payment = factor(ifelse(is.na(Payment), "comptant", as.character(Payment)))) %>%
mutate(Nb.Room = factor(Nb.Room),
Nb.Floor = case_when(Nb.Floor >= 15 ~ 15,
is.na(Nb.Floor) ~ 1,
TRUE ~ as.numeric(Nb.Floor)))
We selected few variables with potential predictive power.
we expect the minimum price per meter squared to be higher than 100000 DZD and we discarded values above 400000 DZD.
The same logic as above holds, we consider only properties with Area between 30m2 and 300m2.
The missing values for Garage, Garden, Furnished, Promise, New.Project variables has been replaced with FALSE. We expect that if a property contains one or more of these features, the owner would mention it/them to increase the selling price.
Some levels within Municipality variable have very low instances.
data_model %>%
group_by(Municipality) %>%
count() %>%
arrange(n)
## # A tibble: 55 x 2
## # Groups: Municipality [55]
## Municipality n
## <fct> <int>
## 1 KHRAISSIA 1
## 2 OULED CHEBEL 1
## 3 EL MAGHARIA 2
## 4 EL MARSA 2
## 5 HARAOUA 2
## 6 CASBAH 3
## 7 EL MADANIA 3
## 8 EL MOURADIA 3
## 9 BOUROUBA 4
## 10 OUED SMAR 4
## # ... with 45 more rows
To avoid problmes related to those levels when splitting the data, we are going to regroup those instances to their close municipalities according to their location.
data_model <- data_model %>%
mutate(Municipality = as.character(Municipality)) %>%
mutate(Municipality = case_when(Municipality %in% c("KHRAISSIA","OULED CHEBEL") ~ "BIRTOUTA",
Municipality == "TASSALA EL MERDJA" ~ "DOUERA",
Municipality == "HARAOUA" ~ "AIN TAYA",
Municipality == "EL MARSA" ~ "BORDJ EL BAHRI",
Municipality %in% c("EL MAGHARIA", "EL MADANIA") ~ "HUSSEIN DEY",
Municipality == "EL MOURADIA" ~ "BIR MOURAD RAIS",
Municipality == "CASBAH" ~ "BAB EL OUED",
Municipality == "RAIS HAMIDOU" ~ "BOLOGHINE IBN ZIRI",
Municipality == "OUED SMAR" ~ "DAR EL BEIDA",
Municipality == "BOUROUBA" ~ "BACH DJERRAH",
Municipality == "BEN AKNOUN" ~ "EL BIAR",
TRUE ~ as.character(Municipality))) %>%
mutate(Municipality = as.factor(Municipality))
Now that our data is ready to empower our machine, let’s build the model using caret package!
Let’s start by splitting our data by putting 60% of our data into a training set then we use the remaining observations to test our model.
library(caret)
set.seed(100)
trainIndex <- createDataPartition(data_model$Price.value.dzd, p = 0.6, list = FALSE)
head(trainIndex)
## Resample1
## [1,] 1
## [2,] 3
## [3,] 4
## [4,] 5
## [5,] 7
## [6,] 14
train_set <- data_model[trainIndex, ]
test_set <- data_model[-trainIndex, ]
let’s train our model
We start first by building a linear regression model for the sake of interpretability.
lm_model <- lm(formula = Price.value.dzd ~ ., data = select(train_set, -price_m2, -Month))
summary(lm_model)
##
## Call:
## lm(formula = Price.value.dzd ~ ., data = select(train_set, -price_m2,
## -Month))
##
## Residuals:
## Min 1Q Median 3Q Max
## -18738963 -2531505 -32187 2385183 24604954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1169163 1163782 -1.005 0.315349
## MunicipalityAIN TAYA -2668597 1674778 -1.593 0.111422
## MunicipalityALGER 3581710 1000769 3.579 0.000363 ***
## MunicipalityBAB EL OUED -763558 1366584 -0.559 0.576482
## MunicipalityBABA HASSEN -280587 1483028 -0.189 0.849980
## MunicipalityBACH DJERRAH 768132 1567007 0.490 0.624119
## MunicipalityBAINS ROMAINS -1511836 2516916 -0.601 0.548211
## MunicipalityBARAKI -4026785 1750855 -2.300 0.021683 *
## MunicipalityBEB EZZOUAR 3559486 1046165 3.402 0.000697 ***
## MunicipalityBENI MESSOUS -1788032 1496247 -1.195 0.232398
## MunicipalityBIR KHADEM 3552127 1146940 3.097 0.002015 **
## MunicipalityBIR MOURAD RAIS 8385380 1143071 7.336 4.92e-13 ***
## MunicipalityBIRTOUTA -3864393 1319347 -2.929 0.003486 **
## MunicipalityBOLOGHINE IBN ZIRI -1429888 1981725 -0.722 0.470767
## MunicipalityBORDJ EL BAHRI -3335186 1069311 -3.119 0.001873 **
## MunicipalityBORDJ EL KIFFAN -3021372 1233205 -2.450 0.014475 *
## MunicipalityBOUZAREAH 1839799 1396623 1.317 0.188068
## MunicipalityCHERAGA 5284597 1070878 4.935 9.56e-07 ***
## MunicipalityDAR EL BEIDA 518404 1612348 0.322 0.747890
## MunicipalityDELY BRAHIM 5516829 1214902 4.541 6.36e-06 ***
## MunicipalityDJISR KSENTINA 184661 1073901 0.172 0.863513
## MunicipalityDOUERA -3230142 1521836 -2.123 0.034066 *
## MunicipalityDRARIA 2065332 1195831 1.727 0.084492 .
## MunicipalityEL ACHOUR 5530200 1097950 5.037 5.72e-07 ***
## MunicipalityEL BIAR 3521785 1607956 2.190 0.028764 *
## MunicipalityEL HARRACH -1283237 2289288 -0.561 0.575251
## MunicipalityHAMMA ANASSERS 1941682 2014527 0.964 0.335386
## MunicipalityHUSSEIN DEY 2026097 1796475 1.128 0.259698
## MunicipalityHYDRA 16970207 1670840 10.157 < 2e-16 ***
## MunicipalityKOUBA 6405146 1269832 5.044 5.51e-07 ***
## MunicipalityLES EUCALYPTUS -3268473 1350010 -2.421 0.015672 *
## MunicipalityMAHELMA -3370482 1877165 -1.796 0.072908 .
## MunicipalityMOHAMMADIA 5276891 1565910 3.370 0.000784 ***
## MunicipalityOULED FAYET 3664602 1428293 2.566 0.010457 *
## MunicipalityREGHAIA -2391389 1320321 -1.811 0.070441 .
## MunicipalityROUIBA 1305363 1522973 0.857 0.391610
## MunicipalitySAOULA -1424577 1668664 -0.854 0.393486
## MunicipalitySIDI MHAMED 1200647 2516813 0.477 0.633442
## MunicipalitySIDI MOUSSA -5729552 1663800 -3.444 0.000600 ***
## MunicipalitySOUIDANIA -3497900 1308715 -2.673 0.007659 **
## MunicipalitySTAOUELI 5316223 1674285 3.175 0.001548 **
## MunicipalityZERALDA -1648228 1350925 -1.220 0.222757
## Nb.Floor -166108 75638 -2.196 0.028341 *
## Nb.Room4 284023 422023 0.673 0.501118
## Nb.Room5 -400337 750105 -0.534 0.593676
## GarageTRUE 2901367 525201 5.524 4.33e-08 ***
## GardenTRUE -693354 551350 -1.258 0.208879
## FurnishedTRUE 1058825 695119 1.523 0.128053
## PromiseTRUE 240304 693961 0.346 0.729214
## New.ProjectTRUE -270667 541598 -0.500 0.617369
## Paymenttranches -2972192 919206 -3.233 0.001268 **
## Area 207488 9019 23.007 < 2e-16 ***
## Announcer.TypeBUREAU D'AFFAIRE 141844 406711 0.349 0.727352
## Announcer.TypePARTICULIER -1098526 391826 -2.804 0.005162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4721000 on 898 degrees of freedom
## Multiple R-squared: 0.763, Adjusted R-squared: 0.749
## F-statistic: 54.55 on 53 and 898 DF, p-value: < 2.2e-16
we can get some insights from the output above:
Although the technique used is simple, we’re still have a pretty decent Adjusted R-squared that is equal to 74.9% which means that the variations of our input variables explain about 74% the variations of the price (the response variable).
We have also many significant variables in our model, we can interpret a few of them.
This model suggests that:
The Area variable (very significant one (3 stars)) has a positive coefficient 207 488. it means that an increase in one unit of this variable, i.e one meter squared, will raise, on average, the price by 207 488 DZD.
The inclusion of “Garage” will raise the price on average by 2 901 367 DZD.
The price of an appartment that is located in “BARAKI” is cheaper, on average, by 4 026 785 DZD.
The price of an appartment that is located in “HYDRA” is more expensive, on average, by 16 970 207 DZD.
Buying a property from a individual owner “PARTICULIER”, would save you 1 098 526 DZD on average.
Now, let’s make predictions and test our model:
lm_pred <- predict(lm_model, newdata = select(test_set, -price_m2, -Month, -Price.value.dzd))
lmValues <- data.frame(obs = test_set$Price.value.dzd, pred = lm_pred)
defaultSummary(lmValues)
## RMSE Rsquared MAE
## 4.783036e+06 7.370490e-01 3.497527e+06
We notice from our output a small drop in the Out-of-sample Rsquared (from 0.763 to 0.737) which suggests that our model performs well on unseen data.
Both Root Mean Squared Error (RMSE) and Mean Absolute Error are in the same unit as our Dependent variable (The price of property) and they indicate the average model prediction error. We’d like to consider MAE over RMSE as RMSE is more sensible to extreme values. Thus, we have an average error of 3 497 527 DZD which is relatively high considering the scale of our response variable.
Let’s see if we can do even better using a more advanced supervized technique, the Random Forests.
library(randomForest)
set.seed(555)
rf_model <- randomForest(formula = Price.value.dzd ~ .,
data = select(train_set, -price_m2, -Month),
ntree = 500, #number of trees to train
node.size = 25, #the size of final nodes
mtry = 8) #number of random independent variables to use
rf_model
##
## Call:
## randomForest(formula = Price.value.dzd ~ ., data = select(train_set, -price_m2, -Month), ntree = 500, node.size = 25, mtry = 8)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 8
##
## Mean of squared residuals: 2.457137e+13
## % Var explained: 72.3
rf_pred <- predict(rf_model, newdata = select(test_set, -price_m2, -Month, -Price.value.dzd))
rfValues <- data.frame(obs = test_set$Price.value.dzd, pred = rf_pred)
defaultSummary(rfValues)
## RMSE Rsquared MAE
## 4.933058e+06 7.182500e-01 3.516278e+06
Comparing the evaluation metrics values above to the ones of the previous model, we notice that random forest didn’t outperform the linear regression.
Comments
from this graph, we notice that, on average, the housing prices in
Lkeriaare higher than those listed inOuedkniss.The overall median price values per meter squared in our sample data are:
On average, comparing “agence” to “agence”, lkeria’s prices are higher by approximately +19%.