The data has been obtained from this project. In this website we can find a dataset with data related to airbnb apartments in several cities around the world.
Since the dataset included a lot of variables, I have decided to load only those that seem more relevant
This dataset contains 20864 rows (observations) and 60 columns, and as we can already see, some variables have very messy data, and this will require a lot of data wrangling.
str(airbnb_data)
As a complement for the main dataset, I’ll be using the neighbourhoods_geojson dataset, which provides geospatial information from barcelona and its airbnb apartments (such as longitude, latitude, polygons, etc):
library(rgdal)
neighbourhoods_geojson <- rgdal::readOGR("C:/Users/marct/Documents/CURSOS/PROJECTES/AIRBNB PRICE PREDICTION/DATA/neighbourhoods.geojson")
First of all I load the lubridate package, which I use to convert the host_since variable, from a character to a date format. Then I extract the year from the variable I have just converted and reorder the data.
library(lubridate)
airbnb_data$host_since <- ymd(airbnb_data$host_since)
airbnb_data$host_since_year <- year(airbnb_data$host_since)
airbnb_data <- airbnb_data[, c(1, ncol(airbnb_data), 2 : (ncol(airbnb_data) - 1))]
After that, I convert the neighbourhood column into a factor and realise that it has spelling mistakes.
airbnb_data$neighbourhood <- airbnb_data$neighbourhood %>% as.factor()
levels(airbnb_data$neighbourhood)
## [1] "" "Camp d'en Grassot i Grà cia Nova"
## [3] "Can Baro" "Carmel"
## [5] "Ciutat Vella" "Diagonal Mar - La Mar Bella"
## [7] "Dreta de l'Eixample" "Eixample"
## [9] "El Baix Guinardó" "El Besòs i el Maresme"
## [11] "El Bon Pastor" "El Born"
## [13] "El Camp de l'Arpa del Clot" "El Clot"
## [15] "El Coll" "El Congrés i els Indians"
## [17] "el Fort Pienc" "El Gòtic"
## [19] "El Poble-sec" "El Poblenou"
## [21] "El Putget i Farró" "El Raval"
## [23] "Glòries - El Parc" "Grà cia"
## [25] "Guinardó" "Horta"
## [27] "Horta-Guinardó" "L'Antiga Esquerra de l'Eixample"
## [29] "La Barceloneta" "La Font d'en Fargues"
## [31] "La Guineueta - Canyelles" "La Maternitat i Sant Ramon"
## [33] "La Nova Esquerra de l'Eixample" "La Prosperitat"
## [35] "La Sagrada FamÃlia" "La Sagrera"
## [37] "La Salut" "La Teixonera"
## [39] "La Trinitat Vella" "La Vall d'Hebron"
## [41] "La Verneda i La Pau" "La Vila OlÃmpica"
## [43] "Les Corts" "Les Tres Torres"
## [45] "Montbau" "Navas"
## [47] "Nou Barris" "Pedralbes"
## [49] "Porta" "Provençals del Poblenou"
## [51] "Sant Andreu" "Sant Andreu de Palomar"
## [53] "Sant Antoni" "Sant GenÃs dels Agudells"
## [55] "Sant Gervasi - Galvany" "Sant Gervasi - la Bonanova"
## [57] "Sant MartÃ" "Sant Martà de Provençals"
## [59] "Sant Pere/Santa Caterina" "Sants-Montjuïc"
## [61] "Sarrià " "Sarrià -Sant Gervasi"
## [63] "Torre Baró" "Trinitat Nova"
## [65] "Turó de la Peira - Can Peguera" "Vallcarca i els Penitents"
## [67] "Verdum - Los Roquetes" "Vila de Grà cia"
## [69] "Vilapicina i la Torre Llobeta"
Since I am catalan, I am able to correct the spelling mistakes.
levels(airbnb_data$neighbourhood) <- c("" , "Camp d'en Grassot i Gràcia Nova", "Can Baró",
"Carmel" , "Ciutat Vella", "Diagonal Mar - La Mar Bella" ,
"Dreta de l'Eixample" , "Eixample" , "El Baix Guinardó",
"El Besós i el Maresme" , "El Bon Pastor" , "El Born" ,
"El Camp de l'Arpa del Clot", "El Clot" , "El Coll" ,
"El Congrés i els Indians" , "El Fort Pienc" , "El Gòtic" ,
"El Poble-sec" , "El Poblenou" , "El Putget i Farró" ,
"El Raval" , "Glòries - El Parc" , "Gràcia",
"Guinardó" , "Horta" , "Horta-Guinardó" ,
"L'Antiga Esquerra de l'Eixample", "La Barceloneta" , "La Font d'en Fargues" ,
"La Guineueta - Canyelles" , "La Maternitat i Sant Ramon" , "La Nova Esquerra de l'Eixample",
"La Prosperitat" , "La Sagrada Família", "La Sagrera" ,
"La Salut" , "La Teixonera", "La Trinitat Vella" ,
"La Vall d'Hebron" , "La Verneda i La Pau" , "La Vila Olímpica" ,
"Les Corts" , "Les Tres Torres" , "Montbau" ,
"Navas" , "Nou Barris" , "Pedralbes" ,
"Porta" , "Provençals del Poblenou" , "Sant Andreu" ,
"Sant Andreu de Palomar" , "Sant Antoni" , "Sant Genís dels Agudells" ,
"Sant Gervasi - Galvany" , "Sant Gervasi - la Bonanova", "Sant Martí" ,
"Sant Martí de Provençals", "Sant Pere/Santa Caterina", "Sants-Montjuïc" ,
"Sarrià " , "Sarrià -Sant Gervasi" , "Torre Baró" ,
"Trinitat Nova" , "Turó de la Peira - Can Peguera", "Vallcarca i els Penitents" ,
"Verdum - Los Roquetes" , "Vila de Gràcia" , "Vilapicina i la Torre Llobeta")
This variables also need to be treated, since they have special characters such as % or $, or they have to be coded as boolean factors. Here we can see some of the values that I am about to modify:
airbnb_data[1, c(4,5)]
## host_response_rate host_acceptance_rate
## 1 100% 91%
airbnb_data[1, c(30, 31, 32, 34)]
## price security_deposit cleaning_fee extra_people
## 1 $80.00 $100.00 $20.00 $0.00
airbnb_data[1, c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)]
## host_is_superhost host_has_profile_pic host_identity_verified
## 1 f t t
## is_location_exact has_availability requires_license instant_bookable
## 1 t t t f
## is_business_travel_ready require_guest_profile_picture
## 1 f f
## require_guest_phone_verification
## 1 f
Then I treat the previous variables that had special characters such as “%” or “$” using simple loops, and convert them into numeric variables. After that I substitute the “t” and “f” values from the boolean variables with TRUE and FALSE (respetively) and I convert them into factors through the last loop.
for(i in c(4, 5)){
airbnb_data[, i] <- as.numeric(gsub("%", "", airbnb_data[, i])) / 100
}
for(i in c(30, 31, 32, 34)){
airbnb_data[, i] <- as.numeric(gsub("\\$", "", airbnb_data[, i]))
}
for(i in c(6, 11, 12, 20, 37, 51, 52, 53, 55, 56)){
airbnb_data[, i] <- gsub("t", "TRUE", airbnb_data[, i])
airbnb_data[, i] <- gsub("f", "FALSE", airbnb_data[, i])
}
Now I create a list that I will use to spot strange values and NA’s by converting all the variables in factors and copying the levels of the ith variable into the ith element of a list. I use the “jsonedit” function from the listviewer package in order to create an interactive and easy list.
airbnb_data_factor <- airbnb_data
factor_list <- list()
for(i in 1:ncol(airbnb_data_factor)){
airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
factor_list <- lapply(airbnb_data_factor, levels)
}
library(listviewer)
jsonedit(factor_list)
After exploring this list I observe that the host_verifications and amenities columns need some treatment in order to be used in any statistical model (I need to create dummy variables with their values). This is an example of the format of this columns’ values.
airbnb_data$host_verifications[1]
## [1] "['email', 'phone', 'reviews', 'manual_offline', 'jumio', 'offline_government_id', 'government_id', 'work_email']"
airbnb_data$amenities[1]
## [1] "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",Kitchen,Elevator,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke alarm\",\"Carbon monoxide alarm\",\"Fire extinguisher\",Essentials,\"translation missing: en.hosting_amenity_49\"}"
In the following chunk I create two dataframes, after deleting the strange or unuseful characters that I have spoted using the previous list.
library(mgsub)
library(stringr)
host_verifications <- as.data.frame(mgsub(airbnb_data$host_verifications, c("\\[", "\\]", "\\,", "\\'") , c("", "", "", "")))
amenities <- as.data.frame(mgsub(airbnb_data$amenities, c("\\{", "\\}", "\\,", "\\'", "\\[", "\\]", "\"", "\\/") , c("", "", " ", "", "", "", "", "")))
Now I create two listes for each of the variables which I will use in the following loop. Since both variables (host_verifications and amenities) are dataframes of the same dimension, I create a loop iterating on the columns of host_verifications. This loop starts by splitting the ith row of the host_verification dataframe (and since str_split returns a list with only one element I subset the first element of each list) and saves this vector into the ith element of the output_list_verifications. Then I count the number of elements of the ith vector of the output_list_verifications list and associate this result to the ith vector of the output_list_verifications_count, which I will use later to see which is the vector with more components and create the dummy variables.
output_list_verifications <- list()
output_list_verifications_count <- list()
output_list_amenities <- list()
output_list_amenities_count <- list()
for(i in 1:nrow(host_verifications)){
output_list_verifications[i] <- str_split(host_verifications[i, ], " ")[1]
output_list_verifications_count[i] <- length(output_list_verifications[[i]])
output_list_amenities[i] <- str_split(amenities[i, ], " ")[1]
output_list_amenities_count[i] <- length(output_list_amenities[[i]])
}
Now I extract the largest vector (the vector that contains more values) of the output_list_verifications list, by filtering the vector that has the highest value in the output_list_verifications_count list and save the result in dummy_host_verifications_cols. After that I create an empty dataframe called dummy_host_verifications_df (with the same rows than the original dataset airbnb_data and the number of columns from dummy_host_verifications_cols), and I repeat the same process for the amenities variable. Below you can see the dummy_host_verifications_df
dummy_host_verifications_cols <- output_list_verifications[[which.max(output_list_verifications_count)]]
dummy_host_verifications_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_host_verifications_cols)) %>% as.data.frame()
colnames(dummy_host_verifications_df) <- dummy_host_verifications_cols
dummy_amenities_cols <- output_list_amenities[[which.max(output_list_amenities_count)]]
dummy_amenities_df <- matrix(nrow = nrow(airbnb_data), ncol = length(dummy_amenities_cols)) %>% as.data.frame()
colnames(dummy_amenities_df) <- dummy_amenities_cols
head(dummy_host_verifications_df)
## email phone facebook reviews manual_offline jumio offline_government_id
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## sent_id selfie government_id identity_manual work_email
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
Finally I bind the new dataframes (dummy_host_verifications_df and dummy_amenities_df) to the original dataset (airbnb_data).
airbnb_data <- cbind(airbnb_data, dummy_host_verifications_df, dummy_amenities_df)
Now I susbtitute the NA’s with Yes (if the apartment has that verification or amenitie, and no if it doesn’t). In this case I use a loop nested inside another loop (a loop for each list/variable), where the outer loop iterates over the length of the output_list_verifications / output_list_amenities and the inner one over the dummy columns that I have just created (which I specified manually). If the name of the jth variable is in the lth vector of the output list, then the value of the ith row and jth column of the airbnb_data corresponds to “Yes” (meaning that the variable name appears in the list).
for(l in 1:length(output_list_verifications)){
for(j in 62:74){
if(colnames(airbnb_data)[j] %in% output_list_verifications[[l]]){
airbnb_data[l, j] <- "Yes"
} else{
}
}
}
for(l in 1:length(output_list_amenities)){
for(j in 75:ncol(airbnb_data)){
if(colnames(airbnb_data)[j] %in% output_list_amenities[[l]]){
airbnb_data[l, j] <- "Yes"
} else{
}
}
}
Now I convert all the new columns into factors
for(j in 62:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
if(is.na(airbnb_data[i, j]) == TRUE){
airbnb_data[i, j] <- "No"
}
}
airbnb_data[, j] <- factor(airbnb_data[, j], levels = c("Yes", "No"))
}
I delete some unuseful variables, and create new ones: days_since_host, days_since_first_review, days_since_last_review, which will allow me to introduce the time component into the model, since the statistical methods I am about to use, don’t deal well with dates and times.
airbnb_data$amenities <- NULL
airbnb_data$host_verifications <- NULL
airbnb_data$has_availability <- NULL
airbnb_data$is_business_travel_ready <- NULL
airbnb_data$days_since_host <- (today() - as.Date(airbnb_data$host_since)) %>% as.numeric()
airbnb_data$days_since_first_review <- (today() - as.Date(airbnb_data$first_review)) %>% as.numeric()
airbnb_data$days_since_last_review <- (today() - as.Date(airbnb_data$last_review)) %>% as.numeric()
airbnb_data$host_since <- NULL
airbnb_data$first_review <- NULL
airbnb_data$last_review <- NULL
airbnb_data$host_response_time <- NULL
airbnb_data$host_listings_count <- NULL
Now I delete the NA’s from these columns, which I will use to clean the other variables by substituting NA’s and strange values with the mean (in case of numerical variables), or the mode (in case of categorical ones).
airbnb_data <- airbnb_data[complete.cases(airbnb_data$host_since_year) & complete.cases(airbnb_data$country) & complete.cases(airbnb_data$city), ]
I use the same list as before to spot NA’s, missing values and strange values:
airbnb_data_factor <- airbnb_data
for(i in 1:ncol(airbnb_data_factor)){
airbnb_data_factor[, i] <- as.factor(airbnb_data_factor[, i])
factor_list <- lapply(airbnb_data_factor, levels)
}
library(listviewer)
jsonedit(factor_list)
The strange values are : "“,”N/A“, NA,”-“,”*“,”.“,”[no name]“,”.". Firstly, I create a for nested in another for, which iterates over columns (j), rows(i) and patterns (k). If it detects that a value that is not N/A from the ith row and jth column, contains any of the k patterns, then it converts that value in NA. This conversion will come in handy in the following chunk codes.
patterns <- c("N/A", "-", "*", "[no name]", ".", "")
for(j in 1:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
for(k in 1:length(patterns)){
if(airbnb_data[i, j] == patterns[k] & !is.na(airbnb_data[i, j])){
airbnb_data[i, j] <- NA
}
}
}
}
I load the modeest package in order to be able to apply the mlv function (which calculates the mode of a vector) and also the stringr library, which I’ll use later. In this loop I iterate over columns (j) and rows(i). If the element from the ith row and jth column is NA, and is also a numeric value, it is substituted by the mean of the jth variable (after being filtered to the same year, country and city values of the ith row). If is NA and character or factor, then it is substituted by the mean of the jth variable (also after being filtered to the same year, country and city of the ith row).
library(modeest)
library(stringr)
for(j in 1:ncol(airbnb_data)){
for(i in 1:nrow(airbnb_data)){
if(is.na(airbnb_data[i, j]) == TRUE){
if(is.numeric(airbnb_data[, j]) == TRUE){
airbnb_data[i, j] = median(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
& airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"] , j])
} else if(is.character(airbnb_data[, j]) == TRUE | is.factor(airbnb_data[, j]) == TRUE){
airbnb_data[i, j] = mlv(airbnb_data[complete.cases(airbnb_data) & airbnb_data$host_since_year == airbnb_data[i, "host_since_year"]
& airbnb_data$country == airbnb_data[i, "country"] & airbnb_data$city == airbnb_data[i, "city"], j])[1]
}
}
}
}
Then I filter out the possible NA’s left (in case there is any NA left) and substitute the " " spaces from the column names with "_", using an easy loop. Finally I delete unuseful columns (mainly factor columns containing a single level), and convert the neighbourhood variable into a factor.
airbnb_data <- airbnb_data[complete.cases(airbnb_data), ]
for(i in 1:ncol(airbnb_data)){
colnames(airbnb_data)[i] <- gsub(" ", "_", colnames(airbnb_data)[i])
}
airbnb_data <- airbnb_data[, !duplicated(colnames(airbnb_data))] # we delete duplicated columns (because some dummy variables had very similar names and were detected by R as identical)
airbnb_data <- airbnb_data[airbnb_data$city == "Barcelona", ] # I filter out possible other cities (since there was a row containing a different city value)
airbnb_data$country <- NULL
airbnb_data$state <- NULL
airbnb_data$smart_location <- NULL
airbnb_data$city <- NULL # since there's only one city (Barcelona)
airbnb_data$requires_license <- NULL # since it is a factor with one level
airbnb_data$neighbourhood <- as.factor(airbnb_data$neighbourhood)
Finally I save the dataset, because I will use it for other projects.
write.csv(airbnb_data, "C:/Users/marct/Documents/UNI_ESTADISTICA/2o/SEGUNDO CUATRIMESTRE/INFERENCIA ESTADÍSTICA II/TRABAJOS/TRABAJO AIRBNB/datos_airbnb.csv")
First of all I check that there are no NA’s left in the data:
airbnb_data[is.na(airbnb_data), ] %>% nrow()
## [1] 0
Now I use str() and summary() to check the structure of the data again:
str(airbnb_data)
## 'data.frame': 7364 obs. of 156 variables:
## $ host_since_year : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ host_response_rate : num 1 1 1 1 1 1 1 1 1 1 ...
## $ host_acceptance_rate : num 0.91 0.91 0.91 1 1 1 1 0.77 1 0.55 ...
## $ host_is_superhost : chr "FALSE" "FALSE" "FALSE" "TRUE" ...
## $ host_neighbourhood : chr "El Gòtic" "El Gòtic" "El Gòtic" "L'Antiga Esquerra de l'Eixample" ...
## $ host_total_listings_count : int 3 3 3 4 4 4 4 1 7 7 ...
## $ host_has_profile_pic : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ host_identity_verified : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ neighbourhood : Factor w/ 69 levels "","Camp d'en Grassot i Gràcia Nova",..: 18 18 5 8 28 28 28 5 8 2 ...
## $ latitude : num 41.4 41.4 41.4 41.4 41.4 ...
## $ longitude : num 2.18 2.18 2.18 2.15 2.15 ...
## $ is_location_exact : chr "TRUE" "TRUE" "TRUE" "TRUE" ...
## $ property_type : chr "Apartment" "Apartment" "Apartment" "Apartment" ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ accommodates : int 2 2 9 2 1 1 2 1 4 6 ...
## $ bathrooms : num 1 1 3 2 2 2 2 1 1 2 ...
## $ bedrooms : int 1 1 4 1 1 1 1 1 1 2 ...
## $ beds : int 1 1 6 1 1 1 1 1 2 4 ...
## $ bed_type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ square_feet : num 807 807 807 807 807 807 807 807 807 807 ...
## $ price : num 80 100 227 40 30 30 45 33 130 110 ...
## $ security_deposit : num 100 150 200 0 0 0 0 0 150 500 ...
## $ cleaning_fee : num 20 40 67 6 6 6 6 15 59 30 ...
## $ guests_included : int 2 1 4 1 1 1 1 1 2 3 ...
## $ extra_people : num 0 0 25 10 0 0 10 0 10 25 ...
## $ minimum_nights : int 3 5 4 7 2 2 2 2 2 3 ...
## $ maximum_maximum_nights : int 90 120 1125 1125 730 1125 1125 65 364 365 ...
## $ availability_30 : int 16 30 30 26 14 21 17 29 30 30 ...
## $ availability_60 : int 16 60 60 52 31 44 43 59 60 60 ...
## $ availability_90 : int 16 90 90 75 46 70 69 83 80 90 ...
## $ availability_365 : int 88 180 180 348 318 345 344 358 336 365 ...
## $ number_of_reviews : int 2 8 149 303 238 258 222 73 339 39 ...
## $ review_scores_rating : num 100 68 91 94 95 96 95 94 94 88 ...
## $ review_scores_accuracy : num 10 8 10 10 10 10 10 10 10 9 ...
## $ review_scores_cleanliness : num 10 8 9 9 10 10 9 10 10 9 ...
## $ review_scores_checkin : num 10 7 10 10 10 10 10 10 9 9 ...
## $ review_scores_communication : num 10 9 10 10 10 10 10 10 10 9 ...
## $ review_scores_value : num 10 7 9 10 10 9 9 10 10 9 ...
## $ instant_bookable : chr "FALSE" "FALSE" "TRUE" "TRUE" ...
## $ cancellation_policy : chr "moderate" "moderate" "moderate" "moderate" ...
## $ require_guest_profile_picture : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ require_guest_phone_verification : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ calculated_host_listings_count : int 3 3 3 4 4 4 4 1 3 4 ...
## $ calculated_host_listings_count_entire_homes : int 1 1 1 0 0 0 0 0 3 4 ...
## $ calculated_host_listings_count_private_rooms: int 2 2 2 4 4 4 4 1 0 0 ...
## $ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reviews_per_month : num 0.05 0.07 1.26 3.01 2.16 2.62 3.17 0.69 3.09 0.4 ...
## $ email : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ phone : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ facebook : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 1 1 ...
## $ reviews : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ manual_offline : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 2 2 ...
## $ jumio : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ offline_government_id : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 2 2 2 ...
## $ sent_id : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ selfie : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ government_id : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ identity_manual : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ work_email : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 2 ...
## $ TV : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Cable_TV : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
## $ Internet : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wifi : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Air_conditioning : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 1 1 1 ...
## $ Wheelchair_accessible : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 1 ...
## $ Kitchen : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Paid_parking_off_premises : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Elevator : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
## $ Buzzerwireless_intercom : Factor w/ 2 levels "Yes","No": 2 2 1 1 2 2 1 1 1 1 ...
## $ Heating : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ Familykid_friendly : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 1 1 ...
## $ Washer : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 2 1 1 ...
## $ Smoke_alarm : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ Carbon_monoxide_alarm : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ First_aid_kit : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Safety_card : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Fire_extinguisher : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 2 2 2 ...
## $ Essentials : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ Shampoo : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
## $ Hangers : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Hair_dryer : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Iron : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 1 ...
## $ Laptop-friendly_workspace : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 1 2 2 ...
## $ Private_entrance : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
## $ Outlet_covers : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ High_chair : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Stair_gates : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Children’s_books_and_toys : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Window_guards : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Table_corner_guards : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Babysitter_recommendations : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Crib : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 1 2 ...
## $ Pack_’n_Playtravel_crib : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 1 2 ...
## $ Room-darkening_shades : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 1 2 2 ...
## $ Children’s_dinnerware : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hot_water : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
## $ Body_soap : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Bath_towel : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ Toilet_paper : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## [list output truncated]
summary(airbnb_data)
## host_since_year host_response_rate host_acceptance_rate host_is_superhost
## Min. :2009 Min. :0.0000 Min. :0.0000 Length:7364
## 1st Qu.:2012 1st Qu.:1.0000 1st Qu.:0.8900 Class :character
## Median :2013 Median :1.0000 Median :0.9800 Mode :character
## Mean :2013 Mean :0.9348 Mean :0.8748
## 3rd Qu.:2013 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2014 Max. :1.0000 Max. :1.0000
##
## host_neighbourhood host_total_listings_count host_has_profile_pic
## Length:7364 Min. : 0.00 Length:7364
## Class :character 1st Qu.: 1.00 Class :character
## Mode :character Median : 3.00 Mode :character
## Mean : 15.68
## 3rd Qu.: 17.00
## Max. :170.00
##
## host_identity_verified neighbourhood latitude
## Length:7364 Eixample :1607 Min. :41.35
## Class :character Ciutat Vella : 796 1st Qu.:41.38
## Mode :character Sants-Montjuïc : 666 Median :41.39
## Dreta de l'Eixample: 400 Mean :41.39
## Gràcia : 393 3rd Qu.:41.40
## Sant Martí : 374 Max. :41.46
## (Other) :3128
## longitude is_location_exact property_type room_type
## Min. :2.089 Length:7364 Length:7364 Length:7364
## 1st Qu.:2.157 Class :character Class :character Class :character
## Median :2.167 Mode :character Mode :character Mode :character
## Mean :2.167
## 3rd Qu.:2.177
## Max. :2.222
##
## accommodates bathrooms bedrooms beds
## Min. : 1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 2.000
## Mean : 3.789 Mean : 1.393 Mean : 1.769 Mean : 2.574
## 3rd Qu.: 5.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :20.000 Max. :16.000 Max. :16.000 Max. :40.000
##
## bed_type square_feet price security_deposit
## Length:7364 Min. : 0.0 Min. : 10.0 Min. : 0.0
## Class :character 1st Qu.: 180.0 1st Qu.: 40.0 1st Qu.:100.0
## Mode :character Median : 431.0 Median : 70.0 Median :160.0
## Mean : 358.8 Mean :101.5 Mean :193.2
## 3rd Qu.: 538.0 3rd Qu.:120.0 3rd Qu.:250.0
## Max. :2799.0 Max. :999.0 Max. :999.0
##
## cleaning_fee guests_included extra_people minimum_nights
## Min. : 0.00 Min. : 1.000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 20.00 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 40.00 Median : 1.000 Median : 0.00 Median : 3.00
## Mean : 50.52 Mean : 2.087 Mean : 9.84 Mean : 10.81
## 3rd Qu.: 65.00 3rd Qu.: 2.000 3rd Qu.: 18.00 3rd Qu.: 10.00
## Max. :540.00 Max. :150.000 Max. :268.00 Max. :1124.00
##
## maximum_maximum_nights availability_30 availability_60 availability_90
## Min. :1.000e+00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:3.000e+02 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.:13.00
## Median :1.125e+03 Median :26.00 Median :49.00 Median :74.00
## Mean :5.840e+05 Mean :18.25 Mean :37.25 Mean :57.19
## 3rd Qu.:1.125e+03 3rd Qu.:30.00 3rd Qu.:59.00 3rd Qu.:89.00
## Max. :2.147e+09 Max. :30.00 Max. :60.00 Max. :90.00
##
## availability_365 number_of_reviews review_scores_rating review_scores_accuracy
## Min. : 0.0 Min. : 0.00 Min. : 20.00 Min. : 2.000
## 1st Qu.: 89.0 1st Qu.: 1.00 1st Qu.: 90.00 1st Qu.: 9.000
## Median :209.0 Median : 15.00 Median : 94.00 Median :10.000
## Mean :206.7 Mean : 56.84 Mean : 92.14 Mean : 9.517
## 3rd Qu.:346.0 3rd Qu.: 78.00 3rd Qu.: 96.00 3rd Qu.:10.000
## Max. :365.0 Max. :731.00 Max. :100.00 Max. :10.000
##
## review_scores_cleanliness review_scores_checkin review_scores_communication
## Min. : 2.000 Min. : 2.000 Min. : 2.000
## 1st Qu.: 9.000 1st Qu.:10.000 1st Qu.:10.000
## Median : 9.000 Median :10.000 Median :10.000
## Mean : 9.335 Mean : 9.699 Mean : 9.695
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
##
## review_scores_value instant_bookable cancellation_policy
## Min. : 2.000 Length:7364 Length:7364
## 1st Qu.: 9.000 Class :character Class :character
## Median : 9.000 Mode :character Mode :character
## Mean : 9.073
## 3rd Qu.:10.000
## Max. :10.000
##
## require_guest_profile_picture require_guest_phone_verification
## Length:7364 Length:7364
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## calculated_host_listings_count calculated_host_listings_count_entire_homes
## Min. : 1.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 3.00 Median : 1.00
## Mean : 15.44 Mean : 12.86
## 3rd Qu.: 18.00 3rd Qu.: 14.00
## Max. :132.00 Max. :132.00
##
## calculated_host_listings_count_private_rooms
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 2.227
## 3rd Qu.: 1.000
## Max. :81.000
##
## calculated_host_listings_count_shared_rooms reviews_per_month email
## Min. :0.00000 Min. : 0.010 Yes:7180
## 1st Qu.:0.00000 1st Qu.: 0.340 No : 184
## Median :0.00000 Median : 0.800
## Mean :0.02879 Mean : 1.214
## 3rd Qu.:0.00000 3rd Qu.: 1.660
## Max. :8.00000 Max. :25.450
##
## phone facebook reviews manual_offline jumio
## Yes:7354 Yes:1673 Yes:6945 Yes: 433 Yes:5709
## No : 10 No :5691 No : 419 No :6931 No :1655
##
##
##
##
##
## offline_government_id sent_id selfie government_id identity_manual
## Yes:3275 Yes: 117 Yes:2249 Yes:6076 Yes:1986
## No :4089 No :7247 No :5115 No :1288 No :5378
##
##
##
##
##
## work_email TV Cable_TV Internet Wifi Air_conditioning
## Yes:1452 Yes: 0 Yes: 908 Yes:2675 Yes:7218 Yes:4949
## No :5912 No :7364 No :6456 No :4689 No : 146 No :2415
##
##
##
##
##
## Wheelchair_accessible Kitchen Paid_parking_off_premises Elevator
## Yes: 521 Yes:6741 Yes:2520 Yes:4354
## No :6843 No : 623 No :4844 No :3010
##
##
##
##
##
## Buzzerwireless_intercom Heating Familykid_friendly Washer Smoke_alarm
## Yes:2393 Yes:6264 Yes:3631 Yes:6194 Yes:1732
## No :4971 No :1100 No :3733 No :1170 No :5632
##
##
##
##
##
## Carbon_monoxide_alarm First_aid_kit Safety_card Fire_extinguisher Essentials
## Yes:1324 Yes:1617 Yes: 525 Yes:1641 Yes:6909
## No :6040 No :5747 No :6839 No :5723 No : 455
##
##
##
##
##
## Shampoo Hangers Hair_dryer Iron Laptop-friendly_workspace
## Yes:4480 Yes:6153 Yes:6009 Yes:5802 Yes:4424
## No :2884 No :1211 No :1355 No :1562 No :2940
##
##
##
##
##
## Private_entrance Outlet_covers High_chair Stair_gates
## Yes:1005 Yes: 80 Yes:1519 Yes: 20
## No :6359 No :7284 No :5845 No :7344
##
##
##
##
##
## Childrenâ\200\231s_books_and_toys Window_guards Table_corner_guards
## Yes: 512 Yes: 113 Yes: 22
## No :6852 No :7251 No :7342
##
##
##
##
##
## Babysitter_recommendations Crib Pack_â\200\231n_Playtravel_crib
## Yes: 319 Yes:1756 Yes: 975
## No :7045 No :5608 No :6389
##
##
##
##
##
## Room-darkening_shades Childrenâ\200\231s_dinnerware Hot_water Body_soap Bath_towel
## Yes: 887 Yes: 179 Yes:5115 Yes: 159 Yes: 159
## No :6477 No :7185 No :2249 No :7205 No :7205
##
##
##
##
##
## Toilet_paper Bed_linens Extra_pillows_and_blankets Ethernet_connection
## Yes: 159 Yes:3544 Yes:2090 Yes: 416
## No :7205 No :3820 No :5274 No :6948
##
##
##
##
##
## Pocket_wifi Microwave Coffee_maker Refrigerator Dishwasher
## Yes: 469 Yes:3995 Yes:3996 Yes:4367 Yes:2066
## No :6895 No :3369 No :3368 No :2997 No :5298
##
##
##
##
##
## Dishes_and_silverware Cooking_basics Oven Stove Single_level_home
## Yes:4370 Yes:3732 Yes:3255 Yes:3268 Yes: 408
## No :2994 No :3632 No :4109 No :4096 No :6956
##
##
##
##
##
## Patio_or_balcony Luggage_dropoff_allowed Wide_hallways
## Yes:2504 Yes:1778 Yes: 119
## No :4860 No :5586 No :7245
##
##
##
##
##
## No_stairs_or_steps_to_enter Wide_entrance_for_guests
## Yes: 220 Yes: 123
## No :7144 No :7241
##
##
##
##
##
## Flat_path_to_guest_entrance Well-lit_path_to_entrance Extra_space_around_bed
## Yes: 104 Yes: 188 Yes: 57
## No :7260 No :7176 No :7307
##
##
##
##
##
## Accessible-height_bed Accessible-height_toilet Wide_clearance_to_shower
## Yes: 59 Yes: 31 Yes: 10
## No :7305 No :7333 No :7354
##
##
##
##
##
## _toilet Wide_entryway Host_greets_you Waterfront Beachfront
## Yes: 10 Yes: 45 Yes:3285 Yes: 169 Yes: 133
## No :7354 No :7319 No :4079 No :7195 No :7231
##
##
##
##
##
## Handheld_shower_head Hot_water_kettle Ceiling_fan Beach_view Rain_shower
## Yes: 38 Yes: 68 Yes: 12 Yes: 2 Yes: 38
## No :7326 No :7296 No :7352 No :7362 No :7326
##
##
##
##
##
## Bidet Heated_towel_rack Balcony Printer Espresso_machine
## Yes: 26 Yes: 30 Yes: 71 Yes: 7 Yes: 85
## No :7338 No :7334 No :7293 No :7357 No :7279
##
##
##
##
##
## Formal_dining_area Day_bed Convection_oven Standing_valet
## Yes: 34 Yes: 19 Yes: 40 Yes: 12
## No :7330 No :7345 No :7324 No :7352
##
##
##
##
##
## Pillow-top_mattress Memory_foam_mattress En_suite_bathroom Outdoor_seating
## Yes: 27 Yes: 36 Yes: 44 Yes: 35
## No :7337 No :7328 No :7320 No :7329
##
##
##
##
##
## Mudroom Full_kitchen Paid_parking_on_premises Bedroom_comforts
## Yes: 9 Yes: 184 Yes:1377 Yes: 160
## No :7355 No :7180 No :5987 No :7204
##
##
##
##
##
## Bathroom_essentials Fixed_grab_bars_for_shower Shower_chair days_since_host
## Yes: 161 Yes: 7 Yes: 3 Min. :2270
## No :7203 No :7357 No :7361 1st Qu.:2647
## Median :2915
## Mean :2961
## 3rd Qu.:3241
## Max. :4382
##
## days_since_first_review days_since_last_review
## Min. : 277 Min. : 277.0
## 1st Qu.:1065 1st Qu.: 375.0
## Median :1697 Median : 392.0
## Mean :1682 Mean : 523.1
## 3rd Qu.:2202 3rd Qu.: 510.0
## Max. :3904 Max. :3556.0
##
Now I check the distribution of price, the variable that I will try to predict. I use the log of price, since it allows us to see the data in a more compact way. The shape seems pretty gaussian, but a little bit skewed to the left. The Average price is 101.5, the minimum 10 and the maximum about 1000 (all the prices are in USD Dollars)
ggplot(data = airbnb_data, aes(x = log(price))) +
geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue") +
ggtitle("AirBnb Price per Night (Log scale)") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) +
xlab("Log of Price per Night") + ylab("Number of apartments")
In this section I will be exploring some categorical variables, as well as their relationship with price, the variable I am intended to predict. First I aggregate the data to take the average night price by neighbourhood. Then I subtract the first row, since the value is clearly an outlier with an average price of $800 per night.
night_neighbourhood <- airbnb_data %>% group_by(neighbourhood) %>% summarize(avg_night_price = mean(price)) %>% arrange(desc(avg_night_price))
night_neighbourhood <- night_neighbourhood[-1, ] # first row is clearly an outlier
Now I plot the average price per night of the 10 most expensive (up) and the 10 cheapest (down) neighbourhoods in Barcelona
ggplot(night_neighbourhood[1:10, ], aes(x = reorder(neighbourhood, avg_night_price), y = avg_night_price,
fill = avg_night_price)) +
geom_bar(stat = "identity") + ggtitle("Top 10 most expensive neighbourhoods") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0("$", x)) +
xlab("") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "yellow", high = "red")
ggplot(night_neighbourhood[(nrow(night_neighbourhood) - 10) : nrow(night_neighbourhood), ], aes(x = reorder(neighbourhood, avg_night_price), y = avg_night_price,
fill = avg_night_price)) +
geom_bar(stat = "identity") + ggtitle("Top 10 cheapest neighbourhoods") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12)) + scale_y_continuous(labels = function(x) paste0(x, "$")) +
xlab("Neighbourhood") + ylab("Airbnb Price per Night") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "green", high = "blue")
In order to see all this insights clearly, I create a map, using the geospatial data from Barcelona, which I will obtain from the neighbourhoods_geojson dataset. This geospatial data is loaded as a “SpatialPolygonsDataFrame”, which is a kind of dataframe that includes polygons, which are used to create maps in R. This dataset includes a dataframe containing the neighbourhoods, as well as the spatial polygons:
summary(neighbourhoods_geojson)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 2.05247 2.22922
## y 41.31696 41.46828
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Data attributes:
## neighbourhood neighbourhood_group
## Length:75 Length:75
## Class :character Class :character
## Mode :character Mode :character
Now I correct the misspellings and convert the variable neighbourhood of the dataframe data of neighbourhoods_geojson into a factor. Then I join this data with the night_neighbourhood dataframe I created previously, and use a loop to substitute the missing values of the price variable, with the mean of the available ones. This way, I obtain a dataset with each neighbourhood, its spatial polygons and its average night price.
neighbourhoods_geojson@data$neighbourhood <- as.factor(neighbourhoods_geojson@data$neighbourhood)
levels(neighbourhoods_geojson$neighbourhood) <- c("Baró de Viver" , "Can Baró" ,
"Can Peguera" , "Canyelles",
"Ciutat Meridiana" , "Diagonal Mar i el Front Marítim del Poblenou",
"El Baix Guinardó" , "El Barri Gòtic" ,
"El Besós i el Maresme", "El Bon Pastor" ,
"El Camp d'en Grassot i Gràcia Nova" , "El Camp de l'Arpa del Clot" ,
"El Carmel" , "El Clot" ,
"El Coll" , "El Congrés i els Indians" ,
"El Fort Pienc" , "El Guinardó" ,
"El Parc i la Llacuna del Poblenou" , "El Poble Sec" ,
"El Poblenou" , "El Putxet i el Farró" ,
"El Raval" , "El Turó de la Peira",
"Horta" , "Hostafrancs" ,
"L'Antiga Esquerra de l'Eixample" , "La Barceloneta" ,
"La Bordeta" , "La Clota" ,
"La Dreta de l'Eixample" , "La Font d'en Fargues" ,
"La Font de la Guatlla" , "La Guineueta" ,
"La Marina de Port" , "La Marina del Prat Vermell",
"La Maternitat i Sant Ramon" , "La Nova Esquerra de l'Eixample" ,
"La Prosperitat" , "La Sagrada Família" ,
"La Sagrera" , "La Salut" ,
"La Teixonera" , "La Trinitat Nova" ,
"La Trinitat Vella" , "La Vall d'Hebron" ,
"La Verneda i la Pau" , "La Vila de Gràcia",
"La Vila Olímpica del Poblenou", "Les Corts" ,
"Les Roquetes" , "Les Tres Torres",
"Montbau" , "Navas",
"Pedralbes" , "Porta" ,
"Provençals del Poblenou" , "Sant Andreu" ,
"Sant Antoni" , "Sant Genís dels Agudells" ,
"Sant Gervasi - Galvany" , "Sant Gervasi - la Bonanova" ,
"Sant Martí de Provençals" , "Sant Pere, Santa Caterina i la Ribera",
"Sants" , "Sants - Badal" ,
"Sarrià " , "Torre Baró" ,
"Vallbona" , "Vallcarca i els Penitents" ,
"Vallvidrera, el Tibidabo i les Planes" , "Verdun" ,
"Vilapicina i la Torre Llobeta" )
neighbourhoods_geojson@data <- left_join(neighbourhoods_geojson@data, night_neighbourhood[, 1:2]) %>% as.data.frame()
for(i in 1:nrow(neighbourhoods_geojson@data)){
if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == TRUE){
neighbourhoods_geojson@data[i, 3] <- mean(neighbourhoods_geojson@data[complete.cases(neighbourhoods_geojson@data), 3])
} else if(is.na(neighbourhoods_geojson@data[i, "avg_night_price"]) == FALSE){
}
}
I load the leaflet package to create an interactive map. This map shows the Average Airbnb Night Price by Neighbourhood in Barcelona.
library(leaflet)
pal <- colorNumeric(
palette = "YlGnBu",
domain = neighbourhoods_geojson@data$avg_night_price
)
price_per_neighbourhood <- leaflet(neighbourhoods_geojson) %>%
addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 11.5) %>%
addPolygons(stroke = TRUE, fillColor = ~ pal(avg_night_price), fillOpacity = 0.8,
highlight = highlightOptions(weight = 2,
color = ~ pal(avg_night_price),
fillOpacity = 1,
bringToFront = TRUE),
label = ~neighbourhood,
smoothFactor = 0.2,
popup = ~ paste(paste(neighbourhood,":"), "<br/>","<b/>", paste("Avg Night Price: ", "$", round(avg_night_price)))) %>%
addLegend("bottomright", pal = pal, values = ~avg_night_price, opacity = 1.0,
title = "Average Airbnb Night Price",
labFormat = labelFormat(prefix = "$"), na.label="")
price_per_neighbourhood
As we can see, the most expensive neighbourhoods of Barcelona seem to be in the northwest and center parts of the city, and the cheapeast ones are most likely located in the northeast and south. This may be due to the presence of turistic attractions, mainly in the center (Plaça Catalunya, Portal de l’Àngel, Passeig de Gràcia) and others such as Sagrada Familia, where the average night price is $125.
Now I create another map, this time based on the type of property:
pal3 <- colorFactor(palette = c(
"dodgerblue2", "#E31A1C",
"green4",
"#6A3D9A",
"#FF7F00",
"black", "gold1",
"skyblue2", "#FB9A99",
"palegreen2",
"#CAB2D6",
"#FDBF6F",
"gray70", "khaki2",
"maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
"darkturquoise", "green1"),
domain = airbnb_data$property_type)
typeofproperty <- list()
for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
typeofproperty[[i]] <- airbnb_data %>% dplyr::filter(property_type == levels(as.factor(airbnb_data$property_type))[i])
}
names(typeofproperty) <- levels(as.factor(airbnb_data$property_type))
typeofproperty_map <- leaflet() %>% addTiles() %>% setView(lng = 2.1734, lat = 41.3851, zoom = 13)
for (i in 1:length(levels(as.factor(airbnb_data$property_type)))) {
typeofproperty_map <- typeofproperty_map %>% addCircles(data = typeofproperty[[i]], lat = ~latitude,
lng = ~longitude, color = ~pal3(property_type),
fillOpacity = 1, label = ~property_type,
popup = ~price, group = levels(as.factor(airbnb_data$property_type))[i])
}
typeofproperty_map <- typeofproperty_map %>% addLegend(data = airbnb_data, "topleft",
pal = pal3, values = ~property_type, title = "Property Type",
opacity = 1, group = "Legend")
groups <- c("Legend", levels(as.factor(airbnb_data$property_type)))
typeofproperty_map <- typeofproperty_map %>% addLayersControl(overlayGroups = groups,
options = layersControlOptions(collapsed = TRUE))
typeofproperty_map
Apartments are the most common Airbnb Property Types, being spread all over the city, condominiums lofts and aparthotels are also pretty extended.
Now I check the correlations of price and the other numerical variables.
library(corrr)
numeric_correlations <- select_if(airbnb_data, is.numeric) %>% correlate() %>% focus(price)
numeric_correlations_plot <- ggplot(data = numeric_correlations, aes(x = reorder(rowname, price), y = price)) +
geom_bar(aes(fill = price), stat = "identity") +
scale_fill_gradient(low = "blue", high = "red") +
ggtitle("Correlation of price and the other numeric variables") +
theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
axis.title = element_text(hjust = 0.5, size = 14),
axis.text.x = element_text(size = 14, angle = 90),
legend.position = "none",
legend.title = element_blank()) +
xlab("Numeric variable") + ylab("Correlation") + coord_flip()
numeric_correlations_plot
The variables that seem to have more dependency on price are: accomodates, bedrooms, beds and guests_included. This means that these factors may affect the price more than other variables.