Understanding Data - Airnbnb listing popularity analisys based on Barcelona data

Objective

Airbnb is a platform business that provides and guides an opportunity to link two groups - the hosts and the guests. Anybody with an open room or free space can become a host on Airbnb and offer it to global community. It is a good way to provide extra income with minimal effort. It is an easy way to advertise space because the platform has the traffic and a global user base to support it. Airbnb gives hosts an easy way to monetize a space that would otherwise be going to waste.

On the other side we have guests with a very specific needs - some might be seeking affordable accomodation close to the city attractions while others a luxurious apartment by the sea. They might be groups, families or individuals both local and foreign. After every visit guests have an opportunity to rate they stay and leave their feedback.

We will try to find out what contributes to the listing popularity and predict whether listing has potential to make into Top 100 most reviewed acommodations based on its attributes.

Since the data contains both current and historical listings we will measure their popularity based on total number of reviews received. We will rank all the listings and categorize them as is_top_100 true or false.

Importing Data

For that purpose publicly available Airbnb data will be used which can be sourced from Inside Airbnb website. The data covers all Barcelona listing details, customer reviews and associated geolocation information collected on 9th of June 2018 and is published in a form of csv files:

  • listing_details.csv - Detailed Listings data for Barcelona
  • calendar_details.csv - Detailed Calendar Data for listings in Barcelona
  • review_details.csv - Detailed Review Data for listings in Barcelona
  • review_summary.csv - Summary Review data and Listing ID
  • neighbourhoods.csv - Neighbourhood list for geo filter.
  • neighbourhoods.geojson - GeoJSON file of neighbourhoods of the city

Copy of the data can be found at Inside Airbnb

Quick inspection of the files revealed that summary files contains only limited number of columns that are also available within details files therefore will not be used further. Also information contained within the calendar is already present in the listing details so we will not use it as well.

For data import, exploration and visiualisation we will use R language with additional packages. The below code will import all the csv files for further analysis and exploration. To preserve multiple languages ‘utf-8’ encoding will be used across the files.

lis_det <- read_csv('C:/Airbnb/listing_details.csv', guess_max = 10000)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   id = col_integer(),
##   scrape_id = col_double(),
##   last_scraped = col_date(format = ""),
##   host_id = col_integer(),
##   host_since = col_date(format = ""),
##   host_listings_count = col_integer(),
##   host_total_listings_count = col_integer(),
##   latitude = col_double(),
##   longitude = col_double(),
##   accommodates = col_integer(),
##   bathrooms = col_double(),
##   bedrooms = col_integer(),
##   beds = col_integer(),
##   square_feet = col_integer(),
##   guests_included = col_integer(),
##   minimum_nights = col_integer(),
##   maximum_nights = col_integer(),
##   availability_30 = col_integer(),
##   availability_60 = col_integer(),
##   availability_90 = col_integer()
##   # ... with 14 more columns
## )
## See spec(...) for full column specifications.
nb_geo <- geojson_read('C:/Airbnb/neighbourhoods.geojson', what = 'sp')

Data Exploration and Pre-processing

We will start with familiarizing ourselves with the columns in the dataset, to understand what each feature represents. This is important, because a poor understanding of the features could cause us to make mistakes in the data analysis and the modeling process. We will also try to reduce number of columns that either contained elsewhere or do not carry information that can be used to answer our questions.

LISTING DETAILS

The file contains all historical and active listings captured in Barcelona on the 9th of June 2018. We will inspect the file’s composition first.

dim(lis_det)
## [1] 17788    96
kable(lis_det[1:5,1:20]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 9) %>%
  scroll_box(width = "910px", height = "400px")
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules thumbnail_url medium_url picture_url xl_picture_url host_id
18666 https://www.airbnb.com/rooms/18666 2.018071e+13 2018-07-10 Flat with Sunny Terrace Apartment located near the “Plaza de las Glorias” and the second-hand market (Encants). The accommodation is also close to the National Theatre of Catalunya and the Agbar Tower which has become one of the new symbols of Barcelona. Licence number: HUTB-(PHONE NUMBER HIDDEN) Nice apartment situated on the penthouse floor of a building with elevator. Huge Living/dining-room with double sofa-bed 1 bedroom with two single beds 1 bedroom with double bed Nice kitchen opened to the living/dining-room and fully equipped for 6 people Bathroom with shower The accommodation has been recently renovated and tastefully decorated with a comfortable furniture and wood floor. Also it is equipped with air-conditioning and heating. Apartment located near the “Plaza de las Glorias” and the second-hand market (Encants). The accommodation is also close to the National Theatre of Catalunya and the Agbar Tower which has become one of the new symbols of Barcelona. Licence number: HUTB-(PHONE NUMBER HIDDEN) Nice apartment situated on the penthouse floor of a building with elevator. Huge Living/dining-room with double sofa-bed 1 bedroom with two single beds 1 bedroom with double bed Nice kitchen opened to the living/dining-room and fully equipped for 6 people Bathroom with shower The accommodation has been recently renovated and tastefully decorated with a comfortable furniture and wood floor. Also it is equipped with air-conditioning and heating. Free Wifi - air conditioning. We will provide basic amenities like shower gel, shampoo,and hand soap. Also, 1 set of bed linen and towels per person will be included. We can provide you all kind of entrance and tickets for monuments and shows in Barcelona in order you avo none Apartment in Barcelona near to the Plaza de las Glorias, the old market (Encants), the Agbar Tower one of the new symbols of Barcelona and the Teatre Nacional de Catalunya. All kinds of services in surroundings (shops, supermarkets, restaurants, bars). NA Good transports connections, 50 m from the metro station “Clot” (line 1 and 2, red and purple), just 10 minutes by metro to Passeig de Gracia (La Pedrera, Casa Batllo), Plaza Catalunya and Las Ramblas. Free Wifi - air conditioning. We will provide basic amenities like shower gel, shampoo,and hand soap. Also, 1 set of bed linen and towels per person will be included. We can provide you all kind of entrance and tickets for monuments and shows in Barcelona in order you avoid queues and plan your trip in advance. Also we can organize shuttle from/to airport. All that you need for get a perfect stay in our nice city you only have to ask us. Cleaning fee: 40 euros (to pay at arrival) Tourist tax at arrival: 2.48 Eur/person/night (to pay at arrival) Arrival and departure times: Check-in: 3 pm to 9 pm. Check-out: Before 11 am. Checking-in and checking-out times will be flexible and can be arranged provided there are not other bookings made prior or after your reservation. The client must always inform of the approximate arrival time. If the client arrives between 9 pm and 12 am, he/she shall pay a 30 euros extra charge. Arrivals after 12 am and until 2 am shall have an additional extra charge of 50 euros NA NA https://a0.muscache.com/im/pictures/47f88bc6-6561-445a-beec-f8ec4ddc1038.jpg?aki_policy=large NA 71615
18674 https://www.airbnb.com/rooms/18674 2.018071e+13 2018-07-10 Huge flat for 8 people close to Sagrada Familia 110m2 apartment to rent in Barcelona. Located in the Eixample district, near the Sagrada Familia. It has a small balcony where you can see the temple of Gaudi. Capacity for 8 people. Licence number: HUTB-002062 Apartment with 110 m2 located in the 6th floor in a building with elevator Huge living/dinig-room 1 double bedrrom 1 bedroom with 2 single beds 1 bedroom with bunk beds Kitchen fully equipped for 8 people 1 bathroom with bathtub 1 small bathroom with shower balcony The accommodation has been recently renovated and tastefully decorated with a comfortable furniture and wood floor. Also it is equipped with heating, air conditioning and wifi. 110m2 apartment to rent in Barcelona. Located in the Eixample district, near the Sagrada Familia. It has a small balcony where you can see the temple of Gaudi. Capacity for 8 people. Licence number: HUTB-002062 Apartment with 110 m2 located in the 6th floor in a building with elevator Huge living/dinig-room 1 double bedrrom 1 bedroom with 2 single beds 1 bedroom with bunk beds Kitchen fully equipped for 8 people 1 bathroom with bathtub 1 small bathroom with shower balcony The accommodation has been recently renovated and tastefully decorated with a comfortable furniture and wood floor. Also it is equipped with heating, air conditioning and wifi. Free Wifi - air conditioning. We will provide basic amenities like shower gel, shampoo,and hand soap. Also, 1 set of bed linen and towels per person will be included. We can provide you all kind of entrance and tickets for monuments and shows in Barcelona in order you avoid queues and plan your trip in advance. Also we can organize sh none Apartment in Barcelona located in the heart of Eixample district, within only 150 m form the great Sagrada Familia and really near of Gaudí Avenue and the famous Sant Pau Hospital . All kind of services in surroundings (shops, supermarkets, restaurants, bars). NA Good transport connection, 150 m from metro “Sagarda Familia” (L5 and L2) and within only 15 minutes you can arrive by metro to “Plaza Catalunya”, “Paseo de Gracia” and “Ciutat Vella”. Also there are bus stations in surroundings. Free Wifi - air conditioning. We will provide basic amenities like shower gel, shampoo,and hand soap. Also, 1 set of bed linen and towels per person will be included. We can provide you all kind of entrance and tickets for monuments and shows in Barcelona in order you avoid queues and plan your trip in advance. Also we can organize shuttle from/to airport. All that you need for get a perfect stay in our nice city you only have to ask us. Tourist tax at arrival: 2.48 Eur/person/night (to pay at arrival) Arrival and departure times: Check-in: 3 pm to 9 pm. Check-out: Before 11 am. Checking-in and checking-out times will be flexible and can be arranged provided there are not other bookings made prior or after your reservation. The client must always inform of the approximate arrival time. If the client arrives between 9 pm and 12 am, he/she shall pay a 30 euros extra charge. Arrivals after 12 am and until 2 am shall have an additional extra charge of 50 euros. NA NA https://a0.muscache.com/im/pictures/13031453/413cdbfc_original.jpg?aki_policy=large NA 71615
19157 https://www.airbnb.com/rooms/19157 2.018071e+13 2018-07-10 Great Place in Sagrada Familia, Bcn We offer a Room in a very well located apartment and close to major attractions by metro, bus or walking. If you want an accesible place to discover the city, then this is a great option for you. The room is located in a very basic apartment, there are no luxuries. The apartment is not modern. If you are looking for great location, quiet place then this is a good option for you. Keep in mind that you will be sharing with me and another roommate and that you are not renting the whole flat. Cozy, well located apartment located just two blocks from amazing Sagrada Familia Church. Our neighborhood is safe, lively during the day and quiet at night. Also, there are many nice restaurants, bars and local shops. The building has an elevator and It is two blocks away from two metro stations, plus there are plenty of bus stops nearby so you can get to any part of the city quickly. We are only 10 minutes away by Metro from Plaza Cataluña. The guests’ room is furnished with a double bed, open cupboard to put personal belongings and essentials ( toiletries, towels, maps, etc) and it also has a pine wooden wardrobe to put your clothes, two night tables, and reading lamp. ** Bed linen and towels are provided for your convenience, specially for those who want to travel light. >> Toiletries like shampoo, bath gel and conditioner are provided and left in the bathroom for guests use. >> Please note that the room is big enough for two people, although if you are used to bigger room sp We offer a Room in a very well located apartment and close to major attractions by metro, bus or walking. If you want an accesible place to discover the city, then this is a great option for you. The room is located in a very basic apartment, there are no luxuries. The apartment is not modern. If you are looking for great location, quiet place then this is a good option for you. Keep in mind that you will be sharing with me and another roommate and that you are not renting the whole flat. Cozy, well located apartment located just two blocks from amazing Sagrada Familia Church. Our neighborhood is safe, lively during the day and quiet at night. Also, there are many nice restaurants, bars and local shops. The building has an elevator and It is two blocks away from two metro stations, plus there are plenty of bus stops nearby so you can get to any part of the city quickly. We are only 10 minutes away by Metro from Plaza Cataluña. The guests’ room is furnished with a double bed, open none The neighbourhood has a local and touristy balanced combination. The vicinity to Sagrada Familia and Hospital Sant Pau makes the area a major attraction without the massive amount of tourists that you will get in other parts of Barcelona. There are many local stores with very nice cafés and terraces where you can enjoy and get a feel of Barcelona’s Mediterranean lifestyle Please take into account that there is no Air Conditioning in the apartment nor the room. The bathroom is shared, you can leave your amenities there for your convenience. Some guests have complaint about noise: we are quiet in the apartment, but neighbours sometimes are not quiet. So if you are too sensitive to noise, I honestly don´t recommend staying home. Please, once again, take into account that you are only renting a room not the whole apartment and please keep in mind that you will be sharing the apartment with me and my roommate. Just two blocks from the apartment you have 2 metro stations: Hospital Sant Pau Station( Metro Blue Line) and Sagrada Familia Station (Blue and Purple line). By metro you are only 3 stops away from Ramblas and you can get there in 12-15 minutes. If you prefer to take the bus, then two blocks away you have several bus stops with lines 19, 45, 47 that will take to Plaza Cataluña, or line 92 that will take you to Glories Shopping Mall if you are in the mood for shopping. Guests will have access to bathroom, kitchen (Please note that heavy cooking is not allowed). The living room is not shared. The bathroom and kitchen are to be shared. This means that you have to clean up after use. ***** The cleaning service is only provided before your stay to clean and arrange the room and the bathroom before your arrival. This means that during your stay you have to also collaborate on keeping common areas clean since there is no but there is no permanent cleaning during your stay. For this purpose cleaning material and gloves are permanently left in the bathroom*** - Shampoo and liquid soap are provided and left in the bathroom for those guests who travel light. - In the kitchen basic equipment is provided ( cups, dishes, bowls, cutlery) in case you want to prepare breakfast. Heavy cooking is not allowed. *** Kitchen premises must be left clean and tidy after every meal (this means doing the washing up, cleaning the stove, counter areas, and sweeping the flo Due to working schedules, sometimes I cannot welcome the guests or be home. Check in will be coordinated though for guests arrival. Late check in is not a problem. I will be at guests disposal for questions and requests at their arrival and during their stay if we coincide. In case of any question, issue, need or recommendation feel free to use Airbnb chat or free instant messaging tools like wassup, telegram or iMessage and I will gladly get back to guest with recommendations or information requested. CLEANING FEE: it is to cover the cleaning lady that cleans and organises your room, bathroom and kitchen before your arrival. It does not mean that there is a permanent cleaning service in the house. - Keeping things clean and tidy is also a guests responsibility in common shared areas. – Shall you need extra cleaning services please do not hesitate to mention it, the cleaning lady can be hired on your behalf for an extra fee of 12€ x cleaning service. -KITCHEN: heavy cooking is not allowed. If you need to prepare breakfast or a snack that is fine. All the areas must be left clean and organized after use ( this means doing the washing up of used utensils, leaving stove, counters and floor clean). All utensils used will be left clean and tidy. No food or beverage is included in the use of the kitchen, so you will have to buy all your food/ drink products. *** CHECK-IN is flexible and this means that it will depend on your arrival schedule and my working schedule availability. I NA NA https://a0.muscache.com/im/pictures/10556089/29e5de9f_original.jpg?aki_policy=large NA 73099
20345 https://www.airbnb.com/rooms/20345 2.018071e+13 2018-07-10 2 Double rooms for 4 persons, WI-FI HOME SHARING!!! Hello everybady! My name is Mila, hospitality is the best feature of me, you will feel at home as in yours. I am very respectful, sociable, calm and friendly person. I have hosted guests since 2010, I have a lot of experience and patience. I will be happy to host you and share my home with you! You are WELCOME !! WELCOME!!! WILKOMMEN !!! BIENBENUE !!! Apartment (3 bedrooms, living room, kitchen and bathroom) is ideal for Groups of 4 adults or for families with children. Rent 2 BEDROOMS (large and medium) in the comfortable apartment, furnished, clean, bright and quiet, WI-FI is available. One BEDROOM is large, has air conditioning, has 2 single beds (90x190) with a nice window, you can see the far hills and the gardens of the neighborhood, big closet, night tables, radio. Another BEDROOM is medium with a nice window, it is ideal for couples or two children. It has 1 double bed (135x190), closet, night table. LIVING ROOM has TV, DVD, CD player and a pretty balcony. Breakfast is included and you are welcome to use the kitchen to cook. Attention for families with young children! Children under 6! Apply the 30% off after communicating his age. For stays longer do the 20% discount. The price of 450 € for a month for a person applies from November to April, except the week of Christmas and Easter. ADDITIONAL PAYMENTS: € 20 BY CLEAN HOME SHARING!!! Hello everybady! My name is Mila, hospitality is the best feature of me, you will feel at home as in yours. I am very respectful, sociable, calm and friendly person. I have hosted guests since 2010, I have a lot of experience and patience. I will be happy to host you and share my home with you! You are WELCOME !! WELCOME!!! WILKOMMEN !!! BIENBENUE !!! Apartment (3 bedrooms, living room, kitchen and bathroom) is ideal for Groups of 4 adults or for families with children. Rent 2 BEDROOMS (large and medium) in the comfortable apartment, furnished, clean, bright and quiet, WI-FI is available. One BEDROOM is large, has air conditioning, has 2 single beds (90x190) with a nice window, you can see the far hills and the gardens of the neighborhood, big closet, night tables, radio. Another BEDROOM is medium with a nice window, it is ideal for couples or two children. It has 1 double bed (135x190), closet, night table. LIVING ROOM has TV, DVD, CD player and a pretty balcony. none En la misma plaza hay parque y una placita para los juegos de los niños. El passeig Fabra i Puig es una zona comercial donde se encuentran muchos restaurantes, supermercados y otras tiendas. En 7-10 min. a pie hay un centro comercial " Herron City“, Mercadona, El”Cort Ingles“, piscina y gimnacios. En 4-5 min. a pie esta estación de trenes y busos”Sant Andreu Arenal" y entrada al metro. Cleaning-15€. Cleaning during stay for guests desire, the price can be negotiated. Price for used washing machine - € 10 Check- In or Check- Out of 23:00 until 7:00 pm. € -15 Penalty for delay, input at the time had not foreseen - € 15 DIRECTIONS to the apartment from the AIRPORT. Transport is cheap, fast and modern. From “El Prat” airport to the apartment you can get: 1. Now with subway line N9!!! 2. take a green bus free from Terminal 1 to the Train Station Commuter (Terminal 2) and take train on R2-Nord to “CLOT-ARAGO”, where he will transfer to the station “CLOT” Metro (line 1, red) to “Fabra I Puig” station (direction “FONDO”), get off at “Fabra I Puig” and leave the last wagon on the right to the stairs or take the acensor in the rear of the train, cross the Meridiana avenue , riding down the street ESCOCIA and walk about 3-4 min. to find PLAZA GARRIGO. The trip takes about 40 minutos. 3. Or take bus L46 Plaza Spain (or night bus to Plaza Catalunya N17 ), where you have to take the metro (line 1, red) directly to “Fabra I Puig”. The time on the road depends on traffic. DIRECTIONS TO MILA´s HOUSE. AIRPORT TRANSPORTATION is inexpensive, fast and modern. From the airport “el PRAT” : 1. Now with subway, line N9!!! 2. Or you can arrive to MILA´s apartment taking a free green bus from Terminal 1 to the Rodalies train station ( Terminal 2) and take the R2-NORD, until station “CLOT- ARAGO”, where you will transfer to the metro station “CLOT” (red line) until station “FABRA I PUIG” ( heading to “FONDO”) get off at “FABRA I PUIG” and exit to the stairs at the rear of the train, cross the Meridian Avenue, look for the street ESCOCIA and walk until you find PLAZA GARRIGO. The trip will take around 40 minits. 3. Or take the L46 bus to Plaza España (or nite bus 17 to Plaza Catalunya) where you will transfer to the metro (red line) direct to “FABRA I PUIG”. The time on the road depends on traffic is. Intercom: call button “cinque-2”, (that is written in catalan) … Plant 5, Gate 2. Llamar por interfono: el botón “cinque-2” (quinta planta, puerta 2) Intercom-Anru In general …I will be present during your stay and help my guests depends on your needs RESPECT! GOOD CHEER AND GOOD SENSE OF HUMOR! Please pray: You can smoke on the balcony! You can not make too much noise! NA NA https://a0.muscache.com/im/pictures/623060/17f03910_original.jpg?aki_policy=large NA 76809
25786 https://www.airbnb.com/rooms/25786 2.018071e+13 2018-07-10 NICE ROOM AVAILABLE IN THE HEART OF GRACIA JUST GO THROUGH THE MANY REVIEWS I GOT THROUGH THE YEARS, NO BETTER FEEDBACK THAN THAT. WELCOME. Room available for rent.- PEDRO PEREZ. Shared with a Catalan male aged 38, Ayurvedic massage therapist and Yoga practitioner. Looking for people non-smoking, enthusiastic willing to share more than just the space in a centric beautiful flat in PLaça Vila de Gracia. i am very flexible you can use anything in the house feel free to ask anything! The neighborhood is really special you could live here and not needing anything from outside, such an experience, just 100 years ago was a village in the outskirts of barcelona, we do have our own cultural program throughout the year, very Catalan place. The area is full of bohemians, artisans and modern artists. Most of the area has been taken over by us over the past 10 years making it a mix between the past and the present-future. Metro stations around are: Diagonal L3-L5, Fontana L3, Joanic L4, 10-15 minutes walking to city center Ramblas. Separate Wardrobe room available Kitchen and bathroom shared Bills included available for renti JUST GO THROUGH THE MANY REVIEWS I GOT THROUGH THE YEARS, NO BETTER FEEDBACK THAN THAT. WELCOME. Room available for rent.- PEDRO PEREZ. Shared with a Catalan male aged 38, Ayurvedic massage therapist and Yoga practitioner. Looking for people non-smoking, enthusiastic willing to share more than just the space in a centric beautiful flat in PLaça Vila de Gracia. i am very flexible you can use anything in the house feel free to ask anything! The neighborhood is really special you could live here and not needing anything from outside, such an experience, just 100 years ago was a village in the outskirts of barcelona, we do have our own cultural program throughout the year, very Catalan place. The area is full of bohemians, artisans and modern artists. Most of the area has been taken over by us over the past 10 years making it a mix between the past and the present-future. Metro stations around are: Diagonal L3-L5, Fontana L3, Joanic L4, 10-15 minutes walking to city center Ramblas. S none Solo decir que a menudo ni salgo del barrio. Muy entretenido con sus gentes y lugares. No dudes en perdir una cita para un masaje relajante o terapeutico. Masaje ayurvedico y tailandés disponibles. *Airport car service available for 25 € one way Metro con las principales estaciones. A pie. El autobús. Bicicletas para alquilar. Si miras el mapa de Barcelona veras esta justo en el corazón All access with respect. Kitchen facilities need permission. Feel free to ask. Avoid Noise after midnight and early.morning Available for interaction. Nonproblem feel free to ask. Clean Bathroom after use and quick shower At all times. avoid noise early mornings and after mindnight. Clean and tidy room. Communication is essential. Ask permission to use kitchen facilities, cooking, washing machine and fridge. Weekends are a bit noisy. This is an essential part of this famous area. SUNDAY TO THURSDAY IS FINE. i follow these rules strictly myself. NA NA https://a0.muscache.com/im/pictures/3a27896a-95ce-4d69-9fc4-39116ed3dd9c.jpg?aki_policy=large NA 108310

Out of the first 20 columns we will keep the following:

  • id - listing identifier that can be used to create a join with other files
  • last_scraped - we will use it to calculate reviews_per_month

And remove all the below:

  • listing_url - interesting if we want to analyse the pictures as well but out of scope otherwise
  • scrape_id - same for all the records
  • name - textual description already extracted as continous variables in other columns
  • summary - as above
  • space - as above
  • description - as above
  • experiences_offered - contains only none value
  • neighborhood_overview - requires lot of preprocessing to turn into useful a feature
  • notes - requires lot of preprocessing to turn into useful a feature
  • transit - requires lot of preprocessing to turn into useful a feature
  • access - requires lot of preprocessing to turn into useful a feature
  • interaction - requires lot of preprocessing to turn into useful a feature
  • house_rules - requires lot of preprocessing to turn into useful a feature
  • thumbnail_url - contains no values
  • medium_url - contains no values
  • picture_url - interesting if we want to analyse the pictures as well but out of scope otherwise
  • xl_picture_url - contains no values
  • host_id - id that is not used anywhere else
kable(lis_det[1:5,21:40]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 9) %>%
  scroll_box(width = "910px", height = "400px")
host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed
https://www.airbnb.com/users/show/71615 Mireia And Maria 2010-01-19 Barcelona, Cataluña, Spain

We are Mireia (39) & Maria (41), two multilingual entrepreneurs loving Barcelona and having big experience in the touristic market. In our apartments you are going feel youself like at home. The location of our flats perfectly suites for travelling and sightseeing.

We are looking forward to sincerely host you in our apartments.
within an hour 99% N/A f https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_x_medium El Camp de l’Arpa del Clot 50 50 [‘email’, ‘phone’, ‘reviews’, ‘jumio’, ‘government_id’] t t Barcelona, CT, Spain El Camp de l’Arpa del Clot el Camp de l’Arpa del Clot
https://www.airbnb.com/users/show/71615 Mireia And Maria 2010-01-19 Barcelona, Cataluña, Spain

We are Mireia (39) & Maria (41), two multilingual entrepreneurs loving Barcelona and having big experience in the touristic market. In our apartments you are going feel youself like at home. The location of our flats perfectly suites for travelling and sightseeing.

We are looking forward to sincerely host you in our apartments.
within an hour 99% N/A f https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_x_medium El Camp de l’Arpa del Clot 50 50 [‘email’, ‘phone’, ‘reviews’, ‘jumio’, ‘government_id’] t t Barcelona, CT, Spain La Sagrada Família la Sagrada Família
https://www.airbnb.com/users/show/73099 Urania 2010-01-24 Barcelona, Cataluña, Spain

Hi there,

We love art, music and gastronomy, not to mention travelling and giving tips to our visitors so they can discover Barcelona.

We have decorated the apartment so it reflects the spirit of a city like Barcelona: a mix between the Mediterranean and Europe, cosmopolitan and local at the same time, where people from all cultures are welcomed to leave their imprint with creativity and where you can feel like at home.

We are very independent and leave a lot of space to our guests so they can keep their privacy.
within an hour 90% N/A f https://a0.muscache.com/im/users/73099/profile_pic/1281190747/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/73099/profile_pic/1281190747/original.jpg?aki_policy=profile_x_medium la Sagrada Família 2 2 [‘email’, ‘phone’, ‘reviews’] t f Barcelona, CT, Spain La Sagrada Família la Sagrada Família
https://www.airbnb.com/users/show/76809 Mila 2010-02-02 Barcelona, Cataluña, Spain NA within a few hours 100% N/A f https://a0.muscache.com/im/users/76809/profile_pic/1265162222/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/76809/profile_pic/1265162222/original.jpg?aki_policy=profile_x_medium Vilapicina i la Torre Llobeta 1 1 [‘phone’, ‘facebook’, ‘reviews’, ‘jumio’] t f Barcelona, Catalonia, Spain Vilapicina i la Torre Llobeta Vilapicina i la Torre Llobeta
https://www.airbnb.com/users/show/108310 Pedro 2010-04-14 Barcelona, Catalonia, Spain

Hola! as i say in my add i look for enthusiastic people willing to share things, experiences not just coming to Barcelona and sightseeing. Of course if your option is so, Go ahead. Many people has come up to my place so far through airbnb and the experience has been great.

Let me introduce myself! I think when i was born i had the force to travel and meet people, this is my goal in life! I studied Photography for 3 years and when the degree was finished i had the urge to travel and so i did until Today!! Half way i was in India and met and ayurvedic Massage Master i was totally impressed by so i became his disciple until today.

I now make my living out of massage therapy and when i can i travel back there to keep going my studies in new techniques and philosophy of the indian traditions on reality, such an amazing country.

I am flexible and not only centered to one thing, that is why when airbnb came i thought it would be amazing being able to share my place, get some extra money “of course” and at the same time being able to host people from all over the world. So far so many people has come over and made new friends. Some people did not want to talk much but appreciated the location and the experience, cause at the end of the day it is quite revolutionary that we can do this.

thanks to the guys of airbnb.!!

You can meet me at my apartment, feel free!

Pedro
within an hour 100% N/A f https://a0.muscache.com/im/pictures/user/43199285-d4a5-412d-8a06-5c91efb78042.jpg?aki_policy=profile_small https://a0.muscache.com/im/pictures/user/43199285-d4a5-412d-8a06-5c91efb78042.jpg?aki_policy=profile_x_medium Vila de Gràcia 1 1 [‘email’, ‘phone’, ‘reviews’, ‘jumio’, ‘offline_government_id’, ‘selfie’, ‘government_id’, ‘identity_manual’] t t Barcelona, Barcelona, Spain Vila de Gràcia la Vila de Gràcia

From the next 20 columns we will keep the following:

  • host_name - can be used to identify words associated with the host in reviews
  • host_since - can be used to calculate host experience based on duration since the first listing
  • host_location - we can use it to establish if host is local or not
  • host_about - since its only a text we will count number of characters
  • host_is_superhost - categorical t or f - describing highly rated and relaible hosts - https://www.airbnb.co.uk/superhost
  • host_has_profile_pic - categorical t or f - profiles with pictures are seen as more credible
  • host_identity_verified - categorical t or f - another credibility metric

And remove all the below:

  • host_url - host profile is out of scope
  • host_response_time - this value could be useful but contains high percentage of N/A and is contained within score_communication
  • host_response_rate - same as above
  • host_acceptance_rate - eaither NA or blank
  • host_thumbnail_url - host picture is out of scope
  • host_picture_url - host picture is out of scope
  • host_neighbourhood - host_location to be instead
  • host_listings_count - we will use more accurate calculated_host_listings_count
  • host_total_listings_count - as above
  • host_verifications - list of host verification methods - information already contained in host_identity_verified
  • street - neighbourhood_cleansed will be used instead
  • neighbourhood - neighbourhood_cleansed will be used instead
kable(lis_det[1:5,41:60]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 9) %>%
  scroll_box(width = "910px", height = "400px")
neighbourhood_group_cleansed city state zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet
Sant Martí Barcelona CT 08026 Barcelona Barcelona, Spain ES Spain 41.40889 2.185545 t Apartment Entire home/apt 6 1 2 4 Real Bed {TV,Internet,Wifi,“Air conditioning”,“Wheelchair accessible”,Kitchen,Elevator,“Free street parking”,Heating,“Family/kid friendly”,Washer,Dryer,Essentials,Shampoo,“Hair dryer”,“Hot water”,“Host greets you”,“Paid parking on premises”} 75
Eixample Barcelona CT 08025 Barcelona Barcelona, Spain ES Spain 41.40420 2.173058 t Apartment Entire home/apt 8 2 3 6 Real Bed {TV,Internet,Wifi,“Air conditioning”,“Wheelchair accessible”,Kitchen,Elevator,“Free street parking”,“Buzzer/wireless intercom”,Heating,“Family/kid friendly”,Washer,Essentials,Shampoo,Hangers,“Hair dryer”,Iron,“Laptop friendly workspace”,Crib,“Hot water”,“Host greets you”,“Paid parking on premises”} NA
Eixample Barcelona CT 08025 Barcelona Barcelona, Spain ES Spain 41.40793 2.174540 t Apartment Private room 2 1 1 1 Real Bed {Internet,Wifi,Kitchen,“Smoking allowed”,Elevator,“First aid kit”,Essentials,Shampoo,“translation missing: en.hosting_amenity_49”,“translation missing: en.hosting_amenity_50”} NA
Nou Barris Barcelona Catalonia 08016 Barcelona Barcelona, Spain ES Spain 41.42950 2.181558 t Apartment Private room 4 1 2 3 Real Bed {TV,Wifi,“Air conditioning”,Kitchen,“Paid parking off premises”,“Pets allowed”,Breakfast,Elevator,“Free street parking”,“Buzzer/wireless intercom”,Heating,“Family/kid friendly”,Washer,Dryer,Essentials,Shampoo,“24-hour check-in”,Hangers,“Hair dryer”,Iron,Crib,“Room-darkening shades”,“Hot water”,“Host greets you”} 732
Gràcia Barcelona Barcelona 08012 Barcelona Barcelona, Spain ES Spain 41.40145 2.156446 t Apartment Private room 2 1 1 1 Real Bed {TV,Wifi,“Air conditioning”,Kitchen,Elevator,Heating,“Family/kid friendly”,Washer,“Fire extinguisher”,Essentials,Shampoo,“Lock on bedroom door”,Hangers,“Hair dryer”,“Hot water”,“Luggage dropoff allowed”} NA

From the next 20 columns we will keep the following:

  • neighbourhood_cleansed - we will use only for visualisation due to number of neighbourhoods while we use gruoupped neighbourhoods instead
  • neighbourhood_group_cleansed - categorical value which will be used to identify most popular parts of Barclona
  • latitude - we will use it later to visualise the data on the map
  • longitude - we will use it later to visualise the data on the map
  • property_type - categorical variable
  • room_type - categorical variable
  • accommodates - discrete value describing property
  • bathrooms - another discrete value describing property
  • bedrooms - another discrete value describing property
  • beds - another discrete value describing property
  • bed_type - categorical value describing property
  • amenities - due to number of unique features (over 100) we will only concentrate on the total number of amenities

And remove all the below:

  • city - we already know the city
  • state - and region being Catalonia
  • zipcode - we will use neighbourhood
  • market - it is mainly Barcelona
  • smart_location - it is mainly Barcelona
  • country_code - we already know the country
  • country - as above
  • is_location_exact - unimportant as it could be inacurate up to 150 meters http://insideairbnb.com/about.html#disclaimers
kable(lis_det[1:5,61:80]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 9) %>%
  scroll_box(width = "910px", height = "200px")
price weekly_price monthly_price security_deposit cleaning_fee guests_included extra_people minimum_nights maximum_nights calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews first_review last_review review_scores_rating
$130.00 NA NA $150.00 $42.00 2 $25.00 3 730 today t 0 0 0 61 2018-07-10 1 2015-10-10 2015-10-10 80
$140.00 NA NA $150.00 $50.00 2 $30.00 1 1125 today t 3 17 47 132 2018-07-10 5 2013-05-27 2018-06-18 85
$30.00 $185.00 $580.00 NA $20.00 2 $15.00 2 180 4 weeks ago t 1 2 12 96 2018-07-10 165 2010-08-18 2018-06-23 89
$25.00 NA $450.00 $100.00 $20.00 1 $20.00 2 365 3 weeks ago t 11 40 70 345 2018-07-10 72 2010-06-16 2018-06-17 84
$42.00 NA NA NA NA 1 $31.00 1 730 3 weeks ago t 16 31 52 94 2018-07-10 191 2010-08-11 2018-07-06 95

From the next 20 columns we will keep the following:

  • price - price per night for number of included guests
  • security_deposit - another continous value assiociated with the cost
  • cleaning_fee - additional cost at the top of rent
  • guests_included - descrete value which we will use to evaluate the cost per person
  • extra_people - cost of additional person per night
  • minimum_nights - another discrete value that is cost related. Listing with high value of minimum nights are likely sublettings
  • first_review - we will use it to calculate reviews_per_month
  • last_review - we will use this field to filter out no longer active listings
  • number_of_reviews - total number of reviews in entire listing history

And remove all the below:

  • square_feet - could be used to evaluate the property size but most of the values are missing
  • weekly_price - mostly blank so we will use price instead
  • monthly_price - mostly blank so we will use price instead
  • maximum_nights - most of the values are above 30 days suggesting its used as an open bracket
  • calendar_updated - we are not interested in future data that is a subject to daily updates
  • has_availability - as above
  • availability_30 - as above
  • availability_60 - as above
  • availability_90 - as above
  • availability_365 - as above
  • calendar_last_scraped - as above
kable(lis_det[1:5,81:96]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 9) %>%
  scroll_box(width = "910px", height = "200px")
review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value requires_license license jurisdiction_names instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
10 10 2 10 10 8 t HUTB-003004 NA f f flexible f f 29 0.03
9 10 10 10 9 9 t HUTB-002062 NA t f strict_14_with_grace_period f f 29 0.08
9 9 10 10 10 9 t NA NA t f strict_14_with_grace_period f f 2 1.72
8 9 9 9 9 8 t NA NA f f strict_14_with_grace_period t t 1 0.73
9 9 10 10 10 9 t NA NA t f moderate t t 1 1.98

From the final set of columns we will keep the following:

  • review_scores_accuracy - discrete value - numbers between 2 and 10
  • review_scores_cleanliness - discrete value - numbers between 2 and 10
  • review_scores_checkin - discrete value - numbers between 2 and 10
  • review_scores_communication - discrete value - numbers between 2 and 10
  • review_scores_location - discrete value - numbers between 2 and 10
  • review_scores_value - discrete value - numbers between 2 and 10
  • instant_bookable - categorical value - t or false
  • cancellation_policy - ordinal value with 5 categories that can be ordered from lowest to highest level of flexibility
  • require_guest_profile_picture - categorical value - t or false
  • require_guest_phone_verification categorical value - t or false
  • calculated_host_listings_count - continious value which is actual number of host listings - another metric to measure host experience or to distinguish buisness from individual

And remove all the below:

  • review_scores_rating - this value is calculated as weighted sum of other scores
  • requires_license - all values are t
  • license - textual value that is mostly null
  • jurisdiction_names - contains only nulls
  • is_business_travel_ready - contains one value of f
  • reviews_per_month - we will re-calculate this field using our formula

We will now exctract all the column of interest and store it inseparate data frame - lis_det_sel. Also we will rename id column to make it consistent with other files (“listing_id”).

lis_det_sel <- lis_det[c('id', 'last_scraped', 'host_name', 'host_since', 'host_location', 'host_about', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'first_review', 'last_review', 'number_of_reviews', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'cancel_policy' = 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count')]
names(lis_det_sel)[1] <- "listing_id"

We will now perform data cleansing and pre-processing to prepare the data for further analysis.

First we will remove all the listings that have number of reviews equal to 0 as most of these will be NA.

lis_det_sel <- subset(lis_det_sel, number_of_reviews > 0)
dim(lis_det_sel)
## [1] 14764    41

This will reduce number of records to 14764.

Secondly we will input 0 where value is not explicitly specified within the listing for the following columns:

  • security_deposit
  • cleaning_fee

And convert host_about from NA to empty string to allow character count.

lis_det_sel$security_deposit[is.na(lis_det_sel$security_deposit)] <- 0
lis_det_sel$cleaning_fee[is.na(lis_det_sel$cleaning_fee)] <- 0
lis_det_sel$host_about[is.na(lis_det_sel$host_about)] <- ''

And lastly we will drop all the records that have any remaining NAs as we will not be able to perfom further calculations. This include missing review scores, host_since and first_review date.

lis_det_sel <- subset(lis_det_sel, rowSums(is.na(lis_det_sel))==0)
dim(lis_det_sel)
## [1] 14416    41

This will leave us with 14414 records out of 17788 original records.

We will now apply cleansing / conversion to below fields:

  • host_is_superhost - conversion from t/f to 1/0 values
  • host_has_profile_pic - conversion from t/f to 1/0 values
  • host_identity_verified - conversion from t/f to 1/0 values
  • price - conversion from $ to numeric values
  • security_deposit - conversion from $ to numeric values
  • cleaning_fee - conversion from $ to numeric values
  • extra_people - conversion from $ to numeric values
  • instant_bookable - conversion from t/f to 1/0 values
  • require_guest_profile_picture - conversion from t/f to 1/0 values
  • require_guest_phone_verification - conversion from t/f to 1/0 values
lis_det_sel$host_is_superhost <- as.numeric(ifelse(lis_det_sel$host_is_superhost == 't', 1, 0))
lis_det_sel$host_has_profile_pic <- as.numeric(ifelse(lis_det_sel$host_has_profile_pic == 't', 1, 0))
lis_det_sel$host_identity_verified <- as.numeric(ifelse(lis_det_sel$host_identity_verified == 't', 1, 0))
lis_det_sel$price <- lis_det_sel$price %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
lis_det_sel$security_deposit <- lis_det_sel$security_deposit  %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
lis_det_sel$cleaning_fee <- lis_det_sel$cleaning_fee   %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
lis_det_sel$extra_people <- lis_det_sel$extra_people   %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
lis_det_sel$instant_bookable <- as.numeric(ifelse(lis_det_sel$instant_bookable == 't', 1, 0))
lis_det_sel$require_guest_profile_picture <- as.numeric(ifelse(lis_det_sel$require_guest_profile_picture == 't', 1, 0))
lis_det_sel$require_guest_phone_verification <- as.numeric(ifelse(lis_det_sel$require_guest_phone_verification == 't', 1, 0))

And we can add following calculated fields:

  • listing_duration = (last_review - first_review) - value to be evaluated in days
  • hosting_duration = (last_review - host_since) - value to be evaluated in days
  • host_local = if(‘Barcelona’ in host_location, 1, 0) - we will treat hosts from Barcelona as local
  • host_about_len = ifelse(is.na(host_about), 0, nchar(host_about)) - in this case NA will be counted as 0
  • total_amenities = count of listed amenities - since is comma delimited we will count number of separators + 1 where number of characters greater than 2 (listing brackets)
  • price_per_person - (price/accommodates)
  • is_top_100 - listing ranked in top 100 reviewed listings
lis_det_sel <- lis_det_sel %>% 
  mutate(listing_duration = as.numeric(difftime(lis_det_sel$last_scraped, lis_det_sel$first_review, unit = 'days')), hosting_duration = as.numeric(difftime(lis_det_sel$last_scraped, lis_det_sel$host_since, unit = 'days')), host_local = as.numeric(str_detect(host_location, 'barcelona|Barcelona')), host_about_len = ifelse(is.na(host_about), 0, nchar(host_about)), total_amenities = ifelse(nchar(amenities)>2, str_count(amenities, ',')+1, 0),price_per_person = price / accommodates)

lis_det_sel$is_top_100 <- ifelse(rank(-lis_det_sel$number_of_reviews) <= 100, 1, 0)

And convert categorical values into dummy variables:

  • neighbourhood_group_cleansed - we will replace non-alphanumeric characters with underscores to prevent code errors
  • property_type - we will replace non-alphanumeric characters with underscores to prevent code errors
  • room_type - we will replace non-alphanumeric characters with underscores to prevent code errors
  • bed_type - we will replace non-alphanumeric characters with underscores to prevent code errors
  • cancellation_policy
lis_det_sel$neighbourhood_group_cleansed <- str_replace_all(lis_det_sel$neighbourhood_group_cleansed, "[^[:alnum:]]", "_")
lis_det_sel$property_type <- str_replace_all(lis_det_sel$property_type, "[^[:alnum:]]", "_")
lis_det_sel$room_type <- str_replace_all(lis_det_sel$room_type, "[^[:alnum:]]", "_")
lis_det_sel$bed_type <- str_replace_all(lis_det_sel$bed_type, "[^[:alnum:]]", "_")
lis_det_sel$neighbourhood_group_cleansed <- str_replace_all(lis_det_sel$neighbourhood_group_cleansed, "[^[:alnum:]]", "_")
lis_det_sel$property_type <- str_replace_all(lis_det_sel$property_type, "[^[:alnum:]]", "_")
lis_det_sel$room_type <- str_replace_all(lis_det_sel$room_type, "[^[:alnum:]]", "_")
lis_det_sel$bed_type <- str_replace_all(lis_det_sel$bed_type, "[^[:alnum:]]", "_")


nb_group_dummy <- dummy(lis_det_sel$neighbourhood_group_cleansed, sep = "_")
lis_det_sel <- cbind(lis_det_sel, nb_group_dummy)

property_type_dummy <- dummy(lis_det_sel$property_type, sep = "_")
lis_det_sel <- cbind(lis_det_sel, property_type_dummy)

room_type_dummy <- dummy(lis_det_sel$room_type, sep = "_")
lis_det_sel <- cbind(lis_det_sel, room_type_dummy)

bed_type_dummy <- dummy(lis_det_sel$bed_type, sep = "_")
lis_det_sel <- cbind(lis_det_sel, bed_type_dummy)

cancellation_policy_dummy <- dummy(lis_det_sel$cancellation_policy, sep = "_")
lis_det_sel <- cbind(lis_det_sel, cancellation_policy_dummy)

```

We will now create two dataframes wich are subsets of lis_det_sel:

  • lis_det_clean - containing pre-processed data with categorical values in its original form
  • lis_det_clean_dummy - containing pre-processed data with categorical values converted into dummies

In both cases we will drop number_of_reviews column as is already contained within is_top_100.

lis_det_clean <- lis_det_sel[, c(7:9, 11:11, 14:15, 17:20, 23:27, 31:48)]
lis_det_clean_dummy <- lis_det_sel[, c(7:9, 17:19, 23:27, 31:37, 39:96)]

We will now visualise distribution of all the features broken down on popularity.

DISCRETE

We will use barchart that will show us relative density grouped by our target value.

discrete <- c("host_is_superhost", "host_has_profile_pic", "host_identity_verified", "instant_bookable", "require_guest_profile_picture", "require_guest_phone_verification", "host_local")


  for (colname in discrete) { 
    
    temp <- subset(lis_det_clean, is_top_100 == 1) 
    temp <- temp %>% 
      group_by(is_top_100, temp[,colname]) %>% 
      summarise(density = n()/nrow(.))
    colnames(temp)[2] <- colname
      
    temp1 <- subset(lis_det_clean, is_top_100 == 0) 
    temp1 <- temp1 %>% 
      group_by(is_top_100, temp1[,colname]) %>% 
      summarise(density = n()/nrow(.))
    colnames(temp1)[2] <- colname
      
    temp2 <- rbind(temp, temp1)
    
    plot <- ggplot(data=temp2, aes(x=as.factor(temp2[[colname]]), y=density, fill=as.factor(is_top_100))) + 
      geom_bar(position = 'dodge', stat='identity') + labs(fill = "is_top_100", x = colname, 
            title = paste(colname, " relative density grouped by is_top_100")) +
            theme(axis.text.x = element_text(angle = 90, hjust = 1))
    
    print(plot)
  }

Based on the above barcharts we can conclude that:

  • host_is_superhost - Top 100 has greater relative representation as a superhost which is not surprising knowing that this status is heavely linked with review scores
  • host_has_profile_pic - nearly all records in both groups have a profile picture making this feature redundant
  • host_identity_verified - another feature much better represented as TRUE in Top 100 group
  • instant_bookable - slightly better representation in Top 100 group
  • require_guest_profile_picture - over 80% of both groups do not require guest picture
  • require_guest_phone_verification - same as above
  • host_local - both groups have similar representation of more than 80% listings having local host

CATEGORICAL

We will again use barchart that will show us relative density grouped by our target value.

categorical <- c("neighbourhood_group_cleansed", "property_type", "room_type", "bed_type", "cancellation_policy")

  
  for (colname in categorical) {
  
  
    temp <- subset(lis_det_clean, is_top_100 == 1) 
    temp <- temp %>% 
      group_by(is_top_100, temp[,colname]) %>% 
      summarise(density = n()/nrow(.))
    colnames(temp)[2] <- colname
      
    temp1 <- subset(lis_det_clean, is_top_100 == 0) 
    temp1 <- temp1 %>% 
      group_by(is_top_100, temp1[,colname]) %>% 
      summarise(density = n()/nrow(.))
    colnames(temp1)[2] <- colname
      
    temp2 <- rbind(temp, temp1)
    
    plot <- ggplot(data=temp2, aes(x=temp2[[colname]], y=density, fill=as.factor(is_top_100))) + 
      geom_bar(position = 'dodge', stat='identity') + labs(fill = "is_top_100", x = colname, 
            title = paste(colname, " relative density grouped by is_top_100")) +
            theme(axis.text.x = element_text(angle = 90, hjust = 1))
    
    print(plot)
    
  }

Based on the above barcharts we can conclude that:

  • neighbourhood_group_cleansed - Top 100 has representation in 7 out of 10 neighbourhood groups and much greater representation in Ciutat Vella and slightly larger in Eixample and Sants Marti
  • property_type - Top 100 is only represented in Apartment, B&B, Guest Suite, House and Loft room_type - Top 100 is only represented in Entire home/Apt or Private Room with Entire Home being more popular
  • bed_type - nearly all records in both groups have a profile picture making this feature redundant
  • cancellation_policy - Top 100 are less likely to have a flexible policy and more likely moderate to strict 14 days

CONTINOUS

For continous values we will use a box plot to better understand data distribution between groups.

continous <- c("bathrooms", "bedrooms", "beds", "price_per_person", "security_deposit", "cleaning_fee", "guests_included", "extra_people", "minimum_nights", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "calculated_host_listings_count", "listing_duration", "hosting_duration", "host_about_len", "total_amenities")

  
  for (colname in continous) {
  
    plot <- ggplot(data=lis_det_clean, aes(x=as.factor(is_top_100), y=lis_det_clean[[colname]])) + 
      geom_boxplot(fill="lightblue") + labs(x = "is_top_100", y = colname,
            title = paste(colname, " grouped by is_top_100")) +
            theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
            scale_y_continuous(limits = quantile(lis_det_clean[[colname]], c(0.1, 0.9)))
    
    print(plot)
    
  }

Based on the above boxplots we can conclude that:

  • bathrooms - nearly all Top 100 have 1 bathroom with exception of few outliers
  • bedrooms - nearly identical distribution in both groups with median being at 1
  • beds - same as above
  • price_per_person - overall lower price per person in Top 100 group
  • security_deposit - much higher median value for Top 100 group being at about $120 while other group has median value at 0
  • cleaning_fee - similar distribution for both groups
  • guests_included - nearly identical distribution in both groups with median being at 1
  • extra_people - similar distribution in both groups but higher median in Top 100 - $10 vs $0
  • minimum_nights - in Top 100 group is having lower number of nights by 1 required for booking
  • review_scores - same distribution for accuracy, cleanliness, location and value being in range between 8-10 while checkin and communication has only 10s in Top 100 group
  • calculated_host_listings - similar distribution with exception of outliers
  • listing_duration - Top 100 group has much higher listing duration with median at around 1400 days while other group has median duration of around 550 days
  • hosting_duration - Top 100 group has much higher host presence with median at around 1800 days while other group has median duration of around 1300 days
  • host_about - Top 100 group has much higher number of character in their profile description with none having profile without input
  • total_amenities - Top 100 group has slightly higher number of amenities listed

We will now identify and remove near-zero variance predictors using the follwoing code. This data will be then used in our predictive model.

nzv <- nearZeroVar(select(lis_det_clean, -is_top_100))
lis_det_clean_nzv <- lis_det_clean[, -nzv]

nzv_dummy <- nearZeroVar(select(lis_det_clean_dummy, -is_top_100))
lis_det_clean_dummy_nzv <- lis_det_clean_dummy[, -nzv_dummy]

By applying nearZeroVar function we have reduced our datasets to 28 and 36 columns including our target.

NEIGHBOURHOODS GEOSPATIAL DATA

Neighbourhoods GEOJSON file contains full list of Barcelona neighbourhoods with geospatial data that we will use to visualise information on the map. WE will use Leaflet R package and display listings from both groups using lat/long information coming from listing details dataset. This will give us an idea of geographical distribution with Red points being in Top 100 most popular listings.

other <- lis_det_sel %>% 
  filter(is_top_100 == 0)
  
top_100 <- lis_det_sel %>% 
  filter(is_top_100 == 1) 

leaflet() %>% setView(lng = 2.154007, lat = 41.390205, zoom = 12) %>%
  addTiles() %>%
  addPolygons(data = nb_geo, color = "#444444", weight = 2, opacity = 1) %>%
  addCircleMarkers(  lng = other$longitude, 
                     lat = other$latitude,
                     radius = 2, 
                     stroke = FALSE,
                     color = "blue",
                     fillOpacity = 0.5, 
                     group = "Other"
                     ) %>%
  addCircleMarkers(  lng = top_100$longitude, 
                     lat = top_100$latitude,
                     radius = 3, 
                     stroke = FALSE,
                     color = "red",
                     fillOpacity = 0.9, 
                     group = "Top 100"
                     )

Baseline Modeling

NAIVE BAYES

We will now try to build our initial model based on the data we have created in prior steps. Before we start we will try to remove highly correlated features. This is because the highly correlated features are voted for twice in the model, over inflating their importance.

descrCor <- cor(lis_det_clean_dummy_nzv)

highlyCorrelated <- findCorrelation(descrCor, cutoff=0.7)

highlyCorCol <- colnames(lis_det_clean_dummy_nzv)[highlyCorrelated]

highlyCorCol
## [1] "bedrooms"
lis_det_clean_dummy_nzv_uncor <- lis_det_clean_dummy_nzv[, -which(colnames(lis_det_clean_dummy_nzv) %in% highlyCorCol)]

Unsurprisingly beds and bedrooms are highly correlated with number of beds available.These variables have been removed from the new dataset.

We will now split our data into traning and test data with 60/40 split then use Naive Bayes method to do predictions.

set.seed(132)

nb_sub <- sample(nrow(lis_det_clean_dummy_nzv_uncor), floor(nrow(lis_det_clean_dummy_nzv_uncor) * 0.6))
nb_train <- lis_det_clean_dummy_nzv_uncor[nb_sub, ]
nb_test <- lis_det_clean_dummy_nzv_uncor[-nb_sub, ]

nb <- naiveBayes(as.factor(is_top_100) ~ ., data = nb_train)

nb_prediction <- predict(nb, nb_test)

nb_conf <- table(nb_test$is_top_100, nb_prediction)

print(nb_conf)
##    nb_prediction
##        0    1
##   0 2943 2789
##   1    1   34
nb_accuracy <- sum(diag(nb_conf))/sum(nb_conf)

print(nb_accuracy)
## [1] 0.5162129
nb_precision <- nb_conf[2,2] / (nb_conf[2,2] + nb_conf[2,1])

print(nb_precision)
## [1] 0.9714286
nb_recall <- nb_conf[2,2] / (nb_conf[2,2] + nb_conf[1,2])

print(nb_recall)
## [1] 0.01204392
nb_roc <- performance(prediction(as.numeric(nb_prediction), as.numeric(nb_test$is_top_100)), "tpr", "fpr")
plot(nb_roc, colorize=TRUE)
abline(0, 1, lty = 2)

As we can see our model struggles to correctly predict almost 48% non Top 100 listings within the test sample but predicts very well Top 100 listings with 97 % accuracy. Recall for our model is only 12%. This is also well represented on the ROC plot.

Secondary Modeling

DECISSION TREE

We will now try to build our secondary model based on the data we have created in prior steps. We will first split our data into traning and test data with 60/40 split then use Decission Tree method to do predictions.

set.seed(132)
lis_det_clean_nzv1 <- lis_det_clean_nzv
dt_sub <- sample(nrow(lis_det_clean_nzv1), floor(nrow(lis_det_clean_nzv1) * 0.6))
dt_train <- lis_det_clean_nzv1[dt_sub, ]
dt_test <- lis_det_clean_nzv1[-dt_sub, ]

dt_model <- rpart(is_top_100 ~ ., data = dt_train, method = "class", control = rpart.control(cp = 0.01, minbucket = 5))

fancyRpartPlot(dt_model, caption = "")

printcp(dt_model)
## 
## Classification tree:
## rpart(formula = is_top_100 ~ ., data = dt_train, method = "class", 
##     control = rpart.control(cp = 0.01, minbucket = 5))
## 
## Variables actually used in tree construction:
## [1] cancellation_policy          cleaning_fee                
## [3] extra_people                 host_is_superhost           
## [5] hosting_duration             listing_duration            
## [7] minimum_nights               neighbourhood_group_cleansed
## [9] review_scores_location      
## 
## Root node error: 64/8649 = 0.0073997
## 
## n= 8649 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.018229      0   1.00000 1.0000 0.12454
## 2 0.010417     10   0.79688 1.1875 0.13562
## 3 0.010000     13   0.76562 1.1719 0.13473
plotcp(dt_model)

dt_prediction <- predict(dt_model, dt_test, type = "class")
dt_pred <- prediction(predict(dt_model, type = "prob")[, 2], dt_train$is_top_100)

dt_conf <- table(dt_test$is_top_100, dt_prediction)

print(dt_conf)
##    dt_prediction
##        0    1
##   0 5722   10
##   1   31    4
dt_accuracy <- sum(diag(dt_conf))/sum(dt_conf)

print(dt_accuracy)
## [1] 0.9928906
dt_precision <- dt_conf[2,2] / (dt_conf[2,2] + dt_conf[2,1])

print(dt_precision)
## [1] 0.1142857
dt_recall <- dt_conf[2,2] / (dt_conf[2,2] + dt_conf[1,2])

print(dt_recall)
## [1] 0.2857143
dt_roc <- performance(dt_pred, measure="tpr", x.measure="fpr")
plot(dt_roc, colorize=TRUE)
abline(0, 1, lty = 2)

In case of decission tree the model accurately predicts almost all False negaitve occurences but struggles to detect most of the True positive values. Overall accuracy is over 99% which is well visualised on ROC plot.

Comunicating Results

In general the second model works better even if its not predicting well most of the true positive cases. This is due to that fact that Top 100 cases are under repressented within the data set. However it predicts well huge majority of false negative making overall accuracy to stand above 99%.

The first model predicts true positive better but is poor when predicting false negative values.

Both models will require further work and adjustment to make predictions accurate.

Conclusion

Data exploration shed some light on data geographical distribution and correlation beween number of reviews and following features:

  • Hosting and Listing Duration - which indicate host experience advantage
  • Overall High scores mainly in Communication and Check In categories where Top 100 had 10/10 but surprisingly lower score for cleaniness
  • Being Super host with more then double relative reperesentation among Top 100 which is strongly linked to the scores
  • Top 100 hosts prefere more strict cancellation policies
  • Top 100 listings had usually 1 bathroom while others had more
  • Top 100 had slightly lower price per person and cleaning fee but higher security deposit and fee for extra people
  • Top 100 had lot lower minimum night treashold starting at just 1 night with 3 being maximum value
  • And lastly Top 100 had higher number of amenities

As for geographical distribution the most popular neighbourhood for Top 100 is Eixample and Ciutat Viela which does not come as a surprise as they are located where the best city attractions are namely Sagrada Familia, Les Rambles or Gothic Quarter.

However the data set itself was difficult to provide valuable information to perform accurate predictions.

This will require further work and application of more than one method.