Assignment 6: Scraping From Booking.com

Author

Thomas Bonnici

Introduction

Travel is a topic I am interested in learning more about, because I have a bucket list of places I want to go in my lifetime. I think it would be interesting to see what areas of the United States that have good hotels, and ones that could use some improvement. Looking into a website like booking.com will allow me to learn more about the travel world through web scraping!

Booking.com is a website where you can look into things like booking a flight, a hotel, a cruise, car rentals and more! It’s similar to sites like AirBnB and TripAdvisor. I was interested in looking into a few cities including Raleigh, Pittsburgh, Cincinnati, and Columbus. I chose these cities because if I don’t end up in Cincinnati after college, these would be places of interest for me.

Additionally, here’s the link to my database I created:
all_city_reviews.csv

The Big Question

The question I am trying to answer is: What cities in the areas I chose have the best set of hotels?

How I will answer the question:

I will answer this question by scraping booking.com’s website to see which properties have high ratings and try to then compare that to other cities. I will try to calculate an average rating from each city. This will be done through grabbing the numbers directly off of booking.com’s website.

Creating the Code

First, I needed to establish myself as a googlebot to scrape from the booking.com website.

set_config(user_agent("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36"))

I was able to make a function to make this scraping possible.

I tested the function here, as well as combined the data frames I created into one larger one:

Analysis/Results

Visualization 1

First, I wanted to determine which properties had the highest ratings in Raleigh. This is the city that would be the most ideal for me outside of Cincinnati. So, I created a top 5 bar plot showing this here:

I was able to gather that the best 5 places to stay are Fairfield Inn (8.9), Embassy Suites by Hilton (9.0), Home2 Suites (9.0), Hampton Inn (9.2), and Time to Relax (9.9). Most of these are chains, so this doesn’t come as a surprise. People are comfortable with these places because you can stay at one around the country and you will have similar experiences at each one. Time to relax is an upscale place that has it all, so there should be really high ratings there (not your typical hotel).

Visualization 2

The next visualization I created was looking at the average property ratings in regards to each city:

According to the graph, the highest average amongst the four cities goes to Cincinnati, who is at 8.6. The lowest average rating goes to Columbus at 7.5. Overall, they are all similar as the cities vary about 1 point in ratings.

Visualization 3

Next, I looked into what effects the ratings attribute. I did this through comparing distance from the center of town to ratings. The graph is below:

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Overall, there is a slight positive correlation between property rating and distance from the center of town. This could be due to things like people being fine with not being in the center of downtown, or that its still in Raleigh. Additionally, the area surrounding it could have a bigger effect. However, it’s still provides value because the rating is generally increasing.

Visualization 4

In this next visualization, I decided to do a correlation test between two variables. I was looking to see what the correlation was between each property rating and number of ratings each listing had:

[1] -0.3391944
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

When we do the correlation test, we get a number of -0.1761477, which shows a negative weak correlation. This means that generally as the ratings decrease, the ratings increase. This number is statistically significant, but not a super strong correlation.

Visualization 5

With this last visualization, I was able to look into the worst hotel ratings in regards to each city (not just Raleigh):

The worst rating in Cincy was 7.2 Hilton Cincinnati Netherland Plaza, Columbus was 2.9 Baymont Inn and Suites, Pittsburgh was 7.4 Wyndham Grand, and Raleigh was Days Inn by Wyndham. So, three out of the four hotels are owned by Wyndam, which is concerning for them. Especially because of the really low rating in Columbus.