Exploration Sections

Challenge Summary & Information

Challenge

The goal of this challenge is to use a dataset based on millions of real anonymized accommodation reservations to come up with a strategy for making the best recommendation for their next destination in real-time.


Dataset

  • The training dataset consists of over a million of anonymized hotel reservations, based on real data, with the following features:
  • user_id - User ID
  • checkin - Reservation check-in date
  • checkout - Reservation check-out date- created_date - Date when the reservation was made
  • affiliate_id - An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
  • device_class - desktop/mobile
  • booker_country - Country from which the reservation was made (anonymized)
  • hotel_country - Country of the hotel (anonymized)
  • city_id - city_id of the hotel’s city (anonymized)
  • utrip_id - Unique identification of user’s trip (a group of multi-destinations bookings within the same trip).
  • Note -Each reservation is a part of a customer’s trip (identified by utrip_id) which includes at least 4 consecutive reservations. The check-out date of a reservation is the check-in date of the following reservation in their trip. The evaluation dataset is constructed similarly, however the city_id of the final reservation of each trip is concealed and requires a prediction.


Evaluation & Goal

The goal of the challenge is to predict (and recommend) the final city (city_id) of each trip (utrip_id). We will evaluate the quality of the predictions based on the top four recommended cities for each trip by using metric (4 representing the four suggestion slots at Booking.com website). When the true city is one of the top 4 suggestions (regardless of the order), it is considered correct.

Structure of Project & Exploration

Quotes on Data & Analytics

“Data are just summaries of thousands of stories – tell a few of those stories to help make the data meaningful.” — Chip & Dan Heath

“A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product.” – Hillary Mason, founder, Fast Forward Labs.

“Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, senior vice president, Gartner Research.


What are my aims with this exploration?

  • Distributions
  • Basic Descriptive Metrics
  • Correlations/ Relationships
  • Identifying Outliers or Odd Data
  • Seasonality/Time Series
  • Creation of New Columns for Feature Engineering
  • Outlier Detection
  • Volume


Structure of the Exploration

  • Questions: Questions will be asked previous to the visualization to make sure the visualizations shown in this project are insightful.

  • Summary: After each section I will provide a summary to understand what we got from the visualizations

  • Observations/Comments: Any observations I made with a given metric or visual or comments I have about a given summary of data


Objective of the Exploration

  • Observations - What can we learn from this deep dive that we didn’t know before from our inital observations of the dataset.

  • Understanding Current Dataset - The intention of this exploration is to also understand the booking.com dataset and how a customer travels using booking.com.

  • Modeling - Creating a model to predict (and recommend) the final city (city_id) of each trip (utrip_id)


Other Thoughts

  • Analytics teams often make the trade-off between speed and accuracy. The trade-off often results in solutions that are challenging to interpret and deploy for the wider organization. A natural drawback of scrappy or agile analytics approaches for systematic solutions are wide gaps or ‘blind spots’ in analysis and unstable/brittle tools, often leading to sub-optimal outcomes. This write up serves to document and compare past business cases, analytics methodologies and learnings in order for the reader to gain a good understanding of historical efforts to date.

Data Exploration

Viewing the Data Types of Each Column

  • Observations - The data is made of 9 variables (summarized in the first tab) and 1,048,575 rows of data.
## 'data.frame':    1048575 obs. of  9 variables:
##  $ user_id       : int  1006220 1006220 1006220 1006220 1010293 1010293 1010293 1010293 1010293 1010293 ...
##  $ checkin       : Factor w/ 425 levels "1/1/2016","1/1/2017",..: 272 245 246 250 364 335 337 338 340 341 ...
##  $ checkout      : Factor w/ 425 levels "1/1/2016","1/1/2017",..: 245 246 250 251 335 336 338 340 341 342 ...
##  $ city_id       : int  31114 39641 20232 24144 5325 55 23921 65322 23921 20545 ...
##  $ device_class  : Factor w/ 3 levels "desktop","mobile",..: 1 1 1 1 2 2 2 1 1 1 ...
##  $ affiliate_id  : int  384 384 384 384 359 359 359 9924 9924 10573 ...
##  $ booker_country: Factor w/ 5 levels "Bartovia","Elbonia",..: 3 3 3 3 5 5 5 5 5 5 ...
##  $ hotel_country : Factor w/ 193 levels "Absurdistan",..: 62 62 61 62 37 37 37 37 37 37 ...
##  $ utrip_id      : Factor w/ 195685 levels "1000027_1","1000045_1",..: 230 230 230 230 382 382 382 382 382 382 ...

Viewing a Small Subset of the Data

  • Observations/Comments - Below are 10 rows of the dataset. My first thoughts are to undertand the following:
  • Distinct Number of booker countires to understand if we are looking at a wide array of countries or just a few
  • Distinct Number of hotel countries to understand if we are looking at a wide array of hotels or just a few
  • Distinct Number of utrip IDs to understand how many trips in total are in the dataset
  • Distinct Number city IDs to understand if we are looking at a wide array of cities or just a few.

Additional Datapoints

  • Observations/Comments - After observing the data there is potential to create additional columns of data from the exisiting dataset which incude:
  • Trip duration for a given city ID or in other words how long was the customer in a given city during their itenierary.
  • Entire trip duration, how long was the entire trip from the first check-in date to the last check out date.
  • How many cities in total did a customer visit.
  • How many countires in total did the customer visit.
user_id checkin checkout city_id device_class affiliate_id booker_country hotel_country utrip_id
1006220 4/9/2016 4/11/2016 31114 desktop 384 Gondal Gondal 1006220_1
1006220 4/11/2016 4/12/2016 39641 desktop 384 Gondal Gondal 1006220_1
1006220 4/12/2016 4/16/2016 20232 desktop 384 Gondal Glubbdubdrib 1006220_1
1006220 4/16/2016 4/17/2016 24144 desktop 384 Gondal Gondal 1006220_1
1010293 7/9/2016 7/10/2016 5325 mobile 359 The Devilfire Empire Cobra Island 1010293_1
1010293 7/10/2016 7/11/2016 55 mobile 359 The Devilfire Empire Cobra Island 1010293_1
1010293 7/12/2016 7/13/2016 23921 mobile 359 The Devilfire Empire Cobra Island 1010293_1
1010293 7/13/2016 7/15/2016 65322 desktop 9924 The Devilfire Empire Cobra Island 1010293_1
1010293 7/15/2016 7/16/2016 23921 desktop 9924 The Devilfire Empire Cobra Island 1010293_1
1010293 7/16/2016 7/17/2016 20545 desktop 10573 The Devilfire Empire Cobra Island 1010293_1

Viewing Distinct Counts in the Data

  • Observations/Comments -Below are the counts for parts of the dataset:
  • Note that there are only 5 distinct countries within the dataset. Potentially these are the customers that booking.com recieves the most volume
  • There are 193 hotel counties, we can assume that may include every country in the world
  • 38k cities were visited
  • 3k in affialiate ID
  • 181k unique users
  • Unsure of any deep observations here, however it is good to have context around these data points as we continue to explore and model.


    user_id_cnt city_id_cnt affiliate_id_cnt booker_country_cnt hotel_country_cnt utrip_id_cnt
    181231 38638 3126 5 193 195685



Creating New Columns in the Data for Additional Data Points

  • Observations/Comments -Below are the new columns and the defintions:
  • Stop Duration - How many days the travler stayed at a given city
  • Trip Duration - How many days did the entire trip last by user ID
  • Total City Dest - How many cities did the travler visit during the entire duration of the trip
  • Month Name - Month extracted from the checkin date
  • Year - Year extracted from the checking date
  • Leg of Trip - For a given trip-city combination, what leg of the trip was the city. 1,2,3…


    user_id checkin checkout city_id device_class affiliate_id booker_country hotel_country utrip_id stop_duration trip_duration total_city_dest month year month_name leg_of_trip
    1006220 4/9/2016 4/11/2016 31114 desktop 384 Gondal Gondal 1006220_1 2 8 4 4 2016 April 4
    1006220 4/11/2016 4/12/2016 39641 desktop 384 Gondal Gondal 1006220_1 1 8 4 4 2016 April 1
    1006220 4/12/2016 4/16/2016 20232 desktop 384 Gondal Glubbdubdrib 1006220_1 4 8 4 4 2016 April 2
    1006220 4/16/2016 4/17/2016 24144 desktop 384 Gondal Gondal 1006220_1 1 8 4 4 2016 April 3
    1010293 7/9/2016 7/10/2016 5325 mobile 359 The Devilfire Empire Cobra Island 1010293_1 1 7 5 7 2016 July 6
    1010293 7/10/2016 7/11/2016 55 mobile 359 The Devilfire Empire Cobra Island 1010293_1 1 7 5 7 2016 July 1
    1010293 7/12/2016 7/13/2016 23921 mobile 359 The Devilfire Empire Cobra Island 1010293_1 1 7 5 7 2016 July 2
    1010293 7/13/2016 7/15/2016 65322 desktop 9924 The Devilfire Empire Cobra Island 1010293_1 2 7 5 7 2016 July 3
    1010293 7/15/2016 7/16/2016 23921 desktop 9924 The Devilfire Empire Cobra Island 1010293_1 1 7 5 7 2016 July 4
    1010293 7/16/2016 7/17/2016 20545 desktop 10573 The Devilfire Empire Cobra Island 1010293_1 1 7 5 7 2016 July 5



Exploraing the Data Through Visuals

  • Observations/Comments -Below I will explore the data using ggplot2 visuals:
  • Trip by Booking Country - What is the volume of trips by country Chart 1 - Note there are only 5 booking countries in the dataset with Gondal having the most volume





Stop Duration (Days)

  • Question: What is the distribution of the number days a traveler stayed at for a given stop?
  • Observations: Do the travlers stay for a short period of time during a stop or a longer period? Is there a relationship with this and the number of cities they visit?


Trip Duration (Days) Count

  • Question: How long was a travelers entire trip?
  • Observations: Does the number of a days a trip is correlate positvely with the number of cities a traveler may visit?


City Visits During Trip

  • Question: How many cities did a given travler visit?
  • Observations: I am curious about travlers that visited more than 15 cities during their trip. Are these outliers or bad data?
  • Observations: Does the number of a days a trip is correlate positvely with the number of cities a traveler may visit?

City Visits During Trip

  • Question: How many cities did a given travler visit?
  • Observations: I am curious about travlers that visited more than 15 cities during their trip. Are these outliers or bad data?
  • Observations: Does the number of a days a trip is correlate positvely with the number of cities a traveler may visit?