We are using dataset from MakeMyTrip which offers flight tickets for travel, rail and bus tickets, cab service and hotel booking. We are specifically targeting the hotel booking data scrappped from the website for analysis. This was made available to us from Kaggle
We will be doing an analysis which reflects the best rated hotels in cities across various states in India. Also, we will be narrowing down our analysis to look at which hotels are best in terms of customer experience (Food/Hospitality/Facilities/Value_for_Money). Along with this we will be doing a comparison between the Hotel Rating and the Customer Rating. We will be exploring the data further by making use of Geospatial Analysis.
This analysis will help us know which is the best rated hotel and we can further drill down on basis of best customer experience on a preferred location.
As for now, we are using the following libraries:
dplyr - Data Manipulation
tidyr - For Manipulation, Cleaning & Processing data
library(dplyr)
library(tidyr)
Original Datasource: Kaggle
This is a pre-crawled dataset, taken as subset of a bigger dataset (more than 615,000 hotels) that was created by extracting data from MakeMyTrip.com, a travel portal in India. The complete dataset is available on DataStock.
We have 52 variables in the original dataset which we reduced to 15 variables.
Data Cleaning:
There were numerous missing values in our dataset which was the major challenge. Initially, we replaced all missing values with “Unknown” except the ones where we need to perform quantitative analysis (Eg: Location Rating, Food, etc.)
We have worked on “traveller_rating” which was in the following format “Location:5.0/5 | Hospitality:5.0/5 | Facilities:5.0/5 | Cleanliness:5.0/5 | Value for Money:5.0/5 | Food:5.0/5” and was not useful. Hence, we segregated this column into the respective attributes on which they were rated.
First, we read our csv file into R.
D1 = read.csv("makemytrip_com-travel_sample.csv")
Initial review of the data resulted in the removal of few insignificant columns (Ex: country “India” as it is uniform throughout the dataset)
D2 <- D1[, -c(3,4, 5, 6, 8,9, 13, 20, 21,22, 25, 26,28, 29,30, 33:52)]
head(D2)
We have cleaned and standardized all our columns by filling in “Unknown” or “0” at places where there were no values.
sum(grepl("^$",D2$area))
x<-1
D2$area<-sub("^$", "Unknown", D2$area, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$city<-sub("^$", "Unknown", D2$city, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$mmt_review_rating<-sub("\\{\\{ratingCriteria.name\\}\\}\\{\\{ratingCriteria.value\\}\\}", "Unknown", D2$mmt_review_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$mmt_review_rating<-sub("^$", "Unknown", D2$mmt_review_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$hotel_star_rating<-sub("^$", "Unknown", D2$hotel_star_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$hotel_star_rating<-sub("^$", "Unknown", D2$hotel_star_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$hotel_star_rating<-sub("^$", "Unknown", D2$hotel_star_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$mmt_tripadvisor_count<-sub("Unknown", "0", D2$mmt_tripadvisor_count, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$state<-sub("^$", "Unknown", D2$state, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$traveller_rating<-sub("^$", "0", D2$traveller_rating, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
We have cleaned hotel_star_rating from “1 star” format to “1” throughout the column for the respective values.
D2$hotel_star_rating<-sub("star$", "", D2$hotel_star_rating, ignore.case = T, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D2$hotel_star_rating<-as.factor(D2$hotel_star_rating)
We have segregated our “traveller_rating” column into “Location, Hospitality, Facilities, Cleanliness, Value_for_Money, Food”. Thus we have spread out our data into different columns for all the above mentioned attributes for the analysis based on specific customer satisfaction factors.
D2$traveller_rating<-sub("%Value for Money%", "VFM",fixed = FALSE, useBytes = FALSE)
D3<-D2
D5<-separate(D3, traveller_rating, c("Location","d1", "Hospitality","d2","Facilities","d3","Cleanliness","d4","d5","d6","Value_for_Money","d7","Food"), sep = " | ")
D5$Location<-sub("^Location:", "",D5$Location,fixed = FALSE, useBytes = FALSE)
D5$Hospitality<-sub("^Hospitality:", "",D4$Hospitality,fixed = FALSE, useBytes = FALSE)
D5$Facilities<-sub("^Facilities:", "",D4$Facilities,fixed = FALSE, useBytes = FALSE)
D5$Cleanliness<-sub("^Cleanliness:", "",D4$Cleanliness,fixed = FALSE, useBytes = FALSE)
D5$Value_for_Money<-sub("^Money:", "",D4$Value_for_Money,fixed = FALSE, useBytes = FALSE)
D5$Food<-sub("^Food:", "",D4$Food,fixed = FALSE, useBytes = FALSE)
D6 <- D5[, -c(18,20,22,24,25,26)]
On separating “traveller_rating” column into “Location, Hospitality, Facilities, Cleanliness, Value_for_Money, Food” we were getting values as “4/5” format which we converted to “4”
D6$Location<-sub("/5$", "", D6$Location, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D6$Hospitality<-sub("/5$", "", D6$Hospitality, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D6$Facilities<-sub("/5$", "", D6$Facilities, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D6$Cleanliness<-sub("/5$", "", D6$Cleanliness, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D6$Value_for_Money<-sub("/5$", "", D6$Value_for_Money, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D6$Food<-sub("/5$", "", D6$Food, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
D7 <- D6[, -c(9,11,13:18)]
For final analysis we carried out data conversion for required columns into numeric
D7$mmt_location_rating <- sub("..","",D7$mmt_location_rating)
D7$hotel_star_rating = as.numeric(D7$hotel_star_rating)
D7$latitude = as.numeric(D7$latitude)
D7$mmt_location_rating = as.numeric(D7$mmt_location_rating)
D7$mmt_review_count = as.numeric(D7$mmt_review_count)
D7$mmt_review_score = as.numeric(D7$mmt_review_score)
D7$mmt_tripadvisor_count = as.numeric(D7$mmt_tripadvisor_count)
D7$Location = as.numeric(D7$Location)
D7$Facilities = as.numeric(D7$Facilities)
D7$Cleanliness = as.numeric(D7$Cleanliness)
D7$Value_for_Money = as.numeric(D7$Value_for_Money)
D7$Food = as.numeric(D7$Food)
Here we have the summary information for our cleaned dataset
str(D7)
'data.frame': 20063 obs. of 15 variables:
$ area : chr "Hardasji Ki Magri" "Near Nai Gaon" "Near Bagore Ki Haveli" "Dabok" ...
$ city : chr "Udaipur" "Udaipur" "Udaipur" "Udaipur" ...
$ hotel_star_rating : num 0 0 0 0 0 0 0 0 0 0 ...
$ is_value_plus : Factor w/ 9 levels "","0","103","43",..: 7 7 7 7 7 7 7 7 7 7 ...
$ latitude : num 5161 5065 5140 5229 5143 ...
$ longitude : num 73.7 73.6 73.7 73.9 73.7 ...
$ mmt_location_rating : num NA NA NA NA NA NA NA NA NA NA ...
$ mmt_review_count : num 1 1 1 1 1 1 1 1 1 1 ...
$ mmt_review_score : num 46 46 1 1 38 42 1 40 49 48 ...
$ mmt_tripadvisor_count: num 504 323 113 285 80 581 630 138 328 50 ...
$ Location : num NA NA NA NA NA NA NA NA NA NA ...
$ Facilities : num NA NA NA NA NA NA NA NA NA NA ...
$ Cleanliness : num NA NA NA NA NA NA NA NA NA NA ...
$ Value_for_Money : num NA NA NA NA NA NA NA NA NA NA ...
$ Food : num NA NA NA NA NA NA NA NA NA NA ...
summary(D7)
area city hotel_star_rating is_value_plus
Length:20063 Length:20063 Min. :0 no :19297
Class :character Class :character 1st Qu.:0 yes : 740
Mode :character Mode :character Median :0 : 20
Mean :0 0 : 1
3rd Qu.:0 103 : 1
Max. :0 43 : 1
(Other): 3
latitude longitude mmt_location_rating mmt_review_count
Min. : 1.0 Min. : 0.00 Min. :0.00 Min. : 1.00
1st Qu.: 213.5 1st Qu.: 72.56 1st Qu.:4.00 1st Qu.: 1.00
Median :3375.0 Median : 75.85 Median :4.50 Median : 1.00
Mean :3796.4 Mean : 59.34 Mean :4.55 Mean : 46.79
3rd Qu.:6706.5 3rd Qu.: 77.62 3rd Qu.:5.00 3rd Qu.: 37.00
Max. :9962.0 Max. :130.00 Max. :9.00 Max. :301.00
NA's :42 NA's :13360
mmt_review_score mmt_tripadvisor_count Location Facilities Cleanliness
Min. : 1.00 Min. : 1.0 Min. :0.000 Min. :1.000 Min. :0.000
1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.:3.900 1st Qu.:3.300 1st Qu.:3.300
Median : 1.00 Median : 20.0 Median :4.300 Median :3.900 Median :4.000
Mean :18.15 Mean :159.2 Mean :4.138 Mean :3.728 Mean :3.805
3rd Qu.:40.00 3rd Qu.:296.0 3rd Qu.:4.700 3rd Qu.:4.300 3rd Qu.:4.500
Max. :52.00 Max. :651.0 Max. :5.000 Max. :5.000 Max. :5.000
NA's :13287 NA's :13452 NA's :13452
Value_for_Money Food
Min. :0.000 Min. :0.000
1st Qu.:3.500 1st Qu.:2.300
Median :4.000 Median :3.300
Mean :3.903 Mean :3.079
3rd Qu.:4.500 3rd Qu.:4.000
Max. :5.000 Max. :5.000
NA's :13452 NA's :13452
head(D7)
tail(D7)
This analysis will help us understand the trend of the specific customer types and their preferences while selecting hotels in India. Also, we will be incorporating geospatial analysis which will help us visualize the geopoints which attract more customers.