This project relates to the hotel pricing analysis. We have 19 variables depending on which we base our analysis. The variables include where the hotel is located and some other details regarding the featres of the hotel. We try to make a meaningful analysis of all these variables and find out which factors do affect the pricing and which do not. Do cities have higher pricing? Does the availability of wifi and breakfast for free change pricing? Lets find out
The study collects data related to various hotels in 42 different cities. The sampling is done randomly. The hotels are at various places and data belongs to all types of hotels. The study considers information about room rent, whether the booking is done on a new year eve or on a weekend. It also collects the dates in which the bookings were done so that we can establish a pattern. The sole motive of this project is to do an analysis of which variables affect room booking and in particular the room rent and find the variables impacting the room rent the most so that we can increase or reduce those factors to optimise the bookings. From the customer perspective, greater failities for better or lesser rent is always preferable.
The data is collected from an open source https://in.hotels.com/ in October 2016. It is a combined data set for 42 cities. It has 19 variables out of which some are of text form, others are dummy, some boolean and some which are in numeric form. Some of the dummy variables are hotel address,hotel description, pincode which do not affect the pricing from an initial perspective. Also some boolean variables are Is the hotel located in a metro city or not, Is the booking done on a weekend or not, Is the eve close to New Year or not. The major variables on which we base our analysis are the rating of the hotel - given out of 5 stars, Rent of the particular room, Distance from airport, hote capacity, does the hotel have a swimming pool or not. We also have the data about the date on which the booking is made(for 8 particular dates), the population of the city in which the boking has been made and several other objective factors. We would finally obtain the key variables which decide the rent.
RoomRent = a.CityRank+ b.RoomRent+ c.StarRating+ d.Airport+ e.HotelCapacity+ f.Population+ g.IsNewYearEve+ h.IsTouristDestination+ j.IsMetroCity+ k.HasSwimmingPool
For this we get Adjusted R-squared: 0.1894 which is the least among all the 3 models. The p-value is close to 2.2e-16 which is much much lesser than 0.05. Hence we deem this to be the best fit linear regression model.
The project was aimed to understand room rent pricing and we can conclude that the following factors affect the pricing the most in respective order: City Rank, star Rating, Hotel Capacity, Is New Year Eve, Is Tourism Destination, Has Swimming Pool, followed by distance To Airport, Population and the city being metro or not.
https://in.hotels.com/ https://drive.google.com/open?id=14knKqDBPBWm5ag6YJuMWD3-BZ7JiHjumIdK5KIbSF5Q
citi <- read.csv("file:///C:/Users/ABID/Downloads/Cities42.csv")
attach(citi)
library(psych)
describe(citi)
## vars n mean sd median trimmed
## CityName* 1 13232 18.07 11.72 16 17.29
## Population 2 13232 4416836.87 4258386.00 3046163 4040816.22
## CityRank 3 13232 14.83 13.51 9 13.30
## IsMetroCity 4 13232 0.28 0.45 0 0.23
## IsTouristDestination 5 13232 0.70 0.46 1 0.75
## IsWeekend 6 13232 0.62 0.48 1 0.65
## IsNewYearEve 7 13232 0.12 0.33 0 0.03
## Date* 8 13232 14.30 2.69 14 14.39
## HotelName* 9 13232 841.19 488.16 827 841.18
## RoomRent 10 13232 5473.99 7333.12 4000 4383.33
## StarRating 11 13232 3.46 0.76 3 3.40
## Airport 12 13232 21.16 22.76 15 16.39
## HotelAddress* 13 13232 1202.53 582.17 1261 1233.25
## HotelPincode 14 13232 397430.26 259837.50 395003 388540.47
## HotelDescription* 15 13224 581.34 363.26 567 575.37
## FreeWifi 16 13232 0.93 0.26 1 1.00
## FreeBreakfast 17 13232 0.65 0.48 1 0.69
## HotelCapacity 18 13232 62.51 76.66 34 46.03
## HasSwimmingPool 19 13232 0.36 0.48 0 0.32
## mad min max range skew
## CityName* 11.86 1.0 42 41.0 0.48
## Population 3846498.95 8096.0 12442373 12434277.0 0.68
## CityRank 11.86 0.0 44 44.0 0.69
## IsMetroCity 0.00 0.0 1 1.0 0.96
## IsTouristDestination 0.00 0.0 1 1.0 -0.86
## IsWeekend 0.00 0.0 1 1.0 -0.51
## IsNewYearEve 0.00 0.0 1 1.0 2.28
## Date* 2.97 1.0 20 19.0 -0.77
## HotelName* 641.97 1.0 1670 1669.0 0.01
## RoomRent 2653.85 299.0 322500 322201.0 16.75
## StarRating 0.74 0.0 5 5.0 0.48
## Airport 11.12 0.2 124 123.8 2.73
## HotelAddress* 668.65 1.0 2108 2107.0 -0.37
## HotelPincode 257975.37 100025.0 7000157 6900132.0 9.99
## HotelDescription* 472.95 1.0 1226 1225.0 0.11
## FreeWifi 0.00 0.0 1 1.0 -3.25
## FreeBreakfast 0.00 0.0 1 1.0 -0.62
## HotelCapacity 28.17 0.0 600 600.0 2.95
## HasSwimmingPool 0.00 0.0 1 1.0 0.60
## kurtosis se
## CityName* -0.88 0.10
## Population -1.08 37019.65
## CityRank -0.76 0.12
## IsMetroCity -1.08 0.00
## IsTouristDestination -1.26 0.00
## IsWeekend -1.74 0.00
## IsNewYearEve 3.18 0.00
## Date* 1.92 0.02
## HotelName* -1.25 4.24
## RoomRent 582.06 63.75
## StarRating 0.25 0.01
## Airport 7.89 0.20
## HotelAddress* -0.88 5.06
## HotelPincode 249.76 2258.86
## HotelDescription* -1.25 3.16
## FreeWifi 8.57 0.00
## FreeBreakfast -1.61 0.00
## HotelCapacity 11.39 0.67
## HasSwimmingPool -1.64 0.00
plot(~RoomRent + CityRank, main="RoomRent vs City Rank")
abline(0,1)
plot(~RoomRent + Airport, main="RoomRent vs Distance from airport")
abline(0,1)
plot(~RoomRent + HotelCapacity, main="RoomRent vs HotelCapacity")
abline(0,1)
plot(~RoomRent + IsNewYearEve, main="RoomRent vs New year eve or not")
abline(0,1)
library(lattice)
histogram(~HotelCapacity, data = citi,
main = "Distribution of Room Rent", col='gray' )
library(lattice)
histogram(~StarRating, data = citi,
main = "Distribution of Room Rent", col='gray' )
library(lattice)
histogram(~HasSwimmingPool, data = citi,
main = "Distribution of Room Rent", col='gray' )
boxplot(RoomRent~StarRating,data=citi, main="variation")
boxplot(RoomRent~CityName,data=citi, main="variation")
boxplot(RoomRent~CityRank,data=citi, main="variation")
boxplot(RoomRent~FreeBreakfast,data=citi, main="variation")
boxplot(RoomRent~HotelPincode,data=citi, main="variation")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(~ RoomRent+StarRating+Airport+CityRank,
data=citi
)
scatterplotMatrix(~RoomRent+HotelCapacity+Population, data=citi
)
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
col1 <- c("RoomRent","StarRating","Airport","CityRank")
corMatrix <- rcorr(as.matrix(citi[,col1]))
corMatrix
## RoomRent StarRating Airport CityRank
## RoomRent 1.00 0.37 0.05 0.09
## StarRating 0.37 1.00 -0.06 -0.13
## Airport 0.05 -0.06 1.00 0.51
## CityRank 0.09 -0.13 0.51 1.00
##
## n= 13232
##
##
## P
## RoomRent StarRating Airport CityRank
## RoomRent 0 0 0
## StarRating 0 0 0
## Airport 0 0 0
## CityRank 0 0 0
col2 <- c("RoomRent","HotelCapacity","Population")
corMatrix2 <- rcorr(as.matrix(citi[,col2]))
corMatrix2
## RoomRent HotelCapacity Population
## RoomRent 1.00 0.16 -0.09
## HotelCapacity 0.16 1.00 0.26
## Population -0.09 0.26 1.00
##
## n= 13232
##
##
## P
## RoomRent HotelCapacity Population
## RoomRent 0 0
## HotelCapacity 0 0
## Population 0 0
library(Hmisc)
library(car)
library(corrgram)
colairlines <- c("RoomRent","HotelCapacity","Population","StarRating","Airport","CityRank")
corrgram(citi[,colairlines], order=TRUE,
main="Hotel Room Booking",
lower.panel=panel.pts, upper.panel=panel.pie,
diag.panel=panel.minmax, text.panel=panel.txt)
t.test(RoomRent,HotelCapacity)
##
## Welch Two Sample t-test
##
## data: RoomRent and HotelCapacity
## t = 84.882, df = 13234, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5286.515 5536.445
## sample estimates:
## mean of x mean of y
## 5473.99184 62.51164
t.test(RoomRent, Population)
##
## Welch Two Sample t-test
##
## data: RoomRent and Population
## t = -119.16, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4483927 -4338799
## sample estimates:
## mean of x mean of y
## 5473.992 4416836.872
t.test(RoomRent, StarRating)
##
## Welch Two Sample t-test
##
## data: RoomRent and StarRating
## t = 85.813, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5345.575 5595.491
## sample estimates:
## mean of x mean of y
## 5473.991838 3.458933
Model1 <- RoomRent ~ RoomRent+StarRating+Airport+HotelCapacity+Population+IsWeekend+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool+FreeWifi+FreeBreakfast
fit1 <- lm(Model1, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
summary(fit1)
##
## Call:
## lm(formula = Model1, data = citi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11838 -2356 -685 1025 309676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.573e+03 4.143e+02 -20.694 < 2e-16 ***
## StarRating 3.594e+03 1.105e+02 32.522 < 2e-16 ***
## Airport 9.797e+00 2.720e+00 3.601 0.000318 ***
## HotelCapacity -1.027e+01 1.033e+00 -9.943 < 2e-16 ***
## Population -1.237e-04 2.263e-05 -5.465 4.72e-08 ***
## IsWeekend -9.039e+01 1.238e+02 -0.730 0.465460
## IsNewYearEve 8.825e+02 1.818e+02 4.854 1.22e-06 ***
## IsTouristDestination 1.935e+03 1.381e+02 14.012 < 2e-16 ***
## IsMetroCity -6.580e+02 2.137e+02 -3.079 0.002078 **
## HasSwimmingPool 2.149e+03 1.598e+02 13.448 < 2e-16 ***
## FreeWifi 5.493e+02 2.242e+02 2.450 0.014288 *
## FreeBreakfast 1.699e+02 1.232e+02 1.379 0.167941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6600 on 13220 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1899
## F-statistic: 282.9 on 11 and 13220 DF, p-value: < 2.2e-16
Model2 <- RoomRent ~ RoomRent+StarRating+Airport+HotelCapacity+Population+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool
fit2 <- lm(Model2, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
summary(fit2)
##
## Call:
## lm(formula = Model2, data = citi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11798 -2358 -704 1030 309571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.039e+03 3.561e+02 -22.576 < 2e-16 ***
## StarRating 3.613e+03 1.103e+02 32.742 < 2e-16 ***
## Airport 9.544e+00 2.711e+00 3.520 0.000432 ***
## HotelCapacity -1.055e+01 1.027e+00 -10.268 < 2e-16 ***
## Population -1.214e-04 2.261e-05 -5.372 7.92e-08 ***
## IsNewYearEve 8.429e+02 1.739e+02 4.847 1.27e-06 ***
## IsTouristDestination 1.900e+03 1.373e+02 13.843 < 2e-16 ***
## IsMetroCity -6.274e+02 2.132e+02 -2.943 0.003258 **
## HasSwimmingPool 2.133e+03 1.598e+02 13.354 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6602 on 13223 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.1895
## F-statistic: 387.6 on 8 and 13223 DF, p-value: < 2.2e-16
Model3 <- RoomRent ~CityRank+ RoomRent+StarRating+Airport+HotelCapacity+Population+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool
fit3<- lm(Model3, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned
summary(fit3)
##
## Call:
## lm(formula = Model3, data = citi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11785 -2357 -703 1033 309589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.077e+03 3.974e+02 -20.326 < 2e-16 ***
## CityRank 2.252e+00 1.033e+01 0.218 0.82750
## StarRating 3.611e+03 1.106e+02 32.648 < 2e-16 ***
## Airport 9.187e+00 3.166e+00 2.902 0.00372 **
## HotelCapacity -1.055e+01 1.027e+00 -10.270 < 2e-16 ***
## Population -1.154e-04 3.589e-05 -3.214 0.00131 **
## IsNewYearEve 8.429e+02 1.739e+02 4.847 1.27e-06 ***
## IsTouristDestination 1.889e+03 1.471e+02 12.842 < 2e-16 ***
## IsMetroCity -6.350e+02 2.161e+02 -2.939 0.00330 **
## HasSwimmingPool 2.139e+03 1.615e+02 13.240 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6602 on 13222 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.1894
## F-statistic: 344.5 on 9 and 13222 DF, p-value: < 2.2e-16
library(leaps)
leap1 <- regsubsets(Model3, data = citi, nbest=1)
## Warning in model.matrix.default(terms(formula, data = data), mm): the
## response appeared on the right-hand side and was dropped
## Warning in model.matrix.default(terms(formula, data = data), mm): problem
## with term 2 in model.matrix: no columns are assigned
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model3, data = citi, nbest = 1)
## 9 Variables (and intercept)
## Forced in Forced out
## CityRank FALSE FALSE
## StarRating FALSE FALSE
## Airport FALSE FALSE
## HotelCapacity FALSE FALSE
## Population FALSE FALSE
## IsNewYearEve FALSE FALSE
## IsTouristDestination FALSE FALSE
## IsMetroCity FALSE FALSE
## HasSwimmingPool FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## CityRank StarRating Airport HotelCapacity Population IsNewYearEve
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " "
## 3 ( 1 ) " " "*" " " " " "*" " "
## 4 ( 1 ) " " "*" " " " " "*" " "
## 5 ( 1 ) " " "*" " " "*" "*" " "
## 6 ( 1 ) " " "*" " " "*" "*" "*"
## 7 ( 1 ) " " "*" "*" "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*" "*"
## IsTouristDestination IsMetroCity HasSwimmingPool
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) "*" " " " "
## 4 ( 1 ) "*" " " "*"
## 5 ( 1 ) "*" " " "*"
## 6 ( 1 ) "*" " " "*"
## 7 ( 1 ) "*" " " "*"
## 8 ( 1 ) "*" "*" "*"
plot(leap1, scale="adjr2")
plot(leap1, scale="bic")
summary(fit1)$adj.r.squared
## [1] 0.189885
summary(fit2)$adj.r.squared
## [1] 0.1894769
summary(fit3)$adj.r.squared
## [1] 0.1894185
As mentioned earlier, fit 3 is the best fit with lowest deviation and we stick to it as assumed in the beginning of the project and hence our goal is achieved.