Introduction

This project relates to the hotel pricing analysis. We have 19 variables depending on which we base our analysis. The variables include where the hotel is located and some other details regarding the featres of the hotel. We try to make a meaningful analysis of all these variables and find out which factors do affect the pricing and which do not. Do cities have higher pricing? Does the availability of wifi and breakfast for free change pricing? Lets find out

Overview

The study collects data related to various hotels in 42 different cities. The sampling is done randomly. The hotels are at various places and data belongs to all types of hotels. The study considers information about room rent, whether the booking is done on a new year eve or on a weekend. It also collects the dates in which the bookings were done so that we can establish a pattern. The sole motive of this project is to do an analysis of which variables affect room booking and in particular the room rent and find the variables impacting the room rent the most so that we can increase or reduce those factors to optimise the bookings. From the customer perspective, greater failities for better or lesser rent is always preferable.

Data

The data is collected from an open source https://in.hotels.com/ in October 2016. It is a combined data set for 42 cities. It has 19 variables out of which some are of text form, others are dummy, some boolean and some which are in numeric form. Some of the dummy variables are hotel address,hotel description, pincode which do not affect the pricing from an initial perspective. Also some boolean variables are Is the hotel located in a metro city or not, Is the booking done on a weekend or not, Is the eve close to New Year or not. The major variables on which we base our analysis are the rating of the hotel - given out of 5 stars, Rent of the particular room, Distance from airport, hote capacity, does the hotel have a swimming pool or not. We also have the data about the date on which the booking is made(for 8 particular dates), the population of the city in which the boking has been made and several other objective factors. We would finally obtain the key variables which decide the rent.

Model

RoomRent = a.CityRank+ b.RoomRent+ c.StarRating+ d.Airport+ e.HotelCapacity+ f.Population+ g.IsNewYearEve+ h.IsTouristDestination+ j.IsMetroCity+ k.HasSwimmingPool

Results

For this we get Adjusted R-squared: 0.1894 which is the least among all the 3 models. The p-value is close to 2.2e-16 which is much much lesser than 0.05. Hence we deem this to be the best fit linear regression model.

Conclusion

The project was aimed to understand room rent pricing and we can conclude that the following factors affect the pricing the most in respective order: City Rank, star Rating, Hotel Capacity, Is New Year Eve, Is Tourism Destination, Has Swimming Pool, followed by distance To Airport, Population and the city being metro or not.

References

https://in.hotels.com/ https://drive.google.com/open?id=14knKqDBPBWm5ag6YJuMWD3-BZ7JiHjumIdK5KIbSF5Q

1. Reading the raw data into a dataframe

citi <- read.csv("file:///C:/Users/ABID/Downloads/Cities42.csv")
attach(citi)
library(psych)
describe(citi)
##                      vars     n       mean         sd  median    trimmed
## CityName*               1 13232      18.07      11.72      16      17.29
## Population              2 13232 4416836.87 4258386.00 3046163 4040816.22
## CityRank                3 13232      14.83      13.51       9      13.30
## IsMetroCity             4 13232       0.28       0.45       0       0.23
## IsTouristDestination    5 13232       0.70       0.46       1       0.75
## IsWeekend               6 13232       0.62       0.48       1       0.65
## IsNewYearEve            7 13232       0.12       0.33       0       0.03
## Date*                   8 13232      14.30       2.69      14      14.39
## HotelName*              9 13232     841.19     488.16     827     841.18
## RoomRent               10 13232    5473.99    7333.12    4000    4383.33
## StarRating             11 13232       3.46       0.76       3       3.40
## Airport                12 13232      21.16      22.76      15      16.39
## HotelAddress*          13 13232    1202.53     582.17    1261    1233.25
## HotelPincode           14 13232  397430.26  259837.50  395003  388540.47
## HotelDescription*      15 13224     581.34     363.26     567     575.37
## FreeWifi               16 13232       0.93       0.26       1       1.00
## FreeBreakfast          17 13232       0.65       0.48       1       0.69
## HotelCapacity          18 13232      62.51      76.66      34      46.03
## HasSwimmingPool        19 13232       0.36       0.48       0       0.32
##                             mad      min      max      range  skew
## CityName*                 11.86      1.0       42       41.0  0.48
## Population           3846498.95   8096.0 12442373 12434277.0  0.68
## CityRank                  11.86      0.0       44       44.0  0.69
## IsMetroCity                0.00      0.0        1        1.0  0.96
## IsTouristDestination       0.00      0.0        1        1.0 -0.86
## IsWeekend                  0.00      0.0        1        1.0 -0.51
## IsNewYearEve               0.00      0.0        1        1.0  2.28
## Date*                      2.97      1.0       20       19.0 -0.77
## HotelName*               641.97      1.0     1670     1669.0  0.01
## RoomRent                2653.85    299.0   322500   322201.0 16.75
## StarRating                 0.74      0.0        5        5.0  0.48
## Airport                   11.12      0.2      124      123.8  2.73
## HotelAddress*            668.65      1.0     2108     2107.0 -0.37
## HotelPincode          257975.37 100025.0  7000157  6900132.0  9.99
## HotelDescription*        472.95      1.0     1226     1225.0  0.11
## FreeWifi                   0.00      0.0        1        1.0 -3.25
## FreeBreakfast              0.00      0.0        1        1.0 -0.62
## HotelCapacity             28.17      0.0      600      600.0  2.95
## HasSwimmingPool            0.00      0.0        1        1.0  0.60
##                      kurtosis       se
## CityName*               -0.88     0.10
## Population              -1.08 37019.65
## CityRank                -0.76     0.12
## IsMetroCity             -1.08     0.00
## IsTouristDestination    -1.26     0.00
## IsWeekend               -1.74     0.00
## IsNewYearEve             3.18     0.00
## Date*                    1.92     0.02
## HotelName*              -1.25     4.24
## RoomRent               582.06    63.75
## StarRating               0.25     0.01
## Airport                  7.89     0.20
## HotelAddress*           -0.88     5.06
## HotelPincode           249.76  2258.86
## HotelDescription*       -1.25     3.16
## FreeWifi                 8.57     0.00
## FreeBreakfast           -1.61     0.00
## HotelCapacity           11.39     0.67
## HasSwimmingPool         -1.64     0.00
plot(~RoomRent + CityRank, main="RoomRent vs City Rank")
abline(0,1)

plot(~RoomRent + Airport, main="RoomRent vs Distance from airport")
abline(0,1)

plot(~RoomRent + HotelCapacity, main="RoomRent vs HotelCapacity")
abline(0,1)

plot(~RoomRent + IsNewYearEve, main="RoomRent vs New year eve or not")
abline(0,1)

library(lattice)
histogram(~HotelCapacity, data = citi,
          main = "Distribution of Room Rent", col='gray' ) 

library(lattice)
histogram(~StarRating, data = citi,
          main = "Distribution of Room Rent", col='gray' ) 

library(lattice)
histogram(~HasSwimmingPool, data = citi,
          main = "Distribution of Room Rent", col='gray' )

boxplot(RoomRent~StarRating,data=citi, main="variation")

boxplot(RoomRent~CityName,data=citi, main="variation") 

boxplot(RoomRent~CityRank,data=citi, main="variation") 

boxplot(RoomRent~FreeBreakfast,data=citi, main="variation") 

boxplot(RoomRent~HotelPincode,data=citi, main="variation") 

Scatterplot Matrices from the car Package

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(~ RoomRent+StarRating+Airport+CityRank,
                  data=citi
                  )

scatterplotMatrix(~RoomRent+HotelCapacity+Population, data=citi
                  )

CORRELATION MATRIX

library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
col1 <- c("RoomRent","StarRating","Airport","CityRank")
corMatrix <- rcorr(as.matrix(citi[,col1]))
corMatrix
##            RoomRent StarRating Airport CityRank
## RoomRent       1.00       0.37    0.05     0.09
## StarRating     0.37       1.00   -0.06    -0.13
## Airport        0.05      -0.06    1.00     0.51
## CityRank       0.09      -0.13    0.51     1.00
## 
## n= 13232 
## 
## 
## P
##            RoomRent StarRating Airport CityRank
## RoomRent             0          0       0      
## StarRating  0                   0       0      
## Airport     0        0                  0      
## CityRank    0        0          0
col2 <- c("RoomRent","HotelCapacity","Population")
corMatrix2 <- rcorr(as.matrix(citi[,col2]))
corMatrix2
##               RoomRent HotelCapacity Population
## RoomRent          1.00          0.16      -0.09
## HotelCapacity     0.16          1.00       0.26
## Population       -0.09          0.26       1.00
## 
## n= 13232 
## 
## 
## P
##               RoomRent HotelCapacity Population
## RoomRent                0             0        
## HotelCapacity  0                      0        
## Population     0        0

CORRGRAM

library(Hmisc)
library(car)
library(corrgram)
colairlines <- c("RoomRent","HotelCapacity","Population","StarRating","Airport","CityRank")
corrgram(citi[,colairlines], order=TRUE,
         main="Hotel Room Booking",
         lower.panel=panel.pts, upper.panel=panel.pie,
         diag.panel=panel.minmax, text.panel=panel.txt)

T-TESTs

9a. t-test of RoomRent, HotelCapacity

t.test(RoomRent,HotelCapacity)
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent and HotelCapacity
## t = 84.882, df = 13234, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5286.515 5536.445
## sample estimates:
##  mean of x  mean of y 
## 5473.99184   62.51164
  • Since the p-value is less than 0.05, we fail to reject the null hypothesis that they are equal

9b. t-test of RoomRent, Population

t.test(RoomRent, Population)
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent and Population
## t = -119.16, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4483927 -4338799
## sample estimates:
##   mean of x   mean of y 
##    5473.992 4416836.872

9c. Perfroming t-test on RoomRent, StarRating

t.test(RoomRent, StarRating)
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent and StarRating
## t = 85.813, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5345.575 5595.491
## sample estimates:
##   mean of x   mean of y 
## 5473.991838    3.458933
Model1 <- RoomRent ~ RoomRent+StarRating+Airport+HotelCapacity+Population+IsWeekend+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool+FreeWifi+FreeBreakfast
fit1 <- lm(Model1, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
summary(fit1)
## 
## Call:
## lm(formula = Model1, data = citi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11838  -2356   -685   1025 309676 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.573e+03  4.143e+02 -20.694  < 2e-16 ***
## StarRating            3.594e+03  1.105e+02  32.522  < 2e-16 ***
## Airport               9.797e+00  2.720e+00   3.601 0.000318 ***
## HotelCapacity        -1.027e+01  1.033e+00  -9.943  < 2e-16 ***
## Population           -1.237e-04  2.263e-05  -5.465 4.72e-08 ***
## IsWeekend            -9.039e+01  1.238e+02  -0.730 0.465460    
## IsNewYearEve          8.825e+02  1.818e+02   4.854 1.22e-06 ***
## IsTouristDestination  1.935e+03  1.381e+02  14.012  < 2e-16 ***
## IsMetroCity          -6.580e+02  2.137e+02  -3.079 0.002078 ** 
## HasSwimmingPool       2.149e+03  1.598e+02  13.448  < 2e-16 ***
## FreeWifi              5.493e+02  2.242e+02   2.450 0.014288 *  
## FreeBreakfast         1.699e+02  1.232e+02   1.379 0.167941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6600 on 13220 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1899 
## F-statistic: 282.9 on 11 and 13220 DF,  p-value: < 2.2e-16
Model2 <- RoomRent ~ RoomRent+StarRating+Airport+HotelCapacity+Population+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool
fit2 <- lm(Model2, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
summary(fit2)
## 
## Call:
## lm(formula = Model2, data = citi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11798  -2358   -704   1030 309571 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.039e+03  3.561e+02 -22.576  < 2e-16 ***
## StarRating            3.613e+03  1.103e+02  32.742  < 2e-16 ***
## Airport               9.544e+00  2.711e+00   3.520 0.000432 ***
## HotelCapacity        -1.055e+01  1.027e+00 -10.268  < 2e-16 ***
## Population           -1.214e-04  2.261e-05  -5.372 7.92e-08 ***
## IsNewYearEve          8.429e+02  1.739e+02   4.847 1.27e-06 ***
## IsTouristDestination  1.900e+03  1.373e+02  13.843  < 2e-16 ***
## IsMetroCity          -6.274e+02  2.132e+02  -2.943 0.003258 ** 
## HasSwimmingPool       2.133e+03  1.598e+02  13.354  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6602 on 13223 degrees of freedom
## Multiple R-squared:   0.19,  Adjusted R-squared:  0.1895 
## F-statistic: 387.6 on 8 and 13223 DF,  p-value: < 2.2e-16
Model3 <- RoomRent ~CityRank+ RoomRent+StarRating+Airport+HotelCapacity+Population+IsNewYearEve+IsTouristDestination+IsMetroCity+HasSwimmingPool
fit3<- lm(Model3, data = citi)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned
summary(fit3)
## 
## Call:
## lm(formula = Model3, data = citi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11785  -2357   -703   1033 309589 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.077e+03  3.974e+02 -20.326  < 2e-16 ***
## CityRank              2.252e+00  1.033e+01   0.218  0.82750    
## StarRating            3.611e+03  1.106e+02  32.648  < 2e-16 ***
## Airport               9.187e+00  3.166e+00   2.902  0.00372 ** 
## HotelCapacity        -1.055e+01  1.027e+00 -10.270  < 2e-16 ***
## Population           -1.154e-04  3.589e-05  -3.214  0.00131 ** 
## IsNewYearEve          8.429e+02  1.739e+02   4.847 1.27e-06 ***
## IsTouristDestination  1.889e+03  1.471e+02  12.842  < 2e-16 ***
## IsMetroCity          -6.350e+02  2.161e+02  -2.939  0.00330 ** 
## HasSwimmingPool       2.139e+03  1.615e+02  13.240  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6602 on 13222 degrees of freedom
## Multiple R-squared:   0.19,  Adjusted R-squared:  0.1894 
## F-statistic: 344.5 on 9 and 13222 DF,  p-value: < 2.2e-16
library(leaps)
leap1 <- regsubsets(Model3, data = citi, nbest=1)
## Warning in model.matrix.default(terms(formula, data = data), mm): the
## response appeared on the right-hand side and was dropped
## Warning in model.matrix.default(terms(formula, data = data), mm): problem
## with term 2 in model.matrix: no columns are assigned
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model3, data = citi, nbest = 1)
## 9 Variables  (and intercept)
##                      Forced in Forced out
## CityRank                 FALSE      FALSE
## StarRating               FALSE      FALSE
## Airport                  FALSE      FALSE
## HotelCapacity            FALSE      FALSE
## Population               FALSE      FALSE
## IsNewYearEve             FALSE      FALSE
## IsTouristDestination     FALSE      FALSE
## IsMetroCity              FALSE      FALSE
## HasSwimmingPool          FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          CityRank StarRating Airport HotelCapacity Population IsNewYearEve
## 1  ( 1 ) " "      "*"        " "     " "           " "        " "         
## 2  ( 1 ) "*"      "*"        " "     " "           " "        " "         
## 3  ( 1 ) " "      "*"        " "     " "           "*"        " "         
## 4  ( 1 ) " "      "*"        " "     " "           "*"        " "         
## 5  ( 1 ) " "      "*"        " "     "*"           "*"        " "         
## 6  ( 1 ) " "      "*"        " "     "*"           "*"        "*"         
## 7  ( 1 ) " "      "*"        "*"     "*"           "*"        "*"         
## 8  ( 1 ) " "      "*"        "*"     "*"           "*"        "*"         
##          IsTouristDestination IsMetroCity HasSwimmingPool
## 1  ( 1 ) " "                  " "         " "            
## 2  ( 1 ) " "                  " "         " "            
## 3  ( 1 ) "*"                  " "         " "            
## 4  ( 1 ) "*"                  " "         "*"            
## 5  ( 1 ) "*"                  " "         "*"            
## 6  ( 1 ) "*"                  " "         "*"            
## 7  ( 1 ) "*"                  " "         "*"            
## 8  ( 1 ) "*"                  "*"         "*"
plot(leap1, scale="adjr2")

plot(leap1, scale="bic")

summary(fit1)$adj.r.squared
## [1] 0.189885
summary(fit2)$adj.r.squared
## [1] 0.1894769
summary(fit3)$adj.r.squared
## [1] 0.1894185

Ending Statement

As mentioned earlier, fit 3 is the best fit with lowest deviation and we stick to it as assumed in the beginning of the project and hence our goal is achieved.