Read Data using read.csv and view Hotel room pricing

Cities <- read.csv(paste("Cities42.csv", sep=""))
View(Cities)

Summary Statistics - mean, median, standard deviation

attach(Cities)
library(psych)
## Warning: package 'psych' was built under R version 3.4.3
describe(Cities)
##                      vars     n       mean         sd  median    trimmed
## CityName*               1 13232      18.07      11.72      16      17.29
## Population              2 13232 4416836.87 4258386.00 3046163 4040816.22
## CityRank                3 13232      14.83      13.51       9      13.30
## IsMetroCity             4 13232       0.28       0.45       0       0.23
## IsTouristDestination    5 13232       0.70       0.46       1       0.75
## IsWeekend               6 13232       0.62       0.48       1       0.65
## IsNewYearEve            7 13232       0.12       0.33       0       0.03
## Date*                   8 13232      14.30       2.69      14      14.39
## HotelName*              9 13232     841.19     488.16     827     841.18
## RoomRent               10 13232    5473.99    7333.12    4000    4383.33
## StarRating             11 13232       3.46       0.76       3       3.40
## Airport                12 13232      21.16      22.76      15      16.39
## HotelAddress*          13 13232    1202.53     582.17    1261    1233.25
## HotelPincode           14 13232  397430.26  259837.50  395003  388540.47
## HotelDescription*      15 13224     581.34     363.26     567     575.37
## FreeWifi               16 13232       0.93       0.26       1       1.00
## FreeBreakfast          17 13232       0.65       0.48       1       0.69
## HotelCapacity          18 13232      62.51      76.66      34      46.03
## HasSwimmingPool        19 13232       0.36       0.48       0       0.32
##                             mad      min      max      range  skew
## CityName*                 11.86      1.0       42       41.0  0.48
## Population           3846498.95   8096.0 12442373 12434277.0  0.68
## CityRank                  11.86      0.0       44       44.0  0.69
## IsMetroCity                0.00      0.0        1        1.0  0.96
## IsTouristDestination       0.00      0.0        1        1.0 -0.86
## IsWeekend                  0.00      0.0        1        1.0 -0.51
## IsNewYearEve               0.00      0.0        1        1.0  2.28
## Date*                      2.97      1.0       20       19.0 -0.77
## HotelName*               641.97      1.0     1670     1669.0  0.01
## RoomRent                2653.85    299.0   322500   322201.0 16.75
## StarRating                 0.74      0.0        5        5.0  0.48
## Airport                   11.12      0.2      124      123.8  2.73
## HotelAddress*            668.65      1.0     2108     2107.0 -0.37
## HotelPincode          257975.37 100025.0  7000157  6900132.0  9.99
## HotelDescription*        472.95      1.0     1226     1225.0  0.11
## FreeWifi                   0.00      0.0        1        1.0 -3.25
## FreeBreakfast              0.00      0.0        1        1.0 -0.62
## HotelCapacity             28.17      0.0      600      600.0  2.95
## HasSwimmingPool            0.00      0.0        1        1.0  0.60
##                      kurtosis       se
## CityName*               -0.88     0.10
## Population              -1.08 37019.65
## CityRank                -0.76     0.12
## IsMetroCity             -1.08     0.00
## IsTouristDestination    -1.26     0.00
## IsWeekend               -1.74     0.00
## IsNewYearEve             3.18     0.00
## Date*                    1.92     0.02
## HotelName*              -1.25     4.24
## RoomRent               582.06    63.75
## StarRating               0.25     0.01
## Airport                  7.89     0.20
## HotelAddress*           -0.88     5.06
## HotelPincode           249.76  2258.86
## HotelDescription*       -1.25     3.16
## FreeWifi                 8.57     0.00
## FreeBreakfast           -1.61     0.00
## HotelCapacity           11.39     0.67
## HasSwimmingPool         -1.64     0.00
summary(Cities)
##       CityName      Population          CityRank      IsMetroCity    
##  Delhi    :2048   Min.   :    8096   Min.   : 0.00   Min.   :0.0000  
##  Jaipur   : 768   1st Qu.:  744983   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mumbai   : 712   Median : 3046163   Median : 9.00   Median :0.0000  
##  Bangalore: 656   Mean   : 4416837   Mean   :14.83   Mean   :0.2842  
##  Goa      : 624   3rd Qu.: 8443675   3rd Qu.:24.00   3rd Qu.:1.0000  
##  Kochi    : 608   Max.   :12442373   Max.   :44.00   Max.   :1.0000  
##  (Other)  :7816                                                      
##  IsTouristDestination   IsWeekend       IsNewYearEve             Date     
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000   Dec 21 2016:1611  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.0000   Dec 24 2016:1611  
##  Median :1.0000       Median :1.0000   Median :0.0000   Dec 25 2016:1611  
##  Mean   :0.6972       Mean   :0.6228   Mean   :0.1244   Dec 28 2016:1611  
##  3rd Qu.:1.0000       3rd Qu.:1.0000   3rd Qu.:0.0000   Dec 31 2016:1611  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000   Dec 18 2016:1608  
##                                                         (Other)    :3569  
##                   HotelName        RoomRent        StarRating   
##  Vivanta by Taj        :   32   Min.   :   299   Min.   :0.000  
##  Goldfinch Hotel       :   24   1st Qu.:  2436   1st Qu.:3.000  
##  OYO Rooms             :   24   Median :  4000   Median :3.000  
##  The Gordon House Hotel:   24   Mean   :  5474   Mean   :3.459  
##  Apnayt Villa          :   16   3rd Qu.:  6299   3rd Qu.:4.000  
##  Bentleys Hotel Colaba :   16   Max.   :322500   Max.   :5.000  
##  (Other)               :13096                                   
##     Airport      
##  Min.   :  0.20  
##  1st Qu.:  8.40  
##  Median : 15.00  
##  Mean   : 21.16  
##  3rd Qu.: 24.00  
##  Max.   :124.00  
##                  
##                                                                    HotelAddress  
##  The Mall, Shimla                                                        :   32  
##  #2-91/14/8, White Fields, Kondapur, Hitech City, Hyderabad, 500084 India:   16  
##  121, City Terrace, Walchand Hirachand Marg, Mumbai, Maharashtra         :   16  
##  14-4507/9, Balmatta Road, Near Jyothi Circle, Hampankatta               :   16  
##  144/7, Rajiv Gandi Salai (OMR), Kottivakkam, Chennai, Tamil Nadu        :   16  
##  17, Oliver Road, Colaba, Mumbai, Maharashtra                            :   16  
##  (Other)                                                                 :13120  
##   HotelPincode         HotelDescription    FreeWifi      FreeBreakfast   
##  Min.   : 100025   3           :  120   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 221001   Abc         :  112   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median : 395003   3-star hotel:  104   Median :1.0000   Median :1.0000  
##  Mean   : 397430   3.5         :   88   Mean   :0.9259   Mean   :0.6491  
##  3rd Qu.: 570001   4           :   72   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :7000157   (Other)     :12728   Max.   :1.0000   Max.   :1.0000  
##                    NA's        :    8                                    
##  HotelCapacity    HasSwimmingPool 
##  Min.   :  0.00   Min.   :0.0000  
##  1st Qu.: 16.00   1st Qu.:0.0000  
##  Median : 34.00   Median :0.0000  
##  Mean   : 62.51   Mean   :0.3558  
##  3rd Qu.: 75.00   3rd Qu.:1.0000  
##  Max.   :600.00   Max.   :1.0000  
## 

Create one-way contingency tables for the categorical variables in your dataset

Table <- with(Cities,table(CityName))
    Table
## CityName
##             Agra        Ahmedabad         Amritsar        Bangalore 
##              432              424              136              656 
##      Bhubaneswar       Chandigarh          Chennai       Darjeeling 
##              120              336              416              136 
##            Delhi          Gangtok              Goa         Guwahati 
##             2048              128              624               48 
##         Haridwar        Hyderabad           Indore           Jaipur 
##               48              536              160              768 
##        Jaisalmer          Jodhpur           Kanpur            Kochi 
##              264              224               16              608 
##          Kolkata          Lucknow          Madurai           Manali 
##              512              128              112              288 
##        Mangalore           Mumbai           Munnar           Mysore 
##              104              712              328              160 
##         Nainital             Ooty        Panchkula             Pune 
##              144              136               64              600 
##             Puri           Rajkot        Rishikesh           Shimla 
##               56              128               88              280 
##         Srinagar            Surat Thiruvanthipuram         Thrissur 
##               40               80              392               32 
##          Udaipur         Varanasi 
##              456              264
Table1 <- with(Cities,table(IsMetroCity))
    Table1
## IsMetroCity
##    0    1 
## 9472 3760
Table2 <- with(Cities,table(IsTouristDestination))
    Table2
## IsTouristDestination
##    0    1 
## 4007 9225
Table3 <- with(Cities,table(IsWeekend))
    Table3
## IsWeekend
##    0    1 
## 4991 8241
Table4 <- with(Cities,table(IsNewYearEve))
    Table4
## IsNewYearEve
##     0     1 
## 11586  1646
Table5 <- with(Cities,table(HasSwimmingPool))
    Table5
## HasSwimmingPool
##    0    1 
## 8524 4708

Create two-way contingency tables for the categorical variables in your dataset.

t1 <- xtabs(~ StarRating + IsMetroCity, data=Cities)
t1
##           IsMetroCity
## StarRating    0    1
##        0     16    0
##        1      8    0
##        2    344   96
##        2.5  456  176
##        3   4336 1617
##        3.2    8    0
##        3.3   16    0
##        3.4    8    0
##        3.5 1312  440
##        3.6    0    8
##        3.7   24    0
##        3.8   16    0
##        3.9   32    0
##        4   1696  767
##        4.1   24    0
##        4.3   16    0
##        4.4    8    0
##        4.5  288   88
##        4.7    8    0
##        4.8   16    0
##        5    840  568
t2 <- xtabs(~ FreeWifi+ FreeBreakfast, data=Cities)
t2
##         FreeBreakfast
## FreeWifi    0    1
##        0  606  375
##        1 4037 8214
t3 <- xtabs(~ IsTouristDestination + IsNewYearEve, data=Cities)
t3
##                     IsNewYearEve
## IsTouristDestination    0    1
##                    0 3504  503
##                    1 8082 1143

Draw a boxplot of the variables that belong to your study.

hist(Cities$Population, main= "Population" ,xlab="Population" ,col = "blue")

hist(Cities$StarRating, main= "Star rating" ,xlab="Star rating",col = "blue")

hist(Cities$Airport, main = "Distance to nearest major airport distribution", xlab = "Distance to nearest major Airport in km",col = "blue")

hist(Cities$HotelCapacity, main = "Capacity of hotels", xlab = "Hotel Capacity", col = "blue")

hist(Cities$CityRank, main = "Distribution of rank of cities", xlab = "City rank", col = "blue")

boxplot(Cities$CityRank , horizontal =TRUE, main="city rank",col = "Red" )

boxplot(Cities$IsMetroCity, horizontal =TRUE, main="Metro city and Tourist destination",col = "light green" )

boxplot(Cities$IsTouristDestination, horizontal =TRUE, main="Metro city and Tourist destination",col = "light green" )

#Corrgram of Cities

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(Cities, order = TRUE, upper.panel = panel.pie)

Create a correlation matrix.

 x<-Cities[,c("RoomRent","StarRating","Airport","HotelCapacity")]
    
y<-Cities[,c("RoomRent","StarRating","Airport","HotelCapacity")]
    cor(x,y)
##                 RoomRent  StarRating     Airport HotelCapacity
## RoomRent      1.00000000  0.36937343  0.04965324     0.1578733
## StarRating    0.36937343  1.00000000 -0.06091918     0.6374303
## Airport       0.04965324 -0.06091918  1.00000000    -0.1176721
## HotelCapacity 0.15787331  0.63743034 -0.11767207     1.0000000

Visualize your correlation matrix using corrgram.

corrgram(Cities[c("RoomRent","StarRating","Airport","HotelCapacity")], upper.panel = panel.pie)

Create a scatter plot matrix for your data set.

library(car)    
## Warning: package 'car' was built under R version 3.4.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(RoomRent~StarRating,     data=cities,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of Star Rating vs Room rent",
            ylab="Room Rent",
            xlab="Star Rating")

library(car)    
scatterplot(RoomRent~CityRank,     data=cities,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of City rank vs Room rent",
            ylab="Room Rent",
            xlab="Rank of City")

library(car)    
scatterplot(RoomRent~ HasSwimmingPool, data=Cities,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of hotel with Swimming pool vs Room rent",
            ylab="Room Rent",
            xlab="Hotel with swimming pool")

library(car)
scatterplotMatrix(formula = ~ RoomRent + IsWeekend + IsNewYearEve +Airport , data = Cities, pch = 16)
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

T-Test

test1: Null Hypothesis - Their is no Difference between the Room Rent on weekdays and on weekends

t.test(Cities$RoomRent ~ Cities$IsWeekend)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$IsWeekend
## t = -0.51853, df = 9999.4, p-value = 0.6041
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -331.2427  192.6559
## sample estimates:
## mean in group 0 mean in group 1 
##        5430.835        5500.129

As we can see the P-Value = 0.6 (>0.05) , We Fail To reject the Null Hypothesis.

It Shows that Their is No Significant Difference Between the Room rents on Weekdays and Weekends.

test2: Null Hypothesis - Their is no Difference between the Room Rent on new year’s eve and on other days

t.test(Cities$RoomRent ~ Cities$IsNewYearEve)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$IsNewYearEve
## t = -4.1793, df = 2065, p-value = 3.046e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1256.5297  -453.9099
## sample estimates:
## mean in group 0 mean in group 1 
##        5367.606        6222.826

P-Value = 3.046e-05 (<0.05) Which is small enough for Rejecting the Null Hupothesis.

Hence there is significant difference between the Room Rent on new year’s eve and on other days

test3: Null Hypothesis - Their is no Difference between the Room Rent of Metro Cities and other cities

t.test(Cities$RoomRent ~ Cities$IsMetroCity)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$IsMetroCity
## t = 10.721, df = 13224, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   888.0308 1285.4102
## sample estimates:
## mean in group 0 mean in group 1 
##        5782.794        4696.073

P-Value = 2.2e-16 (<0.05) Which is small enough for Rejecting the Null Hupothesis.

Hence there is significant difference between the Room Rent of Metro Cities and other cities

test4: Null Hypothesis - Their is no Difference between the Room Rent where wifi is free and other rooms.

t.test(Cities$RoomRent ~ Cities$FreeWifi)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$FreeWifi
## t = -0.76847, df = 1804.7, p-value = 0.4423
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -360.5977  157.5701
## sample estimates:
## mean in group 0 mean in group 1 
##        5380.004        5481.518

As we can see the P-Value = 0.44 (>0.05) , We Fail To reject the Null Hypothesis.

It Shows that Their is No Significant Difference Between the Room Rent where wifi is free and other rooms.

test5: Null Hypothesis: Their is no difference in the means of room Rent where free Breakfast is available or not

t.test(Cities$RoomRent ~ Cities$FreeBreakfast)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$FreeBreakfast
## t = 0.98095, df = 6212.3, p-value = 0.3267
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -153.5017  460.9935
## sample estimates:
## mean in group 0 mean in group 1 
##        5573.790        5420.044

The difference between The two means is not different as p-value = 0.32 (>0.05) so we fail to reject the Null hypothesis.

It Means The Room rents Are same for all room whether free Breakfast is available or not.

test6: null Hypothesis: The correlation between RoomRent and Star Rating is Zero

cor.test(Cities$RoomRent , Cities$StarRating)
## 
##  Pearson's product-moment correlation
## 
## data:  Cities$RoomRent and Cities$StarRating
## t = 45.719, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3545660 0.3839956
## sample estimates:
##       cor 
## 0.3693734

We Can clearly see that the corrrelation between RoomRent and StarRating = 36.9% or approx 37%.and this relation is significant because the p-value = 2.2e-16 (<0.05) which suggests us to reject the Null hypothesis.

test7: Null hypothesis: The correlation between RoomRent and Distance to the airport is Zero

cor.test(Cities$RoomRent , Cities$Airport)
## 
##  Pearson's product-moment correlation
## 
## data:  Cities$RoomRent and Cities$Airport
## t = 5.7183, df = 13230, p-value = 1.099e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03264192 0.06663581
## sample estimates:
##        cor 
## 0.04965324

We can see that the correlation is small 5% But it is significant as the p-value < 0.05.

test8: null Hypothesis: The correlation between RoomRent and Hotel Capacity is Zero

cor.test(Cities$RoomRent , Cities$HotelCapacity)
## 
##  Pearson's product-moment correlation
## 
## data:  Cities$RoomRent and Cities$HotelCapacity
## t = 18.389, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1412142 0.1744430
## sample estimates:
##       cor 
## 0.1578733

As the p-value = 2.2e-16 (<0.05) We can reject the Null hypothesis that Means Their exists a significant relation between Room Rent and Hotel Capacity and as we see it is approx 16%

test9: Null Hypothesis: Their is no difference in the means of room Rent where Swimming Pool is available or not

t.test(Cities$RoomRent~ Cities$HasSwimmingPool)
## 
##  Welch Two Sample t-test
## 
## data:  Cities$RoomRent by Cities$HasSwimmingPool
## t = -29.013, df = 5011.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5096.030 -4450.942
## sample estimates:
## mean in group 0 mean in group 1 
##        3775.566        8549.052

The difference between The two means is different as p-value = 2.2e-16 (<0.05) so we reject the Null hypothesis. It Means The Room rents have effect whether the hotel has the swimming pool or not.

linear regression

let us assume that We create a model that shows how the variables effect the Room Rent

Our model will be like y = B0 + B1x1 + B2x2 + B3x3.. + E

y - Room Rent (dependent variable). B0 - intercept. B1, B2, B3 .- Beta coefficients for different variables.And x1, x2, x3. x1, x2, x3 - StarRating, Dist to airport, FreeWifi, etc (independent variables).

E - error term.

Model1  <- lm(RoomRent ~ StarRating + Airport + FreeWifi + FreeBreakfast + HotelCapacity + HasSwimmingPool, data = Cities)
summary(Model1)
## 
## Call:
## lm(formula = RoomRent ~ StarRating + Airport + FreeWifi + FreeBreakfast + 
##     HotelCapacity + HasSwimmingPool, data = Cities)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10783  -2286   -875    967 310387 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7455.073    396.215 -18.816   <2e-16 ***
## StarRating       3519.285    111.754  31.491   <2e-16 ***
## Airport            25.627      2.604   9.840   <2e-16 ***
## FreeWifi          227.843    226.177   1.007    0.314    
## FreeBreakfast     -59.313    123.964  -0.478    0.632    
## HotelCapacity     -14.786      1.009 -14.660   <2e-16 ***
## HasSwimmingPool  2714.146    158.681  17.104   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6687 on 13225 degrees of freedom
## Multiple R-squared:  0.1689, Adjusted R-squared:  0.1685 
## F-statistic: 447.9 on 6 and 13225 DF,  p-value: < 2.2e-16

Final model

let us assume that We create a model that shows how the variables effect the Room Rent

Our model will be like y = B0 + B1x1 + B2x2 + B3x3 + E

y - Room Rent (dependent variable). B0 - intercept. B1, B2, B3 . - Beta coefficients for different variables. x1, x2, x3. x1, x2, x3 - dates, external factors, internal factors (independent variables). E - error term.

Model2 <- lm(RoomRent ~ IsNewYearEve + IsMetroCity + IsTouristDestination + StarRating + Airport + HotelCapacity + HasSwimmingPool, data = Cities)
summary(Model2)
## 
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + IsMetroCity + IsTouristDestination + 
##     StarRating + Airport + HotelCapacity + HasSwimmingPool, data = Cities)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11621  -2342   -706   1039 309463 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8362.318    351.306 -23.804  < 2e-16 ***
## IsNewYearEve           843.295    174.085   4.844 1.29e-06 ***
## IsMetroCity          -1502.844    137.569 -10.924  < 2e-16 ***
## IsTouristDestination  2074.969    133.499  15.543  < 2e-16 ***
## StarRating            3583.270    110.317  32.482  < 2e-16 ***
## Airport                 11.057      2.699   4.096 4.22e-05 ***
## HotelCapacity          -11.252      1.020 -11.030  < 2e-16 ***
## HasSwimmingPool       2211.800    159.259  13.888  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6609 on 13224 degrees of freedom
## Multiple R-squared:  0.1882, Adjusted R-squared:  0.1878 
## F-statistic:   438 on 7 and 13224 DF,  p-value: < 2.2e-16

Conclusion

Hotel pricing is mainly dependent on factors like

new year events,

whether the hotel is in metro city or not,

whether the hotel is in tourist place or not,

Star rating of hotel,

Hotel capacity,

whether it has swimming pool or not,

And distance between Hotel and Airport.

we can summarize it as:

Room Rent = f(NewYearseve, IsMetroCity, IsTouristDestination, StarRating, Distance from the airport, HotelCapacity and HasSwimmingPool).

END OF THE PROJECT