OBJECTIVE- To Identify the factors that affect the prices of Hotels in Indian Hotel industry including 42 different cities based on the data collected on 8 different dates consisting of some internal and external factors.
MOTIVATION OF THE STUDY
It is a common phenomena that we experience variation in hotel price not only in India but across the globe. Why this is so? What provoked hotel owners to charge differently and what motivates a tourist to pay more for some hotels or hotel at a particular place. Here in this report we are going to analyze what could be the independent factors that contribute towards this price difference. We will take the help of the data and some graphs and diagrams and regression analysis and on the basis of these we will try to analyse the data and try to figure out the potential factors affecting the price-behaviour in the hotel industryies.
DATA DESCRIPTION and DATA SOURCE
The Data is available here https://in.hotels.com/ and is of size 2523KB.
Size: 2523KB 13232 observations of 19 variables:
Attributes:
Notice that the dataset tracks hotel prices on 8 different dates at different hotels across different cities.
Dependent Variable
RoomRent <- Rent for the cheapest room, double occupancy, in Indian Rupees.
Independent Variables
External Factors
Date <- We have hotel room rent data for the following 8 dates for each hotel: {Dec 31, Dec 25, Dec 24, Dec 18, Dec 21, Dec 28, Jan 4, Jan 8} IsWeekend <- We use ‘0’ to indicate week days, ‘1’ to indicate weekend dates (Sat / Sun)
IsNewYearEve <- 1’ for Dec 31, ‘0’ otherwise CityName <- Name of the City where the Hotel is located e.g. Mumbai`
Population <- Population of the City in 2011
CityRank <- Rank order of City by Population (e.g. Mumbai = 0, Delhi = 1, so on)
IsMetroCity <- ‘1’ if CityName is {Mumbai, Delhi, Kolkatta, Chennai}, ‘0’ otherwise
IsTouristDestination <- We use ‘1’ if the city is primarily a tourist destination, ‘0’ otherwise.
Internal Factors Many Hotel Features can influence the RoomRent. The dataset captures some of these internal factors, as explained below.
HotelName <- e.g. Park Hyatt Goa Resort and Spa
StarRating <- e.g. 5
Airport <- Distance between Hotel and closest major Airport
HotelAddress <- e.g. Arrossim Beach, Cansaulim, Goa
HotelPincode <- 403712
HotelDescription <- e.g. 5-star beachfront resort with spa, near Arossim Beach
FreeWifi <- ‘1’ if the hotel offers Free Wifi, ‘0’ otherwise
FreeBreakfast <- ‘1’ if the hotel offers Free Breakfast, ‘0’ otherwise
HotelCapacity <- e.g. 242. (enter ‘0’ if not available)
HasSwimmingPool <- ‘1’ if they have a swimming pool, ‘0’ otherwise
ABSTRACT
In order to investigate the factors affecting the pricing strategy of the hotel industry. We have used dataset available to us and done our analysis based on the correlation test and regression analysis using best fit model. We have also visualized the data using boxplot ,scatterplot and correlogram. some findings on the basis of visualization are-
visualizing the data we can say that the price of the hotels of Jodhpur , Udaipur, Goa and Srinagar are most expensive of all. Rent of the hotels are higher for high-rating hotels. During 28 December to 3 January the price of the hotels are on the higher side.
On the basis of correlation test we found out that the only variable which comes out to be insignificant is FreeBreakfast.
We have used the best fit model and run the regression and found that the insignificant variables are CityRank, IsWeekend, and FreeBreakfast. CityRating is positively affect the RoomRent. Tourist Destination attracts the tourist more and hotels in these areas are expensive. SwimmingPool facility also derives the prices up.
We have taken the help of adjusted R squared and AIC to determine the best fit model. More adj R squared and minimum AIC is a criterion for selecting best fit model.
RESEARCH OBJECTIVE AND METHODOLOGY
Empirical evidence based on the data is always considered superior as compared to other methods. Let us try to investigate the factors contributing to the pricing strategy of hotel industry.
OBJECTIVE OF RESEARCH
1- To test the hypothesis that if there exists any price difference of the hotels according to the tourist destination.
2- If yes then What factors affect the pricing strategy of the hotels?
We will read the dataset by creating dataframe called hotel and will use command to summarize it. We will use boxplots and scatterplot to visualize the data and try to establish any sort of relationship between the variables. We will use correlation matrix to know the correlation between the variables concerned. Correlogram and corrplot is also used to depict graphically the relationship between the variables.
To find out the significant factors we have used correlation test to see the significant factors affecting the pricedifference.
Lastly we have used the regression models . We have listed some methods to choose the best fit models and we have used Step wise Regression Method to strengthen our analysis.
OUR variable of concern is RoomRent. On the basis of p-value we have reached on the conclusion of which factors are affecting the dependent variable RoomRent.
Reading the dataset in R by creating a dataframe called hotel.
hotel<-read.csv(paste("Cities42.csv",sep = ""))
View(hotel)
Dimension of the dataset
dim(hotel)
## [1] 13232 19
There are 13232 rows and 19 columns
Summarizing the entire dataset
library(psych)
summary(hotel)
## CityName Population CityRank IsMetroCity
## Delhi :2048 Min. : 8096 Min. : 0.00 Min. :0.0000
## Jaipur : 768 1st Qu.: 744983 1st Qu.: 2.00 1st Qu.:0.0000
## Mumbai : 712 Median : 3046163 Median : 9.00 Median :0.0000
## Bangalore: 656 Mean : 4416837 Mean :14.83 Mean :0.2842
## Goa : 624 3rd Qu.: 8443675 3rd Qu.:24.00 3rd Qu.:1.0000
## Kochi : 608 Max. :12442373 Max. :44.00 Max. :1.0000
## (Other) :7816
## IsTouristDestination IsWeekend IsNewYearEve Date
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Dec 21 2016:1611
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Dec 24 2016:1611
## Median :1.0000 Median :1.0000 Median :0.0000 Dec 25 2016:1611
## Mean :0.6972 Mean :0.6228 Mean :0.1244 Dec 28 2016:1611
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 Dec 31 2016:1611
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Dec 18 2016:1608
## (Other) :3569
## HotelName RoomRent StarRating
## Vivanta by Taj : 32 Min. : 299 Min. :0.000
## Goldfinch Hotel : 24 1st Qu.: 2436 1st Qu.:3.000
## OYO Rooms : 24 Median : 4000 Median :3.000
## The Gordon House Hotel: 24 Mean : 5474 Mean :3.459
## Apnayt Villa : 16 3rd Qu.: 6299 3rd Qu.:4.000
## Bentleys Hotel Colaba : 16 Max. :322500 Max. :5.000
## (Other) :13096
## Airport
## Min. : 0.20
## 1st Qu.: 8.40
## Median : 15.00
## Mean : 21.16
## 3rd Qu.: 24.00
## Max. :124.00
##
## HotelAddress
## The Mall, Shimla : 32
## #2-91/14/8, White Fields, Kondapur, Hitech City, Hyderabad, 500084 India: 16
## 121, City Terrace, Walchand Hirachand Marg, Mumbai, Maharashtra : 16
## 14-4507/9, Balmatta Road, Near Jyothi Circle, Hampankatta : 16
## 144/7, Rajiv Gandi Salai (OMR), Kottivakkam, Chennai, Tamil Nadu : 16
## 17, Oliver Road, Colaba, Mumbai, Maharashtra : 16
## (Other) :13120
## HotelPincode HotelDescription FreeWifi FreeBreakfast
## Min. : 100025 3 : 120 Min. :0.0000 Min. :0.0000
## 1st Qu.: 221001 Abc : 112 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 395003 3-star hotel: 104 Median :1.0000 Median :1.0000
## Mean : 397430 3.5 : 88 Mean :0.9259 Mean :0.6491
## 3rd Qu.: 570001 4 : 72 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :7000157 (Other) :12728 Max. :1.0000 Max. :1.0000
## NA's : 8
## HotelCapacity HasSwimmingPool
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 16.00 1st Qu.:0.0000
## Median : 34.00 Median :0.0000
## Mean : 62.51 Mean :0.3558
## 3rd Qu.: 75.00 3rd Qu.:1.0000
## Max. :600.00 Max. :1.0000
##
Minimum population in any city is 8096 and max is 12442373 with median 3046163. Minimum room rent in any city is 299 rs and maximum is rs 322500 with median rs 4000. Minimum HotelCapacity is 0 and maximum is 600 with median 34. Minimum airport distance from the hotel is 0.20 km and maximum is 124 km with middle distance 15km.
library(psych)
describe(hotel)[,c(3:6)]
## mean sd median trimmed
## CityName* 18.07 11.72 16 17.29
## Population 4416836.87 4258386.00 3046163 4040816.22
## CityRank 14.83 13.51 9 13.30
## IsMetroCity 0.28 0.45 0 0.23
## IsTouristDestination 0.70 0.46 1 0.75
## IsWeekend 0.62 0.48 1 0.65
## IsNewYearEve 0.12 0.33 0 0.03
## Date* 14.30 2.69 14 14.39
## HotelName* 841.19 488.16 827 841.18
## RoomRent 5473.99 7333.12 4000 4383.33
## StarRating 3.46 0.76 3 3.40
## Airport 21.16 22.76 15 16.39
## HotelAddress* 1202.53 582.17 1261 1233.25
## HotelPincode 397430.26 259837.50 395003 388540.47
## HotelDescription* 581.34 363.26 567 575.37
## FreeWifi 0.93 0.26 1 1.00
## FreeBreakfast 0.65 0.48 1 0.69
## HotelCapacity 62.51 76.66 34 46.03
## HasSwimmingPool 0.36 0.48 0 0.32
Mean of the population is 4416836.87 with standard deviation 4258386. Mean of the roomrent is rs5473.99 with standar deviation 7333.12. Dispersion in the data is high. High variability in prices. Mean (average) no of tourist that can be accomodated in a hotel is 62 with dispersion of 76.66. On an average hotel distance from the airport is 21.16 km.
Creating one-way contingency tables for the categorical variables in your dataset
For external factors
Tourist Destination
istourist<-table(hotel$IsTouristDestination)
istourist
##
## 0 1
## 4007 9225
mytable <- with(hotel, table(IsTouristDestination)) #second method
mytable
## IsTouristDestination
## 0 1
## 4007 9225
4007 non-tourist destination as compared to 9225 tourist destination (Almost double of non-tourist destination)
Isweekend
mytable<-with(hotel,table(IsWeekend))
mytable
## IsWeekend
## 0 1
## 4991 8241
Weekend destination is 8241 in total against 4991 almost half of weekend destination
IsMetrocity
mytable <- with(hotel, table(IsMetroCity))
mytable
## IsMetroCity
## 0 1
## 9472 3760
Here metro destinations are less in number. Non-metro city destinations are almost half in number as compared to metro destination.
IsNewYearEve
mytable <- with(hotel, table(IsNewYearEve))
mytable
## IsNewYearEve
## 0 1
## 11586 1646
Data is collected 1646 times for the new yearEve.
For internal Factors
Star rating
mytable <- with(hotel, table(StarRating))
mytable
## StarRating
## 0 1 2 2.5 3 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
## 16 8 440 632 5953 8 16 8 1752 8 24 16 32 2463 24
## 4.3 4.4 4.5 4.7 4.8 5
## 16 8 376 8 16 1408
3 -star hotel rating is most in number followed by 4-star rating and 5-star rating.
Airport
mytable <- with(hotel, table(Airport))
mytable
## Airport
## 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.4
## 16 32 40 24 32 24 8 39 32 16 40 8
## 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6
## 16 32 22 72 40 56 40 24 16 32 56 48
## 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
## 56 24 56 56 16 24 16 48 56 64 16 40
## 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5
## 32 72 32 40 32 32 24 40 24 40 32 73
## 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2
## 72 72 32 40 48 40 32 56 40 33 32 64
## 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4
## 16 48 48 40 24 56 40 49 72 48 24 40
## 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6
## 48 71 48 32 72 73 72 56 40 48 64 16
## 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8
## 56 16 16 49 24 62 48 80 22 40 24 40
## 9.9 10 10.2 10.3 10.4 10.6 10.7 10.8 10.9 11 11.1 11.3
## 56 298 8 8 8 8 8 8 16 610 16 16
## 11.7 11.9 12 12.2 12.3 12.6 12.7 13 13.1 13.3 13.5 13.6
## 8 16 354 24 8 24 16 319 16 8 8 24
## 13.7 13.8 14 14.2 14.4 14.5 14.6 14.7 14.8 14.9 15 15.3
## 16 8 399 16 24 8 16 24 16 8 441 16
## 15.4 15.6 15.7 15.8 15.9 16 16.1 16.2 16.4 16.5 16.7 17
## 16 8 8 8 8 409 16 8 8 32 32 313
## 17.1 17.2 17.4 17.5 17.6 17.8 18 18.3 18.5 18.6 18.7 19
## 8 16 8 16 8 16 424 8 16 8 8 200
## 19.5 19.9 20 20.2 20.3 20.5 20.9 21 21.4 21.5 22 22.1
## 8 8 384 8 16 8 8 248 24 8 305 8
## 22.2 22.4 22.5 23 23.2 23.3 23.4 24 24.2 24.3 24.5 24.6
## 8 8 8 304 8 16 8 167 16 16 16 8
## 24.7 24.9 25 25.6 25.7 25.9 26 26.1 26.3 26.4 26.5 26.7
## 8 32 208 8 8 8 300 8 8 24 8 8
## 27 27.1 27.2 28 28.1 28.6 28.7 29 30 30.5 31 31.2
## 272 8 8 112 8 8 8 88 56 8 224 8
## 31.3 31.9 32 32.9 33 33.4 34 35 36 36.2 37 38
## 16 8 72 8 40 16 16 49 17 8 49 49
## 38.3 39 39.9 40 41 42 42.7 43 43.9 44 44.5 44.6
## 8 100 8 56 102 41 16 33 8 8 8 8
## 44.8 46 47 47.5 48 48.4 49 50 50.1 50.5 51 52
## 8 40 8 8 16 8 8 8 8 8 16 16
## 52.7 53 55 57.2 60 61 62 63 63.5 63.6 65 67.6
## 8 8 8 8 8 16 32 8 8 8 152 8
## 69 73.1 80 80.3 81 82 83 84 85 86 87 91.3
## 8 8 1 8 1 9 1 1 1 1 1 8
## 96.5 100 102.4 105 110 117.4 124
## 8 136 8 240 64 8 128
From this data we can see that the airports which are very far from the hotels are in majority. airports 124 km away hotels are 128 while 105 km are 240 in number.
FreeWifi
mytable<-with(hotel,table(FreeWifi))
mytable
## FreeWifi
## 0 1
## 981 12251
Hotels having FreeWifi are 12251 in number. Most of the hotels are now FreeWifi enabled.
FreeBreakfast
mytable<-with(hotel,table(FreeBreakfast))
mytable
## FreeBreakfast
## 0 1
## 4643 8589
Hotels having breakfast facility are almost double in number as compared to hotels not having breakfast facility.
HasSwimmingPool
mytable<-with(hotel,table(HasSwimmingPool))
mytable
## HasSwimmingPool
## 0 1
## 8524 4708
Here the number of the hotels are not concerned with having SwimmingPool are almost 1.8 times in number as compared to hotels having swimmingpools.
Date
mytable<-with(hotel,table(Date))
mytable
## Date
## 04-Jan-16 04-Jan-17 08-Jan-16 08-Jan-17 18-Dec-16 21-Dec-16
## 31 13 31 13 44 44
## 24-Dec-16 25-Dec-16 28-Dec-16 31-Dec-16 Dec 18 2016 Dec 21 2016
## 44 44 44 44 1608 1611
## Dec 24 2016 Dec 25 2016 Dec 28 2016 Dec 31 2016 Jan 04 2017 Jan 08 2017
## 1611 1611 1611 1611 1548 1542
## Jan 4 2017 Jan 8 2017
## 60 67
THe dataset is having some vague pattern about the data. It’s difficult to draw out any pattern here.
Now let us create two-way contingency table to see the patterns for the two varibles taken together.
ISweekend and IsnewYearEve
mytable<-with(hotel,table(IsWeekend,IsNewYearEve))
mytable
## IsNewYearEve
## IsWeekend 0 1
## 0 4989 2
## 1 6597 1644
mytable[2,2]
## [1] 1644
1644 times the data is collected on the weekend and newyearEve
IsNewYearEve and IsTouristDestination
mytable<-with(hotel,table(IsNewYearEve,IsTouristDestination))
mytable
## IsTouristDestination
## IsNewYearEve 0 1
## 0 3504 8082
## 1 503 1143
mytable[2,2]
## [1] 1143
1143 times the values are collected for NewYearEve and TouristDestination.
IsTouristDestination And Starrating
mytable<-with(hotel,table(IsTouristDestination,StarRating))
mytable
## StarRating
## IsTouristDestination 0 1 2 2.5 3 3.2 3.3 3.4 3.5 3.6
## 0 0 0 64 152 1888 0 0 0 448 0
## 1 16 8 376 480 4065 8 16 8 1304 8
## StarRating
## IsTouristDestination 3.7 3.8 3.9 4 4.1 4.3 4.4 4.5 4.7 4.8
## 0 8 16 16 839 0 16 8 128 8 0
## 1 16 0 16 1624 24 0 0 248 0 16
## StarRating
## IsTouristDestination 5
## 0 416
## 1 992
FreeWifi and FreeBreakfast
mytable<-with(hotel,table(FreeWifi,FreeBreakfast))
mytable
## FreeBreakfast
## FreeWifi 0 1
## 0 606 375
## 1 4037 8214
mytable[2,2]
## [1] 8214
8214 hotels are having Freebreakfast and Freewifi facility
FreeWifi AND HAS SWIMMING POOL
mytable<-with(hotel,table(FreeWifi,HasSwimmingPool))
mytable
## HasSwimmingPool
## FreeWifi 0 1
## 0 592 389
## 1 7932 4319
mytable[2,2]
## [1] 4319
4319 hotels are having both free wifi and swimming pool facility
Breakfast and SwimmingPool
mytable<-with(hotel,table(FreeBreakfast,HasSwimmingPool))
mytable
## HasSwimmingPool
## FreeBreakfast 0 1
## 0 2805 1838
## 1 5719 2870
mytable[2,2]
## [1] 2870
2870 hotels have both the facility of Free Breakfast and Swimming Pool.
Average of Roomrent on the basis of IsWeekend and IsNewYear
aggregate(hotel$RoomRent, by=list(Weekend = hotel$IsWeekend, NewyearServe = hotel$IsNewYearEve), mean)
## Weekend NewyearServe x
## 1 0 0 5429.473
## 2 1 0 5320.820
## 3 0 1 8829.500
## 4 1 1 6219.655
Here average roomrent on non-weekend normal day is 5429.473. Average roomrent on weekend and new yearEve is 6219.655.
Here is the average brek-up of RoomRent city-wise
aggregate(hotel$RoomRent, by=list(city = hotel$CityName), mean)
## city x
## 1 Agra 4124.287
## 2 Ahmedabad 4175.045
## 3 Amritsar 3444.029
## 4 Bangalore 4112.803
## 5 Bhubaneswar 3587.442
## 6 Chandigarh 4030.940
## 7 Chennai 4323.647
## 8 Darjeeling 5458.088
## 9 Delhi 4318.606
## 10 Gangtok 4629.648
## 11 Goa 8170.801
## 12 Guwahati 5325.812
## 13 Haridwar 3919.938
## 14 Hyderabad 3852.175
## 15 Indore 3414.594
## 16 Jaipur 7292.022
## 17 Jaisalmer 5986.072
## 18 Jodhpur 10661.371
## 19 Kanpur 3008.562
## 20 Kochi 6039.609
## 21 Kolkata 4528.986
## 22 Lucknow 5879.070
## 23 Madurai 4768.223
## 24 Manali 4858.285
## 25 Mangalore 4110.337
## 26 Mumbai 6343.730
## 27 Munnar 7543.500
## 28 Mysore 3320.869
## 29 Nainital 6409.833
## 30 Ooty 6144.257
## 31 Panchkula 2813.500
## 32 Pune 3897.652
## 33 Puri 5708.429
## 34 Rajkot 4107.078
## 35 Rishikesh 4943.670
## 36 Shimla 5780.604
## 37 Srinagar 10572.025
## 38 Surat 3660.850
## 39 Thiruvanthipuram 6726.796
## 40 Thrissur 3387.844
## 41 Udaipur 10145.252
## 42 Varanasi 8675.042
Jodhpur, Udaipur , Srinagar and Goa are the most expensive and all of these cities are non-metro cities. But these four are Tourist destination.
Average RoomRent on the basis of TouristDestination and MetroCity
aggregate(hotel$RoomRent,by=list(touristplace= hotel$IsTouristDestination, MetroCity= hotel$IsMetroCity),mean)
## touristplace MetroCity x
## 1 0 0 4006.435
## 2 1 0 6755.728
## 3 0 1 4646.136
## 4 1 1 4706.608
See here both TouristDestination and MetroCity are cheaper as compared to ToristDestination and non-metrocity.
Average RoomRent on the basis of FreeWifi facility ,FreeBreakfast, and SwimmingPool facility
attach(hotel)
aggregate(hotel$RoomRent, by=list(Free_Wifi = hotel$FreeWifi, FreeBreakfast = hotel$FreeBreakfast, SwimmingPool = HasSwimmingPool), mean)
## Free_Wifi FreeBreakfast SwimmingPool x
## 1 0 0 0 3538.085
## 2 1 0 0 3148.628
## 3 0 1 0 5636.617
## 4 1 1 0 3984.457
## 5 0 0 1 7378.590
## 6 1 0 1 9530.906
## 7 0 1 1 5207.000
## 8 1 1 1 8246.284
Here Hotel having free wifi , free Breakfast and swimmingpool has mean room rent= 8246.284
Average RoomRent on the basis of Weekday and Weekend
aggregate(RoomRent, by=list(IsWeekend), mean)
## Group.1 x
## 1 0 5430.835
## 2 1 5500.129
Here average Roomrent on Weekend =5500.129
Average RoomRent on the basis of MetroCity
aggregate(RoomRent, by=list(IsMetroCity),mean)
## Group.1 x
## 1 0 5782.794
## 2 1 4696.073
Average RoomRent on the NewYearEve
aggregate(RoomRent, by=list(IsNewYearEve),mean)
## Group.1 x
## 1 0 5367.606
## 2 1 6222.826
Average RoomRent on the basis of TouristDestination
aggregate(RoomRent, by=list(IsTouristDestination),mean)
## Group.1 x
## 1 0 4111.003
## 2 1 6066.024
Average RoomRent on the basis of Ratings
aggregate(RoomRent, by=list(Ratings = StarRating), mean)
## Ratings x
## 1 0.0 7237.125
## 2 1.0 686.625
## 3 2.0 2783.166
## 4 2.5 2520.816
## 5 3.0 3694.811
## 6 3.2 15937.500
## 7 3.3 2841.062
## 8 3.4 23437.500
## 9 3.5 4843.346
## 10 3.6 7769.500
## 11 3.7 6701.958
## 12 3.8 5400.062
## 13 3.9 13062.750
## 14 4.0 6393.105
## 15 4.1 19075.000
## 16 4.3 7423.125
## 17 4.4 5563.500
## 18 4.5 8699.920
## 19 4.7 10125.000
## 20 4.8 46752.812
## 21 5.0 12398.221
Average room rent for 4.8 star rating is very expensive. Hotel with 1 satr rating is least expensive. Surprisingly hotel with 3.4 star rating is expensive than the hotel with 5-star rating.
Average RoomRent on the basis of FreeWifi
aggregate(RoomRent, by=list(freewifi = FreeWifi), mean)
## freewifi x
## 1 0 5380.004
## 2 1 5481.518
Average RoomRent on the basis of FreeBreakfast facility
aggregate(RoomRent, by=list(freebreakfast = FreeBreakfast), mean)
## freebreakfast x
## 1 0 5573.790
## 2 1 5420.044
Average RoomRent on the basis of Date
aggregate(RoomRent,by=list(Date= Date),mean)
## Date x
## 1 04-Jan-16 4738.548
## 2 04-Jan-17 3829.615
## 3 08-Jan-16 4907.419
## 4 08-Jan-17 3843.077
## 5 18-Dec-16 3366.795
## 6 21-Dec-16 3437.545
## 7 24-Dec-16 3510.795
## 8 25-Dec-16 3349.591
## 9 28-Dec-16 3450.045
## 10 31-Dec-16 3570.318
## 11 Dec 18 2016 4938.257
## 12 Dec 21 2016 5130.320
## 13 Dec 24 2016 5598.746
## 14 Dec 25 2016 5521.896
## 15 Dec 28 2016 5652.478
## 16 Dec 31 2016 6263.374
## 17 Jan 04 2017 5754.513
## 18 Jan 08 2017 5406.821
## 19 Jan 4 2017 4481.400
## 20 Jan 8 2017 4347.821
Average RoomRent on the basis of SwimmingPool facility
aggregate(RoomRent, by=list(HasSwimmingPool = HasSwimmingPool), mean)
## HasSwimmingPool x
## 1 0 3775.566
## 2 1 8549.052
Let us visualize the data using boxplot
Price variation
boxplot(hotel$RoomRent,main="Hotel Rent",xlab="Rent",horizontal=TRUE,ylim=c(0,150000),col=c("peachpuff"))
Here lot of outliers in the data. Some hotels are exorbitantly high. May be having all sorts of facility like tourist destination, located in metro city, having 5-star rating, very near to airport, having free wifi and free breakfast. Otherwise Hotel rent is less than say rs 25000 approx for most of the hotels. Data values are clustered towards left.
population
boxplot(hotel$Population, main="population data",xlab="population",horizontal=TRUE,col=c("orchid3"))
Star Rating
boxplot(hotel$StarRating , main="Hotel Rating",xlab="StarRating",horizontal=TRUE,col=c("beige"))
Here First and second quartiles are coinciding at 3-star rating. Only two hotels are having poor rating of 0 and 1. This data is evenly distributed. 50% hotels are having below 3-star rating and 50% of the hotels are above 3-star ratings. Also 50% of the total Hotels are having rating in between 3 and 4.
boxplot(hotel$Airport , main="Distance of Hotels from Airport",xlab="Airport",horizontal=TRUE,col=c("chartreuse"))
Lot of outliers can be seen . Some of the hotels are very far from the airport.
boxplot(hotel$HotelCapacity , main="No of rooms",xlab="HotelCapacity",horizontal=TRUE,col=c("blue"))
Some of the hotels are having high accomodating power. see the outliers present in the dataset.Most of the hotels are having guest capacity below 100.
Let us see how price of the rooms are distributed with respect to other relevant variables.
Rent on Weekends
boxplot(hotel$RoomRent~hotel$IsWeekend ,main="price on weekends",xlab="rent",ylab="weekends" ,horizontal=TRUE,ylim=c(0,100000),col=c("orchid3","peachpuff"))
HERE it seems like price variation is same across the weekdays and weekends. That means this factor may not be affecting rent at all
Rent in metro cities
boxplot(hotel$RoomRent ~ hotel$IsMetroCity ,main="price in metros",xlab="rent",ylab="Metro" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","blue"))
Here we can see that the there are some extremely high room rent in non-metro cities. Lot of outliers for the non-metrocities room rent. It is against the expectations
Rent in tourist places
boxplot(hotel$RoomRent ~ hotel$IsTouristDestination,main="price in tourist places",xlab="rent",ylab="tourist place" ,horizontal=TRUE,ylim=c(0,100000),col=c("orchid3","green"))
Rents are higher in tourist places. Median rent is also high in tourist places.Some Quite extremely costly hotels are there in tourist places. It is as per the expectation .
Rent during new yearEve
boxplot(hotel$RoomRent ~ hotel$IsNewYearEve ,main="price on new year eve",xlab="rent",ylab="new yeareve" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","peachpuff"))
Difficult to say rents are high on new year eve. Outliers are there for both the occasions.
Rent based on star-ratings
boxplot(hotel$RoomRent ~ hotel$StarRating ,main="price of star hotels",xlab="rent",ylab="star-rating" ,horizontal=TRUE,ylim=c(0,100000),col=c("blue","peachpuff"))
Lot of variation is seen clearly. Some of the 5-star ratings hotels are exorbitantly high. most of the star-rating hotels are cheaper. 4.8 star- rating and 3.4 star ratings are very expensive than others.
Rent on different dates
boxplot(hotel$RoomRent ~ hotel$Date ,main="price on different dates",xlab="rent",ylab="date" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","orchid3"))
lot of outliers can be seen . Baring the exceptions more or less rent is same across all the dates.
Rent based on distance of hotels from airports
boxplot(hotel$RoomRent ~ hotel$Airport ,main="price based on distane",xlab="rent",ylab="distance" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","chartreuse4"))
Difficult to interpret this .But we can say prices are bit on lower side for every distance range.
Rent where Wifi is free
boxplot(hotel$RoomRent ~ hotel$FreeWifi ,main="price where wifi is free",xlab="rent",ylab=" wifi" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","chartreuse4"))
If we leave out the outliers which we are witnessing for the hotels where wifi is free, the wifi seems complementary.
Rent Where breakfast is free
boxplot(hotel$RoomRent ~ hotel$FreeBreakfast ,main="pricewhere breakfast is free",xlab="rent",ylab="breakfast" ,horizontal=TRUE,ylim=c(0,100000),col=c("blue","orchid3"))
Again no prime variation in roomrents .
Rent for swimmingpool
boxplot(hotel$RoomRent ~ hotel$HasSwimmingPool ,main="price for swimmingpool",xlab="rent",ylab="swimmingpool" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","peachpuff"))
Here we can see noticeable rent difference. Hotels having swimming pools are fairly high. It’s fair enough to be expensive for the facility being provided.
RoomRent with cityName
boxplot(hotel$RoomRent ~ hotel$CityName ,main="price bifurcation for cities",xlab="rent",ylab="Cities" ,horizontal=TRUE, ylim=c(0,50000),col=c("red","yellow","brown", "blue", "peachpuff","beige","orchid3", "chartreuse"))
LET us see what histograms can offer
par(mfrow=c(2,2))
hist(hotel$IsMetroCity, xlab = "metros",main = "MetroCity",col = "red")
hist(hotel$IsTouristDestination, xlab = "Tourist Destination",main = "Tourist Destination",col = "blue")
hist(hotel$IsWeekend,xlab = "weekend",main = "Weekends",col = "peachpuff")
hist(hotel$IsNewYearEve,xlab = "New Year",main = "New YearEve",col = "beige")
par(mfrow=c(2,2))
hist(hotel$StarRating,xlab = "ratings",main = "star-Rating",col = "orchid3")
hist(hotel$FreeWifi,xlab = "Wifi",main = "freeWifi",col = "green")
hist(hotel$FreeBreakfast,xlab = "breakfast",main = "FreeBreakfast",col = "chartreuse4")
hist(hotel$HasSwimmingPool,xlab = "Swimmingpool",main = "Swimmingpool",col = "brown")
HOTELS in metro cities are less in comoarison to non-metro cities. HOTELS in touristplaces are more against non-tourist places HOTELS during weekends are more here. HOTELS with 3-star rating is high. HOTELShaving free wifi , free breakfast and having swimming poll facility are greater in number as against those hotels where these facilities are absent.
histogram for the RoomRent
hist(hotel$RoomRent,xlab = "rent",main = "Roomrent",col = "blue",breaks = 100,xlim = c(200,90000))
Highly skewed data having long right tails. It shows that more than 50% of the hotels are having rent which is less than average rent price.
Let us use scatterplot for our dataset
plot(hotel$StarRating,hotel$RoomRent , main="RoomRent & Rating",xlab="StarRating",ylab="RoomRent")
RoomRent vs StarRating
plot(hotel$StarRating , hotel$RoomRent ,data=hotel,main=" RoomRent vs Rating",ylab="rent", xlab = "rating",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$RoomRent ~hotel$StarRating ),col="red")
See the best line fit we have got here. It says increase in ratings will lead to increase in RoomRents.
RoomRent vs Airport
plot( hotel$Airport,hotel$RoomRent ,data=hotel,main=" RoomRent vs distance",ylab="rent", xlab = "airport distance",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$Airport ~hotel$RoomRent ),col="red")
Relatively flat best fit line is observed.
RoomRent vs HostelCapacity
plot( hotel$HotelCapacity,hotel$RoomRent ,data=hotel,main=" RoomRent vs capacity",ylab="rent", xlab = "capacity",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$HotelCapacity~hotel$RoomRent ),col="red")
Scatters plots are not giving proper insights about the pattern. the best fit line is horizontal.
Effects of date on RoomRent
date1<- aggregate(RoomRent ~ Date, data =hotel,mean)
plot(date1$Date,date1$RoomRent, main="Scatterplot between Date and RoomRent", xlab="Date",ylab = "Room Rent")
Roomrents are bit on higher side near 31Dec to 4 january.
Scatterplot of different variables together
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(formula=~RoomRent+IsWeekend+IsNewYearEve+StarRating+FreeBreakfast+HasSwimmingPool,data=hotel,diagonal="histogram",pch=16)
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
Scatterplot matrix of internal factors only
scatterplotMatrix(hotel[,c( "RoomRent", "StarRating" , "Airport" , "FreeWifi" , "FreeBreakfast","HotelCapacity","HasSwimmingPool")], spread = FALSE, smoother.args = list(lty=2), main= "Scatterplot Matrix",diagonal = "histogram")
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
Correlogram
First we will remove all those columns havig non-numeric data in order to use the coor formula
hotel1<-hotel[,-1,-8:-9]
hotel2<-hotel1[,-7:-8,-12:-14]
hotel3<-hotel2[c(-10,-12)]
coor<-cor(hotel3)
coor
## Population CityRank IsMetroCity
## Population 1.0000000000 -0.8353204432 0.7712260105
## CityRank -0.8353204432 1.0000000000 -0.5643937903
## IsMetroCity 0.7712260105 -0.5643937903 1.0000000000
## IsTouristDestination -0.0482029722 0.2807134520 0.1763717063
## IsWeekend 0.0115926802 -0.0072564766 0.0018118005
## IsNewYearEve 0.0007332482 -0.0006326444 0.0006464753
## RoomRent -0.0887280632 0.0939855292 -0.0668397705
## StarRating 0.1341365933 -0.1333810133 0.0776028661
## Airport -0.2597010198 0.5059119892 -0.2073586125
## HotelPincode -0.2586550765 0.1448743454 -0.1756624007
## FreeWifi 0.1129334410 -0.1214309404 0.0868288677
## FreeBreakfast 0.0364278235 -0.0086837497 0.0513856623
## HotelCapacity 0.2599830516 -0.2561197059 0.1871502153
## HasSwimmingPool 0.0262590820 -0.1029737518 0.0214119243
## IsTouristDestination IsWeekend IsNewYearEve
## Population -0.048202972 0.011592680 7.332482e-04
## CityRank 0.280713452 -0.007256477 -6.326444e-04
## IsMetroCity 0.176371706 0.001811801 6.464753e-04
## IsTouristDestination 1.000000000 -0.019481101 -2.266388e-03
## IsWeekend -0.019481101 1.000000000 2.923821e-01
## IsNewYearEve -0.002266388 0.292382051 1.000000e+00
## RoomRent 0.122502963 0.004580134 3.849123e-02
## StarRating -0.040554998 0.006378436 2.360897e-03
## Airport 0.194422049 -0.002724756 4.598872e-04
## HotelPincode -0.170413906 -0.006444704 -2.111441e-03
## FreeWifi -0.061568821 0.002960828 2.787472e-05
## FreeBreakfast -0.071692559 -0.007612777 -2.606416e-03
## HotelCapacity -0.094356091 0.006306507 1.352679e-03
## HasSwimmingPool 0.042156280 0.004500461 1.122308e-03
## RoomRent StarRating Airport HotelPincode
## Population -0.088728063 0.134136593 -0.2597010198 -0.258655077
## CityRank 0.093985529 -0.133381013 0.5059119892 0.144874345
## IsMetroCity -0.066839771 0.077602866 -0.2073586125 -0.175662401
## IsTouristDestination 0.122502963 -0.040554998 0.1944220492 -0.170413906
## IsWeekend 0.004580134 0.006378436 -0.0027247555 -0.006444704
## IsNewYearEve 0.038491227 0.002360897 0.0004598872 -0.002111441
## RoomRent 1.000000000 0.369373425 0.0496532442 0.009262712
## StarRating 0.369373425 1.000000000 -0.0609191837 -0.009618454
## Airport 0.049653244 -0.060919184 1.0000000000 0.223641588
## HotelPincode 0.009262712 -0.009618454 0.2236415883 1.000000000
## FreeWifi 0.003627002 0.018009594 -0.0945236768 -0.012503744
## FreeBreakfast -0.010006370 -0.032892463 0.0242839409 0.024880420
## HotelCapacity 0.157873308 0.637430337 -0.1176720722 -0.035088175
## HasSwimmingPool 0.311657734 0.618214699 -0.1416665606 0.020280765
## FreeWifi FreeBreakfast HotelCapacity
## Population 1.129334e-01 0.036427824 0.259983052
## CityRank -1.214309e-01 -0.008683750 -0.256119706
## IsMetroCity 8.682887e-02 0.051385662 0.187150215
## IsTouristDestination -6.156882e-02 -0.071692559 -0.094356091
## IsWeekend 2.960828e-03 -0.007612777 0.006306507
## IsNewYearEve 2.787472e-05 -0.002606416 0.001352679
## RoomRent 3.627002e-03 -0.010006370 0.157873308
## StarRating 1.800959e-02 -0.032892463 0.637430337
## Airport -9.452368e-02 0.024283941 -0.117672072
## HotelPincode -1.250374e-02 0.024880420 -0.035088175
## FreeWifi 1.000000e+00 0.158220597 -0.008703612
## FreeBreakfast 1.582206e-01 1.000000000 -0.087165446
## HotelCapacity -8.703612e-03 -0.087165446 1.000000000
## HasSwimmingPool -2.407405e-02 -0.061522132 0.509045809
## HasSwimmingPool
## Population 0.026259082
## CityRank -0.102973752
## IsMetroCity 0.021411924
## IsTouristDestination 0.042156280
## IsWeekend 0.004500461
## IsNewYearEve 0.001122308
## RoomRent 0.311657734
## StarRating 0.618214699
## Airport -0.141666561
## HotelPincode 0.020280765
## FreeWifi -0.024074046
## FreeBreakfast -0.061522132
## HotelCapacity 0.509045809
## HasSwimmingPool 1.000000000
Here we can see that the room rent is negatively correlated with population. Though correlation is weak. Roomrent is positively correlated with city ranking. increase in one will cause increase in the other. We can’t establish the causation. Roomrent is negatively correlated with Metrocity. Metrocity means more crowd and one can’t get relaxed fully in metro cities so one might expect roomrents to be down. Similarly Roomrent is negatively correlated with Free Breakfast. Though correlation is weak but surprising . Normally we would claim free breakfast ill make hotel expensive.Pincode is also negatively correlated
Rest of the varibles are positively correlated with Roomrent.
Here on interesting thing we are getting is that Tourist destination, Star rating , HotelCapacity and Swimming facility are stongly correlated with RoomRent relative to rest of the other variables and all these four are positively correlated. We can use these variables for regression analysis. Though the individual correlation is not so strong but relative to other factors correlation can be considered as strong
Let us visualise this matrix graphically
library(corrplot)
## corrplot 0.84 loaded
corrplot(corr=cor(hotel3,use = "complete.obs"),method = "circle")
Here blue shades are showing positive correlation and red shades are showing negative correlation. The bigger and darker circles are showing strong relationship between the variables.
City Rank is negatively correlated with Metro city and relationship is strong here, while positively correlated with Torist destination and airport(hotel distance from airport) Weekend is positively related with newyear eve. Hotel capacity is positively correlated with Star rating. Swimming facility is Positively correlated with Hotel Capacity.
Another way of representing through Corregram
library(corrgram)
corrgram(hotel, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of Hotel data")
Based on the correlogram we are filtering out four variables which are affecting RoomRent relatively strongly as compared to others are IsTouristDestination, StarRating, HasSwimmingPool and HotelCapacity.
Draw separate correlogram of these variables alongwith Roomrent
library(corrgram)
rent<-data.frame(RoomRent,HasSwimmingPool, HotelCapacity, StarRating, IsTouristDestination)
corrgram(rent, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of Hotel Prices In India")
Now create the variance -covariance matrix of these four variables with RoomRent.
x<-hotel[,c("HasSwimmingPool","StarRating", "HotelCapacity","IsTouristDestination")]
y<-hotel[,c("RoomRent")]
cor(x,y)
## [,1]
## HasSwimmingPool 0.3116577
## StarRating 0.3693734
## HotelCapacity 0.1578733
## IsTouristDestination 0.1225030
var(x,y)
## [,1]
## HasSwimmingPool 1094.2017
## StarRating 2048.3755
## HotelCapacity 88753.4128
## IsTouristDestination 412.7803
cov(x,y)
## [,1]
## HasSwimmingPool 1094.2017
## StarRating 2048.3755
## HotelCapacity 88753.4128
## IsTouristDestination 412.7803
Now make some assumptions based on the data set.
ASSUMPTION 1: H1: NewYearEve is related with RoomRent Ho: Newyear is not related with RoomRent
tab1<-with(hotel,table(IsNewYearEve,RoomRent))
chisq.test(tab1)
## Warning in chisq.test(tab1): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tab1
## X-squared = 2047.8, df = 2155, p-value = 0.9505
Here p-value is >0.05. We are accepting our null hypothesis that these two variables are not related. Thus our claim is rejected here on the basis of chisquare test.
ASSUMPTION2- H2: New Year Eve is related with Tourist Destination
tab2<-with(hotel,table(IsNewYearEve,IsTouristDestination))
chisq.test(tab2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab2
## X-squared = 0.053842, df = 1, p-value = 0.8165
HERE again our claim is getting rejected on the basis of p-value>0.05. There is no relation between NEW YearEve and TouristDestination.
Some hypothesis testing on the basis of t-test.
H1:- Average RoomRent on Weekend is greater than the Average Roomrent on non-weekend. H0:- There is no significant difference between the average RoomRent on weekend and weekdays
We will use right tail t-test
t.test(RoomRent~IsWeekend,data = hotel,alternative="greater")
##
## Welch Two Sample t-test
##
## data: RoomRent by IsWeekend
## t = -0.51853, df = 9999.4, p-value = 0.698
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -289.122 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 5430.835 5500.129
Here we see that the p-value>0.05. So we fail to reject the null hypothesis Hence we say that average Roomrent on weekend is equal to the average roomrent on non-weekend.
H2:- Average Roomrent on the normal days is less than the average Roomrent on NewYearEve days
We will use left tail t-test
t.test(RoomRent~IsNewYearEve,data = hotel,alternative="less")
##
## Welch Two Sample t-test
##
## data: RoomRent by IsNewYearEve
## t = -4.1793, df = 2065, p-value = 1.523e-05
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -518.4763
## sample estimates:
## mean in group 0 mean in group 1
## 5367.606 6222.826
p-value <0.05 . So we reject the null hypothesis. so we can say that the average Roomrent on the normal days is cheaper.
H3:- Average RoomRent of Metro Cities is greater than that of non-metro cities.
t.test(RoomRent~IsMetroCity,data = hotel,alternative="greater")
##
## Welch Two Sample t-test
##
## data: RoomRent by IsMetroCity
## t = 10.721, df = 13224, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 919.9785 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 5782.794 4696.073
So we reject the null hypothesis because p-value<0.05. Hence we conclude that average RoomRent of Metro Cities is greater than that of non-metro cities.
H4:- Average Room Rent of non-Tourist Destination Cities is less than than that of Tourist Destination Cities.
t.test(RoomRent ~ IsTouristDestination, data = hotel,alternative="less")
##
## Welch Two Sample t-test
##
## data: RoomRent by IsTouristDestination
## t = -19.449, df = 12888, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -1789.665
## sample estimates:
## mean in group 0 mean in group 1
## 4111.003 6066.024
Here p-value<0.05. So we reject our null hypothesis. So we conclude that the average Room Rent of Non- Tourist Destination Cities is less than that of Tourist Destination Cities.
H5:- Average of room Rent where free wifi is free is greater as compared to where wifi is not available
t.test(RoomRent ~ FreeWifi, data = hotel,alternative="greater")
##
## Welch Two Sample t-test
##
## data: RoomRent by FreeWifi
## t = -0.76847, df = 1804.7, p-value = 0.7788
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -318.9097 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 5380.004 5481.518
p-value>0.05 . so we accept the null hypothesis that average room rent is same .
H6:- Average Room Rent is greater where free Breakfast is available.
t.test(RoomRent ~ FreeBreakfast ,data = hotel,alternative="greater")
##
## Welch Two Sample t-test
##
## data: RoomRent by FreeBreakfast
## t = 0.98095, df = 6212.3, p-value = 0.1633
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -104.0926 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 5573.790 5420.044
p-value>0.05. So we fail to reject the null hypothesis . Average room rent is same.
H7:- Average Room Rent is significantly different where Swimming Pool is available.
t.test(RoomRent~HasSwimmingPool ,data = hotel)
##
## Welch Two Sample t-test
##
## data: RoomRent by HasSwimmingPool
## t = -29.013, df = 5011.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5096.030 -4450.942
## sample estimates:
## mean in group 0 mean in group 1
## 3775.566 8549.052
p-value<0.05. So we reject the null hypothesis. So we conclude that average Room Rent is greater where Swimming Pool is available.
After having visualizing the data and testing for any sort of correlation among the variables and making some relevant assumptions , now it’s time to go for regression analysis to see which of the variables on their own way are actually contributing to the pricing stategy Of hotels in the Indian Hotel industry.
Let us test the significance of the correlation of different variables with Roomrent
Roomrent vs population
cor.test(hotel$RoomRent,hotel$Population)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$Population
## t = -10.246, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10560734 -0.07179767
## sample estimates:
## cor
## -0.08872806
Here correlation is significant as p-value<0.05 So our null hypothesis of no correlation is rejected. Population and RoomRent are negativley correlated.
Roomrent vs City Ranking
cor.test(hotel$RoomRent,hotel$CityRank)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$CityRank
## t = 10.858, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07707001 0.11084696
## sample estimates:
## cor
## 0.09398553
p-value is less than 0.05 . so there is significant positive correlation between these two.
Roomrent vs Metrocity
cor.test(hotel$RoomRent,hotel$IsMetroCity)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$IsMetroCity
## t = -7.7053, df = 13230, p-value = 1.399e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08378329 -0.04985761
## sample estimates:
## cor
## -0.06683977
p-value<0.05. negative and significant correlation between room rent and IsMetrocity var.
Roomrent vs Weekend destination
cor.test(hotel$RoomRent,hotel$IsTouristDestination)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$IsTouristDestination
## t = 14.197, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1056846 0.1392512
## sample estimates:
## cor
## 0.122503
p-value < 0.05. So there is significant and positive relationship.
RoomRent vs NewYear Eve
cor.test(hotel$RoomRent,hotel$IsNewYearEve)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$IsNewYearEve
## t = 4.4306, df = 13230, p-value = 9.472e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02146637 0.05549377
## sample estimates:
## cor
## 0.03849123
Here p-value<0.05 . There is a significant positive relationship between these two.
RoomRent vs Starrating
cor.test(hotel$RoomRent,hotel$StarRating)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$StarRating
## t = 45.719, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3545660 0.3839956
## sample estimates:
## cor
## 0.3693734
p-value<0.05 So there is significant positive correlation between Star rating and RoomRent.
RoomRent vs Airport
cor.test(hotel$RoomRent, hotel$Airport)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$Airport
## t = 5.7183, df = 13230, p-value = 1.099e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03264192 0.06663581
## sample estimates:
## cor
## 0.04965324
p-value<0.05. significant relationship
RoomRent vs FreeWifi
cor.test(hotel$RoomRent, hotel$FreeWifi)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$FreeWifi
## t = 0.41719, df = 13230, p-value = 0.6765
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01341277 0.02066466
## sample estimates:
## cor
## 0.003627002
p-value<0.05 . significant.
Roomrent vs Breakfast
cor.test(hotel$RoomRent, hotel$FreeBreakfast)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$FreeBreakfast
## t = -1.151, df = 13230, p-value = 0.2497
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.027040698 0.007033769
## sample estimates:
## cor
## -0.01000637
Here p-value >0.05 So our null hypothesis is accepted . So there is no significant relationship between Free Breakfast and RoomRent.
Roomrent vs Hotel Capacity
cor.test(hotel$RoomRent, hotel$HotelCapacity)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$HotelCapacity
## t = 18.389, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1412142 0.1744430
## sample estimates:
## cor
## 0.1578733
Here pvalue<0.05. So the relationship is Significant
Roomrent vs SwimmingPool
cor.test(hotel$RoomRent,hotel$HasSwimmingPool)
##
## Pearson's product-moment correlation
##
## data: hotel$RoomRent and hotel$HasSwimmingPool
## t = 37.726, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2961917 0.3269604
## sample estimates:
## cor
## 0.3116577
Here p-value<0.05. So we can reject our null hypothesis and conclude that Swimming facility and Room rent is significantly positively correlated.
On the basis of this correlation test, the only variable which is not significantly related with RoomRent is FreeBreakFast.
Setting a regression model requires a correct and robust model but the problem is how to select the correct model?There are some methods to select a robust model-
All subset function will test every possible subset of the set of potential variables. Suppose there are N possible independent variables(besides constant) then there will be 2^N (2 raised to the power N) distinct subsets to be tested.
attach(hotel)
model<-lm(RoomRent~., data= hotel)
ols_all_subset(model)
Then use the best subset function that will select the best subset of regression like having maximum adjusted R squared and minimum AIC(explained below)
model<-lm(RoomRent~.,data= hotel)
ols_best_subset(model)
Then use the plot method to get the panel of fit criteria for best subset regression methods.
model<-lm(RoomRent~.,data= hotel)
k<-ols_all_subset(model)
plot(k)
Stepwise Forward Regression
The function builds a regression model from a set of all possible predictor variables by entering predictors based on p- values, in a stepwise manner until there is no variable left to enter any more. The model should include all the candidate predictor variables. To get each step set the details=TRUE, each step is displayed.
model<-lm(RoomRent~.,data= hotel)
ols_step_forward(model)
For detailed output
model<-lm(RoomRent~.,data= hotel)
ols_step_forward(model, details= TRUE)
Stepwise Backward Regression
This function Builds a regression model from a set of possible predictor variables by removing predictors based on p- values, in a stepwise manner until there is no variable left to remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.This is reverse procedure of Stepwise function.
model<-lm(RoomRent~.,data= hotel)
ols_step_backward(model)
For detailed output
model<-lm(RoomRent~.,data= hotel)
ols_step_backward(model, details= TRUE)
Stepwise Regression
This function builds a regression model from a set of possible predictor variables by entering and removing predictors based on p -values, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the possible predictor variables. If details is set to TRUE, each step is displayed.
model<-lm(RoomRent~.,data= hotel)
ols_stepwise(model)
For detailed output
model<-lm(RoomRent~.,data= hotel)
ols_stepwise(model,details= TRUE)
Stepwise AIC Regression
This function builds a regression model from a set of possible predictor variables by entering and removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the possible predictor variables. If details is set to TRUE, each step is displayed.
model<-lm(RoomRent~.,data= hotel)
ols_stepaic_both(model)
For Detailed output
model<-lm(RoomRent~.,data= hotel)
ols_stepaic_both(model,details=TRUE)
So here we are using StepWise Regression method. It will by default use backward direction if scope is not given. The full model will be passed to Step function. It searhes for the full scope of the variables. It performs multiple iterations by dropping one X variable each time. The AIC of the model is also computed and the model with lowest AIC is retained for the next iterations.
lmMod<-lm(RoomRent~.,data=hotel)
selectedMod<-step(lmMod)
summary(selectedMod)
model<- RoomRent ~ Population + IsMetroCity + IsTouristDestination +
IsNewYearEve + StarRating + Airport + FreeWifi + HotelCapacity +
HasSwimmingPool
fit1<-lm(formula = model,data = hotel)
summary(fit1)
##
## Call:
## lm(formula = model, data = hotel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11839 -2385 -691 1045 309532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.560e+03 4.055e+02 -21.109 < 2e-16 ***
## Population -1.244e-04 2.263e-05 -5.499 3.88e-08 ***
## IsMetroCity -6.369e+02 2.132e+02 -2.988 0.00282 **
## IsTouristDestination 1.918e+03 1.374e+02 13.958 < 2e-16 ***
## IsNewYearEve 8.430e+02 1.739e+02 4.849 1.26e-06 ***
## StarRating 3.598e+03 1.104e+02 32.582 < 2e-16 ***
## Airport 1.001e+01 2.716e+00 3.684 0.00023 ***
## FreeWifi 5.952e+02 2.217e+02 2.685 0.00726 **
## HotelCapacity -1.040e+01 1.029e+00 -10.115 < 2e-16 ***
## HasSwimmingPool 2.147e+03 1.598e+02 13.434 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared: 0.1904, Adjusted R-squared: 0.1899
## F-statistic: 345.5 on 9 and 13222 DF, p-value: < 2.2e-16
MODEL1
Model1 <- RoomRent ~ Population+CityRank+IsMetroCity+IsTouristDestination+IsWeekend+IsNewYearEve+StarRating+Airport+FreeWifi+FreeBreakfast+HotelCapacity+HasSwimmingPool
fit1 <- lm(Model1, data = hotel)
summary(fit1)
##
## Call:
## lm(formula = Model1, data = hotel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11845 -2356 -690 1030 309689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.604e+03 4.494e+02 -19.147 < 2e-16 ***
## Population -1.188e-04 3.592e-05 -3.307 0.000945 ***
## CityRank 1.821e+00 1.035e+01 0.176 0.860302
## IsMetroCity -6.640e+02 2.164e+02 -3.068 0.002158 **
## IsTouristDestination 1.925e+03 1.481e+02 13.001 < 2e-16 ***
## IsWeekend -9.076e+01 1.239e+02 -0.733 0.463709
## IsNewYearEve 8.826e+02 1.818e+02 4.855 1.22e-06 ***
## StarRating 3.592e+03 1.108e+02 32.434 < 2e-16 ***
## Airport 9.510e+00 3.171e+00 2.999 0.002709 **
## FreeWifi 5.498e+02 2.242e+02 2.452 0.014214 *
## FreeBreakfast 1.688e+02 1.233e+02 1.369 0.171163
## HotelCapacity -1.028e+01 1.033e+00 -9.945 < 2e-16 ***
## HasSwimmingPool 2.153e+03 1.616e+02 13.327 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6601 on 13219 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1898
## F-statistic: 259.3 on 12 and 13219 DF, p-value: < 2.2e-16
See here CityRank , IsWeekend and FreeBreakfast are not affecting RoomRent because of p-value>0.05. Rest all other variables are significantly affecting RoomRent because p-value <0.05.
MODEL FIT
library(leaps)
leap1 <- regsubsets(Model1, data = hotel, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model1, data = hotel, nbest = 1)
## 12 Variables (and intercept)
## Forced in Forced out
## Population FALSE FALSE
## CityRank FALSE FALSE
## IsMetroCity FALSE FALSE
## IsTouristDestination FALSE FALSE
## IsWeekend FALSE FALSE
## IsNewYearEve FALSE FALSE
## StarRating FALSE FALSE
## Airport FALSE FALSE
## FreeWifi FALSE FALSE
## FreeBreakfast FALSE FALSE
## HotelCapacity FALSE FALSE
## HasSwimmingPool FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Population CityRank IsMetroCity IsTouristDestination IsWeekend
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) "*" " " " " "*" " "
## 4 ( 1 ) "*" " " " " "*" " "
## 5 ( 1 ) "*" " " " " "*" " "
## 6 ( 1 ) "*" " " " " "*" " "
## 7 ( 1 ) "*" " " " " "*" " "
## 8 ( 1 ) "*" " " "*" "*" " "
## IsNewYearEve StarRating Airport FreeWifi FreeBreakfast
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " " "
## 6 ( 1 ) "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" "*" " " " "
## HotelCapacity HasSwimmingPool
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
plot(leap1, scale="adjr2")
On the basis of p-value calculated earlier we will drop our three variables namely CityRank, IsWeekend and FreeBreakFast.
Now Model2
Model2 <- RoomRent ~ StarRating+Population+IsMetroCity+IsTouristDestination+IsNewYearEve+Airport+FreeWifi+HotelCapacity+HasSwimmingPool
fit2 <- lm(Model2, data = hotel)
summary(fit2)
##
## Call:
## lm(formula = Model2, data = hotel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11839 -2385 -691 1045 309532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.560e+03 4.055e+02 -21.109 < 2e-16 ***
## StarRating 3.598e+03 1.104e+02 32.582 < 2e-16 ***
## Population -1.244e-04 2.263e-05 -5.499 3.88e-08 ***
## IsMetroCity -6.369e+02 2.132e+02 -2.988 0.00282 **
## IsTouristDestination 1.918e+03 1.374e+02 13.958 < 2e-16 ***
## IsNewYearEve 8.430e+02 1.739e+02 4.849 1.26e-06 ***
## Airport 1.001e+01 2.716e+00 3.684 0.00023 ***
## FreeWifi 5.952e+02 2.217e+02 2.685 0.00726 **
## HotelCapacity -1.040e+01 1.029e+00 -10.115 < 2e-16 ***
## HasSwimmingPool 2.147e+03 1.598e+02 13.434 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared: 0.1904, Adjusted R-squared: 0.1899
## F-statistic: 345.5 on 9 and 13222 DF, p-value: < 2.2e-16
Here every variable is significant . See the p-values are less than <0.05
library(leaps)
leap2 <- regsubsets(Model2, data = hotel, nbest=1)
summary(leap2)
## Subset selection object
## Call: regsubsets.formula(Model2, data = hotel, nbest = 1)
## 9 Variables (and intercept)
## Forced in Forced out
## StarRating FALSE FALSE
## Population FALSE FALSE
## IsMetroCity FALSE FALSE
## IsTouristDestination FALSE FALSE
## IsNewYearEve FALSE FALSE
## Airport FALSE FALSE
## FreeWifi FALSE FALSE
## HotelCapacity FALSE FALSE
## HasSwimmingPool FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## StarRating Population IsMetroCity IsTouristDestination
## 1 ( 1 ) "*" " " " " " "
## 2 ( 1 ) "*" "*" " " " "
## 3 ( 1 ) "*" "*" " " "*"
## 4 ( 1 ) "*" "*" " " "*"
## 5 ( 1 ) "*" "*" " " "*"
## 6 ( 1 ) "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*"
## IsNewYearEve Airport FreeWifi HotelCapacity HasSwimmingPool
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" "*"
## 6 ( 1 ) "*" " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" "*"
plot(leap2, scale="adjr2")
Now Checking for multicollinearity
Model2 <- RoomRent ~ StarRating+Population+IsMetroCity+IsTouristDestination+IsNewYearEve+Airport+FreeWifi+HotelCapacity+HasSwimmingPool
fit2 <- lm(Model2, data = hotel)
all_vifs <- car::vif(fit2)
print(all_vifs)
## StarRating Population IsMetroCity
## 2.118451 2.820418 2.807261
## IsTouristDestination IsNewYearEve Airport
## 1.210458 1.000013 1.160325
## FreeWifi HotelCapacity HasSwimmingPool
## 1.024652 1.888342 1.777718
Remove vars with VIF> 2.5 and re-build model until none of VIFs don’t exceed 2.5
signif_all <- names(all_vifs)
# Remove vars with VIF> 2.5 and re-build model until none of VIFs don't exceed 2.5.
while(any(all_vifs > 2.5)){
var_with_max_vif <- names(which(all_vifs == max(all_vifs))) # get the var with max vif
signif_all <- signif_all[!(signif_all) %in% var_with_max_vif] # remove
myForm <- as.formula(paste("RoomRent ~ ", paste (signif_all, collapse=" + "), sep="")) # new formula
selectedMod <- lm(myForm, data=hotel) # re-build model with new formula
all_vifs <- car::vif(selectedMod)
}
summary(selectedMod)
##
## Call:
## lm(formula = myForm, data = hotel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11654 -2365 -710 1067 309426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8838.099 402.792 -21.942 < 2e-16 ***
## StarRating 3569.749 110.439 32.323 < 2e-16 ***
## IsMetroCity -1530.867 138.033 -11.091 < 2e-16 ***
## IsTouristDestination 2094.588 133.722 15.664 < 2e-16 ***
## IsNewYearEve 843.370 174.054 4.845 1.28e-06 ***
## Airport 11.506 2.705 4.253 2.12e-05 ***
## FreeWifi 534.928 221.665 2.413 0.0158 *
## HotelCapacity -11.137 1.021 -10.907 < 2e-16 ***
## HasSwimmingPool 2225.460 159.331 13.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6608 on 13223 degrees of freedom
## Multiple R-squared: 0.1886, Adjusted R-squared: 0.1881
## F-statistic: 384.1 on 8 and 13223 DF, p-value: < 2.2e-16
car::vif(selectedMod)
## StarRating IsMetroCity IsTouristDestination
## 2.113742 1.174547 1.144109
## IsNewYearEve Airport FreeWifi
## 1.000013 1.148621 1.022146
## HotelCapacity HasSwimmingPool
## 1.856658 1.763424
Here VIF for all the variables is below 2.5 and every variable is statistically significant. Just Look at the p-values.
So we can say our model is free from multicollinearity and is significant also.
OBSERVATIONS OF THIS REGRESSION ANALYSIS-
NULL HYPOTHESIS (H0:)- None of the factors are affecting RoomRent i.e. there is no dependency between Room rent of hotel and other variables or say all the beta coefficients are zero. ALTERNATIVE HYPOTHESIS (H1):- There is dependency between Room Rent and other variables i.e.at least one of the variables is affecting RoomRent. i.e at least one of the beta coefficients is not equal to zero.
Here from the regression analysis we found out that p-value is <0.05 .
Therefore we are rejecting our null hypothesis and conclude that variables are affecting our Dependent variable RoomRent.As well as we can see that F value is very high which means mean of all the variables differ.
R-squared value is .1886 That means this model is explaining 18.86% of variation in the RoomRents only which is quite low. But this model is best fit model and free from multicollinearity also.
Calculating the adjusted R squared and AIC values for both model1 and model2.
summary(fit1)$adj.r.squared
## [1] 0.1898256
summary(fit2)$adj.r.squared
## [1] 0.1898573
AIC(fit1)
## [1] 270314.1
AIC(fit2)
## [1] 270310.6
RESULT- Here we can cleary see that the adjusted R squared for model2 is greater than the adjusted R squared for model1 and at the same time AIC Value for model2 is less than the AIC value of model1. So Model2 is a good model because it is having higher adjusted R-squared value and lower AIC value.
Listing out BETA coefficients and coinfidence intervals
fit2$coefficients
## (Intercept) StarRating Population
## -8.559981e+03 3.598383e+03 -1.244506e-04
## IsMetroCity IsTouristDestination IsNewYearEve
## -6.368734e+02 1.917691e+03 8.429888e+02
## Airport FreeWifi HotelCapacity
## 1.000580e+01 5.952249e+02 -1.040392e+01
## HasSwimmingPool
## 2.146661e+03
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) -9.354845e+03 -7.765117e+03
## StarRating 3.381905e+03 3.814862e+03
## Population -1.688086e-04 -8.009249e-05
## IsMetroCity -1.054700e+03 -2.190462e+02
## IsTouristDestination 1.648382e+03 2.187001e+03
## IsNewYearEve 5.021954e+02 1.183782e+03
## Airport 4.682508e+00 1.532909e+01
## FreeWifi 1.606768e+02 1.029773e+03
## HotelCapacity -1.242003e+01 -8.387814e+00
## HasSwimmingPool 1.833432e+03 2.459890e+03
INTERPRETATION for few coefficients- StarRating coefficent is 3.598383e+03 . And there is 95% chance that this estimated coefficient is going to fall between 3.381905e+03 and 3.814862e+03. fair enough. Similarly others can be interpreted.
Estimating few values for RoomRent
f<-fitted(fit2)
f[1:50]
## 1 2 3 4 5 6 7 8
## 9130.375 9130.375 9130.375 9130.375 9130.375 9973.364 9130.375 9130.375
## 9 10 11 12 13 14 15 16
## 6069.941 6069.941 6069.941 6069.941 6069.941 6912.930 6069.941 6069.941
## 17 18 19 20 21 22 23 24
## 7602.771 7602.771 7602.771 7602.771 7602.771 8445.759 7602.771 7602.771
## 25 26 27 28 29 30 31 32
## 6280.099 6280.099 6280.099 6280.099 6280.099 7123.088 6280.099 6280.099
## 33 34 35 36 37 38 39 40
## 4041.863 4041.863 4041.863 4041.863 4041.863 4884.852 4041.863 4041.863
## 41 42 43 44 45 46 47 48
## 2502.769 2502.769 2502.769 2502.769 2502.769 3345.758 2502.769 2502.769
## 49 50
## 2409.134 2409.134
Here some values are underestimated and some values are overestimated because thogh the model is best fit model but R square value is very poor 0.1886 only.
So Final conclusion-
Below are the 9 variables(both internal and external) which are affecting the price of the Room of the hotels, As well as they are in the order of significance .
StarRating- is positively affecting RoomRent. One percent increase in Rating will lead to 3.598383e+03 %(almost 14%) hike in rent. Tourists care for Rating of the hotels. More Rating is an indication for high standards and luxury and tourists would not mind paying higher for this.
Population- negatively affecting the RoomRent. one percent increase in Population will lead to 1.244506e-04 % reduction in the RoomRent.
IsTouristDestination - is positively affecting RoomRent. It is evident that the tourists love tourist destination and hotel owners take advantage of this. Rooms always be in demand throught the year if it is a tourist destination.
HotelCapacity - Hotel Capacity is negatively affecting the prices . Increase in the cpacity of guest accomodation will lead to decrese in the RoomRent. The more a hotel can fit in tourists more it will be in position to lower the prices.
HasSwimmingPool- is positively affecting the RoomRents. Tourists after their tour would love to relax in their hotels and nothing beats swimmingpool as a great source of relaxation. So hotelowners having swimmingpool facility will charge more.
IsNewYearEve- is positively affecting the RoomRents. Ofcourse most of the Tourists love to go out on NewYearEve . This will increase the demands for hotels and too many tourists hasing fixed no. of hotels drive up the prices.
Airport:- is positively affecting the prices. The more the distance from the hotels to the airport the more expensive the hotel will be. It’s somewhat against the logic. But one could say that you tour in the city and at airport area. The more the hotel is acessible in the city easily the more will be the price. You do not want to come to the airport area for relaxing in your hotel everytime.
IsMetroCity- is negatively affecting the RoomRent. The more crowded the city is less no of tourists will be attracted. As metro cities are big and crowded having more noise pollution one can not relax . So hotels in these city are less expensive.
Free Wifi- This factor is least significantly affecting the prices in a positive manner. Now most of the tourist carry their data traveller with them so Wifi facility does not semm to be major factor but it is significant so one could expect hotels with free Wifi facility are expensive.
The factors which were insignificantly RoomRent are IsWeekend , CityRank and FreeBreakfast.
tourists do not take into account whether it’s a weekday or weekend. Whenever they want to tour they just go to that place irrespective of day.
Similarly City Ranking does not matter . What matters is exotic locations and Brekfast does not seem to be motivating factor it is morelike complementary with the hotel. Most of the tourist return hotel late night and remain sleepy in the morning so they wake up late and by that brekfast time would normally over.
That's it from my side - Thank You