OBJECTIVE- To Identify the factors that affect the prices of Hotels in Indian Hotel industry including 42 different cities based on the data collected on 8 different dates consisting of some internal and external factors.

MOTIVATION OF THE STUDY

It is a common phenomena that we experience variation in hotel price not only in India but across the globe. Why this is so? What provoked hotel owners to charge differently and what motivates a tourist to pay more for some hotels or hotel at a particular place. Here in this report we are going to analyze what could be the independent factors that contribute towards this price difference. We will take the help of the data and some graphs and diagrams and regression analysis and on the basis of these we will try to analyse the data and try to figure out the potential factors affecting the price-behaviour in the hotel industryies.

DATA DESCRIPTION and DATA SOURCE

The Data is available here https://in.hotels.com/ and is of size 2523KB.

Size: 2523KB 13232 observations of 19 variables:

Attributes:

Notice that the dataset tracks hotel prices on 8 different dates at different hotels across different cities.

Dependent Variable

RoomRent <- Rent for the cheapest room, double occupancy, in Indian Rupees.

Independent Variables

External Factors

Date <- We have hotel room rent data for the following 8 dates for each hotel: {Dec 31, Dec 25, Dec 24, Dec 18, Dec 21, Dec 28, Jan 4, Jan 8} IsWeekend <- We use ‘0’ to indicate week days, ‘1’ to indicate weekend dates (Sat / Sun)

IsNewYearEve <- 1’ for Dec 31, ‘0’ otherwise CityName <- Name of the City where the Hotel is located e.g. Mumbai`

Population <- Population of the City in 2011

CityRank <- Rank order of City by Population (e.g. Mumbai = 0, Delhi = 1, so on)

IsMetroCity <- ‘1’ if CityName is {Mumbai, Delhi, Kolkatta, Chennai}, ‘0’ otherwise

IsTouristDestination <- We use ‘1’ if the city is primarily a tourist destination, ‘0’ otherwise.

Internal Factors Many Hotel Features can influence the RoomRent. The dataset captures some of these internal factors, as explained below.

HotelName <- e.g. Park Hyatt Goa Resort and Spa

StarRating <- e.g. 5

Airport <- Distance between Hotel and closest major Airport

HotelAddress <- e.g. Arrossim Beach, Cansaulim, Goa

HotelPincode <- 403712

HotelDescription <- e.g. 5-star beachfront resort with spa, near Arossim Beach

FreeWifi <- ‘1’ if the hotel offers Free Wifi, ‘0’ otherwise

FreeBreakfast <- ‘1’ if the hotel offers Free Breakfast, ‘0’ otherwise

HotelCapacity <- e.g. 242. (enter ‘0’ if not available)

HasSwimmingPool <- ‘1’ if they have a swimming pool, ‘0’ otherwise

ABSTRACT

In order to investigate the factors affecting the pricing strategy of the hotel industry. We have used dataset available to us and done our analysis based on the correlation test and regression analysis using best fit model. We have also visualized the data using boxplot ,scatterplot and correlogram. some findings on the basis of visualization are-

visualizing the data we can say that the price of the hotels of Jodhpur , Udaipur, Goa and Srinagar are most expensive of all. Rent of the hotels are higher for high-rating hotels. During 28 December to 3 January the price of the hotels are on the higher side.

On the basis of correlation test we found out that the only variable which comes out to be insignificant is FreeBreakfast.

We have used the best fit model and run the regression and found that the insignificant variables are CityRank, IsWeekend, and FreeBreakfast. CityRating is positively affect the RoomRent. Tourist Destination attracts the tourist more and hotels in these areas are expensive. SwimmingPool facility also derives the prices up.

We have taken the help of adjusted R squared and AIC to determine the best fit model. More adj R squared and minimum AIC is a criterion for selecting best fit model.

RESEARCH OBJECTIVE AND METHODOLOGY

Empirical evidence based on the data is always considered superior as compared to other methods. Let us try to investigate the factors contributing to the pricing strategy of hotel industry.

OBJECTIVE OF RESEARCH

1- To test the hypothesis that if there exists any price difference of the hotels according to the tourist destination.

2- If yes then What factors affect the pricing strategy of the hotels?

We will read the dataset by creating dataframe called hotel and will use command to summarize it. We will use boxplots and scatterplot to visualize the data and try to establish any sort of relationship between the variables. We will use correlation matrix to know the correlation between the variables concerned. Correlogram and corrplot is also used to depict graphically the relationship between the variables.

To find out the significant factors we have used correlation test to see the significant factors affecting the pricedifference.

Lastly we have used the regression models . We have listed some methods to choose the best fit models and we have used Step wise Regression Method to strengthen our analysis.

OUR variable of concern is RoomRent. On the basis of p-value we have reached on the conclusion of which factors are affecting the dependent variable RoomRent.

Reading the dataset in R by creating a dataframe called hotel.

hotel<-read.csv(paste("Cities42.csv",sep = ""))
View(hotel)

Dimension of the dataset

dim(hotel)
## [1] 13232    19

There are 13232 rows and 19 columns

Summarizing the entire dataset

library(psych)
summary(hotel)
##       CityName      Population          CityRank      IsMetroCity    
##  Delhi    :2048   Min.   :    8096   Min.   : 0.00   Min.   :0.0000  
##  Jaipur   : 768   1st Qu.:  744983   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mumbai   : 712   Median : 3046163   Median : 9.00   Median :0.0000  
##  Bangalore: 656   Mean   : 4416837   Mean   :14.83   Mean   :0.2842  
##  Goa      : 624   3rd Qu.: 8443675   3rd Qu.:24.00   3rd Qu.:1.0000  
##  Kochi    : 608   Max.   :12442373   Max.   :44.00   Max.   :1.0000  
##  (Other)  :7816                                                      
##  IsTouristDestination   IsWeekend       IsNewYearEve             Date     
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000   Dec 21 2016:1611  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.0000   Dec 24 2016:1611  
##  Median :1.0000       Median :1.0000   Median :0.0000   Dec 25 2016:1611  
##  Mean   :0.6972       Mean   :0.6228   Mean   :0.1244   Dec 28 2016:1611  
##  3rd Qu.:1.0000       3rd Qu.:1.0000   3rd Qu.:0.0000   Dec 31 2016:1611  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000   Dec 18 2016:1608  
##                                                         (Other)    :3569  
##                   HotelName        RoomRent        StarRating   
##  Vivanta by Taj        :   32   Min.   :   299   Min.   :0.000  
##  Goldfinch Hotel       :   24   1st Qu.:  2436   1st Qu.:3.000  
##  OYO Rooms             :   24   Median :  4000   Median :3.000  
##  The Gordon House Hotel:   24   Mean   :  5474   Mean   :3.459  
##  Apnayt Villa          :   16   3rd Qu.:  6299   3rd Qu.:4.000  
##  Bentleys Hotel Colaba :   16   Max.   :322500   Max.   :5.000  
##  (Other)               :13096                                   
##     Airport      
##  Min.   :  0.20  
##  1st Qu.:  8.40  
##  Median : 15.00  
##  Mean   : 21.16  
##  3rd Qu.: 24.00  
##  Max.   :124.00  
##                  
##                                                                    HotelAddress  
##  The Mall, Shimla                                                        :   32  
##  #2-91/14/8, White Fields, Kondapur, Hitech City, Hyderabad, 500084 India:   16  
##  121, City Terrace, Walchand Hirachand Marg, Mumbai, Maharashtra         :   16  
##  14-4507/9, Balmatta Road, Near Jyothi Circle, Hampankatta               :   16  
##  144/7, Rajiv Gandi Salai (OMR), Kottivakkam, Chennai, Tamil Nadu        :   16  
##  17, Oliver Road, Colaba, Mumbai, Maharashtra                            :   16  
##  (Other)                                                                 :13120  
##   HotelPincode         HotelDescription    FreeWifi      FreeBreakfast   
##  Min.   : 100025   3           :  120   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 221001   Abc         :  112   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median : 395003   3-star hotel:  104   Median :1.0000   Median :1.0000  
##  Mean   : 397430   3.5         :   88   Mean   :0.9259   Mean   :0.6491  
##  3rd Qu.: 570001   4           :   72   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :7000157   (Other)     :12728   Max.   :1.0000   Max.   :1.0000  
##                    NA's        :    8                                    
##  HotelCapacity    HasSwimmingPool 
##  Min.   :  0.00   Min.   :0.0000  
##  1st Qu.: 16.00   1st Qu.:0.0000  
##  Median : 34.00   Median :0.0000  
##  Mean   : 62.51   Mean   :0.3558  
##  3rd Qu.: 75.00   3rd Qu.:1.0000  
##  Max.   :600.00   Max.   :1.0000  
## 

Minimum population in any city is 8096 and max is 12442373 with median 3046163. Minimum room rent in any city is 299 rs and maximum is rs 322500 with median rs 4000. Minimum HotelCapacity is 0 and maximum is 600 with median 34. Minimum airport distance from the hotel is 0.20 km and maximum is 124 km with middle distance 15km.

library(psych)
describe(hotel)[,c(3:6)]
##                            mean         sd  median    trimmed
## CityName*                 18.07      11.72      16      17.29
## Population           4416836.87 4258386.00 3046163 4040816.22
## CityRank                  14.83      13.51       9      13.30
## IsMetroCity                0.28       0.45       0       0.23
## IsTouristDestination       0.70       0.46       1       0.75
## IsWeekend                  0.62       0.48       1       0.65
## IsNewYearEve               0.12       0.33       0       0.03
## Date*                     14.30       2.69      14      14.39
## HotelName*               841.19     488.16     827     841.18
## RoomRent                5473.99    7333.12    4000    4383.33
## StarRating                 3.46       0.76       3       3.40
## Airport                   21.16      22.76      15      16.39
## HotelAddress*           1202.53     582.17    1261    1233.25
## HotelPincode          397430.26  259837.50  395003  388540.47
## HotelDescription*        581.34     363.26     567     575.37
## FreeWifi                   0.93       0.26       1       1.00
## FreeBreakfast              0.65       0.48       1       0.69
## HotelCapacity             62.51      76.66      34      46.03
## HasSwimmingPool            0.36       0.48       0       0.32

Mean of the population is 4416836.87 with standard deviation 4258386. Mean of the roomrent is rs5473.99 with standar deviation 7333.12. Dispersion in the data is high. High variability in prices. Mean (average) no of tourist that can be accomodated in a hotel is 62 with dispersion of 76.66. On an average hotel distance from the airport is 21.16 km.

Creating one-way contingency tables for the categorical variables in your dataset

For external factors

Tourist Destination

istourist<-table(hotel$IsTouristDestination)
istourist 
## 
##    0    1 
## 4007 9225
mytable <- with(hotel, table(IsTouristDestination)) #second method
mytable
## IsTouristDestination
##    0    1 
## 4007 9225

4007 non-tourist destination as compared to 9225 tourist destination (Almost double of non-tourist destination)

Isweekend

mytable<-with(hotel,table(IsWeekend))
mytable
## IsWeekend
##    0    1 
## 4991 8241

Weekend destination is 8241 in total against 4991 almost half of weekend destination

IsMetrocity

mytable <- with(hotel, table(IsMetroCity))
mytable
## IsMetroCity
##    0    1 
## 9472 3760

Here metro destinations are less in number. Non-metro city destinations are almost half in number as compared to metro destination.

IsNewYearEve

mytable <- with(hotel, table(IsNewYearEve))
mytable
## IsNewYearEve
##     0     1 
## 11586  1646

Data is collected 1646 times for the new yearEve.

For internal Factors

Star rating

mytable <- with(hotel, table(StarRating))
mytable
## StarRating
##    0    1    2  2.5    3  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9    4  4.1 
##   16    8  440  632 5953    8   16    8 1752    8   24   16   32 2463   24 
##  4.3  4.4  4.5  4.7  4.8    5 
##   16    8  376    8   16 1408

3 -star hotel rating is most in number followed by 4-star rating and 5-star rating.

Airport

mytable <- with(hotel, table(Airport))
mytable
## Airport
##   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9     1   1.1   1.2   1.4 
##    16    32    40    24    32    24     8    39    32    16    40     8 
##   1.5   1.6   1.7   1.8   1.9     2   2.1   2.2   2.3   2.4   2.5   2.6 
##    16    32    22    72    40    56    40    24    16    32    56    48 
##   2.7   2.8   2.9     3   3.1   3.2   3.3   3.4   3.5   3.6   3.7   3.8 
##    56    24    56    56    16    24    16    48    56    64    16    40 
##   3.9     4   4.1   4.2   4.3   4.4   4.5   4.6   4.7   4.8   4.9     5 
##    32    72    32    40    32    32    24    40    24    40    32    73 
##   5.1   5.2   5.3   5.4   5.5   5.6   5.7   5.8   5.9     6   6.1   6.2 
##    72    72    32    40    48    40    32    56    40    33    32    64 
##   6.3   6.4   6.5   6.6   6.7   6.8   6.9     7   7.1   7.2   7.3   7.4 
##    16    48    48    40    24    56    40    49    72    48    24    40 
##   7.5   7.6   7.7   7.8   7.9     8   8.1   8.2   8.3   8.4   8.5   8.6 
##    48    71    48    32    72    73    72    56    40    48    64    16 
##   8.7   8.8   8.9     9   9.1   9.2   9.3   9.4   9.5   9.6   9.7   9.8 
##    56    16    16    49    24    62    48    80    22    40    24    40 
##   9.9    10  10.2  10.3  10.4  10.6  10.7  10.8  10.9    11  11.1  11.3 
##    56   298     8     8     8     8     8     8    16   610    16    16 
##  11.7  11.9    12  12.2  12.3  12.6  12.7    13  13.1  13.3  13.5  13.6 
##     8    16   354    24     8    24    16   319    16     8     8    24 
##  13.7  13.8    14  14.2  14.4  14.5  14.6  14.7  14.8  14.9    15  15.3 
##    16     8   399    16    24     8    16    24    16     8   441    16 
##  15.4  15.6  15.7  15.8  15.9    16  16.1  16.2  16.4  16.5  16.7    17 
##    16     8     8     8     8   409    16     8     8    32    32   313 
##  17.1  17.2  17.4  17.5  17.6  17.8    18  18.3  18.5  18.6  18.7    19 
##     8    16     8    16     8    16   424     8    16     8     8   200 
##  19.5  19.9    20  20.2  20.3  20.5  20.9    21  21.4  21.5    22  22.1 
##     8     8   384     8    16     8     8   248    24     8   305     8 
##  22.2  22.4  22.5    23  23.2  23.3  23.4    24  24.2  24.3  24.5  24.6 
##     8     8     8   304     8    16     8   167    16    16    16     8 
##  24.7  24.9    25  25.6  25.7  25.9    26  26.1  26.3  26.4  26.5  26.7 
##     8    32   208     8     8     8   300     8     8    24     8     8 
##    27  27.1  27.2    28  28.1  28.6  28.7    29    30  30.5    31  31.2 
##   272     8     8   112     8     8     8    88    56     8   224     8 
##  31.3  31.9    32  32.9    33  33.4    34    35    36  36.2    37    38 
##    16     8    72     8    40    16    16    49    17     8    49    49 
##  38.3    39  39.9    40    41    42  42.7    43  43.9    44  44.5  44.6 
##     8   100     8    56   102    41    16    33     8     8     8     8 
##  44.8    46    47  47.5    48  48.4    49    50  50.1  50.5    51    52 
##     8    40     8     8    16     8     8     8     8     8    16    16 
##  52.7    53    55  57.2    60    61    62    63  63.5  63.6    65  67.6 
##     8     8     8     8     8    16    32     8     8     8   152     8 
##    69  73.1    80  80.3    81    82    83    84    85    86    87  91.3 
##     8     8     1     8     1     9     1     1     1     1     1     8 
##  96.5   100 102.4   105   110 117.4   124 
##     8   136     8   240    64     8   128

From this data we can see that the airports which are very far from the hotels are in majority. airports 124 km away hotels are 128 while 105 km are 240 in number.

FreeWifi

mytable<-with(hotel,table(FreeWifi))
mytable
## FreeWifi
##     0     1 
##   981 12251

Hotels having FreeWifi are 12251 in number. Most of the hotels are now FreeWifi enabled.

FreeBreakfast

mytable<-with(hotel,table(FreeBreakfast))
mytable
## FreeBreakfast
##    0    1 
## 4643 8589

Hotels having breakfast facility are almost double in number as compared to hotels not having breakfast facility.

HasSwimmingPool

mytable<-with(hotel,table(HasSwimmingPool))
mytable
## HasSwimmingPool
##    0    1 
## 8524 4708

Here the number of the hotels are not concerned with having SwimmingPool are almost 1.8 times in number as compared to hotels having swimmingpools.

Date

mytable<-with(hotel,table(Date))
mytable
## Date
##   04-Jan-16   04-Jan-17   08-Jan-16   08-Jan-17   18-Dec-16   21-Dec-16 
##          31          13          31          13          44          44 
##   24-Dec-16   25-Dec-16   28-Dec-16   31-Dec-16 Dec 18 2016 Dec 21 2016 
##          44          44          44          44        1608        1611 
## Dec 24 2016 Dec 25 2016 Dec 28 2016 Dec 31 2016 Jan 04 2017 Jan 08 2017 
##        1611        1611        1611        1611        1548        1542 
##  Jan 4 2017  Jan 8 2017 
##          60          67

THe dataset is having some vague pattern about the data. It’s difficult to draw out any pattern here.

Now let us create two-way contingency table to see the patterns for the two varibles taken together.

ISweekend and IsnewYearEve

mytable<-with(hotel,table(IsWeekend,IsNewYearEve))
mytable
##          IsNewYearEve
## IsWeekend    0    1
##         0 4989    2
##         1 6597 1644
mytable[2,2]
## [1] 1644

1644 times the data is collected on the weekend and newyearEve

IsNewYearEve and IsTouristDestination

mytable<-with(hotel,table(IsNewYearEve,IsTouristDestination))
mytable
##             IsTouristDestination
## IsNewYearEve    0    1
##            0 3504 8082
##            1  503 1143
mytable[2,2]
## [1] 1143

1143 times the values are collected for NewYearEve and TouristDestination.

IsTouristDestination And Starrating

mytable<-with(hotel,table(IsTouristDestination,StarRating))
mytable
##                     StarRating
## IsTouristDestination    0    1    2  2.5    3  3.2  3.3  3.4  3.5  3.6
##                    0    0    0   64  152 1888    0    0    0  448    0
##                    1   16    8  376  480 4065    8   16    8 1304    8
##                     StarRating
## IsTouristDestination  3.7  3.8  3.9    4  4.1  4.3  4.4  4.5  4.7  4.8
##                    0    8   16   16  839    0   16    8  128    8    0
##                    1   16    0   16 1624   24    0    0  248    0   16
##                     StarRating
## IsTouristDestination    5
##                    0  416
##                    1  992

FreeWifi and FreeBreakfast

mytable<-with(hotel,table(FreeWifi,FreeBreakfast))
mytable
##         FreeBreakfast
## FreeWifi    0    1
##        0  606  375
##        1 4037 8214
mytable[2,2]
## [1] 8214

8214 hotels are having Freebreakfast and Freewifi facility

FreeWifi AND HAS SWIMMING POOL

mytable<-with(hotel,table(FreeWifi,HasSwimmingPool))
mytable
##         HasSwimmingPool
## FreeWifi    0    1
##        0  592  389
##        1 7932 4319
mytable[2,2]
## [1] 4319

4319 hotels are having both free wifi and swimming pool facility

Breakfast and SwimmingPool

mytable<-with(hotel,table(FreeBreakfast,HasSwimmingPool))
mytable
##              HasSwimmingPool
## FreeBreakfast    0    1
##             0 2805 1838
##             1 5719 2870
mytable[2,2]
## [1] 2870

2870 hotels have both the facility of Free Breakfast and Swimming Pool.

Average of Roomrent on the basis of IsWeekend and IsNewYear

aggregate(hotel$RoomRent, by=list(Weekend = hotel$IsWeekend, NewyearServe = hotel$IsNewYearEve), mean)
##   Weekend NewyearServe        x
## 1       0            0 5429.473
## 2       1            0 5320.820
## 3       0            1 8829.500
## 4       1            1 6219.655

Here average roomrent on non-weekend normal day is 5429.473. Average roomrent on weekend and new yearEve is 6219.655.

Here is the average brek-up of RoomRent city-wise

aggregate(hotel$RoomRent, by=list(city = hotel$CityName), mean)
##                city         x
## 1              Agra  4124.287
## 2         Ahmedabad  4175.045
## 3          Amritsar  3444.029
## 4         Bangalore  4112.803
## 5       Bhubaneswar  3587.442
## 6        Chandigarh  4030.940
## 7           Chennai  4323.647
## 8        Darjeeling  5458.088
## 9             Delhi  4318.606
## 10          Gangtok  4629.648
## 11              Goa  8170.801
## 12         Guwahati  5325.812
## 13         Haridwar  3919.938
## 14        Hyderabad  3852.175
## 15           Indore  3414.594
## 16           Jaipur  7292.022
## 17        Jaisalmer  5986.072
## 18          Jodhpur 10661.371
## 19           Kanpur  3008.562
## 20            Kochi  6039.609
## 21          Kolkata  4528.986
## 22          Lucknow  5879.070
## 23          Madurai  4768.223
## 24           Manali  4858.285
## 25        Mangalore  4110.337
## 26           Mumbai  6343.730
## 27           Munnar  7543.500
## 28           Mysore  3320.869
## 29         Nainital  6409.833
## 30             Ooty  6144.257
## 31        Panchkula  2813.500
## 32             Pune  3897.652
## 33             Puri  5708.429
## 34           Rajkot  4107.078
## 35        Rishikesh  4943.670
## 36           Shimla  5780.604
## 37         Srinagar 10572.025
## 38            Surat  3660.850
## 39 Thiruvanthipuram  6726.796
## 40         Thrissur  3387.844
## 41          Udaipur 10145.252
## 42         Varanasi  8675.042

Jodhpur, Udaipur , Srinagar and Goa are the most expensive and all of these cities are non-metro cities. But these four are Tourist destination.

Average RoomRent on the basis of TouristDestination and MetroCity

aggregate(hotel$RoomRent,by=list(touristplace= hotel$IsTouristDestination, MetroCity= hotel$IsMetroCity),mean)
##   touristplace MetroCity        x
## 1            0         0 4006.435
## 2            1         0 6755.728
## 3            0         1 4646.136
## 4            1         1 4706.608

See here both TouristDestination and MetroCity are cheaper as compared to ToristDestination and non-metrocity.

Average RoomRent on the basis of FreeWifi facility ,FreeBreakfast, and SwimmingPool facility

attach(hotel)
aggregate(hotel$RoomRent, by=list(Free_Wifi = hotel$FreeWifi, FreeBreakfast = hotel$FreeBreakfast, SwimmingPool = HasSwimmingPool), mean)
##   Free_Wifi FreeBreakfast SwimmingPool        x
## 1         0             0            0 3538.085
## 2         1             0            0 3148.628
## 3         0             1            0 5636.617
## 4         1             1            0 3984.457
## 5         0             0            1 7378.590
## 6         1             0            1 9530.906
## 7         0             1            1 5207.000
## 8         1             1            1 8246.284

Here Hotel having free wifi , free Breakfast and swimmingpool has mean room rent= 8246.284

Average RoomRent on the basis of Weekday and Weekend

aggregate(RoomRent, by=list(IsWeekend), mean)
##   Group.1        x
## 1       0 5430.835
## 2       1 5500.129

Here average Roomrent on Weekend =5500.129

Average RoomRent on the basis of MetroCity

aggregate(RoomRent, by=list(IsMetroCity),mean)
##   Group.1        x
## 1       0 5782.794
## 2       1 4696.073

Average RoomRent on the NewYearEve

aggregate(RoomRent, by=list(IsNewYearEve),mean)
##   Group.1        x
## 1       0 5367.606
## 2       1 6222.826

Average RoomRent on the basis of TouristDestination

aggregate(RoomRent, by=list(IsTouristDestination),mean)
##   Group.1        x
## 1       0 4111.003
## 2       1 6066.024

Average RoomRent on the basis of Ratings

aggregate(RoomRent, by=list(Ratings = StarRating), mean)
##    Ratings         x
## 1      0.0  7237.125
## 2      1.0   686.625
## 3      2.0  2783.166
## 4      2.5  2520.816
## 5      3.0  3694.811
## 6      3.2 15937.500
## 7      3.3  2841.062
## 8      3.4 23437.500
## 9      3.5  4843.346
## 10     3.6  7769.500
## 11     3.7  6701.958
## 12     3.8  5400.062
## 13     3.9 13062.750
## 14     4.0  6393.105
## 15     4.1 19075.000
## 16     4.3  7423.125
## 17     4.4  5563.500
## 18     4.5  8699.920
## 19     4.7 10125.000
## 20     4.8 46752.812
## 21     5.0 12398.221

Average room rent for 4.8 star rating is very expensive. Hotel with 1 satr rating is least expensive. Surprisingly hotel with 3.4 star rating is expensive than the hotel with 5-star rating.

Average RoomRent on the basis of FreeWifi

aggregate(RoomRent, by=list(freewifi = FreeWifi), mean)
##   freewifi        x
## 1        0 5380.004
## 2        1 5481.518

Average RoomRent on the basis of FreeBreakfast facility

aggregate(RoomRent, by=list(freebreakfast = FreeBreakfast), mean)
##   freebreakfast        x
## 1             0 5573.790
## 2             1 5420.044

Average RoomRent on the basis of Date

aggregate(RoomRent,by=list(Date= Date),mean)
##           Date        x
## 1    04-Jan-16 4738.548
## 2    04-Jan-17 3829.615
## 3    08-Jan-16 4907.419
## 4    08-Jan-17 3843.077
## 5    18-Dec-16 3366.795
## 6    21-Dec-16 3437.545
## 7    24-Dec-16 3510.795
## 8    25-Dec-16 3349.591
## 9    28-Dec-16 3450.045
## 10   31-Dec-16 3570.318
## 11 Dec 18 2016 4938.257
## 12 Dec 21 2016 5130.320
## 13 Dec 24 2016 5598.746
## 14 Dec 25 2016 5521.896
## 15 Dec 28 2016 5652.478
## 16 Dec 31 2016 6263.374
## 17 Jan 04 2017 5754.513
## 18 Jan 08 2017 5406.821
## 19  Jan 4 2017 4481.400
## 20  Jan 8 2017 4347.821

Average RoomRent on the basis of SwimmingPool facility

aggregate(RoomRent, by=list(HasSwimmingPool = HasSwimmingPool), mean)
##   HasSwimmingPool        x
## 1               0 3775.566
## 2               1 8549.052

Let us visualize the data using boxplot

Price variation

boxplot(hotel$RoomRent,main="Hotel Rent",xlab="Rent",horizontal=TRUE,ylim=c(0,150000),col=c("peachpuff"))

Here lot of outliers in the data. Some hotels are exorbitantly high. May be having all sorts of facility like tourist destination, located in metro city, having 5-star rating, very near to airport, having free wifi and free breakfast. Otherwise Hotel rent is less than say rs 25000 approx for most of the hotels. Data values are clustered towards left.

population

boxplot(hotel$Population, main="population data",xlab="population",horizontal=TRUE,col=c("orchid3"))

Star Rating

boxplot(hotel$StarRating , main="Hotel Rating",xlab="StarRating",horizontal=TRUE,col=c("beige"))

Here First and second quartiles are coinciding at 3-star rating. Only two hotels are having poor rating of 0 and 1. This data is evenly distributed. 50% hotels are having below 3-star rating and 50% of the hotels are above 3-star ratings. Also 50% of the total Hotels are having rating in between 3 and 4.

boxplot(hotel$Airport , main="Distance of Hotels from Airport",xlab="Airport",horizontal=TRUE,col=c("chartreuse"))

Lot of outliers can be seen . Some of the hotels are very far from the airport.

boxplot(hotel$HotelCapacity , main="No of rooms",xlab="HotelCapacity",horizontal=TRUE,col=c("blue"))

Some of the hotels are having high accomodating power. see the outliers present in the dataset.Most of the hotels are having guest capacity below 100.

Let us see how price of the rooms are distributed with respect to other relevant variables.

Rent on Weekends

boxplot(hotel$RoomRent~hotel$IsWeekend ,main="price on weekends",xlab="rent",ylab="weekends" ,horizontal=TRUE,ylim=c(0,100000),col=c("orchid3","peachpuff"))

HERE it seems like price variation is same across the weekdays and weekends. That means this factor may not be affecting rent at all

Rent in metro cities

boxplot(hotel$RoomRent ~ hotel$IsMetroCity ,main="price in metros",xlab="rent",ylab="Metro" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","blue"))

Here we can see that the there are some extremely high room rent in non-metro cities. Lot of outliers for the non-metrocities room rent. It is against the expectations

Rent in tourist places

boxplot(hotel$RoomRent ~ hotel$IsTouristDestination,main="price in tourist places",xlab="rent",ylab="tourist place" ,horizontal=TRUE,ylim=c(0,100000),col=c("orchid3","green"))

Rents are higher in tourist places. Median rent is also high in tourist places.Some Quite extremely costly hotels are there in tourist places. It is as per the expectation .

Rent during new yearEve

boxplot(hotel$RoomRent ~ hotel$IsNewYearEve ,main="price on new year eve",xlab="rent",ylab="new yeareve" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","peachpuff"))

Difficult to say rents are high on new year eve. Outliers are there for both the occasions.

Rent based on star-ratings

boxplot(hotel$RoomRent ~ hotel$StarRating ,main="price of star hotels",xlab="rent",ylab="star-rating" ,horizontal=TRUE,ylim=c(0,100000),col=c("blue","peachpuff"))

Lot of variation is seen clearly. Some of the 5-star ratings hotels are exorbitantly high. most of the star-rating hotels are cheaper. 4.8 star- rating and 3.4 star ratings are very expensive than others.

Rent on different dates

boxplot(hotel$RoomRent ~ hotel$Date ,main="price on different dates",xlab="rent",ylab="date" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","orchid3"))

lot of outliers can be seen . Baring the exceptions more or less rent is same across all the dates.

Rent based on distance of hotels from airports

boxplot(hotel$RoomRent ~ hotel$Airport ,main="price based on distane",xlab="rent",ylab="distance" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","chartreuse4"))

Difficult to interpret this .But we can say prices are bit on lower side for every distance range.

Rent where Wifi is free

boxplot(hotel$RoomRent ~ hotel$FreeWifi ,main="price where wifi is free",xlab="rent",ylab=" wifi" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","chartreuse4"))

If we leave out the outliers which we are witnessing for the hotels where wifi is free, the wifi seems complementary.

Rent Where breakfast is free

boxplot(hotel$RoomRent ~ hotel$FreeBreakfast ,main="pricewhere breakfast is free",xlab="rent",ylab="breakfast" ,horizontal=TRUE,ylim=c(0,100000),col=c("blue","orchid3"))

Again no prime variation in roomrents .

Rent for swimmingpool

boxplot(hotel$RoomRent ~ hotel$HasSwimmingPool ,main="price for swimmingpool",xlab="rent",ylab="swimmingpool" ,horizontal=TRUE,ylim=c(0,100000),col=c("red","peachpuff"))

Here we can see noticeable rent difference. Hotels having swimming pools are fairly high. It’s fair enough to be expensive for the facility being provided.

RoomRent with cityName

boxplot(hotel$RoomRent ~ hotel$CityName ,main="price bifurcation for cities",xlab="rent",ylab="Cities" ,horizontal=TRUE, ylim=c(0,50000),col=c("red","yellow","brown", "blue", "peachpuff","beige","orchid3", "chartreuse"))

LET us see what histograms can offer

par(mfrow=c(2,2))
hist(hotel$IsMetroCity, xlab = "metros",main = "MetroCity",col = "red")
hist(hotel$IsTouristDestination, xlab = "Tourist Destination",main = "Tourist Destination",col = "blue")
hist(hotel$IsWeekend,xlab = "weekend",main = "Weekends",col = "peachpuff")
hist(hotel$IsNewYearEve,xlab = "New Year",main = "New YearEve",col = "beige")

par(mfrow=c(2,2))
hist(hotel$StarRating,xlab = "ratings",main = "star-Rating",col = "orchid3")
hist(hotel$FreeWifi,xlab = "Wifi",main = "freeWifi",col = "green")
hist(hotel$FreeBreakfast,xlab = "breakfast",main = "FreeBreakfast",col = "chartreuse4")
hist(hotel$HasSwimmingPool,xlab = "Swimmingpool",main = "Swimmingpool",col = "brown")

HOTELS in metro cities are less in comoarison to non-metro cities. HOTELS in touristplaces are more against non-tourist places HOTELS during weekends are more here. HOTELS with 3-star rating is high. HOTELShaving free wifi , free breakfast and having swimming poll facility are greater in number as against those hotels where these facilities are absent.

histogram for the RoomRent

hist(hotel$RoomRent,xlab = "rent",main = "Roomrent",col = "blue",breaks = 100,xlim = c(200,90000))

Highly skewed data having long right tails. It shows that more than 50% of the hotels are having rent which is less than average rent price.

Let us use scatterplot for our dataset

plot(hotel$StarRating,hotel$RoomRent , main="RoomRent & Rating",xlab="StarRating",ylab="RoomRent")

RoomRent vs StarRating

plot(hotel$StarRating , hotel$RoomRent  ,data=hotel,main=" RoomRent vs Rating",ylab="rent", xlab = "rating",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$RoomRent ~hotel$StarRating ),col="red")

See the best line fit we have got here. It says increase in ratings will lead to increase in RoomRents.

RoomRent vs Airport

plot( hotel$Airport,hotel$RoomRent   ,data=hotel,main=" RoomRent vs distance",ylab="rent", xlab = "airport distance",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$Airport ~hotel$RoomRent ),col="red")

Relatively flat best fit line is observed.

RoomRent vs HostelCapacity

plot( hotel$HotelCapacity,hotel$RoomRent  ,data=hotel,main=" RoomRent vs capacity",ylab="rent", xlab = "capacity",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(hotel$HotelCapacity~hotel$RoomRent ),col="red")

Scatters plots are not giving proper insights about the pattern. the best fit line is horizontal.

Effects of date on RoomRent

date1<- aggregate(RoomRent ~ Date, data =hotel,mean)
 plot(date1$Date,date1$RoomRent, main="Scatterplot between Date and RoomRent", xlab="Date",ylab = "Room Rent")

Roomrents are bit on higher side near 31Dec to 4 january.

Scatterplot of different variables together

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(formula=~RoomRent+IsWeekend+IsNewYearEve+StarRating+FreeBreakfast+HasSwimmingPool,data=hotel,diagonal="histogram",pch=16)
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

Scatterplot matrix of internal factors only

scatterplotMatrix(hotel[,c( "RoomRent",  "StarRating" , "Airport" ,  "FreeWifi" , "FreeBreakfast","HotelCapacity","HasSwimmingPool")], spread = FALSE, smoother.args = list(lty=2), main= "Scatterplot Matrix",diagonal = "histogram")
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

Correlogram

First we will remove all those columns havig non-numeric data in order to use the coor formula

hotel1<-hotel[,-1,-8:-9]
hotel2<-hotel1[,-7:-8,-12:-14]
hotel3<-hotel2[c(-10,-12)]
coor<-cor(hotel3)
coor
##                         Population      CityRank   IsMetroCity
## Population            1.0000000000 -0.8353204432  0.7712260105
## CityRank             -0.8353204432  1.0000000000 -0.5643937903
## IsMetroCity           0.7712260105 -0.5643937903  1.0000000000
## IsTouristDestination -0.0482029722  0.2807134520  0.1763717063
## IsWeekend             0.0115926802 -0.0072564766  0.0018118005
## IsNewYearEve          0.0007332482 -0.0006326444  0.0006464753
## RoomRent             -0.0887280632  0.0939855292 -0.0668397705
## StarRating            0.1341365933 -0.1333810133  0.0776028661
## Airport              -0.2597010198  0.5059119892 -0.2073586125
## HotelPincode         -0.2586550765  0.1448743454 -0.1756624007
## FreeWifi              0.1129334410 -0.1214309404  0.0868288677
## FreeBreakfast         0.0364278235 -0.0086837497  0.0513856623
## HotelCapacity         0.2599830516 -0.2561197059  0.1871502153
## HasSwimmingPool       0.0262590820 -0.1029737518  0.0214119243
##                      IsTouristDestination    IsWeekend  IsNewYearEve
## Population                   -0.048202972  0.011592680  7.332482e-04
## CityRank                      0.280713452 -0.007256477 -6.326444e-04
## IsMetroCity                   0.176371706  0.001811801  6.464753e-04
## IsTouristDestination          1.000000000 -0.019481101 -2.266388e-03
## IsWeekend                    -0.019481101  1.000000000  2.923821e-01
## IsNewYearEve                 -0.002266388  0.292382051  1.000000e+00
## RoomRent                      0.122502963  0.004580134  3.849123e-02
## StarRating                   -0.040554998  0.006378436  2.360897e-03
## Airport                       0.194422049 -0.002724756  4.598872e-04
## HotelPincode                 -0.170413906 -0.006444704 -2.111441e-03
## FreeWifi                     -0.061568821  0.002960828  2.787472e-05
## FreeBreakfast                -0.071692559 -0.007612777 -2.606416e-03
## HotelCapacity                -0.094356091  0.006306507  1.352679e-03
## HasSwimmingPool               0.042156280  0.004500461  1.122308e-03
##                          RoomRent   StarRating       Airport HotelPincode
## Population           -0.088728063  0.134136593 -0.2597010198 -0.258655077
## CityRank              0.093985529 -0.133381013  0.5059119892  0.144874345
## IsMetroCity          -0.066839771  0.077602866 -0.2073586125 -0.175662401
## IsTouristDestination  0.122502963 -0.040554998  0.1944220492 -0.170413906
## IsWeekend             0.004580134  0.006378436 -0.0027247555 -0.006444704
## IsNewYearEve          0.038491227  0.002360897  0.0004598872 -0.002111441
## RoomRent              1.000000000  0.369373425  0.0496532442  0.009262712
## StarRating            0.369373425  1.000000000 -0.0609191837 -0.009618454
## Airport               0.049653244 -0.060919184  1.0000000000  0.223641588
## HotelPincode          0.009262712 -0.009618454  0.2236415883  1.000000000
## FreeWifi              0.003627002  0.018009594 -0.0945236768 -0.012503744
## FreeBreakfast        -0.010006370 -0.032892463  0.0242839409  0.024880420
## HotelCapacity         0.157873308  0.637430337 -0.1176720722 -0.035088175
## HasSwimmingPool       0.311657734  0.618214699 -0.1416665606  0.020280765
##                           FreeWifi FreeBreakfast HotelCapacity
## Population            1.129334e-01   0.036427824   0.259983052
## CityRank             -1.214309e-01  -0.008683750  -0.256119706
## IsMetroCity           8.682887e-02   0.051385662   0.187150215
## IsTouristDestination -6.156882e-02  -0.071692559  -0.094356091
## IsWeekend             2.960828e-03  -0.007612777   0.006306507
## IsNewYearEve          2.787472e-05  -0.002606416   0.001352679
## RoomRent              3.627002e-03  -0.010006370   0.157873308
## StarRating            1.800959e-02  -0.032892463   0.637430337
## Airport              -9.452368e-02   0.024283941  -0.117672072
## HotelPincode         -1.250374e-02   0.024880420  -0.035088175
## FreeWifi              1.000000e+00   0.158220597  -0.008703612
## FreeBreakfast         1.582206e-01   1.000000000  -0.087165446
## HotelCapacity        -8.703612e-03  -0.087165446   1.000000000
## HasSwimmingPool      -2.407405e-02  -0.061522132   0.509045809
##                      HasSwimmingPool
## Population               0.026259082
## CityRank                -0.102973752
## IsMetroCity              0.021411924
## IsTouristDestination     0.042156280
## IsWeekend                0.004500461
## IsNewYearEve             0.001122308
## RoomRent                 0.311657734
## StarRating               0.618214699
## Airport                 -0.141666561
## HotelPincode             0.020280765
## FreeWifi                -0.024074046
## FreeBreakfast           -0.061522132
## HotelCapacity            0.509045809
## HasSwimmingPool          1.000000000

Here we can see that the room rent is negatively correlated with population. Though correlation is weak. Roomrent is positively correlated with city ranking. increase in one will cause increase in the other. We can’t establish the causation. Roomrent is negatively correlated with Metrocity. Metrocity means more crowd and one can’t get relaxed fully in metro cities so one might expect roomrents to be down. Similarly Roomrent is negatively correlated with Free Breakfast. Though correlation is weak but surprising . Normally we would claim free breakfast ill make hotel expensive.Pincode is also negatively correlated

Rest of the varibles are positively correlated with Roomrent.

Here on interesting thing we are getting is that Tourist destination, Star rating , HotelCapacity and Swimming facility are stongly correlated with RoomRent relative to rest of the other variables and all these four are positively correlated. We can use these variables for regression analysis. Though the individual correlation is not so strong but relative to other factors correlation can be considered as strong

Let us visualise this matrix graphically

library(corrplot)
## corrplot 0.84 loaded
corrplot(corr=cor(hotel3,use = "complete.obs"),method = "circle")

Here blue shades are showing positive correlation and red shades are showing negative correlation. The bigger and darker circles are showing strong relationship between the variables.

City Rank is negatively correlated with Metro city and relationship is strong here, while positively correlated with Torist destination and airport(hotel distance from airport) Weekend is positively related with newyear eve. Hotel capacity is positively correlated with Star rating. Swimming facility is Positively correlated with Hotel Capacity.

Another way of representing through Corregram

library(corrgram)

corrgram(hotel, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of Hotel  data")

Based on the correlogram we are filtering out four variables which are affecting RoomRent relatively strongly as compared to others are IsTouristDestination, StarRating, HasSwimmingPool and HotelCapacity.

Draw separate correlogram of these variables alongwith Roomrent

library(corrgram)

rent<-data.frame(RoomRent,HasSwimmingPool, HotelCapacity, StarRating, IsTouristDestination)
corrgram(rent, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Corrgram of Hotel Prices In India")

Now create the variance -covariance matrix of these four variables with RoomRent.

x<-hotel[,c("HasSwimmingPool","StarRating", "HotelCapacity","IsTouristDestination")]
y<-hotel[,c("RoomRent")]
cor(x,y)
##                           [,1]
## HasSwimmingPool      0.3116577
## StarRating           0.3693734
## HotelCapacity        0.1578733
## IsTouristDestination 0.1225030
var(x,y)
##                            [,1]
## HasSwimmingPool       1094.2017
## StarRating            2048.3755
## HotelCapacity        88753.4128
## IsTouristDestination   412.7803
cov(x,y)
##                            [,1]
## HasSwimmingPool       1094.2017
## StarRating            2048.3755
## HotelCapacity        88753.4128
## IsTouristDestination   412.7803

Now make some assumptions based on the data set.

ASSUMPTION 1: H1: NewYearEve is related with RoomRent Ho: Newyear is not related with RoomRent

tab1<-with(hotel,table(IsNewYearEve,RoomRent))
chisq.test(tab1)
## Warning in chisq.test(tab1): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tab1
## X-squared = 2047.8, df = 2155, p-value = 0.9505

Here p-value is >0.05. We are accepting our null hypothesis that these two variables are not related. Thus our claim is rejected here on the basis of chisquare test.

ASSUMPTION2- H2: New Year Eve is related with Tourist Destination

tab2<-with(hotel,table(IsNewYearEve,IsTouristDestination))
chisq.test(tab2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab2
## X-squared = 0.053842, df = 1, p-value = 0.8165

HERE again our claim is getting rejected on the basis of p-value>0.05. There is no relation between NEW YearEve and TouristDestination.

Some hypothesis testing on the basis of t-test.

H1:- Average RoomRent on Weekend is greater than the Average Roomrent on non-weekend. H0:- There is no significant difference between the average RoomRent on weekend and weekdays

          We will use right tail t-test
t.test(RoomRent~IsWeekend,data = hotel,alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by IsWeekend
## t = -0.51853, df = 9999.4, p-value = 0.698
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -289.122      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        5430.835        5500.129

Here we see that the p-value>0.05. So we fail to reject the null hypothesis Hence we say that average Roomrent on weekend is equal to the average roomrent on non-weekend.

H2:- Average Roomrent on the normal days is less than the average Roomrent on NewYearEve days

                    We will use left tail t-test
t.test(RoomRent~IsNewYearEve,data = hotel,alternative="less")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by IsNewYearEve
## t = -4.1793, df = 2065, p-value = 1.523e-05
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -518.4763
## sample estimates:
## mean in group 0 mean in group 1 
##        5367.606        6222.826

p-value <0.05 . So we reject the null hypothesis. so we can say that the average Roomrent on the normal days is cheaper.

H3:- Average RoomRent of Metro Cities is greater than that of non-metro cities.

t.test(RoomRent~IsMetroCity,data = hotel,alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by IsMetroCity
## t = 10.721, df = 13224, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  919.9785      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        5782.794        4696.073

So we reject the null hypothesis because p-value<0.05. Hence we conclude that average RoomRent of Metro Cities is greater than that of non-metro cities.

H4:- Average Room Rent of non-Tourist Destination Cities is less than than that of Tourist Destination Cities.

t.test(RoomRent ~ IsTouristDestination, data = hotel,alternative="less")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by IsTouristDestination
## t = -19.449, df = 12888, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -1789.665
## sample estimates:
## mean in group 0 mean in group 1 
##        4111.003        6066.024

Here p-value<0.05. So we reject our null hypothesis. So we conclude that the average Room Rent of Non- Tourist Destination Cities is less than that of Tourist Destination Cities.

H5:- Average of room Rent where free wifi is free is greater as compared to where wifi is not available

t.test(RoomRent ~ FreeWifi, data = hotel,alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by FreeWifi
## t = -0.76847, df = 1804.7, p-value = 0.7788
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -318.9097       Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        5380.004        5481.518

p-value>0.05 . so we accept the null hypothesis that average room rent is same .

H6:- Average Room Rent is greater where free Breakfast is available.

t.test(RoomRent ~ FreeBreakfast ,data = hotel,alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by FreeBreakfast
## t = 0.98095, df = 6212.3, p-value = 0.1633
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -104.0926       Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        5573.790        5420.044

p-value>0.05. So we fail to reject the null hypothesis . Average room rent is same.

H7:- Average Room Rent is significantly different where Swimming Pool is available.

t.test(RoomRent~HasSwimmingPool ,data = hotel)
## 
##  Welch Two Sample t-test
## 
## data:  RoomRent by HasSwimmingPool
## t = -29.013, df = 5011.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5096.030 -4450.942
## sample estimates:
## mean in group 0 mean in group 1 
##        3775.566        8549.052

p-value<0.05. So we reject the null hypothesis. So we conclude that average Room Rent is greater where Swimming Pool is available.

After having visualizing the data and testing for any sort of correlation among the variables and making some relevant assumptions , now it’s time to go for regression analysis to see which of the variables on their own way are actually contributing to the pricing stategy Of hotels in the Indian Hotel industry.

Let us test the significance of the correlation of different variables with Roomrent

Roomrent vs population

cor.test(hotel$RoomRent,hotel$Population)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$Population
## t = -10.246, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10560734 -0.07179767
## sample estimates:
##         cor 
## -0.08872806

Here correlation is significant as p-value<0.05 So our null hypothesis of no correlation is rejected. Population and RoomRent are negativley correlated.

Roomrent vs City Ranking

cor.test(hotel$RoomRent,hotel$CityRank)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$CityRank
## t = 10.858, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07707001 0.11084696
## sample estimates:
##        cor 
## 0.09398553

p-value is less than 0.05 . so there is significant positive correlation between these two.

Roomrent vs Metrocity

cor.test(hotel$RoomRent,hotel$IsMetroCity)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$IsMetroCity
## t = -7.7053, df = 13230, p-value = 1.399e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08378329 -0.04985761
## sample estimates:
##         cor 
## -0.06683977

p-value<0.05. negative and significant correlation between room rent and IsMetrocity var.

Roomrent vs Weekend destination

cor.test(hotel$RoomRent,hotel$IsTouristDestination)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$IsTouristDestination
## t = 14.197, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1056846 0.1392512
## sample estimates:
##      cor 
## 0.122503

p-value < 0.05. So there is significant and positive relationship.

RoomRent vs NewYear Eve

cor.test(hotel$RoomRent,hotel$IsNewYearEve)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$IsNewYearEve
## t = 4.4306, df = 13230, p-value = 9.472e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02146637 0.05549377
## sample estimates:
##        cor 
## 0.03849123

Here p-value<0.05 . There is a significant positive relationship between these two.

RoomRent vs Starrating

cor.test(hotel$RoomRent,hotel$StarRating)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$StarRating
## t = 45.719, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3545660 0.3839956
## sample estimates:
##       cor 
## 0.3693734

p-value<0.05 So there is significant positive correlation between Star rating and RoomRent.

RoomRent vs Airport

cor.test(hotel$RoomRent, hotel$Airport)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$Airport
## t = 5.7183, df = 13230, p-value = 1.099e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03264192 0.06663581
## sample estimates:
##        cor 
## 0.04965324

p-value<0.05. significant relationship

RoomRent vs FreeWifi

cor.test(hotel$RoomRent, hotel$FreeWifi)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$FreeWifi
## t = 0.41719, df = 13230, p-value = 0.6765
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01341277  0.02066466
## sample estimates:
##         cor 
## 0.003627002

p-value<0.05 . significant.

Roomrent vs Breakfast

cor.test(hotel$RoomRent, hotel$FreeBreakfast)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$FreeBreakfast
## t = -1.151, df = 13230, p-value = 0.2497
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.027040698  0.007033769
## sample estimates:
##         cor 
## -0.01000637

Here p-value >0.05 So our null hypothesis is accepted . So there is no significant relationship between Free Breakfast and RoomRent.

Roomrent vs Hotel Capacity

cor.test(hotel$RoomRent, hotel$HotelCapacity)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$HotelCapacity
## t = 18.389, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1412142 0.1744430
## sample estimates:
##       cor 
## 0.1578733

Here pvalue<0.05. So the relationship is Significant

Roomrent vs SwimmingPool

cor.test(hotel$RoomRent,hotel$HasSwimmingPool)
## 
##  Pearson's product-moment correlation
## 
## data:  hotel$RoomRent and hotel$HasSwimmingPool
## t = 37.726, df = 13230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2961917 0.3269604
## sample estimates:
##       cor 
## 0.3116577

Here p-value<0.05. So we can reject our null hypothesis and conclude that Swimming facility and Room rent is significantly positively correlated.

On the basis of this correlation test, the only variable which is not significantly related with RoomRent is FreeBreakFast.

Setting a regression model requires a correct and robust model but the problem is how to select the correct model?There are some methods to select a robust model-

  1. All Possible Regression:-

All subset function will test every possible subset of the set of potential variables. Suppose there are N possible independent variables(besides constant) then there will be 2^N (2 raised to the power N) distinct subsets to be tested.

here is the code

attach(hotel)

model<-lm(RoomRent~., data= hotel)

ols_all_subset(model)

Then use the best subset function that will select the best subset of regression like having maximum adjusted R squared and minimum AIC(explained below)

model<-lm(RoomRent~.,data= hotel)

ols_best_subset(model)

Then use the plot method to get the panel of fit criteria for best subset regression methods.

model<-lm(RoomRent~.,data= hotel)

k<-ols_all_subset(model)

plot(k)

  1. METHOD

Stepwise Forward Regression

The function builds a regression model from a set of all possible predictor variables by entering predictors based on p- values, in a stepwise manner until there is no variable left to enter any more. The model should include all the candidate predictor variables. To get each step set the details=TRUE, each step is displayed.

model<-lm(RoomRent~.,data= hotel)

ols_step_forward(model)

For detailed output

model<-lm(RoomRent~.,data= hotel)

ols_step_forward(model, details= TRUE)

  1. METHOD

Stepwise Backward Regression

This function Builds a regression model from a set of possible predictor variables by removing predictors based on p- values, in a stepwise manner until there is no variable left to remove any more. The model should include all the candidate predictor variables. If details is set to TRUE, each step is displayed.This is reverse procedure of Stepwise function.

model<-lm(RoomRent~.,data= hotel)

ols_step_backward(model)

For detailed output

model<-lm(RoomRent~.,data= hotel)

ols_step_backward(model, details= TRUE)

  1. METHOD

Stepwise Regression

This function builds a regression model from a set of possible predictor variables by entering and removing predictors based on p -values, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the possible predictor variables. If details is set to TRUE, each step is displayed.

model<-lm(RoomRent~.,data= hotel)

ols_stepwise(model)

For detailed output

model<-lm(RoomRent~.,data= hotel)

ols_stepwise(model,details= TRUE)

  1. METHOD

Stepwise AIC Regression

This function builds a regression model from a set of possible predictor variables by entering and removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to enter or remove any more. The model should include all the possible predictor variables. If details is set to TRUE, each step is displayed.

model<-lm(RoomRent~.,data= hotel)

ols_stepaic_both(model)

For Detailed output

model<-lm(RoomRent~.,data= hotel)

ols_stepaic_both(model,details=TRUE)

So here we are using StepWise Regression method. It will by default use backward direction if scope is not given. The full model will be passed to Step function. It searhes for the full scope of the variables. It performs multiple iterations by dropping one X variable each time. The AIC of the model is also computed and the model with lowest AIC is retained for the next iterations.

lmMod<-lm(RoomRent~.,data=hotel)

selectedMod<-step(lmMod)

summary(selectedMod)

NOTE- We are not running the codes which we have mentioned in these 5 methods. This version of R is not compatible for running these codes because data is large and no of possible predictor variables are large . If we run these codes R will check 2 raised to power 10 subsets of model (here no of possible predictors are 10). The Software will get stuck.

So easier version of lm function is used here.

 model<- RoomRent ~ Population + IsMetroCity + IsTouristDestination + 
     IsNewYearEve + StarRating + Airport + FreeWifi + HotelCapacity + 
    HasSwimmingPool
fit1<-lm(formula = model,data = hotel)
summary(fit1)
## 
## Call:
## lm(formula = model, data = hotel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11839  -2385   -691   1045 309532 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.560e+03  4.055e+02 -21.109  < 2e-16 ***
## Population           -1.244e-04  2.263e-05  -5.499 3.88e-08 ***
## IsMetroCity          -6.369e+02  2.132e+02  -2.988  0.00282 ** 
## IsTouristDestination  1.918e+03  1.374e+02  13.958  < 2e-16 ***
## IsNewYearEve          8.430e+02  1.739e+02   4.849 1.26e-06 ***
## StarRating            3.598e+03  1.104e+02  32.582  < 2e-16 ***
## Airport               1.001e+01  2.716e+00   3.684  0.00023 ***
## FreeWifi              5.952e+02  2.217e+02   2.685  0.00726 ** 
## HotelCapacity        -1.040e+01  1.029e+00 -10.115  < 2e-16 ***
## HasSwimmingPool       2.147e+03  1.598e+02  13.434  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared:  0.1904, Adjusted R-squared:  0.1899 
## F-statistic: 345.5 on 9 and 13222 DF,  p-value: < 2.2e-16

MODEL1

Model1 <- RoomRent ~ Population+CityRank+IsMetroCity+IsTouristDestination+IsWeekend+IsNewYearEve+StarRating+Airport+FreeWifi+FreeBreakfast+HotelCapacity+HasSwimmingPool
fit1 <- lm(Model1, data = hotel)
summary(fit1)
## 
## Call:
## lm(formula = Model1, data = hotel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11845  -2356   -690   1030 309689 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.604e+03  4.494e+02 -19.147  < 2e-16 ***
## Population           -1.188e-04  3.592e-05  -3.307 0.000945 ***
## CityRank              1.821e+00  1.035e+01   0.176 0.860302    
## IsMetroCity          -6.640e+02  2.164e+02  -3.068 0.002158 ** 
## IsTouristDestination  1.925e+03  1.481e+02  13.001  < 2e-16 ***
## IsWeekend            -9.076e+01  1.239e+02  -0.733 0.463709    
## IsNewYearEve          8.826e+02  1.818e+02   4.855 1.22e-06 ***
## StarRating            3.592e+03  1.108e+02  32.434  < 2e-16 ***
## Airport               9.510e+00  3.171e+00   2.999 0.002709 ** 
## FreeWifi              5.498e+02  2.242e+02   2.452 0.014214 *  
## FreeBreakfast         1.688e+02  1.233e+02   1.369 0.171163    
## HotelCapacity        -1.028e+01  1.033e+00  -9.945  < 2e-16 ***
## HasSwimmingPool       2.153e+03  1.616e+02  13.327  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6601 on 13219 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1898 
## F-statistic: 259.3 on 12 and 13219 DF,  p-value: < 2.2e-16

See here CityRank , IsWeekend and FreeBreakfast are not affecting RoomRent because of p-value>0.05. Rest all other variables are significantly affecting RoomRent because p-value <0.05.

MODEL FIT

library(leaps)
leap1 <- regsubsets(Model1, data = hotel, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(Model1, data = hotel, nbest = 1)
## 12 Variables  (and intercept)
##                      Forced in Forced out
## Population               FALSE      FALSE
## CityRank                 FALSE      FALSE
## IsMetroCity              FALSE      FALSE
## IsTouristDestination     FALSE      FALSE
## IsWeekend                FALSE      FALSE
## IsNewYearEve             FALSE      FALSE
## StarRating               FALSE      FALSE
## Airport                  FALSE      FALSE
## FreeWifi                 FALSE      FALSE
## FreeBreakfast            FALSE      FALSE
## HotelCapacity            FALSE      FALSE
## HasSwimmingPool          FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Population CityRank IsMetroCity IsTouristDestination IsWeekend
## 1  ( 1 ) " "        " "      " "         " "                  " "      
## 2  ( 1 ) " "        "*"      " "         " "                  " "      
## 3  ( 1 ) "*"        " "      " "         "*"                  " "      
## 4  ( 1 ) "*"        " "      " "         "*"                  " "      
## 5  ( 1 ) "*"        " "      " "         "*"                  " "      
## 6  ( 1 ) "*"        " "      " "         "*"                  " "      
## 7  ( 1 ) "*"        " "      " "         "*"                  " "      
## 8  ( 1 ) "*"        " "      "*"         "*"                  " "      
##          IsNewYearEve StarRating Airport FreeWifi FreeBreakfast
## 1  ( 1 ) " "          "*"        " "     " "      " "          
## 2  ( 1 ) " "          "*"        " "     " "      " "          
## 3  ( 1 ) " "          "*"        " "     " "      " "          
## 4  ( 1 ) " "          "*"        " "     " "      " "          
## 5  ( 1 ) " "          "*"        " "     " "      " "          
## 6  ( 1 ) "*"          "*"        " "     " "      " "          
## 7  ( 1 ) "*"          "*"        "*"     " "      " "          
## 8  ( 1 ) "*"          "*"        "*"     " "      " "          
##          HotelCapacity HasSwimmingPool
## 1  ( 1 ) " "           " "            
## 2  ( 1 ) " "           " "            
## 3  ( 1 ) " "           " "            
## 4  ( 1 ) " "           "*"            
## 5  ( 1 ) "*"           "*"            
## 6  ( 1 ) "*"           "*"            
## 7  ( 1 ) "*"           "*"            
## 8  ( 1 ) "*"           "*"
plot(leap1, scale="adjr2")

On the basis of p-value calculated earlier we will drop our three variables namely CityRank, IsWeekend and FreeBreakFast.

Now Model2

Model2 <- RoomRent ~ StarRating+Population+IsMetroCity+IsTouristDestination+IsNewYearEve+Airport+FreeWifi+HotelCapacity+HasSwimmingPool
fit2 <- lm(Model2, data = hotel)
summary(fit2)
## 
## Call:
## lm(formula = Model2, data = hotel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11839  -2385   -691   1045 309532 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.560e+03  4.055e+02 -21.109  < 2e-16 ***
## StarRating            3.598e+03  1.104e+02  32.582  < 2e-16 ***
## Population           -1.244e-04  2.263e-05  -5.499 3.88e-08 ***
## IsMetroCity          -6.369e+02  2.132e+02  -2.988  0.00282 ** 
## IsTouristDestination  1.918e+03  1.374e+02  13.958  < 2e-16 ***
## IsNewYearEve          8.430e+02  1.739e+02   4.849 1.26e-06 ***
## Airport               1.001e+01  2.716e+00   3.684  0.00023 ***
## FreeWifi              5.952e+02  2.217e+02   2.685  0.00726 ** 
## HotelCapacity        -1.040e+01  1.029e+00 -10.115  < 2e-16 ***
## HasSwimmingPool       2.147e+03  1.598e+02  13.434  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared:  0.1904, Adjusted R-squared:  0.1899 
## F-statistic: 345.5 on 9 and 13222 DF,  p-value: < 2.2e-16

Here every variable is significant . See the p-values are less than <0.05

library(leaps)
leap2 <- regsubsets(Model2, data = hotel, nbest=1)
 summary(leap2)
## Subset selection object
## Call: regsubsets.formula(Model2, data = hotel, nbest = 1)
## 9 Variables  (and intercept)
##                      Forced in Forced out
## StarRating               FALSE      FALSE
## Population               FALSE      FALSE
## IsMetroCity              FALSE      FALSE
## IsTouristDestination     FALSE      FALSE
## IsNewYearEve             FALSE      FALSE
## Airport                  FALSE      FALSE
## FreeWifi                 FALSE      FALSE
## HotelCapacity            FALSE      FALSE
## HasSwimmingPool          FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          StarRating Population IsMetroCity IsTouristDestination
## 1  ( 1 ) "*"        " "        " "         " "                 
## 2  ( 1 ) "*"        "*"        " "         " "                 
## 3  ( 1 ) "*"        "*"        " "         "*"                 
## 4  ( 1 ) "*"        "*"        " "         "*"                 
## 5  ( 1 ) "*"        "*"        " "         "*"                 
## 6  ( 1 ) "*"        "*"        " "         "*"                 
## 7  ( 1 ) "*"        "*"        " "         "*"                 
## 8  ( 1 ) "*"        "*"        "*"         "*"                 
##          IsNewYearEve Airport FreeWifi HotelCapacity HasSwimmingPool
## 1  ( 1 ) " "          " "     " "      " "           " "            
## 2  ( 1 ) " "          " "     " "      " "           " "            
## 3  ( 1 ) " "          " "     " "      " "           " "            
## 4  ( 1 ) " "          " "     " "      " "           "*"            
## 5  ( 1 ) " "          " "     " "      "*"           "*"            
## 6  ( 1 ) "*"          " "     " "      "*"           "*"            
## 7  ( 1 ) "*"          "*"     " "      "*"           "*"            
## 8  ( 1 ) "*"          "*"     " "      "*"           "*"
plot(leap2, scale="adjr2")

Now Checking for multicollinearity

Model2 <- RoomRent ~ StarRating+Population+IsMetroCity+IsTouristDestination+IsNewYearEve+Airport+FreeWifi+HotelCapacity+HasSwimmingPool
fit2 <- lm(Model2, data = hotel)

all_vifs <- car::vif(fit2)
print(all_vifs)
##           StarRating           Population          IsMetroCity 
##             2.118451             2.820418             2.807261 
## IsTouristDestination         IsNewYearEve              Airport 
##             1.210458             1.000013             1.160325 
##             FreeWifi        HotelCapacity      HasSwimmingPool 
##             1.024652             1.888342             1.777718

Remove vars with VIF> 2.5 and re-build model until none of VIFs don’t exceed 2.5

signif_all <- names(all_vifs)

# Remove vars with VIF> 2.5 and re-build model until none of VIFs don't exceed 2.5.
while(any(all_vifs > 2.5)){
  var_with_max_vif <- names(which(all_vifs == max(all_vifs)))  # get the var with max vif
  signif_all <- signif_all[!(signif_all) %in% var_with_max_vif]  # remove
  myForm <- as.formula(paste("RoomRent ~ ", paste (signif_all, collapse=" + "), sep=""))  # new formula
  selectedMod <- lm(myForm, data=hotel)  # re-build model with new formula
  all_vifs <- car::vif(selectedMod)
}
summary(selectedMod)
## 
## Call:
## lm(formula = myForm, data = hotel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11654  -2365   -710   1067 309426 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8838.099    402.792 -21.942  < 2e-16 ***
## StarRating            3569.749    110.439  32.323  < 2e-16 ***
## IsMetroCity          -1530.867    138.033 -11.091  < 2e-16 ***
## IsTouristDestination  2094.588    133.722  15.664  < 2e-16 ***
## IsNewYearEve           843.370    174.054   4.845 1.28e-06 ***
## Airport                 11.506      2.705   4.253 2.12e-05 ***
## FreeWifi               534.928    221.665   2.413   0.0158 *  
## HotelCapacity          -11.137      1.021 -10.907  < 2e-16 ***
## HasSwimmingPool       2225.460    159.331  13.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6608 on 13223 degrees of freedom
## Multiple R-squared:  0.1886, Adjusted R-squared:  0.1881 
## F-statistic: 384.1 on 8 and 13223 DF,  p-value: < 2.2e-16
car::vif(selectedMod)
##           StarRating          IsMetroCity IsTouristDestination 
##             2.113742             1.174547             1.144109 
##         IsNewYearEve              Airport             FreeWifi 
##             1.000013             1.148621             1.022146 
##        HotelCapacity      HasSwimmingPool 
##             1.856658             1.763424

Here VIF for all the variables is below 2.5 and every variable is statistically significant. Just Look at the p-values.

So we can say our model is free from multicollinearity and is significant also.

OBSERVATIONS OF THIS REGRESSION ANALYSIS-

NULL HYPOTHESIS (H0:)- None of the factors are affecting RoomRent i.e. there is no dependency between Room rent of hotel and other variables or say all the beta coefficients are zero. ALTERNATIVE HYPOTHESIS (H1):- There is dependency between Room Rent and other variables i.e.at least one of the variables is affecting RoomRent. i.e at least one of the beta coefficients is not equal to zero.

Here from the regression analysis we found out that p-value is <0.05 .

Therefore we are rejecting our null hypothesis and conclude that variables are affecting our Dependent variable RoomRent.As well as we can see that F value is very high which means mean of all the variables differ.

R-squared value is .1886 That means this model is explaining 18.86% of variation in the RoomRents only which is quite low. But this model is best fit model and free from multicollinearity also.

Calculating the adjusted R squared and AIC values for both model1 and model2.

summary(fit1)$adj.r.squared
## [1] 0.1898256
summary(fit2)$adj.r.squared
## [1] 0.1898573
AIC(fit1)
## [1] 270314.1
AIC(fit2)
## [1] 270310.6

RESULT- Here we can cleary see that the adjusted R squared for model2 is greater than the adjusted R squared for model1 and at the same time AIC Value for model2 is less than the AIC value of model1. So Model2 is a good model because it is having higher adjusted R-squared value and lower AIC value.

Listing out BETA coefficients and coinfidence intervals

fit2$coefficients
##          (Intercept)           StarRating           Population 
##        -8.559981e+03         3.598383e+03        -1.244506e-04 
##          IsMetroCity IsTouristDestination         IsNewYearEve 
##        -6.368734e+02         1.917691e+03         8.429888e+02 
##              Airport             FreeWifi        HotelCapacity 
##         1.000580e+01         5.952249e+02        -1.040392e+01 
##      HasSwimmingPool 
##         2.146661e+03
confint(fit2)
##                              2.5 %        97.5 %
## (Intercept)          -9.354845e+03 -7.765117e+03
## StarRating            3.381905e+03  3.814862e+03
## Population           -1.688086e-04 -8.009249e-05
## IsMetroCity          -1.054700e+03 -2.190462e+02
## IsTouristDestination  1.648382e+03  2.187001e+03
## IsNewYearEve          5.021954e+02  1.183782e+03
## Airport               4.682508e+00  1.532909e+01
## FreeWifi              1.606768e+02  1.029773e+03
## HotelCapacity        -1.242003e+01 -8.387814e+00
## HasSwimmingPool       1.833432e+03  2.459890e+03

INTERPRETATION for few coefficients- StarRating coefficent is 3.598383e+03 . And there is 95% chance that this estimated coefficient is going to fall between 3.381905e+03 and 3.814862e+03. fair enough. Similarly others can be interpreted.

Estimating few values for RoomRent

f<-fitted(fit2)
f[1:50]
##        1        2        3        4        5        6        7        8 
## 9130.375 9130.375 9130.375 9130.375 9130.375 9973.364 9130.375 9130.375 
##        9       10       11       12       13       14       15       16 
## 6069.941 6069.941 6069.941 6069.941 6069.941 6912.930 6069.941 6069.941 
##       17       18       19       20       21       22       23       24 
## 7602.771 7602.771 7602.771 7602.771 7602.771 8445.759 7602.771 7602.771 
##       25       26       27       28       29       30       31       32 
## 6280.099 6280.099 6280.099 6280.099 6280.099 7123.088 6280.099 6280.099 
##       33       34       35       36       37       38       39       40 
## 4041.863 4041.863 4041.863 4041.863 4041.863 4884.852 4041.863 4041.863 
##       41       42       43       44       45       46       47       48 
## 2502.769 2502.769 2502.769 2502.769 2502.769 3345.758 2502.769 2502.769 
##       49       50 
## 2409.134 2409.134

Here some values are underestimated and some values are overestimated because thogh the model is best fit model but R square value is very poor 0.1886 only.

So Final conclusion-

Below are the 9 variables(both internal and external) which are affecting the price of the Room of the hotels, As well as they are in the order of significance .

StarRating- is positively affecting RoomRent. One percent increase in Rating will lead to 3.598383e+03 %(almost 14%) hike in rent. Tourists care for Rating of the hotels. More Rating is an indication for high standards and luxury and tourists would not mind paying higher for this.

Population- negatively affecting the RoomRent. one percent increase in Population will lead to 1.244506e-04 % reduction in the RoomRent.

IsTouristDestination - is positively affecting RoomRent. It is evident that the tourists love tourist destination and hotel owners take advantage of this. Rooms always be in demand throught the year if it is a tourist destination.

HotelCapacity - Hotel Capacity is negatively affecting the prices . Increase in the cpacity of guest accomodation will lead to decrese in the RoomRent. The more a hotel can fit in tourists more it will be in position to lower the prices.

HasSwimmingPool- is positively affecting the RoomRents. Tourists after their tour would love to relax in their hotels and nothing beats swimmingpool as a great source of relaxation. So hotelowners having swimmingpool facility will charge more.

IsNewYearEve- is positively affecting the RoomRents. Ofcourse most of the Tourists love to go out on NewYearEve . This will increase the demands for hotels and too many tourists hasing fixed no. of hotels drive up the prices.

Airport:- is positively affecting the prices. The more the distance from the hotels to the airport the more expensive the hotel will be. It’s somewhat against the logic. But one could say that you tour in the city and at airport area. The more the hotel is acessible in the city easily the more will be the price. You do not want to come to the airport area for relaxing in your hotel everytime.

IsMetroCity- is negatively affecting the RoomRent. The more crowded the city is less no of tourists will be attracted. As metro cities are big and crowded having more noise pollution one can not relax . So hotels in these city are less expensive.

Free Wifi- This factor is least significantly affecting the prices in a positive manner. Now most of the tourist carry their data traveller with them so Wifi facility does not semm to be major factor but it is significant so one could expect hotels with free Wifi facility are expensive.

The factors which were insignificantly RoomRent are IsWeekend , CityRank and FreeBreakfast.

tourists do not take into account whether it’s a weekday or weekend. Whenever they want to tour they just go to that place irrespective of day.

Similarly City Ranking does not matter . What matters is exotic locations and Brekfast does not seem to be motivating factor it is morelike complementary with the hotel. Most of the tourist return hotel late night and remain sleepy in the morning so they wake up late and by that brekfast time would normally over.

                     That's it from my side - Thank You