Introduction In this project the pricing strategy of hotels in Indian cities is analyzed. The objective is to identify the factors that contribute the most in setting up the prices of hotel rooms.

Data

The collected for this project is taken from the website www.hotel.in. The dataset contains 13232 rows and 19 columns of data. The data contains details like, distance from airports, description of hotels, star rating of hotels, hotel booking date, price at which rooms in that particular hotels were booked etc., of hotels in 42 Indian cities.We shall analyze the type of data in more detail in our analysis. The various parameters in the dataset are as follows: 1. CityName 2. Population 3. CityRank 4. ISMetroCity 5. IsTouristDestination 6. IsWeekend 7. IsNewYearEve 8. Date 9. HotelName 10. RoomRent 11. StarRating 12. Airport 13. HotelAddress 14. HotelPincode 15. HotelDescription 16. FreeWifi 17. FreeBreakfast 18. HotelCapacity 19. HasSwimmingPool

Description of Data City: Indian city names like Delhi, Mumbai, Goa, Agra, etc.

Population: Population in that city

CityRank: Used in the dataset to uniquely identify each city

IsMetroCity: Used in the dataset to indicate whether or not that particular city is metro city 1= Yes 0= No

IsTouristDestination: Used in the dataset to indicate whether or not that particular city is tourist destination 1= Yes 0= No

IsWeekend: Used in the dataset to indicate whether or not that particular booking is made on a weekend 1= Yes 0= No

IsNewYearEve: Used in the dataset to indicate whether or not that particular booking is made on new year eve 1= Yes 0= No

Date: The date on which that particular booking is made

HotelName: Name of the hotel in which the booking is made

RoomRent: The rent of room of the hotel in which the booking is made

StarRating: The star rating of the hotel in which the booking is made

Airport: The distance of the airport (in km) from the hotel in which the booking is made

HotelAddress: The address of the hotel in which the booking is made

HotelPincode: The pincode of adress of the hotel in which the booking is made

HotelDescription: Description of the hotel in which booking is made

FreeWifi: Used in the dataset to indicate whether or not that particular hotel provides free wifi 1= Yes 0= No

FreeBreakfast: Used in the dataset to indicate whether or not that particular hotel provides free breakfast 1= Yes 0= No

HotelCapacity: The capacity of the hotel in which the booking is made

HasSwimmingPool Used in the dataset to indicate whether or not that particular hotel has swimming pool 1= Yes 0= No

Strategy for attaining the objective:

From the given dataset it is easy to identify and categorize parameters as Time (Time of booking), External factors (CityRank, IsMetrocity, IsTouristDestination), and Internal factors (StarRating, Airport, HotelAddress, HotelPincode, HotelDescription, FreeWifi, FreeBreakfast, HotelCapacity, HasSwimmingPool) So, we shall test the effect various categories of parameters (time, external factors, internal factors) have on the dependent variable (RoomRent), one by one by running regression analysis and doing various statistical tests. Eventually we shall eliminate the parameters which contribute less to the dependent variable and by trial and error we shall obtain an equation which contains a healthy mix of all the categories (time, external factors, internal factors) as independent variable in order to determine the dependent variable (RoomRent)

Reading the Data

data.df <- read.csv(paste("Cities42.csv", sep=""))
library(psych)
View(data.df)
dim(data.df)
## [1] 13232    19

Comments: 13232 observations of 19 variables obtained

Analyzing the types of data in the dataset

str(data.df)
## 'data.frame':    13232 obs. of  19 variables:
##  $ CityName            : Factor w/ 42 levels "Agra","Ahmedabad",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ Population          : int  12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 ...
##  $ CityRank            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsMetroCity         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ IsTouristDestination: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ IsWeekend           : int  1 0 1 1 0 1 0 1 1 0 ...
##  $ IsNewYearEve        : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Date                : Factor w/ 20 levels "04-Jan-16","04-Jan-17",..: 11 12 13 14 15 16 17 18 11 12 ...
##  $ HotelName           : Factor w/ 1670 levels "14 Square Amanora",..: 1635 1635 1635 1635 1635 1635 1635 1635 1409 1409 ...
##  $ RoomRent            : int  12375 10250 9900 10350 12000 11475 11220 9225 6800 9350 ...
##  $ StarRating          : num  5 5 5 5 5 5 5 5 4 4 ...
##  $ Airport             : num  21 21 21 21 21 21 21 21 20 20 ...
##  $ HotelAddress        : Factor w/ 2108 levels " H.P. High Court Mall Road, Shimla",..: 925 928 930 933 935 937 940 941 699 746 ...
##  $ HotelPincode        : int  400005 400006 400007 400008 400009 400010 400011 400012 400039 400040 ...
##  $ HotelDescription    : Factor w/ 1226 levels "#NAME?","10 star hotel near Queensroad, Amritsar",..: 1030 1030 1030 1030 1030 1030 1030 1030 1006 1006 ...
##  $ FreeWifi            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FreeBreakfast       : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ HotelCapacity       : int  287 287 287 287 287 287 287 287 28 28 ...
##  $ HasSwimmingPool     : int  1 1 1 1 1 1 1 1 0 0 ...

Describing the data in the dataset

describe(data.df) [, 1:5]
##                      vars     n       mean         sd  median
## CityName*               1 13232      18.07      11.72      16
## Population              2 13232 4416836.87 4258386.00 3046163
## CityRank                3 13232      14.83      13.51       9
## IsMetroCity             4 13232       0.28       0.45       0
## IsTouristDestination    5 13232       0.70       0.46       1
## IsWeekend               6 13232       0.62       0.48       1
## IsNewYearEve            7 13232       0.12       0.33       0
## Date*                   8 13232      14.30       2.69      14
## HotelName*              9 13232     841.19     488.16     827
## RoomRent               10 13232    5473.99    7333.12    4000
## StarRating             11 13232       3.46       0.76       3
## Airport                12 13232      21.16      22.76      15
## HotelAddress*          13 13232    1202.53     582.17    1261
## HotelPincode           14 13232  397430.26  259837.50  395003
## HotelDescription*      15 13224     581.34     363.26     567
## FreeWifi               16 13232       0.93       0.26       1
## FreeBreakfast          17 13232       0.65       0.48       1
## HotelCapacity          18 13232      62.51      76.66      34
## HasSwimmingPool        19 13232       0.36       0.48       0

Distribution of city names in the dataset

cities <- table(data.df$CityName)
cities
## 
##             Agra        Ahmedabad         Amritsar        Bangalore 
##              432              424              136              656 
##      Bhubaneswar       Chandigarh          Chennai       Darjeeling 
##              120              336              416              136 
##            Delhi          Gangtok              Goa         Guwahati 
##             2048              128              624               48 
##         Haridwar        Hyderabad           Indore           Jaipur 
##               48              536              160              768 
##        Jaisalmer          Jodhpur           Kanpur            Kochi 
##              264              224               16              608 
##          Kolkata          Lucknow          Madurai           Manali 
##              512              128              112              288 
##        Mangalore           Mumbai           Munnar           Mysore 
##              104              712              328              160 
##         Nainital             Ooty        Panchkula             Pune 
##              144              136               64              600 
##             Puri           Rajkot        Rishikesh           Shimla 
##               56              128               88              280 
##         Srinagar            Surat Thiruvanthipuram         Thrissur 
##               40               80              392               32 
##          Udaipur         Varanasi 
##              456              264

Distribution of CityRank in the dataset

CityRank <- table(data.df$CityRank)
CityRank
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  712 2048  656  416  536  424  512   80  600  768   32  128   16  136  160 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  432  448  624  128  264   40  224  336  392   48  160  120  272  104  456 
##   32   33   34   35   36   37   38   39   40   42   43   44 
##   48   56  280   64  136   88  128  136  264  144  328  288

Distribution of MetroCity in the dataset

barplot(table(data.df$IsMetroCity), main="Metro city or not in dataset", col= c("red","pink"))

Distribution of TouristCity in the dataset

pie(table(data.df$IsTouristDestination), main="Tourist destination or not in dataset", col=c("burlywood","peachpuff"))

Distribution of CityName in the dataset (as percentage of the total dataset)

mytable <- with(data.df, table(CityName))
prop.table(mytable)*100
## CityName
##             Agra        Ahmedabad         Amritsar        Bangalore 
##        3.2648126        3.2043531        1.0278114        4.9576784 
##      Bhubaneswar       Chandigarh          Chennai       Darjeeling 
##        0.9068924        2.5392987        3.1438936        1.0278114 
##            Delhi          Gangtok              Goa         Guwahati 
##       15.4776300        0.9673519        4.7158404        0.3627570 
##         Haridwar        Hyderabad           Indore           Jaipur 
##        0.3627570        4.0507860        1.2091898        5.8041112 
##        Jaisalmer          Jodhpur           Kanpur            Kochi 
##        1.9951632        1.6928658        0.1209190        4.5949214 
##          Kolkata          Lucknow          Madurai           Manali 
##        3.8694075        0.9673519        0.8464329        2.1765417 
##        Mangalore           Mumbai           Munnar           Mysore 
##        0.7859734        5.3808948        2.4788392        1.2091898 
##         Nainital             Ooty        Panchkula             Pune 
##        1.0882709        1.0278114        0.4836759        4.5344619 
##             Puri           Rajkot        Rishikesh           Shimla 
##        0.4232164        0.9673519        0.6650544        2.1160822 
##         Srinagar            Surat Thiruvanthipuram         Thrissur 
##        0.3022975        0.6045949        2.9625151        0.2418380 
##          Udaipur         Varanasi 
##        3.4461911        1.9951632
View(mytable)

Boxplot of the population of cities in the dataset

boxplot(data.df$Population, main="Population of cities in dataset", col="grey", horizontal = TRUE)

Percentage of Metrocity in the dataset

mytable1 <- with(data.df, table(IsMetroCity))
prop.table(mytable1)*100
## IsMetroCity
##        0        1 
## 71.58404 28.41596
View(mytable1)

Comments: 28.41% of data falls in the metro city category and remaning fall in the non metro city category

mytable2 <- with(data.df, table(IsWeekend))
prop.table(mytable2)*100
## IsWeekend
##        0        1 
## 37.71917 62.28083
View(mytable2)

Comments: 62.28% of the data suggests that time of respective booking the hotel rooms was done on any weekend

mytable3 <- with(data.df, table(IsNewYearEve))
prop.table(mytable3)*100
## IsNewYearEve
##        0        1 
## 87.56046 12.43954
View(mytable3)

Comments: 12.44% of the data suggests that time of respective booking the hotel rooms was done on New year eve

mytable4 <- with(data.df, table(StarRating))
prop.table(mytable4)*100
## StarRating
##           0           1           2         2.5           3         3.2 
##  0.12091898  0.06045949  3.32527207  4.77629988 44.98941959  0.06045949 
##         3.3         3.4         3.5         3.6         3.7         3.8 
##  0.12091898  0.06045949 13.24062878  0.06045949  0.18137848  0.12091898 
##         3.9           4         4.1         4.3         4.4         4.5 
##  0.24183797 18.61396614  0.18137848  0.12091898  0.06045949  2.84159613 
##         4.7         4.8           5 
##  0.06045949  0.12091898 10.64087062
View(mytable4)

Barchart of Star Rating of the hotels in the dataset

barplot(table(data.df$StarRating), main = "Star rating of hotels in dataset", col= c("red","lightgreen","yellow", "blue", "orange"))

Comments: The frequency of hotels with star rating 3 is the highest

Percentage of hotels with free wifi

mytable5<-with(data.df,table(FreeWifi))
View(mytable5)
round(prop.table(mytable5)*100,2)
## FreeWifi
##     0     1 
##  7.41 92.59

Comments: 92.59% of the hotels booked had free wifi

Percentage of hotels with free breakfast

mytable6<-with(data.df,table(FreeBreakfast))
View(mytable6)
round(prop.table(mytable6)*100,2)
## FreeBreakfast
##     0     1 
## 35.09 64.91

Comments: 64.91% of the hotels booked had free breakfast

Percentage of hotels with swimming pool

mytable7<-with(data.df,table(HasSwimmingPool))
round(prop.table(mytable7)*100,2)
## HasSwimmingPool
##     0     1 
## 64.42 35.58

Comments: 35.58% of the hotels booked had swimming pool

Scatterplot of City Rank and Room Rent

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(data.df$CityRank, data.df$RoomRent, xlab="City Rank", ylab="Room Rent", main="Scatterplot of City Rank and Room Rent", spread = FALSE)

Correlation between Metro City and Room Rent

cor(data.df$IsMetroCity, data.df$RoomRent)
## [1] -0.06683977

Correlation between Tourist Destination and Room Rent

cor(data.df$IsTouristDestination, data.df$RoomRent)
## [1] 0.122503

Correlation between Weekend and Room Rent

cor(data.df$IsWeekend, data.df$RoomRent)
## [1] 0.004580134

Scatterplot of Star Rating and Room Rent

scatterplot(RoomRent~StarRating,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Star Rating and Room rent",ylab="Room Rent", xlab="Star Rating")

Scatterplot of Airport distance and Room Rent

scatterplot(RoomRent~Airport,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Airport(distance) and Room rent",ylab="Room Rent", xlab="Airport(distance)")

Scatterplot of Hotel Capacity and Room Rent

scatterplot(RoomRent~HotelCapacity,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Hotel Capacity and Room rent",ylab="Room Rent", xlab="Hotel Capacity")

Corrgram of all variables

library(corrgram)
corrgram(data.df, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all  variables in the dataset")

Comments: The corrgram suggests- Positive and strong correlation between RoomRent and the following: 1. TouristDestination 2. CityRank 3. StarRating 4. Airport 5. HotelCapacity 6. SwimmingPool

Negative and weak correlation between RoomRent and the following: 1. Population 2. IsMetroCity

Regression Models between Dependent Variable and Time category

Model1: Roomrent and Time

model1 <- lm(RoomRent ~ IsWeekend + IsNewYearEve, data=data.df)
summary(model1)
## 
## Call:
## lm(formula = RoomRent ~ IsWeekend + IsNewYearEve, data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5874  -3031  -1436    808 317180 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5430.5      103.7  52.353  < 2e-16 ***
## IsWeekend      -110.4      137.4  -0.803    0.422    
## IsNewYearEve    902.6      201.8   4.472 7.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7328 on 13229 degrees of freedom
## Multiple R-squared:  0.00153,    Adjusted R-squared:  0.001379 
## F-statistic: 10.14 on 2 and 13229 DF,  p-value: 3.987e-05

Correlation Test between Room Rent and IsWeekend

cor.test(data.df$RoomRent, data.df$IsWeekend)
## 
##  Pearson's product-moment correlation
## 
## data:  data.df$RoomRent and data.df$IsWeekend
## t = 0.52682, df = 13230, p-value = 0.5983
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01245978  0.02161739
## sample estimates:
##         cor 
## 0.004580134

Comments: p-value is >0.05 so we can safely conclude no contribution of IsWeekend on the RoomRent

Correlation Test between Room Rent and IsNewYearEve

cor.test(data.df$RoomRent, data.df$IsNewYearEve)
## 
##  Pearson's product-moment correlation
## 
## data:  data.df$RoomRent and data.df$IsNewYearEve
## t = 4.4306, df = 13230, p-value = 9.472e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02146637 0.05549377
## sample estimates:
##        cor 
## 0.03849123

Comments: p-value is <0.05 so we can conclude contribution of IsNewYear on RoomRent

Regression Models between Dependent Variable and External Factors

Model2: Room Rent and External Factors

model2 <- lm( RoomRent ~ CityRank + IsMetroCity + IsTouristDestination + Airport, data=data.df)
summary(model2)
## 
## Call:
## lm(formula = RoomRent ~ CityRank + IsMetroCity + IsTouristDestination + 
##     Airport, data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6229  -2872  -1289   1052 315993 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4305.5167   141.3452  30.461  < 2e-16 ***
## CityRank                 2.8886     7.0826   0.408    0.683    
## IsMetroCity          -1419.4961   187.5451  -7.569 4.02e-14 ***
## IsTouristDestination  2169.3886   157.6919  13.757  < 2e-16 ***
## Airport                  0.7822     3.2301   0.242    0.809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7249 on 13227 degrees of freedom
## Multiple R-squared:  0.02311,    Adjusted R-squared:  0.02281 
## F-statistic: 78.22 on 4 and 13227 DF,  p-value: < 2.2e-16

Comments: R-squared value is less, so not a very fitting model.

library(coefplot)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
coefplot(model2, predictors=c("IsMetroCity", "Airport", "IsTouristDestination"))

Comments: 1. Factor IsTourist Destination affects RoomRent the most 2. Factor IsMetroCity does not affect RoomRent the least

T-test between Room Rent and City Rank

t.test(data.df$RoomRent, data.df$CityRank)
## 
##  Welch Two Sample t-test
## 
## data:  data.df$RoomRent and data.df$CityRank
## t = 85.635, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5334.200 5584.116
## sample estimates:
##  mean of x  mean of y 
## 5473.99184   14.83374

T-test between Room Rent and Metrocity

t.test(data.df$RoomRent, data.df$IsMetroCity)
## 
##  Welch Two Sample t-test
## 
## data:  data.df$RoomRent and data.df$IsMetroCity
## t = 85.863, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5348.750 5598.666
## sample estimates:
##    mean of x    mean of y 
## 5473.9918380    0.2841596

T-test between Room Rent and Tourist Destination

t.test(data.df$RoomRent, data.df$IsTouristDestination)
## 
##  Welch Two Sample t-test
## 
## data:  data.df$RoomRent and data.df$IsTouristDestination
## t = 85.856, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5348.337 5598.253
## sample estimates:
##    mean of x    mean of y 
## 5473.9918380    0.6971735

T-test between Room rent and Airport distance

t.test(data.df$RoomRent, data.df$Airport)
## 
##  Welch Two Sample t-test
## 
## data:  data.df$RoomRent and data.df$Airport
## t = 85.535, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5327.875 5577.792
## sample estimates:
##  mean of x  mean of y 
## 5473.99184   21.15874

Regression Model between Dependent Variable and External Factors

Model3: Room Rent and External Factors

model3 <- lm( RoomRent ~ StarRating + FreeWifi + FreeBreakfast + HotelCapacity + HasSwimmingPool, data=data.df)
summary(model3)
## 
## Call:
## lm(formula = RoomRent ~ StarRating + FreeWifi + FreeBreakfast + 
##     HotelCapacity + HasSwimmingPool, data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10795  -2291   -961   1009 310092 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6875.101    393.223 -17.484   <2e-16 ***
## StarRating       3598.541    111.867  32.168   <2e-16 ***
## FreeWifi           -6.021    225.738  -0.027    0.979    
## FreeBreakfast     -27.775    124.371  -0.223    0.823    
## HotelCapacity     -15.576      1.009 -15.436   <2e-16 ***
## HasSwimmingPool  2527.393    158.112  15.985   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6711 on 13226 degrees of freedom
## Multiple R-squared:  0.1628, Adjusted R-squared:  0.1625 
## F-statistic: 514.4 on 5 and 13226 DF,  p-value: < 2.2e-16

Comments: R-squared value is good so the model is good fitting. But it should be improved

Chisquared test between Room Rent and Star Rating

chisq.test(table(data.df$RoomRent, data.df$StarRating))
## Warning in chisq.test(table(data.df$RoomRent, data.df$StarRating)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(data.df$RoomRent, data.df$StarRating)
## X-squared = 132390, df = 43100, p-value < 2.2e-16

Chisquared test between Room Rent and Free Wifi

chisq.test(table(data.df$RoomRent, data.df$FreeWifi))
## Warning in chisq.test(table(data.df$RoomRent, data.df$FreeWifi)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(data.df$RoomRent, data.df$FreeWifi)
## X-squared = 5914.3, df = 2155, p-value < 2.2e-16

Chisquared test between Room Rent and Free Breakfast

chisq.test(table(data.df$RoomRent, data.df$FreeBreakfast))
## Warning in chisq.test(table(data.df$RoomRent, data.df$FreeBreakfast)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(data.df$RoomRent, data.df$FreeBreakfast)
## X-squared = 6492.4, df = 2155, p-value < 2.2e-16

Chisquared test between Room Rent and Hotel Capacity

chisq.test(table(data.df$RoomRent, data.df$HotelCapacity))
## Warning in chisq.test(table(data.df$RoomRent, data.df$HotelCapacity)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(data.df$RoomRent, data.df$HotelCapacity)
## X-squared = 1479800, df = 545220, p-value < 2.2e-16

Chisquared test between Room Rent and Swimming Pool availability

chisq.test(table(data.df$RoomRent, data.df$HasSwimmingPool))
## Warning in chisq.test(table(data.df$RoomRent, data.df$HasSwimmingPool)):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(data.df$RoomRent, data.df$HasSwimmingPool)
## X-squared = 8394.6, df = 2155, p-value < 2.2e-16

Regression Model:

model4 <- lm( RoomRent ~ IsNewYearEve + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool, data=data.df)
summary(model4)
## 
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + IsTouristDestination + 
##     StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool, 
##     data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9884  -2371   -805    931 310960 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7134.89     393.91 -18.113  < 2e-16 ***
## IsNewYearEve           843.98     176.44   4.783 1.74e-06 ***
## IsTouristDestination  2092.87     127.75  16.383  < 2e-16 ***
## StarRating            2904.81      98.39  29.524  < 2e-16 ***
## FreeWifi               188.78     225.55   0.837   0.4026    
## FreeBreakfast          242.58     124.01   1.956   0.0505 .  
## HasSwimmingPool       1869.03     155.56  12.015  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6698 on 13225 degrees of freedom
## Multiple R-squared:  0.1661, Adjusted R-squared:  0.1657 
## F-statistic:   439 on 6 and 13225 DF,  p-value: < 2.2e-16

Comments: R-squared value is less so let us consider more independent variables in the regression model to improvise

model5 <- lm( RoomRent ~ IsNewYearEve + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + HotelCapacity, data=data.df)
summary(model5)
## 
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + IsTouristDestination + 
##     StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + 
##     HotelCapacity, data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11503  -2380   -737   1083 309773 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8654.938    406.373 -21.298  < 2e-16 ***
## IsNewYearEve           842.571    175.188   4.810 1.53e-06 ***
## IsTouristDestination  1892.334    127.674  14.822  < 2e-16 ***
## StarRating            3627.805    110.882  32.718  < 2e-16 ***
## FreeWifi               153.553    223.974   0.686    0.493    
## FreeBreakfast          101.377    123.556   0.820    0.412    
## HasSwimmingPool       2292.940    157.493  14.559  < 2e-16 ***
## HotelCapacity          -13.874      1.007 -13.784  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6651 on 13224 degrees of freedom
## Multiple R-squared:  0.1779, Adjusted R-squared:  0.1775 
## F-statistic: 408.8 on 7 and 13224 DF,  p-value: < 2.2e-16

Comments: R-squared values is improved so this model is a fitter model. Let us consider more independent variables in the regression model to improvise

model6 <- lm(RoomRent ~ IsNewYearEve + Airport + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + HotelCapacity, data=data.df)
summary(model6)
## 
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + Airport + IsTouristDestination + 
##     StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + 
##     HotelCapacity, data = data.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11265  -2324   -740   1075 310031 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8917.869    407.290 -21.896  < 2e-16 ***
## IsNewYearEve           841.203    174.860   4.811 1.52e-06 ***
## Airport                 18.778      2.637   7.120 1.14e-12 ***
## IsTouristDestination  1709.789    129.989  13.153  < 2e-16 ***
## StarRating            3566.808    111.006  32.132  < 2e-16 ***
## FreeWifi               309.478    224.624   1.378    0.168    
## FreeBreakfast           65.956    123.425   0.534    0.593    
## HasSwimmingPool       2452.427    158.785  15.445  < 2e-16 ***
## HotelCapacity          -13.460      1.006 -13.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6638 on 13223 degrees of freedom
## Multiple R-squared:  0.181,  Adjusted R-squared:  0.1805 
## F-statistic: 365.4 on 8 and 13223 DF,  p-value: < 2.2e-16

Comments: R-squared values is further improved so this model is a fitter model. Let us consider more independent variables in the regression model to improvise

coefplot(model6, predictors=c("StarRating", "Airport", "FreeWifi", "FreeBreakfast", "HotelCapacity", "HasSwimmingPool", "IsMetroCity", "IsNewYearEve", "IsTouristDestination"))

CONCLUSION FOR MANAGERIAL DECISION MAKING We analysed the dataset containing 13232 rows of data on Indian hotel pricing to test which factors had positive affect on the room rent of the hotels. The parameters were visualized as bar charts, pie charts, scatterplots, and corrgrams. Correlation tests, t-tests, and chi-squared tests were run on the parameters. Finally various regression models were tested to give the best fitting model to obtain the p-values and coefficients which suggest model is good. 1. Factors like StarRating, HasSwimmingPool, IsTouristDestination, IsNewYearEve, FreeWifi affect RoomRent significantly 2. Factors like distance from Airport, FreeBreakfast, HotelCapacity affect RoomRent less significantly 3. Factors like IsMetroCity and Population do not affect RoomRent