Introduction In this project the pricing strategy of hotels in Indian cities is analyzed. The objective is to identify the factors that contribute the most in setting up the prices of hotel rooms.
Data The collected for this project is taken from the website www.hotel.in. The dataset contains 13232 rows and 19 columns of data. The data contains details like, distance from airports, description of hotels, star rating of hotels, hotel booking date, price at which rooms in that particular hotels were booked etc., of hotels in 42 Indian cities.We shall analyze the type of data in more detail in our analysis. The various parameters in the dataset are as follows: 1. CityName 2. Population 3. CityRank 4. ISMetroCity 5. IsTouristDestination 6. IsWeekend 7. IsNewYearEve 8. Date 9. HotelName 10. RoomRent 11. StarRating 12. Airport 13. HotelAddress 14. HotelPincode 15. HotelDescription 16. FreeWifi 17. FreeBreakfast 18. HotelCapacity 19. HasSwimmingPool
Description of Data City: Indian city names like Delhi, Mumbai, Goa, Agra, etc.
Population: Population in that city
CityRank: Used in the dataset to uniquely identify each city
IsMetroCity: Used in the dataset to indicate whether or not that particular city is metro city 1= Yes 0= No
IsTouristDestination: Used in the dataset to indicate whether or not that particular city is tourist destination 1= Yes 0= No
IsWeekend: Used in the dataset to indicate whether or not that particular booking is made on a weekend 1= Yes 0= No
IsNewYearEve: Used in the dataset to indicate whether or not that particular booking is made on new year eve 1= Yes 0= No
Date: The date on which that particular booking is made
HotelName: Name of the hotel in which the booking is made
RoomRent: The rent of room of the hotel in which the booking is made
StarRating: The star rating of the hotel in which the booking is made
Airport: The distance of the airport (in km) from the hotel in which the booking is made
HotelAddress: The address of the hotel in which the booking is made
HotelPincode: The pincode of adress of the hotel in which the booking is made
HotelDescription: Description of the hotel in which booking is made
FreeWifi: Used in the dataset to indicate whether or not that particular hotel provides free wifi 1= Yes 0= No
FreeBreakfast: Used in the dataset to indicate whether or not that particular hotel provides free breakfast 1= Yes 0= No
HotelCapacity: The capacity of the hotel in which the booking is made
HasSwimmingPool Used in the dataset to indicate whether or not that particular hotel has swimming pool 1= Yes 0= No
Strategy for attaining the objective From the given dataset it is easy to identify and categorize parameters as Time (Time of booking), External factors (CityRank, IsMetrocity, IsTouristDestination), and Internal factors (StarRating, Airport, HotelAddress, HotelPincode, HotelDescription, FreeWifi, FreeBreakfast, HotelCapacity, HasSwimmingPool) So, we shall test the effect various categories of parameters (time, external factors, internal factors) have on the dependent variable (RoomRent), one by one by running regression analysis and doing various statistical tests. Eventually we shall eliminate the parameters which contribute less to the dependent variable and by trial and error we shall obtain an equation which contains a healthy mix of all the categories (time, external factors, internal factors) as independent variable in order to determine the dependent variable (RoomRent)
Reading the Data
data.df <- read.csv(paste("Cities42.csv", sep=""))
library(psych)
View(data.df)
dim(data.df)
## [1] 13232 19
Comments: 13232 observations of 19 variables obtained
Analyzing the types of data in the dataset
str(data.df)
## 'data.frame': 13232 obs. of 19 variables:
## $ CityName : Factor w/ 42 levels "Agra","Ahmedabad",..: 26 26 26 26 26 26 26 26 26 26 ...
## $ Population : int 12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 12442373 ...
## $ CityRank : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IsMetroCity : int 1 1 1 1 1 1 1 1 1 1 ...
## $ IsTouristDestination: int 1 1 1 1 1 1 1 1 1 1 ...
## $ IsWeekend : int 1 0 1 1 0 1 0 1 1 0 ...
## $ IsNewYearEve : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Date : Factor w/ 20 levels "04-Jan-16","04-Jan-17",..: 11 12 13 14 15 16 17 18 11 12 ...
## $ HotelName : Factor w/ 1670 levels "14 Square Amanora",..: 1635 1635 1635 1635 1635 1635 1635 1635 1409 1409 ...
## $ RoomRent : int 12375 10250 9900 10350 12000 11475 11220 9225 6800 9350 ...
## $ StarRating : num 5 5 5 5 5 5 5 5 4 4 ...
## $ Airport : num 21 21 21 21 21 21 21 21 20 20 ...
## $ HotelAddress : Factor w/ 2108 levels " H.P. High Court Mall Road, Shimla",..: 925 928 930 933 935 937 940 941 699 746 ...
## $ HotelPincode : int 400005 400006 400007 400008 400009 400010 400011 400012 400039 400040 ...
## $ HotelDescription : Factor w/ 1226 levels "#NAME?","10 star hotel near Queensroad, Amritsar",..: 1030 1030 1030 1030 1030 1030 1030 1030 1006 1006 ...
## $ FreeWifi : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FreeBreakfast : int 0 0 0 0 0 0 0 0 1 1 ...
## $ HotelCapacity : int 287 287 287 287 287 287 287 287 28 28 ...
## $ HasSwimmingPool : int 1 1 1 1 1 1 1 1 0 0 ...
Describing the data in the dataset
describe(data.df) [, 1:5]
## vars n mean sd median
## CityName* 1 13232 18.07 11.72 16
## Population 2 13232 4416836.87 4258386.00 3046163
## CityRank 3 13232 14.83 13.51 9
## IsMetroCity 4 13232 0.28 0.45 0
## IsTouristDestination 5 13232 0.70 0.46 1
## IsWeekend 6 13232 0.62 0.48 1
## IsNewYearEve 7 13232 0.12 0.33 0
## Date* 8 13232 14.30 2.69 14
## HotelName* 9 13232 841.19 488.16 827
## RoomRent 10 13232 5473.99 7333.12 4000
## StarRating 11 13232 3.46 0.76 3
## Airport 12 13232 21.16 22.76 15
## HotelAddress* 13 13232 1202.53 582.17 1261
## HotelPincode 14 13232 397430.26 259837.50 395003
## HotelDescription* 15 13224 581.34 363.26 567
## FreeWifi 16 13232 0.93 0.26 1
## FreeBreakfast 17 13232 0.65 0.48 1
## HotelCapacity 18 13232 62.51 76.66 34
## HasSwimmingPool 19 13232 0.36 0.48 0
Distribution of city names in the dataset
cities <- table(data.df$CityName)
cities
##
## Agra Ahmedabad Amritsar Bangalore
## 432 424 136 656
## Bhubaneswar Chandigarh Chennai Darjeeling
## 120 336 416 136
## Delhi Gangtok Goa Guwahati
## 2048 128 624 48
## Haridwar Hyderabad Indore Jaipur
## 48 536 160 768
## Jaisalmer Jodhpur Kanpur Kochi
## 264 224 16 608
## Kolkata Lucknow Madurai Manali
## 512 128 112 288
## Mangalore Mumbai Munnar Mysore
## 104 712 328 160
## Nainital Ooty Panchkula Pune
## 144 136 64 600
## Puri Rajkot Rishikesh Shimla
## 56 128 88 280
## Srinagar Surat Thiruvanthipuram Thrissur
## 40 80 392 32
## Udaipur Varanasi
## 456 264
Distribution of CityRank in the dataset
CityRank <- table(data.df$CityRank)
CityRank
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 712 2048 656 416 536 424 512 80 600 768 32 128 16 136 160
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 432 448 624 128 264 40 224 336 392 48 160 120 272 104 456
## 32 33 34 35 36 37 38 39 40 42 43 44
## 48 56 280 64 136 88 128 136 264 144 328 288
Distribution of MetroCity in the dataset
barplot(table(data.df$IsMetroCity), main="Metro city or not in dataset", col= c("red","pink"))
Distribution of TouristCity in the dataset
pie(table(data.df$IsTouristDestination), main="Tourist destination or not in dataset", col=c("burlywood","peachpuff"))
Distribution of CityName in the dataset (as percentage of the total dataset)
mytable <- with(data.df, table(CityName))
prop.table(mytable)*100
## CityName
## Agra Ahmedabad Amritsar Bangalore
## 3.2648126 3.2043531 1.0278114 4.9576784
## Bhubaneswar Chandigarh Chennai Darjeeling
## 0.9068924 2.5392987 3.1438936 1.0278114
## Delhi Gangtok Goa Guwahati
## 15.4776300 0.9673519 4.7158404 0.3627570
## Haridwar Hyderabad Indore Jaipur
## 0.3627570 4.0507860 1.2091898 5.8041112
## Jaisalmer Jodhpur Kanpur Kochi
## 1.9951632 1.6928658 0.1209190 4.5949214
## Kolkata Lucknow Madurai Manali
## 3.8694075 0.9673519 0.8464329 2.1765417
## Mangalore Mumbai Munnar Mysore
## 0.7859734 5.3808948 2.4788392 1.2091898
## Nainital Ooty Panchkula Pune
## 1.0882709 1.0278114 0.4836759 4.5344619
## Puri Rajkot Rishikesh Shimla
## 0.4232164 0.9673519 0.6650544 2.1160822
## Srinagar Surat Thiruvanthipuram Thrissur
## 0.3022975 0.6045949 2.9625151 0.2418380
## Udaipur Varanasi
## 3.4461911 1.9951632
View(mytable)
Boxplot of the population of cities in the dataset
boxplot(data.df$Population, main="Population of cities in dataset", col="grey", horizontal = TRUE)
Percentage of Metrocity in the dataset
mytable1 <- with(data.df, table(IsMetroCity))
prop.table(mytable1)*100
## IsMetroCity
## 0 1
## 71.58404 28.41596
View(mytable1)
Comments: 28.41% of data falls in the metro city category and remaning fall in the non metro city category
mytable2 <- with(data.df, table(IsWeekend))
prop.table(mytable2)*100
## IsWeekend
## 0 1
## 37.71917 62.28083
View(mytable2)
Comments: 62.28% of the data suggests that time of respective booking the hotel rooms was done on any weekend
mytable3 <- with(data.df, table(IsNewYearEve))
prop.table(mytable3)*100
## IsNewYearEve
## 0 1
## 87.56046 12.43954
View(mytable3)
Comments: 12.44% of the data suggests that time of respective booking the hotel rooms was done on New year eve
mytable4 <- with(data.df, table(StarRating))
prop.table(mytable4)*100
## StarRating
## 0 1 2 2.5 3 3.2
## 0.12091898 0.06045949 3.32527207 4.77629988 44.98941959 0.06045949
## 3.3 3.4 3.5 3.6 3.7 3.8
## 0.12091898 0.06045949 13.24062878 0.06045949 0.18137848 0.12091898
## 3.9 4 4.1 4.3 4.4 4.5
## 0.24183797 18.61396614 0.18137848 0.12091898 0.06045949 2.84159613
## 4.7 4.8 5
## 0.06045949 0.12091898 10.64087062
View(mytable4)
Barchart of Star Rating of the hotels in the dataset
barplot(table(data.df$StarRating), main = "Star rating of hotels in dataset", col= c("red","lightgreen","yellow", "blue", "orange"))
Comments: The frequency of hotels with star rating 3 is the highest
Percentage of hotels with free wifi
mytable5<-with(data.df,table(FreeWifi))
View(mytable5)
round(prop.table(mytable5)*100,2)
## FreeWifi
## 0 1
## 7.41 92.59
Comments: 92.59% of the hotels booked had free wifi
Percentage of hotels with free breakfast
mytable6<-with(data.df,table(FreeBreakfast))
View(mytable6)
round(prop.table(mytable6)*100,2)
## FreeBreakfast
## 0 1
## 35.09 64.91
Comments: 64.91% of the hotels booked had free breakfast
Percentage of hotels with swimming pool
mytable7<-with(data.df,table(HasSwimmingPool))
round(prop.table(mytable7)*100,2)
## HasSwimmingPool
## 0 1
## 64.42 35.58
Comments: 35.58% of the hotels booked had swimming pool
Scatterplot of City Rank and Room Rent
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(data.df$CityRank, data.df$RoomRent, xlab="City Rank", ylab="Room Rent", main="Scatterplot of City Rank and Room Rent", spread = FALSE)
Correlation between Metro City and Room Rent
cor(data.df$IsMetroCity, data.df$RoomRent)
## [1] -0.06683977
Correlation between Tourist Destination and Room Rent
cor(data.df$IsTouristDestination, data.df$RoomRent)
## [1] 0.122503
Correlation between Weekend and Room Rent
cor(data.df$IsWeekend, data.df$RoomRent)
## [1] 0.004580134
Scatterplot of Star Rating and Room Rent
scatterplot(RoomRent~StarRating,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Star Rating and Room rent",ylab="Room Rent", xlab="Star Rating")
Scatterplot of Airport distance and Room Rent
scatterplot(RoomRent~Airport,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Airport(distance) and Room rent",ylab="Room Rent", xlab="Airport(distance)")
Scatterplot of Hotel Capacity and Room Rent
scatterplot(RoomRent~HotelCapacity,data=data.df,spread=FALSE, smoother.args=list(lty=2),main="Scatter plot of Hotel Capacity and Room rent",ylab="Room Rent", xlab="Hotel Capacity")
Corrgram of all variables
library(corrgram)
corrgram(data.df, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all variables in the dataset")
Comments: The corrgram suggests- Positive and strong correlation between RoomRent and the following: 1. TouristDestination 2. CityRank 3. StarRating 4. Airport 5. HotelCapacity 6. SwimmingPool
Negative and weak correlation between RoomRent and the following: 1. Population 2. IsMetroCity
Regression Models between Dependent Variable and Time category
Model1: Roomrent and Time
model1 <- lm(RoomRent ~ IsWeekend + IsNewYearEve, data=data.df)
summary(model1)
##
## Call:
## lm(formula = RoomRent ~ IsWeekend + IsNewYearEve, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5874 -3031 -1436 808 317180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5430.5 103.7 52.353 < 2e-16 ***
## IsWeekend -110.4 137.4 -0.803 0.422
## IsNewYearEve 902.6 201.8 4.472 7.82e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7328 on 13229 degrees of freedom
## Multiple R-squared: 0.00153, Adjusted R-squared: 0.001379
## F-statistic: 10.14 on 2 and 13229 DF, p-value: 3.987e-05
Correlation Test between Room Rent and IsWeekend
cor.test(data.df$RoomRent, data.df$IsWeekend)
##
## Pearson's product-moment correlation
##
## data: data.df$RoomRent and data.df$IsWeekend
## t = 0.52682, df = 13230, p-value = 0.5983
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01245978 0.02161739
## sample estimates:
## cor
## 0.004580134
Comments: p-value is >0.05 so we can safely conclude no contribution of IsWeekend on the RoomRent
Correlation Test between Room Rent and IsNewYearEve
cor.test(data.df$RoomRent, data.df$IsNewYearEve)
##
## Pearson's product-moment correlation
##
## data: data.df$RoomRent and data.df$IsNewYearEve
## t = 4.4306, df = 13230, p-value = 9.472e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02146637 0.05549377
## sample estimates:
## cor
## 0.03849123
Comments: p-value is <0.05 so we can conclude contribution of IsNewYear on RoomRent
Regression Models between Dependent Variable and External Factors
Model2: Room Rent and External Factors
model2 <- lm( RoomRent ~ CityRank + IsMetroCity + IsTouristDestination + Airport, data=data.df)
summary(model2)
##
## Call:
## lm(formula = RoomRent ~ CityRank + IsMetroCity + IsTouristDestination +
## Airport, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6229 -2872 -1289 1052 315993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4305.5167 141.3452 30.461 < 2e-16 ***
## CityRank 2.8886 7.0826 0.408 0.683
## IsMetroCity -1419.4961 187.5451 -7.569 4.02e-14 ***
## IsTouristDestination 2169.3886 157.6919 13.757 < 2e-16 ***
## Airport 0.7822 3.2301 0.242 0.809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7249 on 13227 degrees of freedom
## Multiple R-squared: 0.02311, Adjusted R-squared: 0.02281
## F-statistic: 78.22 on 4 and 13227 DF, p-value: < 2.2e-16
Comments: R-squared value is less, so not a very fitting model.
library(coefplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
coefplot(model2, predictors=c("IsMetroCity", "Airport", "IsTouristDestination"))
Comments: 1. Factor IsTourist Destination affects RoomRent the most 2. Factor IsMetroCity does not affect RoomRent the least
T-test between Room Rent and City Rank
t.test(data.df$RoomRent, data.df$CityRank)
##
## Welch Two Sample t-test
##
## data: data.df$RoomRent and data.df$CityRank
## t = 85.635, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5334.200 5584.116
## sample estimates:
## mean of x mean of y
## 5473.99184 14.83374
T-test between Room Rent and Metrocity
t.test(data.df$RoomRent, data.df$IsMetroCity)
##
## Welch Two Sample t-test
##
## data: data.df$RoomRent and data.df$IsMetroCity
## t = 85.863, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5348.750 5598.666
## sample estimates:
## mean of x mean of y
## 5473.9918380 0.2841596
T-test between Room Rent and Tourist Destination
t.test(data.df$RoomRent, data.df$IsTouristDestination)
##
## Welch Two Sample t-test
##
## data: data.df$RoomRent and data.df$IsTouristDestination
## t = 85.856, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5348.337 5598.253
## sample estimates:
## mean of x mean of y
## 5473.9918380 0.6971735
T-test between Room rent and Airport distance
t.test(data.df$RoomRent, data.df$Airport)
##
## Welch Two Sample t-test
##
## data: data.df$RoomRent and data.df$Airport
## t = 85.535, df = 13231, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5327.875 5577.792
## sample estimates:
## mean of x mean of y
## 5473.99184 21.15874
Regression Model between Dependent Variable and External Factors
Model3: Room Rent and External Factors
model3 <- lm( RoomRent ~ StarRating + FreeWifi + FreeBreakfast + HotelCapacity + HasSwimmingPool, data=data.df)
summary(model3)
##
## Call:
## lm(formula = RoomRent ~ StarRating + FreeWifi + FreeBreakfast +
## HotelCapacity + HasSwimmingPool, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10795 -2291 -961 1009 310092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6875.101 393.223 -17.484 <2e-16 ***
## StarRating 3598.541 111.867 32.168 <2e-16 ***
## FreeWifi -6.021 225.738 -0.027 0.979
## FreeBreakfast -27.775 124.371 -0.223 0.823
## HotelCapacity -15.576 1.009 -15.436 <2e-16 ***
## HasSwimmingPool 2527.393 158.112 15.985 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6711 on 13226 degrees of freedom
## Multiple R-squared: 0.1628, Adjusted R-squared: 0.1625
## F-statistic: 514.4 on 5 and 13226 DF, p-value: < 2.2e-16
Comments: R-squared value is good so the model is good fitting. But it should be improved
Chisquared test between Room Rent and Star Rating
chisq.test(table(data.df$RoomRent, data.df$StarRating))
## Warning in chisq.test(table(data.df$RoomRent, data.df$StarRating)): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(data.df$RoomRent, data.df$StarRating)
## X-squared = 132390, df = 43100, p-value < 2.2e-16
Chisquared test between Room Rent and Free Wifi
chisq.test(table(data.df$RoomRent, data.df$FreeWifi))
## Warning in chisq.test(table(data.df$RoomRent, data.df$FreeWifi)): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(data.df$RoomRent, data.df$FreeWifi)
## X-squared = 5914.3, df = 2155, p-value < 2.2e-16
Chisquared test between Room Rent and Free Breakfast
chisq.test(table(data.df$RoomRent, data.df$FreeBreakfast))
## Warning in chisq.test(table(data.df$RoomRent, data.df$FreeBreakfast)): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(data.df$RoomRent, data.df$FreeBreakfast)
## X-squared = 6492.4, df = 2155, p-value < 2.2e-16
Chisquared test between Room Rent and Hotel Capacity
chisq.test(table(data.df$RoomRent, data.df$HotelCapacity))
## Warning in chisq.test(table(data.df$RoomRent, data.df$HotelCapacity)): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(data.df$RoomRent, data.df$HotelCapacity)
## X-squared = 1479800, df = 545220, p-value < 2.2e-16
Chisquared test between Room Rent and Swimming Pool availability
chisq.test(table(data.df$RoomRent, data.df$HasSwimmingPool))
## Warning in chisq.test(table(data.df$RoomRent, data.df$HasSwimmingPool)):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(data.df$RoomRent, data.df$HasSwimmingPool)
## X-squared = 8394.6, df = 2155, p-value < 2.2e-16
Regression Model:
model4 <- lm( RoomRent ~ IsNewYearEve + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool, data=data.df)
summary(model4)
##
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + IsTouristDestination +
## StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool,
## data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9884 -2371 -805 931 310960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7134.89 393.91 -18.113 < 2e-16 ***
## IsNewYearEve 843.98 176.44 4.783 1.74e-06 ***
## IsTouristDestination 2092.87 127.75 16.383 < 2e-16 ***
## StarRating 2904.81 98.39 29.524 < 2e-16 ***
## FreeWifi 188.78 225.55 0.837 0.4026
## FreeBreakfast 242.58 124.01 1.956 0.0505 .
## HasSwimmingPool 1869.03 155.56 12.015 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6698 on 13225 degrees of freedom
## Multiple R-squared: 0.1661, Adjusted R-squared: 0.1657
## F-statistic: 439 on 6 and 13225 DF, p-value: < 2.2e-16
Comments: R-squared value is less so let us consider more independent variables in the regression model to improvise
model5 <- lm( RoomRent ~ IsNewYearEve + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + HotelCapacity, data=data.df)
summary(model5)
##
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + IsTouristDestination +
## StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool +
## HotelCapacity, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11503 -2380 -737 1083 309773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8654.938 406.373 -21.298 < 2e-16 ***
## IsNewYearEve 842.571 175.188 4.810 1.53e-06 ***
## IsTouristDestination 1892.334 127.674 14.822 < 2e-16 ***
## StarRating 3627.805 110.882 32.718 < 2e-16 ***
## FreeWifi 153.553 223.974 0.686 0.493
## FreeBreakfast 101.377 123.556 0.820 0.412
## HasSwimmingPool 2292.940 157.493 14.559 < 2e-16 ***
## HotelCapacity -13.874 1.007 -13.784 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6651 on 13224 degrees of freedom
## Multiple R-squared: 0.1779, Adjusted R-squared: 0.1775
## F-statistic: 408.8 on 7 and 13224 DF, p-value: < 2.2e-16
Comments: R-squared values is improved so this model is a fitter model. Let us consider more independent variables in the regression model to improvise
model6 <- lm(RoomRent ~ IsNewYearEve + Airport + IsTouristDestination + StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool + HotelCapacity, data=data.df)
summary(model6)
##
## Call:
## lm(formula = RoomRent ~ IsNewYearEve + Airport + IsTouristDestination +
## StarRating + FreeWifi + FreeBreakfast + HasSwimmingPool +
## HotelCapacity, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11265 -2324 -740 1075 310031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8917.869 407.290 -21.896 < 2e-16 ***
## IsNewYearEve 841.203 174.860 4.811 1.52e-06 ***
## Airport 18.778 2.637 7.120 1.14e-12 ***
## IsTouristDestination 1709.789 129.989 13.153 < 2e-16 ***
## StarRating 3566.808 111.006 32.132 < 2e-16 ***
## FreeWifi 309.478 224.624 1.378 0.168
## FreeBreakfast 65.956 123.425 0.534 0.593
## HasSwimmingPool 2452.427 158.785 15.445 < 2e-16 ***
## HotelCapacity -13.460 1.006 -13.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6638 on 13223 degrees of freedom
## Multiple R-squared: 0.181, Adjusted R-squared: 0.1805
## F-statistic: 365.4 on 8 and 13223 DF, p-value: < 2.2e-16
Comments: R-squared values is further improved so this model is a fitter model. Let us consider more independent variables in the regression model to improvise
coefplot(model6, predictors=c("StarRating", "Airport", "FreeWifi", "FreeBreakfast", "HotelCapacity", "HasSwimmingPool", "IsMetroCity", "IsNewYearEve", "IsTouristDestination"))
CONCLUSION FOR MANAGERIAL DECISION MAKING We analysed the dataset containing 13232 rows of data on Indian hotel pricing to test which factors had positive affect on the room rent of the hotels. The parameters were visualized as bar charts, pie charts, scatterplots, and corrgrams. Correlation tests, t-tests, and chi-squared tests were run on the parameters. Finally various regression models were tested to give the best fitting model to obtain the p-values and coefficients which suggest model is good. 1. Factors like StarRating, HasSwimmingPool, IsTouristDestination, IsNewYearEve, FreeWifi affect RoomRent significantly 2. Factors like distance from Airport, FreeBreakfast, HotelCapacity affect RoomRent less significantly 3. Factors like IsMetroCity and Population do not affect RoomRent