Analysis of Hotel Prices
NAME: Monesh Kumar Sharma
EMAIL: monesh.sharma23992@gmail.com
COLLEGE / COMPANY: Welingkar Institute of Management, Mumbai

Problem Definition

Use data set MBA Salaries.
1. Analysis of Hotel Prices Data of hotels of 42 cities.(Summarize, plots etc.)
2. Visualization of Data
3. Some T and Chi square tests through data
4. Correlation between dependent and Independent variables
5. Find out which all columns / features impact Price of hotel room
6. Predict the hotel prices with some dummy values.

Data Location

The data was collected from www.hotels.in in October 2016.

Data Description

Size: 2523KB 13232 observations of 19 variables:

Attributes:
Notice that the dataset tracks hotel prices on 8 different dates at different hotels across different cities. Please browse the dataset.
Dependent Variable
RoomRent <- Rent for the cheapest room, double occupancy, in Indian Rupees.

Independent Variables
External Factors
Date <- We have hotel room rent data for the following 8 dates for each hotel: {Dec 31, Dec 25, Dec 24, Dec 18, Dec 21, Dec 28, Jan 4, Jan 8} IsWeekend <- We use ‘0’ to indicate week days, ‘1’ to indicate weekend dates (Sat / Sun)
IsNewYearEve <- 1’ for Dec 31, ‘0’ otherwise CityName <- Name of the City where the Hotel is located e.g. Mumbai`
Population <- Population of the City in 2011
CityRank <- Rank order of City by Population (e.g. Mumbai = 0, Delhi = 1, so on)
IsMetroCity <- ‘1’ if CityName is {Mumbai, Delhi, Kolkatta, Chennai}, ‘0’ otherwise
IsTouristDestination <- We use ‘1’ if the city is primarily a tourist destination, ‘0’ otherwise.

Internal Factors Many Hotel Features can influence the RoomRent. The dataset captures some of these internal factors, as explained below.

HotelName <- e.g. Park Hyatt Goa Resort and Spa
StarRating <- e.g. 5
Airport <- Distance between Hotel and closest major Airport
HotelAddress <- e.g. Arrossim Beach, Cansaulim, Goa
HotelPincode <- 403712
HotelDescription <- e.g. 5-star beachfront resort with spa, near Arossim Beach
FreeWifi <- ‘1’ if the hotel offers Free Wifi, ‘0’ otherwise
FreeBreakfast <- ‘1’ if the hotel offers Free Breakfast, ‘0’ otherwise
HotelCapacity <- e.g. 242. (enter ‘0’ if not available)
HasSwimmingPool <- ‘1’ if they have a swimming pool, ‘0’ otherwise

Setup

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(vcd)
## Loading required package: grid
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
library(corrplot)
library(coefplot)

Functions

detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 1.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  #inp <- count(inp[is.na(otp)])
  sum(is.na(otp))
}

Non_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Remove_Outliers <- function ( z, na.rm = TRUE){
 Out <- Non_outliers(z)
 Out <-as.data.frame (Out)
 z <- Out$Out[match(z, Out$Out)]
 z
}

Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

Dataset

setwd("D:/Welingkar/My/IL/Project/Hotel Industry/Data")
dfrModel <- read.csv("./Cities42.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
##   CityName Population CityRank IsMetroCity IsTouristDestination IsWeekend
## 1   Mumbai   12442373        0           1                    1         1
## 2   Mumbai   12442373        0           1                    1         0
## 3   Mumbai   12442373        0           1                    1         1
## 4   Mumbai   12442373        0           1                    1         1
## 5   Mumbai   12442373        0           1                    1         0
## 6   Mumbai   12442373        0           1                    1         1
##   IsNewYearEve        Date      HotelName RoomRent StarRating Airport
## 1            0 Dec 18 2016 Vivanta by Taj    12375          5      21
## 2            0 Dec 21 2016 Vivanta by Taj    10250          5      21
## 3            0 Dec 24 2016 Vivanta by Taj     9900          5      21
## 4            0 Dec 25 2016 Vivanta by Taj    10350          5      21
## 5            0 Dec 28 2016 Vivanta by Taj    12000          5      21
## 6            1 Dec 31 2016 Vivanta by Taj    11475          5      21
##                                   HotelAddress HotelPincode
## 1 90 Cuffe Parade, Colaba, Mumbai, Maharashtra       400005
## 2 91 Cuffe Parade, Colaba, Mumbai, Maharashtra       400006
## 3 92 Cuffe Parade, Colaba, Mumbai, Maharashtra       400007
## 4 93 Cuffe Parade, Colaba, Mumbai, Maharashtra       400008
## 5 94 Cuffe Parade, Colaba, Mumbai, Maharashtra       400009
## 6 95 Cuffe Parade, Colaba, Mumbai, Maharashtra       400010
##                               HotelDescription FreeWifi FreeBreakfast
## 1 Luxury hotel with spa, near Gateway of India        1             0
## 2 Luxury hotel with spa, near Gateway of India        1             0
## 3 Luxury hotel with spa, near Gateway of India        1             0
## 4 Luxury hotel with spa, near Gateway of India        1             0
## 5 Luxury hotel with spa, near Gateway of India        1             0
## 6 Luxury hotel with spa, near Gateway of India        1             0
##   HotelCapacity HasSwimmingPool
## 1           287               1
## 2           287               1
## 3           287               1
## 4           287               1
## 5           287               1
## 6           287               1

Observation 1. There are total ‘intRowCount’ data records in the file.
As there are Non Numeric data as well in the given dataset, so we are going to remove the non numeric data.

Data_Cleaning

dfrModel <- select(dfrModel, -c(CityName, Date, HotelName, HotelAddress, HotelDescription, HotelPincode ))

Summary

#describe(dfrModel$CityName)
describe(dfrModel$Population)[,c(2,3,4,5,8,9)]
##        n    mean      sd  median  min      max
## X1 13232 4416837 4258386 3046163 8096 12442373
#describe(dfrModel$CityRank)[,c(2,3,4,5,8,9)]
#describe(dfrModel$IsMetroCity)[,c(2,3,4,5,8,9)]
#describe(dfrModel$IsTouristDestination)[,c(2,3,4,5,8,9)]
#describe(dfrModel$IsWeekend)[,c(2,3,4,5,8,9)]
#describe(dfrModel$IsNewYearEve)[,c(2,3,4,5,8,9)]
#describe(dfrModel$Date)[,c(2,3,4,5,8,9)]
#describe(dfrModel$HotelName)[,c(2,3,4,5,8,9)]
describe(dfrModel$RoomRent)[,c(2,3,4,5,8,9)]
##        n    mean      sd median min    max
## X1 13232 5473.99 7333.12   4000 299 322500
describe(dfrModel$StarRating)[,c(2,3,4,5,8,9)]
##        n mean   sd median min max
## X1 13232 3.46 0.76      3   0   5
describe(dfrModel$Airport)[,c(2,3,4,5,8,9)]
##        n  mean    sd median min max
## X1 13232 21.16 22.76     15 0.2 124
#describe(dfrModel$HotelAddress)[,c(2,3,4,5,8,9)]
#describe(dfrModel$HotelPincode)[,c(2,3,4,5,8,9)]
#describe(dfrModel$HotelDescription)[,c(2,3,4,5,8,9)]
#describe(dfrModel$FreeWifi)[,c(2,3,4,5,8,9)]
#describe(dfrModel$FreeBreakfast)[,c(2,3,4,5,8,9)]
describe(dfrModel$HotelCapacity)[,c(2,3,4,5,8,9)]
##        n  mean    sd median min max
## X1 13232 62.51 76.66     34   0 600
#describe(dfrModel$HasSwimmingPool)[,c(2,3,4,5,8,9)]

Observations
Dependent Variable is
Y = Hotel Rent

Independent Variable is
X1 = Star Rating
X2 = IsTouristDestination
X3 = Airport Distance
X4 = Hotel Capacity

Box Plot

#Graph_Boxplot(dfrModel$CityName)
#Graph_Boxplot(dfrModel$Population)
#Graph_Boxplot(dfrModel$CityRank)
#Graph_Boxplot(dfrModel$IsMetroCity)
#Graph_Boxplot(dfrModel$IsTouristDestination)
#Graph_Boxplot(dfrModel$IsWeekend)
#Graph_Boxplot(dfrModel$IsNewYearEve)
#Graph_Boxplot(dfrModel$Date)
#Graph_Boxplot(dfrModel$HotelName)
#Graph_Boxplot(dfrModel$RoomRent)
Graph_Boxplot(dfrModel$StarRating)

Graph_Boxplot(dfrModel$Airport)

#Graph_Boxplot(dfrModel$HotelAddress)
#Graph_Boxplot(dfrModel$HotelPincode)
#Graph_Boxplot(dfrModel$HotelDescription)
#Graph_Boxplot(dfrModel$FreeWifi)
#Graph_Boxplot(dfrModel$FreeBreakfast)
Graph_Boxplot(dfrModel$HotelCapacity)

#Graph_Boxplot(dfrModel$HasSwimmingPool)

Observation
There are few outliers in the datasets

Tables

TouristDestination <- table(dfrModel$IsTouristDestination)
TouristDestination
## 
##    0    1 
## 4007 9225
prop.table(TouristDestination)
## 
##         0         1 
## 0.3028265 0.6971735

Observations
Here
1 Implies Tourist Destination
0 Implies Not an tourist destination

Scatter Plot

plot(y=dfrModel$RoomRent, x=dfrModel$Airport,
     col="green",
     ylim=c(0, 350000), xlim=c(0, 150), 
     main="Relationship Btw Room Rent and Airport Distance",
     ylab="Hotel Rent", xlab="Airport Distance")

scatterplot(dfrModel$Airport, dfrModel$RoomRent , main="Relationship Btw Room Rent and Airport Distance", xlab="Airport Distance", ylab="Hotel Rent")

plot((dfrModel$IsTouristDestination),jitter(dfrModel$RoomRent),
     col="green",
     ylim=c(0, 350000), xlim=c(0, 5), 
     main="Relationship Btw Room Rent and Tourist Destination",
     ylab="Hotel Rent", xlab="Tourist Destination")

plot(y=dfrModel$RoomRent, x=dfrModel$StarRating,
     col="blue",
     ylim=c(0, 350000), xlim=c(0, 10), 
     main="Relationship Btw Room Rent and Star Rating of Hotel",
     ylab="Hotel Rent", xlab="Star Rating")

plot(y=dfrModel$RoomRent, x=dfrModel$HotelCapacity,
     col="green",
     ylim=c(0, 350000), xlim=c(0, 150), 
     main="Relationship Btw Room Rent and Hotel Capacity",
     ylab="Hotel Rent", xlab="Hotel Capacity")

scatterplot(dfrModel$HotelCapacity, dfrModel$RoomRent , main="Relationship Btw Room Rent and Hotel Capacity", xlab="Hotel Capacity", ylab="Hotel Rent")

Observations
1.Above scatter plot is showing some relationship between Hotel rent and other Independent variables.

Correlation Plot

#pairs(dfrModel)
corrplot(corr=cor(dfrModel[ , c(4,7,8,9,12)], use="complete.obs"), 
         method ="ellipse")

Correlation Matrix

cor(dfrModel[, c(1:13)]) 
##                         Population      CityRank   IsMetroCity
## Population            1.0000000000 -0.8353204432  0.7712260105
## CityRank             -0.8353204432  1.0000000000 -0.5643937903
## IsMetroCity           0.7712260105 -0.5643937903  1.0000000000
## IsTouristDestination -0.0482029722  0.2807134520  0.1763717063
## IsWeekend             0.0115926802 -0.0072564766  0.0018118005
## IsNewYearEve          0.0007332482 -0.0006326444  0.0006464753
## RoomRent             -0.0887280632  0.0939855292 -0.0668397705
## StarRating            0.1341365933 -0.1333810133  0.0776028661
## Airport              -0.2597010198  0.5059119892 -0.2073586125
## FreeWifi              0.1129334410 -0.1214309404  0.0868288677
## FreeBreakfast         0.0364278235 -0.0086837497  0.0513856623
## HotelCapacity         0.2599830516 -0.2561197059  0.1871502153
## HasSwimmingPool       0.0262590820 -0.1029737518  0.0214119243
##                      IsTouristDestination    IsWeekend  IsNewYearEve
## Population                   -0.048202972  0.011592680  7.332482e-04
## CityRank                      0.280713452 -0.007256477 -6.326444e-04
## IsMetroCity                   0.176371706  0.001811801  6.464753e-04
## IsTouristDestination          1.000000000 -0.019481101 -2.266388e-03
## IsWeekend                    -0.019481101  1.000000000  2.923821e-01
## IsNewYearEve                 -0.002266388  0.292382051  1.000000e+00
## RoomRent                      0.122502963  0.004580134  3.849123e-02
## StarRating                   -0.040554998  0.006378436  2.360897e-03
## Airport                       0.194422049 -0.002724756  4.598872e-04
## FreeWifi                     -0.061568821  0.002960828  2.787472e-05
## FreeBreakfast                -0.071692559 -0.007612777 -2.606416e-03
## HotelCapacity                -0.094356091  0.006306507  1.352679e-03
## HasSwimmingPool               0.042156280  0.004500461  1.122308e-03
##                          RoomRent   StarRating       Airport      FreeWifi
## Population           -0.088728063  0.134136593 -0.2597010198  1.129334e-01
## CityRank              0.093985529 -0.133381013  0.5059119892 -1.214309e-01
## IsMetroCity          -0.066839771  0.077602866 -0.2073586125  8.682887e-02
## IsTouristDestination  0.122502963 -0.040554998  0.1944220492 -6.156882e-02
## IsWeekend             0.004580134  0.006378436 -0.0027247555  2.960828e-03
## IsNewYearEve          0.038491227  0.002360897  0.0004598872  2.787472e-05
## RoomRent              1.000000000  0.369373425  0.0496532442  3.627002e-03
## StarRating            0.369373425  1.000000000 -0.0609191837  1.800959e-02
## Airport               0.049653244 -0.060919184  1.0000000000 -9.452368e-02
## FreeWifi              0.003627002  0.018009594 -0.0945236768  1.000000e+00
## FreeBreakfast        -0.010006370 -0.032892463  0.0242839409  1.582206e-01
## HotelCapacity         0.157873308  0.637430337 -0.1176720722 -8.703612e-03
## HasSwimmingPool       0.311657734  0.618214699 -0.1416665606 -2.407405e-02
##                      FreeBreakfast HotelCapacity HasSwimmingPool
## Population             0.036427824   0.259983052     0.026259082
## CityRank              -0.008683750  -0.256119706    -0.102973752
## IsMetroCity            0.051385662   0.187150215     0.021411924
## IsTouristDestination  -0.071692559  -0.094356091     0.042156280
## IsWeekend             -0.007612777   0.006306507     0.004500461
## IsNewYearEve          -0.002606416   0.001352679     0.001122308
## RoomRent              -0.010006370   0.157873308     0.311657734
## StarRating            -0.032892463   0.637430337     0.618214699
## Airport                0.024283941  -0.117672072    -0.141666561
## FreeWifi               0.158220597  -0.008703612    -0.024074046
## FreeBreakfast          1.000000000  -0.087165446    -0.061522132
## HotelCapacity         -0.087165446   1.000000000     0.509045809
## HasSwimmingPool       -0.061522132   0.509045809     1.000000000

Correlation with Room Rent
Correlation

vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(dfrModel$RoomRent, as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
##           Population             CityRank          IsMetroCity 
##         -0.088728063          0.093985529         -0.066839771 
## IsTouristDestination            IsWeekend         IsNewYearEve 
##          0.122502963          0.004580134          0.038491227 
##             RoomRent           StarRating              Airport 
##          1.000000000          0.369373425          0.049653244 
##             FreeWifi        FreeBreakfast        HotelCapacity 
##          0.003627002         -0.010006370          0.157873308 
##      HasSwimmingPool 
##          0.311657734

Visualize

dfrGraph <- gather(dfrModel, variable, value, -RoomRent)
head(dfrGraph)
##   RoomRent   variable    value
## 1    12375 Population 12442373
## 2    10250 Population 12442373
## 3     9900 Population 12442373
## 4    10350 Population 12442373
## 5    12000 Population 12442373
## 6    11475 Population 12442373
ggplot(dfrGraph) +
geom_jitter(aes(value,RoomRent, colour=variable)) + 
geom_smooth(aes(value,RoomRent, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Price With Other Features")

Regression Analysis
Find Best Multi Linear Model for Economy Class
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

#?step()
stpModel=step(lm(data=dfrModel, RoomRent~.), trace=0, steps=1000)
stpSummary <- summary(stpModel)
stpSummary 
## 
## Call:
## lm(formula = RoomRent ~ Population + IsMetroCity + IsTouristDestination + 
##     IsNewYearEve + StarRating + Airport + FreeWifi + HotelCapacity + 
##     HasSwimmingPool, data = dfrModel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11839  -2385   -691   1045 309532 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.560e+03  4.055e+02 -21.109  < 2e-16 ***
## Population           -1.244e-04  2.263e-05  -5.499 3.88e-08 ***
## IsMetroCity          -6.369e+02  2.132e+02  -2.988  0.00282 ** 
## IsTouristDestination  1.918e+03  1.374e+02  13.958  < 2e-16 ***
## IsNewYearEve          8.430e+02  1.739e+02   4.849 1.26e-06 ***
## StarRating            3.598e+03  1.104e+02  32.582  < 2e-16 ***
## Airport               1.001e+01  2.716e+00   3.684  0.00023 ***
## FreeWifi              5.952e+02  2.217e+02   2.685  0.00726 ** 
## HotelCapacity        -1.040e+01  1.029e+00 -10.115  < 2e-16 ***
## HasSwimmingPool       2.147e+03  1.598e+02  13.434  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared:  0.1904, Adjusted R-squared:  0.1899 
## F-statistic: 345.5 on 9 and 13222 DF,  p-value: < 2.2e-16

Model1

## ------------------------------------------------------------------------
Model1 <- RoomRent ~ Population+CityRank+IsMetroCity+IsTouristDestination+IsWeekend+IsNewYearEve+StarRating+Airport+FreeWifi+FreeBreakfast+HotelCapacity+HasSwimmingPool
fit1 <- lm(Model1, data = dfrModel)
summary(fit1)
## 
## Call:
## lm(formula = Model1, data = dfrModel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11845  -2356   -690   1030 309689 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.604e+03  4.494e+02 -19.147  < 2e-16 ***
## Population           -1.188e-04  3.592e-05  -3.307 0.000945 ***
## CityRank              1.821e+00  1.035e+01   0.176 0.860302    
## IsMetroCity          -6.640e+02  2.164e+02  -3.068 0.002158 ** 
## IsTouristDestination  1.925e+03  1.481e+02  13.001  < 2e-16 ***
## IsWeekend            -9.076e+01  1.239e+02  -0.733 0.463709    
## IsNewYearEve          8.826e+02  1.818e+02   4.855 1.22e-06 ***
## StarRating            3.592e+03  1.108e+02  32.434  < 2e-16 ***
## Airport               9.510e+00  3.171e+00   2.999 0.002709 ** 
## FreeWifi              5.498e+02  2.242e+02   2.452 0.014214 *  
## FreeBreakfast         1.688e+02  1.233e+02   1.369 0.171163    
## HotelCapacity        -1.028e+01  1.033e+00  -9.945  < 2e-16 ***
## HasSwimmingPool       2.153e+03  1.616e+02  13.327  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6601 on 13219 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1898 
## F-statistic: 259.3 on 12 and 13219 DF,  p-value: < 2.2e-16

Model Fit

## ------------------------------------------------------------------------
library(leaps)
leap1 <- regsubsets(Model1, data = dfrModel, nbest=1)
# summary(leap1)
plot(leap1, scale="adjr2")

Observations
The best fit model excludes Free Breakfast, City Rank and IS Weekend. Therefore, in our next model, we rerun the regression, excluding these variables.
Model2

## ------------------------------------------------------------------------
Model2 <- RoomRent ~ StarRating+Population+IsMetroCity+IsTouristDestination+IsNewYearEve+Airport+FreeWifi+HotelCapacity+HasSwimmingPool
fit2 <- lm(Model2, data = dfrModel)
summary(fit2)
## 
## Call:
## lm(formula = Model2, data = dfrModel)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11839  -2385   -691   1045 309532 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8.560e+03  4.055e+02 -21.109  < 2e-16 ***
## StarRating            3.598e+03  1.104e+02  32.582  < 2e-16 ***
## Population           -1.244e-04  2.263e-05  -5.499 3.88e-08 ***
## IsMetroCity          -6.369e+02  2.132e+02  -2.988  0.00282 ** 
## IsTouristDestination  1.918e+03  1.374e+02  13.958  < 2e-16 ***
## IsNewYearEve          8.430e+02  1.739e+02   4.849 1.26e-06 ***
## Airport               1.001e+01  2.716e+00   3.684  0.00023 ***
## FreeWifi              5.952e+02  2.217e+02   2.685  0.00726 ** 
## HotelCapacity        -1.040e+01  1.029e+00 -10.115  < 2e-16 ***
## HasSwimmingPool       2.147e+03  1.598e+02  13.434  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6600 on 13222 degrees of freedom
## Multiple R-squared:  0.1904, Adjusted R-squared:  0.1899 
## F-statistic: 345.5 on 9 and 13222 DF,  p-value: < 2.2e-16

Observations of Regression Analysis
Null Hypothesis
There is no dependency between Room rent of hotel and other variables

Alternative Hypothesis
There is dependency between Room Rent and other variables

As per regression model we find out that P Value is less than 0.05 which means we are rejecting the NULL Hypothesis at 95% Confidence interval.
As well as we can see that F value is very high which means means of all the variables differ.

Below are the 9 variables which are affecting the price of the Room of the hotels, As well as they are in the order of significance to affect the room rent
StarRating
Population
IsTouristDestination
HotelCapacity
HasSwimmingPool
IsNewYearEve
Airport
IsMetroCity
Free Wifi

VISUALIZE THE BETA COEFFICIENTS AND THEIR CONFIDENCE INTERVALS FROM MODEL 2

library(coefplot)
coefplot(fit2, intercept= FALSE, outerCI=1.96,coefficients=c("StarRating","Population", "IsMetroCity", "IsTouristDestination", "IsNewYearEve", "Airport", "HotelCapacity", "HasSwimmingPool"))
## Warning: Ignoring unknown aesthetics: xmin, xmax

## ------------------------------------------------------------------------
# the Adjusted R Squared for Model 2 is less than Model 1
summary(fit1)$adj.r.squared
## [1] 0.1898256
summary(fit2)$adj.r.squared
## [1] 0.1898573
# the AIC for Model 2 is less than Model 1
AIC(fit1)
## [1] 270314.1
AIC(fit2)
## [1] 270310.6

Observations
1. We can see that Adjusted R square value is more for model 2 instead of model 1 so model 2 is better
2. As well as AIC Value is less than Model 1, so Model 2 is better

Summary

  1. Data has been loaded successfully
  2. Data has been summarized to know the different statistical values
  3. Outliers has been find out in each variable and Evry variable is plotted on Box plot to know about the outliers
  4. Scatter plot as well as Corrgram is plotted which is showing the relationship between Room Rent and other variables
  5. Continuous variable are shown on Box plot while tables is used for discrete variables.
  6. For Regression Analysis,
    Dependent Variable: RoomRent

Note: For Regression analysis we exclude the date, Date can be a important factor which can be used in time series forecasting, but as of now we are not using date in our regression model

Below are the 9 variables which are affecting the price of the Room of the hotels, As well as they are in the order of significance to affect the room rent
StarRating
Population
IsTouristDestination
HotelCapacity
HasSwimmingPool
IsNewYearEve
Airport
IsMetroCity
Free Wifi

###########End of the Project#########