The recommender system that I would like to implement would recommend hotels in Manhattan. Lots of people travel here all year round. So, this project might help them. There might have been lots of research works that have been already done on which hotels are best rated in Manhattan. I got the dataset from Kaggle and I filtered out only the hotels that are located in Manhattan.
#importing required libraries
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caTools)
For this project, I am using a dataset which shows the star_rating for different hotels in Manhattan.
Let’s import the dataset, that I have uploaded on github.
The dataset had all the hotels for whole New York State but I have only filtered data for Manhattan and uploaded it in the github.
#importing dataset from github
data <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA612/master/manhattan.csv")
head(data)
## hotel_id name
## 1 219370 Dumont NYC-an Affinia hotel
## 2 109277 Park Central New York
## 3 116103 Blakely New York
## 4 415836 Pod 39
## 5 351449 Sheraton Tribeca New York Hotel
## 6 343224 Chelsea Savoy Hotel
## address city state_province
## 1 150 E. 34th Street at Lexington Avenue New York NY
## 2 870 7th Ave New York NY
## 3 136 W 55th St New York NY
## 4 145 East 39th Street New York NY
## 5 370 Canal Street New York NY
## 6 204 W 23rd Street New York NY
## postal_code latitude longitude star_rating high_rate low_rate
## 1 10016 40.74624 -73.97929 4.0 349.7400 299.2400
## 2 10019 40.76452 -73.98078 4.0 191.0000 190.0000
## 3 10019 40.76366 -73.97965 4.5 206.4600 205.4600
## 4 10016 40.74929 -73.97670 3.0 220.0373 130.0206
## 5 10013 40.72102 -74.00418 4.0 350.1800 219.1800
## 6 10011 40.74422 -73.99596 2.0 176.4100 175.4100
To further analyze data we need to split the dataset in train and test dataset.
let’s split the data into 80-20 ratio.
#splitting train / test (80-20) ratio
n = nrow(data)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.80, 0.20))
train = data[split, ]
test = data[!split, ]
#train datasset
head(train)
## hotel_id name
## 1 219370 Dumont NYC-an Affinia hotel
## 2 109277 Park Central New York
## 3 116103 Blakely New York
## 4 415836 Pod 39
## 6 343224 Chelsea Savoy Hotel
## 8 358061 Element New York Times Square West
## address city state_province
## 1 150 E. 34th Street at Lexington Avenue New York NY
## 2 870 7th Ave New York NY
## 3 136 W 55th St New York NY
## 4 145 East 39th Street New York NY
## 6 204 W 23rd Street New York NY
## 8 311 West 39th Street New York NY
## postal_code latitude longitude star_rating high_rate low_rate
## 1 10016 40.74624 -73.97929 4.0 349.7400 299.2400
## 2 10019 40.76452 -73.98078 4.0 191.0000 190.0000
## 3 10019 40.76366 -73.97965 4.5 206.4600 205.4600
## 4 10016 40.74929 -73.97670 3.0 220.0373 130.0206
## 6 10011 40.74422 -73.99596 2.0 176.4100 175.4100
## 8 10018 40.75565 -73.99193 3.5 474.2100 299.1600
#test datasset
head(test)
## hotel_id name
## 5 351449 Sheraton Tribeca New York Hotel
## 7 326046 Distrikt Hotel New York City, an Ascend Collection Hotel
## 13 438630 WestHouse New York
## 14 124573 DoubleTree Suites by Hilton New York City - Times Square
## 24 163823 The Redbury New York
## 25 416820 The Westin New York Grand Central
## address city state_province postal_code latitude
## 5 370 Canal Street New York NY 10013 40.72102
## 7 342 W 40th St New York NY 10018 40.75679
## 13 201 West 55th Street New York NY 10019 40.76442
## 14 1568 Broadway New York NY 10036 40.75904
## 24 29 East 29th Street New York NY 10016 40.74501
## 25 212 East 42nd Street New York NY 10017 40.75041
## longitude star_rating high_rate low_rate
## 5 -74.00418 4.0 350.1800 219.1800
## 7 -73.99273 3.5 219.3400 129.3400
## 13 -73.98144 5.0 510.4900 370.2300
## 14 -73.98475 4.0 439.1900 259.1100
## 24 -73.98426 4.0 479.3400 229.3400
## 25 -73.97378 4.5 679.1567 499.1459
We have two different random dataset. We can now do the RMSE calculation.
# Get raw average for further RMSE calculations
average_raw <- sum(train$star_rating, na.rm = TRUE) / length(which(!is.na(train$star_rating)))
# Calculate RMSE for train and test dataset
#calculate rmse for train dataset
rmse_train_raw <- sqrt(sum((train$star_rating[!is.na(train$star_rating)] - average_raw)^2) /
length(which(!is.na(train$star_rating))))
rmse_train_raw
## [1] 0.79818
#calculate RMSE for test dataset
rmse_test_raw <- sqrt(sum((test$star_rating[!is.na(test$star_rating)] - average_raw)^2) /
length(which(!is.na(test$star_rating))))
rmse_test_raw
## [1] 0.770159
Here, we have the RMSE values which are between 0 to 1 which indicates we have a good fit. The train and the test rmse values are pretty close. So, it fits the model so far.
#to check highest rated hotels in Manhattan
top_rated_hotels <- select(data, name, star_rating)
top_rated_hotels <- arrange(data, name, desc(star_rating))
x <- head(top_rated_hotels,10)
x
## hotel_id name
## 1 481441 1 Hotel Central Park
## 2 563115 11 Howard
## 3 578930 3 West Club
## 4 251284 36 Hudson Hotel
## 5 278067 Ace Hotel New York
## 6 223460 AKA Central Park
## 7 598637 AKA Wall Street
## 8 323984 Allies' Inn Bed and Breakfast
## 9 358950 Aloft Harlem
## 10 480123 Aloft Manhattan Downtown - Financial District
## address city state_province postal_code
## 1 1414 Avenue of the Americas New York NY 10019
## 2 11 Howard Street New York NY 10013
## 3 3 W 51st Street New York NY 10019
## 4 449 West 36th Street, Midtown West New York NY 10018
## 5 20 West 29th Street New York NY 10001
## 6 42 West 58th Street New York NY 10019
## 7 84 William Street New York NY 10038
## 8 313 West 136th Street New York NY 10030
## 9 2296 Frederick Douglass Boulevard New York NY 10027
## 10 49-53 Ann Street New York NY 10038
## latitude longitude star_rating high_rate low_rate
## 1 40.76483 -73.97685 5.0 880.4400 613.0400
## 2 40.71925 -73.99999 5.0 519.0772 319.0472
## 3 40.75967 -73.97729 3.5 250.3300 249.3300
## 4 40.75564 -73.99765 3.0 179.0000 109.6097
## 5 40.74587 -73.98823 4.0 1048.9830 177.6384
## 6 40.76451 -73.97555 4.0 744.3800 325.3800
## 7 40.70801 -74.00778 4.0 869.4400 395.4400
## 8 40.81751 -73.94678 3.0 341.0000 215.0000
## 9 40.80921 -73.95184 3.5 249.0983 234.0491
## 10 40.71032 -74.00678 3.0 292.6100 262.5900
Above we have calculated some of the highest rated hotels depending upon the star_rating, high_rate and low_rate.