Project

The recommender system that I would like to implement would recommend hotels in Manhattan. Lots of people travel here all year round. So, this project might help them. There might have been lots of research works that have been already done on which hotels are best rated in Manhattan. I got the dataset from Kaggle and I filtered out only the hotels that are located in Manhattan.

#importing required libraries

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caTools)

Dataset

For this project, I am using a dataset which shows the star_rating for different hotels in Manhattan.

Let’s import the dataset, that I have uploaded on github.

The dataset had all the hotels for whole New York State but I have only filtered data for Manhattan and uploaded it in the github.

#importing dataset from github

data <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA612/master/manhattan.csv")  
head(data)

##   hotel_id                            name
## 1   219370     Dumont NYC-an Affinia hotel
## 2   109277           Park Central New York
## 3   116103                Blakely New York
## 4   415836                          Pod 39
## 5   351449 Sheraton Tribeca New York Hotel
## 6   343224             Chelsea Savoy Hotel
##                                  address     city state_province
## 1 150 E. 34th Street at Lexington Avenue New York             NY
## 2                            870 7th Ave New York             NY
## 3                          136 W 55th St New York             NY
## 4                   145 East 39th Street New York             NY
## 5                       370 Canal Street New York             NY
## 6                      204 W 23rd Street New York             NY
##   postal_code latitude longitude star_rating high_rate low_rate
## 1       10016 40.74624 -73.97929         4.0  349.7400 299.2400
## 2       10019 40.76452 -73.98078         4.0  191.0000 190.0000
## 3       10019 40.76366 -73.97965         4.5  206.4600 205.4600
## 4       10016 40.74929 -73.97670         3.0  220.0373 130.0206
## 5       10013 40.72102 -74.00418         4.0  350.1800 219.1800
## 6       10011 40.74422 -73.99596         2.0  176.4100 175.4100

To further analyze data we need to split the dataset in train and test dataset.

Train / Test dataset

let’s split the data into 80-20 ratio.

#splitting train / test (80-20) ratio
n = nrow(data)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.80, 0.20))

train = data[split, ]
test = data[!split, ]

#train datasset
head(train)

##   hotel_id                               name
## 1   219370        Dumont NYC-an Affinia hotel
## 2   109277              Park Central New York
## 3   116103                   Blakely New York
## 4   415836                             Pod 39
## 6   343224                Chelsea Savoy Hotel
## 8   358061 Element New York Times Square West
##                                  address     city state_province
## 1 150 E. 34th Street at Lexington Avenue New York             NY
## 2                            870 7th Ave New York             NY
## 3                          136 W 55th St New York             NY
## 4                   145 East 39th Street New York             NY
## 6                      204 W 23rd Street New York             NY
## 8                   311 West 39th Street New York             NY
##   postal_code latitude longitude star_rating high_rate low_rate
## 1       10016 40.74624 -73.97929         4.0  349.7400 299.2400
## 2       10019 40.76452 -73.98078         4.0  191.0000 190.0000
## 3       10019 40.76366 -73.97965         4.5  206.4600 205.4600
## 4       10016 40.74929 -73.97670         3.0  220.0373 130.0206
## 6       10011 40.74422 -73.99596         2.0  176.4100 175.4100
## 8       10018 40.75565 -73.99193         3.5  474.2100 299.1600

#test datasset
head(test)

##    hotel_id                                                     name
## 5    351449                          Sheraton Tribeca New York Hotel
## 7    326046 Distrikt Hotel New York City, an Ascend Collection Hotel
## 13   438630                                       WestHouse New York
## 14   124573 DoubleTree Suites by Hilton New York City - Times Square
## 24   163823                                     The Redbury New York
## 25   416820                        The Westin New York Grand Central
##                 address     city state_province postal_code latitude
## 5      370 Canal Street New York             NY       10013 40.72102
## 7         342 W 40th St New York             NY       10018 40.75679
## 13 201 West 55th Street New York             NY       10019 40.76442
## 14        1568 Broadway New York             NY       10036 40.75904
## 24  29 East 29th Street New York             NY       10016 40.74501
## 25 212 East 42nd Street New York             NY       10017 40.75041
##    longitude star_rating high_rate low_rate
## 5  -74.00418         4.0  350.1800 219.1800
## 7  -73.99273         3.5  219.3400 129.3400
## 13 -73.98144         5.0  510.4900 370.2300
## 14 -73.98475         4.0  439.1900 259.1100
## 24 -73.98426         4.0  479.3400 229.3400
## 25 -73.97378         4.5  679.1567 499.1459

We have two different random dataset. We can now do the RMSE calculation.

# Get raw average for further RMSE calculations
average_raw <- sum(train$star_rating, na.rm = TRUE) / length(which(!is.na(train$star_rating)))

# Calculate RMSE for train and test dataset

#calculate rmse for train dataset
rmse_train_raw <- sqrt(sum((train$star_rating[!is.na(train$star_rating)] - average_raw)^2) /
                         length(which(!is.na(train$star_rating))))
rmse_train_raw

## [1] 0.79818

#calculate RMSE for test dataset
rmse_test_raw <- sqrt(sum((test$star_rating[!is.na(test$star_rating)] - average_raw)^2) /
                        length(which(!is.na(test$star_rating))))
rmse_test_raw

## [1] 0.770159

Here, we have the RMSE values which are between 0 to 1 which indicates we have a good fit. The train and the test rmse values are pretty close. So, it fits the model so far.

#to check highest rated hotels in Manhattan
top_rated_hotels <- select(data, name, star_rating)
top_rated_hotels <- arrange(data, name, desc(star_rating))
x <- head(top_rated_hotels,10)
x

##    hotel_id                                          name
## 1    481441                          1 Hotel Central Park
## 2    563115                                     11 Howard
## 3    578930                                   3 West Club
## 4    251284                               36 Hudson Hotel
## 5    278067                            Ace Hotel New York
## 6    223460                              AKA Central Park
## 7    598637                               AKA Wall Street
## 8    323984                 Allies' Inn Bed and Breakfast
## 9    358950                                  Aloft Harlem
## 10   480123 Aloft Manhattan Downtown - Financial District
##                               address     city state_province postal_code
## 1         1414 Avenue of the Americas New York             NY       10019
## 2                    11 Howard Street New York             NY       10013
## 3                     3 W 51st Street New York             NY       10019
## 4  449 West 36th Street, Midtown West New York             NY       10018
## 5                 20 West 29th Street New York             NY       10001
## 6                 42 West 58th Street New York             NY       10019
## 7                   84 William Street New York             NY       10038
## 8               313 West 136th Street New York             NY       10030
## 9   2296 Frederick Douglass Boulevard New York             NY       10027
## 10                   49-53 Ann Street New York             NY       10038
##    latitude longitude star_rating high_rate low_rate
## 1  40.76483 -73.97685         5.0  880.4400 613.0400
## 2  40.71925 -73.99999         5.0  519.0772 319.0472
## 3  40.75967 -73.97729         3.5  250.3300 249.3300
## 4  40.75564 -73.99765         3.0  179.0000 109.6097
## 5  40.74587 -73.98823         4.0 1048.9830 177.6384
## 6  40.76451 -73.97555         4.0  744.3800 325.3800
## 7  40.70801 -74.00778         4.0  869.4400 395.4400
## 8  40.81751 -73.94678         3.0  341.0000 215.0000
## 9  40.80921 -73.95184         3.5  249.0983 234.0491
## 10 40.71032 -74.00678         3.0  292.6100 262.5900

Above we have calculated some of the highest rated hotels depending upon the star_rating, high_rate and low_rate.

Project_1

Sudhan Maharjan

2/15/2020

Dataset

Train / Test dataset