WQD7004 PROGRAMMING FOR DATA SCIENCE: GROUP ASSIGNMENT

Topic: Analysis of most popular restaurants in Malaysia

Introduction and objectives

Malaysia is well-known for its large variety of delicious food. In this project, we conduct analysis on the most popular restaurants in Malaysia, based on two datasets. The TripAdvisor and Google Maps data sets are similar and contain valuable customer reviews for restaurants, specifically geared towards analyzing the popularity of eateries in Malaysia. The main objective of this project is to build machine learning models to predict restaurant ratings in both datasets based on their reviews.

Data cleaning and preprocessing

df <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/GoogleReview_data_cleaned.csv")

# Convert the "Rating" column to numeric, na.rm=TRUE will remove non-numeric values
df$Rating <- as.numeric(as.character(df$Rating), na.rm=TRUE)

# Filter rows where Rating is between 1 and 5
df_filtered<- df[df$Rating >= 1 & df$Rating <= 5, ]

The “Rating” column is converted to numeric data with the as.numeric function, handling possible non-numeric items. Then, the filter function is used to select rows with a “Rating” between 1 and 5 from the data box to filter the data using the ‘dplyr’ package.

df_tripadvisor <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/TripAdvisor_data_cleaned.csv")

#clean and convert the 'date' column
df_tripadvisor$Dates <- gsub("Reviewed ", "", df_tripadvisor$Dates) #removing 'reviewed'
df_tripadvisor$Dates <- as.Date(df_tripadvisor$Dates, format = '%d %B %Y') #converts format of data strings

The initial step checks the proportion of missing values in the ‘Country’ column using the sum and is.na functions. Subsequently, the ‘Dates’ column is cleaned by removing the ‘Reviewed’ string and converting the strings to Date format using gsub and as.Date functions. ‘dplyr’ package is used.

# extracting country names from author column
countries <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/countries.csv")
df_tripadvisor$Country <- NA

# check each author name for country, cross reference with countries.csv, if yes, append to 'country' column
for (i in 1:nrow(df_tripadvisor)) {
  for (j in 1:nrow(countries)) {
    if (grepl(countries$Country[j], df_tripadvisor$Author[i])) {
      df_tripadvisor$Country[i] <- countries$Country[j]
    }
  }
}

We discover that the string value for authors’ names also included the countries they are from. To extract the countries from the author’s column, we obtained a CSV file containing a list of countries, and a new ‘Country’ column is initialized in the df_tripadvisor dataframe, and used a nested loop to check each string in the author column for a match with any country from the list. If a match is found, the corresponding country’s name is appended to the ‘Country’ column. Following the population of the ‘Country’ column, the ‘Rating’ column is subsequently converted to a factor, and a summary table is created to delve into the distribution of ratings specifically within Kuala Lumpur.

Exploratory Data Analysis

The Exploratory Data Analysis (EDA) phase is for understanding the patterns, relationships, and anomalies in the data. This involves visualizing data through graphs, charts, and summary statistics to uncover underlying trends and characteristics. For predicting which restaurant has the highest rating, EDA involves examining factors like location , customer reviews, other relevant variables. By exploring these aspects, we can identify key features that influence restaurant ratings. This not only helps in building a more accurate predictive model but also provides valuable insights into what makes a restaurant appealing to customers.

Exploring Ratings Columns

#converting rating to a factor and creating a summary table 
df_tripadvisor$Rating <- as.factor(df_tripadvisor$Rating)

#creating a frequency table
df_tripadvisor_kl_rating_count <- as.data.frame(table(df_tripadvisor$Rating))

#rename the data frame to Rating and Count
names(df_tripadvisor_kl_rating_count) <- c('Rating', 'Count')

#filtering, only KL area
df_tripadvisor_kl <- df_tripadvisor[df_tripadvisor$Location == "KL",] 
summary(df_tripadvisor_kl)
##     Author             Title              Review          Rating   
##  Length:71368       Length:71368       Length:71368       1: 2836  
##  Class :character   Class :character   Class :character   2: 2944  
##  Mode  :character   Mode  :character   Mode  :character   3: 7547  
##                                                           4:20227  
##                                                           5:37814  
##                                                                    
##                                                                    
##      Dates             Restaurant          Location           Country         
##  Min.   :2008-01-02   Length:71368       Length:71368       Length:71368      
##  1st Qu.:2015-10-28   Class :character   Class :character   Class :character  
##  Median :2017-08-14   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2017-06-18                                                           
##  3rd Qu.:2019-06-14                                                           
##  Max.   :2022-02-19                                                           
##  NA's   :418

Plotting the ratings of restaurants in KL on a scale of 1 to 5

ggplot(df_tripadvisor_kl_rating_count, aes(x="", y=Count, fill=Rating)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  theme_void() + 
  labs(title = 'Ratings of restaurants in KL on a scale of 1 to 5')

From this pie chart, we can see majority of the ratings are at least 4 and above.

Examining the number of KL restaurant reviews given on TripAdvisor overtime

#Grouping the data by month.then sums the 'Count' values
df_tripadvisor_kl_review_dates <- as.data.frame(table(df_tripadvisor_kl$Dates))
names(df_tripadvisor_kl_review_dates) <- c('Date', "Count")
df_tripadvisor_kl_review_dates$Date <- floor_date(as.Date(df_tripadvisor_kl_review_dates$Date), unit='month')
df_tripadvisor_kl_review_dates <- df_tripadvisor_kl_review_dates %>% group_by(Date) %>% summarise(Count = sum(Count), .groups = 'drop')

#To view the number of KL restaurant reviews given overtime:
ggplot(df_tripadvisor_kl_review_dates, aes(x = Date, y = Count, group=1)) + 
  geom_line(size=1) +
  labs(title = 'Number of KL restaurant reviews given on TripAdvisor over time')

The graph shows a significant decrease in review given during MCO, the number remains low until now.

Viewing the number of KL restaurant reviews given by tourists

#create a frequency table 
df_tripadvisor_kl_review_countries <- as.data.frame(table(df_tripadvisor_kl$Country)) 
names(df_tripadvisor_kl_review_countries) <- c('Country', 'Count')

#all except Malaysia
df_tripadvisor_kl_review_countries <- df_tripadvisor_kl_review_countries[df_tripadvisor_kl_review_countries$Country != "Malaysia", ]

df_tripadvisor_kl_review_countries <- df_tripadvisor_kl_review_countries %>% arrange(desc(Count)) %>% head(25)

ggplot(df_tripadvisor_kl_review_countries, aes(x = reorder(Country, -Count), y = Count)) + 
  geom_bar(stat = "identity") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) + 
  labs(title = 'Number of KL restaurant reviews given by tourists', x = 'Country')

This bar chart shows that UK tourists gave the most reviews.

Identifying the top 20 restaurant in KL based on Bayesian Average ratings

Applying Bayesian Average to ensures that both highly-rated restaurants and those with a large number of reviews are fairly represented. It avoids bias towards restaurants with few but “perfect” ratings.

df_tripadvisor <- read.csv('C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/TripAdvisor_data_cleaned.csv')

df_googlemaps <- read.csv('C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/GoogleReview_data_cleaned.csv')

#converts ratings to factor
df_googlemaps$Rating <- as.factor(df_googlemaps$Rating)
df_tripadvisor$Rating <- as.factor(df_tripadvisor$Rating)

#Combining Google Maps and TripAdvisor datasets
df <- bind_rows(df_tripadvisor, df_googlemaps) 

#selecting specific columns
df <- df[, c('Review', 'Rating', 'Restaurant', 'Location')] 

#only KL
df <- df[df$Location == "KL", ]

#creating a matrix 
df_restaurant <- data.frame(matrix(ncol=3, nrow=0))

#naming the columns of the newly created dataframe
colnames(df_restaurant) <- c('Name', 'Number of reviews', 'Average rating') 

df_restaurant <- df %>% group_by(Restaurant) %>% 
  summarise(`Number of reviews` = n(), `Average rating` = round(mean(as.numeric(Rating), na.rm = TRUE), 2)) %>% 
  arrange(desc(`Number of reviews`), desc(`Average rating`))

#To group the restaurants based on their ratings
df_restaurant
## # A tibble: 862 × 3
##    Restaurant                    `Number of reviews` `Average rating`
##    <chr>                                       <int>            <dbl>
##  1 Dining In The Dark KL                        1783             4.67
##  2 The Whisky Bar                               1284             4.74
##  3 Canopy Rooftop Bar and Lounge                1268             4.88
##  4 Madam Kwan's KLCC                            1268             3.84
##  5 Iketeru Restaurant                           1257             4.75
##  6 BBQ NIGHTS                                   1012             4.45
##  7 Ishin Japanese Dining                        1005             4.44
##  8 Canopy Lounge Rooftop Bar KL                  948             4.55
##  9 Khan’s Indian Cuisine                         939             4.42
## 10 El Cerdo                                      915             4.57
## # ℹ 852 more rows
global_avg_rating <- mean(df_restaurant$`Average rating`, na.rm = TRUE) #mean of ratings
C <- mean(df_restaurant$`Number of reviews`, na.rm = TRUE) #mean of reviews

df_restaurant <- df_restaurant %>%
  mutate(Bayesian_Average = ((`Average rating` * `Number of reviews`) + (C * global_avg_rating)) / (`Number of reviews` + C)) %>%
  arrange(desc(Bayesian_Average)) #now all ratings are fair 

top_20_restaurants <- df_restaurant %>%
  top_n(20, Bayesian_Average) #selecting the top 20 restaurant
top_20_restaurants
## # A tibble: 20 × 4
##    Restaurant              `Number of reviews` `Average rating` Bayesian_Average
##    <chr>                                 <int>            <dbl>            <dbl>
##  1 Canopy Rooftop Bar and…                1268             4.88             4.82
##  2 Positano Risto                          848             4.87             4.78
##  3 Chambers Grill                          703             4.86             4.76
##  4 Cielo Sky Dining & Lou…                 490             4.85             4.72
##  5 Iketeru Restaurant                     1257             4.75             4.70
##  6 Sausage KL Cafe & Deli                  393             4.85             4.69
##  7 The Whisky Bar                         1284             4.74             4.69
##  8 Healy Mac's                             460             4.78             4.65
##  9 Dining In The Dark KL                  1783             4.67             4.64
## 10 Opium KL                                890             4.67             4.61
## 11 Zenzero Restaurant & W…                 500             4.68             4.58
## 12 Nizza                                   172             4.85             4.57
## 13 Manja                                   484             4.67             4.57
## 14 Knowhere Bangsar                        710             4.63             4.56
## 15 Quivo Pavilion                          650             4.62             4.55
## 16 Favola Italian Restaur…                 379             4.67             4.55
## 17 Vin's Restaurant and B…                 348             4.66             4.53
## 18 Sushi Hibiki                            137             4.84             4.53
## 19 The Steakhouse KL                       660             4.59             4.53
## 20 El Cerdo                                915             4.57             4.52
#Plotting the graph:
ggplot(top_20_restaurants, aes(x = reorder(Restaurant, Bayesian_Average), y = Bayesian_Average)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flips the axes for better readability
  labs(title = "Top 20 Restaurants in KL based on Bayesian Average Rating",
       x = "Restaurant",
       y = "Bayesian Average Rating")

Based on the Bayesian average, which considers both average ratings and review volume, Canopy Rooftop Bar and Lounge stands out as the top-rated restaurant in Kuala Lumpur, reflecting its high customer ratings and consistent performance across numerous reviews.

Identifying the top 20 restaurants in Malaysia

df_tripadvisor2 <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/TripAdvisor_data_cleaned.csv")
df_googlemaps2 <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/GoogleReview_data_cleaned.csv")

df_googlemaps2$Rating <- as.factor(df_googlemaps2$Rating)
df_tripadvisor2$Rating <- as.factor(df_tripadvisor2$Rating)
df2 <- bind_rows(df_tripadvisor2, df_googlemaps2)
df2 <- df2[, c('Review', 'Rating', 'Restaurant', 'Location')]

unique_locations <- df2 %>% 
  distinct(Location) %>% 
  arrange(Location)
unique_locations #to show how many locations 
##         Location
## 1           Ipoh
## 2             JB
## 3             KL
## 4        Kuching
## 5       Langkawi
## 6         Melaka
## 7           Miri
## 8         Penang
## 9  Petaling Jaya
## 10     Shah Alam
df_restaurant2 <- data.frame(matrix(ncol=3, nrow=0)) #creating a matrix 
colnames(df_restaurant) <- c('Name', 'Number of reviews', 'Average rating', 'Location')

df_restaurant2 <- df2 %>% group_by(Restaurant) %>% 
  group_by(Restaurant, Location ) %>% 
  summarise(`Number of reviews` = n(), 
            `Average rating` = round(mean(as.numeric(Rating), na.rm = TRUE), 2)) %>% 
  arrange(desc(`Average rating`))
## `summarise()` has grouped output by 'Restaurant'. You can override using the
## `.groups` argument.
df_restaurant2
## # A tibble: 3,635 × 4
## # Groups:   Restaurant [3,515]
##    Restaurant                   Location  `Number of reviews` `Average rating`
##    <chr>                        <chr>                   <int>            <dbl>
##  1 362 Gunung Rapat Heong Peah  Ipoh                        2                5
##  2 Agape Foodcourt              JB                          1                5
##  3 Ah Lan Hainanese Satay       Melaka                      1                5
##  4 Ah Ni Bak Kut Teh            Shah Alam                   1                5
##  5 Aji Dataran                  Ipoh                        1                5
##  6 Arabic Food Ttdi Jaya        Shah Alam                   1                5
##  7 Aryan Restaurant             KL                         18                5
##  8 Asam pedas house             JB                          1                5
##  9 Asma Cake House              Kuching                     3                5
## 10 Atap Food Court 新香儐美食閣 JB                          1                5
## # ℹ 3,625 more rows
#Applying Bayesian_Average

global_avg_rating2 <- mean(df_restaurant2$`Average rating`, na.rm = TRUE) #mean of ratings
C2 <- mean(df_restaurant2$`Number of reviews`, na.rm = TRUE) #mean of reviews

df_restaurant2 <- df_restaurant2 %>%
  mutate(Bayesian_Average = ((`Average rating` * `Number of reviews`) + 
                               (C2 * global_avg_rating2)) / (`Number of reviews` + C2)) %>%
  arrange(desc(Bayesian_Average))

#Selecting the top 20
top_20_restaurant2 <- df_restaurant2 %>%
  arrange(desc(Bayesian_Average)) %>%
  head(20)

#Shorten one of the restaurant's name to vitualize better
top_20_restaurant2 <- top_20_restaurant2 %>%
  mutate(Modified_Restaurant = 
           ifelse(Restaurant == "The Argan Trees Restaurant-Moroccan and Mediterran- Restaurant Langkawi", 
                  "The Argan Restaurant", Restaurant))
top_20_restaurant2
## # A tibble: 20 × 6
## # Groups:   Restaurant [20]
##    Restaurant     Location `Number of reviews` `Average rating` Bayesian_Average
##    <chr>          <chr>                  <int>            <dbl>            <dbl>
##  1 Canopy Roofto… KL                      1268             4.88             4.82
##  2 Positano Risto KL                       848             4.87             4.79
##  3 Chambers Grill KL                       703             4.86             4.76
##  4 Alhamdulillah… Langkawi                 436             4.9              4.75
##  5 Cielo Sky Din… KL                       490             4.85             4.72
##  6 Haroo Haroo K… Langkawi                 856             4.79             4.72
##  7 Iketeru Resta… KL                      1257             4.75             4.70
##  8 Sausage KL Ca… KL                       393             4.85             4.69
##  9 The Whisky Bar KL                      1284             4.74             4.69
## 10 Healy Mac's    KL                       460             4.78             4.65
## 11 Dining In The… KL                      1783             4.67             4.64
## 12 Haroo Korean … Langkawi                 426             4.75             4.62
## 13 AIN- ARABIA R… Langkawi                 258             4.82             4.61
## 14 The Argan Tre… Langkawi                 296             4.79             4.61
## 15 Opium KL       KL                       890             4.67             4.61
## 16 Kayuputi       Langkawi                 375             4.74             4.60
## 17 Antipodean Ca… Petalin…                 988             4.64             4.59
## 18 Zenzero Resta… KL                       500             4.68             4.58
## 19 MY French Fac… Langkawi                 770             4.64             4.58
## 20 Rubin Mardini… Penang                   200             4.82             4.57
## # ℹ 1 more variable: Modified_Restaurant <chr>
#Plotting the bar graph
ggplot(top_20_restaurant2, aes(x = reorder(Modified_Restaurant, Bayesian_Average), y = Bayesian_Average)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = Location), position = position_dodge(width = 0.9), hjust = -0.1, vjust = -0.5, color = "black", size = 3) +
  coord_flip() +
  labs(title = "Top 20 Restaurants in Malaysia based on Bayesian Average Rating",
       x = "Restaurant",
       y = "Bayesian Average Rating")

Based on the graph above, the top 3 restaurants in KL happen to be the top 3 restaurants in the entirety of Malaysia as well.

Examining the most common words in reviews

To achieve this, we create a word cloud based on a sample of reviews.

#Sample a subset of the data
set.seed(123)  #for reproducibility
df_tripadvisor <- df_tripadvisor %>% sample_n(1000)

#showing the amount of words, characters and sentence
df_tripadvisor <- df_tripadvisor %>% 
  mutate(
    word_count= str_count(Review, boundary("word")),
    character_count= nchar(Review),
    sentence_count= str_count(Review, boundary('sentence'))
  )

#Text Preprocessing- lowering casing, removing punctuation and numbers, and eliminating stop words.
#creating a corpus
corpus <- Corpus(VectorSource(df_tripadvisor$Review))

#Preprocessing steps:
corpus_clean <- corpus %>% 
  tm_map(content_transformer(tolower)) %>%  # Convert text to lower case
  tm_map(removePunctuation) %>% # Remove punctuation
  tm_map(removeNumbers) %>%   # Remove all numbers
  tm_map(removeWords, stopwords('english')) # Remove common stop words

#Word Frequency Analysis
dtm <- TermDocumentMatrix(corpus_clean)

m <- as.matrix(dtm)
word_freqs <- sort(rowSums(m), decreasing= TRUE)
word_freqs_df <- data.frame(word= names(word_freqs), freq=word_freqs)
head(word_freqs_df)
##                  word freq
## food             food  802
## good             good  592
## service       service  356
## place           place  347
## restaurant restaurant  336
## great           great  307
#using word cloud to visualize the frequency of words
wordcloud(names(word_freqs), freq = word_freqs, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "Dark2"))

#Here we can see what words are mentioned regularly in the review stage.

Identifying which location has the highest average ratings

#Based on location
df_tripadvisor <- read.csv('C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/TripAdvisor_data_cleaned.csv')
df_googlemaps <- read.csv('C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/GoogleReview_data_cleaned.csv')
df <- bind_rows(df_tripadvisor, df_googlemaps)

#using weighted rating to identify which locations has the highest average ratings
m <- mean(df$Rating, na.rm = TRUE)
v <- 50 # Minimum review threshold

weighted_ratings <- df %>%
  group_by(Location) %>%
  summarise(Average_Rating = mean(Rating, na.rm = TRUE),
            Review_Count = n()) %>%
  mutate(Weighted_Rating = (Review_Count * Average_Rating + v * m) / (Review_Count + v)) %>%
  arrange(desc(Weighted_Rating))

#plotting the graph
ggplot(weighted_ratings[1:10, ], aes(x = reorder(Location, Weighted_Rating), y = Weighted_Rating)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Locations by Weighted Average Rating",
       x = "Location",
       y = "Weighted Average Rating") 

Based on this, we can identify that Langkawi has the highest weighted average ratings. This approach gives more weight to locations with more reviews, addressing the potential bias of small sample sizes.

Modelling

We read in either one of our restaurant review datasets, either from Google Maps or TripAdvisor. Due to hardware limitations, we select only 10,000 random rows from the dataset for model training.

data <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/GoogleReview_data_cleaned.csv", stringsAsFactors = FALSE)
data <- read.csv("C:/Users/User/OneDrive - Universiti Malaya/PROGRAMMING FOR DATA SCIENCE/assignment/Malaysia Restaurant Review Datasets/data_cleaned/TripAdvisor_countries.csv", stringsAsFactors = FALSE)
data <- sample_n(data, 10000)

For model training, we want to only take into account the restaurant name, location, review and rating columns. Since machine learning models only accept numerical variables as input, we have to convert the restaurant name and location variables to integers.

data <- subset(data, select=c('Restaurant', 'Location', 'Review', 'Rating'))
data$Restaurant <- as.numeric(factor(data$Restaurant))
data$Location <- as.numeric(factor(data$Location))

As for the review column, a number of preprocessing steps are taken, including conversion to lowercase, removal of punctuation and numbers, as well as whitespace.

data$Review <- tolower(data$Review)
data$Review <- removePunctuation(data$Review)
data$Review <- removeNumbers(data$Review)
data$Review <- stripWhitespace(data$Review)
data$Review <- strsplit(data$Review, " ")

We create word embeddings for the text in the review column using the word2vec package. After every word in the corpus has its associated embedding, a doc2vec model is created for the review column, so that each row in the review column has an embedding vector.

w2v <- word2vec(data$Review, iter=10)
d2v <- doc2vec(w2v, data$Review, type='embedding')

Now that we have our matrix of word embeddings for each review, we append the restaurant name, location and rating columns from the original dataset.

embedding_matrix <- data.frame(d2v, 
                               restaurant = data$Restaurant, 
                               location = data$Location, 
                               rating = data$Rating)

We split the embedding matrix into a training set and test set with a ratio of 80:20.

trainIndex <- createDataPartition(embedding_matrix$rating, p = .8, list = FALSE)
train_set <- embedding_matrix[trainIndex, ]
test_set <- embedding_matrix[-trainIndex, ]

This prediction function will come into play later. Essentially, this function makes prediction for restaurant ratings based on the word embeddings of the test set. Our main metrics for evaluation are accuracy, mean absolute error, mean squared error, and R-squared.

predict_function <- function(model){
  predictions <- predict(model, newdata = test_set)
  predictions <- na.omit(predictions)
  rounded_predictions <- round(predictions)
  
  accuracy <- mean(rounded_predictions == test_set$rating)
  print(paste("Accuracy:", round(accuracy, 2)))
  
  MAE <- mean(abs(test_set$rating - predictions))
  MSE <- mean(abs(test_set$rating - predictions)^2)
  R2 <- 1 - sum((test_set$rating - predictions)^2) / sum((test_set$rating - mean(test_set$rating))^2)
  
  print(paste("Mean Absolute Error:", round(MAE, 2)))
  print(paste("Mean Squared Error:", round(MSE, 2)))
  print(paste("R-squared:", round(R2, 2)))
}

We experimented with three different machine learning models, namely linear regression, decision tree, and random forest.

Linear regression

model <- lm(rating ~ ., data = train_set)
predict_function(model)
## [1] "Accuracy: 0.53"
## [1] "Mean Absolute Error: 0.6"
## [1] "Mean Squared Error: 0.63"
## [1] "R-squared: 0.37"

Decision tree

model <- rpart(rating ~ ., data = train_set)
predict_function(model)
## [1] "Accuracy: 0.5"
## [1] "Mean Absolute Error: 0.66"
## [1] "Mean Squared Error: 0.8"
## [1] "R-squared: 0.18"

Random forest

model <- randomForest(rating ~ ., data = train_set, na.action = na.omit, ntree=100)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?
predict_function(model)
## [1] "Accuracy: 0.54"
## [1] "Mean Absolute Error: 0.59"
## [1] "Mean Squared Error: 0.61"
## [1] "R-squared: 0.38"

Model evaluation and interpretation

For different models

  • Linear Regression: Lower accuracy, poor performance, and negative R-squared values at larger sample sizes of Google Maps data indicate that complexity of the data may not be captured.

  • Decision tree: Moderate accuracy, r-squared values due to linear regression, but still low or negative, surface predictive power is limited and cannot capture the full complexity.

  • Random forest: As the sample size increased, it showed greater accuracy and R-squared, indicating better generalization compared to other models, especially on TripAdvisor data.

For different datasets

  • Google Maps: The decision tree model consistently outperforms the other two models in terms of accuracy. In terms of R-square value, the random forest model shows improvement with the increase of sample size.

  • TripAdvisor: The accuracy of all models improved as the sample size increased, with the random forest model showing the best performance on the largest sample size. Similarly, with the increase of the number of samples in all models, the R-square night watch will increase, and the random forest will achieve the highest value in the maximum sample size.

Model selection

  • Based on multiple evaluations such as accuracy and R-squared values, the random forest model is the best choice, especially for large data such as the TripAdvisor dataset. This model can provide a substantial role for businesses in understanding customer data, leading to more accurate predictions and better decisions.

Business strategy suggestions

  • Through the random forest model, enterprises can reveal customer behaviors and understand customer preferences. The random forest model has the ability to handle complex interactions and capture the nuances of customer preferences, so as to improve service quality or adjust menus more accurately and targeted according to the analysis results, improve customer satisfaction and operational efficiency, so as to obtain more profits and strengthen market competitiveness.

Conclusion

In conclusion, this analysis of Malaysia’s restaurant scene, leveraging data from TripAdvisor and Google Maps, has unearthed significant insights. Key trends in customer preferences were identified, particularly high satisfaction rates in the Kuala Lumpur region. The Random Forest model demonstrated exceptional predictive capabilities, underscoring its effectiveness. These insights are invaluable for businesses, suggesting ways to enhance service quality and diversify menus. Looking ahead, there’s potential for exploring regional tastes and the impact of seasonal trends, which could further enrich our understanding of the dynamic culinary landscape in Malaysia. This project highlights the power of data-driven analysis in transforming the restaurant industry.