MATH2349 Data Preprocessing

Required Packages

library(readr)
library(readxl)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(mlr)
library(knitr)
library(forecast)
library(ggplot2)

Executive Summary

The aim of this assignment is to use the knowledge acquired throughout this course to preprocess some open data. The data sets used in this assignment are about restaurants around the world. It provides information which includes the location, cuisine and ratings.

The 2 data sets used, zomato.csv and Country-Code.xlsx, were read in and merged (right_join) by a common variable. In order for it to be easy to work with, non-essential variables were omitted and the remaining variables were re-organised and re-labelled.

For a better understanding of the data set, an analysis of the structure and variable class type was conducted. It showed that there were numerical (integer), numerical (double) and character class types. Some variables were then made into factors and had levels named.

The data was already in a tidy format, so no alterations were needed. A new column was created through the mutate function which categorised the ratings of the restaurant into a 5-point star system. This is a more appropriate way to assess the quality of a restaurant.

A number of missing values were identified in the data set. The missing values that were of numerical class were replaced by the mean of the corresponding variable. Whereas for the categorical variables (factor/character class), the missing values were replaced by the mode of the corresponding variables. This is a suitable way to treat missing values.

In order to handle the outlier data points, a capping method was implemented. This would cap the outlier values at certain upper and lower percentile limits. This removed a large portion of the outlier results, which would have had skewed the results of the data if not dealt with.

Finally, transformations were made on certain variables. These transformations reduce the skewness of the data and make the distribution more normal. A normal distribution is a more appropriate way to conduct statistical analyses. Logarithm and power transformations were used, which proved effective.

Data

The data sets used in this assignments provide a wide range of information on different restaurants around the world. These data sets are obtained from https://www.kaggle.com/shrutimehta/zomato-restaurants-data, which is originally sourced from the Zomato API https://developers.zomato.com/api#headline1.

There are 2 data sets used in this assignment. The main data set zomato.csv, has extensive information on the restaurants location, address, cuisine, services, price, affordability and rating. The other data set Country-Code.xlsx, shows the restaurants location by country.

Utilising the readr and readxl packages, the 2 data sets are imported and read in. They are in csv and excel format, so the read_csv and read_excel functions are used.

These data sets are checked and then merged (right_join) by a common variable, “Country Code”. The right_join is appropriate as it matches rows from restaurant to country_code.

The heads of the original data sets and the merged data set.

setwd("C:/Users/Omer Ahmed/Desktop/data preprossing/assignment.3")    
restaurant<-read_csv("zomato.csv")
country_code<-read_excel("Country-Code.xlsx")     
Restaurant_1<-right_join(restaurant,country_code,by="Country Code")

head(restaurant, n=10)

## # A tibble: 10 x 21
##    `Restaurant ID` `Restaurant Name` `Country Code` City  Address Locality
##              <int> <chr>                      <int> <chr> <chr>   <chr>   
##  1         6317637 Le Petit Souffle             162 Maka~ Third ~ Century~
##  2         6304287 Izakaya Kikufuji             162 Maka~ Little~ Little ~
##  3         6300002 Heat - Edsa Shan~            162 Mand~ Edsa S~ Edsa Sh~
##  4         6318506 Ooma                         162 Mand~ Third ~ SM Mega~
##  5         6314302 Sambo Kojin                  162 Mand~ Third ~ SM Mega~
##  6        18189371 Din Tai Fung                 162 Mand~ Ground~ SM Mega~
##  7         6300781 Buffet 101                   162 Pasa~ Buildi~ SM by t~
##  8         6301290 Vikings                      162 Pasa~ Buildi~ SM by t~
##  9         6300010 Spiral - Sofitel~            162 Pasa~ Plaza ~ Sofitel~
## 10         6314987 Locavore                     162 Pasi~ Brixto~ Kapitol~
## # ... with 15 more variables: `Locality Verbose` <chr>, Longitude <dbl>,
## #   Latitude <dbl>, Cuisines <chr>, `Average Cost for two` <int>,
## #   Currency <chr>, `Has Table booking` <chr>, `Has Online
## #   delivery` <chr>, `Is delivering now` <chr>, `Switch to order
## #   menu` <chr>, `Price range` <int>, `Aggregate rating` <dbl>, `Rating
## #   color` <chr>, `Rating text` <chr>, Votes <int>

head(country_code, n=10)

## # A tibble: 10 x 2
##    `Country Code` Country     
##             <dbl> <chr>       
##  1             1. India       
##  2            14. Australia   
##  3            30. Brazil      
##  4            37. Canada      
##  5            94. Indonesia   
##  6           148. New Zealand 
##  7           162. Phillipines 
##  8           166. Qatar       
##  9           184. Singapore   
## 10           189. South Africa

head(Restaurant_1, n=10)

## # A tibble: 10 x 22
##    `Restaurant ID` `Restaurant Name` `Country Code` City  Address Locality
##              <int> <chr>                      <dbl> <chr> <chr>   <chr>   
##  1         3400025 Jahanpanah                    1. Agra  E 23, ~ Agra Ca~
##  2         3400341 Rangrezz Restaur~             1. Agra  E-20, ~ Agra Ca~
##  3         3400005 Time2Eat - Mama ~             1. Agra  Main M~ Agra Ca~
##  4         3400021 Chokho Jeeman Ma~             1. Agra  1/48, ~ Civil L~
##  5         3400017 Pinch Of Spice                1. Agra  23/453~ Civil L~
##  6         3400325 MoMo Cafe                     1. Agra  Courty~ Courtya~
##  7         3400059 Peshawri - ITC M~             1. Agra  ITC Mu~ ITC Mug~
##  8         3400060 Taj Bano - ITC M~             1. Agra  ITC Mu~ ITC Mug~
##  9         3400348 G Thal                        1. Agra  3/20, ~ Khandari
## 10         3400072 Dawat-e-Nawab - ~             1. Agra  Radiss~ Radisso~
## # ... with 16 more variables: `Locality Verbose` <chr>, Longitude <dbl>,
## #   Latitude <dbl>, Cuisines <chr>, `Average Cost for two` <int>,
## #   Currency <chr>, `Has Table booking` <chr>, `Has Online
## #   delivery` <chr>, `Is delivering now` <chr>, `Switch to order
## #   menu` <chr>, `Price range` <int>, `Aggregate rating` <dbl>, `Rating
## #   color` <chr>, `Rating text` <chr>, Votes <int>, Country <chr>

Understand

Data Set Formatting

The Restaurant_1 data set has a lot of variables. The data set has been refined and organised to make it easier to work with.

A number of variables have been cut from Restaurant_1 and the remaining variables have then been ordered and re-named in a more suitable way.

#Variable Formatting
Restaurant_1$`Restaurant ID`<-NULL
Restaurant_1$`Country Code`<-NULL
Restaurant_1$Address<-NULL
Restaurant_1$Locality<-NULL
Restaurant_1$`Locality Verbose`<-NULL
Restaurant_1$Longitude<-NULL
Restaurant_1$Latitude<-NULL
Restaurant_1$Currency<-NULL
Restaurant_1$`Is delivering now`<-NULL
Restaurant_1$`Switch to order menu`<-NULL
Restaurant_1$`Price range`<-NULL
Restaurant_1$`Rating color`<-NULL

#Re-order Columns
Restaurant_2<-Restaurant_1[c(1,10,2,3,4,5,6,7,8,9)]
colnames(Restaurant_2)<-c("Re_Name","Re_country","Re_city","Cuisines","Average cost for two by country currency","Booking","Delivery","Aggregate Rating","Rating Text","Votes")

Data Structure

As shown below, the new data set (Restaurant_2) has been reduced from 21 variables to 10 variables, which is more manageable.

str(Restaurant_2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    9551 obs. of  10 variables:
##  $ Re_Name                                 : chr  "Jahanpanah" "Rangrezz Restaurant" "Time2Eat - Mama Chicken" "Chokho Jeeman Marwari Jain Bhojanalya" ...
##  $ Re_country                              : chr  "India" "India" "India" "India" ...
##  $ Re_city                                 : chr  "Agra" "Agra" "Agra" "Agra" ...
##  $ Cuisines                                : chr  "North Indian, Mughlai" "North Indian, Mughlai" "North Indian" "Rajasthani" ...
##  $ Average cost for two by country currency: int  850 700 500 400 1000 2000 2500 2500 800 3600 ...
##  $ Booking                                 : chr  "No" "No" "No" "No" ...
##  $ Delivery                                : chr  "No" "No" "No" "No" ...
##  $ Aggregate Rating                        : num  3.9 3.5 3.6 4 4.2 4 4.3 4 3.6 3.8 ...
##  $ Rating Text                             : chr  "Good" "Good" "Good" "Very Good" ...
##  $ Votes                                   : int  140 71 94 87 177 45 133 41 59 46 ...

Class of Variables

The code below provides a summary of the various classes of Restaurant_2. It has character, numeric (integer) and numeric (double) variable types.

for(i in Restaurant_2) {print(typeof(i))}

## [1] "character"
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "integer"
## [1] "character"
## [1] "character"
## [1] "double"
## [1] "character"
## [1] "integer"

To make the data set easier to understand and work with, the Booking and Delivery variables been changed to factors. The values of these variables have been adjusted to Available and Unavailable as levels.

Restaurant_2$Booking<-factor(Restaurant_2$Booking,levels = c("No","Yes"),labels = c("Available","Unavailable"))
Restaurant_2$Delivery<-factor(Restaurant_2$Delivery,levels = c("No","Yes"),labels = c("Available","Unavailable"))

str(Restaurant_2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    9551 obs. of  10 variables:
##  $ Re_Name                                 : chr  "Jahanpanah" "Rangrezz Restaurant" "Time2Eat - Mama Chicken" "Chokho Jeeman Marwari Jain Bhojanalya" ...
##  $ Re_country                              : chr  "India" "India" "India" "India" ...
##  $ Re_city                                 : chr  "Agra" "Agra" "Agra" "Agra" ...
##  $ Cuisines                                : chr  "North Indian, Mughlai" "North Indian, Mughlai" "North Indian" "Rajasthani" ...
##  $ Average cost for two by country currency: int  850 700 500 400 1000 2000 2500 2500 800 3600 ...
##  $ Booking                                 : Factor w/ 2 levels "Available","Unavailable": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Delivery                                : Factor w/ 2 levels "Available","Unavailable": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Aggregate Rating                        : num  3.9 3.5 3.6 4 4.2 4 4.3 4 3.6 3.8 ...
##  $ Rating Text                             : chr  "Good" "Good" "Good" "Very Good" ...
##  $ Votes                                   : int  140 71 94 87 177 45 133 41 59 46 ...

for(i in Restaurant_2) {print(typeof(i))}

## [1] "character"
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "double"
## [1] "character"
## [1] "integer"

Tidy & Manipulate Data 1

Fortunately, our data was in a tidy format as it fulfils the required rules of tidy data.

Tidy & Manipulate Data 2

In order to better categorise the rating of a restaurant, a 5-star rating system is adopted. A 5 star restaurant corresponds to an “Excellent” review from the Rating Text variable and a 1 star corresponds to a “Poor” rating. Furthermore, some restaurants do not have a Rating Text value, so they are given a “Not Rated” value.

Restaurant_3<-Restaurant_2 %>% mutate(Rating=ifelse(`Rating Text`=="Excellent","Five Star",ifelse(`Rating Text`=="Very Good","Four Star",ifelse(`Rating Text`=="Good","Three Star",ifelse(`Rating Text`=="Average","Two Star",ifelse(`Rating Text`=="Poor","One Star","Not Rated"))))))

## Warning: package 'bindrcpp' was built under R version 3.4.3

Scan 1

The number of missing values was found for the different columns using the colSums function. The location of these missing values were then identified. For the variables with numerical values, the missing values were replaced by the mean of that column. This option was chosen as it doesn’t statistically alter the data.

The Rating and Rating Text variables has character values. The missing values can’t be replaced with the mean of categorical data. However, the highest occurring string can be calculated and replace the missing values.

The user developed mode function is used to determine the highest occurring character values. All the missing values have been effectively dealt with and fulfil the requirements.

# Check NA
colSums(is.na(Restaurant_3))

##                                  Re_Name 
##                                        0 
##                               Re_country 
##                                        0 
##                                  Re_city 
##                                        0 
##                                 Cuisines 
##                                        0 
## Average cost for two by country currency 
##                                        6 
##                                  Booking 
##                                        0 
##                                 Delivery 
##                                        0 
##                         Aggregate Rating 
##                                       12 
##                              Rating Text 
##                                        4 
##                                    Votes 
##                                       14 
##                                   Rating 
##                                        4

NA_Aggregate_rating<-which(is.na(Restaurant_3$`Aggregate Rating`))
NA_Votes<-which(is.na(Restaurant_3$Votes))
NA_Average_cost_for2<-which(is.na(Restaurant_3$`Average cost for two by country currency`))
NA_Rating<-which(is.na(Restaurant_3$Rating))
NA_Rating_Text<-which(is.na(Restaurant_3$`Rating Text`))
NA_Cuisines<-which(is.na(Restaurant_3))

# Replacing NA
Restaurant_3$`Aggregate Rating`[NA_Aggregate_rating]<-mean(Restaurant_3$`Aggregate Rating`,na.rm = T)
Restaurant_3$Votes[NA_Votes]<-mean(Restaurant_3$Votes,na.rm = T)
Restaurant_3$`Average cost for two by country currency`[NA_Average_cost_for2]<-mean(Restaurant_3$`Average cost for two by country currency`,na.rm = T)


mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

Restaurant_3$Rating[NA_Rating]<-mode(Restaurant_3$Rating)
Restaurant_3$`Rating Text`[NA_Rating_Text]<-mode(Restaurant_3$`Rating Text`)


# check NA
colSums(is.na(Restaurant_3))

##                                  Re_Name 
##                                        0 
##                               Re_country 
##                                        0 
##                                  Re_city 
##                                        0 
##                                 Cuisines 
##                                        0 
## Average cost for two by country currency 
##                                        0 
##                                  Booking 
##                                        0 
##                                 Delivery 
##                                        0 
##                         Aggregate Rating 
##                                        0 
##                              Rating Text 
##                                        0 
##                                    Votes 
##                                        0 
##                                   Rating 
##                                        0

Scan 2

A capping method has been used in order to adequately deal with the outliers. The capping method works by substituting outlier data points with upper or lower limit percentile values. In the first case Replacing_by_median, the upper and lower percentile limits are set at 25% and 75%, respectively.

The Replacing_by_median function works well at removing a large portion of the outlier results in the Average cost for two by country currency data.

The Replacing_by_1st_Qu function operates in a similar way to theReplacing_by_median function, but it replaces the outlier results by the 1st quantile. The outlier result for Aggregate Rating is effectively dealt with.

The Replacing_by_3rd function has extended upper and lower percentile limits of 5% and 95%, to better handle the larger amount of outlier results in the Votes data set. This is proves very successful as all outlier results have been removed.

Replacing_by_median <- function(x) {
 quantiles <- quantile( x, c( 0.25, 0.50, .75 ) )
 x[ x < quantiles[1] - 1.5*IQR(x) ] <- quantiles[2]
 x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[2]
 x}

par(mfrow=c(1,2))

boxplot(Restaurant_3$`Average cost for two by country currency`,main="Before Outlier Removal ", col= "red")
After_Capped<-Restaurant_3$`Average cost for two by country currency` %>% Replacing_by_median()
boxplot(After_Capped,main="After Outlier Removal by Median", col= "red")

#####################
Replacing_by_1st_Qu<- function(x) {
 quantiles <- quantile( x, c( 0.25, 0.50, .75 ) )
 x[ x < quantiles[1] - 1.5*IQR(x) ] <- quantiles[1]
 x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[1]
 x}
par(mfrow=c(1,2))
boxplot(Restaurant_3$`Aggregate Rating`,main="Before Outlier Removal ",col= "blue")
After_Capped_1<-Restaurant_3$`Aggregate Rating` %>% Replacing_by_1st_Qu()
boxplot(After_Capped_1,main="After Outlier Removal by 1st Qu",col= "blue")

###################
Replacing_by_3rd<- function(x) {
 quantiles <- quantile( x, c( 0.05,0.25, 0.75, .95 ) )
 x[ x < quantiles[1] - 1.5*IQR(x) ] <- quantiles[3]
 x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[3]
 x}
par(mfrow=c(1,2))
boxplot(Restaurant_3$Votes,main="Before Outlier Removal ",col="burlywood")
After_Capped_2<-Restaurant_3$Votes %>% Replacing_by_3rd()
boxplot(After_Capped_2,main="After Outlier Removal by 3rd Qu",col= "burlywood")

Transform

Transformations have been applied on the Average cost for two by country currency and Aggregate Rating variables. The transformation aim is to reduce the skewness of the data and turn it into a normal distribution.

For the Average cost for two by country currency data, a log transformation has been used. In this instance it spreads out the congested data. Typically, log transformations are implemented on data that has a large portion of high values, (right-skewed). As you can see, the result shows data that resembles a normal distribution. However, there does seem to be a fairly high peak in the centre of the distribution.

For the Aggregate Rating data, a power of (-1/6) is applied. This reduces the skewness (left-skewed) of the data, by shifting it across. A normal distribution for this data is then achieved.

par(mfrow=c(1,2))
hist(Restaurant_3$`Average cost for two by country currency`,main="Before Transformation",col = "chartreuse")
T_average<-log10(Restaurant_3$`Average cost for two by country currency`)
hist(T_average,main="After Transformation",col = "chartreuse")

###############
par(mfrow=c(1,2))
hist(Restaurant_3$`Aggregate Rating`,main="Before Transformation",col = "coral")
T_Aggregate_Rating<-(Restaurant_3$`Aggregate Rating`)^(-1/6)
hist(T_Aggregate_Rating,main="After Transformation",col = "coral")

#################

fivestar<-Restaurant_3 %>% filter(Rating=="Five Star")
countfive<-fivestar %>% group_by(Re_country) %>% count()
countfive[countfive$n==sort(countfive$n,T)[1],]

## # A tibble: 1 x 2
## # Groups:   Re_country [1]
##   Re_country     n
##   <chr>      <int>
## 1 India        116

countfive[countfive$n==sort(countfive$n,T)[2],]

## # A tibble: 1 x 2
## # Groups:   Re_country [1]
##   Re_country        n
##   <chr>         <int>
## 1 United States    68

Conclusion

The restaurants that are rated “Five Star” are filtered into a new dataframe, fivestar. The “Five Star” restaurants are then grouped by Re_country. The restaurants country of origin are listed and the highest ranked restaurants are shown. The country with the highest number of “Five Star” restaurants is India with 116. This is followed by the United States which has 68 restaurants with “Five Star” ratings.

MATH2349 Data Preprocessing - Assignment 3

Omar Alhelali-student NO.3609226 Piyush Bhatt- student NO. 3652293 Selwyn Tahoune - student NO. 3725435

22 May 2018