This report follows the steps I followed when working towards the competition held on MachineHack and sponsored by IMS Proschool to predict the delivery time of a restaurant based on a few variables. I loosely followed the workflow provided by Hadley Wickham and Garrett Grolemund in R for Data Science as shown in the following image:
Figure 1: Workflow Diagram provided in R for Data Science
Here is the description of this Food Delivery Time competition provided by MachineHack:
The entire world is transforming digitally and our relationship with technology has grown exponentially over the last few years. We have grown closer to technology, and it has made our life a lot easier by saving time and effort. Today everything is accessible with smartphones — from groceries to cooked food and from medicines to doctors. In this hackathon, we provide you with data that is a by-product as well as a thriving proof of this growing relationship.
When was the last time you ordered food online? And how long did it take to reach you?
In this hackathon, we are providing you with data from thousands of restaurants in India regarding the time they take to deliver food for online order. As data scientists, your goal is to predict the online order delivery time based on the given factors.
Analytics India Magazine and IMS Proschool bring to you ‘Predicting Predicting Food Delivery Time Hackathon’.
require(readxl) # Read in provided data from xlsx files
require(writexl) # Write out xlsx files
require(tidyverse) # Dataframe package
require(caret) # Machine learning package
require(naniar) # for visualizing missing data
require(ggmap) # For geocoding city
require(mltools) # For gini impurity measure
require(plotly) # Interactive plots
# ggmap citation - D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2.
# The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
MachineHack provides two files for the competition:
Both files are already prepared in convenient xlsx files that we can import easily. I manually downloaded these files and saved them in the following locations:
Now we can easily import the data with the following commands:
data_train_raw <- read_excel("./data/Data_Train.xlsx")
data_test_raw <- read_excel("./data/Data_Test.xlsx")
Now that we have imported our data, let’s look it over to get an idea of what we are working with. One method of getting a glimpse of our data is to use the str command. The intention of this method is to compactly display each feature of the dataframe in just one line.
str(data_train_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 11094 obs. of 9 variables:
## $ Restaurant : chr "ID_6321" "ID_2882" "ID_1595" "ID_5929" ...
## $ Location : chr "FTI College, Law College Road, Pune" "Sector 3, Marathalli" "Mumbai Central" "Sector 1, Noida" ...
## $ Cuisines : chr "Fast Food, Rolls, Burger, Salad, Wraps" "Ice Cream, Desserts" "Italian, Street Food, Fast Food" "Mughlai, North Indian, Chinese" ...
## $ Average_Cost : chr "<U+20B9>200" "<U+20B9>100" "<U+20B9>150" "<U+20B9>250" ...
## $ Minimum_Order: chr "<U+20B9>50" "<U+20B9>50" "<U+20B9>50" "<U+20B9>99" ...
## $ Rating : chr "3.5" "3.5" "3.6" "3.7" ...
## $ Votes : chr "12" "11" "99" "176" ...
## $ Reviews : chr "4" "4" "30" "95" ...
## $ Delivery_Time: chr "30 minutes" "30 minutes" "65 minutes" "30 minutes" ...
Let us use the head command to look at the first few rows of our data:
head(data_train_raw)
## # A tibble: 6 x 9
## Restaurant Location Cuisines Average_Cost Minimum_Order Rating Votes
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID_6321 FTI Col~ Fast Fo~ <U+20B9>200 <U+20B9>50 3.5 12
## 2 ID_2882 Sector ~ Ice Cre~ <U+20B9>100 <U+20B9>50 3.5 11
## 3 ID_1595 Mumbai ~ Italian~ <U+20B9>150 <U+20B9>50 3.6 99
## 4 ID_5929 Sector ~ Mughlai~ <U+20B9>250 <U+20B9>99 3.7 176
## 5 ID_6123 Rmz Cen~ Cafe, B~ <U+20B9>200 <U+20B9>99 3.2 521
## 6 ID_5221 Rmz Cen~ South I~ <U+20B9>150 <U+20B9>50 3.8 46
## # ... with 2 more variables: Reviews <chr>, Delivery_Time <chr>
The head function shows much of the same information as str in this case, however, the different method of displaying it can be helpful. For instance, it is easier to see that Cuisines and Location are comma separated with varying amounts of information in each row. We can also see that the values in Average_Cost and Minimum_Order are preceded by the unicode U+20B9 which represents the Indian ruppee.
Let’s do some basic cleaning of our data so that we can more easily inspect it. First, we are going to remove the Indian ruppee from the two columns Average_Cost and Minimum_Order. We can write a short function for this since we will need to apply this to both the training set and test set.
# Remove bad_char input
remove_char <- function(dataset, bad_char, given_col){
remove_terms <- paste("[",bad_char,"]")
dataset[[given_col]] <- str_replace(dataset[[given_col]], bad_char, "")
return(dataset)
}
data_train <- data_train_raw %>%
remove_char("\u20b9", "Average_Cost") %>%
remove_char("\u20b9", "Minimum_Order")
We have now created a variable data_train with our modifications to the raw downloaded data. We can again use the head function to see that we correctly modified those two columns:
head(data_train)
## # A tibble: 6 x 9
## Restaurant Location Cuisines Average_Cost Minimum_Order Rating Votes
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID_6321 FTI Col~ Fast Fo~ 200 50 3.5 12
## 2 ID_2882 Sector ~ Ice Cre~ 100 50 3.5 11
## 3 ID_1595 Mumbai ~ Italian~ 150 50 3.6 99
## 4 ID_5929 Sector ~ Mughlai~ 250 99 3.7 176
## 5 ID_6123 Rmz Cen~ Cafe, B~ 200 99 3.2 521
## 6 ID_5221 Rmz Cen~ South I~ 150 50 3.8 46
## # ... with 2 more variables: Reviews <chr>, Delivery_Time <chr>
Before figuring out how to handle and clean each feature, let us get a quick overview of the features themselves. Below are the unique number of entries in each column out of the 11,094 total entries:
data_train %>% summarise_all(n_distinct)
## # A tibble: 1 x 9
## Restaurant Location Cuisines Average_Cost Minimum_Order Rating Votes
## <int> <int> <int> <int> <int> <int> <int>
## 1 7480 35 2179 26 18 33 1103
## # ... with 2 more variables: Reviews <int>, Delivery_Time <int>
Restaurant(7480 unique entries): “A unique ID that represents a restaurant.”
The unique ID seems to be random by visual inspection as numbers ranging from the 1xxx range to the 8xxx range. It is possible this is related to the age of the restaurant.
Location(35 unique entries): “The location of the restaurant.”
This feature gives a brief description of the location of the restaurant. The locations span across India but are surprisingly limited to just 35 unique locations.
Cuisines(2179 unique entries): “The cuisines offered by the restaurant.”
There are over 2000 unique entries in this column. The entries are comma separated by what types of dishes are offered ranging from just one description such as South Indian to several descriptions such as Fast Food, North Indian, Rolls, Chinese, Momos, Mughlai. This will definitely need to be handled differently than it is now to capture similarities between entries sharing details without having to consider 2179 unique categories.
Average_Cost(26 unique entries): “The average cost for one person/order.”
The average cost in Indian rupees at the restaurant. Values range from 50 to over 2000.
Minimum_Order(18 unique entries): “The minimum order amount.”
The minimum order in Indian rupees at the restaurant to qualify for delivery. Values range from 0 to 500.
Rating(33 unique entries): “Customer rating for the restaurant [from 0 to 5].”
The Rating for the restaurant likely based on an online system like Yelp or Google. The values have increments of 0.1 and can range from 0 to 5.0. There are some exceptions such as “Opening Soon” and “NEW” that will need to be handled.
Votes(1103 unique entries): “The total number of customer votes for the restaurant.”
The total number of votes for a restaurant likely based on an online system like Yelp or Google similar to previous feature Rating. The number is likely the number of votes that went into creating the Rating feature, but the number itself does not show good or bad. Can be considered a metric of popularity.
Reviews(761 unique entries): “The number of customer reviews for the restaurant.”
The total number of reviews for a restaurant likely based on an online system like Yelp or Google similar to previous feature Votes. The number is likely the number of reviews that were written for the restaurant when filling out the rating and votes. Can be considered a metric of popularity.
Delivery_Time(7 unique entries):
This is our target feature for each restaurant. The value is the number of minutes that the restaurant takes for delivery grouped into 7 bins:
120 minutes
We will now go through each feature individually to inspect and prepare them for further analysis.
There are 7 categories for Delivery_Time so it seems reasonable to treat this feature as an ordered factor:
output_levels <- c("10 minutes","20 minutes", "30 minutes", "45 minutes", "65 minutes", "80 minutes", "120 minutes")
data_train$Delivery_Time <- factor(data_train$Delivery_Time, levels = output_levels)
Now, let’s look at the distribution of our data across the different factors:
The x axis represents the 7 unique delivery times and the y axis represents the frequency of each occurrence. We can see that by far 30 minutes is the most frequent delivery time while 10 minutes, 20 minutes, 80 minutes, and 20 minutes are incredibly infrequent.
There are no missing values to deal with from this column, however, we may still consider removing some of the extremely unlikely features such as 10 minutes. For now we will leave Delivery_Time as the 7 unique factors.
Average_Cost has 26 unique entries, so let us first convert this column to numeric to get a better understanding.
EDA_avgcost <- data_train
EDA_avgcost$Average_Cost <- as.numeric(data_train$Average_Cost)
## Warning: NAs introduced by coercion
We get an error that “NAs introduced by coercion.” Let us see what caused this to happen and how many instances this affected.
data_train_raw[is.na(EDA_avgcost$Average_Cost),]
## # A tibble: 25 x 9
## Restaurant Location Cuisines Average_Cost Minimum_Order Rating Votes
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ID_6278 Nathan ~ Chinese~ <U+20B9>1,000 <U+20B9>50 4.7 441
## 2 ID_8117 Mico La~ Italian~ <U+20B9>1,000 <U+20B9>50 4.3 1276
## 3 ID_2545 Delhi H~ North I~ <U+20B9>1,200 <U+20B9>50 3.8 175
## 4 ID_6278 Yerawad~ Chinese~ <U+20B9>1,000 <U+20B9>50 4.7 441
## 5 ID_3549 Laxman ~ Japanese <U+20B9>1,000 <U+20B9>50 3.8 32
## 6 ID_1665 Mumbai ~ America~ <U+20B9>1,000 <U+20B9>50 4.1 2866
## 7 ID_7585 Moulali~ Italian~ <U+20B9>1,200 <U+20B9>50 4.0 655
## 8 ID_8164 Mumbai ~ Chinese <U+20B9>1,000 <U+20B9>50 4.2 687
## 9 ID_8038 Mumbai ~ Europea~ <U+20B9>1,200 <U+20B9>50 4.1 1446
## 10 ID_5870 Rmz Cen~ North I~ <U+20B9>1,150 <U+20B9>50 4.6 1038
## # ... with 15 more rows, and 2 more variables: Reviews <chr>,
## # Delivery_Time <chr>
Looking over the 25 observations that were coerced to NAs we can see that all but one had an issue because of the comma used to represent values of 1,000 and greater. We can fix this by using our remove_char function again to remove commas in this feature.
data_train <- data_train %>% remove_char(",", "Average_Cost")
The only remaining error had the word “for” rather than a number. This entry seems to be an error with no clear intended value. The delivery time of this entry was the very common value of 30 minutes, so it is simply best to remove the observation:
data_train <- data_train %>% filter(!Average_Cost == "for")
Now we can convert Average_Cost to a numeric column without any errors:
data_train$Average_Cost <- as.numeric(data_train$Average_Cost)
Next we can plot the Average_Cost feature to determine if we have any significant outliers in our data. One method is to use a box and whiskers plot where any value outside o fthe 1.5*IQR is considered to be an outlier:
Unfortunately this approach does not work too well with this feature as its distribution is almost categorical. We end up having many values tied at extremely low prices causing the IQR to be fairly compressed. We can calculate how many outliers the box and whiskers plot identifies using the following function:
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outliers_avgcost<- is_outlier(data_train$Average_Cost)
sum(outliers_avgcost)
## [1] 867
Therefore about 8% of our values are outliers following the box and whiskers approach which seems a bit extreme. Let us try visualising our data in a different way:
This method of visualization helps a bit more than the box and whiskers, but is still not perfect. We can see a very clear outlier past 2000 rupees and a very large collection of values on the lower end. Each point is colored by its output class (from Delivery_Time), and we can clearly see the abundance of 30 minute results. Visually, I would interpret from this plot the top two points to be outliers based on their distances to the nearest points in terms of cost.
One additional way of plotting this data is to plot the density further broken into the different delivery outcomes:
We can see clear spikes for the 10 minute and 80 minute deliveries that implies that this feature could be very helpful in determining those outputs. The one caveat though, is that these two outcomes, 10 minutes and 80 minutes, have such low likelihoods that these spikes can be caused from just a handful of coincidental groupings.
Minimum_Order has 18 unique entries, so let us first convert this column to numeric to get a better understanding.
data_train$Minimum_Order <- as.numeric(data_train$Minimum_Order)
Next we can plot the Minimum_Order feature to determine if we have any significant outliers in our data. Again we can use a box and whiskers plot where any value outside of the 1.5*IQR is considered to be an outlier:
Unfortunately this approach again does not work too well with this feature as it is almost like a categorical variable. We have an extremely large number of minimum orders equal to 50 causing the IQR to almost be a pencil thin line. We can calculate how many outliers the box and whiskers plot identifies using our function from before:
outliers_minorder<- is_outlier(data_train$Minimum_Order)
sum(outliers_minorder)
## [1] 976
Therefore about 9% of our values are outliers following the box and whiskers approach which is a bit extreme. Let us try visualising our data in a different way:
This method of visualization helps a bit more than the box and whiskers, but is still not perfect. We can see a very clearly that most points are 100 or less and that the highest value is 500. Visually, I would interpret from this plot any points above 100 to be an outlier which looks to be roughly 30 points.
An additional way of plotting this data is to plot the density further broken into the different delivery outcomes:
We can see clear spikes at the 50 and 100 rupee marks with almost all groups distributed between these two values. The 120 minute class is the most evenly distributed between the two, while the 20 and 30 minute groups have clear preference for the 50 rupee price.
The Rating data is generally a numerical value from 0 to 5.0 in .1 increments grading the restaurant. Of the observations 1962 have some form of error. The Rating column has the following values:
## [1] "Numerical Values: 9131"
## [1] "-: 1191"
## [1] "NEW: 757"
## [1] "Opening Soon: 12"
## [1] "Temporarily Closed: 2"
We can subset our data based on the Rating and see the different distributions. Here we will subset the data based on whether the value of Rating is numerical, Missing, NEW, or Opening Soon/Temporarily Closed(OSTC):
To subset our data from data_train, let’s write a short function that adds an R_subset column and places each row in the appropriate one based on Rating:
Rating_subsets <- function(df){
df$R_subset <- NA
for(row in 1:nrow(df)){
if(df$Rating[row] == "Opening Soon"|df$Rating[row] == "Temporarily Closed"|is.na(df$Average_Cost[row])){df$R_subset[row] = "D"}
else if(df$Rating[row]=="NEW"){df$R_subset[row] = "C"}
else if(df$Rating[row]=="-"){df$R_subset[row] = "B"}
else {df$R_subset[row] = "A"}
}
return(df)
}
data_train <- Rating_subsets(data_train)
Next we will investigate the distribution of the numerical values based on the output group:
We can see that while each output group has a bell-shaped distribution, they are slightly different in where they are centered. This implies that the numerical Rating value can help us in our machine learning algorithm.
Let us now look for any outliers in our numerical Rating data:
From our plot we can see that Ratings with a value of 2.1, 2.2, 2.3, and 4.9 are outliers. Let us see how many observations meet this criteria by converting our Rating values to numeric and then using our function from before:
temp_numeric <- data_train %>%
filter(R_subset=="A") %>%
mutate(Rating = as.numeric(Rating))
outliers_rating<- is_outlier(temp_numeric$Rating)
sum(outliers_rating)
## [1] 16
We can see that we have 16 observations that can be counted as outliers. Like before, let us visualize our Rating values in a stretched 1-D plot:
This helps us to see that the values of 2.1 and 2.2 have very few occurences while 2.3 and 4.9 contain most of the outliers. Other than that, it is difficult to gain much insight from this plot.
The Votes feature is intended to be an integer greater than 1. Unfortunately for some restaurants the Votes value is Missing and replace with a “-”. Like with Rating, we can look and see how the distributions differ between observations with and without the Votes value:
We can see that of the distributions closely follow those of he Rating column when it has numeric vs nonnumeric data. Specifically when Votes is numeric it has about 60%, 30%, and 10% of its reults in 30 minutes, 45 minutes, and 65 minutes respectively just like Results did.
Just like with the Rating feature, we can subset our vVotes feature based on whether or not it is numerical: To subset our data from data_train, let’s write a short function that adds an R_subset column and places each row in the appropriate one based on Rating:
Votes_subsets <- function(df){
df$V_subset <- NA
for(row in 1:nrow(df)){
if(df$Votes[row]=="-"){df$V_subset[row] = "B"}
else {df$V_subset[row] = "A"}
}
return(df)
}
data_train <- Votes_subsets(data_train)
Next we will investigate the distribution of the numerical values based on the output group:
We can see that the distribution here is exponential rather than the normal distribution we saw with the Rating variable. This makes sense as we would expect many restaurants with few reviews, but less and less restaurants to have many reviews. We can try transforming our data to see if it helps us visualize it:
We can see that the center of the 30 minute distribution is shifted left compared to the 45 minute and 65 minute distributions. Therefore this feature is likely going to help in our machine learning model.
STOPPED HERE # {r EDA: Votes box and whiskers, fig.width = 4, echo=FALSE} # t <- data_train %>% # filter(R_subset=="A") %>% # mutate(Votes = as.numeric(Votes)) %>% # ggplot(aes(y=Votes)) + # geom_boxplot() + # ylab("Votes") + # ggtitle("BoxPlot of Votes") + # theme(axis.title.x=element_blank(), # axis.text.x=element_blank(), # axis.ticks.x=element_blank()) # # ggplotly(t) #
# # From our plot we can see that Votess with a value of 2.1, 2.2, 2.3, and 4.9 are outliers. Let us see how many observations meet this criteria by converting our Votes values to numeric and then using our function from before: # # {r EDA: Votes count number of outliers} # temp_numeric <- data_train %>% # filter(R_subset=="A") %>% # mutate(Votes = as.numeric(Votes)) # # outliers_votes<- is_outlier(temp_numeric$Votes) # sum(outliers_votes) #
# # We can see that we have 16 observations that can be counted as outliers. Like before, let us visualize our Votes values in a stretched 1-D plot: # # {r EDA: Votes Stretched 1-D, echo=FALSE} # ggplot(temp_numeric, aes(x=seq_along(Votes), y=Votes)) + # geom_point(mapping=aes(color=Delivery_Time)) + # xlab("Stretched 1-D plot") + # ylab("Votes") + # ggtitle("Stretched 1-D Plot of Votes") #
# # This helps us to see that the values of 2.1 and 2.2 have very few occurences while 2.3 and 4.9 contain most of the outliers. Other than that, it is difficult to gain much insight from this plot. # #
# # #