output:R Notebook html_document: default pdf_document: default — title: “MATH2349 Data Wrangling” subtitle: “Food Delivery Data” author: “Ying Shen 3536461” subtitle: Assignment 2 date: “October 20, 2020” output: pdf_document: default html_notebook: default —
Loading 5 packages for this assignment, so that another user running this code can use either package.
library(readr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(outliers)
library(mlr)
## Loading required package: ParamHelpers
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
library(lattice)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MVN)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## sROC 0.1-2 loaded
For MATH2349 Assignment 1, firstly, all the data are imported to RProject. And then two datasets with the same variable are binding together to add volume and credibility of the datasets. Then two data sets are merged together to add more information to the main dataset.
And then, formats of certain variables are corrected. They have been corrected to dates, time, numeric, factor, labelled and levelled depending on the variable.
After that, some variables that are not used in this report are being subsetted. And untidy data are being tidied, customer latitude and customer longitude are being all labelled under the variable location, and their values are degree. And the date is separated to year, month and day for more detailed investigation.
Missing data is then scanned and located at specific variables. For customer distance, I used the mean to replace the NA. For branch_code that is missing, I deleted the whole row with missing data in branch_code.
I have created variables such as ‘spending and distance ratio’ and ‘total bill’.
Outliers are then scanned and detected. Then I decided to use the capping method to treat the outliers. For the latitude and longitude, I used Chi-Square Q-Q plot to detect the outliers and they are removed in data BK_clean.
Total_bill are found to be right-skewed, I used a few methods and found square root the data will transform it into a normal distribution.
The data set contains Food Delivery data from a restaurant in Melbourne, Australia. The restaurant has three branches around the CBD area. It is from Kaggle, and the link to it are: https://www.kaggle.com/jianhanma/melbourne-restaurant-food-delivery-data.
VARIABLE DESCRIPTION: order_id: A unique id for each order.
date: The date the order was made, given in YYYY-MM-DD format.
time: The time the order was made, given in hh:mm:ss format.
order_type: A categorical attribute representing the different types of orders namely: Breakfast, Lunch or Dinner.
branch_code: A categorical attribute representing the branch code in which the order was made.
order_items: A list of tuples representing the order items: first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.
order_price: A float value representing the order total price.
customer_lat: Latitude of the customer. This variable will be dropped because this report does not discuss the customer’s latitude.
customer_lon: Longitude of the customer. This variable will be dropped because this report does not discuss the customer’s longitude.
customerHasloyalty?: A logical variable denoting whether the customer has a loyalty card with the restaurant (1 if the customer has loyalty and 0 otherwise).
distancetocustomer_KM: A float representing the shortest distance, in kilometers, between the branch and the customer nodes.
delivery_fee: A float representing the delivery fee of the order.
# This is the R chunk for the Data Section
## Setting the working directory noting that each user running this code will need to amend to their WD
setwd("/Users/yingshen/Desktop/DW")
## Importing the data to the RProject and naming the data frame 'missing_data'
missing_data <- read_csv("missing_data.csv")
## Parsed with column specification:
## cols(
## order_id = col_character(),
## date = col_date(format = ""),
## time = col_time(format = ""),
## order_type = col_character(),
## branch_code = col_character(),
## order_items = col_character(),
## order_price = col_double(),
## customer_lat = col_double(),
## customer_lon = col_double(),
## `customerHasloyalty?` = col_double(),
## distance_to_customer_KM = col_double(),
## delivery_fee = col_double()
## )
head(missing_data)
## # A tibble: 6 x 12
## order_id date time order_type branch_code order_items order_price
## <chr> <date> <time> <chr> <chr> <chr> <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast BK [('Pancake… 288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner BK [('Shrimp'… 325
## 3 ORDX110… 2018-02-07 09:10:59 Breakfast <NA> [('Pancake… 136.
## 4 ORDZ083… 2018-10-31 17:58:18 Dinner NS [('Shrimp'… 663
## 5 ORDC058… 2018-02-19 09:31:16 Breakfast NS [('Pancake… 206.
## 6 ORDC010… 2018-03-04 10:01:41 Breakfast NS [('Pancake… 278.
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## # `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## # delivery_fee <dbl>
## Importing the data to the RProject and naming the data frame 'outlier_data'
outlier_data <- read_csv("outlier_data.csv")
## Parsed with column specification:
## cols(
## order_id = col_character(),
## date = col_date(format = ""),
## time = col_time(format = ""),
## order_type = col_character(),
## branch_code = col_character(),
## order_items = col_character(),
## order_price = col_double(),
## customer_lat = col_double(),
## customer_lon = col_double(),
## `customerHasloyalty?` = col_double(),
## distance_to_customer_KM = col_double(),
## delivery_fee = col_double()
## )
head(outlier_data)
## # A tibble: 6 x 12
## order_id date time order_type branch_code order_items order_price
## <chr> <date> <time> <chr> <chr> <chr> <dbl>
## 1 ORDA071… 2018-08-09 15:16:03 Lunch BK [('Fries',… 328
## 2 ORDI089… 2018-02-08 09:10:59 Breakfast NS [('Coffee'… 219
## 3 ORDI018… 2018-01-22 08:40:33 Breakfast NS [('Coffee'… 110.
## 4 ORDY020… 2018-01-02 08:00:00 Breakfast TP [('Eggs', … 598.
## 5 ORDY030… 2018-04-17 11:22:49 Breakfast TP [('Coffee'… 169
## 6 ORDX054… 2018-03-26 10:42:15 Breakfast BK [('Cereal'… 172
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## # `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## # delivery_fee <dbl>
## Binding two data sets together to add more values, it can give the report more accuracy
data1 <- bind_rows(missing_data,outlier_data)
head(data1)
## # A tibble: 6 x 12
## order_id date time order_type branch_code order_items order_price
## <chr> <date> <time> <chr> <chr> <chr> <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast BK [('Pancake… 288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner BK [('Shrimp'… 325
## 3 ORDX110… 2018-02-07 09:10:59 Breakfast <NA> [('Pancake… 136.
## 4 ORDZ083… 2018-10-31 17:58:18 Dinner NS [('Shrimp'… 663
## 5 ORDC058… 2018-02-19 09:31:16 Breakfast NS [('Pancake… 206.
## 6 ORDC010… 2018-03-04 10:01:41 Breakfast NS [('Pancake… 278.
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## # `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## # delivery_fee <dbl>
## Merging two data sets, adding full branch name to branch using key=branch_code. Firstly we only select 2 variable from branches_name and then we join them.
branches_name <- read_csv("branches (1).csv")
## Parsed with column specification:
## cols(
## branch_code = col_character(),
## branch_name = col_character(),
## branch_lat = col_double(),
## branch_lon = col_double()
## )
names(branches_name)
## [1] "branch_code" "branch_name" "branch_lat" "branch_lon"
branches_name <- branches_name[, c(1,2)]
data3 <- data1 %>% inner_join(branches_name, by="branch_code")
head(data3)
## # A tibble: 6 x 13
## order_id date time order_type branch_code order_items order_price
## <chr> <date> <time> <chr> <chr> <chr> <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast BK [('Pancake… 288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner BK [('Shrimp'… 325
## 3 ORDZ083… 2018-10-31 17:58:18 Dinner NS [('Shrimp'… 663
## 4 ORDC058… 2018-02-19 09:31:16 Breakfast NS [('Pancake… 206.
## 5 ORDC010… 2018-03-04 10:01:41 Breakfast NS [('Pancake… 278.
## 6 ORDY065… 2018-09-23 16:47:19 Dinner TP [('Salmon'… 786
## # … with 6 more variables: customer_lat <dbl>, customer_lon <dbl>,
## # `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## # delivery_fee <dbl>, branch_name <chr>
Summaris the types of variables and data structures, check the attributes in the data and apply proper data type conversions.
# This is the R chunk for the Understand Section
##Changing the date from character to dates
head(as.Date(data1$date, format = "%Y-%m-%d"))
## [1] "2018-04-13" "2018-12-01" "2018-02-07" "2018-10-31" "2018-02-19"
## [6] "2018-03-04"
is.Date(data1$date)
## [1] TRUE
## Attaching value labels 1=Has, 2=Do not have to indicate which customer has loyalty card
data1$`customerHasloyalty?` <- factor(data1$`customerHasloyalty?`, levels = c(0,1), labels = c("Do not have", "Has"))
head(data1$`customerHasloyalty?`,3)
## [1] Do not have Do not have Do not have
## Levels: Do not have Has
## Defining the order type as a factor with 3 levels (Breakfast, Lunch and Dinner)
data1$order_type <- as.factor(data1$order_type)
head(data1$order_type)
## [1] Breakfast Dinner Breakfast Dinner Breakfast Breakfast
## Levels: Breakfast Dinner Lunch
## Setting customer_lat variable as numbers
head(as.numeric(data1$customer_lat))
## [1] -37.81498 -37.79913 -37.80504 -37.81198 -37.80723 -37.80193
## Setting customer_lon variable as numbers
head(as.numeric(data1$customer_lon))
## [1] 144.9506 145.0039 144.9267 144.9614 144.9559 144.9682
##Setting format to hours, minutes and seconds
data1$time <- hms(data1$time)
head(data1$time)
## [1] "11H 22M 49S" "18H 59M 9S" "9H 10M 59S" "17H 58M 18S" "9H 31M 16S"
## [6] "10H 1M 41S"
##Factorise branch code with levels
data1$branch_code <- factor(data1$branch_code)
str(data1$branch_code)
## Factor w/ 3 levels "BK","NS","TP": 1 1 NA 2 2 2 NA 3 3 3 ...
Not all variable are used in this report, therefore I am going to subset the not used one. Not all variable has it’s own column, such as customer latitude and customer longitude needs to be put under one big variable Location and degree. Then the dates needs to be separated for easier access if later need to extract data from certain month to predict future stock level.
# This is the R chunk for the Tidy & Manipulate Data I
## Subsetting variable that are not valuable to the report
names(data1)
## [1] "order_id" "date"
## [3] "time" "order_type"
## [5] "branch_code" "order_items"
## [7] "order_price" "customer_lat"
## [9] "customer_lon" "customerHasloyalty?"
## [11] "distance_to_customer_KM" "delivery_fee"
head(data1 <- data1[, c(2,4,5,7,8,9,10,11,12)])
## # A tibble: 6 x 9
## date order_type branch_code order_price customer_lat customer_lon
## <date> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2018-04-13 Breakfast BK 288. -37.8 145.
## 2 2018-12-01 Dinner BK 325 -37.8 145.
## 3 2018-02-07 Breakfast <NA> 136. -37.8 145.
## 4 2018-10-31 Dinner NS 663 -37.8 145.
## 5 2018-02-19 Breakfast NS 206. -37.8 145.
## 6 2018-03-04 Breakfast NS 278. -37.8 145.
## # … with 3 more variables: `customerHasloyalty?` <fct>,
## # distance_to_customer_KM <dbl>, delivery_fee <dbl>
## Tidying data to make all variable have it's own column
names(data1)
## [1] "date" "order_type"
## [3] "branch_code" "order_price"
## [5] "customer_lat" "customer_lon"
## [7] "customerHasloyalty?" "distance_to_customer_KM"
## [9] "delivery_fee"
head(data2 <- data1 %>% gather(location, degree, 5:6))
## # A tibble: 6 x 9
## date order_type branch_code order_price `customerHasloy…
## <date> <fct> <fct> <dbl> <fct>
## 1 2018-04-13 Breakfast BK 288. Do not have
## 2 2018-12-01 Dinner BK 325 Do not have
## 3 2018-02-07 Breakfast <NA> 136. Do not have
## 4 2018-10-31 Dinner NS 663 Do not have
## 5 2018-02-19 Breakfast NS 206. Do not have
## 6 2018-03-04 Breakfast NS 278. Do not have
## # … with 4 more variables: distance_to_customer_KM <dbl>, delivery_fee <dbl>,
## # location <chr>, degree <dbl>
## Multiple variables are stored in one column, therefore separating the year, month and date of the orders
data1 <- data1 %>% separate(date, into = c("year", "month", "day"), sep = "-")
head(data1,3)
## # A tibble: 3 x 11
## year month day order_type branch_code order_price customer_lat customer_lon
## <chr> <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2018 04 13 Breakfast BK 288. -37.8 145.
## 2 2018 12 01 Dinner BK 325 -37.8 145.
## 3 2018 02 07 Breakfast <NA> 136. -37.8 145.
## # … with 3 more variables: `customerHasloyalty?` <fct>,
## # distance_to_customer_KM <dbl>, delivery_fee <dbl>
Scan the data for missing values, special values and obvious errors and recode them. (i.e. inconsistencies).
# This is the R chunk for the Scan I for missing value
## Computing the total missing values in each column
colSums(is.na(data1))
## year month day
## 0 0 0
## order_type branch_code order_price
## 0 100 0
## customer_lat customer_lon customerHasloyalty?
## 0 0 0
## distance_to_customer_KM delivery_fee
## 50 50
## Recoding the missing values in 'distance_to_customer_KM' with the mean distance to customer
data1$distance_to_customer_KM[is.na(data1$distance_to_customer_KM)] <- mean(data1$distance_to_customer_KM, na.rm = TRUE)
sum(is.na(data1$distance_to_customer_KM))
## [1] 0
## Omiting all rows containing missing values in column 'branch_code'
data1 <- na.omit(data1)
sum(is.na(data1$branch_code))
## [1] 0
## Double checking if there is any missing data
sum(colSums(is.na(data1)))
## [1] 0
Creating and mutating two variable from the existing variables.
# This is the R chunk for the Tidy & Manipulate Data II
## Mutating a new column explaining the ratio for spending and distance into the data set
data1 <- data1 %>% mutate(spending_distance_ratio = order_price/distance_to_customer_KM)
head(data1,3)
## # A tibble: 3 x 12
## year month day order_type branch_code order_price customer_lat customer_lon
## <chr> <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2018 04 13 Breakfast BK 288. -37.8 145.
## 2 2018 12 01 Dinner BK 325 -37.8 145.
## 3 2018 10 31 Dinner NS 663 -37.8 145.
## # … with 4 more variables: `customerHasloyalty?` <fct>,
## # distance_to_customer_KM <dbl>, delivery_fee <dbl>,
## # spending_distance_ratio <dbl>
## Calculating how much customers actually spent on food
data1<- data1 %>% mutate(total_bill = order_price + delivery_fee)
head(data1$total_bill,3)
## [1] 302.6815 339.3125 676.6626
Scan the numeric data for outliers and recode them.
# This is the R chunk for the Scan II for outliers in variable 'distance_to_customer_KM'
## Checking the distribution of variable 'distance_to_customer_KM'
hist(data1$distance_to_customer_KM)
## After confirming the 'distance_to_customer_KM' variable has a approximately normal distribution, we can calculate z scores using outliers package
z.scores <- data1$distance_to_customer_KM %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.04761 -0.59458 0.03651 0.00000 0.63863 4.80218
## Finding the locations of outliers in the depth variable
which( abs(z.scores) >3 )
## [1] 82 794 805
## Detecting outliers in 'distance_to_customer_KM' for a given 'month'
boxplot(data1$distance_to_customer_KM ~ data1$month, ylab = "Distance", xlab = "Month",main = "Customer distance based on month",col = "pink")
# Define a function to cap the values outside the limits, and replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x}
## apply the capping
data1$distance_to_customer_KM <- cap(data1$distance_to_customer_KM)
# This is the R chunk for the Scan II for outliers in variable 'branch_code,filltering out BK'
BK <- data1 %>% filter( branch_code == "BK" ) %>% dplyr::select(customer_lat, customer_lon)
head(BK)
## # A tibble: 6 x 2
## customer_lat customer_lon
## <dbl> <dbl>
## 1 -37.8 145.
## 2 -37.8 145.
## 3 -37.8 145.
## 4 -37.8 145.
## 5 -37.8 145.
## 6 -37.8 145.
## Discovering possible multivariate outliers using function mvn()
results <- mvn( data = BK, multivariateOutlierMethod = "quan", showOutliers = TRUE)
results$multivariateOutliers
## Observation Mahalanobis Distance Outlier
## 1 1 8.264 TRUE
## 2 2 7.455 TRUE
# Exclude 1th and 2nd observations
BK_clean <- BK[ -c(1,2), ]
# Check the dimension and see outliers are excluded
dim(BK)
## [1] 267 2
dim(BK_clean)
## [1] 265 2
Applying an appropriate transformation for variable ‘total_bill’ for it to reduce the skewness and become a normal distribution.
# This is the R chunk for the decreasing the skewness and convert the distribution into a normal distribution
## Drawing histogram on how much people paid for each order
hist(data1$total_bill)
## Transforming the variable with natural logarithm, but did not work
ln_total_bill <- log(data1$total_bill)
hist(ln_total_bill)
## Transforming the variable with square root transformation, which is often used for reducing right skewness. Turns out it worked
sqrt_total_bill <- sqrt(data1$total_bill)
hist(sqrt_total_bill )
Investigating if having a loyalty card affects the amount of food they order.
# This is the R chunk for the how to utilize data for analysis
## Using lattice and dplyr to analysis if customer who has a loyalty card spend more in the restarant. In detail, I used customerHasloyalty as
data1 %>% group_by(data1$`customerHasloyalty?`) %>% summarise(Min = min(data1$order_price,na.rm = TRUE),
Q1 = quantile(data1$order_price,probs = .25,na.rm = TRUE),
Median = median(data1$order_price, na.rm =TRUE),
Q3 = quantile(data1$order_price,probs = .75,na.rm = TRUE),
Max = max(data1$order_price,na.rm = TRUE),
Mean = mean(data1$order_price, na.rm = TRUE),
SD = sd(data1$order_price, na.rm = TRUE),n = n(),
Missing = sum(is.na(data1$order_price)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 10
## `data1$\`customerHas… Min Q1 Median Q3 Max Mean SD n Missing
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 Do not have 44.5 287. 428. 652. 1283 482. 256. 818 0
## 2 Has 44.5 287. 428. 652. 1283 482. 256. 32 0
data1 %>% histogram(~data1$order_price |data1$`customerHasloyalty?`, data = .,layout=c(1,2))
That could be explained because customer who has a loyalty card order more frequently from the store. Hence why their order bill is usually less than the people who does not have loyalty card.