MATH2349_Assignment_2_Rmarkdown

output:R Notebook html_document: default pdf_document: default — title: “MATH2349 Data Wrangling” subtitle: “Food Delivery Data” author: “Ying Shen 3536461” subtitle: Assignment 2 date: “October 20, 2020” output: pdf_document: default html_notebook: default —

Required packages

Loading 5 packages for this assignment, so that another user running this code can use either package.

library(readr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyr)
library(outliers)
library(mlr)

## Loading required package: ParamHelpers

## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).

library(lattice)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(MVN)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## sROC 0.1-2 loaded

Executive Summary

For MATH2349 Assignment 1, firstly, all the data are imported to RProject. And then two datasets with the same variable are binding together to add volume and credibility of the datasets. Then two data sets are merged together to add more information to the main dataset.

And then, formats of certain variables are corrected. They have been corrected to dates, time, numeric, factor, labelled and levelled depending on the variable.

After that, some variables that are not used in this report are being subsetted. And untidy data are being tidied, customer latitude and customer longitude are being all labelled under the variable location, and their values are degree. And the date is separated to year, month and day for more detailed investigation.

Missing data is then scanned and located at specific variables. For customer distance, I used the mean to replace the NA. For branch_code that is missing, I deleted the whole row with missing data in branch_code.

I have created variables such as ‘spending and distance ratio’ and ‘total bill’.

Outliers are then scanned and detected. Then I decided to use the capping method to treat the outliers. For the latitude and longitude, I used Chi-Square Q-Q plot to detect the outliers and they are removed in data BK_clean.

Total_bill are found to be right-skewed, I used a few methods and found square root the data will transform it into a normal distribution.

Data

The data set contains Food Delivery data from a restaurant in Melbourne, Australia. The restaurant has three branches around the CBD area. It is from Kaggle, and the link to it are: https://www.kaggle.com/jianhanma/melbourne-restaurant-food-delivery-data.

VARIABLE DESCRIPTION: order_id: A unique id for each order.
date: The date the order was made, given in YYYY-MM-DD format.
time: The time the order was made, given in hh:mm:ss format.
order_type: A categorical attribute representing the different types of orders namely: Breakfast, Lunch or Dinner.
branch_code: A categorical attribute representing the branch code in which the order was made.
order_items: A list of tuples representing the order items: first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.
order_price: A float value representing the order total price.
customer_lat: Latitude of the customer. This variable will be dropped because this report does not discuss the customer’s latitude.
customer_lon: Longitude of the customer. This variable will be dropped because this report does not discuss the customer’s longitude.
customerHasloyalty?: A logical variable denoting whether the customer has a loyalty card with the restaurant (1 if the customer has loyalty and 0 otherwise).
distancetocustomer_KM: A float representing the shortest distance, in kilometers, between the branch and the customer nodes.
delivery_fee: A float representing the delivery fee of the order.

# This is the R chunk for the Data Section
## Setting the working directory noting that each user running this code will need to amend to their WD
setwd("/Users/yingshen/Desktop/DW") 

## Importing the data to the RProject and naming the data frame 'missing_data'
missing_data <- read_csv("missing_data.csv")

## Parsed with column specification:
## cols(
##   order_id = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   order_type = col_character(),
##   branch_code = col_character(),
##   order_items = col_character(),
##   order_price = col_double(),
##   customer_lat = col_double(),
##   customer_lon = col_double(),
##   `customerHasloyalty?` = col_double(),
##   distance_to_customer_KM = col_double(),
##   delivery_fee = col_double()
## )

head(missing_data)

## # A tibble: 6 x 12
##   order_id date       time     order_type branch_code order_items order_price
##   <chr>    <date>     <time>   <chr>      <chr>       <chr>             <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast  BK          [('Pancake…        288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner     BK          [('Shrimp'…        325 
## 3 ORDX110… 2018-02-07 09:10:59 Breakfast  <NA>        [('Pancake…        136.
## 4 ORDZ083… 2018-10-31 17:58:18 Dinner     NS          [('Shrimp'…        663 
## 5 ORDC058… 2018-02-19 09:31:16 Breakfast  NS          [('Pancake…        206.
## 6 ORDC010… 2018-03-04 10:01:41 Breakfast  NS          [('Pancake…        278.
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## #   `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## #   delivery_fee <dbl>

## Importing the data to the RProject and naming the data frame 'outlier_data'
outlier_data <- read_csv("outlier_data.csv")

## Parsed with column specification:
## cols(
##   order_id = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   order_type = col_character(),
##   branch_code = col_character(),
##   order_items = col_character(),
##   order_price = col_double(),
##   customer_lat = col_double(),
##   customer_lon = col_double(),
##   `customerHasloyalty?` = col_double(),
##   distance_to_customer_KM = col_double(),
##   delivery_fee = col_double()
## )

head(outlier_data)

## # A tibble: 6 x 12
##   order_id date       time     order_type branch_code order_items order_price
##   <chr>    <date>     <time>   <chr>      <chr>       <chr>             <dbl>
## 1 ORDA071… 2018-08-09 15:16:03 Lunch      BK          [('Fries',…        328 
## 2 ORDI089… 2018-02-08 09:10:59 Breakfast  NS          [('Coffee'…        219 
## 3 ORDI018… 2018-01-22 08:40:33 Breakfast  NS          [('Coffee'…        110.
## 4 ORDY020… 2018-01-02 08:00:00 Breakfast  TP          [('Eggs', …        598.
## 5 ORDY030… 2018-04-17 11:22:49 Breakfast  TP          [('Coffee'…        169 
## 6 ORDX054… 2018-03-26 10:42:15 Breakfast  BK          [('Cereal'…        172 
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## #   `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## #   delivery_fee <dbl>

## Binding two data sets together to add more values, it can give the report more accuracy
data1 <- bind_rows(missing_data,outlier_data)
head(data1)

## # A tibble: 6 x 12
##   order_id date       time     order_type branch_code order_items order_price
##   <chr>    <date>     <time>   <chr>      <chr>       <chr>             <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast  BK          [('Pancake…        288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner     BK          [('Shrimp'…        325 
## 3 ORDX110… 2018-02-07 09:10:59 Breakfast  <NA>        [('Pancake…        136.
## 4 ORDZ083… 2018-10-31 17:58:18 Dinner     NS          [('Shrimp'…        663 
## 5 ORDC058… 2018-02-19 09:31:16 Breakfast  NS          [('Pancake…        206.
## 6 ORDC010… 2018-03-04 10:01:41 Breakfast  NS          [('Pancake…        278.
## # … with 5 more variables: customer_lat <dbl>, customer_lon <dbl>,
## #   `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## #   delivery_fee <dbl>

## Merging two data sets, adding full branch name to branch using key=branch_code. Firstly we only select 2 variable from branches_name and then we join them. 
branches_name <- read_csv("branches (1).csv")

## Parsed with column specification:
## cols(
##   branch_code = col_character(),
##   branch_name = col_character(),
##   branch_lat = col_double(),
##   branch_lon = col_double()
## )

names(branches_name)

## [1] "branch_code" "branch_name" "branch_lat"  "branch_lon"

branches_name <- branches_name[, c(1,2)]
data3 <- data1 %>% inner_join(branches_name, by="branch_code")
head(data3)

## # A tibble: 6 x 13
##   order_id date       time     order_type branch_code order_items order_price
##   <chr>    <date>     <time>   <chr>      <chr>       <chr>             <dbl>
## 1 ORDK010… 2018-04-13 11:22:49 Breakfast  BK          [('Pancake…        288.
## 2 ORDX107… 2018-12-01 18:59:09 Dinner     BK          [('Shrimp'…        325 
## 3 ORDZ083… 2018-10-31 17:58:18 Dinner     NS          [('Shrimp'…        663 
## 4 ORDC058… 2018-02-19 09:31:16 Breakfast  NS          [('Pancake…        206.
## 5 ORDC010… 2018-03-04 10:01:41 Breakfast  NS          [('Pancake…        278.
## 6 ORDY065… 2018-09-23 16:47:19 Dinner     TP          [('Salmon'…        786 
## # … with 6 more variables: customer_lat <dbl>, customer_lon <dbl>,
## #   `customerHasloyalty?` <dbl>, distance_to_customer_KM <dbl>,
## #   delivery_fee <dbl>, branch_name <chr>

Understand

Summaris the types of variables and data structures, check the attributes in the data and apply proper data type conversions.

# This is the R chunk for the Understand Section
##Changing the date from character to dates 
head(as.Date(data1$date, format = "%Y-%m-%d"))

## [1] "2018-04-13" "2018-12-01" "2018-02-07" "2018-10-31" "2018-02-19"
## [6] "2018-03-04"

is.Date(data1$date)

## [1] TRUE

## Attaching value labels 1=Has, 2=Do not have to indicate which customer has loyalty card
data1$`customerHasloyalty?` <- factor(data1$`customerHasloyalty?`, levels = c(0,1), labels = c("Do not have", "Has"))
head(data1$`customerHasloyalty?`,3)

## [1] Do not have Do not have Do not have
## Levels: Do not have Has

## Defining the order type as a factor with 3 levels (Breakfast, Lunch and Dinner)
data1$order_type <- as.factor(data1$order_type)
head(data1$order_type)

## [1] Breakfast Dinner    Breakfast Dinner    Breakfast Breakfast
## Levels: Breakfast Dinner Lunch

## Setting customer_lat variable as numbers 
head(as.numeric(data1$customer_lat))

## [1] -37.81498 -37.79913 -37.80504 -37.81198 -37.80723 -37.80193

## Setting customer_lon variable as numbers 
head(as.numeric(data1$customer_lon))

## [1] 144.9506 145.0039 144.9267 144.9614 144.9559 144.9682

##Setting format to hours, minutes and seconds
data1$time <- hms(data1$time)
head(data1$time)

## [1] "11H 22M 49S" "18H 59M 9S"  "9H 10M 59S"  "17H 58M 18S" "9H 31M 16S" 
## [6] "10H 1M 41S"

##Factorise branch code with levels
data1$branch_code <- factor(data1$branch_code)
str(data1$branch_code)

##  Factor w/ 3 levels "BK","NS","TP": 1 1 NA 2 2 2 NA 3 3 3 ...

Tidy & Manipulate Data I

Not all variable are used in this report, therefore I am going to subset the not used one. Not all variable has it’s own column, such as customer latitude and customer longitude needs to be put under one big variable Location and degree. Then the dates needs to be separated for easier access if later need to extract data from certain month to predict future stock level.

# This is the R chunk for the Tidy & Manipulate Data I
## Subsetting variable that are not valuable to the report
names(data1)

##  [1] "order_id"                "date"                   
##  [3] "time"                    "order_type"             
##  [5] "branch_code"             "order_items"            
##  [7] "order_price"             "customer_lat"           
##  [9] "customer_lon"            "customerHasloyalty?"    
## [11] "distance_to_customer_KM" "delivery_fee"

head(data1 <- data1[, c(2,4,5,7,8,9,10,11,12)])

## # A tibble: 6 x 9
##   date       order_type branch_code order_price customer_lat customer_lon
##   <date>     <fct>      <fct>             <dbl>        <dbl>        <dbl>
## 1 2018-04-13 Breakfast  BK                 288.        -37.8         145.
## 2 2018-12-01 Dinner     BK                 325         -37.8         145.
## 3 2018-02-07 Breakfast  <NA>               136.        -37.8         145.
## 4 2018-10-31 Dinner     NS                 663         -37.8         145.
## 5 2018-02-19 Breakfast  NS                 206.        -37.8         145.
## 6 2018-03-04 Breakfast  NS                 278.        -37.8         145.
## # … with 3 more variables: `customerHasloyalty?` <fct>,
## #   distance_to_customer_KM <dbl>, delivery_fee <dbl>

## Tidying data to make all variable have it's own column
names(data1)

## [1] "date"                    "order_type"             
## [3] "branch_code"             "order_price"            
## [5] "customer_lat"            "customer_lon"           
## [7] "customerHasloyalty?"     "distance_to_customer_KM"
## [9] "delivery_fee"

head(data2 <- data1 %>% gather(location, degree, 5:6))

## # A tibble: 6 x 9
##   date       order_type branch_code order_price `customerHasloy…
##   <date>     <fct>      <fct>             <dbl> <fct>           
## 1 2018-04-13 Breakfast  BK                 288. Do not have     
## 2 2018-12-01 Dinner     BK                 325  Do not have     
## 3 2018-02-07 Breakfast  <NA>               136. Do not have     
## 4 2018-10-31 Dinner     NS                 663  Do not have     
## 5 2018-02-19 Breakfast  NS                 206. Do not have     
## 6 2018-03-04 Breakfast  NS                 278. Do not have     
## # … with 4 more variables: distance_to_customer_KM <dbl>, delivery_fee <dbl>,
## #   location <chr>, degree <dbl>

## Multiple variables are stored in one column, therefore separating the year, month and date of the orders 
data1 <- data1 %>%   separate(date, into = c("year", "month", "day"), sep = "-")
head(data1,3)

## # A tibble: 3 x 11
##   year  month day   order_type branch_code order_price customer_lat customer_lon
##   <chr> <chr> <chr> <fct>      <fct>             <dbl>        <dbl>        <dbl>
## 1 2018  04    13    Breakfast  BK                 288.        -37.8         145.
## 2 2018  12    01    Dinner     BK                 325         -37.8         145.
## 3 2018  02    07    Breakfast  <NA>               136.        -37.8         145.
## # … with 3 more variables: `customerHasloyalty?` <fct>,
## #   distance_to_customer_KM <dbl>, delivery_fee <dbl>

Scan for missing value I

Scan the data for missing values, special values and obvious errors and recode them. (i.e. inconsistencies).

# This is the R chunk for the Scan I for missing value 
## Computing the total missing values in each column
colSums(is.na(data1))

##                    year                   month                     day 
##                       0                       0                       0 
##              order_type             branch_code             order_price 
##                       0                     100                       0 
##            customer_lat            customer_lon     customerHasloyalty? 
##                       0                       0                       0 
## distance_to_customer_KM            delivery_fee 
##                      50                      50

## Recoding the missing values in 'distance_to_customer_KM' with the mean distance to customer
data1$distance_to_customer_KM[is.na(data1$distance_to_customer_KM)] <- mean(data1$distance_to_customer_KM, na.rm = TRUE)
sum(is.na(data1$distance_to_customer_KM))

## [1] 0

## Omiting all rows containing missing values in column 'branch_code'
data1 <- na.omit(data1)
sum(is.na(data1$branch_code))

## [1] 0

## Double checking if there is any missing data
sum(colSums(is.na(data1)))

## [1] 0

Tidy & Manipulate Data II

Creating and mutating two variable from the existing variables.

# This is the R chunk for the Tidy & Manipulate Data II 
## Mutating a new column explaining the ratio for spending and distance into the data set 
data1 <- data1 %>% mutate(spending_distance_ratio = order_price/distance_to_customer_KM) 
head(data1,3)

## # A tibble: 3 x 12
##   year  month day   order_type branch_code order_price customer_lat customer_lon
##   <chr> <chr> <chr> <fct>      <fct>             <dbl>        <dbl>        <dbl>
## 1 2018  04    13    Breakfast  BK                 288.        -37.8         145.
## 2 2018  12    01    Dinner     BK                 325         -37.8         145.
## 3 2018  10    31    Dinner     NS                 663         -37.8         145.
## # … with 4 more variables: `customerHasloyalty?` <fct>,
## #   distance_to_customer_KM <dbl>, delivery_fee <dbl>,
## #   spending_distance_ratio <dbl>

## Calculating how much customers actually spent on food
data1<- data1 %>% mutate(total_bill = order_price + delivery_fee)
head(data1$total_bill,3)

## [1] 302.6815 339.3125 676.6626

Scan for outlier II

Scan the numeric data for outliers and recode them.

# This is the R chunk for the Scan II for outliers in variable 'distance_to_customer_KM'
## Checking the distribution of variable 'distance_to_customer_KM'
hist(data1$distance_to_customer_KM)

## After confirming the 'distance_to_customer_KM' variable has a approximately normal distribution, we can calculate z scores using outliers package
z.scores <- data1$distance_to_customer_KM %>% scores(type = "z")
z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.04761 -0.59458  0.03651  0.00000  0.63863  4.80218

## Finding the locations of outliers in the depth variable
which( abs(z.scores) >3 )

## [1]  82 794 805

## Detecting outliers in 'distance_to_customer_KM' for a given 'month'
boxplot(data1$distance_to_customer_KM ~ data1$month, ylab = "Distance", xlab = "Month",main = "Customer distance based on month",col = "pink")

# Define a function to cap the values outside the limits, and replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
                       x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1] 
                       x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
                       x}

## apply the capping
data1$distance_to_customer_KM <- cap(data1$distance_to_customer_KM)



# This is the R chunk for the Scan II for outliers in variable 'branch_code,filltering out BK'
BK <- data1 %>% filter( branch_code == "BK" ) %>% dplyr::select(customer_lat, customer_lon)
head(BK)

## # A tibble: 6 x 2
##   customer_lat customer_lon
##          <dbl>        <dbl>
## 1        -37.8         145.
## 2        -37.8         145.
## 3        -37.8         145.
## 4        -37.8         145.
## 5        -37.8         145.
## 6        -37.8         145.

## Discovering possible multivariate outliers using function mvn()
results <- mvn( data = BK, multivariateOutlierMethod = "quan", showOutliers = TRUE)

results$multivariateOutliers

##   Observation Mahalanobis Distance Outlier
## 1           1                8.264    TRUE
## 2           2                7.455    TRUE

# Exclude 1th and 2nd observations
BK_clean <- BK[ -c(1,2), ]

# Check the dimension and see outliers are excluded 
dim(BK)

## [1] 267   2

dim(BK_clean)

## [1] 265   2

Transform

Applying an appropriate transformation for variable ‘total_bill’ for it to reduce the skewness and become a normal distribution.

# This is the R chunk for the decreasing the skewness and convert the distribution into a normal distribution 
## Drawing histogram on how much people paid for each order
hist(data1$total_bill)

## Transforming the variable with natural logarithm, but did not work
ln_total_bill <- log(data1$total_bill)
hist(ln_total_bill)

## Transforming the variable with square root transformation, which is often used for reducing right skewness. Turns out it worked
sqrt_total_bill <- sqrt(data1$total_bill)
hist(sqrt_total_bill )

Utilising data for analysis

Investigating if having a loyalty card affects the amount of food they order.

# This is the R chunk for the how to utilize data for analysis
## Using lattice and dplyr to analysis if customer who has a loyalty card spend more in the restarant. In detail, I used customerHasloyalty as 
data1 %>% group_by(data1$`customerHasloyalty?`) %>% summarise(Min = min(data1$order_price,na.rm = TRUE),
                        Q1 = quantile(data1$order_price,probs = .25,na.rm = TRUE),                                            
                        Median = median(data1$order_price, na.rm =TRUE),                                                     
                        Q3 = quantile(data1$order_price,probs = .75,na.rm = TRUE), 
                        Max = max(data1$order_price,na.rm = TRUE), 
                        Mean = mean(data1$order_price, na.rm = TRUE),                                                         
                        SD = sd(data1$order_price, na.rm = TRUE),n = n(),
                        Missing = sum(is.na(data1$order_price)))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 10
##   `data1$\`customerHas…   Min    Q1 Median    Q3   Max  Mean    SD     n Missing
##   <fct>                 <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <int>
## 1 Do not have            44.5  287.   428.  652.  1283  482.  256.   818       0
## 2 Has                    44.5  287.   428.  652.  1283  482.  256.    32       0

data1 %>% histogram(~data1$order_price |data1$`customerHasloyalty?`, data = .,layout=c(1,2))

Results analysis

That could be explained because customer who has a loyalty card order more frequently from the store. Hence why their order bill is usually less than the people who does not have loyalty card.