This is the analysis of the performance overview of Jumpman23’s newest market - New York City. We have data from 30 days of operation in October of 2014, which has been analysed for some insights about the operations and to identify potential data integrity issues with the dataset.
Language/Tools used: For the purpose of this analysis, R and RStudio have been used. The report has then been integrated with Rmarkdown and published on rpubs.com for ease of accessibility.
I chose to select R as the medium of analysis because of the following reasons:
Apart from R, Python and Tableau could have been another language which could be used for this purpose and would have similar functionality.
Similar Reports by Yash Sharma
I have additionally worked on R/Shiny/Rmarkdown projects which are published online. These projects can be found at:
Following are some of the key takeaways from the analysis:
From my analysis and the deep dives, I believe following could be the data integrity issues and how these would impact the analysis:
Missing Data: There are lot of missing data. This could bias the analysis, as we might have missed capturing data where we could have got interesting results.
Timestamp: After calculating the average speed during the delivery, I got certain deliveries which had an average speed of about 80 mph on bicycles in NYC. Hence, there seems to be an error with the timestamp variable. This would have biased our analyses which were done using the timestamp variables.
Repeated values for single orders: When a customer ordered multiple items in a single order, there were multiple rows with one for each item. This could create potential problems with delivery level analysis. I removed the duplicates as explained in the detailed analysis tab.
Here are some of the graphs/tables which I think gave some significant insights, though extensive analysis was performed on all variables in the dataset.
Data Explored
The data has 5983 rows and 18 columns.
df <- read.csv("analyze_me.csv",stringsAsFactors = F)
dim(df)
## [1] 5983 18
#Remove duplicate orders (Keeo the first row)
library(dplyr)
The following is the structure of the data. I have used the “str” function, which comes in the base-R package. It shows the name, format and some sample values from each columns.
str(df)
## 'data.frame': 5983 obs. of 18 variables:
## $ delivery_id : int 1457973 1377056 1476547 1485494 1327707 1423142 1334106 1311619 1487674 1417206 ...
## $ customer_id : int 327168 64452 83095 271149 122609 75169 101347 59161 55375 153816 ...
## $ jumpman_id : int 162381 104533 132725 157175 118095 91932 124897 79847 181543 157415 ...
## $ vehicle_type : chr "van" "bicycle" "bicycle" "bicycle" ...
## $ pickup_place : chr "Melt Shop" "Prince Street Pizza" "Bareburger" "Juice Press" ...
## $ place_category : chr "American" "Pizza" "Burger" "Juice Bar" ...
## $ item_name : chr "Lemonade" "Neapolitan Rice Balls" "Bare Sodas" "OMG! My Favorite Juice!" ...
## $ item_quantity : int 1 3 1 1 2 1 1 2 NA 1 ...
## $ item_category_name : chr "Beverages" "Munchables" "Drinks" "Cold Pressed Juices" ...
## $ how_long_it_took_to_order : chr "00:19:58.582052" "00:25:09.107093" "00:06:44.541717" "" ...
## $ pickup_lat : num 40.7 40.7 40.7 40.7 40.7 ...
## $ pickup_lon : num -74 -74 -74 -74 -74 ...
## $ dropoff_lat : num 40.8 40.7 40.7 40.8 40.7 ...
## $ dropoff_lon : num -74 -74 -74 -74 -74 ...
## $ when_the_delivery_started : chr "2014-10-26 13:51:59.898924" "2014-10-16 21:58:58.65491" "2014-10-28 21:39:52.654394" "2014-10-30 10:54:11.531894" ...
## $ when_the_Jumpman_arrived_at_pickup : chr "" "2014-10-16 22:26:02.120931" "2014-10-28 21:37:18.793405" "2014-10-30 11:04:17.759577" ...
## $ when_the_Jumpman_left_pickup : chr "" "2014-10-16 22:48:23.091253" "2014-10-28 21:59:09.98481" "2014-10-30 11:16:37.895816" ...
## $ when_the_Jumpman_arrived_at_dropoff: chr "2014-10-26 14:52:06.313088" "2014-10-16 22:59:22.948873" "2014-10-28 22:04:40.634962" "2014-10-30 11:32:38.090061" ...
The “describe” function from the “Hmisc” package enables inspection of each variable. It shows the count, missing values (if any), distinct values, mean and other summary values. It also shows the 5 highest and 5 lowest values for each variable. This is very useful in finding missing values, outliers etc in the data.
library(Hmisc)
describe(df)
## df
##
## 18 Variables 5983 Observations
## ---------------------------------------------------------------------------
## delivery_id
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 5214 1 1379495 74533 1281674 1293571
## .25 .50 .75 .90 .95
## 1322793 1375689 1436371 1472712 1480980
##
## lowest : 1271706 1271751 1271867 1272279 1272303
## highest: 1491110 1491144 1491147 1491341 1491424
## ---------------------------------------------------------------------------
## customer_id
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 3192 1 176473 129473 44005 54921
## .25 .50 .75 .90 .95
## 77817 131093 293381 363561 377245
##
## lowest : 242 641 1311 1517 2533, highest: 404787 405147 405233 405334 405547
## ---------------------------------------------------------------------------
## jumpman_id
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 578 1 102662 55402 19432 30905
## .25 .50 .75 .90 .95
## 60761 113364 143807 158791 165848
##
## lowest : 3296 3592 3941 5935 6458, highest: 177485 177847 178325 179183 181543
## ---------------------------------------------------------------------------
## vehicle_type
## n missing distinct
## 5983 0 7
##
## Value bicycle car motorcycle scooter truck
## Frequency 4274 1215 21 75 48
## Proportion 0.714 0.203 0.004 0.013 0.008
##
## Value van walker
## Frequency 76 274
## Proportion 0.013 0.046
## ---------------------------------------------------------------------------
## pickup_place
## n missing distinct
## 5983 0 898
##
## lowest : 'Essen 'wichcraft Il Mulino New York $10 Blue Ribbon Fried Chicken Sandwich 11th Street Cafe
## highest: Yura On Madison Zabar's Zen Palate Zero Otto Nove Zucker's Bagels & Smoked Fish
## ---------------------------------------------------------------------------
## place_category
## n missing distinct
## 5100 883 57
##
## lowest : African American Art Store Asian Bakery
## highest: Sushi Thai Vegan Vegetarian Vietnamese
## ---------------------------------------------------------------------------
## item_name
## n missing distinct
## 4753 1230 2277
##
## lowest : 'Shroom Burger "Ala Vodka" Sauce with Mushrooms "Lure Style" Burger "The Cadillac" $10 Dirty Bird Rotisserie Chicken Wrap
## highest: Yellowtail with Jalapeno Sushi Yogurt Raisins Zesty Corn Zinc 50mg Target Mins Tb Zucchini Chips
## ---------------------------------------------------------------------------
## item_quantity
## n missing distinct Info Mean Gmd .05 .10
## 4753 1230 11 0.411 1.248 0.4391 1 1
## .25 .50 .75 .90 .95
## 1 1 1 2 2
##
## Value 1 2 3 4 5 6 7 8 12 15
## Frequency 3980 570 112 54 13 14 1 4 1 3
## Proportion 0.837 0.120 0.024 0.011 0.003 0.003 0.000 0.001 0.000 0.001
##
## Value 16
## Frequency 1
## Proportion 0.000
## ---------------------------------------------------------------------------
## item_category_name
## n missing distinct
## 4753 1230 767
##
## lowest : 10" Pies 18" Pizzas 6" Cakes A la Cart A La Carte
## highest: Yasai (Vegetable Rolls) Year Round Flavors Yeast Doughnuts Your Creation Yummy Food
## ---------------------------------------------------------------------------
## how_long_it_took_to_order
## n missing distinct
## 3038 2945 2579
##
## lowest : 00:01:22.997519 00:01:32.308446 00:01:33.864756 00:01:37.552443 00:01:38.872929
## highest: 00:47:48.181357 00:58:02.117535 01:03:42.753775 01:12:59.55104 01:13:13.266118
## ---------------------------------------------------------------------------
## pickup_lat
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 1210 1 40.74 0.02508 40.72 40.72
## .25 .50 .75 .90 .95
## 40.72 40.74 40.76 40.78 40.78
##
## lowest : 40.66561 40.67176 40.67348 40.67425 40.67469
## highest: 40.80626 40.80707 40.81133 40.81535 40.81808
## ---------------------------------------------------------------------------
## pickup_lon
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 1179 1 -73.99 0.01634 -74.01 -74.00
## .25 .50 .75 .90 .95
## -74.00 -73.99 -73.98 -73.96 -73.96
##
## lowest : -74.01584 -74.01545 -74.01486 -74.01472 -74.01463
## highest: -73.93543 -73.93420 -73.93324 -73.92829 -73.92098
## ---------------------------------------------------------------------------
## dropoff_lat
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 2841 1 40.74 0.0283 40.71 40.72
## .25 .50 .75 .90 .95
## 40.73 40.74 40.76 40.78 40.79
##
## lowest : 40.64936 40.64940 40.64957 40.65215 40.66689
## highest: 40.82885 40.83513 40.83603 40.83740 40.84832
## ---------------------------------------------------------------------------
## dropoff_lon
## n missing distinct Info Mean Gmd .05 .10
## 5983 0 2839 1 -73.99 0.02033 -74.01 -74.01
## .25 .50 .75 .90 .95
## -74.00 -73.99 -73.97 -73.96 -73.95
##
## lowest : -74.01768 -74.01729 -74.01715 -74.01712 -74.01706
## highest: -73.92836 -73.92772 -73.92582 -73.92481 -73.92412
## ---------------------------------------------------------------------------
## when_the_delivery_started
## n missing distinct
## 5983 0 5214
##
## lowest : 2014-10-01 00:07:58.632482 2014-10-01 00:26:31.924774 2014-10-01 01:00:06.75635 2014-10-01 08:46:15.935061 2014-10-01 09:20:21.573801
## highest: 2014-10-30 22:24:54.42562 2014-10-30 22:31:58.003417 2014-10-30 22:32:24.293206 2014-10-30 22:56:00.07339 2014-10-30 23:08:43.4819
## ---------------------------------------------------------------------------
## when_the_Jumpman_arrived_at_pickup
## n missing distinct
## 5433 550 4719
##
## lowest : 2014-10-01 00:39:31.086322 2014-10-01 01:19:29.205722 2014-10-01 09:02:40.003541 2014-10-01 09:26:01.194532 2014-10-01 10:10:27.589662
## highest: 2014-10-30 22:30:00.72672 2014-10-30 22:34:18.51496 2014-10-30 22:34:33.893881 2014-10-30 23:01:38.619634 2014-10-30 23:10:31.062088
## ---------------------------------------------------------------------------
## when_the_Jumpman_left_pickup
## n missing distinct
## 5433 550 4717
##
## lowest : 2014-10-01 00:59:57.522402 2014-10-01 01:36:49.131316 2014-10-01 09:15:59.607582 2014-10-01 09:37:56.158669 2014-10-01 10:32:19.033949
## highest: 2014-10-30 22:54:17.896179 2014-10-30 22:57:59.036928 2014-10-30 23:06:54.47219 2014-10-30 23:14:58.679208 2014-10-30 23:23:51.143279
## ---------------------------------------------------------------------------
## when_the_Jumpman_arrived_at_dropoff
## n missing distinct
## 5983 0 5214
##
## lowest : 2014-10-01 00:30:21.109149 2014-10-01 01:04:14.355157 2014-10-01 01:49:29.034932 2014-10-01 09:28:40.095456 2014-10-01 09:39:41.631246
## highest: 2014-10-30 23:04:40.777794 2014-10-30 23:05:57.857982 2014-10-30 23:19:29.96027 2014-10-30 23:22:48.252946 2014-10-30 23:29:44.866438
## ---------------------------------------------------------------------------
To do some deep dive analysis and data exploration, I wanted to create additional columns. Below is the explanation/formula of each new variable created and the code used for the same.
After exploring the data and creating the new variables, I removed the dulplicate orders, keeping the first entry for each order. This might create misleading results when analysing the items which were delivered, of which we need to be careful.
#Change formats of timestamp
library(lubridate)
df$when_the_delivery_started <- ymd_hms(substr(df$when_the_delivery_started,1,19))
df$when_the_Jumpman_arrived_at_pickup <- ymd_hms(substr(df$when_the_Jumpman_arrived_at_pickup,1,19))
df$when_the_Jumpman_left_pickup <- ymd_hms(substr(df$when_the_Jumpman_left_pickup,1,19))
df$when_the_Jumpman_arrived_at_dropoff <- ymd_hms(substr(df$when_the_Jumpman_arrived_at_dropoff,1,19))
#day of week and weekend flag
df$wday_delivery_started <- wday(df$when_the_delivery_started)
df$weekend_delivery_started <- ifelse(df$wday_delivery_started %in% c(1,7),1,0)
df$day_del_started <- (day(df$when_the_delivery_started))
#Creat time duration columns
df$delivery_time <- difftime(df$when_the_Jumpman_arrived_at_dropoff,
df$when_the_Jumpman_left_pickup,
units="hours")
df$loading_time <- difftime(df$when_the_Jumpman_left_pickup,
df$when_the_Jumpman_arrived_at_pickup,
units="hours")
df$jumpman_arrival_time <- difftime(df$when_the_Jumpman_arrived_at_pickup,
df$when_the_delivery_started,
units="hours")
#delivery distance
library(geosphere)
df$delivery_distance <- 0
for(i in 1:nrow(df))
{
df[i,'delivery_distance'] <- distm(c(df[i,"dropoff_lat"],df[i,"dropoff_lon"]),
c(df[i,"pickup_lat"],df[i,"pickup_lon"]),
fun=distHaversine)/1609.34
}
#Calculate average Jumpman speed
df$jumpman_avg_speed <- df$delivery_distance/as.numeric(df$delivery_time)
library(dplyr)
df_unique <- df %>% distinct(delivery_id, .keep_all = TRUE)
Below is the geospatial representation of the pickup and drop locations across new york.
library(leaflet)
#weekend vs weekday by dropoff
leaflet() %>% setView(-73.972887,40.732828,zoom=12) %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(data=subset(df_unique,weekend_delivery_started==0),
lat=~dropoff_lat,lng=~dropoff_lon,weight=1,radius=3,opacity=1,color="Orange") %>%
addCircleMarkers(data=subset(df_unique,weekend_delivery_started==1),
lat=~dropoff_lat,lng=~dropoff_lon,weight=1,radius=2,opacity=1,color="Blue") %>%
addLegend("bottomright",colors =c("Blue", "Orange"),labels= c("Weekend","Weekday"),opacity = 1)
library(leaflet)
#weekend vs weekday by pickup
leaflet() %>% setView(-73.972887,40.732828,zoom=12) %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(data=subset(df_unique,weekend_delivery_started==0),
lat=~pickup_lat,lng=~pickup_lon,weight=1,radius=3,opacity=1,color="Orange") %>%
addCircleMarkers(data=subset(df_unique,weekend_delivery_started==1),
lat=~pickup_lat,lng=~pickup_lon,weight=1,radius=2,opacity=1,color="Blue")%>%
addLegend("bottomright",colors =c("Blue", "Orange"),labels= c("Weekend","Weekday"),opacity = 1)
First, I see the unique number of customer: 3,192 over the one month period (Oct 1, 2014 to Oct 30, 2014). Along with that, I analysed the number of order per customer. This would tell us how many customer are repeat users of our service.
We can look at the histogram below, that most (1,932) customers only ordered once. The count of repeat customer exponentially decreased with increase in orders. We need to retain more customers as retaining is always cheaper than acquiring new customers.
#Unique number of customers
paste(length(unique(df_unique$customer_id))," Unique Customers")
## [1] "3192 Unique Customers"
library(ggplot2)
ggplot(data.frame(as.vector(table(df_unique$customer_id))))+
geom_histogram(bins=30,aes(x=as.vector.table.df_unique.customer_id..))+
ggtitle("Customer Order Frequency - Histogram")+
xlab("Orders per customer")+
ylab("Number of Customers")
describe(as.vector(table(df_unique$customer_id)))
## as.vector(table(df_unique$customer_id))
## n missing distinct Info Mean Gmd .05 .10
## 3192 0 16 0.661 1.633 1.023 1 1
## .25 .50 .75 .90 .95
## 1 1 2 3 4
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 2216 512 232 106 54 29 14 7 6 5
## Proportion 0.694 0.160 0.073 0.033 0.017 0.009 0.004 0.002 0.002 0.002
##
## Value 11 12 14 15 17 23
## Frequency 4 2 1 2 1 1
## Proportion 0.001 0.001 0.000 0.001 0.000 0.000
I also wanted to see when were new customer acquired in the 1 month of operation. So the below graph shows number of new customer acquired by each day of business. This analysis was based on the first delivery the customer had with Jumpman23.
We can see that customer acquisition rate has been decreasing over the time. To further deep dive into this, we will have to look at marketing campaigns, ads etc.
#Customer acquisition timeframe
library(dplyr)
cust_acq <- df_unique %>%
group_by(customer_id) %>%
summarise(first_day=min(day(when_the_delivery_started)))
ggplot(cust_acq,aes(x=first_day,y=1))+
stat_summary(fun.y=sum,geom="line")+
ggtitle("Number of New Customers Acquired per Day of Operation")+
ylab("Number of Customers")+
xlab("Days in October 2014")
I have taken the column “when_the_delivery_started” as the coloumn which signifies what was order created, and I have analyzed the same column in the below graphs. The intent here is to analyze when do most orders come in.
I wanted to see how it varies across hours of the day.
#Range of Dates (When delivery started)
paste("Dates range from ",
min(df_unique$when_the_delivery_started),
" to ",
max(df_unique$when_the_delivery_started)
)
## [1] "Dates range from 2014-10-01 00:07:58 to 2014-10-30 23:08:43"
#Delivery trends by hour of the day
ggplot(df_unique,aes(x=hour(when_the_delivery_started), 1,group=1)) +
stat_summary(fun.y = sum,geom = "bar")+
ggtitle("Deliveries by hour of the day")+
ylab("Number of Deliveries")+xlab("Hour of the day")
#Delivery trends by weekday/weekend (STANDARDIZED by number of weekdays/weekends)
library(dplyr)
df_wday_hour <- df_unique %>%
group_by(weekend_delivery_started,hour(when_the_delivery_started)) %>%
summarise(count=n())
df_wday_hour$count <- ifelse(df_wday_hour$weekend_delivery_started == 1,df_wday_hour$count/8,df_wday_hour$count/22)
colnames(df_wday_hour) <- c("weekend","hour","count")
df_wday_hour$weekend <- ifelse(df_wday_hour$weekend == 1, "Yes","No")
ggplot(df_wday_hour,aes(x=hour,y=count,group=weekend,color=weekend))+
geom_line(size=2)+
ggtitle("Delivery Trends by hour of the day by weekend/weekday")+
xlab("Hour of the day")+ylab("Number of Deliveries")
In this section, I have analyzed how the number of deliveries (Order start timestamp) vary over time.
library(ggplot2)
#Delivery trends by day of the week (NOT STANDARDIZED across number number of each days in calendar month)
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T), 1,group=1)) +
stat_summary(fun.y = sum,geom = "line")+
ggtitle("Deliveries by Days of the Week")+
ylab("Number of Deliveries")+xlab("Days of the Week")
In this part, I wanted to check what kind of vehicles are used the most and how do delivery times, distances, average delivery speeds vary by each vehicle kind.
We can clearly see from the below graph that Bicycle is used the most.
#overall vehicle usage
ggplot(df_unique,aes(x=vehicle_type, 1,group=1)) +
stat_summary(fun.y = sum,geom = "bar")+
ggtitle("Number of Deliveries by Vehicle Type")+
xlab("Vehicle Type")+ylab("Number of Deliveries")
Diving deep, I saw how does the usage of vehicle vary by days of the week. We can see that most deliveries are made by bicycle. In terms of proportion of deliveries: higher percentage of deliveries are made by bicycle on weekdays, while cars take higher percentages on weekends comparatively.
#vechile usage by days of the week
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T), 1,group=vehicle_type,color=vehicle_type)) +
stat_summary(fun.y = sum,geom = "line",size=1)+
ggtitle("Vehicle usage by Days of the Week")+
xlab("Days of the Week")+ylab("Number of Deliveries")
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T),1,fill=vehicle_type))+
geom_bar(position="fill",stat="identity")+
ggtitle("Proportion of deliveries by Vehicle Type by Days of Week")+
xlab("Days of the Week")+ylab("Proportion of Deliveries")
The same analysis was done across days of the month. Here we can see the weekly trends with bicycles being used the most, followed by car. In terms of proportions, we can see the trend of higher bicycle usage on weekdays being violated. There are certain weekdays which have lower bicycle usage as well. We need to look into more data for deep dives into why this is happening.
#Vehicle usage by days of the month
ggplot(df_unique,aes(x=as.factor(day_del_started), 1,group=vehicle_type,color=vehicle_type)) +
stat_summary(fun.y = sum,geom = "line",size=1)+
ggtitle("Vehicle usage by Days of the Month")+
xlab("Days of October")+ylab("Number of Deliveries")
ggplot(df_unique,aes(x=as.factor(day_del_started),1,fill=vehicle_type))+
geom_bar(position="fill",stat="identity")+
ggtitle("Proportion of Vehicle deliveries by Vehicle Type by Days of the Month")+
xlab("Days of October")+ylab("Proportion of Deliveries")
First, I observed how does delivery time taken by the Jumpman varied across vehicle types and days of the week.
#delivery time by day of the week
ggplot(df_unique,aes(x=vehicle_type,y=delivery_time))+
geom_boxplot()+
ggtitle("Delivery Time variation across Vehicle Types")+
xlab("Vehicle Types")+ylab("Delivery Time (Hours)")
#jumpman arrival by day of the week
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T),y=delivery_time))+
geom_boxplot()+
ggtitle("Loading Time variation across Days of the Week")+
xlab("Days of the Week")+ylab("Delivery Time")
Next I wanted to see how do loading times vary. This was seen across vehicle type: + Loading times were almost equally distributed across all vehicles. However, motorcycle had a bit higher loading times are compared to other vehicles.
#loading by day of the week
ggplot(df_unique,aes(x=vehicle_type,y=loading_time))+
geom_boxplot()+
ggtitle("Loading Time variation across Vehicle Types")+
xlab("Vehicle Types")+ylab("Loading Time (Hours)")
I wanted to see if bigger vehicles are used for long route deliveries and bicycles and walkers are used for smaller routes. And that is what the data showed.
ggplot(df_unique,aes(x=vehicle_type,y=delivery_distance))+
geom_boxplot()+
ggtitle("Delivery Distances by Vehicle Type")+
xlab("Vehicle Type")+
ylab("Distance (Miles)")
I wanted to see if the average speed varied by different vehicle types. So below are the box plots:
ggplot(df_unique,aes(x=vehicle_type,y=jumpman_avg_speed))+
geom_boxplot()+
ggtitle("Avg. Delivery Speed by Vehicle Type")+
xlab("Vehicle Type")+
ylab("Speed (MPH)")
I can see outliers which definitely seem to be wrong. Average speed cannot be so high for bicycles in NYC. So I plotted another chart after removing the outliers, and saw that motorcycles and scooters had a bit of higher variation and slightly higher average delivery speeds.
ggplot(df_unique,aes(x=vehicle_type,y=jumpman_avg_speed))+
geom_boxplot()+
ggtitle("Avg. Delivery Speed by Vehicle Type")+
xlab("Vehicle Type")+
ylab("Speed (MPH)")+ylim(0,20)
I also wanted to see if there are certain pickup locations which bring in a lot of business for Jumpman23. We should ideally filter these pickup locations and target them differently.
paste(length(unique(df_unique$pickup_place))," unique pickup locations")
## [1] "898 unique pickup locations"
df_pickup_place <- df %>%
group_by(pickup_place) %>%
summarise(count=n())
library(ggplot2)
qplot(df_pickup_place$count, geom="histogram",bins=200)+
ggtitle("Frequency of Deliveries from pickup locations")+
xlab("Delivery Frequency")+ylab("Number of pickup locations")
library(Hmisc)
describe(as.factor(df_pickup_place$count))
## as.factor(df_pickup_place$count)
## n missing distinct
## 898 0 57
##
## lowest : 1 2 3 4 5 , highest: 149 151 184 186 311