Overview

This is the analysis of the performance overview of Jumpman23’s newest market - New York City. We have data from 30 days of operation in October of 2014, which has been analysed for some insights about the operations and to identify potential data integrity issues with the dataset.


Language/Tools used: For the purpose of this analysis, R and RStudio have been used. The report has then been integrated with Rmarkdown and published on rpubs.com for ease of accessibility.

I chose to select R as the medium of analysis because of the following reasons:

  1. R is an open source software, which enables tremendous support in terms of the online community and available tutorials. So in case of any new development, doubts, it is always easy to reach out and overcome any possible hurdles.
  2. It has great visual packages like plotly, ggplot2 etc. These packages have inbuilt functions which can be used to visualize data in various possible ways. We can even leverage third party APIs to get external data.
  3. Having packages like Rmarkdown, Shiny makes it easy to share a report/dashboard. The end user does not need an R environment to view these kind of reports.

Apart from R, Python and Tableau could have been another language which could be used for this purpose and would have similar functionality.


Similar Reports by Yash Sharma

I have additionally worked on R/Shiny/Rmarkdown projects which are published online. These projects can be found at:

  1. Cincinnati Crime Analysis: https://yash-sharma.shinyapps.io/cincinnati_crime_dashboard/
  2. R class homeworks: https://rpubs.com/yash91sharma/
  3. Github profile: https://github.com/yash91sharma

Executive Summary

Following are some of the key takeaways from the analysis:

  1. Missing Values: There are lot of missing values in the data.
  2. Unique Orders: There are 5,983 deliveries in the dataset, while 5,214 unique order IDs. I have noticed that if a customer has ordered multiple items, there is an entry for each item in the database. Hence the database of not unique on deliveries/delivery_ID. Rather a combination of delivery_ID and item_name is unique.
  3. Variable Creation: Some additional variables were created using the existing columns and external sources/API. These columns are explained in the detailed analysis.
  4. Customers: Jumpman23 had 3,192 unique customers in the 30 days of opeartions. About 69% used service only once, ~16% used twice and this trend reduced drastically. Handful used more than 10 times.
  5. Customer Acquisition: Jumpman23 acquired a lot of customer initially, and customer acquisition rate decreased over time.
  6. Trends/Patterns across Time: Order peak at noon and at around 7 PM. This trend is similar across weekdays and weekend, with average orders per hour slightly higher on weekends.
  7. Vehicle: Bicycles deliver most of the products across all days and hours of the day. However, on weekends, larger proportion of orders are delivered by bicycles compared to other days.
  8. Delivery Time: Cars, Trucks and Vans take the most time to deliver the items from pickup point to drop point.
  9. Loading Times: Motorcycles have a little higher item loading time (time waited at pickup point) as compared to other vehicles.
  10. Delivery Distance: Trucks and cars are used to deliver products long distances while bicycle, walker and scooters are used for short distances.
  11. Delivery Speed: Average delivery speed is fairly equal across all delivery modes, with scooters having more variance as compared to other vehicles.
  12. Pickup Locations: There are 898 unique pickup locations. Most of them received single order, while a handful had higher number of orders.

Data Integrity Issues

From my analysis and the deep dives, I believe following could be the data integrity issues and how these would impact the analysis:

  1. Missing Data: There are lot of missing data. This could bias the analysis, as we might have missed capturing data where we could have got interesting results.

  2. Timestamp: After calculating the average speed during the delivery, I got certain deliveries which had an average speed of about 80 mph on bicycles in NYC. Hence, there seems to be an error with the timestamp variable. This would have biased our analyses which were done using the timestamp variables.

  3. Repeated values for single orders: When a customer ordered multiple items in a single order, there were multiple rows with one for each item. This could create potential problems with delivery level analysis. I removed the duplicates as explained in the detailed analysis tab.

Detailed Analysis

Here are some of the graphs/tables which I think gave some significant insights, though extensive analysis was performed on all variables in the dataset.

Data Exploration

Data Explored

The data has 5983 rows and 18 columns.

df <- read.csv("analyze_me.csv",stringsAsFactors = F)
dim(df)
## [1] 5983   18
#Remove duplicate orders (Keeo the first row)
library(dplyr)

The following is the structure of the data. I have used the “str” function, which comes in the base-R package. It shows the name, format and some sample values from each columns.

str(df)
## 'data.frame':    5983 obs. of  18 variables:
##  $ delivery_id                        : int  1457973 1377056 1476547 1485494 1327707 1423142 1334106 1311619 1487674 1417206 ...
##  $ customer_id                        : int  327168 64452 83095 271149 122609 75169 101347 59161 55375 153816 ...
##  $ jumpman_id                         : int  162381 104533 132725 157175 118095 91932 124897 79847 181543 157415 ...
##  $ vehicle_type                       : chr  "van" "bicycle" "bicycle" "bicycle" ...
##  $ pickup_place                       : chr  "Melt Shop" "Prince Street Pizza" "Bareburger" "Juice Press" ...
##  $ place_category                     : chr  "American" "Pizza" "Burger" "Juice Bar" ...
##  $ item_name                          : chr  "Lemonade" "Neapolitan Rice Balls" "Bare Sodas" "OMG! My Favorite Juice!" ...
##  $ item_quantity                      : int  1 3 1 1 2 1 1 2 NA 1 ...
##  $ item_category_name                 : chr  "Beverages" "Munchables" "Drinks" "Cold Pressed Juices" ...
##  $ how_long_it_took_to_order          : chr  "00:19:58.582052" "00:25:09.107093" "00:06:44.541717" "" ...
##  $ pickup_lat                         : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ pickup_lon                         : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_lat                        : num  40.8 40.7 40.7 40.8 40.7 ...
##  $ dropoff_lon                        : num  -74 -74 -74 -74 -74 ...
##  $ when_the_delivery_started          : chr  "2014-10-26 13:51:59.898924" "2014-10-16 21:58:58.65491" "2014-10-28 21:39:52.654394" "2014-10-30 10:54:11.531894" ...
##  $ when_the_Jumpman_arrived_at_pickup : chr  "" "2014-10-16 22:26:02.120931" "2014-10-28 21:37:18.793405" "2014-10-30 11:04:17.759577" ...
##  $ when_the_Jumpman_left_pickup       : chr  "" "2014-10-16 22:48:23.091253" "2014-10-28 21:59:09.98481" "2014-10-30 11:16:37.895816" ...
##  $ when_the_Jumpman_arrived_at_dropoff: chr  "2014-10-26 14:52:06.313088" "2014-10-16 22:59:22.948873" "2014-10-28 22:04:40.634962" "2014-10-30 11:32:38.090061" ...

The “describe” function from the “Hmisc” package enables inspection of each variable. It shows the count, missing values (if any), distinct values, mean and other summary values. It also shows the 5 highest and 5 lowest values for each variable. This is very useful in finding missing values, outliers etc in the data.

library(Hmisc)
describe(df)
## df 
## 
##  18  Variables      5983  Observations
## ---------------------------------------------------------------------------
## delivery_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     5214        1  1379495    74533  1281674  1293571 
##      .25      .50      .75      .90      .95 
##  1322793  1375689  1436371  1472712  1480980 
## 
## lowest : 1271706 1271751 1271867 1272279 1272303
## highest: 1491110 1491144 1491147 1491341 1491424
## ---------------------------------------------------------------------------
## customer_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     3192        1   176473   129473    44005    54921 
##      .25      .50      .75      .90      .95 
##    77817   131093   293381   363561   377245 
## 
## lowest :    242    641   1311   1517   2533, highest: 404787 405147 405233 405334 405547
## ---------------------------------------------------------------------------
## jumpman_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0      578        1   102662    55402    19432    30905 
##      .25      .50      .75      .90      .95 
##    60761   113364   143807   158791   165848 
## 
## lowest :   3296   3592   3941   5935   6458, highest: 177485 177847 178325 179183 181543
## ---------------------------------------------------------------------------
## vehicle_type 
##        n  missing distinct 
##     5983        0        7 
##                                                                  
## Value         bicycle        car motorcycle    scooter      truck
## Frequency        4274       1215         21         75         48
## Proportion      0.714      0.203      0.004      0.013      0.008
##                                 
## Value             van     walker
## Frequency          76        274
## Proportion      0.013      0.046
## ---------------------------------------------------------------------------
## pickup_place 
##        n  missing distinct 
##     5983        0      898 
## 
## lowest : 'Essen                                 'wichcraft                              Il Mulino New York                    $10 Blue Ribbon Fried Chicken Sandwich 11th Street Cafe                      
## highest: Yura On Madison                        Zabar's                                Zen Palate                             Zero Otto Nove                         Zucker's Bagels & Smoked Fish         
## ---------------------------------------------------------------------------
## place_category 
##        n  missing distinct 
##     5100      883       57 
## 
## lowest : African    American   Art Store  Asian      Bakery    
## highest: Sushi      Thai       Vegan      Vegetarian Vietnamese
## ---------------------------------------------------------------------------
## item_name 
##        n  missing distinct 
##     4753     1230     2277 
## 
## lowest : 'Shroom Burger                         "Ala Vodka" Sauce with Mushrooms       "Lure Style" Burger                    "The Cadillac"                         $10 Dirty Bird Rotisserie Chicken Wrap
## highest: Yellowtail with Jalapeno Sushi         Yogurt Raisins                         Zesty Corn                             Zinc 50mg Target Mins Tb               Zucchini Chips                        
## ---------------------------------------------------------------------------
## item_quantity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4753     1230       11    0.411    1.248   0.4391        1        1 
##      .25      .50      .75      .90      .95 
##        1        1        1        2        2 
##                                                                       
## Value          1     2     3     4     5     6     7     8    12    15
## Frequency   3980   570   112    54    13    14     1     4     1     3
## Proportion 0.837 0.120 0.024 0.011 0.003 0.003 0.000 0.001 0.000 0.001
##                 
## Value         16
## Frequency      1
## Proportion 0.000
## ---------------------------------------------------------------------------
## item_category_name 
##        n  missing distinct 
##     4753     1230      767 
## 
## lowest : 10" Pies                18" Pizzas              6" Cakes                A la Cart               A La Carte             
## highest: Yasai (Vegetable Rolls) Year Round Flavors      Yeast Doughnuts         Your Creation           Yummy Food             
## ---------------------------------------------------------------------------
## how_long_it_took_to_order 
##        n  missing distinct 
##     3038     2945     2579 
## 
## lowest : 00:01:22.997519 00:01:32.308446 00:01:33.864756 00:01:37.552443 00:01:38.872929
## highest: 00:47:48.181357 00:58:02.117535 01:03:42.753775 01:12:59.55104  01:13:13.266118
## ---------------------------------------------------------------------------
## pickup_lat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     1210        1    40.74  0.02508    40.72    40.72 
##      .25      .50      .75      .90      .95 
##    40.72    40.74    40.76    40.78    40.78 
## 
## lowest : 40.66561 40.67176 40.67348 40.67425 40.67469
## highest: 40.80626 40.80707 40.81133 40.81535 40.81808
## ---------------------------------------------------------------------------
## pickup_lon 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     1179        1   -73.99  0.01634   -74.01   -74.00 
##      .25      .50      .75      .90      .95 
##   -74.00   -73.99   -73.98   -73.96   -73.96 
## 
## lowest : -74.01584 -74.01545 -74.01486 -74.01472 -74.01463
## highest: -73.93543 -73.93420 -73.93324 -73.92829 -73.92098
## ---------------------------------------------------------------------------
## dropoff_lat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     2841        1    40.74   0.0283    40.71    40.72 
##      .25      .50      .75      .90      .95 
##    40.73    40.74    40.76    40.78    40.79 
## 
## lowest : 40.64936 40.64940 40.64957 40.65215 40.66689
## highest: 40.82885 40.83513 40.83603 40.83740 40.84832
## ---------------------------------------------------------------------------
## dropoff_lon 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5983        0     2839        1   -73.99  0.02033   -74.01   -74.01 
##      .25      .50      .75      .90      .95 
##   -74.00   -73.99   -73.97   -73.96   -73.95 
## 
## lowest : -74.01768 -74.01729 -74.01715 -74.01712 -74.01706
## highest: -73.92836 -73.92772 -73.92582 -73.92481 -73.92412
## ---------------------------------------------------------------------------
## when_the_delivery_started 
##        n  missing distinct 
##     5983        0     5214 
## 
## lowest : 2014-10-01 00:07:58.632482 2014-10-01 00:26:31.924774 2014-10-01 01:00:06.75635  2014-10-01 08:46:15.935061 2014-10-01 09:20:21.573801
## highest: 2014-10-30 22:24:54.42562  2014-10-30 22:31:58.003417 2014-10-30 22:32:24.293206 2014-10-30 22:56:00.07339  2014-10-30 23:08:43.4819  
## ---------------------------------------------------------------------------
## when_the_Jumpman_arrived_at_pickup 
##        n  missing distinct 
##     5433      550     4719 
## 
## lowest : 2014-10-01 00:39:31.086322 2014-10-01 01:19:29.205722 2014-10-01 09:02:40.003541 2014-10-01 09:26:01.194532 2014-10-01 10:10:27.589662
## highest: 2014-10-30 22:30:00.72672  2014-10-30 22:34:18.51496  2014-10-30 22:34:33.893881 2014-10-30 23:01:38.619634 2014-10-30 23:10:31.062088
## ---------------------------------------------------------------------------
## when_the_Jumpman_left_pickup 
##        n  missing distinct 
##     5433      550     4717 
## 
## lowest : 2014-10-01 00:59:57.522402 2014-10-01 01:36:49.131316 2014-10-01 09:15:59.607582 2014-10-01 09:37:56.158669 2014-10-01 10:32:19.033949
## highest: 2014-10-30 22:54:17.896179 2014-10-30 22:57:59.036928 2014-10-30 23:06:54.47219  2014-10-30 23:14:58.679208 2014-10-30 23:23:51.143279
## ---------------------------------------------------------------------------
## when_the_Jumpman_arrived_at_dropoff 
##        n  missing distinct 
##     5983        0     5214 
## 
## lowest : 2014-10-01 00:30:21.109149 2014-10-01 01:04:14.355157 2014-10-01 01:49:29.034932 2014-10-01 09:28:40.095456 2014-10-01 09:39:41.631246
## highest: 2014-10-30 23:04:40.777794 2014-10-30 23:05:57.857982 2014-10-30 23:19:29.96027  2014-10-30 23:22:48.252946 2014-10-30 23:29:44.866438
## ---------------------------------------------------------------------------

Variable Creation & Addition

To do some deep dive analysis and data exploration, I wanted to create additional columns. Below is the explanation/formula of each new variable created and the code used for the same.

  1. Timestamp Format: The timestamp columns were imported as characters in R. I used the “lubridate” package to convert these into proper timestamps, from where the date, day, hour could be extracted anytime using available functions.
  2. Day of the week/Weekend flag: I used the above timestamps to create the days when order started. Then I also created a flag which is 1 if the day was a weekend and 0 other wise. I also created a day (1-30) for each of the delivery.
  3. Duration Variables: Using the timestamp of Jumpman’s arrivals and departure, I created two duration variables. One is the duration of delivery and second is the duration of loading items once the Jumpman reaches pickup station.
    • Delivery Time: Time Jumpman arrived at Dropoff - Time Jumpman started from Pickup
    • Loading Time: Time Jumpman left from Pickup - Time Jumpman arrived at Pickup
  4. Delivery Distance: Here I used the “Geosphere” package to get the straight line distance between two points. I wanted to use the Google Maps API to get distance by road, but that is restricted to 2000 queries per day which would have delayed my analysis. So I took the straight line distance as a proxy.
  5. Average Jumpman speed: Dividing the distance travelled by time taken, I get the average speed with which the Jumpman would have delivered the product.

After exploring the data and creating the new variables, I removed the dulplicate orders, keeping the first entry for each order. This might create misleading results when analysing the items which were delivered, of which we need to be careful.

#Change formats of timestamp
library(lubridate)
df$when_the_delivery_started <- ymd_hms(substr(df$when_the_delivery_started,1,19))
df$when_the_Jumpman_arrived_at_pickup <- ymd_hms(substr(df$when_the_Jumpman_arrived_at_pickup,1,19))
df$when_the_Jumpman_left_pickup <- ymd_hms(substr(df$when_the_Jumpman_left_pickup,1,19))
df$when_the_Jumpman_arrived_at_dropoff <- ymd_hms(substr(df$when_the_Jumpman_arrived_at_dropoff,1,19))

#day of week and weekend flag
df$wday_delivery_started <- wday(df$when_the_delivery_started)
df$weekend_delivery_started <- ifelse(df$wday_delivery_started %in% c(1,7),1,0)
df$day_del_started <- (day(df$when_the_delivery_started))

#Creat time duration columns
df$delivery_time <- difftime(df$when_the_Jumpman_arrived_at_dropoff,
                             df$when_the_Jumpman_left_pickup,
                             units="hours")
df$loading_time <- difftime(df$when_the_Jumpman_left_pickup,
                            df$when_the_Jumpman_arrived_at_pickup,
                            units="hours")
df$jumpman_arrival_time <- difftime(df$when_the_Jumpman_arrived_at_pickup,
                            df$when_the_delivery_started,
                            units="hours")

#delivery distance
library(geosphere)
df$delivery_distance <- 0
for(i in 1:nrow(df))
{
  df[i,'delivery_distance'] <- distm(c(df[i,"dropoff_lat"],df[i,"dropoff_lon"]),
                                     c(df[i,"pickup_lat"],df[i,"pickup_lon"]),
                                     fun=distHaversine)/1609.34
}

#Calculate average Jumpman speed
df$jumpman_avg_speed <- df$delivery_distance/as.numeric(df$delivery_time)

library(dplyr)
df_unique <- df %>% distinct(delivery_id, .keep_all = TRUE)

Maps

Below is the geospatial representation of the pickup and drop locations across new york.


Dropoff Locations by Weekday/Weekend

library(leaflet)
#weekend vs weekday by dropoff
leaflet() %>% setView(-73.972887,40.732828,zoom=12) %>% addTiles() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(data=subset(df_unique,weekend_delivery_started==0),
                   lat=~dropoff_lat,lng=~dropoff_lon,weight=1,radius=3,opacity=1,color="Orange") %>%
  addCircleMarkers(data=subset(df_unique,weekend_delivery_started==1),
                   lat=~dropoff_lat,lng=~dropoff_lon,weight=1,radius=2,opacity=1,color="Blue") %>%
  addLegend("bottomright",colors =c("Blue", "Orange"),labels= c("Weekend","Weekday"),opacity = 1)

Pickup Locations by Weekday/Weekend

library(leaflet)
#weekend vs weekday by pickup
leaflet() %>% setView(-73.972887,40.732828,zoom=12) %>% addTiles() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(data=subset(df_unique,weekend_delivery_started==0),
                   lat=~pickup_lat,lng=~pickup_lon,weight=1,radius=3,opacity=1,color="Orange") %>%
  addCircleMarkers(data=subset(df_unique,weekend_delivery_started==1),
                   lat=~pickup_lat,lng=~pickup_lon,weight=1,radius=2,opacity=1,color="Blue")%>%
  addLegend("bottomright",colors =c("Blue", "Orange"),labels= c("Weekend","Weekday"),opacity = 1)

Customer Analysis

Unique customer and frequency of transaction

First, I see the unique number of customer: 3,192 over the one month period (Oct 1, 2014 to Oct 30, 2014). Along with that, I analysed the number of order per customer. This would tell us how many customer are repeat users of our service.

We can look at the histogram below, that most (1,932) customers only ordered once. The count of repeat customer exponentially decreased with increase in orders. We need to retain more customers as retaining is always cheaper than acquiring new customers.

#Unique number of customers
paste(length(unique(df_unique$customer_id))," Unique Customers")
## [1] "3192  Unique Customers"
library(ggplot2)
ggplot(data.frame(as.vector(table(df_unique$customer_id))))+
  geom_histogram(bins=30,aes(x=as.vector.table.df_unique.customer_id..))+
  ggtitle("Customer Order Frequency - Histogram")+
  xlab("Orders per customer")+
  ylab("Number of Customers")

describe(as.vector(table(df_unique$customer_id)))
## as.vector(table(df_unique$customer_id)) 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3192        0       16    0.661    1.633    1.023        1        1 
##      .25      .50      .75      .90      .95 
##        1        1        2        3        4 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency   2216   512   232   106    54    29    14     7     6     5
## Proportion 0.694 0.160 0.073 0.033 0.017 0.009 0.004 0.002 0.002 0.002
##                                               
## Value         11    12    14    15    17    23
## Frequency      4     2     1     2     1     1
## Proportion 0.001 0.001 0.000 0.001 0.000 0.000

Customer Acquisition over time

I also wanted to see when were new customer acquired in the 1 month of operation. So the below graph shows number of new customer acquired by each day of business. This analysis was based on the first delivery the customer had with Jumpman23.

We can see that customer acquisition rate has been decreasing over the time. To further deep dive into this, we will have to look at marketing campaigns, ads etc.

#Customer acquisition timeframe
library(dplyr)
cust_acq <- df_unique %>%
  group_by(customer_id) %>%
  summarise(first_day=min(day(when_the_delivery_started)))

ggplot(cust_acq,aes(x=first_day,y=1))+
  stat_summary(fun.y=sum,geom="line")+
  ggtitle("Number of New Customers Acquired per Day of Operation")+
  ylab("Number of Customers")+
  xlab("Days in October 2014")

Trends/Patterns across Time

I have taken the column “when_the_delivery_started” as the coloumn which signifies what was order created, and I have analyzed the same column in the below graphs. The intent here is to analyze when do most orders come in.


Trend across days of the Week

In this section, I have analyzed how the number of deliveries (Order start timestamp) vary over time.

library(ggplot2)

#Delivery trends by day of the week (NOT STANDARDIZED across number number of each days in calendar month)
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T), 1,group=1)) +
  stat_summary(fun.y = sum,geom = "line")+
  ggtitle("Deliveries by Days of the Week")+
  ylab("Number of Deliveries")+xlab("Days of the Week")

Vehicle Type

In this part, I wanted to check what kind of vehicles are used the most and how do delivery times, distances, average delivery speeds vary by each vehicle kind.

We can clearly see from the below graph that Bicycle is used the most.

#overall vehicle usage
ggplot(df_unique,aes(x=vehicle_type, 1,group=1)) +
  stat_summary(fun.y = sum,geom = "bar")+
  ggtitle("Number of Deliveries by Vehicle Type")+
  xlab("Vehicle Type")+ylab("Number of Deliveries")


Diving deep, I saw how does the usage of vehicle vary by days of the week. We can see that most deliveries are made by bicycle. In terms of proportion of deliveries: higher percentage of deliveries are made by bicycle on weekdays, while cars take higher percentages on weekends comparatively.

#vechile usage by days of the week
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T), 1,group=vehicle_type,color=vehicle_type)) +
  stat_summary(fun.y = sum,geom = "line",size=1)+
  ggtitle("Vehicle usage by Days of the Week")+
  xlab("Days of the Week")+ylab("Number of Deliveries")

ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T),1,fill=vehicle_type))+
  geom_bar(position="fill",stat="identity")+
  ggtitle("Proportion of deliveries by Vehicle Type by Days of Week")+
  xlab("Days of the Week")+ylab("Proportion of Deliveries")


The same analysis was done across days of the month. Here we can see the weekly trends with bicycles being used the most, followed by car. In terms of proportions, we can see the trend of higher bicycle usage on weekdays being violated. There are certain weekdays which have lower bicycle usage as well. We need to look into more data for deep dives into why this is happening.

#Vehicle usage by days of the month
ggplot(df_unique,aes(x=as.factor(day_del_started), 1,group=vehicle_type,color=vehicle_type)) +
  stat_summary(fun.y = sum,geom = "line",size=1)+
  ggtitle("Vehicle usage by Days of the Month")+
  xlab("Days of October")+ylab("Number of Deliveries")

ggplot(df_unique,aes(x=as.factor(day_del_started),1,fill=vehicle_type))+
  geom_bar(position="fill",stat="identity")+
  ggtitle("Proportion of Vehicle deliveries by Vehicle Type by Days of the Month")+
  xlab("Days of October")+ylab("Proportion of Deliveries")

Time/Distance/Speed

Delivery Time

First, I observed how does delivery time taken by the Jumpman varied across vehicle types and days of the week.

  1. We can see bicycle takes significantly less delivery time, while truck takes the most. Here we are also seeing lot of outliers for bicycle and car.
  2. Delivery times are fairly equally distributed across days of the week.
#delivery time by day of the week
ggplot(df_unique,aes(x=vehicle_type,y=delivery_time))+
  geom_boxplot()+
  ggtitle("Delivery Time variation across Vehicle Types")+
  xlab("Vehicle Types")+ylab("Delivery Time (Hours)")

#jumpman arrival by day of the week
ggplot(df_unique,aes(x=wday(when_the_delivery_started,label=T),y=delivery_time))+
  geom_boxplot()+
  ggtitle("Loading Time variation across Days of the Week")+
  xlab("Days of the Week")+ylab("Delivery Time")


Loading Time

Next I wanted to see how do loading times vary. This was seen across vehicle type: + Loading times were almost equally distributed across all vehicles. However, motorcycle had a bit higher loading times are compared to other vehicles.

#loading by day of the week
ggplot(df_unique,aes(x=vehicle_type,y=loading_time))+
  geom_boxplot()+
  ggtitle("Loading Time variation across Vehicle Types")+
  xlab("Vehicle Types")+ylab("Loading Time (Hours)")


Delivery Distances

I wanted to see if bigger vehicles are used for long route deliveries and bicycles and walkers are used for smaller routes. And that is what the data showed.

ggplot(df_unique,aes(x=vehicle_type,y=delivery_distance))+
  geom_boxplot()+
  ggtitle("Delivery Distances by Vehicle Type")+
  xlab("Vehicle Type")+
  ylab("Distance (Miles)")


Average Delivery Speed

I wanted to see if the average speed varied by different vehicle types. So below are the box plots:

ggplot(df_unique,aes(x=vehicle_type,y=jumpman_avg_speed))+
  geom_boxplot()+
  ggtitle("Avg. Delivery Speed by Vehicle Type")+
  xlab("Vehicle Type")+
  ylab("Speed (MPH)")

I can see outliers which definitely seem to be wrong. Average speed cannot be so high for bicycles in NYC. So I plotted another chart after removing the outliers, and saw that motorcycles and scooters had a bit of higher variation and slightly higher average delivery speeds.

ggplot(df_unique,aes(x=vehicle_type,y=jumpman_avg_speed))+
  geom_boxplot()+
  ggtitle("Avg. Delivery Speed by Vehicle Type")+
  xlab("Vehicle Type")+
  ylab("Speed (MPH)")+ylim(0,20)

Pickup Place

I also wanted to see if there are certain pickup locations which bring in a lot of business for Jumpman23. We should ideally filter these pickup locations and target them differently.

paste(length(unique(df_unique$pickup_place))," unique pickup locations")
## [1] "898  unique pickup locations"
df_pickup_place <- df %>%
  group_by(pickup_place) %>%
  summarise(count=n())

library(ggplot2)
qplot(df_pickup_place$count, geom="histogram",bins=200)+
  ggtitle("Frequency of Deliveries from pickup locations")+
  xlab("Delivery Frequency")+ylab("Number of pickup locations")

library(Hmisc)
describe(as.factor(df_pickup_place$count))
## as.factor(df_pickup_place$count) 
##        n  missing distinct 
##      898        0       57 
## 
## lowest : 1   2   3   4   5  , highest: 149 151 184 186 311