Load/Import Dataset

library(readxl)
Cancellation_Chart <- read_excel("C:/Users/Havilah/Desktop/BOZA (Work Files)/BOZA (Work)/Raw Data/Cancellation Chart.xlsx")

The excel-format dataframe has been loaded into the R environment.

First glance at data frame

head(Cancellation_Chart)
## # A tibble: 6 × 5
##   S_Number Month Cancellation_By_Driver Cancellation_By_User
##      <dbl> <chr>                  <dbl>                <dbl>
## 1        1 MAR                      599                   29
## 2        2 APR                      877                   59
## 3        3 MAY                      320                   28
## 4        4 JUN                      252                   15
## 5        5 JUL                      159                    6
## 6        6 AUG                       93                    4
## # ℹ 1 more variable: Cancellation_No_Drivers <dbl>
str(Cancellation_Chart)
## tibble [7 × 5] (S3: tbl_df/tbl/data.frame)
##  $ S_Number               : num [1:7] 1 2 3 4 5 6 7
##  $ Month                  : chr [1:7] "MAR" "APR" "MAY" "JUN" ...
##  $ Cancellation_By_Driver : num [1:7] 599 877 320 252 159 93 21
##  $ Cancellation_By_User   : num [1:7] 29 59 28 15 6 4 2
##  $ Cancellation_No_Drivers: num [1:7] 0 0 0 0 0 0 0

The given output describes a tibble, which is a type of data frame in R, containing 7 rows and 5 columns. Here’s what each part represents:

This array provides a concise view of cancellation data by month, highlighting differences between driver and user cancellations.

colnames(Cancellation_Chart)
## [1] "S_Number"                "Month"                  
## [3] "Cancellation_By_Driver"  "Cancellation_By_User"   
## [5] "Cancellation_No_Drivers"

This is just a quick re-iteration of the NAMES of the variables which are contained in our dataset.

complete.cases(Cancellation_Chart)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
na.omit(Cancellation_Chart)
## # A tibble: 7 × 5
##   S_Number Month Cancellation_By_Driver Cancellation_By_User
##      <dbl> <chr>                  <dbl>                <dbl>
## 1        1 MAR                      599                   29
## 2        2 APR                      877                   59
## 3        3 MAY                      320                   28
## 4        4 JUN                      252                   15
## 5        5 JUL                      159                    6
## 6        6 AUG                       93                    4
## 7        7 SEP                       21                    2
## # ℹ 1 more variable: Cancellation_No_Drivers <dbl>
duplicated(Cancellation_Chart)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Here, we check to confirm that no omission nor duplicate values are present in this dataset, and thus ascertain that all cases are complete.

DATA VISUALIZATION

library(ggplot2)


# Reshape data for easier plotting
library(tidyr)
data_long <- Cancellation_Chart %>%
  pivot_longer(cols = starts_with("Cancellation"), names_to = "Cancellation_Type", values_to = "Count")

ggplot(data_long, aes(x = Cancellation_Type, y = Count, fill = Cancellation_Type)) +
  geom_boxplot() +
  labs(title = "Box Plot of Cancellations by Type") +
  theme_minimal()

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Define the order of the months
month_order <- c("MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP")

# Convert Month column to a factor with specified order
Cancellation_Chart$Month <- factor(Cancellation_Chart$Month, levels = month_order)

# Calculate total cancellations for percentages
Cancellation_Chart <- Cancellation_Chart %>%
  mutate(Total_Drivers_Cancellations = sum(Cancellation_By_Driver),
         Driver_Percentage = (Cancellation_By_Driver / Total_Drivers_Cancellations) * 100,
         Total_Users_Cancellations = sum(Cancellation_By_User),
         User_Percentage = (Cancellation_By_User / Total_Users_Cancellations) * 100)

# Bar plot with trend line, percentages, and legend
ggplot(Cancellation_Chart, aes(x = Month)) +
  geom_bar(aes(y = Cancellation_By_Driver, fill = "Driver Cancellations"), stat = "identity", color = "black") +
  geom_bar(aes(y = Cancellation_By_User, fill = "User Cancellations"), stat = "identity", color = "black", alpha = 0.7) +
  geom_smooth(aes(y = Total_Drivers_Cancellations), method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  geom_text(aes(y = Cancellation_By_Driver, label = paste0(round(Driver_Percentage, 1), "%")), 
            vjust = -0.5, color = "white", size = 3) +
  geom_text(aes(y = Cancellation_By_User + Cancellation_By_Driver, label = paste0(round(User_Percentage, 1), "%")), 
            vjust = -0.5, color = "black", size = 3) +
  labs(title = "Cancellations by Month", y = "Number of Cancellations") +
  scale_fill_manual(values = c("blue", "red"), name = "Cancellations", labels = c("Driver Cancellations", "User Cancellations")) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The bar chart illustrates the number of cancellations by month, separated into driver and user cancellations.

Key Observations:

  1. High Cancellations in March and April:
    • April has the highest number of cancellations overall, with drivers accounting for the majority. The percentage of user cancellations is 41.3%.
    • March also shows significant cancellations, with driver cancellations being predominant. User cancellations are 20.3%.
  2. Decrease in May and June:
    • Cancellations decrease after April, with May having an 19.6% user cancellation rate, still substantial but lower than the previous months.
    • June shows a further decline in cancellations, with user cancellations at 10.5%.
  3. Lowest Cancellations in August and September:
    • August and September have the lowest total cancellations, with September having a slight increase in user cancellations at 2.8%.
  4. Consistent Pattern:
    • Across all months, driver cancellations are the dominant type, significantly higher than user cancellations.

On Drivers Cancellation

In order, the drivers cancellation reads: “MAR” = 25.8%, “APR” = 37.8%, “MAY” = 13.8%, “JUN” = 10.9%, “JUL” = 6.85%, “AUG” = 4.0%, “SEP” = 0.905%

Based on the cancellation data, here are some observed trends:

  1. High Cancellations in March and April:
    • Both driver and user cancellations peak in March and April, with March having the highest driver cancellations.
  2. Gradual Decrease:
    • There is a noticeable decrease in cancellations from May to July. This possibly suggests (1) improved conditions or (2) reduced demand.
  3. Consistently No ‘No Drivers’ Cancellations:
    • The “Cancellation_No_Drivers” remains zero throughout, indicating that cancellations due to a lack of drivers did not occur.
summary(Cancellation_Chart)
##     S_Number   Month   Cancellation_By_Driver Cancellation_By_User
##  Min.   :1.0   MAR:1   Min.   : 21.0          Min.   : 2.00       
##  1st Qu.:2.5   APR:1   1st Qu.:126.0          1st Qu.: 5.00       
##  Median :4.0   MAY:1   Median :252.0          Median :15.00       
##  Mean   :4.0   JUN:1   Mean   :331.6          Mean   :20.43       
##  3rd Qu.:5.5   JUL:1   3rd Qu.:459.5          3rd Qu.:28.50       
##  Max.   :7.0   AUG:1   Max.   :877.0          Max.   :59.00       
##                SEP:1                                              
##  Cancellation_No_Drivers Total_Drivers_Cancellations Driver_Percentage
##  Min.   :0               Min.   :2321                Min.   : 0.9048  
##  1st Qu.:0               1st Qu.:2321                1st Qu.: 5.4287  
##  Median :0               Median :2321                Median :10.8574  
##  Mean   :0               Mean   :2321                Mean   :14.2857  
##  3rd Qu.:0               3rd Qu.:2321                3rd Qu.:19.7975  
##  Max.   :0               Max.   :2321                Max.   :37.7854  
##                                                                       
##  Total_Users_Cancellations User_Percentage 
##  Min.   :143               Min.   : 1.399  
##  1st Qu.:143               1st Qu.: 3.497  
##  Median :143               Median :10.490  
##  Mean   :143               Mean   :14.286  
##  3rd Qu.:143               3rd Qu.:19.930  
##  Max.   :143               Max.   :41.259  
## 

The provided output appears is a summary related to cancellations by drivers and users. Here is an interpretation of the information presented:

  • Month: Indicates the month associated with the data point.

  • Cancellation_By_Driver: Shows the number of cancellations made by drivers. The values range from a minimum of 21 to a maximum of 877. The mean number of cancellations by drivers is approximately 331.6.

  • Cancellation_By_User: Represents the number of cancellations made by users. The values range from a minimum of 2 to a maximum of 59. The mean number of cancellations by users is around 20.43.

  • Cancellation_No_Drivers: This column seems to indicate the number of drivers involved in cancellations. However, all values are zero in the provided summary.

  • Total_Cancellations: Represents the total number of cancellations, calculated as the sum of cancellations by drivers and users. The total cancellations range from a minimum of 23 to a maximum of 936.

  • Driver_Percentage: Shows the percentage of cancellations attributed to drivers. The driver percentage ranges from approximately 0.9048% to 37.7854%, with a mean percentage of about 14.2857%.

  • User_Percentage: Indicates the percentage of cancellations attributed to users. The user percentage ranges from around 1.399% to 41.259%, with a mean percentage of approximately 14.286%.

  • Total_Drivers_Cancellations: This column seems to represent the total number of cancellations made by all drivers, with a constant value of 2321 in the provided summary.

  • Total_Users_Cancellations: Represents the total number of cancellations made by all users, with a constant value of 143 in the provided summary.

This summary provides insights into cancellation patterns by both drivers and users, illustrating their individual contributions to the total cancellations and their respective percentages in the data set.

correlation_matrix <- cor(Cancellation_Chart[, c("Cancellation_By_Driver", "Cancellation_By_User")])
print(correlation_matrix)
##                        Cancellation_By_Driver Cancellation_By_User
## Cancellation_By_Driver              1.0000000            0.9648033
## Cancellation_By_User                0.9648033            1.0000000

Interpretation:

The correlation matrix shows the following:

  • Correlation between Cancellation_By_Driver & Cancellation_By_User: The correlation coefficient is approximately 0.965. Using a scale of 0 to 1, this indicates a very strong positive correlation between the number of cancellations by drivers and the number of cancellations by users. As cancellations by drivers increase, cancellations by users tend to increase as well.

  • Perfect Correlation Within Variables: Both variables have a correlation of 1 with themselves, which is expected. (Ignore)

Interpretation:

The strong positive correlation suggests that factors influencing driver cancellations may also affect user cancellations, or that both are responding to similar external conditions or trends. (such as DISTANCE BETWEEN DRIVER & USER, leading to longer WAITING TIME). This relationship could be explored further to understand the underlying causes and to develop strategies to address them.

Data-driven recommendation.

The distance between the driver and user(or rider must be minimized) to avoid repeated cancellations. Wherever campaigns are conducted, driver positioning must also be facilitated in same neighborhood to ensure proximity and and successful transactions. ***Drivers registered in a certain area must be assigned to campaigns, implementations and activities in the same region.

Overall, it is encouraging to see a decline in the number of cancellations as time goes on. However, this calls for further probing, as this may not necessarily mean more success in number of successfully completed rides.