library(readxl)
Cancellation_Chart <- read_excel("C:/Users/Havilah/Desktop/BOZA (Work Files)/BOZA (Work)/Raw Data/Cancellation Chart.xlsx")
The excel-format dataframe has been loaded into the R environment.
head(Cancellation_Chart)
## # A tibble: 6 × 5
## S_Number Month Cancellation_By_Driver Cancellation_By_User
## <dbl> <chr> <dbl> <dbl>
## 1 1 MAR 599 29
## 2 2 APR 877 59
## 3 3 MAY 320 28
## 4 4 JUN 252 15
## 5 5 JUL 159 6
## 6 6 AUG 93 4
## # ℹ 1 more variable: Cancellation_No_Drivers <dbl>
str(Cancellation_Chart)
## tibble [7 × 5] (S3: tbl_df/tbl/data.frame)
## $ S_Number : num [1:7] 1 2 3 4 5 6 7
## $ Month : chr [1:7] "MAR" "APR" "MAY" "JUN" ...
## $ Cancellation_By_Driver : num [1:7] 599 877 320 252 159 93 21
## $ Cancellation_By_User : num [1:7] 29 59 28 15 6 4 2
## $ Cancellation_No_Drivers: num [1:7] 0 0 0 0 0 0 0
The given output describes a tibble, which is a type of data frame in R, containing 7 rows and 5 columns. Here’s what each part represents:
This array provides a concise view of cancellation data by month, highlighting differences between driver and user cancellations.
colnames(Cancellation_Chart)
## [1] "S_Number" "Month"
## [3] "Cancellation_By_Driver" "Cancellation_By_User"
## [5] "Cancellation_No_Drivers"
This is just a quick re-iteration of the NAMES of the variables which are contained in our dataset.
complete.cases(Cancellation_Chart)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
na.omit(Cancellation_Chart)
## # A tibble: 7 × 5
## S_Number Month Cancellation_By_Driver Cancellation_By_User
## <dbl> <chr> <dbl> <dbl>
## 1 1 MAR 599 29
## 2 2 APR 877 59
## 3 3 MAY 320 28
## 4 4 JUN 252 15
## 5 5 JUL 159 6
## 6 6 AUG 93 4
## 7 7 SEP 21 2
## # ℹ 1 more variable: Cancellation_No_Drivers <dbl>
duplicated(Cancellation_Chart)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Here, we check to confirm that no omission nor duplicate values are present in this dataset, and thus ascertain that all cases are complete.
library(ggplot2)
# Reshape data for easier plotting
library(tidyr)
data_long <- Cancellation_Chart %>%
pivot_longer(cols = starts_with("Cancellation"), names_to = "Cancellation_Type", values_to = "Count")
ggplot(data_long, aes(x = Cancellation_Type, y = Count, fill = Cancellation_Type)) +
geom_boxplot() +
labs(title = "Box Plot of Cancellations by Type") +
theme_minimal()
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Define the order of the months
month_order <- c("MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP")
# Convert Month column to a factor with specified order
Cancellation_Chart$Month <- factor(Cancellation_Chart$Month, levels = month_order)
# Calculate total cancellations for percentages
Cancellation_Chart <- Cancellation_Chart %>%
mutate(Total_Drivers_Cancellations = sum(Cancellation_By_Driver),
Driver_Percentage = (Cancellation_By_Driver / Total_Drivers_Cancellations) * 100,
Total_Users_Cancellations = sum(Cancellation_By_User),
User_Percentage = (Cancellation_By_User / Total_Users_Cancellations) * 100)
# Bar plot with trend line, percentages, and legend
ggplot(Cancellation_Chart, aes(x = Month)) +
geom_bar(aes(y = Cancellation_By_Driver, fill = "Driver Cancellations"), stat = "identity", color = "black") +
geom_bar(aes(y = Cancellation_By_User, fill = "User Cancellations"), stat = "identity", color = "black", alpha = 0.7) +
geom_smooth(aes(y = Total_Drivers_Cancellations), method = "lm", se = FALSE, color = "black", linetype = "dashed") +
geom_text(aes(y = Cancellation_By_Driver, label = paste0(round(Driver_Percentage, 1), "%")),
vjust = -0.5, color = "white", size = 3) +
geom_text(aes(y = Cancellation_By_User + Cancellation_By_Driver, label = paste0(round(User_Percentage, 1), "%")),
vjust = -0.5, color = "black", size = 3) +
labs(title = "Cancellations by Month", y = "Number of Cancellations") +
scale_fill_manual(values = c("blue", "red"), name = "Cancellations", labels = c("Driver Cancellations", "User Cancellations")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The bar chart illustrates the number of cancellations by month,
separated into driver and user cancellations.
In order, the drivers cancellation reads: “MAR” = 25.8%, “APR” = 37.8%, “MAY” = 13.8%, “JUN” = 10.9%, “JUL” = 6.85%, “AUG” = 4.0%, “SEP” = 0.905%
Based on the cancellation data, here are some observed trends:
summary(Cancellation_Chart)
## S_Number Month Cancellation_By_Driver Cancellation_By_User
## Min. :1.0 MAR:1 Min. : 21.0 Min. : 2.00
## 1st Qu.:2.5 APR:1 1st Qu.:126.0 1st Qu.: 5.00
## Median :4.0 MAY:1 Median :252.0 Median :15.00
## Mean :4.0 JUN:1 Mean :331.6 Mean :20.43
## 3rd Qu.:5.5 JUL:1 3rd Qu.:459.5 3rd Qu.:28.50
## Max. :7.0 AUG:1 Max. :877.0 Max. :59.00
## SEP:1
## Cancellation_No_Drivers Total_Drivers_Cancellations Driver_Percentage
## Min. :0 Min. :2321 Min. : 0.9048
## 1st Qu.:0 1st Qu.:2321 1st Qu.: 5.4287
## Median :0 Median :2321 Median :10.8574
## Mean :0 Mean :2321 Mean :14.2857
## 3rd Qu.:0 3rd Qu.:2321 3rd Qu.:19.7975
## Max. :0 Max. :2321 Max. :37.7854
##
## Total_Users_Cancellations User_Percentage
## Min. :143 Min. : 1.399
## 1st Qu.:143 1st Qu.: 3.497
## Median :143 Median :10.490
## Mean :143 Mean :14.286
## 3rd Qu.:143 3rd Qu.:19.930
## Max. :143 Max. :41.259
##
The provided output appears is a summary related to cancellations by drivers and users. Here is an interpretation of the information presented:
Month: Indicates the month associated with the data point.
Cancellation_By_Driver: Shows the number of cancellations made by drivers. The values range from a minimum of 21 to a maximum of 877. The mean number of cancellations by drivers is approximately 331.6.
Cancellation_By_User: Represents the number of cancellations made by users. The values range from a minimum of 2 to a maximum of 59. The mean number of cancellations by users is around 20.43.
Cancellation_No_Drivers: This column seems to indicate the number of drivers involved in cancellations. However, all values are zero in the provided summary.
Total_Cancellations: Represents the total number of cancellations, calculated as the sum of cancellations by drivers and users. The total cancellations range from a minimum of 23 to a maximum of 936.
Driver_Percentage: Shows the percentage of cancellations attributed to drivers. The driver percentage ranges from approximately 0.9048% to 37.7854%, with a mean percentage of about 14.2857%.
User_Percentage: Indicates the percentage of cancellations attributed to users. The user percentage ranges from around 1.399% to 41.259%, with a mean percentage of approximately 14.286%.
Total_Drivers_Cancellations: This column seems to represent the total number of cancellations made by all drivers, with a constant value of 2321 in the provided summary.
Total_Users_Cancellations: Represents the total number of cancellations made by all users, with a constant value of 143 in the provided summary.
This summary provides insights into cancellation patterns by both drivers and users, illustrating their individual contributions to the total cancellations and their respective percentages in the data set.
correlation_matrix <- cor(Cancellation_Chart[, c("Cancellation_By_Driver", "Cancellation_By_User")])
print(correlation_matrix)
## Cancellation_By_Driver Cancellation_By_User
## Cancellation_By_Driver 1.0000000 0.9648033
## Cancellation_By_User 0.9648033 1.0000000
The correlation matrix shows the following:
Correlation between Cancellation_By_Driver
& Cancellation_By_User
: The correlation
coefficient is approximately 0.965. Using a scale of 0 to 1, this
indicates a very strong positive correlation between the number of
cancellations by drivers and the number of cancellations by users. As
cancellations by drivers increase, cancellations by users tend to
increase as well.
Perfect Correlation Within Variables: Both variables have a correlation of 1 with themselves, which is expected. (Ignore)
The strong positive correlation suggests that factors influencing driver cancellations may also affect user cancellations, or that both are responding to similar external conditions or trends. (such as DISTANCE BETWEEN DRIVER & USER, leading to longer WAITING TIME). This relationship could be explored further to understand the underlying causes and to develop strategies to address them.
The distance between the driver and user(or rider must be minimized) to avoid repeated cancellations. Wherever campaigns are conducted, driver positioning must also be facilitated in same neighborhood to ensure proximity and and successful transactions. ***Drivers registered in a certain area must be assigned to campaigns, implementations and activities in the same region.
Overall, it is encouraging to see a decline in the number of cancellations as time goes on. However, this calls for further probing, as this may not necessarily mean more success in number of successfully completed rides.