library(readr)
Food_Delivery_Times <- read_csv("C:/Users/anama/Desktop/Masters/Multivariate Analysis/Homeworks/Homework 1/Food_Delivery_Times.csv")
## Rows: 800 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Weather, Traffic_Level, Time_of_Day, Vehicle_Type
## dbl (5): Order_ID, Distance_km, Preparation_Time_min, Courier_Experience_yrs...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Food_Delivery_Times)
## # A tibble: 6 × 9
## Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type
## <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 522 7.93 Windy Low Afternoon Scooter
## 2 738 16.4 Clear Medium Evening Bike
## 3 741 9.52 Foggy Low Night Scooter
## 4 661 7.44 Rainy Medium Afternoon Scooter
## 5 412 19.0 Clear Low Morning Bike
## 6 679 19.4 Clear Low Evening Scooter
## # ℹ 3 more variables: Preparation_Time_min <dbl>, Courier_Experience_yrs <dbl>,
## # Delivery_Time_min <dbl>
Source: Keggle
Unit of observation: one delivery worker
Sample size = n = 800 observations
Food delivery times data includes:
Using the “subset” function (I consulted with Chat GPT about this), I created subsets of experienced and less experienced delivery workers i.e. I divided my samples into two groups. This way it will be easier for me to work and perform some of the tests and graphs.
experienced <- subset(Food_Delivery_Times, Courier_Experience_yrs > 5)
less_experienced <- subset(Food_Delivery_Times, Courier_Experience_yrs <= 5)
EXPLANATION: Using the subset function, from the data set “Food_Delivery_Times”, I take the variable “Courier_Experience_yrs and I use the rule”>5” to select only the numbers above 5. And all of this is put into the new subset “experienced”. The same logic applies for the less_experienced.
I will also create new variable “Experience”. This is a categorical variable and will depend on whether the courier has less than or equal to 5 years of experience, or more than 5 years of experience working in delivery.
Food_Delivery_Times$Experience <- ifelse(Food_Delivery_Times$Courier_Experience_yrs > 5, "experienced", "less_experienced")
Food_Delivery_Times_Stat <- Food_Delivery_Times[ , c(-1, -3, -4, -5, -6, -7, -10)]
library(pastecs)
round(stat.desc(Food_Delivery_Times_Stat), 2)
## Distance_km Courier_Experience_yrs Delivery_Time_min
## nbr.val 800.00 800.00 800.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 0.60 1.00 8.00
## max 19.99 9.00 141.00
## range 19.39 8.00 133.00
## sum 8009.33 4097.00 44816.00
## median 10.11 5.00 54.00
## mean 10.01 5.12 56.02
## SE.mean 0.20 0.09 0.77
## CI.mean.0.95 0.40 0.18 1.51
## var 32.62 6.96 476.17
## std.dev 5.71 2.64 21.82
## coef.var 0.57 0.52 0.39
The maximum kilometers that a delivery driver has passed is 19.99, while the minimum kilometers that a delivery driver has passed is 0.60 km or 600 meters. The average distance that a delivery worker traveled is 10.01 km.
The average courier experience is 5.12 years. The median (middle value) of the courier experience is also 5 years, meaning that 50% of the delivery workers have an experience of up to (including) 5 years, and the other 50% have working experience as couriers of more than 5 years. These two analyses are the reason why I took 5 as a threshold for differentiating between experienced and less experienced delivery workers.
The standard deviation of the delivery time is 21.82 minutes, which means that delivery times typically vary around the mean (56.02 minutes) by approximately +/- 21.82 minutes.
Research question 1: Does the working experience of the couriers influence the delivery time?
H0: µ (experienced) = µ (less experienced) -> The average delivery time of experienced couriers is the same with the average delivery time of less experienced couriers.
H1: µ (experienced) ≠ µ (less experienced) -> The average delivery time of experienced couriers is different from the average delivery time of less experienced couriers.
t.test(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
paired = FALSE,
alternative ="two.sided")
##
## Welch Two Sample t-test
##
## data: experienced$Delivery_Time_min and less_experienced$Delivery_Time_min
## t = -1.1973, df = 794.33, p-value = 0.2316
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.874996 1.181116
## sample estimates:
## mean of x mean of y
## 55.05497 56.90191
We cannot reject H0 because p > 0.05.
H0: Location distribution of delivery time is the same for experienced and less experienced couriers.
H1: Location distribution of delivery time is not the same for experienced and less experienced couriers.
wilcox.test(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: experienced$Delivery_Time_min and less_experienced$Delivery_Time_min
## W = 76458, p-value = 0.3004
## alternative hypothesis: true location shift is not equal to 0
We cannot reject H0 stating that the location distribution of time is the same for experienced and less experienced couriers because p > 0.05.
In order to decide which test is reliable I will need to check whether all of the assumptions for performing the parametric test are met. In this case, the independent sample t-test has four assumptions:
The first and third assumptions are met. The variable delivery time is numerical and the data comes from two different populations. For the fourth assumption for equal variances, when performing the independent samples t test, I used the Welch correction that fixes the whole variance distribution problem. I have to check only for the assumption 2.
For checking the assumption 2, I will do a histogram to check the normality in both populations and further perform the Shapiro Wilk test for normality.
library(ggplot2)
ggplot(Food_Delivery_Times, aes(x = Delivery_Time_min)) +
geom_histogram(binwidth = 1, color = "navy", fill = "coral") +
facet_wrap(~Experience, ncol = 1) +
ylab("Frequency")
From the Histograms, the delivery time does not look normally distributed in both cases. They look a little right skewed as well. I will further check this using the Shapiro Wilk test. (I tried using the %>% function for grouping the data by experience and then performing the Shapiro WIlk test but R constantly showed me some errors).
shapiro.test(experienced$Delivery_Time_min)
##
## Shapiro-Wilk normality test
##
## data: experienced$Delivery_Time_min
## W = 0.9831, p-value = 0.0001907
We reject H0 that the delivery time of experienced couriers is normally distributed at p < 0.001.
shapiro.test(less_experienced$Delivery_Time_min)
##
## Shapiro-Wilk normality test
##
## data: less_experienced$Delivery_Time_min
## W = 0.9767, p-value = 3.012e-06
We reject H0 that the delivery time of less experienced couriers is normally distributed at p < 0.001.
library(effectsize)
rank_biserial(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
mu = 0,
paired = FALSE,
ci = 0.95,
alternative = "two.sided")
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.04 | [-0.12, 0.04]
interpret_rank_biserial(-0.04, rules = "funder2019")
## [1] "tiny"
## (Rules: funder2019)
As a conclusion, I decided to use the Wilcoxon Rank Sum test because normality was not met. Since the p > 0.05, we cannot reject H0 stating that location distribution of delivery time is the same for experienced and less experienced couriers. So, as an answer to my research question we cannot really say that the working experience of the couriers influences the delivery time. This is further confirmed by the effect size - the difference in distribution locations is tiny (r = -0.04).
Research question 2: Is there a correlation between the distance and the delivery time?
OR
library(car)
## Loading required package: carData
scatterplotMatrix(Food_Delivery_Times_Stat[ , c(1, 3)], smooth = FALSE)
As expected, from the graphs we can observe a positive correlation between the delivery time and the kilometers. So, the higher the kilometers driven, the more time the delivery would take.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Food_Delivery_Times[ , c("Distance_km", "Delivery_Time_min")])
The correlation is strong positive correlation (0.783) and the three asterixs indicate that we can reject H0.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Food_Delivery_Times_Stat[ , c(1, 3)]),
type = "pearson")
## Distance_km Delivery_Time_min
## Distance_km 1.00 0.78
## Delivery_Time_min 0.78 1.00
##
## n= 800
##
##
## P
## Distance_km Delivery_Time_min
## Distance_km 0
## Delivery_Time_min 0
We reject the Null Hypothesis at p < 0.001. So, the population correlation coefficient is not equal to zero, meaning that there is some correlation between the two variables.
From the graphs observed and the Pearson Correlation Coefficient (0.783), there is a positive and strong correlation between the courier’s traveled distance and the time needed for delivering the food.
Assumptions:
Research question 3: Is there association between weather conditions and traffic?
chi_square <- chisq.test(Food_Delivery_Times$Weather, Food_Delivery_Times$Traffic_Level,
correct = FALSE)
chi_square
##
## Pearson's Chi-squared test
##
## data: Food_Delivery_Times$Weather and Food_Delivery_Times$Traffic_Level
## X-squared = 10.999, df = 8, p-value = 0.2018
We cannot reject H0 (p > 0.05) which states that there is no association between weather and traffic.
addmargins(round(chi_square$expected, 2))
##
## Food_Delivery_Times$Weather High Low Medium Sum
## Clear 80.58 150.06 155.36 386.00
## Foggy 18.58 34.60 35.82 89.00
## Rainy 35.49 66.09 68.42 170.00
## Snowy 15.87 29.55 30.59 76.01
## Windy 16.49 30.71 31.80 79.00
## Sum 167.01 311.01 321.99 800.01
Assumption 1 is met. All observations are above 5.
The theoretical / expected frequencies measure how much of one combination will be frequent, given that there was no association between weather and traffic levels. For example: - If there was no association between weather conditions and traffic level, there will be 15.87 occurrences where there would be snowy weather and high traffic level.
addmargins(round(chi_square$observed))
##
## Food_Delivery_Times$Weather High Low Medium Sum
## Clear 75 143 168 386
## Foggy 19 40 30 89
## Rainy 38 68 64 170
## Snowy 16 37 23 76
## Windy 19 23 37 79
## Sum 167 311 322 800
The observed frequencies also called the empirical frequencies are the actual measured values that occurred.
From the above mentioned, we would expect 15.87 occurrences where there would be snowy weather and high traffic level, but in reality we observed 16. These two number are very close, so I will calculate the standardized residuals to see whether they are statistically different.
round(chi_square$residuals, 2)
##
## Food_Delivery_Times$Weather High Low Medium
## Clear -0.62 -0.58 1.01
## Foggy 0.10 0.92 -0.97
## Rainy 0.42 0.24 -0.53
## Snowy 0.03 1.37 -1.37
## Windy 0.62 -1.39 0.92
For the combination snowy and high traffic level, the standardized residuals are 0.03 indicating not significant difference.
library(effectsize)
effectsize::cramers_v(Food_Delivery_Times$Weather, Food_Delivery_Times$Traffic_Level)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.04 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.55)
## [1] "very large"
## (Rules: funder2019)
Contrary to the tests above, the effect size shows a very large association. This can happen due to several reasons, one of which is the potential issue with statistical significance vs. practical significance (source: Chat GPT).
From the chi squared test and the standardized residuals, we can conclude that there is no association found between weather conditions and traffic level. We cannot reject the hypothesis H0 (there is no association between weather and traffic). Additionally, all of the standardized residuals are below the threshold of +/- 1.96. However, the effect size indicates a very large association, which contradicts the statistical tests made above. One reason is the statistical significance vs. practical significance. This makes sense because, naturally, worsening weather conditions are expected to correlate with worsening traffic levels.