Homework 1

Data importing and data description

library(readr)
Food_Delivery_Times <- read_csv("C:/Users/anama/Desktop/Masters/Multivariate Analysis/Homeworks/Homework 1/Food_Delivery_Times.csv")

## Rows: 800 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Weather, Traffic_Level, Time_of_Day, Vehicle_Type
## dbl (5): Order_ID, Distance_km, Preparation_Time_min, Courier_Experience_yrs...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Food_Delivery_Times)

## # A tibble: 6 × 9
##   Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type
##      <dbl>       <dbl> <chr>   <chr>         <chr>       <chr>       
## 1      522        7.93 Windy   Low           Afternoon   Scooter     
## 2      738       16.4  Clear   Medium        Evening     Bike        
## 3      741        9.52 Foggy   Low           Night       Scooter     
## 4      661        7.44 Rainy   Medium        Afternoon   Scooter     
## 5      412       19.0  Clear   Low           Morning     Bike        
## 6      679       19.4  Clear   Low           Evening     Scooter     
## # ℹ 3 more variables: Preparation_Time_min <dbl>, Courier_Experience_yrs <dbl>,
## #   Delivery_Time_min <dbl>

Source: Keggle

Unit of observation: one delivery worker

Sample size = n = 800 observations

Food delivery times data includes:

Order ID
Distance in km
Weather (Windy, Clear, Foggy, Rainy, Snowy)
Traffic level (Low, Medium, High)
Time of day (Morning, Afternoon, Evening, Night)
Vehicle type (Bike, Car, Scooter)
Preparation time in minutes
Courier experience in years
Delivery time in minutes

Using the “subset” function (I consulted with Chat GPT about this), I created subsets of experienced and less experienced delivery workers i.e. I divided my samples into two groups. This way it will be easier for me to work and perform some of the tests and graphs.

experienced <- subset(Food_Delivery_Times, Courier_Experience_yrs > 5)
less_experienced <- subset(Food_Delivery_Times, Courier_Experience_yrs <= 5)

EXPLANATION: Using the subset function, from the data set “Food_Delivery_Times”, I take the variable “Courier_Experience_yrs and I use the rule”>5” to select only the numbers above 5. And all of this is put into the new subset “experienced”. The same logic applies for the less_experienced.

I will also create new variable “Experience”. This is a categorical variable and will depend on whether the courier has less than or equal to 5 years of experience, or more than 5 years of experience working in delivery.

Food_Delivery_Times$Experience <- ifelse(Food_Delivery_Times$Courier_Experience_yrs > 5, "experienced", "less_experienced")

Descriptive statistics

Food_Delivery_Times_Stat <- Food_Delivery_Times[ , c(-1, -3, -4, -5, -6, -7, -10)]

library(pastecs)
round(stat.desc(Food_Delivery_Times_Stat), 2)

##              Distance_km Courier_Experience_yrs Delivery_Time_min
## nbr.val           800.00                 800.00            800.00
## nbr.null            0.00                   0.00              0.00
## nbr.na              0.00                   0.00              0.00
## min                 0.60                   1.00              8.00
## max                19.99                   9.00            141.00
## range              19.39                   8.00            133.00
## sum              8009.33                4097.00          44816.00
## median             10.11                   5.00             54.00
## mean               10.01                   5.12             56.02
## SE.mean             0.20                   0.09              0.77
## CI.mean.0.95        0.40                   0.18              1.51
## var                32.62                   6.96            476.17
## std.dev             5.71                   2.64             21.82
## coef.var            0.57                   0.52              0.39

The maximum kilometers that a delivery driver has passed is 19.99, while the minimum kilometers that a delivery driver has passed is 0.60 km or 600 meters. The average distance that a delivery worker traveled is 10.01 km.

The average courier experience is 5.12 years. The median (middle value) of the courier experience is also 5 years, meaning that 50% of the delivery workers have an experience of up to (including) 5 years, and the other 50% have working experience as couriers of more than 5 years. These two analyses are the reason why I took 5 as a threshold for differentiating between experienced and less experienced delivery workers.

The standard deviation of the delivery time is 21.82 minutes, which means that delivery times typically vary around the mean (56.02 minutes) by approximately +/- 21.82 minutes.

Task no 1: Hypothesis testing

Analysis

Research question 1: Does the working experience of the couriers influence the delivery time?

Parametric test

H0: µ (experienced) = µ (less experienced) -> The average delivery time of experienced couriers is the same with the average delivery time of less experienced couriers.

H1: µ (experienced) ≠ µ (less experienced) -> The average delivery time of experienced couriers is different from the average delivery time of less experienced couriers.

t.test(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
       paired = FALSE,
       alternative ="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  experienced$Delivery_Time_min and less_experienced$Delivery_Time_min
## t = -1.1973, df = 794.33, p-value = 0.2316
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.874996  1.181116
## sample estimates:
## mean of x mean of y 
##  55.05497  56.90191

We cannot reject H0 because p > 0.05.

Nonparametric test

H0: Location distribution of delivery time is the same for experienced and less experienced couriers.

H1: Location distribution of delivery time is not the same for experienced and less experienced couriers.

wilcox.test(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  experienced$Delivery_Time_min and less_experienced$Delivery_Time_min
## W = 76458, p-value = 0.3004
## alternative hypothesis: true location shift is not equal to 0

We cannot reject H0 stating that the location distribution of time is the same for experienced and less experienced couriers because p > 0.05.

In order to decide which test is reliable I will need to check whether all of the assumptions for performing the parametric test are met. In this case, the independent sample t-test has four assumptions:

Variable is numeric
The distribution of the variable is normal in both populations
The data must come from two different populations
Variable has the same variance in both populations.

The first and third assumptions are met. The variable delivery time is numerical and the data comes from two different populations. For the fourth assumption for equal variances, when performing the independent samples t test, I used the Welch correction that fixes the whole variance distribution problem. I have to check only for the assumption 2.

For checking the assumption 2, I will do a histogram to check the normality in both populations and further perform the Shapiro Wilk test for normality.

Checking for normality

library(ggplot2)
ggplot(Food_Delivery_Times, aes(x = Delivery_Time_min)) +
  geom_histogram(binwidth = 1, color = "navy", fill = "coral") +
  facet_wrap(~Experience, ncol = 1) +
  ylab("Frequency")

From the Histograms, the delivery time does not look normally distributed in both cases. They look a little right skewed as well. I will further check this using the Shapiro Wilk test. (I tried using the %>% function for grouping the data by experience and then performing the Shapiro WIlk test but R constantly showed me some errors).

shapiro.test(experienced$Delivery_Time_min)

## 
##  Shapiro-Wilk normality test
## 
## data:  experienced$Delivery_Time_min
## W = 0.9831, p-value = 0.0001907

H0: The variable delivery time (for experienced couriers) is normally distributed.
H1: The variable delivery time (for experienced couriers) is not normally distributed.

We reject H0 that the delivery time of experienced couriers is normally distributed at p < 0.001.

shapiro.test(less_experienced$Delivery_Time_min)

## 
##  Shapiro-Wilk normality test
## 
## data:  less_experienced$Delivery_Time_min
## W = 0.9767, p-value = 3.012e-06

H0: The variable delivery time (for less experienced couriers) is normally distributed.
H1: The variable delivery time (for less experienced couriers) is not normally distributed.

We reject H0 that the delivery time of less experienced couriers is normally distributed at p < 0.001.

Effect size

library(effectsize)
rank_biserial(experienced$Delivery_Time_min, less_experienced$Delivery_Time_min,
              mu = 0,
              paired = FALSE,
              ci = 0.95,
              alternative = "two.sided")

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.04             | [-0.12, 0.04]

interpret_rank_biserial(-0.04, rules = "funder2019")

## [1] "tiny"
## (Rules: funder2019)

Conclusion

As a conclusion, I decided to use the Wilcoxon Rank Sum test because normality was not met. Since the p > 0.05, we cannot reject H0 stating that location distribution of delivery time is the same for experienced and less experienced couriers. So, as an answer to my research question we cannot really say that the working experience of the couriers influences the delivery time. This is further confirmed by the effect size - the difference in distribution locations is tiny (r = -0.04).

Task no. 2 - correlation

Analysis

Research question 2: Is there a correlation between the distance and the delivery time?

H0: ρ(distance&delivery time) = 0
H1: ρ(distance&delivery time) ≠ 0

H0: There is no correlation between distance and delivery time.
H1: There is correlation between distance and delivery time.

library(car)

## Loading required package: carData

scatterplotMatrix(Food_Delivery_Times_Stat[ , c(1, 3)], smooth = FALSE)

As expected, from the graphs we can observe a positive correlation between the delivery time and the kilometers. So, the higher the kilometers driven, the more time the delivery would take.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(Food_Delivery_Times[ , c("Distance_km", "Delivery_Time_min")])

The correlation is strong positive correlation (0.783) and the three asterixs indicate that we can reject H0.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(Food_Delivery_Times_Stat[ , c(1, 3)]),
      type = "pearson")

##                   Distance_km Delivery_Time_min
## Distance_km              1.00              0.78
## Delivery_Time_min        0.78              1.00
## 
## n= 800 
## 
## 
## P
##                   Distance_km Delivery_Time_min
## Distance_km                    0               
## Delivery_Time_min  0

We reject the Null Hypothesis at p < 0.001. So, the population correlation coefficient is not equal to zero, meaning that there is some correlation between the two variables.

Conclusion

From the graphs observed and the Pearson Correlation Coefficient (0.783), there is a positive and strong correlation between the courier’s traveled distance and the time needed for delivering the food.

Task no. 3 - Pearson Chi squared test

Assumptions:

The observations are independent of each other
All expected frequencies are above 5

Analysis

Research question 3: Is there association between weather conditions and traffic?

H0: There is no association between weather and traffic.
H1: There is association between weather and traffic.

chi_square <- chisq.test(Food_Delivery_Times$Weather, Food_Delivery_Times$Traffic_Level, 
                        correct = FALSE)

chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  Food_Delivery_Times$Weather and Food_Delivery_Times$Traffic_Level
## X-squared = 10.999, df = 8, p-value = 0.2018

We cannot reject H0 (p > 0.05) which states that there is no association between weather and traffic.

Expected frequencies

addmargins(round(chi_square$expected, 2))

##                            
## Food_Delivery_Times$Weather   High    Low Medium    Sum
##                       Clear  80.58 150.06 155.36 386.00
##                       Foggy  18.58  34.60  35.82  89.00
##                       Rainy  35.49  66.09  68.42 170.00
##                       Snowy  15.87  29.55  30.59  76.01
##                       Windy  16.49  30.71  31.80  79.00
##                       Sum   167.01 311.01 321.99 800.01

Assumption 1 is met. All observations are above 5.

The theoretical / expected frequencies measure how much of one combination will be frequent, given that there was no association between weather and traffic levels. For example: - If there was no association between weather conditions and traffic level, there will be 15.87 occurrences where there would be snowy weather and high traffic level.

Observed frequencies

addmargins(round(chi_square$observed))

##                            
## Food_Delivery_Times$Weather High Low Medium Sum
##                       Clear   75 143    168 386
##                       Foggy   19  40     30  89
##                       Rainy   38  68     64 170
##                       Snowy   16  37     23  76
##                       Windy   19  23     37  79
##                       Sum    167 311    322 800

The observed frequencies also called the empirical frequencies are the actual measured values that occurred.

From the above mentioned, we would expect 15.87 occurrences where there would be snowy weather and high traffic level, but in reality we observed 16. These two number are very close, so I will calculate the standardized residuals to see whether they are statistically different.

Srandardised Residuals

round(chi_square$residuals, 2)

##                            
## Food_Delivery_Times$Weather  High   Low Medium
##                       Clear -0.62 -0.58   1.01
##                       Foggy  0.10  0.92  -0.97
##                       Rainy  0.42  0.24  -0.53
##                       Snowy  0.03  1.37  -1.37
##                       Windy  0.62 -1.39   0.92

For the combination snowy and high traffic level, the standardized residuals are 0.03 indicating not significant difference.

Effect size

library(effectsize)
effectsize::cramers_v(Food_Delivery_Times$Weather, Food_Delivery_Times$Traffic_Level)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.04              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.55)

## [1] "very large"
## (Rules: funder2019)

Contrary to the tests above, the effect size shows a very large association. This can happen due to several reasons, one of which is the potential issue with statistical significance vs. practical significance (source: Chat GPT).

Conclusion

From the chi squared test and the standardized residuals, we can conclude that there is no association found between weather conditions and traffic level. We cannot reject the hypothesis H0 (there is no association between weather and traffic). Additionally, all of the standardized residuals are below the threshold of +/- 1.96. However, the effect size indicates a very large association, which contradicts the statistical tests made above. One reason is the statistical significance vs. practical significance. This makes sense because, naturally, worsening weather conditions are expected to correlate with worsening traffic levels.

Homework 1

Ana Marija Dimitrova

2025-01-17

Data importing and data description

Descriptive statistics

Task no 1: Hypothesis testing

Analysis

Parametric test

Nonparametric test

Checking for normality

Effect size

Conclusion

Task no. 2 - correlation

Analysis

Conclusion

Task no. 3 - Pearson Chi squared test

Analysis

Expected frequencies

Observed frequencies

Srandardised Residuals

Effect size

Conclusion