All required libraries are loaded first. The tidyverse
package handles data manipulation and visualisation,
moments provides skewness and kurtosis functions,
scales assists with axis formatting, and
kableExtra produces styled tables in the report.
library(tidyverse)
library(moments)
library(scales)
library(knitr)
library(kableExtra)
library(reshape2)The dataset is read from disk and inspected using str(),
glimpse(), and summary() to understand its
structure, variable types, and initial data quality.
spc_tbl_ [21,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ order_id : chr [1:21000] "FDD000001" "FDD000002" "FDD000003" "FDD000004" ...
$ customer_age : num [1:21000] 60 17 43 49 53 32 34 57 NA 37 ...
$ customer_gender : chr [1:21000] "Mle" "Mle" NA "Male" ...
$ restaurant_type : chr [1:21000] "cafe" "cafe" "Fast Food" "cafe" ...
$ order_value : num [1:21000] 3116 3763 816 884 3854 ...
$ delivery_distance: num [1:21000] -3.51 21.58 20.57 11.97 6.82 ...
$ delivery_time : num [1:21000] 52 134 61 36 91 132 62 61 145 54 ...
$ traffic_level : chr [1:21000] "low" "low" "High" "High" ...
$ weather : chr [1:21000] "Rainy" "Cloudy" "Sunny" "Cloudy" ...
$ payment_method : chr [1:21000] "Online" "Card" "cash" NA ...
$ rating : num [1:21000] 4 10 3 10 10 -1 10 3 4 -1 ...
$ delivery_status : chr [1:21000] "delivered" "Delivered" "Delivered" NA ...
$ rider_experience : num [1:21000] 30 0 -3 0 15 30 0 13 8 1 ...
- attr(*, "spec")=
.. cols(
.. order_id = col_character(),
.. customer_age = col_double(),
.. customer_gender = col_character(),
.. restaurant_type = col_character(),
.. order_value = col_double(),
.. delivery_distance = col_double(),
.. delivery_time = col_double(),
.. traffic_level = col_character(),
.. weather = col_character(),
.. payment_method = col_character(),
.. rating = col_double(),
.. delivery_status = col_character(),
.. rider_experience = col_double()
.. )
- attr(*, "problems")=<externalptr>
Rows: 21,000
Columns: 13
$ order_id <chr> "FDD000001", "FDD000002", "FDD000003", "FDD000004", …
$ customer_age <dbl> 60, 17, 43, 49, 53, 32, 34, 57, NA, 37, 48, 47, 64, …
$ customer_gender <chr> "Mle", "Mle", NA, "Male", NA, "Mle", "Male", "Male",…
$ restaurant_type <chr> "cafe", "cafe", "Fast Food", "cafe", "Cafe", "cafe",…
$ order_value <dbl> 3116, 3763, 816, 884, 3854, 2082, 1480, 2085, 2600, …
$ delivery_distance <dbl> -3.51, 21.58, 20.57, 11.97, 6.82, 20.51, 1.17, 11.25…
$ delivery_time <dbl> 52, 134, 61, 36, 91, 132, 62, 61, 145, 54, 139, 25, …
$ traffic_level <chr> "low", "low", "High", "High", "low", "Hgh", "Hgh", "…
$ weather <chr> "Rainy", "Cloudy", "Sunny", "Cloudy", "Rainy", "Clou…
$ payment_method <chr> "Online", "Card", "cash", NA, "Card", "Card", NA, "C…
$ rating <dbl> 4, 10, 3, 10, 10, -1, 10, 3, 4, -1, 3, 3, NA, 4, 5, …
$ delivery_status <chr> "delivered", "Delivered", "Delivered", NA, "delivere…
$ rider_experience <dbl> 30, 0, -3, 0, 15, 30, 0, 13, 8, 1, 1, 9, 0, 8, 4, -3…
The dataset contains 21,000 observations across 13 variables,
covering a mix of categorical identifiers – such as
restaurant_type, traffic_level, and
payment_method – and numeric measures including
order_value, delivery_distance,
delivery_time, and rating. The presence of
character-type columns alongside numeric ones signals that careful type
checking and text standardisation will be required before analysis.
order_id customer_age customer_gender restaurant_type
Length:21000 Min. : -5.00 Length:21000 Length:21000
Class :character 1st Qu.: 28.00 Class :character Class :character
Mode :character Median : 42.00 Mode :character Mode :character
Mean : 42.35
3rd Qu.: 56.00
Max. :120.00
NA's :368
order_value delivery_distance delivery_time traffic_level
Min. : -100 Min. :-5.00 Min. :-20.00 Length:21000
1st Qu.: 1618 1st Qu.: 2.47 1st Qu.: 44.00 Class :character
Median : 2750 Median :10.05 Median : 79.00 Mode :character
Mean : 2777 Mean :10.03 Mean : 81.87
3rd Qu.: 3868 3rd Qu.:17.56 3rd Qu.:116.00
Max. :99999 Max. :25.00 Max. :500.00
NA's :6 NA's :139
weather payment_method rating delivery_status
Length:21000 Length:21000 Min. :-1.000 Length:21000
Class :character Class :character 1st Qu.: 1.000 Class :character
Mode :character Mode :character Median : 3.000 Mode :character
Mean : 3.438
3rd Qu.: 5.000
Max. :10.000
NA's :2607
rider_experience
Min. :-3.000
1st Qu.: 3.000
Median : 7.000
Mean : 8.152
3rd Qu.:12.000
Max. :30.000
NA's :1110
The summary confirms several data quality concerns. Customer age has a mean of roughly 42 years, which is plausible, but the range extends into negative territory, indicating invalid entries. Delivery time averages around 81.87 minutes with a maximum of 500 minutes – clearly unrealistic for a food delivery context. Rating shows 2,607 missing values, which is notable since it directly reflects customer satisfaction. Rider experience averages 8.15 years, though some values are negative. These issues are addressed systematically in the cleaning steps below.
Before correcting any values, blank strings in character columns are
first converted to NA so that they are treated consistently
alongside other missing entries.
food_raw2 <- food_raw %>%
mutate(across(where(is.character), ~na_if(str_trim(.), "")))
missing_tbl <- food_raw2 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Missing_Count") %>%
mutate(Missing_Pct = round(100 * Missing_Count / nrow(food_raw2), 2)) %>%
arrange(desc(Missing_Count))
kable(missing_tbl,
caption = "Table 1: Missing values per variable (blank strings treated as NA)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Variable | Missing_Count | Missing_Pct |
|---|---|---|
| customer_gender | 6873 | 32.73 |
| delivery_status | 5272 | 25.10 |
| weather | 5267 | 25.08 |
| payment_method | 4251 | 20.24 |
| traffic_level | 3469 | 16.52 |
| rating | 2607 | 12.41 |
| rider_experience | 1110 | 5.29 |
| customer_age | 368 | 1.75 |
| delivery_time | 139 | 0.66 |
| order_value | 6 | 0.03 |
| order_id | 0 | 0.00 |
| restaurant_type | 0 | 0.00 |
| delivery_distance | 0 | 0.00 |
Missing data is spread across several variables. The most affected
are rating and rider_experience, which
together represent the two clearest quality indicators in the dataset.
Variables like customer_gender,
delivery_status, and traffic_level also show
non-trivial rates of missingness, suggesting inconsistent data
collection at the point of entry. These patterns are important to keep
in mind during analysis, as they can introduce non-response bias if not
handled carefully.
All categorical variables contain spelling variants, inconsistent
capitalisation, and abbreviated entries. Each variable is standardised
to a fixed set of valid labels, with unrecognisable entries mapped to
NA.
food_clean <- food_raw2 %>%
mutate(
customer_gender = case_when(
str_to_lower(customer_gender) %in% c("male", "mle") ~ "Male",
str_to_lower(customer_gender) %in% c("female", "fmale") ~ "Female",
TRUE ~ NA_character_
)
) %>%
mutate(
restaurant_type = case_when(
str_to_lower(restaurant_type) %in% c("cafe") ~ "Cafe",
str_to_lower(restaurant_type) %in% c("fast food", "fastfood") ~ "Fast Food",
str_to_lower(restaurant_type) %in% c("fine dining", "finedining") ~ "Fine Dining",
TRUE ~ NA_character_
)
) %>%
mutate(
traffic_level = case_when(
str_to_lower(traffic_level) %in% c("low") ~ "Low",
str_to_lower(traffic_level) %in% c("medium", "med") ~ "Medium",
str_to_lower(traffic_level) %in% c("high", "hgh") ~ "High",
TRUE ~ NA_character_
)
) %>%
mutate(weather = str_to_title(weather)) %>%
mutate(
payment_method = case_when(
str_to_lower(payment_method) %in% c("cash") ~ "Cash",
str_to_lower(payment_method) %in% c("card") ~ "Card",
str_to_lower(payment_method) %in% c("online") ~ "Online",
TRUE ~ NA_character_
)
) %>%
mutate(
delivery_status = case_when(
str_to_lower(delivery_status) %in% c("delivered") ~ "Delivered",
str_to_lower(delivery_status) %in% c("cancelled") ~ "Cancelled",
TRUE ~ NA_character_
)
)
cat("customer_gender:\n"); print(table(food_clean$customer_gender, useNA = "ifany"))customer_gender:
Female Male <NA>
7217 6910 6873
restaurant_type:
Cafe Fast Food Fine Dining
8381 8390 4229
traffic_level:
High Low Medium <NA>
6942 7032 3557 3469
weather:
Cloudy Rainy Sunny <NA>
5160 5299 5274 5267
payment_method:
Card Cash Online <NA>
4241 8207 4301 4251
delivery_status:
Cancelled Delivered <NA>
5208 10520 5272
After standardisation, each categorical variable is reduced to its
expected set of valid labels. Fast Food and Cafe dominate the restaurant
distribution, while Fine Dining makes up a smaller share. Cash is by far
the most common payment method. The remaining missingness in
customer_gender, weather, and
delivery_status reflects entries that could not be matched
to any recognised category and were therefore set to
NA.
Each numeric variable is validated against its expected real-world
range. Observations falling outside these bounds are replaced with
NA rather than removed, preserving the full row for
variables that are still valid.
food_clean <- food_clean %>%
mutate(
customer_age = if_else(
customer_age >= 15 & customer_age <= 70,
customer_age, NA_real_),
order_value = if_else(
order_value >= 500 & order_value <= 5000,
order_value, NA_real_),
delivery_distance = if_else(
delivery_distance >= 0 & delivery_distance <= 20,
delivery_distance, NA_real_),
delivery_time = if_else(
delivery_time >= 10 & delivery_time <= 120,
delivery_time, NA_real_),
rating = if_else(
rating >= 1 & rating <= 5,
rating, NA_real_),
rider_experience = if_else(
rider_experience >= 0 & rider_experience <= 15,
rider_experience, NA_real_)
)
cat("Remaining NA counts after range validation:\n")Remaining NA counts after range validation:
order_id customer_age customer_gender restaurant_type
0 1063 6873 0
order_value delivery_distance delivery_time traffic_level
15 7083 4743 3469
weather payment_method rating delivery_status
5267 4251 7809 5272
rider_experience
3310
This step increases the missing value counts considerably for
rating and delivery_distance, which confirms
that a large portion of the raw values were out-of-range rather than
simply blank. For example, rating now has 7,809 missing
entries, reflecting both the original blanks and the invalid numeric
entries combined. While this improves data integrity, it reduces the
usable sample for any analysis involving these variables. This trade-off
– stricter validity at the cost of completeness – is a routine
consideration in data preprocessing.
A new variable, delivery_speed, is calculated as the
ratio of distance covered to time taken. This gives a continuous measure
of how efficiently each delivery was completed, expressed in kilometres
per minute.
food_clean <- food_clean %>%
mutate(delivery_speed = delivery_distance / delivery_time)
food_clean %>%
select(order_id, delivery_distance, delivery_time, delivery_speed) %>%
head(10)| order_id | delivery_distance | delivery_time | delivery_speed |
|---|---|---|---|
| FDD000001 | NA | 52 | NA |
| FDD000002 | NA | NA | NA |
| FDD000003 | NA | 61 | NA |
| FDD000004 | 11.97 | 36 | 0.3325000 |
| FDD000005 | 6.82 | 91 | 0.0749451 |
| FDD000006 | NA | NA | NA |
| FDD000007 | 1.17 | 62 | 0.0188710 |
| FDD000008 | 11.25 | 61 | 0.1844262 |
| FDD000009 | NA | NA | NA |
| FDD000010 | 19.69 | 54 | 0.3646296 |
The variable takes meaningful values only where both
delivery_distance and delivery_time are
non-missing. Records where either is NA will naturally
produce an NA speed, which is the correct behaviour.
Delivery performance is classified into three tiers based on delivery time thresholds defined in the study brief.
food_clean <- food_clean %>%
mutate(
delivery_performance = case_when(
delivery_time < 30 ~ "Fast",
delivery_time >= 30 & delivery_time < 60 ~ "Moderate",
delivery_time >= 60 ~ "Slow",
TRUE ~ NA_character_
),
delivery_performance = factor(
delivery_performance, levels = c("Fast", "Moderate", "Slow")
)
)
cat("delivery_performance distribution (including NAs):\n")delivery_performance distribution (including NAs):
Fast Moderate Slow <NA>
2953 4437 8867 4743
The Slow category is the largest group with 8,867 orders, followed by
Moderate at 4,437 and Fast at 2,953. The 4,743 missing values correspond
to rows where delivery_time was set to NA
during cleaning. The skew towards longer delivery times may point to
genuine operational inefficiencies, though it is worth noting that this
classification does not account for distance – a 60-minute delivery over
20 km is quite different from a 60-minute delivery over 2 km.
Mean delivery time and mean order value are computed for each combination of restaurant type and traffic level.
summary_rt_tl <- food_clean %>%
filter(!is.na(restaurant_type), !is.na(traffic_level)) %>%
group_by(restaurant_type, traffic_level) %>%
summarise(
Mean_Delivery_Time = round(mean(delivery_time, na.rm = TRUE), 2),
Mean_Order_Value = round(mean(order_value, na.rm = TRUE), 2),
N = n(),
.groups = "drop"
) %>%
arrange(restaurant_type, traffic_level)
kable(summary_rt_tl,
caption = "Table 2: Mean delivery time and order value by restaurant type and traffic level") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| restaurant_type | traffic_level | Mean_Delivery_Time | Mean_Order_Value | N |
|---|---|---|---|---|
| Cafe | High | 64.58 | 2734.09 | 2814 |
| Cafe | Low | 63.73 | 2733.84 | 2795 |
| Cafe | Medium | 65.27 | 2745.46 | 1378 |
| Fast Food | High | 65.05 | 2722.40 | 2767 |
| Fast Food | Low | 64.44 | 2696.76 | 2831 |
| Fast Food | Medium | 65.44 | 2716.04 | 1445 |
| Fine Dining | High | 65.53 | 2754.12 | 1361 |
| Fine Dining | Low | 66.34 | 2771.95 | 1406 |
| Fine Dining | Medium | 65.61 | 2849.61 | 734 |
The grouped means show very little variation across the nine combinations of restaurant type and traffic level. Delivery time sits consistently between 63.7 and 66.3 minutes regardless of the combination, and mean order value clusters tightly between LKR 2,700 and 2,850. This uniformity is striking – in a real operational setting, one would expect higher traffic levels and more complex restaurant types to push delivery times upward. The flat pattern is consistent with the wider data quality issues identified earlier.
summary_gs <- food_clean %>%
filter(!is.na(customer_gender), !is.na(delivery_status)) %>%
count(customer_gender, delivery_status, name = "Total_Orders")
kable(summary_gs,
caption = "Table 3: Total orders by customer gender and delivery status") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| customer_gender | delivery_status | Total_Orders |
|---|---|---|
| Female | Cancelled | 1857 |
| Female | Delivered | 3595 |
| Male | Cancelled | 1691 |
| Male | Delivered | 3482 |
Males placed 3,482 delivered orders and 1,691 cancellations, while females placed 3,595 delivered orders and 1,857 cancellations. Delivery success rates are broadly comparable across genders, with no meaningful difference in cancellation rates. This suggests that customer gender is not a driver of delivery outcomes in this dataset.
Summary statistics for delivery_time and
order_value are computed to characterise central tendency
and spread.
summary_stats <- food_clean %>%
summarise(
across(
c(delivery_time, order_value),
list(
Mean = ~round(mean(.x, na.rm = TRUE), 2),
Median = ~round(median(.x, na.rm = TRUE), 2),
SD = ~round(sd(.x, na.rm = TRUE), 2),
IQR = ~round(IQR(.x, na.rm = TRUE), 2)
),
.names = "{.col}__{.fn}"
)
) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "__") %>%
pivot_wider(names_from = "Statistic", values_from = "value")
kable(summary_stats,
caption = "Table 4: Summary statistics for delivery time and order value") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Variable | Mean | Median | SD | IQR |
|---|---|---|---|---|
| delivery_time | 64.84 | 65 | 32.14 | 56 |
| order_value | 2744.98 | 2748 | 1297.13 | 2248 |
For delivery_time, the mean (64.84 min) and median (65
min) are almost identical, indicating a symmetric distribution with no
meaningful skew. The standard deviation of 32.14 and IQR of 56 reflect
moderate spread. For order_value, the mean (2,744.98 LKR)
and median (2,748 LKR) are again very close, though the much larger
standard deviation (1,297.13) and IQR (2,248) suggest that spending
varies considerably across orders, possibly driven by a wide range of
order types and sizes.
dt_vals <- na.omit(food_clean$delivery_time)
sk <- round(skewness(dt_vals), 4)
ku_raw <- round(kurtosis(dt_vals), 4)
ku_ex <- round(ku_raw - 3, 4)
cat("Skewness (delivery_time) :", sk, "\n")Skewness (delivery_time) : 0.0175
Raw kurtosis : 1.7994
Excess kurtosis (kurt - 3) : -1.2006
A skewness of 0.0175 confirms that the distribution of delivery time is very nearly symmetric. The raw kurtosis of 1.7994 and corresponding excess kurtosis of -1.2006 indicate a platykurtic distribution – one with lighter tails and a flatter peak than a normal distribution. This is consistent with a broadly uniform spread of delivery times across the 10-120 minute range.
perf_freq <- food_clean %>%
filter(!is.na(delivery_performance)) %>%
count(delivery_performance, name = "Frequency") %>%
mutate(Percentage = round(100 * Frequency / sum(Frequency), 2))
kable(perf_freq,
caption = "Table 5: Frequency table for delivery performance") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| delivery_performance | Frequency | Percentage |
|---|---|---|
| Fast | 2953 | 18.16 |
| Moderate | 4437 | 27.29 |
| Slow | 8867 | 54.54 |
ggplot(perf_freq,
aes(x = delivery_performance, y = Frequency, fill = delivery_performance)) +
geom_col(width = 0.55, colour = "white") +
geom_text(
aes(label = paste0(Frequency, "\n(", Percentage, "%)")),
vjust = -0.3, size = 3.8, fontface = "bold"
) +
scale_fill_manual(
values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Distribution of delivery performance",
subtitle = "Fast: < 30 min | Moderate: 30-59 min | Slow: >= 60 min",
x = "Performance category",
y = "Number of orders"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold"),
panel.grid.major.x = element_blank()
)Figure 1: Distribution of delivery performance categories
The bar chart shows that over half of all deliveries (54.54%, n = 8,867) fell into the Slow category. Moderate deliveries account for about 27% of orders, and Fast deliveries are the least common at 18.16%. This distribution raises questions about operational efficiency, though the absence of a distance control limits how much can be concluded from timing alone.
food_clean %>%
filter(!is.na(delivery_distance), !is.na(delivery_time), !is.na(traffic_level)) %>%
ggplot(aes(x = delivery_distance, y = delivery_time, colour = traffic_level)) +
geom_point(alpha = 0.25, size = 0.9) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
scale_colour_manual(
values = c("Low" = "#27ae60", "Medium" = "#e67e22", "High" = "#c0392b")
) +
labs(
title = "Delivery distance vs. delivery time",
subtitle = "Coloured by traffic level, with linear trend lines per group",
x = "Delivery distance (km)",
y = "Delivery time (minutes)",
colour = "Traffic level"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 2: Delivery distance vs. delivery time, coloured by traffic level
The scatter plot reveals that distance has almost no relationship with delivery time in this dataset. All three trend lines sit nearly flat at around 65-70 minutes regardless of distance, and the three traffic level lines overlap almost perfectly. In a real delivery operation, longer distances and heavier traffic would typically push delivery times upward; neither pattern is visible here. The very wide vertical scatter at each distance value also suggests that unobserved factors – such as order preparation time or rider routing decisions – are far more influential than the variables captured here.
food_clean %>%
filter(!is.na(order_value), !is.na(restaurant_type)) %>%
ggplot(aes(x = restaurant_type, y = order_value, fill = restaurant_type)) +
geom_boxplot(outlier.colour = "red", outlier.alpha = 0.35,
outlier.size = 1.2, width = 0.5) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = comma) +
labs(
title = "Order value distribution by restaurant type",
x = "Restaurant type",
y = "Order value (LKR)"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold")
)Figure 3: Order value distribution by restaurant type
All three restaurant types show strikingly similar distributions. Medians sit close together in the LKR 2,750-2,900 range, and the interquartile ranges are comparable across categories. One would intuitively expect Fine Dining to produce meaningfully higher order values than Cafe or Fast Food, but that distinction does not emerge here. This is most likely a consequence of the order values being generated within a fixed LKR 500-5,000 range, which compresses the inter-group differences regardless of restaurant type.
mean_dt <- mean(food_clean$delivery_time, na.rm = TRUE)
median_dt <- median(food_clean$delivery_time, na.rm = TRUE)
food_clean %>%
filter(!is.na(delivery_time)) %>%
ggplot(aes(x = delivery_time)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "#3498db", colour = "white", alpha = 0.80) +
geom_density(colour = "#e74c3c", linewidth = 1.1) +
geom_vline(xintercept = mean_dt, linetype = "dashed",
colour = "#2c3e50", linewidth = 0.9) +
geom_vline(xintercept = median_dt, linetype = "dotted",
colour = "#8e44ad", linewidth = 0.9) +
annotate("text", x = mean_dt + 5, y = 0.022,
label = paste0("Mean = ", round(mean_dt, 1)),
colour = "#2c3e50", size = 3.5) +
annotate("text", x = median_dt - 5, y = 0.020,
label = paste0("Median = ", round(median_dt, 1)),
colour = "#8e44ad", size = 3.5) +
labs(
title = "Distribution of delivery time",
subtitle = "Histogram with kernel density overlay",
x = "Delivery time (minutes)",
y = "Density"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 4: Histogram of delivery time distribution
The histogram displays a remarkably flat, near-uniform distribution of delivery times across the 10-120 minute range. Bar heights are fairly consistent throughout, with no obvious cluster around a typical delivery duration. The mean and median sit almost on top of each other at around 65 minutes, reinforcing the symmetric, flat shape identified by the skewness and kurtosis measures. Real-world delivery data would normally peak around a common duration, making this pattern more characteristic of synthetically generated data.
food_clean %>%
filter(!is.na(delivery_performance), !is.na(weather)) %>%
count(weather, delivery_performance) %>%
group_by(weather) %>%
mutate(pct = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(x = weather, y = n, fill = delivery_performance)) +
geom_col(position = "fill", colour = "white", width = 0.60) +
geom_text(
aes(label = paste0(round(pct * 100, 1), "%")),
position = position_fill(vjust = 0.5),
size = 3.5, colour = "white", fontface = "bold"
) +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(
values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
) +
labs(
title = "Delivery performance by weather condition",
subtitle = "Proportion of orders in each performance category",
x = "Weather condition",
y = "Proportion of orders",
fill = "Delivery performance"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 5: Delivery performance by weather condition
Weather conditions have very little visible effect on delivery performance. The proportions of Fast, Moderate, and Slow deliveries are nearly identical across Cloudy, Rainy, and Sunny conditions. Rainy conditions show a slightly higher Slow rate (approximately 56%) compared to Sunny (approximately 54%), which is intuitively plausible, but the margin is too small to draw any firm conclusion. If weather were a genuine driver of performance, a more pronounced gap – particularly between Rainy and Sunny – would be expected.
File saved: food_delivery_cleaned.csv
Rows : 21000
Columns: 15
The cleaned and transformed dataset has been saved as
food_delivery_cleaned.csv. The file retains all 21,000
original rows and contains 15 columns – the original 13 variables plus
the two newly derived fields, delivery_speed and
delivery_performance.
The mean delivery time across all valid orders is 64.84 minutes.
The total number of orders in each delivery performance category is as follows:
| Category | Total orders |
|---|---|
| Fast | 2953 |
| Moderate | 4437 |
| Slow | 8867 |
| household_id | region | household_size | monthly_income | monthly_expenditure | employment_status | education_level | housing_type | access_to_internet | health_expenditure | savings | debt |
|---|---|---|---|---|---|---|---|---|---|---|---|
| UHID000001 | urbn | 1 | 385000 | 120000 | Self-employed | Tertiary | Owned | yes | 46000 | 155000 | 470000 |
| UHID000002 | rural | 20 | 60000 | 60000 | Employed | primary | Shared | yes | 4000 | 135000 | 420000 |
| UHID000003 | Suburban | NA | 85000 | 355000 | Unemployed | primary | Owned | yes | 1000 | 65000 | 660000 |
| UHID000004 | rural | 0 | 150000 | 170000 | emplyd | Tertiary | Shared | TRUE | 20000 | 20000 | 700000 |
| UHID000005 | rural | 0 | 110000 | 15000 | emplyd | primary | Shared | no | 14000 | 15000 | 800000 |
| UHID000006 | Rural | 2 | 295000 | 30000 | emplyd | Primary | Owned | yes | 2000 | 100000 | 290000 |
The first six rows give an immediate sense of the dataset’s
structure. Variables cover household demographics (region,
household_size), economic indicators
(monthly_income, monthly_expenditure,
savings, debt), and socio-economic
characteristics (employment_status,
education_level, housing_type,
access_to_internet). Some missing values are already
visible in the preview.
household_id region household_size monthly_income
Length:33000 Length:33000 Min. : 0.00 Min. : -5000
Class :character Class :character 1st Qu.: 3.00 1st Qu.: 130000
Mode :character Mode :character Median : 6.00 Median : 255000
Mean : 6.32 Mean : 358695
3rd Qu.: 9.00 3rd Qu.: 385000
Max. :20.00 Max. :10000000
NA's :2585 NA's :355
monthly_expenditure employment_status education_level housing_type
Min. :-10000 Length:33000 Length:33000 Length:33000
1st Qu.:100000 Class :character Class :character Class :character
Median :200000 Mode :character Mode :character Mode :character
Mean :198716
3rd Qu.:300000
Max. :400000
NA's :391
access_to_internet health_expenditure savings debt
Length:33000 Min. : 1000 Min. :-50000 Min. : 0
Class :character 1st Qu.: 13000 1st Qu.: 45000 1st Qu.: 250000
Mode :character Median : 26000 Median : 95000 Median : 510000
Mean : 44116 Mean : 95939 Mean : 601793
3rd Qu.: 39000 3rd Qu.:150000 3rd Qu.: 760000
Max. :1000000 Max. :200000 Max. :10000000
NA's :622 NA's :725 NA's :303
The summary reveals a number of data quality problems. Negative
values appear in monthly_income,
monthly_expenditure, and savings, all of which
should be non-negative by definition. Extreme maximum values – for
example, income and debt reaching 10,000,000 – are well beyond plausible
household figures and suggest either data entry errors or outliers
introduced during dataset generation. Missing values are widespread,
particularly in employment_status,
education_level, and access_to_internet.
Missing value counts (raw data):
household_id region household_size monthly_income
0 5452 2585 355
monthly_expenditure employment_status education_level housing_type
391 6676 6628 0
access_to_internet health_expenditure savings debt
6586 622 725 303
hh_raw2 <- hh_raw %>%
mutate(across(where(is.character), ~na_if(str_trim(.), "")))
cat("Missing value counts after converting blank strings to NA:\n")Missing value counts after converting blank strings to NA:
household_id region household_size monthly_income
0 5452 2585 355
monthly_expenditure employment_status education_level housing_type
391 6676 6628 0
access_to_internet health_expenditure savings debt
6586 622 725 303
After converting blank strings to NA, the full extent of
missingness becomes clearer. employment_status (6,676),
education_level (6,628), and
access_to_internet (6,586) are the most affected, followed
by region (5,452) and household_size (2,585).
Financial variables carry lower rates of missingness but are not immune.
The pattern suggests that demographic and categorical fields were most
susceptible to incomplete data collection in this survey.
Duplicate records are removed first, then numeric variables are validated against their expected ranges.
hh_clean <- hh_raw2 %>%
distinct(household_id, .keep_all = TRUE)
cat("Rows after removing duplicates:", nrow(hh_clean),
"| Duplicates removed:", nrow(hh_raw2) - nrow(hh_clean), "\n\n")Rows after removing duplicates: 30000 | Duplicates removed: 3000
hh_clean <- hh_clean %>%
mutate(
household_size = if_else(
household_size >= 1 & household_size <= 10,
household_size, NA_real_),
monthly_income = if_else(
monthly_income >= 10000 & monthly_income <= 500000,
monthly_income, NA_real_),
monthly_expenditure = if_else(
monthly_expenditure >= 5000 & monthly_expenditure <= 400000,
monthly_expenditure, NA_real_),
health_expenditure = if_else(
health_expenditure >= 0 & health_expenditure <= 500000,
health_expenditure, NA_real_),
savings = if_else(savings <= 200000, savings, NA_real_),
debt = if_else(debt >= 0 & debt <= 5000000, debt, NA_real_)
)
cat("NA counts after range validation:\n")NA counts after range validation:
household_id region household_size monthly_income
0 4970 6956 934
monthly_expenditure employment_status education_level housing_type
718 6080 6016 0
access_to_internet health_expenditure savings debt
5988 1127 656 596
Removing duplicates reduced the dataset from 33,000 to 30,000 rows,
with 3,000 duplicate household_id entries eliminated. Range
validation then converted out-of-bound values to NA for all
six numeric variables. The increases in missing counts for
monthly_income and monthly_expenditure confirm
that a portion of the raw entries were implausible rather than simply
blank.
hh_clean <- hh_clean %>%
mutate(
region = case_when(
str_to_lower(region) %in% c("urban", "urbn") ~ "Urban",
str_to_lower(region) %in% c("rural") ~ "Rural",
str_to_lower(region) %in% c("suburban") ~ "Suburban",
TRUE ~ NA_character_
)
) %>%
mutate(
employment_status = case_when(
str_to_lower(employment_status) %in%
c("employed", "emplyd", "employ") ~ "Employed",
str_to_lower(employment_status) %in%
c("unemployed", "unemp") ~ "Unemployed",
str_to_lower(employment_status) %in%
c("self-employed", "selfemployed", "self employed",
"self_employed") ~ "Self-Employed",
TRUE ~ NA_character_
)
) %>%
mutate(
education_level = case_when(
str_to_lower(education_level) %in% c("primary") ~ "Primary",
str_to_lower(education_level) %in% c("secondary") ~ "Secondary",
str_to_lower(education_level) %in% c("tertiary") ~ "Tertiary",
TRUE ~ NA_character_
),
education_level = factor(
education_level, levels = c("Primary", "Secondary", "Tertiary")
)
) %>%
mutate(
access_to_internet = case_when(
str_to_lower(as.character(access_to_internet)) %in%
c("true", "yes", "1", "y") ~ TRUE,
str_to_lower(as.character(access_to_internet)) %in%
c("false", "no", "0", "n") ~ FALSE,
TRUE ~ NA
)
)
cat("--- region ---\n"); print(table(hh_clean$region, useNA = "ifany"))--- region ---
Rural Suburban Urban <NA>
10120 4873 10037 4970
--- employment_status ---
Employed Self-Employed Unemployed <NA>
11932 5965 6023 6080
--- education_level ---
Primary Secondary Tertiary <NA>
11962 6041 5981 6016
--- access_to_internet ---
FALSE TRUE <NA>
12103 11909 5988
After standardisation, the region variable shows Urban (10,037) and
Rural (10,120) households as nearly equal in size, with Suburban (4,873)
accounting for a smaller share. Employment status breaks into Employed
(11,932), Unemployed (6,023), and Self-Employed (5,965). Education
levels are distributed across Primary (11,962), Secondary (6,041), and
Tertiary (5,981). Internet access is split fairly evenly, with roughly
12,103 households without access and 11,909 with access. Residual
missing values across all four variables reflect entries that did not
match any recognised label and were therefore set to
NA.
hh_clean <- hh_clean %>%
mutate(savings_rate = savings / monthly_income)
summary(hh_clean$savings_rate) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-5.0000 0.1739 0.3778 0.8138 0.7432 20.0000 1569
The savings_rate variable captures the proportion of
monthly income that a household saves. The summary shows a range that
extends into negative territory, which can occur when a household’s
savings figure is negative – reflecting net dissaving rather than
accumulation. These values are retained for now and handled through
filtering in downstream analyses.
hh_clean <- hh_clean %>%
mutate(
income_group = case_when(
monthly_income >= 200000 ~ "High Income",
monthly_income >= 75000 & monthly_income < 200000 ~ "Middle Income",
monthly_income < 75000 ~ "Low Income",
TRUE ~ NA_character_
),
income_group = factor(
income_group, levels = c("Low Income", "Middle Income", "High Income")
)
)
cat("Income group distribution:\n")Income group distribution:
Low Income Middle Income High Income <NA>
3836 7359 17871 934
Households are classified into three income tiers based on monthly income thresholds specified in the study brief. The distribution across groups will be examined further in the visualisation section.
inc_exp_tbl <- hh_clean %>%
filter(!is.na(region), !is.na(employment_status)) %>%
group_by(region, employment_status) %>%
summarise(
Mean_Income = round(mean(monthly_income, na.rm = TRUE), 0),
Mean_Expenditure = round(mean(monthly_expenditure, na.rm = TRUE), 0),
N = n(),
.groups = "drop"
) %>%
arrange(region, employment_status)
kable(inc_exp_tbl,
caption = "Table 6: Mean monthly income and expenditure by region and employment status",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| region | employment_status | Mean_Income | Mean_Expenditure | N |
|---|---|---|---|---|
| Rural | Employed | 255,176 | 201,519 | 4,010 |
| Rural | Self-Employed | 260,212 | 202,537 | 2,000 |
| Rural | Unemployed | 254,376 | 205,269 | 2,046 |
| Suburban | Employed | 250,959 | 200,999 | 1,922 |
| Suburban | Self-Employed | 260,216 | 205,063 | 980 |
| Suburban | Unemployed | 264,645 | 205,788 | 967 |
| Urban | Employed | 256,436 | 197,032 | 3,963 |
| Urban | Self-Employed | 251,358 | 202,598 | 1,989 |
| Urban | Unemployed | 253,623 | 200,803 | 2,049 |
sav_debt_tbl <- hh_clean %>%
filter(!is.na(education_level)) %>%
group_by(education_level) %>%
summarise(
Median_Savings = round(median(savings, na.rm = TRUE), 0),
Median_Debt = round(median(debt, na.rm = TRUE), 0),
N = n(),
.groups = "drop"
)
kable(sav_debt_tbl,
caption = "Table 7: Median savings and debt by education level",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| education_level | Median_Savings | Median_Debt | N |
|---|---|---|---|
| Primary | 95,000 | 500,000 | 11,962 |
| Secondary | 100,000 | 510,000 | 6,041 |
| Tertiary | 100,000 | 510,000 | 5,981 |
Mean income and expenditure show very little variation across regions or employment categories, which mirrors the pattern seen in the food delivery data. Median savings and debt are similarly uniform across education levels, with all three groups sitting at comparable figures. In real-world data, tertiary-educated households typically show higher savings and lower debt burdens – the absence of this gradient here reinforces the synthetic nature of the dataset.
Two variables are simulated to illustrate the use of probability distributions in R.
set.seed(42)
# Normal distribution: simulated household income
household_income_sim <- rnorm(n = 1000, mean = 150000, sd = 50000)
# Poisson distribution: simulated number of dependents
number_of_dependents <- rpois(n = 1000, lambda = 2)
cat("First 10 simulated income values:\n", round(head(household_income_sim, 10)), "\n\n")First 10 simulated income values:
218548 121765 168156 181643 170213 144694 225576 145267 250921 146864
First 10 simulated dependent counts:
6 2 3 4 3 3 2 3 1 1
Household income is simulated from a Normal distribution with mean LKR 150,000 and standard deviation LKR 50,000. This reflects the assumption that most household incomes cluster around a central value, with fewer households at the extremes. Number of dependents is simulated from a Poisson distribution with lambda = 2, which is appropriate for modelling count data where most households have a small number of dependents and larger counts become progressively rarer.
norm_cmp <- tibble(
Statistic = c("Mean", "Variance", "Std deviation"),
Theoretical = c(150000, 50000^2, 50000),
Sample = c(
round(mean(household_income_sim), 2),
round(var(household_income_sim), 2),
round(sd(household_income_sim), 2)
)
)
kable(norm_cmp,
caption = "Table 8: Normal distribution -- theoretical vs. sample moments",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Statistic | Theoretical | Sample |
|---|---|---|
| Mean | 1.5e+05 | 1.487088e+05 |
| Variance | 2.5e+09 | 2.512622e+09 |
| Std deviation | 5.0e+04 | 5.012606e+04 |
pois_cmp <- tibble(
Statistic = c("Mean (lambda)", "Variance (lambda)"),
Theoretical = c(2, 2),
Sample = c(
round(mean(number_of_dependents), 4),
round(var(number_of_dependents), 4)
)
)
kable(pois_cmp,
caption = "Table 9: Poisson distribution -- theoretical vs. sample moments") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Statistic | Theoretical | Sample |
|---|---|---|
| Mean (lambda) | 2 | 2.0100 |
| Variance (lambda) | 2 | 1.9438 |
For the Normal distribution, the sample mean and standard deviation are very close to the theoretical parameters of 150,000 and 50,000 respectively, with small deviations attributable to random sampling variation at n = 1,000. For the Poisson distribution, both the sample mean and variance are close to the theoretical value of lambda = 2, which is expected given that the mean and variance are equal in a Poisson process. Minor discrepancies are normal and would narrow with a larger sample size.
tibble(income = household_income_sim) %>%
ggplot(aes(x = income)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "#2980b9", colour = "white", alpha = 0.75) +
geom_density(colour = "#e74c3c", linewidth = 1.2) +
stat_function(
fun = dnorm,
args = list(mean = 150000, sd = 50000),
colour = "#27ae60", linewidth = 1.1, linetype = "dashed"
) +
scale_x_continuous(labels = comma) +
labs(
title = "Simulated household income -- Normal(150,000; 50,000)",
subtitle = "Blue: histogram | Red: empirical density | Green dashed: theoretical curve",
x = "Simulated income (LKR)",
y = "Density"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 6: Simulated household income – Normal distribution
The histogram closely follows the expected bell-shaped curve. The empirical density (red) and the theoretical Normal curve (green dashed) align well throughout the distribution, with the slight deviations at the tails being entirely typical of a finite sample. The simulation is behaving as intended.
pois_pmf <- tibble(
x = 0:max(number_of_dependents),
prob = dpois(0:max(number_of_dependents), lambda = 2)
)
tibble(dependents = number_of_dependents) %>%
ggplot(aes(x = dependents)) +
geom_bar(aes(y = after_stat(prop)), fill = "#8e44ad",
colour = "white", alpha = 0.80, width = 0.70) +
geom_point(data = pois_pmf, aes(x = x, y = prob),
colour = "#27ae60", size = 3.5) +
geom_line(data = pois_pmf, aes(x = x, y = prob),
colour = "#27ae60", linewidth = 1.0, linetype = "dashed") +
labs(
title = "Simulated number of dependents -- Poisson(lambda = 2)",
subtitle = "Purple bars: sample proportions | Green: theoretical PMF",
x = "Number of dependents",
y = "Probability"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 7: Simulated number of dependents – Poisson distribution
The bar chart peaks at 2 dependents and tails off steadily at higher counts, matching the shape of a Poisson distribution with lambda = 2. The sample proportions (purple bars) align closely with the theoretical probability mass function (green), confirming that the simulation is a good empirical approximation of the specified distribution at n = 1,000.
set.seed(1)
hh_clean %>%
filter(!is.na(monthly_income), !is.na(monthly_expenditure), !is.na(region)) %>%
sample_n(5000) %>%
ggplot(aes(x = monthly_income, y = monthly_expenditure, colour = region)) +
geom_point(alpha = 0.30, size = 1.2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
scale_colour_brewer(palette = "Set1") +
labs(
title = "Monthly income vs. monthly expenditure",
subtitle = "Coloured by region -- random sample of 5,000 households",
x = "Monthly income (LKR)",
y = "Monthly expenditure (LKR)",
colour = "Region"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))Figure 8: Monthly income vs. monthly expenditure, coloured by region
The scatter plot shows virtually no relationship between income and expenditure. The three regional trend lines are nearly flat and lie almost on top of each other around LKR 200,000, meaning expenditure does not change with income level regardless of where a household is located. In real survey data, a positive relationship between income and expenditure is almost universal; its absence here is a direct consequence of the variables being generated independently within fixed ranges.
hh_clean %>%
filter(!is.na(savings), !is.na(education_level)) %>%
ggplot(aes(x = education_level, y = savings, fill = education_level)) +
geom_boxplot(outlier.colour = "red", outlier.alpha = 0.30,
outlier.size = 1.2, width = 0.50) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = "Pastel1") +
labs(
title = "Monthly savings by education level",
x = "Education level",
y = "Monthly savings (LKR)"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold")
)Figure 9: Monthly savings distribution by education level
The three education groups display near-identical distributions, with medians sitting around LKR 95,000-100,000 and comparable spreads. Negative values appear across all groups, indicating households that are spending beyond their income. One would typically expect Tertiary-educated households to show noticeably higher savings given better employment prospects and earning capacity, but this pattern does not appear – again consistent with the synthetic data generation process.
hh_clean %>%
filter(!is.na(monthly_income), !is.na(employment_status), !is.na(region)) %>%
group_by(region, employment_status) %>%
summarise(avg_income = mean(monthly_income, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = employment_status, y = avg_income, fill = employment_status)) +
geom_col(width = 0.60, colour = "white") +
geom_text(aes(label = comma(round(avg_income, 0))),
vjust = -0.40, size = 2.7, fontface = "bold") +
facet_wrap(~region, ncol = 3) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Average monthly income by employment status",
subtitle = "Faceted by region",
x = "Employment status",
y = "Average monthly income (LKR)"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 35, hjust = 1),
strip.text = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold")
)Figure 10: Average monthly income by employment status, faceted by region
Average monthly income is almost identical across every combination of region and employment status, ranging between roughly LKR 251,000 and LKR 265,000. Employed and Self-Employed households would ordinarily earn more than Unemployed households, and Urban areas would typically outpace Rural ones. Neither pattern is visible here, which reflects the income variable being generated within a fixed range without regard to employment or location.
num_hh <- hh_clean %>%
select(household_size, monthly_income, monthly_expenditure,
health_expenditure, savings, debt, savings_rate) %>%
drop_na()
cor_mat <- cor(num_hh)
cor_melt <- melt(cor_mat)
ggplot(cor_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(colour = "white", linewidth = 0.5) +
geom_text(aes(label = round(value, 2)), size = 3.5, fontface = "bold") +
scale_fill_gradient2(
low = "#c0392b",
mid = "white",
high = "#2980b9",
midpoint = 0,
limits = c(-1, 1),
name = "Pearson r"
) +
labs(
title = "Correlation heatmap -- socio-economic household numeric variables",
x = NULL, y = NULL
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.position = "right"
)Figure 11: Correlation heatmap of numeric household variables
Almost every pair of numeric variables shows a correlation close to
zero, confirming that the variables were generated independently. The
one notable exception is the moderate negative correlation between
monthly_income and savings_rate (-0.48), which
is somewhat counterintuitive – higher-income households would generally
be expected to save a larger share of income. This negative relationship
likely emerges from the way savings and income were independently
bounded during data generation rather than reflecting a genuine economic
dynamic.
All code in this report is fully reproducible. A global chunk options
block at the top of the document (knitr::opts_chunk$set)
ensures consistent behaviour across all chunks – warnings and messages
are suppressed to keep the output clean, figures are centred, and echo
is enabled so the code is always visible alongside its output. The
simulation in Section C uses set.seed(42) to guarantee that
identical results are obtained on every knit. The tidyverse
and supporting packages are loaded once at the start of Problem 1 and
remain available throughout. All inline R expressions update
automatically when the underlying data changes, so key reported figures
– such as the mean delivery time and household count – are always
consistent with the actual computed values.
write_csv(hh_clean, "socio_economic_households_cleaned.csv")
cat("File saved: socio_economic_households_cleaned.csv\n")File saved: socio_economic_households_cleaned.csv
Rows : 30000
Columns: 14
The cleaned household dataset has been saved as
socio_economic_households_cleaned.csv. The file contains
30000 rows and 14 columns – the
original 12 variables plus the two newly derived fields,
savings_rate and income_group.
The mean monthly income across valid household records is 255,075 LKR. The distribution of households across income groups is shown below:
| Income group | Households |
|---|---|
| Low Income | 3836 |
| Middle Income | 7359 |
| High Income | 17871 |