1 Problem 1: Food Delivery Performance Study

1.1 Section A: Data Preparation

1.1.1 Importing and Inspecting the Dataset

All required libraries are loaded first. The tidyverse package handles data manipulation and visualisation, moments provides skewness and kurtosis functions, scales assists with axis formatting, and kableExtra produces styled tables in the report.

library(tidyverse)
library(moments)
library(scales)
library(knitr)
library(kableExtra)
library(reshape2)

The dataset is read from disk and inspected using str(), glimpse(), and summary() to understand its structure, variable types, and initial data quality.

food_raw <- read_csv("food_delivery_data.csv", show_col_types = FALSE)
str(food_raw)

spc_tbl_ [21,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ order_id         : chr [1:21000] "FDD000001" "FDD000002" "FDD000003" "FDD000004" ...
 $ customer_age     : num [1:21000] 60 17 43 49 53 32 34 57 NA 37 ...
 $ customer_gender  : chr [1:21000] "Mle" "Mle" NA "Male" ...
 $ restaurant_type  : chr [1:21000] "cafe" "cafe" "Fast Food" "cafe" ...
 $ order_value      : num [1:21000] 3116 3763 816 884 3854 ...
 $ delivery_distance: num [1:21000] -3.51 21.58 20.57 11.97 6.82 ...
 $ delivery_time    : num [1:21000] 52 134 61 36 91 132 62 61 145 54 ...
 $ traffic_level    : chr [1:21000] "low" "low" "High" "High" ...
 $ weather          : chr [1:21000] "Rainy" "Cloudy" "Sunny" "Cloudy" ...
 $ payment_method   : chr [1:21000] "Online" "Card" "cash" NA ...
 $ rating           : num [1:21000] 4 10 3 10 10 -1 10 3 4 -1 ...
 $ delivery_status  : chr [1:21000] "delivered" "Delivered" "Delivered" NA ...
 $ rider_experience : num [1:21000] 30 0 -3 0 15 30 0 13 8 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   order_id = col_character(),
  ..   customer_age = col_double(),
  ..   customer_gender = col_character(),
  ..   restaurant_type = col_character(),
  ..   order_value = col_double(),
  ..   delivery_distance = col_double(),
  ..   delivery_time = col_double(),
  ..   traffic_level = col_character(),
  ..   weather = col_character(),
  ..   payment_method = col_character(),
  ..   rating = col_double(),
  ..   delivery_status = col_character(),
  ..   rider_experience = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

glimpse(food_raw)

Rows: 21,000
Columns: 13
$ order_id          <chr> "FDD000001", "FDD000002", "FDD000003", "FDD000004", …
$ customer_age      <dbl> 60, 17, 43, 49, 53, 32, 34, 57, NA, 37, 48, 47, 64, …
$ customer_gender   <chr> "Mle", "Mle", NA, "Male", NA, "Mle", "Male", "Male",…
$ restaurant_type   <chr> "cafe", "cafe", "Fast Food", "cafe", "Cafe", "cafe",…
$ order_value       <dbl> 3116, 3763, 816, 884, 3854, 2082, 1480, 2085, 2600, …
$ delivery_distance <dbl> -3.51, 21.58, 20.57, 11.97, 6.82, 20.51, 1.17, 11.25…
$ delivery_time     <dbl> 52, 134, 61, 36, 91, 132, 62, 61, 145, 54, 139, 25, …
$ traffic_level     <chr> "low", "low", "High", "High", "low", "Hgh", "Hgh", "…
$ weather           <chr> "Rainy", "Cloudy", "Sunny", "Cloudy", "Rainy", "Clou…
$ payment_method    <chr> "Online", "Card", "cash", NA, "Card", "Card", NA, "C…
$ rating            <dbl> 4, 10, 3, 10, 10, -1, 10, 3, 4, -1, 3, 3, NA, 4, 5, …
$ delivery_status   <chr> "delivered", "Delivered", "Delivered", NA, "delivere…
$ rider_experience  <dbl> 30, 0, -3, 0, 15, 30, 0, 13, 8, 1, 1, 9, 0, 8, 4, -3…

The dataset contains 21,000 observations across 13 variables, covering a mix of categorical identifiers – such as restaurant_type, traffic_level, and payment_method – and numeric measures including order_value, delivery_distance, delivery_time, and rating. The presence of character-type columns alongside numeric ones signals that careful type checking and text standardisation will be required before analysis.

summary(food_raw)

   order_id          customer_age    customer_gender    restaurant_type   
 Length:21000       Min.   : -5.00   Length:21000       Length:21000      
 Class :character   1st Qu.: 28.00   Class :character   Class :character  
 Mode  :character   Median : 42.00   Mode  :character   Mode  :character  
                    Mean   : 42.35                                        
                    3rd Qu.: 56.00                                        
                    Max.   :120.00                                        
                    NA's   :368                                           
  order_value    delivery_distance delivery_time    traffic_level     
 Min.   : -100   Min.   :-5.00     Min.   :-20.00   Length:21000      
 1st Qu.: 1618   1st Qu.: 2.47     1st Qu.: 44.00   Class :character  
 Median : 2750   Median :10.05     Median : 79.00   Mode  :character  
 Mean   : 2777   Mean   :10.03     Mean   : 81.87                     
 3rd Qu.: 3868   3rd Qu.:17.56     3rd Qu.:116.00                     
 Max.   :99999   Max.   :25.00     Max.   :500.00                     
 NA's   :6                         NA's   :139                        
   weather          payment_method         rating       delivery_status   
 Length:21000       Length:21000       Min.   :-1.000   Length:21000      
 Class :character   Class :character   1st Qu.: 1.000   Class :character  
 Mode  :character   Mode  :character   Median : 3.000   Mode  :character  
                                       Mean   : 3.438                     
                                       3rd Qu.: 5.000                     
                                       Max.   :10.000                     
                                       NA's   :2607                       
 rider_experience
 Min.   :-3.000  
 1st Qu.: 3.000  
 Median : 7.000  
 Mean   : 8.152  
 3rd Qu.:12.000  
 Max.   :30.000  
 NA's   :1110

The summary confirms several data quality concerns. Customer age has a mean of roughly 42 years, which is plausible, but the range extends into negative territory, indicating invalid entries. Delivery time averages around 81.87 minutes with a maximum of 500 minutes – clearly unrealistic for a food delivery context. Rating shows 2,607 missing values, which is notable since it directly reflects customer satisfaction. Rider experience averages 8.15 years, though some values are negative. These issues are addressed systematically in the cleaning steps below.

1.1.2 Missing Value Analysis

Before correcting any values, blank strings in character columns are first converted to NA so that they are treated consistently alongside other missing entries.

food_raw2 <- food_raw %>%
  mutate(across(where(is.character), ~na_if(str_trim(.), "")))

missing_tbl <- food_raw2 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(100 * Missing_Count / nrow(food_raw2), 2)) %>%
  arrange(desc(Missing_Count))

kable(missing_tbl,
      caption = "Table 1: Missing values per variable (blank strings treated as NA)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 1: Missing values per variable (blank strings treated as NA)
Variable	Missing_Count	Missing_Pct
customer_gender	6873	32.73
delivery_status	5272	25.10
weather	5267	25.08
payment_method	4251	20.24
traffic_level	3469	16.52
rating	2607	12.41
rider_experience	1110	5.29
customer_age	368	1.75
delivery_time	139	0.66
order_value	6	0.03
order_id	0	0.00
restaurant_type	0	0.00
delivery_distance	0	0.00

Missing data is spread across several variables. The most affected are rating and rider_experience, which together represent the two clearest quality indicators in the dataset. Variables like customer_gender, delivery_status, and traffic_level also show non-trivial rates of missingness, suggesting inconsistent data collection at the point of entry. These patterns are important to keep in mind during analysis, as they can introduce non-response bias if not handled carefully.

1.1.3 Correcting Data Issues

1.1.3.1 Inconsistent Text Entries

All categorical variables contain spelling variants, inconsistent capitalisation, and abbreviated entries. Each variable is standardised to a fixed set of valid labels, with unrecognisable entries mapped to NA.

food_clean <- food_raw2 %>%

  mutate(
    customer_gender = case_when(
      str_to_lower(customer_gender) %in% c("male",   "mle")   ~ "Male",
      str_to_lower(customer_gender) %in% c("female", "fmale") ~ "Female",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    restaurant_type = case_when(
      str_to_lower(restaurant_type) %in% c("cafe")                     ~ "Cafe",
      str_to_lower(restaurant_type) %in% c("fast food", "fastfood")    ~ "Fast Food",
      str_to_lower(restaurant_type) %in% c("fine dining", "finedining") ~ "Fine Dining",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    traffic_level = case_when(
      str_to_lower(traffic_level) %in% c("low")           ~ "Low",
      str_to_lower(traffic_level) %in% c("medium", "med") ~ "Medium",
      str_to_lower(traffic_level) %in% c("high", "hgh")   ~ "High",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(weather = str_to_title(weather)) %>%

  mutate(
    payment_method = case_when(
      str_to_lower(payment_method) %in% c("cash")   ~ "Cash",
      str_to_lower(payment_method) %in% c("card")   ~ "Card",
      str_to_lower(payment_method) %in% c("online") ~ "Online",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    delivery_status = case_when(
      str_to_lower(delivery_status) %in% c("delivered") ~ "Delivered",
      str_to_lower(delivery_status) %in% c("cancelled") ~ "Cancelled",
      TRUE ~ NA_character_
    )
  )

cat("customer_gender:\n");  print(table(food_clean$customer_gender,  useNA = "ifany"))

customer_gender:


Female   Male   <NA> 
  7217   6910   6873

cat("restaurant_type:\n");  print(table(food_clean$restaurant_type,  useNA = "ifany"))

restaurant_type:


       Cafe   Fast Food Fine Dining 
       8381        8390        4229

cat("traffic_level:\n");    print(table(food_clean$traffic_level,    useNA = "ifany"))

traffic_level:


  High    Low Medium   <NA> 
  6942   7032   3557   3469

cat("weather:\n");          print(table(food_clean$weather,          useNA = "ifany"))

weather:


Cloudy  Rainy  Sunny   <NA> 
  5160   5299   5274   5267

cat("payment_method:\n");   print(table(food_clean$payment_method,   useNA = "ifany"))

payment_method:


  Card   Cash Online   <NA> 
  4241   8207   4301   4251

cat("delivery_status:\n");  print(table(food_clean$delivery_status,  useNA = "ifany"))

delivery_status:


Cancelled Delivered      <NA> 
     5208     10520      5272

After standardisation, each categorical variable is reduced to its expected set of valid labels. Fast Food and Cafe dominate the restaurant distribution, while Fine Dining makes up a smaller share. Cash is by far the most common payment method. The remaining missingness in customer_gender, weather, and delivery_status reflects entries that could not be matched to any recognised category and were therefore set to NA.

1.1.3.2 Invalid Numeric Values

Each numeric variable is validated against its expected real-world range. Observations falling outside these bounds are replaced with NA rather than removed, preserving the full row for variables that are still valid.

food_clean <- food_clean %>%
  mutate(
    customer_age = if_else(
      customer_age >= 15 & customer_age <= 70,
      customer_age, NA_real_),

    order_value = if_else(
      order_value >= 500 & order_value <= 5000,
      order_value, NA_real_),

    delivery_distance = if_else(
      delivery_distance >= 0 & delivery_distance <= 20,
      delivery_distance, NA_real_),

    delivery_time = if_else(
      delivery_time >= 10 & delivery_time <= 120,
      delivery_time, NA_real_),

    rating = if_else(
      rating >= 1 & rating <= 5,
      rating, NA_real_),

    rider_experience = if_else(
      rider_experience >= 0 & rider_experience <= 15,
      rider_experience, NA_real_)
  )

cat("Remaining NA counts after range validation:\n")

Remaining NA counts after range validation:

print(colSums(is.na(food_clean)))

         order_id      customer_age   customer_gender   restaurant_type 
                0              1063              6873                 0 
      order_value delivery_distance     delivery_time     traffic_level 
               15              7083              4743              3469 
          weather    payment_method            rating   delivery_status 
             5267              4251              7809              5272 
 rider_experience 
             3310

This step increases the missing value counts considerably for rating and delivery_distance, which confirms that a large portion of the raw values were out-of-range rather than simply blank. For example, rating now has 7,809 missing entries, reflecting both the original blanks and the invalid numeric entries combined. While this improves data integrity, it reduces the usable sample for any analysis involving these variables. This trade-off – stricter validity at the cost of completeness – is a routine consideration in data preprocessing.

1.2 Section B: Data Transformation

1.2.1 Derived Variable: Delivery Speed

A new variable, delivery_speed, is calculated as the ratio of distance covered to time taken. This gives a continuous measure of how efficiently each delivery was completed, expressed in kilometres per minute.

food_clean <- food_clean %>%
  mutate(delivery_speed = delivery_distance / delivery_time)

food_clean %>%
  select(order_id, delivery_distance, delivery_time, delivery_speed) %>%
  head(10)

order_id	delivery_distance	delivery_time	delivery_speed
FDD000001	NA	52	NA
FDD000002	NA	NA	NA
FDD000003	NA	61	NA
FDD000004	11.97	36	0.3325000
FDD000005	6.82	91	0.0749451
FDD000006	NA	NA	NA
FDD000007	1.17	62	0.0188710
FDD000008	11.25	61	0.1844262
FDD000009	NA	NA	NA
FDD000010	19.69	54	0.3646296

The variable takes meaningful values only where both delivery_distance and delivery_time are non-missing. Records where either is NA will naturally produce an NA speed, which is the correct behaviour.

1.2.2 Categorical Variable: Delivery Performance

Delivery performance is classified into three tiers based on delivery time thresholds defined in the study brief.

food_clean <- food_clean %>%
  mutate(
    delivery_performance = case_when(
      delivery_time < 30                       ~ "Fast",
      delivery_time >= 30 & delivery_time < 60 ~ "Moderate",
      delivery_time >= 60                      ~ "Slow",
      TRUE                                     ~ NA_character_
    ),
    delivery_performance = factor(
      delivery_performance, levels = c("Fast", "Moderate", "Slow")
    )
  )

cat("delivery_performance distribution (including NAs):\n")

delivery_performance distribution (including NAs):

print(table(food_clean$delivery_performance, useNA = "ifany"))


    Fast Moderate     Slow     <NA> 
    2953     4437     8867     4743

The Slow category is the largest group with 8,867 orders, followed by Moderate at 4,437 and Fast at 2,953. The 4,743 missing values correspond to rows where delivery_time was set to NA during cleaning. The skew towards longer delivery times may point to genuine operational inefficiencies, though it is worth noting that this classification does not account for distance – a 60-minute delivery over 20 km is quite different from a 60-minute delivery over 2 km.

1.2.3 Group Summaries

Mean delivery time and mean order value are computed for each combination of restaurant type and traffic level.

summary_rt_tl <- food_clean %>%
  filter(!is.na(restaurant_type), !is.na(traffic_level)) %>%
  group_by(restaurant_type, traffic_level) %>%
  summarise(
    Mean_Delivery_Time = round(mean(delivery_time,  na.rm = TRUE), 2),
    Mean_Order_Value   = round(mean(order_value,    na.rm = TRUE), 2),
    N                  = n(),
    .groups = "drop"
  ) %>%
  arrange(restaurant_type, traffic_level)

kable(summary_rt_tl,
      caption = "Table 2: Mean delivery time and order value by restaurant type and traffic level") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 2: Mean delivery time and order value by restaurant type and traffic level
restaurant_type	traffic_level	Mean_Delivery_Time	Mean_Order_Value	N
Cafe	High	64.58	2734.09	2814
Cafe	Low	63.73	2733.84	2795
Cafe	Medium	65.27	2745.46	1378
Fast Food	High	65.05	2722.40	2767
Fast Food	Low	64.44	2696.76	2831
Fast Food	Medium	65.44	2716.04	1445
Fine Dining	High	65.53	2754.12	1361
Fine Dining	Low	66.34	2771.95	1406
Fine Dining	Medium	65.61	2849.61	734

The grouped means show very little variation across the nine combinations of restaurant type and traffic level. Delivery time sits consistently between 63.7 and 66.3 minutes regardless of the combination, and mean order value clusters tightly between LKR 2,700 and 2,850. This uniformity is striking – in a real operational setting, one would expect higher traffic levels and more complex restaurant types to push delivery times upward. The flat pattern is consistent with the wider data quality issues identified earlier.

summary_gs <- food_clean %>%
  filter(!is.na(customer_gender), !is.na(delivery_status)) %>%
  count(customer_gender, delivery_status, name = "Total_Orders")

kable(summary_gs,
      caption = "Table 3: Total orders by customer gender and delivery status") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 3: Total orders by customer gender and delivery status
customer_gender	delivery_status	Total_Orders
Female	Cancelled	1857
Female	Delivered	3595
Male	Cancelled	1691
Male	Delivered	3482

Males placed 3,482 delivered orders and 1,691 cancellations, while females placed 3,595 delivered orders and 1,857 cancellations. Delivery success rates are broadly comparable across genders, with no meaningful difference in cancellation rates. This suggests that customer gender is not a driver of delivery outcomes in this dataset.

1.3 Section C: Summarisation

1.3.1 Descriptive Statistics

Summary statistics for delivery_time and order_value are computed to characterise central tendency and spread.

summary_stats <- food_clean %>%
  summarise(
    across(
      c(delivery_time, order_value),
      list(
        Mean   = ~round(mean(.x,   na.rm = TRUE), 2),
        Median = ~round(median(.x, na.rm = TRUE), 2),
        SD     = ~round(sd(.x,     na.rm = TRUE), 2),
        IQR    = ~round(IQR(.x,    na.rm = TRUE), 2)
      ),
      .names = "{.col}__{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", "Statistic"),
               names_sep = "__") %>%
  pivot_wider(names_from = "Statistic", values_from = "value")

kable(summary_stats,
      caption = "Table 4: Summary statistics for delivery time and order value") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 4: Summary statistics for delivery time and order value
Variable	Mean	Median	SD	IQR
delivery_time	64.84	65	32.14	56
order_value	2744.98	2748	1297.13	2248

For delivery_time, the mean (64.84 min) and median (65 min) are almost identical, indicating a symmetric distribution with no meaningful skew. The standard deviation of 32.14 and IQR of 56 reflect moderate spread. For order_value, the mean (2,744.98 LKR) and median (2,748 LKR) are again very close, though the much larger standard deviation (1,297.13) and IQR (2,248) suggest that spending varies considerably across orders, possibly driven by a wide range of order types and sizes.

1.3.2 Skewness and Kurtosis

dt_vals <- na.omit(food_clean$delivery_time)

sk     <- round(skewness(dt_vals), 4)
ku_raw <- round(kurtosis(dt_vals), 4)
ku_ex  <- round(ku_raw - 3, 4)

cat("Skewness (delivery_time)   :", sk,     "\n")

Skewness (delivery_time)   : 0.0175

cat("Raw kurtosis               :", ku_raw, "\n")

Raw kurtosis               : 1.7994

cat("Excess kurtosis (kurt - 3) :", ku_ex,  "\n")

Excess kurtosis (kurt - 3) : -1.2006

A skewness of 0.0175 confirms that the distribution of delivery time is very nearly symmetric. The raw kurtosis of 1.7994 and corresponding excess kurtosis of -1.2006 indicate a platykurtic distribution – one with lighter tails and a flatter peak than a normal distribution. This is consistent with a broadly uniform spread of delivery times across the 10-120 minute range.

1.3.3 Frequency Table and Bar Chart

perf_freq <- food_clean %>%
  filter(!is.na(delivery_performance)) %>%
  count(delivery_performance, name = "Frequency") %>%
  mutate(Percentage = round(100 * Frequency / sum(Frequency), 2))

kable(perf_freq,
      caption = "Table 5: Frequency table for delivery performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 5: Frequency table for delivery performance
delivery_performance	Frequency	Percentage
Fast	2953	18.16
Moderate	4437	27.29
Slow	8867	54.54

ggplot(perf_freq,
       aes(x = delivery_performance, y = Frequency, fill = delivery_performance)) +
  geom_col(width = 0.55, colour = "white") +
  geom_text(
    aes(label = paste0(Frequency, "\n(", Percentage, "%)")),
    vjust = -0.3, size = 3.8, fontface = "bold"
  ) +
  scale_fill_manual(
    values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Distribution of delivery performance",
    subtitle = "Fast: < 30 min  |  Moderate: 30-59 min  |  Slow: >= 60 min",
    x        = "Performance category",
    y        = "Number of orders"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position    = "none",
    plot.title         = element_text(face = "bold"),
    panel.grid.major.x = element_blank()
  )

Figure 1: Distribution of delivery performance categories

The bar chart shows that over half of all deliveries (54.54%, n = 8,867) fell into the Slow category. Moderate deliveries account for about 27% of orders, and Fast deliveries are the least common at 18.16%. This distribution raises questions about operational efficiency, though the absence of a distance control limits how much can be concluded from timing alone.

1.4 Section D: Visualisation

1.4.1 Delivery Distance vs. Delivery Time by Traffic Level

food_clean %>%
  filter(!is.na(delivery_distance), !is.na(delivery_time), !is.na(traffic_level)) %>%
  ggplot(aes(x = delivery_distance, y = delivery_time, colour = traffic_level)) +
  geom_point(alpha = 0.25, size = 0.9) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_colour_manual(
    values = c("Low" = "#27ae60", "Medium" = "#e67e22", "High" = "#c0392b")
  ) +
  labs(
    title    = "Delivery distance vs. delivery time",
    subtitle = "Coloured by traffic level, with linear trend lines per group",
    x        = "Delivery distance (km)",
    y        = "Delivery time (minutes)",
    colour   = "Traffic level"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 2: Delivery distance vs. delivery time, coloured by traffic level

The scatter plot reveals that distance has almost no relationship with delivery time in this dataset. All three trend lines sit nearly flat at around 65-70 minutes regardless of distance, and the three traffic level lines overlap almost perfectly. In a real delivery operation, longer distances and heavier traffic would typically push delivery times upward; neither pattern is visible here. The very wide vertical scatter at each distance value also suggests that unobserved factors – such as order preparation time or rider routing decisions – are far more influential than the variables captured here.

1.4.2 Order Value by Restaurant Type

food_clean %>%
  filter(!is.na(order_value), !is.na(restaurant_type)) %>%
  ggplot(aes(x = restaurant_type, y = order_value, fill = restaurant_type)) +
  geom_boxplot(outlier.colour = "red", outlier.alpha = 0.35,
               outlier.size = 1.2, width = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Order value distribution by restaurant type",
    x     = "Restaurant type",
    y     = "Order value (LKR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title      = element_text(face = "bold")
  )

Figure 3: Order value distribution by restaurant type

All three restaurant types show strikingly similar distributions. Medians sit close together in the LKR 2,750-2,900 range, and the interquartile ranges are comparable across categories. One would intuitively expect Fine Dining to produce meaningfully higher order values than Cafe or Fast Food, but that distinction does not emerge here. This is most likely a consequence of the order values being generated within a fixed LKR 500-5,000 range, which compresses the inter-group differences regardless of restaurant type.

1.4.3 Distribution of Delivery Time

mean_dt   <- mean(food_clean$delivery_time,   na.rm = TRUE)
median_dt <- median(food_clean$delivery_time, na.rm = TRUE)

food_clean %>%
  filter(!is.na(delivery_time)) %>%
  ggplot(aes(x = delivery_time)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#3498db", colour = "white", alpha = 0.80) +
  geom_density(colour = "#e74c3c", linewidth = 1.1) +
  geom_vline(xintercept = mean_dt,   linetype = "dashed",
             colour = "#2c3e50", linewidth = 0.9) +
  geom_vline(xintercept = median_dt, linetype = "dotted",
             colour = "#8e44ad", linewidth = 0.9) +
  annotate("text", x = mean_dt   + 5, y = 0.022,
           label = paste0("Mean = ", round(mean_dt, 1)),
           colour = "#2c3e50", size = 3.5) +
  annotate("text", x = median_dt - 5, y = 0.020,
           label = paste0("Median = ", round(median_dt, 1)),
           colour = "#8e44ad", size = 3.5) +
  labs(
    title    = "Distribution of delivery time",
    subtitle = "Histogram with kernel density overlay",
    x        = "Delivery time (minutes)",
    y        = "Density"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 4: Histogram of delivery time distribution

The histogram displays a remarkably flat, near-uniform distribution of delivery times across the 10-120 minute range. Bar heights are fairly consistent throughout, with no obvious cluster around a typical delivery duration. The mean and median sit almost on top of each other at around 65 minutes, reinforcing the symmetric, flat shape identified by the skewness and kurtosis measures. Real-world delivery data would normally peak around a common duration, making this pattern more characteristic of synthetically generated data.

1.4.4 Delivery Performance by Weather Condition

food_clean %>%
  filter(!is.na(delivery_performance), !is.na(weather)) %>%
  count(weather, delivery_performance) %>%
  group_by(weather) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = weather, y = n, fill = delivery_performance)) +
  geom_col(position = "fill", colour = "white", width = 0.60) +
  geom_text(
    aes(label = paste0(round(pct * 100, 1), "%")),
    position = position_fill(vjust = 0.5),
    size = 3.5, colour = "white", fontface = "bold"
  ) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(
    values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
  ) +
  labs(
    title    = "Delivery performance by weather condition",
    subtitle = "Proportion of orders in each performance category",
    x        = "Weather condition",
    y        = "Proportion of orders",
    fill     = "Delivery performance"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 5: Delivery performance by weather condition

Weather conditions have very little visible effect on delivery performance. The proportions of Fast, Moderate, and Slow deliveries are nearly identical across Cloudy, Rainy, and Sunny conditions. Rainy conditions show a slightly higher Slow rate (approximately 56%) compared to Sunny (approximately 54%), which is intuitively plausible, but the margin is too small to draw any firm conclusion. If weather were a genuine driver of performance, a more pronounced gap – particularly between Rainy and Sunny – would be expected.

1.5 Section E: Reporting and Export

1.5.1 Exporting the Cleaned Dataset

write_csv(food_clean, "food_delivery_cleaned.csv")

cat("File saved: food_delivery_cleaned.csv\n")

File saved: food_delivery_cleaned.csv

cat("  Rows   :", nrow(food_clean), "\n")

  Rows   : 21000

cat("  Columns:", ncol(food_clean), "\n")

  Columns: 15

The cleaned and transformed dataset has been saved as food_delivery_cleaned.csv. The file retains all 21,000 original rows and contains 15 columns – the original 13 variables plus the two newly derived fields, delivery_speed and delivery_performance.

1.5.2 Key Results (Inline R)

The mean delivery time across all valid orders is 64.84 minutes.

The total number of orders in each delivery performance category is as follows:

Category	Total orders
Fast	2953
Moderate	4437
Slow	8867

2 Problem 2: Socio-Economic Household Study

2.1 Section A: Data Quality and Exploration

2.1.1 Importing and Summarising the Dataset

hh_raw <- read_csv("households_data.csv", show_col_types = FALSE)
head(hh_raw)

household_id	region	household_size	monthly_income	monthly_expenditure	employment_status	education_level	housing_type	access_to_internet	health_expenditure	savings	debt
UHID000001	urbn	1	385000	120000	Self-employed	Tertiary	Owned	yes	46000	155000	470000
UHID000002	rural	20	60000	60000	Employed	primary	Shared	yes	4000	135000	420000
UHID000003	Suburban	NA	85000	355000	Unemployed	primary	Owned	yes	1000	65000	660000
UHID000004	rural	0	150000	170000	emplyd	Tertiary	Shared	TRUE	20000	20000	700000
UHID000005	rural	0	110000	15000	emplyd	primary	Shared	no	14000	15000	800000
UHID000006	Rural	2	295000	30000	emplyd	Primary	Owned	yes	2000	100000	290000

The first six rows give an immediate sense of the dataset’s structure. Variables cover household demographics (region, household_size), economic indicators (monthly_income, monthly_expenditure, savings, debt), and socio-economic characteristics (employment_status, education_level, housing_type, access_to_internet). Some missing values are already visible in the preview.

summary(hh_raw)

 household_id          region          household_size  monthly_income    
 Length:33000       Length:33000       Min.   : 0.00   Min.   :   -5000  
 Class :character   Class :character   1st Qu.: 3.00   1st Qu.:  130000  
 Mode  :character   Mode  :character   Median : 6.00   Median :  255000  
                                       Mean   : 6.32   Mean   :  358695  
                                       3rd Qu.: 9.00   3rd Qu.:  385000  
                                       Max.   :20.00   Max.   :10000000  
                                       NA's   :2585    NA's   :355       
 monthly_expenditure employment_status  education_level    housing_type      
 Min.   :-10000      Length:33000       Length:33000       Length:33000      
 1st Qu.:100000      Class :character   Class :character   Class :character  
 Median :200000      Mode  :character   Mode  :character   Mode  :character  
 Mean   :198716                                                              
 3rd Qu.:300000                                                              
 Max.   :400000                                                              
 NA's   :391                                                                 
 access_to_internet health_expenditure    savings            debt         
 Length:33000       Min.   :   1000    Min.   :-50000   Min.   :       0  
 Class :character   1st Qu.:  13000    1st Qu.: 45000   1st Qu.:  250000  
 Mode  :character   Median :  26000    Median : 95000   Median :  510000  
                    Mean   :  44116    Mean   : 95939   Mean   :  601793  
                    3rd Qu.:  39000    3rd Qu.:150000   3rd Qu.:  760000  
                    Max.   :1000000    Max.   :200000   Max.   :10000000  
                    NA's   :622        NA's   :725      NA's   :303

The summary reveals a number of data quality problems. Negative values appear in monthly_income, monthly_expenditure, and savings, all of which should be non-negative by definition. Extreme maximum values – for example, income and debt reaching 10,000,000 – are well beyond plausible household figures and suggest either data entry errors or outliers introduced during dataset generation. Missing values are widespread, particularly in employment_status, education_level, and access_to_internet.

cat("Missing value counts (raw data):\n")

Missing value counts (raw data):

print(colSums(is.na(hh_raw)))

       household_id              region      household_size      monthly_income 
                  0                5452                2585                 355 
monthly_expenditure   employment_status     education_level        housing_type 
                391                6676                6628                   0 
 access_to_internet  health_expenditure             savings                debt 
               6586                 622                 725                 303

hh_raw2 <- hh_raw %>%
  mutate(across(where(is.character), ~na_if(str_trim(.), "")))

cat("Missing value counts after converting blank strings to NA:\n")

Missing value counts after converting blank strings to NA:

print(colSums(is.na(hh_raw2)))

       household_id              region      household_size      monthly_income 
                  0                5452                2585                 355 
monthly_expenditure   employment_status     education_level        housing_type 
                391                6676                6628                   0 
 access_to_internet  health_expenditure             savings                debt 
               6586                 622                 725                 303

After converting blank strings to NA, the full extent of missingness becomes clearer. employment_status (6,676), education_level (6,628), and access_to_internet (6,586) are the most affected, followed by region (5,452) and household_size (2,585). Financial variables carry lower rates of missingness but are not immune. The pattern suggests that demographic and categorical fields were most susceptible to incomplete data collection in this survey.

2.1.2 Correcting Impossible Numeric Values

Duplicate records are removed first, then numeric variables are validated against their expected ranges.

hh_clean <- hh_raw2 %>%
  distinct(household_id, .keep_all = TRUE)

cat("Rows after removing duplicates:", nrow(hh_clean),
    "| Duplicates removed:", nrow(hh_raw2) - nrow(hh_clean), "\n\n")

Rows after removing duplicates: 30000 | Duplicates removed: 3000

hh_clean <- hh_clean %>%
  mutate(
    household_size = if_else(
      household_size >= 1 & household_size <= 10,
      household_size, NA_real_),

    monthly_income = if_else(
      monthly_income >= 10000 & monthly_income <= 500000,
      monthly_income, NA_real_),

    monthly_expenditure = if_else(
      monthly_expenditure >= 5000 & monthly_expenditure <= 400000,
      monthly_expenditure, NA_real_),

    health_expenditure = if_else(
      health_expenditure >= 0 & health_expenditure <= 500000,
      health_expenditure, NA_real_),

    savings = if_else(savings <= 200000, savings, NA_real_),

    debt = if_else(debt >= 0 & debt <= 5000000, debt, NA_real_)
  )

cat("NA counts after range validation:\n")

NA counts after range validation:

print(colSums(is.na(hh_clean)))

       household_id              region      household_size      monthly_income 
                  0                4970                6956                 934 
monthly_expenditure   employment_status     education_level        housing_type 
                718                6080                6016                   0 
 access_to_internet  health_expenditure             savings                debt 
               5988                1127                 656                 596

Removing duplicates reduced the dataset from 33,000 to 30,000 rows, with 3,000 duplicate household_id entries eliminated. Range validation then converted out-of-bound values to NA for all six numeric variables. The increases in missing counts for monthly_income and monthly_expenditure confirm that a portion of the raw entries were implausible rather than simply blank.

2.1.3 Correcting Inconsistent Text Entries

hh_clean <- hh_clean %>%

  mutate(
    region = case_when(
      str_to_lower(region) %in% c("urban",    "urbn") ~ "Urban",
      str_to_lower(region) %in% c("rural")            ~ "Rural",
      str_to_lower(region) %in% c("suburban")         ~ "Suburban",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    employment_status = case_when(
      str_to_lower(employment_status) %in%
        c("employed", "emplyd", "employ")                            ~ "Employed",
      str_to_lower(employment_status) %in%
        c("unemployed", "unemp")                                     ~ "Unemployed",
      str_to_lower(employment_status) %in%
        c("self-employed", "selfemployed", "self employed",
          "self_employed")                                            ~ "Self-Employed",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    education_level = case_when(
      str_to_lower(education_level) %in% c("primary")   ~ "Primary",
      str_to_lower(education_level) %in% c("secondary") ~ "Secondary",
      str_to_lower(education_level) %in% c("tertiary")  ~ "Tertiary",
      TRUE ~ NA_character_
    ),
    education_level = factor(
      education_level, levels = c("Primary", "Secondary", "Tertiary")
    )
  ) %>%

  mutate(
    access_to_internet = case_when(
      str_to_lower(as.character(access_to_internet)) %in%
        c("true", "yes", "1", "y")  ~ TRUE,
      str_to_lower(as.character(access_to_internet)) %in%
        c("false", "no", "0", "n")  ~ FALSE,
      TRUE ~ NA
    )
  )

cat("--- region ---\n");             print(table(hh_clean$region,             useNA = "ifany"))

--- region ---


   Rural Suburban    Urban     <NA> 
   10120     4873    10037     4970

cat("--- employment_status ---\n");  print(table(hh_clean$employment_status,  useNA = "ifany"))

--- employment_status ---


     Employed Self-Employed    Unemployed          <NA> 
        11932          5965          6023          6080

cat("--- education_level ---\n");    print(table(hh_clean$education_level,    useNA = "ifany"))

--- education_level ---


  Primary Secondary  Tertiary      <NA> 
    11962      6041      5981      6016

cat("--- access_to_internet ---\n"); print(table(hh_clean$access_to_internet, useNA = "ifany"))

--- access_to_internet ---


FALSE  TRUE  <NA> 
12103 11909  5988

After standardisation, the region variable shows Urban (10,037) and Rural (10,120) households as nearly equal in size, with Suburban (4,873) accounting for a smaller share. Employment status breaks into Employed (11,932), Unemployed (6,023), and Self-Employed (5,965). Education levels are distributed across Primary (11,962), Secondary (6,041), and Tertiary (5,981). Internet access is split fairly evenly, with roughly 12,103 households without access and 11,909 with access. Residual missing values across all four variables reflect entries that did not match any recognised label and were therefore set to NA.

2.2 Section B: Data Manipulation and Wrangling

2.2.1 Savings Rate

hh_clean <- hh_clean %>%
  mutate(savings_rate = savings / monthly_income)

summary(hh_clean$savings_rate)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-5.0000  0.1739  0.3778  0.8138  0.7432 20.0000    1569

The savings_rate variable captures the proportion of monthly income that a household saves. The summary shows a range that extends into negative territory, which can occur when a household’s savings figure is negative – reflecting net dissaving rather than accumulation. These values are retained for now and handled through filtering in downstream analyses.

2.2.2 Income Group Classification

hh_clean <- hh_clean %>%
  mutate(
    income_group = case_when(
      monthly_income >= 200000                           ~ "High Income",
      monthly_income >= 75000 & monthly_income < 200000 ~ "Middle Income",
      monthly_income <  75000                            ~ "Low Income",
      TRUE                                               ~ NA_character_
    ),
    income_group = factor(
      income_group, levels = c("Low Income", "Middle Income", "High Income")
    )
  )

cat("Income group distribution:\n")

Income group distribution:

print(table(hh_clean$income_group, useNA = "ifany"))


   Low Income Middle Income   High Income          <NA> 
         3836          7359         17871           934

Households are classified into three income tiers based on monthly income thresholds specified in the study brief. The distribution across groups will be examined further in the visualisation section.

2.2.3 Group Summary Statistics

inc_exp_tbl <- hh_clean %>%
  filter(!is.na(region), !is.na(employment_status)) %>%
  group_by(region, employment_status) %>%
  summarise(
    Mean_Income      = round(mean(monthly_income,      na.rm = TRUE), 0),
    Mean_Expenditure = round(mean(monthly_expenditure, na.rm = TRUE), 0),
    N                = n(),
    .groups = "drop"
  ) %>%
  arrange(region, employment_status)

kable(inc_exp_tbl,
      caption = "Table 6: Mean monthly income and expenditure by region and employment status",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 6: Mean monthly income and expenditure by region and employment status
region	employment_status	Mean_Income	Mean_Expenditure	N
Rural	Employed	255,176	201,519	4,010
Rural	Self-Employed	260,212	202,537	2,000
Rural	Unemployed	254,376	205,269	2,046
Suburban	Employed	250,959	200,999	1,922
Suburban	Self-Employed	260,216	205,063	980
Suburban	Unemployed	264,645	205,788	967
Urban	Employed	256,436	197,032	3,963
Urban	Self-Employed	251,358	202,598	1,989
Urban	Unemployed	253,623	200,803	2,049

sav_debt_tbl <- hh_clean %>%
  filter(!is.na(education_level)) %>%
  group_by(education_level) %>%
  summarise(
    Median_Savings = round(median(savings, na.rm = TRUE), 0),
    Median_Debt    = round(median(debt,    na.rm = TRUE), 0),
    N              = n(),
    .groups = "drop"
  )

kable(sav_debt_tbl,
      caption = "Table 7: Median savings and debt by education level",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 7: Median savings and debt by education level
education_level	Median_Savings	Median_Debt	N
Primary	95,000	500,000	11,962
Secondary	100,000	510,000	6,041
Tertiary	100,000	510,000	5,981

Mean income and expenditure show very little variation across regions or employment categories, which mirrors the pattern seen in the food delivery data. Median savings and debt are similarly uniform across education levels, with all three groups sitting at comparable figures. In real-world data, tertiary-educated households typically show higher savings and lower debt burdens – the absence of this gradient here reinforces the synthetic nature of the dataset.

2.3 Section C: Probability Distribution and Simulation

2.3.1 Simulating Random Data

Two variables are simulated to illustrate the use of probability distributions in R.

set.seed(42)

# Normal distribution: simulated household income
household_income_sim <- rnorm(n = 1000, mean = 150000, sd = 50000)

# Poisson distribution: simulated number of dependents
number_of_dependents <- rpois(n = 1000, lambda = 2)

cat("First 10 simulated income values:\n",  round(head(household_income_sim, 10)), "\n\n")

First 10 simulated income values:
 218548 121765 168156 181643 170213 144694 225576 145267 250921 146864

cat("First 10 simulated dependent counts:\n", head(number_of_dependents, 10), "\n")

First 10 simulated dependent counts:
 6 2 3 4 3 3 2 3 1 1

Household income is simulated from a Normal distribution with mean LKR 150,000 and standard deviation LKR 50,000. This reflects the assumption that most household incomes cluster around a central value, with fewer households at the extremes. Number of dependents is simulated from a Poisson distribution with lambda = 2, which is appropriate for modelling count data where most households have a small number of dependents and larger counts become progressively rarer.

2.3.2 Comparing Sample and Theoretical Moments

norm_cmp <- tibble(
  Statistic   = c("Mean", "Variance", "Std deviation"),
  Theoretical = c(150000, 50000^2, 50000),
  Sample      = c(
    round(mean(household_income_sim), 2),
    round(var(household_income_sim),  2),
    round(sd(household_income_sim),   2)
  )
)

kable(norm_cmp,
      caption = "Table 8: Normal distribution -- theoretical vs. sample moments",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 8: Normal distribution – theoretical vs. sample moments
Statistic	Theoretical	Sample
Mean	1.5e+05	1.487088e+05
Variance	2.5e+09	2.512622e+09
Std deviation	5.0e+04	5.012606e+04

pois_cmp <- tibble(
  Statistic   = c("Mean (lambda)", "Variance (lambda)"),
  Theoretical = c(2, 2),
  Sample      = c(
    round(mean(number_of_dependents), 4),
    round(var(number_of_dependents),  4)
  )
)

kable(pois_cmp,
      caption = "Table 9: Poisson distribution -- theoretical vs. sample moments") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 9: Poisson distribution – theoretical vs. sample moments
Statistic	Theoretical	Sample
Mean (lambda)	2	2.0100
Variance (lambda)	2	1.9438

For the Normal distribution, the sample mean and standard deviation are very close to the theoretical parameters of 150,000 and 50,000 respectively, with small deviations attributable to random sampling variation at n = 1,000. For the Poisson distribution, both the sample mean and variance are close to the theoretical value of lambda = 2, which is expected given that the mean and variance are equal in a Poisson process. Minor discrepancies are normal and would narrow with a larger sample size.

2.3.3 Histograms and Density Plots

tibble(income = household_income_sim) %>%
  ggplot(aes(x = income)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#2980b9", colour = "white", alpha = 0.75) +
  geom_density(colour = "#e74c3c", linewidth = 1.2) +
  stat_function(
    fun  = dnorm,
    args = list(mean = 150000, sd = 50000),
    colour = "#27ae60", linewidth = 1.1, linetype = "dashed"
  ) +
  scale_x_continuous(labels = comma) +
  labs(
    title    = "Simulated household income -- Normal(150,000; 50,000)",
    subtitle = "Blue: histogram  |  Red: empirical density  |  Green dashed: theoretical curve",
    x        = "Simulated income (LKR)",
    y        = "Density"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 6: Simulated household income – Normal distribution

The histogram closely follows the expected bell-shaped curve. The empirical density (red) and the theoretical Normal curve (green dashed) align well throughout the distribution, with the slight deviations at the tails being entirely typical of a finite sample. The simulation is behaving as intended.

pois_pmf <- tibble(
  x    = 0:max(number_of_dependents),
  prob = dpois(0:max(number_of_dependents), lambda = 2)
)

tibble(dependents = number_of_dependents) %>%
  ggplot(aes(x = dependents)) +
  geom_bar(aes(y = after_stat(prop)), fill = "#8e44ad",
           colour = "white", alpha = 0.80, width = 0.70) +
  geom_point(data = pois_pmf, aes(x = x, y = prob),
             colour = "#27ae60", size = 3.5) +
  geom_line(data  = pois_pmf, aes(x = x, y = prob),
            colour = "#27ae60", linewidth = 1.0, linetype = "dashed") +
  labs(
    title    = "Simulated number of dependents -- Poisson(lambda = 2)",
    subtitle = "Purple bars: sample proportions  |  Green: theoretical PMF",
    x        = "Number of dependents",
    y        = "Probability"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 7: Simulated number of dependents – Poisson distribution

The bar chart peaks at 2 dependents and tails off steadily at higher counts, matching the shape of a Poisson distribution with lambda = 2. The sample proportions (purple bars) align closely with the theoretical probability mass function (green), confirming that the simulation is a good empirical approximation of the specified distribution at n = 1,000.

2.4 Section D: Visualisation with ggplot2

2.4.1 Monthly Income vs. Monthly Expenditure by Region

set.seed(1)

hh_clean %>%
  filter(!is.na(monthly_income), !is.na(monthly_expenditure), !is.na(region)) %>%
  sample_n(5000) %>%
  ggplot(aes(x = monthly_income, y = monthly_expenditure, colour = region)) +
  geom_point(alpha = 0.30, size = 1.2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Monthly income vs. monthly expenditure",
    subtitle = "Coloured by region -- random sample of 5,000 households",
    x        = "Monthly income (LKR)",
    y        = "Monthly expenditure (LKR)",
    colour   = "Region"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 8: Monthly income vs. monthly expenditure, coloured by region

The scatter plot shows virtually no relationship between income and expenditure. The three regional trend lines are nearly flat and lie almost on top of each other around LKR 200,000, meaning expenditure does not change with income level regardless of where a household is located. In real survey data, a positive relationship between income and expenditure is almost universal; its absence here is a direct consequence of the variables being generated independently within fixed ranges.

2.4.2 Monthly Savings by Education Level

hh_clean %>%
  filter(!is.na(savings), !is.na(education_level)) %>%
  ggplot(aes(x = education_level, y = savings, fill = education_level)) +
  geom_boxplot(outlier.colour = "red", outlier.alpha = 0.30,
               outlier.size = 1.2, width = 0.50) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    title = "Monthly savings by education level",
    x     = "Education level",
    y     = "Monthly savings (LKR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title      = element_text(face = "bold")
  )

Figure 9: Monthly savings distribution by education level

The three education groups display near-identical distributions, with medians sitting around LKR 95,000-100,000 and comparable spreads. Negative values appear across all groups, indicating households that are spending beyond their income. One would typically expect Tertiary-educated households to show noticeably higher savings given better employment prospects and earning capacity, but this pattern does not appear – again consistent with the synthetic data generation process.

2.4.3 Average Monthly Income by Employment Status, Faceted by Region

hh_clean %>%
  filter(!is.na(monthly_income), !is.na(employment_status), !is.na(region)) %>%
  group_by(region, employment_status) %>%
  summarise(avg_income = mean(monthly_income, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = employment_status, y = avg_income, fill = employment_status)) +
  geom_col(width = 0.60, colour = "white") +
  geom_text(aes(label = comma(round(avg_income, 0))),
            vjust = -0.40, size = 2.7, fontface = "bold") +
  facet_wrap(~region, ncol = 3) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Average monthly income by employment status",
    subtitle = "Faceted by region",
    x        = "Employment status",
    y        = "Average monthly income (LKR)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    axis.text.x     = element_text(angle = 35, hjust = 1),
    strip.text      = element_text(face = "bold", size = 11),
    plot.title      = element_text(face = "bold")
  )

Figure 10: Average monthly income by employment status, faceted by region

Average monthly income is almost identical across every combination of region and employment status, ranging between roughly LKR 251,000 and LKR 265,000. Employed and Self-Employed households would ordinarily earn more than Unemployed households, and Urban areas would typically outpace Rural ones. Neither pattern is visible here, which reflects the income variable being generated within a fixed range without regard to employment or location.

2.4.4 Correlation Heatmap

num_hh <- hh_clean %>%
  select(household_size, monthly_income, monthly_expenditure,
         health_expenditure, savings, debt, savings_rate) %>%
  drop_na()

cor_mat  <- cor(num_hh)
cor_melt <- melt(cor_mat)

ggplot(cor_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(value, 2)), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(
    low      = "#c0392b",
    mid      = "white",
    high     = "#2980b9",
    midpoint = 0,
    limits   = c(-1, 1),
    name     = "Pearson r"
  ) +
  labs(
    title = "Correlation heatmap -- socio-economic household numeric variables",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x  = element_text(angle = 45, hjust = 1),
    plot.title   = element_text(face = "bold"),
    legend.position = "right"
  )

$Figure 11: Correlation heatmap of numeric household variables$

Figure 11: Correlation heatmap of numeric household variables

Almost every pair of numeric variables shows a correlation close to zero, confirming that the variables were generated independently. The one notable exception is the moderate negative correlation between monthly_income and savings_rate (-0.48), which is somewhat counterintuitive – higher-income households would generally be expected to save a larger share of income. This negative relationship likely emerges from the way savings and income were independently bounded during data generation rather than reflecting a genuine economic dynamic.

2.5 Section E: Reproducible Reporting and Export

2.5.1 Reproducibility Notes

All code in this report is fully reproducible. A global chunk options block at the top of the document (knitr::opts_chunk$set) ensures consistent behaviour across all chunks – warnings and messages are suppressed to keep the output clean, figures are centred, and echo is enabled so the code is always visible alongside its output. The simulation in Section C uses set.seed(42) to guarantee that identical results are obtained on every knit. The tidyverse and supporting packages are loaded once at the start of Problem 1 and remain available throughout. All inline R expressions update automatically when the underlying data changes, so key reported figures – such as the mean delivery time and household count – are always consistent with the actual computed values.

2.5.2 Exporting the Cleaned Dataset

write_csv(hh_clean, "socio_economic_households_cleaned.csv")

cat("File saved: socio_economic_households_cleaned.csv\n")

File saved: socio_economic_households_cleaned.csv

cat("  Rows   :", nrow(hh_clean), "\n")

  Rows   : 30000

cat("  Columns:", ncol(hh_clean), "\n")

  Columns: 14

The cleaned household dataset has been saved as socio_economic_households_cleaned.csv. The file contains 30000 rows and 14 columns – the original 12 variables plus the two newly derived fields, savings_rate and income_group.

The mean monthly income across valid household records is 255,075 LKR. The distribution of households across income groups is shown below:

Income group	Households
Low Income	3836
Middle Income	7359
High Income	17871

Data Cleaning, Analysis & Visualisation Using R

May 2026