1 Problem 1: Food Delivery Performance Study

1.1 Section A: Data Preparation

1.1.1 Importing and Inspecting the Dataset

All required libraries are loaded first. The tidyverse package handles data manipulation and visualisation, moments provides skewness and kurtosis functions, scales assists with axis formatting, and kableExtra produces styled tables in the report.

library(tidyverse)
library(moments)
library(scales)
library(knitr)
library(kableExtra)
library(reshape2)

The dataset is read from disk and inspected using str(), glimpse(), and summary() to understand its structure, variable types, and initial data quality.

food_raw <- read_csv("food_delivery_data.csv", show_col_types = FALSE)
str(food_raw)
spc_tbl_ [21,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ order_id         : chr [1:21000] "FDD000001" "FDD000002" "FDD000003" "FDD000004" ...
 $ customer_age     : num [1:21000] 60 17 43 49 53 32 34 57 NA 37 ...
 $ customer_gender  : chr [1:21000] "Mle" "Mle" NA "Male" ...
 $ restaurant_type  : chr [1:21000] "cafe" "cafe" "Fast Food" "cafe" ...
 $ order_value      : num [1:21000] 3116 3763 816 884 3854 ...
 $ delivery_distance: num [1:21000] -3.51 21.58 20.57 11.97 6.82 ...
 $ delivery_time    : num [1:21000] 52 134 61 36 91 132 62 61 145 54 ...
 $ traffic_level    : chr [1:21000] "low" "low" "High" "High" ...
 $ weather          : chr [1:21000] "Rainy" "Cloudy" "Sunny" "Cloudy" ...
 $ payment_method   : chr [1:21000] "Online" "Card" "cash" NA ...
 $ rating           : num [1:21000] 4 10 3 10 10 -1 10 3 4 -1 ...
 $ delivery_status  : chr [1:21000] "delivered" "Delivered" "Delivered" NA ...
 $ rider_experience : num [1:21000] 30 0 -3 0 15 30 0 13 8 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   order_id = col_character(),
  ..   customer_age = col_double(),
  ..   customer_gender = col_character(),
  ..   restaurant_type = col_character(),
  ..   order_value = col_double(),
  ..   delivery_distance = col_double(),
  ..   delivery_time = col_double(),
  ..   traffic_level = col_character(),
  ..   weather = col_character(),
  ..   payment_method = col_character(),
  ..   rating = col_double(),
  ..   delivery_status = col_character(),
  ..   rider_experience = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
glimpse(food_raw)
Rows: 21,000
Columns: 13
$ order_id          <chr> "FDD000001", "FDD000002", "FDD000003", "FDD000004", …
$ customer_age      <dbl> 60, 17, 43, 49, 53, 32, 34, 57, NA, 37, 48, 47, 64, …
$ customer_gender   <chr> "Mle", "Mle", NA, "Male", NA, "Mle", "Male", "Male",…
$ restaurant_type   <chr> "cafe", "cafe", "Fast Food", "cafe", "Cafe", "cafe",…
$ order_value       <dbl> 3116, 3763, 816, 884, 3854, 2082, 1480, 2085, 2600, …
$ delivery_distance <dbl> -3.51, 21.58, 20.57, 11.97, 6.82, 20.51, 1.17, 11.25…
$ delivery_time     <dbl> 52, 134, 61, 36, 91, 132, 62, 61, 145, 54, 139, 25, …
$ traffic_level     <chr> "low", "low", "High", "High", "low", "Hgh", "Hgh", "…
$ weather           <chr> "Rainy", "Cloudy", "Sunny", "Cloudy", "Rainy", "Clou…
$ payment_method    <chr> "Online", "Card", "cash", NA, "Card", "Card", NA, "C…
$ rating            <dbl> 4, 10, 3, 10, 10, -1, 10, 3, 4, -1, 3, 3, NA, 4, 5, …
$ delivery_status   <chr> "delivered", "Delivered", "Delivered", NA, "delivere…
$ rider_experience  <dbl> 30, 0, -3, 0, 15, 30, 0, 13, 8, 1, 1, 9, 0, 8, 4, -3…

The dataset contains 21,000 observations across 13 variables, covering a mix of categorical identifiers – such as restaurant_type, traffic_level, and payment_method – and numeric measures including order_value, delivery_distance, delivery_time, and rating. The presence of character-type columns alongside numeric ones signals that careful type checking and text standardisation will be required before analysis.

summary(food_raw)
   order_id          customer_age    customer_gender    restaurant_type   
 Length:21000       Min.   : -5.00   Length:21000       Length:21000      
 Class :character   1st Qu.: 28.00   Class :character   Class :character  
 Mode  :character   Median : 42.00   Mode  :character   Mode  :character  
                    Mean   : 42.35                                        
                    3rd Qu.: 56.00                                        
                    Max.   :120.00                                        
                    NA's   :368                                           
  order_value    delivery_distance delivery_time    traffic_level     
 Min.   : -100   Min.   :-5.00     Min.   :-20.00   Length:21000      
 1st Qu.: 1618   1st Qu.: 2.47     1st Qu.: 44.00   Class :character  
 Median : 2750   Median :10.05     Median : 79.00   Mode  :character  
 Mean   : 2777   Mean   :10.03     Mean   : 81.87                     
 3rd Qu.: 3868   3rd Qu.:17.56     3rd Qu.:116.00                     
 Max.   :99999   Max.   :25.00     Max.   :500.00                     
 NA's   :6                         NA's   :139                        
   weather          payment_method         rating       delivery_status   
 Length:21000       Length:21000       Min.   :-1.000   Length:21000      
 Class :character   Class :character   1st Qu.: 1.000   Class :character  
 Mode  :character   Mode  :character   Median : 3.000   Mode  :character  
                                       Mean   : 3.438                     
                                       3rd Qu.: 5.000                     
                                       Max.   :10.000                     
                                       NA's   :2607                       
 rider_experience
 Min.   :-3.000  
 1st Qu.: 3.000  
 Median : 7.000  
 Mean   : 8.152  
 3rd Qu.:12.000  
 Max.   :30.000  
 NA's   :1110    

The summary confirms several data quality concerns. Customer age has a mean of roughly 42 years, which is plausible, but the range extends into negative territory, indicating invalid entries. Delivery time averages around 81.87 minutes with a maximum of 500 minutes – clearly unrealistic for a food delivery context. Rating shows 2,607 missing values, which is notable since it directly reflects customer satisfaction. Rider experience averages 8.15 years, though some values are negative. These issues are addressed systematically in the cleaning steps below.


1.1.2 Missing Value Analysis

Before correcting any values, blank strings in character columns are first converted to NA so that they are treated consistently alongside other missing entries.

food_raw2 <- food_raw %>%
  mutate(across(where(is.character), ~na_if(str_trim(.), "")))

missing_tbl <- food_raw2 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(100 * Missing_Count / nrow(food_raw2), 2)) %>%
  arrange(desc(Missing_Count))

kable(missing_tbl,
      caption = "Table 1: Missing values per variable (blank strings treated as NA)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 1: Missing values per variable (blank strings treated as NA)
Variable Missing_Count Missing_Pct
customer_gender 6873 32.73
delivery_status 5272 25.10
weather 5267 25.08
payment_method 4251 20.24
traffic_level 3469 16.52
rating 2607 12.41
rider_experience 1110 5.29
customer_age 368 1.75
delivery_time 139 0.66
order_value 6 0.03
order_id 0 0.00
restaurant_type 0 0.00
delivery_distance 0 0.00

Missing data is spread across several variables. The most affected are rating and rider_experience, which together represent the two clearest quality indicators in the dataset. Variables like customer_gender, delivery_status, and traffic_level also show non-trivial rates of missingness, suggesting inconsistent data collection at the point of entry. These patterns are important to keep in mind during analysis, as they can introduce non-response bias if not handled carefully.


1.1.3 Correcting Data Issues

1.1.3.1 Inconsistent Text Entries

All categorical variables contain spelling variants, inconsistent capitalisation, and abbreviated entries. Each variable is standardised to a fixed set of valid labels, with unrecognisable entries mapped to NA.

food_clean <- food_raw2 %>%

  mutate(
    customer_gender = case_when(
      str_to_lower(customer_gender) %in% c("male",   "mle")   ~ "Male",
      str_to_lower(customer_gender) %in% c("female", "fmale") ~ "Female",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    restaurant_type = case_when(
      str_to_lower(restaurant_type) %in% c("cafe")                     ~ "Cafe",
      str_to_lower(restaurant_type) %in% c("fast food", "fastfood")    ~ "Fast Food",
      str_to_lower(restaurant_type) %in% c("fine dining", "finedining") ~ "Fine Dining",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    traffic_level = case_when(
      str_to_lower(traffic_level) %in% c("low")           ~ "Low",
      str_to_lower(traffic_level) %in% c("medium", "med") ~ "Medium",
      str_to_lower(traffic_level) %in% c("high", "hgh")   ~ "High",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(weather = str_to_title(weather)) %>%

  mutate(
    payment_method = case_when(
      str_to_lower(payment_method) %in% c("cash")   ~ "Cash",
      str_to_lower(payment_method) %in% c("card")   ~ "Card",
      str_to_lower(payment_method) %in% c("online") ~ "Online",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    delivery_status = case_when(
      str_to_lower(delivery_status) %in% c("delivered") ~ "Delivered",
      str_to_lower(delivery_status) %in% c("cancelled") ~ "Cancelled",
      TRUE ~ NA_character_
    )
  )

cat("customer_gender:\n");  print(table(food_clean$customer_gender,  useNA = "ifany"))
customer_gender:

Female   Male   <NA> 
  7217   6910   6873 
cat("restaurant_type:\n");  print(table(food_clean$restaurant_type,  useNA = "ifany"))
restaurant_type:

       Cafe   Fast Food Fine Dining 
       8381        8390        4229 
cat("traffic_level:\n");    print(table(food_clean$traffic_level,    useNA = "ifany"))
traffic_level:

  High    Low Medium   <NA> 
  6942   7032   3557   3469 
cat("weather:\n");          print(table(food_clean$weather,          useNA = "ifany"))
weather:

Cloudy  Rainy  Sunny   <NA> 
  5160   5299   5274   5267 
cat("payment_method:\n");   print(table(food_clean$payment_method,   useNA = "ifany"))
payment_method:

  Card   Cash Online   <NA> 
  4241   8207   4301   4251 
cat("delivery_status:\n");  print(table(food_clean$delivery_status,  useNA = "ifany"))
delivery_status:

Cancelled Delivered      <NA> 
     5208     10520      5272 

After standardisation, each categorical variable is reduced to its expected set of valid labels. Fast Food and Cafe dominate the restaurant distribution, while Fine Dining makes up a smaller share. Cash is by far the most common payment method. The remaining missingness in customer_gender, weather, and delivery_status reflects entries that could not be matched to any recognised category and were therefore set to NA.

1.1.3.2 Invalid Numeric Values

Each numeric variable is validated against its expected real-world range. Observations falling outside these bounds are replaced with NA rather than removed, preserving the full row for variables that are still valid.

food_clean <- food_clean %>%
  mutate(
    customer_age = if_else(
      customer_age >= 15 & customer_age <= 70,
      customer_age, NA_real_),

    order_value = if_else(
      order_value >= 500 & order_value <= 5000,
      order_value, NA_real_),

    delivery_distance = if_else(
      delivery_distance >= 0 & delivery_distance <= 20,
      delivery_distance, NA_real_),

    delivery_time = if_else(
      delivery_time >= 10 & delivery_time <= 120,
      delivery_time, NA_real_),

    rating = if_else(
      rating >= 1 & rating <= 5,
      rating, NA_real_),

    rider_experience = if_else(
      rider_experience >= 0 & rider_experience <= 15,
      rider_experience, NA_real_)
  )

cat("Remaining NA counts after range validation:\n")
Remaining NA counts after range validation:
print(colSums(is.na(food_clean)))
         order_id      customer_age   customer_gender   restaurant_type 
                0              1063              6873                 0 
      order_value delivery_distance     delivery_time     traffic_level 
               15              7083              4743              3469 
          weather    payment_method            rating   delivery_status 
             5267              4251              7809              5272 
 rider_experience 
             3310 

This step increases the missing value counts considerably for rating and delivery_distance, which confirms that a large portion of the raw values were out-of-range rather than simply blank. For example, rating now has 7,809 missing entries, reflecting both the original blanks and the invalid numeric entries combined. While this improves data integrity, it reduces the usable sample for any analysis involving these variables. This trade-off – stricter validity at the cost of completeness – is a routine consideration in data preprocessing.


1.2 Section B: Data Transformation

1.2.1 Derived Variable: Delivery Speed

A new variable, delivery_speed, is calculated as the ratio of distance covered to time taken. This gives a continuous measure of how efficiently each delivery was completed, expressed in kilometres per minute.

food_clean <- food_clean %>%
  mutate(delivery_speed = delivery_distance / delivery_time)

food_clean %>%
  select(order_id, delivery_distance, delivery_time, delivery_speed) %>%
  head(10)
order_id delivery_distance delivery_time delivery_speed
FDD000001 NA 52 NA
FDD000002 NA NA NA
FDD000003 NA 61 NA
FDD000004 11.97 36 0.3325000
FDD000005 6.82 91 0.0749451
FDD000006 NA NA NA
FDD000007 1.17 62 0.0188710
FDD000008 11.25 61 0.1844262
FDD000009 NA NA NA
FDD000010 19.69 54 0.3646296

The variable takes meaningful values only where both delivery_distance and delivery_time are non-missing. Records where either is NA will naturally produce an NA speed, which is the correct behaviour.


1.2.2 Categorical Variable: Delivery Performance

Delivery performance is classified into three tiers based on delivery time thresholds defined in the study brief.

food_clean <- food_clean %>%
  mutate(
    delivery_performance = case_when(
      delivery_time < 30                       ~ "Fast",
      delivery_time >= 30 & delivery_time < 60 ~ "Moderate",
      delivery_time >= 60                      ~ "Slow",
      TRUE                                     ~ NA_character_
    ),
    delivery_performance = factor(
      delivery_performance, levels = c("Fast", "Moderate", "Slow")
    )
  )

cat("delivery_performance distribution (including NAs):\n")
delivery_performance distribution (including NAs):
print(table(food_clean$delivery_performance, useNA = "ifany"))

    Fast Moderate     Slow     <NA> 
    2953     4437     8867     4743 

The Slow category is the largest group with 8,867 orders, followed by Moderate at 4,437 and Fast at 2,953. The 4,743 missing values correspond to rows where delivery_time was set to NA during cleaning. The skew towards longer delivery times may point to genuine operational inefficiencies, though it is worth noting that this classification does not account for distance – a 60-minute delivery over 20 km is quite different from a 60-minute delivery over 2 km.


1.2.3 Group Summaries

Mean delivery time and mean order value are computed for each combination of restaurant type and traffic level.

summary_rt_tl <- food_clean %>%
  filter(!is.na(restaurant_type), !is.na(traffic_level)) %>%
  group_by(restaurant_type, traffic_level) %>%
  summarise(
    Mean_Delivery_Time = round(mean(delivery_time,  na.rm = TRUE), 2),
    Mean_Order_Value   = round(mean(order_value,    na.rm = TRUE), 2),
    N                  = n(),
    .groups = "drop"
  ) %>%
  arrange(restaurant_type, traffic_level)

kable(summary_rt_tl,
      caption = "Table 2: Mean delivery time and order value by restaurant type and traffic level") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 2: Mean delivery time and order value by restaurant type and traffic level
restaurant_type traffic_level Mean_Delivery_Time Mean_Order_Value N
Cafe High 64.58 2734.09 2814
Cafe Low 63.73 2733.84 2795
Cafe Medium 65.27 2745.46 1378
Fast Food High 65.05 2722.40 2767
Fast Food Low 64.44 2696.76 2831
Fast Food Medium 65.44 2716.04 1445
Fine Dining High 65.53 2754.12 1361
Fine Dining Low 66.34 2771.95 1406
Fine Dining Medium 65.61 2849.61 734

The grouped means show very little variation across the nine combinations of restaurant type and traffic level. Delivery time sits consistently between 63.7 and 66.3 minutes regardless of the combination, and mean order value clusters tightly between LKR 2,700 and 2,850. This uniformity is striking – in a real operational setting, one would expect higher traffic levels and more complex restaurant types to push delivery times upward. The flat pattern is consistent with the wider data quality issues identified earlier.

summary_gs <- food_clean %>%
  filter(!is.na(customer_gender), !is.na(delivery_status)) %>%
  count(customer_gender, delivery_status, name = "Total_Orders")

kable(summary_gs,
      caption = "Table 3: Total orders by customer gender and delivery status") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 3: Total orders by customer gender and delivery status
customer_gender delivery_status Total_Orders
Female Cancelled 1857
Female Delivered 3595
Male Cancelled 1691
Male Delivered 3482

Males placed 3,482 delivered orders and 1,691 cancellations, while females placed 3,595 delivered orders and 1,857 cancellations. Delivery success rates are broadly comparable across genders, with no meaningful difference in cancellation rates. This suggests that customer gender is not a driver of delivery outcomes in this dataset.


1.3 Section C: Summarisation

1.3.1 Descriptive Statistics

Summary statistics for delivery_time and order_value are computed to characterise central tendency and spread.

summary_stats <- food_clean %>%
  summarise(
    across(
      c(delivery_time, order_value),
      list(
        Mean   = ~round(mean(.x,   na.rm = TRUE), 2),
        Median = ~round(median(.x, na.rm = TRUE), 2),
        SD     = ~round(sd(.x,     na.rm = TRUE), 2),
        IQR    = ~round(IQR(.x,    na.rm = TRUE), 2)
      ),
      .names = "{.col}__{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", "Statistic"),
               names_sep = "__") %>%
  pivot_wider(names_from = "Statistic", values_from = "value")

kable(summary_stats,
      caption = "Table 4: Summary statistics for delivery time and order value") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 4: Summary statistics for delivery time and order value
Variable Mean Median SD IQR
delivery_time 64.84 65 32.14 56
order_value 2744.98 2748 1297.13 2248

For delivery_time, the mean (64.84 min) and median (65 min) are almost identical, indicating a symmetric distribution with no meaningful skew. The standard deviation of 32.14 and IQR of 56 reflect moderate spread. For order_value, the mean (2,744.98 LKR) and median (2,748 LKR) are again very close, though the much larger standard deviation (1,297.13) and IQR (2,248) suggest that spending varies considerably across orders, possibly driven by a wide range of order types and sizes.


1.3.2 Skewness and Kurtosis

dt_vals <- na.omit(food_clean$delivery_time)

sk     <- round(skewness(dt_vals), 4)
ku_raw <- round(kurtosis(dt_vals), 4)
ku_ex  <- round(ku_raw - 3, 4)

cat("Skewness (delivery_time)   :", sk,     "\n")
Skewness (delivery_time)   : 0.0175 
cat("Raw kurtosis               :", ku_raw, "\n")
Raw kurtosis               : 1.7994 
cat("Excess kurtosis (kurt - 3) :", ku_ex,  "\n")
Excess kurtosis (kurt - 3) : -1.2006 

A skewness of 0.0175 confirms that the distribution of delivery time is very nearly symmetric. The raw kurtosis of 1.7994 and corresponding excess kurtosis of -1.2006 indicate a platykurtic distribution – one with lighter tails and a flatter peak than a normal distribution. This is consistent with a broadly uniform spread of delivery times across the 10-120 minute range.


1.3.3 Frequency Table and Bar Chart

perf_freq <- food_clean %>%
  filter(!is.na(delivery_performance)) %>%
  count(delivery_performance, name = "Frequency") %>%
  mutate(Percentage = round(100 * Frequency / sum(Frequency), 2))

kable(perf_freq,
      caption = "Table 5: Frequency table for delivery performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 5: Frequency table for delivery performance
delivery_performance Frequency Percentage
Fast 2953 18.16
Moderate 4437 27.29
Slow 8867 54.54
ggplot(perf_freq,
       aes(x = delivery_performance, y = Frequency, fill = delivery_performance)) +
  geom_col(width = 0.55, colour = "white") +
  geom_text(
    aes(label = paste0(Frequency, "\n(", Percentage, "%)")),
    vjust = -0.3, size = 3.8, fontface = "bold"
  ) +
  scale_fill_manual(
    values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Distribution of delivery performance",
    subtitle = "Fast: < 30 min  |  Moderate: 30-59 min  |  Slow: >= 60 min",
    x        = "Performance category",
    y        = "Number of orders"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position    = "none",
    plot.title         = element_text(face = "bold"),
    panel.grid.major.x = element_blank()
  )
Figure 1: Distribution of delivery performance categories

Figure 1: Distribution of delivery performance categories

The bar chart shows that over half of all deliveries (54.54%, n = 8,867) fell into the Slow category. Moderate deliveries account for about 27% of orders, and Fast deliveries are the least common at 18.16%. This distribution raises questions about operational efficiency, though the absence of a distance control limits how much can be concluded from timing alone.


1.4 Section D: Visualisation

1.4.1 Delivery Distance vs. Delivery Time by Traffic Level

food_clean %>%
  filter(!is.na(delivery_distance), !is.na(delivery_time), !is.na(traffic_level)) %>%
  ggplot(aes(x = delivery_distance, y = delivery_time, colour = traffic_level)) +
  geom_point(alpha = 0.25, size = 0.9) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_colour_manual(
    values = c("Low" = "#27ae60", "Medium" = "#e67e22", "High" = "#c0392b")
  ) +
  labs(
    title    = "Delivery distance vs. delivery time",
    subtitle = "Coloured by traffic level, with linear trend lines per group",
    x        = "Delivery distance (km)",
    y        = "Delivery time (minutes)",
    colour   = "Traffic level"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 2: Delivery distance vs. delivery time, coloured by traffic level

Figure 2: Delivery distance vs. delivery time, coloured by traffic level

The scatter plot reveals that distance has almost no relationship with delivery time in this dataset. All three trend lines sit nearly flat at around 65-70 minutes regardless of distance, and the three traffic level lines overlap almost perfectly. In a real delivery operation, longer distances and heavier traffic would typically push delivery times upward; neither pattern is visible here. The very wide vertical scatter at each distance value also suggests that unobserved factors – such as order preparation time or rider routing decisions – are far more influential than the variables captured here.


1.4.2 Order Value by Restaurant Type

food_clean %>%
  filter(!is.na(order_value), !is.na(restaurant_type)) %>%
  ggplot(aes(x = restaurant_type, y = order_value, fill = restaurant_type)) +
  geom_boxplot(outlier.colour = "red", outlier.alpha = 0.35,
               outlier.size = 1.2, width = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Order value distribution by restaurant type",
    x     = "Restaurant type",
    y     = "Order value (LKR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title      = element_text(face = "bold")
  )
Figure 3: Order value distribution by restaurant type

Figure 3: Order value distribution by restaurant type

All three restaurant types show strikingly similar distributions. Medians sit close together in the LKR 2,750-2,900 range, and the interquartile ranges are comparable across categories. One would intuitively expect Fine Dining to produce meaningfully higher order values than Cafe or Fast Food, but that distinction does not emerge here. This is most likely a consequence of the order values being generated within a fixed LKR 500-5,000 range, which compresses the inter-group differences regardless of restaurant type.


1.4.3 Distribution of Delivery Time

mean_dt   <- mean(food_clean$delivery_time,   na.rm = TRUE)
median_dt <- median(food_clean$delivery_time, na.rm = TRUE)

food_clean %>%
  filter(!is.na(delivery_time)) %>%
  ggplot(aes(x = delivery_time)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#3498db", colour = "white", alpha = 0.80) +
  geom_density(colour = "#e74c3c", linewidth = 1.1) +
  geom_vline(xintercept = mean_dt,   linetype = "dashed",
             colour = "#2c3e50", linewidth = 0.9) +
  geom_vline(xintercept = median_dt, linetype = "dotted",
             colour = "#8e44ad", linewidth = 0.9) +
  annotate("text", x = mean_dt   + 5, y = 0.022,
           label = paste0("Mean = ", round(mean_dt, 1)),
           colour = "#2c3e50", size = 3.5) +
  annotate("text", x = median_dt - 5, y = 0.020,
           label = paste0("Median = ", round(median_dt, 1)),
           colour = "#8e44ad", size = 3.5) +
  labs(
    title    = "Distribution of delivery time",
    subtitle = "Histogram with kernel density overlay",
    x        = "Delivery time (minutes)",
    y        = "Density"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 4: Histogram of delivery time distribution

Figure 4: Histogram of delivery time distribution

The histogram displays a remarkably flat, near-uniform distribution of delivery times across the 10-120 minute range. Bar heights are fairly consistent throughout, with no obvious cluster around a typical delivery duration. The mean and median sit almost on top of each other at around 65 minutes, reinforcing the symmetric, flat shape identified by the skewness and kurtosis measures. Real-world delivery data would normally peak around a common duration, making this pattern more characteristic of synthetically generated data.


1.4.4 Delivery Performance by Weather Condition

food_clean %>%
  filter(!is.na(delivery_performance), !is.na(weather)) %>%
  count(weather, delivery_performance) %>%
  group_by(weather) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = weather, y = n, fill = delivery_performance)) +
  geom_col(position = "fill", colour = "white", width = 0.60) +
  geom_text(
    aes(label = paste0(round(pct * 100, 1), "%")),
    position = position_fill(vjust = 0.5),
    size = 3.5, colour = "white", fontface = "bold"
  ) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(
    values = c("Fast" = "#27ae60", "Moderate" = "#f39c12", "Slow" = "#e74c3c")
  ) +
  labs(
    title    = "Delivery performance by weather condition",
    subtitle = "Proportion of orders in each performance category",
    x        = "Weather condition",
    y        = "Proportion of orders",
    fill     = "Delivery performance"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 5: Delivery performance by weather condition

Figure 5: Delivery performance by weather condition

Weather conditions have very little visible effect on delivery performance. The proportions of Fast, Moderate, and Slow deliveries are nearly identical across Cloudy, Rainy, and Sunny conditions. Rainy conditions show a slightly higher Slow rate (approximately 56%) compared to Sunny (approximately 54%), which is intuitively plausible, but the margin is too small to draw any firm conclusion. If weather were a genuine driver of performance, a more pronounced gap – particularly between Rainy and Sunny – would be expected.


1.5 Section E: Reporting and Export

1.5.1 Exporting the Cleaned Dataset

write_csv(food_clean, "food_delivery_cleaned.csv")

cat("File saved: food_delivery_cleaned.csv\n")
File saved: food_delivery_cleaned.csv
cat("  Rows   :", nrow(food_clean), "\n")
  Rows   : 21000 
cat("  Columns:", ncol(food_clean), "\n")
  Columns: 15 

The cleaned and transformed dataset has been saved as food_delivery_cleaned.csv. The file retains all 21,000 original rows and contains 15 columns – the original 13 variables plus the two newly derived fields, delivery_speed and delivery_performance.


1.5.2 Key Results (Inline R)

The mean delivery time across all valid orders is 64.84 minutes.

The total number of orders in each delivery performance category is as follows:

Category Total orders
Fast 2953
Moderate 4437
Slow 8867

2 Problem 2: Socio-Economic Household Study

2.1 Section A: Data Quality and Exploration

2.1.1 Importing and Summarising the Dataset

hh_raw <- read_csv("households_data.csv", show_col_types = FALSE)
head(hh_raw)
household_id region household_size monthly_income monthly_expenditure employment_status education_level housing_type access_to_internet health_expenditure savings debt
UHID000001 urbn 1 385000 120000 Self-employed Tertiary Owned yes 46000 155000 470000
UHID000002 rural 20 60000 60000 Employed primary Shared yes 4000 135000 420000
UHID000003 Suburban NA 85000 355000 Unemployed primary Owned yes 1000 65000 660000
UHID000004 rural 0 150000 170000 emplyd Tertiary Shared TRUE 20000 20000 700000
UHID000005 rural 0 110000 15000 emplyd primary Shared no 14000 15000 800000
UHID000006 Rural 2 295000 30000 emplyd Primary Owned yes 2000 100000 290000

The first six rows give an immediate sense of the dataset’s structure. Variables cover household demographics (region, household_size), economic indicators (monthly_income, monthly_expenditure, savings, debt), and socio-economic characteristics (employment_status, education_level, housing_type, access_to_internet). Some missing values are already visible in the preview.

summary(hh_raw)
 household_id          region          household_size  monthly_income    
 Length:33000       Length:33000       Min.   : 0.00   Min.   :   -5000  
 Class :character   Class :character   1st Qu.: 3.00   1st Qu.:  130000  
 Mode  :character   Mode  :character   Median : 6.00   Median :  255000  
                                       Mean   : 6.32   Mean   :  358695  
                                       3rd Qu.: 9.00   3rd Qu.:  385000  
                                       Max.   :20.00   Max.   :10000000  
                                       NA's   :2585    NA's   :355       
 monthly_expenditure employment_status  education_level    housing_type      
 Min.   :-10000      Length:33000       Length:33000       Length:33000      
 1st Qu.:100000      Class :character   Class :character   Class :character  
 Median :200000      Mode  :character   Mode  :character   Mode  :character  
 Mean   :198716                                                              
 3rd Qu.:300000                                                              
 Max.   :400000                                                              
 NA's   :391                                                                 
 access_to_internet health_expenditure    savings            debt         
 Length:33000       Min.   :   1000    Min.   :-50000   Min.   :       0  
 Class :character   1st Qu.:  13000    1st Qu.: 45000   1st Qu.:  250000  
 Mode  :character   Median :  26000    Median : 95000   Median :  510000  
                    Mean   :  44116    Mean   : 95939   Mean   :  601793  
                    3rd Qu.:  39000    3rd Qu.:150000   3rd Qu.:  760000  
                    Max.   :1000000    Max.   :200000   Max.   :10000000  
                    NA's   :622        NA's   :725      NA's   :303       

The summary reveals a number of data quality problems. Negative values appear in monthly_income, monthly_expenditure, and savings, all of which should be non-negative by definition. Extreme maximum values – for example, income and debt reaching 10,000,000 – are well beyond plausible household figures and suggest either data entry errors or outliers introduced during dataset generation. Missing values are widespread, particularly in employment_status, education_level, and access_to_internet.

cat("Missing value counts (raw data):\n")
Missing value counts (raw data):
print(colSums(is.na(hh_raw)))
       household_id              region      household_size      monthly_income 
                  0                5452                2585                 355 
monthly_expenditure   employment_status     education_level        housing_type 
                391                6676                6628                   0 
 access_to_internet  health_expenditure             savings                debt 
               6586                 622                 725                 303 
hh_raw2 <- hh_raw %>%
  mutate(across(where(is.character), ~na_if(str_trim(.), "")))

cat("Missing value counts after converting blank strings to NA:\n")
Missing value counts after converting blank strings to NA:
print(colSums(is.na(hh_raw2)))
       household_id              region      household_size      monthly_income 
                  0                5452                2585                 355 
monthly_expenditure   employment_status     education_level        housing_type 
                391                6676                6628                   0 
 access_to_internet  health_expenditure             savings                debt 
               6586                 622                 725                 303 

After converting blank strings to NA, the full extent of missingness becomes clearer. employment_status (6,676), education_level (6,628), and access_to_internet (6,586) are the most affected, followed by region (5,452) and household_size (2,585). Financial variables carry lower rates of missingness but are not immune. The pattern suggests that demographic and categorical fields were most susceptible to incomplete data collection in this survey.


2.1.2 Correcting Impossible Numeric Values

Duplicate records are removed first, then numeric variables are validated against their expected ranges.

hh_clean <- hh_raw2 %>%
  distinct(household_id, .keep_all = TRUE)

cat("Rows after removing duplicates:", nrow(hh_clean),
    "| Duplicates removed:", nrow(hh_raw2) - nrow(hh_clean), "\n\n")
Rows after removing duplicates: 30000 | Duplicates removed: 3000 
hh_clean <- hh_clean %>%
  mutate(
    household_size = if_else(
      household_size >= 1 & household_size <= 10,
      household_size, NA_real_),

    monthly_income = if_else(
      monthly_income >= 10000 & monthly_income <= 500000,
      monthly_income, NA_real_),

    monthly_expenditure = if_else(
      monthly_expenditure >= 5000 & monthly_expenditure <= 400000,
      monthly_expenditure, NA_real_),

    health_expenditure = if_else(
      health_expenditure >= 0 & health_expenditure <= 500000,
      health_expenditure, NA_real_),

    savings = if_else(savings <= 200000, savings, NA_real_),

    debt = if_else(debt >= 0 & debt <= 5000000, debt, NA_real_)
  )

cat("NA counts after range validation:\n")
NA counts after range validation:
print(colSums(is.na(hh_clean)))
       household_id              region      household_size      monthly_income 
                  0                4970                6956                 934 
monthly_expenditure   employment_status     education_level        housing_type 
                718                6080                6016                   0 
 access_to_internet  health_expenditure             savings                debt 
               5988                1127                 656                 596 

Removing duplicates reduced the dataset from 33,000 to 30,000 rows, with 3,000 duplicate household_id entries eliminated. Range validation then converted out-of-bound values to NA for all six numeric variables. The increases in missing counts for monthly_income and monthly_expenditure confirm that a portion of the raw entries were implausible rather than simply blank.


2.1.3 Correcting Inconsistent Text Entries

hh_clean <- hh_clean %>%

  mutate(
    region = case_when(
      str_to_lower(region) %in% c("urban",    "urbn") ~ "Urban",
      str_to_lower(region) %in% c("rural")            ~ "Rural",
      str_to_lower(region) %in% c("suburban")         ~ "Suburban",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    employment_status = case_when(
      str_to_lower(employment_status) %in%
        c("employed", "emplyd", "employ")                            ~ "Employed",
      str_to_lower(employment_status) %in%
        c("unemployed", "unemp")                                     ~ "Unemployed",
      str_to_lower(employment_status) %in%
        c("self-employed", "selfemployed", "self employed",
          "self_employed")                                            ~ "Self-Employed",
      TRUE ~ NA_character_
    )
  ) %>%

  mutate(
    education_level = case_when(
      str_to_lower(education_level) %in% c("primary")   ~ "Primary",
      str_to_lower(education_level) %in% c("secondary") ~ "Secondary",
      str_to_lower(education_level) %in% c("tertiary")  ~ "Tertiary",
      TRUE ~ NA_character_
    ),
    education_level = factor(
      education_level, levels = c("Primary", "Secondary", "Tertiary")
    )
  ) %>%

  mutate(
    access_to_internet = case_when(
      str_to_lower(as.character(access_to_internet)) %in%
        c("true", "yes", "1", "y")  ~ TRUE,
      str_to_lower(as.character(access_to_internet)) %in%
        c("false", "no", "0", "n")  ~ FALSE,
      TRUE ~ NA
    )
  )

cat("--- region ---\n");             print(table(hh_clean$region,             useNA = "ifany"))
--- region ---

   Rural Suburban    Urban     <NA> 
   10120     4873    10037     4970 
cat("--- employment_status ---\n");  print(table(hh_clean$employment_status,  useNA = "ifany"))
--- employment_status ---

     Employed Self-Employed    Unemployed          <NA> 
        11932          5965          6023          6080 
cat("--- education_level ---\n");    print(table(hh_clean$education_level,    useNA = "ifany"))
--- education_level ---

  Primary Secondary  Tertiary      <NA> 
    11962      6041      5981      6016 
cat("--- access_to_internet ---\n"); print(table(hh_clean$access_to_internet, useNA = "ifany"))
--- access_to_internet ---

FALSE  TRUE  <NA> 
12103 11909  5988 

After standardisation, the region variable shows Urban (10,037) and Rural (10,120) households as nearly equal in size, with Suburban (4,873) accounting for a smaller share. Employment status breaks into Employed (11,932), Unemployed (6,023), and Self-Employed (5,965). Education levels are distributed across Primary (11,962), Secondary (6,041), and Tertiary (5,981). Internet access is split fairly evenly, with roughly 12,103 households without access and 11,909 with access. Residual missing values across all four variables reflect entries that did not match any recognised label and were therefore set to NA.


2.2 Section B: Data Manipulation and Wrangling

2.2.1 Savings Rate

hh_clean <- hh_clean %>%
  mutate(savings_rate = savings / monthly_income)

summary(hh_clean$savings_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-5.0000  0.1739  0.3778  0.8138  0.7432 20.0000    1569 

The savings_rate variable captures the proportion of monthly income that a household saves. The summary shows a range that extends into negative territory, which can occur when a household’s savings figure is negative – reflecting net dissaving rather than accumulation. These values are retained for now and handled through filtering in downstream analyses.


2.2.2 Income Group Classification

hh_clean <- hh_clean %>%
  mutate(
    income_group = case_when(
      monthly_income >= 200000                           ~ "High Income",
      monthly_income >= 75000 & monthly_income < 200000 ~ "Middle Income",
      monthly_income <  75000                            ~ "Low Income",
      TRUE                                               ~ NA_character_
    ),
    income_group = factor(
      income_group, levels = c("Low Income", "Middle Income", "High Income")
    )
  )

cat("Income group distribution:\n")
Income group distribution:
print(table(hh_clean$income_group, useNA = "ifany"))

   Low Income Middle Income   High Income          <NA> 
         3836          7359         17871           934 

Households are classified into three income tiers based on monthly income thresholds specified in the study brief. The distribution across groups will be examined further in the visualisation section.


2.2.3 Group Summary Statistics

inc_exp_tbl <- hh_clean %>%
  filter(!is.na(region), !is.na(employment_status)) %>%
  group_by(region, employment_status) %>%
  summarise(
    Mean_Income      = round(mean(monthly_income,      na.rm = TRUE), 0),
    Mean_Expenditure = round(mean(monthly_expenditure, na.rm = TRUE), 0),
    N                = n(),
    .groups = "drop"
  ) %>%
  arrange(region, employment_status)

kable(inc_exp_tbl,
      caption = "Table 6: Mean monthly income and expenditure by region and employment status",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6: Mean monthly income and expenditure by region and employment status
region employment_status Mean_Income Mean_Expenditure N
Rural Employed 255,176 201,519 4,010
Rural Self-Employed 260,212 202,537 2,000
Rural Unemployed 254,376 205,269 2,046
Suburban Employed 250,959 200,999 1,922
Suburban Self-Employed 260,216 205,063 980
Suburban Unemployed 264,645 205,788 967
Urban Employed 256,436 197,032 3,963
Urban Self-Employed 251,358 202,598 1,989
Urban Unemployed 253,623 200,803 2,049
sav_debt_tbl <- hh_clean %>%
  filter(!is.na(education_level)) %>%
  group_by(education_level) %>%
  summarise(
    Median_Savings = round(median(savings, na.rm = TRUE), 0),
    Median_Debt    = round(median(debt,    na.rm = TRUE), 0),
    N              = n(),
    .groups = "drop"
  )

kable(sav_debt_tbl,
      caption = "Table 7: Median savings and debt by education level",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 7: Median savings and debt by education level
education_level Median_Savings Median_Debt N
Primary 95,000 500,000 11,962
Secondary 100,000 510,000 6,041
Tertiary 100,000 510,000 5,981

Mean income and expenditure show very little variation across regions or employment categories, which mirrors the pattern seen in the food delivery data. Median savings and debt are similarly uniform across education levels, with all three groups sitting at comparable figures. In real-world data, tertiary-educated households typically show higher savings and lower debt burdens – the absence of this gradient here reinforces the synthetic nature of the dataset.


2.3 Section C: Probability Distribution and Simulation

2.3.1 Simulating Random Data

Two variables are simulated to illustrate the use of probability distributions in R.

set.seed(42)

# Normal distribution: simulated household income
household_income_sim <- rnorm(n = 1000, mean = 150000, sd = 50000)

# Poisson distribution: simulated number of dependents
number_of_dependents <- rpois(n = 1000, lambda = 2)

cat("First 10 simulated income values:\n",  round(head(household_income_sim, 10)), "\n\n")
First 10 simulated income values:
 218548 121765 168156 181643 170213 144694 225576 145267 250921 146864 
cat("First 10 simulated dependent counts:\n", head(number_of_dependents, 10), "\n")
First 10 simulated dependent counts:
 6 2 3 4 3 3 2 3 1 1 

Household income is simulated from a Normal distribution with mean LKR 150,000 and standard deviation LKR 50,000. This reflects the assumption that most household incomes cluster around a central value, with fewer households at the extremes. Number of dependents is simulated from a Poisson distribution with lambda = 2, which is appropriate for modelling count data where most households have a small number of dependents and larger counts become progressively rarer.


2.3.2 Comparing Sample and Theoretical Moments

norm_cmp <- tibble(
  Statistic   = c("Mean", "Variance", "Std deviation"),
  Theoretical = c(150000, 50000^2, 50000),
  Sample      = c(
    round(mean(household_income_sim), 2),
    round(var(household_income_sim),  2),
    round(sd(household_income_sim),   2)
  )
)

kable(norm_cmp,
      caption = "Table 8: Normal distribution -- theoretical vs. sample moments",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 8: Normal distribution – theoretical vs. sample moments
Statistic Theoretical Sample
Mean 1.5e+05 1.487088e+05
Variance 2.5e+09 2.512622e+09
Std deviation 5.0e+04 5.012606e+04
pois_cmp <- tibble(
  Statistic   = c("Mean (lambda)", "Variance (lambda)"),
  Theoretical = c(2, 2),
  Sample      = c(
    round(mean(number_of_dependents), 4),
    round(var(number_of_dependents),  4)
  )
)

kable(pois_cmp,
      caption = "Table 9: Poisson distribution -- theoretical vs. sample moments") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 9: Poisson distribution – theoretical vs. sample moments
Statistic Theoretical Sample
Mean (lambda) 2 2.0100
Variance (lambda) 2 1.9438

For the Normal distribution, the sample mean and standard deviation are very close to the theoretical parameters of 150,000 and 50,000 respectively, with small deviations attributable to random sampling variation at n = 1,000. For the Poisson distribution, both the sample mean and variance are close to the theoretical value of lambda = 2, which is expected given that the mean and variance are equal in a Poisson process. Minor discrepancies are normal and would narrow with a larger sample size.


2.3.3 Histograms and Density Plots

tibble(income = household_income_sim) %>%
  ggplot(aes(x = income)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#2980b9", colour = "white", alpha = 0.75) +
  geom_density(colour = "#e74c3c", linewidth = 1.2) +
  stat_function(
    fun  = dnorm,
    args = list(mean = 150000, sd = 50000),
    colour = "#27ae60", linewidth = 1.1, linetype = "dashed"
  ) +
  scale_x_continuous(labels = comma) +
  labs(
    title    = "Simulated household income -- Normal(150,000; 50,000)",
    subtitle = "Blue: histogram  |  Red: empirical density  |  Green dashed: theoretical curve",
    x        = "Simulated income (LKR)",
    y        = "Density"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 6: Simulated household income -- Normal distribution

Figure 6: Simulated household income – Normal distribution

The histogram closely follows the expected bell-shaped curve. The empirical density (red) and the theoretical Normal curve (green dashed) align well throughout the distribution, with the slight deviations at the tails being entirely typical of a finite sample. The simulation is behaving as intended.

pois_pmf <- tibble(
  x    = 0:max(number_of_dependents),
  prob = dpois(0:max(number_of_dependents), lambda = 2)
)

tibble(dependents = number_of_dependents) %>%
  ggplot(aes(x = dependents)) +
  geom_bar(aes(y = after_stat(prop)), fill = "#8e44ad",
           colour = "white", alpha = 0.80, width = 0.70) +
  geom_point(data = pois_pmf, aes(x = x, y = prob),
             colour = "#27ae60", size = 3.5) +
  geom_line(data  = pois_pmf, aes(x = x, y = prob),
            colour = "#27ae60", linewidth = 1.0, linetype = "dashed") +
  labs(
    title    = "Simulated number of dependents -- Poisson(lambda = 2)",
    subtitle = "Purple bars: sample proportions  |  Green: theoretical PMF",
    x        = "Number of dependents",
    y        = "Probability"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 7: Simulated number of dependents -- Poisson distribution

Figure 7: Simulated number of dependents – Poisson distribution

The bar chart peaks at 2 dependents and tails off steadily at higher counts, matching the shape of a Poisson distribution with lambda = 2. The sample proportions (purple bars) align closely with the theoretical probability mass function (green), confirming that the simulation is a good empirical approximation of the specified distribution at n = 1,000.


2.4 Section D: Visualisation with ggplot2

2.4.1 Monthly Income vs. Monthly Expenditure by Region

set.seed(1)

hh_clean %>%
  filter(!is.na(monthly_income), !is.na(monthly_expenditure), !is.na(region)) %>%
  sample_n(5000) %>%
  ggplot(aes(x = monthly_income, y = monthly_expenditure, colour = region)) +
  geom_point(alpha = 0.30, size = 1.2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Monthly income vs. monthly expenditure",
    subtitle = "Coloured by region -- random sample of 5,000 households",
    x        = "Monthly income (LKR)",
    y        = "Monthly expenditure (LKR)",
    colour   = "Region"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 8: Monthly income vs. monthly expenditure, coloured by region

Figure 8: Monthly income vs. monthly expenditure, coloured by region

The scatter plot shows virtually no relationship between income and expenditure. The three regional trend lines are nearly flat and lie almost on top of each other around LKR 200,000, meaning expenditure does not change with income level regardless of where a household is located. In real survey data, a positive relationship between income and expenditure is almost universal; its absence here is a direct consequence of the variables being generated independently within fixed ranges.


2.4.2 Monthly Savings by Education Level

hh_clean %>%
  filter(!is.na(savings), !is.na(education_level)) %>%
  ggplot(aes(x = education_level, y = savings, fill = education_level)) +
  geom_boxplot(outlier.colour = "red", outlier.alpha = 0.30,
               outlier.size = 1.2, width = 0.50) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    title = "Monthly savings by education level",
    x     = "Education level",
    y     = "Monthly savings (LKR)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "none",
    plot.title      = element_text(face = "bold")
  )
Figure 9: Monthly savings distribution by education level

Figure 9: Monthly savings distribution by education level

The three education groups display near-identical distributions, with medians sitting around LKR 95,000-100,000 and comparable spreads. Negative values appear across all groups, indicating households that are spending beyond their income. One would typically expect Tertiary-educated households to show noticeably higher savings given better employment prospects and earning capacity, but this pattern does not appear – again consistent with the synthetic data generation process.


2.4.3 Average Monthly Income by Employment Status, Faceted by Region

hh_clean %>%
  filter(!is.na(monthly_income), !is.na(employment_status), !is.na(region)) %>%
  group_by(region, employment_status) %>%
  summarise(avg_income = mean(monthly_income, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = employment_status, y = avg_income, fill = employment_status)) +
  geom_col(width = 0.60, colour = "white") +
  geom_text(aes(label = comma(round(avg_income, 0))),
            vjust = -0.40, size = 2.7, fontface = "bold") +
  facet_wrap(~region, ncol = 3) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Average monthly income by employment status",
    subtitle = "Faceted by region",
    x        = "Employment status",
    y        = "Average monthly income (LKR)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    axis.text.x     = element_text(angle = 35, hjust = 1),
    strip.text      = element_text(face = "bold", size = 11),
    plot.title      = element_text(face = "bold")
  )
Figure 10: Average monthly income by employment status, faceted by region

Figure 10: Average monthly income by employment status, faceted by region

Average monthly income is almost identical across every combination of region and employment status, ranging between roughly LKR 251,000 and LKR 265,000. Employed and Self-Employed households would ordinarily earn more than Unemployed households, and Urban areas would typically outpace Rural ones. Neither pattern is visible here, which reflects the income variable being generated within a fixed range without regard to employment or location.


2.4.4 Correlation Heatmap

num_hh <- hh_clean %>%
  select(household_size, monthly_income, monthly_expenditure,
         health_expenditure, savings, debt, savings_rate) %>%
  drop_na()

cor_mat  <- cor(num_hh)
cor_melt <- melt(cor_mat)

ggplot(cor_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(value, 2)), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(
    low      = "#c0392b",
    mid      = "white",
    high     = "#2980b9",
    midpoint = 0,
    limits   = c(-1, 1),
    name     = "Pearson r"
  ) +
  labs(
    title = "Correlation heatmap -- socio-economic household numeric variables",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x  = element_text(angle = 45, hjust = 1),
    plot.title   = element_text(face = "bold"),
    legend.position = "right"
  )
Figure 11: Correlation heatmap of numeric household variables

Figure 11: Correlation heatmap of numeric household variables

Almost every pair of numeric variables shows a correlation close to zero, confirming that the variables were generated independently. The one notable exception is the moderate negative correlation between monthly_income and savings_rate (-0.48), which is somewhat counterintuitive – higher-income households would generally be expected to save a larger share of income. This negative relationship likely emerges from the way savings and income were independently bounded during data generation rather than reflecting a genuine economic dynamic.


2.5 Section E: Reproducible Reporting and Export

2.5.1 Reproducibility Notes

All code in this report is fully reproducible. A global chunk options block at the top of the document (knitr::opts_chunk$set) ensures consistent behaviour across all chunks – warnings and messages are suppressed to keep the output clean, figures are centred, and echo is enabled so the code is always visible alongside its output. The simulation in Section C uses set.seed(42) to guarantee that identical results are obtained on every knit. The tidyverse and supporting packages are loaded once at the start of Problem 1 and remain available throughout. All inline R expressions update automatically when the underlying data changes, so key reported figures – such as the mean delivery time and household count – are always consistent with the actual computed values.

2.5.2 Exporting the Cleaned Dataset

write_csv(hh_clean, "socio_economic_households_cleaned.csv")

cat("File saved: socio_economic_households_cleaned.csv\n")
File saved: socio_economic_households_cleaned.csv
cat("  Rows   :", nrow(hh_clean), "\n")
  Rows   : 30000 
cat("  Columns:", ncol(hh_clean), "\n")
  Columns: 14 

The cleaned household dataset has been saved as socio_economic_households_cleaned.csv. The file contains 30000 rows and 14 columns – the original 12 variables plus the two newly derived fields, savings_rate and income_group.

The mean monthly income across valid household records is 255,075 LKR. The distribution of households across income groups is shown below:

Income group Households
Low Income 3836
Middle Income 7359
High Income 17871