1. Project Overview

This analysis aims to answer key business questions regarding sales performance, channel distribution, and pricing strategies for Inyange Industries. By applying data cleaning, exploratory data analysis (EDA), and predictive modeling techniques in R, we derive actionable insights to support decision-making.

Important Disclaimer: > Please note that the dataset utilized in this project is synthetic and created exclusively for educational and demonstration purposes. It does not represent actual confidential sales records, financial metrics, or operational data of Inyange Industries. All figures, customer details, and transaction volumes are dummy data generated to showcase data science and R programming capabilities.

Key Business Objectives:

  • Product Performance: Identify high-revenue drivers versus low-impact products.
  • Channel Strategy: Evaluate the importance of Retail vs. Hotel/Restaurant vs. Wholesalers.
  • Geographic Analysis: Determine strongest districts to inform depot planning.
  • Pricing Dynamics: Analyze relationships between unit price and demand (quantity sold).
  • Promotion Efficacy: Determine if promotions drive volume or simply erode margins.

2. Data Loading and Cleaning

We begin by loading the raw sales dataset, cleaning column names to snake_case, and converting variables to their appropriate data types for analysis.

# Load and clean data
# Logic to handle file path validation
if(file.exists("rwanda_inyange_sales.csv")){
  inyange <- read.csv("rwanda_inyange_sales.csv", stringsAsFactors = FALSE) %>%
    clean_names()
} else {
  warning("Data file not found. Please ensure 'rwanda_inyange_sales.csv' is in your working directory.")
  stop("File 'rwanda_inyange_sales.csv' not found.") 
}

# Convert variable types
inyange <- inyange %>%
  mutate(
    date = as.Date(date),
    district = as.factor(district),
    province = as.factor(province),
    channel = as.factor(channel),
    product_category = as.factor(product_category),
    product_name = as.factor(product_name),
    package_size_ml = as.numeric(package_size_ml),
    unit_price_rwf = as.numeric(unit_price_rwf),
    quantity_sold = as.numeric(quantity_sold),
    total_sales_rwf = as.numeric(total_sales_rwf),
    # Standardize Promotion Flag
    promotion_flag = ifelse(tolower(promotion_flag) %in% c("yes", "y", "true"), "Yes", "No"),
    promotion_flag = factor(promotion_flag, levels = c("No", "Yes")),
    customer_type = as.factor(customer_type)
  )

glimpse(inyange)
## Rows: 54,720
## Columns: 12
## $ date             <date> 2025-01-01, 2025-01-01, 2025-01-01, 2025-01-01, 2025…
## $ district         <fct> Nyaruguru, Kicukiro, Bugesera, Nyabihu, Huye, Ngorore…
## $ province         <fct> Southern, Kigali, Eastern, Western, Southern, Western…
## $ channel          <fct> Wholesaler, Retail, Retail, Supermarket, Hotel/Restau…
## $ product_category <fct> Butter, Milk, Milk, Juice, Water, Butter, Milk, Yogur…
## $ product_name     <fct> Inyange Butter 250g, Inyange Fresh Milk 1L, Inyange L…
## $ package_size_ml  <dbl> 250, 1000, 1000, 500, 1500, 250, 500, 500, 250, 500, …
## $ unit_price_rwf   <dbl> 1800, 1100, 1200, 900, 720, 1800, 600, 900, 1800, 400…
## $ quantity_sold    <dbl> 12, 108, 138, 37, 133, 20, 46, 24, 11, 92, 37, 80, 72…
## $ total_sales_rwf  <dbl> 21600, 118800, 165600, 33300, 95760, 36000, 27600, 21…
## $ promotion_flag   <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, …
## $ customer_type    <fct> Household, Shop, Hotel, Hotel, School, Hotel, Shop, H…

3. Exploratory Data Analysis (EDA)

3.1 Distribution of Key Metrics

Understanding the spread of our data helps identify outliers and the general shape of our business.

# Histogram of unit price
p1 <- ggplot(inyange, aes(x = unit_price_rwf)) + 
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") + 
  labs(title = "Distribution of Unit Price", x = "Unit Price (RWF)", y = "Count") +
  theme_minimal()

# Distribution of quantity sold
p2 <- ggplot(inyange, aes(x = quantity_sold)) + 
  geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") + 
  labs(title = "Distribution of Quantity Sold", x = "Quantity Sold", y = "Count") +
  theme_minimal()

# Distribution of total sales (revenue)
p3 <- ggplot(inyange, aes(x = total_sales_rwf)) +
  geom_histogram(binwidth = 10000, fill = "orange", color = "black") +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
  labs(title = "Distribution of Total Sales per Record", x = "Total Sales (RWF)", y = "Count") +
  theme_minimal()

# Combine plots using patchwork library
(p1 + p2) / p3

Interpretation: The histograms reveal three distinct structural characteristics of the data:

  • Unit Price: The distribution is multi-modal with distinct spikes (around 500, 800, 1500 RWF). This confirms a tiered product portfolio, likely distinguishing between low-cost items (e.g., small water/milk) and premium items (e.g., butter/ghee).
  • Quantity Sold: The distribution is heavily right-skewed (Poisson-like). The vast majority of transactions are for smaller quantities (<50 units), typical of retail behavior.
  • Total Sales: Similar to quantity, revenue per transaction is right-skewed. While most sales are small (<50k RWF), the long tail extending to 200k RWF indicates the presence of high-value bulk buyers or wholesale transactions.

4. Sales Performance Analysis

4.1 Top Products by Revenue

Which specific SKUs are the “cash cows”?

prod_summary <- inyange %>%
  group_by(product_name, package_size_ml, product_category) %>%
  summarise(
    total_revenue   = sum(total_sales_rwf, na.rm = TRUE),
    total_units     = sum(quantity_sold, na.rm = TRUE),
    avg_price       = mean(unit_price_rwf, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Display top 10 table
kable(head(prod_summary, 10), caption = "Top 10 Products by Revenue")
Top 10 Products by Revenue
product_name package_size_ml product_category total_revenue total_units avg_price
Inyange Fresh Milk 1L 1000 Milk 487421330 452396 1077.5658
Inyange Long Life Milk 1L 1000 Milk 468150600 397881 1176.5263
Inyange Mineral Water 1.5L 1500 Water 414726880 528955 783.8487
Inyange Mango Juice 1L 1000 Juice 408465450 277803 1469.9013
Inyange Pineapple Juice 1L 1000 Juice 407215650 277124 1469.8661
Inyange Orange Juice 500ml 500 Juice 315838170 357826 882.4342
Inyange Fresh Milk 500ml 500 Milk 219603840 373583 587.8336
Inyange Strawberry Yogurt 500ml 500 Yogurt 199531800 226489 881.1049
Inyange Butter 250g 250 Butter 186397740 105683 1764.1036
Inyange Mineral Water 500ml 500 Water 182487280 465422 392.1147
# Plot Top 10
ggplot(head(prod_summary, 10),
       aes(x = reorder(product_name, total_revenue), y = total_revenue)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 10 Products by Revenue",
    x = "Product",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation: We observe a clear hierarchy in product performance:

  1. The “Powerhouse” Products: Inyange Fresh Milk 1L is the undisputed market leader, followed closely by Long Life Milk 1L. These two SKUs are the backbone of the company’s revenue.
  2. Beverages: Mineral Water 1.5L and Mango/Pineapple Juices form the second tier of revenue drivers.
  3. Niche Items: Butter 250g and Mineral Water 500ml contribute significantly less revenue.

Strategy: Supply chain stability for the 1L Milk varieties is critical, as stockouts here would have the largest negative impact on the bottom line.

4.2 Performance by Channel

Which sales channels are driving the most value?

channel_summary <- inyange %>%
  group_by(channel) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    total_units   = sum(quantity_sold, na.rm = TRUE),
    avg_order_qty = mean(quantity_sold, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Revenue by Channel Plot
ggplot(channel_summary,
       aes(x = reorder(channel, total_revenue), y = total_revenue, group = 1)) +
  geom_line(color = "purple", size = 1) +
  geom_point(color = "red", size = 3) +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Total Revenue by Channel",
    x = "Channel",
    y = "Total Revenue (RWF)"
  ) +
  theme_light()

Interpretation:

  • Leading Channel: Supermarkets generate the highest total revenue (~912M RWF).
  • Channel Parity: Interestingly, the revenue spread between channels is relatively tight (ranging from ~902M for Wholesalers to ~912M for Supermarkets). This suggests a balanced market penetration where no single channel is overwhelmingly dominant or failing. All channels are performing within a ~1% margin of each other.

4.3 Geographic Distribution (Districts)

Where is the demand concentrated geographically?

prov_summary <- inyange %>%
  group_by(district) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Plotting Revenue by District
ggplot(head(prov_summary, 15), aes(x = reorder(district, total_revenue), y = total_revenue)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 15 Districts by Revenue",
    x = "District",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation:

  • Top Districts: Karongi is the highest-performing district, followed closely by Bugesera and Nyanza.
  • Strategic Implication: The top 5 districts (Karongi, Bugesera, Nyanza, Rulindo, Nyagatare) account for a massive portion of sales. Any regional depots or logistics hubs should be situated near these high-volume zones to minimize transportation costs.

4.4 Sales Trend Over Time

Tracking revenue performance over the recorded period.

# Aggregating sales by month
monthly_sales <- inyange %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(total_revenue = sum(total_sales_rwf, na.rm = TRUE))

ggplot(monthly_sales, aes(x = month, y = total_revenue)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_smooth(method = "loess", se = FALSE, color = "gray", linetype = "dashed") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Monthly Revenue Trend",
    subtitle = "Tracking performance over time",
    x = "Date",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation: The time series reveals significant seasonality and volatility:

  • The Dip: There is a sharp decline in revenue around April, which coincides with the rainy season in Rwanda. This could suggest logistical challenges in distribution or a shift in consumer consumption patterns during cooler/wetter months.
  • The Peak: Revenue surges to a peak in July/August, likely driven by the dry season (increasing demand for water/juice) and summer holidays.
  • Action: Inventory planning needs to account for the July surge to prevent stockouts during peak demand.

5. Promotion Impact Analysis

Do promotions actually work? We compare metrics between promoted and non-promoted transactions.

# Overall impact of promotions
promo_summary <- inyange %>%
  group_by(promotion_flag) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    total_units   = sum(quantity_sold, na.rm = TRUE),
    avg_qty       = mean(quantity_sold, na.rm = TRUE),
    avg_rev       = mean(total_sales_rwf, na.rm = TRUE),
    .groups = "drop"
  )

kable(promo_summary, caption = "Summary of Metrics: Promotion vs No Promotion")
Summary of Metrics: Promotion vs No Promotion
promotion_flag total_revenue total_units avg_qty avg_rev
No 2964274900 2967155 67.84705 67781.19
Yes 668890170 743533 67.67389 60880.15
# Visualizing Distributions
v1 <- ggplot(inyange, aes(x = promotion_flag, y = quantity_sold, fill = promotion_flag)) +
  geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.1) +
  labs(title = "Quantity per Transaction", x = "Promotion", y = "Qty") +
  theme_light()

v2 <- ggplot(inyange, aes(x = promotion_flag, y = total_sales_rwf, fill = promotion_flag)) +
  geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.1) +
  labs(title = "Revenue per Transaction", x = "Promotion", y = "Sales (RWF)") +
  theme_light()

v1 + v2

Interpretation:

  • Visual Evidence: The violin plots show a very high degree of overlap between the “No” (Red) and “Yes” (Blue) distributions for both Quantity and Sales.
  • Volume Lift: While the “Yes” promotion shape is slightly thicker at the higher quantity range, the difference is not drastic.
  • Revenue Risk: Since the revenue distributions look nearly identical, there is a risk that promotions are effectively lowering margins without driving enough extra volume to compensate. We need to be careful that we aren’t just “giving away margin.”

6. Predictive Modeling

6.1 Linear Regression: Predicting Demand

Business Question: How does price elasticity and channel affect the volume sold?

Note: We predict quantity_sold rather than total_sales to avoid mathematical redundancy, as Sales = Price × Quantity.

lm_demand <- lm(
  quantity_sold ~ unit_price_rwf + product_category + channel + promotion_flag,
  data = inyange
)

summary(lm_demand)
## 
## Call:
## lm(formula = quantity_sold ~ unit_price_rwf + product_category + 
##     channel + promotion_flag, data = inyange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.458  -12.391   -0.356   11.923  134.123 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            22.8354473  0.9116123  25.050  < 2e-16 ***
## unit_price_rwf         -0.0006674  0.0004650  -1.435    0.151    
## product_categoryGhee   -4.4638386  0.9377579  -4.760 1.94e-06 ***
## product_categoryJuice  43.1919163  0.4517892  95.602  < 2e-16 ***
## product_categoryMilk   65.2552221  0.5330866 122.410  < 2e-16 ***
## product_categoryWater  86.5253334  0.6761044 127.976  < 2e-16 ***
## product_categoryYogurt 21.3655290  0.5737667  37.237  < 2e-16 ***
## channelRetail           0.0503190  0.2763691   0.182    0.856    
## channelSupermarket      0.3646419  0.2759951   1.321    0.186    
## channelWholesaler      -0.0282452  0.2764531  -0.102    0.919    
## promotion_flagYes      -0.1343055  0.2499557  -0.537    0.591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.85 on 54709 degrees of freedom
## Multiple R-squared:  0.6124, Adjusted R-squared:  0.6123 
## F-statistic:  8643 on 10 and 54709 DF,  p-value: < 2.2e-16

Interpretation: The model explains 61.2% (R-squared) of the variance in quantity sold, which is a strong result.

  • Category is King: The most significant drivers of quantity are the product categories. Compared to the baseline (Butter), products like Water (t-value = 127) and Milk (t-value = 122) have massively higher baseline quantities.
  • Price Insignificance: In this specific model, unit_price_rwf has a p-value of 0.151, meaning it is not statistically significant when product category is already accounted for. This suggests that the category determines the volume more than small price fluctuations within that category.
  • Promotions Ineffective? The coefficient for promotion_flagYes is -0.13 with a p-value of 0.591. This is a critical finding: Promotions are not having a statistically significant impact on increasing quantity sold in this model. This warrants a review of the current promotion strategy.

6.2 Logistic Regression: Predicting Promotions

Business Question: Which factors make a transaction more likely to be under promotion?

model_logit <- glm(
  promotion_flag ~ unit_price_rwf + quantity_sold +
    product_category + channel + district,
  data = inyange,
  family = binomial
)

# Extract Odds Ratios
odds_ratios <- exp(coef(model_logit))
ci_odds <- exp(confint(model_logit))

# Create results table
results_logit <- data.frame(
  Predictor = names(odds_ratios),
  Odds_Ratio = round(odds_ratios, 4),
  CI_Lower = round(ci_odds[,1], 4),
  CI_Upper = round(ci_odds[,2], 4)
)

kable(head(results_logit, 10), caption = "Odds Ratios (First 10 predictors)")
Odds Ratios (First 10 predictors)
Predictor Odds_Ratio CI_Lower CI_Upper
(Intercept) (Intercept) 34.1170 27.1711 42.8422
unit_price_rwf unit_price_rwf 0.9973 0.9972 0.9974
quantity_sold quantity_sold 0.9998 0.9988 1.0007
product_categoryGhee product_categoryGhee 91.8065 74.4095 113.3097
product_categoryJuice product_categoryJuice 0.2063 0.1836 0.2318
product_categoryMilk product_categoryMilk 0.1044 0.0906 0.1201
product_categoryWater product_categoryWater 0.0427 0.0359 0.0508
product_categoryYogurt product_categoryYogurt 0.0994 0.0874 0.1130
channelRetail channelRetail 0.9436 0.8878 1.0028
channelSupermarket channelSupermarket 0.9671 0.9104 1.0274

Interpretation of Odds Ratios:

  • Category Bias: The strongest predictor of a promotion is the product category. Ghee has an Odds Ratio of 91.8, meaning Ghee transactions are overwhelmingly likely to be on promotion compared to the baseline.
  • Low Promo Items: Conversely, Water (OR 0.04) and Yogurt (OR 0.09) are rarely on promotion.
  • Price: The Odds Ratio for price is ~0.99, suggesting price itself is not a strong predictor of whether an item is on promo or not.
# Predicting probabilities
inyange$promo_prob <- predict(model_logit, type = "response")

# Plotting probability distribution
ggplot(inyange, aes(x = promo_prob, fill = promotion_flag)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  labs(
    title = "Predicted Probability of Promotion",
    subtitle = "Separation between Promoted and Non-Promoted transactions",
    x = "Predicted Probability",
    y = "Count"
  ) +
  theme_minimal()

Interpretation: The histogram shows the model’s confidence in predicting promotions.

  • Separation: The model successfully identifies “Non-Promoted” items (Red bars clustered near 0.1).
  • Ambiguity: However, the “Promoted” items (Blue) are spread out, often overlapping with non-promoted scores. This suggests that while we can identify some patterns (like Ghee being promoted), the overall promotion strategy is inconsistent, making it difficult for the model to perfectly classify when a promotion will happen.

7. Conclusion & Next Steps

Based on the data evidence, we recommend:

  1. Re-evaluate Promotions: The linear regression showed no significant lift in quantity from promotions. We should run A/B tests to see if current discounts are actually generating value or just eroding revenue.
  2. Focus on Summer Logistics: The July revenue peak is substantial. Ensure supply chains to Karongi and Bugesera are fortified before June to capture this demand.
  3. Protect the “Cash Cows”: Fresh Milk 1L and Long Life Milk 1L are the revenue engines. These products should never be out of stock.