1. Project Overview

This analysis aims to answer key business questions regarding sales performance, channel distribution, and pricing strategies for Inyange Industries. By applying data cleaning, exploratory data analysis (EDA), and predictive modeling techniques in R, we derive actionable insights to support decision-making.

Important Disclaimer: > Please note that the dataset utilized in this project is synthetic and created exclusively for educational and demonstration purposes. It does not represent actual confidential sales records, financial metrics, or operational data of Inyange Industries. All figures, customer details, and transaction volumes are dummy data generated to showcase data science and R programming capabilities.

Key Business Objectives:

Product Performance: Identify high-revenue drivers versus low-impact products.
Channel Strategy: Evaluate the importance of Retail vs. Hotel/Restaurant vs. Wholesalers.
Geographic Analysis: Determine strongest districts to inform depot planning.
Pricing Dynamics: Analyze relationships between unit price and demand (quantity sold).
Promotion Efficacy: Determine if promotions drive volume or simply erode margins.

2. Data Loading and Cleaning

We begin by loading the raw sales dataset, cleaning column names to snake_case, and converting variables to their appropriate data types for analysis.

# Load and clean data
# Logic to handle file path validation
if(file.exists("rwanda_inyange_sales.csv")){
  inyange <- read.csv("rwanda_inyange_sales.csv", stringsAsFactors = FALSE) %>%
    clean_names()
} else {
  warning("Data file not found. Please ensure 'rwanda_inyange_sales.csv' is in your working directory.")
  stop("File 'rwanda_inyange_sales.csv' not found.") 
}

# Convert variable types
inyange <- inyange %>%
  mutate(
    date = as.Date(date),
    district = as.factor(district),
    province = as.factor(province),
    channel = as.factor(channel),
    product_category = as.factor(product_category),
    product_name = as.factor(product_name),
    package_size_ml = as.numeric(package_size_ml),
    unit_price_rwf = as.numeric(unit_price_rwf),
    quantity_sold = as.numeric(quantity_sold),
    total_sales_rwf = as.numeric(total_sales_rwf),
    # Standardize Promotion Flag
    promotion_flag = ifelse(tolower(promotion_flag) %in% c("yes", "y", "true"), "Yes", "No"),
    promotion_flag = factor(promotion_flag, levels = c("No", "Yes")),
    customer_type = as.factor(customer_type)
  )

glimpse(inyange)

## Rows: 54,720
## Columns: 12
## $ date             <date> 2025-01-01, 2025-01-01, 2025-01-01, 2025-01-01, 2025…
## $ district         <fct> Nyaruguru, Kicukiro, Bugesera, Nyabihu, Huye, Ngorore…
## $ province         <fct> Southern, Kigali, Eastern, Western, Southern, Western…
## $ channel          <fct> Wholesaler, Retail, Retail, Supermarket, Hotel/Restau…
## $ product_category <fct> Butter, Milk, Milk, Juice, Water, Butter, Milk, Yogur…
## $ product_name     <fct> Inyange Butter 250g, Inyange Fresh Milk 1L, Inyange L…
## $ package_size_ml  <dbl> 250, 1000, 1000, 500, 1500, 250, 500, 500, 250, 500, …
## $ unit_price_rwf   <dbl> 1800, 1100, 1200, 900, 720, 1800, 600, 900, 1800, 400…
## $ quantity_sold    <dbl> 12, 108, 138, 37, 133, 20, 46, 24, 11, 92, 37, 80, 72…
## $ total_sales_rwf  <dbl> 21600, 118800, 165600, 33300, 95760, 36000, 27600, 21…
## $ promotion_flag   <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, …
## $ customer_type    <fct> Household, Shop, Hotel, Hotel, School, Hotel, Shop, H…

3. Exploratory Data Analysis (EDA)

3.1 Distribution of Key Metrics

Understanding the spread of our data helps identify outliers and the general shape of our business.

# Histogram of unit price
p1 <- ggplot(inyange, aes(x = unit_price_rwf)) + 
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") + 
  labs(title = "Distribution of Unit Price", x = "Unit Price (RWF)", y = "Count") +
  theme_minimal()

# Distribution of quantity sold
p2 <- ggplot(inyange, aes(x = quantity_sold)) + 
  geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") + 
  labs(title = "Distribution of Quantity Sold", x = "Quantity Sold", y = "Count") +
  theme_minimal()

# Distribution of total sales (revenue)
p3 <- ggplot(inyange, aes(x = total_sales_rwf)) +
  geom_histogram(binwidth = 10000, fill = "orange", color = "black") +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
  labs(title = "Distribution of Total Sales per Record", x = "Total Sales (RWF)", y = "Count") +
  theme_minimal()

# Combine plots using patchwork library
(p1 + p2) / p3

Interpretation: The histograms reveal three distinct structural characteristics of the data:

Unit Price: The distribution is multi-modal with distinct spikes (around 500, 800, 1500 RWF). This confirms a tiered product portfolio, likely distinguishing between low-cost items (e.g., small water/milk) and premium items (e.g., butter/ghee).
Quantity Sold: The distribution is heavily right-skewed (Poisson-like). The vast majority of transactions are for smaller quantities (<50 units), typical of retail behavior.
Total Sales: Similar to quantity, revenue per transaction is right-skewed. While most sales are small (<50k RWF), the long tail extending to 200k RWF indicates the presence of high-value bulk buyers or wholesale transactions.

3.2 Price Elasticity and Category Trends

We analyze the classic economic relationship: does higher price reduce demand?

ggplot(inyange, aes(x = unit_price_rwf, y = quantity_sold, color = product_category)) + 
  geom_point(alpha = 0.6) + 
  geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "dashed") + 
  scale_y_log10() + 
  labs(
    title = "Unit Price vs Quantity Sold", 
    subtitle = "Trend lines indicate price sensitivity",
    x = "Unit Price (RWF)",
    y = "Quantity Sold (Log Scale)"
  ) +
  theme_light()

Interpretation: The scatterplot (log scale) demonstrates a clear negative correlation between price and quantity, consistent with the Law of Demand.

High Elasticity: Low-priced items like Milk and Water (blue/light blue dots) cluster at the top-left, showing high sales volumes.
Low Elasticity: Premium categories like Butter and Ghee (red/yellow dots) cluster at the bottom-right.
Conclusion: As expected, higher unit prices correlate with lower transaction volumes. The trend line confirms that for every step up in price tier, the expected volume drops significantly.

4. Sales Performance Analysis

4.1 Top Products by Revenue

Which specific SKUs are the “cash cows”?

prod_summary <- inyange %>%
  group_by(product_name, package_size_ml, product_category) %>%
  summarise(
    total_revenue   = sum(total_sales_rwf, na.rm = TRUE),
    total_units     = sum(quantity_sold, na.rm = TRUE),
    avg_price       = mean(unit_price_rwf, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Display top 10 table
kable(head(prod_summary, 10), caption = "Top 10 Products by Revenue")

Top 10 Products by Revenue
product_name	package_size_ml	product_category	total_revenue	total_units	avg_price
Inyange Fresh Milk 1L	1000	Milk	487421330	452396	1077.5658
Inyange Long Life Milk 1L	1000	Milk	468150600	397881	1176.5263
Inyange Mineral Water 1.5L	1500	Water	414726880	528955	783.8487
Inyange Mango Juice 1L	1000	Juice	408465450	277803	1469.9013
Inyange Pineapple Juice 1L	1000	Juice	407215650	277124	1469.8661
Inyange Orange Juice 500ml	500	Juice	315838170	357826	882.4342
Inyange Fresh Milk 500ml	500	Milk	219603840	373583	587.8336
Inyange Strawberry Yogurt 500ml	500	Yogurt	199531800	226489	881.1049
Inyange Butter 250g	250	Butter	186397740	105683	1764.1036
Inyange Mineral Water 500ml	500	Water	182487280	465422	392.1147

# Plot Top 10
ggplot(head(prod_summary, 10),
       aes(x = reorder(product_name, total_revenue), y = total_revenue)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 10 Products by Revenue",
    x = "Product",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation: We observe a clear hierarchy in product performance:

The “Powerhouse” Products: Inyange Fresh Milk 1L is the undisputed market leader, followed closely by Long Life Milk 1L. These two SKUs are the backbone of the company’s revenue.
Beverages: Mineral Water 1.5L and Mango/Pineapple Juices form the second tier of revenue drivers.
Niche Items: Butter 250g and Mineral Water 500ml contribute significantly less revenue.

Strategy: Supply chain stability for the 1L Milk varieties is critical, as stockouts here would have the largest negative impact on the bottom line.

4.2 Performance by Channel

Which sales channels are driving the most value?

channel_summary <- inyange %>%
  group_by(channel) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    total_units   = sum(quantity_sold, na.rm = TRUE),
    avg_order_qty = mean(quantity_sold, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Revenue by Channel Plot
ggplot(channel_summary,
       aes(x = reorder(channel, total_revenue), y = total_revenue, group = 1)) +
  geom_line(color = "purple", size = 1) +
  geom_point(color = "red", size = 3) +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Total Revenue by Channel",
    x = "Channel",
    y = "Total Revenue (RWF)"
  ) +
  theme_light()

Interpretation:

Leading Channel: Supermarkets generate the highest total revenue (~912M RWF).
Channel Parity: Interestingly, the revenue spread between channels is relatively tight (ranging from ~902M for Wholesalers to ~912M for Supermarkets). This suggests a balanced market penetration where no single channel is overwhelmingly dominant or failing. All channels are performing within a ~1% margin of each other.

4.3 Geographic Distribution (Districts)

Where is the demand concentrated geographically?

prov_summary <- inyange %>%
  group_by(district) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

# Plotting Revenue by District
ggplot(head(prov_summary, 15), aes(x = reorder(district, total_revenue), y = total_revenue)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Top 15 Districts by Revenue",
    x = "District",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation:

Top Districts: Karongi is the highest-performing district, followed closely by Bugesera and Nyanza.
Strategic Implication: The top 5 districts (Karongi, Bugesera, Nyanza, Rulindo, Nyagatare) account for a massive portion of sales. Any regional depots or logistics hubs should be situated near these high-volume zones to minimize transportation costs.

4.4 Sales Trend Over Time

Tracking revenue performance over the recorded period.

# Aggregating sales by month
monthly_sales <- inyange %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(total_revenue = sum(total_sales_rwf, na.rm = TRUE))

ggplot(monthly_sales, aes(x = month, y = total_revenue)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_smooth(method = "loess", se = FALSE, color = "gray", linetype = "dashed") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Monthly Revenue Trend",
    subtitle = "Tracking performance over time",
    x = "Date",
    y = "Total Revenue (RWF)"
  ) +
  theme_minimal()

Interpretation: The time series reveals significant seasonality and volatility:

The Dip: There is a sharp decline in revenue around April, which coincides with the rainy season in Rwanda. This could suggest logistical challenges in distribution or a shift in consumer consumption patterns during cooler/wetter months.
The Peak: Revenue surges to a peak in July/August, likely driven by the dry season (increasing demand for water/juice) and summer holidays.
Action: Inventory planning needs to account for the July surge to prevent stockouts during peak demand.

5. Promotion Impact Analysis

Do promotions actually work? We compare metrics between promoted and non-promoted transactions.

# Overall impact of promotions
promo_summary <- inyange %>%
  group_by(promotion_flag) %>%
  summarise(
    total_revenue = sum(total_sales_rwf, na.rm = TRUE),
    total_units   = sum(quantity_sold, na.rm = TRUE),
    avg_qty       = mean(quantity_sold, na.rm = TRUE),
    avg_rev       = mean(total_sales_rwf, na.rm = TRUE),
    .groups = "drop"
  )

kable(promo_summary, caption = "Summary of Metrics: Promotion vs No Promotion")

Summary of Metrics: Promotion vs No Promotion
promotion_flag	total_revenue	total_units	avg_qty	avg_rev
No	2964274900	2967155	67.84705	67781.19
Yes	668890170	743533	67.67389	60880.15

# Visualizing Distributions
v1 <- ggplot(inyange, aes(x = promotion_flag, y = quantity_sold, fill = promotion_flag)) +
  geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.1) +
  labs(title = "Quantity per Transaction", x = "Promotion", y = "Qty") +
  theme_light()

v2 <- ggplot(inyange, aes(x = promotion_flag, y = total_sales_rwf, fill = promotion_flag)) +
  geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.1) +
  labs(title = "Revenue per Transaction", x = "Promotion", y = "Sales (RWF)") +
  theme_light()

v1 + v2

Interpretation:

Visual Evidence: The violin plots show a very high degree of overlap between the “No” (Red) and “Yes” (Blue) distributions for both Quantity and Sales.
Volume Lift: While the “Yes” promotion shape is slightly thicker at the higher quantity range, the difference is not drastic.
Revenue Risk: Since the revenue distributions look nearly identical, there is a risk that promotions are effectively lowering margins without driving enough extra volume to compensate. We need to be careful that we aren’t just “giving away margin.”

6. Predictive Modeling

6.1 Linear Regression: Predicting Demand

Business Question: How does price elasticity and channel affect the volume sold?

Note: We predict quantity_sold rather than total_sales to avoid mathematical redundancy, as Sales = Price × Quantity.

lm_demand <- lm(
  quantity_sold ~ unit_price_rwf + product_category + channel + promotion_flag,
  data = inyange
)

summary(lm_demand)

## 
## Call:
## lm(formula = quantity_sold ~ unit_price_rwf + product_category + 
##     channel + promotion_flag, data = inyange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.458  -12.391   -0.356   11.923  134.123 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            22.8354473  0.9116123  25.050  < 2e-16 ***
## unit_price_rwf         -0.0006674  0.0004650  -1.435    0.151    
## product_categoryGhee   -4.4638386  0.9377579  -4.760 1.94e-06 ***
## product_categoryJuice  43.1919163  0.4517892  95.602  < 2e-16 ***
## product_categoryMilk   65.2552221  0.5330866 122.410  < 2e-16 ***
## product_categoryWater  86.5253334  0.6761044 127.976  < 2e-16 ***
## product_categoryYogurt 21.3655290  0.5737667  37.237  < 2e-16 ***
## channelRetail           0.0503190  0.2763691   0.182    0.856    
## channelSupermarket      0.3646419  0.2759951   1.321    0.186    
## channelWholesaler      -0.0282452  0.2764531  -0.102    0.919    
## promotion_flagYes      -0.1343055  0.2499557  -0.537    0.591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.85 on 54709 degrees of freedom
## Multiple R-squared:  0.6124, Adjusted R-squared:  0.6123 
## F-statistic:  8643 on 10 and 54709 DF,  p-value: < 2.2e-16

Interpretation: The model explains 61.2% (R-squared) of the variance in quantity sold, which is a strong result.

Category is King: The most significant drivers of quantity are the product categories. Compared to the baseline (Butter), products like Water (t-value = 127) and Milk (t-value = 122) have massively higher baseline quantities.
Price Insignificance: In this specific model, unit_price_rwf has a p-value of 0.151, meaning it is not statistically significant when product category is already accounted for. This suggests that the category determines the volume more than small price fluctuations within that category.
Promotions Ineffective? The coefficient for promotion_flagYes is -0.13 with a p-value of 0.591. This is a critical finding: Promotions are not having a statistically significant impact on increasing quantity sold in this model. This warrants a review of the current promotion strategy.

6.2 Logistic Regression: Predicting Promotions

Business Question: Which factors make a transaction more likely to be under promotion?

model_logit <- glm(
  promotion_flag ~ unit_price_rwf + quantity_sold +
    product_category + channel + district,
  data = inyange,
  family = binomial
)

# Extract Odds Ratios
odds_ratios <- exp(coef(model_logit))
ci_odds <- exp(confint(model_logit))

# Create results table
results_logit <- data.frame(
  Predictor = names(odds_ratios),
  Odds_Ratio = round(odds_ratios, 4),
  CI_Lower = round(ci_odds[,1], 4),
  CI_Upper = round(ci_odds[,2], 4)
)

kable(head(results_logit, 10), caption = "Odds Ratios (First 10 predictors)")

Odds Ratios (First 10 predictors)
	Predictor	Odds_Ratio	CI_Lower	CI_Upper
(Intercept)	(Intercept)	34.1170	27.1711	42.8422
unit_price_rwf	unit_price_rwf	0.9973	0.9972	0.9974
quantity_sold	quantity_sold	0.9998	0.9988	1.0007
product_categoryGhee	product_categoryGhee	91.8065	74.4095	113.3097
product_categoryJuice	product_categoryJuice	0.2063	0.1836	0.2318
product_categoryMilk	product_categoryMilk	0.1044	0.0906	0.1201
product_categoryWater	product_categoryWater	0.0427	0.0359	0.0508
product_categoryYogurt	product_categoryYogurt	0.0994	0.0874	0.1130
channelRetail	channelRetail	0.9436	0.8878	1.0028
channelSupermarket	channelSupermarket	0.9671	0.9104	1.0274

Interpretation of Odds Ratios:

Category Bias: The strongest predictor of a promotion is the product category. Ghee has an Odds Ratio of 91.8, meaning Ghee transactions are overwhelmingly likely to be on promotion compared to the baseline.
Low Promo Items: Conversely, Water (OR 0.04) and Yogurt (OR 0.09) are rarely on promotion.
Price: The Odds Ratio for price is ~0.99, suggesting price itself is not a strong predictor of whether an item is on promo or not.

# Predicting probabilities
inyange$promo_prob <- predict(model_logit, type = "response")

# Plotting probability distribution
ggplot(inyange, aes(x = promo_prob, fill = promotion_flag)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  labs(
    title = "Predicted Probability of Promotion",
    subtitle = "Separation between Promoted and Non-Promoted transactions",
    x = "Predicted Probability",
    y = "Count"
  ) +
  theme_minimal()

Interpretation: The histogram shows the model’s confidence in predicting promotions.

Separation: The model successfully identifies “Non-Promoted” items (Red bars clustered near 0.1).
Ambiguity: However, the “Promoted” items (Blue) are spread out, often overlapping with non-promoted scores. This suggests that while we can identify some patterns (like Ghee being promoted), the overall promotion strategy is inconsistent, making it difficult for the model to perfectly classify when a promotion will happen.

7. Conclusion & Next Steps

Based on the data evidence, we recommend:

Re-evaluate Promotions: The linear regression showed no significant lift in quantity from promotions. We should run A/B tests to see if current discounts are actually generating value or just eroding revenue.
Focus on Summer Logistics: The July revenue peak is substantial. Ensure supply chains to Karongi and Bugesera are fortified before June to capture this demand.
Protect the “Cash Cows”: Fresh Milk 1L and Long Life Milk 1L are the revenue engines. These products should never be out of stock.

Inyange Sales Analysis & Predictive Modeling

Nshuti Elise

2025-11-23

1. Project Overview

2. Data Loading and Cleaning

3. Exploratory Data Analysis (EDA)

3.1 Distribution of Key Metrics

3.2 Price Elasticity and Category Trends

4. Sales Performance Analysis

4.1 Top Products by Revenue

4.2 Performance by Channel

4.3 Geographic Distribution (Districts)

4.4 Sales Trend Over Time

5. Promotion Impact Analysis

6. Predictive Modeling

6.1 Linear Regression: Predicting Demand

6.2 Logistic Regression: Predicting Promotions

7. Conclusion & Next Steps