This analysis aims to answer key business questions regarding sales performance, channel distribution, and pricing strategies for Inyange Industries. By applying data cleaning, exploratory data analysis (EDA), and predictive modeling techniques in R, we derive actionable insights to support decision-making.
Important Disclaimer: > Please note that the dataset utilized in this project is synthetic and created exclusively for educational and demonstration purposes. It does not represent actual confidential sales records, financial metrics, or operational data of Inyange Industries. All figures, customer details, and transaction volumes are dummy data generated to showcase data science and R programming capabilities.
Key Business Objectives:
We begin by loading the raw sales dataset, cleaning column names to
snake_case, and converting variables to their appropriate
data types for analysis.
# Load and clean data
# Logic to handle file path validation
if(file.exists("rwanda_inyange_sales.csv")){
inyange <- read.csv("rwanda_inyange_sales.csv", stringsAsFactors = FALSE) %>%
clean_names()
} else {
warning("Data file not found. Please ensure 'rwanda_inyange_sales.csv' is in your working directory.")
stop("File 'rwanda_inyange_sales.csv' not found.")
}
# Convert variable types
inyange <- inyange %>%
mutate(
date = as.Date(date),
district = as.factor(district),
province = as.factor(province),
channel = as.factor(channel),
product_category = as.factor(product_category),
product_name = as.factor(product_name),
package_size_ml = as.numeric(package_size_ml),
unit_price_rwf = as.numeric(unit_price_rwf),
quantity_sold = as.numeric(quantity_sold),
total_sales_rwf = as.numeric(total_sales_rwf),
# Standardize Promotion Flag
promotion_flag = ifelse(tolower(promotion_flag) %in% c("yes", "y", "true"), "Yes", "No"),
promotion_flag = factor(promotion_flag, levels = c("No", "Yes")),
customer_type = as.factor(customer_type)
)
glimpse(inyange)## Rows: 54,720
## Columns: 12
## $ date <date> 2025-01-01, 2025-01-01, 2025-01-01, 2025-01-01, 2025…
## $ district <fct> Nyaruguru, Kicukiro, Bugesera, Nyabihu, Huye, Ngorore…
## $ province <fct> Southern, Kigali, Eastern, Western, Southern, Western…
## $ channel <fct> Wholesaler, Retail, Retail, Supermarket, Hotel/Restau…
## $ product_category <fct> Butter, Milk, Milk, Juice, Water, Butter, Milk, Yogur…
## $ product_name <fct> Inyange Butter 250g, Inyange Fresh Milk 1L, Inyange L…
## $ package_size_ml <dbl> 250, 1000, 1000, 500, 1500, 250, 500, 500, 250, 500, …
## $ unit_price_rwf <dbl> 1800, 1100, 1200, 900, 720, 1800, 600, 900, 1800, 400…
## $ quantity_sold <dbl> 12, 108, 138, 37, 133, 20, 46, 24, 11, 92, 37, 80, 72…
## $ total_sales_rwf <dbl> 21600, 118800, 165600, 33300, 95760, 36000, 27600, 21…
## $ promotion_flag <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, …
## $ customer_type <fct> Household, Shop, Hotel, Hotel, School, Hotel, Shop, H…
Understanding the spread of our data helps identify outliers and the general shape of our business.
# Histogram of unit price
p1 <- ggplot(inyange, aes(x = unit_price_rwf)) +
geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
labs(title = "Distribution of Unit Price", x = "Unit Price (RWF)", y = "Count") +
theme_minimal()
# Distribution of quantity sold
p2 <- ggplot(inyange, aes(x = quantity_sold)) +
geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") +
labs(title = "Distribution of Quantity Sold", x = "Quantity Sold", y = "Count") +
theme_minimal()
# Distribution of total sales (revenue)
p3 <- ggplot(inyange, aes(x = total_sales_rwf)) +
geom_histogram(binwidth = 10000, fill = "orange", color = "black") +
scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
labs(title = "Distribution of Total Sales per Record", x = "Total Sales (RWF)", y = "Count") +
theme_minimal()
# Combine plots using patchwork library
(p1 + p2) / p3Interpretation: The histograms reveal three distinct structural characteristics of the data:
We analyze the classic economic relationship: does higher price reduce demand?
ggplot(inyange, aes(x = unit_price_rwf, y = quantity_sold, color = product_category)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "dashed") +
scale_y_log10() +
labs(
title = "Unit Price vs Quantity Sold",
subtitle = "Trend lines indicate price sensitivity",
x = "Unit Price (RWF)",
y = "Quantity Sold (Log Scale)"
) +
theme_light()Interpretation: The scatterplot (log scale) demonstrates a clear negative correlation between price and quantity, consistent with the Law of Demand.
Which specific SKUs are the “cash cows”?
prod_summary <- inyange %>%
group_by(product_name, package_size_ml, product_category) %>%
summarise(
total_revenue = sum(total_sales_rwf, na.rm = TRUE),
total_units = sum(quantity_sold, na.rm = TRUE),
avg_price = mean(unit_price_rwf, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_revenue))
# Display top 10 table
kable(head(prod_summary, 10), caption = "Top 10 Products by Revenue")| product_name | package_size_ml | product_category | total_revenue | total_units | avg_price |
|---|---|---|---|---|---|
| Inyange Fresh Milk 1L | 1000 | Milk | 487421330 | 452396 | 1077.5658 |
| Inyange Long Life Milk 1L | 1000 | Milk | 468150600 | 397881 | 1176.5263 |
| Inyange Mineral Water 1.5L | 1500 | Water | 414726880 | 528955 | 783.8487 |
| Inyange Mango Juice 1L | 1000 | Juice | 408465450 | 277803 | 1469.9013 |
| Inyange Pineapple Juice 1L | 1000 | Juice | 407215650 | 277124 | 1469.8661 |
| Inyange Orange Juice 500ml | 500 | Juice | 315838170 | 357826 | 882.4342 |
| Inyange Fresh Milk 500ml | 500 | Milk | 219603840 | 373583 | 587.8336 |
| Inyange Strawberry Yogurt 500ml | 500 | Yogurt | 199531800 | 226489 | 881.1049 |
| Inyange Butter 250g | 250 | Butter | 186397740 | 105683 | 1764.1036 |
| Inyange Mineral Water 500ml | 500 | Water | 182487280 | 465422 | 392.1147 |
# Plot Top 10
ggplot(head(prod_summary, 10),
aes(x = reorder(product_name, total_revenue), y = total_revenue)) +
geom_col(fill = "steelblue") +
coord_flip() +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Top 10 Products by Revenue",
x = "Product",
y = "Total Revenue (RWF)"
) +
theme_minimal()Interpretation: We observe a clear hierarchy in product performance:
Inyange Fresh Milk 1L is the undisputed market leader,
followed closely by Long Life Milk 1L. These two SKUs are
the backbone of the company’s revenue.Mineral Water 1.5L and
Mango/Pineapple Juices form the second tier of revenue
drivers.Butter 250g and
Mineral Water 500ml contribute significantly less
revenue.Strategy: Supply chain stability for the 1L Milk varieties is critical, as stockouts here would have the largest negative impact on the bottom line.
Which sales channels are driving the most value?
channel_summary <- inyange %>%
group_by(channel) %>%
summarise(
total_revenue = sum(total_sales_rwf, na.rm = TRUE),
total_units = sum(quantity_sold, na.rm = TRUE),
avg_order_qty = mean(quantity_sold, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_revenue))
# Revenue by Channel Plot
ggplot(channel_summary,
aes(x = reorder(channel, total_revenue), y = total_revenue, group = 1)) +
geom_line(color = "purple", size = 1) +
geom_point(color = "red", size = 3) +
coord_flip() +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Total Revenue by Channel",
x = "Channel",
y = "Total Revenue (RWF)"
) +
theme_light()Interpretation:
Where is the demand concentrated geographically?
prov_summary <- inyange %>%
group_by(district) %>%
summarise(
total_revenue = sum(total_sales_rwf, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_revenue))
# Plotting Revenue by District
ggplot(head(prov_summary, 15), aes(x = reorder(district, total_revenue), y = total_revenue)) +
geom_col(fill = "darkorange") +
coord_flip() +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Top 15 Districts by Revenue",
x = "District",
y = "Total Revenue (RWF)"
) +
theme_minimal()Interpretation:
Tracking revenue performance over the recorded period.
# Aggregating sales by month
monthly_sales <- inyange %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(total_revenue = sum(total_sales_rwf, na.rm = TRUE))
ggplot(monthly_sales, aes(x = month, y = total_revenue)) +
geom_line(color = "darkgreen", size = 1.2) +
geom_smooth(method = "loess", se = FALSE, color = "gray", linetype = "dashed") +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
labs(
title = "Monthly Revenue Trend",
subtitle = "Tracking performance over time",
x = "Date",
y = "Total Revenue (RWF)"
) +
theme_minimal()Interpretation: The time series reveals significant seasonality and volatility:
Do promotions actually work? We compare metrics between promoted and non-promoted transactions.
# Overall impact of promotions
promo_summary <- inyange %>%
group_by(promotion_flag) %>%
summarise(
total_revenue = sum(total_sales_rwf, na.rm = TRUE),
total_units = sum(quantity_sold, na.rm = TRUE),
avg_qty = mean(quantity_sold, na.rm = TRUE),
avg_rev = mean(total_sales_rwf, na.rm = TRUE),
.groups = "drop"
)
kable(promo_summary, caption = "Summary of Metrics: Promotion vs No Promotion")| promotion_flag | total_revenue | total_units | avg_qty | avg_rev |
|---|---|---|---|---|
| No | 2964274900 | 2967155 | 67.84705 | 67781.19 |
| Yes | 668890170 | 743533 | 67.67389 | 60880.15 |
# Visualizing Distributions
v1 <- ggplot(inyange, aes(x = promotion_flag, y = quantity_sold, fill = promotion_flag)) +
geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
geom_jitter(width = 0.15, alpha = 0.1) +
labs(title = "Quantity per Transaction", x = "Promotion", y = "Qty") +
theme_light()
v2 <- ggplot(inyange, aes(x = promotion_flag, y = total_sales_rwf, fill = promotion_flag)) +
geom_violin(trim = FALSE, alpha = 0.6, show.legend = FALSE) +
geom_jitter(width = 0.15, alpha = 0.1) +
labs(title = "Revenue per Transaction", x = "Promotion", y = "Sales (RWF)") +
theme_light()
v1 + v2Interpretation:
Business Question: How does price elasticity and channel affect the volume sold?
Note: We predict quantity_sold rather than
total_sales to avoid mathematical redundancy, as Sales =
Price × Quantity.
lm_demand <- lm(
quantity_sold ~ unit_price_rwf + product_category + channel + promotion_flag,
data = inyange
)
summary(lm_demand)##
## Call:
## lm(formula = quantity_sold ~ unit_price_rwf + product_category +
## channel + promotion_flag, data = inyange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.458 -12.391 -0.356 11.923 134.123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.8354473 0.9116123 25.050 < 2e-16 ***
## unit_price_rwf -0.0006674 0.0004650 -1.435 0.151
## product_categoryGhee -4.4638386 0.9377579 -4.760 1.94e-06 ***
## product_categoryJuice 43.1919163 0.4517892 95.602 < 2e-16 ***
## product_categoryMilk 65.2552221 0.5330866 122.410 < 2e-16 ***
## product_categoryWater 86.5253334 0.6761044 127.976 < 2e-16 ***
## product_categoryYogurt 21.3655290 0.5737667 37.237 < 2e-16 ***
## channelRetail 0.0503190 0.2763691 0.182 0.856
## channelSupermarket 0.3646419 0.2759951 1.321 0.186
## channelWholesaler -0.0282452 0.2764531 -0.102 0.919
## promotion_flagYes -0.1343055 0.2499557 -0.537 0.591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.85 on 54709 degrees of freedom
## Multiple R-squared: 0.6124, Adjusted R-squared: 0.6123
## F-statistic: 8643 on 10 and 54709 DF, p-value: < 2.2e-16
Interpretation: The model explains 61.2% (R-squared) of the variance in quantity sold, which is a strong result.
t-value = 127) and
Milk (t-value = 122) have massively higher
baseline quantities.unit_price_rwf has a p-value of 0.151,
meaning it is not statistically significant when
product category is already accounted for. This suggests that the
category determines the volume more than small price
fluctuations within that category.promotion_flagYes is -0.13 with a p-value of
0.591. This is a critical finding: Promotions
are not having a statistically significant impact on increasing quantity
sold in this model. This warrants a review of the current
promotion strategy.Business Question: Which factors make a transaction more likely to be under promotion?
model_logit <- glm(
promotion_flag ~ unit_price_rwf + quantity_sold +
product_category + channel + district,
data = inyange,
family = binomial
)
# Extract Odds Ratios
odds_ratios <- exp(coef(model_logit))
ci_odds <- exp(confint(model_logit))
# Create results table
results_logit <- data.frame(
Predictor = names(odds_ratios),
Odds_Ratio = round(odds_ratios, 4),
CI_Lower = round(ci_odds[,1], 4),
CI_Upper = round(ci_odds[,2], 4)
)
kable(head(results_logit, 10), caption = "Odds Ratios (First 10 predictors)")| Predictor | Odds_Ratio | CI_Lower | CI_Upper | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 34.1170 | 27.1711 | 42.8422 |
| unit_price_rwf | unit_price_rwf | 0.9973 | 0.9972 | 0.9974 |
| quantity_sold | quantity_sold | 0.9998 | 0.9988 | 1.0007 |
| product_categoryGhee | product_categoryGhee | 91.8065 | 74.4095 | 113.3097 |
| product_categoryJuice | product_categoryJuice | 0.2063 | 0.1836 | 0.2318 |
| product_categoryMilk | product_categoryMilk | 0.1044 | 0.0906 | 0.1201 |
| product_categoryWater | product_categoryWater | 0.0427 | 0.0359 | 0.0508 |
| product_categoryYogurt | product_categoryYogurt | 0.0994 | 0.0874 | 0.1130 |
| channelRetail | channelRetail | 0.9436 | 0.8878 | 1.0028 |
| channelSupermarket | channelSupermarket | 0.9671 | 0.9104 | 1.0274 |
Interpretation of Odds Ratios:
Ghee has an Odds Ratio
of 91.8, meaning Ghee transactions are overwhelmingly
likely to be on promotion compared to the baseline.Water (OR
0.04) and Yogurt (OR 0.09) are rarely on promotion.# Predicting probabilities
inyange$promo_prob <- predict(model_logit, type = "response")
# Plotting probability distribution
ggplot(inyange, aes(x = promo_prob, fill = promotion_flag)) +
geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
labs(
title = "Predicted Probability of Promotion",
subtitle = "Separation between Promoted and Non-Promoted transactions",
x = "Predicted Probability",
y = "Count"
) +
theme_minimal()
Interpretation: The histogram shows the model’s
confidence in predicting promotions.
Based on the data evidence, we recommend: