TOPICS

-> Summary Statistics

-> Analytical Focus Areas for Supermart Grocery Sales Dataset:

1. Sales performance analysis

2. Customer Behavior and Preferences

3. Profitability and Cost Analysis

4. Market Segmentation and Targeting

-> Conclusions

Loading the Supermart csv file

data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

Loading the necessary libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

Summarizing the data of supermart grocery sales

These descriptive statistics help in providing an overview of our data’s characteristics and distribution.

summary_data<-summary(data)
summary_data
##    Order.ID         CustomerName         Category         SubCategory       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      City            OrderDate            Region              Sales     
##  Length:9994        Length:9994        Length:9994        Min.   : 500  
##  Class :character   Class :character   Class :character   1st Qu.:1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1498  
##                                                           Mean   :1497  
##                                                           3rd Qu.:1995  
##                                                           Max.   :2500  
##     Discount          Profit           State            ProfitRange    
##  Min.   :0.1000   Min.   :  25.25   Length:9994        Min.   :0.0000  
##  1st Qu.:0.1600   1st Qu.: 180.02   Class :character   1st Qu.:0.0000  
##  Median :0.2300   Median : 320.78   Mode  :character   Median :0.0000  
##  Mean   :0.2268   Mean   : 374.94                      Mean   :0.2764  
##  3rd Qu.:0.2900   3rd Qu.: 525.63                      3rd Qu.:1.0000  
##  Max.   :0.3500   Max.   :1120.95                      Max.   :1.0000
  • Order ID: It contains character data and has 9994 rows.

  • Customer Name: It contains character data and has 9994 rows.

  • Category: It contains character data and has 9994 rows.

  • Subcategory: It contains character data and has 9994 rows.

  • City: It contains character data and has 9994 rows.

  • Order Date: It contains character data and has 9994 rows.

  • Region: It contains character data and has 9994 rows.

  • Sales: The minimum sales value is 500, the maximum is 2500, with a median of 1498 and a mean of 1497.

  • Discount: The minimum discount is 0.1, the maximum is 0.35, with a median of 0.23 and a mean of 0.2268.

  • Profit: The minimum profit is 25.25, the maximum is 1120.95, with a median of 320.78 and a mean of 374.94.

  • State: It contains character data and has 9994 rows.

  • Profit Range: The minimum value is 0, the maximum is 1, with a median and mean of 0.

Sales performance analysis:

(a) What are the top-selling categories and subcategories?

Identifying the top-selling categories and subcategories is crucial in retail analysis as it helps businesses understand consumer preferences and allocate resources effectively to capitalize on high-demand products, ultimately driving sales and profitability.

Let’s use descriptive statistics and visualization techniques to identify the top-selling categories and subcategories in our dataset.

top_categories <- data %>%
  group_by(Category) %>%
  summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
  arrange(desc(total_sales)) %>%
  head()

top_subcategories <- data %>%
  group_by(Category, SubCategory) %>%
  summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
  arrange(desc(total_sales)) %>%
  group_by(Category) %>%
  slice(1)

# Plot for top categories
ggplot(top_categories, aes(x = reorder(Category, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top-selling Categories", x = "Category", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot for top subcategories
ggplot(top_subcategories, aes(x = reorder(SubCategory, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Top-selling Subcategories", x = "Subcategory", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observations:

- From the above plots, it can be observed that the top-selling category is “Eggs, Meat, and Fish” with the subcategory being “Health Drinks”.

Inferences:

- The growing interest in “Eggs, Meat, and Fish” and “Health Drinks” among customers shows that people are becoming more health-conscious about what they eat.

- This means that stores and companies need to change the products they sell and how they advertise them to better suit the new focus on health in what people buy.

(b) Which products have the highest sales revenue and profit margins?

Knowing which products generate the highest sales revenue and profit margins is essential in retail analysis as it enables businesses to focus on promoting and optimizing the most profitable items, leading to increased revenue and improved overall profitability.

Firstly, let’s understand the relationship between sales revenue and profit margins using linear regression modeling.

model1 <- lm(Profit ~ Sales, data = data)
summary(model1)
## 
## Call:
## lm(formula = Profit ~ Sales, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -500.66 -128.83   -0.93  128.84  495.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.422974   5.306490  -0.268    0.789    
## Sales        0.251477   0.003308  76.022   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 191 on 9992 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.3664 
## F-statistic:  5779 on 1 and 9992 DF,  p-value: < 2.2e-16

Observations:

  1. Impact of sales on profit: The coefficient estimate for the sales is 0.251477. This means that for every unit increase in sales revenue, the profit margin increases by approximately 0.251477 units. This positive coefficient suggests that higher sales revenue is associated with higher profit margins.

  2. Statistical significance: The p-value associated with the coefficient estimate for “Sales” is less than 0.001 (<2e-16). This indicates that the relationship between sales revenue and profit margins is statistically significant. So, the observed relationship is unlikely to have occurred by random chance alone.

Understanding the relationship between sales and profit is quite important if we want to improve business performance and decision-making. By identifying products with the highest sales revenue and profit margins, retailers can prioritize resources, allocate marketing budgets effectively, and make informed decisions about pricing, inventory management, and product offerings.

Next, let’s look at the products that have the highest sales revenue and profit margins in our dataset.

high_sales <- data %>%
  group_by(Category) %>%
  summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
  arrange(desc(total_sales)) %>%
  head()

high_profit <- data %>%
  group_by(Category) %>%
  summarise(total_profit = sum(Profit), .groups = "drop_last") %>%
  arrange(desc(total_profit)) %>%
  head()

# Plot top sales products
ggplot(high_sales, aes(x = reorder(Category, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top Sales Products", x = "Category", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot top profit products
ggplot(high_profit, aes(x = reorder(Category, -total_profit), y = total_profit)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Top Profit Products", x = "Category", y = "Total Profit") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observations:

We can observe that “Eggs, Meat, and Fish” are the top-selling products(as we observed before) while “Snacks”(surprisingly) emerge as the top profit product.

The shift from health-conscious products like health drinks (previously seen as top-selling) to snacks (top profit product) can suggest a diverse range of consumer preferences and purchasing behaviors.

Inferences:

Regarding why snacks may have the highest profit margins and the sudden shift from health-conscious to snacks, one possible explanation is that snack products often have higher markups and lower production costs compared to health-conscious alternatives. Moreover, shifts in lifestyle changes, such as increased snacking occasions may contribute to the surge in profitability for snacks.

(d) Can we identify any outliers or underperforming products that may require attention?

Identifying outliers or underperforming products is crucial in retail analysis as it helps businesses pinpoint areas needing attention or improvement.

Let’s conduct an analysis to identify any outliers or underperforming products that may require attention. We will compare the sales performance of a ‘Beverages’(chose ‘beverages’ since it had the least sales as well as profit margin) against the average sales performance of all products in the dataset. By defining hypotheses, setting a significance level, and performing a one-sample t-test, we can determine if there is a significant difference in the sales performance.

Hypothesis Testing:

-Defining the null hypothesis (H0) and alternative hypothesis (H1):

  • H0: There is no significant difference between the sales performance of beverages and the average sales performance of all products.

  • H1: There is a significant difference between the sales performance of beverages and the average sales performance of all products.

-Setting the significance level (α), typically 0.05.

-Since we are comparing the sales performance of one product against the average sales performance of all products, we will be using a one-sample t-test.

# Getting sales data of 'Beverages'
bev_sales <- subset(data, Category == "Beverages")$Sales

# average sales performance of all products
average_sales <- mean(data$Sales)

# Performing one-sample t-test
result <- t.test(bev_sales, mu = average_sales)
print(result)
## 
##  One Sample t-test
## 
## data:  bev_sales
## t = -0.45312, df = 1399, p-value = 0.6505
## alternative hypothesis: true mean is not equal to 1496.596
## 95 percent confidence interval:
##  1458.829 1520.190
## sample estimates:
## mean of x 
##  1489.509

Observations:

We can see that the p-value is greater than the typical significance level of 0.05, which means that there is not enough evidence to reject our null hypothesis. Hence, we fail to reject our hypothesis that the true mean sales performance of beverages is equal to the overall average sales performance of all products i.e., there is no significant difference between the sales performance of beverages and the average sales performance of all products.

Therefore, there is no evidence to indicate that beverages are outliers or underperforming products that require immediate attention.

Inferences:

While beverages may not be outliers or underperforming products based on their sales performance relative to the overall average, there could be other factors affecting their low sales and profit margins. These factors can be pricing strategies, competition, consumer preferences, or marketing effectiveness.

Customer Behavior and Preferences:

(b) Are there any geographic variations in purchasing behavior?

Understanding geographic variations in purchasing behavior is important as it enables businesses to tailor marketing strategies, and promotional activities in specific regions, maximizing sales and customer satisfaction.

Let’s observe the purchasing behavior variations within each city in our dataset by firstly performing a geographic analysis(using linear regression model) and then plotting the sales for each city(to kind of verify our model’s results).

# Geographic analysis
geo_analysis <- lm(Sales ~ City, data = data)
summary(geo_analysis)
## 
## Call:
## lm(formula = Sales ~ City, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1047.29  -495.38     3.32   498.02  1047.65 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1509.450     27.473  54.943   <2e-16 ***
## CityChennai         -39.628     39.077  -1.014    0.311    
## CityCoimbatore      -26.394     39.169  -0.674    0.500    
## CityCumbum           -8.138     39.431  -0.206    0.836    
## CityDharmapuri       10.638     40.522   0.263    0.793    
## CityDindigul        -55.837     39.965  -1.397    0.162    
## CityKanyakumari      30.341     38.491   0.788    0.431    
## CityKarur           -15.792     39.123  -0.404    0.686    
## CityKrishnagiri     -61.102     38.897  -1.571    0.116    
## CityMadurai           4.854     39.654   0.122    0.903    
## CityNagercoil       -31.072     40.610  -0.765    0.444    
## CityNamakkal        -24.264     39.782  -0.610    0.542    
## CityOoty            -26.054     39.756  -0.655    0.512    
## CityPerambalur       10.683     39.032   0.274    0.784    
## CityPudukottai        9.571     39.123   0.245    0.807    
## CityRamanadhapuram   -2.595     39.334  -0.066    0.947    
## CitySalem            15.127     39.100   0.387    0.699    
## CityTenkasi         -19.515     39.077  -0.499    0.618    
## CityTheni           -11.897     40.210  -0.296    0.767    
## CityTirunelveli     -30.051     38.766  -0.775    0.438    
## CityTrichy            7.085     41.101   0.172    0.863    
## CityVellore          45.837     39.009   1.175    0.240    
## CityViluppuram      -45.284     39.939  -1.134    0.257    
## CityVirudhunagar    -50.748     39.455  -1.286    0.198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 577.6 on 9970 degrees of freedom
## Multiple R-squared:  0.002194,   Adjusted R-squared:  -0.0001078 
## F-statistic: 0.9532 on 23 and 9970 DF,  p-value: 0.525

Observations:

The coefficients associated with each city indicate the estimated change in sales for each unit change in the respective city, holding other factors constant. However, most of the coefficients are not statistically significant, as indicated by their p-values being greater than 0.05. This suggests that the city alone may not be a strong predictor of sales behavior, and other factors may have a more significant influence.

Inferences:

The geographic analysis shows that there aren’t huge differences in how people purchase things across different cities. The coefficients we got for each city tell us how much sales might change if the city changes, but most of them weren’t very reliable(not statistically significant, as their p-values > 0.05). This means that just the city might not tell us much about sales behavior, and there are probably other factors that affect it more.

Let’s plot the sales by each city(using a line graph) and compare it with the coefficients of our model

city_sales <- data %>%
  group_by(City) %>%
  summarise(total_sales = sum(Sales))
ggplot(city_sales, aes(x = reorder(City, -total_sales), y = total_sales)) +
  geom_line(color = "skyblue", size = 1.5, aes(group = 1)) +  
  geom_point(color = "blue", size = 1) + 
  labs(title = "Total Sales by City", x = "City", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The results from our regression model suggest that the impact of geographic location on sales behavior is not statistically significant. However, when we look at the above plot, we see that there are observable differences in sales across different cities.

This indicates that geographic location alone may not strongly predict sales, other factors may be influencing sales in each city. The estimates from the regression model show the expected change in sales for each unit change in city, assuming all other factors remain constant. However, in reality, sales may be influenced by multiple factors that interact in complex ways, which the regression model may not fully capture.

(c) How do discounts affect sales volume and profitability?

Understanding the effect of discounts on sales volume and profitability is crucial as it can help businesses optimize their pricing strategies to attract customers, boost sales, and therefore maximize profits.

Fitting GLMs with ‘Discount’ as the predictor variable for analysing the relationship between discounts and sales volume/profitability

# for sales
sales_glm <- glm(Sales ~ Discount, data = data, family = gaussian)
summary(sales_glm)
## 
## Call:
## glm(formula = Sales ~ Discount, family = gaussian, data = data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1506.27      18.48  81.487   <2e-16 ***
## Discount      -42.66      77.41  -0.551    0.582    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 333597.7)
## 
##     Null deviance: 3333409382  on 9993  degrees of freedom
## Residual deviance: 3333308088  on 9992  degrees of freedom
## AIC: 155466
## 
## Number of Fisher Scoring iterations: 2
# for profit
profit_glm <- glm(Profit ~ Discount, data = data, family = gaussian)
summary(profit_glm)
## 
## Call:
## glm(formula = Profit ~ Discount, family = gaussian, data = data)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 374.92434    7.67916  48.824   <2e-16 ***
## Discount      0.05618   32.15996   0.002    0.999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 57573.55)
## 
##     Null deviance: 575274901  on 9993  degrees of freedom
## Residual deviance: 575274901  on 9992  degrees of freedom
## AIC: 137908
## 
## Number of Fisher Scoring iterations: 2

Observations:

  1. For Sales Volume : The p-value for the Discount variable is greater than the significance level of 0.05, indicating that there is no significant effect of discounts on sales volume.

  2. For Profitability : Similar to sales volume, the p-value for the Discount variable is much greater than 0.05, indicating that discounts do not have a significant impact on profitability.

Now, let’s check if discounts affect sales volume and profitability when including additional variables such as Category, City, and OrderDate along with Discount.

By including Category, City, and OrderDate in the model, we can assess how discounts affect sales volume and profitability after accounting for differences in product categories, geographic locations, and time periods. This helps us understand the effects of discount more accurately by adjusting for other influences.

#for sales
sales_glm <- glm(Sales ~ Discount + Category + City + OrderDate, data = data, family = gaussian)
summary(sales_glm)
## 
## Call:
## glm(formula = Sales ~ Discount + Category + City + OrderDate, 
##     family = gaussian, data = data)
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.383e+03  2.406e+02   5.750  9.2e-09 ***
## Discount                  -4.898e+01  7.760e+01  -0.631    0.528    
## CategoryBeverages         -5.683e+00  2.181e+01  -0.261    0.794    
## CategoryEggs, Meat & Fish  2.602e+01  2.147e+01   1.212    0.226    
## CategoryFood Grains        1.827e+01  2.181e+01   0.838    0.402    
## CategoryFruits & Veggies  -1.244e+01  2.173e+01  -0.572    0.567    
## CategoryOil & Masala       1.704e+00  2.195e+01   0.078    0.938    
## CategorySnacks            -1.699e+01  2.138e+01  -0.795    0.427    
## CityChennai               -3.938e+01  3.909e+01  -1.007    0.314    
## CityCoimbatore            -2.624e+01  3.918e+01  -0.670    0.503    
## CityCumbum                -8.516e+00  3.945e+01  -0.216    0.829    
## CityDharmapuri             1.035e+01  4.054e+01   0.255    0.798    
## CityDindigul              -5.455e+01  3.998e+01  -1.365    0.172    
## CityKanyakumari            2.996e+01  3.852e+01   0.778    0.437    
## CityKarur                 -1.678e+01  3.913e+01  -0.429    0.668    
## CityKrishnagiri           -6.192e+01  3.892e+01  -1.591    0.112    
## CityMadurai                3.850e+00  3.968e+01   0.097    0.923    
## CityNagercoil             -2.977e+01  4.062e+01  -0.733    0.464    
## CityNamakkal              -2.416e+01  3.979e+01  -0.607    0.544    
## CityOoty                  -2.686e+01  3.977e+01  -0.675    0.499    
## CityPerambalur             1.254e+01  3.906e+01   0.321    0.748    
## CityPudukottai             1.031e+01  3.913e+01   0.264    0.792    
## CityRamanadhapuram        -3.145e+00  3.935e+01  -0.080    0.936    
## CitySalem                  1.507e+01  3.912e+01   0.385    0.700    
## CityTenkasi               -1.940e+01  3.910e+01  -0.496    0.620    
## CityTheni                 -1.202e+01  4.023e+01  -0.299    0.765    
## CityTirunelveli           -2.905e+01  3.878e+01  -0.749    0.454    
## CityTrichy                 7.117e+00  4.112e+01   0.173    0.863    
## CityVellore                4.611e+01  3.903e+01   1.182    0.237    
## CityViluppuram            -4.460e+01  3.996e+01  -1.116    0.264    
## CityVirudhunagar          -5.039e+01  3.947e+01  -1.277    0.202    
## OrderDate                  7.837e-03  1.377e-02   0.569    0.569    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 333638)
## 
##     Null deviance: 3333409382  on 9993  degrees of freedom
## Residual deviance: 3323702146  on 9962  degrees of freedom
## AIC: 155498
## 
## Number of Fisher Scoring iterations: 2
#for profitability
profit_glm <- glm(Profit ~ Discount + Category + City + OrderDate, data = data, family = gaussian)
summary(profit_glm)
## 
## Call:
## glm(formula = Profit ~ Discount + Category + City + OrderDate, 
##     family = gaussian, data = data)
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               437.884736  99.979011   4.380  1.2e-05 ***
## Discount                   -2.470620  32.243947  -0.077   0.9389    
## CategoryBeverages           1.169283   9.061351   0.129   0.8973    
## CategoryEggs, Meat & Fish   6.488616   8.921075   0.727   0.4670    
## CategoryFood Grains         4.545026   9.064596   0.501   0.6161    
## CategoryFruits & Veggies   -0.119190   9.031560  -0.013   0.9895    
## CategoryOil & Masala       -8.259590   9.122630  -0.905   0.3653    
## CategorySnacks              1.436173   8.884445   0.162   0.8716    
## CityChennai               -20.244884  16.244189  -1.246   0.2127    
## CityCoimbatore            -24.892096  16.280998  -1.529   0.1263    
## CityCumbum                -17.889207  16.393747  -1.091   0.2752    
## CityDharmapuri            -16.006666  16.844878  -0.950   0.3420    
## CityDindigul              -26.861096  16.611723  -1.617   0.1059    
## CityKanyakumari           -17.220128  16.005270  -1.076   0.2820    
## CityKarur                   0.667379  16.261518   0.041   0.9673    
## CityKrishnagiri           -28.007739  16.171195  -1.732   0.0833 .  
## CityMadurai               -18.957915  16.487181  -1.150   0.2502    
## CityNagercoil             -22.953294  16.880824  -1.360   0.1739    
## CityNamakkal              -31.820854  16.535853  -1.924   0.0543 .  
## CityOoty                  -21.468171  16.525236  -1.299   0.1939    
## CityPerambalur              1.635779  16.229466   0.101   0.9197    
## CityPudukottai            -10.860990  16.260375  -0.668   0.5042    
## CityRamanadhapuram        -15.347943  16.351913  -0.939   0.3480    
## CitySalem                 -19.401374  16.257279  -1.193   0.2327    
## CityTenkasi               -31.170803  16.246108  -1.919   0.0551 .  
## CityTheni                 -24.199522  16.716901  -1.448   0.1478    
## CityTirunelveli           -22.473331  16.114540  -1.395   0.1632    
## CityTrichy                -11.947746  17.086984  -0.699   0.4844    
## CityVellore                 7.290696  16.216541   0.450   0.6530    
## CityViluppuram            -29.332599  16.605945  -1.766   0.0774 .  
## CityVirudhunagar          -30.182780  16.401134  -1.840   0.0658 .  
## OrderDate                  -0.002626   0.005722  -0.459   0.6462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 57608.36)
## 
##     Null deviance: 575274901  on 9993  degrees of freedom
## Residual deviance: 573894458  on 9962  degrees of freedom
## AIC: 137944
## 
## Number of Fisher Scoring iterations: 2

Observations:

For Sales Volume : Similar to our previous model(which only included Discount variable), the p-value for the Discount variable here is also greater than 0.05, suggesting that discounts alone may not significantly impact sales volume when other variables are considered.

For Profitability : Again, the p-value for the Discount variable is much greater than 0.05, indicating that discounts alone may not significantly impact profitability when other variables are considered.

Inference:

Overall, based on the results, we can conclude that discounts do not have a significant effect on both sales volume and profitability, whether considered alone or along with other variables such as Category, City, and OrderDate.

Box plots for sales and profitability by discount levels

# sales by discount level
ggplot(data, aes(x = as.factor(Discount), y = Sales)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(title = "Distribution of Sales by Discount Level", x = "Discount", y = "Sales") +
  theme_minimal()

# profitability by discount level
ggplot(data, aes(x = as.factor(Discount), y = Profit)) +
  geom_boxplot(fill = "pink", color = "purple") +
  labs(title = "Distribution of Profitability by Discount Level", x = "Discount", y = "Profit") +
  theme_minimal()

The box plots for sales and profitability by discount levels do not show major differences and are almost the same for respective sales and profit values, it aligns with the findings from the GLM summaries.

Both GLM summaries and box plots suggest the same thing- discounts don’t seem to make much of a difference in how much we sell or how much profit we make.

Profitability and Cost Analysis:

(a) Which products generate the highest profits relative to their sales?

Understanding which products make the most profit compared to their sales is important because it helps the retailers focus on the items that bring in the most money relative to their cost. This information helps in deciding what products to stock, how to price them, and where to allocate resources to make the most profit.

data <- mutate(data, profit_margin = Profit / Sales)

# Finding products with highest profit margins
top_profit_margin <- data %>%
  group_by(SubCategory) %>%
  summarise(avg_profit_margin = mean(profit_margin, na.rm = TRUE)) %>%
  arrange(desc(avg_profit_margin)) %>%
  head()

top_profit_margin
## # A tibble: 6 × 2
##   SubCategory        avg_profit_margin
##   <chr>                          <dbl>
## 1 Fish                           0.26 
## 2 Noodles                        0.260
## 3 Organic Staples                0.259
## 4 Organic Vegetables             0.257
## 5 Breads & Buns                  0.257
## 6 Organic Fruits                 0.256
ggplot(top_profit_margin, aes(x = reorder(SubCategory, avg_profit_margin), y = avg_profit_margin)) +
  geom_point(color = "skyblue", size = 3) +
  labs(title = "Top Profit Margins by Subcategory", x = "Subcategory", y = "Average Profit Margin") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observations:

Fish and noodles are the top products generating the highest profits relative to their sales.

Inference:

Despite possibly lower sales volumes(as noodles never even made it to the top selling categories in our previous analyses), these products exhibit high profitability ratios, which can indicate efficient cost management or premium pricing.

b. Are there any cost-saving opportunities in procuring or operations?

Identifying cost-saving opportunities in purchasing or operations helps businesses optimize their expenses and improve overall profitability. It allows them to allocate resources more efficiently and potentially lower product prices, which can attract more customers and increase competitiveness in the market.

Let’s create a simulated retail data and analyze how different procurement methods affect total costs through regression analysis and ANOVA. We will try to determine if there’s a significant difference in total costs based on the procurement method (online or offline).

#generating random synthetic data
set.seed(123)
data1 <- data.frame(
  ProductID = 1:100,
  CostPerUnit = rnorm(100, mean = 10, sd = 2),
  ProcurementMethod = sample(c("Online", "Offline"), 100, replace = TRUE),
  OperationalCost = rnorm(100, mean = 2000, sd = 500)
)

# total procurement cost
data1 <- mutate(data1, TotalProcurementCost = CostPerUnit * ProductID)

# total cost + operational cost
data1 <- mutate(data1, TotalCost = TotalProcurementCost + OperationalCost)

cost_regression <- lm(TotalCost ~ ProcurementMethod, data = data1)
summary(cost_regression)
## 
## Call:
## lm(formula = TotalCost ~ ProcurementMethod, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -986.7 -319.6 -125.9  305.8 1240.0 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2555.92      75.67  33.776   <2e-16 ***
## ProcurementMethodOnline   -58.89     105.96  -0.556     0.58    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 529.7 on 98 degrees of freedom
## Multiple R-squared:  0.003142,   Adjusted R-squared:  -0.00703 
## F-statistic: 0.3089 on 1 and 98 DF,  p-value: 0.5796
anova_result <- anova(cost_regression)
print(anova_result)
## Analysis of Variance Table
## 
## Response: TotalCost
##                   Df   Sum Sq Mean Sq F value Pr(>F)
## ProcurementMethod  1    86679   86679  0.3089 0.5796
## Residuals         98 27498023  280592

Observations:

- The regression model doesn’t show a significant relationship between procurement method (online or offline) and total cost.

- The coefficients for the procurement method show that the difference in total cost between online and offline procurement is not statistically significant.

- Also, The p-value from the ANOVA test is greater than 0.05, indicating that there’s no significant difference in total cost based on the procurement method.

Inferences:

- Based on the results, we don’t have enough evidence to conclude that one procurement method leads to significantly lower or higher total costs compared to the other. Other factors beyond just the procurement method may influence total costs in this scenario.

Market Segmentation and Targeting

(a) Can we segment customers based on their purchasing habits or preferences?

- Segmenting customers based on their purchasing habits or preferences helps tailor marketing strategies and product offerings to specific groups, improving customer satisfaction and increasing sales.

We can probably segment customers based on various factors such as product category preferences, geographic location, order date, and discount usage.

Segmenting customers based on product category preferences:

customer_segments <- data %>%
  group_by(Category, City) %>%
  summarise(total_purchases = n(), .groups = "drop") %>%
  pivot_wider(names_from = Category, values_from = total_purchases, values_fill = 0)

# adding a column to identify the dominant product category for each customer
customer_segments <- mutate(customer_segments, Dominant_Category = apply(customer_segments[, -c(1,2)], 1, function(x) names(x)[which.max(x)]))
customer_segment_analysis <- customer_segments %>%
  group_by(Dominant_Category) %>%
  summarise(
    Total_Customers = n()
  )

ggplot(customer_segment_analysis, aes(x = Dominant_Category, y = Total_Customers)) +
  geom_bar(stat = "identity", fill = "pink") +
  labs(title = "Customer Segments by Dominant Product Category", x = "Dominant Product Category", y = "Total Customers") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


Observations:

- Despite beverages being among the least in terms of top-selling categories(in previous analysis), they have more sales within the customer segments.

- Also, fruits and veggies are more dominant in customer preferences compared to food grains and oils/masala(which again wasn’t the same case in top-selling categories analysis).

Inferences:

- The discrepancy between top-selling categories and dominant product categories within customer segments suggests that customer preferences may differ from overall sales trends. This highlights the importance of understanding customer segmentation and preferences for targeted marketing.

(b) Can we tailor promotions or product offerings to specific customer segments for better results?

Through market segmentation, we can identify specific customer segments with unique preferences and behaviors. By understanding these segments, businesses can tailor their promotions, product offerings, and marketing messages to better resonate with each group. For example, offering personalized discounts or product recommendations based on past purchase history or segment-specific preferences can lead to higher engagement.

Conclusions:

  1. Product Preferences and Profitability:

    - The alignment between popular products and high profitability suggests a strong correlation between demand and profitability. Stores should leverage this to focus on marketing and stocking these high-demand, high-profit items to maximize revenue.

  2. Seasonal Sales Patterns:

    - Understanding seasonal sales trends enables stores to anticipate fluctuations in demand and adjust inventory levels accordingly. By aligning product offerings with seasonal preferences, stores can optimize sales and maintain customer satisfaction throughout the year.

  3. Geographic Variations:

    - While geographic location alone may not strongly predict sales behavior, subtle differences in purchasing patterns across cities indicate the need for localized marketing strategies. Tailoring promotions and product assortments to the preferences of each city’s demographic can enhance sales performance.

  4. Impact of Discounts:

    - Our analysis indicates that discounts do not significantly impact sales volume or profitability. Instead of relying solely on discounts, stores should focus on providing value through product quality, customer service, and targeted marketing strategies to drive sales growth.

  5. Outlier Identification:

    - While certain products like beverages may not stand out as outliers in terms of sales performance, other factors such as pricing strategies, competition, or consumer preferences could be influencing their performance. Stores should conduct further analysis to identify and address underlying issues affecting product performance.

  6. Product Profitability:

    - Fish and noodles emerge as top performers in generating high profits relative to their sales. Despite potentially lower sales volumes, these products exhibit efficient cost management or premium pricing, contributing to their high profitability ratios.

  7. Cost-saving Opportunities:

    - Our analysis does not reveal significant cost differences between online and offline procurement methods. This suggests that other factors beyond procurement methods may influence total costs. However, further exploration of operational efficiencies could uncover potential cost-saving opportunities.

  8. Regional Profit Margins:

    - The North region stands out with the highest profit margin, indicating its potential as a lucrative market for the business. While other regions show consistent profit margins, targeted strategies could help improve profitability across all regions.

  9. Customer Segmentation Insights:

    - Discrepancies between top-selling categories and dominant product categories within customer segments underscore the importance of understanding customer preferences. Tailoring marketing efforts to align with customer preferences can optimize sales and enhance customer satisfaction.

By integrating all these insights into retail strategies, stores can optimize product offerings, marketing initiatives, and operational processes to drive sustainable growth and enhance customer satisfaction.