data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
These descriptive statistics help in providing an overview of our data’s characteristics and distribution.
summary_data<-summary(data)
summary_data
## Order.ID CustomerName Category SubCategory
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## City OrderDate Region Sales
## Length:9994 Length:9994 Length:9994 Min. : 500
## Class :character Class :character Class :character 1st Qu.:1000
## Mode :character Mode :character Mode :character Median :1498
## Mean :1497
## 3rd Qu.:1995
## Max. :2500
## Discount Profit State ProfitRange
## Min. :0.1000 Min. : 25.25 Length:9994 Min. :0.0000
## 1st Qu.:0.1600 1st Qu.: 180.02 Class :character 1st Qu.:0.0000
## Median :0.2300 Median : 320.78 Mode :character Median :0.0000
## Mean :0.2268 Mean : 374.94 Mean :0.2764
## 3rd Qu.:0.2900 3rd Qu.: 525.63 3rd Qu.:1.0000
## Max. :0.3500 Max. :1120.95 Max. :1.0000
Order ID: It contains character data and has 9994 rows.
Customer Name: It contains character data and has 9994 rows.
Category: It contains character data and has 9994 rows.
Subcategory: It contains character data and has 9994 rows.
City: It contains character data and has 9994 rows.
Order Date: It contains character data and has 9994 rows.
Region: It contains character data and has 9994 rows.
Sales: The minimum sales value is 500, the maximum is 2500, with a median of 1498 and a mean of 1497.
Discount: The minimum discount is 0.1, the maximum is 0.35, with a median of 0.23 and a mean of 0.2268.
Profit: The minimum profit is 25.25, the maximum is 1120.95, with a median of 320.78 and a mean of 374.94.
State: It contains character data and has 9994 rows.
Profit Range: The minimum value is 0, the maximum is 1, with a median and mean of 0.
Identifying the top-selling categories and subcategories is crucial in retail analysis as it helps businesses understand consumer preferences and allocate resources effectively to capitalize on high-demand products, ultimately driving sales and profitability.
Let’s use descriptive statistics and visualization techniques to identify the top-selling categories and subcategories in our dataset.
top_categories <- data %>%
group_by(Category) %>%
summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
arrange(desc(total_sales)) %>%
head()
top_subcategories <- data %>%
group_by(Category, SubCategory) %>%
summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
arrange(desc(total_sales)) %>%
group_by(Category) %>%
slice(1)
# Plot for top categories
ggplot(top_categories, aes(x = reorder(Category, -total_sales), y = total_sales)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Top-selling Categories", x = "Category", y = "Total Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot for top subcategories
ggplot(top_subcategories, aes(x = reorder(SubCategory, -total_sales), y = total_sales)) +
geom_bar(stat = "identity", fill = "pink") +
labs(title = "Top-selling Subcategories", x = "Subcategory", y = "Total Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations:
- From the above plots, it can be observed that the top-selling category is “Eggs, Meat, and Fish” with the subcategory being “Health Drinks”.
Inferences:
- The growing interest in “Eggs, Meat, and Fish” and “Health Drinks” among customers shows that people are becoming more health-conscious about what they eat.
- This means that stores and companies need to change the products they sell and how they advertise them to better suit the new focus on health in what people buy.
Knowing which products generate the highest sales revenue and profit margins is essential in retail analysis as it enables businesses to focus on promoting and optimizing the most profitable items, leading to increased revenue and improved overall profitability.
Firstly, let’s understand the relationship between sales revenue and profit margins using linear regression modeling.
model1 <- lm(Profit ~ Sales, data = data)
summary(model1)
##
## Call:
## lm(formula = Profit ~ Sales, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -500.66 -128.83 -0.93 128.84 495.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.422974 5.306490 -0.268 0.789
## Sales 0.251477 0.003308 76.022 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 191 on 9992 degrees of freedom
## Multiple R-squared: 0.3664, Adjusted R-squared: 0.3664
## F-statistic: 5779 on 1 and 9992 DF, p-value: < 2.2e-16
Observations:
Impact of sales on profit: The coefficient estimate for the sales is 0.251477. This means that for every unit increase in sales revenue, the profit margin increases by approximately 0.251477 units. This positive coefficient suggests that higher sales revenue is associated with higher profit margins.
Statistical significance: The p-value associated with the coefficient estimate for “Sales” is less than 0.001 (<2e-16). This indicates that the relationship between sales revenue and profit margins is statistically significant. So, the observed relationship is unlikely to have occurred by random chance alone.
Understanding the relationship between sales and profit is quite important if we want to improve business performance and decision-making. By identifying products with the highest sales revenue and profit margins, retailers can prioritize resources, allocate marketing budgets effectively, and make informed decisions about pricing, inventory management, and product offerings.
Next, let’s look at the products that have the highest sales revenue and profit margins in our dataset.
high_sales <- data %>%
group_by(Category) %>%
summarise(total_sales = sum(Sales), .groups = "drop_last") %>%
arrange(desc(total_sales)) %>%
head()
high_profit <- data %>%
group_by(Category) %>%
summarise(total_profit = sum(Profit), .groups = "drop_last") %>%
arrange(desc(total_profit)) %>%
head()
# Plot top sales products
ggplot(high_sales, aes(x = reorder(Category, -total_sales), y = total_sales)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Top Sales Products", x = "Category", y = "Total Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot top profit products
ggplot(high_profit, aes(x = reorder(Category, -total_profit), y = total_profit)) +
geom_bar(stat = "identity", fill = "pink") +
labs(title = "Top Profit Products", x = "Category", y = "Total Profit") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations:
We can observe that “Eggs, Meat, and Fish” are the top-selling products(as we observed before) while “Snacks”(surprisingly) emerge as the top profit product.
The shift from health-conscious products like health drinks (previously seen as top-selling) to snacks (top profit product) can suggest a diverse range of consumer preferences and purchasing behaviors.
Inferences:
Regarding why snacks may have the highest profit margins and the sudden shift from health-conscious to snacks, one possible explanation is that snack products often have higher markups and lower production costs compared to health-conscious alternatives. Moreover, shifts in lifestyle changes, such as increased snacking occasions may contribute to the surge in profitability for snacks.
Analyzing seasonal trends and patterns in sales can provide some great insights into consumer behavior and purchasing dynamics. By aggregating sales data by month and year, we can uncover any recurring patterns or fluctuations in sales volume over time. This allows retailers to identify peak seasons, understand seasonal demand variations, and marketing strategies accordingly.
data$OrderDate <- as.Date(data$OrderDate, format = "%m/%d/%y")
data$Month <- format(data$OrderDate, "%m")
data$Year <- format(data$OrderDate, "%Y")
monthly_sales <- data %>%
group_by(Year, Month, .groups = "drop_last") %>%
summarise(total_sales = sum(Sales))
## `summarise()` has grouped output by 'Year', 'Month'. You can override using the
## `.groups` argument.
# Plotting monthly sales trends
ggplot(monthly_sales, aes(x = Month, y = total_sales, group = Year, color = Year)) +
geom_line() +
scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
labs(title = "Monthly Sales Trend", x = "Month", y = "Total Sales")
Observations:
We can observe that the sales went up every year, from 2015 to 2018, which can mean that the business kept growing. Also, each year started with lower sales in the first few months, especially in February(with the least sales), but picked up in the later months, with September usually having the highest sales. This tells us that sales go through ups and downs during the year, with certain times being busier for shopping than others.
Inferences:
This pattern may be due to various factors such as the timing of holidays, changing consumer preferences throughout the year, or external factors like weather conditions affecting product demand.
Knowing about these seasonal trends is super important for stores because it helps them prepare for changes in sales. They can plan ahead by adjusting how much stock they have, making sure they have the right products when people want them the most. During busy times, they can focus on selling popular stuff and run special deals to make even more sales. And when things slow down, they can try out new products or offer discounts to keep things moving and make sure they’re still making money.
Basically, understanding these sales patterns helps stores make smart choices about what to do to keep customers happy and keep their business going strong all year round.
Identifying outliers or underperforming products is crucial in retail analysis as it helps businesses pinpoint areas needing attention or improvement.
Let’s conduct an analysis to identify any outliers or underperforming products that may require attention. We will compare the sales performance of a ‘Beverages’(chose ‘beverages’ since it had the least sales as well as profit margin) against the average sales performance of all products in the dataset. By defining hypotheses, setting a significance level, and performing a one-sample t-test, we can determine if there is a significant difference in the sales performance.
Hypothesis Testing:
-Defining the null hypothesis (H0) and alternative hypothesis (H1):
H0: There is no significant difference between the sales performance of beverages and the average sales performance of all products.
H1: There is a significant difference between the sales performance of beverages and the average sales performance of all products.
-Setting the significance level (α), typically 0.05.
-Since we are comparing the sales performance of one product against the average sales performance of all products, we will be using a one-sample t-test.
# Getting sales data of 'Beverages'
bev_sales <- subset(data, Category == "Beverages")$Sales
# average sales performance of all products
average_sales <- mean(data$Sales)
# Performing one-sample t-test
result <- t.test(bev_sales, mu = average_sales)
print(result)
##
## One Sample t-test
##
## data: bev_sales
## t = -0.45312, df = 1399, p-value = 0.6505
## alternative hypothesis: true mean is not equal to 1496.596
## 95 percent confidence interval:
## 1458.829 1520.190
## sample estimates:
## mean of x
## 1489.509
Observations:
We can see that the p-value is greater than the typical significance level of 0.05, which means that there is not enough evidence to reject our null hypothesis. Hence, we fail to reject our hypothesis that the true mean sales performance of beverages is equal to the overall average sales performance of all products i.e., there is no significant difference between the sales performance of beverages and the average sales performance of all products.
Therefore, there is no evidence to indicate that beverages are outliers or underperforming products that require immediate attention.
Inferences:
While beverages may not be outliers or underperforming products based on their sales performance relative to the overall average, there could be other factors affecting their low sales and profit margins. These factors can be pricing strategies, competition, consumer preferences, or marketing effectiveness.
Knowing the most popular products among customers helps retailers understand consumer preferences and demand trends, which can help them in optimizing inventory management, stock levels, and marketing strategies to meet customer needs more effectively.
popular_products <- data %>%
group_by(Category) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head()
popular_products
## # A tibble: 6 × 2
## Category count
## <chr> <int>
## 1 Snacks 1514
## 2 Eggs, Meat & Fish 1490
## 3 Fruits & Veggies 1418
## 4 Bakery 1413
## 5 Beverages 1400
## 6 Food Grains 1398
Observations:
We notice that ‘Snacks’ emerge as the top-selling product, which coincidentally also holds the highest profitability among all products analyzed.
Inferences:
The observation that the most popular products among customers align with the top profitable products suggests a strong correlation between popularity and profitability in retail. This could be because products with high demand tend to sell more, leading to increased sales volume, which in turn contributes to higher profits for the business.
Understanding geographic variations in purchasing behavior is important as it enables businesses to tailor marketing strategies, and promotional activities in specific regions, maximizing sales and customer satisfaction.
Let’s observe the purchasing behavior variations within each city in our dataset by firstly performing a geographic analysis(using linear regression model) and then plotting the sales for each city(to kind of verify our model’s results).
# Geographic analysis
geo_analysis <- lm(Sales ~ City, data = data)
summary(geo_analysis)
##
## Call:
## lm(formula = Sales ~ City, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1047.29 -495.38 3.32 498.02 1047.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1509.450 27.473 54.943 <2e-16 ***
## CityChennai -39.628 39.077 -1.014 0.311
## CityCoimbatore -26.394 39.169 -0.674 0.500
## CityCumbum -8.138 39.431 -0.206 0.836
## CityDharmapuri 10.638 40.522 0.263 0.793
## CityDindigul -55.837 39.965 -1.397 0.162
## CityKanyakumari 30.341 38.491 0.788 0.431
## CityKarur -15.792 39.123 -0.404 0.686
## CityKrishnagiri -61.102 38.897 -1.571 0.116
## CityMadurai 4.854 39.654 0.122 0.903
## CityNagercoil -31.072 40.610 -0.765 0.444
## CityNamakkal -24.264 39.782 -0.610 0.542
## CityOoty -26.054 39.756 -0.655 0.512
## CityPerambalur 10.683 39.032 0.274 0.784
## CityPudukottai 9.571 39.123 0.245 0.807
## CityRamanadhapuram -2.595 39.334 -0.066 0.947
## CitySalem 15.127 39.100 0.387 0.699
## CityTenkasi -19.515 39.077 -0.499 0.618
## CityTheni -11.897 40.210 -0.296 0.767
## CityTirunelveli -30.051 38.766 -0.775 0.438
## CityTrichy 7.085 41.101 0.172 0.863
## CityVellore 45.837 39.009 1.175 0.240
## CityViluppuram -45.284 39.939 -1.134 0.257
## CityVirudhunagar -50.748 39.455 -1.286 0.198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 577.6 on 9970 degrees of freedom
## Multiple R-squared: 0.002194, Adjusted R-squared: -0.0001078
## F-statistic: 0.9532 on 23 and 9970 DF, p-value: 0.525
Observations:
The coefficients associated with each city indicate the estimated change in sales for each unit change in the respective city, holding other factors constant. However, most of the coefficients are not statistically significant, as indicated by their p-values being greater than 0.05. This suggests that the city alone may not be a strong predictor of sales behavior, and other factors may have a more significant influence.
Inferences:
The geographic analysis shows that there aren’t huge differences in how people purchase things across different cities. The coefficients we got for each city tell us how much sales might change if the city changes, but most of them weren’t very reliable(not statistically significant, as their p-values > 0.05). This means that just the city might not tell us much about sales behavior, and there are probably other factors that affect it more.
Let’s plot the sales by each city(using a line graph) and compare it with the coefficients of our model
city_sales <- data %>%
group_by(City) %>%
summarise(total_sales = sum(Sales))
ggplot(city_sales, aes(x = reorder(City, -total_sales), y = total_sales)) +
geom_line(color = "skyblue", size = 1.5, aes(group = 1)) +
geom_point(color = "blue", size = 1) +
labs(title = "Total Sales by City", x = "City", y = "Total Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The results from our regression model suggest that the impact of geographic location on sales behavior is not statistically significant. However, when we look at the above plot, we see that there are observable differences in sales across different cities.
This indicates that geographic location alone may not strongly predict sales, other factors may be influencing sales in each city. The estimates from the regression model show the expected change in sales for each unit change in city, assuming all other factors remain constant. However, in reality, sales may be influenced by multiple factors that interact in complex ways, which the regression model may not fully capture.
Understanding the effect of discounts on sales volume and profitability is crucial as it can help businesses optimize their pricing strategies to attract customers, boost sales, and therefore maximize profits.
Fitting GLMs with ‘Discount’ as the predictor variable for analysing the relationship between discounts and sales volume/profitability
# for sales
sales_glm <- glm(Sales ~ Discount, data = data, family = gaussian)
summary(sales_glm)
##
## Call:
## glm(formula = Sales ~ Discount, family = gaussian, data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1506.27 18.48 81.487 <2e-16 ***
## Discount -42.66 77.41 -0.551 0.582
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 333597.7)
##
## Null deviance: 3333409382 on 9993 degrees of freedom
## Residual deviance: 3333308088 on 9992 degrees of freedom
## AIC: 155466
##
## Number of Fisher Scoring iterations: 2
# for profit
profit_glm <- glm(Profit ~ Discount, data = data, family = gaussian)
summary(profit_glm)
##
## Call:
## glm(formula = Profit ~ Discount, family = gaussian, data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 374.92434 7.67916 48.824 <2e-16 ***
## Discount 0.05618 32.15996 0.002 0.999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 57573.55)
##
## Null deviance: 575274901 on 9993 degrees of freedom
## Residual deviance: 575274901 on 9992 degrees of freedom
## AIC: 137908
##
## Number of Fisher Scoring iterations: 2
Observations:
For Sales Volume : The p-value for the Discount variable is greater than the significance level of 0.05, indicating that there is no significant effect of discounts on sales volume.
For Profitability : Similar to sales volume, the p-value for the Discount variable is much greater than 0.05, indicating that discounts do not have a significant impact on profitability.
Now, let’s check if discounts affect sales volume and profitability when including additional variables such as Category, City, and OrderDate along with Discount.
By including Category, City, and OrderDate in the model, we can assess how discounts affect sales volume and profitability after accounting for differences in product categories, geographic locations, and time periods. This helps us understand the effects of discount more accurately by adjusting for other influences.
#for sales
sales_glm <- glm(Sales ~ Discount + Category + City + OrderDate, data = data, family = gaussian)
summary(sales_glm)
##
## Call:
## glm(formula = Sales ~ Discount + Category + City + OrderDate,
## family = gaussian, data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.383e+03 2.406e+02 5.750 9.2e-09 ***
## Discount -4.898e+01 7.760e+01 -0.631 0.528
## CategoryBeverages -5.683e+00 2.181e+01 -0.261 0.794
## CategoryEggs, Meat & Fish 2.602e+01 2.147e+01 1.212 0.226
## CategoryFood Grains 1.827e+01 2.181e+01 0.838 0.402
## CategoryFruits & Veggies -1.244e+01 2.173e+01 -0.572 0.567
## CategoryOil & Masala 1.704e+00 2.195e+01 0.078 0.938
## CategorySnacks -1.699e+01 2.138e+01 -0.795 0.427
## CityChennai -3.938e+01 3.909e+01 -1.007 0.314
## CityCoimbatore -2.624e+01 3.918e+01 -0.670 0.503
## CityCumbum -8.516e+00 3.945e+01 -0.216 0.829
## CityDharmapuri 1.035e+01 4.054e+01 0.255 0.798
## CityDindigul -5.455e+01 3.998e+01 -1.365 0.172
## CityKanyakumari 2.996e+01 3.852e+01 0.778 0.437
## CityKarur -1.678e+01 3.913e+01 -0.429 0.668
## CityKrishnagiri -6.192e+01 3.892e+01 -1.591 0.112
## CityMadurai 3.850e+00 3.968e+01 0.097 0.923
## CityNagercoil -2.977e+01 4.062e+01 -0.733 0.464
## CityNamakkal -2.416e+01 3.979e+01 -0.607 0.544
## CityOoty -2.686e+01 3.977e+01 -0.675 0.499
## CityPerambalur 1.254e+01 3.906e+01 0.321 0.748
## CityPudukottai 1.031e+01 3.913e+01 0.264 0.792
## CityRamanadhapuram -3.145e+00 3.935e+01 -0.080 0.936
## CitySalem 1.507e+01 3.912e+01 0.385 0.700
## CityTenkasi -1.940e+01 3.910e+01 -0.496 0.620
## CityTheni -1.202e+01 4.023e+01 -0.299 0.765
## CityTirunelveli -2.905e+01 3.878e+01 -0.749 0.454
## CityTrichy 7.117e+00 4.112e+01 0.173 0.863
## CityVellore 4.611e+01 3.903e+01 1.182 0.237
## CityViluppuram -4.460e+01 3.996e+01 -1.116 0.264
## CityVirudhunagar -5.039e+01 3.947e+01 -1.277 0.202
## OrderDate 7.837e-03 1.377e-02 0.569 0.569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 333638)
##
## Null deviance: 3333409382 on 9993 degrees of freedom
## Residual deviance: 3323702146 on 9962 degrees of freedom
## AIC: 155498
##
## Number of Fisher Scoring iterations: 2
#for profitability
profit_glm <- glm(Profit ~ Discount + Category + City + OrderDate, data = data, family = gaussian)
summary(profit_glm)
##
## Call:
## glm(formula = Profit ~ Discount + Category + City + OrderDate,
## family = gaussian, data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 437.884736 99.979011 4.380 1.2e-05 ***
## Discount -2.470620 32.243947 -0.077 0.9389
## CategoryBeverages 1.169283 9.061351 0.129 0.8973
## CategoryEggs, Meat & Fish 6.488616 8.921075 0.727 0.4670
## CategoryFood Grains 4.545026 9.064596 0.501 0.6161
## CategoryFruits & Veggies -0.119190 9.031560 -0.013 0.9895
## CategoryOil & Masala -8.259590 9.122630 -0.905 0.3653
## CategorySnacks 1.436173 8.884445 0.162 0.8716
## CityChennai -20.244884 16.244189 -1.246 0.2127
## CityCoimbatore -24.892096 16.280998 -1.529 0.1263
## CityCumbum -17.889207 16.393747 -1.091 0.2752
## CityDharmapuri -16.006666 16.844878 -0.950 0.3420
## CityDindigul -26.861096 16.611723 -1.617 0.1059
## CityKanyakumari -17.220128 16.005270 -1.076 0.2820
## CityKarur 0.667379 16.261518 0.041 0.9673
## CityKrishnagiri -28.007739 16.171195 -1.732 0.0833 .
## CityMadurai -18.957915 16.487181 -1.150 0.2502
## CityNagercoil -22.953294 16.880824 -1.360 0.1739
## CityNamakkal -31.820854 16.535853 -1.924 0.0543 .
## CityOoty -21.468171 16.525236 -1.299 0.1939
## CityPerambalur 1.635779 16.229466 0.101 0.9197
## CityPudukottai -10.860990 16.260375 -0.668 0.5042
## CityRamanadhapuram -15.347943 16.351913 -0.939 0.3480
## CitySalem -19.401374 16.257279 -1.193 0.2327
## CityTenkasi -31.170803 16.246108 -1.919 0.0551 .
## CityTheni -24.199522 16.716901 -1.448 0.1478
## CityTirunelveli -22.473331 16.114540 -1.395 0.1632
## CityTrichy -11.947746 17.086984 -0.699 0.4844
## CityVellore 7.290696 16.216541 0.450 0.6530
## CityViluppuram -29.332599 16.605945 -1.766 0.0774 .
## CityVirudhunagar -30.182780 16.401134 -1.840 0.0658 .
## OrderDate -0.002626 0.005722 -0.459 0.6462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 57608.36)
##
## Null deviance: 575274901 on 9993 degrees of freedom
## Residual deviance: 573894458 on 9962 degrees of freedom
## AIC: 137944
##
## Number of Fisher Scoring iterations: 2
Observations:
For Sales Volume : Similar to our previous model(which only included Discount variable), the p-value for the Discount variable here is also greater than 0.05, suggesting that discounts alone may not significantly impact sales volume when other variables are considered.
For Profitability : Again, the p-value for the
Discount variable is much greater than 0.05, indicating that discounts
alone may not significantly impact profitability when other variables
are considered.
Inference:
Overall, based on the results, we can conclude that discounts do not have a significant effect on both sales volume and profitability, whether considered alone or along with other variables such as Category, City, and OrderDate.
Box plots for sales and profitability by discount levels
# sales by discount level
ggplot(data, aes(x = as.factor(Discount), y = Sales)) +
geom_boxplot(fill = "skyblue", color = "darkblue") +
labs(title = "Distribution of Sales by Discount Level", x = "Discount", y = "Sales") +
theme_minimal()
# profitability by discount level
ggplot(data, aes(x = as.factor(Discount), y = Profit)) +
geom_boxplot(fill = "pink", color = "purple") +
labs(title = "Distribution of Profitability by Discount Level", x = "Discount", y = "Profit") +
theme_minimal()
The box plots for sales and profitability by discount levels do not
show major differences and are almost the same for respective sales and
profit values, it aligns with the findings from the GLM summaries.
Both GLM summaries and box plots suggest the same thing- discounts don’t seem to make much of a difference in how much we sell or how much profit we make.
Understanding which products make the most profit compared to their sales is important because it helps the retailers focus on the items that bring in the most money relative to their cost. This information helps in deciding what products to stock, how to price them, and where to allocate resources to make the most profit.
data <- mutate(data, profit_margin = Profit / Sales)
# Finding products with highest profit margins
top_profit_margin <- data %>%
group_by(SubCategory) %>%
summarise(avg_profit_margin = mean(profit_margin, na.rm = TRUE)) %>%
arrange(desc(avg_profit_margin)) %>%
head()
top_profit_margin
## # A tibble: 6 × 2
## SubCategory avg_profit_margin
## <chr> <dbl>
## 1 Fish 0.26
## 2 Noodles 0.260
## 3 Organic Staples 0.259
## 4 Organic Vegetables 0.257
## 5 Breads & Buns 0.257
## 6 Organic Fruits 0.256
ggplot(top_profit_margin, aes(x = reorder(SubCategory, avg_profit_margin), y = avg_profit_margin)) +
geom_point(color = "skyblue", size = 3) +
labs(title = "Top Profit Margins by Subcategory", x = "Subcategory", y = "Average Profit Margin") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations:
Fish and noodles are the top products generating the highest profits relative to their sales.
Inference:
Despite possibly lower sales volumes(as noodles never even made it to the top selling categories in our previous analyses), these products exhibit high profitability ratios, which can indicate efficient cost management or premium pricing.
Identifying cost-saving opportunities in purchasing or operations helps businesses optimize their expenses and improve overall profitability. It allows them to allocate resources more efficiently and potentially lower product prices, which can attract more customers and increase competitiveness in the market.
Let’s create a simulated retail data and analyze how different procurement methods affect total costs through regression analysis and ANOVA. We will try to determine if there’s a significant difference in total costs based on the procurement method (online or offline).
#generating random synthetic data
set.seed(123)
data1 <- data.frame(
ProductID = 1:100,
CostPerUnit = rnorm(100, mean = 10, sd = 2),
ProcurementMethod = sample(c("Online", "Offline"), 100, replace = TRUE),
OperationalCost = rnorm(100, mean = 2000, sd = 500)
)
# total procurement cost
data1 <- mutate(data1, TotalProcurementCost = CostPerUnit * ProductID)
# total cost + operational cost
data1 <- mutate(data1, TotalCost = TotalProcurementCost + OperationalCost)
cost_regression <- lm(TotalCost ~ ProcurementMethod, data = data1)
summary(cost_regression)
##
## Call:
## lm(formula = TotalCost ~ ProcurementMethod, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -986.7 -319.6 -125.9 305.8 1240.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2555.92 75.67 33.776 <2e-16 ***
## ProcurementMethodOnline -58.89 105.96 -0.556 0.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 529.7 on 98 degrees of freedom
## Multiple R-squared: 0.003142, Adjusted R-squared: -0.00703
## F-statistic: 0.3089 on 1 and 98 DF, p-value: 0.5796
anova_result <- anova(cost_regression)
print(anova_result)
## Analysis of Variance Table
##
## Response: TotalCost
## Df Sum Sq Mean Sq F value Pr(>F)
## ProcurementMethod 1 86679 86679 0.3089 0.5796
## Residuals 98 27498023 280592
Observations:
- The regression model doesn’t show a significant relationship between procurement method (online or offline) and total cost.
- The coefficients for the procurement method show that the difference in total cost between online and offline procurement is not statistically significant.
- Also, The p-value from the ANOVA test is greater than 0.05, indicating that there’s no significant difference in total cost based on the procurement method.
Inferences:
- Based on the results, we don’t have enough evidence to conclude that one procurement method leads to significantly lower or higher total costs compared to the other. Other factors beyond just the procurement method may influence total costs in this scenario.
By examining profit margins across different regions, retailers can pinpoint which areas are more profitable than others. This insight allows them to adjust their strategies to capitalize on profitable regions and address challenges in less profitable ones, ensuring overall business success.
region_profit_margin <- data %>%
group_by(Region) %>%
summarise(avg_profit_margin = mean(profit_margin, na.rm = TRUE))
ggplot(region_profit_margin, aes(x = Region, y = avg_profit_margin)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Profit Margin Trends Across Regions", x = "Region", y = "Average Profit Margin")
Observation:
The North region shows the highest profit margin compared to other regions, while the remaining regions exhibit similar profit margin values in the bar plot.
Inference:
The North region may be more profitable for the business, while other regions show relatively consistent profit margins. Further analysis and targeted strategies could capitalize on the strengths of the North region and potentially improve profitability in other regions.
(a) Can we segment customers based on their purchasing habits or preferences?
- Segmenting customers based on their purchasing habits or preferences helps tailor marketing strategies and product offerings to specific groups, improving customer satisfaction and increasing sales.
We can probably segment customers based on various factors such as product category preferences, geographic location, order date, and discount usage.
Segmenting customers based on product category preferences:
customer_segments <- data %>%
group_by(Category, City) %>%
summarise(total_purchases = n(), .groups = "drop") %>%
pivot_wider(names_from = Category, values_from = total_purchases, values_fill = 0)
# adding a column to identify the dominant product category for each customer
customer_segments <- mutate(customer_segments, Dominant_Category = apply(customer_segments[, -c(1,2)], 1, function(x) names(x)[which.max(x)]))
customer_segment_analysis <- customer_segments %>%
group_by(Dominant_Category) %>%
summarise(
Total_Customers = n()
)
ggplot(customer_segment_analysis, aes(x = Dominant_Category, y = Total_Customers)) +
geom_bar(stat = "identity", fill = "pink") +
labs(title = "Customer Segments by Dominant Product Category", x = "Dominant Product Category", y = "Total Customers") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations:
- Despite beverages being among the least in terms of top-selling categories(in previous analysis), they have more sales within the customer segments.
- Also, fruits and veggies are more dominant in customer preferences compared to food grains and oils/masala(which again wasn’t the same case in top-selling categories analysis).
Inferences:
- The discrepancy between top-selling categories and dominant product categories within customer segments suggests that customer preferences may differ from overall sales trends. This highlights the importance of understanding customer segmentation and preferences for targeted marketing.
(b) Can we tailor promotions or product offerings to specific customer segments for better results?
Through market segmentation, we can identify specific customer segments with unique preferences and behaviors. By understanding these segments, businesses can tailor their promotions, product offerings, and marketing messages to better resonate with each group. For example, offering personalized discounts or product recommendations based on past purchase history or segment-specific preferences can lead to higher engagement.
Product Preferences and Profitability:
- The alignment between popular products and high profitability suggests a strong correlation between demand and profitability. Stores should leverage this to focus on marketing and stocking these high-demand, high-profit items to maximize revenue.
Seasonal Sales Patterns:
- Understanding seasonal sales trends enables stores to anticipate fluctuations in demand and adjust inventory levels accordingly. By aligning product offerings with seasonal preferences, stores can optimize sales and maintain customer satisfaction throughout the year.
Geographic Variations:
- While geographic location alone may not strongly predict sales behavior, subtle differences in purchasing patterns across cities indicate the need for localized marketing strategies. Tailoring promotions and product assortments to the preferences of each city’s demographic can enhance sales performance.
Impact of Discounts:
- Our analysis indicates that discounts do not significantly impact sales volume or profitability. Instead of relying solely on discounts, stores should focus on providing value through product quality, customer service, and targeted marketing strategies to drive sales growth.
Outlier Identification:
- While certain products like beverages may not stand out as outliers in terms of sales performance, other factors such as pricing strategies, competition, or consumer preferences could be influencing their performance. Stores should conduct further analysis to identify and address underlying issues affecting product performance.
Product Profitability:
- Fish and noodles emerge as top performers in generating high profits relative to their sales. Despite potentially lower sales volumes, these products exhibit efficient cost management or premium pricing, contributing to their high profitability ratios.
Cost-saving Opportunities:
- Our analysis does not reveal significant cost differences between online and offline procurement methods. This suggests that other factors beyond procurement methods may influence total costs. However, further exploration of operational efficiencies could uncover potential cost-saving opportunities.
Regional Profit Margins:
- The North region stands out with the highest profit margin, indicating its potential as a lucrative market for the business. While other regions show consistent profit margins, targeted strategies could help improve profitability across all regions.
Customer Segmentation Insights:
- Discrepancies between top-selling categories and dominant product categories within customer segments underscore the importance of understanding customer preferences. Tailoring marketing efforts to align with customer preferences can optimize sales and enhance customer satisfaction.
By integrating all these insights into retail strategies, stores can optimize product offerings, marketing initiatives, and operational processes to drive sustainable growth and enhance customer satisfaction.