Introduction

This report presents an analysis of data from Lidl chain stores across the United States. We have examined various metrics, such as profitability, sales, payment methods, and return rates, to gather insights essential for strategic planning and decision-making. The report includes a series of plots and graphs that provide a clear view of financial performance across different regions and customer segments.

Data visualization, wrangling and cleansing

Visualisation of Outliers in the Lidl dataset

Boxplots based on variables “Sales”, “Quantity” and “Profit” show that the data set contains high amount of outliers. In that case standard deviation can be to high in order to provide accurate results from statistical tests. In order to improve the research imputation of outliers need to be conducted

Imputation of Outliers in the variable Profit

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -38.216   1.796   8.502  19.518  28.615  87.890

Imputation of Outliers in the variable Sales

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.836  71.976 128.648 192.475 265.170 577.916

After imputation of outliers data set no longer contains high standard deviation and standard errors

Checking wether there are any missing values in the numerical variables

any_na(lidlcleaner$Profit)
## [1] FALSE
any_na(lidlcleaner$Sales)
## [1] FALSE
any_na(lidlcleaner$Quantity)
## [1] FALSE

Data does not contain any missing values that could impact the results of tests.

Visualisation of imputated variables

The distribution of mainly examined variable profit

##        # classes  Goodness of fit Tabular accuracy 
##       10.0000000        0.9912471        0.9155987
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
x label Freq Percent Valid Percent Cumulative Percent
Valid (-40,-29] 72 1.2 1.4 1.4
(-29,-18] 132 2.2 2.6 4.0
(-18,-7] 600 10.2 11.7 15.7
(-7,4] 1249 21.2 24.3 40.0
(4,15] 1619 27.4 31.5 71.5
(15,26] 658 11.2 12.8 84.4
(26,37] 339 5.7 6.6 91.0
(37,48] 205 3.5 4.0 95.0
(48,59] 159 2.7 3.1 98.1
(59,70] 100 1.7 1.9 100.0
Total 5133 87.0 100.0
Missing <blank> 0 0.0
<NA> 768 13.0
Total 5901 100.0

Graphical visualization of Profit among different segments in the company

The histogram shows the frequency of profit levels across the dataset. The bars represent actual profit data points, with the height indicating the number of occurrences at each profit level.

The peak of the Consumer segment’s density curve suggests that the most common profit level is slightly above zero, indicating that most consumer transactions are marginally profitable. The right skew of the curve suggests there are fewer instances of higher profits.

The Corporate segment’s curve is broader and flatter than the Consumer’s, indicating a wider range of profit levels. It shows that Corporate transactions have a more variable profit outcome with a similar right skew, suggesting a possibility of higher profits, albeit less frequently.

The Home Office segment’s curve is the broadest, indicating the highest variability in profit levels among the three segments. It also shows a skew towards the right, implying the potential for high-profit transactions.

This plot suggests that while most transactions across all segments do not contribute significantly to profit, there is a potential for substantial profits in less frequent, high-value transactions, especially within the Corporate and Home Office segments.

Distibution of profit in each of segments of company

Distribution of profit grouped by the category of products sold

Density Plots by Customer Segment

This plot presents density plots for three different customer segments: Consumer, Corporate, and Home Office.

For Consumer - the plot shows a sharp peak around the zero profit mark, indicating that most consumer transactions have a small profit margin. The distribution has thin tails, showing that there are fewer transactions with large profits or losses.

For Corporate - similar to the Consumer segment, there’s a sharp peak around zero. However, the tails are thicker, suggesting that there are more transactions with significant profits or losses compared to the Consumer segment.

For Home Office - this segment also shows a central peak, with a distribution that is slightly broader than the Consumer segment, indicating a wider range of profit outcomes.

Raincloud Plots by Product Category

Raincloud plots combine box plots, violin plots, and dot plots, for three product categories: Technology, Office Supplies, and Furniture.

For Technology - the dot plot shows a wide spread of profit outcomes, with several outliers suggesting a few transactions with very high profits. The box plot indicates that the median profit is above zero, while the violin plot shows a wide distribution, suggesting a high variability in profit.

For Office Supplies - The distribution of profits is narrower than Technology, with fewer outliers, suggesting more consistency in profit margins. The median profit appears to be around zero, with a small interquartile range.

For Furniture - The plot shows a similar pattern to Office Supplies, with a narrow spread of profits and a median close to zero. However, there appears to be a slightly wider range of outcomes than Office Supplies.

Conclusion

Technology is the most variable category, with a significant potential for high profits, but also a risk of losses. Office Supplies and Furniture have more consistent profit margins but less potential for high profits. Corporate and Home Office segments have a greater variability in profit outcomes than the Consumer segment.

Most profits made by Lidl company grouped by states

## # A tibble: 10 × 2
##    State         sum
##    <chr>       <dbl>
##  1 California 29592.
##  2 New York   18377.
##  3 Washington  8491.
##  4 Michigan    5676.
##  5 Georgia     4427.
##  6 Virginia    3792.
##  7 Indiana     3661.
##  8 Kentucky    2839.
##  9 New Jersey  2801.
## 10 Wisconsin   2582.

Graphical visualization of Profits made on different types of product grouped by company’s segments

A summary table, which lists the top 10 states by a sum value, representing total profits. California leads, followed by New York and Washington, indicating that these states are significant contributors to the total.

The scatter plots depict individual profit data points for transactions within each customer segment and product category.

There is a notable spread of profits across all segments and categories, with some transactions yielding high profits and others resulting in losses, as indicated by points below the zero line.

The distribution of data points suggests variability in profitability, with no single cluster indicating a consistent profit level across categories.

Univariate analysis

Table of profits in different regions

Region Min Max Mean Std_Deviation Median Q1 Q3 Skewness Kurtosis
Central -38.2158 87.89 14.75045 31.47063 5.3970 -4.1136 23.1192 1.210632 0.6317692
East -37.9152 87.89 19.26487 32.05689 8.2121 1.6665 28.0469 1.136197 0.2796200
South -36.1116 87.89 21.48571 33.20401 8.3524 1.6864 32.4513 1.006280 -0.1279510
West -38.2116 87.89 22.24199 30.16417 10.7325 3.8940 30.9918 1.132196 0.3616991

The table presents statistical measures for profit data across four regions: Central, East, South, and West.

Central Region

Has the lowest mean profit but a moderate range of profit values, as indicated by the standard deviation. The positive skewness indicates that there are more high-profit outliers, and the kurtosis value suggests a relatively peaked distribution.

East Region

Similar to the Central region in terms of mean profit, but with a slightly higher standard deviation and less positive skewness. The kurtosis is lower, suggesting a less peaked distribution.

South Region

Shows the lowest mean profit and the highest standard deviation, indicating the most variability in profit. It has a positive skewness, though less pronounced than in Central, and a negative kurtosis, which suggests a flatter distribution with fewer outliers.

West Region

Has the highest mean profit and a lower standard deviation than the South, suggesting more consistent profit values. The skewness is also positive, but less so than in Central or East, and the kurtosis is similar to East, indicating a moderately peaked distribution.

The median values across all regions are lower than the means, which, together with the positive skewness, suggest that the profit distributions are right-skewed, with more transactions yielding lower profits and fewer transactions yielding higher profits.

Descriptive Statistics grouped by the regions of USA

variable Region min q1 med q3 max
Profit Central -38.22 -4.11 5.40 23.12 87.89
Profit East -37.92 1.66 8.21 28.07 87.89
Profit South -36.11 1.67 8.35 32.47 87.89
Profit West -38.21 3.89 10.73 30.99 87.89
Quantity Central 1.00 2.00 3.00 5.00 14.00
Quantity East 1.00 2.00 3.00 5.00 14.00
Quantity South 1.00 2.00 3.00 5.00 14.00
Quantity West 1.00 2.00 3.00 5.00 14.00
Sales Central 0.84 68.11 123.88 252.94 577.92
Sales East 1.50 71.98 129.81 264.83 577.92
Sales South 2.21 73.20 127.61 276.96 577.92
Sales West 1.41 76.18 131.28 265.53 577.92

Profit

All regions have similar minimum and maximum profit values, suggesting that they all have the potential for both significant losses and gains.

The median profit is the highest in the West, indicating that over half of the transactions in that region are more profitable than the other regions.

The first quartile (q1) for profit is negative in all regions, which means at least 25% of transactions are operating at a loss.

The third quartile (q3) values, representing the profit level below which 75% of the data fall, suggest that the top 25% of transactions are considerably more profitable, especially in the West.

Quantity

Quantity sold has the same range (min 1 to max 14) across all regions, with medians all at 3, indicating a uniform distribution of transaction volume.

Sales

The West region has the highest median sales, as well as the highest minimum and maximum values, suggesting that the West has the highest sales values overall.

The East and South have similar sales distribution, with the South having a slightly higher median.

Conclusion

The data indicates that while there is potential for both losses and gains across all regions, the frequency and magnitude of profitable transactions vary, with the West exhibiting a more favorable profit distribution.

The consistent quantity distribution suggests that differences in profit are not due to the volume of products sold but could be influenced by other factors such as product mix, pricing strategies, or operational efficiencies.

Bivariate Analysis

Correlation between each numerical variable

Profit and Quantity

The correlation coefficient is 0.51, suggesting a moderate positive relationship between profit and quantity. As quantity increases, there is a tendency for profit to increase as well.

Profit and Sales

The correlation coefficient is 0.22, indicating a weak positive relationship. This suggests that higher sales are somewhat associated with higher profits, but the relationship is not strong.

Quantity and Sales

The correlation coefficient is 0.21, which is another weak positive correlation. This indicates that there is a slight tendency for transactions with a higher quantity of items to have higher sales.

Conclusion

The correlation values suggest that while there is some positive association between these variables, they are not strongly linked. This indicates that other factors not represented in this correlogram influence profit and sales.

The moderate correlation between profit and quantity could be influenced by economies of scale, where selling more units may lead to lower costs per unit and hence more profit.

The weak correlation between sales and profit implies that simply increasing sales does not guarantee a proportionate increase in profit. This could be due to a variety of factors such as the mix of products sold, varying profit margins, or differing costs associated with the sales.

Statistical Inference

Test for independent samples

Kruskal-Wallis Test

This non-parametric test has been used to determine if there are statistically significant differences between the groups. The test statistic is given as 227.31, and the p-value is near zero, indicating that differences in profit distributions across these categories are statistically significant.

Median Profit

The line within the box plot represents the median profit for each category. It is apparent that the median profit for Technology is higher than that of the other two categories, which have medians around the zero line.

Interquartile Range

The length of the box represents the interquartile range (from the first quartile, Q1, to the third quartile, Q3), showing the middle 50% of profits for each category. The Technology category has the widest box, indicating a greater variability in profits compared to Furniture and Office Supplies.

Whiskers

These lines extending from the boxes indicate the range of the data, excluding outliers. The whiskers for the Technology category are much longer, further indicating the high variability in profits.

Outliers - there are noticeable outliers in the Technology category, indicating some very profitable transactions.

Swarm Plot Points: The individual points plotted in a ‘swarm’ fashion show the distribution and density of profit data for each category.

Conclusion

The Technology category not only has a higher median profit but also a higher range of profits, including several outliers indicating extremely profitable transactions.

Furniture and Office Supplies have similar median profits near zero, but the spread of profits in Office Supplies is larger than in Furniture, indicating more variability.

The significant p-value from the Kruskal-Wallis test suggests that these differences are not due to chance, and there is a true variation in how these categories contribute to overall profit.

Kruskal-Wallis Test - The test statistic is 0.99 with a p-value of 0.61, which is well above the alpha level of 0.05. This indicates there is no statistically significant difference in the distribution of profit between the different payment modes.

The distribution of profit for transactions using cards shows a median close to zero, with a relatively symmetric distribution of data points around the median.

The median profit for COD transactions is slightly higher than for cards, but the overall distribution is similar, with a wide range of profits and losses indicated by the spread of the data points.

The distribution for online payments is also centered around a median profit close to zero. The data points indicate a range of profits similar to cards and COD.

Consumer Segment

The density curve and histogram indicate a concentration of profit around the zero mark, with some spread into both the loss and gain sides. This suggests that consumer transactions frequently yield small profits or losses. The distribution appears to be slightly right-skewed, with a small tail extending towards higher profits.

Corporate Segment

This segment also shows a concentration of profit around zero but with a wider spread compared to the Consumer segment. This indicates more variability in the profit of corporate transactions. The skewness seems more pronounced, suggesting that while most corporate transactions have modest profits, there’s a possibility for higher profits as well.

Home Office Segment

The distribution in this segment is similar to the Corporate segment, with a wide spread and a noticeable right skew. However, the frequency of high-profit transactions seems less compared to the Corporate segment.

The t-test statistics and corresponding p-values for each segment suggest that the means of these distributions are significantly different from zero. This is also supported by the Bayes Factor values, which provide the likelihood ratio for the alternative hypothesis over the null hypothesis. A Bayes Factor greater than one suggests that the alternative hypothesis is more likely.

The left plot represents transactions without returns, and the right plot transactions with returns.

The median profit for transactions without returns is slightly above zero, suggesting that these transactions are generally profitable, albeit with a small margin. The median profit for transactions with returns is lower and closer to zero, indicating that returns may have a negative impact on profitability. Both distributions exhibit a wide range of values, including some transactions with high profits and others with significant losses.

The broader range of the violin plot for transactions with returns suggests greater variability in profit when returns are involved. The p-value is not significant, which suggests that the difference in the median profits between transactions with and without returns is not statistically significant, which is also supported by the Kruskal-Wallis test.

No Returns Pie Chart shows the proportion of transactions without returns in each region. The Central region accounts for the largest share (31%), followed by the East (29%), West (24%), and South (16%).

Returns Pie Chart presents the proportion of transactions with returns. The West region accounts for a significant majority (68%), while the other regions account for much smaller proportions (East 17%, South 11%, and Central 4%).

Chi-Square Test for No Returns is 153.97 with a p-value of approximately 0, indicating a significant difference in the distribution of transactions without returns across regions.

Chi-Square Test for Returns is 286.68 with a p-value of approximately 0, suggesting a significant difference in the distribution of transactions with returns across regions.

Both tests show a Cramer’s V value of 0.16, which, despite the statistical significance, indicates a relatively weak association in terms of the effect size.

Conclusion:

The significant chi-square test results indicate that the distribution of transactions, both with and without returns, is not uniform across regions.

In Central Region, the largest proportion of payments is made with COD at 40%, followed closely by Online payments at 39%, and Cash payments account for 22%. In East Region, COD is also the most common payment mode at 41%, with Online at 35% and Cash at 24%. In South Region, COD constitutes the majority at 48%, Online payments drop to 32%, and Cash remains the least common at 22%. In West Region, Online payments and COD are equally prevalent at 39% each, while Cash payments are slightly less common at 20%.

The p-values for each region are well below the 0.05 threshold, suggesting that the differences in payment mode usage between the regions are statistically significant.

The chi-square test statistic (χ²) is not explicitly stated for each region but given the low p-values, it is clear the differences are statistically significant.

#Conclusion

The data indicates a preference for COD across all regions, with it being the most common or joint most common payment mode in each area.

Online payments are also popular, especially in the Central and West regions where they are nearly as common as COD.

Cash is consistently the least preferred method across all regions, never exceeding a quarter of the payments.