About Dataset

The dataset contains data of different stores of a supermarket company.

Column Description

Store.ID: (Index) ID of the particular store.

Store_Area: Physical Area of the store in yard square.

Items_Available: Number of different items available in the corresponding store.

Daily_Customer_Count: Number of customers who visited to stores on an average over month.

Store_Sales: Sales in (US $) that stores made.

Source

https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data

Input Data

Read the file “Stores.csv” and save it in variable Stores.

Stores <- read.csv("Stores.csv")

Data Inspection

# First 6 rows of data 
head(Stores)

##   Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 1        1       1659            1961                  530       66490
## 2        2       1461            1752                  210       39820
## 3        3       1340            1609                  720       54010
## 4        4       1451            1748                  620       53730
## 5        5       1770            2111                  450       46620
## 6        6       1442            1733                  760       45260

# Last 6 rows of data
tail(Stores)

##     Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 891      891       1549            1851                 1220       70620
## 892      892       1582            1910                 1080       66390
## 893      893       1387            1663                  850       82080
## 894      894       1200            1436                 1060       76440
## 895      895       1299            1560                  770       96610
## 896      896       1174            1429                 1110       54340

# Dimension of Stores data
dim(Stores)

## [1] 896   5

Data has 896 rows and 5 columns.

# Column names of dataset
names(Stores)

## [1] "Store.ID"             "Store_Area"           "Items_Available"     
## [4] "Daily_Customer_Count" "Store_Sales"

The Stores dataset has 5 columns with the names: “Store.ID”, “Store_Area”, “Items_Available”, “Daily_Customer_Count”, “Store_Sales”.

Data Cleansing & Coertions

# Check the data type for each column
str(Stores)

## 'data.frame':    896 obs. of  5 variables:
##  $ Store.ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Store_Area          : int  1659 1461 1340 1451 1770 1442 1542 1261 1090 1030 ...
##  $ Items_Available     : int  1961 1752 1609 1748 2111 1733 1858 1507 1321 1235 ...
##  $ Daily_Customer_Count: int  530 210 720 620 450 760 1030 1020 680 1130 ...
##  $ Store_Sales         : int  66490 39820 54010 53730 46620 45260 72240 37720 46310 44150 ...

The Store.ID data type is integer. However, while the values of Store.ID are numbers, they are IDs that represents different object. We will not do mathematical operations with IDs, therefore the proper data type should be character.

# Convert the incorrect data types
Stores$Store.ID <- as.character(Stores$Store.ID)

str(Stores)

## 'data.frame':    896 obs. of  5 variables:
##  $ Store.ID            : chr  "1" "2" "3" "4" ...
##  $ Store_Area          : int  1659 1461 1340 1451 1770 1442 1542 1261 1090 1030 ...
##  $ Items_Available     : int  1961 1752 1609 1748 2111 1733 1858 1507 1321 1235 ...
##  $ Daily_Customer_Count: int  530 210 720 620 450 760 1030 1020 680 1130 ...
##  $ Store_Sales         : int  66490 39820 54010 53730 46620 45260 72240 37720 46310 44150 ...

Now, all the columns are in their desired data types.

# Check for missing values in each rows
colSums(is.na(Stores))

##             Store.ID           Store_Area      Items_Available 
##                    0                    0                    0 
## Daily_Customer_Count          Store_Sales 
##                    0                    0

# Check for missing values in the entire data
anyNA(Stores)

## [1] FALSE

There are no missing values in our data.

Our data is ready to be further explored and analysed.

Data Explanation

# Brief statistical explanation of data
summary(Stores)

##    Store.ID           Store_Area   Items_Available Daily_Customer_Count
##  Length:896         Min.   : 775   Min.   : 932    Min.   :  10.0      
##  Class :character   1st Qu.:1317   1st Qu.:1576    1st Qu.: 600.0      
##  Mode  :character   Median :1477   Median :1774    Median : 780.0      
##                     Mean   :1485   Mean   :1782    Mean   : 786.4      
##                     3rd Qu.:1654   3rd Qu.:1983    3rd Qu.: 970.0      
##                     Max.   :2229   Max.   :2667    Max.   :1560.0      
##   Store_Sales    
##  Min.   : 14920  
##  1st Qu.: 46530  
##  Median : 58605  
##  Mean   : 59351  
##  3rd Qu.: 71872  
##  Max.   :116320

From the summary above, we can understand that:

Store Area

The smallest store area is 775 yard square, whereas the biggest is 2229 yard square.
The store areas are 1485 yard square on average.

Items Available

The store with the lowest number of item variation has 932 different items.
The store with the highest number of item variation has 2667 different items.
The stores have 1782 different items on average.

Daily Customer Count

The stores serve minimum 10 customers per day.
The stores serve maximum 1560 customers per day.
On average, the stores serve 786 customers per day.

Store Sales

The stores made minimum USD 14,920 and maximum USD 116,320 sales.
On average, the stores made USD 59,351 sales.

Checking for outliers with Boxplot

Store Area

boxplot(Stores$Store_Area, horizontal=T)

2 stores are relatively smaller compared to other store branches with store area below 1000 yard square, while 3 store branches have relatively large store area with more than 2000 yard square.

Items Available

boxplot(Stores$Items_Available, horizontal=T)

2 stores have relatively little item variety (< 1000 different items), whereas 3 stores have relatively large item variety (> 2500 different items).

Daily Customer Count

boxplot(Stores$Daily_Customer_Count, horizontal=T)

One of the stores are not doing very well and barely get any customers. Meanwhile, two store branches are getting more than 1500 customers daily.

Store Sales

boxplot(Stores$Store_Sales, horizontal=T)

From the boxplot above, it seems that one of the stores are doing exceptionally well in their sales, making almost USD 120,000.

Data Exploration

1. How are the correlation between the Store Sales and other variables in this data?

1.1 Store Area vs Store Sales

cor(Stores$Store_Area, Stores$Store_Sales)

## [1] 0.0974738

plot(Stores$Store_Area, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Store_Area), col="red")

1.2 Item Available Area vs Store Sales

cor(Stores$Items_Available, Stores$Store_Sales)

## [1] 0.09884943

plot(Stores$Items_Available, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Items_Available), col="red")

1.3 Daily Customer Count vs Store Sales

cor(Stores$Daily_Customer_Count, Stores$Store_Sales)

## [1] 0.008628708

plot(Stores$Daily_Customer_Count, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Daily_Customer_Count), col="red")

Answer: All 3 variables (Store Area, Items Available, Daily Customer Count) individually has negligible correlation with Store Sales.

2. Does the store size affect other variables?

2.1 Items Available vs Store Size

cor(Stores$Items_Available, Stores$Store_Area)

## [1] 0.9988908

plot(Stores$Store_Area, Stores$Items_Available)
abline(lm(Stores$Items_Available ~ Stores$Store_Area), col="red")

Answer: Yes, the number of items variety and the store are has a strong positive correlation. Stores with bigger store area will have more variety of items.

2.2 Daily Customer Count vs Store Size

cor(Stores$Daily_Customer_Count, Stores$Store_Area)

## [1] -0.0414231

plot(Stores$Store_Area, Stores$Daily_Customer_Count)
abline(lm(Stores$Daily_Customer_Count ~ Stores$Store_Area), col="red")

Answer: No, the store size does not affect the number of customer

3. What is the total sales of this supermarket company?

sum(Stores$Store_Sales)

## [1] 53178770

Answer: The supermarket company made a total sales of USD 53,178,770

4. Which store makes the highest sales?

Stores[Stores$Store_Sales==max(Stores$Store_Sales),]

##     Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 650      650       1989            2414                  860      116320

Answer: The store with store ID 650 has the highest sales. Comparing it with the metrics from Stores data summary, we can see that:

Its store area is quite large (1989 yard square), larger than 75% of the stores (1654 yard square).
It sells 2414 different items, holding more variety of items than 75% of the stores (1983 items).
It serves averagely 860 customers per day, which is more than the average daily customer count (786 customers).

5. Does customer efficiency affect store sales?

Customer Efficiency is the number of sales the store made per customer.

# Calculate customer efficiency
Stores$Customer_Efficiency <- Stores$Store_Sales / Stores$Daily_Customer_Count

# Calculate the correlation
cor(Stores$Customer_Efficiency, Stores$Store_Sales)

## [1] 0.1395461

plot(Stores$Customer_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Customer_Efficiency), col="red")

Let’s try removing the outlier.

# Remove outlier
Stores_ce_sales <- Stores[Stores$Customer_Efficiency < 4000, ]

# Calculate correlation
cor(Stores_ce_sales$Customer_Efficiency, Stores_ce_sales$Store_Sales)

## [1] 0.4069858

plot(Stores_ce_sales$Customer_Efficiency, Stores_ce_sales$Store_Sales)
abline(lm(Stores_ce_sales$Store_Sales ~ Stores_ce_sales$Customer_Efficiency), col="red")

Answer: Customer Efficiency has moderate positive correlation with Store Sales

6. Does area efficiency affect store sales?

Area Efficiency is the number of sales the store made per area.

# Calculate area efficiency
Stores$Area_Efficiency <- Stores$Store_Sales / Stores$Store_Area

# Calculate the correlation
cor(Stores$Area_Efficiency, Stores$Store_Sales)

## [1] 0.8269392

plot(Stores$Area_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Area_Efficiency), col="red")

Answer: Area Efficiency has strong positive correlation with the Store Sales

7. Does item efficiency affect store sales?

Item Efficiency is the number of sales the store made per item variety.

# Calculate area efficiency
Stores$Item_Efficiency <- Stores$Store_Sales / Stores$Items_Available

# Calculate the correlation
cor(Stores$Item_Efficiency, Stores$Store_Sales)

## [1] 0.8272109

plot(Stores$Item_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Item_Efficiency), col="red")

Answer: Item Efficiency has strong positive correlation with Store Sales

8.1 Top 5 most area-efficient stores and their sales

head(Stores [order(Stores$Area_Efficiency, decreasing = T), c("Store.ID", "Area_Efficiency", "Store_Sales")], 5)

##     Store.ID Area_Efficiency Store_Sales
## 557      557        96.30901       89760
## 469      469        91.84906       97360
## 745      745        91.03044       77740
## 558      558        89.51627      101780
## 340      340        88.47826       85470

8.2 Top 5 least area-efficient stores and their sales

tail(Stores [order(Stores$Area_Efficiency, decreasing = T), c("Store.ID", "Area_Efficiency", "Store_Sales")], 5)

##     Store.ID Area_Efficiency Store_Sales
## 594      594        12.48153       20270
## 32        32        11.93600       14920
## 373      373        11.54051       21650
## 776      776        11.49642       17670
## 853      853        11.08328       16370

9.1 Top 5 most item-efficient stores and their sales

head(Stores [order(Stores$Item_Efficiency, decreasing = T), c("Store.ID", "Item_Efficiency", "Store_Sales")], 5)

##     Store.ID Item_Efficiency Store_Sales
## 557      557        78.87522       89760
## 745      745        76.36542       77740
## 469      469        76.12197       97360
## 558      558        74.07569      101780
## 690      690        73.95768       94370

9.2 Top 5 least item-efficient stores and their sales

tail(Stores [order(Stores$Item_Efficiency, decreasing = T), c("Store.ID", "Item_Efficiency", "Store_Sales")], 5)

##     Store.ID Item_Efficiency Store_Sales
## 594      594       10.416238       20270
## 32        32        9.893899       14920
## 373      373        9.605146       21650
## 776      776        9.413958       17670
## 853      853        9.145251       16370

Explanatory Text, Key Insights, Business Recommendation

A supermarket company has several store branches and they make a total sales of USD 53,178,770. The highest sales they made from one store was USD 116,320 from store number 650.

Key Insight

Store Area, Items Available, and Daily Customer Count have no correlation with Store Sales.
Customer Efficiency has moderate positive correlation with Store Sales.
Area Efficiency and Item Efficiency has strong positive correlation with Store Sales.
Stores with bigger areas tend to have have larger number of item variety.

Business Recommendation

Focus on area efficiency (sales per floor area) and item efficiency (sales per item variety) to further improve store sales.
Conduct research on the area-inefficient stores to see how to improve the area efficiency. For example, the number and placement of the item shelves or the cashiers.
Gather more data to learn more about the item efficiency, such as what items or category of items are doing well. The item varieties that generates low sales can be may be considered for discontinuation.

Exploratory Data Analysis - Supermarket Store Branches Sales

LBB P4DS

Grace Angela Sutantya

2023-10-22