The dataset contains data of different stores of a supermarket company.
Store.ID: (Index) ID of the particular store.
Store_Area: Physical Area of the store in yard
square.
Items_Available: Number of different items available in
the corresponding store.
Daily_Customer_Count: Number of customers who visited to
stores on an average over month.
Store_Sales: Sales in (US $) that stores made.
Read the file “Stores.csv” and save it in variable Stores.
Stores <- read.csv("Stores.csv")
# First 6 rows of data
head(Stores)
## Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 1 1 1659 1961 530 66490
## 2 2 1461 1752 210 39820
## 3 3 1340 1609 720 54010
## 4 4 1451 1748 620 53730
## 5 5 1770 2111 450 46620
## 6 6 1442 1733 760 45260
# Last 6 rows of data
tail(Stores)
## Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 891 891 1549 1851 1220 70620
## 892 892 1582 1910 1080 66390
## 893 893 1387 1663 850 82080
## 894 894 1200 1436 1060 76440
## 895 895 1299 1560 770 96610
## 896 896 1174 1429 1110 54340
# Dimension of Stores data
dim(Stores)
## [1] 896 5
Data has 896 rows and 5 columns.
# Column names of dataset
names(Stores)
## [1] "Store.ID" "Store_Area" "Items_Available"
## [4] "Daily_Customer_Count" "Store_Sales"
The Stores dataset has 5 columns with the names: “Store.ID”, “Store_Area”, “Items_Available”, “Daily_Customer_Count”, “Store_Sales”.
# Check the data type for each column
str(Stores)
## 'data.frame': 896 obs. of 5 variables:
## $ Store.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Store_Area : int 1659 1461 1340 1451 1770 1442 1542 1261 1090 1030 ...
## $ Items_Available : int 1961 1752 1609 1748 2111 1733 1858 1507 1321 1235 ...
## $ Daily_Customer_Count: int 530 210 720 620 450 760 1030 1020 680 1130 ...
## $ Store_Sales : int 66490 39820 54010 53730 46620 45260 72240 37720 46310 44150 ...
The Store.ID data type is integer. However, while the values of Store.ID are numbers, they are IDs that represents different object. We will not do mathematical operations with IDs, therefore the proper data type should be character.
# Convert the incorrect data types
Stores$Store.ID <- as.character(Stores$Store.ID)
str(Stores)
## 'data.frame': 896 obs. of 5 variables:
## $ Store.ID : chr "1" "2" "3" "4" ...
## $ Store_Area : int 1659 1461 1340 1451 1770 1442 1542 1261 1090 1030 ...
## $ Items_Available : int 1961 1752 1609 1748 2111 1733 1858 1507 1321 1235 ...
## $ Daily_Customer_Count: int 530 210 720 620 450 760 1030 1020 680 1130 ...
## $ Store_Sales : int 66490 39820 54010 53730 46620 45260 72240 37720 46310 44150 ...
Now, all the columns are in their desired data types.
# Check for missing values in each rows
colSums(is.na(Stores))
## Store.ID Store_Area Items_Available
## 0 0 0
## Daily_Customer_Count Store_Sales
## 0 0
# Check for missing values in the entire data
anyNA(Stores)
## [1] FALSE
There are no missing values in our data.
Our data is ready to be further explored and analysed.
# Brief statistical explanation of data
summary(Stores)
## Store.ID Store_Area Items_Available Daily_Customer_Count
## Length:896 Min. : 775 Min. : 932 Min. : 10.0
## Class :character 1st Qu.:1317 1st Qu.:1576 1st Qu.: 600.0
## Mode :character Median :1477 Median :1774 Median : 780.0
## Mean :1485 Mean :1782 Mean : 786.4
## 3rd Qu.:1654 3rd Qu.:1983 3rd Qu.: 970.0
## Max. :2229 Max. :2667 Max. :1560.0
## Store_Sales
## Min. : 14920
## 1st Qu.: 46530
## Median : 58605
## Mean : 59351
## 3rd Qu.: 71872
## Max. :116320
From the summary above, we can understand that:
boxplot(Stores$Store_Area, horizontal=T)
2 stores are relatively smaller compared to other store branches with store area below 1000 yard square, while 3 store branches have relatively large store area with more than 2000 yard square.
boxplot(Stores$Items_Available, horizontal=T)
2 stores have relatively little item variety (< 1000 different items), whereas 3 stores have relatively large item variety (> 2500 different items).
boxplot(Stores$Daily_Customer_Count, horizontal=T)
One of the stores are not doing very well and barely get any customers. Meanwhile, two store branches are getting more than 1500 customers daily.
boxplot(Stores$Store_Sales, horizontal=T)
From the boxplot above, it seems that one of the stores are doing exceptionally well in their sales, making almost USD 120,000.
cor(Stores$Store_Area, Stores$Store_Sales)
## [1] 0.0974738
plot(Stores$Store_Area, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Store_Area), col="red")
cor(Stores$Items_Available, Stores$Store_Sales)
## [1] 0.09884943
plot(Stores$Items_Available, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Items_Available), col="red")
cor(Stores$Daily_Customer_Count, Stores$Store_Sales)
## [1] 0.008628708
plot(Stores$Daily_Customer_Count, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Daily_Customer_Count), col="red")
Answer: All 3 variables (Store Area, Items Available, Daily Customer Count) individually has negligible correlation with Store Sales.
cor(Stores$Items_Available, Stores$Store_Area)
## [1] 0.9988908
plot(Stores$Store_Area, Stores$Items_Available)
abline(lm(Stores$Items_Available ~ Stores$Store_Area), col="red")
Answer: Yes, the number of items variety and the store are has a strong positive correlation. Stores with bigger store area will have more variety of items.
cor(Stores$Daily_Customer_Count, Stores$Store_Area)
## [1] -0.0414231
plot(Stores$Store_Area, Stores$Daily_Customer_Count)
abline(lm(Stores$Daily_Customer_Count ~ Stores$Store_Area), col="red")
Answer: No, the store size does not affect the number of customer
sum(Stores$Store_Sales)
## [1] 53178770
Answer: The supermarket company made a total sales of USD 53,178,770
Stores[Stores$Store_Sales==max(Stores$Store_Sales),]
## Store.ID Store_Area Items_Available Daily_Customer_Count Store_Sales
## 650 650 1989 2414 860 116320
Answer: The store with store ID 650 has the highest sales. Comparing it with the metrics from Stores data summary, we can see that:
Its store area is quite large (1989 yard square), larger than 75% of the stores (1654 yard square).
It sells 2414 different items, holding more variety of items than 75% of the stores (1983 items).
It serves averagely 860 customers per day, which is more than the average daily customer count (786 customers).
Customer Efficiency is the number of sales the store made per customer.
# Calculate customer efficiency
Stores$Customer_Efficiency <- Stores$Store_Sales / Stores$Daily_Customer_Count
# Calculate the correlation
cor(Stores$Customer_Efficiency, Stores$Store_Sales)
## [1] 0.1395461
plot(Stores$Customer_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Customer_Efficiency), col="red")
Let’s try removing the outlier.
# Remove outlier
Stores_ce_sales <- Stores[Stores$Customer_Efficiency < 4000, ]
# Calculate correlation
cor(Stores_ce_sales$Customer_Efficiency, Stores_ce_sales$Store_Sales)
## [1] 0.4069858
plot(Stores_ce_sales$Customer_Efficiency, Stores_ce_sales$Store_Sales)
abline(lm(Stores_ce_sales$Store_Sales ~ Stores_ce_sales$Customer_Efficiency), col="red")
Answer: Customer Efficiency has moderate positive correlation with Store Sales
Area Efficiency is the number of sales the store made per area.
# Calculate area efficiency
Stores$Area_Efficiency <- Stores$Store_Sales / Stores$Store_Area
# Calculate the correlation
cor(Stores$Area_Efficiency, Stores$Store_Sales)
## [1] 0.8269392
plot(Stores$Area_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Area_Efficiency), col="red")
Answer: Area Efficiency has strong positive correlation with the Store Sales
Item Efficiency is the number of sales the store made per item variety.
# Calculate area efficiency
Stores$Item_Efficiency <- Stores$Store_Sales / Stores$Items_Available
# Calculate the correlation
cor(Stores$Item_Efficiency, Stores$Store_Sales)
## [1] 0.8272109
plot(Stores$Item_Efficiency, Stores$Store_Sales)
abline(lm(Stores$Store_Sales ~ Stores$Item_Efficiency), col="red")
Answer: Item Efficiency has strong positive correlation with Store Sales
head(Stores [order(Stores$Area_Efficiency, decreasing = T), c("Store.ID", "Area_Efficiency", "Store_Sales")], 5)
## Store.ID Area_Efficiency Store_Sales
## 557 557 96.30901 89760
## 469 469 91.84906 97360
## 745 745 91.03044 77740
## 558 558 89.51627 101780
## 340 340 88.47826 85470
tail(Stores [order(Stores$Area_Efficiency, decreasing = T), c("Store.ID", "Area_Efficiency", "Store_Sales")], 5)
## Store.ID Area_Efficiency Store_Sales
## 594 594 12.48153 20270
## 32 32 11.93600 14920
## 373 373 11.54051 21650
## 776 776 11.49642 17670
## 853 853 11.08328 16370
head(Stores [order(Stores$Item_Efficiency, decreasing = T), c("Store.ID", "Item_Efficiency", "Store_Sales")], 5)
## Store.ID Item_Efficiency Store_Sales
## 557 557 78.87522 89760
## 745 745 76.36542 77740
## 469 469 76.12197 97360
## 558 558 74.07569 101780
## 690 690 73.95768 94370
tail(Stores [order(Stores$Item_Efficiency, decreasing = T), c("Store.ID", "Item_Efficiency", "Store_Sales")], 5)
## Store.ID Item_Efficiency Store_Sales
## 594 594 10.416238 20270
## 32 32 9.893899 14920
## 373 373 9.605146 21650
## 776 776 9.413958 17670
## 853 853 9.145251 16370
A supermarket company has several store branches and they make a total sales of USD 53,178,770. The highest sales they made from one store was USD 116,320 from store number 650.