Retail Shopping Sales Analysis

Case Study

The dataset represents the sales transactions of a UK-based online retail company specializing in unique all-occasion gift-ware. The data spans from December 1, 2009, to December 9, 2011. The company caters to both individual customers and wholesalers.

Here is a description of the attributes in the dataset:

InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If the code starts with the letter ‘c’, it indicates a cancellation.
StockCode: A 5-digit integral number uniquely assigned to each distinct product,representing the product (item) code.
Description: The product (item) name, provided in nominal format.
Quantity: The quantity of each product (item) per transaction, represented as a numeric value.
InvoiceDate: The date and time when a transaction was generated, presented in numeric format.
UnitPrice: The unit price of each product, representing the product price per unit in numeric format.
CustomerID: A 5-digit integral number uniquely assigned to each customer, serving as a nominal identifier.
Country: The name of the country where a customer resides, presented in nominal format.

The dataset provides information on sales transactions, including details on the products sold, quantities, prices, customer identification, and the countries where customers are located. Additionally, the dataset indicates canceled transactions with the presence of the letter ‘c’ at the beginning of the InvoiceNo. Analyzing this dataset could reveal insights into customer behavior, popular products, sales trends, and geographical distribution of customers.

# Install the necessary packages
install.packages(c("tidyverse", "plotly", "knitr", "corrplot"),
                 repos = "https://cran.rstudio.com/")

## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'plotly' successfully unpacked and MD5 sums checked
## package 'knitr' successfully unpacked and MD5 sums checked
## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\HP\AppData\Local\Temp\RtmpE5TuLJ\downloaded_packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(lubridate)
library(knitr)
library(corrplot)

## corrplot 0.92 loaded

Data Preparation

#########################
### DATA PREPARATION ##
#########################

# Loading the data file's
file1 <- read.csv("Main Data 09-10.csv")
file2 <- read.csv("Cancelled Products 09-10.csv")
file3 <- read.csv("Damaged Products 09-10.csv")
file4 <- read.csv("Main Data 10-11.csv")
file5 <- read.csv("Cancelled Products 10-11.csv")
file6 <- read.csv("Damaged Products 10-11.csv")

Sales_Data <- rbind(file1, file2, file3, file4, file5, file6)

# Checking the structure of file
str(Sales_Data)

## 'data.frame':    1054751 obs. of  15 variables:
##  $ Invoice     : chr  "489434" "489434" "489434" "489434" ...
##  $ StockCode   : chr  "85048" "79323P" "79323W" "22041" ...
##  $ Description : chr  "15Cm Christmas Glass Ball 20 Lights" "Pink Cherry Lights" "White Cherry Lights" "Record Frame 7\" Single Size" ...
##  $ Customer.ID : int  13085 13085 13085 13085 13085 13085 13085 13085 13085 13085 ...
##  $ Quantity    : int  12 12 12 48 24 24 24 10 12 12 ...
##  $ InvoiceDate : chr  "12/1/2009" "12/1/2009" "12/1/2009" "12/1/2009" ...
##  $ InvoiceYear : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ InvoiceMonth: int  12 12 12 12 12 12 12 12 12 12 ...
##  $ DayOfWeek   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ InvoiceTime : chr  "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" ...
##  $ Hour        : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Price       : num  6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
##  $ Revenue     : num  83.4 81 81 100.8 30 ...
##  $ Country     : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ Status      : chr  "Normal" "Normal" "Normal" "Normal" ...

# Changing 'Status', 'InvoiceMonth', 'DayOfWeek', 'Hour' and 'InvoiceYear' to factor
Sales_Data$Status <- factor(Sales_Data$Status)

Sales_Data$InvoiceMonth <- factor(Sales_Data$InvoiceMonth)

Sales_Data$DayOfWeek <- factor(Sales_Data$DayOfWeek)

Sales_Data$Hour <- factor(Sales_Data$Hour)

Sales_Data$InvoiceYear <- factor(Sales_Data$InvoiceYear)

# Changing Data Type of Date from 'Character' to 'Date'
Sales_Data$InvoiceDate <- mdy(Sales_Data$InvoiceDate)

# Changing Data Type of Time from 'Character' to 'HMS' format
Sales_Data$InvoiceTime <- hms(Sales_Data$InvoiceTime)

# Checking head and summary statistics
head(Sales_Data)

##   Invoice StockCode                         Description Customer.ID Quantity
## 1  489434     85048 15Cm Christmas Glass Ball 20 Lights       13085       12
## 2  489434    79323P                  Pink Cherry Lights       13085       12
## 3  489434    79323W                 White Cherry Lights       13085       12
## 4  489434     22041         Record Frame 7" Single Size       13085       48
## 5  489434     21232      Strawberry Ceramic Trinket Box       13085       24
## 6  489434     22064           Pink Doughnut Trinket Pot       13085       24
##   InvoiceDate InvoiceYear InvoiceMonth DayOfWeek InvoiceTime Hour Price Revenue
## 1  2009-12-01        2009           12         2   7H 45M 0S    7  6.95    83.4
## 2  2009-12-01        2009           12         2   7H 45M 0S    7  6.75    81.0
## 3  2009-12-01        2009           12         2   7H 45M 0S    7  6.75    81.0
## 4  2009-12-01        2009           12         2   7H 45M 0S    7  2.10   100.8
## 5  2009-12-01        2009           12         2   7H 45M 0S    7  1.25    30.0
## 6  2009-12-01        2009           12         2   7H 45M 0S    7  1.65    39.6
##          Country Status
## 1 United Kingdom Normal
## 2 United Kingdom Normal
## 3 United Kingdom Normal
## 4 United Kingdom Normal
## 5 United Kingdom Normal
## 6 United Kingdom Normal

summary(Sales_Data)

##    Invoice           StockCode         Description         Customer.ID    
##  Length:1054751     Length:1054751     Length:1054751     Min.   :12346   
##  Class :character   Class :character   Class :character   1st Qu.:13969   
##  Mode  :character   Mode  :character   Mode  :character   Median :15245   
##                                                           Mean   :15318   
##                                                           3rd Qu.:16794   
##                                                           Max.   :18287   
##                                                           NA's   :242813  
##     Quantity          InvoiceDate         InvoiceYear    InvoiceMonth   
##  Min.   :-80995.00   Min.   :2009-12-01   2009: 44722   11     :159857  
##  1st Qu.:     1.00   1st Qu.:2010-07-09   2010:515807   12     :134094  
##  Median :     3.00   Median :2010-12-07   2011:494222   10     :118155  
##  Mean   :    10.02   Mean   :2011-01-02                 9      : 91424  
##  3rd Qu.:    10.00   3rd Qu.:2011-07-22                 3      : 77379  
##  Max.   : 80995.00   Max.   :2011-12-09                 6      : 76049  
##                                                         (Other):397793  
##  DayOfWeek   InvoiceTime                            Hour       
##  0:135593   Min.   :1H 0M 0S                   12     :157226  
##  1:187416   1st Qu.:2H 49M 0S                  1      :147848  
##  2:194826   Median :5H 8M 0S                   2      :139576  
##  3:183045   Mean   :6H 42M 40.0099928798372S   3      :133878  
##  4:201042   3rd Qu.:11H 10M 0S                 11     :120057  
##  5:152427   Max.   :12H 59M 0S                 4      : 93038  
##  6:   402                                      (Other):263128  
##      Price             Revenue             Country                Status       
##  Min.   :    0.00   Min.   :-168469.60   Length:1054751     Cancelled:  22177  
##  1st Qu.:    1.25   1st Qu.:      3.75   Class :character   Damaged  :   3457  
##  Median :    2.10   Median :      9.90   Mode  :character   Normal   :1029117  
##  Mean   :    4.73   Mean   :     18.28                                         
##  3rd Qu.:    4.15   3rd Qu.:     17.70                                         
##  Max.   :38970.00   Max.   : 168469.60                                         
##

colnames(Sales_Data)

##  [1] "Invoice"      "StockCode"    "Description"  "Customer.ID"  "Quantity"    
##  [6] "InvoiceDate"  "InvoiceYear"  "InvoiceMonth" "DayOfWeek"    "InvoiceTime" 
## [11] "Hour"         "Price"        "Revenue"      "Country"      "Status"

# Checking for null values if any
any(is.na(Sales_Data))

## [1] TRUE

colSums(is.na(Sales_Data))

##      Invoice    StockCode  Description  Customer.ID     Quantity  InvoiceDate 
##            0            0            0       242813            0            0 
##  InvoiceYear InvoiceMonth    DayOfWeek  InvoiceTime         Hour        Price 
##            0            0            0            0            0            0 
##      Revenue      Country       Status 
##            0            0            0

The “CustomerID” column in the dataset contains 242,813 null values, and removing them is not feasible as it could result in substantial data loss. Additionally, replacing these null values is challenging because the actual CustomerID information is unknown. Therefore, retaining the null values in the “CustomerID” column is necessary for the integrity of the dataset, even though it limits the ability to analyze and identify specific customers for those transactions.

Exploratory Data Analysis

##################
 ### EDA ##
##################

Sales_New <- Sales_Data %>% 
  filter(Quantity > 0 & Status == 'Normal')
sum(Sales_New$Price)

## [1] 4109900

hist(log(Sales_New$Quantity), 
     main = "Distribution of Quantity",
     col = "#30fc66")

hist(log(Sales_New$Revenue), 
     main = "Distribution of Revenue",
     col = "red")

# Looking at the total number of:
  # Normal Products
  # Cancelled Products
  # Damaged Products

Product_Status <- Sales_Data %>% 
  group_by(Status) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count) %>% 
  mutate(Perc = (Count/sum(Count))*100)

kable(Product_Status)

Status	Count	Perc
Normal	1029117	97.5696634
Cancelled	22177	2.1025816
Damaged	3457	0.3277551

# 97.6% of the products were in good condition

# Let's look at the revenue column
sum(Sales_Data$Revenue)

## [1] 19280622

Overall_Revenue <- Sales_Data %>% 
  select(InvoiceYear, Revenue) %>% 
  group_by(InvoiceYear) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(-Total_Revenue)

ggplot(Overall_Revenue, aes(x = InvoiceYear, y = Total_Revenue)) +
  geom_col(aes(fill = factor(InvoiceYear))
           , color = "black", alpha = 0.7) +
  theme_bw() +
  labs(title = "Total Revenue VS Year",
       x = "Year") +
  guides(fill = F) +
  scale_fill_brewer(palette = "Set2")+
  theme(plot.title = element_text(color = "#7A4726"))

# Overall Revenue was maximum in 2010

# Looking for trends in Quantity and Month
Sales_Data %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "red", size = 3)+
  theme_bw() +
  labs(title = "Overall Monthly Trend: Quantity Purchased",
       subtitle = "An overall upward trend in purchases, peaking in Oct-Dec",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726")) +
  theme(plot.subtitle = element_text(color = "#072662",
                                     face = "italic", size = 10))

# Let's dive deeper to find Year wise Trends
Year_2010 <- Sales_Data %>% 
  filter(InvoiceYear == 2010 & Status == 'Normal')

Year_2010 %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "orange", size = 3)+
  theme_bw() +
  labs(title = "Monthly Trend of Year 2010: Quantity Purchased",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726"))

Year_2011 <- Sales_Data %>% 
  filter(InvoiceYear == 2011 & Status == 'Normal')

Year_2011 %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "blue", size = 3)+
  theme_bw() +
  labs(title = "Monthly Trend of Year 2011: Quantity Purchased",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726"))

Upon analyzing the plots, it is evident that there is a general upward trend in purchases, with an increase in the number of transactions observed from November to December. Noteworthy declines in sales occurred in July 2010, as well as in February and December 2011.

# At what time does most purchases happen?             
DOW <- Sales_New %>% 
  group_by(DayOfWeek,Hour) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  ggplot(aes(x = DayOfWeek, y = Count)) +
  geom_col(aes(fill = Hour)) +
  theme_bw() +
  labs(title = "Sales Count by Day of Week and Hour",
       x = "Day of Week",
       y = "Count") +
  theme(plot.title = element_text(color = "#7A4726")) 
ggplotly(DOW)

The data indicates that the peak sales occur on Thursday and Tuesday. Additionally, the majority of purchases are made between 11 AM and 12 PM, with another peak observed between 12 AM and 2 AM.

# Product Analysis
product_analysis <- Sales_Data %>%
  group_by(StockCode, Description) %>%
  summarise(total_quantity = sum(Quantity), total_revenue = sum(Revenue)) %>%
  arrange(desc(total_quantity))

## `summarise()` has grouped output by 'StockCode'. You can override using the
## `.groups` argument.

pr_an <- head(product_analysis, 5)

kable(pr_an)

StockCode	Description	total_quantity	total_revenue
84077	World War 2 Gliders Asstd Designs	108305	24836.30
85123A	White Hanging Heart T-Light Holder	92333	253200.31
84879	Assorted Colour Bird Ornament	81087	131043.74
85099B	Jumbo Bag Red Retrospot	77509	146393.71
17003	Brocade Ring Purse	70664	14732.97

The item ‘World War 2 Gliders Asstd Designs’ stands out as the most purchased product, with a total purchase count reaching 108,305. On the other hand, ‘Regency Cakestand 3 Tier’ achieved the highest sales profit, generating a total revenue of 327,345.

Top_Countries <- Sales_Data %>% 
  filter(Status == 'Normal' & Quantity > 0) %>% 
  select(Revenue, Country) %>% 
  group_by(Country) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(- Total_Revenue)

Top_10_Countries <- head(Top_Countries, 10)

kable(Top_10_Countries)

Country	Total_Revenue
United Kingdom	17752597.01
Ireland	651760.43
Netherlands	554230.69
Germany	428407.54
France	347068.14
Australia	169900.61
Spain	109127.21
Switzerland	100988.99
Sweden	91839.82
Denmark	69862.19

# United Kingdom is the country which purchases the maximum number of products

# Let's look at the countries with the lowest number of purchases
Bottom_Countries <- Sales_Data %>% 
  filter(Status == 'Normal' & Quantity > 0) %>% 
  select(Revenue, Country) %>% 
  group_by(Country) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(Total_Revenue)

Bottom_5_Countries <- head(Bottom_Countries, 5)

kable(Bottom_5_Countries)

Country	Total_Revenue
Nigeria	140.39
Saudi Arabia	145.92
West Indies	536.41
Czech Republic	826.74
Korea	1118.51

# The least purchases were made by Nigeria followed by Saudi Arabia

# Let's find out the top Customers 
Top_Customers <- Sales_New %>% 
  drop_na(Customer.ID) %>% 
  select(Customer.ID, StockCode) %>% 
  group_by(Customer.ID) %>% 
  summarise(Total_Purchases = n())%>% 
  arrange(-Total_Purchases)

Top_Loyal_Customers <- head(Top_Customers, 5)

kable(Top_Loyal_Customers)

Customer.ID	Total_Purchases
17841	12611
14911	11152
12748	6929
14606	6477
14096	5111

Customer with ID 17841 and 14911 made the most purchases.

### Finding Correlations

cor(Sales_New$Quantity, Sales_New$Revenue)

## [1] 0.8089968

# Quantity and Revenue is highly correlated with a correlation value of 0.827

# Let's look at Damaged Products
Damaged_Products <- Sales_Data %>% 
  filter(Status == "Damaged")

Damaged_Products %>% 
  group_by(InvoiceYear, Country) %>% 
  summarise(Count = n()) %>% 
  ggplot(aes(x = InvoiceYear, y = Count)) +
  geom_col(fill = "green", alpha = 0.5, color = "black") +
  theme_bw() +
  labs(title = "Damaged Products Count",
       subtitle = "UK received 3457 damaged products from 2009 to 2011",
       x = "Year",
       y = "Count") +    
  theme(plot.title = element_text(color = "#7A4726")) +
  theme(plot.subtitle = element_text(color = "#072662",
                                     face = "italic", size = 10))

## `summarise()` has grouped output by 'InvoiceYear'. You can override using the
## `.groups` argument.

Over the three-year period from 2009 to 2011, a total of 3,457 products delivered to the UK were reported as damaged. Given that the UK represents a significant portion of top purchases, it emphasizes the importance of exercising caution and implementing careful delivery practices, particularly for shipments to the UK, to ensure product integrity and customer satisfaction.

# Let's look at the Cancelled Products
Cancelled_Products <- Sales_Data %>% 
  filter(Status == "Cancelled")

top_3_cancels <- Cancelled_Products %>% 
  group_by(Country) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count) 

top_cancels <- head(top_3_cancels, 3)

kable(top_cancels)

Country	Count
United Kingdom	19310
Germany	930
Ireland	515

Most cancellations were made by the countries UK, Germany and Ireland.

Model Building

######################
### MODEL BUILDING ##
######################

# To build a model it is always best to avoid null values
Sales_Model <- Sales_New %>% 
  drop_na(Customer.ID)

model <- lm(Quantity~Price + Revenue, data = Sales_Model)

summary(model)

## 
## Call:
## lm(formula = Quantity ~ Price + Revenue, data = Sales_Model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20612     -4     -2      0  32510 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3143903  0.0902903   36.71   <2e-16 ***
## Price       -0.5958043  0.0033288 -178.99   <2e-16 ***
## Revenue      0.5403015  0.0003996 1352.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.59 on 793165 degrees of freedom
## Multiple R-squared:  0.6975, Adjusted R-squared:  0.6975 
## F-statistic: 9.143e+05 on 2 and 793165 DF,  p-value: < 2.2e-16

Interpretation:

The intercept (3.3143903) is the predicted quantity when both Price and Revenue are zero.
Price has a negative effect on Quantity. For each unit increase in Price, Quantity is estimated to decrease by approximately 0.596.
Revenue has a positive effect on Quantity. For each unit increase in Revenue, Quantity is estimated to increase by approximately 0.540.

Model Fit:

The model appears to fit the data well, as indicated by the high R-squared value.
The p-values for the coefficients are very small, suggesting that both Price and Revenue are significant predictors of Quantity.

Overall, the model suggests that both Price and Revenue are important factors in predicting Quantity, and the model explains a significant amount of the variability in Quantity.

Conclusion

The analysis of the retail shopping sales dataset revealed key insights:

Sales Trends: Upward trend with peak purchases in November to December.
Purchase Time Analysis: Peak sales on Thursday and Tuesday, with peak hours at 11 AM - 12 PM and 12 AM - 2 AM.
Product Analysis: ‘World War 2 Gliders’ most purchased; ‘Regency Cakestand 3 Tier’ highest revenue.
Geographic Analysis: UK leads in purchases; Nigeria and Saudi Arabia have the least.
Customer Analysis: IDs 17841 and 14911 are top purchasers, indicating loyalty.
Correlation Analysis: Quantity and Revenue highly correlated (0.827).
Damaged and Cancelled Products: 3,457 products reported damaged; most cancellations from UK, Germany, and Ireland.
Model Building: Linear regression model shows both Price and Revenue significantly predict Quantity.

Recommendations:

Improve delivery practices to reduce damaged products.
Optimize inventory and marketing based on sales trends.
Incentivize loyal customers, especially those with IDs 17841 and 14911.