Case Study

The dataset represents the sales transactions of a UK-based online retail company specializing in unique all-occasion gift-ware. The data spans from December 1, 2009, to December 9, 2011. The company caters to both individual customers and wholesalers.

Here is a description of the attributes in the dataset:

The dataset provides information on sales transactions, including details on the products sold, quantities, prices, customer identification, and the countries where customers are located. Additionally, the dataset indicates canceled transactions with the presence of the letter ‘c’ at the beginning of the InvoiceNo. Analyzing this dataset could reveal insights into customer behavior, popular products, sales trends, and geographical distribution of customers.

# Install the necessary packages
install.packages(c("tidyverse", "plotly", "knitr", "corrplot"),
                 repos = "https://cran.rstudio.com/")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'plotly' successfully unpacked and MD5 sums checked
## package 'knitr' successfully unpacked and MD5 sums checked
## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\HP\AppData\Local\Temp\RtmpE5TuLJ\downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(lubridate)
library(knitr)
library(corrplot)
## corrplot 0.92 loaded

Data Preparation

#########################
### DATA PREPARATION ##
#########################

# Loading the data file's
file1 <- read.csv("Main Data 09-10.csv")
file2 <- read.csv("Cancelled Products 09-10.csv")
file3 <- read.csv("Damaged Products 09-10.csv")
file4 <- read.csv("Main Data 10-11.csv")
file5 <- read.csv("Cancelled Products 10-11.csv")
file6 <- read.csv("Damaged Products 10-11.csv")

Sales_Data <- rbind(file1, file2, file3, file4, file5, file6)

# Checking the structure of file
str(Sales_Data)
## 'data.frame':    1054751 obs. of  15 variables:
##  $ Invoice     : chr  "489434" "489434" "489434" "489434" ...
##  $ StockCode   : chr  "85048" "79323P" "79323W" "22041" ...
##  $ Description : chr  "15Cm Christmas Glass Ball 20 Lights" "Pink Cherry Lights" "White Cherry Lights" "Record Frame 7\" Single Size" ...
##  $ Customer.ID : int  13085 13085 13085 13085 13085 13085 13085 13085 13085 13085 ...
##  $ Quantity    : int  12 12 12 48 24 24 24 10 12 12 ...
##  $ InvoiceDate : chr  "12/1/2009" "12/1/2009" "12/1/2009" "12/1/2009" ...
##  $ InvoiceYear : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ InvoiceMonth: int  12 12 12 12 12 12 12 12 12 12 ...
##  $ DayOfWeek   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ InvoiceTime : chr  "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" ...
##  $ Hour        : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Price       : num  6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
##  $ Revenue     : num  83.4 81 81 100.8 30 ...
##  $ Country     : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ Status      : chr  "Normal" "Normal" "Normal" "Normal" ...
# Changing 'Status', 'InvoiceMonth', 'DayOfWeek', 'Hour' and 'InvoiceYear' to factor
Sales_Data$Status <- factor(Sales_Data$Status)

Sales_Data$InvoiceMonth <- factor(Sales_Data$InvoiceMonth)

Sales_Data$DayOfWeek <- factor(Sales_Data$DayOfWeek)

Sales_Data$Hour <- factor(Sales_Data$Hour)

Sales_Data$InvoiceYear <- factor(Sales_Data$InvoiceYear)

# Changing Data Type of Date from 'Character' to 'Date'
Sales_Data$InvoiceDate <- mdy(Sales_Data$InvoiceDate)

# Changing Data Type of Time from 'Character' to 'HMS' format
Sales_Data$InvoiceTime <- hms(Sales_Data$InvoiceTime)

# Checking head and summary statistics
head(Sales_Data)
##   Invoice StockCode                         Description Customer.ID Quantity
## 1  489434     85048 15Cm Christmas Glass Ball 20 Lights       13085       12
## 2  489434    79323P                  Pink Cherry Lights       13085       12
## 3  489434    79323W                 White Cherry Lights       13085       12
## 4  489434     22041         Record Frame 7" Single Size       13085       48
## 5  489434     21232      Strawberry Ceramic Trinket Box       13085       24
## 6  489434     22064           Pink Doughnut Trinket Pot       13085       24
##   InvoiceDate InvoiceYear InvoiceMonth DayOfWeek InvoiceTime Hour Price Revenue
## 1  2009-12-01        2009           12         2   7H 45M 0S    7  6.95    83.4
## 2  2009-12-01        2009           12         2   7H 45M 0S    7  6.75    81.0
## 3  2009-12-01        2009           12         2   7H 45M 0S    7  6.75    81.0
## 4  2009-12-01        2009           12         2   7H 45M 0S    7  2.10   100.8
## 5  2009-12-01        2009           12         2   7H 45M 0S    7  1.25    30.0
## 6  2009-12-01        2009           12         2   7H 45M 0S    7  1.65    39.6
##          Country Status
## 1 United Kingdom Normal
## 2 United Kingdom Normal
## 3 United Kingdom Normal
## 4 United Kingdom Normal
## 5 United Kingdom Normal
## 6 United Kingdom Normal
summary(Sales_Data)
##    Invoice           StockCode         Description         Customer.ID    
##  Length:1054751     Length:1054751     Length:1054751     Min.   :12346   
##  Class :character   Class :character   Class :character   1st Qu.:13969   
##  Mode  :character   Mode  :character   Mode  :character   Median :15245   
##                                                           Mean   :15318   
##                                                           3rd Qu.:16794   
##                                                           Max.   :18287   
##                                                           NA's   :242813  
##     Quantity          InvoiceDate         InvoiceYear    InvoiceMonth   
##  Min.   :-80995.00   Min.   :2009-12-01   2009: 44722   11     :159857  
##  1st Qu.:     1.00   1st Qu.:2010-07-09   2010:515807   12     :134094  
##  Median :     3.00   Median :2010-12-07   2011:494222   10     :118155  
##  Mean   :    10.02   Mean   :2011-01-02                 9      : 91424  
##  3rd Qu.:    10.00   3rd Qu.:2011-07-22                 3      : 77379  
##  Max.   : 80995.00   Max.   :2011-12-09                 6      : 76049  
##                                                         (Other):397793  
##  DayOfWeek   InvoiceTime                            Hour       
##  0:135593   Min.   :1H 0M 0S                   12     :157226  
##  1:187416   1st Qu.:2H 49M 0S                  1      :147848  
##  2:194826   Median :5H 8M 0S                   2      :139576  
##  3:183045   Mean   :6H 42M 40.0099928798372S   3      :133878  
##  4:201042   3rd Qu.:11H 10M 0S                 11     :120057  
##  5:152427   Max.   :12H 59M 0S                 4      : 93038  
##  6:   402                                      (Other):263128  
##      Price             Revenue             Country                Status       
##  Min.   :    0.00   Min.   :-168469.60   Length:1054751     Cancelled:  22177  
##  1st Qu.:    1.25   1st Qu.:      3.75   Class :character   Damaged  :   3457  
##  Median :    2.10   Median :      9.90   Mode  :character   Normal   :1029117  
##  Mean   :    4.73   Mean   :     18.28                                         
##  3rd Qu.:    4.15   3rd Qu.:     17.70                                         
##  Max.   :38970.00   Max.   : 168469.60                                         
## 
colnames(Sales_Data)
##  [1] "Invoice"      "StockCode"    "Description"  "Customer.ID"  "Quantity"    
##  [6] "InvoiceDate"  "InvoiceYear"  "InvoiceMonth" "DayOfWeek"    "InvoiceTime" 
## [11] "Hour"         "Price"        "Revenue"      "Country"      "Status"
# Checking for null values if any
any(is.na(Sales_Data)) 
## [1] TRUE
colSums(is.na(Sales_Data))
##      Invoice    StockCode  Description  Customer.ID     Quantity  InvoiceDate 
##            0            0            0       242813            0            0 
##  InvoiceYear InvoiceMonth    DayOfWeek  InvoiceTime         Hour        Price 
##            0            0            0            0            0            0 
##      Revenue      Country       Status 
##            0            0            0

The “CustomerID” column in the dataset contains 242,813 null values, and removing them is not feasible as it could result in substantial data loss. Additionally, replacing these null values is challenging because the actual CustomerID information is unknown. Therefore, retaining the null values in the “CustomerID” column is necessary for the integrity of the dataset, even though it limits the ability to analyze and identify specific customers for those transactions.

Exploratory Data Analysis

##################
 ### EDA ##
##################

Sales_New <- Sales_Data %>% 
  filter(Quantity > 0 & Status == 'Normal')
sum(Sales_New$Price)
## [1] 4109900
hist(log(Sales_New$Quantity), 
     main = "Distribution of Quantity",
     col = "#30fc66")

hist(log(Sales_New$Revenue), 
     main = "Distribution of Revenue",
     col = "red")

# Looking at the total number of:
  # Normal Products
  # Cancelled Products
  # Damaged Products

Product_Status <- Sales_Data %>% 
  group_by(Status) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count) %>% 
  mutate(Perc = (Count/sum(Count))*100)

kable(Product_Status)
Status Count Perc
Normal 1029117 97.5696634
Cancelled 22177 2.1025816
Damaged 3457 0.3277551
# 97.6% of the products were in good condition

# Let's look at the revenue column
sum(Sales_Data$Revenue)
## [1] 19280622
Overall_Revenue <- Sales_Data %>% 
  select(InvoiceYear, Revenue) %>% 
  group_by(InvoiceYear) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(-Total_Revenue)

ggplot(Overall_Revenue, aes(x = InvoiceYear, y = Total_Revenue)) +
  geom_col(aes(fill = factor(InvoiceYear))
           , color = "black", alpha = 0.7) +
  theme_bw() +
  labs(title = "Total Revenue VS Year",
       x = "Year") +
  guides(fill = F) +
  scale_fill_brewer(palette = "Set2")+
  theme(plot.title = element_text(color = "#7A4726"))

# Overall Revenue was maximum in 2010

# Looking for trends in Quantity and Month
Sales_Data %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "red", size = 3)+
  theme_bw() +
  labs(title = "Overall Monthly Trend: Quantity Purchased",
       subtitle = "An overall upward trend in purchases, peaking in Oct-Dec",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726")) +
  theme(plot.subtitle = element_text(color = "#072662",
                                     face = "italic", size = 10))

# Let's dive deeper to find Year wise Trends
Year_2010 <- Sales_Data %>% 
  filter(InvoiceYear == 2010 & Status == 'Normal')

Year_2010 %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "orange", size = 3)+
  theme_bw() +
  labs(title = "Monthly Trend of Year 2010: Quantity Purchased",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726")) 

Year_2011 <- Sales_Data %>% 
  filter(InvoiceYear == 2011 & Status == 'Normal')

Year_2011 %>% 
  group_by(InvoiceMonth) %>% 
  summarise(total_quantity = sum(Quantity)) %>% 
  ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
  geom_point(color = "blue", size = 3)+
  theme_bw() +
  labs(title = "Monthly Trend of Year 2011: Quantity Purchased",
       x = "Month", y = "Total_Quantity") +
  theme(plot.title = element_text(color = "#7A4726")) 

Upon analyzing the plots, it is evident that there is a general upward trend in purchases, with an increase in the number of transactions observed from November to December. Noteworthy declines in sales occurred in July 2010, as well as in February and December 2011.

# At what time does most purchases happen?             
DOW <- Sales_New %>% 
  group_by(DayOfWeek,Hour) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  ggplot(aes(x = DayOfWeek, y = Count)) +
  geom_col(aes(fill = Hour)) +
  theme_bw() +
  labs(title = "Sales Count by Day of Week and Hour",
       x = "Day of Week",
       y = "Count") +
  theme(plot.title = element_text(color = "#7A4726")) 
ggplotly(DOW)

The data indicates that the peak sales occur on Thursday and Tuesday. Additionally, the majority of purchases are made between 11 AM and 12 PM, with another peak observed between 12 AM and 2 AM.

# Product Analysis
product_analysis <- Sales_Data %>%
  group_by(StockCode, Description) %>%
  summarise(total_quantity = sum(Quantity), total_revenue = sum(Revenue)) %>%
  arrange(desc(total_quantity))
## `summarise()` has grouped output by 'StockCode'. You can override using the
## `.groups` argument.
pr_an <- head(product_analysis, 5)

kable(pr_an)
StockCode Description total_quantity total_revenue
84077 World War 2 Gliders Asstd Designs 108305 24836.30
85123A White Hanging Heart T-Light Holder 92333 253200.31
84879 Assorted Colour Bird Ornament 81087 131043.74
85099B Jumbo Bag Red Retrospot 77509 146393.71
17003 Brocade Ring Purse 70664 14732.97

The item ‘World War 2 Gliders Asstd Designs’ stands out as the most purchased product, with a total purchase count reaching 108,305. On the other hand, ‘Regency Cakestand 3 Tier’ achieved the highest sales profit, generating a total revenue of 327,345.

Top_Countries <- Sales_Data %>% 
  filter(Status == 'Normal' & Quantity > 0) %>% 
  select(Revenue, Country) %>% 
  group_by(Country) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(- Total_Revenue)

Top_10_Countries <- head(Top_Countries, 10)

kable(Top_10_Countries)
Country Total_Revenue
United Kingdom 17752597.01
Ireland 651760.43
Netherlands 554230.69
Germany 428407.54
France 347068.14
Australia 169900.61
Spain 109127.21
Switzerland 100988.99
Sweden 91839.82
Denmark 69862.19
# United Kingdom is the country which purchases the maximum number of products

# Let's look at the countries with the lowest number of purchases
Bottom_Countries <- Sales_Data %>% 
  filter(Status == 'Normal' & Quantity > 0) %>% 
  select(Revenue, Country) %>% 
  group_by(Country) %>% 
  summarise(Total_Revenue = sum(Revenue)) %>% 
  arrange(Total_Revenue)

Bottom_5_Countries <- head(Bottom_Countries, 5)

kable(Bottom_5_Countries)
Country Total_Revenue
Nigeria 140.39
Saudi Arabia 145.92
West Indies 536.41
Czech Republic 826.74
Korea 1118.51
# The least purchases were made by Nigeria followed by Saudi Arabia

# Let's find out the top Customers 
Top_Customers <- Sales_New %>% 
  drop_na(Customer.ID) %>% 
  select(Customer.ID, StockCode) %>% 
  group_by(Customer.ID) %>% 
  summarise(Total_Purchases = n())%>% 
  arrange(-Total_Purchases)

Top_Loyal_Customers <- head(Top_Customers, 5)

kable(Top_Loyal_Customers)
Customer.ID Total_Purchases
17841 12611
14911 11152
12748 6929
14606 6477
14096 5111

Customer with ID 17841 and 14911 made the most purchases.

### Finding Correlations

cor(Sales_New$Quantity, Sales_New$Revenue)
## [1] 0.8089968
# Quantity and Revenue is highly correlated with a correlation value of 0.827

# Let's look at Damaged Products
Damaged_Products <- Sales_Data %>% 
  filter(Status == "Damaged")

Damaged_Products %>% 
  group_by(InvoiceYear, Country) %>% 
  summarise(Count = n()) %>% 
  ggplot(aes(x = InvoiceYear, y = Count)) +
  geom_col(fill = "green", alpha = 0.5, color = "black") +
  theme_bw() +
  labs(title = "Damaged Products Count",
       subtitle = "UK received 3457 damaged products from 2009 to 2011",
       x = "Year",
       y = "Count") +    
  theme(plot.title = element_text(color = "#7A4726")) +
  theme(plot.subtitle = element_text(color = "#072662",
                                     face = "italic", size = 10))
## `summarise()` has grouped output by 'InvoiceYear'. You can override using the
## `.groups` argument.

Over the three-year period from 2009 to 2011, a total of 3,457 products delivered to the UK were reported as damaged. Given that the UK represents a significant portion of top purchases, it emphasizes the importance of exercising caution and implementing careful delivery practices, particularly for shipments to the UK, to ensure product integrity and customer satisfaction.

# Let's look at the Cancelled Products
Cancelled_Products <- Sales_Data %>% 
  filter(Status == "Cancelled")

top_3_cancels <- Cancelled_Products %>% 
  group_by(Country) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count) 

top_cancels <- head(top_3_cancels, 3)

kable(top_cancels)
Country Count
United Kingdom 19310
Germany 930
Ireland 515

Most cancellations were made by the countries UK, Germany and Ireland.

Model Building

######################
### MODEL BUILDING ##
######################

# To build a model it is always best to avoid null values
Sales_Model <- Sales_New %>% 
  drop_na(Customer.ID)

model <- lm(Quantity~Price + Revenue, data = Sales_Model)

summary(model)
## 
## Call:
## lm(formula = Quantity ~ Price + Revenue, data = Sales_Model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20612     -4     -2      0  32510 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3143903  0.0902903   36.71   <2e-16 ***
## Price       -0.5958043  0.0033288 -178.99   <2e-16 ***
## Revenue      0.5403015  0.0003996 1352.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.59 on 793165 degrees of freedom
## Multiple R-squared:  0.6975, Adjusted R-squared:  0.6975 
## F-statistic: 9.143e+05 on 2 and 793165 DF,  p-value: < 2.2e-16

Interpretation:

  • The intercept (3.3143903) is the predicted quantity when both Price and Revenue are zero.

  • Price has a negative effect on Quantity. For each unit increase in Price, Quantity is estimated to decrease by approximately 0.596.

  • Revenue has a positive effect on Quantity. For each unit increase in Revenue, Quantity is estimated to increase by approximately 0.540.

Model Fit:

  • The model appears to fit the data well, as indicated by the high R-squared value.

  • The p-values for the coefficients are very small, suggesting that both Price and Revenue are significant predictors of Quantity.

Overall, the model suggests that both Price and Revenue are important factors in predicting Quantity, and the model explains a significant amount of the variability in Quantity.

Conclusion

The analysis of the retail shopping sales dataset revealed key insights:

Recommendations: