The dataset represents the sales transactions of a UK-based online retail company specializing in unique all-occasion gift-ware. The data spans from December 1, 2009, to December 9, 2011. The company caters to both individual customers and wholesalers.
Here is a description of the attributes in the dataset:
InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If the code starts with the letter ‘c’, it indicates a cancellation.
StockCode: A 5-digit integral number uniquely assigned to each distinct product,representing the product (item) code.
Description: The product (item) name, provided in nominal format.
Quantity: The quantity of each product (item) per transaction, represented as a numeric value.
InvoiceDate: The date and time when a transaction was generated, presented in numeric format.
UnitPrice: The unit price of each product, representing the product price per unit in numeric format.
CustomerID: A 5-digit integral number uniquely assigned to each customer, serving as a nominal identifier.
Country: The name of the country where a customer resides, presented in nominal format.
The dataset provides information on sales transactions, including details on the products sold, quantities, prices, customer identification, and the countries where customers are located. Additionally, the dataset indicates canceled transactions with the presence of the letter ‘c’ at the beginning of the InvoiceNo. Analyzing this dataset could reveal insights into customer behavior, popular products, sales trends, and geographical distribution of customers.
# Install the necessary packages
install.packages(c("tidyverse", "plotly", "knitr", "corrplot"),
repos = "https://cran.rstudio.com/")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'plotly' successfully unpacked and MD5 sums checked
## package 'knitr' successfully unpacked and MD5 sums checked
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\HP\AppData\Local\Temp\RtmpE5TuLJ\downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
library(knitr)
library(corrplot)
## corrplot 0.92 loaded
#########################
### DATA PREPARATION ##
#########################
# Loading the data file's
file1 <- read.csv("Main Data 09-10.csv")
file2 <- read.csv("Cancelled Products 09-10.csv")
file3 <- read.csv("Damaged Products 09-10.csv")
file4 <- read.csv("Main Data 10-11.csv")
file5 <- read.csv("Cancelled Products 10-11.csv")
file6 <- read.csv("Damaged Products 10-11.csv")
Sales_Data <- rbind(file1, file2, file3, file4, file5, file6)
# Checking the structure of file
str(Sales_Data)
## 'data.frame': 1054751 obs. of 15 variables:
## $ Invoice : chr "489434" "489434" "489434" "489434" ...
## $ StockCode : chr "85048" "79323P" "79323W" "22041" ...
## $ Description : chr "15Cm Christmas Glass Ball 20 Lights" "Pink Cherry Lights" "White Cherry Lights" "Record Frame 7\" Single Size" ...
## $ Customer.ID : int 13085 13085 13085 13085 13085 13085 13085 13085 13085 13085 ...
## $ Quantity : int 12 12 12 48 24 24 24 10 12 12 ...
## $ InvoiceDate : chr "12/1/2009" "12/1/2009" "12/1/2009" "12/1/2009" ...
## $ InvoiceYear : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ InvoiceMonth: int 12 12 12 12 12 12 12 12 12 12 ...
## $ DayOfWeek : int 2 2 2 2 2 2 2 2 2 2 ...
## $ InvoiceTime : chr "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" "7:45:00 AM" ...
## $ Hour : int 7 7 7 7 7 7 7 7 7 7 ...
## $ Price : num 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
## $ Revenue : num 83.4 81 81 100.8 30 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
## $ Status : chr "Normal" "Normal" "Normal" "Normal" ...
# Changing 'Status', 'InvoiceMonth', 'DayOfWeek', 'Hour' and 'InvoiceYear' to factor
Sales_Data$Status <- factor(Sales_Data$Status)
Sales_Data$InvoiceMonth <- factor(Sales_Data$InvoiceMonth)
Sales_Data$DayOfWeek <- factor(Sales_Data$DayOfWeek)
Sales_Data$Hour <- factor(Sales_Data$Hour)
Sales_Data$InvoiceYear <- factor(Sales_Data$InvoiceYear)
# Changing Data Type of Date from 'Character' to 'Date'
Sales_Data$InvoiceDate <- mdy(Sales_Data$InvoiceDate)
# Changing Data Type of Time from 'Character' to 'HMS' format
Sales_Data$InvoiceTime <- hms(Sales_Data$InvoiceTime)
# Checking head and summary statistics
head(Sales_Data)
## Invoice StockCode Description Customer.ID Quantity
## 1 489434 85048 15Cm Christmas Glass Ball 20 Lights 13085 12
## 2 489434 79323P Pink Cherry Lights 13085 12
## 3 489434 79323W White Cherry Lights 13085 12
## 4 489434 22041 Record Frame 7" Single Size 13085 48
## 5 489434 21232 Strawberry Ceramic Trinket Box 13085 24
## 6 489434 22064 Pink Doughnut Trinket Pot 13085 24
## InvoiceDate InvoiceYear InvoiceMonth DayOfWeek InvoiceTime Hour Price Revenue
## 1 2009-12-01 2009 12 2 7H 45M 0S 7 6.95 83.4
## 2 2009-12-01 2009 12 2 7H 45M 0S 7 6.75 81.0
## 3 2009-12-01 2009 12 2 7H 45M 0S 7 6.75 81.0
## 4 2009-12-01 2009 12 2 7H 45M 0S 7 2.10 100.8
## 5 2009-12-01 2009 12 2 7H 45M 0S 7 1.25 30.0
## 6 2009-12-01 2009 12 2 7H 45M 0S 7 1.65 39.6
## Country Status
## 1 United Kingdom Normal
## 2 United Kingdom Normal
## 3 United Kingdom Normal
## 4 United Kingdom Normal
## 5 United Kingdom Normal
## 6 United Kingdom Normal
summary(Sales_Data)
## Invoice StockCode Description Customer.ID
## Length:1054751 Length:1054751 Length:1054751 Min. :12346
## Class :character Class :character Class :character 1st Qu.:13969
## Mode :character Mode :character Mode :character Median :15245
## Mean :15318
## 3rd Qu.:16794
## Max. :18287
## NA's :242813
## Quantity InvoiceDate InvoiceYear InvoiceMonth
## Min. :-80995.00 Min. :2009-12-01 2009: 44722 11 :159857
## 1st Qu.: 1.00 1st Qu.:2010-07-09 2010:515807 12 :134094
## Median : 3.00 Median :2010-12-07 2011:494222 10 :118155
## Mean : 10.02 Mean :2011-01-02 9 : 91424
## 3rd Qu.: 10.00 3rd Qu.:2011-07-22 3 : 77379
## Max. : 80995.00 Max. :2011-12-09 6 : 76049
## (Other):397793
## DayOfWeek InvoiceTime Hour
## 0:135593 Min. :1H 0M 0S 12 :157226
## 1:187416 1st Qu.:2H 49M 0S 1 :147848
## 2:194826 Median :5H 8M 0S 2 :139576
## 3:183045 Mean :6H 42M 40.0099928798372S 3 :133878
## 4:201042 3rd Qu.:11H 10M 0S 11 :120057
## 5:152427 Max. :12H 59M 0S 4 : 93038
## 6: 402 (Other):263128
## Price Revenue Country Status
## Min. : 0.00 Min. :-168469.60 Length:1054751 Cancelled: 22177
## 1st Qu.: 1.25 1st Qu.: 3.75 Class :character Damaged : 3457
## Median : 2.10 Median : 9.90 Mode :character Normal :1029117
## Mean : 4.73 Mean : 18.28
## 3rd Qu.: 4.15 3rd Qu.: 17.70
## Max. :38970.00 Max. : 168469.60
##
colnames(Sales_Data)
## [1] "Invoice" "StockCode" "Description" "Customer.ID" "Quantity"
## [6] "InvoiceDate" "InvoiceYear" "InvoiceMonth" "DayOfWeek" "InvoiceTime"
## [11] "Hour" "Price" "Revenue" "Country" "Status"
# Checking for null values if any
any(is.na(Sales_Data))
## [1] TRUE
colSums(is.na(Sales_Data))
## Invoice StockCode Description Customer.ID Quantity InvoiceDate
## 0 0 0 242813 0 0
## InvoiceYear InvoiceMonth DayOfWeek InvoiceTime Hour Price
## 0 0 0 0 0 0
## Revenue Country Status
## 0 0 0
The “CustomerID” column in the dataset contains 242,813 null values, and removing them is not feasible as it could result in substantial data loss. Additionally, replacing these null values is challenging because the actual CustomerID information is unknown. Therefore, retaining the null values in the “CustomerID” column is necessary for the integrity of the dataset, even though it limits the ability to analyze and identify specific customers for those transactions.
##################
### EDA ##
##################
Sales_New <- Sales_Data %>%
filter(Quantity > 0 & Status == 'Normal')
sum(Sales_New$Price)
## [1] 4109900
hist(log(Sales_New$Quantity),
main = "Distribution of Quantity",
col = "#30fc66")
hist(log(Sales_New$Revenue),
main = "Distribution of Revenue",
col = "red")
# Looking at the total number of:
# Normal Products
# Cancelled Products
# Damaged Products
Product_Status <- Sales_Data %>%
group_by(Status) %>%
summarise(Count = n()) %>%
arrange(-Count) %>%
mutate(Perc = (Count/sum(Count))*100)
kable(Product_Status)
| Status | Count | Perc |
|---|---|---|
| Normal | 1029117 | 97.5696634 |
| Cancelled | 22177 | 2.1025816 |
| Damaged | 3457 | 0.3277551 |
# 97.6% of the products were in good condition
# Let's look at the revenue column
sum(Sales_Data$Revenue)
## [1] 19280622
Overall_Revenue <- Sales_Data %>%
select(InvoiceYear, Revenue) %>%
group_by(InvoiceYear) %>%
summarise(Total_Revenue = sum(Revenue)) %>%
arrange(-Total_Revenue)
ggplot(Overall_Revenue, aes(x = InvoiceYear, y = Total_Revenue)) +
geom_col(aes(fill = factor(InvoiceYear))
, color = "black", alpha = 0.7) +
theme_bw() +
labs(title = "Total Revenue VS Year",
x = "Year") +
guides(fill = F) +
scale_fill_brewer(palette = "Set2")+
theme(plot.title = element_text(color = "#7A4726"))
# Overall Revenue was maximum in 2010
# Looking for trends in Quantity and Month
Sales_Data %>%
group_by(InvoiceMonth) %>%
summarise(total_quantity = sum(Quantity)) %>%
ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
geom_point(color = "red", size = 3)+
theme_bw() +
labs(title = "Overall Monthly Trend: Quantity Purchased",
subtitle = "An overall upward trend in purchases, peaking in Oct-Dec",
x = "Month", y = "Total_Quantity") +
theme(plot.title = element_text(color = "#7A4726")) +
theme(plot.subtitle = element_text(color = "#072662",
face = "italic", size = 10))
# Let's dive deeper to find Year wise Trends
Year_2010 <- Sales_Data %>%
filter(InvoiceYear == 2010 & Status == 'Normal')
Year_2010 %>%
group_by(InvoiceMonth) %>%
summarise(total_quantity = sum(Quantity)) %>%
ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
geom_point(color = "orange", size = 3)+
theme_bw() +
labs(title = "Monthly Trend of Year 2010: Quantity Purchased",
x = "Month", y = "Total_Quantity") +
theme(plot.title = element_text(color = "#7A4726"))
Year_2011 <- Sales_Data %>%
filter(InvoiceYear == 2011 & Status == 'Normal')
Year_2011 %>%
group_by(InvoiceMonth) %>%
summarise(total_quantity = sum(Quantity)) %>%
ggplot(aes(x = factor(InvoiceMonth), y = total_quantity)) +
geom_point(color = "blue", size = 3)+
theme_bw() +
labs(title = "Monthly Trend of Year 2011: Quantity Purchased",
x = "Month", y = "Total_Quantity") +
theme(plot.title = element_text(color = "#7A4726"))
Upon analyzing the plots, it is evident that there is a general upward trend in purchases, with an increase in the number of transactions observed from November to December. Noteworthy declines in sales occurred in July 2010, as well as in February and December 2011.
# At what time does most purchases happen?
DOW <- Sales_New %>%
group_by(DayOfWeek,Hour) %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x = DayOfWeek, y = Count)) +
geom_col(aes(fill = Hour)) +
theme_bw() +
labs(title = "Sales Count by Day of Week and Hour",
x = "Day of Week",
y = "Count") +
theme(plot.title = element_text(color = "#7A4726"))
ggplotly(DOW)
The data indicates that the peak sales occur on Thursday and Tuesday. Additionally, the majority of purchases are made between 11 AM and 12 PM, with another peak observed between 12 AM and 2 AM.
# Product Analysis
product_analysis <- Sales_Data %>%
group_by(StockCode, Description) %>%
summarise(total_quantity = sum(Quantity), total_revenue = sum(Revenue)) %>%
arrange(desc(total_quantity))
## `summarise()` has grouped output by 'StockCode'. You can override using the
## `.groups` argument.
pr_an <- head(product_analysis, 5)
kable(pr_an)
| StockCode | Description | total_quantity | total_revenue |
|---|---|---|---|
| 84077 | World War 2 Gliders Asstd Designs | 108305 | 24836.30 |
| 85123A | White Hanging Heart T-Light Holder | 92333 | 253200.31 |
| 84879 | Assorted Colour Bird Ornament | 81087 | 131043.74 |
| 85099B | Jumbo Bag Red Retrospot | 77509 | 146393.71 |
| 17003 | Brocade Ring Purse | 70664 | 14732.97 |
The item ‘World War 2 Gliders Asstd Designs’ stands out as the most purchased product, with a total purchase count reaching 108,305. On the other hand, ‘Regency Cakestand 3 Tier’ achieved the highest sales profit, generating a total revenue of 327,345.
Top_Countries <- Sales_Data %>%
filter(Status == 'Normal' & Quantity > 0) %>%
select(Revenue, Country) %>%
group_by(Country) %>%
summarise(Total_Revenue = sum(Revenue)) %>%
arrange(- Total_Revenue)
Top_10_Countries <- head(Top_Countries, 10)
kable(Top_10_Countries)
| Country | Total_Revenue |
|---|---|
| United Kingdom | 17752597.01 |
| Ireland | 651760.43 |
| Netherlands | 554230.69 |
| Germany | 428407.54 |
| France | 347068.14 |
| Australia | 169900.61 |
| Spain | 109127.21 |
| Switzerland | 100988.99 |
| Sweden | 91839.82 |
| Denmark | 69862.19 |
# United Kingdom is the country which purchases the maximum number of products
# Let's look at the countries with the lowest number of purchases
Bottom_Countries <- Sales_Data %>%
filter(Status == 'Normal' & Quantity > 0) %>%
select(Revenue, Country) %>%
group_by(Country) %>%
summarise(Total_Revenue = sum(Revenue)) %>%
arrange(Total_Revenue)
Bottom_5_Countries <- head(Bottom_Countries, 5)
kable(Bottom_5_Countries)
| Country | Total_Revenue |
|---|---|
| Nigeria | 140.39 |
| Saudi Arabia | 145.92 |
| West Indies | 536.41 |
| Czech Republic | 826.74 |
| Korea | 1118.51 |
# The least purchases were made by Nigeria followed by Saudi Arabia
# Let's find out the top Customers
Top_Customers <- Sales_New %>%
drop_na(Customer.ID) %>%
select(Customer.ID, StockCode) %>%
group_by(Customer.ID) %>%
summarise(Total_Purchases = n())%>%
arrange(-Total_Purchases)
Top_Loyal_Customers <- head(Top_Customers, 5)
kable(Top_Loyal_Customers)
| Customer.ID | Total_Purchases |
|---|---|
| 17841 | 12611 |
| 14911 | 11152 |
| 12748 | 6929 |
| 14606 | 6477 |
| 14096 | 5111 |
Customer with ID 17841 and 14911 made the most purchases.
### Finding Correlations
cor(Sales_New$Quantity, Sales_New$Revenue)
## [1] 0.8089968
# Quantity and Revenue is highly correlated with a correlation value of 0.827
# Let's look at Damaged Products
Damaged_Products <- Sales_Data %>%
filter(Status == "Damaged")
Damaged_Products %>%
group_by(InvoiceYear, Country) %>%
summarise(Count = n()) %>%
ggplot(aes(x = InvoiceYear, y = Count)) +
geom_col(fill = "green", alpha = 0.5, color = "black") +
theme_bw() +
labs(title = "Damaged Products Count",
subtitle = "UK received 3457 damaged products from 2009 to 2011",
x = "Year",
y = "Count") +
theme(plot.title = element_text(color = "#7A4726")) +
theme(plot.subtitle = element_text(color = "#072662",
face = "italic", size = 10))
## `summarise()` has grouped output by 'InvoiceYear'. You can override using the
## `.groups` argument.
Over the three-year period from 2009 to 2011, a total of 3,457 products delivered to the UK were reported as damaged. Given that the UK represents a significant portion of top purchases, it emphasizes the importance of exercising caution and implementing careful delivery practices, particularly for shipments to the UK, to ensure product integrity and customer satisfaction.
# Let's look at the Cancelled Products
Cancelled_Products <- Sales_Data %>%
filter(Status == "Cancelled")
top_3_cancels <- Cancelled_Products %>%
group_by(Country) %>%
summarise(Count = n()) %>%
arrange(-Count)
top_cancels <- head(top_3_cancels, 3)
kable(top_cancels)
| Country | Count |
|---|---|
| United Kingdom | 19310 |
| Germany | 930 |
| Ireland | 515 |
Most cancellations were made by the countries UK, Germany and Ireland.
######################
### MODEL BUILDING ##
######################
# To build a model it is always best to avoid null values
Sales_Model <- Sales_New %>%
drop_na(Customer.ID)
model <- lm(Quantity~Price + Revenue, data = Sales_Model)
summary(model)
##
## Call:
## lm(formula = Quantity ~ Price + Revenue, data = Sales_Model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20612 -4 -2 0 32510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3143903 0.0902903 36.71 <2e-16 ***
## Price -0.5958043 0.0033288 -178.99 <2e-16 ***
## Revenue 0.5403015 0.0003996 1352.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.59 on 793165 degrees of freedom
## Multiple R-squared: 0.6975, Adjusted R-squared: 0.6975
## F-statistic: 9.143e+05 on 2 and 793165 DF, p-value: < 2.2e-16
The intercept (3.3143903) is the predicted quantity when both Price and Revenue are zero.
Price has a negative effect on Quantity. For each unit increase in Price, Quantity is estimated to decrease by approximately 0.596.
Revenue has a positive effect on Quantity. For each unit increase in Revenue, Quantity is estimated to increase by approximately 0.540.
The model appears to fit the data well, as indicated by the high R-squared value.
The p-values for the coefficients are very small, suggesting that both Price and Revenue are significant predictors of Quantity.
Overall, the model suggests that both Price and Revenue are important factors in predicting Quantity, and the model explains a significant amount of the variability in Quantity.
The analysis of the retail shopping sales dataset revealed key insights:
Sales Trends: Upward trend with peak purchases in November to December.
Purchase Time Analysis: Peak sales on Thursday and Tuesday, with peak hours at 11 AM - 12 PM and 12 AM - 2 AM.
Product Analysis: ‘World War 2 Gliders’ most purchased; ‘Regency Cakestand 3 Tier’ highest revenue.
Geographic Analysis: UK leads in purchases; Nigeria and Saudi Arabia have the least.
Customer Analysis: IDs 17841 and 14911 are top purchasers, indicating loyalty.
Correlation Analysis: Quantity and Revenue highly correlated (0.827).
Damaged and Cancelled Products: 3,457 products reported damaged; most cancellations from UK, Germany, and Ireland.
Model Building: Linear regression model shows both Price and Revenue significantly predict Quantity.
Improve delivery practices to reduce damaged products.
Optimize inventory and marketing based on sales trends.
Incentivize loyal customers, especially those with IDs 17841 and 14911.