Introduction

This project analyzes an online retail dataset to understand customer buying patterns, product sales, and revenue trends. The dataset contains information about invoices, products, quantities, prices, and customer IDs. The goal is to find useful insights such as top customers, best-selling products, revenue by country, and sales trends over time. We use different data analysis methods like charts, clustering, association rules, and predictive modeling.

Install Library

options(warn = -1)
library(readxl)
library(writexl)
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(scales)
library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(caret)

## Loading required package: lattice

library(class)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(DT)
library(corrplot)

## corrplot 0.95 loaded

All the library() lines bring in tools that help perform specific tasks — like data handling (dplyr), visualization (ggplot2), machine learning (caret), clustering (factoextra), etc. arules is used for Market Basket Analysis (Apriori algorithm)

1:Read the Data Set

retail <- read.csv("C:/Users/kisha/Downloads/OnlineRetailData.csv", stringsAsFactors = FALSE)

Interpretation: This reads the Online Retail dataset (CSV file) from your computer and stores it into a data frame named retail

2:Explore Data

# 2Explore Data
str(retail)

## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: chr  "12-01-2010 08:26" "12-01-2010 08:26" "12-01-2010 08:26" "12-01-2010 08:26" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

summary(retail)

##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##  InvoiceDate          UnitPrice           CustomerID       Country         
##  Length:541909      Min.   :-11062.06   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.25   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.08   Median :15152    Mode  :character  
##                     Mean   :     4.61   Mean   :15288                      
##                     3rd Qu.:     4.13   3rd Qu.:16791                      
##                     Max.   : 38970.00   Max.   :18287                      
##                                         NA's   :135080

head(retail)

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##        InvoiceDate UnitPrice CustomerID        Country
## 1 12-01-2010 08:26      2.55      17850 United Kingdom
## 2 12-01-2010 08:26      3.39      17850 United Kingdom
## 3 12-01-2010 08:26      2.75      17850 United Kingdom
## 4 12-01-2010 08:26      3.39      17850 United Kingdom
## 5 12-01-2010 08:26      3.39      17850 United Kingdom
## 6 12-01-2010 08:26      7.65      17850 United Kingdom

Interpretation: This helps you understand the dataset — its structure (column types), basic summary (min, max, mean), and first few records.

3:Handle Missing Value

# Handle Missing Value
sapply(retail, function(x) sum(is.na(x)))  # count NA

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0

retail$CustomerID[is.na(retail$CustomerID)] <- "Unknown"
retail$UnitPrice[is.na(retail$UnitPrice)] <- median(retail$UnitPrice, na.rm = TRUE)

Interpretation: 1.Finds how many missing values each column has.

2.Replaces missing CustomerID with “Unk

3.Replaces missing UnitPrice with the median (middle value).

4:Remove Cancelled / Negative Records

# Remove Cancelled / Negative Records 
retail <- retail[!grepl("^C", retail$InvoiceNo), ]
retail <- retail[retail$Quantity > 0 & retail$UnitPrice > 0, ]


#  Feature Engineering 
retail$InvoiceDate <- as.POSIXct(retail$InvoiceDate, format = "%m/%d/%Y %H:%M")
retail$TotalPrice <- retail$Quantity * retail$UnitPrice
retail$Month <- month(retail$InvoiceDate)
retail$Day <- day(retail$InvoiceDate)
retail$Weekday <- wday(retail$InvoiceDate, label = TRUE)
head(retail)

##   InvoiceNo StockCode                         Description Quantity InvoiceDate
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6        <NA>
## 2    536365     71053                 WHITE METAL LANTERN        6        <NA>
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8        <NA>
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6        <NA>
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6        <NA>
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2        <NA>
##   UnitPrice CustomerID        Country TotalPrice Month Day Weekday
## 1      2.55      17850 United Kingdom      15.30    NA  NA    <NA>
## 2      3.39      17850 United Kingdom      20.34    NA  NA    <NA>
## 3      2.75      17850 United Kingdom      22.00    NA  NA    <NA>
## 4      3.39      17850 United Kingdom      20.34    NA  NA    <NA>
## 5      3.39      17850 United Kingdom      20.34    NA  NA    <NA>
## 6      7.65      17850 United Kingdom      15.30    NA  NA    <NA>

Interpretation: 1. Removes all cancelled invoices (which start with “C”).

2.Removes rows where Quantity or UnitPrice is negative or zero.

3.Converts InvoiceDate into date-time format.

4.Creates a new column TotalPrice = Quantity × UnitPrice.

5.Extracts month, day, and weekday from the date for further analysis.

5:Basic Statistical Operations

#  Basic Statistical Operations 
mode_func <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

mean(retail$Quantity)

## [1] 10.54204

median(retail$Quantity)

## [1] 3

mode_func(retail$Quantity)

## [1] 1

sd(retail$Quantity)

## [1] 155.5241

Interpretation:Finds Mean, Median, Mode, and Standard Deviation of Quantity column to understand sales variation.

6:Min-Max Normalization

#  Min-Max Normalization 
minmax <- function(x){ (x - min(x)) / (max(x) - min(x)) }
retail$Quantity_norm <- minmax(retail$Quantity)
retail$UnitPrice_norm <- minmax(retail$UnitPrice)
retail$TotalPrice_norm <- minmax(retail$TotalPrice)

Interpretation: Converts numeric values into a 0–1 scale so that large numbers don’t dominate small ones during analysis.

7:Correlation Analysis

#  Correlation Analysis 
numeric_data <- retail[, c("Quantity", "UnitPrice", "TotalPrice")]
cor_matrix <- cor(numeric_data)
print(cor_matrix)

##                Quantity    UnitPrice TotalPrice
## Quantity    1.000000000 -0.003772543  0.9073378
## UnitPrice  -0.003772543  1.000000000  0.1374035
## TotalPrice  0.907337778  0.137403527  1.0000000

corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)

Interpretation: 1.Finds how strongly numeric variables (Quantity, UnitPrice, TotalPrice) are related.

2.Displays this relationship visually with a correlation plot.

8:Linear Regression

#  Linear Regression 
model <- lm(TotalPrice ~ Quantity + UnitPrice, data = retail)
summary(model)

## 
## Call:
## lm(formula = TotalPrice ~ Quantity + UnitPrice, data = retail)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39943     -2     -1      3  40642 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -0.6580384  0.1482955   -4.437 9.11e-06 ***
## Quantity     1.5782023  0.0009458 1668.707  < 2e-16 ***
## UnitPrice    1.0600922  0.0040954  258.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.1 on 530101 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8431 
## F-statistic: 1.424e+06 on 2 and 530101 DF,  p-value: < 2.2e-16

Interpretation: 1.Builds a regression model to predict TotalPrice using Quantity and UnitPrice.

2.Shows how much each factor influences total sales.

Visualization: Shows a scatter plot comparing actual vs predicted prices.1.

9:Plot: Actual vs Predicted

# Plot: Actual vs Predicted
retail$pred <- predict(model, newdata = retail)
ggplot(retail, aes(x = TotalPrice, y = pred)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  ggtitle("Actual vs Predicted TotalPrice")

Interpretation: 1.It uses your linear regression model (model) to predict the TotalPrice for each record in your dataset.

2.Actual TotalPrice (from dataset) on the X-axis, and

Predicted TotalPrice (from model) on the Y-axis.

10:K-Means Clustering

# Normalize numeric columns
retail_norm <- scale(numeric_data)

Interpretation:1.The numeric columns (Quantity, UnitPrice, TotalPrice) are converted into the same scale.

# Reduce dataset size to avoid memory issues
set.seed(123)
retail_sample <- retail_norm[sample(1:nrow(retail_norm), 5000), ]

Interpretation:1.To avoid memory errors, only 5000 random records are selected from the dataset.

2.This smaller sample is enough to perform K-Means clustering efficiently.

# Find optimal number of clusters (Elbow Method)
fviz_nbclust(retail_sample, kmeans, method = "wss") + ggtitle("Elbow Method for Optimal Clusters")

Interpretation:1.The Elbow Method graph shows how the clustering quality changes as we increase the number of clusters (k).

2.We look for a point where the curve bends like an “elbow.”

3.That k-value is considered the best number of clusters for the dataset.

11:Apply K-Means (k = 3)

# Apply K-Means (k = 3)
set.seed(123)
km <- kmeans(retail_sample, centers = 3, nstart = 25)
fviz_cluster(km, data = retail_sample, geom = "point", ellipse.type = "convex") +
  ggtitle("K-Means Clustering (k = 3)")

Interpretation:1.We are dividing the dataset into 3 clusters (groups).

2.K-Means will try 25 different starting points and choose the best grouping.

3.This helps in getting a more accurate and stable clustering result.

4.This creates a colorful plot showing the 3 clusters.

5.Each point represents a customer transaction.

6.Points with similar buying behavior appear close together.

7.The ellipse around each group shows the cluster boundary

12:KNN Classification

# Create sample data
set.seed(123)
sample_data <- retail[sample(1:nrow(retail), 1000), c("Quantity", "UnitPrice", "TotalPrice")]
sample_data$Category <- ifelse(sample_data$TotalPrice > median(sample_data$TotalPrice), "High", "Low")

Interpretation:1.From the large retail dataset, 1,000 random rows are selected for KNN.

2.Only three columns are used: Quantity, UnitPrice, and TotalPrice.

3.A new column Category is created:

4.If TotalPrice is above the median, label = “High” Otherwise label = “Low”

# Split data into train/test
set.seed(123)
train_index <- createDataPartition(sample_data$Category, p = 0.7, list = FALSE)
train <- sample_data[train_index, ]
test <- sample_data[-train_index, ]

Interpretation:1.The data is split into 70% training and 30% testing.

2.createDataPartition ensures both classes (High/Low) stay balanced.

3.The training set is used to build the KNN model.

test set is used to check model accuracy.

# Normalize numeric features
train_norm <- as.data.frame(scale(train[, 1:3]))
test_norm <- as.data.frame(scale(test[, 1:3]))

Interpretation:1.KNN works based on distance, so all values must be scaled.

2,scale() standardizes the data so that all variables are on the same range.

# KNN classification
pred <- knn(train = train_norm, test = test_norm, cl = train$Category, k = 5)
confusionMatrix(table(pred, test$Category))

## Confusion Matrix and Statistics
## 
##       
## pred   High Low
##   High  116  29
##   Low    34 121
##                                           
##                Accuracy : 0.79            
##                  95% CI : (0.7395, 0.8347)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.58            
##                                           
##  Mcnemar's Test P-Value : 0.6143          
##                                           
##             Sensitivity : 0.7733          
##             Specificity : 0.8067          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.7806          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3867          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.7900          
##                                           
##        'Positive' Class : High            
##

Interpretation:1.KNN predicts whether each customer is a High or Low spender.

2.k = 5 means the prediction is based on the 5 nearest neighbors.

3.Predictions (pred) are compared with actual labels (test$Category).

4.confusionMatrix shows:

Accuracy

Precision & Recall

How many predictions were correct or wrong

13:Top 10 Loyal Customers (Highest Spending)

#  Top 10 Loyal Customers (Highest Spending) 
top_customers <- retail %>%
  group_by(CustomerID) %>%
  summarise(TotalSpent = sum(TotalPrice, na.rm = TRUE)) %>%
  arrange(desc(TotalSpent)) %>%
  slice(1:10)

ggplot(top_customers, aes(x = reorder(as.factor(CustomerID), TotalSpent), y = TotalSpent)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  ggtitle("Top 10 Loyal Customers by Total Spending") +
  xlab("Customer ID") + ylab("Total Revenue")

Interpretation: 1.Groups data by each CustomerID.

2.Calculates total spending for each customer.

3.Sorts customers from highest to lowest spending.

4.Picks the top 10 customers with the most spending.

Interpretation: 1.The chart shows which customers spent the most.

2.Taller bars represent higher spending.

3.This helps identify the most loyal or valuable customers for marketing or reward programs.

14:Revenue Contribution by Country (Pie Chart)

#  Revenue Contribution by Country (Pie Chart) 
country_revenue <- retail %>%
  group_by(Country) %>%
  summarise(TotalRevenue = sum(TotalPrice, na.rm = TRUE)) %>%
  arrange(desc(TotalRevenue))

top5_countries <- head(country_revenue, 5)
others_sum <- sum(country_revenue$TotalRevenue[6:nrow(country_revenue)])
top5_countries <- rbind(top5_countries, data.frame(Country = "Others", TotalRevenue = others_sum))

ggplot(top5_countries, aes(x = "", y = TotalRevenue, fill = Country)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  ggtitle("Revenue Contribution by Country (Top 5 + Others)") +
  theme_void()

Interpretation:1.Shows how revenue is distributed across countries.

2.The largest slices indicate countries contributing the most revenue.

3.The “Others” slice summarizes the combined contribution of smaller revenue countries.

4.This helps identify key markets and countries where most sales come from.

15:Visualization: Top 10 Best-Selling Items

# Visualization: Top 10 Best-Selling Items 
top_items <- retail %>%
  group_by(Description) %>%
  summarise(TotalQty = sum(Quantity)) %>%
  arrange(desc(TotalQty)) %>%
  slice(1:10)

ggplot(top_items, aes(x = reorder(Description, TotalQty), y = TotalQty)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  ggtitle("Top 10 Best-Selling Items")

Interpretation:1.Shows which items sold the most units in the dataset.

2.Taller bars indicate higher sales quantity.

3.Helps the business identify popular products for marketing, stock planning, and promotions.

16:Visualization: Revenue by Country

#  Visualization: Revenue by Country 
top_countries <- retail %>%
  group_by(Country) %>%
  summarise(TotalRevenue = sum(TotalPrice)) %>%
  arrange(desc(TotalRevenue)) %>%
  slice(1:10)

ggplot(top_countries, aes(x = reorder(Country, TotalRevenue), y = TotalRevenue)) +
  geom_bar(stat = "identity", fill = "green") +
  coord_flip() +
  ggtitle("Top 10 Countries by Revenue")

Interpretation:1. Shows which countries contribute the most revenue.

2.Taller bars represent higher revenue contribution.

3.Helps identify key markets and prioritize strategies for high-revenue countries.

17:Time Series Plot (Revenue over Time)

#  Time Series Plot (Revenue over Time) 
daily_sales <- retail %>%
  group_by(Date = as.Date(InvoiceDate)) %>%
  summarise(DailyRevenue = sum(TotalPrice))

ggplot(daily_sales, aes(x = Date, y = DailyRevenue)) +
  geom_line(color = "blue") +
  ggtitle("Daily Revenue Trend")

Interpretation:1.Shows how revenue changes daily.

2.Peaks indicate days with high sales, while dips show low sales periods.

3.Helps identify trends, seasonality, or unusual spikes in revenue over time.

18:ANOVA: Compare TotalPrice across Countries

# check Country is a factor
retail$Country <- as.factor(retail$Country)

# ANOVA model
anova_model <- aov(TotalPrice ~ Country, data = retail)

# Summary of ANOVA
summary(anova_model)

##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## Country         37 4.627e+07 1250456   17.13 <2e-16 ***
## Residuals   530066 3.870e+10   73011                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:1.If the p-value < 0.05, it means:

2.There is a significant difference in average TotalPrice between countries.

3.Some countries spend more or less compared to others.

4.If p-value > 0.05, it means:

5.There is no significant difference in spending across countries.

19:ANOVA Visualization

boxplot(TotalPrice ~ Country, data = retail,
        main = "TotalPrice vs Country",
        xlab = "Country", ylab = "Total Price",
        col = rainbow(10), las = 2)

Interpretation:1.The boxplot compares TotalPrice across different Countries to see how spending differs.

2.Each box shows the distribution, median, and variability of TotalPrice for each country.

3.Outliers (dots) indicate unusually high or low purchase amounts.

4.The x-axis labels are rotated (las = 2) for better readability, and different colors distinguish countries.

Online Retail Data Analysis: Customer and Sales Insights

Student 1: Kishan Kumar Singh

Student 2: Nishant Kumar

2025-11-14