This project analyzes an online retail dataset to understand customer buying patterns, product sales, and revenue trends. The dataset contains information about invoices, products, quantities, prices, and customer IDs. The goal is to find useful insights such as top customers, best-selling products, revenue by country, and sales trends over time. We use different data analysis methods like charts, clustering, association rules, and predictive modeling.
options(warn = -1)
library(readxl)
library(writexl)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(scales)
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(caret)
## Loading required package: lattice
library(class)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(DT)
library(corrplot)
## corrplot 0.95 loaded
All the library() lines bring in tools that help perform specific tasks — like data handling (dplyr), visualization (ggplot2), machine learning (caret), clustering (factoextra), etc. arules is used for Market Basket Analysis (Apriori algorithm)
retail <- read.csv("C:/Users/kisha/Downloads/OnlineRetailData.csv", stringsAsFactors = FALSE)
Interpretation: This reads the Online Retail dataset (CSV file) from your computer and stores it into a data frame named retail
# 2Explore Data
str(retail)
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: chr "12-01-2010 08:26" "12-01-2010 08:26" "12-01-2010 08:26" "12-01-2010 08:26" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
summary(retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID Country
## Length:541909 Min. :-11062.06 Min. :12346 Length:541909
## Class :character 1st Qu.: 1.25 1st Qu.:13953 Class :character
## Mode :character Median : 2.08 Median :15152 Mode :character
## Mean : 4.61 Mean :15288
## 3rd Qu.: 4.13 3rd Qu.:16791
## Max. : 38970.00 Max. :18287
## NA's :135080
head(retail)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 12-01-2010 08:26 2.55 17850 United Kingdom
## 2 12-01-2010 08:26 3.39 17850 United Kingdom
## 3 12-01-2010 08:26 2.75 17850 United Kingdom
## 4 12-01-2010 08:26 3.39 17850 United Kingdom
## 5 12-01-2010 08:26 3.39 17850 United Kingdom
## 6 12-01-2010 08:26 7.65 17850 United Kingdom
Interpretation: This helps you understand the dataset — its structure (column types), basic summary (min, max, mean), and first few records.
# Handle Missing Value
sapply(retail, function(x) sum(is.na(x))) # count NA
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country
## 135080 0
retail$CustomerID[is.na(retail$CustomerID)] <- "Unknown"
retail$UnitPrice[is.na(retail$UnitPrice)] <- median(retail$UnitPrice, na.rm = TRUE)
Interpretation: 1.Finds how many missing values each column has.
2.Replaces missing CustomerID with “Unk
3.Replaces missing UnitPrice with the median (middle value).
# Remove Cancelled / Negative Records
retail <- retail[!grepl("^C", retail$InvoiceNo), ]
retail <- retail[retail$Quantity > 0 & retail$UnitPrice > 0, ]
# Feature Engineering
retail$InvoiceDate <- as.POSIXct(retail$InvoiceDate, format = "%m/%d/%Y %H:%M")
retail$TotalPrice <- retail$Quantity * retail$UnitPrice
retail$Month <- month(retail$InvoiceDate)
retail$Day <- day(retail$InvoiceDate)
retail$Weekday <- wday(retail$InvoiceDate, label = TRUE)
head(retail)
## InvoiceNo StockCode Description Quantity InvoiceDate
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 <NA>
## 2 536365 71053 WHITE METAL LANTERN 6 <NA>
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 <NA>
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 <NA>
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 <NA>
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 <NA>
## UnitPrice CustomerID Country TotalPrice Month Day Weekday
## 1 2.55 17850 United Kingdom 15.30 NA NA <NA>
## 2 3.39 17850 United Kingdom 20.34 NA NA <NA>
## 3 2.75 17850 United Kingdom 22.00 NA NA <NA>
## 4 3.39 17850 United Kingdom 20.34 NA NA <NA>
## 5 3.39 17850 United Kingdom 20.34 NA NA <NA>
## 6 7.65 17850 United Kingdom 15.30 NA NA <NA>
Interpretation: 1. Removes all cancelled invoices (which start with “C”).
2.Removes rows where Quantity or UnitPrice is negative or zero.
3.Converts InvoiceDate into date-time format.
4.Creates a new column TotalPrice = Quantity × UnitPrice.
5.Extracts month, day, and weekday from the date for further analysis.
# Basic Statistical Operations
mode_func <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
mean(retail$Quantity)
## [1] 10.54204
median(retail$Quantity)
## [1] 3
mode_func(retail$Quantity)
## [1] 1
sd(retail$Quantity)
## [1] 155.5241
Interpretation:Finds Mean, Median, Mode, and Standard Deviation of Quantity column to understand sales variation.
# Min-Max Normalization
minmax <- function(x){ (x - min(x)) / (max(x) - min(x)) }
retail$Quantity_norm <- minmax(retail$Quantity)
retail$UnitPrice_norm <- minmax(retail$UnitPrice)
retail$TotalPrice_norm <- minmax(retail$TotalPrice)
Interpretation: Converts numeric values into a 0–1 scale so that large numbers don’t dominate small ones during analysis.
# Correlation Analysis
numeric_data <- retail[, c("Quantity", "UnitPrice", "TotalPrice")]
cor_matrix <- cor(numeric_data)
print(cor_matrix)
## Quantity UnitPrice TotalPrice
## Quantity 1.000000000 -0.003772543 0.9073378
## UnitPrice -0.003772543 1.000000000 0.1374035
## TotalPrice 0.907337778 0.137403527 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)
Interpretation: 1.Finds how strongly numeric variables (Quantity,
UnitPrice, TotalPrice) are related.
2.Displays this relationship visually with a correlation plot.
# Linear Regression
model <- lm(TotalPrice ~ Quantity + UnitPrice, data = retail)
summary(model)
##
## Call:
## lm(formula = TotalPrice ~ Quantity + UnitPrice, data = retail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39943 -2 -1 3 40642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6580384 0.1482955 -4.437 9.11e-06 ***
## Quantity 1.5782023 0.0009458 1668.707 < 2e-16 ***
## UnitPrice 1.0600922 0.0040954 258.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.1 on 530101 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8431
## F-statistic: 1.424e+06 on 2 and 530101 DF, p-value: < 2.2e-16
Interpretation: 1.Builds a regression model to predict TotalPrice using Quantity and UnitPrice.
2.Shows how much each factor influences total sales.
Visualization: Shows a scatter plot comparing actual vs predicted prices.1.
# Plot: Actual vs Predicted
retail$pred <- predict(model, newdata = retail)
ggplot(retail, aes(x = TotalPrice, y = pred)) +
geom_point(alpha = 0.3, color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red") +
ggtitle("Actual vs Predicted TotalPrice")
Interpretation: 1.It uses your linear regression model (model) to
predict the TotalPrice for each record in your dataset.
2.Actual TotalPrice (from dataset) on the X-axis, and
# Normalize numeric columns
retail_norm <- scale(numeric_data)
Interpretation:1.The numeric columns (Quantity, UnitPrice, TotalPrice) are converted into the same scale.
# Reduce dataset size to avoid memory issues
set.seed(123)
retail_sample <- retail_norm[sample(1:nrow(retail_norm), 5000), ]
Interpretation:1.To avoid memory errors, only 5000 random records are selected from the dataset.
2.This smaller sample is enough to perform K-Means clustering efficiently.
# Find optimal number of clusters (Elbow Method)
fviz_nbclust(retail_sample, kmeans, method = "wss") + ggtitle("Elbow Method for Optimal Clusters")
Interpretation:1.The Elbow Method graph shows how the clustering quality changes as we increase the number of clusters (k).
2.We look for a point where the curve bends like an “elbow.”
3.That k-value is considered the best number of clusters for the dataset.
# Apply K-Means (k = 3)
set.seed(123)
km <- kmeans(retail_sample, centers = 3, nstart = 25)
fviz_cluster(km, data = retail_sample, geom = "point", ellipse.type = "convex") +
ggtitle("K-Means Clustering (k = 3)")
Interpretation:1.We are dividing the dataset into 3 clusters (groups).
2.K-Means will try 25 different starting points and choose the best grouping.
3.This helps in getting a more accurate and stable clustering result.
4.This creates a colorful plot showing the 3 clusters.
5.Each point represents a customer transaction.
6.Points with similar buying behavior appear close together.
7.The ellipse around each group shows the cluster boundary
# Create sample data
set.seed(123)
sample_data <- retail[sample(1:nrow(retail), 1000), c("Quantity", "UnitPrice", "TotalPrice")]
sample_data$Category <- ifelse(sample_data$TotalPrice > median(sample_data$TotalPrice), "High", "Low")
Interpretation:1.From the large retail dataset, 1,000 random rows are selected for KNN.
2.Only three columns are used: Quantity, UnitPrice, and TotalPrice.
3.A new column Category is created:
4.If TotalPrice is above the median, label = “High” Otherwise label = “Low”
# Split data into train/test
set.seed(123)
train_index <- createDataPartition(sample_data$Category, p = 0.7, list = FALSE)
train <- sample_data[train_index, ]
test <- sample_data[-train_index, ]
Interpretation:1.The data is split into 70% training and 30% testing.
2.createDataPartition ensures both classes (High/Low) stay balanced.
3.The training set is used to build the KNN model.
# Normalize numeric features
train_norm <- as.data.frame(scale(train[, 1:3]))
test_norm <- as.data.frame(scale(test[, 1:3]))
Interpretation:1.KNN works based on distance, so all values must be scaled.
2,scale() standardizes the data so that all variables are on the same range.
# KNN classification
pred <- knn(train = train_norm, test = test_norm, cl = train$Category, k = 5)
confusionMatrix(table(pred, test$Category))
## Confusion Matrix and Statistics
##
##
## pred High Low
## High 116 29
## Low 34 121
##
## Accuracy : 0.79
## 95% CI : (0.7395, 0.8347)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.58
##
## Mcnemar's Test P-Value : 0.6143
##
## Sensitivity : 0.7733
## Specificity : 0.8067
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.7806
## Prevalence : 0.5000
## Detection Rate : 0.3867
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.7900
##
## 'Positive' Class : High
##
Interpretation:1.KNN predicts whether each customer is a High or Low spender.
2.k = 5 means the prediction is based on the 5 nearest neighbors.
3.Predictions (pred) are compared with actual labels (test$Category).
4.confusionMatrix shows:
Accuracy
Precision & Recall
How many predictions were correct or wrong
# Top 10 Loyal Customers (Highest Spending)
top_customers <- retail %>%
group_by(CustomerID) %>%
summarise(TotalSpent = sum(TotalPrice, na.rm = TRUE)) %>%
arrange(desc(TotalSpent)) %>%
slice(1:10)
ggplot(top_customers, aes(x = reorder(as.factor(CustomerID), TotalSpent), y = TotalSpent)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
ggtitle("Top 10 Loyal Customers by Total Spending") +
xlab("Customer ID") + ylab("Total Revenue")
Interpretation: 1.Groups data by each CustomerID.
2.Calculates total spending for each customer.
3.Sorts customers from highest to lowest spending.
4.Picks the top 10 customers with the most spending.
Interpretation: 1.The chart shows which customers spent the most.
2.Taller bars represent higher spending.
3.This helps identify the most loyal or valuable customers for marketing or reward programs.
# Revenue Contribution by Country (Pie Chart)
country_revenue <- retail %>%
group_by(Country) %>%
summarise(TotalRevenue = sum(TotalPrice, na.rm = TRUE)) %>%
arrange(desc(TotalRevenue))
top5_countries <- head(country_revenue, 5)
others_sum <- sum(country_revenue$TotalRevenue[6:nrow(country_revenue)])
top5_countries <- rbind(top5_countries, data.frame(Country = "Others", TotalRevenue = others_sum))
ggplot(top5_countries, aes(x = "", y = TotalRevenue, fill = Country)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
ggtitle("Revenue Contribution by Country (Top 5 + Others)") +
theme_void()
Interpretation:1.Shows how revenue is distributed across countries.
2.The largest slices indicate countries contributing the most revenue.
3.The “Others” slice summarizes the combined contribution of smaller revenue countries.
4.This helps identify key markets and countries where most sales come from.
# Visualization: Top 10 Best-Selling Items
top_items <- retail %>%
group_by(Description) %>%
summarise(TotalQty = sum(Quantity)) %>%
arrange(desc(TotalQty)) %>%
slice(1:10)
ggplot(top_items, aes(x = reorder(Description, TotalQty), y = TotalQty)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
ggtitle("Top 10 Best-Selling Items")
Interpretation:1.Shows which items sold the most units in the dataset.
2.Taller bars indicate higher sales quantity.
3.Helps the business identify popular products for marketing, stock planning, and promotions.
# Visualization: Revenue by Country
top_countries <- retail %>%
group_by(Country) %>%
summarise(TotalRevenue = sum(TotalPrice)) %>%
arrange(desc(TotalRevenue)) %>%
slice(1:10)
ggplot(top_countries, aes(x = reorder(Country, TotalRevenue), y = TotalRevenue)) +
geom_bar(stat = "identity", fill = "green") +
coord_flip() +
ggtitle("Top 10 Countries by Revenue")
Interpretation:1. Shows which countries contribute the most revenue.
2.Taller bars represent higher revenue contribution.
3.Helps identify key markets and prioritize strategies for high-revenue countries.
# Time Series Plot (Revenue over Time)
daily_sales <- retail %>%
group_by(Date = as.Date(InvoiceDate)) %>%
summarise(DailyRevenue = sum(TotalPrice))
ggplot(daily_sales, aes(x = Date, y = DailyRevenue)) +
geom_line(color = "blue") +
ggtitle("Daily Revenue Trend")
Interpretation:1.Shows how revenue changes daily.
2.Peaks indicate days with high sales, while dips show low sales periods.
3.Helps identify trends, seasonality, or unusual spikes in revenue over time.
# check Country is a factor
retail$Country <- as.factor(retail$Country)
# ANOVA model
anova_model <- aov(TotalPrice ~ Country, data = retail)
# Summary of ANOVA
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 37 4.627e+07 1250456 17.13 <2e-16 ***
## Residuals 530066 3.870e+10 73011
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:1.If the p-value < 0.05, it means:
2.There is a significant difference in average TotalPrice between countries.
3.Some countries spend more or less compared to others.
4.If p-value > 0.05, it means:
5.There is no significant difference in spending across countries.
boxplot(TotalPrice ~ Country, data = retail,
main = "TotalPrice vs Country",
xlab = "Country", ylab = "Total Price",
col = rainbow(10), las = 2)
Interpretation:1.The boxplot compares TotalPrice across different Countries to see how spending differs.
2.Each box shows the distribution, median, and variability of TotalPrice for each country.
3.Outliers (dots) indicate unusually high or low purchase amounts.
4.The x-axis labels are rotated (las = 2) for better readability, and different colors distinguish countries.