INTRODUCTION

in this competitive world of online selling, having full knowledge on customer purchase behavior is very crucial as it will help in optimizing sales, inventory and also the marketing strategies. The Online Retail dataset here contains transactional data from UK-based e-commerce store which includes product descriptions, quantities, prices, invoice dates and customer locations. This project is applying descriptive analytics techniques in R to explore sales trends, identify top performing products and regions, segment customers and uncover hidden patterns through statistical modeling and visualization. The goal is to support data driven decision making for business expansion and operational efficiency.

Analytics

we want to see the data types present in the Online Retail dataset and their classifications.

# Load necessary libraries
library(readxl)

## Warning: package 'readxl' was built under R version 4.5.2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(class)

## Warning: package 'class' was built under R version 4.5.2

library(caret)

## Loading required package: lattice

# Load dataset
retail_data <- read_excel("C:/Users/chiru/Downloads/Online Retail.xlsx")

# View structure
str(retail_data)

## tibble [541,909 × 8] (S3: tbl_df/tbl/data.frame)
##  $ InvoiceNo  : chr [1:541909] "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
##  $ Country    : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

summary(retail_data)

##   InvoiceNo          StockCode         Description           Quantity         
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.000  
##  Class :character   Class :character   Class :character   1st Qu.:     1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.000  
##                                                           Mean   :     9.552  
##                                                           3rd Qu.:    10.000  
##                                                           Max.   : 80995.000  
##                                                                               
##   InvoiceDate                    UnitPrice            CustomerID    
##  Min.   :2010-12-01 08:26:00   Min.   :-11062.060   Min.   :12346   
##  1st Qu.:2011-03-28 11:34:00   1st Qu.:     1.250   1st Qu.:13953   
##  Median :2011-07-19 17:17:00   Median :     2.080   Median :15152   
##  Mean   :2011-07-04 13:34:57   Mean   :     4.611   Mean   :15288   
##  3rd Qu.:2011-10-19 11:27:00   3rd Qu.:     4.130   3rd Qu.:16791   
##  Max.   :2011-12-09 12:50:00   Max.   : 38970.000   Max.   :18287   
##                                                     NA's   :135080  
##    Country         
##  Length:541909     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Data Preprocessing

For data cleaning, we had to remove all the missing values and then verify if there are still any missing values in the table. This helps to have a better analysis and visualizations.

nrow_before <- nrow(retail_data)
nrow_before

## [1] 541909

retail_data <- na.omit(retail_data)
#Remove missing and zero values
retail_data <- retail_data %>%
  filter(!is.na(CustomerID), Quantity > 0, UnitPrice > 0)
sum(is.na(retail_data))# Count missing values

## [1] 0

nrow_after <- nrow(retail_data)
nrow_after

## [1] 397884

dropped <- nrow_before - nrow_after
dropped

## [1] 144025

Data Eploration

We explored the dataset using descriptive statistics and visualizations.

Basic statistics

retail_data$Sales <- retail_data$Quantity * retail_data$UnitPrice
mean(retail_data$Sales)

## [1] 22.397

median(retail_data$Sales)

## [1] 11.8

sd(retail_data$Sales)

## [1] 309.071

Mean was double the median , showing that most values are relatively low ,SD is high indicating that large variance in sales, this suggests that customer spending is not uniform, some are higher while others are too low.

##Data Visualization

Which products and time periods generate the highest sales?
showing distributions and relationships.

###Filter Out Extreme Values Large outliers can distort plots so remove the top 1% of extreme sales values

retail_data_filtered <- retail_data %>%
  filter(Sales > 0 & Sales < quantile(Sales, 0.99))

Histogram of Sales

ggplot(retail_data_filtered, aes(x = Sales)) +
  geom_histogram(binwidth = 10,fill = "skyblue", color = "black") +
  labs(title = "Distribution of Sales", x = "Sales", y = "Frequency")

Shows the distribution of Sales Values(helps us understand customer spending) Conclusion: Most sales are of low value between 0-50

Boxplot of sales by country

ggplot(retail_data_filtered, aes(x = Country, y = Sales)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
  labs(title = "Sales by Country", x = "Country", y = "Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Compares distribution of sales across different countries Conclusion: Netherlands has the highest Spending , Presence of outliers , show rare but large orders(This supports our Anova Finding that sales differ significantly between countries)

Quantity vs Sales

ggplot(retail_data_filtered, aes(x = Quantity, y = Sales)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  labs(title = "Quantity vs Sales", x = "Quantity", y = "Sales")

As quantity increase , sales increase , however most transactions include small quantities(Scatter plots are used to show Relationships)

Top selling Products

top_products <- retail_data %>%
  group_by(Description) %>%
  summarise(TotalSold = sum(Quantity, na.rm = TRUE)) %>%
  arrange(desc(TotalSold)) %>%
  head(10)

ggplot(top_products, aes(x = reorder(Description, TotalSold), y = TotalSold)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Selling Products", x = "Product", y = "Quantity Sold")

We grouped the data by product description using the group_by function, calculated the sum of quantity sold and then visualized the first 10 products using a horizontal bar chart

Sales over time

retail_data$InvoiceDate <- as.Date(retail_data$InvoiceDate)

daily_sales <- retail_data %>%
  mutate(Sales = Quantity * UnitPrice) %>%
  group_by(InvoiceDate) %>%
  summarise(DailySales = sum(Sales, na.rm = TRUE))

ggplot(daily_sales, aes(x = InvoiceDate, y = DailySales)) +
  geom_line(color = "darkgreen") +
  labs(title = "Daily Sales Over Time", x = "Date", y = "Sales")

- Converts invoice date to proper date format then calculate daily sales. shows how sales flactuate over time Higher sells start from September to Nov

Business Intelligence and Analytics

We used regression to model sales and ANOVA to compare country-wise performance.

Linear Regression

retail_data$Sales <- retail_data$Quantity * retail_data$UnitPrice
model <- lm(Sales ~ Quantity + UnitPrice, data = retail_data)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Quantity + UnitPrice, data = retail_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39822     -2      0      5  40773 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.820780   0.196287   -9.276   <2e-16 ***
## Quantity     1.576584   0.001081 1458.439   <2e-16 ***
## UnitPrice    1.200302   0.008773  136.822   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.3 on 397881 degrees of freedom
## Multiple R-squared:  0.8435, Adjusted R-squared:  0.8435 
## F-statistic: 1.072e+06 on 2 and 397881 DF,  p-value: < 2.2e-16

A linear Regression Model tries to predict an outcome using Variables In this case predicting sales using Sales against Quanity Checks how each variable affects sales conclusion: when quantity increases, Sales increase(Because quantity coefficient is positive) Positive Coefficient , hence when prices go up , Sales decrease

the intercept 2.14 baseline the sales when quantity and Unit Price are zero.
The quantity coefficient is shwing that every additional unit sold increase sales by approximately $1.57.
the unity price is giving a negative value giving us the insights that higher unit price slightly reduces total sales and this might be due to lower quantity sold at higher prices
concluding from the given R-squared value, the model is explaining 85.57% of the variation in sales
However, this regression model is highly predictive, showing that quantity is a strong positive driver of sales while unit price has a negative effect and this might be due to price sensitivity.

Anova

Analysis of Variance

anova_model <- aov(Sales ~ Country, data = retail_data)
summary(anova_model)

##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## Country         36 4.288e+07 1191216   12.48 <2e-16 ***
## Residuals   397847 3.796e+10   95426                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a statisically significant difference in average sales across countries therefore supporting geographic segmentation. because:F value is large/>1 (large f value means greater diff between countries) p value is small

Analytics and Big Data World

KNN

Supervised Machine learning algorithm(MONITORED)

We used KNN to classify sales levels and evaluate accuracy.

# Create binary classification target
retail_data$SalesLevel <- ifelse(retail_data$Sales > median(retail_data$Sales), "High", "Low")
retail_data$SalesLevel <- as.factor(retail_data$SalesLevel)

# Select numeric predictors and remove duplicates
knn_data <- retail_data[, c("Quantity", "UnitPrice", "SalesLevel")]
knn_data <- distinct(knn_data)  # Remove duplicate rows

# Normalize predictors
preproc <- preProcess(knn_data[, 1:2], method = c("center", "scale"))
knn_scaled <- predict(preproc, knn_data[, 1:2])

# Add target variable back
knn_scaled$SalesLevel <- knn_data$SalesLevel

#Split data
set.seed(123)
trainIndex <- createDataPartition(knn_scaled$SalesLevel, p = 0.8, list = FALSE)
train_knn <- knn_scaled[trainIndex, 1:2]
test_knn <- knn_scaled[-trainIndex, 1:2]
train_labels <- knn_scaled$SalesLevel[trainIndex]
test_labels <- knn_scaled$SalesLevel[-trainIndex]

#Apply KNN with odd k to avoid ties
knn_pred <- knn(train = train_knn, test = test_knn, cl = train_labels, k = 3)

# Evaluate model
confusionMatrix(knn_pred, test_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  669   6
##       Low     7 138
##                                          
##                Accuracy : 0.9841         
##                  95% CI : (0.973, 0.9915)
##     No Information Rate : 0.8244         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9454         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9896         
##             Specificity : 0.9583         
##          Pos Pred Value : 0.9911         
##          Neg Pred Value : 0.9517         
##              Prevalence : 0.8244         
##          Detection Rate : 0.8159         
##    Detection Prevalence : 0.8232         
##       Balanced Accuracy : 0.9740         
##                                          
##        'Positive' Class : High           
##

This KNN model is highly accurate and performs well in classifying transactions into High and Low sales. This supports its use for customer segmentation or flagging high-value orders Knn Model tests and trains data , the data is split into two using the createDataPartition function

Using K-Nearest Neighbors (KNN) classification, we predicted whether a transaction was ‘High’ or ‘Low’ sales based on Quantity and UnitPrice. The model achieved an accuracy of 98.41%, with a sensitivity of 98.96% and specificity of 95.83%. These results indicate that KNN is highly effective for segmenting transactions and identifying high-value orders, which can support targeted marketing and inventory decisions.

Customer Purchase Behavior Analysis

Rhoda Sanyangore

Tanaka Emma Kutamahufa

2025-11-14