in this competitive world of online selling, having full knowledge on customer purchase behavior is very crucial as it will help in optimizing sales, inventory and also the marketing strategies. The Online Retail dataset here contains transactional data from UK-based e-commerce store which includes product descriptions, quantities, prices, invoice dates and customer locations. This project is applying descriptive analytics techniques in R to explore sales trends, identify top performing products and regions, segment customers and uncover hidden patterns through statistical modeling and visualization. The goal is to support data driven decision making for business expansion and operational efficiency.
we want to see the data types present in the Online Retail dataset and their classifications.
# Load necessary libraries
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(class)
## Warning: package 'class' was built under R version 4.5.2
library(caret)
## Loading required package: lattice
# Load dataset
retail_data <- read_excel("C:/Users/chiru/Downloads/Online Retail.xlsx")
# View structure
str(retail_data)
## tibble [541,909 × 8] (S3: tbl_df/tbl/data.frame)
## $ InvoiceNo : chr [1:541909] "536365" "536365" "536365" "536365" ...
## $ StockCode : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
## $ Country : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
summary(retail_data)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.000
## Class :character Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Mode :character Median : 3.000
## Mean : 9.552
## 3rd Qu.: 10.000
## Max. : 80995.000
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. :-11062.060 Min. :12346
## 1st Qu.:2011-03-28 11:34:00 1st Qu.: 1.250 1st Qu.:13953
## Median :2011-07-19 17:17:00 Median : 2.080 Median :15152
## Mean :2011-07-04 13:34:57 Mean : 4.611 Mean :15288
## 3rd Qu.:2011-10-19 11:27:00 3rd Qu.: 4.130 3rd Qu.:16791
## Max. :2011-12-09 12:50:00 Max. : 38970.000 Max. :18287
## NA's :135080
## Country
## Length:541909
## Class :character
## Mode :character
##
##
##
##
For data cleaning, we had to remove all the missing values and then verify if there are still any missing values in the table. This helps to have a better analysis and visualizations.
nrow_before <- nrow(retail_data)
nrow_before
## [1] 541909
retail_data <- na.omit(retail_data)
#Remove missing and zero values
retail_data <- retail_data %>%
filter(!is.na(CustomerID), Quantity > 0, UnitPrice > 0)
sum(is.na(retail_data))# Count missing values
## [1] 0
nrow_after <- nrow(retail_data)
nrow_after
## [1] 397884
dropped <- nrow_before - nrow_after
dropped
## [1] 144025
We explored the dataset using descriptive statistics and visualizations.
retail_data$Sales <- retail_data$Quantity * retail_data$UnitPrice
mean(retail_data$Sales)
## [1] 22.397
median(retail_data$Sales)
## [1] 11.8
sd(retail_data$Sales)
## [1] 309.071
Mean was double the median , showing that most values are relatively low ,SD is high indicating that large variance in sales, this suggests that customer spending is not uniform, some are higher while others are too low.
##Data Visualization
###Filter Out Extreme Values Large outliers can distort plots so remove the top 1% of extreme sales values
retail_data_filtered <- retail_data %>%
filter(Sales > 0 & Sales < quantile(Sales, 0.99))
ggplot(retail_data_filtered, aes(x = Sales)) +
geom_histogram(binwidth = 10,fill = "skyblue", color = "black") +
labs(title = "Distribution of Sales", x = "Sales", y = "Frequency")
Shows the distribution of Sales Values(helps us understand customer spending) Conclusion: Most sales are of low value between 0-50
ggplot(retail_data_filtered, aes(x = Country, y = Sales)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
labs(title = "Sales by Country", x = "Country", y = "Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Compares distribution of sales across different countries Conclusion:
Netherlands has the highest Spending , Presence of outliers , show rare
but large orders(This supports our Anova Finding that sales differ
significantly between countries)
ggplot(retail_data_filtered, aes(x = Quantity, y = Sales)) +
geom_point(alpha = 0.5, color = "darkblue") +
labs(title = "Quantity vs Sales", x = "Quantity", y = "Sales")
As quantity increase , sales increase , however most transactions
include small quantities(Scatter plots are used to show
Relationships)
top_products <- retail_data %>%
group_by(Description) %>%
summarise(TotalSold = sum(Quantity, na.rm = TRUE)) %>%
arrange(desc(TotalSold)) %>%
head(10)
ggplot(top_products, aes(x = reorder(Description, TotalSold), y = TotalSold)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Selling Products", x = "Product", y = "Quantity Sold")
We grouped the data by product description using the group_by function, calculated the sum of quantity sold and then visualized the first 10 products using a horizontal bar chart
retail_data$InvoiceDate <- as.Date(retail_data$InvoiceDate)
daily_sales <- retail_data %>%
mutate(Sales = Quantity * UnitPrice) %>%
group_by(InvoiceDate) %>%
summarise(DailySales = sum(Sales, na.rm = TRUE))
ggplot(daily_sales, aes(x = InvoiceDate, y = DailySales)) +
geom_line(color = "darkgreen") +
labs(title = "Daily Sales Over Time", x = "Date", y = "Sales")
- Converts invoice date to proper date format then calculate daily
sales. shows how sales flactuate over time Higher sells start from
September to Nov
We used regression to model sales and ANOVA to compare country-wise performance.
retail_data$Sales <- retail_data$Quantity * retail_data$UnitPrice
model <- lm(Sales ~ Quantity + UnitPrice, data = retail_data)
summary(model)
##
## Call:
## lm(formula = Sales ~ Quantity + UnitPrice, data = retail_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39822 -2 0 5 40773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.820780 0.196287 -9.276 <2e-16 ***
## Quantity 1.576584 0.001081 1458.439 <2e-16 ***
## UnitPrice 1.200302 0.008773 136.822 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 122.3 on 397881 degrees of freedom
## Multiple R-squared: 0.8435, Adjusted R-squared: 0.8435
## F-statistic: 1.072e+06 on 2 and 397881 DF, p-value: < 2.2e-16
A linear Regression Model tries to predict an outcome using Variables In this case predicting sales using Sales against Quanity Checks how each variable affects sales conclusion: when quantity increases, Sales increase(Because quantity coefficient is positive) Positive Coefficient , hence when prices go up , Sales decrease
Analysis of Variance
anova_model <- aov(Sales ~ Country, data = retail_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 36 4.288e+07 1191216 12.48 <2e-16 ***
## Residuals 397847 3.796e+10 95426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Supervised Machine learning algorithm(MONITORED)
We used KNN to classify sales levels and evaluate accuracy.
# Create binary classification target
retail_data$SalesLevel <- ifelse(retail_data$Sales > median(retail_data$Sales), "High", "Low")
retail_data$SalesLevel <- as.factor(retail_data$SalesLevel)
# Select numeric predictors and remove duplicates
knn_data <- retail_data[, c("Quantity", "UnitPrice", "SalesLevel")]
knn_data <- distinct(knn_data) # Remove duplicate rows
# Normalize predictors
preproc <- preProcess(knn_data[, 1:2], method = c("center", "scale"))
knn_scaled <- predict(preproc, knn_data[, 1:2])
# Add target variable back
knn_scaled$SalesLevel <- knn_data$SalesLevel
#Split data
set.seed(123)
trainIndex <- createDataPartition(knn_scaled$SalesLevel, p = 0.8, list = FALSE)
train_knn <- knn_scaled[trainIndex, 1:2]
test_knn <- knn_scaled[-trainIndex, 1:2]
train_labels <- knn_scaled$SalesLevel[trainIndex]
test_labels <- knn_scaled$SalesLevel[-trainIndex]
#Apply KNN with odd k to avoid ties
knn_pred <- knn(train = train_knn, test = test_knn, cl = train_labels, k = 3)
# Evaluate model
confusionMatrix(knn_pred, test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 669 6
## Low 7 138
##
## Accuracy : 0.9841
## 95% CI : (0.973, 0.9915)
## No Information Rate : 0.8244
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9454
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9896
## Specificity : 0.9583
## Pos Pred Value : 0.9911
## Neg Pred Value : 0.9517
## Prevalence : 0.8244
## Detection Rate : 0.8159
## Detection Prevalence : 0.8232
## Balanced Accuracy : 0.9740
##
## 'Positive' Class : High
##
This KNN model is highly accurate and performs well in classifying transactions into High and Low sales. This supports its use for customer segmentation or flagging high-value orders Knn Model tests and trains data , the data is split into two using the createDataPartition function
Using K-Nearest Neighbors (KNN) classification, we predicted whether a transaction was ‘High’ or ‘Low’ sales based on Quantity and UnitPrice. The model achieved an accuracy of 98.41%, with a sensitivity of 98.96% and specificity of 95.83%. These results indicate that KNN is highly effective for segmenting transactions and identifying high-value orders, which can support targeted marketing and inventory decisions.