load libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(class)
## Warning: package 'class' was built under R version 4.4.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

load dataset

data <- read.csv("C:/Users/ANAND KUMAR/Downloads/SuperMarket Analysis.csv", header = TRUE)
head(data)
##    Invoice.ID Branch      City Customer.type Gender           Product.line
## 1 750-67-8428   Alex    Yangon        Member Female      Health and beauty
## 2 226-31-3081   Giza Naypyitaw        Normal Female Electronic accessories
## 3 631-41-3108   Alex    Yangon        Normal Female     Home and lifestyle
## 4 123-19-1176   Alex    Yangon        Member Female      Health and beauty
## 5 373-73-7910   Alex    Yangon        Member Female      Sports and travel
## 6 699-14-3026   Giza Naypyitaw        Member Female Electronic accessories
##   Unit.price Quantity  Tax.5.    Sales      Date        Time     Payment   cogs
## 1      74.69        7 26.1415 548.9715  1/5/2019  1:08:00 PM     Ewallet 522.83
## 2      15.28        5  3.8200  80.2200  3/8/2019 10:29:00 AM        Cash  76.40
## 3      46.33        7 16.2155 340.5255  3/3/2019  1:23:00 PM Credit card 324.31
## 4      58.22        8 23.2880 489.0480 1/27/2019  8:33:00 PM     Ewallet 465.76
## 5      86.31        7 30.2085 634.3785  2/8/2019 10:37:00 AM     Ewallet 604.17
## 6      85.39        7 29.8865 627.6165 3/25/2019  6:30:00 PM     Ewallet 597.73
##   gross.margin.percentage gross.income Rating
## 1                4.761905      26.1415    9.1
## 2                4.761905       3.8200    9.6
## 3                4.761905      16.2155    7.4
## 4                4.761905      23.2880    8.4
## 5                4.761905      30.2085    5.3
## 6                4.761905      29.8865    4.1
str(data)
## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "Alex" "Giza" "Alex" "Alex" ...
##  $ City                   : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Female" "Female" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Sales                  : num  549 80.2 340.5 489 634.4 ...
##  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Time                   : chr  "1:08:00 PM" "10:29:00 AM" "1:23:00 PM" "8:33:00 PM" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
summary(data)
##   Invoice.ID           Branch              City           Customer.type     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender          Product.line         Unit.price       Quantity    
##  Length:1000        Length:1000        Min.   :10.08   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:32.88   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :55.23   Median : 5.00  
##                                        Mean   :55.67   Mean   : 5.51  
##                                        3rd Qu.:77.94   3rd Qu.: 8.00  
##                                        Max.   :99.96   Max.   :10.00  
##      Tax.5.            Sales             Date               Time          
##  Min.   : 0.5085   Min.   :  10.68   Length:1000        Length:1000       
##  1st Qu.: 5.9249   1st Qu.: 124.42   Class :character   Class :character  
##  Median :12.0880   Median : 253.85   Mode  :character   Mode  :character  
##  Mean   :15.3794   Mean   : 322.97                                        
##  3rd Qu.:22.4453   3rd Qu.: 471.35                                        
##  Max.   :49.6500   Max.   :1042.65                                        
##    Payment               cogs        gross.margin.percentage  gross.income    
##  Length:1000        Min.   : 10.17   Min.   :4.762           Min.   : 0.5085  
##  Class :character   1st Qu.:118.50   1st Qu.:4.762           1st Qu.: 5.9249  
##  Mode  :character   Median :241.76   Median :4.762           Median :12.0880  
##                     Mean   :307.59   Mean   :4.762           Mean   :15.3794  
##                     3rd Qu.:448.90   3rd Qu.:4.762           3rd Qu.:22.4453  
##                     Max.   :993.00   Max.   :4.762           Max.   :49.6500  
##      Rating      
##  Min.   : 4.000  
##  1st Qu.: 5.500  
##  Median : 7.000  
##  Mean   : 6.973  
##  3rd Qu.: 8.500  
##  Max.   :10.000

Interpretation: 1.The first few rows show information about supermarket transactions such as Invoice ID, Branch, City, Customer type, Gender, Product line, Prices, Payment, and Ratings.

2.he dataset has 1000 observations and 17 variables, including text fields (like Gender, Payment) and numeric fields (like Unit Price, Quantity, Sales, Rating).

3.For numeric columns (like Unit price, Sales, Tax, Rating), summary gives minimum, maximum, median, and average values, helping understand distribution.

Basic Statistical Operations

 mean(data$Unit.price)
## [1] 55.67213
median(data$Unit.price)
## [1] 55.23
sd(data$Unit.price)
## [1] 26.49463

Interpretation::Mean: The average unit price customers pay.

Median: The middle price; half the items cost less, half cost more.

SD: Shows how much unit prices vary from the average.

Custom function for Mode

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(data$Payment)
## [1] "Ewallet"
summary(select(data, Unit.price, Quantity, Tax.5., Sales, Rating))
##    Unit.price       Quantity         Tax.5.            Sales        
##  Min.   :10.08   Min.   : 1.00   Min.   : 0.5085   Min.   :  10.68  
##  1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 5.9249   1st Qu.: 124.42  
##  Median :55.23   Median : 5.00   Median :12.0880   Median : 253.85  
##  Mean   :55.67   Mean   : 5.51   Mean   :15.3794   Mean   : 322.97  
##  3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.:22.4453   3rd Qu.: 471.35  
##  Max.   :99.96   Max.   :10.00   Max.   :49.6500   Max.   :1042.65  
##      Rating      
##  Min.   : 4.000  
##  1st Qu.: 5.500  
##  Median : 7.000  
##  Mean   : 6.973  
##  3rd Qu.: 8.500  
##  Max.   :10.000

Interpretation:Mode: The most common payment method used by customers. Unit.price: Shows average product price and its range.

Quantity: Shows how many items customers usually buy per transaction.

Tax.5.: Shows how much tax customers typically pay.

Sales: Shows average total bill amount and the highest/lowest sales.

Rating: Shows customers’ satisfaction scores (from 4 to 10).

Summary statistics for main numeric columns

colSums(is.na(data))
##              Invoice.ID                  Branch                    City 
##                       0                       0                       0 
##           Customer.type                  Gender            Product.line 
##                       0                       0                       0 
##              Unit.price                Quantity                  Tax.5. 
##                       0                       0                       0 
##                   Sales                    Date                    Time 
##                       0                       0                       0 
##                 Payment                    cogs gross.margin.percentage 
##                       0                       0                       0 
##            gross.income                  Rating 
##                       0                       0
data <- na.omit(data)

Interpretatio: 1.colSums(is.na(data)) checks how many missing values each column has.

2.na.omit(data) removes any rows that contain missing values.

Min–Max Normalization for numeric columns

normalize <- function(x) (x - min(x)) / (max(x) - min(x))
data$Unit.price_norm <- normalize(data$Unit.price)
data$Quantity_norm <- normalize(data$Quantity)
data$Sales_norm <- normalize(data$Sales)

Interpretation :1.The code normalizes Unit Price, Quantity, and Sales.

2.Normalization converts values to a scale between **0 and

Predict Sales using Quantity

model1 <- lm(Sales ~ Quantity, data = data)
summary(model1)
## 
## Call:
## lm(formula = Sales ~ Quantity, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -474.32  -99.41   -0.41  100.74  453.25 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.993     11.768  -0.339    0.734    
## Quantity      59.339      1.887  31.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 174.3 on 998 degrees of freedom
## Multiple R-squared:  0.4977, Adjusted R-squared:  0.4972 
## F-statistic:   989 on 1 and 998 DF,  p-value: < 2.2e-16

Interpretation :1.The model shows that Quantity strongly predicts Sales.

2.For every 1 extra item purchased, Sales increase by about 59 units.

3.The relationship is highly significant (p-value < 0.001).

4.The model explains about 50% of the variation in Sales (R² ≈ 0.50).

Visualization

ggplot(data, aes(x = Quantity, y = Sales)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Linear Regression: Sales vs Quantity")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation :1.The scatter plot shows that Sales increase as Quantity increases.

2.The red line (linear regression) confirms a strong positive relationship between Quantity and Sales.

3.More items purchased → higher total sales

ANOVA Mode

anova_model <- aov(Sales ~ Product.line, data = data)
summary(anova_model)
##               Df   Sum Sq Mean Sq F value Pr(>F)
## Product.line   5   102506   20501   0.338   0.89
## Residuals    994 60296633   60661

Interpretation : 1.The ANOVA test shows no significant difference in Sales across different Product Lines.

2.This means Product Line does NOT have a strong effect on Sales (p-value is high)

Boxplot visualization

ggplot(data, aes(x = Product.line, y = Sales, fill = Product.line)) +
geom_boxplot() +
ggtitle("ANOVA: Sales across Product Lines") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation :1.The boxplot compares Sales across all Product Lines.

2.The sales ranges look similar, which supports the ANOVA result that Product Line does not significantly affect Sales

K-Means Clustering

cluster_data <- data %>% select(Unit.price, Quantity)
set.seed(123)
kmeans_result <- kmeans(cluster_data, centers = 3)
data$Cluster <- as.factor(kmeans_result$cluster)

fviz_cluster(kmeans_result, data = cluster_data)

Interpretation : 1.The algorithm grouped customers into 3 clusters based on Unit Price and Quantity.

2.Each cluster represents a different buying pattern (e.g., low quantity–low price, high quantity–medium price, etc.).

3.The cluster plot visually shows how customers are separated into clear groups

KNN Classification

median_sales <- median(data$Sales)
data$SalesClass <- ifelse(data$Sales > median_sales, "High", "Low")

set.seed(100)
index <- createDataPartition(data$SalesClass, p = 0.7, list = FALSE)
train <- data[index, ]
test <- data[-index, ]

train_x <- scale(train[, c("Unit.price", "Quantity", "Rating")])
test_x  <- scale(test[, c("Unit.price", "Quantity", "Rating")])

train_y <- train$SalesClass
test_y  <- test$SalesClass

knn_pred <- knn(train_x, test_x, cl = train_y, k = 5)
confusionMatrix(table(Predicted = knn_pred, Actual = test_y))
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted High Low
##      High  142  10
##      Low     8 140
##                                           
##                Accuracy : 0.94            
##                  95% CI : (0.9068, 0.9641)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.88            
##                                           
##  Mcnemar's Test P-Value : 0.8137          
##                                           
##             Sensitivity : 0.9467          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9342          
##          Neg Pred Value : 0.9459          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4733          
##    Detection Prevalence : 0.5067          
##       Balanced Accuracy : 0.9400          
##                                           
##        'Positive' Class : High            
## 

Interpretation : 1.The model predicts whether Sales are High or Low using Unit Price, Quantity, and Rating.

2.The confusion matrix shows the model is highly accurate (~94%).

3.This means KNN is very good at correctly classifying High vs Low sales based on customer purchase behavior

Histogram of Ratings

ggplot(data, aes(x = Rating)) +
geom_histogram(fill = "lightblue", color = "black", bins = 15) +
ggtitle("Distribution of Customer Ratings")

Interpretation : 1.Customer Ratings mostly fall between 5 and 9, showing that customers generally give moderate to high satisfaction.

2.Very low or very high ratings are less common.

Average Sales by Product Line

ggplot(data, aes(x = Product.line, y = Sales, fill = Product.line)) +
geom_bar(stat = "summary", fun = "mean") +
ggtitle("Average Sales by Product Line") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation: 1.The chart compares the average sales of each product line.

2.Some product lines show higher average sales, meaning they perform better and generate more revenue.

3.Other product lines have lower average sales, indicating less revenue contribution

Scatter Plot

ggplot(data, aes(x = Unit.price, y = Sales, color = Gender)) +
geom_point() +
ggtitle("Unit Price vs Sales by Gender")

Interpretation : 1.The scatter plot shows how Unit Price and Sales are related.

2.Each point represents a customer purchase, and the color shows Gender.

3.Higher unit prices generally lead to higher total sales, since expensive items increase bill value.

4.Both genders follow a similar pattern, meaning gender does not strongly change the relationship.

Pie Chart of Payment Methods

payment_count <- data %>% count(Payment)
ggplot(payment_count, aes(x = "", y = n, fill = Payment)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
ggtitle("Payment Method Distribution")

Interpretation : 1This pie chart shows how often each payment method is used.

2.Some payment types (like Ewallet / Cash) are used more frequently, while others (like Credit Card) are used less often.

3.It helps identify the most popular payment method among customers.