library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(class)
## Warning: package 'class' was built under R version 4.4.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data <- read.csv("C:/Users/ANAND KUMAR/Downloads/SuperMarket Analysis.csv", header = TRUE)
head(data)
## Invoice.ID Branch City Customer.type Gender Product.line
## 1 750-67-8428 Alex Yangon Member Female Health and beauty
## 2 226-31-3081 Giza Naypyitaw Normal Female Electronic accessories
## 3 631-41-3108 Alex Yangon Normal Female Home and lifestyle
## 4 123-19-1176 Alex Yangon Member Female Health and beauty
## 5 373-73-7910 Alex Yangon Member Female Sports and travel
## 6 699-14-3026 Giza Naypyitaw Member Female Electronic accessories
## Unit.price Quantity Tax.5. Sales Date Time Payment cogs
## 1 74.69 7 26.1415 548.9715 1/5/2019 1:08:00 PM Ewallet 522.83
## 2 15.28 5 3.8200 80.2200 3/8/2019 10:29:00 AM Cash 76.40
## 3 46.33 7 16.2155 340.5255 3/3/2019 1:23:00 PM Credit card 324.31
## 4 58.22 8 23.2880 489.0480 1/27/2019 8:33:00 PM Ewallet 465.76
## 5 86.31 7 30.2085 634.3785 2/8/2019 10:37:00 AM Ewallet 604.17
## 6 85.39 7 29.8865 627.6165 3/25/2019 6:30:00 PM Ewallet 597.73
## gross.margin.percentage gross.income Rating
## 1 4.761905 26.1415 9.1
## 2 4.761905 3.8200 9.6
## 3 4.761905 16.2155 7.4
## 4 4.761905 23.2880 8.4
## 5 4.761905 30.2085 5.3
## 6 4.761905 29.8865 4.1
str(data)
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "Alex" "Giza" "Alex" "Alex" ...
## $ City : chr "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Sales : num 549 80.2 340.5 489 634.4 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "1:08:00 PM" "10:29:00 AM" "1:23:00 PM" "8:33:00 PM" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
summary(data)
## Invoice.ID Branch City Customer.type
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Gender Product.line Unit.price Quantity
## Length:1000 Length:1000 Min. :10.08 Min. : 1.00
## Class :character Class :character 1st Qu.:32.88 1st Qu.: 3.00
## Mode :character Mode :character Median :55.23 Median : 5.00
## Mean :55.67 Mean : 5.51
## 3rd Qu.:77.94 3rd Qu.: 8.00
## Max. :99.96 Max. :10.00
## Tax.5. Sales Date Time
## Min. : 0.5085 Min. : 10.68 Length:1000 Length:1000
## 1st Qu.: 5.9249 1st Qu.: 124.42 Class :character Class :character
## Median :12.0880 Median : 253.85 Mode :character Mode :character
## Mean :15.3794 Mean : 322.97
## 3rd Qu.:22.4453 3rd Qu.: 471.35
## Max. :49.6500 Max. :1042.65
## Payment cogs gross.margin.percentage gross.income
## Length:1000 Min. : 10.17 Min. :4.762 Min. : 0.5085
## Class :character 1st Qu.:118.50 1st Qu.:4.762 1st Qu.: 5.9249
## Mode :character Median :241.76 Median :4.762 Median :12.0880
## Mean :307.59 Mean :4.762 Mean :15.3794
## 3rd Qu.:448.90 3rd Qu.:4.762 3rd Qu.:22.4453
## Max. :993.00 Max. :4.762 Max. :49.6500
## Rating
## Min. : 4.000
## 1st Qu.: 5.500
## Median : 7.000
## Mean : 6.973
## 3rd Qu.: 8.500
## Max. :10.000
Interpretation: 1.The first few rows show information about supermarket transactions such as Invoice ID, Branch, City, Customer type, Gender, Product line, Prices, Payment, and Ratings.
2.he dataset has 1000 observations and 17 variables, including text fields (like Gender, Payment) and numeric fields (like Unit Price, Quantity, Sales, Rating).
3.For numeric columns (like Unit price, Sales, Tax, Rating), summary gives minimum, maximum, median, and average values, helping understand distribution.
mean(data$Unit.price)
## [1] 55.67213
median(data$Unit.price)
## [1] 55.23
sd(data$Unit.price)
## [1] 26.49463
Interpretation::Mean: The average unit price customers pay.
Median: The middle price; half the items cost less, half cost more.
SD: Shows how much unit prices vary from the average.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(data$Payment)
## [1] "Ewallet"
summary(select(data, Unit.price, Quantity, Tax.5., Sales, Rating))
## Unit.price Quantity Tax.5. Sales
## Min. :10.08 Min. : 1.00 Min. : 0.5085 Min. : 10.68
## 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249 1st Qu.: 124.42
## Median :55.23 Median : 5.00 Median :12.0880 Median : 253.85
## Mean :55.67 Mean : 5.51 Mean :15.3794 Mean : 322.97
## 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453 3rd Qu.: 471.35
## Max. :99.96 Max. :10.00 Max. :49.6500 Max. :1042.65
## Rating
## Min. : 4.000
## 1st Qu.: 5.500
## Median : 7.000
## Mean : 6.973
## 3rd Qu.: 8.500
## Max. :10.000
Interpretation:Mode: The most common payment method used by customers. Unit.price: Shows average product price and its range.
Quantity: Shows how many items customers usually buy per transaction.
Tax.5.: Shows how much tax customers typically pay.
Sales: Shows average total bill amount and the highest/lowest sales.
Rating: Shows customers’ satisfaction scores (from 4 to 10).
colSums(is.na(data))
## Invoice.ID Branch City
## 0 0 0
## Customer.type Gender Product.line
## 0 0 0
## Unit.price Quantity Tax.5.
## 0 0 0
## Sales Date Time
## 0 0 0
## Payment cogs gross.margin.percentage
## 0 0 0
## gross.income Rating
## 0 0
data <- na.omit(data)
Interpretatio: 1.colSums(is.na(data)) checks how many missing values each column has.
2.na.omit(data) removes any rows that contain missing values.
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
data$Unit.price_norm <- normalize(data$Unit.price)
data$Quantity_norm <- normalize(data$Quantity)
data$Sales_norm <- normalize(data$Sales)
Interpretation :1.The code normalizes Unit Price, Quantity, and Sales.
2.Normalization converts values to a scale between **0 and
model1 <- lm(Sales ~ Quantity, data = data)
summary(model1)
##
## Call:
## lm(formula = Sales ~ Quantity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -474.32 -99.41 -0.41 100.74 453.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.993 11.768 -0.339 0.734
## Quantity 59.339 1.887 31.449 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 174.3 on 998 degrees of freedom
## Multiple R-squared: 0.4977, Adjusted R-squared: 0.4972
## F-statistic: 989 on 1 and 998 DF, p-value: < 2.2e-16
Interpretation :1.The model shows that Quantity strongly predicts Sales.
2.For every 1 extra item purchased, Sales increase by about 59 units.
3.The relationship is highly significant (p-value < 0.001).
4.The model explains about 50% of the variation in Sales (R² ≈ 0.50).
ggplot(data, aes(x = Quantity, y = Sales)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Linear Regression: Sales vs Quantity")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation :1.The scatter plot shows that Sales increase as Quantity
increases.
2.The red line (linear regression) confirms a strong positive relationship between Quantity and Sales.
3.More items purchased → higher total sales
anova_model <- aov(Sales ~ Product.line, data = data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Product.line 5 102506 20501 0.338 0.89
## Residuals 994 60296633 60661
Interpretation : 1.The ANOVA test shows no significant difference in Sales across different Product Lines.
2.This means Product Line does NOT have a strong effect on Sales (p-value is high)
ggplot(data, aes(x = Product.line, y = Sales, fill = Product.line)) +
geom_boxplot() +
ggtitle("ANOVA: Sales across Product Lines") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation :1.The boxplot compares Sales across all Product
Lines.
2.The sales ranges look similar, which supports the ANOVA result that Product Line does not significantly affect Sales
cluster_data <- data %>% select(Unit.price, Quantity)
set.seed(123)
kmeans_result <- kmeans(cluster_data, centers = 3)
data$Cluster <- as.factor(kmeans_result$cluster)
fviz_cluster(kmeans_result, data = cluster_data)
Interpretation : 1.The algorithm grouped customers into 3 clusters based
on Unit Price and Quantity.
2.Each cluster represents a different buying pattern (e.g., low quantity–low price, high quantity–medium price, etc.).
3.The cluster plot visually shows how customers are separated into clear groups
median_sales <- median(data$Sales)
data$SalesClass <- ifelse(data$Sales > median_sales, "High", "Low")
set.seed(100)
index <- createDataPartition(data$SalesClass, p = 0.7, list = FALSE)
train <- data[index, ]
test <- data[-index, ]
train_x <- scale(train[, c("Unit.price", "Quantity", "Rating")])
test_x <- scale(test[, c("Unit.price", "Quantity", "Rating")])
train_y <- train$SalesClass
test_y <- test$SalesClass
knn_pred <- knn(train_x, test_x, cl = train_y, k = 5)
confusionMatrix(table(Predicted = knn_pred, Actual = test_y))
## Confusion Matrix and Statistics
##
## Actual
## Predicted High Low
## High 142 10
## Low 8 140
##
## Accuracy : 0.94
## 95% CI : (0.9068, 0.9641)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.88
##
## Mcnemar's Test P-Value : 0.8137
##
## Sensitivity : 0.9467
## Specificity : 0.9333
## Pos Pred Value : 0.9342
## Neg Pred Value : 0.9459
## Prevalence : 0.5000
## Detection Rate : 0.4733
## Detection Prevalence : 0.5067
## Balanced Accuracy : 0.9400
##
## 'Positive' Class : High
##
Interpretation : 1.The model predicts whether Sales are High or Low using Unit Price, Quantity, and Rating.
2.The confusion matrix shows the model is highly accurate (~94%).
3.This means KNN is very good at correctly classifying High vs Low sales based on customer purchase behavior
ggplot(data, aes(x = Rating)) +
geom_histogram(fill = "lightblue", color = "black", bins = 15) +
ggtitle("Distribution of Customer Ratings")
Interpretation : 1.Customer Ratings mostly fall between 5 and 9, showing
that customers generally give moderate to high satisfaction.
2.Very low or very high ratings are less common.
ggplot(data, aes(x = Product.line, y = Sales, fill = Product.line)) +
geom_bar(stat = "summary", fun = "mean") +
ggtitle("Average Sales by Product Line") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation: 1.The chart compares the average sales of each product
line.
2.Some product lines show higher average sales, meaning they perform better and generate more revenue.
3.Other product lines have lower average sales, indicating less revenue contribution
ggplot(data, aes(x = Unit.price, y = Sales, color = Gender)) +
geom_point() +
ggtitle("Unit Price vs Sales by Gender")
Interpretation : 1.The scatter plot shows how Unit Price and Sales are
related.
2.Each point represents a customer purchase, and the color shows Gender.
3.Higher unit prices generally lead to higher total sales, since expensive items increase bill value.
4.Both genders follow a similar pattern, meaning gender does not strongly change the relationship.
payment_count <- data %>% count(Payment)
ggplot(payment_count, aes(x = "", y = n, fill = Payment)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
ggtitle("Payment Method Distribution")
Interpretation : 1This pie chart shows how often each payment method is
used.
2.Some payment types (like Ewallet / Cash) are used more frequently, while others (like Credit Card) are used less often.
3.It helps identify the most popular payment method among customers.