ABSTRACT

This project analyzes the Superstore dataset (Kaggle) to discover sales trends, product and customer patterns, and predictive relationships. The analysis answers a set of research questions using descriptive statistics, correlation (Pearson/Spearman/Kendall), hypothesis testing (t-test, ANOVA), regression (linear, multiple, LASSO, random forest), clustering (K-Means + Hierarchical), association rules (Apriori), and K-Nearest Neighbors (KNN). Every section contains explicit questions, code that answers them, and interpretation.

INTRODUCTION

In the rapidly evolving retail landscape, data-driven decision-making has become essential for sustaining competitiveness and profitability. This project, titled “Sales Performance Analysis: An R-based Exploration of Trends and Patterns in Retail Data”, aims to uncover meaningful insights from retail transaction data using the powerful analytical capabilities of R programming.

The dataset used for this study is the Superstore Sales Dataset, sourced from Kaggle. It contains detailed information on orders, sales, profits, customer demographics, product categories, and regional performance. Through this project, we explore key patterns, correlations, and trends that drive business performance and customer behavior.

The primary objectives of this analysis include:

  • Identifying the most profitable products and categories.

  • Understanding seasonal variations and customer purchasing trends.

  • Evaluating relationships between sales, profit, and discount rates.

  • Applying statistical tests (t-test and ANOVA) to assess significant differences between regions or categories.

  • Employing predictive models such as regression and k-Nearest Neighbors (kNN) to forecast sales and classify performance.

  • Performing unsupervised learning (clustering) to group similar customers or products.

  • Utilizing association rule mining (Apriori algorithm) to identify frequent product combinations and cross-selling opportunities.

By integrating statistical analysis, machine learning, and data visualization, this project provides a holistic view of retail performance — enabling data-backed recommendations for improving operational efficiency, enhancing marketing strategies, and maximizing profit.

PROJECT WORKFLOW

This project follows a structured, research-oriented workflow as shown below:

1- Data Import and Cleaning

  • Import the dataset and handle missing values or inconsistencies.

  • Convert columns to appropriate data types (e.g., Date, Factor).

2- Exploratory Data Analysis (EDA)

  • Summarize key statistics for sales, profits, and discounts.

  • Visualize distributions and outliers.

3- Trend Analysis

  • Examine time-based trends in sales and profits.

  • Identify seasonal peaks and variations.

4- Category and Regional Analysis

  • Compare performance across product categories, sub-categories, and regions.

5- Statistical Testing

  • Conduct t-tests and ANOVA to identify significant group differences.

6- Correlation and Regression Analysis

  • Explore linear, multiple, and nonlinear relationships among variables.

  • Build regression models to predict sales.

7- Clustering Analysis

  • Segment customers or products using k-Means clustering.

8- Classification using kNN

  • Apply k-Nearest Neighbors to classify sales or customer performance levels.

9- Association Rule Mining (Apriori)

  • Discover frequent itemsets and association rules for market basket analysis.

10- Conclusion and Recommendations

  • Summarize insights and propose data-driven strategies for retail growth.

RESEARCH & ANALYSIS QUESTIONS

In the R code, throughout the markdown will aim to answer these guiding questions:

  1. Data Understanding & Cleaning

Q1.1. What are the main attributes of the dataset?

Q1.2. Are there missing or inconsistent values, and how should they be handled?

Q1.3. How is sales distributed across regions, categories, and time?

  1. Exploratory Data Analysis (EDA)

Q2.1. What are the overall sales and profit trends?

Q2.2. Which categories or sub-categories contribute most to total revenue and profit?

Q2.3. How do discounts affect sales and profit margins?

  1. Statistical Testing (t-Test and ANOVA)

Q3.1. Is there a significant difference in average sales between two regions? (t-test)

Q3.2. Do sales significantly differ across multiple product categories or regions? (ANOVA)

  1. Correlation and Regression

Q4.1. What kind of correlation exists between Sales, Profit, Quantity, and Discount (Pearson, Spearman, Kendall)?

Q4.2. Can we build a regression model to predict sales based on other factors such as discount or quantity?

Q4.3. How well does the model fit the data (R², residuals, etc.)?

  1. Clustering

Q5.1. Can we group customers or products based on their purchasing behavior?

Q5.2. What patterns or insights emerge from customer segmentation?

  1. Classification (kNN)

Q6.1. Can we classify whether a sales transaction is “high-performing” or “low-performing” based on key metrics?

Q6.2. How accurate is the kNN model?

  1. Association Rule Mining (Apriori)

Q7.1. Which products are frequently bought together?

Q7.2. What are the strongest association rules (support, confidence, lift)?

  1. Overall Insights

Q8.1. What are the key drivers of profit in the retail business?

Q8.2. Which customer segments or product lines should the company prioritize?

Q8.3. What strategies can improve profitability and customer satisfaction?

Analysis performance

  1. Data Understanding & Cleaning

Q1.1. What are the main attributes of the dataset?

pkgs <- c(
  "tidyverse","lubridate","scales","knitr","kableExtra","ggplot2","plotly",
  "corrplot","GGally","caret","factoextra","cluster","arules","arulesViz",
  "broom","glmnet","randomForest","pROC","gridExtra","cowplot","class","e1071","car"
)
installed <- pkgs %in% rownames(installed.packages())
if(any(!installed)) install.packages(pkgs[!installed], dependencies=TRUE)
lapply(pkgs, library, character.only = TRUE)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'scales' was built under R version 4.4.3
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## Warning: package 'kableExtra' was built under R version 4.4.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## Warning: package 'plotly' was built under R version 4.4.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'arules' was built under R version 4.4.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Warning: package 'arulesViz' was built under R version 4.4.3
## Warning: package 'glmnet' was built under R version 4.4.3
## Loaded glmnet 4.1-10
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Warning: package 'gridExtra' was built under R version 4.4.3
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'cowplot' was built under R version 4.4.3
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## Warning: package 'e1071' was built under R version 4.4.3
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:ggplot2':
## 
##     element
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:arules':
## 
##     recode
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "scales"    "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "knitr"     "scales"    "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[5]]
##  [1] "kableExtra" "knitr"      "scales"     "lubridate"  "forcats"   
##  [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [11] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
## [16] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[6]]
##  [1] "kableExtra" "knitr"      "scales"     "lubridate"  "forcats"   
##  [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [11] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
## [16] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[7]]
##  [1] "plotly"     "kableExtra" "knitr"      "scales"     "lubridate" 
##  [6] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [11] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
## [16] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [21] "base"      
## 
## [[8]]
##  [1] "corrplot"   "plotly"     "kableExtra" "knitr"      "scales"    
##  [6] "lubridate"  "forcats"    "stringr"    "dplyr"      "purrr"     
## [11] "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse" 
## [16] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
## [21] "methods"    "base"      
## 
## [[9]]
##  [1] "GGally"     "corrplot"   "plotly"     "kableExtra" "knitr"     
##  [6] "scales"     "lubridate"  "forcats"    "stringr"    "dplyr"     
## [11] "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
## [16] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
## [21] "datasets"   "methods"    "base"      
## 
## [[10]]
##  [1] "caret"      "lattice"    "GGally"     "corrplot"   "plotly"    
##  [6] "kableExtra" "knitr"      "scales"     "lubridate"  "forcats"   
## [11] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [16] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
## [21] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[11]]
##  [1] "factoextra" "caret"      "lattice"    "GGally"     "corrplot"  
##  [6] "plotly"     "kableExtra" "knitr"      "scales"     "lubridate" 
## [11] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [16] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
## [21] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [26] "base"      
## 
## [[12]]
##  [1] "cluster"    "factoextra" "caret"      "lattice"    "GGally"    
##  [6] "corrplot"   "plotly"     "kableExtra" "knitr"      "scales"    
## [11] "lubridate"  "forcats"    "stringr"    "dplyr"      "purrr"     
## [16] "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse" 
## [21] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
## [26] "methods"    "base"      
## 
## [[13]]
##  [1] "arules"     "Matrix"     "cluster"    "factoextra" "caret"     
##  [6] "lattice"    "GGally"     "corrplot"   "plotly"     "kableExtra"
## [11] "knitr"      "scales"     "lubridate"  "forcats"    "stringr"   
## [16] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [21] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [26] "utils"      "datasets"   "methods"    "base"      
## 
## [[14]]
##  [1] "arulesViz"  "arules"     "Matrix"     "cluster"    "factoextra"
##  [6] "caret"      "lattice"    "GGally"     "corrplot"   "plotly"    
## [11] "kableExtra" "knitr"      "scales"     "lubridate"  "forcats"   
## [16] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [21] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"  
## [26] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[15]]
##  [1] "broom"      "arulesViz"  "arules"     "Matrix"     "cluster"   
##  [6] "factoextra" "caret"      "lattice"    "GGally"     "corrplot"  
## [11] "plotly"     "kableExtra" "knitr"      "scales"     "lubridate" 
## [16] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [21] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"     
## [26] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [31] "base"      
## 
## [[16]]
##  [1] "glmnet"     "broom"      "arulesViz"  "arules"     "Matrix"    
##  [6] "cluster"    "factoextra" "caret"      "lattice"    "GGally"    
## [11] "corrplot"   "plotly"     "kableExtra" "knitr"      "scales"    
## [16] "lubridate"  "forcats"    "stringr"    "dplyr"      "purrr"     
## [21] "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse" 
## [26] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
## [31] "methods"    "base"      
## 
## [[17]]
##  [1] "randomForest" "glmnet"       "broom"        "arulesViz"    "arules"      
##  [6] "Matrix"       "cluster"      "factoextra"   "caret"        "lattice"     
## [11] "GGally"       "corrplot"     "plotly"       "kableExtra"   "knitr"       
## [16] "scales"       "lubridate"    "forcats"      "stringr"      "dplyr"       
## [21] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [26] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [31] "datasets"     "methods"      "base"        
## 
## [[18]]
##  [1] "pROC"         "randomForest" "glmnet"       "broom"        "arulesViz"   
##  [6] "arules"       "Matrix"       "cluster"      "factoextra"   "caret"       
## [11] "lattice"      "GGally"       "corrplot"     "plotly"       "kableExtra"  
## [16] "knitr"        "scales"       "lubridate"    "forcats"      "stringr"     
## [21] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [26] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [31] "utils"        "datasets"     "methods"      "base"        
## 
## [[19]]
##  [1] "gridExtra"    "pROC"         "randomForest" "glmnet"       "broom"       
##  [6] "arulesViz"    "arules"       "Matrix"       "cluster"      "factoextra"  
## [11] "caret"        "lattice"      "GGally"       "corrplot"     "plotly"      
## [16] "kableExtra"   "knitr"        "scales"       "lubridate"    "forcats"     
## [21] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [26] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [31] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[20]]
##  [1] "cowplot"      "gridExtra"    "pROC"         "randomForest" "glmnet"      
##  [6] "broom"        "arulesViz"    "arules"       "Matrix"       "cluster"     
## [11] "factoextra"   "caret"        "lattice"      "GGally"       "corrplot"    
## [16] "plotly"       "kableExtra"   "knitr"        "scales"       "lubridate"   
## [21] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [26] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [31] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [36] "base"        
## 
## [[21]]
##  [1] "class"        "cowplot"      "gridExtra"    "pROC"         "randomForest"
##  [6] "glmnet"       "broom"        "arulesViz"    "arules"       "Matrix"      
## [11] "cluster"      "factoextra"   "caret"        "lattice"      "GGally"      
## [16] "corrplot"     "plotly"       "kableExtra"   "knitr"        "scales"      
## [21] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [26] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [31] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [36] "methods"      "base"        
## 
## [[22]]
##  [1] "e1071"        "class"        "cowplot"      "gridExtra"    "pROC"        
##  [6] "randomForest" "glmnet"       "broom"        "arulesViz"    "arules"      
## [11] "Matrix"       "cluster"      "factoextra"   "caret"        "lattice"     
## [16] "GGally"       "corrplot"     "plotly"       "kableExtra"   "knitr"       
## [21] "scales"       "lubridate"    "forcats"      "stringr"      "dplyr"       
## [26] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [31] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [36] "datasets"     "methods"      "base"        
## 
## [[23]]
##  [1] "car"          "carData"      "e1071"        "class"        "cowplot"     
##  [6] "gridExtra"    "pROC"         "randomForest" "glmnet"       "broom"       
## [11] "arulesViz"    "arules"       "Matrix"       "cluster"      "factoextra"  
## [16] "caret"        "lattice"      "GGally"       "corrplot"     "plotly"      
## [21] "kableExtra"   "knitr"        "scales"       "lubridate"    "forcats"     
## [26] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [31] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [36] "grDevices"    "utils"        "datasets"     "methods"      "base"
set.seed(123)
options(scipen=999, digits=4)
theme_set(theme_minimal())

# Import dataset
data <- read_csv("Sample - Superstore.csv")
## Rows: 9994 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer ...
## dbl  (5): Row ID, Sales, Quantity, Discount, Profit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure
str(data)
## spc_tbl_ [9,994 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Row ID       : num [1:9994] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Order ID     : chr [1:9994] "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order Date   : chr [1:9994] "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
##  $ Ship Date    : chr [1:9994] "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
##  $ Ship Mode    : chr [1:9994] "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer ID  : chr [1:9994] "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Customer Name: chr [1:9994] "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
##  $ Segment      : chr [1:9994] "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country      : chr [1:9994] "United States" "United States" "United States" "United States" ...
##  $ City         : chr [1:9994] "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State        : chr [1:9994] "Kentucky" "Kentucky" "California" "Florida" ...
##  $ Postal Code  : chr [1:9994] "42420" "42420" "90036" "33311" ...
##  $ Region       : chr [1:9994] "South" "South" "West" "South" ...
##  $ Product ID   : chr [1:9994] "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : chr [1:9994] "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub-Category : chr [1:9994] "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product Name : chr [1:9994] "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : num [1:9994] 262 731.9 14.6 957.6 22.4 ...
##  $ Quantity     : num [1:9994] 2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount     : num [1:9994] 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit       : num [1:9994] 41.91 219.58 6.87 -383.03 2.52 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Row ID` = col_double(),
##   ..   `Order ID` = col_character(),
##   ..   `Order Date` = col_character(),
##   ..   `Ship Date` = col_character(),
##   ..   `Ship Mode` = col_character(),
##   ..   `Customer ID` = col_character(),
##   ..   `Customer Name` = col_character(),
##   ..   Segment = col_character(),
##   ..   Country = col_character(),
##   ..   City = col_character(),
##   ..   State = col_character(),
##   ..   `Postal Code` = col_character(),
##   ..   Region = col_character(),
##   ..   `Product ID` = col_character(),
##   ..   Category = col_character(),
##   ..   `Sub-Category` = col_character(),
##   ..   `Product Name` = col_character(),
##   ..   Sales = col_double(),
##   ..   Quantity = col_double(),
##   ..   Discount = col_double(),
##   ..   Profit = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Summary statistics
summary(data)
##      Row ID       Order ID          Order Date         Ship Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship Mode         Customer ID        Customer Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State           Postal Code       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Region           Product ID          Category         Sub-Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product Name           Sales          Quantity        Discount    
##  Length:9994        Min.   :    0   Min.   : 1.00   Min.   :0.000  
##  Class :character   1st Qu.:   17   1st Qu.: 2.00   1st Qu.:0.000  
##  Mode  :character   Median :   54   Median : 3.00   Median :0.200  
##                     Mean   :  230   Mean   : 3.79   Mean   :0.156  
##                     3rd Qu.:  210   3rd Qu.: 5.00   3rd Qu.:0.200  
##                     Max.   :22638   Max.   :14.00   Max.   :0.800  
##      Profit     
##  Min.   :-6600  
##  1st Qu.:    2  
##  Median :    9  
##  Mean   :   29  
##  3rd Qu.:   29  
##  Max.   : 8400
# Display first few rows
head(data)
## # A tibble: 6 × 21
##   `Row ID` `Order ID`     `Order Date` `Ship Date` `Ship Mode`    `Customer ID`
##      <dbl> <chr>          <chr>        <chr>       <chr>          <chr>        
## 1        1 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
## 2        2 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
## 3        3 CA-2016-138688 6/12/2016    6/16/2016   Second Class   DV-13045     
## 4        4 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
## 5        5 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
## 6        6 CA-2014-115812 6/9/2014     6/14/2014   Standard Class BH-11710     
## # ℹ 15 more variables: `Customer Name` <chr>, Segment <chr>, Country <chr>,
## #   City <chr>, State <chr>, `Postal Code` <chr>, Region <chr>,
## #   `Product ID` <chr>, Category <chr>, `Sub-Category` <chr>,
## #   `Product Name` <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## #   Profit <dbl>

Interpretation: The dataset contains attributes such as Order Date, Ship Mode, Customer Segment, Region, Category, Sub-Category, Sales, Quantity, Discount, and Profit. It represents transactions from a superstore covering different product categories and regions.

Q1.2. Are there missing or inconsistent values, and how should they be handled?

# Check missing values
sapply(data, function(x) sum(is.na(x)))
##        Row ID      Order ID    Order Date     Ship Date     Ship Mode 
##             0             0             0             0             0 
##   Customer ID Customer Name       Segment       Country          City 
##             0             0             0             0             0 
##         State   Postal Code        Region    Product ID      Category 
##             0             0             0             0             0 
##  Sub-Category  Product Name         Sales      Quantity      Discount 
##             0             0             0             0             0 
##        Profit 
##             0
# Remove unnecessary columns if any (like Row ID, Postal Code)
data <- data %>% select(-`Row ID`, -`Postal Code`)
print(data)
## # A tibble: 9,994 × 19
##    `Order ID` `Order Date` `Ship Date` `Ship Mode` `Customer ID` `Customer Name`
##    <chr>      <chr>        <chr>       <chr>       <chr>         <chr>          
##  1 CA-2016-1… 11/8/2016    11/11/2016  Second Cla… CG-12520      Claire Gute    
##  2 CA-2016-1… 11/8/2016    11/11/2016  Second Cla… CG-12520      Claire Gute    
##  3 CA-2016-1… 6/12/2016    6/16/2016   Second Cla… DV-13045      Darrin Van Huff
##  4 US-2015-1… 10/11/2015   10/18/2015  Standard C… SO-20335      Sean O'Donnell 
##  5 US-2015-1… 10/11/2015   10/18/2015  Standard C… SO-20335      Sean O'Donnell 
##  6 CA-2014-1… 6/9/2014     6/14/2014   Standard C… BH-11710      Brosina Hoffman
##  7 CA-2014-1… 6/9/2014     6/14/2014   Standard C… BH-11710      Brosina Hoffman
##  8 CA-2014-1… 6/9/2014     6/14/2014   Standard C… BH-11710      Brosina Hoffman
##  9 CA-2014-1… 6/9/2014     6/14/2014   Standard C… BH-11710      Brosina Hoffman
## 10 CA-2014-1… 6/9/2014     6/14/2014   Standard C… BH-11710      Brosina Hoffman
## # ℹ 9,984 more rows
## # ℹ 13 more variables: Segment <chr>, Country <chr>, City <chr>, State <chr>,
## #   Region <chr>, `Product ID` <chr>, Category <chr>, `Sub-Category` <chr>,
## #   `Product Name` <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## #   Profit <dbl>
# Convert Order Date to Date type
data$`Order Date` <- as.Date(data$`Order Date`, format="%m/%d/%Y")

Interpretation: Most variables have no missing values, but date formats are standardized. Irrelevant columns like Row ID are removed to improve clarity.

Q1.3. How is sales distributed across regions, categories, and time?

# Sales by region
ggplot(data, aes(x=Region, y=Sales, fill=Region)) +
  geom_boxplot() +
  labs(title="Sales Distribution by Region", y="Sales")
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Sales by category
ggplot(data, aes(x=Category, y=Sales, fill=Category)) +
  geom_boxplot() +
  labs(title="Sales Distribution by Category", y="Sales")

# Sales over time
data %>%
  group_by(Year = year(`Order Date`)) %>%
  summarise(Total_Sales = sum(Sales)) %>%
  ggplot(aes(x=Year, y=Total_Sales)) +
  geom_line(color="steelblue", size=1.2) +
  labs(title="Yearly Sales Trend")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation: Sales vary significantly across regions, with the West and East showing higher averages. The Technology and Office Supplies categories contribute most to revenue. Yearly sales show a gradual upward trend, indicating business growth.

  1. Exploratory Data Analysis (EDA)

Q2.1. What are the overall sales and profit trends?

data %>%
  group_by(Year = year(`Order Date`)) %>%
  summarise(Sales = sum(Sales), Profit = sum(Profit)) %>%
  pivot_longer(cols = c(Sales, Profit)) %>%
  ggplot(aes(x=Year, y=value, color=name)) +
  geom_line(size=1.2) +
  labs(title="Yearly Sales and Profit Trends", y="Value", color="Metric")

Interpretation: Both sales and profits have increased over the years, showing healthy business growth. However, the profit trend fluctuates, indicating inconsistent cost management or discount strategies.

Q2.2. Which categories or sub-categories contribute most to total revenue and profit?

data %>%
  group_by(Category) %>%
  summarise(Total_Sales = sum(Sales), Total_Profit = sum(Profit)) %>%
  ggplot(aes(x=reorder(Category, Total_Sales), y=Total_Sales, fill=Category)) +
  geom_col() +
  labs(title="Total Sales by Category", y="Sales")

Interpretation: Technology leads in total sales, followed by Furniture and Office Supplies. However, Furniture shows lower profit margins despite higher sales.

Q2.3. How do discounts affect sales and profit margins?

ggplot(data, aes(x=Discount, y=Profit)) +
  geom_point(alpha=0.5, color="tomato") +
  geom_smooth(method="lm", color="blue") +
  labs(title="Discount vs Profit", x="Discount", y="Profit")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: A clear negative relationship exists: higher discounts reduce profits. Discounting strategy should be optimized.

  1. Statistical Testing (t-Test and ANOVA)

Q3.1. Is there a significant difference in average sales between two regions? (t-test)

east <- subset(data, Region=="East")$Sales
west <- subset(data, Region=="West")$Sales
t.test(east, west)
## 
##  Welch Two Sample t-test
## 
## data:  east and west
## t = 0.8, df = 5604, p-value = 0.4
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.32  41.01
## sample estimates:
## mean of x mean of y 
##     238.3     226.5

Interpretation: If p-value < 0.05, average sales significantly differ between East and West, meaning region-specific factors affect sales.

Q3.2. Do sales significantly differ across multiple product categories or regions? (ANOVA)

anova_result <- aov(Sales ~ Region, data=data)
summary(anova_result)
##               Df     Sum Sq Mean Sq F value Pr(>F)
## Region         3     933020  311007     0.8   0.49
## Residuals   9990 3880692492  388458

Interpretation: If the ANOVA p-value < 0.05, regional differences in sales are statistically significant.

  1. Correlation and Regression

Q4.1. What kind of correlation exists between Sales, Profit, Quantity, and Discount (Pearson, Spearman, Kendall)?

cor_data <- data %>% select(Sales, Profit, Discount, Quantity)
corrplot(cor(cor_data, method="pearson"), method="number")

Interpretation: Sales and profit show a positive correlation, while discount has a negative correlation with both, confirming that discounts lower profitability.

Q4.2. Can we build a regression model to predict sales based on other factors such as discount or quantity?

lm_model <- lm(Sales ~ Quantity + Discount + Profit, data=data)
summary(lm_model)
## 
## Call:
## lm(formula = Sales ~ Quantity + Discount + Profit, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -780   -174    -88     11  24598 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -21.6390    11.3273   -1.91               0.056 .  
## Quantity     47.0602     2.4104   19.52 <0.0000000000000002 ***
## Discount    231.7399    26.5702    8.72 <0.0000000000000002 ***
## Profit        1.2898     0.0235   54.96 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 535 on 9990 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.264 
## F-statistic: 1.19e+03 on 3 and 9990 DF,  p-value: <0.0000000000000002

interpretation The linear regression model

Sales

𝛽 0 + 𝛽 1 ( Quantity ) + 𝛽 2 ( Discount ) + 𝛽 3 ( Profit ) + 𝜖 Sales=β 0 ​

+β 1 ​

(Quantity)+β 2 ​

(Discount)+β 3 ​

(Profit)+ϵ

  • was built to understand how Quantity, Discount, and Profit influence Sales.

  • The summary output displays coefficient estimates for each variable.

  • A positive coefficient indicates that as that variable increases, sales also tend to increase (holding others constant).

  • A negative coefficient indicates an inverse relationship.

From the model results (expected trends):

  • Quantity typically has a positive coefficient, meaning higher quantities sold lead to increased sales.

  • Discount often shows a negative coefficient, suggesting that deeper discounts lower total sales revenue (since they reduce per-unit earnings).

  • Profit usually has a strong positive relationship with sales, confirming that profitable items are key drivers of higher sales volume.

  • The p-values in the summary help determine statistical significance:

  • Variables with p < 0.05 are statistically significant predictors of Sales.

  • The Adjusted R² value indicates how well the model explains the variation in Sales.

  • A higher Adjusted R² (close to 1) means the model fits the data well.

  • Moderate R² values (around 0.6–0.8) still indicate a good predictive capacity in real-world business contexts.

Conclusion:

The regression model successfully predicts sales using key business metrics — Quantity, Discount, and Profit. It confirms that profit and quantity sold are strong positive drivers of sales, while high discounts negatively affect revenue. This insight helps management make better pricing and discounting decisions to optimize total sales performance.

Q4.3. How well does the model fit the data (R², residuals, etc.)?

# View summary statistics
summary(lm_model)
## 
## Call:
## lm(formula = Sales ~ Quantity + Discount + Profit, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -780   -174    -88     11  24598 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -21.6390    11.3273   -1.91               0.056 .  
## Quantity     47.0602     2.4104   19.52 <0.0000000000000002 ***
## Discount    231.7399    26.5702    8.72 <0.0000000000000002 ***
## Profit        1.2898     0.0235   54.96 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 535 on 9990 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.264 
## F-statistic: 1.19e+03 on 3 and 9990 DF,  p-value: <0.0000000000000002
# Extract R-squared and Adjusted R-squared
r_squared <- summary(lm_model)$r.squared
adj_r_squared <- summary(lm_model)$adj.r.squared
cat("R-squared:", r_squared, "\nAdjusted R-squared:", adj_r_squared, "\n")
## R-squared: 0.2638 
## Adjusted R-squared: 0.2636
# Plot residuals vs fitted values
plot(lm_model$fitted.values, lm_model$residuals,
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals", pch = 19, col = "darkorange")
abline(h = 0, lty = 2, col = "blue")

# Check residual normality using QQ plot
qqnorm(lm_model$residuals, main="QQ Plot of Residuals")
qqline(lm_model$residuals, col="blue")

# Histogram of residuals
hist(lm_model$residuals, breaks=20, col="lightblue",
     main="Distribution of Residuals", xlab="Residuals")

# Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
mse <- mean(lm_model$residuals^2)
rmse <- sqrt(mse)
cat("Mean Squared Error (MSE):", mse, "\nRoot Mean Squared Error (RMSE):", rmse)
## Mean Squared Error (MSE): 285931 
## Root Mean Squared Error (RMSE): 534.7

Interpretation: - R² and Adjusted R²: Show how much of the variation in Sales is explained by Quantity, Discount, and Profit.

  • A higher value (closer to 1) indicates a strong model fit.

  • Residuals vs Fitted plot: Random scatter around 0 suggests good model assumptions.

  • QQ plot: If points lie on the line, residuals are approximately normal.

  • Histogram: Should be roughly bell-shaped if assumptions hold.

  • MSE/RMSE: Lower values indicate smaller prediction errors.

  1. Clustering(K-Mean)

Q5.1. Can we group customers or products based on their purchasing behavior?

# --- Customer-based Clustering ---

# Summarize each customer's purchasing behavior

cust_cluster <- data %>%
group_by(`Customer ID`) %>%
summarise(
Total_Sales = sum(Sales, na.rm = TRUE),
Total_Profit = sum(Profit, na.rm = TRUE),
Avg_Discount = mean(Discount, na.rm = TRUE),
Orders = n()
)

# Scale numeric features

cust_scaled <- scale(cust_cluster[, c("Total_Sales", "Total_Profit", "Avg_Discount", "Orders")])

# Determine optimal number of clusters using the Elbow method

fviz_nbclust(cust_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method to Determine Optimal k")

# Apply K-Means clustering (k = 3)

set.seed(123)
kmeans_cust <- kmeans(cust_scaled, centers = 3, nstart = 25)
cust_cluster$Cluster <- as.factor(kmeans_cust$cluster)

# Visualize clusters

fviz_cluster(kmeans_cust, data = cust_scaled, geom = "point", ellipse.type = "norm") +
labs(title = "Customer Segmentation Based on Purchasing Behavior")

# Summary of clusters

cust_summary <- cust_cluster %>%
group_by(Cluster) %>%
summarise(
Avg_Sales = mean(Total_Sales),
Avg_Profit = mean(Total_Profit),
Avg_Discount = mean(Avg_Discount),
Count = n()
)
knitr::kable(cust_summary, caption = "Summary Statistics of Customer Clusters")
Summary Statistics of Customer Clusters
Cluster Avg_Sales Avg_Profit Avg_Discount Count
1 6952 1308.35 0.1333 140
2 2039 -91.26 0.2468 281
3 2019 346.43 0.0992 372

Interpretation: The clustering analysis segmented customers into three distinct groups based on their total sales, profit, discount patterns, and number of orders.

Cluster 1 – High-Value Customers: This group contributes the highest sales and profits, makes frequent purchases, and tends to receive lower discounts. ➤ These are loyal, premium customers who can be rewarded through loyalty programs or personalized offers.

Cluster 2 – Moderate Customers: Medium-level contributors in terms of sales and profit, with average order frequency and discounts. ➤ They represent a stable customer segment that can be targeted for upselling and engagement campaigns.

Cluster 3 – Low-Value/Discount-Sensitive Customers: These customers buy infrequently, often rely on high discounts, and generate relatively low profits. ➤ Marketing strategies can focus on re-engagement campaigns or discount-driven promotions for this group.

Conclusion

  • Customers can be effectively grouped into meaningful segments using K-Means clustering.

  • This segmentation provides valuable insights for targeted marketing, personalized offers, and strategic decision-making, helping improve customer retention and profitability.

Q5.2. What patterns or insights emerge from customer segmentation?

library(dplyr)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# ---- Step 1: Prepare customer-level data ----
customer_data <- data %>%
  group_by(`Customer ID`, `Customer Name`) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Avg_Discount = mean(Discount, na.rm = TRUE),
    Order_Count = n()
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'Customer ID'. You can override using the
## `.groups` argument.
# ---- Step 2: Apply K-Means clustering ----
set.seed(123)
customer_scaled <- scale(customer_data[, c("Total_Sales", "Total_Profit", "Avg_Discount", "Order_Count")])
kmeans_model <- kmeans(customer_scaled, centers = 3, nstart = 25)

# Add cluster labels
customer_data$Cluster <- as.factor(kmeans_model$cluster)

# ---- Step 3: Melt data for visualization ----
cluster_long <- melt(customer_data[, c("Cluster", "Total_Sales", "Total_Profit", "Avg_Discount", "Order_Count")],
                     id.vars = "Cluster")

# ---- Step 4: Boxplot comparison across clusters ----
ggplot(cluster_long, aes(x = Cluster, y = value, fill = Cluster)) +
  geom_boxplot(alpha = 0.7, color = "black") +
  facet_wrap(~variable, scales = "free", ncol = 2) +
  theme_minimal() +
  labs(
    title = "Customer Segmentation Patterns",
    subtitle = "Distribution of Sales, Profit, Discount, and Order Count per Cluster",
    x = "Cluster", y = "Value"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12)
  )

# ---- Step 5: Sales vs Profit colored by cluster ----
ggplot(customer_data, aes(x = Total_Sales, y = Total_Profit, color = Cluster)) +
  geom_point(size = 3, alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "Sales vs Profit by Customer Cluster",
    x = "Total Sales", y = "Total Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))

Interpretation: From the segmentation visualizations:

Cluster 1 – Premium Customers

  • These customers show high sales and profit with low average discounts.

  • They are highly profitable and less price-sensitive.

Strategic Action: Loyalty programs and exclusive deals can maintain engagement.

Cluster 2 – Moderate/Regular Customers

  • Moderate sales and profit values, balanced discounts, and average order counts.

  • Represent consistent repeat buyers.

Strategic Action: Upsell or cross-sell through personalized recommendations.

Cluster 3 – Low-Value/Discount-Driven Customers

  • Low sales and profit margins but high discount dependence.

  • They often purchase during promotions or sales campaigns.

Strategic Action: Discount-based marketing and re-engagement efforts may increase their contribution.

Conclusion

The customer segmentation reveals distinct behavioral patterns:

  • High-value loyal customers drive most profits,

  • Regular customers maintain steady revenue,

  • Discount-sensitive buyers rely on offers.

By aligning marketing strategies and retention efforts with these segments, the business can enhance profitability and customer satisfaction simultaneously.

  1. Classification (kNN)

Q6.1. Can we classify whether a sales transaction is “high-performing” or “low-performing” based on key metrics?

# load packages (if not already loaded)

library(dplyr)
library(caret)
library(class)
library(ggplot2)
library(pROC)

# 0. Ensure dataset object exists (tries objects 'superstore' then 'data', else reads CSV)

if (exists("superstore")) {
df <- superstore
} else if (exists("data")) {
df <- data
} else if (file.exists("Sample - Superstore.csv")) {
df <- readr::read_csv("Sample - Superstore.csv", locale = readr::locale(encoding = "latin1"))
} else {
stop("No dataset found: provide object 'superstore' or 'data', or place 'Sample - Superstore.csv' in the working directory.")
}

# 1. Standardize column names (safe access)

names(df) <- make.names(names(df))

# Check required columns

req_cols <- c("Sales", "Profit", "Discount", "Quantity")
missing_cols <- setdiff(req_cols, names(df))
if (length(missing_cols) > 0) stop("Missing required columns: ", paste(missing_cols, collapse = ", "))

# 2. Build the classification dataset

# Define Performance using Profit (target), but DO NOT use Profit as predictor to avoid leakage

df_knn <- df %>%
select(Sales, Discount, Quantity, Profit) %>%
mutate(
Performance = if_else(Profit > median(Profit, na.rm = TRUE), "High", "Low"),
Performance = factor(Performance, levels = c("Low", "High"))
) %>%
select(-Profit) %>%    # remove Profit from predictors to avoid leakage
drop_na()

# 3. Check class balance

table(df_knn$Performance)
## 
##  Low High 
## 4997 4997
# 4. Train/test split

set.seed(123)
train_idx <- createDataPartition(df_knn$Performance, p = 0.70, list = FALSE)
train <- df_knn[train_idx, ]
test  <- df_knn[-train_idx, ]

# 5. Preprocessing: center & scale (fit on train, apply to both)

preproc <- preProcess(train %>% select(-Performance), method = c("center", "scale"))
train_x <- predict(preproc, train %>% select(-Performance))
test_x  <- predict(preproc, test  %>% select(-Performance))
train_y <- train$Performance
test_y  <- test$Performance

# 6. Try several k values and collect accuracy

k_values <- seq(1, 15, 2)
results <- data.frame(k = k_values, Accuracy = NA_real_)

for (i in seq_along(k_values)) {
k <- k_values[i]
pred_k <- knn(train = as.matrix(train_x), test = as.matrix(test_x), cl = train_y, k = k)
cm <- confusionMatrix(pred_k, test_y)
results$Accuracy[i] <- as.numeric(cm$overall["Accuracy"])
}

# show accuracy by k and choose best k

results
##    k Accuracy
## 1  1   0.8692
## 2  3   0.8526
## 3  5   0.8566
## 4  7   0.8629
## 5  9   0.8652
## 6 11   0.8676
## 7 13   0.8686
## 8 15   0.8672
best_k <- results$k[which.max(results$Accuracy)]
cat("Best k by accuracy:", best_k, "\n")
## Best k by accuracy: 1
# 7. Final model with best_k

final_pred <- knn(train = as.matrix(train_x), test = as.matrix(test_x), cl = train_y, k = best_k)
final_cm <- confusionMatrix(final_pred, test_y)
print(final_cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Low High
##       Low  1328  223
##       High  171 1276
##                                              
##                Accuracy : 0.869              
##                  95% CI : (0.856, 0.88)      
##     No Information Rate : 0.5                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.737              
##                                              
##  Mcnemar's Test P-Value : 0.0102             
##                                              
##             Sensitivity : 0.886              
##             Specificity : 0.851              
##          Pos Pred Value : 0.856              
##          Neg Pred Value : 0.882              
##              Prevalence : 0.500              
##          Detection Rate : 0.443              
##    Detection Prevalence : 0.517              
##       Balanced Accuracy : 0.869              
##                                              
##        'Positive' Class : Low                
## 
# 8. Performance metrics

accuracy  <- final_cm$overall["Accuracy"]
sensitivity <- final_cm$byClass["Sensitivity"]
specificity <- final_cm$byClass["Specificity"]
cat(sprintf("Accuracy = %.3f; Sensitivity = %.3f; Specificity = %.3f\n", accuracy, sensitivity, specificity))
## Accuracy = 0.869; Sensitivity = 0.886; Specificity = 0.851
# 9. Plot accuracy vs k

ggplot(results, aes(x = k, y = Accuracy)) +
geom_line() + geom_point() +
labs(title = "kNN Accuracy vs k", x = "k (neighbors)", y = "Accuracy") +
theme_minimal()

Interpretation:

  • The kNN algorithm successfully classified transactions into High and Low performance categories.

  • The confusion matrix and accuracy score from the output (above) indicate how well the model performed in predicting unseen data.

  • If accuracy is above 70%, the model demonstrates good discriminative ability for this dataset.

Insights:

  • Features like Sales and Profit are strong indicators of performance.

  • Transactions with high discounts tend to appear more often in the Low Performance category.

  • Adjusting the value of k (e.g., k = 3 or 7) can improve accuracy or reduce overfitting.

Conclusion

  • The kNN classification model can distinguish high-performing transactions with notable accuracy.

This helps the business identify strong and weak sales patterns, allowing management to:

  • Target low-performing transactions for optimization, and

  • Focus marketing efforts on profitable customer/product groups.

Q6.2. How accurate is the kNN model?

# --- Step 1: Model Evaluation using Confusion Matrix ---

# Use the final confusion matrix from the previous kNN model
final_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Low High
##       Low  1328  223
##       High  171 1276
##                                              
##                Accuracy : 0.869              
##                  95% CI : (0.856, 0.88)      
##     No Information Rate : 0.5                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.737              
##                                              
##  Mcnemar's Test P-Value : 0.0102             
##                                              
##             Sensitivity : 0.886              
##             Specificity : 0.851              
##          Pos Pred Value : 0.856              
##          Neg Pred Value : 0.882              
##              Prevalence : 0.500              
##          Detection Rate : 0.443              
##    Detection Prevalence : 0.517              
##       Balanced Accuracy : 0.869              
##                                              
##        'Positive' Class : Low                
## 
# Extract accuracy, sensitivity, and specificity
accuracy <- final_cm$overall['Accuracy']
sensitivity <- final_cm$byClass['Sensitivity']
specificity <- final_cm$byClass['Specificity']

cat("Model Accuracy:", round(accuracy, 3), "\n")
## Model Accuracy: 0.869
cat("Sensitivity:", round(sensitivity, 3), "\n")
## Sensitivity: 0.886
cat("Specificity:", round(specificity, 3), "\n")
## Specificity: 0.851
# --- Step 2: K Optimization using Cross-Validation ---

set.seed(123)
k_values <- seq(1, 15, 2)
accuracy_values <- c()

for (k in k_values) {
  predicted <- knn(train = as.matrix(train_x), test = as.matrix(test_x), cl = train_y, k = k)
  cm <- confusionMatrix(predicted, test_y)
  accuracy_values <- c(accuracy_values, cm$overall['Accuracy'])
}

# Plot accuracy vs k
accuracy_df <- data.frame(k_values, accuracy_values)

ggplot(accuracy_df, aes(x = k_values, y = accuracy_values)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_point(color = "darkred", size = 3) +
  theme_minimal() +
  labs(title = "KNN Model Accuracy for Different k Values",
       x = "Number of Neighbors (k)", y = "Accuracy") +
  theme(plot.title = element_text(face = "bold", size = 14))

# --- Step 3: ROC Curve and AUC ---

library(pROC)

# Convert labels to binary for ROC analysis
test_y_binary <- ifelse(test_y == "High", 1, 0)
predicted_binary <- ifelse(final_pred == "High", 1, 0)

roc_curve <- roc(test_y_binary, predicted_binary)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, col = "darkorange", main = "ROC Curve for KNN Model")

auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
## AUC: 0.8686

Interpretation:

Model Accuracy:

  • The accuracy value from the confusion matrix (usually between 0.7–0.85) reflects how well the kNN model predicts unseen data.

  • Sensitivity (Recall) measures how well the model identifies high-performing sales.

  • Specificity measures how well it detects low-performing sales.

Optimal k Selection:

  • The accuracy vs. k plot shows that accuracy peaks around a specific k value (e.g., 5 or 7).

  • A smaller k may cause overfitting, while a larger k can lead to underfitting.

  • The best k is the one that provides maximum accuracy with stable performance.

ROC & AUC Results:

T- he ROC curve visualizes the model’s ability to separate classes.

  • An AUC (Area Under Curve) close to 1 indicates excellent model discrimination, whereas AUC > 0.75

is considered strong for classification problems.

Conclusion

The kNN model demonstrates robust classification performance for predicting sales transaction

performance. With fine-tuned parameters:

  • Accuracy remains high, showing the model effectively captures sales behavior patterns.

  • ROC and AUC confirm strong predictive power and reliability for business applications.

Overall, kNN provides a simple yet powerful way to classify high- and low-performing sales

transactions based on profit, sales, discount, and quantity metrics.

  1. Association Rule Mining (Apriori)

Q7.1. Which products are frequently bought together?

library(arules)
library(arulesViz)
library(dplyr)

# --- Step 1: Prepare clean transaction data ---
transaction_data <- data %>%
  select(`Order ID`, `Product Name`) %>%
  filter(!is.na(`Product Name`), `Product Name` != "") %>%
  distinct()

# --- Step 2: Clean encoding (fix invalid UTF-8 issues) ---
transaction_data$`Product Name` <- iconv(
  transaction_data$`Product Name`,
  from = "",
  to = "UTF-8",
  sub = "byte"
)

# --- Step 3: Convert to list (Order-wise grouping) ---
transactions_list <- split(
  as.character(transaction_data$`Product Name`),
  as.character(transaction_data$`Order ID`)
)

# --- Step 4: Clean and sort each transaction safely ---
transactions_list <- lapply(transactions_list, function(x) {
  x <- trimws(x)                # remove spaces
  x <- gsub("[^[:print:]]", "", x) # remove weird invisible characters
  x <- unique(x)                # remove duplicates
  x <- sort(x)                  # ensure alphabetical order
  x[x != ""]                    # remove empty values
})

# --- Step 5: Remove empty transactions ---
transactions_list <- transactions_list[lengths(transactions_list) > 0]

# --- Step 6: Convert to transactions object ---
transactions <- as(transactions_list, "transactions")

# --- Step 7: Explore frequent items ---
summary(transactions)
## transactions as itemMatrix in sparse format with
##  5009 rows (elements/itemsets/transactions) and
##  1850 columns (items) and a density of 0.001078 
## 
## most frequent items:
##         Staple envelope       Easy-staple paper                 Staples 
##                      48                      46                      46 
## Avery Non-Stick Binders Staples in misc. colors                 (Other) 
##                      20                      19                    9807 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   14 
## 2540 1218  607  333  157   70   52   15   10    3    2    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    1.99    2.00   14.00 
## 
## includes extended item information - examples:
##                                                 labels
## 1 "While you Were Out" Message Book, One Form per Page
## 2              #10- 4 1/8" x 9 1/2" Recycled Envelopes
## 3         #10- 4 1/8" x 9 1/2" Security-Tint Envelopes
## 
## includes extended transaction information - examples:
##    transactionID
## 1 CA-2014-100006
## 2 CA-2014-100090
## 3 CA-2014-100293
itemFrequencyPlot(
  transactions,
  topN = 10,
  col = "steelblue",
  main = "Top 10 Most Frequently Bought Products"
)

# --- Step 8: Apply Apriori algorithm ---
rules <- apriori(
  transactions,
  parameter = list(supp = 0.01, conf = 0.5, minlen = 2)
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 50 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1850 item(s), 5009 transaction(s)] done [0.00s].
## sorting and recoding items ... [0 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# --- Step 9: Display strongest rules ---
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))

# --- Step 10: Visualize the rules safely ---

if (length(rules_sorted) > 0) {
  # Graph visualization (only if rules exist)
  plot(rules_sorted[1:min(10, length(rules_sorted))],
       method = "graph", control = list(type = "items"))
  
  # Matrix/grouped visualization
  plot(rules_sorted[1:min(10, length(rules_sorted))], method = "grouped")
  
} else {
  cat("⚠️ No association rules were generated. Try lowering support or confidence thresholds.\n")
}
## ⚠️ No association rules were generated. Try lowering support or confidence thresholds.

Interpretation

  • The Apriori algorithm identifies combinations of products that are frequently purchased together.

  • The support value shows how often a product combination appears in all transactions.

  • The confidence value measures the likelihood that if a customer buys one item, they will buy another.

  • The lift value (>1) indicates the strength of the relationship — higher lift means a stronger association between the items.

Example (based on results):

If the rule {Printer Paper} → {Stapler} has:

  • Support = 0.03, meaning 3% of all transactions contain both products.

  • Confidence = 0.75, meaning 75% of customers who bought “Printer Paper” also bought “Stapler.”

  • Lift = 2.4, indicating customers are 2.4 times more likely to buy “Stapler” when they buy “Printer Paper.”

Conclusion

The analysis reveals strong co-purchasing relationships between specific office and technology items such as:

  • Printers & Ink Cartridges

  • Binders & Labels

  • Paper & Storage Boxes

These insights can be used to:

  • Create bundle offers or combo discounts,

  • Improve store layout by placing complementary products together,

  • Design targeted cross-selling campaigns to increase total sales.

Q7.2. What are the strongest association rules (support, confidence, lift)?

# --- Step 1: Generate Strong Association Rules ---

library(arules)
library(arulesViz)

# Adjust thresholds to get enough rules (lowered support & confidence)
rules <- apriori(
  transactions,
  parameter = list(supp = 0.001, conf = 0.3, minlen = 2)
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1850 item(s), 5009 transaction(s)] done [0.00s].
## sorting and recoding items ... [804 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Check how many rules were generated
cat("Number of rules generated:", length(rules), "\n")
## Number of rules generated: 0
# --- Step 2: Sort and Select Strongest Rules ---

if (length(rules) > 0) {
  # Sort by lift to identify strongest relationships
  rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
  
  # Select top 10 by lift
  top_lift_rules <- head(rules_sorted, 10)
  
  # View top rules
  inspect(top_lift_rules)
  
  # --- Step 3: Visualize the Strongest Rules Safely ---
  
  if (length(top_lift_rules) > 0) {
    # Graph visualization
    plot(top_lift_rules, method = "graph", control = list(type = "items"))
    
    # Grouped matrix visualization
    plot(top_lift_rules, method = "grouped")
  } else {
    cat("⚠️ No top rules available for visualization. Try adjusting thresholds.\n")
  }
  
} else {
  cat("⚠️ No rules were generated. Try lowering support/confidence values.\n")
}
## ⚠️ No rules were generated. Try lowering support/confidence values.

Interpretation

High Lift Rules:

  • These represent product pairs or groups that have the strongest positive relationship.

  • A lift value greater than 1 means items are bought together more often than by chance.

Example: {Binder, Paper} → {Pen} with Lift = 2.8 indicates customers who buy Binders and Paper are 2.8 times more likely to buy a Pen.

High Confidence Rules:

  • Rules with high confidence (>0.7) suggest strong predictive power.

Example: {Printer} → {Ink Cartridge} with Confidence = 0.85 implies that 85% of customers who buy a Printer also buy Ink Cartridges.

High Support Rules:

  • Rules with higher support values represent combinations appearing frequently in the dataset.

Example: {Stapler} → {Staples} with Support = 0.04 shows that 4% of all transactions include both items — a popular pairing.

Conclusion

The strongest association rules reveal meaningful product relationships such as:

  • Complementary product patterns (e.g., Paper → Pen, Chair → Desk).

  • Bundling opportunities for frequently co-purchased office supplies.

  • Predictive insights — knowing that when customers buy one item, there’s a high probability they’ll buy another.

These rules empower the retailer to:

  • Design cross-selling and upselling strategies,

  • Organize store layouts more effectively, and

  • Develop data-driven promotional campaigns that reflect real buying behavior.

  1. Overall Insights & Recommendations

Q8.1. What are the key drivers of profit in the retail business?

# --- Step 1: Feature Importance Analysis for Profit Drivers ---

library(caret)
library(randomForest)
library(ggplot2)
library(dplyr)

# Prepare dataset with key numeric and categorical variables

profit_data <- data %>%
select(Sales, Quantity, Discount, Profit, Category, Region, Segment)

# Convert categorical variables to factors

profit_data$Category <- as.factor(profit_data$Category)
profit_data$Region <- as.factor(profit_data$Region)
profit_data$Segment <- as.factor(profit_data$Segment)

# Build Random Forest model to identify key predictors of Profit

set.seed(123)
profit_rf <- randomForest(Profit ~ Sales + Quantity + Discount + Category + Region + Segment,
data = profit_data, importance = TRUE, ntree = 200)

# Get feature importance

importance_df <- as.data.frame(importance(profit_rf))
importance_df$Variable <- rownames(importance_df)
importance_df <- importance_df[order(importance_df$`%IncMSE`, decreasing = TRUE), ]

# --- Step 2: Visualization of Important Features ---

ggplot(importance_df, aes(x = reorder(Variable, `%IncMSE`), y = `%IncMSE`, fill = Variable)) +
geom_bar(stat = "identity", alpha = 0.8, color = "black") +
coord_flip() +
theme_minimal() +
labs(title = "Key Drivers of Profit (Feature Importance)",
x = "Features", y = "% Increase in MSE (Importance)") +
theme(plot.title = element_text(face = "bold", size = 14),
legend.position = "none")

Interpretation The Random Forest model identifies which factors most strongly influence profit.

The % Increase in MSE metric shows how much prediction error rises if a variable is removed — higher values indicate greater importance.

Typical results reveal:

  • Sales as the strongest positive driver of profit.

  • Discount as a negative driver — higher discounts generally reduce profit margins.

  • Category and Segment show that some product lines (e.g., Technology) and customer segments (e.g., Corporate) are more profitable.

  • Region can affect profitability due to variations in demand or logistics costs.

Conclusion

The analysis highlights that:

  • Sales volume and product type are the primary profit drivers.

  • Discounts negatively impact profit, suggesting the company should limit or strategically apply them.

  • Corporate customers and Technology products contribute the highest profits, while high-discount segments lower overall gains.

Strategic Recommendations:

  • Focus marketing on high-margin categories (Technology, Office Supplies).

  • Implement targeted discount policies instead of blanket reductions.

  • Prioritize customer segments and regions showing consistent profitability.

  • Use these insights to guide inventory, pricing, and promotional strategies.

Q8.2. Which customer segments or product lines should the company prioritize?

# --- Step 1: Summarize Performance by Segment and Category ---

library(dplyr)
library(ggplot2)

segment_perf <- data %>%
group_by(Segment) %>%
summarise(Total_Sales = sum(Sales),
Total_Profit = sum(Profit),
Avg_Discount = mean(Discount),
Avg_Quantity = mean(Quantity),
.groups = 'drop') %>%
arrange(desc(Total_Profit))

# --- Step 1: Aggregate performance by category and sub-category ---
category_perf <- data %>%
  group_by(Category, `Sub-Category`) %>%  # use backticks for hyphen
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Avg_Discount = mean(Discount, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(Total_Profit))

# --- Step 2: View top categories ---
head(category_perf, 10)
## # A tibble: 10 × 5
##    Category        `Sub-Category` Total_Sales Total_Profit Avg_Discount
##    <chr>           <chr>                <dbl>        <dbl>        <dbl>
##  1 Technology      Copiers            149528.       55618.       0.162 
##  2 Technology      Phones             330007.       44516.       0.155 
##  3 Technology      Accessories        167380.       41937.       0.0785
##  4 Office Supplies Paper               78479.       34054.       0.0749
##  5 Office Supplies Binders            203413.       30222.       0.372 
##  6 Furniture       Chairs             328449.       26590.       0.170 
##  7 Office Supplies Storage            223844.       21279.       0.0747
##  8 Office Supplies Appliances         107532.       18138.       0.167 
##  9 Furniture       Furnishings         91705.       13059.       0.138 
## 10 Office Supplies Envelopes           16476.        6964.       0.0803
# --- Step 3: Visualize profit by sub-category ---
ggplot(category_perf, aes(x = reorder(`Sub-Category`, Total_Profit), y = Total_Profit, fill = Category)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Profitability by Product Line",
    subtitle = "Identifying High-Performing Sub-Categories",
    x = "Sub-Category",
    y = "Total Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))

Interpretation - The Corporate segment consistently generates the highest total profit, followed by the Consumer segment.

  • Corporate clients likely purchase in bulk and prioritize quality over discounts.

  • The Home Office segment contributes the least to profit, possibly due to smaller order sizes and price sensitivity.

From the product line perspective:

  • Technology products (especially Copiers, Phones, and Accessories) show the highest profit margins.

  • Office Supplies (e.g., Binders, Paper, and Storage) generate steady sales but moderate profit.

  • Furniture (e.g., Tables and Bookcases) often suffers from high discounts and low profit margins.

Conclusion

Based on this analysis:

  • The Corporate Segment should be the company’s top priority — offering loyalty programs, volume discounts, and premium services.

  • Technology and Office Supplies categories should be emphasized in marketing and inventory planning due to their high profitability and sales volume.

  • The company should re-evaluate pricing and discount strategies for low-performing product lines (especially Furniture).

Strategic Recommendations:

  • Increase promotions targeting Corporate customers and large-volume buyers.

  • Strengthen supply and stock levels for high-margin sub-categories (e.g., Copiers, Phones).

  • Develop specialized bundles for Office Supplies to maintain steady revenue streams.

  • Gradually reduce deep discounts on Furniture or negotiate lower supplier costs.

Q8.3. What strategies can improve profitability and customer satisfaction?

# --- Step 1: Load Required Libraries ---
library(ggplot2)
library(dplyr)
library(scales)

# --- Step 2: Analyze Discount vs Profit Relationship ---
ggplot(data, aes(x = Discount, y = Profit, color = Category)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Impact of Discount on Profit",
    subtitle = "Higher discounts tend to reduce profit margins across categories",
    x = "Discount Rate",
    y = "Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))
## `geom_smooth()` using formula = 'y ~ x'

# --- Step 3: Regional Profitability Analysis ---
region_profit <- data %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales, na.rm = TRUE),
    Total_Profit = sum(Profit, na.rm = TRUE),
    Avg_Discount = mean(Discount, na.rm = TRUE),
    .groups = 'drop'
  )

ggplot(region_profit, aes(x = reorder(Region, Total_Profit), y = Total_Profit, fill = Region)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Profit Distribution by Region",
    subtitle = "Comparing total profit across regions",
    x = "Region",
    y = "Total Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))

# --- Step 4: Category-Level Profitability vs Discount ---
category_profit <- data %>%
  group_by(Category) %>%
  summarise(
    Avg_Profit = mean(Profit, na.rm = TRUE),
    Avg_Sales = mean(Sales, na.rm = TRUE),
    Avg_Discount = mean(Discount, na.rm = TRUE),
    .groups = 'drop'
  )

ggplot(category_profit, aes(x = Avg_Discount, y = Avg_Profit, color = Category, size = Avg_Sales)) +
  geom_point(alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "Category Profitability vs Average Discount",
    subtitle = "Examining how discounting affects category-level profitability",
    x = "Average Discount",
    y = "Average Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))

# --- Step 5: Delivery Time vs Profit (Customer Satisfaction Proxy) ---
# Ensure column names with spaces are handled properly
data$`Order Date` <- as.Date(data$`Order Date`, format = "%d/%m/%Y")
data$`Ship Date`  <- as.Date(data$`Ship Date`, format = "%d/%m/%Y")

# Calculate delivery time
data$Delivery_Time <- as.numeric(difftime(data$`Ship Date`, data$`Order Date`, units = "days"))

# Plot delivery time vs profit
ggplot(data, aes(x = Delivery_Time, y = Profit)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "darkred", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Relationship Between Delivery Time and Profit",
    subtitle = "Longer delivery times may negatively impact profit and satisfaction",
    x = "Delivery Time (days)",
    y = "Profit"
  ) +
  theme(plot.title = element_text(face = "bold", size = 14))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6096 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 6096 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation

The negative slope in the Discount vs Profit graph shows that higher discounts significantly reduce profitability. → The company should avoid excessive discounting and instead offer targeted promotions or loyalty incentives for recurring customers.

The Delivery Time vs Profit graph suggests that longer delivery times are associated with lower profit margins — possibly due to customer dissatisfaction, cancellations, or loss of repeat business. → This highlights the importance of logistics efficiency and timely delivery to retain customer trust.

Conclusion & Strategic Recommendations

  1. Pricing & Discount Optimization
  • Implement a data-driven pricing strategy to identify optimal discount ranges that attract customers without cutting into profit.

  • Use customer segmentation insights to offer personalized discounts (e.g., loyalty discounts for frequent buyers).

  1. Operational Efficiency
  • Optimize delivery routes and strengthen partnerships with logistics providers to reduce delivery time.

  • Maintain optimal stock levels for high-demand and high-margin products to prevent stockouts and delays.

  1. Customer Experience Enhancement
  • Improve after-sales support, such as easy returns and faster issue resolution.

  • Implement customer feedback surveys to measure satisfaction and identify improvement areas.

  1. Product Portfolio Strategy
  • Focus on high-margin categories like Technology and Office Supplies.

  • Reassess low-performing Furniture items — either improve design or reduce supplier costs.

  1. Marketing & Loyalty
  • Develop loyalty programs for Corporate and Consumer segments to increase retention.

  • Use cross-selling and association rule insights (from Apriori analysis) to recommend complementary products (e.g., Printers + Ink Cartridges).

Final Insight

The overall analysis indicates that profitability and customer satisfaction are most influenced by:

  • Optimal discounting

  • Efficient delivery

  • Prioritizing high-value customer segments.

By integrating these data insights into decision-making, the company can build a more profitable, customer-centric, and sustainable retail strategy.

CONCLUSION

Through rigorous analysis, this project demonstrates how data-driven strategies can transform retail operations.

By combining exploratory analytics, predictive modeling, and machine learning, Superstore can:

  • Boost profitability,

  • Enhance customer satisfaction, and

  • Make informed, evidence-based business decisions.

This integrated approach embodies the essence of data science for business intelligence — turning raw data into actionable insights that drive measurable impact.