Project Overview

This project uses an e-commerce customer dataset containing 1,999 records across 13 attributes. The dataset captures detailed information about customer transactions, including purchase details, product categories, payment methods, returns, and churn status.

Objectives

The objective of this project is to analyze customer purchasing behavior, identify key factors influencing product returns and customer churn, and generate insights to support business strategies for customer retention and revenue growth.

Specifically, The Project Aims To:

Understand purchase patterns across different product categories.
Analyze the impact of demographics (age, gender) on buying behavior and churn.
Explore the relationship between returns and customer loyalty.
Identify payment preferences among customers.

Loading Packages

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)
library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 4.4.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(scales)

## Warning: package 'scales' was built under R version 4.4.3

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Load The Dataset

Database <- read.csv("C:\\Users\\My Pc\\Desktop\\ecommerce_customer_data_large.csv")
View(Database)

1. UNDERSTANDING THE DATASET

How any unique customers are in the dataset?

Database %>% summarise(Unique_Customers = n_distinct(Customer.ID))

##   Unique_Customers
## 1              329

What Is The Total Number Of Purchases Recorded?

Database %>% summarise(Total_Purchases = n())

##   Total_Purchases
## 1            1999

What Are The Different Product Categories Available?

Database %>% distinct(Product.Category)

##   Product.Category
## 1             Home
## 2      Electronics
## 3            Books
## 4         Clothing

What Are The Most Common Product Categories Purchased?

Database %>% count(Product.Category, sort = TRUE)

##   Product.Category   n
## 1      Electronics 515
## 2             Home 515
## 3            Books 498
## 4         Clothing 471

What Is The Time Range Of The Purchase dData?

Database %>%
  summarise(
    Start_Date = min(Purchase.Date, na.rm = TRUE),
    End_Date = max(Purchase.Date, na.rm = TRUE)
  )

##       Start_Date      End_Date
## 1 1/1/2020 15:05 9/9/2023 4:57

2. DATA EXTRACTION & FILTERING

#Retrieve all purchases made by a specific customer.

Database %>% filter(Customer.ID == "44605")  # Replace with real ID

##   Customer.ID   Purchase.Date Product.Category Product.Price Quantity
## 1       44605  5/3/2023 21:30             Home           177        1
## 2       44605 5/16/2021 13:57      Electronics           174        3
## 3       44605  7/13/2020 6:16            Books           413        1
## 4       44605 1/17/2023 13:14      Electronics           396        3
## 5       44605  5/1/2021 11:29            Books           259        4
##   Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1                  2427         PayPal           31       1   John Rivera  31
## 2                  2448         PayPal           31       1   John Rivera  31
## 3                  2345    Credit Card           31       1   John Rivera  31
## 4                   937           Cash           31       0   John Rivera  31
## 5                  2598         PayPal           31       1   John Rivera  31
##   Gender Churn
## 1 Female     0
## 2 Female     0
## 3 Female     0
## 4 Female     0
## 5 Female     0

What Is The Average Total Purchase Amount For Transactions Made Using Each Different Payment Method?

average_spending_by_payment <- Database %>%
  group_by(Payment.Method) %>%
  summarise(
    AverageTotalAmount = mean(Total.Purchase.Amount, na.rm = TRUE),
    NumberOfTransactions = n() # Good practice to check sample size
  ) 

print(average_spending_by_payment)

## # A tibble: 3 × 3
##   Payment.Method AverageTotalAmount NumberOfTransactions
##   <chr>                       <dbl>                <int>
## 1 Cash                        2773.                  674
## 2 Credit Card                 2795.                  667
## 3 PayPal                      2716.                  658

Extract Details of Purchases Where The Purchase Amount Exceeds A Certain Threshold.

Database %>% filter(Total.Purchase.Amount > 3000)%>%
  head(20)

##    Customer.ID    Purchase.Date Product.Category Product.Price Quantity
## 1        13738   8/25/2022 6:48             Home           191        3
## 2        13738    2/9/2023 0:53      Electronics            40        4
## 3        33969  2/28/2023 19:58         Clothing           410        3
## 4        33969   1/5/2023 11:15             Home           304        1
## 5        33969  7/18/2023 23:36            Books            54        2
## 6        33969   3/7/2020 21:31            Books           281        1
## 7        33969   7/21/2022 4:25             Home           193        2
## 8        42650 10/18/2020 23:38            Books           127        5
## 9        42650  5/17/2020 17:02             Home           284        2
## 10       42650  3/18/2022 13:52      Electronics           256        2
## 11       42650  1/26/2022 12:50      Electronics           105        2
## 12       42650  11/11/2020 7:19      Electronics           193        2
## 13       42650    8/1/2021 5:27      Electronics            30        3
## 14       42650   3/24/2022 0:05      Electronics           254        1
## 15       42650    9/2/2022 4:00            Books            12        3
## 16       24053 11/17/2020 13:14      Electronics           102        2
## 17       24053  1/11/2022 10:56            Books           210        2
## 18       19676  1/12/2022 22:49         Clothing           304        3
## 19       19676  6/26/2021 23:49      Electronics           329        1
## 20       19917   12/9/2021 7:43            Books           174        5
##    Total.Purchase.Amount Payment.Method Customer.Age Returns  Customer.Name Age
## 1                   3722    Credit Card           27       1 Lauren Johnson  27
## 2                   4327           Cash           27       0 Lauren Johnson  27
## 3                   5018    Credit Card           27      NA    Carol Allen  27
## 4                   3883         PayPal           27       1    Carol Allen  27
## 5                   4187         PayPal           27       0    Carol Allen  27
## 6                   3810           Cash           27       0    Carol Allen  27
## 7                   3198    Credit Card           27       0    Carol Allen  27
## 8                   3347           Cash           20       0   Curtis Smith  20
## 9                   3531    Credit Card           20       1   Curtis Smith  20
## 10                  3548    Credit Card           20       0   Curtis Smith  20
## 11                  3721    Credit Card           20      NA   Curtis Smith  20
## 12                  3266         PayPal           20       1   Curtis Smith  20
## 13                  5024    Credit Card           20       1   Curtis Smith  20
## 14                  3865         PayPal           20       0   Curtis Smith  20
## 15                  3112           Cash           20       1   Curtis Smith  20
## 16                  5206    Credit Card           53       1     Jose Green  53
## 17                  4567    Credit Card           53       0     Jose Green  53
## 18                  4869         PayPal           57       1      Linda Lee  57
## 19                  3391           Cash           57       1      Linda Lee  57
## 20                  4517    Credit Card           34      NA   Joshua Davis  34
##    Gender Churn
## 1  Female     0
## 2  Female     0
## 3    Male     0
## 4    Male     0
## 5    Male     0
## 6    Male     0
## 7    Male     0
## 8  Female     0
## 9  Female     0
## 10 Female     0
## 11 Female     0
## 12 Female     0
## 13 Female     0
## 14 Female     0
## 15 Female     0
## 16   Male     0
## 17   Male     0
## 18   Male     0
## 19   Male     0
## 20   Male     1

Get All Purchases Made For A Specific Product Category.

Database %>% filter(Product.Category == "Electronics")%>%
  head(20)

##    Customer.ID    Purchase.Date Product.Category Product.Price Quantity
## 1        44605  5/16/2021 13:57      Electronics           174        3
## 2        44605  1/17/2023 13:14      Electronics           396        3
## 3        13738   7/25/2023 5:17      Electronics           205        1
## 4        13738    2/9/2023 0:53      Electronics            40        4
## 5        33969 12/20/2021 23:44      Electronics           428        4
## 6        42650  3/18/2022 13:52      Electronics           256        2
## 7        42650  1/26/2022 12:50      Electronics           105        2
## 8        42650  11/11/2020 7:19      Electronics           193        2
## 9        42650    8/1/2021 5:27      Electronics            30        3
## 10       42650   5/24/2020 6:45      Electronics           307        3
## 11       42650   3/24/2022 0:05      Electronics           254        1
## 12       24053 11/17/2020 13:14      Electronics           102        2
## 13       24053    7/5/2020 6:30      Electronics           364        3
## 14       19676  6/26/2021 23:49      Electronics           329        1
## 15       19676  4/20/2020 14:08      Electronics           165        5
## 16       19676  3/23/2021 17:36      Electronics           218        4
## 17       19917 10/18/2022 22:58      Electronics            41        2
## 18       23734  9/21/2022 19:10      Electronics            75        1
## 19       16921  4/22/2021 22:10      Electronics           500        2
## 20        7796 10/22/2022 12:12      Electronics           145        2
##    Total.Purchase.Amount Payment.Method Customer.Age Returns   Customer.Name
## 1                   2448         PayPal           31       1     John Rivera
## 2                    937           Cash           31       0     John Rivera
## 3                   2773    Credit Card           27      NA  Lauren Johnson
## 4                   4327           Cash           27       0  Lauren Johnson
## 5                   2289           Cash           27       0     Carol Allen
## 6                   3548    Credit Card           20       0    Curtis Smith
## 7                   3721    Credit Card           20      NA    Curtis Smith
## 8                   3266         PayPal           20       1    Curtis Smith
## 9                   5024    Credit Card           20       1    Curtis Smith
## 10                   973         PayPal           20      NA    Curtis Smith
## 11                  3865         PayPal           20       0    Curtis Smith
## 12                  5206    Credit Card           53       1      Jose Green
## 13                  1008    Credit Card           53       0      Jose Green
## 14                  3391           Cash           57       1       Linda Lee
## 15                   603    Credit Card           57       0       Linda Lee
## 16                  2866         PayPal           57      NA       Linda Lee
## 17                   743           Cash           34      NA    Joshua Davis
## 18                   289         PayPal           18       0 Cheryl Espinoza
## 19                   988           Cash           54       1  Cheyenne James
## 20                  2134         PayPal           21       0     Lisa Dennis
##    Age Gender Churn
## 1   31 Female     0
## 2   31 Female     0
## 3   27 Female     0
## 4   27 Female     0
## 5   27   Male     0
## 6   20 Female     0
## 7   20 Female     0
## 8   20 Female     0
## 9   20 Female     0
## 10  20 Female     0
## 11  20 Female     0
## 12  53   Male     0
## 13  53   Male     0
## 14  57   Male     0
## 15  57   Male     0
## 16  57   Male     0
## 17  34   Male     1
## 18  18   Male     0
## 19  54   Male     0
## 20  21   Male     0

Get Extract Of Customers Who Abandon Their Carts Before Completing A Purchase?

Database %>%
  filter(Churn==1)%>%
  head(20)

##    Customer.ID    Purchase.Date Product.Category Product.Price Quantity
## 1        19917   5/16/2023 1:34         Clothing           392        3
## 2        19917 10/27/2021 16:25         Clothing           238        3
## 3        19917   12/9/2021 7:43            Books           174        5
## 4        19917 10/18/2022 22:58      Electronics            41        2
## 5        21035   9/3/2022 16:41      Electronics           443        2
## 6        21035   6/15/2021 6:53      Electronics           116        3
## 7        21035  8/22/2021 14:09             Home           500        1
## 8        21035 11/28/2020 15:28             Home           249        1
## 9        21035  3/15/2020 18:45            Books           112        4
## 10       21035  1/27/2021 18:26      Electronics           457        2
## 11       21035   9/9/2023 18:11             Home           237        2
## 12       16825   7/21/2020 3:33         Clothing           489        1
## 13       16825  11/14/2022 2:19         Clothing           420        2
## 14       16825  4/19/2021 16:20            Books           395        3
## 15       16825  3/18/2023 15:00            Books           227        2
## 16       16825  8/16/2021 17:56         Clothing           166        4
## 17       34857   1/10/2021 9:21            Books           120        3
## 18       34857   6/13/2021 0:59             Home           104        2
## 19       34857    9/3/2021 7:26            Books           252        4
## 20       34857   2/6/2023 19:10      Electronics           284        4
##    Total.Purchase.Amount Payment.Method Customer.Age Returns   Customer.Name
## 1                   1378    Credit Card           34      NA    Joshua Davis
## 2                   1760    Credit Card           34      NA    Joshua Davis
## 3                   4517    Credit Card           34      NA    Joshua Davis
## 4                    743           Cash           34      NA    Joshua Davis
## 5                   1884    Credit Card           50       1    Peter Watson
## 6                   2334    Credit Card           50       0    Peter Watson
## 7                    698    Credit Card           50       1    Peter Watson
## 8                   2925    Credit Card           50       0    Peter Watson
## 9                    793         PayPal           50       0    Peter Watson
## 10                  2919         PayPal           50       1    Peter Watson
## 11                  1088           Cash           50       1    Peter Watson
## 12                  3184    Credit Card           37       0 Melissa Cabrera
## 13                   969           Cash           37       1 Melissa Cabrera
## 14                  3487           Cash           37       1 Melissa Cabrera
## 15                  3400           Cash           37       0 Melissa Cabrera
## 16                  1379         PayPal           37       0 Melissa Cabrera
## 17                  5279    Credit Card           57      NA Alexander Morse
## 18                  3319    Credit Card           57       0 Alexander Morse
## 19                  1038           Cash           57       0 Alexander Morse
## 20                   402         PayPal           57       0 Alexander Morse
##    Age Gender Churn
## 1   34   Male     1
## 2   34   Male     1
## 3   34   Male     1
## 4   34   Male     1
## 5   50 Female     1
## 6   50 Female     1
## 7   50 Female     1
## 8   50 Female     1
## 9   50 Female     1
## 10  50 Female     1
## 11  50 Female     1
## 12  37   Male     1
## 13  37   Male     1
## 14  37   Male     1
## 15  37   Male     1
## 16  37   Male     1
## 17  57   Male     1
## 18  57   Male     1
## 19  57   Male     1
## 20  57   Male     1

3. GROUPING & SUMMARIZATION

What Is The Total Purchase Amount Per Customer?

Database %>%
  group_by(Customer.ID) %>%
  summarise(Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
  arrange(desc(Total_Spent))%>%
  head(50)

## # A tibble: 50 × 2
##    Customer.ID Total_Spent
##          <int>       <int>
##  1       45089       42701
##  2       37332       37509
##  3       26378       36589
##  4       42650       35366
##  5       28317       34436
##  6       10191       33332
##  7       32089       32712
##  8       22062       31214
##  9       35164       30017
## 10       48573       29898
## # ℹ 40 more rows

What is the average purchase amount per product category?

Database %>%
  group_by(Product.Category) %>%
  summarise(Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE)) %>%
  arrange(desc(Avg_Spend))

## # A tibble: 4 × 2
##   Product.Category Avg_Spend
##   <chr>                <dbl>
## 1 Home                 2820.
## 2 Electronics          2789.
## 3 Books                2740.
## 4 Clothing             2691.

How many purchases were made by gender?

Database %>% count(Gender, name = "Purchase_Count")

##   Gender Purchase_Count
## 1 Female           1015
## 2   Male            984

What is the most frequently purchased product category?

Database %>%
  count(Product.Category) %>%
  filter(n == max(n))

##   Product.Category   n
## 1      Electronics 515
## 2             Home 515

4. SORTING & RANKING

List the top 5 customers with the highest total purchase amount and how frequently do they shop?

Database %>%
  group_by(Customer.ID) %>%
  summarise(
    Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE),
    Frequency = n()
  ) %>%
  arrange(desc(Total_Spent)) %>%
  slice(1:5)

## # A tibble: 5 × 3
##   Customer.ID Total_Spent Frequency
##         <int>       <int>     <int>
## 1       45089       42701        11
## 2       37332       37509        10
## 3       26378       36589        12
## 4       42650       35366        13
## 5       28317       34436        12

Find the most expensive purchases in the dataset.

Database %>%
  arrange(desc(Total.Purchase.Amount)) %>%
  select(Customer.ID, Product.Category, Total.Purchase.Amount) %>%
  slice(1:5)

##   Customer.ID Product.Category Total.Purchase.Amount
## 1       41933      Electronics                  5333
## 2       45961             Home                  5321
## 3       41953         Clothing                  5319
## 4       18407         Clothing                  5313
## 5       17203             Home                  5300

Rank product categories by total revenue generated.

Database %>%
  group_by(Product.Category) %>%
  summarise(Revenue = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
  arrange(desc(Revenue))

## # A tibble: 4 × 2
##   Product.Category Revenue
##   <chr>              <int>
## 1 Home             1452304
## 2 Electronics      1436197
## 3 Books            1364446
## 4 Clothing         1267311

5. FEATURE ENGINEERING

Create a “Customer_Spend_Category” column based on average spending per order

Database<-Database %>%
  mutate(Customer_Spend_Category = case_when(
    Total.Purchase.Amount < 700 ~ "Low Spender",
    Total.Purchase.Amount >= 700 & Total.Purchase.Amount <= 5000 ~ "Moderate Spender",
    Total.Purchase.Amount > 5000 ~ "High Spender"
  ))
View(Database)

Assign a “Customer Loyalty Score” combining recency, frequency, and monetary value.

Build a “Purchase Frequency Score” based on the number of transactions per month.

Database_frequency_score<- Database %>%
  group_by(Customer.Name) %>%
  summarise(Monthly_Orders = n()) 
view(Database_frequency_score)

Compute “Repeat Purchase Rate” to measure customer retention.

Database %>%
  group_by(Customer.ID) %>%
  summarise(Purchase_Count = n()) %>%
  summarise(
    Repeat_Buyers = sum(Purchase_Count > 1),
    Total_Customers = n(),
    Repeat_Purchase_Rate = Repeat_Buyers / Total_Customers
  )

## # A tibble: 1 × 3
##   Repeat_Buyers Total_Customers Repeat_Purchase_Rate
##           <int>           <int>                <dbl>
## 1           327             329                0.994

6. ADVANCED INSIGHTS

Are repeat buyers more likely to purchase from specific product categories?

repeat_buyers <- Database %>%
  count(Customer.ID) %>%
  filter(n > 1)

Database %>%
  filter(Customer.ID %in% repeat_buyers$Customer.ID) %>%
  count(Product.Category, sort = TRUE)

##   Product.Category   n
## 1             Home 515
## 2      Electronics 514
## 3            Books 498
## 4         Clothing 470

Are weekends or weekdays more profitable in terms of sales volume?

Database %>%
  mutate(Day = weekdays(as.Date(Purchase.Date))) %>%
  group_by(Day) %>%
  summarise(Avg_Sales = mean(Total.Purchase.Amount, na.rm = TRUE))

## # A tibble: 8 × 2
##   Day       Avg_Sales
##   <chr>         <dbl>
## 1 Friday        2616.
## 2 Monday        2688.
## 3 Saturday      2752.
## 4 Sunday        2764.
## 5 Thursday      2881.
## 6 Tuesday       2600.
## 7 Wednesday     2828.
## 8 <NA>          2785.

Is there a significant difference in spending behavior based on payment methods?

Database %>%
  group_by(Payment.Method) %>%
  summarise(
    Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE),
    Count = n()
  )

## # A tibble: 3 × 3
##   Payment.Method Avg_Spend Count
##   <chr>              <dbl> <int>
## 1 Cash               2773.   674
## 2 Credit Card        2795.   667
## 3 PayPal             2716.   658

What factors contribute most to customer churn?

Database %>%
  group_by(Customer.ID) %>%
  summarise(Frequency = n()) %>%
  mutate(Churned = ifelse(Frequency == 1, 1, 0)) %>%
  count(Churned) %>%
  mutate(Percentage = n / sum(n) * 100)

## # A tibble: 2 × 3
##   Churned     n Percentage
##     <dbl> <int>      <dbl>
## 1       0   327     99.4  
## 2       1     2      0.608

Section 2 Visualization

— Basic Data Preparation —

Convert Purchase Date (Handle potential parsing errors)

Database$Purchase.Date <- mdy_hm(Database$Purchase.Date, quiet = TRUE)

Ensure essential columns are the correct type

Database$Age <- suppressWarnings(as.numeric(as.character(Database$Age)))
Database$Product.Price <- suppressWarnings(as.numeric(as.character(Database$Product.Price)))
Database$Quantity <- suppressWarnings(as.numeric(as.character(Database$Quantity)))
Database$Total.Purchase.Amount <- suppressWarnings(as.numeric(as.character(Database$Total.Purchase.Amount)))
Database$Returns[Database$Returns == ""] <- NA # Handle blanks before numeric conversion
Database$Returns <- suppressWarnings(as.numeric(as.character(Database$Returns)))

Convert key categorical columns to factors

Database$Product.Category <- as.factor(Database$Product.Category)
Database$Payment.Method <- as.factor(Database$Payment.Method)
Database$Gender <- as.factor(Database$Gender)
Database$Churn <- as.factor(Database$Churn)

Remove rows with NA in essential numeric columns used frequently

Database <- Database %>% filter(!is.na(Age), !is.na(Total.Purchase.Amount), !is.na(Product.Price), !is.na(Quantity))

Create a unique customer view for demographic analysis

unique_customers <- Database %>% distinct(Customer.ID, .keep_all = TRUE)

Simple theme

simple_theme <- theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

1. Pairplot: Explore relationships between core numeric features.

df_subset_pairplot <- Database %>% sample_n(min(1000, nrow(.))) # Plot sample of 1000
pairplot_vars <- c("Age", "Product.Price", "Quantity", "Total.Purchase.Amount")

# Check if data exists before plotting
if(nrow(df_subset_pairplot) > 0) {
  plot1 <- ggpairs(df_subset_pairplot[, pairplot_vars],
                   title = "Pairwise Relationships of Key Numeric Variables (Sampled Data)") +
    simple_theme
  print(plot1)
} else {
  
}

Interpretation:By examining the scatterplots, we can visually assess if relationships appear linear, clustered, or random. For instance, we might look for a negative trend between Product.Price and Quantity (suggesting people buy fewer items as price increases) or a positive trend between Quantity and Total.Purchase.Amount. The density plots help understand the typical range and concentration of values for each metric. The correlation coefficients quantify the strength and direction of linear relationships (values near +1 or -1 indicate strong linear association, while values near 0 indicate weak linear association). This plot is excellent for quickly identifying potential multicollinearity or interesting pairwise relationships worthy of deeper investigation.

2. Bar Graph: How many transactions per Product Category?

plot2 <- ggplot(Database, aes(x = Product.Category, fill = Product.Category)) +
  geom_bar(show.legend = FALSE) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5, size=3) +
  labs(title = "Number of Transactions by Product Category",
       x = "Product Category",
       y = "Number of Transactions") 
print(plot2)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation:This visualization allows for a direct comparison of transaction volume across categories. We can immediately identify which product categories are most frequently involved in transactions and which are less common. For example, if “Electronics” and “Home” have significantly taller bars than “Books” and “Clothing”, it indicates they drive a larger portion of the store’s transaction activity.

3. Histogram: What is the distribution of Total Purchase Amounts per transaction?

plot3 <- ggplot(Database, aes(x = Total.Purchase.Amount)) +
  geom_histogram(fill = "skyblue", color = "white", bins = 30) +
  labs(title = "Distribution of Total Purchase Amount per Transaction",
       x = "Total Purchase Amount ($)",
       y = "Frequency") +
  scale_x_continuous(labels = scales::dollar_format()) +  # <--- updated
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  simple_theme
print(plot3)

Interpretation:The shape of the histogram reveals common spending patterns per transaction. Typically, e-commerce data shows a right-skewed distribution, meaning most transactions are for smaller amounts, with fewer transactions involving very high totals. We can identify the most frequent spending range (the tallest bar), the overall spread of transaction values, and the presence and extent of high-value outlier transactions (the long tail to the right, if present).

4. Box Plot: Purchase Amount by Payment Method (ANOVA Context)

plot4 <- ggplot(Database, aes(x = Payment.Method, y = Total.Purchase.Amount, fill = Payment.Method)) +
  geom_boxplot(show.legend = FALSE, outlier.shape = NA) + # Hide outliers for scale
  coord_cartesian(ylim = quantile(Database$Total.Purchase.Amount, c(0.05, 0.95), na.rm=TRUE)) + # Zoom
  labs(title = "Total Purchase Amount Distribution by Payment Method",
       subtitle = "Visual check for ANOVA (Zoomed 5th-95th percentile)",
       x = "Payment Method",
       y = "Total Purchase Amount ($)")+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  simple_theme
print(plot4)

Interpretation:This plot helps visually assess if typical spending per transaction differs based on how customers pay. We compare the median spending levels (middle lines) and the variability (height of the boxes/IQRs) across methods. For instance, if the median line for “Credit Card” is higher than for “Cash”, it suggests customers tend to spend more per transaction when using a credit card. Different box heights would indicate differences in spending consistency. The subtitle explicitly notes this is a visual precursor to ANOVA – if the medians look very different or the variances (approximated by box heights/spread) appear unequal, it strengthens the case for formal statistical testing. The zoom means we are focusing on the bulk of the data, ignoring extreme high/low values for this comparison.

5. Scatter Plot + Regression: Is there a linear relationship between Age and Total Purchase Amount?

simple_theme2 <- theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", color = "darkblue"),
    plot.subtitle = element_text(size = 12, color = "darkgray"),
    axis.title = element_text(size = 14, face = "bold", color = "black"),
    axis.text = element_text(size = 12, color = "black"),
    panel.grid.major = element_line(color = "lightgray", linewidth = 0.5),   # Updated to linewidth
    panel.grid.minor = element_line(color = "lightgray", linewidth = 0.25),  # Updated to linewidth
    panel.background = element_rect(fill = "white", color = "gray"),
    plot.background = element_rect(fill = "lightblue", color = "lightblue")
  )

plot6 <- ggplot(Database, aes(x = Age, y = Total.Purchase.Amount)) +
  geom_point(aes(color = Age), alpha = 0.7, size = 2) +  # Add color based on Age and adjust transparency
  geom_smooth(method = "lm", color = "red", se = TRUE, linetype = "solid", size = 1.5) + # Enhanced line
  scale_color_viridis_c() +  # Use a color scale for better visual appeal
  labs(title = "Customer Age vs. Total Purchase Amount",
       subtitle = "Red line = Linear Regression Fit",
       x = "Customer Age",
       y = "Total Purchase Amount ($)") +
  simple_theme2 +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Angle x-axis text for better readability

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(plot6)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:This visualization reveals temporal patterns in overall business activity. We can identify upward or downward trends in transaction volume over the observed period. Seasonal patterns become evident through recurring peaks (e.g., potential holiday season spikes) and troughs (e.g., post-holiday dips). The magnitude of these fluctuations indicates the strength of seasonality or impact of specific events (like promotions). The shaded area provides a quick visual gauge of the overall volume across time.

6. Bar Graph: What is the churn rate by Gender?

churn_by_gender <- unique_customers %>%
  group_by(Gender) %>%
  summarise(Churn.Rate = mean(Churn == "1", na.rm = TRUE), 
            Total.Customers = n(), .groups = 'drop') %>%
  filter(Total.Customers > 10) # Optional: filter small groups

if (nrow(churn_by_gender) > 0) {
  
  plot7 <- ggplot(churn_by_gender, aes(x = Gender, y = Churn.Rate, fill = Gender)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    geom_text(aes(label = scales::percent(Churn.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) + # Show percentage labels
    labs(title = "Customer Churn Rate by Gender",
         x = "Gender",
         y = "Churn Rate") +
    scale_y_continuous(labels = scales::percent) +  # Y-axis as percentages
    simple_theme2 +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
  
  print(plot7)
  
} else {
  cat("Skipping Churn Rate by Gender Plot: Not enough unique customer data with gender/churn info.\n")
}

Interpretation:This plot allows for a direct visual comparison of customer retention between genders. If the bars are of noticeably different heights, it suggests one gender group might be more prone to churning than the other within this dataset. For example, a significantly taller bar for ‘Male’ would indicate a higher churn propensity among male customers compared to female customers. Similar bar heights suggest comparable churn rates. This can inform targeted retention strategies if significant differences exist.

7. Histogram: How are Product Prices distributed overall?

simple_theme2 <- theme_minimal()

day_counts <- Database %>%
  filter(!is.na(Purchase.Date)) %>%
  mutate(DayOfWeek = wday(Purchase.Date, label = TRUE, week_start = 1)) %>%
  filter(!is.na(DayOfWeek)) %>%
  count(DayOfWeek, name = "Transactions")

if (nrow(day_counts) > 0) {
  
  plot15 <- ggplot(day_counts, aes(x = DayOfWeek, y = Transactions, group = 1)) +
    geom_area(aes(fill = "Area"), alpha = 0.3, show.legend = FALSE) +   # Shaded area under the line
    geom_line(color = "steelblue", size = 1.2) +                        # Line
    geom_point(color = "steelblue", size = 3) +                         # Points
    geom_text(aes(label = Transactions), vjust = -1, size = 3) +        # Labels
    labs(title = "Total Transactions by Day of the Week",
         x = "Day of the Week",
         y = "Number of Transactions") +
    scale_y_continuous(labels = scales::comma) +
    simple_theme2 +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
  
  print(plot15)
  
} else {
  cat("Skipping Day of Week Plot: Not enough valid date data.\n")
}

Interpretation:The plot helps determine if there’s a linear association between age and spending per transaction. A widely scattered pattern suggests age is not a strong linear predictor of transaction amount. The slope of the red line indicates the direction and magnitude of the average linear trend (e.g., a slightly upward slope suggests older customers tend to spend slightly more on average, though individual transactions vary greatly). The width of the confidence interval indicates the uncertainty around the regression line – a wider band implies less confidence in the precise slope. The color gradient can help visually spot if certain age ranges cluster at higher/lower spending levels, though this might be subtle.

8. Box Plot: How does customer Age distribution compare for churned vs. non-churned customers?

plot9 <- ggplot(Database, aes(x = Churn, y = Customer.Age, fill = Churn)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Customer Age Distribution by Churn Status",
       x = "Churn Status (0 = Active, 1 = Churned)",
       y = "Customer Age") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  simple_theme
print(plot9)

Interpretation:This visualization highlights the weekly rhythm of purchasing behavior. We can easily identify the busiest and slowest shopping days. Peaks (e.g., often Friday/Saturday) indicate preferred shopping times, while troughs (e.g., often Sunday/Monday) show periods of lower activity. This pattern can inform staffing, server load management, and timing for promotional activities (e.g., running sales on typically slower days).

9. Scatter Plot: Relationship between Product Price and Quantity sold?

cat("\n--- 10: Scatter Plot (Product Price vs. Quantity) ---\n")

## 
## --- 10: Scatter Plot (Product Price vs. Quantity) ---

plot10 <- ggplot(Database, aes(x = Product.Price, y = Quantity,fill = Product.Category)) +
  geom_point(aes(color = Product.Category), alpha = 0.4, size = 1.5) + # Color by category
  labs(title = "Product Price vs. Quantity Purchased per Transaction",
       x = "Product Price ($)",
       y = "Quantity Purchased",
       color = "Product Category") + # Add legend title
  simple_theme
print(plot10)

Interpretation:This plot helps visualize if the age profile differs between the two churn groups based on the transactions recorded. We compare the median age (middle line), the age range of the middle 50% of transactions (IQR/box height), and the overall age spread for transactions linked to each group. If, for instance, the median age box for churned customers is lower than for non-churned, it might suggest younger customers are associated with more churned transactions. Important Caveat: Because this uses the main df (where customer age is repeated per transaction) and not unique_customers, it’s showing the age distribution of the transactions, not necessarily the age distribution of unique customers who churned. For a direct comparison of churned customer ages, the plot should ideally be based on the unique_customers dataframe.

10. Bar Graph: What proportion of transactions resulted in a return for each Product Category?

returns_by_cat <- Database %>%
  filter(!is.na(Returns)) %>%
  group_by(Product.Category) %>%
  summarise(
    Return.Rate = mean(Returns == 1),
    Total.Transactions = n(),
    .groups = 'drop'
  ) %>%
  filter(Total.Transactions > 10)

if (nrow(returns_by_cat) > 0) {
  plot11 <- ggplot(returns_by_cat, aes(x = reorder(Product.Category, -Return.Rate), y = Return.Rate, fill = Product.Category)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    geom_text(aes(label = scales::percent(Return.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) +
    labs(
      title = "Return Rate by Product Category",
      x = "Product Category",
      y = "Return Rate"
    ) +
    scale_y_continuous(labels = scales::percent) +  # <--- important!
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # tilt x labels if needed
  print(plot11)
} else {
  cat("Skipping Return Rate by Category Plot: Not enough valid 'Returns' data found.\n")
}

11. Line Plot: How has the average Total Purchase Amount changed monthly?

monthly_avg_purchase <- Database %>%
  filter(!is.na(Purchase.Date)) %>%
  mutate(YearMonth = floor_date(Purchase.Date, "month")) %>%
  group_by(YearMonth) %>%
  summarise(Avg.Purchase = mean(Total.Purchase.Amount, na.rm = TRUE), .groups = 'drop') %>%
  filter(!is.na(YearMonth))

if(nrow(monthly_avg_purchase) > 0) {
  plot12 <- ggplot(monthly_avg_purchase, aes(x = YearMonth, y = Avg.Purchase)) +
    geom_line(color = "purple", linewidth = 1) +
    geom_point(color = "purple") +
    labs(title = "Monthly Average Total Purchase Amount Trend",
         x = "Month",
         y = "Average Total Purchase Amount ($)") +
    scale_x_datetime(date_labels = "%Y-%m", date_breaks = "3 months") +
    simple_theme
  print(plot12)
} else {
  cat("Skipping Average Purchase Trend Plot: Not enough valid date data.\n")
}

Interpretation: Unlike the transaction volume plot (Q5), this focuses on the value per transaction. We look for trends in average spending – are customers spending more or less per visit over time? Peaks could indicate periods where customers bought higher-value items or simply more items per transaction (cross-reference with Q5). Dips might suggest periods of promotional activity focusing on lower-priced goods or smaller basket sizes. This helps understand changes in customer spending behavior separate from just the number of visits/transactions.

12. Box Plot: Compare Total Purchase Amount distributions by Gender.

cat("\n--- 13: Box Plot (Purchase Amount by Gender) ---\n")

## 
## --- 13: Box Plot (Purchase Amount by Gender) ---

plot13 <- ggplot(Database, aes(x = Gender, y = Total.Purchase.Amount, fill = Gender)) +
  geom_boxplot(show.legend = FALSE, outlier.alpha = 0.1) + # Show outliers faintly
  labs(title = "Total Purchase Amount Distribution by Gender",
       x = "Gender",
       y = "Total Purchase Amount ($)") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  simple_theme
print(plot13)

13. Scatter Plot + Regression: How does Quantity relate to Total Purchase Amount?

plot14 <- ggplot(Database, aes(x = Quantity, y = Total.Purchase.Amount)) +
  geom_point(alpha = 0.2, size = 1, color = "brown") +
  geom_smooth(method = "lm", color = "orange", se = FALSE) +
  labs(title = "Quantity vs. Total Purchase Amount per Transaction",
       subtitle = "Orange line = Linear Regression Fit",
       x = "Quantity Purchased",
       y = "Total Purchase Amount ($)") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  simple_theme
print(plot14)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:This plot primarily investigates the correlation between the number of units bought and the recorded total spend for the transaction. A positive slope on the regression line is expected, indicating that buying more units generally leads to a higher total amount. The strength of this linear relationship is indicated by how tightly the points cluster around the line and the width of the confidence interval. Significant scatter suggests either a wide variation in product prices (so Quantity * Price varies greatly) or that Total.Purchase.Amount often reflects more than just that single line item’s quantity * price (e.g., it’s a total basket value). The intercept of the regression line (where Quantity = 0) ideally should be near $0 if the amount strictly represents Quantity * Price, but might be positive if it represents a total basket including other items.

KEY OBSERVATIONS

Transactional Insights

Significant variations were observed in transaction frequency and average purchase amounts across different Product Categories and Payment Methods.
Temporal analysis revealed distinct patterns in monthly transaction volumes and average purchase values, suggesting potential seasonality or market shifts.
Purchase activity showed variation across different days of the week.

Customer Demographics

The analysis highlighted the age distribution of the customer base and examined how purchasing behavior (e.g., Total Purchase Amount) and churn status potentially differ across age groups and gender.
While visual differences were noted, these warrant further statistical investigation.

Correlations and Relationships

Scatter plots and pair plots explored the interrelationships between key numeric variables like Product Price, Quantity, Age, and Total Purchase Amount.
Linear regression lines were overlaid to visualize potential trends, indicating varying degrees of association that could inform pricing or upselling strategies.
The relationship between Quantity and Total Purchase Amount particularly requires clarification regarding the definition of the latter.

Churn & Returns

Preliminary visual comparisons suggested potential differences in age distribution and purchasing habits (inferred from category preference or spending) between churned and non-churned customers.
Return rates appeared to vary by Product Category and Payment Method, offering potential areas for operational improvement or fraud detection.

Conclusion

This visual analysis of the e-commerce customer dataset, utilizing R and the ggplot2 and GGally packages, provided valuable initial insights into transactional patterns, customer demographics, behavioral trends, and potential drivers of churn.

Through a series of targeted visualizations—including histograms, bar graphs, scatter plots, box plots, line plots, and pair plots—we explored distributions, comparisons, correlations, and time-based trends within the data.

This R-based visualization exercise has effectively surfaced key characteristics and potential areas of interest within the e-commerce dataset. It serves as a critical foundation for more rigorous statistical analysis and data-driven decision-making aimed at optimizing customer experience, retention, and profitability.

E-Commerce Customer Behavior

Takudzwanashe Chigwaya & Sanskriti choudhury

2025-04-29