This project uses an e-commerce customer dataset containing 1,999 records across 13 attributes. The dataset captures detailed information about customer transactions, including purchase details, product categories, payment methods, returns, and churn status.
The objective of this project is to analyze customer purchasing behavior, identify key factors influencing product returns and customer churn, and generate insights to support business strategies for customer retention and revenue growth.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(lubridate)
library(scales)
## Warning: package 'scales' was built under R version 4.4.3
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(readr)
Database <- read.csv("C:\\Users\\My Pc\\Desktop\\ecommerce_customer_data_large.csv")
View(Database)
library(dplyr)
Database %>%
summarise(Unique_Customers = n_distinct(Customer.ID))
## Unique_Customers
## 1 329
INTERPRETATION:The dataset contains 329 unique customers, based on distinct Customer.ID values. This indicates that purchase or interaction records come from 329 individual users.
Database %>% summarise(Total_Purchases = n())
## Total_Purchases
## 1 1999
INTERPRETATION:TOTAL purchase record for this dataset is 1999
Database %>% distinct(Product.Category)
## Product.Category
## 1 Home
## 2 Electronics
## 3 Books
## 4 Clothing
INTERPRETAION:The dataset includes products from four distinct categories: Home, Electronics, Books, and Clothing, indicating a diverse range of items purchased by customers.
Database %>% count(Product.Category, sort = TRUE)
## Product.Category n
## 1 Electronics 515
## 2 Home 515
## 3 Books 498
## 4 Clothing 471
INTERPRETATION:The most commonly purchased product categories are Electronics and Home, each with 515 purchases, followed by Books (498) and Clothing (471).
Database %>%
summarise(
Start_Date = min(Purchase.Date, na.rm = TRUE),
End_Date = max(Purchase.Date, na.rm = TRUE)
)
## Start_Date End_Date
## 1 1/1/2020 15:05 9/9/2023 4:57
INTERPRETATION:The purchase data spans a time range from January 1, 2020, 15:05 to September 9, 2023, 04:57, indicating nearly 3 years and 9 months of recorded transactions.
#Retrieve all purchases made by a specific customer.
Database %>% filter(Customer.ID == "44605") # Replace with real ID
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 5/3/2023 21:30 Home 177 1
## 2 44605 5/16/2021 13:57 Electronics 174 3
## 3 44605 7/13/2020 6:16 Books 413 1
## 4 44605 1/17/2023 13:14 Electronics 396 3
## 5 44605 5/1/2021 11:29 Books 259 4
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 2427 PayPal 31 1 John Rivera 31
## 2 2448 PayPal 31 1 John Rivera 31
## 3 2345 Credit Card 31 1 John Rivera 31
## 4 937 Cash 31 0 John Rivera 31
## 5 2598 PayPal 31 1 John Rivera 31
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Female 0
## 4 Female 0
## 5 Female 0
*INTERPRETATION:Customer 44605 made multiple purchases between 2020 and 2023, including:
They tend to purchase items in multiple quantities (e.g., 3 Electronics units twice and 4 Books once), indicating strong buying behavior and diverse interests.*
average_spending_by_payment <- Database %>%
group_by(Payment.Method) %>%
summarise(
AverageTotalAmount = mean(Total.Purchase.Amount, na.rm = TRUE),
NumberOfTransactions = n() # Good practice to check sample size
)
print(average_spending_by_payment)
## # A tibble: 3 × 3
## Payment.Method AverageTotalAmount NumberOfTransactions
## <chr> <dbl> <int>
## 1 Cash 2773. 674
## 2 Credit Card 2795. 667
## 3 PayPal 2716. 658
Database %>% filter(Total.Purchase.Amount > 3000)%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 13738 8/25/2022 6:48 Home 191 3
## 2 13738 2/9/2023 0:53 Electronics 40 4
## 3 33969 2/28/2023 19:58 Clothing 410 3
## 4 33969 1/5/2023 11:15 Home 304 1
## 5 33969 7/18/2023 23:36 Books 54 2
## 6 33969 3/7/2020 21:31 Books 281 1
## 7 33969 7/21/2022 4:25 Home 193 2
## 8 42650 10/18/2020 23:38 Books 127 5
## 9 42650 5/17/2020 17:02 Home 284 2
## 10 42650 3/18/2022 13:52 Electronics 256 2
## 11 42650 1/26/2022 12:50 Electronics 105 2
## 12 42650 11/11/2020 7:19 Electronics 193 2
## 13 42650 8/1/2021 5:27 Electronics 30 3
## 14 42650 3/24/2022 0:05 Electronics 254 1
## 15 42650 9/2/2022 4:00 Books 12 3
## 16 24053 11/17/2020 13:14 Electronics 102 2
## 17 24053 1/11/2022 10:56 Books 210 2
## 18 19676 1/12/2022 22:49 Clothing 304 3
## 19 19676 6/26/2021 23:49 Electronics 329 1
## 20 19917 12/9/2021 7:43 Books 174 5
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 3722 Credit Card 27 1 Lauren Johnson 27
## 2 4327 Cash 27 0 Lauren Johnson 27
## 3 5018 Credit Card 27 NA Carol Allen 27
## 4 3883 PayPal 27 1 Carol Allen 27
## 5 4187 PayPal 27 0 Carol Allen 27
## 6 3810 Cash 27 0 Carol Allen 27
## 7 3198 Credit Card 27 0 Carol Allen 27
## 8 3347 Cash 20 0 Curtis Smith 20
## 9 3531 Credit Card 20 1 Curtis Smith 20
## 10 3548 Credit Card 20 0 Curtis Smith 20
## 11 3721 Credit Card 20 NA Curtis Smith 20
## 12 3266 PayPal 20 1 Curtis Smith 20
## 13 5024 Credit Card 20 1 Curtis Smith 20
## 14 3865 PayPal 20 0 Curtis Smith 20
## 15 3112 Cash 20 1 Curtis Smith 20
## 16 5206 Credit Card 53 1 Jose Green 53
## 17 4567 Credit Card 53 0 Jose Green 53
## 18 4869 PayPal 57 1 Linda Lee 57
## 19 3391 Cash 57 1 Linda Lee 57
## 20 4517 Credit Card 34 NA Joshua Davis 34
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Male 0
## 4 Male 0
## 5 Male 0
## 6 Male 0
## 7 Male 0
## 8 Female 0
## 9 Female 0
## 10 Female 0
## 11 Female 0
## 12 Female 0
## 13 Female 0
## 14 Female 0
## 15 Female 0
## 16 Male 0
## 17 Male 0
## 18 Male 0
## 19 Male 0
## 20 Male 1
INTERPRETATION:This filtered dataset shows the top 20 transactions where the Total Purchase Amount exceeded 3000. High-value purchases involve multiple customers (e.g., 13738, 33969, 42650) and span various categories such as Home, Electronics, Books, and Clothing, often with higher quantities or expensive items, indicating significant spending behavior.
Database %>% filter(Product.Category == "Electronics")%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 5/16/2021 13:57 Electronics 174 3
## 2 44605 1/17/2023 13:14 Electronics 396 3
## 3 13738 7/25/2023 5:17 Electronics 205 1
## 4 13738 2/9/2023 0:53 Electronics 40 4
## 5 33969 12/20/2021 23:44 Electronics 428 4
## 6 42650 3/18/2022 13:52 Electronics 256 2
## 7 42650 1/26/2022 12:50 Electronics 105 2
## 8 42650 11/11/2020 7:19 Electronics 193 2
## 9 42650 8/1/2021 5:27 Electronics 30 3
## 10 42650 5/24/2020 6:45 Electronics 307 3
## 11 42650 3/24/2022 0:05 Electronics 254 1
## 12 24053 11/17/2020 13:14 Electronics 102 2
## 13 24053 7/5/2020 6:30 Electronics 364 3
## 14 19676 6/26/2021 23:49 Electronics 329 1
## 15 19676 4/20/2020 14:08 Electronics 165 5
## 16 19676 3/23/2021 17:36 Electronics 218 4
## 17 19917 10/18/2022 22:58 Electronics 41 2
## 18 23734 9/21/2022 19:10 Electronics 75 1
## 19 16921 4/22/2021 22:10 Electronics 500 2
## 20 7796 10/22/2022 12:12 Electronics 145 2
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name
## 1 2448 PayPal 31 1 John Rivera
## 2 937 Cash 31 0 John Rivera
## 3 2773 Credit Card 27 NA Lauren Johnson
## 4 4327 Cash 27 0 Lauren Johnson
## 5 2289 Cash 27 0 Carol Allen
## 6 3548 Credit Card 20 0 Curtis Smith
## 7 3721 Credit Card 20 NA Curtis Smith
## 8 3266 PayPal 20 1 Curtis Smith
## 9 5024 Credit Card 20 1 Curtis Smith
## 10 973 PayPal 20 NA Curtis Smith
## 11 3865 PayPal 20 0 Curtis Smith
## 12 5206 Credit Card 53 1 Jose Green
## 13 1008 Credit Card 53 0 Jose Green
## 14 3391 Cash 57 1 Linda Lee
## 15 603 Credit Card 57 0 Linda Lee
## 16 2866 PayPal 57 NA Linda Lee
## 17 743 Cash 34 NA Joshua Davis
## 18 289 PayPal 18 0 Cheryl Espinoza
## 19 988 Cash 54 1 Cheyenne James
## 20 2134 PayPal 21 0 Lisa Dennis
## Age Gender Churn
## 1 31 Female 0
## 2 31 Female 0
## 3 27 Female 0
## 4 27 Female 0
## 5 27 Male 0
## 6 20 Female 0
## 7 20 Female 0
## 8 20 Female 0
## 9 20 Female 0
## 10 20 Female 0
## 11 20 Female 0
## 12 53 Male 0
## 13 53 Male 0
## 14 57 Male 0
## 15 57 Male 0
## 16 57 Male 0
## 17 34 Male 1
## 18 18 Male 0
## 19 54 Male 0
## 20 21 Male 0
INTERPRETATION:This table displays all purchase transactions where the product category is Electronics. Multiple customers (e.g., 44605, 13738, 42650) have made purchases across different dates, with varying product prices and quantities. The data helps analyze consumer interest and purchase behavior specifically for electronics over time.
Database %>%
filter(Churn==1)%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 19917 5/16/2023 1:34 Clothing 392 3
## 2 19917 10/27/2021 16:25 Clothing 238 3
## 3 19917 12/9/2021 7:43 Books 174 5
## 4 19917 10/18/2022 22:58 Electronics 41 2
## 5 21035 9/3/2022 16:41 Electronics 443 2
## 6 21035 6/15/2021 6:53 Electronics 116 3
## 7 21035 8/22/2021 14:09 Home 500 1
## 8 21035 11/28/2020 15:28 Home 249 1
## 9 21035 3/15/2020 18:45 Books 112 4
## 10 21035 1/27/2021 18:26 Electronics 457 2
## 11 21035 9/9/2023 18:11 Home 237 2
## 12 16825 7/21/2020 3:33 Clothing 489 1
## 13 16825 11/14/2022 2:19 Clothing 420 2
## 14 16825 4/19/2021 16:20 Books 395 3
## 15 16825 3/18/2023 15:00 Books 227 2
## 16 16825 8/16/2021 17:56 Clothing 166 4
## 17 34857 1/10/2021 9:21 Books 120 3
## 18 34857 6/13/2021 0:59 Home 104 2
## 19 34857 9/3/2021 7:26 Books 252 4
## 20 34857 2/6/2023 19:10 Electronics 284 4
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name
## 1 1378 Credit Card 34 NA Joshua Davis
## 2 1760 Credit Card 34 NA Joshua Davis
## 3 4517 Credit Card 34 NA Joshua Davis
## 4 743 Cash 34 NA Joshua Davis
## 5 1884 Credit Card 50 1 Peter Watson
## 6 2334 Credit Card 50 0 Peter Watson
## 7 698 Credit Card 50 1 Peter Watson
## 8 2925 Credit Card 50 0 Peter Watson
## 9 793 PayPal 50 0 Peter Watson
## 10 2919 PayPal 50 1 Peter Watson
## 11 1088 Cash 50 1 Peter Watson
## 12 3184 Credit Card 37 0 Melissa Cabrera
## 13 969 Cash 37 1 Melissa Cabrera
## 14 3487 Cash 37 1 Melissa Cabrera
## 15 3400 Cash 37 0 Melissa Cabrera
## 16 1379 PayPal 37 0 Melissa Cabrera
## 17 5279 Credit Card 57 NA Alexander Morse
## 18 3319 Credit Card 57 0 Alexander Morse
## 19 1038 Cash 57 0 Alexander Morse
## 20 402 PayPal 57 0 Alexander Morse
## Age Gender Churn
## 1 34 Male 1
## 2 34 Male 1
## 3 34 Male 1
## 4 34 Male 1
## 5 50 Female 1
## 6 50 Female 1
## 7 50 Female 1
## 8 50 Female 1
## 9 50 Female 1
## 10 50 Female 1
## 11 50 Female 1
## 12 37 Male 1
## 13 37 Male 1
## 14 37 Male 1
## 15 37 Male 1
## 16 37 Male 1
## 17 57 Male 1
## 18 57 Male 1
## 19 57 Male 1
## 20 57 Male 1
INTERPRETATION:This R code snippet filters customers who have
churned (i.e., Churn == 1) from the Database
and displays the first 20 records. The previewed dataset shows
customers’ purchase history, including Customer.ID,
Purchase.Date, Product.Category,
Product.Price, and Quantity. It highlights
behavior patterns of users who added items to their cart but abandoned
it before completing the purchase, which is useful for churn analysis
and retention strategies.
Database %>%
group_by(Customer.ID) %>%
summarise(Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Total_Spent))%>%
head(50)
## # A tibble: 50 × 2
## Customer.ID Total_Spent
## <int> <int>
## 1 45089 42701
## 2 37332 37509
## 3 26378 36589
## 4 42650 35366
## 5 28317 34436
## 6 10191 33332
## 7 32089 32712
## 8 22062 31214
## 9 35164 30017
## 10 48573 29898
## # ℹ 40 more rows
INTERPRETATION:This R code performs grouping and summarization on the dataset to calculate the total purchase amount per customer. It groups the data by Customer.ID, sums up the Total.Purchase.Amount for each customer , and arranges the results in descending order of total spending. The output shows the top 50 customers based on their total spending, with the highest spender having spent 42,701 units. This helps in identifying high-value customers for loyalty programs or targeted marketing.
Database %>%
group_by(Product.Category) %>%
summarise(Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Avg_Spend))
## # A tibble: 4 × 2
## Product.Category Avg_Spend
## <chr> <dbl>
## 1 Home 2820.
## 2 Electronics 2789.
## 3 Books 2740.
## 4 Clothing 2691.
INTERPRETATION:The Home category has the highest average spend (₹2820), followed closely by Electronics (₹2788), Books (₹2739), and Clothing (₹2690).This suggests customers tend to spend more per transaction on home and electronic products compared to clothing.
Database %>% count(Gender, name = "Purchase_Count")
## Gender Purchase_Count
## 1 Female 1015
## 2 Male 984
INTERPRETATION:Female customers made slightly more purchases (1015) than male customers (984).The purchasing behavior appears balanced across genders, with a marginally higher contribution from females.
Database %>%
count(Product.Category) %>%
filter(n == max(n))
## Product.Category n
## 1 Electronics 515
## 2 Home 515
INTERPRETATION:Both Electronics and Home categories are equally the most frequently purchased, with 515 purchases each.This indicates a strong consumer interest in these two categories.
Database %>%
group_by(Customer.ID) %>%
summarise(
Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE),
Frequency = n()
) %>%
arrange(desc(Total_Spent)) %>%
slice(1:5)
## # A tibble: 5 × 3
## Customer.ID Total_Spent Frequency
## <int> <int> <int>
## 1 45089 42701 11
## 2 37332 37509 10
## 3 26378 36589 12
## 4 42650 35366 13
## 5 28317 34436 12
INTERPRETATION:The table lists the top 5 customers based on total spending.The highest spender (Customer.ID 45089) spent ₹42,701 across 11 purchases, followed closely by others. This analysis shows not only who spends the most, but also how often they shop, helping in identifying loyal and high-value customers for special engagement strategies.
Database %>%
arrange(desc(Total.Purchase.Amount)) %>%
select(Customer.ID, Product.Category, Total.Purchase.Amount) %>%
slice(1:5)
## Customer.ID Product.Category Total.Purchase.Amount
## 1 41933 Electronics 5333
## 2 45961 Home 5321
## 3 41953 Clothing 5319
## 4 18407 Clothing 5313
## 5 17203 Home 5300
INTERPRETATION:The top 5 most expensive individual purchases range from ₹5,300 to ₹5,333, involving categories like:Electronics, Home, and Clothing,The most expensive purchase was made by Customer ID 41933 on an Electronics item.This helps identify high-value transactions and possibly premium products or big-ticket items.
Database %>%
group_by(Product.Category) %>%
summarise(Revenue = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Revenue))
## # A tibble: 4 × 2
## Product.Category Revenue
## <chr> <int>
## 1 Home 1452304
## 2 Electronics 1436197
## 3 Books 1364446
## 4 Clothing 1267311
INTERPRETATION:Categories ranked by total revenue generated are:Home–₹1,452,304, Electronics – ₹1,436,197,Books – ₹1,364,446,Clothing – ₹1,267,311.Although Home and Electronics have similar purchase counts (seen earlier), Home slightly leads in revenue, suggesting higher average prices or larger quantity purchases in that category.
Database<-Database %>%
mutate(Customer_Spend_Category = case_when(
Total.Purchase.Amount < 700 ~ "Low Spender",
Total.Purchase.Amount >= 700 & Total.Purchase.Amount <= 5000 ~ "Moderate Spender",
Total.Purchase.Amount > 5000 ~ "High Spender"
))
View(Database)
Database_frequency_score<- Database %>%
group_by(Customer.Name) %>%
summarise(Monthly_Orders = n())
print(Database_frequency_score)
## # A tibble: 328 × 2
## Customer.Name Monthly_Orders
## <chr> <int>
## 1 Aaron Butler 7
## 2 Aaron Jacobs 5
## 3 Abigail Lopez 3
## 4 Adriana Parrish 5
## 5 Albert Suarez 11
## 6 Alexander Morse 8
## 7 Alicia Shaw 4
## 8 Alisha Juarez 4
## 9 Amanda Best 6
## 10 Amanda Heath 9
## # ℹ 318 more rows
Database %>%
group_by(Customer.ID) %>%
summarise(Purchase_Count = n()) %>%
summarise(
Repeat_Buyers = sum(Purchase_Count > 1),
Total_Customers = n(),
Repeat_Purchase_Rate = Repeat_Buyers / Total_Customers
)
## # A tibble: 1 × 3
## Repeat_Buyers Total_Customers Repeat_Purchase_Rate
## <int> <int> <dbl>
## 1 327 329 0.994
INTERPRETATION:Nearly all customers (99.4%) returned for additional purchases, which is an exceptionally high customer retention rate. This suggests:High customer satisfaction,Strong product-market fit,Possibly effective loyalty or engagement strategy
repeat_buyers <- Database %>%
count(Customer.ID) %>%
filter(n > 1)
Database %>%
filter(Customer.ID %in% repeat_buyers$Customer.ID) %>%
count(Product.Category, sort = TRUE)
## Product.Category n
## 1 Home 515
## 2 Electronics 514
## 3 Books 498
## 4 Clothing 470
INTERPRETATION:Repeat buyers are fairly evenly distributed across categories.Slightly higher tendency toward Home and Electronics, suggesting:These categories might offer more frequent-use or upgradeable products.Or they could be part of ongoing customer needs, prompting repeat purchases.
Database %>%
mutate(Day = weekdays(as.Date(Purchase.Date))) %>%
group_by(Day) %>%
summarise(Avg_Sales = mean(Total.Purchase.Amount, na.rm = TRUE))
## # A tibble: 8 × 2
## Day Avg_Sales
## <chr> <dbl>
## 1 Friday 2616.
## 2 Monday 2688.
## 3 Saturday 2752.
## 4 Sunday 2764.
## 5 Thursday 2881.
## 6 Tuesday 2600.
## 7 Wednesday 2828.
## 8 <NA> 2785.
INTERPRETATION:Weekdays are slightly more profitable overall, especially Wednesday and Thursday, which top the sales chart.Weekends (Saturday & Sunday) also perform strongly, but not quite as high.Tuesday and Friday have the lowest average sales.The NA row likely represents records with missing or improperly formatted dates.
Database %>%
group_by(Payment.Method) %>%
summarise(
Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE),
Count = n()
)
## # A tibble: 3 × 3
## Payment.Method Avg_Spend Count
## <chr> <dbl> <int>
## 1 Cash 2773. 674
## 2 Credit Card 2795. 667
## 3 PayPal 2716. 658
INTERPRETATION:Credit card users tend to spend the most on average, followed closely by cash.PayPal users spend the least on average.The distribution of counts across methods is relatively even, making this comparison fair. Promotions or offers targeted at credit card users could potentially yield higher transaction values.
Database %>%
group_by(Customer.ID) %>%
summarise(Frequency = n()) %>%
mutate(Churned = ifelse(Frequency == 1, 1, 0)) %>%
count(Churned) %>%
mutate(Percentage = n / sum(n) * 100)
## # A tibble: 2 × 3
## Churned n Percentage
## <dbl> <int> <dbl>
## 1 0 327 99.4
## 2 1 2 0.608
INTERPRETATION:The churn rate is very low — only 0.61% of customers made just one purchase and never returned.A whopping 99.39% of customers are repeat buyers — indicating strong retention.Whatever strategies are in place for customer engagement and satisfaction are highly effective. Continue emphasizing loyalty programs, customer service, and relevant product offerings.
Database$Purchase.Date <- mdy_hm(Database$Purchase.Date, quiet = TRUE)
Database$Age <- suppressWarnings(as.numeric(as.character(Database$Age)))
Database$Product.Price <- suppressWarnings(as.numeric(as.character(Database$Product.Price)))
Database$Quantity <- suppressWarnings(as.numeric(as.character(Database$Quantity)))
Database$Total.Purchase.Amount <- suppressWarnings(as.numeric(as.character(Database$Total.Purchase.Amount)))
Database$Returns[Database$Returns == ""] <- NA # Handle blanks before numeric conversion
Database$Returns <- suppressWarnings(as.numeric(as.character(Database$Returns)))
Database$Product.Category <- as.factor(Database$Product.Category)
Database$Payment.Method <- as.factor(Database$Payment.Method)
Database$Gender <- as.factor(Database$Gender)
Database$Churn <- as.factor(Database$Churn)
Database <- Database %>% filter(!is.na(Age), !is.na(Total.Purchase.Amount), !is.na(Product.Price), !is.na(Quantity))
unique_customers <- Database %>% distinct(Customer.ID, .keep_all = TRUE)
simple_theme <- theme_minimal(base_size = 11) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom"
)
df_subset_pairplot <- Database %>% sample_n(min(1000, nrow(.))) # Plot sample of 1000
pairplot_vars <- c("Age", "Product.Price", "Quantity", "Total.Purchase.Amount")
# Check if data exists before plotting
if(nrow(df_subset_pairplot) > 0) {
plot1 <- ggpairs(df_subset_pairplot[, pairplot_vars],
title = "Pairwise Relationships of Key Numeric Variables (Sampled Data)") +
simple_theme
print(plot1)
} else {
}
INTERPRETAION:By examining the scatterplots, we can visually assess
if relationships appear linear, clustered, or random. For instance, we
might look for a negative trend between Product.Price and Quantity
(suggesting people buy fewer items as price increases) or a positive
trend between Quantity and Total.Purchase.Amount. The density plots help
understand the typical range and concentration of values for each
metric. The correlation coefficients quantify the strength and direction
of linear relationships (values near +1 or -1 indicate strong linear
association, while values near 0 indicate weak linear association). This
plot is excellent for quickly identifying potential multicollinearity or
interesting pairwise relationships worthy of deeper
investigation.
plot2 <- ggplot(Database, aes(x = Product.Category, fill = Product.Category)) +
geom_bar(show.legend = FALSE) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, size=3) +
labs(title = "Number of Transactions by Product Category",
x = "Product Category",
y = "Number of Transactions")
print(plot2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
INTERPRETAION:This visualization allows for a direct comparison of
transaction volume across categories. We can immediately identify which
product categories are most frequently involved in transactions and
which are less common. For example, if “Electronics” and “Home” have
significantly taller bars than “Books” and “Clothing”, it indicates they
drive a larger portion of the store’s transaction activity.
plot3 <- ggplot(Database, aes(x = Total.Purchase.Amount)) +
geom_histogram(fill = "skyblue", color = "white", bins = 30) +
labs(title = "Distribution of Total Purchase Amount per Transaction",
x = "Total Purchase Amount ($)",
y = "Frequency") +
scale_x_continuous(labels = scales::dollar_format()) + # <--- updated
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot3)
INTERPRETAION:The shape of the histogram reveals common spending
patterns per transaction. Typically, e-commerce data shows a
right-skewed distribution, meaning most transactions are for smaller
amounts, with fewer transactions involving very high totals. We can
identify the most frequent spending range (the tallest bar), the overall
spread of transaction values, and the presence and extent of high-value
outlier transactions (the long tail to the right, if present).
?
plot4 <- ggplot(Database, aes(x = Payment.Method, y = Total.Purchase.Amount, fill = Payment.Method)) +
geom_boxplot(show.legend = FALSE, outlier.shape = NA) + # Hide outliers for scale
coord_cartesian(ylim = quantile(Database$Total.Purchase.Amount, c(0.05, 0.95), na.rm=TRUE)) + # Zoom
labs(title = "Total Purchase Amount Distribution by Payment Method",
subtitle = "Visual check for ANOVA (Zoomed 5th-95th percentile)",
x = "Payment Method",
y = "Total Purchase Amount ($)")+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot4)
INTERPRETAION:This plot helps visually assess if typical spending
per transaction differs based on how customers pay. We compare the
median spending levels (middle lines) and the variability (height of the
boxes/IQRs) across methods. For instance, if the median line for “Credit
Card” is higher than for “Cash”, it suggests customers tend to spend
more per transaction when using a credit card. Different box heights
would indicate differences in spending consistency. The subtitle
explicitly notes this is a visual precursor to ANOVA – if the medians
look very different or the variances (approximated by box
heights/spread) appear unequal, it strengthens the case for formal
statistical testing. The zoom means we are focusing on the bulk of the
data, ignoring extreme high/low values for this comparison.
simple_theme2 <- theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", color = "darkblue"),
plot.subtitle = element_text(size = 12, color = "darkgray"),
axis.title = element_text(size = 14, face = "bold", color = "black"),
axis.text = element_text(size = 12, color = "black"),
panel.grid.major = element_line(color = "lightgray", linewidth = 0.5), # Updated to linewidth
panel.grid.minor = element_line(color = "lightgray", linewidth = 0.25), # Updated to linewidth
panel.background = element_rect(fill = "white", color = "gray"),
plot.background = element_rect(fill = "lightblue", color = "lightblue")
)
plot6 <- ggplot(Database, aes(x = Age, y = Total.Purchase.Amount)) +
geom_point(aes(color = Age), alpha = 0.7, size = 2) + # Add color based on Age and adjust transparency
geom_smooth(method = "lm", color = "red", se = TRUE, linetype = "solid", size = 1.5) + # Enhanced line
scale_color_viridis_c() + # Use a color scale for better visual appeal
labs(title = "Customer Age vs. Total Purchase Amount",
subtitle = "Red line = Linear Regression Fit",
x = "Customer Age",
y = "Total Purchase Amount ($)") +
simple_theme2 +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Angle x-axis text for better readability
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(plot6)
## `geom_smooth()` using formula = 'y ~ x'
INTERPRETAION:This visualization reveals temporal patterns in
overall business activity. We can identify upward or downward trends in
transaction volume over the observed period. Seasonal patterns become
evident through recurring peaks (e.g., potential holiday season spikes)
and troughs (e.g., post-holiday dips). The magnitude of these
fluctuations indicates the strength of seasonality or impact of specific
events (like promotions). The shaded area provides a quick visual gauge
of the overall volume across time.
churn_by_gender <- unique_customers %>%
group_by(Gender) %>%
summarise(Churn.Rate = mean(Churn == "1", na.rm = TRUE),
Total.Customers = n(), .groups = 'drop') %>%
filter(Total.Customers > 10) # Optional: filter small groups
if (nrow(churn_by_gender) > 0) {
plot7 <- ggplot(churn_by_gender, aes(x = Gender, y = Churn.Rate, fill = Gender)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = scales::percent(Churn.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) + # Show percentage labels
labs(title = "Customer Churn Rate by Gender",
x = "Gender",
y = "Churn Rate") +
scale_y_continuous(labels = scales::percent) + # Y-axis as percentages
simple_theme2 +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
print(plot7)
} else {
cat("Skipping Churn Rate by Gender Plot: Not enough unique customer data with gender/churn info.\n")
}
INTERPRETAION:This plot allows for a direct visual comparison of
customer retention between genders. If the bars are of noticeably
different heights, it suggests one gender group might be more prone to
churning than the other within this dataset. For example, a
significantly taller bar for ‘Male’ would indicate a higher churn
propensity among male customers compared to female customers. Similar
bar heights suggest comparable churn rates. This can inform targeted
retention strategies if significant differences exist.
simple_theme2 <- theme_minimal()
day_counts <- Database %>%
filter(!is.na(Purchase.Date)) %>%
mutate(DayOfWeek = wday(Purchase.Date, label = TRUE, week_start = 1)) %>%
filter(!is.na(DayOfWeek)) %>%
count(DayOfWeek, name = "Transactions")
if (nrow(day_counts) > 0) {
plot15 <- ggplot(day_counts, aes(x = DayOfWeek, y = Transactions, group = 1)) +
geom_area(aes(fill = "Area"), alpha = 0.3, show.legend = FALSE) + # Shaded area under the line
geom_line(color = "steelblue", size = 1.2) + # Line
geom_point(color = "steelblue", size = 3) + # Points
geom_text(aes(label = Transactions), vjust = -1, size = 3) + # Labels
labs(title = "Total Transactions by Day of the Week",
x = "Day of the Week",
y = "Number of Transactions") +
scale_y_continuous(labels = scales::comma) +
simple_theme2 +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
print(plot15)
} else {
cat("Skipping Day of Week Plot: Not enough valid date data.\n")
}
INTERPRETAION:The plot helps determine if there’s a linear
association between age and spending per transaction. A widely scattered
pattern suggests age is not a strong linear predictor of transaction
amount. The slope of the red line indicates the direction and magnitude
of the average linear trend (e.g., a slightly upward slope suggests
older customers tend to spend slightly more on average, though
individual transactions vary greatly). The width of the confidence
interval indicates the uncertainty around the regression line – a wider
band implies less confidence in the precise slope. The color gradient
can help visually spot if certain age ranges cluster at higher/lower
spending levels, though this might be subtle.
plot9 <- ggplot(Database, aes(x = Churn, y = Customer.Age, fill = Churn)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Customer Age Distribution by Churn Status",
x = "Churn Status (0 = Active, 1 = Churned)",
y = "Customer Age") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot9)
INTERPRETAION:This visualization highlights the weekly rhythm of
purchasing behavior. We can easily identify the busiest and slowest
shopping days. Peaks (e.g., often Friday/Saturday) indicate preferred
shopping times, while troughs (e.g., often Sunday/Monday) show periods
of lower activity. This pattern can inform staffing, server load
management, and timing for promotional activities (e.g., running sales
on typically slower days).
cat("\n--- 10: Scatter Plot (Product Price vs. Quantity) ---\n")
##
## --- 10: Scatter Plot (Product Price vs. Quantity) ---
plot10 <- ggplot(Database, aes(x = Product.Price, y = Quantity,fill = Product.Category)) +
geom_point(aes(color = Product.Category), alpha = 0.4, size = 1.5) + # Color by category
labs(title = "Product Price vs. Quantity Purchased per Transaction",
x = "Product Price ($)",
y = "Quantity Purchased",
color = "Product Category") + # Add legend title
simple_theme
print(plot10)
INTERPRETAION:This plot helps visualize if the age profile differs
between the two churn groups based on the transactions recorded. We
compare the median age (middle line), the age range of the middle 50% of
transactions (IQR/box height), and the overall age spread for
transactions linked to each group. If, for instance, the median age box
for churned customers is lower than for non-churned, it might suggest
younger customers are associated with more churned transactions.
Important Caveat: Because this uses the main df (where customer age is
repeated per transaction) and not unique_customers, it’s showing the age
distribution of the transactions, not necessarily the age distribution
of unique customers who churned. For a direct comparison of churned
customer ages, the plot should ideally be based on the unique_customers
dataframe.
returns_by_cat <- Database %>%
filter(!is.na(Returns)) %>%
group_by(Product.Category) %>%
summarise(
Return.Rate = mean(Returns == 1),
Total.Transactions = n(),
.groups = 'drop'
) %>%
filter(Total.Transactions > 10)
if (nrow(returns_by_cat) > 0) {
plot11 <- ggplot(returns_by_cat, aes(x = reorder(Product.Category, -Return.Rate), y = Return.Rate, fill = Product.Category)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = scales::percent(Return.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) +
labs(
title = "Return Rate by Product Category",
x = "Product Category",
y = "Return Rate"
) +
scale_y_continuous(labels = scales::percent) + # <--- important!
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # tilt x labels if needed
print(plot11)
} else {
cat("Skipping Return Rate by Category Plot: Not enough valid 'Returns' data found.\n")
}
INTERPRETAION:This plot helps visualize if the age profile differs
between the two churn groups based on the transactions recorded. We
compare the median age (middle line), the age range of the middle 50% of
transactions (IQR/box height), and the overall age spread for
transactions linked to each group. If, for instance, the median age box
for churned customers is lower than for non-churned, it might suggest
younger customers are associated with more churned transactions.
Important Caveat: Because this uses the main df (where customer age is
repeated per transaction) and not unique_customers, it’s showing the age
distribution of the transactions, not necessarily the age distribution
of unique customers who churned. For a direct comparison of churned
customer ages, the plot should ideally be based on the unique_customers
dataframe.
monthly_avg_purchase <- Database %>%
filter(!is.na(Purchase.Date)) %>%
mutate(YearMonth = floor_date(Purchase.Date, "month")) %>%
group_by(YearMonth) %>%
summarise(Avg.Purchase = mean(Total.Purchase.Amount, na.rm = TRUE), .groups = 'drop') %>%
filter(!is.na(YearMonth))
if(nrow(monthly_avg_purchase) > 0) {
plot12 <- ggplot(monthly_avg_purchase, aes(x = YearMonth, y = Avg.Purchase)) +
geom_line(color = "purple", linewidth = 1) +
geom_point(color = "purple") +
labs(title = "Monthly Average Total Purchase Amount Trend",
x = "Month",
y = "Average Total Purchase Amount ($)") +
scale_x_datetime(date_labels = "%Y-%m", date_breaks = "3 months") +
simple_theme
print(plot12)
} else {
cat("Skipping Average Purchase Trend Plot: Not enough valid date data.\n")
}
INTERPRETAION:Unlike the transaction volume plot (Q5), this focuses
on the value per transaction. We look for trends in average spending –
are customers spending more or less per visit over time? Peaks could
indicate periods where customers bought higher-value items or simply
more items per transaction (cross-reference with Q5). Dips might suggest
periods of promotional activity focusing on lower-priced goods or
smaller basket sizes. This helps understand changes in customer spending
behavior separate from just the number of visits/transactions.
cat("\n--- 13: Box Plot (Purchase Amount by Gender) ---\n")
##
## --- 13: Box Plot (Purchase Amount by Gender) ---
plot13 <- ggplot(Database, aes(x = Gender, y = Total.Purchase.Amount, fill = Gender)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.1) + # Show outliers faintly
labs(title = "Total Purchase Amount Distribution by Gender",
x = "Gender",
y = "Total Purchase Amount ($)") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot13)
INTERPRETAION:Unlike the transaction volume plot (Q5), this focuses
on the value per transaction. We look for trends in average spending –
are customers spending more or less per visit over time? Peaks could
indicate periods where customers bought higher-value items or simply
more items per transaction (cross-reference with Q5). Dips might suggest
periods of promotional activity focusing on lower-priced goods or
smaller basket sizes. This helps understand changes in customer spending
behavior separate from just the number of visits/transactions.
plot14 <- ggplot(Database, aes(x = Quantity, y = Total.Purchase.Amount)) +
geom_point(alpha = 0.2, size = 1, color = "brown") +
geom_smooth(method = "lm", color = "orange", se = FALSE) +
labs(title = "Quantity vs. Total Purchase Amount per Transaction",
subtitle = "Orange line = Linear Regression Fit",
x = "Quantity Purchased",
y = "Total Purchase Amount ($)") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot14)
## `geom_smooth()` using formula = 'y ~ x'
INTERPRETAION:This plot primarily investigates the correlation
between the number of units bought and the recorded total spend for the
transaction. A positive slope on the regression line is expected,
indicating that buying more units generally leads to a higher total
amount. The strength of this linear relationship is indicated by how
tightly the points cluster around the line and the width of the
confidence interval. Significant scatter suggests either a wide
variation in product prices (so Quantity Price varies greatly) or
that Total.Purchase.Amount often reflects more than just that single
line item’s quantity * price (e.g., it’s a total basket value). The
intercept of the regression line (where Quantity = 0) ideally should be
near $0 if the amount strictly represents Quantity * Price, but might be
positive if it represents a total basket including other items.*
This visual analysis of the e-commerce customer dataset, utilizing R
and the ggplot2 and GGally packages, provided
valuable initial insights into transactional patterns, customer
demographics, behavioral trends, and potential drivers of churn.
Through a series of targeted visualizations—including histograms, bar graphs, scatter plots, box plots, line plots, and pair plots—we explored distributions, comparisons, correlations, and time-based trends within the data.
This R-based visualization exercise has effectively surfaced key characteristics and potential areas of interest within the e-commerce dataset. It serves as a critical foundation for more rigorous statistical analysis and data-driven decision-making aimed at optimizing customer experience, retention, and profitability.