This project uses an e-commerce customer dataset containing 1,999 records across 13 attributes. The dataset captures detailed information about customer transactions, including purchase details, product categories, payment methods, returns, and churn status.
The objective of this project is to analyze customer purchasing behavior, identify key factors influencing product returns and customer churn, and generate insights to support business strategies for customer retention and revenue growth.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(scales)
## Warning: package 'scales' was built under R version 4.4.3
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Database <- read.csv("C:\\Users\\My Pc\\Desktop\\ecommerce_customer_data_large.csv")
View(Database)
Database %>% summarise(Unique_Customers = n_distinct(Customer.ID))
## Unique_Customers
## 1 329
Database %>% summarise(Total_Purchases = n())
## Total_Purchases
## 1 1999
Database %>% distinct(Product.Category)
## Product.Category
## 1 Home
## 2 Electronics
## 3 Books
## 4 Clothing
Database %>% count(Product.Category, sort = TRUE)
## Product.Category n
## 1 Electronics 515
## 2 Home 515
## 3 Books 498
## 4 Clothing 471
Database %>%
summarise(
Start_Date = min(Purchase.Date, na.rm = TRUE),
End_Date = max(Purchase.Date, na.rm = TRUE)
)
## Start_Date End_Date
## 1 1/1/2020 15:05 9/9/2023 4:57
#Retrieve all purchases made by a specific customer.
Database %>% filter(Customer.ID == "44605") # Replace with real ID
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 5/3/2023 21:30 Home 177 1
## 2 44605 5/16/2021 13:57 Electronics 174 3
## 3 44605 7/13/2020 6:16 Books 413 1
## 4 44605 1/17/2023 13:14 Electronics 396 3
## 5 44605 5/1/2021 11:29 Books 259 4
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 2427 PayPal 31 1 John Rivera 31
## 2 2448 PayPal 31 1 John Rivera 31
## 3 2345 Credit Card 31 1 John Rivera 31
## 4 937 Cash 31 0 John Rivera 31
## 5 2598 PayPal 31 1 John Rivera 31
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Female 0
## 4 Female 0
## 5 Female 0
average_spending_by_payment <- Database %>%
group_by(Payment.Method) %>%
summarise(
AverageTotalAmount = mean(Total.Purchase.Amount, na.rm = TRUE),
NumberOfTransactions = n() # Good practice to check sample size
)
print(average_spending_by_payment)
## # A tibble: 3 × 3
## Payment.Method AverageTotalAmount NumberOfTransactions
## <chr> <dbl> <int>
## 1 Cash 2773. 674
## 2 Credit Card 2795. 667
## 3 PayPal 2716. 658
Database %>% filter(Total.Purchase.Amount > 3000)%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 13738 8/25/2022 6:48 Home 191 3
## 2 13738 2/9/2023 0:53 Electronics 40 4
## 3 33969 2/28/2023 19:58 Clothing 410 3
## 4 33969 1/5/2023 11:15 Home 304 1
## 5 33969 7/18/2023 23:36 Books 54 2
## 6 33969 3/7/2020 21:31 Books 281 1
## 7 33969 7/21/2022 4:25 Home 193 2
## 8 42650 10/18/2020 23:38 Books 127 5
## 9 42650 5/17/2020 17:02 Home 284 2
## 10 42650 3/18/2022 13:52 Electronics 256 2
## 11 42650 1/26/2022 12:50 Electronics 105 2
## 12 42650 11/11/2020 7:19 Electronics 193 2
## 13 42650 8/1/2021 5:27 Electronics 30 3
## 14 42650 3/24/2022 0:05 Electronics 254 1
## 15 42650 9/2/2022 4:00 Books 12 3
## 16 24053 11/17/2020 13:14 Electronics 102 2
## 17 24053 1/11/2022 10:56 Books 210 2
## 18 19676 1/12/2022 22:49 Clothing 304 3
## 19 19676 6/26/2021 23:49 Electronics 329 1
## 20 19917 12/9/2021 7:43 Books 174 5
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 3722 Credit Card 27 1 Lauren Johnson 27
## 2 4327 Cash 27 0 Lauren Johnson 27
## 3 5018 Credit Card 27 NA Carol Allen 27
## 4 3883 PayPal 27 1 Carol Allen 27
## 5 4187 PayPal 27 0 Carol Allen 27
## 6 3810 Cash 27 0 Carol Allen 27
## 7 3198 Credit Card 27 0 Carol Allen 27
## 8 3347 Cash 20 0 Curtis Smith 20
## 9 3531 Credit Card 20 1 Curtis Smith 20
## 10 3548 Credit Card 20 0 Curtis Smith 20
## 11 3721 Credit Card 20 NA Curtis Smith 20
## 12 3266 PayPal 20 1 Curtis Smith 20
## 13 5024 Credit Card 20 1 Curtis Smith 20
## 14 3865 PayPal 20 0 Curtis Smith 20
## 15 3112 Cash 20 1 Curtis Smith 20
## 16 5206 Credit Card 53 1 Jose Green 53
## 17 4567 Credit Card 53 0 Jose Green 53
## 18 4869 PayPal 57 1 Linda Lee 57
## 19 3391 Cash 57 1 Linda Lee 57
## 20 4517 Credit Card 34 NA Joshua Davis 34
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Male 0
## 4 Male 0
## 5 Male 0
## 6 Male 0
## 7 Male 0
## 8 Female 0
## 9 Female 0
## 10 Female 0
## 11 Female 0
## 12 Female 0
## 13 Female 0
## 14 Female 0
## 15 Female 0
## 16 Male 0
## 17 Male 0
## 18 Male 0
## 19 Male 0
## 20 Male 1
Database %>% filter(Product.Category == "Electronics")%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 5/16/2021 13:57 Electronics 174 3
## 2 44605 1/17/2023 13:14 Electronics 396 3
## 3 13738 7/25/2023 5:17 Electronics 205 1
## 4 13738 2/9/2023 0:53 Electronics 40 4
## 5 33969 12/20/2021 23:44 Electronics 428 4
## 6 42650 3/18/2022 13:52 Electronics 256 2
## 7 42650 1/26/2022 12:50 Electronics 105 2
## 8 42650 11/11/2020 7:19 Electronics 193 2
## 9 42650 8/1/2021 5:27 Electronics 30 3
## 10 42650 5/24/2020 6:45 Electronics 307 3
## 11 42650 3/24/2022 0:05 Electronics 254 1
## 12 24053 11/17/2020 13:14 Electronics 102 2
## 13 24053 7/5/2020 6:30 Electronics 364 3
## 14 19676 6/26/2021 23:49 Electronics 329 1
## 15 19676 4/20/2020 14:08 Electronics 165 5
## 16 19676 3/23/2021 17:36 Electronics 218 4
## 17 19917 10/18/2022 22:58 Electronics 41 2
## 18 23734 9/21/2022 19:10 Electronics 75 1
## 19 16921 4/22/2021 22:10 Electronics 500 2
## 20 7796 10/22/2022 12:12 Electronics 145 2
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name
## 1 2448 PayPal 31 1 John Rivera
## 2 937 Cash 31 0 John Rivera
## 3 2773 Credit Card 27 NA Lauren Johnson
## 4 4327 Cash 27 0 Lauren Johnson
## 5 2289 Cash 27 0 Carol Allen
## 6 3548 Credit Card 20 0 Curtis Smith
## 7 3721 Credit Card 20 NA Curtis Smith
## 8 3266 PayPal 20 1 Curtis Smith
## 9 5024 Credit Card 20 1 Curtis Smith
## 10 973 PayPal 20 NA Curtis Smith
## 11 3865 PayPal 20 0 Curtis Smith
## 12 5206 Credit Card 53 1 Jose Green
## 13 1008 Credit Card 53 0 Jose Green
## 14 3391 Cash 57 1 Linda Lee
## 15 603 Credit Card 57 0 Linda Lee
## 16 2866 PayPal 57 NA Linda Lee
## 17 743 Cash 34 NA Joshua Davis
## 18 289 PayPal 18 0 Cheryl Espinoza
## 19 988 Cash 54 1 Cheyenne James
## 20 2134 PayPal 21 0 Lisa Dennis
## Age Gender Churn
## 1 31 Female 0
## 2 31 Female 0
## 3 27 Female 0
## 4 27 Female 0
## 5 27 Male 0
## 6 20 Female 0
## 7 20 Female 0
## 8 20 Female 0
## 9 20 Female 0
## 10 20 Female 0
## 11 20 Female 0
## 12 53 Male 0
## 13 53 Male 0
## 14 57 Male 0
## 15 57 Male 0
## 16 57 Male 0
## 17 34 Male 1
## 18 18 Male 0
## 19 54 Male 0
## 20 21 Male 0
Database %>%
filter(Churn==1)%>%
head(20)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 19917 5/16/2023 1:34 Clothing 392 3
## 2 19917 10/27/2021 16:25 Clothing 238 3
## 3 19917 12/9/2021 7:43 Books 174 5
## 4 19917 10/18/2022 22:58 Electronics 41 2
## 5 21035 9/3/2022 16:41 Electronics 443 2
## 6 21035 6/15/2021 6:53 Electronics 116 3
## 7 21035 8/22/2021 14:09 Home 500 1
## 8 21035 11/28/2020 15:28 Home 249 1
## 9 21035 3/15/2020 18:45 Books 112 4
## 10 21035 1/27/2021 18:26 Electronics 457 2
## 11 21035 9/9/2023 18:11 Home 237 2
## 12 16825 7/21/2020 3:33 Clothing 489 1
## 13 16825 11/14/2022 2:19 Clothing 420 2
## 14 16825 4/19/2021 16:20 Books 395 3
## 15 16825 3/18/2023 15:00 Books 227 2
## 16 16825 8/16/2021 17:56 Clothing 166 4
## 17 34857 1/10/2021 9:21 Books 120 3
## 18 34857 6/13/2021 0:59 Home 104 2
## 19 34857 9/3/2021 7:26 Books 252 4
## 20 34857 2/6/2023 19:10 Electronics 284 4
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name
## 1 1378 Credit Card 34 NA Joshua Davis
## 2 1760 Credit Card 34 NA Joshua Davis
## 3 4517 Credit Card 34 NA Joshua Davis
## 4 743 Cash 34 NA Joshua Davis
## 5 1884 Credit Card 50 1 Peter Watson
## 6 2334 Credit Card 50 0 Peter Watson
## 7 698 Credit Card 50 1 Peter Watson
## 8 2925 Credit Card 50 0 Peter Watson
## 9 793 PayPal 50 0 Peter Watson
## 10 2919 PayPal 50 1 Peter Watson
## 11 1088 Cash 50 1 Peter Watson
## 12 3184 Credit Card 37 0 Melissa Cabrera
## 13 969 Cash 37 1 Melissa Cabrera
## 14 3487 Cash 37 1 Melissa Cabrera
## 15 3400 Cash 37 0 Melissa Cabrera
## 16 1379 PayPal 37 0 Melissa Cabrera
## 17 5279 Credit Card 57 NA Alexander Morse
## 18 3319 Credit Card 57 0 Alexander Morse
## 19 1038 Cash 57 0 Alexander Morse
## 20 402 PayPal 57 0 Alexander Morse
## Age Gender Churn
## 1 34 Male 1
## 2 34 Male 1
## 3 34 Male 1
## 4 34 Male 1
## 5 50 Female 1
## 6 50 Female 1
## 7 50 Female 1
## 8 50 Female 1
## 9 50 Female 1
## 10 50 Female 1
## 11 50 Female 1
## 12 37 Male 1
## 13 37 Male 1
## 14 37 Male 1
## 15 37 Male 1
## 16 37 Male 1
## 17 57 Male 1
## 18 57 Male 1
## 19 57 Male 1
## 20 57 Male 1
Database %>%
group_by(Customer.ID) %>%
summarise(Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Total_Spent))%>%
head(50)
## # A tibble: 50 × 2
## Customer.ID Total_Spent
## <int> <int>
## 1 45089 42701
## 2 37332 37509
## 3 26378 36589
## 4 42650 35366
## 5 28317 34436
## 6 10191 33332
## 7 32089 32712
## 8 22062 31214
## 9 35164 30017
## 10 48573 29898
## # ℹ 40 more rows
Database %>%
group_by(Product.Category) %>%
summarise(Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Avg_Spend))
## # A tibble: 4 × 2
## Product.Category Avg_Spend
## <chr> <dbl>
## 1 Home 2820.
## 2 Electronics 2789.
## 3 Books 2740.
## 4 Clothing 2691.
Database %>% count(Gender, name = "Purchase_Count")
## Gender Purchase_Count
## 1 Female 1015
## 2 Male 984
Database %>%
count(Product.Category) %>%
filter(n == max(n))
## Product.Category n
## 1 Electronics 515
## 2 Home 515
Database %>%
group_by(Customer.ID) %>%
summarise(
Total_Spent = sum(Total.Purchase.Amount, na.rm = TRUE),
Frequency = n()
) %>%
arrange(desc(Total_Spent)) %>%
slice(1:5)
## # A tibble: 5 × 3
## Customer.ID Total_Spent Frequency
## <int> <int> <int>
## 1 45089 42701 11
## 2 37332 37509 10
## 3 26378 36589 12
## 4 42650 35366 13
## 5 28317 34436 12
Database %>%
arrange(desc(Total.Purchase.Amount)) %>%
select(Customer.ID, Product.Category, Total.Purchase.Amount) %>%
slice(1:5)
## Customer.ID Product.Category Total.Purchase.Amount
## 1 41933 Electronics 5333
## 2 45961 Home 5321
## 3 41953 Clothing 5319
## 4 18407 Clothing 5313
## 5 17203 Home 5300
Database %>%
group_by(Product.Category) %>%
summarise(Revenue = sum(Total.Purchase.Amount, na.rm = TRUE)) %>%
arrange(desc(Revenue))
## # A tibble: 4 × 2
## Product.Category Revenue
## <chr> <int>
## 1 Home 1452304
## 2 Electronics 1436197
## 3 Books 1364446
## 4 Clothing 1267311
Database<-Database %>%
mutate(Customer_Spend_Category = case_when(
Total.Purchase.Amount < 700 ~ "Low Spender",
Total.Purchase.Amount >= 700 & Total.Purchase.Amount <= 5000 ~ "Moderate Spender",
Total.Purchase.Amount > 5000 ~ "High Spender"
))
View(Database)
Database_frequency_score<- Database %>%
group_by(Customer.Name) %>%
summarise(Monthly_Orders = n())
view(Database_frequency_score)
Database %>%
group_by(Customer.ID) %>%
summarise(Purchase_Count = n()) %>%
summarise(
Repeat_Buyers = sum(Purchase_Count > 1),
Total_Customers = n(),
Repeat_Purchase_Rate = Repeat_Buyers / Total_Customers
)
## # A tibble: 1 × 3
## Repeat_Buyers Total_Customers Repeat_Purchase_Rate
## <int> <int> <dbl>
## 1 327 329 0.994
repeat_buyers <- Database %>%
count(Customer.ID) %>%
filter(n > 1)
Database %>%
filter(Customer.ID %in% repeat_buyers$Customer.ID) %>%
count(Product.Category, sort = TRUE)
## Product.Category n
## 1 Home 515
## 2 Electronics 514
## 3 Books 498
## 4 Clothing 470
Database %>%
mutate(Day = weekdays(as.Date(Purchase.Date))) %>%
group_by(Day) %>%
summarise(Avg_Sales = mean(Total.Purchase.Amount, na.rm = TRUE))
## # A tibble: 8 × 2
## Day Avg_Sales
## <chr> <dbl>
## 1 Friday 2616.
## 2 Monday 2688.
## 3 Saturday 2752.
## 4 Sunday 2764.
## 5 Thursday 2881.
## 6 Tuesday 2600.
## 7 Wednesday 2828.
## 8 <NA> 2785.
Database %>%
group_by(Payment.Method) %>%
summarise(
Avg_Spend = mean(Total.Purchase.Amount, na.rm = TRUE),
Count = n()
)
## # A tibble: 3 × 3
## Payment.Method Avg_Spend Count
## <chr> <dbl> <int>
## 1 Cash 2773. 674
## 2 Credit Card 2795. 667
## 3 PayPal 2716. 658
Database %>%
group_by(Customer.ID) %>%
summarise(Frequency = n()) %>%
mutate(Churned = ifelse(Frequency == 1, 1, 0)) %>%
count(Churned) %>%
mutate(Percentage = n / sum(n) * 100)
## # A tibble: 2 × 3
## Churned n Percentage
## <dbl> <int> <dbl>
## 1 0 327 99.4
## 2 1 2 0.608
Database$Purchase.Date <- mdy_hm(Database$Purchase.Date, quiet = TRUE)
Database$Age <- suppressWarnings(as.numeric(as.character(Database$Age)))
Database$Product.Price <- suppressWarnings(as.numeric(as.character(Database$Product.Price)))
Database$Quantity <- suppressWarnings(as.numeric(as.character(Database$Quantity)))
Database$Total.Purchase.Amount <- suppressWarnings(as.numeric(as.character(Database$Total.Purchase.Amount)))
Database$Returns[Database$Returns == ""] <- NA # Handle blanks before numeric conversion
Database$Returns <- suppressWarnings(as.numeric(as.character(Database$Returns)))
Database$Product.Category <- as.factor(Database$Product.Category)
Database$Payment.Method <- as.factor(Database$Payment.Method)
Database$Gender <- as.factor(Database$Gender)
Database$Churn <- as.factor(Database$Churn)
Database <- Database %>% filter(!is.na(Age), !is.na(Total.Purchase.Amount), !is.na(Product.Price), !is.na(Quantity))
unique_customers <- Database %>% distinct(Customer.ID, .keep_all = TRUE)
simple_theme <- theme_minimal(base_size = 11) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom"
)
df_subset_pairplot <- Database %>% sample_n(min(1000, nrow(.))) # Plot sample of 1000
pairplot_vars <- c("Age", "Product.Price", "Quantity", "Total.Purchase.Amount")
# Check if data exists before plotting
if(nrow(df_subset_pairplot) > 0) {
plot1 <- ggpairs(df_subset_pairplot[, pairplot_vars],
title = "Pairwise Relationships of Key Numeric Variables (Sampled Data)") +
simple_theme
print(plot1)
} else {
}
Interpretation:By examining the scatterplots, we can visually assess if relationships appear linear, clustered, or random. For instance, we might look for a negative trend between Product.Price and Quantity (suggesting people buy fewer items as price increases) or a positive trend between Quantity and Total.Purchase.Amount. The density plots help understand the typical range and concentration of values for each metric. The correlation coefficients quantify the strength and direction of linear relationships (values near +1 or -1 indicate strong linear association, while values near 0 indicate weak linear association). This plot is excellent for quickly identifying potential multicollinearity or interesting pairwise relationships worthy of deeper investigation.
plot2 <- ggplot(Database, aes(x = Product.Category, fill = Product.Category)) +
geom_bar(show.legend = FALSE) +
geom_text(stat='count', aes(label=..count..), vjust=-0.5, size=3) +
labs(title = "Number of Transactions by Product Category",
x = "Product Category",
y = "Number of Transactions")
print(plot2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation:This visualization allows for a direct comparison of transaction volume across categories. We can immediately identify which product categories are most frequently involved in transactions and which are less common. For example, if “Electronics” and “Home” have significantly taller bars than “Books” and “Clothing”, it indicates they drive a larger portion of the store’s transaction activity.
plot3 <- ggplot(Database, aes(x = Total.Purchase.Amount)) +
geom_histogram(fill = "skyblue", color = "white", bins = 30) +
labs(title = "Distribution of Total Purchase Amount per Transaction",
x = "Total Purchase Amount ($)",
y = "Frequency") +
scale_x_continuous(labels = scales::dollar_format()) + # <--- updated
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot3)
Interpretation:The shape of the histogram reveals common spending patterns per transaction. Typically, e-commerce data shows a right-skewed distribution, meaning most transactions are for smaller amounts, with fewer transactions involving very high totals. We can identify the most frequent spending range (the tallest bar), the overall spread of transaction values, and the presence and extent of high-value outlier transactions (the long tail to the right, if present).
?
plot4 <- ggplot(Database, aes(x = Payment.Method, y = Total.Purchase.Amount, fill = Payment.Method)) +
geom_boxplot(show.legend = FALSE, outlier.shape = NA) + # Hide outliers for scale
coord_cartesian(ylim = quantile(Database$Total.Purchase.Amount, c(0.05, 0.95), na.rm=TRUE)) + # Zoom
labs(title = "Total Purchase Amount Distribution by Payment Method",
subtitle = "Visual check for ANOVA (Zoomed 5th-95th percentile)",
x = "Payment Method",
y = "Total Purchase Amount ($)")+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot4)
Interpretation:This plot helps visually assess if typical spending per transaction differs based on how customers pay. We compare the median spending levels (middle lines) and the variability (height of the boxes/IQRs) across methods. For instance, if the median line for “Credit Card” is higher than for “Cash”, it suggests customers tend to spend more per transaction when using a credit card. Different box heights would indicate differences in spending consistency. The subtitle explicitly notes this is a visual precursor to ANOVA – if the medians look very different or the variances (approximated by box heights/spread) appear unequal, it strengthens the case for formal statistical testing. The zoom means we are focusing on the bulk of the data, ignoring extreme high/low values for this comparison.
simple_theme2 <- theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", color = "darkblue"),
plot.subtitle = element_text(size = 12, color = "darkgray"),
axis.title = element_text(size = 14, face = "bold", color = "black"),
axis.text = element_text(size = 12, color = "black"),
panel.grid.major = element_line(color = "lightgray", linewidth = 0.5), # Updated to linewidth
panel.grid.minor = element_line(color = "lightgray", linewidth = 0.25), # Updated to linewidth
panel.background = element_rect(fill = "white", color = "gray"),
plot.background = element_rect(fill = "lightblue", color = "lightblue")
)
plot6 <- ggplot(Database, aes(x = Age, y = Total.Purchase.Amount)) +
geom_point(aes(color = Age), alpha = 0.7, size = 2) + # Add color based on Age and adjust transparency
geom_smooth(method = "lm", color = "red", se = TRUE, linetype = "solid", size = 1.5) + # Enhanced line
scale_color_viridis_c() + # Use a color scale for better visual appeal
labs(title = "Customer Age vs. Total Purchase Amount",
subtitle = "Red line = Linear Regression Fit",
x = "Customer Age",
y = "Total Purchase Amount ($)") +
simple_theme2 +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Angle x-axis text for better readability
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(plot6)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:This visualization reveals temporal patterns in overall business activity. We can identify upward or downward trends in transaction volume over the observed period. Seasonal patterns become evident through recurring peaks (e.g., potential holiday season spikes) and troughs (e.g., post-holiday dips). The magnitude of these fluctuations indicates the strength of seasonality or impact of specific events (like promotions). The shaded area provides a quick visual gauge of the overall volume across time.
churn_by_gender <- unique_customers %>%
group_by(Gender) %>%
summarise(Churn.Rate = mean(Churn == "1", na.rm = TRUE),
Total.Customers = n(), .groups = 'drop') %>%
filter(Total.Customers > 10) # Optional: filter small groups
if (nrow(churn_by_gender) > 0) {
plot7 <- ggplot(churn_by_gender, aes(x = Gender, y = Churn.Rate, fill = Gender)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = scales::percent(Churn.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) + # Show percentage labels
labs(title = "Customer Churn Rate by Gender",
x = "Gender",
y = "Churn Rate") +
scale_y_continuous(labels = scales::percent) + # Y-axis as percentages
simple_theme2 +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
print(plot7)
} else {
cat("Skipping Churn Rate by Gender Plot: Not enough unique customer data with gender/churn info.\n")
}
Interpretation:This plot allows for a direct visual comparison of customer retention between genders. If the bars are of noticeably different heights, it suggests one gender group might be more prone to churning than the other within this dataset. For example, a significantly taller bar for ‘Male’ would indicate a higher churn propensity among male customers compared to female customers. Similar bar heights suggest comparable churn rates. This can inform targeted retention strategies if significant differences exist.
simple_theme2 <- theme_minimal()
day_counts <- Database %>%
filter(!is.na(Purchase.Date)) %>%
mutate(DayOfWeek = wday(Purchase.Date, label = TRUE, week_start = 1)) %>%
filter(!is.na(DayOfWeek)) %>%
count(DayOfWeek, name = "Transactions")
if (nrow(day_counts) > 0) {
plot15 <- ggplot(day_counts, aes(x = DayOfWeek, y = Transactions, group = 1)) +
geom_area(aes(fill = "Area"), alpha = 0.3, show.legend = FALSE) + # Shaded area under the line
geom_line(color = "steelblue", size = 1.2) + # Line
geom_point(color = "steelblue", size = 3) + # Points
geom_text(aes(label = Transactions), vjust = -1, size = 3) + # Labels
labs(title = "Total Transactions by Day of the Week",
x = "Day of the Week",
y = "Number of Transactions") +
scale_y_continuous(labels = scales::comma) +
simple_theme2 +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
print(plot15)
} else {
cat("Skipping Day of Week Plot: Not enough valid date data.\n")
}
Interpretation:The plot helps determine if there’s a linear association between age and spending per transaction. A widely scattered pattern suggests age is not a strong linear predictor of transaction amount. The slope of the red line indicates the direction and magnitude of the average linear trend (e.g., a slightly upward slope suggests older customers tend to spend slightly more on average, though individual transactions vary greatly). The width of the confidence interval indicates the uncertainty around the regression line – a wider band implies less confidence in the precise slope. The color gradient can help visually spot if certain age ranges cluster at higher/lower spending levels, though this might be subtle.
plot9 <- ggplot(Database, aes(x = Churn, y = Customer.Age, fill = Churn)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Customer Age Distribution by Churn Status",
x = "Churn Status (0 = Active, 1 = Churned)",
y = "Customer Age") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot9)
Interpretation:This visualization highlights the weekly rhythm of purchasing behavior. We can easily identify the busiest and slowest shopping days. Peaks (e.g., often Friday/Saturday) indicate preferred shopping times, while troughs (e.g., often Sunday/Monday) show periods of lower activity. This pattern can inform staffing, server load management, and timing for promotional activities (e.g., running sales on typically slower days).
cat("\n--- 10: Scatter Plot (Product Price vs. Quantity) ---\n")
##
## --- 10: Scatter Plot (Product Price vs. Quantity) ---
plot10 <- ggplot(Database, aes(x = Product.Price, y = Quantity,fill = Product.Category)) +
geom_point(aes(color = Product.Category), alpha = 0.4, size = 1.5) + # Color by category
labs(title = "Product Price vs. Quantity Purchased per Transaction",
x = "Product Price ($)",
y = "Quantity Purchased",
color = "Product Category") + # Add legend title
simple_theme
print(plot10)
Interpretation:This plot helps visualize if the age profile differs between the two churn groups based on the transactions recorded. We compare the median age (middle line), the age range of the middle 50% of transactions (IQR/box height), and the overall age spread for transactions linked to each group. If, for instance, the median age box for churned customers is lower than for non-churned, it might suggest younger customers are associated with more churned transactions. Important Caveat: Because this uses the main df (where customer age is repeated per transaction) and not unique_customers, it’s showing the age distribution of the transactions, not necessarily the age distribution of unique customers who churned. For a direct comparison of churned customer ages, the plot should ideally be based on the unique_customers dataframe.
returns_by_cat <- Database %>%
filter(!is.na(Returns)) %>%
group_by(Product.Category) %>%
summarise(
Return.Rate = mean(Returns == 1),
Total.Transactions = n(),
.groups = 'drop'
) %>%
filter(Total.Transactions > 10)
if (nrow(returns_by_cat) > 0) {
plot11 <- ggplot(returns_by_cat, aes(x = reorder(Product.Category, -Return.Rate), y = Return.Rate, fill = Product.Category)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = scales::percent(Return.Rate, accuracy = 0.1)), vjust = -0.5, size = 3.5) +
labs(
title = "Return Rate by Product Category",
x = "Product Category",
y = "Return Rate"
) +
scale_y_continuous(labels = scales::percent) + # <--- important!
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # tilt x labels if needed
print(plot11)
} else {
cat("Skipping Return Rate by Category Plot: Not enough valid 'Returns' data found.\n")
}
Interpretation:This plot helps visualize if the age profile differs between the two churn groups based on the transactions recorded. We compare the median age (middle line), the age range of the middle 50% of transactions (IQR/box height), and the overall age spread for transactions linked to each group. If, for instance, the median age box for churned customers is lower than for non-churned, it might suggest younger customers are associated with more churned transactions. Important Caveat: Because this uses the main df (where customer age is repeated per transaction) and not unique_customers, it’s showing the age distribution of the transactions, not necessarily the age distribution of unique customers who churned. For a direct comparison of churned customer ages, the plot should ideally be based on the unique_customers dataframe.
monthly_avg_purchase <- Database %>%
filter(!is.na(Purchase.Date)) %>%
mutate(YearMonth = floor_date(Purchase.Date, "month")) %>%
group_by(YearMonth) %>%
summarise(Avg.Purchase = mean(Total.Purchase.Amount, na.rm = TRUE), .groups = 'drop') %>%
filter(!is.na(YearMonth))
if(nrow(monthly_avg_purchase) > 0) {
plot12 <- ggplot(monthly_avg_purchase, aes(x = YearMonth, y = Avg.Purchase)) +
geom_line(color = "purple", linewidth = 1) +
geom_point(color = "purple") +
labs(title = "Monthly Average Total Purchase Amount Trend",
x = "Month",
y = "Average Total Purchase Amount ($)") +
scale_x_datetime(date_labels = "%Y-%m", date_breaks = "3 months") +
simple_theme
print(plot12)
} else {
cat("Skipping Average Purchase Trend Plot: Not enough valid date data.\n")
}
Interpretation: Unlike the transaction volume plot (Q5), this focuses on the value per transaction. We look for trends in average spending – are customers spending more or less per visit over time? Peaks could indicate periods where customers bought higher-value items or simply more items per transaction (cross-reference with Q5). Dips might suggest periods of promotional activity focusing on lower-priced goods or smaller basket sizes. This helps understand changes in customer spending behavior separate from just the number of visits/transactions.
cat("\n--- 13: Box Plot (Purchase Amount by Gender) ---\n")
##
## --- 13: Box Plot (Purchase Amount by Gender) ---
plot13 <- ggplot(Database, aes(x = Gender, y = Total.Purchase.Amount, fill = Gender)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.1) + # Show outliers faintly
labs(title = "Total Purchase Amount Distribution by Gender",
x = "Gender",
y = "Total Purchase Amount ($)") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot13)
Interpretation: Unlike the transaction volume plot (Q5), this focuses on the value per transaction. We look for trends in average spending – are customers spending more or less per visit over time? Peaks could indicate periods where customers bought higher-value items or simply more items per transaction (cross-reference with Q5). Dips might suggest periods of promotional activity focusing on lower-priced goods or smaller basket sizes. This helps understand changes in customer spending behavior separate from just the number of visits/transactions.
plot14 <- ggplot(Database, aes(x = Quantity, y = Total.Purchase.Amount)) +
geom_point(alpha = 0.2, size = 1, color = "brown") +
geom_smooth(method = "lm", color = "orange", se = FALSE) +
labs(title = "Quantity vs. Total Purchase Amount per Transaction",
subtitle = "Orange line = Linear Regression Fit",
x = "Quantity Purchased",
y = "Total Purchase Amount ($)") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
simple_theme
print(plot14)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:This plot primarily investigates the correlation between the number of units bought and the recorded total spend for the transaction. A positive slope on the regression line is expected, indicating that buying more units generally leads to a higher total amount. The strength of this linear relationship is indicated by how tightly the points cluster around the line and the width of the confidence interval. Significant scatter suggests either a wide variation in product prices (so Quantity * Price varies greatly) or that Total.Purchase.Amount often reflects more than just that single line item’s quantity * price (e.g., it’s a total basket value). The intercept of the regression line (where Quantity = 0) ideally should be near $0 if the amount strictly represents Quantity * Price, but might be positive if it represents a total basket including other items.
This visual analysis of the e-commerce customer dataset, utilizing R
and the ggplot2 and GGally packages, provided
valuable initial insights into transactional patterns, customer
demographics, behavioral trends, and potential drivers of churn.
Through a series of targeted visualizations—including histograms, bar graphs, scatter plots, box plots, line plots, and pair plots—we explored distributions, comparisons, correlations, and time-based trends within the data.
This R-based visualization exercise has effectively surfaced key characteristics and potential areas of interest within the e-commerce dataset. It serves as a critical foundation for more rigorous statistical analysis and data-driven decision-making aimed at optimizing customer experience, retention, and profitability.