Introduction

We’ve all experienced the moment when a product catches our eye, leading us to make an impulse decision to purchase. Whether it’s walking into a store or scrolling online, that urge to buy doesn’t just happen by chance. Retailers invest billions each year trying to recreate that exact moment for consumers, utilizing web campaigns, social media, advertisements, and branding to nudge buyers toward making a purchase. But what drives this behavior? And how do businesses capture and repeat that pivotal moment when a consumer turns into a paying customer?

When it comes to Customer Purchase Behavior in the electronics market, understanding the drivers behind buying decisions becomes even more crucial. The choices consumers make are influenced by various factors, including product type, pricing, brand reputation, and promotional campaigns. By studying consumer purchase patterns, businesses can identify what resonates with their audience and refine their strategies accordingly.

This report provides an analysis of the Electronic Sales dataset for an electronics company over a one-year period, spanning from September 2023 to September 2024. It includes detailed information about customer demographics, product types, and purchase behaviors. The analysis aims to uncover insights into customer purchasing behavior and product performance.

Research Questions

What are the key factors influencing customer purchase behavior in the electronics market?
- This will explore how variables such as age, gender, loyalty membership, and payment method affect purchasing decisions.
How does product type influence customer purchase decisions?
- Investigate whether certain product categories, such as mobile devices or home electronics, are more popular among specific customer segments.
Does pricing impact customer purchasing patterns?
- Analyze whether there is a relationship between the price of products and the likelihood of purchase, as well as how customers respond to varying price points across product types.
How does customer behavior change across different seasons?
- Examine whether customer purchasing behavior, including order status (e.g., purchased vs. canceled), varies by season, particularly focusing on peak shopping seasons such as the holidays.
Do loyalty members show higher purchasing frequency or spend more than non-members?
- Analyze whether loyalty membership is associated with higher average spending or purchase frequency.
What role does brand reputation play in customer purchasing decisions?
- Explore whether customers tend to prefer specific brands over others and how brand loyalty impacts purchasing patterns.
Are there any patterns in customer payment method preferences, and do they influence the total amount spent?
- Investigate the relationship between different payment methods (e.g., credit card, PayPal) and overall spending behavior.
How do promotions and marketing campaigns influence customer purchase behavior?
- Evaluate the effectiveness of promotional campaigns (e.g., discounts, bundles) in driving purchases, and determine if certain types of promotions are more successful than others.
What customer demographics (age, gender, etc.) are most predictive of high spending or frequent purchases?
- Examine which demographic variables are most strongly associated with higher spending or purchase frequency.
How do shipping options affect purchase completion rates?
- Investigate whether offering premium or free shipping options influences the likelihood of a purchase being completed versus canceled.
Do add-on purchases increase overall spending?
- Analyze whether customers who purchase additional items (add-ons) spend more overall than those who do not.

Data Overview

The dataset contains 20,000 entries with 16 variables, including customer demographics, product details, and transaction information.

#load package
library(pacman)
library(psych)
library(viridis)

## Loading required package: viridisLite

library(gridExtra)
p_load(tidyverse, lubridate, knitr, kableExtra, gmodels, janitor, psych)
# load data
data <- read_csv('Electronic_sales.csv')

## Rows: 20000 Columns: 16

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Gender, Loyalty Member, Product Type, SKU, Order Status, Payment M...
## dbl  (7): Customer ID, Age, Rating, Total Price, Unit Price, Quantity, Add-o...
## date (1): Purchase Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# we use clean name to standardize the variables names
data <- clean_names(data)

Key Features:

Customer ID: Unique identifier for each customer.
Age: Age of the customer (numeric)
Gender: Gender of the customer (Male or Female)
Loyalty Member: (Yes/No) (Values change by time, so pay attention to who cancelled and who signed up)
Product Type: Type of electronic product sold (e.g., Smartphone, Laptop, Tablet)
SKU: a unique code for each product.
Rating: Customer rating of the product (1-5 stars) (Should have no Null Ratings)
Order Status: Status of the order (Completed, Cancelled)
Payment Method: Method used for payment (e.g., Cash, Credit Card, Paypal) # Total Price: Total price of the transaction (numeric)
Unit Price: Price per unit of the product (numeric)
Quantity: Number of units purchased (numeric)
Purchase Date: Date of the purchase (format: YYYY-MM-DD)
Shipping Type: Type of shipping chosen (e.g., Standard, Overnight, Express)
Add-ons Purchased: List of any additional items purchased (e.g., Accessories, Extended Warranty)
Add-on Total: Total price of add-ons purchased (numeric)

# convert to correct data format
data$quantity <- as.integer(data$quantity)
data$age <- as.integer(data$age)
# Convert Purchase.Date to Date type
data$purchase_date <- as.Date(data$purchase_date)
data$gender <- as.factor(data$gender)
data$product_type <- as.factor(data$product_type)
data$loyalty_member <- as.factor(data$loyalty_member)
data$order_status <- as.factor(data$order_status)
data$sku <- as.factor(data$sku)
data$payment_method <- as.factor(data$payment_method)
data$shipping_type <- as.factor(data$shipping_type)

Data Exploratory

The data contains 16 variables which consist of 7 numeric, 8 character and 1 date datatypes. There are no duplicates values however, the seem to be missing values in the dataset, in total 4868 null values were found, and all the missing values are in the variables ‘add_ons_purchased’ which constitute about 24% of that column values

# check for missing values
sum(is.na(data)) #there are 4868 missing values

## [1] 4868

colSums(is.na(data))

##       customer_id               age            gender    loyalty_member 
##                 0                 0                 0                 0 
##      product_type               sku            rating      order_status 
##                 0                 0                 0                 0 
##    payment_method       total_price        unit_price          quantity 
##                 0                 0                 0                 0 
##     purchase_date     shipping_type add_ons_purchased      add_on_total 
##                 0                 0              4868                 0

# percentage of missing values per column
get_percent_missing <- function(df){
  
  #this function produce the missing percentage of each variable
  missing_percentage <- colMeans(is.na(df)) * 100
  missing_percentage <- data.frame(missing_percentage)
  missing_percentage <- missing_percentage %>% 
    arrange(desc(missing_percentage)) %>% round(., 2)
  
  return(missing_percentage)
}

missing_percentage = get_percent_missing(data)
missing_percentage #check for missing values per columns

# all missing values are in the variables , add_ons_purchased which consitute about 24% of the data
# which is the list of any additional items purchased 

# Check for duplicates
sum(duplicated(data))

## [1] 0

Summary Statistics

The summary statistics present some interesting observations in the data

The age of customers ranges from 18 to 80, with a median age of 49 years which appears reasonable with no obvious outliers.
The total price of transactions varies significantly, with a minimum of $20.75 and a maximum of $11,396.80, indicating a wide range of products and potential premium offerings.
The average total price is $3,180.13, while the median is $2,534.49, suggesting that a few high-priced items may skew the average. This implies present of outliers but could also represent legitimate high-value purchase.
The quantity ranges from 1 to 10, which seems reasonable for electronic products.
All ratings are within the expected range of 1 to 5.

summary(data)

##   customer_id         age           gender      loyalty_member
##  Min.   : 1000   Min.   :18.00   #N/A  :    1   No :15657     
##  1st Qu.: 5478   1st Qu.:33.00   Female: 9835   Yes: 4343     
##  Median :10500   Median :49.00   Male  :10164                 
##  Mean   :10484   Mean   :48.99                                
##  3rd Qu.:15504   3rd Qu.:65.00                                
##  Max.   :19998   Max.   :80.00                                
##                                                               
##      product_type       sku           rating         order_status  
##  Headphones:2011   TBL345 :2062   Min.   :1.000   Cancelled: 6568  
##  Laptop    :3973   SKU1002:2042   1st Qu.:2.000   Completed:13432  
##  Smartphone:5978   SKU1004:2019   Median :3.000                    
##  Smartwatch:3934   SKU1005:2012   Mean   :3.094                    
##  Tablet    :4104   HDP456 :2010   3rd Qu.:4.000                    
##                    SMP234 :1987   Max.   :5.000                    
##                    (Other):7868                                    
##        payment_method  total_price         unit_price         quantity     
##  Bank Transfer:3371   Min.   :   20.75   Min.   :  20.75   Min.   : 1.000  
##  Cash         :2492   1st Qu.: 1139.68   1st Qu.: 361.18   1st Qu.: 3.000  
##  Credit Card  :5868   Median : 2534.49   Median : 463.96   Median : 5.000  
##  Debit Card   :2471   Mean   : 3180.13   Mean   : 578.63   Mean   : 5.486  
##  Paypal       :2514   3rd Qu.: 4639.60   3rd Qu.: 791.19   3rd Qu.: 8.000  
##  PayPal       :3284   Max.   :11396.80   Max.   :1139.68   Max.   :10.000  
##                                                                            
##  purchase_date          shipping_type  add_ons_purchased   add_on_total    
##  Min.   :2023-09-24   Expedited:3272   Length:20000       Min.   :  0.000  
##  1st Qu.:2024-02-05   Express  :3366   Class :character   1st Qu.:  7.615  
##  Median :2024-04-24   Overnight:3357   Mode  :character   Median : 51.700  
##  Mean   :2024-04-18   Same Day :3280                      Mean   : 62.245  
##  3rd Qu.:2024-07-08   Standard :6725                      3rd Qu.: 93.843  
##  Max.   :2024-09-23                                       Max.   :292.770  
##

Data Visualization

We further our analysis of the customer transaction dataset to gain deeper insights into various aspects of purchasing behavior and trends and answer our research questions. We will examine the distribution of age and total price to understand how these factors influence purchasing decisions.

Additionally, we will conduct a comprehensive product analysis to assess the performance of different product types in terms of pricing and customer preferences. Insights related to order status will also be explored to evaluate fulfillment processes and overall customer satisfaction.

Furthermore, we will analyze purchase trends over time, focusing on monthly and seasonal patterns to identify fluctuations in customer activity and preferences throughout the year. Through these analyses, we aim to provide actionable insights that can inform strategic decisions and enhance business performance.

Age Distribution & Total Price

The age distribution of customers appears to be relatively normal, with a slight right skew. The majority of customers are between 30 and 50 years old, with a peak around 40-45 years. The distribution of total prices is right-skewed, with most purchases falling in the lower price range (under $5,000). There are some high-value purchases, which could represent bulk orders or expensive products like high-end laptops or multiple smartphones.

# -------------------- TOTAL PRICE -----------------------
ggplot(data, aes(x = total_price)) + 
  geom_histogram(bins = 50, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Total Price", x = "Total Price", y = "Frequency") +
  scale_x_continuous(labels = scales::dollar_format()) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

# observation: There is a wide range of prices ($20.75 to $11,396.80), which may
#              include some outliers but could also represent legitimate high-value purchases.

# boxplot for total price
ggplot(data, aes(x = total_price)) + 
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Boxplot of Total Price", x = "Total Price") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# ---------------------- AGE --------------------
ggplot(data, aes(x = age)) + 
  geom_histogram(binwidth = 5,fill = "plum3", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Age", x = "Age", y = "Frequency") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

#The age range appears reasonable (18 to 70 years), with no obvious outliers.

Scatter Plot

There appears to be a slight positive correlation between age and total price, suggesting that older customers tend to make slightly more expensive purchases. However, the relationship is not very strong, and there is considerable variation across all age groups. The scatter plot shows a positive relationship between Quantity and Total Price. As the quantity increases, the total price tends to increase as well. This is expected in a sales dataset.

# Relationship between Age and total Price
age_p_plt <- ggplot(data, aes(x = age, y = total_price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "red") +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs( x = "Age", y = "Total Price") +
  theme(plot.title = element_text(hjust = 0.5))

# ----------------- QUANTITY ---------------------
q_plt <- ggplot(data, aes(x = quantity, y = total_price)) +
  geom_point(color = "blue", alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  scale_y_continuous(labels = scales::dollar_format())  +
  labs(x = "Quantity", y = "Total Price") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

#combine the scatter plot together
grid.arrange(age_p_plt, q_plt, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Product Type Distribution

The most popular product types, in order, are: Smartphone, Laptop, Tablet, Smartwatch, Headphones.

# ------------ Product Type Distribution -----------------
ggplot(data, aes(x = product_type, fill = ..count..)) + 
  geom_bar() +
  scale_fill_gradient(low = "slategrey", high = "steelblue2") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Distribution of Product Types", x = "Product Type", y = "Count") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5), legend.position = 'none')

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

One surprising observation was the higher average price of smartphones compared to laptops. Typically, one might expect laptops, being larger and more multifunctional devices, to be more expensive. However, in this case, smartphones have an average price of $3,599, compared to $3,095 for laptops.

# Summary of Total Price by Product Type
product_summary <- data %>%
  group_by(product_type) %>%
  summarise(
    Avg_Price = mean(total_price),
    Median_Price = median(total_price),
    Min_Price = min(total_price),
    Max_Price = max(total_price)
  ) %>%
  arrange(desc(Avg_Price))

Product Type Pricing
Product Type	Avg_Price	Median_Price	Min_Price	Max_Price
Smartphone	3599.32	3164.76	20.75	11396.80
Smartwatch	3567.94	3216.50	459.50	8448.3
Laptop	3094.95	2783.76	463.96	6743.2
Tablet	2853.80	2223.27	247.03	7864.1
Headphones	2009.65	2167.08	361.18	4130.2

Product Preference by Gender or Age

The question we want to answer is that; Do men and women (or different age groups) prefer different types of products?

# Violin plot to compare product preferences by gender
ggplot(data, aes(x = product_type, y = age, fill=gender)) +
  geom_violin(trim = FALSE) +  # Create violin plot without trimming tails
  scale_fill_manual(values = c("#FF9999", "#9999FF")) +  # Color fill for male/female
  labs(title = "Product Preferences by Gender", 
       x = "Product Type", 
       y = "Age",
       fill = "Gender") +  # Customize plot labels
  theme_minimal() +  # Use minimal theme for clean look
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

From the visualization, there seem not be product type preference either by age or gender. The violin plot doesn’t seem to show any variation. We conducted a chi-square test, and it futher proof our assumption that there is no statistically significant relationship between age group or gender and product type preferences in this dataset.

# Cross-tabulation of product type by gender
product_gender_table <- table(data$product_type, data$gender)

# Chi-square test to check relationship
chisq.test(product_gender_table)

## 
##  Pearson's Chi-squared test
## 
## data:  product_gender_table
## X-squared = 7.5442, df = 8, p-value = 0.4792

# For age, create age groups and repeat similar analysis
data$age_group <- cut(data$age, breaks = c(18, 30, 40, 50, 60, 100))
age_product <- table(data$product_type, data$age_group)
# Chi-square test to check relationship
chisq.test(age_product)

## 
##  Pearson's Chi-squared test
## 
## data:  age_product
## X-squared = 19.127, df = 16, p-value = 0.2621

Loyalty Impact

Interestingly, loyalty members have a slightly lower average total price compared to non-members. However, they have a marginally higher average rating, which could indicate higher satisfaction despite spending less. The fact that loyalty members spend slightly less on average than non-members is intriguing. This could indicate that the loyalty program is attracting price-sensitive customers or that it’s offering discounts that reduce the overall spend. However, the higher average rating from loyalty members suggests that the program is successful in terms of customer satisfaction but not with much margin.

# -------------------- Loyalty Member Impact ----------
loyalty_impact <- data %>%
  group_by(loyalty_member) %>%
  summarise(
    Avg_Total_Price = mean(total_price),
    Avg_Quantity = mean(quantity),
    Avg_Rating = mean(rating)
  )

Loyalty Member Impact
Loyalty Member	Avg_Total Price	Avg_Quantity	Avg_Rating
No	3191.97	5.49	3.09
Yes	3137.44	5.46	3.10

# frequency of loyalty members
loyalty_summary <- data %>%
  group_by(loyalty_member) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Plotting the percentages
ggplot(loyalty_summary, aes(x = loyalty_member, y = Percentage, fill = loyalty_member)) +
  geom_bar(stat = 'identity') +
  scale_y_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +
  geom_text(aes(label = Percentage), vjust = -0.5, size = 5) +
  labs(x = "Loyalty Membership Status", y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none")

The bar plot clearly illustrates that non-loyalty members outnumber loyalty members in our dataset. This trend highlights a potential opportunity for the company to enhance its loyalty programs and strategies to attract more customers to become loyalty members. Notably, there isn’t a significant price difference between loyalty and non-loyalty members. I believe the company should consider increasing discounts for loyalty members to further incentivize enrollment in the loyalty program and improve customer retention.

Daily Sales Trend

The daily sales trend shows considerable fluctuations, with some noticeable peaks and troughs. Starting in January 2024, there has been a clear increase in total sales compared to 2023. This rise may be attributed to seasonal effects, promotional events, or new product launches driving higher customer spending during this period.

# --------------------- Time series of daily sales -----------------
daily_sales <- data %>%
  group_by(purchase_date) %>%
  summarise(Total_Sales = sum(total_price))
daily_sales$purchase_date <- as.Date(daily_sales$purchase_date)

Frequency of Product Type across the Years
Year	Headphones	Laptop	Smartphone	Smartwatch	Tablet
2023	0	521	1080	515	582
2024	2011	3452	4898	3419	3522

# Create the time series plot for daily sales
ggplot(daily_sales, aes(x = purchase_date, y = Total_Sales)) +
  geom_line(group = 1, color = "black") +  # Set group=1 to treat the entire dataset as one group
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  labs(title = "Daily Sales Over Time", x = "Date", y = "Total Sales") +
  theme_minimal() +  # Optional: Add a minimal theme
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

The increase in product types available in 2024 is notable, especially considering that the 2023 data only begins in October. This limited timeframe likely contributes to the significant surge in sales prices observed in 2024. The comparison highlights that the expanded availability of products in 2024, as opposed to the constrained data from late 2023, plays a crucial role in this trend. With the 2023 dataset covering only a few months, it fails to provide a comprehensive understanding of consumer purchasing behavior over the entire year.

Seasonal Trend

The analysis indicates a significant surge in total sales during the Spring and Summer seasons. This trend is likely linked to the timing of new smartphone releases, which typically occur during these months, attracting heightened consumer interest and engagement. As consumers seek out the latest technology and upgrades, the increased availability of new products during this period contributes to the overall sales boost.

# Function to categorize dates into seasons
get_season <- function(dates) {
  # Extract the month and day
  month_day <- format(dates, "%m-%d")
  
  # Assign season based on the month-day ranges
  ifelse(month_day >= "03-21" & month_day <= "06-20", "Spring",
         ifelse(month_day >= "06-21" & month_day <= "09-20", "Summer",
                ifelse(month_day >= "09-21" & month_day <= "12-20", "Fall", "Winter")))
}
data$season <- get_season(data$purchase_date)

season_sales <- data %>%
  group_by(season) %>%
  summarise(Total_Sales = sum(total_price))

ggplot(season_sales, aes(x = season, y = Total_Sales, fill=season)) +
  geom_bar(stat = 'identity') +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_fill_manual(values = c("Winter" = "slategrey", "Spring" = "steelblue2", 
                               "Summer" = "lightblue", "Fall" = "skyblue")) +
  labs(x = 'Seasons', y='Total Sales') +
  theme(legend.position = 'none')

Conversely, the Winter season also shows significant sales figures, likely driven by holiday shopping and seasonal promotions. The winter months often see a spike in consumer spending due to events like Black Friday, Cyber Monday, and Christmas, where retailers offer significant discounts and promotions. Additionally, the colder weather can drive consumers indoors, encouraging online shopping and exploration of new products.

Day of the Week

Are there certain days when customers tend to buy more? The observation reveal that;

Monday to Wednesday: These days have relatively high sales, indicating that customers are more likely to make purchases at the start of the workweek, possibly due to beginning-of-week shopping habits.
Weekend: Strong sales on the weekend, particularly on Saturday, suggest that customers are making more purchases when they have more free time or are out shopping, likely for leisure or planned purchases.
Thursday Dip: The drop in sales on Thursday is interesting, as it represents a lull in customer activity. This could be a point of focus for targeted promotions or marketing to boost sales on this day.

data$month <- month(data$purchase_date, label = TRUE, abbr = TRUE) 
data$day_of_week <- wday(data$purchase_date, label = TRUE, abbr = TRUE) 
data$year <- format(data$purchase_date, "%Y")  # Extract day of the week 
# ------------ Day of the Week ---------
# Are there certain days when customers tend to buy more?
day_of_week_sales <- data %>%
  group_by(day_of_week) %>%
  summarise(Total_Sales = sum(total_price)) 

ggplot(day_of_week_sales, aes(y = day_of_week, x = Total_Sales, fill=Total_Sales)) +
  geom_bar(stat = 'identity') +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_fill_viridis() +
  #geom_text(aes(label = scales::dollar(Total_Sales)),  # Add labels with dollar formatting
  #          vjust = -0.5, size = 3.5) +
  labs(x = 'Day of the Week',
       y = ' Total Sales') + theme_minimal()

Gender

A closer look into the gender column reveals an unusual entry where instead of standard NA, we have #N/A for one customer with ID 19998. Further analysis will aim to determine how to impute this value using other available variables. In general, the data shows a slightly higher number of male customers(50.8%) compared to female customers(49.2%).

# ------------------- Gender analysis ----------
table(data$gender) #

## 
##   #N/A Female   Male 
##      1   9835  10164

# Filter the dataset for specific genders
filtered_data <- data %>%
  filter(gender %in% c("#N/A"))
filtered_data # shows that customer_id = 19998 has gender '#N/A'

# find the if any other customer has that id
data %>%
  filter(customer_id == 19998) # only 1 customer has that id

# Filter the dataset to remove entries where gender is '#N/A'
gender_data <- data %>%
  filter(gender != "#N/A")

# Count the occurrences of each gender and calculate percentages
gender_counts <- gender_data %>%
  group_by(gender) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)  # Calculate percentage

# Create the pie chart with custom colors and percentages
ggplot(gender_counts, aes(x = "", y = Count, fill = gender)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +  # Convert to pie chart
  labs(fill = "Gender") +
  theme_void() +  # Remove background grid and axes
  theme(plot.title = element_text(hjust = 0.5)) +  # Center the title
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), color = "white") +  # Add percentage labels
  scale_fill_manual(values = c("Male" = "steelblue2", "Female" = "tan2"))

Order Status

There are more completed orders than cancelled orders. Proportion-wise 67.2% of the orders are completed compared to 32.8%. This indicates that while most orders are successfully completed, a significant portion (about one-third) ends up being canceled, which could warrant further investigation into the reasons behind these cancellations to improve the completion rate.

# ---------------- Order Status --------------------
# Order status analysis
order_status_summary <- data %>%
  group_by(order_status) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

order_plt <- ggplot(order_status_summary, aes(x = reorder(order_status, -Count), y = Count, fill = order_status)) +
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_fill_manual(values = c("Completed" = "plum3", "Cancelled" = "lightblue")) +
  labs(x = "Order Status", y = "Count") +
  theme(legend.position = 'none')

print(order_plt)

Investigating Reason for Cancellation

Several factors could contribute to this high cancellation rate. For instance, it may vary by year, as the comparison between 2023 and 2024 shows fluctuations. It could also be influenced by product type, with different categories (smartphones, laptops, etc.) potentially having different cancellation rates. Additionally, customer demographics, such as gender preferences, may play a role in understanding why certain products are more frequently canceled.

Purchase Year	Cancelled	Completed
2023	899 (0.333%)	1799 (0.667%)
2024	5669 (0.327%)	11633 (0.672%)

Product Type	Cancelled	Completed
Headphones	0.3232	0.6768
Laptop	0.3239	0.6761
Smartphone	0.3302	0.6698
Smartwatch	0.3299	0.6701
Tablet	0.3311	0.6689

The results show that cancellation rates are fairly consistent across all product types and year, with around 32-33% of orders being canceled.

Could the rating also impact the order status? This question prompted my analysis. The density plot didn’t show much of a difference, or really hard to tell

# rating and order status
ggplot(data, aes(x = rating, fill = order_status)) +
  geom_density(alpha = 0.6) +
  labs(x = "Rating", y = "Density") +
  scale_fill_manual(values = c("Cancelled" = "tomato", "Completed" = "steelblue2")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Hence we conducted a t-test to assess whether there is a significant difference in the mean ratings between cancelled and completed orders.

t.test(rating ~ order_status, data = data)

## 
##  Welch Two Sample t-test
## 
## data:  rating by order_status
## t = -0.59223, df = 13086, p-value = 0.5537
## alternative hypothesis: true difference in means between group Cancelled and group Completed is not equal to 0
## 95 percent confidence interval:
##  -0.04696007  0.02516780
## sample estimates:
## mean in group Cancelled mean in group Completed 
##                3.086632                3.097528

The p-value (0.5537) is much larger than the common significance threshold (0.05), indicating that there is no statistically significant difference in the average ratings between canceled and completed orders. In summary, based on the data, lower ratings do not seem to correlate with a higher likelihood of order cancellation.

Payment Method

The data reveals that the most commonly used payment method among customers is the Credit Card, closely followed by PayPal. This predominance of Credit Card usage aligns with common consumer behavior, as many individuals prefer to make purchases on credit.

# ---------------Payment method
table(data$payment_method) # paypal appears to have been save different

## 
## Bank Transfer          Cash   Credit Card    Debit Card        Paypal 
##          3371          2492          5868          2471          2514 
##        PayPal 
##          3284

# Replace 'Paypal' with 'PayPal'
data$payment_method <- gsub("Paypal", "PayPal", data$payment_method)
#re-run the table code above
prop.table(table(data$payment_method, data$order_status), margin = 1)

##                
##                 Cancelled Completed
##   Bank Transfer 0.3298724 0.6701276
##   Cash          0.3069823 0.6930177
##   Credit Card   0.3355487 0.6644513
##   Debit Card    0.3184945 0.6815055
##   PayPal        0.3337358 0.6662642

# Create a jitter plot
payment_data <- data %>% group_by(payment_method) %>%
  summarise(count = n())

#install.packages("ggbeeswarm")
library(ggbeeswarm)
swarm_plot <- ggplot(payment_data, aes(x = payment_method, y = count)) +
  geom_beeswarm(aes(color = payment_method), size = 3) +
  geom_text(aes(label = count), 
            position = position_jitter(width = 0.2), # Adjust position of text to avoid overlap
            vjust = -0.5,  # Position text slightly above the points
            size = 4,     # Adjust size of the text
            color = "black") +
  theme_minimal() +
  labs(x = "Payment Method",
       y = "Frequency") +
  theme(legend.position = "none")

# Display the plot
print(swarm_plot)

The question we are asking is that; Do customers using different payment methods tend to spend more or less? And from the ANOVA result, we know that there is a statistically significant difference in spending across different payment methods.

# Compare average total price by payment method
payment_method_comparison <- aggregate(total_price ~ payment_method, data = data, mean)

# ANOVA test for differences in spending by payment method
anova_payment <- aov(total_price ~ payment_method, data = data)
summary(anova_payment)

##                   Df    Sum Sq   Mean Sq F value Pr(>F)    
## payment_method     4 2.853e+09 713330440   112.6 <2e-16 ***
## Residuals      19995 1.267e+11   6335510                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Add-Ons-Purchase

It’s interesting to see that a lot of customers tend to add extra items to their purchases, with “Accessory” and “Impulse Item” being the most common additions. This behavior suggests that shoppers are not just looking for what they originally came for; they’re open to exploring complementary products. It shows a level of engagement and can be a good sign for businesses. When customers are willing to buy additional items, it indicates that they feel satisfied and perhaps excited about their purchases. This trend could encourage businesses to think about how they market accessories or impulse buys, as it seems to resonate well with customers!

# --- Add-ons-purchase
items_df <- data.frame(items = data$add_ons_purchased)
items_df <- na.omit(items_df)
items_long <- items_df %>%
  separate_rows(items, sep = ",") %>%
  mutate(items = trimws(items))

Add-ons-purchase
Accessory	Extended Warranty	Impulse Item
10,048	9,975	10,234

Conclusion

In conclusion, our analysis of the customer transaction dataset has uncovered key insights into purchasing behavior and trends. The daily sales trend shows notable fluctuations, with peaks and troughs throughout the period. A significant increase in sales is observed from January 2024 compared to 2023, likely driven by seasonal effects, promotional campaigns, or product launches. Additionally, male customers slightly outnumber female customers, and loyalty members, while spending less on average, report higher satisfaction through better ratings.

The product analysis showed that smartphones and tablets had the highest purchase frequency, making them the most popular product types. However, when looking at average prices, smartphones and smartwatches had the highest average price, followed by laptops, tablets, and headphones. This highlights that smartphones dominate both in popularity and pricing in the premium market, with smartwatches also commanding high prices despite lower purchase volumes.

An intriguing aspect of the data is the disparity between loyalty members and non-members—loyalty members tend to spend slightly less but express higher satisfaction through better ratings.

Regarding seasonal variations, the data suggests that order status (whether purchased or canceled) remained relatively stable across seasons. This implies that external factors like holidays or seasonal promotions do not significantly influence cancellations. This stability reinforces the idea that businesses should continue running campaigns throughout peak seasons and consider introducing more promotions in the fall to boost sales. By understanding what motivates customers to buy, businesses can build more robust strategies, with loyalty programs being a crucial element. These programs not only reward repeat customers but also foster stronger connections with the brand, driving customer retention and repeat purchases.

When examining demographics, no significant purchasing preferences were revealed across different age groups or genders. This suggests that product preferences, such as smartphones, tablets, and smartwatches, were consistent regardless of customer demographics, indicating that marketing strategies do not need to be heavily adjusted based on age or gender.

Our day-of-week analysis revealed that sales were consistently higher on weekdays, with Wednesday and Monday being the strongest sales days. This trend suggests that midweek promotions or work-related needs might drive consumer purchases.

Finally, the ANOVA test on payment methods showed a significant difference in spending across different methods. Customers using credit cards spent more on average than those using mobile payment options, highlighting the importance of optimizing payment methods to increase transaction values.

To capitalize on these insights, businesses should focus on targeted marketing strategies that engage both loyalty members and new customers. Personalized promotions, exclusive offers for loyalty members, and marketing campaigns aligned with seasonal trends can further enhance customer engagement and sales performance.

In conclusion, these findings provide critical guidance for refining marketing strategies, improving customer satisfaction, and increasing sales by better understanding customer behavior and preferences.

Reference

Baumer, B. S., Kaplan, D. T., & Horton, N. J. (2021). Modern Data Science with R. CRC Press. http://books.google.ie/books?id=SokmEAAAQBAJ&printsec=frontcover&dq=Wickham,+H.+(2022).+ggplot2:+Create+Elegant+Data+Visualisations+Using+the+Grammar+of+Graphics.+R+Foundation+for+Statistical+Computing.+URL:+https://cran.r-project.org/package%3Dggplot2&hl=&cd=3&source=gbs_api

Calzon, B. (2023, August 10). Misleading Statistics – Real Life Examples Of Data Misuse. BI Blog | Data Visualization & Analytics Blog | datapine. https://www.datapine.com/blog/misleading-statistics-and-data/

Customer purchase behavior - Electronic Sales Data. (2024, September 23). Kaggle. https://www.kaggle.com/datasets/cameronseamons/electronic-sales-sep2023-sep2024

Datar, R. (n.d.). Hands-On Exploratory Data Analysis with R. O’Reilly Online Learning. https://www.oreilly.com/library/view/hands-on-exploratory-data/9781789804379/5a7a58f4-66f7-4661-8f6d-9c5c09eaa23a.xhtml

Customer Purchase Behavior - Electronic Sales Data

Sheila Kwartemaa Boateng

2024-09-23