We’ve all experienced the moment when a product catches our eye, leading us to make an impulse decision to purchase. Whether it’s walking into a store or scrolling online, that urge to buy doesn’t just happen by chance. Retailers invest billions each year trying to recreate that exact moment for consumers, utilizing web campaigns, social media, advertisements, and branding to nudge buyers toward making a purchase. But what drives this behavior? And how do businesses capture and repeat that pivotal moment when a consumer turns into a paying customer?
When it comes to Customer Purchase Behavior in the electronics market, understanding the drivers behind buying decisions becomes even more crucial. The choices consumers make are influenced by various factors, including product type, pricing, brand reputation, and promotional campaigns. By studying consumer purchase patterns, businesses can identify what resonates with their audience and refine their strategies accordingly.
This report provides an analysis of the Electronic Sales dataset for an electronics company over a one-year period, spanning from September 2023 to September 2024. It includes detailed information about customer demographics, product types, and purchase behaviors. The analysis aims to uncover insights into customer purchasing behavior and product performance.
What are the key factors influencing customer purchase behavior in the electronics market?
How does product type influence customer purchase decisions?
Does pricing impact customer purchasing patterns?
How does customer behavior change across different seasons?
Do loyalty members show higher purchasing frequency or spend more than non-members?
What role does brand reputation play in customer purchasing decisions?
Are there any patterns in customer payment method preferences, and do they influence the total amount spent?
How do promotions and marketing campaigns influence customer purchase behavior?
What customer demographics (age, gender, etc.) are most predictive of high spending or frequent purchases?
How do shipping options affect purchase completion rates?
Do add-on purchases increase overall spending?
The dataset contains 20,000 entries with 16 variables, including customer demographics, product details, and transaction information.
#load package
library(pacman)
library(psych)
library(viridis)
## Loading required package: viridisLite
library(gridExtra)
p_load(tidyverse, lubridate, knitr, kableExtra, gmodels, janitor, psych)
# load data
data <- read_csv('Electronic_sales.csv')
## Rows: 20000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Gender, Loyalty Member, Product Type, SKU, Order Status, Payment M...
## dbl (7): Customer ID, Age, Rating, Total Price, Unit Price, Quantity, Add-o...
## date (1): Purchase Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# we use clean name to standardize the variables names
data <- clean_names(data)
Key Features:
Customer ID: Unique identifier for each customer.
Age: Age of the customer (numeric)
Gender: Gender of the customer (Male or Female)
Loyalty Member: (Yes/No) (Values change by time, so pay attention to who cancelled and who signed up)
Product Type: Type of electronic product sold (e.g., Smartphone, Laptop, Tablet)
SKU: a unique code for each product.
Rating: Customer rating of the product (1-5 stars) (Should have no Null Ratings)
Order Status: Status of the order (Completed, Cancelled)
Payment Method: Method used for payment (e.g., Cash, Credit Card, Paypal) # Total Price: Total price of the transaction (numeric)
Unit Price: Price per unit of the product (numeric)
Quantity: Number of units purchased (numeric)
Purchase Date: Date of the purchase (format: YYYY-MM-DD)
Shipping Type: Type of shipping chosen (e.g., Standard, Overnight, Express)
Add-ons Purchased: List of any additional items purchased (e.g., Accessories, Extended Warranty)
Add-on Total: Total price of add-ons purchased (numeric)
# convert to correct data format
data$quantity <- as.integer(data$quantity)
data$age <- as.integer(data$age)
# Convert Purchase.Date to Date type
data$purchase_date <- as.Date(data$purchase_date)
data$gender <- as.factor(data$gender)
data$product_type <- as.factor(data$product_type)
data$loyalty_member <- as.factor(data$loyalty_member)
data$order_status <- as.factor(data$order_status)
data$sku <- as.factor(data$sku)
data$payment_method <- as.factor(data$payment_method)
data$shipping_type <- as.factor(data$shipping_type)
The data contains 16 variables which consist of 7 numeric, 8 character and 1 date datatypes. There are no duplicates values however, the seem to be missing values in the dataset, in total 4868 null values were found, and all the missing values are in the variables ‘add_ons_purchased’ which constitute about 24% of that column values
# check for missing values
sum(is.na(data)) #there are 4868 missing values
## [1] 4868
colSums(is.na(data))
## customer_id age gender loyalty_member
## 0 0 0 0
## product_type sku rating order_status
## 0 0 0 0
## payment_method total_price unit_price quantity
## 0 0 0 0
## purchase_date shipping_type add_ons_purchased add_on_total
## 0 0 4868 0
# percentage of missing values per column
get_percent_missing <- function(df){
#this function produce the missing percentage of each variable
missing_percentage <- colMeans(is.na(df)) * 100
missing_percentage <- data.frame(missing_percentage)
missing_percentage <- missing_percentage %>%
arrange(desc(missing_percentage)) %>% round(., 2)
return(missing_percentage)
}
missing_percentage = get_percent_missing(data)
missing_percentage #check for missing values per columns
# all missing values are in the variables , add_ons_purchased which consitute about 24% of the data
# which is the list of any additional items purchased
# Check for duplicates
sum(duplicated(data))
## [1] 0
The summary statistics present some interesting observations in the data
The age of customers ranges from 18 to 80, with a median age of 49 years which appears reasonable with no obvious outliers.
The total price of transactions varies significantly, with a minimum of $20.75 and a maximum of $11,396.80, indicating a wide range of products and potential premium offerings.
The average total price is $3,180.13, while the median is $2,534.49, suggesting that a few high-priced items may skew the average. This implies present of outliers but could also represent legitimate high-value purchase.
The quantity ranges from 1 to 10, which seems reasonable for electronic products.
All ratings are within the expected range of 1 to 5.
summary(data)
## customer_id age gender loyalty_member
## Min. : 1000 Min. :18.00 #N/A : 1 No :15657
## 1st Qu.: 5478 1st Qu.:33.00 Female: 9835 Yes: 4343
## Median :10500 Median :49.00 Male :10164
## Mean :10484 Mean :48.99
## 3rd Qu.:15504 3rd Qu.:65.00
## Max. :19998 Max. :80.00
##
## product_type sku rating order_status
## Headphones:2011 TBL345 :2062 Min. :1.000 Cancelled: 6568
## Laptop :3973 SKU1002:2042 1st Qu.:2.000 Completed:13432
## Smartphone:5978 SKU1004:2019 Median :3.000
## Smartwatch:3934 SKU1005:2012 Mean :3.094
## Tablet :4104 HDP456 :2010 3rd Qu.:4.000
## SMP234 :1987 Max. :5.000
## (Other):7868
## payment_method total_price unit_price quantity
## Bank Transfer:3371 Min. : 20.75 Min. : 20.75 Min. : 1.000
## Cash :2492 1st Qu.: 1139.68 1st Qu.: 361.18 1st Qu.: 3.000
## Credit Card :5868 Median : 2534.49 Median : 463.96 Median : 5.000
## Debit Card :2471 Mean : 3180.13 Mean : 578.63 Mean : 5.486
## Paypal :2514 3rd Qu.: 4639.60 3rd Qu.: 791.19 3rd Qu.: 8.000
## PayPal :3284 Max. :11396.80 Max. :1139.68 Max. :10.000
##
## purchase_date shipping_type add_ons_purchased add_on_total
## Min. :2023-09-24 Expedited:3272 Length:20000 Min. : 0.000
## 1st Qu.:2024-02-05 Express :3366 Class :character 1st Qu.: 7.615
## Median :2024-04-24 Overnight:3357 Mode :character Median : 51.700
## Mean :2024-04-18 Same Day :3280 Mean : 62.245
## 3rd Qu.:2024-07-08 Standard :6725 3rd Qu.: 93.843
## Max. :2024-09-23 Max. :292.770
##
We further our analysis of the customer transaction dataset to gain deeper insights into various aspects of purchasing behavior and trends and answer our research questions. We will examine the distribution of age and total price to understand how these factors influence purchasing decisions.
Additionally, we will conduct a comprehensive product analysis to assess the performance of different product types in terms of pricing and customer preferences. Insights related to order status will also be explored to evaluate fulfillment processes and overall customer satisfaction.
Furthermore, we will analyze purchase trends over time, focusing on monthly and seasonal patterns to identify fluctuations in customer activity and preferences throughout the year. Through these analyses, we aim to provide actionable insights that can inform strategic decisions and enhance business performance.
The age distribution of customers appears to be relatively normal, with a slight right skew. The majority of customers are between 30 and 50 years old, with a peak around 40-45 years. The distribution of total prices is right-skewed, with most purchases falling in the lower price range (under $5,000). There are some high-value purchases, which could represent bulk orders or expensive products like high-end laptops or multiple smartphones.
# -------------------- TOTAL PRICE -----------------------
ggplot(data, aes(x = total_price)) +
geom_histogram(bins = 50, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Total Price", x = "Total Price", y = "Frequency") +
scale_x_continuous(labels = scales::dollar_format()) +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
# observation: There is a wide range of prices ($20.75 to $11,396.80), which may
# include some outliers but could also represent legitimate high-value purchases.
# boxplot for total price
ggplot(data, aes(x = total_price)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Boxplot of Total Price", x = "Total Price") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# ---------------------- AGE --------------------
ggplot(data, aes(x = age)) +
geom_histogram(binwidth = 5,fill = "plum3", color = "black", alpha = 0.7) +
labs(title = "Distribution of Age", x = "Age", y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
#The age range appears reasonable (18 to 70 years), with no obvious outliers.
There appears to be a slight positive correlation between age and total price, suggesting that older customers tend to make slightly more expensive purchases. However, the relationship is not very strong, and there is considerable variation across all age groups. The scatter plot shows a positive relationship between Quantity and Total Price. As the quantity increases, the total price tends to increase as well. This is expected in a sales dataset.
# Relationship between Age and total Price
age_p_plt <- ggplot(data, aes(x = age, y = total_price)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", color = "red") +
scale_y_continuous(labels = scales::dollar_format()) +
labs( x = "Age", y = "Total Price") +
theme(plot.title = element_text(hjust = 0.5))
# ----------------- QUANTITY ---------------------
q_plt <- ggplot(data, aes(x = quantity, y = total_price)) +
geom_point(color = "blue", alpha = 0.3) +
geom_smooth(method = "lm", color = "red") +
scale_y_continuous(labels = scales::dollar_format()) +
labs(x = "Quantity", y = "Total Price") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
#combine the scatter plot together
grid.arrange(age_p_plt, q_plt, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The most popular product types, in order, are: Smartphone, Laptop, Tablet, Smartwatch, Headphones.
# ------------ Product Type Distribution -----------------
ggplot(data, aes(x = product_type, fill = ..count..)) +
geom_bar() +
scale_fill_gradient(low = "slategrey", high = "steelblue2") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Distribution of Product Types", x = "Product Type", y = "Count") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5), legend.position = 'none')
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
One surprising observation was the higher average price of smartphones compared to laptops. Typically, one might expect laptops, being larger and more multifunctional devices, to be more expensive. However, in this case, smartphones have an average price of $3,599, compared to $3,095 for laptops.
# Summary of Total Price by Product Type
product_summary <- data %>%
group_by(product_type) %>%
summarise(
Avg_Price = mean(total_price),
Median_Price = median(total_price),
Min_Price = min(total_price),
Max_Price = max(total_price)
) %>%
arrange(desc(Avg_Price))
Product Type | Avg_Price | Median_Price | Min_Price | Max_Price |
---|---|---|---|---|
Smartphone | 3599.32 | 3164.76 | 20.75 | 11396.80 |
Smartwatch | 3567.94 | 3216.50 | 459.50 | 8448.3 |
Laptop | 3094.95 | 2783.76 | 463.96 | 6743.2 |
Tablet | 2853.80 | 2223.27 | 247.03 | 7864.1 |
Headphones | 2009.65 | 2167.08 | 361.18 | 4130.2 |
The question we want to answer is that; Do men and women (or different age groups) prefer different types of products?
# Violin plot to compare product preferences by gender
ggplot(data, aes(x = product_type, y = age, fill=gender)) +
geom_violin(trim = FALSE) + # Create violin plot without trimming tails
scale_fill_manual(values = c("#FF9999", "#9999FF")) + # Color fill for male/female
labs(title = "Product Preferences by Gender",
x = "Product Type",
y = "Age",
fill = "Gender") + # Customize plot labels
theme_minimal() + # Use minimal theme for clean look
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
From the visualization, there seem not be product type preference either by age or gender. The violin plot doesn’t seem to show any variation. We conducted a chi-square test, and it futher proof our assumption that there is no statistically significant relationship between age group or gender and product type preferences in this dataset.
# Cross-tabulation of product type by gender
product_gender_table <- table(data$product_type, data$gender)
# Chi-square test to check relationship
chisq.test(product_gender_table)
##
## Pearson's Chi-squared test
##
## data: product_gender_table
## X-squared = 7.5442, df = 8, p-value = 0.4792
# For age, create age groups and repeat similar analysis
data$age_group <- cut(data$age, breaks = c(18, 30, 40, 50, 60, 100))
age_product <- table(data$product_type, data$age_group)
# Chi-square test to check relationship
chisq.test(age_product)
##
## Pearson's Chi-squared test
##
## data: age_product
## X-squared = 19.127, df = 16, p-value = 0.2621
Interestingly, loyalty members have a slightly lower average total price compared to non-members. However, they have a marginally higher average rating, which could indicate higher satisfaction despite spending less. The fact that loyalty members spend slightly less on average than non-members is intriguing. This could indicate that the loyalty program is attracting price-sensitive customers or that it’s offering discounts that reduce the overall spend. However, the higher average rating from loyalty members suggests that the program is successful in terms of customer satisfaction but not with much margin.
# -------------------- Loyalty Member Impact ----------
loyalty_impact <- data %>%
group_by(loyalty_member) %>%
summarise(
Avg_Total_Price = mean(total_price),
Avg_Quantity = mean(quantity),
Avg_Rating = mean(rating)
)
Loyalty Member | Avg_Total Price | Avg_Quantity | Avg_Rating |
---|---|---|---|
No | 3191.97 | 5.49 | 3.09 |
Yes | 3137.44 | 5.46 | 3.10 |
# frequency of loyalty members
loyalty_summary <- data %>%
group_by(loyalty_member) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
# Plotting the percentages
ggplot(loyalty_summary, aes(x = loyalty_member, y = Percentage, fill = loyalty_member)) +
geom_bar(stat = 'identity') +
scale_y_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +
geom_text(aes(label = Percentage), vjust = -0.5, size = 5) +
labs(x = "Loyalty Membership Status", y = "Percentage") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "none")
The bar plot clearly illustrates that non-loyalty members outnumber loyalty members in our dataset. This trend highlights a potential opportunity for the company to enhance its loyalty programs and strategies to attract more customers to become loyalty members. Notably, there isn’t a significant price difference between loyalty and non-loyalty members. I believe the company should consider increasing discounts for loyalty members to further incentivize enrollment in the loyalty program and improve customer retention.
The daily sales trend shows considerable fluctuations, with some noticeable peaks and troughs. Starting in January 2024, there has been a clear increase in total sales compared to 2023. This rise may be attributed to seasonal effects, promotional events, or new product launches driving higher customer spending during this period.
# --------------------- Time series of daily sales -----------------
daily_sales <- data %>%
group_by(purchase_date) %>%
summarise(Total_Sales = sum(total_price))
daily_sales$purchase_date <- as.Date(daily_sales$purchase_date)
Year | Headphones | Laptop | Smartphone | Smartwatch | Tablet |
---|---|---|---|---|---|
2023 | 0 | 521 | 1080 | 515 | 582 |
2024 | 2011 | 3452 | 4898 | 3419 | 3522 |
# Create the time series plot for daily sales
ggplot(daily_sales, aes(x = purchase_date, y = Total_Sales)) +
geom_line(group = 1, color = "black") + # Set group=1 to treat the entire dataset as one group
scale_y_continuous(labels = scales::dollar_format()) +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
labs(title = "Daily Sales Over Time", x = "Date", y = "Total Sales") +
theme_minimal() + # Optional: Add a minimal theme
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
The increase in product types available in 2024 is notable, especially considering that the 2023 data only begins in October. This limited timeframe likely contributes to the significant surge in sales prices observed in 2024. The comparison highlights that the expanded availability of products in 2024, as opposed to the constrained data from late 2023, plays a crucial role in this trend. With the 2023 dataset covering only a few months, it fails to provide a comprehensive understanding of consumer purchasing behavior over the entire year.
The analysis indicates a significant surge in total sales during the Spring and Summer seasons. This trend is likely linked to the timing of new smartphone releases, which typically occur during these months, attracting heightened consumer interest and engagement. As consumers seek out the latest technology and upgrades, the increased availability of new products during this period contributes to the overall sales boost.
# Function to categorize dates into seasons
get_season <- function(dates) {
# Extract the month and day
month_day <- format(dates, "%m-%d")
# Assign season based on the month-day ranges
ifelse(month_day >= "03-21" & month_day <= "06-20", "Spring",
ifelse(month_day >= "06-21" & month_day <= "09-20", "Summer",
ifelse(month_day >= "09-21" & month_day <= "12-20", "Fall", "Winter")))
}
data$season <- get_season(data$purchase_date)
season_sales <- data %>%
group_by(season) %>%
summarise(Total_Sales = sum(total_price))
ggplot(season_sales, aes(x = season, y = Total_Sales, fill=season)) +
geom_bar(stat = 'identity') +
scale_y_continuous(labels = scales::dollar_format()) +
scale_fill_manual(values = c("Winter" = "slategrey", "Spring" = "steelblue2",
"Summer" = "lightblue", "Fall" = "skyblue")) +
labs(x = 'Seasons', y='Total Sales') +
theme(legend.position = 'none')
Conversely, the Winter season also shows significant sales figures, likely driven by holiday shopping and seasonal promotions. The winter months often see a spike in consumer spending due to events like Black Friday, Cyber Monday, and Christmas, where retailers offer significant discounts and promotions. Additionally, the colder weather can drive consumers indoors, encouraging online shopping and exploration of new products.
Are there certain days when customers tend to buy more? The observation reveal that;
Monday to Wednesday: These days have relatively high sales, indicating that customers are more likely to make purchases at the start of the workweek, possibly due to beginning-of-week shopping habits.
Weekend: Strong sales on the weekend, particularly on Saturday, suggest that customers are making more purchases when they have more free time or are out shopping, likely for leisure or planned purchases.
Thursday Dip: The drop in sales on Thursday is interesting, as it represents a lull in customer activity. This could be a point of focus for targeted promotions or marketing to boost sales on this day.
data$month <- month(data$purchase_date, label = TRUE, abbr = TRUE)
data$day_of_week <- wday(data$purchase_date, label = TRUE, abbr = TRUE)
data$year <- format(data$purchase_date, "%Y") # Extract day of the week
# ------------ Day of the Week ---------
# Are there certain days when customers tend to buy more?
day_of_week_sales <- data %>%
group_by(day_of_week) %>%
summarise(Total_Sales = sum(total_price))
ggplot(day_of_week_sales, aes(y = day_of_week, x = Total_Sales, fill=Total_Sales)) +
geom_bar(stat = 'identity') +
scale_x_continuous(labels = scales::dollar_format()) +
scale_fill_viridis() +
#geom_text(aes(label = scales::dollar(Total_Sales)), # Add labels with dollar formatting
# vjust = -0.5, size = 3.5) +
labs(x = 'Day of the Week',
y = ' Total Sales') + theme_minimal()
A closer look into the gender column reveals an unusual entry where instead of standard NA, we have #N/A for one customer with ID 19998. Further analysis will aim to determine how to impute this value using other available variables. In general, the data shows a slightly higher number of male customers(50.8%) compared to female customers(49.2%).
# ------------------- Gender analysis ----------
table(data$gender) #
##
## #N/A Female Male
## 1 9835 10164
# Filter the dataset for specific genders
filtered_data <- data %>%
filter(gender %in% c("#N/A"))
filtered_data # shows that customer_id = 19998 has gender '#N/A'
# find the if any other customer has that id
data %>%
filter(customer_id == 19998) # only 1 customer has that id
# Filter the dataset to remove entries where gender is '#N/A'
gender_data <- data %>%
filter(gender != "#N/A")
# Count the occurrences of each gender and calculate percentages
gender_counts <- gender_data %>%
group_by(gender) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100) # Calculate percentage
# Create the pie chart with custom colors and percentages
ggplot(gender_counts, aes(x = "", y = Count, fill = gender)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) + # Convert to pie chart
labs(fill = "Gender") +
theme_void() + # Remove background grid and axes
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
position = position_stack(vjust = 0.5), color = "white") + # Add percentage labels
scale_fill_manual(values = c("Male" = "steelblue2", "Female" = "tan2"))
There are more completed orders than cancelled orders. Proportion-wise 67.2% of the orders are completed compared to 32.8%. This indicates that while most orders are successfully completed, a significant portion (about one-third) ends up being canceled, which could warrant further investigation into the reasons behind these cancellations to improve the completion rate.
# ---------------- Order Status --------------------
# Order status analysis
order_status_summary <- data %>%
group_by(order_status) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
order_plt <- ggplot(order_status_summary, aes(x = reorder(order_status, -Count), y = Count, fill = order_status)) +
geom_bar(stat = "identity") +
theme_minimal() +
scale_fill_manual(values = c("Completed" = "plum3", "Cancelled" = "lightblue")) +
labs(x = "Order Status", y = "Count") +
theme(legend.position = 'none')
print(order_plt)
Several factors could contribute to this high cancellation rate. For instance, it may vary by year, as the comparison between 2023 and 2024 shows fluctuations. It could also be influenced by product type, with different categories (smartphones, laptops, etc.) potentially having different cancellation rates. Additionally, customer demographics, such as gender preferences, may play a role in understanding why certain products are more frequently canceled.
Purchase Year | Cancelled | Completed |
---|---|---|
2023 | 899 (0.333%) | 1799 (0.667%) |
2024 | 5669 (0.327%) | 11633 (0.672%) |
Product Type | Cancelled | Completed |
---|---|---|
Headphones | 0.3232 | 0.6768 |
Laptop | 0.3239 | 0.6761 |
Smartphone | 0.3302 | 0.6698 |
Smartwatch | 0.3299 | 0.6701 |
Tablet | 0.3311 | 0.6689 |
The results show that cancellation rates are fairly consistent across all product types and year, with around 32-33% of orders being canceled.
Could the rating also impact the order status? This question prompted my analysis. The density plot didn’t show much of a difference, or really hard to tell
# rating and order status
ggplot(data, aes(x = rating, fill = order_status)) +
geom_density(alpha = 0.6) +
labs(x = "Rating", y = "Density") +
scale_fill_manual(values = c("Cancelled" = "tomato", "Completed" = "steelblue2")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Hence we conducted a t-test to assess whether there is a significant difference in the mean ratings between cancelled and completed orders.
t.test(rating ~ order_status, data = data)
##
## Welch Two Sample t-test
##
## data: rating by order_status
## t = -0.59223, df = 13086, p-value = 0.5537
## alternative hypothesis: true difference in means between group Cancelled and group Completed is not equal to 0
## 95 percent confidence interval:
## -0.04696007 0.02516780
## sample estimates:
## mean in group Cancelled mean in group Completed
## 3.086632 3.097528
The p-value (0.5537) is much larger than the common significance threshold (0.05), indicating that there is no statistically significant difference in the average ratings between canceled and completed orders. In summary, based on the data, lower ratings do not seem to correlate with a higher likelihood of order cancellation.
The data reveals that the most commonly used payment method among customers is the Credit Card, closely followed by PayPal. This predominance of Credit Card usage aligns with common consumer behavior, as many individuals prefer to make purchases on credit.
# ---------------Payment method
table(data$payment_method) # paypal appears to have been save different
##
## Bank Transfer Cash Credit Card Debit Card Paypal
## 3371 2492 5868 2471 2514
## PayPal
## 3284
# Replace 'Paypal' with 'PayPal'
data$payment_method <- gsub("Paypal", "PayPal", data$payment_method)
#re-run the table code above
prop.table(table(data$payment_method, data$order_status), margin = 1)
##
## Cancelled Completed
## Bank Transfer 0.3298724 0.6701276
## Cash 0.3069823 0.6930177
## Credit Card 0.3355487 0.6644513
## Debit Card 0.3184945 0.6815055
## PayPal 0.3337358 0.6662642
# Create a jitter plot
payment_data <- data %>% group_by(payment_method) %>%
summarise(count = n())
#install.packages("ggbeeswarm")
library(ggbeeswarm)
swarm_plot <- ggplot(payment_data, aes(x = payment_method, y = count)) +
geom_beeswarm(aes(color = payment_method), size = 3) +
geom_text(aes(label = count),
position = position_jitter(width = 0.2), # Adjust position of text to avoid overlap
vjust = -0.5, # Position text slightly above the points
size = 4, # Adjust size of the text
color = "black") +
theme_minimal() +
labs(x = "Payment Method",
y = "Frequency") +
theme(legend.position = "none")
# Display the plot
print(swarm_plot)
The question we are asking is that; Do customers using different payment methods tend to spend more or less? And from the ANOVA result, we know that there is a statistically significant difference in spending across different payment methods.
# Compare average total price by payment method
payment_method_comparison <- aggregate(total_price ~ payment_method, data = data, mean)
# ANOVA test for differences in spending by payment method
anova_payment <- aov(total_price ~ payment_method, data = data)
summary(anova_payment)
## Df Sum Sq Mean Sq F value Pr(>F)
## payment_method 4 2.853e+09 713330440 112.6 <2e-16 ***
## Residuals 19995 1.267e+11 6335510
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s interesting to see that a lot of customers tend to add extra items to their purchases, with “Accessory” and “Impulse Item” being the most common additions. This behavior suggests that shoppers are not just looking for what they originally came for; they’re open to exploring complementary products. It shows a level of engagement and can be a good sign for businesses. When customers are willing to buy additional items, it indicates that they feel satisfied and perhaps excited about their purchases. This trend could encourage businesses to think about how they market accessories or impulse buys, as it seems to resonate well with customers!
# --- Add-ons-purchase
items_df <- data.frame(items = data$add_ons_purchased)
items_df <- na.omit(items_df)
items_long <- items_df %>%
separate_rows(items, sep = ",") %>%
mutate(items = trimws(items))
Accessory | Extended Warranty | Impulse Item |
---|---|---|
10,048 | 9,975 | 10,234 |
In conclusion, our analysis of the customer transaction dataset has uncovered key insights into purchasing behavior and trends. The daily sales trend shows notable fluctuations, with peaks and troughs throughout the period. A significant increase in sales is observed from January 2024 compared to 2023, likely driven by seasonal effects, promotional campaigns, or product launches. Additionally, male customers slightly outnumber female customers, and loyalty members, while spending less on average, report higher satisfaction through better ratings.
The product analysis showed that smartphones and tablets had the highest purchase frequency, making them the most popular product types. However, when looking at average prices, smartphones and smartwatches had the highest average price, followed by laptops, tablets, and headphones. This highlights that smartphones dominate both in popularity and pricing in the premium market, with smartwatches also commanding high prices despite lower purchase volumes.
An intriguing aspect of the data is the disparity between loyalty members and non-members—loyalty members tend to spend slightly less but express higher satisfaction through better ratings.
Regarding seasonal variations, the data suggests that order status (whether purchased or canceled) remained relatively stable across seasons. This implies that external factors like holidays or seasonal promotions do not significantly influence cancellations. This stability reinforces the idea that businesses should continue running campaigns throughout peak seasons and consider introducing more promotions in the fall to boost sales. By understanding what motivates customers to buy, businesses can build more robust strategies, with loyalty programs being a crucial element. These programs not only reward repeat customers but also foster stronger connections with the brand, driving customer retention and repeat purchases.
When examining demographics, no significant purchasing preferences were revealed across different age groups or genders. This suggests that product preferences, such as smartphones, tablets, and smartwatches, were consistent regardless of customer demographics, indicating that marketing strategies do not need to be heavily adjusted based on age or gender.
Our day-of-week analysis revealed that sales were consistently higher on weekdays, with Wednesday and Monday being the strongest sales days. This trend suggests that midweek promotions or work-related needs might drive consumer purchases.
Finally, the ANOVA test on payment methods showed a significant difference in spending across different methods. Customers using credit cards spent more on average than those using mobile payment options, highlighting the importance of optimizing payment methods to increase transaction values.
To capitalize on these insights, businesses should focus on targeted marketing strategies that engage both loyalty members and new customers. Personalized promotions, exclusive offers for loyalty members, and marketing campaigns aligned with seasonal trends can further enhance customer engagement and sales performance.
In conclusion, these findings provide critical guidance for refining marketing strategies, improving customer satisfaction, and increasing sales by better understanding customer behavior and preferences.
Baumer, B. S., Kaplan, D. T., & Horton, N. J. (2021). Modern Data Science with R. CRC Press. http://books.google.ie/books?id=SokmEAAAQBAJ&printsec=frontcover&dq=Wickham,+H.+(2022).+ggplot2:+Create+Elegant+Data+Visualisations+Using+the+Grammar+of+Graphics.+R+Foundation+for+Statistical+Computing.+URL:+https://cran.r-project.org/package%3Dggplot2&hl=&cd=3&source=gbs_api
Calzon, B. (2023, August 10). Misleading Statistics – Real Life Examples Of Data Misuse. BI Blog | Data Visualization & Analytics Blog | datapine. https://www.datapine.com/blog/misleading-statistics-and-data/
Customer purchase behavior - Electronic Sales Data. (2024, September 23). Kaggle. https://www.kaggle.com/datasets/cameronseamons/electronic-sales-sep2023-sep2024
Datar, R. (n.d.). Hands-On Exploratory Data Analysis with R. O’Reilly Online Learning. https://www.oreilly.com/library/view/hands-on-exploratory-data/9781789804379/5a7a58f4-66f7-4661-8f6d-9c5c09eaa23a.xhtml