This project explores and analyzes e-commerce customer data from some of the largest cities in the United States. Using R programming, we visualize trends and relationships between different variables including, but not limited to, gender, age, membership type, items purchased, total amount spent, etc. This analysis provides insight into customer behavior.
Online customer behavior is affected by many factors that differ from traditional shopping. E-commerce lacks many aspects of customer service and the sensory experiences of in-person shopping. Because of these differences, this analysis mainly uses quantitative variables, with a limited number of relevant categorical variables including gender, city, membership type, a true or false value for whether or not a discount was applied, and customer satisfaction level. This project uses R programming to visualize relationships between variables and attempts to draw conclusions on the impact of these relationships on the greater scope of customer behavior.
Data Source: Kaggle E-commerce Customer Behavior Dataset. Tools: R programming language, base packages, readr, dplyr, ggplot2. Data preprocessing: grouping and cleaning data. Data visualization: creating boxplots, scatterplots, and barplots using ggplot2.
## # A tibble: 3 × 6
## Membership.Type Min Q1 Median Q3 Max
## <fct> <chr> <chr> <chr> <chr> <chr>
## 1 Bronze " 410.80" " 440.90" " 475.25" " 500.75" " 530.40"
## 2 Silver " 660.30" " 690.60" " 770.20" " 800.90" " 830.90"
## 3 Gold "1120.20" "1160.60" "1210.60" "1470.50" "1520.10"
This box plot shows the distributions of the total amount of money spent based on the customer’s membership type. We can see that the higher the membership tier, the higher the median spending. In the summary statistics above, the medians are $475.25 for Bronze, $770.20 for Silver, and $1210.60 for Gold. We can also observe that higher membership types have higher interquartile ranges(IQRs). The IQR is calculated by subtracting Q1 from Q3 and represents the range of the middle 50% of the data. The visual representation of the IQR in a box plot is the height of the box. A higher IQR or larger box suggests more variability and spread in the data, and a smaller IQR indicates that the data does not vary much from the median(average). In conclusion, this box plot shows that customers spend more on average and that the range of their spending also increases the higher their membership tier.
## [1] "Correlation Coefficient: 0.9724"
For this scatter plot, we explore the relationship between the total amount of money spent and items purchased in a given transaction. We can see there is a strong positive relationship between these two variables. The more items a customer purchases, the more they spend. This relationship is supported by a remarkably high correlation coefficient of 0.9724, which suggests a nearly perfect linear relationship.
The bar plot above shows the average customer rating by city. The rating scale is from 0 to 5. San Francisco, New York, and Los Angeles have the three highest ratings, while Miami, Chicago, and Houston make up the bottom three. The differences in ratings are likely due to a number of variables and factors not accounted for in the data set that vary by location. An example of a possible variable not included in the data set could be the days between order and delivery, which may be faster in the top three cities if both distribution centers or company headquarters are located there. Determining the root cause of these differences would require a combination of the addition of more variables to the dataset or further research on how these locations vary from one another.
## [1] "Correlation Coefficient: -0.5401"
In this scatter plot, we explore the correlation between the days since the last purchase and the total spent. The trend line between these two variables shows a moderate negative relationship supported by our correlation coefficient of -0.5401. This value suggests that there is a meaningful relationship between these two variables, this relationship may also involve other variables or factors which could be worth investigating further by fitting a statistical model. Since our purposes do not involve fitting models, though, the correlation between these two variables suggests that the longer a customer has gone without making a purchase, the less money they will spend on their next purchase. This may imply that customers who do not shop online frequently are less committed to online shopping overall or that this group of customers supplement their traditional shopping online when they cannot find what they need.
## # A tibble: 2 × 6
## Discount.Applied Min Q1 Median Q3 Max
## <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 FALSE 410.80 460.70 800.90 1450.50 1520.10
## 2 TRUE 475.25 513.25 690.60 1140.60 1210.60
The box plot above explores the impact of discount applications on the total amount of money spent in a transaction. We can observe that the median amount spent is higher when a discount has not been applied. This makes logical sense as customers are usually spending less than they normally would because of the discount. What is curious here is that the IQR is much higher when a discount is not applied, meaning the amount a customer spends varies more when there is no discount. When a discount has been applied, customers spend less on average and the range of their spending is smaller. This could potentially suggest that discounts make up lower-value purchases or discounts attract more price-conscious customers, however, further investigation would be necessary to see if there are any statistically significant differences after accounting for how different types of discounts decrease the total amount spent in a transaction. Depending on the type of discount, the overall amount of money spent on every item in a purchase can decrease, or only the amount spent on select items in a purchase.
## # A tibble: 3 × 6
## Membership.Type Min Q1 Median Q3 Max
## <fct> <int> <dbl> <dbl> <dbl> <int>
## 1 Bronze 35 37 39.5 42 43
## 2 Silver 26 27 32 34 35
## 3 Gold 28 29 30 31 36
This box plot looks at the age distribution of customers by membership type. We can see that older customers 35+ are bronze members, customers 35 and under comprise the silver members, and customers between the ages of 31 and 28 make up the gold members. We also notice an outlier represented by a red dot of a gold member aged 36. The larger IQR of the silver members tells us that this membership type has the largest and most diverse group of ages. The small IQR of the gold membership tells us that gold members fall in a small age range. These kinds of tiered memberships tend to be associated with purchasing frequency and may explain the age ranges (from the start of the bottom line to the end of the top line for each group) and IQRs (represented by the size of the boxes themselves) for each membership type.
## # A tibble: 3 × 6
## Satisfaction.Level Min Q1 Median Q3 Max
## <fct> <chr> <chr> <chr> <chr> <chr>
## 1 Unsatisfied 475.25 500.75 595.35 690.60 730.40
## 2 Neutral 410.80 440.90 470.50 800.20 830.90
## 3 Satisfied 820.75 1160.30 1200.80 1470.50 1520.10
The box plot above explores the correlation between customer satisfaction level and the total amount of money spent. As we can see, customers who spend more money tend to leave a satisfied rating. Unsatisfied and neutral customers tend to spend less money, with neutral customers spending less on average than unsatisfied customers.
## # A tibble: 2 × 6
## Gender Min Q1 Median Q3 Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female 411. 461. 501. 1141. 1201.
## 2 Male 660. 710. 801. 1450. 1520.
This box plot explores the differences in spending by gender groups. We can observe that the median spending and spending range are higher for males. While the IQRs have similar values, in other words, male and female customers have similar variability in their spending, the variability for male customers is across a higher range of values. These differences could be due to wage disparities between male and female customers, who place the orders among heterosexual couples, or price consciousness by gender. However, the fact that male customers tend to spend more money than female customers is likely due to a combination of many factors that would require further investigation to make an informed conclusion.
In this bar plot, we can see that the top three cities are San Francisco, New York, and Los Angeles. We surmise that these cities purchase more items as customers in these areas tend to make more due to higher minimum wages and that these areas generally have lower shipping costs, especially when the distribution centers for a company are located in California and New York, which seem likely in this case.
This project aimed to explore and analyze e-commerce customer data to observe and draw possible conclusions about customer behavior. We show the importance of considering various factors in the data and emphasize that with a topic such as customer behavior, additional information, variables, and model fitting are needed to make the most informed conclusions about the causes of certain observations within the data.
Kaggle Dataset:[https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset]
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(readr)
library(dplyr)
customer_behavior_data <- read.csv("E-commerce_Customer_Behavior.csv")
customer_behavior_data$Membership.Type <- factor(customer_behavior_data$Membership.Type,
levels = c("Bronze", "Silver", "Gold"))
ggplot(customer_behavior_data, aes(x = Membership.Type, y = Total.Spend, fill = Membership.Type)) +
geom_boxplot(outlier.colour = "red", outlier.size = 1) +
labs(title = "Total Spent Distribution by Membership Type",
x = "Membership Type",
y = "Total Spent") +
scale_fill_manual(values = c("Bronze" = "orange", "Silver" = "gray", "Gold" = "yellow")) +
theme_minimal() +
theme(legend.position = "none")
box_plot_summary_stats <- customer_behavior_data %>%
group_by(Membership.Type) %>%
summarise(Min = min(Total.Spend, na.rm = TRUE),
Q1 = quantile(Total.Spend, 0.25, na.rm = TRUE),
Median = median(Total.Spend, na.rm = TRUE),
Q3 = quantile(Total.Spend, 0.75, na.rm = TRUE),
Max = max(Total.Spend, na.rm = TRUE))
box_plot_summary_stats <- box_plot_summary_stats %>%
mutate(across(c(Min, Q1, Median, Q3, Max), ~ format(round(., 2), nsmall = 2)))
print(box_plot_summary_stats)
ggplot(customer_behavior_data, aes(x = Items.Purchased, y = Total.Spend)) +
geom_point(alpha = 0.4, color = "blue") +
labs(title = "Items Purchased vs. Total Spent",
x = "Items Purchased",
y = "Total Spent") +
theme_minimal()
correlation_coefficient1 <- cor(customer_behavior_data$Items.Purchased,
customer_behavior_data$Total.Spend,
use = "complete.obs")
print(paste("Correlation Coefficient:", round(correlation_coefficient1, 4)))
average_rating_by_city <- customer_behavior_data %>%
group_by(City) %>%
summarise(AverageRating = mean(Average.Rating, na.rm = TRUE))
ggplot(average_rating_by_city, aes(x = reorder(City, AverageRating), y = AverageRating)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Average Rating by City",
x = "City",
y = "Average Rating") +
coord_flip() +
theme_minimal()
ggplot(customer_behavior_data,
aes(x = Days.Since.Last.Purchase,
y = Total.Spend)) +
geom_point(alpha = 0.4, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Days Since Last Purchase vs. Total Spent",
x = "Days Since Last Purchase",
y = "Total Spent") +
theme_minimal()
correlation_coefficient <- cor(customer_behavior_data$Days.Since.Last.Purchase,
customer_behavior_data$Total.Spend,
use = "complete.obs")
print(paste("Correlation Coefficient:", round(correlation_coefficient, 4)))
ggplot(customer_behavior_data, aes(x = Discount.Applied, y = Total.Spend)) +
geom_boxplot(fill = c("lightblue", "lightgreen"), outlier.colour = "red") +
labs(title = "Impact of Discount Application on Total Spent",
x = "Discount Applied",
y = "Total Spend") +
theme_minimal()
discount_summary_stats <- customer_behavior_data %>%
group_by(Discount.Applied) %>%
summarise(
Min = format(round(min(Total.Spend, na.rm = TRUE), 2), nsmall = 2),
Q1 = format(round(quantile(Total.Spend, 0.25, na.rm = TRUE), 2), nsmall = 2),
Median = format(round(median(Total.Spend, na.rm = TRUE), 2), nsmall = 2),
Q3 = format(round(quantile(Total.Spend, 0.75, na.rm = TRUE), 2), nsmall = 2),
Max = format(round(max(Total.Spend, na.rm = TRUE), 2), nsmall = 2)
)
print(discount_summary_stats)
ggplot(customer_behavior_data, aes(x = Membership.Type, y = Age, fill = Membership.Type)) +
geom_boxplot(outlier.colour = "red") +
labs(title = "Age Distribution of Customers by Membership Type",
x = "Membership Type",
y = "Age") +
scale_fill_manual(values = c("Bronze" = "orange", "Silver" = "gray", "Gold" = "yellow")) +
theme_minimal() +
theme(legend.position = "none")
age_summary_stats <- customer_behavior_data %>%
group_by(Membership.Type) %>%
summarise(
Min = min(Age, na.rm = TRUE),
Q1 = quantile(Age, 0.25, na.rm = TRUE),
Median = median(Age, na.rm = TRUE),
Q3 = quantile(Age, 0.75, na.rm = TRUE),
Max = max(Age, na.rm = TRUE)
)
print(age_summary_stats)
cleaned_data <- customer_behavior_data %>%
filter(row_number() != 172) %>%
filter(!is.na(Satisfaction.Level) & !is.na(Total.Spend)) %>%
filter(Satisfaction.Level != "") %>%
mutate(Satisfaction.Level = factor(Satisfaction.Level)) %>%
droplevels()
cleaned_data$Satisfaction.Level <- factor(cleaned_data$Satisfaction.Level,
levels = c("Unsatisfied", "Neutral", "Satisfied"))
satisfaction_boxplot <- ggplot(cleaned_data, aes(x = Satisfaction.Level, y = Total.Spend)) +
geom_boxplot(fill = "lightblue", outlier.colour = "red") +
labs(title = "Total Spent by Customer Satisfaction Level",
x = "Satisfaction Level",
y = "Total Spent") +
theme_minimal()
print(satisfaction_boxplot)
satisfaction_summary_stats <- cleaned_data %>%
group_by(Satisfaction.Level) %>%
summarise(
Min = format(min(Total.Spend, na.rm = TRUE), nsmall = 2),
Q1 = format(quantile(Total.Spend, 0.25, na.rm = TRUE), nsmall = 2),
Median = format(median(Total.Spend, na.rm = TRUE), nsmall = 2),
Q3 = format(quantile(Total.Spend, 0.75, na.rm = TRUE), nsmall = 2),
Max = format(max(Total.Spend, na.rm = TRUE), nsmall = 2)
)
print(satisfaction_summary_stats)
ggplot(customer_behavior_data, aes(x = Gender, y = Total.Spend)) +
geom_boxplot(fill = "lightblue", outlier.colour = "red") +
labs(title = "Total Spent by Gender",
x = "Gender",
y = "Total Spent") +
theme_minimal()
gender_summary_stats <- customer_behavior_data %>%
group_by(Gender) %>%
summarise(
Min = min(Total.Spend, na.rm = TRUE),
Q1 = quantile(Total.Spend, 0.25, na.rm = TRUE),
Median = median(Total.Spend, na.rm = TRUE),
Q3 = quantile(Total.Spend, 0.75, na.rm = TRUE),
Max = max(Total.Spend, na.rm = TRUE)
)
print(gender_summary_stats)
items_by_city <- customer_behavior_data %>%
group_by(City) %>%
summarise(Total_Items = sum(Items.Purchased, na.rm = TRUE))
ggplot(items_by_city, aes(x = reorder(City, Total_Items), y = Total_Items)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Total Items Purchased by City",
x = "City",
y = "Total Items Purchased") +
coord_flip() +
theme_minimal()