1. Abstract

This project explores and analyzes e-commerce customer data from some of the largest cities in the United States. Using R programming, we visualize trends and relationships between different variables including, but not limited to, gender, age, membership type, items purchased, total amount spent, etc. This analysis provides insight into customer behavior.

2. Introduction

Online customer behavior is affected by many factors that differ from traditional shopping. E-commerce lacks many aspects of customer service and the sensory experiences of in-person shopping. Because of these differences, this analysis mainly uses quantitative variables, with a limited number of relevant categorical variables including gender, city, membership type, a true or false value for whether or not a discount was applied, and customer satisfaction level. This project uses R programming to visualize relationships between variables and attempts to draw conclusions on the impact of these relationships on the greater scope of customer behavior.

3. Methodology

Data Source: Kaggle E-commerce Customer Behavior Dataset. Tools: R programming language, base packages, readr, dplyr, ggplot2. Data preprocessing: grouping and cleaning data. Data visualization: creating boxplots, scatterplots, and barplots using ggplot2.

4. Results

4.1 Boxplot of Total Amount Spent Based on Membership Type

## # A tibble: 3 × 6
##   Membership.Type Min       Q1        Median    Q3        Max      
##   <fct>           <chr>     <chr>     <chr>     <chr>     <chr>    
## 1 Bronze          " 410.80" " 440.90" " 475.25" " 500.75" " 530.40"
## 2 Silver          " 660.30" " 690.60" " 770.20" " 800.90" " 830.90"
## 3 Gold            "1120.20" "1160.60" "1210.60" "1470.50" "1520.10"

This box plot shows the distributions of the total amount of money spent based on the customer’s membership type. We can see that the higher the membership tier, the higher the median spending. In the summary statistics above, the medians are $475.25 for Bronze, $770.20 for Silver, and $1210.60 for Gold. We can also observe that higher membership types have higher interquartile ranges(IQRs). The IQR is calculated by subtracting Q1 from Q3 and represents the range of the middle 50% of the data. The visual representation of the IQR in a box plot is the height of the box. A higher IQR or larger box suggests more variability and spread in the data, and a smaller IQR indicates that the data does not vary much from the median(average). In conclusion, this box plot shows that customers spend more on average and that the range of their spending also increases the higher their membership tier.

4.2 Scatter plot of Items Purchased vs. Total Spent Per Transaction

## [1] "Correlation Coefficient: 0.9724"

For this scatter plot, we explore the relationship between the total amount of money spent and items purchased in a given transaction. We can see there is a strong positive relationship between these two variables. The more items a customer purchases, the more they spend. This relationship is supported by a remarkably high correlation coefficient of 0.9724, which suggests a nearly perfect linear relationship.

4.3 Bar Plot of Average Ratings by City

The bar plot above shows the average customer rating by city. The rating scale is from 0 to 5. San Francisco, New York, and Los Angeles have the three highest ratings, while Miami, Chicago, and Houston make up the bottom three. The differences in ratings are likely due to a number of variables and factors not accounted for in the data set that vary by location. An example of a possible variable not included in the data set could be the days between order and delivery, which may be faster in the top three cities if both distribution centers or company headquarters are located there. Determining the root cause of these differences would require a combination of the addition of more variables to the dataset or further research on how these locations vary from one another.

4.4 Scatter plot of Days Since Last Purchase vs. Total Spent

## [1] "Correlation Coefficient: -0.5401"

In this scatter plot, we explore the correlation between the days since the last purchase and the total spent. The trend line between these two variables shows a moderate negative relationship supported by our correlation coefficient of -0.5401. This value suggests that there is a meaningful relationship between these two variables, this relationship may also involve other variables or factors which could be worth investigating further by fitting a statistical model. Since our purposes do not involve fitting models, though, the correlation between these two variables suggests that the longer a customer has gone without making a purchase, the less money they will spend on their next purchase. This may imply that customers who do not shop online frequently are less committed to online shopping overall or that this group of customers supplement their traditional shopping online when they cannot find what they need.

4.5 Box plot of Total Spent Based on Discount Application

## # A tibble: 2 × 6
##   Discount.Applied Min    Q1     Median Q3      Max    
##   <lgl>            <chr>  <chr>  <chr>  <chr>   <chr>  
## 1 FALSE            410.80 460.70 800.90 1450.50 1520.10
## 2 TRUE             475.25 513.25 690.60 1140.60 1210.60

The box plot above explores the impact of discount applications on the total amount of money spent in a transaction. We can observe that the median amount spent is higher when a discount has not been applied. This makes logical sense as customers are usually spending less than they normally would because of the discount. What is curious here is that the IQR is much higher when a discount is not applied, meaning the amount a customer spends varies more when there is no discount. When a discount has been applied, customers spend less on average and the range of their spending is smaller. This could potentially suggest that discounts make up lower-value purchases or discounts attract more price-conscious customers, however, further investigation would be necessary to see if there are any statistically significant differences after accounting for how different types of discounts decrease the total amount spent in a transaction. Depending on the type of discount, the overall amount of money spent on every item in a purchase can decrease, or only the amount spent on select items in a purchase.

4.6 Box plot of Age Based on Membership Type

## # A tibble: 3 × 6
##   Membership.Type   Min    Q1 Median    Q3   Max
##   <fct>           <int> <dbl>  <dbl> <dbl> <int>
## 1 Bronze             35    37   39.5    42    43
## 2 Silver             26    27   32      34    35
## 3 Gold               28    29   30      31    36

This box plot looks at the age distribution of customers by membership type. We can see that older customers 35+ are bronze members, customers 35 and under comprise the silver members, and customers between the ages of 31 and 28 make up the gold members. We also notice an outlier represented by a red dot of a gold member aged 36. The larger IQR of the silver members tells us that this membership type has the largest and most diverse group of ages. The small IQR of the gold membership tells us that gold members fall in a small age range. These kinds of tiered memberships tend to be associated with purchasing frequency and may explain the age ranges (from the start of the bottom line to the end of the top line for each group) and IQRs (represented by the size of the boxes themselves) for each membership type.

4.7 Box plot of Customer Satisfaction Level Based on Total Spent

## # A tibble: 3 × 6
##   Satisfaction.Level Min    Q1      Median  Q3      Max    
##   <fct>              <chr>  <chr>   <chr>   <chr>   <chr>  
## 1 Unsatisfied        475.25 500.75  595.35  690.60  730.40 
## 2 Neutral            410.80 440.90  470.50  800.20  830.90 
## 3 Satisfied          820.75 1160.30 1200.80 1470.50 1520.10

The box plot above explores the correlation between customer satisfaction level and the total amount of money spent. As we can see, customers who spend more money tend to leave a satisfied rating. Unsatisfied and neutral customers tend to spend less money, with neutral customers spending less on average than unsatisfied customers.

4.8 Box plot of Spending by Gender

## # A tibble: 2 × 6
##   Gender   Min    Q1 Median    Q3   Max
##   <chr>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Female  411.  461.   501. 1141. 1201.
## 2 Male    660.  710.   801. 1450. 1520.

This box plot explores the differences in spending by gender groups. We can observe that the median spending and spending range are higher for males. While the IQRs have similar values, in other words, male and female customers have similar variability in their spending, the variability for male customers is across a higher range of values. These differences could be due to wage disparities between male and female customers, who place the orders among heterosexual couples, or price consciousness by gender. However, the fact that male customers tend to spend more money than female customers is likely due to a combination of many factors that would require further investigation to make an informed conclusion.

4.9 Bar plot of Number of Items Purchased by City

In this bar plot, we can see that the top three cities are San Francisco, New York, and Los Angeles. We surmise that these cities purchase more items as customers in these areas tend to make more due to higher minimum wages and that these areas generally have lower shipping costs, especially when the distribution centers for a company are located in California and New York, which seem likely in this case.

5. Conclusion

This project aimed to explore and analyze e-commerce customer data to observe and draw possible conclusions about customer behavior. We show the importance of considering various factors in the data and emphasize that with a topic such as customer behavior, additional information, variables, and model fitting are needed to make the most informed conclusions about the causes of certain observations within the data.

7. Appendices

7.1 Setup Code

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(readr)
library(dplyr)

7.2 Reading and Factoring Data & Box plot 4.1 with Summary Stats

customer_behavior_data <- read.csv("E-commerce_Customer_Behavior.csv")

customer_behavior_data$Membership.Type <- factor(customer_behavior_data$Membership.Type, 
                                                 levels = c("Bronze", "Silver", "Gold"))

ggplot(customer_behavior_data, aes(x = Membership.Type, y = Total.Spend, fill = Membership.Type)) +
  geom_boxplot(outlier.colour = "red", outlier.size = 1) +
  labs(title = "Total Spent Distribution by Membership Type",
       x = "Membership Type",
       y = "Total Spent") +
  scale_fill_manual(values = c("Bronze" = "orange", "Silver" = "gray", "Gold" = "yellow")) +
  theme_minimal() +
  theme(legend.position = "none")

box_plot_summary_stats <- customer_behavior_data %>%
  group_by(Membership.Type) %>%
  summarise(Min = min(Total.Spend, na.rm = TRUE),
            Q1 = quantile(Total.Spend, 0.25, na.rm = TRUE),
            Median = median(Total.Spend, na.rm = TRUE),
            Q3 = quantile(Total.Spend, 0.75, na.rm = TRUE),
            Max = max(Total.Spend, na.rm = TRUE))

box_plot_summary_stats <- box_plot_summary_stats %>%
  mutate(across(c(Min, Q1, Median, Q3, Max), ~ format(round(., 2), nsmall = 2)))

print(box_plot_summary_stats)

7.3 Scatter plot 4.2 with Correlation

ggplot(customer_behavior_data, aes(x = Items.Purchased, y = Total.Spend)) +
  geom_point(alpha = 0.4, color = "blue") +  
  labs(title = "Items Purchased vs. Total Spent",
       x = "Items Purchased",
       y = "Total Spent") +
  theme_minimal()

correlation_coefficient1 <- cor(customer_behavior_data$Items.Purchased, 
                               customer_behavior_data$Total.Spend, 
                               use = "complete.obs") 

print(paste("Correlation Coefficient:", round(correlation_coefficient1, 4)))

7.4 Grouping Data and Bar plot 4.3

average_rating_by_city <- customer_behavior_data %>%
  group_by(City) %>%
  summarise(AverageRating = mean(Average.Rating, na.rm = TRUE))

ggplot(average_rating_by_city, aes(x = reorder(City, AverageRating), y = AverageRating)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Average Rating by City",
       x = "City",
       y = "Average Rating") +
  coord_flip() + 
  theme_minimal()

7.5 Scatter plot 4.4 with Correlation

ggplot(customer_behavior_data, 
       aes(x = Days.Since.Last.Purchase, 
           y = Total.Spend)) +
  geom_point(alpha = 0.4, color = "blue") + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  labs(title = "Days Since Last Purchase vs. Total Spent",
       x = "Days Since Last Purchase",
       y = "Total Spent") +
  theme_minimal()

correlation_coefficient <- cor(customer_behavior_data$Days.Since.Last.Purchase, 
                               customer_behavior_data$Total.Spend, 
                               use = "complete.obs") 

print(paste("Correlation Coefficient:", round(correlation_coefficient, 4)))

7.6 Box plot 4.5 with Summmary Stats

ggplot(customer_behavior_data, aes(x = Discount.Applied, y = Total.Spend)) +
  geom_boxplot(fill = c("lightblue", "lightgreen"), outlier.colour = "red") + 
  labs(title = "Impact of Discount Application on Total Spent",
       x = "Discount Applied",
       y = "Total Spend") +
  theme_minimal()

discount_summary_stats <- customer_behavior_data %>%
  group_by(Discount.Applied) %>%
  summarise(
    Min = format(round(min(Total.Spend, na.rm = TRUE), 2), nsmall = 2),
    Q1 = format(round(quantile(Total.Spend, 0.25, na.rm = TRUE), 2), nsmall = 2),
    Median = format(round(median(Total.Spend, na.rm = TRUE), 2), nsmall = 2),
    Q3 = format(round(quantile(Total.Spend, 0.75, na.rm = TRUE), 2), nsmall = 2),
    Max = format(round(max(Total.Spend, na.rm = TRUE), 2), nsmall = 2)
  )

print(discount_summary_stats)

7.7 Box plot 4.6 with Summary Stats

ggplot(customer_behavior_data, aes(x = Membership.Type, y = Age, fill = Membership.Type)) +
  geom_boxplot(outlier.colour = "red") +
  labs(title = "Age Distribution of Customers by Membership Type",
       x = "Membership Type",
       y = "Age") +
  scale_fill_manual(values = c("Bronze" = "orange", "Silver" = "gray", "Gold" = "yellow")) +
  theme_minimal() +
  theme(legend.position = "none")

age_summary_stats <- customer_behavior_data %>%
  group_by(Membership.Type) %>%
  summarise(
    Min = min(Age, na.rm = TRUE),
    Q1 = quantile(Age, 0.25, na.rm = TRUE),
    Median = median(Age, na.rm = TRUE),
    Q3 = quantile(Age, 0.75, na.rm = TRUE),
    Max = max(Age, na.rm = TRUE)
  )

print(age_summary_stats)

7.8 Data Cleaning and Box plot 4.7 with Summary Stats

cleaned_data <- customer_behavior_data %>%
  filter(row_number() != 172) %>%  
  filter(!is.na(Satisfaction.Level) & !is.na(Total.Spend)) %>%  
  filter(Satisfaction.Level != "") %>%  
  mutate(Satisfaction.Level = factor(Satisfaction.Level)) %>%  
  droplevels()

cleaned_data$Satisfaction.Level <- factor(cleaned_data$Satisfaction.Level, 
                                           levels = c("Unsatisfied", "Neutral", "Satisfied"))

satisfaction_boxplot <- ggplot(cleaned_data, aes(x = Satisfaction.Level, y = Total.Spend)) +
  geom_boxplot(fill = "lightblue", outlier.colour = "red") +
  labs(title = "Total Spent by Customer Satisfaction Level",
       x = "Satisfaction Level",
       y = "Total Spent") +
  theme_minimal()

print(satisfaction_boxplot)

satisfaction_summary_stats <- cleaned_data %>%
  group_by(Satisfaction.Level) %>%
  summarise(
    Min = format(min(Total.Spend, na.rm = TRUE), nsmall = 2),
    Q1 = format(quantile(Total.Spend, 0.25, na.rm = TRUE), nsmall = 2),
    Median = format(median(Total.Spend, na.rm = TRUE), nsmall = 2),
    Q3 = format(quantile(Total.Spend, 0.75, na.rm = TRUE), nsmall = 2),
    Max = format(max(Total.Spend, na.rm = TRUE), nsmall = 2)
  )

print(satisfaction_summary_stats)

7.9 Box plot 4.8 with Summary Stats

ggplot(customer_behavior_data, aes(x = Gender, y = Total.Spend)) +
  geom_boxplot(fill = "lightblue", outlier.colour = "red") +
  labs(title = "Total Spent by Gender",
       x = "Gender",
       y = "Total Spent") +
  theme_minimal()

gender_summary_stats <- customer_behavior_data %>%
  group_by(Gender) %>%
  summarise(
    Min = min(Total.Spend, na.rm = TRUE),
    Q1 = quantile(Total.Spend, 0.25, na.rm = TRUE),
    Median = median(Total.Spend, na.rm = TRUE),
    Q3 = quantile(Total.Spend, 0.75, na.rm = TRUE),
    Max = max(Total.Spend, na.rm = TRUE)
  )

print(gender_summary_stats)

7.10 Data Grouping and Bar plot 4.9

items_by_city <- customer_behavior_data %>%
  group_by(City) %>%
  summarise(Total_Items = sum(Items.Purchased, na.rm = TRUE))

ggplot(items_by_city, aes(x = reorder(City, Total_Items), y = Total_Items)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Total Items Purchased by City",
       x = "City",
       y = "Total Items Purchased") +
  coord_flip() +
  theme_minimal()