Section 1: Introduction

In this project, I will examine E-commerce customer behavior data. I will address the following questions:

  1. What is the distribution of the total amount of money spent across different membership types?
  2. How does the total spent vary by age group?
  3. Is there a relationship between the number of items purchased and the total spent?
  4. How do average ratings differ by city?
  5. What is the correlation between days since last purchase and total spent?
  6. What is the impact of discount application on total spent?
  7. What is the age distribution of customers for each membership type?
  8. How does customer satisfaction level correlate with the total spent?
  9. What are the spending patterns of different gender groups?
  10. How does the number of items purchased vary by city?

Section 2: The Dataset

For this project, I used the E-Commerce Customer Behavior Dataset from kaggle.com. A portion of the dataset is displayed below.

##   Customer.ID Gender Age          City Membership.Type Total.Spend
## 1         101 Female  29      New York            Gold     1120.20
## 2         102   Male  34   Los Angeles          Silver      780.50
## 3         103 Female  43       Chicago          Bronze      510.75
## 4         104   Male  30 San Francisco            Gold     1480.30
## 5         105   Male  27         Miami          Silver      720.40
## 6         106 Female  37       Houston          Bronze      440.80
##   Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1              14            4.6             TRUE                       25
## 2              11            4.1            FALSE                       18
## 3               9            3.4             TRUE                       42
## 4              19            4.7            FALSE                       12
## 5              13            4.0             TRUE                       55
## 6               8            3.1            FALSE                       22
##   Satisfaction.Level
## 1          Satisfied
## 2            Neutral
## 3        Unsatisfied
## 4          Satisfied
## 5        Unsatisfied
## 6            Neutral

Section 3: Distribution of Total Spent Across Membership Types

This box plot displays the distribution of total money spent across the 3 different membership types (Bronze, Gold, and Silver). Based on this plot, it appears that bronze members typically spend the least of the membership types, and their spending has little variance. Gold members have a greater variance in spending, but overall spend much more than bronze or silver members. Silver members spend more than bronze members but less than gold members, and there is not a great deal of variance in spending.

Section 4: How Total Spent Varies by Age Group

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

This box plot displays the variance in age group and their total spent. It appears that the age groups 19-30 and 31-40 have the largest variance, while the 41-50 group has very little. The greater portion of the 19-30 age group spends the most and the 41-50 group spends the least.

Section 5: Relationship Between Items Bought and Total Spent

## `geom_smooth()` using formula = 'y ~ x'

As shown in the scatterplot, there is a strong linear relationship between the number of items bought and the total amount spent. After fitting this linear model, we can see the R-squared value (shown below) is 0.94561. This means that the number of items purchased is a good predictor for the total spent.

## [1] 0.94561

Section 6: Difference in Average Rating by City

This bar plot displays the average ratings (ranging from 0 to 5) in each city. San Francisco has the highest average rating (4.81) and Houston has the lowest (3.19).

Section 7: Correlation Bewteen Days Since Last Purchase and Total Spent

## `geom_smooth()` using formula = 'y ~ x'

The scatterplot comparing days since last purchase and total spent shows an obvious negative relationship, however, the 2 variables do not appear to be strongly correlated. Upon calculating the correlation coefficient (shown below), we can see there is a moderate negative correlation.

## [1] -0.5400891

Section 8: Impact of Discount Application on Total Spent

This box plot shows that customers who do not utilize discounts have a greater variance in spending than those who do utilize discounts. Also shown in the box plot is the average spent, marked by a red dot. Overall, a discount application does not have a great effect on total spent.

Section 9: Distribution of Age for Each Membership Type

The distribution of age by membership type is reflected in the box plots above. The ages of bronze members appears to be between 35 and 42. Gold members are typically between 28 and 32, with one outlier of age 36. Silver members are bewteen 22 and 35. Silver members have the largest variance of age, which gold members have the least.

Section 10: Correlation Between Customer Satisfaction Level and Total Spent

## [1] 0.7974944

To calculate the Pearson’s Correlation Coefficient, we must first make the variable satisfaction level numeric. After doing this, we are able to calculate the correlation, which is 0.7974944 (as shown above). This means that the level of satisfaction is relatively strongly correlated with total money spent.

Section 11: Spending Patterns by Gender

This box plot displays the distribution of spending by gender. It appears that males typically spent more than the females, and both genders exhibit approximately the same amount of variance in spending.

Section 12: Variance in Number of Items Purchased by City

This series of box plots shows the variance in numbers of items purchased in each city. Customers in San Francisco purchased the largest number of items while customers in Houston purchased the least. While none of the cities appear to have a large variance in number of items purchased, Miami seems to have the largest variance.