In this project, I will examine E-commerce customer behavior data. I will address the following questions:
For this project, I used the E-Commerce Customer Behavior Dataset from kaggle.com. A portion of the dataset is displayed below.
## Customer.ID Gender Age City Membership.Type Total.Spend
## 1 101 Female 29 New York Gold 1120.20
## 2 102 Male 34 Los Angeles Silver 780.50
## 3 103 Female 43 Chicago Bronze 510.75
## 4 104 Male 30 San Francisco Gold 1480.30
## 5 105 Male 27 Miami Silver 720.40
## 6 106 Female 37 Houston Bronze 440.80
## Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1 14 4.6 TRUE 25
## 2 11 4.1 FALSE 18
## 3 9 3.4 TRUE 42
## 4 19 4.7 FALSE 12
## 5 13 4.0 TRUE 55
## 6 8 3.1 FALSE 22
## Satisfaction.Level
## 1 Satisfied
## 2 Neutral
## 3 Unsatisfied
## 4 Satisfied
## 5 Unsatisfied
## 6 Neutral
This box plot displays the distribution of total money spent across the 3 different membership types (Bronze, Gold, and Silver). Based on this plot, it appears that bronze members typically spend the least of the membership types, and their spending has little variance. Gold members have a greater variance in spending, but overall spend much more than bronze or silver members. Silver members spend more than bronze members but less than gold members, and there is not a great deal of variance in spending.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This box plot displays the variance in age group and their total spent. It appears that the age groups 19-30 and 31-40 have the largest variance, while the 41-50 group has very little. The greater portion of the 19-30 age group spends the most and the 41-50 group spends the least.
## `geom_smooth()` using formula = 'y ~ x'
As shown in the scatterplot, there is a strong linear relationship between the number of items bought and the total amount spent. After fitting this linear model, we can see the R-squared value (shown below) is 0.94561. This means that the number of items purchased is a good predictor for the total spent.
## [1] 0.94561
This bar plot displays the average ratings (ranging from 0 to 5) in each city. San Francisco has the highest average rating (4.81) and Houston has the lowest (3.19).
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot comparing days since last purchase and total spent shows an obvious negative relationship, however, the 2 variables do not appear to be strongly correlated. Upon calculating the correlation coefficient (shown below), we can see there is a moderate negative correlation.
## [1] -0.5400891
This box plot shows that customers who do not utilize discounts have a greater variance in spending than those who do utilize discounts. Also shown in the box plot is the average spent, marked by a red dot. Overall, a discount application does not have a great effect on total spent.
The distribution of age by membership type is reflected in the box plots above. The ages of bronze members appears to be between 35 and 42. Gold members are typically between 28 and 32, with one outlier of age 36. Silver members are bewteen 22 and 35. Silver members have the largest variance of age, which gold members have the least.
## [1] 0.7974944
To calculate the Pearson’s Correlation Coefficient, we must first make the variable satisfaction level numeric. After doing this, we are able to calculate the correlation, which is 0.7974944 (as shown above). This means that the level of satisfaction is relatively strongly correlated with total money spent.
This box plot displays the distribution of spending by gender. It appears that males typically spent more than the females, and both genders exhibit approximately the same amount of variance in spending.
This series of box plots shows the variance in numbers of items purchased in each city. Customers in San Francisco purchased the largest number of items while customers in Houston purchased the least. While none of the cities appear to have a large variance in number of items purchased, Miami seems to have the largest variance.