Project 2 (E-commerce)

Author

Griffin Lessinger

In this project, we examine a dataset consisting of supposed account information for an e-commerce institution. The dataset has 350 entries, corresponding to 350 accounts. A sample:

  Customer.ID Gender Age          City Membership.Type Total.Spend
1         101 Female  29      New York            Gold     1120.20
2         102   Male  34   Los Angeles          Silver      780.50
3         103 Female  43       Chicago          Bronze      510.75
4         104   Male  30 San Francisco            Gold     1480.30
5         105   Male  27         Miami          Silver      720.40

There exist 11 columns in total, mostly of numeric type.

1. What is the distribution of total spend across different membership types?

boxplotcommerce <- commerce
boxplotcommerce$Membership.Type <- match(
  boxplotcommerce$Membership.Type,
  c("Bronze", "Silver", "Gold")
)

boxplot(
  data = boxplotcommerce,
  Total.Spend ~ Membership.Type,
  xlab = "Membership Type",
  ylab = "Total Spend ($US)",
  main = "Total Spending by Membership Type",
  names = c("Bronze", "Silver", "Gold"),
  col = c("brown1", 'gray85', "gold")
)

As seen above, each box (and their whiskers) are completely disjoint on the vertical axis. Thus, membership level of the customer matters greatly in terms of how much that the customer has spent, with Bronze membership customers spending \$500 on average, Silver spending \$800, and Gold spending \$1200.

2. Is there a relationship between the number of items purchased and total spend?

plot(
  x = commerce$Items.Purchased,
  y = commerce$Total.Spend,
  xlab = "Items Purchased",
  ylab = "Total Spend ($US)",
  main = "Items Purchased vs. Total Spending"
)
abline(
  (ivss_model <- lm(
    data = commerce,
    Total.Spend ~ Items.Purchased
  ))$coefficients,
  col = "red2",
  lwd = 2
)
text(
  x = 16.2,
  y = 1000,
  bty = "n",
  labels = paste0("R squared: ", trunc(1000*summary(ivss_model)$r.squared)/1000),
  cex = 0.8
)
legend(
  x = "topleft",
  lwd = 2,
  col = "red2",
  bty = "n",
  legend = "Linear Model",
  cex = 0.8
)

Yes, there exists a clearly positive correlation between the number of items purchased and total account spending, as expected. In fact, the relationship is very linear, which is also expected.

3. How do average ratings differ by city?

cityboxplot <- boxplot(
  data = commerce,
  Average.Rating ~ City,
  xlab = "City",
  ylab = "Average Rating (1 to 5)",
  main = "Average Rating by City",
  col = topo.colors(7, alpha = 0.5)[2:7],
  cex.axis = 0.85
)
legend(
  x = 0.24,
  y = 4.95,
  bty = "n",
  title = "Averages",
  legend = paste0(paste0(cityboxplot$names, ": "), cityboxplot$stats[3, ]),
  fill = topo.colors(7, alpha = 0.5)[2:7],
  cex = 0.8
)

The city also seems to matter somewhat greatly in terms of reviews. The average review (a score from 1 to 5) is largely different between cities, with not much overlap between boxes. Houston was the lowest, with an average review score of 3.15. San Francisco the highest, with an average review score of 4.8 (quite high!).

4. What is the correlation between days since last purchase and total spend?

plot(
  x = commerce$Days.Since.Last.Purchase,
  y = commerce$Total.Spend,
  xlab = "Days Since Last Purchase",
  ylab = "Total Spend ($US)",
  main = "Recent Purchasing vs. Total Spending"
)
abline(
  (dvss_model <- lm(
    data = commerce,
    Total.Spend ~ Days.Since.Last.Purchase,
  ))$coefficients,
  col = "red2",
  lwd = 2
)
text(
  x = 40,
  y = 770,
  bty = "n",
  labels = paste0("R squared: ", trunc(1000*summary(dvss_model)$r.squared)/1000),
  cex = 0.8
)
legend(
  x = "topright",
  lwd = 2,
  col = "red2",
  bty = "n",
  legend = "Linear Model",
  cex = 0.8
)
text(
  x = 35.75,
  y = 1625,
  labels = paste0("Correlation: ", trunc(1000*cor(commerce$Total.Spend, commerce$Days.Since.Last.Purchase))/1000),
  cex = 0.9,
  xpd = TRUE
)

This plot is a bit messier. The relationship is nonlinear, with an overall negative correlation between days elapsed since last purchase and total spend. This is extremely expected, as those who order more frequently will likely have higher total expenditure and will have fewer elapsed days between orders.

5. What is the impact of discount application on total spend?

boxplot(
  data = commerce,
  Total.Spend ~ Discount.Applied,
  xlab = "Discounts Applied?",
  ylab = "Total Spend ($US)",
  ylim = c(0, 1500),
  main = "Discounted vs. Non-Discounted Spending",
  names = c("No", "Yes"),
  col = c("lightblue1", "skyblue3"),
  alpha = 0.1
)

The average total expenditure for accounts that apply discounts versus those which don’t is similar, but their exists more variability in those that do not apply discounts. This is likely explained by the fact that discounts are generally targeted at specific goods rather than general inventory, and those that use discounts are likely frugal shoppers who purchase fewer items in total.

6. What is the age distribution of customers for each membership type?

ggplot(commerce, aes(x = Age, fill = Membership.Type)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(
    values = c(
      "Gold" = "gold",
      "Silver" = "grey85",
      "Bronze" = "brown1"
    )
  ) +
  labs(
    title = "Age Distribution by Membership Type",
    x = "Age",
    y = "Density"
  ) + theme_classic()

Membership class seems to partition the accounts somewhat well (by age), with those customers aged under ~26 years old having Silver memberships, greater than ~26 but less than ~32 having Gold, greater than ~32 but less than ~35 having Silver again, and all else older having Bronze.

A possible conclusion could be that younger people favor Gold memberships, older people favor Bronze memberships, and Silver is somewhat favored by most.

7. How does customer satisfaction level correlate with total spend?

satisfaction <- commerce[commerce$Satisfaction.Level != "", ]
satisfaction$Satisfaction.Level <- match(satisfaction$Satisfaction.Level, c("Unsatisfied", "Neutral", "Satisfied"))

boxplot(
  data = satisfaction,
  Total.Spend ~ Satisfaction.Level,
  xlab = "Satisfaction Level",
  ylab = "Total Spend ($US)",
  ylim = c(0, 1500),
  main = "Customer Spending by Satisfaction Level",
  names = c("Unsatisfied", "Neutral", "Satisfied"),
  col = c("firebrick2", "red3", "maroon"),
  alpha = 0.1
)
text(
  x = 2,
  y = 1625,
  labels = paste0("Correlation: ", trunc(1000*cor(satisfaction$Satisfaction.Level, satisfaction$Total.Spend))/1000),
  cex = 0.9,
  xpd = TRUE
)

There exists a clear positive correlation between level of customer satisfaction and total spending. Again, this is expected, because the happier customers are more likely to return to purchase more. Interestingly, the average total spending for customers that had a Neutral satisfaction level was less than customers who were Unsatisfied!

8. What are the spending patterns of different gender groups?

hist(
  x = commerce[commerce$Gender == "Male", ]$Total.Spend,
  breaks = 25,
  xlim = c(400, 1600),
  main = "Histogram of Total Spend by Sex",
  xlab = "Total Spend",
  ylab = "Count",
  col = rgb(0.345, 0.592, 0.902, 0.4)
)
par(new = TRUE)
hist(
  x = commerce[commerce$Gender == "Female", ]$Total.Spend,
  breaks = 25,
  xlim = c(400, 1600),
  main = "",
  xlab = "",
  ylab = "",
  xaxt = "n",
  yaxt = "n",
  col = rgb(0.839, 0.345, 0.588, 0.4)
)
legend(
  x = 370,
  y = 63,
  bty = "n",
  fill = c(rgb(0.345, 0.592, 0.902, 0.4), rgb(0.839, 0.345, 0.588, 0.4)),
  legend = c("Male", "Female"),
  xpd = TRUE
)

The data is somewhat limited, but it seems reasonable to conclude that there exists, for Male and Female customers, two spending classes for each, a high and a low class. In general, the Female customers spend less than their Male counterparts, which might be explainable if we knew what the e-commerce firm typically sells.

9. How does the number of items purchased vary by city?

cityboxplot2 <- boxplot(
  data = commerce,
  Items.Purchased ~ City,
  xlab = "City",
  ylab = "Items Purchased",
  main = "Items Purchased by City",
  col = topo.colors(7, alpha = 0.5)[2:7],
  cex.axis = 0.85
)
legend(
  x = 0.24,
  y = 21.3,
  bty = "n",
  title = "Averages",
  legend = paste0(paste0(cityboxplot2$names, ": "), cityboxplot2$stats[3, ]),
  fill = topo.colors(7, alpha = 0.5)[2:7],
  cex = 0.8
)

Customers from San Francisco typically purchased the greatest quantity of items by far, followed by New York, then Miami and Los Angeles. Again, this might be explainable if we knew what the e-commerce firm sells. The data could be explained by the fact that the highest four cities above are all coastal cities, whereas Chicago and Houston are not?