E-commerce Data Analysis

  1. Create an indicator variable (0/1) for whether the customer has children or not. Hint: use the function if_else() or the function as.double().

  2. Create a table with the count, mean, median, sd, min, max of amount spent by customers who own vs rent their homes. Hint: you can use the functions group_by(), summarize(), mean(), median(),sd(), min() and max().

  3. What proportion of male and female customers are homeowners? Represent this information in a two way table. Hint: you can use the function tabyl().

  4. Use the appropriate statistical test to compare the proportion of males and females who are home owners. Write down the null and alternative hypothesis of your test and interpret its results. Hint: you can use the function prop_test().

  5. Use the appropriate statistical test to compare the mean amount spent by customer location (far/close). Write down the null and alternative hypothesis of your test and interpret its results. Hint: you can use the function t_test().

  6. Create a confidence interval plot for mean amount spent by customers according to their purchase history. Briefly interpret your plot. Hint: note that you have to process history variable. First create a new variable new_hist where you convert any NA to “New Customer” (you can use if_else again) and then convert new_hist to a factor of order c("New Customer", "Low", "Medium", "High").

  7. Use the appropriate statistical test to compare the mean amount spent by customers according to their purchase history. Write down the null and alternative hypothesis of your test and interpret its results.


Get started by loading libraries and reading data.

library(tidyverse)
library(stargazer)
library(infer)
library(janitor)

load("data/ecommerce.RData")

tb.ecommerce <- rename_with(tb.ecommerce, tolower)


  1. Create an indicator variable (0/1) for whether the customer has children or not.
# using as.double()
tb.ecommerce <- tb.ecommerce %>% mutate(child_flag = as.double(children > 0))

# using if_else()
tb.ecommerce <- tb.ecommerce %>% mutate(child_flag = if_else(children > 0, 1, 0))


  1. Create a table with the count, mean, median, sd, min, max of amount spent by customers who own vs rent their homes.
tb.ecommerce %>% 
  group_by(ownhome) %>%
  summarize(n = n(),
            mean_amount = mean(amountspent),
            median_amount = median(amountspent),
            sd_amount = sd(amountspent),
            min_amount = min(amountspent),
            max_amount = max(amountspent))
## # A tibble: 2 × 7
##   ownhome     n mean_amount median_amount sd_amount min_amount max_amount
##   <fct>   <int>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 Own       516        70.5             0     106.           0       622.
## 2 Rent      484        37.7             0      68.9          0       583


  1. What proportion of male and female customers are homeowners? Represent this information in a two way table.
# Comparing two proportions
tb.ecommerce %>% tabyl(gender, ownhome) %>%
  # Add column and row totals
  adorn_totals(c("row", "col")) %>%
  # Convert values to percentages
  adorn_percentages("row") %>%
  # Format values with percent sign and two decimal places
  adorn_pct_formatting(2) %>%
  # Add counts along with the percentages
  adorn_ns("front")
##  gender          Own         Rent           Total
##  Female 240 (47.43%) 266 (52.57%)   506 (100.00%)
##    Male 276 (55.87%) 218 (44.13%)   494 (100.00%)
##   Total 516 (51.60%) 484 (48.40%) 1,000 (100.00%)


  1. Use the appropriate statistical test to compare the proportion of males and females who are home owners. Write down the null and alternative hypothesis of your test and interpret its results.
# Test for two proportions
tb.ecommerce %>% prop_test(ownhome ~ gender, 
                           order = c("Female", "Male"), 
                           z = TRUE)
## # A tibble: 1 × 5
##   statistic p_value alternative lower_ci upper_ci
##       <dbl>   <dbl> <chr>          <dbl>    <dbl>
## 1     -2.67 0.00758 two.sided     -0.146  -0.0227
# note the z argument, a logical value for whether to report the statistic as 
# a standard normal deviate or a Pearson's chi-square statistic.

The p-value is below 0.01; thus, we can reject the null hypothesis of equality of proportions with 99% of confidence. The evidence suggests that the two population proportions are not equal.


  1. Use the appropriate statistical test to compare the mean amount spent by customer location (far/close). Write down the null and alternative hypothesis of your test and interpret its results.
ggplot(tb.ecommerce, aes(x = amountspent, fill = location, y = after_stat(count)/sum(after_stat(count)))) + 
  geom_histogram(color = "white", position = 'identity', alpha = 0.8) +
  theme(text = element_text(size = 16)) +
  ylab("Relative Frequency")

ggplot(tb.ecommerce, aes(x=amountspent, y=location, fill=location)) + 
  geom_boxplot()

tb.ecommerce %>% group_by(location) %>% summarise(mean_amt = mean(amountspent))
## # A tibble: 2 × 2
##   location mean_amt
##   <fct>       <dbl>
## 1 Close        45.2
## 2 Far          77.8
# Test for two means
tb.ecommerce %>% t_test(amountspent ~ location)
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Close" - "Far", or divided in the order "Close" / "Far" for ratio-based
## statistics. To specify this order yourself, supply `order = c("Close", "Far")`.
## # A tibble: 1 × 7
##   statistic  t_df   p_value alternative estimate lower_ci upper_ci
##       <dbl> <dbl>     <dbl> <chr>          <dbl>    <dbl>    <dbl>
## 1     -4.23  373. 0.0000296 two.sided      -32.6    -47.8    -17.5

The p-value is below 0.01; thus, we can reject the null hypothesis of equality of means with 99% of confidence. The evidence suggests that the two population means are not equal.


  1. Create a confidence interval plot for mean amount spent by customers according to their purchase history. Briefly interpret your plot.
# Factor variable new_hist
tb.ecommerce <- tb.ecommerce %>%
  mutate(new_hist = if_else(is.na(history), "New Customer", as.character(history)))
tb.ecommerce <- tb.ecommerce  %>%
  mutate(fact_new_hist = factor(x = new_hist,
                                levels = c("New Customer", "Low", "Medium", "High")))
# Create data set for plot
df_to_plot <- tb.ecommerce %>% 
  group_by(fact_new_hist) %>% 
  summarize(n = n(), 
            mean_amount = mean(amountspent), 
            sd_amount = sd(amountspent),
            se_amount = sd_amount/sqrt(n))

# Create plot with confidence interval for each group mean
ggplot(data = df_to_plot, 
       aes(x = fact_new_hist, y = mean_amount, color = fact_new_hist)) + 
  geom_point() +
  geom_errorbar(aes(ymin=mean_amount-1.96*se_amount,
                    ymax=mean_amount+1.96*se_amount),
                width=.2) +
  theme_classic() + 
  labs(x = "") + 
  theme(legend.position = "")


  1. Use the appropriate statistical test to compare the mean amount spent by customers according to their purchase history. Write down the null and alternative hypothesis of your test and interpret its results.
# One-way ANOVA
one_way_aov <- aov(amountspent ~ fact_new_hist, data = tb.ecommerce)
summary(one_way_aov)
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## fact_new_hist   3  945353  315118   42.65 <2e-16 ***
## Residuals     996 7358186    7388                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis (p<0.01) and conclude that not all group means are equal.