E-commerce Data Analysis
Create an indicator variable (0/1) for whether the customer has
children or not. Hint: use the function if_else()
or the
function as.double()
.
Create a table with the count, mean, median, sd, min, max of
amount spent by customers who own vs rent their homes. Hint: you can use
the functions group_by()
, summarize()
,
mean()
, median()
,sd()
,
min()
and max()
.
What proportion of male and female customers are homeowners?
Represent this information in a two way table. Hint: you can use the
function tabyl()
.
Use the appropriate statistical test to compare the proportion of
males and females who are home owners. Write down the null and
alternative hypothesis of your test and interpret its results. Hint: you
can use the function prop_test()
.
Use the appropriate statistical test to compare the mean amount
spent by customer location (far/close). Write down the null and
alternative hypothesis of your test and interpret its results. Hint: you
can use the function t_test()
.
Create a confidence interval plot for mean amount spent by
customers according to their purchase history. Briefly interpret your
plot. Hint: note that you have to process history
variable.
First create a new variable new_hist
where you convert any
NA
to “New Customer” (you can use if_else
again) and then convert new_hist
to a factor of order
c("New Customer", "Low", "Medium", "High")
.
Use the appropriate statistical test to compare the mean amount spent by customers according to their purchase history. Write down the null and alternative hypothesis of your test and interpret its results.
Get started by loading libraries and reading data.
library(tidyverse)
library(stargazer)
library(infer)
library(janitor)
load("data/ecommerce.RData")
tb.ecommerce <- rename_with(tb.ecommerce, tolower)
# using as.double()
tb.ecommerce <- tb.ecommerce %>% mutate(child_flag = as.double(children > 0))
# using if_else()
tb.ecommerce <- tb.ecommerce %>% mutate(child_flag = if_else(children > 0, 1, 0))
tb.ecommerce %>%
group_by(ownhome) %>%
summarize(n = n(),
mean_amount = mean(amountspent),
median_amount = median(amountspent),
sd_amount = sd(amountspent),
min_amount = min(amountspent),
max_amount = max(amountspent))
## # A tibble: 2 × 7
## ownhome n mean_amount median_amount sd_amount min_amount max_amount
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Own 516 70.5 0 106. 0 622.
## 2 Rent 484 37.7 0 68.9 0 583
# Comparing two proportions
tb.ecommerce %>% tabyl(gender, ownhome) %>%
# Add column and row totals
adorn_totals(c("row", "col")) %>%
# Convert values to percentages
adorn_percentages("row") %>%
# Format values with percent sign and two decimal places
adorn_pct_formatting(2) %>%
# Add counts along with the percentages
adorn_ns("front")
## gender Own Rent Total
## Female 240 (47.43%) 266 (52.57%) 506 (100.00%)
## Male 276 (55.87%) 218 (44.13%) 494 (100.00%)
## Total 516 (51.60%) 484 (48.40%) 1,000 (100.00%)
\(H_0:\) The population proportion of female who own an home is equal to the population proportion of males who own a home.
\(H_a:\) The population proportion of female who own an home is not equal to the population proportion of males who own a home.
# Test for two proportions
tb.ecommerce %>% prop_test(ownhome ~ gender,
order = c("Female", "Male"),
z = TRUE)
## # A tibble: 1 × 5
## statistic p_value alternative lower_ci upper_ci
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 -2.67 0.00758 two.sided -0.146 -0.0227
# note the z argument, a logical value for whether to report the statistic as
# a standard normal deviate or a Pearson's chi-square statistic.
The p-value is below 0.01; thus, we can reject the null hypothesis of equality of proportions with 99% of confidence. The evidence suggests that the two population proportions are not equal.
\(H_0:\) The population mean amount spent by customers located close to a store selling similar products equals the population mean amount spent by customers located far from such a store.
\(H_a:\) The population mean amount spent by customers located close to a store selling similar products does not equal the population mean amount spent by customers located far from such a store.
ggplot(tb.ecommerce, aes(x = amountspent, fill = location, y = after_stat(count)/sum(after_stat(count)))) +
geom_histogram(color = "white", position = 'identity', alpha = 0.8) +
theme(text = element_text(size = 16)) +
ylab("Relative Frequency")
ggplot(tb.ecommerce, aes(x=amountspent, y=location, fill=location)) +
geom_boxplot()
tb.ecommerce %>% group_by(location) %>% summarise(mean_amt = mean(amountspent))
## # A tibble: 2 × 2
## location mean_amt
## <fct> <dbl>
## 1 Close 45.2
## 2 Far 77.8
# Test for two means
tb.ecommerce %>% t_test(amountspent ~ location)
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "Close" - "Far", or divided in the order "Close" / "Far" for ratio-based
## statistics. To specify this order yourself, supply `order = c("Close", "Far")`.
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 -4.23 373. 0.0000296 two.sided -32.6 -47.8 -17.5
The p-value is below 0.01; thus, we can reject the null hypothesis of equality of means with 99% of confidence. The evidence suggests that the two population means are not equal.
# Factor variable new_hist
tb.ecommerce <- tb.ecommerce %>%
mutate(new_hist = if_else(is.na(history), "New Customer", as.character(history)))
tb.ecommerce <- tb.ecommerce %>%
mutate(fact_new_hist = factor(x = new_hist,
levels = c("New Customer", "Low", "Medium", "High")))
# Create data set for plot
df_to_plot <- tb.ecommerce %>%
group_by(fact_new_hist) %>%
summarize(n = n(),
mean_amount = mean(amountspent),
sd_amount = sd(amountspent),
se_amount = sd_amount/sqrt(n))
# Create plot with confidence interval for each group mean
ggplot(data = df_to_plot,
aes(x = fact_new_hist, y = mean_amount, color = fact_new_hist)) +
geom_point() +
geom_errorbar(aes(ymin=mean_amount-1.96*se_amount,
ymax=mean_amount+1.96*se_amount),
width=.2) +
theme_classic() +
labs(x = "") +
theme(legend.position = "")
# One-way ANOVA
one_way_aov <- aov(amountspent ~ fact_new_hist, data = tb.ecommerce)
summary(one_way_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## fact_new_hist 3 945353 315118 42.65 <2e-16 ***
## Residuals 996 7358186 7388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis (p<0.01) and conclude that not all group means are equal.