Research question 1 (Correlation analysis): Is there a correlation between the customer’s age and their total spending on the e-commerce platform?
Variables Involved: Age (Numeric); Total Spend (Numeric)
Hypothesis: Null Hypothesis (H0): There is no correlation between the customer’s age and total spending. Alternative Hypothesis (H1): There is a correlation between the customer’s age and total spending.
Research Question 2 (Pearson Chi2 Test): Is there an association between whether a discount was applied to a purchase and the satisfaction level of the customer?
Variables Involved: Discount Applied (Binary: True or False); Satisfaction Level (Categorical: Satisfied, Neutral, Dissatisfied)
Hypothesis: Null Hypothesis (H0): There is no association between whether a discount was applied and satisfaction level. Alternative Hypothesis (H1): There is an association between whether a discount was applied and satisfaction level.
Source of the dataset: https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset
# Loading my e-commerce customer behavior dataset
mydata <- read.csv("~/Bootcamp/E-commerce Customer Behavior - Sheet1.csv", header = TRUE)
# Display the first few rows of the dataset
head(mydata, 10)
## Customer.ID Gender Age City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1 101 Female 29 New York Gold 1120.20 14 4.6 TRUE
## 2 102 Male 34 Los Angeles Silver 780.50 11 4.1 FALSE
## 3 103 Female 43 Chicago Bronze 510.75 9 3.4 TRUE
## 4 104 Male 30 San Francisco Gold 1480.30 19 4.7 FALSE
## 5 105 Male 27 Miami Silver 720.40 13 4.0 TRUE
## 6 106 Female 37 Houston Bronze 440.80 8 3.1 FALSE
## 7 107 Female 31 New York Gold 1150.60 15 4.5 TRUE
## 8 108 Male 35 Los Angeles Silver 800.90 12 4.2 FALSE
## 9 109 Female 41 Chicago Bronze 495.25 10 3.6 TRUE
## 10 110 Male 28 San Francisco Gold 1520.10 21 4.8 FALSE
## Days.Since.Last.Purchase Satisfaction.Level
## 1 25 Satisfied
## 2 18 Neutral
## 3 42 Unsatisfied
## 4 12 Satisfied
## 5 55 Unsatisfied
## 6 22 Neutral
## 7 28 Satisfied
## 8 14 Neutral
## 9 40 Unsatisfied
## 10 9 Satisfied
Unit of Observation: Each entry in the dataset corresponds to a unique customer.
Sample Size: The dataset includes information for 350 customers, with entries ranging from Customer ID 101 to 450.
Definition of Variables:
Customer ID: Type: Numeric Description: A unique identifier assigned to each customer.
Gender: Type: Categorical (Male, Female) Description: Specifies the gender of the customer.
Age: Type: Numeric Description: Represents the age of the customer.
City: Type: Categorical (City names) Description: Indicates the city of residence for each customer.
Membership Type: Type: Categorical (Gold, Silver, Bronze) Description: Identifies the type of membership held by the customer.
Total Spend: Type: Numeric Description: Records the total monetary expenditure by the customer on the e-commerce platform.
Items Purchased: Type: Numeric Description: Quantifies the total number of items purchased by the customer.
Average Rating: Type: Numeric (0 to 5, with decimals) Description: Represents the average rating given by the customer for purchased items.
Discount Applied: Type: Boolean (True, False) Description: Indicates whether a discount was applied to the customer’s purchase.
Days Since Last Purchase: Type: Numeric Description: Reflects the number of days elapsed since the customer’s most recent purchase.
Satisfaction Level: Type: Categorical (Satisfied, Neutral, Unsatisfied) Description: Captures the overall satisfaction level of the customer.
#Let's begin analyzing the Research question 1
mydata$Age <- ifelse(test = mydata$Age == 999,
yes = NA,
no = as.numeric(mydata$Age))
mydata$Total.Spend <- ifelse(test = mydata$Total.Spend >= 6666,
yes = NA,
no = as.numeric(mydata$Total.Spend))
mydata <- mydata[complete.cases(mydata), ]
# Display the first few rows of the updated dataset
head(mydata)
## Customer.ID Gender Age City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1 101 Female 29 New York Gold 1120.20 14 4.6 TRUE
## 2 102 Male 34 Los Angeles Silver 780.50 11 4.1 FALSE
## 3 103 Female 43 Chicago Bronze 510.75 9 3.4 TRUE
## 4 104 Male 30 San Francisco Gold 1480.30 19 4.7 FALSE
## 5 105 Male 27 Miami Silver 720.40 13 4.0 TRUE
## 6 106 Female 37 Houston Bronze 440.80 8 3.1 FALSE
## Days.Since.Last.Purchase Satisfaction.Level
## 1 25 Satisfied
## 2 18 Neutral
## 3 42 Unsatisfied
## 4 12 Satisfied
## 5 55 Unsatisfied
## 6 22 Neutral
Here are some descriptive statistics.
library(psych)
psych::describe(mydata[ , c("Age", "Total.Spend")])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Age 1 350 33.60 4.87 32.5 33.31 5.19 26.0 43.0 17.0 0.46 -0.79 0.26
## Total.Spend 2 350 845.38 362.06 775.2 816.78 444.71 410.8 1520.1 1109.3 0.56 -1.09 19.35
library(ggplot2)
ggplot(mydata, aes(x = Age, y = Total.Spend)) +
geom_point()
library(ggplot2)
ggplot(mydata, aes(x = Age)) +
geom_bar(colour = "black", fill = "pink") +
ylab("Total.Spend") +
scale_x_continuous(breaks = c(0:12))
We use Pearson correlation coefficient.
cor(mydata$Age, mydata$Total.Spend,
method = "pearson")
## [1] -0.6779183
H0: Ro yk = 0 H1: Ro yk != 0
cor(mydata$Age, mydata$Total.Spend,
method = "pearson",
use = "complete.obs")
## [1] -0.6779183
cor.test(mydata$Age, mydata$Total.Spend,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata$Age and mydata$Total.Spend
## t = -17.203, df = 348, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7308122 -0.6169314
## sample estimates:
## cor
## -0.6779183
Now let’s give answer to the Research question 2. See in the beginning of the report the description of variables and hypothesis.
Let’s start by factoring the variables.
mydata$Discount.AppliedF <- factor(mydata$Discount.Applied,
levels = c(TRUE, FALSE),
labels = c("Applied", "Not Applied"))
mydata$Satisfaction.LevelF <- factor(mydata$Satisfaction.Level,
levels = c("Satisfied", "Neutral", "Unsatisfied"))
mydata <- mydata[complete.cases(mydata$Discount.AppliedF), ]
mydata <- mydata[complete.cases(mydata$Satisfaction.LevelF), ]
head(mydata)
## Customer.ID Gender Age City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1 101 Female 29 New York Gold 1120.20 14 4.6 TRUE
## 2 102 Male 34 Los Angeles Silver 780.50 11 4.1 FALSE
## 3 103 Female 43 Chicago Bronze 510.75 9 3.4 TRUE
## 4 104 Male 30 San Francisco Gold 1480.30 19 4.7 FALSE
## 5 105 Male 27 Miami Silver 720.40 13 4.0 TRUE
## 6 106 Female 37 Houston Bronze 440.80 8 3.1 FALSE
## Days.Since.Last.Purchase Satisfaction.Level Discount.AppliedF Satisfaction.LevelF
## 1 25 Satisfied Applied Satisfied
## 2 18 Neutral Not Applied Neutral
## 3 42 Unsatisfied Applied Unsatisfied
## 4 12 Satisfied Not Applied Satisfied
## 5 55 Unsatisfied Applied Unsatisfied
## 6 22 Neutral Not Applied Neutral
Let’s make a contingency table.
# Create a matrix manually
mytable <- matrix(c(
sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Unsatisfied"),
sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Neutral"),
sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Satisfied"),
sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Unsatisfied"),
sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Neutral"),
sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Satisfied")
), nrow = 2)
# Specify row and column names
colnames(mytable) <- c("Unsatisfied", "Neutral", "Satisfied")
rownames(mytable) <- c("Not Applied", "Applied")
# Display the matrix
print(mytable)
## Unsatisfied Neutral Satisfied
## Not Applied 0 66 0
## Applied 107 116 59
The table adds up to 348, because there were two NA records in the Satisfaction Level variable. There are two variables: Discount applied (2 categories), and Satisfaction level (3 categories). The empirical frequencies are in the table above. For hypothesis, see above.
chi_squared <- chisq.test(mytable,
correct = FALSE)
chi_squared
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 74.287, df = 2, p-value < 2.2e-16
addmargins(chi_squared$observed)
## Unsatisfied Neutral Satisfied Sum
## Not Applied 0 66 0 66
## Applied 107 116 59 282
## Sum 107 182 59 348
addmargins(round(chi_squared$expected, 2))
## Unsatisfied Neutral Satisfied Sum
## Not Applied 20.29 34.52 11.19 66
## Applied 86.71 147.48 47.81 282
## Sum 107.00 182.00 59.00 348
round(chi_squared$res, 2)
## Unsatisfied Neutral Satisfied
## Not Applied -4.50 5.36 -3.35
## Applied 2.18 -2.59 1.62
Based on the sample data we can reject the null hypothesis (p<0,001) and conclude that there is an association between applying discounts to customers and their satisfaction levels.
Let’s explain the standard residual value -4,50 (Not applied & Unsatisfied):
There is less people in our sample who did not have discounts applied and were unsatisfied than we expected (alpha equals 0,1%).
Let’s explain the number -2,59 (Applied & Neutral):
There is less people in our sample who did have discounts applied and were neutral about their satisfaction than we expected (alpha equals 1%). So we can tell this with 99% certainty.
Now let’s analyze the proportion tables. The first one is a regular proportion table. The second and third are conditional proportion tables.
addmargins(round(prop.table(chi_squared$observed), 3))
## Unsatisfied Neutral Satisfied Sum
## Not Applied 0.000 0.190 0.00 0.19
## Applied 0.307 0.333 0.17 0.81
## Sum 0.307 0.523 0.17 1.00
addmargins(round(prop.table(chi_squared$observed, 1), 3), 2)
## Unsatisfied Neutral Satisfied Sum
## Not Applied 0.000 1.000 0.000 1.000
## Applied 0.379 0.411 0.209 0.999
addmargins(round(prop.table(chi_squared$observed, 2), 3), 1)
## Unsatisfied Neutral Satisfied
## Not Applied 0 0.363 0
## Applied 1 0.637 1
## Sum 1 1.000 1
The conditions of the second table are: “Discount not applied” and “Discount applied”.
The conditions of the third table are: “Unsatisfied”, “Neutral”, and “Satisfied”.
Now let’s explain one example number from each table:
0,17 - Out of all people, 17% of them had both discounts applied and were satisfied.
0,209 - Out of all people who had discounts, 20,9% of them were satisfied.
0,363 - Out of all people who were neutral about satisfaction, 36,3% of them didn’t have discounts.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$Discount.AppliedF, mydata$Satisfaction.LevelF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.80 | [0.71, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
We performed Pearson Chi2 test, and based on the sample data, we can reject the null hypothesis (p<0,001) and conclude that there is an association between applying discounts to customers and their satisfaction levels. The proportion tables and standardized residuals further support this association.
Based on the sample data, the effect size is tiny.