Erdene Enkh, 20 January 2024

IMB; Multivariate Analysis;

Homework 2

E-commerce customer behavior analysis

Research question 1 (Correlation analysis): Is there a correlation between the customer’s age and their total spending on the e-commerce platform?

Variables Involved: Age (Numeric); Total Spend (Numeric)

Hypothesis: Null Hypothesis (H0): There is no correlation between the customer’s age and total spending. Alternative Hypothesis (H1): There is a correlation between the customer’s age and total spending.

Research Question 2 (Pearson Chi2 Test): Is there an association between whether a discount was applied to a purchase and the satisfaction level of the customer?

Variables Involved: Discount Applied (Binary: True or False); Satisfaction Level (Categorical: Satisfied, Neutral, Dissatisfied)

Hypothesis: Null Hypothesis (H0): There is no association between whether a discount was applied and satisfaction level. Alternative Hypothesis (H1): There is an association between whether a discount was applied and satisfaction level.

Source of the dataset: https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset

# Loading my e-commerce customer behavior dataset
mydata <- read.csv("~/Bootcamp/E-commerce Customer Behavior - Sheet1.csv", header = TRUE)

# Display the first few rows of the dataset
head(mydata, 10)

##    Customer.ID Gender Age          City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1          101 Female  29      New York            Gold     1120.20              14            4.6             TRUE
## 2          102   Male  34   Los Angeles          Silver      780.50              11            4.1            FALSE
## 3          103 Female  43       Chicago          Bronze      510.75               9            3.4             TRUE
## 4          104   Male  30 San Francisco            Gold     1480.30              19            4.7            FALSE
## 5          105   Male  27         Miami          Silver      720.40              13            4.0             TRUE
## 6          106 Female  37       Houston          Bronze      440.80               8            3.1            FALSE
## 7          107 Female  31      New York            Gold     1150.60              15            4.5             TRUE
## 8          108   Male  35   Los Angeles          Silver      800.90              12            4.2            FALSE
## 9          109 Female  41       Chicago          Bronze      495.25              10            3.6             TRUE
## 10         110   Male  28 San Francisco            Gold     1520.10              21            4.8            FALSE
##    Days.Since.Last.Purchase Satisfaction.Level
## 1                        25          Satisfied
## 2                        18            Neutral
## 3                        42        Unsatisfied
## 4                        12          Satisfied
## 5                        55        Unsatisfied
## 6                        22            Neutral
## 7                        28          Satisfied
## 8                        14            Neutral
## 9                        40        Unsatisfied
## 10                        9          Satisfied

Unit of Observation: Each entry in the dataset corresponds to a unique customer.

Sample Size: The dataset includes information for 350 customers, with entries ranging from Customer ID 101 to 450.

Definition of Variables:

Customer ID: Type: Numeric Description: A unique identifier assigned to each customer.

Gender: Type: Categorical (Male, Female) Description: Specifies the gender of the customer.

Age: Type: Numeric Description: Represents the age of the customer.

City: Type: Categorical (City names) Description: Indicates the city of residence for each customer.

Membership Type: Type: Categorical (Gold, Silver, Bronze) Description: Identifies the type of membership held by the customer.

Total Spend: Type: Numeric Description: Records the total monetary expenditure by the customer on the e-commerce platform.

Items Purchased: Type: Numeric Description: Quantifies the total number of items purchased by the customer.

Average Rating: Type: Numeric (0 to 5, with decimals) Description: Represents the average rating given by the customer for purchased items.

Discount Applied: Type: Boolean (True, False) Description: Indicates whether a discount was applied to the customer’s purchase.

Days Since Last Purchase: Type: Numeric Description: Reflects the number of days elapsed since the customer’s most recent purchase.

Satisfaction Level: Type: Categorical (Satisfied, Neutral, Unsatisfied) Description: Captures the overall satisfaction level of the customer.

#Let's begin analyzing the Research question 1
mydata$Age <- ifelse(test = mydata$Age == 999,
                     yes = NA,
                     no = as.numeric(mydata$Age))


mydata$Total.Spend <- ifelse(test = mydata$Total.Spend >= 6666,
                              yes = NA,
                              no = as.numeric(mydata$Total.Spend))

mydata <- mydata[complete.cases(mydata), ]

# Display the first few rows of the updated dataset
head(mydata)

##   Customer.ID Gender Age          City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1         101 Female  29      New York            Gold     1120.20              14            4.6             TRUE
## 2         102   Male  34   Los Angeles          Silver      780.50              11            4.1            FALSE
## 3         103 Female  43       Chicago          Bronze      510.75               9            3.4             TRUE
## 4         104   Male  30 San Francisco            Gold     1480.30              19            4.7            FALSE
## 5         105   Male  27         Miami          Silver      720.40              13            4.0             TRUE
## 6         106 Female  37       Houston          Bronze      440.80               8            3.1            FALSE
##   Days.Since.Last.Purchase Satisfaction.Level
## 1                       25          Satisfied
## 2                       18            Neutral
## 3                       42        Unsatisfied
## 4                       12          Satisfied
## 5                       55        Unsatisfied
## 6                       22            Neutral

Here are some descriptive statistics.

library(psych)
psych::describe(mydata[ , c("Age", "Total.Spend")])

##             vars   n   mean     sd median trimmed    mad   min    max  range skew kurtosis    se
## Age            1 350  33.60   4.87   32.5   33.31   5.19  26.0   43.0   17.0 0.46    -0.79  0.26
## Total.Spend    2 350 845.38 362.06  775.2  816.78 444.71 410.8 1520.1 1109.3 0.56    -1.09 19.35

library(ggplot2)
ggplot(mydata, aes(x = Age, y = Total.Spend)) +
  geom_point()

library(ggplot2)
ggplot(mydata, aes(x = Age)) +
  geom_bar(colour = "black", fill = "pink") +
  ylab("Total.Spend") +
  scale_x_continuous(breaks = c(0:12))

We use Pearson correlation coefficient.

cor(mydata$Age, mydata$Total.Spend, 
    method = "pearson")

## [1] -0.6779183

The negative sign (-) indicates a negative relationship between costumer’s age and total spending. The absolute value falls between 0,3 and 0,7, so it is a semi-strong relationship.

H0: Ro yk = 0 H1: Ro yk != 0

Based on the sample data we can reject the H0 (p<0,001) and conclude that there is a linear relationship between the customer’s age and total spending on the e-commerce platform. It was an answer to the Research question 1.

cor(mydata$Age, mydata$Total.Spend, 
    method = "pearson",
    use = "complete.obs")

## [1] -0.6779183

cor.test(mydata$Age, mydata$Total.Spend, 
         method = "pearson",
         use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata$Age and mydata$Total.Spend
## t = -17.203, df = 348, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7308122 -0.6169314
## sample estimates:
##        cor 
## -0.6779183

Now let’s give answer to the Research question 2. See in the beginning of the report the description of variables and hypothesis.

Let’s start by factoring the variables.

mydata$Discount.AppliedF <- factor(mydata$Discount.Applied,
                                  levels = c(TRUE, FALSE),
                                  labels = c("Applied", "Not Applied"))

mydata$Satisfaction.LevelF <- factor(mydata$Satisfaction.Level,
                                     levels = c("Satisfied", "Neutral", "Unsatisfied"))


mydata <- mydata[complete.cases(mydata$Discount.AppliedF), ]
mydata <- mydata[complete.cases(mydata$Satisfaction.LevelF), ]



head(mydata)

##   Customer.ID Gender Age          City Membership.Type Total.Spend Items.Purchased Average.Rating Discount.Applied
## 1         101 Female  29      New York            Gold     1120.20              14            4.6             TRUE
## 2         102   Male  34   Los Angeles          Silver      780.50              11            4.1            FALSE
## 3         103 Female  43       Chicago          Bronze      510.75               9            3.4             TRUE
## 4         104   Male  30 San Francisco            Gold     1480.30              19            4.7            FALSE
## 5         105   Male  27         Miami          Silver      720.40              13            4.0             TRUE
## 6         106 Female  37       Houston          Bronze      440.80               8            3.1            FALSE
##   Days.Since.Last.Purchase Satisfaction.Level Discount.AppliedF Satisfaction.LevelF
## 1                       25          Satisfied           Applied           Satisfied
## 2                       18            Neutral       Not Applied             Neutral
## 3                       42        Unsatisfied           Applied         Unsatisfied
## 4                       12          Satisfied       Not Applied           Satisfied
## 5                       55        Unsatisfied           Applied         Unsatisfied
## 6                       22            Neutral       Not Applied             Neutral

Let’s make a contingency table.

# Create a matrix manually
mytable <- matrix(c(
  sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Unsatisfied"),
  sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Neutral"),
  sum(mydata$Discount.AppliedF == "Not Applied" & mydata$Satisfaction.LevelF == "Satisfied"),
  sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Unsatisfied"),
  sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Neutral"),
  sum(mydata$Discount.AppliedF == "Applied" & mydata$Satisfaction.LevelF == "Satisfied")
), nrow = 2)

# Specify row and column names
colnames(mytable) <- c("Unsatisfied", "Neutral", "Satisfied")
rownames(mytable) <- c("Not Applied", "Applied")

# Display the matrix
print(mytable)

##             Unsatisfied Neutral Satisfied
## Not Applied           0      66         0
## Applied             107     116        59

The table adds up to 348, because there were two NA records in the Satisfaction Level variable. There are two variables: Discount applied (2 categories), and Satisfaction level (3 categories). The empirical frequencies are in the table above. For hypothesis, see above.

chi_squared <- chisq.test(mytable, 
                          correct = FALSE)

chi_squared

## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 74.287, df = 2, p-value < 2.2e-16

addmargins(chi_squared$observed)

##             Unsatisfied Neutral Satisfied Sum
## Not Applied           0      66         0  66
## Applied             107     116        59 282
## Sum                 107     182        59 348

addmargins(round(chi_squared$expected, 2))

##             Unsatisfied Neutral Satisfied Sum
## Not Applied       20.29   34.52     11.19  66
## Applied           86.71  147.48     47.81 282
## Sum              107.00  182.00     59.00 348

round(chi_squared$res, 2)

##             Unsatisfied Neutral Satisfied
## Not Applied       -4.50    5.36     -3.35
## Applied            2.18   -2.59      1.62

Based on the sample data we can reject the null hypothesis (p<0,001) and conclude that there is an association between applying discounts to customers and their satisfaction levels.

Let’s explain the standard residual value -4,50 (Not applied & Unsatisfied):

There is less people in our sample who did not have discounts applied and were unsatisfied than we expected (alpha equals 0,1%).

Let’s explain the number -2,59 (Applied & Neutral):

There is less people in our sample who did have discounts applied and were neutral about their satisfaction than we expected (alpha equals 1%). So we can tell this with 99% certainty.

Now let’s analyze the proportion tables. The first one is a regular proportion table. The second and third are conditional proportion tables.

addmargins(round(prop.table(chi_squared$observed), 3))

##             Unsatisfied Neutral Satisfied  Sum
## Not Applied       0.000   0.190      0.00 0.19
## Applied           0.307   0.333      0.17 0.81
## Sum               0.307   0.523      0.17 1.00

addmargins(round(prop.table(chi_squared$observed, 1), 3), 2)

##             Unsatisfied Neutral Satisfied   Sum
## Not Applied       0.000   1.000     0.000 1.000
## Applied           0.379   0.411     0.209 0.999

addmargins(round(prop.table(chi_squared$observed, 2), 3), 1)

##             Unsatisfied Neutral Satisfied
## Not Applied           0   0.363         0
## Applied               1   0.637         1
## Sum                   1   1.000         1

The conditions of the second table are: “Discount not applied” and “Discount applied”.

The conditions of the third table are: “Unsatisfied”, “Neutral”, and “Satisfied”.

Now let’s explain one example number from each table:

0,17 - Out of all people, 17% of them had both discounts applied and were satisfied.

0,209 - Out of all people who had discounts, 20,9% of them were satisfied.

0,363 - Out of all people who were neutral about satisfaction, 36,3% of them didn’t have discounts.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata$Discount.AppliedF, mydata$Satisfaction.LevelF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.80              | [0.71, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

My answer to research question 2:

We performed Pearson Chi2 test, and based on the sample data, we can reject the null hypothesis (p<0,001) and conclude that there is an association between applying discounts to customers and their satisfaction levels. The proportion tables and standardized residuals further support this association.

Based on the sample data, the effect size is tiny.

Homeworkreal2

2024-01-20