Part 1

mydata <- read.table("./HW_1.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")
head(mydata)

##       Group Customer_Segment Sales_Before Sales_After
## 1   Control       High Value     240.5484    300.0076
## 2 Treatment       High Value     246.8621    381.3376
## 3   Control       High Value     156.9781    179.3305
## 4   Control     Medium Value     192.1267    229.2780
## 5   Control       High Value     229.6856    270.1677
## 6 Treatment        Low Value     135.5730    218.5600
##   Customer_Satisfaction_Before Customer_Satisfaction_After Purchase_Made
## 1                     74.68477                    74.09366            No
## 2                    100.00000                   100.00000           Yes
## 3                     98.78073                   100.00000            No
## 4                     49.33377                    39.81184           Yes
## 5                     83.97485                    87.73859           Yes
## 6                     58.07534                    69.40492            No

set.seed(123)
mydata <- mydata[sample(nrow(mydata), 500), ]

I have reduced the number of observations to 500 so that I will be able to perform the necessary tests.

This data set contains information on sales and customer satisfaction before and after an intervention and purchase data for control and treatment groups.

Unit of observation is a customer transaction or purchase. The explanation of variables:

Group: either control or treatment
Customer_Segment: categorizes customers based on their value into high value, medium value, and low value. The description of the data set does not specify what a customer being high value, for example, means, but the classification may be based on their income, the high-value customers having the highest income.
Sales_Before: sales before intervention
Sales_After: sales after intervention
Customer_Satisfaction_Before: customer satisfaction scores before the intervention
Customer-Satisfaction_After: customer satisfaction scores after the intervention
Purchase_Made: indicates whether a purchase was made after the intervention

Data set source: https://www.kaggle.com/datasets/matinmahmoudi/sales-and-satisfaction

Let us start with some descriptive statistics.

library(psych)
summary(mydata$Sales_Before)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.34  166.36  203.74  202.30  238.15  380.45

I will explain a few estimates of parameters:

The average sales before the intervention were 202.30
The median is 203.74, which means that half of the sales before the intervention were larger than 203.74, and half were smaller
The minimum sales before the intervention were 36.34, and the maximum sales before the intervention were 380.45

Let us start with the first research question.

Research question 1: I am interested in whether there is a difference in sales before and after intervention.

My hypotheses are:

H0: the average sales before the intervention are the same as the average sales after the intervention; the difference is zero. μ sales_before = μ sales_after or μ difference = 0.
H1: the average sales before the intervention are not the same as the average sales after the intervention; the difference is not zero. μ sales_before ≠ μ sales_after or μ difference ≠ 0.

library(pastecs)
round(stat.desc(mydata[ , c(3,4)]), 2)

##              Sales_Before Sales_After
## nbr.val            500.00      500.00
## nbr.null             0.00        0.00
## nbr.na               0.00        0.00
## min                 36.34       46.64
## max                380.45      587.44
## range              344.11      540.80
## sum             101148.36   139380.90
## median             203.74      271.86
## mean               202.30      278.76
## SE.mean              2.40        3.73
## CI.mean.0.95         4.71        7.33
## var               2878.68     6954.08
## std.dev             53.65       83.39
## coef.var             0.27        0.30

mydata$Difference <- mydata$Sales_After - mydata$Sales_Before

I have created a new variable, Difference, which is the difference between sales after the intervention and sales before the intervention.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes (x = Difference)) +
  geom_histogram(position = "identity", binwidth = 25, colour = "red") +
  ylab("Frequency") +
  xlab("Difference")

Observing the histogram, the differences do not seem to be normally distributed. However, to be sure of that, we have to perform the Shapiro-Wilk normality test.

The hypotheses are:

H0: the differences are normally distributed.
H1: the differences are not normally distributed.

shapiro.test(mydata$Difference)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Difference
## W = 0.92626, p-value = 5.866e-15

H0 can be rejected at p < 0.001. With the help of the statistical test, I have formally shown that the differences are not normally distributed.

t.test(mydata$Sales_Before, mydata$Sales_After,
       paired = TRUE,
       alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  mydata$Sales_Before and mydata$Sales_After
## t = -39.659, df = 499, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -80.25323 -72.67692
## sample estimates:
## mean difference 
##       -76.46508

I have performed a paired samples t-test, where the assumptions are the following:

The variable is numeric; and
The differences on the population are normally distributed.

Since the p-value < 0.001, we can reject the H0, which says that the average sales before the intervention are the same as the average sales after the intervention; the difference is zero. The mean difference is about -76.5.

Let us perform the non-parametric alternative. The non-parametric alternative to the paired t-test is the Wilcoxon signed rank test.

My hypotheses are the following:

H0: the distribution location of sales is the same.
H1: the distribution location of sales not the same.

wilcox.test(mydata$Sales_Before, mydata$Sales_After,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$Sales_Before and mydata$Sales_After
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Again, the p-value < 0.001, and the H0 can be rejected.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydata$Difference)

## Cohen's d |       95% CI
## ------------------------
## 1.77      | [1.63, 1.91]

interpret_cohens_d(1.77, rules = "sawilowsky2009")

## [1] "very large"
## (Rules: sawilowsky2009)

The difference in sales before the intervention and after the intervention is very large.

Part 2

head(mydata)

##          Group Customer_Segment Sales_Before Sales_After
## 2463   Control     Medium Value     141.1834    178.6317
## 2511 Treatment     Medium Value     182.3470    292.2028
## 8718 Treatment       High Value     380.4529    587.4384
## 2986 Treatment     Medium Value     248.8122    384.3708
## 1842 Treatment        Low Value     177.7405    273.7634
## 9334   Control       High Value     246.6162    298.9664
##      Customer_Satisfaction_Before Customer_Satisfaction_After Purchase_Made
## 2463                     57.47312                    62.91412            No
## 2511                     53.03289                    35.49490            No
## 8718                     82.97648                    76.14568            No
## 2986                     57.98888                    38.54076           Yes
## 1842                     68.14757                    75.70201           Yes
## 9334                     93.00714                    86.44396            No
##      Difference
## 2463   37.44828
## 2511  109.85582
## 8718  206.98546
## 2986  135.55857
## 1842   96.02289
## 9334   52.35014

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata[, c(5,6)], smooth = FALSE)

Research question 2: Let us say that I would like to test whether there is some correlation between customer satisfaction before the intervention and customer satisfaction after the intervention.

My hypotheses are the following:

H0: ρ satisfaction_before, satisfaction_after = 0
H1: ρ satisfaction_before, satisfaction_after =/= 0.

I will test these hypotheses with the help of Pearson correlation coefficient.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

cor.test(mydata$Customer_Satisfaction_Before, mydata$Customer_Satisfaction_After,
         method = "pearson",
         use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata$Customer_Satisfaction_Before and mydata$Customer_Satisfaction_After
## t = 35.651, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8209649 0.8706121
## sample estimates:
##       cor 
## 0.8476336

H0 is rejected at p < 0.001.

cor(mydata$Customer_Satisfaction_Before, mydata$Customer_Satisfaction_After,
    method = "pearson",
    use = "complete.obs")

## [1] 0.8476336

The linear relationship between customer satisfaction before the intervention and customer satisfaction after the intervention is positive and strong.

Part 3

head(mydata)

##          Group Customer_Segment Sales_Before Sales_After
## 2463   Control     Medium Value     141.1834    178.6317
## 2511 Treatment     Medium Value     182.3470    292.2028
## 8718 Treatment       High Value     380.4529    587.4384
## 2986 Treatment     Medium Value     248.8122    384.3708
## 1842 Treatment        Low Value     177.7405    273.7634
## 9334   Control       High Value     246.6162    298.9664
##      Customer_Satisfaction_Before Customer_Satisfaction_After Purchase_Made
## 2463                     57.47312                    62.91412            No
## 2511                     53.03289                    35.49490            No
## 8718                     82.97648                    76.14568            No
## 2986                     57.98888                    38.54076           Yes
## 1842                     68.14757                    75.70201           Yes
## 9334                     93.00714                    86.44396            No
##      Difference
## 2463   37.44828
## 2511  109.85582
## 8718  206.98546
## 2986  135.55857
## 1842   96.02289
## 9334   52.35014

Research question 3: Let us say that I would like to test whether there is some association between the customer segment (whether they are low, medium, or high value) and whether they have made a purchase.

My hypotheses are:

H0: Customer segment and making a purchase are not associated.
H1: Customer segment and making a purchase are associated.

results <- chisq.test(mydata$Customer_Segment, mydata$Purchase_Made)
results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Customer_Segment and mydata$Purchase_Made
## X-squared = 0.66079, df = 2, p-value = 0.7186

As the p-value > 0.05 (p-value = 0.7186), we cannot reject the H0.

addmargins(results$observed)

##                        mydata$Purchase_Made
## mydata$Customer_Segment  No Yes Sum
##            High Value    83  70 153
##            Low Value     92  77 169
##            Medium Value  90  88 178
##            Sum          265 235 500

This table shows observed (empirical) frequencies. Out of 153 high-value customers, 83 have not made a purchase and 70 have. Out of 169 low-value customers, 92 have not made a purchase and 77 have. Out of 178 medium-value customers, 90 have not made a purchase and 88 have.

addmargins(round(results$expected, 2))

##                        mydata$Purchase_Made
## mydata$Customer_Segment     No    Yes Sum
##            High Value    81.09  71.91 153
##            Low Value     89.57  79.43 169
##            Medium Value  94.34  83.66 178
##            Sum          265.00 235.00 500

This table shows expected (theoretical) frequencies. If there were no association between customer segment and making a purchase, we would expect that 81.09 high-value customers would not make a purchase and 71.91 would make a purchase. If there were no association between customer segment and making a purchase, we would expect that 89.57 low-value customers would not make a purchase and 79.43 would make a purchase. If there were no association between customer segment and making a purchase, we would expect that 94.34 medium-value customers would not make a purchase and 83.66 would make a purchase.

round(results$residuals, 2)

##                        mydata$Purchase_Made
## mydata$Customer_Segment    No   Yes
##            High Value    0.21 -0.23
##            Low Value     0.26 -0.27
##            Medium Value -0.45  0.47

Let us calculate the standardized residuals. As the p-value of the Pearson’s Chi-squared test is higher than 0.05, it can be expected that the standardized residuals will not be significant either. None of the combinations is 1.96 or higher in absolute terms, they are all insignificant.

library(effectsize)
effectsize::cramers_v(mydata$Customer_Segment, mydata$Purchase_Made)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

The effect size is tiny. This means that there is no real difference - or the difference is tiny - between the expected (theoretical) and observed (empirical) frequencies.

Homework 1 - Blanka Pucer

2025-01-11

Part 1

Part 2

Part 3