Scenario 2: Human vs. AI Service

A customer service firm wants to test whether customer satisfaction scores differ between those served by human agents versus those served by an AI chatbot. After interactions, customers rate their satisfaction (single-item rating scale). Is there a difference in the average satisfaction scores of the two groups?
QUESTION

What are the null and alternate hypotheses for YOUR research scenario?

H0:There is no difference between the scores of Group A and Group B.

H1:There is a difference between the scores of Group A and Group B.

#INSTSALL REQUIRED PACKAGE
#install.packages("readxl")
# LOAD THE PACKAGE
library(readxl)
# IMPORT EXCEL FILE INTO R STUDIO
dataset <- read_excel("C:\\Users\\DELL\\Downloads\\A6R2.xlsx")
DESCRIPTIVE STATISTICS

PURPOSE: Calculate the mean, median, SD, and sample size for each group.

#INSTALL REQUIRED PACKAGE
#install.packages("dplyr")
# LOAD THE PACKAGE
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# CALCULATE THE DESCRIPTIVE STATISTICS
dataset %>%
  group_by(ServiceType) %>%
  summarise(
    Mean = mean(SatisfactionScore, na.rm = TRUE),
    Median = median(SatisfactionScore, na.rm = TRUE),
    SD = sd(SatisfactionScore, na.rm = TRUE),
    N = n()
  )
## # A tibble: 2 × 5
##   ServiceType  Mean Median    SD     N
##   <chr>       <dbl>  <dbl> <dbl> <int>
## 1 AI           3.6       3  1.60   100
## 2 Human        7.42      8  1.44   100
HISTOGRAMS

Purpose: Visually check the normality of the scores for each group.

#CREATE THE HISTOGRAMS 
hist(dataset$SatisfactionScore[dataset$ServiceType == "Human"],
     main = "Histogram of Human Scores",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 20)

hist(dataset$SatisfactionScore[dataset$ServiceType == "AI"],
     main = "Histogram of AI Scores",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightgreen",
     border = "black",
     breaks = 20)

QUESTIONS

Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?

A)Negatively skewed

Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?

A) The histogram does not have a proper bell shaped curve

Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?

A) Positively skewed

Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?

A) The histogram does not have a proper bell shaped curve

SHAPIRO-WILK TEST

Purpose: Check the normality for each group’s score statistically. The Shapiro-Wilk Test is a test that checks skewness and kurtosis at the same time. The test is checking “Is this variable the SAME as normal data (null hypothesis) or DIFFERENT from normal data (alternate hypothesis)?” For this test, if p is GREATER than .05 (p > .05), the data is NORMAL. If p is LESS than .05 (p < .05), the data is NOT normal.

# CONDUCT THE SHAPIRO-WILK TEST
shapiro.test(dataset$SatisfactionScore[dataset$ServiceType == "Human"])
## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$SatisfactionScore[dataset$ServiceType == "Human"]
## W = 0.93741, p-value = 0.0001344
shapiro.test(dataset$SatisfactionScore[dataset$ServiceType == "AI"])
## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$SatisfactionScore[dataset$ServiceType == "AI"]
## W = 0.91143, p-value = 5.083e-06
QUESTION

Q1)Was the data normally distributed for Variable 1?

No, The data is not normally distributed

Q2)Was the data normally distributed for Variable 2?

No, the data is not normally distributed

If p > 0.05 (P-value is GREATER than .05) this means the data is NORMAL. Continue to the box-plot test below. If p < 0.05 (P-value is LESS than .05) this means the data is NOT normal (switch to Mann-Whitney U).

BOXPLOT

Purpose: Check for any outliers impacting the mean for each group’s scores.

# INSTALL REQUIRED PACKAGE
# install.packages("ggplot2")
# install.packages("ggpubr")

# LOAD THE PACKAGE
library(ggplot2)
library(ggpubr)

#CREATE THE BOXPLOT
ggboxplot(dataset, x = "ServiceType", y = "SatisfactionScore",
          color = "ServiceType",
          palette = "jco",
          add = "jitter")

QUESTION

Q1) Were there any dots outside of the boxplot? Are these dots close to the whiskers of the boxplot or are they very far away?

For Human scores the box plot has many dots far away from whiskers while for the AI score there is a lesser proportion of dots outside, hence switching to Mann Whitney U test.

MANN-WHITNEY U TEST

PURPOSE: Test if there was a difference between the distributions of the two groups.

wilcox.test(SatisfactionScore ~ ServiceType, data = dataset, exact = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  SatisfactionScore by ServiceType
## W = 497, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
DETERMINE STATISTICAL SIGNIFICANCE

If results were statistically significant (p < .05), continue to effect size section below. If results were NOT statistically significant (p > .05), skip to reporting section below.

NOTE: The Mann-Whitney U test is used when your data is abnormally distributed

or when assumptions of the t-test are not met. It is not chosen based on whether the t-test was significant.

EFFECT-SIZE

PURPOSE: Determine how big of a difference there was between the group distributions.

# INSTALL REQUIRED PACKAGE
# install.packages("effectsize")
# LOAD THE PACKAGE


library(effectsize)
# CALCULATE EFFECT SIZE (R VALUE)
rank_biserial(SatisfactionScore ~ ServiceType, data = dataset, exact = FALSE)
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.90             | [-0.93, -0.87]
QUESTIONS

Q1) What is the size of the effect?

Large

Q2) Which group had the higher average rank?

Satisfaction scores for Human services have higher average rank

REPORT FOR MANN-WHITNEY U TEST

A Mann-Whitney U test was conducted to compare satisfaction scores between human (n = 100) and AI customer services (n = 100). Satisfaction score of the customers interacting with humans have higher median satisfaction score (Mdn = 8) than that of an AI chatbot (Mdn = 3), U = 497, p < .001. The effect size was large (r = -0.90), indicating a meaningful difference between the satisfaction score of human and AI customer services. Overall, satisfaction score for human interaction in customer services is higher than AI chat bot.