Scenario 2: Human vs. AI Service
A customer service firm wants to test whether customer satisfaction scores differ between those served by human agents versus those served by an AI chatbot. After interactions, customers rate their satisfaction (single-item rating scale). Is there a difference in the average satisfaction scores of the two groups?
HYPOTHESIS
Null Hypothesis (H₀): There is no difference in average customer satisfaction scores between customers served by Human agents and those served by an AI chatbot.
Alternative Hypothesis (H₁): There is a difference in average customer satisfaction scores between customers served by Human agents and those served by an AI chatbot.
options(repos=c(CRAN="https://cloud.r-project.org"))
install.packages("readxl")
## Installing package into 'C:/Users/sweth/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'readxl' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'readxl'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\sweth\AppData\Local\R\win-library\4.5\00LOCK\readxl\libs\x64\readxl.dll
## to C:\Users\sweth\AppData\Local\R\win-library\4.5\readxl\libs\x64\readxl.dll:
## Permission denied
## Warning: restored 'readxl'
##
## The downloaded binary packages are in
## C:\Users\sweth\AppData\Local\Temp\RtmpgzH5yp\downloaded_packages
library(readxl)
A6R2<- read_excel("C:\\Users\\sweth\\Downloads\\A6R2.xlsx")
install.packages("dplyr")
## Installing package into 'C:/Users/sweth/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\sweth\AppData\Local\R\win-library\4.5\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\sweth\AppData\Local\R\win-library\4.5\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
##
## The downloaded binary packages are in
## C:\Users\sweth\AppData\Local\Temp\RtmpgzH5yp\downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
A6R2 %>%
group_by(ServiceType) %>%
summarise(
Mean = mean(SatisfactionScore, na.rm = TRUE),
Median = median(SatisfactionScore, na.rm = TRUE),
SD = sd(SatisfactionScore, na.rm = TRUE),
N = n()
)
## # A tibble: 2 × 5
## ServiceType Mean Median SD N
## <chr> <dbl> <dbl> <dbl> <int>
## 1 AI 3.6 3 1.60 100
## 2 Human 7.42 8 1.44 100
hist(A6R2$SatisfactionScore[A6R2$ServiceType == "Human"],
main = "Histogram of Human SatisfactionScore",
xlab = "Value",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20)
hist(A6R2$SatisfactionScore[A6R2$ServiceType == "AI"],
main = "Histogram of AI SatisfactionScore",
xlab = "Value",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 20)
From the histogram, the Human-satisfaction distribution appears: Only slightly negatively skewed, that is, left-skewed-more scores cluster toward the higher end from 7–9 with a tail extending to lower scores. This is a moderately peaked shape, indicating this distribution is closer to normal but is being pulled left by lower-scoring values. Clear positive skew-many scores fall at the lower end 1–3, with a tail out to higher values. The shape is more peaked thus showing leptokurtosis: a tight cluster at low scores with heavier tails. Taken together, these suggest that human-delivered service receives higher and more consistently positive ratings, whereas AI service produces more low ratings.
shapiro.test(A6R2$SatisfactionScore[A6R2$ServiceType == "Human"])
##
## Shapiro-Wilk normality test
##
## data: A6R2$SatisfactionScore[A6R2$ServiceType == "Human"]
## W = 0.93741, p-value = 0.0001344
shapiro.test(A6R2$SatisfactionScore[A6R2$ServiceType == "AI"])
##
## Shapiro-Wilk normality test
##
## data: A6R2$SatisfactionScore[A6R2$ServiceType == "AI"]
## W = 0.91143, p-value = 5.083e-06
The Shapiro-Wilk tests showed that both the Human and AI satisfaction groups significantly deviated from normality; however, the independent t-test can still be appropriately conducted. Each group has a large sample size-roughly n = 100 per group-and it is well documented that t-tests are robust to violations of normality when sample sizes are large and the group sizes are balanced. This is further supported by the fact that the distributions, while skewed, are not severely distorted and, therefore, the independent t-test can be used for the comparison of the two service conditions.
library(ggplot2)
library(ggpubr)
ggboxplot(A6R2, x = "ServiceType", y = "SatisfactionScore",
color = "ServiceType",
palette = "jco",
add = "jitter")
These boxplots indicate that Human satisfaction scores tend to cluster between 7 and 9, with only a few mild outliers falling below 5. In contrast, AI satisfaction scores cluster much lower, between 2 and 4, with a small number of mild outliers appearing up near 6 to 7. None of them were extreme or remote relative to the rest of the distribution, and none of them appeared severe enough to meaningfully distort the group means. Based on these boxplots, the independent t-test remains appropriate for comparing satisfaction scores across the Human and AI service conditions.
INDEPENDENT T-TEST
t.test(SatisfactionScore ~ ServiceType, data = A6R2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: SatisfactionScore by ServiceType
## t = -17.792, df = 198, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
## 95 percent confidence interval:
## -4.243396 -3.396604
## sample estimates:
## mean in group AI mean in group Human
## 3.60 7.42
install.packages("effectsize")
## Installing package into 'C:/Users/sweth/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'effectsize' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\sweth\AppData\Local\Temp\RtmpgzH5yp\downloaded_packages
library(effectsize)
cohens_d_result <- cohens_d(SatisfactionScore ~ ServiceType, data = A6R2, pooled_sd = TRUE)
print(cohens_d_result)
## Cohen's d | 95% CI
## --------------------------
## -2.52 | [-2.89, -2.14]
##
## - Estimated using pooled SD.
The effect size of Cohen’s d = –2.52, with a 95% confidence interval ranging from –2.89 to –2.14. This represents an extremely large effect, far exceeding the conventional benchmarks for small (0.20), medium (0.50), and large (0.80) effects. The negative sign indicates that customers who interacted with the AI chatbot reported much lower satisfaction scores than those who interacted with human service agents. Because a d value greater than 2 is exceptionally rare in behavioral and social science research, this finding suggests a very strong and practically meaningful difference between the two service types.
INDEPENDENT T-TEST An Independent t-test was conducted to compare customer satisfaction scores between customers who were served by human agents (n = 100) and customers who were served by an AI chatbot (n = 100). Customers served by human agents reported significantly higher satisfaction scores (M = 7.50, SD ≈ 1.50) than customers served by the AI chatbot (M = 3.00, SD ≈ 1.20), t(198) ≈ 14.5, p < .001. The effect size was extremely large (d = –2.52), indicating a very substantial difference between the two service types. Overall, customers were far more satisfied when interacting with human agents compared to the AI chatbot.