Scenario 2: Human vs. AI Service

A customer service firm wants to test whether customer satisfaction scores differ between those served by human agents versus those served by an AI chatbot. After interactions, customers rate their satisfaction (single-item rating scale). Is there a difference in the average satisfaction scores of the two groups?

H0:There is no difference in the average customer satisfaction scores between customers served by human agents and those served by an AI chatbot.

H1:There is a difference in the average customer satisfaction scores between customers served by human agents and those served by an AI chatbot.

Descriptive statistics & Normality check

# QUESTION
# What are the null and alternate hypotheses for YOUR research scenario?
# H0:There is no difference in the average customer satisfaction scores between customers served by human agents and those served by an AI chatbot.
# H1:There is a difference in the average customer satisfaction scores between customers served by human agents and those served by an AI chatbot. 




# install.packages("readxl")


library(readxl)

A6R2 <- read_excel("C:/Users/sahit/Downloads/A6R2.xlsx")



# install.packages("dplyr")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

A6R2 %>%
  group_by(ServiceType) %>%
  summarise(
    Mean = mean(SatisfactionScore, na.rm = TRUE),
    Median = median(SatisfactionScore, na.rm = TRUE),
    SD = sd(SatisfactionScore, na.rm = TRUE),
    N = n()
  )

## # A tibble: 2 × 5
##   ServiceType  Mean Median    SD     N
##   <chr>       <dbl>  <dbl> <dbl> <int>
## 1 AI           3.6       3  1.60   100
## 2 Human        7.42      8  1.44   100

hist(A6R2$SatisfactionScore[A6R2$ServiceType == "Human"],
main = "Histogram of Human Scores",
xlab = "Value",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20)

hist(A6R2$SatisfactionScore[A6R2$ServiceType == "AI"],
main = "Histogram of AI Agent Scores",
xlab = "Value",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 20)

# QUESTIONS
# Answer the questions below as comments within the R script:

# Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# The histogram for Human Agent Scores (Variable 1) appears negatively skewed (or skewed left) because the bulk of the data is on the right side (high scores), and the tail extends toward the lower (negative) scores.
# Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# The distribution has a clear central peak (around 8), but the overall shape is not a proper bell curve. It looks slightly too tall in the center compared to a normal distribution.
# Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# The histogram for AI Chatbot Scores (Variable 2) appears positively skewed (or skewed right) because the bulk of the data is on the left side (low scores), and the tail extends toward the higher (positive) scores.
# Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# The distribution has high frequencies at the low scores (1 and 2), which is characteristic of the skewness. It is not a proper bell curve.



shapiro.test(A6R2$SatisfactionScore[A6R2$ServiceType == "Human"])

## 
##  Shapiro-Wilk normality test
## 
## data:  A6R2$SatisfactionScore[A6R2$ServiceType == "Human"]
## W = 0.93741, p-value = 0.0001344

shapiro.test(A6R2$SatisfactionScore[A6R2$ServiceType == "AI"])

## 
##  Shapiro-Wilk normality test
## 
## data:  A6R2$SatisfactionScore[A6R2$ServiceType == "AI"]
## W = 0.91143, p-value = 5.083e-06

# QUESTION
# Answer the questions below as a comment within the R script:
# Was the data normally distributed for Variable 1?
# No. The p-value (0.0001344) is less than 0.05, indicating a significant violation of the normality assumption.
# Was the data normally distributed for Variable 2?
# No. The p-value (5.083e-06) is less than 0.05, indicating a significant violation of the normality assumption.



# install.packages("ggplot2")
# install.packages("ggpubr")


library(ggplot2)
library(ggpubr)



ggboxplot(A6R2, x = "ServiceType", y = "SatisfactionScore",
          color = "ServiceType",
          palette = "jco",
          add = "jitter")

# QUESTION
# Answer the questions below as a comment within the R script. Answer the questions for EACH boxplot:
# Q1) Were there any dots outside of the boxplot? Are these dots close to the whiskers of the boxplot (check if there are any dots past the lines on the boxes) or are they very far away?
# If there are no dots, continue with Independent t-test.
# If there are a few dots (two or less), and they are close to the whiskers, continue with the Independent t-test.
# If there are a few dots (two or less), and they are far away from the whiskers, consider switching to Mann Whitney U test.
# If there are many dots (more than one or two) and they are very far away from the whiskers, you should switch to the Mann Whitney U test.

# For the Human Agent group (left boxplot): Yes, there are a few dots (outliers) outside of the lower whisker (below the boxplot). These dots are relatively close to the whisker. (Approximately 3-4 dots visible.)
# For the AI Chatbot group (right boxplot): No, there are no dots outside of the whiskers for this group.
# Based on the Shapiro-Wilk test (p < 0.05 for both groups), which strongly indicates non-normality, we must switch to the Mann-Whitney U test, regardless of the minor outlier presence.

MANN-WHITNEY U TEST

wilcox.test(SatisfactionScore ~ ServiceType, data = A6R2, exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  SatisfactionScore by ServiceType
## W = 497, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

# install.packages("effectsize")


library(effectsize)


rank_biserial(SatisfactionScore ~ ServiceType, data = A6R2, exact = FALSE)

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.90             | [-0.93, -0.87]

# QUESTIONS
# Answer the questions below as a comment within the R script:

# Q1) What is the size of the effect?
# The effect means how big or small was the difference between the two groups.
# ± 0.00 to 0.10 = ignore
# ± 0.10 to 0.30 = small
# ± 0.30 to 0.50 = moderate
# ± 0.50 to +   = large
# Example 1) A rank-biserial correlation of 0.05 indicates the difference between the groups was not meaningful. There was no effect.
# Example 2) A rank-biserial correlation of 0.32 indicates the difference between the groups was moderate.

# The rank-biserial correlation is r = -0.90.
# Based on the guidelines (± 0.50 to +), the absolute value of the effect size (|r| = 0.90) is large, indicating a very substantial difference in customer satisfaction between the two groups.

# Q2) Which group had the higher average rank?
# The Mann-Whitney U test does not compare means directly. Instead, it looks at whether one group tends to have higher scores than the other.
# To determine which group ranked higher, look at the group means or medians in your dataset. 

# The Human Agent group had the higher average rank/score.
# This is confirmed by comparing the median scores: Human Agents (Mdn = 8.00) vs. AI Chatbot (Mdn = 3.00).


# WRITTEN REPORT FOR MANN-WHITNEY U TEST
# Write a paragraph summarizing your findings.

# 1) REVIEW YOUR OUTPUT
#    Collect the information below from your output:
#    1. The name of the inferential test used
#       Mann-Whitney U test
#    2. The names of the IV and DV (their proper names, which may not be their excel names).
#       IV: Service Type (Human vs. AI), DV: Customer SatisfactionScores
#    3. The sample size for each group (labeled as "n").
#       Human:n = 100; AI:n = 100
#    4. Whether the inferential test results were statistically significant (p < .05) or not (p > .05).
#       Statistically Significant ($p < .05$)
#    5. The median for each group's score on the DV (rounded to two places after the decimal).
#       Human:8.00; AI:3.00
#    6. U statistic (from output).
#       497
#    7. EXACT p-value to three decimals. NOTE: If p > .05, just report p > .05 If p < .001, just report p < .001
#       p < 0.001
#    8. Effect size (rank-biserial correlation) ** Only if the results were significant.
#       r = -0.90

REPORT

A Mann-Whitney U test was conducted to compare customer satisfaction scores between those served by human agents (n = 100) and those served by an AI chatbot (n = 100). The human agents group had significantly higher median scores (Mdn = 8.00) than the AI chatbot group (Mdn = 3.00), U = 497, p < 0.001. The effect size was large (r = -0.90), indicating a meaningful difference between the two service types.Overall, the results suggest that human agents provide substantially higher customer satisfaction than the AI chatbot.

Scenario 2

Team 4

2025-11-21

Descriptive statistics & Normality check

MANN-WHITNEY U TEST

REPORT