Task (instructions):
For any statistical test you run, you need to check, and fully report, each of the assumptions associated with the relevant statistical test. If assumptions are violated, you need to assess what that means for your analysis.
Please note, that large participant samples with multiple conditions often show some violations of normality as assessed by a p value of a specific statistical test. Instead of blindly following these results, please in addition use graphs to assess whether distributions are roughly normal, which is usually sufficient for the ANOVA family of tests. It is also important to ensure that skewness and kurtosis values fall within the range from -2 to +2. If no assumptions are violated, you still need to explicitly report that this is the case. You should report all relevant assumptions for each test, and discuss for each one whether they are violated or not, and eventually decide on your decision for what test to run based on this.
If there are outliers, based on any criterion you have set, or due to data input error, or because participants responded inappropriately or incorrectly to items or questions, you need to assess what the impact of these outliers/participants may be. You then need to decide on whether to include or exclude these participants. In all cases, you should report if outliers are present (or not) and what you have decided to do with them, i.e. how many outliers are removed from the analysis and why. Remember, outliers can come from a number of sources, such as evidence of not carrying out the tasks correctly, or extreme values, so you will need to inspect the data carefully.
Does a person’s deepfake detection performance change depending on the quality of deepfake (ie., between good vs bad fakes), and depending on the proportion of trials which were fake compared to real videos? And is there an interaction between the two factors?
Are people better at detecting fake or real videos? And does accuracy depend on the quality of the fake video they see?
Which of the following variables significantly predict overall deepfake detection accuracy: confidence, inspection time, gen AI use, gen AI image use, and age?
Which of the following variables - inspection time, gen AI use, gen AI image use, and age - predict overall deepfake detection performance, over and above (i.e. controlling for) confidence?
Variables codebook:
What I (Yuni) did:
so if I understand correctly RQs 3 and 4 are answered with the same model, right?
Reply:
Yes and no. For RQ4, I compared the model used in RQ3 with a base
model which only has percent_correct
and
mean_confidence
to control for confidence. It looks
like:
data5_base = lm(percent_correct ~ mean_confidence,data5)
data5_additional = lm(percent_correct ~ mean_confidence + inspection_time + AI_use + AI_img_use + age, data5)
knitr::opts_chunk$set(echo = T,message = F,warning = F)
rm(list=ls())
#setwd("~/R/Psi_Chi_R")
library(tidyverse)
#total for bottom row
sum_rows = function(x) {
x = as.data.frame(x)
sums = sapply(x,function(col) if (is.numeric(col)) sum(col, na.rm = T) else NA)
sums = as.data.frame(t(sums))
names(sums) = names(x)
rbind(x, sums)
}
## right column for total
sum_cols = function(x) {
x$Total = rowSums(x[sapply(x, is.numeric)], na.rm = T)
x
}
#dollar format function
dollars = function(x) {
paste0("$",format(x,big.mark= ",",scientific=F))
}
desc_stats = function(x){
c(min = min(x,na.rm=T),
median = median(x,na.rm=T),
max = max(x,na.rm=T),
mean = mean(x,na.rm=T),
sd = sd(x,na.rm=T))
}
#numeric notations
options(scipen=9999)
## Clean and EDA
data1 = read.csv('yuni_datafile1.csv')
data2 = read.csv('yuni_datafile2.csv')
#join datasets together
data3 = data1 %>%
inner_join(data2,by=join_by(participant==participant,condition==condition,sex==sex,age==age,prop_fake==prop_fake,fake_qual==fake_qual))
data4 = data3 %>%
mutate(prop_fake = as.factor(prop_fake),
fake_qual = as.factor(fake_qual),
condition = as.factor(condition),
sex = as.factor(sex))
#SmartEDA::ExpData(data3,type = 2) %>% arrange(desc(Per_of_Missing))
DV: percent_correct
IV: fake_qual, prop_fake
RQ1_main = aov(percent_correct ~ fake_qual * prop_fake, data4)
residuals_RQ1_main =residuals(RQ1_main)
hist(residuals_RQ1_main)
qqnorm(residuals_RQ1_main)
qqline(residuals_RQ1_main)
shapiro.test(residuals_RQ1_main)
##
## Shapiro-Wilk normality test
##
## data: residuals_RQ1_main
## W = 0.99051, p-value = 0.4692
Plot and test assumptions passed.
par(mfrow = c(2,2))
plot(RQ1_main)
var.test(data4$percent_correct[data4$fake_qual == "Good_fake"],data4$percent_correct[data4$fake_qual == "Bad_fake"])
##
## F test to compare two variances
##
## data: data4$percent_correct[data4$fake_qual == "Good_fake"] and data4$percent_correct[data4$fake_qual == "Bad_fake"]
## F = 0.79364, num df = 68, denom df = 69, p-value = 0.3416
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4924805 1.2800513
## sample estimates:
## ratio of variances
## 0.7936433
# tapply(data4$percent_correct, interaction(data4$fake_qual, data4$prop_fake), var) %>% data.frame() %>%
# select(variance = 1) %>%
# arrange(variance) %>%
# print()
Plots appear fine overall. Some outliers here and there, but nothing severe to exclude them from the analysis.
F-test shows no evidence of violating homogeneity (p-val. = 0.3416).
print(anova(RQ1_main))
## Analysis of Variance Table
##
## Response: percent_correct
## Df Sum Sq Mean Sq F value Pr(>F)
## fake_qual 1 21485.5 21485.5 511.0962 <0.0000000000000002 ***
## prop_fake 1 10.6 10.6 0.2515 0.6168
## fake_qual:prop_fake 1 29.1 29.1 0.6919 0.4070
## Residuals 135 5675.1 42.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, a person’s percent_correct
changes depending on
fake_qual
(p-val: 0.0000000000000002), but not on
prop_fake
(p-val: 0.6168).
Detection of bad vs good deepfakes is significant (p-val. = 0.0000000000000002).
Changing the proportion of fakes doesn’t impact performance (p-val. = 0.6168).
No interaction between fake_qual and prop_fake (p.val = 0.4070). The impact of fake_qual stays consistent regardless of how many fake trials were shown.
No, p-value is (fake_qual:prop_fake) 0.4070. There’s no interaction
between fake_qual
and prop_fake
.
data4 %>%
group_by(fake_qual, prop_fake) %>%
summarise(
mean_accuracy = mean(percent_correct),
se = sd(percent_correct) / sqrt(n()),
n = n(),
.groups = "drop") %>%
ggplot(aes(x = prop_fake, y = mean_accuracy, fill = fake_qual)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), color = "black", width = 0.6) +
geom_errorbar(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se),
position = position_dodge(width = 0.8), width = 0.2) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(title = "Interaction of Fake Quality by Proportion of Fakes",
y = "Mean Accuracy") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label = paste0(n, "\n(", round(mean_accuracy, 1), "%)")),position = position_dodge(width = 0.8),
vjust = 5,hjust=.5, size = 3, color = "black")
Bar chart shows no differences for interaction between the two factors. The pink bars look the same in either group, and the same with the teal bars as well. No differences seen.
This is the line chart replica of the above bar chart:
data4 %>%
group_by(fake_qual, prop_fake) %>%
summarise(
mean_accuracy = mean(percent_correct),
se = sd(percent_correct) / sqrt(n()),
n = n(),
.groups = "drop") %>%
ggplot(aes(x = prop_fake, y = mean_accuracy, color = fake_qual, group = fake_qual)) +
geom_line(size = 1) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se),
width = 0.1) +
labs(title = "Interaction of Fake Quality and Proportion of Fakes",
y = "Mean Accuracy") +
theme_bw()+
theme(plot.title = element_text(hjust =.5))+
scale_y_continuous(labels = scales::percent_format(scale = 1))+
geom_text(aes(label = paste0(n, "\n(", round(mean_accuracy, 1), "%)")),
position = position_dodge(width = 0.8), size = 3, color = "black")
percent_correct_fake (PCF) vs. percent_correct_real (PCR)
shapiro.test(data4$percent_correct_fake)
##
## Shapiro-Wilk normality test
##
## data: data4$percent_correct_fake
## W = 0.93846, p-value = 0.000008619
shapiro.test(data4$percent_correct_real)
##
## Shapiro-Wilk normality test
##
## data: data4$percent_correct_real
## W = 0.99036, p-value = 0.4554
hist(data4$percent_correct_fake)
hist(data4$percent_correct_real)
PCF violates normality, but PCR doesn’t.
var.test(data4$percent_correct_fake,data4$percent_correct_real)
##
## F test to compare two variances
##
## data: data4$percent_correct_fake and data4$percent_correct_real
## F = 2.5027, num df = 138, denom df = 138, p-value = 0.0000001231
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.790188 3.498935
## sample estimates:
## ratio of variances
## 2.502749
Variances are statistically different (p-val. = 0.0000001231).
PCR
is 2.5x smaller than PCF
.
t.test(data4$percent_correct_real,data4$percent_correct_fake, paired = T)
##
## Paired t-test
##
## data: data4$percent_correct_real and data4$percent_correct_fake
## t = 2.5966, df = 138, p-value = 0.01043
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 1.011367 7.469153
## sample estimates:
## mean difference
## 4.24026
data4.1 = data4 %>%
select(participant, percent_correct_fake, percent_correct_real) %>%
pivot_longer(cols = c(percent_correct_fake, percent_correct_real),
names_to = "video_type",
values_to = "accuracy")
data4.1_AOV = aov(accuracy ~ video_type + Error(participant/video_type), data4.1)
summary(data4.1_AOV)
##
## Error: participant
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 1 23.4 23.4
##
## Error: participant:video_type
## Df Sum Sq Mean Sq
## video_type 1 574 574
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## video_type 1 867 867.1 2.976 0.0857 .
## Residuals 274 79838 291.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data4_long = pivot_longer(data4, cols = c(percent_correct_fake, percent_correct_real),
names_to = "Video_Type", values_to = "Percent_Correct")
ggplot(data4_long, aes(x = Video_Type, y = Percent_Correct)) +
geom_boxplot(fill = "steelblue") +
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
theme_bw()+
labs(title = "percent_correct_fake vs. percent_correct_real")+
theme(plot.title = element_text(hjust = .5))
IV: fake_qual
DV: percent_correct
shapiro.test(data4$percent_correct[data4$fake_qual == "Good_fake"])
##
## Shapiro-Wilk normality test
##
## data: data4$percent_correct[data4$fake_qual == "Good_fake"]
## W = 0.99045, p-value = 0.8811
shapiro.test(data4$percent_correct[data4$fake_qual == "Bad_fake"])
##
## Shapiro-Wilk normality test
##
## data: data4$percent_correct[data4$fake_qual == "Bad_fake"]
## W = 0.98064, p-value = 0.3516
p-val. > 0.05. Normality met; proceed with homogeneity check.
var.test(data4$percent_correct[data4$fake_qual == "Good_fake"],data4$percent_correct[data4$fake_qual == "Bad_fake"])
##
## F test to compare two variances
##
## data: data4$percent_correct[data4$fake_qual == "Good_fake"] and data4$percent_correct[data4$fake_qual == "Bad_fake"]
## F = 0.79364, num df = 68, denom df = 69, p-value = 0.3416
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4924805 1.2800513
## sample estimates:
## ratio of variances
## 0.7936433
No significant difference between the two groups — homogeneity of variances is fine and variances are similar enough (p-val. = 0.3416).
t.test(data4$percent_correct ~ data4$fake_qual)
##
## Welch Two Sample t-test
##
## data: data4$percent_correct by data4$fake_qual
## t = 22.714, df = 135.63, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group Bad_fake and group Good_fake is not equal to 0
## 95 percent confidence interval:
## 22.70104 27.03099
## sample estimates:
## mean in group Bad_fake mean in group Good_fake
## 82.90115 58.03513
#t.test(data4$percent_correct_fake~data4$percent_correct_real)
p-val. = 0.00000000000000022.
“Predict overall deepfake detection performance” = percent_correct.
IV: mean_confidence, inspection_time, AI_use, AI_img_use, age
DV: percent_correct
model_4.1 = lm(percent_correct ~ mean_confidence + inspection_time + AI_use+ AI_img_use+ age,data4)
par(mfrow = c(2,2))
plot(model_4.1)
shapiro.test(residuals(model_4.1))
##
## Shapiro-Wilk normality test
##
## data: residuals(model_4.1)
## W = 0.97628, p-value = 0.01597
Shapiro test shows violation, but plot shows they aren’t severe and linear model could be used. Use generalized linear model instead to allow for flexibility of outliers.
summary(data4$percent_correct)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.59 58.51 68.33 70.56 82.35 98.73
shapiro.test(data4$percent_correct)
##
## Shapiro-Wilk normality test
##
## data: data4$percent_correct
## W = 0.9453, p-value = 0.00002767
hist(data4$percent_correct)
percent_check
appears to be a bimodal distribution.
data5 = data4 %>%
mutate(prop_percent_correct = (percent_correct + 0.01) / 100)
#this model replaces model_4.1
data5_GLM = data5 %>%
mutate(predicted = predict(glm(prop_percent_correct ~ mean_confidence + inspection_time + AI_use + AI_img_use + age, family = quasibinomial(), data = data5),type = "response"))
model=glm(prop_percent_correct ~ mean_confidence + inspection_time + AI_use + AI_img_use + age,
family = quasibinomial(), data = data5)
dispersion <- sum(residuals(model, type = "deviance")^2) / model$df.residual
print(dispersion)
## [1] 0.06719602
plot(model$fitted.values, residuals(model, type = "deviance"),
xlab = "Fitted values", ylab = "Deviance residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
Dispersion is within parameters (anything over 1 indicates overdispersion). Plot shows line straight and nearing 0.
Given that percent_correct
is a proportion and is
bimodally distributed, use quasibinomial GLM.
data5_GLM_model = glm(prop_percent_correct ~ mean_confidence + inspection_time + AI_use + AI_img_use + age,family = quasibinomial(), data5_GLM)
summary(data5_GLM_model)
##
## Call:
## glm(formula = prop_percent_correct ~ mean_confidence + inspection_time +
## AI_use + AI_img_use + age, family = quasibinomial(), data = data5_GLM)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7084030 0.3865151 -4.420 0.0000203024643544 ***
## mean_confidence 1.0608461 0.1264319 8.391 0.0000000000000622 ***
## inspection_time 0.0004333 0.0019077 0.227 0.8207
## AI_use 0.0846463 0.0452686 1.870 0.0637 .
## AI_img_use -0.0142102 0.0516985 -0.275 0.7838
## age -0.0064289 0.0024781 -2.594 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.0666086)
##
## Null deviance: 14.0458 on 138 degrees of freedom
## Residual deviance: 8.9371 on 133 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
ggplot(data5_GLM, aes(x = mean_confidence, y =prop_percent_correct)) +
geom_point(alpha = 0.6) +
geom_line(aes(y = predicted), color = "red", size = .5) +
labs(title = "Predicted percent_correct by mean_confidence") +
theme_bw()+
theme(plot.title = element_text(hjust = .5))
ggplot(data5_GLM, aes(x = mean_confidence, y =prop_percent_correct)) +
geom_point(alpha = 0.6) +
geom_line(aes(y = predicted), color = "red", size = .5) +
labs(title = "Predicted percent_correct by mean_confidence faceted by AI_use") +
theme_bw()+
theme(plot.title = element_text(hjust = .5),
plot.subtitle = element_text(hjust = .5))+
facet_wrap(~AI_use)
mean_confidence
(p-val. = 0.0000000000000622) had a
positive impact on percent_correct
while age
(p-val. = 0.0105) had a negative impact on percent_correct
.
AI_use
showed near significance, but not quite enough.
Looking at the split scatter plot above, AI_use
of 3
appears different from the other groups.
Higher confidence correlates with higher accuracy, whereas older age showed less accuracy in deepfake detection.
“Predict overall deepfake detection performance” = percent_correct. Create two models to compare them against each other. One model is the control and the other is the manipulation.
IV: mean_confidence, inspection_time, AI_use, AI_img_use, age
DV: percent_correct
#base model to control for confidence
data5_base = lm(percent_correct ~ mean_confidence,data5)
#add IVs for manipulation
data5_additional= lm(percent_correct ~ mean_confidence + inspection_time + AI_use + AI_img_use + age, data5)
par(mfrow = c(2,2))
plot(data5_base,main='base model')
par(mfrow = c(2,2))
plot(data5_additional,main='additional model')
Some mildness in Residuals vs Fitted, but nothing severe.
data5_cor_matrix = data5 %>%
select(mean_confidence, inspection_time, AI_use,AI_img_use, age) %>%
cor(use = "complete.obs")
data5_cor_plot = as.data.frame(as.table(data5_cor_matrix))
ggplot(data5_cor_plot, aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = round(Freq, 2)), color = "black", size = 4) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
labs(title = "Correlation of IVs",
x = "", y = "", fill = "Correlation %") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = .5))
Values not >= 0.70 or <= -0.70 means multicollinearity isn’t violated. Proceed with hierarchical regression.
anova(data5_base,data5_additional) %>% print()
## Analysis of Variance Table
##
## Model 1: percent_correct ~ mean_confidence
## Model 2: percent_correct ~ mean_confidence + inspection_time + AI_use +
## AI_img_use + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 137 17921
## 2 133 16543 4 1378.5 2.7708 0.02984 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(data5_additional)
##
## Call:
## lm(formula = percent_correct ~ mean_confidence + inspection_time +
## AI_use + AI_img_use + age, data = data5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.5340 -7.6219 0.2639 8.2718 22.2999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.407914 7.652055 2.144 0.0338 *
## mean_confidence 22.010164 2.454837 8.966 0.00000000000000247 ***
## inspection_time 0.008713 0.036350 0.240 0.8109
## AI_use 1.668879 0.864559 1.930 0.0557 .
## AI_img_use -0.204375 1.001871 -0.204 0.8387
## age -0.133828 0.051680 -2.590 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.15 on 133 degrees of freedom
## Multiple R-squared: 0.3918, Adjusted R-squared: 0.369
## F-statistic: 17.14 on 5 and 133 DF, p-value: 0.0000000000004626
The p-val. = 0.0000000000004626 shows that this model is significant,
and that the adjusted R-squared shows 36.9% of the variance in
percent_correct
can be explained by the IVs.
mean_confidence
(p-val. = 0.00000000000000247) and
age
(p-val. = 0.0107) showed significance.
mean_confidence
with percent_correct
is
22.010164 while age
with percent_correct
is
-0.133828. This means each point of mean_confidence
increased accuracy by 22, while each point of age
decreased
accuracy by 0.13.