Introduction

In a recent paper we used the R package “statcheck” (Epskamp & Nuijten, 2015) to estimate the prevalence of reporting errors in psychology (Nuijten et al., in press). statcheck searches text for APA reported statistical results and uses the reported test statistic and degrees of freedom to recalculate the p-value. A result was flagged as an inconsistency if the recalculated p-value doesn’t match the reported p-value, and was flagged as a gross inconsistency if the inconsistent p-value could have affected the statistical conclusion (assuming \(\alpha = .05\)).

As we repeatedly stressed in our paper, statcheck is an automated program that will never be as accurate as a manual search in pointing out reporting errors. However, we compared the results of statcheck with those of a manual search and concluded that even though statcheck’s results are sometimes noisy, they don’t seem to be systematically biased. The inter-rater reliability was .76 for the inconsistencies and .89 for the gross inconsistencies.

Recently, it was pointed out to by several people me that statcheck wrongly interprets t-tests and F-tests that belong to item analyses and are reported as “t2(..) = …, p = …” as chi-square tests and hence wrongly flags these results as (gross) inconsistencies (see for instance the discussion on PubPeer. I fixed this bug in the current version of statcheck (Epskamp & Nuijten, 2016), but the question remained to what extent this bug may have had an influence on our estimate of the error prevalence in psychology.

In this document I will analyze the influence of the wrongly flagged inconsistencies in t2/F2-tests on the estimated error prevalence in psychology as reported in Nuijten et al. (in press).

Results at the P-value Level

For this analysis I used the full data set we also used in Nuijten et al. (in press), which can be found at the Open Science Framework: https://osf.io/e9qbp/.

data <- read.csv2("150211FullFile_AllStatcheckData_Automatic1Tail.csv", header=TRUE)

Descriptives

I compared the full data set with the cases in which statcheck interpreted a t2 or F2 as a chi-square:

# total number of extracted statistics
nrow(data)
## [1] 258105
# total number of t2/F2 statistics
t2 <- data[grepl("t2|F2",data$Raw),]
nrow(t2)
## [1] 173
# % of total number of extracted p-values
nrow(t2)/nrow(data)*100
## [1] 0.06702699

These results show that only 0.067% of all extracted results were t2 or F2 tests wrongly interpreted as chi-square tests.

Error Prevalence

Next, I compared the error prevalence of the full data (as we reported in Nuijten et al., in press) with the data set from which I excluded the t2/F2 cases.

# data without t2/F2 cases
data_not2 <- data[!grepl("t2|F2",data$Raw),]

# error prevalence full data
error_all <- sum(data$Error,na.rm=TRUE)/nrow(data)*100
dec_error_all <- sum(data$DecisionError,na.rm=TRUE)/nrow(data)*100

# error prevalence full data without t2
error_not2 <- sum(data_not2$Error,na.rm=TRUE)/nrow(data_not2)*100
dec_error_not2 <- sum(data_not2$DecisionError,na.rm=TRUE)/nrow(data_not2)*100
% Inconsistencies % Gross Inconsistencies
All Data 9.6708704 1.3874198
Data Without t2 9.6195897 1.3426019
Difference 0.0512807 0.0448179

These results show that that both the prevalence of inconsistencies and gross inconsistencies is slightly lower if I remove the wrongly t2/F2 cases. However, this difference is very small (.05% and .04%, respectively).

Results at the Article Level

Data Preparation

To have a complete picture of the influence of t2/F2 results, I also looked at results at the article level. To do this, I first reorganized the full data (at the p-value level) to be at article level.

library(plyr)

#------------------------------------------------------------------------------------
# data per article based on full data

# article/source
Source <- unique(data$Source)

# number of NHST results per article
nr_NHST <- by(data,data$Source,nrow)

# number of errors per article
total_errors <- ddply(data,"Source",function(x) sum(x$Error,na.rm=TRUE))

# number of decision errors per article
total_dec_errors <- ddply(data,"Source",function(x) sum(x$DecisionError,na.rm=TRUE))

# percentage of results that is an error per article
perc_errors <- round(total_errors[,2]/nr_NHST*100,2)

# percentage of results that is a decision error per article
perc_dec_errors <- round(total_dec_errors[,2]/nr_NHST*100,2)

# combine article summary info in a data frame
data_per_article <- data.frame(Source=Source,
                               nr_NHST=as.vector(nr_NHST),
                               errors=total_errors[,2],
                               dec_errors=total_dec_errors[,2],
                               perc_errors=as.vector(perc_errors),
                               perc_dec_errors=as.vector(perc_dec_errors))

#------------------------------------------------------------------------------------
# data per article based on full data without t2/F2 results

data_not2 <- data[!grepl("t2|F2",data$Raw),]

# article/source
Source <- unique(data_not2$Source)

# number of NHST results per article
nr_NHST <- by(data_not2,data_not2$Source,nrow)

# number of errors per article
total_errors <- ddply(data_not2,"Source",function(x) sum(x$Error,na.rm=TRUE))

# number of decision errors per article
total_dec_errors <- ddply(data_not2,"Source",function(x) sum(x$DecisionError,na.rm=TRUE))

# percentage of results that is an error per article
perc_errors <- round(total_errors[,2]/nr_NHST*100,2)

# percentage of results that is a decision error per article
perc_dec_errors <- round(total_dec_errors[,2]/nr_NHST*100,2)

# combine article summary info in a data frame
data_per_article_not2 <- data.frame(Source=Source,
                               nr_NHST=as.vector(nr_NHST),
                               errors=total_errors[,2],
                               dec_errors=total_dec_errors[,2],
                               perc_errors=as.vector(perc_errors),
                               perc_dec_errors=as.vector(perc_dec_errors))

Error Prevalence

Based on these newly organized data sets I could establish the error prevalence at the article level for data including and excluding t2/F2 cases.

## error prevalence full data
# % articles with at least one error
error_art_all <- round(sum(data_per_article$errors>0)/nrow(data_per_article)*100,2)
# % articles with at least one decision error
dec_error_art_all <- round(sum(data_per_article$dec_errors>0)/nrow(data_per_article)*100,2)

## error prevalence full data without t2/F2 cases
# % articles with at least one error
error_art_not2 <- round(sum(data_per_article_not2$errors>0)/nrow(data_per_article_not2)*100,2)
# % articles with at least one decision error
dec_error_art_not2 <- round(sum(data_per_article_not2$dec_errors>0)/nrow(data_per_article_not2)*100,2)
% Articles With 1+ Inconsistencies % Articles with 1+ Gross Inconsistencies
All Data 49.55 12.88
Data Without t2 49.49 12.79
Difference 0.06 0.09

Again, the prevalence of articles with at least one inconsistency or gross inconsistency is slightly lower if I exclude the wrongly flagged t2/F2 cases, but these differences are very small as well (0.06% and 0.09%, respectively).

Conclusion

I investigated the influence of t2 and F2 tests wrongly flagged as chi-square tests on the estimated prevalence of reporting errors in psychology as reported in Nuijten et al. (in press). I found that if I exclude the erroneous t2/F2 cases, the estimated error prevalence is sligthly lower, both at the p-value level and at the article lever. However, these differences are so small, that they don’t substantively influence the conclusions in Nuijten et al.

References

Epskamp, S. & Nuijten, M. B. (2015). statcheck: Extract statistics from articles and recompute p values. Retrieved from https://cran.r-project.org/src/contrib/Archive/statcheck/. (R package version 1.0.1)

Epskamp, S. & Nuijten, M. B. (2016). statcheck: Extract statistics from articles and recompute p values. Retrieved from http://CRAN.R-project.org/package=statcheck. (R package version 1.2.2)

Nuijten, M.B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (in press). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. DOI: 10.3758/s13428-015-0664-2