# ggstatsplot的统计原理与实操

editor_options: markdown: wrap: 72 output: html_document: df_print: paged —

基本统计

参考资料:R语言实战

研究类型:基因表达与临床变量的关系

知识点:

  • 组间差异比较

  • 正态分布检验

  • 方差齐性检验

## 准备工作
rm(list=ls())
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 449586 24.1     962665 51.5   643731 34.4
## Vcells 803220  6.2    8388608 64.0  1648428 12.6
#加载R包
options(warn=0)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(car)
## 载入需要的程辑包:carData
## 
## 载入程辑包:'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(gplots)
## 
## 载入程辑包:'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
#install.packages("HH")
library(HH)
## 载入需要的程辑包:lattice
## 载入需要的程辑包:grid
## 载入需要的程辑包:latticeExtra
## 
## 载入程辑包:'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
## 载入需要的程辑包:multcomp
## 载入需要的程辑包:mvtnorm
## 载入需要的程辑包:survival
## 载入需要的程辑包:TH.data
## 载入需要的程辑包:MASS
## 
## 载入程辑包:'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 载入程辑包:'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
## 载入需要的程辑包:gridExtra
## 
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 载入程辑包:'HH'
## The following object is masked from 'package:gplots':
## 
##     residplot
## The following objects are masked from 'package:car':
## 
##     logit, vif
## The following object is masked from 'package:purrr':
## 
##     transpose
#install.packages("rrcov")
library(rrcov)
## 载入需要的程辑包:robustbase
## 
## 载入程辑包:'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
## Scalable Robust Estimators with High Breakdown Point (version 1.6-0)
#install.packages("multicomp")
#library(multicomp)
#install.packages("effects")
library(effects)
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
#install.packages("MASS")
library(MASS)
#install.packages("mvoutlier")
library(mvoutlier)
## 载入需要的程辑包:sgeostat
#install.packages("pwr")
library(pwr)
#install.packages("FSA")
library(FSA)
## Registered S3 methods overwritten by 'FSA':
##   method       from
##   confint.boot car 
##   hist.boot    car
## ## FSA v0.9.1. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
## 
## 载入程辑包:'FSA'
## The following object is masked from 'package:car':
## 
##     bootCase
library(ggstatsplot)
## Registered S3 method overwritten by 'WRS2':
##   method       from
##   print.ancova HH
## Registered S3 methods overwritten by 'parameters':
##   method                           from      
##   as.double.parameters_kurtosis    datawizard
##   as.double.parameters_skewness    datawizard
##   as.double.parameters_smoothness  datawizard
##   as.numeric.parameters_kurtosis   datawizard
##   as.numeric.parameters_skewness   datawizard
##   as.numeric.parameters_smoothness datawizard
##   print.parameters_distribution    datawizard
##   print.parameters_kurtosis        datawizard
##   print.parameters_skewness        datawizard
##   summary.parameters_kurtosis      datawizard
##   summary.parameters_skewness      datawizard
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
options(warn = 1)

组间差异比较

  1. 两组间比较还是多组比较?

  2. 参数还是非参数?

2组 多组 事后两两检验
参数 t-检验 ANOVA F检验;Gam es-Howell检验
非参数 wilcoxon秩 和(MW检验) KW检验 Dunn检验

2组间比较

Student’s t-Test

Description: Performs one and two sample t-tests on vectors of data.

适用条件:①under normality assumptions ②large samples(多大叫大?)

说明:比较的是均值,一般不是特别偏态分布就可以用

Wilcoxon Rank Sum and Signed Rank Tests

Description: Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as Mann-Whitney test.

适用条件:①不符合t-test的条件(不满足正态分布);②因变量为等级、分类变量时;③是t-test的替代品,效力不如t-test

示例

## 载入数据
load("datt.Rdata")
head(datt)
##   agegroup Stage Expression
## 1      >65    IV   4.523562
## 2      >65    IV   5.000000
## 3     <=65    IV   3.169925
## 4     <=65    IV   1.000000
## 5     <=65    IV   5.700440
## 6      >65    IV   6.209453
summary(datt)
##  agegroup    Stage       Expression    
##  <=65:324   I   : 19   Min.   : 0.000  
##  >65 :175   II  : 95   1st Qu.: 2.000  
##  NA's:  1   III :102   Median : 3.459  
##             IV  :270   Mean   : 3.822  
##             NA's: 14   3rd Qu.: 5.459  
##                        Max.   :12.783
dat <- datt %>% dplyr::select(agegroup,Expression) %>% na.omit()
# 参数检验:
ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "p")

t.test(Expression~agegroup,data = dat) 
## 
##  Welch Two Sample t-test
## 
## data:  Expression by agegroup
## t = -0.24934, df = 404.91, p-value = 0.8032
## alternative hypothesis: true difference in means between group <=65 and group >65 is not equal to 0
## 95 percent confidence interval:
##  -0.4929888  0.3820058
## sample estimates:
## mean in group <=65  mean in group >65 
##           3.802547           3.858039

说明:

ggbetweenstats()t-test结果一致

ggbetweenstats()t.test()默认方差不等,默认使用Welch’s t-test(即方差不齐,适合2组观测数量不等时使用的Student’s t-test的改良版)

可以通过var.equal=TRUE设置方差齐平

Welch’s t-test From Wikipedia: In statistics, Welch’s t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. It is named for its creator, Bernard Lewis Welch, is an adaptation of Student’s t-test, and is more reliable when the two samples have unequal variances and/or unequal sample sizes. These tests are often referred to as “unpaired” or “independent samples” t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping. Given that Welch’s t-test has been less popular than Student’s t-test and may be less familiar to readers, a more informative name is “Welch’s unequal variances t-test” — or “unequal variances t-test” for brevity.

ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "p",var.equal = TRUE)

t.test(Expression~agegroup,data = dat,var.equal=TRUE) 
## 
##  Two Sample t-test
## 
## data:  Expression by agegroup
## t = -0.23837, df = 497, p-value = 0.8117
## alternative hypothesis: true difference in means between group <=65 and group >65 is not equal to 0
## 95 percent confidence interval:
##  -0.5128847  0.4019017
## sample estimates:
## mean in group <=65  mean in group >65 
##           3.802547           3.858039

两者结果再次一致,此时ggbetweenstats()进行的是经典的Student's t-test

但是这种检验合理吗?Student’s t-test 要求符合正态分布或大样本

正态分布检验

方法: shapiro-wilk正态性检验

假设: H0:总体符合正态分布 VS 总体不符合正态分布

检验统计量:W

条件: P<0.05拒绝H0

# 正态分布检验
## QQ图检验:
x <- dat$Expression
#qqnorm(x)+qqline(x)

## shapiro-wilk test
shapiro.test(x)
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.96137, p-value = 3.639e-10

点距离线较远且p<0.05,拒绝H0(符合正态分布),即非正态

方差齐性检验

#非正态分布数据,用Levene's test
library(car)
leveneTest(Expression~agegroup,data=dat)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  5.6498 0.01784 *
##       497                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pr(>F) < 0.01拒绝H0,方差不齐 所以不能用Student’s t-test

MW秩和检验

效力低于Welch’s t-test

ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "n")

wilcox.test(Expression~agegroup,data = dat)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Expression by agegroup
## W = 27037, p-value = 0.3927
## alternative hypothesis: true location shift is not equal to 0

多组间比较

Fit an Analysis of Variance Model(ANOVA)

Description: Fit an analysis of variance model by a call to lm for each stratum.

适用条件:正态分布、方差相齐

Kruskal-Wallis Rank Sum Test(KW test)

适用条件:多组间比较差异是否显著+不满足ANOVA条件

事后两两比较

参数:Games-Howell比较泛用,方差不齐也可以

非参数:Dunn检验

示例

## 方差齐性检验
#非正态分布数据,用Levene's test
dat <- datt %>% dplyr::select(Stage,Expression) %>% na.omit()
leveneTest(Expression~Stage,data=dat)#方差齐平
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.0152 0.9974
##       482
## one-way ANOVA
ggbetweenstats(data = dat,x=Stage,y=Expression,type = "p",var.equal = TRUE)

## p=0.268
oneway.test(Expression~Stage,data = dat,var.equal = TRUE)
## 
##  One-way analysis of means
## 
## data:  Expression and Stage
## F = 1.3162, num df = 3, denom df = 482, p-value = 0.2684

结果一致,差异性用one-way ANOVA(Fisher 检验),事后组间两两比较用了Student’ t-test(此处应该是没办法改的?)

但是如果不设置方差齐平,则会默认进行Welch’s ANOVA检测

ggbetweenstats(data = dat,x=Stage,y=Expression,type = "p",var.equal = FALSE)

## p=0.310
oneway.test(Expression~Stage,data = dat,var.equal = FALSE)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Expression and Stage
## F = 1.2165, num df = 3.00, denom df = 75.78, p-value = 0.3096

此时组间两两比较默认适用比较宽泛的Games-Howell test

## 非参数检验
ggbetweenstats(data=dat,x=Stage,y=Expression,type = "n")

p=0.234
#差异性检验
kruskal.test(Expression~Stage,data = dat) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Expression by Stage
## Kruskal-Wallis chi-squared = 4.2719, df = 3, p-value = 0.2336
#组间两两比较
#perform Dunn's Test with Bonferroni correction for p-values
dunnTest(Expression ~ Stage,
         data=dat,
         method="holm")
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Holm method.
##   Comparison          Z    P.unadj     P.adj
## 1     I - II -1.7566221 0.07898224 0.4738934
## 2    I - III -1.4112275 0.15817753 0.6327101
## 3   II - III  0.6230429 0.53325631 0.5332563
## 4     I - IV -1.1029377 0.27005421 0.8101626
## 5    II - IV  1.5062327 0.13200746 0.6600373
## 6   III - IV  0.7816220 0.43443673 0.8688735

多组非参数检验适用KW秩和检验,事后组间两两比较使用Dunn test ### 总结: * ggbetweenstats 默认方差不齐,可以通过var.equal=TRUE设置,同时检验方法会改变

  • 决定ggbetweenstats的统计方法的参数有:

    • type(参数检验还是秩和检验,还可以选robust——稳健性检验;bayes——完全是另一个学派);

    • var.equal方差是否相齐: 两组间比较时:Student-t | Welch’s-t; 方差分析: Fisher(经典单因素方差分析) | Welch’s ANOVA; 事后组间比较:Games-Well | Student‘s t-test

    • p.adjust.method:P值矫正: 默认“holm”; 多组比较推荐“fdr”

    Adjustment method for p-values for multiple comparisons. Possible methods are: “holm” (default), “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.

  • centrality.type中心点类型会随type变化:

    “parameteric” (for mean)

    “nonparametric” (for median)

    “robust” (for trimmed mean)

    “bayes” (for MAP estimator)

`