基本统计

参考资料：R语言实战

研究类型：基因表达与临床变量的关系

知识点：

组间差异比较
正态分布检验
方差齐性检验

## 准备工作
rm(list=ls())
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 449586 24.1     962665 51.5   643731 34.4
## Vcells 803220  6.2    8388608 64.0  1648428 12.6

#加载R包
options(warn=0)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(car)

## 载入需要的程辑包：carData

## 
## 载入程辑包：'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(gplots)

## 
## 载入程辑包：'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

#install.packages("HH")
library(HH)

## 载入需要的程辑包：lattice

## 载入需要的程辑包：grid

## 载入需要的程辑包：latticeExtra

## 
## 载入程辑包：'latticeExtra'

## The following object is masked from 'package:ggplot2':
## 
##     layer

## 载入需要的程辑包：multcomp

## 载入需要的程辑包：mvtnorm

## 载入需要的程辑包：survival

## 载入需要的程辑包：TH.data

## 载入需要的程辑包：MASS

## 
## 载入程辑包：'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## 载入程辑包：'TH.data'

## The following object is masked from 'package:MASS':
## 
##     geyser

## 载入需要的程辑包：gridExtra

## 
## 载入程辑包：'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## 
## 载入程辑包：'HH'

## The following object is masked from 'package:gplots':
## 
##     residplot

## The following objects are masked from 'package:car':
## 
##     logit, vif

## The following object is masked from 'package:purrr':
## 
##     transpose

#install.packages("rrcov")
library(rrcov)

## 载入需要的程辑包：robustbase

## 
## 载入程辑包：'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Scalable Robust Estimators with High Breakdown Point (version 1.6-0)

#install.packages("multicomp")
#library(multicomp)
#install.packages("effects")
library(effects)

## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.

#install.packages("MASS")
library(MASS)
#install.packages("mvoutlier")
library(mvoutlier)

## 载入需要的程辑包：sgeostat

#install.packages("pwr")
library(pwr)
#install.packages("FSA")
library(FSA)

## Registered S3 methods overwritten by 'FSA':
##   method       from
##   confint.boot car 
##   hist.boot    car

## ## FSA v0.9.1. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.

## 
## 载入程辑包：'FSA'

## The following object is masked from 'package:car':
## 
##     bootCase

library(ggstatsplot)

## Registered S3 method overwritten by 'WRS2':
##   method       from
##   print.ancova HH

## Registered S3 methods overwritten by 'parameters':
##   method                           from      
##   as.double.parameters_kurtosis    datawizard
##   as.double.parameters_skewness    datawizard
##   as.double.parameters_smoothness  datawizard
##   as.numeric.parameters_kurtosis   datawizard
##   as.numeric.parameters_skewness   datawizard
##   as.numeric.parameters_smoothness datawizard
##   print.parameters_distribution    datawizard
##   print.parameters_kurtosis        datawizard
##   print.parameters_skewness        datawizard
##   summary.parameters_kurtosis      datawizard
##   summary.parameters_skewness      datawizard

## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

options(warn = 1)

组间差异比较

两组间比较还是多组比较？
参数还是非参数？

	2组	多组	事后两两检验
参数	t-检验	ANOVA	F检验；Gam es-Howell检验
非参数	wilcoxon秩和（MW检验）	KW检验	Dunn检验

2组间比较

Student’s t-Test

Description: Performs one and two sample t-tests on vectors of data.

适用条件：①under normality assumptions ②large samples（多大叫大？）

说明：比较的是均值，一般不是特别偏态分布就可以用

Wilcoxon Rank Sum and Signed Rank Tests

Description: Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as Mann-Whitney test.

适用条件：①不符合t-test的条件（不满足正态分布）；②因变量为等级、分类变量时；③是t-test的替代品，效力不如t-test

示例

## 载入数据
load("datt.Rdata")
head(datt)

##   agegroup Stage Expression
## 1      >65    IV   4.523562
## 2      >65    IV   5.000000
## 3     <=65    IV   3.169925
## 4     <=65    IV   1.000000
## 5     <=65    IV   5.700440
## 6      >65    IV   6.209453

summary(datt)

##  agegroup    Stage       Expression    
##  <=65:324   I   : 19   Min.   : 0.000  
##  >65 :175   II  : 95   1st Qu.: 2.000  
##  NA's:  1   III :102   Median : 3.459  
##             IV  :270   Mean   : 3.822  
##             NA's: 14   3rd Qu.: 5.459  
##                        Max.   :12.783

dat <- datt %>% dplyr::select(agegroup,Expression) %>% na.omit()
# 参数检验：
ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "p")

t.test(Expression~agegroup,data = dat)

## 
##  Welch Two Sample t-test
## 
## data:  Expression by agegroup
## t = -0.24934, df = 404.91, p-value = 0.8032
## alternative hypothesis: true difference in means between group <=65 and group >65 is not equal to 0
## 95 percent confidence interval:
##  -0.4929888  0.3820058
## sample estimates:
## mean in group <=65  mean in group >65 
##           3.802547           3.858039

说明：

ggbetweenstats()和t-test结果一致

ggbetweenstats()和t.test()默认方差不等，默认使用Welch’s t-test(即方差不齐，适合2组观测数量不等时使用的Student’s t-test的改良版)

可以通过var.equal=TRUE设置方差齐平

Welch’s t-test From Wikipedia: In statistics, Welch’s t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. It is named for its creator, Bernard Lewis Welch, is an adaptation of Student’s t-test, and is more reliable when the two samples have unequal variances and/or unequal sample sizes. These tests are often referred to as “unpaired” or “independent samples” t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping. Given that Welch’s t-test has been less popular than Student’s t-test and may be less familiar to readers, a more informative name is “Welch’s unequal variances t-test” — or “unequal variances t-test” for brevity.

ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "p",var.equal = TRUE)

t.test(Expression~agegroup,data = dat,var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  Expression by agegroup
## t = -0.23837, df = 497, p-value = 0.8117
## alternative hypothesis: true difference in means between group <=65 and group >65 is not equal to 0
## 95 percent confidence interval:
##  -0.5128847  0.4019017
## sample estimates:
## mean in group <=65  mean in group >65 
##           3.802547           3.858039

两者结果再次一致，此时ggbetweenstats()进行的是经典的Student's t-test

但是这种检验合理吗？Student’s t-test 要求符合正态分布或大样本

正态分布检验

方法： shapiro-wilk正态性检验

假设： H0：总体符合正态分布 VS 总体不符合正态分布

检验统计量：W

条件： P<0.05拒绝H0

# 正态分布检验
## QQ图检验：
x <- dat$Expression
#qqnorm(x)+qqline(x)

## shapiro-wilk test
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.96137, p-value = 3.639e-10

点距离线较远且p<0.05,拒绝H0（符合正态分布），即非正态

方差齐性检验

#非正态分布数据，用Levene's test
library(car)
leveneTest(Expression~agegroup,data=dat)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  5.6498 0.01784 *
##       497                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pr(>F) < 0.01拒绝H0，方差不齐所以不能用Student’s t-test

MW秩和检验

效力低于Welch’s t-test

ggbetweenstats(data=dat,x=agegroup,y=Expression,type = "n")

wilcox.test(Expression~agegroup,data = dat)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Expression by agegroup
## W = 27037, p-value = 0.3927
## alternative hypothesis: true location shift is not equal to 0

多组间比较

Fit an Analysis of Variance Model（ANOVA）

Description: Fit an analysis of variance model by a call to lm for each stratum.

适用条件：正态分布、方差相齐

Kruskal-Wallis Rank Sum Test(KW test)

适用条件：多组间比较差异是否显著+不满足ANOVA条件

事后两两比较

参数：Games-Howell比较泛用，方差不齐也可以

非参数：Dunn检验

示例

## 方差齐性检验
#非正态分布数据，用Levene's test
dat <- datt %>% dplyr::select(Stage,Expression) %>% na.omit()
leveneTest(Expression~Stage,data=dat)#方差齐平

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.0152 0.9974
##       482

## one-way ANOVA
ggbetweenstats(data = dat,x=Stage,y=Expression,type = "p",var.equal = TRUE)

## p=0.268
oneway.test(Expression~Stage,data = dat,var.equal = TRUE)

## 
##  One-way analysis of means
## 
## data:  Expression and Stage
## F = 1.3162, num df = 3, denom df = 482, p-value = 0.2684

结果一致,差异性用one-way ANOVA（Fisher 检验），事后组间两两比较用了Student’ t-test（此处应该是没办法改的？）

但是如果不设置方差齐平，则会默认进行Welch’s ANOVA检测

ggbetweenstats(data = dat,x=Stage,y=Expression,type = "p",var.equal = FALSE)

## p=0.310
oneway.test(Expression~Stage,data = dat,var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Expression and Stage
## F = 1.2165, num df = 3.00, denom df = 75.78, p-value = 0.3096

此时组间两两比较默认适用比较宽泛的Games-Howell test

## 非参数检验
ggbetweenstats(data=dat,x=Stage,y=Expression,type = "n")

p=0.234
#差异性检验
kruskal.test(Expression~Stage,data = dat)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Expression by Stage
## Kruskal-Wallis chi-squared = 4.2719, df = 3, p-value = 0.2336

#组间两两比较
#perform Dunn's Test with Bonferroni correction for p-values
dunnTest(Expression ~ Stage,
         data=dat,
         method="holm")

## Dunn (1964) Kruskal-Wallis multiple comparison

##   p-values adjusted with the Holm method.

##   Comparison          Z    P.unadj     P.adj
## 1     I - II -1.7566221 0.07898224 0.4738934
## 2    I - III -1.4112275 0.15817753 0.6327101
## 3   II - III  0.6230429 0.53325631 0.5332563
## 4     I - IV -1.1029377 0.27005421 0.8101626
## 5    II - IV  1.5062327 0.13200746 0.6600373
## 6   III - IV  0.7816220 0.43443673 0.8688735

多组非参数检验适用KW秩和检验，事后组间两两比较使用Dunn test ### 总结： * ggbetweenstats 默认方差不齐，可以通过var.equal=TRUE设置，同时检验方法会改变

决定ggbetweenstats的统计方法的参数有：
- type(参数检验还是秩和检验，还可以选robust——稳健性检验;bayes——完全是另一个学派)；
- var.equal方差是否相齐：两组间比较时：Student-t | Welch’s-t; 方差分析： Fisher(经典单因素方差分析) | Welch’s ANOVA；事后组间比较：Games-Well | Student‘s t-test
- p.adjust.method:P值矫正: 默认“holm”; 多组比较推荐“fdr”
Adjustment method for p-values for multiple comparisons. Possible methods are: “holm” (default), “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.
centrality.type中心点类型会随type变化：

“parameteric” (for mean)

“nonparametric” (for median)

“robust” (for trimmed mean)

“bayes” (for MAP estimator)

# ggstatsplot的统计原理与实操