For this exercise, we are using the following dataset:
Download data_example2.xlsx here
Import dataset “data_example2.xlsx” and rename as data2
to import data, we can use import button as in the environment window or we can use “readxl” package
library(readxl)
data2=read_excel("data_example2.xlsx")
attach data2 to access the variables present in the data2 without calling the data frame
attach(data2)
Pearson Chi Square Test / Fisher’s Exact Test
Example:
Association between hypertension in preganancy and gestational age
Q:Is hypertension in preganancy is affected by gestational age?
first, we need to install package “gmodels” and load the package using function ‘library’
library(gmodels)
using chisq.test function chi square test statistics
chisq.test(gestwks_grp,hypt) #check warning whether this test is appropriate or not
##
## Pearson's Chi-squared test
##
## data: gestwks_grp and hypt
## X-squared = 27.332, df = 3, p-value = 5.016e-06
if warning appeared, use fisher.test function
fisher.test(gestwks_grp,hypt)
##
## Fisher's Exact Test for Count Data
##
## data: gestwks_grp and hypt
## p-value = 6.976e-06
## alternative hypothesis: two.sided
*Notes:
Fisher’s Exact test was used when the assumptions of Pearson Chi-Square test were not met
Assumptions of Pearson Chi-square Test :
The number of cells with Expected Count (EC) less than 5, must be less than 25% of the total number of cells
The smallest EC must be at least 2
Independent t-test / Mann-Whitney U Test
Example 1:
Comparing birthweight by gender
first, we need to explore the data distribution to decide the appropriate test
need to install and load package “e1071” is for aggregate function
library(e1071)
using aggregate function to explore the distribution of data birthweight by gender
aggregate(bweight,by=list(sex),skewness,na.rm=TRUE,type=2) #skewness #type= 2:spss format
## Group.1 x
## 1 1 -0.8906878
## 2 2 -1.1561681
aggregate(bweight,by=list(sex),kurtosis,na.rm=TRUE,type=2) #kurtosis #type= 2:spss format
## Group.1 x
## 1 1 1.666833
## 2 2 2.124618
*Notes:
Distributions of numerical data can be checked using skewness(skew) and kurtosis values.
The data is normally distributed if skewness and kurtosis value lies between -1 to +11 and -3 to +32 respectively.
If the data is normally distributed, mean and standard deviation(sd) should be reported.
Median and interquartile range (IQR), if otherwise.
Inspection of skewness and kurtosis revealed that birthweight was not normally distributed in at least 1 of 2 gender groups. Therefore, a mann-whitney u test was run on the data.
Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.
Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119
Mann-whitney U test for distribution difference between gender
wilcox.test(bweight~sex)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bweight by sex
## W = 59774, p-value = 0.0003234
## alternative hypothesis: true location shift is not equal to 0
Example 2:
Comparing birthweight by hypertension in pregnancy
first, we need to explore the data distribution to decide the appropriate test
need to install and load package “e1071” is for aggregate function
library(e1071)
using aggregate function to explore the distribution of data birthweight by hypertension in pregnancy
aggregate(bweight,by=list(hypt),skewness,na.rm=TRUE,type=2) #skewness #type 2:spss format
## Group.1 x
## 1 1 -0.7423578
## 2 2 -0.8484070
aggregate(bweight,by=list(hypt),kurtosis,na.rm=TRUE,type=2) #kurtosis #type 2:spss format
## Group.1 x
## 1 1 0.03558354
## 2 2 1.99954957
*Notes:
Inspection of skewness and kurtosis revealed that birthweight was normally distributed in all 2 hypertension in pregnancy groups. Therefore, a Independent t-test was run on the data.
1 Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.
2 Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119
Test mean difference of birthweight by hypertension in pregnancy
Step1:Homogeneity test for bweight by hypertension in pregnancy using Levene Test
need to install package “car” for leveneTest function
library("car")
## Loading required package: carData
first, since hypt is in numeric structure, need to change to factor
hypt=as.factor(hypt)
leveneTest(bweight~hypt,center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 18.319 2.155e-05 ***
## 639
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 2:Make decision from levene test
if p.value if p.value >0.05, equal variances is assumed.
if p.value if p.value <0.05, equal variances is NOT assumed.
to generate independent t-test
t.test(bweight~hypt,var.equal=FALSE) #var.equal=FALSE , since p-value of equal variance test <0.05
##
## Welch Two Sample t-test
##
## data: bweight by hypt
## t = -4.9991, df = 104.07, p-value = 2.341e-06
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -627.6273 -271.1197
## sample estimates:
## mean in group 1 mean in group 2
## 2742.157 3191.531
ANOVA / Kruskall Wallis Test
Example 1:
Comparing birthweight among maternal age group
first, we need to explore the data distribution to decide the appropriate test
need to install and load package “e1071” is for aggregate function
library(e1071)
using aggregate function to explore the distribution of data birthweight by maternal age group
aggregate(bweight,by=list(matage_grp),skewness,na.rm=TRUE,type=2) #skewness
## Group.1 x
## 1 1 -0.9375413
## 2 2 -0.9324707
## 3 3 -1.0135990
## 4 4 -0.8704457
aggregate(bweight,by=list(matage_grp),skewness,na.rm=TRUE,type=2) #kurtosis
## Group.1 x
## 1 1 -0.9375413
## 2 2 -0.9324707
## 3 3 -1.0135990
## 4 4 -0.8704457
*Notes:
Distributions of numerical data can be checked using skewness(skew) and kurtosis values.
The data is normally distributed if skewness and kurtosis value lies between -1 to +11 and -3 to +32 respectively.
If the data is normally distributed, mean and standard deviation(sd) should be reported.
Median and interquartile range (IQR), if otherwise.
Inspection of skewness and kurtosis revealed that birthweight was not normally distributed in at least 1 of 4 maternal age groups. Therefore, a kruskall-wallis test was run on the data.
Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.
Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119
kruskal-wallis test for distribution difference between maternal age group
kruskal.test(bweight~matage_grp)
##
## Kruskal-Wallis rank sum test
##
## data: bweight by matage_grp
## Kruskal-Wallis chi-squared = 0.24263, df = 3, p-value = 0.9704
Example 2:
Comparing birthweight among gestational age group
first, we need to explore the data distribution to decide the appropriate test
need to install and load package “e1071” is for aggregate function
library(e1071)
using aggregate function to explore the distribution of data birthweight by maternal age group
aggregate(bweight,by=list(gestwks_grp),skewness,na.rm=TRUE,type=2) #skewness
## Group.1 x
## 1 1 -0.28600574
## 2 2 -0.20891565
## 3 3 -0.05784087
## 4 4 0.32891139
aggregate(bweight,by=list(gestwks_grp),kurtosis,na.rm=TRUE,type=2) #kurtosis
## Group.1 x
## 1 1 -0.65222924
## 2 2 0.38596809
## 3 3 1.09863570
## 4 4 -0.04515041
*Notes:
Inspection of skewness and kurtosis revealed that birthweight was normally distributed in all 4 gestational age groups. Therefore, a one-way anova test was run on the data.
1 Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.
2 Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119
Test mean difference of birthweight between gestational age group
Step1:Homogeneity test for bweight by gestational age group
library(lawstat)
##
## Attaching package: 'lawstat'
## The following object is masked from 'package:car':
##
## levene.test
levene.test(bweight,gestwks_grp,location="mean")
##
## Classical Levene's test based on the absolute deviations from the mean
## ( none not applied because the location is not set to median )
##
## data: bweight
## Test Statistic = 18.486, p-value = 1.649e-11
kruskal.test(bweight~matage_grp)
##
## Kruskal-Wallis rank sum test
##
## data: bweight by matage_grp
## Kruskal-Wallis chi-squared = 0.24263, df = 3, p-value = 0.9704
Step 2:Make decision from levene test
if p.value from levene test is >0.05, homogeneity is assumed.Hence we can use anova test
to generate anova test
fit=aov(bweight~gestwks_grp,data=data2)
summary(fit) #to display result of anova test
## Df Sum Sq Mean Sq F value Pr(>F)
## gestwks_grp 1 99079210 99079210 364.6 <2e-16 ***
## Residuals 639 173640912 271739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
if p.value from levene test is <0.05,homogeneity is NOT assumed.Hence we use welch test
library(lawstat) #load lawstat package
oneway.test(bweight~gestwks_grp) #welch test
##
## One-way analysis of means (not assuming equal variances)
##
## data: bweight and gestwks_grp
## F = 97.285, num df = 3.00, denom df = 170.74, p-value < 2.2e-16
Step 3:Multiple comparison or post hoc test
if anova test give significant result (pvalue<0.05), we can do further analysis with bonferroni test
pairwise.t.test(bweight,gestwks_grp,p.adj="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: bweight and gestwks_grp
##
## 1 2 3
## 2 < 2e-16 - -
## 3 < 2e-16 1.9e-11 -
## 4 < 2e-16 2.8e-09 0.14
##
## P value adjustment method: bonferroni
if welch, we can do further analysis with games-howell test
source("http://aoki2.si.gunma-u.ac.jp/R/src/tukey.R") # need to upload script
tukey(bweight,gestwks_grp,method="Games-Howell") #games-howell test
## $result1
## n Mean Variance
## Group1 89 2151.067 517760.2
## Group2 198 3064.101 210502.3
## Group3 303 3386.739 203858.3
## Group4 51 3558.000 173190.7
##
## $Games.Howell
## t df p
## 1:2 11.007059 121.29311 5.773160e-14
## 1:3 15.338037 109.10625 4.329870e-14
## 1:4 14.657595 137.97658 2.298162e-14
## 2:3 7.743694 416.44177 0.000000e+00
## 2:4 7.396350 84.11482 7.383967e-10
## 3:4 2.684913 71.31131 4.368968e-02
Pearson Correlation / Spearman Correlation
Example 1:
Correlation between birthweight and maternal age
first, we need to explore the correlation pattern to decide the appropriate test
use function plot to produce scatter plot
plot(bweight,matage)
*Notes:
scatter plot will show pattern of homoscedasticity or heteroscedasticity as below examples
if the plot show homosceasticity pattern then pearson correlation will be used.
otherwise, spearman correlation is used
need to install and load package “ggpubr” for cor.test function
library(ggpubr)
## Loading required package: ggplot2
cor.test(bweight,matage,method="spearman",use="complete.obs")
## Warning in cor.test.default(bweight, matage, method = "spearman", use =
## "complete.obs"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: bweight and matage
## S = 42799650, p-value = 0.528
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02496897
Example 1:
Correlation between birthweight and gestational age
first, we need to explore the correlation pattern to decide the appropriate test
use function plot to produce scatter plot
plot(bweight,gestwks)
Since the plot of birthweight and gestational age has homoscedasticity pattern, therefore pearson correlation was used
need to install and load package “ggpubr” for cor.test function
library(ggpubr)
cor.test(bweight,gestwks,method="pearson",use="complete.obs")
##
## Pearson's product-moment correlation
##
## data: bweight and gestwks
## t = 27.609, df = 639, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7000965 0.7709564
## sample estimates:
## cor
## 0.7375501