For this exercise, we are using the following dataset:

Download data_example2.xlsx here

Import dataset “data_example2.xlsx” and rename as data2

to import data, we can use import button as in the environment window or we can use “readxl” package

library(readxl)
data2=read_excel("data_example2.xlsx")

attach data2 to access the variables present in the data2 without calling the data frame

attach(data2)

Basic Inferential Statistical Tests

Association Between Categorical Variables

Pearson Chi Square Test / Fisher’s Exact Test

Example:

Association between hypertension in preganancy and gestational age

Q:Is hypertension in preganancy is affected by gestational age?

first, we need to install package “gmodels” and load the package using function ‘library’

library(gmodels) 

using chisq.test function chi square test statistics

chisq.test(gestwks_grp,hypt) #check warning whether this test is appropriate or not
## 
##  Pearson's Chi-squared test
## 
## data:  gestwks_grp and hypt
## X-squared = 27.332, df = 3, p-value = 5.016e-06

if warning appeared, use fisher.test function

fisher.test(gestwks_grp,hypt)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  gestwks_grp and hypt
## p-value = 6.976e-06
## alternative hypothesis: two.sided

*Notes:

Fisher’s Exact test was used when the assumptions of Pearson Chi-Square test were not met

Assumptions of Pearson Chi-square Test :

  1. The number of cells with Expected Count (EC) less than 5, must be less than 25% of the total number of cells

  2. The smallest EC must be at least 2

Compare differences between two independent groups when the dependent variable is numerical

Independent t-test / Mann-Whitney U Test

Example 1:

Comparing birthweight by gender

first, we need to explore the data distribution to decide the appropriate test

need to install and load package “e1071” is for aggregate function

library(e1071) 

using aggregate function to explore the distribution of data birthweight by gender

aggregate(bweight,by=list(sex),skewness,na.rm=TRUE,type=2) #skewness #type= 2:spss format
##   Group.1          x
## 1       1 -0.8906878
## 2       2 -1.1561681
aggregate(bweight,by=list(sex),kurtosis,na.rm=TRUE,type=2) #kurtosis #type= 2:spss format
##   Group.1        x
## 1       1 1.666833
## 2       2 2.124618

*Notes:

Distributions of numerical data can be checked using skewness(skew) and kurtosis values.

The data is normally distributed if skewness and kurtosis value lies between -1 to +11 and -3 to +32 respectively.

If the data is normally distributed, mean and standard deviation(sd) should be reported.

Median and interquartile range (IQR), if otherwise.

Inspection of skewness and kurtosis revealed that birthweight was not normally distributed in at least 1 of 2 gender groups. Therefore, a mann-whitney u test was run on the data.

  1. Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.

  2. Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119

Mann-whitney U test for distribution difference between gender

wilcox.test(bweight~sex) 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  bweight by sex
## W = 59774, p-value = 0.0003234
## alternative hypothesis: true location shift is not equal to 0

Example 2:

Comparing birthweight by hypertension in pregnancy

first, we need to explore the data distribution to decide the appropriate test

need to install and load package “e1071” is for aggregate function

library(e1071) 

using aggregate function to explore the distribution of data birthweight by hypertension in pregnancy

aggregate(bweight,by=list(hypt),skewness,na.rm=TRUE,type=2) #skewness #type 2:spss format
##   Group.1          x
## 1       1 -0.7423578
## 2       2 -0.8484070
aggregate(bweight,by=list(hypt),kurtosis,na.rm=TRUE,type=2) #kurtosis #type 2:spss format
##   Group.1          x
## 1       1 0.03558354
## 2       2 1.99954957

*Notes:

Inspection of skewness and kurtosis revealed that birthweight was normally distributed in all 2 hypertension in pregnancy groups. Therefore, a Independent t-test was run on the data.

1 Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.

2 Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119

Test mean difference of birthweight by hypertension in pregnancy

Step1:Homogeneity test for bweight by hypertension in pregnancy using Levene Test

need to install package “car” for leveneTest function

library("car")
## Loading required package: carData

first, since hypt is in numeric structure, need to change to factor

hypt=as.factor(hypt)
leveneTest(bweight~hypt,center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value    Pr(>F)    
## group   1  18.319 2.155e-05 ***
##       639                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 2:Make decision from levene test

if p.value if p.value >0.05, equal variances is assumed.

if p.value if p.value <0.05, equal variances is NOT assumed.

to generate independent t-test

t.test(bweight~hypt,var.equal=FALSE) #var.equal=FALSE , since p-value of equal variance test <0.05
## 
##  Welch Two Sample t-test
## 
## data:  bweight by hypt
## t = -4.9991, df = 104.07, p-value = 2.341e-06
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -627.6273 -271.1197
## sample estimates:
## mean in group 1 mean in group 2 
##        2742.157        3191.531

Compare differences between 3 or more independent groups when the dependent variable is numerical

ANOVA / Kruskall Wallis Test

Example 1:

Comparing birthweight among maternal age group

first, we need to explore the data distribution to decide the appropriate test

need to install and load package “e1071” is for aggregate function

library(e1071) 

using aggregate function to explore the distribution of data birthweight by maternal age group

aggregate(bweight,by=list(matage_grp),skewness,na.rm=TRUE,type=2) #skewness
##   Group.1          x
## 1       1 -0.9375413
## 2       2 -0.9324707
## 3       3 -1.0135990
## 4       4 -0.8704457
aggregate(bweight,by=list(matage_grp),skewness,na.rm=TRUE,type=2) #kurtosis
##   Group.1          x
## 1       1 -0.9375413
## 2       2 -0.9324707
## 3       3 -1.0135990
## 4       4 -0.8704457

*Notes:

Distributions of numerical data can be checked using skewness(skew) and kurtosis values.

The data is normally distributed if skewness and kurtosis value lies between -1 to +11 and -3 to +32 respectively.

If the data is normally distributed, mean and standard deviation(sd) should be reported.

Median and interquartile range (IQR), if otherwise.

Inspection of skewness and kurtosis revealed that birthweight was not normally distributed in at least 1 of 4 maternal age groups. Therefore, a kruskall-wallis test was run on the data.

  1. Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.

  2. Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119

kruskal-wallis test for distribution difference between maternal age group

kruskal.test(bweight~matage_grp) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  bweight by matage_grp
## Kruskal-Wallis chi-squared = 0.24263, df = 3, p-value = 0.9704

Example 2:

Comparing birthweight among gestational age group

first, we need to explore the data distribution to decide the appropriate test

need to install and load package “e1071” is for aggregate function

library(e1071) 

using aggregate function to explore the distribution of data birthweight by maternal age group

aggregate(bweight,by=list(gestwks_grp),skewness,na.rm=TRUE,type=2) #skewness
##   Group.1           x
## 1       1 -0.28600574
## 2       2 -0.20891565
## 3       3 -0.05784087
## 4       4  0.32891139
aggregate(bweight,by=list(gestwks_grp),kurtosis,na.rm=TRUE,type=2) #kurtosis
##   Group.1           x
## 1       1 -0.65222924
## 2       2  0.38596809
## 3       3  1.09863570
## 4       4 -0.04515041

*Notes:

Inspection of skewness and kurtosis revealed that birthweight was normally distributed in all 4 gestational age groups. Therefore, a one-way anova test was run on the data.

1 Bulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics.

2 Kevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111-119

Test mean difference of birthweight between gestational age group

Step1:Homogeneity test for bweight by gestational age group

library(lawstat)
## 
## Attaching package: 'lawstat'
## The following object is masked from 'package:car':
## 
##     levene.test
levene.test(bweight,gestwks_grp,location="mean")
## 
##  Classical Levene's test based on the absolute deviations from the mean
##  ( none not applied because the location is not set to median )
## 
## data:  bweight
## Test Statistic = 18.486, p-value = 1.649e-11
kruskal.test(bweight~matage_grp) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  bweight by matage_grp
## Kruskal-Wallis chi-squared = 0.24263, df = 3, p-value = 0.9704

Step 2:Make decision from levene test

if p.value from levene test is >0.05, homogeneity is assumed.Hence we can use anova test

to generate anova test

fit=aov(bweight~gestwks_grp,data=data2)
summary(fit) #to display result of anova test
##              Df    Sum Sq  Mean Sq F value Pr(>F)    
## gestwks_grp   1  99079210 99079210   364.6 <2e-16 ***
## Residuals   639 173640912   271739                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

if p.value from levene test is <0.05,homogeneity is NOT assumed.Hence we use welch test

library(lawstat) #load lawstat package
oneway.test(bweight~gestwks_grp) #welch test
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  bweight and gestwks_grp
## F = 97.285, num df = 3.00, denom df = 170.74, p-value < 2.2e-16

Step 3:Multiple comparison or post hoc test

if anova test give significant result (pvalue<0.05), we can do further analysis with bonferroni test

pairwise.t.test(bweight,gestwks_grp,p.adj="bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  bweight and gestwks_grp 
## 
##   1       2       3   
## 2 < 2e-16 -       -   
## 3 < 2e-16 1.9e-11 -   
## 4 < 2e-16 2.8e-09 0.14
## 
## P value adjustment method: bonferroni

if welch, we can do further analysis with games-howell test

source("http://aoki2.si.gunma-u.ac.jp/R/src/tukey.R") # need to upload script
tukey(bweight,gestwks_grp,method="Games-Howell") #games-howell test
## $result1
##          n     Mean Variance
## Group1  89 2151.067 517760.2
## Group2 198 3064.101 210502.3
## Group3 303 3386.739 203858.3
## Group4  51 3558.000 173190.7
## 
## $Games.Howell
##             t        df            p
## 1:2 11.007059 121.29311 5.773160e-14
## 1:3 15.338037 109.10625 4.329870e-14
## 1:4 14.657595 137.97658 2.298162e-14
## 2:3  7.743694 416.44177 0.000000e+00
## 2:4  7.396350  84.11482 7.383967e-10
## 3:4  2.684913  71.31131 4.368968e-02

Correlation between 2 numerical data

Pearson Correlation / Spearman Correlation

Example 1:

Correlation between birthweight and maternal age

first, we need to explore the correlation pattern to decide the appropriate test

use function plot to produce scatter plot

plot(bweight,matage)

*Notes:

scatter plot will show pattern of homoscedasticity or heteroscedasticity as below examples

if the plot show homosceasticity pattern then pearson correlation will be used.

otherwise, spearman correlation is used

need to install and load package “ggpubr” for cor.test function

library(ggpubr)
## Loading required package: ggplot2
cor.test(bweight,matage,method="spearman",use="complete.obs")
## Warning in cor.test.default(bweight, matage, method = "spearman", use =
## "complete.obs"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  bweight and matage
## S = 42799650, p-value = 0.528
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02496897

Example 1:

Correlation between birthweight and gestational age

first, we need to explore the correlation pattern to decide the appropriate test

use function plot to produce scatter plot

plot(bweight,gestwks)

Since the plot of birthweight and gestational age has homoscedasticity pattern, therefore pearson correlation was used

need to install and load package “ggpubr” for cor.test function

library(ggpubr)
cor.test(bweight,gestwks,method="pearson",use="complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  bweight and gestwks
## t = 27.609, df = 639, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7000965 0.7709564
## sample estimates:
##       cor 
## 0.7375501