knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

Clustering practice: https://rpubs.com/minhtri/923711
Table Descriptive Analysis: https://rpubs.com/minhtri/929114
Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
The R environment - Data Wrangling: https://rpubs.com/minhtri/968607
Correlation analysis: https://rpubs.com/minhtri/968611

Note: This analysis is used for personal study purpose. In this section, I will summerize some basic commands for assessing normality based on several online sources.

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

1 Simulated fakestroke data

Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

Loading the data, adjust the character of each column:

# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
df = data

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) |

Variable	Description
studyid	Study ID # (z001 through z500)
trt	Treatment group (Intervention or Control)
age	Age in years
sex	Male or Female
nihss	NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location	Stroke Location - Left or Right Hemisphere
hx.isch	History of Ischemic Stroke (Yes/No)
afib	Atrial Fibrillation (1 = Yes, 0 = No)
dm	Diabetes Mellitus (1 = Yes, 0 = No)
mrankin	Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp	Systolic blood pressure, in mm Hg
iv.altep	Treatment with IV alteplase (Yes/No)
time.iv	Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects	Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus	Intracranial arterial occlusion, based on vessel imaging - five categories3
extra.ica	Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand	Time from stroke onset to study randomization, in minutes
time.punc	Time from stroke onset to groin puncture, in minutes (only if Intervention)

2 Assumptions of parametric data

First, I will add the table about parametric or non-parametric test for independent groups. This table will help you have an overview about which approach is suitable with your variables:

Parametric data: paired t-test, ANOVA and Pearson’r.
Non-parametric data: Wilcoxon rank-sum, Kruskal Wallis and Spearman.

It is very important that you check the assumptions before deciding which statistical test is appropriate. Most parametric tests based on the normal distribution have four basic assumptions that must be met for the test to be accurate. Field et al. 2012 indicated that:

Normally distributed data: In short, the rationale behind hypothesis testing relies on having something that is normally distributed.
Homogeneity of variance: The variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.
Interval data: Data should be measured at least at the interval level. This assumption is tested by common sense and so won’t be discussed further.
Independence: This assumption, like that of normality, is different depending on the test you’re using. In some cases it means that data from different participants are independent, which means that the behaviour of one participant does not influence the behaviour of another. In repeated-measures designs (in which participants are measured in more than one experimental condition), we expect scores in the experimental conditions to be non-independent for a given participant, but behaviour between different participants should be independent.

3 Normality testing

According to thomaselove.github, data are well approximated by a Normal distribution if the shape of the data’s distribution is a good match for a Normal distribution with mean and standard deviation equal to the sample statistics: The data are symmetrically distributed about a single peak, located at the sample mean.

Several tools for assessing Normality of a single batch of data:

A histogram with superimposed Normal distribution.
Histogram variants (like the boxplot) which provide information on the center, spread and shape of a distribution.
The Empirical Rule for interpretation of a standard deviation
A specialized normal Q-Q plot (also called a normal probability plot or normal quantile-quantile plot) designed to reveal differences between a sample distribution and what we might expect from a normal distribution of a similar number of values with the same mean and standard deviation.

3.1 Using graphs

3.1.1 Drawing histogram for distribution

According to thomaselove.github, most of the time, when we want to understand whether our data are well approximated by a Normal distribution, we will use a graph to aid in the decision.

One option is to build a histogram with a Normal density function (with the same mean and standard deviation as our data) superimposed. This is one way to help visualize deviations between our data and what might be expected from a Normal distribution.

Package: tidyverse
Command: ggplot(dataframe, aes ( )) + geom_histogram(aes(y = ..density..)) + stat_function(fun = dnorm, args = list(mean = mean( numeric variable, na.rm = T), sd = sd(numeric variable, na.rm = T)))

Let examine the distribution of sbp and age:

ggplot(df, aes(sbp)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") + 
  stat_function(fun = dnorm, 
                args = list(mean = mean( df$sbp , na.rm = T), sd = sd( df$sbp , na.rm = T)),
                lwd = 1, col = "blue") + 
  labs(x = "sbp (mmHg)", y = "Density") +
  theme_bw()

ggplot(df, aes(age)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") + 
  stat_function(fun = dnorm, 
                args = list(mean = mean( df$age, na.rm = T), sd = sd( df$age, na.rm = T)),
                lwd = 1, col = "blue") + 
  labs(x = "age (years)", y = "Density") +
  theme_bw()

Discuss

sbp: The blue density curve is single-peaked, and bell-shaped. Therefore, it is a normal curve.
=> We can say that the sbp data distribution seems like normal.
age: The curve looks like a little bit left-skew: seems like non-normal distribution.
However, the density plot is not really helpful for understanding the tail behavior. I advice to draw a normal Q-Q plot before further conclusion.

3.1.2 Normal Q-Q Plot

In the Q-Q plot, the data are ranked and sorted. Each value is compared to the expected value that the score should have in a normal distribution and they are plotted against one another. If the data are normally distributed then the actual scores will have the same distribution as the score we expect from a normal distribution, and you’ll get a lovely straight diagonal line.
Let draw a Q-Q plot for both sbp and age variables. There are 2 types of commands for Q-Q plot. First let’s use a quick command:

Package: ggpubr

Command: ggqqplot(numeric variable, ylab = ” “)

# Drawing QQ-plot for sbp:
library(ggpubr)
a = ggqqplot(df$sbp, ylab = "sbp")

# Drawing density plot for sbp:
b= ggplot(df, aes(sbp)) +
    geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
    stat_function(fun = dnorm, 
                args = list(mean = mean( df$sbp, na.rm = T), sd = sd( df$sbp, na.rm = T)),
                lwd = 1, col = "blue") + 
    labs(x = "sbp (mmHg)", y = "Density") +
    theme_bw()

# Combine plot: 
library(ggpubr)
plot = ggarrange(a,b, 
                 ncol=2, nrow=1, 
                 common.legend = FALSE,
                 legend="right", 
                 labels = c("A","B"))
annotate_figure(plot, bottom = text_grob("The distribution for sbp", 
              color = "red", face = "bold", size = 12))

# Drawing QQ-plot for age:
library(ggpubr)
a = ggqqplot( df$age, ylab = "age")

# Drawing density plot for age:
b = ggplot(df, aes(age)) +
      geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") + 
      stat_function(fun = dnorm, 
                    args = list(mean = mean( df$age, na.rm = T), sd = sd( df$age, na.rm = T)),
                    lwd = 1, col = "blue") + 
      labs(x = "age (years)", y = "Density") +
      theme_bw()

# Combine plot: 
library(ggpubr)
plot = ggarrange(a,b, 
                 ncol=2, nrow=1, 
                 common.legend = FALSE,
                 legend="right", 
                 labels = c("A","B"))
annotate_figure(plot, bottom = text_grob("The distribution for age", 
              color = "red", face = "bold", size = 12))

Let’s try using a more complicated command:

Package: tidyverse

Command: ggplot(data = , aes(sample = )) + geom_qq() + geom_qq_line() + labs( x = ““, y =”“)

# Drawing Q-Q plot for sbp
a = ggplot(data = df, aes(sample = sbp)) + 
      geom_qq() +
      geom_qq_line() +
      labs( x = "Theoretical", y = "sbp")

# Drawing density plot for sbp:
b= ggplot(df, aes(sbp)) +
    geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
    stat_function(fun = dnorm, 
                args = list(mean = mean(df$sbp, na.rm = T), sd = sd(df$sbp, na.rm = T)),
                lwd = 1, col = "blue") + 
    labs(x = "sbp (mmHg)", y = "Density") +
    theme_bw()

# Combine plot: 
library(ggpubr)
plot = ggarrange(a,b, 
                 ncol=2, nrow=1, 
                 common.legend = FALSE,
                 legend="right", 
                 labels = c("A","B"))
annotate_figure(plot, bottom = text_grob("The distribution for sbp", 
              color = "red", face = "bold", size = 12))

# Drawing QQ-plot for age:
library(ggpubr)
a = ggplot(data = df, aes(sample = age)) + 
      geom_qq() +
      geom_qq_line() +
      labs( x = "Theoretical", y = "age")

# Drawing density plot for age:
b = ggplot(df, aes(age)) +
      geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") + 
      stat_function(fun = dnorm, 
                    args = list(mean = mean(df$age, na.rm = T), sd = sd(df$age, na.rm = T)),
                    lwd = 1, col = "blue") + 
      labs(x = "age (years)", y = "Density") +
      theme_bw()

# Combine plot: 
library(ggpubr)
plot = ggarrange(a,b, 
                 ncol=2, nrow=1, 
                 common.legend = FALSE,
                 legend="right", 
                 labels = c("A","B"))
annotate_figure(plot, bottom = text_grob("The distribution for age", 
              color = "red", face = "bold", size = 12))

Discuss

For sbp: These simulated data appear to be well-modeled by the Normal distribution, because the points on the Normal Q-Q plot follow the diagonal reference line:

There is no substantial curve (such as we’d see with data that were skewed).
There is no particularly surprising behavior (curves away from the line) at either tail, so there’s no obvious problem with outliers.

For age:

The density plot shows left skew, with a longer tail on the left hand side and more clustered data at the right end of the distribution.
The Q-Q plot shows that most of the points are under the straight line: Curving down and away from the line in both tails => left skew.
If on the Q-Q plot, most of the points are above the straight line: Curving up and away from the line in both tails => right skew.

From https://thomaselove.github.io/431-notes:

3.2 Using p-value

3.2.1 psych package: describe() function

To examine the descriptive analysis of variables.

Package: psych
Command:

describe (cbind (numeric var 1, … numeric variable n), basic = FALSE, norm = TRUE)
describe (dataframe [ , c(“var1”, …, “var n”)], basic = FALSE, norm = TRUE)

library(psych)
psych::describe(cbind(df$age, df$sbp))

##    vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 500  64.71 17.05  65.75   65.49 15.94  23  96    73 -0.33    -0.40 0.76
## X2    2 499 145.48 25.14 145.00  145.44 25.20  78 231   153  0.05    -0.11 1.13

Discuss

For age, the skew and kurtosis values are away from 0 => high chance non-normal distribution.
For sbp, the skew and kurtosis values are close to 0 => normal distribution.

=> We should use another command to calculate the p-value.

3.2.2 psych package: describe.by() function

To examine the descriptive analysis of variables or factors of variables.

Package: psych
Command:

describe.by(datframe, INDICES = categorical variable)
describe.by(continuous variable, categorical variable)

library(psych)
describe.by(cbind(df$age, df$sbp))

##    vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 500  64.71 17.05  65.75   65.49 15.94  23  96    73 -0.33    -0.40 0.76
## X2    2 499 145.48 25.14 145.00  145.44 25.20  78 231   153  0.05    -0.11 1.13

describe.by(cbind(df$age, df$sbp), df$trt)

## 
##  Descriptive statistics by group 
## INDICES: Control
##    vars   n   mean   sd median trimmed   mad min max range  skew kurtosis   se
## V1    1 267  65.38 16.1   65.7    66.1 15.27  24  94    70 -0.29    -0.32 0.99
## V2    2 266 145.00 24.4  145.0   144.8 24.46  82 231   149  0.15     0.03 1.50
## ------------------------------------------------------------ 
## INDICES: Intervention
##    vars   n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## V1    1 233  63.93 18.09   65.8   64.76 16.01  23  96    73 -0.33    -0.56 1.19
## V2    2 233 146.03 26.00  146.0  146.14 26.69  78 214   136 -0.06    -0.26 1.70

3.2.3 pastecs package: stat.desc function (recommend)

To quickly examine the descriptive analysis of variables.
Package: pastecs
Command:

stat.desc(cbind (numeric var 1, … numeric variable n), basic=TRUE, desc=TRUE, norm=FALSE, p=0.95)
stat.desc(dataframe [ , c(“var1”, …, “var n”)], basic=TRUE, desc=TRUE, norm=FALSE, p=0.95)

library(pastecs)
stat.desc(cbind(df$age, df$sbp), basic=F, norm=T, p=0.95)

##                         V1           V2
## median        6.575000e+01 145.00000000
## mean          6.470580e+01 145.47895792
## SE.mean       7.627075e-01   1.12525785
## CI.mean.0.95  1.498514e+00   2.21083795
## var           2.908613e+02 631.83640373
## std.dev       1.705466e+01  25.13635621
## coef.var      2.635723e-01   0.17278345
## skewness     -3.305647e-01   0.04582062
## skew.2SE     -1.513328e+00   0.20955860
## kurtosis     -3.970448e-01  -0.11045899
## kurt.2SE     -9.106278e-01  -0.25308825
## normtest.W    9.744833e-01   0.99840676
## normtest.p    1.183624e-07   0.93364481

Discuss

For age - V1, the skew and kurtosis values are away from 0, p-value of normality test < 0.05 => non-normal distribution.
For sbp - V2, the skew and kurtosis values are close to 0, p-value of normality test = 0.93 > 0.05 => normal distribution.

=> This result should be in used with the Q-Q plot and density plot for normality conclusion.

3.2.4 pastecs package: by() function (recommend)

To quickly examine the descriptive analysis of factors in variables.
Package: pastecs
Command:

by(data = outcome variable (continuous variable), INDICES = grouping variable, FUN = a function that you want to apply to the data)

library(pastecs)
by(cbind(df$age, df$sbp), df$trt, stat.desc, basic=F, norm=T )

## INDICES: Control
##                         V1           V2
## median        6.570000e+01 145.00000000
## mean          6.538240e+01 144.99624060
## SE.mean       9.853615e-01   1.49577831
## CI.mean.0.95  1.940100e+00   2.94512211
## var           2.592403e+02 595.13583487
## std.dev       1.610094e+01  24.39540602
## coef.var      2.462580e-01   0.16824854
## skewness     -2.927944e-01   0.15427514
## skew.2SE     -9.820483e-01   0.51648808
## kurtosis     -3.177592e-01   0.02906420
## kurt.2SE     -5.348310e-01   0.04882888
## normtest.W    9.699551e-01   0.99675936
## normtest.p    2.132393e-05   0.86710009
## ------------------------------------------------------------ 
## INDICES: Intervention
##                         V1          V2
## median        65.800000000 146.0000000
## mean          63.930472103 146.0300429
## SE.mean        1.185100119   1.7032014
## CI.mean.0.95   2.334933953   3.3557189
## var          327.239714000 675.9085763
## std.dev       18.089768213  25.9982418
## coef.var       0.282960029   0.1780335
## skewness      -0.331285618  -0.0643649
## skew.2SE      -1.038831068  -0.2018326
## kurtosis      -0.556078576  -0.2618235
## kurt.2SE      -0.875485563  -0.4122128
## normtest.W     0.973327147   0.9958385
## normtest.p     0.000222805   0.7869225

3.2.5 Shapiro-Wilk test

Mechanism: it compares the scores in the sample to a normally distributed set of scores with the same mean and standard deviation.

If the test is non-significant (p > 0.05) it tells us that the distribution of the sample is not significantly different from a normal distribution. = normal distribution.
If the test is significant (p < .05) then the distribution in question is significantly different from a normal distribution (i.e., it is non-normal).

Warning: In large samples this test can be significant even when the scores are only slightly different from a normal distribution. Therefore, they should always be interpreted in conjunction with histograms, or Q-Q plots, and the values of skew and kurtosis.

Command:
* For numeric variables: shapiro.test(variable)
* For separate factors in 1 numeric variable: by(numeric variable, group/categorical variable, name of test)

As a final point, bear in mind to look at the variable for separate groups. If our analysis involves comparing groups, then what’s important is not the overall distribution but the distribution in each group.

shapiro.test(df$age)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$age
## W = 0.97448, p-value = 1.184e-07

shapiro.test(df$sbp)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$sbp
## W = 0.99841, p-value = 0.9336

by(df$age, df$trt, shapiro.test)

## df$trt: Control
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.96996, p-value = 2.132e-05
## 
## ------------------------------------------------------------ 
## df$trt: Intervention
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.97333, p-value = 0.0002228

by(df$sbp, df$trt, shapiro.test)

## df$trt: Control
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.99676, p-value = 0.8671
## 
## ------------------------------------------------------------ 
## df$trt: Intervention
## 
##  Shapiro-Wilk normality test
## 
## data:  dd[x, ]
## W = 0.99584, p-value = 0.7869

Discuss

For age:
* age: p value < 0.05 => non-normal distribution.
* Distribution of trt in age: p < 0.05 => non-normal distribution.

For sbp:
* sbp: p value > 0.05 => normal distribution.
* Distribution of trt in sbp: p > 0.05 => normal distribution.

4 Homogeneity of variance

4.1 Introduction

According to Field et al. 2012:

Homogeneity of variance: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.

Example: if you measured the vertical distance between the lowest score and the highest score, all of these distances would be fairly similar. Although the means increase, the spread of scores for hearing loss is the same at each level of the concert variable.

There are many solutions to test for the equality (homogeneity) of variance across groups, including:

F-test: Compare the variances of two samples. The data must be normally distributed.
Bartlett’s test: Compare the variances of k samples, where k can be more than two samples. The data must be normally distributed. The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality.
Levene’s test: Compare the variances of k samples, where k can be more than two samples. It’s an alternative to the Bartlett’s test that is less sensitive to departures from normality.
Fligner-Killeen test: a non-parametric test which is very robust against departures from normality.

4.2 Variance Ratio Test

When there are only two groups the test we use to determine if the variance is the same is called a variance ratio test. The test involves dividing the variance of group one by the variance of group two. If this ratio is close to one the conclusion drawn is that the variance of each group is the same. If the ratio is far from one the conclusion drawn is that the variances are not the same.

Whether or not a variance ratio test should ever be conducted is an open question. Simulation studies have shown that for many cases where the standard t-test performs poorly, and hence we want to use the unequal variance t-test formula, the variance ratio test does not have good ability to accurately determine that the variances of the two groups are different. Conversely, for scenarios where the variance ratio test has good ability to accurately determine that the variance of the two groups are different, just using the standard t-test that incorrectly assumes the variances of the groups are equal still works very well. Combined these two results have led some to conclude that it is better to not ever use a variance ratio test.

This test uses the following null and alternative hypotheses:

H0: The group variances are equal.
HA: The group variances are not equal.

Command:
* var.test(x, y, ratio = 1, alternative = c(“two.sided”, “less”, “greater”), conf.level = 0.95, …)
* var.test(formula, data, subset, na.action, …)

var.test(age ~ trt , data = df)

## 
##  F test to compare two variances
## 
## data:  age by trt
## F = 0.7922, num df = 266, denom df = 232, p-value = 0.06615
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.616568 1.015679
## sample estimates:
## ratio of variances 
##          0.7922029

var.test(sbp ~ trt , data = df)

## 
##  F test to compare two variances
## 
## data:  sbp by trt
## F = 0.8805, num df = 265, denom df = 232, p-value = 0.3155
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6851554 1.1291608
## sample estimates:
## ratio of variances 
##          0.8804975

Discuss

For age by trt: p value > 0.05 => We fail to reject the null hypothesis that the group variances are the same.
=> However, this result is still contradict to the Levene’s test and Fligner-Killeen’s test but agree with Bartlett’s test (at the next section).
=> Because age is non-normal distribution data. Therefore, the Fligner-Killeen and Levene’s test are suitable approaches and the results will be more accurate.
For sbp by trt: p value > 0.05 => This result is similar to what we found at Levene’s test and Bartlett’s test.
=> We do not have sufficient evidence to say the variance is significantly different across the groups.

4.3 Bartlett’s test

Generally we are interested in testing whether or not there is a difference in the group means. When testing for differences in group means the specific test statistic formula to use depends on whether or not the group variances are equal.

In statistics, Bartlett’s test is used to test if k samples are from populations with equal variances. Equal variances across populations are called homoscedasticity or homogeneity of variances. Some statistical tests, for example, the ANOVA test, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption. Bartlett’s test enables us to compare the variance of two or more samples to decide whether they are drawn from populations with equal variance. It is fitting for normally distributed data.

This test uses the following null and alternative hypotheses:

H0: The variance among each group is equal.
HA: At least one group has a variance that is not equal to the rest.

Command: bartlett.test(numeric variable ~ group, dataset)

bartlett.test( age ~ trt , data = df)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  age by trt
## Bartlett's K-squared = 3.3654, df = 1, p-value = 0.06658

bartlett.test( sbp ~ trt , data = df)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  sbp by trt
## Bartlett's K-squared = 1.0019, df = 1, p-value = 0.3168

Discuss

For age by trt: p value > 0.05 => We fail to reject the null hypothesis that the group variances are the same.
=> We do not have sufficient evidence to say the variance is significantly different across the groups.
=> However, this result is contradict to the Levene’s test result (at the next section). This can be explained due to the non-normal distribution of age by trt. Therefore, the Barlett’s test produces different results.
For sbp by trt: p value > 0.05 => This result is similar to what we found at Levene’s test.

4.4 Levene’s test

This test uses the following null and alternative hypotheses:

H0: The variance among each group is equal.
HA: At least one group has a variance that is not equal to the rest.

Package: car
Command: leveneTest(outcome variable, group, center = median/mean)

outcome variable of which we want to test the variances.
the grouping variable, which must be a factor.
this default: median (median is prefered). To overide this default: add the option center = “mean”.

Warning: In large samples Levene’s test can be significant even when group variances are not very different. Therefore, it should be interpreted in conjunction with the variance ratio.

Example 1: For the age:

library(car)
leveneTest(df$age, df$trt)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  4.0963 0.04351 *
##       498                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveneTest(df$age, df$trt, center = mean)

## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value  Pr(>F)  
## group   1  4.3068 0.03847 *
##       498                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discuss

The result is non-significant for the age (p-value < 0.05) regardless of whether we use the median or mean. We reject the null hypothesis that the group variances are the same.
=> We have sufficient evidence to say the variance is significantly different across the groups.
=> Assumption of homogenity of variance is violated.
=> However, this result is contradict to the Bartlett’s test result (at the previous section). This can be explained due to the non-normal distribution of age by trt. Therefore, the Levene’s test is a suitable approach.

Example 2: For the sbp:

library(car)
leveneTest(df$sbp, df$trt)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  1.2385 0.2663
##       497

leveneTest(df$sbp, df$trt, center = mean)

## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value Pr(>F)
## group   1  1.2387 0.2663
##       497

Discuss

The result is non-significant for the sbp (p-value > 0.05) regardless of whether we use the median or mean. This indicates that the variances are not significantly different.

=> We have do not have sufficient evidence to say the variance is significantly different across the groups.

4.5 Fligner-Killeen test

The Fligner-Killeen test is one of the many tests for homogeneity of variances which is most robust against departures from normality.

This test uses the following null and alternative hypotheses:

H0: The variance among each group is equal.
HA: At least one group has a variance that is not equal to the rest.

Command:

fligner.test(x, g, …)
fligner.test(formula, data, subset, na.action, …)

with:

formula: lhs ~ rhs where lhs gives the data values and rhs the corresponding groups.
na.action: a function which indicates what should happen when the data contain NAs. Defaults to getOption(“na.action”).

Example 1: Calculate age by trt

fligner.test(age ~ trt, data = df)

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  age by trt
## Fligner-Killeen:med chi-squared = 4.0497, df = 1, p-value = 0.04418

Discuss

The result is non-significant for the age (p-value < 0.05) regardless of whether we use the median or mean. We reject the null hypothesis that the group variances are the same.
=> We have sufficient evidence to say the variance is significantly different across the groups.
=> Assumption of homogenity of variance is violated.
=> This result is in accordance with Levene’test result because age is non-normal distribution data. Therefore, the Fligner-Killeen and Levene’s test is a suitable approach.

Example 2: Calculate sbp by trt

fligner.test(sbp ~ trt, data = df)

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  sbp by trt
## Fligner-Killeen:med chi-squared = 1.3563, df = 1, p-value = 0.2442

Discuss: The result is similar to Levene’test.

5 References

https://thomaselove.github.io/431-notes/assessing-normality.html#assessing-normality
Field, A., Miles, J. and Field, Z., 2012. Discovering statistics using R. Sage publications.
https://saestatsteaching.tech/section-varianceratio#equal-variance-testing---two-groups
https://www.geeksforgeeks.org/bartletts-test-in-r-programming/
http://www.sthda.com/english/wiki/compare-multiple-sample-variances-in-r

Parametric or Nonparametric data

Minh Tri

2022-10-31

1 Simulated fakestroke data

2 Assumptions of parametric data

3 Normality testing

3.1 Using graphs

3.1.1 Drawing histogram for distribution

3.1.2 Normal Q-Q Plot

3.2 Using p-value

3.2.1 psych package: describe() function

3.2.2 psych package: describe.by() function

3.2.5 Shapiro-Wilk test

4 Homogeneity of variance

4.1 Introduction

4.2 Variance Ratio Test

4.3 Bartlett’s test

4.4 Levene’s test

4.5 Fligner-Killeen test

5 References