knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Note: This analysis is used for personal study purpose. In this section, I will summerize some basic commands for assessing normality based on several online sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
Loading the data, adjust the character of each column:
# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
df = data
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
| Variable | Description |
|---|---|
| studyid | Study ID # (z001 through z500) |
| trt | Treatment group (Intervention or Control) |
| age | Age in years |
| sex | Male or Female |
| nihss | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
| location | Stroke Location - Left or Right Hemisphere |
| hx.isch | History of Ischemic Stroke (Yes/No) |
| afib | Atrial Fibrillation (1 = Yes, 0 = No) |
| dm | Diabetes Mellitus (1 = Yes, 0 = No) |
| mrankin | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
| sbp | Systolic blood pressure, in mm Hg |
| iv.altep | Treatment with IV alteplase (Yes/No) |
| time.iv | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
| aspects | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
| ia.occlus | Intracranial arterial occlusion, based on vessel imaging - five categories3 |
| extra.ica | Extracranial ICA occlusion (1 = Yes, 0 = No) |
| time.rand | Time from stroke onset to study randomization, in minutes |
| time.punc | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
According to Field et al. 2012, there are 3 options to deal with outliers:
The next highest score plus one: Change the score to be one unit
above the next highest score in the data set.
The mean plus two standard deviations: A variation on the above
method is to use the mean plus two times the standard deviation (rather
than three times the standard deviation.
Convert back from a z-score.
According to thomaselove.github.io/431-notes, There are various transformations that you can do in correcting various problems. The Tukey ladder of powers (sometimes called the Bulging Rule) is a way to change the shape of a skewed distribution so that it becomes normal or nearly-normal. It can also help to reduce error variability (heteroscedasticity).
The ladder of power transformations is particularly helpful when we are confronted with data that shows skew.
To handle right skew (where the mean exceeds the median) we
usually apply powers below 1.
To handle left skew (where the median exceeds the mean) we
usually apply powers greater than 1.
The most common transformations are the square (power 2), the square root (power 1/2), the logarithm (power 0) and the inverse (power -1).
Note:
As we move further away from the identity function (power = 1) we
change the shape more and more in the same general direction. For
instance, if we try a logarithm, and this seems like too much of a
change, we might try a square root instead.
If the variable x can take on negative values, we might take a different approach.
If x is a count of something that could be zero, we often simply add 1 to x before transformation.
In the last section, we already know that the distribution of age is non-normal. Therefore, I will practice with age variable - to transform these data to better approximate a Normal distribution.
Here is the summary of original age variable:
library(pastecs)
stat.desc(df$age, basic=F, norm=T)
## median mean SE.mean CI.mean.0.95 var
## 6.575000e+01 6.470580e+01 7.627075e-01 1.498514e+00 2.908613e+02
## std.dev coef.var skewness skew.2SE kurtosis
## 1.705466e+01 2.635723e-01 -3.305647e-01 -1.513328e+00 -3.970448e-01
## kurt.2SE normtest.W normtest.p
## -9.106278e-01 9.744833e-01 1.183624e-07
# Drawing QQ-plot for age:
library(ggpubr)
p1 = ggplot(data = df, aes(sample = age)) +
geom_qq() +
geom_qq_line() +
labs( x = "Theoretical", y = "age", title = "Normal Q-Q plot")
# Drawing density plot for age:
p2 = ggplot(df, aes(age)) +
geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(df$age, na.rm = T), sd = sd(df$age, na.rm = T)),
lwd = 1, col = "blue") +
labs(x = "age (years)", y = "Density", title = "Histogram with Normal fit") +
theme_bw()
# Drawing violin plot for age
p3 = ggplot(df, aes(x = "", y = age)) +
geom_violin() +
geom_boxplot(width = 0.3, fill = "royalblue", outlier.color = "royalblue") +
coord_flip() +
labs(title = "Boxplot with Violin", x = "", y = "Simulated Data")
# Combine plot:
library(ggpubr)
plot = ggarrange(p1,p2, p3,
ncol=2, nrow=2,
common.legend = FALSE,
legend="right"
)
annotate_figure(plot, bottom = text_grob("Age original data",
color = "red", face = "bold", size = 12))
p1 <- ggplot(df, aes(sample = age^3)) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "Cube (power 3)",
y = "age, Cubed")
p2 <- ggplot(df, aes(sample = age^2)) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "Square (power 2)",
y = "age, Squared")
p3 <- ggplot(df, aes(sample = age)) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "Original Data",
y = "age")
p4 <- ggplot(df, aes(sample = sqrt(age))) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "sqrt (power 0.5)",
y = "Square Root of age")
p5 <- ggplot(df, aes(sample = log(age))) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "log (power 0)",
y = "Natural Log of age")
p6 <- ggplot(df, aes(sample = age^(-0.5))) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "1/sqrt (power -0.5)",
y = "1/Square Root(age)")
p7 <- ggplot(df, aes(sample = 1/age)) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "Inverse (power -1)",
y = "1/age")
p8 <- ggplot(df, aes(sample = 1/(age^2))) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "1/Square (power -2)",
y = "1 /(age squared)")
p9 <- ggplot(df, aes(sample = 1/(age^3))) +
geom_qq(col = "salmon") +
geom_qq_line(col = "black") +
labs(title = "1/Cube (power -3)",
y = "1/(age cubed)")
# Combine plot:
library(ggpubr)
plot = ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9,
ncol=3, nrow=3,
common.legend = FALSE,
legend="left"
)
annotate_figure(plot, bottom = text_grob("Transformation of Age data",
color = "red", face = "bold", size = 12))
Discuss: Because the original data is left-skew distribution, so the transformation with power > 1 seems like a suitable choice to create a more symmetric (albeit light-tailed) distribution.
https://thomaselove.github.io/431-notes/assessing-normality.html#assessing-normality
Field, A., Miles, J. and Field, Z., 2012. Discovering statistics using
R. Sage publications.
https://saestatsteaching.tech/section-varianceratio#equal-variance-testing---two-groups
https://www.geeksforgeeks.org/bartletts-test-in-r-programming/
http://www.sthda.com/english/wiki/compare-multiple-sample-variances-in-r