knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

  1. Clustering practice: https://rpubs.com/minhtri/923711
  2. Table Descriptive Analysis: https://rpubs.com/minhtri/929114
  3. Imputation - Dealing with missing data: https://rpubs.com/minhtri/968586
  4. The R environment - Data Wrangling: https://rpubs.com/minhtri/968607
  5. Correlation analysis: https://rpubs.com/minhtri/968611
  6. Parametric or Nonparametric data: https://rpubs.com/minhtri/968616

 

Note: This analysis is used for personal study purpose. In this section, I will summerize some basic commands for assessing normality based on several online sources.

 

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file

 

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

  • Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

 

1 Simulated fakestroke data

Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

Loading the data, adjust the character of each column:

# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
df = data

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) | 
Variable Description
studyid Study ID # (z001 through z500)
trt Treatment group (Intervention or Control)
age Age in years
sex Male or Female
nihss NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location Stroke Location - Left or Right Hemisphere
hx.isch History of Ischemic Stroke (Yes/No)
afib Atrial Fibrillation (1 = Yes, 0 = No)
dm Diabetes Mellitus (1 = Yes, 0 = No)
mrankin Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp Systolic blood pressure, in mm Hg
iv.altep Treatment with IV alteplase (Yes/No)
time.iv Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus Intracranial arterial occlusion, based on vessel imaging - five categories3
extra.ica Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand Time from stroke onset to study randomization, in minutes
time.punc Time from stroke onset to groin puncture, in minutes (only if Intervention)

 

2 Dealing with outliers

According to Field et al. 2012, there are 3 options to deal with outliers:

  1. Remove the case: This entails deleting the data from the person who contributed the outlier. However, this should be done only if you have good reason to believe that this case is not from the population that you intended to sample. For example, if you were investigating factors that affected how much cats purr and one cat didn’t purr at all, this would likely be an outlier (all cats purr). Upon inspection, if you discovered that this cat was actually a dog wearing a cat costume (hence why it didn’t purr), then you’d have grounds to exclude this case because it comes from a different population (dogs who like to dress as cats) than your target population (cats).
  2. Transform the data: Outliers tend to skew the distribution, this skew (and, therefore, the impact of the outliers) can sometimes be reduced by applying transformations to the data.
  3. Change the score: If transformation fails, then you can consider replacing the score. This on the face of it may seem like cheating (you’re changing the data from what was actually corrected); however, if the score you’re changing is very unrepresentative and biases your statistical model anyway then changing the score is the lesser of two evils! There are several options for how to change the score:
  • The next highest score plus one: Change the score to be one unit above the next highest score in the data set.

  • The mean plus two standard deviations: A variation on the above method is to use the mean plus two times the standard deviation (rather than three times the standard deviation.

  • Convert back from a z-score.

     

3 Dealing with non-normality and unequal variances

3.1 Introduction

According to thomaselove.github.io/431-notes, There are various transformations that you can do in correcting various problems. The Tukey ladder of powers (sometimes called the Bulging Rule) is a way to change the shape of a skewed distribution so that it becomes normal or nearly-normal. It can also help to reduce error variability (heteroscedasticity).

 

The ladder of power transformations is particularly helpful when we are confronted with data that shows skew.

  • To handle right skew (where the mean exceeds the median) we usually apply powers below 1.

  • To handle left skew (where the median exceeds the mean) we usually apply powers greater than 1.

  • The most common transformations are the square (power 2), the square root (power 1/2), the logarithm (power 0) and the inverse (power -1).

     

Note:

  • As we move further away from the identity function (power = 1) we change the shape more and more in the same general direction. For instance, if we try a logarithm, and this seems like too much of a change, we might try a square root instead.

  • If the variable x can take on negative values, we might take a different approach.

  • If x is a count of something that could be zero, we often simply add 1 to x before transformation.

     

3.2 Example

In the last section, we already know that the distribution of age is non-normal. Therefore, I will practice with age variable - to transform these data to better approximate a Normal distribution.

3.2.1 Original Data

Here is the summary of original age variable:

library(pastecs)
stat.desc(df$age, basic=F, norm=T)
##        median          mean       SE.mean  CI.mean.0.95           var 
##  6.575000e+01  6.470580e+01  7.627075e-01  1.498514e+00  2.908613e+02 
##       std.dev      coef.var      skewness      skew.2SE      kurtosis 
##  1.705466e+01  2.635723e-01 -3.305647e-01 -1.513328e+00 -3.970448e-01 
##      kurt.2SE    normtest.W    normtest.p 
## -9.106278e-01  9.744833e-01  1.183624e-07
# Drawing QQ-plot for age:
library(ggpubr)
p1 = ggplot(data = df, aes(sample = age)) + 
      geom_qq() +
      geom_qq_line() +
      labs( x = "Theoretical", y = "age", title = "Normal Q-Q plot")

# Drawing density plot for age:
p2 = ggplot(df, aes(age)) +
      geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", fill = "white") + 
      stat_function(fun = dnorm, 
                    args = list(mean = mean(df$age, na.rm = T), sd = sd(df$age, na.rm = T)),
                    lwd = 1, col = "blue") + 
      labs(x = "age (years)", y = "Density", title = "Histogram with Normal fit") +
      theme_bw()

# Drawing violin plot for age
p3 = ggplot(df, aes(x = "", y = age)) +
      geom_violin() +
      geom_boxplot(width = 0.3, fill = "royalblue", outlier.color = "royalblue") +
      coord_flip() +
      labs(title = "Boxplot with Violin", x = "", y = "Simulated Data")

# Combine plot: 
library(ggpubr)
plot = ggarrange(p1,p2, p3, 
                 ncol=2, nrow=2, 
                 common.legend = FALSE,
                 legend="right"  
                 )
annotate_figure(plot, bottom = text_grob("Age original data", 
              color = "red", face = "bold", size = 12))

 

3.2.2 Apply transformations

p1 <- ggplot(df, aes(sample = age^3)) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "Cube (power 3)",
         y = "age, Cubed")

p2 <- ggplot(df, aes(sample = age^2)) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "Square (power 2)",
         y = "age, Squared")

p3 <- ggplot(df, aes(sample = age)) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "Original Data",
         y = "age")


p4 <- ggplot(df, aes(sample = sqrt(age))) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "sqrt (power 0.5)",
         y = "Square Root of age")

p5 <- ggplot(df, aes(sample = log(age))) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "log (power 0)",
         y = "Natural Log of age")

p6 <- ggplot(df, aes(sample = age^(-0.5))) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "1/sqrt (power -0.5)",
         y = "1/Square Root(age)")


p7 <- ggplot(df, aes(sample = 1/age)) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "Inverse (power -1)",
         y = "1/age")

p8 <- ggplot(df, aes(sample = 1/(age^2))) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "1/Square (power -2)",
         y = "1 /(age squared)")

p9 <- ggplot(df, aes(sample = 1/(age^3))) +
    geom_qq(col = "salmon") + 
    geom_qq_line(col = "black") +
    labs(title = "1/Cube (power -3)",
         y = "1/(age cubed)")

# Combine plot: 
library(ggpubr)
plot = ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, 
                 ncol=3, nrow=3, 
                 common.legend = FALSE,
                 legend="left"  
                 )
annotate_figure(plot, bottom = text_grob("Transformation of Age data", 
              color = "red", face = "bold", size = 12))

 

Discuss: Because the original data is left-skew distribution, so the transformation with power > 1 seems like a suitable choice to create a more symmetric (albeit light-tailed) distribution.