COD_week7_1_MGK_BTE3207

Minsik Kim

2024-10-15

Before begin..

Let’s load the SBP (Systolic Blood Pressure) dataset, which we will use throughout this document for various statistical analyses.

read.csv(file = "Inha/5_Lectures/2024/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv") %>%
        reactable::reactable(., sortable = T)
dataset_sbp <- read.csv(file = "Inha/5_Lectures/2024/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")

head(dataset_sbp)
##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

Creating a New Variable: Hypertension

Logical Variables in R:

Logical variables store TRUE or FALSE values. These variables are useful in decision-making processes (like identifying conditions where blood pressure exceeds a threshold). In R, the result of comparisons such as > or == is always logical (TRUE or FALSE).

Using ifelse() to Create Logical Variables:

The ifelse() function is a vectorized conditional function in R that checks a condition for each element in a vector. It returns one value if the condition is TRUE and another value if it is FALSE.

Syntax: ifelse(condition, value_if_TRUE, value_if_FALSE)

Below, we use ifelse() to create a logical variable called hypertension, which will be TRUE if either the systolic blood pressure (SBP) is greater than 130 or the diastolic blood pressure (DBP) is greater than 80, and FALSE otherwise.

Use of str() to validate the datatype

From the last lecture (weeke 6-2), we’ve leared how to check type of a data in R.

a <- 1

a
## [1] 1
str(a)
##  num 1
b <- c(1, 2, 3)

str(b)
##  num [1:3] 1 2 3
c <- c("a", "b", "c")

str(c)
##  chr [1:3] "a" "b" "c"

Logical values

From (Homework - reading) Week 3-2 markdown URL, we’ve learned what is logical variables.

Logical values…having logical values.

a = 1

a
## [1] 1

= does the same thing as <-.

To test if the thing are same, R uses ==.

a
## [1] 1
1
## [1] 1
a == 1
## [1] TRUE

as we inserted a <- 1 in the previous code chunk, this test results in TRUE

"a"
## [1] "a"
1
## [1] 1
"a" == 1
## [1] FALSE

This test will test whether a character, "a", is the same with a numeric value 1. As they are not the same, it returns FALSE

Here, TRUE and FALSE results are , and they are one type of binary variable. It works for longer vectors or variables as well.

c(1, 2, 3, 4, 5) == c(1, 2, 2, 4, 5)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE
c(1, 2, 3, 4, 5) == 1
## [1]  TRUE FALSE FALSE FALSE FALSE

And it results in a vector of all the logical tests. Using this, we can filter data easily!

T/F often converted to binary variable. Therfore R recognizes T as 1 and F as 0 as well.

str(c(1,2,3))
##  num [1:3] 1 2 3
str(c("one","two","three"))
##  chr [1:3] "one" "two" "three"
str(c("1","2","3"))
##  chr [1:3] "1" "2" "3"
as.numeric(c("1","2","3"))
## [1] 1 2 3
str(as.numeric(c("1","2","3")))
##  num [1:3] 1 2 3
1 == 1
## [1] TRUE
TRUE == "1"
## [1] FALSE
TRUE == 1
## [1] TRUE

ifelse() does doe same as if from excel.

ifelse(TRUE, "Right", "Wrong")
## [1] "Right"
ifelse(FALSE, "Right", "Wrong")
## [1] "Wrong"
ifelse(1 > 0, "Right", "Wrong")
## [1] "Right"
ifelse(13 > 20, "Right", "Wrong")
## [1] "Wrong"
ifelse(13 == 20, "Right", "Wrong")
## [1] "Wrong"

AND and OR

In R, | is “OR” and & is “AND”.

•   If either condition is true, the result will be TRUE.
•   If both conditions are false, the result will be FALSE.
#AND
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
#OR

TRUE | TRUE
## [1] TRUE
FALSE | TRUE
## [1] TRUE
#AND

(1 == 1) & (2 < 3)
## [1] TRUE
(1 == 1) & (2 > 3)
## [1] FALSE
#OR

(1 == 1) | (2 < 3)
## [1] TRUE
(1 == 1) | (2 > 3)
## [1] TRUE
(1 == 2) | (2 > 3)
## [1] FALSE

Using ifelse(), &, and |, we can create a new variable called hypertension

# Create 'hypertension' variable based on SBP and DBP

dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 |
                                           dataset_sbp$DBP > 80,
                                   T,
                                   F)

In the code, | is the logical OR operator. The condition checks if either SBP > 130 or DBP > 80 is true.

This new logical variable can now be used to classify individuals as hypertensive or non-hypertensive.

Understanding Normal Distribution Functions

  1. rnorm() – Random Sampling from a Normal Distribution

This function generates random numbers following a normal distribution.

# Generate 5 random numbers from a standard normal distribution (mean = 0, sd = 1)
rnorm(5)
## [1]  1.42876777 -0.13118732  1.76540265 -0.86351081  0.03604344
# Example with custom mean and standard deviation
# rnorm(100, mean = 100, sd = 1)
  1. pnorm() – Cumulative Probability for Z-scores (z-score table results)

pnorm() gives the cumulative probability for a given z-score (i.e., the area under the curve up to that point).

# Cumulative probabilities for different z-scores
pnorm(2)   # ~97.72% of the data lies below a z-score of 2
## [1] 0.9772499
pnorm(0)   # 50% of the data lies below the mean
## [1] 0.5
pnorm(1)   # ~84.13% of the data is below a z-score of 1
## [1] 0.8413447
pnorm(3)   # ~99.87% of the data is below a z-score of 3
## [1] 0.9986501
  1. qnorm() – Find Z-scores from Cumulative Probability

qnorm() returns the z-score corresponding to a given cumulative probability.

# Z-scores for given probabilities
qnorm(0.97724985)  # Returns ~2, the z-score for 97.72%
## [1] 2
qnorm(0.5)         # Returns 0, the z-score for 50%
## [1] 0

Note: pnorm() and qnorm() are reverse operations of each other.

# Calculating density at different points
dnorm(0)  # Height at the mean
## [1] 0.3989423
dnorm(1)  # Height at z = 1
## [1] 0.2419707
dnorm(-1) # Height at z = -1
## [1] 0.2419707

dnorm() - probability density function, the height of normal curve.

dnorm(0)
## [1] 0.3989423
dnorm(1)
## [1] 0.2419707
dnorm(-1)
## [1] 0.2419707
dnorm(10000000000000000000000)
## [1] 0

Plotting the Normal Distribution

Plotting Discrete Values from dnorm().

Using sequences, add dnorm, the calculated height of the normal distribution function

seq(-4, 4, by = 0.5) # generates seqeunces
##  [1] -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0
## [16]  3.5  4.0
dnorm(seq(-4, 4, by = 0.5))# Height of normal distribution function of the sequences as x values
##  [1] 0.0001338302 0.0008726827 0.0044318484 0.0175283005 0.0539909665
##  [6] 0.1295175957 0.2419707245 0.3520653268 0.3989422804 0.3520653268
## [11] 0.2419707245 0.1295175957 0.0539909665 0.0175283005 0.0044318484
## [16] 0.0008726827 0.0001338302
# Generate sequence and plot dnorm results
x_vals <- seq(-4, 4, by = 0.5) #storing the data

plot(dnorm(x_vals) ~ x_vals, 
     ylab = "dnorm() result", 
     xlab = "dnorm() input")

Ploting of a smooth normal curve

# Create a sequence of values and calculate their densities

x <- seq(-4, 4, length = 100)
y <- dnorm(x)

# Plot the normal curve with labeled x-axis
plot(x, y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "") # here, type l means line.

axis(1, at = -3:3, labels = c("-3σ", "-2σ", "-1σ", "mean", "1σ", "2σ", "3σ")) # adding axes texts
abline(v = 2, col = "red")  # Add a vertical line at z = 2
abline(v = -2, col = "red") # Add a vertical line at z = -2

Logical Variables and Statistical Tests

Now that we’ve created the hypertension variable as a logical variable, we can use it in statistical tests.

Single-sample t-test (getting confidence interval of some data)

This tests if the mean of a sample is significantly different from a known value.

result <- t.test(c(1, 2, 3))

result
## 
##  One Sample t-test
## 
## data:  c(1, 2, 3)
## t = 3.4641, df = 2, p-value = 0.07418
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.4841377  4.4841377
## sample estimates:
## mean of x 
##         2
result #a list of statistical test. Use $ to explore more!
## 
##  One Sample t-test
## 
## data:  c(1, 2, 3)
## t = 3.4641, df = 2, p-value = 0.07418
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.4841377  4.4841377
## sample estimates:
## mean of x 
##         2
result <- t.test(dataset_sbp$SBP) # Extract the confidence interval 

result$p.value
## [1] 0
result$conf.int
## [1] 121.8432 121.9003
## attr(,"conf.level")
## [1] 0.95

Two-Sample t-test:

This tests if two groups have significantly different means.

Now that we’ve created the hypertension variable as a logical variable, we can use it in statistical tests.

Two-Sample t-Test Example:

result <- t.test(SBP ~ SEX, data = dataset_sbp) # Extract the confidence interval 
result
## 
##  Welch Two Sample t-test
## 
## data:  SBP by SEX
## t = 170.86, df = 977496, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  4.860551 4.973355
## sample estimates:
## mean in group 1 mean in group 2 
##         124.280         119.363
result <- t.test(SBP ~ hypertension, data = dataset_sbp) # Extract the confidence interval 
result
## 
##  Welch Two Sample t-test
## 
## data:  SBP by hypertension
## t = -928.38, df = 551223, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -21.87828 -21.78610
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            114.9553            136.7875

Z-Test for Proportions

one-sample z-test using z.test()

Test if the proportion of a sample is different from a given proportion.

# Install and load the BSDA package

install.packages("BSDA")
## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpYD5cPD/downloaded_packages
library(BSDA)


# Perform a one-sample z-test

result <- z.test(dataset_sbp$hypertension,
                 sigma.x = mean(dataset_sbp$hypertension) *
                         (1-mean(dataset_sbp$hypertension))
                 )


result
## 
##  One-sample z-Test
## 
## data:  dataset_sbp$hypertension
## z = 1463.7, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.3163788 0.3172272
## sample estimates:
## mean of x 
##  0.316803

Hypertention table

table(dataset_sbp$SEX)
## 
##      1      2 
## 510227 489773
table(dataset_sbp$DIS)
## 
##      1      2      3      4 
##  53398 162826  43114 740662
table(dataset_sbp$SEX, dataset_sbp$hypertension)
##    
##      FALSE   TRUE
##   1 322199 188028
##   2 360998 128775

Two-Sample Z-Test:

two-sample z-test

# Subset the data by sex


library(tidyverse)

dataset_male <- dataset_sbp %>% subset(., .$SEX == 1)
dataset_female <-  dataset_sbp %>% subset(., .$SEX == 2)

# Perform two-sample z-test

result <- z.test(x = dataset_male$hypertension,
                 y = dataset_female$hypertension,
                 sigma.x = mean(dataset_male$hypertension) * (1-mean(dataset_male$hypertension)),
                 sigma.y = mean(dataset_female$hypertension) * (1-mean(dataset_female$hypertension))
                 )


result
## 
##  Two-sample z-Test
## 
## data:  dataset_male$hypertension and dataset_female$hypertension
## z = 246.95, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1047524 0.1064284
## sample estimates:
## mean of x mean of y 
## 0.3685183 0.2629279

Chi-square test for independence

Two-group Chi-square test

Check if two categorical variables are independent.

# Chi-square test between hypertension and sex

result <- chisq.test(x = dataset_sbp$hypertension,
                     y = dataset_sbp$SEX)

result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dataset_sbp$hypertension and dataset_sbp$SEX
## X-squared = 12872, df = 1, p-value < 2.2e-16

Note: p-value is the same as z-score test

Chi-square test for multiple groups

Check relationships between multiple categorical variables.

table(dataset_sbp$SEX, dataset_sbp$DIS)
##    
##          1      2      3      4
##   1  27979  79892  23900 378456
##   2  25419  82934  19214 362206
# Chi-square test between sex and disease status

result <- chisq.test(y = dataset_sbp$DIS,
           x = dataset_sbp$SEX)

result
## 
##  Pearson's Chi-squared test
## 
## data:  dataset_sbp$SEX and dataset_sbp$DIS
## X-squared = 627.3, df = 3, p-value < 2.2e-16

Fisher’s exact test

Fisher’s exact test is used when sample sizes are small.

#Author DataFlair

data_frame <- read.csv("https://goo.gl/j6lRXD")

#Reading CSV
data <- table(data_frame$treatment, data_frame$improvement)


fisher.test(data)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  data
## p-value = 0.02889
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1582959 0.9212947
## sample estimates:
## odds ratio 
##  0.3878601
# or 

# Perform Fisher's exact test on the hypertension and sex table
fisher.test(table(dataset_sbp$SEX, dataset_sbp$hypertension))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(dataset_sbp$SEX, dataset_sbp$hypertension)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6060831 0.6165366
## sample estimates:
## odds ratio 
##   0.611277
#Author DataFlair
table(dataset_sbp$SEX,
                  dataset_sbp$hypertension)
##    
##      FALSE   TRUE
##   1 322199 188028
##   2 360998 128775
fisher.test(table(dataset_sbp$SEX,
                  dataset_sbp$hypertension))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(dataset_sbp$SEX, dataset_sbp$hypertension)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6060831 0.6165366
## sample estimates:
## odds ratio 
##   0.611277
table(dataset_sbp$SEX,
                  dataset_sbp$hypertension) %>%
        fisher.test()
## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6060831 0.6165366
## sample estimates:
## odds ratio 
##   0.611277

As sample number is high, Fisher’s exact test results are having the same result as Chi-square / z-test.

Confidence intervals and statistical tests using epitools

epitools pacakge automates all the calculations and tests! Especially for risk ratios and odds ratios.

# Install and load epitools
install.packages("epitools")
## 
## The downloaded binary packages are in
##  /var/folders/qj/y615pkf53ms_hcxg1f9zjkyc0000gp/T//RtmpYD5cPD/downloaded_packages
library(epitools)

# Calculate risk ratio with confidence interval
riskratio.wald(data)
## $data
##              
##               improved not-improved Total
##   not-treated       26           29    55
##   treated           35           15    50
##   Total             61           44   105
## 
## $measure
##              risk ratio with 95% C.I.
##                estimate     lower    upper
##   not-treated 1.0000000        NA       NA
##   treated     0.5689655 0.3479293 0.930424
## 
## $p.value
##              two-sided
##               midp.exact fisher.exact chi.square
##   not-treated         NA           NA         NA
##   treated     0.02013718   0.02888541 0.01840777
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
# Calculate odds ratio with confidence interval
oddsratio.wald(data)
## $data
##              
##               improved not-improved Total
##   not-treated       26           29    55
##   treated           35           15    50
##   Total             61           44   105
## 
## $measure
##              odds ratio with 95% C.I.
##                estimate     lower     upper
##   not-treated 1.0000000        NA        NA
##   treated     0.3842365 0.1719968 0.8583743
## 
## $p.value
##              two-sided
##               midp.exact fisher.exact chi.square
##   not-treated         NA           NA         NA
##   treated     0.02013718   0.02888541 0.01840777
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN, Gao C (2024). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.9, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## version 0.4.4, <https://CRAN.R-project.org/package=reactable>.