A quick introduction to R for biostatistics

Introduction

This is the RMD file (rendered HTML if you are viewing it on RPubs) for my YouTube video on using R for biostatistics. Download the RMD file from Github and add your own notes during the tutorial! Watch the YouTube video tutorial at https://www.youtube.com/watch?v=zrXPG1onjiE&list=PLsu0TcgLDUiJznd-n-i7rMUmNjCuNhgpB&index=1 Dowload the files at https://github.com/juanklopper/R_statistics

Libraries

Installing R provides the core or base language. R can be extended using libraries (packages). These can be installed using the Package tab (right-bottom of the default interface).

library(tibble)  # Modern dataframes
library(readr)  # Imports and exports spreadsheet files
library(dplyr)  # Tidy data analysis
library(DT)  # Displays HTML tables

Simple arithmetic

At its heart, many computer languages are giant, fancy calculators. Below is some code for simple arithmetic and common mathematical functions.

2 + 2 + 4

## [1] 8

2 - 5 - 4

## [1] -7

2 * 4 * 3

## [1] 24

3 / 4

## [1] 0.75

3^3

## [1] 27

(2 + 4) * 3

## [1] 18

exp(1)

## [1] 2.718282

log10(1000)

## [1] 3

log(1000,
    base = exp(1))

## [1] 6.907755

Lists

Lists are object that contain elements, separated by commas. These can be numbers or words.

c(120, 120, 110, 130, 140)

## [1] 120 120 110 130 140

c("Pneumonia", "ARDS", "Chronic bronchitis")

## [1] "Pneumonia"          "ARDS"               "Chronic bronchitis"

Computer variables

Lists (among other things) can be stored in your computer’s memory as an object. The name given to the object is called a computer variable.

sbp <- c(120, 120, 110, 130, 140)

sbp

## [1] 120 120 110 130 140

# There are rules for computer variable names
# Below is an example of snake-case
patient.number <- seq(from = 1,
                      to = 100,
                      by = 2)
patient.number

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99

Addressing

Every element in a object has an address that can be referenced.

sbp[1]

## [1] 120

sbp[1:3]

## [1] 120 120 110

sbp[c(1, 3)]

## [1] 120 110

Distributions

Data point values for a variable are distributed according to a pattern, called a distribution.

set.seed(123)  # Forces a specific selection of values
age <- sample(18:85,
              500,
              replace = TRUE)

set.seed(123)
before.after <- rnorm(500)

set.seed(123)
sbp <- round(rnorm(500,
                   mean = 120,
                   sd = 20),
             digits = 0)

set.seed(123)
crp <- round(rchisq(500,
                    df = 2),
             digits = 1)

set.seed(123)
group <- sample(c("Control",
                  "Placebo"),
                500,
                replace = TRUE)

set.seed(123)
side.effects <- sample(c("No",
                         "Yes"),
                       500,
                       replace = TRUE,
                       prob = c(0.8,
                                0.2))

Descriptive statistics

The first part of the analysis of data is descriptive statistics.

mean(age)

## [1] 51.184

median(age)

## [1] 50

var(age)

## [1] 373.8539

sd(age)

## [1] 19.3353

range(age)

## [1] 18 85

IQR(age)

## [1] 33

quantile(age,
         0.25)

## 25% 
##  34

quantile(age,
         0.75)

## 75% 
##  67

summary(age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   34.00   50.00   51.18   67.00   85.00

unique(group)

## [1] "Control" "Placebo"

Data visualization

After describing the data, more information can be gained from visualizing it.

Box-and-whisker plot

boxplot(age)

boxplot(age,
        col = "deepskyblue",
        main = "Patient age",
        xlab = "Patients",
        ylab = "Age")

Histogram

hist(before.after,
     col = "pink",
     main = "Difference in measurement before and after treatment",
     xlab = "Measurment",
     ylab = "Counts")

Scatter plot

plot(age,
     sbp,
     col = "blue",
     main = "Age vs systolic blood pressure",
     xlab = "Age",
     ylab = "Systolic blood pressure")

Tibbles

Data for analysis are typically stored in dataframes. These are rows and columns of data. Each column represents a variable and each row contains the data point values for a specific subject.

Creating a tibble

my.data <- tibble(Age = age,
                  Difference = before.after,
                  CRP = crp,
                  Group = group,
                  sBP = sbp,
                  SideEffects = side.effects)

Exporting a tibble

write_csv(my.data,
          "Data.csv")

Importing data

Data in a spreadsheet file can be imported for analysis.

# Spreadhseet file is in same folder as this RMD file
# setwd(getwd()) was used initially
# Need only refer to file name (else use the full folder address)
data <- read_csv("ProjectData.csv")

Inspecting the data

datatable(data)

Selections

Subsets of a dataframe can be selected based on chosen rules.

control.group <- data %>% filter(Group == "I")
#control.group <- filter(data,
#                        filter(Group == "I"))

younger.patients <- data %>% filter(Age < 50)

younger.patients.select <- data %>% filter(Age < 50) %>% 
  select(sBP,
         CRP)

younger.patients.II <- data %>% filter((Age < 50) & (Group == "I"))

Descriptive statistics of the new data

Age by group

data %>% 
  group_by(Group) %>% 
  summarise(mean.age = mean(Age))

## # A tibble: 2 x 2
##   Group mean.age
##   <chr>    <dbl>
## 1 I         58.7
## 2 II        58.1

Side-effect count

data %>% group_by(SideEffects) %>% 
  summarise(count = n())

## # A tibble: 2 x 2
##   SideEffects count
##   <chr>       <int>
## 1 No            289
## 2 Yes           211

Side effects by group

data %>% 
  group_by(Group) %>% 
  count(SideEffects)

## # A tibble: 4 x 3
## # Groups:   Group [2]
##   Group SideEffects     n
##   <chr> <chr>       <int>
## 1 I     No            137
## 2 I     Yes           114
## 3 II    No            152
## 4 II    Yes            97

Visualizing the new data

Box plot of age by group

boxplot(Age ~ Group,
        data = data,
        col = c("deepskyblue", "orange"),
        main = "Age distribution by group",
        xlab = "Group",
        ylab = "Age",
        las = 1)

plot(sBP ~ Age,
     data = data,
     col = "blue",
     main = "Age vs systolic blood pressure",
     xlab = "Age",
     ylab = "Systolic blood pressure",
     las = 1)

Inferential statistics

Student’s t test

t.test(sBP ~ Group,
       data,
       alternative = c("two.sided"),
       mu = 0,
       paired = FALSE,
       var.equal = TRUE,
       conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  sBP by Group
## t = 1.4147, df = 498, p-value = 0.1578
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5037952  3.0954846
## sample estimates:
##  mean in group I mean in group II 
##         125.1673         123.8715

Linear regression

summary(lm(sBP ~ Age,
           data = data))

## 
## Call:
## lm(formula = sBP ~ Age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.7856  -6.9060   0.3047   7.0397  30.0940 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 128.03695    2.32548  55.058   <2e-16 ***
## Age          -0.06021    0.03906  -1.542    0.124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.24 on 498 degrees of freedom
## Multiple R-squared:  0.00475,    Adjusted R-squared:  0.002751 
## F-statistic: 2.377 on 1 and 498 DF,  p-value: 0.1238

\(\chi^2\) test for independence

data %>% 
  group_by(Group) %>% 
  count(SideEffects)

## # A tibble: 4 x 3
## # Groups:   Group [2]
##   Group SideEffects     n
##   <chr> <chr>       <int>
## 1 I     No            137
## 2 I     Yes           114
## 3 II    No            152
## 4 II    Yes            97

grpI <- c(137,
          114)
grpII <- c(152,97)
nrows <- 2
group.side.effect.matrix <- matrix(c(grpI,
                                     grpII),
                                   nrow = nrows,
                                   byrow = TRUE)
rownames(group.side.effect.matrix) <- c("Group I",
                                        "Group II")
colnames(group.side.effect.matrix) <- c("No",
                                       "Yes")
group.side.effect.matrix

##           No Yes
## Group I  137 114
## Group II 152  97

chisq.test(group.side.effect.matrix,
           correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  group.side.effect.matrix
## X-squared = 2.1402, df = 1, p-value = 0.1435