This is the RMD file (rendered HTML if you are viewing it on RPubs) for my YouTube video on using R for biostatistics. Download the RMD file from Github and add your own notes during the tutorial! Watch the YouTube video tutorial at https://www.youtube.com/watch?v=zrXPG1onjiE&list=PLsu0TcgLDUiJznd-n-i7rMUmNjCuNhgpB&index=1 Dowload the files at https://github.com/juanklopper/R_statistics
Installing R provides the core or base language. R can be extended using libraries (packages). These can be installed using the Package tab (right-bottom of the default interface).
library(tibble) # Modern dataframes
library(readr) # Imports and exports spreadsheet files
library(dplyr) # Tidy data analysis
library(DT) # Displays HTML tables
At its heart, many computer languages are giant, fancy calculators. Below is some code for simple arithmetic and common mathematical functions.
2 + 2 + 4
## [1] 8
2 - 5 - 4
## [1] -7
2 * 4 * 3
## [1] 24
3 / 4
## [1] 0.75
3^3
## [1] 27
(2 + 4) * 3
## [1] 18
exp(1)
## [1] 2.718282
log10(1000)
## [1] 3
log(1000,
base = exp(1))
## [1] 6.907755
Lists are object that contain elements, separated by commas. These can be numbers or words.
c(120, 120, 110, 130, 140)
## [1] 120 120 110 130 140
c("Pneumonia", "ARDS", "Chronic bronchitis")
## [1] "Pneumonia" "ARDS" "Chronic bronchitis"
Lists (among other things) can be stored in your computer’s memory as an object. The name given to the object is called a computer variable.
sbp <- c(120, 120, 110, 130, 140)
sbp
## [1] 120 120 110 130 140
# There are rules for computer variable names
# Below is an example of snake-case
patient.number <- seq(from = 1,
to = 100,
by = 2)
patient.number
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
Every element in a object has an address that can be referenced.
sbp[1]
## [1] 120
sbp[1:3]
## [1] 120 120 110
sbp[c(1, 3)]
## [1] 120 110
Data point values for a variable are distributed according to a pattern, called a distribution.
set.seed(123) # Forces a specific selection of values
age <- sample(18:85,
500,
replace = TRUE)
set.seed(123)
before.after <- rnorm(500)
set.seed(123)
sbp <- round(rnorm(500,
mean = 120,
sd = 20),
digits = 0)
set.seed(123)
crp <- round(rchisq(500,
df = 2),
digits = 1)
set.seed(123)
group <- sample(c("Control",
"Placebo"),
500,
replace = TRUE)
set.seed(123)
side.effects <- sample(c("No",
"Yes"),
500,
replace = TRUE,
prob = c(0.8,
0.2))
The first part of the analysis of data is descriptive statistics.
mean(age)
## [1] 51.184
median(age)
## [1] 50
var(age)
## [1] 373.8539
sd(age)
## [1] 19.3353
range(age)
## [1] 18 85
IQR(age)
## [1] 33
quantile(age,
0.25)
## 25%
## 34
quantile(age,
0.75)
## 75%
## 67
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 34.00 50.00 51.18 67.00 85.00
unique(group)
## [1] "Control" "Placebo"
After describing the data, more information can be gained from visualizing it.
boxplot(age)
boxplot(age,
col = "deepskyblue",
main = "Patient age",
xlab = "Patients",
ylab = "Age")
hist(before.after,
col = "pink",
main = "Difference in measurement before and after treatment",
xlab = "Measurment",
ylab = "Counts")
plot(age,
sbp,
col = "blue",
main = "Age vs systolic blood pressure",
xlab = "Age",
ylab = "Systolic blood pressure")
Data for analysis are typically stored in dataframes. These are rows and columns of data. Each column represents a variable and each row contains the data point values for a specific subject.
my.data <- tibble(Age = age,
Difference = before.after,
CRP = crp,
Group = group,
sBP = sbp,
SideEffects = side.effects)
write_csv(my.data,
"Data.csv")
Data in a spreadsheet file can be imported for analysis.
# Spreadhseet file is in same folder as this RMD file
# setwd(getwd()) was used initially
# Need only refer to file name (else use the full folder address)
data <- read_csv("ProjectData.csv")
datatable(data)
Subsets of a dataframe can be selected based on chosen rules.
control.group <- data %>% filter(Group == "I")
#control.group <- filter(data,
# filter(Group == "I"))
younger.patients <- data %>% filter(Age < 50)
younger.patients.select <- data %>% filter(Age < 50) %>%
select(sBP,
CRP)
younger.patients.II <- data %>% filter((Age < 50) & (Group == "I"))
data %>%
group_by(Group) %>%
summarise(mean.age = mean(Age))
## # A tibble: 2 x 2
## Group mean.age
## <chr> <dbl>
## 1 I 58.7
## 2 II 58.1
data %>% group_by(SideEffects) %>%
summarise(count = n())
## # A tibble: 2 x 2
## SideEffects count
## <chr> <int>
## 1 No 289
## 2 Yes 211
data %>%
group_by(Group) %>%
count(SideEffects)
## # A tibble: 4 x 3
## # Groups: Group [2]
## Group SideEffects n
## <chr> <chr> <int>
## 1 I No 137
## 2 I Yes 114
## 3 II No 152
## 4 II Yes 97
boxplot(Age ~ Group,
data = data,
col = c("deepskyblue", "orange"),
main = "Age distribution by group",
xlab = "Group",
ylab = "Age",
las = 1)
plot(sBP ~ Age,
data = data,
col = "blue",
main = "Age vs systolic blood pressure",
xlab = "Age",
ylab = "Systolic blood pressure",
las = 1)
t.test(sBP ~ Group,
data,
alternative = c("two.sided"),
mu = 0,
paired = FALSE,
var.equal = TRUE,
conf.level = 0.95)
##
## Two Sample t-test
##
## data: sBP by Group
## t = 1.4147, df = 498, p-value = 0.1578
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5037952 3.0954846
## sample estimates:
## mean in group I mean in group II
## 125.1673 123.8715
summary(lm(sBP ~ Age,
data = data))
##
## Call:
## lm(formula = sBP ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.7856 -6.9060 0.3047 7.0397 30.0940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.03695 2.32548 55.058 <2e-16 ***
## Age -0.06021 0.03906 -1.542 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.24 on 498 degrees of freedom
## Multiple R-squared: 0.00475, Adjusted R-squared: 0.002751
## F-statistic: 2.377 on 1 and 498 DF, p-value: 0.1238
data %>%
group_by(Group) %>%
count(SideEffects)
## # A tibble: 4 x 3
## # Groups: Group [2]
## Group SideEffects n
## <chr> <chr> <int>
## 1 I No 137
## 2 I Yes 114
## 3 II No 152
## 4 II Yes 97
grpI <- c(137,
114)
grpII <- c(152,97)
nrows <- 2
group.side.effect.matrix <- matrix(c(grpI,
grpII),
nrow = nrows,
byrow = TRUE)
rownames(group.side.effect.matrix) <- c("Group I",
"Group II")
colnames(group.side.effect.matrix) <- c("No",
"Yes")
group.side.effect.matrix
## No Yes
## Group I 137 114
## Group II 152 97
chisq.test(group.side.effect.matrix,
correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: group.side.effect.matrix
## X-squared = 2.1402, df = 1, p-value = 0.1435