Biostatistics_Lecture

Dr Michael Owusu (PhD, MSc)

2026-04-02

SESSION 1: Course Overview & Reproducible Research

Learning Objectives

Exercise 1

In your own words, explain why reproducibility is important in biomedical research.

SESSION 2: R as a Calculator & Object Assignment

6 * 4
## [1] 24
x <- 5
y <- 10
x + y
## [1] 15
z<- 20 / 5
k<-2^3
z+k
## [1] 12

Exercise 1

Write the equation in R:

\[ ax^2 + bx + 2 \]

# Students should practice writing the formula in R

Exercise 2 : Rewrite the equation given the values

\[ ax^2 + bx + 2 \] Given the values of the variables: a = 2; b = 5; c = 3, x=5; Rewrite and solve the equation

# Assign values
a <- 2
b <- -5
c <- 3
y=a * x^2 + b * x + 2
y
## [1] 27

Compute the value of D

# Compute discriminant value of D
D <- b^2 - 4*a*c
D
## [1] 1

Observe the formular below and write an equation in R to solve this:

\[ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

Exercise 4

# Student should write and solve the equation in R

Solution

# break the equation in two
x1 <- (-b + sqrt(D)) / (2*a) # equation 1
x2 <- (-b - sqrt(D)) / (2*a) # equation 2

x1
## [1] 1.5
x2
## [1] 1

Solving with functions for values x1 and x2

\[ f(x) = 2x^2 - 5x + 3 \]

# Write the formula in R and solve with functions
f <- function(x) 2*x^2 - 5*x + 3

Solving with functions given x = 4

x<- 4
f(x)
## [1] 15

Solving with function given values of x as 10, 11, 12, 13, 14, 15

x<-10:15
f(x)
## [1] 153 190 231 276 325 378

Exercise

Create objects age, height_cm, and compute BMI for a 70kg individual Assume age = 4 years and height = 170cm

# Students should try

Summary of syntax

Selected logical operators in R
Operator1 Meaning1 Operator2 Meaning2
< lesser than ! NOT
> greater than & AND
<= lesser than or equal to | OR
>= greater than or equal to == equal
!= not equal to xor exclusive OR

SESSION 3: Data Types in R

Data types in R are mainly Integer, Double, Complex, Logical, Character, Factor and Date and Time. To determine the data type functions such as is.double(), is.integer(), is.logical(), is.character() and is.factor() can be used. Also class(…) can be used.

Data types

Integer data types are made up of numeric variables that can be counted. By default R does not store numbers as integers but there may arise situations where numbers will have to be converted to integers to facilitate manipulations.

A Double is a number that can take any value including decimals. This is the default type of numeric variable used by R. Numeric data type stored as double are never stored as exact but rather as approximations to real numbers.

A Logical is an object stored as TRUE or FALSE. The example below show the creation of “Z” from a statement asking if 5 is less than 8. Z therefore is a logical (TRUE). Logical objects have innate values in R such that FALSE is always considered to have a value of 0 while TRUE has a value of 1.

A Character is an object enclosed in double quotes. These often are names and cannot be be used in mathematical calculations. Examples include “red”, “Male” and “1”. As seen “1” is a character and so cannot be used for calculations unless converted to another object form

A Factor is a categorical variable such as sex (male & female). Factor variables in R have levels representing the different categories. Sex for instance naturally will have two levels, male and female. Factors can be created from numeric and character objects using as.factor()

Examples of data types

sex <- "Female"
age <- 28
febrile <- TRUE
class(sex); class(age); class(febrile)# use class function to determine the nature of a data type
## [1] "character"
## [1] "numeric"
## [1] "logical"
blood.grp<-c("O", "AB", "B", "A")
class(blood.grp) # Character object
## [1] "character"
blood.grp2<-as.factor(blood.grp) # Converts to factor variable
class(blood.grp2) # Now factor variable
## [1] "factor"
levels(blood.grp2) # Shows level of the object
## [1] "A"  "AB" "B"  "O"
 unclass(blood.grp2) # shows numerical values allocated
## [1] 4 2 3 1
## attr(,"levels")
## [1] "A"  "AB" "B"  "O"

SESSION 4: Data Structures (Vectors, Matrices, Data Frames)

Different types of data structures in R include Vector, Matrix, Array, List, Data frame and Time-series. Data needs to be in a specific in structure to perform appropriate analysis. These data structures are often built from the types.

Vectors

A vector is the simplest data structure in R. It is made up of a collection of like data types as above. Many functions in R create vectors in specific order. Some examples of the generation of vectors are as shown below:

A<-c(3,2,4,5,6,2,3,1,7,9) # Combination of numbers

B<-1:10 # sequence of numbers from 1:10

C<-seq(from=0, to=20, by=2) # Sequence of numbers from 1:20 at 2 intervals

D<-rep("B", times=10) # Replicate "B" 10 times
# Numeric object (patient age)
patient_age <- 45
# Numeric object (patient age)
patient_id <- "PT-1023"

Logical object (HIV test result: TRUE = Positive, FALSE = Negative)

#Logistical variable
hiv_positive <- TRUE


A<-7.1
is.double(A) # Determine if A is stored as a double
## [1] TRUE
is.integer(A) # Determine if A is stored as an integer
## [1] FALSE
21.3==A+A+A # Adding A three times is not equal to 21.3!
## [1] FALSE

Matrices

Matrices are vectors arranged in two dimensions (r,c). This dimension arranges the elements as rows and columns. There are a number of ways for creating a matrix, a few are as shown below. First a vector can be assigned a dimension of r x c. Names of the columns and rows can then be created by the dim(), rownames() and colnames() functions.

Example Matrix 1

X<-1:12 # create a vector
dim(X)<-c(3,4) # dim of 3 x 4 converts it into a matrix
colnames(X)<-c("Col1","Col2","Col3","Col4")
rownames(X)<-c("Row1","Row2","Row3")

Example Matrix 2

X<-1:12
 matrix(X, nrow=3, ncol=4, byrow=F,
dimnames=list(c("Row1","Row2","Row3"), c("Col1","Col2","Col3","Col4")))
##      Col1 Col2 Col3 Col4
## Row1    1    4    7   10
## Row2    2    5    8   11
## Row3    3    6    9   12

Example Matrix 3

Row1<-c(1, 4, 7, 10)
Row2<-c(2, 5, 8, 11)
Row3<-c(3, 6, 9, 12)
X<-rbind(Row1, Row2, Row3)

#Example Matrix 4

Col1<-1:3
Col2<-4:6
Col3<-7:9
Col4<-10:12
X<-cbind(Col1, Col2, Col3, Col4)

SESSION 5: Dataframe structures

A data frame is used for routine statistical manipulations. It is essentially a matrix in which various columns are grouped as classes.Most standard statistical datasets are manipulated in R as data frames. An example data frames can be created in R as below:

age <- c(22, 25, 30, 28)
sex <- factor(c("M","F","F","M"))
df <- data.frame(sex, age)
str(df) # a function that displays data structures
## 'data.frame':    4 obs. of  2 variables:
##  $ sex: Factor w/ 2 levels "F","M": 2 1 1 2
##  $ age: num  22 25 30 28
sex<-gl(n=2, k=5, label=c("Male","Female")) # Create factor vector
age<-c(5,2,5,6,5,6,7,8,7,7) # Create numeric vector
color<-rep(c("Red","Blue"), times=5) # Create character vector
old<-age>6 # Create logical vector
df1<-data.frame(sex, age, color, old) # Create data frame
df1
##       sex age color   old
## 1    Male   5   Red FALSE
## 2    Male   2  Blue FALSE
## 3    Male   5   Red FALSE
## 4    Male   6  Blue FALSE
## 5    Male   5   Red FALSE
## 6  Female   6  Blue FALSE
## 7  Female   7   Red  TRUE
## 8  Female   8  Blue  TRUE
## 9  Female   7   Red  TRUE
## 10 Female   7  Blue  TRUE

List

A list is similar to a data frame but with a some of differences. The components of a list can be made of objects other than a vector. These include data frames. Also the length of the components of a list need not be the same as for data frame. A list can be created with the list() function as below. The list below is made up of three elements. The list is a data frame called DF. The next is a numeric vector called Vec and then the last a character called Color.

list1<-list(DF=df1, Vec=1:5, Color="Red")

#Indexing data types in R All data types can be indexed. The simplest is indexing vectors. Numbers are placed in square brackets representing the elements or subset of the vector to be selected. Some examples are as shown below:

Indexing examples

A<-c(1,3,2,4,3,2,6,5,4,8) # Vector of length 10
A[3] # Extracts the 3rd element (2)
## [1] 2
A[4:7] # Extracts 4th to 7th elements
## [1] 4 3 2 6
A[-5] # Leaves out 5th element (3)
## [1] 1 3 2 4 2 6 5 4 8
A[A<5] # Returns elements less than 5
## [1] 1 3 2 4 3 2 4

Indexing Matrices

mat1<-matrix(1:9, nrow=3, ncol=3, byrow=T)
mat1
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
mat1[2:3, 1:2] # Rows 2,3 and column 1,2 selected
##      [,1] [,2]
## [1,]    4    5
## [2,]    7    8

Exercise

Identify variables with missing values.

SESSION 6: Data Cleaning & Transformation

##Importing and exporting data in R

R comes with a number of datasets. To view all datasets that come with the R use data(). To load a specific dataset from a specific package use data(x, package=“y”) where x is dataset’s name and y is the package’s name.

Text files come in many forms and the first thing is to open and view it in any text editor before attempting to import the data into R. Common file types are coma delimited; import this into R using read.csv(), read.csv2() tab delimited; read.delim() and read.delim2(). There are many others. Students should explore this

The first step in reading foreign data is to tell R where to look for the data on your computer. This is done by using the function setwd(). A list of the files in the directory can the be obtained by the function dir().

##Data management in R head(X): Shows first six rows of a dataset X head(X): Shows last six rows of a dataset X str(X): Shows the structure of the dataset X summary(X): Gives a summary of all variables in the dataset X transform(X): Make new variable in the dataset X cut(Y): Convert a continuous vector/variable into a categorical one. set.seed(): Set a number to make the random number generation reproducible subset(): Select a part of a data frame

Selected Statistical Operations on Vectors
Function Explanation
length(x) length of the vector x
sum(x) Add up all elements of vector x
mean(x) Mean of all elements of x
sd(x) Standard deviation of x
var(x) Variance of x
median(x) median of the elements of x
mad(x) Median absolute deviation
quantile(x, probs=) sample quantiles corresponding to the given probabilities
weighted.mean(x, w) mean of x with weights w
rank(x) ranks of the elements of x
density(x) kernel density estimates of x

Data manipulation

data(airquality)
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
aq <- na.omit(airquality)
aq$Temp_C <- (aq$Temp - 32) * 5/9
aq$HotDay <- ifelse(aq$Temp >= 85, "Hot", "Not Hot")
table(aq$HotDay)
## 
##     Hot Not Hot 
##      28      83

SESSION 7: Manipulation of Maternal & Child Health Dataset

library(MASS)
data(birthwt)
bw <- birthwt
bw$low <- factor(bw$low, labels=c("Normal BW","Low BW"))
summary(bw)
##         low           age             lwt             race      
##  Normal BW:130   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  Low BW   : 59   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##                  Median :23.00   Median :121.0   Median :1.000  
##                  Mean   :23.24   Mean   :129.8   Mean   :1.847  
##                  3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##                  Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Exercise

Calculate proportion of low birth weight babies.

SESSION 8: Data Visualization for Public Health

Visualization various plots

# Load dataset
data(mtcars)

# View first few rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Relationship between speed (hp) and fuel efficiency (mpg)
plot(mtcars$hp, mtcars$mpg,
     main = "Scatter Plot: Horsepower vs Fuel Efficiency",
     xlab = "Horsepower",
     ylab = "Miles per Gallon",
     pch = 19,
     col = "blue")

# Fit linear model
model <- lm(mtcars$mpg ~ mtcars$hp)

# Add regression line
abline(model, lwd = 2)

#Histogram

# -----------------------------
# 2. Histogram
# Distribution of miles per gallon
# -----------------------------

hist(mtcars$mpg,
     main = "Histogram of Miles per Gallon",
     xlab = "Miles per Gallon",
     col = "lightgreen",
     border = "black")

Compare mpg across number of cylinders

boxplot(mpg ~ cyl,
        data = mtcars,
        main = "Fuel Efficiency by Cylinder Number",
        xlab = "Number of Cylinders",
        ylab = "Miles per Gallon",
        col = "orange")

# Count of cars by cylinder type
# -----------------------------

cyl_counts <- table(mtcars$cyl)

barplot(cyl_counts,
        main = "Number of Cars by Cylinder Type",
        xlab = "Cylinders",
        ylab = "Frequency",
        col = "purple")

Plotting Probabilities

# Set plotting layout (2 rows, 3 columns)
par(mfrow = c(2,3))
# -----------------------------
# 1. Normal Distribution
# -----------------------------
x <- seq(-4,4,length=100)
y <- dnorm(x, mean=0, sd=1)

plot(x,y,type="l",
     main="Normal Distribution",
     xlab="x",
     ylab="Density",
     col="blue",
     lwd=2)

# -----------------------------
# 2. Binomial Distribution
# -----------------------------
x <- 0:10
y <- dbinom(x, size=10, prob=0.5)

barplot(y,
        names.arg=x,
        main="Binomial Distribution",
        xlab="Number of Successes",
        ylab="Probability",
        col="orange")

# -----------------------------
# 3. Poisson Distribution
# -----------------------------
x <- 0:15
y <- dpois(x, lambda=4)

barplot(y,
        names.arg=x,
        main="Poisson Distribution",
        xlab="Number of Events",
        ylab="Probability",
        col="green")

# -----------------------------
# 4. Exponential Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dexp(x, rate=0.5)

plot(x,y,type="l",
     main="Exponential Distribution",
     xlab="x",
     ylab="Density",
     col="red",
     lwd=2)

# -----------------------------
# 5. Uniform Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dunif(x, min=0, max=10)

plot(x,y,type="l",
     main="Uniform Distribution",
     xlab="x",
     ylab="Density",
     col="purple",
     lwd=2)

# -----------------------------
# 6. Chi-Square Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dchisq(x, df=4)

plot(x,y,type="l",
     main="Chi-Square Distribution",
     xlab="x",
     ylab="Density",
     col="darkblue",
     lwd=2)

SESSION 9: Descriptive Statistics & Reporting

Common Statistical Analyses in R and Their Interpretation
Category Statistical_Test Variable_Type Typical_Use R_Function Interpretation
Descriptive Mean and standard deviation Continuous Summarize average and spread of normally distributed data mean(), sd() Mean gives average; SD shows spread around the mean
Descriptive Median and interquartile range Continuous Summarize central tendency of skewed data median(), IQR(), quantile() Median gives middle value; IQR shows spread of middle 50%
Descriptive Frequency and percentage Categorical Summarize counts in categories table(), prop.table() Shows number and proportion in each category
Descriptive Minimum and maximum Continuous Show data limits min(), max(), range() Shows smallest and largest values
One-sample test One-sample t-test Continuous Compare a sample mean with a known value t.test(x, mu = value) Small p-value suggests sample mean differs from the hypothesized value
One-sample test One-sample proportion test Categorical (binary) Compare a sample proportion with a known value prop.test(x, n, p = value) Small p-value suggests sample proportion differs from the hypothesized value
Two-sample test Independent samples t-test Continuous Compare means of two independent groups t.test(y ~ group, var.equal = TRUE) Small p-value suggests the two group means are significantly different
Two-sample test Welch two-sample t-test Continuous Compare means of two independent groups when variances are unequal t.test(y ~ group) Small p-value suggests the two group means are significantly different under unequal variances
Paired test Paired t-test Continuous Compare two related measurements from the same subjects t.test(before, after, paired = TRUE) Small p-value suggests significant difference between paired measurements
ANOVA One-way ANOVA Continuous Compare means across three or more independent groups aov(y ~ group) Small p-value suggests at least one group mean differs
ANOVA Repeated measures ANOVA Continuous Compare repeated measurements on the same subjects aov(y ~ subject + time) or ezANOVA() Small p-value suggests at least one time point differs
Categorical test Chi-square test of independence Categorical Assess association between two categorical variables chisq.test(table(x, y)) Small p-value suggests association between categorical variables
Categorical test Fisher’s exact test Categorical Assess association between two categorical variables when expected counts are small fisher.test(table(x, y)) Small p-value suggests association between categorical variables with small samples
Categorical test McNemar test Categorical Compare paired categorical responses mcnemar.test(table(before, after)) Small p-value suggests paired categorical responses differ
Association Pearson correlation Continuous Measure linear association between two continuous variables cor.test(x, y, method = ‘pearson’) Correlation coefficient shows strength and direction of linear relationship
Association Spearman correlation Ordinal / Continuous Measure monotonic association between ranked or non-normal variables cor.test(x, y, method = ‘spearman’) Correlation coefficient shows strength and direction of monotonic relationship
Regression Simple linear regression Continuous outcome Model continuous outcome using predictor(s) lm(y ~ x) Regression coefficient shows expected change in outcome per unit increase in predictor
Regression Logistic regression Binary outcome Model binary outcome using predictor(s) glm(y ~ x, family = binomial) Odds ratio or coefficients show effect of predictors on odds of outcome
Non-parametric Mann-Whitney U test Continuous / Ordinal Compare two independent groups when normality assumption is not met wilcox.test(y ~ group) Small p-value suggests the two groups differ in distribution
Non-parametric Wilcoxon signed-rank test Continuous / Ordinal Compare paired observations when normality assumption is not met wilcox.test(before, after, paired = TRUE) Small p-value suggests paired observations differ in distribution
Non-parametric Kruskal-Wallis test Continuous / Ordinal Compare three or more groups when ANOVA assumptions are not met kruskal.test(y ~ group) Small p-value suggests at least one group differs in distribution
Non-parametric Friedman test Continuous / Ordinal Compare repeated measurements when repeated ANOVA assumptions are not met friedman.test(y ~ group | subject) Small p-value suggests repeated measurements differ
mean(aq$Ozone)
## [1] 42.0991
sd(aq$Ozone)
## [1] 33.27597
median(bw$bwt)
## [1] 2977

Exercise

Report median and IQR of birth weight in one sentence.

SESSION 10: Inferential Statistics

t.test(bwt ~ smoke, data=bw)
## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
## mean in group 0 mean in group 1 
##        3055.696        2771.919

Exercise

Interpret the p-value from the t-test.