Dr Michael Owusu (PhD, MSc)
2026-04-02
In your own words, explain why reproducibility is important in biomedical research.
## [1] 24
## [1] 15
## [1] 12
Write the equation in R:
\[ ax^2 + bx + 2 \]
\[ ax^2 + bx + 2 \] Given the values of the variables: a = 2; b = 5; c = 3, x=5; Rewrite and solve the equation
## [1] 27
\[ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]
# break the equation in two
x1 <- (-b + sqrt(D)) / (2*a) # equation 1
x2 <- (-b - sqrt(D)) / (2*a) # equation 2
x1## [1] 1.5
## [1] 1
\[ f(x) = 2x^2 - 5x + 3 \]
## [1] 153 190 231 276 325 378
Create objects age, height_cm, and compute
BMI for a 70kg individual Assume age = 4 years and height = 170cm
| Operator1 | Meaning1 | Operator2 | Meaning2 |
|---|---|---|---|
| < | lesser than | ! | NOT |
| > | greater than | & | AND |
| <= | lesser than or equal to | | | OR |
| >= | greater than or equal to | == | equal |
| != | not equal to | xor | exclusive OR |
Data types in R are mainly Integer, Double, Complex, Logical, Character, Factor and Date and Time. To determine the data type functions such as is.double(), is.integer(), is.logical(), is.character() and is.factor() can be used. Also class(…) can be used.
Integer data types are made up of numeric variables that can be counted. By default R does not store numbers as integers but there may arise situations where numbers will have to be converted to integers to facilitate manipulations.
A Double is a number that can take any value including decimals. This is the default type of numeric variable used by R. Numeric data type stored as double are never stored as exact but rather as approximations to real numbers.
A Logical is an object stored as TRUE or FALSE. The example below show the creation of “Z” from a statement asking if 5 is less than 8. Z therefore is a logical (TRUE). Logical objects have innate values in R such that FALSE is always considered to have a value of 0 while TRUE has a value of 1.
A Character is an object enclosed in double quotes. These often are names and cannot be be used in mathematical calculations. Examples include “red”, “Male” and “1”. As seen “1” is a character and so cannot be used for calculations unless converted to another object form
A Factor is a categorical variable such as sex (male & female). Factor variables in R have levels representing the different categories. Sex for instance naturally will have two levels, male and female. Factors can be created from numeric and character objects using as.factor()
sex <- "Female"
age <- 28
febrile <- TRUE
class(sex); class(age); class(febrile)# use class function to determine the nature of a data type## [1] "character"
## [1] "numeric"
## [1] "logical"
## [1] "character"
blood.grp2<-as.factor(blood.grp) # Converts to factor variable
class(blood.grp2) # Now factor variable## [1] "factor"
## [1] "A" "AB" "B" "O"
## [1] 4 2 3 1
## attr(,"levels")
## [1] "A" "AB" "B" "O"
Different types of data structures in R include Vector, Matrix, Array, List, Data frame and Time-series. Data needs to be in a specific in structure to perform appropriate analysis. These data structures are often built from the types.
A vector is the simplest data structure in R. It is made up of a collection of like data types as above. Many functions in R create vectors in specific order. Some examples of the generation of vectors are as shown below:
#Logistical variable
hiv_positive <- TRUE
A<-7.1
is.double(A) # Determine if A is stored as a double## [1] TRUE
## [1] FALSE
## [1] FALSE
Matrices are vectors arranged in two dimensions (r,c). This dimension arranges the elements as rows and columns. There are a number of ways for creating a matrix, a few are as shown below. First a vector can be assigned a dimension of r x c. Names of the columns and rows can then be created by the dim(), rownames() and colnames() functions.
X<-1:12
matrix(X, nrow=3, ncol=4, byrow=F,
dimnames=list(c("Row1","Row2","Row3"), c("Col1","Col2","Col3","Col4")))## Col1 Col2 Col3 Col4
## Row1 1 4 7 10
## Row2 2 5 8 11
## Row3 3 6 9 12
#Example Matrix 4
A data frame is used for routine statistical manipulations. It is essentially a matrix in which various columns are grouped as classes.Most standard statistical datasets are manipulated in R as data frames. An example data frames can be created in R as below:
age <- c(22, 25, 30, 28)
sex <- factor(c("M","F","F","M"))
df <- data.frame(sex, age)
str(df) # a function that displays data structures## 'data.frame': 4 obs. of 2 variables:
## $ sex: Factor w/ 2 levels "F","M": 2 1 1 2
## $ age: num 22 25 30 28
sex<-gl(n=2, k=5, label=c("Male","Female")) # Create factor vector
age<-c(5,2,5,6,5,6,7,8,7,7) # Create numeric vector
color<-rep(c("Red","Blue"), times=5) # Create character vector
old<-age>6 # Create logical vector
df1<-data.frame(sex, age, color, old) # Create data frame
df1## sex age color old
## 1 Male 5 Red FALSE
## 2 Male 2 Blue FALSE
## 3 Male 5 Red FALSE
## 4 Male 6 Blue FALSE
## 5 Male 5 Red FALSE
## 6 Female 6 Blue FALSE
## 7 Female 7 Red TRUE
## 8 Female 8 Blue TRUE
## 9 Female 7 Red TRUE
## 10 Female 7 Blue TRUE
A list is similar to a data frame but with a some of differences. The components of a list can be made of objects other than a vector. These include data frames. Also the length of the components of a list need not be the same as for data frame. A list can be created with the list() function as below. The list below is made up of three elements. The list is a data frame called DF. The next is a numeric vector called Vec and then the last a character called Color.
#Indexing data types in R All data types can be indexed. The simplest is indexing vectors. Numbers are placed in square brackets representing the elements or subset of the vector to be selected. Some examples are as shown below:
## [1] 2
## [1] 4 3 2 6
## [1] 1 3 2 4 2 6 5 4 8
## [1] 1 3 2 4 3 2 4
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [,1] [,2]
## [1,] 4 5
## [2,] 7 8
Identify variables with missing values.
##Importing and exporting data in R
R comes with a number of datasets. To view all datasets that come with the R use data(). To load a specific dataset from a specific package use data(x, package=“y”) where x is dataset’s name and y is the package’s name.
Text files come in many forms and the first thing is to open and view it in any text editor before attempting to import the data into R. Common file types are coma delimited; import this into R using read.csv(), read.csv2() tab delimited; read.delim() and read.delim2(). There are many others. Students should explore this
The first step in reading foreign data is to tell R where to look for the data on your computer. This is done by using the function setwd(). A list of the files in the directory can the be obtained by the function dir().
##Data management in R head(X): Shows first six rows of a dataset X head(X): Shows last six rows of a dataset X str(X): Shows the structure of the dataset X summary(X): Gives a summary of all variables in the dataset X transform(X): Make new variable in the dataset X cut(Y): Convert a continuous vector/variable into a categorical one. set.seed(): Set a number to make the random number generation reproducible subset(): Select a part of a data frame
| Function | Explanation |
|---|---|
| length(x) | length of the vector x |
| sum(x) | Add up all elements of vector x |
| mean(x) | Mean of all elements of x |
| sd(x) | Standard deviation of x |
| var(x) | Variance of x |
| median(x) | median of the elements of x |
| mad(x) | Median absolute deviation |
| quantile(x, probs=) | sample quantiles corresponding to the given probabilities |
| weighted.mean(x, w) | mean of x with weights w |
| rank(x) | ranks of the elements of x |
| density(x) | kernel density estimates of x |
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
aq <- na.omit(airquality)
aq$Temp_C <- (aq$Temp - 32) * 5/9
aq$HotDay <- ifelse(aq$Temp >= 85, "Hot", "Not Hot")
table(aq$HotDay)##
## Hot Not Hot
## 28 83
library(MASS)
data(birthwt)
bw <- birthwt
bw$low <- factor(bw$low, labels=c("Normal BW","Low BW"))
summary(bw)## low age lwt race
## Normal BW:130 Min. :14.00 Min. : 80.0 Min. :1.000
## Low BW : 59 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :23.00 Median :121.0 Median :1.000
## Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
Calculate proportion of low birth weight babies.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Relationship between speed (hp) and fuel efficiency (mpg)
plot(mtcars$hp, mtcars$mpg,
main = "Scatter Plot: Horsepower vs Fuel Efficiency",
xlab = "Horsepower",
ylab = "Miles per Gallon",
pch = 19,
col = "blue")
# Fit linear model
model <- lm(mtcars$mpg ~ mtcars$hp)
# Add regression line
abline(model, lwd = 2)#Histogram
# -----------------------------
# 2. Histogram
# Distribution of miles per gallon
# -----------------------------
hist(mtcars$mpg,
main = "Histogram of Miles per Gallon",
xlab = "Miles per Gallon",
col = "lightgreen",
border = "black")boxplot(mpg ~ cyl,
data = mtcars,
main = "Fuel Efficiency by Cylinder Number",
xlab = "Number of Cylinders",
ylab = "Miles per Gallon",
col = "orange")# Count of cars by cylinder type
# -----------------------------
cyl_counts <- table(mtcars$cyl)
barplot(cyl_counts,
main = "Number of Cars by Cylinder Type",
xlab = "Cylinders",
ylab = "Frequency",
col = "purple")# Set plotting layout (2 rows, 3 columns)
par(mfrow = c(2,3))
# -----------------------------
# 1. Normal Distribution
# -----------------------------
x <- seq(-4,4,length=100)
y <- dnorm(x, mean=0, sd=1)
plot(x,y,type="l",
main="Normal Distribution",
xlab="x",
ylab="Density",
col="blue",
lwd=2)# -----------------------------
# 2. Binomial Distribution
# -----------------------------
x <- 0:10
y <- dbinom(x, size=10, prob=0.5)
barplot(y,
names.arg=x,
main="Binomial Distribution",
xlab="Number of Successes",
ylab="Probability",
col="orange")# -----------------------------
# 3. Poisson Distribution
# -----------------------------
x <- 0:15
y <- dpois(x, lambda=4)
barplot(y,
names.arg=x,
main="Poisson Distribution",
xlab="Number of Events",
ylab="Probability",
col="green")# -----------------------------
# 4. Exponential Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dexp(x, rate=0.5)
plot(x,y,type="l",
main="Exponential Distribution",
xlab="x",
ylab="Density",
col="red",
lwd=2)# -----------------------------
# 5. Uniform Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dunif(x, min=0, max=10)
plot(x,y,type="l",
main="Uniform Distribution",
xlab="x",
ylab="Density",
col="purple",
lwd=2)# -----------------------------
# 6. Chi-Square Distribution
# -----------------------------
x <- seq(0,10,length=100)
y <- dchisq(x, df=4)
plot(x,y,type="l",
main="Chi-Square Distribution",
xlab="x",
ylab="Density",
col="darkblue",
lwd=2)| Category | Statistical_Test | Variable_Type | Typical_Use | R_Function | Interpretation |
|---|---|---|---|---|---|
| Descriptive | Mean and standard deviation | Continuous | Summarize average and spread of normally distributed data | mean(), sd() | Mean gives average; SD shows spread around the mean |
| Descriptive | Median and interquartile range | Continuous | Summarize central tendency of skewed data | median(), IQR(), quantile() | Median gives middle value; IQR shows spread of middle 50% |
| Descriptive | Frequency and percentage | Categorical | Summarize counts in categories | table(), prop.table() | Shows number and proportion in each category |
| Descriptive | Minimum and maximum | Continuous | Show data limits | min(), max(), range() | Shows smallest and largest values |
| One-sample test | One-sample t-test | Continuous | Compare a sample mean with a known value | t.test(x, mu = value) | Small p-value suggests sample mean differs from the hypothesized value |
| One-sample test | One-sample proportion test | Categorical (binary) | Compare a sample proportion with a known value | prop.test(x, n, p = value) | Small p-value suggests sample proportion differs from the hypothesized value |
| Two-sample test | Independent samples t-test | Continuous | Compare means of two independent groups | t.test(y ~ group, var.equal = TRUE) | Small p-value suggests the two group means are significantly different |
| Two-sample test | Welch two-sample t-test | Continuous | Compare means of two independent groups when variances are unequal | t.test(y ~ group) | Small p-value suggests the two group means are significantly different under unequal variances |
| Paired test | Paired t-test | Continuous | Compare two related measurements from the same subjects | t.test(before, after, paired = TRUE) | Small p-value suggests significant difference between paired measurements |
| ANOVA | One-way ANOVA | Continuous | Compare means across three or more independent groups | aov(y ~ group) | Small p-value suggests at least one group mean differs |
| ANOVA | Repeated measures ANOVA | Continuous | Compare repeated measurements on the same subjects | aov(y ~ subject + time) or ezANOVA() | Small p-value suggests at least one time point differs |
| Categorical test | Chi-square test of independence | Categorical | Assess association between two categorical variables | chisq.test(table(x, y)) | Small p-value suggests association between categorical variables |
| Categorical test | Fisher’s exact test | Categorical | Assess association between two categorical variables when expected counts are small | fisher.test(table(x, y)) | Small p-value suggests association between categorical variables with small samples |
| Categorical test | McNemar test | Categorical | Compare paired categorical responses | mcnemar.test(table(before, after)) | Small p-value suggests paired categorical responses differ |
| Association | Pearson correlation | Continuous | Measure linear association between two continuous variables | cor.test(x, y, method = ‘pearson’) | Correlation coefficient shows strength and direction of linear relationship |
| Association | Spearman correlation | Ordinal / Continuous | Measure monotonic association between ranked or non-normal variables | cor.test(x, y, method = ‘spearman’) | Correlation coefficient shows strength and direction of monotonic relationship |
| Regression | Simple linear regression | Continuous outcome | Model continuous outcome using predictor(s) | lm(y ~ x) | Regression coefficient shows expected change in outcome per unit increase in predictor |
| Regression | Logistic regression | Binary outcome | Model binary outcome using predictor(s) | glm(y ~ x, family = binomial) | Odds ratio or coefficients show effect of predictors on odds of outcome |
| Non-parametric | Mann-Whitney U test | Continuous / Ordinal | Compare two independent groups when normality assumption is not met | wilcox.test(y ~ group) | Small p-value suggests the two groups differ in distribution |
| Non-parametric | Wilcoxon signed-rank test | Continuous / Ordinal | Compare paired observations when normality assumption is not met | wilcox.test(before, after, paired = TRUE) | Small p-value suggests paired observations differ in distribution |
| Non-parametric | Kruskal-Wallis test | Continuous / Ordinal | Compare three or more groups when ANOVA assumptions are not met | kruskal.test(y ~ group) | Small p-value suggests at least one group differs in distribution |
| Non-parametric | Friedman test | Continuous / Ordinal | Compare repeated measurements when repeated ANOVA assumptions are not met | friedman.test(y ~ group | subject) | Small p-value suggests repeated measurements differ |
## [1] 42.0991
## [1] 33.27597
## [1] 2977
Report median and IQR of birth weight in one sentence.
##
## Welch Two Sample t-test
##
## data: bwt by smoke
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 78.57486 488.97860
## sample estimates:
## mean in group 0 mean in group 1
## 3055.696 2771.919
Interpret the p-value from the t-test.