Learning Objectives:
By the end of this tutorial, you will be able to:
Projects automatically set your working directory and keep all files organised. This prevents the classic “cannot open file” error.
Step-by-step instructions:
r_biomedical_introtutorial_01.RBefore we begin, let’s create our clinical dataset. Run this code once:
# Generate clinical dataset for Tutorial 1
set.seed(42)
n <- 80
clinical_data <- data.frame(
patient_id = paste0("PT", sprintf("%03d", 1:n)),
age = round(rnorm(n, mean = 52, sd = 14)),
sex = sample(c("Male", "Female"), n, replace = TRUE),
bmi = round(rnorm(n, mean = 26.5, sd = 4.2), 1),
treatment = sample(c("Control", "Drug_A", "Drug_B"), n, replace = TRUE),
wbc_count = round(rnorm(n, mean = 7.2, sd = 1.8), 2), # white blood cells x10^9/L
systolic_bp = round(rnorm(n, mean = 130, sd = 18)),
response = round(rnorm(n, mean = 50, sd = 20), 1) # % symptom reduction
)
# Save to CSV
write.csv(clinical_data, "clinical_data.csv", row.names = FALSE)
# Display first few rows
head(clinical_data)## patient_id age sex bmi treatment wbc_count systolic_bp response
## 1 PT001 71 Male 20.2 Control 8.79 147 16.5
## 2 PT002 44 Female 20.3 Drug_A 9.44 143 85.7
## 3 PT003 57 Female 27.0 Control 5.31 127 4.5
## 4 PT004 61 Male 22.3 Drug_B 6.00 120 18.2
## 5 PT005 58 Female 26.5 Drug_A 8.52 144 45.1
## 6 PT006 51 Female 24.7 Control 6.28 125 42.9
R can be used as a powerful calculator:
## [1] 5
## [1] 6
## [1] 42
## [1] 25
## [1] 256
## [1] 12
Use <- to store values in variables (objects):
# Variable assignment
# Use <- to store a value in a variable
# Think of <- as "gets": "x gets the value 42"
x <- 42
x # typing the variable name prints its value## [1] 42
patient_count <- 80
mean_age <- 52.4
study_name <- "Cardio Trial 2024" # text (strings) go in quotes
# Print multiple variables
print(paste("Study:", study_name, "- Mean age:", mean_age))## [1] "Study: Cardio Trial 2024 - Mean age: 52.4"
Note: R is case-sensitive. Age and
age are different objects.
A vector is the fundamental unit of R — a collection of values of the same type.
Use c() (combine/concatenate) to create vectors:
# A vector of white blood cell counts (x10^9/L) for 6 patients
wbc <- c(5.2, 7.8, 11.3, 4.9, 8.1, 6.7)
# A vector of patient ages
ages <- c(34, 56, 23, 67, 45, 71)
# A vector of treatment groups (character vector)
treatment <- c("Control", "Drug_A", "Drug_A", "Control", "Drug_B", "Drug_B")
# Display
wbc## [1] 5.2 7.8 11.3 4.9 8.1 6.7
## [1] 6
## [1] 44
## [1] 7.333333
## [1] 7.25
## [1] 2.341509
## [1] 4.9 11.3
R counts from 1 (not 0 like Python):
## [1] 5.2
## [1] 11.3
## [1] 5.2 11.3 8.1
## [1] 7.8 11.3 4.9
# Logical indexing: find elements that meet a condition
wbc > 8 # returns TRUE/FALSE for each element## [1] FALSE FALSE TRUE FALSE TRUE FALSE
## [1] 11.3 8.1
# Which patients have elevated WBC?
# Normal adult WBC range is roughly 4.5 to 11.0 x10^9/L
high_wbc <- wbc[wbc > 11.0]
high_wbc## [1] 11.3
Apply operations to all elements at once — no loops needed:
## [1] 5200 7800 11300 4900 8100 6700
## [1] -2.1333333 0.4666667 3.9666667 -2.4333333 0.7666667 -0.6333333
A data frame is like a spreadsheet table — rows are observations, columns are variables.
We’ll use the tidyverse collection of packages:
## # A tibble: 6 × 8
## patient_id age sex bmi treatment wbc_count systolic_bp response
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 PT001 71 Male 20.2 Control 8.79 147 16.5
## 2 PT002 44 Female 20.3 Drug_A 9.44 143 85.7
## 3 PT003 57 Female 27 Control 5.31 127 4.5
## 4 PT004 61 Male 22.3 Drug_B 6 120 18.2
## 5 PT005 58 Female 26.5 Drug_A 8.52 144 45.1
## 6 PT006 51 Female 24.7 Control 6.28 125 42.9
## # A tibble: 6 × 8
## patient_id age sex bmi treatment wbc_count systolic_bp response
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 PT075 44 Male 22.5 Drug_A 10.2 155 47.3
## 2 PT076 60 Male 31.1 Drug_A 8.13 135 53.9
## 3 PT077 63 Male 28.2 Drug_B 7.11 135 31.9
## 4 PT078 58 Female 29 Drug_B 4.9 105 43.7
## 5 PT079 40 Female 34.1 Control 6.32 132 27
## 6 PT080 37 Female 27 Drug_A 9.52 137 12.5
## Rows: 80
## Columns: 8
## $ patient_id <chr> "PT001", "PT002", "PT003", "PT004", "PT005", "PT006", "PT0…
## $ age <dbl> 71, 44, 57, 61, 58, 51, 73, 51, 80, 51, 70, 84, 33, 48, 50…
## $ sex <chr> "Male", "Female", "Female", "Male", "Female", "Female", "M…
## $ bmi <dbl> 20.2, 20.3, 27.0, 22.3, 26.5, 24.7, 23.9, 18.0, 21.4, 27.3…
## $ treatment <chr> "Control", "Drug_A", "Control", "Drug_B", "Drug_A", "Contr…
## $ wbc_count <dbl> 8.79, 9.44, 5.31, 6.00, 8.52, 6.28, 6.43, 6.08, 7.56, 9.90…
## $ systolic_bp <dbl> 147, 143, 127, 120, 144, 125, 110, 119, 124, 151, 161, 147…
## $ response <dbl> 16.5, 85.7, 4.5, 18.2, 45.1, 42.9, 55.4, 59.1, -3.8, 67.3,…
## patient_id age sex bmi
## Length:80 Min. :10.00 Length:80 Min. :18.00
## Class :character 1st Qu.:43.00 Class :character 1st Qu.:23.70
## Mode :character Median :54.00 Mode :character Median :26.20
## Mean :52.31 Mean :26.02
## 3rd Qu.:61.25 3rd Qu.:28.57
## Max. :84.00 Max. :34.10
## treatment wbc_count systolic_bp response
## Length:80 Min. : 1.170 Min. : 87.0 Min. :-3.80
## Class :character 1st Qu.: 5.938 1st Qu.:118.8 1st Qu.:31.88
## Mode :character Median : 7.090 Median :132.0 Median :47.40
## Mean : 7.137 Mean :130.3 Mean :48.38
## 3rd Qu.: 8.220 3rd Qu.:141.5 3rd Qu.:64.88
## Max. :11.970 Max. :161.0 Max. :91.90
## [1] 80 8
## [1] 80
## [1] 8
## [1] "patient_id" "age" "sex" "bmi" "treatment"
## [6] "wbc_count" "systolic_bp" "response"
## [1] 71 44 57 61 58 51 73 51 80 51 70 84 33 48 50 61 48 15 18 70 48 27 50 69 79
## [26] 46 48 27 58 43 58 62 66 43 59 28 41 40 18 53 55 47 63 42 33 58 41 72 46 61
## [51] 57 41 74 61 53 56 62 53 10 56 47 55 60 72 42 70 57 67 65 62 37 51 61 39 44
## [76] 60 63 58 40 37
## [1] 52.3125
##
## Control Drug_A Drug_B
## 24 33 23
##
## Female Male
## 49 31
The pipe operator means “take this thing, THEN do this to it”:
## [1] 3.5
## [1] 3.5
Note: %>% (from magrittr) works the
same as |> (base R). Use whichever you prefer.
Keep only rows that meet certain conditions:
# Keep only female patients
female_patients <- clinical_data |>
filter(sex == "Female")
# Keep patients over 60 in the Drug_A group
older_drug_a <- clinical_data |>
filter(age > 60, treatment == "Drug_A")
# Keep patients with elevated WBC (above normal range)
elevated_wbc <- clinical_data |>
filter(wbc_count > 11.0)
# Display result
head(elevated_wbc)## # A tibble: 2 × 8
## patient_id age sex bmi treatment wbc_count systolic_bp response
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 PT043 63 Male 27.2 Drug_B 11.6 144 36.2
## 2 PT046 58 Male 32.5 Drug_A 12.0 114 76.9
Select specific columns you need:
# Select just the columns needed for a sub-analysis
patient_summary <- clinical_data |>
select(patient_id, age, sex, treatment, response)
head(patient_summary)## # A tibble: 6 × 5
## patient_id age sex treatment response
## <chr> <dbl> <chr> <chr> <dbl>
## 1 PT001 71 Male Control 16.5
## 2 PT002 44 Female Drug_A 85.7
## 3 PT003 57 Female Control 4.5
## 4 PT004 61 Male Drug_B 18.2
## 5 PT005 58 Female Drug_A 45.1
## 6 PT006 51 Female Control 42.9
Create new columns or modify existing ones:
# Create a new column: WBC converted to cells per microlitre
clinical_data <- clinical_data |>
mutate(wbc_per_ul = wbc_count * 1000)
# Create a new column: categorise BMI
clinical_data <- clinical_data |>
mutate(bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 ~ "Obese"
))
# Check the result
table(clinical_data$bmi_category)##
## Normal Obese Overweight Underweight
## 30 13 36 1
The pipe operator shines when combining operations:
# Chain multiple operations together
drug_response_summary <- clinical_data |>
filter(treatment != "Control") |> # exclude control group
select(patient_id, treatment, response, age) |>
filter(age >= 18 & age <= 65) |> # working-age adults
mutate(high_responder = response > 60) # flag high responders
head(drug_response_summary)## # A tibble: 6 × 5
## patient_id treatment response age high_responder
## <chr> <chr> <dbl> <dbl> <lgl>
## 1 PT002 Drug_A 85.7 44 TRUE
## 2 PT004 Drug_B 18.2 61 FALSE
## 3 PT005 Drug_A 45.1 58 FALSE
## 4 PT008 Drug_A 59.1 51 FALSE
## 5 PT013 Drug_A 42.4 33 FALSE
## 6 PT015 Drug_B 64.4 50 TRUE
Instructions: The dataset
blood_markers.csv (generated below) contains haematology
data from 40 patients. Complete the following tasks:
# Generate the exercise dataset
set.seed(99)
n2 <- 40
blood_markers <- data.frame(
id = paste0("P", 1:n2),
haemoglobin = round(rnorm(n2, mean = 13.5, sd = 1.8), 1), # g/dL
platelets = round(rnorm(n2, mean = 250, sd = 60)), # x10^9/L
condition = sample(c("Anaemia", "Healthy", "Polycythaemia"), n2, replace = TRUE),
hospital = sample(c("City", "Royal", "St. Mary's"), n2, replace = TRUE)
)
write.csv(blood_markers, "blood_markers.csv", row.names = FALSE)Tasks:
blood_markers.csv and display its structure
using at least two exploration functionstable())platelet_status that labels:
"Low""Normal""High"id, haemoglobin,
condition, and platelet_status, and export to
blood_markers_cleaned.csvReproducibility: When would you use
filter() vs. manually removing rows from a CSV in Excel?
What are the advantages of the scripted approach?
Data cleaning: You have a dataset where the
sex column contains entries like "M",
"male", "MALE", and "Male". How
would you standardise this in R? Why does this matter?
Random seeds: Why is it important to use
set.seed() when generating or sampling data? What could go
wrong if two researchers ran the same simulation without it?
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [5] purrr_1.2.0 readr_2.1.6 tidyr_1.3.1 tibble_3.3.0
## [9] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] bit_4.6.0 gtable_0.3.6 jsonlite_2.0.0 crayon_1.5.3
## [5] compiler_4.5.1 tidyselect_1.2.1 parallel_4.5.1 jquerylib_0.1.4
## [9] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
## [13] generics_0.1.4 knitr_1.50 bslib_0.9.0 pillar_1.11.1
## [17] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.6 utf8_1.2.6
## [21] stringi_1.8.7 cachem_1.1.0 xfun_0.55 sass_0.4.10
## [25] S7_0.2.1 bit64_4.6.0-1 timechange_0.3.0 cli_3.6.5
## [29] withr_3.0.2 magrittr_2.0.4 digest_0.6.39 grid_4.5.1
## [33] vroom_1.6.7 rstudioapi_0.17.1 hms_1.1.4 lifecycle_1.0.4
## [37] vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0 farver_2.1.2
## [41] rmarkdown_2.30 tools_4.5.1 pkgconfig_2.0.3 htmltools_0.5.9
Comments and Documentation
Lines starting with
#are comments — R ignores them. Always comment your code!