Introduction

Learning Objectives:

By the end of this tutorial, you will be able to:

  • Navigate the RStudio interface and create an R project
  • Write, annotate, and run basic R code
  • Create and manipulate vectors and data frames using biomedical data
  • Import a CSV dataset and perform exploratory analysis
  • Filter, select, and transform data using dplyr
  • Export a cleaned dataset to CSV

1. Setting Up Your RStudio Project (10 minutes)

Why Projects Matter

Projects automatically set your working directory and keep all files organised. This prevents the classic “cannot open file” error.

Step-by-step instructions:

  1. Open RStudio
  2. Go to File → New Project → New Directory → New Project
  3. Name the project r_biomedical_intro
  4. Choose a sensible location (e.g. your Desktop or Documents folder)
  5. Click Create Project — RStudio will restart
  6. Go to File → New File → R Script — save it as tutorial_01.R

Generate the Dataset

Before we begin, let’s create our clinical dataset. Run this code once:

# Generate clinical dataset for Tutorial 1
set.seed(42)
n <- 80

clinical_data <- data.frame(
  patient_id  = paste0("PT", sprintf("%03d", 1:n)),
  age         = round(rnorm(n, mean = 52, sd = 14)),
  sex         = sample(c("Male", "Female"), n, replace = TRUE),
  bmi         = round(rnorm(n, mean = 26.5, sd = 4.2), 1),
  treatment   = sample(c("Control", "Drug_A", "Drug_B"), n, replace = TRUE),
  wbc_count   = round(rnorm(n, mean = 7.2, sd = 1.8), 2),   # white blood cells x10^9/L
  systolic_bp = round(rnorm(n, mean = 130, sd = 18)),
  response    = round(rnorm(n, mean = 50, sd = 20), 1)      # % symptom reduction
)

# Save to CSV
write.csv(clinical_data, "clinical_data.csv", row.names = FALSE)

# Display first few rows
head(clinical_data)
##   patient_id age    sex  bmi treatment wbc_count systolic_bp response
## 1      PT001  71   Male 20.2   Control      8.79         147     16.5
## 2      PT002  44 Female 20.3    Drug_A      9.44         143     85.7
## 3      PT003  57 Female 27.0   Control      5.31         127      4.5
## 4      PT004  61   Male 22.3    Drug_B      6.00         120     18.2
## 5      PT005  58 Female 26.5    Drug_A      8.52         144     45.1
## 6      PT006  51 Female 24.7   Control      6.28         125     42.9

2. Writing Basic R Code (10 minutes)

Comments and Documentation

Lines starting with # are comments — R ignores them. Always comment your code!

# -------------------------------------------------------
# Tutorial 1: R Fundamentals
# Author: Your Name
# Date: Today's Date
# Description: Exploring a clinical dataset in R
# -------------------------------------------------------

# This is a comment — it documents what the code does
# Your future self will thank you for writing clear comments

Arithmetic Operations

R can be used as a powerful calculator:

# Basic arithmetic
2 + 3          # addition
## [1] 5
10 - 4         # subtraction
## [1] 6
6 * 7          # multiplication
## [1] 42
100 / 4        # division
## [1] 25
2^8            # exponentiation (2 to the power of 8)
## [1] 256
sqrt(144)      # square root — R has many built-in functions
## [1] 12

Variable Assignment

Use <- to store values in variables (objects):

# Variable assignment
# Use <- to store a value in a variable
# Think of <- as "gets": "x gets the value 42"
x <- 42
x              # typing the variable name prints its value
## [1] 42
patient_count <- 80
mean_age <- 52.4
study_name <- "Cardio Trial 2024"   # text (strings) go in quotes

# Print multiple variables
print(paste("Study:", study_name, "- Mean age:", mean_age))
## [1] "Study: Cardio Trial 2024 - Mean age: 52.4"

Note: R is case-sensitive. Age and age are different objects.


3. Vectors (10 minutes)

A vector is the fundamental unit of R — a collection of values of the same type.

Creating Vectors

Use c() (combine/concatenate) to create vectors:

# A vector of white blood cell counts (x10^9/L) for 6 patients
wbc <- c(5.2, 7.8, 11.3, 4.9, 8.1, 6.7)

# A vector of patient ages
ages <- c(34, 56, 23, 67, 45, 71)

# A vector of treatment groups (character vector)
treatment <- c("Control", "Drug_A", "Drug_A", "Control", "Drug_B", "Drug_B")

# Display
wbc
## [1]  5.2  7.8 11.3  4.9  8.1  6.7

Exploring Vectors

# Basic statistics
length(wbc)       # how many elements?
## [1] 6
sum(wbc)          # total
## [1] 44
mean(wbc)         # average
## [1] 7.333333
median(wbc)       # median
## [1] 7.25
sd(wbc)           # standard deviation
## [1] 2.341509
range(wbc)        # min and max
## [1]  4.9 11.3

Indexing and Subsetting

R counts from 1 (not 0 like Python):

# Accessing elements by position
wbc[1]            # first element
## [1] 5.2
wbc[3]            # third element
## [1] 11.3
wbc[c(1, 3, 5)]   # first, third, and fifth elements
## [1]  5.2 11.3  8.1
wbc[2:4]          # elements 2 through 4
## [1]  7.8 11.3  4.9
# Logical indexing: find elements that meet a condition
wbc > 8           # returns TRUE/FALSE for each element
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE
wbc[wbc > 8]      # returns only values greater than 8
## [1] 11.3  8.1
# Which patients have elevated WBC?
# Normal adult WBC range is roughly 4.5 to 11.0 x10^9/L
high_wbc <- wbc[wbc > 11.0]
high_wbc
## [1] 11.3

Vectorised Operations

Apply operations to all elements at once — no loops needed:

# Vectorised operations (a superpower of R)
wbc * 1000        # convert to cells per microlitre
## [1]  5200  7800 11300  4900  8100  6700
wbc - mean(wbc)   # centre the data (subtract mean from each value)
## [1] -2.1333333  0.4666667  3.9666667 -2.4333333  0.7666667 -0.6333333

4. Data Frames & Importing Data (15 minutes)

A data frame is like a spreadsheet table — rows are observations, columns are variables.

Loading Packages

We’ll use the tidyverse collection of packages:

# Load the tidyverse (includes ggplot2, dplyr, readr, tidyr, and more)
library(tidyverse)

Importing CSV Data

# Load the clinical dataset we created earlier
clinical_data <- read_csv("clinical_data.csv")

Exploring Data Frames

# First look at your data
head(clinical_data)         # first 6 rows
## # A tibble: 6 × 8
##   patient_id   age sex      bmi treatment wbc_count systolic_bp response
##   <chr>      <dbl> <chr>  <dbl> <chr>         <dbl>       <dbl>    <dbl>
## 1 PT001         71 Male    20.2 Control        8.79         147     16.5
## 2 PT002         44 Female  20.3 Drug_A         9.44         143     85.7
## 3 PT003         57 Female  27   Control        5.31         127      4.5
## 4 PT004         61 Male    22.3 Drug_B         6            120     18.2
## 5 PT005         58 Female  26.5 Drug_A         8.52         144     45.1
## 6 PT006         51 Female  24.7 Control        6.28         125     42.9
tail(clinical_data)         # last 6 rows
## # A tibble: 6 × 8
##   patient_id   age sex      bmi treatment wbc_count systolic_bp response
##   <chr>      <dbl> <chr>  <dbl> <chr>         <dbl>       <dbl>    <dbl>
## 1 PT075         44 Male    22.5 Drug_A        10.2          155     47.3
## 2 PT076         60 Male    31.1 Drug_A         8.13         135     53.9
## 3 PT077         63 Male    28.2 Drug_B         7.11         135     31.9
## 4 PT078         58 Female  29   Drug_B         4.9          105     43.7
## 5 PT079         40 Female  34.1 Control        6.32         132     27  
## 6 PT080         37 Female  27   Drug_A         9.52         137     12.5
glimpse(clinical_data)      # structure: column names, types, sample values
## Rows: 80
## Columns: 8
## $ patient_id  <chr> "PT001", "PT002", "PT003", "PT004", "PT005", "PT006", "PT0…
## $ age         <dbl> 71, 44, 57, 61, 58, 51, 73, 51, 80, 51, 70, 84, 33, 48, 50…
## $ sex         <chr> "Male", "Female", "Female", "Male", "Female", "Female", "M…
## $ bmi         <dbl> 20.2, 20.3, 27.0, 22.3, 26.5, 24.7, 23.9, 18.0, 21.4, 27.3…
## $ treatment   <chr> "Control", "Drug_A", "Control", "Drug_B", "Drug_A", "Contr…
## $ wbc_count   <dbl> 8.79, 9.44, 5.31, 6.00, 8.52, 6.28, 6.43, 6.08, 7.56, 9.90…
## $ systolic_bp <dbl> 147, 143, 127, 120, 144, 125, 110, 119, 124, 151, 161, 147…
## $ response    <dbl> 16.5, 85.7, 4.5, 18.2, 45.1, 42.9, 55.4, 59.1, -3.8, 67.3,…
summary(clinical_data)      # statistical summary of each column
##   patient_id             age            sex                 bmi       
##  Length:80          Min.   :10.00   Length:80          Min.   :18.00  
##  Class :character   1st Qu.:43.00   Class :character   1st Qu.:23.70  
##  Mode  :character   Median :54.00   Mode  :character   Median :26.20  
##                     Mean   :52.31                      Mean   :26.02  
##                     3rd Qu.:61.25                      3rd Qu.:28.57  
##                     Max.   :84.00                      Max.   :34.10  
##   treatment           wbc_count       systolic_bp       response    
##  Length:80          Min.   : 1.170   Min.   : 87.0   Min.   :-3.80  
##  Class :character   1st Qu.: 5.938   1st Qu.:118.8   1st Qu.:31.88  
##  Mode  :character   Median : 7.090   Median :132.0   Median :47.40  
##                     Mean   : 7.137   Mean   :130.3   Mean   :48.38  
##                     3rd Qu.: 8.220   3rd Qu.:141.5   3rd Qu.:64.88  
##                     Max.   :11.970   Max.   :161.0   Max.   :91.90
# Dimensions
dim(clinical_data)          # rows × columns
## [1] 80  8
nrow(clinical_data)         # number of rows
## [1] 80
ncol(clinical_data)         # number of columns
## [1] 8
colnames(clinical_data)     # column names
## [1] "patient_id"  "age"         "sex"         "bmi"         "treatment"  
## [6] "wbc_count"   "systolic_bp" "response"

Accessing Columns

# Using $ to pull out a column as a vector
clinical_data$age
##  [1] 71 44 57 61 58 51 73 51 80 51 70 84 33 48 50 61 48 15 18 70 48 27 50 69 79
## [26] 46 48 27 58 43 58 62 66 43 59 28 41 40 18 53 55 47 63 42 33 58 41 72 46 61
## [51] 57 41 74 61 53 56 62 53 10 56 47 55 60 72 42 70 57 67 65 62 37 51 61 39 44
## [76] 60 63 58 40 37
mean(clinical_data$age)
## [1] 52.3125
# Count observations per group
table(clinical_data$treatment)
## 
## Control  Drug_A  Drug_B 
##      24      33      23
table(clinical_data$sex)
## 
## Female   Male 
##     49     31

5. Data Wrangling with dplyr (10 minutes)

The Pipe Operator: |>

The pipe operator means “take this thing, THEN do this to it”:

# Without pipe (nested, hard to read):
mean(sqrt(c(4, 9, 16, 25)))
## [1] 3.5
# With pipe (left to right, readable):
c(4, 9, 16, 25) |> sqrt() |> mean()
## [1] 3.5
# Read it aloud as "and then"

Note: %>% (from magrittr) works the same as |> (base R). Use whichever you prefer.

filter(): Keep Rows

Keep only rows that meet certain conditions:

# Keep only female patients
female_patients <- clinical_data |>
  filter(sex == "Female")

# Keep patients over 60 in the Drug_A group
older_drug_a <- clinical_data |>
  filter(age > 60, treatment == "Drug_A")

# Keep patients with elevated WBC (above normal range)
elevated_wbc <- clinical_data |>
  filter(wbc_count > 11.0)

# Display result
head(elevated_wbc)
## # A tibble: 2 × 8
##   patient_id   age sex     bmi treatment wbc_count systolic_bp response
##   <chr>      <dbl> <chr> <dbl> <chr>         <dbl>       <dbl>    <dbl>
## 1 PT043         63 Male   27.2 Drug_B         11.6         144     36.2
## 2 PT046         58 Male   32.5 Drug_A         12.0         114     76.9

select(): Keep Columns

Select specific columns you need:

# Select just the columns needed for a sub-analysis
patient_summary <- clinical_data |>
  select(patient_id, age, sex, treatment, response)

head(patient_summary)
## # A tibble: 6 × 5
##   patient_id   age sex    treatment response
##   <chr>      <dbl> <chr>  <chr>        <dbl>
## 1 PT001         71 Male   Control       16.5
## 2 PT002         44 Female Drug_A        85.7
## 3 PT003         57 Female Control        4.5
## 4 PT004         61 Male   Drug_B        18.2
## 5 PT005         58 Female Drug_A        45.1
## 6 PT006         51 Female Control       42.9

mutate(): Create or Transform Columns

Create new columns or modify existing ones:

# Create a new column: WBC converted to cells per microlitre
clinical_data <- clinical_data |>
  mutate(wbc_per_ul = wbc_count * 1000)

# Create a new column: categorise BMI
clinical_data <- clinical_data |>
  mutate(bmi_category = case_when(
    bmi < 18.5              ~ "Underweight",
    bmi >= 18.5 & bmi < 25 ~ "Normal",
    bmi >= 25 & bmi < 30   ~ "Overweight",
    bmi >= 30               ~ "Obese"
  ))

# Check the result
table(clinical_data$bmi_category)
## 
##      Normal       Obese  Overweight Underweight 
##          30          13          36           1

Chaining Multiple Steps

The pipe operator shines when combining operations:

# Chain multiple operations together
drug_response_summary <- clinical_data |>
  filter(treatment != "Control") |>                    # exclude control group
  select(patient_id, treatment, response, age) |>
  filter(age >= 18 & age <= 65) |>                     # working-age adults
  mutate(high_responder = response > 60)               # flag high responders

head(drug_response_summary)
## # A tibble: 6 × 5
##   patient_id treatment response   age high_responder
##   <chr>      <chr>        <dbl> <dbl> <lgl>         
## 1 PT002      Drug_A        85.7    44 TRUE          
## 2 PT004      Drug_B        18.2    61 FALSE         
## 3 PT005      Drug_A        45.1    58 FALSE         
## 4 PT008      Drug_A        59.1    51 FALSE         
## 5 PT013      Drug_A        42.4    33 FALSE         
## 6 PT015      Drug_B        64.4    50 TRUE

Exporting Data

# Save the cleaned dataset
write_csv(clinical_data, "clinical_data_cleaned.csv")

6. Formative Exercise

Instructions: The dataset blood_markers.csv (generated below) contains haematology data from 40 patients. Complete the following tasks:

# Generate the exercise dataset
set.seed(99)
n2 <- 40
blood_markers <- data.frame(
  id          = paste0("P", 1:n2),
  haemoglobin = round(rnorm(n2, mean = 13.5, sd = 1.8), 1),  # g/dL
  platelets   = round(rnorm(n2, mean = 250, sd = 60)),       # x10^9/L
  condition   = sample(c("Anaemia", "Healthy", "Polycythaemia"), n2, replace = TRUE),
  hospital    = sample(c("City", "Royal", "St. Mary's"), n2, replace = TRUE)
)
write.csv(blood_markers, "blood_markers.csv", row.names = FALSE)

Tasks:

  1. Import blood_markers.csv and display its structure using at least two exploration functions
  2. How many patients are in each condition group? (Hint: table())
  3. Filter to keep only patients with haemoglobin below 12.0 g/dL (clinically anaemic threshold)
  4. Create a new column called platelet_status that labels:
    • platelets below 150 as "Low"
    • 150–400 as "Normal"
    • above 400 as "High"
  5. Select only id, haemoglobin, condition, and platelet_status, and export to blood_markers_cleaned.csv

7. Discussion Questions

  1. Reproducibility: When would you use filter() vs. manually removing rows from a CSV in Excel? What are the advantages of the scripted approach?

  2. Data cleaning: You have a dataset where the sex column contains entries like "M", "male", "MALE", and "Male". How would you standardise this in R? Why does this matter?

  3. Random seeds: Why is it important to use set.seed() when generating or sampling data? What could go wrong if two researchers ran the same simulation without it?


Further Reading & Resources


Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [5] purrr_1.2.0     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
##  [9] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.6.0          gtable_0.3.6       jsonlite_2.0.0     crayon_1.5.3      
##  [5] compiler_4.5.1     tidyselect_1.2.1   parallel_4.5.1     jquerylib_0.1.4   
##  [9] scales_1.4.0       yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
## [13] generics_0.1.4     knitr_1.50         bslib_0.9.0        pillar_1.11.1     
## [17] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.6        utf8_1.2.6        
## [21] stringi_1.8.7      cachem_1.1.0       xfun_0.55          sass_0.4.10       
## [25] S7_0.2.1           bit64_4.6.0-1      timechange_0.3.0   cli_3.6.5         
## [29] withr_3.0.2        magrittr_2.0.4     digest_0.6.39      grid_4.5.1        
## [33] vroom_1.6.7        rstudioapi_0.17.1  hms_1.1.4          lifecycle_1.0.4   
## [37] vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
## [41] rmarkdown_2.30     tools_4.5.1        pkgconfig_2.0.3    htmltools_0.5.9