R is a powerful language for statistics and data analysis. This section introduces core ideas so students can continue independently.
R is an open‑source language designed for: - Statistics - Data analysis - Data visualization - Reproducible science
It is widely used in biology, ecology, genomics, and environmental sciences.
Download R from CRAN and RStudio as an IDE. RStudio makes coding, visualization, and documentation easier.
The console evaluates expressions immediately.
1 + 1
print("Hello World")
R can also be used as a calculator, but its real power comes from vectorized operations, plotting, and statistical analysis.
R stores information in objects. Objects have types that determine how R treats the data.
x <- 10 # numeric
class(x)
y <- "hello" # character
class(y)
z <- TRUE # logical
class(z)
R will automatically choose a type when assigning values.
R tries to convert mixed types to the most flexible type.
v <- c(1, "a", 3)
v # all become characters
class(v)
Use explicit coercion:
as.numeric("3.5")
as.character(42)
Even single values are stored as vectors of length 1.
x <- 5
length(x) # 1
R is a programming language for statistical computing and graphics.
1 + 1
print("Hello World")
x <- 10
class(x)
y <- "hello"
class(y)
z <- TRUE
class(z)
as.numeric("3.5")
Vectors are the fundamental data structure in R. Even a single value is a vector of length 1. Many R functions operate on vectors.
v1 <- c(1, 2, 3)
v2 <- 5:10
v3 <- seq(0, 1, by=0.2)
v4 <- rep(4, times=5)
v1; v2; v3; v4
Named vectors allow meaningful labels.
heights <- c(Alex=180, Maria=165, John=175)
heights
heights["Maria"]
x <- c(10, 20, 30, 40)
x[2]
x[c(1, 3)]
x[-2] # exclude index 2
Logical indexing is powerful for filtering.
z <- c(5, 10, 15, 20)
idx <- z > 10
idx # TRUE FALSE TRUE TRUE
z[idx] # 15 20
R recycles shorter vectors when operating with longer ones.
a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b # b becomes 10,20,10,20
Be careful: this can cause subtle bugs.
R is fast because most operations are vectorized.
x <- 1:5
x * 2
x + 5
sin(x)
Lists can store objects of different types and sizes.
lst <- list(
numbers = c(1,2,3),
name = "oak",
temp = 22.5,
misc = list(inner = 5)
)
lst
lst$numbers
lst[["name"]]
lst$misc$inner
Lists are extremely flexible and often appear in results from statistical models.
Matrices are two‑dimensional rectangular data structures containing elements all of the same type. They are useful in many biological analyses (e.g., gene expression matrices, distance matrices).
m <- matrix(1:12, nrow = 3, ncol = 4)
m
R fills matrices column‑wise by default.
m2 <- matrix(1:12, nrow = 3, byrow = TRUE)
m2
rownames(m) <- paste0("R", 1:3)
colnames(m) <- paste0("C", 1:4)
m
m[1, 2] # row 1, column 2
m[ , 3] # whole column
m[2, ] # whole row
m[1:2, 2:3]
m * 2
m + m2[1:3,]
t(m) # transpose
Imagine rows are genes and columns are samples.
expr <- matrix(rnorm(500), nrow=50, ncol=10)
rownames(expr) <- paste0("Gene_", 1:50)
colnames(expr) <- paste0("Sample_", 1:10)
expr[1:5, 1:5]
Find genes with high variance:
vars <- apply(expr, 1, var)
head(sort(vars, decreasing=TRUE))
Data frames are one of the most important structures in R. They are used to store tabular data, where columns may be of different types (numeric, character, logical, etc.). This makes them ideal for biological and environmental datasets where variables can include quantitative measurements (temperature, pH) alongside categorical values (species, region).
A data frame is similar to a spreadsheet: - Each row is an observation (e.g., one sampling site) - Each column is a variable (e.g., temperature) - Different columns can store different data types
df <- data.frame(
id = 1:3,
temp = c(20.1, 19.5, 22.3),
species = c("oak", "pine", "elm")
)
head(df)
head(df)
tail(df)
str(df)
dim(df)
summary(df)
df$temp
Equivalent:
df[["temp"]]
df$region <- c("north", "south", "east")
df$region <- NULL
We often import data from .csv or .tsv
files.
env <- read.csv("environmental_measurements.csv")
str(env)
Check for missing values:
sum(is.na(env))
colSums(is.na(env))
Filter rows:
subset(env, species == "oak")
Sort values:
env[order(env$temperature), ]
Data frames allow us to combine diverse measurements in one structure. This is essential in environmental biology, where datasets include many types of variables: climate (temp, humidity), chemical (pH), biological (species), and geographic (location).
They form the basis for: - Visualization - Statistical analysis - Modeling - Ecological inference
Understanding how to manipulate them is crucial.
tapply(env$temperature, env$species, mean)
plot(data$temperature, data$humidity)
hist(data$temperature)
boxplot(temperature ~ species, data=data)
library(ggplot2)
ggplot(data, aes(temperature, humidity, color=species)) + geom_point()
summary(data)
tapply(data$temperature, data$species, mean)
cor(data$temperature, data$humidity)
evo <- read.csv("evolutionary_traits.csv")
boxplot(trait ~ pop, data=evo)
t.test(trait ~ pop, data=evo)
In this section, we will explore realistic biological questions using the provided datasets. We introduce each problem, explain its relevance, reference related research when appropriate, and comment on the results.
Research often investigates how different species occupy varying environmental niches.
Related concept: Niche differentiation (Hutchinson, 1957).
env <- read.csv("environmental_measurements.csv")
boxplot(temperature ~ species, data=env)
tapply(env$temperature, env$species, mean)
We observe differences in temperature among species, suggesting niche specialization.
Soil pH and humidity can influence plant community composition.
Reference: Brady & Weil, “The Nature and Properties of Soils”.
cor(env$ph, env$humidity)
A moderate correlation may indicate linked environmental gradients.
Trait divergence between populations is important in evolutionary biology.
Reference: Lande (1976) on quantitative genetic models.
t.test(trait ~ pop, data=evo)
A significant difference supports population divergence.
Principal Component Analysis (PCA) is used to reduce dimensionality.
Reference: Jolliffe, “Principal Component Analysis”.
env_scaled <- scale(env[, c("temperature", "humidity", "ph")])
pc <- prcomp(env_scaled)
summary(pc)
plot(pc$x[,1], pc$x[,2], col=as.factor(env$species))
PCA can reveal clustering by species, suggesting environmental adaptation.
ggplot2.
1.4 Comments
Use
#for comments.