Step 1: Read Expression Data
# Read expression data
df = read.delim("raw_expression_data.txt", row.names = 1, header = TRUE)
# Read sample information
target = read.delim("sample_info.txt")- Expression Data: Load the gene expression data from a file, setting the first column as row names (gene identifiers) and including a header row.
- Sample Information: Load the sample metadata, which contains details like sample IDs and experimental conditions.
Step 2: Match Expression Data with Sample Information
- This step aligns the columns of the expression data
(
df) with the sample order specified in thetargetfile. It ensures that the expression data corresponds correctly to the sample metadata.
Step 4: Data Quality Check
# Check for any NA, 0, or negative values
max(df)
min(df)
# Convert NA values to 0
df[is.na(df)] = 0- Check for NAs, 0s, or negative values: It’s crucial to identify and handle missing or problematic values in the dataset.
- Handling NAs: Replace any missing values (NAs) with 0, which might be appropriate depending on the context of the data.
Step 5: Normalization
- Quantile Normalization: This step normalizes the expression data across arrays (samples) using the quantile normalization method. It ensures that the distribution of expression values is the same for each array, making the data comparable across different samples. This method is particularly useful in removing technical biases and ensuring that the measurements are on the same scale.
Step 6: Log Transformation
- Log Transformation: Transform the expression data to a log scale, typically log2. This transformation helps normalize the data and reduce skewness, making it more suitable for linear modeling.
Step 7: Create Design Matrix
# Create design matrix based on the group
conditions = paste(target$Group)
conditions = factor(conditions, levels=unique(conditions))
design = model.matrix(~0 + conditions)
colnames(design) = levels(conditions)
unique(conditions)- Factor Levels: Convert the group information into a factor with levels corresponding to the experimental groups.
- Design Matrix: Create a design matrix using
model.matrix. The~0 +syntax excludes the intercept, making each column of the design matrix represent a group.
Step 8: Fit Linear Model
# Fit the design to the expression data
fit = lmFit(exp, design)
# Make contrast matrix
contrast.matrix <- makeContrasts(deg1 = Group2 - Group1, deg2 = Group3 - Group1, levels = design)- Fit Linear Model: Use
lmFitto fit a linear model to the expression data, considering the design matrix. - Contrast Matrix: Define contrasts to compare specific groups, specifying the comparisons of interest (e.g., Group2 vs Group1, Group3 vs Group1).
Step 9: Apply Contrasts
- Apply Contrasts: Adjust the fitted model using the contrast matrix to focus on the specific comparisons of interest.
Step 10: Empirical Bayes Moderation
- Empirical Bayes Moderation: Use the
eBayesfunction to apply empirical Bayes moderation, which shrinks the estimated variances of the coefficients towards a common value. This step improves the reliability of the statistical tests, especially for small sample sizes.
Step 11: Extract Differentially Expressed Genes (DEGs)
# Generate DEGs by comparison
res1 = topTable(fit.cont, coef = "degs1", adjust.method = "BH", number = nrow(exp), sort.by = 'none')
res2 = topTable(fit.cont, coef = "degs2", adjust.method = "BH", number = nrow(exp), sort.by = 'none')- Extract DEGs: Use
topTableto extract the top differentially expressed genes for the specified contrast (coef = "deg1"). Theadjust.method = "BH"parameter applies the Benjamini-Hochberg correction for multiple testing, andnumber = nrow(exp)returns results for all genes. - View Results: Display the top differentially
expressed genes, showing the most significant changes in expression.
Similar for (
coef = "deg2")
This limma workflow allows for the identification of
differentially expressed genes across different conditions, accounting
for multiple testing and ensuring robust statistical inference.