worksheet3

Quality Control

A detailed quality control (QC) should be an essential part of every statistical data analysis, since the quality of the data is crucial for the validity and generalizability of statistical results. The goal of quality control is not only to assess data quality, but also to verify the assumptions made or required for further data analysis. In this chapter, we address the quality control of raw data. In the case of microarray data, a second quality control after data preprocessing is necessary. We will discuss this second QC step in Chapter 4, together with pre-processing.

Affymetrix

In the following section, we perform quality control (QC) of Affymetrix microarray data using Bioconductor packages: arrayQualityMetrics, affyPLM, and simpleaffy. These packages provide a comprehensive suite of functions to assess array quality, visualize intensity distributions, identify outlier arrays, and evaluate probe-level model (PLM) residuals. The arrayQualityMetrics package, in particular, generates automated HTML reports summarizing multiple QC metrics and visualizations. We will demonstrate these tools using the Dilution dataset from the affydata package, which contains four Human Genome U95A GeneChips (HG-U95A). We first install and load the required packages:

BiocManager::install(c("affy", "affydata", "hgu95av2cdf",
                       "arrayQualityMetrics", "affyPLM","gridExtra"))

Next, we load the example dataset Dilution from the affydata package (Gautier, 2011). This dataset consists of four Human Genome U95 Type A GeneChips (HG-U95A). The abbreviation U95 refers to Build 95 of UniGene (NCBI, 2012), which was used to define the probe sets on this array. For the four arrays, two types of complementary RNA (cRNA) samples were hybridized: the letter A represents human liver tissue, and B refers to a cell line derived from the central nervous system. The numbers 10 and 20 indicate the amount of starting material used—specifically, 10 µg and 20 µg of total RNA. This dataset is used to demonstrate dilution effects and perform quality control on microarray data.

The following code loads the necessary packages and dataset:

#Load necessary libraries
library(affy)               # Handles Affymetrix microarray data processing
library(affydata)           # Provides the Dilution dataset for demonstration

     Package    LibPath                                         Item      
[1,] "affydata" "/home/fatma/R/x86_64-pc-linux-gnu-library/4.5" "Dilution"
     Title                        
[1,] "AffyBatch instance Dilution"

library(hgu95av2cdf)        # Chip definition file for HG-U95Av2 arrays
library(arrayQualityMetrics) # Generates comprehensive quality control reports
library(affyPLM)            # Provides NUSE and RLE metrics for additional QC
library(gridExtra)          # For rendering tables in plots (replaces titlePage)

# Load the Dilution dataset
data(Dilution)
head(pData(Dilution))        # Phenodata (A/B, 10/20 µg)

    liver sn19 scanner
20A    20    0       1
20B    20    0       2
10A    10    0       1
10B    10    0       2

head(exprs(Dilution))

     20A   20B    10A   10B
1  149.0 112.0  129.0  60.0
2 1153.5 575.3 1262.3 564.8
3  142.0  98.0  128.0  56.0
4 1051.0 597.0 1269.0 570.0
5   91.0  77.0   90.0  46.0
6  136.0 133.0  117.0  62.0

# Boxplots of raw intensities on log scale
par(mar = c(7,4,2,1))
boxplot(log2(exprs(Dilution)), las = 2,
        main = "Raw log2 intensity distributions (CEL)",
        ylab = "log2 intensity")

Shifted medians / very different spreads can indicate hybridization or scanning issues. It’s normal to see moderate differences before normalization.

# Density overlays (coarse but helpful)
hist(Dilution, log = TRUE, which = "pm", main = "Raw log2 PM intensity densities")

The curves with very different modes/tails = potential bias/quality issues pre-norm.

Normalize the data using Robust Multi-array Average (RMA) for consistent QC input.

Dilution_norm <- rma(Dilution, verbose = FALSE)
head(exprs(Dilution_norm))

               20A      20B      10A      10B
100_g_at  8.039969 7.891522 8.022329 7.910679
1000_at   7.790839 7.661209 7.716827 7.622688
1001_at   5.175568 5.067023 5.082672 5.091804
1002_f_at 5.916532 5.754500 5.816628 5.741990
1003_s_at 6.335194 6.010436 6.198295 6.133995
1004_at   6.390533 6.147808 6.357680 6.260663

Now we could generate a quality control report using arrayQualityMetrics. Outputs an HTML report with plots (boxplots, density plots, MA plots, PCA) to assess data quality.

arrayQualityMetrics(expressionset = Dilution_norm,
                    outdir = "Dilution_QC_Report",
                    force = TRUE, # Overwrite existing report directory
                    do.logtransform = FALSE)  # RMA data is already log-transformed

We observe certain differences in the signal distributions among the four arrays, which is typical for microarray data. One objective of data preprocessing is to correct for these differences (see Chapter 4).

Note: Since gene expression is represented by intensity values ranging from 0 to 2¹⁶ − 1 = 65,535, expression values are usually displayed on a logarithmic scale.