# Install the required packages
BiocManager::install(c("affy", "affydata", "hgu133acdf"))Worksheet lec2.2 data import
Data import
Before discussing the specific options for importing gene expression data, it should be emphasized that R already offers extensive general capabilities for data import. In particular, R provides a wide variety of functions for reading text files, so no additional tools are required to import well-formatted text data. Many of the functions introduced in this section build on the standard R functions already presented in Section 1.1.8 and essentially consist of one or more calls to these functions. Depending on the array technology used to generate the gene expression data, various import methods are available. The following sections present some of the most important R and Bioconductor packages relevant for this purpose.
Affymetrix
One of the leading microarray technologies was developed by Affymetrix Inc., a company that used photolithography to manufacture microarrays and introduced the first commercial microarray in 1994. Affymetrix GeneChip® arrays consist of a large number of microscopic cells.

Affymetrix uses two types of short oligonucleotide probes:
Perfect Match (PM) probes, which contain an exact subsequence of the target sequence.
Mismatch (MM) probes, which are identical to the PM probes except for a single nucleotide substitution in the middle position — the base is replaced by its Watson–Crick complement (i.e., adenine (A) ↔︎ thymine (T), and guanine (G) ↔︎ cytosine (C)).
These probes are always designed in pairs (PM–MM). The expression level of a gene is not derived from a single probe pair but from a set of multiple probe pairs that together cover the target sequence. This collection is referred to as a probe set. For more details on the technology, see Affymetrix Inc. (2002) and the company’s technical documentation (e.g., Affymetrix Inc., 2012).

To import files generated using the Affymetrix system, functions from various Bioconductor packages can be used. The most important of these is the Bioconductor package affxparser (Bengtsson et al., 2012). It provides functions for fast and memory-efficient parsing of Affymetrix files and makes use of the so-called Fusion SDK provided by Affymetrix Inc. (Affymetrix Inc., 2011). Since the import functions, such as readCel, only return lists of numerical values, it is recommended not to use this package directly, but rather in combination with the R package aroma.affymetrix (Bengtsson et al., 2008), which additionally provides data classes and analysis functions. Both packages are part of the aroma project (Bengtsson, 2012), which offers extensive functionality for analyzing all types of Affymetrix data. Due to its memory-efficient implementation, this project is particularly useful when a large number of arrays are analyzed simultaneously. However, the high flexibility of these packages also means that their use is more complex than with the standard packages that will be introduced below. Therefore, we will not go into further detail on these packages in this book. The fundamental Bioconductor package for the analysis of Affymetrix GeneChip array data is affy (Gautier et al., 2004). Among other things, it includes the functions read.affybatch and ReadAffy, which can be used to import the CEL files generated during the scanning of GeneChip arrays.
For demonstration purposes, we will load a dataset from the Bioconductor package affydata (Gautier, 2011) and use the corresponding annotation package hgu133acdf (The Bioconductor Project, 2012). We begin by installing the required packages.
Example of loading Affymetrix CEL files in R
## Load the necessary packages
library(affy)
library(affydata)
library(hgu133acdf)## Path to example data
path_affydata <- system.file("celfiles", package = "affydata")
## List CEL files
cel_files <- list.celfiles(path = path_affydata, full.names = TRUE)
## Read CEL files
data_affy <- ReadAffy(filenames = cel_files[1])
## Display the object
data_affyAffyBatch object
size of arrays=712x712 features (18 kb)
cdf=HG-U133A (22283 affyids)
number of samples=1
number of genes=22283
annotation=hgu133a
notes=
The result of importing the CEL files is an object of the class AffyBatch, which can store the measured intensities from multiple GeneChip arrays. This class is derived from the more general eSet class provided by Biobase. The eSet class serves as a core container structure for storing both genomic data (e.g., expression values) and the associated experimental metadata (e.g., sample information, annotation, and protocol details). To explore the structure of this class and its components (called slots in S4 object-oriented programming), you can use the following command:
getClass("eSet")Virtual Class "eSet" [package "Biobase"]
Slots:
Name: assayData phenoData featureData
Class: AssayData AnnotatedDataFrame AnnotatedDataFrame
Name: experimentData annotation protocolData
Class: MIAxE character AnnotatedDataFrame
Name: .__classVersion__
Class: Versions
Extends:
Class "VersionedBiobase", directly
Class "Versioned", by class "VersionedBiobase", distance 2
Known Subclasses: "ExpressionSet", "NChannelSet", "MultiSet", "SnpSet"
The eSet class is composed of several important slots that organize the experimental data and its associated metadata in a structured way:
assayData– This slot is used to store the measured data. It consists of a list or an environment that contains one or more matrices of the same dimensions. The rows of these matrices typically correspond to features (e.g., probes or genes), while the columns correspond to samples.phenoData– This slot contains variables that describe the phenotypes or experimental conditions of the samples (i.e., the columns in theassayDataslot). The data is stored in anAnnotatedDataFrame, which extends the basicdata.framestructure with additional metadata.featureData– This slot holds variables that describe the features (i.e., the rows in theassayDataslot). It is usually based on the annotation specified in theannotationslot. LikephenoData, it is also anAnnotatedDataFrame.experimentData– This slot contains details about the experimental methods and design. The structure follows the recommendations of the Functional Genomics Data Society (FGED).annotation– This slot stores the name of the annotation package used (e.g.,hgu133afor the HG-U133A array). It links the data to gene-level information such as gene symbols, descriptions, or transcript IDs.protocolData– This slot contains variables generated by the experimental equipment. These variables provide additional information about the samples (i.e., the columns in theassayDataslot). LikephenoDataandfeatureData, it is anAnnotatedDataFrame
In most cases, we do not need to deal directly with these internal slots when analyzing gene expression data. Instead, Bioconductor provides a set of accessor methods—special functions designed to access or modify the content of these objects in a structured and user-friendly way.
Details can be found in the “Methods” section of the help page for the eSet class, which can be opened using the command ?eSet. In the following chapters, we will use only a small subset of these available methods.
The raw fluorescence intensities measured by the scanner are stored within the assayData slot of the AffyBatch object. This slot contains the probe-level data before any background correction, normalization, or summarization steps. In practice, we rarely access this slot directly, but it’s helpful to know where the raw data live and how to inspect them. You can extract these values using the function exprs(), which retrieves the expression (intensity) matrix stored inside assayData. Each row corresponds to a probe
# Extract the raw intensity matrix
raw_values <- exprs(data_affy)
head(raw_values) binary.cel
1 115.0
2 15556.0
3 144.0
4 15401.0
5 54.8
6 124.8
Example of downloading real CEL files from GEO
# Install Bioconductor package
BiocManager::install("GEOquery")# Load library
library(GEOquery)
# Example: Download a GEO dataset (GSE package ID)
gse <- getGEO("GSE228408", GSEMatrix = TRUE)
# Check what is included
summary(gse) Length Class Mode
GSE228408_series_matrix.txt.gz 1 ExpressionSet S4
exprs_data <- exprs(gse[[1]])
head(exprs_data) GSM7120344 GSM7120345 GSM7120346 GSM7120347
TC0100006437.hg.1 4.19 4.31 4.49 4.87
TC0100006476.hg.1 6.18 6.01 6.17 6.28
TC0100006479.hg.1 4.27 5.04 4.29 4.53
TC0100006480.hg.1 4.94 5.41 4.97 5.32
TC0100006483.hg.1 5.34 4.64 4.91 5.47
TC0100006486.hg.1 6.25 6.14 5.95 6.13
The table shows the gene expression matrix extracted from a GEO microarray dataset. Each row represents a gene (or probe), and each column represents a sample. The values indicate the measured expression levels log₂-transformed of each gene in each sample. Higher values reflect higher gene expression levels within the respective samples.
Illumina
Illumina BeadArrays are composed of microscopic silica beads, each approximately 3 μm in diameter, which are randomly arranged on a fiber-optic bundle or a flat silica slide. Each bead is coated with hundreds of thousands of copies of a specific oligonucleotide sequence, and each bead belongs to a defined bead type. Importantly, each bead type ievel data. These raw bead-level measurements are then summarized across replicates to produce bead-summary data, which represent the final expression values for each bead type.

In this book, we will focus on single-channel gene expression BeadArrays. In addition to gene expression profiling, Illumina BeadArrays can also be used for other applications, including SNP genotyping, DNA methylation profiling, and copy number variation (CNV) analysis (see Kuhn et al., 2004). If the aggregation of bead-level data has already been performed using BeadStudio or GenomeStudio, the resulting files can be imported directly into R using the functions lumiR() and lumiR.batch() from the lumi package \[\@du2008\]. lumiR() is designed to import a single file, while lumiR.batch() allows you to import multiple files at once, making it more efficient for larger studies.
Example of Importing a GenomeStudio “Probe Profile” File from GEO
n this example, we will demonstrate how to download a publicly available GenomeStudio Probe Profile file from GEO and import it using lumiR().
## Install required packages
BiocManager::install(c("lumi", "GEOquery", "R.utils"))## Load the packages
library(lumi)
library(R.utils)
library(GEOquery)# choose a working folder
dir.create("geo_dl", showWarnings = FALSE)
# fetch supplementary files for this GSM (downloads a .txt.gz)
getGEOSuppFiles("GSM418285", baseDir = "geo_dl") size isdir mode mtime
geo_dl/GSM418285/GSM418285.txt.gz 696976 FALSE 664 2025-10-20 12:40:15
ctime atime uid
geo_dl/GSM418285/GSM418285.txt.gz 2025-10-20 12:40:15 2025-10-20 12:40:13 1000
gid uname grname
geo_dl/GSM418285/GSM418285.txt.gz 1000 fatma fatma
#Download a Probe Profile TXT from GEO. This GSM accession has a supplementary Probe Profile TXT (compressed as .txt.gz): GSM418285.
# This creates: geo_dl/GSM418285/GSM418285.txt.gzUnzip the file:
gunzip("geo_dl/GSM418285/GSM418285.txt.gz", overwrite = TRUE)
# Now you have: geo_dl/GSM418285/GSM418285.txtImport the file with lumiR():
dat.lumi <- lumiR(file = "geo_dl/GSM418285/GSM418285.txt", QC = FALSE)
dat.lumiSummary of data information:
Data File Information:
Major Operation History:
submitted finished
1 2025-10-20 12:40:15.182873 2025-10-20 12:40:15.861132
command lumiVersion
1 lumiR("geo_dl/GSM418285/GSM418285.txt", QC = FALSE) 2.60.0
Object Information:
LumiBatch (storageMode: lockedEnvironment)
assayData: 48803 features, 1 samples
element names: beadNum, detection, exprs, se.exprs
protocolData: none
phenoData
sampleNames: miR-31 #2
varLabels: sampleID
varMetadata: labelDescription
featureData
featureNames: 6450255 2570615 ... 4120753 (48803 total)
fvarLabels: ProbeID
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
Control Data: N/A
QC information: Please run summary(x, 'QC') for details!
Quick peek at the imported data:
exprs(dat.lumi)[1:5, ]6450255 2570615 6370619 2600039 2650615
46.8 77.6 55.7 60.6 57.1
The command displays the first five probe expression values from the imported Illumina dataset, showing raw intensity measurements for a single sample.
Example of Built-in Dataset
For demonstration purposes, we can also use the exampleSummaryData dataset from the beadarrayExampleData package. This dataset can be loaded directly into R without handling raw bead-level measurements. It is stored in an ExpressionSetIllumina object compatible with standard Bioconductor methods.
BiocManager::install(c("beadarray", "beadarrayExampleData"))library(beadarray)
library(beadarrayExampleData)
data("exampleSummaryData") # load the example data into your sessionFrom here, you can inspect expression values via exprs(), sample metadata via pData(), and feature information via featureData().
# exampleSummaryData
illumina_exprs_data<-exprs(exampleSummaryData) # expression matrix
illumina_exprs_data[1:5,1:5 ] # View first few values G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
ILMN_1802380 8.454468 8.616796 8.523001 8.420796
ILMN_1893287 5.388161 5.419345 5.162849 5.133287
ILMN_1736104 5.268626 5.457679 5.012766 4.988511
ILMN_1792389 6.767519 7.183788 6.947624 7.168571
ILMN_1854015 5.556947 5.721614 5.595413 5.520391
G:4616443093_B
ILMN_1802380 8.527748
ILMN_1893287 5.221987
ILMN_1736104 5.284026
ILMN_1792389 7.386435
ILMN_1854015 5.558717
#The command exprs(exampleSummaryData)[1:5, 1:5] displays the first five probes and their log₂-transformed expression levels across five Illumina microarray samples.
featureData(exampleSummaryData)[1:5, ] # probe annotationsAn object of class 'AnnotatedDataFrame'
rowNames: ILMN_1802380 ILMN_1893287 ... ILMN_1854015 (5 total)
varLabels: ArrayAddressID IlluminaID Status
varMetadata: labelDescription
pData(exampleSummaryData) sampleID SampleFac
4613710017_B 4613710017_B UHRR
4613710052_B 4613710052_B UHRR
4613710054_B 4613710054_B UHRR
4616443079_B 4616443079_B UHRR
4616443093_B 4616443093_B UHRR
4616443115_B 4616443115_B UHRR
4616443081_B 4616443081_B Brain
4616443081_H 4616443081_H Brain
4616443092_B 4616443092_B Brain
4616443107_A 4616443107_A Brain
4616443136_A 4616443136_A Brain
4616494005_A 4616494005_A Brain
annotation(exampleSummaryData)[1] "Humanv3"
Information about the “exampleSummaryData” class
getClass(exampleSummaryData)ExpressionSetIllumina (storageMode: list)
assayData: 49576 features, 12 samples
element names: exprs, se.exprs, nObservations
protocolData: none
phenoData
rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
varLabels: sampleID SampleFac
varMetadata: labelDescription
featureData
featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
total)
fvarLabels: ArrayAddressID IlluminaID Status
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: Humanv3
QC Information
Available Slots:
QC Items: Date, Matrix, ..., SampleGroup, numBeads
sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A
Example of Importing Bead-Level Illumina Data
This section demonstrates how to read bead-level Illumina microarray data, which represents the rawest form of Illumina measurements. The function readIllumina() can directly extract intensity values from the image files generated by the Illumina BeadArray scanner, enabling full preprocessing within R, including background correction, intensity extraction, and normalization. The example uses the BeadArrayUseCases package (Dunning et al., 2012c), which provides sample bead-level datasets, and the annotation package illuminaHumanv3.db (Dunning et al., 2012b), which links Illumina probe IDs to corresponding biological information such as gene names and transcript identifiers.
BiocManager::install(c("illuminaHumanv3.db","AnnotationHub"
,"BeadArrayUseCases"))The annotation package links each Illumina probe ID to biological information such as gene names and transcript identifiers.
library(illuminaHumanv3.db)
library(AnnotationHub)
library(BeadArrayUseCases)
library(beadarray)It is not necessary to load the BeadArrayUseCases package itself because the example data can be accessed directly from its installation directory.
#Find the data folder
path <- system.file("extdata/Chips", package = "BeadArrayUseCases")
path[1] "/home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips"
list.files(path) [1] "4613710017" "4613710052" "4613710054" "4616443079"
[5] "4616443081" "4616443092" "4616443093" "4616443107"
[9] "4616443115" "4616443136" "4616494005" "Metrics.txt"
[13] "sampleSheet.csv"
ss <- file.path(path, "sampleSheet.csv") #Full file path
ss[1] "/home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/sampleSheet.csv"
read.csv(ss, nrows = 3) X.Header. X X.1 X.2
1 Investigator Name Anon NA NA
2 Project Name MAQC NA NA
3 Experiment Name NA NA
BLData <- readIllumina(
dir = path,
sampleSheet = ss,
fileExt = "", # files are named like 4613710052 (no .txt)
useImages = FALSE,
illuminaAnnotation = "Humanv3"
)
BLData
-----------------
Experiment information (@experimentData)
-----------------
$sdfFile
[1] "/home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4613710017.sdf"
$platformClass
[1] "Slide"
$`Investigator Name`
[1] "Anon"
$`Project Name`
[1] "MAQC"
$`Experiment Name`
[1] ""
$Date
[1] ""
$sampleSheet
Sample_Name Sample_Group Sentrix_ID Sentrix_Position
1 UHRR-1 UHRR 4613710017 B
2 UHRR-2 UHRR 4613710052 B
3 UHRR-3 UHRR 4613710054 B
4 UHRR-4 UHRR 4616443079 B
5 UHRR-5 UHRR 4616443093 B
6 UHRR-6 UHRR 4616443115 B
7 Brain-1 Brain 4616443081 B
8 Brain-2 Brain 4616443081 H
9 Brain-3 Brain 4616443092 B
10 Brain-4 Brain 4616443107 A
11 Brain-5 Brain 4616443136 A
12 Brain-6 Brain 4616494005 A
$annotation
[1] "Humanv3"
-----------------
Per-section data (@sectionData)
-----------------
Targets
directory
1 /home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4613710017
2 /home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4613710052
3 /home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4613710054
4 /home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4616443079
5 /home/fatma/R/x86_64-pc-linux-gnu-library/4.5/BeadArrayUseCases/extdata/Chips/4616443093
sectionName textFile greenImage greenLocs greenxml
1 4613710017_B 4613710017_B.bab <NA> 4613710017_B.bab <NA>
2 4613710052_B 4613710052_B.bab <NA> 4613710052_B.bab <NA>
3 4613710054_B 4613710054_B.bab <NA> 4613710054_B.bab <NA>
4 4616443079_B 4616443079_B.bab <NA> 4616443079_B.bab <NA>
5 4616443093_B 4616443093_B.bab <NA> 4616443093_B.bab <NA>
... 7 more rows of data
Metrics
Date Matrix Section RegGrn FocusGrn SatGrn P95Grn P05Grn
2 3/13/2009 6:45:04 PM 4613710017 B 0.13 0.70 0 704 36
21 3/24/2009 5:30:53 PM 4613710052 B 0.14 0.64 0 763 37
22 3/17/2009 5:58:00 PM 4613710054 B 0.10 0.67 0 815 36
23 3/31/2009 4:04:09 PM 4616443079 B 0.10 0.64 0 604 34
24 3/26/2009 1:29:30 PM 4616443093 B 0.12 0.66 0 716 34
RegRed FocusRed SatRed P95Red P05Red
2 0 0 0 0 0
21 0 0 0 0 0
22 0 0 0 0 0
23 0 0 0 0 0
24 0 0 0 0 0
... 7 more rows of data
SampleGroup
[1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
[6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
[11] "4616443136_A" "4616494005_A"
numBeads
[1] 1088369 1082665 1082037 1105956 1100921 1069340 1069685 1044881 1104819
[10] 1109626 1095595 1100773
-----------------
Per-bead data (@beadData)
-----------------
Raw data from section 4613710017_B
ProbeID GrnX GrnY Grn
[1,] 10008 900.6661 10781.320 355
[2,] 10008 1992.5400 11352.000 377
[3,] 10008 1257.4790 7559.513 452
[4,] 10008 1700.1600 6351.157 267
[5,] 10008 1814.5210 3299.495 431
... 1088364 more rows of data
... data for 11 more section/s
“Humanv3” refers to the Illumina HumanHT-12 version 3 Expression BeadChip platform, which contains approximately 48,000 probes designed to measure gene expression in human samples. The annotation package illuminaHumanv3.db links these probe IDs to corresponding gene symbols and biological information
Quick checks
# summarize bead-level data (averaging bead replicates)
BLData_summary <- summarize(BLData)
# now you can use exprs(), pData(), fData()
head(exprs(BLData_summary))The object returned by readIllumina() belongs to the beadLevelData class, which mainly consists of a list structure containing the raw bead-level measurements for each array.
getClass("beadLevelData")Class "beadLevelData" [package "beadarray"]
Slots:
Name: beadData sectionData experimentData history
Class: list list list character
Finally, it should be mentioned that the function read.ilmn from the limma package (Smyth, 2005) can also be used to import “Probe Profile” files. This package will be used repeatedly throughout the following chapters and provides various functions for the analysis of gene expression data. In particular, it is one of the standard packages for applying linear models, using empirical Bayes methods for statistical inference (see Chapter 5).
In current workflows, read.ilmn() is particularly useful for Illumina expression data exported from GenomeStudio, while read.idat() and neqc() in the same package support modern IDAT-based pipelines.
Other array technologies
In this section, we introduce important packages for the analysis of microarray data that provide flexible import functions for various array technologies and result file formats. The marray package (Yang et al., 2009) includes functions such as read.marrayRaw(), read.GenePix(), read.SMD(), read.Spot(), and read.Agilent(), which allow users to import data from several common microarray scanners and software platforms — including Spot, Agilent, GenePix, and SMD (Stanford Microarray Database).
Importing Example Spot Files
BiocManager::install("marray")library(marray)Loading required package: limma
Attaching package: 'limma'
The following object is masked from 'package:beadarray':
imageplot
The following object is masked from 'package:BiocGenerics':
plotMA
As a first step, we read in the experimental information.
## Path to the example data
Path <- system.file("swirldata", package = "marray")
## List all files in the directory
Files <- list.files(Path, full.names = TRUE)
## Read the experimental information
Exp.Info <- read.marrayInfo(Files[6])
## Display a summary of the experiment information
summary(Exp.Info)The example dataset swirldata comes from a two-color microarray experiment often used to demonstrate normalization and visualization methods in the marray package.
After importing the experimental information, the result is an object of the class marrayInfo. It contains metadata about each microarray slide, such as the slide number, experimental conditions, dye channels (Cy3 and Cy5), and additional notes.
## Show the object
Exp.Info
## Access slots (S4)
Exp.Info@maLabels # character vector
Exp.Info@maInfo # data.frame with slide, Cy3/Cy5, date, comments
Exp.Info@maNotes # notes
## OR: use accessors (recommended)
marray::maLabels(Exp.Info)
marray::maInfo(Exp.Info)
marray::maNotes(Exp.Info)
## Open in the RStudio viewer
View(marray::maInfo(Exp.Info))To inspect the structure of this class, you can use:
# Display information about the "marrayInfo" class
getClass("marrayInfo")Class "marrayInfo" [package "marray"]
Slots:
Name: maLabels maInfo maNotes
Class: character data.frame character
Extends: "ShowLargeObject"
In the second step, we load the corresponding GAL file, which contains details about the array layout and spot annotation.
# # Read the GAL file
# Array.Info <- read.Galfile("fish.gal", path = Path)
# summary(Array.Info)
#
# ## Overview of the array layout
# summary(Array.Info$layout)Minimal working example (using the included data)
Path <- system.file("swirldata", package = "marray")
list.files(Path)
Array.Info <- read.Galfile("fish.gal", path = Path)
summary(Array.Info)
summary(Array.Info$layout)#summary(Array.Info$gnames )#getClass("marrayLayout")# #data_marry<-read.Spot(path = Path,
# layout = Array.Info$layout,
# gnames = Array.Info$gnames,
# target=Exp.Info)#summary(data_marry)#getClass("marrayRaw")## Vignette
vignette ("marrayClasses")
vignette ("marrayClassesShort")
vignette ("marrayInput")