Data structures in R - an Overview
Data needs to be organized
- numbers and text needs to be organized for computers to work with
it
- DATA STRUCTURES are different ways computers organize and work with
data
- R uses mostly fairly intuitive data structures that are similar a
printed table or a spreadsheet
Data needs to be organized

Tables / spreadsheets have a few basic components
Key structural elements
- columns
- rows
- the whole table
- ?
Tables / spreadsheets have a few basic components
Other elements
- headings
- column “INDEX”: A, B, C, ….
- row “INDEX”: 1, 2, 3, ….
- ?
Another element of spreadsheets is shown in this image

WORKBOOKS are another key part of spreadsheets
- Different spreadsheets within the same file = “workbooks”
- accessed through tab at bottom

Data structures in R
In order of increasing size:
- Vectors of length 1 (scalars)
- Vectors
- Matrices (plural of matrix)
- Dataframes
- Lists
- NOTE: in R, ALL of these are generically called “OBJECTS”
In math, a single number is called a SCALAR
This terminology if often first introduced in linear / matrix algebra
courses
# make a scalar
x <- 1
# shows its length
length(x)
## [1] 1
## [1] 1
R does NOT use the terminology “scalar”
- No object in R is called a scalar in R!
- “scalars” mathematically are just VECTORS of LENGTH 1
- …So R just calls them vectors
# make a scalar
x <- 1
is(x)
## [1] "numeric" "vector"
Vectors are series of data that travel togther
- vectors are made with function
- “c” for concatenate (to stick together)
Its tempting to call a vector a “list” of data…
- but LISTS are an important class of data structure
- we’ll talk about them later in the course
- so, vector =/= LIST
Conceptually, a vector is like a COLUMN in a table or
spreadsheet
## x
## 1 21
## 2 51
## 3 86
## 4 112
## 5 118
A vector can also act like a ROW in a table / spreadsheet
- This is the default way to print out data
- Even though this is how its shown…
- …not the way to think about it (in most cases)
## [1] 21 51 86 112 118
Vectors are made with the c() function
packs <- c( 3, 9, 9,11, 11, 8,10,14,13.5,16,
13,13,11,12, 14,11,10,10,10, 11,
10,11,11, 9, 8, 9)
EACH element within the vector separated by a comma
,
packs <- c( 3, 9, 9,11, 11, 8,10,14,13.5,16,
13,13,11,12, 14,11,10,10,10, 11,
10,11,11, 9, 8, 9)
Long vectors often split between lines
Spaces ok too - I
use them to line things up
packs <- c( 3, 9, 9,11, 11, 8,10,14,13.5,16,
13,13,11,12, 14,11,10,10,10, 11,
10,11,11, 9, 8, 9)
Vectors can contain text
Text MUST be in quotes
wolf_names <- c("white fang", "fluffy", "bingo","minnie", "percy")
Vectors can be named (almost) anything
# 1 letter
x <- c(21,51,86,112,118)
# 2 letters
zz <- c(21,51,86,112,118)
# 3 letters
dog <- c(21,51,86,112,118)
# lots of letters
dogsrbetterthancats <- c(21,51,86,112,118)
Periods “.” and UNDERSCORES “_” can be used in object names
- Underscores _
- (Dashes = CANNOT be used)
i.like.cats.too <- c(21,51,86,112,118)
but_i_like_dogs_better <- c(21,51,86,112,118)
at_least.cats.use_litter_boxes <- c(21,51,86,112,118)
Periods have a special use in Python so I try to use
UNDERSCORES
But you will see both!
i.like.cats.too <- c(21,51,86,112,118)
but_i_like_dogs_better <- c(21,51,86,112,118)
at_least.cats.use_litter_boxes <- c(21,51,86,112,118)
A common standard is to use dashes and lowercase
- easiest to type (no need to hit “shift”) -easiest to read (many
upper case and lower case letters can look the same)
- this is called “SNAKE CASE” 🐍
x <- c(21,51,86,112,118)
wolves_n <- c(21,51,86,112,118)
wolves_population_size <- c(21,51,86,112,118)
Some people like “CAMEL CASE” 🐫
- mixes upper and lower case
- I think its more confusing …
- “Did I called that vector,
wolvesN,
wolvesn, or Wolvesn'…
wolvesN <- c(21,51,86,112,118)
wolvesn <- c(21,51,86,112,118)
Wolvesn <- c(21,51,86,112,118)
My general rules 🐍
# usually this
wolves_K <- 200
wolves_N <- c(21,51,86,112,118)
My general rules
sometimes I do this to save
space
- only use capitol letters for key variable names that have a standard
meaning in biology (N = population size, K = carrying capacity)
- sometimes omit underscore if using capital letter
wolvesK <- 200
wolvesN <- c(21,51,86,112,118)
My general rules
occasionally this if I’m
not thinking
wolves.K <- 200
wolves.N <- c(21,51,86,112,118)
Single ELEMENTS of a vector accessed using BRACKETS
aka BRACKET NOTATION
# names
wolf_names <- c("white fang", "fluffy", "bingo","minnie", "percy")
# first name - uses 1 (not 0!)
wolf_names[1]
## [1] "white fang"
# 2nd names name - starts at 1 (not 0!)
wolf_names[2]
## [1] "fluffy"
Multple ELEMENTS of a vector accessed using :
# first two
wolf_names[1:2]
## [1] "white fang" "fluffy"
# 2nd two
wolf_names[2:3]
## [1] "fluffy" "bingo"
This is called INDEXING
“The index for ‘fluffy’ is 2”
## [1] "fluffy"
MUST have 2 values when using :
ERRORS
Can call everything if you want
## [1] "white fang" "fluffy" "bingo" "minnie" "percy"
Can call everything BUT the 1st like this
## [1] "fluffy" "bingo" "minnie" "percy"
First real programming trick
Use NEGATIVE INDEXING to drop elements
## [1] "fluffy" "bingo" "minnie" "percy"
## [1] "white fang" "bingo" "minnie" "percy"
Next real programming trick - can “pass” vectors of indices to
vectors
A vector
wolf_names <- c("white fang", "fluffy", "bingo","minnie", "percy")
a VECTOR ELEMENT via an
index value
## [1] "white fang"
Next real programming trick - can “pass” vectors of indices to
vectors
2 vector elements via an
index values
## [1] "white fang" "fluffy"
Next real programming trick - can “pass” vectors of indices to
vectors
a vector of indices
2 vector elements via a
VECTOR of INDICES
## [1] "white fang" "fluffy"
I can make a vector of numbers 2 ways
I will usually use the
first for clarity
i1 <- c(1, 2, 3)
i2 <- c(1:3)
How do I test for equality of the two vectors
## [1] TRUE
Operations on vectors can be VECTORIZED
Some vectors
DNA <- c("A","T","C","G")
RNA <- c("A","U","C","G")
Their length
## [1] 4
## [1] 4
Operations on vectors can be VECTORIZED
The equality of their
lengths
length(DNA) == length(RNA)
## [1] TRUE
Access 1st elements
## [1] "A"
## [1] "A"
Operations on vectors can be VECTORIZED
Equality of first elements
## [1] TRUE
Access 2nd elements
## [1] "T"
## [1] "U"
Operations on vectors can be VECTORIZED
IN-Equality of 2nd elements
## [1] FALSE
Operations on vectors can be VECTORIZED
Assess equality of ALL
elements
- the function
== has been applied to each pair of
elements
- this is a VECTORIZED operation
## [1] TRUE FALSE TRUE TRUE
VECTORIZED operations are common in R
taking
the log of something is very common in math, stats, ML, bio…
natural log = log() in R
log base 10 = log10()
log base 2 = lo2g()
## [1] 2.302585
## [1] 1
## [1] 3.321928
Functions can be applied to entire vectors
VECTORIZED operations
are common in R
## [1] 3.044522 3.931826 4.454347 4.718499 4.770685 4.779123 4.882802 4.997212
## [9] 5.159055 5.141664 4.770685 4.912655 5.141664 4.820282 4.564348 4.574711
## [17] 4.584967 4.418841 4.553877 4.644391 4.584967 4.682131 4.574711 4.382027
## [25] 4.543295 4.812184
Math can be done on entire vectors
The average wolf is 95
pounds
Wolf BIOMASS each year
## [1] 1995 4845 8170 10640 11210 11305 12540 14060 16530 16245 11210 12920
## [13] 16245 11780 9120 9215 9310 7885 9025 9880 9310 10260 9215 7600
## [25] 8930 11685
Math can be done on entire vectors
Yellowstone NP is 3500
square miles
Wolves per square mile
## [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
## [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286
Math can be done using variable
Make variables with
constants
wolve_weight <- 95
YNP_size <- 3500
Math can be done using variable
Do math using varibales
## [1] 0.2210526 0.5368421 0.9052632 1.1789474 1.2421053 1.2526316 1.3894737
## [8] 1.5578947 1.8315789 1.8000000 1.2421053 1.4315789 1.8000000 1.3052632
## [15] 1.0105263 1.0210526 1.0315789 0.8736842 1.0000000 1.0947368 1.0315789
## [22] 1.1368421 1.0210526 0.8421053 0.9894737 1.2947368
## [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
## [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286
When two vectors are used in an operation their elements are
compared PAIRWISE
Setup
- Vectors of wolves and number of packs at same time
## [1] 21 51 86 112 118
## [1] 3 9 9 11 11
## [1] 1995 1996 1997 1998 1999
When 2 vectors are used in an operation their elements are compared
PAIRWISE
Number of wolves and packs
year 1
## [1] 21
## [1] 3
Wolves per pack
## [1] 7
When two vectors are used in an operation their elements are
compared PAIRWISE
Wolves per pack via index
values
- math jargon: dividing one SCALAR by another SCALAR
## [1] 7
When two vectors are used in an operation their elements are
compared PAIRWISE
Wolves per pack for ALL
YEARS
## [1] 7.000000 5.666667 9.555556 10.181818 10.727273 14.875000 13.200000
## [8] 10.571429 12.888889 10.687500 9.076923 10.461538 15.545455 10.333333
## [15] 6.857143 8.818182 9.800000 8.300000 9.500000 9.454545 9.800000
## [22] 9.818182 8.818182 8.888889 11.750000 13.666667
When two vectors are used in an operation their elements are
compared PAIRWISE
Wolves per pack 1st 2 years
Using :
## [1] 7.000000 5.666667
When two vectors are used in an operation their elements are
compared PAIRWISE
Using indices in a vector via
c()
## [1] 21 51
When two vectors are used in an operation their elements are
compared PAIRWISE
Using raw index
wolves[c(1,2)]/packs[c(1,2)]
## [1] 7.000000 5.666667
When two vectors are used in an operation their elements are
compared PAIRWISE
Wolves per pack
ignoring just first year
## [1] 5.666667 9.555556 10.181818 10.727273 14.875000 13.200000 10.571429
## [8] 12.888889 10.687500 9.076923 10.461538 15.545455 10.333333 6.857143
## [15] 8.818182 9.800000 8.300000 9.500000 9.454545 9.800000 9.818182
## [22] 8.818182 8.888889 11.750000 13.666667
Vectors have 3 key features: LENGTH, STRUCTURE, and TYPE/CLASS
- length:
length()
- data structure: `is()’
- type of content:
class()
## [1] 5
## [1] "numeric" "vector"
## [1] "numeric"
Different kinds of data have different CLASSES in R
The 2 most important
classes for beginners:
- numeric = numbers
- character = letters, words, sentences
CHARACTER data is REALLY important for computational biology
- Characters MUST be separated by quotes
# make a character vector
x <- c("a", "t", "c", "g")
CHARACTER data is REALLY important for computational biology
## [1] "a" "t" "c" "g"
# check its class and structure
class(x)
## [1] "character"
## [1] "character" "vector" "data.frameRowLabels"
## [4] "SuperClassMethod"
Character data MUST be separated by quotation or other marks
- Quotation marks “…”
- Apostrophes ‘…’
- (In some very species contexts: ticks
...)
- Missing quotation marks are VERY common typos - if you get an error,
check for missing “…”
# quotes
x <- c("a", "t", "c", "g")
Character data MUST be separated by quotation or other marks
# apostrophes
y <- c('a', 't', 'c', 'g')
y
## [1] "a" "t" "c" "g"
# mix
z <- c('a', "t", 'c', "g")
z
## [1] "a" "t" "c" "g"
NUMERIC data includes INTEGERS AND numbers with decimals points
- integers not special in R
wolves_N <- c(21,51,86,112,118)
class(wolves_N)
## [1] "numeric"
NUMERIC data includes INTEGERS AND numbers with decimals points
integers not special in R
wolves_N <- c(21.0, 51.0, 86.0, 112.0, 118.0)
class(wolves_N)
## [1] "numeric"
NUMERIC data includes INTEGERS AND numbers with decimals points
integers not special in R
wolf_mass <- c(22.28, 24.99, 19.45, 22.6)
class(wolf_mass)
## [1] "numeric"
Most programming languages DO care about integers versus “floating
point” numbers
- INTEGERS: -100, -89, -9, 0, 1, 10, 1001, 100000123
- FLOATING POINT: -100.1, 0.89, 8.99
Vectors can only contain ONE type of data
## [1] 1 2 3
## [1] "a" "b" "c"
## [1] "numeric"
## [1] "character"
Vectors can only contain ONE type of data
z <- c(1, "c", 3)
class(z)
## [1] "character"
## [1] "1" "c" "3"
Computer think about numbers 2 ways - as things to do math with, and
things to print that are read by humans
x <- c(1 , 2, 3)
y <- c("1", "2", "3")
class(x) == class(y)
## [1] FALSE
When presented with two different types of data in a single object,
R COERCES it to character data
- General concept: COERCISION: conversion between data types AND/OR
structures
- verbs: COERCE, COERCED, COERCES
## [1] "1" "c" "3"
## [1] "character"
When making with vectors, R isn’t very concerned with spacing
# no spaces is ok
nucleotides <- c("A","T","C","G")
# one space after comma is standard
nucleotides <- c("A", "T", "C", "G")
When making with vectors, R isn’t very concerned with spacing
# can add as many spaces as you want!
nucleotides <- c( "A" , "T" , "C" , "G" )
When making with vectors, R isn’t very concerned with spacing
# can add as many spaces as you want!
nucleotides <- c( "A" , "T" , "C" , "G" )
When making with vectors, R isn’t very concerned with LINE
BREAKS
# no spaces is ok
nucleotides <- c("A",
"T",
"C",
"G")
When making with vectors, R isn’t very concerned with LINE
BREAKS
# one space after comma is standard
nucleotides <- c("A",
"T",
"C",
"G")
When making CHARACTER vectors, R IS concerned with spaces between
the quotes
How many different things are in this vector?
nucleotides <- c("A"," A", "A "," A ")
Vectors are the building blocks of 2 other datas tructures
MATRICES
DATAFRAMES
- Analogous to spreadsheets
MATRICES can be built from vectors
- matrices can hold numeric OR character data
- BUT ONLY ONE TYPE AT A TIME
a_matrix <- cbind(DNA, RNA)
a_matrix
## DNA RNA
## [1,] "A" "A"
## [2,] "T" "U"
## [3,] "C" "C"
## [4,] "G" "G"
## [1] "matrix" "array" "structure" "vector"
MATRICES can be built with cbind()
a_matrix <- cbind(DNA, RNA)
a_matrix <- cbind(DNA,
RNA)
DATAFRAMES can be built with data.frame()
- matrices can hold numeric OR character data
- and both at the same time
a_df <- data.frame(DNA, RNA)
is(a_df)
## [1] "data.frame" "list" "oldClass" "vector"
## DNA RNA
## 1 A A
## 2 T U
## 3 C C
## 4 G G
DATAFRAMES can be built with data.frame()
- matrices can hold numeric OR character data
- and both at the same time
proportion <- c(0.25,0.25,0.25,0.25)
another_df <- data.frame(DNA, proportion)
another_df
## DNA proportion
## 1 A 0.25
## 2 T 0.25
## 3 C 0.25
## 4 G 0.25
Columns in dataframes can be accessed by name use $
notation
## [1] "A" "T" "C" "G"
## [1] 0.25 0.25 0.25 0.25
Columns in dataframes act like vectors
another_df$proportion*100
## [1] 25 25 25 25