Data structures in R - an Overview

Data needs to be organized

Data needs to be organized

Tables / spreadsheets have a few basic components

Key structural elements

Tables / spreadsheets have a few basic components

Other elements

Another element of spreadsheets is shown in this image

WORKBOOKS are another key part of spreadsheets

Data structures in R

In order of increasing size:

In math, a single number is called a SCALAR

This terminology if often first introduced in linear / matrix algebra courses

# make a scalar
x <- 1

# shows its length
length(x)
## [1] 1
# look at it
x
## [1] 1

R does NOT use the terminology “scalar”

# make a scalar
x <- 1

is(x)
## [1] "numeric" "vector"

Vectors are series of data that travel togther

x <- c(21,51,86,112,118)

Its tempting to call a vector a “list” of data…

Conceptually, a vector is like a COLUMN in a table or spreadsheet

##     x
## 1  21
## 2  51
## 3  86
## 4 112
## 5 118

A vector can also act like a ROW in a table / spreadsheet

x
## [1]  21  51  86 112 118

Vectors are made with the c() function

packs  <- c( 3, 9,  9,11,  11, 8,10,14,13.5,16,
             13,13,11,12,  14,11,10,10,10,  11,
             10,11,11, 9,   8, 9)

EACH element within the vector separated by a comma ,

packs  <- c( 3, 9,  9,11, 11, 8,10,14,13.5,16,
             13,13,11,12, 14,11,10,10,10,  11,
             10,11,11, 9,  8, 9)

Long vectors often split between lines

Spaces ok too - I use them to line things up

packs  <- c( 3, 9,  9,11, 11, 8,10,14,13.5,16,
             13,13,11,12, 14,11,10,10,10,  11,
             10,11,11, 9, 8, 9)

Vectors can contain text

Text MUST be in quotes

wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

Vectors can be named (almost) anything

# 1 letter
x <- c(21,51,86,112,118)

# 2 letters
zz <- c(21,51,86,112,118)

# 3 letters
dog <- c(21,51,86,112,118)

# lots of letters
dogsrbetterthancats <- c(21,51,86,112,118)

Periods “.” and UNDERSCORES “_” can be used in object names

i.like.cats.too <- c(21,51,86,112,118)

but_i_like_dogs_better <- c(21,51,86,112,118)

at_least.cats.use_litter_boxes <- c(21,51,86,112,118)

Periods have a special use in Python so I try to use UNDERSCORES

But you will see both!

i.like.cats.too <- c(21,51,86,112,118)

but_i_like_dogs_better <- c(21,51,86,112,118)

at_least.cats.use_litter_boxes <- c(21,51,86,112,118)

A common standard is to use dashes and lowercase

x <- c(21,51,86,112,118)

wolves_n <- c(21,51,86,112,118)

wolves_population_size <- c(21,51,86,112,118)

Some people like “CAMEL CASE” 🐫

wolvesN <- c(21,51,86,112,118)

wolvesn <- c(21,51,86,112,118)

Wolvesn <- c(21,51,86,112,118)

My general rules 🐍

# usually this
wolves_K <- 200
wolves_N <- c(21,51,86,112,118) 

My general rules

sometimes I do this to save space

wolvesK <- 200
wolvesN <- c(21,51,86,112,118) 

My general rules

occasionally this if I’m not thinking

wolves.K <- 200
wolves.N <- c(21,51,86,112,118) 

Single ELEMENTS of a vector accessed using BRACKETS

aka BRACKET NOTATION

# names
wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

# first name - uses 1 (not 0!)
wolf_names[1]
## [1] "white fang"
# 2nd names name - starts at 1 (not 0!)
wolf_names[2]
## [1] "fluffy"

Multple ELEMENTS of a vector accessed using :

# first two
wolf_names[1:2]
## [1] "white fang" "fluffy"
# 2nd two
wolf_names[2:3]
## [1] "fluffy" "bingo"

This is called INDEXING

“The index for ‘fluffy’ is 2”

wolf_names[2]
## [1] "fluffy"

MUST have 2 values when using :

ERRORS

wolf_names[1:]
wolf_names[:2]

Can call everything if you want

wolf_names[1:5]
## [1] "white fang" "fluffy"     "bingo"      "minnie"     "percy"

Can call everything BUT the 1st like this

wolf_names[2:5]
## [1] "fluffy" "bingo"  "minnie" "percy"

First real programming trick

Use NEGATIVE INDEXING to drop elements

wolf_names[-1]
## [1] "fluffy" "bingo"  "minnie" "percy"
wolf_names[-2]
## [1] "white fang" "bingo"      "minnie"     "percy"

Next real programming trick - can “pass” vectors of indices to vectors

A vector

wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

a VECTOR ELEMENT via an index value

wolf_names[1]
## [1] "white fang"

Next real programming trick - can “pass” vectors of indices to vectors

2 vector elements via an index values

wolf_names[1:2]
## [1] "white fang" "fluffy"

Next real programming trick - can “pass” vectors of indices to vectors

a vector of indices

i <- c(1,2)

2 vector elements via a VECTOR of INDICES

wolf_names[i]
## [1] "white fang" "fluffy"

I can make a vector of numbers 2 ways

I will usually use the first for clarity

i1 <- c(1, 2, 3)

i2 <- c(1:3)

How do I test for equality of the two vectors

length(i1) == length(i2)
## [1] TRUE

Operations on vectors can be VECTORIZED

Some vectors

DNA <- c("A","T","C","G")

RNA <- c("A","U","C","G")

Their length

length(DNA)
## [1] 4
length(RNA)
## [1] 4

Operations on vectors can be VECTORIZED

The equality of their lengths

length(DNA) == length(RNA)
## [1] TRUE

Access 1st elements

DNA[1]
## [1] "A"
RNA[1]
## [1] "A"

Operations on vectors can be VECTORIZED

Equality of first elements

DNA[1] == RNA[1]
## [1] TRUE

Access 2nd elements

DNA[2]
## [1] "T"
RNA[2]
## [1] "U"

Operations on vectors can be VECTORIZED

IN-Equality of 2nd elements

DNA[2] == RNA[2]
## [1] FALSE

Operations on vectors can be VECTORIZED

Assess equality of ALL elements

DNA == RNA
## [1]  TRUE FALSE  TRUE  TRUE

VECTORIZED operations are common in R

taking the log of something is very common in math, stats, ML, bio…

natural log = log() in R

log base 10 = log10()

log base 2 = lo2g()

log(10)
## [1] 2.302585
log10(10)
## [1] 1
log2(10)
## [1] 3.321928

Functions can be applied to entire vectors

VECTORIZED operations are common in R

log(wolves)
##  [1] 3.044522 3.931826 4.454347 4.718499 4.770685 4.779123 4.882802 4.997212
##  [9] 5.159055 5.141664 4.770685 4.912655 5.141664 4.820282 4.564348 4.574711
## [17] 4.584967 4.418841 4.553877 4.644391 4.584967 4.682131 4.574711 4.382027
## [25] 4.543295 4.812184

Math can be done on entire vectors

The average wolf is 95 pounds

Wolf BIOMASS each year

wolves*95
##  [1]  1995  4845  8170 10640 11210 11305 12540 14060 16530 16245 11210 12920
## [13] 16245 11780  9120  9215  9310  7885  9025  9880  9310 10260  9215  7600
## [25]  8930 11685

Math can be done on entire vectors

Yellowstone NP is 3500 square miles

Wolves per square mile

wolves / 3500
##  [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
##  [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286

Math can be done using variable

Make variables with constants

wolve_weight <- 95
YNP_size <- 3500

Math can be done using variable

Do math using varibales

wolves/wolve_weight
##  [1] 0.2210526 0.5368421 0.9052632 1.1789474 1.2421053 1.2526316 1.3894737
##  [8] 1.5578947 1.8315789 1.8000000 1.2421053 1.4315789 1.8000000 1.3052632
## [15] 1.0105263 1.0210526 1.0315789 0.8736842 1.0000000 1.0947368 1.0315789
## [22] 1.1368421 1.0210526 0.8421053 0.9894737 1.2947368
wolves/YNP_size
##  [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
##  [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286

When two vectors are used in an operation their elements are compared PAIRWISE

Setup

wolves[1:5]
## [1]  21  51  86 112 118
packs[1:5]
## [1]  3  9  9 11 11
year[1:5]
## [1] 1995 1996 1997 1998 1999

When 2 vectors are used in an operation their elements are compared PAIRWISE

Number of wolves and packs year 1

wolves[1]
## [1] 21
packs[1]
## [1] 3

Wolves per pack

21/3
## [1] 7

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack via index values

wolves[1]/packs[1]
## [1] 7

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack for ALL YEARS

wolves/packs
##  [1]  7.000000  5.666667  9.555556 10.181818 10.727273 14.875000 13.200000
##  [8] 10.571429 12.888889 10.687500  9.076923 10.461538 15.545455 10.333333
## [15]  6.857143  8.818182  9.800000  8.300000  9.500000  9.454545  9.800000
## [22]  9.818182  8.818182  8.888889 11.750000 13.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack 1st 2 years

Using :

wolves[1:2]/packs[1:2]
## [1] 7.000000 5.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Using indices in a vector via c()

i <- c(1,2)
wolves[i]
## [1] 21 51

When two vectors are used in an operation their elements are compared PAIRWISE

Using raw index

wolves[c(1,2)]/packs[c(1,2)]
## [1] 7.000000 5.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack ignoring just first year

wolves[-1]/packs[-1]
##  [1]  5.666667  9.555556 10.181818 10.727273 14.875000 13.200000 10.571429
##  [8] 12.888889 10.687500  9.076923 10.461538 15.545455 10.333333  6.857143
## [15]  8.818182  9.800000  8.300000  9.500000  9.454545  9.800000  9.818182
## [22]  8.818182  8.888889 11.750000 13.666667

Vectors have 3 key features: LENGTH, STRUCTURE, and TYPE/CLASS

length(wolves_N)
## [1] 5
is(wolves_N)
## [1] "numeric" "vector"
class(wolves_N)
## [1] "numeric"

Different kinds of data have different CLASSES in R

The 2 most important classes for beginners:

CHARACTER data is REALLY important for computational biology

# make a character vector
x <- c("a", "t", "c", "g")

CHARACTER data is REALLY important for computational biology

# look at the vector
x
## [1] "a" "t" "c" "g"
# check its class and structure
class(x)
## [1] "character"
is(x)
## [1] "character"           "vector"              "data.frameRowLabels"
## [4] "SuperClassMethod"

Character data MUST be separated by quotation or other marks

# quotes
x <- c("a", "t", "c", "g")

Character data MUST be separated by quotation or other marks

# apostrophes
y <- c('a', 't', 'c', 'g')

y
## [1] "a" "t" "c" "g"
# mix
z <- c('a', "t", 'c', "g")

z
## [1] "a" "t" "c" "g"

NUMERIC data includes INTEGERS AND numbers with decimals points

wolves_N <- c(21,51,86,112,118)
class(wolves_N)
## [1] "numeric"

NUMERIC data includes INTEGERS AND numbers with decimals points

integers not special in R

wolves_N <- c(21.0, 51.0, 86.0, 112.0, 118.0)
class(wolves_N)
## [1] "numeric"

NUMERIC data includes INTEGERS AND numbers with decimals points

integers not special in R

wolf_mass <- c(22.28, 24.99, 19.45, 22.6)
class(wolf_mass)
## [1] "numeric"

Most programming languages DO care about integers versus “floating point” numbers

Vectors can only contain ONE type of data

x <- c(1, 2, 3)
x
## [1] 1 2 3
y <- c("a", "b", "c")
y
## [1] "a" "b" "c"
class(x)
## [1] "numeric"
class(y)
## [1] "character"

Vectors can only contain ONE type of data

z <- c(1, "c", 3)


class(z)
## [1] "character"
z
## [1] "1" "c" "3"

Computer think about numbers 2 ways - as things to do math with, and things to print that are read by humans

x <- c(1 , 2, 3)

y <- c("1", "2", "3")

class(x) == class(y)
## [1] FALSE

When presented with two different types of data in a single object, R COERCES it to character data

z <- c(1, "c", 3)
z
## [1] "1" "c" "3"
class(z)
## [1] "character"

When making with vectors, R isn’t very concerned with spacing

# no spaces is ok
nucleotides <- c("A","T","C","G")

# one space after comma is standard
nucleotides <- c("A", "T", "C", "G")

When making with vectors, R isn’t very concerned with spacing

# can add as many spaces as you want!
nucleotides <- c( "A" ,  "T" ,  "C" , "G"                 )

When making with vectors, R isn’t very concerned with spacing

# can add as many spaces as you want!
nucleotides <- c(        "A" ,    "T" ,  "C" , "G"                 )

When making with vectors, R isn’t very concerned with LINE BREAKS

# no spaces is ok
nucleotides <- c("A",
                 "T",
                 "C",
                 "G")

When making with vectors, R isn’t very concerned with LINE BREAKS

# one space after comma is standard
nucleotides <- c("A",
                 
                 "T",
                 
                 "C",
                 
                 "G")

When making CHARACTER vectors, R IS concerned with spaces between the quotes

How many different things are in this vector?

nucleotides <- c("A"," A", "A "," A ")

Vectors are the building blocks of 2 other datas tructures

MATRICES

DATAFRAMES

MATRICES can be built from vectors

a_matrix <- cbind(DNA, RNA)

a_matrix
##      DNA RNA
## [1,] "A" "A"
## [2,] "T" "U"
## [3,] "C" "C"
## [4,] "G" "G"
is(a_matrix)
## [1] "matrix"    "array"     "structure" "vector"

MATRICES can be built with cbind()

a_matrix <- cbind(DNA, RNA)

a_matrix <- cbind(DNA, 
                  RNA)

DATAFRAMES can be built with data.frame()

a_df <- data.frame(DNA, RNA)

is(a_df)
## [1] "data.frame" "list"       "oldClass"   "vector"
a_df
##   DNA RNA
## 1   A   A
## 2   T   U
## 3   C   C
## 4   G   G

DATAFRAMES can be built with data.frame()

proportion <- c(0.25,0.25,0.25,0.25)
another_df <- data.frame(DNA, proportion)

another_df
##   DNA proportion
## 1   A       0.25
## 2   T       0.25
## 3   C       0.25
## 4   G       0.25

Columns in dataframes can be accessed by name use $ notation

another_df$DNA
## [1] "A" "T" "C" "G"
another_df$proportion
## [1] 0.25 0.25 0.25 0.25

Columns in dataframes act like vectors

another_df$proportion*100
## [1] 25 25 25 25