In math, a single number is called a SCALAR

This terminology if often first introduced in linear / matrix algebra courses

# make a scalar
x <- 1

# shows its length
length(x)

## [1] 1

# look at it
x

## [1] 1

R does NOT use the terminology “scalar”

No object in R is called a scalar in R!
“scalars” mathematically are just VECTORS of LENGTH 1
…So R just calls them vectors

# make a scalar
x <- 1

is(x)

## [1] "numeric" "vector"

Vectors are series of data that travel togther

vectors are made with function
“c” for concatenate (to stick together)

x <- c(21,51,86,112,118)

Its tempting to call a vector a “list” of data…

but LISTS are an important class of data structure
we’ll talk about them later in the course
so, vector =/= LIST

Conceptually, a vector is like a COLUMN in a table or spreadsheet

##     x
## 1  21
## 2  51
## 3  86
## 4 112
## 5 118

A vector can also act like a ROW in a table / spreadsheet

This is the default way to print out data
Even though this is how its shown…
…not the way to think about it (in most cases)

## [1]  21  51  86 112 118

Vectors are made with the `c()` function

packs  <- c( 3, 9,  9,11,  11, 8,10,14,13.5,16,
             13,13,11,12,  14,11,10,10,10,  11,
             10,11,11, 9,   8, 9)

EACH element within the vector separated by a comma `,`

packs  <- c( 3, 9,  9,11, 11, 8,10,14,13.5,16,
             13,13,11,12, 14,11,10,10,10,  11,
             10,11,11, 9,  8, 9)

Long vectors often split between lines

Spaces ok too - I use them to line things up

packs  <- c( 3, 9,  9,11, 11, 8,10,14,13.5,16,
             13,13,11,12, 14,11,10,10,10,  11,
             10,11,11, 9, 8, 9)

Vectors can contain text

Text MUST be in quotes

wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

Vectors can be named (almost) anything

# 1 letter
x <- c(21,51,86,112,118)

# 2 letters
zz <- c(21,51,86,112,118)

# 3 letters
dog <- c(21,51,86,112,118)

# lots of letters
dogsrbetterthancats <- c(21,51,86,112,118)

Periods “.” and UNDERSCORES “_” can be used in object names

Underscores _
(Dashes = CANNOT be used)

i.like.cats.too <- c(21,51,86,112,118)

but_i_like_dogs_better <- c(21,51,86,112,118)

at_least.cats.use_litter_boxes <- c(21,51,86,112,118)

Periods have a special use in Python so I try to use UNDERSCORES

But you will see both!

i.like.cats.too <- c(21,51,86,112,118)

but_i_like_dogs_better <- c(21,51,86,112,118)

at_least.cats.use_litter_boxes <- c(21,51,86,112,118)

A common standard is to use dashes and lowercase

easiest to type (no need to hit “shift”) -easiest to read (many upper case and lower case letters can look the same)
this is called “SNAKE CASE” 🐍

x <- c(21,51,86,112,118)

wolves_n <- c(21,51,86,112,118)

wolves_population_size <- c(21,51,86,112,118)

Some people like “CAMEL CASE” 🐫

mixes upper and lower case
I think its more confusing …
“Did I called that vector, wolvesN, wolvesn, or Wolvesn'…

wolvesN <- c(21,51,86,112,118)

wolvesn <- c(21,51,86,112,118)

Wolvesn <- c(21,51,86,112,118)

My general rules 🐍

use SNAKE CASE

# usually this
wolves_K <- 200
wolves_N <- c(21,51,86,112,118)

My general rules

sometimes I do this to save space

only use capitol letters for key variable names that have a standard meaning in biology (N = population size, K = carrying capacity)
sometimes omit underscore if using capital letter

wolvesK <- 200
wolvesN <- c(21,51,86,112,118)

My general rules

occasionally this if I’m not thinking

wolves.K <- 200
wolves.N <- c(21,51,86,112,118)

Single ELEMENTS of a vector accessed using BRACKETS

aka BRACKET NOTATION

# names
wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

# first name - uses 1 (not 0!)
wolf_names[1]

## [1] "white fang"

# 2nd names name - starts at 1 (not 0!)
wolf_names[2]

## [1] "fluffy"

Multple ELEMENTS of a vector accessed using :

# first two
wolf_names[1:2]

## [1] "white fang" "fluffy"

# 2nd two
wolf_names[2:3]

## [1] "fluffy" "bingo"

This is called INDEXING

“The index for ‘fluffy’ is 2”

wolf_names[2]

## [1] "fluffy"

MUST have 2 values when using `:`

ERRORS

wolf_names[1:]

wolf_names[:2]

Can call everything if you want

wolf_names[1:5]

## [1] "white fang" "fluffy"     "bingo"      "minnie"     "percy"

Can call everything BUT the 1st like this

wolf_names[2:5]

## [1] "fluffy" "bingo"  "minnie" "percy"

First real programming trick

ready for it?

Use NEGATIVE INDEXING to drop elements

Drop just 1st

wolf_names[-1]

## [1] "fluffy" "bingo"  "minnie" "percy"

Drop just 2nd

wolf_names[-2]

## [1] "white fang" "bingo"      "minnie"     "percy"

Next real programming trick - can “pass” vectors of indices to vectors

what does that mean?

A vector

wolf_names  <- c("white fang", "fluffy", "bingo","minnie", "percy")

a VECTOR ELEMENT via an index value

wolf_names[1]

## [1] "white fang"

Next real programming trick - can “pass” vectors of indices to vectors

2 vector elements via an index values

wolf_names[1:2]

## [1] "white fang" "fluffy"

Next real programming trick - can “pass” vectors of indices to vectors

a vector of indices

i <- c(1,2)

2 vector elements via a VECTOR of INDICES

wolf_names[i]

## [1] "white fang" "fluffy"

I can make a vector of numbers 2 ways

I will usually use the first for clarity

i1 <- c(1, 2, 3)

i2 <- c(1:3)

How do I test for equality of the two vectors

length(i1) == length(i2)

## [1] TRUE

Operations on vectors can be VECTORIZED

Some vectors

DNA <- c("A","T","C","G")

RNA <- c("A","U","C","G")

Their length

length(DNA)

## [1] 4

length(RNA)

## [1] 4

Operations on vectors can be VECTORIZED

The equality of their lengths

length(DNA) == length(RNA)

## [1] TRUE

Access 1st elements

DNA[1]

## [1] "A"

RNA[1]

## [1] "A"

Operations on vectors can be VECTORIZED

Equality of first elements

DNA[1] == RNA[1]

## [1] TRUE

Access 2nd elements

DNA[2]

## [1] "T"

RNA[2]

## [1] "U"

Operations on vectors can be VECTORIZED

IN-Equality of 2nd elements

DNA[2] == RNA[2]

## [1] FALSE

Operations on vectors can be VECTORIZED

Assess equality of ALL elements

the function == has been applied to each pair of elements
this is a VECTORIZED operation

DNA == RNA

## [1]  TRUE FALSE  TRUE  TRUE

VECTORIZED operations are common in R

taking the log of something is very common in math, stats, ML, bio…

natural log = `log()` in R

log base 10 = `log10()`

log base 2 = `lo2g()`

log(10)

## [1] 2.302585

log10(10)

## [1] 1

log2(10)

## [1] 3.321928

Functions can be applied to entire vectors

VECTORIZED operations are common in R

log(wolves)

##  [1] 3.044522 3.931826 4.454347 4.718499 4.770685 4.779123 4.882802 4.997212
##  [9] 5.159055 5.141664 4.770685 4.912655 5.141664 4.820282 4.564348 4.574711
## [17] 4.584967 4.418841 4.553877 4.644391 4.584967 4.682131 4.574711 4.382027
## [25] 4.543295 4.812184

Math can be done on entire vectors

The average wolf is 95 pounds

Wolf BIOMASS each year

wolves*95

##  [1]  1995  4845  8170 10640 11210 11305 12540 14060 16530 16245 11210 12920
## [13] 16245 11780  9120  9215  9310  7885  9025  9880  9310 10260  9215  7600
## [25]  8930 11685

Math can be done on entire vectors

Yellowstone NP is 3500 square miles

Wolves per square mile

wolves / 3500

##  [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
##  [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286

Math can be done using variable

Make variables with constants

wolve_weight <- 95
YNP_size <- 3500

Math can be done using variable

Do math using varibales

wolves/wolve_weight

##  [1] 0.2210526 0.5368421 0.9052632 1.1789474 1.2421053 1.2526316 1.3894737
##  [8] 1.5578947 1.8315789 1.8000000 1.2421053 1.4315789 1.8000000 1.3052632
## [15] 1.0105263 1.0210526 1.0315789 0.8736842 1.0000000 1.0947368 1.0315789
## [22] 1.1368421 1.0210526 0.8421053 0.9894737 1.2947368

wolves/YNP_size

##  [1] 0.00600000 0.01457143 0.02457143 0.03200000 0.03371429 0.03400000
##  [7] 0.03771429 0.04228571 0.04971429 0.04885714 0.03371429 0.03885714
## [13] 0.04885714 0.03542857 0.02742857 0.02771429 0.02800000 0.02371429
## [19] 0.02714286 0.02971429 0.02800000 0.03085714 0.02771429 0.02285714
## [25] 0.02685714 0.03514286

When two vectors are used in an operation their elements are compared PAIRWISE

Setup

Vectors of wolves and number of packs at same time

wolves[1:5]

## [1]  21  51  86 112 118

packs[1:5]

## [1]  3  9  9 11 11

year[1:5]

## [1] 1995 1996 1997 1998 1999

When 2 vectors are used in an operation their elements are compared PAIRWISE

Number of wolves and packs year 1

wolves[1]

## [1] 21

packs[1]

## [1] 3

Wolves per pack

21/3

## [1] 7

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack via index values

math jargon: dividing one SCALAR by another SCALAR

wolves[1]/packs[1]

## [1] 7

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack for ALL YEARS

A VECTORIZED OPERATION

wolves/packs

##  [1]  7.000000  5.666667  9.555556 10.181818 10.727273 14.875000 13.200000
##  [8] 10.571429 12.888889 10.687500  9.076923 10.461538 15.545455 10.333333
## [15]  6.857143  8.818182  9.800000  8.300000  9.500000  9.454545  9.800000
## [22]  9.818182  8.818182  8.888889 11.750000 13.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack 1st 2 years

Using `:`

wolves[1:2]/packs[1:2]

## [1] 7.000000 5.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Using indices in a vector via `c()`

i <- c(1,2)
wolves[i]

## [1] 21 51

When two vectors are used in an operation their elements are compared PAIRWISE

Using raw index

wolves[c(1,2)]/packs[c(1,2)]

## [1] 7.000000 5.666667

When two vectors are used in an operation their elements are compared PAIRWISE

Wolves per pack ignoring just first year

wolves[-1]/packs[-1]

##  [1]  5.666667  9.555556 10.181818 10.727273 14.875000 13.200000 10.571429
##  [8] 12.888889 10.687500  9.076923 10.461538 15.545455 10.333333  6.857143
## [15]  8.818182  9.800000  8.300000  9.500000  9.454545  9.800000  9.818182
## [22]  8.818182  8.888889 11.750000 13.666667

Vectors have 3 key features: LENGTH, STRUCTURE, and TYPE/CLASS

length: length()
data structure: `is()’
type of content: class()

length(wolves_N)

## [1] 5

is(wolves_N)

## [1] "numeric" "vector"

class(wolves_N)

## [1] "numeric"

Different kinds of data have different CLASSES in R

The 2 most important classes for beginners:

numeric = numbers
character = letters, words, sentences

CHARACTER data is REALLY important for computational biology

Characters MUST be separated by quotes

# make a character vector
x <- c("a", "t", "c", "g")

CHARACTER data is REALLY important for computational biology

# look at the vector
x

## [1] "a" "t" "c" "g"

# check its class and structure
class(x)

## [1] "character"

is(x)

## [1] "character"           "vector"              "data.frameRowLabels"
## [4] "SuperClassMethod"

Character data MUST be separated by quotation or other marks

Quotation marks “…”
Apostrophes ‘…’
(In some very species contexts: ticks ...)
Missing quotation marks are VERY common typos - if you get an error, check for missing “…”

# quotes
x <- c("a", "t", "c", "g")

Character data MUST be separated by quotation or other marks

# apostrophes
y <- c('a', 't', 'c', 'g')

y

## [1] "a" "t" "c" "g"

# mix
z <- c('a', "t", 'c', "g")

z

## [1] "a" "t" "c" "g"

NUMERIC data includes INTEGERS AND numbers with decimals points

integers not special in R

wolves_N <- c(21,51,86,112,118)
class(wolves_N)

## [1] "numeric"

NUMERIC data includes INTEGERS AND numbers with decimals points

integers not special in R

wolves_N <- c(21.0, 51.0, 86.0, 112.0, 118.0)
class(wolves_N)

## [1] "numeric"

NUMERIC data includes INTEGERS AND numbers with decimals points

integers not special in R

wolf_mass <- c(22.28, 24.99, 19.45, 22.6)
class(wolf_mass)

## [1] "numeric"

Most programming languages DO care about integers versus “floating point” numbers

INTEGERS: -100, -89, -9, 0, 1, 10, 1001, 100000123
FLOATING POINT: -100.1, 0.89, 8.99

Vectors can only contain ONE type of data

x <- c(1, 2, 3)
x

## [1] 1 2 3

y <- c("a", "b", "c")
y

## [1] "a" "b" "c"

class(x)

## [1] "numeric"

class(y)

## [1] "character"

Vectors can only contain ONE type of data

z <- c(1, "c", 3)


class(z)

## [1] "character"

## [1] "1" "c" "3"

Computer think about numbers 2 ways - as things to do math with, and things to print that are read by humans

x <- c(1 , 2, 3)

y <- c("1", "2", "3")

class(x) == class(y)

## [1] FALSE

When presented with two different types of data in a single object, R COERCES it to character data

General concept: COERCISION: conversion between data types AND/OR structures
verbs: COERCE, COERCED, COERCES

z <- c(1, "c", 3)
z

## [1] "1" "c" "3"

class(z)

## [1] "character"

When making with vectors, R isn’t very concerned with spacing

# no spaces is ok
nucleotides <- c("A","T","C","G")

# one space after comma is standard
nucleotides <- c("A", "T", "C", "G")

When making with vectors, R isn’t very concerned with spacing

# can add as many spaces as you want!
nucleotides <- c( "A" ,  "T" ,  "C" , "G"                 )

When making with vectors, R isn’t very concerned with spacing

# can add as many spaces as you want!
nucleotides <- c(        "A" ,    "T" ,  "C" , "G"                 )

When making with vectors, R isn’t very concerned with LINE BREAKS

# no spaces is ok
nucleotides <- c("A",
                 "T",
                 "C",
                 "G")

When making with vectors, R isn’t very concerned with LINE BREAKS

# one space after comma is standard
nucleotides <- c("A",
                 
                 "T",
                 
                 "C",
                 
                 "G")

When making CHARACTER vectors, R IS concerned with spaces between the quotes

How many different things are in this vector?

nucleotides <- c("A"," A", "A "," A ")

Vectors are the building blocks of 2 other datas tructures

MATRICES

plural of MATRIX

DATAFRAMES

Analogous to spreadsheets

MATRICES can be built from vectors

matrices can hold numeric OR character data
BUT ONLY ONE TYPE AT A TIME

a_matrix <- cbind(DNA, RNA)

a_matrix

##      DNA RNA
## [1,] "A" "A"
## [2,] "T" "U"
## [3,] "C" "C"
## [4,] "G" "G"

is(a_matrix)

## [1] "matrix"    "array"     "structure" "vector"

MATRICES can be built with `cbind()`

a_matrix <- cbind(DNA, RNA)

a_matrix <- cbind(DNA, 
                  RNA)

DATAFRAMES can be built with data.frame()

matrices can hold numeric OR character data
and both at the same time

a_df <- data.frame(DNA, RNA)

is(a_df)

## [1] "data.frame" "list"       "oldClass"   "vector"

a_df

##   DNA RNA
## 1   A   A
## 2   T   U
## 3   C   C
## 4   G   G

DATAFRAMES can be built with data.frame()

matrices can hold numeric OR character data
and both at the same time

proportion <- c(0.25,0.25,0.25,0.25)
another_df <- data.frame(DNA, proportion)

another_df

##   DNA proportion
## 1   A       0.25
## 2   T       0.25
## 3   C       0.25
## 4   G       0.25

Columns in dataframes can be accessed by name use `$` notation

another_df$DNA

## [1] "A" "T" "C" "G"

another_df$proportion

## [1] 0.25 0.25 0.25 0.25

Columns in dataframes act like vectors

another_df$proportion*100

## [1] 25 25 25 25

Data structures in R - an Overview

Data needs to be organized

Data needs to be organized

Tables / spreadsheets have a few basic components

Key structural elements

Tables / spreadsheets have a few basic components

Other elements

Another element of spreadsheets is shown in this image

WORKBOOKS are another key part of spreadsheets

Data structures in R

In math, a single number is called a SCALAR

R does NOT use the terminology “scalar”

Vectors are series of data that travel togther

Its tempting to call a vector a “list” of data…

Conceptually, a vector is like a COLUMN in a table or spreadsheet

A vector can also act like a ROW in a table / spreadsheet

Vectors are made with the c() function

EACH element within the vector separated by a comma ,

Long vectors often split between lines

Spaces ok too - I use them to line things up

Vectors can contain text

Text MUST be in quotes

Vectors can be named (almost) anything

Periods “.” and UNDERSCORES “_” can be used in object names

Periods have a special use in Python so I try to use UNDERSCORES

But you will see both!

A common standard is to use dashes and lowercase

Some people like “CAMEL CASE” 🐫

My general rules 🐍

My general rules

sometimes I do this to save space

My general rules

occasionally this if I’m not thinking

Single ELEMENTS of a vector accessed using BRACKETS

aka BRACKET NOTATION

Multple ELEMENTS of a vector accessed using :

This is called INDEXING

“The index for ‘fluffy’ is 2”

MUST have 2 values when using :

ERRORS

Can call everything if you want

Can call everything BUT the 1st like this

First real programming trick

Use NEGATIVE INDEXING to drop elements

Next real programming trick - can “pass” vectors of indices to vectors

A vector

a VECTOR ELEMENT via an index value

Next real programming trick - can “pass” vectors of indices to vectors

2 vector elements via an index values

Next real programming trick - can “pass” vectors of indices to vectors

a vector of indices

2 vector elements via a VECTOR of INDICES

I can make a vector of numbers 2 ways

I will usually use the first for clarity

How do I test for equality of the two vectors

Operations on vectors can be VECTORIZED

Some vectors

Their length

Operations on vectors can be VECTORIZED

The equality of their lengths

Access 1st elements

Operations on vectors can be VECTORIZED

Equality of first elements

Access 2nd elements

Operations on vectors can be VECTORIZED

IN-Equality of 2nd elements

Operations on vectors can be VECTORIZED

Assess equality of ALL elements

VECTORIZED operations are common in R

taking the log of something is very common in math, stats, ML, bio…

natural log = log() in R

log base 10 = log10()

log base 2 = lo2g()

Functions can be applied to entire vectors

VECTORIZED operations are common in R

Math can be done on entire vectors

The average wolf is 95 pounds

Wolf BIOMASS each year

Math can be done on entire vectors

Yellowstone NP is 3500 square miles

Vectors are made with the `c()` function

EACH element within the vector separated by a comma `,`

MUST have 2 values when using `:`

natural log = `log()` in R

log base 10 = `log10()`

log base 2 = `lo2g()`

Using `:`

Using indices in a vector via `c()`

MATRICES can be built with `cbind()`

Columns in dataframes can be accessed by name use `$` notation