Directory


Lists

Lists are a fundamental data structure in R that emplower you to handle diverse and complex data seamlessly

Unlike vectors or matrices, lists can store elements of different types, making them versatile containers for vaious data structures


Creating a List

In R, a list is created using the list() function

Enclose the desired elements within the parenthesis

Names elements in a list provide clarity and accessibility, offering an organized way to reference specific information within the list

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)

Attributes of a List

attributes() -> the names of the vectors in the list

length() -> how many vectors are in the list

class() -> the class of the list

structure() -> how the elements of the list are organized


Access Elements in Lists

By Name:

Elements in a list can be accessed by their assigned names using the $ operator

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list$sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"


By Index:

Accessing elements in a list by index involves using the double square brackets

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[1]
## $Gene.id
## [1] 2645


Extracting Multiple Elements:

To extract multiple elements, use a vector of indices or names

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[c("Gene.id","symbol")]
## $Gene.id
## [1] 2645
## 
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"


Using Double Square Brackets:

You can use double square brackets to access content of elements of the list. This returns a sublist.

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[4] #Returns the vector
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
class(gene.list[4]) #Note the class is still list
## [1] "list"
gene.list[[4]] #Returns the value of the vector
## [1] "GCK" "GLK" "HK4" "GK"
class(gene.list[[4]]) #Note the class is the class of the vector
## [1] "character"

Add Elements to Lists

You can add elements into an existing list

Using index number/square brackets:

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)

#Adding an element with an index number/square brackets
gene.list[8]<-11 #Create the element
names(gene.list)[8] <- "exons" #name the element
gene.list
## $Gene.id
## [1] 2645
## 
## $name
## [1] "glucokinase"
## 
## $length
## [1] 46227
## 
## $symbol
## [1] "GCK" "GLK" "HK4" "GK" 
## 
## $organism
## [1] "human"
## 
## $prot.coding
## [1] TRUE
## 
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
## 
## $exons
## [1] 11


Using $ symbol:

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list$exon <- 11
gene.list
## $Gene.id
## [1] 2645
## 
## $name
## [1] "glucokinase"
## 
## $length
## [1] 46227
## 
## $symbol
## [1] "GCK" "GLK" "HK4" "GK" 
## 
## $organism
## [1] "human"
## 
## $prot.coding
## [1] TRUE
## 
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
## 
## $exon
## [1] 11

Remove Elements from Lists

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA",
  exon = 11
)

gene.list
## $Gene.id
## [1] 2645
## 
## $name
## [1] "glucokinase"
## 
## $length
## [1] 46227
## 
## $symbol
## [1] "GCK" "GLK" "HK4" "GK" 
## 
## $organism
## [1] "human"
## 
## $prot.coding
## [1] TRUE
## 
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
## 
## $exon
## [1] 11
gene.list$exon <- NULL
gene.list
## $Gene.id
## [1] 2645
## 
## $name
## [1] "glucokinase"
## 
## $length
## [1] 46227
## 
## $symbol
## [1] "GCK" "GLK" "HK4" "GK" 
## 
## $organism
## [1] "human"
## 
## $prot.coding
## [1] TRUE
## 
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"

The lapply() and sapply()

The lapply() works similarly to the apply() function

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
lapply(gene.list,length) #will return the length of each element
## $Gene.id
## [1] 1
## 
## $name
## [1] 1
## 
## $length
## [1] 1
## 
## $symbol
## [1] 4
## 
## $organism
## [1] 1
## 
## $prot.coding
## [1] 1
## 
## $sequence
## [1] 1


The sapply() tries to give a vector output:

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
sapply(gene.list, length)
##     Gene.id        name      length      symbol    organism prot.coding 
##           1           1           1           4           1           1 
##    sequence 
##           1

unlist()

The unlist() function converts a list into a vector

The resulting vector containing all the elements from the nested structure of gene.list

Keep in mind that if the list has nested lists or multiple levels of hierarchy, unlist() will concatenate the elements recursively

It’s important to note that using unlist() may list the structural information present in nested lists, and the resulting vector may not fully represent the orignal list’s organization

Use this function carefully based on your specific data manulation

gene.list <- list(
  Gene.id = 2645,
  name = "glucokinase",
  length = 46227,
  symbol = c("GCK", "GLK", "HK4", "GK"),
  organism = "human",
  prot.coding = TRUE,
  sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
unlist(gene.list)
##                      Gene.id                         name 
##                       "2645"                "glucokinase" 
##                       length                      symbol1 
##                      "46227"                        "GCK" 
##                      symbol2                      symbol3 
##                        "GLK"                        "HK4" 
##                      symbol4                     organism 
##                         "GK"                      "human" 
##                  prot.coding                     sequence 
##                       "TRUE" "ACTCCACACCTGGCTGGAGCAGGAAA"

Use Lists

A common usage of lists is to combine multiple values into a single object

e.g. result of hist() function is a list

data(Nile)
Nile[1:10]
##  [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140
ret <- hist(Nile)

(ret)
## $breaks
##  [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400
## 
## $counts
##  [1]  1  0  5 20 25 19 12 11  6  1
## 
## $density
##  [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
## 
## $mids
##  [1]  450  550  650  750  850  950 1050 1150 1250 1350
## 
## $xname
## [1] "Nile"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Lists

Factors and Dataframes


Dataframes

Dataframes are a fundamental data structure in the R programming language, serving as a powerful tool for data manipulation and analysis

A dataframe is essentially a two-dimensional table with rows and columns, resembling a spreadsheet or a database table

In R, dataframes provide an organized way to store, manage, and analyze structured data

The columns in a dataframe may have different classes like numeric, character, or logical, but every column should have the same type within

Vectors and factors within a data frame must have the same length


Dataframes vs Lists


Output of Dataframe vs output of List

df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25,30,22),
  Score = c(95,89,75)
)
df
my.list <- list(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25,30,22),
  Score = c(95,89,75)
)
my.list
## $Name
## [1] "Alice"   "Bob"     "Charlie"
## 
## $Age
## [1] 25 30 22
## 
## $Score
## [1] 95 89 75

Creating Dataframes

Dataframes are created with the data.frame() function

df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25,30,22),
  Score = c(95,89,75)
)

df


Using the Read function

The read function reads from a table in the working directory. I don’t have this table, so I am going to type out the code as if I did.

Note: The table must be in your working directory


To Read from a Text Document

car_data <- read.table(“mtcars_table.txt”) view(car_data)


To Read From a .csv File car_data_csv <- read.csv(“mtcars_table.csv”)

In a csv, the row names will be numbers. To avoid this, use the following steps.

car_data_csv <- read.csv(“mtcars_table.csv”, header = TRUE, row.names = 1)

view(car_data_csv)


To Read From an Excel File:

requires the readxl package

read_excel(““)


General Attributes of Dataframes

There are several functions we can use to access the attributes of a dataframe

df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25,30,22),
  Score = c(95,89,75)
)
df
dim(df) #Dimensions
## [1] 3 3
class(df) #Class of the dataframe
## [1] "data.frame"
str(df) #Structure of the dataframe
## 'data.frame':    3 obs. of  3 variables:
##  $ Name : chr  "Alice" "Bob" "Charlie"
##  $ Age  : num  25 30 22
##  $ Score: num  95 89 75
nrow(df) #Number of rows in the dataframe
## [1] 3
ncol(df) #number of columns in the dataframe
## [1] 3
names(df) #the names in the dataframe
## [1] "Name"  "Age"   "Score"
colnames(df) #The names of the columns in the dataframe
## [1] "Name"  "Age"   "Score"
rownames(df) #The names of the rows in the dataframe
## [1] "1" "2" "3"
summary(df) #basic summary of all columns
##      Name                Age            Score      
##  Length:3           Min.   :22.00   Min.   :75.00  
##  Class :character   1st Qu.:23.50   1st Qu.:82.00  
##  Mode  :character   Median :25.00   Median :89.00  
##                     Mean   :25.67   Mean   :86.33  
##                     3rd Qu.:27.50   3rd Qu.:92.00  
##                     Max.   :30.00   Max.   :95.00

Coerce in Dataframes

You can coerce an object into a data frame as long as its components conform to the restrictions

#Create a simple list
my_list <- list(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25,30,22),
  Score = c(95,89,75)
)
my_list
## $Name
## [1] "Alice"   "Bob"     "Charlie"
## 
## $Age
## [1] 25 30 22
## 
## $Score
## [1] 95 89 75
#Coerce the List into a Dataframe
my_dataframe <- data.frame(my_list)
my_dataframe

Access and Subsetting in Dataframes

Like lists, the elements in a data frame can be accessed by using “$” or using numbers

Note: The data structure is [rows,columns]


Accessing specific Columns:

mtcars
mtcars$mpg #Access one column
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars[,c("mpg","disp","hp")] #Access multiple columns using names
mtcars[,1:5] #Access multple columns using index


Accessing specific Rows:

mtcars[10,] #Access one column using index
mtcars[3:5,] #Access multiple columns using index


Accessing specific Rows and Columns:

mtcars[1:5, 1:10]


Accessing multiple Rows, but only one Column:

mtcars[1:4,2] #Notice that this returns a vector
## [1] 6 6 4 6
mtcars[1:4,2, drop = FALSE] #you have to set drop to false to maintain the dataframe

Adding/Removing

Dataframes can be added with additional rows either by manually inputting the data or rowbind() function

Columns can be added using $ symbol

new_car <- c(mpg = 28, cyl = 4, disp = 120, hp = 110, drat = 3.9, wt = 2.8,
             qsec = 16.5, vs = 0, am = 1, gear = 4, carb = 2)
new_car
##   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
##  28.0   4.0 120.0 110.0   3.9   2.8  16.5   0.0   1.0   4.0   2.0
mtcars #before bound to new_car
mtcars <- rbind(mtcars,new_car) #after bound to new_car
mtcars

In the lecture video, they go on to name this new value, but this is covered later in the lecture, so I am not writing it down a second time.


Removing the data can be done using the numbers of the rows and columns.


Removing Rows

mtcars[-1,] #Delete the first row
mtcars_new <- mtcars[-c(1:5),] #Delete the first five rows and store them in mtcars_new
mtcars_new


Removing Columns

mtcars[,-4] #Delete the 4th column


Removing Rows and Columns

mtcars[-5,-6] #Remove the 5th row and the 6th column

Specific filtering criteria can be applied if a specific type of data can be removed


Filtering

Using Conditional Operators:

mtcars[mtcars$mpg>20,] #dataframe where the value of mpg is greater than 20
mtcars[mtcars$am==1 & mtcars$mpg>20,] #dataframe where the value of am = 1 and the value of mpg>20


Using subset() function

subset(mtcars, am==0 & mpg>20) #dataframe where am = 0 and mpg is greater than 20

NA Values

After subsetting dataframes or filtering, we may perform a subsequent analysis. If there is NA values in the input vector, we might want to tell R to ignore NA

c(1,4,6,NA,9)
## [1]  1  4  6 NA  9
mean(c(1,4,6,NA,9))
## [1] NA
mean(c(1,4,6,NA,9), na.rm = TRUE) #using na.rm
## [1] 5

in a table, you can use na.omit()


Names

You can set both the row and column names of matrices using:

x <- c(1, 2, "TRUE", "FALSE")
dim(x) <- c(2,2)
x
##      [,1] [,2]   
## [1,] "1"  "TRUE" 
## [2,] "2"  "FALSE"
colnames(x) <- c("col1","col2") #Naming with colnames
rownames(x) <- c("row1","row2") #Naming with rownames
x
##      col1 col2   
## row1 "1"  "TRUE" 
## row2 "2"  "FALSE"
dimnames(x) <- list(c("row-A","row-B"), c("col-A","col-B")) #Naming with dimnames
x
##       col-A col-B  
## row-A "1"   "TRUE" 
## row-B "2"   "FALSE"

Factor

Factors are useful for representing categorical data. In general, factors are sequences of integers with associated labels

x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no  yes no 
## Levels: no yes
class(x)
## [1] "factor"
table(x) #table outputs the count of factor
## x
##  no yes 
##   2   3

Factor Levels

The default ordering of levels of factor is alphabetical. You can set the order using the levels argument

x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no  yes no 
## Levels: no yes
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
x
## [1] yes yes no  yes no 
## Levels: yes no

Tables

Tables are usually created using table()

The first argument of table() is either a factor or a list of factors

Table is used to computer a contingency table. It will count the frequency of each value.

table(c(5,5,4,5,3,11,11))
## 
##  3  4  5 11 
##  1  1  3  2
table(mtcars$cyl) #count of each kind of cylinder in mtcars
## 
##  4  6  8 
## 12  7 14

The table can then be turned into a dataframe.

as.data.frame(table(mtcars$cyl))

String Manipulation


Character String

R strings is a vector of length 1, with atomic class of “character”

"This is a string!"
## [1] "This is a string!"
class("This is a string!")
## [1] "character"
my.string <- "This is my first string" #strings can also be stored in variables

Common Functions

R has many functions to manipulate strings.


nchar() counts the number of characters

nchar("South Pole")
## [1] 10
length("South Pole") #because there is only one vector
## [1] 1
length(c("South Pole", "North Pole"))
## [1] 2


grep() searches for a given pattern within a string/vector of strings and returns the index

grep("Pole", "North Pole")
## [1] 1
grep("Pole", c("Equator", "North Pole", "South Pole"))
## [1] 2 3
grep("pole", c("Equator", "North Pole", "South Pole")) #grep is case-sensitive
## integer(0)


gsub() searches for a given pattern within a string/vector and replaces with the given string

gsub("Pole", "POLE", c("Equator", "North Pole", "South Pole"))
## [1] "Equator"    "North POLE" "South POLE"


paste() puts multiple strings together

paste("North","and", "South", "Pole") #basic concatenation
## [1] "North and South Pole"
paste("South", "Pole", sep = "") #a character string to separate the terms
## [1] "SouthPole"
paste(c("South", "North"), c("Pole", "Poles")) #can cycle through a container
## [1] "South Pole"  "North Poles"
paste(c("South", "North"), "Pole") #can utilize recursion
## [1] "South Pole" "North Pole"
paste(c("South", "North"), c("Pole", "Poles"), collapse = "_") #an optional character string to separate the results
## [1] "South Pole_North Poles"
paste(c("South", "North"), c("Pole", "Poles"), sep = "_") #note the placement of the underscore when using sep
## [1] "South_Pole"  "North_Poles"


strsplit() splits a string by the given delimiter

strsplit("North Pole", " ")
## [[1]]
## [1] "North" "Pole"
strsplit("North Pole", "") #an empty delimiter will split each character
## [[1]]
##  [1] "N" "o" "r" "t" "h" " " "P" "o" "l" "e"


sprintf() assembles the string using other given variables

sprintf("the square of %d is %d", 3, 3^2)
## [1] "the square of 3 is 9"


substr() extracts substrings

substr("North Pole", 7,10)
## [1] "Pole"

Regular Expression

A regular expression is a kind of wild card. It’s shorthand to specify a board of classes of strings. E.g.

grep(“[au]”, c(“Equator”, “North Pole”, “South Pole”))

grep("[au]", c("Equator", "North Pole", "South Pole"))
## [1] 1 3


grep(“o.e”, c(“Equator”, “North Pole”, “South Pole”, “Toae”))

grep("[au]", c("Equator", "North Pole", "South Pole"))
## [1] 1 3

Escape Sequences

Backslash is used for escape sequences

# \” refers quote
# \\. refers to period
# \t refers to tab
# \n refers to new line
x <- "say \"Hello!\""
cat(x) #must use cat for it to print properly
## say "Hello!"
grep("\"", c("gt", x)) #grep can still detect backlash
## [1] 2
grep(".", c("abc", "de", "f.g")) #remember, a "." is used to indicate "any letter"
## [1] 1 2 3
grep("\\.", c("abc", "de", "f.g")) #grep can find a "period "." using the escape sequence
## [1] 3
gsub("\\.", "new", c("abc", "de", "f.g")) #gsub an also be used with the escape sequence
## [1] "abc"   "de"    "fnewg"