Creating Lists
Attributes of a List
Access Elements in Lists
Add Elements to Lists
Remove Elements from
Lists
The lapply() and sapply()
unlist()
Use Lists
Lists are a fundamental data structure in R that emplower you to handle diverse and complex data seamlessly
Unlike vectors or matrices, lists can store elements of different types, making them versatile containers for vaious data structures
In R, a list is created using the list() function
Enclose the desired elements within the parenthesis
Names elements in a list provide clarity and accessibility, offering an organized way to reference specific information within the list
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
attributes() -> the names of the vectors in the list
length() -> how many vectors are in the list
class() -> the class of the list
structure() -> how the elements of the list are organized
By Name:
Elements in a list can be accessed by their assigned names using the $ operator
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list$sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
By Index:
Accessing elements in a list by index involves using the double square brackets
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[1]
## $Gene.id
## [1] 2645
Extracting Multiple Elements:
To extract multiple elements, use a vector of indices or names
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[c("Gene.id","symbol")]
## $Gene.id
## [1] 2645
##
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
Using Double Square Brackets:
You can use double square brackets to access content of elements of the list. This returns a sublist.
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list[4] #Returns the vector
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
class(gene.list[4]) #Note the class is still list
## [1] "list"
gene.list[[4]] #Returns the value of the vector
## [1] "GCK" "GLK" "HK4" "GK"
class(gene.list[[4]]) #Note the class is the class of the vector
## [1] "character"
You can add elements into an existing list
Using index number/square brackets:
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
#Adding an element with an index number/square brackets
gene.list[8]<-11 #Create the element
names(gene.list)[8] <- "exons" #name the element
gene.list
## $Gene.id
## [1] 2645
##
## $name
## [1] "glucokinase"
##
## $length
## [1] 46227
##
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
##
## $organism
## [1] "human"
##
## $prot.coding
## [1] TRUE
##
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
##
## $exons
## [1] 11
Using $ symbol:
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
gene.list$exon <- 11
gene.list
## $Gene.id
## [1] 2645
##
## $name
## [1] "glucokinase"
##
## $length
## [1] 46227
##
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
##
## $organism
## [1] "human"
##
## $prot.coding
## [1] TRUE
##
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
##
## $exon
## [1] 11
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA",
exon = 11
)
gene.list
## $Gene.id
## [1] 2645
##
## $name
## [1] "glucokinase"
##
## $length
## [1] 46227
##
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
##
## $organism
## [1] "human"
##
## $prot.coding
## [1] TRUE
##
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
##
## $exon
## [1] 11
gene.list$exon <- NULL
gene.list
## $Gene.id
## [1] 2645
##
## $name
## [1] "glucokinase"
##
## $length
## [1] 46227
##
## $symbol
## [1] "GCK" "GLK" "HK4" "GK"
##
## $organism
## [1] "human"
##
## $prot.coding
## [1] TRUE
##
## $sequence
## [1] "ACTCCACACCTGGCTGGAGCAGGAAA"
The lapply() works similarly to the apply() function
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
lapply(gene.list,length) #will return the length of each element
## $Gene.id
## [1] 1
##
## $name
## [1] 1
##
## $length
## [1] 1
##
## $symbol
## [1] 4
##
## $organism
## [1] 1
##
## $prot.coding
## [1] 1
##
## $sequence
## [1] 1
The sapply() tries to give a vector output:
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
sapply(gene.list, length)
## Gene.id name length symbol organism prot.coding
## 1 1 1 4 1 1
## sequence
## 1
The unlist() function converts a list into a vector
The resulting vector containing all the elements from the nested structure of gene.list
Keep in mind that if the list has nested lists or multiple levels of hierarchy, unlist() will concatenate the elements recursively
It’s important to note that using unlist() may list the structural information present in nested lists, and the resulting vector may not fully represent the orignal list’s organization
Use this function carefully based on your specific data manulation
gene.list <- list(
Gene.id = 2645,
name = "glucokinase",
length = 46227,
symbol = c("GCK", "GLK", "HK4", "GK"),
organism = "human",
prot.coding = TRUE,
sequence = "ACTCCACACCTGGCTGGAGCAGGAAA"
)
unlist(gene.list)
## Gene.id name
## "2645" "glucokinase"
## length symbol1
## "46227" "GCK"
## symbol2 symbol3
## "GLK" "HK4"
## symbol4 organism
## "GK" "human"
## prot.coding sequence
## "TRUE" "ACTCCACACCTGGCTGGAGCAGGAAA"
A common usage of lists is to combine multiple values into a single object
e.g. result of hist() function is a list
data(Nile)
Nile[1:10]
## [1] 1120 1160 963 1210 1160 1160 813 1230 1370 1140
ret <- hist(Nile)
(ret)
## $breaks
## [1] 400 500 600 700 800 900 1000 1100 1200 1300 1400
##
## $counts
## [1] 1 0 5 20 25 19 12 11 6 1
##
## $density
## [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
##
## $mids
## [1] 450 550 650 750 850 950 1050 1150 1250 1350
##
## $xname
## [1] "Nile"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Dataframes are a fundamental data structure in the R programming language, serving as a powerful tool for data manipulation and analysis
A dataframe is essentially a two-dimensional table with rows and columns, resembling a spreadsheet or a database table
In R, dataframes provide an organized way to store, manage, and analyze structured data
The columns in a dataframe may have different classes like numeric, character, or logical, but every column should have the same type within
Vectors and factors within a data frame must have the same length
Output of Dataframe vs output of List
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25,30,22),
Score = c(95,89,75)
)
df
my.list <- list(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25,30,22),
Score = c(95,89,75)
)
my.list
## $Name
## [1] "Alice" "Bob" "Charlie"
##
## $Age
## [1] 25 30 22
##
## $Score
## [1] 95 89 75
Dataframes are created with the data.frame() function
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25,30,22),
Score = c(95,89,75)
)
df
Using the Read function
The read function reads from a table in the working directory. I don’t have this table, so I am going to type out the code as if I did.
Note: The table must be in your working directory
To Read from a Text Document
car_data <- read.table(“mtcars_table.txt”) view(car_data)
To Read From a .csv File car_data_csv <-
read.csv(“mtcars_table.csv”)
In a csv, the row names will be numbers. To avoid this, use the following steps.
car_data_csv <- read.csv(“mtcars_table.csv”, header = TRUE, row.names = 1)
view(car_data_csv)
To Read From an Excel File:
requires the readxl package
read_excel(““)
There are several functions we can use to access the attributes of a dataframe
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25,30,22),
Score = c(95,89,75)
)
df
dim(df) #Dimensions
## [1] 3 3
class(df) #Class of the dataframe
## [1] "data.frame"
str(df) #Structure of the dataframe
## 'data.frame': 3 obs. of 3 variables:
## $ Name : chr "Alice" "Bob" "Charlie"
## $ Age : num 25 30 22
## $ Score: num 95 89 75
nrow(df) #Number of rows in the dataframe
## [1] 3
ncol(df) #number of columns in the dataframe
## [1] 3
names(df) #the names in the dataframe
## [1] "Name" "Age" "Score"
colnames(df) #The names of the columns in the dataframe
## [1] "Name" "Age" "Score"
rownames(df) #The names of the rows in the dataframe
## [1] "1" "2" "3"
summary(df) #basic summary of all columns
## Name Age Score
## Length:3 Min. :22.00 Min. :75.00
## Class :character 1st Qu.:23.50 1st Qu.:82.00
## Mode :character Median :25.00 Median :89.00
## Mean :25.67 Mean :86.33
## 3rd Qu.:27.50 3rd Qu.:92.00
## Max. :30.00 Max. :95.00
You can coerce an object into a data frame as long as its components conform to the restrictions
#Create a simple list
my_list <- list(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25,30,22),
Score = c(95,89,75)
)
my_list
## $Name
## [1] "Alice" "Bob" "Charlie"
##
## $Age
## [1] 25 30 22
##
## $Score
## [1] 95 89 75
#Coerce the List into a Dataframe
my_dataframe <- data.frame(my_list)
my_dataframe
Like lists, the elements in a data frame can be accessed by using “$” or using numbers
Note: The data structure is [rows,columns]
Accessing specific Columns:
mtcars
mtcars$mpg #Access one column
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars[,c("mpg","disp","hp")] #Access multiple columns using names
mtcars[,1:5] #Access multple columns using index
Accessing specific Rows:
mtcars[10,] #Access one column using index
mtcars[3:5,] #Access multiple columns using index
Accessing specific Rows and Columns:
mtcars[1:5, 1:10]
Accessing multiple Rows, but only one
Column:
mtcars[1:4,2] #Notice that this returns a vector
## [1] 6 6 4 6
mtcars[1:4,2, drop = FALSE] #you have to set drop to false to maintain the dataframe
Dataframes can be added with additional rows either by manually inputting the data or rowbind() function
Columns can be added using $ symbol
new_car <- c(mpg = 28, cyl = 4, disp = 120, hp = 110, drat = 3.9, wt = 2.8,
qsec = 16.5, vs = 0, am = 1, gear = 4, carb = 2)
new_car
## mpg cyl disp hp drat wt qsec vs am gear carb
## 28.0 4.0 120.0 110.0 3.9 2.8 16.5 0.0 1.0 4.0 2.0
mtcars #before bound to new_car
mtcars <- rbind(mtcars,new_car) #after bound to new_car
mtcars
In the lecture video, they go on to name this new value, but this is covered later in the lecture, so I am not writing it down a second time.
Removing the data can be done using the numbers
of the rows and columns.
Removing Rows
mtcars[-1,] #Delete the first row
mtcars_new <- mtcars[-c(1:5),] #Delete the first five rows and store them in mtcars_new
mtcars_new
Removing Columns
mtcars[,-4] #Delete the 4th column
Removing Rows and Columns
mtcars[-5,-6] #Remove the 5th row and the 6th column
Specific filtering criteria can be applied if a specific type of data can be removed
Using Conditional Operators:
mtcars[mtcars$mpg>20,] #dataframe where the value of mpg is greater than 20
mtcars[mtcars$am==1 & mtcars$mpg>20,] #dataframe where the value of am = 1 and the value of mpg>20
Using subset() function
subset(mtcars, am==0 & mpg>20) #dataframe where am = 0 and mpg is greater than 20
After subsetting dataframes or filtering, we may perform a subsequent analysis. If there is NA values in the input vector, we might want to tell R to ignore NA
c(1,4,6,NA,9)
## [1] 1 4 6 NA 9
mean(c(1,4,6,NA,9))
## [1] NA
mean(c(1,4,6,NA,9), na.rm = TRUE) #using na.rm
## [1] 5
in a table, you can use na.omit()
You can set both the row and column names of matrices using:
x <- c(1, 2, "TRUE", "FALSE")
dim(x) <- c(2,2)
x
## [,1] [,2]
## [1,] "1" "TRUE"
## [2,] "2" "FALSE"
colnames(x) <- c("col1","col2") #Naming with colnames
rownames(x) <- c("row1","row2") #Naming with rownames
x
## col1 col2
## row1 "1" "TRUE"
## row2 "2" "FALSE"
dimnames(x) <- list(c("row-A","row-B"), c("col-A","col-B")) #Naming with dimnames
x
## col-A col-B
## row-A "1" "TRUE"
## row-B "2" "FALSE"
Factors are useful for representing categorical data. In general, factors are sequences of integers with associated labels
x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no yes no
## Levels: no yes
class(x)
## [1] "factor"
table(x) #table outputs the count of factor
## x
## no yes
## 2 3
The default ordering of levels of factor is alphabetical. You can set the order using the levels argument
x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no yes no
## Levels: no yes
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
x
## [1] yes yes no yes no
## Levels: yes no
Tables are usually created using table()
The first argument of table() is either a factor or a list of factors
Table is used to computer a contingency table. It will count the frequency of each value.
table(c(5,5,4,5,3,11,11))
##
## 3 4 5 11
## 1 1 3 2
table(mtcars$cyl) #count of each kind of cylinder in mtcars
##
## 4 6 8
## 12 7 14
The table can then be turned into a dataframe.
as.data.frame(table(mtcars$cyl))
R strings is a vector of length 1, with atomic class of “character”
"This is a string!"
## [1] "This is a string!"
class("This is a string!")
## [1] "character"
my.string <- "This is my first string" #strings can also be stored in variables
R has many functions to manipulate strings.
nchar() counts the number of characters
nchar("South Pole")
## [1] 10
length("South Pole") #because there is only one vector
## [1] 1
length(c("South Pole", "North Pole"))
## [1] 2
grep() searches for a given pattern within a
string/vector of strings and returns the index
grep("Pole", "North Pole")
## [1] 1
grep("Pole", c("Equator", "North Pole", "South Pole"))
## [1] 2 3
grep("pole", c("Equator", "North Pole", "South Pole")) #grep is case-sensitive
## integer(0)
gsub() searches for a given pattern within a
string/vector and replaces with the given string
gsub("Pole", "POLE", c("Equator", "North Pole", "South Pole"))
## [1] "Equator" "North POLE" "South POLE"
paste() puts multiple strings together
paste("North","and", "South", "Pole") #basic concatenation
## [1] "North and South Pole"
paste("South", "Pole", sep = "") #a character string to separate the terms
## [1] "SouthPole"
paste(c("South", "North"), c("Pole", "Poles")) #can cycle through a container
## [1] "South Pole" "North Poles"
paste(c("South", "North"), "Pole") #can utilize recursion
## [1] "South Pole" "North Pole"
paste(c("South", "North"), c("Pole", "Poles"), collapse = "_") #an optional character string to separate the results
## [1] "South Pole_North Poles"
paste(c("South", "North"), c("Pole", "Poles"), sep = "_") #note the placement of the underscore when using sep
## [1] "South_Pole" "North_Poles"
strsplit() splits a string by the given
delimiter
strsplit("North Pole", " ")
## [[1]]
## [1] "North" "Pole"
strsplit("North Pole", "") #an empty delimiter will split each character
## [[1]]
## [1] "N" "o" "r" "t" "h" " " "P" "o" "l" "e"
sprintf() assembles the string using other given
variables
sprintf("the square of %d is %d", 3, 3^2)
## [1] "the square of 3 is 9"
substr() extracts substrings
substr("North Pole", 7,10)
## [1] "Pole"
A regular expression is a kind of wild card. It’s shorthand to specify a board of classes of strings. E.g.
grep(“[au]”, c(“Equator”, “North Pole”, “South Pole”))
grep("[au]", c("Equator", "North Pole", "South Pole"))
## [1] 1 3
grep(“o.e”, c(“Equator”, “North Pole”, “South
Pole”, “Toae”))
grep("[au]", c("Equator", "North Pole", "South Pole"))
## [1] 1 3
Backslash is used for escape sequences
# \” refers quote
# \\. refers to period
# \t refers to tab
# \n refers to new line
x <- "say \"Hello!\""
cat(x) #must use cat for it to print properly
## say "Hello!"
grep("\"", c("gt", x)) #grep can still detect backlash
## [1] 2
grep(".", c("abc", "de", "f.g")) #remember, a "." is used to indicate "any letter"
## [1] 1 2 3
grep("\\.", c("abc", "de", "f.g")) #grep can find a "period "." using the escape sequence
## [1] 3
gsub("\\.", "new", c("abc", "de", "f.g")) #gsub an also be used with the escape sequence
## [1] "abc" "de" "fnewg"