v1 = c(1, 3, 5, 7)
v2 = 10:20
v3 = seq(0,20,2)
v4 = rep(3, 4)Week2 E115 L: R Basics (2)
by Grace Yuh Chwen Lee, Spring 2024
Recap from last week…
Vectors contain multiple values and following are some general ways to create vectors
In addition to vectors, there are several other useful data structures in R, such as matrices, lists, and data frames!
Matrices
Matrices are very much like vectors (for example, they can only contain one kind of data), but they have 2 dimensions (rows and columns). By default, matrices are filled one column at a time.
m1 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
m1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
m2 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
m2 [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
m3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
m3 [,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
Logic operators can also be used on matrices. BUT the matrices need to be of the same dimensions.
Try the following code to get a sense what they are doing. What answers did you get?
m1 == m2m1 == m3You can transpose a matrix by t( ) function.
t(m1) [,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
t(m1) == m3 [,1] [,2]
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
Matrices can also be created by combining vectors, either combining as columns cbind or as rows rbind.
a = 1:3
b = 4:6
cbind(a, b) a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
rbind(a, b) [,1] [,2] [,3]
a 1 2 3
b 4 5 6
Tip: if you are confused by what a function is doing, remember you can always consult help by typing ?fun.
Matrices can be sampled in the same way as vectors, but you need to specify rows and columns as matrix[row,column]
#recall matrix 1 again
m1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
#call some elements in matrix 1 by specifying row and column
m1[1, 3][1] 5
m1[2, 2][1] 4
Let’s spend some time and figure out what the following codes are doing.
m1[,2] > 3 [1] FALSE TRUE
bool = m1[,2] > 3
m1[bool,][1] 2 4 6
Lists
Vectors and matrices can only contain one data type. Lists instead can contain different kinds of data in different elements (think of each of this as a container box) and each element can be of a different size. It can even contain matrices! They are the most flexible data structure in R.
list1 = list(fruits = c("blueberry", "gooseberry", "strawberry"),
letters = matrix(LETTERS[1:25], nrow=5),
prime = c(2,3,5,7,11,13,17,19))
list1$fruits
[1] "blueberry" "gooseberry" "strawberry"
$letters
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "F" "K" "P" "U"
[2,] "B" "G" "L" "Q" "V"
[3,] "C" "H" "M" "R" "W"
[4,] "D" "I" "N" "S" "X"
[5,] "E" "J" "O" "T" "Y"
$prime
[1] 2 3 5 7 11 13 17 19
Also check how
list1shows up in the Environment window.
There are several ways to filter/call a subset of a list. We could subset it just like a vector/matrix to get each element.
#getting the first container in a list
list1[1]$fruits
[1] "blueberry" "gooseberry" "strawberry"
We could use double [[ ]] to get inside the box. Once you do that, we could further subset a value in an element, just like vector
#being able to access the first container
list1[[1]][1] "blueberry" "gooseberry" "strawberry"
#get an element out from the first container
list1[[1]][2][1] "gooseberry"
If the elements of a list have names, you could use the name to call a subset of a list.
list1$prime[1] 2 3 5 7 11 13 17 19
Data frame
Data frames are the most commonly used data structure for data analysis. They are like spreadsheet in R. Technically, data frames are lists, where every column is a vector and all vectors are of the same size. There are two dimensions of a data frame: the number of columns ncol and the number of rows nrow.
We can create a data frame by merging vectors, using data.frame( ).
#create a data frame
time = 0:24
N = 2^time
growth = data.frame(time, N)Check how
growthshows up in the Environment window.
You can also change the name of a data.frame during this process. names(data.frame) tells you the column names of that data.frame.
growth2 = data.frame(new_time = time, new_N = N)
names(growth)[1] "time" "N"
names(growth2)[1] "new_time" "new_N"
Like other types of variables, we could type their names in console to see their values. However, oftentimes, data frames are big, making it hard to read if print out the whole spreadsheet. Instead, we could use head to see the first few rows of a data frame.
head(growth) time N
1 0 1
2 1 2
3 2 4
4 3 8
5 4 16
6 5 32
Alternatively, we could specify which rows and columns we want to see, again using [ ].
growth[3,1]
growth[3,2]
growth[3,]
growth[,1]Because a data.frame is a list, we can also call columns by their name using $.
growth$time [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
growth$time[2][1] 1
growth$N[growth$time > 3]Get the basics of a data frame
Oftentimes, you get a data frame by loading data, which we will learn in a second. To get a quick idea how the data frame looks like (especially when it is big), try the following functions.
treatment = c(rep('control', 5), rep('drug1', 5), rep('drug2', 5))
genotype = rep(LETTERS[1:5], 3)
phenotype = 1:15
tdata = data.frame(treatment, genotype, phenotype)#get the nubmer of columns (ncol) for a data frame
length(tdata[1,])[1] 3
#get the number of rows (nrow) for a data frame
length(tdata[,1])[1] 15
#super helpful function for data frame!
summary(tdata) treatment genotype phenotype
Length:15 Length:15 Min. : 1.0
Class :character Class :character 1st Qu.: 4.5
Mode :character Mode :character Median : 8.0
Mean : 8.0
3rd Qu.:11.5
Max. :15.0
table(tdata$treatment)
control drug1 drug2
5 5 5
table(tdata$genotype)
A B C D E
3 3 3 3 3
Clean up data frames
Sometimes (or oftentimes?), experiments failed. That results in some data missing for the entire experiment. However, as long as one has sufficient numbers of observations, one can still do statistical analysis. Even if the amount of data is insufficient, getting a sense of the data you already have also help plan for future experiments. Now we will learn how to deal with missing data in a variable and data.frame.
In R, missing data is specified as NA (without quotation marks; "NA" is interpreted as character). You can use is.na( )function to check if a variable is missing data.
a = NA
is.na(a)[1] TRUE
v_wNA = c(19, 32, 28, NA, 10)
is.na(v_wNA)[1] FALSE FALSE FALSE TRUE FALSE
Many R functions returns NA if any of the value in a vector is missing data.
sum(v_wNA)[1] NA
There are different ways dealing with missing data.
?sum
sum(v_wNA, na.rm = TRUE)
?na.omit
sum(na.omit(v_wNA))In the following classes, we will learn other techniques to deal with missing data in data.frame.
Setting the Working Directory
Setting the working directory gives R the specific location (i.e. file path) where you want it to read in data/files and write out results to. You would need this to load the data you collected during the wet labs!
You can set this manually by clicking Session > Set Working Directory > Choose Directory… Alternatively, you can set the working directory with the setwd() function (if you are familiar with “path” on your computer):
setwd("/Users/Anteater/Desktop")To check your working directory, use the getwd() function.
Importing Data
After setting the working directories, we are now ready to load data.
Before loading your own data, let’s try playing with some existing data. If you haven’t already, please download the following two data sets from Canvas to where you set your working directory.
Week2_R_basics_MLB_weights_vector.tsv
Week2_R_basics_MLB_data.tsv
v_mlb = scan("Week2_R_basics_MLB_weights_vector.tsv")Remember, you need the " and " around the file name.
What the above does is it “scans” the values in Week2_R_basics_MLB_weights_vector.tsv and saves it into vector v_mlb . Let’s get some basics of v_mlb.
length(v_mlb)[1] 1034
summary(v_mlb) Min. 1st Qu. Median Mean 3rd Qu. Max.
150.0 187.0 200.0 201.7 215.0 290.0
More generally, we use read.table to read it something looks like a spreadsheet.
data_mlb1 = read.table("Week2_R_basics_MLB_data.tsv")
summary(data_mlb1)data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2)