Week2 E115 L: R Basics (2)

by Grace Yuh Chwen Lee, Spring 2024


Recap from last week…

Vectors contain multiple values and following are some general ways to create vectors

v1 = c(1, 3, 5, 7)
v2 = 10:20
v3 = seq(0,20,2)
v4 = rep(3, 4)

In addition to vectors, there are several other useful data structures in R, such as matrices, lists, and data frames!

Matrices

Matrices are very much like vectors (for example, they can only contain one kind of data), but they have 2 dimensions (rows and columns). By default, matrices are filled one column at a time.

m1 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
m1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
m2 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)
m2
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
m3 = matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
m3
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
Q1

What are the above codes doing? Is m1 the same as m2 and why? Is m1 the same as m3 and why?

Hint: try ?matrix or help(matrix)

Logic operators can also be used on matrices. BUT the matrices need to be of the same dimensions.

Try the following code to get a sense what they are doing. What answers did you get?

m1 == m2
m1 == m3

You can transpose a matrix by t( ) function.

t(m1)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
t(m1) == m3
     [,1] [,2]
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE

Matrices can also be created by combining vectors, either combining as columns cbind or as rows rbind.

a = 1:3
b = 4:6
cbind(a, b)
     a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
rbind(a, b)
  [,1] [,2] [,3]
a    1    2    3
b    4    5    6

Tip: if you are confused by what a function is doing, remember you can always consult help by typing ?fun.

Matrices can be sampled in the same way as vectors, but you need to specify rows and columns as matrix[row,column]

#recall matrix 1 again
m1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
#call some elements in matrix 1 by specifying row and column
m1[1, 3]
[1] 5
m1[2, 2]
[1] 4

Let’s spend some time and figure out what the following codes are doing.

m1[,2] > 3 
[1] FALSE  TRUE
bool = m1[,2] > 3
m1[bool,]
[1] 2 4 6
Q2

Write 6 temperatures (32, 69, 72, 80, 100, 621) in Fahrenheit and save it in a vector.

Make a matrix with two columns, one with degrees in Fahrenheit and one in Celsius.

Subset every row where the degrees Celsius are more than 25 (use a logical statement to do this).


Lists

Vectors and matrices can only contain one data type. Lists instead can contain different kinds of data in different elements (think of each of this as a container box) and each element can be of a different size. It can even contain matrices! They are the most flexible data structure in R.

list1 = list(fruits = c("blueberry", "gooseberry", "strawberry"),
          letters = matrix(LETTERS[1:25], nrow=5),
          prime = c(2,3,5,7,11,13,17,19))
list1
$fruits
[1] "blueberry"  "gooseberry" "strawberry"

$letters
     [,1] [,2] [,3] [,4] [,5]
[1,] "A"  "F"  "K"  "P"  "U" 
[2,] "B"  "G"  "L"  "Q"  "V" 
[3,] "C"  "H"  "M"  "R"  "W" 
[4,] "D"  "I"  "N"  "S"  "X" 
[5,] "E"  "J"  "O"  "T"  "Y" 

$prime
[1]  2  3  5  7 11 13 17 19

Also check how list1 shows up in the Environment window.

There are several ways to filter/call a subset of a list. We could subset it just like a vector/matrix to get each element.

#getting the first container in a list
list1[1]
$fruits
[1] "blueberry"  "gooseberry" "strawberry"

We could use double [[ ]] to get inside the box. Once you do that, we could further subset a value in an element, just like vector

#being able to access the first container
list1[[1]]
[1] "blueberry"  "gooseberry" "strawberry"
#get an element out from the first container
list1[[1]][2]
[1] "gooseberry"
Q3

Subset letter L from list1 using [ ].

If the elements of a list have names, you could use the name to call a subset of a list.

list1$prime
[1]  2  3  5  7 11 13 17 19
Q4

Using $, can you obtain only those >= 11 from prime?

Hint: use logical operator


Data frame

Data frames are the most commonly used data structure for data analysis. They are like spreadsheet in R. Technically, data frames are lists, where every column is a vector and all vectors are of the same size. There are two dimensions of a data frame: the number of columns ncol and the number of rows nrow.

We can create a data frame by merging vectors, using data.frame( ).

#create a data frame
time = 0:24
N = 2^time
growth = data.frame(time, N)

Check how growth shows up in the Environment window.

You can also change the name of a data.frame during this process. names(data.frame) tells you the column names of that data.frame.

growth2 = data.frame(new_time = time, new_N = N)

names(growth)
[1] "time" "N"   
names(growth2)
[1] "new_time" "new_N"   

Like other types of variables, we could type their names in console to see their values. However, oftentimes, data frames are big, making it hard to read if print out the whole spreadsheet. Instead, we could use head to see the first few rows of a data frame.

head(growth)
  time  N
1    0  1
2    1  2
3    2  4
4    3  8
5    4 16
6    5 32

Alternatively, we could specify which rows and columns we want to see, again using [ ].

Q5

Try the following code. Can you figure out which one is specifying row and which one is specifying column?

growth[3,1]
growth[3,2]
growth[3,]
growth[,1]

Because a data.frame is a list, we can also call columns by their name using $.

growth$time
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
growth$time[2]
[1] 1
Q6

What the below code is doing? What answer do you get?

growth$N[growth$time > 3]

Get the basics of a data frame

Oftentimes, you get a data frame by loading data, which we will learn in a second. To get a quick idea how the data frame looks like (especially when it is big), try the following functions.

treatment = c(rep('control', 5), rep('drug1', 5), rep('drug2', 5))
genotype = rep(LETTERS[1:5], 3)
phenotype = 1:15

tdata = data.frame(treatment, genotype, phenotype)
#get the nubmer of columns (ncol) for a data frame
length(tdata[1,])
[1] 3
#get the number of rows (nrow) for a data frame
length(tdata[,1])
[1] 15
#super helpful function for data frame!
summary(tdata)
  treatment           genotype           phenotype   
 Length:15          Length:15          Min.   : 1.0  
 Class :character   Class :character   1st Qu.: 4.5  
 Mode  :character   Mode  :character   Median : 8.0  
                                       Mean   : 8.0  
                                       3rd Qu.:11.5  
                                       Max.   :15.0  
Q7

What do the following codes do?

table(tdata$treatment)

control   drug1   drug2 
      5       5       5 
table(tdata$genotype)

A B C D E 
3 3 3 3 3 

Clean up data frames

Sometimes (or oftentimes?), experiments failed. That results in some data missing for the entire experiment. However, as long as one has sufficient numbers of observations, one can still do statistical analysis. Even if the amount of data is insufficient, getting a sense of the data you already have also help plan for future experiments. Now we will learn how to deal with missing data in a variable and data.frame.

In R, missing data is specified as NA (without quotation marks; "NA" is interpreted as character). You can use is.na( )function to check if a variable is missing data.

a = NA
is.na(a)
[1] TRUE
v_wNA = c(19, 32, 28, NA, 10)
is.na(v_wNA)
[1] FALSE FALSE FALSE  TRUE FALSE

Many R functions returns NA if any of the value in a vector is missing data.

sum(v_wNA)
[1] NA

There are different ways dealing with missing data.

Q8

Try the following codes (including calling the manual of the functions) and describe how each function was used to address missing data.

?sum
sum(v_wNA, na.rm = TRUE)

?na.omit
sum(na.omit(v_wNA))

In the following classes, we will learn other techniques to deal with missing data in data.frame.


Setting the Working Directory

Setting the working directory gives R the specific location (i.e. file path) where you want it to read in data/files and write out results to. You would need this to load the data you collected during the wet labs!

You can set this manually by clicking Session > Set Working Directory > Choose Directory… Alternatively, you can set the working directory with the setwd() function (if you are familiar with “path” on your computer):

setwd("/Users/Anteater/Desktop")

To check your working directory, use the getwd() function.

Q9

execute getwd() before and after changing your working directory (using either drop down manual or setwd), and write down the paths to working directories before and after.


Importing Data

After setting the working directories, we are now ready to load data.

Before loading your own data, let’s try playing with some existing data. If you haven’t already, please download the following two data sets from Canvas to where you set your working directory.

  • Week2_R_basics_MLB_weights_vector.tsv

  • Week2_R_basics_MLB_data.tsv

v_mlb = scan("Week2_R_basics_MLB_weights_vector.tsv")

Remember, you need the " and " around the file name.

What the above does is it “scans” the values in Week2_R_basics_MLB_weights_vector.tsv and saves it into vector v_mlb . Let’s get some basics of v_mlb.

length(v_mlb)
[1] 1034
summary(v_mlb)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  150.0   187.0   200.0   201.7   215.0   290.0 
Q10

Get all the values under 200 and save them as a new vector called v200_mlb. Get mean, median, variance, lower and upper quartiles of this new variable.

More generally, we use read.table to read it something looks like a spreadsheet.

Q11

Try the following two codes. What data types are the columns in data_mlb1 and data_mlb2?

What is the major difference between them and which one you think we should use?

Hint: try ?read.table

data_mlb1 = read.table("Week2_R_basics_MLB_data.tsv")
summary(data_mlb1)
data_mlb2 = read.table("Week2_R_basics_MLB_data.tsv", header = T)
summary(data_mlb2)
Q12 and Q13

Practice what you learn above for extracting a subset from data frames….

  • Get the first five rows of data_mlb2 and save it to data_mlb2_subset1. Then, get the mean Height for this subset of data.

  • Create a vector of logical values for rows with heights larger than 70. Use that vector to create a subset of data and save it to data_mlb2_subset2. Get the mean Weight for these individuals. Data visualization

Let’s take a look at your serial dilution data….

  • Q14

    (If you haven’t already) Prepare your serial dilution data as a three column table, save it as a csv or tsv file (in google spreadsheet, you can go to File -> download -> then choose the file type).

    • color

    • dilution factor

    • CFU

    Import your data as mydata. (We will do the first two steps together in class).

    Use the column of dilution_factor and CFU to calculate the original concentration, just like how you did in the google spreadsheet, but this time, using R.