How to use the problem sets

Each problem set will contain narrative text interspersed with R code. Some of this code will already be completed for you, while some you will need to fill in. You should read through all of the text (just as you are doing right now). The exercises (Activities) are interspersed throughout the text.

Right now, go up to the header and change the line "author:" to have your name and your group number.

Important

As you work through the exercises, really think about what you are doing. In the upcoming problem sets, you will be learning how to think through analyses step by step, from loading data, to visualizing it, to actually carrying out statistical tests, and then interpreting the results.

Read the code like you would a textbook (meaning studying and digesting as you go). It's easy to copy and paste or just run code chunks without really thinking about what the code is doing (we've done it too), but that's not a good way to learn.

R Markdown \(\rightarrow\) HTML

We will be working in R Markdown files. Markdown is a markup language that is designed to be no only readable in its text form, but also able to be converted into other formats. Markdown has a simple syntax which allows for things like bold ("Activities" above), italics, and facilities for including math notation (\(\alpha, \beta, \gamma, x^2\)). Even some pretty fancy math using LaTeX:

\[\bar{Y} = \frac{\sum^{n}_{i = 1}Y_i}{n}\]

R Markdown is the marriage of R and markdown, allowing you to write text and R code together that can include analysis, figures, and tables. R Markdown has been extended for making slides (like the ones we use in this class), adding references, bibliographies, and cross references.

Our usage in this class will be fairly pedestrian, but you can do some really complex things, like writing entire manuscripts using the bookdown package. Read more about R Markdown and bookdown.

R markdown (.Rmd) files can be converted to different formats. We will use HTML (PDF and Word are other options).

  • Click "Knit HTML" at the top of this window right now.

You might get a message about installing some packages. Click yes to install the required packages. After a few seconds, another window should open with this document rendered as a visually pleasing file.

You have just compiled an R Markdown file into HTML. These Rmd and HTML files will be the basic currency of the problems sets you will do in this class.

Insert an R code chunk

An Rmd file can include R code that is run at compile time.

Activity

Activity means that you need to do something (i.e., be active). Before you submit your problem set, search (Ctrl-f / Cmd-f) for "Activity" and make sure you have answered all the questions.

Place the cursor on the line below this text and press Ctrl-Alt-i (Windows) / Cmd-Option-i (Mac) to insert an R code chunk.

Enter some R code into the chunk on the blank line: sqrt(2). Then compile the HTML. Your file will show the R code that is run and R's output (\(\sqrt{2} = 1.41\)).

You can also run code interactively from an Rmd file to the R console. To run a single line, press Ctrl-Enter / Cmd-Return. To run the current chunk, use Ctrl-Alt-c / Cmd-Shift-Return. This is a good way to test code that you are working on, rather then waiting to compile the Rmd to HTML (or whatever format you are using).

You can also enter the code for code chunks manually, but I find it easier to use the insert code chunk shortcut.

Naming R objects

There are not many restrictions on what you can name R objects. Unlike other languages, where some names are reserved and off-limits, in R, pretty much anything goes. Object names can't start with a number (no 1a <- 1; that will give you an "unexpected symbol" error), but otherwise you are free to do what you want, even things you probably should not do. One of the main things to avoid is naming your objects the same as an R function.

Some names to avoid: c, mean, df, matrix, t, T, F. The last two are acceptable abbreviations for TRUE and FALSE. To avoid ambiguity, we recommend writing out TRUE and FALSE explicitly, rather than using the abbreviations.

If you want to take the mean of a vector x, we recommend using mean_x, x_bar, or x_mean.1 There are two benefits of using one of these variable names over using mean.

  1. You don't confuse your mean object with the mean() function.
  2. What if you later want to take the mean of a different vector. Which one does mean refer to?

You could do this, for example:

sd <- sd(1:6)
sd
## [1] 1.870829
sd(4:10)
## [1] 2.160247

Activity

Execute the chunk above and look at the R console output. Explain what we have done here and what R must be doing without telling you. Write your answer after the ">" below. (">"" is the Rmd for a block quote, which will make finding your answers easier.)

We did the SD of vectors 1:6 which is 1.87 and then we did SD of vectors 4:10. So this means that we cannot assign the object name same as the function.

Vectors

Vectors are one of the fundamental data structures in R. They consist of data of all the same type (numeric, character, etc.) in a 1 X n structure. You can manually create vectors using the combine function c(). Some functions like seq(), rep(), and the random number generators (rnorm(), runif(), etc.) produce vectors by default.

Activity

Assign vectors with the following characteristics:

  1. 1, 6, 10, 14.75
  2. TRUE, TRUE, FALSE
  3. a, aa, aaa (as characters)
  4. The sequence 5 to 100 by 1
  5. The sequence 5 to 100 by 5
  6. The sequence starting with 5 and ending at 100 with a length of 60
  7. 17 repeated 10 times
  8. The sequence 1, 2, 3 where each is repeated 10 times in a row
  9. The sequence 1, 2, 3 repeated 10 times

Choose names for these vectors. Add your code to the block below.

df1 <- c(1,6,10,14.75)
df2 <- c(TRUE, TRUE, FALSE)
df3 <- c("a", "aa", "aaa")
df4 <- seq(5,100, by=1)
df5 <- seq(5,100, by=5)
df6 <- seq(5,100, length=60)
df7 <- rep(17, 10)
df8 <- rep(1:3, each=10)
df9 <- rep(1:3, 10)

Working with relational operators

Binary operations are very important in R for selecting, subsetting, and choosing variables. The relational operators in R are:

  • == Equals
  • != Does not equal
  • > Greater than
  • < Less than
  • >= Greater than or equal to
  • <= Less than or equal to
  • %in% Is the comparator in the set?

When these operators are applied to vectors, the result is a vector of logicals (TRUEs and FALSEs).

Activity

Use your vectors from above in the same order to test the following relational operators.

  1. Which values are greater than 5?
  2. Which values equal FALSE?
  3. Does this vector contain the string "a"?
  4. Which values are less than or equal to 10?
  5. Which values are greater than or equal to 10?
  6. Count the number of values less than 50 (hint, use sum())
  7. Count the number of values equal to 17
  8. Which values equal 1?
  9. Which values do not equal 1?
df1 > 5
## [1] FALSE  TRUE  TRUE  TRUE
df2 == FALSE
## [1] FALSE FALSE  TRUE
"a"%in%df3
## [1] TRUE
df4 <= 10
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df5 >= 10
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
sum(df6 < 50)
## [1] 28
sum(df7 == 17)
## [1] 10
df8 == 1
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df9 != 1
##  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [12]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [23]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Perils of relational operators

Computers only infrequently store numbers internally as integers (computer algebra systems do this), particularly after any kind of numeric operation.

In contrast, numerics are often rounded to some level of accuracy (R uses about 53 decimal places). For example:

a <- sqrt(2)
a
## [1] 1.414214
a * a
## [1] 2
a * a == 2
## [1] FALSE
all.equal(a * a, 2)
## [1] TRUE

Line by line, explain what the statements above are doing and the R output of each. Look at the help for all.equal() if you need to. Enter your explanation after the > below.

a is an object that computes square root of 2.Since the R uses 53 decimals, a*a==2 is not recognized by R since it is multiplying two decimal values which is not exactly equal to 2. But all.eaual() function will tell you whether the a x a is nearly equal to 2 or not which is true in this case.

Matrices

Matrices are square objects (rows and columns) in which all of the cells have the same type of data. In most cases when you use matrices, you will have numbers only, however, matrices can hold characters, logicals, or factors as well.

Activity

  1. Use the matrix() function and rnorm(36, mean = 10, sd = 5) to create a 6 X 6 matrix. The rnorm() draw random normally distributed numbers. By supplying the mean and sd arguments, we can specify the mean and standard deviation of the distribution.
mat_1 <- matrix(rnorm(36, mean = 10, sd=5), nrow = 6, ncol = 6)
  1. Use the colMeans() function to calculate the column means of your matrix.
colMeans(mat_1)
## [1]  6.242655  9.674011  8.425551 12.261214  9.518980  9.531194
  1. Use an inequality to find all values less than 10.
mat_1 < 10
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [2,]  TRUE FALSE  TRUE FALSE FALSE FALSE
## [3,]  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [5,]  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [6,]  TRUE FALSE FALSE FALSE  TRUE FALSE

What kind of matrix is returned?

Logical matrix

  1. Use your inequality from part (c) and colMeans() to calculate the average proportion of values less than 10 in your matrix.
col_means <- colMeans(mat_1 <10)
mean(col_means)
## [1] 0.5555556

Compare the results of the column means with the results of part (c). What is R doing with the TRUE and FALSE values in order to be able to use colMeans()?

All the true is assigned 1 and False as 0 and calculating the average.

data.frames

data.frames are one of the most important objects that you will work with in R. They are the closest thing to an Excel spreadsheet in R (with the added restriction that a column must be all of one data type). When you read in files from csv or from Excel, you will almost always get a data.frame or its cousin the "tibble" (tbl_df).

Activity

Create a data.frame with the following columns:

  • A character vector Tx_Group with the values "control" and "high", each repeated 10 times.
  • A numeric vector Replicate with values 1 and 2, each repeated 5 times for each of the 10 values of "control" and "high" in Tx_Group.
  • A numeric vector Productivity, where the first 10 values are normally distributed with a mean of 5 and standard deviation of 2 and the second 10 values are normally distributed with a mean of 8 and standard deviation of 2. c() will be useful here.

Include the argument stringsAsFactors = FALSE to tell R not to convert the strings to factors.

df <- data.frame("Tx_Group"=rep(c("control", "high"), each=10), "Replicate"=rep(1:2,each=5), "Productivity"=c(rnorm(10, mean = 5, sd=2), rnorm(10, mean = 8, sd = 2)), stringsAsFactors = FALSE)

Use the str() function to get information about the data.frame. This will allow you to verify that Tx_Group has the type character. Note that even though Replicate only contains the integers 1 and 2, R treats it as a numeric.

str(df)
## 'data.frame':    20 obs. of  3 variables:
##  $ Tx_Group    : chr  "control" "control" "control" "control" ...
##  $ Replicate   : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Productivity: num  6.027 4.257 6.26 0.841 8.121 ...

Indexing

Taking subsets of objects in R is very common. This can include slicing or filtering rows, extracting one or more columns, and referencing columns in other functions.

You can use standard bracket notation [ ] to subset vectors, matrices, and data.frames. The latter two require a comma to denote rows and columns: [rows, columns]. You can also take a single column of a data.frame with the $ operator.

Activity

Use your data.frame from the question above. Extract the following subsets:

  1. The column Productivity using bracket notation
  2. The column Productivity using $ notation
  3. The second column (assume you don't know its name)
  4. Rows 1-10 of the entire data.frame
  5. Rows 1-10 of only the Productivity column
# 1
df[,3]
##  [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
##  [8] 6.2865919 3.5764575 5.0107614 7.1426600 6.2947463 7.4059573 9.4081988
## [15] 9.5743480 5.3912148 8.8230215 4.5290656 9.4657620 8.1202318
#2
df$Productivity
##  [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
##  [8] 6.2865919 3.5764575 5.0107614 7.1426600 6.2947463 7.4059573 9.4081988
## [15] 9.5743480 5.3912148 8.8230215 4.5290656 9.4657620 8.1202318
#3
df[,"Replicate"]
##  [1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2
#4
df[1:10,]
##    Tx_Group Replicate Productivity
## 1   control         1    6.0267205
## 2   control         1    4.2569402
## 3   control         1    6.2599705
## 4   control         1    0.8409544
## 5   control         1    8.1214173
## 6   control         2    6.1348463
## 7   control         2    5.8128912
## 8   control         2    6.2865919
## 9   control         2    3.5764575
## 10  control         2    5.0107614
#5
df[1:10,3]
##  [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
##  [8] 6.2865919 3.5764575 5.0107614

We will do more complex filtering next week (e.g., only rows from replicate 1 where productivity is less than 5).

Basic calculations in R

R can do basic (and not so basic) calculations. First set up a vector of numbers to work with.

# Set the random number generator seed
set.seed(5)

# Generate 10 random numbers between 0 and 1
x <- runif(10)
x
##  [1] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.5279600
##  [8] 0.8079352 0.9565001 0.1104530

Activity

Try out some R functions: mean(x), sd(x), median(x), t(x). Type them into the code chunk below and compile.

These functions take a vector or matrix and return a single value or a new matrix. Contrast this behavior with x^2. Enter that and see what you get.

Try functions operating on the matrix you created above.

mean(x)
## [1] 0.5295264
sd(x)
## [1] 0.3313606
median(x)
## [1] 0.6065893
t(x)
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]    [,7]
## [1,] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.52796
##           [,8]      [,9]    [,10]
## [1,] 0.8079352 0.9565001 0.110453
x^2
##  [1] 0.04008583 0.46952452 0.84066119 0.08088305 0.01095165 0.49148156
##  [7] 0.27874174 0.65275929 0.91489249 0.01219987

  1. KMM prefers the latter because of RStudio's auto-completion feature.