How to use the problem sets

Each problem set will contain narrative text interspersed with R code. Some of this code will already be completed for you, while some you will need to fill in. You should read through all of the text (just as you are doing right now). The exercises (Activities) are interspersed throughout the text.

Right now, go up to the header and change the line “author:” to have your name and your group number.

Important

As you work through the exercises, really think about what you are doing. In the upcoming problem sets, you will be learning how to think through analyses step by step, from loading data, to visualizing it, to actually carrying out statistical tests, and then interpreting the results.

Read the code like you would a textbook (meaning studying and digesting as you go). It’s easy to copy and paste or just run code chunks without really thinking about what the code is doing (we’ve done it too), but that’s not a good way to learn.

R Markdown \(\rightarrow\) HTML

We will be working in R Markdown files. Markdown is a markup language that is designed to be no only readable in its text form, but also able to be converted into other formats. Markdown has a simple syntax which allows for things like bold (“Activities” above), italics, and facilities for including math notation (\(\alpha, \beta, \gamma, x^2\)). Even some pretty fancy math using LaTeX:

\[\bar{Y} = \frac{\sum^{n}_{i = 1}Y_i}{n}\]

R Markdown is the marriage of R and markdown, allowing you to write text and R code together that can include analysis, figures, and tables. R Markdown has been extended for making slides (like the ones we use in this class), adding references, bibliographies, and cross references.

Our usage in this class will be fairly pedestrian, but you can do some really complex things, like writing entire manuscripts using the bookdown package, make a CV, or generate a website. Read more about R Markdown.

R markdown (.Rmd) files can be converted to different formats. We will use HTML (PDF and Word are other options).

  • Click “Knit” at the top of this window right now.

You might get a message about installing some packages. Click yes to install the required packages. After a few seconds, another window should open with this document rendered as a visually pleasing file.

You have just compiled an R Markdown file into HTML. These Rmd and HTML files will be the basic currency of the problems sets and progress checks you will do in this class.

Insert an R code chunk

An Rmd file can include R code that is run at compile time.

Activity

Activity means that you need to do something (i.e., be active). Before you submit your problem set, search (Ctrl-f / Cmd-f) for “Activity” and make sure you have answered all the questions.

Place the cursor on the line below this text and press Ctrl-Alt-i (Windows) / Cmd-Option-i (Mac) to insert an R code chunk.

sqrt(2)
## [1] 1.414214

Enter some R code into the chunk on the blank line: sqrt(2). Then compile the HTML. Your file will show the R code that is run and R’s output (\(\sqrt{2} = 1.41\)).

You can also run code interactively from an Rmd file to the R console. To run a single line, press Ctrl-Enter / Cmd-Return. To run the current chunk, use Ctrl-Alt-c / Cmd-Shift-Return. This is a good way to test code that you are working on, rather then waiting to compile the Rmd to HTML (or whatever format you are using).

You can also enter the code to set up code chunks manually, but we find it easier to use the insert code chunk shortcut.

Naming R objects

There are not many restrictions on what you can name R objects. Unlike other languages, where some names are reserved and off-limits, in R, pretty much anything goes. Object names can’t start with a number (no 1a <- 1; that will give you an “unexpected symbol” error), but otherwise you are free to do what you want, even things you probably should not do. One of the main things to avoid is naming your objects the same as an R function.

Some names to avoid: c, mean, df, matrix, t, T, F. The last two are acceptable abbreviations for TRUE and FALSE. To avoid ambiguity, we recommend writing out TRUE and FALSE explicitly, rather than using the abbreviations.

If you want to take the mean of a vector x, we recommend using mean_x, x_bar, or x_mean.1 There are two benefits of using one of these variable names over using mean.

  1. You don’t confuse your mean object with the mean() function. R will usually figure out which one you want, but we always encourage users to be explicit rather than relying on defaults.
  2. What if you later want to take the mean of a different vector. Which one does mean refer to?

You could do this, for example:

sd <- sd(1:6) #set sd as stadard devviation of 1-6
sd   #print value sd
## [1] 1.870829
sd(4:10)  #calc the standard deviation of 4-10
## [1] 2.160247

Activity

Execute the chunk above and look at the R console output. Explain what we have done here and what R must be doing without telling you. Write your answer after the “>” below. (“>”” is the Rmd for a block quote, which will make finding your answers easier.)

line 1:set the variable sd as the stadard deviation of values 1,2,3,4,5, and 6. line 2:give me the value of sd line 3:take the standard deviation of values 4,5,6,7,8,9, and 10 and give output Now add comments to the code chunk above briefly annotating your answer above.

Vectors

Vectors are one of the fundamental data structures in R. They consist of data of all the same type (numeric, character, etc.) in a 1 X n structure. You can manually create vectors using the combine function c(). Some functions like seq(), rep(), and the random number generators (rnorm(), runif(), etc.) produce vectors by default.

Activity

Assign vectors with the following characteristics:

  1. 1, 6, 10, 14.75
  2. TRUE, TRUE, FALSE
  3. a, aa, aaa (as characters)
  4. The sequence 5 to 100 by 1
  5. The sequence 5 to 100 by 5
  6. The sequence starting with 5 and ending at 100 with a length of 60
  7. 17 repeated 10 times
  8. The sequence 1, 2, 3 where each is repeated 10 times in a row
  9. The sequence 1, 2, 3 repeated 10 times

Choose names for these vectors. Add your code to the block below.

#1
v1<-c(1,6,10,14.47)
#2
v2<-c(TRUE,TRUE,FALSE)
#3
v3<-c("a","aa","aaa")  
#4  
v4<-seq(5:100)
#5  
v5<-seq(from=5, to=100, by=5)
#6  
v6<-seq(from=5, to=100, length.out=60)
#7
v7<-rep(17, 10)
#8
v8<-rep(1:3, each=10)
#9
v9<-rep(1:3, 10)

Working with relational operators

Binary operations are very important in R for selecting, subsetting, and choosing variables. The relational operators in R are:

  • == Equals
  • != Does not equal
  • > Greater than
  • < Less than
  • >= Greater than or equal to
  • <= Less than or equal to
  • %in% Is the comparator in the set?

When these operators are applied to vectors, the result is a vector of logicals (TRUEs and FALSEs).

Activity

Use your vectors from above in the same order to test the following relational operators.

  1. Which values are greater than 5?
  2. Which values equal FALSE?
  3. Does this vector contain the string “a”?
  4. Which values are less than or equal to 10?
  5. Which values are greater than or equal to 10?
  6. Count the number of values less than 50 (hint, use sum())
  7. Count the number of values equal to 17
  8. Which values equal 1?
  9. Which values do not equal 1?
#1
v1>5
## [1] FALSE  TRUE  TRUE  TRUE
#2
v2==FALSE
## [1] FALSE FALSE  TRUE
#3
v3%in%"a"
## [1]  TRUE FALSE FALSE
#4
v4<=10
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#5
v5>=10
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#6
sum(v6<50)
## [1] 28
#7
sum(v7==17)
## [1] 10
#8
v8==1
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE
#9
v9!=1
##  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [13] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [25] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Perils of relational operators

Computers only infrequently store numbers internally as integers (computer algebra systems do this), particularly after any kind of numeric operation.

In contrast, numerics are often rounded to some level of accuracy (R uses about 53 decimal places). For example:

a <- sqrt(2) # set variable "a" to the value of the sqrt of 2
a #print out the value "a"
## [1] 1.414214
a * a #multiply "a" by "a" and it prints the output  which rounds to 2
## [1] 2
a * a == 2 #asks if the value of a*a is exactly equal to 2. since it is not it prints false
## [1] FALSE
all.equal(a * a, 2) # ask if the value of a*a is nearly equal to 2. since it is it prints true
## [1] TRUE

Line by line, explain what the statements above are doing and the R output of each. Look at the help for all.equal() if you need to. Enter your explanation after the > below. > line 1: define a as the variable of square root of two line 2:what the value of a is and print that value line 3: multiplying value a by value a and print that value which come out to near two after rounding line 4: check and see if a times a =2, since it does equal exactly 2 print FALSE line 5: compare the values of a time a and 2 and see if they are close to the same value. since they are very close in value it outputs true

Matrices

Matrices are square objects (rows and columns) in which all of the cells have the same type of data. In most cases when you use matrices, you will have numbers only, however, matrices can hold characters, logicals, or factors as well.

Activity

  1. Use the matrix() function and rnorm(36, mean = 10, sd = 5) to create a 6 X 6 matrix. The rnorm() draw random normally distributed numbers. By supplying the mean and sd arguments, we can specify the mean and standard deviation of the distribution.

You might need to refer to the help for matrix (?matrix) for assistance with arguments.

set.seed(37)
m1 <- matrix(nrow = 6, ncol = 6, rnorm(36, mean = 10, sd = 5)) #matrix = (#row,#columns)

What is the default order for creating matrices?

it creates the number of rows, then the number of columns. Any additional parameters can be added after that, for example in the matrix above we added the rnorm function

  1. Use the colMeans() function to calculate the column means of your matrix.
colMeans(m1)
## [1]  9.692717 11.586470  7.173573  9.263722  8.954953 10.591404
  1. Use an inequality to find all values less than 10.
m1<10 #if value is less than 10 output as True, if greater than False
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] FALSE  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE FALSE  TRUE FALSE FALSE
## [4,]  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [5,]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [6,]  TRUE FALSE  TRUE  TRUE FALSE FALSE

What kind of matrix is returned?

It is a 6x6 matrix of logicals where each position is either true or false

  1. Use your inequality from part (c) and colMeans() to calculate the average proportion of values less than 10 in your matrix.
colMeans(m1<10) #find mean of each column if true = 1 and false = 0
## [1] 0.5000000 0.3333333 0.5000000 0.5000000 0.3333333 0.3333333

Compare the results of the column means with the results of part (c). What is R doing with the TRUE and FALSE values in order to be able to use colMeans()?

The matrix is assignning False=0 and True=1 so when you take the average of those values based on the True False.

data.frames

data.frames (and their cousins tibbles) are one of the most important objects that you will work with in R. They are the closest thing to an Excel spreadsheet in R (with the added restriction that a column must be all of one data type)..

Activity

Create a data.frame with the following columns:

  • A character vector Tx_Group with the values “control” and “high”, each repeated 10 times.
  • A numeric vector Replicate with values 1 and 2, repeated 5 times each for both values in Tx_Group.
  • A numeric vector Productivity, where the first 10 values are normally distributed with a mean of 5 and standard deviation of 2 and the second 10 values are normally distributed with a mean of 8 and standard deviation of 2. c() will be useful here.

Include the argument stringsAsFactors = FALSE to tell R not to convert the strings to factors.

set.seed(1)
#Tx_Group Vector
Tx_Group <- rep(c("control", "high"), each = 10)

#Replicate vector
controlrep <- rep(1:2, each =5)
highrep <- rep(1:2, each =5)
Replicate <- c(controlrep, highrep)

#Productivity vector
first_10 <- rnorm(1:10, mean = 5, sd =2)
second_10 <- rnorm(1:10, mean =8, sd =2)
Productivity <- c(first_10, second_10)

#data.frame
df1<- data.frame(Tx_Group, Replicate, Productivity, stringsAsFactors = FALSE)

Use the str() function to get information about the data.frame. This will allow you to verify that Tx_Group has the type character. Note that even though Replicate only contains the integers 1 and 2, R treats it as a numeric.

str(df1)
## 'data.frame':    20 obs. of  3 variables:
##  $ Tx_Group    : chr  "control" "control" "control" "control" ...
##  $ Replicate   : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Productivity: num  3.75 5.37 3.33 8.19 5.66 ...

Indexing

Taking subsets of objects in R is very common. This can include slicing or filtering rows, extracting one or more columns, and referencing columns in other functions.

You can use standard bracket notation [ ] to subset vectors, matrices, and data.frames. The latter two require a comma to denote rows and columns: [rows, columns]. You can also take a single column of a data.frame with the $ operator.

Activity

Use your data.frame from the question above. Extract the following subsets:

  1. The column Productivity using bracket notation
  2. The column Productivity using $ notation
  3. The second column (assume you don’t know its name)
  4. Rows 1-10 of the entire data.frame
  5. Rows 1-10 of only the Productivity column
# 1
df1[,"Productivity"]
##  [1]  3.747092  5.367287  3.328743  8.190562  5.659016  3.359063  5.974858
##  [8]  6.476649  6.151563  4.389223 11.023562  8.779686  6.757519  3.570600
## [15] 10.249862  7.910133  7.967619  9.887672  9.642442  9.187803
# 2
df1$Productivity
##  [1]  3.747092  5.367287  3.328743  8.190562  5.659016  3.359063  5.974858
##  [8]  6.476649  6.151563  4.389223 11.023562  8.779686  6.757519  3.570600
## [15] 10.249862  7.910133  7.967619  9.887672  9.642442  9.187803
# 3
df1[,2]
##  [1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2
# 4
df1[1:10,]
##    Tx_Group Replicate Productivity
## 1   control         1     3.747092
## 2   control         1     5.367287
## 3   control         1     3.328743
## 4   control         1     8.190562
## 5   control         1     5.659016
## 6   control         2     3.359063
## 7   control         2     5.974858
## 8   control         2     6.476649
## 9   control         2     6.151563
## 10  control         2     4.389223
# 5
df1[1:10,"Productivity"]
##  [1] 3.747092 5.367287 3.328743 8.190562 5.659016 3.359063 5.974858 6.476649
##  [9] 6.151563 4.389223

We will do more complex filtering next week (e.g., only rows from replicate 1 where productivity is less than 5).

Basic calculations in R

R can do basic (and not so basic) calculations. First set up a vector of numbers to work with.

# Set the random number generator seed
set.seed(5)

# Generate 10 random numbers between 0 and 1
x <- runif(10)
x
##  [1] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.5279600
##  [8] 0.8079352 0.9565001 0.1104530

Activity

Try out some R functions: mean(x), sd(x), median(x), t(x). Type them into the code chunk below and compile.

These functions take a vector or matrix and return a single value or a new matrix. Contrast this behavior with x^2. Enter that and see what you get.

Try functions operating on the matrix you created above.

#mean of matrix
mean(m1)
## [1] 9.543806
#standard deviation of matrix
sd(m1)
## [1] 5.020639
#median of matrix
median(m1)
## [1] 10.51905
#transpose matrix
t(m1)
##           [,1]      [,2]      [,3]      [,4]       [,5]      [,6]
## [1,] 10.623770 11.910373 12.896214  8.531259  5.8582542  8.336432
## [2,]  9.039202 16.814914 14.279772 11.079977  8.1114895 10.193468
## [3,] 17.124075 14.911550 11.552322  1.662358 -3.5242485  1.315379
## [4,] 12.015376  6.339636  8.608195 16.421153 14.1201465 -1.922178
## [5,] 13.695011  7.704763 10.414325 10.120118 -0.5198454 12.315343
## [6,] 11.785339  7.392983 13.276978  8.895007 11.5495392 10.648575
#square the matrix
m1^2
##           [,1]      [,2]       [,3]       [,4]        [,5]      [,6]
## [1,] 112.86449  81.70718 293.233956 144.369257 187.5533339 138.89422
## [2,] 141.85698 282.74132 222.354309  40.190990  59.3633748  54.65620
## [3,] 166.31233 203.91189 133.456151  74.101024 108.4581635 176.27813
## [4,]  72.78239 122.76590   2.763435 269.654280 102.4167922  79.12114
## [5,]  34.31914  65.79626  12.420327 199.378537   0.2702393 133.39186
## [6,]  69.49610 103.90678   1.730222   3.694768 151.6676835 113.39216

Cleaning up code

Whitespace is your friend. Imaginetryingtoreadasentencethathasnospacing. It’s pretty difficult. Writing clean code helps your future self and those you work with to understand the steps of your analysis.

Activity

Take the following code chunk and make the code more readable by adding whitespace (and carriage returns) as well as explicitly naming some of the arguments (check the help files for the assumed ordering of argument).

M1 <- data.frame(x = c(11:20), 
                 y = c(20:11), 
                 l = letters[1:10]) #letters = all 26 lowercase letters

M2 <- tibble(ID = seq(1, 100), 
             x = rnorm(100, 10, 5), # a vector of 100 values with a mean of 10 and a sd of 5
             y = 8.4 * x + rnorm(100) # each value x is multiplied by 8.4 and then that value is added to a new value from the rnorm function. The new values have a mean of 0 and a sd of 1
             )

What is letters?

letters is a function to get all lowercase letters

Explain how y is constructed in M2.

The value in vector x is multiplied by 8.4 and then that is added to new value from rnorm(100)


  1. KMM prefers the latter because of RStudio’s auto-completion feature.↩︎