Each problem set will contain narrative text interspersed with R code. Some of this code will already be completed for you, while some you will need to fill in. You should read through all of the text (just as you are doing right now). The exercises (Activities) are interspersed throughout the text.
Right now, go up to the header and change the line "author:" to have your name and your group number.
As you work through the exercises, really think about what you are doing. In the upcoming problem sets, you will be learning how to think through analyses step by step, from loading data, to visualizing it, to actually carrying out statistical tests, and then interpreting the results.
Read the code like you would a textbook (meaning studying and digesting as you go). It's easy to copy and paste or just run code chunks without really thinking about what the code is doing (we've done it too), but that's not a good way to learn.
We will be working in R Markdown files. Markdown is a markup language that is designed to be no only readable in its text form, but also able to be converted into other formats. Markdown has a simple syntax which allows for things like bold ("Activities" above), italics, and facilities for including math notation (\(\alpha, \beta, \gamma, x^2\)). Even some pretty fancy math using LaTeX:
\[\bar{Y} = \frac{\sum^{n}_{i = 1}Y_i}{n}\]
R Markdown is the marriage of R and markdown, allowing you to write text and R code together that can include analysis, figures, and tables. R Markdown has been extended for making slides (like the ones we use in this class), adding references, bibliographies, and cross references.
Our usage in this class will be fairly pedestrian, but you can do some really complex things, like writing entire manuscripts using the bookdown package. Read more about R Markdown and bookdown.
R markdown (.Rmd) files can be converted to different formats. We will use HTML (PDF and Word are other options).
You might get a message about installing some packages. Click yes to install the required packages. After a few seconds, another window should open with this document rendered as a visually pleasing file.
You have just compiled an R Markdown file into HTML. These Rmd and HTML files will be the basic currency of the problems sets you will do in this class.
An Rmd file can include R code that is run at compile time.
Activity means that you need to do something (i.e., be active). Before you submit your problem set, search (Ctrl-f / Cmd-f) for "Activity" and make sure you have answered all the questions.
Place the cursor on the line below this text and press Ctrl-Alt-i (Windows) / Cmd-Option-i (Mac) to insert an R code chunk.
Enter some R code into the chunk on the blank line: sqrt(2). Then compile the HTML. Your file will show the R code that is run and R's output (\(\sqrt{2} = 1.41\)).
You can also run code interactively from an Rmd file to the R console. To run a single line, press Ctrl-Enter / Cmd-Return. To run the current chunk, use Ctrl-Alt-c / Cmd-Shift-Return. This is a good way to test code that you are working on, rather then waiting to compile the Rmd to HTML (or whatever format you are using).
You can also enter the code for code chunks manually, but I find it easier to use the insert code chunk shortcut.
There are not many restrictions on what you can name R objects. Unlike other languages, where some names are reserved and off-limits, in R, pretty much anything goes. Object names can't start with a number (no 1a <- 1; that will give you an "unexpected symbol" error), but otherwise you are free to do what you want, even things you probably should not do. One of the main things to avoid is naming your objects the same as an R function.
Some names to avoid: c, mean, df, matrix, t, T, F. The last two are acceptable abbreviations for TRUE and FALSE. To avoid ambiguity, we recommend writing out TRUE and FALSE explicitly, rather than using the abbreviations.
If you want to take the mean of a vector x, we recommend using mean_x, x_bar, or x_mean.1 There are two benefits of using one of these variable names over using mean.
mean object with the mean() function.mean refer to?You could do this, for example:
sd <- sd(1:6)
sd
## [1] 1.870829
sd(4:10)
## [1] 2.160247
Execute the chunk above and look at the R console output. Explain what we have done here and what R must be doing without telling you. Write your answer after the ">" below. (">"" is the Rmd for a block quote, which will make finding your answers easier.)
We did the SD of vectors 1:6 which is 1.87 and then we did SD of vectors 4:10. So this means that we cannot assign the object name same as the function.
Vectors are one of the fundamental data structures in R. They consist of data of all the same type (numeric, character, etc.) in a 1 X n structure. You can manually create vectors using the combine function c(). Some functions like seq(), rep(), and the random number generators (rnorm(), runif(), etc.) produce vectors by default.
Assign vectors with the following characteristics:
Choose names for these vectors. Add your code to the block below.
df1 <- c(1,6,10,14.75)
df2 <- c(TRUE, TRUE, FALSE)
df3 <- c("a", "aa", "aaa")
df4 <- seq(5,100, by=1)
df5 <- seq(5,100, by=5)
df6 <- seq(5,100, length=60)
df7 <- rep(17, 10)
df8 <- rep(1:3, each=10)
df9 <- rep(1:3, 10)
Binary operations are very important in R for selecting, subsetting, and choosing variables. The relational operators in R are:
== Equals!= Does not equal> Greater than< Less than>= Greater than or equal to<= Less than or equal to%in% Is the comparator in the set?When these operators are applied to vectors, the result is a vector of logicals (TRUEs and FALSEs).
Use your vectors from above in the same order to test the following relational operators.
sum())df1 > 5
## [1] FALSE TRUE TRUE TRUE
df2 == FALSE
## [1] FALSE FALSE TRUE
"a"%in%df3
## [1] TRUE
df4 <= 10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df5 >= 10
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sum(df6 < 50)
## [1] 28
sum(df7 == 17)
## [1] 10
df8 == 1
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df9 != 1
## [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [12] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [23] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
Computers only infrequently store numbers internally as integers (computer algebra systems do this), particularly after any kind of numeric operation.
In contrast, numerics are often rounded to some level of accuracy (R uses about 53 decimal places). For example:
a <- sqrt(2)
a
## [1] 1.414214
a * a
## [1] 2
a * a == 2
## [1] FALSE
all.equal(a * a, 2)
## [1] TRUE
Line by line, explain what the statements above are doing and the R output of each. Look at the help for all.equal() if you need to. Enter your explanation after the > below.
a is an object that computes square root of 2.Since the R uses 53 decimals, a*a==2 is not recognized by R since it is multiplying two decimal values which is not exactly equal to 2. But all.eaual() function will tell you whether the a x a is nearly equal to 2 or not which is true in this case.
Matrices are square objects (rows and columns) in which all of the cells have the same type of data. In most cases when you use matrices, you will have numbers only, however, matrices can hold characters, logicals, or factors as well.
matrix() function and rnorm(36, mean = 10, sd = 5) to create a 6 X 6 matrix. The rnorm() draw random normally distributed numbers. By supplying the mean and sd arguments, we can specify the mean and standard deviation of the distribution.mat_1 <- matrix(rnorm(36, mean = 10, sd=5), nrow = 6, ncol = 6)
colMeans() function to calculate the column means of your matrix.colMeans(mat_1)
## [1] 6.242655 9.674011 8.425551 12.261214 9.518980 9.531194
mat_1 < 10
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] FALSE TRUE FALSE TRUE TRUE TRUE
## [2,] TRUE FALSE TRUE FALSE FALSE FALSE
## [3,] TRUE TRUE TRUE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE FALSE FALSE FALSE TRUE
## [6,] TRUE FALSE FALSE FALSE TRUE FALSE
What kind of matrix is returned?
Logical matrix
colMeans() to calculate the average proportion of values less than 10 in your matrix.col_means <- colMeans(mat_1 <10)
mean(col_means)
## [1] 0.5555556
Compare the results of the column means with the results of part (c). What is R doing with the TRUE and FALSE values in order to be able to use colMeans()?
All the true is assigned 1 and False as 0 and calculating the average.
data.frames are one of the most important objects that you will work with in R. They are the closest thing to an Excel spreadsheet in R (with the added restriction that a column must be all of one data type). When you read in files from csv or from Excel, you will almost always get a data.frame or its cousin the "tibble" (tbl_df).
Create a data.frame with the following columns:
Tx_Group with the values "control" and "high", each repeated 10 times.Replicate with values 1 and 2, each repeated 5 times for each of the 10 values of "control" and "high" in Tx_Group.Productivity, where the first 10 values are normally distributed with a mean of 5 and standard deviation of 2 and the second 10 values are normally distributed with a mean of 8 and standard deviation of 2. c() will be useful here.Include the argument stringsAsFactors = FALSE to tell R not to convert the strings to factors.
df <- data.frame("Tx_Group"=rep(c("control", "high"), each=10), "Replicate"=rep(1:2,each=5), "Productivity"=c(rnorm(10, mean = 5, sd=2), rnorm(10, mean = 8, sd = 2)), stringsAsFactors = FALSE)
Use the str() function to get information about the data.frame. This will allow you to verify that Tx_Group has the type character. Note that even though Replicate only contains the integers 1 and 2, R treats it as a numeric.
str(df)
## 'data.frame': 20 obs. of 3 variables:
## $ Tx_Group : chr "control" "control" "control" "control" ...
## $ Replicate : int 1 1 1 1 1 2 2 2 2 2 ...
## $ Productivity: num 6.027 4.257 6.26 0.841 8.121 ...
Taking subsets of objects in R is very common. This can include slicing or filtering rows, extracting one or more columns, and referencing columns in other functions.
You can use standard bracket notation [ ] to subset vectors, matrices, and data.frames. The latter two require a comma to denote rows and columns: [rows, columns]. You can also take a single column of a data.frame with the $ operator.
Use your data.frame from the question above. Extract the following subsets:
Productivity using bracket notationProductivity using $ notation# 1
df[,3]
## [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
## [8] 6.2865919 3.5764575 5.0107614 7.1426600 6.2947463 7.4059573 9.4081988
## [15] 9.5743480 5.3912148 8.8230215 4.5290656 9.4657620 8.1202318
#2
df$Productivity
## [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
## [8] 6.2865919 3.5764575 5.0107614 7.1426600 6.2947463 7.4059573 9.4081988
## [15] 9.5743480 5.3912148 8.8230215 4.5290656 9.4657620 8.1202318
#3
df[,"Replicate"]
## [1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2
#4
df[1:10,]
## Tx_Group Replicate Productivity
## 1 control 1 6.0267205
## 2 control 1 4.2569402
## 3 control 1 6.2599705
## 4 control 1 0.8409544
## 5 control 1 8.1214173
## 6 control 2 6.1348463
## 7 control 2 5.8128912
## 8 control 2 6.2865919
## 9 control 2 3.5764575
## 10 control 2 5.0107614
#5
df[1:10,3]
## [1] 6.0267205 4.2569402 6.2599705 0.8409544 8.1214173 6.1348463 5.8128912
## [8] 6.2865919 3.5764575 5.0107614
We will do more complex filtering next week (e.g., only rows from replicate 1 where productivity is less than 5).
R can do basic (and not so basic) calculations. First set up a vector of numbers to work with.
# Set the random number generator seed
set.seed(5)
# Generate 10 random numbers between 0 and 1
x <- runif(10)
x
## [1] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.5279600
## [8] 0.8079352 0.9565001 0.1104530
Try out some R functions: mean(x), sd(x), median(x), t(x). Type them into the code chunk below and compile.
These functions take a vector or matrix and return a single value or a new matrix. Contrast this behavior with x^2. Enter that and see what you get.
Try functions operating on the matrix you created above.
mean(x)
## [1] 0.5295264
sd(x)
## [1] 0.3313606
median(x)
## [1] 0.6065893
t(x)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.52796
## [,8] [,9] [,10]
## [1,] 0.8079352 0.9565001 0.110453
x^2
## [1] 0.04008583 0.46952452 0.84066119 0.08088305 0.01095165 0.49148156
## [7] 0.27874174 0.65275929 0.91489249 0.01219987
KMM prefers the latter because of RStudio's auto-completion feature.↩