Each problem set will contain narrative text interspersed with R code. Some of this code will already be completed for you, while some you will need to fill in. You should read through all of the text (just as you are doing right now). The exercises (Activities) are interspersed throughout the text.
Right now, go up to the header and change the line “author:” to have your name and your group number.
As you work through the exercises, really think about what you are doing. In the upcoming problem sets, you will be learning how to think through analyses step by step, from loading data, to visualizing it, to actually carrying out statistical tests, and then interpreting the results.
Read the code like you would a textbook (meaning studying and digesting as you go). It’s easy to copy and paste or just run code chunks without really thinking about what the code is doing (we’ve done it too), but that’s not a good way to learn.
We will be working in R Markdown files. Markdown is a markup language that is designed to be no only readable in its text form, but also able to be converted into other formats. Markdown has a simple syntax which allows for things like bold (“Activities” above), italics, and facilities for including math notation (\(\alpha, \beta, \gamma, x^2\)). Even some pretty fancy math using LaTeX:
\[\bar{Y} = \frac{\sum^{n}_{i = 1}Y_i}{n}\]
R Markdown is the marriage of R and markdown, allowing you to write text and R code together that can include analysis, figures, and tables. R Markdown has been extended for making slides (like the ones we use in this class), adding references, bibliographies, and cross references.
Our usage in this class will be fairly pedestrian, but you can do some really complex things, like writing entire manuscripts using the bookdown package, make a CV, or generate a website. Read more about R Markdown.
R markdown (.Rmd) files can be converted to different formats. We will use HTML (PDF and Word are other options).
You might get a message about installing some packages. Click yes to install the required packages. After a few seconds, another window should open with this document rendered as a visually pleasing file.
You have just compiled an R Markdown file into HTML. These Rmd and HTML files will be the basic currency of the problems sets and progress checks you will do in this class.
An Rmd file can include R code that is run at compile time.
Activity means that you need to do something (i.e., be active). Before you submit your problem set, search (Ctrl-f / Cmd-f) for “Activity” and make sure you have answered all the questions.
Place the cursor on the line below this text and press Ctrl-Alt-i (Windows) / Cmd-Option-i (Mac) to insert an R code chunk.
sqrt(2)
## [1] 1.414214
Enter some R code into the chunk on the blank line:
sqrt(2). Then compile the HTML. Your file will show the R
code that is run and R’s output (\(\sqrt{2} =
1.41\)).
You can also run code interactively from an Rmd file to the R console. To run a single line, press Ctrl-Enter / Cmd-Return. To run the current chunk, use Ctrl-Alt-c / Cmd-Shift-Return. This is a good way to test code that you are working on, rather then waiting to compile the Rmd to HTML (or whatever format you are using).
You can also enter the code to set up code chunks manually, but we find it easier to use the insert code chunk shortcut.
There are not many restrictions on what you can name R objects.
Unlike other languages, where some names are reserved and off-limits, in
R, pretty much anything goes. Object names can’t start with a number (no
1a <- 1; that will give you an “unexpected symbol”
error), but otherwise you are free to do what you want, even things you
probably should not do. One of the main things to avoid is naming your
objects the same as an R function.
Some names to avoid: c, mean,
df, matrix, t, T,
F. The last two are acceptable abbreviations for
TRUE and FALSE. To avoid ambiguity, we
recommend writing out TRUE and FALSE
explicitly, rather than using the abbreviations.
If you want to take the mean of a vector x, we recommend
using mean_x, x_bar, or x_mean.1 There are
two benefits of using one of these variable names over using
mean.
mean object with the
mean() function. R will usually figure out which
one you want, but we always encourage users to be explicit rather than
relying on defaults.mean refer to?You could do this, for example:
sd <- sd(1:6) #set sd as stadard devviation of 1-6
sd #print value sd
## [1] 1.870829
sd(4:10) #calc the standard deviation of 4-10
## [1] 2.160247
Execute the chunk above and look at the R console output. Explain what we have done here and what R must be doing without telling you. Write your answer after the “>” below. (“>”” is the Rmd for a block quote, which will make finding your answers easier.)
line 1:set the variable sd as the stadard deviation of values 1,2,3,4,5, and 6. line 2:give me the value of sd line 3:take the standard deviation of values 4,5,6,7,8,9, and 10 and give output Now add comments to the code chunk above briefly annotating your answer above.
Vectors are one of the fundamental data structures in R. They consist
of data of all the same type (numeric, character, etc.) in a 1 X n
structure. You can manually create vectors using the combine function
c(). Some functions like seq(),
rep(), and the random number generators
(rnorm(), runif(), etc.) produce vectors by
default.
Assign vectors with the following characteristics:
Choose names for these vectors. Add your code to the block below.
#1
v1<-c(1,6,10,14.47)
#2
v2<-c(TRUE,TRUE,FALSE)
#3
v3<-c("a","aa","aaa")
#4
v4<-seq(5:100)
#5
v5<-seq(from=5, to=100, by=5)
#6
v6<-seq(from=5, to=100, length.out=60)
#7
v7<-rep(17, 10)
#8
v8<-rep(1:3, each=10)
#9
v9<-rep(1:3, 10)
Binary operations are very important in R for selecting, subsetting, and choosing variables. The relational operators in R are:
== Equals!= Does not equal> Greater than< Less than>= Greater than or equal to<= Less than or equal to%in% Is the comparator in the set?When these operators are applied to vectors, the result is a vector
of logicals (TRUEs and FALSEs).
Use your vectors from above in the same order to test the following relational operators.
sum())#1
v1>5
## [1] FALSE TRUE TRUE TRUE
#2
v2==FALSE
## [1] FALSE FALSE TRUE
#3
v3%in%"a"
## [1] TRUE FALSE FALSE
#4
v4<=10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#5
v5>=10
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#6
sum(v6<50)
## [1] 28
#7
sum(v7==17)
## [1] 10
#8
v8==1
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE
#9
v9!=1
## [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [13] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [25] FALSE TRUE TRUE FALSE TRUE TRUE
Computers only infrequently store numbers internally as integers (computer algebra systems do this), particularly after any kind of numeric operation.
In contrast, numerics are often rounded to some level of accuracy (R uses about 53 decimal places). For example:
a <- sqrt(2) # set variable "a" to the value of the sqrt of 2
a #print out the value "a"
## [1] 1.414214
a * a #multiply "a" by "a" and it prints the output which rounds to 2
## [1] 2
a * a == 2 #asks if the value of a*a is exactly equal to 2. since it is not it prints false
## [1] FALSE
all.equal(a * a, 2) # ask if the value of a*a is nearly equal to 2. since it is it prints true
## [1] TRUE
Line by line, explain what the statements above are doing and the R
output of each. Look at the help for all.equal() if you
need to. Enter your explanation after the > below. >
line 1: define a as the variable of square root of two line 2:what the
value of a is and print that value line 3: multiplying value a by value
a and print that value which come out to near two after rounding line 4:
check and see if a times a =2, since it does equal exactly 2 print FALSE
line 5: compare the values of a time a and 2 and see if they are close
to the same value. since they are very close in value it outputs
true
Matrices are square objects (rows and columns) in which all of the cells have the same type of data. In most cases when you use matrices, you will have numbers only, however, matrices can hold characters, logicals, or factors as well.
matrix() function and
rnorm(36, mean = 10, sd = 5) to create a 6 X 6 matrix. The
rnorm() draw random normally distributed numbers. By
supplying the mean and sd arguments, we can
specify the mean and standard deviation of the distribution.You might need to refer to the help for matrix (?matrix)
for assistance with arguments.
set.seed(37)
m1 <- matrix(nrow = 6, ncol = 6, rnorm(36, mean = 10, sd = 5)) #matrix = (#row,#columns)
What is the default order for creating matrices?
it creates the number of rows, then the number of columns. Any additional parameters can be added after that, for example in the matrix above we added the rnorm function
colMeans() function to calculate the column
means of your matrix.colMeans(m1)
## [1] 9.692717 11.586470 7.173573 9.263722 8.954953 10.591404
m1<10 #if value is less than 10 output as True, if greater than False
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] FALSE TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE TRUE TRUE TRUE
## [3,] FALSE FALSE FALSE TRUE FALSE FALSE
## [4,] TRUE FALSE TRUE FALSE FALSE TRUE
## [5,] TRUE TRUE TRUE FALSE TRUE FALSE
## [6,] TRUE FALSE TRUE TRUE FALSE FALSE
What kind of matrix is returned?
It is a 6x6 matrix of logicals where each position is either true or false
colMeans() to
calculate the average proportion of values less than 10 in your
matrix.colMeans(m1<10) #find mean of each column if true = 1 and false = 0
## [1] 0.5000000 0.3333333 0.5000000 0.5000000 0.3333333 0.3333333
Compare the results of the column means with the results of part (c).
What is R doing with the TRUE and FALSE values
in order to be able to use colMeans()?
The matrix is assignning False=0 and True=1 so when you take the average of those values based on the True False.
data.frames (and their cousins tibbles) are
one of the most important objects that you will work with in R. They are
the closest thing to an Excel spreadsheet in R (with the added
restriction that a column must be all of one data type)..
Create a data.frame with the following columns:
Tx_Group with the values “control”
and “high”, each repeated 10 times.Replicate with values 1 and 2,
repeated 5 times each for both values in Tx_Group.Productivity, where the first 10
values are normally distributed with a mean of 5 and standard deviation
of 2 and the second 10 values are normally distributed with a mean of 8
and standard deviation of 2. c() will be useful here.Include the argument stringsAsFactors = FALSE to tell R
not to convert the strings to factors.
set.seed(1)
#Tx_Group Vector
Tx_Group <- rep(c("control", "high"), each = 10)
#Replicate vector
controlrep <- rep(1:2, each =5)
highrep <- rep(1:2, each =5)
Replicate <- c(controlrep, highrep)
#Productivity vector
first_10 <- rnorm(1:10, mean = 5, sd =2)
second_10 <- rnorm(1:10, mean =8, sd =2)
Productivity <- c(first_10, second_10)
#data.frame
df1<- data.frame(Tx_Group, Replicate, Productivity, stringsAsFactors = FALSE)
Use the str() function to get information about the
data.frame. This will allow you to verify that Tx_Group has
the type character. Note that even though Replicate only
contains the integers 1 and 2, R treats it as a numeric.
str(df1)
## 'data.frame': 20 obs. of 3 variables:
## $ Tx_Group : chr "control" "control" "control" "control" ...
## $ Replicate : int 1 1 1 1 1 2 2 2 2 2 ...
## $ Productivity: num 3.75 5.37 3.33 8.19 5.66 ...
Taking subsets of objects in R is very common. This can include slicing or filtering rows, extracting one or more columns, and referencing columns in other functions.
You can use standard bracket notation [ ] to subset
vectors, matrices, and data.frames. The latter two require a comma to
denote rows and columns: [rows, columns]. You can also take
a single column of a data.frame with the $ operator.
Use your data.frame from the question above. Extract the following subsets:
Productivity using bracket notationProductivity using $
notation# 1
df1[,"Productivity"]
## [1] 3.747092 5.367287 3.328743 8.190562 5.659016 3.359063 5.974858
## [8] 6.476649 6.151563 4.389223 11.023562 8.779686 6.757519 3.570600
## [15] 10.249862 7.910133 7.967619 9.887672 9.642442 9.187803
# 2
df1$Productivity
## [1] 3.747092 5.367287 3.328743 8.190562 5.659016 3.359063 5.974858
## [8] 6.476649 6.151563 4.389223 11.023562 8.779686 6.757519 3.570600
## [15] 10.249862 7.910133 7.967619 9.887672 9.642442 9.187803
# 3
df1[,2]
## [1] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2
# 4
df1[1:10,]
## Tx_Group Replicate Productivity
## 1 control 1 3.747092
## 2 control 1 5.367287
## 3 control 1 3.328743
## 4 control 1 8.190562
## 5 control 1 5.659016
## 6 control 2 3.359063
## 7 control 2 5.974858
## 8 control 2 6.476649
## 9 control 2 6.151563
## 10 control 2 4.389223
# 5
df1[1:10,"Productivity"]
## [1] 3.747092 5.367287 3.328743 8.190562 5.659016 3.359063 5.974858 6.476649
## [9] 6.151563 4.389223
We will do more complex filtering next week (e.g., only rows from replicate 1 where productivity is less than 5).
R can do basic (and not so basic) calculations. First set up a vector of numbers to work with.
# Set the random number generator seed
set.seed(5)
# Generate 10 random numbers between 0 and 1
x <- runif(10)
x
## [1] 0.2002145 0.6852186 0.9168758 0.2843995 0.1046501 0.7010575 0.5279600
## [8] 0.8079352 0.9565001 0.1104530
Try out some R functions: mean(x), sd(x),
median(x), t(x). Type them into the code chunk
below and compile.
These functions take a vector or matrix and return a single value or
a new matrix. Contrast this behavior with x^2. Enter that
and see what you get.
Try functions operating on the matrix you created above.
#mean of matrix
mean(m1)
## [1] 9.543806
#standard deviation of matrix
sd(m1)
## [1] 5.020639
#median of matrix
median(m1)
## [1] 10.51905
#transpose matrix
t(m1)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 10.623770 11.910373 12.896214 8.531259 5.8582542 8.336432
## [2,] 9.039202 16.814914 14.279772 11.079977 8.1114895 10.193468
## [3,] 17.124075 14.911550 11.552322 1.662358 -3.5242485 1.315379
## [4,] 12.015376 6.339636 8.608195 16.421153 14.1201465 -1.922178
## [5,] 13.695011 7.704763 10.414325 10.120118 -0.5198454 12.315343
## [6,] 11.785339 7.392983 13.276978 8.895007 11.5495392 10.648575
#square the matrix
m1^2
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 112.86449 81.70718 293.233956 144.369257 187.5533339 138.89422
## [2,] 141.85698 282.74132 222.354309 40.190990 59.3633748 54.65620
## [3,] 166.31233 203.91189 133.456151 74.101024 108.4581635 176.27813
## [4,] 72.78239 122.76590 2.763435 269.654280 102.4167922 79.12114
## [5,] 34.31914 65.79626 12.420327 199.378537 0.2702393 133.39186
## [6,] 69.49610 103.90678 1.730222 3.694768 151.6676835 113.39216
Whitespace is your friend. Imaginetryingtoreadasentencethathasnospacing. It’s pretty difficult. Writing clean code helps your future self and those you work with to understand the steps of your analysis.
Take the following code chunk and make the code more readable by adding whitespace (and carriage returns) as well as explicitly naming some of the arguments (check the help files for the assumed ordering of argument).
M1 <- data.frame(x = c(11:20),
y = c(20:11),
l = letters[1:10]) #letters = all 26 lowercase letters
M2 <- tibble(ID = seq(1, 100),
x = rnorm(100, 10, 5), # a vector of 100 values with a mean of 10 and a sd of 5
y = 8.4 * x + rnorm(100) # each value x is multiplied by 8.4 and then that value is added to a new value from the rnorm function. The new values have a mean of 0 and a sd of 1
)
What is letters?
letters is a function to get all lowercase letters
Explain how y is constructed in M2.
The value in vector x is multiplied by 8.4 and then that is added to new value from rnorm(100)
KMM prefers the latter because of RStudio’s auto-completion feature.↩︎