Why R Markdown?

R markdown is an extra layer to R programming, but ultimately it is one of the best for presenting analyses and plots in ways understandable to non-R users, but also in a reproducible way in which your code can be copied and your analyses rerun with precision. This is different from other statistic programs that use GUI (graphical user interfaces) with point and click functions. While these may be easier to learn and intuitive, they are not reproducible and shareable in the precise way that R code and R markdown documents are.

In this lab, we will illustrate the basic functions of R Markdown and R.

Do you know why headers have ##? This tells Markdown that the text following will be a header of some sort. The more ##’s you put the more of a sub header it becomes in the knitted document. This is incredibly useful for breaking apart documents into meaningful units. You will see these ## at the beginning of each header and sub header.

The other thing to note are the gray boxes. These gray boxes are the code chunks. Three ticks tells knitr that you are creating a code chunk. The {r} tells knitr what language of code is used in this chunk. The final three ticks indicate that the code chunk has ended. It is important that all three of these are present when creating a new code chunk.

Try creating one now! Look ahead at the other code chunks if you are having trouble.

Next, if you look to the left, you will see that lines are numbered. This is to help you and R easily reference sections of code. If you get an error, R will tell you in the console where it encountered that error. In addition, R will sometimes show red error lines in code script. This is R telling you that there is an issue. Most of the time, this is a misspelling, a missing comma (,), or a missing parentheses ().

Next to these numbers you will see little up and down arrows. These are tabs on chunks and headers. If they are pointing up, that means you have hidden that sections and if they are down, you have revealed that chunk/section. If you think code has disappeared, it might just be hidden.

Last, within the code chunks (gray boxes) you will see a green arrow on the left hand side. This is the run button and it will run the code in code chunk and print the results below. This is one way to run the code. Another would be to copy the code from the chunk and paste it in the consule below.

Note, that R reads code from beginning to end. If you try and run code but haven’t run the code that created the objects in that code, it will give you an error. You can always check if the object you are calling is loaded by looking for it in the Environment window in the top right corner of RStudio.

Exercises

Exercise 0

R is an incredibly powerful tool, but it is also a calculator. To illustrate this and the information covered above, lets run through some basics.

Try executing this chunk by clicking the Run button (little green arrow) within the chunk.

2+2

## [1] 4

7-1

## [1] 6

9*2

## [1] 18

27/3

## [1] 9

Describe what just happened!

EKA: The Calculations are done.

Exercise 1

Now, lets warm up for some statistics. To do statistics you will need data. Most of the data today will be with “random” numbers.

Try executing this chunk by clicking the Run button within the next chunk. This code will make your computer print a random number between 0 and 1

runif(1)

## [1] 0.08196117

Try to execute/run the several times to see if you get the same number
You can also run code in the console window. Copy the code and paste it in the console and hit enter.

You can run the following to learn more about any function.

# <- these little symbols tell R not to read this as code and to exclude it. So to run the below, you will need to first remove the # and then add the additional code. But, since this is a help page, just copy the code without the # and paste it in the console.

#?runif

runif(n=1, min=5, max=16) #note that runif() defaults to min=0 and max=1, but you can change these to any numbers you want.

## [1] 5.645783

Note n, min, max. What are these in the context of R? Function, parameter, variable, or data? EKA: n refer to the number of observations EKA: Min and Max defines the lower and upper limits for the calculations
What about runif? EKA: a command, that only runs if a certain parameter is true.

Exercise 2

With R it is possibly to name (part of your) data with a name of your choice. This is known as creating an object. R is an object-oriented programming language. For now your “data” will be a random number between 0 and 1.

Run the next piece of code, which will generate a random number and save it under the name r1 (the <- part of the code means “give the name”)

r1 <- runif(1)

The code only generated the number, but did not display it.

To see which number your computer gave you, run the next piece of code:

r1

## [1] 0.7018028

print(r1)

## [1] 0.7018028

Note that in your environment in the top right you now see the object r1 along with its value.

What do you think the print() function does? EKA: Used for exporting data based on different factors

Exercise 3

Now, let us make some more random numbers:

run the next code boxes and discuss what the code in the boxes do

r2 <- runif(1)
r2

## [1] 0.5933874

r3 <- runif(1)
r3

## [1] 0.7913255

r4 <- runif(1)
r4

## [1] 0.754808

r5 <- runif(1)
r5

## [1] 0.6424698

r6 <- runif(1)
r6

## [1] 0.9410589

Did you notice that all you numbers also appears in the environment window to the right?

Exercise 3

Now it is your turn to write some code! Look carefully at the code chunks above. What do you notice about those that are R code and those that are text (like this) and those that are headers.

In your new R chunk write a code that generates and prints 4 more random numbers: r7,r8,r9,r10

EKA: The () refers to ???

r7 <- runif(1)
r7

## [1] 0.9393032

r8 <- runif(1)
r8

## [1] 0.1126541

r9 <- runif(1)
r9

## [1] 0.9618108

r10 <- runif(1)
r10

## [1] 0.945175

Exercise 4

Soon we are ready do some statistics on your random numbers.

Bur first we make a list (vector) with all you numbers. For that we use the c()-concatenate.

Adjust the list bellow to also include r7, r8, r9 and r10

r_list <- c(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10)
print(r_list)# c means concatenate (def: link together in a chain or series)

##  [1] 0.7018028 0.5933874 0.7913255 0.7548080 0.6424698 0.9410589 0.9393032
##  [8] 0.1126541 0.9618108 0.9451750

r_list

##  [1] 0.7018028 0.5933874 0.7913255 0.7548080 0.6424698 0.9410589 0.9393032
##  [8] 0.1126541 0.9618108 0.9451750

Extend the code chunk to also print the “r_list”

Exercise 5

Now we can calculate the mean

IMPORTANT: But before you do this, try to think what will be the mean of your random numbers. Note, that we are creating uniform distributions.

After you have made your guesses use the code chunk bellow to calculate the mean

r_mean <- mean(r_list)

mean_dif <- 0.5-0.3611859

Also try to extend the code chunk that saves the mean with the name “r_mean” and print it
Finally write a code chunk that calculates the difference between your guess and the actual mean - you can get inspiration from exercise 0

Exercise 6

In the exercises above we created all the random numbers one at the time. We could also have created them at once

Run the code chunk bellow and discuss what it might do

r_long <- runif(100)
r_long

##   [1] 0.730070238 0.658520974 0.972493692 0.659948956 0.739603211 0.436245293
##   [7] 0.711156723 0.550769480 0.636405011 0.784397414 0.930321685 0.086628236
##  [13] 0.765174360 0.513990997 0.698986398 0.342948998 0.278876404 0.652822978
##  [19] 0.499494605 0.547732381 0.105259286 0.454860187 0.476615187 0.747203789
##  [25] 0.822798520 0.883884885 0.437633683 0.741495993 0.928417177 0.196087902
##  [31] 0.373623308 0.963736031 0.873422671 0.972338451 0.053403564 0.359785156
##  [37] 0.104324209 0.443684720 0.828328293 0.586310610 0.611180832 0.692971979
##  [43] 0.672866126 0.054789532 0.009874568 0.729656289 0.152138554 0.774589846
##  [49] 0.413302465 0.361305832 0.875190920 0.720269777 0.837610557 0.573141821
##  [55] 0.570442036 0.258253170 0.402558297 0.698314931 0.737297381 0.834738616
##  [61] 0.268925380 0.200546464 0.230257859 0.526267602 0.158737940 0.607945477
##  [67] 0.814985045 0.369135464 0.723496686 0.345437354 0.018521146 0.146327385
##  [73] 0.756736601 0.647371715 0.908180137 0.609784047 0.687517355 0.093861284
##  [79] 0.624294871 0.332607105 0.270216087 0.037475637 0.772392970 0.083122959
##  [85] 0.062133359 0.781305026 0.383265236 0.434085222 0.944576579 0.133595499
##  [91] 0.218057332 0.894429188 0.828530233 0.317273935 0.096979904 0.821751233
##  [97] 0.364403016 0.364448656 0.950836806 0.144320450

Exercise 7

The (unfinished) code chunk bellow will calculate some standard statistics of r_long - that is the mean, the minimum and the maximum.

Again each group member makes a guess for the mean, the minimum and the maximum.
After you have made your guesses try to finish and run the code chunk to see who had the better guess.

gs_mean <- 0.46
gs_min <- 0.87
gs_max <- 0.12

# <- these little symbols tell R not to read this as code and to exclude it. So to run the below, you will need to first remove the # and then add the additional code.

lg_mean <- mean(r_long)
lg_max <- max(r_long)
lg_min <- min(r_long)

df_meanlg <- gs_mean-lg_mean
df_maxlg <- gs_max-lg_max
df_minlg <- gs_min-lg_min

Exercise 8

A yet simpler way to obtain standard statistics on you data set is use the “summary()” function

Execute the code chunk bellow and discuss the results. Do you know what the “1st Qu.” and “3rd Qu.” are, and if you know it, are you surprised by the results?

summary(r_long)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.009875 0.307675 0.571792 0.525044 0.749587 0.972494

Exercise 9

In the final exercise today we will move a little beyond only using numbers to describing and understanding our data set (which as a reminder is a list of 100 random numbers between 0 and 1).

Instead we will try to make graphical visualizations of the data set. Such graphics are often called a plot in statistics lingo.

Plot the histogram of our data set “r_long” by running the code chunk bellow - you might have to press run twice!

hist(r_long)

We will discuss histograms a lot more in later lectures, but for now we will just use what you already know

What does the histogram show?
Run the code chunk below to make a even larger data set, r_longer, with 10000 random number

r_longer <- runif(n=10000,min=0,max=1)

summary(r_longer)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.245e-05 2.537e-01 5.038e-01 5.035e-01 7.575e-01 1.000e+00

hist(r_longer)

Make a histogram of r_longer and describe how it looks.

EKA: with that many unique numbers, most will look similar - many will begin with 0,1-9

Bonus!

What is the maximum number of unique numbers we can generate between 0 and 1?

EKA: Infinity

Archaeological Example

Sweet! Now lets move on to some archaeological examples.

First, let’s talk about R packages. We are going to use a number of packages, but for data we will rely on two packages. The first you encountered in class, which is yarrr. The second is archdata. First we need to install archdata. We did this with knitr in the first class. In the viewer pane in the lower right corner, select packages and install, then type archdata and click on it when it appears. Once installed, we will need to load it.

Loading the data

Once you have a package installed on your machine, all you need to do is tell R to load it. Below we will load the package we just installed.

library(archdata)

Next, lets load some measurements on Early and Late Bronze Age ceramic cups from Italy analysed by Lukesh and Howe (1978). The data are stored within the package archdata. Below, we use the data() function to load it into R. Note that most data are not stored within R and we will learn how to load your own data from spreadsheets later.

data("BACups")

What did the data load as? (look in your environment for the new object if you don’t know).
What are the variables within this data? Can you remember the two functions used in the lecture that.

#na___()

Use the ?help function to learn what the variable names represent.

?BACups

What do H and ND represent? EKA: Total height and Neck Diameter
What type of data do you think these are? EKA: Count? As there is true measurements

Summarizing the data

Next, let’s try and figure how to describe these data. We will focus on height for now.

Create a code chunk below and find out the mean, min, and max heights of BACups.
Next, let’s make a new code chunk and create a histrogram of the BACups heights.
Are you able to add a line indicating the mean? (See the lecture slide!)
Try renaming the main title, x axis label, and y label.

summary(BACups)

##        RD               ND               SD               H         
##  Min.   : 6.600   Min.   : 6.200   Min.   : 7.000   Min.   : 3.300  
##  1st Qu.: 9.725   1st Qu.: 8.975   1st Qu.: 9.875   1st Qu.: 4.950  
##  Median :12.050   Median :10.900   Median :12.000   Median : 6.300  
##  Mean   :14.020   Mean   :13.040   Mean   :14.063   Mean   : 6.987  
##  3rd Qu.:18.500   3rd Qu.:17.000   3rd Qu.:18.125   3rd Qu.: 8.850  
##  Max.   :29.500   Max.   :28.000   Max.   :28.000   Max.   :14.300  
##        NH                  Phase   
##  Min.   :1.400   Protoapennine:20  
##  1st Qu.:2.275   Subapennine  :40  
##  Median :3.000                     
##  Mean   :3.155                     
##  3rd Qu.:4.000                     
##  Max.   :5.300

hist(BACups$H,
    xlab="height (cm)",
    ylab="count",
    main="Cup height")
abline(v=mean(BACups$H))
abline(h=mean(BACups$H))

Try it again

Let’s try this again, but with a different variable of BACups. Choose your favorite!

What is the variables mean, min, and max (show in code chunk).
Can you make a histogram?

summary(BACups)

##        RD               ND               SD               H         
##  Min.   : 6.600   Min.   : 6.200   Min.   : 7.000   Min.   : 3.300  
##  1st Qu.: 9.725   1st Qu.: 8.975   1st Qu.: 9.875   1st Qu.: 4.950  
##  Median :12.050   Median :10.900   Median :12.000   Median : 6.300  
##  Mean   :14.020   Mean   :13.040   Mean   :14.063   Mean   : 6.987  
##  3rd Qu.:18.500   3rd Qu.:17.000   3rd Qu.:18.125   3rd Qu.: 8.850  
##  Max.   :29.500   Max.   :28.000   Max.   :28.000   Max.   :14.300  
##        NH                  Phase   
##  Min.   :1.400   Protoapennine:20  
##  1st Qu.:2.275   Subapennine  :40  
##  Median :3.000                     
##  Mean   :3.155                     
##  3rd Qu.:4.000                     
##  Max.   :5.300

hist(BACups$ND,
    xlab= "diameter (cm)",
    ylab= "count",
    main= "Neck diameter")
abline(v=mean(BACups$H))
abline(h=mean(BACups$H))

Finished!

Once you have completed the above, click the ‘knit’ button in the top of this window to compile it into a tidy html document. If you have any errors in your code, it will produce and error and show you where that error is.

Once the document is created, look through it and see if there are any issues. This HTML document is like a document in the form of a webpage. If you send the HTML document to someone, it will open in their web browser and show them the knitted document. This is another aspect of R and knitr that make the work widely distributable. Anyone with a web browser can access your work! No need to install Adobe or MS Word.

Lab Session 1 - Intro to R

Peter Yaworsky

2022-08-08

Why R Markdown?

Exercises

Exercise 0

Exercise 1

Exercise 2

Exercise 3

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Bonus!

Archaeological Example

Loading the data

Summarizing the data

Try it again

Finished!