CEC R tutorial

Published

October 21, 2023

About

This is a workshop developed for California Ecology and Conservation Fall 2023 written by An Bui. By the end of this workshop, students will understand:
1. how to use objects and functions to “tell R” to do things
2. how packages fit into R use
3. how to read in and visualize data
4. how to do basic statistical tests

1. Parts of code

Code lives in scripts. A script is a file where all of your code lives. You can think of your script as a place where you can save all the code you write.

Code shows up in the default RStudio color scheme in black, blue, and dark green. You can change the color scheme by going to Tools > Global Options > Appearance and choosing an “Editor Theme”. I’ll change the scheme to “Solarized Dark”, which is a little easier to see.

Comments show up in grey: these are useful for annotating your code. Annotations are super useful for keeping notes about what your code is doing. You can create a comment by putting a pound sign at the beginning of a line.

You can also orient yourself in your script using code sections. You can insert a section by going to Code > Insert Section from the toolbar. There is a table of contents you can use to navigate in the right hand side of the script pane, or in the bottom left.

Objects

Objects are saved in the upper right corner of the RStudio window. Objects can take on any values: you assign an object a value using the assignment operator: a left-facing arrow <-

In this line of code, we’ll
1. create a new object called “yellow”
2. use the assignment operator to assign the object a value (5)

You can run code by putting your cursor on the line of code you want to run and hitting command + return (Mac) or control + enter (Windows).

yellow <- 5
yellow
[1] 5

If you were using RStudio, you would now see this in the upper right corner.

Try creating an object called “blue” and assign it the value 10.

blue <- 10
blue
[1] 10

Try creating an object called “greeting” and assign it the value “Hello!”

greeting <- "Hello!"
greeting
[1] "Hello!"

When creating the objects yellow and blue, you assigned them numerical values. “Hello!” is not a number - it is a character.

Functions

Functions are the backbone of R usage. They allow you to do things: take the mean of a bunch of numbers, store your data, create figures, and much more.

Functions always take the form of an object with a set of parentheses at the end, and almost always are named after what they do. For example, the function to take the mean of something is mean().

Within the parentheses, functions take “arguments”, which tell the function what to work on. You can think of using an argument as telling R: use this function on this thing.

If you ever forget what arguments a function takes, you can look at the function’s help page, which you can access by typing ?function name in the console. Try looking at the help page for mean() by running ?mean.

?mean

Sometimes, you might have a list of 10 numbers, for example:

big_list <- c(4, 5, 1, 2, 3, 5, 6, 7, 8, 9)

You can put an object in a function as an argument. Try taking the mean of big_list.

mean(big_list)
[1] 5

But let’s say you want to save the mean value that you just calculated so that you don’t have to do it over and over again. Try saving mean(big_list) as a object called mean_calculation.

mean_calculation <- mean(big_list)
mean_calculation
[1] 5

2. Packages

Packages are collections of functions. You can see what packages you have by looking at the “Packages” tab in the lower right corner of the RStudio window. Some packages are pre-installed, but you can also install other packages as you wish.

Lots of people write packages to share code with others - if you ever want to do something in R, chances are that someone has written a package for it.

You can install a package using the function install.packages() in the Console. We’ll try installing a package called tidyverse, which is a bunch of packages in one.

Even though you’ve installed a package, you still need to “load” it. You can do this using the library() function.

library(tidyverse)

3. Reading in data

You can work with data in R by “reading it in”. You can create an object to “save” the data in R, then do whatever you want to it without ever changing the original file, like what we did with mean_calculation.

We’ll use the unpaired anemone data from earlier this week, and read it in using the function read_csv() from the tidyverse.

anemone_data <- read_csv("anemone_unpaired.csv")

You can look at the data using the function View() in the console. The code to do that is:

View(anemone_data)

Without the View() function, you can still see the data by typing in the object name and running the code:

anemone_data
# A tibble: 40 × 3
   replicate_pool treatment column_temperature
   <chr>          <chr>                  <dbl>
 1 1r             Removal                20.8 
 2 2r             Removal                15.2 
 3 3r             Removal                19.5 
 4 4r             Removal                13.2 
 5 5r             Removal                 7.5 
 6 6r             Removal                 7.75
 7 7r             Removal                16   
 8 8r             Removal                16   
 9 9r             Removal                12.8 
10 10r            Removal                15.5 
# ℹ 30 more rows

4. Wrangling your data

Data is almost always messy - there might be typos, spaces where there shouldn’t be, wrong numbers, etc. There are lots of functions that allow you to “wrangle” your data: any form of cleaning or manipulation of data to eventually be able to plot and analyze it.

The anemone dataset is not especially messy, but there are a few ways we can wrangle it to make it easier to work with. I’ll refer to the anemone_data object as a “dataframe”, which is the standard term for any object with rows and columns in R.

Selecting columns

Let’s say we only want to work with the replicate pool and treatment columns in the data frame.

The function for selecting columns is from the tidyverse package, and is called select(). Before using it, look at the help page for select() using ?select in the console.

Try selecting the columns replicate_pool and treatment using the select() function.

# create an object called "selected"
# take the anemone_data object AND THEN
selected <- anemone_data |>
  # select the columns replicate_pool and treatment
  select(replicate_pool, treatment)

You can then view the new object to verify that it actually selected the columns you wanted.

5. Creating figures

After looking at the data, it’s often useful to visualize it in a graph to get a sense of what the patterns might be before you start trying any statistical tests.

There are a few ways to plot things in R, but for this workshop we’ll use ggplot(), which is a function within the tidyverse.

ggplot works in the following way:
1. tell R that you want to use the ggplot function and include the data frame name as the data argument
2. specify the aesthetics using the aes() function: what are your x and y axes?
3. specify a geometry using a geom_() function: what kind of plot do you want to make?

Let’s try making a boxplot with the anemone data.

anemone_boxplot <- ggplot(data = anemone_data,
       aes(x = treatment, y = column_temperature, fill = treatment)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Control" = "orange", "Removal" = "salmon")) +
  labs(x = "Treatment", y = "Column Temperature (C)",
       title = "Median removal temperature is slightly higher than control") +
  theme_bw() +
  theme(legend.position = "none")

anemone_boxplot

Sometimes, a boxplot can be kind of confusing to look at if you want to compare means (which is what you’re doing with a t-test). We can try a different kind of plot in ggplot to do that.

anemone_jitter <- ggplot(data = anemone_data,
       aes(x = treatment, y = column_temperature)) +
  geom_jitter(aes(color = treatment),
              alpha = 0.4) +
  stat_summary(geom = "pointrange", fun.data = mean_se, aes(color = treatment)) +
  scale_color_manual(values = c("Control" = "purple", "Removal" = "goldenrod")) +
  labs(x = "Treatment", y = "Column Temperature (C)",
       title = "Mean removal temperature is slightly higher than control") +
  theme_bw() +
  theme(legend.position = "none")

anemone_jitter

6. statistical tests

Most statistical tests in R take the same form: test(response ~ predictor).

So for example, let’s say we want to do a t-test on the anemone data, with column temperature as the response and treatment as the predictor.

t.test(column_temperature ~ treatment,
       data = anemone_data)

    Welch Two Sample t-test

data:  column_temperature by treatment
t = -1.0203, df = 37.973, p-value = 0.314
alternative hypothesis: true difference in means between group Control and group Removal is not equal to 0
95 percent confidence interval:
 -4.699915  1.549915
sample estimates:
mean in group Control mean in group Removal 
              13.2375               14.8125 

Or a paired t-test:

t.test(column_temperature ~ treatment,
       data = anemone_data,
       paired = TRUE)

    Paired t-test

data:  column_temperature by treatment
t = -2.2963, df = 19, p-value = 0.0332
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -3.0105576 -0.1394424
sample estimates:
mean difference 
         -1.575 

Same deal with a Wilcox test (the non-parametric equivalent of a t-test):

wilcox.test(column_temperature ~ treatment,
            data = anemone_data)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties

    Wilcoxon rank sum test with continuity correction

data:  column_temperature by treatment
W = 166, p-value = 0.3645
alternative hypothesis: true location shift is not equal to 0

Or a Mann-Whitney U (the non-parametric equivalent of a paired t-test):

wilcox.test(column_temperature ~ treatment,
            data = anemone_data,
            paired = TRUE)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties

    Wilcoxon signed rank test with continuity correction

data:  column_temperature by treatment
V = 39, p-value = 0.01436
alternative hypothesis: true location shift is not equal to 0

7. Working with pond data

a. load in the data

pond_data <- read_csv("pond_data_simplified.csv")

b. make a figure

ggplot(data = pond_data,
       aes(x = surface_temp_c, y = temp_at_05_m_c)) +
  geom_point()

c. linear regression

pond_model <- lm(temp_at_05_m_c ~ surface_temp_c,
                 data = pond_data)
pond_model

Call:
lm(formula = temp_at_05_m_c ~ surface_temp_c, data = pond_data)

Coefficients:
   (Intercept)  surface_temp_c  
        7.2727          0.6431  

Look at the summary using summary():

summary(pond_model)

Call:
lm(formula = temp_at_05_m_c ~ surface_temp_c, data = pond_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6140 -1.1219 -0.1559  0.5569  5.1941 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.2727     2.7788   2.617   0.0123 *  
surface_temp_c   0.6431     0.1466   4.386 7.59e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.671 on 42 degrees of freedom
Multiple R-squared:  0.3142,    Adjusted R-squared:  0.2979 
F-statistic: 19.24 on 1 and 42 DF,  p-value: 7.588e-05

You can also look at the confidence interval around the estimate for the slope of the line:

confint(pond_model)
                  2.5 %     97.5 %
(Intercept)    1.664957 12.8805161
surface_temp_c 0.347228  0.9389841

Interpretation: as surface temperature increases, temperature at 0.5 m also increases. For every 1 degree of increased surface temperature, we expect an increase of 0.6 [95% confidence interval: 0.35, 0.94] degrees in temperature at 0.5 m.

d. visualizing model predictions with raw data

After creating a linear model, you might want to put the model line on top of the raw data. You can do this using the packages ggeffects and the ggplot function together. In general, the steps are:
1. generate model predictions (the x- and y- values of the linear regression line) using the ggpredict function from the ggeffects package
2. create a plot using two data frames: pond_data, which has your raw data, and model_predictions, which is the data frame of model predictions

Try generating the model predictions below.

model_predictions <- ggpredict(pond_model, terms = "surface_temp_c")
model_predictions
# Predicted values of temp_at_05_m_c

surface_temp_c | Predicted |         95% CI
-------------------------------------------
            15 |     16.92 | [15.70, 18.14]
            16 |     17.56 | [16.60, 18.52]
            17 |     18.21 | [17.47, 18.94]
            18 |     18.85 | [18.29, 19.40]
            20 |     20.13 | [19.54, 20.72]
            21 |     20.78 | [19.99, 21.56]
            22 |     21.42 | [20.40, 22.45]
            24 |     22.71 | [21.15, 24.26]

And then create a plot:

# first, use the pond_data data frame to plot surface_temp_c on the x axis and temp_at_05_m_c on the y axis
ggplot(data = pond_data,
       aes(x = surface_temp_c, y = temp_at_05_m_c)) +
  # plot the raw data
  geom_point(shape = 21) +
  
  # then, use the model_predictions data frame to plot the x column on the x axis, and the predicted column on the y axis
  geom_line(data = model_predictions,
            aes(x = x, y = predicted),
            color = "blue",
            linewidth = 1) +
  # add a ribbon around the confidence interval
  geom_ribbon(data = model_predictions,
              aes(x = x, y = predicted,
                  ymax = conf.high, ymin = conf.low),
              # change the transparency of the ribbon
              alpha = 0.2,
              fill = "blue") +
  
  # label the axes and title
  labs(x = "Surface temperature (degrees Celsius)",
       y = "Temperature at 0.5 m (degrees Celsius)",
       title = "Surface temperature significantly predicts temperature at 0.5 m") +
  
  # theme options
  theme_bw()