1 Introduction

1.1 What are R and RStudio Cloud?

R is a language for statistical computing and graphics. R is available as free and open software runs on a wide variety of platforms. It has a large online community and many packages that extend its functionality.

RStudio is a free front end that layers on top of R and makes it significantly more user friendly. In this course we will use R through the vehicle of RStudio Cloud. Our class can be accessed through Canvas.

1.2 Orientation to RStudio Cloud

First, click File> new File > R Script.

You will now see that RStudio Cloud has four windows:

Script Editor: upper-left. For composing R scripts (code you write) and viewing objects like dataframes. Save your script often!

Console: bottom-left. Shows you which command you have submitted and their output.

Environment and History: top-right. The Environment tab shows objects like dataframes, variables and vectors from the current session. You can view them by clicking on them. When you exit RStudio, it will ask you if you want to save our environment. You should say no to start from a clear slate next time. The History tab shows the code you’ve submitted.

Files, Plots, Packages, Help: bottom-right. In the Files tab, you can browse and open your data and code files. The code from this tutorial will be found in the file “R Cookbook code.R”. Clicking that file will open the script and you can copy and paste helpful code into your own script. The Plots tab will show you any plots you have made, and you can use the arrows to access previous plots. The Export button allows you to save out plots, but it is more reproducible to use ggsave, see Exporting plots. Packages allows you to load and unload packages by checking them. In the Help tab, you can search for help in any loaded package, and then search within the page using Find in Topic.

2 Working with code in R

An R script (aka code in R) is the way you make R work for you. In this section you will learn how to run, write, save and open an R script.

In this R Cookbook, R code (that you can copy and use) will look like this:

#R code will be shown in a box like this. 

R output (what appears in the console window) will then be shown like this:

[1] "R outputs will be shown in a box like this"

2.1 Submitting your first code to R

  1. Copy the code below and paste it into the Script Editor (upper-left window).

  2. You will then need to submit it to R. Highlight the code you want to run (this can be multiple lines) or place your cursor anywhere in a single line. Then either press the Run button at the top of the Script Editor, or press Ctrl+Enter (Cmd+Enter on Mac). R then responds in the console window.

7*9/(3+4)^2-8
[1] -6.714286
  1. Copy the same code and directly into the console (lower-left window) and hit Enter.

See how this does the same thing? But the R script is better because it gives you a reproducible record of what you’ve done that you can save, share, re-open, and re-run.

2.2 Writing your first R script

As you write your own R scripts, annotate your code a lot! This is your way to mark off sections, take notes, and to state clearly exactly what your code does. This is as useful to your future self as it is to others. An comment looks like this: # Comment a lot! where one or more # indicate to R to ignore the text until the next hard return. Here’s an example:

## R script by Joe Schmoe September 1, 2025

#This will calculate the answer to 7*9/(3+4)^2-8
7*9/(3+4)^2-8
[1] -6.714286

Besides adding explanatory comments and headers, there are other rules for good (coding etiquette)[https://ourcodingclub.github.io/tutorials/etiquette/index.html] you should follow to make using R more enjoyable.

Now to write your own R script in the Script Editor window:

  1. Open a new, empty R script (File > New File > R Script). This will appear in your Script Editor window. Alternatively, you can select the new script icon (the green plus) from the top of the Script Editor window.

NOTE: for many of our projects you will be given an R script to work from and will not have to create your own. It will be located in the Files tab on the bottom right and you click it to open and start coding.

  1. Start your R script with an annotation, stating your name and the date and whatever else you might want to help you remember what this R script is about.

  2. Type your own math problem to solve.

  3. Type another annotation.

  4. Type another math problem.

  5. Submit the first math problem.

  6. Submit the second math problem.

  7. Annotate your code, stating the answers to your math problems right after the code for each equation.

2.3 Saving your first R script

Save your R script either using Ctrl+S (Cmd+S on Mac) or by clicking on the save icon at the top of the Script Editor window.

2.4 Opening a saved R script

Open your new R script by clicking the name you gave it in the Files tab on the bottom right, or by finding it using (File > Recent Files or File > Open File).

Now you see how you can create an R script, save it and reopen it to use it again or continue working on it. This is the magic of code-based statistical analysis. Whereas point and click programs require you to execute the same tasks every time you perform or repeat an analysis, with code-based analysis like R you can re-run all your analyses in just seconds once the R script is written.

2.5 R Cookbook code conventions

Below we will provide scripts giving you the code you need for analyses. These R scripts will refer to object names that you can or need to edit in CAPS. These are places we expect you’ll be making edits. We will use the following conventions:

  • YOUR.IV indicates the independent variable. If there is more than one, then we will use YOUR.IV1, YOUR.IV2, etc.

  • YOUR.DV indicates the dependent variable. If there is more than one, then we will use YOUR.DV1, YOUR.DV2, etc.

  • YOUR.DATA indicates where you indicate the name of the dataset you are working with.

  • There will be other CAPS as well, so watch out for them in the offered code.

  • Two-parted names including a . (e.g. my.name) need not be two-parted and you can edit them as desired (e.g. myname).

2.6 Troubleshooting

The R language (and programming languages in general) is often unforgiving in syntax and cryptic in the errors and warnings. But, many of the problems you encounter are common and can be solved with attention to detail, testing a simpler example that should work, and some intuition about how R is interpreting your code.

Here are tips for solving some of the most common problems:

  • RStudio will give you hints about the most common syntax errors by flagging the line (and following lines) with a red X. hover over the X to see what is wrong.

  • Parentheses () need to match up. If you put you mouse next to a parenthesis, RStudio will outline what it thinks the matching parenthesis. If you execute a statement with a missing closing parenthesis, RStudio will wait for more input in the console with a + character where the > character usually is at the beginning of the line. To get out of this state, press Esc twice.

  • Quotes "" need to match up. If you see RStudio has highlighted all of your code after a certain point, look for a unclosed string. Syntax errors like this usually generate an error like unexpected symbol in __.

  • Object and function names are case sensitive. You will often get the error object __ not found if you misspell something, or have not yet run code that creates the object. Use your Environment window to check on the names of objects you have created.

  • Missing functions. If R says it could not find function __, make sure you have loaded all the necessary packages with library. For this cookbook you need ggplot2, dplyr, and car. Make sure they are loaded into your session by using the code library(ggplot2); library(dplyr); library(car).

  • The type of each vector matters. Check your dataframe with str and make the necessary conversions with as.factor, as.integer, as.numeric, as.character.

  • Missing values (NA) can throw off some analyses. Make sure you know why they are missing and then use na.omit(YOUR.DATA) to delete those rows.

  • Getting help. If you need to customize something, or want to know what arguments a function is expecting, look up how the functions work by typing ? in front of their name, like ?lm. Or use the Help tab on the bottom-right pane.

3 Loading packages

R provides many built-in functions that allow you to explore and plot your data. The majority of what is covered in the R Cookbook utilizes these base R tools, but there are also packages that can be downloaded and implemented in R to increase it’s power and functionality. We have already downloaded the three packages we will use in this course.

Once a package is installed, you still need to load it every time you launch R. To load a package, use code like this:

library("PACKAGE.NAME")

Let’s load three very useful packages for data manipulation, plotting, and ANOVA. You will need these packages to run the remainder of the cookbook.

library(car)
library(dplyr)
library(ggplot2)

With this code these packages are loaded into your current R session.

4 Reading in data

In this section, you’ll learn how to read in an external data file for analysis. This original data file will often be created in a spreadsheet.

4.1 Organizing your data

In the next section you’ll read about how to format your data in a spreadsheet, i.e. the right file types, acceptable characters, etc. But before you start that process, you need to understand how to organize your data in a tidy way. This is absolutely critical to being able to working with your data in R.

The simplest explanation of tidy organization for your data follows two commandments:

  1. every column is a variable.

  2. every row is an observation.

The simplest test for whether you have followed the two commandments is this: Do you have a single value for your dependent variable on each row? If so, you’re probably doing this correctly.

As an example, let’s say that you are studying the effects of herbivores on plants. Let’s say you excluded herbivores and compared plant size with and without herbivores at three time points. Plant size is your dependent variable and herbivory (yes, no) is your independent variable.

There are two ways you might organize your data, a right way and a wrong way.

The right way is to also have time (time=1, time=2, time=3) as an independent variable. So your columns (variables) would be herbivory (yes, no), time (1, 2 or 3) and size (one observation of plant size). See how in this format every row contains a single value for your dependent variable (plant size)?

One of several wrong ways is to have 3 dependent variables, size at time=1, size at time=2 and size at time=3. This would give you a single row for each experimental subject (the individual plant), but multiple values for dependent variable (plant size) on the same row.

Now that you know how your data need to be organized, let’s dig into the mechanics of doing this in a spreadsheet.

4.2 Preparing your data

Always the follow these rules / procedures to create data files that you can import into R.

4.2.1 Formatting

  1. Open a new Excel or Google Sheets workbook and select one worksheet (tab at bottom) to be your data set.

  2. Row 1 should contain variable names. Variable names should not have special characters (only letters and numbers, no spaces). They can be two parted (like wet.weight or plant.height). They are case sensitive.

  3. Individual observations should begin on Row 2 and continue consecutively (no blank rows).

  4. Make sure values for variables entered within each cell are formatted consistently. Leave cells with missing data blank. If they are values for a quantitative variable then obviously they are numbers. If they are values for a categorical variable then they should not have special characters (only letters and numbers, no spaces). Words are case-sensitive.

  5. Make sure you do not have any other text anywhere else in the spreadsheet besides your dataset.

4.2.2 Saving the file

  1. Save the entire workbook as an Excel document or let Google Sheets auto save. It is fine to have multiple worksheets in this workbook.

  2. View the worksheet with your data that you want to save (i.e. not some other worksheet in the same workbook)

  3. Select File > Save As (Excel) or File > Download (Google Sheets) and then choose comma separated values or CSV as the file format and save it to your desktop. Ignore any warnings about certain file features removed / not saved.

Now you have your data in a workbook, and you’ve created a second version with the .csv extension that R can read. If you need to edit your data, do so in the workbook (so this remains your master version) and repeat the steps above.

If you’re curious, open this file in a text editor and you’ll see that each row is a line, but values are separated by commas (as the name implies).

4.3 Importing data for this tutorial

You’ll now learn how to read in a dataset.

  1. If you do not have data to work with, you can create an example data file. Read in the iris dataset:
data(iris)

The iris dataset is a classic dataset often used in statistical instruction and is held in the base R package.

For teaching purposes we’re going to add a variable for flower color called Flower.Color with 2 levels (either red or yellow), and save this dataset to your working directory:

#make up a categorical variable Flower.Color and select the columns we want
iris <- iris %>% 
  mutate(Flower.Color = factor(rep(c("red","yellow"), length.out = nrow(iris)))) %>%
  select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Species", "Flower.Color"))

#Save this dataset to your Files as a comma-delimited file called "YOUR.DATA.csv"
write.csv(iris, "YOUR.DATA.csv")

Don’t worry about the syntax here. Look at your working directory. You should now see the file YOUR.DATA.csv.

4.4 Importing your own data

To upload a csv file to R Studio Cloud from your computer, you need to upload it to the Files tab on the bottom right. To do so, you go to Files click upload and navigate to the csv file. Now, you will see the csv in the Files tab of your R Studio Cloud.

Now, you need to load your data into R from your Files. To do so, use this code:

YOUR.DATA <- read.csv("YOUR.DATA.csv", stringsAsFactors=TRUE)

Here’s what this syntax tells R to do: Read in a .csv from Files tab, look for the column names in the first row, the file will have comma separated values (i.e. it will be a .csv format), and name this object (which in this case is a dataframe) YOUR.DATA (or whatever name you give it if you change the name).

In this code, and in the following examples, we will call the dataset YOUR.DATA, but in the future you should give your dataset a more meaningful name. This is especially important if you are working with multiple datasets simultaneously.

Sometimes we will provide the data file in the R Studio Cloud project. If we tell you the data is already in the project, you will see the data file under the Files tab on the bottom right. You can load the data file into your session of R with this code:

YOUR.DATA <- read.csv("YOUR.DATA.csv", stringsAsFactors=TRUE)

4.5 View your data

You can print the first few rows of your data to the console with this code:

head(YOUR.DATA)
  Sepal.Length Sepal.Width Petal.Length Species Flower.Color
1          5.1         3.5          1.4  setosa          red
2          4.9         3.0          1.4  setosa       yellow
3          4.7         3.2          1.3  setosa          red
4          4.6         3.1          1.5  setosa       yellow
5          5.0         3.6          1.4  setosa          red
6          5.4         3.9          1.7  setosa       yellow

Here you can see the variable names and the first 6 observations.

Alternatively, you can view it nicely in RStudio Cloud with this code:

View(YOUR.DATA)

You can also look at the structure of your data with this code:

str(YOUR.DATA)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Flower.Color: Factor w/ 2 levels "red","yellow": 1 2 1 2 1 2 1 2 1 2 ...

This tells you important characteristics of your variables that will matter when you do analyses. In most cases, there are two important distinctions to make between types of variables.

Categorical variables have a discrete range of values. With the str command R indicates these as factor variables. For example, in the iris dataset the variable species is categorical, and has three levels (possible values), which are the three species.

Continuous variables are variables where the values are numbers. With the str command R will distinguish between integers, indicated as int and non-integer numbers, indicated as num.

Note that there are other types of variables (character, dates) that we will not touch on here.

4.6 QAQC your data

In some (most!) cases, R may not classify variables correctly, and this is important to fix. For example, if you were analyzing the test scores of three classes (independent variable YOUR.IV) you coded as 1, 2 and 3 R would treat YOUR.IV as an integer. You would need to tell R that YOUR.IV is a factor variable, using code like this:

YOUR.DATA$YOUR.IV <- as.factor(YOUR.DATA$YOUR.IV)
str(YOUR.DATA)

Similarly, use as.numeric or as.integer to convert a character to a number or integer.

You have now learned to read your data in, check the structure of the dataset, and change the classification of your variables as needed.

5 Summarizing data

All of what you’ve done so far is learning your way around R and getting your data ready to use. Here is where R starts to work for you.

One important task in all data analysis is to calculate summary statistics.

Most studies include at least one categorical independent variable. We frequently want to obtain descriptive statistics for one or more dependent variables for each level of these categorical independent variables. For example, with the iris dataset we might want the mean and standard error for sepal length for each species.

This code uses dplyr’s group_by and summarize to calculate descriptive statistics for each level (value) of a categorical (factor) independent variable:

YOUR.DATA.SUMMARY <- YOUR.DATA %>% 
  group_by(YOUR.IV) %>% 
  summarize(n= n(), 
            mean = mean(YOUR.DV), 
            sd = sd(YOUR.DV),
            se = sd(YOUR.DV)/sqrt(n()))%>% 
  print()

5.1 Summarizing by multiple categorical values

To summarize by more than one categorical variable, just include them in the group_by command. For example, with the iris dataset we might want to get summary statistics for Sepal.Width for each Species but separately for red and yellow Flower.Color. We would use group_by(Species, Flower.Color).

YOUR.DATA.SUMMARY <- YOUR.DATA %>% 
  group_by(YOUR.IV1, YOUR.IV2) %>% 
  summarize(n= n(), 
            mean = mean(YOUR.DV), 
            sd = sd(YOUR.DV),
            se = sd(YOUR.DV)/sqrt(n()))%>% 
  print()

Here is the summary table for the iris dataset, providing summary statistics for sepal width for each species and flower color:

# A tibble: 6 × 6
# Groups:   Species [3]
  Species    Flower.Color     n  mean    sd     se
  <fct>      <fct>        <int> <dbl> <dbl>  <dbl>
1 setosa     red             25  3.48 0.325 0.0651
2 setosa     yellow          25  3.38 0.426 0.0853
3 versicolor red             25  2.78 0.336 0.0672
4 versicolor yellow          25  2.76 0.297 0.0594
5 virginica  red             25  2.94 0.287 0.0574
6 virginica  yellow          25  3.01 0.356 0.0713

The code creates a dataframe called YOUR.DATA.SUMMARY. The first two columns show the levels for the categorical (factor) independent variables. Then the columns n,mean, sd and se give the descriptive statistics (sample size, mean/average, standard deviation, standard error of the mean) for each combination of those independent variables. The code then prints YOUR.DATA.SUMMARY to the console for you to view.

5.2 Subsetting your data

There are times you may want to analyze just a subset of your data. Use the dplyr filter command:

YOUR.DATA.SUBSET <- filter(YOUR.DATA, YOUR.IV == "VALUE")

Here a new dataset called YOUR.DATA.SUBSET is created that is a subset of your full dataset, based on specific values for your independent variable YOUR.IV. You can combine filters with & (logical AND) and | (logical OR).

6 Exporting data

You might want to export dataframes (like your summary data table) to refer to later.

Here is code you would use for saving your summary table as a .csv file to your Files:

write.csv(YOUR.DATA.SUMMARY, "YOUR.DATA.SUMMARY.csv")

7 Making plots

First, load the ggplot2 library and set the theme to black and white:

library(ggplot2) 
theme_set(theme_bw())

In the ggplot “grammar of graphics”, you specify a dataframe and then map each variable to “aesthetics” of the plot such as X and Y coordinates, color, fill, shape, etc. Next you add layers of “geoms” that specify points, lines, boxplots, bars, error bars, etc. To tweak the look of the rest of the plot, you can add labels, custom scales, and themes. To combine all of these elements, just add them together with + and optionally a newline to keep things neat.

7.1 Boxplot with quartiles

To visualize your raw data, you can use a boxplot that shows the median value (center line), the 25th and 75th quartiles (outside of box), the extreme values (the ends of the whiskers) and outliers (individual points).

ggplot(YOUR.DATA, aes(x = YOUR.IV, y = YOUR.DV)) + 
  labs(title = "TITLE", x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  geom_boxplot()

You can also display boxplots for two independent variables in combination by setting the fill aesthetic separate from x. For instance, here is the boxplot for the iris dataset, providing quartiles for sepal width for each species (mapped to x) and color (mapped to fill). We overrode the default color scale to label the legend and give it colors cooresponding to the flowers.

ggplot(YOUR.DATA, aes(x = Species, fill=Flower.Color, y = Sepal.Width)) + 
  labs(title = "Sepal Widths for Two Colors of Three Species", x = "Species", y = "Sepal Width") +
  geom_boxplot() + 
  scale_fill_manual("Flower Color", values=c("red","yellow")) +
  scale_y_continuous(limits=c(0,NA))

7.2 Barplot with means with error bars

We can use the summary statistics table we made above to plot the means with error bars indicating the standard error of the mean. The barplot is common in the literature but unfortunately hides important features of your data, such as variance, shape, outliers, and sample size. In fact, barplots can be positively misleading: unlike the median, the mean is thrown off disproportionately by outliers, and the standard error of the mean keeps decreasing the more samples you include.

You can use geom_col and geom_errorbar to make a barplot. The adjustment to the y scale keeps the bars from floating off the x axis.

ggplot(YOUR.DATA.SUMMARY, aes(x = YOUR.IV1,y = mean)) +
  labs(title="TITLE", x = "X AXIS", y = "Y AXIS")+ 
  geom_col(fill = "steelblue") +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.3) +
  scale_y_continuous(expand=expansion(c(0,0.05)))

You can also display bar plots for two independent variables in combination by setting the fill aesthetic separate from x.

ggplot(YOUR.DATA.SUMMARY, aes(x = YOUR.IV1, fill = YOUR.IV2, y = mean)) +
  labs(title="TITLE", x = "X AXIS", y = "Y AXIS")+
  geom_col(position = position_dodge()) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.3, position = position_dodge(width=0.9)) +
  scale_y_continuous(expand=expansion(c(0,0.05))) +
  scale_fill_manual("YOUR.IV2.LABEL", values=c("COLOR1","COLOR2"))

For instance, here is the figure for the iris dataset, providing the mean and standard error for sepal width for each species and flower color.

For instance, here is the boxplot for the iris dataset, providing quartiles for sepal width for each species (mapped to x) and color (mapped to fill). We overrode the default color scale to label the legend and give it colors corresponding to the flowers. If your fill group had more than two categories, you would need to specify more than two colors.

7.3 Scatterplot of two quantitative variables

If you have two quantitative variables you can make a scatterplot with a best-fit line (using a linear model, or lm to fit the line).

ggplot(YOUR.DATA, aes(x = YOUR.IV, y = YOUR.DV)) + 
  labs(title = "TITLE", x = "X AXIS LABEL", y = "Y AXIS LABEL") + 
  geom_point() + 
  geom_smooth(method="lm", se = F)

For instance, here is the figure for the iris dataset, plotting sepal length against sepal width.

If you have two quantitative variables and a categorical independent variable, you can make a scatterplot with best-fit lines (using a linear model, or lm to fit the line) for each level of the categorical independent variable.

ggplot(YOUR.DATA, aes(x = YOUR.IV, color = YOUR.CAT.IV, y = YOUR.DV)) + 
  labs(title = "TITLE", x = "X AXIS LABEL", y = "Y AXIS LABEL") + 
  geom_point() + 
  geom_smooth(method="lm", se = F) + 
  scale_color_manual("YOUR.CAT.IV.LABEL", values=c("COLOR1","COLOR2"))

For instance, here is the figure for the iris dataset, plotting sepal length against sepal width for each species.

You now know how to make several types of figures, which are extremely useful for understanding your data.

7.4 Exporting plots

To export your plot to a PNG image file in Files, use ggsave after plotting with ggplot. Change the width and height (measured in inches) until the text looks good when placed in your document or slides. If you need a higher resolution for viewing up close, increase the dpi (dots per inch).

ggsave("MY.PLOT.png", width = 5, height = 5, dpi = 300)

After running this code you will find “MY.PLOT.png” in the Files tab. You can click to open it in a new window and download it to your computer.

8 Models

Models are a way to quantify and test patterns in your data based on your experimental design. You should choose a family of models appropriate to the kinds of data you have and the questions you are asking. When you fit a model to your data, you are estimating coefficients of your independent variables (also known as predictors) that best predict your dependent variable. The remaining differences between the model’s predictions and the data are called residuals. The goal is capture the important patterns in your coefficients so that the residuals just represent noise.

Here we will cover (general) linear models. A familiar linear model is a linear regression in the form y = a x + b. Here y is your quantitative dependent variable, x is your quantitative independent variable, and a and b (the slope and intercept) are the model coefficients. You can imagine more complicated models with more terms.

The tests of linear models are given many names (t-tests, one- and two-way ANOVA, linear regression ANCOVA) depending on how many of which kind of independent variables you have. The tests allow you to infer something about your data as it relates to a hypothesis. The classic output is a p-value, which is the probability that you would get data at least as extreme as you did given that a null hypothesis of no effect is true.

The effect size (the magnitude and direction of the phenomenon) is just as important to drawing scientific conclusions as p-values. You can express the effect size as an absolute or percentage difference of means (for categorical IVs) or the slope of a line (for quantitative IVs).

8.1 Choosing an analysis

Use the following table to decide which analysis is appropriate given you dependent (DV) and independent (IV) variables. These are all kinds of (generalized) linear models that you can specify with lm or glm in R.

Analysis Dependent variable Independent variable(s)
One-way ANOVA 1 continuous 1 categorical
Two-way ANOVA 1 continuous 2+ categorical
Simple linear regression 1 continuous 1 continuous
Multiple regression 1 continuous 2+ continuous
ANCOVA 1 continuous 1+ continuous, 1+ categorical
Logistic regression 1 categorical 0+ continuous, 0+ categorical

8.2 One-way ANOVA

A one-way ANOVA is one of our most basic statistical tests. We use this when we have a single categorical (factor) independent variable and a single quantitative dependent variable. For example, in the iris dataset testing for an effect of species (with 3 levels, “setosa”, “veriscolor”, and “virginica”) on sepal width would be a one-way ANOVA. A special case of a one-way ANOVA with only two levels of the predictor (say, if you were comparing two species) is the t-test.

You first need to make sure that R is treating your independent variable is a categorical (factor) variable and the dependent variable as a quantitative variable (numeric or integer) using str(YOUR.DATA). If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our ANOVA using this code:

YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV, data = YOUR.DATA)
anova(YOUR.MODEL)

This fits a linear model (lm) to YOUR.DATA with the defined dependent and independent variables and saves the output into the object YOUR.MODEL. Then it runs an ANOVA test of the effects in that model.

For instance, here is the ANOVA summary table figure for the iris dataset, providing the results for the effect of species on sepal width.

Analysis of Variance Table

Response: Sepal.Width
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals 147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Pr(>F) value listed in the row for your independent variable is the P value for the test of the effect of your independent variable on your dependent variable.

Don’t forget to calculate summary statistics and make a figure so you can understand the magnitude and direction of any effects.

8.3 Two-way ANOVA

A two-way ANOVA is like a one-way ANOVA except that there are two categorical independent variables and a single quantitative dependent variable. You may be interested in the interaction of the two independent variables (whether the effect of one independent variable depends on the other).

You first need to make sure that R is treating your independent variables as categorical (factor) variables and the dependent variable as a quantitative variable (numeric or integer) using str(YOUR.DATA). If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our two-way ANOVA using this code. The * between the two IVs indicates that both the main effects and their interaction should be in the model. For a two-way ANOVA with unbalanced data (sample sizes are not equal among treatments), we have to set orthogonal contrasts before setting up the model, and then use the Anova (notice the capital A) function from the car package to get type III sums of squares. Don’t worry about that terminology, but make sure to run the options line once per session before you construct your model with lm.

library(car)
options(contrasts = c("contr.sum", "contr.poly"))
YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV1 + YOUR.IV2 + YOUR.IV1 * YOUR.IV2, data = YOUR.DATA)
Anova(YOUR.MODEL, type = 3)

For instance, here is the ANOVA summary table figure for the iris dataset, providing the results for the effect of species and flower color on sepal width.

Anova Table (Type III tests)

Response: Sepal.Width
                      Sum Sq  Df    F value Pr(>F)    
(Intercept)          1402.09   1 12051.8004 <2e-16 ***
Species                11.34   2    48.7581 <2e-16 ***
Flower.Color            0.01   1     0.0573 0.8111    
Species:Flower.Color    0.20   2     0.8704 0.4210    
Residuals              16.75 144                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that these results include test statistics for the effects of each independent variable, as well as their interaction, which tests whether the effect of one independent variable depends on the other.

Don’t forget to calculate summary statistics and make a figure so you can understand the magnitude and direction of any effects. In the case of a 2-way ANOVA, you will want descriptive statistics and figures calculated for the combination of the cross between two independent variables (i.e. in the iris dataset you would want it for each species * each color for a total of 6 means and SEs).

8.4 Simple linear regression

A simple regression is when there is one quantitative independent variable and one quantitative dependent variable.

You first need to make sure that R is treating your independent variable and dependent variables as quantitative variable (numeric or integer) variables using str(YOUR.DATA). If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our regression using this code:

YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV, data = YOUR.DATA)
anova(YOUR.MODEL)
paste("R2 = ", round(summary(YOUR.MODEL)$r.squared, 3))

This performs an regression on the dataframe called YOUR.DATA with the defined dependent and independent variable, saves the output into the object YOUR.MODEL, displays an ANOVA table, and outputs the R2.

For instance, here is the regression summary table figure for the iris dataset, providing the results for the effect of sepal length on sepal width.

Analysis of Variance Table

Response: Sepal.Width
              Df  Sum Sq Mean Sq F value Pr(>F)
Sepal.Length   1  0.3913 0.39128  2.0744 0.1519
Residuals    148 27.9157 0.18862               
[1] "R2 =  0.014"

The value for Pr(>F) value listed in the row for your independent variable is the P value for the test of your independent variable on your dependent variable. For regression, you are also interested in the value for R2 (R squared), listed below. R2 is the coefficient of determination, and represents the proportion of variance in your dependent variable explained by your independent variable.

Don’t forget to make a figure so you can understand the magnitude and direction of any effects.

8.5 Multiple regression

A multiple regression is when there are several quantitative independent variable and a single quantitative dependent variable.

You first need to make sure that R is treating your independent variable and dependent variables as quantitative (numeric or integer) variables using str(YOUR.DATA). If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our regression using this code:

options(contrasts = c("contr.sum", "contr.poly"))
YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV1 + YOUR.IV2 + YOUR.IV1 * YOUR.IV2, data = YOUR.DATA)
Anova(YOUR.MODEL, type = 3)
paste("R2 = ", round(summary(YOUR.MODEL)$r.squared, 3))

This performs an regression on the dataframe called YOUR.DATA with the defined dependent and independent variable, saves the output into the object YOUR.MODEL, and displays a summary table of that object.

For instance, here is the regression summary table figure for the iris dataset, providing the results for the effects of sepal length and petal length on sepal width.

Anova Table (Type III tests)

Response: Sepal.Width
                           Sum Sq  Df F value    Pr(>F)    
(Intercept)                0.5780   1  5.5095 0.0202570 *  
Sepal.Length               1.3774   1 13.1299 0.0004005 ***
Petal.Length               1.3795   1 13.1499 0.0003966 ***
Sepal.Length:Petal.Length  0.0707   1  0.6738 0.4130632    
Residuals                 15.3165 146                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "R2 =  0.459"

The values for Pr(>F) value listed in the row for your independent variables are the P values for the test of your independent variable on your dependent variable. Note that these results include test statistics for the effects of each independent variable, as well as their interaction which tests whether the effect of one independent variable depends on the other. For regression, you are also interested in the value for R2, listed below. This is the coefficient of determination, and represents the proportion of variance in your dependent variable explained by your independent variables.

Don’t forget to make a figure so you can understand the magnitude and direction of any effects that you find.

8.6 ANCOVA

An ANCOVA is a combination of ANOVA and regression and, as a result, is used when there is one (or more) quantitative independent variable one (or more) categorical independent variable and a single quantitative dependent variable.

You first need to make sure that R is treating your independent variables correctly (one as categorical, i.e. factor, and one as quantitative, i.e. numeric or integer) using str(YOUR.DATA). If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our ANCOVA using this code:

library(car)
options(contrasts = c("contr.sum", "contr.poly"))
YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV.QUANTITATIVE + YOUR.IV.CATEGORICAL + YOUR.IV.QUANTITATIVE * YOUR.IV.CATEGORICAL, data = YOUR.DATA)
Anova(YOUR.MODEL, type = 3)

This performs an regression on the dataframe called “YOUR.DATA” with the defined dependent and independent variable and saves the output into the file called “test.results” and displays a summary table of that object.

For instance, here is the regression summary table figure for the iris dataset, providing the results for the effect of sepal length and species on sepal width.

Anova Table (Type III tests)

Response: Sepal.Width
                      Sum Sq  Df F value    Pr(>F)    
(Intercept)           0.3374   1  4.5496   0.03462 *  
Sepal.Length          6.2572   1 84.3675 4.109e-16 ***
Species               0.6441   2  4.3425   0.01475 *  
Sepal.Length:Species  1.5132   2 10.2011 7.190e-05 ***
Residuals            10.6800 144                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The values for Pr(>F) value listed in the row for your independent variables are the P values for the test of your independent variable on your dependent variable. Note that these results include test statistics for the effects of each independent variable, as well as their interaction which tests whether the effect of one independent variable depends on the other. In the case of an ANCOVA, this interaction can best be though of as meaning the the regression slope of your (quantitative) dependent variable on your quantitative independent variable differs between the levels of your categorical independent variable. In this case, you see sepal width is significantly affected by the interaction between sepal length and species, meaning that the effect of sepal length on sepal width (a regression) depends on species, or that the slope of these regression lines differ.

Don’t forget to make a figure so you can understand the magnitude and direction of any effects.

8.7 Logistic regression

A logistic regression is used when there are quantitative and/or qualitative independent variables and a single categorical dependent variable (usually binomial, e.g. yes/no, presence/absence). In this case, you can think of how the independent variables affect the likelihood of the dependent variable having one of two states (e.g. yes vs. no; presence vs. absence).

You first need to make sure that R is treating your variables appropriately using str(YOUR.DATA). Especially important (and unique from the other analyses covered in the R Cookbook) is that the dependent variable is categorical, with two factor levels (a binary response). If it makes sense in your context, you should make the first level be “failure” and the second level “success”. If your variables are not listed correctly, see Reading in data above for how to fix this.

Now let’s perform our logistic regression using this code:

YOUR.MODEL <- glm(YOUR.DV ~ YOUR.IV, data = mydata, family = "binomial")
Anova(YOUR.MODEL, type = 3)

This performs a logistic regression on the dataframe called YOUR.DATA with the defined categorical (binomial, i.e. yes/no, presence/absence) dependent and one independent variable, saves the in YOUR.MODEL and displays an ANOVA table of that object. Note that these independent variables can be any combination of quantitative and categorical variables.

For instance, we might want to perform a logistic regression for the iris dataset, testing for the effect of sepal width on flower color. That is, how does sepal width affect the likelihood of flower color being red?

Before we can do that analysis, we need to recode our data so that red is the second level, corresponsind to a “success”.

levels(YOUR.DATA$Flower.Color) <- c("yellow","red")

Having recoded the data, here is the logistic regression summary table for the iris dataset, providing the results for the effect of sepal width on the likelihood of the flower being red.

Analysis of Deviance Table (Type III tests)

Response: Flower.Color
            LR Chisq Df Pr(>Chisq)
Sepal.Width 0.035331  1     0.8509

The value for Pr(>|z|) listed in the row for your independent variable is the P value for the test of your independent variable on your dependent variable. In this case, you see flower color (the probability that a flower is red) is not significantly affected by sepal width.

Don’t forget to make a figure so you can understand the magnitude and direction of any effects. In this case, the best plot is a scatter plot, but your Y variable will only have values of 0 and 1.

8.8 Saving statistical output

If you wanted to export the summary table from a statistical test as a .csv file you would then use this code:

SUMMARY.MODEL <- summary(YOUR.MODEL)
write.csv(SUMMARY.MODEL, "SUMMARY.MODEL.csv")

In this code, you’re creating an object SUMMARY.MODEL that is the summary table, and then saving it as a .csv file to your Files. You can export the file to your computer by checking the csv file you want to download, click “More”, then Export. If multiple files are selected, they are combined into a zip file.

9 Example R script

Here is an example R Script for reading in a .csv dataset, getting and saving descriptive statistics, making a figure, and performing a single-factor ANOVA

## Kailen Mooney October 2, 2016
## Updated for R Studio Cloud,Celia Symons July 22, 2025

## An example R Script for reading in a .csv dataset, getting and saving descriptive statistics, making a figure, and performing a single-factor ANOVA

## This R script needs to be edited (variables in CAPS), but is meant to give a sense of what a complete R script for the entire process of data analysis should look like 

## Reading in and inspecting your data  -----                                               

#Read in your data
YOUR.DATA <- read.csv(file="YOUR.DATA.csv", stringsAsFactors=TRUE)

#Look at your data in the console
head(YOUR.DATA)

#View  your data
View(YOUR.DATA)

#Inspect the structure of your data, making sure the categorical varibles are listed as "factor"
str(YOUR.DATA)

#If the categorical variable is not listed as factor, fix it
YOUR.DATA$YOUR.IV <- as.factor(YOUR.DATA$YOUR.IV)

#Check to make sure it worked
str(YOUR.DATA)

##Calculate descriptive statisics and make figure -----

#Calculate descriptive statistics for a dependent variable at each level of your independent cateorical (factor) variable
YOUR.DATA.SUMMARY <- YOUR.DATA %>% 
  group_by(YOUR.IV) %>% 
  summarize(n= n(), 
            mean = mean(YOUR.DV), 
            sd = sd(YOUR.DV),
            se = sd(YOUR.DV)/sqrt(n())) %>% 
  print()

#View your table of descriptive statistics
YOUR.DATA.SUMMARY

#Save your descriptive statistics to your Files tab
write.csv(YOUR.DATA.SUMMARY,"YOUR.DATA.SUMMARY.csv")

#Call up ggplot2 for use in this R session
library(ggplot2)

#Make your figure showing menas +- 1SE for your dependent variable at each level of your cateorical (factor) independent variable. This is based on using the descriptive statistics summary table you already made.
ggplot(YOUR.DATA.SUMMARY, aes(x = YOUR.IV1, fill = YOUR.IV2, y = mean)) +
  labs(title="TITLE",x="X AXIS", y="Y AXIS")+
  geom_col(position = position_dodge()) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se),width=.3) +
  scale_y_continuous(expand=expansion(c(0,0.05)))

ggsave("MY.PLOT.png", width = 5, height = 5, dpi=300)

##Perform ANOVA testing ----
#for effect of your cateorical (factor) independent variable on your dependent variable

#Perform a two-way ANOVA and view results
library(car)
options(contrasts = c("contr.sum","contr.poly"))
YOUR.MODEL <- lm(YOUR.DV ~ YOUR.IV1 + YOUR.IV2 + YOUR.IV1 * YOUR.IV2, data=YOUR.DATA)
Anova(YOUR.MODEL, type=3)

#Save results to your Files
SUMMARY.MODEL<-Anova(YOUR.MODEL, type=3)
write.csv(SUMMARY.MODEL,"SUMMARY.MODEL.csv")

##You're done, now use your summary statistics and plots to prepare a report