An Introduction to Programming in R

Author

Mark Pezzo

Published

June 12, 2023


Run these first in the console:

library(tidyverse)

library(dplyr)

library(magrittr)


1 Quick Introduction

1.1 About this document

I created this html document using Quarto. Quarto is the latest version of RMarkdown, and allows you to “knit” together content and executable R code into a finished document (html, pdf, Word). You can see your syntax, the results, and your writeup all together. It’s great for creating educational material, but it also supports the idea of reproducable research. Gone are the days when you can’t remember how you recoded your data, or which syntax file you used. Gone too are the days when you have to copy over all your results into a new table in Word for your academic paper. Everything you need is in the document, and you can simply hide stuff (usually syntax) that you don’t want people to see. But it’s still there for a year from now when you can’t recall what you did. Students can use Quarto to take notes! https://quarto.org

1.2 R vs R Studio vs Posit

R is the language. R Studio is the Integrated Development Environment (IDE) where you write and run the code. Other examples of IDE are “Visual Studio,” “Brackets,” and “Atom.” IDEs usually have a file viewer, autocomplete, error messages, and “visual” editing for formatted output (that means instead of typing <bold>Hello World</bold>, you simply type Hello World and click on the bold button. Visual editing is what you’re probably always done. The name is for old people like me, who used to have to do things the hard way. Speaking of, in theory you could write your code in a simple text editor, but even I wouldn’t do that anymore. In addition to R, you can also write code in Python, C++, and SQL in R Studio. That’s confusing, so pretty soon they’re going to change the name of R Studio to Posit Studio. The company that created R Studio is called Posit. They’re already moving in the direction because the cloud version of R Studio is now called “Posit Cloud.” Again, this is because you can program in lots of other language. I’m using Posit Cloud to create this document in Quarto demonstrating R. I know. It’s a lot.

1.3 Sections of the IDE

Here’s what the IDE looks like (again, it’s called “Posit Cloud” if you use it online and “R Studio” if you’re using the desktop version).

There are three areas. You generally write your code in the script window. Your variables, output, packages (code written by other people to make your life easier) and also your file browser can all be found in the viewer area, and people who want to kick it old school and write one line of code at a time can use the console window. The console window will also echo everything that you run in the syntax window, and any warnings or errors will show up there too. Most of the time, you can ignore that stuff. But it’s also where the output shows up, so you can’t completely ignore the console window. Note that the console output includes line numbers in brackets, i.e., [1]. So, let’s say we typed 1 + 1 in the script window and then clicked on RUN. The gray box shows your code, and below that is what you would see in the console window.

7 * 7
[1] 49

Sometimes your output will have multiple lines and then you may see [2], [3], [4], etc. at the beginning of each line.

2 Getting Data

2.1 Built-in Data

R comes with a large number of built in data sets to get you started. Here is a list of some of the most popular:

  • airquality - New York Air Quality Measurements

  • AirPassengers - Monthly Airline Passenger Numbers 1949-1960

  • mtcars - Motor Trend Car Road Tests

  • iris - Edgar Anderson’s Iris (flowers) Data

If you want to learn about other built-in datasets, go here: The R Datasets Package.

To view one of these data sets, simply type (into script window and click run) (or console and hit return) print(name), replacing “name” with the name of the dataset, like mtcars:

print(mtcars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

2.2 Creating a Data Frame

R data must be in the form of a data frame. This is like a (hidden) excel spreadsheet with columns of data, one per variable. The built-in data sets are all data frames, and graphing in R requires data frames. To create one by hand we must enter the data for each column in the form of vectors. Like this:

NameOfVector <- c(1, 2, 3, 4)

The little “c” is what defines it as a vector, and you can think of it asstanding for “column.” Use quotes around your data if the vector is supposed to hold text.

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)

df <- data.frame(Name, Age)

Often people name their dataframes “df”, but you could call it “George”

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)

George <- data.frame(Name, Age)

print(George)
   Name Age
1   Jon  23
2  Bill  41
3 Maria  32
4   Ben  58
5  Tina  26

You can also do it this way:

TheWeekend <- data.frame(Name = c("Jon", "Bill", "Maria", "Ben", "Tina"),
                 Age = c(23, 41, 32, 58, 26)
                 )
print (TheWeekend)
   Name Age
1   Jon  23
2  Bill  41
3 Maria  32
4   Ben  58
5  Tina  26

2.3 Referencing a column within a dataframe

Let’s look at mtcars again. . . Do you see the column for mpg?

print(mtcars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

If we try to print mpg, it won’t work.

print (mpg)

print (mtcars$mpg)

print(mtcars$mpg)
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

2.4 Simple Analyses

Okay, let’s plot some data, then run some correlations, and then a t-test with this car data. First let’s plot data. R has pretty simple graphics package built in. A package is code that someone else wrote. It’s like running ANOVA in SPSS. There is a ton of code required to run an ANOVA, but all you have to do is use the key word and enter some variables and a few other options. The beauty of R is that it’s open source, and anybody can write a package. The best ones are (eventually) built-in to R, but you still have to enable them. To see if you already have a package, go to packages tab in the Viewer window. Any package that’s listed you have, but if the little box to the left isn’t checked, then it’s not yet in your active library. Either check the box, or type this code at the top of your script:

library(name_of_package)

Let’s use this:

plot(mtcars$wt, mtcars$mpg)
plot(mtcars$wt, mtcars$mpg)

Not bad, but the base graphics package is pretty “basic.” Let’s use the famous ggplot2 graphics package. I checked my package window and it’s not even installed. As it turns out, ggplot2 and a few other packages are bundled together in a very popular package called tidyverse. I’ll type this in my script to install tidyverse and all the other packages that come with it (including ggplot):

install.packages("tidyverse")

This may take a few minutes, and you’ll see a LOT of scary messages in red in the Console. That’s okay. Eventually it’ll tell you the the “downloaded source packages are in” . . . well, wherever it puts those things. Now, when you look at your packages window you should see ggplot2

Now, you still have to load it into the library. You can check the box, or type this into your script (I recommend not checking the box, and doing everything via script):

library(ggplot2)

As a reminder, you can type this in the script window or in the console window. It’ll work either way, but it’s better to keep ALL of your code in the script. And it also makes it eaiser to share your script with someone who might not have all of your packages already installed on their computer.

Okay, now that we have ggplot2, let’s plot a simple scatterplot. Scroll back up to the printout of the data for mtcars, and you’ll see a column for mpg and hp (horsepower). I’ll bet those are inversely related:

library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

That’s much nicer! And if we had two series, we could color code those.

Note that even though it’s called ggplot2, we only have to type ggplot. The 2 refers to the version, but everyone uses v2, so R knows what you mean.

Oops! I wanted mpg x horsepower, didn’t I? Okay, let’s do that one:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()

Notice that this time, I didn’t need to include the library(ggplot2) because ggplot2 is already loaded into memory.

Regardless, the relationship between mpg and horsepower is the same as between mpg and wt. So, hp and wt must be (positively) related to each other. Let’s check:

ggplot(data=mtcars,mapping=aes(x=wt,y=hp)) + geom_point()

It’s a little messy, but that’s definitely a positive relationship!

2.5 Color Coding Different Groups

Let’s see where different cylinder engines show up in that last scatterplot. This is the same as creating a series in Excel. We simply add an extra “aesthetic” called col and map it to the variable for the number of cylinders, cyl.

ggplot(data=mtcars,mapping=aes(x=wt,y=hp, col=cyl)) + geom_point()

Hmm… that’s not what I expected. It’s using one color and a gradient, making it hard to tell the difference between 4,5,6, 7, and 8 cylinders (there’s a 7 cylinder car??). The problem is that R automtically color codes variables it thinks are numerical using a gradient. We want it to think of the different cylinder counts as categorical, not numerical (like coding in an ANOVA, where Group = 1, 2, or 3 doesn’t mean that Group 2 has twice as much . . . stuff. . . as Group 1. To do this, we need to turn cylinder into a factor. Just like in a factorial ANOVA, the values aren’t treated as numbers, but simply categories. We use this code to do it:

mtcars$cyl <- as.factor(mtcars$cyl)

What does the $ mean? Whenever you refer to a variable, R needs to know from what dataset (because you might have multiple datasets open). In the function to create a scatterplot we did that by saying data = mtcars. If we aren’t using ggplot then we have to refer to mtcars by using the $.

What does the <- mean? That’s the same as “equals.” We’re saying turn the variable cylinders from the dataframe mtcars into a factor.

mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(data=mtcars,mapping=aes(x=wt,y=hp, col=cyl)) + geom_point()

2.6 More on Scatterplots

mpg <- mtcars$mpg 
print(mtcars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
print(mpg)
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

3 Analysis of Variance

3.1 Prerequisites

Make sure you have the following R packages:

  • tidyverse for data manipulation and visualization

  • ggpubr for creating easily publication ready plots

  • rstatix provides pipe-friendly R functions for easy statistical analyses

  • datarium: contains required data sets for this chapter Load required R packages:

Get them this way:

{r eval = FALSE, echo=TRUE} install.packages(“tidyverse”) install.packages(“ggpubr”) install.packages(“rstatix”) install.packages(“datarium”)}

Once they’re installed, you don’t have to do this again. You can find a list of all installed packages on the right side of R Studio:

But even if they’re installed, you do still have to load them into memory every time so that the rest of your code works. Load them using the “library” command, like this:

{r include=FALSE} library(tidyverse) library(ggpubr) library(rstatix)}

Alternatively, you can skip the code and simply check the box next to each package that you want to load into memory. That is more likely to cause errors, and will cause problems if you share your code with someone else.

Key R functions: anova_test() [rstatix package], wrapper around the function car::Anova().

3.2 Data preparation

We’ll use the jobsatisfaction dataset [datarium package], which contains the job satisfaction score organized by gender and education levels.

In this study, a research wants to evaluate if there is a significant two-way interaction between gender and education_level on explaining the job satisfaction score. An interaction effect occurs when the effect of one independent variable on an outcome variable depends on the level of the other independent variables. If an interaction effect does not exist, main effects could be reported.

Load the data and inspect one random row by groups:

{r} library(datarium) library(dplyr) set.seed(123) data(“jobsatisfaction”, package = “datarium”)}

In this example, the effect of “education_level” is our focal variable, that is our primary concern. It is thought that the effect of “education_level” will depend on one other factor, “gender”, which are called a moderator variable.