Processing math: 100%

Introduction

Welcome to the wonderful world of R. Today I will introduce you to some of the core utility of R by explaining the most fundamental parts of the program. By the end of this tutorial, you should be able to do the following:

Installation

It probably isn’t helpful to read an R tutorial without having the program. To that end, it would be useful to learn how to install it. First, you will need to install the base R software. To do so, navigate to https://www.r-project.org/. As of 2023, it should look something like this:

From there, follow the instructions by going to the download R link, pick a mirror, and then download the appropriate R software from the subsequent links thereafter. After you go through the install prompt, you can open R and it will look something like this:

While this basic setup may have utility, it is usually advised to download a graphic user interface (GUI) that is more useful. Most recommend RStudio, which provides a variety of useful tools for running R. Simply navigate to https://posit.co/download/rstudio-desktop/ and follow the same download instructions (though at this point you may ignore the steps that include downloading the basic R program we just did.) Once you install R Studio and open the program, it will now look like this:

This is what it typically looks like during a typical R session:

You can see that R looks very different in a real environment. You can also see 4 distinct panes. The bottom left section is the Console, which functions similarly to the core R program we saw earlier. Here you can type in code and run it as-is. I recommend not using this area of R Studio for writing code, but I do think it is useful for reading error messages and troubleshooting, which we will talk about later.

The top left section is the Source Editor, usually your R script, and arguably the most important part of your coding experience. Here you can write a ton of code, save it for later, and rerun your analysis at another time. To run a chunk of code, either click on a part of the code chunk or highlight the whole thing, then run Ctrl + Enter. Afterwards you should see it run in the below console. If you are running plots or saving objects, you will also see changes in the other two panes to the right.

The top right section is your Environment. This is what stores your data, objects, and functions for use later. You can see for example that I have saved two objects here, x, and y, each with 1,000 rows of data of varying decimal values. You can also clear or remove things from this area.

The bottom right pane doesn’t have a distinct name, but this is where you find a lot of useful tabs, including Plots, Help, and the Viewer. You can see in the screenshot that I have plotted a histogram, which is shown in the Plots tab. These tabs will be shown to a degree later, but for now just know where they are located.

R: A Very Fancy Calculator

Something that makes R easier to understand is that it is built to conduct statistics, which by extension means that it is built with mathematical properties. To illustrate this, I will show you a series of commands you can run, but first let’s orient ourselves to running a command first. As mentioned earlier, you can type commands into the Console pane then hit Enter, or you can write them in the R script in the Source Editor area. Then use Ctrl + Enter, as shown below:

Now that you know how to enter commands, run the following codes in either the R script or the Console. Note: the spacing doesn’t matter (you can also run 3+4 without problems).

3 + 4 
19 - 5 
20 / 4 
7 * 8 
6 ^ 7 

If you are just running the first part, 3 + 4, you should get the resulting output from the Console, which always generates what you run either in the Console or your R script:

## [1] 7

Beyond just basic arithmetic, there are a number of mathematical functions in R that are also helpful, which leads us to our first use of functions! Be careful though. R is case-sensitive, which means that if you capitalize things that aren’t supposed to be, R will explode. Additionally, R will have trouble with typos as well, so make sure you have spelled the function correctly. In any case, let’s say we want to get the square root of 49. We would simply run the following code:

sqrt(49)
## [1] 7

You can see this function operates quite simply…you type the function name, enter the number you want to use, and then run it in R. However, it is important to note that functions don’t always operate like this. Generally speaking, functions are typically composed of the following elements:

For a practical example, we can use the round function, which takes a decimal value and rounds it by the number of digits you specify. The arguments for round are the following:

To demonstrate, we will try to round the number 40.5789 to the 2nd decimal place. If a function requires multiple objects, do not forget to include the , to separate them.

round(x = 40.5789, digits = 2)
## [1] 40.58

Functions don’t always require you to explicitly name the arguments. We could have done the exact same thing with the following code:

round(40.5789, 2)
## [1] 40.58

As you become more adept at programming in R, you will rely less on explicit arguments, but for now, we will use them to make explanation easier.

Data Types in R

Vectors

There are a number of data types in R. We will not explore everything about them here, but I will mention at least some to familiarize you with what they are. The first form of data is a vector. Vectors are simply a collection of values stored in the following format:

→x=[x1x2...xj]

To show you what that practically means, we can create a vector and run it in R. If you run the code below, R will return a vector. The c function means “concatenate”, which is simply a fancy way of saying “combine these numbers.”

c(3,4,0,1,3)
## [1] 3 4 0 1 3

It may be useful to save these values and use them for later. Thus we will save this as an object. In R, objects are simply items saved to be used later. They can be useful for performing a lot of simple and complex code. To assign these values as an object, simply give it a name (we will call this X1…remember to capitalize this), and assign it to the name with the <- operator. Whatever is written to the right of this operator will be saved as an object. Let’s go ahead and do so.

X1 <- c(3,4,0,1,3)

Once you run this, it will become saved in your environment. You will notice in your Environment that it lists this data as a numeric vector (num), it is composed of values from 1 to 5, and lists the specific numbers in order.

You can also simply call this object by typing its name and running it like any code in R.

X1
## [1] 3 4 0 1 3

Numeric vectors have nice properties. For example you can use arithmetic functions with numeric vectors and it will apply it to the entire vector. As an example, let’s multiply the entire vector by 5.

X1 * 5
## [1] 15 20  0  5 15

You can also apply math between vectors. Let’s save another vector, X2, and add it to X1.

X2 <- c(2,4,9,1,0)
X1 + X2
## [1] 5 8 9 2 3

Vectors are made up of many classes, or types. These include:

  • Numeric (integer: 45 / double: 45.8)
  • Logical (TRUE/FALSE)
  • Character ("This is a sentence.")
  • Lists (examples shown later)

We won’t go into these into super detail for now (I will save that for another time), but just as another example of a vector, we can create a vector called X3 using text, which constitutes as a character vector.

X3 <- c("Shawn watches too many movies.")
X3
## [1] "Shawn watches too many movies."

You may have guessed this, but character vectors have different attributes, and thus can’t share the same properties as numeric vectors. As an example, run this code and see what happens:

X1 * X3

Matrix

A matrix is a two-dimensional data structure. It’s behavior is very similar to a vector, but instead of being composed of one row or column of data, it can be composed by a row and a column, or multiple rows and columns. An example of a 2 x 3 matrix (2 rows and 3 columns) in mathematical notation:

[x11x12x13x21x22x23] As a practical example, let’s create a 2x3 matrix and save it as mat. First, enter the values you want into the data argument, then tell the matrix you want it organized into two rows with the nrow argument.

mat <- matrix(data = c(2,4,5,1,7,3), nrow = 2)
mat
##      [,1] [,2] [,3]
## [1,]    2    5    7
## [2,]    4    1    3

You will notice two things about this matrix. first, it has brackets listed at the top and left of the matrix. These specify the locations of the values in matrix notation. Second, you will notice there are specific values listed as either [X,] or [,X]. The [X,] notation indicates the row number, and the [,X] notation naturally lists the column location. This is useful because this can be used to pull data from specific locations. For instance, if we want to pull the “4” value from Row 2 Column 1, we would run the following code:

mat[2,1]
## [1] 4

And you can simply pull an entire row or column by listing only one location. If for example we want the values 2, 5, and 7, we would use this code instead:

mat[1,]
## [1] 2 5 7

Notice that your code over time may get more difficult to read if you cram a bunch of arguments into the same line. One thing that may make coding easier is to indent each argument by hitting Enter after each argument’s comma. An example is shown below:

mat <- matrix(data = c(2,4,5,1,7,3), 
              nrow = 2)

Here you can now clearly see each argument. The first line is the data argument, and nrow is listed as both the second line and second argument. This makes reading your code much easier over time, as you can go line-by-line through the code after writing. This becomes very useful for debugging code later on as you develop longer and longer scripts.

Just as a side note, this matrix could have also been specified by ncol, which notes how many columns to organize the data into instead. Either will form the matrix by the specification you like. However, note the ordering of the matrix. It fills in the values vertically, which may not be as intuitive as it appears. To fill the values horizontally, we can use the byrow argument and set it to either T or TRUE, which is a logical value that “turns on” this part of the function. By default, it’s argument is set to F for FALSE, which “turns off” this argument, thus not filling in values by row.

mat <- matrix(data = c(2,4,5,1,7,3), 
              nrow = 2,
              byrow = T)
mat
##      [,1] [,2] [,3]
## [1,]    2    4    5
## [2,]    1    7    3

Now you can see the values are filled into the first row, then filled in sequentially into the second row. If you would like to supply names to this matrix (for something like a chi-squared test), you can use the colnames and rownames functions to change them. Lets say we want to name the column names by occupation, and the row names by gender, we would use the below code. This may look awkward because you are assigning values to a function. Here we are basically calling the names of the matrix first, then saving their respective values to be used later.

colnames(mat) <- c("Doctor","Model","Custodian")
rownames(mat) <- c("Male","Female")
mat
##        Doctor Model Custodian
## Male        2     4         5
## Female      1     7         3

Before we move on, remember that matrices behave in similar ways to vectors. If we want to subtract 4 from each value in the matrix, you may do so.

mat - 4
##        Doctor Model Custodian
## Male       -2     0         1
## Female     -3     3        -1

Likewise, you can multiply matrices by each other.

mat * mat
##        Doctor Model Custodian
## Male        4    16        25
## Female      1    49         9

Lists

I like to characterize lists as a shopping bag of data…it contains a variety of objects that don’t necessarily have to be the same thing, but can all be combined for use later.

A typical shopping list.

A typical shopping list.

For instance, we can pair a numeric vector with a character vector, then call it later.

lis <- list(c(4,5,6,8,1,0),
            c("French","Belgian","German"))
lis
## [[1]]
## [1] 4 5 6 8 1 0
## 
## [[2]]
## [1] "French"  "Belgian" "German"

Did you notice the new notation? To fetch each part of the list, we now use [[]] and specify where we want to pull from.

lis[[1]]
## [1] 4 5 6 8 1 0

We can even name the list elements like we did with the matrix, using names this time.

names(lis) <- c("Numeric", "Character")
lis
## $Numeric
## [1] 4 5 6 8 1 0
## 
## $Character
## [1] "French"  "Belgian" "German"

Notice the notation has changed now. For a named list, we can now use the $ operator to pull each part of the list.

lis$Character
## [1] "French"  "Belgian" "German"

You can also just use the original [[]] operator as well, but keep this $ operator in mind. It will be used in our discussion on data frames.

Data Frames

A data frame from R.

A data frame from R.

Data frames are similar to matrices. They are “rectangular” or two-dimensional data composed of rows and columns. The major difference is that data frames, like lists, can store multiple types of data without creating them in a list format, which makes the data easier to read. Data frames also allow you to quickly create variables by naming vectors within a frame. To create a data frame, we can use the data.frame function. To assign each vector a name, simply write the name for the vector, save it with the = operator, and list what you want to save to the right of this operator. This is functionally similar to the <- operator, but the major difference is that the Treatment and Mean_Response vectors won’t be saved as additional objects while making your data frame.

df <- data.frame(Treatment = c("Group_1","Group_2"),
                 Mean_Response = c(78,56))
df
##   Treatment Mean_Response
## 1   Group_1            78
## 2   Group_2            56

You may have seen a slightly peculiar naming scheme I have employed: snake case. This is done by using either _ or . between spaces in names. The reason I have done this is because assigning names with spaces causes a lot more problems than solutions. I won’t go into detail as to why here, but you can experiment on your own and see why not using this can be a problem. For now, simply use snake case from here on.

You will now see that the data frame is assigned a 2x2 matrix with character and numeric values. To pull a column from the data frame, you can now use the $ operator.

df$Mean_Response
## [1] 78 56

Since these are technically vectors, you can retrieve values the same way.

df[2,2]
## [1] 56

Perhaps you want to save a part of this data frame for a separate part of your script. Simply save these vectors as objects for use later.

trt.groups <- df$Treatment
trt.groups
## [1] "Group_1" "Group_2"

There are many datasets that automatically come with R, and you can use many of these to practice statistical analysis. If you run library(help = "datasets"), you will see a full list of all the pre-packaged data that comes with R. To open one, simply write their names and run them as you would any code. For example, we can run the mtcars data in this way:

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Oftentimes it is difficult to read a gigantic dataset in R within the Console. There are multiple useful functions for this. Two functions that are useful for this are head, which pulls the first 6 rows of data and tail, which pulls the last 6 rows of data.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also view the entire data frame in a separate window with the View function.

View(mtcars)

This may not be super useful on it’s own since the window is still collapsed. If you enter Ctrl + Shift + 1, it will expand the view of only the data frame. Using Ctrl + Shift + 1 thereafter returns the view to it’s original position.

By the way, you can use many keyboard shortcuts in R like this. For a full list, check out this link: https://support.posit.co/hc/en-us/articles/200711853-Keyboard-Shortcuts-in-the-RStudio-IDE

What would data be without a nice plot? Visualization allows data to speak to us. To plot the histogram from the beginning of this tutorial, you can use the hist function. First, enter the vector you want to plot, the x argument, as the histogram (here we use miles per gallon, or mpg). Next, change the color to steel blue with the col argument. Then change the x-axis label with xlab. Finally, name the title with main.

hist(x = mtcars$mpg,
     col = "steelblue1",
     xlab = "Miles per Gallon",
     main = "Histogram of Miles per Gallon")

And now you have created your first plot in R! Plots will be explained in a lot more detail in subsequent tutorials, but for now you can try this histogram out on all the data frames that come with R. There are also other forms of data out there one can experiment with (arrays, tibbles, etc.), but this tutorial covers the basics and will be useful for now.

Expanding Your Vocabulary with Libraries

While the data and functions that are available as defaults in R are useful, they are fairly limited. It is essential as an R user to know how to install and load libraries, which are collections of functions written by other users. Without this ability, the utility of R is bounded by what is already available. One of the most useful packages available is the tidyverse, which is actually a collection of packages rolled into one. To install it, simply use the following code:

install.packages("tidyverse")

Once the Console is finished installing the library, simply load the library with the following code:

library(tidyverse)

Or alternatively:

library("tidyverse")

Now you will be able to use all of the functions within the tidyverse. Let’s use the filter function to filter mtcars horsepower that is less than 100.

filter(mtcars, hp < 100)
##                 mpg cyl  disp hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0 66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3 91 4.43 2.140 16.70  0  1    5    2

You will now see that it brings back a much more limited list of entries based on our filter. The tidyverse is quite a dense package with a somewhat steep learning curve. I will not explore it’s functionality in this tutorial, but I will point you to the useful book R for Data Science by its creator Hadley Wickham if you want to learn more.

Creating Functions

Functions and libraries are what make R so useful. But this doesn’t mean you have to rely on the functions of others…you can even create your own! As a very basic example, here I have created a function called add.2, which simply adds 2 to any number you include.

add.2 <- function(x){
  x + 2
}
add.2(5)
## [1] 7

This is achieved by doing the following:

  1. Name the function. First write the name to the left of the <- operator. Then use function, wherein we name our only argument as x, which is our input.
  2. Wrap the function. Simply create the brackets you see to the right of function(x) and hit enter after the first bracket to indent it the way I have.
  3. Write the code in the “body” of the bracket. The x here functions as a placeholder for whatever you enter into the function. We are going to use it simply as a number or vector. To add the number 2 to whatever we enter into the function, we enter “x + 2” to accomplish this goal.
  4. Run the entire code to save it.
  5. Use the function!

As I mentioned, we can use a number or a vector. Let’s create a random vector and use our function on it.

random.vector <- c(16,43,51,28)
add.2(random.vector)
## [1] 18 45 53 30

You have now officially created your first function. Congrats! Programming in R is a dense subject, so we will simply stop here and explore this topic another time.

Troubleshooting R

You may have noticed if you have entered something incorrectly, R will yell at you in it’s own subtle way. See below where I accidentally capitalized the wrong letter. The console will tell me at the bottom that it cannot find the function.

A mental check I first do when troubleshooting:

If many of these checks don’t immediately solve the problem, you can check the help pages in R. Simply use the ? operator with the function you are using, and it should pull up a page in R listing the information you require.

?mean

You can see the help page on the bottom right pane.

However, it may be easier to see if you expand it. If you hit Ctrl + Shift + 3, this will maximize the Help tab so it is easier to read.

The help pages will generally have the following sections:

Oftentimes these help pages can be embarassingly bad and barely helpful for the reader. Another great resource is Stack Overflow. It is a coding community which has a Q & A forum format which can often get you answers if R doesn’t. You can find Stack Overflow at https://www.stackoverflow.com.

Conclusion

And there you have it. You have now accomplished a lot already. Play around with the features you have learned here and experiment in your own time with other features in R. Happy coding!