The purpose of this tutorial is to provide a basic understanding of R. I want to provide knowledge of the basic concepts you need to know, as well as introduce widely used packages within R that can help you with your data analysis. The R language is vast and providing a comprehensive tutorial of it would take many lifetimes and textbooks. Like any computer language, mastering it will require dedication, patience and a lifetime commitment. But even with a basic understanding of R, you will be able to automate monotomous tasks, work more efficiently with data and produce some very insightful and informative analysis. This tutorial is more of a jumping off point to help you get started, and point you towards resources created by people far smarter and adept in R than myself.
Note: Much of this tutorial is just a copy and paste job from different sources. No use reinventing the wheel. Credit is given in the citations at the end.
To get started, you will need to download R and R Studio. R is the language that your computer interprets to run programs and commands, while R Studio is the integrated development environment (IDE) that is used to build applications that combines developer tools into a single graphical user interface (GUI). You can think of R as the paint, and R Studio as all of the tools like the paintbrushes, canvas and palette, that you as the painter will use to create a new piece of art (or just a simple program that makes your life easier).
R for Beginners which provides a basic introduction to base R and is the basis for most of this lesson.
R for Data Science which covers the basics of the tidyverse packages, which is a group of packages widely used for data science in R, created by Hadley Wickham (somewhat of an R celebrity).
The R Project for Statistical Computing which is the website for R with links to useful updates and manuals (see Manuals Section).
R Documentation which allows you to search for documentation on any package distributed on R (Try searching for ggplot or dplyr packages, which are part of the tidyverse).
RPubs is a website maintained by CRAN where view and publish R projects created through R Markdown. It is useful for generating new ideas as well as sharing your work with the community. This tutorial is available through my RPub account and you can view some other projects I’ve done (or don’t, I don’t care).
Stackoverflow which is a public forum where users can post and help resolve questions for many languages not limited to R. Other useful forums and online discussion sites are available, but this is one of the most commonly used.
I found these resources extremely useful when I started learning R, but they are far from the only useful resources. R has alot of online support, particularly because it is an open-source language where regular users can build upon the language by creating new packages (R is similar to Python in this way).
R is a simpler computing language to learn than most for two reasons:
First, R is an interpreted language, not a compiled one, meaning that all commands typed on the keyboard are directly executed without requiring to build a complete program.
Second, R’s syntax is simple and intuitive.
To get started learning this language, it is important to understand how to navigate the R Console and RStudio IDE.
When you start R on your computer (not R Studio), the R console shown below will open. You can try typing some simple mathematical commands yourself. It is good to understand the console, but the scripts that we write and execute will be done in the much more flexible and powerful RStudio IDE.
One of the great advantages of R over point-and-click analysis software like Excel is that you can save your work as a script. You can edit and save scripts using a text editor. There are many available IDE’s, but the RStudio IDE is the most widely used as it is developed specifically for working with R (it can also interpret other codes like SQL and Python). Look at the picture below, which takes an example of a simple plot using the AirPassengers data set from the Datasets package:
The top left quadrant is the Code Editor, where you can write and save your script. Note, that the code editor is not immediately open when you open R Studio. You must start a new script (click File -> New -> R Script).
The bottom left quadrant is the R Console which shows the lines of code that have been run and their outputs, as well as any error messages.
The top right quadrant is the Environment/History section, which displays any objects, such as variables, datasets or packages, that are active in the current R session.
The bottom right quadrant is the Other Panes section, which can show files, plots, packages, help support and viewer. In the picture below, the output from the plot(AirPassengers) command is currently in view.
For more in depth discussion on RStudio, refer to the link to Getting Started with R and R Studio in section 1 above. Learning how to use RStudio effectively will be just as important as learning the language itself.
To make the most of the R language, you’ll need a strongunderstanding of the basic data types and data structures and how to operate on them.
R has six basic data types (in addition to the five listed below, there are also raw data types, which are not discussed here). We will go through each character type, and use the typeof() function to prove its data type.
The character data type is a string, or words and letters. Below, the variable named character_variable is assigned the value Patrick. When assigning a character variable, the object needs to be enclosed in quotation marks, as shown below.
Note: to assign a value to an object in R, we use the <- as shown below. This is how you will assign a value to any variable, data structure or other object.
character_variable <- "Patrick"
typeof(character_variable)
## [1] "character"
The double data type stores numerical values that are not integers. Below, the variable named numerical_variable is assigned the value 10.5.
numerical_variable <- 10.5
typeof(numerical_variable)
## [1] "double"
The integer data type stores integers. Below, the variable named integer_variable is assigned the value 5. Note that we assign the value as 5L. By default, R will store any numerical value as a double. The L tells R to store the variable as an integer. When the variable is called, only the number will appear (see code below).
Many times you will prefer to use integer values, such as when analysing non-divisible units (such as people). Integers also require far less memory storage on your computer than doubles do. Therefore, it is best to ensure your data is stored as an integer and not a double when working with integers.
integer_variable <- 5L
typeof(integer_variable)
## [1] "integer"
integer_variable
## [1] 5
The logical data type is a binary TRUE or FALSE value. Note, that R is case sensitive. When assigning a logical value, TRUE and FALSE needs to be in all upper case. You can also assign the a logical type the value T for TRUE and F for FALSE.
true_variable <- TRUE
false_variable <- FALSE
typeof(true_variable)
## [1] "logical"
typeof(false_variable)
## [1] "logical"
The complex data type stores complex numbers which contain real elements and imaginary elements. These types are not commonly used outside of STEM fields.
complex_variable <- 1 + 2i
typeof(complex_variable)
## [1] "complex"
This section summarizes the most basic data structures in base R. I will provide a brief overview of each data structure, and how they are connected.
R’s basic data structures are listed below. They can be organized by their dimensionality (1D, 2D, ND) and whether they are homogenous (all contents must be of the same data type) or hetergenous (the contents can be of different data types).
| Dimensions | Homogenous | Hetergenous |
|---|---|---|
| 1D | Atomic Vector | List |
| 2D | Matrix | Data frame |
| ND | Array |
Given an object, the best way to understand what data structures it’s composed of is to use the str() function. str() is short for structure and it gives a compact, human readable description of any R data structure.
The basic data structure in R is the vector. The vector comes in two types: atomic vectors and lists. They have three common properties:
typeof(), what it is.length(), how many elements it contains.attributes(), additional arbitrary metadata.Atomic vectors and lists differ in the types of their elements: all elements of an atomic vector must be of the same data type, whereas lists can support elements of different data types.
The four common types of atomic vectors include logical, integer, double (often referred to as numeric) and character. Atomic vectors are usually created with c(), short for combine.
dbl_vector <- c(1, 2.5, 4.5, 8.5)
#With the L suffix, you get an integer rather than a double
int_vector <- c(1L, 2L, 3L, 4L)
#Use TRUE and FALSE (or T and F) to create logical vectors
logical_vector <- c(TRUE, T, FALSE, F)
#Enclose each element in a character vector in double quotations
char_vector <- c("Hello", "World")
Given a vector you can determine its type with typeof(), or check for a specific type with an “is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().
is.double(dbl_vector)
## [1] TRUE
is.integer(int_vector)
## [1] TRUE
typeof(char_vector)
## [1] "character"
is.atomic(logical_vector)
## [1] TRUE
If you attempt to combine elements of different types into one atomic vector, R will coerce the vector into one data type. For example, combining a character and an integer yields a character vector:
str(c("character", 1))
## chr [1:2] "character" "1"
When a logical vector is coerced to a double or integer, TRUE becomes 1 and FALSE becomes zero. This is useful in conjunction with sum() and mean().
as.numeric(logical_vector)
## [1] 1 1 0 0
#Total number of TRUEs
sum(logical_vector)
## [1] 2
#Proportion of TRUEs
mean(logical_vector)
## [1] 0.5
You can also coerce a vector manually, using the “as” functions: as.character(), as.double(), as.integer() and as.logical().
as.double(int_vector)
## [1] 1 2 3 4
as.character(logical_vector)
## [1] "TRUE" "TRUE" "FALSE" "FALSE"
If you coerce a double type vector to an integer, R will round the elements to the closest integers:
as.integer(dbl_vector)
## [1] 1 2 4 8
But be careful, if you try to coerce a vector to a data type that isn’t possible, R will return a vector of NA values, which represents missing values (not available).
as.double(char_vector)
## Warning: NAs introduced by coercion
## [1] NA NA
You can conduct operations on vectors as long as they are of the same length. The operation will return a vector of the same length. Think back to linear algebra 101.
x <- c(1,2,3)
y <- c(4,5,6)
#You can condcut operations on vectors of the same length
x+y
## [1] 5 7 9
y/x
## [1] 4.0 2.5 2.0
#You can also conduct operations on vectors using a scalar
x+3
## [1] 4 5 6
y-2
## [1] 2 3 4
x^2
## [1] 1 4 9
Operations on character vectors are also possible, but they are beyond the scope of this tutorial. There are many ways to work with character vectors. My preferred package is the stringr that is part of the tidyverse.
Lists are different from atomic vectors because their elements can be of any type, including lists. You construct lists by using list() instead of c().
x <- list(1:3, "a", c(TRUE, FALSE), c(2.3, 5.9))
str(x)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:2] TRUE FALSE
## $ : num [1:2] 2.3 5.9
Lists are sometimes referred to as recursive vectors, because a list can contain another list.
x <- list(list(list()))
str(x)
## List of 1
## $ :List of 1
## ..$ : list()
c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them. Compare the results of list() and c():
x <- list(list(1,2), c(3,4))
y <- c(list(1,2), c(3,4))
str(x)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ : num [1:2] 3 4
str(y)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().
Lists are used to build up many of the more complicated data structures in R. For example, both data frames and linear models objects (as produced by lm()) are lists.