Welcome to the wonderful world of R. Today I will introduce you to some of the core utility of R by explaining the most fundamental parts of the program. By the end of this tutorial, you should be able to do the following:
It probably isn’t helpful to read an R tutorial without having the program. To that end, it would be useful to learn how to install it. First, you will need to install the base R software. To do so, navigate to https://www.r-project.org/. As of 2023, it should look something like this:
From there, follow the instructions by going to the download R link, pick a mirror, and then download the appropriate R software from the subsequent links thereafter. After you go through the install prompt, you can open R and it will look something like this:
While this basic setup may have utility, it is usually advised to download a graphic user interface (GUI) that is more useful. Most recommend RStudio, which provides a variety of useful tools for running R. Simply navigate to https://posit.co/download/rstudio-desktop/ and follow the same download instructions (though at this point you may ignore the steps that include downloading the basic R program we just did.) Once you install R Studio and open the program, it will now look like this:
This is what it typically looks like during a typical R session:
You can see that R looks very different in a real environment. You can also see 4 distinct panes. The bottom left section is the Console, which functions similarly to the core R program we saw earlier. Here you can type in code and run it as-is. I recommend not using this area of R Studio for writing code, but I do think it is useful for reading error messages and troubleshooting, which we will talk about later.
The top left section is the Source Editor, usually your R script, and arguably the most important part of your coding experience. Here you can write a ton of code, save it for later, and rerun your analysis at another time. To run a chunk of code, either click on a part of the code chunk or highlight the whole thing, then run Ctrl + Enter. Afterwards you should see it run in the below console. If you are running plots or saving objects, you will also see changes in the other two panes to the right.
The top right section is your Environment. This is
what stores your data, objects, and functions for use later. You can see
for example that I have saved two objects here, x
, and
y
, each with 1,000 rows of data of varying decimal values.
You can also clear or remove things from this area.
The bottom right pane doesn’t have a distinct name, but this is where you find a lot of useful tabs, including Plots, Help, and the Viewer. You can see in the screenshot that I have plotted a histogram, which is shown in the Plots tab. These tabs will be shown to a degree later, but for now just know where they are located.
Something that makes R easier to understand is that it is built to conduct statistics, which by extension means that it is built with mathematical properties. To illustrate this, I will show you a series of commands you can run, but first let’s orient ourselves to running a command first. As mentioned earlier, you can type commands into the Console pane then hit Enter, or you can write them in the R script in the Source Editor area. Then use Ctrl + Enter, as shown below:
Now that you know how to enter commands, run the following codes in
either the R script or the Console. Note: the spacing doesn’t matter
(you can also run 3+4
without problems).
3 + 4
19 - 5
20 / 4
7 * 8
6 ^ 7
If you are just running the first part, 3 + 4
, you
should get the resulting output from the Console, which always generates
what you run either in the Console or your R script:
## [1] 7
Beyond just basic arithmetic, there are a number of mathematical functions in R that are also helpful, which leads us to our first use of functions! Be careful though. R is case-sensitive, which means that if you capitalize things that aren’t supposed to be, R will explode. Additionally, R will have trouble with typos as well, so make sure you have spelled the function correctly. In any case, let’s say we want to get the square root of 49. We would simply run the following code:
sqrt(49)
## [1] 7
You can see this function operates quite simply…you type the function name, enter the number you want to use, and then run it in R. However, it is important to note that functions don’t always operate like this. Generally speaking, functions are typically composed of the following elements:
For a practical example, we can use the round
function,
which takes a decimal value and rounds it by the number of digits you
specify. The arguments for round
are the following:
x
: the number you want to round.digits
: the number of decimal places you want to round
to.To demonstrate, we will try to round the number 40.5789 to the 2nd
decimal place. If a function requires multiple objects, do not forget to
include the ,
to separate them.
round(x = 40.5789, digits = 2)
## [1] 40.58
Functions don’t always require you to explicitly name the arguments. We could have done the exact same thing with the following code:
round(40.5789, 2)
## [1] 40.58
As you become more adept at programming in R, you will rely less on explicit arguments, but for now, we will use them to make explanation easier.
There are a number of data types in R. We will not explore everything about them here, but I will mention at least some to familiarize you with what they are. The first form of data is a vector. Vectors are simply a collection of values stored in the following format:
→x=[x1x2...xj]
To show you what that practically means, we can create a vector and
run it in R. If you run the code below, R will return a vector. The
c
function means “concatenate”, which is simply a fancy way
of saying “combine these numbers.”
c(3,4,0,1,3)
## [1] 3 4 0 1 3
It may be useful to save these values and use them for later. Thus we
will save this as an object. In R, objects are simply
items saved to be used later. They can be useful for performing a lot of
simple and complex code. To assign these values as an object, simply
give it a name (we will call this X1
…remember to capitalize
this), and assign it to the name with the <-
operator.
Whatever is written to the right of this operator will be saved as an
object. Let’s go ahead and do so.
X1 <- c(3,4,0,1,3)
Once you run this, it will become saved in your environment. You will notice in your Environment that it lists this data as a numeric vector (num), it is composed of values from 1 to 5, and lists the specific numbers in order.
You can also simply call this object by typing its name and running it like any code in R.
X1
## [1] 3 4 0 1 3
Numeric vectors have nice properties. For example you can use arithmetic functions with numeric vectors and it will apply it to the entire vector. As an example, let’s multiply the entire vector by 5.
X1 * 5
## [1] 15 20 0 5 15
You can also apply math between vectors. Let’s save another vector,
X2
, and add it to X1
.
X2 <- c(2,4,9,1,0)
X1 + X2
## [1] 5 8 9 2 3
Vectors are made up of many classes, or types. These include:
45
/ double: 45.8
)TRUE
/FALSE
)"This is a sentence."
)We won’t go into these into super detail for now (I will save that
for another time), but just as another example of a vector, we can
create a vector called X3
using text, which constitutes as
a character vector.
X3 <- c("Shawn watches too many movies.")
X3
## [1] "Shawn watches too many movies."
You may have guessed this, but character vectors have different attributes, and thus can’t share the same properties as numeric vectors. As an example, run this code and see what happens:
X1 * X3
A matrix is a two-dimensional data structure. It’s behavior is very similar to a vector, but instead of being composed of one row or column of data, it can be composed by a row and a column, or multiple rows and columns. An example of a 2 x 3 matrix (2 rows and 3 columns) in mathematical notation:
[x11x12x13x21x22x23] As a practical example, let’s create a 2x3 matrix and save it
as mat
. First, enter the values you want into the
data
argument, then tell the matrix you want it organized
into two rows with the nrow
argument.
mat <- matrix(data = c(2,4,5,1,7,3), nrow = 2)
mat
## [,1] [,2] [,3]
## [1,] 2 5 7
## [2,] 4 1 3
You will notice two things about this matrix. first, it has brackets
listed at the top and left of the matrix. These specify the locations of
the values in matrix notation. Second, you will notice there are
specific values listed as either [X,]
or [,X]
.
The [X,]
notation indicates the row number, and the
[,X]
notation naturally lists the column location. This is
useful because this can be used to pull data from specific locations.
For instance, if we want to pull the “4” value from Row 2 Column 1, we
would run the following code:
mat[2,1]
## [1] 4
And you can simply pull an entire row or column by listing only one location. If for example we want the values 2, 5, and 7, we would use this code instead:
mat[1,]
## [1] 2 5 7
Notice that your code over time may get more difficult to read if you cram a bunch of arguments into the same line. One thing that may make coding easier is to indent each argument by hitting Enter after each argument’s comma. An example is shown below:
mat <- matrix(data = c(2,4,5,1,7,3),
nrow = 2)
Here you can now clearly see each argument. The first line is the
data
argument, and nrow
is listed as both the
second line and second argument. This makes reading your code much
easier over time, as you can go line-by-line through the code after
writing. This becomes very useful for debugging code later on as you
develop longer and longer scripts.
Just as a side note, this matrix could have also been specified by
ncol
, which notes how many columns to organize the data
into instead. Either will form the matrix by the specification you like.
However, note the ordering of the matrix. It fills in the values
vertically, which may not be as intuitive as it appears. To fill the
values horizontally, we can use the byrow
argument and set
it to either T
or TRUE
, which is a logical
value that “turns on” this part of the function. By default, it’s
argument is set to F
for FALSE
, which “turns
off” this argument, thus not filling in values by row.
mat <- matrix(data = c(2,4,5,1,7,3),
nrow = 2,
byrow = T)
mat
## [,1] [,2] [,3]
## [1,] 2 4 5
## [2,] 1 7 3
Now you can see the values are filled into the first row, then filled
in sequentially into the second row. If you would like to supply names
to this matrix (for something like a chi-squared test), you can use the
colnames
and rownames
functions to change
them. Lets say we want to name the column names by occupation, and the
row names by gender, we would use the below code. This may look awkward
because you are assigning values to a function. Here we are basically
calling the names of the matrix first, then saving their respective
values to be used later.
colnames(mat) <- c("Doctor","Model","Custodian")
rownames(mat) <- c("Male","Female")
mat
## Doctor Model Custodian
## Male 2 4 5
## Female 1 7 3
Before we move on, remember that matrices behave in similar ways to vectors. If we want to subtract 4 from each value in the matrix, you may do so.
mat - 4
## Doctor Model Custodian
## Male -2 0 1
## Female -3 3 -1
Likewise, you can multiply matrices by each other.
mat * mat
## Doctor Model Custodian
## Male 4 16 25
## Female 1 49 9
I like to characterize lists as a shopping bag of data…it contains a variety of objects that don’t necessarily have to be the same thing, but can all be combined for use later.
For instance, we can pair a numeric vector with a character vector, then call it later.
lis <- list(c(4,5,6,8,1,0),
c("French","Belgian","German"))
lis
## [[1]]
## [1] 4 5 6 8 1 0
##
## [[2]]
## [1] "French" "Belgian" "German"
Did you notice the new notation? To fetch each part of the list, we
now use [[]]
and specify where we want to pull from.
lis[[1]]
## [1] 4 5 6 8 1 0
We can even name the list elements like we did with the matrix, using
names
this time.
names(lis) <- c("Numeric", "Character")
lis
## $Numeric
## [1] 4 5 6 8 1 0
##
## $Character
## [1] "French" "Belgian" "German"
Notice the notation has changed now. For a named list, we can now use
the $
operator to pull each part of the list.
lis$Character
## [1] "French" "Belgian" "German"
You can also just use the original [[]]
operator as
well, but keep this $
operator in mind. It will be used in
our discussion on data frames.
Data frames are similar to matrices. They are “rectangular” or
two-dimensional data composed of rows and columns. The major difference
is that data frames, like lists, can store multiple types of data
without creating them in a list format, which makes the data easier to
read. Data frames also allow you to quickly create variables by naming
vectors within a frame. To create a data frame, we can use the
data.frame
function. To assign each vector a name, simply
write the name for the vector, save it with the =
operator,
and list what you want to save to the right of this operator. This is
functionally similar to the <-
operator, but the major
difference is that the Treatment
and
Mean_Response
vectors won’t be saved as additional objects
while making your data frame.
df <- data.frame(Treatment = c("Group_1","Group_2"),
Mean_Response = c(78,56))
df
## Treatment Mean_Response
## 1 Group_1 78
## 2 Group_2 56
You may have seen a slightly peculiar naming scheme I have employed:
snake case. This is done by using either _
or
.
between spaces in names. The reason I have done this is
because assigning names with spaces causes a lot more problems than
solutions. I won’t go into detail as to why here, but you can experiment
on your own and see why not using this can be a problem. For now, simply
use snake case from here on.
You will now see that the data frame is assigned a 2x2 matrix with
character and numeric values. To pull a column from the data frame, you
can now use the $
operator.
df$Mean_Response
## [1] 78 56
Since these are technically vectors, you can retrieve values the same way.
df[2,2]
## [1] 56
Perhaps you want to save a part of this data frame for a separate part of your script. Simply save these vectors as objects for use later.
trt.groups <- df$Treatment
trt.groups
## [1] "Group_1" "Group_2"
There are many datasets that automatically come with R, and you can
use many of these to practice statistical analysis. If you run
library(help = "datasets")
, you will see a full list of all
the pre-packaged data that comes with R. To open one, simply write their
names and run them as you would any code. For example, we can run the
mtcars
data in this way:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Oftentimes it is difficult to read a gigantic dataset in R within the
Console. There are multiple useful functions for this. Two functions
that are useful for this are head
, which pulls the first 6
rows of data and tail
, which pulls the last 6 rows of
data.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can also view the entire data frame in a separate window with the
View
function.
View(mtcars)
This may not be super useful on it’s own since the window is still collapsed. If you enter Ctrl + Shift + 1, it will expand the view of only the data frame. Using Ctrl + Shift + 1 thereafter returns the view to it’s original position.
By the way, you can use many keyboard shortcuts in R like this. For a full list, check out this link: https://support.posit.co/hc/en-us/articles/200711853-Keyboard-Shortcuts-in-the-RStudio-IDE
What would data be without a nice plot? Visualization allows data to
speak to us. To plot the histogram from the beginning of this tutorial,
you can use the hist
function. First, enter the vector you
want to plot, the x
argument, as the histogram (here we use
miles per gallon, or mpg
). Next, change the color to steel
blue with the col
argument. Then change the x-axis label
with xlab
. Finally, name the title with
main
.
hist(x = mtcars$mpg,
col = "steelblue1",
xlab = "Miles per Gallon",
main = "Histogram of Miles per Gallon")
And now you have created your first plot in R! Plots will be explained in a lot more detail in subsequent tutorials, but for now you can try this histogram out on all the data frames that come with R. There are also other forms of data out there one can experiment with (arrays, tibbles, etc.), but this tutorial covers the basics and will be useful for now.
While the data and functions that are available as defaults in R are
useful, they are fairly limited. It is essential as an R user to know
how to install and load libraries, which are collections of functions
written by other users. Without this ability, the utility of R is
bounded by what is already available. One of the most useful packages
available is the tidyverse
, which is actually a collection
of packages rolled into one. To install it, simply use the following
code:
install.packages("tidyverse")
Once the Console is finished installing the library, simply load the library with the following code:
library(tidyverse)
Or alternatively:
library("tidyverse")
Now you will be able to use all of the functions within the
tidyverse
. Let’s use the filter
function to
filter mtcars
horsepower that is less than 100.
filter(mtcars, hp < 100)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
You will now see that it brings back a much more limited list of
entries based on our filter. The tidyverse
is quite a dense
package with a somewhat steep learning curve. I will not explore it’s
functionality in this tutorial, but I will point you to the useful book
R for Data Science by its creator Hadley Wickham if you want to
learn more.
Functions and libraries are what make R so useful. But this doesn’t
mean you have to rely on the functions of others…you can even create
your own! As a very basic example, here I have created a function called
add.2
, which simply adds 2 to any number you include.
add.2 <- function(x){
x + 2
}
add.2(5)
## [1] 7
This is achieved by doing the following:
<-
operator. Then use function
, wherein we
name our only argument as x
, which is our input.function(x)
and hit enter after the first bracket to
indent it the way I have.x
here
functions as a placeholder for whatever you enter into the function. We
are going to use it simply as a number or vector. To add the number 2 to
whatever we enter into the function, we enter “x + 2” to accomplish this
goal.As I mentioned, we can use a number or a vector. Let’s create a random vector and use our function on it.
random.vector <- c(16,43,51,28)
add.2(random.vector)
## [1] 18 45 53 30
You have now officially created your first function. Congrats! Programming in R is a dense subject, so we will simply stop here and explore this topic another time.
You may have noticed if you have entered something incorrectly, R will yell at you in it’s own subtle way. See below where I accidentally capitalized the wrong letter. The console will tell me at the bottom that it cannot find the function.
A mental check I first do when troubleshooting:
If many of these checks don’t immediately solve the problem, you can
check the help pages in R. Simply use the ?
operator with
the function you are using, and it should pull up a page in R listing
the information you require.
?mean
You can see the help page on the bottom right pane.
However, it may be easier to see if you expand it. If you hit Ctrl + Shift + 3, this will maximize the Help tab so it is easier to read.
The help pages will generally have the following sections:
mean{base}
, indicating that this comes from
the base package in R)Oftentimes these help pages can be embarassingly bad and barely helpful for the reader. Another great resource is Stack Overflow. It is a coding community which has a Q & A forum format which can often get you answers if R doesn’t. You can find Stack Overflow at https://www.stackoverflow.com.
And there you have it. You have now accomplished a lot already. Play around with the features you have learned here and experiment in your own time with other features in R. Happy coding!