R has been around since 1993 and it was built to make statistics, linear algebra, operation research, artificial intelligence and machine learning tools a single function call away. R is a wonderful tool for statistical analysis, visualization and reporting. Its usefulness is best seen in the wide variety of fields where it is used. For example, banks, tech start-ups, international organizations, economics, engineering, epidemiology, ecology, real state, accounting, among other.
To download R go to the Comprehensive R Archive Network (CRAN), the maintainer of R. At the top of the page are links to download R for Windows, Ma OS X and Linux.
RStudio is an IDE or basic interface that facilitates the interaction and working flow with R. R’s usability has greatly improved over the past few years, mainly thanks to RStudio.
To download R go to RStudio.com. Scroll down until you see the RStudio Desktop free open source license download option. Note, that for RStudio to run you must have already download R.
RStudio is highly customization, but the basic interface looks roughly like the figures below.
Once you have open RStudio to start a new script, place where the code will go, click the highlighted icon in the image below, and select R script. Or simply use the keyboard shortcut Ctrl+Shift+N
.
If you have successfully opened R studio and created a new R script your screen should now look like the image below.
After opening your new R script, the new layout of the RStudio window consists of a total of four (4) different panes, each pane has different functionality.
R Script pane: this is where you’ll write your main code.
The console pane: this is where you’ll see the outputs (except for plots)
The environment, history, and connection pane: this is were you’ll see the values for the variables in memory, the R code history, and any open connection to outside databases.
The file, plots, package, help, viewer: in the file tab, you’ll be able to see, generally, the files inside your working directory. In the plot tab, you’ll be able to see your output plots. In the help tab you’ll be able to see R-functions descriptions, go to section below to learn more about help options.
The Help Tab
in the bottom-right window pane is very helpful when starting to use and R and RStudio. From this tab you can get information about R in general to very specific information to a function in R. Below some helpful tips to get the most out of the Help Tab
.
?function_name
in your console (pane 2).Ctr+Enter
. The example output will show in your console (pane 2).This introduction to R will show you the basic tools to explore and analyzed your data. To follow along with the examples below download and save this file.
Recommendation: create a new folder in your desktop, name it
intro2R
and save file in there.
The working directory is where your code and data files will be stored, it is recommended to always set your working directory at the beginning of your script. Below three different ways to set your working directory.
setwd()
.To run the command in R you can press the Run
button located in the top right corner of the script pane, or by pressing Ctrl+Enter
.
Tip: If you do not recall the entire path of where your file is stored you can press
tab
for a list of possible directories. To do this simply writesetwd("")
on your script space and locate your cursor in between the quotes and presstab
.
>
at the beginning) and paste it on your script. By doing this next time you open your script you just have to run the line.File Tab
(located in the bottom-right pane). First click on the ...
located in the top-right corner on the File tab and select the folder you wish to be your working directory. Now, to set the selected path as your working directory click on the gear icon More
and select the option Set As Working Directory
. The path to your working directory will show in your console.One of the ways to import data into R is to use text files. Our data is contained in a CSV (comma-separated values) file. The CSV file uses commas to separate the different elements in a line. Data resembles an excel worksheet. To read file into R we use the R function read.csv()
.
The expression read.csv()
is a function call that asks R to run the function read.csv
. The values inside the parenthesis are call arguments.
Since we did not’t tell it to do anything else with the function’s output, the console will display the full contents of the file. read.csv
reads the file, but we can’t use the data unless we assign it to a variable.
To assign a value to a variable we use the assign symbol <-
.
Notice that when we assign a variable name to the data, in this case the number 5, R did not print the data value in the console. After running this line our data will be stored in our global environment as x
with a value of 5. We can access that value by simply using the new variable name.
## [1] 5
We can also treat our variable as a regular number, for example:
## [1] 0
## [1] 10
## [1] 9
Tip: We can add comments to our code using the # character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.
Going back to our data, in order to make use of it we need to store it in a variable, in this case we will call our variable “data”.
In the case of a data frame there are some important thing to be in the look out for:
Here are some of the most common way to find important information about our data
str()
: displays the internal structure of a data frame## 'data.frame': 50 obs. of 4 variables:
## $ MoNTH : chr "January" "February" "February" "March" ...
## $ weight : num 26.5 13 8 18.5 10.5 6.5 1 5 26 0.5 ...
## $ species: chr "S" "R" "R" "S" ...
## $ RESUTLT: chr "Positive" "Positive" "Negative" "Negative" ...
head()
: displays the first few rows of your data frame## MoNTH weight species RESUTLT
## 1 January 26.5 S Positive
## 2 February 13.0 R Positive
## 3 February 8.0 R Negative
## 4 March 18.5 S Negative
## 5 March 10.5 R Positive
## 6 April 6.5 S Negative
dim()
: returns the dimensions of your data frame, where the first value corresponds to the number of rows and the second to the number of columns.## [1] 50 4
nrow()
: returns the number of rows in data frame## [1] 50
ncol()
: returns the number of columns in data frame## [1] 4
summary()
: displays the statistical summary of each of the columns in the data frame## MoNTH weight species RESUTLT
## Length:50 Min. : 0.50 Length:50 Length:50
## Class :character 1st Qu.: 5.50 Class :character Class :character
## Mode :character Median : 9.50 Mode :character Mode :character
## Mean :11.67
## 3rd Qu.:16.50
## Max. :39.50
typeof()
: return the data type of an object## [1] "double"
## [1] "character"
class()
: returns what the data attribute the R object is## [1] "data.frame"
rownames()
: returns the names of the rows in the data frame## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50"
colnames()
and ’names()`: returns the names of the columns in the data frame## [1] "MoNTH" "weight" "species" "RESUTLT"
## [1] "MoNTH" "weight" "species" "RESUTLT"
All these are useful commands to help clean and understand the structure of your data. For example, using the last functions colnames()
and ’names()` we can rename our columns.
Again, you can check the current column’s name by using the function names(data_frame_name)
## [1] "MoNTH" "weight" "species" "RESUTLT"
To make changes to the column names use function colnames()
and specify the new names using c("name1", "name2", ...)
.
The c( )
in R creates a one dimensional array of items or vectors for specific attributes, for example, names.
## [1] "Month" "Weight" "Species" "Diagnostic"
Now add assign the new names to the columns just as if you were assigning a value to a variable name.
Double check that you have correctly named your columns by rerunning or typing the function names( )
.
## [1] "Month" "Weight" "Species" "Diagnostic"
There are three main ways for addressing data inside R objects.
1. By index (subsetting)
2. By logical vector
3. By name
We can address our data in the data frame by indexes. If we want to get a single value from the data frame, we can provide an index in square brackets [ ]
. The first number specifies the row and the second the column:
## [1] "January"
To select all columns but only a single you can leave the space after the comma black. A black space is as to say “include all”.
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
And viceversa,
## [1] "January" "February" "February" "March" "March" "April"
## [7] "April" "April" "May" "May" "May" "May"
## [13] "June" "June" "June" "June" "June" "July"
## [19] "July" "July" "July" "July" "July" "July"
## [25] "July" "August" "August" "August" "August" "August"
## [31] "August" "August" "August" "August" "August" "September"
## [37] "September" "September" "September" "September" "September" "September"
## [43] "October" "October" "October" "October" "November" "November"
## [49] "November" "December"
Now, to select more than one row and column at a time you can either use the :
operator or use the c()
function. Use the :
when you want to select continuous columns or rows, this operators helps you generate a sequences of numbers. Use ‘c()’ when you want to select non-consecutive rows or columns, note you can also use c()
for continuous selections.
## [1] 1 2 3 4 5
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
## 2 February 13.0 R Positive
## 3 February 8.0 R Negative
## 4 March 18.5 S Negative
## 5 March 10.5 R Positive
## Month Diagnostic
## 1 January Positive
## 3 February Negative
## Month Weight Species
## 1 January 26.5 S
## 3 February 8.0 R
## 5 March 10.5 R
Other special function that can help generate sequences of numbers is seq()
. This function takes three parameters: from
where, to
where, and by
what increment.
## [1] 1 2 3 4 5
In the case we want to get the information of the first 10 rows but skip every other row, we can use seq()
as follows
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
## 3 February 8.0 R Negative
## 5 March 10.5 R Positive
## 7 April 1.0 R Negative
## 9 May 26.0 R Negative
We can also address our data in the data frame by names. If we know the name of our row or column we can get a single or multiple values from the data frame, we can provide a column name or row name inside square brackets [ , ]
, where The first number specifies the row and the second the column. Or with help of the $
operator.
Remember: to get the row names of the data frame you can use function
rownames()
. And to get the column names you can usecolnames()
.
For the examples we will only use column names since we know that our row names are number.
## [1] "Month" "Weight" "Species" "Diagnostic"
If we want all the values of a single column, we can use the $
operator.
## [1] 26.5 13.0 8.0 18.5 10.5 6.5 1.0 5.0 26.0 0.5 14.5 9.5 4.0 16.5 6.0
## [16] 3.5 17.0 5.5 13.0 9.5 16.5 39.5 17.0 6.0 9.0 18.0 5.5 1.0 3.0 8.5
## [31] 7.5 4.0 18.5 10.0 12.0 2.5 10.5 20.0 10.0 31.0 23.5 2.5 7.5 5.0 12.5
## [46] 34.0 8.0 5.5 12.5 8.0
If we only want a few of rows of a single column we can use a combination of the $
operator and the brackets []
. Note that when using the $
is as to specifying the column in the second space of the bracket so now we only need to tell R what row/s we would like, i.e., only one number, sequence, or list is required inside the []
and there is no comma.
## [1] 26.5 13.0 8.0
If we want all the values for multiple columns, we can list the columns we want to select by listing the names with c()
.
## Weight Diagnostic
## 1 26.5 Positive
## 2 13.0 Positive
## 3 8.0 Negative
## 4 18.5 Negative
## 5 10.5 Positive
## 6 6.5 Negative
## 7 1.0 Negative
## 8 5.0 Negative
## 9 26.0 Negative
## 10 0.5 Negative
## 11 14.5 Negative
## 12 9.5 Positive
## 13 4.0 Positive
## 14 16.5 Positive
## 15 6.0 Negative
## 16 3.5 Positive
## 17 17.0 Negative
## 18 5.5 Negative
## 19 13.0 Positive
## 20 9.5 Positive
## 21 16.5 Positive
## 22 39.5 Negative
## 23 17.0 Negative
## 24 6.0 Negative
## 25 9.0 Negative
## 26 18.0 Positive
## 27 5.5 Positive
## 28 1.0 Positive
## 29 3.0 Negative
## 30 8.5 Positive
## 31 7.5 Positive
## 32 4.0 Positive
## 33 18.5 Negative
## 34 10.0 Positive
## 35 12.0 Negative
## 36 2.5 Negative
## 37 10.5 Negative
## 38 20.0 Positive
## 39 10.0 Positive
## 40 31.0 Negative
## 41 23.5 Negative
## 42 2.5 Negative
## 43 7.5 Positive
## 44 5.0 Positive
## 45 12.5 Positive
## 46 34.0 Negative
## 47 8.0 Positive
## 48 5.5 Positive
## 49 12.5 Positive
## 50 8.0 Positive
And we can also specify the rows we would like, like this:
## Weight Diagnostic
## 6 6.5 Negative
## 7 1.0 Negative
## 8 5.0 Negative
## 9 26.0 Negative
## 10 0.5 Negative
Logical vectors can be created using relational operators e.g. <
, >
, ==
, !=
, %in%
. Below some examples of how to each of them:
To highlight the use of each other different operators we would use the following vector of random numbers.
Using <
, >
.
## [1] FALSE FALSE FALSE FALSE FALSE
## [1] TRUE TRUE TRUE FALSE FALSE
Note in the returned values for both are only TRUE
or FALSE
this is the answer to the operator. R is comparing every single object in the list.
Using ==
, !=
.
## [1] FALSE TRUE FALSE FALSE FALSE
## [1] TRUE FALSE TRUE TRUE TRUE
Using %in%. The %in%
operator allows you to compare values using a list or vector of objects
## [1] TRUE TRUE TRUE
Now lets use the operators to subset things in our data frame, combining a couple of the things we have covered so far.
In case we don’t remember how our data looked like we can check the head and structure:
## 'data.frame': 50 obs. of 4 variables:
## $ Month : chr "January" "February" "February" "March" ...
## $ Weight : num 26.5 13 8 18.5 10.5 6.5 1 5 26 0.5 ...
## $ Species : chr "S" "R" "R" "S" ...
## $ Diagnostic: chr "Positive" "Positive" "Negative" "Negative" ...
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
## 2 February 13.0 R Positive
## 3 February 8.0 R Negative
## 4 March 18.5 S Negative
## 5 March 10.5 R Positive
## 6 April 6.5 S Negative
If we want to see where some specify value is in a given column we would do something like this:
## [1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [13] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## [37] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [49] TRUE TRUE
Now if we save this into a variable, we can use it to subset our data
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
## 2 February 13.0 R Positive
## 5 March 10.5 R Positive
## 12 May 9.5 S Positive
## 13 June 4.0 R Positive
## 14 June 16.5 S Positive
## 16 June 3.5 S Positive
## 19 July 13.0 S Positive
## 20 July 9.5 R Positive
## 21 July 16.5 R Positive
## 26 August 18.0 R Positive
## 27 August 5.5 R Positive
## 28 August 1.0 R Positive
## 30 August 8.5 R Positive
## 31 August 7.5 S Positive
## 32 August 4.0 R Positive
## 34 August 10.0 R Positive
## 38 September 20.0 S Positive
## 39 September 10.0 R Positive
## 43 October 7.5 S Positive
## 44 October 5.0 R Positive
## 45 October 12.5 S Positive
## 47 November 8.0 R Positive
## 48 November 5.5 S Positive
## 49 November 12.5 R Positive
## 50 December 8.0 S Positive
## Month Weight Species Diagnostic
## 1 January 26.5 S Positive
## Month Weight Species Diagnostic
## 3 February 8.0 R Negative
## 6 April 6.5 S Negative
## 7 April 1.0 R Negative
## 8 April 5.0 R Negative
## 10 May 0.5 R Negative
## 12 May 9.5 S Positive
## 13 June 4.0 R Positive
## 15 June 6.0 R Negative
## 16 June 3.5 S Positive
## 18 July 5.5 R Negative
## 20 July 9.5 R Positive
## 24 July 6.0 S Negative
## 25 July 9.0 S Negative
## 27 August 5.5 R Positive
## 28 August 1.0 R Positive
## 29 August 3.0 S Negative
## 30 August 8.5 R Positive
## 31 August 7.5 S Positive
## 32 August 4.0 R Positive
## 36 September 2.5 R Negative
## 42 September 2.5 S Negative
## 43 October 7.5 S Positive
## 44 October 5.0 R Positive
## 47 November 8.0 R Positive
## 48 November 5.5 S Positive
## 50 December 8.0 S Positive
And we can also specify what column we want to see
## [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"
## [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"
## [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"
R also has functions for common calculation. To highlight the use of each other different functions we would use the following vector of random numbers.
min()
returns the minimum of the R Object## [1] 1
max()
returns the maximum of the R Object## [1] 88
mean()
returns the mean of the R Object## [1] 20.5
sd()
returns the standard deviation of the R Object## [1] 24.78078
range()
returns the range of the values contained in the R Object. First value corresponds to the minimum value and second corresponds to the max.## [1] 1 88
median()
return the median of the values in the R object.## [1] 9
Remember you can also get the statistical summary using the functionsummary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.75 9.00 20.50 26.75 88.00
You can create your own user define functions in R. This is must useful when you want to run a repetitive task throughout your code.
The basic syntax of a function is as follows:
Note that the word function
is in a different color, we use this word to declare a function in R. The statements
within the curly braces {}
constitutes the body of the function. And finally, to save our new function we must assign it to a variable, i.e., give the function a name function_name
.
Lets create a function that will allow us to convert from feet to miles and lets call the function feet_to_miles
.
# This function transforms feet to miles
feet_to_miles <- function(feet){
miles <- feet/5280
print(paste(feet, "feet are", miles, "miles"))
}
Our function feet_to_miles
takes one argument and converts it to miles and prints the result with an informative message.
Now we can use are function as follows
## [1] "5200 feet are 0.984848484848485 miles"
## [1] "10000 feet are 1.89393939393939 miles"
Instead of using a print
statement in our function we send a result back using the retrun()
statement. Using the return
statement allows us to save the answer into a variable.
# This function transforms meters to feet
feet_to_miles <- function(feet){
miles <- feet/5280
return(miles)
}
x <- feet_to_miles(feet = 5200)
x
## [1] 0.9848485
Lets create a function that calculates the travel time based on speed and distance to be travel (note: time=distance/speed)
We can call the above function as follows:
# You can store the arguments for your function in varibales
d <- 100 #miles
s <- 50 #miles/hour
travel_time(distance = d, speed = s)
## [1] 2
The argument follow a positional order, meaning that if labels are not added R will assume you have wrote them in the correct position.
## [1] 2
## [1] 0.5
And if only one is labeled, R will assume the other one corresponds to the missing label. In this case the order is not important.
## [1] 2
You can call multiple functions within a function. Following our previous examples, lets say we have a distance but we have it in feet and we need it in miles, we can nest our function to convert the units and return to us the correct travel time.
travel_time_nested <- function(distance, speed){
distance <- feet_to_miles(distance)
time <- distance/speed
return(time)
}
d <- 528000 #in feet
s <- 10 #in miles/hour
travel_time_nested(distance = d, speed = s)
## [1] 10
In this case, maybe nesting the function is not the most optimal solution, bu rather it would be best to convert the units before sending to function.
d <- 528000 #in feet
s <- 10 #in miles/hour
d_in_miles <- feet_to_miles(feet = d)
time <- travel_time(distance = d_in_miles, speed = s)
time
## [1] 10
Regardless, in both cases we obtain the same result
To make more complex decision, we need to write code that automatically decides between multiple options. The computer can make these decisions through logical comparisons. using if and else statements.
The basic syntax of a general conditional statement looks something like this:
And reads: if condition
is met do this (1)
, if not do this 2
.
For example, lets we want to compare if a value is greater or less than some boundary, we would do this as follows.
## [1] "not greater"
We can also combine tests. with the operators: &&
, symbolize “and”; Two vertical bars, ||
, symbolize “or”.
Note that &&
is only true if both parts are true, while ||
is true if either part is true:
if (1 > 0 && -1 > 0) {
print("both parts are true")
} else {
print("at least one part is not true")
}
## [1] "at least one part is not true"
## [1] "at least one part is true"
So now if we use our data frame and combine everything so far, we can do something like:
## Month Weight Species Diagnostic
## 13 June 4.0 R Positive
## 14 June 16.5 S Positive
## 16 June 3.5 S Positive
Note here the use of the single &
, the &
and the |
operators are designed to work on vectors.
Coming Soon…
Coming Soon…
Coming Soon…
Coming Soon…
Coming Soon…
The structure to this introduction to R is loosely based on the Software Carpentry (c) Workshop Programming with R. For more information about Software Carpentry visit their website at https://software-carpentry.org/.