Loading [MathJax]/jax/output/HTML-CSS/jax.js
  • 1 Introduction
    • 1.1 What is R?
      • 1.1.1 Downloading R
    • 1.2 What is RStudio?
      • 1.2.1 Downloading R Studio
  • 2 Working with R and RStudio Basics
    • 2.1 RStudio Layout
    • 2.2 Get Help
  • 3 Programming with RO
    • 3.1 Set Working Directory
    • 3.2 Read Data into R
    • 3.3 Variables
      • 3.3.1 Setting variables
    • 3.4 Working with Data Frames
      • 3.4.1 Addressing data
      • 3.4.2 Addressing and Subsetting Data
        • 3.4.2.1 Addressing by Index
        • 3.4.2.2 Addressing by Name
        • 3.4.2.3 Addressing by logical vector
    • 3.5 Using simple statistics
    • 3.6 User Defined Functions
      • 3.6.1 Syntax of a function
      • 3.6.2 One argument function
      • 3.6.3 Two argument function
      • 3.6.4 Nested functions
    • 3.7 Conditionals and if-else Statements
    • 3.8 Switch Statement
    • 3.9 Loops
      • 3.9.1 For Loops
      • 3.9.2 While Loops
      • 3.9.3 Apply Base R Loops
  • 4 More Learning Resources

1 Introduction

1.1 What is R?

R has been around since 1993 and it was built to make statistics, linear algebra, operation research, artificial intelligence and machine learning tools a single function call away. R is a wonderful tool for statistical analysis, visualization and reporting. Its usefulness is best seen in the wide variety of fields where it is used. For example, banks, tech start-ups, international organizations, economics, engineering, epidemiology, ecology, real state, accounting, among other.

1.1.1 Downloading R

To download R go to the Comprehensive R Archive Network (CRAN), the maintainer of R. At the top of the page are links to download R for Windows, Ma OS X and Linux.

1.2 What is RStudio?

RStudio is an IDE or basic interface that facilitates the interaction and working flow with R. R’s usability has greatly improved over the past few years, mainly thanks to RStudio.

1.2.1 Downloading R Studio

To download R go to RStudio.com. Scroll down until you see the RStudio Desktop free open source license download option. Note, that for RStudio to run you must have already download R.

2 Working with R and RStudio Basics

RStudio is highly customization, but the basic interface looks roughly like the figures below.

Once you have open RStudio to start a new script, place where the code will go, click the highlighted icon in the image below, and select R script. Or simply use the keyboard shortcut Ctrl+Shift+N.

If you have successfully opened R studio and created a new R script your screen should now look like the image below.

2.1 RStudio Layout

After opening your new R script, the new layout of the RStudio window consists of a total of four (4) different panes, each pane has different functionality.

  1. R Script pane: this is where you’ll write your main code.

  2. The console pane: this is where you’ll see the outputs (except for plots)

  3. The environment, history, and connection pane: this is were you’ll see the values for the variables in memory, the R code history, and any open connection to outside databases.

  4. The file, plots, package, help, viewer: in the file tab, you’ll be able to see, generally, the files inside your working directory. In the plot tab, you’ll be able to see your output plots. In the help tab you’ll be able to see R-functions descriptions, go to section below to learn more about help options.

2.2 Get Help

The Help Tab in the bottom-right window pane is very helpful when starting to use and R and RStudio. From this tab you can get information about R in general to very specific information to a function in R. Below some helpful tips to get the most out of the Help Tab.

  • Tip 1: Get Cheat sheet for different functions in R.
  • Tip 2: To get help on any function simply (1) typing the function name on the search bar or (2) ?function_name in your console (pane 2).
  • Tip 3: After you have searched or ran the command to get help and you are in the help page, scroll down to the bottom for examples. If you wish to run the example, just select the lines and hit run (the run button is located in the top-right corner of the R Script pane, pane 1) or simply press Ctr+Enter. The example output will show in your console (pane 2).

3 Programming with RO

This introduction to R will show you the basic tools to explore and analyzed your data. To follow along with the examples below download and save this file.

Recommendation: create a new folder in your desktop, name it intro2R and save file in there.

3.1 Set Working Directory

The working directory is where your code and data files will be stored, it is recommended to always set your working directory at the beginning of your script. Below three different ways to set your working directory.

  1. To set working directory using base R function setwd().

To run the command in R you can press the Run button located in the top right corner of the script pane, or by pressing Ctrl+Enter.

Tip: If you do not recall the entire path of where your file is stored you can press tab for a list of possible directories. To do this simply write setwd("") on your script space and locate your cursor in between the quotes and press tab.

  1. To set your working directory using the task bar. Click on the tab Session on the task bar, then click on Set Working Directory, and then click Choose Directory… and select the folder with your data and script will be saved or are currently saved.
  • After selecting your working directory a string with the path to your folder will appeared in your console, copy this string (without the blue arrow > at the beginning) and paste it on your script. By doing this next time you open your script you just have to run the line.
  1. To set working directory using the File Tab (located in the bottom-right pane). First click on the ... located in the top-right corner on the File tab and select the folder you wish to be your working directory. Now, to set the selected path as your working directory click on the gear icon More and select the option Set As Working Directory. The path to your working directory will show in your console.

3.2 Read Data into R

One of the ways to import data into R is to use text files. Our data is contained in a CSV (comma-separated values) file. The CSV file uses commas to separate the different elements in a line. Data resembles an excel worksheet. To read file into R we use the R function read.csv().

The expression read.csv() is a function call that asks R to run the function read.csv. The values inside the parenthesis are call arguments.

3.3 Variables

Since we did not’t tell it to do anything else with the function’s output, the console will display the full contents of the file. read.csv reads the file, but we can’t use the data unless we assign it to a variable.

3.3.1 Setting variables

To assign a value to a variable we use the assign symbol <-.

Notice that when we assign a variable name to the data, in this case the number 5, R did not print the data value in the console. After running this line our data will be stored in our global environment as x with a value of 5. We can access that value by simply using the new variable name.

## [1] 5

We can also treat our variable as a regular number, for example:

## [1] 0
## [1] 10
## [1] 9

Tip: We can add comments to our code using the # character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.

3.4 Working with Data Frames

Going back to our data, in order to make use of it we need to store it in a variable, in this case we will call our variable “data”.

3.4.1 Addressing data

In the case of a data frame there are some important thing to be in the look out for:

  • Number of columns
  • Number of rows
  • Spelling of things
  • Type of data within each column
  • Basic formats

Here are some of the most common way to find important information about our data

  1. str(): displays the internal structure of a data frame
## 'data.frame':    50 obs. of  4 variables:
##  $ MoNTH  : chr  "January" "February" "February" "March" ...
##  $ weight : num  26.5 13 8 18.5 10.5 6.5 1 5 26 0.5 ...
##  $ species: chr  "S" "R" "R" "S" ...
##  $ RESUTLT: chr  "Positive" "Positive" "Negative" "Negative" ...
  1. head(): displays the first few rows of your data frame
##      MoNTH weight species  RESUTLT
## 1  January   26.5       S Positive
## 2 February   13.0       R Positive
## 3 February    8.0       R Negative
## 4    March   18.5       S Negative
## 5    March   10.5       R Positive
## 6    April    6.5       S Negative
  1. dim(): returns the dimensions of your data frame, where the first value corresponds to the number of rows and the second to the number of columns.
## [1] 50  4
  1. nrow(): returns the number of rows in data frame
## [1] 50
  1. ncol(): returns the number of columns in data frame
## [1] 4
  1. summary(): displays the statistical summary of each of the columns in the data frame
##     MoNTH               weight        species            RESUTLT         
##  Length:50          Min.   : 0.50   Length:50          Length:50         
##  Class :character   1st Qu.: 5.50   Class :character   Class :character  
##  Mode  :character   Median : 9.50   Mode  :character   Mode  :character  
##                     Mean   :11.67                                        
##                     3rd Qu.:16.50                                        
##                     Max.   :39.50
  1. typeof(): return the data type of an object
## [1] "double"
## [1] "character"
  1. class(): returns what the data attribute the R object is
## [1] "data.frame"
  1. rownames(): returns the names of the rows in the data frame
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50"
  1. colnames() and ’names()`: returns the names of the columns in the data frame
## [1] "MoNTH"   "weight"  "species" "RESUTLT"
## [1] "MoNTH"   "weight"  "species" "RESUTLT"

All these are useful commands to help clean and understand the structure of your data. For example, using the last functions colnames() and ’names()` we can rename our columns.

Again, you can check the current column’s name by using the function names(data_frame_name)

## [1] "MoNTH"   "weight"  "species" "RESUTLT"

To make changes to the column names use function colnames() and specify the new names using c("name1", "name2", ...).

The c( ) in R creates a one dimensional array of items or vectors for specific attributes, for example, names.

## [1] "Month"      "Weight"     "Species"    "Diagnostic"

Now add assign the new names to the columns just as if you were assigning a value to a variable name.

Double check that you have correctly named your columns by rerunning or typing the function names( ).

## [1] "Month"      "Weight"     "Species"    "Diagnostic"

3.4.2 Addressing and Subsetting Data

There are three main ways for addressing data inside R objects.

1. By index (subsetting)
2. By logical vector
3. By name

3.4.2.1 Addressing by Index

We can address our data in the data frame by indexes. If we want to get a single value from the data frame, we can provide an index in square brackets [ ]. The first number specifies the row and the second the column:

## [1] "January"

To select all columns but only a single you can leave the space after the comma black. A black space is as to say “include all”.

##     Month Weight Species Diagnostic
## 1 January   26.5       S   Positive

And viceversa,

##  [1] "January"   "February"  "February"  "March"     "March"     "April"    
##  [7] "April"     "April"     "May"       "May"       "May"       "May"      
## [13] "June"      "June"      "June"      "June"      "June"      "July"     
## [19] "July"      "July"      "July"      "July"      "July"      "July"     
## [25] "July"      "August"    "August"    "August"    "August"    "August"   
## [31] "August"    "August"    "August"    "August"    "August"    "September"
## [37] "September" "September" "September" "September" "September" "September"
## [43] "October"   "October"   "October"   "October"   "November"  "November" 
## [49] "November"  "December"

Now, to select more than one row and column at a time you can either use the : operator or use the c() function. Use the : when you want to select continuous columns or rows, this operators helps you generate a sequences of numbers. Use ‘c()’ when you want to select non-consecutive rows or columns, note you can also use c() for continuous selections.

## [1] 1 2 3 4 5
##      Month Weight Species Diagnostic
## 1  January   26.5       S   Positive
## 2 February   13.0       R   Positive
## 3 February    8.0       R   Negative
## 4    March   18.5       S   Negative
## 5    March   10.5       R   Positive
##      Month Diagnostic
## 1  January   Positive
## 3 February   Negative
##      Month Weight Species
## 1  January   26.5       S
## 3 February    8.0       R
## 5    March   10.5       R

Other special function that can help generate sequences of numbers is seq(). This function takes three parameters: from where, to where, and by what increment.

## [1] 1 2 3 4 5

In the case we want to get the information of the first 10 rows but skip every other row, we can use seq() as follows

##      Month Weight Species Diagnostic
## 1  January   26.5       S   Positive
## 3 February    8.0       R   Negative
## 5    March   10.5       R   Positive
## 7    April    1.0       R   Negative
## 9      May   26.0       R   Negative

3.4.2.2 Addressing by Name

We can also address our data in the data frame by names. If we know the name of our row or column we can get a single or multiple values from the data frame, we can provide a column name or row name inside square brackets [ , ], where The first number specifies the row and the second the column. Or with help of the $ operator.

Remember: to get the row names of the data frame you can use function rownames(). And to get the column names you can use colnames().

For the examples we will only use column names since we know that our row names are number.

## [1] "Month"      "Weight"     "Species"    "Diagnostic"

If we want all the values of a single column, we can use the $ operator.

##  [1] 26.5 13.0  8.0 18.5 10.5  6.5  1.0  5.0 26.0  0.5 14.5  9.5  4.0 16.5  6.0
## [16]  3.5 17.0  5.5 13.0  9.5 16.5 39.5 17.0  6.0  9.0 18.0  5.5  1.0  3.0  8.5
## [31]  7.5  4.0 18.5 10.0 12.0  2.5 10.5 20.0 10.0 31.0 23.5  2.5  7.5  5.0 12.5
## [46] 34.0  8.0  5.5 12.5  8.0

If we only want a few of rows of a single column we can use a combination of the $ operator and the brackets []. Note that when using the $ is as to specifying the column in the second space of the bracket so now we only need to tell R what row/s we would like, i.e., only one number, sequence, or list is required inside the [] and there is no comma.

## [1] 26.5 13.0  8.0

If we want all the values for multiple columns, we can list the columns we want to select by listing the names with c().

##    Weight Diagnostic
## 1    26.5   Positive
## 2    13.0   Positive
## 3     8.0   Negative
## 4    18.5   Negative
## 5    10.5   Positive
## 6     6.5   Negative
## 7     1.0   Negative
## 8     5.0   Negative
## 9    26.0   Negative
## 10    0.5   Negative
## 11   14.5   Negative
## 12    9.5   Positive
## 13    4.0   Positive
## 14   16.5   Positive
## 15    6.0   Negative
## 16    3.5   Positive
## 17   17.0   Negative
## 18    5.5   Negative
## 19   13.0   Positive
## 20    9.5   Positive
## 21   16.5   Positive
## 22   39.5   Negative
## 23   17.0   Negative
## 24    6.0   Negative
## 25    9.0   Negative
## 26   18.0   Positive
## 27    5.5   Positive
## 28    1.0   Positive
## 29    3.0   Negative
## 30    8.5   Positive
## 31    7.5   Positive
## 32    4.0   Positive
## 33   18.5   Negative
## 34   10.0   Positive
## 35   12.0   Negative
## 36    2.5   Negative
## 37   10.5   Negative
## 38   20.0   Positive
## 39   10.0   Positive
## 40   31.0   Negative
## 41   23.5   Negative
## 42    2.5   Negative
## 43    7.5   Positive
## 44    5.0   Positive
## 45   12.5   Positive
## 46   34.0   Negative
## 47    8.0   Positive
## 48    5.5   Positive
## 49   12.5   Positive
## 50    8.0   Positive

And we can also specify the rows we would like, like this:

##    Weight Diagnostic
## 6     6.5   Negative
## 7     1.0   Negative
## 8     5.0   Negative
## 9    26.0   Negative
## 10    0.5   Negative

3.4.2.3 Addressing by logical vector

Logical vectors can be created using relational operators e.g. <, >, ==, !=, %in%. Below some examples of how to each of them:

To highlight the use of each other different operators we would use the following vector of random numbers.

Using <, >.

## [1] FALSE FALSE FALSE FALSE FALSE
## [1]  TRUE  TRUE  TRUE FALSE FALSE

Note in the returned values for both are only TRUE or FALSE this is the answer to the operator. R is comparing every single object in the list.

Using ==, !=.

## [1] FALSE  TRUE FALSE FALSE FALSE
## [1]  TRUE FALSE  TRUE  TRUE  TRUE

Using %in%. The %in% operator allows you to compare values using a list or vector of objects

## [1] TRUE TRUE TRUE

Now lets use the operators to subset things in our data frame, combining a couple of the things we have covered so far.

In case we don’t remember how our data looked like we can check the head and structure:

## 'data.frame':    50 obs. of  4 variables:
##  $ Month     : chr  "January" "February" "February" "March" ...
##  $ Weight    : num  26.5 13 8 18.5 10.5 6.5 1 5 26 0.5 ...
##  $ Species   : chr  "S" "R" "R" "S" ...
##  $ Diagnostic: chr  "Positive" "Positive" "Negative" "Negative" ...
##      Month Weight Species Diagnostic
## 1  January   26.5       S   Positive
## 2 February   13.0       R   Positive
## 3 February    8.0       R   Negative
## 4    March   18.5       S   Negative
## 5    March   10.5       R   Positive
## 6    April    6.5       S   Negative

If we want to see where some specify value is in a given column we would do something like this:

##  [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [13]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [37] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [49]  TRUE  TRUE

Now if we save this into a variable, we can use it to subset our data

##        Month Weight Species Diagnostic
## 1    January   26.5       S   Positive
## 2   February   13.0       R   Positive
## 5      March   10.5       R   Positive
## 12       May    9.5       S   Positive
## 13      June    4.0       R   Positive
## 14      June   16.5       S   Positive
## 16      June    3.5       S   Positive
## 19      July   13.0       S   Positive
## 20      July    9.5       R   Positive
## 21      July   16.5       R   Positive
## 26    August   18.0       R   Positive
## 27    August    5.5       R   Positive
## 28    August    1.0       R   Positive
## 30    August    8.5       R   Positive
## 31    August    7.5       S   Positive
## 32    August    4.0       R   Positive
## 34    August   10.0       R   Positive
## 38 September   20.0       S   Positive
## 39 September   10.0       R   Positive
## 43   October    7.5       S   Positive
## 44   October    5.0       R   Positive
## 45   October   12.5       S   Positive
## 47  November    8.0       R   Positive
## 48  November    5.5       S   Positive
## 49  November   12.5       R   Positive
## 50  December    8.0       S   Positive
##     Month Weight Species Diagnostic
## 1 January   26.5       S   Positive
##        Month Weight Species Diagnostic
## 3   February    8.0       R   Negative
## 6      April    6.5       S   Negative
## 7      April    1.0       R   Negative
## 8      April    5.0       R   Negative
## 10       May    0.5       R   Negative
## 12       May    9.5       S   Positive
## 13      June    4.0       R   Positive
## 15      June    6.0       R   Negative
## 16      June    3.5       S   Positive
## 18      July    5.5       R   Negative
## 20      July    9.5       R   Positive
## 24      July    6.0       S   Negative
## 25      July    9.0       S   Negative
## 27    August    5.5       R   Positive
## 28    August    1.0       R   Positive
## 29    August    3.0       S   Negative
## 30    August    8.5       R   Positive
## 31    August    7.5       S   Positive
## 32    August    4.0       R   Positive
## 36 September    2.5       R   Negative
## 42 September    2.5       S   Negative
## 43   October    7.5       S   Positive
## 44   October    5.0       R   Positive
## 47  November    8.0       R   Positive
## 48  November    5.5       S   Positive
## 50  December    8.0       S   Positive

And we can also specify what column we want to see

##  [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
##  [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"
##  [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
##  [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"
##  [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
##  [7] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [13] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [19] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"
## [25] "Positive" "Positive"

3.5 Using simple statistics

R also has functions for common calculation. To highlight the use of each other different functions we would use the following vector of random numbers.

  1. min() returns the minimum of the R Object
## [1] 1
  1. max() returns the maximum of the R Object
## [1] 88
  1. mean() returns the mean of the R Object
## [1] 20.5
  1. sd() returns the standard deviation of the R Object
## [1] 24.78078
  1. range() returns the range of the values contained in the R Object. First value corresponds to the minimum value and second corresponds to the max.
## [1]  1 88
  1. median() return the median of the values in the R object.
## [1] 9

Remember you can also get the statistical summary using the functionsummary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.75    9.00   20.50   26.75   88.00

3.6 User Defined Functions

You can create your own user define functions in R. This is must useful when you want to run a repetitive task throughout your code.

3.6.1 Syntax of a function

The basic syntax of a function is as follows:

Note that the word function is in a different color, we use this word to declare a function in R. The statements within the curly braces {} constitutes the body of the function. And finally, to save our new function we must assign it to a variable, i.e., give the function a name function_name.

3.6.2 One argument function

Lets create a function that will allow us to convert from feet to miles and lets call the function feet_to_miles.

Our function feet_to_miles takes one argument and converts it to miles and prints the result with an informative message.

Now we can use are function as follows

## [1] "5200 feet are 0.984848484848485 miles"
## [1] "10000 feet are 1.89393939393939 miles"

Instead of using a print statement in our function we send a result back using the retrun() statement. Using the return statement allows us to save the answer into a variable.

## [1] 0.9848485

3.6.3 Two argument function

Lets create a function that calculates the travel time based on speed and distance to be travel (note: time=distance/speed)

We can call the above function as follows:

## [1] 2

The argument follow a positional order, meaning that if labels are not added R will assume you have wrote them in the correct position.

## [1] 2
## [1] 0.5

And if only one is labeled, R will assume the other one corresponds to the missing label. In this case the order is not important.

## [1] 2

3.6.4 Nested functions

You can call multiple functions within a function. Following our previous examples, lets say we have a distance but we have it in feet and we need it in miles, we can nest our function to convert the units and return to us the correct travel time.

## [1] 10

In this case, maybe nesting the function is not the most optimal solution, bu rather it would be best to convert the units before sending to function.

## [1] 10

Regardless, in both cases we obtain the same result

3.7 Conditionals and if-else Statements

To make more complex decision, we need to write code that automatically decides between multiple options. The computer can make these decisions through logical comparisons. using if and else statements.

The basic syntax of a general conditional statement looks something like this:

And reads: if condition is met do this (1), if not do this 2.

For example, lets we want to compare if a value is greater or less than some boundary, we would do this as follows.

## [1] "not greater"

We can also combine tests. with the operators: &&, symbolize “and”; Two vertical bars, ||, symbolize “or”.

Note that && is only true if both parts are true, while || is true if either part is true:

## [1] "at least one part is not true"
## [1] "at least one part is true"

So now if we use our data frame and combine everything so far, we can do something like:

##    Month Weight Species Diagnostic
## 13  June    4.0       R   Positive
## 14  June   16.5       S   Positive
## 16  June    3.5       S   Positive

Note here the use of the single &, the & and the | operators are designed to work on vectors.

3.8 Switch Statement

Coming Soon…

3.9 Loops

Coming Soon…

3.9.1 For Loops

Coming Soon…

3.9.2 While Loops

Coming Soon…

3.9.3 Apply Base R Loops

Coming Soon…

4 More Learning Resources


The structure to this introduction to R is loosely based on the Software Carpentry (c) Workshop Programming with R. For more information about Software Carpentry visit their website at https://software-carpentry.org/.