Introduction to R

Peter Caya
August 20, 2016

Introduction

John Chambers on S, the precursor to R:

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

Introduction

This philosophy is evident:

The software is free.
Massive user community supporting open source philosophy.
Use of command prompt interpreter to test code.
Environment of packages to extend the language.
Powerful graphics capabilities.

What is R Good for?

Getting and cleaning data.
Exploratory data analysis.
Statistical/probabilistic modeling.
Graphics production.
Report production.

(This presentation was put together using R!)

What is R Not Good For?

High performance computing.
Certain specialized applications. R can handle big data but holds data in memory. This can impose a barrier on what the user can do without using workarounds. Expensive operations are often passed to C++.

Some Important R Qualities

Large ecosystem of packages to augment R functionality.
Command line is used to investigate data.
Vectorization heavily favored over use of loops.

Goals of This Presentation:

Give a functional understanding and demonstration of how R is used.
Introduce the variable types used in R.
Introduce commonly used functions which allow R users to:
- Manipulate objects in the R environment.
- Clean data.
- Import data sets.
- Generate graphics

Goals of This Presentation:

(If there is time)

If there is time:

Introduce the plyr and dplyr libraries which greatly ease data analysis.
Introduce R Markdown which allows users to generate interactive documents incorporating R code within R Studio.

R Basics

R as a Calculator

Command line computing in R is fairly straight forward:

2+2

2^2

2*2

R Basics

R as a Calculator

Scientific notation can be used more easily using “e” in numbers:

2e10

[1] 2e+10

R Basics

Basic Variable Types

Numeric:

class(2)

[1] "numeric"

Integer - behaves like a numeric except it is rounded to the next full number:

as.integer(2.6)

[1] 2

R Basics

Basic Variable Types

Complex numbers:

z = 1 + 2i
z

[1] 1+2i

R Basics

Basic Variable Types

Logical - TRUE or FALSE

x=1
z =1
test <-x==z
class(test)

[1] "logical"

R Basics

Working with Logical Values

If statements in R work similarly to the equivalent in other languages.

Logical evaluations are performed with the following operators:

Equal (when testing for equivalence): ==
Not equal: !=
Greater than: <=
Less than: >=

R Basics

Working with Logical Values

OR and AND statements come in two varieties:

OR for an entire vector: |.
OR for the first element of a vector: ||.
AND for the entire vector: &.
AND for the first element of a vector: &&.

R Basics

Basic Variable Types

Character:

class("abc")

[1] "character"

The class() function is useful for determing the type of variable. There are data types but these are the ones involved in most of the programming you will be doing.

R Basics

Variable Coercion

Variable types can be coerced by using an operation or coercion function: Coercion by using an operator:

TRUE*1

[1] 1

R Basics

Variable Coercion Using a Function

[1] 1

class(z)

[1] "numeric"

as.integer(z)

[1] 1

R Basics

Vectors and Matrices

R is meant to be used in a way where operations are performed on entire vectors or matrices. These are composed of the basic data types discussed earlier:

Some basic ways to generate a vector:

vec1 <- c(1,2,3)
vec2 <- seq(from  =1, to =3, by=1)
vec3 <- rep(1,10)

R Basics

Vectors and Matrices

vec1

[1] 1 2 3

vec2

[1] 1 2 3

vec3

 [1] 1 1 1 1 1 1 1 1 1 1

R Basics

Operations on Vectors

Vectors can also be added, subtracted, multiplied and divided element by element:

vec1+vec2

[1] 2 4 6

vec1*vec2

[1] 1 4 9

R Basics

Vectors and Matrices

We can also perform more conventional matrix-algebra style operations using vectors. To do this we need to use the following:

t() - transpose the vector.
%*% - vector/matrix multiplication.

t(vec1)

     [,1] [,2] [,3]
[1,]    1    2    3

R Basics

Vectors and Matrices

Now, find $ V_1^T V_1 $:

t(vec1)%*% vec1

     [,1]
[1,]   14

R Basics

Operations on Vectors and Matrices

Matrices creation is somewhat more complicated than vector creation. A matrix of all 1's

mat1 <- matrix(rep(1,3*3),nrow = 3,ncol = 3)

A subset of a matrix can be obtained using brackets and numbers indicating relevant columns and rows:

mat1[2:3,2:3]

     [,1] [,2]
[1,]    1    1
[2,]    1    1

R Basics

Operations on Vectors and Matrices

The same operations can be used on matrices that were used on vectors:

mat2 <- matrix(seq(from =1 , to =9, by=1 ),nrow = 3, ncol = 3)
# Original matrix:
mat2

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

R Basics

  # Transposition:
t(mat2)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

# Matrix multiplication:
t(mat2)%*%mat2

     [,1] [,2] [,3]
[1,]   14   32   50
[2,]   32   77  122
[3,]   50  122  194

R Basics

Lists

Lists are kind of catch-all object in R. They act as a way of storing one or more of any type of object.

mat1 <- matrix(seq(from = 1,to = 9,by=1),nrow = 3,ncol=3)
string1 <- "abc"
reg1 <- lm(mat1[,1]~mat1[,2])

example_list <- list(mat1,string1,reg1)

R Basics

Lists

example_list

[[1]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[2]]
[1] "abc"

[[3]]

Call:
lm(formula = mat1[, 1] ~ mat1[, 2])

Coefficients:
(Intercept)    mat1[, 2]  
         -3            1

R Basics

What Are NA Values?

NA values are place holders for numbers that behave like numbers:

NA^0

[1] 1

R Basics

Getting and Setting the Workspace

Most of our needs for working in directories can satisfied with getwd() (which gives the name of the directory) and setwd() which resets the directory.

We can also list the files in the directory using list.files().

R Basics

Getting Help

Any function can have its documentation brought up using ? or help(). For example:

?t()

# Or:
help(t)

R Basics

Getting Help

Other useful resources can be readily found at:

The provided documentation.
Websites like the R Blogger network and Stackexchange.
Free guides which are available online (R Data Import/Export, The R Guide).
Googling.

R Basics

Installing and Loading Libraries

One of the best features of R is how easy it is to extend the language with packages written by other people by using the CRAN network. Once you know what library you want to download it just use the install.packages() function. Then, to load the package, use the library() function:

If you wanted to download and load the stringr package you would simply do the following:

install.packages("stringr")
library(stringr)

R Basics

If Statements

Here's an example:

test <- c(0,1,0,1,0,1,0,1)
a <- 2
if(a == 1){TRUE
}else if(a==2){print("twice")
    }else{print("Nope")}

[1] "twice"

Note: For the else-if part of the evaluation we are essentially passing the criteria to another if function if the evaluation is not true.

R Basics

Loops in R

R offers the same ability to use loops that are available in all other languages. In the case of R the syntax that is used is show below:

R Basics

Loops in R

for(i in 1:10)
{  print(i)}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

R Basics

Loops in R

Unlike C or other languages, we can implicitly initialize the variable which is being used as the criteria to stop the for-loop.

This being said, for-loops should generally be avoided in favor of vectorization unless absolutely necessary.

As R was originally conceived, for-loops are generally slower than vectorization.
Loops can make code harder to read than if a shorter one line alternative were used instead.

R Basics

Vectorization

R provides a set of functions which are preferred when using a function over the elements, columns or rows of a matrix or data frame.

These are the apply() family of functions:

apply() - Applies a function over the margins of an array or matrix.
lapply() - Similar to apply() except it gives its output as a list.
sapply() - A simpler version of lapply() which returns a vector or matrix.

R Basics

Vectorization Examples

apply() type functions take several arguments:

The dataframe or matrix.
The margin type. 1 indicates the function being applied row-wise. 2 indicates columns and c(1,2) indicates rows and columns.
The function to be applied.

R Basics

Vectorization Examples

Let's take the mean of each of the columns of the mtcars dataset:

apply(mtcars,2,mean)

       mpg        cyl       disp         hp       drat         wt 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250 
      qsec         vs         am       gear       carb 
 17.848750   0.437500   0.406250   3.687500   2.812500

R Basics

Writing Functions

R allows users to define their own functions in the same manner as other languages:

new_function <-function(x){
  x+1
}
new_function(2)

[1] 3

R Basics

Writing Functions

We can also use return() in order to specifically specify the value that will be returned to the user. If it is not used then the last variable to be evaluated will be returned:

new_function <-function(x){
  return(10)
  x+1
}


new_function(2)

[1] 10

R Basics

Writing Functions

Lists are useful for returning more than one object from a function:

new_function <-function(x){
  answer<-  x+1
  return(list(x,answer))
  }
new_function(2)

[[1]]
[1] 2

[[2]]
[1] 3

Data Frames

What is a Data Frame?

A data frame is a structure used in R to store data as a list of named vectors. It is the default away to represent data in R and allows users to review and edit data.

Data Frames

What is a Data Frame?

A simple example is the iris data frame which comes preloaded into R:

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Note that each of the columns has a different variable type. Most of the data is numeric but Species is a vector of characters.

Data Frames

Selecting Columns and Rows:

Using the $ operator allows the user to select a column of the data frame and to then use it as a vector:

iris$Sepal.Length>5

  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [12] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [23] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
 [34]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [45]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [56]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [89]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Data Frames

Selecting Columns and Rows:

iris$Sepal.Length>5

  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [12] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [23] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
 [34]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [45]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [56]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [89]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Data Frames

Selecting Columns and Rows:

Specific rows and columns can be selected in several different ways:

Columns can be selected by name. This gives the values of the column as a vector:

irisdata <- head(iris,10)
names(irisdata)

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"

Data Frames

## Selecting Columns and Rows:

irisdata$Sepal.Length

 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

class(irisdata$Sepal.Length)

[1] "numeric"

Data Frames

Selecting Columns and Rows:

The columns can also be accessed by using bracket notation. There are two ways to get a column:

#  Produces a vector which is equivalent to the earlier example:
irisdata[,1]

 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

Data Frames

Selecting Columns and Rows:

irisdata[1]

   Sepal.Length
1           5.1
2           4.9
3           4.7
4           4.6
5           5.0
6           5.4
7           4.6
8           5.0
9           4.4
10          4.9

Data Frames

Selecting Columns and Rows:

Rows are also accessed through the bracket operator:

irisdata[1,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

Data Frames

Selecting Columns and Rows:

The bracket notation can also be used to select rows and columns:

 irisdata[3:6,2:5]

  Sepal.Width Petal.Length Petal.Width Species
3         3.2          1.3         0.2  setosa
4         3.1          1.5         0.2  setosa
5         3.6          1.4         0.2  setosa
6         3.9          1.7         0.4  setosa

Data Frames

Subsetting a Dataframe

Finally, a vector can be passed to the brackets in order to specify which rows and columns we want:

rows <- seq(from =2,to = 10, by= 2)
irisdata[rows,2:4]

   Sepal.Width Petal.Length Petal.Width
2          3.0          1.4         0.2
4          3.1          1.5         0.2
6          3.9          1.7         0.4
8          3.4          1.5         0.2
10         3.1          1.5         0.1

Data Frames

Subsetting a Dataframe

A vector of logical values can be passed to the brackets of a data frame in order to specify a subset of rows. For instance:

example <- head(mtcars)
logivec <- c(TRUE,FALSE,TRUE,FALSE,FALSE)
example[logivec,]

            mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
Valiant    18.1   6  225 105 2.76 3.46 20.22  1  0    3    1

Data Frames

Subsetting a Dataframe

We can use the fact that a vector can be evaluated using logical criteria to subset a data frame. For example, say that we want to take the mtcars dataframe and select only rows where disp <150:

Data Frames

Subsetting a Dataframe

mtcars.subset <- mtcars[mtcars$disp <150,]
head(mtcars.subset)

                mpg cyl  disp hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1

We can also produce more complex criteria using the & and | operators:

mtcars.subset <- mtcars[mtcars$disp <150 & mtcars$wt>3,]
mtcars.subset

           mpg cyl  disp hp drat   wt qsec vs am gear carb
Merc 240D 24.4   4 146.7 62 3.69 3.19 20.0  1  0    4    2
Merc 230  22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

Data Frames

Ordering

Reording a data frame is somewhat more complex but takes a one line command and employs the order() function. This function produces a numeric vector which we can pass to the first argument in the brackets to rearrange the data frame.

order(mtcars$disp)

 [1] 20 19 18 26 28  3 21 27 32  9 30  8  1  2 10 11  6  4 12 13 14 31 23
[24] 22 24 29  5  7 25 17 16 15

Data Frames

Ordering

order_on_disp <- mtcars[order(mtcars$disp),]
head(order_on_disp,10)

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Importing Data

Data Source Types

Importing data to R is fairly simple and can be done from nearly any source imaginable with some work. Some of the more common sources are:

csv (preferred)
Other delimited txt files
Excel (requires outside packages)
Data bases
Web scraping (HTML, XML)

Importing Data

read.table()

Most of your file importing needs can be fulfilled with the read.table() function.

For example, to load the file example.csv into the workspace requires the one line command:

Importing Data

read.table()

example_data <-read.table("example.csv",sep = ",")

head(example_data)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Importing Data

read.table()

read.table() offers a wide array of options which can be seen by using ?read.table. Some of these are:

sep = “,” states that elements are seperated by commas.
header being set to TRUE will make the first row be coded as the column names.

Because of the flexibility of this function, it should satisfy most of your data import needs.

Plotting in R

plot of chunk unnamed-chunk-51

Plotting in R

The plot() Function

The work-horse of basic R plotting is the plot() function which allows the user to generate a wide range of graphics and specify the way they appear.

Plotting in R

The plot() Function

An example using the iris dataset:

plot(iris)

Plotting in R

The plot() Function

plot(iris)

plot of chunk unnamed-chunk-53

We just produced a scatterplot of all the variables against all the other variables!

Plotting in R

The plot() Function

Let's put together a more descriptive plot of sepal width versus sepal length:

plot(iris$Sepal.Length,iris$Sepal.Width)

plot of chunk unnamed-chunk-54

Plotting in R

The plot() Function

Now, add titles to the plot so it is more readable:

plot(iris$Sepal.Length,iris$Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width")

plot of chunk unnamed-chunk-55

Plotting in R

The plot() Function

Add color and change the shape of the dots used in the scatterplot:

plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")

Plotting in R

The plot() Function

plot of chunk unnamed-chunk-57

Plotting in R

The plot() Function

We can also add vertical and horizontal lines to the plot. In this case, let's plot the means of the width and length using the abline() function:

# Add a vertical line for the sepal length's mean:
plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")
abline(v = mean(iris$Sepal.Length))
# Add a horizontal line for the sepal width's mean:
abline(h = mean(iris$Sepal.Width))

Plotting in R

The plot() Function

plot of chunk unnamed-chunk-59

Plotting in R

The plot() Function

Let's finish this plot by graphing everything from before and add labels to the data points based on species type. This is done with the text() function.

# Add a vertical line for the sepal length's mean:
plot(iris$Sepal.Length,iris$Sepal.Width,xlab = "Sepal Length", ylab = "Sepal Width", main = "Plot of Sepal Length Against Width",pch = 5,col = "darkblue")
text(iris$Sepal.Length,iris$Sepal.Width,iris$Species,pos = 3,cex = .6)

Plotting in R

The plot() Function

plot of chunk unnamed-chunk-61

Plotting in R

Other Plot Types

R also offers other functions for plotting:

hist()
boxplot() “

Plotting in R

boxpplot() Example:

virginica_vec <- iris$Species=="virginica"
setosa_vec <- iris$Species=="setosa"
versicolor_vec <- iris$Species=="versicolor"
virginica <- iris[virginica_vec,]$Sepal.Width
setosa <- iris[setosa_vec,]$Sepal.Width
versicolor <- iris[versicolor_vec,]$Sepal.Width
new_frame <- data.frame(virginica,setosa,versicolor)

boxplot(new_frame,main = "Box and Whisker Plot of the Sepal Width",ylab = "Width in Inches")
# Add color:
boxplot(new_frame,main = "Box and Whisker Plot of the Sepal Width",ylab = "Width in Inches",col =c("green", "blue","red"))

Plotting in R

boxpplot() Example:

plot of chunk unnamed-chunk-63

Plotting in R

boxpplot() Example (with color):

plot of chunk unnamed-chunk-64

Plotting in R

hist() Example:

library(datasets)
toothy_data<-ToothGrowth
hist(ToothGrowth$len,breaks = length(ToothGrowth$len)/4,xlab = "",ylab = "Tooth Length", main = "Plot of Tooth Length",col= "darkgreen" )

Plotting in R

hist() Example:

plot of chunk unnamed-chunk-66

Plotting in R

Other Plot Frameworks

R has several packages which extend its graphics capabilities even further:

lattice()
ggplot() “