Introduction to R

Katia Oleinik
koleinik@bu.edu

Helpful Links:

SCV examples: http://scv.bu.edu/examples/r/
Examples of tasks replicated in SAS and R: http://sas-and-r.blogspot.com/
Many examples of statistical analysis using R: http://www.ats.ucla.edu/stat/r/
R for SAS, Stata and SPSS users: http://scv.bu.edu/examples/r/tutorials/

Login to SCC Cluster

Login: tuta30

Password: VizTut30

ssh -X scc4.bu.edu

cp -r /scratch/r-intro-2 .

R as a scientific calculator

Try executing the following commands in your Console Window:

> 2+3          # addition

[1] 5

> 2^3          # power

[1] 8

> log(2)       # built-in functions

[1] 0.6931

Double Precision

By default R outputs 7 significant digits (single precision display). But all calculations are always done using double precision

> options(digits=15)   # change to double precision display
> exp(3)

[1] 20.0855369231877

Return back to the single precision output

> options(digits=7) 
> exp(3)

[1] 20.08554

Variables

> a <- 3
> A <- 7  # R is case sensetive
> 
> b = -5

Both assignment operators are equivalent, the first one is more “traditional”.

Variables

Rules:

The name should contain only letters,digits, periods and underscores.
No other special characters ($, %, etc.) are allowed.
It should start with either a character or a period.
Avoid using names c, t, cat, F, T, D as those are built-in functions/constants

Variables

Let's create a few variables:

> str.var  <- "character variable"   
> num.var  <- 21.17                  
> bool.var <- TRUE                   
> comp.var <- 1-3i

To view the variable's value, either type in the variable name or use print() function:

> num.var

[1] 21.17

> print(str.var)

[1] "character variable"

Variables

Check the mode of the variable (its type):

> mode(bool.var)

[1] "logical"

> mode(num.var)

[1] "numeric"

> mode(str.var)

[1] "character"

Data Types

R has a wide variety of data types including:

Vectors (numerical, character, boolean, complex)
Factors (numerical, character, boolean, complex)
Matrices
Data Frames
Lists

Vectors

Vector - an array of R objects of the same type.

To create a vector use function c() - concatinate:

> ( names <- c ("Alex", "Nick", "Mike") )

[1] "Alex" "Nick" "Mike"

> ( numbers <- c (21, -3, 7.25) )

[1] 21.00 -3.00  7.25

Note: If the R command is enclosed in parentheses, then after the command is executed the result is also printed to the screen.

Vectors

Vectors can be defined in a number of ways:

> ( vals <- c (2, -7, 5, 3, -1 ) )

[1]  2 -7  5  3 -1

> ( vals <- 1:7 )

[1] 1 2 3 4 5 6 7

> ( vals <- seq(0, 3, by=0.5) )

[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0

> ( vals <- rep("o", 7) )

[1] "o" "o" "o" "o" "o" "o" "o"

Vectors

Vectors can be defined in a number of ways:

> ( vals <- numeric(9) )

[1] 0 0 0 0 0 0 0 0 0

> ( vals <- rnorm(5,2,1.5 ) )

[1] 1.242416 5.075357 2.803972 2.021579 3.121984

> ( vals <- rpois(5,4 )  )

[1] 3 1 5 1 1

Vectors

Vector elements can have labels:

> heights<-c(Alex=180, Bob=175, Clara=165, Don=185)
> heights

 Alex   Bob Clara   Don 
  180   175   165   185

Vector arithmetics:

Do not use loops to perform operations on vectors. R operates on vectors!

> a <- 1:5
> b <- seq(2,10, by=2)
> 
> a+b

[1]  3  6  9 12 15

> b/a

[1] 2 2 2 2 2

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> 
> x[2]     # returns second element

[1] 145

> x[2:4]   # returns 2nd through 4th

[1] 145 958 456

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> x[c(1,3,5)]  # returns 1st, 3rd and 5th elemetns

[1] 734 958 924

> x[-2]        # returns all but 2nd element

[1] 734 958 456 924

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> x[c(TRUE, TRUE, FALSE, FALSE, TRUE)]   # returns 1st, 2nd, and 5th elements

[1] 734 145 924

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> # returns only those elements that less than 500
> x[x<500]

[1] 145 456

Vector Functions

max(x), min(x), sum(x)

mean(x), median(x), range(x)

var(x) , cor(x,y)

sort(x), rank(x), order(x)

cumsum(), cumprod(x), cummin(x), cumprod(x)

duplicated(x), unique(x)

Vector Functions

Example:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> mean(x)

[1] 643.4

> sort(x)

[1] 145 456 734 924 958

Vector Functions

Usefule vector function:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  145.0   456.0   734.0   643.4   924.0   958.0

Missing Values

To define a missing value, use NA:

> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)    
> 
> # Check if the element in the vector is missing
> is.na(x)

[1] FALSE FALSE  TRUE FALSE  TRUE

> # Which elements in the vector are missing
> which(is.na(x))

[1] 3 5

Missing Values

Missing value cannot be compared to anything:

> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)    
> 
> x == NA   # this does not work !

[1] NA NA NA NA NA

> # Use is.na() instead
> is.na(x)

[1] FALSE FALSE  TRUE FALSE  TRUE

Factors

Factor is a special type of a vector that stores “categorical” variables. To convert a vector into the factor use factor() function

> x <- c(0, 1, 1, 1, 0, 0, 1, 0)    
> x <- factor(x)
> table(x)

x
0 1 
4 4

Factors

Each level in the factor variable can be named

> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))   
> 
> table(x)

x
   Fail Success 
      5       3

Factors

Factors are treated differently by the summary() function:

> # define a numeric vector
> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))   
> 
> summary(x)

   Fail Success 
      5       3

Matrices

Matrix is a 2 dimentional array of elements of the same type:

> matr <- matrix( c(1,2,3,4,5,6) , ncol=2)
> matr

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Note: R matrix is “column-major”.

Matrices

Matrix is a 2 dimentional array of elements of the same type:

> matr <- matrix( c(1,2,3,4,5,6) , nrow=2) 
> matr

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices

You can also convert array into a matrix:

> # First define the vector
> matr <- c(1,2,3,4,5,6) 
> # Then change dimensions
> dim(matr) <- c(2,3)      
> 
> matr

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices

You can fill matrix by-row

> matr <- matrix( c(1,2,3,4,5,6) , ncol=2, byrow=TRUE)
> matr

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Matrix Operations

> smatr = matrix( c(1,-3, 2, 5) , ncol=2) 
> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # transpose matrix
> t(smatr)

     [,1] [,2]
[1,]    1   -3
[2,]    2    5

Matrix Operations

> smatr = matrix( c(1,-3, 2, 5) , ncol=2) 
> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # Inverse matrix
> solve(smatr)

          [,1]        [,2]
[1,] 0.4545455 -0.18181818
[2,] 0.2727273  0.09090909

Matrix Operations

> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # product of matricies elements
> smatr*smatr

     [,1] [,2]
[1,]    1    4
[2,]    9   25

Matrix Operations

> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # matrix product
> smatr %*% smatr

     [,1] [,2]
[1,]   -5   12
[2,]  -18   19

Matrix Operations

> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # inverse of each element of the matrix
> smatr^(-1)

           [,1] [,2]
[1,]  1.0000000  0.5
[2,] -0.3333333  0.2

Matrix Operations

Some useful matrix functions:
colMeans(); rowMeans(); colSums(); rowSums()

> smatr

     [,1] [,2]
[1,]    1    2
[2,]   -3    5

> # inverse of each element of the matrix
> rowSums(smatr)

[1] 3 2

R help

Access help file for the R function:

> ?matrix

> help(matrix)

R help

You can search for help using help.search() function:

> help.search("matrix")

Or using two question marks:

> ??matrix

R help

Get arguments of a function

> args(matrix)

function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 
NULL

R help

Examples of function usage

> example(matrix)


matrix> is.matrix(as.matrix(1:10))
[1] TRUE

matrix> !is.matrix(warpbreaks)  # data.frame, NOT matrix!
[1] TRUE

matrix> warpbreaks[1:10,]
   breaks wool tension
1      26    A       L
2      30    A       L
3      54    A       L
4      25    A       L
5      70    A       L
6      52    A       L
7      51    A       L
8      26    A       L
9      67    A       L
10     18    A       M

matrix> as.matrix(warpbreaks[1:10,])  # using as.matrix.data.frame(.) method
   breaks wool tension
1  "26"   "A"  "L"    
2  "30"   "A"  "L"    
3  "54"   "A"  "L"    
4  "25"   "A"  "L"    
5  "70"   "A"  "L"    
6  "52"   "A"  "L"    
7  "51"   "A"  "L"    
8  "26"   "A"  "L"    
9  "67"   "A"  "L"    
10 "18"   "A"  "M"    

matrix> ## Example of setting row and column names
matrix> mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
matrix+                dimnames = list(c("row1", "row2"),
matrix+                                c("C.1", "C.2", "C.3")))

matrix> mdat
     C.1 C.2 C.3
row1   1   2   3
row2  11  12  13

Clean current R session

Check variables in the current session

> objects()

 [1] "a"        "A"        "b"        "bool.var" "comp.var" "heights" 
 [7] "matr"     "mdat"     "names"    "num.var"  "numbers"  "smatr"   
[13] "str.var"  "vals"     "x"

> ls()

 [1] "a"        "A"        "b"        "bool.var" "comp.var" "heights" 
 [7] "matr"     "mdat"     "names"    "num.var"  "numbers"  "smatr"   
[13] "str.var"  "vals"     "x"

Clean current R session

Remove array x from the memory

> rm(x)

Remove everything from your working enviroment

> rm(list=ls())

Running R script in a batch mode

R CMD BATCH Rprog.R

Rscript Rprog.R

R -q –vanilla < Rprog.R

Data Frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). It is similar to SAS and SPSS datasets.

> names <- c("Alex", "Bob", "Cat")
> ages <- c(12,5,7)
> sex <- c("M","M","F")
> kids <- data.frame(Names=names,Ages=ages,Sex=sex)
> kids

  Names Ages Sex
1  Alex   12   M
2   Bob    5   M
3   Cat    7   F

Data Frames

Summary function will recognize each variable type:

> kids

  Names Ages Sex
1  Alex   12   M
2   Bob    5   M
3   Cat    7   F

> summary(kids)

  Names        Ages      Sex  
 Alex:1   Min.   : 5.0   F:1  
 Bob :1   1st Qu.: 6.0   M:2  
 Cat :1   Median : 7.0        
          Mean   : 8.0        
          3rd Qu.: 9.5        
          Max.   :12.0

Data Frames

Read the data from a file
Clean the dataset if necessary, check for missing values
Get summary of the data
Use R graphics to explore each variable in the dataset
Perform statistical analysis

Data Frames

Read the dataframe

Error in file(file, "rt") : cannot open the connection