Instructor: Edgar Franco
COMMENTS. To insert comments use “#” Windows shortcut: control + shift + c
TIP: ALWAYS Comment your code
SHORTCUTS. To run a command directly from the script, place the cursor at the end of the command line and type: MAC users: command + enter Windows users: Control + r OR Control + enter
Just to check if R is working
# Try it now with the following command
1 + 1
# Now, calculate your first statistics
mean(1:5)
# Let's start with an empty working directory
rm(list = ls()) ### Remove all objects in the working environment
### Use with caution!!
### Similar to "clear all" in Stata
gc() ### Garbage collector it can be useful to call gc after a large object has been removed, as this may prompt R to return memory to the operating system.
How to get help in R:
?mean #opens the help page for the mean function
## starting httpd help server ...
## done
?"-" #opens the help page for substraction
?"if" #opens the help page for if
??summarizing #searches for topics containing words like "summarizing"
??"least squares" #searches for topics containing phrases like this
### The function help and help.search do the same thing
help("mean")
help.search("least squares")
#### The apropos function find variables and functions that match this input
apropos("vector")
### You can use apropos with regular expressions
### Example: Every function ending in "z"
apropos("z$")
The working directory is the place in your computer where R will be running the script To set the working directory you can use the drop down menu. You can also change the working directory by typing setwd(“Path”)
# PLEASE CHANGE YOUR WORKING DIRECTORY NOW (uncomment and type your own path)
#setwd("Your directory")
We recommend setting the working directory from your script instead of using the drop down menu. This is particularly useful when working in teams or from a shared folder.
The working directory can be set in your computer, in your AFS space at Stanford, or in some other location that you can access through your computer.
For example, a working directory in the AFS space at Stanford looks like this
“/afs/ir.stanford.edu/users/g/r/grobles/Documents”
** NOTE FOR WINDOWS USERS
OR
"c://path//folder"# To display the current working directory use getwd()
getwd()
Once the working directory is set, it is relatively simple to access files in sub-folders and parent folders. We will learn how to do that in this session.
The command line prompt (>) is an invitation to type commands or expressions After you write a command, type Enter to execute it However, it is more convenient to run the command directly from your script
Remember: MAC users: command + enter Windows users: Control + enter OR Control + R
# R can work as a calculator, for example,
2+3 #Addition
2*3 #Multiplication
2^3 #Exponentiation
2**3
2/3 ##Division
Now, going back to the first function that we created, we can use “:” to declare a from:to
1:5
6:10
## What happens if we sum these
1:5 + 6:12
## Warning in 1:5 + 6:12: longer object length is not a multiple of shorter
## object length
#Equivalent to
c(1,2,3,4,5) + c(6,7,8,9,10)
Assigning Variables
In R you can create different objects and give them a name. To create/declare an object, use “<-” ‘<-’ means “the values on the right are assigned to the name on the left”.
Unlike other languages R you don’t have top specify what type of variables you are creating
# 2.1. The simplest objects in R are scalars, for example
A <- 2
# An object called "A" that contains the number 2 is now in the workspace.
# You can call any object in the workspace by typing its name.
A
# This object "A" is now a global variable, which means that you can perform operations with this object by calling its name. For example,
A + 3
A * 3
A * A
# A famous scalar already stored in R
pi
# NOTE:
# Two different objects can't have the same name. R will overwrite the previous contents of the object with the new one.
A <- 3
A
A <- 2
A
# NOTE:
# There are alternative ways to declare objects in R
## You can change the order of the instruction
4 -> A
A
## Or you can use the equal sign '='
A = 5
A
## Nevertheless, you cannot change the order of the instruction.
#
5 = A
## Error in 5 = A: invalid (do_set) left-hand side to assignment
# This will display an error
TIP: It can be better to use arrows ‘<-’ or ‘->’ rather than the equal sign. It’s easier to track the direction of the instruction, especially when creating new objects from old ones. For example, B <- A vs A <- B
Also, as we will see, the commands ‘==’ and ‘!=’ will be used for conditionals and this might create some confusion and errors in the code.
TIP: If you want to assign and print in one line you have two possibilities
# Use ;
k <- rnorm(5) ; k
# Use ()
(kk <- rnorm(5))
NOTE: Special numbers
Inf, -Inf, NaN and NA are special numeric values
c(Inf + 1, Inf-1, Inf-Inf)
c(sqrt(Inf), sin(Inf))
## Warning in sin(Inf): NaNs produced
Objects containing strings can also be created in R. String variables are declared by using quotes " " or apostrophes ‘’
B <- "R Workshop"
B
B <- 'R Workshop, Summer 2016'
B
TIP: Don’t forget to use and close quotes for string variables, otherwise, you might mistakenly call other objects in your workspace.
B <- "A"
B
B <- A
B
Let’s keep this string for now
B <- "R Workshop"
Vectors of numbers or strings are another type of object in R. To create a vector, we need to “concatenate” a series of elements using the function “c()”. “c()” is probably the most important and often-used function in R. It creates a vector from a series of elements.
C <- c(100,200,300,400,550)
C
C <- c("red","blue","black")
C
NOTE: All vectors are column vectors. ** Also note that “c()” is a function and “C” is a vector in the workspace.**
c
C
Rember: To create a series of numbers, use “:” Syntax: “from:to”
1:10
This is vectorizable:
NOTE: This series is not an object of the workspace until you assigned it to an object.
C
C <- c(1:10)
C
To create a vector of repeated numbers or strings, you can use the function rep() Syntax: rep(value,times)
rep(2,10)
rep("index",5)
To create other sequences, use the function ‘seq()’ Syntax: seq(from,to,by)
seq(1,10)
seq(1,10,2)
seq(2,10,2)
To create random numbers, use the function ‘runif()’, it draws n values from a uniform distribution Syntax: runif(n)
runif(10)
NOTE: You can combine different functions to create your vectors
c(1,2,rep(3,3))
NOTE: R keeps a count on the number of elements in a vector.
This will be useful for selecting cases and subsampling data.
c(1:500)
seq(2,1000,2)
Note that “[ ]” indicates the position of an element in a vector.
To call an element in a vector, use the following notation: vector[position]
#For example:
C
C[2] # Second element of vector C
C[c(2:4)] # Elements 2 to 4 of the vector C
# You can also ask R to hide some elements.
C[-2] # All elements of C except the second one
NOTE: R starts indexing with 1, most other languages strat indexing with 0
You can also explore and change some characteristics of the vector:
## length
length(C)
## [1] 10
## Add names
names(C) <- c("Stanford", "Harvard", "MIT", "Princeton", "Berkeley", "Columbia", "NYU",
"Oxford", "Cambridge", "Notre Dame")
C
## Stanford Harvard MIT Princeton Berkeley Columbia
## 1 2 3 4 5 6
## NYU Oxford Cambridge Notre Dame
## 7 8 9 10
There are three vectorized logical operators in R:
x <- 1:10; x
x >=5 # which numbers in x are more or equal than 5
### The %% operator means remainder after division
(y<- 1:10 %% 2)
y == 0
x & y # Both are true. Numers that are larger or equal
# than 5 and have remainder zero
x | y # Only one is TRUE
Matrices are another type of objects in R. We can think about them as two-dimensional vectors with columns and rows.
To create a matrix, use the function ‘matrix()’
Syntax: matrix(vector, number of rows, number of columns)
matrix(c(10,20,30,40), 2, 2)
# You can also call vectors in the workspace
C
matrix(C,2,5)
D <- matrix(c(10,20,30,40), nrow=2,ncol=2)
D
# Note that "[ , ]" indicates the position "[row,column]" of an element in a matrix.
# To call an element in a matrix, use the following notation matrix[row,column]
# For example:
D
D[2,1] # Second row, first column
D[,] # All elements
D[2,] # Second row, all columns
D[,2] # All rows, second column
# Let's go back to our first matrix
D <- matrix(c(10,20,30,40), nrow=2,ncol=2)
We can explore some characteristics of the matrix:
nrow(D) #Number of rows
ncol(D) #Number of columns
length(D) #Product of dimensions
dim(D) # Both, rows and columns
Data frames are used to create spread-sheet data.
In other words, are matrices that store columns with different kind of data.
To create a dataset, use the function data.frame().
The arguments for ‘data.frame()’ are a series of vectors.
You can give variable names to each of these vectors.
Syntax: data.frame(vector1,vector2,vector3,vector4)
data.frame(age = c(20,24,26,23,29),
sex = c("Female","Male","Female","Male","Female"),
treatment = c(1,0,0,1,1),
income = c(1000,1500,2000,2500,3000))
# NOTE: All parenthesis in a function should be balanced, otherwise, R will be expecting more input and won't execute the command.
# TIP: You can take advantage of this to keep your code clear. Use new lines and tabs to make your commands more legible.
# Note the '+' sign in the command window, which indicates that R is expecting more input.
# NOTE: R-Studio users might not be able to run commands from multiple lines.
E <- data.frame(
age = c(20,24,26,23,29),
sex = c("Female","Male","Female","Male","Female"),
treatment = c(1,0,0,1,1),
income = c(1000,1500,2000,2500,3000)
)
E
# Similar to matrices, you can select elements in a data frame by using the following notation
# dataframe[row,column]
# For example:
E
E[4 ,4] # Fourth row, fourth column
E[ , ] # All elements
E[4 , ] # Fourth row, all columns
E[ , 4] # All rows, fourth column (variable income)
# Nevertheless, it is more convenient to use the $ operator when selecting elements in a dataset.
# The '$' operator refers to the parent database a particular variable belongs to.
# Syntax: database$variable
# For example
E
E$age # Variable "age" in database "E"
E$sex # Variable "sex" in database "E"
# "Levels" indicates that R is treating a variable as categorical/factor variable.
# Another way to select variables is by typing their names
E[,"age"] # Column "age" in database "E"
E[,"sex"] # Column "sex" in database "E"
E[,c("age","sex")] # Columns "age" and "sex" in database "E"
# Finally, you can choose a particular element of a variable by using '[ ]'
E$age # Variable "age" in database "E"
E$age[2] # Second element of variable "age" in database "E"
# Note that the following notations are equivalent
E$age[2] # Second element of variable "age" in database "E"
E[ , "age"][2] # Second element of column "age" in database "E"
E[2, 1] # Second row and first column in database "E"
E[ , 1][2] # Second element of first column in database "E"
A list is a generalization of a vector. In a list, elements can be of different type.
# Let's bring our objects back
A <- 5
B <- "R workshop"
C <- c(1:10)
D <- matrix(c(10,20,30,40), nrow=2,ncol=2)
E <- data.frame(
age = c(20,24,26,23,29),
sex = c("Female","Male","Female","Male","Female"),
treatment = c(1,0,0,1,1),
income = c(1000,1500,2000,2500,3000)
)
ls()
# Lists are commonly used objects in R.
# They are a collection of other objects, broadly defined.
# Here we make a list of all the objects that we have created so far.
# Use the function 'list()' to create lists, it works similarly to the concatenate function 'c()'
# The difference is that 'list()' creates lists and 'c()' creates vectors.
# The syntax to retrieve elements from them differ.
list(A,B,C,D,E)
# Note that "[[ ]]" will indicate the position of an object in the list.
# Remember: "[ ]" indicates the position of an element in a vector.
# "( )" are always and *only* used for functions.
# "{ }" are used to program loops and functions.
global.list <- list(A,B,C,D,E)
global.list
# To call an object in a list, use the following notation list[[position]]
# For example:
global.list[[3]] # Third object [3] in list "global.list"
# You can also name objects in the list:
names(global.list) <- c("number", "string", "vector", "matrix", "data.frame")
## And then call them with $
global.list$vector
Many R objects have a class attribute, a character vector giving the names of the classes an object belongs to.
To know the type or “class” of an object, you can use the function class() Syntax: class(object)
class(A)
class(B)
class(C)
class(D)
class(E)
class(global.list)
# Note that in your R-script, some classes may have a different color
# Note: This varies according to the appearance settings you choose!
#(Go to to Tools > Global options >Appearance)
# Numbers : Orange
3
# Strings : Green
"R workshop"
# Functions : White
mean(C)
# Object names : White
C
# You can change the class of an object by using some of the following commands
# This often comes in handy when reading in a dataset from another format, like Excel.
# as.numeric() : converts a string variable to numeric.
# as.character(): converts a numeric variable to string.
# as.vector() : converts a numeric or string matrix to a vector.
# as.matrix() : converts a numeric or string vector to a matrix.
# as.factor() : converts a numeric or string variable to a categorical variable.
# Examples:
A
as.character(A) # Numeric variable expressed as a string.
D
as.vector(D) # A 2x2 matrix expressed as a column vector of length 4.
C
as.matrix(C) # A column vector of length 10 expressed as a 10x1 matrix
E$treatment # This is a numeric variable.
as.factor(E$treatment) # A numeric variable expressed as a categorical variable.
E$sex # This is a categorical variable.
as.numeric(E$sex) # A categorical variable forced to a numeric variable.
To keep track of the objects you’ve created so far use the function ls()
The function ls() “lists” all user-defined objects in the workspace.
ls()
### Advanced search:
ls(pattern ="A")
# To remove an object from the working space, use 'remove()' or 'rm()'
# Syntax: rm(object)
# For example, let's remove object A
remove(A)
ls()
# To remove all objects in the workspace, type 'rm(list=ls())'
# or choose "Clear Workspace" in the drop down menu
rm(list = ls())
ls()
# Let's bring our objects back
A <- 5
B <- "R workshop"
C <- c(1:10)
D <- matrix(c(10,20,30,40), nrow=2,ncol=2)
E <- data.frame(
age = c(20,24,26,23,29),
sex = c("Female","Male","Female","Male","Female"),
treatment = c(1,0,0,1,1),
income = c(1000,1500,2000,2500,3000)
)
ls()
Pair up and answer the following questions:
Assign the numbers 1 to 1000 to a variable ‘x’ . The reciprocal of a number is obtained by dividing 1 over the number (1/number). Define y as the reciprocal of x.
Calculate the inverse tangent (that is arctan) of y and assign the result to a variable w. Hint: take a look to the ?Trig help page to find a function that calculates the inverse tangent
Assign to a variable z the reciprocal of the Tangent of w.
Compare z and x using a logical statement. Compare the first element of x and the first element of z. Before running the command think about what we should expect.
Note that not all elements are equal eventhough if they seem to be. Now compare the elements using de function identical, Then, use the function all.equal. Again, first read about these functions using help.
Most built-in functions do not apply vectorization by default. Try the following
mean(1:5)
mean(1,2,3,4,5)
mean(c(1,2,3,4,5))
Explain the different results.
Take a look to diag() function. Create a 21-by-21 matrix with the sequence 10 to 0 to 10 (i.e. 10, 10,…,0,1,..,10) in the diagonal.The rest of the elements should be zero.
What is the length of the following list: > list(a =2, list(b=2, g=3, d=4), e=NULL)
Explain your answer
Functions are use to do things with data. We can think about them as verbs rather than nouns
First, lets take a look to some buil-in functions
# Let's take our vector C
C
# These are some common functions for numeric vectors
mean(C) # mean
sd(C) # standard deviation
var(C) # variance
max(C) # maximum
min(C) # minimum
median(C) # median
sum(C) # sum
prod(C) # product
quantile(C,probs=0.5) # quantiles
length(C) # length of the vector
range(C) # range
# These functions perform element-wise operations
log(C) # logarithm
exp(C) # exponential
sqrt(C) # squared root
Matrix operators
# Let's work with our matrix D
D
t(D) # Transpose of a matrix t()
# Note the following difference
D
D*D # Element wise multiplication
D^2 # Element wise exponentiation
D%*%D # Dot product/inner product
D%o%D # Outer multiplication
# These are some common functions for matrices
D
rowSums(D) # Row sums
colSums(D) # Column sums
rowMeans(D) # Row means
colMeans(D) # Columns means
diag(D) # Diagonal of a matrix
solve(D) # Inverse of a matrix
cov(D) # Variance covariance matrix
cor(D) # Correlation matrix
solve(D) #Inverse of D
# Let's work with our string variable B
B <- "R workshop"
B
# These are some common functions for strings
paste(B,"2016", sep = ",") # Concatenates two or more string vectors.
# Syntax: paste(string1,string2,separator).
substr(B, 1, 6) # Substrings in a character vector.
# Syntax: substr(string,start,stop).
strsplit(B,"work") # Splits a string according to a substring.
# Syntax: strsplit(string,split).
grep("work", B) # Logical. Finds a pattern or regular expression within a string.
# Syntax: grep(pattern, string).
# is the word "work" in B?
gsub("workshop","awesome workshop",B) # Replaces a substring if it matches a regular expression.
# Syntax: gsub(pattern, replacement, string).
tolower(B) # Converts a string to lowercase.
toupper(B) # Converts a string to upper case.
We will have a full session on datasets. But here are some common functions for data frames.
# Let's call our data frame E
E
dim(E) # Dimensions of the data frame. Syntax: dim(x)
head(E,3) # Shows first n rows. Syntax: head(x,n)
tail(E,3) # Shows last n rows. Syntax: head(x,n)
str(E) # Displays the structure of an object. Syntax: str(x)
summary(E) # Displays summary statistics. Syntax: summary(x)
# You can use the following commands to browse and your data
fix(E) # Opens a database for browsing and editing
edit(E) # Opens the database for editing
View(E) #Opens a separate window
TIP: Although it might take more effort and time, we strongly recommend editing your data from the R script. You will be able to keep track of all the steps taken to clean your data and your results will be replicable to others
To create your own function, we just assign them as anyother variable including the elements as paramethers
Syntax: function(element1, element2, …){ statements return(object) }
## Let's write a function to sum the square of each term
hypothenuse <- function(x, y){
sqrt(x ^2 + y ^2)
}
## This should work with different paramethers
hypothenuse(2,3)
hypothenuse(5,5)
### As with built-in functions we can pass vetors:
normalize <- function(z, m=mean(z), s =sd(z)){
(z-m) /s
}
normalize(c(1,3,6,10,15))
## We can keep track of each step with print
normalize <- function(z, m=mean(z), s =sd(z)){
print(m) ; print(s)
(z-m) /s
}
normalize(c(1,3,6,10,15))
### Note that if one element is missing the result will be missing too:
normalize(c(1,3,6,10,NA))
### We can tell R to remove missing variables before calculations:
normalize <- function(z, m=mean(z, na.rm=T), s =sd(z, na.rm=T)){
(z-m) /s
}
## The cases above only perform one statement. But for cases with many statements is convenient to use 'return'
bad.function <- function(x, y) {
z1 <- 2*x + y
z2 <- x + 2*y
z3 <- 2*x + 2*y
z4 <- x/y
}
bad.function(1, 2) # Only returns the last operation
good.function <- function(x, y) {
z1 <- 2*x + y
z2 <- x + 2*y
z3 <- 2*x + 2*y
z4 <- x/y
return(c(z1, z2, z3, z4))
}
good.function(1,2) # returns all
NOTE: Unlike other languagues R cannot use functions before actually run them (unless you customize your environment). Try to keep all functions at the begining of your code.
Create a function that takes a vector c(1,2,3,4,5) and returns a vector conatining the mean, the sum, the standard deviation and the median
Now, modify your function to return a list instead of a vector
Create a function that recieves the matrix D defined above. The result should be the sum of the diagonal of the inner product.
Create a function that receives an input of integers and returns a logical vector that is TRUE when the input is even and FALSE when is odd. HINT: Remember that %% is the operator for the remainder of a division
There is a vast online library of functions in R created by other users.
These functions come in “packages”, which are collections of objects including databases, functions, models, and compiled code.
Some packages are already installed in your computer and contain baseline functions and data. Other functions need to be downloaded from the Comprehensive R Archive Network (CRAN). CRAN has close to 6,000 packages available for download.
For example, the package “foreign” includes a function ‘read.dta()’ that allows users to read STATA databases and csv objects.
Since this package has not not been installed in your computer, there is no help file available for the function ‘read.dta’.
TRY:
?read.dta
help(read.dta)
To install a package, use the function ‘install.packages()’ Syntax: ’install.packages(“package.name”)
install.packages(“foreign”)
You will have to select a server or CRAN mirror from which the package will be downloaded to your computer.
Choose the server USA (CA 2) for faster downloads.
You can also use the Package Installer interface located in the drop down menu “Packages & Data”
NOTE: Some packages use functions from other packages. When installing a package make sure to install all dependencies as well by selecting the “Install Dependencies” option.
The directory where packages are stored is called the library.
To get the location of the library in your computer, type ‘.libPaths()’
libPaths()
To see all the packages installed in your computer, call the library by typing ‘library()’
library()
NOTE: A package must be installed in your computer ONCE.
Nevertheless, you have to call or “load” a package into EACH R session you are going to use it.
In other words, you have to select the packages that will be active during your session.
This will avoid confusion on variable, data, and function names.
To load a package installed in your computer, use the function ‘library()’
Syntax: ‘library(“package.name”)’.
Go ahead an type:
library(“foreign”)
Once a package is installed and loaded, their functions, data, and code are available for use. Try:
?read.dta
Right now we don’t have Stata datasets but we will create some in the next session.
You can also use the Package Manager interface located in the drop down menu “Packages & Data”
Simply click on the packages to be loaded into your session.
TIP: We recommend installing and loading packages from your script.
This is especially true when running in BATCH mode, sharing code, and for replication purposes.
To see all the packages loaded into your session, use the function ‘search()’.
search()
There are different types of help files for packages.
library(help=“foreign”)
A help file is structured as follows:
help(package=“foreign”)
The pdf version of the help documentation for package “foreign” is available at:
TIP: We recommend to read the pdf version of help documentation when getting familiar with a package.
“foreign”
“xlsx”
“Zelig”
“dplyr”
Go to the full documentation for dplyr (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf). Read the entry for the plyr function ‘select’. Discuss the logic of this function.
Go to the full documentation of foreign (https://cran.r-project.org/web/packages/foreign/foreign.pdf). Take a look to the different formats you can read with this package.