Source file ⇒ 2017-lec10.Rmd
In the uniform distribution every number is equally likely. To create a vector of 5 numbers from the uniform distribution on the interval 0 to 1 use the code:
runif(n=5,min=0,max=1)
## [1] 0.6753299 0.4565588 0.9966289 0.5636669 0.3566742
R has a special mechanism for allowing you to use the same name in different places in your code and have it refer to different objects.
For example, you want to be able to create new variables in your functions and not worry if there are variables with the same name already in the workspace.
For example:
w <- 3
my_func=function(x,y,z){
w <- x^2
print(w)
}
my_func(2,3,4)
## [1] 4
w
## [1] 3
What is happening here is that w in the function my_func is a separate copy of the w outside myfunc. Because it is a separate copy changing w inside the function doesn’t mutate w outside of the function.
Variables declared inside a function are local to that function. For instance:
foo <- function() {
bar <- 1
}
foo()
bar
gives the error Error: objectbarnot found
If you want to make bar a global variable, you should do:
foo <- function() {
bar <<- 1
}
foo()
bar
## [1] 1
In this case bar is accessible from outside the function. It isn’t recommended to use global variables because it can cause very hard to fix bugs in your program.
To understand the scope of named objects better we need to discuss environments.
The environment of your console is called the global environment.
environment()
## <environment: R_GlobalEnv>
When you call a function, R creates a new workspace containing just the variables defined by the arguments of that function. This collection of variables is called a frame.
We can list the frame, using `ls()’
a <- 5
my_func=function(x,y,z){
w <- x^2
print(ls()) #if not the last command in the function you need print() to see the output
print(a)
}
my_func(2,3,4)
## [1] "w" "x" "y" "z"
## [1] 5
head(ls()) #Global environment has lots of variables!
## [1] "a" "bar" "foo" "my_func"
## [5] "show_answers" "w"
When R looks for a named object, by default R looks for the name in the current envronment and if a matching name is found, the corresponding value is returned. If the name isn’t found it looks in the next environment down the tree. An environment is just a frame (collection of variables) plus a pointer to the next environment down the tree to look in.
If R doesn’t find a the name of an object it needs in the environment defined by a function, the “next environment to look in” is called the enclosing environment. The enclosing environment of a function is the environment where it was created. Most of the time enclosing environment is the global environment (i.e. you defined the function in the global environment).
The environment inside of a function is called the evaluation environment. These stack on top of the global environment. In the above example, we needed a in the function’s evaluation environment. It had to search for the value of a in the function’s enclosing environment (i.e. the global environment).
f1 <- function() {
f2 <- function(){
env <- environment()
list(env, parent.env(env), parent.env(parent.env(env)))
}
f2()
}
f1()
## [[1]]
## <environment: 0x7f843047d0a8>
##
## [[2]]
## <environment: 0x7f843047d1c0>
##
## [[3]]
## <environment: R_GlobalEnv>
The environments form a tree with the empty environment at the bottom. Here the top of the tree is the global environment but it doesn’t need to be.
tree() #you can see how I wrote the function tree() in the source code for this lecture.
## + R_GlobalEnv
## + package:DataComputing
## + package:ggplot2
## + package:dplyr
## + package:stats
## + package:graphics
## + package:grDevices
## + package:utils
## + package:datasets
## + package:methods
## + Autoloads
## + base
## + R_EmptyEnv
If R reaches the Global Environment and still can’t find the name of an object it needs, it looks down the tree. This is a list of additional environments, which is used for packages of functions and user attached data.
You can see the list of enviroments starting with the global environment by typing search() in the console.
search()
## [1] ".GlobalEnv" "package:DataComputing"
## [3] "package:ggplot2" "package:dplyr"
## [5] "package:stats" "package:graphics"
## [7] "package:grDevices" "package:utils"
## [9] "package:datasets" "package:methods"
## [11] "Autoloads" "package:base"
You can have for example several objects named pi in different environments.
pi <- 3
base::pi
## [1] 3.141593
pi
## [1] 3
rm(pi)
pi
## [1] 3.141593
We can “attach” a new environment to our tree containing a data table using attach(). We actually insert an entry in the environment tree structure in the position given by the pos argument of function attach(). As this parameter defaults to pos=2L, most of the times we attach just underneath the global environment:
attach(mtcars)
tree()
## + R_GlobalEnv
## + mtcars
## + package:DataComputing
## + package:ggplot2
## + package:dplyr
## + package:stats
## + package:graphics
## + package:grDevices
## + package:utils
## + package:datasets
## + package:methods
## + Autoloads
## + base
## + R_EmptyEnv
When loading libraries, function library() work on a similar basis and use the same parameter pos = 2L
library(MASS)
tree()
## + R_GlobalEnv
## + package:MASS
## + mtcars
## + package:DataComputing
## + package:ggplot2
## + package:dplyr
## + package:stats
## + package:graphics
## + package:grDevices
## + package:utils
## + package:datasets
## + package:methods
## + Autoloads
## + base
## + R_EmptyEnv
Do question 1 and 2:
http://gandalf.berkeley.edu:3838/alucas/Lecture-10-collection/
The term passing a variable to a function is used when a previously defined variable is an argument of a function call.
For example:
myAge <- 14
month <- 1
calculateBirthYear <- function(yourAge){
2016-yourAge
}
calculateBirthYear(myAge)
## [1] 2002
The variable myAge is passed to the function calculateBirthYear. There are two possibilities how you could have passed the variable myAge to the function. The terms “pass by value” and “pass by reference” are used to describe how variables are passed on. Briefly, pass by value means the actual value is passed on. Pass by reference means an address is passed on which defines where the value is stored.
To understand how passing variables to functions works it helps to know a little about how objects are stored in the memory of your computer.
To make it simple, lets think of memory as many blocks which are next to each other. Each block has a number (the memory address). If you define a variable in your code, the value of the variable will be stored somewhere in the memory (your operating system will automatically decide where the best storage place is). The illustration below shows a part of some memory. The gray numbers on top of each block show the address of the block in memory, the colored numbers at the bottom show values which are stored in memory.
memory:
The variables myAge and month are defined in your code, and they will be stored in memory as shown in the illustration above. As example, the value of myAge is stored at the address 106 and the value of month is stored at the address 113.
Passing by value means that the value of the function parameter is copied into another location of your memory, and when accessing or modifying the variable within your function, only the copy is accessed/modified and the original value is left untouched. Passing by value is how your values are passed in R.
The following example shows a variable passed by value:
myAge <- 14
calculateBirthYear(myAge)
memory:
As soon as your software starts processing the calculateBirthYear function, the value myAge is copied to somewhere else in your computer’s memory. To make this more clear, the variable within the function is named age in this example.
function calculateBirthYear(age){
birthYear <- 2017-age
birthYear
}
Everything that is happening now with age does not affect the value of myAge (which is outside of calculateBirthYear‘s function scope) at all.
So how do you modify/update a variable outside of your function? When passing a variable by value, the only way to update the source variable is by using the returning value of the function.
A very simple example:
increaseAge <- function(age) age+1
myAge <- 14
myAge <- increaseAge(myAge)
myAge
## [1] 15
memory:
The variable myAge now holds the value 15
Passing by reference means that the memory address of the variable (a pointer to the memory location) is passed to the function. This is unlike passing by value, where the value of a variable is passed on. R and Python pass by Value. The programming language C can pass by value or by reference.
answ: 0
math_magic <- function(a,b=1){
if(b==0){
return(0)
}
a*b +a/b
}
math_magic(4,0)
## [1] 0
answ: 7
increment <- function(x, inc=1) {
x <- x+inc
x
}
count <- 5
count <- increment(count,2)
count
## [1] 7
lapply() and sapply() (chapter 4 Data Camp’s Intermediate R)Here is a function to convert fahrenheit to celsius
to_celsius <- function(x) {
(x-32)*5/9
}
The function to_celsius happens to be a vectorized function:
to_celsius(c(32, 40, 50, 60, 70))
## [1] 0.000000 4.444444 10.000000 15.555556 21.111111
Here is another example of a vectorized function:
square_me <- function(vec) vec^2
square_me(c(1,2,3))
## [1] 1 4 9
What happens if we feed to_celsius() a list?
list(32,40,50,70)
## [[1]]
## [1] 32
##
## [[2]]
## [1] 40
##
## [[3]]
## [1] 50
##
## [[4]]
## [1] 70
# trying to_celsius() on a list
to_celsius(list(32, 40, 50, 60, 70))
Outputs: Error in x - 32 : non-numeric argument to binary operator
to_celsius() does not work with a list.
One solution is to use a for loop:
temps_farhrenheit <- list(32, 40, 50, 60, 70)
temps_celsius=c()
for(temp in temps_farhrenheit){
temps_celsius <- c(temps_celsius,to_celsius(temp))
}
temps_celsius
## [1] 0.000000 4.444444 10.000000 15.555556 21.111111
R provides a set of functions to “vectorize” functions over the elements of lists.
lapply()sapply()vapply()These functions allow us to avoid writing loops and creates faster more readable code.
The simplest apply function is lapply()
lapply() stands for list apply. It takes a list or vector and a function as inputs and returns a list.
The function class() gives the class of an object. For example,
class(c(1,2,3))
## [1] "numeric"
Suppose we want to know the class of the elements of the following list.
nyc <- list(pop = 8404837, boroughs =c ("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"), capital = FALSE)
Instead of writing a for loop to do this you can use lapply()
nyc %>% lapply(class)
## $pop
## [1] "numeric"
##
## $boroughs
## [1] "character"
##
## $capital
## [1] "logical"
The output of lapply is a list.
The function length() gives the number of elements of a vector. For example,
length(c(1,2,3))
## [1] 3
players <- list(
warriors = c('kurry', 'iguodala', 'thompson', 'green'),
cavaliers = c('james', 'shumpert', 'thompson'),
rockets = c('harden', 'howard')
)
players %>% lapply(length)
## $warriors
## [1] 4
##
## $cavaliers
## [1] 3
##
## $rockets
## [1] 2
It applies the function to length() each element of the list. The output is another list.
What about functions that take more than one argument? For example,
paste() concatenates elements of a vector connecting them with the value of the arguement collapse
paste(c("stat", 133, "go"), collapse= "-")
## [1] "stat-133-go"
after the function argument in lapply we put the extra arguements of the functions.
players %>% lapply(paste, collapse = '-')
## $warriors
## [1] "kurry-iguodala-thompson-green"
##
## $cavaliers
## [1] "james-shumpert-thompson"
##
## $rockets
## [1] "harden-howard"
You can use your own functions in lapply()
num_chars <- function(x) {
nchar(x)
}
lapply(players, num_chars)
## $warriors
## [1] 5 8 8 5
##
## $cavaliers
## [1] 5 8 8
##
## $rockets
## [1] 6 6
num_chars1 <- function(x,y) {
nchar(x) + y
}
players %>% lapply(num_chars1, 3)
## $warriors
## [1] 8 11 11 8
##
## $cavaliers
## [1] 8 11 11
##
## $rockets
## [1] 9 9
#lapply(players,num_chars1,3) can also do this way
You can define a function with no name (an “anonymous” function)
1:3 %>% lapply(function(x) x^2)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
another example:
the passte() function connects the string “mr” with each of the Warrior player’s last name.
paste("mr",c('kurry', 'iguodala', 'thompson', 'green'))
## [1] "mr kurry" "mr iguodala" "mr thompson" "mr green"
To apply paste() to the list of players:
players %>% lapply( function(x) paste("mr",x))
## $warriors
## [1] "mr kurry" "mr iguodala" "mr thompson" "mr green"
##
## $cavaliers
## [1] "mr james" "mr shumpert" "mr thompson"
##
## $rockets
## [1] "mr harden" "mr howard"
Remember that a data.frame and a matrix is internally stored as a list.
df <- data.frame(
name = c('Luke', 'Leia', 'R2-D2', 'C-3PO'),
gender = c('male', 'female', 'male', 'male'),
height = c(1.72, 1.50, 0.96, 1.67),
weight = c(77, 49, 32, 75)
)
df %>% lapply(class)
## $name
## [1] "factor"
##
## $gender
## [1] "factor"
##
## $height
## [1] "numeric"
##
## $weight
## [1] "numeric"
Do questions 3 and 4:
http://gandalf.berkeley.edu:3838/alucas/Lecture-10-collection/
sapply() is a modified version of lapply().
`sapply() stands for simplified apply and will output the result as an array if possilbe.
1:3 %>% sapply(function(x) x^2)
## [1] 1 4 9
Here we output a 1 dimensional array (i.e. a vector) Notice this is the same as
1:3 %>% lapply(function(x) x^2) %>% unlist()
## [1] 1 4 9
We have seen examples where the list output by lapply() have elements that are vectors of different size. It isn’t possible to coerce this to be an array since all the rows of an array must be the same size. However, if the output of lapply() has vectors of equal length, we can display our results as an array. `
for example:
first_and_last <- function(name){
name <- gsub(" ","",name)
letters <- strsplit(name,split = "")[[1]]
c(first=min(letters), last=max(letters))
}
first_and_last("New York")
## first last
## "e" "Y"
now lets apply this function to the following vector of cities.
cities <- c("New York", "Paris", "London", "Tokyo", "Rio de Janeiro", "Cape Town")
cities %>% sapply(first_and_last)
## New York Paris London Tokyo Rio de Janeiro Cape Town
## first "e" "a" "d" "k" "a" "a"
## last "Y" "s" "o" "y" "R" "w"
Notice that the output is a two dimensional array with meaningful row and column names.
In data camp you will learn about vapply() which has the same output as sapply() but is faster to run because you specify size of the array you are outputting ahead of time.
Next time: Efficient programming, distributions and statistics