R excels at letting users interactively work with data in a fluid manner. This is achieved through an interactive (line by line) interpreter, data structures, and expressive syntax. Using the basics of of data types, vectors, an subsetting, we can use the functions and methods presented here to mold data into the form needed for further analysis and modeling. This is by no means an exhaustive or even representative overview of data processing functions/methods, but it will offer some of the fundamental aspects.
This overview is focused on functions in the base installation of R. Another overview will briefly introduce data processing with in the ‘Tidyverse’; a set of packages and functions designed explicitly for processing, tidying, and visualizing data in R.
Introduced here are:
R contains a series of mathematical, logical, relational, and other operators that are the foundation of working with data. An operator is a typically single or double character function that serves to perform comparative or arithmetic operations on data object. The most basic examples are + for additions, > or < greater than or less than to test relative magnitude, == for testing equivalency, and <- for assigning something to an object. Below are examples of a few of these.
R has all the functionality to work as a massively expanded calculator. A wide list of Arithmetic operators can be used on numeric type objects.
2 + 3 # addition
[1] 5
3 - 5 # subtraction
[1] -2
3 * 3 # multiplication
[1] 9
3^2 # raise to power of
[1] 9
34 %% 15 # modulo
[1] 4
sqrt(532) # square root
[1] 23.06513
log(1003) # natural log
[1] 6.910751
exp(34) # exponentiate
[1] 5.834617e+14
sum(1:10) # sum
[1] 55
mean(1:10) # mean
[1] 5.5
median(1:10) # median
[1] 5.5
var(1:10) # variance
[1] 9.166667
These functions are vectorized, meaning that they will automatically apply the function to each element of a vector. As such, they will return a vector of the same length
# build a data.frame to work on
set.seed(717)
df2 <- data.frame(col1 = rnorm(10,0,1),
col2 = rnorm(10,4,0.5),
col3 = sample(c(TRUE, FALSE),10,replace=TRUE),
col4 = letters[1:10],
stringsAsFactors = FALSE)
print(df2)
col1 col2 col3 col4
1 1.3982972 3.817427 TRUE a
2 0.6425140 4.025979 TRUE b
3 -0.1128888 3.601069 TRUE c
4 -0.2124540 3.997407 FALSE d
5 0.2063796 4.041947 TRUE e
6 0.2751510 4.793777 TRUE f
7 0.6880224 3.529871 TRUE g
8 1.2245582 3.920952 FALSE h
9 0.7984930 4.040930 TRUE i
10 -1.2060176 3.946755 TRUE j
df2$col5 <- df2$col1 + df2$col2 # addition
print(df2$col5)
[1] 5.215724 4.668493 3.488180 3.784953 4.248327 5.068928 4.217893
[8] 5.145510 4.839423 2.740737
df2$col5 <- df2$col1 - df2$col2 # subtraction
print(df2$col5)
[1] -2.419129 -3.383465 -3.713958 -4.209861 -3.835568 -4.518626 -2.841848
[8] -2.696394 -3.242437 -5.152773
df2$col5 <- df2$col1 / df2$col2 # division
print(df2$col5)
[1] 0.36629314 0.15959198 -0.03134870 -0.05314795 0.05105944
[6] 0.05739754 0.19491436 0.31231143 0.19760132 -0.30557195
df2$col5 <- df2$col1 * df2$col2 # multiplication
print(df2$col5)
[1] 5.3378967 2.5867482 -0.4065205 -0.8492651 0.8341753 1.3190126
[7] 2.4286302 4.8014338 3.2266541 -4.7598562
df2$col5 <- df2$col1 %% 0.2 # modulo
print(df2$col5)
[1] 0.198297165 0.042514022 0.087111162 0.187546021 0.006379555
[6] 0.075151006 0.088022448 0.024558150 0.198493031 0.193982374
Relational and Logical operators return a TRUE or FALSE if the expression evaluated to TRUE or FALSE. These can be used to subset a vector or data.frame and may not return a logical vector of the same length as the input.
df2$col4 == "b" # equals (returns logical)
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df2[which(df2$col4 == "b"),] # using which() to return columns
col1 col2 col3 col4 col5
2 0.642514 4.025979 TRUE b 0.04251402
df2[which(df2$col1 < -0.5), "col1"] # less than
[1] -1.206018
df2[which(df2$col1 >= -0.5), "col1"] # greater than or equal to
[1] 1.3982972 0.6425140 -0.1128888 -0.2124540 0.2063796 0.2751510
[7] 0.6880224 1.2245582 0.7984930
isTRUE(df2[1,"col3"]) # is a value TRUE
[1] TRUE
df2$col3 != TRUE # does not equal (returnd logical T/F)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
df2$col4 %in% c("b","h","j") # within a set, returns logical
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
df2[which(df2$col4 %in% c("b","h","j")), "col4"] # returning values using which
[1] "b" "h" "j"
df2[which(df2$col1 < -0.5 | df2$col1 > 0.5), "col1"] # element wise or
[1] 1.3982972 0.6425140 0.6880224 1.2245582 0.7984930 -1.2060176
df2[which(df2$col1 < -0.5 & df2$col3 == TRUE), ] # element wise and
col1 col2 col3 col4 col5
10 -1.206018 3.946755 TRUE j 0.1939824
Control structures are a main tool in building a program. While operators allow you to manipulate data one function at a time (or across a vector), control structures allow you to add a logical flow to the operations. The basic structures demonstrated here common across most any computer language, although syntax will vary. These structures can be nested to add increasing level of control (and complexity) Grasping the utility of control structures is really important in making the conceptual leap from R as a high-powered calculator to R as an environment to building an analysis. Once understood, even more powerful structures exist in R.
# if statement
if (condition) {
# do something
}
# if else statment
if (condition) {
# do somthing
} else {
# or do something else
}
# for loop
for (variable in vector) {
# do something for each variable
}
# while loop
while (condition) {
# do something while condition is TRUE
}
# nested loops
if (condition) {
for (variable in vector) {
# do something for each variable
}
} else {
if (condition) {
# do something
} else {
# do something else
}
}
# if statment
some_data <- rnorm(10,0,1)
if (mean(some_data) > 0) {
some_new_data <- mean(some_data)
print(mean(some_data))
}
[1] 0.4159586
# if else statment
some_data <- rnorm(10,0,1)
if (mean(some_data) > 0) {
some_new_data <- mean(some_data)
print(mean(some_data))
} else {
some_new_data <- NULL
print("Mean of data is less than zero")
}
[1] "Mean of data is less than zero"
# for loop
some_data <- rnorm(10,0,1)
for (i in 1:length(some_data)) {
some_new_data <- some_data[i]^2
print(some_new_data)
}
[1] 3.986782
[1] 2.973403
[1] 2.627902
[1] 0.5789869
[1] 0.1446968
[1] 2.11807
[1] 0.05899888
[1] 4.886508
[1] 0.7530598
[1] 0.1821432
# nested ifelse in for loop
some_data <- rnorm(10,0,1)
some_new_data <- NULL
for (i in 1:length(some_data)) {
iter_data <- some_data[i]
if (iter_data > 0) {
some_new_data[i] <- iter_data^2
print("Data positive")
} else {
some_new_data[i] <- abs(iter_data)^2
print("Data < 0, applied asb()")
}
}
[1] "Data positive"
[1] "Data positive"
[1] "Data < 0, applied asb()"
[1] "Data positive"
[1] "Data positive"
[1] "Data positive"
[1] "Data positive"
[1] "Data < 0, applied asb()"
[1] "Data positive"
[1] "Data positive"
print(some_new_data)
[1] 0.03677341 0.52403013 0.07776627 0.97219775 1.89653583 1.52338704
[7] 0.68885827 0.18850034 0.11198929 0.75448516
# while loop # be careful
i <- 0
while(i < 4){
new_value <- rbinom(1,1,0.5)
i <- i + new_value
print(paste0("Value of 'i' is: ", i))
}
[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 1"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 3"
[1] "Value of 'i' is: 4"
# vectorized version of above
some_new_data2 <- ifelse(some_data > 0, some_data^2, abs(some_data)^2)
print(some_new_data2)
[1] 0.03677341 0.52403013 0.07776627 0.97219775 1.89653583 1.52338704
[7] 0.68885827 0.18850034 0.11198929 0.75448516
identical(some_new_data, some_new_data2)
[1] TRUE
Both operators and control structures open the door to functional programming, but there are thousands of pre-build functions available in R and the package ecosystem, and an infinite number of User Defined Functions that you can create for your own needs. The real work of “learning R” is really to learn what functions exist, how they work, how to write your own, and how do so with efficiently and defensively. By creating functions, you unlock the power and potential of R.
It is well beyond the scope of this class to cover the thousands of base functions, but a very small sample is presented. The live demo and example rmarkdown files will demonstrate many more.
df2$col5 <- df2$col1 - mean(df2$col1)
print(df2$col5)
[1] 1.02809167 0.27230853 -0.48309433 -0.58265947 -0.16382594
[6] -0.09505449 0.31781695 0.85435266 0.42828754 -1.57622312
df2$col5 <- scale(df2$col1, center = TRUE, scale = FALSE)
print(df2$col5)
[,1]
[1,] 1.02809167
[2,] 0.27230853
[3,] -0.48309433
[4,] -0.58265947
[5,] -0.16382594
[6,] -0.09505449
[7,] 0.31781695
[8,] 0.85435266
[9,] 0.42828754
[10,] -1.57622312
attr(,"scaled:center")
[1] 0.3702055
df2$col5 <- sign(df2$col1)
print(df2$col5)
[1] 1 1 -1 -1 1 1 1 1 1 -1
df2$col5 <- ifelse(sign(df2$col1) == 1, TRUE, FALSE)
print(df2$col5)
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
df_sample <- sample(df2$col1, 20, replace = TRUE)
print(df_sample)
[1] -0.1128888 -0.1128888 1.3982972 -1.2060176 0.2063796 1.3982972
[7] -1.2060176 1.3982972 0.6425140 0.2063796 0.2751510 0.6880224
[13] 1.3982972 1.2245582 0.6425140 1.3982972 1.2245582 -0.1128888
[19] 1.3982972 0.6880224
sort(df_sample)
[1] -1.2060176 -1.2060176 -0.1128888 -0.1128888 -0.1128888 0.2063796
[7] 0.2063796 0.2751510 0.6425140 0.6425140 0.6880224 0.6880224
[13] 1.2245582 1.2245582 1.3982972 1.3982972 1.3982972 1.3982972
[19] 1.3982972 1.3982972
rev(sort(df_sample))
[1] 1.3982972 1.3982972 1.3982972 1.3982972 1.3982972 1.3982972
[7] 1.2245582 1.2245582 0.6880224 0.6880224 0.6425140 0.6425140
[13] 0.2751510 0.2063796 0.2063796 -0.1128888 -0.1128888 -0.1128888
[19] -1.2060176 -1.2060176
Often in R, there is not available function to do what you want to do. This may be because you are creating a new statistical method, that your need is very specific, or that you are extending the limits of an existing function, for example. In this case, you may create you own function. Creating a function is very easy in R. The structure of a new function is:
name_of_new_function <- function(arguments){
# do something here
return(results)
}
Functions can have any name that follows convention, or no name at all (an anonymous function), can take any object type as input arguments, and can return any object type. In the example below:
my_functionx, y, and constantifelse() evaluates if each \(x_i\) is greater than or equal to zeroTRUE, \(x_i\) is squared, if FALSE \(x_i\) is divided by 2new_value is assigned the mean of yx (after being altered by the ifelse())new_value is returned back to where the function was called. In this caseNote: if any of the arguments are not of the type numeric, this will result in an error. A more defensive function would include a test of the arguments that returns a helpful error message if that are anything but numeric. Further, this function will work with either numeric vectors or single numeric values. However, the assignment back to df2$col5 will result in an error because the length of the returned value is not equal to the length of the column vector it is being assigned to.
my_function <- function(x, y, constant){
x <- ifelse(x >= 0, x^2, x/2)
new_value <- mean(y) * x + constant
return(new_value)
}
df2$col5 <- my_function(x = df2$col1, y = df2$col2, constant = 0.5)
print(df2$col5)
[1] 8.26543337 2.13957755 0.27582470 0.07810768 0.66916094
[6] 0.80068305 2.38006108 6.45560066 3.03226413 -1.89491665
One important thing to learn about functions is whats referred to as scoping, or where R looks for the variables you are using inside your function. Over simplifying a bit, the function will look for information in the arguments passed to it and the variables assigned within it; as opposed to looking outside the function (out into your program). That means if you define the object myVar <- 5 outside of the function, but do not pass it into the function, the function will not know what you mean by myVar. Bottom line, while learning, do not use global variables. If you want to use an object in a function, create it there or pass it in as an argument. A quick example:
# function environment - look internally for myVar
inside_scoping <- function(x){
myVar <- 1 # myVar defined inside function
new_value <- x + myVar
return(new_value)
}
myVar <- 5 # myVar also defined outside function
myValue1 <- inside_scoping(1)
print(myValue1) # function uses internal myVar
## [1] 2
## or ##
# script environment - look globally for myVar
outside_scoping <- function(x){
# myVar no longer defined in function, but used in here
new_value <- x + myVar
return(new_value)
}
myVar <- 5 # myVar only defined outside (globally)
myValue2 <- outside_scoping(1)
print(myValue2) # function finds myVar in the global environment and uses it
## [1] 6
apply functionsPutting the ideas of control structures, vectorizing, and functions together, we end up with the family of apply functions in R. To paraphrase a common sentiment in R users, “If your first thought is ‘how can I loop this’, your second thought should be ‘how can I vecotrize this instead’”. This sentiment is born from the fact that loops (for, each, if, etc…) can be slow and cumbersome compared to vectorized forms that take advantage of faster code and are more presentable (if you are comfortable reading them). The apply functions are a first step at this more efficient version of R programming. They take some getting used to, but can save you major headaches down the road.
The apply functions perform more complex operations to slice and aggregate data in vectors, matrices, and lists than the basic functions used earlier, and do so in faster and in fewer lines of code than making loops to do the job. Here are some examples (see then link in the previous paragraph for a more in-depth treatment):
# make a sample data.frame
df3 <- data.frame(col1 = rnorm(100,0,1),
col2 = rnorm(100,4,0.5),
col3 = rbinom(100,1,0.5))
# apply() # returns vector by applying function over margins of a matrix
column_means <- apply(df3,2,mean) # compute the mean of each column
print(column_means)
col1 col2 col3
0.0435805 3.9454374 0.5500000
row_means <- apply(df3,1,mean) # compute the mean of each row
head(row_means)
[1] 1.096285 1.473665 1.726870 1.308236 1.597097 1.308172
sqrd_matrix <- apply(df3,1:2, function(x) x^2) # square each value in a matrix
head(sqrd_matrix)
col1 col2 col3
[1,] 0.1545383 13.55689 0
[2,] 0.8851056 12.11175 0
[3,] 0.9736380 17.58863 0
[4,] 3.3043945 22.49139 1
[5,] 0.7357716 15.47258 0
[6,] 0.4727123 13.04696 1
# colMeans(), rowMeans(), colSums(), rowSums() stand in for the above
by() function performs operations by groups# by # operations by a group
group_col_means <- by(df3[,1:2], df3$col3, colMeans) # column means by group
print(group_col_means)
df3$col3: 0
col1 col2
0.2129485 3.9322471
--------------------------------------------------------
df3$col3: 1
col1 col2
-0.09499329 3.95622939
lapply() apply function to each element of a list; returns listl2 <- list(part1 = rnorm(5,0,1), part2 = rnorm(12,3,1))
print(l2)
$part1
[1] -0.3919431 -0.8969979 -1.1335531 0.4434971 0.3799945
$part2
[1] 3.204727 2.566988 2.726823 1.999813 2.968780 1.980304 4.366408
[8] 2.571043 2.964162 3.243171 1.545260 3.280595
list_means <- lapply(l2, mean)
print(list_means)
$part1
[1] -0.3198005
$part2
[1] 2.78484
list_sums <- lapply(l2, sum)
print(list_sums)
$part1
[1] -1.599002
$part2
[1] 33.41807
# sapply - similar to lapply, but returns vector of matrix
list_means2 <- sapply(l2, mean)
print(list_means2)
part1 part2
-0.3198005 2.7848395
### Other apply methods should you need them