A function is a block of organized, reusable code that performs a specific task.
Functions allow you to break down your program into smaller, modular components, making it easier to understand, maintain, and debug.
Functions help in promoting code reusability and modularity.
They make your code more readable and easier to maintain by dividing it into logical sections.
A function is defined by using the function() keyword, followed by the function name and its parameters.
Parameters are the inputs that the function accepts, and they are enclosed within parentheses.
The body of the function consists of the code that defines its behavior, enclosed within curly braces {}
functions in R can return a value using the return keyword, although it’s not always necessary.
add_numbers <- function(x,y) {
return(x + y)
}
#Calling the function
result <- add_numbers(5, 3)
print(result)
## [1] 8
If we only have one dataset to analyze/plot, writing scripts is easy and simple.
If we have twelve files to analyze/plot and may have more in the future, writing functions can become very complex. Writing functions allows us to repeat several operations with a single command.
Converts temperatures from Fahrenheit to Celsius:
fahrenheit_to_celsius <- function(temp_F){
temp_C <- (temp_F - 32) * 5 / 9
return(temp_C)
}
Function Name: fahrenheit_to_celsius Argument(s): temp_F Body: statements that are executed when the function runs
fahrenheit_to_celsius(32)
## [1] 0
one_tmp <- 212
fahrenheit_to_celsius(one_tmp)
## [1] 100
If you define your own functions using a vectorized operation/function, your newly defined function will also be vectorized.
celsius_to_kelvin <- function(temp_C){
temp_K <- temp_C + 273.15
return(temp_K)
}
celsius_to_kelvin(0)
## [1] 273.15
We could use the formla. But we can also compose the two functions we have already created:
fahrenheit_to_kelvin <- function(temp_F){
temp_C <- fahrenheit_to_celsius(temp_F)
temp_K <- celsius_to_kelvin(temp_C)
return(temp_K)
}
fahrenheit_to_kelvin(32.0)
## [1] 273.15
The previous example shows how large software programs are built. * Basic Operation * Combine them in larger chunks * Real world functions are longer than out example, but shouldn’t be longer than a few dozen lines, or the next person who reads it won’t be able to understand the code
We named R objects as nouns
Here, the function names are usually verbs * convert_temperature() * get_colors()
A variable that is visible only within the function body is said to be local to that function.
Local variables disappear after a function call
Varaibles created outsie functions are global and are available within functions as well.
A global variable will not be changed if it was used in a function as an argument
Software carpentry:
Once functions are defined, we need to start testing that those functions are working correctly.
Testing functions is a crucial aspect of software development to ensure that they produce the expected results under various conditions.
Proper testing helps identify bugs, errors, and dedge cases, ensuring the reliability and robustness of your code.
Example: Define the function to normalize data around a specified midpoint
Context: you provide some data and a midpoint; the resulting normalized data will be the original data adjusted so that its mean is centered around the specified midpoint.
We will test the following function:
normalize <- function(data, midpoint){
new_data <- (data - mean(data)) + midpoint
return(new_data)
}
Test Case 1: Normalize data around midpoint 0
data1 <- c(1,2,3,4,5)
midpoint1 <- 0
result1 <- normalize(data1, midpoint1)
print(result1)
## [1] -2 -1 0 1 2
We have a dataset data1 with values [1, 2, 3, 4, 5] and we want to normalize it around a midpoint of 0.
The normalized data would be such that the mean of the new data is 0. This means that the values will be adjusted accordingly to center the distribution around 0.
Test Case 2: Normalize data around midpoint 10
data2 <- c(10,20,30,40,50)
midpoint2 <- 10
result2 <- normalize(data2, midpoint2)
print(result2)
## [1] -10 0 10 20 30
We have a dataset data2 with values [10, 20, 30, 40, 50] and we want to normalize it around a midpoint of 10.
The normalized data would be such that the mean of the new data is 10. This involves adjusting the values to center the distribution around 10.
Write multiple testing cases.
Write testing cases on rare cases.
Testing logical vector for numerical function input
Testing vectors of length one if input is usually longer than one
Write testing cases into a separate function.
Sometimes when testing if two numeric values are equal, a very small difference can be detected due to rounding at very low decimal places. all.equal() can be used in those cases
What if we have missing data (NA values) in the data argument we provide to normalize()?
data3 <- c(10,20,30,NA,50)
normalize(data3, 10)
## [1] NA NA NA NA NA
We may actually wish to not consider NA values in our normalize() function.
normalize <- function(data, midpoint) {
new_data <- (data - mean(data, na.rm = TRUE)) + midpoint
return(new_data)
}
data3 <- c(10,20,30,NA,50)
normalize(data3, 10)
## [1] -7.5 2.5 12.5 NA 32.5
Input with the wrong class.
normalize(as.character(data3), 10)
Error in data - mean(data, na.rm = TRUE) :
non-numeric argument to binary operator
In addition, Warning message:
In
mean.default(data, na.rm = TRUE) :
argument is not numeric or
logical: returning NA
You may use stopifnot(), warning(), and stop() function to handle such cases.
So far, we have passed arguments to functions in three ways:
Directly: dim(dat)
By name:
read.csv(file = “data/inflammation-01.csv”, header = FALSE)
Without naming them (order matters): dat <-
read.csv(“data/inflammation-01.csv”, FALSE)
Let’s re-define our function:
normalize <- function(data, midpoint = 0) {
new_data <- (data - mean(data)) + midpoint
return(new_data)
}
The second argument is now written midpoint = 0
normalize(data1, 0)
## [1] -2 -1 0 1 2
normalize(data1)
## [1] -2 -1 0 1 2
A common way to add documentation is to comment directly on your code.
Formal documentation for R functions is written in separate .Rd using a markup language similar to LaTeX.
You see the result of this documentation when you look at the help file for a given function, e.g. ?read.csv.
Control statements are powerful constructs in programming that allow you to control the flow of execution in your code
In R, there are several tyoes of control statements to:
1. Conditional Statements:
2. Looping Statements:
3. Control Flow Statements:
Using a for loop, we will be able to perform a certain operation or function over each element in a vector, list, etc.
for (n in 1:4){
print(n^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
Another example iterating through a vector using indices i.
my.vec <- c(1,3,34,22,16)
for (i in 1:length(my.vec)) {
print(my.vec[i])
}
## [1] 1
## [1] 3
## [1] 34
## [1] 22
## [1] 16
However, if my.vec is an empty vector. 1:length(my.vec) is 1:0.
my.vec <- NULL
for (i in 1:length(my.vec)) {
print(my.vec[i])
}
## NULL
## NULL
for (i in seq(my.vec)){
print(my.vec[i])
}
Sequence returns a sequence of values.
my.vec <- c(1,3,34,22,16)
seq(my.vec)
## [1] 1 2 3 4 5
Using get() within for loop to iterate a set of objects.
get() takes a string as input argument and returns of object of that name.
lm() lm is used to fit linear models, including multivariate ones. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).
u <- c(1,1,2,2,3,4)
dim(u) <- c(3,2)
#or
u <- matrix(c(1,1,2,2,3,4), nrow = 3, ncol = 2)
v <- c(8,15,12,10,20,2)
dim(v) <- c(3,2)
#or
v <- matrix(c(8,15,12,10,20,2), nrow = 3, ncol = 2)
for (m in c("u", "v")){
z <- get(m) #x will become the name u and v
print(lm(z[,2]~z[,1])) #Left side of ~: Represents the dependent variable, Right side of ~: Represents the independent variables
}
##
## Call:
## lm(formula = z[, 2] ~ z[, 1])
##
## Coefficients:
## (Intercept) z[, 1]
## 1.0 1.5
##
##
## Call:
## lm(formula = z[, 2] ~ z[, 1])
##
## Coefficients:
## (Intercept) z[, 1]
## -3.838 1.243
The first iteration will be through matrix u, the second iteration will be through matrix z
The if else statement allows you to made a decision between different options.
Be careful about indentation
The else if may not needed in some situations
my.score <- 83
if(my.score < 60){
print("fail")
} else if (my.score > 100) {
print("wrong input")
} else {
print("pass")
}
## [1] "pass"
Some basic operations/functions commonly used:
Check eligibility for a loan based on age and income
age <- 25
income <- 50000
if (age >= 18 & income >= 40000){
print("Congratulations! You are eledible for a loan.")
}else{
print("Sorry, you are not eligible for a loan at this time.")
}
## [1] "Congratulations! You are eledible for a loan."
Check conditions using if-else statement
x <- 10
y <- 20
z <- NA
if (x == 10 && !is.na(z)){
print("Condition 1: x equals 10 and z is not NA")
} else if (y >= 20) {
print("Condition 2: y is greater than or equal to 20")
} else {
print("None of this conditions are satisfied")
}
## [1] "Condition 2: y is greater than or equal to 20"
I have the following pt.types dataframe of Breast cancer patients that shows whether a patient is positive or negative for a specific receptor type that will decide the type of the patient.
I want a function “get.condition()” with if else condition,
if i call get.condition () function with a Pt.ID, and pt.types dataframe, it should return,
“the patient is triple-negative breast cancer type” if all the column values are FALSE,
“the patient is double negative breast cancer type” if two of the columns are FALSE,
“the patient is single negative breast cancer type” if one of the columns is TRUE and other two are FALSE
pt.types<- data.frame(Pt.ID = c("BCPt1", "BCPt2", "BCPt3", "BCPt4", "BCPt5"),
ER = c(TRUE, FALSE, TRUE, TRUE, FALSE),
PR = c(TRUE, FALSE, FALSE, TRUE, FALSE),
HER2 = c(FALSE, FALSE, TRUE, FALSE, TRUE))
get.condition <- function(ID, dataframe){
patient <- subset(dataframe, Pt.ID == ID)
neg_count <- sum(patient[,-1] == FALSE)
if(neg_count == 3){
return("the patient is triple negative")
} else if (neg_count == 2){
return("the patient is double negative")
} else if (neg_count == 1){
return("the patient is single negative")
} else {
return("Invalid ID")
}
}
get.condition("BCPt5", pt.types)
## [1] "the patient is double negative"
Both statements can be used for a task.
scores <- c(67,65,88,54)
for (score in scores){
print(score)
if(score > 60){
print("pass")
} else {
print("fail")
}
}
## [1] 67
## [1] "pass"
## [1] 65
## [1] "pass"
## [1] 88
## [1] "pass"
## [1] 54
## [1] "fail"
Manupulate strings for better output:
for (score in scores){
if(score > 60){
output.string <- paste(score, "pass", "\n")
} else {
output.string <- paste(score, "fail", "\n")
}
cat(output.string)
}
## 67 pass
## 65 pass
## 88 pass
## 54 fail
#or
for (score in scores){
if(score > 60){
cat(score, "pass", "\n")
} else {
cat(score, "fail", "\n")
}
}
## 67 pass
## 65 pass
## 88 pass
## 54 fail
I would like add another column “Pt.type” to Pt.types dataframe.
The values of “Pt.type” should be
use if else and for loop for this
for (i in 1:nrow(pt.types)){
neg_count <- sum(pt.types[i, 2:4] == FALSE)
if (neg_count == 3){
pt.types$Pt.type[i] <- "triple"
} else if (neg_count == 2){
pt.types$Pt.type[i] <- "double"
} else if (neg_count == 1){
pt.types$Pt.type[i] <- "single"
}else{
pt.types$Pt.type[i] <- "Invalid"
}
}