Here I would like to give a brief lesson on functional programming in R. As I am not a computer science expert, I do not wish to get too deep into the details of how functional programming is different from other styles like Object-Oriented Programming (OOP). If you are interested you can do some reading online.
The advantage of functional programming in R for me is that it has given me a way of envisioning problems and having a reliable systematic method for finding solutions. We will begin first by writing a function in R.
multiply_by_2 <- function(x) {
2 * x
}
multiply_by_2(3)
## [1] 6
The above written function can be thought of as the following equation, \[f(x) = 2x\] where the input that we have given is \(x=3\). This is when thinking about problems in a strict mathematical sense, where we are dealing with ordinary numbers and mathematical operations. Instead, we would like to think beyond numbers and start seeing random variables which can have an array of possible values.
\[g(X,Y,Z)=\frac{length(X) + sum(Y)}{Z}\] Here we can think of \(X\) as a random string, for example, ‘Hello, nice to meet you.’ We can also think of \(Y\) as a vector of numbers (e.g., 3, 4, 5). While \(Z\) is just a number between 1-5.
set.seed(101)
(X <- 'Hello, nice to meet you.')
## [1] "Hello, nice to meet you."
(Y <- c(3, 4, 5))
## [1] 3 4 5
(Z <- sample(1:5, 1))
## [1] 2
g_function <- function(local_x, local_y, local_z) {
print('Calculating g function...')
result <- (nchar(local_x) + sum(local_y)) / local_z
return(result)
}
g_function(local_x = X, local_y = Y, local_z = Z)
## [1] "Calculating g function..."
## [1] 18
What we have just done is we have written a customized function that is specific to the datatypes at hand and figured out a calculation based on our input of random variables. We would like to expand this sort of thinking to approach the big data type of situations which we may run into when practicing data science.
Let us pretend we have some data for two students in an imaginary Data Science 101 course. They have just finished the semester and have had their scores calculated for the midterm and final exams. They both have an average score of an 87, with some randomness that could be positive or negative. Since we have two students who have both taken 2 exams each with 3 possible scores, we are left with a vector of 12 values representing their corresponding scores on the tests.
Below is an example of such a dataset. All the information is stored in a list called data_science_101 and this list is a list of two other lists (i.e., it is a list of lists). Each of these sub-lists represents a single student with two data frames. So each student is a list of two data frames. These data frames show the three-part scores that they have received on their midterms and finals for Data Science 101.
set.seed(101)
b_score <- rep(87, 12) # B-average
random_mistakes <- rnorm(12, sd = 5) # Random mistakes added to scores
actual_score <- b_score + random_mistakes # Vector of scores
(data_science_101 <- list(
# A list of lists containing exam scores for two students. This idea can easily
# be extended towards 30 or 200 students. Additionally, we can add other
# factors such as: homework, attendeence, and participation.
student_1 = list(
midterm = data.frame(subj_category = c('True_False', 'Stat_Theory', 'Calculations'),
scores = actual_score[1:3]),
final_exam = data.frame(subj_category = c('True_False', 'Stat_Theory', 'Calculations'),
scores = actual_score[4:6])
),
student_2 = list(
midterm = data.frame(subj_category = c('True_False', 'Stat_Theory', 'Calculations'),
scores = actual_score[7:9]),
final_exam = data.frame(subj_category = c('True_False', 'Stat_Theory', 'Calculations'),
scores = actual_score[10:12])
)
))
## $student_1
## $student_1$midterm
## subj_category scores
## 1 True_False 85.36982
## 2 Stat_Theory 89.76231
## 3 Calculations 83.62528
##
## $student_1$final_exam
## subj_category scores
## 1 True_False 88.07180
## 2 Stat_Theory 88.55385
## 3 Calculations 92.86983
##
##
## $student_2
## $student_2$midterm
## subj_category scores
## 1 True_False 90.09395
## 2 Stat_Theory 86.43633
## 3 Calculations 91.58514
##
## $student_2$final_exam
## subj_category scores
## 1 True_False 85.88370
## 2 Stat_Theory 89.63224
## 3 Calculations 83.02578
Now we would like to think of a way that we can look through this more complicated data structure and figure out some sort of calculation.
(NOTE: Data structures are quite important to be mindful of during this process. Certain functions in R only work on certain data structures and we are looking to become quite familiar with these when we are working with our large datasets. The common data structures we can consider are: vectors, matrices, data frames, and lists.)
Let’s imagine that there were more students and that the average was unfortunately lower than these two students that are shown. We are interested in disocvering a method to grade the students so that their overall grades will be curved upwards. As a solution, rather than considering the average of all the scores we can take the average of only the top two scores per exam. An example is that for Student 1, that person has a high score of 89 and 92 amongst the three sections for both exams. Therefore, his grade will be the average of these two values:
mean(c(data_science_101[[1]][[1]][2,2], data_science_101[[1]][[2]][3,2]))
## [1] 91.31607
Now, we would like to instead write a function which can perform this task rather than having us manually derive this through hard coding.
(NOTE: Hard coding is when we manually input numbers into our code which make sense only in certain cases and would not function correctly with a different sample of data. Occassionally hard coding is okay if it s a very minor case in which hard coding would provide a quick and simple solution to an otherwise unweildly problem. Many beginners write in this style because no noticable problems have yet to arise when they code in such a manner.)
avg_scores <- function(student_scores = data_science_101[[1]]) {
# The avg_scores function will help to calculate a curvved score
# given that each student has at least one data frame of scores for
# an exam.
exam_high_score_list <- lapply(student_scores,
function(x) max(as.data.frame(x)[,'scores']))
comb_df <- do.call('rbind', exam_high_score_list)
curved_score <- sum(comb_df) / nrow(comb_df)
return(curved_score)
}
lapply(data_science_101, avg_scores)
## $student_1
## [1] 91.31607
##
## $student_2
## [1] 90.60869