Overview
R greatly facilitates data analyses. Coding in R, we can efficiently perform exploratory data analysis, build data analysis pipelines, and prepare data visualizations to communicate results.
However, R is not just a data analysis environment, but a programming language. Advanced R programmers can develop complex packages, and even suggest ways to improve R itself.
There are three key programming concepts in R, i.e, conditionals, for-loops, and functions. These are not just the key building blocks of programming but often come handy during data analyses.
a <- 0 # a is equal to 0
if(a != 0){ # if a is not equal to 0
print(1/a) #print 1/a
} else{ # Otherwise
print("No reciprocal for 0.") # No reciprocal for 0
}
[1] "No reciprocal for 0."
Because a is equal to 0, I received the ‘No reciprocal for 0’ message as the result. If I change the value of a, I will get the 1/a outcome. Let’s check:
# If a is equal to zero
a <- 3 # a is equal to 0
if(a != 0){ # if a is not equal to 0
print(1/a) #print 1/a
} else{ # Otherwise
print("No reciprocal for 0.") # No reciprocal for 0
}
[1] 0.3333333
Yes. I got 0.3333.
The general form of the ifelse statement:
knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/gen.JPG")
Let’s try a few more times using the US murders data frame.
library(dslabs)
data(murders)
m_rate <- murders$total/murders$population*100000
Question: which states, if any, have murder rates lower than 0.5? The if statement protects us from the case in which no state satisfy the condition. So we can write something like this.
ind <- which.min(m_rate)
if(m_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No State has murder rate that low")
}
[1] "Vermont"
We got the state that has murder rate below 0.5. What if we change the murder rate?
ind <- which.min(m_rate)
if(m_rate[ind] < 0.20){
print(murders$state[ind])
} else{
print("No State has murder rate that low")
}
[1] "No State has murder rate that low"
We got the output that there isn’t any country with that low murder rate.
This function takes three arguments, Logical, and two possible answers. If the logical is true, the first answer is return if it is not, the second answer is return. For example:
a <- 0
ifelse(a > 0, 1/a, NA)
[1] NA
The value of a is zero thus we received NA output.
a <- 2
ifelse(a > 0, 1/a, NA)
[1] 0.5
Now, a is 2, thus, we receive 1/2 or 0.5 as an outcome. The function is particularly because it works in vectors. It examines each individuals and returns the corresponding answers accordingly.
a <-c(0,1,2,-4,5)
ifelse(a > 0, 1/a, NA) # if a is positive, 1/a if not NA
[1] NA 1.0 0.5 NA 0.2
This table shows how we received the answer above:
knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/ifelse.JPG")
This function can be easily applied to replace all the missing values with zeros. This is the very common use of this functions, i.e., replacing NAs with some other values.
# Loading the required data set
data(na_example)
# Checking total NAs in the data set
sum(is.na(na_example))
[1] 145
#creating a new vector that replaces all the NAs with Zeros
no_nas <- ifelse(is.na(na_example), 0, na_example)
# Checking total NAs in the new data set
sum(is.na(no_nas))
[1] 0
m <- c(TRUE, FALSE, TRUE)
any(m)
[1] TRUE
The any() functions tests if there is any elements that is true in the data set, and that’s what we got. There are two TRUEs there. If I chang the elements of m the following way:
n <- c(FALSE, FALSE, FALSE)
any(n)
[1] FALSE
I get FALSE because there isn’t any TRUE element.
m <- c(TRUE, FALSE, TRUE)
all(m)
[1] FALSE
Because not all the values are TRUE. There is one FALSE.
n <- c(FALSE, FALSE, FALSE)
all(n)
[1] FALSE
But we can have TRUE if we change the elements to be TRUE
m <- c(TRUE, TRUE, TRUE)
all(m)
[1] TRUE
As a data scientists, there are many functions that repeat over and over again like computing an average. We simply calculate average by diving the sum of values, sum() by the length (total number of values), i.e., Sum(x)/length(). This is longer than it really needs to be. So we can define a function that does it automatically. It is more efficient and that’s the reason we have the mean() function in R. However, we encounter the situation where the function doesn’t already exist and we have to create one. Let’s create a function to calculate the average by using the following syntax:
knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/average.JPG")
myavg <- function(k){
l <- sum(k)
m <- length(k)
l/m
}
k <- 50:100
myavg(k)
[1] 75
# Checking if myavg(k) and mean(k) are identical
identical(mean(k), myavg(k))
[1] TRUE
Yes. They are identical.
Note that variables assigned or created inside a function, like l and m in the function statement above are not stored in the working directory. These variables are created and called only when we run the function.
The general way we define is a function is as follow:
knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/genfunc.JPG")
knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/multfunc.JPG")
Here’s an example of a function with multiple arguments.
myavg <- function(x,arithmetic=TRUE){
n <- length(x)
ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}
x <- 1:15
myavg(x)
[1] 8
An example of a function that computes the sum of integers 1 through n
compute_s_n <- function(n){
x <- 1:n
sum(x)
}
compute_s_n(5) # 1+2+3+4+5
[1] 15
compute_s_n(101)# adds all the values 1 through 101
[1] 5151
for(i in 1:5){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
We like to write i in a loop but it can be anything for example water, money, etc.
The for-loop above generated the numbers 1:5, and now, if I want to see the value of i, I will get 5. Let’s check:
i
[1] 5
Here we go.
# a for-loop for our summation
m <- 25
s_n <- vector(length = m) # create an empty vector
for(n in 1:m){
s_n[n] <- compute_s_n(n)
}
We can check the values for if we did it write by plotting values.
# creating a plot for our summation function
n <- 1:m
plot(n, s_n)
Looks like the relationship is quadratic. So we are on the right track. We can also over lay the two results by using the function lines:
# overlaying our function with the summation formula
plot(n, s_n)
lines(n, n*(n+1)/2)
We talked out about the for loops, but in reality we rarely use them in R. But the concepts are really important. Among many other functions, probably the most important ones are the “APPLY” family.
A. APPLY FAMILY FUNCTIONS
B. OTHER FUNCTIONS
Load the dslabs package and heights data set.
library(dslabs)
data(heights)
str(heights)
'data.frame': 1050 obs. of 2 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
This data set has two variables, sex, and height. There are total of observations.
Q.1. Write an ifelse() statement that returns 1 if the sex is Female and 2 if the sex is Male.What is the sum of the resulting vector?
coded_sex <- ifelse(heights$sex == "Female", 1, 2)
sum(coded_sex)
[1] 1862
Based on the outcome the sum of the resulting vector is 1862.
Q.2. Write an ifelse() statement that takes the height column and returns the height if it is greater than 72 inches and returns 0 otherwise. What is the mean of the resulting vector?
library(dplyr)
new_height <- mutate(heights, t_height = ifelse(heights$height > 72, heights$height, 0))
new_height <- filter(new_height, height>72)
mean(new_height$height)
[1] 74.53096
The mean of the resulting vectors is 74.53096 inches.
Q.3. Write a function inches_to_ft that takes a number of inches x and returns the number of feet. One foot equals 12 inches. What is inches_to_ft(144)?
inches_to_ft <- function(n){
n/12
}
# applying the function to the *heights$height* column
heights <- mutate(heights, ft = inches_to_ft(height))
heights$ft[c(144)]
[1] 6
#OR
height_to_feet <- heights$height/12
height_to_feet[144]
[1] 6
Q.3i. How many individuals in the heights dataset have a height less than 5 feet?
sum(heights$ft < 5)
[1] 20
There are 20 individuals who are shorter than 5 feet.
Q.4. Given an integer x, the factorial of x is called x! and is the product of all integers up to and including x. The factorial() function computes factorials in R. For example, factorial(4) returns 4! = 4 × 3 × 2 × 1 = 24. Now,
m <- 12 # define a vector of length m
f_n <- vector(length = m) # make a vector of factorials f_n
for(n in 1:m){
f_n[n] <- factorial(n)
}
f_n # inspect f_n
[1] 1 2 6 24 120 720 5040
[8] 40320 362880 3628800 39916800 479001600