Overview

Programming Basics in R

R greatly facilitates data analyses. Coding in R, we can efficiently perform exploratory data analysis, build data analysis pipelines, and prepare data visualizations to communicate results.

However, R is not just a data analysis environment, but a programming language. Advanced R programmers can develop complex packages, and even suggest ways to improve R itself.

There are three key programming concepts in R, i.e, conditionals, for-loops, and functions. These are not just the key building blocks of programming but often come handy during data analyses.

Basic Conditionals

  • The most common conditional expression in programming is an if-else statement, which has the form “if [condition], perform [expression], else perform [alternative expression]”.
a <- 0 # a is equal to 0
if(a != 0){ # if a is not equal to 0
  print(1/a) #print 1/a
} else{ # Otherwise
  print("No reciprocal for 0.") # No reciprocal for 0
}
[1] "No reciprocal for 0."

Because a is equal to 0, I received the ‘No reciprocal for 0’ message as the result. If I change the value of a, I will get the 1/a outcome. Let’s check:

# If a is equal to zero
a <- 3 # a is equal to 0
if(a != 0){ # if a is not equal to 0
  print(1/a) #print 1/a
} else{ # Otherwise
  print("No reciprocal for 0.") # No reciprocal for 0
}
[1] 0.3333333

Yes. I got 0.3333.

The general form of the ifelse statement:

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/gen.JPG")

Let’s try a few more times using the US murders data frame.

library(dslabs)
data(murders)
m_rate <- murders$total/murders$population*100000

Question: which states, if any, have murder rates lower than 0.5? The if statement protects us from the case in which no state satisfy the condition. So we can write something like this.

ind <- which.min(m_rate)
if(m_rate[ind] < 0.5){
  print(murders$state[ind])
} else{
  print("No State has murder rate that low")
}
[1] "Vermont"

We got the state that has murder rate below 0.5. What if we change the murder rate?

ind <- which.min(m_rate)
if(m_rate[ind] < 0.20){
  print(murders$state[ind])
} else{
  print("No State has murder rate that low")
}
[1] "No State has murder rate that low"

We got the output that there isn’t any country with that low murder rate.

  • The ifelse() function works similarly to an if-else statement, but it is particularly useful since it works on vectors by examining each element of the vector and returning a corresponding answer accordingly.

This function takes three arguments, Logical, and two possible answers. If the logical is true, the first answer is return if it is not, the second answer is return. For example:

a <- 0
ifelse(a > 0, 1/a, NA)
[1] NA

The value of a is zero thus we received NA output.

a <- 2
ifelse(a > 0, 1/a, NA)
[1] 0.5

Now, a is 2, thus, we receive 1/2 or 0.5 as an outcome. The function is particularly because it works in vectors. It examines each individuals and returns the corresponding answers accordingly.

a <-c(0,1,2,-4,5)
ifelse(a > 0, 1/a, NA) # if a is positive, 1/a if not NA
[1]  NA 1.0 0.5  NA 0.2

This table shows how we received the answer above:

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/ifelse.JPG")

This function can be easily applied to replace all the missing values with zeros. This is the very common use of this functions, i.e., replacing NAs with some other values.

# Loading the required data set
data(na_example)
# Checking total NAs in the data set
sum(is.na(na_example))
[1] 145
#creating a new vector that replaces all the NAs with Zeros
no_nas <- ifelse(is.na(na_example), 0, na_example)
# Checking total NAs in the new data set
sum(is.na(no_nas))
[1] 0
  • The any() function takes a vector of logicals and returns true if any of the entries are true.
m <- c(TRUE, FALSE, TRUE)
any(m)
[1] TRUE

The any() functions tests if there is any elements that is true in the data set, and that’s what we got. There are two TRUEs there. If I chang the elements of m the following way:

n <- c(FALSE, FALSE, FALSE)
any(n)
[1] FALSE

I get FALSE because there isn’t any TRUE element.

  • The all() function takes a vector of logicals and returns true if all of the entries are true.
m <- c(TRUE, FALSE, TRUE)
all(m)
[1] FALSE

Because not all the values are TRUE. There is one FALSE.

n <- c(FALSE, FALSE, FALSE)
all(n)
[1] FALSE

But we can have TRUE if we change the elements to be TRUE

m <- c(TRUE, TRUE, TRUE)
all(m)
[1] TRUE

Programming Basics Functions

As a data scientists, there are many functions that repeat over and over again like computing an average. We simply calculate average by diving the sum of values, sum() by the length (total number of values), i.e., Sum(x)/length(). This is longer than it really needs to be. So we can define a function that does it automatically. It is more efficient and that’s the reason we have the mean() function in R. However, we encounter the situation where the function doesn’t already exist and we have to create one. Let’s create a function to calculate the average by using the following syntax:

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/average.JPG")

myavg <- function(k){
  l <- sum(k)
  m <- length(k)
  l/m
}
k <- 50:100
myavg(k)
[1] 75
# Checking if myavg(k) and mean(k) are identical
identical(mean(k), myavg(k))
[1] TRUE

Yes. They are identical.

Note that variables assigned or created inside a function, like l and m in the function statement above are not stored in the working directory. These variables are created and called only when we run the function.

The general way we define is a function is as follow:

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/genfunc.JPG")

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/multfunc.JPG")

Here’s an example of a function with multiple arguments.

myavg <- function(x,arithmetic=TRUE){
  n <- length(x)
  ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}
x <- 1:15
myavg(x)
[1] 8

Programming Basics- for loops

An example of a function that computes the sum of integers 1 through n

compute_s_n <- function(n){
  x <- 1:n
  sum(x)
}
compute_s_n(5) # 1+2+3+4+5
[1] 15
compute_s_n(101)# adds all the values 1 through 101
[1] 5151
for(i in 1:5){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

We like to write i in a loop but it can be anything for example water, money, etc.

The for-loop above generated the numbers 1:5, and now, if I want to see the value of i, I will get 5. Let’s check:

i
[1] 5

Here we go.

# a for-loop for our summation
m <- 25
s_n <- vector(length = m) # create an empty vector
for(n in 1:m){
  s_n[n] <- compute_s_n(n)
}

We can check the values for if we did it write by plotting values.

# creating a plot for our summation function
n <- 1:m
plot(n, s_n)

Looks like the relationship is quadratic. So we are on the right track. We can also over lay the two results by using the function lines:

# overlaying our function with the summation formula
plot(n, s_n)
lines(n, n*(n+1)/2)

Other Functions

We talked out about the for loops, but in reality we rarely use them in R. But the concepts are really important. Among many other functions, probably the most important ones are the “APPLY” family.

A. APPLY FAMILY FUNCTIONS

  1. apply
  2. sapply
  3. tapply
  4. mapply

B. OTHER FUNCTIONS

  1. split
  2. cut
  3. quantile
  4. reduce
  5. identical
  6. unique, etc.

Sample Q/As

Load the dslabs package and heights data set.

library(dslabs)
data(heights)
str(heights)
'data.frame':   1050 obs. of  2 variables:
 $ sex   : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
 $ height: num  75 70 68 74 61 65 66 62 66 67 ...

This data set has two variables, sex, and height. There are total of observations.

Q.1. Write an ifelse() statement that returns 1 if the sex is Female and 2 if the sex is Male.What is the sum of the resulting vector?

coded_sex <- ifelse(heights$sex == "Female", 1, 2)
sum(coded_sex)
[1] 1862

Based on the outcome the sum of the resulting vector is 1862.

Q.2. Write an ifelse() statement that takes the height column and returns the height if it is greater than 72 inches and returns 0 otherwise. What is the mean of the resulting vector?

library(dplyr)
new_height <- mutate(heights, t_height = ifelse(heights$height > 72, heights$height, 0))
new_height <- filter(new_height, height>72)
mean(new_height$height)
[1] 74.53096

The mean of the resulting vectors is 74.53096 inches.

Q.3. Write a function inches_to_ft that takes a number of inches x and returns the number of feet. One foot equals 12 inches. What is inches_to_ft(144)?

inches_to_ft <- function(n){
n/12  
}
# applying the function to the *heights$height* column
heights <- mutate(heights, ft = inches_to_ft(height))
heights$ft[c(144)]
[1] 6
#OR
height_to_feet <- heights$height/12
height_to_feet[144]
[1] 6

Q.3i. How many individuals in the heights dataset have a height less than 5 feet?

sum(heights$ft < 5)
[1] 20

There are 20 individuals who are shorter than 5 feet.

Q.4. Given an integer x, the factorial of x is called x! and is the product of all integers up to and including x. The factorial() function computes factorials in R. For example, factorial(4) returns 4! = 4 × 3 × 2 × 1 = 24. Now,

  1. define a vector of length m,
  2. make a vector of factorials f_n,
  3. inspect f_n.
m <- 12 # define a vector of length m
f_n <- vector(length = m) # make a vector of factorials f_n
for(n in 1:m){
  f_n[n] <- factorial(n)
}
f_n # inspect f_n
 [1]         1         2         6        24       120       720      5040
 [8]     40320    362880   3628800  39916800 479001600