MCD Wokshop 5

Ibrahim Inal

Programming in R

Syntax

One part of programming involves writing functions. We have already seen examples of functions. Those functions primarily relies on existing functions. However, most useful functions typically use control structures.

Control structures are expressions used to control the execution and flow of the program based on the conditions provided in the statements. These structures are used to make a decision after assessing the variable.

Flow controls

Control structures allow you to put some “logic” into your R code. These are same for other languages such as Python, C etc.

If-else These are called conditional execution as the execution depends on some conditions.

if(condition){
    statements
    ....
    ....
}

The above code does nothing if the condition is false. If you want an additional part you need to add else in the second part.

if(condition){
    statements
    ....
    ....
}
else{
    statements
    ....
    ....
}

If you would like to add more conditions this could be done by using else if

if (condition1) { 
    statements1
    ....
    ....
    } else if (condition2) {
    statements2
    ....
    ....
    } else if  (condition3) {
    statements3
    ....
    ....
    } else {
    statements4
    ....
    ....
}

Just as a brief recap:

  • Relational operators: <, >, <=, >=, ==, !=.
  • Logical operators: !, &, |, &&, ||, xor().
  • Value matching: %in%, ! ... %in% (character operations).
calculate_uk_income_tax <- function(income) {
  tax = 0
  
  if (income <= 12500) {
    tax = 0
  }
  
  if (income > 12500 && income <= 50000) {
    tax = 0.2 * (income - 12500)
  }
  
  if (income > 50000 && income <= 150000) {
    tax = 0.2 * (50000 - 12500) + 0.4 * (income - 50000)
  }
  
  if (income > 150000) {
    tax= 0.2 * (50000 - 12500) + 0.4 * (150000 - 50000) + 0.45 * (income - 150000)
  }
  
  return(tax)
}

or

calculate_uk_income_tax2 <- function(income) {
  tax = 0
  
  if (income <= 12500) {
    tax = 0
  } else if (income <= 50000) {
    tax = 0.2 * (income - 12500)
  } else if (income <= 150000) {
    tax = 0.2 * (50000 - 12500) + 0.4 * (income - 50000)
  } else {
    tax = 0.2 * (50000 - 12500) + 0.4 * (150000 - 50000) + 0.45 * (income - 150000)
  }
  
  return(tax)
}

If your code is nested or too long, you may want to put sanity checks in your function:

  • stop(): will show an error warning and immediately stop execution.
  • warning(): issues a warning, but the program will still be executed.
my_sqrt <- function(x){
  
  if (!is.numeric(x))   stop("Argument 'x' must be numeric!")
  if (x < 0)            stop("Argument 'x' must be non-negative")
  
  result<- sqrt(x)
  return(result)
}

ifelse() also allows a control flows. ifelse(test, yes, no) is general syntax for it. For example

x <- c(-1:4)

sqrt( ifelse(x>=0, x , NA) ) 
[1]       NA 0.000000 1.000000 1.414214 1.732051 2.000000

For loop

In addition to conditional execution, repetitive executions are very common. These are called loops. for is the simplest form of loops.

for (value in values) { 
  statements
    ....
    ....
}

A typical use is to loop over an integer sequence \(i=\{1,2,3,...,n\}\).

for(i in 1:10){ 
cat("This is iteration") 
print(i)
}
This is iteration[1] 1
This is iteration[1] 2
This is iteration[1] 3
This is iteration[1] 4
This is iteration[1] 5
This is iteration[1] 6
This is iteration[1] 7
This is iteration[1] 8
This is iteration[1] 9
This is iteration[1] 10

Obviously this loop is pointless. However,

library(tidyverse)

exam_df <- data.frame(
  name=c("Amos","Barnabas", "Chris", "Damien", "Ester", "Fairuz", "Gao" ),
  year=c("Junior", "Senior", "Senior", "Senior", "Junior", "Senior", "Junior"),
  english=c(60,66, 70,73, 55, 60, 70),
  maths=c(90, 55, 63, 76, 52, 80, 64),
  science=c(70, 62, 57, 43, 75, 80, 82),
  history= c(55, 45, 62, 90, 41, 57, 60),
  economics=c(42,45,60,44,57,65, 39),
  stringsAsFactors = FALSE
)

#get the student names 
students<- exam_df$name

#loop through each student by filtering data for that student 
for(student in students){
  scores<- exam_df %>%
  filter(name==student)%>%
   mutate(avgscore=mean(english, maths,science, history,economics))
  
  #create student  performance with their names
  sp<- c(student, scores$avgscore)

  cat(sp)
}
Amos 60Barnabas 66Chris 70Damien 73Ester 55Fairuz 60Gao 70

Note that there are group of functions :apply(), lapply(), sapply(), vapply(), tapply(), mapply() that could be used instead of for loops. They are known as apply functions. General syntax is: apply(X,MARGIN,FUN) where

  • X is an array or matrix (this is the data that you will be performing the function on)
  • Margin specifies whether you want to apply the function across rows (1) or columns (2)
  • FUN is the function you want to use
calculate_avg_score <- function(student) {
  scores <- exam_df %>%
    filter(name == student) %>%
    mutate(avgscore = mean(english, maths, science, history, economics))
  
  sp <- c(student, scores$avgscore)
  return(sp)
}

#get the student names 
students<- exam_df$name

# Apply the function using sapply
result <- lapply(students, calculate_avg_score)

# Print the results
print(result)
[[1]]
[1] "Amos" "60"  

[[2]]
[1] "Barnabas" "66"      

[[3]]
[1] "Chris" "70"   

[[4]]
[1] "Damien" "73"    

[[5]]
[1] "Ester" "55"   

[[6]]
[1] "Fairuz" "60"    

[[7]]
[1] "Gao" "70" 

Now if we go back to loops, we could get the average of each subject in each year.

# Unique years
unique_years <- unique(exam_df$year)

# Initialize an empty data frame to store results
average_scores <- data.frame(Year = character(0), Subject = character(0), AvgScore = numeric(0))



# Loop through each year
for (year in unique_years) {
  year_data <- exam_df[exam_df$year == year, ]
  print(year_data)
  for (subject in colnames(year_data)[3:7]) {
    avg_score <- mean(year_data[[subject]])
    average_scores <- rbind(average_scores, data.frame(Year = year, Subject = subject, AvgScore = avg_score))
  }
}



# Print the results 
print(average_scores)

Task: Write a for loop that finds the top performers in each subject in each year.

while loop

Another common form of loop.

while(expression){
    statement
    ....
    ....
}

Here’s an example of while loop for our toy data set.

i <- 1

while (i <= nrow(exam_df)) {
  
  student <- exam_df$name[i]
  
  english_score <- exam_df$english[i]
  
  cat("Student:", student, "- English Score:", english_score, "\n")
  
  i <- i + 1
}
Student: Amos - English Score: 60 
Student: Barnabas - English Score: 66 
Student: Chris - English Score: 70 
Student: Damien - English Score: 73 
Student: Ester - English Score: 55 
Student: Fairuz - English Score: 60 
Student: Gao - English Score: 70 
Student: Barnabas 
Student: Chris 
Student: Damien 
Student: Gao 

repeat loop

repeat is a loop which can be iterated many number of times but there is no exit condition to come out from the loop. it should be used with a break condition. Note that this type of loop is less common.

repeat { 
   statements
   ....
   .... 
   if(expression) {
      break
   }
}
i <- 1

repeat {
  if (i > nrow(exam_df)) {
    break  
  }
  
  student <- exam_df$name[i]
  english_score <- exam_df$english[i]
  
  if (english_score > 80) {
    cat("Student:", student, "- English Score:", english_score, "\n")
  }
  
  i <- i + 1
}

In some cases, when you want to skip an iteration you might want to use next().

for (i in 1:nrow(exam_df)) {
  
  student <- exam_df$name[i]
  english_score <- exam_df$english[i]
  
  if (english_score < 70) {
    next  
  }
  
  cat("Student:", student, "- English Score:", english_score, "\n")
}
Student: Chris - English Score: 70 
Student: Damien - English Score: 73 
Student: Gao - English Score: 70 

Some exercises

  1. Write a loop that iterates through the rows of exam_df and prints the names of students who have failed in at least one subject (scored less than 50 in any subject).
  1. Write a loop that finds and prints the name of the student who scored the highest in each subject.

Coding practice

Obviously, it is not possible to practice coding enough. One source that I’ d recommend is swirl.

install.packages("swirl")
library(swirl)

Practical

Football data

In this part we will use the toolkit to produce a league table. I will use data from https://www.football-data.co.uk. The data is EPL 2022/23 season.

Note that we will not use all variables in the data set. In order to understand the names of variables refer https://www.football-data.co.uk/notes.txt. Let’s take the variables we will need.

Column Content
HomeTeam Name of the team that played at home
AwayTeam Name of the team that played away
FTHG Full Time Home Goals
FTAG Full Time Away Goals

In order to create a league table,

  • we need to know total number of wins and draws for each team.

  • we need to calculate the number of league points won by each team. Whilst the winner gets 3 points, the loser gets 0 (zero) point. Each team gets 1 point in case of a draw.

# A tibble: 20 × 5
   Team           TotalWins TotalLosses TotalDraw TotalPoints
   <chr>              <int>       <int>     <int>       <dbl>
 1 Man City              28           5         5          89
 2 Arsenal               26           6         6          84
 3 Man United            23           9         6          75
 4 Newcastle             19           5        14          71
 5 Liverpool             19           9        10          67
 6 Brighton              18          12         8          62
 7 Aston Villa           18          13         7          61
 8 Tottenham             18          14         6          60
 9 Brentford             15           9        14          59
10 Fulham                15          16         7          52
11 Crystal Palace        11          15        12          45
12 Chelsea               11          16        11          44
13 Wolves                11          19         8          41
14 West Ham              11          20         7          40
15 Bournemouth           11          21         6          39
16 Nott'm Forest          9          18        11          38
17 Everton                8          18        12          36
18 Leicester              9          22         7          34
19 Leeds                  7          21        10          31
20 Southampton            6          25         7          25

Optional

This part uses a very famous dataset called gapminder

  1. Filter the data for the year 2007.
  1. Create a new variable named incomeCategory by using gdpPercap. If the gdpPercap is less than 1000, then the country is called Low Income. If the gdpPercap is between 1000 and 12000 then the country is called Middle Income. If the gdpPercap is higher than 12000 then the country will be called High Income.
  1. Use a for loop to calculate the average life expectancy for each income category.

For the rest use the original data set.

4.Calculate the average life expectancy for each continent.

  1. Count the number of countries in each income category.
  1. Calculate the growth rate of population for each country. Note that there is a function called lag() that might be helpful for this purpose.