What is an algorithm?

An algorithm is a step by step method of solving a problem, expressed as a finite list of well-defined instructions.

input: an algorithm has zero or more inputs
output: has one or more outputs, which are specifically determined by the inputs (if any)
finiteness: must always terminate after a finite number of steps
definiteness: each step must be precisely defined; the actions to be carried out must be rigorously and unambiguously specified
effectiveness: all operations to be performed must be sufficiently basic that they can be done exactly, and in a finite length of time

Reference

Algorithms - Applications

an internet search engine:
- input: key words in the search box
- output: a sorted list of websites
a trip planner
- input: starting position, destination, a map of roads
- output: the fastest route
a recommendation system (e.g. YouTube, Netflix…)
- input: history of user’s utilisation
- output: a suggestion of what they may also like
parameter estimation (in statistics)
- input: data, model assumptions
- output: estimated parameters (for example maximising likelihood, more in ACTL2131)

Pseudocode - Definition

‘Pseudocode is a detailed yet readable description of what a computer program or algorithm must do, expressed in a formally-styled natural language rather than in a programming language’.
To be read by humans, not computers
Must be clear and concise
Not typically concerned with issues of software engineering

Reference

Example of Algorithm & Pseudocode: The Collatz Sequence

The Collatz sequence goes as follows: start at any positive integer

if the number is \(1\), stop
if the number is even, divide it by \(2\)
if the number is odd, multiply it by \(3\) and add \(1\)

It is conjectured but not proven that the sequence always reaches \(1\), eventually. For example, the Collatz sequence starting at \(5\) is: \[ 5 \to 16 \to 8 \to 4 \to 2 \to 1 \] A function which takes any positive integer and returns the next number in the sequence is an algorithm! Expressed in pseudocode this would look like:

Declare 'collatzNext' as function of x
If x is 1 
  result is 1
If x is even 
  result is x/2
Otherwise 
  result is 3*x + 1
Return result

Advantages of pseudocode

Allows you to think thoroughly (and plan ahead) before programming something
Describes the key steps of an algorithm, but is environment-independent
Easier to communicate with others, who possibly don’t know the specific language you are using

Exercise - Fibonacci sequence

Starting at \(1,1, \ldots\) every next element of the Fibonacci sequence is obtained by summing the previous two elements: \(1, 1, 2, 3, 5, 8, 13, 21, \ldots\)
Write a pseudocode for an algorithm that finds the \(n^{th}\) term of the Fibonacci sequence.

Sample solution

Pseudocode:

Let x=1
Let y=1
Let z=x+y

Repeat (n-3) times the following: # we need (n-3) numbers (since we have the first 3)
    re-assign variable x to equal current value of y
    re-assign variable y to equal current value of z
    re-assign variable z to equal current value of x+y
Return z

R code:

# We do this in a function with argument n
get_fibo <- function(n){
x <- 1
y <- 1
z <- 2

for (i in 1:(n-3)){
      x <- y # update x
      y <- z # update y
      z <- x + y # find next number in sequence
}
  return(z)
}
get_fibo(7)

## [1] 13

Exercise: The sorting problem

Consider the problem of sorting a list of numbers, from smallest to biggest. How would you use an algorithm to do that task?

Have in mind the characteristics of an algorithm we saw before
Write first your solution as a pseudocode
Then, you can write an actual R code based on your pseudocode

Possible Solution - Strategy

There are many possible sorting algorithms. The one presented here is often called insertion sort

Input: a sequence of \(n\) numbers, \(A=(a_1,a_2,\ldots,a_n)\)
Output: a rearrangement \((a_1',a_2',\ldots,a_n')\) of the input sequence such that \(a_1'\leq a_2'\leq \ldots \leq a_n'\)

General strategy:

Consider the \(j\) \(th\) number (starting at \(2\), finishing at \(n\))
Compare it to the \((j-1)th\) number: if the \((j-1)th\) is bigger, swap them
Then move to compare the (new) \((j-1)th\) number with the \((j-2)th\) number, swap them if the \((j-2)th\) is bigger. Repeat until either:
- you can’t swap numbers
- you reach the first number in the list
Let \(j = j+1\) and go to step 1.

Possible Solution - Code

Pseudocode:

for j=2 to length(A)
    # store the value of A[j] in a variable called key
    key <- A[j] 
    # initiate the comparison with the (j-1)th element
    i <- j-1 
    # the comparison continues as long as A[i]>key (or one exhausts the list)
    while i>0 and A[i]>key 
      A[i+1] = A[i] # shift A[i] to the right (since its bigger than key)
      A[i]   = key
      i=i-1  # move to the previous

In R code:

# We put this in a function 'insert_sort'
insert_sort <- function(A){
  # Length of inputted sequence
  n <- length(A)
  # Scan all numbers, from 2 to n
  for (j in 2:n){
    # Store the jth value
    key <- A[j]
    # Start comparison one backwards
    i <- j-1
    # Do this until we reach 1, or we find a number smaller or equal to 'key'
    while (i > 0 && A[i] > key){
      # Swap the bigger number with 'key'
      A[i+1] <- A[i]
      A[i] <- key
      i = i - 1
    } # End 'while' loop
  } # End 'for' loop
return(A)
} # End function

insert_sort(c(2,5,6,7,3,4,5,1))

## [1] 1 2 3 4 5 5 6 7

Take aways for good pseudocode

Clear structure and steps
Informative explanations
Not every detail is there, but each step should be unambiguous. Ask yourself: would a competent programmer be able to implement this without thinking much?
Use indentation: improves readability of codes
Don’t be afraid to comment (#commentsRgreat) !

Divide and Conquer Method

The ‘Divide and Conquer’ method to design an algorithm breaks down a complex task into simpler blocks, or ‘sub-problems’. Each of those blocks is a programmable step. Those algorithms have the general structure:

Divide: Break the given problem into (smaller/simpler) sub-problems of the same type.
Conquer: Recursively solve these sub-problems.
Combine: Appropriately combine the answers to get the solution of the original problem.

Reference

Divide and Conquer: An Exercise

You are given a sorted list of numbers: \(A = (n_1, n_2, \ldots)\) and another number, say \(N\), which might be in the list. You want to find if the number is in the list and if so, its position in the list.

How would you apply the divide and conquer method to solve this? Write this as pseudocode.

Possible Solution

A ‘naive’ solution would be to scan all the numbers, and stop whenever you find \(N\). That would work, but would not be efficient.

Instead, we can take advantage of the fact the list is sorted. General strategy:

Look at the middle number, if it is equal to \(N\), you are done.
Keep only half the numbers where \(N\) can possibly be: the first half if \(N <\) middle, the second half if \(N >\) middle. Find the middle of that half and start again.
When the process is done, return the position of the number if you found it, or return ‘unsuccessful’ otherwise

In pseudocode:

# Let the list be A, with A[i] its i^th element
# Set the Left (L) and Right (R) indices of your searching range
L = 1 and R = n
If  N < A[L] OR N > A[R], return 'unsuccessful' 
Let m = floor((L+R)/2) # Update the middle point
    If (A[m] == N), return m (the search is done)
    If R = L, return 'unsuccessful'
    If A[m]<N, let L = m + 1  and go back to step "let m = floor(..)"
    If A[m]>N, let R = m - 1  and go back to step "let m = floor(..)"

In R code:

# We put this in a function 'find_N'
find_N <- function(A, N){
  L <- 1 # Set left bound at 1st position
  R <- length(A) # Set right bound at last position
  if (N < A[L] || N > A[R]){return('unsuccessful')} 
  found <- F # Thus far, N has not be found
  while (found == F){ # As long as we have not found N, we will keep going
    m <- floor((L+R)/2) # Update the middle point
    if (A[m] == N){return(paste(N, "is in position",m))}
    if (R == L){return('unsuccessful')}
    if (A[m]<N){L <- m + 1} # if N is bigger than middle point, update left bound
    if (A[m]>N){R <- m - 1} # if N is smaller than middle point, update right bound
  }
} # End function
find_N(A=1:50, N = 42)

## [1] "42 is in position 42"

find_N(A=c(4,8,10,12,14), N = 12)

## [1] "12 is in position 4"

find_N(A=c(4,8,10,12,14), N = 13)

## [1] "unsuccessful"

More on Algorithms: Recursion

Recursion is a popular approach in ‘dividing and conquering’ a problem.
To cite this Reference:

‘The process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called as recursive function. Using recursive algorithm, certain problems can be solved quite easily.’

Each step in the recursion gets you closer to the solution, by solving a smaller/simpler (but identical) version of the original problem.

An old joke: To understand recursion you must first understand recursion.

Three Laws of Recursion

A recursive algorithm needs to follow three important ‘Laws’:

A recursive algorithm must have a base case. When such a base case is reached, the algorithm terminates.
At each new step, a recursive algorithm must change its state and progress towards the base case.
A recursive algorithm must call itself, recursively.

Reference

Example

This is best understood through a simple example. Suppose we want to compute \(n! = n \times (n-1) \times \dots \times 1\). A classical way to do this is to use a loop as follows:

n <- 10
  result<-1
  for (i in 1:n){
      result<-result*i
  }
  result

## [1] 3628800

A recursive solution would require us to define a function my.factorial(n) which is used inside itself!

my.factorial<-function(n){
    # This is the base case
    if (n==1){
        return(1)
    }
    # If we are not at the base case, the algorithm moves towards the base case by calling itself
    else{
        return(n*my.factorial(n-1))
    }
}
my.factorial(10)

## [1] 3628800

Breakdown

The \(\verb|my.factorial(10)|\) function call does the following:

\(10 \times \verb|my.factorial(9)|\)
\(10 \times 9 \times \verb|my.factorial(8)|\)
\(10 \times 9 \times 8 \times \verb|my.factorial(7)|\)

\(\dots\)

\(10 \times 9 \times 8 \times \dots 2 \times \verb|my.factorial(1)|\)

\(\verb|my.factorial(1)|\) returns \(1\) because we defined this behaviour here: \[ \verb|if (n==1){return(1)}| \]

Necessary properties of recursion

Handles a simple base case, whose solution is obvious
Each call of the function gets you closer to the solution
Avoids infinite cycles, i.e. you know it will eventually terminate!

Exercise

Write a recursive algorithm that returns the \(n^{\text{th}}\) number in the Fibonacci sequence \(1,1,2,3,5,8,13,\ldots\)

Solution

my.fibo <-function(n){
    # This is the base case
    if (n==1 || n==2){
        return(1)
    }
    # If we are not at the base case, the algorithm moves towards the base case by calling itself
    else{
        return(my.fibo(n-1) + my.fibo(n-2))
    }
}
my.fibo(7)

## [1] 13

Note: This is a simple example to illustrate what a recursive algorithm is. However, it is a rather inefficient may of solving the Fibonacci sequence problem (much slower than a previous solution we gave). Can you see why?

Beware of infinite loops

As another example of algorithm, consider what we call a ‘happy number’:

Start with a positive integer
Replace that number by the sum of the squares of its digits
Repeat the process until the number EITHER equals \(1\), and in that case you have a happy number
OR the process loops endlessly in a cycle that does not include \(1\), you then have a sad number.

For example, \(19\) is happy:

\[\begin{align*} & 1^2 + 9^2 = 82 \\ & 8^2 + 2^2 = 68 \\ & 6^2 + 8^2 = 100 \\ & 1^2 + 0^2 + 0^2 = 1. \end{align*}\]

\(4\) is sad:

\[\begin{align*} & 4^2 = 16 \\ & 1^2 + 6^2 = 37 \\ & 3^2 + 7^2 = 58 \\ & 5^2 + 8^2 = 89 \\ & 8^2 + 9^2 = 145 \\ & 1^2 + 4^2 + 5^2= 42 \\ & 4^2 + 2^2 = 20 \\ & 2^2 + 0^2 = 4 \\ & 4^2 = 16 \\ & \ldots \end{align*}\]

Write an algorithm (start with pseudocode) that takes as input a positive integer and returns TRUE if it is happy, FALSE otherwise.

Happy number - Sample Solution

The tricky part here is to realise that if we ever get the same number twice in the sequence, then we don’t have a happy number, and the sequence will repeat itself forever. We then want to exit the process and declare the number sad. So, we need to keep a history of all numbers obtained.

In pseudocode:

# We do the task in a function
Declare function 'is_happy' with argument 'x'
Inside the function:
  # We store all numbers of the sequence in a vector, S
  Declare a vector S
  While x is not 1 AND S contains only unique elements
    - Isolate all digits of x: d1, d2, etc.
    - Update x as d1^2 + d2^2 + ...
    - Add x to vector S
  # While loop is now completed
  If x = 1, return TRUE  
  Otherwise, return FALSE

In R code:

is_happy <- function(x){
S <- NULL # declare an empty vector, to contain all numbers of the sequence

# while we have not reached 1, and we have not gotten any duplicates
while (x != 1 && any(duplicated(S)) == F){ 
  x <- as.integer(strsplit(as.character(x), '')[[1]]) # get all digits of x in a vector
  x <- sum(x^2)
  S <- c(S,x)
}
  ifelse(x == 1, "TRUE", "FALSE") # return result
}
is_happy(19)

## [1] "TRUE"

is_happy(4)

## [1] "FALSE"

Advanced Algorithm: PageRank

Algorithms are the building blocks of computer programs. Arguably, the most widely used algorithm today is PageRank, the ranking system invented by Google co-founders Larry Page and Sergey Brin.
PageRank gives a score to any webpage (in a given network). This score represents the probability that a person who is randomly clicking links on the Internet lands on that specific page.
So if a page \(A\) has a \(0.5\) PageRank within a network, and someone is randomly clicking on links, there is a \(50\%\) chance this person will be directed to page \(A\) after clicking randomly on a given link.
The overall idea behind PageRank is very simple: pages that receive more links from other pages should have a higher score, see the illustration below. Each smiley face represents a webpage, and each pointed finger represents an outbound link from one page to another. The size of each smiley face is proportional to their Page Rank.
Said otherwise: PageRank gives a page’s ‘popularity’; websites that are frequently linked to have higher PageRanks.

The license for the image above can be found here Click me.

More formally, call \(A\) some webpage, then
\[PR(A) = \frac{1-d}{N} + d \left(\frac{PR(T_1)}{C(T_1)} + ... + \frac{PR(T_n)}{C(T_n)}\right)\] Where:

-\(PR(A)\) is the PageRank of page \(A\)

-\(N\) is the total number of pages in the network

-\(PR(T_i)\) is the PageRank of pages \(T_i\) which links to page \(A\)

-\(C(T_i)\) is the total number of outbound links from page \(T_i\) to other pages

-\(d\) is a damping factor which can be set between \(0\) and \(1\) (we take it as \(1\) for the remainder of this section, and leave it to you to investigate it further if you are interested)

Reference - PageRanks are calculated recursively because the PageRank of a webpage \(A\) depends on the PageRank of other webpages, which in turn are dependent on the PageRank of \(A\)…

Larry Page and Sergey Brin showed in their research paper that the PageRanks will eventually converge (why this is the case is outside the scope of this course), meaning that the algorithm will eventually terminate. Back then (1998), it took 52 iterations for the PageRanks of 322 million webpages to be determined.

PageRank example

Assume that \(d=1\) and that there are only \(N=4\) websites on the internet: \(A\), \(B\), \(C\) and \(D\). Links are given by the image below. That is, \(A\) has a link to \(C\) only. \(B\) has a link to \(A\) and \(D\). \(C\) has a link to all three other pages. \(D\) has a link to \(A\) only.

Initially, all Pages are assigned the same rank, \(1/4 = 0.25\).
At the first iteration, we have \[\begin{align*} PR(A) & = \frac{0.25}{2} + \frac{0.25}{3} + \frac{0.25}{1} = 11/24 \\ PR(B) & = \frac{0.25}{3} = 2/24 \\ PR(C) & = \frac{0.25}{1} = 6/24 \\ PR(D) & = \frac{0.25}{2} + \frac{0.25}{3} = 5/24 \\ \end{align*}\]
Then for the second iterations, we use the new Page Ranks obtained in the first one to get updated Pages Ranks, i.e. \[\begin{align*} PR(A) & = \frac{2/24}{2} + \frac{6/24}{3} + \frac{5/24}{1} = 8/24 \\ PR(B) & = \frac{6/24}{3} = 2/24 \\ PR(C) & = \frac{11/24}{1} = 11/24 \\ PR(D) & = \frac{2/24}{2} + \frac{6/24}{3} = 3/24 \\ \end{align*}\]
And so on… Can you write an \(\verb|R|\) program to do this automatically and find the converging \(PR\) values?

Final remarks on Algorithms

Writing the pseudocode of your algorithm before actually coding is good practice (especially for complex tasks)
Decomposing a problem into several ‘easier’ tasks is, in general, a good approach
- Each building block is an doable task
- You connect the building blocks to get a solution
Have efficiency in mind: can the task be done in fewer/faster steps?
The fact modern computers are powerful is not a valid excuse for ‘sloppy’, inefficient programming. In your actuarial careers you will encounter huge datasets, and the running time of your programs is likely to be relevant. Be efficient!
Additional Reference: Click me

Week7: Algorithms and pseudocode

Term 2 2023

What is an algorithm?

Algorithms - Applications

Pseudocode - Definition

Example of Algorithm & Pseudocode: The Collatz Sequence

Advantages of pseudocode

Exercise - Fibonacci sequence

Sample solution

Exercise: The sorting problem

Possible Solution - Strategy

Possible Solution - Code

Take aways for good pseudocode

Divide and Conquer Method

Divide and Conquer: An Exercise

Possible Solution

More on Algorithms: Recursion

Three Laws of Recursion

Example

Breakdown

Necessary properties of recursion

Exercise

Solution

Beware of infinite loops

Happy number - Sample Solution

Advanced Algorithm: PageRank

PageRank example

Final remarks on Algorithms