Time Complexity of Algorithms

let’s assume that we all know what is the subject of this presentation. In the following we are going to watch how deciding on type of our algorithm can make difference in run time of our algorithm. I suggest if you don’t know the main idea just google the main title of this page.

first let’s see a complexity chart

I wanted to hide the code for creating this plot but sometimes I don’t feel like being humble. so :

xd <- data.frame(x = c(0.01, 15))
cl <- brewer.pal(n = 6,name = 'Set2')

g <- ggplot(data = xd, aes(x = x))
g <- g + stat_function(fun = n , size = 1.5, aes(color = cl[1])) +
      stat_function(fun = logn , size = 1.5, aes(color = cl[2])) +
      stat_function(fun = nlogn , size = 1.5, aes(color = cl[3])) +
      stat_function(fun = n2, size = 1.5, aes(color = cl[4])) +
      stat_function(fun = pow2n , size = 1.5, aes(color = cl[5])) +
      stat_function(fun = factn , size = 1.5, aes(color = cl[6]))
g <- g + scale_color_identity(
      name = 'complexity Chart',
      breaks = cl,
      labels = c('n', 'log(n)', 'n*log(n)', 'n^2', '2^n', 'n!'),
      guide = 'legend'
)
g <- g + ylim(c(-10, 200))
g <- g + labs(x = 'Elements',y ='Operations')
g

Search algorithms

our first two algorithms to investigate are binary search and linear search. binary works with log(n) and linear as its name suggests has a big-O of n.

O( \(\log n\) ) | Binary Search

in this section an algorithm with Big-O of level log(n) will be explained and simulated.

binary search is one of the fastest algorithms which has some assumptions. let’s take a look at this GIF :

I believe this image prepared us for jumping into codeing section but before that let’s declare or assumptions:

the array we pass to our function must be sorted in an ascending way. why? think about we we’re going to measure
another assumption is our target element is inside the array

bin_search <- function (v, t) {
      lo <- 1
      hi <- length(v)
      while (lo <= hi) {
            mid = round((lo + hi) / 2)
            
            if (abs(v[mid] - t) <= 0 ) {
                  return(mid)
                  break
            } else if (v[mid] < t) {
                  lo = mid + 1
            } else {
                  hi = mid - 1
            }
      }
}

let’s test this function with an array of random generated numbers , huh? :

test_array <- seq(from = 7,to = 2100,by = 7)
test_array[276]

## [1] 1932

so we want to find out whether the function outputs 276 when we pass 1932 to it or not. we’ll see:

bin_search(v = test_array,t = 1932)

## [1] 276

looks like our function works properly. now it’s time for

Simulation

in order to report a rather precise number I couldn’t use R language’s built in function which is named Sys.time() so I went through an advanced package named microbenchmark. I used uniform distribution for creating random numbers and I ran my time measuring for 50 times in order to have more coherent data.

df <- microbenchmark(
      bin2   =  bin_search(v = seq(1,100),t = round(runif(1,min = 1,max = 100))),
      bin3 =  bin_search(v = seq(1,1000),t = round(runif(1,min = 1,max = 1000))),
      bin4   =  bin_search(v = seq(1,10000),t = round(runif(1,min = 1,max = 10000))),
      bin5 =  bin_search(v = seq(1,100000),t = round(runif(1,min = 1,max = 100000))),
      bin6   =  bin_search(v = seq(1,1000000),t = round(runif(1,min = 1,max = 1000000))),
      times = 50,unit = 'ms')

df

## Unit: milliseconds
##  expr      min       lq       mean    median       uq      max neval
##  bin2 0.009440 0.012838 0.01443168 0.0135940 0.014349 0.034738    50
##  bin3 0.011706 0.015104 0.01927988 0.0162365 0.016615 0.131399    50
##  bin4 0.015481 0.018124 0.01897778 0.0186905 0.019258 0.033605    50
##  bin5 0.016237 0.020390 0.02255730 0.0211450 0.021901 0.064945    50
##  bin6 0.020390 0.023411 0.02473964 0.0241660 0.024543 0.054372    50

and finally plots of these numbers :

cl2 <- brewer.pal(name = 'Accent', n =6)
boxplot(df,col = cl2,log = T,outline = F,ylime = c(10,55))

autoplot(df)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

O(n) | linear search

as an example of linear search I’m going to implement linear search which which is a loop over an array to find the target.

ls <- function(v,t){
      c = F
      l <- length(v)
      for (i in 1:l){
            if (v[i] == t){
                  return(i)
                  c = T
            }
      }
      if (c == F){
            return('Not in Vector')
      }
}

it was an easy one. let’s test it.

vector_test <- seq(1,700,by = 2)
target <- 451
ans <- ls(vector_test,target)
ans

## [1] 226

doing double check :

vector_test[ans]

## [1] 451

it works. so now we can jump to simulation: I’m going to repeat what we did for binary search.so I just hide the code for simulation.

df1 <- microbenchmark(
      bin2   =  ls(v = seq(1,100),t = round(runif(1,min = 1,max = 100))),
      bin3   =  ls(v = seq(1,1000),t = round(runif(1,min = 1,max = 1000))),
      bin4   =  ls(v = seq(1,10000),t = round(runif(1,min = 1,max = 10000))),
      bin5   =  ls(v = seq(1,100000),t = round(runif(1,min = 1,max = 100000))),
      bin6   =  ls(v = seq(1,1000000),t = round(runif(1,min = 1,max = 1000000))),
      times = 50,unit = 'ms')

df1

## Unit: milliseconds
##  expr      min        lq        mean     median        uq       max neval
##  bin2 0.009440  0.012839  0.02550240  0.0147260  0.036248  0.065322    50
##  bin3 0.011328  0.023033  0.04429838  0.0353045  0.059281  0.135552    50
##  bin4 0.011328  0.124980  0.24445430  0.2131455  0.371918  0.589782    50
##  bin5 0.220509  1.426880  2.46040374  2.7168875  3.653856  4.825490    50
##  bin6 0.729110 11.818674 24.57978278 24.0232355 38.757970 51.957835    50

autoplot(df1)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

sort algorithms

merge sort
bubble sort

O(\(nlog n\)) | merge sort algorithm

Merge sort is a sorting technique based on divide and conquer technique. With worst-case time complexity being O(\(n log n\)), it is one of the most respected algorithms.

Merge sort first divides the array into equal halves and then combines them in a sorted manner.

let’s see how it works visually :

let’s go for code part and the simulation : code part includes two main function. mmerge function sorts divided by half elements which mergesort function passes to it. mergesort function is a recursive one which takes an array and splits it into halves until we get single elements.

mmerge<-function(a,b) {
      r<-numeric(length(a)+length(b))
      ai<-1; bi<-1; j<-1;
      for(j in 1:length(r)) {
            if((ai<=length(a) && a[ai]<b[bi]) || bi>length(b)) {
                  r[j] <- a[ai]
                  ai <- ai+1
            } else {
                  r[j] <- b[bi]
                  bi <- bi+1          
            }
      }
      r
}
mmergesort<-function(A) {
      if(length(A)>1) {
            q <- ceiling(length(A)/2)
            a <- mmergesort(A[1:q])
            b <- mmergesort(A[(q+1):length(A)])
            mmerge(a,b)
      } else {
            A
      }
}

as we did in previous steps, we test our code chunk passing an unordered array :

test_arr <- c(4,11,3,0,6,4,3,8,6) 
mmergesort(test_arr)

## [1]  0  3  3  4  4  6  6  8 11

in this step we want to use microbenchmark package of R to benchmark our algorithm.

dfmerget <- microbenchmark(
      n25  = mmergesort(runif(n = 25,min = 1, max = 25)),
      n50  = mmergesort(runif(n = 50,min = 1, max = 50)),
      n75 = mmergesort(runif(n = 75,min = 1, max = 75)),
      n100 = mmergesort(runif(n = 100,min = 1, max = 100)),
      n200 = mmergesort(runif(n = 200,min = 1,max = 200))
      
)

dfmerget

## Unit: microseconds
##  expr      min        lq      mean    median        uq      max neval
##   n25  126.113  130.4550  159.9289  134.4195  170.8565  334.916   100
##   n50  265.818  278.6560  316.7536  283.9420  304.3310  596.202   100
##   n75  417.983  434.4075  531.6464  444.7915  510.1135 2185.441   100
##  n100  575.057  590.9155  678.1216  603.1865  632.4490 2728.025   100
##  n200 1226.007 1267.9180 1458.0309 1290.7620 1379.4940 3239.271   100

Mdfmerge <- dfmerget %>%  group_by(expr) %>%  summarise(median(time/1000)) 
Mdfmerge['ops'] <- c(25,50,75,100,200)
names(Mdfmerge)[2] <- 'median' ;

ggplot(Mdfmerge,aes(ops,median),size = 1.08) + geom_point(aes(colour = 'data curve'),size=1.1) + geom_line(aes(colour = 'data curve')) + stat_function(fun = nlogn,size = 1.08,aes(colour = 'n*logn curve')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')

we can see that the curve is similar approximately a good fit to it’s big O-notation.

O(\(n^2\)) | bubble sort

DescriptionBubble sort, sometimes referred to as sinking sort, is a simple sorting algorithm that repeatedly steps through the list, compares adjacent pairs and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted. it consists of two main loops which makes its big O equal to O(\(n^2\)).

code part is as fast forward as the above Gif.

bubblesort <- function(arr){
      n <- length(arr)
      for(i in 1:n){
            for(j in 1:(n-1)){
                  if(arr[j] > arr[j + 1]){
                        arr[c(j,j+1)] <- arr[c(j+1,j)]
                  }
            }
      }
      return(arr)
}

again we take a test from our function to see it works or not.

bubblesort(test_arr)

## [1]  0  3  3  4  4  6  6  8 11

last step in this section is doing the simulation. before getting into our simulation we have to remind to ourselves that \(n^2\) is an extreme increase rate which can give us a blue screen if we don’t pay attention to simulation size and array lengths. unlike search algorithms that we even tried \(10^7\) abd yet it was a little number,now we can’t go further 1000!

bst <- microbenchmark(
      n25  = bubblesort(runif(n = 25,min = 1, max = 25)),
      n50  = bubblesort(runif(n = 50,min = 1, max = 50)),
      n75  = bubblesort(runif(n = 75,min = 1, max = 75)),
      n100 = bubblesort(runif(n = 100,min = 1, max = 100)),
      n200 = bubblesort(runif(n = 200,min = 1,max = 200))
      
)
bst

## Unit: microseconds
##  expr      min         lq       mean     median         uq       max neval
##   n25  126.112   157.4520   173.0273   169.9125   183.8825   248.449   100
##   n50  528.992   627.3525   719.1307   660.7680   707.3990  2629.477   100
##   n75 1176.166  1413.4765  1560.3743  1477.4765  1544.4970  3232.852   100
##  n100 2333.076  2529.2290  2786.0411  2652.1315  2808.4510  5098.858   100
##  n200 9650.979 10379.5225 11133.9604 10783.9125 11589.6705 16988.895   100

Mdbst <- bst %>%  group_by(expr) %>%  summarise(median(time/1000)) 
Mdbst['ops'] <- c(25,50,75,100,200)
names(Mdbst)[2] <- 'median'

ggplot(Mdbst,aes(ops,median),size = 1.08) +
geom_point(aes(colour = 'data curve'),size=1.1) +
geom_line(aes(colour = 'data curve')) +
stat_function(fun = n2,size = 1.08,aes(colour = 'n^2')) +
stat_function(fun = n,aes(colour= 'n')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')

it is abvious that our samples are not worst case and so they won’t fit on \(n^2\). they will be normally somewhere between best case which is a sorted input array which will fit on n and worst case which is \(n^2\).

comparison merge and bubble sort

finally we are going to compare two sort algorithms. in the plot below we can find out how the rates make difference. \(n^2\) is a immense load on system when it gets bigger than 1000 while \(n logn\) stays efficient even with \(10^5\) even larger numbers.

ggplot(Mdbst,aes(ops,median),size = 1.08) + geom_point(aes(colour = 'bubble'),size=1.1) + geom_line(aes(colour = 'bubble')) + stat_function(fun = n2,size = 1.08,aes(colour = 'n^2')) + stat_function(fun = nlogn,aes(colour= 'nlogn'),size = 1.07)+
      geom_line(data = Mdfmerge,aes(x = ops,y = median,colour = 'merge'),size = 1.08) +
      geom_point(data = Mdfmerge,aes(x = ops,y = median,colour = 'merge')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')

Big-O notation simulation

Rooholamin Rasooli

April 4, 2019