Time Complexity of Algorithms
let’s assume that we all know what is the subject of this presentation. In the following we are going to watch how deciding on type of our algorithm can make difference in run time of our algorithm. I suggest if you don’t know the main idea just google the main title of this page.
first let’s see a complexity chart
I wanted to hide the code for creating this plot but sometimes I don’t feel like being humble. so :
xd <- data.frame(x = c(0.01, 15))
cl <- brewer.pal(n = 6,name = 'Set2')
g <- ggplot(data = xd, aes(x = x))
g <- g + stat_function(fun = n , size = 1.5, aes(color = cl[1])) +
stat_function(fun = logn , size = 1.5, aes(color = cl[2])) +
stat_function(fun = nlogn , size = 1.5, aes(color = cl[3])) +
stat_function(fun = n2, size = 1.5, aes(color = cl[4])) +
stat_function(fun = pow2n , size = 1.5, aes(color = cl[5])) +
stat_function(fun = factn , size = 1.5, aes(color = cl[6]))
g <- g + scale_color_identity(
name = 'complexity Chart',
breaks = cl,
labels = c('n', 'log(n)', 'n*log(n)', 'n^2', '2^n', 'n!'),
guide = 'legend'
)
g <- g + ylim(c(-10, 200))
g <- g + labs(x = 'Elements',y ='Operations')
gSearch algorithms
our first two algorithms to investigate are binary search and linear search. binary works with log(n) and linear as its name suggests has a big-O of n.
O( \(\log n\) ) | Binary Search
in this section an algorithm with Big-O of level log(n) will be explained and simulated.
binary search is one of the fastest algorithms which has some assumptions. let’s take a look at this GIF :
I believe this image prepared us for jumping into codeing section but before that let’s declare or assumptions:
the array we pass to our function must be sorted in an ascending way. why? think about we we’re going to measure
another assumption is our target element is inside the array
bin_search <- function (v, t) {
lo <- 1
hi <- length(v)
while (lo <= hi) {
mid = round((lo + hi) / 2)
if (abs(v[mid] - t) <= 0 ) {
return(mid)
break
} else if (v[mid] < t) {
lo = mid + 1
} else {
hi = mid - 1
}
}
}let’s test this function with an array of random generated numbers , huh? :
test_array <- seq(from = 7,to = 2100,by = 7)
test_array[276]## [1] 1932
so we want to find out whether the function outputs 276 when we pass 1932 to it or not. we’ll see:
bin_search(v = test_array,t = 1932)## [1] 276
looks like our function works properly. now it’s time for
Simulation
in order to report a rather precise number I couldn’t use R language’s built in function which is named Sys.time() so I went through an advanced package named microbenchmark. I used uniform distribution for creating random numbers and I ran my time measuring for 50 times in order to have more coherent data.
df <- microbenchmark(
bin2 = bin_search(v = seq(1,100),t = round(runif(1,min = 1,max = 100))),
bin3 = bin_search(v = seq(1,1000),t = round(runif(1,min = 1,max = 1000))),
bin4 = bin_search(v = seq(1,10000),t = round(runif(1,min = 1,max = 10000))),
bin5 = bin_search(v = seq(1,100000),t = round(runif(1,min = 1,max = 100000))),
bin6 = bin_search(v = seq(1,1000000),t = round(runif(1,min = 1,max = 1000000))),
times = 50,unit = 'ms')
df## Unit: milliseconds
## expr min lq mean median uq max neval
## bin2 0.009440 0.012838 0.01443168 0.0135940 0.014349 0.034738 50
## bin3 0.011706 0.015104 0.01927988 0.0162365 0.016615 0.131399 50
## bin4 0.015481 0.018124 0.01897778 0.0186905 0.019258 0.033605 50
## bin5 0.016237 0.020390 0.02255730 0.0211450 0.021901 0.064945 50
## bin6 0.020390 0.023411 0.02473964 0.0241660 0.024543 0.054372 50
and finally plots of these numbers :
cl2 <- brewer.pal(name = 'Accent', n =6)
boxplot(df,col = cl2,log = T,outline = F,ylime = c(10,55))autoplot(df)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
O(n) | linear search
as an example of linear search I’m going to implement linear search which which is a loop over an array to find the target.
ls <- function(v,t){
c = F
l <- length(v)
for (i in 1:l){
if (v[i] == t){
return(i)
c = T
}
}
if (c == F){
return('Not in Vector')
}
}it was an easy one. let’s test it.
vector_test <- seq(1,700,by = 2)
target <- 451
ans <- ls(vector_test,target)
ans## [1] 226
doing double check :
vector_test[ans]## [1] 451
it works. so now we can jump to simulation: I’m going to repeat what we did for binary search.so I just hide the code for simulation.
df1 <- microbenchmark(
bin2 = ls(v = seq(1,100),t = round(runif(1,min = 1,max = 100))),
bin3 = ls(v = seq(1,1000),t = round(runif(1,min = 1,max = 1000))),
bin4 = ls(v = seq(1,10000),t = round(runif(1,min = 1,max = 10000))),
bin5 = ls(v = seq(1,100000),t = round(runif(1,min = 1,max = 100000))),
bin6 = ls(v = seq(1,1000000),t = round(runif(1,min = 1,max = 1000000))),
times = 50,unit = 'ms')
df1## Unit: milliseconds
## expr min lq mean median uq max neval
## bin2 0.009440 0.012839 0.02550240 0.0147260 0.036248 0.065322 50
## bin3 0.011328 0.023033 0.04429838 0.0353045 0.059281 0.135552 50
## bin4 0.011328 0.124980 0.24445430 0.2131455 0.371918 0.589782 50
## bin5 0.220509 1.426880 2.46040374 2.7168875 3.653856 4.825490 50
## bin6 0.729110 11.818674 24.57978278 24.0232355 38.757970 51.957835 50
autoplot(df1)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
sort algorithms
merge sort
bubble sort
O(\(nlog n\)) | merge sort algorithm
Merge sort is a sorting technique based on divide and conquer technique. With worst-case time complexity being O(\(n log n\)), it is one of the most respected algorithms.
Merge sort first divides the array into equal halves and then combines them in a sorted manner.
let’s see how it works visually :
let’s go for code part and the simulation : code part includes two main function. mmerge function sorts divided by half elements which mergesort function passes to it. mergesort function is a recursive one which takes an array and splits it into halves until we get single elements.
mmerge<-function(a,b) {
r<-numeric(length(a)+length(b))
ai<-1; bi<-1; j<-1;
for(j in 1:length(r)) {
if((ai<=length(a) && a[ai]<b[bi]) || bi>length(b)) {
r[j] <- a[ai]
ai <- ai+1
} else {
r[j] <- b[bi]
bi <- bi+1
}
}
r
}
mmergesort<-function(A) {
if(length(A)>1) {
q <- ceiling(length(A)/2)
a <- mmergesort(A[1:q])
b <- mmergesort(A[(q+1):length(A)])
mmerge(a,b)
} else {
A
}
}as we did in previous steps, we test our code chunk passing an unordered array :
test_arr <- c(4,11,3,0,6,4,3,8,6)
mmergesort(test_arr)## [1] 0 3 3 4 4 6 6 8 11
in this step we want to use microbenchmark package of R to benchmark our algorithm.
dfmerget <- microbenchmark(
n25 = mmergesort(runif(n = 25,min = 1, max = 25)),
n50 = mmergesort(runif(n = 50,min = 1, max = 50)),
n75 = mmergesort(runif(n = 75,min = 1, max = 75)),
n100 = mmergesort(runif(n = 100,min = 1, max = 100)),
n200 = mmergesort(runif(n = 200,min = 1,max = 200))
)
dfmerget## Unit: microseconds
## expr min lq mean median uq max neval
## n25 126.113 130.4550 159.9289 134.4195 170.8565 334.916 100
## n50 265.818 278.6560 316.7536 283.9420 304.3310 596.202 100
## n75 417.983 434.4075 531.6464 444.7915 510.1135 2185.441 100
## n100 575.057 590.9155 678.1216 603.1865 632.4490 2728.025 100
## n200 1226.007 1267.9180 1458.0309 1290.7620 1379.4940 3239.271 100
Mdfmerge <- dfmerget %>% group_by(expr) %>% summarise(median(time/1000))
Mdfmerge['ops'] <- c(25,50,75,100,200)
names(Mdfmerge)[2] <- 'median' ; ggplot(Mdfmerge,aes(ops,median),size = 1.08) + geom_point(aes(colour = 'data curve'),size=1.1) + geom_line(aes(colour = 'data curve')) + stat_function(fun = nlogn,size = 1.08,aes(colour = 'n*logn curve')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')we can see that the curve is similar approximately a good fit to it’s big O-notation.
O(\(n^2\)) | bubble sort
DescriptionBubble sort, sometimes referred to as sinking sort, is a simple sorting algorithm that repeatedly steps through the list, compares adjacent pairs and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted. it consists of two main loops which makes its big O equal to O(\(n^2\)).
code part is as fast forward as the above Gif.
bubblesort <- function(arr){
n <- length(arr)
for(i in 1:n){
for(j in 1:(n-1)){
if(arr[j] > arr[j + 1]){
arr[c(j,j+1)] <- arr[c(j+1,j)]
}
}
}
return(arr)
}again we take a test from our function to see it works or not.
bubblesort(test_arr)## [1] 0 3 3 4 4 6 6 8 11
last step in this section is doing the simulation. before getting into our simulation we have to remind to ourselves that \(n^2\) is an extreme increase rate which can give us a blue screen if we don’t pay attention to simulation size and array lengths. unlike search algorithms that we even tried \(10^7\) abd yet it was a little number,now we can’t go further 1000!
bst <- microbenchmark(
n25 = bubblesort(runif(n = 25,min = 1, max = 25)),
n50 = bubblesort(runif(n = 50,min = 1, max = 50)),
n75 = bubblesort(runif(n = 75,min = 1, max = 75)),
n100 = bubblesort(runif(n = 100,min = 1, max = 100)),
n200 = bubblesort(runif(n = 200,min = 1,max = 200))
)
bst## Unit: microseconds
## expr min lq mean median uq max neval
## n25 126.112 157.4520 173.0273 169.9125 183.8825 248.449 100
## n50 528.992 627.3525 719.1307 660.7680 707.3990 2629.477 100
## n75 1176.166 1413.4765 1560.3743 1477.4765 1544.4970 3232.852 100
## n100 2333.076 2529.2290 2786.0411 2652.1315 2808.4510 5098.858 100
## n200 9650.979 10379.5225 11133.9604 10783.9125 11589.6705 16988.895 100
Mdbst <- bst %>% group_by(expr) %>% summarise(median(time/1000))
Mdbst['ops'] <- c(25,50,75,100,200)
names(Mdbst)[2] <- 'median'ggplot(Mdbst,aes(ops,median),size = 1.08) +
geom_point(aes(colour = 'data curve'),size=1.1) +
geom_line(aes(colour = 'data curve')) +
stat_function(fun = n2,size = 1.08,aes(colour = 'n^2')) +
stat_function(fun = n,aes(colour= 'n')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')it is abvious that our samples are not worst case and so they won’t fit on \(n^2\). they will be normally somewhere between best case which is a sorted input array which will fit on n and worst case which is \(n^2\).
comparison merge and bubble sort
finally we are going to compare two sort algorithms. in the plot below we can find out how the rates make difference. \(n^2\) is a immense load on system when it gets bigger than 1000 while \(n logn\) stays efficient even with \(10^5\) even larger numbers.
ggplot(Mdbst,aes(ops,median),size = 1.08) + geom_point(aes(colour = 'bubble'),size=1.1) + geom_line(aes(colour = 'bubble')) + stat_function(fun = n2,size = 1.08,aes(colour = 'n^2')) + stat_function(fun = nlogn,aes(colour= 'nlogn'),size = 1.07)+
geom_line(data = Mdfmerge,aes(x = ops,y = median,colour = 'merge'),size = 1.08) +
geom_point(data = Mdfmerge,aes(x = ops,y = median,colour = 'merge')) + labs(x = 'length of array',y = 'median of computing time [miliseconds]')