DAT 301 Homework 3 - Statistics in Computer Science

2025-09-16

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Using Statistics in Computer Science

Statistics can be found in many areas of computer science, in this presentation we will cover three of these areas:

Algorithms - Run time and memory usage of different algorithms are measured by their worst case run time from probability distributions, which we can use to determine the best algorithm given the circumstances
Machine Learning - Linear regression concepts are used to create decision trees, and statistics are also used to evaluate the performance of a machine learning model
Network Systems - Statistics used to track data packets and determining how to distribute network loads to ensure adequate performance for all computers connected to a network

Algorithms

Statistics is used to demonstrate the worst case run time scenarios for different algorithms as shown.

Algorithms Continued

Different operations require differing amounts of time in computer science. We can see how these differences impact performance by simply simulating arrays with a different size from a size of 500 all the way up to a size of 500,000, and performing various operations on each of those arrays. We can compare the run time of different operations on these different sized arrays to see how the operations slow as the size of the array, \(n\), grows.

We graphically compare the linear \(O(n)\) operation of summing every number in the array, the \(O(n * log(n))\) operation of doing an optimized sort, and the very slow bubble sort that runs in \(O(n^2)\) run time and see how the differences in run time become more dramatic as our array size increases.

Algorithm Timing Examples

Let’s track the run time of three different operations on an array:

# Summing is a single pass operation - O(n)
time_sum = function(n) {
  v = runif(n)
  t = system.time(tmp <- sum(v))[["elapsed"]]
}

# Sorting with R's built in function - O(n log n)
time_sort_optimized = function(n) {
  v = runif(n)
  t = system.time(tmp <- sort(v))[["elapsed"]]
}

# Bubble sort - O(n^2)
time_sort_bubble = function(v) {
  n = length(v)
  for (i in seq_len(n - 1)) {
    for (j in seq_len(n - i)) {
      if (v[j] > v[j + 1]) {
        tmp <- v[j]; v[j] <- v[j + 1]; v[j + 1] <- tmp
      }
    }
  }
  v
}

Algorithms Timing Examples - Visualized

Machine Learning

Statistics is used heavily in machine learning. One usage is for determining ideal boundaries to separate data into multiple classes. Logistic regression can be utilized for this purpose. With this, we can separate individual points of data according to their properties to determine the likely class that data belongs to.

An example of this would be if we had a data set of different flowers tracking their growth time from some baseline and how bright the color of the petals are. We can utilize machine learning to try to draw a boundary on the features of brightness of petal color and growth time from the baseline to try to determine which species the flower belongs to based on these two features.

Machine Learning - Visualized

Network Systems

Computer networks are responsible for connecting many different computers with different purposes together in a way that they can communicate and exchange data between one another. Knowing the frequency, size, and type of data being transferred between each computer is not deterministic and can be incredibly random, which makes it a good fit for utilizing statistics to model the randomness. A router is responsible for receiving and transmitting packets from all the connected computers and we can model the arrival of packets as a Poisson distribution, where \(X\) is the random number of packet arriving per second, \(\lambda\) is the average number of packet arrivals per second, and \(k\) is the specific number of packets per second we want to estimate the probability of:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

Network Systems Example

Imagine we measure that the average number of packets our router handles is 6 per second, so \(\lambda = 6\). If we want to determine the probability that we only actually receive 2 packets we can use our distribution:

\[ P(X = 2) = \frac{6^2 e^{-6}}{2!} = 18 e^{-6} \approx 0.0446 \]

Network Systems Example - Visualized

Conclusions

If we look into almost any area of computer science we can find the application of statistics. Even when statistics isn’t utilized directly, statistics can be applied indirectly to virtually any computer science field in order to measure performance, determine user satisfaction, or determine any other useful metric we want to look at.