2024-02-03

Standard Deviation

Standard deviation is a measure of how dispersed your data is. In other words, it’s how variable your data is. It can be used alongside the mean in statistics to create a distribution graph. This distribution graph, which is typically a normal distribution, can be used to calculate how likely a certain outcome or value is, or its probability of occurring.

In a normal distribution, most occurrences happen close to the mean, and get less common as you stray away from it. How fast they get “less common” can also be seen, through low and high standard deviation (covered later).

The Standard Deviation Formula (1/2)

Standard deviation can easily be calculated using the following formula:

\[ \sigma =\sqrt{\frac{\sum \left ( {x_i}-\mu \right )^2}{N}} \]

Let’s break it down:

  • σ: This is the symbol for standard deviation.
  • xi: This represents any given value in a vector.
  • μ: This represents the mean of the values in a vector.
  • N: This represents the size of the vector, that is the number of elements in it.

The Standard Deviation Formula (2/2)

\[ \sigma =\sqrt{\frac{\sum \left ( {x_i}-\mu \right )^2}{N}} \]

In other words, to find the standard deviation of a vector, we follow these steps:

  • For each value in the vector, subtract the mean of all vector values from it.
  • For each value in the vector, square the result from the subtraction, then add the new results together.
  • Divide the total by the number of values in the vector, and take the square root of this new result.

Let’s go through an example of standard deviation. For that, we’ll generate a vector:

##  [1] 11 16 23 34 42 45 55 58 79 89

Example of Standard Deviation

  • For each value in the vector, subtract the mean of all vector values from it.
## Mean:  45.2
## Vector after subtracting mean:  -34.2 -29.2 -22.2 -11.2 -3.2 -0.2 9.8 12.8 33.8 43.8
  • For each value in the vector, square the result from the subtraction, then add the new results together.
## Sum of squares:  5971.6
  • Divide the total by the number of values in the vector, and take the square root of this new result.
## Standard Deviation: 24.43686

High vs Low Standard Deviation

Standard deviation can be classified in one of two ways: high and low. High standard deviation is spread out and varied, while low standard deviation is more clustered around the mean.

Typically, a low standard deviation would also have a higher “peak” as opposed to a high standard deviation. Think of high standard deviation as a hill, and low standard deviation as more of a mountain.

We can visually determine whether the standard deviation is high or low by using the calculated mean and standard deviation from our vector, and creating a normal distribution plot with it. We’ll show a histogram and a smooth-curved graph.

As a refresher:

## Mean:  45.2
## Standard Deviation:  24.43686

The following 2 plots are a graphical representation of the normal distribution of this mean and standard deviation.

Normal Distribution Function (1/3)

Created using Plot_ly()

Normal Distribution Function (2/3)

Created using plot()

Code for Normal Distribution

If you’re interested in the normal distribution code used to create the two previous graphs, this is what it looks like:

# PLOT_LY()
set.seed(123)
x <- rnorm(1000, mean = meanVal, sd = stDev)

plot_ly(x = x, 
        type = "histogram"
        )

# PLOT()
sequence <- seq(from = 0, to = 100)
density <- dnorm(sequence, mean = meanVal, sd = stDev)
plot(sequence, density, 
     type = "l", 
     xlab = "x", 
     ylab = "Probability Density")

The line set.seed(123) is used so the random numbers produced by rnorm are the same every time; the random sequence is only generated once.

Normal Distribution Function (3/3)

Ok, let’s take a look. We can see that the values are more spread out, going from 0 to 100, looking more like a hill rather than a mountain. Though centered on the mean quite closely, there isn’t a large spike at that area; it’s smooth. Therefore, this dataset likely has a high standard deviation.

To better get the point across, let’s look at an example of low and high standard deviation using these two vectors to calculate mean and standard deviation.

## Low standard deviation:  30 31 32 36 40 43 44 45 48 50
## High standard deviation:  1 13 33 42 45 50 54 68 77 83 96 100

Low Standard Deviation

This plot is an example of low standard deviation. Notice how there is one large spike in the middle, vs a more even spread everywhere.

High Standard Deviation

And this plot is an example of high standard deviation. Notice how it appears more flat or smooth vs a sudden peak.

Why Standard Deviation?

Ok, that’s all well and good. However, why do we use standard deviation in data analysis anyway?

As stated before, standard deviation can be used to show how dispersed or variable your data is. It can be used to help data analysts confirm the reliability of their data, as well as determine how consistent it is across scales.

When comparing two datasets, like we did here, it can be used to visualize just how varied one is compared to another. Looking at the numbers in these small datasets, it’s easy to “eyeball” that one has a higher standard deviation than the other. However when dealing with hundreds of hundreds of values, it becomes increasingly harder. Performing a standard deviation + distribution calculation can make that easy to spot.