2023-06-08

What is standard deviation?

Why is used in data analysis?

  • Standard deviation is a measure of the amount of variation that exist in data
  • This variation is calculated in relation to the mean (average) of the data
  • Standard deviation is used in data analysis to show how far values are from the mean

Calculating Standard Deviation

\(\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}\)

  • \(\sigma\) = standard deviation;
  • N = the number of data points;
  • \(x_i\) = each of the values of the data;
  • \(\overline{x}\) = the mean of \(x_i\)

The 68 - 95 - 99.7 Rule

Calculating from the mean (\(\mu\); in the previous slide, \(\overline{x}\) = the mean of \(x_i\))

  • Approximately 68% of the populated data will fall within one standard deviation (\(\sigma\))
  • Around 95% will fall within 2 standard deviations (\(\sigma\))
  • At 3 standard deviations (\(\sigma\)), 99.7% of the data will fall within the range
  • At 6 standard deviations (\(\sigma\)), 99.99999998% of the data will fall within the range

From: sixsigmastudyguide.com

<>

Displaying a Normal Distribution

Mean = 0 and \(\sigma\) = 6: means that 99.99999998% of all values are displayed

Low Standard Deviation

Data is clustered around the mean (\(\mu\)) and is less spread out

High Standard Deviation

Data is further from mean (\(\mu\)) and is more spread out

R code for Low and High Standard Deviation Graphs

set.seed(102)
g1data = floor(rnorm(n=1000,mean = 0, sd = 2))
table_g1data = table(g1data)
dfg1 = as.data.frame(table_g1data)
g1plot =ggplot(dfg1, aes(x=g1data, y = Freq)) + 
  geom_bar(stat = "identity", fill="green", alpha = 0.5) + 
  labs(title = "Example of a Low Standard Deviation", x = 'Number', 
       y= 'Frequency')
g1plot
set.seed(777)
g2data = floor(rnorm(n=1000,mean = 0, sd = 8))
table_g2data = table(g2data)
dfg2 = as.data.frame(table_g2data)
g2plot =ggplot(dfg2, aes(x=g2data, y = Freq)) + 
  geom_point(stat = "identity", color="blue", alpha = 0.5) + 
  labs(title = "Example of a High Standard Deviation", x = 'Number', 
       y= 'Frequency') 
g2plot