While R provides us a straightforward formula for calculating Standard Deviation, I often better understand a programming language by trying to understand what is “going on behind the scenes.” Therefore, I’m going to try and calculate the Standard Deviation of a vector of values from scratch, i.e. using the formula for calculating standard deviation, creating new vectors and variables, and using basic functions within R to come up with a solution.
The formula I’m going to use to calculate the standard deviation of vector elements is:
(Standard deviation formulas, 2014)
I’m going to calculate the standard deviation for Miles Per Gallon (mpg) available through the data set “mtcars” available under library “dplyr”. In short, I’m going to create a vector of numbers called “mpg” from mtcars that corresponds to the same variable in the data set. I will do this using the following R code:
require (dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mpg <- mtcars[,1]
mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Next, I will need to calculate the mean of this vector for use in the formula above. I do this by using the “mean” function and creating a new variable mean_mpg. I will do this by using the following code:
mean_mpg <- mean(mpg)
mean_mpg
## [1] 20.09062
So the mean is 20.09062
Summing the Squared Differences is a bit more complex and requires a bit more “behind the scenes” code to create. First we must subtract the mean from each element in mpg. I will create a new vector called “differences” using the following code:
differences <- mpg - mean_mpg
differences
## [1] 0.909375 0.909375 2.709375 1.309375 -1.390625 -1.990625 -5.790625
## [8] 4.309375 2.709375 -0.890625 -2.290625 -3.690625 -2.790625 -4.890625
## [15] -9.690625 -9.690625 -5.390625 12.309375 10.309375 13.809375 1.409375
## [22] -4.590625 -4.890625 -6.790625 -0.890625 7.209375 5.909375 10.309375
## [29] -4.290625 -0.390625 -5.090625 1.309375
Next I will need to square each one of these numbers and then sum them together. This can be easily done by using simple calculations and the “sum” function in R. The code and results would look like this:
square_differences <- differences ^ 2
square_differences
## [1] 0.8269629 0.8269629 7.3407129 1.7144629 1.9338379
## [6] 3.9625879 33.5313379 18.5707129 7.3407129 0.7932129
## [11] 5.2469629 13.6207129 7.7875879 23.9182129 93.9082129
## [16] 93.9082129 29.0588379 151.5207129 106.2832129 190.6988379
## [21] 1.9863379 21.0738379 23.9182129 46.1125879 0.7932129
## [26] 51.9750879 34.9207129 106.2832129 18.4094629 0.1525879
## [31] 25.9144629 1.7144629
sum_of_squares <- sum(square_differences)
sum_of_squares
## [1] 1126.047
However, this summation step is not necessary when using the “mean” function in R, which automatically calculates and divides that sum by the number of elements, leading us to:
mean_of_sq_differences <- mean(square_differences)
mean_of_sq_differences
## [1] 35.18897
Finally we need to find the square root of the mean of differences we calculated, which is easily done through the following code:
Stnd_Dev_mpg <- sqrt(mean_of_sq_differences)
Stnd_Dev_mpg
## [1] 5.93203
I found it interesting that R returns a different value for the Standard Deviation of mpg than my calculations provided. By using the Standard Deviation function (sd) available through the program, the result was:
SD_of_mpg_calculated_by_R <- sd(mpg)
SD_of_mpg_calculated_by_R
## [1] 6.026948
I wondered why this was the case. It occured to me that R might be calculating a “Sample Standard Deviation” rather than calculating using the whole population that I was using. Therefore I calculated the results using “Bessel’s Correction” in the formula, which reads:
(Standard deviation formulas, 2014)
Therefore, subtracting 1 from N in this formula, I received the same result as R, which is shown by the following code and calculations:
N <- length(mpg)
N
## [1] 32
Sample_Stnd_Dev_mpg <- sqrt((sum_of_squares)/(N-1))
Sample_Stnd_Dev_mpg
## [1] 6.026948
Standard deviation formulas. (2014). Retrieved from https://www.mathsisfun.com/data/standard-deviation-formulas.html.