0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1
Calculate variance and standard deviation directly in R.
Calculate variance and standard deviation by hand in R.
Compare the numbers and ascertain they are the same.
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525530 28.1 1166868 62.4 NA 669291 35.8
## Vcells 966200 7.4 8388608 64.0 32768 1840395 14.1
cat("\f") # Clear the console
dev.off() # Clear plots. Can use par(mfrow=c(1,1))
## null device
## 1
v <- c(6,2,4,9,8,3,1,6,5,7) # vector we randomly created in class.
I will reference it as v
, though we could give it a more
meaningful name like raw_data
.
RECOMMENDED READING, BUT NOT REQUIRED (IF TIME CONSTRAINED)
class(v)
## [1] "numeric"
typeof(v) # what is the object’s data type (low-level)? double.
## [1] "double"
## R is treating the inputs as double precision real numbers, not whole numbers. If you wanted to explicitly create integers, you need to add an L to each element (or coerce to the integer type using as.integer()).
d <- c(2L,8L)
typeof(d)
## [1] "integer"
remove(d) # remove objects you do not need
numeric here.
objects can be character, numeric (real or decimal), integer, logical or complex more generally.
Note R calculates sample variance, not population variance by default.
sd
, which tells you R is
calculating the sample sd
(division by n-1) and not the
population sd
(division by n)?sd
sd(v)
## [1] 2.601282
sd(x = v)
## [1] 2.601282
You may want to specify the arguments explicitly for recall value / makes it easier for others to understand your code, but of course R will still give you the same answer.
sd_Rformula <- sd(v) # I omit arguments inside sd command as they are obvious here
sd_Rformula # print out the results
## [1] 2.601282
?print
print(sd_Rformula)
## [1] 2.601282
?var
var_Rformula <- var(v)
READ UP ON THE FOLLOWING SECTIONS ATLEAST -
Basic arithmetic operations; Vector Section - Create a vector, Calculations with vectors
# Addition
6+2+4+9+8+3+1+6+5+7
## [1] 51
?sum
sum(v) # adds all the elements/scalars in the vector v directly
## [1] 51
## Arithmetic Mean
(6+2+4+9+8+3+1+6+5+7)/10
## [1] 5.1
sum(v)/10
## [1] 5.1
sum(v)/length(v) # better code as this will be robust to changing size v
## [1] 5.1
?mean
mean(v) # calculates arithmetic mean of all the elements in the vector v directly
## [1] 5.1
v - mean(v) # deviation from the mean. C2-C11 in Excel.
## [1] 0.9 -3.1 -1.1 3.9 2.9 -2.1 -4.1 0.9 -0.1 1.9
### deviations from the mean add up to 0. Confirm with the command -
sum( v - mean(v) )
## [1] 3.552714e-15
( v - mean(v) )^2 # squared deviation from the mean. D2-D11 in Excel.
## [1] 0.81 9.61 1.21 15.21 8.41 4.41 16.81 0.81 0.01 3.61
### squared deviations from the mean DO NOT add up to 0, which we utilize in calculation for variance.
sum( (v - mean(v) )^2 ) # total sum of squared deviation from the mean. D12
## [1] 60.9
sum((v-mean(v))^2)/length(v) # average squared deviation from the mean. E16
## [1] 6.09
variance_hand_sample <- sum( (v-mean(v))^2 ) / (length(v)-1) # E14
variance_hand_population <- sum( (v-mean(v))^2 ) / length(v) # E16
##
var_Rformula # E15
## [1] 6.766667
?paste0
print(paste0("Sample variance by hand is ", variance_hand_sample,
" while Population variance by hand is ", variance_hand_population,
". Furthermore, Sample variance by R is ", var_Rformula, " and is calculated by R directly."
)
)
## [1] "Sample variance by hand is 6.76666666666667 while Population variance by hand is 6.09. Furthermore, Sample variance by R is 6.76666666666667 and is calculated by R directly."
R, like many statistical software, calculates sample variance instead of population variance as the default measure. Since we have a sample most of the time instead of the data on the entire population, this is fine. Furthermore, Sample variance and Population variance are very close when the sample is large enough ($n > 30$).
As long as you can replicate the output of the default R command, you really understand what is happening “behind the scenes”.
# From section 2.1
sd_Rformula <- sd(v)
print(sd_Rformula)
## [1] 2.601282
# Repeat Section 3.1 steps, only need to add one (sqrt) command -
sd_hand_sample <- sqrt(variance_hand_sample) # we know sd is just the square root of variance. All prior steps to calculate the variance remain the same.
print(paste0("Sample sd by hand is ", sd_hand_sample,
" while Sample variance by R is ", sd_Rformula, " (which) is calculated by R directly."
)
)
## [1] "Sample sd by hand is 2.60128173535022 while Sample variance by R is 2.60128173535022 (which) is calculated by R directly."