0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

1 Prompt

  1. Calculate variance and standard deviation directly in R.

  2. Calculate variance and standard deviation by hand in R.

  3. Compare the numbers and ascertain they are the same.

2 Clear the work space

rm(list = ls())   # Clear environment 
gc()              # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525530 28.1    1166868 62.4         NA   669291 35.8
## Vcells 966200  7.4    8388608 64.0      32768  1840395 14.1
cat("\f")         # Clear the console
dev.off()         # Clear plots.  Can use par(mfrow=c(1,1)) 
## null device 
##           1

3 Raw Data

v <- c(6,2,4,9,8,3,1,6,5,7) # vector we randomly created in class.  

I will reference it as v, though we could give it a more meaningful name like raw_data.

RECOMMENDED READING, BUT NOT REQUIRED (IF TIME CONSTRAINED)

class(v)  
## [1] "numeric"
typeof(v) # what is the object’s data type (low-level)?   double.  
## [1] "double"
## R is treating the inputs as double precision real numbers, not whole numbers.  If you wanted to explicitly create integers, you need to add an L to each element (or coerce to the integer type using as.integer()).

d <- c(2L,8L)
typeof(d)
## [1] "integer"
remove(d)  # remove objects you do not need

3.1 What kind of object is it (high-level)?

  • numeric here.

  • objects can be character, numeric (real or decimal), integer, logical or complex more generally.

4 Calculate variance and standard deviation directly in R

  • Note R calculates sample variance, not population variance by default.

    • Read up on syntax of sd, which tells you R is calculating the sample sd (division by n-1) and not the population sd (division by n)
?sd 
sd(v)
## [1] 2.601282
sd(x = v) 
## [1] 2.601282

You may want to specify the arguments explicitly for recall value / makes it easier for others to understand your code, but of course R will still give you the same answer.

  • R is object oriented programming, so you can store the results/calculations as “objects”.
sd_Rformula <- sd(v) #  I omit arguments inside sd command as they are obvious here

sd_Rformula          # print out the results
## [1] 2.601282
?print
print(sd_Rformula)
## [1] 2.601282
?var
var_Rformula <- var(v)

5 Calculate variance by hand / manually in R

5.1 Brush up the basics / you can use R a calculator.

READ UP ON THE FOLLOWING SECTIONS ATLEAST -

Basic arithmetic operations; Vector Section - Create a vector, Calculations with vectors

5.2 Different ways to get the same result. Commands make it more efficient.

# Addition
6+2+4+9+8+3+1+6+5+7
## [1] 51
?sum
sum(v)            # adds all the elements/scalars in the vector v directly
## [1] 51
## Arithmetic Mean
(6+2+4+9+8+3+1+6+5+7)/10
## [1] 5.1
sum(v)/10
## [1] 5.1
sum(v)/length(v)  # better code as this will be robust to changing size v
## [1] 5.1
?mean
mean(v)           # calculates arithmetic mean of all the elements in the vector v directly
## [1] 5.1

5.3 3.1 Start your vectorised operations to create columns like we saw in class in Excel.

v - mean(v)    # deviation from the mean.   C2-C11 in Excel.  
##  [1]  0.9 -3.1 -1.1  3.9  2.9 -2.1 -4.1  0.9 -0.1  1.9
### deviations from the mean add up to 0.  Confirm with the command - 
sum( v - mean(v) )
## [1] 3.552714e-15
( v - mean(v) )^2 # squared deviation from the mean.  D2-D11 in Excel.  
##  [1]  0.81  9.61  1.21 15.21  8.41  4.41 16.81  0.81  0.01  3.61
### squared deviations from the mean DO NOT add up to 0, which we utilize in calculation for variance.


sum( (v - mean(v) )^2 ) # total sum of squared deviation from the mean.  D12
## [1] 60.9
sum((v-mean(v))^2)/length(v) # average squared deviation from the mean.  E16  
## [1] 6.09
  • NOTE, variance is just the average squared deviation from the mean

5.4 Store the important results into objects so that you can call them back later.

variance_hand_sample       <- sum( (v-mean(v))^2 ) / (length(v)-1)   # E14
variance_hand_population   <- sum( (v-mean(v))^2 ) /  length(v)      # E16

## 
  • Sample variance is greater than population variance due to a smaller denominator (N-1 instead of N).
                                                        var_Rformula # E15
## [1] 6.766667

5.5 3.3. Use paste command to organize your calculations.

?paste0
print(paste0("Sample variance by hand is ", variance_hand_sample, 
             " while Population variance by hand is ", variance_hand_population,
             ". Furthermore, Sample variance by R is ", var_Rformula, " and is calculated by R directly."
             )
      )
## [1] "Sample variance by hand is 6.76666666666667 while Population variance by hand is 6.09. Furthermore, Sample variance by R is 6.76666666666667 and is calculated by R directly."

5.5.1 SUMMARY

  • R, like many statistical software, calculates sample variance instead of population variance as the default measure. Since we have a sample most of the time instead of the data on the entire population, this is fine. Furthermore, Sample variance and Population variance are very close when the sample is large enough ($n > 30$).

  • As long as you can replicate the output of the default R command, you really understand what is happening “behind the scenes”.

6 Calculate sd by hand / manually in R

# From section 2.1

sd_Rformula <- sd(v)
print(sd_Rformula)
## [1] 2.601282
# Repeat Section 3.1 steps, only need to add one (sqrt) command -

sd_hand_sample <- sqrt(variance_hand_sample) # we know sd is just the square root of variance.  All prior steps to calculate the variance remain the same.

print(paste0("Sample sd by hand is ", sd_hand_sample, 
              " while Sample variance by R is ", sd_Rformula, " (which) is calculated by R directly."
)
)
## [1] "Sample sd by hand is 2.60128173535022 while Sample variance by R is 2.60128173535022 (which) is calculated by R directly."