Lecture 3 Goals:

  1. Download the priceless R reference card
  2. Learn functions for generating data from probability distributions: rnorm(), runif()
  3. Learn functions for basic descriptive statistics: mean(), median(), sd(), var(), min(), max()

The R Reference Card

Over the next few lessons, you will be learning lots of new functions. Wouldn’t it be nice if someone created a Cheatsheet / Notecard of many common R functions? Yes it would, and thankfully Tom Short has done this in his creation of the R Reference Card. You can download a copy at https://dl.dropboxusercontent.com/u/7618380/RReferenceCard.pdf. I highly encourage you to print this out and start highlighting functions as you learn them!

Generating data from specified probability distributions

Before we learn how to calculate descriptive statistics, let’s figure out how to easy generate some sample data from specified probability distributions. We’ll start with the Normal and Uniform distributions:

The Normal (Gaussian) distribution: rnorm(n, mean, sd)

Let’s start with the most famous distribution in statistics: the Normal (or if you want to sound pretentious, the Gaussian) distribution. From our intro stats class, we know that the Normal distribution is bell-shaped, and has two parameters: a mean and a standard deviation.

# Normal Distribution Plot code
curve(dnorm, from = -3, to = 3, xlab = "x", 
      lwd = 2, main = "Standard Normal Distribution\nmean = 0, sd = 1")

plot of chunk unnamed-chunk-1

To generate samples from a normal distribution, we use the function rnorm(n, mean, sd).

a <- rnorm(10, mean = 0, sd = 1) # 10 samples from standard normal
a # print x to console
##  [1] -1.56233 -0.10097  0.88004  0.23129  0.01313 -1.00554 -0.26293
##  [8] -0.23252  0.09160  1.16139
b <- rnorm(100, mean = 100, sd = 10) # 100 samples from Ndist with mean = 100 and sd = 10
b # print y to console
##   [1]  99.03 101.20  80.00 104.39  97.23 111.55  96.10  99.44  91.48 100.85
##  [11] 111.20 102.59 123.93  81.91  98.33  96.58  92.19 107.96  84.76 103.27
##  [21] 118.23  97.15  95.81  84.22 113.95 109.88  96.78 116.17  93.64 115.51
##  [31]  96.00  97.68  97.06 104.73 102.57 103.84  99.71  93.69 104.28  98.40
##  [41] 100.35 100.47  99.36  90.75 101.40 110.70  97.47  82.43  91.36  80.78
##  [51]  97.30  94.40 106.87 101.04  95.86 105.35 106.46 106.67  98.09 109.40
##  [61] 117.59 110.04 113.09  99.33  74.48  96.45  87.18 107.70 119.53 102.47
##  [71]  98.69 108.37  89.61  92.37 109.29 110.78 110.99  93.36 115.78  96.42
##  [81]  98.58  96.87 104.44 114.09  98.93  91.27 103.85  97.69  90.33  78.52
##  [91] 108.79 108.65 104.45 101.16  93.16 104.50  93.82 101.60 119.81  92.98

The Uniform distribution: runif(n, min, max)

Next, let’s move on to the uniform distribution. The uniform distirbution is rectangular and gives equal probability to all values between the minimum and maximum values:

# Uniform Distribution Plot code
curve(dunif, from = 0, to = 1, xlab = "x", 
      lwd = 2, main = "Uniform Distribution\nmin = 0, max = 1", xlim = c(-.5, 1.5))

plot of chunk unnamed-chunk-3

To generate samples from a uniform distribution, we use the function runif(n, min, max).

a <- runif(15, min = 0, max = 1) # 10 samples from uniform dist with bounds at 0 and 1
a # print a to console
##  [1] 0.73364 0.85812 0.56840 0.42676 0.37394 0.42830 0.04892 0.13061
##  [9] 0.10768 0.32932 0.34041 0.88473 0.09510 0.11826 0.38699
b <- runif(100, min = -100, max = 100) # 100 samples from U[-100, 100]
b # print x to console
##   [1]  70.374 -21.697   7.738 -85.800  79.100  98.568  47.754 -24.021
##   [9]  13.764 -83.542  41.388 -90.771  -3.714  68.744 -50.477  90.191
##  [17]  60.193 -16.821  63.447 -17.803 -84.460 -10.310 -44.901   7.417
##  [25] -12.430  84.790  26.801 -49.643  99.142   3.955  84.680 -39.734
##  [33]  39.585  82.555  21.990 -90.319  67.230  38.863 -52.012 -86.856
##  [41] -34.425 -79.452  96.683 -23.117 -85.485 -24.219  95.218 -37.597
##  [49] -81.681 -53.353  21.066  52.284  27.000  45.815  97.813   3.222
##  [57] -40.792  23.472 -97.193  41.726  31.963 -87.475 -16.334 -15.093
##  [65]  10.832 -37.809  95.128 -37.632 -84.416  34.930  89.586  51.726
##  [73] -55.367  20.759  91.553 -93.158   1.292 -30.957  92.217  52.217
##  [81]  58.963  61.679  62.149   2.604  60.209  12.171  22.138 -34.060
##  [89]  48.612  28.194  95.233  16.174  65.404  16.164  41.550 -54.355
##  [97]  -7.308  24.543 -58.917 -57.102

Ok, now that we can generate some data, let’s learn the basic descriptive statistics functions

mean() and median()

Let’s start with the sample mean and median:

  1. mean(x): The arithmetic mean of the vector x
  2. median(x): The median of the vector x
a <- rnorm(100, mean = 0, sd = 1) # Generate 100 samples from standard normal and assign to object a
mean(a) # What is the mean of the vector a? (should be close to 0)
## [1] 0.1009
median(a) # What is the median of the vector a?
## [1] 0.1933
b <- runif(1000, min = -50, max = 0) # Generate 1000 samples from uniform dist with bounds at -50 and 0 and assign to object b

mean(b) # What is the mean of the vector b? (should be close to -25)
## [1] -24.95
median(b) # What is the median of the vector b?
## [1] -25.01
vec <- c(4, 2, 1, 5, 2, 7, 4)
mean(vec)
## [1] 3.571
median(vec)
## [1] 4

min(), max()

  1. min(x): The minimum value in the vector x
  2. max(x): The maximum value in the vector x
max(a) # What is the maximum value of the vector a?
## [1] 2.496
min(a) # What is the minimum value of the vector a?
## [1] -2.148
Range <- max(a) - min(a) # Let's calculate the range of a
Range
## [1] 4.644

sd() and var()

  1. sd(x): The standard deviation of the vector x
  2. var(x): The variance of the vector x
sd(a) # What is the standard deviation of the vector a?
## [1] 1.058
var(a) # What is the variance of the vector a?
## [1] 1.12
sd(a) ^ 2 # Should be the variance!
## [1] 1.12

A quick test of the Law of Large Numbers

According to the law of large numbers, the larger our sample size, the closer our sample mean should be to the population mean. Let’s test this by drawing either a small (N = 5) or a large (N = 1,000,000) number of observations from a Normal distribution with mean = 100 and sd = 20:

Small <- rnorm(5, mean = 100, sd = 20) # 10 observations
Large <- rnorm(1000000, mean = 100, sd = 20) # One million observations

mean(Small) # What is the mean of the small sample?
## [1] 111.8
mean(Large) # What is the mean of the large sample?
## [1] 100
mean(Small) - 100 # How far is the mean of Small from 100?
## [1] 11.77
mean(Large) - 100 # How far is the mean of Large from 100?
## [1] 6.858e-05

Finished!

You’re done with lecture 3! Now you know how to generate samples from the Normal and Uniform distributions and calculate basic descriptive statistics! In the next lecture, we’ll cover the matrix and dataframe objects.