Over the next few lessons, you will be learning lots of new functions. Wouldn’t it be nice if someone created a Cheatsheet / Notecard of many common R functions? Yes it would, and thankfully Tom Short has done this in his creation of the R Reference Card. You can download a copy at https://dl.dropboxusercontent.com/u/7618380/RReferenceCard.pdf. I highly encourage you to print this out and start highlighting functions as you learn them!
Before we learn how to calculate descriptive statistics, let’s figure out how to easy generate some sample data from specified probability distributions. We’ll start with the Normal and Uniform distributions:
Let’s start with the most famous distribution in statistics: the Normal (or if you want to sound pretentious, the Gaussian) distribution. From our intro stats class, we know that the Normal distribution is bell-shaped, and has two parameters: a mean and a standard deviation.
# Normal Distribution Plot code
curve(dnorm, from = -3, to = 3, xlab = "x",
lwd = 2, main = "Standard Normal Distribution\nmean = 0, sd = 1")
To generate samples from a normal distribution, we use the function rnorm(n, mean, sd).
a <- rnorm(10, mean = 0, sd = 1) # 10 samples from standard normal
a # print x to console
## [1] -1.56233 -0.10097 0.88004 0.23129 0.01313 -1.00554 -0.26293
## [8] -0.23252 0.09160 1.16139
b <- rnorm(100, mean = 100, sd = 10) # 100 samples from Ndist with mean = 100 and sd = 10
b # print y to console
## [1] 99.03 101.20 80.00 104.39 97.23 111.55 96.10 99.44 91.48 100.85
## [11] 111.20 102.59 123.93 81.91 98.33 96.58 92.19 107.96 84.76 103.27
## [21] 118.23 97.15 95.81 84.22 113.95 109.88 96.78 116.17 93.64 115.51
## [31] 96.00 97.68 97.06 104.73 102.57 103.84 99.71 93.69 104.28 98.40
## [41] 100.35 100.47 99.36 90.75 101.40 110.70 97.47 82.43 91.36 80.78
## [51] 97.30 94.40 106.87 101.04 95.86 105.35 106.46 106.67 98.09 109.40
## [61] 117.59 110.04 113.09 99.33 74.48 96.45 87.18 107.70 119.53 102.47
## [71] 98.69 108.37 89.61 92.37 109.29 110.78 110.99 93.36 115.78 96.42
## [81] 98.58 96.87 104.44 114.09 98.93 91.27 103.85 97.69 90.33 78.52
## [91] 108.79 108.65 104.45 101.16 93.16 104.50 93.82 101.60 119.81 92.98
Next, let’s move on to the uniform distribution. The uniform distirbution is rectangular and gives equal probability to all values between the minimum and maximum values:
# Uniform Distribution Plot code
curve(dunif, from = 0, to = 1, xlab = "x",
lwd = 2, main = "Uniform Distribution\nmin = 0, max = 1", xlim = c(-.5, 1.5))
To generate samples from a uniform distribution, we use the function runif(n, min, max).
a <- runif(15, min = 0, max = 1) # 10 samples from uniform dist with bounds at 0 and 1
a # print a to console
## [1] 0.73364 0.85812 0.56840 0.42676 0.37394 0.42830 0.04892 0.13061
## [9] 0.10768 0.32932 0.34041 0.88473 0.09510 0.11826 0.38699
b <- runif(100, min = -100, max = 100) # 100 samples from U[-100, 100]
b # print x to console
## [1] 70.374 -21.697 7.738 -85.800 79.100 98.568 47.754 -24.021
## [9] 13.764 -83.542 41.388 -90.771 -3.714 68.744 -50.477 90.191
## [17] 60.193 -16.821 63.447 -17.803 -84.460 -10.310 -44.901 7.417
## [25] -12.430 84.790 26.801 -49.643 99.142 3.955 84.680 -39.734
## [33] 39.585 82.555 21.990 -90.319 67.230 38.863 -52.012 -86.856
## [41] -34.425 -79.452 96.683 -23.117 -85.485 -24.219 95.218 -37.597
## [49] -81.681 -53.353 21.066 52.284 27.000 45.815 97.813 3.222
## [57] -40.792 23.472 -97.193 41.726 31.963 -87.475 -16.334 -15.093
## [65] 10.832 -37.809 95.128 -37.632 -84.416 34.930 89.586 51.726
## [73] -55.367 20.759 91.553 -93.158 1.292 -30.957 92.217 52.217
## [81] 58.963 61.679 62.149 2.604 60.209 12.171 22.138 -34.060
## [89] 48.612 28.194 95.233 16.174 65.404 16.164 41.550 -54.355
## [97] -7.308 24.543 -58.917 -57.102
Ok, now that we can generate some data, let’s learn the basic descriptive statistics functions
Let’s start with the sample mean and median:
a <- rnorm(100, mean = 0, sd = 1) # Generate 100 samples from standard normal and assign to object a
mean(a) # What is the mean of the vector a? (should be close to 0)
## [1] 0.1009
median(a) # What is the median of the vector a?
## [1] 0.1933
b <- runif(1000, min = -50, max = 0) # Generate 1000 samples from uniform dist with bounds at -50 and 0 and assign to object b
mean(b) # What is the mean of the vector b? (should be close to -25)
## [1] -24.95
median(b) # What is the median of the vector b?
## [1] -25.01
vec <- c(4, 2, 1, 5, 2, 7, 4)
mean(vec)
## [1] 3.571
median(vec)
## [1] 4
max(a) # What is the maximum value of the vector a?
## [1] 2.496
min(a) # What is the minimum value of the vector a?
## [1] -2.148
Range <- max(a) - min(a) # Let's calculate the range of a
Range
## [1] 4.644
sd(a) # What is the standard deviation of the vector a?
## [1] 1.058
var(a) # What is the variance of the vector a?
## [1] 1.12
sd(a) ^ 2 # Should be the variance!
## [1] 1.12
According to the law of large numbers, the larger our sample size, the closer our sample mean should be to the population mean. Let’s test this by drawing either a small (N = 5) or a large (N = 1,000,000) number of observations from a Normal distribution with mean = 100 and sd = 20:
Small <- rnorm(5, mean = 100, sd = 20) # 10 observations
Large <- rnorm(1000000, mean = 100, sd = 20) # One million observations
mean(Small) # What is the mean of the small sample?
## [1] 111.8
mean(Large) # What is the mean of the large sample?
## [1] 100
mean(Small) - 100 # How far is the mean of Small from 100?
## [1] 11.77
mean(Large) - 100 # How far is the mean of Large from 100?
## [1] 6.858e-05
You’re done with lecture 3! Now you know how to generate samples from the Normal and Uniform distributions and calculate basic descriptive statistics! In the next lecture, we’ll cover the matrix and dataframe objects.