Seth J. Chandler
September 4, 2014
R is capable of doing basic descriptive statistics. Let’s start with 19 randomly generated numbers. I’m going to take them from a uniform distribution over 0 to 10. To make sure I get the same results every time, I will set the seed of the random number generator to the day I am writing this, September 4, 2014.
set.seed(942014)
print(rv<-runif(19,0,10))
## [1] 2.2442 1.8975 1.5346 3.5228 7.7944 4.6094 0.2697 8.0016 3.4022 3.4805
## [11] 9.4281 6.4563 6.1863 0.1859 6.3925 6.3446 5.2364 2.3600 1.9805
I can add these numbers up and divide them by the number of elements in the list.
sum(rv)/length(rv)
## [1] 4.28
Or, I can ask R to compute the mean of the numbers.
mean(rv)
## [1] 4.28
The results are identical. You probably know that this identity exists because the definition of “mean” or, if you want to be fancy “arithmetic mean” is the sum of the sample divided by the length of the sample. Using traditional mathematical notation,\(\bar{x}=\frac{\sum _{i=1}^n x_i}{n}\)
What if we put the numbers in order?
print(srv<-sort(rv))
## [1] 0.1859 0.2697 1.5346 1.8975 1.9805 2.2442 2.3600 3.4022 3.4805 3.5228
## [11] 4.6094 5.2364 6.1863 6.3446 6.3925 6.4563 7.7944 8.0016 9.4281
Let’s take the middle element of the sorted list so that an equal number of elements in the sorted list are to the left of the middle element and to the right of the middle element.
print(middleElement<-ceiling(length(srv)/2))
## [1] 10
srv[middleElement]
## [1] 3.523
This middle value is called the median of the sample.
What if we have an even number of elements in our list. There is no single element that is then in the middle. The most common convention is to say that the median is the mean of the two most middle elements.
print(srv2<-sort(rv2<-c(rv,rnorm(1,0,1))))
## [1] 0.08722 0.18592 0.26966 1.53461 1.89750 1.98050 2.24422 2.35996
## [9] 3.40221 3.48051 3.52276 4.60938 5.23643 6.18629 6.34462 6.39245
## [17] 6.45630 7.79443 8.00162 9.42815
mean(srv2[c(length(srv2)/2,1+length(srv2)/2)])
## [1] 3.502
median(srv2)
## [1] 3.502
We also might want to know what element in a list is most common. Let’s make another list, this time drawing 100 values from the numbers 1 to 50. Notice that I am using the sample command to do this. I am setting the replace option in sample to TRUE so that I can generate more than one of the same value from the list of possible values. This is called “sampling with replacement.”
rv3<-sample(1:50,100,replace=TRUE)
There is no built-in mode command in R. In fact, there is a command mode but it does not do what you think it does. There are then several choices of how to proceed. Here’s one.
freqlist<-table(rv3)
print(sortedfreqlist<-sort(freqlist,partial=c(2)))
## rv3
## 8 9 10 11 12 18 20 22 24 31 35 42 48 1 4 7 21 25 26 27 28 29 30 33 39
## 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
## 44 49 2 13 16 23 32 34 40 43 3 5 17 36 38 50 19 15
## 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 5 6
At this point, you might think you could execute the following code
sortedfreqlist[length[sortedfreqlist),2]]
and get the answer. But it does not work. This is because R is storing the results of table as this kind of strange data structure called – guess what – a table. One can take the names of a table object.
names(freqlist)
## [1] "1" "2" "3" "4" "5" "7" "8" "9" "10" "11" "12" "13" "15" "16"
## [15] "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [29] "31" "32" "33" "34" "35" "36" "38" "39" "40" "42" "43" "44" "48" "49"
## [43] "50"
What you see is kind of the first row of sortedfreqlist, which are the values in our sample. So, if we could just figure out the position in the second row of the maximum value in the second row, we could then use that as an index into names and get our mode. Here’s how we do it.
print(freqlist==max(freqlist))
## rv3
## 1 2 3 4 5 7 8 9 10 11 12 13
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 16 17 18 19 20 21 22 23 24 25 26
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 28 29 30 31 32 33 34 35 36 38 39
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 40 42 43 44 48 49 50
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE
print(which(freqlist==max(freqlist)))
## 15
## 13
names(freqlist)[which(freqlist==max(freqlist))]
## [1] "15"
as.numeric(names(freqlist)[which(freqlist==max(freqlist))])
## [1] 15
Pretty ghastly! 1 We could write a function to make sure we never have to do this again.
commonest<-function (x) {freqlist<-table(x)
as.numeric(names(freqlist)[which(freqlist==max(freqlist))])}
commonest(rv3)
## [1] 15
There’s also an R built-in command, summary that provides descriptive statistics. It provides the minimum, the median, the mean, and the maximum. It also provides “quartiles,” which are just the 25th and 75th quantiles.
summary(rv3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 14.5 23.0 24.1 36.0 50.0
It is often helpful to have a measure of how spread out or dispersed the numbers contain in some sample are. There are several measures of dispersion2, but the most commonly used is the “standard deviation.” And, technically, there are two standard deviations, the standard deviation of a population and the standard deviation of a sample from the population. Let’s start with the standard deviation of a population. Let’s take the distance of each element of our sample rv above and subtract the mean of the sample from it. \(\bar{x}=\frac{\sum _{i=1}^n x_i}{n}\) This gives us the distribution of the distance from the mean of the sample.
print(distanceFromMean<-rv-mean(rv))
## [1] -2.0362 -2.3829 -2.7458 -0.7576 3.5140 0.3290 -4.0107 3.7212
## [9] -0.8782 -0.7999 5.1478 2.1759 1.9059 -4.0945 2.1121 2.0642
## [17] 0.9560 -1.9204 -2.2999
Here are some things we could do at this point. We could take the mean of the distances. \(\frac{\sum _{i=1}^n \left(x_i-\bar{x}\right)}{n}\) That’s not a very good way of measuring dispersion, though, because the negative distances will tend to cancel out the positive distances. We could take the mean of the absolute value of the distances from the mean. \(\frac{\sum _{i=1}^n \left| x_i-\bar{x} \right| }{n}\)3
paste("the mean distance from the mean is",mean(distanceFromMean))
## [1] "the mean distance from the mean is -3.7400409888971e-16"
paste("the mean absolute distance from the mean is",mean(abs(distanceFromMean)))
## [1] "the mean absolute distance from the mean is 2.30801252588898"
paste("the median absolute distance from the median is",median(abs(rv-median(rv))))
## [1] "the median absolute distance from the median is 1.98815264040604"
In practice what it generally done, however, is to take the sum of the squares of the distances, divide that value by something – which is the kind of complicated, technical issue – and then take the square root of that quotient. For the moment, let’s have “something” be the length of our distance vector. \(\sqrt{\frac{\sum _{i=1}^n \left(x_i-\bar{x}\right){}^2}{n}}\) This, by the way, means that we are really taking the square root of the mean of the squared distances of the distance vector. Here’s some R code doing this computation. I’ll do it two ways. First, I’ll take the sum and divided by the length of the distance vector. Second I’ll use the mean function to do the same thing.
m<-mean(rv)
print(sqrt(sum((rv-m)^2)/length(rv)))
## [1] 2.636
print(sqrt(mean((rv-m)^2)))
## [1] 2.636
We can save ourselves some trouble, however, by using the built-in standard deviation function sd in R to compute standard deviations. Let’s give it a try on our data.
sd(rv)
## [1] 2.708
Uh oh. What’s going on? The value sd computed is different than the value we computed. And this is where that complexity I adverted to earlier comes into play. Remember, I said we had to divide by “something.” What if, in stead of dividing by the length of the distances vector, I divide by one less than the length of the distance vector.
m<-mean(rv)
print(sqrt(sum((rv-m)^2)/(length(rv)-1)))
## [1] 2.708
Now I get the same answer as I got when we used sd. So, it appears that R’s sd function is using this formula to compute standard deviation: \(\sqrt{\frac{\sum _{i=1}^n \left(x_i-\bar{x}\right){}^2}{n-1}}\). R is doing this because it is assuming that our numbers are not all the numbers that could ever be produced from some underlying distribution. They are not “the population.” Rather, R is assuming that the numbers we are using are but a sample from a larger population. And, when you are drawing from a sample population, it turns out that dividing by the length of the distance vector gives you an answer that will prove to small if you were trying to predict the standard deviation of a large (or infinite) number of draws from the underlying distribution. It is giving you what is known as a “biased” estimator. So, when you are asked to compute a standard deviation, you really should ask whether you are being asked to compute the standard deviation, assuming that you have a population, or the standard deviation, assuming that you have a sample.4 If you have a sample, the sd command works just fine. If you really have a population, however, you need to do a correction. Just take your sd result and multiply it by \(\frac{\sqrt{n-1}}{\sqrt{n}}\).
Variance is another measure of dispersion. It’s just the square of the standard deviation.
Makes you appreciate Mathematica’s Commonest command.↩
Wikipedia contains a good entry on statistical dispersion.↩
We could also take the median of the absolute value of the distances from the median. This statistic is sometimes called the median absolute deviation. You can compute median absolute deviation in R using the mad function.↩
Google Spreadsheets and Excel have at least two formulas for standard deviation. STDEV is for the standard deviation of a sample and STDEVP is for the standard deviation of a population.↩