This refresher course is based on:
Franken, W.M. & Bouts, R.A. (2002). Wiskunde voor statistiek: Een voorbereiding. Bussum, Netherlands: Uitgeverij Coutinho
The refresher course consists of five chapters:
The module illustrates all topics discussed the textbook, and shows how they can be implemented in R.
In statistics, and in data science for that matter, we usually work with data sets.
Common operations are taking the sum of a range of values in the data set, or computing their average value (which first takes the sum of n values, and then divides the sum by n).
For example, if we have the following set of seven values: 6, 7, 3, 12, 8, 5 and 0, then we can easily compute the sum, as \(6+7+3+12+8+5+0=41\).
We can refer to the first value (6), as x1, and to the third value (3), as x3. More in general, we can refer to any element of the series as xi.
For the sum of a set of values, we use the Greek letter s (sigma): ∑
Shorthand for x1+x2+ .. + x7 then is:
The average or mean values of a set of values, is defined as thensum of all values divided by the number of values. The mean is denoted as x̄
By definition - to hone our skills in reading statistical formulas ánd to train our analytical thinking - the sum of deviations from the mean is by definition zero!
For the record, lets see if this is true.
series<-c(6,7,3,12,8,5,0)
sum(series)
## [1] 41
mean(series)
## [1] 5.857143
(seriesMinusMean <- series - mean(series))
## [1] 0.1428571 1.1428571 -2.8571429 6.1428571 2.1428571 -0.8571429 -5.8571429
round(sum(seriesMinusMean),8)
## [1] 0
round(mean(seriesMinusMean),8)
## [1] 0
The output above shows that, as we have already done by hand, the sum of the seven values is equal to 41. The average or mean value of xi is $41/7 = 5.86, in two decimals of precision.
We have subtracted this mean from the original series, and stored the results in seriesMinusMean. The sum of that series equals zero, and so of course does the average.
This is equivalent to the following rule:
For the sake of completeness, the rules for multiplying all values by a fixed factor, and adding a constant, work the same way in combination. The formula makes life easier in statistical programming tasks.
It also makes it clear that the level and scale of measurement have important consequences. Suppose that we are measuring income in dollars. However, due to inflation, dollars now are worth less than dollars some decades ago. The variation (or in statistics: the variance) which is measured in terms of squared deviations from the mean, will widen with larger amounts of dollars.
The notations we have used above, are meant to make life easier. But it seems that, sometimes, and for some of us, all they do is confuse. That's understandable. When reading through textbooks or articles that use these notations without a lot of explanation or examples, try to figure out the logic and use a couple of small numerical examples yourself. It the statistical notation confuses you, then - to make things worse - it happens that we need to take the sum over several (two, or more) series, using two or more indices.
In a simple, example, suppose we have the grades of students (series 1) for a range of subjects they studied. Computing the "average grade" over all students and subjects, can be done in two steps. First, we compute the average grade of each students (averaging over all subjects). Secondly, we take the average of these averages, to calculate the overall average. This would work quite well, if all students have grades for the same set of topics, and all students have grades for all of these topics! What happens if student 1 didn't join the exams for 2 out of the 5 topics, due to illness? And/or what if student 2 had permission to skip one of the five topics, and was allowed to study a sixth topic offered by another school? More in general, data sets are never complete or perfect, and the analyst has to make choices, and be aware of the consequences of these choices - marginal as they may be!
In the "perfect" situation, the average grade for a group of ten students all having done exams in five topics, the calculation would be:
A factorial is a special kind of multiplication.
It is denoted as \(x!\)
In general, \(x! = x*(x-1)*x-2)*...*1\), x∊ℕ (that is, x is a positive integer or natural number, 1, 2, 3, ...)
For example:
\(4! = 4*3*2*1 = 24\)
\(8! = 8*7*6*5*4*3*2*1 = ...\)
Best to use the factorial() function in R.
factorial(8)
## [1] 40320
Note that 3! 4! ≠12!*
factorial(3)
## [1] 6
factorial(4)
## [1] 24
factorial(3)*factorial(4)
## [1] 144
factorial(12)
## [1] 479001600
In data science, we will frequently use problems related to probability, and to the likelihood and frequence of occurrences, or events.
An example is the following. Suppose you have just bought six books on data science, and you want to arrange them on your book shelf.
This is quite a demanding challenge, for any data scientist! Obviously, there are many strategies you can choose. You can do alphabetically, by title or (first) author name; or by size, color, year of publication; or you can do it just randomly.
An interesting view of strategies is our human inclination to make one choice out of the many choices we have.
In this simple example, the number of possible solutions to arrange six books on your bookshelf is no less than \(6! = 720\). Why?
The logic is that for the first book, you have six positions on the book shelf. After deciding on a spot for the first book, only five spots are opten for the second book. For the the last (sixth) book, you do not have a choice, really, as all spots except one are already taken. So, the number of permutations is \(6! = 6*5*4*3*2*1\).
Trial-and-error would be pretty time consuming, especially if in the process you want to evaluate, somehow, these arrangements. Human beings have learned to use intuition and strategies, to save time. Imagine what a person with 200 books has to go through!
factorial(6)
## [1] 720
But the psychological roots are of no concern to us, here.
The rules for counting the number of possibilities, or permuations*, are.
There are many variations to the above challenge. Suppose we have bought four books, and our book shelf still has room for six. If we do allow space between the books, then again we have six spots available for the first book; five for the second book; four for the third book; and still three for the last book!
That is, we have \(6*5*4*3 = 360\) options.
The difference, a factor 2 (\(720/360 = 2\)), can be explained by the fact that for the two empty spots there is no difference. Empty is empty. Before we had two books, to be put on the first or the second empty position! That is, the order of filling those spots mattered!
To illustrate what we mean by "the order matters", another simple example.
Suppose we have a group of three people. Two will be selected to receive prizes, of 200 and 100 dollar. The first person selected gets the bigger prize. The order matters!
Picking 2 out of 3, when order matters, follows the rule for variation.
factorial(3)/factorial(3-2)
## [1] 6
Since the numbers are small, we can enumerate the permutations. If we call our three persons A, B and C, we have:
AB AC BA BC CA CB
Note that AB differs from BA, as A and B receive different prizes!
We can imagine a situation in which the order doesn't matter. If we want to sample two persons out of a group of three, for an interview, then the samples AB and BA are identical (assuming it doesn't matter who is interviewed first).
The formula for combinations is shown below. In terms of our earlier example, let's imagine your data science teacher who has bought four copies of his favorite textbook (to lend to his students). It doesn't really matter in which order they end up on his book shelf that has a capacity of six such books.
Using R code:
spaces <- 6 # 6 spots on our book shelf
books <- 4 # 4 identical books
factorial(spaces)/(factorial(spaces-books)*factorial(books))
## [1] 15
Again, we can enumerate these combinations. The numbers are the shelf positions of the four books. The dots are the empty spots on the shelf.
Combination:
You probably detect the logic. You probably also notice that this task is time consuming and error prone, even with small numbers. Having a formula to do the counting, is great!
Applying the formula to our sample of two interviewees out of a population of three persons: