1. Set Theory
2. Mathematical Operations
3. Equations & Inequalities
4. Relations, Functions & Graphs
5. Statistical Operations
- 5.1 Summation
- 5.2 Factorials and Permutations
6. The Language of Statistics
7. Working with Numbers, and Data Display
- 7.1 Introduction
- 7.2 Exercises
  - 7.2.1 Exercise 1
  - 7.2.2 Exercise 2: Let’s Roll a Die
8. The Normal Distribution
9. Central tendency
10. Variability

The refresher course consists of ten chapters:

Set Theory
Mathematical Operations
Equations & Inequalities
Functions & Graphs
Statistical Operations
The Language of Statistics
Working with Numbers, and Data Display
The Normal Distribution
Central Tendency
Variability

This first five chapters of this refresher course are based on:

Franken, W.M. & Bouts, R.A. (2002). Wiskunde voor statistiek: Een voorbereiding. Bussum, Netherlands: Uitgeverij Coutinho

Chapter 6 to 10 are based on:

Landers, R.N. (2019). A Step-by-Step Introduction to Statistics for Business. Sage Publications

The module illustrates all topics discussed the textbook, and shows how they can be implemented in R.

On the fly, you will learn many of the basic functions of R, and functions in handy R packages.

This file about chapter 1: sets in mathematics.

1. Set Theory

1.1 Sets in Mathematics

Intuitively, we understand the concept of sets. In dictionaries, you will find definitions like a group or collection of things that belong together or resemble one another.

In mathematics, sets have a pretty precise definition:

A set is a well-defined collection of objects.
Each object in a set is called an element of the set.
Two sets are equal if they have exactly the same elements in them.
A set that contains no elements is called a null set or an empty set.
If every element in Set A is also in Set B, then Set A is a subset of Set B.

Example: if A is the set of first five charachters of the alphabet, then: A = {a,b,c,d,e}

In R we can create this set as follows.

We can decide on the name of the set. Here, we use SetA.

The two characters “<” and “-” look like an arrow.

We combine the characters on the right of the arrow, using the c() function

We then assign this combination to SetA, on the lefthand side.

Once we have defined SetA, we can check if certain elements, like “a” and “f” are elements of the set.

SetA <- c("a","b","c","d","e")
"a" %in% SetA

## [1] TRUE

"f" %in% SetA

## [1] FALSE

Likewise, we can create a set of numbers, from 1 to 100.

SetN <- c(1:100)  
length(SetN)

## [1] 100

88 %in% SetN  # Is 88  an element of our set? [TRUE, it is!]

## [1] TRUE

105 %in%SetN  # Is 105 an element of our set? [FALSE, it is not!]

## [1] FALSE

1.2 Relations between Sets

Empty Set

An empty set is a set without elements.

SetEmpty <- NULL
length(SetEmpty)   # the length of an empty set is zero

## [1] 0

"2" %in% SetEmpty  # is 2 part of our empty set?

## [1] FALSE

Identical Sets

Apart from the sequence of elements, the two sets below are identical.

All elements of A occur in B, and all elements of B are part of A.

SetA <- c("1","3","4","6")
SetB <- c("1","4","6","3")
SetB %in% SetA

## [1] TRUE TRUE TRUE TRUE

SetA %in% SetB

## [1] TRUE TRUE TRUE TRUE

We can test the differences and similarities using the commands below.

The overlap between two sets, is called the intersection.

The intersection is graphically displayed below.

All elements that are part of two sets combined (either A, B or both) form the union of the two sets.

In a graph:

The intersect of two sets is written as $A \cap B$

The union of two sets is written as $A \cup B$

setdiff(SetA, SetB)

## character(0)

setequal(SetA, SetB)

## [1] TRUE

intersect(SetA, SetB)

## [1] "1" "3" "4" "6"

Let’s look at some other examples.

The two sets below are not identical, and contain duplicates.

From the result we see that duplicates are removed! The element “a” appears twice in SetA, but only once in the union of both sets.

SetA <- c("a","a","b","c","d")
SetB <- c("c","c","d","e")
union(SetA, SetB)

## [1] "a" "b" "c" "d" "e"

There are various functions in R that help you detect unique and duplicate cases.

Some of these functions are unique() and duplicated().

Applying unique() to a set, filters out any duplicates.

The function duplicated() indicates, for each element, starting from the first element in the set, if it is the same as any of the preceding elements. Since the first element, by definition, has no predecessor, the indication has to be FALSE. If, like in SetA, the second element is identical to the first element, it will be TRUE. Going through the entire set, we generate a list of duplicates: elements with indicator TRUE.

Note that duplication means that the same elements occurs twice or more. That is, triplication, quadruplication, and so on, are just subsets of duplication: elements appearing three times or more, are a subset of elements apperaring twice or more.

We can use the result of duplicated() to index SetA. Since SetA has five elements (including duplicates), we can refer to elements using square brackets.

For example, SetA[1] returns the first element of SetA. And SetA[3:5] returns the thrid to fifth elements of SetA. An alternative way for the latter, would be to use a list of TRUE and FALSE indicators, here c(F,F,T,T,T).

Since the duplicated() function produces such a list, we can use it to index SetA. However, duplicated() returns TRUE for duplicated elements. If we want to detect unique elements, we use the NOT operator (the exclamation mark, !) to change the TRUE indicators to FALSE, and FALSE indicators to TRUE.

SetA[1]

## [1] "a"

SetA[3:5]

## [1] "b" "c" "d"

SetA[c(F,F,T,T,T)] # Equivalent to previous!

## [1] "b" "c" "d"

(SetA)

## [1] "a" "a" "b" "c" "d"

unique(SetA)             # Removes duplicates

## [1] "a" "b" "c" "d"

duplicated(SetA)

## [1] FALSE  TRUE FALSE FALSE FALSE

SetA[!duplicated(SetA)]  # Removes duplicates, in a slightly more complicated way

## [1] "a" "b" "c" "d"

Differences is sets can be checked using the setdiff() function. The funcion only compares two sets at the time, and the order is important, as you see below!

setdiff(SetA, SetB)  # a en b occur in A, but not in B

## [1] "a" "b"

setdiff(SetB, SetA)  # e occurs in B, but not in A

## [1] "e"

Advanced

To challenge your skills, we can define the union of sets A and B as follows:

Elements unique to A, plus:
Elements unique to B, plus:
Elements in the interesection of A and B.

We can sort the combination of these three parts alphabetically, and check if indeed the result is equal to what we have defined as the union:

sort(c(setdiff(SetA, SetB), setdiff(SetB, SetA), intersect(SetB, SetA)))

## [1] "a" "b" "c" "d" "e"

all(sort(c(setdiff(SetA, SetB), setdiff(SetB, SetA), intersect(SetB, SetA))) == union(SetA, SetB))

## [1] TRUE

The all() function can be used to see if all elements of one set, are contained by the other.

All elements of set C (a duplicated element “a” ) are part of D.

One element of D, is not contained in A.

SetC <- c("a","a"); SetD <- c("a","b")
SetC %in% SetD

## [1] TRUE TRUE

all(SetC %in% SetD)

## [1] TRUE

SetD %in% SetC

## [1]  TRUE FALSE

all(SetD %in% SetC)

## [1] FALSE

1.3 Sets of Numbers

Sets can contain any type of element: characters, binary indicators (like TRUE or FALSE), numbers, and more.

It can be tedious to create sets of characters or words, as you have to do a lot of typing.

Sets of numbers are often easier, especially if it’s a line of numbers, say from 1, 2 .. to 99.

You can use the length() function to return the number of elements (here: numbers).

You can access specific elements using indexing, between square brackets. Note that curved brackets () do not work, for indexing. Computers can be very picky.

(SetN <- c(-10:+10)) # Putting the expression between brackets, prints the object in the console!

##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10

length(SetN)

## [1] 21

SetN[3]

## [1] -8

SetN[3] < SetN[4]

## [1] TRUE

1.4 Application of Set Theory, and Sets

You may wonder why all of the above would be of interest to a data scientist like yourself.

There are a least three reasons we would like to stress here.

Logical thinking

First of all, dealing with data and data analysis will force you to logically think about what it all means. Even when working with just one data set, say, a (partial) list of employees in your organization who have filled out survey questionnaire, requires you to first assess the quality of the list. How many employees have filled out the survey twice or more (and what do we do with duplicates, especially if responses by duplicate respondents are inconsistent)? Is the list of (unique) respondents a subset of employees? Is the response rate the same for all departments of the organization? And so on and so forth.

Answers to such questions can be obtained using the tools discussed above, or many other tools. But it is the thinking that counts!

Comparing data sets

One of the starting points in the data mining process, is data understanding. Often, your analysis is based on data from various sources. Before you start your analysis, you have to understand all sources in depth. Combining, or merging, data sets from various sources almost without exception reveals errors, inconsistencies and gaps. Applying the basics of set theory will help you in preparing high-quality, analyzable data sets. Or to put it negatively: data science follows the garbage-in-garbage-out principle!

Probability theory

In some of the modules, we will use probability theory. Concepts like conditional probablitity have close links to set theory. It helps if you’re familiar with the terminologies, and with notations like $\cup$ and $\cap$.

1.5 Exercises

1.5.1 Exercise 1

Given are the following sets:

A = {2,4,6,8,10,12,14,16} B = {1,2,3,4,5,6,7} C = {3,6,9,12,15,18}

A $\cap$ B = (Intersection of A and B)
A $\cap$ C =
B $\cap$ C =
A $\cap$ B $\cap$ C =
A $\cup$ B = (Union of A and B)
A $\cap$ B $\cap$ C =

Solutions

Let’s first create the sets A, B and C

A <- seq(2, 16, 2)
B <- c(1:7)
C <- seq(3, 18, 3)
A; B; C  # Prints the three sets

## [1]  2  4  6  8 10 12 14 16

## [1] 1 2 3 4 5 6 7

## [1]  3  6  9 12 15 18

A $\cap$ B

(intersect(A, B)) # Putting the command between brackets, prints the result

## [1] 2 4 6

A $\cap$ C

(intersect(A, C))

## [1]  6 12

B $\cap$ C

(intersect(B, C))

## [1] 3 6

A $\cap$ B $\cap$ C

intersect(A, B, C) will give an error message, as intersect() only works on two sets!

We can do it in 2 steps; first, we determine the intersection of A and B; followed by the intersection of that result with C!

You can imagine that this is doable for three sets. But if we were to find the intersection between many sets, then the experession would get very lengthy!

A better alternative is to use the Reduce() function.

Often, you can find solutions to this kind of challenges by just googling.

The Reduce() function, for example, we found on link

intersect(intersect(A,B),C)

## [1] 6

Reduce(intersect, list(A,B,C))

## [1] 6

A $\cup$ B

sort((union(A, B)))

##  [1]  1  2  3  4  5  6  7  8 10 12 14 16

A $\cap$ B $\cap$ C

sort(Reduce(union, list(A,B, C)))

##  [1]  1  2  3  4  5  6  7  8  9 10 12 14 15 16 18

Are you in for a challenge?

Which elements are in intersections of sets, but not in all sets?

(AB <- intersect(A,B)) # Pairwise intersection A and B; stored as set AB

## [1] 2 4 6

(AC <- intersect(A,C)) # Id, for A and C

## [1]  6 12

(BC <- intersect(B,C)) # Id, for B and C

## [1] 3 6

(sort(pairs <- Reduce(union, list(AB,AC,BC)))) # All pairwise intersection combined, stored as "pairs"

## [1]  2  3  4  6 12

(trios <- Reduce(intersect, list(A,B,C))) # Elements present in the intersection of all three sets

## [1] 6

sort(setdiff(pairs, trios)) # Elements in pairs, but not in trios

## [1]  2  3  4 12

You see that elements 2, 3, 4 and 12 are in pairwise intersections, but not in the three-way intersection!

Graphically, just to convince you:

1.5.2 Exercise 2

Are the following statements true or false?

{2,4} $\subset$ (A $\cap$ B)
6 $\in$ (A $\cap$ B $\cap$ C)
{6} $\subset$ (A $\cap$ B)
(A $\cap$ B) $\subset$ (A $\cup$ B)

Solutions

{2,4} $\subset$ (A $\cap$ B)

SetTest <- c(2,4)
all(SetTest %in% A)

## [1] TRUE

6 $\in$ (A $\cap$ B $\cap$ C)
{6} $\subset$ (A $\cap$ B)

# In the formulation of c.

SetTest <- c(6)
trios

## [1] 6

all(SetTest %in% trios) # Remember we stored the intersect as "trios"!

## [1] TRUE

# Alternatively (as in b.), define 6 as a single element rather than a set of one element

6 %in% trios

## [1] TRUE

# Note that b. and c. are basically the same!

(A $\cap$ B) $\subset$ (A $\cup$ B)

This is true in general. Elements in the intersection of A and B, are by definition part of both A and B. The intersection is therefore a subset of all elements in A or B!

Just to train the formulation of this exercise in R:

all(intersect(A, B) %in% union(A, B))

## [1] TRUE

1.5.3 Exercise 3

A = {1,2,3,4,5,6} B = {5,6} C = {1,2,5,6} D = {2,3,4} E = {2,3,4,5}

Complete the statement with one of the symbols:

$\in$ (element of)
$\notin$ (not an element of)
$\subset$ (subset of)
$\supset$ (superset; if A is subset of B, then B is superset of A)
$\cap$ (intersection)
$\cup$

B….C
B….C = B
B….C = C
B….D = 0 (empty set)
C….D = A
D….E = D
4….C $\cap$ B
D….E….A

Solutions

Use reason to answer each of the questions!

As an additional challenge, formulate the sets in R, and use any of the commands introduced in this chapter to check your answer!

$\subset$ (B is obviously a subset of C, as all elements of B are also in C)
$\cap$ (the intersect of B and C, is equal to B, as B is a subset of B)
$\cup$ (the combined elements of B and C, are equal to C, as C is a superset of B)
$\cap$ (as B and D have no elements in common, the interesect is the empty set)
$\cup$ (all elements of C and D combined, match A)
$\cap$ or $\subset$
$\notin$ (4 is not part of the intersection of B and C; it is not even part of the union of B and C)
$\subset$ (D is a subset of E, which in turn is a subset of A; you can conclude that therefore D is a subset of A)

Working with data files (in data science) requires logicaal thinking. Set theory is a good exercise in logical thinking!

1.5.4 Exercise 4

Consider the following sets.

A = {x | x is an even number and x<20; x is a positive natural number}
B = {x | x is a multiple of 3 and x<20; x is a positive natural number}
C = {x | x is an odd numbber and x<20; x is a positive natural number}

This looks cryptic. Set A, for example, reads like the set of numbers x conditioned by the following rules: x is a positive natural number (1, 2, 3 to infinity), smaller than 20 and divisible by 2. We can enumerate these numbers easily: 2, 4, 6 .. up to 18.

Set B then is 3, 6, 9 .. up to 18).

Set C is 1, 3, 5 .. up to 19.

Determine:

A $\cap$ B
A $\cap$ B $\cap$ C
$\cup$ B

Again, create the sets in R and use the proper functions to get the solutions!

Solutions

{6, 12, 18}
0 (empty set)
{2,3,4,6,8,9,10,11,12,14,15,16,18}

2. Mathematical Operations

2.1 Addition

Probably the most basic mathematical operation is adding two or more numbers.

You can use R as a calculator. Type your mathematical expression in the console, and get the result instantaneously.

A better way to use R, is to write youR code in R-scripts. Most often, you will assign values to objects. Below we assign the value 2 to a, and 5 to b. You can then assign the addition of $a+b$ to another object, $c$.

2+5

## [1] 7

a = 2; b = 5
c = a + b
cat(a,"+",b,"=",c,"\n")

## 2 + 5 = 7

# if c = a + b, then b = c - a
c-a

## [1] 5

## [1] 5

2.2 Multiplication and Division

Multiplication is like adding a number several times. Adding 5 four times, is equivalent to multiplying 5 by 4.See below.

Often, in data science, you will multiply broken numbers. Like $4.5*3.88$. The analogy (adding 4.5 3.88 times) is somewhat harder to envision.

The operator for multiplication is the __asterisk (*)__. Although in mathematical textbooks, you may find $xy$ as shorthand for x times y, that doesn’t work in R, and other software. You have to use $x*y$ in your code!

5+5+5+5

## [1] 20

4*5

## [1] 20

$a+a+a+a = 4*a$ (or 4a, for short)

$4a = 20$

We can divide both sides by 4 to find a:

$4a/4 = 20/4$ $\Rightarrow$ $a=5$

2.3 Exponentiation

Exponentiation is equivalent to multiplying by the same number, several times. For instance, $5*5$ is the same as 5 raised to the power 2, or $5^{2}$. Exponentiation in R uses the operator ^, like in the example below.

Exponents do not have to be integers (1, 2, …), but can be broken numbers (e.g. 1.2, 2.8, …). A special case is an exponent of $0.5$. Exponentiation by $0.5$ is called the square root. Taking the square root of a number, is the reverse of taking the square.

If $x^{2} = y$, then $y^{0.5} = x$. For example, the square of 5 is $5*5 = 25$; reversely, the square root of 25 is 5.

Other special cases are exponents of 0 and 1.

$x^{0} = 1$

$x^{1} = x$

5*5*5*5

## [1] 625

5^4

## [1] 625

5*5

## [1] 25

5^2

## [1] 25

sqrt(25)

## [1] 5

25^(1/2)

## [1] 5

5^0

## [1] 1

5^1

## [1] 5

Exponentiation has the following structure:

$a^b=c$

In this formula:

a is the base
b is the exponent
c is the power

2.4 Rooting and Logarithms

There is a relationship between exponentiation, rooting and logarithms.

In a simple example, 10 squared (or $10^{2}$) is $10*10 = 100$.

That is:

$10^2 = 100$

Rooting is:

$\sqrt(100) = 10$ (the square root of 100)

Logarithm:

$log(100) = 2$ (using 10 as the base for the logarithm).

Note that the three numbers (2; 10; and 100) keep coming back, in different settings!

In R this would look like:

cat("The square of 10, or 10*10, equals",10^2,"\n")

## The square of 10, or 10*10, equals 100

cat("The square root of 100 equals",sqrt(100),"\n")

## The square root of 100 equals 10

cat("The logarithm of 100 (base 10) equals", log10(100),"\n")

## The logarithm of 100 (base 10) equals 2

Since 10 is an exceptional base, and squaring and square roots are special cases, we can use a more general version.

Suppose we do the same for $2^3 = 2*2*2 = 8$.

cat("2 raised to the power 3 equals",2^3,"\n")

## 2 raised to the power 3 equals 8

cat("The cubic root of 8 equals",8^(1/3),"\n")

## The cubic root of 8 equals 2

cat("The logarithm of 8 (base 2) equals", log(8, 2),"\n")

## The logarithm of 8 (base 2) equals 3

For an easy explanation of the links between roots and exponents, have a look at this video.

2.5 The Order of Operations

The order of operations is governed by the principle of PEMDAS.

Parentheses
Exponents
Multiplication and Division
Addition and Subtraction

As a general rule, in programming for data science and statistics it is best to use parentheses (brackets) in order to avoid confusion.

Some examples:

2+5*8

## [1] 42

2+(5*8)

## [1] 42

(2+5)*8

## [1] 56

3+2^2

## [1] 7

(3+2)^2

## [1] 25

12-24-34+12

## [1] -34

12-(24-34)+12

## [1] 34

12-24-(34+12)

## [1] -58

12-(24-34+12)

## [1] 10

As you see, formulas are prone to errors!

2.6 Negative Numbers in Multiplication

As a rule:

A positive number times a positive number gives a positive number
A positive number times a negative number gives a negative number
A negative number times a negative number gives a positive number

Examples:

8*7

## [1] 56

8*-7

## [1] -56

-8*7

## [1] -56

-8*-7

## [1] 56

2.7 Rules for Operations

Commutative law

$a+b = b+a$

Associative law

$(a+b)+c = a+(b+c)$

This holds for addition, not for subtraction!

(8+2)+3

## [1] 13

8+(2+3)  # Same as above

## [1] 13

(8-2)-3

## [1] 3

8-(2-3)  # Not the same as above!!

## [1] 9

Distributive law

$a*(b+c) = (a*b) + (a*c)$

8*(2+3)

## [1] 40

8*2 + 8*3

## [1] 40

We can use one command line to evaluate if the latter two expressions are identical.

For these evaluations, you have to use the $a == b$ format (double =), which evaluates whether $a$ and $b$ are the same. $a = b$ (single =) would allocate the value of $b$ to $a$, which is not what we want.

8*(2+3) == 8*2 + 8*3 # Evaluate if the expressions are identical

## [1] TRUE

Another rule:

$(b+c) / a = (b/a)+(c/a)$

(8+7)/3

## [1] 5

(8/3) + (7/3)

## [1] 5

(8+7)/3 == (8/3) + (7/3)

## [1] TRUE

But: $(a) / (b+c) \neq (a/b) + (a/c)$

15/(2+3)

## [1] 3

15/2 + 15/3

## [1] 12.5

15/(2+3) == 15/2 + 15/3

## [1] FALSE

2.8 Rules for Fractions

Rule 1: $(a/p) + (b/p) = (a+b)/p$

8/4 + 3/4 == (8 + 3)/4

## [1] TRUE

Rule 2: $(a/p)*(b/q) = (a*b)/(b*q)$

(8/4)*(6/3) == (8*6)/(4*3)

## [1] TRUE

Rule 3: $(a/p)/(b/q) = (a/b)*(q/b)$

(8/4)/(6/3) == (8/4)*(3/6)

## [1] TRUE

2.9 Rules for Exponentiation

Rule 1: $a^n = a * a * a * ...$ (n times)

Rule 2: $a^n$ is positive if $a>0$

Rule 3: $a^n$ is positive if $a<0$ and n is an even number

Rule 4: $a^n$ is negative if $a<0$ and n is an odd number

Examples:

8^2    # Rule 2

## [1] 64

(-8)^2 # Rule 3

## [1] 64

-8^3   # Rule 4

## [1] -512

-8^2   # Is the outcome what you expected??

## [1] -64

In the expression $-8^2$, the PEMDAS rule forces R to first evaluate $8^2$ (E, for exponentiation), before multiplying (M) by -1!!

If you intend to square -8 ($-8*-8=64$), then the code should read:

(-8)^2   # Is the outcome what you expected??

## [1] 64

Parentheses make all the difference!

3. Equations & Inequalities

3.1 Introduction

Consider the following statements:

$4 > 8$ (“4 is greater than 8”)
$4 < 8$ (“4 is less than 8”)

The first statement is false. The second statement is true.

A statement like:

$x>8$

is an “open” statement, if x is not known. x is called a variable.

4>8

## [1] FALSE

4<8

## [1] TRUE

3.2 Solving Equations with One Variable

Rule 1: x + z = y + z, if x = y

Equations, that is, are equivalent when adding the same number to the left hand and right hand sides of the equal-to (“=”) sign

Rule 2: xz = yz, if x = y

Remember that $xz$ is short for $x*z$, or in words, x times z.

Equations are equivalent when multiplying the left hand and right hand sides by the same number.

Below, we assign the value 3 to both $x$ and $y$. Obviously, $x=y$. Next, we multiply both $x$ and $y$ by the same number $z=2$. Since $x=y$, $xz=yz$.

We use an if/else statement to evaluate whether $x$ and $y$ are the same. The cat() function prints the output.

x<-3
y<-3
z<-2
xz<-x*z; yz<-y*z
if(xz == yz){
cat("Yes, xz=yz!!")
} else {
cat("Nope, they differ")
}

## Yes, xz=yz!!

x<-4
y<-3
z<-2
xz<-x*z; yz<-y*z
if(xz == yz){
cat("Yes, xz=yz!!")
} else {
cat("Nope, they differ")
}

## Nope, they differ

Using these transformation rules we can solve equations.

For example:

$2x + 5 = 10$

subtract 5, from both sides of the equation:

$2x + 5 - 5 = 10 - 5$

$2x = 5$

divide both sides by 2:

$2x/2 = 5/2$

$x = 2.5$

Check for yourself:

$3x + 5 = x - 1$

Answer: x = -3

$(x + 2)/5 = 3$

Answer: x = 13

3.3 Linear Inequalities

Rule 1:

$x + z > y + z$, if $x>y$

$x + z < y + z$, if $x<y$

Rule 2a (z>0)

$xz > yz$, if $x>y$

$xz < yz$, if $x<y$

Rule 2b (z<0)

$xz > yz$, if $x<y$

$xz < yz$, if $x>y$

Let’s use R to illustrate these rules.

In the example below, $z$ is a positive number. If $x<y$ is true, then the claim is that after multiplying both sides by $z$, the result is still true.

# Example (rule 2a)
z=2
x=3
y=4
x<y

## [1] TRUE

z*x < z*y

## [1] TRUE

You see that the inequality still holds when multiplying both sides by a positive number.

But if $z$ is a negative number, then things change!

While $x<y$ (indeed, $3<4$!), multiplying $x$ and $y$ by the same negative number reverses the sign: $z*x>z*y$ (as $-2*3 >-2*4$, or $-6>-8$).

Using the same values for $x$ and $y$ as above, but using $z=-2$, we will get the following.

# Example (rule 2b)
z=-2 # z now is a negative number!!
x=3
y=4
x<y

## [1] TRUE

z*x < z*y

## [1] FALSE

(x<y) == (z*x < z*y) # Are both statements either TRUE or FALSE (!) ??

## [1] FALSE

The inequality no longer holds when multiplying both sides by a negative number.

Or, to put it differently, the sign reverses, from “smaller than ($<$)” to “larger than ($>$)”.

# Example (rule 2b)
z=-2 # z now is a negative number!!
x=3
y=4
x<y        # x is less than y

## [1] TRUE

z*x > z*y  # zx is greater than zy, if z<0

## [1] TRUE

(x<y) == (z*x > z*y) # Are both statements either TRUE or FALSE if the sign reverses ??

## [1] TRUE

Check for yourself:

$-x + 1 > 6$

Answer: x < -5

A slightly more complicated expression is the following. Here, we have two inequalities in one statement. Basically it says that $-x$ is in between $-8$ and $-5$. But what does that mean for $x$?

$-8 < -x < -5$

Answer: 5 < x < 8

Just follow the rules: multiply by -1 to get $x$. That will reverse the signs, from $<$ to $>$. Or: $8 > x > 5$.

x, that is, is a number in between 8 and 5.

Since our line of numbers runs from small to large, it is more common to state that x lies between 5 and 8 (x is larger than 5 but smaller than 8). If $8>x$, then $x<8$ (applying all the rules)! And if $x>5$, then $5<x$. Cosmetically, it looks nicer. But mathematically speaking, there’s no difference.

3.4 Equations with Two Variables (or “Unknowns”)

Consider the following pair of equations:

[1] $x = y + 4$ [2] $x = 0.5*y - 3 $

Rather than a line of numbers, we now have two dimensions. The relations represent lines in a two-dimensional space. As long as the lines do not run parallel to one another, there must be one single combination of values of $x$ and $y$ where the lines meet.

One method of solving linear equations is by “substitution”.

The first equation already provides an expression for $x$, in terms of $y$. Otherwise, we could use the transformation rules to obtain such an expression.

Substituting $x$ [1] in [2], we get:

$y + 4 = 0.5y - 3$

And using the rules of transformation:

$0.5y = -7$ (subtract $0.5y$ from both sides; subtract 4 from both sides)

Hence: $y = -14$ (multiply both sides by 2)

and: $x = -14 + 4 = -10$

This method can be extended to sets of 3 equations and 3 variables, or in general to sets of n equations and n variables.

As you can imagine, it will get increasingly tedious and error prone to find the solutions for all variables!

The matlib package

Luckily, you can make use of the matlib package in R, and make use of the functions of matlib.

# install.packages("matlib")
library(matlib)
A <- matrix(c(1,1,-1,-.5),2,2)
b <- c(4,-3)
showEqn(A,b)

## 1*x1   - 1*x2  =   4 
## 1*x1 - 0.5*x2  =  -3

plotEqn(A,b)

## x[1]   - 1*x[2]  =   4 
## x[1] - 0.5*x[2]  =  -3

Solve(A,b,fractions=TRUE)

## x1    =  -10 
##   x2  =  -14

Let’s inspect these lines of code, step-by-step.

We first invoke the functions in the matlib package, using the library() function. Before doing so, see to it that you have installed the package. In RStudio, check the Packages installed in the user library.

We then define a 2*2 matrix containing the coefficients of the equations on the left hand side. Note that the equations have to be rewritten in the format:

$c1*x + c3*y = b1$

$c2*x + c4*y = b2$

In our case, we have:

$x = y + 4$, to be rewritten as:

$x-y = 4$, for the first equation.

And:

$x = 0.5y - 3$, to be written as:

$x -0.5y = -3$, for the second equation.

Summarizing:

$x-y = 4$

$x-0.5y = -3$

In our case, the 4 coefficients are then 1, 1, -1 and -0.5, for $c1$ to $c4$ respectively. Use the correct order, from $c1$ to $c4$!!

The first value (1) is the coefficient ($c1$) of the first variable ($x$) in the first equation.

The second value (1) is the coefficient ($c2$) of the first variable ($x$) in the second equation.

The third value (-1) is the coefficient ($c3$) of the second variable ($y$) in the first equation.

And the fourth value (-0.5) is the coefficient ($c4$) of the second variable ($y$) in the second equation.

The numbers 2 and 2 in the matrix() functions indicate the number of columns and rows.
The coefficients b are 4 and -3.
The showEqn() function allows you to check if the implied equations match the ones you want to solve. You can use the plot(Eqn) to see what the lines look like, in the two-dimensional space.
The Solve() function gives the results for x and y (although the software uses x1 and x2).

As an example, and to get some practice, adapt the above to find x and y for the following set of equations:

$3x + 5y = 11$

$2x - y = 16$

library(matlib)
A <- matrix(c(3,2,5,-1),2,2)
b <- c(11,16)
Solve(A,b,fractions=TRUE)

## x1    =   7 
##   x2  =  -2

There are two special cases.

The first case is the existence of identical lines, as in:

$3x + 6y = 9$

$5x + 10y = 15$

Multiplying the first equation by 5, and the second one by 3 gives:

$15x + 30y = 45$

library(matlib)
A <- matrix(c(3,5,6,10),2,2)
b <- c(9,15)
showEqn(A,b)

## 3*x1  + 6*x2  =   9 
## 5*x1 + 10*x2  =  15

plotEqn(A,b)

## 3*x[1]  + 6*x[2]  =   9 
## 5*x[1] + 10*x[2]  =  15

Solve(A,b,fractions=TRUE)

## x1 + 2*x2  =  3 
##         0  =  0

The plot produces, indeed, one thick line, representing both equations.

The two equations are identical, and produce the same line. The two equations are always true regardless of the values of x and y. Or, as the output shows, 0=0; which is true regardless of x and y!

A second instance is when the lines do not coincide but run parallel. In that case there is no solution, as the lines never meet.

$2x - y = -6$

$6x - 3y = 12$

library(matlib)
A <- matrix(c(2,6,-1,-3),2,2)
b <- c(6,12)
showEqn(A,b)

## 2*x1 - 1*x2  =   6 
## 6*x1 - 3*x2  =  12

plotEqn(A,b)

## 2*x[1] - 1*x[2]  =   6 
## 6*x[1] - 3*x[2]  =  12

Solve(A,b,fractions=TRUE)

## x1 - 1/2*x2  =  2 
##           0  =  2

The solution reads 0=2, which is never true. A bit of a cryptic way to state that no solutions for x and y can be found that solve the set of equations.

3.5 Quadratic Equations

Quadratic equations have the form:

$a*x^{2} + b*x + c = 0$

The trick to solving this equation is based on the following notion. We are going to dig back in your (hopefully not long forgotten) high-school mathematics.

Suppose that we start from:

$(x+p)*(x+q)= 0$

This expression is true if either $x=-p$ or $x=-q$. If $x=-p$, then the first part of the equation is zero, and any number times zero gives zero. Likewise, $x=-q$ would have the same effect. For $x$ ≠ $p,q$, the left hand side of the equation implies multiplication of two non-zero numbers, which gives a non-zero outcome (and hence, the equality doesn’t hold).

The equation can be rewritten as:

$(x)*(x+q) + p*(x+q)= 0$

$x^{2} + (p+q)*x + pq$

Since $(x+p)*(x+q)= 0$ if one of the terms $(x+p)$ or $(x+q)$ equals zero, the trick is to find values p and q such that their sum equals b, and their product equals c.

For example, if we want to solve:

$x^{2} + 7*x + 10 = 0$

then 5 and 2 are the numbers we are looking for, since $p+q = 5+2 = 7$ and $p*q = 5*2 = 10$.

The equation then is true if $x=-2$ or $x=-5$. Check this for yourself!

We can use R-commands to plot the function, and confirm that, indeed, at $x=-2$ and $x=-5$, y takes on a value of zero.

curve(x^2+7*x+10, from=-10, to=++2, , xlab="x", ylab="y")
abline(h = 0, lty = 1)
abline(v = 0, lty = 1)
abline(v = -2, lty = 2)
abline(v = -5, lty = 2)

Unfortunately, it is not always easy to find the values p and q. It is better to make use of a general formula to solve the quadratic equation.

Maybe hidden at the very back parts of your brain, there’s information stored on a scary formula that you can use for solving quadratic equations.

It looks like:

\[x = {-b \pm \sqrt{b^2-4ac} \over 2a}\]

The $\pm$ reads like plus or minus, implying that this formula generates two solutions!

Some advanced R-code creates a function to easily solve any quadratic equation. We use the code written by Kiki Hatzistavrou.

It is beyond the scope of this module to explain the code below in detail. All you have to know is that:

This code defines a user-writtten function result(a,b,c) which you can use to solve your equation. The a, b and c rerpresent the coefficients in your quadratic equation.
Since R has so many users, odds are that you can find R-code for specific tasks written by somebody who has decided to share it with the rest of the world. No guarantee that it works, but most of the time it does!
If you find the code useful, store it somewhere. The code has to be run from a script, since user-written codes is not (yet) included in official packages!

result <- function(a,b,c){    # Constructing the quadratic formula
  if(delta(a,b,c) > 0){ # first case D>0
        x_1 = (-b+sqrt(delta(a,b,c)))/(2*a)
        x_2 = (-b-sqrt(delta(a,b,c)))/(2*a)
        result = c(x_1,x_2)
  }
  else if(delta(a,b,c) == 0){ # second case D=0
        x = -b/(2*a)
  }
  else {"There are no real roots."} # third case D<0
}

delta<-function(a,b,c){       # Constructing delta
      b^2-4*a*c
}

You can now use this user-written function result(a,b,c). The term a, b and c between brackets, are just the coefficients in the formula

$a*x^{2} + b*x + c = 0$

So, as an example, you can easily solve our earlier example.

$x^{2} + 7*x + 10 = 0$

result <- function(a,b,c){
  if(delta(a,b,c) > 0){ # first case D>0
        x_1 = (-b+sqrt(delta(a,b,c)))/(2*a)
        x_2 = (-b-sqrt(delta(a,b,c)))/(2*a)
        result = c(x_1,x_2)
  }
  else if(delta(a,b,c) == 0){ # second case D=0
        x = -b/(2*a)
  }
  else {"There are no real roots."} # third case D<0
}

# Constructing delta
delta<-function(a,b,c){
      b^2-4*a*c
}

using:

(result(1,7,10))

## [1] -2 -5

Solve the following examples, taken from the textbook, yourself, using the result() function.

$x^{2} = 9$

$x^{2} +4*x + 4 = 0$

$3*x^{2} + x -2 = 0$

(result(1,0,-9))

## [1]  3 -3

(result(1,4,4))

## [1] -2

(result(3,1,-2))

## [1]  0.6666667 -1.0000000

library(MASS)
fractions(result(3,1,-2))

## [1] 2/3  -1

Note that in the second example, there is only one solution: $x=-2$. Or, to put it differently, p and q have the same value (2).

By default, the formula gives the solution in decimals. We have used the fractions() function from the MASS package to give us fractions.

3.6 Exercises

3.6.1 Exercise 1

Find x for the following equations. Note that you have to write the equations in the format $a*x^{2} + b*x + c = 0$!

$42 = x+13$

$x - 3 = -12$

$12 - x = 11 - x$

$-x - 5 = -7$

$20 - (3x/5) = x - 12$

$3x - 4 > 5x$

$-3x > -x - 5$

3.6.2 Exercise 2

Solve x and y for the following pairs of equations. Use the showEqn, plotEqn() and Solve() functions to check your answers.

$7x + y = 22$

$5x + y = 14$

(Answer: x=4; y=-6)

$2x - y = 5$

$3x + 3y = 21$

3.6.3 Exercise 3

Solve x for the following quadratic equations:

25 - 20x + 4x² = 0

x² + 20x = -19

-4*x² + 16x + 25 = 0

-x² + x = -1

4. Relations, Functions & Graphs

4.1 Relations Between Sets

We have discussed sets, in chapter 1.

We have also discussed differences and overlaps between sets.

We can relate sets to one another, by linking all elements of one set, to all elements of the other set. This gives the product of the two sets.

If we have two sets, a and b, consisting of four and three elements respectively, then the product of the two sets gives 12 pairs, as below. We use expand.grid() to generate the product, and store the result as a data frame.

a<-c(3,5,12,16)
b<-c(5,6,10,12)
ab <- as.data.frame(expand.grid(a,b))
ab

A relation between two sets is special version of the product. For example, we can consider only those pairs satisfying the condition that the element of a is greater than the element of b. This gives a subset of the product.

In the last line of code below, we select from the data frame only those records that satisfy the condition Var1>Var2. The variable names are generated by the software. We can apply our own labels, if we want to. Here, we will settle for the default names.

Between the brackets [] we specify the conditions. Since data frames have two dimensions, we have to specify both the rows (before the comma) and the columns (after the comma) that we want to keep. Here, we keep the rows that meet the condition, and we keep all variables.

a<-c(3,5,12,16)
b<-c(5,6,10,12)
ab <- as.data.frame(expand.grid(a,b))
abr <- ab[ab$Var1>ab$Var2,]; abr

4.2 Functions Graphically Displayed

4.2.1 Linear Functions

We have already looked at plots of functions in the previous chapter. In that chapter, we examined pairs of linear equations, which are relations between x and y. These relations are often alled functions, in which, traditionally, y is considered a function of x.

Linear equations, as then name implies, can be graphically displayed as straight lines. Unless two lines run parallel or coincide, there is exactly one point where they intersect. That point is a unique combination of one pair (x,y) where both equations are true simultaneously.

Below, we plot the linear function $y = 0.5*x$. This can be rewritten as $0.5*x - y = 0$. The latter way is convenient when writing the corresponding R-code. The matrix is of size (1*2), that is one row and two columns. The right hand side for one equation, is just one number (0).

We have also applied our own labels, x and y, for the two variables, overriding the defaults x1 and x2.

library(matlib)
A <- matrix(c(0.5,-1),1,2)
b <- 0
showEqn(A,b,vars=list("x","y"))

## 0.5*x - 1*y  =  0

plotEqn(A,b,vars=list("x","y"))

## 0.5*x - 1*y  =  0

We can write

$0.5*x - y = 0$

or alternatively

$f(x) = 0.5*x$

indicating that $f(x)$ (or $g(x)$, and so on) are functions of $x$.

Consider two functions:

$f(x) = 0.5*x + 5$ and

$g(x) = 3*x - 2$

When plotting these functions by hand, it suffices to compute two ordered pairs, of values for $x$ and $f(x) - or, alternatively, $y$

For $f(x)$ we can derve those order pairs, for, for example, $x=0$ and $x=10$

For $x=0$, $f(x)= 0.5*0 + 5 = 5$, so we have the ordered pair (0,5)

For $x=10$, $f(x)= 0.5*10 + 5 = 10$, so we have the ordered pair (10,10)

Likewise, for $g(x)$ we can derive ordered pairs (0,-2) and (2,4). Check this for yourself.

library(matlib)
A <- matrix(c(-0.5,1),1,2)
b <- 5
C <- matrix(c(-3,1),1,2)
d <- -2
showEqn(A,b,vars=list("x","y"))

## -0.5*x + 1*y  =  5

showEqn(C,d,vars=list("x","y"))

## -3*x + 1*y  =  -2

plotEqn(A,b,vars=list("x","y"))

## -0.5*x + y  =  5

plotEqn(C,d,vars=list("x","y"))

## -3*x + y  =  -2

Above we have plotted the two functions separately. The automatic settings for the minimum and maximum values of the $x$ and $y$ axes see to it that the lines seemingly run parallel, while actually they don’t.

It is better, of course, to draw both functions in one graph, to better compare their behaviors. Where do they cross the $x$ and $y$ axes? where do the lines intersect? Which line is steeper? And so on.

Below, we use ggplot which not only produces better looking graphs but it is also easier to adapt to our needs.

The commands look frightening, don’t they. Don’t worry, most of it is just an adaptation of what we found from smart googling. For example, the search string ggplot plot two functions guided us to this link from which we copied and adapted the code below.

library(ggplot2)
fx <- function(x) 0.5*x+5
gx <- function(x) 3*x-2
ggplot(data.frame(x=c(-10, 10)), aes(x=x)) + stat_function(fun=fx, color="red") + stat_function(fun=gx, color="blue") + geom_hline(yintercept=0, size=2)+ geom_vline(xintercept=0, size=2)

There’s some things to note:

The blue line is steeper
The function $f(x)$ intersects with the y-axis at $y=5$
The lines do intersect.

In the previous chapter, section 3.4, we have learned how to solve equations with two unknowns. We can use the same approach for solving the values of $x$ and $y$ where the lines intersect!

# install.packages("matlib")
library(matlib)
A <- matrix(c(-0.5,-3,1,1),2,2)
b <- c(5,-2)
showEqn(A,b)

## -0.5*x1 + 1*x2  =   5 
##   -3*x1 + 1*x2  =  -2

plotEqn(A,b)

## -0.5*x[1] + x[2]  =   5 
##   -3*x[1] + x[2]  =  -2

Solve(A,b,fractions=TRUE)

## x1    =  14/5 
##   x2  =  32/5

Solve(A,b,fractions=FALSE)

## x1    =  2.8 
##   x2  =  6.4

Finding the values for $x$ and $y$ where the functions intersect with the $x$ and $y$ axes is straightforward.

For example, for $f(x)$, the line crosses the $x$ axis if $y=0$. Therefore, we can write:

$f(x) = 0.5*x + 5 = 0$

$0.5*x = -5$

$x = -10$

Using R commands:

# Calling solve() function to calculate value of x in ax = b, 
# where a and b are taken as the arguments 
solve(0.5, -5)

## [1] -10

The lines intersect with the $y$ axis at $x=0$.

For $f(x)$ we have:

$f(x) = 0.5*x + 5 = 0$

$ y = 0.5*x + 5$

$ y = 0.5*0 + 5 = 5$

4.2.2 Quadratic Functions

In the same vein, we can display quadratic functions.

Quadratic functions have the form:

$f(x) = a*x^{2} + b*x + c$

We can plot $f(x) = x^{2} - 4$ using ggplot as follows.

library(ggplot2)
fx <- function(x) x^2-5
ggplot(data.frame(x=c(-10, 10)), aes(x=x)) + stat_function(fun=fx, color="red") + geom_hline(yintercept=0, size=2)+ geom_vline(xintercept=0, size=2)

We can add a second (quadratic) function $g(x) = x^{2} - 4*x + 8$

library(ggplot2)
fx <- function(x) x^2-4
gx <- function(x) x^2-4*x+8
ggplot(data.frame(x=c(-10, 10)), aes(x=x)) + stat_function(fun=fx, color="red") + stat_function(fun=gx, color="blue") + geom_hline(yintercept=0, size=2)+ geom_vline(xintercept=0, size=2)

You can pick up the challenge, and use what you learned in the previous chapter to compute the point $(x,y)$ where the two curves (parabolas) intersect. Here, you don’t even the need the formula introduced in section 3.5

Since we have:

$y = x^{2} - 4$ (1)

and

$y = x^{2} - 4*x + 8$ (2)

We can deduct (2) from (1), to get:

$0 = 4x - 12$, and therefore $x = 3$

Substituting $x = 3$ in $f(x)$ we get $y = 3^{2} - 4 = 5$.

4.3 Exercises

4.3.1 Exercise 1

Consider the following functions.

$f(x) = 5x - 8$
$g(x) = -2x - 1$
$p(x) = -0.5x + 3.5$

Plot the functions in one graph
Compute the intercept of these functions (the value of the function at $x=0$)

library(ggplot2)
fx <- function(x) 5*x-8
gx <- function(x) -2*x-1
px <- function(x) -0.5*x+3.5
ggplot(data.frame(x=c(-10, 10)), aes(x=x)) + stat_function(fun=fx, color="red") + stat_function(fun=gx, color="blue")+ stat_function(fun=px, color="green")+ geom_hline(yintercept=0, size=2)+ geom_vline(xintercept=0, size=2)

The intercept of $f(x)$:

$f(x) = 5*0 - 8 = -8$

The intercept of $g(x)$:

$g(x) = -2*0 - 1 = -1$

And the intercept of $p(x)$ = $3.5$.

Since it’s hard to visually inspect if these numbers are correct, given the default scales in the plot above, we can instruct ggplot to use different minimum and maximum values for the axes, and to use smaller scale breaks (including grid lines).

library(ggplot2)
fx <- function(x) 5*x-8
gx <- function(x) -2*x-1
px <- function(x) -0.5*x+3.5
ggplot(data.frame(x=c(-10, 10)), aes(x=x)) + stat_function(fun=fx, color="red") + stat_function(fun=gx, color="blue")+ stat_function(fun=px, color="green")+ geom_hline(yintercept=0, size=2)+ geom_vline(xintercept=0, size=2)+ scale_x_continuous(breaks=c(seq(-10,+10,1)))+ scale_y_continuous(breaks=c(seq(-10,+10,2)),limits=c(-10,+10))

## Warning: Removed 81 row(s) containing missing values (geom_path).

## Warning: Removed 51 row(s) containing missing values (geom_path).

4.3.2 Exercise 2

Consider the following functions.

$f(x) = 2x - 2$
$g(x) = -(2/3)*x + 2$
$p(x) = -0.5x^{2} + 2$

Determine the point where $f(x)$ and $g(x)$ intersect
Plot $f(x)$ and $g(x)$ in on e graph
Determine the points where $p(x)$ intersects with the x axis
Plot $p(x)$ and $g(x)$ in on e graph
Determine the points where $p(x)$ and $g(x)$ intersect

Answers:

a. (1.5; 1)
c. (-1.41;0), and (+1.41;0)
e. (0;2), and (0.67; 1.56)

4.3.3 Exercise 3

Consider the following function.

$f(x) = 2x^{2} - 8x + 11$

Plot the function
Determine the intersections of this parabola with the axes
Determine the minimum value of this function

Answers:

library(ggplot2)
fx <- function(x) 2*(x^2)-8*x+11
ggplot(data.frame(x=c(0, 8)), aes(x=x)) + stat_function(fun=fx, color="red") + geom_hline(yintercept=0, size=1)+ geom_vline(xintercept=0, size=1)+ scale_x_continuous(breaks=c(seq(0,8,1)))+ scale_y_continuous(breaks=c(seq(0,+12,1)),limits=c(0,+12))

## Warning: Removed 49 row(s) containing missing values (geom_path).

From the graph, it can be seen that the graph does not intersect with the x axis. The intersection with the y axis occurs when $x=0$.

$y = 2x^{2} - 8x + 11 = 2*0^{2} - 8*0 + 11 = 0 + 0 + 11 = 11$

We have not formally looked at the minimum or maximum values of functions. Intuitively, you can imagine that if the function is a line with some positive or negative slope (that is, in the equation $y = ax + b$, a≠0) then the function returns values from -∞ to +∞ (minus to plus infinity).

In data science, we use algorithms that search for minimum or maximum values in functions that are often quite complex.

A quadratic function, like here, has a maximum (or here: minimum) value.

Determining the maximum or minimum can be achieved using algebra. More in particular, for quadratic functions we can use the (first) derivatives which indicate changes in the slope. Maybe you happen to remember some algebra (or abracadabra, to some) from your days in highschool. If you don’t, do not worry, as long as you understand the intuition behind it.

Looking at the parabola, the slope changes with different values of x. Coming from the left, the slope is negative, and the increases; at $x=2$ the slope is flat (or zero), which is exactly the point where the function reaches its minimum!

5. Statistical Operations

5.1 Summation

In statistics, and in data science for that matter, we usually work with data sets.

Common operations are taking the sum of a range of values in the data set, or computing their average value (which first takes the sum of n values, and then divides the sum by n).

For example, if we have the following set of seven values: 6, 7, 3, 12, 8, 5 and 0, then we can easily compute the sum, as $6+7+3+12+8+5+0=41$.

We can refer to the first value (6), as x₁, and to the third value (3), as x₃. More in general, we can refer to any element of the series as x_i.

For the sum of a set of values, we use the Greek letter s (sigma): ∑

Shorthand for x₁+x₂+ .. + x₇ then is:

The average or mean values of a set of values, is defined as thensum of all values divided by the number of values. The mean is denoted as x̄

By definition - to hone our skills in reading statistical formulas ánd to train our analytical thinking - the sum of deviations from the mean is by definition zero!

For the record, lets see if this is true.

series<-c(6,7,3,12,8,5,0)
sum(series)

## [1] 41

mean(series)

## [1] 5.857143

(seriesMinusMean <- series - mean(series))

## [1]  0.1428571  1.1428571 -2.8571429  6.1428571  2.1428571 -0.8571429 -5.8571429

round(sum(seriesMinusMean),8)

## [1] 0

round(mean(seriesMinusMean),8)

## [1] 0

The output above shows that, as we have already done by hand, the sum of the seven values is equal to 41. The average or mean value of x_i is $41/7 = 5.86, in two decimals of precision.

We have subtracted this mean from the original series, and stored the results in seriesMinusMean. The sum of that series equals zero, and so of course does the average.

This is equivalent to the following rule:

For the sake of completeness, the rules for multiplying all values by a fixed factor, and adding a constant, work the same way in combination. The formula makes life easier in statistical programming tasks.

It also makes it clear that the level and scale of measurement have important consequences. Suppose that we are measuring income in dollars. However, due to inflation, dollars now are worth less than dollars some decades ago. The variation (or in statistics: the variance) which is measured in terms of squared deviations from the mean, will widen with larger amounts of dollars.

The notations we have used above, are meant to make life easier. But it seems that, sometimes, and for some of us, all they do is confuse. That’s understandable. When reading through textbooks or articles that use these notations without a lot of explanation or examples, try to figure out the logic and use a couple of small numerical examples yourself. It the statistical notation confuses you, then - to make things worse - it happens that we need to take the sum over several (two, or more) series, using two or more indices.

In a simple, example, suppose we have the grades of students (series 1) for a range of subjects they studied. Computing the “average grade” over all students and subjects, can be done in two steps. First, we compute the average grade of each students (averaging over all subjects). Secondly, we take the average of these averages, to calculate the overall average. This would work quite well, if all students have grades for the same set of topics, and all students have grades for all of these topics! What happens if student 1 didn’t join the exams for 2 out of the 5 topics, due to illness? And/or what if student 2 had permission to skip one of the five topics, and was allowed to study a sixth topic offered by another school? More in general, data sets are never complete or perfect, and the analyst has to make choices, and be aware of the consequences of these choices - marginal as they may be!

In the “perfect” situation, the average grade for a group of ten students all having done exams in five topics, the calculation would be:

5.2 Factorials and Permutations

5.2.1 Factorials

A factorial is a special kind of multiplication.

It is denoted as $x!$

In general, $x! = x*(x-1)*x-2)*...*1$, x∊ℕ (that is, x is a positive integer or natural number, 1, 2, 3, …)

For example:

$4! = 4*3*2*1 = 24$

$8! = 8*7*6*5*4*3*2*1 = ...$

Best to use the factorial() function in R.

factorial(8)

## [1] 40320

Note that 3! 4! ≠ 12!*

factorial(3)

## [1] 6

factorial(4)

## [1] 24

factorial(3)*factorial(4)

## [1] 144

factorial(12)

## [1] 479001600

5.2.2 Permutations

In data science, we will frequently use problems related to probability, and to the likelihood and frequence of occurrences, or events.

An example is the following. Suppose you have just bought six books on data science, and you want to arrange them on your book shelf.

This is quite a demanding challenge, for any data scientist! Obviously, there are many strategies you can choose. You can do alphabetically, by title or (first) author name; or by size, color, year of publication; or you can do it just randomly.

An interesting view of strategies is our human inclination to make one choice out of the many choices we have.

In this simple example, the number of possible solutions to arrange six books on your bookshelf is no less than $6! = 720$. Why?

The logic is that for the first book, you have six positions on the book shelf. After deciding on a spot for the first book, only five spots are opten for the second book. For the the last (sixth) book, you do not have a choice, really, as all spots except one are already taken. So, the number of permutations is $6! = 6*5*4*3*2*1$.

Trial-and-error would be pretty time consuming, especially if in the process you want to evaluate, somehow, these arrangements. Human beings have learned to use intuition and strategies, to save time. Imagine what a person with 200 books has to go through!

factorial(6)

## [1] 720

But the psychological roots are of no concern to us, here.

The rules for counting the number of possibilities, or permuations*, are.

There are many variations to the above challenge. Suppose we have bought four books, and our book shelf still has room for six. If we do allow space between the books, then again we have six spots available for the first book; five for the second book; four for the third book; and still three for the last book!

That is, we have $6*5*4*3 = 360$ options.

The difference, a factor 2 ($720/360 = 2$), can be explained by the fact that for the two empty spots there is no difference. Empty is empty. Before we had two books, to be put on the first or the second empty position! That is, the order of filling those spots mattered!

To illustrate what we mean by “the order matters”, another simple example.

Suppose we have a group of three people. Two will be selected to receive prizes, of 200 and 100 dollar. The first person selected gets the bigger prize. The order matters!

Picking 2 out of 3, when order matters, follows the rule for variation.

factorial(3)/factorial(3-2)

## [1] 6

Since the numbers are small, we can enumerate the permutations. If we call our three persons A, B and C, we have:

AB AC BA BC CA CB

Note that AB differs from BA, as A and B receive different prizes!

5.2.3 Combinations

We can imagine a situation in which the order doesn’t matter. If we want to sample two persons out of a group of three, for an interview, then the samples AB and BA are identical (assuming it doesn’t matter who is interviewed first).

The formula for combinations is shown below. In terms of our earlier example, let’s imagine your data science teacher who has bought four copies of his favorite textbook (to lend to his students). It doesn’t really matter in which order they end up on his book shelf that has a capacity of six such books.

Using R code:

spaces <- 6 # 6 spots on our book shelf
books  <- 4 # 4 identical books 
factorial(spaces)/(factorial(spaces-books)*factorial(books))

## [1] 15

Again, we can enumerate these combinations. The numbers are the shelf positions of the four books. The dots are the empty spots on the shelf.

Combination:

1234..
123.5.
123..6
12.45.
12.4.6
12..56
1.345.
1.34.6
1.3.56
1..456
.2345.
.234.6
.23.56
.2.456
..3456

You probably detect the logic. You probably also notice that this task is time consuming and error prone, even with small numbers. Having a formula to do the counting, is great!

Applying the formula to our sample of two interviewees out of a population of three persons:

5.2.4 Probabilities

One of the applications of permutations and combinations, is in probability theory.

Probabilities are commonly referred to as P.

Summing Probabilities

If we have mutually exclusive events, then we add up the probabilities of these events.

A standard 52-card pack comprises 13 ranks in each of the four French suits: clubs (♣), diamonds (♦), hearts (♥) and spades (♠). The ranks are the numbers 2 to 10; jack, queen, king, and ace.

When drawing a card from the deck, the probability that it is one of the red suits (hearts or diamonds), equals

P(hearts or diamonds) = P(hearts) + P(diamonds) = 13/52 + 13/52 = 26/52 = 1/2

since your card cannot be both hearts and diamonds!

This summation rule cannot be used if the events are not mutually exclusive.

For example, the probability of drawing an ace or clubs is not equal to

P(ace or clubs) ≠ P(ace) + P(clubs) = 4/52 + 13/52 = 17/52

This is because we would be counting the ace of clubs twice. The true probability is 16/52.

Multiplying Probabilities

The summation rule is used when one event or another event occurs.

For mutually exclusive events occurring in combination (event 1 and event 2), we use the multiplication rule.

For example, if we test your knowledge of data science in 5 multiple choice questions with two possible answers (yes and no, for example) each, then the probability of answering all questions correctly by just guessing, is:

P(all 5 questions answered correctly) = 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.5⁵ = 0.03125

The calculation is slightly more complicated if there are more combinations. For five correct answers, there’s only one combination: all answers are correct. But for four correct answers (and one incorrect answer), there are five combinations: either the first question is answered incorrectly, or the second one, or … up to the fifth one.

That is, the probability has to be multiplied by the number of combinations. Let’s assume that each question has four answers out of which only one is correct. The chance of guessing correctly, is now only 25% (0.25) rather than 50%.

How do we go about?

The formula for guessing three out of five answers correctly, is given below.

Luckily, there’s an easy function in R, to make this kind of calculations. The function is dbinom(). Between the brackets, you key in the number of successes, the number of tries, and the probability of success for each attempt.

dbinom(3, size=5, prob=0.25)

## [1] 0.08789063

Probabilities should add up to 1. We can get a probability distribution of the number of correctly guessed questions. With five questions, the number of successes range from 0 to 5. Let’s try this in R.

x <- 0:5
ps = .25
n  = 5
cumulative=0
for (i in x) {
  success <- dbinom(i, size=n, prob=ps) 
  cumulative <- cumulative + success
  cat("Probability of",i,"successes",round(success,4),"\n")
  cat("Cumulative probability of",i,"or less successes",round(cumulative,4),"\n","\n")
}

## Probability of 0 successes 0.2373 
## Cumulative probability of 0 or less successes 0.2373 
##  
## Probability of 1 successes 0.3955 
## Cumulative probability of 1 or less successes 0.6328 
##  
## Probability of 2 successes 0.2637 
## Cumulative probability of 2 or less successes 0.8965 
##  
## Probability of 3 successes 0.0879 
## Cumulative probability of 3 or less successes 0.9844 
##  
## Probability of 4 successes 0.0146 
## Cumulative probability of 4 or less successes 0.999 
##  
## Probability of 5 successes 0.001 
## Cumulative probability of 5 or less successes 1 
##

Some explanation on the above.

First, we assign the numbers 0 to 5 to an array x
Next, we assign 0.25 (the probability of success) to an object ps
And the numer of tries (5), to n
We define the cumulative probability (cumulative), and give it a starting value of 0
We then construct a so-called for loop. In this loop the dbinom() function is applied to the three parameters (number of successes; number of tries; and probability of success per try)
Within the loop, we print the output using the cat() function. The string “” jumps to the next line
The result success is the probability of i successes (i=0 to 5) in 5 tries.
The result cumulative is the cumulative probability, that is i or less successes. Obviously, the cumulative success should be 1, for 5 or less successes in five tries!

If your teacher tells you that you will pass the exam if at least 3 questions are answered correctly, then guessing is not a wise strategy. The probability of 3 or more correct answers, is 1 minues the cumulative chance of answering less than three answers correctly (following the summation rule!). That is: $1 - 0.8965 = 0.1035$ (slightly more than 10%).

6. The Language of Statistics

6.1 Introduction

In statistics and data science, we often use data sets which have a rectangular shape. A rectangle, or matrix, has two dimensions which we call rows and columns.

The rows represent cases (or persons, objects, research units). The columns provide information about these cases. Although we will encounter some other ways to store and present information, this type of data set is by far the most common.

The data are measurements of the variables that appear in the columns.

For example, the first column is a simple serial number, from 1 to 7, to identify the case.

Each case represents, for example, a transaction. The second column then indicates which of our three stores A to C is responsible for this transaction.

The variables q1 to q3 contain structured information. What do we mean by structure? In principle, q1 is measured for each and every transaction, and we don’t have variables that are unique to specific transactions.

Unstructured information, in comparison, could be transcripts from interviews with several interviewees on a particular topic. Even if you ask the same questions, the answers can go in any direction, and what your interviewees say, how many and which words they use, will be unique to every interviewee.

6.2 Scales of Measurement

The information in you data set, is based on measurements, of some sort.

When measuring things, we can make use of four types of scales.

Nominal: a scale of measurement for data with meaningful labels

Example: Gender

Ordinal: a scale of measurement for nominal data with order

Example: “1st”, “2nd”, “3rd”

Interval: a scale of measurement for ordinal data with meaningful distances between values

Example: temperature in degrees Celsius

Ratio: a scale of measurement for interval data with a meaningful zero point

Example: sales in euros.

The difference between interval and ratio scales is subtle. If the outside temperature on a particular day is 20 degrees Celsius, you cannot say that it is twice as warm compared to 10 degrees Celsius. It depends on the scale used. Would we we have used Fahrenheit instead of Celsius, then the difference would not be a factor 2.

For the relationship between the two scales we can use the formula:

$ F = C*9/5 + 32$

Below, we have written the user defined formula CtoF() for the conversion from Celsius to Fahrenheit.

We apply the formula to a range of temperatures in degrees Celsius.

CtoF <- function(c)
{
  f <- 9/5*c + 32
  cat(c,"degrees Celsius =",f,"degrees Fahrenheit\n")
}

tempC <- c(0,5,10,20,30)
for(i in tempC) {
  CtoF(i)
}

## 0 degrees Celsius = 32 degrees Fahrenheit
## 5 degrees Celsius = 41 degrees Fahrenheit
## 10 degrees Celsius = 50 degrees Fahrenheit
## 20 degrees Celsius = 68 degrees Fahrenheit
## 30 degrees Celsius = 86 degrees Fahrenheit

Indeed, you see that while it is tempting to state that 10 degrees Celsius is twice as much as 5 degrees Celsius, the very same temperatures on the Fahrenheit scale are 50 and 41, respectively. The difference now looks much smaller!

Still, repeatedly adding 5 degrees Celsius, is equivalent to adding 9 degrees on the Fahrenheit scale. That is, differences (of, say, 5 degrees) on the Celsius scale correspond to differences (of 9 degrees) on the Fahrenheit scale. These scales satisfy the criteria for interval scales, but not for ratio scales.

Ordinal scales are quite common. We use them frequently, in for example surveys. We can ask you the following question about the contents of this chapter.

To what extent do you agree or disagree with the following statement? This module is the best refresher course in statistics in the world

Typically, we would code the answers numerically, for instance from 1=totally disagree to 5=totally agree. This coding scheme suggests that the differences between the points on the scale are the the same (equidistant), like temperatures on Celsius or Fahrenheit scales! But this is doubtful. It may be that the threshold for respondents to totally agree or disagree may be very high, while at the same time they feel comfortable in avoiding the neutral category even if they only slightly agree or disagree to the statement. That is, the distances at the extreme points of the scales are higher than in the middle. A coding scheme like {1 4 5 6 9} may better reflect the true distances between the answers.

But we do not know, and therefore - for the sake of ease - we assume equidistance. This assumption allows us to compute mean scores. A mean score of 3.6 would then mean that overall the feeling is slightly positive, somewhere in between neutral and agree.

Ordinal scales only imply that a score of 4 is more, or better than a score of 3, but we do not know how much better.

Nominal scales are just labels. A question on gender can be numerically coded (1 or 2, for males and females, or the other way round) but the numerical code is just used for convenience. Taking the average (1.5, if our group has as many males as females) makes no sense.

Later on we will see that we can still use nominal scales (like group membership) when analyzing quantitative data.

6.3 Exercise

Sally has opened a pizza restaurant. She delivers pizzas in the neighbourhood. For each delivery she collects the following information:

Size of the pizza (small, medium, large)
Number of pizzas ordered
Day of the week
Distance from restaurant to delivery address
Billing amount

What are the scales of measurements, for 1 to 5?

7. Working with Numbers, and Data Display

7.1 Introduction

In statistics and data science, we are dealing with lots of data. The key challenge is to briefly summarize the data. To that end, we make use of graphs and numbers.

For example, your teachers may make use of different types of assessment, to check if you have studied hard enough to master all the things they have tried to teach you. At the end of the course, you may have gone through many exams and assignments, all of them graded with a number between 1 and 10. This system will generate a lot of information. For all students in the class, we will have grades on all exams and assignments, and the data can be organized in a data set.

If we have a small class and only two subjects, then an example of a data set may look like.

The data is organized in an Excel sheet, of which the figure above is a screenshot. Excel files can be read directly by R, but it’s more common to share data in text files in which the data are separated by commas - hence, csv (comma separated values) files.

csv files, too, can be read by R directly. The first line of the csv data files normally contain the names (labels) of the variables, just like we would do in Excel. You have to tell R that this is the case.

grades <- read.table("grades.csv", header=TRUE, sep=",")
grades

You can have a look at the data, and find an answer to many questions.

How many male and female students are there in class?
What is the minumum, maximum and average age?
What are the minumum, maximum and average grades for the two topics?
What is the average grade for each student?
Is there a (positive) correlation between the grades for the two topics? That is, do students with low/high grades for subject 1, also have higher grades for subject 2?

Try do find these answers from visual inspection! You will notice two things:

Firstly, some tasks are easier than others. Especially (5) is not straightforward!
It is time consuming and error prone, even with this small data set.

Well, that’s why we have computers and software. We can give simple instructions, and the computer will do it faster and without errors. Try the summary() command.

summary(grades)

##        id            name              gender               age      
##  Min.   : 1.00   Length:10          Length:10          Min.   :18.0  
##  1st Qu.: 3.25   Class :character   Class :character   1st Qu.:20.0  
##  Median : 5.50   Mode  :character   Mode  :character   Median :20.0  
##  Mean   : 5.50                                         Mean   :20.3  
##  3rd Qu.: 7.75                                         3rd Qu.:21.0  
##  Max.   :10.00                                         Max.   :22.0  
##     grade_1         grade_2     
##  Min.   : 4.00   Min.   : 4.00  
##  1st Qu.: 6.25   1st Qu.: 6.25  
##  Median : 8.00   Median : 8.50  
##  Mean   : 7.50   Mean   : 7.70  
##  3rd Qu.: 9.00   3rd Qu.: 9.00  
##  Max.   :10.00   Max.   :10.00

Computers are smart. But not that smart. In the output you will find that R recognizes that name is a string variable, containing text. You cannot take an average, or a lot of other statistical things, with text. On the other hand, since we added id as a numerical variable or code, R blindly computes the average value.

For age and grade, statistical information is available. The summary() command returns the minimum and maximum values; the median and the mean, among other things. No student has a grade below 4 on any of the topics. The mean grade for subject 2 is slightly higher than for topic 1. And so on.

The grades (stored in grade) follow a so-called distribution, in the range from 4 to 10. While the mean grade is an interesting statistic, we are also interested in the distribution per se. Are the grades evenly distributed, or more bell shaped? We can plot a histogram of the data.

While the summary() command can be applied to the data set grades and all variables that are part of it, histograms are plotted for a single variable. Variable grade_1 has to be referred to as grades$grade_1. Just grade_1 does not suffice, since in principle it can be an object by itself!

You have control over the outlook of the histogram. Below, we added the number of breaks and clored the bars red. When omitted, the hist() command uses default values.

hist(grades$grade_1, breaks=10, col="red")

Again, ggplot produces nicer (and publication quality) graphs. To the basic part in which we tell R that grade_1 is the x-axis, we add layers (“geoms”).

Note that in the basic part we first have to tell R which data set to use, and R then knows that grade_1 is a variable in that data set!

It is common to play around with settings like binwidth and colours and sizes, until you are happy with the result. We also added a dashed red line representing the mean of grade_1.

library(ggplot2)
ggplot(grades, aes(x=grade_1)) +
    geom_histogram(binwidth=1, colour="black", fill="white") +
    geom_vline(aes(xintercept=mean(grade_1, na.rm=T)), color="red", 
    linetype="dashed", size=1)

7.2 Exercises

7.2.1 Exercise 1

Generate 1,000 random numbers between 0 and 10.

Summarize the data set of random numbers.

Is the mean value as expected?
Is the median value higher or lower than the mean?
Are the minimum and maximum values exactly 0 and 10?
Make an histogram that shows how even or uneven the numbers are disributed.

# Generate 1,000 random numbers between 0 to 10
set.seed(12321)
e1 <- runif(1000, min=0, max=10)
cbind(summary(e1))

##                [,1]
## Min.    0.002596928
## 1st Qu. 2.605566192
## Median  5.105626716
## Mean    5.170922919
## 3rd Qu. 7.757992846
## Max.    9.971542840

As you can see, the mean of the set of 1,000 random numbers is higher than 5.

Should we conclude that the random number generates is not performing as expected? To this end, we can use a t-test. Since the random numbers are in the range from 0 to 10, the expected mean is 5 under the assumption that indeed the data generating process is random.

# one sample t-test
t.test(e1,mu=5)

## 
##  One Sample t-test
## 
## data:  e1
## t = 1.8685, df = 999, p-value = 0.06198
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.991420 5.350426
## sample estimates:
## mean of x 
##  5.170923

Usually, we test with 95% confidence. If the probability of the outcome in our sample is less than 5%, then we would reject the hypothesis that the $mean=0$.

Our test reveals that the probability is quite low (p=0.062) but higher than 0.05 (5%). So, we would accept the hypothesis that the $mean=0$.

However, you have to keep in mind that the numbers are drawn randomly, and in 5 out of 100 cases you are bound to find significant differences!

We uses set.seed(12321) to generate a specific set of 1,000 random numbers. Try other starting numbers than 12321 and see if you stumble upon a random series with a mean significantly different from 5!

Let’s make an histogram of the data.

We first have to convert the vector e1 into a data frame.

e1df <- as.data.frame(e1)
library(ggplot2)
ggplot(e1df, aes(x=e1)) +
  geom_histogram(binwidth=1, colour="green", fill="yellow") +
  geom_vline(aes(xintercept=mean(e1, na.rm=T)), color="red", 
             linetype="dashed", size=1)

Owh, computers 😈 …

There’s something weird with this graph. Using the default settings when using the binwidth option, the brackets will range from -0.5 to .5; .5 to 1.5; up to 9.5 to 10.5. But since values lower than zero or higher than 10 cannot occur, the first and last brackets contain approximately half of the number of observations in the other brackets. Moreover, there are 11 brackets, rather than 10.

A wise lesson: check, and double-check.

The easy way out, is to use breaks, rather than binwidth. As below. Let’s also get rid of these eye-blinding colors.

ggplot(e1df, aes(x=e1)) +
  geom_histogram(breaks=seq(0, 10, by=1), colour="black", fill="white") +
  geom_vline(aes(xintercept=mean(e1, na.rm=T)), color="red", 
             linetype="dashed", size=1)

Well, not entirely flat. But pretty even.

7.2.2 Exercise 2: Let’s Roll a Die

The numbers in the previous exercise are continuous numbers, in the range from 0 to 10. Any number is unique.

The outcome of a throw with a die, in contrast, is an integer number, from 1 to 6.

Let’s roll a die 20 times, and see if the die is a fair die.

Since the runif() function with minimum=1 and maximum 7 produces continuous numbers from 1.00000… to 6.999999…, the floor() of these numbers will be integers from 1 to 6 (that is: 1, 2, 3, 4, 5 or 6).

We convert the vector of 20 integers to a data frame. We summarize the data. And then produce a bar chart using the geom_bar() in ggplot.

set.seed(987)
# Get 20 integers from 1 to 6
# Use max=7 because it will never actually equal 7
roll <- floor(runif(20, min=1, max=7))
summary(roll)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.45    5.00    6.00

rolldf <- as.data.frame(roll)
ggplot(rolldf, aes(x=roll)) +
  geom_bar(stat="count")

Pretty good. We are lucky to throw few 1’s. But more 2’s and 3’s than expected. Try some other starting numbers in set.seed() to see how often you are really (un)lucky.

The sample mean is very close to the theoretical mean ($(1+6)/2=3.5$).

The bar chart is quite OK. But not perfect. The labels for 1, 3 and 5 are not shown. It is often better to define the variable to be counted as factors, using the as.factor() function.

Let’s do so, and look at the result.

rolldf$rollfac <- as.factor(rolldf$roll)
ggplot(rolldf, aes(x=rollfac)) +
  geom_bar(stat="count")

8. The Normal Distribution

8.1 Understanding the Normal Distribution

The normal distribution is the most common type of distribution assumed in statistical analyses.

The standard normal distribution has two parameters: the mean and the standard deviation.

For a normal distribution:

68% of the observations are within ± one standard deviation of the mean
95% are within ± two standard deviations, and
99.7% are within ± three standard deviations.

The normal distribution model is motivated by the Central Limit Theorem which states that averages calculated from independent, identically distributed random variables have approximately normal distributions, regardless of the type of distribution from which the variables are sampled.

8.2 Skewness and Kurtosis

Real life data rarely, if ever, follow a perfect normal distribution.

The skewness and kurtosis coefficients measure how different a given distribution is from a normal distribution.

Skewness measures the symmetry of a distribution. The normal distribution is symmetric and has a skewness of zero. If the distribution of a data set has a skewness less than zero, or negative skewness, then the left tail of the distribution is longer than the right tail; positive skewness implies that the right tail of the distribution is longer than the left.

Kurtosis measures the thickness of the tail ends of a distribution in relation to the tails of the normal distribution. Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution. Distributions with low kurtosis exhibit tail data that is generally less extreme than the tails of the normal distribution.

Surprisingly many things in real life closely follow a normal distribution:

Height of people
Blood presssure
IQs
…

One website mentioned salaries, as another example of normally distributed variables. However, that is doubtful. Maybe it holds true in some economies, but not in all economies and not worldwide. One (theoretical) issue is that normal distributions are continuous distributions ranging from -∞ to +∞. Well, salaries don’t. I guess it’s fiscally possible to have negative incomes, but if your employer offers you -10 dollars per hour, chances are you prefer staying home and “earn” 10 dollars more by doing nothing. That is, on the left of the disribution, salaries are cut off at zero, while on the right some people get incredibly high salaries. In other words, the distribution is not symmetrical, and right skewed.

You can find out more about the normal distribution here, and in the video below.

8.3 Exercise & Illustration

In the file income3.dta (a STATA file), we have collected the incomes of 1,000 persons in three countries (a, b and c).

We want to look a the income distribution, and see if the distributions are normal.

You can download the file from here

We first read the data, using the read.dta() function from the foreign package.

library(foreign)
library(ggplot2)
income <- read.dta("income_12.dta")
summary(income)

##        a                b                 c        
##  Min.   :  4580   Min.   :  424.2   Min.   :    0  
##  1st Qu.: 49874   1st Qu.: 9211.1   1st Qu.: 9874  
##  Median : 59395   Median :16723.8   Median :19395  
##  Mean   : 60032   Mean   :20000.0   Mean   :20623  
##  3rd Qu.: 69880   3rd Qu.:26781.6   3rd Qu.:29880  
##  Max.   :125529   Max.   :99220.2   Max.   :85529

ggplot(data = income) + geom_histogram(mapping = aes(x = a, y=..density..), 
  fill="steelblue", colour="black", binwidth = 1000) +
  ggtitle("Income Distribution Country A") + xlab("Income")

The income distribution looks pretty normal, and at least it’s symmetric.

To see the difference between the actual distribution and a theoretical normal distribution with the same mean and standarad deviation, we can add the latter to our graph.

ggplot(data = income) + geom_histogram(mapping = aes(x = a, y=..density..), 
  fill="steelblue", colour="black", binwidth = 1000) +
  ggtitle("Income Distribution Country A") + xlab("Income") +
  stat_function(fun = dnorm, args = list(mean = mean(income$a), sd = sd(income$a)))

Our hopes are confirmed: although there are some brackets below the normal distribution curve and some above, by and large the distributions are similar. There is, at least, no systematical difference. There are some pretty high peaks in the middle range. This also depends on the binwidth: the narrower the bins, the more the peaks and troughs we can expect.

For illustration, let’s double the binwidth.

summary(income)

##        a                b                 c        
##  Min.   :  4580   Min.   :  424.2   Min.   :    0  
##  1st Qu.: 49874   1st Qu.: 9211.1   1st Qu.: 9874  
##  Median : 59395   Median :16723.8   Median :19395  
##  Mean   : 60032   Mean   :20000.0   Mean   :20623  
##  3rd Qu.: 69880   3rd Qu.:26781.6   3rd Qu.:29880  
##  Max.   :125529   Max.   :99220.2   Max.   :85529

ggplot(data = income) + geom_histogram(mapping = aes(x = a, y=..density..), 
  fill="steelblue", colour="black", binwidth = 2000) +
  ggtitle("Income Distribution Country A") + xlab("Income") +
  stat_function(fun = dnorm, args = list(mean = mean(income$a), sd = sd(income$a)))

In statistical modeling, many techniques assume normal data. Of course, we can say that the data look normal, but the favorite hobby of statisticians is to test if the assumptions holds.

There are several tests for normality.

The Kolmogorov-Smirnov test and the Shapiro-Wilk test are the most widely used methods to test the normality of the data.

Let’s use the latter one.

shapiro.test(income$a)

## 
##  Shapiro-Wilk normality test
## 
## data:  income$a
## W = 0.99664, p-value = 0.03135

The p-value of the test statistic is smaller than 0.05, which means that we have to reject the null hypothesis that the incomes are normally distributed. This is all the more surprising because the data were generated using a function that draws randomly from a normal distribution!

The thing is, with (very) large samples, even small deviations from the normal distribution are significant. Or, to put it differently, the test is sensitive to sample size. In the good old days, this was not a big problem, since samples tended to be small. But in the era of digitization and big data, normality tests are losing their meaning. Nothing is normal, any more.

Well, speaking of the new normal …

So, we can see what happens if the sample is smaller. We pick a random subsample of 100 persons.

income_sample <- sample(income$a, 100)
income_sample <- as.data.frame(income_sample)
shapiro.test(income_sample$income_sample)

## 
##  Shapiro-Wilk normality test
## 
## data:  income_sample$income_sample
## W = 0.98764, p-value = 0.4819

OK, that’s country A.

Let’s do the same for country B.

ggplot(data = income) + geom_histogram(mapping = aes(x = b, y=..density..), 
  fill="steelblue", colour="black", binwidth = 2000) +
  ggtitle("Income Distribution Country B") + xlab("Income") +
  stat_function(fun = dnorm, args = list(mean = mean(income$b), sd = sd(income$b)))

The income distribution is (as is the case in many countries), right skewed. There are some people earning loads of money, like successful entrepreneurs. Since income is seldom negative, a long tail does not appear on the left hand part of the distribution, and a lot of people are clustered at incomes lower than the mean.

Just out of curiosity, let’s check what the normality tests tell us. We would expect p-values close to zero, suggesting that the normal distribution hypothesis is to be rejected. Rejection is normal (oops, wrong word) for large samples, so let’s see what happens for a subsample.

shapiro.test(income$b)

## 
##  Shapiro-Wilk normality test
## 
## data:  income$b
## W = 0.86683, p-value < 2.2e-16

income_sample_b <- sample(income$b, 100)
income_sample_b <- as.data.frame(income_sample_b)
shapiro.test(income_sample_b$income_sample_b)

## 
##  Shapiro-Wilk normality test
## 
## data:  income_sample_b$income_sample_b
## W = 0.86053, p-value = 2.964e-08

Definitely not normal.

Now that you know the drill, try to do the same thing for country C.

What is your conclusion from the graph?

And what do the tests tell you?

ggplot(data = income) + geom_histogram(mapping = aes(x = c, y=..density..), 
  fill="green", colour="black", binwidth = 2000) +
  ggtitle("Income Distribution Country C") + xlab("Income") +
  stat_function(fun = dnorm, args = list(mean = mean(income$c), sd = sd(income$c)))

shapiro.test(income$c)

## 
##  Shapiro-Wilk normality test
## 
## data:  income$c
## W = 0.96604, p-value = 1.619e-14

income_sample_c <- sample(income$c, 100)
income_sample_c <- as.data.frame(income_sample_c)
shapiro.test(income_sample_c$income_sample_c)

## 
##  Shapiro-Wilk normality test
## 
## data:  income_sample_c$income_sample_c
## W = 0.95027, p-value = 0.0008641

9. Central tendency

9.1 Introduction

Above we said that the theoretical normal distribution, is characterized by various parameters. In actual, or empirical distributions, we compute statistics. For the normal distribution, it suffices to know the mean and the standard deviation. If we know thes two, then we know the exact shape of the distribution, and implicitly also other parameters like skewness and kurtosis.

We will focus on the mean and the standard deviation.

9.2 Mean (or Average)

The mean (or average) is a measure of central tendency. Most of us know, at leat intuitively what the mean stands for. Technically, the mean is the sum of all the observations divided by the number of observations. In words, the mean represents the center of the data.

Although we seldom think about it to deeply, using the mean has some drawbacks. One is that it is sensitive to outliers. Suppose that we measure the incomes of two groups of people, The two groups happen to be identical, except for the one person with the highest income. An outlier lifts the mean of the whole group, and often substantially.

group1 <- c(10,8,12,15,7,8,11,10,14,13)
group2 <- c(10,8,12,15,7,8,11,10,14,130)
(mean(group1))

## [1] 10.8

(mean(group2))

## [1] 22.5

You notice that the means differ strongly! In the second group there is only 1 person out of 10 that earns above average; the vast majority earns much less.

9.3 Median

As an alternative, you can make use of the median as a measure of central tendency. The median is simply the middle value in the distribution after sorting them from low to high. The median is less sensitive to (or robust to) extreme values at the end of the distribution. For incomes and salaries, it is usually more informative to use the median, as it tells us more about the center of the group!

The median is simply the middle observation, in the data set sorted from low to high.

We can ask R to do the sorting:

(sort(group2))

##  [1]   7   8   8  10  10  11  12  14  15 130

A sligh complication is that, with an even number of observations, there is no observation in the middle. Here, with 10 observations, we have five observations at the bottom and five at the top, and the (imaginary) middle observation would be somewhere in between the fifth and the sixth observation. In case of an even number of observations, we then compute the median as:

Median = (5^th observation + 6^th observation)/2

In our example:

Sorted data: 7 8 8 10 10 / 11 12 14 15 130

Median = (10 + 11) / 2 = 10.5

R will do the job for you, with the median() function.

(median(group2))

## [1] 10.5

9.4 Mode

For the sake of completeness, we mention a third measure of central tendency: the mode.

The mode is the most common value in the data set. In our example, there are actually to modes, 10 and 11. Both values occur twice.

You can verify this with the table() command, which lists all possible values and their frequency in the data set. We apply cbind() to the output of table, to show the values and their frequencies in columns rather than rows.

(cbind(table(group2)))

##     [,1]
## 7      1
## 8      2
## 10     2
## 11     1
## 12     1
## 14     1
## 15     1
## 130    1

You can imagine that for many continuous variables, most values are unique. For example, net salaries in the Netherlands are pretty unique, as they are derived from gross salaries minus taxes, contributions to pension schemes, individual allowances, and so on and so forth. Since contributions to pension schemes depend on individual traits (age), and taxes depend on household situation and other sources of income, among others, the probability of two persons earning the exat same income is close to zero. Each net salary is unique, and there are as many modes as there are individuals and net salaries! Not very helpful.

A way out of this problem is to define income or salary brackets, like 1,000 to 2,000; 2,000 to 3,000 Euro, and so on. A bracket then contains all unique salaries in a range of salaries.

Distributions can be displayed using histograms. The hist() function, by default, selects the number of brackets, and the width of brackets. It is generally better to override the default values, for the best presentation of the data. But to give you an idea, below the histogram of group 1 data.

Note that the definition of the brackets make a big difference! The brackets are 6-8; 8-10; and so on. But since we do have values of 8 and 10, on the dividing line between these brackets, it is crucially important to know if, for example, 8 is in the first or second bracket!

For continuous variables, in contrast to the seemingly discrete numbers in group 1, this is not a big concern.

hist(group1)

10. Variability

10.1 Introduction

Now that we know how to identify the middle of the distribution, the next step is to identify the width of it.

10.2 Range

One way to identify the width of the distribution, is to find the lowest and highest values in the distribution (the minimum and the maximum). The range then is simply computed as $maximum - minimum$.

(min1 <- min(group1))

## [1] 7

(max1 <- max(group1))

## [1] 15

cat("The range of group 1 incomes is", max1 - min1)

## The range of group 1 incomes is 8

It is obvious that the range is highly sensitive to outliers. In group 2, containing one person with a very high income, the range is dominated by this one outlier.

(min2 <- min(group2))

## [1] 7

(max2 <- max(group2))

## [1] 130

cat("The range of group 1 incomes is", max2 - min2)

## The range of group 1 incomes is 123

The range is seldom used, in statistics and data science, for analytic purposes. However, minimum, maximum and range are often used in the early stages of the data analysis process, especially when checking the validity of the data. For example, if you school registers your grades from 1 to 10 (as in the Dutch system), then analyzing all grades for whatever purpose starts with checking that all grades have a $minimum>1$, and a $maximum>10$. If not, then something is wrong (for example, missing grades coded 99 are included in the data).

10.3 Deviation

A lot of analyses in statistics and data science are based on the principle of differences, deviations and distances. You will encounter these terms time and again throughout the program!

Taking our group 1 incomes, we can say something about all 10 group members, in terms of their differences to one another, or in terms of their deviation from the group average.

We have already calculated the mean income for group 1. It is the easy to calculate how far all members are from the mean of the group.

For example, we now see that the first person who earns 10, is 0.8 below the mean of the group.

(mean1 <- mean(group1))

## [1] 10.8

(group1dev <- group1 - mean1)

##  [1] -0.8 -2.8  1.2  4.2 -3.8 -2.8  0.2 -0.8  3.2  2.2

But our interest is not in individuals, but in how far individuals are removed from the mean. It is tempting then to take the sum of all deviations, as a measure. But there’s a problem here!

Using the sum() function to take the sum of the differences, stored in group1dev, we see that the outcome is zero. That is, the positive and negative deviations cancel out! By definition, the mean is the centre of the data, and the sum of deviations from the mean will cancel out!

Note that we use the round() function. We do so to prevent R displaying a veeeeeery small number, something ending in “e-15”, suggesting that the first part of the expression has to be multiplied by 10^-15, or 0,000000000000001.

round(sum(group1dev),4)

## [1] 0

In order to prevent negative and positive deviations cancelling each other out, we can reason as follows. A deviation, positive or negative, is measure of the variability. So raher than taking the sum of the deviations, it makes sense to take the sum of the absolute deviations.

The absolute deviation is just the deviation without the sign (plus or minus). For example: $|-0.8| = 0.8$, and $|+1.2| = 1.2$. The “$|x|$” means “the absolute value of $x$”. Absolute values are non-negative, by definition.

Since the absolute deviations are always positive, their sum is a measure of the variability of the data.

Let’s see how it pans out. Given the outlier we expect this measure to be substantially bigger in group 2 as compared to group 1.

For group 1 we have:

(mean1 <- mean(group1))

## [1] 10.8

group1dev    <- group1 - mean1
group1absdev <- abs(group1 - mean1)
(cbind(group1dev,group1absdev))

##       group1dev group1absdev
##  [1,]      -0.8          0.8
##  [2,]      -2.8          2.8
##  [3,]       1.2          1.2
##  [4,]       4.2          4.2
##  [5,]      -3.8          3.8
##  [6,]      -2.8          2.8
##  [7,]       0.2          0.2
##  [8,]      -0.8          0.8
##  [9,]       3.2          3.2
## [10,]       2.2          2.2

sum(group1dev)

## [1] -7.105427e-15

sum(group1absdev)

## [1] 22

For group 2 we have:

(mean2 <- mean(group2))

## [1] 22.5

group2dev    <- group2 - mean2
group2absdev <- abs(group2 - mean2)
(cbind(group2dev,group2absdev))

##       group2dev group2absdev
##  [1,]     -12.5         12.5
##  [2,]     -14.5         14.5
##  [3,]     -10.5         10.5
##  [4,]      -7.5          7.5
##  [5,]     -15.5         15.5
##  [6,]     -14.5         14.5
##  [7,]     -11.5         11.5
##  [8,]     -12.5         12.5
##  [9,]      -8.5          8.5
## [10,]     107.5        107.5

sum(group2dev)

## [1] 0

sum(group2absdev)

## [1] 215

Since the sum of absolute deviations will go up with the number of observations, a measure to compare groups of different sizes would be to divide the sum by the number of observations, to obtain the mean absolute deviation.

10.4 Standard Deviation

For a variety of reasons, the (mean) absolute deviation is not used often.

An alternative way to deal with the challenge of negative deviations is to square them. As we have learned, the product of two negative numbers gives a positive number:

$-a * -a = a^{2}$

$-0.2 * -0.2 = 0.04$

$-0.8 * -0.8 = 0.64$

Note that the effect of squaring is that larger deviations get a bigger weight. While (absolute) deviations of 0.2 and 0.8 differ by a factor 4, their squared counterparts differ by a factor of 16! That is, squared deviations are more sensitive to observations at the extreme ends of the distribution.

The procedure for establishing a (mean) deviation, is similar as compared to what we did for absolute deviations.

Let’s do this step-by-step, to familiarize yourself with the concept of the standard deviation - which is so vital and many of the techniques you will encounter in statistics and data science.

(mean1 <- mean(group1))

## [1] 10.8

group1dev    <- group1 - mean1
group1sqdev  <- group1dev * group1dev
(cbind(group1dev,group1sqdev))

##       group1dev group1sqdev
##  [1,]      -0.8        0.64
##  [2,]      -2.8        7.84
##  [3,]       1.2        1.44
##  [4,]       4.2       17.64
##  [5,]      -3.8       14.44
##  [6,]      -2.8        7.84
##  [7,]       0.2        0.04
##  [8,]      -0.8        0.64
##  [9,]       3.2       10.24
## [10,]       2.2        4.84

sum(group1sqdev)

## [1] 65.6

The sum of squared deviations, will increase with the number of observations. Therefore, it is more meaningful to take the average, which allows comparing this statistics across groups (or data sets) of uneven size.

The result, the mean of the squared deviations, is what we call the variance.

varg1 <- sum(group1sqdev)/length(group1)
cat("The variance of incomes in group 1 is",varg1)

## The variance of incomes in group 1 is 6.56

A disadvantage of the variance measure is that, by squaring the deviations, the results is in squared units (dollars squared, if income is measured in dollars). In order to get the result in the same units, we take the square root.

Variance = σ² = Sum of Squared Deviations / n

Standard deviation = σ = √ (Variance)

As you would expect, R has functions for variance and sd: var(), and sd().

var(group1)

## [1] 7.288889

sd(group1)

## [1] 2.699794

Oops!!

The var() gives a different result from ours. How come?

It is typically not possible to measure populations. Researchers use samples from populations, in order to estimate the parameters (like mean, and variance) in the population. Smart statisticians have found that variances based on samples, underestimate the population variance. This is called bias.

Luckily, there’s a simple way to set things right, and get an unbiased estimate of the population variance. Rather than dividing by the size of the sample (n), we divide by n-1. In large samples, this correction is negligible, as n ≈ n-1, but in our small set of incomes of 10 group members, it makes a difference.

The functions var() and sd() use the correction for bias, assuming (rightly or wrongly) that our data are sample data. Since, in data science, we are relying on large volumes of data (say, big data) we don’t have to worry about it. But for the sake of completeness, let’s make an amendment to our own earlier calculations.

varg1 <- sum(group1sqdev)/(length(group1)-1)
cat("The variance of incomes in group 1 is",varg1,"\nAnd the standard deviation is", sqrt(varg1))

## The variance of incomes in group 1 is 7.288889 
## And the standard deviation is 2.699794

10.5 Exercises

10.5.1 Exercise 1

Below are 7 scores from employee performance appraisals of a company. The 7 scores can be considered a sample from the large population of employees of the company.

Compute, for each of the four aspects of employee performance:

The mean and the median
The minimum, maximum and range
The variance and standard deviation

Solution

It is always handy to put these data in a data frame.

jk <- c(2,3,4,5,7,7,3)
cs <- c(3,2,4,4,6,6,7)
at <- c(1,4,4,6,5,6,2)
di <- c(3,6,3,7,7,7,1)
appraisal <- data.frame(jk,cs,at,di); appraisal

It is, of course, perfectly fine to replicate some of the R-code we have used above to illustrate the concepts. Actually, we would encourage you to do so, for two reasons. Firstly, you will familiarize yourself with using R-code, and secondly, doing it step-by-step gives you a deeper understanding of what R is doing behind the scenes.

But once you know what you are doing, and fully understand the various measures of central tendency and variablility, it is better to find a package that does all of the work for you. A great source for popular packages is Quick-R. Here you can find, among many other things, packages for descriptive statistics. One of these packages is pastecs, which has many functions including stat.desc().

You have to install pastecs before using it. After installing it, you can invoke its functions by running the library(pastecs) command line. pastecs has options that are TRUE by default. See for yourself if you switch them off, by setting them to FALSE!

Again, we use the round() functions, to create reader-friendly output (2 decimals, rather than the default of many decimals).

Note that we have used T, as short for TRUE; likewise, you can use F for FALSE. These abbreviations are regarded as bad practice in programming, as T and F can be used as object names, which may confuse R.

# install.packages("pastecs")
library(pastecs)
round(stat.desc(appraisal, basic=T,desc=T),2)

Guiding you through the output:

nbr.val is short for the number of values (7 valid observations, for all 4 variables)
nbr.null and nbr.na are the numbers of undefined and not available values. Most often, these are referred to as missing values. A null is a value that just isn’t there. A missing value represents a value that for some reason we did not record. The difference between the two is subtle, and in almost all projects you will encounter na’s but no null’s.
min and max are the lowest and highest values for each of the four variables. They are all in the range from 1 to 7, which is OK as the appraisal employs a 7-point scale
The range is simply $max - min$
The sum is the sum of the values. As the sum is highest for discipline (34), the employees score better on that aspect than on the other three. The sum here has no meaning. Adding incomes of individuals to a group income, would make sense.
The median and mean are the same values that would result from using mean() and median(), for all the variables separately. Applying the stat.desc() function to a data frame, is considerably quicker!

Note that, as measures of central tendency, the median and mean are quite similar for three of the four aspects (or variables). However, as we have argued before, the median is not sensitive to outliers. In discipline there’s an outlier (1) at the bottom of the distribution. (More than) half of the values are 6 or higher, implying a median of 6. The mean is sensitive to the outlying value of 1, and brings down the group mean to 4.86. In general, $median<mean$ is typical for right-skewed distributions, while $median>mean$ (like here) indicate left-skewed distributions, with some pretty low values (potentially outliers). The definition and detection of outliers is a topic by itself, to which we will turn as the need arises.

We will not discuss SE.mean and CI.mean, here.
var and st.dev give the same results as var() and sd()

10.5.2 Exercise 2

Employee gender, employee salary, sales by employee are provided below, for company XYZ. In addition, the company uses productivity ratios (sales divided by salary) which still have to be computed from the available data.

Compute all appropriate measures of central tendency and variability.

Mathematics for Statistics - A Refresher

1. Set Theory

1.1 Sets in Mathematics

1.2 Relations between Sets

Empty Set

Identical Sets

Advanced

1.3 Sets of Numbers

1.4 Application of Set Theory, and Sets

1.5 Exercises

1.5.1 Exercise 1

Solutions

Are you in for a challenge?

1.5.2 Exercise 2

Solutions

1.5.3 Exercise 3

Solutions

1.5.4 Exercise 4

Solutions

2. Mathematical Operations

2.1 Addition

2.2 Multiplication and Division

2.3 Exponentiation

2.4 Rooting and Logarithms

2.5 The Order of Operations

2.6 Negative Numbers in Multiplication

2.7 Rules for Operations

2.8 Rules for Fractions

2.9 Rules for Exponentiation

3. Equations & Inequalities

3.1 Introduction

3.2 Solving Equations with One Variable

3.3 Linear Inequalities

3.4 Equations with Two Variables (or “Unknowns”)

The matlib package

3.5 Quadratic Equations

3.6 Exercises

3.6.1 Exercise 1

3.6.2 Exercise 2

3.6.3 Exercise 3

4. Relations, Functions & Graphs

4.1 Relations Between Sets

4.2 Functions Graphically Displayed

4.2.1 Linear Functions

4.2.2 Quadratic Functions

4.3 Exercises

4.3.1 Exercise 1

4.3.2 Exercise 2

4.3.3 Exercise 3

5. Statistical Operations

5.1 Summation

5.2 Factorials and Permutations

5.2.1 Factorials

5.2.2 Permutations

5.2.3 Combinations

5.2.4 Probabilities

Summing Probabilities

Multiplying Probabilities

6. The Language of Statistics

6.1 Introduction

6.2 Scales of Measurement

Nominal: a scale of measurement for data with meaningful labels

Ordinal: a scale of measurement for nominal data with order

Interval: a scale of measurement for ordinal data with meaningful distances between values

Ratio: a scale of measurement for interval data with a meaningful zero point

6.3 Exercise

7. Working with Numbers, and Data Display

7.1 Introduction

7.2 Exercises

7.2.1 Exercise 1

7.2.2 Exercise 2: Let’s Roll a Die

8. The Normal Distribution

8.1 Understanding the Normal Distribution

8.2 Skewness and Kurtosis

8.3 Exercise & Illustration

9. Central tendency

9.1 Introduction

9.2 Mean (or Average)

9.3 Median

9.4 Mode