Probability in R

Sample Spaces

For a random experiment E, the set of all possible outcomes of E is called the sample space and is denoted by the letter S . For the coin-toss experiment, S would be the results “Head” and “Tail”, which we may represent by S = {H, T}. Formally, the performance of a random experiment is the unpredictable selection of an outcome in S. In this project we will make use of prob package.

Example

Consider the random experiment of dropping a Styrofoam cup onto the floor from a height of four feet. The cup hits the ground and eventually comes to rest. It could land upside down, right side up, or it could land on its side. We represent these possible outcomes of the random experiment by the following.

s<- data.frame(land = c("down","up","side"))
s

##   land
## 1 down
## 2   up
## 3 side

The sample space s contains the column lands which stores the outcomes “down”, “up”, and “side”.

Consider the random experiment of tossing a coin. The outcomes are H and T. We can set up the sample space quickly with the tosscoin function:

The number 1 tells the tosscoin that we only want to toss the coin once. We could toss it three times also like.

tosscoin(3)

##   toss1 toss2 toss3
## 1     H     H     H
## 2     T     H     H
## 3     H     T     H
## 4     T     T     H
## 5     H     H     T
## 6     T     H     T
## 7     H     T     T
## 8     T     T     T

Alternatively we could roll a fair die:

rolldie(1)

##   X1
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
## 6  6

The rolldie function defaults to a 6-sided die, but we can specify others with the n-sides argument. The command rolldie(3, nsides = 4) would be used to roll a 4-sided die three times.

Perhaps if we would like to draw one card from a standard set of playing cards.

head(cards())

##   rank suit
## 1    2 Club
## 2    3 Club
## 3    4 Club
## 4    5 Club
## 5    6 Club
## 6    7 Club

The cards function that we just used has optional arguments jokers (if you would like Jokers to be in the deck) and makespace which we will discuss that later.

Events

An event A is merely a collection of outcomes, or in other words, a subset of the sample space^2. After the performance of a random experiment E we say that the event A occurred if the experiment’s outcome belongs to A. We say that a bunch of events A1, A2, A3, . . . are mutually exclusive or disjoint if Ai ∩ Aj = ∅ for any distinct pair Ai , Aj. For instance, in the coin-toss experiment the events A = {Heads} and B = {Tails} would be mutually exclusive.

Given the data frame sample/probability space s, We make extract rows using the [] operator:

s<-tosscoin(2, makespace = TRUE)

s[c(2,4),]

##   toss1 toss2 probs
## 2     T     H  0.25
## 4     T     T  0.25

s[1:3,]

##   toss1 toss2 probs
## 1     H     H  0.25
## 2     T     H  0.25
## 3     H     T  0.25

and so forth. We may also extract rows that satisfy a logical expression using the subset function, for instance

s<-cards()
subset(s,suit == "Heart")

##    rank  suit
## 27    2 Heart
## 28    3 Heart
## 29    4 Heart
## 30    5 Heart
## 31    6 Heart
## 32    7 Heart
## 33    8 Heart
## 34    9 Heart
## 35   10 Heart
## 36    J Heart
## 37    Q Heart
## 38    K Heart
## 39    A Heart

subset(s, rank %in% 7:9)

##    rank    suit
## 6     7    Club
## 7     8    Club
## 8     9    Club
## 19    7 Diamond
## 20    8 Diamond
## 21    9 Diamond
## 32    7   Heart
## 33    8   Heart
## 34    9   Heart
## 45    7   Spade
## 46    8   Spade
## 47    9   Spade

We could continue indefinitely. Also note that mathematical expressions are allowed:

subset(rolldie(3),X1+X2+X3 > 16)

##     X1 X2 X3
## 180  6  6  5
## 210  6  5  6
## 215  5  6  6
## 216  6  6  6

Function for finding Subsets

The %in% function

The function %in% helps to learn whether each value of one vector lies somewhere inside another vector.

x<-1:10
y<-8:12
y %in% x

## [1]  TRUE  TRUE  TRUE FALSE FALSE

Notice that the returned value is a vector of length 5 which tests whether each element of y is in x, in turn.

The isin function

It is more common if you want to know whether the whole vector y is in x. We can do this with the isin function.

isin(x,y)

## [1] FALSE

Set Union, Intersection, and Difference

Given subsets A and B, it is often useful to manipulate them in an algebraic fashion. To this end, we have three set operations at our disposal: union, intersection, and difference. Below is a table that summarizes the pertinent information about these operations.

Operations followed by code

some examples follow.

s = cards()
a = subset(s, suit == "Heart")
b = subset(s, rank %in% 7:9)

Doing some algebra:

union(a, b)

##    rank    suit
## 6     7    Club
## 7     8    Club
## 8     9    Club
## 19    7 Diamond
## 20    8 Diamond
## 21    9 Diamond
## 27    2   Heart
## 28    3   Heart
## 29    4   Heart
## 30    5   Heart
## 31    6   Heart
## 32    7   Heart
## 33    8   Heart
## 34    9   Heart
## 35   10   Heart
## 36    J   Heart
## 37    Q   Heart
## 38    K   Heart
## 39    A   Heart
## 45    7   Spade
## 46    8   Spade
## 47    9   Spade

intersect(a, b)

##    rank  suit
## 32    7 Heart
## 33    8 Heart
## 34    9 Heart

setdiff(a, b)

##    rank  suit
## 27    2 Heart
## 28    3 Heart
## 29    4 Heart
## 30    5 Heart
## 31    6 Heart
## 35   10 Heart
## 36    J Heart
## 37    Q Heart
## 38    K Heart
## 39    A Heart

setdiff(b, a)

##    rank    suit
## 6     7    Club
## 7     8    Club
## 8     9    Club
## 19    7 Diamond
## 20    8 Diamond
## 21    9 Diamond
## 45    7   Spade
## 46    8   Spade
## 47    9   Spade

Notice that setdiff is not symmetric. Further, note that we can calculate the complement of a set A, denoted A^c and defined to be the elements of S that are not in A simply with setdiff(S,A).

Note: When the prob package loads you will notice a message: “The following object(s) are masked from package:base : intersect, setdiff,union”. The reason for this message is that there already exist methods for the functions intersect, setdiff,subset, and union in the base package which ships with R.

Conditional Probability

Example

Toss a six-sided die twice. The sample space consists of all ordered pairs (i, j) of the numbers 1, 2, . . . , 6, that is, S = {(1, 1), (1, 2), . . . , (6, 6)}. Let A = {outcomes match} and B = {sum of outcomes at least 8}. The sample space may be represented by a matrix: The outcomes lying in the event A are marked with the symbol “X”, the outcomes falling in B are marked with “O”, and those in both A and B are marked “⊗”. Now it is clear that IP(A) = 6/36, IP(B) = 15/36, and IP(A ∩ B) = 3/36. Finally,

Equation

Again, we see that given the knowledge that B occurred (the 15 outcomes in the lower right triangle), there are 3 of the 15 that fall into the set A, thus the probability is 3/15. Similarly,given that A occurred (we are on the diagonal), there are 3 out of 6 outcomes that also fall in B,thus, the probability of B given A is 1/2.

s<-rolldie(2,makespace = TRUE)#assumes ELM
head(s)                       #first few rows

##   X1 X2      probs
## 1  1  1 0.02777778
## 2  2  1 0.02777778
## 3  3  1 0.02777778
## 4  4  1 0.02777778
## 5  5  1 0.02777778
## 6  6  1 0.02777778

Next we define the events

a <- subset(s, X1 == X2)
b <- subset(s, X1 + X2 >= 8)

And now we are ready to calculate probabilities. To do conditional probability, we use the given argument of the prob function:

Prob(a, given = b)

## [1] 0.2

Prob(b, given = a)

## [1] 0.5

Note that we do not actually need to define the events A and B separately as long as we reference the original probability space S as the first argument of the prob calculation:

Prob(s, X1 == X2, given = (X1 + X2 >= 8))

## [1] 0.2

Prob(s, X1+X2 >=8, given = (X1==X2))

## [1] 0.5

L <- rep(c("red","green"), times = c(7, 3))
M <- urnsamples(L, size=3, replace =FALSE, ordered = TRUE)
N <- probspace(M)

Now if we think about how to set up the event {all 3 balls are red}. Rows of N that satisfy this condition have X1==“red”& X2==“red”& X3==“red”, but there must be an easier way. Indeed, there is. The isrep function (short for “is repeated”) in the prob package was written for this purpose. The command isrep(N,“red”,3) will test each row of N to see whether the value “red” appears 3 times. The result is exactly what we need to define an event with the prob function. Observe

Prob(N, isrep(N, "red", 3)) #Note the answer matches

## [1] 0.2916667

Now let us try some other probability questions. What is the probability of getting two “red”s?

Prob(N, isrep(N,"red", 2))

## [1] 0.525

Independent Event

s<-tosscoin(10, makespace = TRUE)
a<-subset(s, isrep(s, vals = "T", nrep = 10))
1-Prob(a)

## [1] 0.9990234

Bayes Rule

This section we introduce a rule that allows us to update our probabilities when new information becomes available.

Example

Suppose the boss gets a change of heart and does not fire anybody. But the next day (s)he randomly selects another file and again finds it to be misplaced. To decide whom to fire now, the boss would use the same procedure, with one small change. (S)he would not use the prior probabilities 60%, 30%, and 10%; those are old news. Instead, she would replace the prior probabilities with the posterior probabilities just calculated. After the math she will have new posterior probabilities, updated even more from the day before. In this way, probabilities found by Bayes’ rule are always on the cutting edge, always updated with respect to the best information available at the time.

Example Misfiling assistants (continued from Example 4.44). We store the prior probabilities and the likelihoods in vectors and go to town.

prior <- c(0.6,0.3,0.1)
like <- c(0.003, 0.007, 0.01)
post <- prior * like
post/sum(post)

## [1] 0.3673469 0.4285714 0.2040816

Example Let us incorporate the posterior probability (post) information from the last example and suppose that the assistants misfile seven more documents. Using Bayes’ Rule,what would the new posterior probabilities be?

newprior <- post
post <- newprior * like^7
post/sum(post)

## [1] 0.0003355044 0.1473949328 0.8522695627

Random Variable

In this section, we are interested in a number that is associated with the experiment. We conduct a random experiment E and after learning the outcome ω in S we calculate a number X. That is, to each outcome ω in the sample space we associate a number X(ω) = x.

How to use it in R:

The primary vessel for this task is the addrv function. There are two ways to use it, and we will describe both.

Supply a Defining Formula

The first method is based on the transform function. See ?transform. The idea is to write a formula defining the random variable inside the function, and it will be added as a column to the data frame. As an example, let us roll a 4-sided die three times, and let us define the random variable U = X1 − X2 + X3.

s<- rolldie(3, nside = 4, makespace =TRUE)
s <- addrv(s, U = X1 - X2 + X3)

Now let’s take a look at the values of U. In the interest of space, we will only reproduce the first few rows of S (there are 43 = 64 rows in total).

head(s)

##   X1 X2 X3 U    probs
## 1  1  1  1 1 0.015625
## 2  2  1  1 2 0.015625
## 3  3  1  1 3 0.015625
## 4  4  1  1 4 0.015625
## 5  1  2  1 0 0.015625
## 6  2  2  1 1 0.015625

We see from the U column it is operating just like it should. We can now answer questions like….

Prob(s, U >6)

## [1] 0.015625

Supply a Function

Sometimes we have a function laying around that we would like to apply to some of the outcome variables, but it is unfortunately tedious to write out the formula defining what the new variable would be. The addrv function has an argument FUN specifically for this case. Its value should be a legitimate function from R, such as sum, mean, median, etc. Or, you can define your own function. Continuing the previous example, let’s define V = max(X1, X2, X3) and W = X1 + X2 + X3.

s<- addrv(s, FUN = max, invars = c("X1","X2","X3"), name = "V")
s<- addrv(s, FUN = sum, invars = c("X1","X2","X3"), name = "W")
head(s)

##   X1 X2 X3 U V W    probs
## 1  1  1  1 1 1 3 0.015625
## 2  2  1  1 2 2 4 0.015625
## 3  3  1  1 3 3 5 0.015625
## 4  4  1  1 4 4 6 0.015625
## 5  1  2  1 0 2 4 0.015625
## 6  2  2  1 1 2 5 0.015625

Notice that addrv has an invars argument to specify exactly to which columns one would like to apply the function FUN. If no input variables are specified, then addrv will apply FUN to all non-probs columns. Further, addrv has an optional argument name to give the new variable; this can be useful when adding several random variables to a probability space (as above). If not specified, the default name is “X”.

Refrences:

Introduction to Probability and statistics using R
R in Action, Second Edition
Edureka R Blogs
r-tutor.com

Probability in R

Frason Francis

10/28/2020

Sample Spaces

Events

Function for finding Subsets

The %in% function

The isin function

Set Union, Intersection, and Difference

Conditional Probability

Independent Event

Bayes Rule

Random Variable

Refrences: