For the R Studio server go to: rstudio.saintannsny.org:8787
In this lab we’ll lab we’ll use the DiscreteRV package and simulate, football games, basketball games and, time permitting, the 2016 Presidential Election.
The library we’ll use in this package is discreteRV for discrete random variable. A discete random variable is a random variable with distinct or discontinous outcomes. Random variables can also be continous. A random variable that can take only values 1,2,3… , for instance, is discrete. A random variable that could take any real value between 0 and 1, on the other hand, is continouus.
Let’s start by loading the package:
library(discreteRV)
Using the RV function you can create a set of outcomes and their association probabilities. Here’s one random variable that looks like the roll of a die and a few ways to look at it.
X <- RV(outcomes = 1:6, probs = 1/6)
X
probs(X)
plot(X)
You can also find the expected value, variance and standard deviations of random variables. The (long) way to get the expected value is to calculated the weighted average of the outcomes, \(E[X] = \sum x \cdot p(x)\) as follows:
sum(probs(X)*outcomes(X))
The easy way is to use the Expectation function in the discreteRV package:
E(X)
Similarly, you can calculate variance by calculating the weighted average of the squared deviations, \(\sum p(x)(x-E[X])^2\):
sum(probs(X)*(outcomes(X) - E(X))^2)
or you can use the Variance function, V, and the standard deviation function, SD:
V(X)
SD(X)
What if you roll two dice and add the results? You could treat this as the sum of two independent and identically distributed (iid) random variables. In the discreteRV package the SofIID function gives the sum of iid random variables:
SofIID(X, n=2)
two_dice <- SofIID(X, n=2) #remember X was one die and we're using two
plot(two_dice)
What if you roll 20 dice and add them all up? What will this distribution look like?
twenty_dice <- SofIID(X, n=20)
plot(twenty_dice)
Remember this shape – the distribution of possible outcomes of the sums of 20 dice – while we look at the sums of other random varibles. This is a shape we’ll see again and again and learn more about tomorrow.
In the game of roulette, if you bet on either red or black you have an 18/38 chance of winning and a 20/38 chance of losing (there are 18 red slices, 18 black slices and 2 green slices in American casinos). Let’s say that you bet $100 on a spin. The result could be expressed as a random variable:
spin <- RV(outcomes = c(100, -100), probs = c(18/38, 20/38))
E(spin)
plot(spin)
The choice of y-axis limits is somewhat unfortunate here and the following gives a better sense of the bet:
plot(spin, ylim=c(0,1))
What happens if you’re foolish enough to bet on 50 spins of roulette?
fifty_spins <- SofIID(spin, n=50)
plot(fifty_spins)
How does this distribution compare to the sums of twenty dice?
Note that if we want to know how likely we are to make money after 50 spins, we can use the discreteRV package for this too:
P(fifty_spins >0)
Q1: How likely are we to be winning after 200 spins?
Q2: After 500 spins? Q3: How likely are we to have made $500 or more dollars after 20 spins?
Let’s create a football team that scores a touchdown on 20% of its drives, a field goal on 20% of its drive and is scoreless on the remaining 60%. We can further imagine that it has twelve drives in a game and calulate the expected number of points in a game, the standard deviation of points in a game and plot the distribution of points score in a game:
possesion <- RV(outcomes = c(0,3,7), probs = c(0.6, 0.2, 0.2))
game <- SofIID(possesion, n=12)
E(game); SD(game)
plot(game)
Q4: What is the most common score (or scores) for this team in one game? Q5: What is the probability that they score more than 40 points in any game? Q6: What is the probability that they are held to scoring fewer than 10 points?
We could also simulate two teams playing against each other. Below, I’ll simulate the first half of a game. The final line in the code below find the 10 most frequent half-time scores:
half_game <- SofIID(possesion, n=6)
half_time <- joint(half_game, half_game)
probs(half_time)[order(-probs(half_time))[1:10]]
Let’s assume that on any given possesion the Steamer’s basketball team has a 50% chance of not scoring, a 10% chance of scoring 1 point, a 30% chance of scoring 2 points and a 10% chance of socring 3 points. Once again, we can summarize this information as a random variable:
possesion <- RV(outcomes = 0:3, probs = c(0.5, 0.1, 0.3, 0.1))
E(possesion)
plot(possesion)
Now assume that the Steamer’s have 100 independent possesions in a game. Calculate and plot the distribution of possible game scores.
Q7: What is the expected value of points scored in 100 possesions? Q8: What are the variance and standard deviation in points score in 100 possesions? Q9: What is the probability that this team scores more than 110 points in 100 possesions?
According to PredictWise.com (which aggregates betting markets), Clinton has 13 states, the District of Columbia, and 181 electoral college votes locked up. Trump has 13 states and 81 electoral college votes in the bag. The other 24 states (with 276 electoral college votes) can be represented as random variables.
Note: Values as of 11/6/2016
ME <- RV(outcomes=c(0,4), probs=c(0.02, 0.98))
MN <- RV(outcomes=c(0,10), probs=c(0.01, 0.99))
NM <- RV(outcomes=c(0,5), probs=c(0.02, 0.98))
MI <- RV(outcomes=c(0,16), probs=c(0.05, 0.95))
VA <- RV(outcomes=c(0,13), probs=c(0.03, 0.97))
NH <- RV(outcomes=c(0,4), probs=c(0.09, 0.81))
CO <- RV(outcomes=c(0,9), probs=c(0.07, 0.93))
PA <- RV(outcomes=c(0,20), probs=c(0.06, 0.94))
WI <- RV(outcomes=c(0,10), probs=c(0.02, 0.98))
NV <- RV(outcomes=c(0,6), probs=c(0.07, 0.93))
FL <- RV(outcomes=c(0,29), probs=c(0.22, 0.78))
NC <- RV(outcomes=c(0,15), probs=c(0.30, 0.70))
OH <- RV(outcomes=c(0,18), probs=c(0.64, 0.36))
IA <- RV(outcomes=c(0,6), probs=c(0.84, 0.16))
AZ <- RV(outcomes=c(0,11), probs=c(0.74, 0.26))
GA <- RV(outcomes=c(0,16), probs=c(0.91, 0.09))
MT <- RV(outcomes=c(0,3), probs=c(0.93, 0.07))
MO <- RV(outcomes=c(0,10), probs=c(0.99, 0.01))
IN <- RV(outcomes=c(0,11), probs=c(0.98, 0.02))
SC <- RV(outcomes=c(0,9), probs=c(0.99, 0.01))
TX <- RV(outcomes=c(0,38), probs=c(0.98, 0.02))
AK <- RV(outcomes=c(0,3), probs=c(0.92, 0.08))
UT <- RV(outcomes=c(0,6), probs=c(0.99, 0.01))
SD <- RV(outcomes=c(0,3), probs=c(0.99, 0.01))
You can then create a random variable the is the sum of Clinton’s electoral college votes and plot it. This code below also draws a line at 269 electoral college votes (an electoral college tie) and return’s Clinton 10 most common electoral college results:
Clinton <- 181+ME+MN+NM+MI+VA+NH+CO+PA+WI+NV+FL+NC+OH+IA+AZ+GA+MT+MO+IN+SC+TX+AK+SD+UT
plot(Clinton)
abline(v=269, col="red")
probs(Clinton)[order(-probs(Clinton))[1:10]]
Q10: Based on this model, what is Clinton’s chance of winning the election (note that she needs 270 electoral college votes in order to win)?
Q11: Based on this model, what is Clinton’s chance of winning 300 or more electoral college votes?
Q12: There is (at least) one major shortcoming of this model. What is it? Assuming that the probabilies for each state are correct does this model accurately predict Clinton’s chances of winning, underestimate her chances or overestimate her chances?