For the R Studio server go to: **http://13.59.138.42/**
In this lab we’ll lab we’ll use the DiscreteRV package to observe the Central Limit Theorem and, time permitting, simulate the 2016 Presidential Election.
The library we’ll use in this package is discreteRV for discrete random variable. A discete random variable is a random variable with distinct or discontinous outcomes. Random variables can also be continous. A random variable that can take only values 1,2,3… , for instance, is discrete. A random variable that could take any real value between 0 and 1, on the other hand, is continouus.
Let’s start by loading the package:
library(discreteRV)
Using the RV function you can create a set of outcomes and their association probabilities. Here’s one random variable that looks like the roll of a die.
X <- RV(outcomes = 1:6, probs = 1/6)
plot(X)
You can also find the expected value, variance and standard deviations of random variables. The (long) way to get the expected value is to use the formula \(E[X] = \sum x \cdot p(x)\) as follows:
sum(probs(X)*outcomes(X))
The easy way is to use the Expectation function in the discreteRV package:
E(X)
Similarly, you can calculate variance by our formula \(\sum p(x)(x-E[X])^2\):
sum(probs(X)*(outcomes(X) - E(X))^2)
or you can use the Variance function, V, and take the square root to get the standard deviation
V(X)
sqrt(V(X))
What if you roll two dice and add the results? You could treat this as the sum of two independent and identically distributed (iid) random variables. In the discreteRV package the SofIID function gives the sum of iid random variables:
two_dice <- SofIID(X, n=2) #remember X was one die
plot(two_dice)
What if you roll 20 dice and add them all up? What will this distribution look like?
twenty_dice <- SofIID(X, n=20)
plot(twenty_dice)
Remember this shape – the distribution of possible outcomes of the sums of 20 dice – while we look at the sums of other random varibles.
In the game of roulette, if you bet on either red or black you have an 18/38 chance of winning and a 20/38 chance of losing (there are 18 red slices, 18 black slices and 2 green slices in American casinos). Let’s say that you bet $100 on a spin. The result could be expressed as a random variable:
spin <- RV(outcomes = c(100, -100), probs = c(18/38, 20/38))
E(spin)
plot(spin)
The choice of y-axis limits is somewhat unfortunate here and the following gives a better sense of the bet:
plot(spin, ylim=c(0,1))
What happens if you’re foolish enough to bet on 50 spins of roulette?
fifty_spins <- SofIID(spin, n=50)
plot(fifty_spins)
How does this distribution compare to the sums of twenty dice?
Note that if we want to know how likely we are to make money after 50 spins, we can use the discreteRV package for this too:
P(fifty_spins >0)
How likely are we to be winning after 200 spins? After 500 spins?
Let’s assume that on any given possesion the Steamer’s basketball team has a 50% chance of not scoring, a 10% chance of scoring 1 point, a 30% chance of scoring 2 points and a 10% chance of socring 3 points. Once again, we can summarize this information as a random variable:
possesion <- RV(outcomes = 0:3, probs = c(0.5, 0.1, 0.3, 0.1))
E(possesion)
plot(possions)
Now assume that the Steamer’s have 50 independent possesions in a game. Calculate and plot the distribution of possible game scores. How does this distribution compare to the distributions above?
A geneticist grows 90 \(F_2\) Wisconsin Fast Plants in such a way that they will each have a 3/4 chance of having a purple stem. Use the discreteRV package to calculate and plot the distribution of possible numbers of plants with purple stems. How does this distributions compare to the distributions of sums above?
According to the Central Limit Theorem (usually just the CLT among friends) the sum or mean of a sufficiently large number of iterations of ANY random variable will be approximately normally distributed – that is, it will take the shape of a bell curve. The approximation can be as good as you want it to be, you’ll just need to increase your definition of “sufficiently large” accordingly. Proving this mathematically takes some work but we saw how this works in practice by simulating die rolls, roulette spins, basketball possesions and plant stems.
For the mathematically inclined, a normal distribution (also called a Gaussian distribution) follows the curve:
f(x) = \(\frac 1 {\sigma\sqrt {2\pi}}\) \(e^{-\frac {(x-\mu)^2} {2\sigma^2}}\)
where \(\mu\) is the mean or expected value of the distribution and \(\sigma\) is the standard deviation of the distribution.
This strange fact that the sums or means of independent results from any underlying distributions converge upon this normal/Gaussian distribution is one of the most important results in statistics and gives the normal distribution great importance in mathematics, science and social science.
Suppose we build a model to predict the 2020 Presidential Election. According to our model the Democratic party has 14 states, the District of Columbia, and 186 electoral college votes locked up. Trump has 16 states and 93 electoral college votes in the bag. The other 20 states (with 259 electoral college votes) can be represented as random variables (as shown below).
OR <- RV(outcomes=c(0,7), probs=c(0.05, 0.95))
WA <- RV(outcomes=c(0,12), probs=c(0.05, 0.95))
ME <- RV(outcomes=c(0,4), probs=c(0.10, 0.90))
MN <- RV(outcomes=c(0,10), probs=c(0.15, 0.85))
NM <- RV(outcomes=c(0,5), probs=c(0.15, 0.85))
CO <- RV(outcomes=c(0,9), probs=c(0.15, 0.85))
VA <- RV(outcomes=c(0,13), probs=c(0.15, 0.85))
NH <- RV(outcomes=c(0,4), probs=c(0.15, 0.85))
NV <- RV(outcomes=c(0,6), probs=c(0.25, 0.75))
MI <- RV(outcomes=c(0,16), probs=c(0.25, 0.75))
WI <- RV(outcomes=c(0,10), probs=c(0.40, 0.60))
PA <- RV(outcomes=c(0,20), probs=c(0.40, 0.60))
FL <- RV(outcomes=c(0,29), probs=c(0.50, 0.50))
NC <- RV(outcomes=c(0,15), probs=c(0.55, 0.45))
OH <- RV(outcomes=c(0,18), probs=c(0.65, 0.35))
IA <- RV(outcomes=c(0,6), probs=c(0.65, 0.35))
AZ <- RV(outcomes=c(0,11), probs=c(0.85, 0.15))
GA <- RV(outcomes=c(0,16), probs=c(0.90, 0.10))
MT <- RV(outcomes=c(0,3), probs=c(0.93, 0.07))
MO <- RV(outcomes=c(0,10), probs=c(0.95, 0.05))
TX <- RV(outcomes=c(0,38), probs=c(0.95, 0.05))
SC <- RV(outcomes=c(0,9), probs=c(0.97, 0.03))
IN <- RV(outcomes=c(0,11), probs=c(0.97, 0.03))
You can then create a random variable the is the sum of Democratic candidate’s electoral college votes and plot it:
Dems <- 186+OR+WA+ME+MN+NM+MI+VA+NH+CO+PA+WI+NV+FL+NC+OH+IA+AZ+GA+MT+MO+IN+SC+TX
plot(Dems)
abline(v=269, col="red")
According to this simple model, what is the Democratic party’s chance of winning the election (note that they needs 270 electoral college votes in order to win)?
What is the Democratic party’s chance of winning 300 or more electoral college votes?
There is (at least) one major shortcoming of this model. What is it? Assuming that the probabilies for each state are correct does this model accurately predict the Democratic party’s chances of winning, underestimate their chances or overestimate their chances? Why?