Probability Theory & Distribution HCD - 594

Probability Theory & Distributions HCD - 594

Kevin Linares
Spring, 2015

What is probability?

Book Definition:

  • “with a random sample or randomized experiment, the probability an observation has a particular outcome is the proportion of times that outcome would occur in a very long sequence of observations”

  • Informal definition

    • the probability of an event is the relative frequency in the outcomes

Probability function: P(A) or Pr(A)

  • P = probability, A = event(s)

What is the probability that it will snow tomorrow? P(S) = .70 = 70%

What is the probability that it will not snow tomorrow? P(not S) = .30 = 30%

P(S) + P(not S) = .70 + .30 = 1.00 or 100%

Rules of the game:

  1. P(S) = 1
    • S = sample space, contains a set of all possabilities
  2. P(A) = is between >0 & <1
  3. P(A) = never negative
  • Example: Family with 2 kids, what is the the sample space? (S) = {(B, B), (B, G), (G, B), (G, G)}

Putting things in perspective

Babies born in 1981

  • 1,769,000 girls (48.7%): 1,860,000 boys (51.3%)

  • P(B, B) = .5132 = .2631

  • P(G, G) = .4872 = .2371

  • P(B, G) = .513 * .487 = .2498

  • P(G, B) = .487 * .513 = .2498

    • S = {(B, B), (B, G), (G, B), (G, G)} =
  • =.263 + .250 + .250 + .237 = ?

Distributions

Binomial distribution Bin(n,p)

  • P=probability
  • n=number of trials

Out of 100 births, what would be an estimated proportion of boys given that 51.3% of all births in 1981 were males?

Binomial distribution

0 = females, 1 = males: Hmmmm, 48% were boys, but close to our 51.3%

Boys = rbinom(100, size= 1, prob = 0.513)
Boys100 = table(Boys)/length(Boys)
barplot(Boys100)

plot of chunk unnamed-chunk-1

Increasing the sample to 1000

0 = females, 1 = males: 50.8% were boys that is closer to our 51.3%

Boys = rbinom(1000, size= 1, prob = 0.513)
Boys100 = table(Boys)/length(Boys)
barplot(Boys100)

plot of chunk unnamed-chunk-2

Normal Distribution

Gaussian distribution

  • Y ~ N(μ, σ2)

Empirical rule

  • 1sd = 68%
  • 2sd = 95%
  • 3sd= 99.7%

Let's test this assumption on our data using reading scores from the ECLS-K

Summary data for ECLS-K reading scores

plot of chunk unnamed-chunk-3

Testing the empirical rule

Hmmm, partially close the the empirical rule: N=16,109

  • 1sd -+ M = [25.47939, 52.21027] = n/N = 12,577 / N = .78 = 78%
  • 2sd -+ M = [12.11395, 65.57571] = n/N = 15,394 / N = .96 = 96%
  • 3sd -+ M = [-1.25149, 78.94115] = n/N = 15,758 / N = .98 = 98%

What can explain those outliers in our data that are skewing the data? Maybe type of schools?

Type of schools

Many schools that fall under Public school type 78% plot of chunk unnamed-chunk-4

Relative Frequencies for school types

Public=79%, Private=12%, Religious=6%, Catholic = 3% plot of chunk unnamed-chunk-5

Let's get a Z-distribution from our data

plot of chunk unnamed-chunk-6

Distributions of reading scores by school type

Outliers might be within the private type schools plot of chunk unnamed-chunk-7

Finally barplots of reading scores by school type

Lost of variability in private schools plot of chunk unnamed-chunk-8