Dice rolls. If you roll a pair of fair dice, what is the probability of
Probablility of getting a sum of 1 is zero. The minimum sum possible is 2 since each dice has a smallest face value of 1 and both dice are rolled. We would have to roll only one dice, in which case the probablisity of a 1 would be 1/6
Since the probabilities of rolling a given number on each dice are disjoint, we can add up the probabilities. The probability of each die face is 1/6.
P(sum=5) = P(1 and 4) or P(2 and 3) or P(3 and 2) or P(4 and 1)
P(sum=5) = 1/61/6 + 1/61/6 + 1/61/6 + 1/61/6
P(sum=5) = 4 * 1/36 = 4/36 = 0.1111111
We can d this manually since the possible outcomes are rather limited. But if we had say 3 or more dice, then calculating manually might be a complicated undertaking. We can also run a simulation and calculate the probability from the resulting histogram.
#we define two dice
d1<-c(1,2,3,4,5,6)
d2<-c(1,2,3,4,5,6)
#we simulate rolling the dice 5000 times
#we could have simply used one die, but if we use two we could make them both unfair dice in different ways. Here we leaft both as fair, i.e. both have a probability 1/6 per die face
diceRoll<-vector()
for (i in 0:5000) {
diceRoll<-c(diceRoll,sample(d1,1)+sample(d2,1))
}
#a histogram is created out of the simulation
h<-hist(diceRoll,plot = FALSE,breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13))
#we change the counts in the histogram to probabilities which gives us a probability distribution
h$counts<-h$counts/sum(h$counts)
plot(h)
h
## $breaks
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
##
## $counts
## [1] 0.02639472 0.05558888 0.08338332 0.11317736 0.13337333 0.16136773
## [7] 0.15136973 0.11237752 0.08218356 0.05698860 0.02379524 0.00000000
##
## $density
## [1] 0.02639472 0.05558888 0.08338332 0.11317736 0.13337333 0.16136773
## [7] 0.15136973 0.11237752 0.08218356 0.05698860 0.02379524 0.00000000
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5
##
## $xname
## [1] "diceRoll"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
# Since this is a distribution of probabilities, the sum of each disjoint should be equal to 1
sum(h$counts)
## [1] 1
Probability of sum = 5 is given in h$density[4]
h$density[4]
## [1] 0.1131774
Same as before, but here we have only one possible outcome of the dice roll:
P(sum=12)=P(6 and 6) P(sum=12)=1/6*1/6 P(sum=12)=1/36=0.02777778
We can re-sun the simulation as before.
diceRoll<-vector()
for (i in 0:5000) {
diceRoll<-c(diceRoll,sample(d1,1)+sample(d2,1))
}
h<-hist(diceRoll,plot = FALSE,breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13))
h$counts<-h$counts/sum(h$counts)
plot(h)
P(sum=12) will be in h$density[11]
h$density[11]
## [1] 0.02759448
Poverty and language. The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American Community Survey estimates that 14.6% of Americans live below the poverty line, 20.7% speak a language other than English (foreign language) at home, and 4.2% fall into both categories.
Are living below the poverty line and speaking a foreign language at home disjoint?
No since living below the poverty line could include or not include foreign language speakers. The cases states that 4.2% actually fall in both categories.
Draw a Venn diagram summarizing the variables and their associated probabilities.
What percent of Americans live below the poverty line and only speak English at home?
Below Poverty Line and Speak English = Below Poverty Line - Speak Foreign Language Below Poverty Line and Speak English = 14.6% - 4.2% = 10.4%
Below Poverty Line or Speack Foreign Language = Below Property Line + Foreign Language Speaker - Below Property Line and Foreign Languge Speaker
Below Poverty Line or Speak Foreign Language = 14.6% + 20.7% - 4.2%
Below Poverty Line or Speak Foreign Language = 31.1%
Above Poverty Line and Speak English = 100% - Below Poverty Line or Speack Foreign Language
Above Poverty Line and Speak English = 100% - 31.1% = 68.9%
Yes, if someone lives below the poverty line does not mean that person speakes a foreign language or not. We find cases in which they do, 4.2%, and don’t 10.4%.
Assortative mating is a nonrandom mating pattern where individuals with similar genotypes and/or phenotypes mate with one another more frequently than what would be expected under a random mating pattern. Researchers studying this topic collected data on eye colors of 204 Scandinavian men and their female partners. The table below summarizes the results. For simplicity, we only include heterosexual relationships in this exercise.
To answer questions in this case, we compute the table proportions from the contingency table given. We do this by dividing each count by the table total, 204.
#We compute the Totalsinstead of copying them just as a sanity check
contTable<-matrix(c(78,23,13,19,23,12,11,9,16),byrow = TRUE,ncol = 3)
contTable<-cbind(contTable,c(sum(contTable[1,]),sum(contTable[2,]),sum(contTable[3,])))
contTable<-rbind(contTable,c(sum(contTable[,1]),sum(contTable[,2]),sum(contTable[,3]),sum(contTable[,4])))
colnames(contTable)<-c("Blue","Brown","Green","Total")
rownames(contTable)<-c("Blue","Brown","Green","Total")
contTable
## Blue Brown Green Total
## Blue 78 23 13 114
## Brown 19 23 12 54
## Green 11 9 16 36
## Total 108 55 41 204
propTable<-contTable/contTable[4,4]
propTable
## Blue Brown Green Total
## Blue 0.38235294 0.11274510 0.06372549 0.5588235
## Brown 0.09313725 0.11274510 0.05882353 0.2647059
## Green 0.05392157 0.04411765 0.07843137 0.1764706
## Total 0.52941176 0.26960784 0.20098039 1.0000000
P(blue eye)
#using the Contingency Table
contTable[1,4]/contTable[4,4]
## [1] 0.5588235
#using the Proportions Table
propTable[1,4]
## [1] 0.5588235
P(partner blue)
#using the Contingency Table
contTable[4,1]/contTable[4,4]
## [1] 0.5294118
#using the Proportions Table
propTable[4,1]
## [1] 0.5294118
The probability of a male respondent having blue eyes or the partner having blue eyes can be calculated by: P(male blue) + P(partner blue) - P(male and partner blue)
#P(male and partner blue)
#using the Proportions Table
propTable[1,4]+propTable[4,1]-propTable[1,1]
## [1] 0.7058824
P(partner blue eye | male blue eye) = P(partner blue eye and male blue eye) / P(male blue eye)
#using the Contingency Table
(contTable[1,1]/contTable[4,4])/(contTable[1,4]/contTable[4,4])
## [1] 0.6842105
#using the Proportions Table
propTable[1,1]/propTable[1,4]
## [1] 0.6842105
P(partner blue | brown eye) = P(partner blue and brown eye) / P(brown eye)
#using the Contingency Table
contTable[2,1]/contTable[2,4]
## [1] 0.3518519
#using the Proportions Table
propTable[2,1]/propTable[2,4]
## [1] 0.3518519
P(partner blue | green eyes) = P(partner blue and green eye) / P(green eye)
#using the Contingency Table
contTable[3,1]/contTable[3,4]
## [1] 0.3055556
#using the Proportions Table
propTable[3,1]/propTable[3,4]
## [1] 0.3055556
There seems to be some dependancy, from the data it seems males and partners tend to get together with same color eyes. We can see this in both tables, for each color eye row, we get the highest value for the columns corrsponding to the same eye color. But to also to keep in mind, this is an observational study, not an experiment, so extrapolating results might not be possible.
#The maximum column is the same as the row number for each row
which.max(contTable[1,1:3]) #Blue Eye
## Blue
## 1
which.max(contTable[2,1:3]) #Brown Eye
## Brown
## 2
which.max(contTable[3,1:3]) #Green Eye
## Green
## 3
which.max(propTable[1,1:3]) #Blue Eye
## Blue
## 1
which.max(propTable[2,1:3]) #Brown Eye
## Brown
## 2
which.max(propTable[3,1:3]) #Green Eye
## Green
## 3
Books on a bookshelf. The table below shows the distribution of books on a bookcase based on whether they are non-fiction or fiction and hardcover or paperback.
P(hardcover first) = 28/95 = 0.2947368
P(paperback and fiction second) = 59/94 = 0.6276596
P(hardcover first then paperback second) = 0.2947368 * 0.6276596 = 0.1849944
P(fiction first) = 72/95 = 0.7578947
For the second draw we have two cases. One if the first fiction book was a hardcover, and second if the first fiction book was a paperback
P(hardcover second without replacement given that the first fiction was hardcover) = 27/94 = 0.287234
P(hardcover second without replacementgiven that the first fiction was a paperback) = 28/94 = 0.2978723
To calculate the desired conbined probability we multiply each case by the probability of it happening. This is the marginal probability for hardcover (13/72) and paperback (59/72) knowing we the book drawn was fiction:
P(hardcover and first fiction was hardcover) = (27/94)(13/72) = 0.0518617 P(hardcover and first fiction was paperback) = (28/94)(59/72) = 0.2440898
We add these two probabilities since we want one event or the other and they are independent: P(hardcover and first fiction was hardcover or hardcover and first fiction was paperback) = 0.0518617 + 0.2440898 = 0.2959515
Now we multiply the results by the probability of the first book being finction: P(First fiction and second hardcover) = 0.7578947 * 0.2959515 = 0.2243001
We can also use a tree to calculate all the probabilities
P(fiction first) = 72/95 = 0.7578947
P(hardcover second with replacement) = 28/95 = 0.2947368
P(fiction first and hardcover second with replacement) = P(fiction first) * P(hardcover second with replacement)
P(fiction first and hardcover second with replacement) = 0.7578947 * 0.2947368 = 0.2233795
P(hardcover second | finction first) = P(hardcover) because we replaced the first book, so both first and second draw are independent. What was drawn on the first does not affect the second.
Answer are very similar because the number of books is large enough that drawing one from the overall makes little difference. So replacing the book or not changes the total number of books very little, making these probabilities very smilar.
Baggage fees. An airline charges the following baggage fees: $25 for the ???rst bag and $35 for the second. Suppose 54% of passengers have no checked luggage, 34% have one piece of checked luggage and 12% have two pieces. We suppose a negligible portion of people check more than two bags.
fees<-c(0,25,35+25)
prob<-c(0.54,0.34,0.12)
df<-data.frame(fees,prob)
df$feeXprob<-df$fee*df$prob
EV<-sum(df$feeXprob)
df$xmu<-df$fees-EV
df$sqrxmu<-(df$xmu)^2
df$sqrxmuXprob<-df$sqrxmu*df$prob
df
## fees prob feeXprob xmu sqrxmu sqrxmuXprob
## 1 0 0.54 0.0 -15.7 246.49 133.1046
## 2 25 0.34 8.5 9.3 86.49 29.4066
## 3 60 0.12 7.2 44.3 1962.49 235.4988
variance=sum(df$sqrxmuXprob)
standardDev=sqrt(variance)
#results
#Average revenue per passenger
EV
## [1] 15.7
#Standard deviation
standardDev
## [1] 19.95019
Assuming the revenues per each passenger is an independant event, that is one passenger won’t take more or less bag given the amount of bags other passenger take.
#Expected reveneu for 120 passengers
EV120<-EV*120
EV120
## [1] 1884
#variance of 120 passenegrs
var120<-120^2*variance
var120
## [1] 5731344
#STandard deviation is the square root of variance
sd120<-sqrt(var120)
sd120
## [1] 2394.023
Income and gender. The relative frequency table below displays the distribution of annual total personal income (in 2009 in???ation-adjusted dollars) for a representative sample of 96,420,486 Americans. These data come from the American Community Survey for 2005-2009. This sample is comprised of 59% males and 41% females.
Looking at the distribution below we can see how it is slightly bimodal, with a peak at an income of $35-$49.999 and another smaller one for incomes greater than $100k
income <- c("$1 - $9,999 or loss",
"$10,000 to $14,999",
"$15,000 to $24,999",
"$25,000 to $34,999",
"$35,000 to $49,999",
"$50,000 to $64,000",
"$65,000 to $74,999",
"$75,000 to $99,999",
"$100,000 or more")
total <- c(0.022, 0.047, 0.158, 0.183, 0.212, 0.139, 0.058, 0.084, 0.097)
df<- data.frame(income, total)
df
## income total
## 1 $1 - $9,999 or loss 0.022
## 2 $10,000 to $14,999 0.047
## 3 $15,000 to $24,999 0.158
## 4 $25,000 to $34,999 0.183
## 5 $35,000 to $49,999 0.212
## 6 $50,000 to $64,000 0.139
## 7 $65,000 to $74,999 0.058
## 8 $75,000 to $99,999 0.084
## 9 $100,000 or more 0.097
op <- par(mar=c(11,4,4,2))
rm(op)
barplot(df$total,main='American Community Survey for 2005-2009',names.arg = df$income,las=3)
sum(df$total[1:5])
## [1] 0.622
We are told the sample contains 59% males and 41% females. If we assume this split is the same for all income groups, that is, in each bin we have 59% males and 41% females, then the desired probability is:
sum(df$total[1:5])*0.41
## [1] 0.25502
The assumtion made in c) is not valid since here the split of males and females is not the same for all income groups.
sample<-96420486
malesFemalesUnder50k<-sample*sum(df$total[1:5])
malesFemalesUnder50k
## [1] 59973542
#in C) 41% of these are females
malesFemalesUnder50k*.41
## [1] 24589152
#in d) however 71.8% of females are under $50k, so here the total females with income less than $50k is not the same number as in c)
totalFemales<-sample*.41
totalMales<-sample*.59
femalesUnder50k<-totalFemales*.71
femalesUnder50k
## [1] 28068003
# from which we can back calculate total males and females under $50k as a check
malesUnder50k<-malesFemalesUnder50k-femalesUnder50k
femalesUnder50k+malesUnder50k
## [1] 59973542