Akula-DATA606-Week02-Homework

Libraries used

#install.packages('VennDiagram')
#install.packages('pander')

library(grid)
library(futile.logger)
library(VennDiagram)
library(knitr)

Practice Questions:

Q 1: (2.5) Coin flips. If you flip a fair coin 10 times, what is the probability of

getting all tails?
getting all heads?
getting at least one tails?

A: Probability of fair coin landing heads or tails is 50%. And event is mutually exclusive, each coin flip will land hear or tail.

Since each event is independent P(C_T) = 1/2. Since coin is flipped 10 times it (1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2) = 1/1024 = 0.0009765625. This suggests probability of getting all tails is 0.098(rounded to 3^rd decimal).
Since each event is indecent P(C_H) = 1/2. Since coin is flipped 10 times it (1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2) = 1/1024 = 0.0009765625. This suggests probability of getting all heads is 0.098(rounded to 3^rd decimal).
Since coin is flipped 10 times, there are 10 chances of getting tails at least once. It can be derived by ¹⁰C₁ = 10!/1!(10 - 1)! = 10. On each flip chance is 1/2. So 10((1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)*(1/2)) = 0.009765625. This suggests probability of getting at least one tail is 0.98(rounded to 2^nd decimal).

Q 2: (2.7) Swing voters. A 2012 Pew Research survey asked 2,373 randomly sampled registered voters their political affiliation (Republican, Democrat, or Independent) and whether or not they identify as swing voters. 35% of respondents identified as Independent, 23% identified as swing voters, and 11% identified as both.

Are being Independent and being a swing voter disjoint, i.e. mutually exclusive?
Draw a Venn diagram summarizing the variables and their associated probabilities.
What percent of voters are Independent but not swing voters?
What percent of voters are Independent or swing voters?
What percent of voters are neither Independent nor swing voters?
Is the event that someone is a swing voter independent of the event that someone is a political Independent

# 11% of 2373 are swing and independent
# round((2372*.23) - 2373*11) are swing and non independent
# round((2372*.35) - (2373*11)) are non swing and indenpendent
# (2373 - (round(2373*11) + round((2372*.23) - 2373*11) + round((2372*.35) - 2373*11))) are non swing and non independent

voter.data <- matrix(c(round(2373*.11), round((2372*.23) - 2373*.11), round((2372*.35) - (2373*.11)), (2373 - (round(2373*.11) + round((2372*.23) - 2373*.11) + round((2372*.35) - 2373*.11)))),ncol=2,byrow=TRUE)

# calculate percent for swing and independent, swing and non independent, swing and non independent, non swing and indenpendent
voter.data.percentages <- round((voter.data/2373),2) 

# Attach column and row names
colnames(voter.data)<- c("Independent:Yes","Independent:No")
rownames(voter.data)<- c("Swing:Yes","Swing:No")

# Attach column and row names for percent also
colnames(voter.data.percentages)<- c("Independent:Yes","Independent:No")
rownames(voter.data.percentages)<- c("Swing:Yes","Swing:No")

# Calculate row and column sums
voter.data <- cbind(voter.data, Total = rowSums(voter.data))
voter.data <- rbind(voter.data, Total = colSums(voter.data))

# Calculate row and column sums
voter.data.percentages <- cbind(voter.data.percentages, Total = rowSums(voter.data.percentages))
voter.data.percentages <- rbind(voter.data.percentages, Total = colSums(voter.data.percentages))

voter.data

##           Independent:Yes Independent:No Total
## Swing:Yes             261            285   546
## Swing:No              569           1258  1827
## Total                 830           1543  2373

voter.data.percentages

##           Independent:Yes Independent:No Total
## Swing:Yes            0.11           0.12  0.23
## Swing:No             0.24           0.53  0.77
## Total                0.35           0.65  1.00

A: (a) If a random voter is picket from the population probability of being swing voter is 0.23. Probability of being independent voter is 0.35. Probability of of being swing and independent = 0.11. P(V_s) + P(V_i) = 0.23 * 0.35 = 0.08 Since the equation does not hold. Event being independent and being a swing is disjoint.

(b), (c), (d), (e)

# Percent of voters that are Idependent (no party affiliation) and decided whom to vote (Non Swing) : 0.24
# Percent of voters that are Idependent (no party affiliation) and not yet decided whom to vote (Swing) : 0.11
# Percent of voters that are Non Idependent (has party affiliation) and not yet decided whom to (Swing) : 0.12
# Percent of voters that are Non Idependent (has party affiliation) and decided whom to vote (Non Swing) : 0.53

grid.newpage()
# Venn diagram for Swing and Independent
draw.pairwise.venn(area1 = voter.data.percentages[1,3], area2 = voter.data.percentages[3,1], cross.area = voter.data.percentages[1,1], category = c("Swing","Independent"))

## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], text[GRID.text.5], text[GRID.text.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9])

Yes.

Q 3: (2.19) Burger preferences. A 2010 Survey USA poll asked 500 Los Angeles residents, “What is the best hamburger place in Southern California? Five Guys Burgers? In-N-Out Burger? Fat Burger? Tommy’s Hamburgers? Umami Burger? Or somewhere else?” The distribution of responses by gender is shown below

Are being female and liking Five Guys Burgers mutually exclusive?
What is the probability that a randomly chosen male likes In-N-Out the best?
What is the probability that a randomly chosen female likes In-N-Out the best?
What is the probability that a man and a woman who are dating both like In-N-Out the best? Note any assumption you make and evaluate whether you think that assumption is reasonable.
What is the probability that a randomly chosen person likes Umami best or that person is female?

A: (a) Yes. Being female and liking Five Guys Burger is mutually exclusive (b) Of 248 total males 162 like In-N-Out. Probability of liking is 162*100/248 = 65.32% (c) Of 252 total males 181 like In-N-Out. Probability of liking is 181*100/252 = 71.83% (d) Probability of men liking In-N-Out is 0.6532 and women liking In-N-Out 0.7183. In this case picking of second person is dependent on first person. First person being man or woman probability is 0.5. But if first person is man then second person has to be woman. And if first person is woman then second person has to be man. With assumptions probability of liking In-N-Out by man and woman who are dating is (0.6532*0.7183)*(0.5+0.5) = 0.4692. There is 46.92 percent chance that man and woman who are dating will like In-N-Out. (e) Of 500 people 6 like Umami Probability of liking is 6/500 = 0.012. Probability of being female is 252/500 = 0.504. There is one case were female liking Umami. Probability of that person being selected is 1/500 = 0.002. Since that person counted twice in female and Umami population it needs to subtracted. Probability of liking Umami or female is 0.012 + 0.504 - 0.002 = 0.514

Q 4: (2.29) Chips in a bag. Imagine you have a bag containing 5 red, 3 blue, and 2 orange chips.

Suppose you draw a chip and it is blue. If drawing without replacement, what is the probability the next is also blue?
Suppose you draw a chip and it is orange, and then you draw a second chip without replacement. What is the probability this second chip is blue?
If drawing without replacement, what is the probability of drawing two blue chips in a row?
When drawing without replacement, are the draws independent? Explain.

A: (a) Total chips = 5 + 3 + 2 = 10. Probability of drawing blue on first draw = 3/10. Since chip is not replaced probability of second event = 2/9. Probability of getting blue both times = (3/10)*(2/9) = 1/15 = 6.67% (b) On the second draw total chips = 5 + 3 + 1 = 9. Probability of drawing blue = 3/9 = 1/3 = 33.33% (c) 3/10*2/9 = 6.67% (d) Drawing without replacement is independent. It is called dependent probability.

Q 5: (2.43) Cat weights. The histogram shown below represents the weights (in kg) of 47 female and 97 male cats.

What fraction of these cats weigh less than 2.5 kg?
What fraction of these cats weigh between 2.5 and 2.75 kg?
What fraction of these cats weigh between 2.75 and 3.5 kg?

A: (a) (30 + 33)/Total cats = 63/144 (b) (20 + 27)/Total cats = 47/144 (c) (15 + 12)/Total cats = 27/144

Graded Questions

Q 1: (2.6) Dice rolls. If you roll a pair of fair dice, what is the probability of

getting a sum of 1?
getting a sum of 5?
getting a sum of 12?

A: (a) Zero. Because fair dice has a minimum value of 1. Hence sum cannot be equal to 1. (b) Sum of 5 can happen multiple ways. 1+4, 2+3, 3+2, 4+1. Hence probability of getting 5 is 4/36. (c) Sum of 12 can happen only one way. 6+6. Hence probability of getting 12 is 1/36.

Q 2: (2.8) Poverty and language. The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American Community Survey estimates that 14.6% of Americans live below the poverty line, 20.7% speak a language other than English (foreign language) at home, and 4.2% fall into both categories.

Are living below the poverty line and speaking a foreign language at home disjoint?
Draw a Venn diagram summarizing the variables and their associated probabilities.
What percent of Americans live below the poverty line and only speak English at home?
What percent of Americans live below the poverty line or speak a foreign language at home?
What percent of Americans live above the poverty line and only speak English at home?
Is the event that someone lives below the poverty line independent of the event that the person speaks a foreign language at home

# 14.6% of population live below Poverty (0.146 percent)
# 20.7% speak No English (0.207 percent)
# 4.2% fall into both categories, speak No English and live below the poverty line (0.042 percent)
# Percent speak No English and live above poverty line = *(0.207 - 0.042)*
# Percent living above poverty line *(1 - 0.146)*
# Percent speaking English (1 - 0.207)
# Percent speaking English and live above poverty line = *Percent living above poverty line - Percent speak No English and live above poverty line*

population.data <- matrix(c(round((0.146-0.042),3), round((1-0.207-(0.146-0.042)),3), round(0.042,3), (round((0.207-0.042),3))),ncol=2,byrow=TRUE)

# Attach column and row names
colnames(population.data)<- c("Poverty:Yes","Poverty:No")
rownames(population.data)<- c("English:Yes","English:No")

# Calculate row and column sums
population.data <- cbind(population.data, Total = rowSums(population.data))
population.data <- rbind(population.data, Total = colSums(population.data))

population.data

##             Poverty:Yes Poverty:No Total
## English:Yes       0.104      0.689 0.793
## English:No        0.042      0.165 0.207
## Total             0.146      0.854 1.000

(a) If a random *person (family)* is picked from the population probability of living below the poverty line is 0.146. Probability of speaking a foreign language at home is 0.207. Probability of living below the poverty and speaking a foreign language at home = 0.042. *P(A~p~) + P(A~e~) = 0.146 \* 0.207 = 0.030* Since the equation does not hold. Event living below the poverty and speaking a foreign language at home is disjoint.
(b), (c), (d), (e)

# Percent of population living above the Poverty line and speaking a English language : 0.689
# Percent of population living below the Poverty line and speaking a English language : 0.104
# Percent of population living below the Poverty line and speaking a No English language : 0.042
# Percent of population living above the Poverty line and speaking a No English language : 0.165

grid.newpage()
# Venn diagram for Swing and Independent
draw.pairwise.venn(area1 = population.data[1,3], area2 = population.data[3,1], cross.area = population.data[1,1], category = c("English:Yes","Poverty:Yes"))

## (polygon[GRID.polygon.10], polygon[GRID.polygon.11], polygon[GRID.polygon.12], polygon[GRID.polygon.13], text[GRID.text.14], text[GRID.text.15], text[GRID.text.16], text[GRID.text.17], text[GRID.text.18])

Yes.

Q 3: (2.20) Assortative mating. Assortative mating is a nonrandom mating pattern where individuals with similar genotypes and/or phenotypes mate with one another more frequently than what would be expected under a random mating pattern. Researchers studying this topic collected data on eye colors of 204 Scandinavian men and their female partners. The table below summarizes the results. For simplicity, we only include heterosexual relationships in this exercise.65

What is the probability that a randomly chosen male respondent or his partner has blue eyes?
What is the probability that a randomly chosen male respondent with blue eyes has a partner with blue eyes?
What is the probability that a randomly chosen male respondent with brown eyes has a partner with blue eyes? What about the probability of a randomly chosen male respondent with green eyes having a partner with blue eyes?
Does it appear that the eye colors of male respondents and their partners are independent? Explain your reasoning.

A: (a) Probability of picking male with blue eyes = 114 / 204 = 0.5588. Probability of picking female with blue eyes = 108 / 204 = 0.5294. Probability of picking male and female with blue eyes = 77 / 204 = 0.3774. By applying additional rule probability of randomly chosen male respondent or his partner has blue eyes = (P_m) + (P_f) - (P_m and P_f) = (0.5588 + 0.5294 - 0.3774) = 0.7108. (b) Partner has blue eyes given random male has blue eyes. 78 / 114 = 0.6842 (c) Partner has blue eyes given random male has brown eyes. 19 / 54 = 0.3518. Partner has blue eyes given random male has green eyes. 11 / 36 = 0.3055 (d) By looking at the data it appears high percentage of male respondents and their partners have blue eyes. Hence it is not independent.

Q 4: (2.30) Books on a bookshelf. The table below shows the distribution of books on a bookcase based on whether they are nonfiction or fiction and hardcover or paperback.

Find the probability of drawing a hardcover book first then a paperback fiction book second when drawing without replacement.
Determine the probability of drawing a fiction book first and then a hardcover book second, when drawing without replacement.
Calculate the probability of the scenario in part (b), except this time complete the calculations under the scenario where the first book is placed back on the bookcase before randomly drawing the second book.
The final answers to parts (b) and (c) are very similar. Explain why this is the case.

Book data

book.data <- matrix(c(13, 59, 15, 8),ncol=2,byrow=TRUE)

# Attach column and row names
colnames(book.data)<- c("Format:Hardcover","Format:Paperback")
rownames(book.data)<- c("Type:Fiction","Type:Nonfiction")

# Calculate row and column sums
book.data <- cbind(book.data, Total = rowSums(book.data))
book.data <- rbind(book.data, Total = colSums(book.data))

book.data

##                 Format:Hardcover Format:Paperback Total
## Type:Fiction                  13               59    72
## Type:Nonfiction               15                8    23
## Total                         28               67    95

Probability of drawing a hardcover = Total hardcover books / Total books = 28/95 = 0.2947. Since second book is drawn without replacement probability of drawing a fiction = ((72 - 1) / (95 - 1)) = 0.7553. Probability of drawing a hardcover book first then a paperback fiction book second when drawing without replacement is (28/95 * ((72 - 1) / (95 - 1))) = 0.2226
Probability of drawing a fiction = Total fiction books / Total books = 72/95 = 0.7578. Since second book is drawn without replacement probability of drawing a hardcover = ((28 - 1) / (95 - 1)) = 0.2872. Probability of drawing a hardcover book first then a paperback fiction book second when drawing without replacement is (72/95 * ((28 - 1) / (95 - 1))) = 0.21769
Probability of drawing a fiction = Total fiction books / Total books = 72/95 = 0.7578. Since second book is drawn with replacement probability of drawing a hardcover = 28/95 = 0.2947. Probability of drawing a hardcover book first then a paperback fiction book second when drawing with replacement is (72/95 * (28/95)) = 0.2233
When drawing books without replacement two events were dependent. Result of the first draw will impact second draw. Where as drawing with replacement two events are independent. Results of the first draw will not have any effect on second draw. Hence probabilities are slightly different.

Q: 5: (2.38) Baggage fees. An airline charges the following baggage fees: $25 for the first bag and $35 for the second. Suppose 54% of passengers have no checked luggage, 34% have one piece of checked luggage and 12% have two pieces. We suppose a negligible portion of people check more than two bags.

Build a probability model, compute the average revenue per passenger, and compute the corresponding standard deviation.
About how much revenue should the airline expect for a flight of 120 passengers? With what standard deviation? Note any assumptions you make and if you think they are justified.

A: (a) Lets assume there will be n number of passengers. No fees is collected if zero bags are checked in, 1 bag $25 and 2 bags $60($25 + $35)

Number of passengers with zero bags checked = n*0.54

Number of passengers with one bags checked = n*0.34

Number of passengers with two bags checked = n*0.12

Fees collected from passengers with zero bags checked = n*0.54*0

Fees collected from passengers with one bags checked = n*0.34*25

Fees collected from passengers with one bags checked = n*0.12*60

Expected revenue from fees is (n*0.54*0) + (n*0.54*25) + (n*0.54*60)

Average revenue per passenger is m = ((n*0.54*0) + (n*0.54*25) + (n*0.54*60)) / n

Standard deviation,
1. Calculate difference in revenues per passenger based on mean(average m) (((n*[percentage of bag checkers]*0) - m) for each percentage class(0.54, 0.34, 0.12)

Calculate square for each above values as (((n*[percentage of bag chekers]*0) - m)² for each percentage class(0.54, 0.34, 0.12)
Multiple values from step 2 with baggage checker percentages (((n*[percentage of bag chekers]*0) - m)² * [percentage of bag chekers]
Sum all the values from step 3 and find the square root to get standard deviation.

#Total Passangers = 120
passengers.data <- matrix(c(round(120 * 0.54), 0.54, 0, 0, round(120 * 0.34), 0.34, 1, 25, round(120 * 0.12), 0.12, 2, 25+35),ncol=4,byrow=TRUE)
                          
# Attach column and row names
colnames(passengers.data)<- c("Passengers","Percentage","NumberOfBags","Fees")

# Total fees collected
passengers.data <- cbind(passengers.data, TotalFeesCollected = passengers.data[,1] * passengers.data[,4])

# Create a table
passenger.table <- data.frame(passengers.data)

#Average fee
totalRevenuePerPassanger <- sum(passenger.table$TotalFeesCollected) / 120 #Average fee

#Calculate difference in fee from average
passenger.table$DiffInFee <- passenger.table$Fees - totalRevenuePerPassanger

#Square the difference
passenger.table$SqDiff <- passenger.table$DiffInFee^2

#Calulate weighed difference fo squared value
passenger.table$WtDiff <- passenger.table$SqDiff * passenger.table$Percentage

#Get sum of weighted differences. Also known as variance
sumWtDiff <- sum(passenger.table$WtDiff)

#Standard deviation
sdValue <- sqrt(sumWtDiff)

Total revenues : $1865

Average revenue per passenger : $15.54

Standard deviation : $20

Q 6: (2.44) Income and gender. The relative frequency table below displays the distribution of annual total personal income (in 2009 inflation-adjusted dollars) for a representative sample of 96,420,486 Americans. These data come from the American Community Survey for 2005-2009. This sample is comprised of 59% males and 41% females.

Describe the distribution of total personal income.
What is the probability that a randomly chosen US resident makes less than $50,000 per year?
What is the probability that a randomly chosen US resident makes less than$50,000 per year and is female? Note any assumptions you make.
The same data source indicates that 71.8% of females make less than $50,000 per year. Use this value to determine whether or not the assumption you made in part (c) is valid.

A: (a) Distribution of total personal income looks like bell curve, with peak around $35,000 to $49,999 income range.

Data shows percentage of people making less than $50,000 is 2.2 + 4.7 + 15.8 + 18.3 + 21.2 = 62.2%. So probability is 0.622.
If person is picked randomly probability of male or female is 0.5. From earlier problem it is established probability of picking a person making less than $50,000 is 0.622. Probability of picking female making less $50,000 is 0.622*0.5 = 0.311.
If 71.8% females are making less than $50,000, means gender distribution is uneven. Probability of picking a female may go up a little. Assuming person is picked at random my assumption from above problem still holds.

Akula-DATA606-Week02-Homework

Pavan Akula

February 11, 2017