Data Science Math Week 2 Assignment

Problem 1: Dice Rolls

If you roll a pair of fair dice, what is the probability of..

getting a sum of 1?
getting a sum of 5?
getting a sum of 12?

Answer 1:

Let’s define each dice

Dice 1: {1,2,3,4,5,6} Dice 2: {1,2,3,4,5,6}

Let’s define possible sums of faces when we roll the dice.

sum1: {2,3,4,5,6,7} sum2: {3,4,5,6,7,8} sum3: {4,5,6,7,8,9} sum4: {5,6,7,8,9,10} sum5: {6,7,8,9,10,11} sum6: {7,8,9,10,11,12}

Let’s create a matrix for each sum so we can visualize and answer each question:

sum1 <- c(2,3,4,5,6,7)
sum2 <- c(3,4,5,6,7,8)
sum3 <- c(4,5,6,7,8,9)
sum4 <- c(5,6,7,8,9,10)
sum5 <- c(6,7,8,9,10,11)
sum6 <- c(7,8,9,10,11,12)

dice <- matrix(c(sum1,sum2,sum3,sum4,sum5,sum6), nrow=6, byrow=TRUE)
dice

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    3    4    5    6    7
## [2,]    3    4    5    6    7    8
## [3,]    4    5    6    7    8    9
## [4,]    5    6    7    8    9   10
## [5,]    6    7    8    9   10   11
## [6,]    7    8    9   10   11   12

Possibile sums: {2,3,4,5,6,7,8,9,10,11,12}

Let’s alo define the probability of these sums.

prob: {1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36}

Lets create the possible sums and probability sums as a vector

# possible sums vector

sum <- c(2,3,4,5,6,7,8,9,10,11,12)
sum

##  [1]  2  3  4  5  6  7  8  9 10 11 12

# probability of these sums vector

prob<-c(1/36,2/36,3/36,4/36,5/36,6/36,5/36,4/36,3/36,2/36,1/36)
prob

##  [1] 0.02777778 0.05555556 0.08333333 0.11111111 0.13888889 0.16666667
##  [7] 0.13888889 0.11111111 0.08333333 0.05555556 0.02777778

We can look at the probability distribution to see the probability of getting sum of 1, sum of 5 and sum of 12.

prob.dist <- cbind(sum, prob)
prob.dist

##       sum       prob
##  [1,]   2 0.02777778
##  [2,]   3 0.05555556
##  [3,]   4 0.08333333
##  [4,]   5 0.11111111
##  [5,]   6 0.13888889
##  [6,]   7 0.16666667
##  [7,]   8 0.13888889
##  [8,]   9 0.11111111
##  [9,]  10 0.08333333
## [10,]  11 0.05555556
## [11,]  12 0.02777778

We can also visualize this with a bar graph.

barplot(prob, names.arg=sum, main = 'Probability of dice sums', xlab='Sum of two Dice', col='blue')

Based on our probability distribution. The answer for;

(a) Probability of getting a sum of 1 is 0

(b) Probability of getting a sum of 5 is 0.111

(c) Probability of getting a sum of 12 is 0.027

Problem 2. School absences.

Data collected at elementary schools in DeKalb County, GA suggest that each year roughly 25% of students miss exactly one day of school, 15% miss 2 days, and 28% miss 3 or more days due to sickness.

What is the probability that a student chosen at random doesn’t miss any days of school due to sickness this year?
What is the probability that a student chosen at random misses no more than one day?
What is the probability that a student chosen at random misses at least one day?
If a parent has two kids at a DeKalb County elementary school, what is the probability that neither kid will miss any school? Note any assumption you must make to answer this question.
If a parent has two kids at a DeKalb County elementary school, what is the probability that both kids will miss some school, i.e. at least one day? Note any assumption you make.
If you made an assumption in part (d) or (e), do you think it was reasonable? If you didn’t make any assumptions, double check your earlier answers.

Answer 2:

Let’call

P(A) - Probability of missing 1 day of school: 25/100 which is 0.25 P(B) - Probability of missing 2 days of school: 15/100 which is 0.15 P(C) - Probability of missing 3 or more days of school: 28/100 which is 0.28 P(D) - Probability of missing 0 days of school. Which we need to figure it out.

Total Probability needs to equal to 100% which is 1. So we need to add the probabilities given here and minus from 1 to find out the P(D) probability of picking a student that didnt miss any day of school.

a <- 0.25
b <- 0.15
c <- 0.28

d <- 1-(0.25+0.15+0.28)
d

## [1] 0.32

The answer of (a) is 32%

In question (b) we need to figure out the students that misses more than one day. In this case we need to find out the students that misses 2 or 3 or more days, we can further deduct it from the total probability which 1.

1- P($B \cup D$)

which means we need to 1 -(P(B) + P (D))

e <- 1-(b + c)
e

## [1] 0.57

The answer of (b) is 0.57

In question (c) we are being asked to figure out the probability of students that gets picked missing at least 1 day.

Basically;

$P(A)+P(B)+P(C)$

f <- a+b+c
f

## [1] 0.68

The answer of (c) is 0.68

In question (d) we are trying to figure out the probability that neither kid will miss any school.

Probability of first kid not missing any school is P(D), probability of second kid not missing school is P(D). We can apply the maltiplication rule.

$P(D)*P(D)$

We are assuming that both events are independent from each other. The first kid does not get excluded from the students when we select the second time.

g <- d*d
g

## [1] 0.1024

The answer of (d) is 0.1024

In question (e) the question is the probability that both kids will miss some school.

Similar to our approach in question d, we can look at the probability of student that missed the school more than 0 days (some day of school). which is

$1-P(D)$=68% =$0.68$ . We have two kids, so we need to multiply the probability. $0.68*0.68$

Same assumption that we made in question d applies here as well. The first kid does not get excluded from the students when we randomly select the second time.

h <- 0.68 * 0.68
h

## [1] 0.4624

The answer of (e) is 0.4624

The answer of question (f) is; I think i made a reasonable assumption. Both kids are going to the school at the same time. The probability of selecting both kids for either question should not change the total probability or student count.

Question 3

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey designed to identify risk factors in the adult population and report emerging health trends. The following table displays the distribution of health status of respondents to this survey (excellent, very good, good, fair, poor) and whether or not they have health insurance.

mat=matrix(c(.023, 0.0364, 0.0427, 0.0192, 0.0050,0.2099, 0.3123 ,0.2410 ,0.0817,0.0289), byrow=TRUE, nrow=2)
colnames(mat)=c("Excellent", "Very Good","Good", "Fair","Poor")
rownames(mat)=c("No Coverage","Coverage")
mat

##             Excellent Very Good   Good   Fair   Poor
## No Coverage    0.0230    0.0364 0.0427 0.0192 0.0050
## Coverage       0.2099    0.3123 0.2410 0.0817 0.0289

Are being in excellent health and having health coverage mutually exclusive?
What is the probability that a randomly chosen individual has excellent health?
What is the probability that a randomly chosen individual has excellent health given that he has health coverage?
What is the probability that a randomly chosen individual has excellent health given that he doesn’t have health coverage?
Do having excellent health and having health coverage appear to be independent?

Answer 3:

Let’s define certain probabilities such as probability of no coverage, probability of coverage, probability of excellent health, probability of very good health, probability of good health, fair health and poor health.

P(COVERAGE) P(NO COVERAGE) P(EXCELLENT) P(VERY GOOD) P(GOOD) P(FAIR) P(POOR)

In order to do this lets look at our matrix

mat[1,]

## Excellent Very Good      Good      Fair      Poor 
##    0.0230    0.0364    0.0427    0.0192    0.0050

These are the values for no coverage, we can do the same for coverage row. We can essentially add a Total column to find the total value for each row and column. This way we can find out the probabilities that are in question.

total_no_coverage <- sum(mat[1,])
total_coverage <- sum(mat[2,])
total_excellent <- sum(mat[,1])
total_very_good <- sum(mat[,2])
total_good <- sum(mat[,3])
total_fair <- sum(mat[,4])
total_poor <- sum(mat[,5])
total <- total_coverage + total_no_coverage

total_no_coverage

## [1] 0.1263

total_coverage

## [1] 0.8738

total_excellent

## [1] 0.2329

total_very_good

## [1] 0.3487

total_good

## [1] 0.2837

total_fair

## [1] 0.1009

total_poor

## [1] 0.0339

Let’s add these values to our matrix and create a new matrix with total values.

mat_new=matrix(c(0.0230, 0.0364, 0.0427, 0.0192, 0.0050, total_no_coverage, 0.2099, 0.3123 ,0.2410 ,0.0817,0.0289, total_coverage, total_excellent, total_very_good, total_good, total_fair, total_poor, total), byrow=TRUE, nrow=3)
colnames(mat_new)=c("Excellent", "Very Good", "Good Health", "Fair Health", "Poor Health", "Total")
rownames(mat_new)=c("No Coverage", "Coverage", "Total")
mat_new

##             Excellent Very Good Good Health Fair Health Poor Health  Total
## No Coverage    0.0230    0.0364      0.0427      0.0192      0.0050 0.1263
## Coverage       0.2099    0.3123      0.2410      0.0817      0.0289 0.8738
## Total          0.2329    0.3487      0.2837      0.1009      0.0339 1.0001

In question (a) we are being asked if excellent health and health coverage mutually exclusive. Basically, can an adult have excellent health and health coverage at the same time. We need to look at the probability of exellent and coverage.

Is $P(EXCELLENT \cap COVERAGE)$ = 0 correct?

Based on our new matirx mat_new, $P(EXCELLENT \cap COVERAGE)$ is 0.2099.

The answer of (a) is excellent health and having health coverage is not mutually exclusive.

In question (b), we need to find out the probability of excellent health?

Based on our new matrix mat_new,

The answer of (b), the probability of excellent health is 0.2329.

In question (c), we need to figure out excellenT health given that he has health coverage.

$P(EXCELLENT \mid COVERAGE)$

this means;

prob_excellent_coverage <- 0.2099/0.8738
prob_excellent_coverage

## [1] 0.2402152

The answer of (c) is 0.2402152

In question (d), we are being asked probability of excellent health given that he doesnt have health coverage.

$P(EXCELLENT \mid NO COVERAGE)$

this means;

prob_excellent_no_coverage <- 0.0230/0.1263
prob_excellent_no_coverage

## [1] 0.1821061

The answer of (d) is 0.18211061

In question (e), we need to find out if excellent health and having health coverage appear to be independent.

This means is $P(EXCELLENT \cap COVERAGE)$ = P(EXCELLENT)*P(COVERAGE)

result <- 0.2329 * 0.8738 # result is P(EXCELLENT)*P(COVERAGE)
result

## [1] 0.203508

# result does not equal to 0.2329

The answer of (e) is excellent health and having healt coverage are NOT independent.

Question 4. Exit Poll

Edison Research gathered exit poll results from several sources for the Wisconsin recall election of Scott Walker. They found that 53% of the respondents voted in favor of Scott Walker. Additionally, they estimated that of those who did vote in favor for Scott Walker, 37% had a college degree, while 44% of those who voted against Scott Walker had a college degree. Suppose we randomly sampled a person who participated in the exit poll and found that he had a college degree. What is the probability that he voted in favor of Scott Walker?

Answer 4.

P(SCOTT)=53% which is 0.53 voted for Scott Walker

P(SCOTT_COLLEGE_DEGREE)=37% 0.37 voted for Scott had college degree.

P(NO_SCOTT_COLLEGE_DEGREE)= 44% 0.44 did voted against Scott had college degree.

prob_scott <-0.53 
prob_no_scott <- 1-prob_scott # whoever did not vote Scott , voted against him.
prob_no_scott

## [1] 0.47

0.47 (47%) voted against Scott Walker. 0.37 (37%) of Scott Walker voters have college degree 0.44 (44%) of non Scott Walker voters have college degree

prob_scott_college <- prob_scott * 0.37
prob_scott_college

## [1] 0.1961

prob_no_scott_college <- prob_no_scott * 0.44
prob_no_scott_college

## [1] 0.2068

P=prob_scott_college / prob_total

p <- (prob_scott_college)/((prob_scott_college)+(prob_no_scott_college))
p

## [1] 0.4867213

The answer is the probability that the person with college degree voted for Scott Walker is 0.4867213 ~ 48%

Question 5

The table below shows the distribution of books on a bookcase based on whether they are nonfiction or fiction and hardcover or paperback.

mymat2=matrix(c(13,59,15,8),nrow=2,byrow=TRUE)
colnames(mymat2)=c("hard","paper")
rownames(mymat2)=c("fiction","nonfiction")


mymat2

##            hard paper
## fiction      13    59
## nonfiction   15     8

Find the probability of drawing a hardcover book first then a paperback fiction book second when drawing without replacement.
Determine the probability of drawing a fiction book first and then a hardcover book second,when drawing without replacement.
Calculate the probability of the scenario in part (b), except this time complete the calculations under the scenario where the first book is placed back on the bookcase before randomly drawing the second book.
The final answers to parts (b) and (c) are very similar. Explain why this is the case.

Answer 5

There are 13 + 15 = 28 hard books

There are 59 + 8 = 67 paper books

There are total of 28 + 67 = 95 books

There are total of 13+59= 72 fiction books

There are total of 15+8=23 non fiction books

total_books <- 13+59+15+8
hard_books <- 13+15
paper_books <- 59 + 8
fiction_books <- 13+59
nonfiction_books <-15+8

total_books

## [1] 95

hard_books

## [1] 28

paper_books

## [1] 67

fiction_books

## [1] 72

nonfiction_books

## [1] 23

probability of drawing hardcover book first then a paperback fiction , we can use below with the assumption that the total number of hardcover books will be 1 less than the first drawing.

$P(A \mid B)$=$P(A \cap B)/P(B)$

Probability of hardcover book;

p_hardcover <- hard_books/total_books
p_hardcover

## [1] 0.2947368

p_paperback_fiction <-59
p_paperback_fiction

## [1] 59

= (hard_books / total)*(p_paperback_fiction/(total_books-1))

probability_1 <- (hard_books / total_books)*(p_paperback_fiction/(total_books-1))
probability_1

## [1] 0.1849944

The answer of (a) is 0.1849944

In question (b) we need to figure out the probability of fiction first and then hardcover second. (without replacement). We can do the similar approach

probability_2 <- (fiction_books/total_books)*(hard_books/(total_books-1))
probability_2

## [1] 0.2257559

The answer of (b) is 0.2257559

In question (c) we need to figure out same scenario as b but we put the firt book we drew back. will not minus one from the total books.

probability_3 <- (fiction_books/total_books)*(hard_books/total_books)
probability_3

## [1] 0.2233795

The answer of (c) is 0.2233795

The answer of (d) There are similar becase we only reduced the total book number by 1. and 95 is not a very small number (such as 4-5-6) so removing 1 did not have much impact.

Question 6. Is it worth it?

Andy is always looking for ways to make money fast. Lately, he has been trying to make money by gambling. Here is the game he is considering playing: The game costs 2 dollars to play. He draws a card from a deck. If he gets a number card (2-10), he wins nothing. For any face card (jack, queen or king), he wins 3 dollars. For any ace, he wins 5 dollars and he wins an extra $20 if he draws the ace of clubs.

Create a probability model and find Andy’s expected profit per game.
Would you recommend this game to Andy as a good way to make money? Explain.

Answer 6.

There are total 52 cards in a deck.

There are 4 jacks, 4 queens, 4 kings.

There are 4 aces

There is only one ace of club.

Total numbers 9*4=36

total_cards <- 52 
total_jacks <- 4
total_queens <-4
total_kings <- 4
total_numbers <- 4*9
total_club_ace <- 1
total_face <- total_jacks + total_kings + total_queens
total_ace <- total_club_ace + 3

total_cards

## [1] 52

total_jacks

## [1] 4

total_queens

## [1] 4

total_kings

## [1] 4

total_numbers

## [1] 36

total_club_ace

## [1] 1

total_face

## [1] 12

total_ace

## [1] 4

Let’s outline the probabilities.

probability_numbers <- total_numbers/total_cards # 0-2 numbers he wins nothing
probability_face <- total_face/total_cards # # jack, queen or king, he wins $3
probability_ace <- (total_ace - total_club_ace)/total_cards # any ace but the club of ace he wins $5
probability_club_ace <- total_club_ace/total_cards # club of ace, he wins $5 and additional $20 

probability_numbers

## [1] 0.6923077

probability_face

## [1] 0.2307692

probability_ace

## [1] 0.05769231

probability_ace

## [1] 0.05769231

probability_club_ace

## [1] 0.01923077

Let’s define the profit model.

# multiply the probabilities with the dollar amount for each probability win and add them up.

andy_makes<- 0 * probability_numbers + 3* probability_face + 5*probability_ace + 25*probability_club_ace
andy_makes

## [1] 1.461538

# lets dont forget he had to pay $2 to get in the game

andy_profit <- andy_makes -2
andy_profit

## [1] -0.5384615

The answer of (a) is -0.5384615

In question (b), we are being asked if we recommend this game to Andy. So if Andy is making profit bigger than 0, i will recommend him to play the game. But based on our profit model, he will most likely lose money and i wont recommend him to play.

Question 8. Scopping ice cream

Ice cream usually comes in 1.5 quart boxes (48 fluid ounces), and ice cream scoops hold about 2 ounces. However, there is some variability in the amount of ice cream in a box as well as the amount of ice cream scooped out. We represent the amount of ice cream in the box as X and the amount scooped out as Y . Suppose these random variables have the following means, standard deviations, and variances:

mymat3=matrix(c(48,1,1, 2,.25,.0625), nrow=2, byrow=TRUE)
colnames(mymat3)=c("mean", "SD", "Var")
rownames(mymat3)=c("X, In Box","Y, Scooped")
mymat3

##            mean   SD    Var
## X, In Box    48 1.00 1.0000
## Y, Scooped    2 0.25 0.0625

An entire box of ice cream, plus 3 scoops from a second box is served at a party. How much ice cream do you expect to have been served at this party? What is the standard deviation of the amount of ice cream served?
How much ice cream would you expect to be left in the box after scooping out one scoop of ice cream? That is, find the expected value of X ??? Y . What is the standard deviation of the amount left in the box?
Using the context of this exercise, explain why we add variances when we subtract one random variable from another.

Answer 8.

Box of ice cream <- 48 ounces

Scoop of ice cream <- 2 ounces

3 scoops of ice cream in theory <- 3*2=6 ounces

party_icecream <- 48 +3*2
party_icecream # amount of ice cream served at the party

## [1] 54

54 ounces of ice cream served at the party.

standard deviation is the square root of the variance

the variance for the ice cream that is served at the party is 1(one full box)+3*scopps

let’s calculate that

party_variance <- 1+(3*0.0625)
party_sd <- sqrt(party_variance) # standard deviation of the ice cream served at the party
party_sd

## [1] 1.089725

The answer of (a) is 54 ounces of ice cream served at the party and standard deviation is 1.089725 for the ice cream that is served at the party.

If we take one scoop of ice cream from the box we will have 48 ounces - 2 ounces ice cream which is 46 ounces of ice cream left.

X_left <- 48-2
X_left

## [1] 46

we can do the same approach as earlier. standard deviation is the square root of the variance. variance_after_one_scoop = 1+0.065

variance_after_one_scoop <- 1+0.065
standard_deviation_after_one_scoop <- sqrt(variance_after_one_scoop)
standard_deviation_after_one_scoop

## [1] 1.031988

The answer (b) is 46 ounces of ice cream left and standard deviation is of the amount left in the box 1.0308

The question (c) is asking us to explain why we are adding variances even though we are subtracting random variable.

The answer(c) is regardless of adding or subtracting from the total amount the variance increases. So in this case variance went from 1 to 1.065.