R Packages

library(tidyverse) #loading all library needed for this assignment
#library(readxl)
library(plyr)
library(dplyr)
library(dice)
library(VennDiagram)
#library(help = "dice")
#library(DBI)
#library(dbplyr)
#library(data.table)
#library(rstudioapi)
#library(RJDBC)
#library(odbc)
#library(RSQLite)

Dice rolls.(3.6, p. 92)

If you roll a pair of fair dice, what is the probability of(a) getting a sum of 1?(b) getting a sum of 5?(c) getting a sum of 12?

# (a) the probability of getting a sum of 1? The easy way of answering this question is to layout the cases: when rolling a fair dice (of six faces  numbered 1 to 6), the output are: 1, 2, 3, 4, 5, 6: so the chance of getting 1 is one out of the six output. Now, when rolling a pair of dice, the output are: 1,1, 1,2, 1,3, 1,4, 1,5, 1,6 or 2,1, 2,2, 2,3...6,6. The iteration will give 36 output. Therefore, a getting a sum of 1 ( like 1,1) is 1 out of 36 output = 1/36 = 0.02777778 ...if the sum of the output display by the two dice equal 0, then this event does not exist since there is no face numbered 0. Beside the minimum sum on two dice is 2.
a = 1/36
a

## [1] 0.02777778

# (b) getting a sum of 5? I am bit confused: if the question referred to having an output which display 5 on the two dice: then, this probability is 1/36 = 0.02777778...if is it like the sum of output of the two dice equal 5, then the output matching this case are: (1, 4), (2,3) x 2 = 0.1111111
b=((1/36)+(1/36))*2
b

## [1] 0.1111111

# getEventProb(nrolls = 6,ndicePerRoll = 1,nsidesPerDie = 6,eventList = list(4, 3, c(1,2)),orderMatters = FALSE)
# getEventProb( nrolls = 3, ndicePerRoll = 2,nsidesPerDie = 6,eventList = list(10, 4, c(2:6, 8:12)), orderMatters = TRUE)
# getSumProbs(ndicePerRoll = 5,nsidesPerDie = 6,nkept = 3,dropLowest = TRUE)
# getEventProb(nrolls = 1,ndicePerRoll = 1,nsidesPerDie = 6,eventList = list(6)) # rolling a dice (6 faces numbered 1 to 6) one time and getting 6
# dice(rolls = 1, ndice = 2, sides = 6, plot.it = FALSE, load = rep(1, sides))
# 
# (c) getting a sum of 12? possible output: since sum = 12 is the highest two dice can give , only (6,6) is possible: this probability = 1/36 = 0.0277778
# if the sum is for faces (1 and 2) vice-versa, output: (1,2), (2,1) . this probability = (1/36)+(1/36) = 2/36 = 0.0555556

Poverty and language. (3.8, p. 93)

The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American Community Survey estimates that 14.6% of Americans live below the poverty line (PL), 20.7% speak a language other than English (foreign language =FL) at home, and 4.2% fall into both categories.

Are living below the poverty line and speaking a foreign language at home disjoint?
Draw a Venn diagram summarizing the variables and their associated probabilities.
What percent of Americans live below the poverty line and only speak English at home?
What percent of Americans live below the poverty line or speak a foreign language at home?
What percent of Americans live above the poverty line and only speak English at home?

(f)Is the event that someone lives below the poverty line independent of the event that the person speaks a foreign language at home?

# (a) Are living below the poverty line and speaking a foreign language at home disjoint? 
# This event are not disjoint. there is already a link between event of poverty line and foreign language (the 4.2 %)
# 
# (b) Draw a Venn diagram summarizing the variables and their associated probabilities.
#VennDiagram <- draw.pairwise.venn(area1=0.146,area2= 0.207,cross.area =0.042, c("proverty line", " foreign language")) 
#grid.draw(VennDiagram)
# ======================-------------------------
# =             -               =               -
# = PL=0.146    -  PLFL= 0.042  =  FL=0.207     -
# ======================------------------------- 

#(c) What percent of Americans live below the poverty line and only speak English at home? let's call it P(PL and (!PLFL))
#P(PL and (!PLFL)) = P(PL) - P(PLFL) = 0,146 - 0.042 = 0.104 -> 10.4%

#(d) What percent of Americans live below the poverty line or speak a foreign language at home?
#P(PL or (!PLFL)) = P(PL) + P(PLFL) - P(PL and (!PLFL)) = 0.146 + 0.207 - 0.042 = 0.311 -> 31.1%
#
#(e) What percent of Americans live above the poverty line and only speak English at home? P(e), this event is the opposite of event (d)

#p(e) = 1 - 0.311 = 0.689 -> 68.9%  

#(f)Is the event that someone lives below the poverty line independent of the event that the person speaks a foreign language at home?
#it is not independent. This is a dependent event: Here is the definition--> In probability, two events are independent if the incidence of one event does not affect the probability of the other event. If the incidence of one event does affect the probability of the other event, then the events are dependent.

Assortative mating. (3.18, p. 111)

What is the probability that a randomly chosen male respondent or his partner has blue eyes? (b)What is the probability that a randomly chosen male respondent with blue eyes has a partner withblue eyes? (c)Whatistheprobabilitythatarandomlychosenmalerespondentwithbrowneyeshasapartner with blue eyes?What about the probability of a randomly chosen male respondent with green eyes having a partnerwith blue eyes? (d)Does it appear that the eye colors of male respondents and their partners are independent? Explainyour reasoning.

# (a) What is the probability that a randomly chosen male respondent or his partner has blue eyes?
# P(a) = P(men with blue) + P(female with blue) - P(men and female blue) = (114/204) + (108/204) - (78/204) = 0.7058824

(114/204) + (108/204) - (78/204)

## [1] 0.7058824

# (b)What is the probability that a randomly chosen male respondent with blue eyes has a partner with blue eyes?
# P(b) = P(man and female with blue)/P(man with ) = (78/204)/ (114/204) = (78/114) = 0.6842105
#   
(78/204)/ (114/204)

## [1] 0.6842105

# (c)What is the probability that a randomly chose nmale respondent with brown eyes has apartner with blue eyes? What about the probability of a randomly chosen male respondent with green eyes having a partnerwith blue eyes? 
# P(c) = P(man with brown and female with blue)/p(man with brown) = (23/204)/(54/204) = 0.4259259
#   
(23/204)/(54/204)

## [1] 0.4259259

# P(man with brown and female with blue) = P(man with green and female with blue)/P(man with green) = (13/204)/(36/204) =0.3611111
# (13/204)/(36/204)

# (d)Does it appear that the eye colors of male respondents and their partners are independent? Explainyour reasoning.
# No this is not independent event, this is a dependent event: this sample is built using the biology mating table...
# beside if I peak P (man with blue and female with blue) = (78/204) is it equal to P(man with only blue) x P(female with blue only) = (114/204) * (108/204) ....not true...thus, dependent

Books on a bookshelf. (3.26, p. 114)

The table below shows the distribution of books on a bookcase basedon whether they are nonfiction or fiction and hardcover or paperback.

(a)Find the probability of drawing a hardcover book first then a paperback fiction book second whendrawing without replacement. (b)Determine the probability of drawing a fiction book first and then a hardcover book second, whendrawing without replacement. (c)Calculate the probability of the scenario in part (b), except this time complete the calculations underthe scenario where the first book is placed back on the bookcase before randomly drawing the secondbook. (d) The final answers to parts (b) and (c) are very similar. Explain why this is the case

# (a)Find the probability of drawing a hardcover book first then a paperback fiction book second whendrawing without replacement.
# P(a) = P(hardcover first and paperback second) = P( hardcover first)x P(paperback second) = (28/95)*(59/94) =0.1849944

(28/95)*(59/94)

## [1] 0.1849944

# 
# (b)Determine the probability of drawing a fiction book first and then a hardcover book second, whendrawing without replacement.
# P(fiction first and hardcover second) = P(ficition first) x P(hardcover second) = (72/95)*(28/94) =0.2257559
(72/95)*(28/94)

## [1] 0.2257559

# (c)Calculate the probability of the scenario in part (b), except this time complete the calculations underthe scenario where the first book is placed back on the bookcase before randomly drawing the secondbook.
# P(fiction first and hardcover second) = P(ficition place back)x P(harcover second) = (72/95)*(28/95)=  0.2233795
(72/95)*(28/95)

## [1] 0.2233795

# 
# (d) The final answers to parts (b) and (c) are very similar. Explain why this is the case
# This probably mean that the event, a place back before the random drawing does not affect the overall result for a populatio of 95 books. May be different for small sample
# 
#

Baggage fees. (3.34, p. 124)

An airline charges the following baggage fees: $25 for the first bag and $35 for the second. Suppose 54% of passengers have no checked luggage, 34% have one piece of checked luggage and12% have two pieces. We suppose a negligible portion of people check more than two bags.

(a)Build a probability model, compute the average revenue per passenger, and compute the corresponding standard deviation. (b)About how much revenue should the airline expect for a flight of 120 passengers? With what standard deviation? Note any assumptions you make and if you think they are justified.

# For this sample, the event people checked more than two bags is negligeable make sense since 54% + 34% + 12% = 100%
#   
# (a)Build a probability model, compute the average revenue per passenger, and compute the corresponding standard deviation.

# how many bags possible
bags_number <- c(0, 1 , 2)  # 3 to more bags being exclude
bags_number

## [1] 0 1 2

# bags fees
bag_0 = 0; bag_1 = 25; bag_2 = 25+35 # checking the second back equal of paying the first back first , then the second

bag_1

## [1] 25

# this variable holds all possible fees pay by passenger
bags_fees <- c(bag_0, bag_1, bag_2)
# this variable hold for the distribution of the passenger by  number of bag checked...passenger_percent = P_percent
p_Percent <- c( 0.54, 0.34, 0.12)
X_Px <- c(0.54*bag_0, 0.34*bag_1, 0.12*bag_2) 
passenger_tb <- data.frame(bags_number, bags_fees,p_Percent, X_Px)
passenger_tb

revenu_tb <- select(passenger_tb,bags_fees,p_Percent, X_Px)
revenu_tb

# What is the average revenu per passenger ...this percent 
revenu_passenger = sum(bags_fees*p_Percent)
revenu_passenger # answer = 15.7

## [1] 15.7

# Standard deviation
#apply (revenu_tb, 3, var)
#sd(revenu_tb$X_Px)
#sd(revenu_tb$p_Percent)
#sd(revenu_tb$bags_fees)
#apply(revenu_tb, 2, sd) # I am bit confused here, I could compute this by hand but I thought there will be a sd call function to do the iteration 
b= sqrt(abs(-2)) # practice 
b

## [1] 1.414214

(0−15.7)^2*0.54

## [1] 133.1046

#sqrt((0−15.7)^2*0.54 + (25−15.7)^2*0.34 + (60−15.7)^2*0.12) = 19.95
   
# (b)About how much revenue should the airline expect for a flight of 120 passengers? With what standard deviation? Note any assumptions you make and if you think they are justified.
#    
# expected revenu = 120*15.7 = 1884 # with plus or minus the sd= 19.95
# with sd, expected revenu = (0*0.54*120 + 25*0.34*120 + 60*0.12*120) = 1884 about the same
(0*0.54*120 + 25*0.34*120 + 60*0.12*120)

## [1] 1884

Income and gender. (3.38, p. 128)

The relative frequency table below displays the distribution of annualtotal personal income (in 2009 inflation-adjusted dollars) for a representative sample of 96,420,486 Americans.These data come from the American Community Survey for 2005-2009. This sample is comprised of 59%males and 41% females.

Describe the distribution of total personal income.
What is the probability that a randomly chosen US resident makes less than $50,000 per year? (c)What is the probability that a randomly chosen US resident makes less than $50,000 per year and is female? Note any assumptions you make. (d)The same data source indicates that 71.8% of females make less than $50,000 per year. Use this valueto determine whether or not the assumption you made in part (c) is valid

#(a) Describe the distribution of total personal income.
#I can see that the median is about $35,000 to $49,000 and from this median the income drops both left and right ...maybe a plot will show more
## let's create a dataframe
income <- c("$1 to $9,999","$10,000 to $14,999","$15,000 to $24,999","$25,000 to $34,999","$35,000 to $49,999","$50,000 to $64,999","$65,000 to $74,999","$75,000 to $99,999","$100,000 or more")

total_percent <- c(2.2, 4.7, 15.8, 18.3, 21.2, 13.9, 5.8, 8.4, 9.7)

income_gender <- data.frame(income, total_percent)
income_gender

# ggplot(data = nycflights, aes(x = origin, fill = dep_type))  +
#   geom_bar()
#   
#barplot(income_gender$total_percent, names.arg=income, fill = income)

with(income_gender, barplot(total_percent, 
                  names.arg = income, 
                  xlab = "income ", 
                  ylab = "Total in percent", 
                  col = rainbow(10)))

#(b) What is the probability that a randomly chosen US resident makes less than $50,000 per year?

#manually is p(b)0.022 + 0.047 + 0.158 + 0.183 + 0.212 = 0.622
sum(income_gender [1:5, 2]) # 0.622

## [1] 62.2

# (c)What is the probability that a randomly chosen US resident makes less than $50,000 per year and is female? Note any assumptions you make.
# # this is bit tricky because we don't know how female and males are distibuted on this sample. how many female are in each income range or braket? so let assume everything is fair and that the event for female is not link or dependent to male
# Pc) = 0.41*0.622 = 0.255
# (d)The same data source indicates that 71.8% of females make less than $50,000 per year. Use this value to determine whether or not the assumption you made in part (c) is valid
# 
# There is a contradiction in this question. the sample says : 41% of the popluation sample is female. Where is the 71.8% coming from? it does not make sense. Maybe 75% of the 41% of total population...assumming the population is 100, 41% female = 41 female, 75% 0f 41 = 30.75 female ...this mean the remaining 25% of 41female = 10.25 is in the above $50k which is 37.8% of 100 = 37.8 ..of which 10.25 are female...huummm maybe maybe, but still where is the male distribution?
#   
#

Chapter 3 - Probability

Alexis Mekueko