Interview Prep July 18

Probability questions

Bionimial distribution

Basics

library(magrittr)
?dbinom  ## gives the density of binomial
x <- 0:5
## the probability of having 0:5 success out of 5 trials, the probability of each trail is 0.4
p <- dbinom(x, 5, .4)  ## probability density function (PDF)
p

## [1] 0.07776 0.25920 0.34560 0.23040 0.07680 0.01024

x[p == max(p)]

## [1] 2

x <- 0:5
p <- dbinom(x, 5, .5)
x[p==max(p)]

## [1] 2 3

x <- 0:5
p <- pbinom(x, 5, .4)   # CDF cumlative probability 
p

## [1] 0.07776 0.33696 0.68256 0.91296 0.98976 1.00000

## if we draw 15 heads out of 20 trials, what is the proability of seeing such changes given we have a binomial distribution
p <- pbinom(q = 15-1, 
            size = 20, 
            prob = .5,
            lower.tail = FALSE)

p <- p*2  ## two tail
# this is same as
binom.test(15, 20, 0.5)

## 
##  Exact binomial test
## 
## data:  15 and 20
## number of successes = 15, number of trials = 20, p-value = 0.04139
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5089541 0.9134285
## sample estimates:
## probability of success 
##                   0.75

### confidence interval from binomial distribution
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

binconf(5, 10, alpha = 0.05, method = c('wilson'))

##  PointEst     Lower     Upper
##       0.5 0.2365931 0.7634069

* If you are trying to estimate the parameter p of the binomial distribution B(n, p) from data where the number of success is 0, what is the confidence interval for the p estimate

https://stats.stackexchange.com/questions/82720/confidence-interval-around-binomial-estimate-of-0-or-1
Do not use normal distribution
You can use binom.confint
Rule of three

library(binom)
p = seq(0,1,.001)
coverage = binom.coverage(p, 25, method="asymptotic")$coverage
plot(p, coverage, type="l")
binom.confint(0,25)

##           method x  n       mean       lower      upper
## 1  agresti-coull 0 25 0.00000000 -0.02439494 0.15758719
## 2     asymptotic 0 25 0.00000000  0.00000000 0.00000000
## 3          bayes 0 25 0.01923077  0.00000000 0.07323939
## 4        cloglog 0 25 0.00000000  0.00000000 0.13718517
## 5          exact 0 25 0.00000000  0.00000000 0.13718517
## 6          logit 0 25 0.00000000  0.00000000 0.13718517
## 7         probit 0 25 0.00000000  0.00000000 0.13718517
## 8        profile 0 25 0.00000000  0.00000000 0.12291101
## 9            lrt 0 25 0.00000000  0.00000000 0.07398085
## 10     prop.test 0 25 0.00000000  0.00000000 0.16577301
## 11        wilson 0 25 0.00000000  0.00000000 0.13319225

abline(h=.95, col="red")

binom.confint(0, 25, method=c("wilson", "bayes", "agresti-coull"),
  type="central")

##          method x  n       mean       lower      upper
## 1 agresti-coull 0 25 0.00000000 -0.02439494 0.15758719
## 2         bayes 0 25 0.01923077  0.00000000 0.07323939
## 3        wilson 0 25 0.00000000  0.00000000 0.13319225

*  You have 3 baskets and each basket contains exactly 4 balls; each ball is of the same size. Each ball is either red, black, yellow, or orange, and there is one of each color in each basket. If you randomly draw 1 ball from each basket, what is the chance that you would have exactly 2 red balls? Solution:

choose(3, 2)*((1/4)^2)*(3/4)  ## choose()

## [1] 0.140625

* A bag contains x number of 1 rupee coins and y number of 50 paise coins. One coin is taken from the bag and put away. If a coin is now taken at random from the bag, what is the probability that it is a 1 rupee coin?

## two situation
# first coin is rupeee
#x/(x+y) * (x-1)/(x+y-1)

# first coin is Paise
#y/(x+y) * x/(x+y-1)


#a1 + a2 = total probability

## the answer is x/(x+y)
# you can think it as it doesn't matter for which we draw for the first time so we don't know. so theprobability is x/(x+y)

* Bob is selected to interview for 3 posts. The total number of candidates interviewing for the posts are 3, 4 and 2. What is the probability of getting at least one offer? [Assume candidates are picked at random after interviews.]

## probability of not getting any offer
2/3 * 3/4 * 1/2

## [1] 0.25

1- (2/3 * 3/4 * 1/2 )

## [1] 0.75

* Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls? (Part B): Mr. Smith has two children. At least one of them is a boy. What is the probability that both children are boys?

# question 1
1/2

## [1] 0.5

# question B
# b + g (older + young)
# b + b (older + young)
# g + b (older + young)

* You have two jars, 50 red marbles, and 50 blue marbles. You’ll first randomly pick a jar, and then randomly pick a marble out of that jar. You can arrange the marbles however you like, but each marble must be in a jar. You need to place all the marbles into the jars such that when you blindly pick one marble out of the chosen jar, you maximize the chances that it will be red.

0.5*1 + 0.5*(25/75)

## [1] 0.6666667

0.5*1 + 0.5*0

## [1] 0.5

a = c(0:50)
b = 50
p = rep(0, length(a))

for (i in c(1:51)) {
    p[i] = 0.5 + 0.5*((i)/(i + 50))
}
plot(p~a)  # so the answer is have one red in one jar and all other in another jar

* You are given a choice of three doors by an Angel. You can choose only one of the doors among the three. Out of these three doors, two contain nothing and one has a jackpot. After you choose one of the doors, the angel reveals one of the other two doors behind which there is a nothing. The angel gives you an opportunity to change the door or you can stick with your chosen door. You don’t know behind which door we have nothing. Should you switch or does it not matter?

here

## you should always swtich becuase the door that didn't reveal and is not chosen has 2/3 of probability initially
#####################################################
# Simulation of the Monty Hall Problem
# Demonstrates that switching is always better
# than staying with your initial guess
#
# Corey Chivers, 2012
#####################################################

rm(list=ls())
monty<-function(strat='stay',N=1000,print_games=TRUE)
{
 doors<-1:3 #initialize the doors behind one of which is a good prize
 win<-0 #to keep track of number of wins

for(i in 1:N)
 {
 prize<-floor(runif(1,1,4)) #randomize which door has the good prize
 guess<-floor(runif(1,1,4)) #guess a door at random

## Reveal one of the doors you didn't pick which has a bum prize
 if(prize!=guess)
 reveal<-doors[-c(prize,guess)]
 else
 reveal<-sample(doors[-c(prize,guess)],1)

 ## Stay with your initial guess or switch
 if(strat=='switch')
 select<-doors[-c(reveal,guess)]
 if(strat=='stay')
 select<-guess
 if(strat=='random')
 select<-sample(doors[-reveal],1)

## Count up your wins
 if(select==prize)
 {
 win<-win+1
 outcome<-'Winner!'
 }else
 outcome<-'Losser!'

if(print_games)
 cat(paste('Guess: ',guess,
 '\nRevealed: ',reveal,
 '\nSelection: ',select,
 '\nPrize door: ',prize,
 '\n',outcome,'\n\n',sep=''))
 }
 cat(paste('Using the ',strat,' strategy, your win percentage was ',win/N*100,'%\n',sep='')) #Print the win percentage of your strategy
}

Normal distrbution

* With a data set from normal distribution, suppose you can't get any observation greater than 5, how to estimate the mean?

ML estimation
The idea is to

Have pdf of the current distribution.
have a negative log function, and use this function and MLE to solve the paramters.

## fit a normal distribution
set.seed(100)
x = rnorm(5, mean = 3, sd = 2)
LL <- function(mu, sigma){
    R = dnorm(x, mu, sigma)
    -sum(log(R))
}
library(stats4)
mle(LL, start = list(mu = 1, sigma = 1))

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## Warning in dnorm(x, mu, sigma): NaNs produced

## 
## Call:
## mle(minuslogl = LL, start = list(mu = 1, sigma = 1))
## 
## Coefficients:
##        mu     sigma 
## 3.2216713 0.9008312

Maximum likelihood estimation

here

### input our data as traffic light right turn numbers and its frequency
X <- c(rep(0,14), rep(1,30), rep(2,36), rep(3,68), rep(4, 43), rep(5,43), rep(6, 30), rep(7,14), rep(8,10), rep(9, 6), rep(10,4), rep(11,1), rep(12,1))

hist(X)

mean(X)

## [1] 3.893333

here “nlm” is the nonlinear minimization function provided by R

the first argument is the object function to be minimized;
the second argument “p=c(0.5)”, specifies the initial value of the unknown parameter, here we start at 0.5;
the third argument, “hessian = TRUE” says that we want the Hessian matrix as a part of the output result. The optimization result and relevant information will be stored in a structure, and in this example, the variable “out” stores the structure.

## define a negative log likelihood function
n <- length(X)
negloglike<-function(lam) { n* lam -sum(X) *log(lam) + sum(log(factorial(X))) }

out<-nlm(negloglike,p=c(0.5), hessian = TRUE)  ## nlm (non-linear minimization)

## Warning in log(lam): NaNs produced

## Warning in nlm(negloglike, p = c(0.5), hessian = TRUE): NA/Inf replaced by
## maximum positive value

## Warning in log(lam): NaNs produced

## Warning in nlm(negloglike, p = c(0.5), hessian = TRUE): NA/Inf replaced by
## maximum positive value

out

## $minimum
## [1] 667.183
## 
## $estimate
## [1] 3.893331
## 
## $gradient
## [1] -2.569636e-05
## 
## $hessian
##          [,1]
## [1,] 77.03948
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

Where, out$minimum is the negative log-likelihood,
out$estimate are the maximum likelihood estimates of the parameters,
out$gradient is the gradient of the negative log- likelihood function at this point,
out$hessian is the value of the second derivative at this point,
out$iterations is the number of iterations need to converge. The notation “$” is to take the component of the output variable “out”.
- How do you analyze data if they don’t follow normal distribution?
Other types of distrbution: beta, lognormal, possion, gemma
here

## posssion distribution
# rpois randome variable
# dpois pdf
# ppois cdf
# qpois quantile
barplot(table(rpois(n = 100, lambda = 0.8)))

barplot(dpois(x= 1:100, lambda = 0.7, log = FALSE))  #PDF

## same as norm, binom , unif

# gemma distribution

Interview Prep July 18 - July 27

xg

7/18/2017

Probability questions

Bionimial distribution

Basics

Normal distrbution

Maximum likelihood estimation

Other questions