Welcome to the Midterm! Please read the instructions below before starting the exam.

This midterm is open book, open notes/slides, open Internet, and open computer (R). Thus, you can choose to use/reference any resource you can find online.
However, one is expected to work independently on the exam and not communicate with any peer taking the exam. Therefore, you are not allowed to communicate/collaborate with any classmate or other intelligent being on this planet. The only person you can communicate with is the instructor while taking the exam.
Penalty for working in a group on the midterm (if found out) will result harsh penalty including an F grade, as per BC academic integrity policy.
All calculations are to be done in a statistical software (like R), and at the end of the exam please submit ALL your supporting scripts / results / code in Canvas. Further instructions are provided for each part of each question.

Tips for cracking the exam:

Time management is important - make sure you are pacing yourself and have time to attempt all questions at least. Do not spend too much time on a question if you are stuck – move on and come back later.
Do not leave any question blank as it would result in a grade of 0. Try to show the setup / meaningful guess even if you are not sure or cannot complete the entire question to obtain some partial credit. For full credit, show the interim steps/logic atleast instead of simply pasting the correct answer, as I may not able to infer how reached the conclsuion.
May want to pick up the low hanging fruit (the easiest questions) first. Maybe read/skim all the questions first before simply starting from the first question you see.
Ask for hint from the instructor if you get stuck or feel a question is not clear enough.

Good luck!

1 Question 1

1.1

I have two copies of the famous ‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random student will accept it with probability 0.3. Calculate the probability that I have to offer it to exactly 6 students before I get rid of both copies. Assume I am offering each student the book sequentially, and their decisions are independent of each other’s.

Calculate this probability from scratch, without using any statistical function in R (Binomial, Poisson, etc.). Attach an annotated R code detailing all the steps you took and the reasoning behind it.

# Assign Variables
r <- 2 # number of successes
n <- 6 # number of trials
p <- 0.3 # probability of success
f <- (n-r) # number of failures 
# P(X) = choose(n-1,r-1) * p^r * (1-p)^n-r, (# of ways first 5 students have one yes)*(probability that 6th student says yes)
firstFiveComb <- choose(n-1,r-1) # Number of ways first 5 students can have one yes
sixthYes <- p^r * (1-p)^(n-r) # probability that 6th student says yes
q1Answer <- firstFiveComb * sixthYes
q1Answer

## [1] 0.108045

1.2

The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is thne number of failures before a given number $r$ of successes is obtained, given that each trial has a Bernouli distribution with probability of success $p$.

Write a function in R that will calculate the probability distribution function for any $r \geq 1$ and $p \in (0,1)$. You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.

# r - number of successs
# n - number of trials before rth success
# this is different than dnbinom which takes number of failures and number of successes, but parameters can be derived by calling method
myNBinom <- function(r,n,p) {
     nMinusOne <- n-1
     rMinusOne <- r-1
     nCombos <- choose(nMinusOne,rMinusOne)
     lastSuccess <- p^r * (1-p)^(n-r)
     ans <- nCombos * lastSuccess
     return(ans)
}

1.3

Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters $p$ and $r$. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R.

Attach an annotated R code detailing all the steps. Attach the graphs.

Check that above is same as function for 2 sucesses in exactly 6 attempts

r <- 2 # number of successes
n <- 6 # number of trials
p <- 0.3 # probability of success
f <- (n-r) # number of failures 
dnbinom(f, r, p) # The number of f failures before r successes at probability p

## [1] 0.108045

Graphs:

fvec <- c(0:20)
par(mfrow=c(1,2))
r <- 2 # number of successes
p <- 0.3 # probability of success
plot(fvec,dnbinom(fvec,r,p),col="red", main=paste("Negative Binomial w/ P=",p),xlab=paste("# of Failures before", r, "Successes"),ylab="Probability")
lines(fvec,myNBinom(r,(fvec+r),p),col="blue")

# Change r and p
r <- 4 # number of successesmCount
p <- 0.8 # probability of success
plot(fvec,dnbinom(fvec,r,p),col="red", main=paste("Negative Binomial w/ P=",p),xlab=paste("# of Failures before", r, "Successes"),ylab="Probability")
lines(fvec,myNBinom(r,(fvec+r),p),col="blue")

# Change r and p
r <- 7 # number of successes
p <- 0.2 # probability of success
plot(fvec,dnbinom(fvec,r,p),col="red", main=paste("Negative Binomial w/ P=",p),xlab=paste("# of Failures before", r, "Successes"),ylab="Probability")
lines(fvec,myNBinom(r,(fvec+r),p),col="blue")

# Change r and p()
r <- 10 # number of successes
p <- 0.15 # probability of success
plot(fvec,dnbinom(fvec,r,p),col="red", main=paste("Negative Binomial w/ P=",p),xlab=paste("# of Failures before", r, "Successes"),ylab="Probability")
lines(fvec,myNBinom(r,(fvec+r),p),col="blue")

2 Question 2

Yathzee is a game played with five regular six-sided dice.

2.1

What is the probability all five dice will show the same number (this is known as an Yahtzee)? Attach an annotated R code detailing all the steps for your calculations.

# Each die is an independent event and has a 1/6 probability
# if Event A is rolling rolling a specific number on one die
# For this, we need to know P(A) to match any number and can be seen as 
# (1/6)^4 for matching 1st die
# or (1/6)^5 * 6 for six ways to throw a specific number
#
n <- 6 # number of outcomes
pn <- 1/n
nDice = 5
pYahtzee1 <- pn^(nDice-1) 
pYahtzee1 # ways to match one random die in the 4 other die

## [1] 0.0007716049

pYahtzee2 <- pn^(nDice) * n 
pYahtzee2 # ways to match one random die in the 4 other die

## [1] 0.0007716049

2.2

What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.

# Each die is an independent event and has a 1/6 probability
# if Event A is rolling rolling a specific number on one die
# P(A) = 1/6
# Here, again, we need to start with one die, and calculate probability that each of the other dice are different, which is
# 5/6 * 4/6 * 3/6 * 2/6, for each of the remaining dice, respectively
n <- 6 # number of outcomes
p1 = 1 # probability of rolling any number on first die
p2 = (n-1)/n # probability of rolling a different number than first
p3 = (n-2)/n # probability of rolling a different number than first two
p4 = (n-3)/n # probability of rolling a different number than first three
p5 = (n-4)/n # probability of rolling a different number than first four
pAllDiff <- p1 * p2 * p3 * p4 * p5
pAllDiff # Probability that all dice are different

## [1] 0.09259259

2.3

In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations.

2.4

Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)?

Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.

# Ths will order the vector and compare from left to right
nTurns <- 500000
nHighCount <- 0
nF <- 0 # number we are trying to get Yahtzee on
yVec <- c(0,0,0) # count of yahtzee on each roll (1,2,3)

for (i in 1:nTurns) {
  countOne <- 0
  countTwo <- 0
  countThree <- 0
  
  # First Roll
  oD1 <- sample(1:6,5,replace="TRUE")
  for (j in 1:6) {
    nMatches <- sum(oD1 == j)
    if (nMatches > nHighCount) {
      countOne <- nMatches
      nHighCount <- nMatches
      nF <- j
    }
  }
  # Check if we have Yahtzee
  if (countOne == 5) {
      yVec[1] = yVec[1] + 1 # increment number of yahtzee on 1st roll
  }
  
  # Second Roll
  nRemDice <- 5-countOne
  oD2 <- sample(1:6,nRemDice,replace="TRUE")
  
  countTwo <- countOne + sum(oD2 == nF) # add additional matches
  if (countTwo == 5) {
    yVec[2] = yVec[2] + 1
    next
  }
  
  # It's possible that second roll has another number that has more matches
  # do similar loop as firt roll and compare counts
  for (j in c(1:6)){
    nMatches <- sum(oD2 == j)
    if (nMatches > countTwo){
      countTwo <- nMatches
      nF <- j # this is new best number
    }     
  }
  # Can't have Yahtzee on new number since first was a different number
  # Go to roll #3, with better of first and second roll, which is still nF
  nRemDice <- 5-countTwo
  oD3 <- sample(1:6,nRemDice,replace="TRUE")
  
  countThree <- countTwo + sum(oD3 == nF) # add additional matches
  if (countThree == 5) {
    yVec[3] = yVec[3] + 1
  }
}
ansY <- (yVec[1] + yVec[2] + yVec[3]) / nTurns
ansY

## [1] 0.012696

3 Question 3

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file “challenger.csv”.

The variables in the data set are defined as follows:

launch: this numbers the temperature-sorted observations from 1 to 23.
temp: temperature in degrees Fahrenheit at the time of launch.
incident: If there was an incident with an O-Ring, then it is coded “Yes”.
o_ring_probs: counts the number of O-ring partial failures experienced on the flight.

Load the data into R and answer the following questions. Include all R code.

challenger=read.csv("/Users/valcourc/Personal/courses/ADEC7301.02/Midterm/challenger.csv")

3.1

Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

library("psych")
describe(challenger)

##              vars  n  mean   sd median trimmed  mad  min  max range  skew
## launch          1 23 12.00 6.78   12.0   12.00 8.90  1.0 23.0    22  0.00
## temp            2 23 69.02 6.97   69.8   69.33 5.34 53.6 80.6    27 -0.40
## incident*       3 23  1.30 0.47    1.0    1.26 0.00  1.0  2.0     1  0.80
## o_ring_probs    4 23  0.43 0.79    0.0    0.26 0.00  0.0  3.0     3  1.81
##              kurtosis   se
## launch          -1.36 1.41
## temp            -0.44 1.45
## incident*       -1.42 0.10
## o_ring_probs     2.69 0.16

3.2

Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

launch - this is just an incremental/uniform number, the center, spread, and shape don’t really have much significance, the value itself is significant because it represents shuttle launches ordered by temperature.

temp - The center measures seem to be an important aspect of the data. If temperature was correlated to o-ring and/or incidents, you might expect an average temperature to correlate to no problems. The spread of temperature should also be significant as incidents should really only happen in an extreme case, so the range of temperature vs o-rings or incidents should result in a high level of confidence. The symmetry and the light tailed value on temp might also be important when evaluating risk of a future launch

incident - this is a binary (yes/no, 1/0) value, the mean will represent the proportion of the launches that had an incident. The describe function is treating a Yes as a 1 and a No as a 2, if it was treated as a 1, 0, you’d see that the mean would be about .30, instead of 1.30, which is 7/23 incidents. This a launch could be treated as an event of time, this might be used to model with a Poisson distribution with $\lambda$ = .30 and $\mu$ = $\lambda$ * t = .30 * 1, where 1 is the launch event,. Again, with yes and no being modeled as 1 and 2, respectively, some transformation of the data for all measure would be helpful.

o_ring_probs - this is numeric and again could be modeled with a distribution that measures events over time. The mean is the average number of problems per launch, the spread of o_ring_probs is higher than that of incident because the range of values is larger than incident. The shape suggests a heavy skewness and tail

3.3

Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

hist(challenger$o_ring_probs, xlim=c(0,3)) # change x range to 0 through 3

The shows that the majority of launches had either 0 or 1 o-ring problems, but there are some outliers. Plotting with a histogram looks a bit like an exponential distribution

3.4

The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

boxplot(challenger$temp~challenger$incident, col='lightblue', main="Temp per Incident", xlab="Incident", ylab="Temperature")

Analysis: Incidents have occurred at lower temperatures. Seeing as the launch was well outside the IQR (interquartile range) for launches without incidents and even below the IQR of launches with incidents, this might have been an indicator of a bigger problem launching in colder weather.

3.5

In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).

which(challenger$incident=='No', arr.ind=TRUE)[1] # index, or first observation of successful launch

## [1] 5

3.6

How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).

nrow(challenger) - which(challenger$temp>65.0, arr.ind=TRUE)[1] + 1 # include first above 65.0F

## [1] 19

4 Question 4

High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Health Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL. United States women aged 20 and over have a mean HDL of 47 mg/dl with a standard deviation of 3.46. Assume that the distribution is Normal. For each question below, attach an annotated script in R.

4.1

What percent of women in the United States have low values of HDL (40 mg/dl) or less?

round(pnorm(40,47,3.46) * 100,2) # times 100, rounded to 2 decimal places for percentage

## [1] 2.15

4.2

What percent of women in the United States have a value of HDL larger than 50 mg/dl?

round((1-pnorm(50,47,3.46)) * 100,2) # times 100, rounded to 2 decimal places for percentage

## [1] 19.3

4.3

What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?

round((pnorm(50,47,3.46)-pnorm(46,47,3.46)) * 100,2) # diff b/w 47 and 50 times 100, rounded to 2 decimal places for percentage

## [1] 42.08

4.4

What values of HDL would place a woman in the United States in the lowest 15% of the distribution?

round(qnorm(.15,47,3.46),2) # rounded to 2 decimal places for percentage

## [1] 43.41

5 Question 5

5.1

Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than $Q_1 − 1.5 \cdot IQR$ or larger than $Q_3 + 1.5 \cdot IQR$ where $IQR$ is the inter-quartile range ($IQR=Q_3-Q_1$). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.

# Assume a standard normal distribution to calculate, $\mu$ = 0, $\sigma$ = 1
lowerLimit <- qnorm(.25,0,1) - 1.5*IQR(c(-1:1))
upperLimit <- qnorm(.75,0,1) + 1.5*IQR(c(-1:1))
qDiffs <- pnorm(upperLimit,0,1) - pnorm(lowerLimit,0,1) # Percentage diff b/w upper and lower
round(qDiffs*100,2) # rounded to 2 decimal places for percentage

## [1] 97.03

5.2

The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.

5.2.1

What is the median family income in the US in 2017?

# to convert from logNormal mean to Normal median: e^x, where x is logNormal mean
mu <- 11.237
sd <- 0.747
medIncome <- exp(mu)
medIncome

## [1] 75886.95

5.2.2

Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.

mu <- 11.237
sd <- 0.747
lnQ1 <- mu - (2/3) * sd
lnQ3 <- mu + (2/3) * sd
lnIQR <- lnQ3 - lnQ1
lnIQR

## [1] 0.996

5.2.3

What is the interquartile range for the distribution of family income in the US in 2017.

mu <- 11.237
sd <- 0.747
lnQ1 <- mu - (.674) * sd # Class notes showed 2/3 instead of .674, but web references showed somewhere b/w .674 and .675
lnQ3 <- mu + (.674) * sd
incomeIQR <- exp(lnQ3) - exp(lnQ1)
incomeIQR

## [1] 79684.38

5.2.4

In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018?

mu <- 11.237
sd <- 0.747
origAMT <- log(84500)
newAMT <- log(109400)
pctNoAMT <- round((pnorm(newAMT,11.237,0.747) - pnorm(origAMT,11.237,0.747))*100,2)
pctNoAMT

## [1] 13.06

6 Question 6

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is $\frac{1}{10}$. For each question, attach an annotated R code.

6.1

What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use$\lambda \cdot t$ rather just $\lambda$).

lambda=0.1
t=8
x=1
pX <- exp(-1*lambda*t)*(lambda*t)/factorial(x) # probability that machine will fail in 8 years
1-pX # probability that machine will fail after 8 years

## [1] 0.6405368

1-dpois(x,(lambda*t)) # Check with R function

## [1] 0.6405368

6.2

What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If $X$ is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.)

pf=0.1
n=8
x=1
pX <- dbinom(x,size=n,prob=pf) # probability that machine will fail in 8 years
1-pX # probability that machine will fail after 8 years

## [1] 0.6173625

MidTerm_CV

Chris Valcourt

2024-02-10