Midterm

Question 1 1.1 I have two copies of the famous ‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random student will accept it with probability 0.3. Calculate the probability that I have to offer it to exactly 6 students before I get rid of both copies. Assume I am offering each student the book sequentially, and their decisions are independent of each other’s.

Calculate this probability from scratch, without using any statistical function in R (Binomial, Poisson, etc.). Attach an annotated R code detailing all the steps you took and the reasoning behind it.

p<-0.3
n<-6
r<-2
p1 <- choose(n-1,r-1)
p1_1<-p1*(p^r * (1-p)^(n-r))
print(p1_1)

## [1] 0.108045

1.2 The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is the number of failures before a given number r of successes is obtained, given that each trial has a Bernouli distribution with probability of success p.

Write a function in R that will calculate the probability distribution function for any r≥1 and p∈(0,1). You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.

p1_2<- function(r,n,p){
  p1 <- choose(n-1,r-1)
  p2<-p^r * (1-p)^(n-r)
  return(p1*p2)}

1.3 Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters p and r. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R.

Attach an annotated R code detailing all the steps. Attach the graphs.

p<-0.3
n<-6
r<-2
p1_3<-dnbinom(n-r, r, p)
print(p1_3)

## [1] 0.108045

p<-0.8
n<-20
r<-16
plot(x=0:n,
     y= dnbinom(x =0:n,r,p), 
     type="o",
     main=" Negative Binomial Distribution (r=2, p=0.3)",
     col="blue",
     xlab="Number of failure before success",
     ylab="Probability",
     lwd  = 3)

p1_2(r,n,p)

## [1] 0.1745595

p1_3<-dnbinom(n-r, r, p)
print(p1_3)

## [1] 0.1745595

p<-0.4
r<-10
n<-20
plot(x=0:n,
     y= dnbinom(x =0:n,r,p), 
     type="o",
     main=" Negative Binomial Distribution (r=10, p=0.4)",
     col="purple",
     xlab="Number of failure before success",
     ylab="Probability",
     lwd  = 3)

p1_2(r,n,p)

## [1] 0.05857078

p1_3<-dnbinom(n-r, r, p)
print(p1_3)

## [1] 0.05857078

p<-0.6
n<-20
r<-3
plot(x=0:20,
     y= dnbinom(x =0:20,r,p), 
     type="o",
     main=" Negative Binomial Distribution (r=3, p=0.6)",
     col="pink",
     xlab="Number of failure before success",
     ylab="Probability",
     lwd  = 3)

p1_2(r,n,p)

## [1] 6.345556e-06

p1_3<-dnbinom(n-r, r, p)
print(p1_3)

## [1] 6.345556e-06

p<-0.15
r<-15
n<-20
plot(x=0:20,
     y= dnbinom(x =0:20,r,p), 
     type="o",
     main=" Negative Binomial Distribution (r=15, p=0.15)",
     col="brown",
     xlab="Number of failure before success",
     ylab="Probability",
     lwd  = 3)

p1_2(r,n,p)

## [1] 2.259272e-09

p1_3<-dnbinom(n-r, r, p)
print(p1_3)

## [1] 2.259272e-09

Question 2 Yathzee is a game played with five regular six-sided dice.

2.1 What is the probability all five dice will show the same number (this is known as an Yahtzee)? Attach an annotated R code detailing all the steps for your calculations.

n<-6
Dice<-5
pside<-1/n
p2_1<-n*pside^Dice
print(p2_1)

## [1] 0.0007716049

the probability all five dice will show the same number is 0.0007716049.

2.2 What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.

n <- 6
Dice<-5
p2_2<-factorial(n)/(n^Dice)
print(p2_2)

## [1] 0.09259259

the probability all five dice will show a different number is 0.09259259.

2.3 In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations.

n <- 6
Dice<-5
pside<-1/n
p2 <- function (a,b){
  return(factorial(a)/factorial(a - b))}
p2_3<-p2(3,1)*pside*(1-pside)*(1-pside)
print(p2_3)

## [1] 0.3472222

the probability the player will end up having exactly three dice showing the same number is 0.3472222.

2.4 Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)?

Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.

M <- matrix(c(120/1296, 900/1296, 250/1296, 25/1296, 1/1296,
               0, 120/216, 80/216, 15/216, 1/216,
               0, 0, 25/36, 10/36, 1/36,
               0, 0, 0, 5/6, 1/6,
               0, 0, 0, 0, 1), nrow = 5,ncol=5,byrow="TRUE")
print(M)

##            [,1]      [,2]      [,3]       [,4]         [,5]
## [1,] 0.09259259 0.6944444 0.1929012 0.01929012 0.0007716049
## [2,] 0.00000000 0.5555556 0.3703704 0.06944444 0.0046296296
## [3,] 0.00000000 0.0000000 0.6944444 0.27777778 0.0277777778
## [4,] 0.00000000 0.0000000 0.0000000 0.83333333 0.1666666667
## [5,] 0.00000000 0.0000000 0.0000000 0.00000000 1.0000000000

P2_4<-M%*%M%*%M
print(P2_4[1,5])

## [1] 0.04602864

Question 3 In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file “challenger.csv”.

The variables in the data set are defined as follows: launch: this numbers the temperature-sorted observations from 1 to 23. temp: temperature in degrees Fahrenheit at the time of launch. incident: If there was an incident with an O-Ring, then it is coded “Yes”. o_ring_probs: counts the number of O-ring partial failures experienced on the flight. Load the data into R and answer the following questions. Include all R code.

mydata <- read.csv("C:/Users/dingz/Downloads/challenger.csv")

3.1 Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

library("psych")
describe(mydata)

##              vars  n  mean   sd median trimmed  mad  min  max range  skew
## launch          1 23 12.00 6.78   12.0   12.00 8.90  1.0 23.0    22  0.00
## temp            2 23 69.02 6.97   69.8   69.33 5.34 53.6 80.6    27 -0.40
## incident*       3 23  1.30 0.47    1.0    1.26 0.00  1.0  2.0     1  0.80
## o_ring_probs    4 23  0.43 0.79    0.0    0.26 0.00  0.0  3.0     3  1.81
##              kurtosis   se
## launch          -1.36 1.41
## temp            -0.44 1.45
## incident*       -1.42 0.10
## o_ring_probs     2.69 0.16

3.2 Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

str(mydata)

## 'data.frame':    23 obs. of  4 variables:
##  $ launch      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ temp        : num  53.6 57.2 57.2 62.6 66.2 66.2 66.2 66.2 66.2 68 ...
##  $ incident    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ o_ring_probs: int  3 1 1 1 0 0 0 0 0 0 ...

Launch: This is a numerical variable that represent the number of temperature from 1 to 23. It does not have a meaningful zero point and the spread and shape is not significant, but it sequenced the space shuttle launches. Hence, it is an ordinal variable.

Temp: This is a quantitative variable that represent the temperature by Fahrenheit during time. Since temperature does not have a zero point and can be measured on a continuous scale, it is a interval variable.

Incident: This is a categorical variable that represent whether there is an incident on the O-ring. The result is always shows “Yes” or “No”. As it does not have a numerical interpretation, it is a nominal variable.

O_ring_probs: This is a numerical variable that represent the number of O-ring partial failures experienced on the flight. Since it has a meaningful zero point, it isan ratio variable.

3.3 Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

hist(mydata$o_ring_probs,main = "o_ring_probs", col="grey", xlim=c(0,3))

3.4 The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

boxplot(mydata$temp~mydata$incident, main="Temperature by Incident", xlab="Incident", ylab="Temperature", col="darkgreen")

The boxplot indicate that launch O-Rings at lower temperatures is more likely to result in accidents. It could be suggest that temperature is a factor of leading indicent.

3.5 In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).

P3_5 <- which(mydata$incident == "No")[1]
print(P3_5)

## [1] 5

3.6 How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).

p3_6<-sum(mydata$temp > 65 & mydata$incident == "Yes")
print(p3_6)

## [1] 3

Question 4

High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Health Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL. United States women aged 20 and over have a mean HDL of 47 mg/dl with a standard deviation of 3.46. Assume that the distribution is Normal. For each question below, attach an annotated script in R.

4.1 What percent of women in the United States have low values of HDL (40 mg/dl) or less?

mean1<-47
sd1<-3.46
p4_1<-pnorm(40,mean1,sd1)
round(p4_1*100,2)

## [1] 2.15

There are 2.15% of women have low values of HDL or less.

4.2 What percent of women in the United States have a value of HDL larger than 50 mg/dl?

mean1<-47
sd1<-3.46
p4_2<-1-pnorm(50,mean1,sd1)
round(p4_2*100,2)

## [1] 19.3

There are 19.3% of women in the United States have a value of HDL larger than 50 mg/dl.

4.3 What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?

mean1<-47
sd1<-3.46
p4_3<-pnorm(50,mean1,sd1)-pnorm(46,mean1,sd1)
round(p4_3*100,2)

## [1] 42.08

There are 42.08% of women in the United States have a value of HDL between 46 and 50 mg/dl.

4.4 What values of HDL would place a woman in the United States in the lowest 15% of the distribution?

mean1<-47
sd1<-3.46
p4_4<-qnorm(.15,mean1,sd1)
round(p4_4,2)

## [1] 43.41

Woman with 43.41 HDL woukd plcace in the lowest 15% of the distribution.

Question 5 5.1 Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than Q1−1.5⋅IQR or larger than Q3+1.5⋅IQR where IQR is the inter-quartile range (IQR=Q3−Q1). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.

mean1<-0
sd1<-1
Q1<-qnorm(0.25,mean1,sd1)
Q3<-qnorm(0.75,mean1,sd1)
p5_1<-(pnorm(Q1-(1.5*(Q3-Q1)),mean1,sd1)+
         (1-pnorm(Q3+(1.5*(Q3-Q1)),mean1,sd1)))*100
print(p5_1)

## [1] 0.6976603

5.2 The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.

5.2.1 What is the median family income in the US in 2017?

mean1<-11.237
sd1<-0.747
p5_2_1<-exp(qnorm(0.5,mean1,sd1))
print(p5_2_1)

## [1] 75886.95

5.2.2 Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.

mean1<-11.237
sd1<-0.747
Q1<-qnorm(0.25,mean1,sd1)
Q3<-qnorm(0.75,mean1,sd1)
p5_2_2<-Q3-Q1
print(p5_2_2)

## [1] 1.007688

5.2.3 What is the interquartile range for the distribution of family income in the US in 2017.

mean1<-11.237
sd1<-0.747
Q1dis<-exp(qnorm(0.25,mean1,sd1))
Q3dis<-exp(qnorm(0.75,mean1,sd1))
p5_2_3<-Q3dis-Q1dis
print(p5_2_3)

## [1] 79747.1

5.2.4 In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018? 5.

mean2017<-11.237
sd2017<-0.747
mean2018<-11.237
sd2018<-0.747
p5_2_4<-(pnorm(log(109400),mean2018,sd2018)-pnorm(log(84500),mean2017,sd2017))*100
round(p5_2_4,2)

## [1] 13.06

Question 6 Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 1/10. For each question, attach an annotated R code.

6.1 What is the probability that the machine will fail after 8 years? Model as a Poisson.

lambda<-.1
t<-8
x<-0
p6_1<-((lambda*t)^x)*(exp(-1*lambda*t))/factorial(x)
round(p6_1,4)

## [1] 0.4493

6.2 What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.)

p<-.1
n<-8
failure<-0
p6_2<-dbinom(failure,n,p)
round(p6_2,4)

## [1] 0.4305

Midterm

Zhaoyang Ding

2024-09-19