Midterm exam

Author

Allison Shrivastava

0.1 1.1

I have two copies of the famous ‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random student will accept it with probability 0.3. Calculate the probability that I have to offer it to exactly 6 students before I get rid of both copies. Assume I am offering each student the book sequentially, and their decisions are independent of each other’s.

Calculate this probability from scratch, without using any statistical function in R (Binomial, Poisson, etc.). Attach an annotated R code detailing all the steps you took and the reasoning behind it.

#probability a student takes a book
p<-0.3

#probability a student doesn't take a book 1-0.3
q<-0.7

#probability of one sequence (1 acceptance, 4 rejections in first 5 student) then acceptance by the 6th student
single_run<-p*q*q*q*q*p

# there are 5 sequences
5*single_run

[1] 0.11

0.2 1.2

The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is the number of failures before a given number 𝑟 of successes is obtained, given that each trial has a Bernouli distribution with probability of success 𝑝.

Write a function in R that will calculate the probability distribution function for any 𝑟≥1 and 𝑝∈(0,1). You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.

# write the function with substitute variables

prob_funct<-function(k,r,p){
  coefficient<-choose(k+r-1, r-1)
 success<-p^r
 failure<-(1-p)^k
 prob<-coefficient*success*failure
 return(prob)
}

#now add parameters 
# failures
k<-4
# success
r<-2
#prob of accetance

# now put into function and compute

prob_funct(k,r,p)

[1] 0.11

0.3 1.3

Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters 𝑝 and 𝑟. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R.

Attach an annotated R code detailing all the steps. Attach the graphs.

##set the failure values 
failure<-0:20

# set the parameters
params<-list(
  list(r=2, p=0.3),
  list(r=3, p=0.5),
  list(r=5, p=0.4),
  list(r=1, p=0.7)
)

par(mfrow= c(2,2))

## write a function for the graphs themselves
for (param in params){
  r<-param$r
  p<-param$p
  cust2<-prob_funct(failure, r,p)
  probs_in<-dnbinom(failure, size=r,prob=p)
  
  plot(
   failure, cust2,
    type="h",
    lwd=3,
    col="purple",
    ylim=c(0,max(cust2)),
    xlab="",
    ylab=""
  )
  points(failure,probs_in,
         pch=18,
         col="black")
  
  
}

1 2 Question 2

Yathzee is a game played with five regular six-sided dice.

1.1 2.1

What is the probability all five dice will show the same number (this is known as an Yahtzee)? Attach an annotated R code detailing all the steps for your calculations.

# sides of die
sides<-6
# total outcomes with 5 die
total<-sides^5
# sides divided by total outcomes with 5 die
sides/total

[1] 0.00077

1.2 2.2

What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.

# permutations of 5 different from 6
winning2<-factorial(6)/factorial(6-5)

# probability all 5 are different
winning2/total

[1] 0.093

1.3 2.3

In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations.

## total possibilities with 3 die rolls
total2<-6^3
# first, probability that one additional matching die
# cxhoose which of the die match
proba<-choose(3,1)*(1/6)*(5/6)^2*(4/5)

# now 3 of a kind with non-matching numbers, choose one of 5
probb<-5*(1/6)^3

## now calculate the total probability
proba+probb

[1] 0.3

1.4 2.4

Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)?

Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.

# given this is a siumulation, we need to set seed for reproducability
set.seed(44)
# number of simulations
simus<-100

# now write this into a function to simulate the yatzee attempts
yahtzee<-function(){
  dice<-sample(1:6, 5, replace=TRUE)## as the dice are not being pulled
  for (roll in 1:3){
    if(length(unique(dice))==1){
      return(TRUE)
    }
    count<-table(dice)
    top_match<-names(which.max(count))
    keep<-max(count)
    match_take<-rep(as.numeric(top_match), keep)
    if(keep<5){
      fresh<-sample(1:6,5-keep, replace = TRUE)
      dice<-c(keep, fresh)
    }}
return(FALSE)} ## if no yatzee

# now run the simulation and calculate the probability
results<-replicate(simus, yahtzee())
mean(results)

[1] 0.05

2 3 Question 3

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file “challenger.csv”.

The variables in the data set are defined as follows:

launch: this numbers the temperature-sorted observations from 1 to 23.
temp: temperature in degrees Fahrenheit at the time of launch.
incident: If there was an incident with an O-Ring, then it is coded “Yes”.
o_ring_probs: counts the number of O-ring partial failures experienced on the flight.

Load the data into R and answer the following questions. Include all R code.

2.1 3.1

Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

#read in the data and the psych package
library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

setwd("/Users/allisonshrivastava/Downloads")
challenger<-read.csv("challenger.csv")

# check data structure
str(challenger)

'data.frame':   23 obs. of  4 variables:
 $ launch      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ temp        : num  53.6 57.2 57.2 62.6 66.2 66.2 66.2 66.2 66.2 68 ...
 $ incident    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ o_ring_probs: int  3 1 1 1 0 0 0 0 0 0 ...

# get stats
describe(challenger)

             vars  n  mean   sd median trimmed mad min max range skew kurtosis
launch          1 23 12.00 6.78     12   12.00 8.9   1  23    22  0.0    -1.36
temp            2 23 69.02 6.97     70   69.33 5.3  54  81    27 -0.4    -0.44
incident*       3 23  1.30 0.47      1    1.26 0.0   1   2     1  0.8    -1.42
o_ring_probs    4 23  0.43 0.79      0    0.26 0.0   0   3     3  1.8     2.69
               se
launch       1.41
temp         1.45
incident*    0.10
o_ring_probs 0.16

2.2 3.2

Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

Launch is an ordinal variable as they describe the order of the launches. Were this just a label for the launch without order, it would be different

Temeprature is a continuous/interval variable as differences between temps are meaningful, and it is a numerical measurement (not categorical)

Incident is a nominal variable, as it just takes on the values yes or no without any numeric assigned or order in place

O-ring_robs is a ratio variable as it can be categorized and ranked. The differences between the numbers are meaningful as well

2.3 3.3

Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.

Most of the launches had 0 failures as the majority of the incidents have value ‘0’,

and very few launches had 2 to 3 failures.

challenger%>%
  ggplot(aes(x=o_ring_probs))+
  geom_histogram(binwidth=1, fill="skyblue",
                 boundary=-0.5)+
  scale_x_continuous(breaks=0:max(challenger$o_ring_probs))+
                 ggtitle("Histograph of o-ring failures")+
                   theme_minimal()

2.4 3.4

The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

Most successful launches occurred at higher temperatures, so a temperature of 36 degrees might be less likely to be successful as there seems to be a correlation between high temps and successful launches

ggplot(challenger, aes(x=factor(incident), y=temp))+
  geom_boxplot(fill="skyblue",color="navy")+
  labs(
    title="Temperature vs. Incident"
  )+
  theme_minimal()

2.5 3.5

In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).

challenger%>%
  arrange(temp)%>%
  filter(incident=="No")%>%
  slice_head(n=1)

  launch temp incident o_ring_probs
1      5   66       No            0

2.6 3.6

How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).

count(challenger%>%
  filter(temp>65,
         incident=="Yes"))

  n
1 3

3 4 Question 4

High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Health Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL. United States women aged 20 and over have a mean HDL of 47 mg/dl with a standard deviation of 3.46. Assume that the distribution is Normal. For each question below, attach an annotated script in R.

3.1 4.1

What percent of women in the United States have low values of HDL (40 mg/dl) or less?

# low hdl is hdl<40
# mean HDL of 47
# SD of 3.46
# normal distribution

(pnorm(40, mean=47, sd=3.46)*100)

[1] 2.2

3.2 4.2

What percent of women in the United States have a value of HDL larger than 50 mg/dl?

# high hdl is hdl>50 as mean+sd = about 50
# mean HDL of 47
# SD of 3.46
# normal distribution

below<-pnorm(50,mean=47,sd=3.46)
(1-below)*100

[1] 19

3.3 4.3

What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?

# high 50
# low 46
# mean HDL of 47
# SD of 3.46
# normal distribution

upper_prob<-pnorm(50, mean=47, sd=3.46)
lower_prob<-pnorm(46, mean=47, sd=3.46)

## now calculate the difference
(upper_prob-lower_prob)*100

[1] 42

3.4 4.4

What values of HDL would place a woman in the United States in the lowest 15% of the distribution?

# mean 47
#sd 3.46
# lowest 15%

qnorm(0.15, mean=47, sd=3.46)

[1] 43

4 5 Question 5

4.1 5.1

Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than 𝑄1−1.5⋅𝐼𝑄𝑅 or larger than 𝑄3+1.5⋅𝐼𝑄𝑅 where 𝐼𝑄𝑅 is the inter-quartile range (𝐼𝑄𝑅=𝑄3−𝑄1). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.

# define quartiles and IQR
q1<-qnorm(0.25)
q3<-qnorm(0.75)
iqr<-q3-q1

# now define what is an outlier 
lower<-q1-1.5*iqr
upper<-q3+1.5*iqr

# now compute the proportions
pnorm(lower)+(1-pnorm(upper))

[1] 0.007

4.2 5.2

The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.

4.2.1 5.2.1

What is the median family income in the US in 2017?

#normal distribution
# mean =11.237
# standard deviation = 0.747
exp(11.237)

[1] 75887

4.2.2 5.2.2

Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.

# set parameters and variables to keep everything straight
mu<-11.237
sig<-0.747
# calculate median
median_money<-exp(mu)
# now calc IQR of ln(income)
i1<-qnorm(0.25)
i3<-qnorm(0.75)

q1<-mu+i1*sig
q3<-mu+i3*sig
iqr_ln<-q3-q1

#print results
iqr_ln

[1] 1

q1

[1] 11

q3

[1] 12

4.2.3 5.2.3

What is the interquartile range for the distribution of family income in the US in 2017.

#quartiles of ln(income)
q1ln<-mu+qnorm(0.25)*sig
q3ln<-mu+qnorm(0.75)*sig

#quartiles income
q1i<-exp(q1ln)
q3i<-exp(q3ln)

# now iqr of income 
income<-q3i-q1i

#print results
q1i

[1] 45851

q3i

[1] 125598

income

[1] 79747

4.2.4 5.2.4

In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018? 5.

# set thresholds
amt_2017<-84500
amt_2018<-109400

# convert to ln

ln_2017<-log(amt_2017)
ln_2018<-log(amt_2018)

# add in parameters
pa1<-(ln_2017-mu)/sig
pa2<-(ln_2018-mu)/sig

# not calculate the probability of exemption
pnorm(pa2)-pnorm(pa1)

[1] 0.13

5 6 Question 6

Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 110. For each question, attach an annotated R code.

5.1 6.1

What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use𝜆⋅𝑡 rather just 𝜆).

# definitions
lambda<-1/10
time<-8

lambda_time<-lambda*time

#probability of no failure in time=8
dpois(0,lambda_time)

[1] 0.45

5.2 6.2

What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If 𝑋 is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.

#probability of failure in 1 year
p<-0.1

#probability of no failures in time=8 years
dbinom(0, size=8, prob=p)

[1] 0.43