I have two copies of the famous ‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random student will accept it with probability 0.3. Calculate the probability that I have to offer it to exactly 6 students before I get rid of both copies. Assume I am offering each student the book sequentially, and their decisions are independent of each other’s.
Calculate this probability from scratch, without using any statistical function in R (Binomial, Poisson, etc.). Attach an annotated R code detailing all the steps you took and the reasoning behind it.
#probability a student takes a bookp<-0.3#probability a student doesn't take a book 1-0.3q<-0.7#probability of one sequence (1 acceptance, 4 rejections in first 5 student) then acceptance by the 6th studentsingle_run<-p*q*q*q*q*p# there are 5 sequences5*single_run
[1] 0.11
0.21.2
The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is the number of failures before a given number 𝑟 of successes is obtained, given that each trial has a Bernouli distribution with probability of success 𝑝.
Write a function in R that will calculate the probability distribution function for any 𝑟≥1 and 𝑝∈(0,1). You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.
# write the function with substitute variablesprob_funct<-function(k,r,p){ coefficient<-choose(k+r-1, r-1) success<-p^r failure<-(1-p)^k prob<-coefficient*success*failurereturn(prob)}#now add parameters # failuresk<-4# successr<-2#prob of accetance# now put into function and computeprob_funct(k,r,p)
[1] 0.11
0.31.3
Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters 𝑝 and 𝑟. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R.
Attach an annotated R code detailing all the steps. Attach the graphs.
##set the failure values failure<-0:20# set the parametersparams<-list(list(r=2, p=0.3),list(r=3, p=0.5),list(r=5, p=0.4),list(r=1, p=0.7))par(mfrow=c(2,2))## write a function for the graphs themselvesfor (param in params){ r<-param$r p<-param$p cust2<-prob_funct(failure, r,p) probs_in<-dnbinom(failure, size=r,prob=p)plot( failure, cust2,type="h",lwd=3,col="purple",ylim=c(0,max(cust2)),xlab="",ylab="" )points(failure,probs_in,pch=18,col="black")}
12 Question 2
Yathzee is a game played with five regular six-sided dice.
1.12.1
What is the probability all five dice will show the same number (this is known as an Yahtzee)? Attach an annotated R code detailing all the steps for your calculations.
# sides of diesides<-6# total outcomes with 5 dietotal<-sides^5# sides divided by total outcomes with 5 diesides/total
[1] 0.00077
1.22.2
What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.
# permutations of 5 different from 6winning2<-factorial(6)/factorial(6-5)# probability all 5 are differentwinning2/total
[1] 0.093
1.32.3
In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations.
## total possibilities with 3 die rollstotal2<-6^3# first, probability that one additional matching die# cxhoose which of the die matchproba<-choose(3,1)*(1/6)*(5/6)^2*(4/5)# now 3 of a kind with non-matching numbers, choose one of 5probb<-5*(1/6)^3## now calculate the total probabilityproba+probb
[1] 0.3
1.42.4
Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)?
Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.
# given this is a siumulation, we need to set seed for reproducabilityset.seed(44)# number of simulationssimus<-100# now write this into a function to simulate the yatzee attemptsyahtzee<-function(){ dice<-sample(1:6, 5, replace=TRUE)## as the dice are not being pulledfor (roll in1:3){if(length(unique(dice))==1){return(TRUE) } count<-table(dice) top_match<-names(which.max(count)) keep<-max(count) match_take<-rep(as.numeric(top_match), keep)if(keep<5){ fresh<-sample(1:6,5-keep, replace =TRUE) dice<-c(keep, fresh) }}return(FALSE)} ## if no yatzee# now run the simulation and calculate the probabilityresults<-replicate(simus, yahtzee())mean(results)
[1] 0.05
23 Question 3
In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file “challenger.csv”.
The variables in the data set are defined as follows:
launch: this numbers the temperature-sorted observations from 1 to 23.
temp: temperature in degrees Fahrenheit at the time of launch.
incident: If there was an incident with an O-Ring, then it is coded “Yes”.
o_ring_probs: counts the number of O-ring partial failures experienced on the flight.
Load the data into R and answer the following questions. Include all R code.
2.13.1
Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.
#read in the data and the psych packagelibrary(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
setwd("/Users/allisonshrivastava/Downloads")challenger<-read.csv("challenger.csv")# check data structurestr(challenger)
vars n mean sd median trimmed mad min max range skew kurtosis
launch 1 23 12.00 6.78 12 12.00 8.9 1 23 22 0.0 -1.36
temp 2 23 69.02 6.97 70 69.33 5.3 54 81 27 -0.4 -0.44
incident* 3 23 1.30 0.47 1 1.26 0.0 1 2 1 0.8 -1.42
o_ring_probs 4 23 0.43 0.79 0 0.26 0.0 0 3 3 1.8 2.69
se
launch 1.41
temp 1.45
incident* 0.10
o_ring_probs 0.16
2.23.2
Second, what are the levels of measurement of these 4 variables? Discuss/Justify.
Launch is an ordinal variable as they describe the order of the launches. Were this just a label for the launch without order, it would be different
Temeprature is a continuous/interval variable as differences between temps are meaningful, and it is a numerical measurement (not categorical)
Incident is a nominal variable, as it just takes on the values yes or no without any numeric assigned or order in place
O-ring_robs is a ratio variable as it can be categorized and ranked. The differences between the numbers are meaningful as well
2.33.3
Third, provide an appropriate graph for the variable o_ring_probs. Interpret. Boxplot is acceptable, though histogram would be better.
Most of the launches had 0 failures as the majority of the incidents have value ‘0’,
and very few launches had 2 to 3 failures.
challenger%>%ggplot(aes(x=o_ring_probs))+geom_histogram(binwidth=1, fill="skyblue",boundary=-0.5)+scale_x_continuous(breaks=0:max(challenger$o_ring_probs))+ggtitle("Histograph of o-ring failures")+theme_minimal()
2.43.4
The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?
Most successful launches occurred at higher temperatures, so a temperature of 36 degrees might be less likely to be successful as there seems to be a correlation between high temps and successful launches
ggplot(challenger, aes(x=factor(incident), y=temp))+geom_boxplot(fill="skyblue",color="navy")+labs(title="Temperature vs. Incident" )+theme_minimal()
2.53.5
In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).
High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Health Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL. United States women aged 20 and over have a mean HDL of 47 mg/dl with a standard deviation of 3.46. Assume that the distribution is Normal. For each question below, attach an annotated script in R.
3.14.1
What percent of women in the United States have low values of HDL (40 mg/dl) or less?
# low hdl is hdl<40# mean HDL of 47# SD of 3.46# normal distribution(pnorm(40, mean=47, sd=3.46)*100)
[1] 2.2
3.24.2
What percent of women in the United States have a value of HDL larger than 50 mg/dl?
# high hdl is hdl>50 as mean+sd = about 50# mean HDL of 47# SD of 3.46# normal distributionbelow<-pnorm(50,mean=47,sd=3.46)(1-below)*100
[1] 19
3.34.3
What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?
# high 50# low 46# mean HDL of 47# SD of 3.46# normal distributionupper_prob<-pnorm(50, mean=47, sd=3.46)lower_prob<-pnorm(46, mean=47, sd=3.46)## now calculate the difference(upper_prob-lower_prob)*100
[1] 42
3.44.4
What values of HDL would place a woman in the United States in the lowest 15% of the distribution?
# mean 47#sd 3.46# lowest 15%qnorm(0.15, mean=47, sd=3.46)
[1] 43
45 Question 5
4.15.1
Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than 𝑄1−1.5⋅𝐼𝑄𝑅 or larger than 𝑄3+1.5⋅𝐼𝑄𝑅 where 𝐼𝑄𝑅 is the inter-quartile range (𝐼𝑄𝑅=𝑄3−𝑄1). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.
# define quartiles and IQRq1<-qnorm(0.25)q3<-qnorm(0.75)iqr<-q3-q1# now define what is an outlier lower<-q1-1.5*iqrupper<-q3+1.5*iqr# now compute the proportionspnorm(lower)+(1-pnorm(upper))
[1] 0.007
4.25.2
The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.
4.2.15.2.1
What is the median family income in the US in 2017?
#normal distribution# mean =11.237# standard deviation = 0.747exp(11.237)
[1] 75887
4.2.25.2.2
Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.
# set parameters and variables to keep everything straightmu<-11.237sig<-0.747# calculate medianmedian_money<-exp(mu)# now calc IQR of ln(income)i1<-qnorm(0.25)i3<-qnorm(0.75)q1<-mu+i1*sigq3<-mu+i3*sigiqr_ln<-q3-q1#print resultsiqr_ln
[1] 1
q1
[1] 11
q3
[1] 12
4.2.35.2.3
What is the interquartile range for the distribution of family income in the US in 2017.
#quartiles of ln(income)q1ln<-mu+qnorm(0.25)*sigq3ln<-mu+qnorm(0.75)*sig#quartiles incomeq1i<-exp(q1ln)q3i<-exp(q3ln)# now iqr of income income<-q3i-q1i#print resultsq1i
[1] 45851
q3i
[1] 125598
income
[1] 79747
4.2.45.2.4
In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018? 5.
# set thresholdsamt_2017<-84500amt_2018<-109400# convert to lnln_2017<-log(amt_2017)ln_2018<-log(amt_2018)# add in parameterspa1<-(ln_2017-mu)/sigpa2<-(ln_2018-mu)/sig# not calculate the probability of exemptionpnorm(pa2)-pnorm(pa1)
[1] 0.13
56 Question 6
Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 110. For each question, attach an annotated R code.
5.16.1
What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to use𝜆⋅𝑡 rather just 𝜆).
# definitionslambda<-1/10time<-8lambda_time<-lambda*time#probability of no failure in time=8dpois(0,lambda_time)
[1] 0.45
5.26.2
What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If 𝑋 is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.
#probability of failure in 1 yearp<-0.1#probability of no failures in time=8 yearsdbinom(0, size=8, prob=p)