Q1

The sensitivity and specificity of the polygraph has been a subject of study and debate for years. A 2001 study of the use of polygraph for screening purposes suggested that the probability of detecting a liar was .8 (sensitivity) and that the probability of detectinga “truth teller” was .9 (specificity). We estimate that about 30% of individuals selected for the screening polygraph will lie.

What is the probability that an individual is actually a liar given that the polygraph detected him/her as such? (Show me the table or the formulaic solution or both.)
What is the probability that an individual is actually a truth-teller given that the polygraph detected him/her as such? (Show me the table or the formulaic solution or both.)
What is the probability that a randomly selected individual is either a liar or was identified as a liar by the polygraph? Be sure to write the probability statement.

##Q1

#Probability the test identifies a liar
#Sensitivity = A/A+c
#Specificity = D/D+B
A<-.8
D<-.9
C<-1-A
B<-1-D
Liar<-A+C
Truth<-B+D
PredLiar<-A+B
PredTruth<-C+D
Total<-A+B+C+D

#True Pos
ProbA<-A/Total
#False Negative
ProbC<-C/Total
#False Positive
ProbB<-B/Total
#True Neg
ProbD<-D/Total

#Table
polygraph <- matrix(c(A,B,PredLiar,C,D,PredTruth,Liar,Truth,Total),ncol=3,byrow=TRUE)
colnames(polygraph) <- c("Liar","Truth","Total")
rownames(polygraph) <- c("Test Liar","Test Truth","Total")
polygraph <- as.table(polygraph)
polygraph

##            Liar Truth Total
## Test Liar   0.8   0.1   0.9
## Test Truth  0.2   0.9   1.1
## Total       1.0   1.0   2.0

#a
#Probability individual is actually a liar given the test detected liar
AgivenLiar<-A/PredLiar

#b
#Probability individual is actually a truth-teller given the test detected truth-teller
#D/(D+C)
DgivenTruth<-D/PredTruth

#c
probliar<-.3

cat("a.The probability the individual is actually a liar given the test detected him/her as such is",
    "\n",round(AgivenLiar,2),
    "\n", "b. The probability the individual is actually a truth-teller given the test detected him/her as such is", 
    "\n",round(DgivenTruth,2), 
    "\n", "c. The probability a randomly selected individual is actually a liar or was identified as such is", 
    "\n", probliar+(PredLiar/Total))

## a.The probability the individual is actually a liar given the test detected him/her as such is 
##  0.89 
##  b. The probability the individual is actually a truth-teller given the test detected him/her as such is 
##  0.82 
##  c. The probability a randomly selected individual is actually a liar or was identified as such is 
##  0.75

Q2

Your organization owns an MRI. This machine has a manufacturer’s expected lifetime of 15 years. This means that we expect one failure every 15 years. (Include the probability statements and R Code for each part.)

What is the probability that the machine will fail at or after 10 years? Model using probability based on the assumption independent trials.
What is the probability that the machine will fail at or after 10 years? Provide also the expected value and standard deviation. Model as a binomial.
What is the probability that the machine will fail at or after 10 years? Provide also the expected value and standard deviation. Model as a Poisson.

#a
p<-1/15
mygeom<-dgeom(0:15,p)
sum(mygeom[0:11])

## [1] 0.531829

#b.Section 4.3 in OpenIntro Statistics
p<-1/15
n<-15
#I couldn't get this to work with a binom function and couldn't figure out why
mybinom<-dbinom(0:n, n, p)
sum(mybinom[0:11]) #>>> This gave me 1

## [1] 1

#So I did it by hand using formulas from 4.3 in OpenIntro Statistics
p<-1/15
n<-10
#0 fails before 10 years
x=0
prob10<-1−(p^x)*(1−p)^(n−x)
prob10

## [1] 0.4983882

#Expected Valued
ex.binom<-n*p
ex.binom

## [1] 0.6666667

#Standard Deviation
sd.binom = sqrt(n*p*(1-p))
sd.binom

## [1] 0.7888106

#c. Section 4.5 in OpenIntro Statistics
p<-1/15
n<-10
t<-1
#λ=np/t
lambda<-n*p/t
pois<-dpois(1:10000,(lambda*t))
sum(pois)

## [1] 0.4865829

#Expected Value is lambda
ex.pois<-n*p/1
ex.pois

## [1] 0.6666667

#Standard Deviation
sd.pois = sqrt(lambda)
sd.pois

## [1] 0.8164966

Q3

In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file challenger.csv. Load the data and attach.

The variables in the data set are defined as follows: launch: this numbers the temperature-sorted observations from 1 to 23. temp: temperature in degrees Fahrenheit at the time of launch incident: If there was an incident with an O-Ring, then it is coded “Yes.” o_ring_probs: counts the number of O-ring partial failures experienced on the flight.

What are the levels of measurement (nominal, ordinal, interval, ratio) of these variables? Justify.
Provide appropriate descriptive statistics and graphs for the variable o_ring_probs. Interpret. Provide measures of center, spread, shape, position, and two appropriate plots that are appropriate for the level of measurement. Do not provide statistics that are not appropriate. Discuss the distribution based on the graphs and statistics. Does its shape remind you of any distribution we have evaluated so far?
The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident). Why might this have been a concern?

#set directory
setwd("C:/Users/hrall/OneDrive/Documents/R")
#read data
df <- read.csv("challenger.csv",header=TRUE)
#a. 
cat("I. launch: Nominal - Each observation is numeric, but the values are used as unique identifiers and do not represent specific measures. The order is arbitrary and the data cannot be used to calculate a meaningful output such as a mean or median.",
"\n", "II.  temp: Interval - The number provided represents a specific numeric value, but there is no true zero value (or absence of temperature).", 
"\n", "III. incident: Nominal - The values are used to categorize the data. The values cannot be ranked and there are not measurable intervals between each value.", 
"\n", "IV.  o_ring_probs: Ratio – The number provided represents a specific numeric value and these values have equal intervals. Additionally, there is a true zero.")

## I.   launch: Nominal - Each observation is numeric, but the values are used as unique identifiers and do not represent specific measures. The order is arbitrary and the data cannot be used to calculate a meaningful output such as a mean or median. 
##  II. temp: Interval - The number provided represents a specific numeric value, but there is no true zero value (or absence of temperature). 
##  III.    incident: Nominal - The values are used to categorize the data. The values cannot be ranked and there are not measurable intervals between each value. 
##  IV. o_ring_probs: Ratio – The number provided represents a specific numeric value and these values have equal intervals. Additionally, there is a true zero.

#b
#Create a histogram
hist(df$o_ring_probs, main = "Frequency of O-ring failures",xlab="Number of O-ring Failures", ylab="Frequency" )

#Create a normal distribution curve
mean.oring<-mean(df$o_ring_probs)
sd.oring<-sd(df$o_ring_probs)
curve(dnorm(x, mean=mean.oring, sd=sd.oring), xlab = "O-ring Failures", ylab = "Distribution")

summary(df$o_ring_probs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4348  1.0000  3.0000

#Create 95% confidence interval
p<-mean.oring
n<-length(df$o_ring_probs)
alpha=.05 
z=qnorm(1-alpha/2) 
se=sqrt(p*(1-p)/n) 
lower=round((p-z*se),3) 
upper=round((p+z*se),3)
cat("With a 95% confidence interval, launches will include between", lower,"and", upper, "O-ring failures. Since an O-ring failure must be a whole number, we can assume with 95% confidence that a launch will contain 0 or 1 O-ring failures.")

## With a 95% confidence interval, launches will include between 0.232 and 0.637 O-ring failures. Since an O-ring failure must be a whole number, we can assume with 95% confidence that a launch will contain 0 or 1 O-ring failures.

#c Boxplot 
boxplot(temp~o_ring_probs, data=df, main="Temperature at Time of Launch vs O-ring Failures",
        xlab="Number of O-ring Failures", ylab="Temperature")

cat("The box plot shows that 0 or 1 O-ring failure occurred within the upper quartile range ending at",
    "\n","approximately 75 degrees and the lower quartile range ending at approximately 57 degrees.",
    "\n","There are no box plots for 2 and 3 o-ring failures because each only occurred once",
    "\n","within the data. Both occurred at temperatures outside the lower and upper quartiles",
    "\n","mentioned. We can therefore infer that temperatures above approximately 75 degrees Fahrenheit",
    "\n","and below 57 degrees Fahrenheit increases the probability of more than one O-ring failure.")

## The box plot shows that 0 or 1 O-ring failure occurred within the upper quartile range ending at 
##  approximately 75 degrees and the lower quartile range ending at approximately 57 degrees. 
##  There are no box plots for 2 and 3 o-ring failures because each only occurred once 
##  within the data. Both occurred at temperatures outside the lower and upper quartiles 
##  mentioned. We can therefore infer that temperatures above approximately 75 degrees Fahrenheit 
##  and below 57 degrees Fahrenheit increases the probability of more than one O-ring failure.

Q4

Fischer’s Iris dataset provides sepal length, sepal width, petal length, and petal width data for three species of iris flowers. Provide your SINGLE, most meaningful, exploratory graphical chart comparing the four quantitative variables across species. Explain it. If you provide more than one chart, you will earn zero points.

mean(iris$Sepal.Length)

## [1] 5.843333

sd(iris$Sepal.Length)

## [1] 0.8280661

install.packages("car", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/hrall/OneDrive/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'car' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\hrall\AppData\Local\Temp\Rtmp0cMCMH\downloaded_packages

library(car)

## Warning: package 'car' was built under R version 4.1.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.1.2

scatterplot( Sepal.Length ~ Petal.Length | Species, data=iris,
             xlab="Sepal Length", ylab="Petal Length",
             main="Sepal vs Petal Length")

cat("For all species of iris flowers, petal length increases as sepal length increases.",
    "\n","Setosa iris flowers have the shortest petal and sepal length and virginica have the longest.",
    "\n","Additionally, Setosa iris flowers appear to have less of a correlation between the sepal length",
    "\n","and petal length than the versicolor or virginica iris flowers as the data is more scattered",
    "\n","and less congregated around the line of best fit.")

## For all species of iris flowers, petal length increases as sepal length increases. 
##  Setosa iris flowers have the shortest petal and sepal length and virginica have the longest. 
##  Additionally, Setosa iris flowers appear to have less of a correlation between the sepal length 
##  and petal length than the versicolor or virginica iris flowers as the data is more scattered 
##  and less congregated around the line of best fit.

Q5

You are constructing a histogram for describing the distribution of salaries of those individuals who are 40 years are older but not yet retired. Draw the probable shape of the distribution labeling the mean, median, and mode locations as well as both axes using R. (You may have to search for the functions that can help you.) Justify your response

N <- 10000 
n <- 100 
p <- .9 
x <- rbinom(N,n,p)
hist(x, 
     xlim = c(min(x), max(x)), 
     probability = TRUE, 
     #bins
     nclass = max(x) - min(x) + 1, 
     main = 'Salary Distribution (Age 40 and Over, Not Retired)',
     xlab = "Salary in Thousands")
lines(density(x,bw=1), lwd = 3)

mean(x)

## [1] 90.0334

median(x)

## [1] 90

#mode function
getmode <- function(y) {
  uniqy <- unique(y)
  uniqy[which.max(tabulate(match(y, uniqy)))]
}
getmode(x)

## [1] 91

cat("The distribution is skewed left with the median and mode being greater than the mean.
    This distribution indicates a heavy upper tail. In other words, a smaller percentage of 
    people observed have a larger portion of the total amount of income (the area under the 
    curve is larger on the left if the curve is split at the mean).")

## The distribution is skewed left with the median and mode being greater than the mean.
##     This distribution indicates a heavy upper tail. In other words, a smaller percentage of 
##     people observed have a larger portion of the total amount of income (the area under the 
##     curve is larger on the left if the curve is split at the mean).

Q6

I recently conducted some animal research where I was investigating survival of swine based on what drug was given to them. Data are shown below.

Survived Died Totals Drug 1 7 0 7 Drug 2 5 2 7 Totals 12 2 14

Let A represent the drug provided (A1=drug 1, A2=drug 2). Let B represent the pig’s survival. (B1=survived, B2=died). For each cell, calculate the joint probability. In other words, calculate P(A1B1), P(A1B2), P(A2B1), P(A2B2). Place these probabilities in the following table. (Don’t panic here. This is as easy as you think it is.)
```
   Survived Died
```
Drug 1 Drug 2
For each row and column, calculate the marginal probability. In other words, calculate the four marginal probabilities, P(A1), P(A2), P(B1), P(B2). Place them in the next table with the results from part a. (Just remember how we calculated marginal probability in class.) Survived Died Totals Drug 1 Drug 2 Totals
Independence of events means that P(AiBj) = P(Ai) x P(Bj) for all values of i and j. For true independence of events, the joint (cell) probabilities should equal the appropriate marginal probabilities multiplied by each other. In other words, you should be able to multiply the row and column marginal probabilities to obtain the cell probability. If this is not the case, then the events are not truly independent from a noninferential point of view. Demonstrate that survival and drug choice are not independent solely based on the definition of independence. In other words, investigate if P(AiBj) = P(Ai) x P(Bj) for all values of i and j.

#a
B1<-12
B2<-2
a.A1B1<-round(7/B1,2)
a.A1B2<-round(0/B2,2)
a.A2B1<-round(5/B1,2)
a.A2B2<-round(2/B2,2)

pigs1 <- matrix(c(a.A1B1,a.A1B2,a.A2B1,a.A2B2),ncol=2,byrow=TRUE)
colnames(pigs1) <- c("Survived","Died")
rownames(pigs1) <- c("Drug 1","Drug 2")
pigs1 <- as.table(pigs1)
pigs1

##        Survived Died
## Drug 1     0.58 0.00
## Drug 2     0.42 1.00

#c
c.A1B1<-round(7/14,2)
c.A1B2<-round(0/14,2)
c.A2B1<-round(5/14,2)
c.A2B2<-round(2/14,2)

pigs <- matrix(c(c.A1B1,c.A1B2,c.A1B1+c.A1B2,c.A2B1,c.A2B2,c.A2B1+c.A2B2,c.A1B1+c.A2B1, c.A1B2+c.A2B2, c.A1B1+c.A1B2+c.A2B1+c.A2B2),ncol=3,byrow=TRUE)
colnames(pigs) <- c("Survived","Died","Total")
rownames(pigs) <- c("Drug 1","Drug 2","Total")
pigs <- as.table(pigs)
pigs

##        Survived Died Total
## Drug 1     0.50 0.00  0.50
## Drug 2     0.36 0.14  0.50
## Total      0.86 0.14  1.00

#d
#If independent then P(AiBj) == P(Ai)*P(Bj)
A1<-7/14
A2<-7/14
B1<-12/14
B2<-2/14

#Manually entering a.AiBj values as rounding could affect the test
t1<-.58 == (A1*B1)
s1<-round(A1*B1,2)
#a.A1B2
t2<-0 == (A1*B2)
s2<-round(A1*B2,2)
#a.A2B1
t3<-.42 == (A2*B1)
s3<-round(A2*B1,2)
#a.A2B2
t4<-1 == (A2*B2)
s4<-round(A2*B2,2)
cat("True or False: P(A1B1)=P(A1)*P(B1)?",t1,"\n",
    "P(A1B2)=P(A1)*P(B2)?",t2,"\n",
    "P(A2B1)=P(A2)*P(B1)?",t3,"\n",
    "P(A2B2)=P(A2)*P(B2)?",t4,"\n",
    "What does P(A1)*P(B1) equal?", s1,"\n",
    "What does P(A1)*P(B2) equal?", s2,"\n",
    "What does P(A2)*P(B1) equal?", s3,"\n",
    "What does P(A2)*P(B2) equal?", s4)

## True or False: P(A1B1)=P(A1)*P(B1)? FALSE 
##  P(A1B2)=P(A1)*P(B2)? FALSE 
##  P(A2B1)=P(A2)*P(B1)? FALSE 
##  P(A2B2)=P(A2)*P(B2)? FALSE 
##  What does P(A1)*P(B1) equal? 0.43 
##  What does P(A1)*P(B2) equal? 0.07 
##  What does P(A2)*P(B1) equal? 0.43 
##  What does P(A2)*P(B2) equal? 0.07

Q7

The following graph represents GDP growth for the US and the Euro area. Identify the problems associated with this graph. Then, generate your own graph that portrays the data in an improved way

cat("The graph compares the GDP growth by percentage between the two countries.","\n",
    "It is unclear how this percentage was calculated and if the two contries GDPs",
    "\n","can be compared to each other. An alternative would be to compare the percentage",
    "\n","change in GDP year over year and create a separate curve for each country.",
    "\n", "By doing this, we are comparing the countries to themselves, and observing the",
    "\n","differences in change, thus eliminiating potential measurement discrepancies.")

## The graph compares the GDP growth by percentage between the two countries. 
##  It is unclear how this percentage was calculated and if the two contries GDPs 
##  can be compared to each other. An alternative would be to compare the percentage 
##  change in GDP year over year and create a separate curve for each country. 
##  By doing this, we are comparing the countries to themselves, and observing the 
##  differences in change, thus eliminiating potential measurement discrepancies.

#set directory
setwd("C:/Users/hrall/OneDrive/Documents/R")
#read data
install.packages("ggplot2", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/hrall/OneDrive/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\hrall\AppData\Local\Temp\Rtmp0cMCMH\downloaded_packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.

dfgdp <- read.csv("GDP_Data.csv",header=TRUE)

ggplot(dfgdp, aes(x=Year, y=Change_._Year_Over_Year, color=Country))+
  geom_line() + geom_point() +
  ggtitle("GDP Growth Change Year Over Year") + xlab("Year")+ ylab("Change in GDP Percentage")

## Warning: Removed 2 row(s) containing missing values (geom_path).

## Warning: Removed 2 rows containing missing values (geom_point).

## Q8

A sample of 900 computer chips revealed that 58% of the chips fail in the first 1000 hours of their use. The company’s promotional literature states that 54% of the chips fail in the first 1000 hours of their use. The quality control manager wants to test the claim that the actual percentage that fail is different from the stated percentage. Construct a 98% confidence interval for the proportion of chips that fail. Then conduct a hypothesis test and show all the steps (including null and alternative). Is there enough evidence at the 0.02 level to support the manager’s claim?

p=.58
n=900
#alpha is .02 for a 98% CI
alpha=.02 
z=qnorm(1-alpha/2) #getting Z value for use
se=sqrt(p*(1-p)/n) #calculating standard error
lower=p-z*se   #getting lower CI
upper=p+z*se  #getting upper CI
interval=c(lower,p,upper) #building answer
names(interval)=c('Lower 1% CI', 'proportion estimate', 'Upper 1% CI') #printing answer
interval

##         Lower 1% CI proportion estimate         Upper 1% CI 
##            0.541727            0.580000            0.618273

cat("Using the sample of 900 computers to create a 98% confidence interval,", 
    "\n","the percentage of chips that fail in the first 1000 hours will be between 54.17% and 61.82%.",
    "\n", "The 54% stated in the promotional literature is outside of that interval and therfore incorrect",
    "\n", "based on the CI constructed.")

## Using the sample of 900 computers to create a 98% confidence interval, 
##  the percentage of chips that fail in the first 1000 hours will be between 54.17% and 61.82%. 
##  The 54% stated in the promotional literature is outside of that interval and therfore incorrect 
##  based on the CI constructed.

#Hypothesis test
#Null Hypothesis:  54% of chips fail in the first 1000 hours of their use; pi = .54
#Alternative Hypothesis:  More or less than 54% of chips fail in the first 1000 hours of their use; pi does not = .54

H0<-.54
alpha = .02
1-pnorm(p,H0,se)

## [1] 0.007521905

cat("The p-value is less than alpha, therefore, we can reject Ho and accept the alternative hypothesis.",
    "\n", "In other words, there is enough evidence to support the manager's claim that the actual percentage",
    "\n"," of chips that fail in the first 1000 hours of use is different from the stated percentage.")

## The p-value is less than alpha, therefore, we can reject Ho and accept the alternative hypothesis. 
##  In other words, there is enough evidence to support the manager's claim that the actual percentage 
##   of chips that fail in the first 1000 hours of use is different from the stated percentage.

Midterm

Haylee Allen

2/13/2022

Q1

Q2

Q3

Q4

Q5

Q6

Q7