Question 1

Flip a fair coin nine times and write down the number of heads obtained. Now repeat this process 100,000 times. Obviously you don’t want to have to do that by hand, so create the necessary lines of R code to do it for you. Hint: You will need both the rbinom() function and the table() function. Write down the results and explain in your own words what they mean.

table(rbinom(n=1, size = 9, prob = 0.5))
## 
## 4 
## 1

Running the above test, flipping one coin 9 times, heads returned between 4-5 heads per trial. This is as expected, with a probability of 50%, as the number hovers around half of the flips.

n100k  <- table(rbinom(n=100000, size = 9, prob = 0.5))
n100k
## 
##     0     1     2     3     4     5     6     7     8     9 
##   221  1722  7149 16393 24486 24553 16354  7157  1772   193

In this example, performing the same event 100,000 times, the trial dictates that the likihood to have 4-5 heads on 9 flips is approximately 50%. This as expected as we’re working with an even probability. It also indicates that as you move away from the 4/5 the probability decreases, so it less probable that you’ll end up with 3 heads and 7 tails or 7 heads and 2 tails. It is extremely unlikely that the result will end in 0 heads or 9 heads.

Question 2

Using the output from Exercise 1, summarize the results of your 100,000 trials of nine flips each in a bar plot using the appropriate commands in R. Convert the results to probabilities and represent that in a bar plot as well. Write a brief interpretive analysis that describes what each of these bar plots signifies and how the two bar plots are related. Make sure to comment on the shape of each bar plot and why you believe that the bar plot has taken that shape. Also make sure to say something about the center of the bar plot and why it is where it is.

This is bar plot illustrates the raw results of the 100,000 trial. It is a normal distribution of data.

This is the representation of the same data, but displayed by the probability instead of the raw values.

As they are the same data, the bar graphs are both normally distributed. Given that the coin is fair, it is expected that the results of 9 flips would average between 4/5 heads over time. It should also be expected that as you move further away from the mean, the probability of 7/3 splits or 3/7 splits are unlikely and 0/9 or 9/0 splits are extremly improbable.

Question 6

One hundred students took a statistics test. Fifty of them are high school students and 50 are college students. Eighty students passed and 20 students failed. You now have enough information to create a two-by-two contingency table with all of the marginal totals specified (although the four main cells of the table are still blank). Draw that table and write in the marginal totals. I’m now going to give you one additional piece of information that will fill in one of the four blank cells: only three college students failed the test. With that additional information in place, you should now be able to fill in the remaining cells of the two-by-two table. Comment on why that one additional piece of information was all you needed in order to figure out all four of the table’s main cells. Finally, create a second copy of the complete table, replacing the counts of students with probabilities. What is the pass rate for high school students? In other words, if one focuses only on high school students, what is the probability that a student will pass the test?

statsTest <- matrix(c(47,33,3,17),ncol = 2, byrow = TRUE)
colnames(statsTest) <-c('College','High School')
rownames(statsTest) <-c('Pass','Fail')
statsTest <- as.table(statsTest)
statsTest
##      College High School
## Pass      47          33
## Fail       3          17

This table was able to be created because the sample size of 100 was given. Then it was detected that there were 50 of each type of student, High school and College, and that out of the entire sample 80 students passed while 20 failed. Given that only 3 College students failed out of 20, that implies that the remaining,17, failures are High School students.

statsProbs <- statsTest/margin.table(statsTest)
statsProbs
##      College High School
## Pass    0.47        0.33
## Fail    0.03        0.17
statsProbs <- statsTest/margin.table(statsTest)
statsProbs[,2]/sum(statsProbs[,2])
## Pass Fail 
## 0.66 0.34

Normalizing the high school dataset, we find that 34% of high school students are likely to fail this test. ## Question 7

In a typical year, 71 out of 100,000 homes in the United Kingdom is repossessed by the bank because of mortgage default (the owners did not pay their mortgage for many months). Barclays Bank has developed a screening test that they want to use to predict whether a mortgagee will default. The bank spends a year collecting test data: 93,935 households pass the test and 6,065 households fail the test. Interestingly, 5,996 of those who failed the test were actually households that were doing fine on their mortgage (i.e., they were not defaulting and did not get repossessed). Construct a complete contingency table from this information. Hint: The 5,996 is the only number that goes in a cell; the other numbers are marginal totals. What percentage of customers both pass the test and do not have their homes repossessed?

repoTest <- matrix(c(0,93935,69,5996),ncol = 2, byrow = TRUE)
colnames(repoTest) <-c("Repo'd","Not Repo'd")
rownames(repoTest) <-c('Pass','Fail')
repoTest <- as.table(repoTest)
repoTest
##      Repo'd Not Repo'd
## Pass      0      93935
## Fail     69       5996
repoProps <- repoTest/margin.table(repoTest)
repoProps[1,]/sum(repoProps[1,])
##     Repo'd Not Repo'd 
##          0          1

Given the information here, it can only be assumed that none of the individuals who passed the test defaulted on their loan, so the percentage of inviduals who passed the test and did not have their homes repo’d is 1. ## Question 8

Imagine that Barclays deploys the screening test from Exercise 6 on a new customer and the new customer fails the test. What is the probability that this customer will actually default on his or her mortgage? Show your work and especially show the tables that you set up to help with your reasoning.

repoTest <- matrix(c(0,93935,69,5996),ncol = 2, byrow = TRUE)
colnames(repoTest) <-c("Repo'd","Not Repo'd")
rownames(repoTest) <-c('Pass','Fail')
repoTest <- as.table(repoTest)
repoTest
##      Repo'd Not Repo'd
## Pass      0      93935
## Fail     69       5996
repoProps <- repoTest/margin.table(repoTest)
repoProps[2,]/sum(repoProps[2,])
##     Repo'd Not Repo'd 
## 0.01137675 0.98862325

Utilizing the contingency table from the previous question, it was observed that, even though 6065 individuals failed the test, most failured didn’t lead to a defaulted mortgage. If the data failed data is normalized to 1, only 1% of the failures actually have their home repo’d.