Homework 4

Problem 1

Part (a)

Last week, we met Lazy Larry. Larry wanted to group all the strawberry ice cream eaters so he could deal with them first in his experiment. His solution was to zap all the strawberry ice cream eaters, and randomize zapping and not zapping to the remaining vanilla and chocolate eaters.

Propose a procedure for randomizing zapping/not zapping to the subjects that would allow dealing with all the strawberry eaters first, but will maintain the key properties that the zapped and not zapped groups are similar on average on the pre-test variable and, also, that the procedure gets the correct average difference between treated and control units on average.

We can randomize which strawberry eaters we zap and don’t zap. Basically, we take a simple random sample for each group. Since we are taking simple random samples of each, we eliminate bias and reduce the variability.

Now implment your solution. Here is Larry’s procedure. How can you change it to get a valid randomization procedure (i.e., get the red and blue lines to be in the same place)?

results.outcome <- numeric(100)
results.pretest <- numeric(100)
for(i in 1:100) {
  #### CHANGE THIS SECTION
  # split up the data into those that like strawberry and the others
  strawberry <- experiment[ice.cream == "strawberry", ]
  others <- experiment[ice.cream != "strawberry", ]
  # randomize the others 
  rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
  
  treated.table <- rbind(strawberry, others[rand.rest, ]) # rbind joins two tables
  control.table <- others[-rand.rest, ]
  
  #### END CHANGES
  
  
  # generate the difference outcomes we would observe if this experiment were run
  results.outcome[i] <- mean(treated.table$zapped) - mean(control.table$not.zapped)
  
  # now create the difference in pre test values we would see if this were run
  results.pretest[i] <- mean(treated.table$pre.test) - mean(control.table$pre.test)
}

par(mfrow = c(1,2)) # this just puts two figures on the same line.

# distribution of outcome mean differences, should be centered around the truth
hist(results.outcome, main = "Outcome")
abline(v = mean(results.outcome), col = "red")
abline(v = mean(zapped - not.zapped), col = "blue") 

# distribution of pre-test mean differences, should be centered around 0

hist(results.pretest, main = "Pretest")
abline(v = mean(results.pretest), col = "red")
abline(v = 0, col = "blue")

Part (b)

For the next items, explain what is wrong with each of the following random sampling procedures, and explain how you would do the randomization correctly.

To determine the reading level of an introductory statistics text, you evaluate all the written material in the 3rd chapter.

This is undercoverage. Simply put, by only looking at the third chapter, you are getting a misrepresented view of the reading level of the statistics text. To do a proper randomization, you can put slips of paper with numbers representing each chapter in the book in a hat and shake the hat and randomly select two or three to evaluate for your decision.

You want to sample student opinions about a proposed change in procedures for changing majors. You hand questionnaires to 100 students as they arrive for class at 7:30am.

This is convienience sampling. Since you are handing the questionnaires to the first 100 students you see arrive for class in the morning, there is undercoverage. Many students who go to class that early are dedicated students who are likely to have a cemented major choice. They would be rather indifferent to the survey. To fix this, you could flip a coin, with heads give them the survey, tails do not. Do it for each student you encounter until you have given out all surveys.

A population of subjects is put in alphabetical order and a simple random sample of size 10 is taken by selecting the first 10 subjects on the list.

They did not detail the sampling procedure, but this method would be valid for example if you put all subjects names on a slip of paper and mix them thoroughly inside of a hat and then selected 10 of those without looking. My proposed method obeys the conditions of a simple random sample.

Problem 2

The Rmarkdown version of this document contains a table called CSDATA. There are 224 students in the database that form a population. The data set contains information about student’s GPA (the gpa column) and SAT verbal score (the satv column).

Make histograms for variable gpa and satv, and calculate the population mean and standard deviation of these two variables. Recall that you can access the columns of a table using the $ operator: CSDATA$gpa or CSDATA$satv.

hist(CSDATA$gpa,xlab = "GPA" ,ylab = "Frequency")

mean(CSDATA$gpa)

## [1] 2.635223

sd(CSDATA$gpa)

## [1] 0.7793949

hist(CSDATA$satv, xlab = "SAT Verbal Score", ylab = "Frequency")

mean(CSDATA$satv)

## [1] 504.5491

sd(CSDATA$satv)

## [1] 92.61046

Draw 10 random samples of size 20 from the population and calculate the sample mean and sample standard deviation of variable gpa and satv of each random sample you have drawn.

# set up two matrices to hold the results
results.gpa <- matrix(0, nrow = 10, ncol = 2)
colnames(results.gpa) <- c("Mean", "SD")

results.mean <- matrix(0,nrow = 10)

results.satv <- matrix(0, nrow = 10, ncol = 2)
colnames(results.satv) <- c("Mean", "SD")

# repeat the sampling procedure 10 times.
for (i in 1:10) {
  sampGPA = sample(CSDATA$gpa, size = 20, replace = FALSE)
  sampSatv = sample(CSDATA$satv, size = 20, replace = FALSE)
  ### draw a sample of 20 students from the CSDATA gpa and satv columns and compute the mean and SD of each 
  
  # save your results here
  results.gpa[i, "Mean"] <- mean(sampGPA)
  results.gpa[i, "SD"] <- sd(sampGPA)
  
  results.mean[i] <- mean(sampGPA)
  
  results.satv[i, "Mean"] <- mean(sampSatv)
  results.satv[i, "SD"] <- sd(sampSatv)
}

# Uncomment these lines to show your results
 print(results.gpa)

##         Mean        SD
##  [1,] 2.7425 0.8788083
##  [2,] 2.3710 0.8594913
##  [3,] 2.2375 0.9781232
##  [4,] 2.7060 0.5388428
##  [5,] 2.4960 0.5210253
##  [6,] 2.6370 0.8686717
##  [7,] 2.6855 0.7602318
##  [8,] 2.8625 0.6472767
##  [9,] 2.4640 0.8517252
## [10,] 2.7370 0.7703321

 print(results.satv)

##         Mean        SD
##  [1,] 509.05  95.46423
##  [2,] 530.70  95.26976
##  [3,] 508.15 101.57718
##  [4,] 486.00  82.74247
##  [5,] 496.80  67.76243
##  [6,] 483.40  72.34159
##  [7,] 518.00  76.47497
##  [8,] 529.75  77.39977
##  [9,] 497.45  91.44885
## [10,] 521.10 130.89365

Make a histogram for the 10 sample means from (b), and compare with population mean, what do you observe? Describe and explain.

hist(results.mean, xlab = "MEAN GPA")

The population gpa is 2.635. By looking at the histogram of our sample means, the data is split pretty evenly between gpa < 2.6 and gpa >2.6. However, the distribution has a peak at 2.7-2.8 which is higher than the population gpa. If we were take more and more samples, our graph would better represent the value 2.635. However, with such a small sample size, there is possible for variability as shown in the histogram.

Problem 3

The game of craps is played with two six sided dice, which are added together to get a score. We can simulate the process of rolling the pair of dice with the following function:

roll <- function() { sample.int(6, size = 1) + sample.int(6, size = 1) }

The game of craps starts with a “come-out”, in which the player immediate wins if the sum of the dice is either 7 or 11. Using the roll function, generate 100 rolls of the dice to estimate the probability that the player wins on the first roll.

 results.roll <- matrix(0,nrow = 10)
for (i in 1:100)
 {
  roll()
  results.roll[i] <-roll()
 }
results.roll

##   [1] 10  6  4  8  4  6  6  3  6  3  7 10  5  6  3  4  5 10  4  7 10 11  7
##  [24]  6  3  8  6 10 10  9  4  7  9 10 10 10  4  4  3 10 10  7  8  2  4  8
##  [47]  4  4 10  7  4  6  6  6  4 11 10  6  7  7  7  3 11  9  5  8  4  7  5
##  [70]  7  3  7 11  7  4  7  6 10  8 11  5  9  7  6  8 10  9  8  9  4  6  6
##  [93] 11  3  9  6  3  9  6  8

The probability of rolling a 7 or 11 on the first roll is the same as the probability of rolling a 7 or 11 on any roll since they are independent observations. The probability of rolling a 7 or 11 can be determined by looking at the range of possible sums. The range is 2-12, so rolling a 7 or 11 is 2/11.

We can also write out a table of all the possible outcomes for rolling two dice, each of which is equally possible:

outcomes <- outer(1:6, 1:6, `+`) 
dimnames(outcomes) <- list("Die A" = 1:6, "Die B" = 1:6)
print(outcomes)

##      Die B
## Die A 1 2 3  4  5  6
##     1 2 3 4  5  6  7
##     2 3 4 5  6  7  8
##     3 4 5 6  7  8  9
##     4 5 6 7  8  9 10
##     5 6 7 8  9 10 11
##     6 7 8 9 10 11 12

table(outcomes)

## outcomes
##  2  3  4  5  6  7  8  9 10 11 12 
##  1  2  3  4  5  6  5  4  3  2  1

Use the outcomes table to find the distribution of possible values from adding two dice. For each possible value from 2 to 12, this distribution tells us the probability of rolling that value. (Hint: see the table function.)

8 of the 36 possible outcomes yield a sum of 7 or 11. When we simplify this, we get 2/9.

Using the previous distribution what is the exact probability of rolling either 7 or 11 on the first roll. How close was your estimate?

The exact estimate is 2/9. My estimate using the roll function was 21/100. The probabilities are relatively close to eachother.

If a player rolls either 2 or 12 on the first roll, she immediately loses. You see that the player rolled a “1” for one of her dice. What is the probability that the player either won immediately or lost immediately?

1/6. Since you know they rolled a 1 on one of the dice, and for her to lose she must roll another one, and you know that each dice roll is independent and there are six possible outcomes, it must be 1/6 because she has a 1/6 chance of rolling a 1.

Problem 4

The probability of an event $A$ is 0.224. What is the probability that $A$ does not occur?

The probability of A not happening is .776

A coin is tossed 3 times. The probability of 3 heads is 1/8. The probability of 3 tails is 1/8. What is the probability getting 3 of the same side of the coin?

The probability of getting 3 of the same side of the coin is 1/4 because there are 8 possibilities and one of them is all heads and one is all tails. 1/8+1/8 =1/4

For 3 coin tosses, what is the probability that there is at least one head AND at least one tail.

Basically, the answer for this is the probability of !b. Since the probability of b occuring is 1/4, the only outcomes left are ones that include heads and tails. The probability is 3/4.

Is the following statement possible? The probability of A is 0.5. The probability of B is 0.6. The events are disjoint.

Yes. Since the problem states that the events are disjoint, it means that they are mutually exclusive (never can occur at the same time). Even though the added probabilities go above 1, its irrelevant because they can each occur without the other event occuring.

Problem 5

All human blood can be ABO-typed as one of O, A, B, or AB. The distribution of the types varies among groups of people. Here is the distribution of blood types for a randomly chosen person in the United States:

Blood Type	A	B	AB	O
US. Prob	0.42	0.11	??	0.44

What is the probability of having AB blood in the United States?

The probability of having AB blood in the United States is the same as not having any of the others. The total probability of having type A, O, or B blood is 1-.97 = .03

Maria has type B blood. She can receive transfusions from people with B or O blood. What is the probability that a randomly chosen person could donate for Maria?

The probability that a randomly chosen person could donate for Maria is .11 +.44 = .55

Here is a table of blood types for Ireland:

Blood Type	A	B	AB	O
Ireland	0.35	0.10	0.03	0.52

What is the probability that two people, one from the US and one from Ireland, would both have the type O blood? What is the probability that these two people would have the same blood type?

The probability of a random person having type O in the United States is .44, and in Ireland it is .52. Probability of two random people having type O is (.44*.52)= .2288

Blood types have positive and negative subtypes. In the United States, the rate of “O-” blood is 7%. If 10 people are randomly picked from the US population, what is the probability that at least one of them is type “O-”? (Hint: what is the complement of “at least one”?)

.93*9 = .5204, this is the probability of atleast one of them being o-.