Last week, we met Lazy Larry. Larry wanted to group all the strawberry ice cream eaters so he could deal with them first in his experiment. His solution was to zap all the strawberry ice cream eaters, and randomize zapping and not zapping to the remaining vanilla and chocolate eaters.
We can randomize which strawberry eaters we zap and don’t zap. Basically, we take a simple random sample for each group. Since we are taking simple random samples of each, we eliminate bias and reduce the variability.
results.outcome <- numeric(100)
results.pretest <- numeric(100)
for(i in 1:100) {
#### CHANGE THIS SECTION
# split up the data into those that like strawberry and the others
strawberry <- experiment[ice.cream == "strawberry", ]
others <- experiment[ice.cream != "strawberry", ]
# randomize the others
rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
treated.table <- rbind(strawberry, others[rand.rest, ]) # rbind joins two tables
control.table <- others[-rand.rest, ]
#### END CHANGES
# generate the difference outcomes we would observe if this experiment were run
results.outcome[i] <- mean(treated.table$zapped) - mean(control.table$not.zapped)
# now create the difference in pre test values we would see if this were run
results.pretest[i] <- mean(treated.table$pre.test) - mean(control.table$pre.test)
}
par(mfrow = c(1,2)) # this just puts two figures on the same line.
# distribution of outcome mean differences, should be centered around the truth
hist(results.outcome, main = "Outcome")
abline(v = mean(results.outcome), col = "red")
abline(v = mean(zapped - not.zapped), col = "blue")
# distribution of pre-test mean differences, should be centered around 0
hist(results.pretest, main = "Pretest")
abline(v = mean(results.pretest), col = "red")
abline(v = 0, col = "blue")
For the next items, explain what is wrong with each of the following random sampling procedures, and explain how you would do the randomization correctly.
This is undercoverage. Simply put, by only looking at the third chapter, you are getting a misrepresented view of the reading level of the statistics text. To do a proper randomization, you can put slips of paper with numbers representing each chapter in the book in a hat and shake the hat and randomly select two or three to evaluate for your decision.
This is convienience sampling. Since you are handing the questionnaires to the first 100 students you see arrive for class in the morning, there is undercoverage. Many students who go to class that early are dedicated students who are likely to have a cemented major choice. They would be rather indifferent to the survey. To fix this, you could flip a coin, with heads give them the survey, tails do not. Do it for each student you encounter until you have given out all surveys.
They did not detail the sampling procedure, but this method would be valid for example if you put all subjects names on a slip of paper and mix them thoroughly inside of a hat and then selected 10 of those without looking. My proposed method obeys the conditions of a simple random sample.
The Rmarkdown version of this document contains a table called CSDATA
. There are 224 students in the database that form a population. The data set contains information about student’s GPA (the gpa
column) and SAT verbal score (the satv
column).
gpa
and satv
, and calculate the population mean and standard deviation of these two variables. Recall that you can access the columns of a table using the $
operator: CSDATA$gpa
or CSDATA$satv
.hist(CSDATA$gpa,xlab = "GPA" ,ylab = "Frequency")
mean(CSDATA$gpa)
## [1] 2.635223
sd(CSDATA$gpa)
## [1] 0.7793949
hist(CSDATA$satv, xlab = "SAT Verbal Score", ylab = "Frequency")
mean(CSDATA$satv)
## [1] 504.5491
sd(CSDATA$satv)
## [1] 92.61046
gpa
and satv
of each random sample you have drawn.# set up two matrices to hold the results
results.gpa <- matrix(0, nrow = 10, ncol = 2)
colnames(results.gpa) <- c("Mean", "SD")
results.mean <- matrix(0,nrow = 10)
results.satv <- matrix(0, nrow = 10, ncol = 2)
colnames(results.satv) <- c("Mean", "SD")
# repeat the sampling procedure 10 times.
for (i in 1:10) {
sampGPA = sample(CSDATA$gpa, size = 20, replace = FALSE)
sampSatv = sample(CSDATA$satv, size = 20, replace = FALSE)
### draw a sample of 20 students from the CSDATA gpa and satv columns and compute the mean and SD of each
# save your results here
results.gpa[i, "Mean"] <- mean(sampGPA)
results.gpa[i, "SD"] <- sd(sampGPA)
results.mean[i] <- mean(sampGPA)
results.satv[i, "Mean"] <- mean(sampSatv)
results.satv[i, "SD"] <- sd(sampSatv)
}
# Uncomment these lines to show your results
print(results.gpa)
## Mean SD
## [1,] 2.7425 0.8788083
## [2,] 2.3710 0.8594913
## [3,] 2.2375 0.9781232
## [4,] 2.7060 0.5388428
## [5,] 2.4960 0.5210253
## [6,] 2.6370 0.8686717
## [7,] 2.6855 0.7602318
## [8,] 2.8625 0.6472767
## [9,] 2.4640 0.8517252
## [10,] 2.7370 0.7703321
print(results.satv)
## Mean SD
## [1,] 509.05 95.46423
## [2,] 530.70 95.26976
## [3,] 508.15 101.57718
## [4,] 486.00 82.74247
## [5,] 496.80 67.76243
## [6,] 483.40 72.34159
## [7,] 518.00 76.47497
## [8,] 529.75 77.39977
## [9,] 497.45 91.44885
## [10,] 521.10 130.89365
#
hist(results.mean, xlab = "MEAN GPA")
The population gpa is 2.635. By looking at the histogram of our sample means, the data is split pretty evenly between gpa < 2.6 and gpa >2.6. However, the distribution has a peak at 2.7-2.8 which is higher than the population gpa. If we were take more and more samples, our graph would better represent the value 2.635. However, with such a small sample size, there is possible for variability as shown in the histogram.
The game of craps is played with two six sided dice, which are added together to get a score. We can simulate the process of rolling the pair of dice with the following function:
roll <- function() { sample.int(6, size = 1) + sample.int(6, size = 1) }
results.roll <- matrix(0,nrow = 10)
for (i in 1:100)
{
roll()
results.roll[i] <-roll()
}
results.roll
## [1] 10 6 4 8 4 6 6 3 6 3 7 10 5 6 3 4 5 10 4 7 10 11 7
## [24] 6 3 8 6 10 10 9 4 7 9 10 10 10 4 4 3 10 10 7 8 2 4 8
## [47] 4 4 10 7 4 6 6 6 4 11 10 6 7 7 7 3 11 9 5 8 4 7 5
## [70] 7 3 7 11 7 4 7 6 10 8 11 5 9 7 6 8 10 9 8 9 4 6 6
## [93] 11 3 9 6 3 9 6 8
The probability of rolling a 7 or 11 on the first roll is the same as the probability of rolling a 7 or 11 on any roll since they are independent observations. The probability of rolling a 7 or 11 can be determined by looking at the range of possible sums. The range is 2-12, so rolling a 7 or 11 is 2/11.
outcomes <- outer(1:6, 1:6, `+`)
dimnames(outcomes) <- list("Die A" = 1:6, "Die B" = 1:6)
print(outcomes)
## Die B
## Die A 1 2 3 4 5 6
## 1 2 3 4 5 6 7
## 2 3 4 5 6 7 8
## 3 4 5 6 7 8 9
## 4 5 6 7 8 9 10
## 5 6 7 8 9 10 11
## 6 7 8 9 10 11 12
table(outcomes)
## outcomes
## 2 3 4 5 6 7 8 9 10 11 12
## 1 2 3 4 5 6 5 4 3 2 1
Use the outcomes table to find the distribution of possible values from adding two dice. For each possible value from 2 to 12, this distribution tells us the probability of rolling that value. (Hint: see the table
function.)
8 of the 36 possible outcomes yield a sum of 7 or 11. When we simplify this, we get 2/9.
The exact estimate is 2/9. My estimate using the roll function was 21/100. The probabilities are relatively close to eachother.
1/6. Since you know they rolled a 1 on one of the dice, and for her to lose she must roll another one, and you know that each dice roll is independent and there are six possible outcomes, it must be 1/6 because she has a 1/6 chance of rolling a 1.
The probability of A not happening is .776
The probability of getting 3 of the same side of the coin is 1/4 because there are 8 possibilities and one of them is all heads and one is all tails. 1/8+1/8 =1/4
Basically, the answer for this is the probability of !b. Since the probability of b occuring is 1/4, the only outcomes left are ones that include heads and tails. The probability is 3/4.
Yes. Since the problem states that the events are disjoint, it means that they are mutually exclusive (never can occur at the same time). Even though the added probabilities go above 1, its irrelevant because they can each occur without the other event occuring.
All human blood can be ABO-typed as one of O, A, B, or AB. The distribution of the types varies among groups of people. Here is the distribution of blood types for a randomly chosen person in the United States:
Blood Type | A | B | AB | O |
---|---|---|---|---|
US. Prob | 0.42 | 0.11 | ?? | 0.44 |
The probability of having AB blood in the United States is the same as not having any of the others. The total probability of having type A, O, or B blood is 1-.97 = .03
The probability that a randomly chosen person could donate for Maria is .11 +.44 = .55
Blood Type | A | B | AB | O |
---|---|---|---|---|
Ireland | 0.35 | 0.10 | 0.03 | 0.52 |
What is the probability that two people, one from the US and one from Ireland, would both have the type O blood? What is the probability that these two people would have the same blood type?
The probability of a random person having type O in the United States is .44, and in Ireland it is .52. Probability of two random people having type O is (.44*.52)= .2288
.93*9 = .5204, this is the probability of atleast one of them being o-.