You can complete this homework by filling the rest of the .Rmd
document. When you click the Knit button in RStudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
There are four sets of data prepared by the statistician Frank Anscombe to illustrate the dangers of calculating without first plotting the data.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
## y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.0 6.00 4.0 12.00 7.00 5.00
## y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
## y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## x 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 19.0
## y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91 6.89 12.5
plot(Data1$x, Data1$y, xlab = "X", ylab = "Y")
Weak linear relationship. Positive association between X and Y. There does not seem to be any outliers within this scatter plot. Moderately positive correlation.
plot(Data2$x, Data2$y, xlab = "X", ylab = "Y")
Moderate to strong positive correlation between X and Y. There does not seem to be any outliers in this scatter plot. Weak linear relationship.
plot(Data3$x, Data3$y, xlab = "X", ylab = "Y")
Very strong linear relationship. Strong positive correlation between X and Y. One outlier exists at approximately x=13.
plot(Data4$x, Data4$y, xlab = "X", ylab = "Y")
Non-existent linear relationship. Extremely strong zero correlation. One outlier exists at approximately x =19. No direction for association.
```
plot(Data1$x, Data1$y, xlab = "X", ylab = "Y")
abline(lm(Data1$y ~ Data1$x))
This one fits rather nicely. An even amount of points on each side of the LSRL.
plot(Data2$x, Data2$y, xlab = "X", ylab = "Y")
abline(lm(Data2$y ~ Data2$x))
The second line is influenced by the parabolic arch of the points heavily, and manages to only touch two points from the scatterplot as such.
plot(Data3$x, Data3$y, xlab = "X", ylab = "Y")
abline(lm(Data3$y ~ Data3$x))
In this scatteplot, the outlier influences the LSRL heavily, drawing it away from the near perfect linear relationship of the rest of the points.
plot(Data4$x, Data4$y, xlab = "X", ylab = "Y")
abline(lm(Data4$y ~ Data4$x))
This line is heavily influenced by the one outlier. Therefore, the line goes through that outlier ignoring the majority of the rest of the data.
point.lm = lm(y ~ x, data= Data1)
point.res = resid(point.lm)
plot(Data1$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)
No clear pattern present. Good enough. A linear model is ok for this data set.
point.lm = lm(y ~ x, data= Data2)
point.res = resid(point.lm)
plot(Data2$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)
Clear pattern present. A non- linear model is appropriate for this data set.
point.lm = lm(y ~ x, data= Data3)
point.res = resid(point.lm)
plot(Data3$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)
Clear pattern present. A non- linear model is appropriate for this data set.
point.lm = lm(y ~ x, data= Data4)
point.res = resid(point.lm)
plot(Data4$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)
Residual plot was heavily influenced by one lone outlier. A non-linear model is appropriate for this data set.
On April 15, 1912, on her maiden voyage, the Titanic collided with an iceberg and sank. The ship did not have enough lifeboats for 2224 passengers and crew on board. As a result of the collision, 1502 people died. The ship had three classes of passengers. There were 323 passengers in first class, 200 of them survived. There were 277 in second class and 119 survived, and 709 in third class and 181 survived.
ctab = matrix(c(200, 123,119,158,181,528), 3, 2, byrow=TRUE)
rownames(ctab) = c('FirstClass', 'SecondClass', 'ThirdClass')
colnames(ctab) = c('Survived', 'Died')
ctab
## Survived Died
## FirstClass 200 123
## SecondClass 119 158
## ThirdClass 181 528
The explanatory variable is what class the passengers were in. This is because depending on what class you were, the more likely or less likely you were to survive. For example, although people of every class on board the Titanic died, you were less likely to die if you were part of first class rather than third class. Therefore, the response variable is survived.
prop.table(ctab)
## Survived Died
## FirstClass 0.15278839 0.09396486
## SecondClass 0.09090909 0.12070283
## ThirdClass 0.13827349 0.40336134
Third class passengers had the highest proportion of deaths. Approximately 40 % of all people on board died and were from third class. About 15 % of all passengers on board the titanic were first class that survived. Those are the two points that stand out the most. The other values are between these two.
I would choose a conditional distribution based on the class of the passengers. The survival rates for passengers on board was heavily influenced by what class they were a part of. First class passengers had the highest surival rate, while the majority of third class passengers died.
The conditional distributions are as follows: Probability of surviving given first class: .619 Probability of surviving given second class: .43 Probability of surviving given third class: .255
Basically, the higher class you were a part of, the probability of surviving the accident increased dramatically. The highest was first class with around 62 % surviving, and third class as the lowest at 25.5% surviving.
Echinacea In a study designed to evaluate the benefits of taking echinacea when you have a cold, 719 patients were randomly divided into four groups: (1) no pills, (2) pills that had no echinacea, (3) pills that had echinacea but the subjects did not know whether or not the pills contained echinacea, and (4) pills that had echinacea and the bottle containing the pills stated that the contents included echinacea. The outcome was a measure of the severity of cold.
Experimental units were the patients who were part of the study.
Treatments- 4 of them, (1): no pills, (2): pills that had no echinacea, (3): pills that had echinacea but subjects did not know whether or not the pills contained in, (4): pills that had echinacea and subjects knew that the pills contained it.
Outcome: Measure of the severity of cold.
First off, the factor is the name for the explanatory variable which in this case is Echinacea. The levels in this case are the treatments that used this factor. So, pills that had no echinacea, pills that had echniacea but the subject did not know whether or not the pills contained echinacea, and pills that had echinacea and the bottle containing the pills stated that the contents included echinacea.
(Attached to a seperate sheet of paper)
This summer, you get a job working for Dr. Lawrence von Statistein (“Lazy Larry”, to his friends) in his mad science laboratory. Your primary task is testing his latest invention: the IQ Enhancement Ray. The ray has been shown to be safe, but the exact amount of power required to raise a subject’s IQ has not yet been determined. Sadly, the ray can only be used on a subject once, so you can’t just keep ratcheting up the power until you see a change. Your job will be to use the ray gun on a set of volunteers to determine if a setting of 10 millijoules is enough to see an effect. You have already lined up 50 volunteers, and you plan to zap 25 of them (the treatment group), while only pretending to zap 25 (the control group). Each group will then complete an intelligence test to see if the ray has an effect. As payment for volunteering, you are hosting an ice cream party for the participants, and you’ve recorded each person’s favorite flavor so that you will have enough chocolate, vanilla, and strawberry for everyone. You also have a copy of a shorter version of the IQ test that people took last year during the screening process. Such a test usually called a “pre-test.”
The night before the experiment, you are working late at the laboratory, when you accidentally bump the dial on the ray, giving yourself a dose of 1000 gigawatts of brain power! Suddenly, you realize that the ray has given you the ability to see the future! Even more amazingly, you know for every single subject in your study the next day, what he or she would score on the test if exposed to the ray or if exposed to the placebo. You quickly write down the information and rush out to buy as many lottery tickets as you can carry. The data look like this:
# type `View(experiment)` to see the full data as a table after copying this chunk to the console
# or using the "Run Current Chunk" from the "Chunk" menu on the upper left side of this panel.
experiment <- structure(list(ice.cream = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("chocolate",
"strawberry", "vanilla"), class = "factor"), pre.test = c(111,
119, 117, 95, 110, 111, 108, 98, 94, 99, 92, 105, 94, 99, 96,
92, 98, 112, 105, 96, 94, 118, 101, 114, 117, 113, 91, 103, 95,
99, 105, 86, 88, 103, 84, 99, 96, 82, 106, 105, 107, 96, 80,
79, 99, 102, 80, 99, 82, 96), not.zapped = c(111, 119, 118, 93,
109, 112, 107, 96, 96, 98, 92, 106, 95, 101, 97, 90, 99, 111,
107, 97, 93, 120, 102, 116, 115, 114, 88, 102, 96, 99, 106, 82,
87, 106, 83, 100, 95, 78, 105, 108, 106, 99, 77, 77, 98, 101,
76, 99, 85, 98), zapped = c(124, 134, 132, 108, 124, 127, 122,
111, 112, 113, 107, 122, 110, 116, 110, 105, 115, 125, 120, 111,
108, 135, 116, 131, 129, 129, 103, 116, 110, 114, 121, 98, 103,
122, 97, 117, 110, 92, 121, 124, 120, 115, 92, 91, 113, 116,
92, 114, 100, 113)), .Names = c("ice.cream", "pre.test", "not.zapped",
"zapped"), row.names = c(NA, -50L), class = "data.frame")
attach(experiment) # make the ice.cream, pre.test, not.zapped, and zapped vectors available
The next day, you stumble back to the lab, having spent all your winnings on M&Ms and Pokemon cards. The volunteers have arrived, and Larry is about to start the zapping. Looking at the volunteers waiting, you notice that many of the volunteers in the “zap” group (the treatment group) like strawberry ice cream, even though it was the least popular flavor overall.
“Hey, Larry,” you say. “Didn’t you randomize the zapping like we talked about?”
“Sorta,” the mad scientist responds. “I randomized them, but then I decided that I should grab all the people who like strawberry. I like strawberry the best, so I can zap them, and we can all go eat ice cream together.” (They don’t call him Lazy Larry for nothing.) “We’ll get the same answer either way, right?”
“No!” you exclaim. “And I can prove it to you!”
In the next sections, we are going to use R to generate random numbers. To make sure R always generates the same random numbers, we will set the random seed.
set.seed(20160212)
You can learn more about why this works at the wikipedia page for “pseudo random number generators”.
zapped
and not.zapped
columns in your experiment
table. Compute this quantity. On average, does Larry’s ray help the subjects?meanZapped = mean(experiment$zapped)
meanNotZapped = mean(experiment$not.zapped)
meanZapped-meanNotZapped
## [1] 14.9
This value of 14.9 clearly shows that his ray guy increases the intelligence of people zapped.
zapped
column and the mean of the not.zapped
column. What is the difference of these two values?meanZapped
## [1] 114.2
meanNotZapped
## [1] 99.3
The difference between these two values is 14.9
zapped
column of the table and for the control group we see the not.zapped
column. Suggest a way to estimate the value you computed in part 1, using only the data we would usually get to see. How can we use the data we do see to make a reasonable guess about data we don’t see?You can compare the results of when they are zapped or not zapped to the pretest that each individual took prior to volunteering in this experiment. Since not zapped should net the same scores as the pre test, we can compare the values from all zapped individuals to their pre test scores to see what kind of effect the zapping had on their intelligence.
To show Larry what would happen for a real experiment, we need to generate example randomizations that partition our data into treated and control groups. We can do that with the following R code:
difference.random <- numeric(100)
for(i in 1:100) {
treatment.group.ids <- sample.int(50, size = 25)
difference.random[i] <- mean(zapped[treatment.group.ids]) - mean(not.zapped[-treatment.group.ids])
}
Iterates through a list of 100 random people and creates a mean of zapped and subtracts that from the mean of notzapped. (Both of those means are including a subtraction of variable treatment.group.ids)
difference.random
. Where is the center of the distribution? Is the distribution symmetric? How does this distribution compare to the values you computed when you knew the entire truth?plot(difference.random)
The center of the distribution is at index value 50. The distribution is not symmetric because of the randomization of each index’s value. This distribution is different compared to the values I computed when I knew the entire truth.
Continued from the previous scenario:
We can simulate Larry’s procedure with the following code:
difference.larry <- numeric(100)
for(i in 1:100) {
# split up the data into those that like strawberry and the others
strawberry <- experiment[ice.cream == "strawberry", ]
others <- experiment[ice.cream != "strawberry", ]
# always take the strawberry ice cream people
treated.outcomes <- strawberry$zapped
# and then randomize the rest
rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
# add a few to the treatment group
treated.outcomes <- c(treated.outcomes, others$zapped[rand.rest])
control.outcomes <- others$not.zapped[-rand.rest]
difference.larry[i] <- mean(treated.outcomes) - mean(control.outcomes)
}
As before, make a plot of the results of running experiments using Larry’s strategy. What is different about this distribution? Does Larry’s strategy “get the right answer” on average?
plot(difference.larry)
Looking at the plot of the results using Larry’s strategy, it seems to be all over the place. However, it does seem to work on average because the mean difference is around 14 in the chart. There is alot more variability however.
In a real experiment, we don’t get to see both the zapped
and not.zapped
columns for all the subjects, but we do get to see the ice.cream
and pre.test
columns for all subjects. Just as with the true outcomes, we can use these variables to see if our randomization procedure works.
pre.test
variable for the treated and control groups (i.e. replace both zapped
and not.zapped
with pre.test
).difference.random <- numeric(100)
for(i in 1:100) {
treatment.group.ids <- sample.int(50, size = 25)
difference.random[i] <- mean(pre.test[treatment.group.ids]) - mean(pre.test[-treatment.group.ids])
}
difference.larry <- numeric(100)
for(i in 1:100) {
# split up the data into those that like strawberry and the others
strawberry <- experiment[ice.cream == "strawberry", ]
others <- experiment[ice.cream != "strawberry", ]
# always take the strawberry ice cream people
treated.outcomes <- strawberry$pre.test
# and then randomize the rest
rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
# add a few to the treatment group
treated.outcomes <- c(treated.outcomes, others$zapped[rand.rest])
control.outcomes <- others$pre.test[-rand.rest]
difference.larry[i] <- mean(treated.outcomes) - mean(control.outcomes)
}
plot(difference.random)
plot(difference.larry)
zapped
and not.zapped
. What does randomization provide that Larry’s strategy does not?There is a big difference between the random graph and the graph using Larry’s strategy. When using randomization, the range of values is smaller and some of them are negative. When using Larry’s strategy, all values are positive and there is a much larger range in values present.