Homework 3

You can complete this homework by filling the rest of the .Rmd document. When you click the Knit button in RStudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Problem 1

There are four sets of data prepared by the statistician Frank Anscombe to illustrate the dangers of calculating without first plotting the data.

##    [,1] [,2]  [,3] [,4]  [,5]  [,6] [,7] [,8]  [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00  7.00  5.00
## y  8.04 6.95  7.58 8.81  8.33  9.96 7.24 4.26 10.84  4.82  5.68

##    [,1] [,2]  [,3] [,4]  [,5] [,6] [,7] [,8]  [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.0 6.00  4.0 12.00  7.00  5.00
## y  9.14 8.14  8.74 8.77  9.26  8.1 6.13  3.1  9.13  7.26  4.74

##    [,1] [,2]  [,3] [,4]  [,5]  [,6] [,7] [,8]  [,9] [,10] [,11]
## x 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00  7.00  5.00
## y  7.46 6.77 12.74 7.11  7.81  8.84 6.08 5.39  8.15  6.42  5.73

##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## x 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00  8.00  19.0
## y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91  6.89  12.5

Make scatterplots for each of the four data sets. Describe the overall pattern you observed.

plot(Data1$x, Data1$y, xlab = "X", ylab = "Y")

Weak linear relationship. Positive association between X and Y. There does not seem to be any outliers within this scatter plot. Moderately positive correlation.

plot(Data2$x, Data2$y, xlab = "X", ylab = "Y")

Moderate to strong positive correlation between X and Y. There does not seem to be any outliers in this scatter plot. Weak linear relationship.

plot(Data3$x, Data3$y, xlab = "X", ylab = "Y")

Very strong linear relationship. Strong positive correlation between X and Y. One outlier exists at approximately x=13.

plot(Data4$x, Data4$y, xlab = "X", ylab = "Y")

Non-existent linear relationship. Extremely strong zero correlation. One outlier exists at approximately x =19. No direction for association.

```

Find the least-square regression lines for each of the four data sets and add the corresponding lines to your scatterplots. Do you notice any discrepancy? Describe it.

plot(Data1$x, Data1$y, xlab = "X", ylab = "Y")
abline(lm(Data1$y ~ Data1$x))

This one fits rather nicely. An even amount of points on each side of the LSRL.

plot(Data2$x, Data2$y, xlab = "X", ylab = "Y")
abline(lm(Data2$y ~ Data2$x))

The second line is influenced by the parabolic arch of the points heavily, and manages to only touch two points from the scatterplot as such.

plot(Data3$x, Data3$y, xlab = "X", ylab = "Y")
abline(lm(Data3$y ~ Data3$x))

In this scatteplot, the outlier influences the LSRL heavily, drawing it away from the near perfect linear relationship of the rest of the points.

plot(Data4$x, Data4$y, xlab = "X", ylab = "Y")
abline(lm(Data4$y ~ Data4$x))

This line is heavily influenced by the one outlier. Therefore, the line goes through that outlier ignoring the majority of the rest of the data.

Make residual plots for each of the four data sets. Summarize what the residuals tell you about the data sets.

point.lm = lm(y ~ x, data= Data1)
point.res = resid(point.lm)
plot(Data1$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)

No clear pattern present. Good enough. A linear model is ok for this data set.

point.lm = lm(y ~ x, data= Data2)
point.res = resid(point.lm)
plot(Data2$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)

Clear pattern present. A non- linear model is appropriate for this data set.

point.lm = lm(y ~ x, data= Data3)
point.res = resid(point.lm)
plot(Data3$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)

Clear pattern present. A non- linear model is appropriate for this data set.

point.lm = lm(y ~ x, data= Data4)
point.res = resid(point.lm)
plot(Data4$x, point.res, xlab = "X", ylab = "Residuals")
abline(0, 0)

Residual plot was heavily influenced by one lone outlier. A non-linear model is appropriate for this data set.

Problem 2

On April 15, 1912, on her maiden voyage, the Titanic collided with an iceberg and sank. The ship did not have enough lifeboats for 2224 passengers and crew on board. As a result of the collision, 1502 people died. The ship had three classes of passengers. There were 323 passengers in first class, 200 of them survived. There were 277 in second class and 119 survived, and 709 in third class and 181 survived.

Construct a 3x2 table with class and survival as variables (for survival variable use survived and died as its categories).

ctab = matrix(c(200, 123,119,158,181,528), 3, 2, byrow=TRUE)


rownames(ctab) = c('FirstClass', 'SecondClass', 'ThirdClass')


colnames(ctab) = c('Survived', 'Died')


ctab

##             Survived Died
## FirstClass       200  123
## SecondClass      119  158
## ThirdClass       181  528

Which variable is the explanatory variable and which is the response variable. Give reasons for your answer.

The explanatory variable is what class the passengers were in. This is because depending on what class you were, the more likely or less likely you were to survive. For example, although people of every class on board the Titanic died, you were less likely to die if you were part of first class rather than third class. Therefore, the response variable is survived.

Find the marginal distributions and summarize the major features of these distributions.

prop.table(ctab)

##               Survived       Died
## FirstClass  0.15278839 0.09396486
## SecondClass 0.09090909 0.12070283
## ThirdClass  0.13827349 0.40336134

Third class passengers had the highest proportion of deaths. Approximately 40 % of all people on board died and were from third class. About 15 % of all passengers on board the titanic were first class that survived. Those are the two points that stand out the most. The other values are between these two.

Which conditional distribution would you choose to explain the relationship between these two variables? Explain you answer.

I would choose a conditional distribution based on the class of the passengers. The survival rates for passengers on board was heavily influenced by what class they were a part of. First class passengers had the highest surival rate, while the majority of third class passengers died.

Find the conditional distribution you chose in (d) and summarize your interpretation of the relationship based on the results.

The conditional distributions are as follows: Probability of surviving given first class: .619 Probability of surviving given second class: .43 Probability of surviving given third class: .255

Basically, the higher class you were a part of, the probability of surviving the accident increased dramatically. The highest was first class with around 62 % surviving, and third class as the lowest at 25.5% surviving.

Problem 3

Echinacea In a study designed to evaluate the benefits of taking echinacea when you have a cold, 719 patients were randomly divided into four groups: (1) no pills, (2) pills that had no echinacea, (3) pills that had echinacea but the subjects did not know whether or not the pills contained echinacea, and (4) pills that had echinacea and the bottle containing the pills stated that the contents included echinacea. The outcome was a measure of the severity of cold.

Identify the experimental units, the treatments, and the outcome.

Experimental units were the patients who were part of the study.

Treatments- 4 of them, (1): no pills, (2): pills that had no echinacea, (3): pills that had echinacea but subjects did not know whether or not the pills contained in, (4): pills that had echinacea and subjects knew that the pills contained it.

Outcome: Measure of the severity of cold.

Describe the factor and its levels.

First off, the factor is the name for the explanatory variable which in this case is Echinacea. The levels in this case are the treatments that used this factor. So, pills that had no echinacea, pills that had echniacea but the subject did not know whether or not the pills contained echinacea, and pills that had echinacea and the bottle containing the pills stated that the contents included echinacea.

Draw a diagram similar to Figure 3.3 in textbook (pg. 180) that describes the experiment. You can find figure 3.3 in pdf version of HW3.

(Attached to a seperate sheet of paper)

Problem 4

This summer, you get a job working for Dr. Lawrence von Statistein (“Lazy Larry”, to his friends) in his mad science laboratory. Your primary task is testing his latest invention: the IQ Enhancement Ray. The ray has been shown to be safe, but the exact amount of power required to raise a subject’s IQ has not yet been determined. Sadly, the ray can only be used on a subject once, so you can’t just keep ratcheting up the power until you see a change. Your job will be to use the ray gun on a set of volunteers to determine if a setting of 10 millijoules is enough to see an effect. You have already lined up 50 volunteers, and you plan to zap 25 of them (the treatment group), while only pretending to zap 25 (the control group). Each group will then complete an intelligence test to see if the ray has an effect. As payment for volunteering, you are hosting an ice cream party for the participants, and you’ve recorded each person’s favorite flavor so that you will have enough chocolate, vanilla, and strawberry for everyone. You also have a copy of a shorter version of the IQ test that people took last year during the screening process. Such a test usually called a “pre-test.”

The night before the experiment, you are working late at the laboratory, when you accidentally bump the dial on the ray, giving yourself a dose of 1000 gigawatts of brain power! Suddenly, you realize that the ray has given you the ability to see the future! Even more amazingly, you know for every single subject in your study the next day, what he or she would score on the test if exposed to the ray or if exposed to the placebo. You quickly write down the information and rush out to buy as many lottery tickets as you can carry. The data look like this:

# type `View(experiment)` to see the full data as a table after copying this chunk to the console
# or using the "Run Current Chunk" from the "Chunk" menu on the upper left side of this panel.
experiment <- structure(list(ice.cream = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("chocolate", 
"strawberry", "vanilla"), class = "factor"), pre.test = c(111, 
119, 117, 95, 110, 111, 108, 98, 94, 99, 92, 105, 94, 99, 96, 
92, 98, 112, 105, 96, 94, 118, 101, 114, 117, 113, 91, 103, 95, 
99, 105, 86, 88, 103, 84, 99, 96, 82, 106, 105, 107, 96, 80, 
79, 99, 102, 80, 99, 82, 96), not.zapped = c(111, 119, 118, 93, 
109, 112, 107, 96, 96, 98, 92, 106, 95, 101, 97, 90, 99, 111, 
107, 97, 93, 120, 102, 116, 115, 114, 88, 102, 96, 99, 106, 82, 
87, 106, 83, 100, 95, 78, 105, 108, 106, 99, 77, 77, 98, 101, 
76, 99, 85, 98), zapped = c(124, 134, 132, 108, 124, 127, 122, 
111, 112, 113, 107, 122, 110, 116, 110, 105, 115, 125, 120, 111, 
108, 135, 116, 131, 129, 129, 103, 116, 110, 114, 121, 98, 103, 
122, 97, 117, 110, 92, 121, 124, 120, 115, 92, 91, 113, 116, 
92, 114, 100, 113)), .Names = c("ice.cream", "pre.test", "not.zapped", 
"zapped"), row.names = c(NA, -50L), class = "data.frame")

attach(experiment) # make the ice.cream, pre.test, not.zapped, and zapped vectors available

The next day, you stumble back to the lab, having spent all your winnings on M&Ms and Pokemon cards. The volunteers have arrived, and Larry is about to start the zapping. Looking at the volunteers waiting, you notice that many of the volunteers in the “zap” group (the treatment group) like strawberry ice cream, even though it was the least popular flavor overall.

“Hey, Larry,” you say. “Didn’t you randomize the zapping like we talked about?”

“Sorta,” the mad scientist responds. “I randomized them, but then I decided that I should grab all the people who like strawberry. I like strawberry the best, so I can zap them, and we can all go eat ice cream together.” (They don’t call him Lazy Larry for nothing.) “We’ll get the same answer either way, right?”

“No!” you exclaim. “And I can prove it to you!”

Preliminaries

In the next sections, we are going to use R to generate random numbers. To make sure R always generates the same random numbers, we will set the random seed.

set.seed(20160212)

You can learn more about why this works at the wikipedia page for “pseudo random number generators”.

Part (a)

Larry wants to know if his ray works, on average, for raising the IQ of the subjects in this study. In other words, he wants to know the mean difference between the zapped and not.zapped columns in your experiment table. Compute this quantity. On average, does Larry’s ray help the subjects?

meanZapped = mean(experiment$zapped)
meanNotZapped = mean(experiment$not.zapped)

meanZapped-meanNotZapped

## [1] 14.9

This value of 14.9 clearly shows that his ray guy increases the intelligence of people zapped.

Also compute the mean of the zapped column and the mean of the not.zapped column. What is the difference of these two values?

meanZapped

## [1] 114.2

meanNotZapped

## [1] 99.3

The difference between these two values is 14.9

Usually, we don’t have super mind powers, so we can only know how a subject responds under one of the two conditions, either being zapped or not zapped. In other words, for our treated subjects we see the value of the zapped column of the table and for the control group we see the not.zapped column. Suggest a way to estimate the value you computed in part 1, using only the data we would usually get to see. How can we use the data we do see to make a reasonable guess about data we don’t see?

You can compare the results of when they are zapped or not zapped to the pretest that each individual took prior to volunteering in this experiment. Since not zapped should net the same scores as the pre test, we can compare the values from all zapped individuals to their pre test scores to see what kind of effect the zapping had on their intelligence.

Part (b)

To show Larry what would happen for a real experiment, we need to generate example randomizations that partition our data into treated and control groups. We can do that with the following R code:

difference.random <- numeric(100)
for(i in 1:100) {
  treatment.group.ids <- sample.int(50, size = 25)
  difference.random[i] <- mean(zapped[treatment.group.ids]) - mean(not.zapped[-treatment.group.ids])
}

Describe in words what the simulation in the previous R code chunk is doing.

Iterates through a list of 100 random people and creates a mean of zapped and subtracts that from the mean of notzapped. (Both of those means are including a subtraction of variable treatment.group.ids)

Provide a graph of the distribution of difference.random. Where is the center of the distribution? Is the distribution symmetric? How does this distribution compare to the values you computed when you knew the entire truth?

plot(difference.random)

The center of the distribution is at index value 50. The distribution is not symmetric because of the randomization of each index’s value. This distribution is different compared to the values I computed when I knew the entire truth.

Problem 5

Continued from the previous scenario:

Part (c)

We can simulate Larry’s procedure with the following code:

difference.larry <- numeric(100)
for(i in 1:100) {
  # split up the data into those that like strawberry and the others
  strawberry <- experiment[ice.cream == "strawberry", ]
  others <- experiment[ice.cream != "strawberry", ]
  
  # always take the strawberry ice cream people
  treated.outcomes <- strawberry$zapped 
  
  # and then randomize the rest
  rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
  
  # add a few to the treatment group
  treated.outcomes <- c(treated.outcomes, others$zapped[rand.rest])
  control.outcomes <- others$not.zapped[-rand.rest]
  
  difference.larry[i] <- mean(treated.outcomes) - mean(control.outcomes)
}

As before, make a plot of the results of running experiments using Larry’s strategy. What is different about this distribution? Does Larry’s strategy “get the right answer” on average?

plot(difference.larry)

Looking at the plot of the results using Larry’s strategy, it seems to be all over the place. However, it does seem to work on average because the mean difference is around 14 in the chart. There is alot more variability however.

Part (d)

In a real experiment, we don’t get to see both the zapped and not.zapped columns for all the subjects, but we do get to see the ice.cream and pre.test columns for all subjects. Just as with the true outcomes, we can use these variables to see if our randomization procedure works.

Repeat the simulations (you can copy and paste the randomization chunk and Larry’s strategy chunk), but this time compare the difference of mean the pre.test variable for the treated and control groups (i.e. replace both zapped and not.zapped with pre.test).

difference.random <- numeric(100)
for(i in 1:100) {
  treatment.group.ids <- sample.int(50, size = 25)
  difference.random[i] <- mean(pre.test[treatment.group.ids]) - mean(pre.test[-treatment.group.ids])
}

difference.larry <- numeric(100)
for(i in 1:100) {
  # split up the data into those that like strawberry and the others
  strawberry <- experiment[ice.cream == "strawberry", ]
  others <- experiment[ice.cream != "strawberry", ]
  
  # always take the strawberry ice cream people
  treated.outcomes <- strawberry$pre.test 
  
  # and then randomize the rest
  rand.rest <- sample.int(42, 17) # 42 people left, treat 17 of them to get 25 total
  
  # add a few to the treatment group
  treated.outcomes <- c(treated.outcomes, others$zapped[rand.rest])
  control.outcomes <- others$pre.test[-rand.rest]
  
  difference.larry[i] <- mean(treated.outcomes) - mean(control.outcomes)
}

Make distribution plots for simulations.

plot(difference.random)

plot(difference.larry)

(Written) Use these plots to show Larry why we might be worried about his strategy for choosing a treatment group even if we didn’t have the ability to see the future and write down the zapped and not.zapped. What does randomization provide that Larry’s strategy does not?

There is a big difference between the random graph and the graph using Larry’s strategy. When using randomization, the range of values is smaller and some of them are negative. When using Larry’s strategy, all values are positive and there is a much larger range in values present.