Assignment 4

5.60

Should you use the binomial distribution? In each of the following situations, is it reasonable to use a binomial distribution for the random variable X? Give reasons for your answer in each case.

In a random sample of students in a fitness study, X is the mean daily exercise time of the sample.

Exercise time is considered a continuous variable, which cannot be accounted for a binomial distribution. Another reason is that exercise time cannot be categorized into the probability of successes or failures.

A manufacturer of running shoes picks a random sample of 20 shoes from the production of shoes each day for a detailed inspection. X is the number of pairs of shoes with a defect.

This example is suitable for using binomial distribution, because number of pairs of shoes can be categorized into defect (probability of successes) or non-defect (probability of failures).

A nutrition study chooses an SRS of college students. They are asked whether or not they usually eat at least five servings of fruits or vegetables per day. X is the number who say that they do.

X number of people eating five servings of fruits per day can be accounted as the probability of successes and those who do not eat these amount of fruits can be accounted as the probability of failures. So in this case, it is appropiate to apply the binomial distribution.

X is the number of days during the school year when you skip a class.

X number of days when I skip a class can be counted as the probability of failures and the probability of successes is the number of days I don’t skip classes. So in this case the binomial distribution is applied.

5.62

Illegal downloading. New regulations in Canada require all Internet service providers (ISPs) to send a notice to subscribers who are downloading files illegally asking them to stop. This “notice and notice” system was already in place with Rogers Cable. That company says that prior to these new regulations, 67% of its subscribers who received a notice did not reoffend. Consider a random sample of 50 of these Rogers subscribers who received a first notice.

What is the distribution of the number X of subscribers who reoffend? Explain your answer.

We define the probability of success as the percents they do not reoffend. In this case it is 67%. The sample size in this case is 50. So we can use dbinorm function to draw the distribution:

# design a function that helps to calculate the binomial distribution
# and draw the discrete graph
graph <- function(n, p) {
  x <- (dbinom(0:n, size = n, prob = p))
  barplot(x, 
          col = "orange",
          ylim = c(0, 0.3),
          names.arg = 0:n,
          main = sprintf(paste('Binomial Distribution (n,p)' , n, p, sep = ', ')))
}

# draw the binomial distribution of random sample 50 and probability of 0.67
graph(50, .67)

This graph above shows the binomial distribution of 50 random sample of Rogers received the first notice. With the probability of 67%, the probability of success displays as a bell shaped curve, with the mean around 35. The curve started to raise at 23 and end at 49. Very few probabilities lie outside this range. In this case, with the sample of 50 people, around 35 people will not reoffend after the first notice.

What is the probability that at least 18 of the 50 subscribers in your sample reoffend?

We are going to find the probability of at least 18 subscribers who reoffend either by looking at the graph or calculate using dbinom function. This is calculated by summing all the binomial probability from 18 to 50 subscribers who will not reoffend and subtracting it by 1:

# define the function
binomial <- function(n, p) {
  x <- (dbinom(0:n, size = n, prob = p))
  return(x)
}

# calculate the sum of probability
1 - sum(dbinom(18:50, 50, 0.67))

## [1] 1.847945e-06

This number indicates that 0.0000012 of subscribers will reoffend.

5.64

Illegal downloading, continued. Refer to Exercise 5.62. Given the new regulations, suppose that 75% of the Canadian ISP subscribers will not reoffend after receiving a notice.

If you choose at random 15 subscribers who received a notice, what is the mean of the count X who will not reoffend? What is the mean of the proportion pˆp^ in your sample who will not reoffend?

In this case, the probability of success (p), which is the percents that subscribers will not reoffend, is 75%. We also have the random sample of 15 subscribers. Let’s graph it first to see the distribution:

graph(15, .75)

This distribution shows that the mean of the count X is approximately 11. However, to be precise, the mean value of the binomial distribution is calculated by multiplying the sample size (n = 15) by the probability of success (0.75);

15*0.75

## [1] 11.25

So, the mean of the count X who will not reoffend is 11.25. This mean that out of 15 people, about 11 people will not reoffend.

Repeat the calculations in part (a) for samples of size 150 and 1500. What happens to the mean count of successes as the sample size increases?

We can calculate the mean count of successes either by drawing the graphs or by calculating using formula. We will use both approaches.

First, we will draw the distributions for samples of size 150 and 1500:

# function for graph with sample size of 150
graph2 <- function(n, p) {
  x <- (dbinom(0:n, size = n, prob = p))
  barplot(x, 
          color ="orange",
          ylim = c(0, 0.09),
          names.arg = 0:n,
          main = sprintf(paste('Binomial Distribution (n,p)' , n, p, sep = ', ')))
}

# function for graph with 1500 sample size
graph3 <- function(n, p) {
  x <- (dbinom(0:n, size = n, prob = p))
  barplot(x, 
          color ="orange",
          ylim = c(0, 0.025),
          names.arg = 0:n,
          main = sprintf(paste('Binomial Distribution (n,p)' , n, p, sep = ', ')))
}

# draw the graph
par(mfrow=c(2,1))
graph3(1500, 0.75)
graph2(150, 0.75)

In the above graph, the distribution with sample of 1500 has the mean around 1117, while the other one has the mean around 117. We can also calculate the mean by multiplying the probability of successes by the sample size:

150*0.75

## [1] 112.5

1500*0.75

## [1] 1125

These numbers show that the mean for sample sizes of 150 is 112.5, while the mean with sample sizes of 1500 is 1125.

What happens to the mean proportion of successes?

The mean proportion of successes is equal to the mean of the count of success divided by the sample sizes, which is exactly 75%:

112.5/150

## [1] 0.75

1125/1500

## [1] 0.75

Therefore, the mean proportions of successes remain the same even as the sample sizes increase.

[C]

In class we’ll do an experiment in the cafeteria. Write up the results of your observations of the “can you taste the difference between Coke and Pepsi?” study. Pay close attention to the three pillars of experimental design: control for irrelevant variables, randomize, and replicate. Be as precise as possible, and state your results.

Before doing the experiment, we had to ensure that our experiment was randomized. There are eight cups in which we labelled from 1 to 8 underneath the cups. Four of the cups would either be Coke or Pepsi. In reality we had 4 types of drinks: Pepsi, Coke, Diet Pepsi, and Coke Zero. We were only allowed to choose two of them to test the subjects.

If the 4 cups were Pepsi, the other cups were Coke. We then shuffled 8 cards, then we assigned the even numbers from 1 (A) to 8 as either Coke or Pepsi, and the odd numbers would be the another type of drink. This made sure that the experimenters didn’t know which drinks were assigned to the chosen cups.

Our subjects were requested to not see in advance either they were tasted Coke or Pepsi. In this way we could ensure that other irrelevant variables were controlled, such as whether our subjects could know in advance what kinds of drinks they would be tested. This ensured that our experiment is double-blind, because the neither our experimenters nor our subjects knew what cups were Pepsi or Coke. This double-blind design helps increase the randomization of our experiment.

We also defined the probability of successes as the number of times our subjects correctly differentiated these two drinks. For example, if they correctly identified 6 cups, there were 2 cups they were mistakenly identified.

We had in total 3 trials (24 cups in total were tested) with each of the chosen three people conducted each trial.

Our results were as followed:

drink <- data.frame("States" = c(rep("Correct", 3), rep("Incorrect", 3)),
                    "Results" = c(6, 6, 4, 2, 2, 4),
                    "Trials" =  c(1, 2, 3, 1, 2, 3))

knitr::kable(drink)

States	Results	Trials
Correct	6	1
Correct	6	2
Correct	4	3
Incorrect	2	1
Incorrect	2	2
Incorrect	4	3

drink %>%  ggplot(aes(x = Trials, y = Results)) +
  geom_bar(stat = "identity", 
           aes(fill = factor(States)),
           position = "dodge") +
           labs(title = "Bar Graph of Coke and Pepsi for each trial")

In the table above, we can see that for two initial trials, our subjects guessed 6 out of 8 correctly, while the final trial our subject guessed 4 out of 8 correctly.

The information we need to draw from our results is the probability of successes (p). We can calculate the probability of successes (p) by dividing the total number of cups (X) tested and the number of successes we have:

# the percents of the probability of successes:
# the total of cups were tested are 8*3 : 24
# the number of successes are: 16
16/24

## [1] 0.6666667

So, the probability of successes for our trial is 0.66. We could draw our binomial distribution as below. We have in total 8 cups, so the sample size will be 8:

graph4 <- function(n, p) {
  x <- (dbinom(0:n, size = n, prob = p))
  barplot(x, 
          col = "orange",
          ylim = c(0, 0.6),
          names.arg = 0:n,
          main = sprintf(paste('Binomial Distribution (n,p)' , n, p, sep = ', ')))
}

graph4(8, .66)

This shows that with the sample size of 8, then the results will most likely look as above. This is a pretty bell shaped curve, with the probability of successes are higher than the probability of failures.