Assessment 2: Applied Data Project

Overview

This report encompasses the solutions,and their findings, for six given problems.

The purpose of this assessment is to apply your understanding of key concepts covered in this course, and to explore issues around the ability of statistical tests to correctly assess differences in population means.

This assessment covers the following topic areas: • Probability distributions • Sampling distributions • Confidence intervals and hypothesis testing • ANOVA.

Setup

Setting up the knitr options for the report.

Installing and loading the packages you need to produce the report below.

#, message = FALSE

library(plyr) 
library(dplyr)    # Useful for data manipulation
library(ggplot2)  # Useful for building data visualisations
library(knitr)    # Useful for creating nice tables
library(magrittr) # For pipes
library(tidyr)    # for wrangling
library(stringr)  # for str_interp()
library(MASS)     # mvrnorm()
library(car)      # for the leveneTest()

Before generating random data, we set the set.seed() function with out student number as the argument value.

seed <- 3035196
set.seed(seed)

Along this report, we are going to use some given functions, from the class notes, to generate density plots in a visually appealing manner, the first one is plotUnif().

plotUnif <- function (min, max, type = "b", col = "black") 
{
  if (!is.numeric(min) | !is.vector(min) | any(!is.finite(min))) {
    stop("The minimun of the uniform distribution, 'min', must be a single number")
  }
  if (length(min) != 1) {
    stop("The minimun of the uniform distribution, 'min', must be a single number")
  }
  if (!is.numeric(max) | !is.vector(max) | any(!is.finite(max))) {
    stop("The maximun of the uniform distribution, 'max', must be a single number")
  }
  if (length(max) != 1) {
    stop("The maximun of the uniform distribution, 'max', must be a single number")
  }
  if ((max - min) <= 0) 
    stop("The parameters min and max represent the minimum and maximum values of the uniform distribution, respectively")
  if (!type %in% c("b", "dis", "den", "q")) 
    stop("The argument 'type' must be 'b', 'dis', 'den' or 'q'")
  if (length(col) != 1) 
    stop("The argument 'col' must be a single colour")
  if (col %in% c(NA, NaN, Inf, -Inf)) 
    stop("The argument 'col' must be a single colour")
  oldpar <- par(no.readonly = TRUE)
  on.exit(par(oldpar))
  dom = (max - min)/10
  x = seq(min - dom, max + dom, by = 0.001)
  fx = dunif(x, min = min, max = max)
  Fx = punif(x, min = min, max = max)
  Finvx = qunif(seq(0, 1, by = 0.01), min = min, max = max)
  if (type == "b") {
    par(mfrow = c(1, 3))
    plot(x, fx, type = "l", main = "Density Function", ylab = "f(x)", 
         lwd = 2, col = col)
    segments(min, 0, min, 1/(max - min), lty = 2, col = "white", 
             lwd = 3)
    segments(max, 0, max, 1/(max - min), lty = 2, col = "white", 
             lwd = 3)
    plot(x, Fx, type = "l", main = "Distribution Function", 
         ylab = "F(x)", lwd = 2, col = col)
    abline(h = c(0, 1), lty = 2, col = "gray")
    plot(seq(0, 1, by = 0.01), Finvx, type = "l", xlab = expression(tau), 
         ylab = "", main = "Quantile Function", lwd = 2, 
         col = col)
    title(ylab = expression(paste("F"^"-1", (tau), sep = "")), 
          line = 2.5, cex.lab = 1)
  }
  else if (type == "dis") {
    plot(x, Fx, type = "l", main = "Distribution Function", 
         ylab = "f(x)", lwd = 2, col = col)
    abline(h = c(0, 1), lty = 2, col = "gray")
  }
  else if (type == "den") {
    plot(x, fx, type = "l", main = "Density Function", ylab = "f(x)", 
         lwd = 2, col = col)
    segments(min, 0, min, 1/(max - min), lty = 2, col = "white", 
             lwd = 3)
    segments(max, 0, max, 1/(max - min), lty = 2, col = "white", 
             lwd = 3)
  }
  else if (type == "q") {
    plot(seq(0, 1, by = 0.01), Finvx, type = "l", xlab = expression(tau), 
         ylab = "", main = "Quantile Function", lwd = 2, 
         col = col)
    title(ylab = expression(paste("F"^"-1", (tau), sep = "")), 
          line = 2.5, cex.lab = 1)
  }
}

Problem 1

a) Generate 100 data values at random from a uniform distribution on the unit interval from zero to one.

Using the function runif() for a continuous uniform distribution within the interval [0,1], we can generate the data for this question.

n1 <- 100          # sample size 
a1 <- 0
b1 <- 1
p1_data <- runif(n1, min = a1, max = b1) 
head(p1_data, 10) # look at first 10 sampled values

##  [1] 0.82810262 0.55569027 0.82210910 0.97474690 0.53336131 0.79425917
##  [7] 0.23539538 0.03721066 0.56871249 0.76273760

b) Produce a histogram of your data. Discuss its shape. Compare and contrast your histogram with the uniform density. What conclusions can you draw from this comparison?

ggplot() + 
  geom_histogram(aes(p1_data), 
                 boundary = b1, 
                 bins = 20, 
                 color = 'white', 
                 fill = 'steelblue') + 
  scale_x_continuous(limits = c(a1, b1)) + 
  labs(x = "x", 
       y = 'Frequency', 
       title = "Histogram of generated uniform data")

The values seem to be skewed to the left, with certain bins in the interval showing half and even a third of the maximum values. After attempting different number of bins, 20 have been selected as it shows easier the trend and the greater differences, whilst 10 bins easily miss the frequency variations and 30 make it difficult to spot the trend.

If we plot the uniform density function (plotUnif()) with the code given in the class notes,

plotUnif(min = a1, max = b1, type = "den", col = "red")

it comes to show that we could have expected a more rectangular, equal across the interval, distribution of the values, though the histogram displays that the distribution is skewed to the left. It is a surprising result, which could have come from the seed number.

c) Calculate the mean and standard deviation of the data. Compare these two statistics with the mean and standard deviation of the uniform distribution. Discuss.

If by the data, we take the ideal distribution, then the mean can be calculated as: μ = (a + b) / 2 , and the variance as: σ^2 = (b − a)^2 / 12 .

mean_dens_func_p1 <- (a1 + b1) / 2
cat(paste0('Density Function Mean: ',round(mean_dens_func_p1,2)))

## Density Function Mean: 0.5

And for the standard deviation as: σ = (b − a) / 12^(1/2)

var_dens_func_p1 <- (a1 + b1)^2 / 12
sd_dens_func_p1 <- sqrt(var_dens_func_p1)
cat(paste0('Density Function Standard deviation: ',round(sd_dens_func_p1,2)))

## Density Function Standard deviation: 0.29

Now if we calculate these values for the data generated.

mean_vals_p1 <- sum(p1_data) / n1
# mean_vals_p1 <- mean(p1_data) # using the R function
cat(paste0('Mean for the generated values: ',round(mean_vals_p1,2)))

## Mean for the generated values: 0.54

Which tells us that the data is skewed to the left, as the plot showed; by being greater than the density plot.

sd_vals_p1 <- sd(p1_data)
cat(paste0('Standard deviation for the generated values: ', 
           round(sd_vals_p1,2)))

## Standard deviation for the generated values: 0.3

This value, by being slightly higher than the density’s standard deviation, reaffirms the same tendency than the mean values difference.

d) Estimate the following three cumulative probabilities based on your sample data and compare them with the corresponding cumulative probabilities of your uniform distribution:

i. Pr(X ≤ 0.5)

First for the uniform distribution:

xi <- 0.5
prob_xi_ud <- punif(xi, min = a1, max = b1)
cat(paste0('Probability of 0.5 or less for the uniform distribution: ', 
           round(prob_xi_ud,2)))

## Probability of 0.5 or less for the uniform distribution: 0.5

As expected in this case, now if we calculate it for the sample data, we have to count the number of elements that are equal or below 0.5.

n_lessoreqto_0.5 <- length(p1_data[p1_data<=0.5])
prob_xi <- n_lessoreqto_0.5 / n1
cat(paste0('Probability of 0.5 or less for the generated data: ', 
           round(prob_xi,2)))

## Probability of 0.5 or less for the generated data: 0.42

Which matches the tendency already seen with the mean and variance, the difference in probability = -0.08 .

ii. Pr(0.3 ≤ X ≤ 0.5)

We can rewrite this as: P(0.3 ≤ X ≤ 0.5) = P(X <= 0.5) - P(X <= 0.3) .

xii <- 0.3

prob_xii_ud <- prob_xi_ud - punif(xii, min = a1, max = b1)
cat(paste0('Pr(0.3 ≤ X ≤ 0.5) for the uniform distribution: ', 
           round(prob_xii_ud,2)))

## Pr(0.3 ≤ X ≤ 0.5) for the uniform distribution: 0.2

As we expect from a uniform distribution, as it is the area of the rectangle of height 1 between those extremes probability values. Now for the one in our distribution.

n_lessoreqto_0.3 <- length(p1_data[p1_data<=0.3])
prob_xii <- (n_lessoreqto_0.5 - n_lessoreqto_0.3) / n1
cat(paste0('Pr(0.3 ≤ X ≤ 0.5) for the generated data: ', 
           round(prob_xii,2)))

## Pr(0.3 ≤ X ≤ 0.5) for the generated data: 0.16

The difference in probability = -0.04 once again corroborates that we have less values bellow the mean.

iii. Pr(X ≤ 0.7 | X > 0.25).

In this case we first need to calculate both values, in order to find out the joint probability of a number being less than 0.7, once we know it is over 0.25. For the base, Pr(X > 0.25) = 1 - Pr(X ≤ 0.25), and similar than above for Pr(X ≤ 0.7).

x25 <- 0.25
prob_x25_ud <- 1 - punif(x25, min = a1, max = b1)

x7 <- 0.7
prob_x7_ud <- punif(x7, min = a1, max = b1)

prob_xiii_ud <- prob_x7_ud / prob_x25_ud
cat(paste0('Pr(X ≤ 0.7 | X > 0.25) for the uniform distribution: ', 
           round(prob_xiii_ud,2)))

## Pr(X ≤ 0.7 | X > 0.25) for the uniform distribution: 0.93

If we do this for the generated data, in this case we are easily able to count the instances above 0.25.

n_morethan_0.25 <- length(p1_data[p1_data>0.25])

n_lessoreqto_0.7 <- length(p1_data[p1_data<=0.7])

prob_xiii <- n_lessoreqto_0.7 / n_morethan_0.25
cat(paste0('Pr(X ≤ 0.7 | X > 0.25) for the generated data: ', 
           round(prob_xiii,2)))

## Pr(X ≤ 0.7 | X > 0.25) for the generated data: 0.79

This means that we have more values over 0.7 in the generated data, than in the uniform distribution, which further supports the left skewness of the data.

Problem 2

Two hundred adults take part in an experiment to examine the efficacy of having a flu jab ahead of the winter season. They are randomly assigned to two groups of equal size. Group A received the flu jab and group B received a placebo jab. At the end of the winter season all participants are asked to disclose whether they contracted the flu.

n2 <- 200 # total study
n2_g <- n2 / 2 # group size

We will get tabulated values int he form below.

Condition / Group A / Group B Contracted the flu Did not contract the flu TOTAL 100 100

It can be assumed that the number in each group who contract the flu will follow a Binomial distribution.

Participants in group A are expected to have a 10% chance of contracting the flu whereas, for those in group B, the chance of catching the flu is 30%.

a) Generate synthetic data for the number who ‘contract the flu’ in each group, and complete the above table.

In order to generate the synthetic data, we are going to use rbinom(), although only with a single trial, which approaches the Bernoulli distribution, with 100 patients, given that we are only running one trial. First for the group A.

seed <- 3035196
set.seed(seed)
prob_A <- 0.1     # probability of flu cases for group A 

g_A <- rbinom(n2_g, 1, prob_A) 
g_A

##   [1] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0
##  [75] 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

g_A_f <- mean(g_A) * 100 # population that got the flu

Now for group B.

seed <- 3035196
set.seed(seed)
prob_B <- 0.3     # probability of flu cases for group A 

g_B <- rbinom(n2_g, 1, prob_B) 
g_B

##   [1] 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1
##  [38] 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1
##  [75] 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1

g_B_f <- mean(g_B) * 100 # population that got the flu

To generate the table we can create a dataframe with the vectors.

Condition <- c('Contracted the flu (CF)', "Did not contract the flu (DNCF)", "TOTAL")
GroupA <- c(g_A_f, 100-g_A_f, 100 ) 
GroupB <- c(g_B_f, 100-g_B_f, 100)
Totals <- c(g_A_f+g_B_f, 200-(g_A_f+g_B_f), 200)

df2 <- data.frame(Condition, GroupA, GroupB, Totals)
df2

A problem has been found with this approach, which is that the automatic generation of data, does not stay static with a given seed, so the tabulated values will vary with each sample taken. I have attempted separately to run a larger number of trials and use the mean values to generate the data, though the end result is that the percentages adjust to the mean, as we expected from the Central Limit Theorem; and that, I believe, defeats the purpose a random sample of syntheticly generated data, thus why we stick with this data generation method.

b) Answer the following questions based only on the synthetic data.

Given the variability of the data upon generation, we are going to refer to the tabulated values in order to calculate the probabilities.

i. What is your estimated probability that a person receiving the placebo will not contract the flu during the winter season?

p2bi <- (100-g_B_f)
cat(paste0('Probability of NOT getting the flu in Group B: ', 
           round(p2bi,2), "%"))

## Probability of NOT getting the flu in Group B: 62%

ii. Derive a 95% confidence interval for the proportion of people receiving the placebo who do not contract the flu in that winter season. Give a non-technical explanation of your result.

In order to give the confidence intervals, we need to calculate a few values from the synthetic data. It should be noted that for the calculations are based

sd2b <- sd(g_B)
conf_level2bii <- 0.95

#Calculations
s_error2bii <- sd2b/sqrt(n2_g) # standard error
alpha2bii <- 1-conf_level2bii 
df2bii <- n2_g-1 # degrees of freedom

t_value2bii <- qt(alpha2bii/2, df = df2bii, lower.tail = FALSE)

m_error2bii <- t_value2bii*s_error2bii # margin of error  

# Confidence limits
LCL2bii <- 1-mean(g_B)-m_error2bii # Lower 
UCL2bii <- 1-mean(g_B)+m_error2bii # Upper

# Output 
cat(str_interp("${conf_level2bii*100}% CI: [${round(LCL2bii,2)}, ${round(UCL2bii,2)}]"))

## 95% CI: [0.52, 0.72]

We are 95% confident that the proportion of people that does not contract the flu in that season lies with the interval.

If we calculate it the same

sample_prop_2bii <- 1-mean(g_B) # sample proportion
SE_2bii <- sqrt(sample_prop_2bii*(1-sample_prop_2bii)/n2_g)
z_95 <- 1.96

ME_2bii <- SE_2bii*z_95
cat(paste0('Margin of error for NOT getting the flu in Group B: ', 
           round(ME_2bii,2)))

## Margin of error for NOT getting the flu in Group B: 0.1

Which is the same than above.

iii. If 40% of the wider adult population receive the flu vaccine, what percentage would you anticipate contracting the flu that year?

If we use the numbers from the synthetic data to calculate this weighed probability.

vax_pop_per <- 0.4

total_pop_cf <- vax_pop_per*mean(g_A) + (1-vax_pop_per)*mean(g_B)
cat(paste0('Percentage of the propulation contracting the flu if 40% vaccinated: ',
           round(total_pop_cf,3)*100), "%")

## Percentage of the propulation contracting the flu if 40% vaccinated: 28 %

iv. What is your estimate of the probability that a person who is vaccinated against the flu for ten years running, will contract the flu on at least three of those years?

Using the pbinom() from with the probability derived from the synthetic data for Pr(X > 2)

prob_3ormorein10 <- pbinom(2, 10, mean(g_A), lower.tail = FALSE)
cat(paste0('Probability of contracting the flu 3 or more times in 10 years: ',
           round(prob_3ormorein10,3)))

## Probability of contracting the flu 3 or more times in 10 years: 0.131

Problem 3

This question explores the distribution of the sample mean when the underlying variable has an exponential distribution with mean 10.

mu3 <- 10 # mean value

a) Generate 1000 observation from the given distribution.

As with the previous problem, we are going to generate the synthetic distribution using rexp(), with the given requirements.

seed <- 3035196
set.seed(seed) # resetting the seed for this generator
n3 <- 1000 # number of observations
lambda3 <- 1/mu3
exp_dist3 <- rexp(n3, lambda3)
head(exp_dist3)

## [1]  6.562052  1.113805  6.442182  1.631636 29.633299  1.374250

i. Compare the sample estimates for the mean and standard deviation with the population mean and standard deviation for the exponential distribution. Discuss.

First we calculate the mean for the synthetic distribution,

mu_synth3 <- mean(exp_dist3)
cat(paste0('Mean value for the synthetic distribution: ',
           round(mu_synth3,3)))

## Mean value for the synthetic distribution: 10.482

which we can compare directly with the given value and tells us that time between successive events is slightly longer with the generated data.

And now the standard deviation.

sd_synth3 <- sd(exp_dist3)
cat(paste0('Standard deviation for the synthetic distribution: ',
           round(sd_synth3,3)))

## Standard deviation for the synthetic distribution: 10.617

We know that exponential distributions have the same mean and standard deviation, so we can say that the deviation is also slightly higher in the synthetic data, so there is more variability than with a generic exponential distribution.

ii. Compare the histogram of the sample data with the density function of the exponential distribution. Discuss.

plot3ii <- ggplot(data = data.frame(x = exp_dist3), aes(x)) + 
  stat_function(fun = dexp,
                geom = "area",
                fill = "red",
                n = n3,
                args = list(rate = 1/mu_synth3)) +  
  stat_function(fun = dexp,
                geom = "area",
                fill = "steelblue", 
                alpha= 0.6, 
                n = n3,
                args = list(rate = lambda3)) + 
  labs(x = "Values", 
       y = 'Frequency', 
       title = "Histogram comparison between ideal (Blue) and synthetic data (Red)")

plot3ii

As we can see, there is not much difference in the covered area, just a larger section closer to zero for the ideal distribution. which might have an equivalent area in a slight wedge, more perceivable over 25.

Seems, but it is not clear without magnification, that the synthetic distribution has more tail than the original.

b) Generate 1000 sample means of sample size 2 where the observations are drawn at random from the exponential distribution.

Using the same distribution mean for the whole problem, we can generate the data required. Given that the simple size is well below 30, we cannot apply the Central Limit Theorem (CLT), so we are going to generate two separated columns with random values.

set.seed(seed)

n3b <- 2 # sample size 
rows3 <- 1000 # samples 
x31 <- rexp(rows3, lambda3) 
x32 <- rexp(rows3, lambda3) 

df3b <- data.frame(x31, x32)
df3b <- df3b %>% mutate(samplemeans = (x31+x32)/2)

head(df3b)

i. Display a histogram of the sample means and discuss its characteristics.

mu_x3 <- mean(df3b$samplemeans)

plot3bi <- ggplot(data = data.frame(x = df3b$samplemeans), 
                  aes(x)) + 
  stat_function(fun = dexp,
                geom = "area",
                fill = "red",
                n = n3,
                args = list(rate = 1/mu_x3)) +   
  stat_function(fun = dexp,
                geom = "area",
                fill = "steelblue", 
                alpha= 0.6, 
                n = n3,
                args = list(rate = lambda3)) +  
  labs(x = "Values", 
       y = 'Frequency', 
       title = "Histogram of 1000 size 2 sample means vs ideal distribution") + 
  xlim(0, 75)

plot3bi

In order to compare with the previous distribution, we have plotted both, and in this case, we can observe less differences than with the size 1 synthetic distribution, we can also see the tail being shorter. An overall tighter fitting than the single sample generation in the extremes.

c) Generate 1000 sample means of sample size 30 where the observations are drawn at random from the exponential distribution.

In this case, we can use CLT to calculate this, by utilising the properties of the normal distribution to calculate the means. By first generating the total amount of data required to then arrange it as samples of size 30, which requires generating a vector of size 30000.

set.seed(seed)

n3c <- 30 # sample size 
rows3c <- 1000 # samples 
x3c <- rexp(n3c*rows3c, lambda3)  

head(x3c, 10) # look at first 10 sampled values

##  [1]  6.562052  1.113805  6.442182  1.631636 29.633299  1.374250  5.254752
##  [8] 10.192977  8.633720  6.884347

Now we can generate the vector of means by reshaping the current one, into a matrix/dataframe of 1000 rows and 30 features. From that, a simple calculation will give us the vector of averages.

m3c <- matrix(x3c, rows3c)  
x3c_mus <- apply(m3c, 1, mean) # 1 returns a vector of row means  
head(x3c_mus)

## [1] 14.025403 11.414775 10.408267  8.742751 10.604882  9.522752

i. Display a histogram of the sample means and discuss its characteristics.

mu_x3c <- mean(x3c_mus)

plot3ci <- ggplot(data = data.frame(x = x3c_mus), aes(x)) + 
  stat_function(fun = dexp,
                geom = "area",
                fill = "red",
                n = n3,
                args = list(rate = 1/mu_x3c)) +   
  stat_function(fun = dexp,
                geom = "area",
                fill = "steelblue", 
                alpha= 0.6, 
                n = n3,
                args = list(rate = lambda3)) +  
  labs(x = "Values", 
       y = 'Frequency', 
       title = "Histogram of 1000 size 30 sample means vs ideal distribution") + 
  xlim(0, 75)

plot3ci

In this case, the differences are even less noticeable, across all the curve, barely at the right extreme, the blue hue is noticeable. So the means distribution fits the ideal distribution better, the larger the samples are.

Problem 4

A dairy processor is considering whether to change the design of its 2-litre containers of milk, and so they engage an advertising agency to develop a mock-up of what the new packaging will look like. When the agency completed their mock-up of the new design, they decided to conduct a small market research study with 20 randomly selected participants. To begin with, each participant was shown the current design and asked to rate its attractiveness. Then they were shown the mock-up of the new design, and asked to rate its attractiveness. Attractiveness was measured on an 11-point scale, ranging from 0 to 10, where 0 means a design “is not at all attractive” and 10 means it “is very attractive” The study resulted in the current and new designs averaging 7.5 and 8.2 respectively, with a standard deviation of the difference being 1.9. When senior management was shown the new design, and told the results of the research study, they were impressed - especially with the potential improvement in the attractiveness score that the new design would bring. Nonetheless, they turn to you as their chief data scientist to help them make sense of these findings.

n4 <- 20
mu4_old <- 7.5
mu4_new <- 8.2
sd4 <- 1.9

a) In preparation for your meeting with management:

i. Conduct a one-tailed test of the difference between the means, using a 5% significance level. What can you conclude from the test?

Given that the sample size is small, n =20, and that they are dependent, as the focus group gave their marks for both containers, we are going to use a paired test. This is complaint with the requirement to have the same standard deviation for both samples; we have to assume that the distributions are normal, something that ideally we would like to test, should be have more or all the data, and not just the averages.

First let’s state the statistical hypotheses given the above information. \(\mu_{\Delta}\) denotes the mean difference of the container ratings.

\[H_{0} : \mu_{\Delta} = 0\] The Null hypothesis is that the mean average difference is 0, thus with the given significance level, there is no clear advantage in using the newer design.

\[H_{A} : \mu_{\Delta} \ne 0\] The alternative hypothesis is that there is a difference between the average ratings, and thus, the newer design might provide better outcomes, so we only look into a one-tailed test.

Given that we only have the mean values, using the t.test is not possible, unless we create a synthetic data set, but that is not required in this case.

We can substitute the numbers into the equation given in the module notes for n≤30 and paired sample test means:

t_stat4 <- (mu4_old - mu4_new) / (sd4 / sqrt(n4)) 
#cat(round(t_stat4,3))
cat(paste0('t-test value for mean ratings difference: ', round(t_stat4,3)))

## t-test value for mean ratings difference: -1.648

one-tailed test

p_value4 <- pt(abs(t_stat4),n4-1, lower.tail = FALSE) 
#cat(round(p_value4,3))
cat(paste0('p-value for mean ratings difference: ', round(p_value4,3)))

## p-value for mean ratings difference: 0.058

Since p-value (0.058) > significance level (0.05) we fail to reject the hypothesis of equal means and conclude that there is no significant difference between the average container ratings, even though the original results seem to show an improvement in the rating.

ii. Conduct any further analysis that you feel is necessary.

Given the amount of data available to us, there is no clear way to extrapolate or create synthetic sets, based in the support of CLT, and thus, it would be easier to either repeat the test or to receive all the data, not just the summarised format.

b) Write up your advice to management, together with key supporting points.

Given that the p-value is so close to the significance level, in order to make a decision to change the container, with all the production, marketing and development costs entailed; it would be of interest to conduct more in depth research.

Ideally, we would like get their individual ratings, instead of the averages, as we would be able to deduct from them, if some of the assumptions made prior hypothesis testing are correct; for example, if the differences are normally distributed, in order to utilise the three tests we have for that hypothesis: histograms and correlation, Q-Q tests or the Shapiro-Wilk’s normality test.

In the subject of individual ratings, we could have independent groups rating the designs, and also increase the size of the groups. Perhaps have a combination of both, by providing two groups of 40 people opposite designs, that way we can assure sample normality through the CLT, having the ratings group A-new and group B-old, gathering that indenpent data, then allowing them to rate the other design group A- old and group b- new, for the dependent values. In this way we could test both independent and dependent methods, to see if the results are different between them, and check if there is emotional response after entering the comparison, which it does not happen with the single independent test groups.

By allowing our sampling to utilise the array of options that CLT brings to the table, we could optimise the decision making capabilities of the organisation, and also implement similar systems to other products or services.

Problem 5

a) Generate 10 pairs of observations from a bivariate Normal distribution with parameter values (𝜇1,𝜇2,𝜎1,𝜎2,𝜌) = (50,55,10,10,0.8).

First naming the data:

mu5_1 <- 50
mu5_2 <- 55
sigma5_1 <- 10
sigma5_2 <- sigma5_1
rho5 <- 0.8

Now, let’s create the 10 pairs of observations from a bivariate normal distribution based on the given parameters, using an extended version of the mvrnorm() function, inputing the different means and Pearson’s correlation coefficient. After some research regarding the how to input both 𝜎 and 𝜌 into the function, we used references [2] and [3] for the Sigma matrix.

set.seed(seed)
n5a <- 10
data5a <- mvrnorm(n = n5a,        # Create random data
                 mu = c(50, 55),
                 Sigma = matrix(c(100, 80, #with variance sigma^2 = 100
                                  80, 100),
                                nrow = 2))
df5a <- data.frame(data5a)
head(df5a,2)

i. Decide on an appropriate hypothesis test of the difference between the two means based only on these 10 observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

First we need to calculate the summary statistics for both.

round(colMeans(data5a),3) # produces sample means

## [1] 47.072 53.928

round(var(data5a),4) # produces sample variance-covariance matrix

##          [,1]     [,2]
## [1,] 111.5398  91.8049
## [2,]  91.8049 128.0900

round(cor(data5a),4) # produces sample correlation matrix

##        [,1]   [,2]
## [1,] 1.0000 0.7681
## [2,] 0.7681 1.0000

From this, we can gather that the variables have some level of correlation, with the correlation coefficient just below 0.8, which means they are strongly correlated.

If we check individually for each column, first for X1.

stats_X1a <- summarise(df5a,
                       mean = round(mean(df5a$X1, na.rm=TRUE),3),
                       sd = round(sd(df5a$X1, na.rm=TRUE),4),
                       n = sum(!is.na(df5a$X1))) 
stats_X1a

and then for X2.

stats_X2a <- summarise(df5a,
                       mean = round(mean(df5a$X2, na.rm=TRUE),3),
                       sd = round(sd(df5a$X2, na.rm=TRUE),4),
                       n = sum(!is.na(df5a$X2))) 
stats_X2a

We can also observe that the number of observations is below 30 and that the standard deviations are different, thus we will conduct a t-test for dependent small samples.

One last thing, to confirm that test will be testing if the samples have normal distributions, which in order to reduce the report length, we will conduct with the Shapiro-Wilk’s test.

The null hypothesis of this test is that the data is normally distributed. The alternative is that the data has a non-normal distribution. If the p-value < 0.05 then reject normality. Otherwise do not reject if the p-value > 0.05.

p5ai_val_1 <- shapiro.test(df5a$X1)  
p5ai_val_2 <- shapiro.test(df5a$X2) 

# print the p-values
cat(paste('p-value for testing normality (X1) = ', p5ai_val_1$p.value, " and ", 'p-value for testing normality (X2) = ', p5ai_val_2$p.value))

## p-value for testing normality (X1) =  0.701038413436943  and  p-value for testing normality (X2) =  0.0862340087566203

Both p-values > 0.05, so do not reject normality assumption.

ii. Conduct the test using a 5% significance level, and write up your findings. Conclude this part of the question by reflecting on the ability of the test to reach a correct conclusion.

First let’s state the statistical hypotheses given the above information. \(\mu_{\Delta}\) denotes the mean difference of the container ratings.

\[H_{0} : \mu_{\Delta} = 0\] The Null hypothesis is that the mean average difference is 0, thus with the given significance level, there is no mean difference between the samples.

\[H_{A} : \mu_{\Delta} \ne 0\] The alternative hypothesis is that there is a difference between the samples, and thus, we can expect different outcomes from them, in this case, a higher average for the X2.

Knowing that we meet the assumptions of dependent samples and normality, we can perform a paired two-sample t-test.

t.test(df5a$X1, df5a$X2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  df5a$X1 and df5a$X2
## t = -2.8965, df = 9, p-value = 0.0177
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.209853  -1.501463
## sample estimates:
## mean of the differences 
##               -6.855658

And since p-value (0.0177) < significance level (0.05) we can reject the hypothesis of equal means and conclude that there are significant differences between the means of the samples, with X1 providing larger values than X2.

Given that we were able to test, or confirm, all of the required assumptions necessary to run the test, there is no reason to thing that the conclusion reached is incorrect.

b) Generate 30 pairs of observations from a bivariate Normal distribution with the same parameter values, (𝜇1,𝜇2,𝜎1,𝜎2,𝜌) = (50,55,10,10,0.8).

Using the same method than in part a, we are going to create 30 pairs of observations from a bivariate normal distribution.

set.seed(seed)
n5b <- 30
data5b <- mvrnorm(n = n5b,        # Create random data
                 mu = c(50, 55),
                 Sigma = matrix(c(100, 80,
                                  80, 100),
                                nrow = 2))
df5b <- data.frame(data5b)
head(df5b,2)

i. Decide on an appropriate hypothesis test of the difference between the two means based only on these 30 observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

First we need to calculate the summary statistics for both.

round(colMeans(data5b),3) # produces sample means

## [1] 51.101 55.156

round(var(data5b),4) # produces sample variance-covariance matrix

##          [,1]     [,2]
## [1,] 146.3607 120.7054
## [2,] 120.7054 140.4260

round(cor(data5b),4) # produces sample correlation matrix

##       [,1]  [,2]
## [1,] 1.000 0.842
## [2,] 0.842 1.000

From this, we can gather that the variables have a strong correlation, with the correlation coefficient just above 0.8, which means there is a dependency between the variables.

If we check individually for each column, first for X1.

stats_X1b <- summarise(df5b,
                       mean = round(mean(df5b$X1, na.rm=TRUE),3),
                       sd = round(sd(df5b$X1, na.rm=TRUE),4),
                       n = sum(!is.na(df5b$X1))) 
stats_X1b

and then for X2.

stats_X2b <- summarise(df5b,
                       mean = round(mean(df5b$X2, na.rm=TRUE),3),
                       sd = round(sd(df5b$X2, na.rm=TRUE),4),
                       n = sum(!is.na(df5b$X2))) 
stats_X2b

Since the population standard deviations are known and the sample sizes for both groups are large, with 30 values, a z-test should be conducted.

ii. Conduct the test using a 5% significance level, and write up your findings. Conclude this part of the question by reflecting on the ability of the test to reach a correct conclusion.

To perform a z-test we will need to calculate the z-statistic for the sample mean difference.

Assume null hypothesis is true, meaning there is no difference in population means. H0 : μ1−μ2=0 or equivalently, H0 : μD=0.

z_stat5b <- (stats_X1b$mean - stats_X2b$mean) /  
           (stats_X1b$sd - stats_X2b$sd) / sqrt(n5b)
# round(z_stat5b,3)
cat(paste('z-value for two large dependent samples = ', round(z_stat5b,3)))

## z-value for two large dependent samples =  -2.986

and now the p value:

p_value5b <- 2*pnorm(abs(z_stat5b),0,1, lower.tail = FALSE)
#round(p_value5b,3)
cat(paste('p-value for two large dependent samples = ', round(p_value5b, 4)))

## p-value for two large dependent samples =  0.0028

Finding that the p-value of (0.0028) < significance level (0.05) , Which rejects that the population means are the same.

By running the same t-test than for n=10 samples, to double check the result.

t.test(df5b$X1, df5b$X2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  df5b$X1 and df5b$X2
## t = -3.2971, df = 29, p-value = 0.002586
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.570268 -1.539619
## sample estimates:
## mean of the differences 
##               -4.054943

A similar p-value of (0.002586) < significance level (0.05), supports that we should be rejecting the hypothesis of equal means and conclude that X2 mean score is significantly greater than X1’s.

As above, following that the assumptions in which this test has been based are correct, there is no reason to think the test result would have been different.

c) Compare your answers to parts (a) and (b). What can you conclude?

Given that for both parts the assumptions are very similar, and that the z and tvalues are calculated in a very similar manner, it is interesting to note, that the corresponding p-values, seem to show a slightly stronger response in the smaller test size.

Everything said, the results are the same, which is that the variable with the larger mean is significantly greater than the smaller one, for the given initial data. Most importantly, that the sample size does not seem to affect the outcome, when normal and dependent distributions are the subject of testing.

Problem 6

A supermarket chain decides to conduct an experiment to compare the sales effectiveness of three promotional scenarios for its home brand chocolate blocks: 1. Include home brand chocolate blocks in its weekly catalogue of specials 2. Promote home brand chocolate blocks with an end of isle display 3. Include home brand chocolate blocks in its weekly catalogue of specials, and at the same time promote the chocolate blocks with an end of isle display.

Eighteen comparable stores in terms of sales and store size are selected for running this experiment over a given month. Each scenario is randomly allocated to six stores.

a) Generate six monthly sales (recorded in thousands of units) for each scenario, according to the following:

As indicated in the discussion group, the amount of data needed: “There are 3 scenarios and 18 stores. 6 stores are allocated to each scenario. For each store we generate monthly sales data over 6 months. In total, the amount of data generated is 3x6x6 = 108 observations.” David Stewart Even though “…over a given month” seems to indicate only 18 observations

• All sales are normally distributed with a standard deviation of 10

sd6a <- 10
n6 <- 36 # number of simulated values - 6 stores in 6 months

• Scenario 1 sales have a mean of 50 Generating normally distributed data with the values provided.

mu_6scn1 <- 50
set.seed(seed)
scn1a <- rnorm(n6, mu_6scn1, sd6a) 
head(scn1a, 2)

## [1] 59.46694 59.23433

• Scenario 2 sales have a mean of 60

mu_6scn2 <- 60
set.seed(seed)
scn2a <- rnorm(n6, mu_6scn2, sd6a) 
head(scn2a, 2)

## [1] 69.46694 69.23433

• Scenario 3 sales have a mean of 70.

mu_6scn3 <- 70
set.seed(seed)
scn3a <- rnorm(n6, mu_6scn3, sd6a) 
head(scn3a, 2)

## [1] 79.46694 79.23433

If we put all of them together in a single dataframe.

m_salesa <- c(scn1a, scn2a, scn3a) 

scenario <- as.factor(c(rep("Scenario1", n6), 
                        rep("Scenario2", n6), 
                        rep("Scenario3", n6)))

df6a <- data.frame(m_salesa, scenario)
head(df6a, 2)

i.Decide on an appropriate test of the difference between the means based only on the observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

Given that we have three independent samples, we are going to check the requirements for the ANOVA test, which will simplify both operations and comparisons. If any of the below conditions fail, we will need to find other means.

1.- First is testing for the independence of the variables, given that the values represent the monthly sale figure of 6 stores in three scenarios, it is given by the factor element.

2.- The we look at the normality of distribution: to be tested for each group with the Shapiro-Wilk test, as it was done in the previous problem. The Null Hypothesis says data is drawn from a normal distribution, versus the Alternative Hypothesis says it comes from some other type of distribution.

shapiro.test(df6a$m_salesa[scenario == "Scenario1"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6a$m_salesa[scenario == "Scenario1"]
## W = 0.97755, p-value = 0.6623

shapiro.test(df6a$m_salesa[scenario == "Scenario2"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6a$m_salesa[scenario == "Scenario2"]
## W = 0.97755, p-value = 0.6623

shapiro.test(df6a$m_salesa[scenario == "Scenario3"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6a$m_salesa[scenario == "Scenario3"]
## W = 0.97755, p-value = 0.6623

Given that the p-value is greater than 0.05 in all cases, there is insufficient evidence to conclude that the data are drawn from some otherthan a normal distribution.

3.- Finally, we check the homogeneity of variance with Levene’s test. Here, the Null hypothesis (H0) is that the variance among the groups is equal and the alternative hypothesis (HA) the contrary.

leveneTest(m_salesa ~ scenario, data = df6a, center = mean)

The p-value of the test is 1, which is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the variance among the three groups is equal.

After these conditions are met, we are ready to apply the ANOVA test to the data.

ii. Conduct the test using a 5% significance level, and write up your findings. What can you conclude?

Running ANOVA, using the function aov().

anova6a <- aov(m_salesa ~ scenario, data = df6a)
summary(anova6a)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## scenario      2   7200    3600   25.32 1.06e-09 ***
## Residuals   105  14928     142                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that the scenario is statistically significant at the 0.05 significance level, with the p-value (1.06*10^-9).

In other words, there is a statistically significant difference between the monthly sales that results from the three different scenarios.

b) Repeat part (a) under the changed assumption the standard deviation of sales is 5.

• Scenario 1 sales have a mean of 50 Generating normally distributed data with the values provided.

sd6b <- 5
set.seed(seed)
scn1b <- rnorm(n6, mu_6scn1, sd6b) 
head(scn1b, 2)

## [1] 54.73347 54.61716

• Scenario 2 sales have a mean of 60

set.seed(seed)
scn2b <- rnorm(n6, mu_6scn2, sd6b) 
head(scn2b, 2)

## [1] 64.73347 64.61716

• Scenario 3 sales have a mean of 70.

set.seed(seed)
scn3b <- rnorm(n6, mu_6scn3, sd6b) 
head(scn3b, 2)

## [1] 74.73347 74.61716

If we put all of them together in a single dataframe.

m_salesb <- c(scn1b, scn2b, scn3b) 

df6b <- data.frame(m_salesb, scenario)
head(df6b, 2)

Making sure the normality of distribution assumption holds for the new data with the Shapiro-Wilk test. The Null Hypothesis that data is drawn from a normal distribution, versus the Alternative Hypothesis that they are drawn from some other (unspecified) distribution.

shapiro.test(df6b$m_salesb[scenario == "Scenario1"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6b$m_salesb[scenario == "Scenario1"]
## W = 0.97755, p-value = 0.6623

shapiro.test(df6b$m_salesb[scenario == "Scenario2"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6b$m_salesb[scenario == "Scenario2"]
## W = 0.97755, p-value = 0.6623

shapiro.test(df6b$m_salesb[scenario == "Scenario3"])

## 
##  Shapiro-Wilk normality test
## 
## data:  df6b$m_salesb[scenario == "Scenario3"]
## W = 0.97755, p-value = 0.6623

All p-values are greater than 0.05, so there is insufficient evidence to conclude that the data are drawn from some other than a normal distribution.

Homogeneity of variance with Levene’s test.

leveneTest(m_salesb ~ scenario, data = df6b, center = mean)

The p-value of the test is 1, which is more than our significance level of 0.05, thus the variance among the three groups is equal.

Running ANOVA for this different standard deviation.

anova6b <- aov(m_salesb ~ scenario, data = df6b)
summary(anova6b)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## scenario      2   7200    3600   101.3 <2e-16 ***
## Residuals   105   3732      36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At the 0.05 significance level, with the p-value (<2*10^-16), which is even a greater statistical difference between the monthly sales that results from the three different scenarios, than with a larger standard deviation.

c) Compare your answers to parts (a) and (b). What can you conclude?

We can conclude that the campaigns work very differently, though in order to know which one is better, we need to compare the pairs, or perhaps learn other method to output such ranking.

Both standard deviations give us really small p-values, though the orders of magnitude indicate that the larger the deviation, the more statistical overlap there is between the scenarios monthly sales, which is to be expected, as the values will have more amplitud.

Reference

[1] Unknown (2022). ‘R rbinom – Simulate Binomial or Bernoulli trials’. Viewed June 2, 2022, from ProgrammingR: https://www.programmingr.com/examples/neat-tricks/sample-r-function/r-rbinom/
[2] Schork, J (Unknown). ‘Generate Multivariate Random Data in R (2 Examples)’. Viewed June 4, 2022: https://statisticsglobe.com/generate-multivariate-random-data-r
[3] Rizopoulos, D (2004). ‘Simulating from a Multivariate Normal Distribution Using a Correlation Matrix’. Viewed June 4, 2022: https://stat.ethz.ch/pipermail/r-help/2004-June/053386.html

Assessment 2: Applied Data Project

MATH2406 Applied Analytics

Samuel Klett Navarro (s3039156)

2022-05-30

Overview

Setup

Problem 1

a) Generate 100 data values at random from a uniform distribution on the unit interval from zero to one.

b) Produce a histogram of your data. Discuss its shape. Compare and contrast your histogram with the uniform density. What conclusions can you draw from this comparison?

c) Calculate the mean and standard deviation of the data. Compare these two statistics with the mean and standard deviation of the uniform distribution. Discuss.

d) Estimate the following three cumulative probabilities based on your sample data and compare them with the corresponding cumulative probabilities of your uniform distribution:

i. Pr(X ≤ 0.5)

ii. Pr(0.3 ≤ X ≤ 0.5)

iii. Pr(X ≤ 0.7 | X > 0.25).

Problem 2

a) Generate synthetic data for the number who ‘contract the flu’ in each group, and complete the above table.

b) Answer the following questions based only on the synthetic data.

i. What is your estimated probability that a person receiving the placebo will not contract the flu during the winter season?

ii. Derive a 95% confidence interval for the proportion of people receiving the placebo who do not contract the flu in that winter season. Give a non-technical explanation of your result.

iii. If 40% of the wider adult population receive the flu vaccine, what percentage would you anticipate contracting the flu that year?

iv. What is your estimate of the probability that a person who is vaccinated against the flu for ten years running, will contract the flu on at least three of those years?

Problem 3

a) Generate 1000 observation from the given distribution.

i. Compare the sample estimates for the mean and standard deviation with the population mean and standard deviation for the exponential distribution. Discuss.

ii. Compare the histogram of the sample data with the density function of the exponential distribution. Discuss.

b) Generate 1000 sample means of sample size 2 where the observations are drawn at random from the exponential distribution.

i. Display a histogram of the sample means and discuss its characteristics.

c) Generate 1000 sample means of sample size 30 where the observations are drawn at random from the exponential distribution.

i. Display a histogram of the sample means and discuss its characteristics.

Problem 4

a) In preparation for your meeting with management:

i. Conduct a one-tailed test of the difference between the means, using a 5% significance level. What can you conclude from the test?

one-tailed test

ii. Conduct any further analysis that you feel is necessary.

b) Write up your advice to management, together with key supporting points.

Problem 5

a) Generate 10 pairs of observations from a bivariate Normal distribution with parameter values (𝜇1,𝜇2,𝜎1,𝜎2,𝜌) = (50,55,10,10,0.8).

i. Decide on an appropriate hypothesis test of the difference between the two means based only on these 10 observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

ii. Conduct the test using a 5% significance level, and write up your findings. Conclude this part of the question by reflecting on the ability of the test to reach a correct conclusion.

b) Generate 30 pairs of observations from a bivariate Normal distribution with the same parameter values, (𝜇1,𝜇2,𝜎1,𝜎2,𝜌) = (50,55,10,10,0.8).

i. Decide on an appropriate hypothesis test of the difference between the two means based only on these 30 observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

ii. Conduct the test using a 5% significance level, and write up your findings. Conclude this part of the question by reflecting on the ability of the test to reach a correct conclusion.

c) Compare your answers to parts (a) and (b). What can you conclude?

Problem 6

a) Generate six monthly sales (recorded in thousands of units) for each scenario, according to the following:

i.Decide on an appropriate test of the difference between the means based only on the observations (i.e. as though you are unaware of the parameter values). Explain how you arrived at that decision.

ii. Conduct the test using a 5% significance level, and write up your findings. What can you conclude?

b) Repeat part (a) under the changed assumption the standard deviation of sales is 5.

c) Compare your answers to parts (a) and (b). What can you conclude?

Reference