MATH2406 Applied Analytics

Setup

Install and load the packages you need to produce the report here:

library(dplyr) # Useful for data manipulation
library(ggplot2) # Useful for building data visualisations
library(knitr) # Useful for creating nice tables
library(tidyverse)

set.seed(3975239)

Instructions

Task 1

There are two groups, A and B, who receive and do not receive a flu jab, respectively. We will first randomly generate a table (grouped_df) which returns a boolean value for A and B given that they contract the flu given that the flu has a contagion rate of 30%, which is lessened to 10% with the flu jab.

# Set seed for reproducibility
set.seed(3975239)

# Set parameters for Binomial distribution
n_group_a <- 100
n_group_b <- 100
p_a <- 0.1
p_b <- 0.3

# Generate synthetic data
group_a <- rbinom(1, n_group_a, p_a)
group_b <- rbinom(1, n_group_b, p_b)

# Create data frame
df <- data.frame(group = c(rep("A", n_group_a), rep("B", n_group_b)), contracted_flu = c(rep(1, group_a), rep(0, n_group_a - group_a), rep(1, group_b), rep(0, n_group_b - group_b)))

# Tabulate the results by group and contracted_flu
grouped_df <- df %>%  group_by(group, contracted_flu) %>%   summarize(count = n())

Explanation of the table

The two variables, “A” took the flu jab and “B” took the placebo jab.Both groups have 100 individuals, and with the seed set at 3975239, the random generator filled in the two conditions, “0” did not catch the flu and “1” caught the flu.This makes the column, “contracted_flu” boolean data.

Questions

i.

What is your estimated probability that a person receiving the placebo will not contract the flu during the winter season? The total amount of individuals in the “palcebo” condition (B) = 100. The estimated probability that someone receiving the placebo will NOT contract the flu during the winter season should be the probability of not catching the flu (66) over the total number of participants (100), multiplied by a random variable for the winter season (a greater chance of catching the flu due to the winter months). ( 60 / 100 )*k = (assuming k is negligible as there have been no clear indication of0 winter’s effects) Probability = 0.6 or 60%

ii.

Derive a 95% confidence interval for the proportion of people receiving the placebo who do not contract the flu in that winter season. Give a non-technical explanation of your result.

# Load the library for binomial confidence intervals
library(binom)

# Get the number of people in group B who did not contract the flu
no_flu_b <- grouped_df[grouped_df$group == "B" & grouped_df$contracted_flu == 0, "count"]

# Get the total number of people in group B
total_b <- sum(grouped_df[grouped_df$group == "B", "count"])

# Calculate the proportion of people in group B who did not contract the flu
prop_no_flu_b <- no_flu_b/total_b

# Derive the 95% confidence interval for the proportion using the binom.confint function
binom.confint(no_flu_b, total_b, conf.level = 0.95, methods = "wilson")

In laymans terms, we make a few assumptions. We must assume that the individuals are a true sample of the population and also a random distribution of individuals who: 1. took the placebo, and 2. did not contract the flu. We then took the majority of those people (95% of those individuals) to see what the upper and lower boundaries were for that group of people. We can then say that we are highly confident that this volume of people will not contract the flu. We can say that for people who took the placebo, between 50 and 69% will not contract the flu (or 31 and 50% WILL contract the flu). Essentially, we retrieved the people in the first and second standard deviations from the mean (the vast majority of individuals), and looked at the upper and lower margin of the second standard deviation. We found that 95% of the population sat between 0.502 and 0.690 (for this seed).

ii

if, 40% of the wider adult population receive the flu vaccine, what percentage would you anticipate contracting the flu that year?

In this scenario, 40% of the wider population are given the flu vaccine. That means that the population (count) of the whole population (100%) who have the flu vaccine (group A) only comes to 40%. This means that the other 60% of the population do not have the flu vaccine. The anticipated contraction rate will equal: 9% of group A, plus 40% of group B. So that’s 9 x 40%, plus 40 x 60%

(9*0.4) + (40*0.6)

## [1] 27.6

I would expect 27.6% of the whole population would contract the flu.

iv.

**What is your estimate of the probability that a person who is vaccinated against the flu for ten years running, will contract the flu on at least three of those years?*

We need to make an assumption that the flu vaccine is administered once per year, but that the risk of infection (waves of flu contraction) occur every year. This means that for every year, we must reapply the same probability generated in the first step. For every year, given group A (vaccinated), we will have a 9% (0.09) chance of infection. This is necessarily a CDF (cumulative distribution function). So that would be complete certainty, minus the probability that the jab is effective. Or, mathematically 1 minus the probability that the flu jab is effective (pbinom), for any of the 10 years.

1 - pbinom (2, 10, 0.09)

## [1] 0.05404002

I would expect that the probability of them contracting the flu would be 5.4% (given this seed).

Task 2

This question explores the distribution of the sample mean when the underlying variable has an exponential distribution with mean 10.

# Set seed for reproducibility
set.seed(3975239)

# Generate 1000 observations from exponential distribution with mean 10
x <- rexp(1000, rate = 1/10)

# a) i. Compare sample estimates with population mean and standard deviation
cat("Sample mean:", mean(x), "\n")

## Sample mean: 10.05274

cat("Population mean:", 1/10, "\n")

## Population mean: 0.1

cat("Sample standard deviation:", sd(x), "\n")

## Sample standard deviation: 9.843974

cat("Population standard deviation:", sqrt(1/(10^2)), "\n")

## Population standard deviation: 0.1

# a) ii. Compare histogram with exponential density function
hist(x, main = "Histogram of Observed Data", xlab = "Observed Data", col = "red")
curve(dexp(x, rate = 1/10), add = TRUE, col = "blue", lwd = 2)
legend("topright", c("Observed Data", "Exponential Density"), col = c("red", "blue"), lwd = 2)

Within this 1000 sample distribution, the bulk of the observations are found in the first standard deviation from the mean. This means that the bulk of the observations appear +/- 9.834974 from the mean of 0.1 (68% within the first ST, which is from 0 to 9.734974). As the scale increases, the observations are less frequent because they’re at the tail end of the exponential distribution.

# b) i. Generate 1000 sample means of sample size 2 and display histogram
x2 <- replicate(1000, mean(rexp(2, rate = 1/10)))
hist(x2, main = "Histogram of Sample Means (n = 2)", xlab = "Sample Means", col = "red")

# c) i. Generate 1000 sample means of sample size 30 and display histogram
x30 <- replicate(1000, mean(rexp(30, rate = 1/10)))
hist(x30, main = "Histogram of Sample Means (n = 30)", xlab = "Sample Means", col = "red")

This is a visual representation/reminder of the central limit theorum. As the sample sizes increase, the distribution approaches a normal distribution (a bell curve). The larger the sample size, the more normal we can expect the observations of means (no matter what the original distribution looks like). This is the central mathematical observation which we use to generate all manner of statistical tests.

Task 3

A dairy processor is considering whether to change the design of its 2-litre containers of milk, and so they engage an advertising agency to develop a mock-up of what the new packaging will look like. When the agency completed their mock-up of the new design, they decided to conduct a small market research study with 20 randomly selected participants. To begin with, each participant was shown the current design and asked to rate its attractiveness. Then they were shown the mock-up of the new design and asked to rate its attractiveness.

# a
mean1 <- 7.5
mean2 <- 8.2
s <- 1.9
n <- 20 # Assume equal sample size for both groups

d <- mean2 - mean1
se <- s / sqrt(n)
t <- d / se
p <- pt(t, n - 1, lower.tail = FALSE)

print(d)

## [1] 0.7

print(se)

## [1] 0.4248529

print(t)

## [1] 1.647629

print(p)

## [1] 0.05793374

The difference between the two groups can be seen by looking at the difference between the given means (mean1 is the average ‘attractiveness’ of the old design, mean2 is the average ‘attractiveness’ of the new design). We are asking the central question, are these means different from each other? And, if they are different, are they different enough to suggest that there is a large difference between the perception of ‘attractiveness’ of the two designs. We are looking to see if the means are significantly different. Specifically, is the new mean MORE than the old one. This is generated through a 1 tailed t-test (assuming that mean2 > mean1). We generate a p value (the value we’re using for our test) against an arbitrary significance value (0.05). This significance value means that the average attractiveness of the new design is more than 2 standard deviations away from that of the old design. We could set this significance value anywhere we wanted to, but some industries have standards for ‘significance’ (often at the 2SD from the mean).

if (p < 0.05) {
  cat("The result is statistically significant at the 5% level.\n")
} else {
  cat("The result is not statistically significant at the 5% level.\n")
}

## The result is not statistically significant at the 5% level.

My advice for management

For this data, there is NOT a significant difference between the responses of the two designs. Although there is a difference between the average scores on the two surveys (7.5 and 8.2), they are not different enough to suggest that the new design is worth pursuing (given a significance level > 5%).

if (p < 0.5) {
  cat("The result is statistically different given a preference between the two.\n")
} else {
  cat("The result is not statistically different given a preference between the two.\n")
}

## The result is statistically different given a preference between the two.

If we compare the two responses purely as a preference (assume a 50% null hypothesis), then we can safely say that individuals prefer the second design. However, this difference might not be worth the investment to change the design of the produce (not a highly significant difference). We can continue to do this and change the threshold ‘p’ value that we’re looking for. By doing this we are: 1. p-hacking (a statistically dubious thing to do) 2. guaranteed to get a result where we can get to our desired outcome Essentially, given that the investment to change the packaging needs to be built on a clear, significant margin between the two preferences, we should NOT go ahead with this new design, at this time. I will look at the 10%, and 1% difference below as well (essentially, the outcome of p-hacking).

if (p < 0.1) {
  cat("The result is statistically different given a preference between the two at a 10% rate.\n")
} else {
  cat("The result is not statistically different given a preference between the two at a 10% rate.\n")
}

## The result is statistically different given a preference between the two at a 10% rate.

if (p < 0.01) {
  cat("The result is statistically different given a preference between the two at a 1% rate.\n")
} else {
  cat("The result is not statistically different given a preference between the two at a 1% rate.\n")
}

## The result is not statistically different given a preference between the two at a 1% rate.

Task 4

Initially, I’m going to check the observations with plots so that I can get this idea into my mind (the distributions). Then I will attend to the questions systematically.

# a) Generating 10 pairs of observations
set.seed(3975239)
library(MASS)

# Generating bivariate normal distribution
bivariate_normal_10 <- mvrnorm(10, mu=c(50, 55), Sigma=matrix(c(100, 80, 80, 100), nrow=2))

# Plotting the observations
plot(bivariate_normal_10, xlab="Observation 1", ylab="Observation 2")

# Hypothesis: test of the difference between the two means
# We'll use a two-sample t-test, as we did above. We have two sets of observations and do not know the population
t.test(bivariate_normal_10[,1], bivariate_normal_10[,2], alternative="two.sided", var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  bivariate_normal_10[, 1] and bivariate_normal_10[, 2]
## t = -1.59, df = 18, p-value = 0.1292
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.746832   2.179747
## sample estimates:
## mean of x mean of y 
##  45.40880  52.19234

# b) Generating 30 pairs of observations
set.seed(3975239)
bivariate_normal_30 <- mvrnorm(30, mu=c(50, 55), Sigma=matrix(c(100, 80, 80, 100), nrow=2))

# Plotting the observations
plot(bivariate_normal_30, xlab="Observation 1", ylab="Observation 2")

# Hypothesis test of the difference between the two means
t.test(bivariate_normal_30[,1], bivariate_normal_30[,2], alternative="two.sided", var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  bivariate_normal_30[, 1] and bivariate_normal_30[, 2]
## t = -1.6029, df = 58, p-value = 0.1144
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.4808594  0.9383827
## sample estimates:
## mean of x mean of y 
##  49.99710  53.76833

And now, I’m going to answer the questions, a, b, c.

a

mu1 <- 50 mu2 <- 55 sigma1 <- 10 sigma2 <- 10 rho <- 0.8

Given this information, we have a very low number for observations (n=10). This rules out larger kinds of tests which need more observations.

i

An appropriate test for this kind of data would be either a bivariate (2 tailed) t-test or z-test. Because of the unknown number of observations, we cannot assume that the distribution will clearly match a normal distribution. The distribution assumes continuous data, so therefore a t or z test. A t-test assumes the number of the population is unknown, but a z-test assumes it’s known. In this case, n=10, so the data given matches the needs of a bivariate t-test.

library(MASS)

mu1 <- 50
mu2 <- 55
sigma1 <- 10
sigma2 <- 10
rho <- 0.8
n <- 10

# Generate correlated data
set.seed(3975239)

cov_matrix <- matrix(c(sigma1^2, sigma1 * sigma2 * rho, sigma1 * sigma2 * rho, sigma2^2), nrow = 2)
data <- mvrnorm(n, mu = c(mu1, mu2), Sigma = cov_matrix)

# Perform t-test
t.test(data[,1], data[,2], var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  data[, 1] and data[, 2]
## t = -1.59, df = 18, p-value = 0.1292
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.746832   2.179747
## sample estimates:
## mean of x mean of y 
##  45.40880  52.19234

ii

For a 5% confidence interval (p < 0.05) assuming n = 10 (a very low number, but passable for a two sample t-test) The p value for the bivariate t-test is 0.1292 which is greater than the confidence interval of 0.05. This means that the two samples cannot be significantly different, and we must come to the conclusion that (for this set of parameters) we are unable to reject the Null hypothesis. Therefore: these two samples are NOT significantly different However, to run this test, we need a value for n (n = 10). The biggest problem with the ability to derive a correct conclusion is the n value. having a low n value means we are unable to be confident that this data will follow a normal distribution. However, the t-test assumes this as it is testing the relative overlap between two normally distributed areas under the curve. This means that the above data is inherently false or misleading.

b

mu1 <- 50
mu2 <- 55
sigma1 <- 10
sigma2 <- 10
rho <- 0.8
n <- 30

# Generate correlated data
set.seed(3975239)

cov_matrix <- matrix(c(sigma1^2, sigma1 * sigma2 * rho, sigma1 * sigma2 * rho, sigma2^2), nrow = 2)
data <- mvrnorm(n, mu = c(mu1, mu2), Sigma = cov_matrix)

# Perform t-test
t.test(data[,1], data[,2], var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  data[, 1] and data[, 2]
## t = -1.6029, df = 58, p-value = 0.1144
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.4808594  0.9383827
## sample estimates:
## mean of x mean of y 
##  49.99710  53.76833

i

Like in the sample for (a) above, this number of observations is quite small (n=30). Although more robust than n=10, a low number of observations can give misleading conclusions because we are assuming a bivariate normal distribution. ### ii The p value of the new t-test is p = 0.1144. This is still above a threshold of p = 0.05, and therefore, we are unable to reject the null hypothesis. Therefore: these two samples are NOT significantly different. ### c My answers for part a and b are identical. No matter the volume of observations (n = 10 or 30), the threshold to reject the null hypothesis is not met (p > 0.05). Therefore: these two samples are NOT significantly different.

Task 5

A supermarket chain decides to conduct an experiment to compare the sales effectiveness of three promotional scenarios for its home brand chocolate blocks: 1. Include home brand chocolate blocks in its weekly catalogue of specials 2. Promote home brand chocolate blocks with an end of isle display 3. Include home brand chocolate blocks in its weekly catalogue of specials, and at the same time promote the chocolate blocks with an end of isle display. note: 3. is a combination of 1 and 2. ### a Generate random data for the observations of 6 stores for 6 months (n = 36)

note: these numbers are in 1000 units but for simplicity sake, we are using mean and SD of units/100

set.seed(3975239)
# Scenario 1: mean = 50, SD = 30, n = 6*6
data1 <- rnorm(6*6, mean = 50, sd = 30)

# Scenario 2: mean = 55, SD = 30, n = 6*6
data2 <- rnorm(6*6, mean = 55, sd = 30)

# Scenario 3: mean = 60, SD = 30, n = 6*6
data3 <- rnorm(6*6, mean = 60, sd = 30)

Given these random sales generated above ### i The most appropriate test for this kind of data is a one tailed T-test. We are attempting to discover if any of the normal distributions are significantly larger than the others. If they are larger then that scenario offers a larger sales/volume than the alternative. If there are no observable differences between the scenarios, then the p values will not be p > 0.05 (a measure imposed by the task/store/merchant/client). If there are no observable differences, then the company should consider another tactic (possibly a group of experiments rather than single experiments).

# Scenario 1: mean = 50, SD = 30, n = 6*6
t.test(data1, mu = 50)

## 
##  One Sample t-test
## 
## data:  data1
## t = -0.7647, df = 35, p-value = 0.4496
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  37.73991 55.55104
## sample estimates:
## mean of x 
##  46.64547

# Scenario 2: mean = 55, SD = 30, n = 6*6
t.test(data2, mu = 55)

## 
##  One Sample t-test
## 
## data:  data2
## t = -0.31709, df = 35, p-value = 0.7531
## alternative hypothesis: true mean is not equal to 55
## 95 percent confidence interval:
##  43.77160 63.19462
## sample estimates:
## mean of x 
##  53.48311

# Scenario 3: mean = 60, SD = 30, n = 6*6
t.test(data3, mu = 60)

## 
##  One Sample t-test
## 
## data:  data3
## t = -0.75153, df = 35, p-value = 0.4574
## alternative hypothesis: true mean is not equal to 60
## 95 percent confidence interval:
##  44.92622 66.92866
## sample estimates:
## mean of x 
##  55.92744

ii

For this set of experiments, there is no significant difference between them, because none of them meet or exceed the p = 0.05 threshold.

b

Adjusting for a SD = 25

set.seed(3975239)
# Scenario 1: mean = 50, SD = 30, n = 6*6
data1_2 <- rnorm(6*6, mean = 50, sd = 25)

# Scenario 2: mean = 55, SD = 30, n = 6*6
data2_2 <- rnorm(6*6, mean = 55, sd = 25)

# Scenario 3: mean = 60, SD = 30, n = 6*6
data3_2 <- rnorm(6*6, mean = 60, sd = 25)

# Scenario 1: mean = 50, SD = 25, n = 6*6
t.test(data1_2, mu = 50)

## 
##  One Sample t-test
## 
## data:  data1_2
## t = -0.7647, df = 35, p-value = 0.4496
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  39.78326 54.62587
## sample estimates:
## mean of x 
##  47.20456

# Scenario 2: mean = 55, SD = 25, n = 6*6
t.test(data2_2, mu = 55)

## 
##  One Sample t-test
## 
## data:  data2_2
## t = -0.31709, df = 35, p-value = 0.7531
## alternative hypothesis: true mean is not equal to 55
## 95 percent confidence interval:
##  45.64300 61.82885
## sample estimates:
## mean of x 
##  53.73592

# Scenario 3: mean = 60, SD = 25, n = 6*6
t.test(data3_2, mu = 60)

## 
##  One Sample t-test
## 
## data:  data3_2
## t = -0.75153, df = 35, p-value = 0.4574
## alternative hypothesis: true mean is not equal to 60
## 95 percent confidence interval:
##  47.43852 65.77388
## sample estimates:
## mean of x 
##   56.6062

b

Like scenario a, there is no significant difference between these samples, given a p = 0.05 threshold. ### c We can conclude that each of these measures are not significantly different to any of the other measures. Note: we have not compared these to a baseline set of sales (should be done to see if these strategies work relative to what happens regularly). The combined strategy (catalogue + end display) is not significantly different from the others independently.