Inferential Analysis On Data Simulation Highlighting The Toothgrowth Dataset In R

Analysis on data simulation.

We aim to conduct a simulation by generating samples of 40 independent random variables drawn from an exponential distribution with specified parameters. This process is repeated 1,000 times, resulting in 1,000 simulated datasets. For each simulation, we compute the sample mean, ultimately producing a distribution of 1,000 sample means.

The simulation setup commences with the loading of essential R packages to support efficient data handling and graphical representation. Subsequently, we specify the parameters of the exponential distribution—namely, the rate (lamda)—and establish the experimental conditions by defining the sample size per replication and the total number of simulation iterations.

Load packages

library(dplyr, warn.conflicts = F)
library(ggplot2)
library(ggthemes)
##Exponential function parameters
lambda <- 0.2
n <- 40
num.of.sim <- 1000
#set the seed
set.seed(119983)
##Create a 1000x40 matrix containing the results of the simulation
sim.distrib <- matrix(data=rexp(n * num.of.sim, lambda), nrow=num.of.sim)

Sample mean vs theoretical mean

To facilitate analysis, we calculate the mean of each of the 1,000 simulated observations and store the results in a data frame named sim_mns, which serves as a standard data structure in R and integrates seamlessly with the dplyr and ggplot2 packages. This enables streamlined computation and visualization workflows.

The objective is to evaluate how closely the empirical mean of our simulated exponential distribution approximates its theoretical counterpart, given bymu= 1⁄lamda = 5. After computing the sample means, we summarize the resulting data to obtain a single estimate of the overall simulated mean.

A histogram is then constructed to visualize the distribution of these sample means, with a vertical line indicating the calculated mean. This allows for direct visual comparison against theoretical expectations.

# Compute the mean for each of the 1000 simulations (rows)
sim_mns <- data.frame(means = apply(sim.distrib, 1, mean))
# Optional: convert data frame to tibble for cleaner printing (tbl_df is deprecated)
sim_mns <- tibble::as_tibble(sim_mns)
# Compute the mean of the simulated means
simulated.mean <- sim_mns %>%
  summarize(simulated.mean = mean(means)) %>%
  pull(simulated.mean)
# Plot sample means distribution with vertical line for calculated mean
sim_mns %>%
  ggplot(aes(x = means)) +
  geom_histogram(alpha = 0.4, binwidth = 0.25, fill = "orange", col = "white") +
  geom_vline(xintercept = simulated.mean, color = "black", linewidth = 0.5) +
  ggtitle("Distribution of Simulated Means")

The plot indicates that the distribution of the simulated sample means is centered near the empirical mean of 4.9824, represented by the black vertical line. This value is remarkably close to the theoretical mean of an exponential distribution,mu= 1 ⁄ lamda = 5, reinforcing the consistency between simulation results and theoretical expectations.

Sample Variance versus Theoretical Variance

##Compute the variance of the sample means
sd.samp <- sim_mns  %>% select(means)  %>% unlist() %>%  sd()
(var.samp <- sd.samp ^ 2)

## [1] 0.628253

#Theoretical variance of the exponential distribution
(((1/lambda))/sqrt(40))^2

## [1] 0.625

Normality of the Distribution

From the Central limit theorem we know that the distribution of averages of normalized variables becomes that of a standard normal distribution as the sample size increases. Here we want to normalize our sample means. To do this we need to transform each mean in our simulated datset according to the following formula:

Based on the CLT the result should be a normal distribution centered at(more or less) zero. To see if this is the case we create a plot comparing the density of our transformed sample means distribution with the density of the standard normal distribution.

##Compute mean of our normalized means
(z_mean <- sim_mns %>%  mutate(z_score = (means - 1/lambda) / (1/lambda / sqrt(n)))  %>% select(z_score) %>%
   summarise(z_mean = mean(z_score)) %>% unlist())

##      z_mean 
## -0.02230626

##create Z scores from the sample means
sim_mns %>%  mutate(z_score = (means - 1/lambda) / (1/lambda / sqrt(n))) %>% 
  ggplot(aes(x = z_score)) + 
   geom_histogram(alpha=0.1,  binwidth = 0.3, fill="brown", color="black", aes(y = after_stat(density))) +
   stat_function(fun = dnorm, linewidth = 1.3) +
   geom_vline(xintercept = z_mean, color="purple", linewidth = 0.5) +
   ggtitle("Distribution of standardized simulated means") +
   xlab("z-scores")

The plot reveals that the standardized distribution of sample means closely aligns with the standard normal distribution, as evidenced by its similarity to the overlaid density function represented by the black bell-shaped curve. Furthermore, the computed mean of -0.0223 approximates zero, consistent with the Central Limit Theorem, which posits that the distribution of sample means approaches normality with a mean equal to the population mean as sample size increases.

Inferential Data Analysis On The Toothgrowth dataset in R

This is study about the effect of Vitamin C on Tooth Growth in Guinea Pigs. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). Format, given in the documentation of the dataset, is as follows and can be seen after loading the data. A data frame with 60 observations on 3 variables.

Exploratory data analysis

*Loading the required packages and the dataset.

#Load data and convert to tbl format
ToothGrowth <- tibble::as_tibble(ToothGrowth)
ToothGrowth %>% select(dose) %>% unique()

## # A tibble: 3 × 1
##    dose
##   <dbl>
## 1   0.5
## 2   1  
## 3   2

ToothGrowth <- ToothGrowth %>% mutate(dose = as.factor(dose))

Plots

This multipanel plot emphasizes the relationship between teeth length and dose level for each supplement type. It appears to be a positive relationship for both supplement types. In other words, as the amount of supplement increases, so does teeth length.

ToothGrowth %>%
ggplot(aes(x = supp, y = len)) +
geom_boxplot(aes(fill = supp)) +
facet_wrap(~ dose) +
scale_fill_brewer(palette = "Set1") +
theme_bw() +
ggtitle("Teeth Length vs Supplement type \nby Dose level ") +
labs(x="supplement type", y= "teeth length ") +
guides(fill=guide_legend(title="Supplement type"))

ToothGrowth  %>% filter(dose == 2)  %>% group_by(supp)   %>%  summarise(avg.length = mean(len))

## # A tibble: 2 × 2
##   supp  avg.length
##   <fct>      <dbl>
## 1 OJ          26.1
## 2 VC          26.1

Hypothesis Tests

Now we want to further compare teeth growth by supplement type and dose levels. This time we’ll use statistical tests, t-test. As seen before, in our dataset we have two levels for supp: OJ and VC and three levels for dose: 0.5, 1, 2. Thus we’ll have to run one hypothesis test for factor supp and one for each possible pair of the 3 levels in the factor dose, that is, we will run a total of 4 tests. We start by

Testing by dose levels

#Exract the len and dose vectors from the df ToothGrowth
len_a <- ToothGrowth %>% filter(dose %in% c(0.5,1)) %>% select(len) %>% unlist()
dose_a <- ToothGrowth %>% filter(dose %in% c(0.5,1)) %>% select(dose) %>% unlist()
#Test
(Test.a <- t.test(len_a~dose_a))

## 
##  Welch Two Sample t-test
## 
## data:  len_a by dose_a
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means between group 0.5 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

#Exract the len and dose vectors from the df ToothGrowth
len_b <- ToothGrowth %>% filter(dose %in% c(0.5,2)) %>% select(len) %>% unlist()
dose_b <- ToothGrowth %>% filter(dose %in% c(0.5, 2)) %>% select(dose) %>% unlist()
#Test
(Test.b <- t.test(len_b~dose_b))

## 
##  Welch Two Sample t-test
## 
## data:  len_b by dose_b
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means between group 0.5 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

#Exract the len and dose vectors from the df ToothGrowth
len_c <- ToothGrowth %>% filter(dose %in% c(1,2)) %>% select(len) %>% unlist()
dose_c <- ToothGrowth %>% filter(dose %in% c(1,2)) %>% select(dose) %>% unlist()
#Test c
(Test.c <- t.test(len_c~dose_c))

## 
##  Welch Two Sample t-test
## 
## data:  len_c by dose_c
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

We went through all possible combinations of levels from the factor variable dose and in all cases the p-value is lower than the default signficance level 0.05. Thus, we reject Ho. In other words there appears to be a positive relationship between dose level and teeth length

Testing by Supplement

#Exract the len and supp vectors from the df ToothGrowth
len <- ToothGrowth %>% select(len) %>% unlist()
supp <- ToothGrowth %>% select(supp) %>% unlist()
#Test
t.test(len~supp)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

We can see that the p-value of the test is 0.06. Since the p-value is greater than 0.05 and the confidence interval of the test contains zero, we can reject the null hypothesis and say that supplement types don’t seem to have any impact on teeth growth. In other words, there’s no significant statistical difference between them

Conclusions

In the previous section of this report we drew some conclusions from our tests. However, before using any statistical test we should always make sure that some conditions are met. In our case, t-tests, we should check for: * Independence: there must be random sampling/assignment * Normality: the population distribution must be normal or quasi-normal Assuming all the previous conditions are met we can now restate our conclusions.

It appears that there is a statistically significant difference between teeth length and dose levels across both delivery methods, in other words, as the dose increases so does teeth length.

On the other hand, there doesn’t seem to be a statistically significant difference between delivery methods, with Orange juice apparently more effective at dose levels 0.5 and 1, and VC slightly more effective at dose level 2