This project consists of two parts:
An exponential distribution, simulated in R, was investigated and compared with the Central Limit Theorem. The exponential distribution was simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. with lambda = 0.2 for all of the simulations. Investigated the distribution of averages of 40 exponentials.
hist(runif(1000))
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(runif(40)))
hist(mns)
As shown above, the distribution on the right looks far more Gaussian than the original uniform distribution on the left. This is due to the Central Limit Theorem, which states that the distribution of averages is often normal, even if the sampled data has a non-normal distribution.
set.seed(1500)
n <- 40
lambda <- 0.2
simu <- 1000
dataset <- matrix(rexp(n*simu, lambda), nrow = 40, ncol = 1000)
##Calculate the sample mean
samMeans = NULL
for (i in 1: 1000) {
samMeans[i] = mean(dataset[, i])
}
sMean <- mean(samMeans) ##This is the value of the sample mean
##Calculate the theoretical mean of the distribution
tMean <- 1/lambda
The sample mean of the distribution is 5.0316789 while the theoretical mean of the distribution is 5. The two mean results are approximately the same.
To get the variance of the distribution, the standard deviation was calculated first and the square root of the result was taken.
samDevs = NULL
for (i in 1:1000) {
samDevs[i] = sd(dataset[, i])
}
sVar <- sqrt(mean(samDevs)) ## Sample variance of the distribution
tVar <- sqrt(1/lambda) ## Theoretical variance of the distribution
The sample variance of the distribution is 2.2193017 while the theoretical variance of the distribution is 2.236068. As seen with the calculated means, the sample and theoretical variances calculated are approximately the same.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
g <- ggplot(data.frame(column = samMeans), aes(x = column))
g <- g + geom_histogram(aes(y = ..density..), binwidth = 0.1, fill = 'coral', color = 'black')
g <- g + stat_function(fun = dnorm, args = list(mean = lambda^-1, sd=(lambda*sqrt(n))^-1), size=2)
g <- g + labs(title = "Distribution of 40 Exponentials", x = "Simulation Means of 40 exponentials", y = "Density")
g
Based on the comparison between the overlayed normal density distribution and the histogram of the the sample mean distribution, it is safe to assume that the distribution of the averages of 40 exponentials is normal.
The objective of this section of the project is performing basic inferential data analysis using the ToothGrowth data set provided in R. The data set contains information of the effect of the ingestion of vitamin C on tooth growth for 60 Guinea pigs. The data set include the following variables:
i. len: This is the length of odontoblasts, which are the cells responsible for tooth growth. ii. supp: delivery method of vitamin C, either orange juice (OJ) or ascorbic acid (VC). iii. dose: amount given to the guinea pigs, either 0.5, 1, or 2 mg/day.
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
To give a basic overview of the data, a panel plot was made.
data <- ToothGrowth
g1 <- ggplot(data, aes(x=factor(dose), y = len))
g1 <- g1 + facet_grid(.~supp)
g1 <- g1 + geom_col(aes(fill=dose))
g1 <- g1 + labs(title = "Guinea pigs tooth length for by dosage for each supplement",
x = " Dose (mg/day)", y = "Tooth length")
g1
The average length of tooth that grows increase with the an increase in the dose given to the guinea pigs. Irrespective of the delivery method (i.e supplement) used, the longest tooth growth came from guinea pigs that were given the highest vitamin dose (2 mg/day).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
tooth_summ <- data %>%
group_by(dose, supp) %>%
summarise(ave_length = mean(len)) %>%
arrange(desc(ave_length))
## `summarise()` has grouped output by 'dose'. You can override using the `.groups` argument.
tooth_summ
## # A tibble: 6 x 3
## # Groups: dose [3]
## dose supp ave_length
## <dbl> <fct> <dbl>
## 1 2 VC 26.1
## 2 2 OJ 26.1
## 3 1 OJ 22.7
## 4 1 VC 16.8
## 5 0.5 OJ 13.2
## 6 0.5 VC 7.98
From the table above, the highest average tooth length was obtained from guinea pigs given 2mg/day of vitamin C via ascorbic acid (VC), although not very significant. Despite that it seems as giving vitamin C via orange juice (OJ) results in longer tooth length for the other two doses (0.5 and 1 mg/day).
Will test for which supplement (VC vs. OJ) results in the most significant tooth growth for the three dosages. Null hypothesis is that there both supp will result is same tooth length at a specific dosage.
1. 0.5mg/day: *_alternative hypothesis_* is that OJ will result in higher tooth length when used as a way to deliver 0.5mg/day vitamin C, compared to VC.
h1 <- t.test(len~supp, data = subset(data, dose == 0.5))
CI1 <- h1$conf.int
CI1
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
h1_pvalue <- h1$p.value
h1_pvalue
## [1] 0.006358607
At a dosage of 0.5mg/day, the confidence interval is (1.7190573, 8.7809427) and the p-value of the t-test is 0.0063586. The p-value of the test is less than the significant level of 0.05 (p<0.05). Therefore the null hypothesis can be rejected. And one can assume that using orange juice to deliver 0.5mg of vitamin C will result in significantly longer tooth length, compared to ascorbic acid.
2. 1mg/day: *_alternative hypothesis_* is that OJ will result in higher tooth length when used as a way to deliver 1mg/day vitamin C, compared to VC.
h2 <- t.test(len~supp, data = subset(data, dose == 1.0))
CI2 <- h2$conf.int
CI2
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
h2_pvalue <- h2$p.value
h2_pvalue
## [1] 0.001038376
At a dosage of 1.0mg/day, the confidence interval is (2.8021482, 9.0578518) and the p-value of the t-test is 0.0010384. The p-value of the test is less than the significant level of 0.05 (p<0.05). Therefore the null hypothesis can be rejected. And one can assume that using orange juice to deliver 1.0mg/day of vitamin C will result in significantly longer tooth length, compared to ascorbic acid.
3. 2mg/day: *_alternative hypothesis_* is that VC will result in higher tooth length when used as a way to deliver 2mg/day vitamin C, compared to OJ.
h3 <- t.test(len~supp, data = subset(data, dose == 2.0))
CI3 <- h3$conf.int
CI3
## [1] -3.79807 3.63807
## attr(,"conf.level")
## [1] 0.95
h3_pvalue <- h3$p.value
h3_pvalue
## [1] 0.9638516
At a dosage of 2.0mg/day, the confidence interval is (-3.7980705, 3.6380705) and the p-value of the t-test is 0.9638516. The p-value of the test is greater than the significant level of 0.05 (p>0.05). Therefore the null hypothesis cannot be rejected. And one cannot assume that using ascorbic acid to deliver 2.0mg/day of vitamin C will result in significantly longer tooth length, in comparison to orange juice.
At dosage levels of 0.5 and 1.0 mg/day, orange juice is the best medium to deliver vitamin C to Guinea pigs in order to get a significant increase in tooth length. On the other hand, there is no evidence to show which delivery method (OJ vs. VC) is significantly better.