This is an analysis first of a simple simulation of Exponential distribution and then an actual inferential data analysis of ToothGrowth sample dataset in R. The goal of the analysis is first to demonstrate working knowledge of Central Limit Theorem using the simulation. And then second to use hypothesis testing and/or confidence interval to perform some inferential analysis of the sample dataset and draw conclusions.
The simulated data for this analysis is an exponential distribution with a rate (lamdba) of 0.2. The theoretical mean and sd of an exponential distribution is 1/lambda. The following script loads the simulated data and verifies that the theoretical and actual mean and SD are fairly close to each other although not being same. As ‘n’ the number of observation becomes larger the actual mean and SD will get closer to their theoretical counterparts.
# Load required Libraries
library(dplyr)
library(ggplot2)
expdata <- rexp(1000, rate = 0.2)
sprintf("Theoretical Mean of exponential distribution (1/lambda) is %f", 1/0.2)
## [1] "Theoretical Mean of exponential distribution (1/lambda) is 5.000000"
sprintf("Actual mean of sample data is %f", mean(expdata))
## [1] "Actual mean of sample data is 4.860085"
sprintf("Theoretical SD of exponential distribution (1/lambda) is %f", 1/0.2)
## [1] "Theoretical SD of exponential distribution (1/lambda) is 5.000000"
sprintf("Actual SD of sample data is %f", sd(expdata))
## [1] "Actual SD of sample data is 4.883684"
According to the central limit theorem, the distribution of sample mean of sufficiently large sample size will be approximately normal with variance equal to the variance of the population divided by the sample size. The following chart shows the histogram of the sampling distribution of 1000 samples of averages of 40 exponentials. The histograms shows a normal distribution with mean that is an accurate estimate of the theoretical mean and standard deviation equaling the population standard devialaion divided by the square root of the sample size.
mns = NULL
for(i in 1 : 1000) mns = c(mns, mean(rexp(40, rate = 0.2)))
hist(mns, main = "Sampling Distribution of Sample mean of Exponentials", xlab = "Sample Mean")
sprintf("Mean of sampling Distribution of Sample mean is %f", mean(mns))
## [1] "Mean of sampling Distribution of Sample mean is 5.015166"
sprintf("SD of sampling Distribution of Sample mean is %f which is equal to the population sd / sqrt(n) %f", sd(mns), 5/sqrt(40))
## [1] "SD of sampling Distribution of Sample mean is 0.785354 which is equal to the population sd / sqrt(n) 0.790569"
The ToothGrowth dataset tabulates the effection of Vitamin C on Tooth Growth in Guinea Pigs.
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).
A data frame with 60 observations on 3 variables.
1.len - numeric Tooth length 2.supp - factor Supplement type (VC or OJ). 3.dose - numeric Dose in milligrams/day
A quick summary of the 3 columns of the dataset is done below. Len is an quantitative variable that ranges from 4.20 to 33.90. Supp is a factor with 2 levels (OJ, VC) the two forms of providing the supplements. Dose is also a factor with values (0,5, 1, 2) specifying the dosage (mg/day) of the Vitamin provided to the guinea pigs
data("ToothGrowth")
summary(mutate(ToothGrowth, dose = as.factor(dose)))
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
A quick pair wise comparison of the columns indicate that there is definite correlation between length of teeth and dosage administered. There is possible relationship between supplement type and length of teeth that this needs further analysis.
pairs(ToothGrowth)
Figure 1
Hypothesis testing and/or calculating the confidence interval is the prescribed way to perform inferences described in this course. For Hypothesis testing the null hypothesis ho says that there isn’t any difference in mean of len between the types of supp. The alternate hypothesis ha says the opposite that there is a difference. A p-value of < .05 under null hypothesis can result in rejecting the null hypothesis with 95% confidence. The R function t.test is the goto function since it performs test, calculates the test statistic, p-value and the confidence interval.
The following script performs t.test to test the significance of both supp and dose on len.
# Hypothesis testing the significance of Supp on length of teeth
with(ToothGrowth, t.test(len ~ supp, paired = FALSE))
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
#Hypothesis testing the significance of Dose of length of teeth
with(ToothGrowth, t.test(len - dose))
##
## One Sample t-test
##
## data: len - dose
## t = 19.106, df = 59, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 15.79850 19.49483
## sample estimates:
## mean of x
## 17.64667
#
Under the null hypothesis ho to test the significance of supp on len has a p-value of 0.06 which is > that 0.05. Hence the ho can’t be rejected. i.e there may not be a significant difference between effect of supp on len based on the sample data with 95% confidence.
Under the null hypothesis ho to test the significance of dose on len, the t.test returns a p-value of 2.2e-16 which is practically 0. Hence ho can be rejected and it can be concluded with > that 99% confidence that dose has significant effect on len the lenght of the teeth.
These inferences are based on a dataset of 60 observations. This is not very high. More data may be need to reduce chances of modelling error.
The range of the predictor dose is very narrow between 0.5 and 2. Prediction error will likely increase for higher dosages