Statistical Inference Course Project Yuming Liu

Overview

In Part 1, I investigate the exponential distribution on its mean, variable and distribution. In Part 2, I do some basic inferential data analysis on dataset “ToothGrowth” by appplying t-test.

Part 1

Simulations

1st, I assign values for the parameters in simulation

lambda <- 0.2
n <- 40
nosim <- 1000

2nd, I create dataset with 1000 values, each value is the mean of 40 exponentials

sampleExp <- matrix(rexp(n * nosim, rate = lambda), nosim)
sampleData <- rowMeans(sampleExp)

Sample Mean versus Theoretical Mean

3rd, I calculate sample mean and theoretical mean

sample_mean <- mean(sampleData)
sample_mean

## [1] 5.001388

theoretical_mean <- 1 / lambda
theoretical_mean

## [1] 5

It shows that sample mean and theoretical mean are very close to each other. ##Sample Variance versus Theoretical Variance 4th, I calculate sample variance and theoretical variance

sample_variance <- var(sampleData)
sample_variance

## [1] 0.5861573

theoretical_variance <- (1 / lambda)^2 / (n)
theoretical_variance

## [1] 0.625

It shows that sample variance and theoretical variance are close to each other too. 5th, I calculate standard deviation of sample and theoretical one

sample_sd <- sd(sampleData)
sample_sd

## [1] 0.7656091

theoretical_sd <- 1 / (lambda * sqrt(n))
theoretical_sd

## [1] 0.7905694

It shows that the 2 values are still close to each other ##Distribution 6th, I plot the distribution of mean of exponentials to show that it is approximately normal, also on the plot, I draw lines of sample variance and theoretical variance to illustrate the difference, I draw curves of normal distribution of sample and theoretical one to illustrate the difference of variance.

library(data.table)
library(ggplot2)
Data <- data.frame(sampleData)
distribution <- ggplot(Data, aes(x = sampleData)) 
distribution <- distribution + geom_histogram(aes(y = ..density..), col = "black", fill = "pink") 
distribution <- distribution + geom_vline(xintercept = c(sample_mean, theoretical_mean), col = c("red", "blue"), size = 1.5)
distribution <- distribution + stat_function(fun = dnorm, args = list(mean = sample_mean, sd = sample_sd), col = "red", size = 1.5)
distribution <- distribution + stat_function(fun = dnorm, args = list(mean = theoretical_mean, sd = theoretical_sd), col = "blue", size = 1.5)
distribution <- distribution + xlab("sample mean") + ylab("density") + ggtitle("Distribution of sample mean")
distribution

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

On the plot, the pink histogram shows the distribution of sample mean, the red vertical line represents sample mean and the blue one represents theoretical mean, the red curve and the blue one correspond to sample’s and theoretical’s respectively. It shows that variance of the sample one is a little wider than the theoretical one when sample mean approaches to the mean of sample means. And it shows that the distribution is approximately normal in its shape.

Part 2

1. Load the ToothGrowth data and perform some basic exploratory data analyses

1st, I load the ToothGrowth data and perform some basic exploratory data analyses

library(datasets)
data(ToothGrowth)
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

unique(ToothGrowth$supp)

## [1] VC OJ
## Levels: OJ VC

unique(ToothGrowth$dose)

## [1] 0.5 1.0 2.0

It shows that supp has 2 types: “VC” and “OJ”, dose has 3 values: “0.5”, “1.0” and “2.0”. ##2. Provide a basic summary of the data 2nd, I draw a histogram of len on dose with 2 types separately as a basic summary of the data

basic_summary <- ggplot(aes(x = as.factor(dose), y = len, fill = supp), data = ToothGrowth)
basic_summary <- basic_summary + geom_bar(stat = "identity")
basic_summary <- basic_summary + facet_grid(.~ supp)
basic_summary <- basic_summary + xlab("dose") + ylab("tooth length") + ggtitle("Tooth length on Dose amount and Supplement type")
basic_summary

3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose

3rd, I apply t-test on len with dose and supp to test whether the hypothesis that the effect of supplement is equal to be true

##Apply t-test of len on supp, the whole dataset
t.test(len ~ supp, data = ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

##Apply t-test of len on dose, sub-dataset when dose is 0.5 or 1.0
t.test(len~dose, data = subset(ToothGrowth, ToothGrowth$dose %in% c(0.5, 1.0)))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

##Apply t-test of len on dose, sub-dataset when dose is 1.0 or 2.0
t.test(len~dose, data = subset(ToothGrowth, ToothGrowth$dose %in% c(1.0, 2.0)))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

It shows that p = 0.06 when applying t-test on the whole dataset, it is not significant, so I can not deny the hypothesis, so the effect of different supplement has no difference. Then I apply t-test on sub-dataset when dose = (0.5, 1.0) or dose = (1.0, 2.0) to find out whether increase of dose has effect on tooth length. it shows that both p is approaching 0, this to say, p << .05, so I can deny the hypothesis, and say the increase of dose has a positive effect on tooth growth. ##4. State your conclusions and the assumptions needed for your conclusions 1. Type of supplement has no effect on tooth growth. 2. Increase of dosage has good effect on tooth growth.