Abstract

In this project we shall investigate the exponential distribution in R and compare it with the Central Limit Theorem. Next, we shall perform hypothesis testing on data.

library(datasets)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)
library(ggplot2)

Visualizing the exponential

The exponential distribution can be considered as the continuous form of the poisson distribution, which is discrete. They even have the same quantity, lambda, that determines their mean and variance. By discrete we mean that the Poisson random variable only takes rational values (numbers that are not irrational eg. not pi). It will serve us well to see this for ourselves.

N = 10000
lambda = 0.2

rexp(N, lambda) %>% as.data.frame %>%
  ggplot(aes(.)) +
  geom_histogram() +
  labs(title = "Exponential distribution", tag = "A1")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

rpois(N, lambda) %>% as.data.frame %>%
  ggplot(aes(.)) +
  geom_histogram() +
  labs(title = "Poisson distribution", tag = "A2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sample and Mean

We will sample from the exponential distribution and compare its mean with the theoretical mean. The procedure consists in taking 40 samples from rexp(), then taking their mean (I am using the pipeline operator %>% a lot here). Then we repeat this step 1000 times and create a distribution from this. This is what we are calling the sampling distribution.

set.seed(123)
lambda = 0.2

sample_of_means = replicate(1000, rexp(n = 40, lambda) %>% mean) %>%
                    as.data.frame
emp_mean = mean(sample_of_means$.)
emp_variance = var(sample_of_means$.)

Variance

The sampling distribution of the variance seems to be skewed and not gaussian. It is no wonder that it is not touched on in the lectures. Perhaps it is not as important as that of the mean.

replicate(1000, rexp(n = 40, lambda) %>% var) %>%
                    as.data.frame %>%
    ggplot(aes(.)) +
    geom_histogram(bins = 50) +
    labs(title = "Sampling Distribution of the Variance", tag = "B")

So the mean of this sampling distribution is given by 5.0119113, which should be pretty close to the actual mean of the exponential distribution 5. The sampling distribution variance = 0.6004928 has its square root as the standard error to the mean. It has (I believe), nothing to do with the variance of the exponential distribution itself. That variance, = 25. This accounts for the drastic difference between the two.

The Sampling Distribution and the Normal

The distribution should be roughly normal. The more we sample the means (n), the more this sampling distribution approximates the Gaussian.

#Plotting
## Sample distribution
  ggplot(sample_of_means, aes(.)) +
    geom_histogram() +
    geom_vline(xintercept = emp_mean) +
    geom_vline(xintercept = emp_mean + c(2, -2) * sqrt(emp_variance), 
               col = "Red") +
    labs(title = "The Sampling Distribution", tag = "C1")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Normal distribution
m = 10000  
rnorm(m, mean = 1/lambda, sd = 1) %>% as.data.frame %>%
  ggplot(aes(.)) +
  geom_histogram(fill = "Blue", alpha = 0.5) +
  geom_vline(xintercept = emp_mean) +
  labs(title = "The Normal Distribution", tag = "C2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The red lines show the 95% confidence interval.

Part 2

This section will be concerned with using statistical techniques to analyse data; the ToothGrowth data in the R datasets package. It measures the length of odontoblasts (cells responsible for tooth growth) in response to dosage levels and how said doses are administered.

Exploratory Analysis

The data consists of three variables. The len and dose variables are essentially factor variables with 2 and 3 levels respectively.

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

table(ToothGrowth[, 2:3])

##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

First we shall investigate the distribution of teeth length. There are sixty teeth of differing lengths.

ggplot(ToothGrowth, aes(len)) +
  geom_histogram() +
  labs(title = "Length Distribution", tag = "D1") +
  geom_vline(xintercept = mean(ToothGrowth$len))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Bar plot showing length pert tooth
ggplot(ToothGrowth, 
       aes(y = len, x = 1:nrow(ToothGrowth))
       ) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Plot of Lengths", tag = "D2") +
  xlab("Count") +
  ylab("Length")

Hypothesis Testing

We are going to Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

Facet Plots

Now we delve into the details of the distribution to see the composition by factor variable.

#Making dose a factor type variable
ToothGrowth = mutate(ToothGrowth, dose = as.factor(dose))

# Factoring by supp
  ggplot(ToothGrowth, aes(len, fill = supp)) +
    geom_histogram(position = "dodge") +
    labs(title = "Factoring by Dose Type", tag = "E1")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Factoring by dose
  ggplot(ToothGrowth, aes(len, fill = dose)) +
    geom_histogram(position = "dodge") +
    labs(title = "Factoring by Dose Level", tag = "E2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the dose factored histogram it is possible to infer that greater dosage is correlated wiith greater length on average. That is a cursory observation that will be tested later.

Testing

Now that we have gotten a preliminary view of our data, we can then proceed to the creation of hypotheses. Our null hypothesis, H0 : None of the dosage methods(dose) or vitamin levels (supp) have any effect on the length variable. We shall have our significance level alpha = 0.05

Vitamin Levels

oj = filter(ToothGrowth, supp == "OJ") # Oeange juice group
vc = filter(ToothGrowth, supp == "VC") # Vitamin C group

# t.test returns a named list
## We can reference each of these elements
alt = "two.sided"
t_test =  t.test(vc$len, oj$len, paired = TRUE, var.equal = TRUE, 
                 conf.level = .95, alternative = alt
       )

We arbitrarily take one group as the control, in our case, the orange juice group. We are carrying out a two.sided test

It will be helpful to visualize this

# Visualizing
  p = rt(10000, df = t_test$parameter, ncp = t_test$estimate) %>% 
    as.data.frame %>%
  
    ggplot(aes(.)) +
      geom_histogram(bins = 50, alpha = 0.5, fill = "Brown") +
      geom_vline(xintercept = (t_test$estimate + t_test$statistic), 
                 col="Blue") +
      geom_vline(xintercept = t_test$conf.int[1], size = 2) +
      labs(title = "T-Distribution", tag = "F2")

# Add text
annotation <- data.frame(
   x = c(-7.7, -4.4),
   y = c(1000, 1000),
   label = c("p-Value", "5% significance level")
)

p + geom_text(data=annotation, aes( x=x, y=y, label=label), 
           color="black" )

The p_value 0.0025498 is far less than our significance level, 0.05 and so we reject the null hypothesis. The conclusion is that there is indeed a difference between the two groups if we regard the observations to be paired. the confidence interval is given by -5.9913414, -1.4086586

Dosage Methods

group_05 = filter(ToothGrowth, dose == "0.5")
group_1 = filter(ToothGrowth, dose == "1")
group_2 = filter(ToothGrowth, dose == "2")

test_21 = t.test(group_2$len, group_1$len, paired = TRUE, var.equal = TRUE)
test_205 = t.test(group_2$len, group_05$len, paired = TRUE, var.equal = TRUE)
test_105 = t.test(group_1$len, group_05$len, paired = TRUE, var.equal = TRUE)

The point of the code is to do two sample testing between two of the three groups exhaustively; 2 and 1, 2 and 0.5, 1 and 0.5.

The results are as follows - 2 and 1: 1.934186^{-4} - 2 and 0.5: 7.1902545^{-10} - 1 and 0.5: 1.2254367^{-6}

For all the p value is less than 0.05 so we reject the null and accept the alternative hypothesis; there is a significant difference between all the groups.

Conclusion

We concluded that there are meaninful variations in the effect of treatment types and dosage levels. The null is false; the way doses are administered and in what form do in fact affect the length of tooth growing cells. From our exploratory graph by dose factor (see E2), it appears that there is a positive correlation between increasing dosage and cell length. However, that will be better tested by regression. But the takeaway is that the results are significant.