Statistical_Simulation_with_Basic_Inferential

Overview

Simulation Exercises: In this project we will investigate exponential distribution of random numbers with R-language and compare the distribution to Central Limit Theorem. The exponential distribution will be calculated in R-language with rexp(n, lambda) formula, where lambda is the rate parameter. Here the mean of exponential distribution is 1/lambda, the standard deviation is 1/lambda and lambda = 0.2 for all calculation.

We will investigate the distribution of averages of 40 exponentials and then we will need to do a thousand simulations.

The following project instructions were provided for project data simulation calculation.

Set lambda = 0.2 for all of the simulations.
Investigate the distribution of averages of 40 exponentials.
Do a thousand simulations.

Part 1: Simulation exercise

In the code below we ran a simulation of 1000 with average of n = 40 exponentials with the given formula rexp(n, lambda) where lambda is the rate of parameter = 0.2

Next,we calculated the means of the exponential simulations. Following these findings, we will do code-design and plot histogram of simulated mean values.

Project variables for simulation analysis

# setting seed for exact reproduction of simulated data 
set.seed(2019)

# setting lambda value
lambda  <- 0.2    

# setting  sample number of exponential
n <- 40 

# number of simulations
num_Simulation  <- 1000   

# now running suggested simulation with given formula
Simulation_data  <-  replicate(num_Simulation, rexp(n, lambda))

# view the dimension of the newly formed simulated data-frame
dim(Simulation_data)

## [1]   40 1000

# calculate the means of the simulated data and preview it
Simdata_means <- apply(Simulation_data, 2, mean)


# histogram view of the simulated data means
hist(Simdata_means, breaks=40, xlim=c(2,9), main="Simulated data means distribution",
       col="salmon")

Question 1: Quantifying the sample mean with Confidence Interval

# displaying means of sample mean
print(mean(Simdata_means))

## [1] 5.038755

# Confidence interval of the sample means
Sample_confidence_interval <- mean(Simdata_means) + c(-1,1) * 1.96 * sd(Simdata_means)/sqrt(n)
print(paste ("Confidence Interval of sample means = ", Sample_confidence_interval) )

## [1] "Confidence Interval of sample means =  4.79369123562415"
## [2] "Confidence Interval of sample means =  5.283819374181"

Analysis: This Confidence interval range of sample means is entirely above zero.The range of this confidence interval is relatively narrow suggests a more precise estimate of sample population mean.

Since our interval doesn’t include 0, we are 95% Confident that our estimated interval for the simulated data mean is between 4.793691 and 5.283819.

Question 2: sample mean vs theoretical mean on a histogram

# theoretical mean
theoretical_mean <- 1/lambda 
print(theoretical_mean)

## [1] 5

# plotting histogram with abline both(Theoretical & Sample) mean position   
hist(Simdata_means, col="olivedrab", main="Sample mean vs. Theoretical mean",
                      xlim = c(2,9), breaks=40,  xlab = "Simulation Means")

# drawing a vertical line with the mean of sample data mean on the histogram
abline(v=mean(Simdata_means), lwd="2", col="maroon4")

# drawing another vertical line for the theoretical mean on the histogram
abline( v= theoretical_mean, lwd="2", col="blue")

Analysis: The sample mean abline(maroon) and theoretical mean abline(blue) are two vertical lines almost visually superimposed on each other position on the histogram. These overlaid lines show how the sample and population means are practically on the same number position on the graph. So we can infer that they are truly representative of the dataset.

Question 3: Show how variable the sample variance compare to the theoretical variance of the distribution.

# sample standard deviation and variance calculation
Sample_Stdev  <-  sd( Simdata_means)
Sample_Var <-  Sample_Stdev^2
print(paste("Sample standard deviation is = ", round(Sample_Stdev, 3)))

## [1] "Sample standard deviation is =  0.791"

print(paste ("Sample variance is = ", round (Sample_Var, 3)))

## [1] "Sample variance is =  0.625"

# theoretical standard deviation and variance
Theoretical_sd  <- 1/lambda/sqrt(n)
print(paste ("Theoretical standard deviatiton = ",round(Theoretical_sd,3)))

## [1] "Theoretical standard deviatiton =  0.791"

Theoretical_var  <- (1/lambda)^2/n
print(paste ("Theoretical/Population variance = ", round( Theoretical_var, 3)))

## [1] "Theoretical/Population variance =  0.625"

Note: It is very obvious that sample and theoretical variance are in a breathing distance.

Question 4: Show that the simulated exponential distribution is approximately normal

# plotting histogram to prove that distribution in normal
hist(Simdata_means, prob=TRUE, col="turquoise4", main="Normal distribution of simulated
        means", breaks=40, xlim=c(2,9),xlab = "Simulated data means",  ylab="Density")

# shows the sample mean with a vertical abline on the graph
abline(v=mean(Simdata_means), lwd="3", col="white")

# drawing a contour line(RED) on the graph to show sample mean distribution on histogram
lines(density(Simdata_means), lwd=3, col="red")

# plotting a normal density curve line to compare with the calculated simulated density line
xfit <- seq(min(Simdata_means), max(Simdata_means), length=2*n)
yfit <- dnorm(xfit, mean=1/lambda, sd=sqrt(((1/lambda)/sqrt(n))^2))
lines(xfit, yfit, pch=22, col="black", lwd=3)

Note: The shape of the ‘black’ contour line is the true representative bell curve.

Analysis: The histogram contour lines(red) of simulated sample data distribution visually follows a normal distribution pattern.The true density line of the normal distribution is shown with the ‘black’ color. We can see from the graph, the distribution of averages of 40 exponential distributions(‘red’) is close to the normal distribution with the given lambda value.

Part 2: Basic inferential data analysis

Course project guidelines instruct us the following steps should occur:

Load the ‘ToothGrowth’ data and perform some basic exploratory data analyses
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by ‘supp’ and ‘dose’ (only use the techniques from class, even if there are other approaches worth considering).
State your conclusions and the assumptions needed for your conclusions.

Exploratory data analysis:

Dataset (Tooth Growth) resource in summary:

60 data observation
OJ: orange juice
VC: vitamin C
Dose level: 0.5, 1.0 and 2.0 mg
Len: Tooth length in non-sequential numbers

# loading the dataset
data("ToothGrowth")

# calculating the 'mean' of the 'len' variable coupled with supplement element
LengthSupp <- split(ToothGrowth$len, ToothGrowth$supp)
sapply(LengthSupp, mean)

##       OJ       VC 
## 20.66333 16.96333

# quantifying the 'median' of the 'len' variable coupled with supplement element
sapply(LengthSupp, median)

##   OJ   VC 
## 22.7 16.5

# calculating the 'variance' of the 'len' variable associated with supplement
LengthSupp <- split(ToothGrowth$len, ToothGrowth$supp)
sapply(LengthSupp, var)

##       OJ       VC 
## 43.63344 68.32723

# calculating the 'mean' of 'len' value assigned to various dosage regimen
LengthDose <- split(ToothGrowth$len, ToothGrowth$dose)
sapply(LengthDose, mean)

##    0.5      1      2 
## 10.605 19.735 26.100

# finding the median of different dosage associated with length
sapply(LengthDose, median)

##   0.5     1     2 
##  9.85 19.25 25.95

# summary of the ToothGrowth dataset
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

# A linear view of data summary
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

# Summary: Tooth-length growth summary, separated by each supplement with single dosage relevance
by(ToothGrowth$len, INDICES = list(ToothGrowth$supp, ToothGrowth$dose), summary)

## : OJ
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20    9.70   12.25   13.23   16.18   21.50 
## -------------------------------------------------------- 
## : VC
## : 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20    5.95    7.15    7.98   10.90   11.50 
## -------------------------------------------------------- 
## : OJ
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   20.30   23.45   22.70   25.65   27.30 
## -------------------------------------------------------- 
## : VC
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   15.27   16.50   16.77   17.30   22.50 
## -------------------------------------------------------- 
## : OJ
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.58   25.95   26.06   27.08   30.90 
## -------------------------------------------------------- 
## : VC
## : 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90

# displaying some data from the front of the dataset
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

# displaying data from back end of the dataset
tail(ToothGrowth)

##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2

# displaying total number of dosage by different dose level
summary(ToothGrowth$dose <- factor(ToothGrowth$dose))

## 0.5   1   2 
##  20  20  20

Statistical Inference Analysis: ( Confidence Interval & Hypothesis test )

We know from our basic statistical analysis that mean, median and variance are consistently different between two ‘supp’ elements. This means that we will calculate a two sided confidence interval test for detail analysis.We will calculate multiple combination of ‘t-test’ to analyse tooth growth observation.

# Redoing variance of 'supp' associated to 'len'
LengthSupp <- split(ToothGrowth$len, ToothGrowth$sup)
sapply(LengthSupp, var)

##       OJ       VC 
## 43.63344 68.32723

Variances of these two Supplement groups are visibly very wide, so in the t-test we will assume variances are not equal and the datasets are independent meaning not paired.

1. Analyzing data to find correlation between tooth growths and supplement usage.

t.test( len ~ supp, paired=F, var.equal=F, data=ToothGrowth )

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Analysis: This 95% confidence interval range [-0.171, 7.571] contains zero(0). This interval series contains both positive and negative numbers in its wide range.

We know, if the quantified interval contains zero then the relevant null hypothesis cannot be rejected, since the null hypothesis value contains within the range of interval. Since null hypothesis represents a status quo, if we can’t reject null-hypothesis most reliably then we won’t be able to accept the alternative hypothesis.

The p-value (0.06063 > 0.05) is a noticeable proof that null hypothesis value cannot be reliably rejected and the hypothesis test is statistically not significant. We know at any time a significance test fails to reject the null hypothesis, the direction of the effect in unknown. This signifies that there is no good correlation between tooth length and supplement

2. Using Supplement as a Factor within Dosage Level

Finding correlation in tooth growth and supplement usage among different dosage level

Subset: While dosage limit is 0.5 mg

Dose1 <- subset(ToothGrowth, dose == 0.5 )
t.test( len ~ supp, paired = F, var.equal = F, data = Dose1 )

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

Analysis : The confidence interval range (1.719057, 8.780943) does not contain zero(0), thereby null hypothesis can be rejected at the 0.05 level. Also our p-value is 0.006359 < 0.05 (significance level), says that null hypothesis should be rejected and the test is statistically significant, thereby this interval range does not contain the null hypothesis value.

Since there are no negative value in the interval, the direction of the tooth growth is connected to supplement. Since all the values in the interval are positive, the tooth growth is related to dose level. We also see mean growth with OJ is greater than VC, 13.23 > 7.98.

Subset: While dosage limit is 1.0 mg

Dose2 <- subset(ToothGrowth, dose == 1.0)
t.test ( len ~ supp, paired=F, var.equal = F, data= Dose2)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

Analysis: The confidence interval range (2.802148, 9.057852) does not contain zero(0), thereby null hypothesis can be rejected. Also our p-value is 0.001038 < 0.05 (significance level), proves that null hypothesis can be rejected in favor of alternative hypotheis.

We know, if the confidence interval does not contain the null hypothesis value, the results are statistically significant. We can infer that tooth growth is higher as dosage increase with OJ than VC ( 22.70 > 16.77 ).

Subset: While dosage limit is 2.0 mg

Dose3 <- subset(ToothGrowth, dose == 2.0 )
t.test( len ~ supp, paired = F, var.equal = F, data = Dose3 )

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

Analysis: This confidence interval range (-3.79807, 3.63807 ) contains zero, which delimits that null hypothesis cannot be rejected at 0.05 significance level. Here also we see that p-value = 0.9639 > 0.05 is lot higher, verifies that null hypothesis cannot be rejected.

This test is not statistically significant and the interval may contain the null hypothesis value. The mean values ( 26.06 <=26.14 ) almost close to zero. The effect of tooth growth at dose level 2.0 is not significant. We see VC growth holds a trivial edge on OJ.

3. Dosage variation range as a Factor: Analyzing dosage range as a variable for t-test

DoseRange: (0.5 to 1.0) with unequal variance

DoseRange_1 <- subset( ToothGrowth, dose %in%  c(0.5, 1.0) )

# presenting the variance with dose range   
len_Dose_1 <- split(DoseRange_1$len, DoseRange_1$dose)
sapply(len_Dose_1, var)

##      0.5        1        2 
## 20.24787 19.49608       NA

# t-test where dose are independent and variance are not equal.
t.test ( len ~ dose, paired=F, var.equal = F, data= DoseRange_1 )

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

Analysis: The resultant 95% interval (-11.983781, -6.276219) is entirely below zero. It is evident that dose level 1 has higher mean growth than level 0.5 as we see ( 10.605 < 19.735 ).

It is visible that p-value of this test is essentially zero also the confidence interval of the test does not cross over zero (0). The hypothesis test is statistically significant and the confidence interval will not contain the null hypothesis value.

We should reject the null hypothesis and accept that higher dosage level has greater impact in tooth growth.

DoseRange: (0.5 to 2.0) with unequal variance

DoseRange_2 <- subset( ToothGrowth, dose %in%  c(0.5, 2.0))

# presenting the variance with dose range
len_Dose_2 <- split(DoseRange_2$len, DoseRange_2$dose)
sapply(len_Dose_2, var)

##      0.5        1        2 
## 20.24787       NA 14.24421

# t-test where dose are independent and variance are not equal
t.test ( len ~ dose, paired=F, var.equal = F, data= DoseRange_2 )

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

Analysis: The resultant interval (-18.15617, -12.83383) is entirely below zero. The difference in interval range is not wide. The hypothesis test is statistically significant and the confidence interval will not contain the null hypothesis value.

It is visible that p-value of this test is essentially zero and the confidence interval of the test does not cross over zero (0). It suggest that at 95% interval level dosage 2 has higher mean growth than dose level 0.5 . These properties allow us reject the null hypothesis, thereby we can surmise that dosage increase to higher level enhances tooth length significantly.

DoseRange: (1.0 to 2.0) with unequal variance

DoseRange_3 <- subset ( ToothGrowth, dose %in% c(1.0, 2.0))

# presenting the variance with dose range
 len_Dose_3 <- split( DoseRange_3$len, DoseRange_3$dose )
sapply( len_Dose_3, var )

##      0.5        1        2 
##       NA 19.49608 14.24421

# t-test where dose are independent and variance are not equal
 t.test ( len ~ dose, paired = F, var.equal = F, data = DoseRange_3 )

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

Analysis : The resultant interval ( -8.996481 -3.733519) is entirely below zero. The contracted difference in the interval suggest that at 95% confidence interval group 2.0 has slightly higher mean growth than group 1.0 .

It is obvious that p-value of this test was essentially zero and the confidence interval of the test does not cross over zero (0). We can reject the null hypothesis and surmise that as dosage increases effects tooth growth.

Visual data analysis with multiple plot projection

Here we did some exploratory data analysis using scatterplot, boxplot and bar-graph. We used four plots to analyze the given dataset. The summary sentence will describe the projected statistical data analysis.

Scatterplot to analyze the Tooth Lenght (len) relative to Dosage(dose) & Supplement (supp)
Boxplot to analyze tooth length in accordance with dose & supplement data spread.
Comparing Tooth growth relative to Supplement exclusively.
Determine if within a ‘Dosage - supplements’ range have different effects on tooth growth.

1. Scatterplot to analyze the Tooth Lenght (len) relative to Dosage(dose) & Supplement (supp)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

qplot(supp, len, color=len,data=ToothGrowth, facets =.~dose,xlab="Supplement(OJ & VC)",
      ylab="Tooth Growth",main="Scatterplot by supplement and dose variation")

Analysis: As we can see that scatterplot is showing a sense of correlation between dosage increase and tooth-length growth. This graph portrays that as dosage increases in range from 0.5 to 1.0 the tooth length grows consistently and OJ maintains a visual lead over VC.

Tooth-length growth is in positive correlation with dosage(0.5 to 1.0) increase.However, as dose reaches out to 2.0 mg level, tooth length gets a boost and at this point VC holds a lead on OJ.

2. Boxplot to analyze tooth length in accordance with dose & supplement data spread.

Boxplot: Here I plotted some comparative boxplots with the same data elements, which will offer a better visual of the whole data phenomenon. These boxplots displays detail data spread which will project ‘tooth growth’ while different supplement is in use in accordance with different dosage. These boxplots visually explicates the detail data spread with minimum, Lower-Quartile, Median, Upper Quartile, Maximum, and whiskers data point.

boxplot(len~supp*dose, data=ToothGrowth, col=c("green", "gold"), xlab="Supplement",
           ylab="Tooth Growth", main="Boxplot with supplement and dose variation")

Analysis: This visual boxplot projection depicts tooth-growth sequentially increases as dosage application grows. When dosage is in 0.5 mg level data are right skewed on both supp-element and at 1.0 mg level data are more on left skewed. Adding to that, we see there are significant tooth growths at second higher dosage level than the first.

On the contrary, while the dose is 2.0 mg the tooth length reaches to a symmetric growth pattern on both supplements with marginal edge on VC growth. Overall, it deems that OJ has better tooth growth potential than VC at lower range(0.5 - 1.0) dosage distribution.

3. Comparing Tooth growth relative to Supplement (OJ and VC) usage

library(ggplot2)
ggplot(aes(x=supp, y=len), data=ToothGrowth) + geom_boxplot(aes(fill=supp))+
         scale_fill_manual(values=c("chocolate4","deepskyblue4"))+
           xlab("Supplement Type")+ylab("Tooth Length")

Analysis: We know boxplots usually offer a good visual of relative data distribution. These comparative boxplots presenting tooth growth in effect with the supplement usage. We see that the median of tooth length with ‘Orange juice’ is 22.7 and ‘Vitamin C’ is 16.7, which clarifies the tooth growth is generally higher with Orange juice.

The distribution of data shows the median(22.7) of ‘orange juice’ nearly equals to the level of the third quartile(22.7) of ‘vitamin C’. The data spread of ‘orange juice’ is skewed to the left and ‘vitamin C’ is skewed slightly off to the right. The majority of ‘vitamin C’ related tooth length distribution remains below the median line of ‘OJ’, explains tooth growth with VC is lower than OJ.

4. Determine if within a ‘Dosage - supplements’ range have different effects on tooth growth.

library(ggplot2)
ggplot(data=ToothGrowth, aes(x=as.factor(dose), y=len, fill=supp)) + 
       geom_bar(stat="identity",) +
        scale_fill_manual(values=c("hotpink4", "deepskyblue4")) +
          facet_grid(.~supp)+ xlab("Dosage by miligrams")+ylab("Tooth Length") +
            guides(fill=guide_legend(title="Supplemnt Type"))

Analysis: We see incremental tooth growth pattern as dosage increases from( 0.5 to 1.0 to 2.0)mg with both supplement OJ and VC.

Assumption on supplement dosage impact:

We are analyzing an experiment designed to quantify tooth growth of a sample of 60 Guinea Pigs using two different supplements (OJ and VC) with incremented dosage level.

On the purpose of analysis with this small dataset, we made some elemental data-population assumption. We are assuming that each supplement(OJ, VC) sub-population is independent and their variances are not equal . We are conforming to those assumption on the ‘t-test’ by defining ‘paired = False’ and ‘variable.equals = False’.

We made these following statistical inferential suppositions about the projected data analysis.

Supplement usage impacts tooth growth on various dosage levels differently.
When dosage level grows from 0.5 to 1.0 mg, ‘tooth-length’ grows considerably.
Dosage near 2.0 mg does not increase ‘tooth-length’ noticeably compare to (0.5 to 1.0)mg level.
Orange Juice (OJ) has higher ‘growth-control’ on tooth growth than Vitamin C (VC) on lower dosage level. Contrary to that at higher (2.0 mg) dosage level VC holds a linear growth edge.

Statistical_Simulation_with_Basic_Inferential_Analysis

Md Ahmed

March 28th, 2017