Coursera - Statistical Inference

An overview of the project

This work complets the Statistical Inference in Coursera Data Science class and it consists of two parts: Part 1: do a simulation to create some random data and do some analysis under the light of the Central Limit Theorem and Part 2: using one of the datasets in the R datasets library, do some analysis, some inferences and state a conclusion about the data.

Part 1 : Simulation exercise

Using rexp, we will create 40 random exponential distributions using lambda = 0.2, take the mean of this 40 draws and create a data vector with 1500 of these means. Than analyse the distribution of the value of the means. We are searchind the kind of distribution they will be disposed.

#Load libraries to help 
library(ggplot2)
#Set parameters
ECHO=TRUE
set.seed(2222)
lambda=0.2
exponentials=40
#Create the values
simulationMeans = NULL
for (i in 1:1500)simulationMeans = c(simulationMeans,mean(rexp(exponentials, lambda)))
summary(simulationMeans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.847   4.450   4.928   4.967   5.441   7.248

#Obtain the Mean of the Means
mean(simulationMeans)

## [1] 4.967263

#calculate the theoretical Mean
theoreticalmean<-lambda^-1
theoreticalmean

## [1] 5

#Lets plot in a histogram 
hist (simulationMeans, col="#B1EFFF", main="Sample Mean versus Theoretical Mean", breaks=20)
#and draw the two lines for the Means
abline(v=mean(simulationMeans), lwd="2", col="#149403")
abline(v=mean (theoreticalmean), lwd="2", col="#d90b23")
text (6.5, 150, paste("Actual mean (green)= ", round (mean(simulationMeans),3), "\nTheoretical mean (red)= ",round(theoreticalmean,3)), col="#888888")

abs(mean(simulationMeans)-theoreticalmean)

## [1] 0.03273665

The value of the difference between theoretical and real mean are very little and we can assume that the Central Limit Theorem is valid in this simulation that is to say that increasing the number of samples, we will getting closer to the theoretical value.

Compare Variances

#Sample variance
simulationvar<-var(simulationMeans)
simulationvar

## [1] 0.5899575

#Theoretical Variance
Theoreticalvar<-(lambda * sqrt(exponentials))^-2
Theoreticalvar

## [1] 0.625

# Comparison 
simulationvar-Theoreticalvar

## [1] -0.0350425

Comparing the two values of the Variance we see that the values are very close.

Can we say that the distribution is normal?

#Lets draw the histogram for the simulation
hist(simulationMeans, prob=TRUE, col="#FFF4A1", main=" Distribution of the Means", breaks=20)
text (6.4, 0.55, paste("red: density function for the simulation\nblue:  the theorical normal distribution"), col="#888888")
#create a random values for normal distribution with theoretical values
x<-rnorm(10000,mean=5, sd=0.625)
#and compare with the density function that disperses the mass of the simulationMeans over a grid of 512 points using Fourier transform in a gaussian curve as default to smooth the line
lines(density(simulationMeans), lwd="3", col="#E6652E")
lines(density(x), lwd="3", col="#4444AB")

Conclusion of the simulation

The values of the 1500 means obteined each one from the 40 random exponential distribution, assuming lambda as 0.2, are distributed in a close to normal distribution

Part 2 : Basic Inferential Data Analysis

#Use the ToothGrow dataset
data("ToothGrowth")

This dataset refers to the effect of Vitamin C on Tooth Growth in Guinea Pigs and the response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods: orange juice coded as OJ or ascorbic acid, a form of vitamin C, and coded as VC.

Source of information: * https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ToothGrowth.html *

What can we say about the influence of the doses and the delivery methods in the growth of tooth in Guinea Pigs?

#Lets have a overwiew of each variable in dataset
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

#View first occorences
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

#Knew the unique values for each variable
unique(ToothGrowth$len)

##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2  5.2  7.0 16.5 15.2 17.3 22.5 13.6
## [15] 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5 17.6
## [29]  9.7  8.2  9.4 19.7 20.0 25.2 25.8 21.2 27.3 22.4 24.5 24.8 30.9 29.4
## [43] 23.0

unique(ToothGrowth$supp)

## [1] VC OJ
## Levels: OJ VC

unique(ToothGrowth$dose)

## [1] 0.5 1.0 2.0

t=ToothGrowth
levels(t$dose)<-c("0.5mg", "1mg", "2mg")
ggplot (t, aes(x=factor(supp), y=len))+facet_grid(.~dose)+geom_boxplot(aes(fill = factor(dose)), show_guide=TRUE)+labs(title="Tooth lenght of Guinea Pigs \naccording to doses and supply methods ", x="Supply Type", y="Tooth Length")

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

Assuming that the sample of 60 Guinea Pigs are representative of the population and that the doses and delivery methods were randomicaly aplied, we can conclude that the increasing of the dose are associated with the increasing of the tooth grow although, in minor doses, the eficience of the Orange Juice Method has reached greater values of tooth lengh than the Ascorbic Acid deliver method. When the doses reached the value 2, the mean of tooth grow are quite equivalent for both methods of delivery,

My best regards. Thanks for reading C.Werneck

Coursera - Statistical Inference - Course Project

cwerneck - Cláudia Werneck