The project consists of two parts:
In this project I will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. I will investigate the distribution of averages of 40 exponentials.
First, we’ll load the necessary libraries for the project
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We run the histogram of the of the distribution of the mean of 40 exponentials
means=NULL
lamda<-0.2
n_simul<-1000
set.seed(1000)
for (i in 1 : n_simul) means = c(means, mean(rexp(40,lamda)))
means<-data.frame(means=means)
g<-ggplot(means, aes(x = means)) +
geom_histogram(binwidth = 0.1, color = "black", fill = "lightblue") +
labs(title = "Histogram of the Means of 40 Exponential Distributions",
x = "Means",
y = "Frequency")
g
Comparing the sample means with the theoritical mean
theoretical_mean <- 1 / lamda
k_values <- 0:40
P_X_k <- lamda * exp(-lamda * k_values)
sum_result <- sum(k_values * P_X_k)
sample_mean<-mean(means$means)
vline_data <- data.frame(
x = c(theoretical_mean, sample_mean, sum_result),
label = c("theoretical_mean", "sample_mean", "expected_value") # Labels for the legend
)
g<-g+geom_vline(data = vline_data, aes(xintercept = x, color = label), linetype = "dashed", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(sample_mean)
## [1] 4.986963
print(theoretical_mean)
## [1] 5
print(sum_result)
## [1] 4.969574
g
We can see that the expected value is close to the result for the
simulation while the theoritical sample that takes into account an
infinite number of values for the exponential is perfect (5)
Comparing the sample variance with the theoretical variance
sample_variance <- var(means$means)
theoretical_variance <- ((1/lamda)^2)/40
cbind(sample_variance, theoretical_variance)
## sample_variance theoretical_variance
## [1,] 0.654343 0.625
The variances are close here
Showing that the distribution is approximately normal
g <- ggplot(means, aes(x=means)) +
geom_histogram(aes(y=..density.., fill=..density..)) +
labs(title = "Histogram of Exponentials over Simulations", y="Density", x="Average") +
stat_function(fun = dnorm, args = list(mean=1/lamda, sd=sqrt(theoretical_variance)), color="green")+
geom_density(stat="density", alpha=I(0.2))
g
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution is somehow close to the one for the normal
distribution. What we’ll do here is to increase the number of
simulations and to see the results
mea=NULL
lamda=0.2
for (i in 1 : 10000) mea = c(mea, mean(rexp(40,lamda)))
mea<-data.frame(mea=mea)
g <- ggplot(mea, aes(x=mea)) +
geom_histogram(aes(y=..density.., fill=..density..)) +
labs(title = "Histogram of Exponentials over Simulations", y="Density", x="Average") +
stat_function(fun = dnorm, args = list(mean=1/lamda, sd=sqrt(theoretical_variance)), color="green")+
geom_density(stat="density", alpha=I(0.2))
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Then when we do the analysis with the plots we have here, that there is
a kind of supersposition between the two densities which means that
there is kind of equality between these two distributions ## Part 2:
Basic inferential data analysis
library(datasets)
data(ToothGrowth)
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
We need first to inspect some of the data with some tables
print(head(ToothGrowth))
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
print(table(ToothGrowth$supp))
##
## OJ VC
## 30 30
print(table(ToothGrowth$dose))
##
## 0.5 1 2
## 20 20 20
After loading the dataset, we’ll begin by the exploratory data analysis :
After inspecting the data we can see that there is only 2 supplements and 3 doses which gives only 6 combinations possible between these two variable, let’s see the effect of each combination on the length :
boxplot(len ~ supp * dose, data=ToothGrowth,xlab="Supplement and dose", ylab="Tooth Length", main="Tooth Growth by supplements and dosis", col=c("blue", "lightblue", "red", "pink", "green", "cyan"))
Given the three columns in our data, we have given a very detailed analysis of the variables that we have showing how the combination relates to the lenght, however we can also detail one thing if we see each supplement alone we clearly observe a linear relationship between the doses. Let’s plot them and see.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot1 <- ggplot(ToothGrowth[ToothGrowth$supp == "VC", ], aes(x = dose, y = len)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
ggtitle("Length vs Doses for VC") +
theme_minimal()
plot2 <- ggplot(ToothGrowth[ToothGrowth$supp == "OJ", ], aes(x = dose, y = len)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Length vs Doses for OJ") +
theme_minimal()
grid.arrange(plot1, plot2, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Providing a basic summary of the data
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Using confidence intervals and hypothesis tests to compare tooth growth by supp and dose with conclusions
t.test(len ~ supp, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
The p-value (0.06063) is slightly above the common significance level of 0.05. This means there is not strong enough evidence to reject the null hypothesis at the 5% significance level. However, it is close to 0.05, suggesting a trend toward a significant difference.
The 95% confidence interval includes 0, which also indicates that the test does not find a statistically significant difference between the means at the 5% level.
There is some indication that the means might be different (with a p-value close to 0.05), but it is not statistically significant at the 5% level. The results suggest that while OJ appears to have a higher mean length than VC, the evidence isn’t strong enough to confirm a significant difference between them
dose_0.5 <- ToothGrowth[ToothGrowth$dose == 0.5, ]
dose_1 <- ToothGrowth[ToothGrowth$dose == 1, ]
dose_2 <- ToothGrowth[ToothGrowth$dose == 2, ]
t_test_0.5_vs_1 <- t.test(dose_0.5$len, dose_1$len)
t_test_0.5_vs_2 <- t.test(dose_0.5$len, dose_2$len)
t_test_1_vs_2 <- t.test(dose_1$len, dose_2$len)
print(t_test_0.5_vs_1)
##
## Welch Two Sample t-test
##
## data: dose_0.5$len and dose_1$len
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean of x mean of y
## 10.605 19.735
print(t_test_0.5_vs_2)
##
## Welch Two Sample t-test
##
## data: dose_0.5$len and dose_2$len
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.15617 -12.83383
## sample estimates:
## mean of x mean of y
## 10.605 26.100
print(t_test_1_vs_2)
##
## Welch Two Sample t-test
##
## data: dose_1$len and dose_2$len
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean of x mean of y
## 19.735 26.100
he test results provide strong evidence to reject the null hypothesis that the means of doses length are equal . There is a significant mean difference.