Statistical Inference Project

Part 1 : Simulation Exercise

In this project I will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. I will investigate the distribution of averages of 40 exponentials.

First, we’ll load the necessary libraries for the project

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We run the histogram of the of the distribution of the mean of 40 exponentials

means=NULL
lamda<-0.2
n_simul<-1000
set.seed(1000)
for (i in 1 : n_simul) means = c(means, mean(rexp(40,lamda)))
means<-data.frame(means=means)
g<-ggplot(means, aes(x = means)) +
  geom_histogram(binwidth = 0.1, color = "black", fill = "lightblue") +
  labs(title = "Histogram of the Means of 40 Exponential Distributions",
       x = "Means",
       y = "Frequency")
g

Comparing the sample means with the theoritical mean

theoretical_mean <- 1 / lamda
k_values <- 0:40
P_X_k <- lamda * exp(-lamda * k_values)
sum_result <- sum(k_values * P_X_k)
sample_mean<-mean(means$means)
vline_data <- data.frame(
  x = c(theoretical_mean, sample_mean, sum_result),
  label = c("theoretical_mean", "sample_mean", "expected_value")  # Labels for the legend
)
g<-g+geom_vline(data = vline_data, aes(xintercept = x, color = label), linetype = "dashed", size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(sample_mean)

## [1] 4.986963

print(theoretical_mean)

## [1] 5

print(sum_result)

## [1] 4.969574

We can see that the expected value is close to the result for the simulation while the theoritical sample that takes into account an infinite number of values for the exponential is perfect (5)

Comparing the sample variance with the theoretical variance

sample_variance <- var(means$means)
theoretical_variance <- ((1/lamda)^2)/40
cbind(sample_variance, theoretical_variance)

##      sample_variance theoretical_variance
## [1,]        0.654343                0.625

The variances are close here

Showing that the distribution is approximately normal

g <- ggplot(means, aes(x=means)) +
  geom_histogram(aes(y=..density.., fill=..density..)) +
  labs(title = "Histogram of Exponentials over Simulations", y="Density", x="Average") +
  stat_function(fun = dnorm, args = list(mean=1/lamda, sd=sqrt(theoretical_variance)), color="green")+
  geom_density(stat="density", alpha=I(0.2))
g

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is somehow close to the one for the normal distribution. What we’ll do here is to increase the number of simulations and to see the results

mea=NULL
lamda=0.2
for (i in 1 : 10000) mea = c(mea, mean(rexp(40,lamda)))
mea<-data.frame(mea=mea)

g <- ggplot(mea, aes(x=mea)) +
  geom_histogram(aes(y=..density.., fill=..density..)) +
  labs(title = "Histogram of Exponentials over Simulations", y="Density", x="Average") +
  stat_function(fun = dnorm, args = list(mean=1/lamda, sd=sqrt(theoretical_variance)), color="green")+
  geom_density(stat="density", alpha=I(0.2))
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then when we do the analysis with the plots we have here, that there is a kind of supersposition between the two densities which means that there is kind of equality between these two distributions ## Part 2: Basic inferential data analysis

library(datasets)
data(ToothGrowth)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

We need first to inspect some of the data with some tables

print(head(ToothGrowth))

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

print(table(ToothGrowth$supp))

## 
## OJ VC 
## 30 30

print(table(ToothGrowth$dose))

## 
## 0.5   1   2 
##  20  20  20

After loading the dataset, we’ll begin by the exploratory data analysis :

After inspecting the data we can see that there is only 2 supplements and 3 doses which gives only 6 combinations possible between these two variable, let’s see the effect of each combination on the length :

boxplot(len ~ supp * dose, data=ToothGrowth,xlab="Supplement and dose", ylab="Tooth Length", main="Tooth Growth by supplements and dosis", col=c("blue", "lightblue", "red", "pink", "green", "cyan"))

Given the three columns in our data, we have given a very detailed analysis of the variables that we have showing how the combination relates to the lenght, however we can also detail one thing if we see each supplement alone we clearly observe a linear relationship between the doses. Let’s plot them and see.

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

plot1 <- ggplot(ToothGrowth[ToothGrowth$supp == "VC", ], aes(x = dose, y = len)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  ggtitle("Length vs Doses for VC") +
  theme_minimal()

plot2 <- ggplot(ToothGrowth[ToothGrowth$supp == "OJ", ], aes(x = dose, y = len)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Length vs Doses for OJ") +
  theme_minimal()

grid.arrange(plot1, plot2, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Providing a basic summary of the data

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Using confidence intervals and hypothesis tests to compare tooth growth by supp and dose with conclusions

t.test(len ~ supp, data = ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

The p-value (0.06063) is slightly above the common significance level of 0.05. This means there is not strong enough evidence to reject the null hypothesis at the 5% significance level. However, it is close to 0.05, suggesting a trend toward a significant difference.

The 95% confidence interval includes 0, which also indicates that the test does not find a statistically significant difference between the means at the 5% level.

There is some indication that the means might be different (with a p-value close to 0.05), but it is not statistically significant at the 5% level. The results suggest that while OJ appears to have a higher mean length than VC, the evidence isn’t strong enough to confirm a significant difference between them

dose_0.5 <- ToothGrowth[ToothGrowth$dose == 0.5, ]
dose_1 <- ToothGrowth[ToothGrowth$dose == 1, ]
dose_2 <- ToothGrowth[ToothGrowth$dose == 2, ]
t_test_0.5_vs_1 <- t.test(dose_0.5$len, dose_1$len)
t_test_0.5_vs_2 <- t.test(dose_0.5$len, dose_2$len)
t_test_1_vs_2 <- t.test(dose_1$len, dose_2$len)
print(t_test_0.5_vs_1)

## 
##  Welch Two Sample t-test
## 
## data:  dose_0.5$len and dose_1$len
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

print(t_test_0.5_vs_2)

## 
##  Welch Two Sample t-test
## 
## data:  dose_0.5$len and dose_2$len
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean of x mean of y 
##    10.605    26.100

print(t_test_1_vs_2)

## 
##  Welch Two Sample t-test
## 
## data:  dose_1$len and dose_2$len
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

he test results provide strong evidence to reject the null hypothesis that the means of doses length are equal . There is a significant mean difference.

Statistical Inference Project

A.Rachid

2024-11-04

Graded Assignment: Statistical Inference Course Project

Part 1 : Simulation Exercise