One of the great advantages of having statistical software like R available, even for a course in statistical theory, is the ability to simulate samples from various probability distributions and statistical models.This area is worth studying when learning R programming because simulations can be computationally intensive so learning effective programming techniques is worthwhile. We begin with some background on R functions associated with distributions.
The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter.The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda.Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should
A simulation study typically begins with a probability model for the data and simulation of responses from this model. For several common probability distributions R provides a set of functions, sometimes called a d-p-q-r family, to evaluate the probability density function (for continuous distributions - the probability mass function for discrete distributions), the cumulative distribution function or the quantile function (inverse of the c.d.f) and for simulation of a random sample.
As shown in the table the names of the functions are composed of the initial letter indicating d - density function (or probability mass function) p - (cumulative) probability function (values are always in the interval [0,1]) q - quantile function - the inverse (more-or-less) of the p function r - simulation of a random sample from the distributionand an abbreviation of the distribution name, as shown in this table,which also states the parameter names used for the distribution.
Although the “random” numbers generated by the r
The random number stream depends on a seed value. If we want to produce a reproducible example (so, for example,we can talk about properties of a particular sample and the reader can generate the same sample for herself so she can examine it) then we set the seed to a known value before generating the sample. The stored seed is a complicated structure but we can set it to an integer, often something trivial like
set.seed(1)
Ensure a reproducible stream If we simulate a sample,
s1 = rexp(10)
s1
## [1] 0.7552 1.1816 0.1457 0.1398 0.4361 2.8950 1.2296 0.5397 0.9566 0.1470
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
s1 = rexp(10)
s1
## [1] 1.3907 0.7620 1.2376 4.4239 1.0545 1.0352 1.8760 0.6547 0.3369 0.5885
Then simulate a second sample
s2 = rexp(10)
s2
## [1] 2.36452 0.64189 0.29412 0.56587 0.10607 0.05944 0.57871 3.95893
## [9] 1.17331 0.99681
we get different values. However, if we reset the seed to 1 and then simulate the second sample we reproduce the original values.
Let’s come to our distribution.There are two parameters exist in Exponential Distribution.The mean & The Standard Deviation.The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Here, given that, Lambda = .2 So, mean = sd = 1/.2 = 5 Also, instruction is to use 4 exponentials. So, n = 4
In this project, we will explore the distribution of sample means when the samples are drawn from an exponential distribution.
We draw one random sample of size 4 from the exponentials distribution of sample means after 1000 simulation. We will use the command rexp(n=4,rate=.2). In R “rate” is what we call Lambda, n is the sample size and “rexp” stands for random generation from the exponential distribution.We can store your sample in an object called “data” using the command, data <- rexp(n = 4, rate = .2). Type data to viewour sample.
data = rexp(n = 4, rate = .2)
data
## [1] 7.1764 0.1863 1.6201 6.6023
Now, we can calculate mean of the sampling ( exponential ) distribution. mean(data)
Repeat step 1.
data = rexp(n = 4, rate = .2)
data
## [1] 1.018 5.114 1.509 3.626
mean(data)
## [1] 2.816
To understand the behavior of all possible sample means from samples of size 4, we need to repeat step 1, many, many times and record the resulting sample means. This is tedious to do by hand, so we use the following lines of code to generate 1000 sample of size 4 from an exponential) distribution.generate 4000 (ie. 4 X 1000) random samples from Exponential Distribution.
simdata = rexp(n = 4000, rate = .2)
format simdata as matrix
matrixdata = matrix(simdata, nrow = 1000, ncol = 4)
dim(matrixdata)
## [1] 1000 4
Now type matrixdata to see the random samples you just generated. Note each row is one sample of size 4. Since there are 4000 rows, we have 1000 samples of size 4.
str(matrixdata)
## num [1:1000, 1:4] 3.76 1.18 5.4 5.14 6.46 ...
Now get the sample mean of each row of data takes the mean of each row
means.exp = apply(matrixdata, 1, mean)
Now, mean, sd & var of the 1000 sample means
mean(means.exp)
## [1] 5.034
sd(means.exp)
## [1] 2.527
var(means.exp)
## [1] 6.388
Also, we shoyld see the histogram
hist(means.exp)
Estimate the mean of the 1000 sample means : 5.034 How does the estimate compare to the true value of mean (ie.1/lambda= 1/.2 =5) ? The answer is: The estimate value of the mean of the 1000 sample means is very close to sample mean.so, the distribution is the centered at and if we compare it to the theoretical center of the distribution,both are very close to each other. Estimate sd using the 1000 sample means : 2.527 Estimate variance using the 1000 sample means : 6.388
Show how variable it is compare to the theoretical variance of the distribution ( ie. sqr(sd) = sqr(5) = 25 ). The answer is that a good percentage of variance decreases.
Do the sample means appear to be normally distributed? No, the sample means do not appear to be normally distributed.From the picture of Histogram, it is clear that the distribution is right skewed.
Now repeat earlier Steps once again, for samples of size 40. So start by generating 1000 random samples of size 40 from Exp(2).
simdata = rexp(n=40000,rate=.2)
matrixdata = matrix(simdata,nrow=1000,ncol=40)
means.exp = apply(matrixdata,1,mean)
mean(means.exp)
## [1] 4.994
sd(means.exp)
## [1] 0.799
var(means.exp)
## [1] 0.6384
hist(means.exp)
Estimate the mean of the 1000 sample means when sample size is 40 : 4.994
The answer is: The estimate value of the mean of the 1000 sample means is very close to sample mean.so, the distribution is the centered at and if we compare it to the theoretical center of the distribution,both are very close to each other. Estimate sd using the 1000 sample means : 0.799 Estimate variance using the 1000 sample means : .6384
The answer is that a good percentage of variance decreases more compare to the first case.
YES. the sample means do appear to be normally distributed.From the picture of Histogram, it is clear that the distribution is normally distributed.
Now, One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we have calculated the mean of means of samples of size 40 after 1000 times of simulation.
Return for a moment to the question that first motivated our analysis: based on this sample, what can we infer about the population? Based on these samples, the best estimate would be the sample mean (here in this is 5). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval. We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate.
se = sd(means.exp)/sqrt(40)
lower = mean(means.exp) - 1.96 * se
upper = mean(means.exp) + 1.96 * se
c(lower, upper)
## [1] 4.746 5.242
This is an important inference that weve just made: even though we dont know what the full population looks like, were 95% confident that the true mean lies between the values lower and upper. Though, there are a few conditions that must be met for this interval to be valid, but, that discussion is beyound the scope of this project. However, the lower & upper limit of the values are
respectively. In this case we have the luxury of knowing the true population mean since we have data on the entire population.So, we know that, our estimate is correct.