Import the data set called workers:
workers<-read.file("/home/emesekennedy/Data/Ch5/workers.txt")
## Reading data with read.table()
We will treat the 1000 observations (workers) as the entire population.
The mean salary \(\mu\) for the population is
mu<-mean(~salary, data=workers)
mu
## [1] 23600.19
The standard deviation\(\sigma\) for the salary of the population is
sigma<-sd(~salary, data=workers)
sigma
## [1] 19269.82
Now, let’s take a SRS of size 400 and call the entire sample S:
S<-sample(workers, 400, replace=T)
The mean salary \(\bar{x}\) for this one sample S is
mean(~salary, data=S)
## [1] 23652.33
The standard deviation \(s\) for this one sample S is
sd(~salary, data=S)
## [1] 17463.28
Now, let’s take 1000 samples of size 400, and record the mean for each sample in the variable called samples:
samples<-do(1000)*mean(~salary, data=sample(workers, 400, replace=T))
The above command created a data set with 1000 observations, here every line of this data set represent the mean for one of the samples. To make things less confusing, let’s rename the name of the column in samples to xbar (by default of the do command, it is called result):
names(samples)[1]<-"xbar"
Now, let’s find the mean \(\mu_{\bar{x}}\) for these sample means:
mean(~xbar, data=samples)
## [1] 23529.06
\(\bar{x}\) is a random variable, and when we take large samples, it’s mean (\(\mu_{\bar{x}}\)) should be close to the population mean (\(\mu\)).
We can also find the standard deviation \(\sigma_{\bar{x}}\) for these sample means:
sd(~xbar, data=samples)
## [1] 966.7334
The standard deviation of \(\bar{x}\) should be close to the population standard deviation divided by the square root of the sample size (i.e close to \(\frac{\sigma}{\sqrt{n}}=\frac{19269.82}{\sqrt{400}}=963.4911\))
Now, let’s create a histogram to see the distribution of the 1000 sample means:
histogram(~xbar, data=samples)
By the Central Limit Theorem, when the number of observations in each of the samples (here, \(n=400\)) is large, this distribution should be approximately Normal \(N(\mu_{\bar{x}}, \sigma_{\bar{x}})=N(\mu, \frac{\sigma}{\sqrt{n}})\). So for this example, the distribution should be close to \(N(23600.19, 963.4911)\). To verify that this is true, let’s add a Normal curve with this distribution to the histogram:
plotDist("norm", mean=mu, sd=sigma/sqrt(400), add=T)
As you can see, the Normal curve captures the shape of the distribution fairly well.
We can use this Normal distribution to answer questions about probabilities for the mean a sample. For example, to find the probability that the mean salary \(\bar{x}\) for a SRS of size 400 is between 23000 and 24000, we can use the pdist command:
pdist("norm", mean=mu, sd=sigma/sqrt(400), c(23000, 24000))
## [1] 0.2666658 0.6609158
So the probability that the average salary for a SRS of size 400 is between 23000 and 24000 is \(P(23000\le\bar{x}\le 24000)=0.394\).