Import the data set called workers:
workers<-read.file("/home/emesekennedy/Data/Ch5/workers.txt")
## Reading data with read.table()
We will treat the 1000 observations (workers) as the entire population.
The mean salary \(\mu\) for the population is
mu<-mean(~salary, data=workers)
mu
## [1] 23600.19
The standard deviation\(\sigma\) for the salary of the population is
sigma<-sd(~salary, data=workers)
sigma
## [1] 19269.82
Now, let’s take a SRS of size 400 and call the entire sample S:
S<-sample(workers, 400, replace=T)
The mean salary \(\bar{x}\) for this one sample S is
mean(~salary, data=S)
## [1] 23389.45
The standard deviation \(s\) for this one sample S is
sd(~salary, data=S)
## [1] 15357.52
Now, let’s take 1000 samples of size 400, and record the mean for each sample in the variable called samples:
samples<-do(1000)*mean(~salary, data=sample(workers, 400, replace=T))
The above command created a data set with 1000 observations, here every line of this data set represent the mean for one of the samples. To make things less confusing, let’s rename the name of the column in samples to xbar (by default of the do command, it is called result):
names(samples)[1]<-"xbar"
Now, let’s find the mean \(\mu_{\bar{x}}\) for these sample means:
mean(~xbar, data=samples)
## [1] 23599.35
\(\bar{x}\) is a random variable, and when we take large samples, it’s mean (\(\mu_{\bar{x}}\)) should be close to the population mean (\(\mu\)).
We can also find the standard deviation \(\sigma_{\bar{x}}\) for these sample means:
sd(~xbar, data=samples)
## [1] 964.064
The standard deviation of \(\bar{x}\) should be close to the population standard deviation divided by the square root of the sample size (i.e close to \(\frac{\sigma}{\sqrt{n}}=\frac{19269.82}{\sqrt{400}}=963.4911\))
From the facts about sample means, the distribution of the sample means should be more close to Normal than individual observations. We can verify this by using a Normal quantile plot. First, let’s create a Normal quantile plot for the entire population:
xqqmath(~salary, data=workers)
The plot clearly indicates that the salaries for the entire population are not Normally distributed. In fact, when we create a histogram of the salaries for the population, we can see that the data is skewed to the right:
histogram(~salary, data=workers)
Now let’s create a Normal quantile plot for the sample means:
xqqmath(~xbar, data=samples)
We can see that the sample means are very close to Normally distributed.
Now, let’s create a histogram to see the distribution of the sample means:
histogram(~xbar, data=samples)
By the Central Limit Theorem, when the number of observations in each of the samples (here, \(n=400\)) is large, this distribution should be approximately Normal \(N(\mu_{\bar{x}}, \sigma_{\bar{x}})=N(\mu, \frac{\sigma}{\sqrt{n}})\). So for this example, the distribution should be close to \(N(23600.19, 963.4911)\). To verify that this is true, let’s add a Normal curve with this distribution to the histogram:
plotDist("norm", mean=mu, sd=sigma/sqrt(400), add=T)
As you can see, the Normal curve captures the shape of the distribution fairly well.
We can use this Normal distribution to answer questions about probabilities for the mean a sample. For example, to find the probability that the mean salary \(\bar{x}\) for a SRS of size 400 is between 23000 and 24000, we can use the pdist command:
pdist("norm", mean=mu, sd=sigma/sqrt(400), c(23000, 24000))
## [1] 0.2666658 0.6609158
So the probability that the average salary for a SRS of size 400 is between 23000 and 24000 is \(P(23000\le\bar{x}\le 24000)=0.394\).
The distribution of the random variable X that records the number of heads in the four tosses is \(B(4, 0.5)\). We can plot this distribution using the plotDist command:
plotDist("binom", 4, 0.5)
We can enter the optional command kind=“histogram” to make the graph look like a histogram:
plotDist("binom", 4, 0.5, kind="histogram")
This histogram is of course the same as we saw before in Chapter 4, but this is an easier way to create it.