Part 1: Samples from a population
For this part of the lab you must work in a small group.
There are several bins of beads in the classroom. Right now, join a
group around one of the beads (aim for the same amount of students in
each group), and introduce yourself. You will be working together for
part 1.
Premise
The bin of beads represents a small town, with each bead representing
a person in town with a specific opinion. The green beads support a
ballot initiative to increase sales tax to fund bicycle trails, and the
white beads oppose the initiative. Your group of students are pollsters
- you want to determine how much support the initiative has in your
town.
The sampling paddle is your tool to take simple random samples from
the town’s population.
TASK 1.1 What is the population parameter you are
interested in learning about? Use correct notation.
**Response** The proportion of citizens in the town that support the bill.
TASK 1.2 Using the sampling paddle, take a sample of
the population and answer the below questions. Each person in the
group must take their own sample.
**Response**
- What is n, the sample size? 40
- What is your sample statistic, both notation and value? 17/40=.425 (p^).
TASK 1.3 With your group, discuss the differences
between your individual sample estimates. Summarize your conversation in
a sentence or two here.
**Response** Half of our group had a sample statistic that was greater than .5 within .03 (.5-.53), and the other half had a sample statistic that was less than .5 within .125 (.375-.5).
TASK 1.4 Working together, take at least 30 samples
(more if you’d like!) and make a vector of the sample statistics in R.
Make a dotplot of your sample statistics and discuss the center and
spread of the dotplot.
**Response** The range of the dotplot goes from 3.25-.575 and the center is around .5.
SampleStatistics <- c(.375, .425, .51, .53, .45, .5, .475, .45, .55, .575, .475, .475, .5, .525, 22/40, 19/40, 21/40, 23/40, 20/40, 22/40, 18/40, 21/40, 20/40, 20/40, 13/40, 17/40, 22/40, 20/40, 17/40, 23/40)
gf_dotplot( ~ SampleStatistics)
TASK 1.5 Based on everything you’ve done so far,
what do you think is the best guess for the population parameter in your
town? Write a sentence justifying your answer.
**Response** Based on our sample statistics, our best guess is that the population parameter is .5 because our sample distribution is a normal curve centered about .5.
TASK 1.6 Calculate the standard deviation of your
sample statistics. The specific name for this value is the standard
error. Whenever we discuss the standard error of a statistic, it
describes the how variable statistics are when they are calculated from
different samples drawn from the same population.
**Response** SE = .05902488
sd(~SampleStatistics)
Part 2: Baseball Player Salaries
We can also create sampling distributions for the rest of the ‘big 5’
parameters. In this section, you will create several sampling
distributions for means of different populations and of different
sizes.
Sampling Distribution
To create a sampling distribution from simple random samples, we must
have access to the entire population. For this section we have access to
all opening day salaries for major league baseball players in 2019 (in
millions of dollar). We can load the dataset, called
BaseballSalaries2019, from our textbook using the code below:
data("BaseballSalaries2019")
head(BaseballSalaries2019)
TASK 2.1 Find the mean and standard deviation of
salary in the population. Recall that the commands mean(~Y, data =
DataSetName) and sd(~Y, data = DataSetName) can help you accomplish this
task. include proper notation for each quantity
**Response** Mu = 4.509924 and Sigma = 6.334217
mean(~Salary, data = BaseballSalaries2019)
sd(~Salary, data = BaseballSalaries2019)
TASK 2.2 Create a histogram of the salaries and
describe the shape of the distribution. Hint: remember
gf_histogram()
**Response** The histogram is asymmetric and skewed right.
gf_histogram(~Salary, data = BaseballSalaries2019)
TASK 2.3 Use the code below to generate 2000 samples
of size 100, saving the sample mean salary for each sample, and creating
a histogram. What does an observation plotted in the histogram
represent?
**Response** The mean salary (x bar) of a random sample of 100 players.
# save space for the means
SalaryMeans <- rep(NA, 2000)
# generate 200 samples, saving the mean of each one
for(i in 1:2000){
# take a sample
TemporarySample <- sample_n(BaseballSalaries2019, 100)
# save the mean
SalaryMeans[i] <- mean(~Salary, data = TemporarySample)
}
gf_histogram(~SalaryMeans)
TASK 2.4 Describe the shape of your sampling
distribution, and compare it to the shape of the population.
**Response** The sampling distribution is symmetric and centered around 4.5, and the sampling distribution is a normal curve while the population was completely skewed right.
TASK 2.5 Calculate the center of your sampling
distribution, as measured by the mean of the vector of sample means. How
does this value compare to the population mean?
**Response** x bar = 4.503148 is within less than 7 one thousandths of mu.
mean(SalaryMeans)
TASK 2.6 Calculate the standard error of the sample
mean using your vector of sample means. Recall that the standard error
of a statistic is the standard deviation of the sampling
distribution.
**Response** s = 0.6054506
sd(SalaryMeans)
TASK 2.7 Hopefully the standard error you calculated
in 2.6 roughly matches the standard error you could estimate from the
histogram in 2.3. Explain how you could estimate the SE from the
histogram, and show how it is roughly the same.
**Response** You can estimate the SE from the histogram by eyeballing the range in which 95% of the samples fall and dividing that range by 4. For instance, if I take 5.75(approximate upper bound for the 95th percentile) and subtract 3.25 (approximate 5th percentile) then divide the difference by 4, I get .625. This value is quite close to the true standard deviation of the population which is 6.334217.
Confidence intervals
TASK 2.8 Directions For each of the sample means
below (assumed to be means for samples of baseball player salaries)
calculate the corresponding 95% confidence interval. You will need to
use the standard error you calculated from 2.6. Indicate whether the
confidence interval successfully captures the true population mean
salary.
TASK 2.8.1 \(\bar{X}\)= 4
TASK 2.8.2 \(\bar{X}\)= 3.1
TASK 2.8.3 \(\bar{X}\)= 5.2
Part 3: Sample size and confidence intervals
For this activity we will use the dataset ‘AllCountries’ from the
Lock5Data package. These data consist of measurements from all
countries. We will study the variable ‘FemaleLabor’, which provides the
percentage of females aged 15-64 that participate in the countries
workforce. Our goal will be to build sampling distributions from samples
of various sizes for the mean of this variable.
TASK 3.0 Modify the code below to generate a
sampling distribution of the mean with 2000 samples, using a sample size
of n=10.
# load the dataset
data("AllCountries")
# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>%
dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
na.omit()
# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)
# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
TemporarySample <- sample_n(AllCountries, size = 10)
SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}
TASK 3.1 Use your vector of sample means to create a
histogram of your sampling distribution, and to calculate the standard
error of the sample mean. Answer the questions below.
sd(SampleMeans)
mean(SampleMeans)
A. Where is the center of the distribution?
**Response** 58.12275
B. What is the standard error?
**Response** 5.329335
- If we were to build a 95% confidence interval using one of the
sample means, how wide would it be?
**Response** [47.4641, 68.7841]
TASK 3.2 Now generate a sampling distribution for
the mean using samples of size n=50. Again, you’ll need to calculate the
means for 2000 samples and save them as a vector. Then produce the
histogram, calculate the standard error, and answer the questions below.
You should copy and paste the code from 3-1, changing the relative
numbers.
# load the dataset
data("AllCountries")
# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>%
dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
na.omit()
# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)
# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
TemporarySample <- sample_n(AllCountries, size = 50)
SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}
gf_histogram(~SampleMeans)
mean(SampleMeans)
sd(SampleMeans)
A. Where is the center of the distribution?
**Response** 57.98732
B. What is the standard error?
**Response** 2.03178
- If we were to build a 95% confidence interval using one of the
sample means, how wide would it be?
**Response** [53.9238, 62.0509]
TASK 3.3 Finally, generate a sampling distribution
for the mean using samples of size n=100. Again, you’ll need to
calculate the means for 2000 samples and save them as a vector. Then
produce the histogram, calculate the standard error, and answer the
questions below.
# load the dataset
data("AllCountries")
# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>%
dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
na.omit()
# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)
# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
TemporarySample <- sample_n(AllCountries, size = 100)
SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}
gf_histogram(~SampleMeans)
mean(SampleMeans)
sd(SampleMeans)
A. Where is the center of the distribution?
**Response** 57.9323
B. What is the standard error?
**Response** 1.169775
- If we were to build a 95% confidence interval using one of the
sample means, how wide would it be?
**Response** [55.5928,60.2719]
TASK 3.4 What happens to the center of the
distribution as the sample size increases?
**Response** It remains nearly unchanged.
TASK 3.5 What happens to the standard error, and the
width of confidence intervals as the sample size increases?
**Response** The standard error decreases significantly and the range of the confidence interval narrows as the size increases.
