Objectives:

For this lab you should…

Part 1: Samples from a population

For this part of the lab you must work in a small group. There are several bins of beads in the classroom. Right now, join a group around one of the beads (aim for the same amount of students in each group), and introduce yourself. You will be working together for part 1.

Premise

The bin of beads represents a small town, with each bead representing a person in town with a specific opinion. The green beads support a ballot initiative to increase sales tax to fund bicycle trails, and the white beads oppose the initiative. Your group of students are pollsters - you want to determine how much support the initiative has in your town.

The sampling paddle is your tool to take simple random samples from the town’s population.

TASK 1.1 What is the population parameter you are interested in learning about? Use correct notation.

**Response** The proportion of citizens in the town that support the bill.

TASK 1.2 Using the sampling paddle, take a sample of the population and answer the below questions. Each person in the group must take their own sample.

**Response**
- What is n, the sample size? 40
- What is your sample statistic, both notation and value? 17/40=.425 (p^).

TASK 1.3 With your group, discuss the differences between your individual sample estimates. Summarize your conversation in a sentence or two here.

**Response** Half of our group had a sample statistic that was greater than .5 within .03 (.5-.53), and the other half had a sample statistic that was less than .5 within .125 (.375-.5).

TASK 1.4 Working together, take at least 30 samples (more if you’d like!) and make a vector of the sample statistics in R. Make a dotplot of your sample statistics and discuss the center and spread of the dotplot.

**Response** The range of the dotplot goes from 3.25-.575 and the center is around .5.
SampleStatistics <- c(.375, .425, .51, .53, .45, .5, .475, .45, .55, .575, .475, .475, .5, .525, 22/40, 19/40, 21/40, 23/40, 20/40, 22/40, 18/40, 21/40, 20/40, 20/40, 13/40, 17/40, 22/40, 20/40, 17/40, 23/40)

gf_dotplot( ~ SampleStatistics)

TASK 1.5 Based on everything you’ve done so far, what do you think is the best guess for the population parameter in your town? Write a sentence justifying your answer.

**Response** Based on our sample statistics, our best guess is that the population parameter is .5 because our sample distribution is a normal curve centered about .5.

TASK 1.6 Calculate the standard deviation of your sample statistics. The specific name for this value is the standard error. Whenever we discuss the standard error of a statistic, it describes the how variable statistics are when they are calculated from different samples drawn from the same population.

**Response** SE = .05902488
sd(~SampleStatistics)

Part 2: Baseball Player Salaries

We can also create sampling distributions for the rest of the ‘big 5’ parameters. In this section, you will create several sampling distributions for means of different populations and of different sizes.

Sampling Distribution

To create a sampling distribution from simple random samples, we must have access to the entire population. For this section we have access to all opening day salaries for major league baseball players in 2019 (in millions of dollar). We can load the dataset, called BaseballSalaries2019, from our textbook using the code below:

data("BaseballSalaries2019")

head(BaseballSalaries2019)

TASK 2.1 Find the mean and standard deviation of salary in the population. Recall that the commands mean(~Y, data = DataSetName) and sd(~Y, data = DataSetName) can help you accomplish this task. include proper notation for each quantity

**Response** Mu = 4.509924 and Sigma = 6.334217
mean(~Salary, data = BaseballSalaries2019)
sd(~Salary, data = BaseballSalaries2019)

TASK 2.2 Create a histogram of the salaries and describe the shape of the distribution. Hint: remember gf_histogram()

**Response** The histogram is asymmetric and skewed right.
gf_histogram(~Salary, data = BaseballSalaries2019)

TASK 2.3 Use the code below to generate 2000 samples of size 100, saving the sample mean salary for each sample, and creating a histogram. What does an observation plotted in the histogram represent?

**Response** The mean salary (x bar) of a random sample of 100 players.
# save space for the means
SalaryMeans <- rep(NA, 2000)

# generate 200 samples, saving the mean of each one
for(i in 1:2000){
  # take a sample
  TemporarySample <- sample_n(BaseballSalaries2019, 100)
  # save the mean
  SalaryMeans[i] <- mean(~Salary, data = TemporarySample)
}

gf_histogram(~SalaryMeans)

TASK 2.4 Describe the shape of your sampling distribution, and compare it to the shape of the population.

**Response** The sampling distribution is symmetric and centered around 4.5, and the sampling distribution is a normal curve while the population was completely skewed right.

TASK 2.5 Calculate the center of your sampling distribution, as measured by the mean of the vector of sample means. How does this value compare to the population mean?

**Response** x bar = 4.503148 is within less than 7 one thousandths of mu.
mean(SalaryMeans)

TASK 2.6 Calculate the standard error of the sample mean using your vector of sample means. Recall that the standard error of a statistic is the standard deviation of the sampling distribution.

**Response** s = 0.6054506
sd(SalaryMeans)

TASK 2.7 Hopefully the standard error you calculated in 2.6 roughly matches the standard error you could estimate from the histogram in 2.3. Explain how you could estimate the SE from the histogram, and show how it is roughly the same.

**Response** You can estimate the SE from the histogram by eyeballing the range in which 95% of the samples fall and dividing that range by 4. For instance, if I take 5.75(approximate upper bound for the 95th percentile) and subtract 3.25 (approximate 5th percentile) then divide the difference by 4, I get .625. This value is quite close to the true standard deviation of the population which is 6.334217.

Confidence intervals

TASK 2.8 Directions For each of the sample means below (assumed to be means for samples of baseball player salaries) calculate the corresponding 95% confidence interval. You will need to use the standard error you calculated from 2.6. Indicate whether the confidence interval successfully captures the true population mean salary.

TASK 2.8.1 \(\bar{X}\)= 4

  • Confidence Interval [2.7891, 5.2109]

  • Captured mu? Yes

TASK 2.8.2 \(\bar{X}\)= 3.1

  • Confidence Interval [1.8891, 4.3109]

  • Captured mu? No

TASK 2.8.3 \(\bar{X}\)= 5.2

  • Confidence Interval [3.9891, 6.4109]

  • Captured mu? Yes

Part 3: Sample size and confidence intervals

For this activity we will use the dataset ‘AllCountries’ from the Lock5Data package. These data consist of measurements from all countries. We will study the variable ‘FemaleLabor’, which provides the percentage of females aged 15-64 that participate in the countries workforce. Our goal will be to build sampling distributions from samples of various sizes for the mean of this variable.

TASK 3.0 Modify the code below to generate a sampling distribution of the mean with 2000 samples, using a sample size of n=10.

# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 10)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}

TASK 3.1 Use your vector of sample means to create a histogram of your sampling distribution, and to calculate the standard error of the sample mean. Answer the questions below.

sd(SampleMeans)
mean(SampleMeans)

A. Where is the center of the distribution?

**Response** 58.12275

B. What is the standard error?

**Response** 5.329335
  1. If we were to build a 95% confidence interval using one of the sample means, how wide would it be?
**Response** [47.4641, 68.7841]

TASK 3.2 Now generate a sampling distribution for the mean using samples of size n=50. Again, you’ll need to calculate the means for 2000 samples and save them as a vector. Then produce the histogram, calculate the standard error, and answer the questions below. You should copy and paste the code from 3-1, changing the relative numbers.

# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 50)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}

gf_histogram(~SampleMeans)

mean(SampleMeans)
sd(SampleMeans)

A. Where is the center of the distribution?

**Response** 57.98732

B. What is the standard error?

**Response** 2.03178
  1. If we were to build a 95% confidence interval using one of the sample means, how wide would it be?
**Response** [53.9238, 62.0509]

TASK 3.3 Finally, generate a sampling distribution for the mean using samples of size n=100. Again, you’ll need to calculate the means for 2000 samples and save them as a vector. Then produce the histogram, calculate the standard error, and answer the questions below.

# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 100)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}

gf_histogram(~SampleMeans)

mean(SampleMeans)
sd(SampleMeans)

A. Where is the center of the distribution?

**Response** 57.9323

B. What is the standard error?

**Response** 1.169775
  1. If we were to build a 95% confidence interval using one of the sample means, how wide would it be?
**Response** [55.5928,60.2719]

TASK 3.4 What happens to the center of the distribution as the sample size increases?

**Response** It remains nearly unchanged.

TASK 3.5 What happens to the standard error, and the width of confidence intervals as the sample size increases?

**Response** The standard error decreases significantly and the range of the confidence interval narrows as the size increases.

Part 4: Interpreting a confidence interval.

Using a sample of 24 deliveries described in “Diary of a Pizza Girl” on the Slice website, we find a 95% confidence interval for the mean tip given for a pizza delivery to be $2.18 to $3.90. Which of the following is a correct interpretation of this interval?
5t23 a. I am 95% sure that all pizza delivery tips will be between $2.18 and $3.90. b. 95% of all pizza delivery tips will be between $2.18 and $3.90. c. I am 95% sure that the mean pizza delivery tip for this sample will be between $2.18 and $3.90. d. I am 95% sure that the mean tip for all pizza deliveries in this area will be between $2.18 and $3.90. e. I am 95% sure that the confidence interval for the mean pizza delivery tip will be between $2.18 and $3.90.

**Response** The answer is a, because the population must be all pizza diliveries because the location is unrestricted and a confidence interval yields a certain level of confidence that the statistic within the population will fall within a certain interval. D almost makes sense, however the "area" is not defined.
---
title: "Lab 1-2: Sampling and  experiments"
output: html_notebook
---

```{r, echo = F, message = F}
# Clear workspace
rm(list = ls()) 

# your code will be included in the html for this assignment
knitr::opts_chunk$set(echo=TRUE) 

# load packages we need for this lab
library(mosaic, warn.conflicts = FALSE) 
library(ggformula, warn.conflicts = FALSE)
library(Lock5Data, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

```

### Objectives:

For this lab you should...

-   Use manipulatives to explore sampling variability and create a sampling distribution
-   Use technology to create a sampling distribution
-   Calculate standard error of a statistic from a sampling distribution
-   Build and interpret confidence intervals using the standard error from a sampling distribution.
-   Use confidence intervals to make conclusions about a population parameter.

# Part 1: Samples from a population

For this part of the lab *you must work in a small group*. There are several bins of beads in the classroom. Right now, join a group around one of the beads (aim for the same amount of students in each group), and introduce yourself. You will be working together for part 1.

### Premise

The bin of beads represents a small town, with each bead representing a person in town with a specific opinion. The green beads support a ballot initiative to increase sales tax to fund bicycle trails, and the white beads oppose the initiative. Your group of students are pollsters - you want to determine how much support the initiative has in your town.

The sampling paddle is your tool to take simple random samples from the town's population.

**TASK 1.1** What is the population parameter you are interested in learning about? Use correct notation.

    **Response** The proportion of citizens in the town that support the bill.

**TASK 1.2** Using the sampling paddle, take a sample of the population and answer the below questions. *Each person in the group must take their own sample*.

    **Response**
    - What is n, the sample size? 40
    - What is your sample statistic, both notation and value? 17/40=.425 (p^).

**TASK 1.3** With your group, discuss the differences between your individual sample estimates. Summarize your conversation in a sentence or two here.

    **Response** Half of our group had a sample statistic that was greater than .5 within .03 (.5-.53), and the other half had a sample statistic that was less than .5 within .125 (.375-.5).

**TASK 1.4** Working together, take at least 30 samples (more if you'd like!) and make a vector of the sample statistics in R. Make a dotplot of your sample statistics and discuss the center and spread of the dotplot.

    **Response** The range of the dotplot goes from 3.25-.575 and the center is around .5.

```{r}
SampleStatistics <- c(.375, .425, .51, .53, .45, .5, .475, .45, .55, .575, .475, .475, .5, .525, 22/40, 19/40, 21/40, 23/40, 20/40, 22/40, 18/40, 21/40, 20/40, 20/40, 13/40, 17/40, 22/40, 20/40, 17/40, 23/40)

gf_dotplot( ~ SampleStatistics)
```

**TASK 1.5** Based on everything you've done so far, what do you think is the best guess for the population parameter in your town? Write a sentence justifying your answer.

    **Response** Based on our sample statistics, our best guess is that the population parameter is .5 because our sample distribution is a normal curve centered about .5.

**TASK 1.6** Calculate the standard deviation of your sample statistics. The specific name for this value is the *standard error*. Whenever we discuss the standard error of a statistic, it describes the how variable statistics are when they are calculated from different samples drawn from the same population.

    **Response** SE = .05902488

```{r}
sd(~SampleStatistics)
```

# Part 2: Baseball Player Salaries

We can also create sampling distributions for the rest of the 'big 5' parameters. In this section, you will create several sampling distributions for means of different populations and of different sizes.

### Sampling Distribution

To create a sampling distribution from simple random samples, we must have access to the entire population. For this section we have access to all opening day salaries for major league baseball players in 2019 (in millions of dollar). We can load the dataset, called BaseballSalaries2019, from our textbook using the code below:

```{r}
data("BaseballSalaries2019")

head(BaseballSalaries2019)
```

**TASK 2.1** Find the mean and standard deviation of salary in the population. Recall that the commands mean(\~Y, data = DataSetName) and sd(\~Y, data = DataSetName) can help you accomplish this task. *include proper notation for each quantity*

    **Response** Mu = 4.509924 and Sigma = 6.334217

```{r}
mean(~Salary, data = BaseballSalaries2019)
sd(~Salary, data = BaseballSalaries2019)

```

**TASK 2.2** Create a histogram of the salaries and describe the shape of the distribution. Hint: remember gf_histogram()

    **Response** The histogram is asymmetric and skewed right.

```{r}
gf_histogram(~Salary, data = BaseballSalaries2019)

```

**TASK 2.3** Use the code below to generate 2000 samples of size 100, saving the sample mean salary for each sample, and creating a histogram. What does an observation plotted in the histogram represent?

    **Response** The mean salary (x bar) of a random sample of 100 players.

```{r}
# save space for the means
SalaryMeans <- rep(NA, 2000)

# generate 200 samples, saving the mean of each one
for(i in 1:2000){
  # take a sample
  TemporarySample <- sample_n(BaseballSalaries2019, 100)
  # save the mean
  SalaryMeans[i] <- mean(~Salary, data = TemporarySample)
}

gf_histogram(~SalaryMeans)
```

**TASK 2.4** Describe the shape of your sampling distribution, and compare it to the shape of the population.

    **Response** The sampling distribution is symmetric and centered around 4.5, and the sampling distribution is a normal curve while the population was completely skewed right.

**TASK 2.5** Calculate the center of your sampling distribution, as measured by the mean of the vector of sample means. How does this value compare to the population mean?

    **Response** x bar = 4.503148 is within less than 7 one thousandths of mu.

```{r}
mean(SalaryMeans)

```

**TASK 2.6** Calculate the standard error of the sample mean using your vector of sample means. Recall that the standard error of a statistic is the standard deviation of the sampling distribution.

    **Response** s = 0.6054506

```{r}
sd(SalaryMeans)

```

**TASK 2.7** Hopefully the standard error you calculated in 2.6 roughly matches the standard error you could estimate from the histogram in 2.3. Explain how you could estimate the SE from the histogram, and show how it is roughly the same.

    **Response** You can estimate the SE from the histogram by eyeballing the range in which 95% of the samples fall and dividing that range by 4. For instance, if I take 5.75(approximate upper bound for the 95th percentile) and subtract 3.25 (approximate 5th percentile) then divide the difference by 4, I get .625. This value is quite close to the true standard deviation of the population which is 6.334217.

### Confidence intervals

**TASK 2.8 Directions** For each of the sample means below (assumed to be means for samples of baseball player salaries) calculate the corresponding 95% confidence interval. You will need to use the standard error you calculated from 2.6. Indicate whether the confidence interval successfully captures the true population mean salary.

**TASK 2.8.1** $\bar{X}$= 4

-   **Confidence Interval** [2.7891, 5.2109]

-   **Captured mu?** Yes

**TASK 2.8.2** $\bar{X}$= 3.1

-   **Confidence Interval** [1.8891, 4.3109]

-   **Captured mu?** No

**TASK 2.8.3** $\bar{X}$= 5.2

-   **Confidence Interval** [3.9891, 6.4109]

-   **Captured mu?** Yes

# Part 3: Sample size and confidence intervals

For this activity we will use the dataset 'AllCountries' from the Lock5Data package. These data consist of measurements from all countries. We will study the variable 'FemaleLabor', which provides the percentage of females aged 15-64 that participate in the countries workforce. Our goal will be to build sampling distributions from samples of various sizes for the mean of this variable.

**TASK 3.0** Modify the code below to generate a sampling distribution of the mean with 2000 samples, using a sample size of n=10.

```{r}
# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 10)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}
```

**TASK 3.1** Use your vector of sample means to create a histogram of your sampling distribution, and to calculate the standard error of the sample mean. Answer the questions below.

```{r}
sd(SampleMeans)
mean(SampleMeans)
```

A. Where is the center of the distribution?

    **Response** 58.12275

B. What is the standard error?

    **Response** 5.329335

C.  If we were to build a 95% confidence interval using one of the sample means, how wide would it be?

```{=html}
<!-- -->
```
    **Response** [47.4641, 68.7841]

**TASK 3.2** Now generate a sampling distribution for the mean using samples of size n=50. Again, you'll need to calculate the means for 2000 samples and save them as a vector. Then produce the histogram, calculate the standard error, and answer the questions below. You should copy and paste the code from 3-1, changing the relative numbers.

```{r}
# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 50)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}

gf_histogram(~SampleMeans)

mean(SampleMeans)
sd(SampleMeans)
```

A. Where is the center of the distribution?

    **Response** 57.98732

B. What is the standard error?

    **Response** 2.03178

C.  If we were to build a 95% confidence interval using one of the sample means, how wide would it be?

```{=html}
<!-- -->
```
    **Response** [53.9238, 62.0509]

**TASK 3.3** Finally, generate a sampling distribution for the mean using samples of size n=100. Again, you'll need to calculate the means for 2000 samples and save them as a vector. Then produce the histogram, calculate the standard error, and answer the questions below.

```{r}
# load the dataset
data("AllCountries")

# wrangle the data a little bit to select only a couple variables and to remove missing (NA) values
AllCountries <- AllCountries %>% 
  dplyr::select(Country,FemaleLabor,LifeExpectancy) %>%
  na.omit()

# allocate space to store your sample means
SampleMeans <- rep(NA, 2000)

# draw the correct number of samples, and for each of them save the sample mean
for(i in 1:2000){
  TemporarySample <- sample_n(AllCountries, size = 100)
  SampleMeans[i] <- mean(~FemaleLabor, data = TemporarySample)
}

gf_histogram(~SampleMeans)

mean(SampleMeans)
sd(SampleMeans)

```

A. Where is the center of the distribution?

    **Response** 57.9323

B. What is the standard error?

    **Response** 1.169775

C.  If we were to build a 95% confidence interval using one of the sample means, how wide would it be?

```{=html}
<!-- -->
```
    **Response** [55.5928,60.2719]

**TASK 3.4** What happens to the center of the distribution as the sample size increases?

    **Response** It remains nearly unchanged.

**TASK 3.5** What happens to the standard error, and the width of confidence intervals as the sample size increases?

    **Response** The standard error decreases significantly and the range of the confidence interval narrows as the size increases.

# Part 4: Interpreting a confidence interval.

Using a sample of 24 deliveries described in "Diary of a Pizza Girl" on the Slice website, we find a 95% confidence interval for the mean tip given for a pizza delivery to be \$2.18 to \$3.90. Which of the following is a correct interpretation of this interval?\
5t23 a. I am 95% sure that all pizza delivery tips will be between \$2.18 and \$3.90. b. 95% of all pizza delivery tips will be between \$2.18 and \$3.90. c. I am 95% sure that the mean pizza delivery tip for this sample will be between \$2.18 and \$3.90. d. I am 95% sure that the mean tip for all pizza deliveries in this area will be between \$2.18 and \$3.90. e. I am 95% sure that the confidence interval for the mean pizza delivery tip will be between \$2.18 and \$3.90.

    **Response** The answer is a, because the population must be all pizza diliveries because the location is unrestricted and a confidence interval yields a certain level of confidence that the statistic within the population will fall within a certain interval. D almost makes sense, however the "area" is not defined.
