*Submit your homework to Canvas by the due date and time. Email Miranda (mrintoul@wisc.edu) if you have extenuating circumstances and need to request an extension.

*If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.

*If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.

*You must include an explanation and/or intermediate calculations for an exercise to be complete.

*Be sure to submit the HWK5 Autograde Quiz which will give you ~20 of your 40 accuracy points.

Sampling Distributions and CLT

Exercise 1. Retail stores experience their heaviest volume of transactions that include returns on December 26th and December 27th each year. The distribution for the Number of Items Returned (X) by Macy’s customers who do a return transaction on those days last year is given in the table below. It has mean: \(\mu=2.61\) and variance \(\sigma^2 \approx 1.80\).

Number of Items Returned in Transaction (X)	Probability
x=1	0.25
x=2	0.28
x=3	0.20
x=4	0.17
x=5	0.08
x=6	0.02

vals <- c(1,2,3,4,5,6)
probs <- c(0.25, 0.28, 0.20, 0.17, 0.08, 0.02)
EV_pop=sum(vals*probs)
Var_pop <- sum(probs*(vals-EV_pop)^2) #exact var: 1.7979 rounded to 1.80 for computational ease

Is this population distribution left skewed, symmetric, or right skewed? How do you know?

The population distribution is right skewed. More data is closer to 1,2,3, and 4, which is putting it heavy on the left side or front of the graph and leaving a tail on the back or right part of a graph. The mean being 2.61 is on the lower side of the data and the standard deviation also demonstrates a right skewed distribution.

In what percent of returns did retail workers see three or more items in the return?

1-pnorm(3,2.61,sqrt(81))

## [1] 0.4827179

0.20+0.17+0.08+0.02=0.47

In this year, a random sample of size 45 customer return transactions will be selected for review. Since this is a SRS from a very large population, you can consider each draw \(X_i\) iid to the population X.

The simulation below selects 45 customer return transactions from the population with replacement and computes the sample mean and sample sum. It then repeats this process manytimes, stores the sample mean and sample sum values in vectors and then creates histograms of those vectors of values. Identify which histogram diplays (1) the population \(X\) values, (2) the simulated sampling distribution of the sample mean \(\bar{X}\), (3) the simulated sampling distribution of the sample sum \(S\). Briefly explain how you know.

items <- c(rep(1, 25), rep(2, 28), rep(3, 20),
           rep(4, 17), rep(5, 8), rep(6, 2))
iterations <- 500000
samp_mean <- rep(0, iterations)
samp_sum <- rep(0, iterations)
for (i in 1:iterations){
  samp=sample(items, size = 45, replace = TRUE)
  samp_mean[i] <- mean(samp)
  samp_sum[i] <- sum(samp)
}

par(mfrow=c(3,1))
hist(samp_sum, breaks=seq(60, 170, 1), main="Histogram A", xlab="")
hist(items, breaks=seq(0.5, 6.5, 1), freq=FALSE,
     main="Histogram B", xlab="")
hist(samp_mean, breaks=seq(1.5,3.75, 0.01), main="Histogram C", xlab="")

par(mfrow=c(1,1))

Histogram A goes along with the stimulated distribution of the sample sum. The sample mean is 117.45 and this histogram shows the mean, center of the distribution, to be around 117.45. Histogram B goes along with the population x values.This has all the numbers of items returned listed out with the approximations of each. Histogram C goes along with the stimulated sampling distribution of the sample mean. The mean is around 2.61, which is correct from the sample mean data.

Describe the sampling distribution (shape, mean, and standard deviation) of the sample mean number of items returned in 45 return transactions \(\bar{X}=\frac{X_1+X_2+...+X_{45}}{45}\) according to theory. Make sure to name any theorems you are using. (You can compute the mean and sd of one of the vectors constructed above to make sure your theoretical values are close to what you get in the simulation.)

The sampling distribution has a normal looking distribution due to the Central Limit Theorm. We are looking at the distribution of a sample mean and a sample sum. Each item is independent to each other. The sampling distribution of 45 has a mean of 2.61 and standard deviation of 0.2, which is found by sqrt(1.8/45).

What is the probability that the mean number of items returned in the 45 return transactions reviewed will be 3 or more items?

1-pnorm(3,2.61,sqrt(1.8/45))

## [1] 0.02558806

Explain why the value you found in e. was so much smaller than the value found in b.

The value that we got for part e was more accurate than the value found in part b. The value found in e includes the sample size of 45 while b used an overall probability and transactions over two days. The value in e is a more accurate representation of the transactions.

Consider the total number of items returned in 45 customer return transactions. Describe the sampling distribution (shape, center, and spread) of the total number of items returned \(Sum=X_1+X_2+...+X_{45}\). Make sure to name any theorems you are using.

The sampling distribution shape will look normal due to the Central Limit Theorem. There is a bigger sampling population, which allows for a more normal looking curve. The sampling distribution of 45 has a mean of 117.45, which is found by 2.6145 and standard deviation of 9, which is found by sqrt(1.845).

Find an upper bound b such that the total number of items returned in 45 customers’ return transactions will be less than b with probability 0.95.

qnorm(.95,117.45,9)

## [1] 132.2537

The upper bound is equal to 132.35.

Estimating unknown population mean and proportion with point and interval estimators

Exercise 2. Consider the tree data set in R, trees. (You can access the data by just typing trees as you would any other variable you’ve set)

Construct histograms and qqnorm plots for all three of the quantitative variables recorded on the 31 trees. For which of the three variables do we have the strongest evidence that the population of values may not be well approximated by a normal random variable?

Volume:

volumes <- trees$Volume
(x_bar <- mean(volumes))

## [1] 30.17097

(s <- sd(volumes))

## [1] 16.43785

(n <- length(volumes))

## [1] 31

par(mfrow = c(1,2))
hist(volumes)
qqnorm(volumes); qqline(volumes)

par(mfrow = c(1,1))

Girth:

Girth <- trees$Girth
(x_bar <- mean(Girth))

## [1] 13.24839

(s <- sd(Girth))

## [1] 3.138139

(n <- length(Girth))

## [1] 31

par(mfrow = c(1,2))
hist(Girth)
qqnorm(Girth); qqline(Girth)

par(mfrow = c(1,1))

Height:

Height <- trees$Height
(x_bar <- mean(Height))

## [1] 76

(s <- sd(Height))

## [1] 6.371813

(n <- length(Height))

## [1] 31

par(mfrow = c(1,2))
hist(Height)
qqnorm(Height); qqline(Height)

par(mfrow = c(1,1))

The volume gives us the strongest evidence that the population of values may not be well approximated by a normal random variable.

Since \(n=31\) for each of these variables, we believe the CLT will make \(\bar{X} \approx N\) even for the possibly non normal populations referenced above. Construct \(90\%\) t confidence intervals “by hand”” for all three variables using the sample data found in the trees data set. Summaries of the variables are given below and you should use an r function to find the relevant multiplier for your margin of error.

summary(trees$Girth); sd(trees$Girth); length(trees$Girth)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.30   11.05   12.90   13.25   15.25   20.60

## [1] 3.138139

## [1] 31

summary(trees$Height); sd(trees$Height); length(trees$Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      63      72      76      76      80      87

## [1] 6.371813

## [1] 31

summary(trees$Volume); sd(trees$Volume); length(trees$Volume)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   19.40   24.20   30.17   37.30   77.00

## [1] 16.43785

## [1] 31

qt(.9,30)

## [1] 1.310415

# Calculate standard error
se <- 3.14/sqrt(31)

# Find critical value
t <- qt(0.05, df = 30)

# Find margin of error
moe <- 1.697261*0.5636

 1.697261*0.5636

## [1] 0.9565763

13.25+0.9565763#upper bound

## [1] 14.20658

13.25-0.9565763#lower bound

## [1] 12.29342

Height:

6.371813/sqrt(31) #se

## [1] 1.144411

qt(0.95,30) #criticalvalue

## [1] 1.697261

1.697261*1.144411 #marginoferror

## [1] 1.942364

76+1.942364 #upper bound

## [1] 77.94236

76-1.942364 #lower bound

## [1] 74.05764

Girth:

3.138139/sqrt(31) #se

## [1] 0.5636264

qt(0.95,30) #critical value

## [1] 1.697261

1.697261*0.5636 #margin of error

## [1] 0.9565763

13.25+0.9565763 #upper bound

## [1] 14.20658

13.25-0.956763 #lower bound

## [1] 12.29324

Volume:

16.43785/sqrt(31) #se

## [1] 2.952325

qt(0.95,30) #criticalvalue

## [1] 1.697261

1.697261*2.95 #marginoferror

## [1] 5.00692

30.17+5.00692 #upper bound

## [1] 35.17692

30.17-5.00692 #lower bound

## [1] 25.16308

Construct the same confidence intervals that you constructed in (b) above using the t.test() command in R. Confirm that you get very similar endpoints.

se <- 3.14
# R shortcut with "t.test"
t.test(trees, conf.level = 0.9)

## 
##  One Sample t-test
## 
## data:  trees
## t = 13.447, df = 92, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  34.8879 44.7250
## sample estimates:
## mean of x 
##  39.80645

Suppose this data came from 31 trees cut down by a single logger. How does that affect the conclusions we can draw? Suppose this data came from 31 trees selected at the saw mill from a variety of logging companies, how does that affect the conclusions we can draw?

The conclusions we are drawing from a single logger, would not be a representation of a population or an average. This would be a biased sample. The logger most likely cut down 31 trees within the same area and the data of the trees would be affected by the climate and environment. With the multiple logging companies, there would be less bias.There would be a more consistent representation over a population of multiple logging companies as opposed to a single logger. You could not apply the findings from a single logger to an overall idea or population, because the population size is not larger enough to make an accurate conclusion.Additionally, the multiple logging companies would should an iid, which is that each tree is grown independently of each other and there would be a better representation of the population.

Suppose the 31 trees in the trees data set is a random sample from those at a saw mill. The mill would like to use this sample to estimate the proportion of trees that they have at their mill with Volume over 65 cubic ft. Use the following code to determine what count of trees in this sample have Volume over 65 cubic ft. Then, explain why they should not do a large-sample z confidence interval for the proportion of trees at their mill with Volume over 65 cubic feet with this sample of 31 trees. (Hint: Consider what assumption for our large sample Z confidence interval for \(\pi\) is not well met.)

sum(trees$Volume>65)

## [1] 1

There is not a large amount of trees with a volume over 65 cubic feet, so they should not do a large-sample z confidence interval. There is only one tree that is over 65 and that is too far away from the mean, which means they should be looking at the trees under 65.

Stat 371 Homework #5

Alexa Schram

Sampling Distributions and CLT

Estimating unknown population mean and proportion with point and interval estimators