Sampling Distributions and CLT

Exercise 1. Retail stores experience their heaviest volume of transactions that include returns on December 26th and December 27th each year. The distribution for the Number of Items Returned (X) by Macy’s customers who do a return transaction on those days last year is given in the table below. It has mean: \(\mu=2.61\) and variance \(\sigma^2 \approx 1.80\).

Number of Items Returned in Transaction (x) Probability
1 0.25
2 0.28
3 0.20
4 0.17
5 0.08
6 0.02
  1. Is this population distribution left skewed, symmetric, or right skewed? How do you know?
  1. What proportion of returns had three or more items?
  1. Identify which histogram below diplays (1) the population \(X\) values, (2) the simulated sampling distribution of the sample mean \(\bar{X}\), (3) the simulated sampling distribution of the sample total \(T\). Briefly explain how you know.

  1. Describe the sampling distribution (shape, mean, and standard deviation) of the sample mean number of items returned in 45 return transactions \(\bar{X}=\frac{X_1+X_2+...+X_{45}}{45}\) according to theory. Make sure to name any theorems you are using.
  1. What is the probability that the mean number of items returned in the 45 return transactions reviewed will be 3 or more items?
1-pnorm(3, 2.61, 0.2)
## [1] 0.02558806
  1. Explain why the value you found in (e) was so much smaller than the value found in (b).
  1. Consider the total number of items returned in 45 customer return transactions. Describe the sampling distribution (shape, center, and spread) of the total number of items returned \(T=X_1+X_2+\cdots +X_{45}\). Make sure to name any theorems you are using.
  1. Find an upper bound b such that the total number of items returned in 45 customers’ return transactions will be less than b with probability 0.95.
qnorm(0.95, 117.5, 9.0)
## [1] 132.3037

Interval estimation for a population mean

Exercise 2. Consider the tree data set in R, trees. trees is a data frame object, which contains multiple variables. We can access a specific variable’s data by using the $ symbol. For example:

# The data frame contains 3 columns (vectors)
trees

# This is how to access the "Girth" data specifically
trees$Girth

# We can use this vector in our usual R functions
mean(trees$Girth)
  1. Construct histograms and qqnorm plots for all three of the quantitative variables recorded on the 31 trees. For which of the three variables do we have the strongest evidence that the population of values may not be well approximated by a normal random variable?
hist(trees$Girth)

qqnorm(trees$Girth)

hist(trees$Height)

qqnorm(trees$Height)

hist(trees$Volume)

qqnorm(trees$Volume)

  1. Since \(n=31\) for each of these variables, we believe the CLT will make \(\bar{X} \approx N\) even for the possibly non normal populations referenced above. Construct \(90\%\) t confidence intervals “by hand” for all three variables in the trees data set. Summaries of the variables are given below and you should use an R function to find the relevant t critical value.
mean(trees$Girth); sd(trees$Girth); length(trees$Girth)
## [1] 13.24839
## [1] 3.138139
## [1] 31
mean(trees$Height); sd(trees$Height); length(trees$Height)
## [1] 76
## [1] 6.371813
## [1] 31
mean(trees$Volume); sd(trees$Volume); length(trees$Volume)
## [1] 30.17097
## [1] 16.43785
## [1] 31
qt(0.950, 30)
## [1] 1.697261
  1. Construct the same confidence intervals that you constructed in (b) above using the t.test() command in R. Confirm that you get very similar endpoints.
t.test(trees$Girth, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  trees$Girth
## t = 23.506, df = 30, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  12.29177 14.20501
## sample estimates:
## mean of x 
##  13.24839
t.test(trees$Height, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  trees$Height
## t = 66.41, df = 30, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  74.05764 77.94236
## sample estimates:
## mean of x 
##        76
t.test(trees$Volume, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  trees$Volume
## t = 10.219, df = 30, p-value = 2.753e-11
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  25.16010 35.18183
## sample estimates:
## mean of x 
##  30.17097
  1. Suppose this data came from 31 trees cut down by a single logger. How does that affect the conclusions we can draw? Suppose this data came from 31 trees selected at the saw mill from a variety of logging companies, how does that affect the conclusions we can draw?
  1. Suppose the 31 trees in the trees data set is a random sample from those at a saw mill. The mill would like to use this sample to estimate the proportion of trees that they have at their mill with Volume over 65 cubic ft. Use the code below to determine what count of trees in this sample have Volume over 65 cubic ft. Then, explain why they should not do a large-sample z confidence interval for the proportion of trees at their mill with Volume over 65 cubic feet with this sample of 31 trees.
sum(trees$Volume > 65)
## [1] 1