These exercises are taken from the OpenIntro Statistics textbook (https://www.openintro.org/book/os/). Refer to the page numbers listed by each problem to see the whole question and any additional context.

Example

Now you will start showing the output of your calculations in your problem sets. To do this, you include code chunks like you see below. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

mean(c(2,4,6))
## [1] 4
78+52
## [1] 130

Problem 2.2 Associations. (p. 56)

Indicate which of the plots show (a) a positive association, (b) a negative association, or (c) no association. Also determine if the positive and negative associations are linear or nonlinear. Each part may refer to more than one plot.

  1. Positive Association; Linear
  2. No Association
  3. Positive Association; Non Linear
  4. Negative Association; Linear

Problem 2.5 Parameters and statistics. (p. 56)

Identify which value represents the sample mean and which value represents the claimed population mean.

  1. American households spent an average of about $52 in 2007 on Halloween merchandise such as costumes, decorations and candy. To see if this number had changed, researchers conducted a new survey in 2008 before industry numbers were reported. The survey included 1,500 households and found that average Halloween spending was $58 per household.

Sample Mean: $58 per Household Population Mean: $52 per Household

  1. The average GPA of students in 2001 at a private university was 3.37. A survey on a sample of 203 students from this university yielded an average GPA of 3.59 a decade later.

Sample Mean: 3.59 GPA Population Mean: 3.37 GPA

Problem 2.8 Medians and IQRs. (p. 57)

For each part, compare distributions 1. and 2. based on their medians and IQRs. You DO need to calculate these statistics. Make sure you explain your reasoning.

    1. 3, 5, 6, 7, 9
    2. 3, 5, 6, 7, 20

Both set 1 and 2 share identical medians and IQRs.

fivenum(c(3,5,6,7,9))
## [1] 3 5 6 7 9
fivenum(c(3,5,6,7,20))
## [1]  3  5  6  7 20
    1. 3, 5, 6, 7, 9
    2. 3, 5, 7, 8, 9

The second set distribution has both a higher IQR and median value.

fivenum(c(3,5,6,7,9))
## [1] 3 5 6 7 9
fivenum(c(3,5,7,8,9))
## [1] 3 5 7 8 9

Problem 2.9 Means and SDs. (p. 57)

For each part, compare distributions 1. and 2. based on their means and standard deviants. You DO need to calculate these statistics. Make sure you explain your reasoning.

    1. 3, 5, 5, 5, 8, 11, 11, 11, 13
    2. 3, 5, 5, 5, 8, 11, 11, 11, 20

Both the mean and standard deviation are higher in set #2.

mean(c(3, 5, 5, 5, 8, 11, 11, 11, 13))
## [1] 8
sd(c(3, 5, 5, 5, 8, 11, 11, 11, 13))
## [1] 3.605551
mean(c(3, 5, 5, 5, 8, 11, 11, 11, 20))
## [1] 8.777778
sd(c(3, 5, 5, 5, 8, 11, 11, 11, 20))
## [1] 5.214829
    1. -20, 0, 0, 0, 15, 25, 30, 30
    2. -40, 0, 0, 0, 15, 25, 30, 30

While set #1 has a higher mean, set #2 has a higher standard deviation value

mean(c(-20, 0, 0, 0, 15, 25, 30, 30))
## [1] 10
sd(c(-20, 0, 0, 0, 15, 25, 30, 30))
## [1] 17.92843
mean(c(-40, 0, 0, 0, 15, 25, 30, 30))
## [1] 7.5
sd(c(-40, 0, 0, 0, 15, 25, 30, 30))
## [1] 23.29929

Problem 2.12 Median vs. mean. (p. 58)

Estimate the median for the 400 observations shown in the histogram, and note whether you expect the mean to be higher or lower than the median.

Between 80-85 for the mediqn value; I would expect the mean to be lower seeing as the data is left skewed

Problem 2.13 Histograms vs. box plots. (p. 58)

Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?

The histogram identifies the bimodal (two peaked) nature of the data. The box plot, on the other hand, displays the outliers with glaring clarity.

Problem 2.17 Income at the coffee shop. (p. 59)

The first histogram below shows the distribution of the yearly incomes of 40 patrons at a college coffee shop. Suppose two new people walk into the coffee shop: one making $225,000 and the other $250,000. The second histogram shows the new income distribution. Summary statistics are also provided.

  1. Would the mean or the median best represent what we might think of as a typical income for the 42 patrons at this coffee shop? What does this say about the robustness of the two measures?

The median, becuase of its robust quality, is significantly less impacted by extreme values such as outliers.

  1. Would the standard deviation or the IQR best represent the amount of variability in the incomes of the 42 patrons at this coffee shop? What does this say about the robustness of the two measures?

The IQR, because the calculation is reliant on Q1 and Q3 and it is what is considered a robust statistic, which makes the IQR more focused around the center of the data distribution, not the extremes.

Problem I Made Up

Use the information, histograms, and summary statistics provided in Problem 2.17 to answer the following questions:

  1. The middle 50% of the 40 patrons at a college coffee shop earn what yearly incomes? $65,240 per year
  2. The 25% highest incomes of the 42 patrons at a college coffee shop earn at least what yearly income? $66,540 per year
  3. Determine if the 41st and 42nd customers at a college coffee shop were outliers. Show your work. Yes, both of the addtional customers as outliers. To identify outliers, I first had to calculate IQR by subtracting Q1 from Q3. The IQR of the data set is 2,830. To calculate upper outliers beyond Q3, I employed the formula, Upper Fence = Q3 + (1.5 * IQR). In this case, the calucation was (1.5 * 2,830) + 66,540 = 70,785. This means that any value that lies above this threshold is an outlier, which confirms the 41st and 42nd person as outlying values.