These exercises are taken from the OpenIntro Statistics textbook (https://www.openintro.org/book/os/). Refer to the page numbers listed by each problem to see the whole question and any additional context.
Now you will start showing the output of your calculations in your problem sets. To do this, you include code chunks like you see below. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
mean(c(2,4,6))
## [1] 4
78+52
## [1] 130
Indicate which of the plots show (a) a positive association, (b) a negative association, or (c) no association. Also determine if the positive and negative associations are linear or nonlinear. Each part may refer to more than one plot.
Identify which value represents the sample mean and which value represents the claimed population mean.
Sample Mean: $58 per Household Population Mean: $52 per Household
Sample Mean: 3.59 GPA Population Mean: 3.37 GPA
For each part, compare distributions 1. and 2. based on their medians and IQRs. You DO need to calculate these statistics. Make sure you explain your reasoning.
Both set 1 and 2 share identical medians and IQRs.
fivenum(c(3,5,6,7,9))
## [1] 3 5 6 7 9
fivenum(c(3,5,6,7,20))
## [1] 3 5 6 7 20
The second set distribution has both a higher IQR and median value.
fivenum(c(3,5,6,7,9))
## [1] 3 5 6 7 9
fivenum(c(3,5,7,8,9))
## [1] 3 5 7 8 9
For each part, compare distributions 1. and 2. based on their means and standard deviants. You DO need to calculate these statistics. Make sure you explain your reasoning.
Both the mean and standard deviation are higher in set #2.
mean(c(3, 5, 5, 5, 8, 11, 11, 11, 13))
## [1] 8
sd(c(3, 5, 5, 5, 8, 11, 11, 11, 13))
## [1] 3.605551
mean(c(3, 5, 5, 5, 8, 11, 11, 11, 20))
## [1] 8.777778
sd(c(3, 5, 5, 5, 8, 11, 11, 11, 20))
## [1] 5.214829
While set #1 has a higher mean, set #2 has a higher standard deviation value
mean(c(-20, 0, 0, 0, 15, 25, 30, 30))
## [1] 10
sd(c(-20, 0, 0, 0, 15, 25, 30, 30))
## [1] 17.92843
mean(c(-40, 0, 0, 0, 15, 25, 30, 30))
## [1] 7.5
sd(c(-40, 0, 0, 0, 15, 25, 30, 30))
## [1] 23.29929
Estimate the median for the 400 observations shown in the histogram, and note whether you expect the mean to be higher or lower than the median.
Between 80-85 for the mediqn value; I would expect the mean to be lower seeing as the data is left skewed
Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?
The histogram identifies the bimodal (two peaked) nature of the data. The box plot, on the other hand, displays the outliers with glaring clarity.
The first histogram below shows the distribution of the yearly incomes of 40 patrons at a college coffee shop. Suppose two new people walk into the coffee shop: one making $225,000 and the other $250,000. The second histogram shows the new income distribution. Summary statistics are also provided.
The median, becuase of its robust quality, is significantly less impacted by extreme values such as outliers.
The IQR, because the calculation is reliant on Q1 and Q3 and it is what is considered a robust statistic, which makes the IQR more focused around the center of the data distribution, not the extremes.
Use the information, histograms, and summary statistics provided in Problem 2.17 to answer the following questions: