Part I

Please put the answers for Part I next to the question number (2pts each):

  1. b - daysDrive
  2. a - mean = 3.3, median = 3.5
  3. d - Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
    1. there is an association between natural hair color and eye color
  4. b - 17.8 and 69.0 \(Q1-IQR \times 1.5\) and \(Q3+IQR\times1.5\)
  5. d - median and interquartile range; mean and standard deviation

7a. Describe the two distributions (2pts).
The distribution of A is unimodal and slightly skewed to the right by the potential outliers. Provided the size of the sample is large enough, we can assume distribution is nearly normal.
The distribution of B appears to be normal and with much smaller range.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
According to the Central Limit Theorem, for any population distribution with mean \(\mu\) and standard deviation \(\sigma\), the sampling distribution of the sample mean \(\bar { X }\) is approximately normal with mean \(\mu\) and standard deviation \({ \sigma }/{ \sqrt { n } }\), and the approximation improves as n increases. This explains the difference in standard deviation for our population and sample, where \({ 3.22 }/{ \sqrt{ 30 }} \approx 0.59\).

7c. What is the statistical principal that describes this phenomenon (2 pts)?
Central limit theorem.

Part II

Consider the four datasets, each with two columns (x and y), provided below.

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

##   x   y
## 1 9 7.5
## 2 9 7.5
## 3 9 7.5
## 4 9 7.5

b. The median (for x and y separately; 1 pt).

OR

##   x   y
## 1 9 7.6
## 2 9 8.1
## 3 9 7.1
## 4 8 7.0

c. The standard deviation (for x and y separately; 1 pt).

##     x y
## 1 3.3 2
## 2 3.3 2
## 3 3.3 2
## 4 3.3 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

##      x    y
## x 1.00 0.82
## y 0.82 1.00
##      x    y
## x 1.00 0.82
## y 0.82 1.00
##      x    y
## x 1.00 0.82
## y 0.82 1.00
##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

  1. \(y_{ 1 } = 3 + 0.5 \times x_{ 1 }\)
  2. \(y_{ 2 } = 3 + 0.5 \times x_{ 2 }\)
  3. \(y_{ 3 } = 3 + 0.5 \times x_{ 3 }\)
  4. \(y_{ 4 } = 3 + 0.5 \times x_{ 4 }\)

f. R-Squared (2 pts).

  1. 0.67
  2. 0.67
  3. 0.67
  4. 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Model 1 meets linearity and constant variability conditions, however, normality is not met. It is not appropriate for a linear regression model.
Model 2 doesn’t meet any of the conditions required for a linear regression.
Model 3 can be used tentatively as it meets necessary conditions but has outliers.
Model 4 doens’t meet linearity, constant variability or normality conditions. Not appropriate to use.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

All of the datasets above have the same mean, standard deviation, correlation, and relatively the same median. Plotting, however, reveals that not all satisfy conditions of normality, linearity and constant variability. Hence, visualizations are very important to see the patterns not visible when only looking at the basic summary statistics.