Part I

Please put the answers for Part I next to the question number (2pts each):

  1. d - daysDrive, gasMonth
  2. a - mean = 3.3, median = 3.5
  3. a - Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
  4. c - there is an association between natural hair color and eye color
  5. a - 37.0 and 49.8
  6. d - median and interquartile range; mean and standard deviation

7a. Describe the two distributions (2pts).

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The Central Limit Theorem specifies that the distribution of the sample means is approximately normal, assuming individual selections are independent and there are sufficient selections aggregated in each sample (typically 30+).

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

library(purrr)

map_dbl(c(data1, data2, data3, data4), mean)
##   x   y   x   y   x   y   x   y 
## 9.0 7.5 9.0 7.5 9.0 7.5 9.0 7.5

Each has a mean of 9.0 for x and a mean of 7.5 for y.

b. The median (for x and y separately; 1 pt).

map_dbl(c(data1, data2, data3, data4), median)
##   x   y   x   y   x   y   x   y 
## 9.0 7.6 9.0 8.1 9.0 7.1 8.0 7.0

Medians of data1 - x: 9.0, y: 7.6
Medians of data2 - x: 9.0, y: 8.1
Medians of data3 - x: 9.0, y: 7.1
Medians of data4 - x: 8.0, y: 7.0

c. The standard deviation (for x and y separately; 1 pt).

map_dbl(c(data1, data2, data3, data4), sd)
##   x   y   x   y   x   y   x   y 
## 3.3 2.0 3.3 2.0 3.3 2.0 3.3 2.0

Std of data1 - x: 3.3, y: 2.0
Std of data2 - x: 3.3, y: 2.0
Std of data3 - x: 3.3, y: 2.0
Std of data4 - x: 3.3, y: 2.0

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

map_dbl(list(data1, data2, data3, data4), 
        function(df) cor(df$x, df$y)
        )
## [1] 0.82 0.82 0.82 0.82

For each x, y pair, correlation is 0.82.

e. Linear regression equation (2 pts).

map(list(data1, data2, data3, data4), 
        function(df) lm(df$y ~ df$x)
        )
## [[1]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Coefficients:
## (Intercept)         df$x  
##         3.0          0.5  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Coefficients:
## (Intercept)         df$x  
##         3.0          0.5  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Coefficients:
## (Intercept)         df$x  
##         3.0          0.5  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Coefficients:
## (Intercept)         df$x  
##         3.0          0.5
  1. y = 0.5x + 3
  2. y = 0.5x + 3
  3. y = 0.5x + 3
  4. y = 0.5x + 3

f. R-Squared (2 pts).

map(list(data1, data2, data3, data4), 
        function(df) summary(lm(df$y ~ df$x))
        )
## [[1]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## df$x           0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
## 
## 
## [[2]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## df$x           0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
## 
## 
## [[3]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## df$x           0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
## 
## 
## [[4]]
## 
## Call:
## lm(formula = df$y ~ df$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## df$x           0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216
  1. 0.629
  2. 0.629
  3. 0.629
  4. 0.63

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

First, just look at x vs. y:

plot(data1$y ~ data1$x)

plot(data2$y ~ data2$x)

plot(data3$y ~ data3$x)

plot(data4$y ~ data4$x)

We need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

  1. Linearity
reg1 <- lm(data1$y ~ data1$x)
plot(reg1$residuals ~ data1$x)
abline(h = 0, lty = 3)

reg2 <- lm(data2$y ~ data2$x)
plot(reg2$residuals ~ data2$x)
abline(h = 0, lty = 3)

reg3 <- lm(data3$y ~ data3$x)
plot(reg3$residuals ~ data3$x)
abline(h = 0, lty = 3)

reg4 <- lm(data4$y ~ data4$x)
plot(reg4$residuals ~ data4$x)
abline(h = 0, lty = 3)

data1: residuals look randomly distributed above and below zero for all values of x - linearity condition met

data2: residuals do not look randomly distributed for all values of x - linearity condition not met (outlier value at 13 having outsized affect on otherwise linearly distributed values)

data3: residuals do not look randomly distributed for all values of x - linearity condition not met (clearly non-linear distribution)

data4: residuals do not look randomly distributed for all values of x - linearity condition not met (not enough x values to determine)… much more heterogenously distributed values at x = 8

  1. nearly normal residuals
hist(reg1$residuals, breaks = 8)

hist(reg2$residuals, breaks = 8)

hist(reg3$residuals, breaks = 8)

hist(reg4$residuals, breaks = 8)

qqnorm(reg1$residuals)
qqline(reg1$residuals)  # adds diagonal line to the normal prob plot

qqnorm(reg2$residuals)
qqline(reg2$residuals)

qqnorm(reg3$residuals)
qqline(reg3$residuals)

qqnorm(reg4$residuals)
qqline(reg4$residuals)

Only data4 look to have x and y whose residuals are normally distributed from these diagnosistics. Surprising that data1’s results not more normally distributed (though they look second closest)… I think this is due to the data being somewhat curvilinear in the distribution.

  1. constant variability

Only data1 looks to have points that have constant variability for all values of x. data4 clearly violates this condition. For data2, it appears variance might be less for the center x values. For data3, outlier value confounding diagnosis of this metric.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Visualizations are important for diagnostic, exploratory and persuasive purposes.

As seen above, routine tests might not fully account for idiosyncratic data distributions, but these idiosyncracies can be immediately apparent upon visual inspection:

For instance, it’s immediately clear looking at this graph that there is an outlier value that should be considered:

plot(data3$y ~ data3$x)

For persuasive purposes, though in-depth, or even superficial analysis can help uncover underlying trends, graphs of data can act as a “universal language” where trends that are discussed can be readily engaged with without necessarily understanding the underlying data. For instance, the underlying formulas may not be known here in the diagnostic tests, but looking at this graph, it may be easier to someone to understand that something concerning is going on in the distribution of the data:

plot(data4$y ~ data4$x)