Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts).
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
7c. What is the statistical principal that describes this phenomenon (2 pts)?
The Central Limit Theorem specifies that the distribution of the sample means is approximately normal, assuming individual selections are independent and there are sufficient selections aggregated in each sample (typically 30+).
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
library(purrr)
map_dbl(c(data1, data2, data3, data4), mean)
## x y x y x y x y
## 9.0 7.5 9.0 7.5 9.0 7.5 9.0 7.5
Each has a mean of 9.0 for x and a mean of 7.5 for y.
map_dbl(c(data1, data2, data3, data4), median)
## x y x y x y x y
## 9.0 7.6 9.0 8.1 9.0 7.1 8.0 7.0
Medians of data1 - x: 9.0, y: 7.6
Medians of data2 - x: 9.0, y: 8.1
Medians of data3 - x: 9.0, y: 7.1
Medians of data4 - x: 8.0, y: 7.0
map_dbl(c(data1, data2, data3, data4), sd)
## x y x y x y x y
## 3.3 2.0 3.3 2.0 3.3 2.0 3.3 2.0
Std of data1 - x: 3.3, y: 2.0
Std of data2 - x: 3.3, y: 2.0
Std of data3 - x: 3.3, y: 2.0
Std of data4 - x: 3.3, y: 2.0
map_dbl(list(data1, data2, data3, data4),
function(df) cor(df$x, df$y)
)
## [1] 0.82 0.82 0.82 0.82
For each x, y pair, correlation is 0.82.
map(list(data1, data2, data3, data4),
function(df) lm(df$y ~ df$x)
)
## [[1]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Coefficients:
## (Intercept) df$x
## 3.0 0.5
##
##
## [[2]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Coefficients:
## (Intercept) df$x
## 3.0 0.5
##
##
## [[3]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Coefficients:
## (Intercept) df$x
## 3.0 0.5
##
##
## [[4]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Coefficients:
## (Intercept) df$x
## 3.0 0.5
map(list(data1, data2, data3, data4),
function(df) summary(lm(df$y ~ df$x))
)
## [[1]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## df$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
##
##
## [[2]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## df$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
##
##
## [[3]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## df$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
##
##
## [[4]]
##
## Call:
## lm(formula = df$y ~ df$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## df$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
First, just look at x vs. y:
plot(data1$y ~ data1$x)
plot(data2$y ~ data2$x)
plot(data3$y ~ data3$x)
plot(data4$y ~ data4$x)
We need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
reg1 <- lm(data1$y ~ data1$x)
plot(reg1$residuals ~ data1$x)
abline(h = 0, lty = 3)
reg2 <- lm(data2$y ~ data2$x)
plot(reg2$residuals ~ data2$x)
abline(h = 0, lty = 3)
reg3 <- lm(data3$y ~ data3$x)
plot(reg3$residuals ~ data3$x)
abline(h = 0, lty = 3)
reg4 <- lm(data4$y ~ data4$x)
plot(reg4$residuals ~ data4$x)
abline(h = 0, lty = 3)
data1: residuals look randomly distributed above and below zero for all values of x - linearity condition met
data2: residuals do not look randomly distributed for all values of x - linearity condition not met (outlier value at 13 having outsized affect on otherwise linearly distributed values)
data3: residuals do not look randomly distributed for all values of x - linearity condition not met (clearly non-linear distribution)
data4: residuals do not look randomly distributed for all values of x - linearity condition not met (not enough x values to determine)… much more heterogenously distributed values at x = 8
hist(reg1$residuals, breaks = 8)
hist(reg2$residuals, breaks = 8)
hist(reg3$residuals, breaks = 8)
hist(reg4$residuals, breaks = 8)
qqnorm(reg1$residuals)
qqline(reg1$residuals) # adds diagonal line to the normal prob plot
qqnorm(reg2$residuals)
qqline(reg2$residuals)
qqnorm(reg3$residuals)
qqline(reg3$residuals)
qqnorm(reg4$residuals)
qqline(reg4$residuals)
Only data4 look to have x and y whose residuals are normally distributed from these diagnosistics. Surprising that data1’s results not more normally distributed (though they look second closest)… I think this is due to the data being somewhat curvilinear in the distribution.
Only data1 looks to have points that have constant variability for all values of x. data4 clearly violates this condition. For data2, it appears variance might be less for the center x values. For data3, outlier value confounding diagnosis of this metric.
Visualizations are important for diagnostic, exploratory and persuasive purposes.
As seen above, routine tests might not fully account for idiosyncratic data distributions, but these idiosyncracies can be immediately apparent upon visual inspection:
For instance, it’s immediately clear looking at this graph that there is an outlier value that should be considered:
plot(data3$y ~ data3$x)
For persuasive purposes, though in-depth, or even superficial analysis can help uncover underlying trends, graphs of data can act as a “universal language” where trends that are discussed can be readily engaged with without necessarily understanding the underlying data. For instance, the underlying formulas may not be known here in the diagnostic tests, but looking at this graph, it may be easier to someone to understand that something concerning is going on in the distribution of the data:
plot(data4$y ~ data4$x)