Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts). Distribution a is unimodal and significantly right skewed with a center around five. Distribution b is a normal distribution with a much smaller spread.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts). The central limit theorem says that since the sample size is 30 or greater (which it is), the sampling distribution will be normal. Since the sampling distribution is the means of several samples, the standard deviation is thus much smaller since the range is much smaller. There is much less variance in the mean of several random samples.
7c. What is the statistical principal that describes this phenomenon (2 pts)? The central limit theorem.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
options(digits=3)
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5
Note, I am showing two decimal places only when applicable (when not zeros)
options(digits=3)
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.58
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.14
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.11
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7.04
options(digits=3)
sd(data1$x)
## [1] 3.32
sd(data1$y)
## [1] 2.03
sd(data2$x)
## [1] 3.32
sd(data2$y)
## [1] 2.03
sd(data3$x)
## [1] 3.32
sd(data3$y)
## [1] 2.03
sd(data4$x)
## [1] 3.32
sd(data4$y)
## [1] 2.03
options(digits=2)
cor(data1$x, data1$y)
## [1] 0.82
cor(data2$x, data2$y)
## [1] 0.82
cor(data3$x, data3$y)
## [1] 0.82
cor(data4$x, data4$y)
## [1] 0.82
lmdata1 <- lm(x ~ y, data = data1)
lmdata2 <- lm(x ~ y, data = data2)
lmdata3 <- lm(x ~ y, data = data3)
lmdata4 <- lm(x ~ y, data = data4)
summary(lmdata1)
##
## Call:
## lm(formula = x ~ y, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.652 -1.512 -0.266 1.234 3.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.998 2.434 -0.41 0.6916
## y 1.333 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
summary(lmdata2)
##
## Call:
## lm(formula = x ~ y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.852 -1.432 -0.344 0.847 4.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.995 2.435 -0.41 0.6925
## y 1.332 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(lmdata3)
##
## Call:
## lm(formula = x ~ y, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.987 -1.373 -0.027 1.320 3.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000 2.436 -0.41 0.6910
## y 1.333 0.315 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(lmdata4)
##
## Call:
## lm(formula = x ~ y, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.786 -1.412 -0.185 1.455 3.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004 2.435 -0.41 0.6898
## y 1.334 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Data 1: Expected x = -0.998 + 1.333 * y Data 2: Expected x = -0.995 + 1.332 * y Data 3: Expected x = -1.000 + 1.333 * y Data 4: Expected x = -1.004 + 1.334 * y
summary(lmdata1)
##
## Call:
## lm(formula = x ~ y, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.652 -1.512 -0.266 1.234 3.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.998 2.434 -0.41 0.6916
## y 1.333 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
summary(lmdata2)
##
## Call:
## lm(formula = x ~ y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.852 -1.432 -0.344 0.847 4.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.995 2.435 -0.41 0.6925
## y 1.332 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(lmdata3)
##
## Call:
## lm(formula = x ~ y, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.987 -1.373 -0.027 1.320 3.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000 2.436 -0.41 0.6910
## y 1.333 0.315 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(lmdata4)
##
## Call:
## lm(formula = x ~ y, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.786 -1.412 -0.185 1.455 3.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004 2.435 -0.41 0.6898
## y 1.334 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Data 1: R-Squared = .667 Data 2: R-Squared = .666 Data 3: R-Squared = .666 Data 4: R-Squared = .667
Data 1
# Linearity
plot(x ~ y, data = data1)
# Nearly normal residuals
hist(lmdata1$residuals)
# Equal variance
plot(x ~ y, data = data1)
abline(lmdata1)
1. Linearity: yes 2. Nearly normal residuals: yes, nearly normal 3. Equal variance: yes Yes, it is appropriate for data 1.
Data 2
# Linearity
plot(x ~ y, data = data2)
# Nearly normal residuals
hist(lmdata2$residuals)
# Equal variance
plot(x ~ y, data = data2)
abline(lmdata2)
1. Linearity: No 2. Nearly normal residuals: No 3. Equal variance: No No, it is not appropriate to use a linear model for data2
Data 3
# Linearity
plot(x ~ y, data = data3)
# Nearly normal residuals
hist(lmdata3$residuals)
# Equal variance
plot(x ~ y, data = data3)
abline(lmdata3)
All of this holds if we take out the outlier. However, because there are very few number of observations, it might make sense to not take out the outlier and thus not use a lienar model. 1. Linearity: no 2. Nearly normal residuals: no, not really normal 3. Equal variance: no No, it is not appropriate for data 3 without removing the outlier
Data 4
# Linearity
plot(x ~ y, data = data4)
# Nearly normal residuals
hist(lmdata4$residuals)
# Equal variance
plot(x ~ y, data = data4)
abline(lmdata4)
1. Linearity: No 2. Nearly normal residuals: no, not really normal 3. Equal variance: no No, it is not appropriate for data4 without removing the large outlier
It is important to create visualizations because although data might have nearly the same summary statistics, they can be shaped in various different ways thus skewing models. Comparing data 1 and 2, this is particularly clear.