Part I
Ans: b. daysDrive
Ans: b. mean = 3.5, median = 3.3
Ans: a. Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups
Ans: c. there is an association between natural hair colour and eye colour
Ans: c. 36.0 and 52.8
Ans: median and interquartile range; mean and standard deviation
Ans: Distrbution A is postively skewed and Distribution B is close to normally distributed
Ans: Both distributions show curves that peak at around 5 since distribution B is a sample of distribution A. Distribution B however has less spread since it involves less values when compared to Distribution A.
Ans: The Central Limit Theorem says that when a sample is taken from a population the larger it gets(around 30) the more normal the sample distribution becomes.
Part II
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2
cor(data1)
## x y
## x 1.00 0.82
## y 0.82 1.00
cor(data2)
## x y
## x 1.00 0.82
## y 0.82 1.00
cor(data3)
## x y
## x 1.00 0.82
## y 0.82 1.00
data1.lm <- lm(y~x, data=data1)
data2.lm <- lm(y~x, data=data2)
data3.lm <- lm(y~x, data=data3)
summary(data1.lm)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
summary(data2.lm)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(data3.lm)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
Data1 Regression equation is:
\[\hat{y} = 3.0 + 0.5 * x\]
The R-squared = 0.667
Data2 Regresstion equation is:
\[\hat{y} = 3.001 + 0.5 * x\] The R-squrared = 0.666
Data3 Regression equation is: \[\hat{y} = 3.002 + 0.5 * x\]
The R-squared = 0.666
Appropriateness of linear model estimation:
Condition 1: Linearity Condition 2: Constant variability Condition 3: Nearly normal residuals
Data1 plots:
plot(data1.lm$residuals ~ data1$y)
abline(h=0, lty=3)
hist(data1.lm$residuals)
qqnorm(data1.lm$residuals)
qqline(data1.lm$residuals)
Data 1 does not satisfy the conditions
Data2 plots:
plot(data2.lm$residuals ~ data2$y)
abline(h=0, lty=3)
hist(data2.lm$residuals)
qqnorm(data2.lm$residuals)
qqline(data2.lm$residuals)
Data 2 does not satisfy conditions
Data 3 plots:
plot(data3.lm$residuals ~ data3$y)
abline(h=0, lty=3)
hist(data3.lm$residuals)
qqnorm(data3.lm$residuals)
qqline(data3.lm$residuals)
Data 3, while nearly normal does not satisfy the other conditions
Plots are important as they allow us to see a clearer picture that we may not otherwise see from figures alone.