Final Exam -Data 606

Part I

Ans: b. daysDrive

Ans: b. mean = 3.5, median = 3.3

Ans: a. Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups

Ans: c. there is an association between natural hair colour and eye colour

Ans: c. 36.0 and 52.8

Ans: median and interquartile range; mean and standard deviation

Ans: Distrbution A is postively skewed and Distribution B is close to normally distributed

Ans: Both distributions show curves that peak at around 5 since distribution B is a sample of distribution A. Distribution B however has less spread since it involves less values when compared to Distribution A.

Ans: The Central Limit Theorem says that when a sample is taken from a population the larger it gets(around 30) the more normal the sample distribution becomes.

Part II

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

mean(data1$x)

## [1] 9

mean(data1$y)

## [1] 7.5

mean(data2$x)

## [1] 9

mean(data2$y)

## [1] 7.5

mean(data3$x)

## [1] 9

mean(data3$y)

## [1] 7.5

mean(data4$x)

## [1] 9

mean(data4$y)

## [1] 7.5

median(data1$x)

## [1] 9

median(data1$y)

## [1] 7.6

median(data2$x)

## [1] 9

median(data2$y)

## [1] 8.1

median(data3$x)

## [1] 9

median(data3$y)

## [1] 7.1

median(data4$x)

## [1] 8

median(data4$y)

## [1] 7

sd(data1$x)

## [1] 3.3

sd(data1$y)

## [1] 2

sd(data2$x)

## [1] 3.3

sd(data2$y)

## [1] 2

sd(data3$x)

## [1] 3.3

sd(data3$y)

## [1] 2

sd(data4$x)

## [1] 3.3

sd(data4$y)

## [1] 2

cor(data1)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data3)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

data1.lm <- lm(y~x, data=data1)
data2.lm <- lm(y~x, data=data2)
data3.lm <- lm(y~x, data=data3)

summary(data1.lm)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(data2.lm)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(data3.lm)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

Data1 Regression equation is:

\[\hat{y} = 3.0 + 0.5 * x\]

The R-squared = 0.667

Data2 Regresstion equation is:

\[\hat{y} = 3.001 + 0.5 * x\] The R-squrared = 0.666

Data3 Regression equation is: \[\hat{y} = 3.002 + 0.5 * x\]

The R-squared = 0.666

Appropriateness of linear model estimation:

Condition 1: Linearity Condition 2: Constant variability Condition 3: Nearly normal residuals

Data1 plots:

plot(data1.lm$residuals ~ data1$y)
abline(h=0, lty=3)

hist(data1.lm$residuals)

qqnorm(data1.lm$residuals)
qqline(data1.lm$residuals)

Data 1 does not satisfy the conditions

Data2 plots:

plot(data2.lm$residuals ~ data2$y)
abline(h=0, lty=3)

hist(data2.lm$residuals)

qqnorm(data2.lm$residuals)
qqline(data2.lm$residuals)

Data 2 does not satisfy conditions

Data 3 plots:

plot(data3.lm$residuals ~ data3$y)
abline(h=0, lty=3)

hist(data3.lm$residuals)

qqnorm(data3.lm$residuals)
qqline(data3.lm$residuals)

Data 3, while nearly normal does not satisfy the other conditions

Plots are important as they allow us to see a clearer picture that we may not otherwise see from figures alone.

Final Exam -Data 606

N. Nedd

2017-05-25

Part I

Part II