Part I

Please put the answers for Part I next to the question number (2pts each):

b. daysDrive
a. mean = 3.3, median = 3.5
d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients
c. There is an association between natural hair color and eye color
b. 17.8 and 69.0
d. median and interquartile range; mean and SD.

7a. Describe the two distributions (2pts).

The two figures appear to be normally distributed. The spread of the sampling distribution in figure B. is much less than the figure A. spread of the distribution.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

In figure B. the sample size is at least 30, which means the data will closely approximate the mean of the pop. in figure A.

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

round(mean(data1$x),2)

## [1] 9

round(mean(data1$y),2)

## [1] 7.5

round(mean(data2$x),2)

## [1] 9

round(mean(data2$y),2)

## [1] 7.5

round(mean(data3$x),2)

## [1] 9

round(mean(data3$y),2)

## [1] 7.5

round(mean(data4$x),2)

## [1] 9

round(mean(data4$y),2)

## [1] 7.5

b. The median (for x and y separately; 1 pt).

round(median(data1$x),2)

## [1] 9

round(median(data1$y),2)

## [1] 7.6

round(median(data2$x),2)

## [1] 9

round(median(data2$y),2)

## [1] 8.1

round(median(data3$x),2)

## [1] 9

round(median(data3$y),2)

## [1] 7.1

round(median(data4$x),2)

## [1] 8

round(median(data4$y),2)

## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

round(sd(data1$x),2)

## [1] 3.3

round(sd(data1$y),2)

## [1] 2

round(sd(data2$x),2)

## [1] 3.3

round(sd(data2$y),2)

## [1] 2

round(sd(data3$x),2)

## [1] 3.3

round(sd(data3$y),2)

## [1] 2

round(sd(data4$x),2)

## [1] 3.3

round(sd(data4$y),2)

## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

round(cor(data1),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data2),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data3),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data4),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

model1 <- lm(y~x, data1)
summary(model1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

model2 <- lm(y~x, data2)
summary(model2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

model3 <- lm(y~x, data3)
summary(model3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

model4 <- lm(y~x, data4)
summary(model4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

\[\hat{y_1} = 3 + 0.5 \cdot x\] \[\hat{y_2} = 3 + 0.5 \cdot x\] \[\hat{y_3} = 3 + 0.5 \cdot x\] \[\hat{y_4} = 3 + 0.5 \cdot x\]

f. R-Squared (2 pts).

summary(model1)$r.squared

## [1] 0.67

summary(model2)$r.squared

## [1] 0.67

summary(model3)$r.squared

## [1] 0.67

summary(model4)$r.squared

## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data1: No. The residuals do no seem to follow a nearly normal distribution even though the main data suggests linearity

Data2: No. The data shows something other than a linear relationship. THe residuals do not seem to follow a normal distribution.

Data3: Yes. The data shows some type of linearity with some outliers producing an affect on the slope. The data does show a linear trend and nearly normal residuals.

Data4: No. The data does not show a linear trend.

par(mfrow=c(2,2))
plot(data1)
hist(model1$residuals)
qqnorm(model1$residuals)
qqline(model1$residuals)

par(mfrow=c(2,2))
plot(data2)
hist(model2$residuals)
qqnorm(model2$residuals)
qqline(model2$residuals)

par(mfrow=c(2,2))
plot(data3)
hist(model3$residuals)
qqnorm(model3$residuals)
qqline(model3$residuals)

par(mfrow=c(2,2))
plot(data4)
hist(model4$residuals)
qqnorm(model4$residuals)
qqline(model4$residuals)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is imporant to include appropriate visualtizations when analyzing data because it helps to visualize the linear model of the plot. It makes it easier to see wheater the resdiauls have constant variability from the line. See above plots.

DATA 606 Spring 2018 - Final Exam

Nicholas Schettini