Part I

Please put the answers for Part I next to the question number (2pts each):

  1. daysDrive
  2. mean = 3.5, median = 3.3
  3. D - Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients
  4. C - there is an association between natural hair color and eye color
  5. B - 17.8 and 69.0
  6. D - median and interquartile range; mean and standard deviation

7a. Describe the two distributions (2pts).

Observations (A) is rightly skewed and the standard deviation of 3.22 shows that the variation on the data is very high. However the sampling distribution from Observations appear to be normally distributed with much lesser standard deviation.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

The standard deviation is higher with Observations(A) due to the data points in the observations because they are spread widely however When the sample are taken, the data points are much closer in the samples and hence much lower standard deviation

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The statistical principal of Central Limit Theorem is proven here - distribution of sample means will be nearly normal

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

executeFunction <- function(dataSet, funcName)
{
  return(lapply(dataSet, funcName))
}

executeFunction(data1, mean)
## $x
## [1] 9
## 
## $y
## [1] 7.5
executeFunction(data2, mean)
## $x
## [1] 9
## 
## $y
## [1] 7.5
executeFunction(data3, mean)
## $x
## [1] 9
## 
## $y
## [1] 7.5
executeFunction(data4, mean)
## $x
## [1] 9
## 
## $y
## [1] 7.5

b. The median (for x and y separately; 1 pt).

executeFunction(data1, median)
## $x
## [1] 9
## 
## $y
## [1] 7.6
executeFunction(data2, median)
## $x
## [1] 9
## 
## $y
## [1] 8.1
executeFunction(data3, median)
## $x
## [1] 9
## 
## $y
## [1] 7.1
executeFunction(data4, median)
## $x
## [1] 8
## 
## $y
## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

executeFunction(data1, sd)
## $x
## [1] 3.3
## 
## $y
## [1] 2
executeFunction(data2, sd)
## $x
## [1] 3.3
## 
## $y
## [1] 2
executeFunction(data3, sd)
## $x
## [1] 3.3
## 
## $y
## [1] 2
executeFunction(data4, sd)
## $x
## [1] 3.3
## 
## $y
## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

plot(data1)

cor(data1)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
plot(data2)

cor(data2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
plot(data3)

cor(data3)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
plot(data4)

cor(data4)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

lm1 <- lm(y~x, data1)
summary(lm1)
## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
plot(lm1)

lm2 <- lm(y~x, data2)
summary(lm2)
## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
plot(lm2)

lm3 <- lm(y~x, data3)
summary(lm3)
## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
plot(lm3)

lm4 <- lm(y~x, data4)
summary(lm4)
## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216
plot(lm4)
## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

f. R-Squared (2 pts).

summary(lm1)$r.squared
## [1] 0.67
summary(lm2)$r.squared
## [1] 0.67
summary(lm3)$r.squared
## [1] 0.67
summary(lm4)$r.squared
## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

plot(lm1)

hist(lm1$residuals)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

**For data1, the plot shows linear relationship since the data shows an upward linear trend. It is appropriate to estimate that the linear regression model is reliable for* data1*

plot(lm2)

hist(lm2$residuals)

qqnorm(lm2$residuals)
qqline(lm2$residuals)

For data2, a straight line could not hold the data. so it is NOT appropriate to estimate the linear regression model

plot(lm3)

hist(lm3$residuals)

qqnorm(lm3$residuals)
qqline(lm3$residuals)

For data3, the plot shows strong linear relationship however the residuals are non-normal. The outliers are far away from the line so it is appropriate to estimate that the linear regression model is reliable for data3

plot(lm4)
## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

hist(lm4$residuals)

qqnorm(lm4$residuals)
qqline(lm4$residuals)

For data4, the plot shows NO linear relationship so linear regression model is NOT reliable for data4

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It will be easier and quicker to interpret the facts with visualizations. We can understand the trend (linear vs non-linear) more easily with the visualizations