Part I

Please put the answers for Part I next to the question number (2pts each):

    1. daysDrive is both quantatitive and discrete
    1. mean = 3.3, median = 3.5
    1. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
    1. There is an association between natural hair color and eye color
    1. 17.8 and 69.0
    1. median and interquartile range; mean and SD

7a. Describe the two distributions (2pts).

Ans Both figures A and B appear to have normal distribution. The spread of the sampling distribution in figure B. is much less than the figure A. spread of the distribution.

In figure A, the a distribution is fairly moderate right sided skew and lower kurtosis. In figure B, the distribution is normal but with a high kurtosis.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Ans The figure A is a distribution of an observed variable, whereas Figure B is a distribution of the mean from 500 random samples of size 30 from A which is spread of distribution.

7c. What is the statistical principal that describes this phenomenon (2 pts)?

Ans central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

round(mean(data1$x),2)
## [1] 9
round(mean(data1$y),2)
## [1] 7.5
round(mean(data2$x),2)
## [1] 9
round(mean(data2$y),2)
## [1] 7.5
round(mean(data3$x),2)
## [1] 9
round(mean(data3$y),2)
## [1] 7.5
round(mean(data4$x),2)
## [1] 9
round(mean(data4$y),2)
## [1] 7.5

b. The median (for x and y separately; 1 pt).

round(median(data1$x),2)
## [1] 9
round(median(data1$y),2)
## [1] 7.6
round(median(data2$x),2)
## [1] 9
round(median(data2$y),2)
## [1] 8.1
round(median(data3$x),2)
## [1] 9
round(median(data3$y),2)
## [1] 7.1
round(median(data4$x),2)
## [1] 8
round(median(data4$y),2)
## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

round(sd(data1$x),2)
## [1] 3.3
round(sd(data1$y),2)
## [1] 2
round(sd(data2$x),2)
## [1] 3.3
round(sd(data2$y),2)
## [1] 2
round(sd(data3$x),2)
## [1] 3.3
round(sd(data3$y),2)
## [1] 2
round(sd(data4$x),2)
## [1] 3.3
round(sd(data4$y),2)
## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

round(cor(data1),2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data2),2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data3),2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data4),2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

lm1 <- lm(y ~ x, data = data1)
summary(lm1)
## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
lm2 <- lm(y ~ x, data = data2)
summary(lm2)
## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
lm3 <- lm(y ~ x, data = data3)
summary(lm3)
## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
lm4 <- lm(y ~ x, data = data4)
summary(lm4)
## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

f. R-Squared (2 pts).

summary(lm1)$r.squared
## [1] 0.67
summary(lm2)$r.squared
## [1] 0.67
summary(lm3)$r.squared
## [1] 0.67
summary(lm4)$r.squared
## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Conditions for a pair to be appropriate for a linear regression model:

  • Linearity
  • Nearly normal residuals
  • Constant variability
  • Independent observations
par(mfrow=c(2,2))
plot(data1)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)

Data 1 does NOT appropriate for linear model regression due to violation of criteria for nearly normal residuals

par(mfrow=c(2,2))
plot(data2)
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity, nearly normal residuals and constant variability

par(mfrow=c(2,2))
plot(data3)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for nearly normal residuals and constant variability

par(mfrow=c(2,2))
plot(data4)
hist(lm4$residuals)
qqnorm(lm4$residuals)
qqline(lm4$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity and constant variability

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Data Visualization is an important part and step when analyzing data as it provides more insights into the underlying data by exposing the patterns associated with it. Statiscal analysis such as finding mean, median, sd etc are important and when we visualize or plot the data it confirms that. It also reveals other statistical characteristics. Below is an example for data1.

plot(lm1)