Part I

Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):

1.B
2.A
3.D
4.B 
5.B 
6.E 
7.D 
8.E
9.B 
10.C 

Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- mean(data1$x)
data1.y.mean <- mean(data1$y)
data2.x.mean <- mean(data2$x)
data2.y.mean <- mean(data2$y)
data3.x.mean <- mean(data3$x)
data3.y.mean <- mean(data3$y)

b. The median (for x and y separately; 5 pt).

data1.x.median <- median(data1$x)
data1.y.median <- median(data1$y)
data2.x.median <- median(data2$x)
data2.y.median <- median(data2$y)
data3.x.median <- median(data3$x)
data3.y.median <- median(data3$y)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- sd(data1$x)
data1.y.sd <- sd(data1$y)
data2.x.sd <- sd(data2$x)
data2.y.sd <- sd(data2$y)
data3.x.sd <- sd(data3$x)
data3.y.sd <- sd(data3$y)

For each x and y pair, calculate (also to two decimal places):

d. The correlation (5 pt).

round(cor(data1),2)
##       x     y
## x  1.00 -0.06
## y -0.06  1.00
round(cor(data2),2)
##       x     y
## x  1.00 -0.07
## y -0.07  1.00
round(cor(data3),2)
##       x     y
## x  1.00 -0.06
## y -0.06  1.00
data1.correlation <- -0.06
data2.correlation <- -0.07
data3.correlation <- -0.06

e. Linear regression equation (5 points).

lm1 <- lm(x ~ y, data = data1)
summary(lm1)
## 
## Call:
## lm(formula = x ~ y, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.58 -10.56  -0.98  10.29  43.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.1827     2.8792   19.51   <2e-16 ***
## y            -0.0401     0.0525   -0.76     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00416,    Adjusted R-squared:  -0.00296 
## F-statistic: 0.584 on 1 and 140 DF,  p-value: 0.446
#equation1:y=-0.0401*x+56.1827

lm2 <- lm(x ~ y, data = data2)
summary(lm2)
## 
## Call:
## lm(formula = x ~ y, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.91 -11.20  -0.02  10.33  40.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.3218     2.8788   19.56   <2e-16 ***
## y            -0.0429     0.0525   -0.82     0.41    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00476,    Adjusted R-squared:  -0.00235 
## F-statistic: 0.669 on 1 and 140 DF,  p-value: 0.415
#equation2:y=-0.0429*x+56.3218


lm3 <- lm(x ~ y, data = data3)
summary(lm3)
## 
## Call:
## lm(formula = x ~ y, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.42 -13.76  -0.69  15.03  38.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.1756     2.8799   19.51   <2e-16 ***
## y            -0.0399     0.0525   -0.76     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00411,    Adjusted R-squared:  -0.003 
## F-statistic: 0.578 on 1 and 140 DF,  p-value: 0.448
#equation3:y=-0.0399*x+56.1756

data1.slope <- -0.0401
data2.slope <- -0.0429
data3.slope <- -0.0399

data1.intercept <- 56.1827
data2.intercept <- 56.3218
data3.intercept <- 56.1756

f. R-Squared (5 points).

data1.rsquared <- summary(lm1)$r.squared
data2.rsquared <- summary(lm2)$r.squared
data3.rsquared <- summary(lm3)$r.squared

Summary Table

Data 1
Data 2
Data 3
x y x y x y
Mean 54.2633 47.8323 54.2678 47.8359 54.2661 47.8347
Median 53.3333 46.0256 53.1352 46.4013 53.3403 47.5353
SD 16.7651 26.9354 16.7668 26.9361 16.7698 26.9397
r -0.0600 -0.0700 -0.0600
Intercept 56.1827 56.3218 56.1756
Slope -0.0401 -0.0429 -0.0399
R-Squared 0.0042 0.0048 0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Plots for data1

#data1 plots
plot(x ~ y, data1)
abline(lm1)

hist(lm1$residuals)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

From the plots for data1 we can see that data plot suggests somewhat linearity although there are outlier. Also the residuals seems to follow somewhat normal distribution.

Plots for data2

#data2 plots
plot(x ~ y, data2)
abline(lm2)

hist(lm2$residuals)

qqnorm(lm2$residuals)
qqline(lm2$residuals)

Plot2 does not suggests linearity. Also the residuals do not seems to follow a nearly normal distribution

Plots for data3

#data3 plots
plot(x ~ y, data3)
abline(lm3)

hist(lm3$residuals)

qqnorm(lm3$residuals)
qqline(lm3$residuals)

plots for data3 we can see that there is no linearity in the data and also the residuals “heavy-tailed”does not follow normal distribution.

h. Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (15 points)

A linear model is only valid if the necessary conditions have been met, and creating visualizations is a great way to check for linearity, nearly normal residuals, and sometimes even independence when the data collection order is provided. Please see the plots above.