Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):
1.B
2.A
3.D
4.B
5.B
6.E
7.D
8.E
9.B
10.C
Consider the three datasets, each with two columns (x and y),
provided below. Be sure to replace the NA with your answer
for each part (e.g. assign the mean of x for
data1 to the data1.x.mean variable). When you
Knit your answer document, a table will be generated with all the
answers.
For each column, calculate (to four decimal places):
data1.x.mean <- mean(data1$x)
data1.y.mean <- mean(data1$y)
data2.x.mean <- mean(data2$x)
data2.y.mean <- mean(data2$y)
data3.x.mean <- mean(data3$x)
data3.y.mean <- mean(data3$y)
data1.x.median <- median(data1$x)
data1.y.median <- median(data1$y)
data2.x.median <- median(data2$x)
data2.y.median <- median(data2$y)
data3.x.median <- median(data3$x)
data3.y.median <- median(data3$y)
data1.x.sd <- sd(data1$x)
data1.y.sd <- sd(data1$y)
data2.x.sd <- sd(data2$x)
data2.y.sd <- sd(data2$y)
data3.x.sd <- sd(data3$x)
data3.y.sd <- sd(data3$y)
round(cor(data1),2)
## x y
## x 1.00 -0.06
## y -0.06 1.00
round(cor(data2),2)
## x y
## x 1.00 -0.07
## y -0.07 1.00
round(cor(data3),2)
## x y
## x 1.00 -0.06
## y -0.06 1.00
data1.correlation <- -0.06
data2.correlation <- -0.07
data3.correlation <- -0.06
lm1 <- lm(x ~ y, data = data1)
summary(lm1)
##
## Call:
## lm(formula = x ~ y, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.58 -10.56 -0.98 10.29 43.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.1827 2.8792 19.51 <2e-16 ***
## y -0.0401 0.0525 -0.76 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared: 0.00416, Adjusted R-squared: -0.00296
## F-statistic: 0.584 on 1 and 140 DF, p-value: 0.446
#equation1:y=-0.0401*x+56.1827
lm2 <- lm(x ~ y, data = data2)
summary(lm2)
##
## Call:
## lm(formula = x ~ y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.91 -11.20 -0.02 10.33 40.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.3218 2.8788 19.56 <2e-16 ***
## y -0.0429 0.0525 -0.82 0.41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared: 0.00476, Adjusted R-squared: -0.00235
## F-statistic: 0.669 on 1 and 140 DF, p-value: 0.415
#equation2:y=-0.0429*x+56.3218
lm3 <- lm(x ~ y, data = data3)
summary(lm3)
##
## Call:
## lm(formula = x ~ y, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.42 -13.76 -0.69 15.03 38.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.1756 2.8799 19.51 <2e-16 ***
## y -0.0399 0.0525 -0.76 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared: 0.00411, Adjusted R-squared: -0.003
## F-statistic: 0.578 on 1 and 140 DF, p-value: 0.448
#equation3:y=-0.0399*x+56.1756
data1.slope <- -0.0401
data2.slope <- -0.0429
data3.slope <- -0.0399
data1.intercept <- 56.1827
data2.intercept <- 56.3218
data3.intercept <- 56.1756
data1.rsquared <- summary(lm1)$r.squared
data2.rsquared <- summary(lm2)$r.squared
data3.rsquared <- summary(lm3)$r.squared
Summary Table
| x | y | x | y | x | y | |
|---|---|---|---|---|---|---|
| Mean | 54.2633 | 47.8323 | 54.2678 | 47.8359 | 54.2661 | 47.8347 |
| Median | 53.3333 | 46.0256 | 53.1352 | 46.4013 | 53.3403 | 47.5353 |
| SD | 16.7651 | 26.9354 | 16.7668 | 26.9361 | 16.7698 | 26.9397 |
| r | -0.0600 | -0.0700 | -0.0600 | |||
| Intercept | 56.1827 | 56.3218 | 56.1756 | |||
| Slope | -0.0401 | -0.0429 | -0.0399 | |||
| R-Squared | 0.0042 | 0.0048 | 0.0041 |
#data1 plots
plot(x ~ y, data1)
abline(lm1)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)
From the plots for data1 we can see that data plot suggests somewhat linearity although there are outlier. Also the residuals seems to follow somewhat normal distribution.
#data2 plots
plot(x ~ y, data2)
abline(lm2)
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)
Plot2 does not suggests linearity. Also the residuals do not seems to follow a nearly normal distribution
#data3 plots
plot(x ~ y, data3)
abline(lm3)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
plots for data3 we can see that there is no linearity in the data and also the residuals “heavy-tailed”does not follow normal distribution.
A linear model is only valid if the necessary conditions have been met, and creating visualizations is a great way to check for linearity, nearly normal residuals, and sometimes even independence when the data collection order is provided. Please see the plots above.