Part I

Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):

1.B
2.A
3.D
4.B
5.B
6.E
7.D
8.E
9.B
10.C

Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- mean(data1$x)
data1.y.mean <- mean(data1$y)
data2.x.mean <- mean(data2$x)
data2.y.mean <- mean(data2$y)
data3.x.mean <- mean(data3$x)
data3.y.mean <- mean(data3$y)

b. The median (for x and y separately; 5 pt).

data1.x.median <- median(data1$x)
data1.y.median <- median(data1$y)
data2.x.median <- median(data2$x)
data2.y.median <- median(data2$y)
data3.x.median <- median(data3$x)
data3.y.median <- median(data3$y)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- sd(data1$x)
data1.y.sd <- sd(data1$y)
data2.x.sd <- sd(data2$x)
data2.y.sd <- sd(data2$y)
data3.x.sd <- sd(data3$x)
data3.y.sd <- sd(data3$y)

For each x and y pair, calculate (also to two decimal places):

d. The correlation (5 pt).

round(cor(data1),2)

##       x     y
## x  1.00 -0.06
## y -0.06  1.00

round(cor(data2),2)

##       x     y
## x  1.00 -0.07
## y -0.07  1.00

round(cor(data3),2)

##       x     y
## x  1.00 -0.06
## y -0.06  1.00

data1.correlation <- -0.06
data2.correlation <- -0.07
data3.correlation <- -0.06

e. Linear regression equation (5 points).

lm1 <- lm(x ~ y, data = data1)
summary(lm1)

## 
## Call:
## lm(formula = x ~ y, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.58 -10.56  -0.98  10.29  43.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.1827     2.8792   19.51   <2e-16 ***
## y            -0.0401     0.0525   -0.76     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00416,    Adjusted R-squared:  -0.00296 
## F-statistic: 0.584 on 1 and 140 DF,  p-value: 0.446

#equation1:y=-0.0401*x+56.1827

lm2 <- lm(x ~ y, data = data2)
summary(lm2)

## 
## Call:
## lm(formula = x ~ y, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.91 -11.20  -0.02  10.33  40.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.3218     2.8788   19.56   <2e-16 ***
## y            -0.0429     0.0525   -0.82     0.41    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00476,    Adjusted R-squared:  -0.00235 
## F-statistic: 0.669 on 1 and 140 DF,  p-value: 0.415

#equation2:y=-0.0429*x+56.3218


lm3 <- lm(x ~ y, data = data3)
summary(lm3)

## 
## Call:
## lm(formula = x ~ y, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.42 -13.76  -0.69  15.03  38.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.1756     2.8799   19.51   <2e-16 ***
## y            -0.0399     0.0525   -0.76     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 140 degrees of freedom
## Multiple R-squared:  0.00411,    Adjusted R-squared:  -0.003 
## F-statistic: 0.578 on 1 and 140 DF,  p-value: 0.448

#equation3:y=-0.0399*x+56.1756

data1.slope <- -0.0401
data2.slope <- -0.0429
data3.slope <- -0.0399

data1.intercept <- 56.1827
data2.intercept <- 56.3218
data3.intercept <- 56.1756

f. R-Squared (5 points).

data1.rsquared <- summary(lm1)$r.squared
data2.rsquared <- summary(lm2)$r.squared
data3.rsquared <- summary(lm3)$r.squared

Summary Table

	Data 1		Data 2		Data 3
	x	y	x	y	x	y
Mean	54.2633	47.8323	54.2678	47.8359	54.2661	47.8347
Median	53.3333	46.0256	53.1352	46.4013	53.3403	47.5353
SD	16.7651	26.9354	16.7668	26.9361	16.7698	26.9397
r	-0.0600		-0.0700		-0.0600
Intercept	56.1827		56.3218		56.1756
Slope	-0.0401		-0.0429		-0.0399
R-Squared	0.0042		0.0048		0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Plots for data1

#data1 plots
plot(x ~ y, data1)
abline(lm1)

hist(lm1$residuals)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

From the plots for data1 we can see that data plot suggests somewhat linearity although there are outlier. Also the residuals seems to follow somewhat normal distribution.

Plots for data2

#data2 plots
plot(x ~ y, data2)
abline(lm2)

hist(lm2$residuals)

qqnorm(lm2$residuals)
qqline(lm2$residuals)

Plot2 does not suggests linearity. Also the residuals do not seems to follow a nearly normal distribution

Plots for data3

#data3 plots
plot(x ~ y, data3)
abline(lm3)

hist(lm3$residuals)

qqnorm(lm3$residuals)
qqline(lm3$residuals)

plots for data3 we can see that there is no linearity in the data and also the residuals “heavy-tailed”does not follow normal distribution.

h. Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (15 points)

A linear model is only valid if the necessary conditions have been met, and creating visualizations is a great way to check for linearity, nearly normal residuals, and sometimes even independence when the data collection order is provided. Please see the plots above.

Data 606 Final Exam

yina qiao

2023-05-17