Part I

Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):

  1. B
  2. A
  3. D
  4. B
  5. B
  6. E
  7. D
  8. E
  9. B
  10. C

Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)
data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)
data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)

b. The median (for x and y separately; 5 pt).

data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)
data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)
data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)
data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)
data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)

For each x and y pair, calculate (also to four decimal places):

d. The correlation (5 pt).

data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)

e. Linear regression equation (5 points).

reg1 <- lm(y~x, data = data1)
reg2 <- lm(y~x, data = data2)
reg3 <- lm(y~x, data = data3)
data1.slope <- mean(round(coef(reg1)['x'], 4))
data2.slope <- mean(round(coef(reg2)['x'], 4))
data3.slope <- mean(round(coef(reg3)['x'], 4))

data1.intercept <- mean(round(coef(reg1)[1], 4))
data2.intercept <- mean(round(coef(reg2)[1], 4))
data3.intercept <- mean(round(coef(reg3)[1], 4))

f. R-Squared (5 points).

data1.rsquared <- data1.correlation^2
data2.rsquared <- data2.correlation^2
data3.rsquared <- data3.correlation^2

Summary Table

## Warning: package 'kableExtra' was built under R version 4.3.2
Data 1
Data 2
Data 3
x y x y x y
Mean 54.2633 47.8323 54.2678 47.8359 54.2661 47.8347
Median 53.3333 46.0256 53.1352 46.4013 53.3403 47.5353
SD 16.7651 26.9354 16.7668 26.9361 16.7698 26.9397
r -0.0645 -0.0690 -0.0641
Intercept 53.4530 53.8497 53.4251
Slope -0.1036 -0.1108 -0.1030
R-Squared 0.0042 0.0048 0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Data set 1

Let’s make the scatter of y with respect to x.

library(ggplot2)
ggplot(data = data1, aes(x, y))+
  geom_point()

No

Why? Because the scatter plot suggests that that the data points make a head of T-Rex dinosaur. There is no correlation between x and y for the entire data points but different portions have different correlations.

Data set 2

Yes

Why?

Because the scatter plot shows a negative correlations although it appears that the data points produced four parallel rows. But the regreesion line can give a general trend in the dataset.

ggplot(data = data2, aes(x, y))+
  geom_point()

Data set 3

No

Why?

The scatter plot does not show any correlation between x and y. Thus the regression is not useful for this dataset.

ggplot(data = data3, aes(x, y))+
  geom_point()

h. Why it is important to include appropriate visualizations when analyzing data? Be sure to ground your reasoning in the context of the analyses completed above. Include any visualization(s) you create. (15 points)

The appropriate visualization is very important in any data analysis because it gives an insight of the data. It also help us to choose the appropriate data analysis technique sucha as regression or classification etc., based on the nature of the dataset. Also, data visuals such as scatter plots help us in finding the outliers and thus data can be preprocessed before the application of any data analysis technique.

In the above data analysis, it was found that the regression is not appropriate for data1 and data 3 because there is not strong correlation between x and y in both the datasets. Let us take scatter plot of dataset 1.

plot(data1, type= 'p' )