Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)
data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)
data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)

b. The median (for x and y separately; 5 pt).

data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)
data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)
data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)
data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)
data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)

For each x and y pair, calculate (also to four decimal places):

d. The correlation (5 pt).

data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)

e. Linear regression equation (5 points).

reg1 <- lm(y~x, data = data1)
reg2 <- lm(y~x, data = data2)
reg3 <- lm(y~x, data = data3)
data1.slope <- mean(round(coef(reg1)['x'], 4))
data2.slope <- mean(round(coef(reg2)['x'], 4))
data3.slope <- mean(round(coef(reg3)['x'], 4))

data1.intercept <- mean(round(coef(reg1)[1], 4))
data2.intercept <- mean(round(coef(reg2)[1], 4))
data3.intercept <- mean(round(coef(reg3)[1], 4))

f. R-Squared (5 points).

data1.rsquared <- data1.correlation^2
data2.rsquared <- data2.correlation^2
data3.rsquared <- data3.correlation^2

Summary Table

## Warning: package 'kableExtra' was built under R version 4.3.2

	Data 1		Data 2		Data 3
	x	y	x	y	x	y
Mean	54.2633	47.8323	54.2678	47.8359	54.2661	47.8347
Median	53.3333	46.0256	53.1352	46.4013	53.3403	47.5353
SD	16.7651	26.9354	16.7668	26.9361	16.7698	26.9397
r	-0.0645		-0.0690		-0.0641
Intercept	53.4530		53.8497		53.4251
Slope	-0.1036		-0.1108		-0.1030
R-Squared	0.0042		0.0048		0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Data set 1

Let’s make the scatter of y with respect to x.

library(ggplot2)
ggplot(data = data1, aes(x, y))+
  geom_point()

Why? Because the scatter plot suggests that that the data points make a head of T-Rex dinosaur. There is no correlation between x and y for the entire data points but different portions have different correlations.

Data set 2

Yes

Why?

Because the scatter plot shows a negative correlations although it appears that the data points produced four parallel rows. But the regreesion line can give a general trend in the dataset.

ggplot(data = data2, aes(x, y))+
  geom_point()

Data set 3

Why?

The scatter plot does not show any correlation between x and y. Thus the regression is not useful for this dataset.

ggplot(data = data3, aes(x, y))+
  geom_point()

h. Why it is important to include appropriate visualizations when analyzing data? Be sure to ground your reasoning in the context of the analyses completed above. Include any visualization(s) you create. (15 points)

The appropriate visualization is very important in any data analysis because it gives an insight of the data. It also help us to choose the appropriate data analysis technique sucha as regression or classification etc., based on the nature of the dataset. Also, data visuals such as scatter plots help us in finding the outliers and thus data can be preprocessed before the application of any data analysis technique.

In the above data analysis, it was found that the regression is not appropriate for data1 and data 3 because there is not strong correlation between x and y in both the datasets. Let us take scatter plot of dataset 1.

plot(data1, type= 'p' )

DATA 606 Fall 2023 - Final Exam

frederick Jones

Part I