Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):
Consider the three datasets, each with two columns (x and y),
provided below. Be sure to replace the NA
with your answer
for each part (e.g. assign the mean of x
for
data1
to the data1.x.mean
variable). When you
Knit your answer document, a table will be generated with all the
answers.
For each column, calculate (to four decimal places):
data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)
data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)
data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)
data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)
data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)
data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)
data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)
data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)
data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)
data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)
reg1 <- lm(y~x, data = data1)
reg2 <- lm(y~x, data = data2)
reg3 <- lm(y~x, data = data3)
data1.slope <- mean(round(coef(reg1)['x'], 4))
data2.slope <- mean(round(coef(reg2)['x'], 4))
data3.slope <- mean(round(coef(reg3)['x'], 4))
data1.intercept <- mean(round(coef(reg1)[1], 4))
data2.intercept <- mean(round(coef(reg2)[1], 4))
data3.intercept <- mean(round(coef(reg3)[1], 4))
data1.rsquared <- data1.correlation^2
data2.rsquared <- data2.correlation^2
data3.rsquared <- data3.correlation^2
Summary Table
## Warning: package 'kableExtra' was built under R version 4.3.2
x | y | x | y | x | y | |
---|---|---|---|---|---|---|
Mean | 54.2633 | 47.8323 | 54.2678 | 47.8359 | 54.2661 | 47.8347 |
Median | 53.3333 | 46.0256 | 53.1352 | 46.4013 | 53.3403 | 47.5353 |
SD | 16.7651 | 26.9354 | 16.7668 | 26.9361 | 16.7698 | 26.9397 |
r | -0.0645 | -0.0690 | -0.0641 | |||
Intercept | 53.4530 | 53.8497 | 53.4251 | |||
Slope | -0.1036 | -0.1108 | -0.1030 | |||
R-Squared | 0.0042 | 0.0048 | 0.0041 |
Let’s make the scatter of y with respect to x.
library(ggplot2)
ggplot(data = data1, aes(x, y))+
geom_point()
No
Why? Because the scatter plot suggests that that the data points make a head of T-Rex dinosaur. There is no correlation between x and y for the entire data points but different portions have different correlations.
Yes
Why?
Because the scatter plot shows a negative correlations although it appears that the data points produced four parallel rows. But the regreesion line can give a general trend in the dataset.
ggplot(data = data2, aes(x, y))+
geom_point()
No
Why?
The scatter plot does not show any correlation between x and y. Thus the regression is not useful for this dataset.
ggplot(data = data3, aes(x, y))+
geom_point()
The appropriate visualization is very important in any data analysis because it gives an insight of the data. It also help us to choose the appropriate data analysis technique sucha as regression or classification etc., based on the nature of the dataset. Also, data visuals such as scatter plots help us in finding the outliers and thus data can be preprocessed before the application of any data analysis technique.
In the above data analysis, it was found that the regression is not appropriate for data1 and data 3 because there is not strong correlation between x and y in both the datasets. Let us take scatter plot of dataset 1.
plot(data1, type= 'p' )