Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):
Consider the three datasets, each with two columns (x and y),
provided below. Be sure to replace the NA with your answer
for each part (e.g. assign the mean of x for
data1 to the data1.x.mean variable). When you
Knit your answer document, a table will be generated with all the
answers.
For each column, calculate (to four decimal places):
data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)
data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)
data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)
data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)
data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)
data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)
data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)
data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)
data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)
data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)
m1 <- lm(y ~ x, data = data1)
m2 <- lm(y ~ x, data = data2)
m3 <- lm(y ~ x, data = data3)
data1.slope <- round(coef(m1)["x"], 4)
data2.slope <- round(coef(m2)["x"], 4)
data3.slope <- round(coef(m3)["x"], 4)
data1.intercept <- round(coef(m1)["(Intercept)"], 4)
data2.intercept <- round(coef(m2)["(Intercept)"], 4)
data3.intercept <- round(coef(m3)["(Intercept)"], 4)
data1.rsquared <- round(summary(m1)$r.squared, 4)
data2.rsquared <- round(summary(m2)$r.squared, 4)
data3.rsquared <- round(summary(m3)$r.squared, 4)
Summary Table
| x | y | x | y | x | y | |
|---|---|---|---|---|---|---|
| Mean | 54.2633 | 47.8323 | 54.2678 | 47.8359 | 54.2661 | 47.8347 |
| Median | 53.3333 | 46.0256 | 53.1352 | 46.4013 | 53.3403 | 47.5353 |
| SD | 16.7651 | 26.9354 | 16.7668 | 26.9361 | 16.7698 | 26.9397 |
| r | -0.0645 | -0.0690 | -0.0641 | |||
| Intercept | 53.4530 | 53.8497 | 53.4251 | |||
| Slope | -0.1036 | -0.1108 | -0.1030 | |||
| R-Squared | 0.0042 | 0.0048 | 0.0041 |
No
Even though Data Set 1 has:
a mean, standard deviation, correlation, slope, and R-squared that look reasonable,
a linear regression model is not appropriate once we actually look at the data.
When you plot x vs y, the relationship is clearly nonlinear. The points follow a curved pattern, not a straight-line trend. Linear regression assumes that the relationship between x and y is approximately linear, and that assumption is violated here.
So even though the summary statistics suggest a weak linear relationship, the scatterplot tells a different story. The model would be misleading because it forces a straight line onto data that clearly bends.
ggplot(data1, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Data Set 1: x vs y",
subtitle = "Curved relationship makes linear regression inappropriate"
)
## `geom_smooth()` using formula = 'y ~ x'
No
Even though the summary statistics for Data set 2 (mean, standard deviation, correlation, slope, and R²) are very similar to Data sets 1 and 3, the scatterplot shows that the relationship between x and y is clearly non-linear. The data form a curved and uneven pattern rather than clustering around a straight line.
Because one of the key assumptions of linear regression is a linear relationship between the predictor and response variables, this condition is violated. The fitted regression line does not meaningfully represent the pattern in the data, making linear regression misleading for this dataset.
ggplot(data2, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
No
Even though Data set 3 has summary statistics that look reasonable (mean, standard deviation, correlation, and even a regression line), the scatterplot tells a very different story.
When I plot x versus y for Data set 3, the points form a clear curved / non-linear pattern, not a straight-line relationship. The data rise and fall in a way that a straight line simply cannot capture well.
Because linear regression assumes: a linear relationship and residuals randomly scattered around the line
those assumptions are violated here. A straight line would systematically miss large portions of the data, leading to misleading predictions and conclusions.
This dataset is a good example of why you cannot rely only on statistics like correlation or R² — you must look at the plot.
ggplot(data3, aes(x = x, y = y)) +
geom_point() +
labs(
title = "Data Set 3: Nonlinear Relationship",
x = "x",
y = "y"
)
Including appropriate visualizations is critical because numerical summaries alone can be misleading and hide important patterns in the data. In the analyses above, all three datasets had nearly identical means, medians, standard deviations, correlations, and even regression results. Based on the numbers alone, it would be easy to assume that all three datasets behave similarly and that a linear regression model is appropriate in every case.
However, once the data were visualized, it became clear that this was not true.
One dataset showed a curved, non-linear pattern, making linear regression inappropriate.
Another dataset contained influential points that strongly affected the regression line and correlation.
A third dataset showed little to no meaningful linear relationship at all.
The visualizations revealed structure, outliers, non-linearity, and clustering that the summary statistics failed to capture. Without plotting the data, these issues would have gone unnoticed, leading to incorrect conclusions and misuse of statistical models.
This demonstrates that visualizations are not optional or decorative — they are a necessary step in data analysis. They help confirm whether model assumptions are reasonable, whether relationships are real or artificial, and whether results make practical sense. In short, visualizations protect analysts from drawing false conclusions based solely on “good-looking” numbers.