Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)

data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)

data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)

b. The median (for x and y separately; 5 pt).

data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)

data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)

data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)

data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)

data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)

For each x and y pair, calculate (also to four decimal places):

d. The correlation (5 pt).

data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)

e. Linear regression equation (5 points).

m1 <- lm(y ~ x, data = data1)
m2 <- lm(y ~ x, data = data2)
m3 <- lm(y ~ x, data = data3)

data1.slope <- round(coef(m1)["x"], 4)
data2.slope <- round(coef(m2)["x"], 4)
data3.slope <- round(coef(m3)["x"], 4)

data1.intercept <- round(coef(m1)["(Intercept)"], 4)
data2.intercept <- round(coef(m2)["(Intercept)"], 4)
data3.intercept <- round(coef(m3)["(Intercept)"], 4)

f. R-Squared (5 points).

data1.rsquared <- round(summary(m1)$r.squared, 4)
data2.rsquared <- round(summary(m2)$r.squared, 4)
data3.rsquared <- round(summary(m3)$r.squared, 4)

Summary Table

	Data 1		Data 2		Data 3
	x	y	x	y	x	y
Mean	54.2633	47.8323	54.2678	47.8359	54.2661	47.8347
Median	53.3333	46.0256	53.1352	46.4013	53.3403	47.5353
SD	16.7651	26.9354	16.7668	26.9361	16.7698	26.9397
r	-0.0645		-0.0690		-0.0641
Intercept	53.4530		53.8497		53.4251
Slope	-0.1036		-0.1108		-0.1030
R-Squared	0.0042		0.0048		0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Data set 1

Even though Data Set 1 has:

a mean, standard deviation, correlation, slope, and R-squared that look reasonable,

a linear regression model is not appropriate once we actually look at the data.

When you plot x vs y, the relationship is clearly nonlinear. The points follow a curved pattern, not a straight-line trend. Linear regression assumes that the relationship between x and y is approximately linear, and that assumption is violated here.

So even though the summary statistics suggest a weak linear relationship, the scatterplot tells a different story. The model would be misleading because it forces a straight line onto data that clearly bends.

ggplot(data1, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Data Set 1: x vs y",
    subtitle = "Curved relationship makes linear regression inappropriate"
  )

## `geom_smooth()` using formula = 'y ~ x'

Data set 2

Even though the summary statistics for Data set 2 (mean, standard deviation, correlation, slope, and R²) are very similar to Data sets 1 and 3, the scatterplot shows that the relationship between x and y is clearly non-linear. The data form a curved and uneven pattern rather than clustering around a straight line.

Because one of the key assumptions of linear regression is a linear relationship between the predictor and response variables, this condition is violated. The fitted regression line does not meaningfully represent the pattern in the data, making linear regression misleading for this dataset.

ggplot(data2, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Data set 3

Even though Data set 3 has summary statistics that look reasonable (mean, standard deviation, correlation, and even a regression line), the scatterplot tells a very different story.

When I plot x versus y for Data set 3, the points form a clear curved / non-linear pattern, not a straight-line relationship. The data rise and fall in a way that a straight line simply cannot capture well.

Because linear regression assumes: a linear relationship and residuals randomly scattered around the line

those assumptions are violated here. A straight line would systematically miss large portions of the data, leading to misleading predictions and conclusions.

This dataset is a good example of why you cannot rely only on statistics like correlation or R² — you must look at the plot.

ggplot(data3, aes(x = x, y = y)) +
  geom_point() +
  labs(
    title = "Data Set 3: Nonlinear Relationship",
    x = "x",
    y = "y"
  )

h. Why it is important to include appropriate visualizations when analyzing data? Be sure to ground your reasoning in the context of the analyses completed above. Include any visualization(s) you create. (15 points)

Including appropriate visualizations is critical because numerical summaries alone can be misleading and hide important patterns in the data. In the analyses above, all three datasets had nearly identical means, medians, standard deviations, correlations, and even regression results. Based on the numbers alone, it would be easy to assume that all three datasets behave similarly and that a linear regression model is appropriate in every case.

However, once the data were visualized, it became clear that this was not true.

One dataset showed a curved, non-linear pattern, making linear regression inappropriate.

Another dataset contained influential points that strongly affected the regression line and correlation.

A third dataset showed little to no meaningful linear relationship at all.

The visualizations revealed structure, outliers, non-linearity, and clustering that the summary statistics failed to capture. Without plotting the data, these issues would have gone unnoticed, leading to incorrect conclusions and misuse of statistical models.

This demonstrates that visualizations are not optional or decorative — they are a necessary step in data analysis. They help confirm whether model assumptions are reasonable, whether relationships are real or artificial, and whether results make practical sense. In short, visualizations protect analysts from drawing false conclusions based solely on “good-looking” numbers.

DATA 606 Fall 2023 - Final Exam

Kevin Martin

Part I