Part I

Please put the answers for Part I next to the question number (please enter only the letter options; 4 points each):

  1. B
  2. A
  3. D
  4. B
  5. B
  6. E
  7. D
  8. E
  9. B 10.C

Part II

Consider the three datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

For each column, calculate (to four decimal places):

a. The mean (for x and y separately; 5 pt).

data1.x.mean <- round(mean(data1$x), 4)
data1.y.mean <- round(mean(data1$y), 4)

data2.x.mean <- round(mean(data2$x), 4)
data2.y.mean <- round(mean(data2$y), 4)

data3.x.mean <- round(mean(data3$x), 4)
data3.y.mean <- round(mean(data3$y), 4)

b. The median (for x and y separately; 5 pt).

data1.x.median <- round(median(data1$x), 4)
data1.y.median <- round(median(data1$y), 4)

data2.x.median <- round(median(data2$x), 4)
data2.y.median <- round(median(data2$y), 4)

data3.x.median <- round(median(data3$x), 4)
data3.y.median <- round(median(data3$y), 4)

c. The standard deviation (for x and y separately; 5 pt).

data1.x.sd <- round(sd(data1$x), 4)
data1.y.sd <- round(sd(data1$y), 4)

data2.x.sd <- round(sd(data2$x), 4)
data2.y.sd <- round(sd(data2$y), 4)

data3.x.sd <- round(sd(data3$x), 4)
data3.y.sd <- round(sd(data3$y), 4)

For each x and y pair, calculate (also to four decimal places):

d. The correlation (5 pt).

data1.correlation <- round(cor(data1$x, data1$y), 4)
data2.correlation <- round(cor(data2$x, data2$y), 4)
data3.correlation <- round(cor(data3$x, data3$y), 4)

e. Linear regression equation (5 points).

m1 <- lm(y ~ x, data = data1)
m2 <- lm(y ~ x, data = data2)
m3 <- lm(y ~ x, data = data3)

data1.slope <- round(coef(m1)["x"], 4)
data2.slope <- round(coef(m2)["x"], 4)
data3.slope <- round(coef(m3)["x"], 4)

data1.intercept <- round(coef(m1)["(Intercept)"], 4)
data2.intercept <- round(coef(m2)["(Intercept)"], 4)
data3.intercept <- round(coef(m3)["(Intercept)"], 4)

f. R-Squared (5 points).

data1.rsquared <- round(summary(m1)$r.squared, 4)
data2.rsquared <- round(summary(m2)$r.squared, 4)
data3.rsquared <- round(summary(m3)$r.squared, 4)

Summary Table

Data 1
Data 2
Data 3
x y x y x y
Mean 54.2633 47.8323 54.2678 47.8359 54.2661 47.8347
Median 53.3333 46.0256 53.1352 46.4013 53.3403 47.5353
SD 16.7651 26.9354 16.7668 26.9361 16.7698 26.9397
r -0.0645 -0.0690 -0.0641
Intercept 53.4530 53.8497 53.4251
Slope -0.1036 -0.1108 -0.1030
R-Squared 0.0042 0.0048 0.0041

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (15 points)

Data set 1

No

Even though Data Set 1 has:

a mean, standard deviation, correlation, slope, and R-squared that look reasonable,

a linear regression model is not appropriate once we actually look at the data.

When you plot x vs y, the relationship is clearly nonlinear. The points follow a curved pattern, not a straight-line trend. Linear regression assumes that the relationship between x and y is approximately linear, and that assumption is violated here.

So even though the summary statistics suggest a weak linear relationship, the scatterplot tells a different story. The model would be misleading because it forces a straight line onto data that clearly bends.

ggplot(data1, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Data Set 1: x vs y",
    subtitle = "Curved relationship makes linear regression inappropriate"
  )
## `geom_smooth()` using formula = 'y ~ x'

Data set 2

No

Even though the summary statistics for Data set 2 (mean, standard deviation, correlation, slope, and R²) are very similar to Data sets 1 and 3, the scatterplot shows that the relationship between x and y is clearly non-linear. The data form a curved and uneven pattern rather than clustering around a straight line.

Because one of the key assumptions of linear regression is a linear relationship between the predictor and response variables, this condition is violated. The fitted regression line does not meaningfully represent the pattern in the data, making linear regression misleading for this dataset.

ggplot(data2, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Data set 3

No

Even though Data set 3 has summary statistics that look reasonable (mean, standard deviation, correlation, and even a regression line), the scatterplot tells a very different story.

When I plot x versus y for Data set 3, the points form a clear curved / non-linear pattern, not a straight-line relationship. The data rise and fall in a way that a straight line simply cannot capture well.

Because linear regression assumes: a linear relationship and residuals randomly scattered around the line

those assumptions are violated here. A straight line would systematically miss large portions of the data, leading to misleading predictions and conclusions.

This dataset is a good example of why you cannot rely only on statistics like correlation or R² — you must look at the plot.

ggplot(data3, aes(x = x, y = y)) +
  geom_point() +
  labs(
    title = "Data Set 3: Nonlinear Relationship",
    x = "x",
    y = "y"
  )

h. Why it is important to include appropriate visualizations when analyzing data? Be sure to ground your reasoning in the context of the analyses completed above. Include any visualization(s) you create. (15 points)

Including appropriate visualizations is critical because numerical summaries alone can be misleading and hide important patterns in the data. In the analyses above, all three datasets had nearly identical means, medians, standard deviations, correlations, and even regression results. Based on the numbers alone, it would be easy to assume that all three datasets behave similarly and that a linear regression model is appropriate in every case.

However, once the data were visualized, it became clear that this was not true.

One dataset showed a curved, non-linear pattern, making linear regression inappropriate.

Another dataset contained influential points that strongly affected the regression line and correlation.

A third dataset showed little to no meaningful linear relationship at all.

The visualizations revealed structure, outliers, non-linearity, and clustering that the summary statistics failed to capture. Without plotting the data, these issues would have gone unnoticed, leading to incorrect conclusions and misuse of statistical models.

This demonstrates that visualizations are not optional or decorative — they are a necessary step in data analysis. They help confirm whether model assumptions are reasonable, whether relationships are real or artificial, and whether results make practical sense. In short, visualizations protect analysts from drawing false conclusions based solely on “good-looking” numbers.