The following RMD contains CUNY SPS DATA 605 Fall 2025 context for the Discussion 11 regression assignment. The goal of this regression is to take the average height of parents and try to predict the height of their adult child.
Kaggle dataset link: [https://www.kaggle.com/datasets/jacopoferretti/parents-heights-vs-children-heights-galton-data?resource=download]
The assumptions I am basing this project off of come from an Analyst Prep article [https://analystprep.com/study-notes/cfa-level-2/assumptions-of-the-simple-linear-regression-model/]
The dataset I chose comes from a Kaggle csv. The data was gathered in a study to establish a relationship between adult children and their parents heights.
library(tidyverse)
library(broom)
library(ggplot2)
# Import the provided data
heights_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA605/refs/heads/main/GaltonFamilies.csv")
# Save the data frame
heights_df <- heights_raw
# Count missing vals and see the provided data
colSums(is.na(heights_df))
## rownames family father mother midparentHeight
## 0 0 0 0 0
## children childNum gender childHeight
## 0 0 0 0
head(heights_df)
## # A tibble: 6 × 9
## rownames family father mother midparentHeight children childNum gender
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 001 78.5 67 75.4 4 1 male
## 2 2 001 78.5 67 75.4 4 2 female
## 3 3 001 78.5 67 75.4 4 3 female
## 4 4 001 78.5 67 75.4 4 4 female
## 5 5 002 75.5 66.5 73.7 4 1 male
## 6 6 002 75.5 66.5 73.7 4 2 male
## # ℹ 1 more variable: childHeight <dbl>
# Plot the midparentHeight and childHeight to test linearity
ggplot(heights_df, aes(x = midparentHeight, y = childHeight)) +
geom_point(alpha = 0.6, color = "cadetblue") +
geom_smooth(method = "lm", se = FALSE, color = "brown3") +
labs(title = "Linearity Check: Midparent vs. Child Height",
x = "Midparent Height",
y = "Child Height")
The scatterplot shows a roughly linear trend, suggesting a linear relationship between midparent and child heights.
# Fit the model
lm_model <- lm(childHeight ~ midparentHeight, data = heights_df)
# Plot residuals vs fitted values
plot(lm_model$fitted.values, lm_model$residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Homoskedasticity Check: Residuals vs Fitted")
abline(h = 0, col = "salmon", lwd = 2)
The residuals appear to be randomly scattered around zero, which supports the assumption of constant variance (homoskedasticity).
# Plot residuals in order of observation to check randomness
plot(lm_model$residuals, type = "o",
main = "Independence Check: Residuals Over Observations",
xlab = "Observation Order",
ylab = "Residuals")
abline(h = 0, col = "salmon", lwd = 2)
The residuals don’t show any visible pattern, suggesting independence between observations.
# Q-Q plot for residuals
qqnorm(lm_model$residuals)
qqline(lm_model$residuals, col = "salmon", lwd = 2)
# Histogram for residuals
hist(lm_model$residuals,
breaks = 30,
main = "Normality Check: Residual Distribution",
xlab = "Residuals",
col = "cadetblue",
border = "white")
The Q–Q plot shows that the residuals deviate slightly from the line and suggest mild left-skewness. While not perfectly normal, this small deviation is not severe enough to invalidate the model this regression.
# Fit the model
lm_model <- lm(childHeight ~ midparentHeight, data = heights_df)
# Summarize the model
summary(lm_model)
##
## Call:
## lm(formula = childHeight ~ midparentHeight, data = heights_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9570 -2.6989 -0.2155 2.7961 11.6848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.63624 4.26511 5.307 1.39e-07 ***
## midparentHeight 0.63736 0.06161 10.345 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared: 0.103, Adjusted R-squared: 0.102
## F-statistic: 107 on 1 and 932 DF, p-value: < 2.2e-16
# Visualize regression line
ggplot(heights_df, aes(x = midparentHeight, y = childHeight)) +
geom_point(alpha = 0.6, color = "cadetblue") +
geom_smooth(method = "lm", se = TRUE, color = "brown3") +
labs(title = "Simple Linear Regression: Child Height ~ Midparent Height",
x = "Midparent Height",
y = "Child Height")
# Tidy results
tidy(lm_model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 22.6 4.27 5.31 1.39e- 7
## 2 midparentHeight 0.637 0.0616 10.3 8.05e-24
I would say the linear regression was quite appropriate for this data. It makes intuitive sense that as a person’s parents’ average height increases, their height increases.