Introduction

The following RMD contains CUNY SPS DATA 605 Fall 2025 context for the Discussion 11 regression assignment. The goal of this regression is to take the average height of parents and try to predict the height of their adult child.

Kaggle dataset link: [https://www.kaggle.com/datasets/jacopoferretti/parents-heights-vs-children-heights-galton-data?resource=download]

The assumptions I am basing this project off of come from an Analyst Prep article [https://analystprep.com/study-notes/cfa-level-2/assumptions-of-the-simple-linear-regression-model/]

Data Exploration

The dataset I chose comes from a Kaggle csv. The data was gathered in a study to establish a relationship between adult children and their parents heights.

Load Libraries

library(tidyverse)
library(broom)
library(ggplot2)

Download and describe the height dataset

# Import the provided data
heights_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA605/refs/heads/main/GaltonFamilies.csv")

# Save the data frame
heights_df <- heights_raw

# Count missing vals and see the provided data
colSums(is.na(heights_df))
##        rownames          family          father          mother midparentHeight 
##               0               0               0               0               0 
##        children        childNum          gender     childHeight 
##               0               0               0               0
head(heights_df)
## # A tibble: 6 × 9
##   rownames family father mother midparentHeight children childNum gender
##      <dbl> <chr>   <dbl>  <dbl>           <dbl>    <dbl>    <dbl> <chr> 
## 1        1 001      78.5   67              75.4        4        1 male  
## 2        2 001      78.5   67              75.4        4        2 female
## 3        3 001      78.5   67              75.4        4        3 female
## 4        4 001      78.5   67              75.4        4        4 female
## 5        5 002      75.5   66.5            73.7        4        1 male  
## 6        6 002      75.5   66.5            73.7        4        2 male  
## # ℹ 1 more variable: childHeight <dbl>

Assumptions

Assumption 1: Linearity

# Plot the midparentHeight and childHeight to test linearity
ggplot(heights_df, aes(x = midparentHeight, y = childHeight)) +
geom_point(alpha = 0.6, color = "cadetblue") +
geom_smooth(method = "lm", se = FALSE, color = "brown3") +
labs(title = "Linearity Check: Midparent vs. Child Height",
x = "Midparent Height",
y = "Child Height")

The scatterplot shows a roughly linear trend, suggesting a linear relationship between midparent and child heights.

Assumption 2: Homoskedasticity

# Fit the model
lm_model <- lm(childHeight ~ midparentHeight, data = heights_df)

# Plot residuals vs fitted values
plot(lm_model$fitted.values, lm_model$residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Homoskedasticity Check: Residuals vs Fitted")
abline(h = 0, col = "salmon", lwd = 2)

The residuals appear to be randomly scattered around zero, which supports the assumption of constant variance (homoskedasticity).

Assumption 3: Independence

# Plot residuals in order of observation to check randomness
plot(lm_model$residuals, type = "o",
main = "Independence Check: Residuals Over Observations",
xlab = "Observation Order",
ylab = "Residuals")
abline(h = 0, col = "salmon", lwd = 2)

The residuals don’t show any visible pattern, suggesting independence between observations.

Assumption 4: Normality

# Q-Q plot for residuals
qqnorm(lm_model$residuals)
qqline(lm_model$residuals, col = "salmon", lwd = 2)

# Histogram for residuals
hist(lm_model$residuals,
breaks = 30,
main = "Normality Check: Residual Distribution",
xlab = "Residuals",
col = "cadetblue",
border = "white")

The Q–Q plot shows that the residuals deviate slightly from the line and suggest mild left-skewness. While not perfectly normal, this small deviation is not severe enough to invalidate the model this regression.

Linear Regression

# Fit the model
lm_model <- lm(childHeight ~ midparentHeight, data = heights_df)

# Summarize the model
summary(lm_model)
## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = heights_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9570 -2.6989 -0.2155  2.7961 11.6848 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     22.63624    4.26511   5.307 1.39e-07 ***
## midparentHeight  0.63736    0.06161  10.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared:  0.103,  Adjusted R-squared:  0.102 
## F-statistic:   107 on 1 and 932 DF,  p-value: < 2.2e-16
# Visualize regression line
ggplot(heights_df, aes(x = midparentHeight, y = childHeight)) +
geom_point(alpha = 0.6, color = "cadetblue") +
geom_smooth(method = "lm", se = TRUE, color = "brown3") +
labs(title = "Simple Linear Regression: Child Height ~ Midparent Height",
x = "Midparent Height",
y = "Child Height")

# Tidy results
tidy(lm_model)
## # A tibble: 2 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)       22.6      4.27        5.31 1.39e- 7
## 2 midparentHeight    0.637    0.0616     10.3  8.05e-24

I would say the linear regression was quite appropriate for this data. It makes intuitive sense that as a person’s parents’ average height increases, their height increases.