I want to predict my girlfriend Natalia’s weight using her height as a predictor variable. I will do this by using the in-built ‘women’ dataset in R to create a simple linear model to predict weight from height.

The dataset I will be using contains the height (inches) and weight (lbs) of 15 women.

Let’s start by loading all neccesary packages for this project.

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.3.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(ggthemes)

Now let’s view the head of the ‘women’ dataset.

head(women)
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

Let’s split the our data into a subset to train our model and a subset to test our model.

set.seed(1234)
index <- initial_split(prop = 0.8, data = women)
train_data <- training(index)
test_data <- testing(index)

Next, let’s create the actual regression model and look at some summary stats.

model1 <- lm(weight ~ height, data = train_data)
summary(model1)
## 
## Call:
## lm(formula = weight ~ height, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5919 -1.0645 -0.3145  0.8498  2.6342 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -94.5784     6.8024   -13.9 7.23e-08 ***
## height        3.5548     0.1028    34.6 9.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 10 degrees of freedom
## Multiple R-squared:  0.9917, Adjusted R-squared:  0.9909 
## F-statistic:  1197 on 1 and 10 DF,  p-value: 9.644e-12
rmse <- sqrt(mean(model1$residuals^2))
rmse
## [1] 1.296028

We see that our r-squared value is 0.99, which is great. And our p-value is close to zero. Also good. Our Root Mean Squared Error is 1.29 which means our predictions are only off by about a 1 and a half pounds. That’s pretty accurate.

Let’s plot the model to evaluate it visually.

model1plot <- ggplot(model1, aes(x = height, y = weight)) + 
  geom_point() + 
  stat_smooth(method = 'lm', color = 'orange') + 
  labs(title = "Regression Line for the Height and Weight of the Women Dataset", 
       x = "Height in Inches", 
       y = "Weight in Pounds") + 
  theme_economist_white()

model1plot
## `geom_smooth()` using formula = 'y ~ x'

We can see that as height increases, so too does weight. The data points are all very close to the regression line which a great sign. This means our residuals are small. Our model should be effective at predicting the values in our testing data.

Let’s predict the values of our testing data and see how they compare to the actual data. Then we will compare our model’s predictions to the actual data points in our testing data subset.

predictions <- predict(model1, newdata = test_data)
print(predictions)
##        1        2        3 
## 111.5989 118.7084 132.9275
data.frame(predictions, test_data$weight) 
##   predictions test_data.weight
## 1    111.5989              115
## 2    118.7084              120
## 3    132.9275              132

We can see that the model is pretty good at predicting weight from height in our testing data.

Now let’s try using our model to predict Natalia’s weight from her height (63 inches). First, I input Natalia’s height in inches into a variable called new_data.

new_data <- data.frame(height = c(63))

Let’s view our model’s prediction of Natalia’s weight.

pred <- predict(model1, newdata = new_data, interval = 'confidence')
  pred
##        fit      lwr     upr
## 1 129.3728 128.2186 130.527

Our model predicts that Natalia should weight about 129 lbs! This is about 25 lbs greater than her actual weight. Why did this happen?

I hypothesize this is because Natalia’s BMI is less than the average BMI of the dataset.

Let’s find out if my hypothesis is true. We can do this with a one-sample t-test.

Let’s first define our hypotheses:

H1: There is a significant difference between Natalia’s BMI and the average BMI of the dataset.

Null hypothesis: There is no difference between Natalia’s BMI and the average BMI of the dataset

I’ll start by creating a new row in the data to calculate BMI.

BMI <- women %>%
    mutate( bmi = (weight/height^2) * 703) 
head(BMI)
##   height weight      bmi
## 1     58    115 24.03240
## 2     59    117 23.62856
## 3     60    120 23.43333
## 4     61    123 23.23811
## 5     62    126 23.04318
## 6     63    129 22.84883

Let’s find the average BMI of the women in the dataset and save the result to a variable ‘average_bmi’.

average_bmi <- mean(BMI$bmi)
average_bmi
## [1] 22.72443

We find that the average BMI for women in our data is 22.72.

Now let’s calculate Natalia’s BMI.

natalia_bmi <- (105/63^2)*703
natalia_bmi
## [1] 18.59788

We find that Natalia’s BMI is 18.6.

Natalia’s BMI is less than the average in the women dataset.

Let’s see if this difference is statistically significant using a one sample t-test.

In this t-test I will compare the average BMI of the ‘women’ dataset to Natalia’s BMI. I will save the result to a variable ‘result’ and print it.

result <- t.test(BMI$bmi,  mu = natalia_bmi)
print(result)
## 
##  One Sample t-test
## 
## data:  BMI$bmi
## t = 25.849, df = 14, p-value = 3.241e-13
## alternative hypothesis: true mean is not equal to 18.59788
## 95 percent confidence interval:
##  22.38203 23.06682
## sample estimates:
## mean of x 
##  22.72443

Our t-test returned t = 25, and a p-value of 3.241 x 10^-13 (very significant).

This shows that the average BMI’s of the women dataset is significantly larger than Natalia’s BMI. This would explain why the regression model we created significantly overestimates Natalia’s weight.

The model works well on our training and testing data, but it doesn’t generalize well to women whose body mass index is outside of the range of the training data.

This exhibits the limitations of using such a small dataset to create a linear regression model.