US Presidents BMI and US Adult Male Population

Applied Analytics Assignment 2

Gracy Whelihan S3889669

Last updated: 20 May, 2021

Problem Statement

US Presidents are among some of the most looked at people in the world, and are a representation of the US population. Research conducted by the United States Centers for Disease control and Prevention (CDC) has found that the average Body Mass Index (BMI) for adult males in the US is 26.8, which is considered overweight. As the US president is one of the most influential leaders in the world, data on each presidents health was created. The question that will be answered in this investigation is whether the US Presidents BMIs provide evidence that the presidents are an accurate representation on the health of adult males in the US. There has never been a female president, therefore presidents are only being compared to males in this investigation.

To solve this problem the population BMI mean of 26.8 will be used along with a hypothesis test. The data on each presidents BMI will be used.

Data

The data used in this assignment is a data set about US presidents, excluding the current President, Joe Biden. The first data set is Historical US President Physical Data (https://www.kaggle.com/atmcfarland/historical-us-president-physical-data-more), a data set containing information on US presidents height, weight, BMI, and more. The data set is open source, and were found on Kaggle. In the data every US president is included. The data was cleaned, then certain attributed were selected, as there were some redundant. The cleaned data set was named usPresident, and will be used for the remainder of the investigation.

Variables

Here are the variables for the usPresidents data set:

Cleaning Data

Attributes that were not in the correct data type were converted to the correct data types. NA values were inspected and dealt with according to the reason for each one.

The corrected_iq variable was calculated before Trump and Biden were in office, therefore using the mean value will let us more accurately reflect the corrected_iq attributed for the presidents who have recorded IQs. The mean here is rounded because IQs are reflected as whole numbers.

For body_mass_index_range the two NA values are from the same president, who served two non consecutive terms. Also, the body_mass_index_range is based off of the body_mass_index, if the bmi is between 30 and 40 that means the bmi range is “Obese”. For this NA values the body_mass_index is 36.4 therefore the NA values should be replaced with the level labeled “Obese”. If these were more NA values like this it would be best to make a rule and impute it using the deductive and validate packages, but because there are only two NA values and they both need to be replaced with “Obese” they were replaced using is.na.

For the death_date variable there are blank values, as these presidents have not died yet. These will be left blank as blank cells are different than NA, and this indicates these presidents are still alive.

Descriptive Statistics and Visualisation

In this investigation, the important variables is body_mass_index. This variable will be summarized to show important statistics. A box plot and abline were used to visualize the US president statistics. The sample mean is found to be 26.4 as shown in the summary and in the visualization. There is one outlier in this data, but the outlier was not removed from the data. It is accurate that President William Taft was morbidly obese, and his BMI is accurately recorded in this data set. Therefore it should be included.The Red line (AB line) shows the population mean for BMI in adult males, which looks to be slightly higher than the observed mean of the president’s BMIs. A hypothesis test will give more information on the statistical significance.

boxplot(usPresidents$body_mass_index, ylab = "BMI")
abline(h=26.8, col = "red")#population mean

Decsriptive Statistics Cont.

summrary_usPresidents <- usPresidents %>% summarise(Min = min(body_mass_index,na.rm = TRUE),
                                           Q1 = quantile(body_mass_index, probs = .25,na.rm = TRUE),
                                           Median = median(body_mass_index, na.rm = TRUE),
                                           Q3 = quantile(body_mass_index,probs = .75,na.rm = TRUE),
                                           Max = max(body_mass_index,na.rm = TRUE),
                                           Mean = mean(body_mass_index, na.rm = TRUE),
                                           SD = sd(body_mass_index, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(body_mass_index)))

knitr::kable(summrary_usPresidents)
Min Q1 Median Q3 Max Mean SD n Missing
18.6 23.475 25.25 28.225 46.6 26.36522 4.999387 46 0

Hypothesis Testing

\[H_0: \mu = 26.8 \] \[H_A: \mu \ne 26.8 \]

A one-sample t-test will be used because there is a known population mean, a sample mean, and unknown population standard deviation. The US President data has 45 observations, so it can be assumed the sampling distribution will be normally distributed.

qt(0.025, 44, lower.tail = FALSE)
## [1] 2.015368
t_test_output <- t.test(usPresidents$body_mass_index, mu = 26.8)
t_test_output
## 
##  One Sample t-test
## 
## data:  usPresidents$body_mass_index
## t = -0.58984, df = 45, p-value = 0.5582
## alternative hypothesis: true mean is not equal to 26.8
## 95 percent confidence interval:
##  24.88058 27.84985
## sample estimates:
## mean of x 
##  26.36522

Hypthesis Testing Cont.

Discussion

The desision for this investigation is to fail to reject H0 = 26.8 as the p > 0.001, and the 95% CI of the estimated population mean 26.4 cm was in [24.983, 27.925], which did not capture mu = 30.48. Therefor the results of the one-sample t-test were not statistically significant. This meant that the mean BMI of US Presidents was significantly different from the mean BMI of US adult males, 26.8. Thus nothing can be said about the US Presidents health being representative of the US adult male population terms of BMI.

This investigation does have some drawbacks. The US Presidents data, includes data on males who were living in 1700s, 1800s, 1900s, and 2000s. Lifestyles were drastically different for people living over 100 years ago, and if all of the presidents were alive today (2021) their BMI may be different and more representative. Also US Presidents are typically adult males over 50, while the adult male population of of the United States includes a much wider age range. In the future, when there are female presidents, this could help to see is if the Presidents are an accurate representation on the US adult population including females.

Another interesting question for this this investigation is whether US presidents are over represented by any astrological signs? To find this out a Chi-square Goodness to Fit Test would, but the one major assumption, the minimum expected value for each cell must be at least 5, would be violated.

References