library(tidyverse)
library(ggplot2)

Overview

The article You Can’t Trust What You Read About Nutrition explains some of the main reasons nutrition studies and the reporting around them are problematic. This is mainly due to studies relying on questionable survey data as well as p-hacking. In regards to the latter issue, spurious associations are often found when conducting statistical tests across an inordinate amount of variable combinations.

Link to article: https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/

Loading the data

# Load nutrition csv from Github repo (if you prefer)
#nutrition <- read.csv("raw_anonymized_data.csv")

# Load nutrition csv from URL
nutrition <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/nutrition-studies/raw_anonymized_data.csv"))

# Preview the data
head(nutrition)

# Rows x Columns
dim(nutrition)

Transforming the data

# Narrow data frame to conduct a more focused analysis
characteristics_colums <- c("cancer", "diabetes", "heart_disease", "ever_smoked", "currently_smoke")
df_characteristics <- nutrition[characteristics_colums]

# Convert binary yes and no characteristics to ones and zeros
df_characteristics <- ifelse(df_characteristics == "Yes", 1, 0)

# Select ID and PIZZAFREQ to join with characteristics
id_and_pizza_columns <- c("ID", "PIZZAFREQ")
df_id_pizza <- nutrition[id_and_pizza_columns]

# Rename PIZZAFREQ column to match formatting of other column names
names(df_id_pizza)[names(df_id_pizza) == "PIZZAFREQ"] <- "pizza_freq"

# Join all the variables of interest into one data frame
df_pizza_disease <- data.frame(df_id_pizza, df_characteristics)

Exploratory data analysis

# Review summary statistics of our one discrete variable pizza_freq
summary(df_pizza_disease$pizza_freq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.593   4.750   6.000
# Review means of binary characteristics
sapply(df_pizza_disease[c(3:7)], mean, na.rm=TRUE)
##          cancer        diabetes   heart_disease     ever_smoked currently_smoke 
##      0.51851852      0.27777778      0.37037037      0.29629630      0.09259259
# Create histogram of our one discrete variable pizza_freq
hist(df_pizza_disease$pizza_freq, col="steelblue", 
     xlab = "Frequency of Pizza", ylab = "Frequency of Participants", 
     main="Frequency of Pizza")

# Create bar plots for characteristics
barplot_columns <- function(df_columns) {
  df_columns <- as.data.frame(df_columns)
  for (col in colnames(df_columns)) {
    # Print bar plot for column
    print(ggplot(df_columns, aes(df_columns[ , col])) 
          + geom_bar(fill="steelblue") + xlab(as.character(col)))
  }
}

barplot_columns(df_characteristics)

Modeling

# Logistic regression with multiple variables
summary(glm(heart_disease ~ pizza_freq + currently_smoke, 
            df_pizza_disease, family = "binomial"))
## 
## Call:
## glm(formula = heart_disease ~ pizza_freq + currently_smoke, family = "binomial", 
##     data = df_pizza_disease)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3665  -0.9336  -0.8978   1.4211   1.4856  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -0.46087    0.89396  -0.516    0.606
## pizza_freq      -0.04794    0.23591  -0.203    0.839
## currently_smoke  1.03906    0.96117   1.081    0.280
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 71.188  on 53  degrees of freedom
## Residual deviance: 69.951  on 51  degrees of freedom
## AIC: 75.951
## 
## Number of Fisher Scoring iterations: 4
# Logistic regression with multiple variables and interaction
summary(glm(diabetes ~ pizza_freq * currently_smoke, 
            df_pizza_disease, family = "binomial"))
## 
## Call:
## glm(formula = diabetes ~ pizza_freq * currently_smoke, family = "binomial", 
##     data = df_pizza_disease)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9842  -0.8367  -0.7937   1.5150   1.7089  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -0.62059    0.95285  -0.651    0.515
## pizza_freq                 -0.08296    0.25408  -0.327    0.744
## currently_smoke            -3.46751    5.38953  -0.643    0.520
## pizza_freq:currently_smoke  0.80598    1.36210   0.592    0.554
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63.811  on 53  degrees of freedom
## Residual deviance: 63.234  on 50  degrees of freedom
## AIC: 71.234
## 
## Number of Fisher Scoring iterations: 4

Conclusions

Since the purpose of this FiveThirtyEight analysis was to prove how easy it is to p-hack nutrition data, extending the work would not be advised. Instead, above is a subset of characteristics and an eating frequency variable of interest. To conduct a more sound analysis, more thought should be put into the subset of diseases and explanatory variables to test. Additionally, more research should be done to better understand the data values as FiveThirtyEight does not identify specifics around time frames, units and criteria of the data columns.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.