library(tidyverse)
library(ggplot2)
The article You Can’t Trust What You Read About Nutrition explains some of the main reasons nutrition studies and the reporting around them are problematic. This is mainly due to studies relying on questionable survey data as well as p-hacking. In regards to the latter issue, spurious associations are often found when conducting statistical tests across an inordinate amount of variable combinations.
Link to article: https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/
# Load nutrition csv from Github repo (if you prefer)
#nutrition <- read.csv("raw_anonymized_data.csv")
# Load nutrition csv from URL
nutrition <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/nutrition-studies/raw_anonymized_data.csv"))
# Preview the data
head(nutrition)
# Rows x Columns
dim(nutrition)
# Narrow data frame to conduct a more focused analysis
characteristics_colums <- c("cancer", "diabetes", "heart_disease", "ever_smoked", "currently_smoke")
df_characteristics <- nutrition[characteristics_colums]
# Convert binary yes and no characteristics to ones and zeros
df_characteristics <- ifelse(df_characteristics == "Yes", 1, 0)
# Select ID and PIZZAFREQ to join with characteristics
id_and_pizza_columns <- c("ID", "PIZZAFREQ")
df_id_pizza <- nutrition[id_and_pizza_columns]
# Rename PIZZAFREQ column to match formatting of other column names
names(df_id_pizza)[names(df_id_pizza) == "PIZZAFREQ"] <- "pizza_freq"
# Join all the variables of interest into one data frame
df_pizza_disease <- data.frame(df_id_pizza, df_characteristics)
# Review summary statistics of our one discrete variable pizza_freq
summary(df_pizza_disease$pizza_freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.593 4.750 6.000
# Review means of binary characteristics
sapply(df_pizza_disease[c(3:7)], mean, na.rm=TRUE)
## cancer diabetes heart_disease ever_smoked currently_smoke
## 0.51851852 0.27777778 0.37037037 0.29629630 0.09259259
# Create histogram of our one discrete variable pizza_freq
hist(df_pizza_disease$pizza_freq, col="steelblue",
xlab = "Frequency of Pizza", ylab = "Frequency of Participants",
main="Frequency of Pizza")
# Create bar plots for characteristics
barplot_columns <- function(df_columns) {
df_columns <- as.data.frame(df_columns)
for (col in colnames(df_columns)) {
# Print bar plot for column
print(ggplot(df_columns, aes(df_columns[ , col]))
+ geom_bar(fill="steelblue") + xlab(as.character(col)))
}
}
barplot_columns(df_characteristics)
# Logistic regression with multiple variables
summary(glm(heart_disease ~ pizza_freq + currently_smoke,
df_pizza_disease, family = "binomial"))
##
## Call:
## glm(formula = heart_disease ~ pizza_freq + currently_smoke, family = "binomial",
## data = df_pizza_disease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3665 -0.9336 -0.8978 1.4211 1.4856
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.46087 0.89396 -0.516 0.606
## pizza_freq -0.04794 0.23591 -0.203 0.839
## currently_smoke 1.03906 0.96117 1.081 0.280
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 71.188 on 53 degrees of freedom
## Residual deviance: 69.951 on 51 degrees of freedom
## AIC: 75.951
##
## Number of Fisher Scoring iterations: 4
# Logistic regression with multiple variables and interaction
summary(glm(diabetes ~ pizza_freq * currently_smoke,
df_pizza_disease, family = "binomial"))
##
## Call:
## glm(formula = diabetes ~ pizza_freq * currently_smoke, family = "binomial",
## data = df_pizza_disease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9842 -0.8367 -0.7937 1.5150 1.7089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.62059 0.95285 -0.651 0.515
## pizza_freq -0.08296 0.25408 -0.327 0.744
## currently_smoke -3.46751 5.38953 -0.643 0.520
## pizza_freq:currently_smoke 0.80598 1.36210 0.592 0.554
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63.811 on 53 degrees of freedom
## Residual deviance: 63.234 on 50 degrees of freedom
## AIC: 71.234
##
## Number of Fisher Scoring iterations: 4
Since the purpose of this FiveThirtyEight analysis was to prove how easy it is to p-hack nutrition data, extending the work would not be advised. Instead, above is a subset of characteristics and an eating frequency variable of interest. To conduct a more sound analysis, more thought should be put into the subset of diseases and explanatory variables to test. Additionally, more research should be done to better understand the data values as FiveThirtyEight does not identify specifics around time frames, units and criteria of the data columns.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.