This notebook highlights the acquisition of data analysis skills learned from manipulating real data. This notebook demonstrates the writing of code, production of output, and interpretation of output.
The following is a series of tasks to involving the wrangling, visualization, description, analysis, and interpretation the wtloss.Rds dataframe.
Here is the scenario:
This is a mock scenario of a nutrition and exercise scientist team focused on promoting healthy eating, exercise, and maintenance of a healthy weight. They are keenly aware that an individual’s behaviors aren’t just driven by personal and intrinsic factors, but rather the context in which one works, lives, and plays drives behavior as well. One salient factor for healthy eating is the availability of healthy foods.
The United States Department of Agriculture’s Economic Research Service identified 6,500 food deserts in the United States based on 2000 Census and 2006 data on locations of supermarkets, supercenters, and large grocery stores. A food desert is an area where people have limited access to healthy and affordable food. A paper by Morland, Diez Roux, and Wing found that people who live in a food desert are more likely to have a poor diet and be obese. Of course, food deserts are not evenly distributed across the country. The communities classified as food deserts tend to be home to poorer families and families of color according to a report by the USDA. Thus, food deserts are a factor in producing and maintaining health disparities.
The team seeks to design a weight loss program for overweight individuals living in the North Denver neighborhoods that have been designated as food deserts. An effective weight loss program in these neighborhoods should have two components. First, a cognitive-behavioral therapy (CBT) component in which a dietitian and a therapist can work with the individual to teach them about healthy eating, provide them with the skills necessary to prepare healthy foods, and counsel them on adopting new healthy eating behaviors. Second, a mobile grocery store that visits their home once a week that is full of fresh and wholesome foods to purchase at an affordable price.
The team desires to determine if the combination of the two intervention components produces better results than no intervention at all, or just one of the interventions alone. Therefore, the team designs a randomized controlled trial (RCT) with four arms:
Recruitment consists of overweight adults (body mass index (BMI) > 25) interested in losing weight and who reside in one of the North Denver neighborhoods designated as food deserts. Recruitment efforts are highly successful, and 10,000 people respond to participate. Of those who agree to participate, 400 are randomly selected, then randomly assigned to one of the four arms of the study (100 people per arm). Prior to the intervention, each participant’s height and weight is measured and their BMI is calculated (to verify that they meet the study criteria).
The interventions are delivered over the course of three months. During the study, information about calorie intake and calorie expenditure is ascertained so that a total caloric deficit can be computed. After the three months of intervention, the team will again weigh each participant to calculate the total pounds lost or gained over the intervention period.
In the dataframe called wtloss.Rds, the following variables are recorded:
pid is the participant ID
condition is the condition the participant was randomly assigned to receive (control, CBT, mobile groceries, CBT + mobile groceries)
bmi is the participant’s body mass index at the start of the study
caldef is the sum of caloric deficit (in kilocalories) across all three months of the intervention. Caloric deficit is the difference between calories consumed and calories expended (positive number = more calories burned than consumed, negative number = more calories consumed than burned).
lbs_lost is the total pounds lost during the program (positive number = weight loss, negative number = weight gain)
fv5 is a binary indicator that is coded “yes” if the individual ate at least 5 fruits and vegetables per day on at least four days during the last week of the intervention, and is coded “no” if the individual did not meet this criteria.
library(skimr)
library(moderndive)
library(here)
library(magrittr)
library(tidyverse)
wtloss <- readRDS(here("data", "wtloss.Rds"))
Use the glimpse() function to take a look at the variables in the data frame.
wtloss %>% glimpse()
## Rows: 400
## Columns: 7
## $ pid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ condition <fct> control, control, control, control, control, control, contro…
## $ caldef <dbl> -9965.8, 7218.2, -6172.1, 20293.9, 10654.2, 9872.2, 274.5, 6…
## $ lbs_lost <dbl> -2.14369251, 1.15335885, 2.36789011, 0.56314651, 10.08834935…
## $ bmi <dbl> 35.43, 41.92, 38.42, 38.24, 39.42, 36.88, 48.58, 36.45, 40.3…
## $ sex <fct> male, female, female, female, male, male, female, male, male…
## $ fv5 <fct> yes, yes, no, no, yes, no, yes, no, no, yes, no, no, no, no,…
Use the skim() function to calculate descriptive statistics for all variables for the whole group.
wtloss %>% skim()
| Name | Piped data |
| Number of rows | 400 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| condition | 0 | 1 | FALSE | 4 | con: 100, mob: 100, CBT: 100, CBT: 100 |
| sex | 0 | 1 | FALSE | 2 | fem: 203, mal: 197 |
| fv5 | 0 | 1 | FALSE | 2 | yes: 220, no: 180 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pid | 0 | 1 | 200.50 | 115.61 | 1.00 | 100.75 | 200.50 | 300.25 | 400.00 | ▇▇▇▇▇ |
| caldef | 0 | 1 | 17418.56 | 25779.13 | -49516.10 | -157.22 | 12761.50 | 31364.38 | 101476.30 | ▁▇▆▃▁ |
| lbs_lost | 0 | 1 | 4.89 | 9.42 | -19.01 | -1.58 | 4.57 | 10.11 | 32.43 | ▁▆▇▃▁ |
| bmi | 0 | 1 | 38.93 | 3.85 | 26.06 | 36.18 | 39.14 | 41.60 | 48.58 | ▁▃▇▇▂ |
Subset the dataframe to select only condition, caldef, and lbs_lost. Next, use the group_by() function to group the dataframe by condition, then use the skim() function to calculate descriptive statistics of lbs_lost by condition.
wtloss_subset <- wtloss %>%
select(condition, caldef, lbs_lost)
wtloss_subset %>%
group_by(condition) %>%
skim(lbs_lost)
| Name | Piped data |
| Number of rows | 400 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | condition |
Variable type: numeric
| skim_variable | condition | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| lbs_lost | control | 0 | 1 | -0.32 | 6.16 | -14.77 | -4.69 | -1.15 | 4.26 | 14.90 | ▂▇▇▆▂ |
| lbs_lost | mobile groceries | 0 | 1 | 2.10 | 7.30 | -19.01 | -2.44 | 2.36 | 7.29 | 20.21 | ▁▃▇▇▂ |
| lbs_lost | CBT | 0 | 1 | 6.45 | 9.42 | -16.49 | 0.98 | 6.20 | 12.80 | 30.67 | ▂▅▇▅▂ |
| lbs_lost | CBT + mobile groceries | 0 | 1 | 11.32 | 9.89 | -14.36 | 4.51 | 11.07 | 19.30 | 32.43 | ▂▅▇▇▂ |
Create a boxplot of lbs_lost, with one box per condition. Label the x- and y- axis and give the graph a title. Describe what’s seen
lbs_lost_condition <- wtloss_subset
lbs_lost_condition %>%
ggplot(aes(x = lbs_lost, y = condition)) +
geom_boxplot() +
theme_bw() +
labs(title = "Pounds lost given conditional intervention",
x = "Lbs Lost",
y = "Condition")
commentary: Most successful weight loss was acheived through the combination of Cognitive Behavioral Therapy (CBT) and Mobile Grocery shopping. Given a singular condition, CBT showed the most significant difference, while Mobile Groceries showed a difference from the control group, but only minimally.
Filter the dataframe to include only the control condition, then create a scatterplot of caldef on the x-axis and lbs_lost on the y-axis. Label the x- and y- axis appropriately and give the graph a title. Below the graph, write what’s seen.
wtloss_control <- wtloss_subset %>%
filter(condition == "control")
wtloss_control %>%
ggplot(mapping = aes(x = caldef, y = lbs_lost)) +
geom_point(aes(color = lbs_lost)) +
theme_bw() +
labs(title = "Weight loss and caloric deficit - Control Variable",
x = "Caloric Deficit",
y = "Pounds Lost")
commentary: there is a trend of higher caloric deficit leading towards a larger amount of weight loss. Conversely, there is also a trend towards more weight gained when there is a caloric excess. Note: a negative(-) number represents weight gain, while a positive(+) number represents weight loss.
Create a dataframe called “control”, that includes only the people who were in the control condition. Add a variable to this dataset called caldef1000, in which caldef is divided by 1000. Note that for this new variable, a one-unit increase in caldef1000 signifies a 1000 calorie increase in caloric deficit (i.e, additional 1000 calories burned).
control <- wtloss_control %>%
mutate(caldef1000 = caldef/1000)
Use the control dataframe that you just created to do the following:
Create a scatterplot of caldef1000 on the x-axis and lbs_lost on the y-axis. Add a best fit line. Be sure to label the x- and y- axis appropriately and give the graph a title. Below the graph, write what’s seen.
control %>%
ggplot(mapping = aes(x = caldef1000, y = lbs_lost)) +
geom_point(aes(size = caldef1000)) +
stat_smooth(aes(x = caldef1000,
y = lbs_lost,
color = "Linear Fit\n& 95% CI"),
formula = y ~ x, method = "lm", se = TRUE, level = 0.95, fill = "grey70") +
theme_bw() +
labs(title = "Pounds lost by caloric deficit",
x = "Caloric Deficit",
y = "Pounds Lost",
size = "Caloric Deficit in Thousands",
colour = "")
commentary: There looks to be a correlation between number of pounds lost and a higher caloric deficit, thought not in all cases. Size of the scatterplots represent a higher caloric deficit, while smaller scatterplots represent caloric excess.
Regress lbs_lost on caldef1000. Use get_regression_table() to obtain the regression estimates for the intercept and slope. Interpret the intercept and slope estimate.
caldef1000 <- control %>%
lm(caldef1000 ~ lbs_lost, data = .)
caldef1000 %>% get_regression_table()
commentary: estimates for coefficients look to show a positive relationship between caloric deficit leading towards more pounds lost.
Use get_regression_summaries() to obtain the R-squared. Describe the R-squared.
caldef1000 %>% get_regression_summaries()
commentary: R-squared is low. Statistically low correlation. Ideally we would observe 0.5 and greater, or -0.5 or lower to give more conclusive data ___
Create a dataframe called “compare”, that includes only people in the CBT or the CBT + mobile groceries interventions. Add a dummy coded variable to the dataframe, name it CBT_MG. Code CBT_MG to be 1 if condition equals CBT + mobile groceries and 0 if condition equals CBT.
compare <- lbs_lost_condition %>%
filter(condition == "CBT"| condition == "CBT + mobile groceries")
compare <- compare %>%
mutate(CBT_MG = case_when(condition == "CBT" ~ 1,
condition == "CBT + mobile groceries" ~ 0))
Use the compare dataframe to do the following:
Create a density plot of lbs_lost that is grouped by CBT_MG (there should be two density curves on the same graph, one for the CBT group and one for the CBT + mobile groceries group). Label the x- axis appropriately and give the graph a title. Describe the graph.
compare %>%
group_by(CBT_MG) %>%
ggplot(aes(x = lbs_lost, fill = condition )) +
geom_density(alpha = .7) +
theme_bw() +
labs(title = "Distribution of caloric deficit by intervetion type",
x = "Caloric deficit", y = "Lbs Lost",
fill = "CBT vs. CBT+MG")
my commentary: You can a noticable difference between CBT vs. CBT & Mobile Groceries. Though there is a noticable difference in caloric deficit in a sizable portion of CBT + Mobile Groceries, there is still a good amount of overlap between the two density plots.
Fit a simple linear regression model, regress lbs_lost on the dummy coded indicator created (CBT_MG). Use get_regression_table() to obtain the parameter estimates. Explain what the estimates of the intercept and slope represent. Interpret the statistic, p-value, and 95% CI for the slope.
mod1 <- compare %>%
lm(lbs_lost ~ CBT_MG, data = .)
get_regression_table(mod1)
commentary: Intercept estimate represents the predicted value of y when x is held constant. The lower and upper bound CIs of 9.420, and 13.229 do not contain zero, which corresponds to rejecting the null hypothesis.
Fit an independent samples t-test that will provide the same test of the slope in simple linear regression. Describe the results.
compare %$% t.test(lbs_lost ~ CBT_MG, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: lbs_lost by CBT_MG
## t = 3.5699, df = 197.53, p-value = 0.0004483
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 2.182103 7.568214
## sample estimates:
## mean in group 0 mean in group 1
## 11.324391 6.449233
commentary: Results indicate a significant difference in caloric deficit for participants who revceived both CBT & Mobile Groceries (M = 11.324, SD = 9.89), and those who only had CBT (M = 6.45, SD = 9.42), t(197)=3.57, p<.001.
conclusions: the use of CBT and mobile grocery shopping could add valuable tools to those positioned in food deserts or those who need more options to make better decisions regarding food. In addition to the 4 conditions in this study, additional data could be analyzed from the fv5 indicator to see any correlation between categorical choices of food and weight loss. Future studies may include additional categories to be recorded, as well as calculations on BMI as it correlates to muscle mass.