Project2 Nutrition And Physical Risk Factors

Author

Mamokotjo Letjama

Introduction

The dataset I choose is from OpenData.gov. and the topic is topic, Nutrition, Physical Activity, and Obesity, collected by Behavioral Risk Factor Surveillance System. Source: https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system

The variables in this dataset include YearStart (ranging from 2011 to 2023), State, Class (categorizing “Fruits and Vegetables” and “Physical Activity”), Percentage, Educational level, and Question (specifying lifestyle habits such as fruit consumption or exercise frequency). This dataset was selected to analyze public health trends related to healthy living. Regular physical activity is associated with a reduced risk of chronic diseases and improved quality of life. According to the World Health Organization (WHO), physical inactivity is recognized as a significant risk factor for noncommunicable disease mortality. Monitoring state-level performance may provide relevant information for policy makers regarding future interventions. I started by loading the necessary packages

Load Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(leaflet)
library(ggrepel)
library(ggthemes)
library(viridis)

Loading required package: viridisLite

Setting working directory and load dataset

setwd("C:/Users/tmats/OneDrive/DATA110/Working Directories")
health_behavior <- read_csv("Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")

Rows: 104272 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl  (8): YearStart, YearEnd, Data_Value_Unit, Data_Value, Data_Value_Alt, L...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(health_behavior)

# A tibble: 6 × 33
  YearStart YearEnd LocationAbbr LocationDesc Datasource Class    Topic Question
      <dbl>   <dbl> <chr>        <chr>        <chr>      <chr>    <chr> <chr>   
1      2011    2011 AK           Alaska       BRFSS      Obesity… Obes… Percent…
2      2011    2011 AK           Alaska       BRFSS      Obesity… Obes… Percent…
3      2011    2011 AK           Alaska       BRFSS      Physica… Phys… Percent…
4      2011    2011 AK           Alaska       BRFSS      Obesity… Obes… Percent…
5      2011    2011 AK           Alaska       BRFSS      Obesity… Obes… Percent…
6      2011    2011 AK           Alaska       BRFSS      Obesity… Obes… Percent…
# ℹ 25 more variables: Data_Value_Unit <dbl>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
#   QuestionID <chr>, DataValueTypeID <chr>, LocationID <chr>, …

health_behavior1 <- health_behavior |>
  filter(!is.na(GeoLocation)) |>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", "")) |>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
health_behavior1

# A tibble: 102,340 × 34
   YearStart YearEnd LocationAbbr LocationDesc Datasource Class   Topic Question
       <dbl>   <dbl> <chr>        <chr>        <chr>      <chr>   <chr> <chr>   
 1      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 2      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 3      2011    2011 AK           Alaska       BRFSS      Physic… Phys… Percent…
 4      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 5      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 6      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 7      2011    2011 AK           Alaska       BRFSS      Physic… Phys… Percent…
 8      2011    2011 AK           Alaska       BRFSS      Obesit… Obes… Percent…
 9      2011    2011 AK           Alaska       BRFSS      Physic… Phys… Percent…
10      2011    2011 AK           Alaska       BRFSS      Physic… Phys… Percent…
# ℹ 102,330 more rows
# ℹ 26 more variables: Data_Value_Unit <dbl>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, lat <dbl>, long <dbl>, ClassID <chr>, …

Select data to explore

life_style <- health_behavior1 |>
  filter(Class %in% c("Physical Activity", "Fruits and Vegetables")) |>
  filter(Question %in% c('Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)', 'Percent of adults who report consuming fruit less than one time daily', 'Percent of adults who report consuming vegetables less than one time daily')) |>
  filter(YearStart == YearStart) |>
  rename(Percentage = Data_Value) |>
  rename(State = LocationAbbr) |>
  select (YearStart, 
          State, 
          Class, 
          Percentage, 
          Question,
          Education,
          Sex,
          lat, long,
          Income,
          StratificationCategory1)
head(life_style)

# A tibble: 6 × 11
  YearStart State Class   Percentage Question Education Sex     lat  long Income
      <dbl> <chr> <chr>        <dbl> <chr>    <chr>     <chr> <dbl> <dbl> <chr> 
1      2011 AK    Physic…       56.2 Percent… <NA>      <NA>   64.8 -148. <NA>  
2      2011 AK    Physic…       67.2 Percent… College … <NA>   64.8 -148. <NA>  
3      2011 AK    Physic…       46.5 Percent… <NA>      <NA>   64.8 -148. <NA>  
4      2011 AK    Physic…       59.3 Percent… <NA>      <NA>   64.8 -148. $50,0…
5      2011 AK    Physic…       55.3 Percent… <NA>      <NA>   64.8 -148. <NA>  
6      2011 AK    Physic…       62.8 Percent… <NA>      <NA>   64.8 -148. <NA>  
# ℹ 1 more variable: StratificationCategory1 <chr>

Removing Na’s from the dataset

lifestyle_clean <- life_style |>
  filter(!is.na(Percentage))

Convert Income Ranges to Numeric Values

Create a mapping system that assigns a numeric value to each range based on income level:

lifesyle_Stat <- lifestyle_clean |>
  filter(!is.na(Income)) |>
mutate(Income_numeric = recode_factor(Income,
    "Less than $15,000" = 15000,
    "$15,000 - $24,999" = 25000,
    "$25,000 - $34,999" = 35000,
    "$35,000 - $49,999" = 55000,
    "$50,000 - $74,999" = 75000,
    "$75,000 or greater" = 85000,
    "Data not reported" = 0
  ))

convert the Income_numeric to numeric

lifesyle_Stat$Income_numeric <- as.numeric(lifesyle_Stat$Income_numeric)

Statistical Analysis: linear regression with the command, lm(y~x)

cor(lifesyle_Stat$Income_numeric, lifesyle_Stat$Percentage)

[1] 0.05344259

model <- lm(Percentage ~ Income_numeric, data = lifesyle_Stat)
summary(model)


Call:
lm(formula = Percentage ~ Income_numeric, data = lifesyle_Stat)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.895  -8.678   1.834   9.818  38.605 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     39.1684     0.4680  83.701  < 2e-16 ***
Income_numeric   0.3712     0.1046   3.547 0.000394 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.88 on 4392 degrees of freedom
Multiple R-squared:  0.002856,  Adjusted R-squared:  0.002629 
F-statistic: 12.58 on 1 and 4392 DF,  p-value: 0.000394

Summary

Model: = 39.1684 + 0.3712×Income_numeric The model suggests that for each increase in income bracket (from lower to higher), percentage of adults engaging in lifestyle behavior is predicted to increase by 0.37 Correlation = 0.05344 is very weak correlation, nearly close to zero suggesting no linear relationship between income levels and lifestyle behavior P-value = 0.000394 small suggesting that it is less meaningful to explain the linear increase. R^2 value = 0.0026 ~ 0.26% is very low, indicating the income alone cannot explain variation in lifestyle behavior. Other variables may be considered such as educational level regional health initiatives or access to recreational facilities. Adjusted R^2 confirms the model is weak to explain the variation

Preparing data to plot lifestyle behaviors against educationnal level

Edu_lifestyle <- lifestyle_clean |>
  filter(!is.na(Education)) |>
  group_by(State)

Craeting a plot

Plot1 <- ggplot(Edu_lifestyle, aes(x = Education, y = Percentage, fill = Class  )) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    x = "Education Level",
    y = "Percentage Engaged",
    fill = "Lifestyle Category",
    title = "Engagement in Healthy Lifestyle Behaviors by Education Level") +
  theme()
Plot1

summary of the Plot1

College graduates demonstrate the highest levels of engagement in physical activity. Individuals with less than a high school education tend to exhibit higher rates of fruit and vegetable consumption.

Loading highchater

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)

Select 5 state

# prepare data
lifestyle_10state<- lifestyle_clean |>
  filter(State %in% c("MD","DC", "VA", "NC", "MI")) |>
  arrange(YearStart)

View of five state MD”,“DC”, “VA”, “NC”, “MI”

highchart() |>
  hc_colors(c("#1b9e77", "#d95f02", "#7570b3")) |>
  hc_add_series(data = lifestyle_10state,
                type = "area", 
                hcaes(x = YearStart,
                      y = Percentage, 
                      group = Class)) |>
  hc_xAxis(title = list(text="YearStart")) |>
  hc_yAxis(title = list(text="Percentage")) |>
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) |>
  hc_legend(align = "right", 
            verticalAlign = "top")

Summary of the Graph

This graph shows that the physical activity line remains consistently higher than that of fruit and vegetable consumption between 2021 and 2022.

# Filter for Fruits and Vegetables class
fv_data <- lifestyle_10state |> #selecting "Fruit and Vegetables to explore
  filter(Class == "Fruits and Vegetables") |>
  group_by(State, Education) |>
  summarize(Average_Percentage = mean(Percentage, na.rm = TRUE))

`summarise()` has grouped output by 'State'. You can override using the
`.groups` argument.

Preparing data for the Heatmap Graph

fv_data |>
  filter(!is.na(Education))

# A tibble: 20 × 3
# Groups:   State [5]
   State Education                        Average_Percentage
   <chr> <chr>                                         <dbl>
 1 DC    College graduate                               20.0
 2 DC    High school graduate                           32.3
 3 DC    Less than high school                          34.0
 4 DC    Some college or technical school               27.0
 5 MD    College graduate                               21.6
 6 MD    High school graduate                           32.8
 7 MD    Less than high school                          38.0
 8 MD    Some college or technical school               28.7
 9 MI    College graduate                               22.4
10 MI    High school graduate                           33.5
11 MI    Less than high school                          38.0
12 MI    Some college or technical school               28.6
13 NC    College graduate                               20.8
14 NC    High school graduate                           33.1
15 NC    Less than high school                          37.7
16 NC    Some college or technical school               25.9
17 VA    College graduate                               21.6
18 VA    High school graduate                           33.2
19 VA    Less than high school                          35.9
20 VA    Some college or technical school               27.8

ggplot(fv_data, aes(x = Education, y = State, fill = Average_Percentage)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "#edf8fb", high = "#006d2c") +
  labs(
    title = "Fruit & Vegetable Consumption by State and Education Level",
    x = "Education Level",
    y = "State",
    fill = "Avg %",
    caption = "Source: CDC BRFSS"
  ) +
theme_minimal(base_size = 12) +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1))

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

fv_matrix <- fv_data |>
  tidyr::pivot_wider(names_from = Education, values_from = Average_Percentage) |>
  column_to_rownames("State") |>
  as.matrix()

plot_ly(
  x = colnames(fv_matrix),
  y = rownames(fv_matrix),
  z = fv_matrix,
  type = "heatmap",
  colorscale = "Greens",
  hoverongaps = FALSE
) |>
  plotly::layout(
    title = "Interactive Heatmap: Fruit & Vegetable Consumption by State and Education",
    xaxis = list(title = "Education Level", tickangle = 45),
    yaxis = list(title = "State"),
    margin = list(l = 100, r = 50, b = 100, t = 100),
    annotations = list(
      xref = "paper", yref = "paper",
        x = 0, y = -0.2, showarrow = FALSE))

Summary

Summary Fluctuations across the years, suggesting that public engagement with healthy habits hasn’t followed a linear trajectory. For instance, the area graph indicates that fruit and vegetable consumption was just above 40% around 2019 and then drop thereafter went up in 2021. The rise could be due be related to shifts in health awareness or lifestyle changes associated with lockdowns and remote living. The physical activity line has remained consistently higher than that of fruit and vegetable consumption, which may indicate that one behavior has been more stable or more difficult to maintain over time. The statistical analysis shows that there is no correlation between the level of income and for the individuals to living a healthy lifestyle The data contains many NA and as the result, there was lack of exploration