The dataset I choose is from OpenData.gov. and the topic is topic, Nutrition, Physical Activity, and Obesity, collected by Behavioral Risk Factor Surveillance System. Source: https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system
The variables in this dataset include YearStart (ranging from 2011 to 2023), State, Class (categorizing “Fruits and Vegetables” and “Physical Activity”), Percentage, Educational level, and Question (specifying lifestyle habits such as fruit consumption or exercise frequency). This dataset was selected to analyze public health trends related to healthy living. Regular physical activity is associated with a reduced risk of chronic diseases and improved quality of life. According to the World Health Organization (WHO), physical inactivity is recognized as a significant risk factor for noncommunicable disease mortality. Monitoring state-level performance may provide relevant information for policy makers regarding future interventions. I started by loading the necessary packages
Load Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 104272 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): LocationAbbr, LocationDesc, Datasource, Class, Topic, Question, Da...
dbl (8): YearStart, YearEnd, Data_Value_Unit, Data_Value, Data_Value_Alt, L...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 102,340 × 34
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
2 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
3 2011 2011 AK Alaska BRFSS Physic… Phys… Percent…
4 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
5 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
6 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
7 2011 2011 AK Alaska BRFSS Physic… Phys… Percent…
8 2011 2011 AK Alaska BRFSS Obesit… Obes… Percent…
9 2011 2011 AK Alaska BRFSS Physic… Phys… Percent…
10 2011 2011 AK Alaska BRFSS Physic… Phys… Percent…
# ℹ 102,330 more rows
# ℹ 26 more variables: Data_Value_Unit <dbl>, Data_Value_Type <chr>,
# Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
# `Age(years)` <chr>, Education <chr>, Sex <chr>, Income <chr>,
# `Race/Ethnicity` <chr>, lat <dbl>, long <dbl>, ClassID <chr>, …
Select data to explore
life_style <- health_behavior1 |>filter(Class %in%c("Physical Activity", "Fruits and Vegetables")) |>filter(Question %in%c('Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)', 'Percent of adults who report consuming fruit less than one time daily', 'Percent of adults who report consuming vegetables less than one time daily')) |>filter(YearStart == YearStart) |>rename(Percentage = Data_Value) |>rename(State = LocationAbbr) |>select (YearStart, State, Class, Percentage, Question, Education, Sex, lat, long, Income, StratificationCategory1)head(life_style)
# A tibble: 6 × 11
YearStart State Class Percentage Question Education Sex lat long Income
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2011 AK Physic… 56.2 Percent… <NA> <NA> 64.8 -148. <NA>
2 2011 AK Physic… 67.2 Percent… College … <NA> 64.8 -148. <NA>
3 2011 AK Physic… 46.5 Percent… <NA> <NA> 64.8 -148. <NA>
4 2011 AK Physic… 59.3 Percent… <NA> <NA> 64.8 -148. $50,0…
5 2011 AK Physic… 55.3 Percent… <NA> <NA> 64.8 -148. <NA>
6 2011 AK Physic… 62.8 Percent… <NA> <NA> 64.8 -148. <NA>
# ℹ 1 more variable: StratificationCategory1 <chr>
model <-lm(Percentage ~ Income_numeric, data = lifesyle_Stat)summary(model)
Call:
lm(formula = Percentage ~ Income_numeric, data = lifesyle_Stat)
Residuals:
Min 1Q Median 3Q Max
-34.895 -8.678 1.834 9.818 38.605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.1684 0.4680 83.701 < 2e-16 ***
Income_numeric 0.3712 0.1046 3.547 0.000394 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.88 on 4392 degrees of freedom
Multiple R-squared: 0.002856, Adjusted R-squared: 0.002629
F-statistic: 12.58 on 1 and 4392 DF, p-value: 0.000394
Summary
Model: = 39.1684 + 0.3712×Income_numeric The model suggests that for each increase in income bracket (from lower to higher), percentage of adults engaging in lifestyle behavior is predicted to increase by 0.37 Correlation = 0.05344 is very weak correlation, nearly close to zero suggesting no linear relationship between income levels and lifestyle behavior P-value = 0.000394 small suggesting that it is less meaningful to explain the linear increase. R^2 value = 0.0026 ~ 0.26% is very low, indicating the income alone cannot explain variation in lifestyle behavior. Other variables may be considered such as educational level regional health initiatives or access to recreational facilities. Adjusted R^2 confirms the model is weak to explain the variation
Preparing data to plot lifestyle behaviors against educationnal level
Plot1 <-ggplot(Edu_lifestyle, aes(x = Education, y = Percentage, fill = Class )) +geom_bar(stat ="identity", position ="dodge") +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(x ="Education Level",y ="Percentage Engaged",fill ="Lifestyle Category",title ="Engagement in Healthy Lifestyle Behaviors by Education Level") +theme()Plot1
summary of the Plot1
College graduates demonstrate the highest levels of engagement in physical activity. Individuals with less than a high school education tend to exhibit higher rates of fruit and vegetable consumption.
Loading highchater
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
This graph shows that the physical activity line remains consistently higher than that of fruit and vegetable consumption between 2021 and 2022.
# Filter for Fruits and Vegetables classfv_data <- lifestyle_10state |>#selecting "Fruit and Vegetables to explorefilter(Class =="Fruits and Vegetables") |>group_by(State, Education) |>summarize(Average_Percentage =mean(Percentage, na.rm =TRUE))
`summarise()` has grouped output by 'State'. You can override using the
`.groups` argument.
Preparing data for the Heatmap Graph
fv_data |>filter(!is.na(Education))
# A tibble: 20 × 3
# Groups: State [5]
State Education Average_Percentage
<chr> <chr> <dbl>
1 DC College graduate 20.0
2 DC High school graduate 32.3
3 DC Less than high school 34.0
4 DC Some college or technical school 27.0
5 MD College graduate 21.6
6 MD High school graduate 32.8
7 MD Less than high school 38.0
8 MD Some college or technical school 28.7
9 MI College graduate 22.4
10 MI High school graduate 33.5
11 MI Less than high school 38.0
12 MI Some college or technical school 28.6
13 NC College graduate 20.8
14 NC High school graduate 33.1
15 NC Less than high school 37.7
16 NC Some college or technical school 25.9
17 VA College graduate 21.6
18 VA High school graduate 33.2
19 VA Less than high school 35.9
20 VA Some college or technical school 27.8
ggplot(fv_data, aes(x = Education, y = State, fill = Average_Percentage)) +geom_tile(color ="white") +scale_fill_gradient(low ="#edf8fb", high ="#006d2c") +labs(title ="Fruit & Vegetable Consumption by State and Education Level",x ="Education Level",y ="State",fill ="Avg %",caption ="Source: CDC BRFSS" ) +theme_minimal(base_size =12) +theme(axis.text.y =element_text(size =10),axis.text.x =element_text(angle =45, hjust =1))
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
fv_matrix <- fv_data |> tidyr::pivot_wider(names_from = Education, values_from = Average_Percentage) |>column_to_rownames("State") |>as.matrix()plot_ly(x =colnames(fv_matrix),y =rownames(fv_matrix),z = fv_matrix,type ="heatmap",colorscale ="Greens",hoverongaps =FALSE) |> plotly::layout(title ="Interactive Heatmap: Fruit & Vegetable Consumption by State and Education",xaxis =list(title ="Education Level", tickangle =45),yaxis =list(title ="State"),margin =list(l =100, r =50, b =100, t =100),annotations =list(xref ="paper", yref ="paper",x =0, y =-0.2, showarrow =FALSE))
Summary
Summary Fluctuations across the years, suggesting that public engagement with healthy habits hasn’t followed a linear trajectory. For instance, the area graph indicates that fruit and vegetable consumption was just above 40% around 2019 and then drop thereafter went up in 2021. The rise could be due be related to shifts in health awareness or lifestyle changes associated with lockdowns and remote living. The physical activity line has remained consistently higher than that of fruit and vegetable consumption, which may indicate that one behavior has been more stable or more difficult to maintain over time. The statistical analysis shows that there is no correlation between the level of income and for the individuals to living a healthy lifestyle The data contains many NA and as the result, there was lack of exploration