Project 2

Author

Davi Krause

Project 2 - Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System

Obesity is a serius matter on human life, a health and life quality concern. The U.S. Department of Health & Human Services monitores it on the contry to help local and national potilics to face the issue. They have disposed metadate on the issue avaliable https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system

I believe in enhancing quality of life, thats why I chose this particular dataset. To help to understand and visualise the problem.

the datset is huge it has 33 variables from initial year, and ending year, Location abbraviation, name, source, class, topic, question presented, the unit, tha value type, the value, some value statistics as alterations, limit, high confidense, sample size, then the classification in the different characteristcs analysed: Race, income level, gender, education and gender range; This followed by the coordinates, the classification of the row: in its class, question, topic, value time, the stratification chategory, gande and ids. That is a lot of things and a lot of unusefull things. But it is rich, rich in information. But there is a lot of data, 93249 observations, and its formating is metadata, and so for it to be workable there must be a lot of cleaning for unwanted and unecessary variables, and for a better formating. The variables from the variable topic must be transformed into many different variables, because it is the one that defines the statistc observed in the dataset it separates overweight, obesity, sedestarism, and other habbits and characteristics of the percentages. Those are the first steps to conquer the data.

For this particular project The team chose to use only income variables, that took the teams curitsity, is their a obisity disparity because of income? A sedentarism disparity, and if so why? Affordability of leasure or helthy food? And so we chose to initially explore this.

A further reading on the topic

1st Source: https://www.nhlbi.nih.gov/health/overweight-and-obesity/causes 2nd source: https://www.who.int/news-room/questions-and-answers/item/obesity-health-consequences-of-being-overweight

Obisity is a alarming health concern it makes the life spam of people, their life quality and their mental health. It accours upon not having good habits of food and drinks, lack of physical activity, some medications, stress and genetics are also involved. But its consequences are nefarious: cardiovasscular diceses, type 3 diabetes, musculoskeletal disorders like osteoarthritis, and cancers. Something we must run away from, protect our children and lifes. And to act upon it we need the department of health data and to understand it better and the problem it describes we must visualize it. One of this vizualisations is this following project.

The data

The data is still on metadata and so we must look into it to understand and extract a programing friendly data set from it.

library(tidyverse)
library(tidyr)
library(highcharter)
library(leaflet)

metadata <- read_csv("Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv")

head(metadata)
# A tibble: 6 × 33
  YearStart YearEnd LocationAbbr LocationDesc Datasource    Class Topic Question
      <dbl>   <dbl> <chr>        <chr>        <chr>         <chr> <chr> <chr>   
1      2020    2020 US           National     Behavioral R… Phys… Phys… Percent…
2      2014    2014 GU           Guam         Behavioral R… Obes… Obes… Percent…
3      2013    2013 US           National     Behavioral R… Obes… Obes… Percent…
4      2013    2013 US           National     Behavioral R… Obes… Obes… Percent…
5      2015    2015 US           National     Behavioral R… Phys… Phys… Percent…
6      2015    2015 GU           Guam         Behavioral R… Phys… Phys… Percent…
# ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Gender <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
#   QuestionID <chr>, DataValueTypeID <chr>, LocationID <chr>, …

We can see 33 collums, with year, and year again, they naver change. Location abbraviation, name, the source, class, topic, question presented, the unit, tha value type, the value, some value statistics as alterations, limit, high confidense, sample size, then the classification in the different characteristcs analysed: Race, income level, gender, education and gender range; This followed by the coordinates, the classification of the row: in its class, question, topic, value time, the stratification chategory, gande and ids.

And we have 93249 different observations.

Hwoooo, thats a lot.

And since the time and process power is limited we must of must contract the data frame.

Uppon inspection the income category and its influence cought the attention of the team, and it chose this to study further. To understand how income level explains your apportability into helthy food.

The selected variables

Also to understand the todays america we chose to use only, the last year.

Filterin

First applying a filter into year and income.

filtered <- metadata |>
  filter(YearStart == 2022) |>
  filter(StratificationCategory1 == "Income") 
head(filtered)
# A tibble: 6 × 33
  YearStart YearEnd LocationAbbr LocationDesc Datasource Class    Topic Question
      <dbl>   <dbl> <chr>        <chr>        <chr>      <chr>    <chr> <chr>   
1      2022    2022 US           National     BRFSS      Physica… Phys… Percent…
2      2022    2022 US           National     BRFSS      Physica… Phys… Percent…
3      2022    2022 US           National     BRFSS      Obesity… Obes… Percent…
4      2022    2022 US           National     BRFSS      Obesity… Obes… Percent…
5      2022    2022 US           National     BRFSS      Obesity… Obes… Percent…
6      2022    2022 US           National     BRFSS      Physica… Phys… Percent…
# ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
#   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
#   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
#   `Age(years)` <chr>, Education <chr>, Gender <chr>, Income <chr>,
#   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
#   QuestionID <chr>, DataValueTypeID <chr>, LocationID <chr>, …

Now the data frame with 1155 observations seems more workable.

Dara vemoving

usable  <- filtered |>
  select(-YearEnd,-Datasource,-Class,-Datasource,-Topic, ,-Data_Value_Unit,-Data_Value_Type,-Data_Value_Alt,-Data_Value_Footnote_Symbol,-Low_Confidence_Limit,-High_Confidence_Limit,-Data_Value_Footnote_Symbol,-Data_Value_Footnote,-Sample_Size,-Total,-`Age(years)`,-Education,-Gender,-`Race/Ethnicity`,-GeoLocation,-ClassID,-TopicID,-QuestionID,-DataValueTypeID,-LocationID,-StratificationCategory1,-Stratification1,-StratificationCategoryId1,-StratificationID1)
head(usable)
# A tibble: 6 × 6
  YearStart LocationAbbr LocationDesc Question                 Data_Value Income
      <dbl> <chr>        <chr>        <chr>                         <dbl> <chr> 
1      2022 US           National     Percent of adults who e…       21.4 $50,0…
2      2022 US           National     Percent of adults who e…       40   Less …
3      2022 US           National     Percent of adults aged …       28.8 Less …
4      2022 US           National     Percent of adults aged …       38.4 $15,0…
5      2022 US           National     Percent of adults aged …       29.5 Data …
6      2022 US           National     Percent of adults who e…       37   $15,0…

Now the data is clearer and we can understand it better. And work on it.

There are always 21 values for each region, there are 7 diferent incomes, one of them beegin the general; for each of them their are 3 questions: if they are obese, overweith and sedentary.

But we can owganize the dataframe in a more workable display.

workable<-usable %>%
  group_by(LocationDesc, Income)%>%
  summarize(overweight=sum(Data_Value[Question=="Percent of adults aged 18 years and older who have an overweight classification"]), Sedentary=sum(Data_Value[Question=="Percent of adults who engage in no leisure-time physical activity"]), Obesity=sum(Data_Value[Question=="Percent of adults aged 18 years and older who have obesity"]), Abbr=unique(LocationAbbr))
`summarise()` has grouped output by 'LocationDesc'. You can override using the
`.groups` argument.
head(workable)
# A tibble: 6 × 6
# Groups:   LocationDesc [1]
  LocationDesc Income             overweight Sedentary Obesity Abbr 
  <chr>        <chr>                   <dbl>     <dbl>   <dbl> <chr>
1 Alabama      $15,000 - $24,999        26.9      38.1    46.1 AL   
2 Alabama      $25,000 - $34,999        31.8      30.4    38   AL   
3 Alabama      $35,000 - $49,999        34.7      27      42.3 AL   
4 Alabama      $50,000 - $74,999        30.6      24.7    42.4 AL   
5 Alabama      $75,000 or greater       35.6      17.7    37.6 AL   
6 Alabama      Data not reported        37.2      27.9    33.8 AL   

SUCESS!!! We have now it all planned out and will be able to work with it. The process for any other visualisation from this dataset should look like this.

There is only one thing left to do before we start to work with our values their is “Data not resported” witch is actually the total. So lets replace it

data<-replace(df\(Marks, df\)Marks<0, 0)

workable$Income[workable$Income=="Data not reported"]<-"Total"
head(workable)
# A tibble: 6 × 6
# Groups:   LocationDesc [1]
  LocationDesc Income             overweight Sedentary Obesity Abbr 
  <chr>        <chr>                   <dbl>     <dbl>   <dbl> <chr>
1 Alabama      $15,000 - $24,999        26.9      38.1    46.1 AL   
2 Alabama      $25,000 - $34,999        31.8      30.4    38   AL   
3 Alabama      $35,000 - $49,999        34.7      27      42.3 AL   
4 Alabama      $50,000 - $74,999        30.6      24.7    42.4 AL   
5 Alabama      $75,000 or greater       35.6      17.7    37.6 AL   
6 Alabama      Total                    37.2      27.9    33.8 AL   

Relationships

Lets see how each variable can tell something about the other, lets see with we can observe arelationship with a regression model.

But to correlate we can only work with numeric values, for this it is needed to transform income, that is categorical into numerical, at least in a abstract sence by using the middleground for it all.

correlational <- workable %>%
  filter(Income != "Total") |>
  mutate(inc=case_when(Income=="$15,000 - $24,999"~20000,Income=="$25,000 - $34,999"~30000,Income=="$35,000 - $49,999"~42500,Income=="Less than $15,000"~7500,Income=="$50,000 - $74,999"~62500,Income=="$75,000 or greater"~80000))
head(correlational)
# A tibble: 6 × 7
# Groups:   LocationDesc [1]
  LocationDesc Income             overweight Sedentary Obesity Abbr    inc
  <chr>        <chr>                   <dbl>     <dbl>   <dbl> <chr> <dbl>
1 Alabama      $15,000 - $24,999        26.9      38.1    46.1 AL    20000
2 Alabama      $25,000 - $34,999        31.8      30.4    38   AL    30000
3 Alabama      $35,000 - $49,999        34.7      27      42.3 AL    42500
4 Alabama      $50,000 - $74,999        30.6      24.7    42.4 AL    62500
5 Alabama      $75,000 or greater       35.6      17.7    37.6 AL    80000
6 Alabama      Less than $15,000        28.4      51.4    39.5 AL     7500

Lets see it in a grath, initially for income and sedentarism.

P1<-ggplot(correlational, aes(x=inc, y=Sedentary))
P1+geom_point()+geom_smooth(color="blue")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Look at this, This is tremendus, lets see the corrolations then.

cor(correlational$inc,correlational$Sedentary)
[1] -0.8772038
fit1 <- lm(Sedentary ~ inc, data = correlational)
summary(fit1)

Call:
lm(formula = Sedentary ~ inc, data = correlational)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.4495  -3.0197  -0.0808   2.7821  13.5132 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.236e+01  4.779e-01   88.65   <2e-16 ***
inc         -3.337e-04  1.009e-05  -33.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.53 on 328 degrees of freedom
Multiple R-squared:  0.7695,    Adjusted R-squared:  0.7688 
F-statistic:  1095 on 1 and 328 DF,  p-value: < 2.2e-16

So we have that

Sedentarism = - 0.000337 (income) + 4236

For this it is important to understand income is in dollarsand so for every dollar in a year salary their is 0.0003 % less chance for you to be sedentary so only a thousand dollar diferrence there is 0.3%.

Lets see then how it all correlates.

library(GGally)
Warning: package 'GGally' was built under R version 4.3.3
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
ggpairs(correlational, columns = c(7, 3,4,5))

We can se that income as a great relationship with sedentarism and overweight and not that much with obesity. But lets try to find a model for obesety with all other variables.

fit_obese<- lm(Obesity ~ overweight + inc + Sedentary, data = correlational)
summary(fit_obese)

Call:
lm(formula = Obesity ~ overweight + inc + Sedentary, data = correlational)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.3228  -2.4146   0.4984   2.6443  12.0989 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.047e+01  2.849e+00  14.203  < 2e-16 ***
overweight  -5.738e-01  6.112e-02  -9.389  < 2e-16 ***
inc          1.286e-04  1.932e-05   6.657 1.19e-10 ***
Sedentary    3.304e-01  4.934e-02   6.696 9.36e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.041 on 326 degrees of freedom
Multiple R-squared:  0.3224,    Adjusted R-squared:  0.3161 
F-statistic: 51.69 on 3 and 326 DF,  p-value: < 2.2e-16

The equation is then:

Obesity = -5.738e-01 * Overweight + 1.286e-04 * income + 3.304e-01 * sedentary

This is some interresting findings.

Lets visualize it all then.

Data vizualization

Lets plot it all, using high chart for a more interrective experience.

mine <- setNames(c("#0E90FF", "#6495ED","#4169E1", "#0000FF", "#0000CD","#4B0082","#B0E0E6"), levels(workable$Income))

highchart () |>
  hc_add_series(data = workable,
                type = "scatter" ,
                hcaes(x = Sedentary,
                      y = overweight,
                      group = Income,
                      color = mine),
                tooltip = list(useHTML = TRUE,
                pointFormat = "{point.LocationDesc}({point.Abbr}),
                               <br>
                               Obesity: {point.Obesity}<br>
                               Overweight: {point.overweight}<br>
                               Sedentarism: {point.Sedentary}<br><br>
                Source: U.S. Department of Health & Human Services")) |>
hc_xAxis(title = list(text="Population of Sedentaries (%)"), color=mine) |>
hc_yAxis(title = list(text="Population of overweight (%))"), color=mine) |>
hc_title(
    text = "Overweight population against Sedentary population segregated by income levels",
    margin = 20,
    align = "left",
    style = list(color = "#0000FF", useHTML = TRUE)
  )|>
hc_subtitle(
    text = "<i>Source : The U.S. Department of Health & Human Services</i>",
    margin = 20,
    align = "left",
    style = list(color = "#4169E1", useHTML = TRUE)) |>
hc_add_theme(hc_theme_smpl()) |>
hc_labels(color=mine)  

The grath is Great, the team just did not manage to set the labels to mach the colors

mine <- setNames(c("#0E90FF", "#6495ED","#4169E1", "#0000FF", "#0000CD","#4B0082","#B0E0E6"), levels(workable$Income))

highchart () |>
  hc_add_series(data = workable,
                type = "scatter" ,
                hcaes(x = Sedentary,
                      y = overweight,
                      group = Income),
                tooltip = list(useHTML = TRUE,
                pointFormat = "{point.LocationDesc}({point.Abbr}),
                               <br>
                               Obesity: {point.Obesity}<br>
                               Overweight: {point.overweight}<br>
                               Sedentarism: {point.Sedentary}<br><br>
                Source: U.S. Department of Health & Human Services")) |>
hc_xAxis(title = list(text="Population of Sedentaries (%)"), color=mine) |>
hc_yAxis(title = list(text="Population of overweight (%))"), color=mine) |>
hc_title(
    text = "Overweight population against Sedentary population segregated by income levels",
    margin = 20,
    align = "left",
    style = list(color = "#0000FF", useHTML = TRUE)
  )|>
hc_subtitle(
    text = "<i>Source : The U.S. Department of Health & Human Services</i>",
    margin = 20,
    align = "left",
    style = list(color = "#4169E1", useHTML = TRUE)) |>
hc_add_theme(hc_theme_smpl()) |>
hc_labels(color=mine)  

Conclusion

The date is enourmous and if possible the team would love to work on it more and more, exploring other characteristcs and variables. It indicates that people that are have higher income are less sedentary, and a little bit less obese and overweight that is because helthy food is expensive, and leasure time for activites is also expensive. For better visualizations the team would love to have the blue shades in the graph, and a map would also be interresting. A mapplot.