Project 2 - Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System
Obesity is a serius matter on human life, a health and life quality concern. The U.S. Department of Health & Human Services monitores it on the contry to help local and national potilics to face the issue. They have disposed metadate on the issue avaliable https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system
I believe in enhancing quality of life, thats why I chose this particular dataset. To help to understand and visualise the problem.
the datset is huge it has 33 variables from initial year, and ending year, Location abbraviation, name, source, class, topic, question presented, the unit, tha value type, the value, some value statistics as alterations, limit, high confidense, sample size, then the classification in the different characteristcs analysed: Race, income level, gender, education and gender range; This followed by the coordinates, the classification of the row: in its class, question, topic, value time, the stratification chategory, gande and ids. That is a lot of things and a lot of unusefull things. But it is rich, rich in information. But there is a lot of data, 93249 observations, and its formating is metadata, and so for it to be workable there must be a lot of cleaning for unwanted and unecessary variables, and for a better formating. The variables from the variable topic must be transformed into many different variables, because it is the one that defines the statistc observed in the dataset it separates overweight, obesity, sedestarism, and other habbits and characteristics of the percentages. Those are the first steps to conquer the data.
For this particular project The team chose to use only income variables, that took the teams curitsity, is their a obisity disparity because of income? A sedentarism disparity, and if so why? Affordability of leasure or helthy food? And so we chose to initially explore this.
Obisity is a alarming health concern it makes the life spam of people, their life quality and their mental health. It accours upon not having good habits of food and drinks, lack of physical activity, some medications, stress and genetics are also involved. But its consequences are nefarious: cardiovasscular diceses, type 3 diabetes, musculoskeletal disorders like osteoarthritis, and cancers. Something we must run away from, protect our children and lifes. And to act upon it we need the department of health data and to understand it better and the problem it describes we must visualize it. One of this vizualisations is this following project.
The data
The data is still on metadata and so we must look into it to understand and extract a programing friendly data set from it.
We can see 33 collums, with year, and year again, they naver change. Location abbraviation, name, the source, class, topic, question presented, the unit, tha value type, the value, some value statistics as alterations, limit, high confidense, sample size, then the classification in the different characteristcs analysed: Race, income level, gender, education and gender range; This followed by the coordinates, the classification of the row: in its class, question, topic, value time, the stratification chategory, gande and ids.
And we have 93249 different observations.
Hwoooo, thats a lot.
And since the time and process power is limited we must of must contract the data frame.
Uppon inspection the income category and its influence cought the attention of the team, and it chose this to study further. To understand how income level explains your apportability into helthy food.
The selected variables
Also to understand the todays america we chose to use only, the last year.
# A tibble: 6 × 6
YearStart LocationAbbr LocationDesc Question Data_Value Income
<dbl> <chr> <chr> <chr> <dbl> <chr>
1 2022 US National Percent of adults who e… 21.4 $50,0…
2 2022 US National Percent of adults who e… 40 Less …
3 2022 US National Percent of adults aged … 28.8 Less …
4 2022 US National Percent of adults aged … 38.4 $15,0…
5 2022 US National Percent of adults aged … 29.5 Data …
6 2022 US National Percent of adults who e… 37 $15,0…
Now the data is clearer and we can understand it better. And work on it.
There are always 21 values for each region, there are 7 diferent incomes, one of them beegin the general; for each of them their are 3 questions: if they are obese, overweith and sedentary.
But we can owganize the dataframe in a more workable display.
workable<-usable %>%group_by(LocationDesc, Income)%>%summarize(overweight=sum(Data_Value[Question=="Percent of adults aged 18 years and older who have an overweight classification"]), Sedentary=sum(Data_Value[Question=="Percent of adults who engage in no leisure-time physical activity"]), Obesity=sum(Data_Value[Question=="Percent of adults aged 18 years and older who have obesity"]), Abbr=unique(LocationAbbr))
`summarise()` has grouped output by 'LocationDesc'. You can override using the
`.groups` argument.
head(workable)
# A tibble: 6 × 6
# Groups: LocationDesc [1]
LocationDesc Income overweight Sedentary Obesity Abbr
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Alabama $15,000 - $24,999 26.9 38.1 46.1 AL
2 Alabama $25,000 - $34,999 31.8 30.4 38 AL
3 Alabama $35,000 - $49,999 34.7 27 42.3 AL
4 Alabama $50,000 - $74,999 30.6 24.7 42.4 AL
5 Alabama $75,000 or greater 35.6 17.7 37.6 AL
6 Alabama Data not reported 37.2 27.9 33.8 AL
SUCESS!!! We have now it all planned out and will be able to work with it. The process for any other visualisation from this dataset should look like this.
There is only one thing left to do before we start to work with our values their is “Data not resported” witch is actually the total. So lets replace it
data<-replace(df\(Marks, df\)Marks<0, 0)
workable$Income[workable$Income=="Data not reported"]<-"Total"head(workable)
# A tibble: 6 × 6
# Groups: LocationDesc [1]
LocationDesc Income overweight Sedentary Obesity Abbr
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Alabama $15,000 - $24,999 26.9 38.1 46.1 AL
2 Alabama $25,000 - $34,999 31.8 30.4 38 AL
3 Alabama $35,000 - $49,999 34.7 27 42.3 AL
4 Alabama $50,000 - $74,999 30.6 24.7 42.4 AL
5 Alabama $75,000 or greater 35.6 17.7 37.6 AL
6 Alabama Total 37.2 27.9 33.8 AL
Relationships
Lets see how each variable can tell something about the other, lets see with we can observe arelationship with a regression model.
But to correlate we can only work with numeric values, for this it is needed to transform income, that is categorical into numerical, at least in a abstract sence by using the middleground for it all.
correlational <- workable %>%filter(Income !="Total") |>mutate(inc=case_when(Income=="$15,000 - $24,999"~20000,Income=="$25,000 - $34,999"~30000,Income=="$35,000 - $49,999"~42500,Income=="Less than $15,000"~7500,Income=="$50,000 - $74,999"~62500,Income=="$75,000 or greater"~80000))head(correlational)
# A tibble: 6 × 7
# Groups: LocationDesc [1]
LocationDesc Income overweight Sedentary Obesity Abbr inc
<chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Alabama $15,000 - $24,999 26.9 38.1 46.1 AL 20000
2 Alabama $25,000 - $34,999 31.8 30.4 38 AL 30000
3 Alabama $35,000 - $49,999 34.7 27 42.3 AL 42500
4 Alabama $50,000 - $74,999 30.6 24.7 42.4 AL 62500
5 Alabama $75,000 or greater 35.6 17.7 37.6 AL 80000
6 Alabama Less than $15,000 28.4 51.4 39.5 AL 7500
Lets see it in a grath, initially for income and sedentarism.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Look at this, This is tremendus, lets see the corrolations then.
cor(correlational$inc,correlational$Sedentary)
[1] -0.8772038
fit1 <-lm(Sedentary ~ inc, data = correlational)summary(fit1)
Call:
lm(formula = Sedentary ~ inc, data = correlational)
Residuals:
Min 1Q Median 3Q Max
-20.4495 -3.0197 -0.0808 2.7821 13.5132
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.236e+01 4.779e-01 88.65 <2e-16 ***
inc -3.337e-04 1.009e-05 -33.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.53 on 328 degrees of freedom
Multiple R-squared: 0.7695, Adjusted R-squared: 0.7688
F-statistic: 1095 on 1 and 328 DF, p-value: < 2.2e-16
So we have that
Sedentarism = - 0.000337 (income) + 4236
For this it is important to understand income is in dollarsand so for every dollar in a year salary their is 0.0003 % less chance for you to be sedentary so only a thousand dollar diferrence there is 0.3%.
Lets see then how it all correlates.
library(GGally)
Warning: package 'GGally' was built under R version 4.3.3
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
ggpairs(correlational, columns =c(7, 3,4,5))
We can se that income as a great relationship with sedentarism and overweight and not that much with obesity. But lets try to find a model for obesety with all other variables.
fit_obese<-lm(Obesity ~ overweight + inc + Sedentary, data = correlational)summary(fit_obese)
Call:
lm(formula = Obesity ~ overweight + inc + Sedentary, data = correlational)
Residuals:
Min 1Q Median 3Q Max
-16.3228 -2.4146 0.4984 2.6443 12.0989
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.047e+01 2.849e+00 14.203 < 2e-16 ***
overweight -5.738e-01 6.112e-02 -9.389 < 2e-16 ***
inc 1.286e-04 1.932e-05 6.657 1.19e-10 ***
Sedentary 3.304e-01 4.934e-02 6.696 9.36e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.041 on 326 degrees of freedom
Multiple R-squared: 0.3224, Adjusted R-squared: 0.3161
F-statistic: 51.69 on 3 and 326 DF, p-value: < 2.2e-16
Lets plot it all, using high chart for a more interrective experience.
mine <-setNames(c("#0E90FF", "#6495ED","#4169E1", "#0000FF", "#0000CD","#4B0082","#B0E0E6"), levels(workable$Income))highchart () |>hc_add_series(data = workable,type ="scatter" ,hcaes(x = Sedentary,y = overweight,group = Income,color = mine),tooltip =list(useHTML =TRUE,pointFormat ="{point.LocationDesc}({point.Abbr}), <br> Obesity: {point.Obesity}<br> Overweight: {point.overweight}<br> Sedentarism: {point.Sedentary}<br><br> Source: U.S. Department of Health & Human Services")) |>hc_xAxis(title =list(text="Population of Sedentaries (%)"), color=mine) |>hc_yAxis(title =list(text="Population of overweight (%))"), color=mine) |>hc_title(text ="Overweight population against Sedentary population segregated by income levels",margin =20,align ="left",style =list(color ="#0000FF", useHTML =TRUE) )|>hc_subtitle(text ="<i>Source : The U.S. Department of Health & Human Services</i>",margin =20,align ="left",style =list(color ="#4169E1", useHTML =TRUE)) |>hc_add_theme(hc_theme_smpl()) |>hc_labels(color=mine)
The grath is Great, the team just did not manage to set the labels to mach the colors
mine <-setNames(c("#0E90FF", "#6495ED","#4169E1", "#0000FF", "#0000CD","#4B0082","#B0E0E6"), levels(workable$Income))highchart () |>hc_add_series(data = workable,type ="scatter" ,hcaes(x = Sedentary,y = overweight,group = Income),tooltip =list(useHTML =TRUE,pointFormat ="{point.LocationDesc}({point.Abbr}), <br> Obesity: {point.Obesity}<br> Overweight: {point.overweight}<br> Sedentarism: {point.Sedentary}<br><br> Source: U.S. Department of Health & Human Services")) |>hc_xAxis(title =list(text="Population of Sedentaries (%)"), color=mine) |>hc_yAxis(title =list(text="Population of overweight (%))"), color=mine) |>hc_title(text ="Overweight population against Sedentary population segregated by income levels",margin =20,align ="left",style =list(color ="#0000FF", useHTML =TRUE) )|>hc_subtitle(text ="<i>Source : The U.S. Department of Health & Human Services</i>",margin =20,align ="left",style =list(color ="#4169E1", useHTML =TRUE)) |>hc_add_theme(hc_theme_smpl()) |>hc_labels(color=mine)
Conclusion
The date is enourmous and if possible the team would love to work on it more and more, exploring other characteristcs and variables. It indicates that people that are have higher income are less sedentary, and a little bit less obese and overweight that is because helthy food is expensive, and leasure time for activites is also expensive. For better visualizations the team would love to have the blue shades in the graph, and a map would also be interresting. A mapplot.