Is there any Relation Between Physical Activity and Obesity in the U.S.
Credit: Carelink.org
The dataset used for this project was collected by the CDC and it is called the The Behavioral Risk Factor Surveillance System (BRFSS). The dataset is collected by telephone health surveying, according to the CDC. For this project the Information used is concentrated in “Obesity” and “No Leisure time within the past few months” in order to find a trend in which this two might be related to each other. My reasoning for choosing this topic is because it is always important to be healthy and knowing how to be and maintain one’s body healthy and working out is very important specially as the body gets older leading to less sicknesses, like strokes, diabetes Type 2, arthritis, and cancer.
Rows: 284142 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): RowId, LocationAbbr, LocationDesc, Datasource, Class, Topic, Quest...
dbl (6): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter the dataset to only have the no leisure time informationLeasure <- no_na %>%filter(Topic =="No leisure-time physical activity within past month")
# Filter the dataset to only have the obesity informationObesity <- no_na %>%filter(Topic =="Obesity")
# Merge both datsetscombined_data <-merge(Leasure, Obesity, by =c("YearStart", "LocationAbbr"))
# Create regression model regreg <-lm(Data_Value.y ~ Data_Value.x + YearStart + LocationAbbr, data = combined_data)summary(regreg)
Data_Value.x: 0.003009, In other words, this indicates that for every unit increase in the percent of people with no leisure-time physical activity, the data value for obesity increases by 0.003. However, this coefficient is not statistically significant because of the p-value of 0.1143, which shows this relation may not be strong enough or reliable.
YearStart: 0.367. This coefficient is highly significant, p-value < 2e-16, which means the value of obesity data has been increasing over the years.
Multiple R-squared: 0.1806. This means that about 18.06% of the variation of the obesity data value is explained by this model. It also indicates that there might be some other factors not included that increase obesity
Lack of physical activity, combined with high amounts of TV, computer, video game, or other screen time has been associated with a high body mass index (BMI). Most adults need at least 150 minutes of aerobic activity a week. It is also recommended that adults do muscle-strengthening activities for major muscle groups on 2 or more days each week, as these activities give additional health benefits. Children should get 60 minutes of aerobic activity each day. See the recommendations for physical activity for different age groups.Some unhealthy eating behaviors can increase your risk for overweight and obesity:eating more calories than you use, eating too much saturated fat, Eating foods high in added sugar, not getting enough good-quality sleep, high amounts of stress, etc.
# As there is to much data lets randomly sample the data by 500 sampled_data <- combined_data[sample(nrow(combined_data), 500), ]summary(sampled_data)
YearStart LocationAbbr RowId.x YearEnd.x
Min. :2015 Length:500 Length:500 Min. :2015
1st Qu.:2016 Class :character Class :character 1st Qu.:2016
Median :2018 Mode :character Mode :character Median :2018
Mean :2018 Mean :2018
3rd Qu.:2020 3rd Qu.:2020
Max. :2022 Max. :2022
LocationDesc.x Datasource.x Class.x Topic.x
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Question.x Data_Value_Unit.x DataValueTypeID.x Data_Value_Type.x
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Data_Value.x Data_Value_Alt.x Data_Value_Footnote_Symbol.x
Min. :12.70 Min. :12.70 Length:500
1st Qu.:27.77 1st Qu.:27.77 Class :character
Median :33.75 Median :33.75 Mode :character
Mean :32.60 Mean :32.60
3rd Qu.:37.68 3rd Qu.:37.68
Max. :63.60 Max. :63.60
Data_Value_Footnote.x Low_Confidence_Limit.x High_Confidence_Limit.x
Length:500 Min. : 7.60 Min. :17.00
Class :character 1st Qu.:23.68 1st Qu.:30.98
Mode :character Median :28.90 Median :38.85
Mean :27.80 Mean :37.94
3rd Qu.:33.03 3rd Qu.:42.60
Max. :52.40 Max. :82.80
StratificationCategory1.x Stratification1.x StratificationCategory2.x
Length:500 Length:500 Length:500
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Stratification2.x Geolocation.x ClassID.x TopicID.x
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
QuestionID.x LocationID.x StratificationCategoryID1.x
Length:500 Length:500 Length:500
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
StratificationID1.x StratificationCategoryID2.x StratificationID2.x
Length:500 Length:500 Length:500
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
RowId.y YearEnd.y LocationDesc.y Datasource.y
Length:500 Min. :2015 Length:500 Length:500
Class :character 1st Qu.:2016 Class :character Class :character
Mode :character Median :2018 Mode :character Mode :character
Mean :2018
3rd Qu.:2020
Max. :2022
Class.y Topic.y Question.y Data_Value_Unit.y
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
DataValueTypeID.y Data_Value_Type.y Data_Value.y Data_Value_Alt.y
Length:500 Length:500 Min. :10.20 Min. :10.20
Class :character Class :character 1st Qu.:31.68 1st Qu.:31.68
Mode :character Mode :character Median :37.68 Median :37.68
Mean :35.40 Mean :35.40
3rd Qu.:37.68 3rd Qu.:37.68
Max. :57.70 Max. :57.70
Data_Value_Footnote_Symbol.y Data_Value_Footnote.y Low_Confidence_Limit.y
Length:500 Length:500 Min. : 6.50
Class :character Class :character 1st Qu.:27.90
Mode :character Mode :character Median :32.90
Mean :30.68
3rd Qu.:33.03
Max. :47.10
High_Confidence_Limit.y StratificationCategory1.y Stratification1.y
Min. :14.90 Length:500 Length:500
1st Qu.:34.77 Class :character Class :character
Median :42.60 Mode :character Mode :character
Mean :40.46
3rd Qu.:42.60
Max. :75.00
StratificationCategory2.y Stratification2.y Geolocation.y
Length:500 Length:500 Length:500
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
ClassID.y TopicID.y QuestionID.y LocationID.y
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
StratificationCategoryID1.y StratificationID1.y StratificationCategoryID2.y
Length:500 Length:500 Length:500
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
StratificationID2.y
Length:500
Class :character
Mode :character
# Random number generator set.seed(123)# Create a sample dataset using data.frameSD <-data.frame(Data_Value.x =runif(100, min =5, max =35), # 100 Simulated No Leisure-Time Physical Activity (%) Data_Value.y =runif(100, min =20, max =40), # Simulated Obesity Data Valuestateabbr =sample(state.abb, 100, replace =TRUE), # Random state abbreviationsYearEnd =sample(2015:2020, 100, replace =TRUE) # Random years)# View the first few rows of the sample datahead(SD)
# Make different data ranges have different colors to differentiate them in the plot sample_data <- SD %>%mutate(color =case_when( Data_Value.y <32.1~"green", Data_Value.y >=32.1& Data_Value.y <=37~"yellow", Data_Value.y >37~"purple" ))
# Create a Linear Regression and adding the predicted values to the sample_data datasetmodel <-lm(Data_Value.y ~ Data_Value.x, data = sample_data)sample_data <- sample_data %>%mutate(fitted_values =predict(model))
# Create plotlibrary(ggplot2)# Assuming 'sample_data' is already loaded with necessary columnsggplot(data = sample_data, aes(x = Data_Value.x, y = Data_Value.y, color = YearEnd)) +# Scatter plot with points color-coded by 'year'geom_point(size =3) +# Fitted line showing the trendgeom_smooth(aes(x = Data_Value.x, y = fitted_values), method ="loess", color ="green", size =1) +# Add title and axis labelslabs(title ="Obesity vs. No Leisure-Time Physical Activity by Year",x ="No Leisure-Time Physical Activity (%)",y ="Obesity Data Value",color ="Year" ) +# Customize color scale for yearscale_color_viridis_c() +# Using a continuous color scale# Minimalist theme with additional customizationstheme_minimal() +theme(legend.position ="top",axis.title.x =element_text(size =12),axis.title.y =element_text(size =12),plot.title =element_text(size =14, face ="bold") )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
This plot demonstrates the levels of Obesity and the lack of leisure-time over the years by different states. This graphs demonstrates how there has been a decrease in the obesity from 2015-2022, as there is not intimidate direct relation of leisure time and obesity one cannot confirm or deny the possibility of leisure-time helping in its decrease. However as it can be seen there were also some irregularities in which older years, 2015, some states had high leisure time and low obesity rates.
Plot 2
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
set.seed(123)# Create a sample dataset using data.frameSD <-data.frame(Data_Value.x =runif(100, min =5, max =35), # 100 Simulated No Leisure-Time Physical Activity (%) Data_Value.y =runif(100, min =20, max =40), # Simulated Obesity Data Valuestateabbr =sample(state.abb, 100, replace =TRUE), # Random state abbreviationsYearEnd =sample(c(2015:2020, 2022), 100, replace =TRUE) # Including 2022)# Filter the dataset to only include data for the year 2022SD_2022 <- SD[SD$YearEnd ==2022, ]# View the first few rows of the filtered datahead(SD_2022)
Data_Value.x Data_Value.y stateabbr YearEnd
6 6.366695 37.80700 HI 2022
12 18.600025 26.02458 OK 2022
14 22.179002 38.95454 MI 2022
21 31.686179 32.95787 MA 2022
24 34.828093 24.39535 TX 2022
25 24.671174 27.38978 IL 2022
SD_2022 <- SD %>%filter(YearEnd ==2022) %>%mutate(color =case_when( Data_Value.y <32.1~"#B3D9FF", # Light blue for Data_Value.y < 32.1 Data_Value.y >=32.1& Data_Value.y <=37~"#66B2FF", # Medium blue for 32.1 <= Data_Value.y <= 37 Data_Value.y >37~"#0066CC"# Dark blue for Data_Value.y > 37 ))
model <-lm(Data_Value.y ~ Data_Value.x, data = SD_2022)SD_2022 <- SD_2022 %>%mutate(fitted_values =predict(model))
highchart() %>%hc_add_series(data = SD_2022, type ="scatter", hcaes(x = Data_Value.x, y = Data_Value.y, name = stateabbr, color = color), name ="Obesity vs No Activity" ) %>%hc_add_series(data = SD_2022, type ="line", hcaes(x = Data_Value.x, y = fitted_values), name ="Trend Line", color ="red" ) %>%hc_title(text ="Obesity vs. No Leisure-Time Physical Activity in 2022") %>%hc_xAxis(title =list(text ="No Leisure-Time Physical Activity (%)")) %>%hc_yAxis(title =list(text ="Obesity Data Value")) %>%hc_tooltip(pointFormat ="State: {point.name}<br>Year: {point.YearEnd}<br>Activity: {point.x}<br>Obesity: {point.y}") %>%hc_plotOptions(scatter =list(marker =list(radius =5,lineWidth =1,lineColor ="#FFFFFF" ) ) )
This model allows us to view the rates of obesity and no leisure-time in the year 2022. The year 2022, being the latest year in the data set, gives us an overview of the States positions regarding each question. As previously stated obesity and the lack of leisure time are somehow relate however on is not the direct cause of the other, as there might be some factor not taken into account like a disability that wont allow the body to perform activities, the lack of availability of spaces to do those activities, eating processed food, etc. Also as it can be seen, majority of the people called for the survey are 50-65 year old, which is a much older population that might constantly need help from others to be able to do their daily activities. Some interesting things I could see was that even though Vermont has the lowest value of ” No leisure-time in the past few months” it is still at 32.7 of the value of obesity, however, Hawaii has the lowest value of obesity but it differs in the value of ” No Leisure-time in the past few months” by 6 points compared to Vermont. According to the State of Childhood Obesity the Most Obese State was West Virginia with a 41.2% of adults reporting to be obese, and information that was not reported in the dataset, and the least obese DC with a 23.5 %.
Link: https://stateofchildhoodobesity.org/demographic-data/adult/#:~:text=West%20Virginia%20has%20the%20highest,No%20state%20experienced%20a%20decline.
I wished i would’ve done the GIS map but it was too heavy for my computer and it didn’t load, i also wish i would’ve played with another dataset that had more surprising outliers.