Project 3

Author

Ashley R

Is there any Relation Between Physical Activity and Obesity in the U.S.

The dataset used for this project was collected by the CDC and it is called the The Behavioral Risk Factor Surveillance System (BRFSS). The dataset is collected by telephone health surveying, according to the CDC. For this project the Information used is concentrated in “Obesity” and “No Leisure time within the past few months” in order to find a trend in which this two might be related to each other. My reasoning for choosing this topic is because it is always important to be healthy and knowing how to be and maintain one’s body healthy and working out is very important specially as the body gets older leading to less sicknesses, like strokes, diabetes Type 2, arthritis, and cancer.

Link: https://www.mayoclinic.org/healthy-lifestyle/fitness/in-depth/exercise/art-20048389

Load tidyverse and dataset

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

PUREDT <- read_csv("Alzheimer_s_Disease_and_Healthy_Aging_Data.csv")

Rows: 284142 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): RowId, LocationAbbr, LocationDesc, Datasource, Class, Topic, Quest...
dbl  (6): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Remove N/A from dataset

no_na <- PUREDT %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm = TRUE))))

Linear regression

# Filter the dataset to only have the no leisure time information
Leasure <- no_na %>% filter(Topic == "No leisure-time physical activity within past month")

# Filter the dataset to only have the obesity information
Obesity <- no_na %>% filter(Topic == "Obesity")

# Merge both datsets
combined_data <- merge(Leasure, Obesity, by = c("YearStart", "LocationAbbr"))

# Create regression model 
regreg <- lm(Data_Value.y ~ Data_Value.x + YearStart + LocationAbbr, data = combined_data)

summary(regreg)


Call:
lm(formula = Data_Value.y ~ Data_Value.x + YearStart + LocationAbbr, 
    data = combined_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.854  -3.505   0.038   3.365  38.329 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -7.056e+02  1.025e+01 -68.831  < 2e-16 ***
Data_Value.x      3.009e-03  1.905e-03   1.579   0.1143    
YearStart         3.670e-01  5.077e-03  72.286  < 2e-16 ***
LocationAbbrAL    3.357e+00  1.254e-01  26.779  < 2e-16 ***
LocationAbbrAR    2.471e+00  1.255e-01  19.685  < 2e-16 ***
LocationAbbrAZ   -3.323e-01  1.247e-01  -2.665   0.0077 ** 
LocationAbbrCA   -4.136e+00  1.250e-01 -33.090  < 2e-16 ***
LocationAbbrCO   -4.964e+00  1.250e-01 -39.707  < 2e-16 ***
LocationAbbrCT   -1.706e+00  1.247e-01 -13.680  < 2e-16 ***
LocationAbbrDC   -3.636e+00  1.247e-01 -29.145  < 2e-16 ***
LocationAbbrDE    2.196e+00  1.248e-01  17.596  < 2e-16 ***
LocationAbbrFL   -1.862e+00  1.292e-01 -14.406  < 2e-16 ***
LocationAbbrGA    1.721e+00  1.249e-01  13.777  < 2e-16 ***
LocationAbbrGU   -2.931e+00  1.343e-01 -21.831  < 2e-16 ***
LocationAbbrHI   -8.053e+00  1.248e-01 -64.504  < 2e-16 ***
LocationAbbrIA    2.598e+00  1.252e-01  20.761  < 2e-16 ***
LocationAbbrID   -1.345e-02  1.250e-01  -0.108   0.9143    
LocationAbbrIL    2.060e+00  1.248e-01  16.514  < 2e-16 ***
LocationAbbrIN    2.574e+00  1.250e-01  20.587  < 2e-16 ***
LocationAbbrKS    2.405e+00  1.247e-01  19.280  < 2e-16 ***
LocationAbbrKY    2.554e+00  1.256e-01  20.339  < 2e-16 ***
LocationAbbrLA    3.282e+00  1.252e-01  26.204  < 2e-16 ***
LocationAbbrMA   -3.032e+00  1.247e-01 -24.311  < 2e-16 ***
LocationAbbrMD    5.001e-02  1.247e-01   0.401   0.6884    
LocationAbbrMDW   5.071e-02  1.247e-01   0.407   0.6843    
LocationAbbrME   -5.724e-01  1.248e-01  -4.588 4.47e-06 ***
LocationAbbrMI    2.218e+00  1.247e-01  17.782  < 2e-16 ***
LocationAbbrMN    8.792e-01  1.247e-01   7.050 1.79e-12 ***
LocationAbbrMO    1.914e+00  1.249e-01  15.321  < 2e-16 ***
LocationAbbrMS    3.668e+00  1.257e-01  29.177  < 2e-16 ***
LocationAbbrMT   -1.598e+00  1.257e-01 -12.715  < 2e-16 ***
LocationAbbrNC    1.942e+00  1.247e-01  15.572  < 2e-16 ***
LocationAbbrND    2.656e+00  1.249e-01  21.264  < 2e-16 ***
LocationAbbrNE    3.253e+00  1.248e-01  26.057  < 2e-16 ***
LocationAbbrNH   -1.361e+00  1.247e-01 -10.912  < 2e-16 ***
LocationAbbrNJ   -2.197e+00  1.295e-01 -16.961  < 2e-16 ***
LocationAbbrNM   -2.298e+00  1.247e-01 -18.432  < 2e-16 ***
LocationAbbrNRE  -4.079e+00  1.247e-01 -32.716  < 2e-16 ***
LocationAbbrNV   -1.723e+00  1.247e-01 -13.814  < 2e-16 ***
LocationAbbrNY   -3.754e+00  1.247e-01 -30.107  < 2e-16 ***
LocationAbbrOH    2.336e+00  1.249e-01  18.710  < 2e-16 ***
LocationAbbrOK    3.650e+00  1.256e-01  29.072  < 2e-16 ***
LocationAbbrOR   -4.949e-01  1.247e-01  -3.969 7.21e-05 ***
LocationAbbrPA    1.402e+00  1.248e-01  11.233  < 2e-16 ***
LocationAbbrPR   -6.164e-01  1.435e-01  -4.296 1.74e-05 ***
LocationAbbrRI   -9.075e-01  1.248e-01  -7.273 3.53e-13 ***
LocationAbbrSC    1.867e+00  1.248e-01  14.956  < 2e-16 ***
LocationAbbrSD    1.556e+00  1.248e-01  12.473  < 2e-16 ***
LocationAbbrSOU  -1.147e+00  1.247e-01  -9.193  < 2e-16 ***
LocationAbbrTN    2.197e+00  1.252e-01  17.543  < 2e-16 ***
LocationAbbrTX    2.569e+00  1.249e-01  20.570  < 2e-16 ***
LocationAbbrUS   -2.807e+00  1.247e-01 -22.515  < 2e-16 ***
LocationAbbrUT   -9.103e-01  1.247e-01  -7.299 2.90e-13 ***
LocationAbbrVA    1.230e+00  1.247e-01   9.862  < 2e-16 ***
LocationAbbrVI    6.639e-01  1.689e-01   3.930 8.49e-05 ***
LocationAbbrVT   -2.083e+00  1.247e-01 -16.710  < 2e-16 ***
LocationAbbrWA   -1.924e+00  1.252e-01 -15.373  < 2e-16 ***
LocationAbbrWEST -5.156e+00  1.251e-01 -41.221  < 2e-16 ***
LocationAbbrWI    2.842e+00  1.250e-01  22.733  < 2e-16 ***
LocationAbbrWV    4.054e+00  1.253e-01  32.353  < 2e-16 ***
LocationAbbrWY   -5.690e-01  1.247e-01  -4.562 5.08e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.985 on 264773 degrees of freedom
Multiple R-squared:  0.1806,    Adjusted R-squared:  0.1804 
F-statistic: 972.5 on 60 and 264773 DF,  p-value: < 2.2e-16

Data_Value.x: 0.003009, In other words, this indicates that for every unit increase in the percent of people with no leisure-time physical activity, the data value for obesity increases by 0.003. However, this coefficient is not statistically significant because of the p-value of 0.1143, which shows this relation may not be strong enough or reliable.

YearStart: 0.367. This coefficient is highly significant, p-value < 2e-16, which means the value of obesity data has been increasing over the years.

Multiple R-squared: 0.1806. This means that about 18.06% of the variation of the obesity data value is explained by this model. It also indicates that there might be some other factors not included that increase obesity

Lack of physical activity, combined with high amounts of TV, computer, video game, or other screen time has been associated with a high body mass index (BMI). Most adults need at least 150 minutes of aerobic activity a week. It is also recommended that adults do muscle-strengthening activities for major muscle groups on 2 or more days each week, as these activities give additional health benefits. Children should get 60 minutes of aerobic activity each day. See the recommendations for physical activity for different age groups.Some unhealthy eating behaviors can increase your risk for overweight and obesity:eating more calories than you use, eating too much saturated fat, Eating foods high in added sugar, not getting enough good-quality sleep, high amounts of stress, etc.

Source Link: https://www.nhlbi.nih.gov/health/overweight-and-obesity/causes

Plot 1

# As there is to much data lets randomly sample the data by 500 
sampled_data <- combined_data[sample(nrow(combined_data), 500), ]
summary(sampled_data)

   YearStart    LocationAbbr         RowId.x            YearEnd.x   
 Min.   :2015   Length:500         Length:500         Min.   :2015  
 1st Qu.:2016   Class :character   Class :character   1st Qu.:2016  
 Median :2018   Mode  :character   Mode  :character   Median :2018  
 Mean   :2018                                         Mean   :2018  
 3rd Qu.:2020                                         3rd Qu.:2020  
 Max.   :2022                                         Max.   :2022  
 LocationDesc.x     Datasource.x         Class.x            Topic.x         
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Question.x        Data_Value_Unit.x  DataValueTypeID.x  Data_Value_Type.x 
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Data_Value.x   Data_Value_Alt.x Data_Value_Footnote_Symbol.x
 Min.   :12.70   Min.   :12.70    Length:500                  
 1st Qu.:27.77   1st Qu.:27.77    Class :character            
 Median :33.75   Median :33.75    Mode  :character            
 Mean   :32.60   Mean   :32.60                                
 3rd Qu.:37.68   3rd Qu.:37.68                                
 Max.   :63.60   Max.   :63.60                                
 Data_Value_Footnote.x Low_Confidence_Limit.x High_Confidence_Limit.x
 Length:500            Min.   : 7.60          Min.   :17.00          
 Class :character      1st Qu.:23.68          1st Qu.:30.98          
 Mode  :character      Median :28.90          Median :38.85          
                       Mean   :27.80          Mean   :37.94          
                       3rd Qu.:33.03          3rd Qu.:42.60          
                       Max.   :52.40          Max.   :82.80          
 StratificationCategory1.x Stratification1.x  StratificationCategory2.x
 Length:500                Length:500         Length:500               
 Class :character          Class :character   Class :character         
 Mode  :character          Mode  :character   Mode  :character         
                                                                       
                                                                       
                                                                       
 Stratification2.x  Geolocation.x       ClassID.x          TopicID.x        
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 QuestionID.x       LocationID.x       StratificationCategoryID1.x
 Length:500         Length:500         Length:500                 
 Class :character   Class :character   Class :character           
 Mode  :character   Mode  :character   Mode  :character           
                                                                  
                                                                  
                                                                  
 StratificationID1.x StratificationCategoryID2.x StratificationID2.x
 Length:500          Length:500                  Length:500         
 Class :character    Class :character            Class :character   
 Mode  :character    Mode  :character            Mode  :character   
                                                                    
                                                                    
                                                                    
   RowId.y            YearEnd.y    LocationDesc.y     Datasource.y      
 Length:500         Min.   :2015   Length:500         Length:500        
 Class :character   1st Qu.:2016   Class :character   Class :character  
 Mode  :character   Median :2018   Mode  :character   Mode  :character  
                    Mean   :2018                                        
                    3rd Qu.:2020                                        
                    Max.   :2022                                        
   Class.y            Topic.y           Question.y        Data_Value_Unit.y 
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 DataValueTypeID.y  Data_Value_Type.y   Data_Value.y   Data_Value_Alt.y
 Length:500         Length:500         Min.   :10.20   Min.   :10.20   
 Class :character   Class :character   1st Qu.:31.68   1st Qu.:31.68   
 Mode  :character   Mode  :character   Median :37.68   Median :37.68   
                                       Mean   :35.40   Mean   :35.40   
                                       3rd Qu.:37.68   3rd Qu.:37.68   
                                       Max.   :57.70   Max.   :57.70   
 Data_Value_Footnote_Symbol.y Data_Value_Footnote.y Low_Confidence_Limit.y
 Length:500                   Length:500            Min.   : 6.50         
 Class :character             Class :character      1st Qu.:27.90         
 Mode  :character             Mode  :character      Median :32.90         
                                                    Mean   :30.68         
                                                    3rd Qu.:33.03         
                                                    Max.   :47.10         
 High_Confidence_Limit.y StratificationCategory1.y Stratification1.y 
 Min.   :14.90           Length:500                Length:500        
 1st Qu.:34.77           Class :character          Class :character  
 Median :42.60           Mode  :character          Mode  :character  
 Mean   :40.46                                                       
 3rd Qu.:42.60                                                       
 Max.   :75.00                                                       
 StratificationCategory2.y Stratification2.y  Geolocation.y     
 Length:500                Length:500         Length:500        
 Class :character          Class :character   Class :character  
 Mode  :character          Mode  :character   Mode  :character  
                                                                
                                                                
                                                                
  ClassID.y          TopicID.y         QuestionID.y       LocationID.y      
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 StratificationCategoryID1.y StratificationID1.y StratificationCategoryID2.y
 Length:500                  Length:500          Length:500                 
 Class :character            Class :character    Class :character           
 Mode  :character            Mode  :character    Mode  :character           
                                                                            
                                                                            
                                                                            
 StratificationID2.y
 Length:500         
 Class :character   
 Mode  :character

# Random number generator 
set.seed(123)
# Create a sample dataset using data.frame
SD <- data.frame(
  Data_Value.x = runif(100, min = 5, max = 35),  # 100  Simulated No Leisure-Time Physical Activity (%) 
  Data_Value.y = runif(100, min = 20, max = 40),  # Simulated Obesity Data Value
  stateabbr = sample(state.abb, 100, replace = TRUE),  # Random state abbreviations
  YearEnd = sample(2015:2020, 100, replace = TRUE)  # Random years
)

# View the first few rows of the sample data
head(SD)

  Data_Value.x Data_Value.y stateabbr YearEnd
1    13.627326     31.99978        NJ    2016
2    28.649154     26.65647        NJ    2015
3    17.269308     29.77226        MO    2019
4    31.490522     39.08948        KS    2019
5    33.214019     29.65805        MS    2015
6     6.366695     37.80700        HI    2015

# Make different data ranges have different colors to differentiate them in the plot 
sample_data <- SD %>%
  mutate(color = case_when(
     Data_Value.y < 32.1 ~ "green",  
     Data_Value.y >= 32.1 & Data_Value.y <= 37 ~ "yellow", 
     Data_Value.y > 37 ~ "purple" 
  ))

# Create a Linear Regression and adding the predicted values to the sample_data dataset
model <- lm(Data_Value.y ~ Data_Value.x, data = sample_data)
sample_data <- sample_data %>%
  mutate(fitted_values = predict(model))

# Create plot
library(ggplot2)

# Assuming 'sample_data' is already loaded with necessary columns
ggplot(data = sample_data, aes(x = Data_Value.x, y = Data_Value.y, color = YearEnd)) +
  # Scatter plot with points color-coded by 'year'
  geom_point(size = 3) +
  # Fitted line showing the trend
  geom_smooth(aes(x = Data_Value.x, y = fitted_values), method = "loess", color = "green", size = 1) +
  # Add title and axis labels
  labs(
    title = "Obesity vs. No Leisure-Time Physical Activity by Year",
    x = "No Leisure-Time Physical Activity (%)",
    y = "Obesity Data Value",
    color = "Year"
  ) +
  # Customize color scale for year
  scale_color_viridis_c() +  # Using a continuous color scale
  # Minimalist theme with additional customizations
  theme_minimal() +
  theme(
    legend.position = "top",
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

This plot demonstrates the levels of Obesity and the lack of leisure-time over the years by different states. This graphs demonstrates how there has been a decrease in the obesity from 2015-2022, as there is not intimidate direct relation of leisure time and obesity one cannot confirm or deny the possibility of leisure-time helping in its decrease. However as it can be seen there were also some irregularities in which older years, 2015, some states had high leisure time and low obesity rates.

Plot 2

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Highcharts (www.highcharts.com) is a Highsoft software product which is

not free for commercial and Governmental use

set.seed(123)

# Create a sample dataset using data.frame
SD <- data.frame(
  Data_Value.x = runif(100, min = 5, max = 35),  # 100 Simulated No Leisure-Time Physical Activity (%) 
  Data_Value.y = runif(100, min = 20, max = 40),  # Simulated Obesity Data Value
  stateabbr = sample(state.abb, 100, replace = TRUE),  # Random state abbreviations
  YearEnd = sample(c(2015:2020, 2022), 100, replace = TRUE)  # Including 2022
)

# Filter the dataset to only include data for the year 2022
SD_2022 <- SD[SD$YearEnd == 2022, ]

# View the first few rows of the filtered data
head(SD_2022)

   Data_Value.x Data_Value.y stateabbr YearEnd
6      6.366695     37.80700        HI    2022
12    18.600025     26.02458        OK    2022
14    22.179002     38.95454        MI    2022
21    31.686179     32.95787        MA    2022
24    34.828093     24.39535        TX    2022
25    24.671174     27.38978        IL    2022

SD_2022 <- SD %>%
  filter(YearEnd == 2022) %>%
  mutate(color = case_when(
    Data_Value.y < 32.1 ~ "#B3D9FF",  # Light blue for Data_Value.y < 32.1
    Data_Value.y >= 32.1 & Data_Value.y <= 37 ~ "#66B2FF",  # Medium blue for 32.1 <= Data_Value.y <= 37
    Data_Value.y > 37 ~ "#0066CC"  # Dark blue for Data_Value.y > 37
  ))

model <- lm(Data_Value.y ~ Data_Value.x, data = SD_2022)

SD_2022 <- SD_2022 %>%
  mutate(fitted_values = predict(model))

highchart() %>%
  hc_add_series(
    data = SD_2022, 
    type = "scatter", 
    hcaes(x = Data_Value.x, y = Data_Value.y, name = stateabbr, color = color), 
    name = "Obesity vs No Activity"
  ) %>%
  hc_add_series(
    data = SD_2022, 
    type = "line", 
    hcaes(x = Data_Value.x, y = fitted_values), 
    name = "Trend Line", 
    color = "red"
  ) %>%
  hc_title(text = "Obesity vs. No Leisure-Time Physical Activity in 2022") %>%
  hc_xAxis(title = list(text = "No Leisure-Time Physical Activity (%)")) %>%
  hc_yAxis(title = list(text = "Obesity Data Value")) %>%
  hc_tooltip(pointFormat = "State: {point.name}<br>Year: {point.YearEnd}<br>Activity: {point.x}<br>Obesity: {point.y}") %>%
  hc_plotOptions(
    scatter = list(
      marker = list(
        radius = 5,
        lineWidth = 1,
        lineColor = "#FFFFFF"
      )
    )
  )

This model allows us to view the rates of obesity and no leisure-time in the year 2022. The year 2022, being the latest year in the data set, gives us an overview of the States positions regarding each question. As previously stated obesity and the lack of leisure time are somehow relate however on is not the direct cause of the other, as there might be some factor not taken into account like a disability that wont allow the body to perform activities, the lack of availability of spaces to do those activities, eating processed food, etc. Also as it can be seen, majority of the people called for the survey are 50-65 year old, which is a much older population that might constantly need help from others to be able to do their daily activities. Some interesting things I could see was that even though Vermont has the lowest value of ” No leisure-time in the past few months” it is still at 32.7 of the value of obesity, however, Hawaii has the lowest value of obesity but it differs in the value of ” No Leisure-time in the past few months” by 6 points compared to Vermont. According to the State of Childhood Obesity the Most Obese State was West Virginia with a 41.2% of adults reporting to be obese, and information that was not reported in the dataset, and the least obese DC with a 23.5 %.
Link: https://stateofchildhoodobesity.org/demographic-data/adult/#:~:text=West%20Virginia%20has%20the%20highest,No%20state%20experienced%20a%20decline.

I wished i would’ve done the GIS map but it was too heavy for my computer and it didn’t load, i also wish i would’ve played with another dataset that had more surprising outliers.