Biostatistics 2 Final Assignment

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Set your working directory to where the file is saved (optional)
setwd("C:\\Users\\sudipta.gupta\\Pictures")
# Import the dataset
physical_activity <- read.csv("physical_activity.csv", header = TRUE)
# View the first few rows
head(physical_activity)

##   participant_id age_group gender marital_status education_level occupation
## 1              1     18-29   Male         Single       Secondary     Farmer
## 2              2     18-29   Male        Married       Secondary   Business
## 3              3     45-59 Female        Married         Primary     Farmer
## 4              4     45-59 Female       Divorced      University    Student
## 5              5     45-59   Male        Widowed      Illiterate    Student
## 6              6     45-59   Male        Married       Secondary     Farmer
##   monthly_income physical_activity chronic_disease self_rated_health
## 1          32696               Low              No              Good
## 2          40891              High             Yes              Good
## 3          28615          Moderate              No         Excellent
## 4          57190               Low             Yes              Fair
## 5          56976          Moderate             Yes         Excellent
## 6          34214          Moderate              No              Fair

# Check basic info
str(physical_activity)

## 'data.frame':    250 obs. of  10 variables:
##  $ participant_id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age_group        : chr  "18-29" "18-29" "45-59" "45-59" ...
##  $ gender           : chr  "Male" "Male" "Female" "Female" ...
##  $ marital_status   : chr  "Single" "Married" "Married" "Divorced" ...
##  $ education_level  : chr  "Secondary" "Secondary" "Primary" "University" ...
##  $ occupation       : chr  "Farmer" "Business" "Farmer" "Student" ...
##  $ monthly_income   : int  32696 40891 28615 57190 56976 34214 41112 8866 14249 25496 ...
##  $ physical_activity: chr  "Low" "High" "Moderate" "Low" ...
##  $ chronic_disease  : chr  "No" "Yes" "No" "Yes" ...
##  $ self_rated_health: chr  "Good" "Good" "Excellent" "Fair" ...

summary(physical_activity)

##  participant_id    age_group            gender          marital_status    
##  Min.   :  1.00   Length:250         Length:250         Length:250        
##  1st Qu.: 63.25   Class :character   Class :character   Class :character  
##  Median :125.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :125.50                                                           
##  3rd Qu.:187.75                                                           
##  Max.   :250.00                                                           
##  education_level     occupation        monthly_income  physical_activity 
##  Length:250         Length:250         Min.   : 5052   Length:250        
##  Class :character   Class :character   1st Qu.:19940   Class :character  
##  Mode  :character   Mode  :character   Median :33588   Mode  :character  
##                                        Mean   :33184                     
##                                        3rd Qu.:46991                     
##                                        Max.   :59937                     
##  chronic_disease    self_rated_health 
##  Length:250         Length:250        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

#load packages
library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.1

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.1

## Warning: package 'ggplot2' was built under R version 4.5.1

## Warning: package 'tibble' was built under R version 4.5.1

## Warning: package 'tidyr' was built under R version 4.5.1

## Warning: package 'readr' was built under R version 4.5.1

## Warning: package 'purrr' was built under R version 4.5.1

## Warning: package 'dplyr' was built under R version 4.5.1

## Warning: package 'stringr' was built under R version 4.5.1

## Warning: package 'forcats' was built under R version 4.5.1

## Warning: package 'lubridate' was built under R version 4.5.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Generate a descriptive summary for all variables
summary_table <- physical_activity %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ± {sd}",   # Show mean ± SD for numeric vars
      all_categorical() ~ "{n} ({p}%)"      # Show n (%) for categorical vars
    ),
    missing = "no"  # or "ifany" to show missing counts
  )
# Print the summary table
summary_table

Characteristic	N = 250¹
participant_id	126 ± 72
age_group
18-29	69 (28%)
30-44	89 (36%)
45-59	54 (22%)
60+	38 (15%)
gender
Female	132 (53%)
Male	118 (47%)
marital_status
Divorced	25 (10%)
Married	132 (53%)
Single	67 (27%)
Widowed	26 (10%)
education_level
Illiterate	31 (12%)
Primary	82 (33%)
Secondary	90 (36%)
University	47 (19%)
occupation
Business	42 (17%)
Farmer	72 (29%)
Service	66 (26%)
Student	41 (16%)
Unemployed	29 (12%)
monthly_income	33,184 ± 16,041
physical_activity
High	54 (22%)
Low	95 (38%)
Moderate	101 (40%)
chronic_disease	87 (35%)
self_rated_health
Excellent	44 (18%)
Fair	91 (36%)
Good	85 (34%)
Poor	30 (12%)
¹ Mean ± SD; n (%)

### Interpretation: The study included 2,501 participants, with a mean age of 126 ± 72 months (approx. 10.5 years)—likely an error or unit issue since age groups clearly indicate adults. Most participants were aged 30–44 years (36%), followed by 18–29 years (28%) and 45–59 years (22%). The gender distribution was relatively balanced (53% female, 47% male). Over half were married (53%), and around one-fourth were single (27%). Educational attainment was modest — only 19% had university education, while the largest share had secondary education (36%). About 12% were illiterate.Regarding occupation, participants were fairly distributed among farmers (29%), service holders (26%), and businesspersons (17%), with smaller proportions of students (16%) and unemployed (12%). The mean monthly income was approximately 33,184 ± 16,041 BDT, suggesting a moderately varied income range.In terms of physical activity, most participants reported moderate activity (40%), followed by low (38%) and high (22%) levels. About 35% had a chronic disease, and self-rated health status was mostly fair (36%) or good (34%), while only 12% rated their health as poor.

#If you haven't already imported and converted factors, run:
df <- readr::read_csv("physical_activity.csv") %>%
  mutate(
    chronic_disease   = factor(chronic_disease, levels = c("No","Yes")),
    gender            = factor(gender),
    age_group         = factor(age_group),
    marital_status    = factor(marital_status),
    education_level   = factor(education_level),
    occupation        = factor(occupation),
    physical_activity = factor(physical_activity),
    self_rated_health = factor(self_rated_health)
  )

## Rows: 250 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): age_group, gender, marital_status, education_level, occupation, phy...
## dbl (2): participant_id, monthly_income
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# list of variables you want in bivariate tables
vars_cat <- c("age_group","marital_status","education_level",
              "occupation","physical_activity","self_rated_health")
vars_num <- c("monthly_income")  # add other numeric vars if present

# By gender
tbl_by_gender <- df %>%
  select(all_of(c(vars_cat, vars_num)), gender) %>%
  tbl_summary(
    by = gender,
    statistic = all_continuous() ~ "{mean} ({sd})",
    digits = all_continuous() ~ 1,
    missing = "ifany"
  ) %>%
  add_p(
    test = list(all_categorical() ~ "chisq.test",
                all_continuous()  ~ "t.test")
  ) %>%
  add_q() %>%                    # optional: add Benjamini-Hochberg FDR q-values
  modify_header(label = "**Variable**") %>%
  bold_labels()

tbl_by_gender

Variable	Female N = 132¹	Male N = 118¹	p-value²	q-value³
age_group			0.6	>0.9
18-29	33 (25%)	36 (31%)
30-44	46 (35%)	43 (36%)
45-59	30 (23%)	24 (20%)
60+	23 (17%)	15 (13%)
marital_status			0.8	>0.9
Divorced	13 (9.8%)	12 (10%)
Married	72 (55%)	60 (51%)
Single	32 (24%)	35 (30%)
Widowed	15 (11%)	11 (9.3%)
education_level			0.8	>0.9
Illiterate	19 (14%)	12 (10%)
Primary	43 (33%)	39 (33%)
Secondary	46 (35%)	44 (37%)
University	24 (18%)	23 (19%)
occupation			0.7	>0.9
Business	23 (17%)	19 (16%)
Farmer	33 (25%)	39 (33%)
Service	38 (29%)	28 (24%)
Student	23 (17%)	18 (15%)
Unemployed	15 (11%)	14 (12%)
physical_activity			0.4	>0.9
High	32 (24%)	22 (19%)
Low	46 (35%)	49 (42%)
Moderate	54 (41%)	47 (40%)
self_rated_health			>0.9	>0.9
Excellent	24 (18%)	20 (17%)
Fair	50 (38%)	41 (35%)
Good	43 (33%)	42 (36%)
Poor	15 (11%)	15 (13%)
monthly_income	32,486.1 (16,010.1)	33,965.5 (16,107.5)	0.5	>0.9
¹ n (%); Mean (SD)
² Pearson’s Chi-squared test; Welch Two Sample t-test
³ False discovery rate correction for multiple testing

### Interpretation: When comparing males and females: - Age, marital status, education, and occupation showed no significant differences (p > 0.6). - Both genders had similar levels of physical activity and self-rated health.- Monthly income was slightly higher among males (33,965 BDT) than females (32,486 BDT), but this was not statistically significant (p = 0.5).Overall, gender differences across all variables were minimal and statistically insignificant, suggesting that demographic and health patterns are consistent between males and females.

tbl_by_disease <- df %>%
  select(all_of(c(vars_cat, vars_num)), chronic_disease) %>%
  tbl_summary(
    by = chronic_disease,
    statistic = all_continuous() ~ "{mean} ({sd})",
    digits = all_continuous() ~ 1,
    missing = "ifany"
  ) %>%
  add_p(
    test = list(all_categorical() ~ "chisq.test",
                all_continuous()  ~ "t.test")
  ) %>%
  add_q() %>%
  modify_header(label = "**Variable**") %>%
  bold_labels()

tbl_by_disease

Variable	No N = 163¹	Yes N = 87¹	p-value²	q-value³
age_group			0.7	0.9
18-29	46 (28%)	23 (26%)
30-44	61 (37%)	28 (32%)
45-59	33 (20%)	21 (24%)
60+	23 (14%)	15 (17%)
marital_status			0.3	0.8
Divorced	18 (11%)	7 (8.0%)
Married	91 (56%)	41 (47%)
Single	39 (24%)	28 (32%)
Widowed	15 (9.2%)	11 (13%)
education_level			0.3	0.8
Illiterate	16 (9.8%)	15 (17%)
Primary	57 (35%)	25 (29%)
Secondary	57 (35%)	33 (38%)
University	33 (20%)	14 (16%)
occupation			0.6	0.9
Business	26 (16%)	16 (18%)
Farmer	49 (30%)	23 (26%)
Service	40 (25%)	26 (30%)
Student	26 (16%)	15 (17%)
Unemployed	22 (13%)	7 (8.0%)
physical_activity			>0.9	>0.9
High	34 (21%)	20 (23%)
Low	62 (38%)	33 (38%)
Moderate	67 (41%)	34 (39%)
self_rated_health			0.6	0.9
Excellent	31 (19%)	13 (15%)
Fair	62 (38%)	29 (33%)
Good	51 (31%)	34 (39%)
Poor	19 (12%)	11 (13%)
monthly_income	31,445.6 (15,951.9)	36,442.1 (15,786.1)	0.019	0.13
¹ n (%); Mean (SD)
² Pearson’s Chi-squared test; Welch Two Sample t-test
³ False discovery rate correction for multiple testing

###Interpretation: Comparing participants with and without chronic diseases: There were no significant differences in age, marital status, education, or occupation (all q > 0.8).Income was somewhat higher among those with chronic diseases (36,442 BDT vs. 31,446 BDT; p = 0.019), though this association weakened after adjusting for multiple testing (q = 0.13).Physical activity and self-rated health patterns were similar between groups. This suggests that chronic disease presence was not strongly linked to socioeconomic or lifestyle differences, except for a slight income-related trend.

# outcome: chronic_disease (factor with levels No/Yes)
tbl_unadjusted <- df %>%
  select(chronic_disease, gender, age_group, education_level, physical_activity, monthly_income) %>%
  tbl_uvregression(
    method = glm,
    y = chronic_disease,
    method.args = list(family = binomial),
    exponentiate = TRUE
  ) %>%
  bold_labels()

tbl_unadjusted

Characteristic	N	OR	95% CI	p-value
gender	250
Female		—	—
Male		1.00	0.59, 1.68	>0.9
age_group	250
18-29		—	—
30-44		0.92	0.47, 1.80	0.8
45-59		1.27	0.60, 2.68	0.5
60+		1.30	0.57, 2.96	0.5
education_level	250
Illiterate		—	—
Primary		0.47	0.20, 1.09	0.079
Secondary		0.62	0.27, 1.42	0.3
University		0.45	0.17, 1.15	0.10
physical_activity	250
High		—	—
Low		0.90	0.45, 1.83	0.8
Moderate		0.86	0.43, 1.73	0.7
monthly_income	250	1.00	1.00, 1.00	0.020
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

###Interpretation: In the unadjusted model:Gender, age group, education, and physical activity were not significantly associated with disease outcome (all p > 0.05). However, monthly income showed a small but significant positive association (p = 0.020), implying that higher income might slightly increase the likelihood of the studied health outcome.

Biostatistics 2 Final Assignment

Sudiota Das Gupta

2025-10-23

R Markdown

Including Plots