Homework 3

Getting Started:

# Start Session
rm(list = ls())
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 543959 29.1    1210938 64.7   686460 36.7
## Vcells 993085  7.6    8388608 64.0  1876791 14.4

# Load Packages
library(readxl)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(sf)

## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

library(sp)
library(tidyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(descr) 
library(leaflet)
library(ggthemes)
library(writexl)
library(readr)
library(haven)
library(leaflet)
library(knitr)

Part 1:

Basic Information:

The dataset used for this analysis is derived from the National Health and Nutrition Examination Survey (NHANES) for the period August 2021 – August 2023. This dataset is collected by the National Center for Health Statistics (NCHS), a division of the Centers for Disease Control and Prevention (CDC). It is publicly available on the NHANES website and is released in SAS transport format (xpt), which has been converted into Excel (xlsx) for this analysis. The data collection process follows standardized protocols to ensure accuracy and reliability.

For this analysis, data from three NHANES components will be used:

Demographic Data (DEMO_L) – Includes variables on age, gender, race/ethnicity, education level, and income-to-poverty ratio.
Alcohol Use Questionnaire Data (ALQ_L) – Captures information on alcohol consumption, binge drinking frequency, and drinking habits over the past 12 months.
Income Questionnaire Data (INQ_L) – Provides detailed insights into household income, monthly poverty level, and financial resources.

Research Questions:

I aim to explore the relationship between demographic characteristics, income level, and alcohol consumption behavior in the U.S. adult population. Key research questions include:

How does alcohol consumption frequency differ across income levels and demographic groups?
Is there a significant association between income level (poverty index) and binge drinking behavior?
Do education status play a role in influencing drinking habits?
What are the demographic factors that contribute to excessive alcohol consumption?

Variables of Interest

The following variables will be examined in the analysis:

Demographic Variables (DEMO_L)

RIDAGEYR – Age in years
RIAGENDR – Gender (Male/Female)
RIDRETH3 – Race/Ethnicity
INDFMPIR – Income-to-poverty ratio
DMDEDUC2 – Highest education level attained

Alcohol Use Variables (ALQ_L)

ALQ111 – Ever had a drink of alcohol (Yes/No)
ALQ121 – Frequency of alcohol consumption in the past 12 months
ALQ130 – Average number of drinks consumed per drinking day
ALQ142 – Number of days having 4+ (women) or 5+ (men) drinks in a single day (binge drinking)
ALQ170 – Number of times consuming 4+/5+ drinks in the past 30 days

Income Variables (INQ_L)

INDFMMPI – Monthly income-to-poverty index
INDFMMPC – Categorized income-to-poverty index
INQ300 – Whether the family has more than $20,000 in savings (Yes/No)

Planned Analyses:

Descriptive Statistics: Summarize alcohol consumption patterns and income distribution across different demographic groups.

Regression Analysis:

Logistic regression to assess whether income and demographic factors predict binge drinking.
Multiple linear regression to determine how income influences alcohol consumption frequency.

Data Visualization: Use bar charts, histograms, and scatterplots to illustrate patterns and correlations.

Dataset(s):

# NHANES August 2021 – August 2023 Dataset
Demo <- read_xlsx("DEMO_L.xlsx")

## New names:
## • `` -> `...1`

Income <- read_xlsx("INQ_L.xlsx")

## New names:
## • `` -> `...1`

Alcohol <- read_xlsx("ALQ_L.xlsx")

## New names:
## • `` -> `...1`

# Variable names
names(c(Demo, Income, Alcohol))

##  [1] "...1"     "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR"
##  [7] "RIDAGEMN" "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGM" "DMQMILIZ"
## [13] "DMDBORN4" "DMDYRUSR" "DMDEDUC2" "DMDMARTZ" "RIDEXPRG" "DMDHHSIZ"
## [19] "DMDHRGND" "DMDHRAGZ" "DMDHREDZ" "DMDHRMAZ" "DMDHSEDZ" "WTINT2YR"
## [25] "WTMEC2YR" "SDMVSTRA" "SDMVPSU"  "INDFMPIR" "...1"     "SEQN"    
## [31] "INDFMMPI" "INDFMMPC" "INQ300"   "IND310"   "...1"     "SEQN"    
## [37] "ALQ111"   "ALQ121"   "ALQ130"   "ALQ142"   "ALQ270"   "ALQ280"  
## [43] "ALQ151"   "ALQ170"

Disclaimer:

The dataset can change over the semester as I am not totally sold on this dataset for my thesis. I’d like to use the weekly homework’s to try and find the dataset I want to use for my thesis.

Part 2:

To demonstrate my understanding of pivot_longer() and pivot_wider(), I will use the Demo dataset from Part 1.

The pivot_longer() and pivot_wider() functions from the tidyverse package in R are used to reshape datasets between two common structures: long format and wide format. pivot_longer() is used to transform data from a wide format (where multiple columns represent different variables) into a long format (where a single column contains variable names, and another column holds their values). This is useful when dealing with repeated measurements across multiple time points or categories, making data easier to analyze with group-based operations. Conversely, pivot_wider() reverses this process by spreading the values of a categorical variable back into separate columns, returning to the original wide format.

In the provided code, the dataset is first filtered to select key demographic variables from the NHANES dataset, including Age, Gender, and Income Poverty Ratio (which were are all renamed for better readability), for 200 participants. Since an initial error was encountered due to a mismatch in data types, the mutate(across(..., as.numeric)) function ensures that all selected columns are converted to numeric before pivoting. The pivot_longer() function is then applied to restructure the dataset from wide to long format, creating a “Variable” column that stores the original column names (Age, Gender, and Income Poverty Ratio) and a “Value” column that holds the corresponding data. Finally, pivot_wider() is used to convert the dataset back into its original wide format, restoring each variable as a separate column for each participant. This process ensures that data can be easily transformed for different analytical needs, such as visualization, statistical modeling, or machine learning applications. Furthermore, the kable() function was used as a demonstration of use since Professor Song recommended it on last weeks Homework assignment. The kable() function in R, from the knitr package, is used to format data frames or tables into well-structured, readable tables for markdown or HTML.

# Selecting relevant columns and first 200 rows for the example using Demo Dataset
demo_subset <- Demo %>%
  select(SEQN, RIDAGEYR, RIAGENDR, INDFMPIR) %>%
  rename(ID = SEQN, Age = RIDAGEYR, Gender = RIAGENDR, Income_Poverty_Ratio = INDFMPIR) %>%
  head(200)

# Converting all numeric-like columns to numeric type because initially I got an error
demo_subset <- demo_subset %>%
  mutate(across(c(Age, Gender, Income_Poverty_Ratio), as.numeric))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c(Age, Gender, Income_Poverty_Ratio), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion

# Checking if conversion was successful
str(demo_subset)

## tibble [200 × 4] (S3: tbl_df/tbl/data.frame)
##  $ ID                  : num [1:200] 130378 130379 130380 130381 130382 ...
##  $ Age                 : num [1:200] 43 66 44 5 2 3 43 65 34 68 ...
##  $ Gender              : num [1:200] 1 1 2 2 1 2 1 2 1 2 ...
##  $ Income_Poverty_Ratio: num [1:200] 5 5 1.41 1.53 3.6 NA 0.63 5 1.33 1.32 ...

# Viewing the original wide format
head(demo_subset)

## # A tibble: 6 × 4
##       ID   Age Gender Income_Poverty_Ratio
##    <dbl> <dbl>  <dbl>                <dbl>
## 1 130378    43      1                 5   
## 2 130379    66      1                 5   
## 3 130380    44      2                 1.41
## 4 130381     5      2                 1.53
## 5 130382     2      1                 3.6 
## 6 130383     3      2                NA

# Converting from WIDE to LONG format using pivot_longer()
demo_long <- demo_subset %>%
  pivot_longer(cols = -ID,  # Select all columns except ID
               names_to = "Variable", 
               values_to = "Value")

# Viewing transformed long format
head(demo_long)

## # A tibble: 6 × 3
##       ID Variable             Value
##    <dbl> <chr>                <dbl>
## 1 130378 Age                     43
## 2 130378 Gender                   1
## 3 130378 Income_Poverty_Ratio     5
## 4 130379 Age                     66
## 5 130379 Gender                   1
## 6 130379 Income_Poverty_Ratio     5

# Converting back from LONG to WIDE format using pivot_wider()
demo_wide <- demo_long %>%
  pivot_wider(names_from = Variable, 
              values_from = Value)

# View reconstructed wide format
head(demo_wide)

## # A tibble: 6 × 4
##       ID   Age Gender Income_Poverty_Ratio
##    <dbl> <dbl>  <dbl>                <dbl>
## 1 130378    43      1                 5   
## 2 130379    66      1                 5   
## 3 130380    44      2                 1.41
## 4 130381     5      2                 1.53
## 5 130382     2      1                 3.6 
## 6 130383     3      2                NA

# Demonstrating kable() from Homework 2 commments
kable(head(demo_long))

ID	Variable	Value
130378	Age	43
130378	Gender	1
130378	Income_Poverty_Ratio	5
130379	Age	66
130379	Gender	1
130379	Income_Poverty_Ratio	5

kable(head(demo_wide))

ID	Age	Gender	Income_Poverty_Ratio
130378	43	1	5.00
130379	66	1	5.00
130380	44	2	1.41
130381	5	2	1.53
130382	2	1	3.60
130383	3	2	NA