Poverty and Diabetes: Is There A Connection?

library(ROSE)
library(tidyverse)
library(magrittr)
library(janitor)
library(tidyverse)
library(RODBC)
library(pscl)
library(caret)
library(car)
library(InformationValue)
library(rvest)
library(jsonlite)
library(xml2)
library(XML)
library(stringr)
library(kableExtra)

Diabetes and Poverty: How does poverty affect health outcomes?

We know that poverty affects health outcomes. We also know that obesity, poor diet choices, and poor lifestyle choices often result from poverty and inadequate access to resources. However, what are the actual poverty markers that lead to these outcomes - is it lack of transportation options? Housing insecurity? Lack of High School defree? In this study we will take a deeper look. To answer these questions we will examine data from a client survey on economic needs from my workplace. We will supplement the survey info with other client health-related databases from my workplace, as well as data from outside sources such as the NY Department of Health.

Altogether we will use data from 4 different sources:

Files from a proprietary survey of economic needs, stripped of identity markers (stored in a local csv file)
Tables of client health information in a proprietary SQL Server database, stripped of identity markers and password protected through keyring (SQL Server database on Azure)
Health data from the New York Department of Health (in json format accessed through APIs)
Zip code level data on population size and poverty (scraped from html tables on the web)

After collecting the data we will store it in the database so that we can easily combine the datasets in productive ways using SQL. In particular, we will generate dataframes that contain zip code level information on poverty and public health, and client level information on economic needs and individual client health. Then we will test for correlations and other relationships within the data.

Note - Certain operations that load data and are very time consuming will be contained within functions that can be run or not by changing the value of the boolean variable “run” which appears at the top of each function.

Loading The Data

1. CSV file of a Survey of Client Economic Needs

This file is the result of a client survey taken at my workplace of over 3,000 clients. We did not create the survey instrument - the file was provided by the organization that did. The Ids have been mathced to those of the clients in the SQL database. The file is otherwise free of any personal identifying information.

For security purposes these files are stored locally. However, they have already been read into the sql server database and can be accessed from there through keyring. Leave “run = false” in the following code, and in the code in which the file is read into the database, to use this rmd successfully with the survey.

We also input a file that simply contains the list of survey needs.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
dfSDOH_raw <- read.csv("D:\\Buffer\\SDOH__En__1_0 (3).csv", encoding = "UTF-8")
}

dfListOfNeeds <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/ListOfNeeds.csv", encoding = "UTF-8")
dfListOfNeeds %<>%
  rename(Need=X.U.FEFF.Need)

The survey asks these questions:

kableExtra::kable(dfListOfNeeds)

Need
Do you or your family ever feel hungry?
Do you have a safe place to live right now?
Are you worried about losing your housing?
Do you have a job?
Are you able to afford basic needs?
Do you have a high school diploma or a GED?
Do you ever need help understanding what your doctor tells you?
Are you sad or lonely?
Do problems with childcare make it hard for you?
Do you ever have a problem getting the right type of clothes for each season?
Do you have access to transportation to get where you need to go?
Do you feel safe: emotionally, physically?

Since the survey instrument and resultant data were created and delivered by an outside agency, the data needs extensive reforming. The file contains 800,000 observations with only two columns- the columns are “Field” and “Value”. Each record takes up 211 observations.

We begin by pivoting wider every 211 records, taking our names from “Field” and our values from “Value”, and binding each to a final dataframe.

run=FALSE

if (run) {
dfSDOH <- dfSDOH_raw %>%
    filter(as.numeric(Num) <=211) %>%
    pivot_wider(Org, names_from = "Field", values_from = "Value")

dfFinal <- dfSDOH

x=211

for (i in 1: 3752)
{

dfSDOH <- dfSDOH_raw %>%
    filter(as.numeric(Num) > x & as.numeric(Num)<= (x+211)) %>%
    pivot_wider(Org, names_from = "Field", values_from = "Value")

x=x+211

dfFinal <- rbind(dfFinal, dfSDOH)
}
}

We eliminate bad records that have no ClientID.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
dfFinal %<>%
  filter (ClientID > 0)
}

We select the columns which describe needs, and rename them to reflect the type of need.

NeedsVector = c("Need_Food",  "Need_SafePlace",  "Need_LoseHousing",  "Need_Job",  "Need_AffordNeeds",  "Need_HighSchool",  "Need_HelpUnderstanding",  "Need_Sad",  "Need_Childcare",  "Need_Clothes",  "Need_Transport",  "Need_Safe") 


#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

dfFinal2 <- dfFinal %>%
  rename(Need_Food=DoYouOrYou, Need_SafePlace = DoYouHaveA, Need_LoseHousing = AreYouWorr, Need_Job = DoYouHaveA1, Need_AffordNeeds = AreYouAble, Need_HighSchool = DoYouHaveA2, Need_HelpUnderstanding=DoYouEverN, Need_Sad = SocialAreY, Need_Childcare = ChildCareD, Need_Clothes = DoYouEverH, Need_Transport = DoYouHaveA3, Need_Safe = DoYouFeelS )

dfFinal3 <- dfFinal2 %>%
  select(ClientID, starts_with('Need_'))
}

Each of the 12 need columns is binary, containing either a yes or no in response to a question about a need. However, sometimes ‘yes’ means the client has a need (“Do you or your family ever go hungry”) and sometimes ‘yes’ means the client does not have a need (“Do you feel safe physically and emotionally”).

We will recode the questions so that 1 means ‘has a need’ and 0 means ‘does not have a need’ for all variables.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
QCorrect<- function(x){
    ifelse(x=="Yes", 1,0)
}

QReverse<- function(x){
    ifelse(x=="No", 1,0)
}
 
dfFinal4 <- dfFinal3 %>%
  select(Need_Food, Need_Sad, Need_HelpUnderstanding, Need_Childcare, Need_Clothes, Need_LoseHousing) %>%
   mutate_all(QCorrect)

dfFinal5 <- dfFinal3 %>%
  select(Need_SafePlace, Need_Job, Need_AffordNeeds, Need_HighSchool, Need_Transport, Need_Safe) %>%
   mutate_all(QReverse)

dfFinal6 <- dfFinal3 %>%
  select(ClientID)

dfFinal6$ClientID <- as.numeric(as.character(dfFinal6$ClientID))

dfFinal10 <- cbind(dfFinal6, dfFinal4, dfFinal5)
}

2. SQL Server Database Data

I built this database for my workplace. It contains information on client health screens for diabetes, hypertension and other chronic diseases. We will use three tables:

Demographic data: A table of 5,433 unique clients with demographic data and the client’s first A1C result (A1C is a test that indicates diabetes.)
Health data: A table of 39,491 observations for 8,546 unique clients with A1C, Blood Pressure and other health screen data.
Vaccine data: A table of 2122 unique clients with info on vaccine status.

The data is password protected by keyring. Here we create a connection.

## Please enter password in TK window (Alt+Tab)

# This is the connection string:

strConnection <- paste0(
  'Driver={ODBC Driver 13 for SQL Server};
   Server=tcp:ehtmp.database.windows.net,1433;
   Database=HC_A1C;
   Encrypt=yes;
   TrustServerCertificate=no;
   Connection Timeout=30;',
   'Uid=',keyring::key_get(service = "EH_606_username", keyring = "EH_606_keyring"),';',
   'Pwd=', keyring::key_get(service = "EH_606_pwd", keyring = "EH_606_keyring"), ';'
)

library(RODBC)
dbConnection <- odbcDriverConnect(strConnection)
#keyring_lock

3. New York DOH Data

We read the New York DOH data from the APIs. The RSocrata library makes this easy.

library("RSocrata")

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE 

if(run) {


#Community Health Obesity and Diabetes Related Indicators: 2008 - 2012
dfCommunityHealthObesityAndDiabetes <- read.socrata(
  "https://health.data.ny.gov/resource/tchg-ruva.json")

#Community Health: Age-adjusted percentage of adults with physician diagnosed diabetes: 2008 - 2009
dfAdultsDiagnosed <- read.socrata(
  "https://health.data.ny.gov/resource/9j5w-7zpd.json")

#AH Provisional Diabetes Death Counts, 2020
dfDeath <- read.socrata(
  "https://data.cdc.gov/resource/qdcb-uzft.json")

#500 Cities: Diagnosed diabetes among adults aged >=18 years
dfCityComparison <- read.socrata(
  "https://chronicdata.cdc.gov/resource/cn78-b9bj.json")

#Conditions contributing to deaths involving coronavirus disease 2019 (COVID-19), by age group and state, United States.
dfDeathsForCovid <- read.socrata(
  "https://data.cdc.gov/resource/hk9y-quqm.json")

#Community Health: Diabetes Short-term Complications Hospitalization Rate per 10,000 - Aged 18+ Years by County Map: Latest Data
dfHospitalizations <- read.socrata(
  "https://health.data.ny.gov/resource/xuwq-ppg8.json")

#Medicaid Chronic Conditions, Inpatient Admissions and Emergency Room Visits by Zip Code: Beginning 2012
dfMedicaidByZip <- read.socrata(
  "https://health.data.ny.gov/resource/2yck-xisk.json")

}

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE 

if(run) {
#Medicaid Program Enrollment by Month: Beginning 2009
dfMedicaidEnrollment <- read.socrata(
  "https://health.data.ny.gov/resource/m4hz-kzn3.json")
}

4. Zip Code Level Population and Poverty Data Scraped From the web

We scrape and then clean 17 pages of zipatlas data to compile a table of zip codes, population levels and poverty levels. We also read a csv file, taken from the web, which contains information on zip code and broadband access. The file is compiled by the city of New York and is downloaded from here:

https://data.cityofnewyork.us/City-Government/Broadband-Adoption-and-Infrastructure-by-Zip-Code/qz5f-yx82

library(XML)

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {


x="http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.htm"
dfPopAndPovertyLevel = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))

for (i in 1:16) {
x= str_c("http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.", i, ".htm")
dfx = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))
dfPopAndPovertyLevel=rbind(dfPopAndPovertyLevel, dfx)
}
}

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
  
dfPopAndPovertyLevel %<>%
  filter(str_length(V2)==5 & !is.na(V4)) %<>%
  rename(number = V1, zip=V2, location=V3, city=V4, population=V5, percent_below_poverty=V6, rank=V7) %<>%
  mutate(population = as.numeric(gsub(",", "", population))) %<>%
  mutate(percent_below_poverty = as.numeric(gsub(" %", "", percent_below_poverty))) %<>%
  mutate(rank = as.numeric(gsub("#", "", rank)))

}


dfBroadband_raw <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/Broadband_Adoption.csv", encoding = "UTF-8")

Loading all of the Datasets into the SQL Server Database

If we load our datasets into persistent storage (the SQL Sever database), then we can use SQL to easily combine data from several tables and read the result back into a dataframe. Just as important, many of the datasets in this project are very large and take a long time to load. Therefore we can filter and prepare them, read them into the database, and then use the database exclusively to read them back to cut down on load time.

We begin with 3 NYDOH files: MedicaidByZip, a NYDOH file with over a million records on medicaid information; dfAdultsDiagnosed which has 63 records about Diabetes diagnoses throughout New York State; and dfCommunityHealthObesityAndDiabtes, a dataset with 2733 records with detailed county-level information about diabetes.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblMedicaidByZip")
PopulateTable <- sqlSave(dbConnection, dfMedicaidByZip, "tblMedicaidByZip", append=TRUE)
}

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblAdultsDiagnosed")
PopulateTable <- sqlSave(dbConnection, dfAdultsDiagnosed, "tblAdultsDiagnosed", append=TRUE)
}

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblCommunityHealth")
PopulateTable <- sqlSave(dbConnection, dfCommunityHealthObesityAndDiabetes, "tblCommunityHealth", append=TRUE)
}

Next we clean and load the scraped dataset with 1711 zip code level observations on population and poverty in NY State:

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblPopAndPovertyLevel")
PopulateTable <- sqlSave(dbConnection, dfPopAndPovertyLevel, "tblPopAndPovertyLevel", append=TRUE)
}

Next we load the survey data into a table called tblNeeds:

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

  
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblNeeds")
PopulateTable <- sqlSave(dbConnection, dfFinal10, "tblNeeds", append=TRUE)
}

Finally, we add the broadband_by_zip_code dataframe

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblBroadBand")
PopulateTable <- sqlSave(dbConnection, dfBroadband_raw, "tblBroadband", append=TRUE)
}

Creating Dataframes for Analysis By Joining Tables in the SQL Server Database

We will create the following dataframes by combining elements from a number of different tables in the database:

dfZipInfo, a dataframe with 99 observations at the zip code level with information on population level, poverty level, medicaid Info, average A1C from client data, and broadband info.
dfdNeedsInfo, a dataframe with client-level observations with information on needs from the survey, chronic diseases, demographics and initial A1C. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.
dfNeedsInfo_Vaccine, a dataframe with 304 client-level observations with information on needs from the survey and vaccine information. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.
dfA1CAndDemographics, a dataframe with 5433 observations at the client level with information on demographics and initial A1C readings.

dfZipInfo_raw <- sqlQuery(dbConnection, "SELECT tblMedicaidByZip.zip_code, tblMedicaidByZip.major_diagnostic_category, tblPopAndPovertyLevel.population, tblPopAndPovertyLevel.percent_below_poverty, tblMedicaidByZip.ip_admits, tblMedicaidByZip.er_visits, tblA1CAndDemographics.ClientID, tblA1CAndDemographics.Zip, tblA1CAndDemographics.InitialA1C, tblA1CAndDemographics.Gender, tblA1CAndDemographics.RaceandEthnicity, tblA1CAndDemographics.Age, tblBroadband.NoInternetAccessPercentageOfHouseholds
FROM (tblMedicaidByZip INNER JOIN (tblA1CAndDemographics INNER JOIN tblPopAndPovertyLevel ON tblA1CAndDemographics.Zip = tblPopAndPovertyLevel.zip) ON tblMedicaidByZip.zip_code = tblPopAndPovertyLevel.zip) INNER JOIN tblBroadband ON tblPopAndPovertyLevel.zip = tblBroadband.ZipCode;
")


dfZipInfo10 <- dfZipInfo_raw %>%
  filter(major_diagnostic_category=="Diabetes Mellitus") %>%
  na.omit() %>%
  group_by(zip_code) %>%
  summarize(n=n(), ave_A1C=mean(InitialA1C), pov=mean(percent_below_poverty), pop=mean(population), NoInternet=mean(NoInternetAccessPercentageOfHouseholds*100), percent_ip = sum(ip_admits/population), percent_er=sum(er_visits/population))


dfZipInfo_SI <- dfZipInfo10 %>%
  filter(as.numeric(zip_code) >= 10301 & as.numeric(zip_code) <= 10314)

dfA1CAndDemographics <- sqlQuery(dbConnection, "SELECT * FROM tblA1CAndDemographics")

dfA1CAndDemographics %<>% 
  mutate(BlackOrNot = ifelse(RaceandEthnicity=="Black/African American", 1, 0)) %<>%
  mutate(MaleOrNot = ifelse(Gender=="Male", 1, 0)) %>%
  mutate(A1CAboveMean = ifelse(InitialA1C > 7, 1, 0))

dfNeedsInfo_raw <- sqlQuery(dbConnection, "SELECT tblNeeds.ClientID, tblChronic.Chronic, tblReadings.Type,  tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblReadings INNER JOIN (tblChronic RIGHT JOIN tblNeeds ON tblChronic.ClientID = tblNeeds.ClientID) ON tblReadings.ClientID = tblNeeds.ClientID
WHERE tblReadings.Type='A1C';")

    
dfNeedsInfo <- dfNeedsInfo_raw %>%
    group_by(ClientID) %>%
  
    mutate(DiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Diabetes (doctor denied)"|Chronic == "Diabetes type I", 1, 0)) %>%
    mutate(HypertensionOrNot = ifelse(Chronic == "Hypertension", 1,0)) %>%
    mutate(HypertensionOrDiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Hypertension", 1,0)) %>%
  
    summarize(sum(DiabetesOrNot), sum(HypertensionOrNot), sum(HypertensionOrDiabetesOrNot), across(NeedsVector, mean, na.rm = TRUE)) %>%
  
    rename(DiabetesOrNot="sum(DiabetesOrNot)") %>%
      rename(HypertensionOrNot="sum(HypertensionOrNot)") %>%
      rename(HypertensionOrDiabetesOrNot="sum(HypertensionOrDiabetesOrNot)") %>%
  
      mutate(DiabetesOrNot = ifelse(DiabetesOrNot>0, 1, 0)) %>%
      mutate(HypertensionOrNot = ifelse(HypertensionOrNot>0, 1, 0)) %>%
      mutate(HypertensionOrDiabetesOrNot = ifelse(HypertensionOrDiabetesOrNot>0, 1, 0)) %>%
      mutate(NumOfNeeds = Need_Sad+Need_HelpUnderstanding+Need_Childcare+Need_Clothes+ Need_LoseHousing+Need_SafePlace+ Need_Job+Need_AffordNeeds+Need_HighSchool+Need_Transport+Need_Safe) 

dfNeedsInfo$DiabetesOrNot <- replace(dfNeedsInfo$DiabetesOrNot, is.na(dfNeedsInfo$DiabetesOrNot), 0)

dfNeedsInfo_Vaccine <- sqlQuery(dbConnection, "SELECT tblVaccine.*, tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblNeeds INNER JOIN tblVaccine ON tblVaccine.ClientID = tblNeeds.ClientID;")

dfNeedsInfo3_raw <- sqlQuery(dbConnection, "SELECT * from tblNeeds")

These are the final dataframes which will be used for analysis:

kable(psych::describe(dfZipInfo10), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
zip_code	1	99	10732.727273	545.141993	1.047300e+04	1.073309e+04	6.983046e+02	1.00010e+04	11694.00000	1693.00000	-0.0643484	-1.7025451	54.7888317
n	2	99	1608.767677	5167.760737	1.410000e+02	2.300000e+02	1.690164e+02	6.00000e+00	30874.00000	30868.00000	3.9544099	15.6486294	519.3794961
ave_A1C	3	99	6.076375	1.541382	5.802632e+00	5.853841e+00	5.891384e-01	0.00000e+00	13.30000	13.30000	1.5766543	7.9789903	0.1549147
pov	4	99	29.846162	17.291370	2.396000e+01	2.874667e+01	1.826563e+01	4.96000e+00	70.12000	65.16000	0.5426009	-0.8320327	1.7378481
pop	5	99	53790.494949	24502.945900	5.387700e+04	5.346051e+04	3.139406e+04	1.13540e+04	106154.00000	94800.00000	0.1239264	-1.0908126	2462.6387214
NoInternet	6	99	17.884646	6.273084	1.715000e+01	1.777926e+01	6.523440e+00	5.95000e+00	32.36000	26.41000	0.2072927	-0.6523270	0.6304687
percent_ip	7	99	4.578043	15.794267	3.664816e-01	6.727343e-01	4.551523e-01	1.14795e-02	97.95367	97.94219	4.3155717	18.5438846	1.5873836
percent_er	8	99	5.872413	21.252716	3.893043e-01	7.301394e-01	4.936495e-01	1.46325e-02	132.47984	132.46520	4.3585059	18.7755648	2.1359784

dfNeedsInfo

kable(psych::describe(dfNeedsInfo), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
ClientID	1	348	2.966458e+04	1.575818e+04	25923	2.916106e+04	18595.51	7168	64237	57069	0.3037166	-1.3249444	844.7271667
DiabetesOrNot	2	348	1.896552e-01	3.925926e-01	0	1.142857e-01	0.00	0	1	1	1.5764586	0.4866449	0.0210452
HypertensionOrNot	3	233	6.523605e-01	4.772461e-01	1	6.898396e-01	0.00	0	1	1	-0.6357584	-1.6026343	0.0312654
HypertensionOrDiabetesOrNot	4	233	8.068670e-01	3.956063e-01	1	8.823529e-01	0.00	0	1	1	-1.5447171	0.3878708	0.0259170
Need_Food	5	348	8.807471e-01	3.189546e-01	1	9.732143e-01	0.00	0	1	1	-2.3384605	3.5621568	0.0170978
Need_SafePlace	6	348	1.149430e-02	9.976950e-02	0	0.000000e+00	0.00	0	1	1	9.0577938	83.3778152	0.0053482
Need_LoseHousing	7	348	1.724140e-02	1.247077e-01	0	0.000000e+00	0.00	0	1	1	7.3623562	53.7030566	0.0066850
Need_Job	8	348	2.159962e-01	3.973691e-01	0	1.470238e-01	0.00	0	1	1	1.3748729	-0.0018337	0.0213012
Need_AffordNeeds	9	348	7.854410e-02	2.600390e-01	0	0.000000e+00	0.00	0	1	1	3.1196035	8.0028700	0.0139396
Need_HighSchool	10	348	8.812260e-02	2.801915e-01	0	0.000000e+00	0.00	0	1	1	2.8878253	6.4365180	0.0150198
Need_HelpUnderstanding	11	348	2.155170e-02	1.429250e-01	0	0.000000e+00	0.00	0	1	1	6.5581311	41.5425502	0.0076616
Need_Sad	12	348	1.963600e-02	1.227322e-01	0	0.000000e+00	0.00	0	1	1	6.7636810	47.2894576	0.0065791
Need_Childcare	13	348	1.149430e-02	1.067468e-01	0	0.000000e+00	0.00	0	1	1	9.1263064	81.5237573	0.0057222
Need_Clothes	14	348	4.022990e-02	1.893168e-01	0	0.000000e+00	0.00	0	1	1	4.6484797	20.1796933	0.0101484
Need_Transport	15	348	7.183900e-03	8.020220e-02	0	0.000000e+00	0.00	0	1	1	11.5677700	136.0490731	0.0042993
Need_Safe	16	348	3.879310e-02	1.838297e-01	0	0.000000e+00	0.00	0	1	1	4.7366709	21.1986878	0.0098543
NumOfNeeds	17	348	5.502874e-01	9.631922e-01	0	3.500000e-01	0.00	0	8	8	3.1730058	16.0423113	0.0516325

dfNeedsInfo_Vaccine

kableExtra::kable(psych::describe(dfNeedsInfo_Vaccine), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
rownames	1	304	8.146513e+02	4.696786e+02	749	8.035615e+02	578.9553	8	2081	2073	0.2284195	-0.9306780	26.9379129
ClientID	2	304	4.803288e+04	1.770246e+04	54953	5.004211e+04	13790.4039	6908	68217	61309	-0.8006545	-0.6991243	1015.3055418
SumOfVaccinated	3	304	6.250000e-01	4.849211e-01	1	6.557377e-01	0.0000	0	1	1	-0.5138519	-1.7416530	0.0278121
SumOfNotVaccinated	4	304	3.947368e-01	4.896000e-01	0	3.688525e-01	0.0000	0	1	1	0.4285821	-1.8222793	0.0280805
SumOfWantsVaccine	5	304	7.894740e-02	2.701012e-01	0	0.000000e+00	0.0000	0	1	1	3.1074840	7.6817580	0.0154914
SumOfUnsure	6	304	2.763158e-01	4.479122e-01	0	2.213115e-01	0.0000	0	1	1	0.9955005	-1.0122761	0.0256895
SumOfAppointmentdatewithSunRiverReferraltoJenille	7	304	7.894740e-02	2.701012e-01	0	0.000000e+00	0.0000	0	1	1	3.1074840	7.6817580	0.0154914
SumOfNoAnswer	8	304	1.973680e-02	1.393239e-01	0	0.000000e+00	0.0000	0	1	1	6.8715176	45.3670198	0.0079908
DateOfContact*	9	304	1.068421e+01	6.401239e+00	11	1.052869e+01	8.8956	1	22	21	0.0722340	-1.1901540	0.3671362
Need_Food	10	304	9.769737e-01	1.502343e-01	1	1.000000e+00	0.0000	0	1	1	-6.3288430	38.1798775	0.0086165
Need_Sad	11	304	9.868400e-03	9.901150e-02	0	0.000000e+00	0.0000	0	1	1	9.8679276	95.6908007	0.0056787
Need_HelpUnderstanding	12	304	0.000000e+00	0.000000e+00	0	0.000000e+00	0.0000	0	0	0	NaN	NaN	0.0000000
Need_Childcare	13	304	0.000000e+00	0.000000e+00	0	0.000000e+00	0.0000	0	0	0	NaN	NaN	0.0000000
Need_Clothes	14	304	9.868400e-03	9.901150e-02	0	0.000000e+00	0.0000	0	1	1	9.8679276	95.6908007	0.0056787
Need_LoseHousing	15	304	3.289500e-03	5.735390e-02	0	0.000000e+00	0.0000	0	1	1	17.2639113	297.0197044	0.0032895
Need_SafePlace	16	304	0.000000e+00	0.000000e+00	0	0.000000e+00	0.0000	0	0	0	NaN	NaN	0.0000000
Need_Job	17	304	5.263160e-02	2.236651e-01	0	0.000000e+00	0.0000	0	1	1	3.9871836	13.9435325	0.0128281
Need_AffordNeeds	18	304	4.276320e-02	2.026563e-01	0	0.000000e+00	0.0000	0	1	1	4.4975934	18.2885386	0.0116231
Need_HighSchool	19	304	9.868400e-03	9.901150e-02	0	0.000000e+00	0.0000	0	1	1	9.8679276	95.6908007	0.0056787
Need_Transport	20	304	0.000000e+00	0.000000e+00	0	0.000000e+00	0.0000	0	0	0	NaN	NaN	0.0000000
Need_Safe	21	304	1.644740e-02	1.273980e-01	0	0.000000e+00	0.0000	0	1	1	7.5662433	55.4304067	0.0073068

dfA1CAndDemographics

kableExtra::kable(psych::describe(dfA1CAndDemographics), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
rownames	1	5433	2.717000e+03	1.568516e+03	2717.0	2.717000e+03	2013.37080	1	5433	5432	0.0000000	-1.2006626	21.2798810
ClientID	2	5433	2.614950e+04	1.360138e+04	23056.0	2.507657e+04	13318.19580	5724	64237	58513	0.6163751	-0.7000766	184.5284125
InitialA1C	3	5387	5.996850e+00	1.890963e+00	5.6	5.695646e+00	0.59304	0	59	59	11.7031865	285.2218961	0.0257638
Gender*	4	5433	2.732192e+00	9.849138e-01	2.0	2.655394e+00	0.00000	1	7	6	0.7206722	-0.8799277	0.0133622
RaceandEthnicity*	5	5433	9.787963e+00	3.699842e+00	9.0	9.800092e+00	4.44780	1	14	13	0.0580228	-1.7376694	0.0501953
Age	6	5433	5.167090e+01	1.554934e+01	53.0	5.161353e+01	16.30860	-4	175	179	0.0706890	0.1713094	0.2109561
Zip	7	5432	1.033860e+04	1.329142e+03	10303.0	1.030484e+04	2.96520	0	93543	93543	46.8664053	2871.5735410	18.0339738
BlackOrNot	8	5433	3.868949e-01	4.870841e-01	0.0	3.586381e-01	0.00000	0	1	1	0.4643317	-1.7847245	0.0066082
MaleOrNot	9	5433	3.591018e-01	4.797813e-01	0.0	3.239015e-01	0.00000	0	1	1	0.5872354	-1.6554591	0.0065091
A1CAboveMean	10	5387	1.236310e-01	3.291906e-01	0.0	2.969150e-02	0.00000	0	1	1	2.2862083	3.2273478	0.0044851

Analysis - Exploring the Connection Between Poverty and Diabetes

Poverty and diabetes in Staten Island

Richmond county (Staten Island) is a relatively middle class borough. The NY DOH reports that the incidence of diabetes in Staten Island is 8.5%. The chart below shows the incidence of diabetes in New York City. Staten Island is the second lowest.

dfIncidenceOfDiabetes <- sqlQuery(dbConnection, "SELECT * FROM tblAdultsDiagnosed Where region_name = 'New York City';")


ggplot(dfIncidenceOfDiabetes, aes(percentage_rate, reorder(county_name, as.numeric(percentage_rate)))) +
  geom_col() +
  ggtitle("Incidence of Diabetes by New York City County") + 
  ylab("") +
  xlab("Per Cent of Population with Diabetes")

However, among the 292 clients from our survey who were also screened for diabetes, the incidence is much higher at 19%. What is happening here?

  diab_percent <- dfNeedsInfo %>%
      summarize(PercentDiabetes = sum(DiabetesOrNot/n()))
  diab_percent

## # A tibble: 1 x 1
##   PercentDiabetes
##             <dbl>
## 1           0.190

While Richmond county is middle class, the clients in our survey are economically disadvantaged - all are from poorer zip codes seeking assistance from a community health organization.

Public health information helps us get at the relationship between poverty and diabetes health outcomes in the aggregate. For example, consider the hospitalization rate per 100,000 for all individuals with diabetes in Staten Island:

dfCommunityHealth <- sqlQuery(dbConnection, "SELECT * from tblCommunityHealth;")

HospRate <- dfCommunityHealth %>%
  filter(indicator=="Diabetes hospitalization rate per 10,000 (any diagnosis)" & county_name=="Richmond") %>%
  mutate(percentage_rate = percentage_rate/10) %>%
  select(percentage_rate)
HospRate

##   percentage_rate
## 1           26.38

Now consider these scatterplots of hospital admissions for diabetes patients with Medicaid, against indicators of poverty per zip code.

a <- ggplot(dfZipInfo_SI, aes(pov, percent_ip)) +
  geom_point() +
  ggtitle("Poverty And Medicaid Hospitalizations") +
   theme(plot.title = element_text(size = 10, face="bold")) +
      xlab("Percent below the poverty line") +
  ylab("Hospitalizations per 100,000") +
  geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')

b <- ggplot(dfZipInfo_SI, aes(NoInternet, percent_ip)) +
  geom_point() +
  ggtitle("Poverty And Medicaid Hospitalizations") +
   theme(plot.title = element_text(size = 10, face="bold")) +
      xlab("Percent without internet") +
  ylab("Hospitalizations per 100,000") +
  geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')


gridExtra::grid.arrange(a,b,ncol=2)

While the strong relationship between poverty level and hospitalization rate for medicaid recipients is likely due in part to the fact that there are more medicaid recipients in poorer zip codes, nonetheless we can see from these graphs that the 5 poorest zip codes have higher rates of hospitalization for medicaid recipients alone than the average hospitalization rate for all recipients.

Poverty and diabetes at the individual level - looking at the mechanisms of poverty

The second graph suggests that poverty affects individuals in a direct way - lack of internet may lead to lack of knowledge and poor health choices.

Our survey data allow us to examine diabetes and poverty at the individual level. We will look first to see if incidence of diabetes and particular economic and social needs are correlated.

We examine the correlation between the incidence of diabetes and social needs through chi squares and a multiple regression:

m10 <- glm(DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding + Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport + Need_Safe, family = "binomial", data=dfNeedsInfo)
summary(m10)

## 
## Call:
## glm(formula = DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding + 
##     Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace + 
##     Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport + 
##     Need_Safe, family = "binomial", data = dfNeedsInfo)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3436  -0.6284  -0.6284  -0.3168   2.4564  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.52183    0.16911  -8.999  < 2e-16 ***
## Need_Sad                -0.19058    1.36930  -0.139  0.88931    
## Need_HelpUnderstanding   0.47490    1.13057   0.420  0.67445    
## Need_Childcare         -14.61899  686.36417  -0.021  0.98301    
## Need_Clothes            -0.70005    1.05310  -0.665  0.50621    
## Need_LoseHousing         1.37903    1.29617   1.064  0.28736    
## Need_SafePlace          -0.65462    2.21737  -0.295  0.76782    
## Need_Job                 0.03591    0.40870   0.088  0.92998    
## Need_AffordNeeds         1.42953    0.49329   2.898  0.00376 ** 
## Need_HighSchool         -1.48072    0.78499  -1.886  0.05926 .  
## Need_Transport           0.73656    1.51552   0.486  0.62696    
## Need_Safe                0.38788    0.73477   0.528  0.59757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 338.06  on 347  degrees of freedom
## Residual deviance: 319.72  on 336  degrees of freedom
## AIC: 343.72
## 
## Number of Fisher Scoring iterations: 14

dfNeedsChi <- dfNeedsInfo %>%
  select(DiabetesOrNot, NeedsVector) %>%
  summarise_all(funs(chisq.test(.,dfNeedsInfo$DiabetesOrNot)$p.value))
kableExtra::kable(head(dfNeedsChi), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

DiabetesOrNot	Need_Food	Need_SafePlace	Need_LoseHousing	Need_Job	Need_AffordNeeds	Need_HighSchool	Need_HelpUnderstanding	Need_Sad	Need_Childcare	Need_Clothes	Need_Transport	Need_Safe
0	0.4783364	0.3761446	0.2533187	0.2993787	0.0198248	0.2186276	0.7193608	0.083128	0.7400691	0.6070138	0.4745132	0.4813973

tbl <- dfNeedsInfo %>%
  select(DiabetesOrNot, Need_AffordNeeds) %>%
  mutate(Need_AffordNeeds = ifelse(Need_AffordNeeds==0,0,1))

tbl2 <- table(tbl)
tbl2

##              Need_AffordNeeds
## DiabetesOrNot   0   1
##             0 263  19
##             1  54  12

psych::phi(tbl2)

## [1] 0.16

Those clients who cannot afford basic needs have a higher incidence of diabetes in both the chi squares and regression analysis (p=.02). However, we cannot reject the null hypothesis for other needs.

Diabetes is particularly dangerous for clients with Covid. We therefore look at the relationship between poverty and whether a client is vaccinated. Some of the needs are all zeros so we remove them.

m10 <- glm(SumOfNotVaccinated ~ Need_Sad + Need_Clothes + Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Safe, family = "binomial", data=dfNeedsInfo_Vaccine)
summary(m10)

## 
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Sad + Need_Clothes + 
##     Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool + 
##     Need_Safe, family = "binomial", data = dfNeedsInfo_Vaccine)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6651  -0.9558  -0.9558   1.4165   1.4165  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.5465     0.1263  -4.328 1.51e-05 ***
## Need_Sad           10.9534  4248.3854   0.003   0.9979    
## Need_Clothes      -16.1358  1022.8449  -0.016   0.9874    
## Need_LoseHousing  -15.5864  2859.7350  -0.005   0.9957    
## Need_Job            1.6452     0.6785   2.425   0.0153 *  
## Need_AffordNeeds    0.8830     0.5990   1.474   0.1404    
## Need_HighSchool   -25.5406  4406.0046  -0.006   0.9954    
## Need_Safe          19.2175  3955.3301   0.005   0.9961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 407.86  on 303  degrees of freedom
## Residual deviance: 384.66  on 296  degrees of freedom
## AIC: 400.66
## 
## Number of Fisher Scoring iterations: 15

NeedsVector1 <- NeedsVector[! NeedsVector %in% c("Need_SafePlace", "Need_HelpUnderstanding", "Need_Childcare", "Need_Transport")] 


dfNeedsChi1 <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, NeedsVector1) %>%
  summarise_all(funs(chisq.test(.,dfNeedsInfo_Vaccine$SumOfNotVaccinated)$p.value))
kableExtra::kable(head(dfNeedsChi1), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

SumOfNotVaccinated	Need_Food	Need_LoseHousing	Need_Job	Need_AffordNeeds	Need_HighSchool	Need_Sad	Need_Clothes	Need_Safe
0	0.3230587	0.8292155	0.0278985	0.1695744	0.4166836	0.7077675	1	0.0197716

tbl <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, Need_Job) %>%
  mutate(Need_Job = ifelse(Need_Job==0,0,1))

tbl1 <- table(tbl)
tbl1

##                   Need_Job
## SumOfNotVaccinated   0   1
##                  0 179   5
##                  1 109  11

psych::phi(tbl1)

## [1] 0.14

tbl2 <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, Need_Safe) %>%
  mutate(Need_Safe = ifelse(Need_Safe==0,0,1))

tbl3 <- table(tbl2)
tbl3

##                   Need_Safe
## SumOfNotVaccinated   0   1
##                  0 184   0
##                  1 115   5

psych::phi(tbl3)

## [1] 0.16

Unemployed clients and those that do not feel safe are correlated with no vaccine (p=.028 and p=.02 respectively).

The effect is not strong enough to run a successful prediction model. The model does no better than guessing. This is perhaps because the data is very unbalanced:

#https://www.statology.org/logistic-regression-in-r/
#make this example reproducible
set.seed(1)

#Use 70% of dfNeedsInfoset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(dfNeedsInfo_Vaccine), replace=TRUE, prob=c(0.7,0.3))
train <- dfNeedsInfo_Vaccine[sample, ]
test <- dfNeedsInfo_Vaccine[!sample, ]  

#fit logistic regression model
model1 <- glm(SumOfNotVaccinated ~ Need_Safe + Need_Job, family="binomial", data=train)

summary(model1)

## 
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Safe + Need_Job, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5518  -0.9756  -0.9756   1.3937   1.3937  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.4953     0.1436  -3.448 0.000564 ***
## Need_Safe     16.5805  1005.3119   0.016 0.986841    
## Need_Job       1.3426     0.7049   1.905 0.056805 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 298.72  on 220  degrees of freedom
## Residual deviance: 285.54  on 218  degrees of freedom
## AIC: 291.54
## 
## Number of Fisher Scoring iterations: 15

#model <- glm(HypertensionOrDiabetesOrNot ~ NumOfNeeds, family="binomial", data=train)
#model
#disable scientific notation for model summary
#options(scipen=999)


#Shows fit - ranges from 0 to almost 1
pscl::pR2(model1)["McFadden"]

## fitting null model for pseudo-r2

##   McFadden 
## 0.04413668

#Ranks importance
caret::varImp(model1)

##              Overall
## Need_Safe 0.01649294
## Need_Job  1.90481077

#multicolinearity (VIF)
car::vif(model1)

## Need_Safe  Need_Job 
##         1         1

#calculate probability of default for each individual in test dataset
predicted <- predict(model1, test, type="response")
p2 <- as.data.frame(predicted)

p3 <- test %>%
  select(SumOfNotVaccinated)

Predictions <- cbind(p2, p3)
Predictions$predicted <- ifelse(Predictions$predicted >.38, 1, 0)
Predictions$predicted=as.factor(Predictions$predicted)
Predictions$SumOfNotVaccinated=as.factor(Predictions$SumOfNotVaccinated)

psych::describe(Predictions)

##                     vars  n mean   sd median trimmed mad min max range skew
## predicted*             1 83 1.05 0.22      1    1.00   0   1   2     1 4.14
## SumOfNotVaccinated*    2 83 1.36 0.48      1    1.33   0   1   2     1 0.57
##                     kurtosis   se
## predicted*             15.35 0.02
## SumOfNotVaccinated*    -1.70 0.05

#find optimal cutoff probability to use to maximize accuracy
optimal <- InformationValue::optimalCutoff(test$SumOfNotVaccinated, predicted)[1]
optimal

## [1] 0.38

#ConfusionMatrix doesn't work
caret::confusionMatrix(Predictions$SumOfNotVaccinated, Predictions$predicted)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51  2
##          1 28  2
##                                           
##                Accuracy : 0.6386          
##                  95% CI : (0.5257, 0.7412)
##     No Information Rate : 0.9518          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0356          
##                                           
##  Mcnemar's Test P-Value : 5.01e-06        
##                                           
##             Sensitivity : 0.64557         
##             Specificity : 0.50000         
##          Pos Pred Value : 0.96226         
##          Neg Pred Value : 0.06667         
##              Prevalence : 0.95181         
##          Detection Rate : 0.61446         
##    Detection Prevalence : 0.63855         
##       Balanced Accuracy : 0.57278         
##                                           
##        'Positive' Class : 0               
##

#calculate sensitivity
#sensitivity(test$HypertensionOrDiabetesOrNot, predicted)

#calculate specificity
#specificity(test$HypertensionOrDiabetesOrNot, predicted)

#calculate total misclassification error rate
#misClassError(Predictions$HypertensionOrDiabetesOrNot, Predictions$predicted, threshold=optimal`)

Lastly we look at race and other demographics.

rBase1 <- lm(InitialA1C ~ BlackOrNot + MaleOrNot + Age,  data = dfA1CAndDemographics)
summary(rBase1)

## 
## Call:
## lm(formula = InitialA1C ~ BlackOrNot + MaleOrNot + Age, data = dfA1CAndDemographics)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.704 -0.702 -0.358  0.099 53.070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.787345   0.092460  51.778  < 2e-16 ***
## BlackOrNot  0.094788   0.052052   1.821  0.06866 .  
## MaleOrNot   0.165160   0.052783   3.129  0.00176 ** 
## Age         0.021538   0.001628  13.228  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.859 on 5383 degrees of freedom
##   (46 observations deleted due to missingness)
## Multiple R-squared:  0.03374,    Adjusted R-squared:  0.0332 
## F-statistic: 62.65 on 3 and 5383 DF,  p-value: < 2.2e-16

Here we can see that race, gender and age all affect the initial A1C at a client’s first screening. The R2 is very low, however.

Conclusion

This study touched the surface of the question, "how are poverty and diabetes connected?’ We found statistical associations between poverty and diabetes hospitalization rates, between being unable to afford basic needs and diabetes incidence rates, between having not been vaccinated against not having a job and feeling unsafe, and between high A1C and race, age and gender. Clearly there is more to explore here.

This analysis only shows a small handful of the anlyses that were run on this data. I also analyzed the relationship between poverty and chronic hypertension, A1V levels, success in A1C management, and so on. Thistype of “fishing” increases the likelihood ofsampling errors. A cursory analysis of survey results by program and staff suggests a lack of consistency in how the survey was applied. In the second round of survey impkementation I plan to make sure that suveys are conducted in a consistent manager accross staff and programs to provide the possibility for a more robust analysis.