library(ROSE)
library(tidyverse)
library(magrittr)
library(janitor)
library(tidyverse)
library(RODBC)
library(pscl)
library(caret)
library(car)
library(InformationValue)
library(rvest)
library(jsonlite)
library(xml2)
library(XML)
library(stringr)
library(kableExtra)

Diabetes and Poverty: How does poverty affect health outcomes?

We know that poverty affects health outcomes. We also know that obesity, poor diet choices, and poor lifestyle choices often result from poverty and inadequate access to resources. However, what are the actual poverty markers that lead to these outcomes - is it lack of transportation options? Housing insecurity? Lack of High School defree? In this study we will take a deeper look. To answer these questions we will examine data from a client survey on economic needs from my workplace. We will supplement the survey info with other client health-related databases from my workplace, as well as data from outside sources such as the NY Department of Health.

Altogether we will use data from 4 different sources:

  1. Files from a proprietary survey of economic needs, stripped of identity markers (stored in a local csv file)
  2. Tables of client health information in a proprietary SQL Server database, stripped of identity markers and password protected through keyring (SQL Server database on Azure)
  3. Health data from the New York Department of Health (in json format accessed through APIs)
  4. Zip code level data on population size and poverty (scraped from html tables on the web)

After collecting the data we will store it in the database so that we can easily combine the datasets in productive ways using SQL. In particular, we will generate dataframes that contain zip code level information on poverty and public health, and client level information on economic needs and individual client health. Then we will test for correlations and other relationships within the data.

Note - Certain operations that load data and are very time consuming will be contained within functions that can be run or not by changing the value of the boolean variable “run” which appears at the top of each function.

Loading The Data

1. CSV file of a Survey of Client Economic Needs

This file is the result of a client survey taken at my workplace of over 3,000 clients. We did not create the survey instrument - the file was provided by the organization that did. The Ids have been mathced to those of the clients in the SQL database. The file is otherwise free of any personal identifying information.

For security purposes these files are stored locally. However, they have already been read into the sql server database and can be accessed from there through keyring. Leave “run = false” in the following code, and in the code in which the file is read into the database, to use this rmd successfully with the survey.

We also input a file that simply contains the list of survey needs.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
dfSDOH_raw <- read.csv("D:\\Buffer\\SDOH__En__1_0 (3).csv", encoding = "UTF-8")
}

dfListOfNeeds <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/ListOfNeeds.csv", encoding = "UTF-8")
dfListOfNeeds %<>%
  rename(Need=X.U.FEFF.Need)

The survey asks these questions:

kableExtra::kable(dfListOfNeeds)
Need
Do you or your family ever feel hungry?
Do you have a safe place to live right now?
Are you worried about losing your housing?
Do you have a job?
Are you able to afford basic needs?
Do you have a high school diploma or a GED?
Do you ever need help understanding what your doctor tells you?
Are you sad or lonely?
Do problems with childcare make it hard for you?
Do you ever have a problem getting the right type of clothes for each season?
Do you have access to transportation to get where you need to go?
Do you feel safe: emotionally, physically?


Since the survey instrument and resultant data were created and delivered by an outside agency, the data needs extensive reforming. The file contains 800,000 observations with only two columns- the columns are “Field” and “Value”. Each record takes up 211 observations.

We begin by pivoting wider every 211 records, taking our names from “Field” and our values from “Value”, and binding each to a final dataframe.

run=FALSE

if (run) {
dfSDOH <- dfSDOH_raw %>%
    filter(as.numeric(Num) <=211) %>%
    pivot_wider(Org, names_from = "Field", values_from = "Value")

dfFinal <- dfSDOH

x=211

for (i in 1: 3752)
{

dfSDOH <- dfSDOH_raw %>%
    filter(as.numeric(Num) > x & as.numeric(Num)<= (x+211)) %>%
    pivot_wider(Org, names_from = "Field", values_from = "Value")

x=x+211

dfFinal <- rbind(dfFinal, dfSDOH)
}
}

We eliminate bad records that have no ClientID.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
dfFinal %<>%
  filter (ClientID > 0)
}

We select the columns which describe needs, and rename them to reflect the type of need.

NeedsVector = c("Need_Food",  "Need_SafePlace",  "Need_LoseHousing",  "Need_Job",  "Need_AffordNeeds",  "Need_HighSchool",  "Need_HelpUnderstanding",  "Need_Sad",  "Need_Childcare",  "Need_Clothes",  "Need_Transport",  "Need_Safe") 


#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

dfFinal2 <- dfFinal %>%
  rename(Need_Food=DoYouOrYou, Need_SafePlace = DoYouHaveA, Need_LoseHousing = AreYouWorr, Need_Job = DoYouHaveA1, Need_AffordNeeds = AreYouAble, Need_HighSchool = DoYouHaveA2, Need_HelpUnderstanding=DoYouEverN, Need_Sad = SocialAreY, Need_Childcare = ChildCareD, Need_Clothes = DoYouEverH, Need_Transport = DoYouHaveA3, Need_Safe = DoYouFeelS )

dfFinal3 <- dfFinal2 %>%
  select(ClientID, starts_with('Need_'))
}

Each of the 12 need columns is binary, containing either a yes or no in response to a question about a need. However, sometimes ‘yes’ means the client has a need (“Do you or your family ever go hungry”) and sometimes ‘yes’ means the client does not have a need (“Do you feel safe physically and emotionally”).

We will recode the questions so that 1 means ‘has a need’ and 0 means ‘does not have a need’ for all variables.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
QCorrect<- function(x){
    ifelse(x=="Yes", 1,0)
}

QReverse<- function(x){
    ifelse(x=="No", 1,0)
}
 
dfFinal4 <- dfFinal3 %>%
  select(Need_Food, Need_Sad, Need_HelpUnderstanding, Need_Childcare, Need_Clothes, Need_LoseHousing) %>%
   mutate_all(QCorrect)

dfFinal5 <- dfFinal3 %>%
  select(Need_SafePlace, Need_Job, Need_AffordNeeds, Need_HighSchool, Need_Transport, Need_Safe) %>%
   mutate_all(QReverse)

dfFinal6 <- dfFinal3 %>%
  select(ClientID)

dfFinal6$ClientID <- as.numeric(as.character(dfFinal6$ClientID))

dfFinal10 <- cbind(dfFinal6, dfFinal4, dfFinal5)
}

2. SQL Server Database Data

I built this database for my workplace. It contains information on client health screens for diabetes, hypertension and other chronic diseases. We will use three tables:

  1. Demographic data: A table of 5,433 unique clients with demographic data and the client’s first A1C result (A1C is a test that indicates diabetes.)

  2. Health data: A table of 39,491 observations for 8,546 unique clients with A1C, Blood Pressure and other health screen data.

  3. Vaccine data: A table of 2122 unique clients with info on vaccine status.

The data is password protected by keyring. Here we create a connection.

## Please enter password in TK window (Alt+Tab)
# This is the connection string:

strConnection <- paste0(
  'Driver={ODBC Driver 13 for SQL Server};
   Server=tcp:ehtmp.database.windows.net,1433;
   Database=HC_A1C;
   Encrypt=yes;
   TrustServerCertificate=no;
   Connection Timeout=30;',
   'Uid=',keyring::key_get(service = "EH_606_username", keyring = "EH_606_keyring"),';',
   'Pwd=', keyring::key_get(service = "EH_606_pwd", keyring = "EH_606_keyring"), ';'
)

library(RODBC)
dbConnection <- odbcDriverConnect(strConnection)
#keyring_lock

3. New York DOH Data

We read the New York DOH data from the APIs. The RSocrata library makes this easy.

library("RSocrata")

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE 

if(run) {


#Community Health Obesity and Diabetes Related Indicators: 2008 - 2012
dfCommunityHealthObesityAndDiabetes <- read.socrata(
  "https://health.data.ny.gov/resource/tchg-ruva.json")

#Community Health: Age-adjusted percentage of adults with physician diagnosed diabetes: 2008 - 2009
dfAdultsDiagnosed <- read.socrata(
  "https://health.data.ny.gov/resource/9j5w-7zpd.json")

#AH Provisional Diabetes Death Counts, 2020
dfDeath <- read.socrata(
  "https://data.cdc.gov/resource/qdcb-uzft.json")

#500 Cities: Diagnosed diabetes among adults aged >=18 years
dfCityComparison <- read.socrata(
  "https://chronicdata.cdc.gov/resource/cn78-b9bj.json")

#Conditions contributing to deaths involving coronavirus disease 2019 (COVID-19), by age group and state, United States.
dfDeathsForCovid <- read.socrata(
  "https://data.cdc.gov/resource/hk9y-quqm.json")

#Community Health: Diabetes Short-term Complications Hospitalization Rate per 10,000 - Aged 18+ Years by County Map: Latest Data
dfHospitalizations <- read.socrata(
  "https://health.data.ny.gov/resource/xuwq-ppg8.json")

#Medicaid Chronic Conditions, Inpatient Admissions and Emergency Room Visits by Zip Code: Beginning 2012
dfMedicaidByZip <- read.socrata(
  "https://health.data.ny.gov/resource/2yck-xisk.json")

}
#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE 

if(run) {
#Medicaid Program Enrollment by Month: Beginning 2009
dfMedicaidEnrollment <- read.socrata(
  "https://health.data.ny.gov/resource/m4hz-kzn3.json")
}

4. Zip Code Level Population and Poverty Data Scraped From the web

We scrape and then clean 17 pages of zipatlas data to compile a table of zip codes, population levels and poverty levels. We also read a csv file, taken from the web, which contains information on zip code and broadband access. The file is compiled by the city of New York and is downloaded from here:

https://data.cityofnewyork.us/City-Government/Broadband-Adoption-and-Infrastructure-by-Zip-Code/qz5f-yx82

library(XML)

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {


x="http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.htm"
dfPopAndPovertyLevel = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))

for (i in 1:16) {
x= str_c("http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.", i, ".htm")
dfx = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))
dfPopAndPovertyLevel=rbind(dfPopAndPovertyLevel, dfx)
}
}

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
  
dfPopAndPovertyLevel %<>%
  filter(str_length(V2)==5 & !is.na(V4)) %<>%
  rename(number = V1, zip=V2, location=V3, city=V4, population=V5, percent_below_poverty=V6, rank=V7) %<>%
  mutate(population = as.numeric(gsub(",", "", population))) %<>%
  mutate(percent_below_poverty = as.numeric(gsub(" %", "", percent_below_poverty))) %<>%
  mutate(rank = as.numeric(gsub("#", "", rank)))

}


dfBroadband_raw <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/Broadband_Adoption.csv", encoding = "UTF-8")

Loading all of the Datasets into the SQL Server Database

If we load our datasets into persistent storage (the SQL Sever database), then we can use SQL to easily combine data from several tables and read the result back into a dataframe. Just as important, many of the datasets in this project are very large and take a long time to load. Therefore we can filter and prepare them, read them into the database, and then use the database exclusively to read them back to cut down on load time.

We begin with 3 NYDOH files: MedicaidByZip, a NYDOH file with over a million records on medicaid information; dfAdultsDiagnosed which has 63 records about Diabetes diagnoses throughout New York State; and dfCommunityHealthObesityAndDiabtes, a dataset with 2733 records with detailed county-level information about diabetes.

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblMedicaidByZip")
PopulateTable <- sqlSave(dbConnection, dfMedicaidByZip, "tblMedicaidByZip", append=TRUE)
}
#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblAdultsDiagnosed")
PopulateTable <- sqlSave(dbConnection, dfAdultsDiagnosed, "tblAdultsDiagnosed", append=TRUE)
}
#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblCommunityHealth")
PopulateTable <- sqlSave(dbConnection, dfCommunityHealthObesityAndDiabetes, "tblCommunityHealth", append=TRUE)
}

Next we clean and load the scraped dataset with 1711 zip code level observations on population and poverty in NY State:

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblPopAndPovertyLevel")
PopulateTable <- sqlSave(dbConnection, dfPopAndPovertyLevel, "tblPopAndPovertyLevel", append=TRUE)
}

Next we load the survey data into a table called tblNeeds:

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

  
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblNeeds")
PopulateTable <- sqlSave(dbConnection, dfFinal10, "tblNeeds", append=TRUE)
}

Finally, we add the broadband_by_zip_code dataframe

#This operation is time consuming.  Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE

if(run) {

DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblBroadBand")
PopulateTable <- sqlSave(dbConnection, dfBroadband_raw, "tblBroadband", append=TRUE)
}

Creating Dataframes for Analysis By Joining Tables in the SQL Server Database

We will create the following dataframes by combining elements from a number of different tables in the database:

  1. dfZipInfo, a dataframe with 99 observations at the zip code level with information on population level, poverty level, medicaid Info, average A1C from client data, and broadband info.

  2. dfdNeedsInfo, a dataframe with client-level observations with information on needs from the survey, chronic diseases, demographics and initial A1C. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.

  3. dfNeedsInfo_Vaccine, a dataframe with 304 client-level observations with information on needs from the survey and vaccine information. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.

  4. dfA1CAndDemographics, a dataframe with 5433 observations at the client level with information on demographics and initial A1C readings.

dfZipInfo_raw <- sqlQuery(dbConnection, "SELECT tblMedicaidByZip.zip_code, tblMedicaidByZip.major_diagnostic_category, tblPopAndPovertyLevel.population, tblPopAndPovertyLevel.percent_below_poverty, tblMedicaidByZip.ip_admits, tblMedicaidByZip.er_visits, tblA1CAndDemographics.ClientID, tblA1CAndDemographics.Zip, tblA1CAndDemographics.InitialA1C, tblA1CAndDemographics.Gender, tblA1CAndDemographics.RaceandEthnicity, tblA1CAndDemographics.Age, tblBroadband.NoInternetAccessPercentageOfHouseholds
FROM (tblMedicaidByZip INNER JOIN (tblA1CAndDemographics INNER JOIN tblPopAndPovertyLevel ON tblA1CAndDemographics.Zip = tblPopAndPovertyLevel.zip) ON tblMedicaidByZip.zip_code = tblPopAndPovertyLevel.zip) INNER JOIN tblBroadband ON tblPopAndPovertyLevel.zip = tblBroadband.ZipCode;
")


dfZipInfo10 <- dfZipInfo_raw %>%
  filter(major_diagnostic_category=="Diabetes Mellitus") %>%
  na.omit() %>%
  group_by(zip_code) %>%
  summarize(n=n(), ave_A1C=mean(InitialA1C), pov=mean(percent_below_poverty), pop=mean(population), NoInternet=mean(NoInternetAccessPercentageOfHouseholds*100), percent_ip = sum(ip_admits/population), percent_er=sum(er_visits/population))


dfZipInfo_SI <- dfZipInfo10 %>%
  filter(as.numeric(zip_code) >= 10301 & as.numeric(zip_code) <= 10314) 
dfA1CAndDemographics <- sqlQuery(dbConnection, "SELECT * FROM tblA1CAndDemographics")

dfA1CAndDemographics %<>% 
  mutate(BlackOrNot = ifelse(RaceandEthnicity=="Black/African American", 1, 0)) %<>%
  mutate(MaleOrNot = ifelse(Gender=="Male", 1, 0)) %>%
  mutate(A1CAboveMean = ifelse(InitialA1C > 7, 1, 0))
dfNeedsInfo_raw <- sqlQuery(dbConnection, "SELECT tblNeeds.ClientID, tblChronic.Chronic, tblReadings.Type,  tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblReadings INNER JOIN (tblChronic RIGHT JOIN tblNeeds ON tblChronic.ClientID = tblNeeds.ClientID) ON tblReadings.ClientID = tblNeeds.ClientID
WHERE tblReadings.Type='A1C';")

    
dfNeedsInfo <- dfNeedsInfo_raw %>%
    group_by(ClientID) %>%
  
    mutate(DiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Diabetes (doctor denied)"|Chronic == "Diabetes type I", 1, 0)) %>%
    mutate(HypertensionOrNot = ifelse(Chronic == "Hypertension", 1,0)) %>%
    mutate(HypertensionOrDiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Hypertension", 1,0)) %>%
  
    summarize(sum(DiabetesOrNot), sum(HypertensionOrNot), sum(HypertensionOrDiabetesOrNot), across(NeedsVector, mean, na.rm = TRUE)) %>%
  
    rename(DiabetesOrNot="sum(DiabetesOrNot)") %>%
      rename(HypertensionOrNot="sum(HypertensionOrNot)") %>%
      rename(HypertensionOrDiabetesOrNot="sum(HypertensionOrDiabetesOrNot)") %>%
  
      mutate(DiabetesOrNot = ifelse(DiabetesOrNot>0, 1, 0)) %>%
      mutate(HypertensionOrNot = ifelse(HypertensionOrNot>0, 1, 0)) %>%
      mutate(HypertensionOrDiabetesOrNot = ifelse(HypertensionOrDiabetesOrNot>0, 1, 0)) %>%
      mutate(NumOfNeeds = Need_Sad+Need_HelpUnderstanding+Need_Childcare+Need_Clothes+ Need_LoseHousing+Need_SafePlace+ Need_Job+Need_AffordNeeds+Need_HighSchool+Need_Transport+Need_Safe) 

dfNeedsInfo$DiabetesOrNot <- replace(dfNeedsInfo$DiabetesOrNot, is.na(dfNeedsInfo$DiabetesOrNot), 0)
dfNeedsInfo_Vaccine <- sqlQuery(dbConnection, "SELECT tblVaccine.*, tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblNeeds INNER JOIN tblVaccine ON tblVaccine.ClientID = tblNeeds.ClientID;")
dfNeedsInfo3_raw <- sqlQuery(dbConnection, "SELECT * from tblNeeds")

These are the final dataframes which will be used for analysis:

kable(psych::describe(dfZipInfo10), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
vars n mean sd median trimmed mad min max range skew kurtosis se
zip_code 1 99 10732.727273 545.141993 1.047300e+04 1.073309e+04 6.983046e+02 1.00010e+04 11694.00000 1693.00000 -0.0643484 -1.7025451 54.7888317
n 2 99 1608.767677 5167.760737 1.410000e+02 2.300000e+02 1.690164e+02 6.00000e+00 30874.00000 30868.00000 3.9544099 15.6486294 519.3794961
ave_A1C 3 99 6.076375 1.541382 5.802632e+00 5.853841e+00 5.891384e-01 0.00000e+00 13.30000 13.30000 1.5766543 7.9789903 0.1549147
pov 4 99 29.846162 17.291370 2.396000e+01 2.874667e+01 1.826563e+01 4.96000e+00 70.12000 65.16000 0.5426009 -0.8320327 1.7378481
pop 5 99 53790.494949 24502.945900 5.387700e+04 5.346051e+04 3.139406e+04 1.13540e+04 106154.00000 94800.00000 0.1239264 -1.0908126 2462.6387214
NoInternet 6 99 17.884646 6.273084 1.715000e+01 1.777926e+01 6.523440e+00 5.95000e+00 32.36000 26.41000 0.2072927 -0.6523270 0.6304687
percent_ip 7 99 4.578043 15.794267 3.664816e-01 6.727343e-01 4.551523e-01 1.14795e-02 97.95367 97.94219 4.3155717 18.5438846 1.5873836
percent_er 8 99 5.872413 21.252716 3.893043e-01 7.301394e-01 4.936495e-01 1.46325e-02 132.47984 132.46520 4.3585059 18.7755648 2.1359784
  1. dfNeedsInfo
kable(psych::describe(dfNeedsInfo), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
vars n mean sd median trimmed mad min max range skew kurtosis se
ClientID 1 348 2.966458e+04 1.575818e+04 25923 2.916106e+04 18595.51 7168 64237 57069 0.3037166 -1.3249444 844.7271667
DiabetesOrNot 2 348 1.896552e-01 3.925926e-01 0 1.142857e-01 0.00 0 1 1 1.5764586 0.4866449 0.0210452
HypertensionOrNot 3 233 6.523605e-01 4.772461e-01 1 6.898396e-01 0.00 0 1 1 -0.6357584 -1.6026343 0.0312654
HypertensionOrDiabetesOrNot 4 233 8.068670e-01 3.956063e-01 1 8.823529e-01 0.00 0 1 1 -1.5447171 0.3878708 0.0259170
Need_Food 5 348 8.807471e-01 3.189546e-01 1 9.732143e-01 0.00 0 1 1 -2.3384605 3.5621568 0.0170978
Need_SafePlace 6 348 1.149430e-02 9.976950e-02 0 0.000000e+00 0.00 0 1 1 9.0577938 83.3778152 0.0053482
Need_LoseHousing 7 348 1.724140e-02 1.247077e-01 0 0.000000e+00 0.00 0 1 1 7.3623562 53.7030566 0.0066850
Need_Job 8 348 2.159962e-01 3.973691e-01 0 1.470238e-01 0.00 0 1 1 1.3748729 -0.0018337 0.0213012
Need_AffordNeeds 9 348 7.854410e-02 2.600390e-01 0 0.000000e+00 0.00 0 1 1 3.1196035 8.0028700 0.0139396
Need_HighSchool 10 348 8.812260e-02 2.801915e-01 0 0.000000e+00 0.00 0 1 1 2.8878253 6.4365180 0.0150198
Need_HelpUnderstanding 11 348 2.155170e-02 1.429250e-01 0 0.000000e+00 0.00 0 1 1 6.5581311 41.5425502 0.0076616
Need_Sad 12 348 1.963600e-02 1.227322e-01 0 0.000000e+00 0.00 0 1 1 6.7636810 47.2894576 0.0065791
Need_Childcare 13 348 1.149430e-02 1.067468e-01 0 0.000000e+00 0.00 0 1 1 9.1263064 81.5237573 0.0057222
Need_Clothes 14 348 4.022990e-02 1.893168e-01 0 0.000000e+00 0.00 0 1 1 4.6484797 20.1796933 0.0101484
Need_Transport 15 348 7.183900e-03 8.020220e-02 0 0.000000e+00 0.00 0 1 1 11.5677700 136.0490731 0.0042993
Need_Safe 16 348 3.879310e-02 1.838297e-01 0 0.000000e+00 0.00 0 1 1 4.7366709 21.1986878 0.0098543
NumOfNeeds 17 348 5.502874e-01 9.631922e-01 0 3.500000e-01 0.00 0 8 8 3.1730058 16.0423113 0.0516325
  1. dfNeedsInfo_Vaccine
kableExtra::kable(psych::describe(dfNeedsInfo_Vaccine), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
vars n mean sd median trimmed mad min max range skew kurtosis se
rownames 1 304 8.146513e+02 4.696786e+02 749 8.035615e+02 578.9553 8 2081 2073 0.2284195 -0.9306780 26.9379129
ClientID 2 304 4.803288e+04 1.770246e+04 54953 5.004211e+04 13790.4039 6908 68217 61309 -0.8006545 -0.6991243 1015.3055418
SumOfVaccinated 3 304 6.250000e-01 4.849211e-01 1 6.557377e-01 0.0000 0 1 1 -0.5138519 -1.7416530 0.0278121
SumOfNotVaccinated 4 304 3.947368e-01 4.896000e-01 0 3.688525e-01 0.0000 0 1 1 0.4285821 -1.8222793 0.0280805
SumOfWantsVaccine 5 304 7.894740e-02 2.701012e-01 0 0.000000e+00 0.0000 0 1 1 3.1074840 7.6817580 0.0154914
SumOfUnsure 6 304 2.763158e-01 4.479122e-01 0 2.213115e-01 0.0000 0 1 1 0.9955005 -1.0122761 0.0256895
SumOfAppointmentdatewithSunRiverReferraltoJenille 7 304 7.894740e-02 2.701012e-01 0 0.000000e+00 0.0000 0 1 1 3.1074840 7.6817580 0.0154914
SumOfNoAnswer 8 304 1.973680e-02 1.393239e-01 0 0.000000e+00 0.0000 0 1 1 6.8715176 45.3670198 0.0079908
DateOfContact* 9 304 1.068421e+01 6.401239e+00 11 1.052869e+01 8.8956 1 22 21 0.0722340 -1.1901540 0.3671362
Need_Food 10 304 9.769737e-01 1.502343e-01 1 1.000000e+00 0.0000 0 1 1 -6.3288430 38.1798775 0.0086165
Need_Sad 11 304 9.868400e-03 9.901150e-02 0 0.000000e+00 0.0000 0 1 1 9.8679276 95.6908007 0.0056787
Need_HelpUnderstanding 12 304 0.000000e+00 0.000000e+00 0 0.000000e+00 0.0000 0 0 0 NaN NaN 0.0000000
Need_Childcare 13 304 0.000000e+00 0.000000e+00 0 0.000000e+00 0.0000 0 0 0 NaN NaN 0.0000000
Need_Clothes 14 304 9.868400e-03 9.901150e-02 0 0.000000e+00 0.0000 0 1 1 9.8679276 95.6908007 0.0056787
Need_LoseHousing 15 304 3.289500e-03 5.735390e-02 0 0.000000e+00 0.0000 0 1 1 17.2639113 297.0197044 0.0032895
Need_SafePlace 16 304 0.000000e+00 0.000000e+00 0 0.000000e+00 0.0000 0 0 0 NaN NaN 0.0000000
Need_Job 17 304 5.263160e-02 2.236651e-01 0 0.000000e+00 0.0000 0 1 1 3.9871836 13.9435325 0.0128281
Need_AffordNeeds 18 304 4.276320e-02 2.026563e-01 0 0.000000e+00 0.0000 0 1 1 4.4975934 18.2885386 0.0116231
Need_HighSchool 19 304 9.868400e-03 9.901150e-02 0 0.000000e+00 0.0000 0 1 1 9.8679276 95.6908007 0.0056787
Need_Transport 20 304 0.000000e+00 0.000000e+00 0 0.000000e+00 0.0000 0 0 0 NaN NaN 0.0000000
Need_Safe 21 304 1.644740e-02 1.273980e-01 0 0.000000e+00 0.0000 0 1 1 7.5662433 55.4304067 0.0073068
  1. dfA1CAndDemographics
kableExtra::kable(psych::describe(dfA1CAndDemographics), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
vars n mean sd median trimmed mad min max range skew kurtosis se
rownames 1 5433 2.717000e+03 1.568516e+03 2717.0 2.717000e+03 2013.37080 1 5433 5432 0.0000000 -1.2006626 21.2798810
ClientID 2 5433 2.614950e+04 1.360138e+04 23056.0 2.507657e+04 13318.19580 5724 64237 58513 0.6163751 -0.7000766 184.5284125
InitialA1C 3 5387 5.996850e+00 1.890963e+00 5.6 5.695646e+00 0.59304 0 59 59 11.7031865 285.2218961 0.0257638
Gender* 4 5433 2.732192e+00 9.849138e-01 2.0 2.655394e+00 0.00000 1 7 6 0.7206722 -0.8799277 0.0133622
RaceandEthnicity* 5 5433 9.787963e+00 3.699842e+00 9.0 9.800092e+00 4.44780 1 14 13 0.0580228 -1.7376694 0.0501953
Age 6 5433 5.167090e+01 1.554934e+01 53.0 5.161353e+01 16.30860 -4 175 179 0.0706890 0.1713094 0.2109561
Zip 7 5432 1.033860e+04 1.329142e+03 10303.0 1.030484e+04 2.96520 0 93543 93543 46.8664053 2871.5735410 18.0339738
BlackOrNot 8 5433 3.868949e-01 4.870841e-01 0.0 3.586381e-01 0.00000 0 1 1 0.4643317 -1.7847245 0.0066082
MaleOrNot 9 5433 3.591018e-01 4.797813e-01 0.0 3.239015e-01 0.00000 0 1 1 0.5872354 -1.6554591 0.0065091
A1CAboveMean 10 5387 1.236310e-01 3.291906e-01 0.0 2.969150e-02 0.00000 0 1 1 2.2862083 3.2273478 0.0044851

Analysis - Exploring the Connection Between Poverty and Diabetes

Poverty and diabetes in Staten Island

Richmond county (Staten Island) is a relatively middle class borough. The NY DOH reports that the incidence of diabetes in Staten Island is 8.5%. The chart below shows the incidence of diabetes in New York City. Staten Island is the second lowest.

dfIncidenceOfDiabetes <- sqlQuery(dbConnection, "SELECT * FROM tblAdultsDiagnosed Where region_name = 'New York City';")


ggplot(dfIncidenceOfDiabetes, aes(percentage_rate, reorder(county_name, as.numeric(percentage_rate)))) +
  geom_col() +
  ggtitle("Incidence of Diabetes by New York City County") + 
  ylab("") +
  xlab("Per Cent of Population with Diabetes")

However, among the 292 clients from our survey who were also screened for diabetes, the incidence is much higher at 19%. What is happening here?

  diab_percent <- dfNeedsInfo %>%
      summarize(PercentDiabetes = sum(DiabetesOrNot/n()))
  diab_percent
## # A tibble: 1 x 1
##   PercentDiabetes
##             <dbl>
## 1           0.190

While Richmond county is middle class, the clients in our survey are economically disadvantaged - all are from poorer zip codes seeking assistance from a community health organization.

Public health information helps us get at the relationship between poverty and diabetes health outcomes in the aggregate. For example, consider the hospitalization rate per 100,000 for all individuals with diabetes in Staten Island:

dfCommunityHealth <- sqlQuery(dbConnection, "SELECT * from tblCommunityHealth;")

HospRate <- dfCommunityHealth %>%
  filter(indicator=="Diabetes hospitalization rate per 10,000 (any diagnosis)" & county_name=="Richmond") %>%
  mutate(percentage_rate = percentage_rate/10) %>%
  select(percentage_rate)
HospRate
##   percentage_rate
## 1           26.38

Now consider these scatterplots of hospital admissions for diabetes patients with Medicaid, against indicators of poverty per zip code.

a <- ggplot(dfZipInfo_SI, aes(pov, percent_ip)) +
  geom_point() +
  ggtitle("Poverty And Medicaid Hospitalizations") +
   theme(plot.title = element_text(size = 10, face="bold")) +
      xlab("Percent below the poverty line") +
  ylab("Hospitalizations per 100,000") +
  geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')

b <- ggplot(dfZipInfo_SI, aes(NoInternet, percent_ip)) +
  geom_point() +
  ggtitle("Poverty And Medicaid Hospitalizations") +
   theme(plot.title = element_text(size = 10, face="bold")) +
      xlab("Percent without internet") +
  ylab("Hospitalizations per 100,000") +
  geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')


gridExtra::grid.arrange(a,b,ncol=2)

While the strong relationship between poverty level and hospitalization rate for medicaid recipients is likely due in part to the fact that there are more medicaid recipients in poorer zip codes, nonetheless we can see from these graphs that the 5 poorest zip codes have higher rates of hospitalization for medicaid recipients alone than the average hospitalization rate for all recipients.

Poverty and diabetes at the individual level - looking at the mechanisms of poverty

The second graph suggests that poverty affects individuals in a direct way - lack of internet may lead to lack of knowledge and poor health choices.

Our survey data allow us to examine diabetes and poverty at the individual level. We will look first to see if incidence of diabetes and particular economic and social needs are correlated.

We examine the correlation between the incidence of diabetes and social needs through chi squares and a multiple regression:

m10 <- glm(DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding + Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport + Need_Safe, family = "binomial", data=dfNeedsInfo)
summary(m10)
## 
## Call:
## glm(formula = DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding + 
##     Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace + 
##     Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport + 
##     Need_Safe, family = "binomial", data = dfNeedsInfo)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3436  -0.6284  -0.6284  -0.3168   2.4564  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.52183    0.16911  -8.999  < 2e-16 ***
## Need_Sad                -0.19058    1.36930  -0.139  0.88931    
## Need_HelpUnderstanding   0.47490    1.13057   0.420  0.67445    
## Need_Childcare         -14.61899  686.36417  -0.021  0.98301    
## Need_Clothes            -0.70005    1.05310  -0.665  0.50621    
## Need_LoseHousing         1.37903    1.29617   1.064  0.28736    
## Need_SafePlace          -0.65462    2.21737  -0.295  0.76782    
## Need_Job                 0.03591    0.40870   0.088  0.92998    
## Need_AffordNeeds         1.42953    0.49329   2.898  0.00376 ** 
## Need_HighSchool         -1.48072    0.78499  -1.886  0.05926 .  
## Need_Transport           0.73656    1.51552   0.486  0.62696    
## Need_Safe                0.38788    0.73477   0.528  0.59757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 338.06  on 347  degrees of freedom
## Residual deviance: 319.72  on 336  degrees of freedom
## AIC: 343.72
## 
## Number of Fisher Scoring iterations: 14
dfNeedsChi <- dfNeedsInfo %>%
  select(DiabetesOrNot, NeedsVector) %>%
  summarise_all(funs(chisq.test(.,dfNeedsInfo$DiabetesOrNot)$p.value))
kableExtra::kable(head(dfNeedsChi), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
DiabetesOrNot Need_Food Need_SafePlace Need_LoseHousing Need_Job Need_AffordNeeds Need_HighSchool Need_HelpUnderstanding Need_Sad Need_Childcare Need_Clothes Need_Transport Need_Safe
0 0.4783364 0.3761446 0.2533187 0.2993787 0.0198248 0.2186276 0.7193608 0.083128 0.7400691 0.6070138 0.4745132 0.4813973
tbl <- dfNeedsInfo %>%
  select(DiabetesOrNot, Need_AffordNeeds) %>%
  mutate(Need_AffordNeeds = ifelse(Need_AffordNeeds==0,0,1))

tbl2 <- table(tbl)
tbl2
##              Need_AffordNeeds
## DiabetesOrNot   0   1
##             0 263  19
##             1  54  12
psych::phi(tbl2)
## [1] 0.16

Those clients who cannot afford basic needs have a higher incidence of diabetes in both the chi squares and regression analysis (p=.02). However, we cannot reject the null hypothesis for other needs.

Diabetes is particularly dangerous for clients with Covid. We therefore look at the relationship between poverty and whether a client is vaccinated. Some of the needs are all zeros so we remove them.

m10 <- glm(SumOfNotVaccinated ~ Need_Sad + Need_Clothes + Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Safe, family = "binomial", data=dfNeedsInfo_Vaccine)
summary(m10)
## 
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Sad + Need_Clothes + 
##     Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool + 
##     Need_Safe, family = "binomial", data = dfNeedsInfo_Vaccine)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6651  -0.9558  -0.9558   1.4165   1.4165  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.5465     0.1263  -4.328 1.51e-05 ***
## Need_Sad           10.9534  4248.3854   0.003   0.9979    
## Need_Clothes      -16.1358  1022.8449  -0.016   0.9874    
## Need_LoseHousing  -15.5864  2859.7350  -0.005   0.9957    
## Need_Job            1.6452     0.6785   2.425   0.0153 *  
## Need_AffordNeeds    0.8830     0.5990   1.474   0.1404    
## Need_HighSchool   -25.5406  4406.0046  -0.006   0.9954    
## Need_Safe          19.2175  3955.3301   0.005   0.9961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 407.86  on 303  degrees of freedom
## Residual deviance: 384.66  on 296  degrees of freedom
## AIC: 400.66
## 
## Number of Fisher Scoring iterations: 15
NeedsVector1 <- NeedsVector[! NeedsVector %in% c("Need_SafePlace", "Need_HelpUnderstanding", "Need_Childcare", "Need_Transport")] 


dfNeedsChi1 <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, NeedsVector1) %>%
  summarise_all(funs(chisq.test(.,dfNeedsInfo_Vaccine$SumOfNotVaccinated)$p.value))
kableExtra::kable(head(dfNeedsChi1), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
SumOfNotVaccinated Need_Food Need_LoseHousing Need_Job Need_AffordNeeds Need_HighSchool Need_Sad Need_Clothes Need_Safe
0 0.3230587 0.8292155 0.0278985 0.1695744 0.4166836 0.7077675 1 0.0197716
tbl <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, Need_Job) %>%
  mutate(Need_Job = ifelse(Need_Job==0,0,1))

tbl1 <- table(tbl)
tbl1
##                   Need_Job
## SumOfNotVaccinated   0   1
##                  0 179   5
##                  1 109  11
psych::phi(tbl1)
## [1] 0.14
tbl2 <- dfNeedsInfo_Vaccine %>%
  select(SumOfNotVaccinated, Need_Safe) %>%
  mutate(Need_Safe = ifelse(Need_Safe==0,0,1))

tbl3 <- table(tbl2)
tbl3
##                   Need_Safe
## SumOfNotVaccinated   0   1
##                  0 184   0
##                  1 115   5
psych::phi(tbl3)
## [1] 0.16

Unemployed clients and those that do not feel safe are correlated with no vaccine (p=.028 and p=.02 respectively).

The effect is not strong enough to run a successful prediction model. The model does no better than guessing. This is perhaps because the data is very unbalanced:

#https://www.statology.org/logistic-regression-in-r/
#make this example reproducible
set.seed(1)

#Use 70% of dfNeedsInfoset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(dfNeedsInfo_Vaccine), replace=TRUE, prob=c(0.7,0.3))
train <- dfNeedsInfo_Vaccine[sample, ]
test <- dfNeedsInfo_Vaccine[!sample, ]  

#fit logistic regression model
model1 <- glm(SumOfNotVaccinated ~ Need_Safe + Need_Job, family="binomial", data=train)

summary(model1)
## 
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Safe + Need_Job, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5518  -0.9756  -0.9756   1.3937   1.3937  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.4953     0.1436  -3.448 0.000564 ***
## Need_Safe     16.5805  1005.3119   0.016 0.986841    
## Need_Job       1.3426     0.7049   1.905 0.056805 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 298.72  on 220  degrees of freedom
## Residual deviance: 285.54  on 218  degrees of freedom
## AIC: 291.54
## 
## Number of Fisher Scoring iterations: 15
#model <- glm(HypertensionOrDiabetesOrNot ~ NumOfNeeds, family="binomial", data=train)
#model
#disable scientific notation for model summary
#options(scipen=999)


#Shows fit - ranges from 0 to almost 1
pscl::pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
##   McFadden 
## 0.04413668
#Ranks importance
caret::varImp(model1)
##              Overall
## Need_Safe 0.01649294
## Need_Job  1.90481077
#multicolinearity (VIF)
car::vif(model1)
## Need_Safe  Need_Job 
##         1         1
#calculate probability of default for each individual in test dataset
predicted <- predict(model1, test, type="response")
p2 <- as.data.frame(predicted)

p3 <- test %>%
  select(SumOfNotVaccinated)

Predictions <- cbind(p2, p3)
Predictions$predicted <- ifelse(Predictions$predicted >.38, 1, 0)
Predictions$predicted=as.factor(Predictions$predicted)
Predictions$SumOfNotVaccinated=as.factor(Predictions$SumOfNotVaccinated)

psych::describe(Predictions)
##                     vars  n mean   sd median trimmed mad min max range skew
## predicted*             1 83 1.05 0.22      1    1.00   0   1   2     1 4.14
## SumOfNotVaccinated*    2 83 1.36 0.48      1    1.33   0   1   2     1 0.57
##                     kurtosis   se
## predicted*             15.35 0.02
## SumOfNotVaccinated*    -1.70 0.05
#find optimal cutoff probability to use to maximize accuracy
optimal <- InformationValue::optimalCutoff(test$SumOfNotVaccinated, predicted)[1]
optimal
## [1] 0.38
#ConfusionMatrix doesn't work
caret::confusionMatrix(Predictions$SumOfNotVaccinated, Predictions$predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51  2
##          1 28  2
##                                           
##                Accuracy : 0.6386          
##                  95% CI : (0.5257, 0.7412)
##     No Information Rate : 0.9518          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0356          
##                                           
##  Mcnemar's Test P-Value : 5.01e-06        
##                                           
##             Sensitivity : 0.64557         
##             Specificity : 0.50000         
##          Pos Pred Value : 0.96226         
##          Neg Pred Value : 0.06667         
##              Prevalence : 0.95181         
##          Detection Rate : 0.61446         
##    Detection Prevalence : 0.63855         
##       Balanced Accuracy : 0.57278         
##                                           
##        'Positive' Class : 0               
## 
#calculate sensitivity
#sensitivity(test$HypertensionOrDiabetesOrNot, predicted)

#calculate specificity
#specificity(test$HypertensionOrDiabetesOrNot, predicted)

#calculate total misclassification error rate
#misClassError(Predictions$HypertensionOrDiabetesOrNot, Predictions$predicted, threshold=optimal`)
  1. Lastly we look at race and other demographics.
rBase1 <- lm(InitialA1C ~ BlackOrNot + MaleOrNot + Age,  data = dfA1CAndDemographics)
summary(rBase1)
## 
## Call:
## lm(formula = InitialA1C ~ BlackOrNot + MaleOrNot + Age, data = dfA1CAndDemographics)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.704 -0.702 -0.358  0.099 53.070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.787345   0.092460  51.778  < 2e-16 ***
## BlackOrNot  0.094788   0.052052   1.821  0.06866 .  
## MaleOrNot   0.165160   0.052783   3.129  0.00176 ** 
## Age         0.021538   0.001628  13.228  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.859 on 5383 degrees of freedom
##   (46 observations deleted due to missingness)
## Multiple R-squared:  0.03374,    Adjusted R-squared:  0.0332 
## F-statistic: 62.65 on 3 and 5383 DF,  p-value: < 2.2e-16

Here we can see that race, gender and age all affect the initial A1C at a client’s first screening. The R2 is very low, however.

Conclusion

This study touched the surface of the question, "how are poverty and diabetes connected?’ We found statistical associations between poverty and diabetes hospitalization rates, between being unable to afford basic needs and diabetes incidence rates, between having not been vaccinated against not having a job and feeling unsafe, and between high A1C and race, age and gender. Clearly there is more to explore here.

This analysis only shows a small handful of the anlyses that were run on this data. I also analyzed the relationship between poverty and chronic hypertension, A1V levels, success in A1C management, and so on. Thistype of “fishing” increases the likelihood ofsampling errors. A cursory analysis of survey results by program and staff suggests a lack of consistency in how the survey was applied. In the second round of survey impkementation I plan to make sure that suveys are conducted in a consistent manager accross staff and programs to provide the possibility for a more robust analysis.