library(ROSE)
library(tidyverse)
library(magrittr)
library(janitor)
library(tidyverse)
library(RODBC)
library(pscl)
library(caret)
library(car)
library(InformationValue)
library(rvest)
library(jsonlite)
library(xml2)
library(XML)
library(stringr)
library(kableExtra)
We know that poverty affects health outcomes. We also know that obesity, poor diet choices, and poor lifestyle choices often result from poverty and inadequate access to resources. However, what are the actual poverty markers that lead to these outcomes - is it lack of transportation options? Housing insecurity? Lack of High School defree? In this study we will take a deeper look. To answer these questions we will examine data from a client survey on economic needs from my workplace. We will supplement the survey info with other client health-related databases from my workplace, as well as data from outside sources such as the NY Department of Health.
Altogether we will use data from 4 different sources:
After collecting the data we will store it in the database so that we can easily combine the datasets in productive ways using SQL. In particular, we will generate dataframes that contain zip code level information on poverty and public health, and client level information on economic needs and individual client health. Then we will test for correlations and other relationships within the data.
Note - Certain operations that load data and are very time consuming will be contained within functions that can be run or not by changing the value of the boolean variable “run” which appears at the top of each function.
This file is the result of a client survey taken at my workplace of over 3,000 clients. We did not create the survey instrument - the file was provided by the organization that did. The Ids have been mathced to those of the clients in the SQL database. The file is otherwise free of any personal identifying information.
For security purposes these files are stored locally. However, they have already been read into the sql server database and can be accessed from there through keyring. Leave “run = false” in the following code, and in the code in which the file is read into the database, to use this rmd successfully with the survey.
We also input a file that simply contains the list of survey needs.
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
dfSDOH_raw <- read.csv("D:\\Buffer\\SDOH__En__1_0 (3).csv", encoding = "UTF-8")
}
dfListOfNeeds <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/ListOfNeeds.csv", encoding = "UTF-8")
dfListOfNeeds %<>%
rename(Need=X.U.FEFF.Need)
The survey asks these questions:
kableExtra::kable(dfListOfNeeds)
| Need |
|---|
| Do you or your family ever feel hungry? |
| Do you have a safe place to live right now? |
| Are you worried about losing your housing? |
| Do you have a job? |
| Are you able to afford basic needs? |
| Do you have a high school diploma or a GED? |
| Do you ever need help understanding what your doctor tells you? |
| Are you sad or lonely? |
| Do problems with childcare make it hard for you? |
| Do you ever have a problem getting the right type of clothes for each season? |
| Do you have access to transportation to get where you need to go? |
| Do you feel safe: emotionally, physically? |
Since the survey instrument and resultant data were created and delivered by an outside agency, the data needs extensive reforming. The file contains 800,000 observations with only two columns- the columns are “Field” and “Value”. Each record takes up 211 observations.
We begin by pivoting wider every 211 records, taking our names from “Field” and our values from “Value”, and binding each to a final dataframe.
run=FALSE
if (run) {
dfSDOH <- dfSDOH_raw %>%
filter(as.numeric(Num) <=211) %>%
pivot_wider(Org, names_from = "Field", values_from = "Value")
dfFinal <- dfSDOH
x=211
for (i in 1: 3752)
{
dfSDOH <- dfSDOH_raw %>%
filter(as.numeric(Num) > x & as.numeric(Num)<= (x+211)) %>%
pivot_wider(Org, names_from = "Field", values_from = "Value")
x=x+211
dfFinal <- rbind(dfFinal, dfSDOH)
}
}
We eliminate bad records that have no ClientID.
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
dfFinal %<>%
filter (ClientID > 0)
}
We select the columns which describe needs, and rename them to reflect the type of need.
NeedsVector = c("Need_Food", "Need_SafePlace", "Need_LoseHousing", "Need_Job", "Need_AffordNeeds", "Need_HighSchool", "Need_HelpUnderstanding", "Need_Sad", "Need_Childcare", "Need_Clothes", "Need_Transport", "Need_Safe")
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
dfFinal2 <- dfFinal %>%
rename(Need_Food=DoYouOrYou, Need_SafePlace = DoYouHaveA, Need_LoseHousing = AreYouWorr, Need_Job = DoYouHaveA1, Need_AffordNeeds = AreYouAble, Need_HighSchool = DoYouHaveA2, Need_HelpUnderstanding=DoYouEverN, Need_Sad = SocialAreY, Need_Childcare = ChildCareD, Need_Clothes = DoYouEverH, Need_Transport = DoYouHaveA3, Need_Safe = DoYouFeelS )
dfFinal3 <- dfFinal2 %>%
select(ClientID, starts_with('Need_'))
}
Each of the 12 need columns is binary, containing either a yes or no in response to a question about a need. However, sometimes ‘yes’ means the client has a need (“Do you or your family ever go hungry”) and sometimes ‘yes’ means the client does not have a need (“Do you feel safe physically and emotionally”).
We will recode the questions so that 1 means ‘has a need’ and 0 means ‘does not have a need’ for all variables.
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
QCorrect<- function(x){
ifelse(x=="Yes", 1,0)
}
QReverse<- function(x){
ifelse(x=="No", 1,0)
}
dfFinal4 <- dfFinal3 %>%
select(Need_Food, Need_Sad, Need_HelpUnderstanding, Need_Childcare, Need_Clothes, Need_LoseHousing) %>%
mutate_all(QCorrect)
dfFinal5 <- dfFinal3 %>%
select(Need_SafePlace, Need_Job, Need_AffordNeeds, Need_HighSchool, Need_Transport, Need_Safe) %>%
mutate_all(QReverse)
dfFinal6 <- dfFinal3 %>%
select(ClientID)
dfFinal6$ClientID <- as.numeric(as.character(dfFinal6$ClientID))
dfFinal10 <- cbind(dfFinal6, dfFinal4, dfFinal5)
}
I built this database for my workplace. It contains information on client health screens for diabetes, hypertension and other chronic diseases. We will use three tables:
Demographic data: A table of 5,433 unique clients with demographic data and the client’s first A1C result (A1C is a test that indicates diabetes.)
Health data: A table of 39,491 observations for 8,546 unique clients with A1C, Blood Pressure and other health screen data.
Vaccine data: A table of 2122 unique clients with info on vaccine status.
The data is password protected by keyring. Here we create a connection.
## Please enter password in TK window (Alt+Tab)
# This is the connection string:
strConnection <- paste0(
'Driver={ODBC Driver 13 for SQL Server};
Server=tcp:ehtmp.database.windows.net,1433;
Database=HC_A1C;
Encrypt=yes;
TrustServerCertificate=no;
Connection Timeout=30;',
'Uid=',keyring::key_get(service = "EH_606_username", keyring = "EH_606_keyring"),';',
'Pwd=', keyring::key_get(service = "EH_606_pwd", keyring = "EH_606_keyring"), ';'
)
library(RODBC)
dbConnection <- odbcDriverConnect(strConnection)
#keyring_lock
We read the New York DOH data from the APIs. The RSocrata library makes this easy.
library("RSocrata")
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
#Community Health Obesity and Diabetes Related Indicators: 2008 - 2012
dfCommunityHealthObesityAndDiabetes <- read.socrata(
"https://health.data.ny.gov/resource/tchg-ruva.json")
#Community Health: Age-adjusted percentage of adults with physician diagnosed diabetes: 2008 - 2009
dfAdultsDiagnosed <- read.socrata(
"https://health.data.ny.gov/resource/9j5w-7zpd.json")
#AH Provisional Diabetes Death Counts, 2020
dfDeath <- read.socrata(
"https://data.cdc.gov/resource/qdcb-uzft.json")
#500 Cities: Diagnosed diabetes among adults aged >=18 years
dfCityComparison <- read.socrata(
"https://chronicdata.cdc.gov/resource/cn78-b9bj.json")
#Conditions contributing to deaths involving coronavirus disease 2019 (COVID-19), by age group and state, United States.
dfDeathsForCovid <- read.socrata(
"https://data.cdc.gov/resource/hk9y-quqm.json")
#Community Health: Diabetes Short-term Complications Hospitalization Rate per 10,000 - Aged 18+ Years by County Map: Latest Data
dfHospitalizations <- read.socrata(
"https://health.data.ny.gov/resource/xuwq-ppg8.json")
#Medicaid Chronic Conditions, Inpatient Admissions and Emergency Room Visits by Zip Code: Beginning 2012
dfMedicaidByZip <- read.socrata(
"https://health.data.ny.gov/resource/2yck-xisk.json")
}
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
#Medicaid Program Enrollment by Month: Beginning 2009
dfMedicaidEnrollment <- read.socrata(
"https://health.data.ny.gov/resource/m4hz-kzn3.json")
}
We scrape and then clean 17 pages of zipatlas data to compile a table of zip codes, population levels and poverty levels. We also read a csv file, taken from the web, which contains information on zip code and broadband access. The file is compiled by the city of New York and is downloaded from here:
library(XML)
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
x="http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.htm"
dfPopAndPovertyLevel = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))
for (i in 1:16) {
x= str_c("http://www.zipatlas.com/us/ny/zip-code-comparison/population-below-poverty-level.", i, ".htm")
dfx = as.data.frame(readHTMLTable(x, header=T,which=5,strings2factors=F))
dfPopAndPovertyLevel=rbind(dfPopAndPovertyLevel, dfx)
}
}
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
dfPopAndPovertyLevel %<>%
filter(str_length(V2)==5 & !is.na(V4)) %<>%
rename(number = V1, zip=V2, location=V3, city=V4, population=V5, percent_below_poverty=V6, rank=V7) %<>%
mutate(population = as.numeric(gsub(",", "", population))) %<>%
mutate(percent_below_poverty = as.numeric(gsub(" %", "", percent_below_poverty))) %<>%
mutate(rank = as.numeric(gsub("#", "", rank)))
}
dfBroadband_raw <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Final%20Project/Broadband_Adoption.csv", encoding = "UTF-8")
If we load our datasets into persistent storage (the SQL Sever database), then we can use SQL to easily combine data from several tables and read the result back into a dataframe. Just as important, many of the datasets in this project are very large and take a long time to load. Therefore we can filter and prepare them, read them into the database, and then use the database exclusively to read them back to cut down on load time.
We begin with 3 NYDOH files: MedicaidByZip, a NYDOH file with over a million records on medicaid information; dfAdultsDiagnosed which has 63 records about Diabetes diagnoses throughout New York State; and dfCommunityHealthObesityAndDiabtes, a dataset with 2733 records with detailed county-level information about diabetes.
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblMedicaidByZip")
PopulateTable <- sqlSave(dbConnection, dfMedicaidByZip, "tblMedicaidByZip", append=TRUE)
}
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblAdultsDiagnosed")
PopulateTable <- sqlSave(dbConnection, dfAdultsDiagnosed, "tblAdultsDiagnosed", append=TRUE)
}
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblCommunityHealth")
PopulateTable <- sqlSave(dbConnection, dfCommunityHealthObesityAndDiabetes, "tblCommunityHealth", append=TRUE)
}
Next we clean and load the scraped dataset with 1711 zip code level observations on population and poverty in NY State:
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblPopAndPovertyLevel")
PopulateTable <- sqlSave(dbConnection, dfPopAndPovertyLevel, "tblPopAndPovertyLevel", append=TRUE)
}
Next we load the survey data into a table called tblNeeds:
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblNeeds")
PopulateTable <- sqlSave(dbConnection, dfFinal10, "tblNeeds", append=TRUE)
}
Finally, we add the broadband_by_zip_code dataframe
#This operation is time consuming. Set run to TRUE the first time you run ii, and FALSE afterward.
run=FALSE
if(run) {
DropTable <- sqlQuery(dbConnection, "DROP TABLE IF EXISTS tblBroadBand")
PopulateTable <- sqlSave(dbConnection, dfBroadband_raw, "tblBroadband", append=TRUE)
}
We will create the following dataframes by combining elements from a number of different tables in the database:
dfZipInfo, a dataframe with 99 observations at the zip code level with information on population level, poverty level, medicaid Info, average A1C from client data, and broadband info.
dfdNeedsInfo, a dataframe with client-level observations with information on needs from the survey, chronic diseases, demographics and initial A1C. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.
dfNeedsInfo_Vaccine, a dataframe with 304 client-level observations with information on needs from the survey and vaccine information. The dataframe only includes only those clients who took the survey AND have been screened for diabetes.
dfA1CAndDemographics, a dataframe with 5433 observations at the client level with information on demographics and initial A1C readings.
dfZipInfo_raw <- sqlQuery(dbConnection, "SELECT tblMedicaidByZip.zip_code, tblMedicaidByZip.major_diagnostic_category, tblPopAndPovertyLevel.population, tblPopAndPovertyLevel.percent_below_poverty, tblMedicaidByZip.ip_admits, tblMedicaidByZip.er_visits, tblA1CAndDemographics.ClientID, tblA1CAndDemographics.Zip, tblA1CAndDemographics.InitialA1C, tblA1CAndDemographics.Gender, tblA1CAndDemographics.RaceandEthnicity, tblA1CAndDemographics.Age, tblBroadband.NoInternetAccessPercentageOfHouseholds
FROM (tblMedicaidByZip INNER JOIN (tblA1CAndDemographics INNER JOIN tblPopAndPovertyLevel ON tblA1CAndDemographics.Zip = tblPopAndPovertyLevel.zip) ON tblMedicaidByZip.zip_code = tblPopAndPovertyLevel.zip) INNER JOIN tblBroadband ON tblPopAndPovertyLevel.zip = tblBroadband.ZipCode;
")
dfZipInfo10 <- dfZipInfo_raw %>%
filter(major_diagnostic_category=="Diabetes Mellitus") %>%
na.omit() %>%
group_by(zip_code) %>%
summarize(n=n(), ave_A1C=mean(InitialA1C), pov=mean(percent_below_poverty), pop=mean(population), NoInternet=mean(NoInternetAccessPercentageOfHouseholds*100), percent_ip = sum(ip_admits/population), percent_er=sum(er_visits/population))
dfZipInfo_SI <- dfZipInfo10 %>%
filter(as.numeric(zip_code) >= 10301 & as.numeric(zip_code) <= 10314)
dfA1CAndDemographics <- sqlQuery(dbConnection, "SELECT * FROM tblA1CAndDemographics")
dfA1CAndDemographics %<>%
mutate(BlackOrNot = ifelse(RaceandEthnicity=="Black/African American", 1, 0)) %<>%
mutate(MaleOrNot = ifelse(Gender=="Male", 1, 0)) %>%
mutate(A1CAboveMean = ifelse(InitialA1C > 7, 1, 0))
dfNeedsInfo_raw <- sqlQuery(dbConnection, "SELECT tblNeeds.ClientID, tblChronic.Chronic, tblReadings.Type, tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblReadings INNER JOIN (tblChronic RIGHT JOIN tblNeeds ON tblChronic.ClientID = tblNeeds.ClientID) ON tblReadings.ClientID = tblNeeds.ClientID
WHERE tblReadings.Type='A1C';")
dfNeedsInfo <- dfNeedsInfo_raw %>%
group_by(ClientID) %>%
mutate(DiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Diabetes (doctor denied)"|Chronic == "Diabetes type I", 1, 0)) %>%
mutate(HypertensionOrNot = ifelse(Chronic == "Hypertension", 1,0)) %>%
mutate(HypertensionOrDiabetesOrNot = ifelse(Chronic == "Diabetes Type I"|Chronic=="Diabetes Type II"|Chronic == "Hypertension", 1,0)) %>%
summarize(sum(DiabetesOrNot), sum(HypertensionOrNot), sum(HypertensionOrDiabetesOrNot), across(NeedsVector, mean, na.rm = TRUE)) %>%
rename(DiabetesOrNot="sum(DiabetesOrNot)") %>%
rename(HypertensionOrNot="sum(HypertensionOrNot)") %>%
rename(HypertensionOrDiabetesOrNot="sum(HypertensionOrDiabetesOrNot)") %>%
mutate(DiabetesOrNot = ifelse(DiabetesOrNot>0, 1, 0)) %>%
mutate(HypertensionOrNot = ifelse(HypertensionOrNot>0, 1, 0)) %>%
mutate(HypertensionOrDiabetesOrNot = ifelse(HypertensionOrDiabetesOrNot>0, 1, 0)) %>%
mutate(NumOfNeeds = Need_Sad+Need_HelpUnderstanding+Need_Childcare+Need_Clothes+ Need_LoseHousing+Need_SafePlace+ Need_Job+Need_AffordNeeds+Need_HighSchool+Need_Transport+Need_Safe)
dfNeedsInfo$DiabetesOrNot <- replace(dfNeedsInfo$DiabetesOrNot, is.na(dfNeedsInfo$DiabetesOrNot), 0)
dfNeedsInfo_Vaccine <- sqlQuery(dbConnection, "SELECT tblVaccine.*, tblNeeds.Need_Food, tblNeeds.Need_Sad, tblNeeds.Need_HelpUnderstanding, tblNeeds.Need_Childcare, tblNeeds.Need_Clothes, tblNeeds.Need_LoseHousing, tblNeeds.Need_SafePlace, tblNeeds.Need_Job, tblNeeds.Need_AffordNeeds, tblNeeds.Need_HighSchool, tblNeeds.Need_Transport, tblNeeds.Need_Safe
FROM tblNeeds INNER JOIN tblVaccine ON tblVaccine.ClientID = tblNeeds.ClientID;")
dfNeedsInfo3_raw <- sqlQuery(dbConnection, "SELECT * from tblNeeds")
These are the final dataframes which will be used for analysis:
kable(psych::describe(dfZipInfo10), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zip_code | 1 | 99 | 10732.727273 | 545.141993 | 1.047300e+04 | 1.073309e+04 | 6.983046e+02 | 1.00010e+04 | 11694.00000 | 1693.00000 | -0.0643484 | -1.7025451 | 54.7888317 |
| n | 2 | 99 | 1608.767677 | 5167.760737 | 1.410000e+02 | 2.300000e+02 | 1.690164e+02 | 6.00000e+00 | 30874.00000 | 30868.00000 | 3.9544099 | 15.6486294 | 519.3794961 |
| ave_A1C | 3 | 99 | 6.076375 | 1.541382 | 5.802632e+00 | 5.853841e+00 | 5.891384e-01 | 0.00000e+00 | 13.30000 | 13.30000 | 1.5766543 | 7.9789903 | 0.1549147 |
| pov | 4 | 99 | 29.846162 | 17.291370 | 2.396000e+01 | 2.874667e+01 | 1.826563e+01 | 4.96000e+00 | 70.12000 | 65.16000 | 0.5426009 | -0.8320327 | 1.7378481 |
| pop | 5 | 99 | 53790.494949 | 24502.945900 | 5.387700e+04 | 5.346051e+04 | 3.139406e+04 | 1.13540e+04 | 106154.00000 | 94800.00000 | 0.1239264 | -1.0908126 | 2462.6387214 |
| NoInternet | 6 | 99 | 17.884646 | 6.273084 | 1.715000e+01 | 1.777926e+01 | 6.523440e+00 | 5.95000e+00 | 32.36000 | 26.41000 | 0.2072927 | -0.6523270 | 0.6304687 |
| percent_ip | 7 | 99 | 4.578043 | 15.794267 | 3.664816e-01 | 6.727343e-01 | 4.551523e-01 | 1.14795e-02 | 97.95367 | 97.94219 | 4.3155717 | 18.5438846 | 1.5873836 |
| percent_er | 8 | 99 | 5.872413 | 21.252716 | 3.893043e-01 | 7.301394e-01 | 4.936495e-01 | 1.46325e-02 | 132.47984 | 132.46520 | 4.3585059 | 18.7755648 | 2.1359784 |
kable(psych::describe(dfNeedsInfo), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ClientID | 1 | 348 | 2.966458e+04 | 1.575818e+04 | 25923 | 2.916106e+04 | 18595.51 | 7168 | 64237 | 57069 | 0.3037166 | -1.3249444 | 844.7271667 |
| DiabetesOrNot | 2 | 348 | 1.896552e-01 | 3.925926e-01 | 0 | 1.142857e-01 | 0.00 | 0 | 1 | 1 | 1.5764586 | 0.4866449 | 0.0210452 |
| HypertensionOrNot | 3 | 233 | 6.523605e-01 | 4.772461e-01 | 1 | 6.898396e-01 | 0.00 | 0 | 1 | 1 | -0.6357584 | -1.6026343 | 0.0312654 |
| HypertensionOrDiabetesOrNot | 4 | 233 | 8.068670e-01 | 3.956063e-01 | 1 | 8.823529e-01 | 0.00 | 0 | 1 | 1 | -1.5447171 | 0.3878708 | 0.0259170 |
| Need_Food | 5 | 348 | 8.807471e-01 | 3.189546e-01 | 1 | 9.732143e-01 | 0.00 | 0 | 1 | 1 | -2.3384605 | 3.5621568 | 0.0170978 |
| Need_SafePlace | 6 | 348 | 1.149430e-02 | 9.976950e-02 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 9.0577938 | 83.3778152 | 0.0053482 |
| Need_LoseHousing | 7 | 348 | 1.724140e-02 | 1.247077e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 7.3623562 | 53.7030566 | 0.0066850 |
| Need_Job | 8 | 348 | 2.159962e-01 | 3.973691e-01 | 0 | 1.470238e-01 | 0.00 | 0 | 1 | 1 | 1.3748729 | -0.0018337 | 0.0213012 |
| Need_AffordNeeds | 9 | 348 | 7.854410e-02 | 2.600390e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 3.1196035 | 8.0028700 | 0.0139396 |
| Need_HighSchool | 10 | 348 | 8.812260e-02 | 2.801915e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 2.8878253 | 6.4365180 | 0.0150198 |
| Need_HelpUnderstanding | 11 | 348 | 2.155170e-02 | 1.429250e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 6.5581311 | 41.5425502 | 0.0076616 |
| Need_Sad | 12 | 348 | 1.963600e-02 | 1.227322e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 6.7636810 | 47.2894576 | 0.0065791 |
| Need_Childcare | 13 | 348 | 1.149430e-02 | 1.067468e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 9.1263064 | 81.5237573 | 0.0057222 |
| Need_Clothes | 14 | 348 | 4.022990e-02 | 1.893168e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 4.6484797 | 20.1796933 | 0.0101484 |
| Need_Transport | 15 | 348 | 7.183900e-03 | 8.020220e-02 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 11.5677700 | 136.0490731 | 0.0042993 |
| Need_Safe | 16 | 348 | 3.879310e-02 | 1.838297e-01 | 0 | 0.000000e+00 | 0.00 | 0 | 1 | 1 | 4.7366709 | 21.1986878 | 0.0098543 |
| NumOfNeeds | 17 | 348 | 5.502874e-01 | 9.631922e-01 | 0 | 3.500000e-01 | 0.00 | 0 | 8 | 8 | 3.1730058 | 16.0423113 | 0.0516325 |
kableExtra::kable(psych::describe(dfNeedsInfo_Vaccine), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rownames | 1 | 304 | 8.146513e+02 | 4.696786e+02 | 749 | 8.035615e+02 | 578.9553 | 8 | 2081 | 2073 | 0.2284195 | -0.9306780 | 26.9379129 |
| ClientID | 2 | 304 | 4.803288e+04 | 1.770246e+04 | 54953 | 5.004211e+04 | 13790.4039 | 6908 | 68217 | 61309 | -0.8006545 | -0.6991243 | 1015.3055418 |
| SumOfVaccinated | 3 | 304 | 6.250000e-01 | 4.849211e-01 | 1 | 6.557377e-01 | 0.0000 | 0 | 1 | 1 | -0.5138519 | -1.7416530 | 0.0278121 |
| SumOfNotVaccinated | 4 | 304 | 3.947368e-01 | 4.896000e-01 | 0 | 3.688525e-01 | 0.0000 | 0 | 1 | 1 | 0.4285821 | -1.8222793 | 0.0280805 |
| SumOfWantsVaccine | 5 | 304 | 7.894740e-02 | 2.701012e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 3.1074840 | 7.6817580 | 0.0154914 |
| SumOfUnsure | 6 | 304 | 2.763158e-01 | 4.479122e-01 | 0 | 2.213115e-01 | 0.0000 | 0 | 1 | 1 | 0.9955005 | -1.0122761 | 0.0256895 |
| SumOfAppointmentdatewithSunRiverReferraltoJenille | 7 | 304 | 7.894740e-02 | 2.701012e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 3.1074840 | 7.6817580 | 0.0154914 |
| SumOfNoAnswer | 8 | 304 | 1.973680e-02 | 1.393239e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 6.8715176 | 45.3670198 | 0.0079908 |
| DateOfContact* | 9 | 304 | 1.068421e+01 | 6.401239e+00 | 11 | 1.052869e+01 | 8.8956 | 1 | 22 | 21 | 0.0722340 | -1.1901540 | 0.3671362 |
| Need_Food | 10 | 304 | 9.769737e-01 | 1.502343e-01 | 1 | 1.000000e+00 | 0.0000 | 0 | 1 | 1 | -6.3288430 | 38.1798775 | 0.0086165 |
| Need_Sad | 11 | 304 | 9.868400e-03 | 9.901150e-02 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 9.8679276 | 95.6908007 | 0.0056787 |
| Need_HelpUnderstanding | 12 | 304 | 0.000000e+00 | 0.000000e+00 | 0 | 0.000000e+00 | 0.0000 | 0 | 0 | 0 | NaN | NaN | 0.0000000 |
| Need_Childcare | 13 | 304 | 0.000000e+00 | 0.000000e+00 | 0 | 0.000000e+00 | 0.0000 | 0 | 0 | 0 | NaN | NaN | 0.0000000 |
| Need_Clothes | 14 | 304 | 9.868400e-03 | 9.901150e-02 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 9.8679276 | 95.6908007 | 0.0056787 |
| Need_LoseHousing | 15 | 304 | 3.289500e-03 | 5.735390e-02 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 17.2639113 | 297.0197044 | 0.0032895 |
| Need_SafePlace | 16 | 304 | 0.000000e+00 | 0.000000e+00 | 0 | 0.000000e+00 | 0.0000 | 0 | 0 | 0 | NaN | NaN | 0.0000000 |
| Need_Job | 17 | 304 | 5.263160e-02 | 2.236651e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 3.9871836 | 13.9435325 | 0.0128281 |
| Need_AffordNeeds | 18 | 304 | 4.276320e-02 | 2.026563e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 4.4975934 | 18.2885386 | 0.0116231 |
| Need_HighSchool | 19 | 304 | 9.868400e-03 | 9.901150e-02 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 9.8679276 | 95.6908007 | 0.0056787 |
| Need_Transport | 20 | 304 | 0.000000e+00 | 0.000000e+00 | 0 | 0.000000e+00 | 0.0000 | 0 | 0 | 0 | NaN | NaN | 0.0000000 |
| Need_Safe | 21 | 304 | 1.644740e-02 | 1.273980e-01 | 0 | 0.000000e+00 | 0.0000 | 0 | 1 | 1 | 7.5662433 | 55.4304067 | 0.0073068 |
kableExtra::kable(psych::describe(dfA1CAndDemographics), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rownames | 1 | 5433 | 2.717000e+03 | 1.568516e+03 | 2717.0 | 2.717000e+03 | 2013.37080 | 1 | 5433 | 5432 | 0.0000000 | -1.2006626 | 21.2798810 |
| ClientID | 2 | 5433 | 2.614950e+04 | 1.360138e+04 | 23056.0 | 2.507657e+04 | 13318.19580 | 5724 | 64237 | 58513 | 0.6163751 | -0.7000766 | 184.5284125 |
| InitialA1C | 3 | 5387 | 5.996850e+00 | 1.890963e+00 | 5.6 | 5.695646e+00 | 0.59304 | 0 | 59 | 59 | 11.7031865 | 285.2218961 | 0.0257638 |
| Gender* | 4 | 5433 | 2.732192e+00 | 9.849138e-01 | 2.0 | 2.655394e+00 | 0.00000 | 1 | 7 | 6 | 0.7206722 | -0.8799277 | 0.0133622 |
| RaceandEthnicity* | 5 | 5433 | 9.787963e+00 | 3.699842e+00 | 9.0 | 9.800092e+00 | 4.44780 | 1 | 14 | 13 | 0.0580228 | -1.7376694 | 0.0501953 |
| Age | 6 | 5433 | 5.167090e+01 | 1.554934e+01 | 53.0 | 5.161353e+01 | 16.30860 | -4 | 175 | 179 | 0.0706890 | 0.1713094 | 0.2109561 |
| Zip | 7 | 5432 | 1.033860e+04 | 1.329142e+03 | 10303.0 | 1.030484e+04 | 2.96520 | 0 | 93543 | 93543 | 46.8664053 | 2871.5735410 | 18.0339738 |
| BlackOrNot | 8 | 5433 | 3.868949e-01 | 4.870841e-01 | 0.0 | 3.586381e-01 | 0.00000 | 0 | 1 | 1 | 0.4643317 | -1.7847245 | 0.0066082 |
| MaleOrNot | 9 | 5433 | 3.591018e-01 | 4.797813e-01 | 0.0 | 3.239015e-01 | 0.00000 | 0 | 1 | 1 | 0.5872354 | -1.6554591 | 0.0065091 |
| A1CAboveMean | 10 | 5387 | 1.236310e-01 | 3.291906e-01 | 0.0 | 2.969150e-02 | 0.00000 | 0 | 1 | 1 | 2.2862083 | 3.2273478 | 0.0044851 |
Richmond county (Staten Island) is a relatively middle class borough. The NY DOH reports that the incidence of diabetes in Staten Island is 8.5%. The chart below shows the incidence of diabetes in New York City. Staten Island is the second lowest.
dfIncidenceOfDiabetes <- sqlQuery(dbConnection, "SELECT * FROM tblAdultsDiagnosed Where region_name = 'New York City';")
ggplot(dfIncidenceOfDiabetes, aes(percentage_rate, reorder(county_name, as.numeric(percentage_rate)))) +
geom_col() +
ggtitle("Incidence of Diabetes by New York City County") +
ylab("") +
xlab("Per Cent of Population with Diabetes")
However, among the 292 clients from our survey who were also screened for diabetes, the incidence is much higher at 19%. What is happening here?
diab_percent <- dfNeedsInfo %>%
summarize(PercentDiabetes = sum(DiabetesOrNot/n()))
diab_percent
## # A tibble: 1 x 1
## PercentDiabetes
## <dbl>
## 1 0.190
While Richmond county is middle class, the clients in our survey are economically disadvantaged - all are from poorer zip codes seeking assistance from a community health organization.
Public health information helps us get at the relationship between poverty and diabetes health outcomes in the aggregate. For example, consider the hospitalization rate per 100,000 for all individuals with diabetes in Staten Island:
dfCommunityHealth <- sqlQuery(dbConnection, "SELECT * from tblCommunityHealth;")
HospRate <- dfCommunityHealth %>%
filter(indicator=="Diabetes hospitalization rate per 10,000 (any diagnosis)" & county_name=="Richmond") %>%
mutate(percentage_rate = percentage_rate/10) %>%
select(percentage_rate)
HospRate
## percentage_rate
## 1 26.38
Now consider these scatterplots of hospital admissions for diabetes patients with Medicaid, against indicators of poverty per zip code.
a <- ggplot(dfZipInfo_SI, aes(pov, percent_ip)) +
geom_point() +
ggtitle("Poverty And Medicaid Hospitalizations") +
theme(plot.title = element_text(size = 10, face="bold")) +
xlab("Percent below the poverty line") +
ylab("Hospitalizations per 100,000") +
geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')
b <- ggplot(dfZipInfo_SI, aes(NoInternet, percent_ip)) +
geom_point() +
ggtitle("Poverty And Medicaid Hospitalizations") +
theme(plot.title = element_text(size = 10, face="bold")) +
xlab("Percent without internet") +
ylab("Hospitalizations per 100,000") +
geom_hline(yintercept=26.4, size=2, color='red', linetype='dotted')
gridExtra::grid.arrange(a,b,ncol=2)
While the strong relationship between poverty level and hospitalization rate for medicaid recipients is likely due in part to the fact that there are more medicaid recipients in poorer zip codes, nonetheless we can see from these graphs that the 5 poorest zip codes have higher rates of hospitalization for medicaid recipients alone than the average hospitalization rate for all recipients.
The second graph suggests that poverty affects individuals in a direct way - lack of internet may lead to lack of knowledge and poor health choices.
Our survey data allow us to examine diabetes and poverty at the individual level. We will look first to see if incidence of diabetes and particular economic and social needs are correlated.
We examine the correlation between the incidence of diabetes and social needs through chi squares and a multiple regression:
m10 <- glm(DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding + Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport + Need_Safe, family = "binomial", data=dfNeedsInfo)
summary(m10)
##
## Call:
## glm(formula = DiabetesOrNot ~ Need_Sad + Need_HelpUnderstanding +
## Need_Childcare + Need_Clothes + Need_LoseHousing + Need_SafePlace +
## Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Transport +
## Need_Safe, family = "binomial", data = dfNeedsInfo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3436 -0.6284 -0.6284 -0.3168 2.4564
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.52183 0.16911 -8.999 < 2e-16 ***
## Need_Sad -0.19058 1.36930 -0.139 0.88931
## Need_HelpUnderstanding 0.47490 1.13057 0.420 0.67445
## Need_Childcare -14.61899 686.36417 -0.021 0.98301
## Need_Clothes -0.70005 1.05310 -0.665 0.50621
## Need_LoseHousing 1.37903 1.29617 1.064 0.28736
## Need_SafePlace -0.65462 2.21737 -0.295 0.76782
## Need_Job 0.03591 0.40870 0.088 0.92998
## Need_AffordNeeds 1.42953 0.49329 2.898 0.00376 **
## Need_HighSchool -1.48072 0.78499 -1.886 0.05926 .
## Need_Transport 0.73656 1.51552 0.486 0.62696
## Need_Safe 0.38788 0.73477 0.528 0.59757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 338.06 on 347 degrees of freedom
## Residual deviance: 319.72 on 336 degrees of freedom
## AIC: 343.72
##
## Number of Fisher Scoring iterations: 14
dfNeedsChi <- dfNeedsInfo %>%
select(DiabetesOrNot, NeedsVector) %>%
summarise_all(funs(chisq.test(.,dfNeedsInfo$DiabetesOrNot)$p.value))
kableExtra::kable(head(dfNeedsChi), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| DiabetesOrNot | Need_Food | Need_SafePlace | Need_LoseHousing | Need_Job | Need_AffordNeeds | Need_HighSchool | Need_HelpUnderstanding | Need_Sad | Need_Childcare | Need_Clothes | Need_Transport | Need_Safe |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.4783364 | 0.3761446 | 0.2533187 | 0.2993787 | 0.0198248 | 0.2186276 | 0.7193608 | 0.083128 | 0.7400691 | 0.6070138 | 0.4745132 | 0.4813973 |
tbl <- dfNeedsInfo %>%
select(DiabetesOrNot, Need_AffordNeeds) %>%
mutate(Need_AffordNeeds = ifelse(Need_AffordNeeds==0,0,1))
tbl2 <- table(tbl)
tbl2
## Need_AffordNeeds
## DiabetesOrNot 0 1
## 0 263 19
## 1 54 12
psych::phi(tbl2)
## [1] 0.16
Those clients who cannot afford basic needs have a higher incidence of diabetes in both the chi squares and regression analysis (p=.02). However, we cannot reject the null hypothesis for other needs.
Diabetes is particularly dangerous for clients with Covid. We therefore look at the relationship between poverty and whether a client is vaccinated. Some of the needs are all zeros so we remove them.
m10 <- glm(SumOfNotVaccinated ~ Need_Sad + Need_Clothes + Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool + Need_Safe, family = "binomial", data=dfNeedsInfo_Vaccine)
summary(m10)
##
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Sad + Need_Clothes +
## Need_LoseHousing + Need_Job + Need_AffordNeeds + Need_HighSchool +
## Need_Safe, family = "binomial", data = dfNeedsInfo_Vaccine)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6651 -0.9558 -0.9558 1.4165 1.4165
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5465 0.1263 -4.328 1.51e-05 ***
## Need_Sad 10.9534 4248.3854 0.003 0.9979
## Need_Clothes -16.1358 1022.8449 -0.016 0.9874
## Need_LoseHousing -15.5864 2859.7350 -0.005 0.9957
## Need_Job 1.6452 0.6785 2.425 0.0153 *
## Need_AffordNeeds 0.8830 0.5990 1.474 0.1404
## Need_HighSchool -25.5406 4406.0046 -0.006 0.9954
## Need_Safe 19.2175 3955.3301 0.005 0.9961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 407.86 on 303 degrees of freedom
## Residual deviance: 384.66 on 296 degrees of freedom
## AIC: 400.66
##
## Number of Fisher Scoring iterations: 15
NeedsVector1 <- NeedsVector[! NeedsVector %in% c("Need_SafePlace", "Need_HelpUnderstanding", "Need_Childcare", "Need_Transport")]
dfNeedsChi1 <- dfNeedsInfo_Vaccine %>%
select(SumOfNotVaccinated, NeedsVector1) %>%
summarise_all(funs(chisq.test(.,dfNeedsInfo_Vaccine$SumOfNotVaccinated)$p.value))
kableExtra::kable(head(dfNeedsChi1), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
| SumOfNotVaccinated | Need_Food | Need_LoseHousing | Need_Job | Need_AffordNeeds | Need_HighSchool | Need_Sad | Need_Clothes | Need_Safe |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.3230587 | 0.8292155 | 0.0278985 | 0.1695744 | 0.4166836 | 0.7077675 | 1 | 0.0197716 |
tbl <- dfNeedsInfo_Vaccine %>%
select(SumOfNotVaccinated, Need_Job) %>%
mutate(Need_Job = ifelse(Need_Job==0,0,1))
tbl1 <- table(tbl)
tbl1
## Need_Job
## SumOfNotVaccinated 0 1
## 0 179 5
## 1 109 11
psych::phi(tbl1)
## [1] 0.14
tbl2 <- dfNeedsInfo_Vaccine %>%
select(SumOfNotVaccinated, Need_Safe) %>%
mutate(Need_Safe = ifelse(Need_Safe==0,0,1))
tbl3 <- table(tbl2)
tbl3
## Need_Safe
## SumOfNotVaccinated 0 1
## 0 184 0
## 1 115 5
psych::phi(tbl3)
## [1] 0.16
Unemployed clients and those that do not feel safe are correlated with no vaccine (p=.028 and p=.02 respectively).
The effect is not strong enough to run a successful prediction model. The model does no better than guessing. This is perhaps because the data is very unbalanced:
#https://www.statology.org/logistic-regression-in-r/
#make this example reproducible
set.seed(1)
#Use 70% of dfNeedsInfoset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(dfNeedsInfo_Vaccine), replace=TRUE, prob=c(0.7,0.3))
train <- dfNeedsInfo_Vaccine[sample, ]
test <- dfNeedsInfo_Vaccine[!sample, ]
#fit logistic regression model
model1 <- glm(SumOfNotVaccinated ~ Need_Safe + Need_Job, family="binomial", data=train)
summary(model1)
##
## Call:
## glm(formula = SumOfNotVaccinated ~ Need_Safe + Need_Job, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5518 -0.9756 -0.9756 1.3937 1.3937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4953 0.1436 -3.448 0.000564 ***
## Need_Safe 16.5805 1005.3119 0.016 0.986841
## Need_Job 1.3426 0.7049 1.905 0.056805 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 298.72 on 220 degrees of freedom
## Residual deviance: 285.54 on 218 degrees of freedom
## AIC: 291.54
##
## Number of Fisher Scoring iterations: 15
#model <- glm(HypertensionOrDiabetesOrNot ~ NumOfNeeds, family="binomial", data=train)
#model
#disable scientific notation for model summary
#options(scipen=999)
#Shows fit - ranges from 0 to almost 1
pscl::pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.04413668
#Ranks importance
caret::varImp(model1)
## Overall
## Need_Safe 0.01649294
## Need_Job 1.90481077
#multicolinearity (VIF)
car::vif(model1)
## Need_Safe Need_Job
## 1 1
#calculate probability of default for each individual in test dataset
predicted <- predict(model1, test, type="response")
p2 <- as.data.frame(predicted)
p3 <- test %>%
select(SumOfNotVaccinated)
Predictions <- cbind(p2, p3)
Predictions$predicted <- ifelse(Predictions$predicted >.38, 1, 0)
Predictions$predicted=as.factor(Predictions$predicted)
Predictions$SumOfNotVaccinated=as.factor(Predictions$SumOfNotVaccinated)
psych::describe(Predictions)
## vars n mean sd median trimmed mad min max range skew
## predicted* 1 83 1.05 0.22 1 1.00 0 1 2 1 4.14
## SumOfNotVaccinated* 2 83 1.36 0.48 1 1.33 0 1 2 1 0.57
## kurtosis se
## predicted* 15.35 0.02
## SumOfNotVaccinated* -1.70 0.05
#find optimal cutoff probability to use to maximize accuracy
optimal <- InformationValue::optimalCutoff(test$SumOfNotVaccinated, predicted)[1]
optimal
## [1] 0.38
#ConfusionMatrix doesn't work
caret::confusionMatrix(Predictions$SumOfNotVaccinated, Predictions$predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 2
## 1 28 2
##
## Accuracy : 0.6386
## 95% CI : (0.5257, 0.7412)
## No Information Rate : 0.9518
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0356
##
## Mcnemar's Test P-Value : 5.01e-06
##
## Sensitivity : 0.64557
## Specificity : 0.50000
## Pos Pred Value : 0.96226
## Neg Pred Value : 0.06667
## Prevalence : 0.95181
## Detection Rate : 0.61446
## Detection Prevalence : 0.63855
## Balanced Accuracy : 0.57278
##
## 'Positive' Class : 0
##
#calculate sensitivity
#sensitivity(test$HypertensionOrDiabetesOrNot, predicted)
#calculate specificity
#specificity(test$HypertensionOrDiabetesOrNot, predicted)
#calculate total misclassification error rate
#misClassError(Predictions$HypertensionOrDiabetesOrNot, Predictions$predicted, threshold=optimal`)
rBase1 <- lm(InitialA1C ~ BlackOrNot + MaleOrNot + Age, data = dfA1CAndDemographics)
summary(rBase1)
##
## Call:
## lm(formula = InitialA1C ~ BlackOrNot + MaleOrNot + Age, data = dfA1CAndDemographics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.704 -0.702 -0.358 0.099 53.070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.787345 0.092460 51.778 < 2e-16 ***
## BlackOrNot 0.094788 0.052052 1.821 0.06866 .
## MaleOrNot 0.165160 0.052783 3.129 0.00176 **
## Age 0.021538 0.001628 13.228 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.859 on 5383 degrees of freedom
## (46 observations deleted due to missingness)
## Multiple R-squared: 0.03374, Adjusted R-squared: 0.0332
## F-statistic: 62.65 on 3 and 5383 DF, p-value: < 2.2e-16
Here we can see that race, gender and age all affect the initial A1C at a client’s first screening. The R2 is very low, however.
This study touched the surface of the question, "how are poverty and diabetes connected?’ We found statistical associations between poverty and diabetes hospitalization rates, between being unable to afford basic needs and diabetes incidence rates, between having not been vaccinated against not having a job and feeling unsafe, and between high A1C and race, age and gender. Clearly there is more to explore here.
This analysis only shows a small handful of the anlyses that were run on this data. I also analyzed the relationship between poverty and chronic hypertension, A1V levels, success in A1C management, and so on. Thistype of “fishing” increases the likelihood ofsampling errors. A cursory analysis of survey results by program and staff suggests a lack of consistency in how the survey was applied. In the second round of survey impkementation I plan to make sure that suveys are conducted in a consistent manager accross staff and programs to provide the possibility for a more robust analysis.