DATA 607: Final Project

Goal: What is the relationship between obesity and social economic status in NYC?

As someone who grew up in NYC, I remember living in an area where there was an abundance of fast food restaurants. It almost seemed like there was one on every corner. Often times it felt like for every 10 fast food restaurant, there was maybe one supermarket. The people around me, including myself had cheap and easy access to sugary drinks, candy, chips, and all sorts of goodies. We all know these foods contribute the Obesity.

For this project, I want to study and pin point where the obesity instances are the highest in NYC. Is it the same as where the Lowest Income Adults live?

The study outline is as follows: I) Main data collection and cleaning transformations (Any additional data sources will be added in adlib and stored in MySQL)

EDA
Analysis and conclusions

Data Collection:

Plan: -Gather data on income on by NYC region -Gather data on obesity by NYC region

We will first consider data from the U.S census. There is a handy API which allows us to query data directly from source. You will need to sign up for an API key. Our goal is to gain aggregated data from the census that gives provides social economic status by region. Tutorial: https://cran.r-project.org/web/packages/censusapi/vignettes/getting-started.html

We are able to view and identify the data that we need

#block out View otherwise it will yield large output when performing knit 
library(censusapi)

## 
## Attaching package: 'censusapi'

## The following object is masked from 'package:methods':
## 
##     getFunction

apis <- listCensusApis()
#View(apis)

We will be using the Small Area Health Insurance Estimates API.

Identify the variables that would contribute to this study

sahie_vars <- listCensusMetadata(name = "timeseries/healthins/sahie", type = "variables")
head(sahie_vars, 20)

##          name
## 1    AGE_DESC
## 2    NUI_LB90
## 3       STATE
## 4     NIC_MOE
## 5     NIPR_PT
## 6     RACECAT
## 7      SEXCAT
## 8    PCTIC_PT
## 9        YEAR
## 10   NUI_UB90
## 11   NIPR_MOE
## 12     COUNTY
## 13     NIC_PT
## 14    STABREV
## 15  RACE_DESC
## 16   NIC_UB90
## 17  NIPR_LB90
## 18 PCTUI_UB90
## 19 PCTIC_UB90
## 20     AGECAT
##                                                                                                        label
## 1                                                                                   Age Category Description
## 2                                                  Number Uninsured, Lower Bound for 90% Confidence Interval
## 3                                                                                            State FIPS Code
## 4                                                                            Number Insured, Margin of Error
## 5                                            Number in Demographic Group for Selected Income Range, Estimate
## 6                                                                                              Race Category
## 7                                                                                               Sex Category
## 8                                   Percent Insured in Demographic Group for Selected Income Range, Estimate
## 9                                                                                              Estimate Year
## 10                                                 Number Uninsured, Upper Bound for 90% Confidence Interval
## 11                                    Number in Demographic Group for Selected Income Range, Margin of Error
## 12                                                                                          County FIPS Code
## 13                                                                                  Number Insured, Estimate
## 14                                                                            Two-letter Postal Abbreviation
## 15                                                                                 Race Category Description
## 16                                                   Number Insured, Upper Bound for 90% Confidence Interval
## 17            Number in Demographic Group for Selected Income Range, Lower Bound for 90% Confidence Interval
## 18 Percent Uninsured in Demographic Group for Selected Income Range, Upper Bound for 90% Confidence Interval
## 19                                                  Percent Insured, Upper Bound for 90% Confidence Interval
## 20                                                                                              Age Category
##                concept predicateType group limit          required
## 1       Demographic ID           int   N/A     0              <NA>
## 2  Uncertainty Measure           int   N/A     0              <NA>
## 3        Geographic ID           int   N/A     0              <NA>
## 4  Uncertainty Measure           int   N/A     0              <NA>
## 5             Estimate           int   N/A     0              <NA>
## 6       Demographic ID           int   N/A     0 default displayed
## 7       Demographic ID           int   N/A     0 default displayed
## 8             Estimate           int   N/A     0              <NA>
## 9     Reference Period           int   N/A     0              <NA>
## 10 Uncertainty Measure           int   N/A     0              <NA>
## 11 Uncertainty Measure           int   N/A     0              <NA>
## 12       Geographic ID           int   N/A     0              <NA>
## 13            Estimate           int   N/A     0              <NA>
## 14       Geographic ID           int   N/A     0              <NA>
## 15      Demographic ID           int   N/A     0              <NA>
## 16 Uncertainty Measure           int   N/A     0              <NA>
## 17 Uncertainty Measure           int   N/A     0              <NA>
## 18 Uncertainty Measure           int   N/A     0              <NA>
## 19 Uncertainty Measure           int   N/A     0              <NA>
## 20      Demographic ID           int   N/A     0 default displayed

We want the following variables: NAME- Name of geography returned AGE_DESC- Age Category Description IPRCAT: Income Poverty Ratio Category IPR_DESC: Income Poverty Ratio Category Description PCTUI_PT: Percent Uninsured in Demographic Group for Selected Income Range, Estimate COUNTY County FIPS Code

The percentage of people who are not covered by insurance should be a sufficient proxy measure. There are several papers and articles that study the correlation between having insurance and economic status, for example https://www.ncbi.nlm.nih.gov/pubmed/11297885

Filter down to the state level

listCensusMetadata(name = "timeseries/healthins/sahie", type = "geography")

##     name geoLevelId referenceDate requires wildcard optionalWithWCFor
## 1     us        010    2015-01-01     NULL     NULL              <NA>
## 2 county        050    2015-01-01    state    state             state
## 3  state        040    2015-01-01     NULL     NULL              <NA>

census_directory<-getCensus(name = "timeseries/healthins/sahie",
    vars = c("NAME","AGE_DESC",  "IPRCAT", "IPR_DESC", "PCTUI_PT", "COUNTY"), 
    region = "county:*", time = 2015)
head(census_directory)

##   time state county               NAME       AGE_DESC IPRCAT    IPR_DESC
## 1 2015    01    001 Autauga County, AL Under 65 years      0 All Incomes
## 2 2015    01    003 Baldwin County, AL Under 65 years      0 All Incomes
## 3 2015    01    005 Barbour County, AL Under 65 years      0 All Incomes
## 4 2015    01    007    Bibb County, AL Under 65 years      0 All Incomes
## 5 2015    01    009  Blount County, AL Under 65 years      0 All Incomes
## 6 2015    01    011 Bullock County, AL Under 65 years      0 All Incomes
##   PCTUI_PT COUNTY
## 1      9.4    001
## 2     11.5    003
## 3     13.3    005
## 4     11.9    007
## 5     14.0    009
## 6     14.5    011

We do not need all counties. We just want to focus on counties within NY. We can use the following link to identify the FIPS code for NYC counties and apply it to our filter http://library.columbia.edu/locations/dssc/data/nycounty_fips.html

We first identify by NYS, and then subset this data frame by the FIPS code:

nyc_counties <- getCensus(name = "timeseries/healthins/sahie",
    vars = c("NAME","AGE_DESC",  "IPRCAT", "IPR_DESC","PCTUI_PT", "COUNTY"), 
    region = "county:*", regionin = "state:36", time = 2015)
head(nyc_counties, n=12L)

##    time state county                NAME       AGE_DESC IPRCAT
## 1  2015    36    001   Albany County, NY Under 65 years      0
## 2  2015    36    001   Albany County, NY Under 65 years      1
## 3  2015    36    001   Albany County, NY Under 65 years      2
## 4  2015    36    001   Albany County, NY Under 65 years      3
## 5  2015    36    001   Albany County, NY Under 65 years      4
## 6  2015    36    001   Albany County, NY Under 65 years      5
## 7  2015    36    003 Allegany County, NY Under 65 years      0
## 8  2015    36    003 Allegany County, NY Under 65 years      1
## 9  2015    36    003 Allegany County, NY Under 65 years      2
## 10 2015    36    003 Allegany County, NY Under 65 years      3
## 11 2015    36    003 Allegany County, NY Under 65 years      4
## 12 2015    36    003 Allegany County, NY Under 65 years      5
##                   IPR_DESC PCTUI_PT COUNTY
## 1              All Incomes      5.5    001
## 2       <= 200% of Poverty     10.6    001
## 3       <= 250% of Poverty     10.3    001
## 4       <= 138% of Poverty     10.6    001
## 5       <= 400% of Poverty      8.3    001
## 6  138% to 400% of Poverty      7.2    001
## 7              All Incomes      6.2    003
## 8       <= 200% of Poverty      8.5    003
## 9       <= 250% of Poverty      8.4    003
## 10      <= 138% of Poverty      8.9    003
## 11      <= 400% of Poverty      7.3    003
## 12 138% to 400% of Poverty      6.3    003

Subset by FIPS code 005 - Bronx 047 - Kings (Brooklyn) 061 - New York (Manhattan) 081 - Queens 085 - Richmond (Staten Island)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

k<-c('005', '047', '061', '081', '085')
df.nyincome <- select(filter(nyc_counties, county %in% k),c('time', 'NAME', 'AGE_DESC', 'IPRCAT', 'IPR_DESC', 'PCTUI_PT'))
head(df.nyincome)

##   time             NAME       AGE_DESC IPRCAT                IPR_DESC
## 1 2015 Bronx County, NY Under 65 years      0             All Incomes
## 2 2015 Bronx County, NY Under 65 years      1      <= 200% of Poverty
## 3 2015 Bronx County, NY Under 65 years      2      <= 250% of Poverty
## 4 2015 Bronx County, NY Under 65 years      3      <= 138% of Poverty
## 5 2015 Bronx County, NY Under 65 years      4      <= 400% of Poverty
## 6 2015 Bronx County, NY Under 65 years      5 138% to 400% of Poverty
##   PCTUI_PT
## 1     11.1
## 2     11.5
## 3     11.8
## 4     10.7
## 5     11.9
## 6     13.3

We should convert the columns to the correct data types. For example, PCTUI_PT should be a numeric and not char.

df.nyincome$PCTUI_PT <- as.numeric(as.character(df.nyincome$PCTUI_PT))
head(df.nyincome)

##   time             NAME       AGE_DESC IPRCAT                IPR_DESC
## 1 2015 Bronx County, NY Under 65 years      0             All Incomes
## 2 2015 Bronx County, NY Under 65 years      1      <= 200% of Poverty
## 3 2015 Bronx County, NY Under 65 years      2      <= 250% of Poverty
## 4 2015 Bronx County, NY Under 65 years      3      <= 138% of Poverty
## 5 2015 Bronx County, NY Under 65 years      4      <= 400% of Poverty
## 6 2015 Bronx County, NY Under 65 years      5 138% to 400% of Poverty
##   PCTUI_PT
## 1     11.1
## 2     11.5
## 3     11.8
## 4     10.7
## 5     11.9
## 6     13.3

Convert the column names into something more constant

names(df.nyincome)[names(df.nyincome) == "NAME"] <- "county"
names(df.nyincome)[names(df.nyincome) == "AGE_DESC"] <- "age_group"
names(df.nyincome)[names(df.nyincome) == "IPRCAT"] <- "poverty_level"
names(df.nyincome)[names(df.nyincome) == "IPR_DESC"] <- "poverty_group"
names(df.nyincome)[names(df.nyincome) == "PCTUI_PT"] <- "percent_uninsured"
names(df.nyincome)

## [1] "time"              "county"            "age_group"        
## [4] "poverty_level"     "poverty_group"     "percent_uninsured"

I want to store this data in MySQL for safekeeping and easier access for future use. library(RMySQL) mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’‘, host=’’)

dbWriteTable(mydb,"NYCounty_Income",df.nyincome,overwrite=T)

## [1] TRUE

dbListTables(mydb)

## [1] "NYC_Poverty"           "NYC_obesity_estimates" "NYC_zip"              
## [4] "NYCounty_Income"

Close connection (Not needed yet)

#dbDisconnect(mydb)

We can also bring in data from an additional source. It has more granularity for the poverty rate on the neighborhood level. The data comes from: http://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=103,4466a0,109,Summarize

We store this data in Github: https://github.com/vindication09/DATA-607-Final-Project/blob/master/Poverty.csv

library(readr)
poverty.df <- read_csv("https://raw.githubusercontent.com/vindication09/DATA-607-Final-Project/master/Poverty.csv")

## Parsed with column specification:
## cols(
##   Year = col_character(),
##   GeoTypeName = col_character(),
##   Borough = col_character(),
##   Geography = col_character(),
##   `Geography ID` = col_integer(),
##   `Indicator Name` = col_character(),
##   Number = col_number(),
##   Percent = col_double()
## )

head(poverty.df)

## # A tibble: 6 x 8
##   Year   GeoTypeName  Borough Geography    `Geography ID` `Indicator Name`
##   <chr>  <chr>        <chr>   <chr>                 <int> <chr>           
## 1 2011-… Neighborhoo… Queens  Bayside - L…            404 Poverty         
## 2 2011-… Neighborhoo… Brookl… Bedford Stu…            203 Poverty         
## 3 2011-… Neighborhoo… Brookl… Bensonhurst…            209 Poverty         
## 4 2011-… Neighborhoo… Brookl… Borough Park            206 Poverty         
## 5 2011-… Neighborhoo… Brookl… Canarsie - …            208 Poverty         
## 6 2011-… Neighborhoo… Manhat… Central Har…            302 Poverty         
## # ... with 2 more variables: Number <dbl>, Percent <dbl>

Store this data in MySQL

dbWriteTable(mydb,"NYC_Poverty",poverty.df,overwrite=T)

## [1] TRUE

dbListTables(mydb)

## [1] "NYC_Poverty"           "NYC_obesity_estimates" "NYC_zip"              
## [4] "NYCounty_Income"

The NYC income data has now been transformed and saved in a relational database.

The next set of data that we need should contain information for region and obesity measurments.

This dataset gives us information at the NYC Neighborhood level. We are given the estimated number of adults classified as obese in 2015. http://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2063,4466a0,113,Summarize

After some manual cleaning (the spreadhseet is not large), the data is uploaded into Github for storage. https://github.com/vindication09/DATA-607-Final-Project/blob/master/Obese%20Adults.csv

library(readr)
obesity.df <- read_csv("https://raw.githubusercontent.com/vindication09/DATA-607-Final-Project/master/Obese%20Adults.csv")

## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   GeoTypeName = col_character(),
##   Borough = col_character(),
##   Geography = col_character(),
##   `Geography ID` = col_integer(),
##   `Indicator Name` = col_character(),
##   Number = col_number(),
##   Percent = col_double()
## )

head(obesity.df)

## # A tibble: 6 x 8
##    Year GeoTypeName   Borough Geography    `Geography ID` `Indicator Name`
##   <int> <chr>         <chr>   <chr>                 <int> <chr>           
## 1  2015 Neighborhood… Queens  Bayside Lit…         404406 Obese Adults    
## 2  2015 Neighborhood… Brookl… Bedford Stu…            203 Obese Adults    
## 3  2015 Neighborhood… Brookl… Bensonhurst…            209 Obese Adults    
## 4  2015 Neighborhood… Brookl… Borough Park            206 Obese Adults    
## 5  2015 Neighborhood… Brookl… Canarsie - …            208 Obese Adults    
## 6  2015 Neighborhood… Manhat… Central Har…            302 Obese Adults    
## # ... with 2 more variables: Number <dbl>, Percent <dbl>

This dataset does not seem to need any transformations for cleaning. Lets go ahead and save this data in our local instance of MySQL as well

dbWriteTable(mydb,"NYC_obesity_estimates",obesity.df,overwrite=T)

## [1] TRUE

dbListTables(mydb)

## [1] "NYC_Poverty"           "NYC_obesity_estimates" "NYC_zip"              
## [4] "NYCounty_Income"

EDA: Lets analyze the data we collected using visualizations.

The NY income data gives us aggregated data which looks at the percentage of people, by Borough, who are uninsured.

names(df.nyincome)

## [1] "time"              "county"            "age_group"        
## [4] "poverty_level"     "poverty_group"     "percent_uninsured"

Lets convert the data from wide to long in order to visualize with a stacked bar plot

df.nyincome_b <- select(filter(df.nyincome),c('county', 'poverty_level', 'percent_uninsured'))
head(df.nyincome_b)

##             county poverty_level percent_uninsured
## 1 Bronx County, NY             0              11.1
## 2 Bronx County, NY             1              11.5
## 3 Bronx County, NY             2              11.8
## 4 Bronx County, NY             3              10.7
## 5 Bronx County, NY             4              11.9
## 6 Bronx County, NY             5              13.3

Add a new column that divides the current percent by 100 in order to scale in a stacked bar plot.

df.nyincome_b["percent"] <- NA
df.nyincome_b$percent <- df.nyincome_b$percent_uninsured/100

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

#library(easyGgplot2)

ggplot(df.nyincome_b, aes(x=county, y=percent, fill=poverty_level)) + 
  geom_bar(stat="identity") +
  xlab("\nCounty") +
  ylab("Percent\n") +
  guides(fill=FALSE) +
  theme_bw() +
  scale_y_continuous(labels = percent_format())+
  facet_grid(poverty_level ~ .)

The color scales are ordered by poverty level. 0 pertains to households above the poverty line while 5 pertains to households who are classified as high poverty. According the the chart, Queens County has the highest percentage of uninsured people who are categoried in the highest level of poverty.

We have another dataset that gives us neighborhood level granularity on poverty. Lets look at each instance of poverty by neighborhood for each Borough

Note: I Originally planned to Map GEO code to zip code and then map zip code to neighborhood. This proved to be more difficult by requiring the usage of “shapefiles”, which I am not familiar with yet.

Lets analyze the poverty rate by individual neighborhood.

Queens:

Brooklyn:

Manhattan:

Bronx:

Staten Island:

Note: Certain peaks that are out of order in the plot are aggregated values for a whole Neighborhood’s sub areas.

We will plot the similar charts to get a break down of the estimated number of adults classified as Obese per neighborhood. Queens:

Brooklyn:

Manhhatan:

Bronx:

Staten Island:

We see that the Obesity data is not as robust as the income data. There are a lot of neighborhoods missing. Solving the problem of getting a more complete dataset should be considered for future study.

Analysis and Conclusion:

In the meantime, we can join the poverty dataset and the income dataset and see a side by side comparison for any matched neighborhood. Lets join using MySQL

joinsql<-("select 
         a.Geography, a.Number as 'ObeseAdults', b.Number as 'LowIncomeAdults'
         from NYC_obesity_estimates a 
          join NYC_Poverty b 
         on (a.Borough=b.Borough and a.Geography=b.Geography);")

joinclean<-dbGetQuery(mydb, joinsql)
head(joinclean)

##                              Geography ObeseAdults LowIncomeAdults
## 1   Bedford Stuyvesant - Crown Heights       86000           89354
## 2              Bensonhurst - Bay Ridge       35000           33668
## 3                         Borough Park       39000           92821
## 4                 Canarsie - Flatlands       45000           27775
## 5 Central Harlem - Morningside Heights       37000           51123
## 6        Coney Island - Sheepshead Bay       66000           56608

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':
## 
##     rename

joinclean.df<- melt(joinclean, id.vars='Geography')
ggplot(joinclean.df, aes(Geography, value)) +   
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  geom_bar(aes(fill = variable), position = "dodge", stat="identity")

From the looks of the chart, we can see that there are certain Neighborhoods where the number of Obese adults are almost the same as Low Income Adults, for example Bedford Stuyvesant - Crown Heights. There is not enough information in this plot to make a statistical inference.

The ggpubr package allows us to perform Hypothesis testing for correlation between Number of Low Income Adults and Number of Adults who are obese. GGPubr tutorial: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

Is there a linear relationship between Low Income Adults and Obesity?

library("ggpubr")

## Loading required package: magrittr

ggscatter(joinclean, x = "LowIncomeAdults", y = "ObeseAdults", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Number of Low Income Adults", ylab = "Number of Obese Adults")

There is indeed a linear positive correlation between Adult Obesity and Low Income. The correlation coefficient is 0.56. Lets try a different function to verify the correlation. It should be noted that the low p value also implies statistically significant correlation.

Is there a correlation between Low Income Adults and Obesity?

test <- cor.test(joinclean$LowIncomeAdults, joinclean$ObeseAdults, 
                    method = "pearson")
test

## 
##  Pearson's product-moment correlation
## 
## data:  joinclean$LowIncomeAdults and joinclean$ObeseAdults
## t = 4.0528, df = 36, p-value = 0.0002584
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2923693 0.7459421
## sample estimates:
##       cor 
## 0.5597365

The alternate method also yields the same result. The correlation is significant at the 95 percent confidence level.

Conclusion: -The correlation between Low Income Adults and Obese Adults is significant. The correlation constant is .55 rounded up. Given the time to find a more complete dataset of city and Obesity instances, I do believe the results would have been more conclusive. I had originally planned on doing something clever with maps. I had data with a unique GEO ID and another with zip codes. There has to be a better way to map GEO ID to zip code and then map zip code to Lattitude and Longitude. With spatial parameters, we are able to plot on a map, the Obesity instances and the Low income Adults. I think this would be a great addition to this study.

DATA 607: Final Project

Vinicio Haro

5/12/2018