Goal: What is the relationship between obesity and social economic status in NYC?
As someone who grew up in NYC, I remember living in an area where there was an abundance of fast food restaurants. It almost seemed like there was one on every corner. Often times it felt like for every 10 fast food restaurant, there was maybe one supermarket. The people around me, including myself had cheap and easy access to sugary drinks, candy, chips, and all sorts of goodies. We all know these foods contribute the Obesity.
For this project, I want to study and pin point where the obesity instances are the highest in NYC. Is it the same as where the Lowest Income Adults live?
The study outline is as follows: I) Main data collection and cleaning transformations (Any additional data sources will be added in adlib and stored in MySQL)
EDA
Analysis and conclusions
Data Collection:
Plan: -Gather data on income on by NYC region -Gather data on obesity by NYC region
We will first consider data from the U.S census. There is a handy API which allows us to query data directly from source. You will need to sign up for an API key. Our goal is to gain aggregated data from the census that gives provides social economic status by region. Tutorial: https://cran.r-project.org/web/packages/censusapi/vignettes/getting-started.html
We are able to view and identify the data that we need
#block out View otherwise it will yield large output when performing knit
library(censusapi)
##
## Attaching package: 'censusapi'
## The following object is masked from 'package:methods':
##
## getFunction
apis <- listCensusApis()
#View(apis)
We will be using the Small Area Health Insurance Estimates API.
Identify the variables that would contribute to this study
sahie_vars <- listCensusMetadata(name = "timeseries/healthins/sahie", type = "variables")
head(sahie_vars, 20)
## name
## 1 AGE_DESC
## 2 NUI_LB90
## 3 STATE
## 4 NIC_MOE
## 5 NIPR_PT
## 6 RACECAT
## 7 SEXCAT
## 8 PCTIC_PT
## 9 YEAR
## 10 NUI_UB90
## 11 NIPR_MOE
## 12 COUNTY
## 13 NIC_PT
## 14 STABREV
## 15 RACE_DESC
## 16 NIC_UB90
## 17 NIPR_LB90
## 18 PCTUI_UB90
## 19 PCTIC_UB90
## 20 AGECAT
## label
## 1 Age Category Description
## 2 Number Uninsured, Lower Bound for 90% Confidence Interval
## 3 State FIPS Code
## 4 Number Insured, Margin of Error
## 5 Number in Demographic Group for Selected Income Range, Estimate
## 6 Race Category
## 7 Sex Category
## 8 Percent Insured in Demographic Group for Selected Income Range, Estimate
## 9 Estimate Year
## 10 Number Uninsured, Upper Bound for 90% Confidence Interval
## 11 Number in Demographic Group for Selected Income Range, Margin of Error
## 12 County FIPS Code
## 13 Number Insured, Estimate
## 14 Two-letter Postal Abbreviation
## 15 Race Category Description
## 16 Number Insured, Upper Bound for 90% Confidence Interval
## 17 Number in Demographic Group for Selected Income Range, Lower Bound for 90% Confidence Interval
## 18 Percent Uninsured in Demographic Group for Selected Income Range, Upper Bound for 90% Confidence Interval
## 19 Percent Insured, Upper Bound for 90% Confidence Interval
## 20 Age Category
## concept predicateType group limit required
## 1 Demographic ID int N/A 0 <NA>
## 2 Uncertainty Measure int N/A 0 <NA>
## 3 Geographic ID int N/A 0 <NA>
## 4 Uncertainty Measure int N/A 0 <NA>
## 5 Estimate int N/A 0 <NA>
## 6 Demographic ID int N/A 0 default displayed
## 7 Demographic ID int N/A 0 default displayed
## 8 Estimate int N/A 0 <NA>
## 9 Reference Period int N/A 0 <NA>
## 10 Uncertainty Measure int N/A 0 <NA>
## 11 Uncertainty Measure int N/A 0 <NA>
## 12 Geographic ID int N/A 0 <NA>
## 13 Estimate int N/A 0 <NA>
## 14 Geographic ID int N/A 0 <NA>
## 15 Demographic ID int N/A 0 <NA>
## 16 Uncertainty Measure int N/A 0 <NA>
## 17 Uncertainty Measure int N/A 0 <NA>
## 18 Uncertainty Measure int N/A 0 <NA>
## 19 Uncertainty Measure int N/A 0 <NA>
## 20 Demographic ID int N/A 0 default displayed
We want the following variables: NAME- Name of geography returned AGE_DESC- Age Category Description IPRCAT: Income Poverty Ratio Category IPR_DESC: Income Poverty Ratio Category Description PCTUI_PT: Percent Uninsured in Demographic Group for Selected Income Range, Estimate COUNTY County FIPS Code
The percentage of people who are not covered by insurance should be a sufficient proxy measure. There are several papers and articles that study the correlation between having insurance and economic status, for example https://www.ncbi.nlm.nih.gov/pubmed/11297885
Filter down to the state level
listCensusMetadata(name = "timeseries/healthins/sahie", type = "geography")
## name geoLevelId referenceDate requires wildcard optionalWithWCFor
## 1 us 010 2015-01-01 NULL NULL <NA>
## 2 county 050 2015-01-01 state state state
## 3 state 040 2015-01-01 NULL NULL <NA>
census_directory<-getCensus(name = "timeseries/healthins/sahie",
vars = c("NAME","AGE_DESC", "IPRCAT", "IPR_DESC", "PCTUI_PT", "COUNTY"),
region = "county:*", time = 2015)
head(census_directory)
## time state county NAME AGE_DESC IPRCAT IPR_DESC
## 1 2015 01 001 Autauga County, AL Under 65 years 0 All Incomes
## 2 2015 01 003 Baldwin County, AL Under 65 years 0 All Incomes
## 3 2015 01 005 Barbour County, AL Under 65 years 0 All Incomes
## 4 2015 01 007 Bibb County, AL Under 65 years 0 All Incomes
## 5 2015 01 009 Blount County, AL Under 65 years 0 All Incomes
## 6 2015 01 011 Bullock County, AL Under 65 years 0 All Incomes
## PCTUI_PT COUNTY
## 1 9.4 001
## 2 11.5 003
## 3 13.3 005
## 4 11.9 007
## 5 14.0 009
## 6 14.5 011
We do not need all counties. We just want to focus on counties within NY. We can use the following link to identify the FIPS code for NYC counties and apply it to our filter http://library.columbia.edu/locations/dssc/data/nycounty_fips.html
We first identify by NYS, and then subset this data frame by the FIPS code:
nyc_counties <- getCensus(name = "timeseries/healthins/sahie",
vars = c("NAME","AGE_DESC", "IPRCAT", "IPR_DESC","PCTUI_PT", "COUNTY"),
region = "county:*", regionin = "state:36", time = 2015)
head(nyc_counties, n=12L)
## time state county NAME AGE_DESC IPRCAT
## 1 2015 36 001 Albany County, NY Under 65 years 0
## 2 2015 36 001 Albany County, NY Under 65 years 1
## 3 2015 36 001 Albany County, NY Under 65 years 2
## 4 2015 36 001 Albany County, NY Under 65 years 3
## 5 2015 36 001 Albany County, NY Under 65 years 4
## 6 2015 36 001 Albany County, NY Under 65 years 5
## 7 2015 36 003 Allegany County, NY Under 65 years 0
## 8 2015 36 003 Allegany County, NY Under 65 years 1
## 9 2015 36 003 Allegany County, NY Under 65 years 2
## 10 2015 36 003 Allegany County, NY Under 65 years 3
## 11 2015 36 003 Allegany County, NY Under 65 years 4
## 12 2015 36 003 Allegany County, NY Under 65 years 5
## IPR_DESC PCTUI_PT COUNTY
## 1 All Incomes 5.5 001
## 2 <= 200% of Poverty 10.6 001
## 3 <= 250% of Poverty 10.3 001
## 4 <= 138% of Poverty 10.6 001
## 5 <= 400% of Poverty 8.3 001
## 6 138% to 400% of Poverty 7.2 001
## 7 All Incomes 6.2 003
## 8 <= 200% of Poverty 8.5 003
## 9 <= 250% of Poverty 8.4 003
## 10 <= 138% of Poverty 8.9 003
## 11 <= 400% of Poverty 7.3 003
## 12 138% to 400% of Poverty 6.3 003
Subset by FIPS code 005 - Bronx 047 - Kings (Brooklyn) 061 - New York (Manhattan) 081 - Queens 085 - Richmond (Staten Island)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
k<-c('005', '047', '061', '081', '085')
df.nyincome <- select(filter(nyc_counties, county %in% k),c('time', 'NAME', 'AGE_DESC', 'IPRCAT', 'IPR_DESC', 'PCTUI_PT'))
head(df.nyincome)
## time NAME AGE_DESC IPRCAT IPR_DESC
## 1 2015 Bronx County, NY Under 65 years 0 All Incomes
## 2 2015 Bronx County, NY Under 65 years 1 <= 200% of Poverty
## 3 2015 Bronx County, NY Under 65 years 2 <= 250% of Poverty
## 4 2015 Bronx County, NY Under 65 years 3 <= 138% of Poverty
## 5 2015 Bronx County, NY Under 65 years 4 <= 400% of Poverty
## 6 2015 Bronx County, NY Under 65 years 5 138% to 400% of Poverty
## PCTUI_PT
## 1 11.1
## 2 11.5
## 3 11.8
## 4 10.7
## 5 11.9
## 6 13.3
We should convert the columns to the correct data types. For example, PCTUI_PT should be a numeric and not char.
df.nyincome$PCTUI_PT <- as.numeric(as.character(df.nyincome$PCTUI_PT))
head(df.nyincome)
## time NAME AGE_DESC IPRCAT IPR_DESC
## 1 2015 Bronx County, NY Under 65 years 0 All Incomes
## 2 2015 Bronx County, NY Under 65 years 1 <= 200% of Poverty
## 3 2015 Bronx County, NY Under 65 years 2 <= 250% of Poverty
## 4 2015 Bronx County, NY Under 65 years 3 <= 138% of Poverty
## 5 2015 Bronx County, NY Under 65 years 4 <= 400% of Poverty
## 6 2015 Bronx County, NY Under 65 years 5 138% to 400% of Poverty
## PCTUI_PT
## 1 11.1
## 2 11.5
## 3 11.8
## 4 10.7
## 5 11.9
## 6 13.3
Convert the column names into something more constant
names(df.nyincome)[names(df.nyincome) == "NAME"] <- "county"
names(df.nyincome)[names(df.nyincome) == "AGE_DESC"] <- "age_group"
names(df.nyincome)[names(df.nyincome) == "IPRCAT"] <- "poverty_level"
names(df.nyincome)[names(df.nyincome) == "IPR_DESC"] <- "poverty_group"
names(df.nyincome)[names(df.nyincome) == "PCTUI_PT"] <- "percent_uninsured"
names(df.nyincome)
## [1] "time" "county" "age_group"
## [4] "poverty_level" "poverty_group" "percent_uninsured"
I want to store this data in MySQL for safekeeping and easier access for future use. library(RMySQL) mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’‘, host=’’)
dbWriteTable(mydb,"NYCounty_Income",df.nyincome,overwrite=T)
## [1] TRUE
dbListTables(mydb)
## [1] "NYC_Poverty" "NYC_obesity_estimates" "NYC_zip"
## [4] "NYCounty_Income"
Close connection (Not needed yet)
#dbDisconnect(mydb)
We can also bring in data from an additional source. It has more granularity for the poverty rate on the neighborhood level. The data comes from: http://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=103,4466a0,109,Summarize
We store this data in Github: https://github.com/vindication09/DATA-607-Final-Project/blob/master/Poverty.csv
library(readr)
poverty.df <- read_csv("https://raw.githubusercontent.com/vindication09/DATA-607-Final-Project/master/Poverty.csv")
## Parsed with column specification:
## cols(
## Year = col_character(),
## GeoTypeName = col_character(),
## Borough = col_character(),
## Geography = col_character(),
## `Geography ID` = col_integer(),
## `Indicator Name` = col_character(),
## Number = col_number(),
## Percent = col_double()
## )
head(poverty.df)
## # A tibble: 6 x 8
## Year GeoTypeName Borough Geography `Geography ID` `Indicator Name`
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 2011-… Neighborhoo… Queens Bayside - L… 404 Poverty
## 2 2011-… Neighborhoo… Brookl… Bedford Stu… 203 Poverty
## 3 2011-… Neighborhoo… Brookl… Bensonhurst… 209 Poverty
## 4 2011-… Neighborhoo… Brookl… Borough Park 206 Poverty
## 5 2011-… Neighborhoo… Brookl… Canarsie - … 208 Poverty
## 6 2011-… Neighborhoo… Manhat… Central Har… 302 Poverty
## # ... with 2 more variables: Number <dbl>, Percent <dbl>
Store this data in MySQL
dbWriteTable(mydb,"NYC_Poverty",poverty.df,overwrite=T)
## [1] TRUE
dbListTables(mydb)
## [1] "NYC_Poverty" "NYC_obesity_estimates" "NYC_zip"
## [4] "NYCounty_Income"
The NYC income data has now been transformed and saved in a relational database.
The next set of data that we need should contain information for region and obesity measurments.
This dataset gives us information at the NYC Neighborhood level. We are given the estimated number of adults classified as obese in 2015. http://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=2063,4466a0,113,Summarize
After some manual cleaning (the spreadhseet is not large), the data is uploaded into Github for storage. https://github.com/vindication09/DATA-607-Final-Project/blob/master/Obese%20Adults.csv
library(readr)
obesity.df <- read_csv("https://raw.githubusercontent.com/vindication09/DATA-607-Final-Project/master/Obese%20Adults.csv")
## Parsed with column specification:
## cols(
## Year = col_integer(),
## GeoTypeName = col_character(),
## Borough = col_character(),
## Geography = col_character(),
## `Geography ID` = col_integer(),
## `Indicator Name` = col_character(),
## Number = col_number(),
## Percent = col_double()
## )
head(obesity.df)
## # A tibble: 6 x 8
## Year GeoTypeName Borough Geography `Geography ID` `Indicator Name`
## <int> <chr> <chr> <chr> <int> <chr>
## 1 2015 Neighborhood… Queens Bayside Lit… 404406 Obese Adults
## 2 2015 Neighborhood… Brookl… Bedford Stu… 203 Obese Adults
## 3 2015 Neighborhood… Brookl… Bensonhurst… 209 Obese Adults
## 4 2015 Neighborhood… Brookl… Borough Park 206 Obese Adults
## 5 2015 Neighborhood… Brookl… Canarsie - … 208 Obese Adults
## 6 2015 Neighborhood… Manhat… Central Har… 302 Obese Adults
## # ... with 2 more variables: Number <dbl>, Percent <dbl>
This dataset does not seem to need any transformations for cleaning. Lets go ahead and save this data in our local instance of MySQL as well
dbWriteTable(mydb,"NYC_obesity_estimates",obesity.df,overwrite=T)
## [1] TRUE
dbListTables(mydb)
## [1] "NYC_Poverty" "NYC_obesity_estimates" "NYC_zip"
## [4] "NYCounty_Income"
EDA: Lets analyze the data we collected using visualizations.
The NY income data gives us aggregated data which looks at the percentage of people, by Borough, who are uninsured.
names(df.nyincome)
## [1] "time" "county" "age_group"
## [4] "poverty_level" "poverty_group" "percent_uninsured"
Lets convert the data from wide to long in order to visualize with a stacked bar plot
df.nyincome_b <- select(filter(df.nyincome),c('county', 'poverty_level', 'percent_uninsured'))
head(df.nyincome_b)
## county poverty_level percent_uninsured
## 1 Bronx County, NY 0 11.1
## 2 Bronx County, NY 1 11.5
## 3 Bronx County, NY 2 11.8
## 4 Bronx County, NY 3 10.7
## 5 Bronx County, NY 4 11.9
## 6 Bronx County, NY 5 13.3
Add a new column that divides the current percent by 100 in order to scale in a stacked bar plot.
df.nyincome_b["percent"] <- NA
df.nyincome_b$percent <- df.nyincome_b$percent_uninsured/100
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
#library(easyGgplot2)
ggplot(df.nyincome_b, aes(x=county, y=percent, fill=poverty_level)) +
geom_bar(stat="identity") +
xlab("\nCounty") +
ylab("Percent\n") +
guides(fill=FALSE) +
theme_bw() +
scale_y_continuous(labels = percent_format())+
facet_grid(poverty_level ~ .)
The color scales are ordered by poverty level. 0 pertains to households above the poverty line while 5 pertains to households who are classified as high poverty. According the the chart, Queens County has the highest percentage of uninsured people who are categoried in the highest level of poverty.
We have another dataset that gives us neighborhood level granularity on poverty. Lets look at each instance of poverty by neighborhood for each Borough
Note: I Originally planned to Map GEO code to zip code and then map zip code to neighborhood. This proved to be more difficult by requiring the usage of “shapefiles”, which I am not familiar with yet.
Lets analyze the poverty rate by individual neighborhood.
Queens:
Brooklyn:
Manhattan:
Bronx:
Staten Island:
Note: Certain peaks that are out of order in the plot are aggregated values for a whole Neighborhood’s sub areas.
We will plot the similar charts to get a break down of the estimated number of adults classified as Obese per neighborhood. Queens:
Brooklyn:
Manhhatan:
Bronx:
Staten Island:
We see that the Obesity data is not as robust as the income data. There are a lot of neighborhoods missing. Solving the problem of getting a more complete dataset should be considered for future study.
Analysis and Conclusion:
In the meantime, we can join the poverty dataset and the income dataset and see a side by side comparison for any matched neighborhood. Lets join using MySQL
joinsql<-("select
a.Geography, a.Number as 'ObeseAdults', b.Number as 'LowIncomeAdults'
from NYC_obesity_estimates a
join NYC_Poverty b
on (a.Borough=b.Borough and a.Geography=b.Geography);")
joinclean<-dbGetQuery(mydb, joinsql)
head(joinclean)
## Geography ObeseAdults LowIncomeAdults
## 1 Bedford Stuyvesant - Crown Heights 86000 89354
## 2 Bensonhurst - Bay Ridge 35000 33668
## 3 Borough Park 39000 92821
## 4 Canarsie - Flatlands 45000 27775
## 5 Central Harlem - Morningside Heights 37000 51123
## 6 Coney Island - Sheepshead Bay 66000 56608
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
joinclean.df<- melt(joinclean, id.vars='Geography')
ggplot(joinclean.df, aes(Geography, value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
geom_bar(aes(fill = variable), position = "dodge", stat="identity")
From the looks of the chart, we can see that there are certain Neighborhoods where the number of Obese adults are almost the same as Low Income Adults, for example Bedford Stuyvesant - Crown Heights. There is not enough information in this plot to make a statistical inference.
The ggpubr package allows us to perform Hypothesis testing for correlation between Number of Low Income Adults and Number of Adults who are obese. GGPubr tutorial: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
Is there a linear relationship between Low Income Adults and Obesity?
library("ggpubr")
## Loading required package: magrittr
ggscatter(joinclean, x = "LowIncomeAdults", y = "ObeseAdults",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Number of Low Income Adults", ylab = "Number of Obese Adults")
There is indeed a linear positive correlation between Adult Obesity and Low Income. The correlation coefficient is 0.56. Lets try a different function to verify the correlation. It should be noted that the low p value also implies statistically significant correlation.
Is there a correlation between Low Income Adults and Obesity?
test <- cor.test(joinclean$LowIncomeAdults, joinclean$ObeseAdults,
method = "pearson")
test
##
## Pearson's product-moment correlation
##
## data: joinclean$LowIncomeAdults and joinclean$ObeseAdults
## t = 4.0528, df = 36, p-value = 0.0002584
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2923693 0.7459421
## sample estimates:
## cor
## 0.5597365
The alternate method also yields the same result. The correlation is significant at the 95 percent confidence level.
Conclusion: -The correlation between Low Income Adults and Obese Adults is significant. The correlation constant is .55 rounded up. Given the time to find a more complete dataset of city and Obesity instances, I do believe the results would have been more conclusive. I had originally planned on doing something clever with maps. I had data with a unique GEO ID and another with zip codes. There has to be a better way to map GEO ID to zip code and then map zip code to Lattitude and Longitude. With spatial parameters, we are able to plot on a map, the Obesity instances and the Low income Adults. I think this would be a great addition to this study.