Last night I thought I’d play around and learn some new R packages and data sources. For packages, new to me here are:

For data sources, I wanted to explore the Data.gov (http://www.data.gov) catalog. While browsing the site and its catalog, I found a straightforward data set titled the USDA Food Environmental Atlas. The data is stored in a multisheet XLS spreadsheet and the item record for this data is located at:

Direct link to the XLS spreadsheet:

I’m using the following packages in this post:

library("dplyr")
library("ggplot2")
library("maps")
library("xlsx")
library("directlabels")
library("choroplethr")
library("choroplethrMaps")

The HEALTH sheet in the XLS file (sheet index #7) contains US county level diabetes and obesity rates in the US. I’m using the 2009 data, for no apparent reason, but some of it goes up to 2012.


I used the xlsx package to read the the HEALTH sheet into R as a data frame. The data file is in the current working directory, and the sheet is index #7.

health <- read.xlsx("DataDownload.xls", 7, "HEALTH", as.data.frame = TRUE)

To avoid state by state comparisons, it seems nicer to categorize states by US Census Divisions. The US Census provides a nice structured text file to help with this:

Main web page:

Data page:

After extracting the diabetes data from the USDA HEALTH sheet for the year 2009, I took the mean for all reporting counties by state:

dia09            <- tapply(health$PCT_DIABETES_ADULTS09,
                           health$State, FUN = mean)
dia09            <- as.data.frame(dia09)
dia09$State      <- as.factor(row.names(dia09))
row.names(dia09) <- NULL
dia09            <- na.exclude(dia09)

Did the same for 2009 obesity data:

obe09            <- tapply(health$PCT_OBESE_ADULTS09,
                           health$State, FUN = mean)
obe09            <- as.data.frame(obe09)
obe09$State      <- as.factor(row.names(obe09))
row.names(obe09) <- NULL
obe09            <- na.exclude(obe09)

Then I combined the data with an inner join:

dia_act          <- merge(dia09, obe09, by = "State")
rm(dia09, obe09)

And manually entered US Census Divisions data:

census_divisions   <- c("New England", "Middle Atlantic",
                        "East North Central", "West North Central",
                        "South Atlantic", "East South Central",
                        "West South Central", "Mountain",
                        "Pacific")
division           <- seq(1:9)
census_divisions   <- cbind(census_divisions, division)
census_divisions   <- as.data.frame(census_divisions)
rm(division)

I used the maps package to help organize state data:

data(state.fips)
state_fips       <- state.fips ; rm(state.fips)
state_fips$State <- state_fips$polyname
dia_act$abb      <- dia_act$State
dia_act          <- merge(dia_act, state_fips, by = "abb")
dia_act          <- merge(dia_act, census_divisions, by = "division")
rm(state_fips, census_divisions)

And then created a simple dataframe. The second line below, where I slice the data, is necessary because the state names (variable is polyname) include some names of states with names of more local areas in this format: state:local_name. I excluded those by grepping for them and then inverting the grepped results (like a boolean NOT).

dia_act <- distinct(select(dia_act,
                           c(division, abb, obe09, dia09,
                             polyname, census_divisions)))
dia_act <- slice(dia_act,
                 grep("[a-z]:[a-z]", dia_act$polyname, invert = TRUE))

ggplot2: Here I plotted the 2009 obesity rates against the 2009 diabetes rates and set color to US Census Divisions. The default is to plot with shaded confidence regions, but with the separate regression lines, it becomes difficult to read. Here, then, is the plot without the shaded confidence regions. The directlabels package adds label lines since I found the default legend, with so many colors, not very helpful in identifying some of the lines:

p <- ggplot(dia_act, aes(x = obe09,
                         y = dia09,
                         color = census_divisions))

p + geom_point(shape = 1) +
        geom_smooth(method = lm, se = FALSE) +
        xlab("Obesity Rates, 2009") +
        ylab("Diabetes Rates, 2009") +
        ggtitle("Relationship beteen 2009 Obesity and Diabetes Rates\nby US Census Division. Source:\nhttp://catalog.data.gov/dataset/food-environment-atlas-f4a22") +
        scale_colour_discrete(guide = FALSE) +
        geom_dl(aes(label = census_divisions), method = "smart.grid")

Here’s the more difficult to read plot with shaded confidence regions and the legend:

p + geom_point(shape = 1) +
        geom_smooth(method = lm) +
        xlab("Obesity Rates, 2009") +
        ylab("Diabetes Rates, 2009") +
        ggtitle("Relationship beteen 2009 Obesity and Diabetes Rates\nby US Census Division. Source:\nhttp://catalog.data.gov/dataset/food-environment-atlas-f4a22") +
        scale_colour_discrete(name = "Census Divisions")

Since the data is spatial, I thought I’d explore the choropleth package. Here’s a map of mean obesity rates per state (no data for states colored black):

dia_act$region <- dia_act$polyname
dia_act$value <- dia_act$obe09
state_choropleth(dia_act,
                 title = "2009 Obesity Rates",
                 legend = "Obesity Ranges")

And here’s a map of the mean diabetes rates per state:

dia_act$value <- dia_act$dia09
state_choropleth(dia_act,
                 title = "2009 Diabetes Rates",
                 legend = "Diabetes Ranges")

Here I regress diabetes rates on obesity rates:

model.1 <- lm(dia_act$dia09 ~ dia_act$obe09)
summary(model.1)
## 
## Call:
## lm(formula = dia_act$dia09 ~ dia_act$obe09)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6496 -0.4957  0.1215  0.3276  1.7940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.13081    0.93895  -3.334  0.00188 ** 
## dia_act$obe09  0.44357    0.03185  13.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7646 on 39 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.8283 
## F-statistic: 193.9 on 1 and 39 DF,  p-value: < 2.2e-16
dia_act$value <- predict(model.1)

And I used the outcome of predict(model.1) to generate a plot where the values are a result of the prediction.

state_choropleth(dia_act,
        title = "Prediction Based on Linear Regression (diabetes ~ obesity)",
        legend = "Prediction Ranges")