Project 2: Data Preparation & Transformation Pt. I

Data Set I - US Chronic Disease Indicators (CDI)
Load Libraries
Load Data
Preview Data Structure
Conclusion

Data Set I - US Chronic Disease Indicators (CDI)

This data set was posted by Niteen Kumar. “The dataset actually outlines how different states are impacted by certain types of disease category along with clear indicators such as alcohol use among youth, Binge drinking prevalence among adults aged >= 18 years, Heavy drinking among adults aged >= 18 years, Chronic liver disease mortality etc.”

My goals with this data set are as follows:

(-) Load, Tidy, and transform the data set.
(-) Investigate how each state and location are impacted by several types of chronic diseases.

Load Libraries

library("tidyverse")

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("stringr")
library("DT")
library("leaflet")
library("geojsonio")

## 
## Attaching package: 'geojsonio'

## The following object is masked from 'package:base':
## 
##     pretty

library("ggplot2")
library("fiftystater")
library("colorplaner")
library("mapproj")

## Loading required package: maps

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

Load Data

data <- read.csv("data/U.S._Chronic_Disease_Indicators__CDI.csv", sep = ",", header = TRUE, stringsAsFactors=FALSE,fileEncoding="UTF-8-BOM")

Preview Data Structure

str(data)

## 'data.frame':    53469 obs. of  19 variables:
##  $ Year                   : chr  "2013" "2013" "2013" "2013" ...
##  $ LocationAbbr           : chr  "AL" "AK" "AZ" "AR" ...
##  $ LocationDesc           : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Category               : chr  "Alcohol" "Alcohol" "Alcohol" "Alcohol" ...
##  $ Indicator              : chr  "Alcohol use among youth" "Alcohol use among youth" "Alcohol use among youth" "Alcohol use among youth" ...
##  $ Datasource             : chr  "YRBSS" "YRBSS" "YRBSS" "YRBSS" ...
##  $ DataValueUnit          : chr  "%" "%" "%" "%" ...
##  $ DataValueType          : chr  "Crude Prevalence" "Crude Prevalence" "Crude Prevalence" "Crude Prevalence" ...
##  $ DataValue              : num  35 22.5 36 36.3 NA NA 36.7 36.3 31.4 34.8 ...
##  $ DataValueAlt           : chr  "35.0" "22.5" "36.0" "36.3" ...
##  $ DataValueFootnoteSymbol: chr  "" "" "" "" ...
##  $ DataValueFootnote      : chr  "" "" "" "" ...
##  $ Gender                 : chr  "" "" "" "" ...
##  $ StratificationID1      : chr  "" "" "" "" ...
##  $ IndicatorID            : chr  "ALC1_1" "ALC1_1" "ALC1_1" "ALC1_1" ...
##  $ LocationID             : chr  "01" "02" "04" "05" ...
##  $ LowConfidenceInterval  : num  30.1 19.3 31.4 32.3 NA NA 32.7 33.7 30.2 33.1 ...
##  $ HighConfidenceInterval : num  40.3 26.1 40.9 40.4 NA NA 41 39 32.5 36.6 ...
##  $ GeoLocation            : chr  "(32.84057112200048, -86.63186076199969)" "(64.84507995700051, -147.72205903599973)" "(34.865970280000454, -111.76381127699972)" "(34.74865012400045, -92.27449074299966)" ...

As we can see, this dataset is very complex. It has 19 columns, 53469 rows, and missing values. Further more, the variables aren’t standardized. For example, the Year variable may contain a single year value, or a range of years. The values in DataValue may represent different quantities, it may be per 100,000, nominal, etc.

I will focus on the prevelence of alcohol related idicators accross different states.

Select only cases with a category of Alchohol

data <- filter(data,Category == "Alcohol")

Remove columns that are not needed for analysis.

data <- select(data, Year, LocationDesc, Category, Indicator,IndicatorID, DataValueUnit, DataValueType, DataValue, GeoLocation)
datatable(data, options = list(filter = FALSE),filter="top")

Geospatial analysis typically require separate structures for latitude and longitude. We need to split the GeoLocation variable and remove additional punctuations to facilitate this. An example of a case of this variable is “(32.84057112200048, -86.63186076199969)”

Create separate columns to represent latitude and longitude.

data$lats <- str_extract_all( data$GeoLocation, "-?[[:digit:]]*\\.[[:digit:]]*" )[[1]]
data$lons <- str_extract_all( data$GeoLocation, "-?[[:digit:]]*\\.[[:digit:]]*" )[[2]]
data <- select( data, -GeoLocation )

Preview coordinates

datatable( data, options = list( filter = FALSE ), filter="top" )

Map the prevlalence of binge drinking among adults accross states.

There are multiple cases for the same LocationDesc,Indicator, and DataValueType. This would require grouping and summarizing before we can filter and map our indicator of interest.

I did not utilize the coordinates in this case. I used a package called fifty_states to get the boundaries for each state. I filtered the data to meet the requirements of the query to produce the map.

This method can be used to map most variables in the data set.

indicatorID <- 'ALC2_2'
indicator <- "Binge drinking prevalence among Adults Aged >= 18 years"
dataValueType <- 'Age-adjusted Prevalence'
data("fifty_states") # Load state properties

# perform grouping to condiser multiple entries for the same state
mapData <- data  %>%
  group_by(LocationDesc,IndicatorID,DataValueType) %>%
    summarise(DataValue = sum(DataValue) )
# consider only the indicator and values of interest
mapData <- filter( mapData, IndicatorID == indicatorID, DataValueType == dataValueType )

# the state data set is in lower case, create a column with lower case states for referencing
mapData <- data.frame( state = tolower(mapData$LocationDesc), mapData )

# plot map
p <- ggplot(mapData, aes(map_id = state)) + 
  # map points to the fifty_states shape data
  geom_map(aes(fill = DataValue), map = fifty_states, size = 0.15, color = "#ffffff") + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  scale_fill_gradient2(name = dataValueType, midpoint = median(mapData$DataValue, na.rm = TRUE) ,
                       low = "blue", mid = "green",
                       high = "red") +
  coord_map() +
  labs(x = NULL, y = NULL,
       title = indicator) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) 
  theme(legend.position = "bottom", 
        panel.background = element_blank())

## List of 2
##  $ legend.position : chr "bottom"
##  $ panel.background: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

p + fifty_states_inset_boxes()

Conclusion

The map shows that binge drinking among adults is more prominent in the northern states. North Dakota is the most prominent, could it be the cold weather blues? Even more alarming is the fact tha only a few states have an age-adjusted prevalence below 40! It is important to note that The indicator does not convey the exact amount of alcohol consumed per day. So the values should be used with caution.

I have merely scratched the surface of this data set. The procedures above can be replicated to map and compare other indicators.

This data set provides an excellent learning opportunity.