US Chronic Disease Data per state
Description from https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi-e50c9 is as follows:
CDC’s Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources.
library(tidyr)
library(dplyr)
library(ggplot2)
library(utils)
# Load the CSV file
orig <- read.csv("C:/Users/vikas/cuny/data607/projects/Project_2/US_Chronic_Disease/U.S._Chronic_Disease_Indicators__CDI_.csv")
# check the dimensions and column names
dim(orig)
## [1] 403984 34
names(orig)
## [1] "YearStart" "YearEnd"
## [3] "LocationAbbr" "LocationDesc"
## [5] "DataSource" "Topic"
## [7] "Question" "Response"
## [9] "DataValueUnit" "DataValueType"
## [11] "DataValue" "DataValueAlt"
## [13] "DataValueFootnoteSymbol" "DatavalueFootnote"
## [15] "LowConfidenceLimit" "HighConfidenceLimit"
## [17] "StratificationCategory1" "Stratification1"
## [19] "StratificationCategory2" "Stratification2"
## [21] "StratificationCategory3" "Stratification3"
## [23] "GeoLocation" "ResponseID"
## [25] "LocationID" "TopicID"
## [27] "QuestionID" "DataValueTypeID"
## [29] "StratificationCategoryID1" "StratificationID1"
## [31] "StratificationCategoryID2" "StratificationID2"
## [33] "StratificationCategoryID3" "StratificationID3"
This is a wide dataframe with 34 columns. Let’s attempt to tidy the data. We can begin with variables that have a constant value or are always NA, or are redundant in some other obvious way.
# Examine the first few variables.
unique(orig$YearStart)
## [1] 2015 2013 2014 2012 2011 2010 2009 2016 2008 2007 2001
unique(orig$YearEnd)
## [1] 2015 2013 2014 2012 2011 2010 2016 2008 2007 2001
The LocationDesc is simply an expansion of the LocationAbbr field and adds no value to analysis of this data set. We can simply remove it. To verify, we can print their unique values and check lengths.
unique(orig$LocationAbbr)
## [1] AK AL AR AZ CA CO CT DC FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
## [24] MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TX US UT
## [47] VA VT WA WI WV WY DE TN VI
## 55 Levels: AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
unique(orig$LocationDesc)
## [1] Alaska Alabama Arkansas
## [4] Arizona California Colorado
## [7] Connecticut District of Columbia Florida
## [10] Georgia Guam Hawaii
## [13] Iowa Idaho Illinois
## [16] Indiana Kansas Kentucky
## [19] Louisiana Massachusetts Maryland
## [22] Maine Michigan Minnesota
## [25] Missouri Mississippi Montana
## [28] North Carolina North Dakota Nebraska
## [31] New Hampshire New Jersey New Mexico
## [34] Nevada New York Ohio
## [37] Oklahoma Oregon Pennsylvania
## [40] Puerto Rico Rhode Island South Carolina
## [43] South Dakota Texas United States
## [46] Utah Virginia Vermont
## [49] Washington Wisconsin West Virginia
## [52] Wyoming Delaware Tennessee
## [55] Virgin Islands
## 55 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
orig2 <- orig %>% select(-LocationDesc)
# With the redundant column dropped, this should show one less column.
dim(orig2)
## [1] 403984 33
Let’s see if any of the variables have a constant value or are always NA. Those can be removed.
unique(orig2$Response)
## [1] NA
orig2[1,]$Response
## [1] NA
# Since Response is always NA, it can be removed
orig3 <- orig2 %>% select(-Response)
A number of NA values are listed in the variables beginning with “Stratification”. Let’s see if any of those variables are always NA.
unique(orig3$Stratification2)
## [1] NA
orig3[1,]$Stratification2
## [1] NA
unique(orig3$Stratification3)
## [1] NA
orig3[1,]$Stratification3
## [1] NA
# Stratification2, Stratification3 can be removed entirely.
orig3 <- orig3 %>% select(-Stratification2, -Stratification3)
# StratificationCategory3 appears to be NA always.
unique(orig3$StratificationCategory3)
## [1] NA
orig3[1,]$StratificationCategory3
## [1] NA
# ResponseID appears to be NA always.
unique(orig3$ResponseID)
## [1] NA
orig3[1,]$ResponseID
## [1] NA
# LocationID appears to be completely dependent on LocationAbbr.
unique(orig3$LocationID)
## [1] 2 1 5 4 6 8 9 11 12 13 66 15 19 16 17 18 20 21 22 25 24 23 26
## [24] 27 29 28 30 37 38 31 33 34 35 32 36 39 40 41 42 72 44 45 46 48 59 49
## [47] 51 50 53 55 54 56 10 47 78
# StratificationCategoryID2 appears to be all NAs.
unique(orig3$StratificationCategoryID2)
## [1] NA
# StratificationCategoryID3 appears to be all NAs.
unique(orig3$StratificationCategoryID3)
## [1] NA
orig4 <- orig3 %>% select(-StratificationCategory3, -ResponseID, -LocationID) %>%
select(-StratificationCategoryID2, -StratificationCategoryID3) %>%
select(-StratificationID2, -StratificationID3)
dim(orig4)
## [1] 403984 23
names(orig4)
## [1] "YearStart" "YearEnd"
## [3] "LocationAbbr" "DataSource"
## [5] "Topic" "Question"
## [7] "DataValueUnit" "DataValueType"
## [9] "DataValue" "DataValueAlt"
## [11] "DataValueFootnoteSymbol" "DatavalueFootnote"
## [13] "LowConfidenceLimit" "HighConfidenceLimit"
## [15] "StratificationCategory1" "Stratification1"
## [17] "StratificationCategory2" "GeoLocation"
## [19] "TopicID" "QuestionID"
## [21] "DataValueTypeID" "StratificationCategoryID1"
## [23] "StratificationID1"
# summary statistics
summary(orig4)
## YearStart YearEnd LocationAbbr
## Min. :2001 Min. :2001 KY : 7779
## 1st Qu.:2011 1st Qu.:2012 NC : 7779
## Median :2013 Median :2013 NE : 7779
## Mean :2013 Mean :2013 NJ : 7779
## 3rd Qu.:2014 3rd Qu.:2014 NV : 7779
## Max. :2016 Max. :2016 SC : 7779
## (Other):357310
## DataSource
## BRFSS :299921
## NVSS : 79756
## State Inpatient Data : 9324
## CMS Part A Claims Data: 1872
## SEDD; SID : 1866
## YRBSS : 1497
## (Other) : 9748
## Topic
## Diabetes : 67342
## Cardiovascular Disease : 62678
## Chronic Obstructive Pulmonary Disease: 52536
## Asthma : 31958
## Arthritis : 31652
## Overarching Conditions : 30092
## (Other) :127726
## Question
## Mortality from heart failure : 6136
## Asthma mortality rate : 6135
## Chronic liver disease mortality : 6135
## Mortality due to diabetes reported as any listed cause of death: 6135
## Mortality from cerebrovascular disease (stroke) : 6135
## Mortality from coronary heart disease : 6135
## (Other) :367173
## DataValueUnit DataValueType
## % :280119 Crude Prevalence :150281
## cases per 100,000 : 49392 Age-adjusted Prevalence:128577
## : 29865 Number : 32419
## Number : 24137 Age-adjusted Rate : 31311
## cases per 10,000 : 8204 Crude Rate : 31311
## cases per 1,000,000: 5130 Age-adjusted Mean : 10875
## (Other) : 7137 (Other) : 19210
## DataValue DataValueAlt DataValueFootnoteSymbol
## :106167 Min. : 0 :215965
## : 23064 1st Qu.: 19 **** : 80363
## 1 : 991 Median : 42 : 56188
## 1.1 : 706 Mean : 727 - : 26097
## 3.6 : 666 3rd Qu.: 71 ~ : 22686
## 3.7 : 657 Max. :3967333 * : 1758
## (Other):271733 NA's :130318 (Other): 927
## DatavalueFootnote
## :216131
## Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%: 80363
## : 56022
## No data available : 26097
## Data not shown because of too few respondents or cases : 22686
## 50 States + DC: US Median : 1758
## (Other) : 927
## LowConfidenceLimit HighConfidenceLimit StratificationCategory1
## Min. : 0.20 Min. : 0.42 Gender : 94150
## 1st Qu.: 13.00 1st Qu.: 20.20 Overall : 63215
## Median : 31.10 Median : 45.60 Race/Ethnicity:246619
## Mean : 49.73 Mean : 62.65
## 3rd Qu.: 57.40 3rd Qu.: 72.20
## Max. :1293.90 Max. :2088.00
## NA's :157165 NA's :157165
## Stratification1 StratificationCategory2
## Overall :63215 Mode:logical
## Black, non-Hispanic:49324 NA's:403984
## Hispanic :49324
## White, non-Hispanic:49324
## Female :47075
## Male :47075
## (Other) :98647
## GeoLocation TopicID
## (33.998821303000454, -81.04537120699968) : 7779 DIA : 67342
## (35.466220975000454, -79.15925046299964) : 7779 CVD : 62678
## (37.645970271000465, -84.77497104799966) : 7779 COPD : 52536
## (39.493240390000494, -117.07184056399967): 7779 AST : 31958
## (40.13057004800049, -74.27369128799967) : 7779 ART : 31652
## (41.6410409880005, -99.36572062299967) : 7779 OVC : 30092
## (Other) :357310 (Other):127726
## QuestionID DataValueTypeID StratificationCategoryID1
## CVD1_4 : 6136 CrdPrev :150281 GENDER : 94150
## ALC6_0 : 6135 AgeAdjPrev:128577 OVERALL: 63215
## AST4_1 : 6135 Nmbr : 32419 RACE :246619
## CKD1_0 : 6135 AgeAdjRate: 31311
## COPD1_1: 6135 CrdRate : 31311
## COPD1_2: 6135 AgeAdjMean: 10875
## (Other):367173 (Other) : 19210
## StratificationID1
## OVR :63215
## BLK :49324
## HIS :49324
## WHT :49324
## GENF :47075
## GENM :47075
## (Other):98647