US Chronic Disease Data per state

Description from https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi-e50c9 is as follows:

CDC’s Division of Population Health provides cross-cutting set of 124 indicators that were developed by consensus and that allows states and territories and large metropolitan areas to uniformly define, collect, and report chronic disease data that are important to public health practice and available for states, territories and large metropolitan areas. In addition to providing access to state-specific indicator data, the CDI web site serves as a gateway to additional information and data resources.

library(tidyr)
library(dplyr)
library(ggplot2)
library(utils)

# Load the CSV file
orig <- read.csv("C:/Users/vikas/cuny/data607/projects/Project_2/US_Chronic_Disease/U.S._Chronic_Disease_Indicators__CDI_.csv")

# check the dimensions and column names
dim(orig)
## [1] 403984     34
names(orig)
##  [1] "YearStart"                 "YearEnd"                  
##  [3] "LocationAbbr"              "LocationDesc"             
##  [5] "DataSource"                "Topic"                    
##  [7] "Question"                  "Response"                 
##  [9] "DataValueUnit"             "DataValueType"            
## [11] "DataValue"                 "DataValueAlt"             
## [13] "DataValueFootnoteSymbol"   "DatavalueFootnote"        
## [15] "LowConfidenceLimit"        "HighConfidenceLimit"      
## [17] "StratificationCategory1"   "Stratification1"          
## [19] "StratificationCategory2"   "Stratification2"          
## [21] "StratificationCategory3"   "Stratification3"          
## [23] "GeoLocation"               "ResponseID"               
## [25] "LocationID"                "TopicID"                  
## [27] "QuestionID"                "DataValueTypeID"          
## [29] "StratificationCategoryID1" "StratificationID1"        
## [31] "StratificationCategoryID2" "StratificationID2"        
## [33] "StratificationCategoryID3" "StratificationID3"

This is a wide dataframe with 34 columns. Let’s attempt to tidy the data. We can begin with variables that have a constant value or are always NA, or are redundant in some other obvious way.

# Examine the first few variables.
unique(orig$YearStart)
##  [1] 2015 2013 2014 2012 2011 2010 2009 2016 2008 2007 2001
unique(orig$YearEnd)
##  [1] 2015 2013 2014 2012 2011 2010 2016 2008 2007 2001

The LocationDesc is simply an expansion of the LocationAbbr field and adds no value to analysis of this data set. We can simply remove it. To verify, we can print their unique values and check lengths.

unique(orig$LocationAbbr)
##  [1] AK AL AR AZ CA CO CT DC FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
## [24] MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TX US UT
## [47] VA VT WA WI WV WY DE TN VI
## 55 Levels: AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY
unique(orig$LocationDesc)
##  [1] Alaska               Alabama              Arkansas            
##  [4] Arizona              California           Colorado            
##  [7] Connecticut          District of Columbia Florida             
## [10] Georgia              Guam                 Hawaii              
## [13] Iowa                 Idaho                Illinois            
## [16] Indiana              Kansas               Kentucky            
## [19] Louisiana            Massachusetts        Maryland            
## [22] Maine                Michigan             Minnesota           
## [25] Missouri             Mississippi          Montana             
## [28] North Carolina       North Dakota         Nebraska            
## [31] New Hampshire        New Jersey           New Mexico          
## [34] Nevada               New York             Ohio                
## [37] Oklahoma             Oregon               Pennsylvania        
## [40] Puerto Rico          Rhode Island         South Carolina      
## [43] South Dakota         Texas                United States       
## [46] Utah                 Virginia             Vermont             
## [49] Washington           Wisconsin            West Virginia       
## [52] Wyoming              Delaware             Tennessee           
## [55] Virgin Islands      
## 55 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
orig2 <- orig %>% select(-LocationDesc)

# With the redundant column dropped, this should show one less column.
dim(orig2)
## [1] 403984     33

Let’s see if any of the variables have a constant value or are always NA. Those can be removed.

unique(orig2$Response)
## [1] NA
orig2[1,]$Response
## [1] NA
# Since Response is always NA, it can be removed
orig3 <- orig2 %>% select(-Response)

A number of NA values are listed in the variables beginning with “Stratification”. Let’s see if any of those variables are always NA.

unique(orig3$Stratification2)
## [1] NA
orig3[1,]$Stratification2
## [1] NA
unique(orig3$Stratification3)
## [1] NA
orig3[1,]$Stratification3
## [1] NA
# Stratification2, Stratification3 can be removed entirely.

orig3 <- orig3 %>% select(-Stratification2, -Stratification3)

# StratificationCategory3 appears to be NA always.
unique(orig3$StratificationCategory3)
## [1] NA
orig3[1,]$StratificationCategory3
## [1] NA
# ResponseID appears to be NA always.
unique(orig3$ResponseID)
## [1] NA
orig3[1,]$ResponseID
## [1] NA
# LocationID appears to be completely dependent on LocationAbbr.
unique(orig3$LocationID)
##  [1]  2  1  5  4  6  8  9 11 12 13 66 15 19 16 17 18 20 21 22 25 24 23 26
## [24] 27 29 28 30 37 38 31 33 34 35 32 36 39 40 41 42 72 44 45 46 48 59 49
## [47] 51 50 53 55 54 56 10 47 78
# StratificationCategoryID2 appears to be all NAs.
unique(orig3$StratificationCategoryID2)
## [1] NA
# StratificationCategoryID3 appears to be all NAs.
unique(orig3$StratificationCategoryID3)
## [1] NA
orig4 <- orig3 %>% select(-StratificationCategory3, -ResponseID, -LocationID) %>%
                    select(-StratificationCategoryID2, -StratificationCategoryID3) %>%
                    select(-StratificationID2, -StratificationID3)

dim(orig4)
## [1] 403984     23
names(orig4)
##  [1] "YearStart"                 "YearEnd"                  
##  [3] "LocationAbbr"              "DataSource"               
##  [5] "Topic"                     "Question"                 
##  [7] "DataValueUnit"             "DataValueType"            
##  [9] "DataValue"                 "DataValueAlt"             
## [11] "DataValueFootnoteSymbol"   "DatavalueFootnote"        
## [13] "LowConfidenceLimit"        "HighConfidenceLimit"      
## [15] "StratificationCategory1"   "Stratification1"          
## [17] "StratificationCategory2"   "GeoLocation"              
## [19] "TopicID"                   "QuestionID"               
## [21] "DataValueTypeID"           "StratificationCategoryID1"
## [23] "StratificationID1"
# summary statistics
summary(orig4)
##    YearStart       YearEnd      LocationAbbr   
##  Min.   :2001   Min.   :2001   KY     :  7779  
##  1st Qu.:2011   1st Qu.:2012   NC     :  7779  
##  Median :2013   Median :2013   NE     :  7779  
##  Mean   :2013   Mean   :2013   NJ     :  7779  
##  3rd Qu.:2014   3rd Qu.:2014   NV     :  7779  
##  Max.   :2016   Max.   :2016   SC     :  7779  
##                                (Other):357310  
##                   DataSource    
##  BRFSS                 :299921  
##  NVSS                  : 79756  
##  State Inpatient Data  :  9324  
##  CMS Part A Claims Data:  1872  
##  SEDD; SID             :  1866  
##  YRBSS                 :  1497  
##  (Other)               :  9748  
##                                    Topic       
##  Diabetes                             : 67342  
##  Cardiovascular Disease               : 62678  
##  Chronic Obstructive Pulmonary Disease: 52536  
##  Asthma                               : 31958  
##  Arthritis                            : 31652  
##  Overarching Conditions               : 30092  
##  (Other)                              :127726  
##                                                             Question     
##  Mortality from heart failure                                   :  6136  
##  Asthma mortality rate                                          :  6135  
##  Chronic liver disease mortality                                :  6135  
##  Mortality due to diabetes reported as any listed cause of death:  6135  
##  Mortality from cerebrovascular disease (stroke)                :  6135  
##  Mortality from coronary heart disease                          :  6135  
##  (Other)                                                        :367173  
##              DataValueUnit                    DataValueType   
##  %                  :280119   Crude Prevalence       :150281  
##  cases per 100,000  : 49392   Age-adjusted Prevalence:128577  
##                     : 29865   Number                 : 32419  
##  Number             : 24137   Age-adjusted Rate      : 31311  
##  cases per 10,000   :  8204   Crude Rate             : 31311  
##  cases per 1,000,000:  5130   Age-adjusted Mean      : 10875  
##  (Other)            :  7137   (Other)                : 19210  
##    DataValue       DataValueAlt     DataValueFootnoteSymbol
##         :106167   Min.   :      0          :215965         
##         : 23064   1st Qu.:     19   ****   : 80363         
##  1      :   991   Median :     42          : 56188         
##  1.1    :   706   Mean   :    727   -      : 26097         
##  3.6    :   666   3rd Qu.:     71   ~      : 22686         
##  3.7    :   657   Max.   :3967333   *      :  1758         
##  (Other):271733   NA's   :130318    (Other):   927         
##                                                                                                                        DatavalueFootnote 
##                                                                                                                                 :216131  
##  Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%: 80363  
##                                                                                                                                 : 56022  
##  No data available                                                                                                              : 26097  
##  Data not shown because of too few respondents or cases                                                                         : 22686  
##  50 States + DC: US Median                                                                                                      :  1758  
##  (Other)                                                                                                                        :   927  
##  LowConfidenceLimit HighConfidenceLimit   StratificationCategory1
##  Min.   :   0.20    Min.   :   0.42     Gender        : 94150    
##  1st Qu.:  13.00    1st Qu.:  20.20     Overall       : 63215    
##  Median :  31.10    Median :  45.60     Race/Ethnicity:246619    
##  Mean   :  49.73    Mean   :  62.65                              
##  3rd Qu.:  57.40    3rd Qu.:  72.20                              
##  Max.   :1293.90    Max.   :2088.00                              
##  NA's   :157165     NA's   :157165                               
##             Stratification1  StratificationCategory2
##  Overall            :63215   Mode:logical           
##  Black, non-Hispanic:49324   NA's:403984            
##  Hispanic           :49324                          
##  White, non-Hispanic:49324                          
##  Female             :47075                          
##  Male               :47075                          
##  (Other)            :98647                          
##                                     GeoLocation        TopicID      
##  (33.998821303000454, -81.04537120699968) :  7779   DIA    : 67342  
##  (35.466220975000454, -79.15925046299964) :  7779   CVD    : 62678  
##  (37.645970271000465, -84.77497104799966) :  7779   COPD   : 52536  
##  (39.493240390000494, -117.07184056399967):  7779   AST    : 31958  
##  (40.13057004800049, -74.27369128799967)  :  7779   ART    : 31652  
##  (41.6410409880005, -99.36572062299967)   :  7779   OVC    : 30092  
##  (Other)                                  :357310   (Other):127726  
##    QuestionID       DataValueTypeID   StratificationCategoryID1
##  CVD1_4 :  6136   CrdPrev   :150281   GENDER : 94150           
##  ALC6_0 :  6135   AgeAdjPrev:128577   OVERALL: 63215           
##  AST4_1 :  6135   Nmbr      : 32419   RACE   :246619           
##  CKD1_0 :  6135   AgeAdjRate: 31311                            
##  COPD1_1:  6135   CrdRate   : 31311                            
##  COPD1_2:  6135   AgeAdjMean: 10875                            
##  (Other):367173   (Other)   : 19210                            
##  StratificationID1
##  OVR    :63215    
##  BLK    :49324    
##  HIS    :49324    
##  WHT    :49324    
##  GENF   :47075    
##  GENM   :47075    
##  (Other):98647