Team Members

Mia Siracusa ; Javer Wilson ; Kleber Perez; Yohannes Getahun Deboch

Introduction

Data Science Skills Project focused on Exploratory Analysis of US Chrnoic Disease Indicators DatA Set

Objective of the Project

The objective of this project is to practice soft skill working in a virtual taem. During the project we practised collaborating, knowledge sharing and problem solving in a team remotely . Load the libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.5.3
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

Approach

For this data science skills project after having detiled discussion, we have decided to use US chronic disease indicators data proposed by Kleber (one of the team members) Using this data set we followed the following approach: 1. Means of Communication: We created a Whatsapp group where we can share documents and make a group call to discuss how we should do the project. For the purpose of sharing data sets we have utilized Github and Rpubs . 2. Separate the work among the team members. 3. Each individual do their part of the work. 4. Combine everyone’s work to make a final work.

Motivation

Our motivation behind working as a team is due to the necessity of being able to effectivelly collaborate in a virtual environment . Working in a virtual team is essentail as it will enhance one of the basic data science soft skills( which being effective communiction and collaboration). Nowadays people prefer more flexible work schedule and working from home due to the inherent disire of maintaing work and life balance. This project has enabled us to practice how to mantain work and life balance working in a virtual team.

Data information

For this data Science skills project we’re using us chronic disease indicators data set.

Data source

The data was downloaded from the following website in csv format. https://chronicdata.cdc.gov/Chronic-Disease-Indicators/U-S-Chronic-Disease-Indicators-CDI-/g4ie-h725

Data on google drive

https://drive.google.com/file/d/14lQCOt5gHB6lk9995hsgB5cxgH8BUMU1/view?usp=sharing

How to load it

We’ve loaded the data using read.csv and for indicating missing values empty string was used as an identifier. Load the data

url <- "https://github.com/jonygeta/Data-607-Data-Science-Skills-Project-/raw/master/USChronicDiseaseIndicators.zip"
download.file(url,"disease.zip")
unzip("disease.zip")
getwd()
## [1] "C:/Users/Yohannes/Desktop/DATA 607 PROJECT 3/Data-607-Data-Science-Skills-Project--master"
disease <- read.csv("USChronicDiseaseIndicators/USChronicDiseaseIndicators.csv", na.strings = "")

Getting an overview of the data

glimpse(disease)
## Observations: 519,718
## Variables: 34
## $ YearStart                 <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ YearEnd                   <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ LocationAbbr              <fct> US, AL, AK, AZ, AR, CA, CO, CT, DE, ...
## $ LocationDesc              <fct> United States, Alabama, Alaska, Ariz...
## $ DataSource                <fct> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS, B...
## $ Topic                     <fct> Alcohol, Alcohol, Alcohol, Alcohol, ...
## $ Question                  <fct> Binge drinking prevalence among adul...
## $ Response                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DataValueUnit             <fct> %, %, %, %, %, %, %, %, %, %, %, %, ...
## $ DataValueType             <fct> Crude Prevalence, Crude Prevalence, ...
## $ DataValue                 <fct> 16.9, 13, 18.2, 15.6, 15, 16.3, 19, ...
## $ DataValueAlt              <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, ...
## $ DataValueFootnoteSymbol   <fct> *, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ DatavalueFootnote         <fct> 50 States + DC: US Median, NA, NA, N...
## $ LowConfidenceLimit        <dbl> 16.0, 11.9, 16.0, 14.3, 13.0, 15.4, ...
## $ HighConfidenceLimit       <dbl> 18.0, 14.1, 20.6, 16.9, 17.2, 17.2, ...
## $ StratificationCategory1   <fct> Overall, Overall, Overall, Overall, ...
## $ Stratification1           <fct> Overall, Overall, Overall, Overall, ...
## $ StratificationCategory2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Stratification2           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationCategory3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Stratification3           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ GeoLocation               <fct> NA, "(32.84057112200048, -86.6318607...
## $ ResponseID                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LocationID                <int> 59, 1, 2, 4, 5, 6, 8, 9, 10, 11, 12,...
## $ TopicID                   <fct> ALC, ALC, ALC, ALC, ALC, ALC, ALC, A...
## $ QuestionID                <fct> ALC2_2, ALC2_2, ALC2_2, ALC2_2, ALC2...
## $ DataValueTypeID           <fct> CRDPREV, CRDPREV, CRDPREV, CRDPREV, ...
## $ StratificationCategoryID1 <fct> OVERALL, OVERALL, OVERALL, OVERALL, ...
## $ StratificationID1         <fct> OVR, OVR, OVR, OVR, OVR, OVR, OVR, O...
## $ StratificationCategoryID2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationID2         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationCategoryID3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationID3         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

This data set has 519,718 observations and 34 variables.

head(disease)
##   YearStart YearEnd LocationAbbr  LocationDesc DataSource   Topic
## 1      2016    2016           US United States      BRFSS Alcohol
## 2      2016    2016           AL       Alabama      BRFSS Alcohol
## 3      2016    2016           AK        Alaska      BRFSS Alcohol
## 4      2016    2016           AZ       Arizona      BRFSS Alcohol
## 5      2016    2016           AR      Arkansas      BRFSS Alcohol
## 6      2016    2016           CA    California      BRFSS Alcohol
##                                                  Question Response
## 1 Binge drinking prevalence among adults aged >= 18 years       NA
## 2 Binge drinking prevalence among adults aged >= 18 years       NA
## 3 Binge drinking prevalence among adults aged >= 18 years       NA
## 4 Binge drinking prevalence among adults aged >= 18 years       NA
## 5 Binge drinking prevalence among adults aged >= 18 years       NA
## 6 Binge drinking prevalence among adults aged >= 18 years       NA
##   DataValueUnit    DataValueType DataValue DataValueAlt
## 1             % Crude Prevalence      16.9         16.9
## 2             % Crude Prevalence        13         13.0
## 3             % Crude Prevalence      18.2         18.2
## 4             % Crude Prevalence      15.6         15.6
## 5             % Crude Prevalence        15         15.0
## 6             % Crude Prevalence      16.3         16.3
##   DataValueFootnoteSymbol         DatavalueFootnote LowConfidenceLimit
## 1                       * 50 States + DC: US Median               16.0
## 2                    <NA>                      <NA>               11.9
## 3                    <NA>                      <NA>               16.0
## 4                    <NA>                      <NA>               14.3
## 5                    <NA>                      <NA>               13.0
## 6                    <NA>                      <NA>               15.4
##   HighConfidenceLimit StratificationCategory1 Stratification1
## 1                18.0                 Overall         Overall
## 2                14.1                 Overall         Overall
## 3                20.6                 Overall         Overall
## 4                16.9                 Overall         Overall
## 5                17.2                 Overall         Overall
## 6                17.2                 Overall         Overall
##   StratificationCategory2 Stratification2 StratificationCategory3
## 1                      NA              NA                      NA
## 2                      NA              NA                      NA
## 3                      NA              NA                      NA
## 4                      NA              NA                      NA
## 5                      NA              NA                      NA
## 6                      NA              NA                      NA
##   Stratification3                               GeoLocation ResponseID
## 1              NA                                      <NA>         NA
## 2              NA   (32.84057112200048, -86.63186076199969)         NA
## 3              NA  (64.84507995700051, -147.72205903599973)         NA
## 4              NA (34.865970280000454, -111.76381127699972)         NA
## 5              NA   (34.74865012400045, -92.27449074299966)         NA
## 6              NA  (37.63864012300047, -120.99999953799971)         NA
##   LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1         59     ALC     ALC2_2         CRDPREV                   OVERALL
## 2          1     ALC     ALC2_2         CRDPREV                   OVERALL
## 3          2     ALC     ALC2_2         CRDPREV                   OVERALL
## 4          4     ALC     ALC2_2         CRDPREV                   OVERALL
## 5          5     ALC     ALC2_2         CRDPREV                   OVERALL
## 6          6     ALC     ALC2_2         CRDPREV                   OVERALL
##   StratificationID1 StratificationCategoryID2 StratificationID2
## 1               OVR                        NA                NA
## 2               OVR                        NA                NA
## 3               OVR                        NA                NA
## 4               OVR                        NA                NA
## 5               OVR                        NA                NA
## 6               OVR                        NA                NA
##   StratificationCategoryID3 StratificationID3
## 1                        NA                NA
## 2                        NA                NA
## 3                        NA                NA
## 4                        NA                NA
## 5                        NA                NA
## 6                        NA                NA
tail(disease)
##        YearStart YearEnd LocationAbbr         LocationDesc
## 519713      2015    2015           DC District of Columbia
## 519714      2015    2015           FL              Florida
## 519715      2015    2015           HI               Hawaii
## 519716      2015    2015           VI       Virgin Islands
## 519717      2015    2015           VT              Vermont
## 519718      2013    2013           NM           New Mexico
##                  DataSource      Topic
## 519713                YRBSS    Tobacco
## 519714                YRBSS    Tobacco
## 519715                YRBSS    Tobacco
## 519716                YRBSS    Tobacco
## 519717                YRBSS    Tobacco
## 519718 ACS 1-Year Estimates Disability
##                                         Question Response DataValueUnit
## 519713 Current smokeless tobacco use among youth       NA             %
## 519714 Current smokeless tobacco use among youth       NA             %
## 519715 Current smokeless tobacco use among youth       NA             %
## 519716 Current smokeless tobacco use among youth       NA             %
## 519717 Current smokeless tobacco use among youth       NA             %
## 519718  Disability among adults aged >= 65 years       NA             %
##           DataValueType DataValue DataValueAlt DataValueFootnoteSymbol
## 519713 Crude Prevalence      <NA>           NA                       -
## 519714 Crude Prevalence      <NA>           NA                       -
## 519715 Crude Prevalence      <NA>           NA                       -
## 519716 Crude Prevalence      <NA>           NA                       -
## 519717 Crude Prevalence      <NA>           NA                       -
## 519718 Crude Prevalence      <NA>           NA                       ~
##                                             DatavalueFootnote
## 519713                                      No data available
## 519714                                      No data available
## 519715                                      No data available
## 519716                                      No data available
## 519717                                      No data available
## 519718 Data not shown because of too few respondents or cases
##        LowConfidenceLimit HighConfidenceLimit StratificationCategory1
## 519713                 NA                  NA                 Overall
## 519714                 NA                  NA                 Overall
## 519715                 NA                  NA                 Overall
## 519716                 NA                  NA                 Overall
## 519717                 NA                  NA                 Overall
## 519718                 NA                  NA          Race/Ethnicity
##            Stratification1 StratificationCategory2 Stratification2
## 519713             Overall                      NA              NA
## 519714             Overall                      NA              NA
## 519715             Overall                      NA              NA
## 519716             Overall                      NA              NA
## 519717             Overall                      NA              NA
## 519718 Black, non-Hispanic                      NA              NA
##        StratificationCategory3 Stratification3
## 519713                      NA              NA
## 519714                      NA              NA
## 519715                      NA              NA
## 519716                      NA              NA
## 519717                      NA              NA
## 519718                      NA              NA
##                                      GeoLocation ResponseID LocationID
## 519713                   (38.907192, -77.036871)         NA         11
## 519714  (28.932040377000476, -81.92896053899966)         NA         12
## 519715 (21.304850435000446, -157.85774940299973)         NA         15
## 519716                   (18.335765, -64.896335)         NA         78
## 519717   (43.62538123900049, -72.51764079099962)         NA         50
## 519718  (34.52088095200048, -106.24058098499967)         NA         35
##        TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 519713     TOB     TOB2_1         CRDPREV                   OVERALL
## 519714     TOB     TOB2_1         CRDPREV                   OVERALL
## 519715     TOB     TOB2_1         CRDPREV                   OVERALL
## 519716     TOB     TOB2_1         CRDPREV                   OVERALL
## 519717     TOB     TOB2_1         CRDPREV                   OVERALL
## 519718     DIS     DIS1_0         CRDPREV                      RACE
##        StratificationID1 StratificationCategoryID2 StratificationID2
## 519713               OVR                        NA                NA
## 519714               OVR                        NA                NA
## 519715               OVR                        NA                NA
## 519716               OVR                        NA                NA
## 519717               OVR                        NA                NA
## 519718               BLK                        NA                NA
##        StratificationCategoryID3 StratificationID3
## 519713                        NA                NA
## 519714                        NA                NA
## 519715                        NA                NA
## 519716                        NA                NA
## 519717                        NA                NA
## 519718                        NA                NA

Table of summary Statistics

summary(disease)
##    YearStart       YearEnd      LocationAbbr      LocationDesc   
##  Min.   :2001   Min.   :2001   AZ     :  9923   Arizona :  9923  
##  1st Qu.:2012   1st Qu.:2012   FL     :  9923   Florida :  9923  
##  Median :2013   Median :2013   IA     :  9923   Iowa    :  9923  
##  Mean   :2013   Mean   :2013   KY     :  9923   Kentucky:  9923  
##  3rd Qu.:2015   3rd Qu.:2015   NC     :  9923   Nebraska:  9923  
##  Max.   :2016   Max.   :2016   NE     :  9923   Nevada  :  9923  
##                                (Other):460180   (Other) :460180  
##                   DataSource    
##  BRFSS                 :364425  
##  NVSS                  : 79755  
##  CMS Part A Claims Data: 29952  
##  State Inpatient Data  : 18423  
##  ACS 1-Year Estimates  :  7403  
##  SEDD; SID             :  6924  
##  (Other)               : 12836  
##                                    Topic       
##  Diabetes                             : 79631  
##  Chronic Obstructive Pulmonary Disease: 78729  
##  Cardiovascular Disease               : 75787  
##  Arthritis                            : 41765  
##  Overarching Conditions               : 39362  
##  Asthma                               : 39261  
##  (Other)                              :165183  
##                                                                                                                                  Question     
##  Hospitalization for chronic obstructive pulmonary disease as any diagnosis among Medicare-eligible persons aged >= 65 years         :  7488  
##  Hospitalization for chronic obstructive pulmonary disease as first-listed diagnosis among Medicare-eligible persons aged >= 65 years:  7488  
##  Hospitalization for heart failure among Medicare-eligible persons aged >= 65 years                                                  :  7488  
##  Hospitalization for hip fracture among Medicare-eligible persons aged >= 65 years                                                   :  7488  
##  Asthma mortality rate                                                                                                               :  6135  
##  Chronic liver disease mortality                                                                                                     :  6135  
##  (Other)                                                                                                                             :477496  
##  Response                 DataValueUnit   
##  Mode:logical   %                :349869  
##  NA's:519718    cases per 100,000: 49080  
##                 Number           : 28930  
##                 cases per 1,000  : 19968  
##                 cases per 10,000 : 16898  
##                 (Other)          : 11301  
##                 NA's             : 43672  
##                  DataValueType      DataValue       DataValueAlt      
##  Crude Prevalence       :191522          : 23091   Min.   :      0.0  
##  Age-adjusted Prevalence:156810   1      :  1005   1st Qu.:     18.5  
##  Number                 : 46125   3.6    :   829   Median :     41.0  
##  Age-adjusted Rate      : 45018   3.8    :   816   Mean   :    891.8  
##  Crude Rate             : 45018   3.7    :   813   3rd Qu.:     70.3  
##  Mean                   : 13160   (Other):348010   Max.   :2600878.0  
##  (Other)                : 22065   NA's   :145154   NA's   :169383     
##  DataValueFootnoteSymbol
##  ****   : 98370         
##         : 56098         
##  -      : 39252         
##  ~      : 30532         
##  *      :  2062         
##  (Other):  1004         
##  NA's   :292400         
##                                                                                                                        DatavalueFootnote 
##  Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%: 98370  
##                                                                                                                                 : 55932  
##  No data available                                                                                                              : 39252  
##  Data not shown because of too few respondents or cases                                                                         : 30532  
##  50 States + DC: US Median                                                                                                      :  2062  
##  (Other)                                                                                                                        :  1004  
##  NA's                                                                                                                           :292566  
##  LowConfidenceLimit HighConfidenceLimit   StratificationCategory1
##  Min.   :   0.20    Min.   :   0.42     Gender        :121660    
##  1st Qu.:  12.70    1st Qu.:  18.90     Overall       : 77888    
##  Median :  30.20    Median :  43.80     Race/Ethnicity:320170    
##  Mean   :  46.76    Mean   :  58.99                              
##  3rd Qu.:  55.40    3rd Qu.:  70.40                              
##  Max.   :1330.66    Max.   :2088.00                              
##  NA's   :208656     NA's   :208656                               
##             Stratification1   StratificationCategory2 Stratification2
##  Overall            : 77888   Mode:logical            Mode:logical   
##  Black, non-Hispanic: 64034   NA's:519718             NA's:519718    
##  Hispanic           : 64034                                          
##  White, non-Hispanic: 64034                                          
##  Female             : 60830                                          
##  Male               : 60830                                          
##  (Other)            :128068                                          
##  StratificationCategory3 Stratification3
##  Mode:logical            Mode:logical   
##  NA's:519718             NA's:519718    
##                                         
##                                         
##                                         
##                                         
##                                         
##                                     GeoLocation     ResponseID    
##  (28.932040377000476, -81.92896053899966) :  9923   Mode:logical  
##  (33.998821303000454, -81.04537120699968) :  9923   NA's:519718   
##  (34.865970280000454, -111.76381127699972):  9923                 
##  (35.466220975000454, -79.15925046299964) :  9923                 
##  (37.645970271000465, -84.77497104799966) :  9923                 
##  (Other)                                  :466500                 
##  NA's                                     :  3603                 
##    LocationID       TopicID         QuestionID       DataValueTypeID  
##  Min.   : 1.00   DIA    : 79631   COPD5_3:  7488   CRDPREV   :191522  
##  1st Qu.:17.00   COPD   : 78729   COPD5_4:  7488   AGEADJPREV:156810  
##  Median :30.00   CVD    : 75787   CVD2_0 :  7488   NMBR      : 46125  
##  Mean   :30.99   ART    : 41765   OLD1_0 :  7488   AGEADJRATE: 45018  
##  3rd Qu.:45.00   OVC    : 39362   ALC6_0 :  6135   CRDRATE   : 45018  
##  Max.   :78.00   AST    : 39261   AST4_1 :  6135   MEAN      : 13160  
##                  (Other):165183   (Other):477496   (Other)   : 22065  
##  StratificationCategoryID1 StratificationID1 StratificationCategoryID2
##  GENDER :121660            OVR    : 77888    Mode:logical             
##  OVERALL: 77888            BLK    : 64034    NA's:519718              
##  RACE   :320170            HIS    : 64034                             
##                            WHT    : 64034                             
##                            GENF   : 60830                             
##                            GENM   : 60830                             
##                            (Other):128068                             
##  StratificationID2 StratificationCategoryID3 StratificationID3
##  Mode:logical      Mode:logical              Mode:logical     
##  NA's:519718       NA's:519718               NA's:519718      
##                                                               
##                                                               
##                                                               
##                                                               
## 

Visualization of missing value pattern. From the data set we can visualize the missing value pattern. Read areas of the following graph indicated missing values.

missmap(disease)

From the summary statistics and missing value plot we can see that the data set has 37% missing values and most of them are in the following variables

names(-which(colMeans(is.na(disease))>0.1))
##  [1] "Response"                  "DataValue"                
##  [3] "DataValueAlt"              "DataValueFootnoteSymbol"  
##  [5] "DatavalueFootnote"         "LowConfidenceLimit"       
##  [7] "HighConfidenceLimit"       "StratificationCategory2"  
##  [9] "Stratification2"           "StratificationCategory3"  
## [11] "Stratification3"           "ResponseID"               
## [13] "StratificationCategoryID2" "StratificationID2"        
## [15] "StratificationCategoryID3" "StratificationID3"

Count number of missing values in each column

lapply(disease, function(x){sum(is.na(x))})
## $YearStart
## [1] 0
## 
## $YearEnd
## [1] 0
## 
## $LocationAbbr
## [1] 0
## 
## $LocationDesc
## [1] 0
## 
## $DataSource
## [1] 0
## 
## $Topic
## [1] 0
## 
## $Question
## [1] 0
## 
## $Response
## [1] 519718
## 
## $DataValueUnit
## [1] 43672
## 
## $DataValueType
## [1] 0
## 
## $DataValue
## [1] 145154
## 
## $DataValueAlt
## [1] 169383
## 
## $DataValueFootnoteSymbol
## [1] 292400
## 
## $DatavalueFootnote
## [1] 292566
## 
## $LowConfidenceLimit
## [1] 208656
## 
## $HighConfidenceLimit
## [1] 208656
## 
## $StratificationCategory1
## [1] 0
## 
## $Stratification1
## [1] 0
## 
## $StratificationCategory2
## [1] 519718
## 
## $Stratification2
## [1] 519718
## 
## $StratificationCategory3
## [1] 519718
## 
## $Stratification3
## [1] 519718
## 
## $GeoLocation
## [1] 3603
## 
## $ResponseID
## [1] 519718
## 
## $LocationID
## [1] 0
## 
## $TopicID
## [1] 0
## 
## $QuestionID
## [1] 0
## 
## $DataValueTypeID
## [1] 0
## 
## $StratificationCategoryID1
## [1] 0
## 
## $StratificationID1
## [1] 0
## 
## $StratificationCategoryID2
## [1] 519718
## 
## $StratificationID2
## [1] 519718
## 
## $StratificationCategoryID3
## [1] 519718
## 
## $StratificationID3
## [1] 519718

Tidying and Transformation

Drop columns that have more than 10% missing

disease_no_miss <- disease[,-which(colMeans(is.na(disease))>0.1)]

Dimension after dropping missing values

dim(disease_no_miss)
## [1] 519718     18

After dropping missing value columns there are only 18 variables left.

Get a glimpse of the data.

glimpse(disease_no_miss)
## Observations: 519,718
## Variables: 18
## $ YearStart                 <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ YearEnd                   <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ LocationAbbr              <fct> US, AL, AK, AZ, AR, CA, CO, CT, DE, ...
## $ LocationDesc              <fct> United States, Alabama, Alaska, Ariz...
## $ DataSource                <fct> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS, B...
## $ Topic                     <fct> Alcohol, Alcohol, Alcohol, Alcohol, ...
## $ Question                  <fct> Binge drinking prevalence among adul...
## $ DataValueUnit             <fct> %, %, %, %, %, %, %, %, %, %, %, %, ...
## $ DataValueType             <fct> Crude Prevalence, Crude Prevalence, ...
## $ StratificationCategory1   <fct> Overall, Overall, Overall, Overall, ...
## $ Stratification1           <fct> Overall, Overall, Overall, Overall, ...
## $ GeoLocation               <fct> NA, "(32.84057112200048, -86.6318607...
## $ LocationID                <int> 59, 1, 2, 4, 5, 6, 8, 9, 10, 11, 12,...
## $ TopicID                   <fct> ALC, ALC, ALC, ALC, ALC, ALC, ALC, A...
## $ QuestionID                <fct> ALC2_2, ALC2_2, ALC2_2, ALC2_2, ALC2...
## $ DataValueTypeID           <fct> CRDPREV, CRDPREV, CRDPREV, CRDPREV, ...
## $ StratificationCategoryID1 <fct> OVERALL, OVERALL, OVERALL, OVERALL, ...
## $ StratificationID1         <fct> OVR, OVR, OVR, OVR, OVR, OVR, OVR, O...

Exploratory Data Analysis

Top 5 location with chronic disease

disease_no_miss %>%
    count(LocationDesc)%>%
    arrange(desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   LocationDesc     n
##   <fct>        <int>
## 1 Arizona       9923
## 2 Florida       9923
## 3 Iowa          9923
## 4 Kentucky      9923
## 5 Nebraska      9923
## 6 Nevada        9923

Bottom 5 location with chronic disease

disease_no_miss %>%
    count(LocationDesc)%>%
    arrange(-desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   LocationDesc       n
##   <fct>          <int>
## 1 United States   3603
## 2 Guam            7139
## 3 Virgin Islands  7187
## 4 Puerto Rico     7305
## 5 Alabama         9530
## 6 Alaska          9530

Top 5 Data Source with chronic disease

disease_no_miss %>%
    count(DataSource)%>%
    arrange(desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   DataSource                  n
##   <fct>                   <int>
## 1 BRFSS                  364425
## 2 NVSS                    79755
## 3 CMS Part A Claims Data  29952
## 4 State Inpatient Data    18423
## 5 ACS 1-Year Estimates     7403
## 6 SEDD; SID                6924

Bottom 5 Data Source with chronic disease

disease_no_miss %>%
    count(DataSource)%>%
    arrange(-desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   DataSource                    n
##   <fct>                     <int>
## 1 Birth Certificate, NVSS      52
## 2 Current Population Survey    55
## 3 NVSS, Mortality             104
## 4 AEDS                        110
## 5 ANRF                        110
## 6 InfoUSA; USDA               110

Bar plot of Chronic Diseases

barplot(table(disease_no_miss$Topic), main = "Distribution of Chronic Disease Topics")

From the bar plot we can see that most of the chronic diseases are the following top 5

disease_no_miss %>%
    count(Topic)%>%
    arrange(desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   Topic                                     n
##   <fct>                                 <int>
## 1 Diabetes                              79631
## 2 Chronic Obstructive Pulmonary Disease 78729
## 3 Cardiovascular Disease                75787
## 4 Arthritis                             41765
## 5 Overarching Conditions                39362
## 6 Asthma                                39261

Visualization of data value type

barplot(table(disease_no_miss$DataValueType))

Top data value types are the following

disease_no_miss %>%
    count(DataValueType)%>%
    arrange(desc(n)) %>%
    head()
## # A tibble: 6 x 2
##   DataValueType                n
##   <fct>                    <int>
## 1 Crude Prevalence        191522
## 2 Age-adjusted Prevalence 156810
## 3 Number                   46125
## 4 Age-adjusted Rate        45018
## 5 Crude Rate               45018
## 6 Mean                     13160

Stratification Category

barplot(table(disease_no_miss$StratificationCategory1))

Data Value Type Plot

barplot(table(disease_no_miss$DataValueTypeID))

Findings

Real world data sets are often messy and have lot’s of missing values. From the exploratory analysis we’ve found that Diabetes, Chronic Obstructive Pulmonary Disease, Cardiovascular Di seas, Arthritis are the most occurring chronic diseases. From the overall analysis we’ve learned that the chronic diseases are a major concern in the USA though the data set is missing several necessary information. Arizona,Florida,Iowa have most chronic disease while Guam,Virgin Islands have least disease.