Mia Siracusa ; Javern; Kleber Perez; Yohannes Getahun Deboch
Data Science Skills Project focused on Exploratory Analysis of US Chrnoic Disease Indicators DatA Set
The objective of this project is to practice soft skill working in a virtual taem. During the project we practised collaborating, knowledge sharing and problem solving in a team remotely . Load the libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 2.0.1 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.3.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.5.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
For this data science skills project after having detiled discussion, we have decided to use US chronic disease indicators data proposed by Kleber (one of the team members) Using this data set we followed the following approach: 1. Means of Communication: We created a Whatsapp group where we can share documents and make a group call to discuss how we should do the project. For the purpose of sharing data sets we have utilized Github and Rpubs . 2. Separate the work among the team members. 3. Each individual do their part of the work. 4. Combine everyone’s work to make a final work.
Our motivation behind working as a team is due to the necessity of being able to effectivelly collaborate in a virtual environment . Working in a virtual team is essentail as it will enhance one of the basic data science soft skills( which being effective communiction and collaboration). Nowadays people prefer more flexible work schedule and working from home due to the inherent disire of maintaing work and life balance. This project has enabled us to practice how to mantain work and life balance working in a virtual team.
For this data Science skills project we’re using us chronic disease indicators data set.
The data was downloaded from the following website in csv format. https://chronicdata.cdc.gov/Chronic-Disease-Indicators/U-S-Chronic-Disease-Indicators-CDI-/g4ie-h725
https://drive.google.com/file/d/14lQCOt5gHB6lk9995hsgB5cxgH8BUMU1/view?usp=sharing
We’ve loaded the data using read.csv and for indicating missing values empty string was used as an identifier. Load the data
url <- "https://doc-04-28-docs.googleusercontent.com/docs/securesc/4gka1akk08hvpslffei635a9a065i68g/li4rhokegsnad94d1cg1n01qjkvbqgpc/1553011200000/07658601726910950510/07658601726910950510/14lQCOt5gHB6lk9995hsgB5cxgH8BUMU1?e=download&nonce=dj7n4jm3ie44c&user=07658601726910950510&hash=orscqeqh98gop1iet5s6gqlov65qn8o1"
disease <- read.csv("USChronicDiseaseIndicators.csv", na.strings = "")
Getting an overview of the data
glimpse(disease)
## Observations: 519,718
## Variables: 34
## $ YearStart <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ YearEnd <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ LocationAbbr <fct> US, AL, AK, AZ, AR, CA, CO, CT, DE, ...
## $ LocationDesc <fct> United States, Alabama, Alaska, Ariz...
## $ DataSource <fct> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS, B...
## $ Topic <fct> Alcohol, Alcohol, Alcohol, Alcohol, ...
## $ Question <fct> Binge drinking prevalence among adul...
## $ Response <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DataValueUnit <fct> %, %, %, %, %, %, %, %, %, %, %, %, ...
## $ DataValueType <fct> Crude Prevalence, Crude Prevalence, ...
## $ DataValue <fct> 16.9, 13, 18.2, 15.6, 15, 16.3, 19, ...
## $ DataValueAlt <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, ...
## $ DataValueFootnoteSymbol <fct> *, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ DatavalueFootnote <fct> 50 States + DC: US Median, NA, NA, N...
## $ LowConfidenceLimit <dbl> 16.0, 11.9, 16.0, 14.3, 13.0, 15.4, ...
## $ HighConfidenceLimit <dbl> 18.0, 14.1, 20.6, 16.9, 17.2, 17.2, ...
## $ StratificationCategory1 <fct> Overall, Overall, Overall, Overall, ...
## $ Stratification1 <fct> Overall, Overall, Overall, Overall, ...
## $ StratificationCategory2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Stratification2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationCategory3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Stratification3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ GeoLocation <fct> NA, "(32.84057112200048, -86.6318607...
## $ ResponseID <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LocationID <int> 59, 1, 2, 4, 5, 6, 8, 9, 10, 11, 12,...
## $ TopicID <fct> ALC, ALC, ALC, ALC, ALC, ALC, ALC, A...
## $ QuestionID <fct> ALC2_2, ALC2_2, ALC2_2, ALC2_2, ALC2...
## $ DataValueTypeID <fct> CRDPREV, CRDPREV, CRDPREV, CRDPREV, ...
## $ StratificationCategoryID1 <fct> OVERALL, OVERALL, OVERALL, OVERALL, ...
## $ StratificationID1 <fct> OVR, OVR, OVR, OVR, OVR, OVR, OVR, O...
## $ StratificationCategoryID2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationID2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationCategoryID3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StratificationID3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
This data set has 519,718 observations and 34 variables.
head(disease)
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic
## 1 2016 2016 US United States BRFSS Alcohol
## 2 2016 2016 AL Alabama BRFSS Alcohol
## 3 2016 2016 AK Alaska BRFSS Alcohol
## 4 2016 2016 AZ Arizona BRFSS Alcohol
## 5 2016 2016 AR Arkansas BRFSS Alcohol
## 6 2016 2016 CA California BRFSS Alcohol
## Question Response
## 1 Binge drinking prevalence among adults aged >= 18 years NA
## 2 Binge drinking prevalence among adults aged >= 18 years NA
## 3 Binge drinking prevalence among adults aged >= 18 years NA
## 4 Binge drinking prevalence among adults aged >= 18 years NA
## 5 Binge drinking prevalence among adults aged >= 18 years NA
## 6 Binge drinking prevalence among adults aged >= 18 years NA
## DataValueUnit DataValueType DataValue DataValueAlt
## 1 % Crude Prevalence 16.9 16.9
## 2 % Crude Prevalence 13 13.0
## 3 % Crude Prevalence 18.2 18.2
## 4 % Crude Prevalence 15.6 15.6
## 5 % Crude Prevalence 15 15.0
## 6 % Crude Prevalence 16.3 16.3
## DataValueFootnoteSymbol DatavalueFootnote LowConfidenceLimit
## 1 * 50 States + DC: US Median 16.0
## 2 <NA> <NA> 11.9
## 3 <NA> <NA> 16.0
## 4 <NA> <NA> 14.3
## 5 <NA> <NA> 13.0
## 6 <NA> <NA> 15.4
## HighConfidenceLimit StratificationCategory1 Stratification1
## 1 18.0 Overall Overall
## 2 14.1 Overall Overall
## 3 20.6 Overall Overall
## 4 16.9 Overall Overall
## 5 17.2 Overall Overall
## 6 17.2 Overall Overall
## StratificationCategory2 Stratification2 StratificationCategory3
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Stratification3 GeoLocation ResponseID
## 1 NA <NA> NA
## 2 NA (32.84057112200048, -86.63186076199969) NA
## 3 NA (64.84507995700051, -147.72205903599973) NA
## 4 NA (34.865970280000454, -111.76381127699972) NA
## 5 NA (34.74865012400045, -92.27449074299966) NA
## 6 NA (37.63864012300047, -120.99999953799971) NA
## LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1 59 ALC ALC2_2 CRDPREV OVERALL
## 2 1 ALC ALC2_2 CRDPREV OVERALL
## 3 2 ALC ALC2_2 CRDPREV OVERALL
## 4 4 ALC ALC2_2 CRDPREV OVERALL
## 5 5 ALC ALC2_2 CRDPREV OVERALL
## 6 6 ALC ALC2_2 CRDPREV OVERALL
## StratificationID1 StratificationCategoryID2 StratificationID2
## 1 OVR NA NA
## 2 OVR NA NA
## 3 OVR NA NA
## 4 OVR NA NA
## 5 OVR NA NA
## 6 OVR NA NA
## StratificationCategoryID3 StratificationID3
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
tail(disease)
## YearStart YearEnd LocationAbbr LocationDesc
## 519713 2015 2015 DC District of Columbia
## 519714 2015 2015 FL Florida
## 519715 2015 2015 HI Hawaii
## 519716 2015 2015 VI Virgin Islands
## 519717 2015 2015 VT Vermont
## 519718 2013 2013 NM New Mexico
## DataSource Topic
## 519713 YRBSS Tobacco
## 519714 YRBSS Tobacco
## 519715 YRBSS Tobacco
## 519716 YRBSS Tobacco
## 519717 YRBSS Tobacco
## 519718 ACS 1-Year Estimates Disability
## Question Response DataValueUnit
## 519713 Current smokeless tobacco use among youth NA %
## 519714 Current smokeless tobacco use among youth NA %
## 519715 Current smokeless tobacco use among youth NA %
## 519716 Current smokeless tobacco use among youth NA %
## 519717 Current smokeless tobacco use among youth NA %
## 519718 Disability among adults aged >= 65 years NA %
## DataValueType DataValue DataValueAlt DataValueFootnoteSymbol
## 519713 Crude Prevalence <NA> NA -
## 519714 Crude Prevalence <NA> NA -
## 519715 Crude Prevalence <NA> NA -
## 519716 Crude Prevalence <NA> NA -
## 519717 Crude Prevalence <NA> NA -
## 519718 Crude Prevalence <NA> NA ~
## DatavalueFootnote
## 519713 No data available
## 519714 No data available
## 519715 No data available
## 519716 No data available
## 519717 No data available
## 519718 Data not shown because of too few respondents or cases
## LowConfidenceLimit HighConfidenceLimit StratificationCategory1
## 519713 NA NA Overall
## 519714 NA NA Overall
## 519715 NA NA Overall
## 519716 NA NA Overall
## 519717 NA NA Overall
## 519718 NA NA Race/Ethnicity
## Stratification1 StratificationCategory2 Stratification2
## 519713 Overall NA NA
## 519714 Overall NA NA
## 519715 Overall NA NA
## 519716 Overall NA NA
## 519717 Overall NA NA
## 519718 Black, non-Hispanic NA NA
## StratificationCategory3 Stratification3
## 519713 NA NA
## 519714 NA NA
## 519715 NA NA
## 519716 NA NA
## 519717 NA NA
## 519718 NA NA
## GeoLocation ResponseID LocationID
## 519713 (38.907192, -77.036871) NA 11
## 519714 (28.932040377000476, -81.92896053899966) NA 12
## 519715 (21.304850435000446, -157.85774940299973) NA 15
## 519716 (18.335765, -64.896335) NA 78
## 519717 (43.62538123900049, -72.51764079099962) NA 50
## 519718 (34.52088095200048, -106.24058098499967) NA 35
## TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 519713 TOB TOB2_1 CRDPREV OVERALL
## 519714 TOB TOB2_1 CRDPREV OVERALL
## 519715 TOB TOB2_1 CRDPREV OVERALL
## 519716 TOB TOB2_1 CRDPREV OVERALL
## 519717 TOB TOB2_1 CRDPREV OVERALL
## 519718 DIS DIS1_0 CRDPREV RACE
## StratificationID1 StratificationCategoryID2 StratificationID2
## 519713 OVR NA NA
## 519714 OVR NA NA
## 519715 OVR NA NA
## 519716 OVR NA NA
## 519717 OVR NA NA
## 519718 BLK NA NA
## StratificationCategoryID3 StratificationID3
## 519713 NA NA
## 519714 NA NA
## 519715 NA NA
## 519716 NA NA
## 519717 NA NA
## 519718 NA NA
Table of summary Statistics
summary(disease)
## YearStart YearEnd LocationAbbr LocationDesc
## Min. :2001 Min. :2001 AZ : 9923 Arizona : 9923
## 1st Qu.:2012 1st Qu.:2012 FL : 9923 Florida : 9923
## Median :2013 Median :2013 IA : 9923 Iowa : 9923
## Mean :2013 Mean :2013 KY : 9923 Kentucky: 9923
## 3rd Qu.:2015 3rd Qu.:2015 NC : 9923 Nebraska: 9923
## Max. :2016 Max. :2016 NE : 9923 Nevada : 9923
## (Other):460180 (Other) :460180
## DataSource
## BRFSS :364425
## NVSS : 79755
## CMS Part A Claims Data: 29952
## State Inpatient Data : 18423
## ACS 1-Year Estimates : 7403
## SEDD; SID : 6924
## (Other) : 12836
## Topic
## Diabetes : 79631
## Chronic Obstructive Pulmonary Disease: 78729
## Cardiovascular Disease : 75787
## Arthritis : 41765
## Overarching Conditions : 39362
## Asthma : 39261
## (Other) :165183
## Question
## Hospitalization for chronic obstructive pulmonary disease as any diagnosis among Medicare-eligible persons aged >= 65 years : 7488
## Hospitalization for chronic obstructive pulmonary disease as first-listed diagnosis among Medicare-eligible persons aged >= 65 years: 7488
## Hospitalization for heart failure among Medicare-eligible persons aged >= 65 years : 7488
## Hospitalization for hip fracture among Medicare-eligible persons aged >= 65 years : 7488
## Asthma mortality rate : 6135
## Chronic liver disease mortality : 6135
## (Other) :477496
## Response DataValueUnit
## Mode:logical % :349869
## NA's:519718 cases per 100,000: 49080
## Number : 28930
## cases per 1,000 : 19968
## cases per 10,000 : 16898
## (Other) : 11301
## NA's : 43672
## DataValueType DataValue DataValueAlt
## Crude Prevalence :191522 : 23091 Min. : 0.0
## Age-adjusted Prevalence:156810 1 : 1005 1st Qu.: 18.5
## Number : 46125 3.6 : 829 Median : 41.0
## Age-adjusted Rate : 45018 3.8 : 816 Mean : 891.8
## Crude Rate : 45018 3.7 : 813 3rd Qu.: 70.3
## Mean : 13160 (Other):348010 Max. :2600878.0
## (Other) : 22065 NA's :145154 NA's :169383
## DataValueFootnoteSymbol
## **** : 98370
## : 56098
## - : 39252
## ~ : 30532
## * : 2062
## (Other): 1004
## NA's :292400
## DatavalueFootnote
## Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%: 98370
## : 55932
## No data available : 39252
## Data not shown because of too few respondents or cases : 30532
## 50 States + DC: US Median : 2062
## (Other) : 1004
## NA's :292566
## LowConfidenceLimit HighConfidenceLimit StratificationCategory1
## Min. : 0.20 Min. : 0.42 Gender :121660
## 1st Qu.: 12.70 1st Qu.: 18.90 Overall : 77888
## Median : 30.20 Median : 43.80 Race/Ethnicity:320170
## Mean : 46.76 Mean : 58.99
## 3rd Qu.: 55.40 3rd Qu.: 70.40
## Max. :1330.66 Max. :2088.00
## NA's :208656 NA's :208656
## Stratification1 StratificationCategory2 Stratification2
## Overall : 77888 Mode:logical Mode:logical
## Black, non-Hispanic: 64034 NA's:519718 NA's:519718
## Hispanic : 64034
## White, non-Hispanic: 64034
## Female : 60830
## Male : 60830
## (Other) :128068
## StratificationCategory3 Stratification3
## Mode:logical Mode:logical
## NA's:519718 NA's:519718
##
##
##
##
##
## GeoLocation ResponseID
## (28.932040377000476, -81.92896053899966) : 9923 Mode:logical
## (33.998821303000454, -81.04537120699968) : 9923 NA's:519718
## (34.865970280000454, -111.76381127699972): 9923
## (35.466220975000454, -79.15925046299964) : 9923
## (37.645970271000465, -84.77497104799966) : 9923
## (Other) :466500
## NA's : 3603
## LocationID TopicID QuestionID DataValueTypeID
## Min. : 1.00 DIA : 79631 COPD5_3: 7488 CRDPREV :191522
## 1st Qu.:17.00 COPD : 78729 COPD5_4: 7488 AGEADJPREV:156810
## Median :30.00 CVD : 75787 CVD2_0 : 7488 NMBR : 46125
## Mean :30.99 ART : 41765 OLD1_0 : 7488 AGEADJRATE: 45018
## 3rd Qu.:45.00 OVC : 39362 ALC6_0 : 6135 CRDRATE : 45018
## Max. :78.00 AST : 39261 AST4_1 : 6135 MEAN : 13160
## (Other):165183 (Other):477496 (Other) : 22065
## StratificationCategoryID1 StratificationID1 StratificationCategoryID2
## GENDER :121660 OVR : 77888 Mode:logical
## OVERALL: 77888 BLK : 64034 NA's:519718
## RACE :320170 HIS : 64034
## WHT : 64034
## GENF : 60830
## GENM : 60830
## (Other):128068
## StratificationID2 StratificationCategoryID3 StratificationID3
## Mode:logical Mode:logical Mode:logical
## NA's:519718 NA's:519718 NA's:519718
##
##
##
##
##
Visualization of missing value pattern. From the data set we can visualize the missing value pattern. Read areas of the following graph indicated missing values.
missmap(disease)
From the summary statistics and missing value plot we can see that the data set has 37% missing values and most of them are in the following variables
names(-which(colMeans(is.na(disease))>0.1))
## [1] "Response" "DataValue"
## [3] "DataValueAlt" "DataValueFootnoteSymbol"
## [5] "DatavalueFootnote" "LowConfidenceLimit"
## [7] "HighConfidenceLimit" "StratificationCategory2"
## [9] "Stratification2" "StratificationCategory3"
## [11] "Stratification3" "ResponseID"
## [13] "StratificationCategoryID2" "StratificationID2"
## [15] "StratificationCategoryID3" "StratificationID3"
lapply(disease, function(x){sum(is.na(x))})
## $YearStart
## [1] 0
##
## $YearEnd
## [1] 0
##
## $LocationAbbr
## [1] 0
##
## $LocationDesc
## [1] 0
##
## $DataSource
## [1] 0
##
## $Topic
## [1] 0
##
## $Question
## [1] 0
##
## $Response
## [1] 519718
##
## $DataValueUnit
## [1] 43672
##
## $DataValueType
## [1] 0
##
## $DataValue
## [1] 145154
##
## $DataValueAlt
## [1] 169383
##
## $DataValueFootnoteSymbol
## [1] 292400
##
## $DatavalueFootnote
## [1] 292566
##
## $LowConfidenceLimit
## [1] 208656
##
## $HighConfidenceLimit
## [1] 208656
##
## $StratificationCategory1
## [1] 0
##
## $Stratification1
## [1] 0
##
## $StratificationCategory2
## [1] 519718
##
## $Stratification2
## [1] 519718
##
## $StratificationCategory3
## [1] 519718
##
## $Stratification3
## [1] 519718
##
## $GeoLocation
## [1] 3603
##
## $ResponseID
## [1] 519718
##
## $LocationID
## [1] 0
##
## $TopicID
## [1] 0
##
## $QuestionID
## [1] 0
##
## $DataValueTypeID
## [1] 0
##
## $StratificationCategoryID1
## [1] 0
##
## $StratificationID1
## [1] 0
##
## $StratificationCategoryID2
## [1] 519718
##
## $StratificationID2
## [1] 519718
##
## $StratificationCategoryID3
## [1] 519718
##
## $StratificationID3
## [1] 519718
Drop columns that have more than 10% missing
disease_no_miss <- disease[,-which(colMeans(is.na(disease))>0.1)]
Dimension after dropping missing values
dim(disease_no_miss)
## [1] 519718 18
After dropping missing value columns there are only 18 variables left.
Get a glimpse of the data.
glimpse(disease_no_miss)
## Observations: 519,718
## Variables: 18
## $ YearStart <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ YearEnd <int> 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ LocationAbbr <fct> US, AL, AK, AZ, AR, CA, CO, CT, DE, ...
## $ LocationDesc <fct> United States, Alabama, Alaska, Ariz...
## $ DataSource <fct> BRFSS, BRFSS, BRFSS, BRFSS, BRFSS, B...
## $ Topic <fct> Alcohol, Alcohol, Alcohol, Alcohol, ...
## $ Question <fct> Binge drinking prevalence among adul...
## $ DataValueUnit <fct> %, %, %, %, %, %, %, %, %, %, %, %, ...
## $ DataValueType <fct> Crude Prevalence, Crude Prevalence, ...
## $ StratificationCategory1 <fct> Overall, Overall, Overall, Overall, ...
## $ Stratification1 <fct> Overall, Overall, Overall, Overall, ...
## $ GeoLocation <fct> NA, "(32.84057112200048, -86.6318607...
## $ LocationID <int> 59, 1, 2, 4, 5, 6, 8, 9, 10, 11, 12,...
## $ TopicID <fct> ALC, ALC, ALC, ALC, ALC, ALC, ALC, A...
## $ QuestionID <fct> ALC2_2, ALC2_2, ALC2_2, ALC2_2, ALC2...
## $ DataValueTypeID <fct> CRDPREV, CRDPREV, CRDPREV, CRDPREV, ...
## $ StratificationCategoryID1 <fct> OVERALL, OVERALL, OVERALL, OVERALL, ...
## $ StratificationID1 <fct> OVR, OVR, OVR, OVR, OVR, OVR, OVR, O...
Top 5 location with chronic disease
disease_no_miss %>%
count(LocationDesc)%>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## LocationDesc n
## <fct> <int>
## 1 Arizona 9923
## 2 Florida 9923
## 3 Iowa 9923
## 4 Kentucky 9923
## 5 Nebraska 9923
## 6 Nevada 9923
Bottom 5 location with chronic disease
disease_no_miss %>%
count(LocationDesc)%>%
arrange(-desc(n)) %>%
head()
## # A tibble: 6 x 2
## LocationDesc n
## <fct> <int>
## 1 United States 3603
## 2 Guam 7139
## 3 Virgin Islands 7187
## 4 Puerto Rico 7305
## 5 Alabama 9530
## 6 Alaska 9530
Top 5 Data Source with chronic disease
disease_no_miss %>%
count(DataSource)%>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## DataSource n
## <fct> <int>
## 1 BRFSS 364425
## 2 NVSS 79755
## 3 CMS Part A Claims Data 29952
## 4 State Inpatient Data 18423
## 5 ACS 1-Year Estimates 7403
## 6 SEDD; SID 6924
Bottom 5 Data Source with chronic disease
disease_no_miss %>%
count(DataSource)%>%
arrange(-desc(n)) %>%
head()
## # A tibble: 6 x 2
## DataSource n
## <fct> <int>
## 1 Birth Certificate, NVSS 52
## 2 Current Population Survey 55
## 3 NVSS, Mortality 104
## 4 AEDS 110
## 5 ANRF 110
## 6 InfoUSA; USDA 110
Bar plot of Chronic Diseases
barplot(table(disease_no_miss$Topic), main = "Distribution of Chronic Disease Topics")
From the bar plot we can see that most of the chronic diseases are the following top 5
disease_no_miss %>%
count(Topic)%>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## Topic n
## <fct> <int>
## 1 Diabetes 79631
## 2 Chronic Obstructive Pulmonary Disease 78729
## 3 Cardiovascular Disease 75787
## 4 Arthritis 41765
## 5 Overarching Conditions 39362
## 6 Asthma 39261
Visualization of data value type
barplot(table(disease_no_miss$DataValueType))
Top data value types are the following
disease_no_miss %>%
count(DataValueType)%>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## DataValueType n
## <fct> <int>
## 1 Crude Prevalence 191522
## 2 Age-adjusted Prevalence 156810
## 3 Number 46125
## 4 Age-adjusted Rate 45018
## 5 Crude Rate 45018
## 6 Mean 13160
Stratification Category
barplot(table(disease_no_miss$StratificationCategory1))
Data Value Type Plot
barplot(table(disease_no_miss$DataValueTypeID))
Real world data sets are often messy and have lot’s of missing values. From the exploratory analysis we’ve found that Diabetes, Chronic Obstructive Pulmonary Disease, Cardiovascular Di seas, Arthritis are the most occurring chronic diseases. From the overall analysis we’ve learned that the chronic diseases are a major concern in the USA though the data set is missing several necessary information. Arizona,Florida,Iowa have most chronic disease while Guam,Virgin Islands have least disease.