Tyding and Transforming Data

Project 2:

Choose three of the wide “data sets” identified in Week 6 discussion items and practice tyding and Transformation of those data sets.

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

DATASET 1 :U.S Chronic Disease Indicators

https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi-e50c9

Dataset description: The Center of Disease Control and Prevention provides and report chronic diseases data that are relevant information for hospitals, clinics and providers that allows states ,territories and metropolitan areas to access and tackle some indicators. Dataset created date: Jun 2, 2016 Dataset updated date: September 12, 2017 Data Source: The Behavioral Risk Factor Surveillance System (BRFSS ) The Youth Risk Behavior Surveillance System (YRBSS)

And many other government organizations that collects, analyzes and distributes information about chronic diseases state-wide.

This dataset is wide enough containing 403984 observations. However, I used postgreSQL to create a small table called state_chronic filtering the observations where state = “NY” and Year is between 2015 and 2016.

Setting up enviroment

library("DBI", lib.loc="~/R/win-library/3.4")
library("RPostgreSQL", lib.loc="~/R/win-library/3.4")
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

## sqldf will default to using PostgreSQL

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.2

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'tidyr' was built under R version 3.4.2

## Warning: package 'purrr' was built under R version 3.4.2

## Warning: package 'dplyr' was built under R version 3.4.2

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(stringr)
library(ggplot2)

Connectting with my Repository

con = dbConnect(PostgreSQL(), user="postgres", password="Doradu8030",
                 host="localhost", port=5432, dbname="mysql")

Getting data from my new table state_chronic

NY_chronic_ind = dbGetQuery(con,"select * from state_chronic")

s_chronic_ind <- transform(NY_chronic_ind, yearstart = as.numeric(yearstart), 
                                         yearend = as.numeric(yearend),
                                         datavalue = as.numeric(datavalue))

## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs
## introduced by coercion

dim(s_chronic_ind)

## [1] 1180   10

str(s_chronic_ind)

## 'data.frame':    1180 obs. of  10 variables:
##  $ yearstart        : num  2015 2015 2015 2015 2015 ...
##  $ yearend          : num  2015 2015 2015 2015 2015 ...
##  $ locationabbr     : chr  "NY" "NY" "NY" "NY" ...
##  $ topic            : chr  "Alcohol" "Alcohol" "Alcohol" "Alcohol" ...
##  $ question         : chr  "Alcohol use among youth" "Binge drinking prevalence among youth" "Binge drinking prevalence among adults aged >= 18 years" "Binge drinking prevalence among adults aged >= 18 years" ...
##  $ datavaluetype    : chr  "Crude Prevalence" "Crude Prevalence" "Age-adjusted Prevalence" "Age-adjusted Prevalence" ...
##  $ datavalue        : num  29.7 15.6 12.5 12.7 16.9 22.6 16.9 11.2 17.5 21.1 ...
##  $ questionid       : chr  "ALC1_1" "ALC2_1" "ALC2_2" "ALC2_2" ...
##  $ stratification1  : chr  "Overall" "Overall" "Black, non-Hispanic" "Female" ...
##  $ stratificationid1: chr  "OVR" "OVR" "BLK" "GENF" ...

This data set contains 1180 observation for New York State only. The sample population is adults aged >= 18 years old and this data set analyzes categorical data by stratified random sampling, leading in the causes of disability and death in New York State

Variables: yearStart: year in which the study starts = 2015 and 2016 yearEnd: year in which the study ends = 2015 and 2016 LocationAbbr: Abbreviation for state Topic: categories that have reported as a possible cause of a chronic disease Question: description of category given in string ex. Heavy drinking among adults aged >=18 QuestionID: identifiers of each question DataValuetypeID: Standardized rate decease identifier DataValue: Percentage of acquire a chronic diseases and conditions base on each topic. QuestionID: identifiers of each question Stratification1: value for each stratification category ID1 ex. Female, male, Hispanic, black, white, etc. StratificationCategoryID1: categories in which each observation is classified as gender, race and overall

After I studied each variable selected, I identified the variables “Question” and “Stratification1” that would need to be transformed for meaningful use.

#droping NA values
table1 <- as_tibble(NY_chronic_ind)
table2 <- drop_na(table1, datavalue)

table2$stratification1 <- str_replace(table2$stratification1,"Female|Male" , "Gender")

#table2$stratification1 <- str_replace(table2$stratification1, pattern = "([:alpha:]+?.[:punct:][:punct:][:alpha:])","Race and Ethnicity" )

table2$stratificationid1 <- str_replace(table2$stratificationid1, "GENF", "FEMALE")
table2$stratificationid1 <- str_replace(table2$stratificationid1, "GENM", "MALE")

dim(table2)

## [1] 959  10

g_topic <- group_by(s_chronic_ind,topic)
c_topic <- count(s_chronic_ind,topic) 
#Number of obsevartions by topic
c_topic

## # A tibble: 15 x 2
##                                              topic     n
##                                              <chr> <int>
##  1                                         Alcohol    82
##  2                                       Arthritis   128
##  3                                          Asthma    86
##  4                                          Cancer    46
##  5                          Cardiovascular Disease   134
##  6                          Chronic Kidney Disease    16
##  7           Chronic Obstructive Pulmonary Disease   128
##  8                                        Diabetes   198
##  9                                    Immunization    16
## 10                                   Mental Health    22
## 11 Nutrition, Physical Activity, and Weight Status   148
## 12                                     Oral Health     1
## 13                          Overarching Conditions    77
## 14                             Reproductive Health     6
## 15                                         Tobacco    92

table2$datavalue <-  as.numeric(table2$datavalue)

## Warning: NAs introduced by coercion

fplot <- table2 %>% 
  group_by(topic) %>% 
  summarise(topsum = sum(datavalue, na.rm = T))

ggplot(fplot, aes(topic, topsum))+
  geom_bar(stat = 'identity') +
  theme(axis.text.x=element_text(angle=90, hjust=1))+ coord_flip()+
  ggtitle('Posible Causes of Death in New York')

g_topic <- group_by(s_chronic_ind,topic)
c_topic <- count(s_chronic_ind,topic) 
s_value <- summarise(s_chronic_ind, avg = mean(datavalue))
c_topic

## # A tibble: 15 x 2
##                                              topic     n
##                                              <chr> <int>
##  1                                         Alcohol    82
##  2                                       Arthritis   128
##  3                                          Asthma    86
##  4                                          Cancer    46
##  5                          Cardiovascular Disease   134
##  6                          Chronic Kidney Disease    16
##  7           Chronic Obstructive Pulmonary Disease   128
##  8                                        Diabetes   198
##  9                                    Immunization    16
## 10                                   Mental Health    22
## 11 Nutrition, Physical Activity, and Weight Status   148
## 12                                     Oral Health     1
## 13                          Overarching Conditions    77
## 14                             Reproductive Health     6
## 15                                         Tobacco    92

Data set 2 : Mortality Rates

Getting data

mort <- read.csv("C:/Users/Dora/CUNY/DATA607-Data Acquisition/Project2/MORT_READM_July2017.csv")

plt1 <- mort %>% 
  group_by(Measure.as.Posted.on.Hospital.Compare) %>% 
  summarise(facSum = sum(VHA.National.Rate, na.rm = T))

ggplot(plt1, aes(Measure.as.Posted.on.Hospital.Compare, facSum))+
  geom_bar(stat = 'identity') +
  theme(axis.text.x=element_text(angle=90, hjust=1))+ coord_flip()+
  ggtitle('Mortality Rates')

mort$Number.Deaths.Readmissions <-  as.numeric(mort$Number.Deaths.Readmissions)
plt2 <- mort %>% 
  group_by(Measure.as.Posted.on.Hospital.Compare) %>% 
  summarise(facSum1 = sum(Number.Deaths.Readmissions,na.rm = T))

ggplot(plt2, aes(Measure.as.Posted.on.Hospital.Compare, facSum1))+
  geom_bar(stat = 'identity') +
  theme(axis.text.x=element_text(angle=90, hjust=1))+ coord_flip()+
  ggtitle('Hosp. Readmission Deaths ')

Data set 3 : HCAHPS Survey Results for Adult Inpatients

Quality of Care

A list of the state averages for the HCAHPS survey responses. HCAHPS (the Hospital Consumer Assessment of Healthcare Providers and Systems) is a patient satisfaction survey required by CMS (the Centers for Medicare and Medicaid Services) for all hospitals in the United States. The Survey is for adult inpatients, excluding psychiatric patients. Getting data

Qcare <- read.csv("C:/Users/Dora/CUNY/DATA607-Data Acquisition/Project2/Patient_survey_HCAHPS_State.csv")

Qcare <- Qcare %>% mutate(timeDiff = difftime(as.POSIXct(as.Date(Qcare$Measure.End.Date,format = '%m/%d/%Y')), 
         as.POSIXct(as.Date(Qcare$Measure.Start.Date,format = '%m/%d/%Y')), 
         units = 'days'))

Qcare$HCAHPS.Answer.PercentPercent <-  as.numeric(Qcare$HCAHPS.Answer.Percent)
plt <- Qcare %>% 
  group_by(HCAHPS.Question) %>% 
  summarise(facSum = sum(HCAHPS.Answer.PercentPercent, na.rm = T))

ggplot(plt, aes(HCAHPS.Question, facSum))+
geom_bar(stat = 'identity') +  theme(axis.text.x=element_text(angle=90, hjust=1))+ coord_flip() +
  ggtitle('Quality of Treatment')

For a large plot as our peers indicated I had to call ggplot() from the console and save the image from the ‘plot’ tab on the right bottom pane of RStudio. Change the size. Paste the image in the RMD file. Make the chunk option for the ggplot call in the RMD file set as eval=FALSE.

X<-dbDisconnect(con)
X

## [1] TRUE

Note I used the same plot for all three datasets because I found this format more meaninful to display several categories in a wide dataset.