Github Link
Web Link
Deployment app for Question1

Assignment

Data about mortality from all 50 states and the District of Columbia.Please access it at https://github.com/charleyferrari/CUNY_DATA608/tree/master/module3/data.

You are invited to gather more data from our provider, the CDC WONDER system, at https://wonder.cdc.gov

This assignment must be done in R. It must be done using the ‘shiny’ package. It is recommended you use an R package that supports interactive graphing such as plotly, or vegalite, but this is ​not​ required. Your apps ​must​ be deployed, I won’t be accepting raw files. Luckily, you can pretty easily deploy apps with a free account at shinyapps.io

Question 1: As a researcher, you frequently compare mortality rates from particular causes across different States. You need a visualization that will let you see (for 2010 only) the crude mortality rate, across all States, from one cause (for example, Neoplasms, which are effectively cancers). Create a visualization that allows you to rank States by crude mortality for each cause of death.

Data Acquisition

There is one dataset recorded by Centers for Disease Control and Prevention (CDC) about mortality from 1999-2010 for U.S. States. The dataset is provided by Instructor:Charley Ferrari. This data comes in csv files and we will use R-programming language to acquire the dataset pre-stored in Github repository.

mortality_df <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module3/data/cleaned-cdc-mortality-1999-2010-2.csv", header = TRUE, stringsAsFactors=FALSE)
#write.csv(mortality_df,"~/R/Data608_Module3\\mortality.csv", row.names = FALSE)

head(mortality_df)
##                                 ICD.Chapter State Year Deaths Population
## 1 Certain infectious and parasitic diseases    AL 1999   1092    4430141
## 2 Certain infectious and parasitic diseases    AL 2000   1188    4447100
## 3 Certain infectious and parasitic diseases    AL 2001   1211    4467634
## 4 Certain infectious and parasitic diseases    AL 2002   1215    4480089
## 5 Certain infectious and parasitic diseases    AL 2003   1350    4503491
## 6 Certain infectious and parasitic diseases    AL 2004   1251    4530729
##   Crude.Rate
## 1       24.6
## 2       26.7
## 3       27.1
## 4       27.1
## 5       30.0
## 6       27.6

Data Structure

The dataset include 9961 observations and 06 variables. All values are numerical of type integer excepted the variable “State” that has a character datatype. Luckly, there is no missiing data. Therefore, we don’t have to deal with missing data.

str(mortality_df)
## 'data.frame':    9961 obs. of  6 variables:
##  $ ICD.Chapter: chr  "Certain infectious and parasitic diseases" "Certain infectious and parasitic diseases" "Certain infectious and parasitic diseases" "Certain infectious and parasitic diseases" ...
##  $ State      : chr  "AL" "AL" "AL" "AL" ...
##  $ Year       : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ Deaths     : int  1092 1188 1211 1215 1350 1251 1303 1312 1241 1385 ...
##  $ Population : int  4430141 4447100 4467634 4480089 4503491 4530729 4569805 4628981 4672840 4718206 ...
##  $ Crude.Rate : num  24.6 26.7 27.1 27.1 30 27.6 28.5 28.3 26.6 29.4 ...
#view(mortality_df)
sum(is.na(mortality_df))
## [1] 0
summary(mortality_df)
##  ICD.Chapter           State                Year          Deaths     
##  Length:9961        Length:9961        Min.   :1999   Min.   :   10  
##  Class :character   Class :character   1st Qu.:2002   1st Qu.:  177  
##  Mode  :character   Mode  :character   Median :2005   Median :  667  
##                                        Mean   :2005   Mean   : 2929  
##                                        3rd Qu.:2008   3rd Qu.: 2474  
##                                        Max.   :2010   Max.   :96511  
##    Population         Crude.Rate    
##  Min.   :  491780   Min.   :  0.00  
##  1st Qu.: 1728292   1st Qu.:  4.60  
##  Median : 4219239   Median : 24.00  
##  Mean   : 5937896   Mean   : 52.15  
##  3rd Qu.: 6562231   3rd Qu.: 50.50  
##  Max.   :37253956   Max.   :478.40

Connecting with Shiny App

## 
## Attaching package: 'shiny'
## The following object is masked from 'package:rsconnect':
## 
##     serverInfo
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning: package 'RCurl' was built under R version 4.0.5
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete

Data Visualization

Let’s explore the CDC report for the state of Oregon from 1999-2010. We could also transform the data into time series to do some forecasting. We could also focus on other variables and this can be done by filter() or select().

#mortality_df %>%
  #group_by(State) %>%
  #mutate() %>%
  #arrange(desc()) %>%
  #top_n(15)%>%
  #filter() %>%# adjusting the legend
  #autoplot(Deaths) +   labs(title= "GDP per capital", y = "Currency in US Dollars")

df1 <- mortality_df %>%
       filter(State == "OR")

ggplot(df1, aes(x = Year, y = Deaths,
                      group = interaction(State, ICD.Chapter),
                      colour = ICD.Chapter)) +
 geom_line() + labs(title= "Centers for Disease Control and Prevention (CDC) Report on Diseases in Oregon 19990 2010", y = "Number of Deaths")

df2s <- mortality_df %>%
       filter(State == "OR" | State == "WA" | State == "CA") %>%
       filter(ICD.Chapter == "Neoplasms")

head(df2s)
##   ICD.Chapter State Year Deaths Population Crude.Rate
## 1   Neoplasms    CA 1999  54197   33499204      161.8
## 2   Neoplasms    CA 2000  54338   33871648      160.4
## 3   Neoplasms    CA 2001  55095   34479458      159.8
## 4   Neoplasms    CA 2002  55400   34871843      158.9
## 5   Neoplasms    CA 2003  55607   35253159      157.7
## 6   Neoplasms    CA 2004  54911   35574576      154.4
ggplot(df2s, aes(x = Year, y = Deaths,
                      group = interaction(State, ICD.Chapter),
                      colour = State)) +
                      geom_line() + labs(title= "Comparing Centers for Disease Control and Prevention (CDC) Report on Neoplasms Disease in California, Oregon and Washington 1990-2010", y = "Number of Deaths")

# df1as <- mortality_df %>%
#   dplyr::select(Year, Deaths, Population, Crude.Rate) %>% ## can remove some variables
#   gather(key = "variable", value = "value", -Year)
# ggplot(df1as, aes(x = Year, y = value)) + 
#   geom_line(aes(color = variable, linetype = variable)) + 
#   scale_color_manual(values = c("darkred", "steelblue"))

#my1 <- ts (name of the data frame, [,2], start = year, 
#           month, date, frequency = in my case it was 31)


df3 <- mortality_df %>%
       filter(ICD.Chapter == "Neoplasms" & Year == 2010 )%>%
       arrange(desc(Crude.Rate))
plot1 <- ggplot(df3, aes(x=reorder(State, -Crude.Rate), y = Crude.Rate))+
            geom_col(fill=rainbow(51)) + 
            coord_flip() +
            geom_text(aes(label=Crude.Rate), size = 4, hjust = -0, color = 'blue')+
            labs(x= "State", y = "Crude Rate", title = "Centers for Disease Control and Prevention (CDC) Report on Neoplasms Disease in U.S. State 2010")+
            theme(axis.text.x = element_text(angle = 0, vjust = 0.2))


# Example of UI with fluidPage
ui <- fluidPage(
      # Application title
    titlePanel("Centers for Disease Control and Prevention (CDC) Report on Neoplasms Disease in U.S. State 2010"),
        # Sidebar with a slider input
    sidebarLayout(
        sidebarPanel(
          # add the selected input
            selectInput('Infections','Cause of Death', unique(mortality_df$ICD.Chapter))),
        
        mainPanel(
            htmlOutput(outputId = 'Select'), 
            #plot to be display
            plotOutput('trend')
            )
    )
)

#server logic 
server <- shinyServer(function(input, output, session){
    df <- reactive({mortality_df %>%
                    filter(ICD.Chapter == input$Infections & Year == 2010)%>%
                    arrange(desc(Crude.Rate))

    })
    output$Select <- renderText({
        paste("Death Caused by Neoplasms Disease", input$Infections)
    })
    output$trend <- renderPlot({

    ggplot(df(), aes(x=reorder(State, -Crude.Rate), y = Crude.Rate))+
            geom_col(fill="#FF8000FF") + 
            coord_flip() +
            geom_text(aes(label=Crude.Rate), size = 4, hjust = -0)+
            labs(x= "State", y = "Crude Rate", title = "Centers for Disease Control and Prevention (CDC) Report on Neoplasms Disease in U.S. State 2010")+
            theme(axis.text.x = element_text(angle = 0, vjust = 0.2))
  
    }
    )
})
#shinyApp(ui = ui, server = server, options = list(height = 500, width = 960))
#runApp()
#deployApp()

Question 2:

Often you are asked whether particular States are improving their mortality rates (per cause) faster than, or slower than, the national average. Create a visualization that lets your clients see this for themselves for one cause of death at the time. Keep in mind that the national average should be weighted by the national population.

We observed that Crude.Rate = (death/population)*100000 . This is a rate per state, per year, per infection. To find the national average, we need to group by infection and year, then sum all deaths divided by sum population times 100000 and assigned to a new variable, but we think this new variable will be redundant since the selection will be by state.

#view(mortality_df)
mortality_df0 <- mortality_df %>%
                 mutate(Crude.Rate.State = Crude.Rate)

mortality_df1 <- mortality_df0 %>%
                 group_by(Year, ICD.Chapter)%>%
                 mutate(Crude.Rate.USA = round(((sum(Deaths)*100000)/sum(Population)), 1))%>%
                 dplyr::select(ICD.Chapter, State, Year, Crude.Rate.State, Crude.Rate.USA ) %>%
                 gather(key = "variable", value = "value", -ICD.Chapter, -State, -Year)
                 

#view(mortality_df1)

df4 <- mortality_df1 %>%
       filter(ICD.Chapter == "Neoplasms" & State == "OR" )#%>%
       #arrange(desc(Crude.Rate))
# ggplot(df4, aes(x=Year)) + 
#   geom_line(aes(y = Crude.Rate, color = "Crude.Rate")) + #, color = "EEA236"
#   geom_line(aes(y = Crude.RateUSA, color = "Crude.RateUSA")) + #, color ="darkred" 
#     #scale_color_manual(values = c("darkred", "steelblue"))+
#   labs(title= "Comparing Centers for Disease Control and Prevention (CDC) Report \n on Neoplasms Disease in Oregon against Nationwide  1990-2010", y = "Crude Rate")+
#   theme(legend.position = "right", legend.text = element_text(size = 8), legend.title = element_text(face = "bold"))

#view(df4)
# df4a <- df4 %>%
#         dplyr::select(Year,Crude.Rate,Crude.RateUSA) %>%
#         gather(key = "variable", value = "value", -Year)
# 
# Visualization
ggplot(df4, aes(x = Year, y = value)) +
  geom_line(aes(color = variable, linetype = variable)) +
  scale_color_manual(values = c("darkred", "steelblue")) +
    labs(title= "Comparing Centers for Disease Control and Prevention (CDC) Report \n on Infections Diseases in Oregon against Nationwide  1990-2010", y = "Crude Rate")+
  theme(legend.position = "right", legend.text = element_text(size = 8), legend.title = element_text(face = "bold"))

# Example of UI with fluidPage
ui <- fluidPage(
      # Application title
    titlePanel("Centers for Disease Control and Prevention (CDC) Report \n on Infections Diseases in each State against Nationwide  1990-2010"),
        # Sidebar with a slider input
    sidebarLayout(
        sidebarPanel(
          # add the selected input
            selectInput('Infections','Cause of Death', unique(mortality_df1$ICD.Chapter)),
            selectInput('States','State infected', unique(mortality_df1$State))),
            
        mainPanel(
            htmlOutput(outputId = 'Selects'), 
            #plot to be display
            plotOutput('trends')
            )
    )
)

#server logic 
server <- shinyServer(function(input, output, session){
    df5 <- reactive({mortality_df1 %>%
                    filter(ICD.Chapter == input$Infections & State == input$States)
                    

    })
    output$Selects <- renderText({
    a1 <- paste("Comparing Death Rate caused by ", input$Infections)
    a2 <- paste("in the state of ", input$States)
    a3 <- paste("against Nationwide Rate 1990-2010")
    paste(" ")
    })

    
    output$trends <- renderPlot({
            a1 <- paste("Comparing Death Rate caused by ", input$Infections)
                a2 <- paste("in the state of ", input$States)
                    a3 <- paste("against Nationwide Rate 1990-2010")

                                 ggplot(df5(), aes(x = Year, y = value)) +
                                 geom_line(aes(color = variable, linetype = variable), size = 1) +
                                 geom_point(aes(color=variable))+
                                 ggtitle(paste0(a1," ", a2,"\n",a3))+
                                 scale_color_manual(values = c("darkred", "darkblue")) +
                                 labs( y = "Crude Rate")+
                                 theme(legend.position = "right", legend.text = element_text(size = 8), legend.title = element_text(face = "bold"))

  
    }
    )
})
#shinyApp(ui = ui, server = server, options = list(height = 500, width = 960))
#runApp()
#deployApp()