CUNY MSDS DATA 607 Project 2

Nicholas Schettini

March 11, 2018

Task:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Datasets:

In this project, I’ve looked at three different datasets:

1.) Marriage and Divorce Rates, found by classmate Jiadi Li.

2.) 2014-15 To 2016-17 School- Level NYC Regents, found by Nicholas Schettini

3.) CDC Chronic Health Diseases - found by Niteeen Kumar

Dataset 1 - Marriage and Divorce rate

Libraries:

library(tidyverse)
library(readr)

Read data: Data uploaded to github, then run through rawgit.

The data has issues with column names, formatting, types (int, char). Showing an enlongated head to diplay issues with the data.

data <- read.csv("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/national_marriage_divorce_rates_00-16.csv")

head(data, 30)
##                                  ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016
## 1                                                                                                              
## 2                                                                                                          Year
## 3                                                                                                          2016
## 4                                                                                                          2015
## 5                                                                                                        2014/1
## 6                                                                                                        2013/1
## 7                                                                                                          2012
## 8                                                                                                          2011
## 9                                                                                                          2010
## 10                                                                                                         2009
## 11                                                                                                         2008
## 12                                                                                                         2007
## 13                                                                                                       2006/2
## 14                                                                                                         2005
## 15                                                                                                         2004
## 16                                                                                                         2003
## 17                                                                                                         2002
## 18                                                                                                         2001
## 19                                                                                                         2000
## 20                                                                                                             
## 21                                                                                 1/Excludes data for Georgia.
## 22                                                                               2/Excludes data for Louisiana.
## 23                                                                                                             
## 24 Note: Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000 
## 25                                  and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26                                                           Source: CDC/NCHS National Vital Statistics System.
## 27                                                                                                             
## 28                                                                                                             
## 29                             Provisional number of divorces and annulments and rate: United States, 2000-2016
## 30                                                                                                             
##            X         X.1                             X.2 X.3 X.4 X.5 X.6
## 1                                                         NA  NA  NA  NA
## 2  Marriages  Population Rate per 1,000 total population  NA  NA  NA  NA
## 3  2,245,404 323,127,513                             6.9  NA  NA  NA  NA
## 4  2,221,579 321,418,820                             6.9  NA  NA  NA  NA
## 5  2,140,272 308,759,713                             6.9  NA  NA  NA  NA
## 6  2,081,301 306,136,672                             6.8  NA  NA  NA  NA
## 7  2,131,000 313,914,040                             6.8  NA  NA  NA  NA
## 8  2,118,000 311,591,917                             6.8  NA  NA  NA  NA
## 9  2,096,000 308,745,538                             6.8  NA  NA  NA  NA
## 10 2,080,000 306,771,529                             6.8  NA  NA  NA  NA
## 11 2,157,000 304,093,966                             7.1  NA  NA  NA  NA
## 12 2,197,000 301,231,207                             7.3  NA  NA  NA  NA
## 13 2,193,000 294,077,247                             7.5  NA  NA  NA  NA
## 14 2,249,000 295,516,599                             7.6  NA  NA  NA  NA
## 15 2,279,000 292,805,298                             7.8  NA  NA  NA  NA
## 16 2,245,000 290,107,933                             7.7  NA  NA  NA  NA
## 17 2,290,000 287,625,193                             8.0  NA  NA  NA  NA
## 18 2,326,000 284,968,955                             8.2  NA  NA  NA  NA
## 19 2,315,000 281,421,906                             8.2  NA  NA  NA  NA
## 20                                                        NA  NA  NA  NA
## 21                                                        NA  NA  NA  NA
## 22                                                        NA  NA  NA  NA
## 23                                                        NA  NA  NA  NA
## 24                                                        NA  NA  NA  NA
## 25                                                        NA  NA  NA  NA
## 26                                                        NA  NA  NA  NA
## 27                                                        NA  NA  NA  NA
## 28                                                        NA  NA  NA  NA
## 29                                                        NA  NA  NA  NA
## 30                                                        NA  NA  NA  NA
##    X.7 X.8
## 1   NA  NA
## 2   NA  NA
## 3   NA  NA
## 4   NA  NA
## 5   NA  NA
## 6   NA  NA
## 7   NA  NA
## 8   NA  NA
## 9   NA  NA
## 10  NA  NA
## 11  NA  NA
## 12  NA  NA
## 13  NA  NA
## 14  NA  NA
## 15  NA  NA
## 16  NA  NA
## 17  NA  NA
## 18  NA  NA
## 19  NA  NA
## 20  NA  NA
## 21  NA  NA
## 22  NA  NA
## 23  NA  NA
## 24  NA  NA
## 25  NA  NA
## 26  NA  NA
## 27  NA  NA
## 28  NA  NA
## 29  NA  NA
## 30  NA  NA

Converted data into a tibble. Tibbles make it easier to display data; which makes it easier to understand the data.

as.tbl(data)
## # A tibble: 60 x 10
##    ï..Provisional.n~ X     X.1   X.2   X.3   X.4   X.5   X.6   X.7   X.8  
##    <fct>             <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 ""                ""    ""    ""    NA    NA    NA    NA    NA    NA   
##  2 Year              Marr~ Popu~ Rate~ NA    NA    NA    NA    NA    NA   
##  3 2016              2,24~ 323,~ 6.9   NA    NA    NA    NA    NA    NA   
##  4 2015              2,22~ 321,~ 6.9   NA    NA    NA    NA    NA    NA   
##  5 2014/1            2,14~ 308,~ 6.9   NA    NA    NA    NA    NA    NA   
##  6 2013/1            2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
##  7 2012              2,13~ 313,~ 6.8   NA    NA    NA    NA    NA    NA   
##  8 2011              2,11~ 311,~ 6.8   NA    NA    NA    NA    NA    NA   
##  9 2010              2,09~ 308,~ 6.8   NA    NA    NA    NA    NA    NA   
## 10 2009              2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
## # ... with 50 more rows
tbl_df(data)
## # A tibble: 60 x 10
##    ï..Provisional.n~ X     X.1   X.2   X.3   X.4   X.5   X.6   X.7   X.8  
##    <fct>             <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
##  1 ""                ""    ""    ""    NA    NA    NA    NA    NA    NA   
##  2 Year              Marr~ Popu~ Rate~ NA    NA    NA    NA    NA    NA   
##  3 2016              2,24~ 323,~ 6.9   NA    NA    NA    NA    NA    NA   
##  4 2015              2,22~ 321,~ 6.9   NA    NA    NA    NA    NA    NA   
##  5 2014/1            2,14~ 308,~ 6.9   NA    NA    NA    NA    NA    NA   
##  6 2013/1            2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
##  7 2012              2,13~ 313,~ 6.8   NA    NA    NA    NA    NA    NA   
##  8 2011              2,11~ 311,~ 6.8   NA    NA    NA    NA    NA    NA   
##  9 2010              2,09~ 308,~ 6.8   NA    NA    NA    NA    NA    NA   
## 10 2009              2,08~ 306,~ 6.8   NA    NA    NA    NA    NA    NA   
## # ... with 50 more rows

Select columns from the data and rename the first column something that is understandable.

data <- data %>%
  as.tbl() %>%
  select(ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016,
         X,
         X.1,
         X.2) %>%
  rename(num_marrage_rate = ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016)

data
## # A tibble: 60 x 4
##    num_marrage_rate X         X.1         X.2                            
##    <fct>            <fct>     <fct>       <fct>                          
##  1 ""               ""        ""          ""                             
##  2 Year             Marriages Population  Rate per 1,000 total population
##  3 2016             2,245,404 323,127,513 6.9                            
##  4 2015             2,221,579 321,418,820 6.9                            
##  5 2014/1           2,140,272 308,759,713 6.9                            
##  6 2013/1           2,081,301 306,136,672 6.8                            
##  7 2012             2,131,000 313,914,040 6.8                            
##  8 2011             2,118,000 311,591,917 6.8                            
##  9 2010             2,096,000 308,745,538 6.8                            
## 10 2009             2,080,000 306,771,529 6.8                            
## # ... with 50 more rows

Remove the “/” which references footnotes in the data. Using reg. expression to search through the column for what needds to be removed.

data1 <- data
data1$num_marrage_rate <- gsub("/\\d", "", data$num_marrage_rate)
data1
## # A tibble: 60 x 4
##    num_marrage_rate X         X.1         X.2                            
##    <chr>            <fct>     <fct>       <fct>                          
##  1 ""               ""        ""          ""                             
##  2 Year             Marriages Population  Rate per 1,000 total population
##  3 2016             2,245,404 323,127,513 6.9                            
##  4 2015             2,221,579 321,418,820 6.9                            
##  5 2014             2,140,272 308,759,713 6.9                            
##  6 2013             2,081,301 306,136,672 6.8                            
##  7 2012             2,131,000 313,914,040 6.8                            
##  8 2011             2,118,000 311,591,917 6.8                            
##  9 2010             2,096,000 308,745,538 6.8                            
## 10 2009             2,080,000 306,771,529 6.8                            
## # ... with 50 more rows

Remove first two rows. The first two rows have: 1. a blink row, and 2. a row that has the column names. We’ll rename these later.

data1 <- data1[-c(1:2), ]

Rename columns. Renaming the columns to something that is understandable. X becomes Marriages, and so on.

data1 <- rename(data1, marriages = X, population = X.1, rate_per_1000 = X.2)
data1 <- rename(data1, year = num_marrage_rate)
data1
## # A tibble: 58 x 4
##    year  marriages population  rate_per_1000
##    <chr> <fct>     <fct>       <fct>        
##  1 2016  2,245,404 323,127,513 6.9          
##  2 2015  2,221,579 321,418,820 6.9          
##  3 2014  2,140,272 308,759,713 6.9          
##  4 2013  2,081,301 306,136,672 6.8          
##  5 2012  2,131,000 313,914,040 6.8          
##  6 2011  2,118,000 311,591,917 6.8          
##  7 2010  2,096,000 308,745,538 6.8          
##  8 2009  2,080,000 306,771,529 6.8          
##  9 2008  2,157,000 304,093,966 7.1          
## 10 2007  2,197,000 301,231,207 7.3          
## # ... with 48 more rows

Remove ’,’s. When trying to use calculations on the data, the ,’s were interfering. Turns out the numbers are actually characters and not numbers.

data1$marriages <- gsub(",", "", data1$marriages)
data1$population <- gsub(",", "", data1$population)
data1
## # A tibble: 58 x 4
##    year  marriages population rate_per_1000
##    <chr> <chr>     <chr>      <fct>        
##  1 2016  2245404   323127513  6.9          
##  2 2015  2221579   321418820  6.9          
##  3 2014  2140272   308759713  6.9          
##  4 2013  2081301   306136672  6.8          
##  5 2012  2131000   313914040  6.8          
##  6 2011  2118000   311591917  6.8          
##  7 2010  2096000   308745538  6.8          
##  8 2009  2080000   306771529  6.8          
##  9 2008  2157000   304093966  7.1          
## 10 2007  2197000   301231207  7.3          
## # ... with 48 more rows

Convert columns into numeric instead of char using as.interger.

data1$marriages <- as.integer(data1$marriages)
data1$population <- as.integer(data1$population)
head(data1)
## # A tibble: 6 x 4
##   year  marriages population rate_per_1000
##   <chr>     <int>      <int> <fct>        
## 1 2016    2245404  323127513 6.9          
## 2 2015    2221579  321418820 6.9          
## 3 2014    2140272  308759713 6.9          
## 4 2013    2081301  306136672 6.8          
## 5 2012    2131000  313914040 6.8          
## 6 2011    2118000  311591917 6.8

Create a dataframe for just marrages. It’s easier to manipulate the data for marriages and divorce when they’re not in the same columns.

df_marrages <- data1[1:17,]
df_marrages
## # A tibble: 17 x 4
##    year  marriages population rate_per_1000
##    <chr>     <int>      <int> <fct>        
##  1 2016    2245404  323127513 6.9          
##  2 2015    2221579  321418820 6.9          
##  3 2014    2140272  308759713 6.9          
##  4 2013    2081301  306136672 6.8          
##  5 2012    2131000  313914040 6.8          
##  6 2011    2118000  311591917 6.8          
##  7 2010    2096000  308745538 6.8          
##  8 2009    2080000  306771529 6.8          
##  9 2008    2157000  304093966 7.1          
## 10 2007    2197000  301231207 7.3          
## 11 2006    2193000  294077247 7.5          
## 12 2005    2249000  295516599 7.6          
## 13 2004    2279000  292805298 7.8          
## 14 2003    2245000  290107933 7.7          
## 15 2002    2290000  287625193 8.0          
## 16 2001    2326000  284968955 8.2          
## 17 2000    2315000  281421906 8.2

Create datafrane for divorce

df_divorce <- data1[30:46,]
df_divorce
## # A tibble: 17 x 4
##    year  marriages population rate_per_1000
##    <chr>     <int>      <int> <fct>        
##  1 2016     827261  257904548 3.2          
##  2 2015     800909  258518265 3.1          
##  3 2014     813862  256483624 3.2          
##  4 2013     832157  254408815 3.3          
##  5 2012     851000  248041986 3.4          
##  6 2011     877000  246273366 3.6          
##  7 2010     872000  244122529 3.6          
##  8 2009     840000  242610561 3.5          
##  9 2008     844000  240545163 3.5          
## 10 2007     856000  238352850 3.6          
## 11 2006     872000  236094277 3.7          
## 12 2005     847000  233495163 3.6          
## 13 2004     879000  236402656 3.7          
## 14 2003     927000  243902090 3.8          
## 15 2002     955000  243108303 3.9          
## 16 2001     940000  236416762 4.0          
## 17 2000     944000  233550143 4.0

Rename marriage column to divorce.

df_divorce <- rename(df_divorce, divorce = marriages)
df_divorce
## # A tibble: 17 x 4
##    year  divorce population rate_per_1000
##    <chr>   <int>      <int> <fct>        
##  1 2016   827261  257904548 3.2          
##  2 2015   800909  258518265 3.1          
##  3 2014   813862  256483624 3.2          
##  4 2013   832157  254408815 3.3          
##  5 2012   851000  248041986 3.4          
##  6 2011   877000  246273366 3.6          
##  7 2010   872000  244122529 3.6          
##  8 2009   840000  242610561 3.5          
##  9 2008   844000  240545163 3.5          
## 10 2007   856000  238352850 3.6          
## 11 2006   872000  236094277 3.7          
## 12 2005   847000  233495163 3.6          
## 13 2004   879000  236402656 3.7          
## 14 2003   927000  243902090 3.8          
## 15 2002   955000  243108303 3.9          
## 16 2001   940000  236416762 4.0          
## 17 2000   944000  233550143 4.0

Crude Divorce Rate - The number of divorces per 1000 in the population.

This does not take into account the num of people who can’t marry (kids, etc.), as such it isn’t that accurate. We’ll see a better way down below looking at the divorce to marriage ratio.

df_marrages <- df_marrages %>%
  mutate(marrages_rate = marriages/population *1000)

df_marrages
## # A tibble: 17 x 5
##    year  marriages population rate_per_1000 marrages_rate
##    <chr>     <int>      <int> <fct>                 <dbl>
##  1 2016    2245404  323127513 6.9                    6.95
##  2 2015    2221579  321418820 6.9                    6.91
##  3 2014    2140272  308759713 6.9                    6.93
##  4 2013    2081301  306136672 6.8                    6.80
##  5 2012    2131000  313914040 6.8                    6.79
##  6 2011    2118000  311591917 6.8                    6.80
##  7 2010    2096000  308745538 6.8                    6.79
##  8 2009    2080000  306771529 6.8                    6.78
##  9 2008    2157000  304093966 7.1                    7.09
## 10 2007    2197000  301231207 7.3                    7.29
## 11 2006    2193000  294077247 7.5                    7.46
## 12 2005    2249000  295516599 7.6                    7.61
## 13 2004    2279000  292805298 7.8                    7.78
## 14 2003    2245000  290107933 7.7                    7.74
## 15 2002    2290000  287625193 8.0                    7.96
## 16 2001    2326000  284968955 8.2                    8.16
## 17 2000    2315000  281421906 8.2                    8.23
df_divorce <- df_divorce %>%
  mutate(divorse_rate = divorce/population*1000)

df_divorce
## # A tibble: 17 x 5
##    year  divorce population rate_per_1000 divorse_rate
##    <chr>   <int>      <int> <fct>                <dbl>
##  1 2016   827261  257904548 3.2                   3.21
##  2 2015   800909  258518265 3.1                   3.10
##  3 2014   813862  256483624 3.2                   3.17
##  4 2013   832157  254408815 3.3                   3.27
##  5 2012   851000  248041986 3.4                   3.43
##  6 2011   877000  246273366 3.6                   3.56
##  7 2010   872000  244122529 3.6                   3.57
##  8 2009   840000  242610561 3.5                   3.46
##  9 2008   844000  240545163 3.5                   3.51
## 10 2007   856000  238352850 3.6                   3.59
## 11 2006   872000  236094277 3.7                   3.69
## 12 2005   847000  233495163 3.6                   3.63
## 13 2004   879000  236402656 3.7                   3.72
## 14 2003   927000  243902090 3.8                   3.80
## 15 2002   955000  243108303 3.9                   3.93
## 16 2001   940000  236416762 4.0                   3.98
## 17 2000   944000  233550143 4.0                   4.04

Combine divorce and marriage dataframes into a single variable: df_combine

df_combine <- data.frame(c(df_marrages, df_divorce))

df_combine
##    year marriages population rate_per_1000 marrages_rate year.1 divorce
## 1  2016   2245404  323127513           6.9      6.948972   2016  827261
## 2  2015   2221579  321418820           6.9      6.911789   2015  800909
## 3  2014   2140272  308759713           6.9      6.931837   2014  813862
## 4  2013   2081301  306136672           6.8      6.798601   2013  832157
## 5  2012   2131000  313914040           6.8      6.788483   2012  851000
## 6  2011   2118000  311591917           6.8      6.797352   2011  877000
## 7  2010   2096000  308745538           6.8      6.788762   2010  872000
## 8  2009   2080000  306771529           6.8      6.780290   2009  840000
## 9  2008   2157000  304093966           7.1      7.093202   2008  844000
## 10 2007   2197000  301231207           7.3      7.293401   2007  856000
## 11 2006   2193000  294077247           7.5      7.457224   2006  872000
## 12 2005   2249000  295516599           7.6      7.610402   2005  847000
## 13 2004   2279000  292805298           7.8      7.783329   2004  879000
## 14 2003   2245000  290107933           7.7      7.738499   2003  927000
## 15 2002   2290000  287625193           8.0      7.961750   2002  955000
## 16 2001   2326000  284968955           8.2      8.162293   2001  940000
## 17 2000   2315000  281421906           8.2      8.226083   2000  944000
##    population.1 rate_per_1000.1 divorse_rate
## 1     257904548             3.2     3.207625
## 2     258518265             3.1     3.098075
## 3     256483624             3.2     3.173154
## 4     254408815             3.3     3.270944
## 5     248041986             3.4     3.430871
## 6     246273366             3.6     3.561083
## 7     244122529             3.6     3.571977
## 8     242610561             3.5     3.462339
## 9     240545163             3.5     3.508697
## 10    238352850             3.6     3.591314
## 11    236094277             3.7     3.693440
## 12    233495163             3.6     3.627484
## 13    236402656             3.7     3.718232
## 14    243902090             3.8     3.800705
## 15    243108303             3.9     3.928290
## 16    236416762             4.0     3.976029
## 17    233550143             4.0     4.041959

Select columns from the combined dataset, and rename some columns that were ‘duplicate’ names.

df_combine <- df_combine %>%
  select(year, marriages, population, marrages_rate, divorce, population.1, divorse_rate) %>%
  rename(population_m = population, population_d = population.1)

Visualization of marriages and divorse over the 16 years in the data set.

Looking at the data, it seems overall theres a relationship. As the # of marriages increase, so does the divorce rate. But what does this actually tell us? It could just be the total population over 16 years has increased…

ggplot(df_combine) + 
  geom_point(aes(year, population_m), color = 'blue') +
  geom_point(aes(year, population_d), color = 'red') +
  facet_grid(~year) +
  theme_minimal()

Divorse to Marriage ratio.

Number of divorces to the number of marriages in a given year. This takes into account how many people were actually married!

d_2_m_ratio <- df_combine$divorce/df_combine$marriages
d_2_m_ratio
##  [1] 0.3684241 0.3605134 0.3802610 0.3998254 0.3993430 0.4140699 0.4160305
##  [8] 0.4038462 0.3912842 0.3896222 0.3976288 0.3766118 0.3856955 0.4129176
## [15] 0.4170306 0.4041273 0.4077754

Add ratio column to dataframe for our visualization.

df_combine <- df_combine %>%
  mutate(dm_ratio = df_combine$divorce/df_combine$marriages)

ggplot(df_combine, aes(year, dm_ratio)) +
  geom_point(aes(color = dm_ratio)) +
  theme_dark()

It looks like the divorse rate has actually decreased over the 16 years in the dataset. From ~41% to ~37%. It would be interesting if the data had more information about the individuals surveyed. For example: the Age that they became married, their education status, and their income level. All of these factors could have an impact on if a couple gets divorsed.

Doing some research online, it seems that couples who have a college degree have a much lower chance of being divorced.

Dataset 2 - 2014-15 To 2016-17 School- Level NYC Regents

Libraries:

library(tidyverse)
library(readr)
library(knitr)
library(kableExtra)

Read Data:

school_data <- read_csv("2014-15_To_2016-17_School-_Level_NYC_Regents_Report_For_All_Variables.csv")
## Parsed with column specification:
## cols(
##   `School DBN` = col_character(),
##   `School Name` = col_character(),
##   `School Type` = col_character(),
##   `School Level` = col_character(),
##   `Regents Exam` = col_character(),
##   Year = col_integer(),
##   `Demographic Category` = col_character(),
##   `Demographic Variable` = col_character(),
##   `Total Tested` = col_integer(),
##   `Mean Score` = col_character(),
##   `Number Scoring Below 65` = col_character(),
##   `Percent Scoring Below 65` = col_character(),
##   `Number Scoring 65 or Above` = col_character(),
##   `Percent Scoring 65 or Above` = col_character(),
##   `Number Scoring 80 or Above` = col_character(),
##   `Percent Scoring 80 or Above` = col_character(),
##   `Number Scoring CR` = col_character(),
##   `Percent Scoring CR` = col_character()
## )

Viewing the data to see what variables we have. Using head function since there are over 212k rows.

head(as.tbl(school_data))
## # A tibble: 6 x 18
##   `School DBN` `School Name`   `School Type` `School Level` `Regents Exam`
##   <chr>        <chr>           <chr>         <chr>          <chr>         
## 1 01M034       P.S. 034 Frank~ General Acad~ K-8            Common Core A~
## 2 01M034       P.S. 034 Frank~ General Acad~ K-8            Living Enviro~
## 3 01M034       P.S. 034 Frank~ General Acad~ K-8            Living Enviro~
## 4 01M140       P.S. 140 Natha~ General Acad~ K-8            Common Core A~
## 5 01M140       P.S. 140 Natha~ General Acad~ K-8            Common Core A~
## 6 01M140       P.S. 140 Natha~ General Acad~ K-8            Living Enviro~
## # ... with 13 more variables: Year <int>, `Demographic Category` <chr>,
## #   `Demographic Variable` <chr>, `Total Tested` <int>, `Mean
## #   Score` <chr>, `Number Scoring Below 65` <chr>, `Percent Scoring Below
## #   65` <chr>, `Number Scoring 65 or Above` <chr>, `Percent Scoring 65 or
## #   Above` <chr>, `Number Scoring 80 or Above` <chr>, `Percent Scoring 80
## #   or Above` <chr>, `Number Scoring CR` <chr>, `Percent Scoring CR` <chr>

Looking at some of the variables, I notice that there are different schools, regents exams (living environment, common core, etc), demographics, and test score data. I use to be a science teacher in a public middle school, it would be interesting to look at science data.

unique(school_data$`Regents Exam`)
##  [1] "Common Core Algebra"             "Living Environment"             
##  [3] "Common Core English"             "Algebra2/Trigonometry"          
##  [5] "Common Core Algebra2"            "Common Core Geometry"           
##  [7] "English"                         "Geometry"                       
##  [9] "Global History and Geography"    "Integrated Algebra"             
## [11] "Physical Settings/Chemistry"     "Physical Settings/Earth Science"
## [13] "U.S. History and Government"     "Physical Settings/Physics"      
## [15] NA

Living environment was a subject I taught in middle school, so I’m going to break down the data by only living environment exams.

This dataset has various grade levels - I’m going to explore junior high/middle school.

unique(school_data$`School Level`)
## [1] "K-8"                             "High school"                    
## [3] "Junior High-Intermediate-Middle" "Secondary School"               
## [5] "K-12 all grades"                 "Elementary"

This data has different student demographics, it might be interesting to compare their overall scores.

unique(school_data$`Demographic Variable`) 
##  [1] "All Students"                            
##  [2] "SWD"                                     
##  [3] "Non-SWD"                                 
##  [4] "ELL"                                     
##  [5] "English Proficient"                      
##  [6] "Former ELL"                              
##  [7] "Male"                                    
##  [8] "Female"                                  
##  [9] "White"                                   
## [10] "Black"                                   
## [11] "Hispanic"                                
## [12] "Asian"                                   
## [13] "Multiple Race Categories Not Represented"

Looking through the different schools listed… there are too many to display, but it would be interesting if my old school is listed…

head(unique(school_data$`School Name`))
## [1] "P.S. 034 Franklin D. Roosevelt"       
## [2] "P.S. 140 Nathan Straus"               
## [3] "P.S. 184m Shuang Wen"                 
## [4] "P.S. 188 The Island School"           
## [5] "Orchard Collegiate Academy"           
## [6] "Technology, Arts, and Sciences Studio"

Lets see …

school_data %>%
  filter(`School Name` == "Marsh Avenue School for Expeditionary Learning")
## # A tibble: 83 x 18
##    `School DBN` `School Name`  `School Type` `School Level` `Regents Exam`
##    <chr>        <chr>          <chr>         <chr>          <chr>         
##  1 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
##  2 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
##  3 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
##  4 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Living Enviro~
##  5 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Living Enviro~
##  6 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
##  7 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
##  8 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
##  9 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## 10 31R063       Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## # ... with 73 more rows, and 13 more variables: Year <int>, `Demographic
## #   Category` <chr>, `Demographic Variable` <chr>, `Total Tested` <int>,
## #   `Mean Score` <chr>, `Number Scoring Below 65` <chr>, `Percent Scoring
## #   Below 65` <chr>, `Number Scoring 65 or Above` <chr>, `Percent Scoring
## #   65 or Above` <chr>, `Number Scoring 80 or Above` <chr>, `Percent
## #   Scoring 80 or Above` <chr>, `Number Scoring CR` <chr>, `Percent
## #   Scoring CR` <chr>

Ok, now that I have some information on my dataset, lets start taking a deeper look.

I noticed that the numeric data is enetered at character type, so I need to fix that:

school_data1 <- school_data
school_data1$`Mean Score` <- as.double(school_data1$`Mean Score`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring Below 65` <- as.double(school_data1$`Number Scoring Below 65`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring Below 65` <- as.double(school_data1$`Percent Scoring Below 65`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring 65 or Above` <- as.double(school_data1$`Number Scoring 65 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring 65 or Above` <- as.double(school_data1$`Percent Scoring 65 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring 80 or Above` <- as.double(school_data1$`Number Scoring 80 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring 80 or Above` <- as.double(school_data1$`Percent Scoring 80 or Above`)
## Warning: NAs introduced by coercion
head(school_data1)
## # A tibble: 6 x 18
##   `School DBN` `School Name`   `School Type` `School Level` `Regents Exam`
##   <chr>        <chr>           <chr>         <chr>          <chr>         
## 1 01M034       P.S. 034 Frank~ General Acad~ K-8            Common Core A~
## 2 01M034       P.S. 034 Frank~ General Acad~ K-8            Living Enviro~
## 3 01M034       P.S. 034 Frank~ General Acad~ K-8            Living Enviro~
## 4 01M140       P.S. 140 Natha~ General Acad~ K-8            Common Core A~
## 5 01M140       P.S. 140 Natha~ General Acad~ K-8            Common Core A~
## 6 01M140       P.S. 140 Natha~ General Acad~ K-8            Living Enviro~
## # ... with 13 more variables: Year <int>, `Demographic Category` <chr>,
## #   `Demographic Variable` <chr>, `Total Tested` <int>, `Mean
## #   Score` <dbl>, `Number Scoring Below 65` <dbl>, `Percent Scoring Below
## #   65` <dbl>, `Number Scoring 65 or Above` <dbl>, `Percent Scoring 65 or
## #   Above` <dbl>, `Number Scoring 80 or Above` <dbl>, `Percent Scoring 80
## #   or Above` <dbl>, `Number Scoring CR` <chr>, `Percent Scoring CR` <chr>

So the data has a lot of NA’s. This makes things … frustrating because some columns have more NA’s than others. So I either need to remove everything that has an NA or remove the NA’s when I want to work with those variables. I also notice that Number Scoring CR and Percent Scoring CR have much more NAs than the rest, I think I’m going to remove that completely.

science_data <- school_data1 %>%
  filter(`Regents Exam` == "Living Environment", `School Level` == c("Junior High-Intermediate-Middle", "Secondary School")) %>%
  select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
         `Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)

science_data <- na.omit(science_data)  

Lets take a look at mean test scores for the living environment exam for middle school students relating to demographic variables.

science_data %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`)) + 
  geom_histogram(aes(fill = `Demographic Variable`)) +
  ggtitle("Living Environment - Middle School")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

science_data %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`, `Demographic Variable`)) + 
  geom_point(aes(fill = `Demographic Variable`),  alpha = 0.5) +
  ggtitle("Living Environment - Middle School")

science_data1 <- science_data %>%
  group_by(`Demographic Variable`) %>%
  summarize(mean(`Mean Score`))

kable(science_data1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(2, bold = T, color = "white", background = "green")  %>%
  row_spec(13, bold = T, color = "white", background = "green")  %>%
  row_spec(4, bold = T, color = "white", background = "red")
Demographic Variable mean(Mean Score)
All Students 73.64864
Asian 81.71721
Black 71.81100
ELL 55.63714
English Proficient 74.56829
Female 73.21759
Former ELL 76.12270
Hispanic 72.59544
Male 75.53846
Multiple Race Categories Not Represented 75.24500
Non-SWD 74.38908
SWD 60.45306
White 81.65682

Conclusion:

Looking at this data, it seems that the demographics: White and Asian have the highest mean living environment scores for middle school students. The lowest mean score are English Language Learners, with a 55.6 mean score. They are, on average, about 15 points lower than all of the others. It’s interesting that former ELL students are on average with other demographic groups.

The Living Environment exam is broken up into different sections, with a lot of reading comphrension. One seciton has a hands-on lab, where students must follow a set of instructions withouth any guidance from the teacher. This could be one reason why ELL students strugle with this exam.

I’m interested in knowing how their scores compare to a non-reading heavy exam - like Math. (Though now with the common core it is very reading heavy.)

Lets take a look!

So, math has many different types of exams, and the Living Environment is technically a 9th grade exam, however students can take it in 8th grade. Lets break down the intermediate algebra exam, which I believe is also a 9th grade exam that can be taken in middle school.

math_data <- school_data1 %>%
  filter(`Regents Exam` == "Integrated Algebra", `School Level` == c("Junior High-Intermediate-Middle", "Secondary School")) %>%
  select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
         `Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)

math_data <- na.omit(math_data)  
math_data %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`)) + 
  geom_histogram(aes(fill = `Demographic Variable`)) +
  ggtitle("Intermediate Algebra - Middle School")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

math_data %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`, `Demographic Variable`)) + 
  geom_point(aes(fill = `Demographic Variable`),  alpha = 0.5) +
  ggtitle("Intermediate Algebra - Middle School")

math_data1 <- math_data %>%
  group_by(`Demographic Variable`) %>%
  summarize(mean(`Mean Score`))

kable(math_data1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(2, bold = T, color = "white", background = "green") %>%
  row_spec(11, bold = T, color = "white", background = "red") %>%
  row_spec(4, bold = T, color = "white", background = "lightblue")
Demographic Variable mean(Mean Score)
All Students 67.71231
Asian 76.13333
Black 65.85000
ELL 61.24375
English Proficient 67.97083
Female 67.71091
Former ELL 70.15714
Hispanic 67.10541
Male 67.82000
Non-SWD 68.57347
SWD 58.88000
White 73.11429

Conculsions:

It looks like ELL’s mean score has increased by about ~5. The mean score for Asian and White has also decreased by about ~5 and ~8, respectively. The lowest scoring group for the intermediate algbera is now the SWD - students with disabilities.

It’s also interesting to look at the spread of data in the point graphs. The math range of scores are much more spread when compared to the science scores.

Lets compare my old school to the whole NYC dataset.

former_school <- school_data1 %>%
  filter(`Regents Exam` == "Living Environment",  `School Name` == "Marsh Avenue School for Expeditionary Learning"  ) %>%
  select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
         `Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)

former_school <- na.omit(former_school)  

former_school
## # A tibble: 11 x 11
##    `School Name`    `School Level`   `Regents Exam`  Year `Demographic Va~
##    <chr>            <chr>            <chr>          <int> <chr>           
##  1 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2016 All Students    
##  2 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 All Students    
##  3 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 Non-SWD         
##  4 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2016 Female          
##  5 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2016 Male            
##  6 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 Female          
##  7 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 Male            
##  8 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2016 Asian           
##  9 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2016 White           
## 10 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 Asian           
## 11 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~  2017 White           
## # ... with 6 more variables: `Total Tested` <int>, `Mean Score` <dbl>,
## #   `Number Scoring 65 or Above` <dbl>, `Number Scoring Below 65` <dbl>,
## #   `Percent Scoring 80 or Above` <dbl>, `Percent Scoring Below 65` <dbl>
former_school %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`)) + 
  geom_histogram(aes(fill = `Demographic Variable`)) +
  ggtitle("Living Environment - Marsh Avenue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

former_school %>%
  group_by(`Demographic Variable`, `Mean Score`) %>%
  ggplot(aes(`Mean Score`, `Demographic Variable`)) + 
  geom_point( alpha = 0.5) +
  ggtitle("Living Environment - Marsh Avenue")

former_school %>%
  group_by(`Mean Score`) %>%
  ggplot(aes(`Mean Score`, `Demographic Variable`)) + 
  geom_density(aes(fill = `Demographic Variable`),  alpha = 0.5) +
  ggtitle("Living Environment - Marsh Avenue")

former_school1 <- former_school %>%
  group_by(`Demographic Variable`) %>%
  summarize(mean(`Mean Score`))

kable(former_school1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T)
Demographic Variable mean(Mean Score)
All Students 83.30
Asian 87.30
Female 82.35
Male 84.45
Non-SWD 81.70
White 81.65

Based on the students that did take the Living environment regents, it seems that for ‘all students’, this school had a higher mean than the city average (81.57 vs. 83.30).

Lets take a look on how students regents scores have progressed over the years in our dataset:

science_data %>%
  ggplot(aes(Year, `Mean Score`)) + 
  geom_point() +
  aes(colour = factor(Year)) + 
  stat_summary()
## No summary function supplied, defaulting to `mean_se()

It looks like as a whole, students have increased their living environment exams over the 3 years in the data set. As a former science teacher, I know the science standards have only increased in complexity, and teachers are continually held to a high standard.

math_data %>%
  ggplot(aes(Year, `Mean Score`)) + 
  geom_point() +
  aes(colour = factor(Year)) + 
  stat_summary()
## No summary function supplied, defaulting to `mean_se()

Dataset3 - CDC Chronic Disease Data

Libraries:

library(geofacet)
library(tidyverse)

Read in data:

CDC <- read.csv("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv", stringsAsFactors = F)

The data consists of many variables and over 50k rows of data. I want to break the data down by creating a filter using the tidyverse for only the year 2013.

colnames(CDC)
##  [1] "ï..Year"                 "LocationAbbr"           
##  [3] "LocationDesc"            "Category"               
##  [5] "Indicator"               "Datasource"             
##  [7] "DataValueUnit"           "DataValueType"          
##  [9] "DataValue"               "DataValueAlt"           
## [11] "DataValueFootnoteSymbol" "DataValueFootnote"      
## [13] "Gender"                  "StratificationID1"      
## [15] "IndicatorID"             "LocationID"             
## [17] "LowConfidenceInterval"   "HighConfidenceInterval" 
## [19] "GeoLocation"
unique(CDC$ï..Year)
##  [1] "2013"      "2011"      "2012"      "2010"      "2007-2011"
##  [6] "2006-2010" "2009"      "2009-2011" "2008"      "2011-2012"
## [11] "2001"      "2007"
unique(CDC$Category)
##  [1] "Alcohol"                                        
##  [2] "Arthritis"                                      
##  [3] "Asthma"                                         
##  [4] "Cancer"                                         
##  [5] "Chronic Kidney Disease"                         
##  [6] "Chronic Obstructive Pulmonary Disease"          
##  [7] "Cardiovascular Disease"                         
##  [8] "Diabetes"                                       
##  [9] "Disability"                                     
## [10] "Immunization"                                   
## [11] "Mental Health"                                  
## [12] "Nutrition, physical activity, and weight status"
## [13] "Nutrition, Physical Activity, and Weight Status"
## [14] "Older Adults"                                   
## [15] "Oral Health"                                    
## [16] "Overarching Conditions"                         
## [17] "Reproductive Health"                            
## [18] "Tobacco"

In this chunk of code, I’m renaming the CDC variable to CDC1, and selecting certain variables I want to work with. I’m also filtering for the year 2013, and renaming some columns to more user-friendly names.

CDC1 <- CDC %>%
  select(ï..Year, LocationAbbr, Category, Indicator, IndicatorID, LowConfidenceInterval, HighConfidenceInterval) %>%
  filter(ï..Year == 2013) %>%
  rename(year = ï..Year, LocationID = LocationAbbr, LowCI = LowConfidenceInterval, 
         HighCI = HighConfidenceInterval)

head(CDC1)
##   year LocationID Category               Indicator IndicatorID LowCI
## 1 2013         AL  Alcohol Alcohol use among youth      ALC1_1  30.1
## 2 2013         AK  Alcohol Alcohol use among youth      ALC1_1  19.3
## 3 2013         AZ  Alcohol Alcohol use among youth      ALC1_1  31.4
## 4 2013         AR  Alcohol Alcohol use among youth      ALC1_1  32.3
## 5 2013         CA  Alcohol Alcohol use among youth      ALC1_1    NA
## 6 2013         CO  Alcohol Alcohol use among youth      ALC1_1    NA
##   HighCI
## 1   40.3
## 2   26.1
## 3   40.9
## 4   40.4
## 5     NA
## 6     NA

For the year 2013 - there are still over 24k rows of data.

This next chunk of code breaks down all of the categories in the dataset. As you can see below, there are many factors included that lead to disease in individuals.

#Entire 2013
ggplot(CDC1, aes(Category)) + geom_bar(aes(fill = Category)) + 
  coord_flip()

According to this data, for the year 2013 diabetes is the top leading cause for disease in this dataset.

I wonder how this relates to each state in the USA. Do certain states have higher percentages of certain types of diseases?

ggplot(CDC1, aes(Category, LocationID)) + 
  geom_col(aes(fill=Category)) +
  facet_geo(~LocationID, grid = "us_state_grid2") +
  coord_flip() +
  theme(axis.text.y=element_blank()) +
  theme(legend.position = "none")
## Some values in the specified facet_geo column 'LocationID' do not
##   match the 'code' column of the specified grid and will be
##   removed: GU, PR, US, VI

I removed the legend so the graph is easier to read (but it’s the same as the one above in relation to colors).

CDC1 %>%
  group_by(Category) %>%
  filter(Category == c("Alcohol", "Tobacco", "Cardiovascular Disease")) %>%
  summarise(min(LocationID), max(LocationID))
## # A tibble: 3 x 3
##   Category               `min(LocationID)` `max(LocationID)`
##   <chr>                  <chr>             <chr>            
## 1 Alcohol                AK                WY               
## 2 Cardiovascular Disease AK                WY               
## 3 Tobacco                AK                WY

Looking at the data, it seems WY has the highest reported chronic diseases relating to Alcohol, TObacco, and Card Disease.

The dataset has lots of variables that affect ones health. There are a few I’m most interested in looking at, including: Tobacco

This code chunk breaks down 2013 into only Tobacco relalted vairables.

#2013 Alcohol
CDC_Tobacco <- CDC1 %>%
  group_by(LocationID, Category) %>%
  filter(Category %in% c("Tobacco")) 

CDC_Tobacco
## # A tibble: 2,090 x 7
## # Groups:   LocationID, Category [55]
##    year  LocationID Category Indicator            IndicatorID LowCI HighCI
##    <chr> <chr>      <chr>    <chr>                <chr>       <dbl>  <dbl>
##  1 2013  AL         Tobacco  Current cigarette s~ TOB1_1      15.5    20.8
##  2 2013  AK         Tobacco  Current cigarette s~ TOB1_1       8.20   13.4
##  3 2013  AZ         Tobacco  Current cigarette s~ TOB1_1      11.8    16.6
##  4 2013  AR         Tobacco  Current cigarette s~ TOB1_1      16.4    22.2
##  5 2013  CA         Tobacco  Current cigarette s~ TOB1_1      NA      NA  
##  6 2013  CO         Tobacco  Current cigarette s~ TOB1_1      NA      NA  
##  7 2013  CT         Tobacco  Current cigarette s~ TOB1_1      11.1    16.3
##  8 2013  DE         Tobacco  Current cigarette s~ TOB1_1      12.5    16.0
##  9 2013  DC         Tobacco  Current cigarette s~ TOB1_1      NA      NA  
## 10 2013  FL         Tobacco  Current cigarette s~ TOB1_1       9.70   12.0
## # ... with 2,080 more rows

How does the reported disease data on Alcohol compare to that of tobacco and cardio health?

#Alcohol comparison to tobacco
CDC_Tobaccocompare <- CDC1 %>%
  group_by(LocationID, Category) %>%
  filter(Category %in% c("Tobacco"))

CDC_Tobaccocompare %>%
  group_by(Category) %>%
  filter(Category == c("Tobacco")) %>%
  summarise(n=n())
## # A tibble: 1 x 2
##   Category     n
##   <chr>    <int>
## 1 Tobacco   2090

In our data, we have a total 2090 cases that suffer from Tobacco issues for our filtered dataset. Looking into the data, it doesn’t seem that the recorded data mentions if an individual that reported chronic tobacco use also had cardiovascular disease. It seems the data just reported on a case-by-case basis, so that makes it hard to draw relationships between the two - e.g.: if more users that suffered from tobacco also had cardio vascular issues.

This following code chunk attempts to break down the tobacco number (2090) into those who suffered from tobacco and cardiovascular disease.

CDC_Tobaccocompare <- CDC1 %>%
  group_by(LocationID, Category) %>%
  filter(Category %in% c("Tobacco", "Cardiovascular Disease" ))

CDC_Tobaccocompare %>%
  group_by(Category) %>%
  filter(Category == c("Tobacco", "Cardiovascular Disease")) %>%
  summarise(n=n())
## Warning in Category == c("Tobacco", "Cardiovascular Disease"): longer
## object length is not a multiple of shorter object length
## # A tibble: 2 x 2
##   Category                   n
##   <chr>                  <int>
## 1 Cardiovascular Disease  1347
## 2 Tobacco                 1045

What indicator most closely relates to Cardiovascular issues?

WIthin the dataset for 2013, it shows that for pure number of reported incidents, Cardiovascular Disease and Nutrition are the most compareable. In today’s society, we know there is an issue with being overweight and having heart problems. Most reported indicidents for 2013 were infact those two, with tobacco following closely behind. It would have been nice to know if an individual also suffered from more than one disease. The way the data is broken down is by reported cases by state, in it’s own category. Based on that, it’s hard to say that Nutirion issues can “cause” cardiovascular disease, but intuitively one can say that it does.

CDC_heartdisease <- CDC1 %>%
  group_by(LocationID, Category) %>%
  filter(Category %in% c("Nutrition, physical activity, and weight status", "Tobacco", "Alcohol", "Cardiovascular Disease"))
 ggplot(CDC_heartdisease, aes(Category)) + 
  geom_bar(aes(fill=Category))