Dataset 1: New York City Leading Causes of Death

DataSet Information:

The leading causes of death by sex and ethnicity in New York City since 2007.

The data can be downloaded from https://data.cityofnewyork.us/api/views/jb7j-dtam/rows.csv?accessType=DOWNLOAD

Attribute Information:

  • Year

  • Leading.Cause

  • Sex

  • Race.Ethnicity

  • Deaths

  • Death.Rate

  • Age.Adjusted.Death.Rate

Load the file from the URL.

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(tidyr)
## Loading required package: tidyr
require(knitr)
## Loading required package: knitr
require(ggplot2)
## Loading required package: ggplot2
nyc_data<-read.csv("https://data.cityofnewyork.us/api/views/jb7j-dtam/rows.csv?accessType=DOWNLOAD", sep = ",")

str(nyc_data)
## 'data.frame':    1094 obs. of  7 variables:
##  $ Year                   : int  2014 2011 2008 2010 2012 2007 2011 2007 2012 2009 ...
##  $ Leading.Cause          : Factor w/ 26 levels "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)",..: 12 7 19 19 7 21 2 10 14 13 ...
##  $ Sex                    : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 1 1 1 2 ...
##  $ Race.Ethnicity         : Factor w/ 6 levels "Asian and Pacific Islander",..: 5 6 4 3 2 4 4 2 3 3 ...
##  $ Deaths                 : Factor w/ 465 levels ".","10","1007",..: 24 269 409 10 132 1 87 122 4 85 ...
##  $ Death.Rate             : Factor w/ 442 levels ".","10","10.1",..: 1 189 1 409 167 1 1 100 393 62 ...
##  $ Age.Adjusted.Death.Rate: Factor w/ 427 levels ".","10","10.1",..: 1 156 1 424 231 1 1 77 412 225 ...
nyc_data <- subset(nyc_data, nyc_data$Deaths != '.')

str(nyc_data)
## 'data.frame':    956 obs. of  7 variables:
##  $ Year                   : int  2014 2011 2008 2010 2012 2011 2007 2012 2009 2008 ...
##  $ Leading.Cause          : Factor w/ 26 levels "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)",..: 12 7 19 19 7 2 10 14 13 7 ...
##  $ Sex                    : Factor w/ 2 levels "F","M": 1 2 2 1 2 1 1 1 2 1 ...
##  $ Race.Ethnicity         : Factor w/ 6 levels "Asian and Pacific Islander",..: 5 6 4 3 2 4 2 3 3 4 ...
##  $ Deaths                 : Factor w/ 465 levels ".","10","1007",..: 24 269 409 10 132 87 122 4 85 64 ...
##  $ Death.Rate             : Factor w/ 442 levels ".","10","10.1",..: 1 189 1 409 167 1 100 393 62 1 ...
##  $ Age.Adjusted.Death.Rate: Factor w/ 427 levels ".","10","10.1",..: 1 156 1 424 231 1 77 412 225 1 ...

There are 956 observations after removing rows with “.” and 7 columns.

Analysis:

1. Find the leading cause of death since 2007 for both males and female

        analysis1 <- nyc_data  %>%
      select(Year, Sex, Leading.Cause, DeathCount = ((as.numeric(as.character(Deaths)) ))) %>% 
      group_by(Year, Sex, Leading.Cause) %>% 
      summarise ( Total = sum(as.numeric(as.character(DeathCount))) )  %>%
            select(Year, Sex, Leading.Cause,Total) %>%
           arrange((Year), Sex ) %>%
           top_n(1,Total)   
           

           kable(analysis1)
Year Sex Leading.Cause Total
2007 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 11743
2007 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 9698
2008 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 11605
2008 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 9587
2009 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 10780
2009 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 9304
2010 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 9463
2010 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 8464
2011 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 9083
2011 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 7817
2012 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 8777
2012 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 7954
2013 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 8745
2013 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 8014
2014 F Diseases of Heart (I00-I09, I11, I13, I20-I51) 8556
2014 M Diseases of Heart (I00-I09, I11, I13, I20-I51) 7961

2. Find the leading cause of death since 2007 by Race.Ethnicity

           analysis2 <- nyc_data  %>%
select(Year, Ethnicity=Race.Ethnicity, Leading.Cause, DeathCount=(as.numeric(as.character(Deaths)) )) %>% 
           group_by(Year, Ethnicity,Leading.Cause) %>% 
    summarise ( Total = sum(as.numeric(as.character(DeathCount))) )  %>%
            select(Year, Ethnicity, Leading.Cause, Total) %>%
           arrange(desc(Year), Ethnicity,Leading.Cause) %>%
           top_n(1,Total) 
           
          kable(analysis2)
Year Ethnicity Leading.Cause Total
2014 Asian and Pacific Islander Malignant Neoplasms (Cancer: C00-C97) 1159
2014 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4152
2014 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2511
2014 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 210
2014 Other Race/ Ethnicity All Other Causes 133
2014 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 8497
2013 Asian and Pacific Islander Malignant Neoplasms (Cancer: C00-C97) 1100
2013 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4233
2013 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2568
2013 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 257
2013 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 113
2013 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 8620
2012 Asian and Pacific Islander Malignant Neoplasms (Cancer: C00-C97) 1083
2012 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4209
2012 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2514
2012 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 175
2012 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 87
2012 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 8875
2011 Asian and Pacific Islander Malignant Neoplasms (Cancer: C00-C97) 1001
2011 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4083
2011 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2549
2011 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 105
2011 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 73
2011 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 9236
2010 Asian and Pacific Islander Malignant Neoplasms (Cancer: C00-C97) 940
2010 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4297
2010 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2670
2010 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 140
2010 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 102
2010 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 9846
2009 Asian and Pacific Islander Diseases of Heart (I00-I09, I11, I13, I20-I51) 1002
2009 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4603
2009 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2731
2009 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 205
2009 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 78
2009 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 11465
2008 Asian and Pacific Islander Diseases of Heart (I00-I09, I11, I13, I20-I51) 1002
2008 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4802
2008 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2775
2008 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 200
2008 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 74
2008 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 12339
2007 Asian and Pacific Islander Diseases of Heart (I00-I09, I11, I13, I20-I51) 924
2007 Black Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 4843
2007 Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 2745
2007 Not Stated/Unknown Diseases of Heart (I00-I09, I11, I13, I20-I51) 168
2007 Other Race/ Ethnicity Diseases of Heart (I00-I09, I11, I13, I20-I51) 79
2007 White Non-Hispanic Diseases of Heart (I00-I09, I11, I13, I20-I51) 12682

3. Since heart disease was the leading cause, find the count of this disease since 2007.

        wide <- nyc_data %>% 
        select(Year, Leading.Cause, Deaths) %>% 
        filter(Leading.Cause == "Diseases of Heart (I00-I09, I11, I13, I20-I51)") %>%
        group_by(Year, Leading.Cause) %>%
        summarise ( Total = sum(as.numeric(as.character(Deaths) )  )) %>%
        spread(Year, Total)  
   
        kable(wide)
Leading.Cause 2007 2008 2009 2010 2011 2012 2013 2014
Diseases of Heart (I00-I09, I11, I13, I20-I51) 21441 21192 20084 17927 16900 16731 16759 16517

Conclusion: The disease rate is decreasing in a very slower phase.

Dataset 2: Projected Population 2010-2040 - Total By Age Groups

DataSet Information:

Projected total New York City population for five intervals from 2010 through 2040 by Borough, broken down by 18 age cohorts. (Age groups may not add up to the total due to rounding.)

The data set can be downloaded from https://data.cityofnewyork.us/api/views/97pn-acdf/rows.csv?accessType=DOWNLOAD

Attribute Information:

  • Borough

  • Age

  • 2010

  • 2015

  • 2020

  • 2025

  • 2030

  • 2035

  • 2040

Load the file from the URL.

population<-read.csv("https://data.cityofnewyork.us/api/views/97pn-acdf/rows.csv?accessType=DOWNLOAD", sep = ",")

str(population)
## 'data.frame':    114 obs. of  9 variables:
##  $ Borough: Factor w/ 6 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Age    : Factor w/ 19 levels "0-4","10-14",..: 1 3 4 5 6 7 8 9 11 12 ...
##  $ X2010  : int  521990 539844 647483 736105 667657 592299 571825 570273 546204 479661 ...
##  $ X2015  : int  535209 505783 646075 770396 707726 611239 550097 535998 552074 493997 ...
##  $ X2020  : int  545778 492532 606203 763956 743916 649594 569628 517668 520597 501239 ...
##  $ X2025  : int  547336 519298 591683 715824 740268 684249 606185 537516 504322 474319 ...
##  $ X2030  : int  542426 535024 625253 698195 693684 682964 638148 571723 523815 459574 ...
##  $ X2035  : int  540523 546062 643728 740437 675497 639237 637517 600792 556586 477052 ...
##  $ X2040  : int  546426 546750 657403 762757 715486 621899 596493 600514 584164 506390 ...

This data set has 114 observations and 9 columns.

Analysis:

1. Analyze the projected population in 2020 for all 5 Boroughs in different age groups.

    head(population)
##     Borough   Age  X2010  X2015  X2020  X2025  X2030  X2035  X2040
## 1 NYC Total   0-4 521990 535209 545778 547336 542426 540523 546426
## 2 NYC Total 15-19 539844 505783 492532 519298 535024 546062 546750
## 3 NYC Total 20-24 647483 646075 606203 591683 625253 643728 657403
## 4 NYC Total 25-29 736105 770396 763956 715824 698195 740437 762757
## 5 NYC Total 30-34 667657 707726 743916 740268 693684 675497 715486
## 6 NYC Total 35-39 592299 611239 649594 684249 682964 639237 621899
# remove the rows with total in columns Borough and Age.

    analysis2.1 <- population  %>%
        select(Borough,Age,X2020 ) %>%      
         filter( Borough != "NYC Total" ) %>% 
         filter( Age != "Total" ) %>% 
         spread(Age,X2020)

         kable(analysis2.1)
Borough 0-4 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 5-9 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85+
Bronx 109972 95703 98987 113827 127995 109797 93530 87217 83496 105775 89495 85632 73506 56988 43380 31168 20292 20028
Brooklyn 186886 166761 158561 180072 221421 221544 205525 181452 159615 180252 152501 144647 137606 113940 92570 61041 41717 42341
Manhattan 82096 60331 69487 118301 180070 190129 146480 110439 94732 68243 94242 88669 84257 76334 64227 47159 31115 31971
Queens 138141 129351 134198 162692 200730 189015 172928 160306 149550 132342 151019 147504 137219 107634 82663 57287 38275 39441
Staten Island 28683 30307 31299 31311 33740 33431 31131 30214 30275 29394 33340 34787 31599 25968 21232 13581 8259 8604

2. Analyze the projected percentage change in population between 2020 and 2040 for all 5 Boroughs.

    analysis2.2 <- population  %>%
        select(Borough,Age,X2020,X2040 ) %>%      
         filter( Borough != "NYC Total" ) %>% 
         filter( Age == "Total" ) %>% 
         select(Borough,X2020,X2040) %>% 
          mutate(Pct.Change = round(((X2040 - X2020)/X2040) * 100, 2)) %>%
           rename(Year_2020=X2020, Year_2040=X2040) 
       
           kable(analysis2.2)
Borough Year_2020 Year_2040 Pct.Change
Bronx 1446788 1579245 8.39
Brooklyn 2648452 2840525 6.76
Manhattan 1638282 1691617 3.15
Queens 2330295 2412649 3.41
Staten Island 487155 501109 2.78

Dataset 3: SAT Results

DataSet Information:

The most recent school level results for New York City on the SAT. Results are available at the school level for the graduating seniors of 2012. Records contain 2012 College-bound seniors mean SAT scores taken during SY 2012.

The data set can be downloaded from the below URL: https://data.cityofnewyork.us/api/views/f9bf-2cp4/rows.csv?accessType=DOWNLOAD

Attribute Information:

  • DBN

  • SCHOOL NAME

  • Num of SAT Test Takers

  • SAT Critical Reading Avg. Score

  • SAT Math Avg. Score

  • SAT Writing Avg. Score

Load the file from the URL.

sat_result<-read.csv("https://data.cityofnewyork.us/api/views/f9bf-2cp4/rows.csv?accessType=DOWNLOAD", sep = ",")

str(sat_result)
## 'data.frame':    478 obs. of  6 variables:
##  $ DBN                            : Factor w/ 478 levels "01M292","01M448",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SCHOOL.NAME                    : Factor w/ 478 levels "47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL",..: 204 438 141 169 309 296 333 110 38 1 ...
##  $ Num.of.SAT.Test.Takers         : Factor w/ 175 levels "10","101","102",..: 68 166 136 135 98 11 35 42 20 36 ...
##  $ SAT.Critical.Reading.Avg..Score: Factor w/ 164 levels "279","287","300",..: 34 62 56 93 69 17 145 95 159 74 ...
##  $ SAT.Math.Avg..Score            : Factor w/ 173 levels "312","315","317",..: 70 87 68 67 94 153 158 82 164 66 ...
##  $ SAT.Writing.Avg..Score         : Factor w/ 163 levels "286","291","297",..: 45 48 52 41 66 11 147 92 158 69 ...

Tidy the data set.

# remove the rows that does not have scores
sat_result1 <- sat_result %>% 
               filter(Num.of.SAT.Test.Takers != 's' )

str(sat_result1)
## 'data.frame':    421 obs. of  6 variables:
##  $ DBN                            : Factor w/ 478 levels "01M292","01M448",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SCHOOL.NAME                    : Factor w/ 478 levels "47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL",..: 204 438 141 169 309 296 333 110 38 1 ...
##  $ Num.of.SAT.Test.Takers         : Factor w/ 175 levels "10","101","102",..: 68 166 136 135 98 11 35 42 20 36 ...
##  $ SAT.Critical.Reading.Avg..Score: Factor w/ 164 levels "279","287","300",..: 34 62 56 93 69 17 145 95 159 74 ...
##  $ SAT.Math.Avg..Score            : Factor w/ 173 levels "312","315","317",..: 70 87 68 67 94 153 158 82 164 66 ...
##  $ SAT.Writing.Avg..Score         : Factor w/ 163 levels "286","291","297",..: 45 48 52 41 66 11 147 92 158 69 ...
# split the DBN column to School district and code
school_District <- substring(sat_result1$DBN,1,3 )
school_code <- substring(sat_result1$DBN,4,6 )

#change the name of the columns
names(sat_result1)[3:6] <- c('No_of_SAT_takers','Reading_Score','Math_Score','Writing_Score')

sat_result2 <- sat_result1 %>%
  mutate(school_District, school_code) %>%
  select(school_District, school_code,SCHOOL.NAME,No_of_SAT_takers,Reading_Score,Math_Score,Writing_Score )

str(sat_result1) 
## 'data.frame':    421 obs. of  6 variables:
##  $ DBN             : Factor w/ 478 levels "01M292","01M448",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SCHOOL.NAME     : Factor w/ 478 levels "47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL",..: 204 438 141 169 309 296 333 110 38 1 ...
##  $ No_of_SAT_takers: Factor w/ 175 levels "10","101","102",..: 68 166 136 135 98 11 35 42 20 36 ...
##  $ Reading_Score   : Factor w/ 164 levels "279","287","300",..: 34 62 56 93 69 17 145 95 159 74 ...
##  $ Math_Score      : Factor w/ 173 levels "312","315","317",..: 70 87 68 67 94 153 158 82 164 66 ...
##  $ Writing_Score   : Factor w/ 163 levels "286","291","297",..: 45 48 52 41 66 11 147 92 158 69 ...

This data set has 421 observations and 6 columns.

1. Rank the schools by math score, reading and writing score. Find the top 10 schools.

sat_result2$school_District = as.factor(sat_result2$school_District)
sat_result2$Reading_Score = as.numeric(sat_result2$Reading_Score)
sat_result2$Math_Score  = as.numeric(sat_result2$Math_Score )
sat_result2$Writing_Score = as.numeric(sat_result2$Writing_Score)

school_Rank<-round(rank( -(sat_result2$Reading_Score+sat_result2$Math_Score+sat_result2$Writing_Score )))

sat_result3 <- sat_result2 %>%
  mutate(school_Rank) %>%
  select(school_District, school_code,SCHOOL.NAME,No_of_SAT_takers,Reading_Score,Math_Score,Writing_Score,school_Rank ) %>%
  arrange(school_Rank) %>%
  top_n(-10,school_Rank)

kable(sat_result3)
school_District school_code SCHOOL.NAME No_of_SAT_takers Reading_Score Math_Score Writing_Score school_Rank
02M 475 STUYVESANT HIGH SCHOOL 832 163 172 162 1
10X 445 BRONX HIGH SCHOOL OF SCIENCE 731 160 171 161 2
31R 605 STATEN ISLAND TECHNICAL HIGH SCHOOL 227 161 170 159 3
10X 696 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE 92 162 165 159 4
25Q 525 TOWNSEND HARRIS HIGH SCHOOL 278 158 166 160 5
28Q 687 QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK COLLEGE 121 157 169 157 6
01M 696 BARD HIGH SCHOOL EARLY COLLEGE 130 159 164 158 7
05M 692 HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGINEERING AT CITY COLLEGE 101 156 167 155 8
13K 430 BROOKLYN TECHNICAL HIGH SCHOOL 1277 155 168 154 9
02M 416 ELEANOR ROOSEVELT HIGH SCHOOL 127 152 163 156 10

2. Find the Top5 school district by aggregating the math score, reading and writing score.

Total_score = (sat_result2$Reading_Score + sat_result2$Math_Score + sat_result2$Writing_Score)

sat_result4 <- sat_result2 %>%
 select( school_District, Reading_Score,Math_Score ,Writing_Score  ) %>%
  mutate(Total_score) %>%  
   group_by ( school_District)  %>% 
      summarise(Avg_Reading_Score = round(mean(Reading_Score)), Avg_math_Score = round(mean(Math_Score)), Avg_writing_Score=round(mean(Writing_Score)), Totals = round(mean(Total_score))) %>%
       top_n(5,Totals)


kable(sat_result4)
school_District Avg_Reading_Score Avg_math_Score Avg_writing_Score Totals
22K 114 119 114 347
26Q 112 121 113 346
30Q 98 117 98 313
31R 112 106 106 324
75Q 104 103 114 321