Overview

DATA 607 Project 2 will take three wide datasets identified in the class discussion forum and use tidyr and dplyr as needed to tidy and transform the data into an analyzable format.

The datasets chosen for this project are:
1. Credit Card Default Data – UCI Machine Learning Repository
2. New York City Leading Causes of Death – NYC OpenData
3. Total Population By Country – World Bank

Setup

This analysis requires the following R packages:

  1. dplyr
  2. tidyr
  3. stringr
  4. RCurl
  5. knitr
  6. data.tree
  7. ggplot2

Dataset 1: Credit Card Default Data – UCI Machine Learning Repository

DataSet Information:

This research aimed at the case of customers default payments in Taiwan and was used to compare the predictive accuracy of probability of default among six data mining methods.

The Credit Card Default Data on the UCI Machine Learning Repository can be found directly below:

http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable.

This study reviewed the literature and used the following 23 variables as explanatory variables:
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX : Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE : Marital status (1 = married; 2 = single; 3 = others).
AGE : Age (year).

(X6 - X11)
PAY_0 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17
BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.

X18-X23
PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Load the Credit Card Default Data in CSV format from GitHub

The dataset contains 30,000 observations and has 25 variables.

## 'data.frame':    30000 obs. of  25 variables:
##  $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LIMIT_BAL                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ BILL_AMT2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ BILL_AMT3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ BILL_AMT4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ BILL_AMT5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ BILL_AMT6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ PAY_AMT1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ PAY_AMT2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ PAY_AMT3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ PAY_AMT5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ PAY_AMT6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

Start tidying the data

  1. Fix what appears to be a typo in the field header PAY_0
names(credit_data_df)[names(credit_data_df) == "PAY_0"] <- "PAY_1"  
  1. Change codes to values for sex, education, and marriage. Any observations associated with undocument code values will be removed.
## [1] "Female" "Male"
## [1] University      Graduate School High School    
## Levels: High School University Graduate School
## [1] "Married" "Single"
  1. Rename the columns in order to use tidyr to convert the dataset from wide to long format.
##  [1] "ID"                       "LIMIT_BAL"               
##  [3] "SEX"                      "EDUCATION"               
##  [5] "MARRIAGE"                 "AGE"                     
##  [7] "PAY_1"                    "PAY_2"                   
##  [9] "PAY_3"                    "PAY_4"                   
## [11] "PAY_5"                    "PAY_6"                   
## [13] "BILLAMT_1"                "BILLAMT_2"               
## [15] "BILLAMT_3"                "BILLAMT_4"               
## [17] "BILLAMT_5"                "BILLAMT_6"               
## [19] "PAYAMT_1"                 "PAYAMT_2"                
## [21] "PAYAMT_3"                 "PAYAMT_4"                
## [23] "PAYAMT_5"                 "PAYAMT_6"                
## [25] "DEFAULT_NEXT_PAYMENT_IND"
  1. Convert from wide to long using tidyr
tidy_credit_data_df <- credit_data_df_s %>%
                       gather(col, "VAL", -ID:-AGE, -DEFAULT_NEXT_PAYMENT_IND) %>%
                       separate(col, c("L1", "MTH"), sep="_") %>%
                       spread(L1, VAL) %>%
                       select(ID:MTH, BILLAMT, PAYAMT, PAY) 

The resulting dataset looks like this:

ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE DEFAULT_NEXT_PAYMENT_IND MTH BILLAMT PAYAMT PAY
1 20000 Female University Married 24 1 1 3913 0 2
2 120000 Female University Single 26 1 1 2682 0 -1
3 90000 Female University Single 34 0 1 29239 1518 0
4 50000 Female University Married 37 0 1 46990 2000 0
5 50000 Male University Married 57 0 1 8617 2000 -1
6 50000 Male Graduate School Single 37 0 1 64400 2500 0
7 500000 Male Graduate School Single 29 0 1 367965 55000 0
8 100000 Female University Single 23 0 1 11876 380 0
9 140000 Female High School Married 28 0 1 11285 3329 0
10 20000 Male High School Single 35 0 1 0 0 -2
11 200000 Female High School Single 34 0 1 11073 2306 0
12 260000 Female Graduate School Single 51 0 1 12261 21818 -1
13 630000 Female University Single 41 0 1 12137 1000 -1
14 70000 Male University Single 30 1 1 65802 3200 1
15 250000 Male Graduate School Single 29 0 1 70887 3000 0
17 20000 Male Graduate School Single 24 1 1 15376 3200 0
18 320000 Male Graduate School Married 49 0 1 253286 10358 0
19 360000 Female Graduate School Married 49 0 1 0 0 1
20 180000 Female Graduate School Single 29 0 1 0 0 1
21 130000 Female High School Single 39 0 1 38358 3000 0
  1. With the dataset in long format, create some derived fields:
Field Name Description
AMT_OWED Running or cumulative sum of bill amount - payment amount for each individual
AVG_6MTH_BAL Mean value of AMT_OWED over a 6 month period
MISSED_PAYMENTS Maxium number of missed payments recorded for the individual
BALANCE_TO_LIMIT_RATIO Average 6 month balance divided by the individual’s credit limit; note anything <= .3 is considered good
AGE_RANGE Groups individuals into 10 year age grouping [20-29] through [70-79]

The resulting dataset looks like this:

ID CREDIT_LIMIT SEX EDUCATION MARITAL_STATUS DEFAULT_NEXT_PAYMENT_IND AVG_6MTH_BAL MISSED_PAYMENTS BALANCE_TO_LIMIT_RATIO AGE_RANGE
1 20000 Female University Married 1 6383.167 2 0.319 20-29
2 120000 Female University Single 1 6905.333 2 0.058 20-29
3 90000 Female University Single 0 59605.833 0 0.662 30-39
4 50000 Female University Married 0 143223.833 0 2.864 30-39
5 50000 Male University Married 0 13195.500 0 0.264 50-59
6 50000 Male Graduate School Single 0 164519.667 0 3.290 30-39
7 500000 Male Graduate School Single 0 1388642.500 0 2.777 20-29
8 100000 Female University Single 0 10754.667 0 0.108 20-29
9 140000 Female High School Married 0 37143.000 2 0.265 20-29
10 20000 Male High School Single 0 -223.167 -1 -0.011 30-39
  1. Using the dataset just created and stored in credit_data_individual, create an aggregate dataset for the different group combinations of Sex, Age Range, Marital Status, and Education.
Aggregation Level Description
LVL1 Top Level - Aggregates to Age Range
LVL2 Aggregates to Age Range and Sex
LVL3 Aggregates to Age Range, Sex, Marital Satus
LVL4 Bottom Level - Aggregates to Age Range, Sex, Marital Satus, Education

Visualize the groups using the data.tree package:

Field Definition:

Aggregation Level Description
n Count of individuals in the groups
Avg Limit Average Credit Limit of the group
% Good Credit Percentage representation of how many individuals in the group have a <= .3 limit-to-balance ratio
% Default Percentage representation of how many individuals in the group are predicted to default

Level 1: Summarized to Age Range

##   levelName     n Avg Limit % Good Credit % Default
## 1 All          NA                      NA        NA
## 2  ¦--20-29  9435  $124,157          0.32      0.23
## 3  ¦--30-39 10981  $197,420          0.46      0.21
## 4  ¦--40-49  6197  $182,750          0.42      0.23
## 5  ¦--50-59  2222  $165,729          0.37      0.25
## 6  ¦--60-69   303  $186,832          0.42      0.29
## 7  °--70-79    25  $218,800          0.40      0.28

Level 2: Summarized to Age Range and Sex

##         levelName    n Avg Limit % Good Credit % Default
## 1  All              NA                      NA        NA
## 2   ¦--20-29        NA                      NA        NA
## 3   ¦   ¦--Female 6219  $131,269          0.34      0.22
## 4   ¦   °--Male   3216  $110,404          0.28      0.24
## 5   ¦--30-39        NA                      NA        NA
## 6   ¦   ¦--Female 6501  $203,130          0.50      0.19
## 7   ¦   °--Male   4480  $189,134          0.39      0.23
## 8   ¦--40-49        NA                      NA        NA
## 9   ¦   ¦--Female 3534  $184,020          0.47      0.22
## 10  ¦   °--Male   2663  $181,066          0.37      0.25
## 11  ¦--50-59        NA                      NA        NA
## 12  ¦   ¦--Female 1181  $159,670          0.40      0.23
## 13  ¦   °--Male   1041  $172,603          0.33      0.27
## 14  ¦--60-69        NA                      NA        NA
## 15  ¦   ¦--Female  141  $160,355          0.43      0.31
## 16  ¦   °--Male    162  $209,877          0.41      0.27
## 17  °--70-79        NA                      NA        NA
## 18      ¦--Female   12  $213,333          0.50      0.25
## 19      °--Male     13  $223,846          0.31      0.31

Level 3: Summarized to Age Range, Sex, and Marital Status

##              levelName    n Avg Limit % Good Credit % Default
## 1  All                   NA                      NA        NA
## 2   ¦--20-29             NA                      NA        NA
## 3   ¦   ¦--Female        NA                      NA        NA
## 4   ¦   ¦   ¦--Married 1154  $122,626          0.27      0.27
## 5   ¦   ¦   °--Single  5065  $133,238          0.36      0.22
## 6   ¦   °--Male          NA                      NA        NA
## 7   ¦       ¦--Married  295  $117,085          0.26      0.27
## 8   ¦       °--Single  2921  $109,730          0.29      0.24
## 9   ¦--30-39             NA                      NA        NA
## 10  ¦   ¦--Female        NA                      NA        NA
## 11  ¦   ¦   ¦--Married 3510  $191,148          0.47      0.20
## 12  ¦   ¦   °--Single  2991  $217,192          0.53      0.17
## 13  ¦   °--Male          NA                      NA        NA
## 14  ¦       ¦--Married 1916  $196,200          0.42      0.26
## 15  ¦       °--Single  2564  $183,853          0.37      0.21
## 16  ¦--40-49             NA                      NA        NA
## 17  ¦   ¦--Female        NA                      NA        NA
## 18  ¦   ¦   ¦--Married 2645  $184,021          0.48      0.22
## 19  ¦   ¦   °--Single   889  $184,016          0.42      0.21
## 20  ¦   °--Male          NA                      NA        NA
## 21  ¦       ¦--Married 1931  $195,831          0.41      0.25
## 22  ¦       °--Single   732  $142,114          0.25      0.24
## 23  ¦--50-59             NA                      NA        NA
## 24  ¦   ¦--Female        NA                      NA        NA
## 25  ¦   ¦   ¦--Married  880  $165,250          0.42      0.25
## 26  ¦   ¦   °--Single   301  $143,355          0.36      0.19
## 27  ¦   °--Male          NA                      NA        NA
## 28  ¦       ¦--Married  821  $188,197          0.37      0.28
## 29  ¦       °--Single   220  $114,409          0.21      0.21
## 30  ¦--60-69             NA                      NA        NA
## 31  ¦   ¦--Female        NA                      NA        NA
## 32  ¦   ¦   ¦--Married  118  $172,712          0.45      0.28
## 33  ¦   ¦   °--Single    23  $ 96,957          0.30      0.48
## 34  ¦   °--Male          NA                      NA        NA
## 35  ¦       ¦--Married  132  $216,818          0.44      0.30
## 36  ¦       °--Single    30  $179,333          0.27      0.13
## 37  °--70-79             NA                      NA        NA
## 38      °--Male          NA                      NA        NA
## 39          °--Married   13  $223,846          0.31      0.31

Level 4: Summarized to Age Range, Sex, Marital Status, and Education

##                          levelName    n Avg Limit % Good Credit % Default
## 1  All                               NA                      NA        NA
## 2   ¦--20-29                         NA                      NA        NA
## 3   ¦   ¦--Female                    NA                      NA        NA
## 4   ¦   ¦   ¦--Married               NA                      NA        NA
## 5   ¦   ¦   ¦   ¦--High School      192  $ 99,688          0.21      0.29
## 6   ¦   ¦   ¦   ¦--University       814  $118,206          0.25      0.28
## 7   ¦   ¦   ¦   °--Graduate School  148  $176,689          0.45      0.16
## 8   ¦   ¦   °--Single                NA                      NA        NA
## 9   ¦   ¦       ¦--High School      365  $110,219          0.31      0.24
## 10  ¦   ¦       ¦--University      2337  $119,718          0.28      0.24
## 11  ¦   ¦       °--Graduate School 2363  $150,165          0.45      0.19
## 12  ¦   °--Male                      NA                      NA        NA
## 13  ¦       ¦--Married               NA                      NA        NA
## 14  ¦       ¦   ¦--High School       53  $103,962          0.23      0.32
## 15  ¦       ¦   ¦--University       197  $110,457          0.25      0.26
## 16  ¦       ¦   °--Graduate School   45  $161,556          0.36      0.27
## 17  ¦       °--Single                NA                      NA        NA
## 18  ¦           ¦--High School      325  $ 70,062          0.22      0.30
## 19  ¦           ¦--University      1462  $ 91,826          0.20      0.27
## 20  ¦           °--Graduate School 1134  $144,180          0.41      0.19
## 21  ¦--30-39                         NA                      NA        NA
## 22  ¦   ¦--Female                    NA                      NA        NA
## 23  ¦   ¦   ¦--Married               NA                      NA        NA
## 24  ¦   ¦   ¦   ¦--High School      500  $141,260          0.34      0.22
## 25  ¦   ¦   ¦   ¦--University      1948  $173,850          0.40      0.21
## 26  ¦   ¦   ¦   °--Graduate School 1062  $246,365          0.67      0.19
## 27  ¦   ¦   °--Single                NA                      NA        NA
## 28  ¦   ¦       ¦--High School      266  $185,677          0.49      0.16
## 29  ¦   ¦       ¦--University      1220  $192,311          0.39      0.18
## 30  ¦   ¦       °--Graduate School 1505  $242,930          0.65      0.16
## 31  ¦   °--Male                      NA                      NA        NA
## 32  ¦       ¦--Married               NA                      NA        NA
## 33  ¦       ¦   ¦--High School      271  $150,701          0.30      0.31
## 34  ¦       ¦   ¦--University      1010  $170,178          0.34      0.26
## 35  ¦       ¦   °--Graduate School  635  $257,008          0.59      0.22
## 36  ¦       °--Single                NA                      NA        NA
## 37  ¦           ¦--High School      279  $136,022          0.26      0.24
## 38  ¦           ¦--University      1045  $142,737          0.27      0.23
## 39  ¦           °--Graduate School 1240  $229,266          0.49      0.19
## 40  ¦--40-49                         NA                      NA        NA
## 41  ¦   ¦--Female                    NA                      NA        NA
## 42  ¦   ¦   ¦--Married               NA                      NA        NA
## 43  ¦   ¦   ¦   ¦--High School      688  $133,808          0.38      0.26
## 44  ¦   ¦   ¦   ¦--University      1331  $176,255          0.45      0.22
## 45  ¦   ¦   ¦   °--Graduate School  626  $255,719          0.67      0.18
## 46  ¦   ¦   °--Single                NA                      NA        NA
## 47  ¦   ¦       ¦--High School      235  $139,617          0.34      0.25
## 48  ¦   ¦       ¦--University       407  $167,052          0.35      0.20
## 49  ¦   ¦       °--Graduate School  247  $254,211          0.60      0.20
## 50  ¦   °--Male                      NA                      NA        NA
## 51  ¦       ¦--Married               NA                      NA        NA
## 52  ¦       ¦   ¦--High School      414  $137,222          0.29      0.27
## 53  ¦       ¦   ¦--University       824  $165,534          0.32      0.27
## 54  ¦       ¦   °--Graduate School  693  $266,869          0.59      0.23
## 55  ¦       °--Single                NA                      NA        NA
## 56  ¦           ¦--High School      197  $100,902          0.21      0.24
## 57  ¦           ¦--University       334  $110,329          0.18      0.27
## 58  ¦           °--Graduate School  201  $235,323          0.41      0.20
## 59  ¦--50-59                         NA                      NA        NA
## 60  ¦   ¦--Female                    NA                      NA        NA
## 61  ¦   ¦   ¦--Married               NA                      NA        NA
## 62  ¦   ¦   ¦   ¦--High School      369  $119,837          0.32      0.25
## 63  ¦   ¦   ¦   ¦--University       339  $168,112          0.36      0.27
## 64  ¦   ¦   ¦   °--Graduate School  172  $257,035          0.73      0.19
## 65  ¦   ¦   °--Single                NA                      NA        NA
## 66  ¦   ¦       ¦--High School      133  $103,684          0.27      0.21
## 67  ¦   ¦       ¦--University       112  $145,714          0.34      0.17
## 68  ¦   ¦       °--Graduate School   56  $232,857          0.59      0.18
## 69  ¦   °--Male                      NA                      NA        NA
## 70  ¦       ¦--Married               NA                      NA        NA
## 71  ¦       ¦   ¦--High School      267  $136,180          0.32      0.31
## 72  ¦       ¦   ¦--University       294  $157,619          0.27      0.29
## 73  ¦       ¦   °--Graduate School  260  $276,192          0.52      0.25
## 74  ¦       °--Single                NA                      NA        NA
## 75  ¦           ¦--High School       81  $ 95,926          0.20      0.20
## 76  ¦           ¦--University        86  $ 90,930          0.16      0.24
## 77  ¦           °--Graduate School   53  $180,755          0.32      0.19
## 78  °--60-69                         NA                      NA        NA
## 79      ¦--Female                    NA                      NA        NA
## 80      ¦   ¦--Married               NA                      NA        NA
## 81      ¦   ¦   ¦--High School       56  $124,464          0.30      0.25
## 82      ¦   ¦   ¦--University        38  $184,211          0.50      0.29
## 83      ¦   ¦   °--Graduate School   24  $267,083          0.71      0.33
## 84      ¦   °--Single                NA                      NA        NA
## 85      ¦       °--High School       15  $ 86,000          0.20      0.40
## 86      °--Male                      NA                      NA        NA
## 87          ¦--Married               NA                      NA        NA
## 88          ¦   ¦--High School       40  $173,000          0.22      0.22
## 89          ¦   ¦--University        42  $173,095          0.50      0.45
## 90          ¦   °--Graduate School   50  $288,600          0.56      0.24
## 91          °--Single                NA                      NA        NA
## 92              ¦--High School       12  $149,167          0.33      0.17
## 93              °--University        13  $192,308          0.15      0.08

Analysis

Which group has the highest average credit limit?

SEX MARITAL_STATUS EDUCATION AGE_RANGE FORMATTED_AVG_CREDIT_LIMIT
Male Married Graduate School 60-69 $288,600

Which group has the lowest average credit limit?

SEX MARITAL_STATUS EDUCATION AGE_RANGE FORMATTED_AVG_CREDIT_LIMIT
Male Single High School 20-29 $ 70,062

Which group is comprised of highest percentage of people who have a balance-to-limit rating less than or equal to 30%?

SEX MARITAL_STATUS EDUCATION AGE_RANGE PCT_IN_GROUP_GOOD_CREDIT
Female Married Graduate School 50-59 0.73

Which group has the lowest utilization or balance-to-limit rating?

SEX MARITAL_STATUS EDUCATION AGE_RANGE PCT_IN_GROUP_GOOD_CREDIT
Male Single University 60-69 0.15

Which group is the most likely to predicted to default?

SEX MARITAL_STATUS EDUCATION AGE_RANGE PCT_IN_GROUP_DEFAULT_RATE
Male Married University 60-69 0.45

Which group has the highest amount of debt, is the most likely to default, and is the most likely to miss a payment?

SEX MARITAL_STATUS EDUCATION AGE_RANGE
Female Single High School 60-69

Which group has the lowest amount of debt, is the least predicted to default, and is not likely to miss a payment?

SEX MARITAL_STATUS EDUCATION AGE_RANGE
Female Single Graduate School 30-39

Dataset 2: New York City Leading Causes of Death – NYC OpenData

DataSet Information:

This dataset collects the leading causes of death for New York City by ethnicity and sex from 2007 - 2011.

The New York City Leading Causes of Death dataset can be found directly below:

https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

Attribute Information:

The following attributes are captured in the dataset:

Year
Ethnicity
Sex
Cause of Death
Count
Percent

Load the NYC Leading Causes of Death datset in CSV format from GitHub

Looking at the dataset, there are 2880 observations with 6 variables.

## 'data.frame':    2880 obs. of  6 variables:
##  $ Year          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Ethnicity     : chr  "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" ...
##  $ Sex           : chr  "MALE" "MALE" "MALE" "MALE" ...
##  $ Cause.of.Death: chr  "HUMAN IMMUNODEFICIENCY VIRUS DISEASE" "INFLUENZA AND PNEUMONIA" "INTENTIONAL SELF-HARM (SUICIDE)" "MALIGNANT NEOPLASMS" ...
##  $ Count         : int  297 201 64 1540 50 70 13 111 36 35 ...
##  $ Percent       : int  5 3 1 23 1 1 0 2 1 1 ...

However, there appears to be duplicate entries in the data.

Year Ethnicity Sex Cause.of.Death Count Percent rn
2007 ASIAN & PACIFIC ISLANDER FEMALE ACCIDENTS EXCEPT DRUG POISONING 32 2 2
2007 ASIAN & PACIFIC ISLANDER FEMALE ACCIDENTS EXCEPT DRUG POISONING 32 2 3
2007 ASIAN & PACIFIC ISLANDER FEMALE ALZHEIMERS DISEASE 7 1 2
2007 ASIAN & PACIFIC ISLANDER FEMALE ALZHEIMERS DISEASE 7 1 3
2007 ASIAN & PACIFIC ISLANDER FEMALE AORTIC ANEURYSM AND DISSECTION 9 1 2
2007 ASIAN & PACIFIC ISLANDER FEMALE AORTIC ANEURYSM AND DISSECTION 9 1 3
2007 ASIAN & PACIFIC ISLANDER FEMALE ASSAULT (HOMICIDE) 6 0 2
2007 ASIAN & PACIFIC ISLANDER FEMALE ASSAULT (HOMICIDE) 6 0 3
2007 ASIAN & PACIFIC ISLANDER FEMALE ATHEROSCLEROSIS 5 0 2
2007 ASIAN & PACIFIC ISLANDER FEMALE ATHEROSCLEROSIS 5 0 3
2007 ASIAN & PACIFIC ISLANDER FEMALE BENIGN AND UNCERTAIN NEOPLASMS 6 0 2
2007 ASIAN & PACIFIC ISLANDER FEMALE BENIGN AND UNCERTAIN NEOPLASMS 6 0 3
2007 ASIAN & PACIFIC ISLANDER FEMALE CEREBROVASCULAR DISEASE 70 5 2
2007 ASIAN & PACIFIC ISLANDER FEMALE CEREBROVASCULAR DISEASE 70 5 3
2007 ASIAN & PACIFIC ISLANDER FEMALE CHRONIC LIVER DISEASE AND CIRRHOSIS 6 0 2
2007 ASIAN & PACIFIC ISLANDER FEMALE CHRONIC LIVER DISEASE AND CIRRHOSIS 6 0 3
2007 ASIAN & PACIFIC ISLANDER FEMALE CHRONIC LOWER RESPIRATORY DISEASES 29 2 2
2007 ASIAN & PACIFIC ISLANDER FEMALE CHRONIC LOWER RESPIRATORY DISEASES 29 2 3
2007 ASIAN & PACIFIC ISLANDER FEMALE DIABETES MELLITUS 34 3 2
2007 ASIAN & PACIFIC ISLANDER FEMALE DIABETES MELLITUS 34 3 3

Remove the duplicate entries:

nyc_deaths_df <-  nyc_deaths_df %>% 
                  group_by(Year, Ethnicity, Sex, Cause.of.Death, Count, Percent) %>% 
                  filter(row_number() == 1)

dim(nyc_deaths_df)
## [1] 960   6

Removing the duplicate reduces the data down to 960 observations.

Analysis:

Show the leading causes of death each year for men and women.

kable(nyc_deaths_df %>% 
           select(Year, Sex, Cause.of.Death, Count) %>% 
           group_by(Year, Sex) %>% 
           select(Year, Sex, Cause.of.Death, Count) %>% arrange(Year, Sex) %>%
           top_n(1) )
## Selecting by Count
Year Sex Cause.of.Death Count
2007 FEMALE DISEASES OF HEART 7050
2007 MALE DISEASES OF HEART 5632
2008 FEMALE DISEASES OF HEART 6836
2008 MALE DISEASES OF HEART 5503
2009 FEMALE DISEASES OF HEART 6297
2009 MALE DISEASES OF HEART 5168
2010 FEMALE DISEASES OF HEART 5351
2010 MALE DISEASES OF HEART 4495
2011 FEMALE DISEASES OF HEART 5016
2011 MALE DISEASES OF HEART 4220

Show the leading causes of death each year for each ethnic group.

kable(nyc_deaths_df %>% 
         select(Year, Ethnicity, Cause.of.Death, Count) %>% 
         group_by(Year, Ethnicity) %>%
         select(Year, Ethnicity, Cause.of.Death, Count) %>% arrange(Year, Ethnicity) %>%
         top_n(1))
## Selecting by Count
Year Ethnicity Cause.of.Death Count
2007 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS 528
2007 HISPANIC DISEASES OF HEART 1418
2007 NON-HISPANIC BLACK DISEASES OF HEART 2722
2007 NON-HISPANIC WHITE DISEASES OF HEART 7050
2008 ASIAN & PACIFIC ISLANDER DISEASES OF HEART 546
2008 HISPANIC DISEASES OF HEART 1445
2008 NON-HISPANIC BLACK DISEASES OF HEART 2725
2008 NON-HISPANIC WHITE DISEASES OF HEART 6836
2009 ASIAN & PACIFIC ISLANDER DISEASES OF HEART 555
2009 HISPANIC DISEASES OF HEART 1382
2009 NON-HISPANIC BLACK DISEASES OF HEART 2535
2009 NON-HISPANIC WHITE DISEASES OF HEART 6297
2010 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS 527
2010 HISPANIC DISEASES OF HEART 1355
2010 NON-HISPANIC BLACK DISEASES OF HEART 2282
2010 NON-HISPANIC WHITE DISEASES OF HEART 5351
2011 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS 550
2011 HISPANIC DISEASES OF HEART 1348
2011 NON-HISPANIC BLACK DISEASES OF HEART 2243
2011 NON-HISPANIC WHITE DISEASES OF HEART 5016

Calculate which cause of death has declined the most and which has increased the most in the years given.

wide <- nyc_deaths_df %>% 
        select(Year, Cause.of.Death, Count) %>% 
        group_by(Year, Cause.of.Death) %>%
        summarise(Total = sum(Count)) %>% 
        spread(Year, Total)  %>% 
        mutate(Pct.Change = round((`2011`-`2007`)/`2007`, 2) *100)

wide <- wide[complete.cases(wide), ]

kable(wide %>% 
           arrange(Pct.Change) %>% 
           filter(row_number() == 1))
Cause.of.Death 2007 2008 2009 2010 2011 Pct.Change
CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD 76 55 33 29 24 -68
kable(wide %>% 
          arrange(desc(Pct.Change)) %>% 
          filter(row_number() == 1))
Cause.of.Death 2007 2008 2009 2010 2011 Pct.Change
ALZHEIMERS DISEASE 276 363 514 574 619 124

Calculate which cause of death has remained stable over the years given.

The answer to this question uses the percentage change when comparing year 2007 to year 2011. There some fluctuations up and down in the amount of change in the years between.

kable(wide %>% filter(Pct.Change == 0))
Cause.of.Death 2007 2008 2009 2010 2011 Pct.Change
CONGENITAL MALFORMATIONS,DEFORMATIONS 232 283 238 228 232 0
SHORT GESTATION/LBW 92 102 106 139 92 0

Dataset 3: Total Population By Country

DataSet Information:

Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin.

The population data is captured for each country from 1960 - 2014.

The World Bank Total Populdation dataset can be found directly below:

http://data.worldbank.org/indicator/SP.POP.TOTL

Attribute Information:

The following attributes are captured in the dataset:

Country Name: Name of the Country Country Code: Country’s 3-digit ISO code
Indicator Name: Constant value - “Population, total” Indicator Code: Constant value - SP.POP.TOTL Year [1960 - 2015]: Population of the country for the given year; 2015 does not contain any values for population

Load the World Bank Total Population Data in CSV format from GitHub

The dataset contains 248 observations and has 61 variables.

## 'data.frame':    248 obs. of  61 variables:
##  $ Country.Name  : chr  "Aruba" "Andorra" "Afghanistan" "Angola" ...
##  $ Country.Code  : chr  "ABW" "AND" "AFG" "AGO" ...
##  $ Indicator.Name: chr  "Population, total" "Population, total" "Population, total" "Population, total" ...
##  $ Indicator.Code: chr  "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##  $ X1960         : num  54208 13414 8994793 5270844 1608800 ...
##  $ X1961         : num  55435 14376 9164945 5367287 1659800 ...
##  $ X1962         : num  56226 15376 9343772 5465905 1711319 ...
##  $ X1963         : num  56697 16410 9531555 5565808 1762621 ...
##  $ X1964         : num  57029 17470 9728645 5665701 1814135 ...
##  $ X1965         : num  57360 18551 9935358 5765025 1864791 ...
##  $ X1966         : num  57712 19646 10148841 5863568 1914573 ...
##  $ X1967         : num  58049 20755 10368600 5962831 1965598 ...
##  $ X1968         : num  58385 21888 10599790 6066094 2022272 ...
##  $ X1969         : num  58724 23061 10849510 6177703 2081695 ...
##  $ X1970         : num  59065 24279 11121097 6300969 2135479 ...
##  $ X1971         : num  59438 25560 11412821 6437645 2187853 ...
##  $ X1972         : num  59849 26892 11716896 6587647 2243126 ...
##  $ X1973         : num  60239 28231 12022514 6750215 2296752 ...
##  $ X1974         : num  60525 29514 12315553 6923749 2350124 ...
##  $ X1975         : num  60655 30706 12582954 7107334 2404831 ...
##  $ X1976         : num  60589 31781 12831361 7299508 2458526 ...
##  $ X1977         : num  60366 32769 13056499 7501320 2513546 ...
##  $ X1978         : num  60106 33746 13222547 7717139 2566266 ...
##  $ X1979         : num  59978 34819 13283279 7952882 2617832 ...
##  $ X1980         : num  60096 36063 13211412 8211950 2671997 ...
##  $ X1981         : num  60567 37502 12996923 8497950 2726056 ...
##  $ X1982         : num  61344 39112 12667001 8807511 2784278 ...
##  $ X1983         : num  62204 40862 12279095 9128655 2843960 ...
##  $ X1984         : num  62831 42704 11912510 9444918 2904429 ...
##  $ X1985         : num  63028 44597 11630498 9745209 2964762 ...
##  $ X1986         : num  62644 46515 11438949 10023700 3022635 ...
##  $ X1987         : num  61835 48458 11337932 10285712 3083605 ...
##  $ X1988         : num  61077 50431 11375768 10544904 3142336 ...
##  $ X1989         : num  61032 52449 11608351 10820992 3227943 ...
##  $ X1990         : num  62148 54511 12067570 11127870 3286542 ...
##  $ X1991         : num  64623 56674 12789374 11472173 3266790 ...
##  $ X1992         : num  68235 58904 13745630 11848971 3247039 ...
##  $ X1993         : num  72498 61003 14824371 12246786 3227287 ...
##  $ X1994         : num  76700 62707 15869967 12648483 3207536 ...
##  $ X1995         : num  80326 63854 16772522 13042666 3187784 ...
##  $ X1996         : num  83195 64291 17481800 13424813 3168033 ...
##  $ X1997         : num  85447 64147 18034130 13801868 3148281 ...
##  $ X1998         : num  87276 63888 18511480 14187710 3128530 ...
##  $ X1999         : num  89004 64161 19038420 14601983 3108778 ...
##  $ X2000         : num  90858 65399 19701940 15058638 3089027 ...
##  $ X2001         : num  92894 67770 20531160 15562791 3060173 ...
##  $ X2002         : num  94995 71046 21487079 16109696 3051010 ...
##  $ X2003         : num  97015 74783 22507368 16691395 3039616 ...
##  $ X2004         : num  98742 78337 23499850 17295500 3026939 ...
##  $ X2005         : num  100031 81223 24399948 17912942 3011487 ...
##  $ X2006         : num  100830 83373 25183615 18541467 2992547 ...
##  $ X2007         : num  101218 84878 25877544 19183907 2970017 ...
##  $ X2008         : num  101342 85616 26528741 19842251 2947314 ...
##  $ X2009         : num  101416 85474 27207291 20520103 2927519 ...
##  $ X2010         : num  101597 84419 27962207 21219954 2913021 ...
##  $ X2011         : num  101936 82326 28809167 21942296 2904780 ...
##  $ X2012         : num  102393 79316 29726803 22685632 2900489 ...
##  $ X2013         : num  102921 75902 30682500 23448202 2897366 ...
##  $ X2014         : num  103441 72786 31627506 24227524 2894475 ...
##  $ X2015         : logi  NA NA NA NA NA NA ...
##  $ X             : logi  NA NA NA NA NA NA ...

Remove variables that will not be used or are not valued – Indicator Name, Indicator Code, X2015, and X (which appears after processing in the CSV file).

population_df$Indicator.Name <- NULL
population_df$Indicator.Code <- NULL
population_df$X  <- NULL
population_df$X2015  <- NULL

Remove incomplete cases

population_df <- population_df[complete.cases(population_df), ] 

Convert the Total Population dataset from wide to long format:

tidy_population_df <- population_df %>%
                      gather(col, "Population", -Country.Code, -Country.Name) %>%
                      separate(col, c("X", "Year"), sep="X") %>%
                      select(-X) %>% 
                      arrange(Country.Code, Year)

tidy_population_df$Year <- as.numeric(tidy_population_df$Year)

tidy_population_df <- as.data.frame(tidy_population_df)

The resulting dataframe now has one row per year and population value for a country. In total, there are 13,640 observations with just four variables.

Country.Name Country.Code Year Population
Aruba ABW 1960 54208
Aruba ABW 1961 55435
Aruba ABW 1962 56226
Aruba ABW 1963 56697
Aruba ABW 1964 57029
Aruba ABW 1965 57360
Aruba ABW 1966 57712
Aruba ABW 1967 58049
Aruba ABW 1968 58385
Aruba ABW 1969 58724
Aruba ABW 1970 59065
Aruba ABW 1971 59438
Aruba ABW 1972 59849
Aruba ABW 1973 60239
Aruba ABW 1974 60525
Aruba ABW 1975 60655
Aruba ABW 1976 60589
Aruba ABW 1977 60366
Aruba ABW 1978 60106

Analysis:

Determine which countries have had the most significant change in terms of straight percent population change, comparing 1960 to 2014.

population_aggr <- tidy_population_df %>%
                   na.omit() %>% 
                   group_by(Country.Code) %>% 
                   mutate(min_year = min(Year),
                   max_year = max(Year)) %>%
                   filter(Year == min_year | Year == max_year) %>%  
                   mutate(rn = str_c("M", row_number())) %>%
                   select(Country.Name, Country.Code, Population, rn) %>% 
                   spread(rn, Population) %>% 
                   rename(Min.Year=M1, Max.Year=M2) %>%
                   mutate(Pct.Change = round(((Max.Year - Min.Year)/Min.Year) * 100, 2)) 


population_aggr <- as.data.frame(population_aggr)

Using this resultset, determine which countries have experiences the most total growth and which experienced the least amount of growth:

countries <- filter(population_aggr, Pct.Change==min(Pct.Change) |Pct.Change==max(Pct.Change))

kable(countries)
Country.Name Country.Code Min.Year Max.Year Pct.Change
Bulgaria BGR 7867374 7223938 -8.18
United Arab Emirates ARE 92612 9086139 9710.97

We see that Bulgaria has seen its population drop the most significantly over the years, while United Arab Emirates has seen a population explosion.

To drill into these two countries more, let’s look at the percent population change by decade.

The resulting dataset looks like:

Country.Name Country.Code Pct.Change.1960s Pct.Change.1970s Pct.Change.1980s Pct.Change.1990s Pct.Change.2000s
Afghanistan AFG 20.62 19.44 -12.13 57.77 38.09
Albania ALB 29.39 22.59 20.81 -5.41 -5.23
Algeria DZA 27.14 28.94 30.61 18.73 13.53
American Samoa ASM 33.00 16.04 40.47 20.67 -2.25
Andorra AND 71.92 43.41 45.44 17.70 30.70
Angola AGO 17.21 26.22 31.77 31.22 36.27
Antigua and Barbuda ATG 17.60 7.93 -11.85 22.83 11.14
Arab World ARB 28.34 30.63 30.09 24.14 22.51
Argentina ARG 14.49 15.48 14.79 11.97 10.10
Armenia ARM 31.54 20.89 14.42 -12.72 -3.58

Looking at the change for Bulgaria and United Arab Emirates specifically:

Country.Name Country.Code Pct.Change.1960s Pct.Change.1970s Pct.Change.1980s Pct.Change.1990s Pct.Change.2000s
Bulgaria BGR 7.20 3.96 0.17 -5.82 -8.88
United Arab Emirates ARE 121.97 293.87 68.38 59.22 152.63

Plotting the population by year for both countries:

For relative growth by decade: