DATA 607: Project 2

Overview

DATA 607 Project 2 will take three wide datasets identified in the class discussion forum and use tidyr and dplyr as needed to tidy and transform the data into an analyzable format.

The datasets chosen for this project are:
1. Credit Card Default Data – UCI Machine Learning Repository
2. New York City Leading Causes of Death – NYC OpenData
3. Total Population By Country – World Bank

Setup

This analysis requires the following R packages:

dplyr
tidyr
stringr
RCurl
knitr
data.tree
ggplot2

Dataset 1: Credit Card Default Data – UCI Machine Learning Repository

DataSet Information:

This research aimed at the case of customers default payments in Taiwan and was used to compare the predictive accuracy of probability of default among six data mining methods.

The Credit Card Default Data on the UCI Machine Learning Repository can be found directly below:

http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable.

This study reviewed the literature and used the following 23 variables as explanatory variables:
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX : Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE : Marital status (1 = married; 2 = single; 3 = others).
AGE : Age (year).

(X6 - X11)
PAY_0 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

X12-X17
BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.

X18-X23
PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Load the Credit Card Default Data in CSV format from GitHub

The dataset contains 30,000 observations and has 25 variables.

## 'data.frame':    30000 obs. of  25 variables:
##  $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LIMIT_BAL                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ BILL_AMT2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ BILL_AMT3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ BILL_AMT4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ BILL_AMT5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ BILL_AMT6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ PAY_AMT1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ PAY_AMT2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ PAY_AMT3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ PAY_AMT5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ PAY_AMT6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

Start tidying the data

Fix what appears to be a typo in the field header PAY_0

names(credit_data_df)[names(credit_data_df) == "PAY_0"] <- "PAY_1"

Change codes to values for sex, education, and marriage. Any observations associated with undocument code values will be removed.

## [1] "Female" "Male"

## [1] University      Graduate School High School    
## Levels: High School University Graduate School

## [1] "Married" "Single"

Rename the columns in order to use tidyr to convert the dataset from wide to long format.

##  [1] "ID"                       "LIMIT_BAL"               
##  [3] "SEX"                      "EDUCATION"               
##  [5] "MARRIAGE"                 "AGE"                     
##  [7] "PAY_1"                    "PAY_2"                   
##  [9] "PAY_3"                    "PAY_4"                   
## [11] "PAY_5"                    "PAY_6"                   
## [13] "BILLAMT_1"                "BILLAMT_2"               
## [15] "BILLAMT_3"                "BILLAMT_4"               
## [17] "BILLAMT_5"                "BILLAMT_6"               
## [19] "PAYAMT_1"                 "PAYAMT_2"                
## [21] "PAYAMT_3"                 "PAYAMT_4"                
## [23] "PAYAMT_5"                 "PAYAMT_6"                
## [25] "DEFAULT_NEXT_PAYMENT_IND"

Convert from wide to long using tidyr

tidy_credit_data_df <- credit_data_df_s %>%
                       gather(col, "VAL", -ID:-AGE, -DEFAULT_NEXT_PAYMENT_IND) %>%
                       separate(col, c("L1", "MTH"), sep="_") %>%
                       spread(L1, VAL) %>%
                       select(ID:MTH, BILLAMT, PAYAMT, PAY)

The resulting dataset looks like this:

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	DEFAULT_NEXT_PAYMENT_IND	MTH	BILLAMT	PAYAMT	PAY
1	20000	Female	University	Married	24	1	1	3913	0	2
2	120000	Female	University	Single	26	1	1	2682	0	-1
3	90000	Female	University	Single	34	0	1	29239	1518	0
4	50000	Female	University	Married	37	0	1	46990	2000	0
5	50000	Male	University	Married	57	0	1	8617	2000	-1
6	50000	Male	Graduate School	Single	37	0	1	64400	2500	0
7	500000	Male	Graduate School	Single	29	0	1	367965	55000	0
8	100000	Female	University	Single	23	0	1	11876	380	0
9	140000	Female	High School	Married	28	0	1	11285	3329	0
10	20000	Male	High School	Single	35	0	1	0	0	-2
11	200000	Female	High School	Single	34	0	1	11073	2306	0
12	260000	Female	Graduate School	Single	51	0	1	12261	21818	-1
13	630000	Female	University	Single	41	0	1	12137	1000	-1
14	70000	Male	University	Single	30	1	1	65802	3200	1
15	250000	Male	Graduate School	Single	29	0	1	70887	3000	0
17	20000	Male	Graduate School	Single	24	1	1	15376	3200	0
18	320000	Male	Graduate School	Married	49	0	1	253286	10358	0
19	360000	Female	Graduate School	Married	49	0	1	0	0	1
20	180000	Female	Graduate School	Single	29	0	1	0	0	1
21	130000	Female	High School	Single	39	0	1	38358	3000	0

With the dataset in long format, create some derived fields:

Field Name	Description
AMT_OWED	Running or cumulative sum of bill amount - payment amount for each individual
AVG_6MTH_BAL	Mean value of AMT_OWED over a 6 month period
MISSED_PAYMENTS	Maxium number of missed payments recorded for the individual
BALANCE_TO_LIMIT_RATIO	Average 6 month balance divided by the individual’s credit limit; note anything <= .3 is considered good
AGE_RANGE	Groups individuals into 10 year age grouping [20-29] through [70-79]

The resulting dataset looks like this:

ID	CREDIT_LIMIT	SEX	EDUCATION	MARITAL_STATUS	DEFAULT_NEXT_PAYMENT_IND	AVG_6MTH_BAL	MISSED_PAYMENTS	BALANCE_TO_LIMIT_RATIO	AGE_RANGE
1	20000	Female	University	Married	1	6383.167	2	0.319	20-29
2	120000	Female	University	Single	1	6905.333	2	0.058	20-29
3	90000	Female	University	Single	0	59605.833	0	0.662	30-39
4	50000	Female	University	Married	0	143223.833	0	2.864	30-39
5	50000	Male	University	Married	0	13195.500	0	0.264	50-59
6	50000	Male	Graduate School	Single	0	164519.667	0	3.290	30-39
7	500000	Male	Graduate School	Single	0	1388642.500	0	2.777	20-29
8	100000	Female	University	Single	0	10754.667	0	0.108	20-29
9	140000	Female	High School	Married	0	37143.000	2	0.265	20-29
10	20000	Male	High School	Single	0	-223.167	-1	-0.011	30-39

Using the dataset just created and stored in credit_data_individual, create an aggregate dataset for the different group combinations of Sex, Age Range, Marital Status, and Education.

Aggregation Level	Description
LVL1	Top Level - Aggregates to Age Range
LVL2	Aggregates to Age Range and Sex
LVL3	Aggregates to Age Range, Sex, Marital Satus
LVL4	Bottom Level - Aggregates to Age Range, Sex, Marital Satus, Education

Visualize the groups using the `data.tree` package:

Field Definition:

Aggregation Level	Description
n	Count of individuals in the groups
Avg Limit	Average Credit Limit of the group
% Good Credit	Percentage representation of how many individuals in the group have a <= .3 limit-to-balance ratio
% Default	Percentage representation of how many individuals in the group are predicted to default

Level 1: Summarized to Age Range

##   levelName     n Avg Limit % Good Credit % Default
## 1 All          NA                      NA        NA
## 2  ¦--20-29  9435  $124,157          0.32      0.23
## 3  ¦--30-39 10981  $197,420          0.46      0.21
## 4  ¦--40-49  6197  $182,750          0.42      0.23
## 5  ¦--50-59  2222  $165,729          0.37      0.25
## 6  ¦--60-69   303  $186,832          0.42      0.29
## 7  °--70-79    25  $218,800          0.40      0.28

Level 2: Summarized to Age Range and Sex

##         levelName    n Avg Limit % Good Credit % Default
## 1  All              NA                      NA        NA
## 2   ¦--20-29        NA                      NA        NA
## 3   ¦   ¦--Female 6219  $131,269          0.34      0.22
## 4   ¦   °--Male   3216  $110,404          0.28      0.24
## 5   ¦--30-39        NA                      NA        NA
## 6   ¦   ¦--Female 6501  $203,130          0.50      0.19
## 7   ¦   °--Male   4480  $189,134          0.39      0.23
## 8   ¦--40-49        NA                      NA        NA
## 9   ¦   ¦--Female 3534  $184,020          0.47      0.22
## 10  ¦   °--Male   2663  $181,066          0.37      0.25
## 11  ¦--50-59        NA                      NA        NA
## 12  ¦   ¦--Female 1181  $159,670          0.40      0.23
## 13  ¦   °--Male   1041  $172,603          0.33      0.27
## 14  ¦--60-69        NA                      NA        NA
## 15  ¦   ¦--Female  141  $160,355          0.43      0.31
## 16  ¦   °--Male    162  $209,877          0.41      0.27
## 17  °--70-79        NA                      NA        NA
## 18      ¦--Female   12  $213,333          0.50      0.25
## 19      °--Male     13  $223,846          0.31      0.31

Level 3: Summarized to Age Range, Sex, and Marital Status

##              levelName    n Avg Limit % Good Credit % Default
## 1  All                   NA                      NA        NA
## 2   ¦--20-29             NA                      NA        NA
## 3   ¦   ¦--Female        NA                      NA        NA
## 4   ¦   ¦   ¦--Married 1154  $122,626          0.27      0.27
## 5   ¦   ¦   °--Single  5065  $133,238          0.36      0.22
## 6   ¦   °--Male          NA                      NA        NA
## 7   ¦       ¦--Married  295  $117,085          0.26      0.27
## 8   ¦       °--Single  2921  $109,730          0.29      0.24
## 9   ¦--30-39             NA                      NA        NA
## 10  ¦   ¦--Female        NA                      NA        NA
## 11  ¦   ¦   ¦--Married 3510  $191,148          0.47      0.20
## 12  ¦   ¦   °--Single  2991  $217,192          0.53      0.17
## 13  ¦   °--Male          NA                      NA        NA
## 14  ¦       ¦--Married 1916  $196,200          0.42      0.26
## 15  ¦       °--Single  2564  $183,853          0.37      0.21
## 16  ¦--40-49             NA                      NA        NA
## 17  ¦   ¦--Female        NA                      NA        NA
## 18  ¦   ¦   ¦--Married 2645  $184,021          0.48      0.22
## 19  ¦   ¦   °--Single   889  $184,016          0.42      0.21
## 20  ¦   °--Male          NA                      NA        NA
## 21  ¦       ¦--Married 1931  $195,831          0.41      0.25
## 22  ¦       °--Single   732  $142,114          0.25      0.24
## 23  ¦--50-59             NA                      NA        NA
## 24  ¦   ¦--Female        NA                      NA        NA
## 25  ¦   ¦   ¦--Married  880  $165,250          0.42      0.25
## 26  ¦   ¦   °--Single   301  $143,355          0.36      0.19
## 27  ¦   °--Male          NA                      NA        NA
## 28  ¦       ¦--Married  821  $188,197          0.37      0.28
## 29  ¦       °--Single   220  $114,409          0.21      0.21
## 30  ¦--60-69             NA                      NA        NA
## 31  ¦   ¦--Female        NA                      NA        NA
## 32  ¦   ¦   ¦--Married  118  $172,712          0.45      0.28
## 33  ¦   ¦   °--Single    23  $ 96,957          0.30      0.48
## 34  ¦   °--Male          NA                      NA        NA
## 35  ¦       ¦--Married  132  $216,818          0.44      0.30
## 36  ¦       °--Single    30  $179,333          0.27      0.13
## 37  °--70-79             NA                      NA        NA
## 38      °--Male          NA                      NA        NA
## 39          °--Married   13  $223,846          0.31      0.31

Level 4: Summarized to Age Range, Sex, Marital Status, and Education

##                          levelName    n Avg Limit % Good Credit % Default
## 1  All                               NA                      NA        NA
## 2   ¦--20-29                         NA                      NA        NA
## 3   ¦   ¦--Female                    NA                      NA        NA
## 4   ¦   ¦   ¦--Married               NA                      NA        NA
## 5   ¦   ¦   ¦   ¦--High School      192  $ 99,688          0.21      0.29
## 6   ¦   ¦   ¦   ¦--University       814  $118,206          0.25      0.28
## 7   ¦   ¦   ¦   °--Graduate School  148  $176,689          0.45      0.16
## 8   ¦   ¦   °--Single                NA                      NA        NA
## 9   ¦   ¦       ¦--High School      365  $110,219          0.31      0.24
## 10  ¦   ¦       ¦--University      2337  $119,718          0.28      0.24
## 11  ¦   ¦       °--Graduate School 2363  $150,165          0.45      0.19
## 12  ¦   °--Male                      NA                      NA        NA
## 13  ¦       ¦--Married               NA                      NA        NA
## 14  ¦       ¦   ¦--High School       53  $103,962          0.23      0.32
## 15  ¦       ¦   ¦--University       197  $110,457          0.25      0.26
## 16  ¦       ¦   °--Graduate School   45  $161,556          0.36      0.27
## 17  ¦       °--Single                NA                      NA        NA
## 18  ¦           ¦--High School      325  $ 70,062          0.22      0.30
## 19  ¦           ¦--University      1462  $ 91,826          0.20      0.27
## 20  ¦           °--Graduate School 1134  $144,180          0.41      0.19
## 21  ¦--30-39                         NA                      NA        NA
## 22  ¦   ¦--Female                    NA                      NA        NA
## 23  ¦   ¦   ¦--Married               NA                      NA        NA
## 24  ¦   ¦   ¦   ¦--High School      500  $141,260          0.34      0.22
## 25  ¦   ¦   ¦   ¦--University      1948  $173,850          0.40      0.21
## 26  ¦   ¦   ¦   °--Graduate School 1062  $246,365          0.67      0.19
## 27  ¦   ¦   °--Single                NA                      NA        NA
## 28  ¦   ¦       ¦--High School      266  $185,677          0.49      0.16
## 29  ¦   ¦       ¦--University      1220  $192,311          0.39      0.18
## 30  ¦   ¦       °--Graduate School 1505  $242,930          0.65      0.16
## 31  ¦   °--Male                      NA                      NA        NA
## 32  ¦       ¦--Married               NA                      NA        NA
## 33  ¦       ¦   ¦--High School      271  $150,701          0.30      0.31
## 34  ¦       ¦   ¦--University      1010  $170,178          0.34      0.26
## 35  ¦       ¦   °--Graduate School  635  $257,008          0.59      0.22
## 36  ¦       °--Single                NA                      NA        NA
## 37  ¦           ¦--High School      279  $136,022          0.26      0.24
## 38  ¦           ¦--University      1045  $142,737          0.27      0.23
## 39  ¦           °--Graduate School 1240  $229,266          0.49      0.19
## 40  ¦--40-49                         NA                      NA        NA
## 41  ¦   ¦--Female                    NA                      NA        NA
## 42  ¦   ¦   ¦--Married               NA                      NA        NA
## 43  ¦   ¦   ¦   ¦--High School      688  $133,808          0.38      0.26
## 44  ¦   ¦   ¦   ¦--University      1331  $176,255          0.45      0.22
## 45  ¦   ¦   ¦   °--Graduate School  626  $255,719          0.67      0.18
## 46  ¦   ¦   °--Single                NA                      NA        NA
## 47  ¦   ¦       ¦--High School      235  $139,617          0.34      0.25
## 48  ¦   ¦       ¦--University       407  $167,052          0.35      0.20
## 49  ¦   ¦       °--Graduate School  247  $254,211          0.60      0.20
## 50  ¦   °--Male                      NA                      NA        NA
## 51  ¦       ¦--Married               NA                      NA        NA
## 52  ¦       ¦   ¦--High School      414  $137,222          0.29      0.27
## 53  ¦       ¦   ¦--University       824  $165,534          0.32      0.27
## 54  ¦       ¦   °--Graduate School  693  $266,869          0.59      0.23
## 55  ¦       °--Single                NA                      NA        NA
## 56  ¦           ¦--High School      197  $100,902          0.21      0.24
## 57  ¦           ¦--University       334  $110,329          0.18      0.27
## 58  ¦           °--Graduate School  201  $235,323          0.41      0.20
## 59  ¦--50-59                         NA                      NA        NA
## 60  ¦   ¦--Female                    NA                      NA        NA
## 61  ¦   ¦   ¦--Married               NA                      NA        NA
## 62  ¦   ¦   ¦   ¦--High School      369  $119,837          0.32      0.25
## 63  ¦   ¦   ¦   ¦--University       339  $168,112          0.36      0.27
## 64  ¦   ¦   ¦   °--Graduate School  172  $257,035          0.73      0.19
## 65  ¦   ¦   °--Single                NA                      NA        NA
## 66  ¦   ¦       ¦--High School      133  $103,684          0.27      0.21
## 67  ¦   ¦       ¦--University       112  $145,714          0.34      0.17
## 68  ¦   ¦       °--Graduate School   56  $232,857          0.59      0.18
## 69  ¦   °--Male                      NA                      NA        NA
## 70  ¦       ¦--Married               NA                      NA        NA
## 71  ¦       ¦   ¦--High School      267  $136,180          0.32      0.31
## 72  ¦       ¦   ¦--University       294  $157,619          0.27      0.29
## 73  ¦       ¦   °--Graduate School  260  $276,192          0.52      0.25
## 74  ¦       °--Single                NA                      NA        NA
## 75  ¦           ¦--High School       81  $ 95,926          0.20      0.20
## 76  ¦           ¦--University        86  $ 90,930          0.16      0.24
## 77  ¦           °--Graduate School   53  $180,755          0.32      0.19
## 78  °--60-69                         NA                      NA        NA
## 79      ¦--Female                    NA                      NA        NA
## 80      ¦   ¦--Married               NA                      NA        NA
## 81      ¦   ¦   ¦--High School       56  $124,464          0.30      0.25
## 82      ¦   ¦   ¦--University        38  $184,211          0.50      0.29
## 83      ¦   ¦   °--Graduate School   24  $267,083          0.71      0.33
## 84      ¦   °--Single                NA                      NA        NA
## 85      ¦       °--High School       15  $ 86,000          0.20      0.40
## 86      °--Male                      NA                      NA        NA
## 87          ¦--Married               NA                      NA        NA
## 88          ¦   ¦--High School       40  $173,000          0.22      0.22
## 89          ¦   ¦--University        42  $173,095          0.50      0.45
## 90          ¦   °--Graduate School   50  $288,600          0.56      0.24
## 91          °--Single                NA                      NA        NA
## 92              ¦--High School       12  $149,167          0.33      0.17
## 93              °--University        13  $192,308          0.15      0.08

Analysis

Which group has the highest average credit limit?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE	FORMATTED_AVG_CREDIT_LIMIT
Male	Married	Graduate School	60-69	$288,600

Which group has the lowest average credit limit?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE	FORMATTED_AVG_CREDIT_LIMIT
Male	Single	High School	20-29	$ 70,062

Which group is comprised of highest percentage of people who have a balance-to-limit rating less than or equal to 30%?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE	PCT_IN_GROUP_GOOD_CREDIT
Female	Married	Graduate School	50-59	0.73

Which group has the lowest utilization or balance-to-limit rating?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE	PCT_IN_GROUP_GOOD_CREDIT
Male	Single	University	60-69	0.15

Which group is the most likely to predicted to default?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE	PCT_IN_GROUP_DEFAULT_RATE
Male	Married	University	60-69	0.45

Which group has the highest amount of debt, is the most likely to default, and is the most likely to miss a payment?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE
Female	Single	High School	60-69

Which group has the lowest amount of debt, is the least predicted to default, and is not likely to miss a payment?

SEX	MARITAL_STATUS	EDUCATION	AGE_RANGE
Female	Single	Graduate School	30-39

Dataset 2: New York City Leading Causes of Death – NYC OpenData

DataSet Information:

This dataset collects the leading causes of death for New York City by ethnicity and sex from 2007 - 2011.

The New York City Leading Causes of Death dataset can be found directly below:

https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

Attribute Information:

The following attributes are captured in the dataset:

Year
Ethnicity
Sex
Cause of Death
Count
Percent

Load the NYC Leading Causes of Death datset in CSV format from GitHub

Looking at the dataset, there are 2880 observations with 6 variables.

## 'data.frame':    2880 obs. of  6 variables:
##  $ Year          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Ethnicity     : chr  "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" ...
##  $ Sex           : chr  "MALE" "MALE" "MALE" "MALE" ...
##  $ Cause.of.Death: chr  "HUMAN IMMUNODEFICIENCY VIRUS DISEASE" "INFLUENZA AND PNEUMONIA" "INTENTIONAL SELF-HARM (SUICIDE)" "MALIGNANT NEOPLASMS" ...
##  $ Count         : int  297 201 64 1540 50 70 13 111 36 35 ...
##  $ Percent       : int  5 3 1 23 1 1 0 2 1 1 ...

However, there appears to be duplicate entries in the data.

Year	Ethnicity	Sex	Cause.of.Death	Count	Percent	rn
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ACCIDENTS EXCEPT DRUG POISONING	32	2	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ACCIDENTS EXCEPT DRUG POISONING	32	2	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ALZHEIMERS DISEASE	7	1	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ALZHEIMERS DISEASE	7	1	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	AORTIC ANEURYSM AND DISSECTION	9	1	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	AORTIC ANEURYSM AND DISSECTION	9	1	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ASSAULT (HOMICIDE)	6	0	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ASSAULT (HOMICIDE)	6	0	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ATHEROSCLEROSIS	5	0	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	ATHEROSCLEROSIS	5	0	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	BENIGN AND UNCERTAIN NEOPLASMS	6	0	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	BENIGN AND UNCERTAIN NEOPLASMS	6	0	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CEREBROVASCULAR DISEASE	70	5	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CEREBROVASCULAR DISEASE	70	5	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CHRONIC LIVER DISEASE AND CIRRHOSIS	6	0	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CHRONIC LIVER DISEASE AND CIRRHOSIS	6	0	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CHRONIC LOWER RESPIRATORY DISEASES	29	2	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	CHRONIC LOWER RESPIRATORY DISEASES	29	2	3
2007	ASIAN & PACIFIC ISLANDER	FEMALE	DIABETES MELLITUS	34	3	2
2007	ASIAN & PACIFIC ISLANDER	FEMALE	DIABETES MELLITUS	34	3	3

Remove the duplicate entries:

nyc_deaths_df <-  nyc_deaths_df %>% 
                  group_by(Year, Ethnicity, Sex, Cause.of.Death, Count, Percent) %>% 
                  filter(row_number() == 1)

dim(nyc_deaths_df)

## [1] 960   6

Removing the duplicate reduces the data down to 960 observations.

Analysis:

Show the leading causes of death each year for men and women.

kable(nyc_deaths_df %>% 
           select(Year, Sex, Cause.of.Death, Count) %>% 
           group_by(Year, Sex) %>% 
           select(Year, Sex, Cause.of.Death, Count) %>% arrange(Year, Sex) %>%
           top_n(1) )

## Selecting by Count

Year	Sex	Cause.of.Death	Count
2007	FEMALE	DISEASES OF HEART	7050
2007	MALE	DISEASES OF HEART	5632
2008	FEMALE	DISEASES OF HEART	6836
2008	MALE	DISEASES OF HEART	5503
2009	FEMALE	DISEASES OF HEART	6297
2009	MALE	DISEASES OF HEART	5168
2010	FEMALE	DISEASES OF HEART	5351
2010	MALE	DISEASES OF HEART	4495
2011	FEMALE	DISEASES OF HEART	5016
2011	MALE	DISEASES OF HEART	4220

Show the leading causes of death each year for each ethnic group.

kable(nyc_deaths_df %>% 
         select(Year, Ethnicity, Cause.of.Death, Count) %>% 
         group_by(Year, Ethnicity) %>%
         select(Year, Ethnicity, Cause.of.Death, Count) %>% arrange(Year, Ethnicity) %>%
         top_n(1))

## Selecting by Count

Year	Ethnicity	Cause.of.Death	Count
2007	ASIAN & PACIFIC ISLANDER	MALIGNANT NEOPLASMS	528
2007	HISPANIC	DISEASES OF HEART	1418
2007	NON-HISPANIC BLACK	DISEASES OF HEART	2722
2007	NON-HISPANIC WHITE	DISEASES OF HEART	7050
2008	ASIAN & PACIFIC ISLANDER	DISEASES OF HEART	546
2008	HISPANIC	DISEASES OF HEART	1445
2008	NON-HISPANIC BLACK	DISEASES OF HEART	2725
2008	NON-HISPANIC WHITE	DISEASES OF HEART	6836
2009	ASIAN & PACIFIC ISLANDER	DISEASES OF HEART	555
2009	HISPANIC	DISEASES OF HEART	1382
2009	NON-HISPANIC BLACK	DISEASES OF HEART	2535
2009	NON-HISPANIC WHITE	DISEASES OF HEART	6297
2010	ASIAN & PACIFIC ISLANDER	MALIGNANT NEOPLASMS	527
2010	HISPANIC	DISEASES OF HEART	1355
2010	NON-HISPANIC BLACK	DISEASES OF HEART	2282
2010	NON-HISPANIC WHITE	DISEASES OF HEART	5351
2011	ASIAN & PACIFIC ISLANDER	MALIGNANT NEOPLASMS	550
2011	HISPANIC	DISEASES OF HEART	1348
2011	NON-HISPANIC BLACK	DISEASES OF HEART	2243
2011	NON-HISPANIC WHITE	DISEASES OF HEART	5016

Calculate which cause of death has declined the most and which has increased the most in the years given.

wide <- nyc_deaths_df %>% 
        select(Year, Cause.of.Death, Count) %>% 
        group_by(Year, Cause.of.Death) %>%
        summarise(Total = sum(Count)) %>% 
        spread(Year, Total)  %>% 
        mutate(Pct.Change = round((`2011`-`2007`)/`2007`, 2) *100)

wide <- wide[complete.cases(wide), ]

kable(wide %>% 
           arrange(Pct.Change) %>% 
           filter(row_number() == 1))

Cause.of.Death	2007	2008	2009	2010	2011	Pct.Change
CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD	76	55	33	29	24	-68

kable(wide %>% 
          arrange(desc(Pct.Change)) %>% 
          filter(row_number() == 1))

Cause.of.Death	2007	2008	2009	2010	2011	Pct.Change
ALZHEIMERS DISEASE	276	363	514	574	619	124

Calculate which cause of death has remained stable over the years given.

The answer to this question uses the percentage change when comparing year 2007 to year 2011. There some fluctuations up and down in the amount of change in the years between.

kable(wide %>% filter(Pct.Change == 0))

Cause.of.Death	2007	2008	2009	2010	2011	Pct.Change
CONGENITAL MALFORMATIONS,DEFORMATIONS	232	283	238	228	232	0
SHORT GESTATION/LBW	92	102	106	139	92	0

Dataset 3: Total Population By Country

DataSet Information:

Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin.

The population data is captured for each country from 1960 - 2014.

The World Bank Total Populdation dataset can be found directly below:

http://data.worldbank.org/indicator/SP.POP.TOTL

Attribute Information:

The following attributes are captured in the dataset:

Country Name: Name of the Country Country Code: Country’s 3-digit ISO code
Indicator Name: Constant value - “Population, total” Indicator Code: Constant value - SP.POP.TOTL Year [1960 - 2015]: Population of the country for the given year; 2015 does not contain any values for population

Load the World Bank Total Population Data in CSV format from GitHub

The dataset contains 248 observations and has 61 variables.

## 'data.frame':    248 obs. of  61 variables:
##  $ Country.Name  : chr  "Aruba" "Andorra" "Afghanistan" "Angola" ...
##  $ Country.Code  : chr  "ABW" "AND" "AFG" "AGO" ...
##  $ Indicator.Name: chr  "Population, total" "Population, total" "Population, total" "Population, total" ...
##  $ Indicator.Code: chr  "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##  $ X1960         : num  54208 13414 8994793 5270844 1608800 ...
##  $ X1961         : num  55435 14376 9164945 5367287 1659800 ...
##  $ X1962         : num  56226 15376 9343772 5465905 1711319 ...
##  $ X1963         : num  56697 16410 9531555 5565808 1762621 ...
##  $ X1964         : num  57029 17470 9728645 5665701 1814135 ...
##  $ X1965         : num  57360 18551 9935358 5765025 1864791 ...
##  $ X1966         : num  57712 19646 10148841 5863568 1914573 ...
##  $ X1967         : num  58049 20755 10368600 5962831 1965598 ...
##  $ X1968         : num  58385 21888 10599790 6066094 2022272 ...
##  $ X1969         : num  58724 23061 10849510 6177703 2081695 ...
##  $ X1970         : num  59065 24279 11121097 6300969 2135479 ...
##  $ X1971         : num  59438 25560 11412821 6437645 2187853 ...
##  $ X1972         : num  59849 26892 11716896 6587647 2243126 ...
##  $ X1973         : num  60239 28231 12022514 6750215 2296752 ...
##  $ X1974         : num  60525 29514 12315553 6923749 2350124 ...
##  $ X1975         : num  60655 30706 12582954 7107334 2404831 ...
##  $ X1976         : num  60589 31781 12831361 7299508 2458526 ...
##  $ X1977         : num  60366 32769 13056499 7501320 2513546 ...
##  $ X1978         : num  60106 33746 13222547 7717139 2566266 ...
##  $ X1979         : num  59978 34819 13283279 7952882 2617832 ...
##  $ X1980         : num  60096 36063 13211412 8211950 2671997 ...
##  $ X1981         : num  60567 37502 12996923 8497950 2726056 ...
##  $ X1982         : num  61344 39112 12667001 8807511 2784278 ...
##  $ X1983         : num  62204 40862 12279095 9128655 2843960 ...
##  $ X1984         : num  62831 42704 11912510 9444918 2904429 ...
##  $ X1985         : num  63028 44597 11630498 9745209 2964762 ...
##  $ X1986         : num  62644 46515 11438949 10023700 3022635 ...
##  $ X1987         : num  61835 48458 11337932 10285712 3083605 ...
##  $ X1988         : num  61077 50431 11375768 10544904 3142336 ...
##  $ X1989         : num  61032 52449 11608351 10820992 3227943 ...
##  $ X1990         : num  62148 54511 12067570 11127870 3286542 ...
##  $ X1991         : num  64623 56674 12789374 11472173 3266790 ...
##  $ X1992         : num  68235 58904 13745630 11848971 3247039 ...
##  $ X1993         : num  72498 61003 14824371 12246786 3227287 ...
##  $ X1994         : num  76700 62707 15869967 12648483 3207536 ...
##  $ X1995         : num  80326 63854 16772522 13042666 3187784 ...
##  $ X1996         : num  83195 64291 17481800 13424813 3168033 ...
##  $ X1997         : num  85447 64147 18034130 13801868 3148281 ...
##  $ X1998         : num  87276 63888 18511480 14187710 3128530 ...
##  $ X1999         : num  89004 64161 19038420 14601983 3108778 ...
##  $ X2000         : num  90858 65399 19701940 15058638 3089027 ...
##  $ X2001         : num  92894 67770 20531160 15562791 3060173 ...
##  $ X2002         : num  94995 71046 21487079 16109696 3051010 ...
##  $ X2003         : num  97015 74783 22507368 16691395 3039616 ...
##  $ X2004         : num  98742 78337 23499850 17295500 3026939 ...
##  $ X2005         : num  100031 81223 24399948 17912942 3011487 ...
##  $ X2006         : num  100830 83373 25183615 18541467 2992547 ...
##  $ X2007         : num  101218 84878 25877544 19183907 2970017 ...
##  $ X2008         : num  101342 85616 26528741 19842251 2947314 ...
##  $ X2009         : num  101416 85474 27207291 20520103 2927519 ...
##  $ X2010         : num  101597 84419 27962207 21219954 2913021 ...
##  $ X2011         : num  101936 82326 28809167 21942296 2904780 ...
##  $ X2012         : num  102393 79316 29726803 22685632 2900489 ...
##  $ X2013         : num  102921 75902 30682500 23448202 2897366 ...
##  $ X2014         : num  103441 72786 31627506 24227524 2894475 ...
##  $ X2015         : logi  NA NA NA NA NA NA ...
##  $ X             : logi  NA NA NA NA NA NA ...

Remove variables that will not be used or are not valued – Indicator Name, Indicator Code, X2015, and X (which appears after processing in the CSV file).

population_df$Indicator.Name <- NULL
population_df$Indicator.Code <- NULL
population_df$X  <- NULL
population_df$X2015  <- NULL

Remove incomplete cases

population_df <- population_df[complete.cases(population_df), ]

Convert the Total Population dataset from wide to long format:

tidy_population_df <- population_df %>%
                      gather(col, "Population", -Country.Code, -Country.Name) %>%
                      separate(col, c("X", "Year"), sep="X") %>%
                      select(-X) %>% 
                      arrange(Country.Code, Year)

tidy_population_df$Year <- as.numeric(tidy_population_df$Year)

tidy_population_df <- as.data.frame(tidy_population_df)

The resulting dataframe now has one row per year and population value for a country. In total, there are 13,640 observations with just four variables.

Country.Name	Country.Code	Year	Population
Aruba	ABW	1960	54208
Aruba	ABW	1961	55435
Aruba	ABW	1962	56226
Aruba	ABW	1963	56697
Aruba	ABW	1964	57029
Aruba	ABW	1965	57360
Aruba	ABW	1966	57712
Aruba	ABW	1967	58049
Aruba	ABW	1968	58385
Aruba	ABW	1969	58724
Aruba	ABW	1970	59065
Aruba	ABW	1971	59438
Aruba	ABW	1972	59849
Aruba	ABW	1973	60239
Aruba	ABW	1974	60525
Aruba	ABW	1975	60655
Aruba	ABW	1976	60589
Aruba	ABW	1977	60366
Aruba	ABW	1978	60106

Analysis:

Determine which countries have had the most significant change in terms of straight percent population change, comparing 1960 to 2014.

population_aggr <- tidy_population_df %>%
                   na.omit() %>% 
                   group_by(Country.Code) %>% 
                   mutate(min_year = min(Year),
                   max_year = max(Year)) %>%
                   filter(Year == min_year | Year == max_year) %>%  
                   mutate(rn = str_c("M", row_number())) %>%
                   select(Country.Name, Country.Code, Population, rn) %>% 
                   spread(rn, Population) %>% 
                   rename(Min.Year=M1, Max.Year=M2) %>%
                   mutate(Pct.Change = round(((Max.Year - Min.Year)/Min.Year) * 100, 2)) 


population_aggr <- as.data.frame(population_aggr)

Using this resultset, determine which countries have experiences the most total growth and which experienced the least amount of growth:

countries <- filter(population_aggr, Pct.Change==min(Pct.Change) |Pct.Change==max(Pct.Change))

kable(countries)

Country.Name	Country.Code	Min.Year	Max.Year	Pct.Change
Bulgaria	BGR	7867374	7223938	-8.18
United Arab Emirates	ARE	92612	9086139	9710.97

We see that Bulgaria has seen its population drop the most significantly over the years, while United Arab Emirates has seen a population explosion.

To drill into these two countries more, let’s look at the percent population change by decade.

The resulting dataset looks like:

Country.Name	Country.Code	Pct.Change.1960s	Pct.Change.1970s	Pct.Change.1980s	Pct.Change.1990s	Pct.Change.2000s
Afghanistan	AFG	20.62	19.44	-12.13	57.77	38.09
Albania	ALB	29.39	22.59	20.81	-5.41	-5.23
Algeria	DZA	27.14	28.94	30.61	18.73	13.53
American Samoa	ASM	33.00	16.04	40.47	20.67	-2.25
Andorra	AND	71.92	43.41	45.44	17.70	30.70
Angola	AGO	17.21	26.22	31.77	31.22	36.27
Antigua and Barbuda	ATG	17.60	7.93	-11.85	22.83	11.14
Arab World	ARB	28.34	30.63	30.09	24.14	22.51
Argentina	ARG	14.49	15.48	14.79	11.97	10.10
Armenia	ARM	31.54	20.89	14.42	-12.72	-3.58

Looking at the change for Bulgaria and United Arab Emirates specifically:

Country.Name	Country.Code	Pct.Change.1960s	Pct.Change.1970s	Pct.Change.1980s	Pct.Change.1990s	Pct.Change.2000s
Bulgaria	BGR	7.20	3.96	0.17	-5.82	-8.88
United Arab Emirates	ARE	121.97	293.87	68.38	59.22	152.63

Plotting the population by year for both countries:

For relative growth by decade:

DATA 607: Project 2

Keith Folsom

March 12, 2016

Overview

Setup

Dataset 1: Credit Card Default Data – UCI Machine Learning Repository

DataSet Information:

Attribute Information:

Load the Credit Card Default Data in CSV format from GitHub

Start tidying the data

Visualize the groups using the `data.tree` package:

Analysis

Dataset 2: New York City Leading Causes of Death – NYC OpenData

DataSet Information:

Attribute Information:

Load the NYC Leading Causes of Death datset in CSV format from GitHub

Analysis:

Dataset 3: Total Population By Country

DataSet Information:

Attribute Information:

Load the World Bank Total Population Data in CSV format from GitHub

Analysis:

DATA 607: Project 2

Keith Folsom

March 12, 2016

Overview

Setup

Dataset 1: Credit Card Default Data – UCI Machine Learning Repository

DataSet Information:

Attribute Information:

Load the Credit Card Default Data in CSV format from GitHub

Start tidying the data

Visualize the groups using the data.tree package:

Analysis

Dataset 2: New York City Leading Causes of Death – NYC OpenData

DataSet Information:

Attribute Information:

Load the NYC Leading Causes of Death datset in CSV format from GitHub

Analysis:

Dataset 3: Total Population By Country

DataSet Information:

Attribute Information:

Load the World Bank Total Population Data in CSV format from GitHub

Analysis:

Visualize the groups using the `data.tree` package: