DATA 607 Project 2 will take three wide datasets identified in the class discussion forum and use tidyr
and dplyr
as needed to tidy and transform the data into an analyzable format.
The datasets chosen for this project are:
1. Credit Card Default Data – UCI Machine Learning Repository
2. New York City Leading Causes of Death – NYC OpenData
3. Total Population By Country – World Bank
This analysis requires the following R packages:
This research aimed at the case of customers default payments in Taiwan and was used to compare the predictive accuracy of probability of default among six data mining methods.
The Credit Card Default Data on the UCI Machine Learning Repository can be found directly below:
http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable.
This study reviewed the literature and used the following 23 variables as explanatory variables:
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX : Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE : Marital status (1 = married; 2 = single; 3 = others).
AGE : Age (year).
(X6 - X11)
PAY_0 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17
BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23
PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
The dataset contains 30,000 observations and has 25 variables.
## 'data.frame': 30000 obs. of 25 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ LIMIT_BAL : int 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : int 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : int 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : int 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : int 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_0 : int 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ PAY_2 : int 2 2 0 0 0 0 0 -1 0 -2 ...
## $ PAY_3 : int -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ PAY_4 : int -1 0 0 0 0 0 0 0 0 -2 ...
## $ PAY_5 : int -2 0 0 0 0 0 0 0 0 -1 ...
## $ PAY_6 : int -2 2 0 0 0 0 0 -1 0 -1 ...
## $ BILL_AMT1 : int 3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
## $ BILL_AMT2 : int 3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
## $ BILL_AMT3 : int 689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
## $ BILL_AMT4 : int 0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
## $ BILL_AMT5 : int 0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
## $ BILL_AMT6 : int 0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
## $ PAY_AMT1 : int 0 0 1518 2000 2000 2500 55000 380 3329 0 ...
## $ PAY_AMT2 : int 689 1000 1500 2019 36681 1815 40000 601 0 0 ...
## $ PAY_AMT3 : int 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : int 0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
## $ PAY_AMT5 : int 0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
## $ PAY_AMT6 : int 0 2000 5000 1000 679 800 13770 1542 1000 0 ...
## $ default.payment.next.month: int 1 1 0 0 0 0 0 0 0 0 ...
names(credit_data_df)[names(credit_data_df) == "PAY_0"] <- "PAY_1"
## [1] "Female" "Male"
## [1] University Graduate School High School
## Levels: High School University Graduate School
## [1] "Married" "Single"
tidyr
to convert the dataset from wide to long format.## [1] "ID" "LIMIT_BAL"
## [3] "SEX" "EDUCATION"
## [5] "MARRIAGE" "AGE"
## [7] "PAY_1" "PAY_2"
## [9] "PAY_3" "PAY_4"
## [11] "PAY_5" "PAY_6"
## [13] "BILLAMT_1" "BILLAMT_2"
## [15] "BILLAMT_3" "BILLAMT_4"
## [17] "BILLAMT_5" "BILLAMT_6"
## [19] "PAYAMT_1" "PAYAMT_2"
## [21] "PAYAMT_3" "PAYAMT_4"
## [23] "PAYAMT_5" "PAYAMT_6"
## [25] "DEFAULT_NEXT_PAYMENT_IND"
tidyr
tidy_credit_data_df <- credit_data_df_s %>%
gather(col, "VAL", -ID:-AGE, -DEFAULT_NEXT_PAYMENT_IND) %>%
separate(col, c("L1", "MTH"), sep="_") %>%
spread(L1, VAL) %>%
select(ID:MTH, BILLAMT, PAYAMT, PAY)
The resulting dataset looks like this:
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | DEFAULT_NEXT_PAYMENT_IND | MTH | BILLAMT | PAYAMT | PAY |
---|---|---|---|---|---|---|---|---|---|---|
1 | 20000 | Female | University | Married | 24 | 1 | 1 | 3913 | 0 | 2 |
2 | 120000 | Female | University | Single | 26 | 1 | 1 | 2682 | 0 | -1 |
3 | 90000 | Female | University | Single | 34 | 0 | 1 | 29239 | 1518 | 0 |
4 | 50000 | Female | University | Married | 37 | 0 | 1 | 46990 | 2000 | 0 |
5 | 50000 | Male | University | Married | 57 | 0 | 1 | 8617 | 2000 | -1 |
6 | 50000 | Male | Graduate School | Single | 37 | 0 | 1 | 64400 | 2500 | 0 |
7 | 500000 | Male | Graduate School | Single | 29 | 0 | 1 | 367965 | 55000 | 0 |
8 | 100000 | Female | University | Single | 23 | 0 | 1 | 11876 | 380 | 0 |
9 | 140000 | Female | High School | Married | 28 | 0 | 1 | 11285 | 3329 | 0 |
10 | 20000 | Male | High School | Single | 35 | 0 | 1 | 0 | 0 | -2 |
11 | 200000 | Female | High School | Single | 34 | 0 | 1 | 11073 | 2306 | 0 |
12 | 260000 | Female | Graduate School | Single | 51 | 0 | 1 | 12261 | 21818 | -1 |
13 | 630000 | Female | University | Single | 41 | 0 | 1 | 12137 | 1000 | -1 |
14 | 70000 | Male | University | Single | 30 | 1 | 1 | 65802 | 3200 | 1 |
15 | 250000 | Male | Graduate School | Single | 29 | 0 | 1 | 70887 | 3000 | 0 |
17 | 20000 | Male | Graduate School | Single | 24 | 1 | 1 | 15376 | 3200 | 0 |
18 | 320000 | Male | Graduate School | Married | 49 | 0 | 1 | 253286 | 10358 | 0 |
19 | 360000 | Female | Graduate School | Married | 49 | 0 | 1 | 0 | 0 | 1 |
20 | 180000 | Female | Graduate School | Single | 29 | 0 | 1 | 0 | 0 | 1 |
21 | 130000 | Female | High School | Single | 39 | 0 | 1 | 38358 | 3000 | 0 |
Field Name | Description |
---|---|
AMT_OWED | Running or cumulative sum of bill amount - payment amount for each individual |
AVG_6MTH_BAL | Mean value of AMT_OWED over a 6 month period |
MISSED_PAYMENTS | Maxium number of missed payments recorded for the individual |
BALANCE_TO_LIMIT_RATIO | Average 6 month balance divided by the individual’s credit limit; note anything <= .3 is considered good |
AGE_RANGE | Groups individuals into 10 year age grouping [20-29] through [70-79] |
The resulting dataset looks like this:
ID | CREDIT_LIMIT | SEX | EDUCATION | MARITAL_STATUS | DEFAULT_NEXT_PAYMENT_IND | AVG_6MTH_BAL | MISSED_PAYMENTS | BALANCE_TO_LIMIT_RATIO | AGE_RANGE |
---|---|---|---|---|---|---|---|---|---|
1 | 20000 | Female | University | Married | 1 | 6383.167 | 2 | 0.319 | 20-29 |
2 | 120000 | Female | University | Single | 1 | 6905.333 | 2 | 0.058 | 20-29 |
3 | 90000 | Female | University | Single | 0 | 59605.833 | 0 | 0.662 | 30-39 |
4 | 50000 | Female | University | Married | 0 | 143223.833 | 0 | 2.864 | 30-39 |
5 | 50000 | Male | University | Married | 0 | 13195.500 | 0 | 0.264 | 50-59 |
6 | 50000 | Male | Graduate School | Single | 0 | 164519.667 | 0 | 3.290 | 30-39 |
7 | 500000 | Male | Graduate School | Single | 0 | 1388642.500 | 0 | 2.777 | 20-29 |
8 | 100000 | Female | University | Single | 0 | 10754.667 | 0 | 0.108 | 20-29 |
9 | 140000 | Female | High School | Married | 0 | 37143.000 | 2 | 0.265 | 20-29 |
10 | 20000 | Male | High School | Single | 0 | -223.167 | -1 | -0.011 | 30-39 |
credit_data_individual
, create an aggregate dataset for the different group combinations of Sex, Age Range, Marital Status, and Education.Aggregation Level | Description |
---|---|
LVL1 | Top Level - Aggregates to Age Range |
LVL2 | Aggregates to Age Range and Sex |
LVL3 | Aggregates to Age Range, Sex, Marital Satus |
LVL4 | Bottom Level - Aggregates to Age Range, Sex, Marital Satus, Education |
data.tree
package:Field Definition:
Aggregation Level | Description |
---|---|
n | Count of individuals in the groups |
Avg Limit | Average Credit Limit of the group |
% Good Credit | Percentage representation of how many individuals in the group have a <= .3 limit-to-balance ratio |
% Default | Percentage representation of how many individuals in the group are predicted to default |
Level 1: Summarized to Age Range
## levelName n Avg Limit % Good Credit % Default
## 1 All NA NA NA
## 2 ¦--20-29 9435 $124,157 0.32 0.23
## 3 ¦--30-39 10981 $197,420 0.46 0.21
## 4 ¦--40-49 6197 $182,750 0.42 0.23
## 5 ¦--50-59 2222 $165,729 0.37 0.25
## 6 ¦--60-69 303 $186,832 0.42 0.29
## 7 °--70-79 25 $218,800 0.40 0.28
Level 2: Summarized to Age Range and Sex
## levelName n Avg Limit % Good Credit % Default
## 1 All NA NA NA
## 2 ¦--20-29 NA NA NA
## 3 ¦ ¦--Female 6219 $131,269 0.34 0.22
## 4 ¦ °--Male 3216 $110,404 0.28 0.24
## 5 ¦--30-39 NA NA NA
## 6 ¦ ¦--Female 6501 $203,130 0.50 0.19
## 7 ¦ °--Male 4480 $189,134 0.39 0.23
## 8 ¦--40-49 NA NA NA
## 9 ¦ ¦--Female 3534 $184,020 0.47 0.22
## 10 ¦ °--Male 2663 $181,066 0.37 0.25
## 11 ¦--50-59 NA NA NA
## 12 ¦ ¦--Female 1181 $159,670 0.40 0.23
## 13 ¦ °--Male 1041 $172,603 0.33 0.27
## 14 ¦--60-69 NA NA NA
## 15 ¦ ¦--Female 141 $160,355 0.43 0.31
## 16 ¦ °--Male 162 $209,877 0.41 0.27
## 17 °--70-79 NA NA NA
## 18 ¦--Female 12 $213,333 0.50 0.25
## 19 °--Male 13 $223,846 0.31 0.31
Level 3: Summarized to Age Range, Sex, and Marital Status
## levelName n Avg Limit % Good Credit % Default
## 1 All NA NA NA
## 2 ¦--20-29 NA NA NA
## 3 ¦ ¦--Female NA NA NA
## 4 ¦ ¦ ¦--Married 1154 $122,626 0.27 0.27
## 5 ¦ ¦ °--Single 5065 $133,238 0.36 0.22
## 6 ¦ °--Male NA NA NA
## 7 ¦ ¦--Married 295 $117,085 0.26 0.27
## 8 ¦ °--Single 2921 $109,730 0.29 0.24
## 9 ¦--30-39 NA NA NA
## 10 ¦ ¦--Female NA NA NA
## 11 ¦ ¦ ¦--Married 3510 $191,148 0.47 0.20
## 12 ¦ ¦ °--Single 2991 $217,192 0.53 0.17
## 13 ¦ °--Male NA NA NA
## 14 ¦ ¦--Married 1916 $196,200 0.42 0.26
## 15 ¦ °--Single 2564 $183,853 0.37 0.21
## 16 ¦--40-49 NA NA NA
## 17 ¦ ¦--Female NA NA NA
## 18 ¦ ¦ ¦--Married 2645 $184,021 0.48 0.22
## 19 ¦ ¦ °--Single 889 $184,016 0.42 0.21
## 20 ¦ °--Male NA NA NA
## 21 ¦ ¦--Married 1931 $195,831 0.41 0.25
## 22 ¦ °--Single 732 $142,114 0.25 0.24
## 23 ¦--50-59 NA NA NA
## 24 ¦ ¦--Female NA NA NA
## 25 ¦ ¦ ¦--Married 880 $165,250 0.42 0.25
## 26 ¦ ¦ °--Single 301 $143,355 0.36 0.19
## 27 ¦ °--Male NA NA NA
## 28 ¦ ¦--Married 821 $188,197 0.37 0.28
## 29 ¦ °--Single 220 $114,409 0.21 0.21
## 30 ¦--60-69 NA NA NA
## 31 ¦ ¦--Female NA NA NA
## 32 ¦ ¦ ¦--Married 118 $172,712 0.45 0.28
## 33 ¦ ¦ °--Single 23 $ 96,957 0.30 0.48
## 34 ¦ °--Male NA NA NA
## 35 ¦ ¦--Married 132 $216,818 0.44 0.30
## 36 ¦ °--Single 30 $179,333 0.27 0.13
## 37 °--70-79 NA NA NA
## 38 °--Male NA NA NA
## 39 °--Married 13 $223,846 0.31 0.31
Level 4: Summarized to Age Range, Sex, Marital Status, and Education
## levelName n Avg Limit % Good Credit % Default
## 1 All NA NA NA
## 2 ¦--20-29 NA NA NA
## 3 ¦ ¦--Female NA NA NA
## 4 ¦ ¦ ¦--Married NA NA NA
## 5 ¦ ¦ ¦ ¦--High School 192 $ 99,688 0.21 0.29
## 6 ¦ ¦ ¦ ¦--University 814 $118,206 0.25 0.28
## 7 ¦ ¦ ¦ °--Graduate School 148 $176,689 0.45 0.16
## 8 ¦ ¦ °--Single NA NA NA
## 9 ¦ ¦ ¦--High School 365 $110,219 0.31 0.24
## 10 ¦ ¦ ¦--University 2337 $119,718 0.28 0.24
## 11 ¦ ¦ °--Graduate School 2363 $150,165 0.45 0.19
## 12 ¦ °--Male NA NA NA
## 13 ¦ ¦--Married NA NA NA
## 14 ¦ ¦ ¦--High School 53 $103,962 0.23 0.32
## 15 ¦ ¦ ¦--University 197 $110,457 0.25 0.26
## 16 ¦ ¦ °--Graduate School 45 $161,556 0.36 0.27
## 17 ¦ °--Single NA NA NA
## 18 ¦ ¦--High School 325 $ 70,062 0.22 0.30
## 19 ¦ ¦--University 1462 $ 91,826 0.20 0.27
## 20 ¦ °--Graduate School 1134 $144,180 0.41 0.19
## 21 ¦--30-39 NA NA NA
## 22 ¦ ¦--Female NA NA NA
## 23 ¦ ¦ ¦--Married NA NA NA
## 24 ¦ ¦ ¦ ¦--High School 500 $141,260 0.34 0.22
## 25 ¦ ¦ ¦ ¦--University 1948 $173,850 0.40 0.21
## 26 ¦ ¦ ¦ °--Graduate School 1062 $246,365 0.67 0.19
## 27 ¦ ¦ °--Single NA NA NA
## 28 ¦ ¦ ¦--High School 266 $185,677 0.49 0.16
## 29 ¦ ¦ ¦--University 1220 $192,311 0.39 0.18
## 30 ¦ ¦ °--Graduate School 1505 $242,930 0.65 0.16
## 31 ¦ °--Male NA NA NA
## 32 ¦ ¦--Married NA NA NA
## 33 ¦ ¦ ¦--High School 271 $150,701 0.30 0.31
## 34 ¦ ¦ ¦--University 1010 $170,178 0.34 0.26
## 35 ¦ ¦ °--Graduate School 635 $257,008 0.59 0.22
## 36 ¦ °--Single NA NA NA
## 37 ¦ ¦--High School 279 $136,022 0.26 0.24
## 38 ¦ ¦--University 1045 $142,737 0.27 0.23
## 39 ¦ °--Graduate School 1240 $229,266 0.49 0.19
## 40 ¦--40-49 NA NA NA
## 41 ¦ ¦--Female NA NA NA
## 42 ¦ ¦ ¦--Married NA NA NA
## 43 ¦ ¦ ¦ ¦--High School 688 $133,808 0.38 0.26
## 44 ¦ ¦ ¦ ¦--University 1331 $176,255 0.45 0.22
## 45 ¦ ¦ ¦ °--Graduate School 626 $255,719 0.67 0.18
## 46 ¦ ¦ °--Single NA NA NA
## 47 ¦ ¦ ¦--High School 235 $139,617 0.34 0.25
## 48 ¦ ¦ ¦--University 407 $167,052 0.35 0.20
## 49 ¦ ¦ °--Graduate School 247 $254,211 0.60 0.20
## 50 ¦ °--Male NA NA NA
## 51 ¦ ¦--Married NA NA NA
## 52 ¦ ¦ ¦--High School 414 $137,222 0.29 0.27
## 53 ¦ ¦ ¦--University 824 $165,534 0.32 0.27
## 54 ¦ ¦ °--Graduate School 693 $266,869 0.59 0.23
## 55 ¦ °--Single NA NA NA
## 56 ¦ ¦--High School 197 $100,902 0.21 0.24
## 57 ¦ ¦--University 334 $110,329 0.18 0.27
## 58 ¦ °--Graduate School 201 $235,323 0.41 0.20
## 59 ¦--50-59 NA NA NA
## 60 ¦ ¦--Female NA NA NA
## 61 ¦ ¦ ¦--Married NA NA NA
## 62 ¦ ¦ ¦ ¦--High School 369 $119,837 0.32 0.25
## 63 ¦ ¦ ¦ ¦--University 339 $168,112 0.36 0.27
## 64 ¦ ¦ ¦ °--Graduate School 172 $257,035 0.73 0.19
## 65 ¦ ¦ °--Single NA NA NA
## 66 ¦ ¦ ¦--High School 133 $103,684 0.27 0.21
## 67 ¦ ¦ ¦--University 112 $145,714 0.34 0.17
## 68 ¦ ¦ °--Graduate School 56 $232,857 0.59 0.18
## 69 ¦ °--Male NA NA NA
## 70 ¦ ¦--Married NA NA NA
## 71 ¦ ¦ ¦--High School 267 $136,180 0.32 0.31
## 72 ¦ ¦ ¦--University 294 $157,619 0.27 0.29
## 73 ¦ ¦ °--Graduate School 260 $276,192 0.52 0.25
## 74 ¦ °--Single NA NA NA
## 75 ¦ ¦--High School 81 $ 95,926 0.20 0.20
## 76 ¦ ¦--University 86 $ 90,930 0.16 0.24
## 77 ¦ °--Graduate School 53 $180,755 0.32 0.19
## 78 °--60-69 NA NA NA
## 79 ¦--Female NA NA NA
## 80 ¦ ¦--Married NA NA NA
## 81 ¦ ¦ ¦--High School 56 $124,464 0.30 0.25
## 82 ¦ ¦ ¦--University 38 $184,211 0.50 0.29
## 83 ¦ ¦ °--Graduate School 24 $267,083 0.71 0.33
## 84 ¦ °--Single NA NA NA
## 85 ¦ °--High School 15 $ 86,000 0.20 0.40
## 86 °--Male NA NA NA
## 87 ¦--Married NA NA NA
## 88 ¦ ¦--High School 40 $173,000 0.22 0.22
## 89 ¦ ¦--University 42 $173,095 0.50 0.45
## 90 ¦ °--Graduate School 50 $288,600 0.56 0.24
## 91 °--Single NA NA NA
## 92 ¦--High School 12 $149,167 0.33 0.17
## 93 °--University 13 $192,308 0.15 0.08
Which group has the highest average credit limit?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE | FORMATTED_AVG_CREDIT_LIMIT |
---|---|---|---|---|
Male | Married | Graduate School | 60-69 | $288,600 |
Which group has the lowest average credit limit?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE | FORMATTED_AVG_CREDIT_LIMIT |
---|---|---|---|---|
Male | Single | High School | 20-29 | $ 70,062 |
Which group is comprised of highest percentage of people who have a balance-to-limit rating less than or equal to 30%?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE | PCT_IN_GROUP_GOOD_CREDIT |
---|---|---|---|---|
Female | Married | Graduate School | 50-59 | 0.73 |
Which group has the lowest utilization or balance-to-limit rating?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE | PCT_IN_GROUP_GOOD_CREDIT |
---|---|---|---|---|
Male | Single | University | 60-69 | 0.15 |
Which group is the most likely to predicted to default?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE | PCT_IN_GROUP_DEFAULT_RATE |
---|---|---|---|---|
Male | Married | University | 60-69 | 0.45 |
Which group has the highest amount of debt, is the most likely to default, and is the most likely to miss a payment?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE |
---|---|---|---|
Female | Single | High School | 60-69 |
Which group has the lowest amount of debt, is the least predicted to default, and is not likely to miss a payment?
SEX | MARITAL_STATUS | EDUCATION | AGE_RANGE |
---|---|---|---|
Female | Single | Graduate School | 30-39 |
This dataset collects the leading causes of death for New York City by ethnicity and sex from 2007 - 2011.
The New York City Leading Causes of Death dataset can be found directly below:
https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
The following attributes are captured in the dataset:
Year
Ethnicity
Sex
Cause of Death
Count
Percent
Looking at the dataset, there are 2880 observations with 6 variables.
## 'data.frame': 2880 obs. of 6 variables:
## $ Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Ethnicity : chr "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" "NON-HISPANIC BLACK" ...
## $ Sex : chr "MALE" "MALE" "MALE" "MALE" ...
## $ Cause.of.Death: chr "HUMAN IMMUNODEFICIENCY VIRUS DISEASE" "INFLUENZA AND PNEUMONIA" "INTENTIONAL SELF-HARM (SUICIDE)" "MALIGNANT NEOPLASMS" ...
## $ Count : int 297 201 64 1540 50 70 13 111 36 35 ...
## $ Percent : int 5 3 1 23 1 1 0 2 1 1 ...
However, there appears to be duplicate entries in the data.
Year | Ethnicity | Sex | Cause.of.Death | Count | Percent | rn |
---|---|---|---|---|---|---|
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ACCIDENTS EXCEPT DRUG POISONING | 32 | 2 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ACCIDENTS EXCEPT DRUG POISONING | 32 | 2 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ALZHEIMERS DISEASE | 7 | 1 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ALZHEIMERS DISEASE | 7 | 1 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | AORTIC ANEURYSM AND DISSECTION | 9 | 1 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | AORTIC ANEURYSM AND DISSECTION | 9 | 1 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ASSAULT (HOMICIDE) | 6 | 0 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ASSAULT (HOMICIDE) | 6 | 0 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ATHEROSCLEROSIS | 5 | 0 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | ATHEROSCLEROSIS | 5 | 0 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | BENIGN AND UNCERTAIN NEOPLASMS | 6 | 0 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | BENIGN AND UNCERTAIN NEOPLASMS | 6 | 0 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CEREBROVASCULAR DISEASE | 70 | 5 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CEREBROVASCULAR DISEASE | 70 | 5 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CHRONIC LIVER DISEASE AND CIRRHOSIS | 6 | 0 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CHRONIC LIVER DISEASE AND CIRRHOSIS | 6 | 0 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CHRONIC LOWER RESPIRATORY DISEASES | 29 | 2 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | CHRONIC LOWER RESPIRATORY DISEASES | 29 | 2 | 3 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | DIABETES MELLITUS | 34 | 3 | 2 |
2007 | ASIAN & PACIFIC ISLANDER | FEMALE | DIABETES MELLITUS | 34 | 3 | 3 |
Remove the duplicate entries:
nyc_deaths_df <- nyc_deaths_df %>%
group_by(Year, Ethnicity, Sex, Cause.of.Death, Count, Percent) %>%
filter(row_number() == 1)
dim(nyc_deaths_df)
## [1] 960 6
Removing the duplicate reduces the data down to 960 observations.
Show the leading causes of death each year for men and women.
kable(nyc_deaths_df %>%
select(Year, Sex, Cause.of.Death, Count) %>%
group_by(Year, Sex) %>%
select(Year, Sex, Cause.of.Death, Count) %>% arrange(Year, Sex) %>%
top_n(1) )
## Selecting by Count
Year | Sex | Cause.of.Death | Count |
---|---|---|---|
2007 | FEMALE | DISEASES OF HEART | 7050 |
2007 | MALE | DISEASES OF HEART | 5632 |
2008 | FEMALE | DISEASES OF HEART | 6836 |
2008 | MALE | DISEASES OF HEART | 5503 |
2009 | FEMALE | DISEASES OF HEART | 6297 |
2009 | MALE | DISEASES OF HEART | 5168 |
2010 | FEMALE | DISEASES OF HEART | 5351 |
2010 | MALE | DISEASES OF HEART | 4495 |
2011 | FEMALE | DISEASES OF HEART | 5016 |
2011 | MALE | DISEASES OF HEART | 4220 |
Show the leading causes of death each year for each ethnic group.
kable(nyc_deaths_df %>%
select(Year, Ethnicity, Cause.of.Death, Count) %>%
group_by(Year, Ethnicity) %>%
select(Year, Ethnicity, Cause.of.Death, Count) %>% arrange(Year, Ethnicity) %>%
top_n(1))
## Selecting by Count
Year | Ethnicity | Cause.of.Death | Count |
---|---|---|---|
2007 | ASIAN & PACIFIC ISLANDER | MALIGNANT NEOPLASMS | 528 |
2007 | HISPANIC | DISEASES OF HEART | 1418 |
2007 | NON-HISPANIC BLACK | DISEASES OF HEART | 2722 |
2007 | NON-HISPANIC WHITE | DISEASES OF HEART | 7050 |
2008 | ASIAN & PACIFIC ISLANDER | DISEASES OF HEART | 546 |
2008 | HISPANIC | DISEASES OF HEART | 1445 |
2008 | NON-HISPANIC BLACK | DISEASES OF HEART | 2725 |
2008 | NON-HISPANIC WHITE | DISEASES OF HEART | 6836 |
2009 | ASIAN & PACIFIC ISLANDER | DISEASES OF HEART | 555 |
2009 | HISPANIC | DISEASES OF HEART | 1382 |
2009 | NON-HISPANIC BLACK | DISEASES OF HEART | 2535 |
2009 | NON-HISPANIC WHITE | DISEASES OF HEART | 6297 |
2010 | ASIAN & PACIFIC ISLANDER | MALIGNANT NEOPLASMS | 527 |
2010 | HISPANIC | DISEASES OF HEART | 1355 |
2010 | NON-HISPANIC BLACK | DISEASES OF HEART | 2282 |
2010 | NON-HISPANIC WHITE | DISEASES OF HEART | 5351 |
2011 | ASIAN & PACIFIC ISLANDER | MALIGNANT NEOPLASMS | 550 |
2011 | HISPANIC | DISEASES OF HEART | 1348 |
2011 | NON-HISPANIC BLACK | DISEASES OF HEART | 2243 |
2011 | NON-HISPANIC WHITE | DISEASES OF HEART | 5016 |
Calculate which cause of death has declined the most and which has increased the most in the years given.
wide <- nyc_deaths_df %>%
select(Year, Cause.of.Death, Count) %>%
group_by(Year, Cause.of.Death) %>%
summarise(Total = sum(Count)) %>%
spread(Year, Total) %>%
mutate(Pct.Change = round((`2011`-`2007`)/`2007`, 2) *100)
wide <- wide[complete.cases(wide), ]
kable(wide %>%
arrange(Pct.Change) %>%
filter(row_number() == 1))
Cause.of.Death | 2007 | 2008 | 2009 | 2010 | 2011 | Pct.Change |
---|---|---|---|---|---|---|
CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD | 76 | 55 | 33 | 29 | 24 | -68 |
kable(wide %>%
arrange(desc(Pct.Change)) %>%
filter(row_number() == 1))
Cause.of.Death | 2007 | 2008 | 2009 | 2010 | 2011 | Pct.Change |
---|---|---|---|---|---|---|
ALZHEIMERS DISEASE | 276 | 363 | 514 | 574 | 619 | 124 |
Calculate which cause of death has remained stable over the years given.
The answer to this question uses the percentage change when comparing year 2007 to year 2011. There some fluctuations up and down in the amount of change in the years between.
kable(wide %>% filter(Pct.Change == 0))
Cause.of.Death | 2007 | 2008 | 2009 | 2010 | 2011 | Pct.Change |
---|---|---|---|---|---|---|
CONGENITAL MALFORMATIONS,DEFORMATIONS | 232 | 283 | 238 | 228 | 232 | 0 |
SHORT GESTATION/LBW | 92 | 102 | 106 | 139 | 92 | 0 |
Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin.
The population data is captured for each country from 1960 - 2014.
The World Bank Total Populdation dataset can be found directly below:
The following attributes are captured in the dataset:
Country Name: Name of the Country Country Code: Country’s 3-digit ISO code
Indicator Name: Constant value - “Population, total” Indicator Code: Constant value - SP.POP.TOTL Year [1960 - 2015]: Population of the country for the given year; 2015 does not contain any values for population
The dataset contains 248 observations and has 61 variables.
## 'data.frame': 248 obs. of 61 variables:
## $ Country.Name : chr "Aruba" "Andorra" "Afghanistan" "Angola" ...
## $ Country.Code : chr "ABW" "AND" "AFG" "AGO" ...
## $ Indicator.Name: chr "Population, total" "Population, total" "Population, total" "Population, total" ...
## $ Indicator.Code: chr "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
## $ X1960 : num 54208 13414 8994793 5270844 1608800 ...
## $ X1961 : num 55435 14376 9164945 5367287 1659800 ...
## $ X1962 : num 56226 15376 9343772 5465905 1711319 ...
## $ X1963 : num 56697 16410 9531555 5565808 1762621 ...
## $ X1964 : num 57029 17470 9728645 5665701 1814135 ...
## $ X1965 : num 57360 18551 9935358 5765025 1864791 ...
## $ X1966 : num 57712 19646 10148841 5863568 1914573 ...
## $ X1967 : num 58049 20755 10368600 5962831 1965598 ...
## $ X1968 : num 58385 21888 10599790 6066094 2022272 ...
## $ X1969 : num 58724 23061 10849510 6177703 2081695 ...
## $ X1970 : num 59065 24279 11121097 6300969 2135479 ...
## $ X1971 : num 59438 25560 11412821 6437645 2187853 ...
## $ X1972 : num 59849 26892 11716896 6587647 2243126 ...
## $ X1973 : num 60239 28231 12022514 6750215 2296752 ...
## $ X1974 : num 60525 29514 12315553 6923749 2350124 ...
## $ X1975 : num 60655 30706 12582954 7107334 2404831 ...
## $ X1976 : num 60589 31781 12831361 7299508 2458526 ...
## $ X1977 : num 60366 32769 13056499 7501320 2513546 ...
## $ X1978 : num 60106 33746 13222547 7717139 2566266 ...
## $ X1979 : num 59978 34819 13283279 7952882 2617832 ...
## $ X1980 : num 60096 36063 13211412 8211950 2671997 ...
## $ X1981 : num 60567 37502 12996923 8497950 2726056 ...
## $ X1982 : num 61344 39112 12667001 8807511 2784278 ...
## $ X1983 : num 62204 40862 12279095 9128655 2843960 ...
## $ X1984 : num 62831 42704 11912510 9444918 2904429 ...
## $ X1985 : num 63028 44597 11630498 9745209 2964762 ...
## $ X1986 : num 62644 46515 11438949 10023700 3022635 ...
## $ X1987 : num 61835 48458 11337932 10285712 3083605 ...
## $ X1988 : num 61077 50431 11375768 10544904 3142336 ...
## $ X1989 : num 61032 52449 11608351 10820992 3227943 ...
## $ X1990 : num 62148 54511 12067570 11127870 3286542 ...
## $ X1991 : num 64623 56674 12789374 11472173 3266790 ...
## $ X1992 : num 68235 58904 13745630 11848971 3247039 ...
## $ X1993 : num 72498 61003 14824371 12246786 3227287 ...
## $ X1994 : num 76700 62707 15869967 12648483 3207536 ...
## $ X1995 : num 80326 63854 16772522 13042666 3187784 ...
## $ X1996 : num 83195 64291 17481800 13424813 3168033 ...
## $ X1997 : num 85447 64147 18034130 13801868 3148281 ...
## $ X1998 : num 87276 63888 18511480 14187710 3128530 ...
## $ X1999 : num 89004 64161 19038420 14601983 3108778 ...
## $ X2000 : num 90858 65399 19701940 15058638 3089027 ...
## $ X2001 : num 92894 67770 20531160 15562791 3060173 ...
## $ X2002 : num 94995 71046 21487079 16109696 3051010 ...
## $ X2003 : num 97015 74783 22507368 16691395 3039616 ...
## $ X2004 : num 98742 78337 23499850 17295500 3026939 ...
## $ X2005 : num 100031 81223 24399948 17912942 3011487 ...
## $ X2006 : num 100830 83373 25183615 18541467 2992547 ...
## $ X2007 : num 101218 84878 25877544 19183907 2970017 ...
## $ X2008 : num 101342 85616 26528741 19842251 2947314 ...
## $ X2009 : num 101416 85474 27207291 20520103 2927519 ...
## $ X2010 : num 101597 84419 27962207 21219954 2913021 ...
## $ X2011 : num 101936 82326 28809167 21942296 2904780 ...
## $ X2012 : num 102393 79316 29726803 22685632 2900489 ...
## $ X2013 : num 102921 75902 30682500 23448202 2897366 ...
## $ X2014 : num 103441 72786 31627506 24227524 2894475 ...
## $ X2015 : logi NA NA NA NA NA NA ...
## $ X : logi NA NA NA NA NA NA ...
Remove variables that will not be used or are not valued – Indicator Name, Indicator Code, X2015, and X (which appears after processing in the CSV file).
population_df$Indicator.Name <- NULL
population_df$Indicator.Code <- NULL
population_df$X <- NULL
population_df$X2015 <- NULL
Remove incomplete cases
population_df <- population_df[complete.cases(population_df), ]
Convert the Total Population dataset from wide to long format:
tidy_population_df <- population_df %>%
gather(col, "Population", -Country.Code, -Country.Name) %>%
separate(col, c("X", "Year"), sep="X") %>%
select(-X) %>%
arrange(Country.Code, Year)
tidy_population_df$Year <- as.numeric(tidy_population_df$Year)
tidy_population_df <- as.data.frame(tidy_population_df)
The resulting dataframe now has one row per year and population value for a country. In total, there are 13,640 observations with just four variables.
Country.Name | Country.Code | Year | Population |
---|---|---|---|
Aruba | ABW | 1960 | 54208 |
Aruba | ABW | 1961 | 55435 |
Aruba | ABW | 1962 | 56226 |
Aruba | ABW | 1963 | 56697 |
Aruba | ABW | 1964 | 57029 |
Aruba | ABW | 1965 | 57360 |
Aruba | ABW | 1966 | 57712 |
Aruba | ABW | 1967 | 58049 |
Aruba | ABW | 1968 | 58385 |
Aruba | ABW | 1969 | 58724 |
Aruba | ABW | 1970 | 59065 |
Aruba | ABW | 1971 | 59438 |
Aruba | ABW | 1972 | 59849 |
Aruba | ABW | 1973 | 60239 |
Aruba | ABW | 1974 | 60525 |
Aruba | ABW | 1975 | 60655 |
Aruba | ABW | 1976 | 60589 |
Aruba | ABW | 1977 | 60366 |
Aruba | ABW | 1978 | 60106 |
Determine which countries have had the most significant change in terms of straight percent population change, comparing 1960 to 2014.
population_aggr <- tidy_population_df %>%
na.omit() %>%
group_by(Country.Code) %>%
mutate(min_year = min(Year),
max_year = max(Year)) %>%
filter(Year == min_year | Year == max_year) %>%
mutate(rn = str_c("M", row_number())) %>%
select(Country.Name, Country.Code, Population, rn) %>%
spread(rn, Population) %>%
rename(Min.Year=M1, Max.Year=M2) %>%
mutate(Pct.Change = round(((Max.Year - Min.Year)/Min.Year) * 100, 2))
population_aggr <- as.data.frame(population_aggr)
Using this resultset, determine which countries have experiences the most total growth and which experienced the least amount of growth:
countries <- filter(population_aggr, Pct.Change==min(Pct.Change) |Pct.Change==max(Pct.Change))
kable(countries)
Country.Name | Country.Code | Min.Year | Max.Year | Pct.Change |
---|---|---|---|---|
Bulgaria | BGR | 7867374 | 7223938 | -8.18 |
United Arab Emirates | ARE | 92612 | 9086139 | 9710.97 |
We see that Bulgaria has seen its population drop the most significantly over the years, while United Arab Emirates has seen a population explosion.
To drill into these two countries more, let’s look at the percent population change by decade.
The resulting dataset looks like:
Country.Name | Country.Code | Pct.Change.1960s | Pct.Change.1970s | Pct.Change.1980s | Pct.Change.1990s | Pct.Change.2000s |
---|---|---|---|---|---|---|
Afghanistan | AFG | 20.62 | 19.44 | -12.13 | 57.77 | 38.09 |
Albania | ALB | 29.39 | 22.59 | 20.81 | -5.41 | -5.23 |
Algeria | DZA | 27.14 | 28.94 | 30.61 | 18.73 | 13.53 |
American Samoa | ASM | 33.00 | 16.04 | 40.47 | 20.67 | -2.25 |
Andorra | AND | 71.92 | 43.41 | 45.44 | 17.70 | 30.70 |
Angola | AGO | 17.21 | 26.22 | 31.77 | 31.22 | 36.27 |
Antigua and Barbuda | ATG | 17.60 | 7.93 | -11.85 | 22.83 | 11.14 |
Arab World | ARB | 28.34 | 30.63 | 30.09 | 24.14 | 22.51 |
Argentina | ARG | 14.49 | 15.48 | 14.79 | 11.97 | 10.10 |
Armenia | ARM | 31.54 | 20.89 | 14.42 | -12.72 | -3.58 |
Looking at the change for Bulgaria and United Arab Emirates specifically:
Country.Name | Country.Code | Pct.Change.1960s | Pct.Change.1970s | Pct.Change.1980s | Pct.Change.1990s | Pct.Change.2000s |
---|---|---|---|---|---|---|
Bulgaria | BGR | 7.20 | 3.96 | 0.17 | -5.82 | -8.88 |
United Arab Emirates | ARE | 121.97 | 293.87 | 68.38 | 59.22 | 152.63 |
Plotting the population by year for both countries:
For relative growth by decade: