rmarkdown output
We are using prettydoc with cayman theme, for our rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html
Problem Statement
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
- Perform the analysis requested in the discussion item.
- Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
- Please include in your homework submission, for each of the three chosen datasets:
- The URL to the .Rmd file in your GitHub repository, and
- The URL for your rpubs.com web page.
Git-Hub
The datasets and .rmd file used in this project can be found at: https://github.com/forhadakbar/data607fall2019/tree/master/Week%2006 and https://github.com/ShovanBiswas/DATA607/tree/master/Week6-Project2
Loading package
The tidyverse includes both tidyr and dplyr. There’s Amazing documentation on tidyverse site: https://www.tidyverse.org/packages/.
Dataset-1: Population Migrations to USA
This is a wide dataset, in UN’s website, containing data on migrations, between countries, from 1990 to 2019. The link is as follows: https://www.un.org/en/development/desa/population/migration/data/estimates2/estimates15.asp
The questions, we tried to address here are:
- Top 10 countries from which migrations happned to USA.
- Bottom 10 countries from which migrations happned to USA.
- Trend changes in population migrations to USA, over time.
Create a .csv file
After downloading dataset from original link, it was uploaded to github, which has been read below.
migrantstock <- read.csv(paste("https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2006/UN_MigrantStockByOriginAndDestination_2019.csv"),stringsAsFactors = F)
head(migrantstock, n = 5)## Year Sort.order Major.area..region..country.or.area.of.destination Notes
## 1 1990 1990001 WORLD
## 2 1990 1990002 UN development groups
## 3 1990 1990003 More developed regions b
## 4 1990 1990004 Less developed regions c
## 5 1990 1990005 Least developed countries d
## Code Type.of.data..a. Afghanistan Albania Algeria American.Samoa Andorra
## 1 900 6823350 180284 921727 2041 3792
## 2 NA .. .. .. .. ..
## 3 901 119386 177986 867015 1027 3737
## 4 902 6703964 2298 54712 1014 55
## 5 941 0 0 5622 0 0
## Angola Anguilla Antigua.and.Barbuda Argentina Armenia Aruba Australia
## 1 824942 2047 21753 430169 899649 10596 303696
## 2 .. .. .. .. .. .. ..
## 3 167381 540 14561 223308 654111 4639 245256
## 4 657561 1507 7192 206861 245538 5957 58440
## 5 608108 0 0 526 0 0 1465
## Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium
## 1 506088 1634081 25182 12820 5451831 84931 1767606 365360
## 2 .. .. .. .. .. .. .. ..
## 3 471950 1056543 23376 1783 161091 79563 1570292 338069
## 4 34138 577538 1806 11037 5290740 5368 197314 27291
## 5 190 0 0 89 2046 0 0 973
## Belize Benin Bermuda Bhutan Bolivia..Plurinational.State.of.
## 1 36117 234314 71702 28465 224693
## 2 .. .. .. .. ..
## 3 32183 14838 71550 554 43272
## 4 3934 219476 152 27911 181421
## 5 0 41733 0 16805 0
## Bonaire..Sint.Eustatius.and.Saba Bosnia.and.Herzegovina Botswana Brazil
## 1 3206 861766 26053 500392
## 2 .. .. .. ..
## 3 2900 854349 2843 282969
## 4 306 7417 23210 217423
## 5 0 0 652 1631
## British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi
## 1 3094 26323 613093 1021332 337118
## 2 .. .. .. .. ..
## 3 158 6623 122758 6386 4620
## 4 2936 19700 490335 1014946 332498
## 5 0 0 0 43162 329604
## Cabo.Verde Cambodia Cameroon Canada Cayman.Islands
## 1 91368 355430 115853 998163 373
## 2 .. .. .. .. ..
## 3 80227 237274 50166 938586 269
## 4 11141 118156 65687 59577 104
## 5 10796 14322 33380 1876 0
## Central.African.Republic Chad Channel.Islands Chile China
## 1 46362 336802 18726 493026 4231648
## 2 .. .. .. .. ..
## 3 9114 3608 18625 191694 1460345
## 4 37248 333194 101 301332 2771303
## 5 26440 208161 0 333 245923
## China..Hong.Kong.SAR China..Macao.SAR Colombia Comoros Congo
## 1 551080 95648 1009935 40083 96372
## 2 .. .. .. .. ..
## 3 516749 14382 365140 18946 61595
## 4 34331 81266 644795 21137 34777
## 5 54 0 828 8549 25168
## Cook.Islands Costa.Rica Côte.d.Ivoire Croatia Cuba Curaçao Cyprus
## 1 17488 69711 366348 425807 835796 43190 174378
## 2 .. .. .. .. .. .. ..
## 3 17441 48040 49237 422110 774534 39067 163985
## 4 47 21671 317111 3697 61262 4123 10393
## 5 0 274 308481 0 1493 0 5
## Czechia Dem..People.s.Republic.of.Korea Democratic.Republic.of.the.Congo
## 1 277260 39784 436526
## 2 .. .. ..
## 3 268472 15643 95528
## 4 8788 24141 340998
## 5 0 78 257210
## Denmark Djibouti Dominica Dominican.Republic Ecuador Egypt El.Salvador
## 1 201761 5308 42437 466216 214008 1322178 1242075
## 2 .. .. .. .. .. .. ..
## 3 194363 3466 23814 387566 165601 269821 507652
## 4 7398 1842 18623 78650 48407 1052357 734423
## 5 189 904 0 2406 0 30582 0
## Equatorial.Guinea Eritrea Estonia Eswatini Ethiopia
## 1 36178 170603 113905 35181 1689955
## 2 .. .. .. .. ..
## 3 8443 25565 110168 546 116900
## 4 27735 145038 3737 34635 1573055
## 5 355 134471 0 102 1432950
## Falkland.Islands..Malvinas. Faroe.Islands Fiji Finland France
## 1 260 7520 90166 250765 1215895
## 2 .. .. .. .. ..
## 3 237 7518 85997 247088 881133
## 4 23 2 4169 3677 334762
## 5 0 0 721 0 38541
## French.Guiana French.Polynesia Gabon Gambia Georgia Germany Ghana
## 1 2844 3149 15352 36280 919454 2929448 371162
## 2 .. .. .. .. .. .. ..
## 3 61 324 8717 12656 801710 2488935 127620
## 4 2783 2825 6635 23624 117744 440513 243542
## 5 0 0 5273 13064 0 4776 58990
## Gibraltar Greece Greenland Grenada Guadeloupe Guam Guatemala Guinea
## 1 11994 1022459 9510 43249 5828 1376 348332 352763
## 2 .. .. .. .. .. .. .. ..
## 3 11920 890569 8997 23826 162 43 245297 14712
## 4 74 131890 513 19423 5666 1333 103035 338051
## 5 0 40 0 0 0 0 0 200952
## Guinea.Bissau Guyana Haiti Holy.See Honduras Hungary Iceland India
## 1 55409 233731 528873 31 156594 387514 17621 6623177
## 2 .. .. .. .. .. .. .. ..
## 3 16106 206865 294766 10 114335 360160 17533 1232954
## 4 39303 26866 234107 21 42259 27354 88 5390223
## 5 37630 0 0 0 0 10 0 462470
## Indonesia Iran..Islamic.Republic.of. Iraq Ireland Isle.of.Man Israel
## 1 1638365 631339 1506702 917286 10735 281597
## 2 .. .. .. .. .. ..
## 3 309159 518437 145177 908952 10735 169372
## 4 1329206 112902 1361525 8334 0 112225
## 5 75246 0 5020 1 0 0
## Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait
## 1 3351006 589010 608921 313997 2971639 250340 4053 81611
## 2 .. .. .. .. .. .. .. ..
## 3 2789415 575132 430558 70531 2833828 154625 989 16425
## 4 561591 13878 178363 243466 137811 95715 3064 65186
## 5 2401 0 9729 0 0 82931 1054 29
## Kyrgyzstan Lao.People.s.Democratic.Republic Latvia Lebanon Lesotho
## 1 522578 483021 215134 509323 191339
## 2 .. .. .. .. ..
## 3 483043 251777 200654 370568 365
## 4 39535 231244 14480 138755 190974
## 5 0 58843 0 7337 2954
## Liberia Libya Liechtenstein Lithuania Luxembourg Madagascar Malawi
## 1 516886 76256 3428 341050 36141 59424 143437
## 2 .. .. .. .. .. .. ..
## 3 18354 25322 3250 310298 35779 43500 11584
## 4 498532 50934 178 30752 362 15924 131853
## 5 477733 3757 0 0 0 10810 47615
## Malaysia Maldives Mali Malta Marshall.Islands Martinique Mauritania
## 1 562762 2193 647436 110746 1426 11041 134488
## 2 .. .. .. .. .. .. ..
## 3 188217 304 51354 110502 1115 316 13255
## 4 374545 1889 596082 244 311 10725 121233
## 5 92880 0 132257 0 0 0 107801
## Mauritius Mayotte Mexico Micronesia..Fed..States.of. Monaco Mongolia
## 1 110708 1835 4395365 7714 4479 24466
## 2 .. .. .. .. .. ..
## 3 101617 0 4350586 2764 4200 24300
## 4 9091 1835 44779 4950 279 166
## 5 113 0 306 0 0 0
## Montenegro Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal
## 1 77384 7188 1748251 2222369 685310 16079 1419 748060
## 2 .. .. .. .. .. .. .. ..
## 3 77143 5174 1567742 78957 44698 1304 465 7177
## 4 241 2014 180509 2143412 640612 14775 954 740883
## 5 0 0 3711 1265353 226295 3080 928 10025
## Netherlands New.Caledonia New.Zealand Nicaragua Niger Nigeria Niue
## 1 723638 4151 388173 442126 149779 446806 5860
## 2 .. .. .. .. .. .. ..
## 3 660809 1268 371488 179003 3433 148202 5821
## 4 62829 2883 16685 263123 146346 298604 39
## 5 548 292 487 0 55934 114276 0
## North.Macedonia Northern.Mariana.Islands Norway Oman Pakistan Palau
## 1 432296 2525 138536 12535 3343328 2958
## 2 .. .. .. .. .. ..
## 3 380767 274 130212 714 447344 12
## 4 51529 2251 8324 11821 2895984 2946
## 5 0 0 4249 0 17315 0
## Panama Papua.New.Guinea Paraguay Peru Philippines Poland Portugal
## 1 134743 3111 297979 314854 2033684 1510415 1873457
## 2 .. .. .. .. .. .. ..
## 3 90144 1845 14571 228073 1349642 1346970 1475456
## 4 44599 1266 283408 86781 684042 163445 398001
## 5 0 1160 0 351 818 0 8194
## Puerto.Rico Qatar Republic.of.Korea Republic.of.Moldova Réunion Romania
## 1 1200821 12204 1624729 625683 3087 813066
## 2 .. .. .. .. .. ..
## 3 1180927 904 1382392 567165 125 661082
## 4 19894 11300 242337 58518 2962 151984
## 5 1546 0 354 0 957 0
## Russian.Federation Rwanda Saint.Helena Saint.Kitts.and.Nevis Saint.Lucia
## 1 12662893 547718 884 20714 22005
## 2 .. .. .. .. ..
## 3 7566200 8427 539 9886 10108
## 4 5096693 539291 345 10828 11897
## 5 1465 532395 0 0 0
## Saint.Pierre.and.Miquelon Saint.Vincent.and.the.Grenadines Samoa
## 1 485 37049 74861
## 2 .. .. ..
## 3 433 18424 59558
## 4 52 18625 15303
## 5 0 0 21
## San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Serbia Seychelles
## 1 1419 13977 107166 370263 742547 35633
## 2 .. .. .. .. .. ..
## 3 1376 5881 24905 130627 738976 28990
## 4 43 8096 82261 239636 3571 6643
## 5 0 3219 2132 196154 0 6426
## Sierra.Leone Singapore Sint.Maarten..Dutch.part. Slovakia Slovenia
## 1 61854 156468 14823 133006 91496
## 2 .. .. .. .. ..
## 3 19718 90554 13911 132447 89464
## 4 42136 65914 912 559 2032
## 5 39039 10678 0 0 15
## Solomon.Islands Somalia South.Africa South.Sudan Spain Sri.Lanka
## 1 2212 848067 308303 514943 1439019 885951
## 2 .. .. .. .. .. ..
## 3 1115 67402 224304 1 917001 260120
## 4 1097 780665 83999 514942 522018 625831
## 5 83 757421 21041 498608 1778 115
## State.of.Palestine Sudan Suriname Sweden Switzerland
## 1 1813063 584940 179870 206848 326276
## 2 .. .. .. .. ..
## 3 35600 15083 160380 197121 288203
## 4 1777463 569857 19490 9727 38073
## 5 2697 244607 0 1170 686
## Syrian.Arab.Republic Tajikistan Thailand Timor.Leste Togo Tokelau
## 1 621881 537701 311308 11261 193830 1684
## 2 .. .. .. .. .. ..
## 3 126353 471233 206019 10514 19358 1523
## 4 495528 66468 105289 747 174472 161
## 5 2393 40537 32076 0 32868 0
## Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan
## 1 32666 197521 465576 2640033 259987
## 2 .. .. .. .. ..
## 3 29974 182904 416484 2548456 249213
## 4 2692 14617 49092 91577 10774
## 5 24 0 221 6 0
## Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates
## 1 2311 2350 311602 5545760 79545
## 2 .. .. .. .. ..
## 3 221 1171 71129 4668356 5600
## 4 2090 1179 240473 877404 73945
## 5 0 377 149308 84 62
## United.Kingdom United.Republic.of.Tanzania United.States.of.America
## 1 3794333 203070 1739233
## 2 .. .. ..
## 3 3462531 61033 889414
## 4 331802 142037 849819
## 5 15760 72693 38316
## United.States.Virgin.Islands Uruguay Uzbekistan Vanuatu
## 1 2362 237486 1428020 5060
## 2 .. .. .. ..
## 3 70 56838 1078563 1017
## 4 2292 180648 349457 4043
## 5 0 286 2027 9
## Venezuela..Bolivarian.Republic.of. Viet.Nam Wallis.and.Futuna.Islands
## 1 185946 1237873 6484
## 2 .. .. ..
## 3 114991 1085310 884
## 4 70955 152563 5600
## 5 2510 71579 0
## Western.Sahara Yemen Zambia Zimbabwe X
## 1 168239 455492 85203 204365 NA
## 2 .. .. .. .. NA
## 3 333 11457 26062 40957 NA
## 4 167906 444035 59141 163408 NA
## 5 0 357 26254 75122 NA
## [1] 1981 239
## [1] "data.frame"
Converting to tibbles
Tibbles is an enhanced dataframe, which inherits data frame class, and improves three behaviors, which are:
- Subsetting: Always returns a new tibble, [[ and $ return a vector.
- No partial matching: One must use full column names when subsetting.
- Display: When one prints a tibble, R provides a concise view of the data, which fits on one screen.
This is good site for tibbles documentation: https://tibble.tidyverse.org/
Converting dataframe to tibble
Check if data frame to tibble conversion worked
## [1] TRUE
Our Approach
In Migrantion dataset, columns 6 to 239 represent values of country variable, and each row represents 239 observations. We used the most recent functions pivot_longer(), introduced in tidyr 1.0.0, replacing the older functions spread() and gather().
More information is available at: https://tidyr.tidyverse.org/reference/pivot_longer.html
Transformation, using pivot_longer()
long_tidy_migrantstock <- migrantstock_tibble %>%
rename(destination = Major.area..region..country.or.area.of.destination, year = Year) %>%
pivot_longer(7:239, names_to = "country_of_origin", values_to = "count_in_thousand") # Transform the data from its wide to a more normalized long format
head(long_tidy_migrantstock, n = 5)## # A tibble: 5 x 8
## year Sort.order destination Notes Code Type.of.data..a.
## <int> <int> <chr> <chr> <int> <chr>
## 1 1990 1990001 WORLD "" 900 ""
## 2 1990 1990001 WORLD "" 900 ""
## 3 1990 1990001 WORLD "" 900 ""
## 4 1990 1990001 WORLD "" 900 ""
## 5 1990 1990001 WORLD "" 900 ""
## # ... with 2 more variables: country_of_origin <chr>,
## # count_in_thousand <chr>
We are selecting necessary columns, and filtering rows for migrations to USA, based on the conditions for our main analysis.
migrantstock_USA <- long_tidy_migrantstock %>%
na.omit() %>% # remove blank row
select(year, destination, country_of_origin, count_in_thousand) %>%
filter(destination == "United States of America")
head(migrantstock_USA)## # A tibble: 6 x 4
## year destination country_of_origin count_in_thousand
## <int> <chr> <chr> <chr>
## 1 1990 United States of America Afghanistan 28444
## 2 1990 United States of America Albania 5627
## 3 1990 United States of America Algeria 4629
## 4 1990 United States of America American.Samoa ""
## 5 1990 United States of America Andorra ""
## 6 1990 United States of America Angola 2252
We need to change data type of count_in_thousand to numeric
Top 10 countries from where people migrated to USA
top10_migrantstock_USA <- migrantstock_USA %>%
na.omit(migrantstock_USA) %>%
group_by(country_of_origin) %>%
summarise(count_in_thousand = sum(count_in_thousand)) %>%
arrange(desc(count_in_thousand)) %>%
top_n(10)## Selecting by count_in_thousand
| country_of_origin | count_in_thousand |
|---|---|
| Mexico | 66449196 |
| China | 12287459 |
| Puerto.Rico | 10995652 |
| Philippines | 10941865 |
| India | 10471258 |
| Viet.Nam | 7344249 |
| Cuba | 7010218 |
| El.Salvador | 6925286 |
| Republic.of.Korea | 6441503 |
| Canada | 5746699 |
ggplot(top10_migrantstock_USA, aes(x = reorder(country_of_origin,count_in_thousand), y = count_in_thousand, fill = country_of_origin)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Top 10 Countries of Origin Migrated to USA") +
xlab("Countries of Origin") + ylab("Count in Thousand")So, at this point, having seen the top 10 countries, from where migrations happnened to USA, now we’ll see the bottom 10.
Bottom 10 countries from where people migrated to USA
bottom10_migrantstock_USA <- migrantstock_USA %>%
na.omit(migrantstock_USA) %>%
group_by(country_of_origin) %>%
summarise(count_in_thousand = sum(count_in_thousand)) %>%
arrange((count_in_thousand)) %>%
top_n(-10)## Selecting by count_in_thousand
| country_of_origin | count_in_thousand |
|---|---|
| Turkmenistan | 10208 |
| Luxembourg | 17226 |
| Tajikistan | 23662 |
| Kyrgyzstan | 27087 |
| Aruba | 34887 |
| Montserrat | 35796 |
| Iceland | 38299 |
| Marshall.Islands | 38505 |
| Micronesia..Fed..States.of. | 45496 |
| Slovenia | 46807 |
ggplot(bottom10_migrantstock_USA, aes(x = reorder(country_of_origin,count_in_thousand), y = count_in_thousand, fill = country_of_origin)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Bottom 10 Countries of Origin Migrated to USA") +
xlab("Countries of Origin") + ylab("Count in Thousand")Trend changes in population migrations to USA, over time
migrantstock_USA <- na.omit(migrantstock_USA)
migrantstock_USA_year <- migrantstock_USA %>%
group_by(year) %>%
summarise(count_in_thousand = sum(count_in_thousand))
kable(migrantstock_USA_year)| year | count_in_thousand |
|---|---|
| 1990 | 20134790 |
| 1995 | 26593925 |
| 2000 | 33157941 |
| 2005 | 36996835 |
| 2010 | 42071829 |
| 2015 | 45775575 |
| 2019 | 48065741 |
ggplot(migrantstock_USA_year, aes(x = year, y = count_in_thousand)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
ggtitle("Yearly Migration Estimates to USA") +
xlab("Year") + ylab("Count in Thousand")ggplot() + geom_line(aes(y = count_in_thousand, x = year), size = 1.5,
data = migrantstock_USA_year, stat = "identity")Conclusion
So, we answered all the three questions, we wanted to answer in the begining, and observed that the migrations to USA has been progressively increasing, over the years.
Dataset-2: Educational attainment
This is a wide dataset, at Census Bureau website, containing data on educatioanl attainments and incomes, for 2017. The link is as follows: https://www.census.gov/data/datasets/time-series/demo/income-poverty/cps-asec-design.html
The questions, we tried to address here are:
- Percentage wise distribution of levels of education (e.g. GED, Bachelors etc), in 2017.
- Correlation between educational attainment and being without earning.
- Correlation between educational attainment and high inclome, over $100,000.
After downloading dataset from original link, it was uploaded to github, which has been read below.
raw_data<-read.csv(paste("https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2006/Educational%20Attainment.csv"),stringsAsFactors = FALSE,header = TRUE)
head(raw_data)## Characteristic Less.Than.9th.Grade Nongrad.9th.to.12th
## 1 Without Earnings 4943 7159
## 2 $1 to $2,499 or loss 135 250
## 3 $2,500 to $4,999 92 201
## 4 $5,000 to $7,499 130 283
## 5 $7,500 to $9,999 92 190
## 6 $10,000 to $12,499 291 388
## High.School.Grad..Incl.GED. Some.College.No.Degree Associate.Degree
## 1 26111 12574 6394
## 2 889 682 398
## 3 628 494 277
## 4 1013 579 283
## 5 761 389 244
## 6 1612 829 575
## Bachelor.Degree Master.Degree Professional.Degree Doctorate.Degree
## 1 11592 5003 676 915
## 2 663 249 36 43
## 3 417 127 17 25
## 4 633 282 16 42
## 5 311 106 9 33
## 6 864 291 47 24
names(raw_data)<-c("Character", gsub("\\.", " ", names(raw_data)[2:length(raw_data)]))
head(raw_data)## Character Less Than 9th Grade Nongrad 9th to 12th
## 1 Without Earnings 4943 7159
## 2 $1 to $2,499 or loss 135 250
## 3 $2,500 to $4,999 92 201
## 4 $5,000 to $7,499 130 283
## 5 $7,500 to $9,999 92 190
## 6 $10,000 to $12,499 291 388
## High School Grad Incl GED Some College No Degree Associate Degree
## 1 26111 12574 6394
## 2 889 682 398
## 3 628 494 277
## 4 1013 579 283
## 5 761 389 244
## 6 1612 829 575
## Bachelor Degree Master Degree Professional Degree Doctorate Degree
## 1 11592 5003 676 915
## 2 663 249 36 43
## 3 417 127 17 25
## 4 633 282 16 42
## 5 311 106 9 33
## 6 864 291 47 24
Transformation, using pivot_longer() and melt (member of reshape2)
Reshape using data.table
raw_data_table <- as.data.table(raw_data)
education_attainment <- melt(raw_data_table,id.vars = "Character", variable.name = "Education", value.name = 'Count_in_Thousands', na.rm = TRUE)
head(education_attainment)## Character Education Count_in_Thousands
## 1: Without Earnings Less Than 9th Grade 4943
## 2: $1 to $2,499 or loss Less Than 9th Grade 135
## 3: $2,500 to $4,999 Less Than 9th Grade 92
## 4: $5,000 to $7,499 Less Than 9th Grade 130
## 5: $7,500 to $9,999 Less Than 9th Grade 92
## 6: $10,000 to $12,499 Less Than 9th Grade 291
Reshape using tidyr
education_attainment_long <- raw_data %>%
pivot_longer (2:10, names_to = "Education", values_to = "Count_in_Thousands" )
education_attainment_long## # A tibble: 378 x 3
## Character Education Count_in_Thousands
## <chr> <chr> <int>
## 1 Without Earnings Less Than 9th Grade 4943
## 2 Without Earnings Nongrad 9th to 12th 7159
## 3 Without Earnings "High School Grad Incl GED " 26111
## 4 Without Earnings Some College No Degree 12574
## 5 Without Earnings Associate Degree 6394
## 6 Without Earnings Bachelor Degree 11592
## 7 Without Earnings Master Degree 5003
## 8 Without Earnings Professional Degree 676
## 9 Without Earnings Doctorate Degree 915
## 10 $1 to $2,499 or loss Less Than 9th Grade 135
## # ... with 368 more rows
Percentage wise distribution of levels of education (e.g. GED, Bachelors etc), in 2017.
education <- education_attainment %>%
group_by(Education) %>%
summarise(Count_in_Thousands = sum(Count_in_Thousands)) %>%
arrange(desc(Count_in_Thousands))
kable(education)| Education | Count_in_Thousands |
|---|---|
| High School Grad Incl GED | 62669 |
| Bachelor Degree | 48220 |
| Some College No Degree | 35457 |
| Associate Degree | 22370 |
| Master Degree | 21054 |
| Nongrad 9th to 12th | 13685 |
| Less Than 9th Grade | 8723 |
| Doctorate Degree | 4473 |
| Professional Degree | 3176 |
education$Percent_total <- round((education$Count_in_Thousands / sum(education$Count_in_Thousands))*100,2)
education## # A tibble: 9 x 3
## Education Count_in_Thousands Percent_total
## <fct> <int> <dbl>
## 1 "High School Grad Incl GED " 62669 28.5
## 2 Bachelor Degree 48220 21.9
## 3 Some College No Degree 35457 16.1
## 4 Associate Degree 22370 10.2
## 5 Master Degree 21054 9.58
## 6 Nongrad 9th to 12th 13685 6.23
## 7 Less Than 9th Grade 8723 3.97
## 8 Doctorate Degree 4473 2.03
## 9 Professional Degree 3176 1.44
Let’s see it visually using ggplot2
ggplot(education, aes(x = reorder(Education,Percent_total), y = Percent_total,fill = Education)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of Total, by education") +
xlab("Education") + ylab("Percent of Total") +
coord_flip() Correlation between educational attainment and being without earning.
education_attainment_wide <- education_attainment_long %>%
pivot_wider(names_from = Character, values_from = Count_in_Thousands)
kable(education_attainment_wide)| Education | Without Earnings | $1 to $2,499 or loss | $2,500 to $4,999 | $5,000 to $7,499 | $7,500 to $9,999 | $10,000 to $12,499 | $12,500 to $14,999 | $15,000 to $17,499 | $17,500 to $19,999 | $20,000 to $22,499 | $22,500 to $24,999 | $25,000 to $27,499 | $27,500 to $29,999 | $30,000 to $32,499 | $32,500 to $34,999 | $35,000 to $37,499 | $37,500 to $39,999 | $40,000 to $42,499 | $42,500 to $44,999 | $45,000 to $47,499 | $47,500 to $49,999 | $50,000 to $52,499 | $52,500 to $54,999 | $55,000 to $57,499 | $57,500 to $59,999 | $60,000 to $62,499 | $62,500 to $64,999 | $65,000 to $67,499 | $67,500 to $69,999 | $70,000 to $72,499 | $72,500 to $74,999 | $75,000 to $77,499 | $77,500 to $79,999 | $80,000 to $82,499 | $82,500 to $84,999 | $85,000 to $87,499 | $87,500 to $89,999 | $90,000 to $92,499 | $92,500 to $94,999 | $95,000 to $97,499 | $97,500 to $99,999 | $100,000 and over |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Less Than 9th Grade | 4943 | 135 | 92 | 130 | 92 | 291 | 156 | 290 | 195 | 406 | 205 | 346 | 89 | 291 | 81 | 159 | 76 | 156 | 24 | 92 | 31 | 129 | 6 | 47 | 17 | 64 | 5 | 27 | 3 | 16 | 4 | 25 | 4 | 24 | 2 | 3 | 2 | 15 | 4 | 6 | 1 | 39 |
| Nongrad 9th to 12th | 7159 | 250 | 201 | 283 | 190 | 388 | 228 | 399 | 288 | 675 | 318 | 521 | 121 | 484 | 148 | 284 | 142 | 325 | 45 | 142 | 90 | 229 | 31 | 83 | 24 | 118 | 27 | 61 | 25 | 63 | 13 | 47 | 22 | 54 | 1 | 15 | 7 | 35 | 1 | 6 | 6 | 136 |
| High School Grad Incl GED | 26111 | 889 | 628 | 1013 | 761 | 1612 | 868 | 1697 | 1097 | 2406 | 1244 | 2451 | 1058 | 2724 | 700 | 2114 | 788 | 2041 | 396 | 1475 | 599 | 2011 | 321 | 937 | 305 | 1158 | 165 | 694 | 187 | 649 | 126 | 446 | 92 | 522 | 88 | 223 | 94 | 245 | 33 | 144 | 30 | 1527 |
| Some College No Degree | 12574 | 682 | 494 | 579 | 389 | 829 | 333 | 914 | 594 | 1209 | 626 | 1283 | 604 | 1698 | 443 | 1239 | 469 | 1253 | 269 | 852 | 356 | 1204 | 280 | 591 | 213 | 870 | 129 | 545 | 150 | 533 | 132 | 388 | 98 | 374 | 65 | 211 | 73 | 179 | 23 | 138 | 59 | 1513 |
| Associate Degree | 6394 | 398 | 277 | 283 | 244 | 575 | 196 | 557 | 326 | 740 | 409 | 828 | 389 | 1058 | 334 | 859 | 383 | 940 | 246 | 597 | 306 | 825 | 261 | 463 | 175 | 652 | 125 | 420 | 139 | 341 | 94 | 323 | 78 | 319 | 52 | 214 | 32 | 221 | 21 | 98 | 42 | 1136 |
| Bachelor Degree | 11592 | 663 | 417 | 633 | 311 | 864 | 297 | 735 | 383 | 1048 | 505 | 1094 | 559 | 1473 | 413 | 1403 | 559 | 1881 | 446 | 1424 | 697 | 2090 | 500 | 1192 | 411 | 1640 | 296 | 1012 | 282 | 1247 | 194 | 998 | 266 | 1042 | 230 | 580 | 183 | 740 | 114 | 380 | 138 | 7288 |
| Master Degree | 5003 | 249 | 127 | 282 | 106 | 291 | 135 | 208 | 113 | 334 | 142 | 273 | 108 | 421 | 99 | 367 | 137 | 477 | 138 | 474 | 245 | 881 | 218 | 474 | 189 | 717 | 166 | 408 | 165 | 621 | 155 | 542 | 145 | 689 | 89 | 387 | 92 | 441 | 66 | 247 | 107 | 4526 |
| Professional Degree | 676 | 36 | 17 | 16 | 9 | 47 | 14 | 18 | 4 | 23 | 11 | 32 | 8 | 36 | 12 | 39 | 8 | 40 | 9 | 38 | 19 | 102 | 25 | 42 | 12 | 87 | 16 | 67 | 14 | 45 | 10 | 59 | 8 | 80 | 12 | 36 | 11 | 60 | 11 | 41 | 16 | 1310 |
| Doctorate Degree | 915 | 43 | 25 | 42 | 33 | 24 | 37 | 49 | 32 | 38 | 10 | 33 | 22 | 57 | 19 | 46 | 19 | 50 | 14 | 52 | 45 | 152 | 38 | 73 | 19 | 119 | 18 | 71 | 15 | 67 | 24 | 103 | 32 | 121 | 20 | 53 | 22 | 114 | 12 | 61 | 41 | 1693 |
withoutearning <- education_attainment_wide %>%
select(Education, "Without Earnings") %>%
mutate(Percent_total = round((`Without Earnings` / sum(`Without Earnings`)) * 100, 2)) %>%
arrange(desc(Percent_total))
kable(withoutearning)| Education | Without Earnings | Percent_total |
|---|---|---|
| High School Grad Incl GED | 26111 | 34.65 |
| Some College No Degree | 12574 | 16.68 |
| Bachelor Degree | 11592 | 15.38 |
| Nongrad 9th to 12th | 7159 | 9.50 |
| Associate Degree | 6394 | 8.48 |
| Master Degree | 5003 | 6.64 |
| Less Than 9th Grade | 4943 | 6.56 |
| Doctorate Degree | 915 | 1.21 |
| Professional Degree | 676 | 0.90 |
Let’s see it visually using ggplot2
ggplot(withoutearning, aes(x = reorder(Education,Percent_total), y = Percent_total)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of educational attainment without earnings") +
xlab("Education") + ylab("Percentage of total") +
coord_flip()Correlation between educational attainment and high inclome, over $100,000.
morethan100k<- education_attainment_wide %>%
select(Education,"$100,000 and over") %>%
mutate(Percent_total = round((`$100,000 and over` / sum(`$100,000 and over`)) * 100, 2))%>%
arrange(desc(Percent_total))
kable(morethan100k)| Education | $100,000 and over | Percent_total |
|---|---|---|
| Bachelor Degree | 7288 | 38.02 |
| Master Degree | 4526 | 23.61 |
| Doctorate Degree | 1693 | 8.83 |
| High School Grad Incl GED | 1527 | 7.97 |
| Some College No Degree | 1513 | 7.89 |
| Professional Degree | 1310 | 6.83 |
| Associate Degree | 1136 | 5.93 |
| Nongrad 9th to 12th | 136 | 0.71 |
| Less Than 9th Grade | 39 | 0.20 |
Let’s see it visually using ggplot2
ggplot(morethan100k, aes(x = reorder(Education,Percent_total), y = Percent_total,fill = Education)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1,position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of Education level who is making more than $100K") +
xlab("Education") + ylab("Percentage of Education") +
coord_flip() Conclusion
- Of the people surveyed, 28.51% graduated from high school (incl GED), 21.94% attained bachelor’s degree and 16.13% had some college degree.
- There’s a correlation between educational attainment and being without earning, because 34.65% of incomeless people, had only high school degree.
- There’s a correlation between educational attainment and high inclome (over $100,000), because 34.02% of those who earn 100k or more, have bachlor’s degree, 23.61% have master’s degree, and 8.83% have PhD.
Dataset-3: population, by global entities, from 1960 to 2010.
This is a wide dataset, in Worldbank’s website, containing data on populations by entities (including countries, regions etc), from 1960 to 2010. The link is as follows: https://databank.worldbank.org/home.aspx
The questions, we tried to address here are:
- Populations of top 20 entities in the year 2010.
- Populations of bottom 20 entities in the year 2010.
- Trend changes in world (i.e. sum total of entities, countries, regions etc) population, over time.
Reading the dataset.
After downloading dataset from original link, it was uploaded to github, which has been read below.
url <- "https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2006/API_SP.POP.TOTL_DS2_en_csv_v2_247892.csv"
tbl <- read.csv(url, skip = 4)Selecting relevant columns and omitting NA
Tidying the data, with pivot_longer
Populations of top 20 entities in the year 2010
tbl.pivot_2010.df <- tbl.pivot.df %>%
filter(Year == "X2010") %>%
select(Country.Name, Population) %>%
arrange(desc(Population))
tbl.pivot_2010.top20.df <- top_n(tbl.pivot_2010.df, 20)## Selecting by Population
| Country.Name | Population |
|---|---|
| World | 6922947261 |
| IDA & IBRD total | 5793479782 |
| Low & middle income | 5765121055 |
| Middle income | 5187847044 |
| IBRD only | 4421851199 |
| Early-demographic dividend | 2908287643 |
| Lower middle income | 2688436551 |
| Upper middle income | 2499410493 |
| East Asia & Pacific | 2206884624 |
| Late-demographic dividend | 2181529363 |
| East Asia & Pacific (excluding high income) | 1965964309 |
| East Asia & Pacific (IDA & IBRD countries) | 1941377349 |
| South Asia | 1638792934 |
| South Asia (IDA & IBRD) | 1638792934 |
| IDA total | 1371628583 |
| China | 1337705000 |
| OECD members | 1242309585 |
| India | 1234281170 |
| High income | 1157826206 |
| Post-demographic dividend | 1074869481 |
ggplot(tbl.pivot_2010.top20.df, aes(x = reorder(Country.Name, Population), y = Population)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Top 20 entities in 2010") +
xlab("Entity name") + ylab("Population")Populations of bottom 20 entities in the year 2010
tbl.pivot_2010.df <- tbl.pivot.df %>%
filter(Year == "X2010") %>%
select(Country.Name, Population) %>%
arrange(desc(Population))
tbl.pivot_2010.bottom20.df <- top_n(tbl.pivot_2010.df, -20)## Selecting by Population
| Country.Name | Population |
|---|---|
| Andorra | 84449 |
| Dominica | 70878 |
| Bermuda | 65124 |
| Greenland | 56905 |
| Cayman Islands | 56672 |
| Marshall Islands | 56366 |
| American Samoa | 56079 |
| Northern Mariana Islands | 53971 |
| St. Kitts and Nevis | 49016 |
| Faroe Islands | 47814 |
| St. Martin (French part) | 37582 |
| Liechtenstein | 35994 |
| Monaco | 35612 |
| Gibraltar | 33585 |
| Turks and Caicos Islands | 32660 |
| San Marino | 31229 |
| British Virgin Islands | 27794 |
| Palau | 17955 |
| Tuvalu | 10530 |
| Nauru | 10005 |
ggplot(tbl.pivot_2010.bottom20.df, aes(x = reorder(Country.Name, Population), y = Population)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Bottom 20 entities in 2010") +
xlab("Entity name") + ylab("Population")Trend changes in world (i.e. sum total of entities, countries, regions etc) population, over time
tbl.pivot.grp_by_yr.df <- tbl.pivot.df %>%
group_by(Year) %>%
summarise("Total Population" = sum(Population))
kable(tbl.pivot.grp_by_yr.df)| Year | Total Population |
|---|---|
| X1960 | 30696994944 |
| X1961 | 31108353431 |
| X1962 | 31658294754 |
| X1963 | 32342820687 |
| X1964 | 33032632731 |
| X1965 | 33739453133 |
| X1966 | 34483819203 |
| X1967 | 35225755472 |
| X1968 | 35981282059 |
| X1969 | 36774110168 |
| X1970 | 37578380420 |
| X1971 | 38403294546 |
| X1972 | 39218037696 |
| X1973 | 40027236911 |
| X1974 | 40838904173 |
| X1975 | 41634929535 |
| X1976 | 42418106464 |
| X1977 | 43198308074 |
| X1978 | 43991861452 |
| X1979 | 44804464513 |
| X1980 | 45627321658 |
| X1981 | 46472179835 |
| X1982 | 47351062716 |
| X1983 | 48240338882 |
| X1984 | 49128810634 |
| X1985 | 50034855935 |
| X1986 | 50967920664 |
| X1987 | 51925263797 |
| X1988 | 52893291009 |
| X1989 | 53860012489 |
| X1990 | 54840819416 |
| X1991 | 55795466259 |
| X1992 | 56723821804 |
| X1993 | 57652320273 |
| X1994 | 58572105249 |
| X1995 | 59492289648 |
| X1996 | 60399571928 |
| X1997 | 61304671961 |
| X1998 | 62200715609 |
| X1999 | 63084689314 |
| X2000 | 63961898183 |
| X2001 | 64834636987 |
| X2002 | 65706004220 |
| X2003 | 66581568814 |
| X2004 | 67462607601 |
| X2005 | 68349669516 |
| X2006 | 69242979693 |
| X2007 | 70140938566 |
| X2008 | 71051310936 |
| X2009 | 71967689579 |
| X2010 | 72885851687 |
ggplot(tbl.pivot.grp_by_yr.df, aes(x=Year, y=`Total Population`)) +
geom_bar(stat="identity", position=position_dodge(), color="black", width = 0.5) +
ggtitle("Total Population by Year") +
xlab("Year") + ylab("Population")Conclusion
- We selected the top 20 entities. We saw that among them are China, India, and there South Asia region.
- Among the bottom 20 entities, we saw Palau, Tuvalu, Nauru.
- Finally, we saw that the world population has been progressivley increasing from 1960 to 2010.