Transform Data with Pvot Tibl ggplot2 Reshape
rmarkdown output
I am using prettydoc with cayman theme, for my rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html
Problem Statement
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
- Perform the analysis requested in the discussion item.
- Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
- Please include in your homework submission, for each of the three chosen datasets:
- The URL to the .Rmd file in your GitHub repository, and
- The URL for your rpubs.com web page.
Git-Hub
The datasets and .rmd file used in this project can be found at: https://github.com/ShovanBiswas/DATA607/tree/master/Week6-Project2
Loading package
The tidyverse includes both tidyr and dplyr. There’s Amazing documentation on tidyverse site: https://www.tidyverse.org/packages/.
Dataset-1: Population Migrations to USA
This is a wide dataset, in UN’s website, containing data on migrations, between countries, from 1990 to 2019. The link is as follows: https://www.un.org/en/development/desa/population/migration/data/estimates2/estimates15.asp
The questions I tried to address here, are:
- Top 10 countries from which migrations happned to USA.
- Bottom 10 countries from which migrations happned to USA.
- Trend changes in population migrations to USA, over time.
Create a .csv file
After downloading dataset from original link, it was uploaded to github, which has been read below.
migrantstock <- read.csv(paste("./UN_MigrantStockByOriginAndDestination_2019.csv"),stringsAsFactors = F)
head(migrantstock, n = 5)## Year Sort.order Major.area..region..country.or.area.of.destination Notes Code
## 1 1990 1990001 WORLD 900
## 2 1990 1990002 UN development groups NA
## 3 1990 1990003 More developed regions b 901
## 4 1990 1990004 Less developed regions c 902
## 5 1990 1990005 Least developed countries d 941
## Type.of.data..a. Afghanistan Albania Algeria American.Samoa Andorra Angola
## 1 6823350 180284 921727 2041 3792 824942
## 2 .. .. .. .. .. ..
## 3 119386 177986 867015 1027 3737 167381
## 4 6703964 2298 54712 1014 55 657561
## 5 0 0 5622 0 0 608108
## Anguilla Antigua.and.Barbuda Argentina Armenia Aruba Australia Austria
## 1 2047 21753 430169 899649 10596 303696 506088
## 2 .. .. .. .. .. .. ..
## 3 540 14561 223308 654111 4639 245256 471950
## 4 1507 7192 206861 245538 5957 58440 34138
## 5 0 0 526 0 0 1465 190
## Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## 1 1634081 25182 12820 5451831 84931 1767606 365360 36117 234314
## 2 .. .. .. .. .. .. .. .. ..
## 3 1056543 23376 1783 161091 79563 1570292 338069 32183 14838
## 4 577538 1806 11037 5290740 5368 197314 27291 3934 219476
## 5 0 0 89 2046 0 0 973 0 41733
## Bermuda Bhutan Bolivia..Plurinational.State.of.
## 1 71702 28465 224693
## 2 .. .. ..
## 3 71550 554 43272
## 4 152 27911 181421
## 5 0 16805 0
## Bonaire..Sint.Eustatius.and.Saba Bosnia.and.Herzegovina Botswana Brazil
## 1 3206 861766 26053 500392
## 2 .. .. .. ..
## 3 2900 854349 2843 282969
## 4 306 7417 23210 217423
## 5 0 0 652 1631
## British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi
## 1 3094 26323 613093 1021332 337118
## 2 .. .. .. .. ..
## 3 158 6623 122758 6386 4620
## 4 2936 19700 490335 1014946 332498
## 5 0 0 0 43162 329604
## Cabo.Verde Cambodia Cameroon Canada Cayman.Islands Central.African.Republic
## 1 91368 355430 115853 998163 373 46362
## 2 .. .. .. .. .. ..
## 3 80227 237274 50166 938586 269 9114
## 4 11141 118156 65687 59577 104 37248
## 5 10796 14322 33380 1876 0 26440
## Chad Channel.Islands Chile China China..Hong.Kong.SAR China..Macao.SAR
## 1 336802 18726 493026 4231648 551080 95648
## 2 .. .. .. .. .. ..
## 3 3608 18625 191694 1460345 516749 14382
## 4 333194 101 301332 2771303 34331 81266
## 5 208161 0 333 245923 54 0
## Colombia Comoros Congo Cook.Islands Costa.Rica Côte.d.Ivoire Croatia Cuba
## 1 1009935 40083 96372 17488 69711 366348 425807 835796
## 2 .. .. .. .. .. .. .. ..
## 3 365140 18946 61595 17441 48040 49237 422110 774534
## 4 644795 21137 34777 47 21671 317111 3697 61262
## 5 828 8549 25168 0 274 308481 0 1493
## Curaçao Cyprus Czechia Dem..People.s.Republic.of.Korea
## 1 43190 174378 277260 39784
## 2 .. .. .. ..
## 3 39067 163985 268472 15643
## 4 4123 10393 8788 24141
## 5 0 5 0 78
## Democratic.Republic.of.the.Congo Denmark Djibouti Dominica Dominican.Republic
## 1 436526 201761 5308 42437 466216
## 2 .. .. .. .. ..
## 3 95528 194363 3466 23814 387566
## 4 340998 7398 1842 18623 78650
## 5 257210 189 904 0 2406
## Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea Estonia Eswatini
## 1 214008 1322178 1242075 36178 170603 113905 35181
## 2 .. .. .. .. .. .. ..
## 3 165601 269821 507652 8443 25565 110168 546
## 4 48407 1052357 734423 27735 145038 3737 34635
## 5 0 30582 0 355 134471 0 102
## Ethiopia Falkland.Islands..Malvinas. Faroe.Islands Fiji Finland France
## 1 1689955 260 7520 90166 250765 1215895
## 2 .. .. .. .. .. ..
## 3 116900 237 7518 85997 247088 881133
## 4 1573055 23 2 4169 3677 334762
## 5 1432950 0 0 721 0 38541
## French.Guiana French.Polynesia Gabon Gambia Georgia Germany Ghana Gibraltar
## 1 2844 3149 15352 36280 919454 2929448 371162 11994
## 2 .. .. .. .. .. .. .. ..
## 3 61 324 8717 12656 801710 2488935 127620 11920
## 4 2783 2825 6635 23624 117744 440513 243542 74
## 5 0 0 5273 13064 0 4776 58990 0
## Greece Greenland Grenada Guadeloupe Guam Guatemala Guinea Guinea.Bissau
## 1 1022459 9510 43249 5828 1376 348332 352763 55409
## 2 .. .. .. .. .. .. .. ..
## 3 890569 8997 23826 162 43 245297 14712 16106
## 4 131890 513 19423 5666 1333 103035 338051 39303
## 5 40 0 0 0 0 0 200952 37630
## Guyana Haiti Holy.See Honduras Hungary Iceland India Indonesia
## 1 233731 528873 31 156594 387514 17621 6623177 1638365
## 2 .. .. .. .. .. .. .. ..
## 3 206865 294766 10 114335 360160 17533 1232954 309159
## 4 26866 234107 21 42259 27354 88 5390223 1329206
## 5 0 0 0 0 10 0 462470 75246
## Iran..Islamic.Republic.of. Iraq Ireland Isle.of.Man Israel Italy Jamaica
## 1 631339 1506702 917286 10735 281597 3351006 589010
## 2 .. .. .. .. .. .. ..
## 3 518437 145177 908952 10735 169372 2789415 575132
## 4 112902 1361525 8334 0 112225 561591 13878
## 5 0 5020 1 0 0 2401 0
## Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## 1 608921 313997 2971639 250340 4053 81611 522578
## 2 .. .. .. .. .. .. ..
## 3 430558 70531 2833828 154625 989 16425 483043
## 4 178363 243466 137811 95715 3064 65186 39535
## 5 9729 0 0 82931 1054 29 0
## Lao.People.s.Democratic.Republic Latvia Lebanon Lesotho Liberia Libya
## 1 483021 215134 509323 191339 516886 76256
## 2 .. .. .. .. .. ..
## 3 251777 200654 370568 365 18354 25322
## 4 231244 14480 138755 190974 498532 50934
## 5 58843 0 7337 2954 477733 3757
## Liechtenstein Lithuania Luxembourg Madagascar Malawi Malaysia Maldives Mali
## 1 3428 341050 36141 59424 143437 562762 2193 647436
## 2 .. .. .. .. .. .. .. ..
## 3 3250 310298 35779 43500 11584 188217 304 51354
## 4 178 30752 362 15924 131853 374545 1889 596082
## 5 0 0 0 10810 47615 92880 0 132257
## Malta Marshall.Islands Martinique Mauritania Mauritius Mayotte Mexico
## 1 110746 1426 11041 134488 110708 1835 4395365
## 2 .. .. .. .. .. .. ..
## 3 110502 1115 316 13255 101617 0 4350586
## 4 244 311 10725 121233 9091 1835 44779
## 5 0 0 0 107801 113 0 306
## Micronesia..Fed..States.of. Monaco Mongolia Montenegro Montserrat Morocco
## 1 7714 4479 24466 77384 7188 1748251
## 2 .. .. .. .. .. ..
## 3 2764 4200 24300 77143 5174 1567742
## 4 4950 279 166 241 2014 180509
## 5 0 0 0 0 0 3711
## Mozambique Myanmar Namibia Nauru Nepal Netherlands New.Caledonia New.Zealand
## 1 2222369 685310 16079 1419 748060 723638 4151 388173
## 2 .. .. .. .. .. .. .. ..
## 3 78957 44698 1304 465 7177 660809 1268 371488
## 4 2143412 640612 14775 954 740883 62829 2883 16685
## 5 1265353 226295 3080 928 10025 548 292 487
## Nicaragua Niger Nigeria Niue North.Macedonia Northern.Mariana.Islands Norway
## 1 442126 149779 446806 5860 432296 2525 138536
## 2 .. .. .. .. .. .. ..
## 3 179003 3433 148202 5821 380767 274 130212
## 4 263123 146346 298604 39 51529 2251 8324
## 5 0 55934 114276 0 0 0 4249
## Oman Pakistan Palau Panama Papua.New.Guinea Paraguay Peru Philippines
## 1 12535 3343328 2958 134743 3111 297979 314854 2033684
## 2 .. .. .. .. .. .. .. ..
## 3 714 447344 12 90144 1845 14571 228073 1349642
## 4 11821 2895984 2946 44599 1266 283408 86781 684042
## 5 0 17315 0 0 1160 0 351 818
## Poland Portugal Puerto.Rico Qatar Republic.of.Korea Republic.of.Moldova
## 1 1510415 1873457 1200821 12204 1624729 625683
## 2 .. .. .. .. .. ..
## 3 1346970 1475456 1180927 904 1382392 567165
## 4 163445 398001 19894 11300 242337 58518
## 5 0 8194 1546 0 354 0
## Réunion Romania Russian.Federation Rwanda Saint.Helena Saint.Kitts.and.Nevis
## 1 3087 813066 12662893 547718 884 20714
## 2 .. .. .. .. .. ..
## 3 125 661082 7566200 8427 539 9886
## 4 2962 151984 5096693 539291 345 10828
## 5 957 0 1465 532395 0 0
## Saint.Lucia Saint.Pierre.and.Miquelon Saint.Vincent.and.the.Grenadines Samoa
## 1 22005 485 37049 74861
## 2 .. .. .. ..
## 3 10108 433 18424 59558
## 4 11897 52 18625 15303
## 5 0 0 0 21
## San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Serbia Seychelles
## 1 1419 13977 107166 370263 742547 35633
## 2 .. .. .. .. .. ..
## 3 1376 5881 24905 130627 738976 28990
## 4 43 8096 82261 239636 3571 6643
## 5 0 3219 2132 196154 0 6426
## Sierra.Leone Singapore Sint.Maarten..Dutch.part. Slovakia Slovenia
## 1 61854 156468 14823 133006 91496
## 2 .. .. .. .. ..
## 3 19718 90554 13911 132447 89464
## 4 42136 65914 912 559 2032
## 5 39039 10678 0 0 15
## Solomon.Islands Somalia South.Africa South.Sudan Spain Sri.Lanka
## 1 2212 848067 308303 514943 1439019 885951
## 2 .. .. .. .. .. ..
## 3 1115 67402 224304 1 917001 260120
## 4 1097 780665 83999 514942 522018 625831
## 5 83 757421 21041 498608 1778 115
## State.of.Palestine Sudan Suriname Sweden Switzerland Syrian.Arab.Republic
## 1 1813063 584940 179870 206848 326276 621881
## 2 .. .. .. .. .. ..
## 3 35600 15083 160380 197121 288203 126353
## 4 1777463 569857 19490 9727 38073 495528
## 5 2697 244607 0 1170 686 2393
## Tajikistan Thailand Timor.Leste Togo Tokelau Tonga Trinidad.and.Tobago
## 1 537701 311308 11261 193830 1684 32666 197521
## 2 .. .. .. .. .. .. ..
## 3 471233 206019 10514 19358 1523 29974 182904
## 4 66468 105289 747 174472 161 2692 14617
## 5 40537 32076 0 32868 0 24 0
## Tunisia Turkey Turkmenistan Turks.and.Caicos.Islands Tuvalu Uganda Ukraine
## 1 465576 2640033 259987 2311 2350 311602 5545760
## 2 .. .. .. .. .. .. ..
## 3 416484 2548456 249213 221 1171 71129 4668356
## 4 49092 91577 10774 2090 1179 240473 877404
## 5 221 6 0 0 377 149308 84
## United.Arab.Emirates United.Kingdom United.Republic.of.Tanzania
## 1 79545 3794333 203070
## 2 .. .. ..
## 3 5600 3462531 61033
## 4 73945 331802 142037
## 5 62 15760 72693
## United.States.of.America United.States.Virgin.Islands Uruguay Uzbekistan
## 1 1739233 2362 237486 1428020
## 2 .. .. .. ..
## 3 889414 70 56838 1078563
## 4 849819 2292 180648 349457
## 5 38316 0 286 2027
## Vanuatu Venezuela..Bolivarian.Republic.of. Viet.Nam Wallis.and.Futuna.Islands
## 1 5060 185946 1237873 6484
## 2 .. .. .. ..
## 3 1017 114991 1085310 884
## 4 4043 70955 152563 5600
## 5 9 2510 71579 0
## Western.Sahara Yemen Zambia Zimbabwe X
## 1 168239 455492 85203 204365 NA
## 2 .. .. .. .. NA
## 3 333 11457 26062 40957 NA
## 4 167906 444035 59141 163408 NA
## 5 0 357 26254 75122 NA
## [1] 1981 239
## [1] "data.frame"
Converting to tibbles
Tibbles is an enhanced dataframe, which inherits data frame class, and improves three behaviors, which are:
- Subsetting: Always returns a new tibble, [[ and $ return a vector.
- No partial matching: One must use full column names when subsetting.
- Display: When one prints a tibble, R provides a concise view of the data, which fits on one screen.
This is good site for tibbles documentation: https://tibble.tidyverse.org/
Converting dataframe to tibble
Check if data frame to tibble conversion worked
## [1] TRUE
my Approach
In Migrantion dataset, columns 6 to 239 represent values of country variable, and each row represents 239 observations. I used the most recent functions pivot_longer(), introduced in tidyr 1.0.0, replacing the older functions spread() and gather().
More information is available at: https://tidyr.tidyverse.org/reference/pivot_longer.html
Transformation, using pivot_longer()
long_tidy_migrantstock <- migrantstock_tibble %>%
rename(destination = Major.area..region..country.or.area.of.destination, year = Year) %>%
pivot_longer(7:239, names_to = "country_of_origin", values_to = "count_in_thousand") # Transform the data from its wide to a more normalized long format
head(long_tidy_migrantstock, n = 5)## # A tibble: 5 x 8
## year Sort.order destination Notes Code Type.of.data..a. country_of_orig~
## <int> <int> <chr> <chr> <int> <chr> <chr>
## 1 1990 1990001 WORLD "" 900 "" Afghanistan
## 2 1990 1990001 WORLD "" 900 "" Albania
## 3 1990 1990001 WORLD "" 900 "" Algeria
## 4 1990 1990001 WORLD "" 900 "" American.Samoa
## 5 1990 1990001 WORLD "" 900 "" Andorra
## # ... with 1 more variable: count_in_thousand <chr>
I am selecting necessary columns, and filtering rows for migrations to USA, based on the conditions for my main analysis.
migrantstock_USA <- long_tidy_migrantstock %>%
na.omit() %>% # remove blank row
select(year, destination, country_of_origin, count_in_thousand) %>%
filter(destination == "United States of America")
head(migrantstock_USA)## # A tibble: 6 x 4
## year destination country_of_origin count_in_thousand
## <int> <chr> <chr> <chr>
## 1 1990 United States of America Afghanistan "28444"
## 2 1990 United States of America Albania "5627"
## 3 1990 United States of America Algeria "4629"
## 4 1990 United States of America American.Samoa ""
## 5 1990 United States of America Andorra ""
## 6 1990 United States of America Angola "2252"
I need to change data type of count_in_thousand to numeric
Top 10 countries from where people migrated to USA
top10_migrantstock_USA <- migrantstock_USA %>%
na.omit(migrantstock_USA) %>%
group_by(country_of_origin) %>%
summarise(count_in_thousand = sum(count_in_thousand)) %>%
arrange(desc(count_in_thousand)) %>%
top_n(10)## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by count_in_thousand
| country_of_origin | count_in_thousand |
|---|---|
| Mexico | 66449196 |
| China | 12287459 |
| Puerto.Rico | 10995652 |
| Philippines | 10941865 |
| India | 10471258 |
| Viet.Nam | 7344249 |
| Cuba | 7010218 |
| El.Salvador | 6925286 |
| Republic.of.Korea | 6441503 |
| Canada | 5746699 |
ggplot(top10_migrantstock_USA, aes(x = reorder(country_of_origin,count_in_thousand), y = count_in_thousand, fill = country_of_origin)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Top 10 Countries of Origin Migrated to USA") +
xlab("Countries of Origin") + ylab("Count in Thousand")So, at this point, having seen the top 10 countries, from where migrations happnened to USA, now I’ll see the bottom 10.
Bottom 10 countries from where people migrated to USA
bottom10_migrantstock_USA <- migrantstock_USA %>%
na.omit(migrantstock_USA) %>%
group_by(country_of_origin) %>%
summarise(count_in_thousand = sum(count_in_thousand)) %>%
arrange((count_in_thousand)) %>%
top_n(-10)## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by count_in_thousand
| country_of_origin | count_in_thousand |
|---|---|
| Turkmenistan | 10208 |
| Luxembourg | 17226 |
| Tajikistan | 23662 |
| Kyrgyzstan | 27087 |
| Aruba | 34887 |
| Montserrat | 35796 |
| Iceland | 38299 |
| Marshall.Islands | 38505 |
| Micronesia..Fed..States.of. | 45496 |
| Slovenia | 46807 |
ggplot(bottom10_migrantstock_USA, aes(x = reorder(country_of_origin,count_in_thousand), y = count_in_thousand, fill = country_of_origin)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Bottom 10 Countries of Origin Migrated to USA") +
xlab("Countries of Origin") + ylab("Count in Thousand")Trend changes in population migrations to USA, over time
migrantstock_USA <- na.omit(migrantstock_USA)
migrantstock_USA_year <- migrantstock_USA %>%
group_by(year) %>%
summarise(count_in_thousand = sum(count_in_thousand))## `summarise()` ungrouping output (override with `.groups` argument)
| year | count_in_thousand |
|---|---|
| 1990 | 20134790 |
| 1995 | 26593925 |
| 2000 | 33157941 |
| 2005 | 36996835 |
| 2010 | 42071829 |
| 2015 | 45775575 |
| 2019 | 48065741 |
ggplot(migrantstock_USA_year, aes(x = year, y = count_in_thousand)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
ggtitle("Yearly Migration Estimates to USA") +
xlab("Year") + ylab("Count in Thousand")ggplot() + geom_line(aes(y = count_in_thousand, x = year), size = 1.5,
data = migrantstock_USA_year, stat = "identity")Conclusion
So, I answered the three questions, which I wanted to answer in the begining, and observed that the migrations to USA has been progressively increasing, over the years.
Dataset-2: Educational attainment
This is a wide dataset, at Census Bureau website, containing data on educatioanl attainments and incomes, for 2017. The link is as follows: https://www.census.gov/data/datasets/time-series/demo/income-poverty/cps-asec-design.html
The questions, I tried to address here are:
- Percentage wise distribution of levels of education (e.g. GED, Bachelors etc), in 2017.
- Correlation between educational attainment and being without earning.
- Correlation between educational attainment and high inclome, over $100,000.
After downloading dataset from original link, it was uploaded to github, which has been read below.
raw_data<-read.csv(paste("./Educational_Attainment.csv"),stringsAsFactors = FALSE,header = TRUE)
head(raw_data)## Characteristic Less.Than.9th.Grade Nongrad.9th.to.12th
## 1 Without Earnings 4943 7159
## 2 $1 to $2,499 or loss 135 250
## 3 $2,500 to $4,999 92 201
## 4 $5,000 to $7,499 130 283
## 5 $7,500 to $9,999 92 190
## 6 $10,000 to $12,499 291 388
## High.School.Grad..Incl.GED. Some.College.No.Degree Associate.Degree
## 1 26111 12574 6394
## 2 889 682 398
## 3 628 494 277
## 4 1013 579 283
## 5 761 389 244
## 6 1612 829 575
## Bachelor.Degree Master.Degree Professional.Degree Doctorate.Degree
## 1 11592 5003 676 915
## 2 663 249 36 43
## 3 417 127 17 25
## 4 633 282 16 42
## 5 311 106 9 33
## 6 864 291 47 24
names(raw_data)<-c("Character", gsub("\\.", " ", names(raw_data)[2:length(raw_data)]))
head(raw_data)## Character Less Than 9th Grade Nongrad 9th to 12th
## 1 Without Earnings 4943 7159
## 2 $1 to $2,499 or loss 135 250
## 3 $2,500 to $4,999 92 201
## 4 $5,000 to $7,499 130 283
## 5 $7,500 to $9,999 92 190
## 6 $10,000 to $12,499 291 388
## High School Grad Incl GED Some College No Degree Associate Degree
## 1 26111 12574 6394
## 2 889 682 398
## 3 628 494 277
## 4 1013 579 283
## 5 761 389 244
## 6 1612 829 575
## Bachelor Degree Master Degree Professional Degree Doctorate Degree
## 1 11592 5003 676 915
## 2 663 249 36 43
## 3 417 127 17 25
## 4 633 282 16 42
## 5 311 106 9 33
## 6 864 291 47 24
Transformation, using pivot_longer() and melt (member of reshape2)
Reshape using data.table
raw_data_table <- as.data.table(raw_data)
education_attainment <- melt(raw_data_table,id.vars = "Character", variable.name = "Education", value.name = 'Count_in_Thousands', na.rm = TRUE)
head(education_attainment)## Character Education Count_in_Thousands
## 1: Without Earnings Less Than 9th Grade 4943
## 2: $1 to $2,499 or loss Less Than 9th Grade 135
## 3: $2,500 to $4,999 Less Than 9th Grade 92
## 4: $5,000 to $7,499 Less Than 9th Grade 130
## 5: $7,500 to $9,999 Less Than 9th Grade 92
## 6: $10,000 to $12,499 Less Than 9th Grade 291
Reshape using tidyr
education_attainment_long <- raw_data %>%
pivot_longer (2:10, names_to = "Education", values_to = "Count_in_Thousands" )
education_attainment_long## # A tibble: 378 x 3
## Character Education Count_in_Thousands
## <chr> <chr> <int>
## 1 Without Earnings "Less Than 9th Grade" 4943
## 2 Without Earnings "Nongrad 9th to 12th" 7159
## 3 Without Earnings "High School Grad Incl GED " 26111
## 4 Without Earnings "Some College No Degree" 12574
## 5 Without Earnings "Associate Degree" 6394
## 6 Without Earnings "Bachelor Degree" 11592
## 7 Without Earnings "Master Degree" 5003
## 8 Without Earnings "Professional Degree" 676
## 9 Without Earnings "Doctorate Degree" 915
## 10 $1 to $2,499 or loss "Less Than 9th Grade" 135
## # ... with 368 more rows
Percentage wise distribution of levels of education (e.g. GED, Bachelors etc), in 2017.
education <- education_attainment %>%
group_by(Education) %>%
summarise(Count_in_Thousands = sum(Count_in_Thousands)) %>%
arrange(desc(Count_in_Thousands)) ## `summarise()` ungrouping output (override with `.groups` argument)
| Education | Count_in_Thousands |
|---|---|
| High School Grad Incl GED | 62669 |
| Bachelor Degree | 48220 |
| Some College No Degree | 35457 |
| Associate Degree | 22370 |
| Master Degree | 21054 |
| Nongrad 9th to 12th | 13685 |
| Less Than 9th Grade | 8723 |
| Doctorate Degree | 4473 |
| Professional Degree | 3176 |
education$Percent_total <- round((education$Count_in_Thousands / sum(education$Count_in_Thousands))*100,2)
education## # A tibble: 9 x 3
## Education Count_in_Thousands Percent_total
## <fct> <int> <dbl>
## 1 "High School Grad Incl GED " 62669 28.5
## 2 "Bachelor Degree" 48220 21.9
## 3 "Some College No Degree" 35457 16.1
## 4 "Associate Degree" 22370 10.2
## 5 "Master Degree" 21054 9.58
## 6 "Nongrad 9th to 12th" 13685 6.23
## 7 "Less Than 9th Grade" 8723 3.97
## 8 "Doctorate Degree" 4473 2.03
## 9 "Professional Degree" 3176 1.44
Let’s see it visually using ggplot2
ggplot(education, aes(x = reorder(Education,Percent_total), y = Percent_total,fill = Education)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of Total, by education") +
xlab("Education") + ylab("Percent of Total") +
coord_flip() Correlation between educational attainment and being without earning.
education_attainment_wide <- education_attainment_long %>%
pivot_wider(names_from = Character, values_from = Count_in_Thousands)
kable(education_attainment_wide)| Education | Without Earnings | $1 to $2,499 or loss | $2,500 to $4,999 | $5,000 to $7,499 | $7,500 to $9,999 | $10,000 to $12,499 | $12,500 to $14,999 | $15,000 to $17,499 | $17,500 to $19,999 | $20,000 to $22,499 | $22,500 to $24,999 | $25,000 to $27,499 | $27,500 to $29,999 | $30,000 to $32,499 | $32,500 to $34,999 | $35,000 to $37,499 | $37,500 to $39,999 | $40,000 to $42,499 | $42,500 to $44,999 | $45,000 to $47,499 | $47,500 to $49,999 | $50,000 to $52,499 | $52,500 to $54,999 | $55,000 to $57,499 | $57,500 to $59,999 | $60,000 to $62,499 | $62,500 to $64,999 | $65,000 to $67,499 | $67,500 to $69,999 | $70,000 to $72,499 | $72,500 to $74,999 | $75,000 to $77,499 | $77,500 to $79,999 | $80,000 to $82,499 | $82,500 to $84,999 | $85,000 to $87,499 | $87,500 to $89,999 | $90,000 to $92,499 | $92,500 to $94,999 | $95,000 to $97,499 | $97,500 to $99,999 | $100,000 and over |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Less Than 9th Grade | 4943 | 135 | 92 | 130 | 92 | 291 | 156 | 290 | 195 | 406 | 205 | 346 | 89 | 291 | 81 | 159 | 76 | 156 | 24 | 92 | 31 | 129 | 6 | 47 | 17 | 64 | 5 | 27 | 3 | 16 | 4 | 25 | 4 | 24 | 2 | 3 | 2 | 15 | 4 | 6 | 1 | 39 |
| Nongrad 9th to 12th | 7159 | 250 | 201 | 283 | 190 | 388 | 228 | 399 | 288 | 675 | 318 | 521 | 121 | 484 | 148 | 284 | 142 | 325 | 45 | 142 | 90 | 229 | 31 | 83 | 24 | 118 | 27 | 61 | 25 | 63 | 13 | 47 | 22 | 54 | 1 | 15 | 7 | 35 | 1 | 6 | 6 | 136 |
| High School Grad Incl GED | 26111 | 889 | 628 | 1013 | 761 | 1612 | 868 | 1697 | 1097 | 2406 | 1244 | 2451 | 1058 | 2724 | 700 | 2114 | 788 | 2041 | 396 | 1475 | 599 | 2011 | 321 | 937 | 305 | 1158 | 165 | 694 | 187 | 649 | 126 | 446 | 92 | 522 | 88 | 223 | 94 | 245 | 33 | 144 | 30 | 1527 |
| Some College No Degree | 12574 | 682 | 494 | 579 | 389 | 829 | 333 | 914 | 594 | 1209 | 626 | 1283 | 604 | 1698 | 443 | 1239 | 469 | 1253 | 269 | 852 | 356 | 1204 | 280 | 591 | 213 | 870 | 129 | 545 | 150 | 533 | 132 | 388 | 98 | 374 | 65 | 211 | 73 | 179 | 23 | 138 | 59 | 1513 |
| Associate Degree | 6394 | 398 | 277 | 283 | 244 | 575 | 196 | 557 | 326 | 740 | 409 | 828 | 389 | 1058 | 334 | 859 | 383 | 940 | 246 | 597 | 306 | 825 | 261 | 463 | 175 | 652 | 125 | 420 | 139 | 341 | 94 | 323 | 78 | 319 | 52 | 214 | 32 | 221 | 21 | 98 | 42 | 1136 |
| Bachelor Degree | 11592 | 663 | 417 | 633 | 311 | 864 | 297 | 735 | 383 | 1048 | 505 | 1094 | 559 | 1473 | 413 | 1403 | 559 | 1881 | 446 | 1424 | 697 | 2090 | 500 | 1192 | 411 | 1640 | 296 | 1012 | 282 | 1247 | 194 | 998 | 266 | 1042 | 230 | 580 | 183 | 740 | 114 | 380 | 138 | 7288 |
| Master Degree | 5003 | 249 | 127 | 282 | 106 | 291 | 135 | 208 | 113 | 334 | 142 | 273 | 108 | 421 | 99 | 367 | 137 | 477 | 138 | 474 | 245 | 881 | 218 | 474 | 189 | 717 | 166 | 408 | 165 | 621 | 155 | 542 | 145 | 689 | 89 | 387 | 92 | 441 | 66 | 247 | 107 | 4526 |
| Professional Degree | 676 | 36 | 17 | 16 | 9 | 47 | 14 | 18 | 4 | 23 | 11 | 32 | 8 | 36 | 12 | 39 | 8 | 40 | 9 | 38 | 19 | 102 | 25 | 42 | 12 | 87 | 16 | 67 | 14 | 45 | 10 | 59 | 8 | 80 | 12 | 36 | 11 | 60 | 11 | 41 | 16 | 1310 |
| Doctorate Degree | 915 | 43 | 25 | 42 | 33 | 24 | 37 | 49 | 32 | 38 | 10 | 33 | 22 | 57 | 19 | 46 | 19 | 50 | 14 | 52 | 45 | 152 | 38 | 73 | 19 | 119 | 18 | 71 | 15 | 67 | 24 | 103 | 32 | 121 | 20 | 53 | 22 | 114 | 12 | 61 | 41 | 1693 |
withoutearning <- education_attainment_wide %>%
select(Education, "Without Earnings") %>%
mutate(Percent_total = round((`Without Earnings` / sum(`Without Earnings`)) * 100, 2)) %>%
arrange(desc(Percent_total))
kable(withoutearning)| Education | Without Earnings | Percent_total |
|---|---|---|
| High School Grad Incl GED | 26111 | 34.65 |
| Some College No Degree | 12574 | 16.68 |
| Bachelor Degree | 11592 | 15.38 |
| Nongrad 9th to 12th | 7159 | 9.50 |
| Associate Degree | 6394 | 8.48 |
| Master Degree | 5003 | 6.64 |
| Less Than 9th Grade | 4943 | 6.56 |
| Doctorate Degree | 915 | 1.21 |
| Professional Degree | 676 | 0.90 |
Let’s see it visually using ggplot2
ggplot(withoutearning, aes(x = reorder(Education,Percent_total), y = Percent_total)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1, position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of educational attainment without earnings") +
xlab("Education") + ylab("Percentage of total") +
coord_flip()Correlation between educational attainment and high inclome, over $100,000.
morethan100k<- education_attainment_wide %>%
select(Education,"$100,000 and over") %>%
mutate(Percent_total = round((`$100,000 and over` / sum(`$100,000 and over`)) * 100, 2))%>%
arrange(desc(Percent_total))
kable(morethan100k)| Education | $100,000 and over | Percent_total |
|---|---|---|
| Bachelor Degree | 7288 | 38.02 |
| Master Degree | 4526 | 23.61 |
| Doctorate Degree | 1693 | 8.83 |
| High School Grad Incl GED | 1527 | 7.97 |
| Some College No Degree | 1513 | 7.89 |
| Professional Degree | 1310 | 6.83 |
| Associate Degree | 1136 | 5.93 |
| Nongrad 9th to 12th | 136 | 0.71 |
| Less Than 9th Grade | 39 | 0.20 |
Let’s see it visually using ggplot2
ggplot(morethan100k, aes(x = reorder(Education,Percent_total), y = Percent_total,fill = Education)) +
geom_bar(stat = "identity", position = position_dodge(), colour = "black") +
geom_text(aes(label = Percent_total), vjust = .5, hjust = 1,position = position_dodge(width = 0.9), color = "black") +
ggtitle("Percentage of Education level who is making more than $100K") +
xlab("Education") + ylab("Percentage of Education") +
coord_flip() Conclusion
- Of the people surveyed, 28.51% graduated from high school (incl GED), 21.94% attained bachelor’s degree and 16.13% had some college degree.
- There’s a correlation between educational attainment and being without earning, because 34.65% of incomeless people, had only high school degree.
- There’s a correlation between educational attainment and high inclome (over $100,000), because 34.02% of those who earn 100k or more, have bachlor’s degree, 23.61% have master’s degree, and 8.83% have PhD.
Dataset-3: population, by global entities, from 1960 to 2010.
This is a wide dataset, in Worldbank’s website, containing data on populations by entities (including countries, regions etc), from 1960 to 2010. The link is as follows: https://databank.worldbank.org/home.aspx
The questions, I tried to address here are:
- Populations of top 20 entities from in the year 2010.
- Populations of bottom 20 entities from in the year 2010.
- Trend changes in world (i.e. sum total of entities, countries, regions etc) population, over time.
Reading the dataset.
After downloading dataset from original link, it was uploaded to github, which has been read below.
Selecting relevant columns and omitting NA
Tidying the data, with pivot_longer
Populations of top 20 entities in the year 2010
tbl.pivot_2010.df <- tbl.pivot.df %>%
filter(Year == "X2010") %>%
select(Country.Name, Population) %>%
arrange(desc(Population))
tbl.pivot_2010.top20.df <- top_n(tbl.pivot_2010.df, 20)## Selecting by Population
| Country.Name | Population |
|---|---|
| World | 6922947261 |
| IDA & IBRD total | 5793479782 |
| Low & middle income | 5765121055 |
| Middle income | 5187847044 |
| IBRD only | 4421851199 |
| Early-demographic dividend | 2908287643 |
| Lower middle income | 2688436551 |
| Upper middle income | 2499410493 |
| East Asia & Pacific | 2206884624 |
| Late-demographic dividend | 2181529363 |
| East Asia & Pacific (excluding high income) | 1965964309 |
| East Asia & Pacific (IDA & IBRD countries) | 1941377349 |
| South Asia | 1638792934 |
| South Asia (IDA & IBRD) | 1638792934 |
| IDA total | 1371628583 |
| China | 1337705000 |
| OECD members | 1242309585 |
| India | 1234281170 |
| High income | 1157826206 |
| Post-demographic dividend | 1074869481 |
ggplot(tbl.pivot_2010.top20.df, aes(x = reorder(Country.Name, Population), y = Population)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Top 20 entities in 2010") +
xlab("Entity name") + ylab("Population")Populations of bottom 20 entities in the year 2010
tbl.pivot_2010.df <- tbl.pivot.df %>%
filter(Year == "X2010") %>%
select(Country.Name, Population) %>%
arrange(desc(Population))
tbl.pivot_2010.bottom20.df <- top_n(tbl.pivot_2010.df, -20)## Selecting by Population
| Country.Name | Population |
|---|---|
| Andorra | 84449 |
| Dominica | 70878 |
| Bermuda | 65124 |
| Greenland | 56905 |
| Cayman Islands | 56672 |
| Marshall Islands | 56366 |
| American Samoa | 56079 |
| Northern Mariana Islands | 53971 |
| St. Kitts and Nevis | 49016 |
| Faroe Islands | 47814 |
| St. Martin (French part) | 37582 |
| Liechtenstein | 35994 |
| Monaco | 35612 |
| Gibraltar | 33585 |
| Turks and Caicos Islands | 32660 |
| San Marino | 31229 |
| British Virgin Islands | 27794 |
| Palau | 17955 |
| Tuvalu | 10530 |
| Nauru | 10005 |
ggplot(tbl.pivot_2010.bottom20.df, aes(x = reorder(Country.Name, Population), y = Population)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black", width = 0.5) +
coord_flip() +
ggtitle("Bottom 20 entities in 2010") +
xlab("Entity name") + ylab("Population")Trend changes in world (i.e. sum total of entities, countries, regions etc) population, over time
tbl.pivot.grp_by_yr.df <- tbl.pivot.df %>%
group_by(Year) %>%
summarise("Total Population" = sum(Population))## `summarise()` ungrouping output (override with `.groups` argument)
| Year | Total Population |
|---|---|
| X1960 | 30696994944 |
| X1961 | 31108353431 |
| X1962 | 31658294754 |
| X1963 | 32342820687 |
| X1964 | 33032632731 |
| X1965 | 33739453133 |
| X1966 | 34483819203 |
| X1967 | 35225755472 |
| X1968 | 35981282059 |
| X1969 | 36774110168 |
| X1970 | 37578380420 |
| X1971 | 38403294546 |
| X1972 | 39218037696 |
| X1973 | 40027236911 |
| X1974 | 40838904173 |
| X1975 | 41634929535 |
| X1976 | 42418106464 |
| X1977 | 43198308074 |
| X1978 | 43991861452 |
| X1979 | 44804464513 |
| X1980 | 45627321658 |
| X1981 | 46472179835 |
| X1982 | 47351062716 |
| X1983 | 48240338882 |
| X1984 | 49128810634 |
| X1985 | 50034855935 |
| X1986 | 50967920664 |
| X1987 | 51925263797 |
| X1988 | 52893291009 |
| X1989 | 53860012489 |
| X1990 | 54840819416 |
| X1991 | 55795466259 |
| X1992 | 56723821804 |
| X1993 | 57652320273 |
| X1994 | 58572105249 |
| X1995 | 59492289648 |
| X1996 | 60399571928 |
| X1997 | 61304671961 |
| X1998 | 62200715609 |
| X1999 | 63084689314 |
| X2000 | 63961898183 |
| X2001 | 64834636987 |
| X2002 | 65706004220 |
| X2003 | 66581568814 |
| X2004 | 67462607601 |
| X2005 | 68349669516 |
| X2006 | 69242979693 |
| X2007 | 70140938566 |
| X2008 | 71051310936 |
| X2009 | 71967689579 |
| X2010 | 72885851687 |
ggplot(tbl.pivot.grp_by_yr.df, aes(x=Year, y=`Total Population`)) +
geom_bar(stat="identity", position=position_dodge(), color="black", width = 0.5) +
ggtitle("Total Population by Year") +
xlab("Year") + ylab("Population")Conclusion
- I selected the top 20 entities. I saw that among them are China, India, and there South Asia region.
- Among the bottom 20 entities, I saw Palau, Tuvalu, Nauru.
- Finally, I saw that the world population has been progressivley increasing from 1960 to 2010.
Marker: 607-06_p