One of my intentions for project 2 was to try something different, and in this case, the thing that I tried was an abject failure. The data sample that I posted to the forum was in .pdf format, and Jeremy pointed out that he’s never worked with PDF data from R before (nor had I for that matter…). So I decided to try it out as a challenge. My intention was to try to extract data from the .pdf “as is” and do some work with it, but after several hours of effort, and in the interest of getting the project finished on time, I decided to just choose another data source as I wasn’t getting any traction with the pdf. It wasn’t a waste of time as I familiarized myself with several packages and techniques (pdfTools, Tabulizer <- this one isn’t on CRAN, so I also familiarized myself with non-CRAN package installation)… but in the end, I was not even able to parse the .pdf, let alone work with the data. This is how we learn - and having found a few things that don’t work, I’m better prepared for next time. :)
I took the extra time/extention given for this assignment and figured out how to parse the .pdf! As such, i’ve included 4 data-sets rather than 3. I didn’t have sufficient time to perform any major analysis, but I managed to get it tidy, which appears to have been the point of the exercise.
You can find this doc on rpubs here and as always, you can find all the relevant data on my github here
Clear the workspace
rm(list = ls())Load the required libraries
## -- Attaching packages ------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For my first dataset, I chose to work with the electricity data posted by Rose. Let’s load it and take a look
electricity.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/us_daily_electric_system_operating_data.csv",
skip = 4)
electricity.data <- rename(electricity.data,"region" = "megawatthours")
head(electricity.data,5)## region X01.01.2018 X01.02.2018 X01.03.2018
## 1 California (region) NA NA
## 2 Demand 645599 704074 727216
## 3 Net generation 463629 502312 504456
## 4 Total net actual interchange -179525 -201654 -222762
## 5 Carolinas (region) NA NA
## X01.04.2018 X01.05.2018 X01.06.2018 X01.07.2018 X01.08.2018 X01.09.2018
## 1 NA NA NA NA NA NA
## 2 714305 712021 668299 655802 731881 729380
## 3 505148 484644 447821 447952 500797 493884
## 4 -209163 -213905 -219968 -207861 -237089 -246264
## 5 NA NA NA NA NA NA
## X01.10.2018 X01.11.2018 X01.12.2018 X01.13.2018 X01.14.2018 X01.15.2018
## 1 NA NA NA NA NA NA
## 2 716698 675260 687504 665136 642080 703473
## 3 482843 435107 460105 452264 428364 474015
## 4 -234361 -240146 -227413 -231833 -213729 -229468
## 5 NA NA NA NA NA NA
## X01.16.2018 X01.17.2018 X01.18.2018 X01.19.2018 X01.20.2018 X01.21.2018
## 1 NA NA NA NA NA NA
## 2 713050 715733 718298 712862 669421 665340
## 3 487101 484557 488237 484276 458660 437506
## 4 -225955 -231182 -230069 -228598 -210774 -227842
## 5 NA NA NA NA NA NA
## X01.22.2018 X01.23.2018 X01.24.2018 X01.25.2018 X01.26.2018 X01.27.2018
## 1 NA NA NA NA NA NA
## 2 727071 722833 728743 732552 699533 666415
## 3 502041 510543 500446 501674 464103 432315
## 4 -225042 -212304 -228303 -230401 -233814 -231742
## 5 NA NA NA NA NA NA
## X01.28.2018 X01.29.2018 X01.30.2018 X01.31.2018
## 1 NA NA NA NA
## 2 647844 725483 720384 717446
## 3 433092 491871 480451 492836
## 4 -213582 -233589 -239945 -224620
## 5 NA NA NA NA
Initial inspection of the data shows that a few changes are required in preparation for moving to a “tidy” format. I note that for each “region” (eg: California) there are 3 data-rows (“Demand”,“Net generation”,“Total net actual interchange”) and that the region row itself is actually blank. I’m going to copy the “region” column into a new column called “metric” and use this row to preserve the labels of the actual quantities being measured. Then, in the “region” col, I’ll convert the non-region labels to NAs and forward-fill to make sure that every row has an associated region.
electricity.data$metric <- electricity.data$region
electricity.data$region <- str_replace(electricity.data$region,"Demand","")
electricity.data$region <- str_replace(electricity.data$region,"Net generation","")
electricity.data$region <- str_replace(electricity.data$region,"Total net actual interchange","")
is.na(electricity.data$region ) <- electricity.data$region ==''
electricity.data <- electricity.data %>%
fill(region) %>%
na.omit(electricity.data)
head(electricity.data,5)## region X01.01.2018 X01.02.2018 X01.03.2018 X01.04.2018
## 2 California (region) 645599 704074 727216 714305
## 3 California (region) 463629 502312 504456 505148
## 4 California (region) -179525 -201654 -222762 -209163
## 6 Carolinas (region) 847133 938726 948954 901939
## 7 Carolinas (region) 802908 914879 920055 866924
## X01.05.2018 X01.06.2018 X01.07.2018 X01.08.2018 X01.09.2018 X01.10.2018
## 2 712021 668299 655802 731881 729380 716698
## 3 484644 447821 447952 500797 493884 482843
## 4 -213905 -219968 -207861 -237089 -246264 -234361
## 6 939699 932742 946665 840094 646932 590003
## 7 895604 884842 913783 820248 639454 581165
## X01.11.2018 X01.12.2018 X01.13.2018 X01.14.2018 X01.15.2018 X01.16.2018
## 2 675260 687504 665136 642080 703473 713050
## 3 435107 460105 452264 428364 474015 487101
## 4 -240146 -227413 -231833 -213729 -229468 -225955
## 6 561119 513368 587356 761043 823663 778424
## 7 553005 506277 582277 739821 780965 760210
## X01.17.2018 X01.18.2018 X01.19.2018 X01.20.2018 X01.21.2018 X01.22.2018
## 2 715733 718298 712862 669421 665340 727071
## 3 484557 488237 484276 458660 437506 502041
## 4 -231182 -230069 -228598 -210774 -227842 -225042
## 6 779099 858007 755994 663462 575237 588743
## 7 764523 831241 739271 642171 558728 578841
## X01.23.2018 X01.24.2018 X01.25.2018 X01.26.2018 X01.27.2018 X01.28.2018
## 2 722833 728743 732552 699533 666415 647844
## 3 510543 500446 501674 464103 432315 433092
## 4 -212304 -228303 -230401 -233814 -231742 -213582
## 6 523900 603365 676274 685613 575060 511872
## 7 520099 592705 659392 665487 564307 506623
## X01.29.2018 X01.30.2018 X01.31.2018 metric
## 2 725483 720384 717446 Demand
## 3 491871 480451 492836 Net generation
## 4 -233589 -239945 -224620 Total net actual interchange
## 6 567562 697175 752846 Demand
## 7 555393 676237 731116 Net generation
Now we’ll transfor the data from a wide format to a long format, rename a few things, and format the dates.
electricity.tidy <- electricity.data %>%
gather(electricity.data,power,-region,-metric)
electricity.tidy<- rename(electricity.tidy,"date" = "electricity.data")
electricity.tidy <- transform(electricity.tidy, power = as.numeric(power))
electricity.tidy$date <- as.Date(gsub("X","",electricity.tidy$date),"%m.%d.%Y")
electricity.tidy$region <- gsub("\\(region\\)","",electricity.tidy$region)
head(electricity.tidy,5)## region metric date power
## 1 California Demand 2018-01-01 645599
## 2 California Net generation 2018-01-01 463629
## 3 California Total net actual interchange 2018-01-01 -179525
## 4 Carolinas Demand 2018-01-01 847133
## 5 Carolinas Net generation 2018-01-01 802908
The data is now in a format that we should be able to work with relatively easily. The questions that rose wanted to look at were: Daily demand per region Daily net generation per region Daily total net actual Proportion of demand / total demand by region.
totals.byRegion <- electricity.tidy %>%
group_by(metric) %>%
filter(!(region =="United States Lower 48 ")) %>%
summarize(power = sum(power))
#ungroup()
power.byRegion <- electricity.tidy %>%
group_by(region,metric) %>%
filter(!(region =="United States Lower 48 ")) %>%
summarize(power = mean(power)) %>%
ungroup()
power.byRegion$proportions = power.byRegion$power / totals.byRegion$power
#interchange is on a totally different scale from demand and generation
#so lets split them.
power.supplyDemand <- power.byRegion %>%
filter(!(metric =="Total net actual interchange"))
power.interchange <- power.byRegion %>%
filter((metric =="Total net actual interchange"))
ggplot(power.supplyDemand, aes(x = region, y=power,fill=metric)) +
geom_bar(stat="identity",position="dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
xlab("Region") +
ylab("Power Supply and Demand (MWh)") +
ggtitle("Power Supply and Demand by Region") +
theme(plot.title = element_text(hjust = 0.5))ggplot(power.supplyDemand, aes(x = region, y=proportions,fill=metric)) +
geom_bar(stat="identity",position="dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
xlab("Region") +
ylab("Proportional Supply and Demand") +
ggtitle("Proportional Power Supply and Demand by Region") +
theme(plot.title = element_text(hjust = 0.5))ggplot(power.interchange, aes(x = region, y=power,fill=metric)) +
geom_bar(stat="identity",position="dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
xlab("Region") +
ylab("Interchange (MWh)") +
ggtitle("Interchange by Region") +
theme(plot.title = element_text(hjust = 0.5))ggplot(power.interchange, aes(x = region, y=proportions*-1,fill=metric)) +
geom_bar(stat="identity",position="dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
xlab("Region") +
ylab("Proportional Interchange") +
ggtitle("Proportional Interchange Region") +
theme(plot.title = element_text(hjust = 0.5))Whether we look at the data proportionally, or in absolute terms, we see more or less the same picture. The Mid-Atlantic and Midwest regions demand and generate the most electricity. New-England and the Southwest regions produce and demand the least, respectively.
California has the biggest electricity deficite as measured by total net interchange, and the northwest has the highest surplus.
demand.timeseries <- electricity.tidy %>%
filter((metric =="Demand") & !(region =="United States Lower 48 "))
ggplot(demand.timeseries, aes(x = date, y=power,color=region)) +
geom_line() +
scale_y_log10()+
xlab("Time") +
ylab("Electricity Demand (MWh)") +
ggtitle("Demand By Region over Time") +
theme(plot.title = element_text(hjust = 0.5))interchange.timeseries <- electricity.tidy %>%
filter((metric =="Total net actual interchange") & !(region =="United States Lower 48 "))
ggplot(interchange.timeseries, aes(x = date, y=power,color=region)) +
geom_line()+
xlab("Time") +
ylab("Interchange (MWh)") +
ggtitle("Interchange by Region over Time") +
theme(plot.title = element_text(hjust = 0.5))If we look at demand and interchange by region as a timeseries, we can see that they vary considerably. In terms of demans, we can see that the majority of them seem to follow a similar pattern, namely, that they start high at the first of the year, then trail down until about Jan 10, where they abruptly reverse, spike, and then flatten out.
The interchange curves seem to be reasonably stable with the occasional violent spike
This dataset (posted by Albert) shows the prevalence of Dengue fever by various geographies in Belize. In hindsight, this dataset proved to be much easier than expected.
belize.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/Belize.csv",
header=TRUE)
head(belize.data)## Community Type X2007 X2008 X2009 X2010 X2011 X2012
## 1 Belize City Urban 1.28 0.69 2.68 5.13 3.92 6.53
## 2 San Pedro Town- Amberg. Caye Urban 0.02 0.00 0.32 0.18 0.18 0.22
## 3 Caye Caulker Urban 0.07 0.00 0.09 0.04 0.02 0.11
## 4 Ladyville Rural 0.06 0.04 0.64 0.43 0.36 0.45
## 5 Hattieville Rural 0.01 0.02 0.08 0.12 0.09 0.13
## 6 Burrel Boom Rural 0.02 0.00 0.07 0.07 0.11 0.09
## X2013 X2014
## 1 11.69 17.87
## 2 1.22 0.64
## 3 0.20 0.14
## 4 0.72 0.88
## 5 0.17 0.21
## 6 0.40 0.32
We’ve got a wide dataset that we’ll need to melt into a tidy format. We’ll also need to clean up column names and we’ll change blank Types of regions to “unknown”
belize.tidy <- belize.data %>%
gather(year,fever,-Community,-Type)
belize.tidy$year <- gsub("X","",belize.tidy$year)
belize.tidy$Type <- gsub("^$","Unknown",belize.tidy$Type)
head(belize.tidy)## Community Type year fever
## 1 Belize City Urban 2007 1.28
## 2 San Pedro Town- Amberg. Caye Urban 2007 0.02
## 3 Caye Caulker Urban 2007 0.07
## 4 Ladyville Rural 2007 0.06
## 5 Hattieville Rural 2007 0.01
## 6 Burrel Boom Rural 2007 0.02
The data is now more or less tidy. One issue that remains is that there are NAs in the data. It would appear that we can just ignore the NAs using “na.rm=TRUE”. It may or may not be ideal to do so in the real world - I think that would be dependent on context/domain knowledge. I also tried interpolation here (acknowleding that it may not make sense in practice), but the data was much too sparse to allow for it. It appears as though an NA required 2 adjscent data points in order to be interpolated (at least as I attempted it). Lets look at a few plots:
fever.type <- belize.tidy %>%
group_by(year,Type) %>%
summarize(avg = mean(fever,na.rm=TRUE))
fever.type## # A tibble: 24 x 3
## # Groups: year [?]
## year Type avg
## <chr> <chr> <dbl>
## 1 2007 Rural 0.0240
## 2 2007 Unknown 0.0400
## 3 2007 Urban 0.457
## 4 2008 Rural 0.0140
## 5 2008 Unknown 0.0200
## 6 2008 Urban 0.230
## 7 2009 Rural 0.155
## 8 2009 Unknown 0.0150
## 9 2009 Urban 1.03
## 10 2010 Rural 0.111
## # ... with 14 more rows
ggplot(fever.type, aes(x = year, y=avg,group=Type,color=Type)) +
geom_line()+
scale_y_log10()+
xlab("Time") +
ylab("Fever Prevalence") +
ggtitle("Fever Prevalance by Location Type") +
theme(plot.title = element_text(hjust = 0.5)) Firstly, in the charts we can see the spots where we’re lacking data, which I feel is okay as no data exists.
fever.Community <- belize.tidy %>%
group_by(year,Community) %>%
summarize(avg = mean(fever,na.rm=TRUE))
fever.Community## # A tibble: 112 x 3
## # Groups: year [?]
## year Community avg
## <chr> <fct> <dbl>
## 1 2007 Belize City 1.28
## 2 2007 Bermudan Landing NaN
## 3 2007 Biscayne 0.0200
## 4 2007 Boston NaN
## 5 2007 Burrel Boom 0.0200
## 6 2007 Caye Caulker 0.0700
## 7 2007 Hattieville 0.0100
## 8 2007 Ladyville 0.0600
## 9 2007 Lords Bank NaN
## 10 2007 Mahogany Heights NaN
## # ... with 102 more rows
ggplot(fever.Community, aes(x = year, y=avg,group=Community,color=Community)) +
geom_line() +
scale_y_log10()+
xlab("Time") +
ylab("Fever Prevalence") +
ggtitle("Fever Prevalance by Community") +
theme(plot.title = element_text(hjust = 0.5))Secondly we can see that fever prevalance is much higher in an urban type of environment and particularly in Belize City. I suspect that this is a known phenomenon that has to do with population density.
Let’s check and see if any years showed a particurlarly large spike in Dengue fever prevalence:
fever.diff <- belize.tidy %>%
group_by(year) %>%
summarize(avg = mean(fever,na.rm=TRUE)) %>%
mutate(diff = (avg - lag(avg))/avg)
head(fever.diff,5)## # A tibble: 5 x 3
## year avg diff
## <chr> <dbl> <dbl>
## 1 2007 0.170 NA
## 2 2008 0.0867 - 0.962
## 3 2009 0.368 0.765
## 4 2010 0.567 0.351
## 5 2011 0.396 - 0.432
ggplot(fever.diff, aes(x = year, y=diff, group=1)) +
geom_line() +
xlab("Time") +
ylab("Percentage Change") +
ggtitle("YoY Percentage Change in Fever Prevalence") +
theme(plot.title = element_text(hjust = 0.5))We can see that the average year-over-year change in fever prevalence shows no real trend. To the eye, it looks as if it may be mean-reverting around zero, but I don’t think that we have enough data to make any strong statements.
The third set that I’ve chosen is the UN Migration data as posted by Brian. I’ve used specific data that he posted and also found an “Annex” data table which has mappings for all the country codes and Sort.order columns, in case I want to use that data as well.
UN.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/UN_migrationStockByAg.csv",
skip=14)
UN.countryCodes <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/UN_Country_codes.csv")
head(UN.data,5)## Sort.order Major.area..region..country.or.area.of.destination Notes
## 1 NA
## 2 1 WORLD
## 3 2 Developed regions (b)
## 4 3 Developing regions (c)
## 5 4 Least developed countries (d)
## Country.code Type.of.data..a.
## 1 NA
## 2 900
## 3 901
## 4 902
## 5 941
## International.migrant.stock.at.mid.year.by.age..both.sexes. X
## 1 0-4 5-9
## 2 5 444 102 6 705 750
## 3 1 251 818 2 326 676
## 4 4 192 284 4 379 074
## 5 1 063 284 1 130 597
## X.1 X.2 X.3 X.4 X.5 X.6
## 1 10-14 15-19 20-24 25-29 30-34 35-39
## 2 8 031 158 9 873 556 12 619 191 15 051 127 15 400 571 14 337 175
## 3 3 475 269 4 720 303 6 478 826 7 974 562 8 461 189 8 226 083
## 4 4 555 889 5 153 253 6 140 365 7 076 565 6 939 382 6 111 092
## 5 1 094 935 1 200 615 1 236 369 1 195 461 981 953 769 658
## X.7 X.8 X.9 X.10 X.11 X.12
## 1 40-44 45-49 50-54 55-59 60-64 65-69
## 2 12 753 493 10 559 275 8 763 500 7 596 248 6 786 967 5 568 653
## 3 7 624 421 6 433 452 5 382 155 4 705 909 4 197 180 3 463 728
## 4 5 129 072 4 125 823 3 381 345 2 890 339 2 589 787 2 104 925
## 5 581 996 451 789 382 834 308 221 256 041 155 596
## X.13 X.14 X.15
## 1 70-74 75+ Total
## 2 4 401 727 8 669 566 152 563 212
## 3 2 753 490 4 902 797 82 378 628
## 4 1 648 237 3 766 769 70 184 584
## 5 104 773 161 844 11 075 966
## International.migrant.stock.at.mid.year.by.age..male. X.16
## 1 0-4 5-9
## 2 2 742 131 3 400 509
## 3 646 176 1 198 838
## 4 2 095 955 2 201 671
## 5 533 519 572 990
## X.17 X.18 X.19 X.20 X.21 X.22
## 1 10-14 15-19 20-24 25-29 30-34 35-39
## 2 4 122 431 5 133 205 6 638 950 8 064 063 8 291 309 7 658 562
## 3 1 796 700 2 449 450 3 353 011 4 111 003 4 325 619 4 171 376
## 4 2 325 731 2 683 755 3 285 939 3 953 060 3 965 690 3 487 186
## 5 575 276 648 111 674 067 644 368 520 922 407 103
## X.23 X.24 X.25 X.26 X.27 X.28
## 1 40-44 45-49 50-54 55-59 60-64 65-69
## 2 6 737 139 5 519 992 4 481 023 3 745 593 3 191 398 2 528 192
## 3 3 834 793 3 211 968 2 635 497 2 228 293 1 872 690 1 496 823
## 4 2 902 346 2 308 024 1 845 526 1 517 300 1 318 708 1 031 369
## 5 307 998 240 027 202 824 164 230 132 550 80 593
## X.29 X.30 X.31
## 1 70-74 75+ Total
## 2 1 889 719 3 602 696 77 747 510
## 3 1 117 749 1 813 035 40 263 397
## 4 771 970 1 789 661 37 484 113
## 5 53 638 83 120 5 841 336
## International.migrant.stock.at.mid.year.by.age..female. X.32
## 1 0-4 5-9
## 2 2 701 971 3 305 241
## 3 605 642 1 127 838
## 4 2 096 329 2 177 403
## 5 529 765 557 607
## X.33 X.34 X.35 X.36 X.37 X.38
## 1 10-14 15-19 20-24 25-29 30-34 35-39
## 2 3 908 727 4 740 351 5 980 241 6 987 064 7 109 262 6 678 613
## 3 1 678 569 2 270 853 3 125 815 3 863 559 4 135 570 4 054 707
## 4 2 230 158 2 469 498 2 854 426 3 123 505 2 973 692 2 623 906
## 5 519 659 552 504 562 302 551 093 461 031 362 555
## X.39 X.40 X.41 X.42 X.43 X.44
## 1 40-44 45-49 50-54 55-59 60-64 65-69
## 2 6 016 354 5 039 283 4 282 477 3 850 655 3 595 569 3 040 461
## 3 3 789 628 3 221 484 2 746 658 2 477 616 2 324 490 1 966 905
## 4 2 226 726 1 817 799 1 535 819 1 373 039 1 271 079 1 073 556
## 5 273 998 211 762 180 010 143 991 123 491 75 003
## X.45 X.46 X.47
## 1 70-74 75+ Total
## 2 2 512 008 5 066 870 74 815 702
## 3 1 635 741 3 089 762 42 115 231
## 4 876 267 1 977 108 32 700 471
## 5 51 135 78 724 5 234 630
This data-set appears to need a whole bunch of work. I’m going to focus on the region,age and migration data which I wont be needing, then renaming the colums as appropriate.
colnames(UN.data)## [1] "Sort.order"
## [2] "Major.area..region..country.or.area.of.destination"
## [3] "Notes"
## [4] "Country.code"
## [5] "Type.of.data..a."
## [6] "International.migrant.stock.at.mid.year.by.age..both.sexes."
## [7] "X"
## [8] "X.1"
## [9] "X.2"
## [10] "X.3"
## [11] "X.4"
## [12] "X.5"
## [13] "X.6"
## [14] "X.7"
## [15] "X.8"
## [16] "X.9"
## [17] "X.10"
## [18] "X.11"
## [19] "X.12"
## [20] "X.13"
## [21] "X.14"
## [22] "X.15"
## [23] "International.migrant.stock.at.mid.year.by.age..male."
## [24] "X.16"
## [25] "X.17"
## [26] "X.18"
## [27] "X.19"
## [28] "X.20"
## [29] "X.21"
## [30] "X.22"
## [31] "X.23"
## [32] "X.24"
## [33] "X.25"
## [34] "X.26"
## [35] "X.27"
## [36] "X.28"
## [37] "X.29"
## [38] "X.30"
## [39] "X.31"
## [40] "International.migrant.stock.at.mid.year.by.age..female."
## [41] "X.32"
## [42] "X.33"
## [43] "X.34"
## [44] "X.35"
## [45] "X.36"
## [46] "X.37"
## [47] "X.38"
## [48] "X.39"
## [49] "X.40"
## [50] "X.41"
## [51] "X.42"
## [52] "X.43"
## [53] "X.44"
## [54] "X.45"
## [55] "X.46"
## [56] "X.47"
so it looks like we want to get rid of columns 3, and 5
dropCols <- colnames(UN.data)[c(3,5)]
UN.data <- UN.data[ , !(names(UN.data) %in% dropCols)]
head(colnames(UN.data),5)## [1] "Sort.order"
## [2] "Major.area..region..country.or.area.of.destination"
## [3] "Country.code"
## [4] "International.migrant.stock.at.mid.year.by.age..both.sexes."
## [5] "X"
That appears to have worked. Now we want to rename the columns. As we see above, region-related data are contained in cols 1-4 and should be easy to manually rename. For the actual migration data, both sexes are in columns 4-20, males in 21-37 and femails in 38-54. I note also that column-header information that i’m looking for is actually contained in the first row. I’m going to try to rename the columns by pulling out the 1st row values and appending it with a gender-prefix (eg: “b0-4” would be both sexes age 0-4). Finally, we’ll drop the 1st row as it will no longer be of any use.
names(UN.data)[2] <- "region"
col.names <- as.matrix(UN.data[1, 4:20])
colnames(UN.data)[4:20] <- colnames(UN.data[4:20]) <- paste("b",col.names,sep = "_")
colnames(UN.data)[21:37] <- colnames(UN.data[21:37]) <- paste("m",col.names,sep = "_")
colnames(UN.data)[38:54] <- colnames(UN.data[38:54]) <- paste("f",col.names,sep = "_")
UN.data <- UN.data[-1, ]
head(UN.data,5)## Sort.order region
## 2 1 WORLD
## 3 2 Developed regions
## 4 3 Developing regions
## 5 4 Least developed countries
## 6 5 Less developed regions excluding least developed countries
## Country.code b_0-4 b_5-9 b_10-14 b_ 15-19 b_20-24
## 2 900 5 444 102 6 705 750 8 031 158 9 873 556 12 619 191
## 3 901 1 251 818 2 326 676 3 475 269 4 720 303 6 478 826
## 4 902 4 192 284 4 379 074 4 555 889 5 153 253 6 140 365
## 5 941 1 063 284 1 130 597 1 094 935 1 200 615 1 236 369
## 6 934 3 129 000 3 248 477 3 460 954 3 952 638 4 903 996
## b_25-29 b_30-34 b_35-39 b_40-44 b_45-49 b_50-54
## 2 15 051 127 15 400 571 14 337 175 12 753 493 10 559 275 8 763 500
## 3 7 974 562 8 461 189 8 226 083 7 624 421 6 433 452 5 382 155
## 4 7 076 565 6 939 382 6 111 092 5 129 072 4 125 823 3 381 345
## 5 1 195 461 981 953 769 658 581 996 451 789 382 834
## 6 5 881 104 5 957 429 5 341 434 4 547 076 3 674 034 2 998 511
## b_55-59 b_60-64 b_65-69 b_70-74 b_75+ b_Total
## 2 7 596 248 6 786 967 5 568 653 4 401 727 8 669 566 152 563 212
## 3 4 705 909 4 197 180 3 463 728 2 753 490 4 902 797 82 378 628
## 4 2 890 339 2 589 787 2 104 925 1 648 237 3 766 769 70 184 584
## 5 308 221 256 041 155 596 104 773 161 844 11 075 966
## 6 2 582 118 2 333 746 1 949 329 1 543 464 3 604 925 59 108 618
## m_0-4 m_5-9 m_10-14 m_ 15-19 m_20-24 m_25-29
## 2 2 742 131 3 400 509 4 122 431 5 133 205 6 638 950 8 064 063
## 3 646 176 1 198 838 1 796 700 2 449 450 3 353 011 4 111 003
## 4 2 095 955 2 201 671 2 325 731 2 683 755 3 285 939 3 953 060
## 5 533 519 572 990 575 276 648 111 674 067 644 368
## 6 1 562 436 1 628 681 1 750 455 2 035 644 2 611 872 3 308 692
## m_30-34 m_35-39 m_40-44 m_45-49 m_50-54 m_55-59
## 2 8 291 309 7 658 562 6 737 139 5 519 992 4 481 023 3 745 593
## 3 4 325 619 4 171 376 3 834 793 3 211 968 2 635 497 2 228 293
## 4 3 965 690 3 487 186 2 902 346 2 308 024 1 845 526 1 517 300
## 5 520 922 407 103 307 998 240 027 202 824 164 230
## 6 3 444 768 3 080 083 2 594 348 2 067 997 1 642 702 1 353 070
## m_60-64 m_65-69 m_70-74 m_75+ m_Total f_0-4
## 2 3 191 398 2 528 192 1 889 719 3 602 696 77 747 510 2 701 971
## 3 1 872 690 1 496 823 1 117 749 1 813 035 40 263 397 605 642
## 4 1 318 708 1 031 369 771 970 1 789 661 37 484 113 2 096 329
## 5 132 550 80 593 53 638 83 120 5 841 336 529 765
## 6 1 186 158 950 776 718 332 1 706 541 31 642 777 1 566 564
## f_5-9 f_10-14 f_ 15-19 f_20-24 f_25-29 f_30-34
## 2 3 305 241 3 908 727 4 740 351 5 980 241 6 987 064 7 109 262
## 3 1 127 838 1 678 569 2 270 853 3 125 815 3 863 559 4 135 570
## 4 2 177 403 2 230 158 2 469 498 2 854 426 3 123 505 2 973 692
## 5 557 607 519 659 552 504 562 302 551 093 461 031
## 6 1 619 796 1 710 499 1 916 994 2 292 124 2 572 412 2 512 661
## f_35-39 f_40-44 f_45-49 f_50-54 f_55-59 f_60-64
## 2 6 678 613 6 016 354 5 039 283 4 282 477 3 850 655 3 595 569
## 3 4 054 707 3 789 628 3 221 484 2 746 658 2 477 616 2 324 490
## 4 2 623 906 2 226 726 1 817 799 1 535 819 1 373 039 1 271 079
## 5 362 555 273 998 211 762 180 010 143 991 123 491
## 6 2 261 351 1 952 728 1 606 037 1 355 809 1 229 048 1 147 588
## f_65-69 f_70-74 f_75+ f_Total
## 2 3 040 461 2 512 008 5 066 870 74 815 702
## 3 1 966 905 1 635 741 3 089 762 42 115 231
## 4 1 073 556 876 267 1 977 108 32 700 471
## 5 75 003 51 135 78 724 5 234 630
## 6 998 553 825 132 1 898 384 27 465 841
Now the data can be converted from “wide” to “long” and into a tidy format. Then we’ll separate the “group” column into “sex” and “age.range”.
Also… it is at this point where I have realized that my variable containing the “tidy” data in this case is actually named “UNtidy”… ugh.
UN.tidy <- UN.data %>%
gather(group,count,-Sort.order,-region,-Country.code)
UN.tidy <- separate(UN.tidy, group,c("sex","age.range"), sep="_")
head(UN.tidy,5)## Sort.order region
## 1 1 WORLD
## 2 2 Developed regions
## 3 3 Developing regions
## 4 4 Least developed countries
## 5 5 Less developed regions excluding least developed countries
## Country.code sex age.range count
## 1 900 b 0-4 5 444 102
## 2 901 b 0-4 1 251 818
## 3 902 b 0-4 4 192 284
## 4 941 b 0-4 1 063 284
## 5 934 b 0-4 3 129 000
Looking better, but there are still a few things that we need to do to the data. We can delete all the rows where sex=b because anything in those rows can be computed as the sum of sex=m + sex=f. The same is true for age.range=Total - it’s the sum of all age-ranges for that specific country code, and thus can be easily computed, so it doesn’t represent individual observations, but the sum of individual observations.
UN.tidy <- UN.tidy %>%
filter(!(sex== "b")) %>%
filter(!(age.range == "Total"))
head(UN.tidy,5)## Sort.order region
## 1 1 WORLD
## 2 2 Developed regions
## 3 3 Developing regions
## 4 4 Least developed countries
## 5 5 Less developed regions excluding least developed countries
## Country.code sex age.range count
## 1 900 m 0-4 2 742 131
## 2 901 m 0-4 646 176
## 3 902 m 0-4 2 095 955
## 4 941 m 0-4 533 519
## 5 934 m 0-4 1 562 436
And, finally, we’ll remove the whitespace from the “count” and convert it to a numeric type and also replace the blanks with NAs.
UN.tidy[UN.tidy == ".."] <- NA
#first we have to remove the spaces...
UN.tidy$count <- gsub('\\s+', '', UN.tidy$count)
UN.tidy$count <- as.numeric(UN.tidy$count)
head(UN.tidy,5)## Sort.order region
## 1 1 WORLD
## 2 2 Developed regions
## 3 3 Developing regions
## 4 4 Least developed countries
## 5 5 Less developed regions excluding least developed countries
## Country.code sex age.range count
## 1 900 m 0-4 2742131
## 2 901 m 0-4 646176
## 3 902 m 0-4 2095955
## 4 941 m 0-4 533519
## 5 934 m 0-4 1562436
I notice also that our “region” column has non-country entries contained in it, such as “WORLD” and “Developed regions”. This is a problem and could screw up any summing or averaging work by introducing a double-counting issue. We don’t have enough data in the original set to rectify this however, What we’re going to do is add some data from the “ANNEX” file to associate every country with a “Major.Area” and a “Region”. Then we can drop all non-country entries in the data and perform some analysis.
UN.countryCodes <- UN.countryCodes[,c(1,4,7)]
head(UN.countryCodes,5)## Country.code Major.area Region
## 1 4 Asia Southern Asia
## 2 8 Europe Southern Europe
## 3 12 Africa Northern Africa
## 4 16 Oceania Polynesia
## 5 20 Europe Southern Europe
The data looks good, so now we’ll join and remove the non-country-specific rows:
UN.tidy <- left_join(UN.tidy, UN.countryCodes, by= c('Country.code' = 'Country.code'))
UN.tidy <- UN.tidy %>%
filter(!is.na(Major.area))
head(UN.tidy,5)## Sort.order region Country.code sex age.range count Major.area
## 1 9 Burundi 108 m 0-4 9574 Africa
## 2 10 Comoros 174 m 0-4 247 Africa
## 3 11 Djibouti 262 m 0-4 3543 Africa
## 4 12 Eritrea 232 m 0-4 385 Africa
## 5 13 Ethiopia 231 m 0-4 41272 Africa
## Region
## 1 Eastern Africa
## 2 Eastern Africa
## 3 Eastern Africa
## 4 Eastern Africa
## 5 Eastern Africa
Now we have what appears to be complete and proper “tidy” data, so lets look at a few things.
migration.majorArea <- UN.tidy %>%
group_by(Major.area) %>%
summarise(migrants = sum(count,na.rm = TRUE)) %>%
ungroup()
ggplot(migration.majorArea, aes(x = Major.area , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=Major.area)) +
geom_bar(stat="identity",position="dodge") +
xlab("Major Area") +
ylab("Migrant Proportion") +
ggtitle("Migration by Major Areas") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))migration.majorAreaSex <- UN.tidy %>%
group_by(Major.area,sex) %>%
summarise(migrants = sum(count,na.rm = TRUE)) %>%
ungroup()
ggplot(migration.majorAreaSex, aes(x = Major.area , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=sex)) +
geom_bar(stat="identity",position="dodge") +
xlab("Major Area") +
ylab("Migrant Proportion") +
ggtitle("Migration by Major Areas & Sex") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))Migration is dominated by Asia, Europe and North America (about 80%). This makes sense as these are the most populus places. There does not appear to be huge disparity by sex, but only small differences in the proportions with Asia showing the largest migration.
migration.age <- UN.tidy %>%
group_by(age.range) %>%
summarise(migrants = sum(count,na.rm = TRUE)) %>%
ungroup()
ggplot(migration.age, aes(x = age.range, y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=age.range)) +
geom_bar(stat="identity",position="dodge") +
xlab("Age") +
ylab("Migrant Proportion") +
ggtitle("Migration by Age") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))migration.ageSex <- UN.tidy %>%
group_by(age.range,sex) %>%
summarise(migrants = sum(count,na.rm = TRUE)) %>%
ungroup()
ggplot(migration.ageSex, aes(x = age.range , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=sex)) +
geom_bar(stat="identity",position="dodge") +
xlab("Age") +
ylab("Migrant Proportion") +
ggtitle("Migration by Age and Sex") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))If we look at migration by age, we see a hump between ages 20 and 50. Interestingly, when we break it down by sex, we see that in this high-density region, men are signigicantly more likley migrants, but women are vastly over-represented in the 75+ category. It’s unclear why this may be, but potentially, men move for work and maybe women move to be with family when they reach an older age.
My original intention for this project was to parse and tidy a .pdf document, but initial attempt was unsuccesssful. Given the extra time, however, I was able to get it to work, so here goes
library(tabulizer)
tax.data <- extract_tables("https://github.com/plb2018/DATA607/raw/master/DATA_607_Project2/canada_tax.pdf")
typeof(tax.data[1])## [1] "list"
typeof(tax.data[[1]])## [1] "character"
I used a package called Tabulizer. The parsing seem to have worked but it appears to have been returned as nested lists - a list for each page of the pdf. Let’s take a look at what we’ve got:
tax.data[[1]][1:5,1:5]## [,1]
## [1,] ""
## [2,] ""
## [3,] "Province or Territory"
## [4,] ""
## [5,] "Newfoundland and Labrador"
## [,2] [,3] [,4] [,5]
## [1,] "" "" "Less than" ""
## [2,] "Grand Total of GST/HST Credit Recipients" "" "" ""
## [3,] "" "" "$5,000" ""
## [4,] "(#) ($)" "(#)" "" "($)"
## [5,] "160,300 $65,756" "" "26,410" "$7,044"
This looks pretty messy in it’s current state, so let’s see if we can tidy it up a bit. Firstly, i note that a few of the columns contain data for multiple salary-ranges. We’ll re-arrange the data to deal with this:
tax.p1 <- data.frame(tax.data[[1]],stringsAsFactors = FALSE)
tax.p2 <- data.frame(tax.data[[2]],stringsAsFactors = FALSE)
#put the data into proper "wide" format
tax.p2[14:28] <- tax.p2[19:nrow(tax.p2) , ]
#we'll now remove the extra data... and the "total" column while we're at it.
tax.p2 <- tax.p2[1:17, ]
head(tax.p2)## X1 X2 X3 X4 X5 X6 X7 X8
## 1 $40,000 to $45,000 to
## 2 Province or Territory $44,999 $49,999
## 3 (#) ($) (#) ($)
## 4 Newfoundland and Labrador 3,980 $1,274 550
## 5 Prince Edward Island 1,260 $422 290
## 6 Nova Scotia 7,530 $2,501 1,280
## X9 X10 X11 X12 X13 X1.1 X2.1 X3.1
## 1 $50,000 to $55,000 to
## 2 $54,999 Province or Territory $59,999
## 3 (#) ($) (#)
## 4 $169 60 $18 Newfoundland and Labrador
## 5 $102 50 $14 Prince Edward Island 10
## 6 $406 190 $57 Nova Scotia 20
## X4.1 X5.1 X6.1 X7.1 X8.1 X9.1 X10.1 X11.1 X12.1 X13.1
## 1 $60,000
## 2 and over
## 3 ($) (#) ($)
## 4
## 5 $4
## 6 $7
## X1.2 X2.2
## 1
## 2 Province or Territory
## 3 (#)
## 4 Newfoundland and Labrador
## 5 Prince Edward Island
## 6 Nova Scotia
Now we’re going to rename the columns and drop everything that we don’t need. The column names are contained in the first few rows. We’ll paste them together where required. Note that I’m only going to focus on page 2 of the document from here on for the sake of simplicity.
colnames(tax.p2) <- paste(tax.p2[1, ],tax.p2[2, ],sep = " ")
head(tax.p2)## Province or Territory $40,000 to $44,999
## 1 $40,000 to
## 2 Province or Territory $44,999
## 3 (#) ($) (#)
## 4 Newfoundland and Labrador 3,980 $1,274
## 5 Prince Edward Island 1,260 $422
## 6 Nova Scotia 7,530 $2,501
## $45,000 to $49,999 $50,000 to $54,999
## 1 $45,000 to $50,000 to
## 2 $49,999 $54,999
## 3 ($) (#) ($)
## 4 550 $169 60 $18
## 5 290 $102 50 $14
## 6 1,280 $406 190 $57
## Province or Territory $55,000 to $59,999
## 1 $55,000 to
## 2 Province or Territory $59,999
## 3 (#) ($) (#)
## 4 Newfoundland and Labrador
## 5 Prince Edward Island 10 $4
## 6 Nova Scotia 20 $7
## $60,000 and over Province or Territory
## 1 $60,000
## 2 and over Province or Territory
## 3 ($) (#)
## 4 Newfoundland and Labrador
## 5 Prince Edward Island
## 6 Nova Scotia
#drop all the columns that we dont need
tax.p2 <- tax.p2[-14]
tax.p2 <- tax.p2[seq(1,28,2)]
tax.p2 <- tax.p2[-c(12:14)]
head(tax.p2)## Province or Territory $40,000 to $44,999 .2 $45,000 to $49,999
## 1 $40,000 to $45,000 to
## 2 Province or Territory $44,999 $49,999
## 3
## 4 Newfoundland and Labrador 3,980 $1,274 550
## 5 Prince Edward Island 1,260 $422 290
## 6 Nova Scotia 7,530 $2,501 1,280
## .5 $50,000 to $54,999 .8 $55,000 to $59,999 .11 $60,000 and over
## 1 $50,000 to $55,000 to $60,000
## 2 $54,999 $59,999 and over
## 3
## 4 $169 60 $18
## 5 $102 50 $14 10 $4
## 6 $406 190 $57 20 $7
## .14
## 1
## 2
## 3
## 4
## 5
## 6
Now some of the columns are properly named. Also, we no longer need the first few rows of the data, so we’ll drop them.
tax.p2 <- tax.p2[-1:-3,]
head(tax.p2,5)## Province or Territory $40,000 to $44,999 .2 $45,000 to $49,999
## 4 Newfoundland and Labrador 3,980 $1,274 550
## 5 Prince Edward Island 1,260 $422 290
## 6 Nova Scotia 7,530 $2,501 1,280
## 7 New Brunswick 6,670 $2,243 1,240
## 8 Quebec 73,060 $25,216 15,330
## .5 $50,000 to $54,999 .8 $55,000 to $59,999 .11 $60,000 and over
## 4 $169 60 $18
## 5 $102 50 $14 10 $4
## 6 $406 190 $57 20 $7
## 7 $395 200 $59 20 $4
## 8 $5,039 2,390 $662 170 $51 30
## .14
## 4
## 5
## 6
## 7
## 8 $8
The data is starting to look much better. We can see that we need to rename every second column. We’ll basically copy previous column names to the right, and append a “_dollars" suffix which will help us out in the next steps.
colnames(tax.p2)[seq(3,11,2)] <- paste(colnames(tax.p2[seq(2,10,2)]),"_dollars",sep = " ")
colnames(tax.p2)[1] <- "Region"
head(tax.p2,5)## Region $40,000 to $44,999 $40,000 to $44,999 _dollars
## 4 Newfoundland and Labrador 3,980 $1,274
## 5 Prince Edward Island 1,260 $422
## 6 Nova Scotia 7,530 $2,501
## 7 New Brunswick 6,670 $2,243
## 8 Quebec 73,060 $25,216
## $45,000 to $49,999 $45,000 to $49,999 _dollars $50,000 to $54,999
## 4 550 $169 60
## 5 290 $102 50
## 6 1,280 $406 190
## 7 1,240 $395 200
## 8 15,330 $5,039 2,390
## $50,000 to $54,999 _dollars $55,000 to $59,999
## 4 $18
## 5 $14 10
## 6 $57 20
## 7 $59 20
## 8 $662 170
## $55,000 to $59,999 _dollars $60,000 and over $60,000 and over _dollars
## 4
## 5 $4
## 6 $7
## 7 $4
## 8 $51 30 $8
We’re now ready to tidy the data. We’re going to use gather(), and then separate() the “range” column on the “_" we added in a previous step. This will cause all rows with the “_dollar" appendix to have corresponding “dollar” entries in the “value” column, whereas we’ll be left with NAs for all rows where “range” didn’t have a suffix - we’ll replace those NAs with “count”.
tax.tidy <- gather(tax.p2,range,quantity,-Region) %>%
separate(range,c("range","value"), sep="_")
tax.tidy$value[is.na(tax.tidy$value)] <- "count"
head(tax.tidy,5)## Region range value quantity
## 1 Newfoundland and Labrador $40,000 to $44,999 count 3,980
## 2 Prince Edward Island $40,000 to $44,999 count 1,260
## 3 Nova Scotia $40,000 to $44,999 count 7,530
## 4 New Brunswick $40,000 to $44,999 count 6,670
## 5 Quebec $40,000 to $44,999 count 73,060
We’re almost done. One final thing that we’re going to do is clean up the quantity column and make it numeric so that we can perform some analysis.
tax.tidy$quantity <- as.numeric(gsub('\\$|,', '', tax.tidy$quantity))
head(tax.tidy,5)## Region range value quantity
## 1 Newfoundland and Labrador $40,000 to $44,999 count 3980
## 2 Prince Edward Island $40,000 to $44,999 count 1260
## 3 Nova Scotia $40,000 to $44,999 count 7530
## 4 New Brunswick $40,000 to $44,999 count 6670
## 5 Quebec $40,000 to $44,999 count 73060
The data is now tidy and we can do a few quick plots:
tax.range <- tax.tidy %>%
filter(value=="count") %>%
group_by(Region,range) %>%
summarise(quantity = mean(quantity,na.rm = TRUE)) %>%
ungroup()
ggplot(tax.range, aes(x = range, y=quantity,group=Region,color=Region)) +
geom_line() +
scale_y_log10()+
xlab("Income Range") +
ylab("Number of Recipients") +
ggtitle("# of Tax Credit Recipients by Income and Region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))tax.rangeD <- tax.tidy %>%
filter(value=="dollars") %>%
group_by(Region,range) %>%
summarise(quantity = mean(quantity,na.rm = TRUE)) %>%
ungroup()
ggplot(tax.rangeD, aes(x = range, y=quantity,group=Region,color=Region)) +
geom_line() +
scale_y_log10()+
xlab("Income Range") +
ylab("Tax Credits Collected") +
ggtitle("$ Amt of Tax Credit Collected by Income and Region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))We can see that tax credits, in both number of recipients, and amount tend to decline as income range increases. This finding appears to be true in all cases.