Introduction

One of my intentions for project 2 was to try something different, and in this case, the thing that I tried was an abject failure. The data sample that I posted to the forum was in .pdf format, and Jeremy pointed out that he’s never worked with PDF data from R before (nor had I for that matter…). So I decided to try it out as a challenge. My intention was to try to extract data from the .pdf “as is” and do some work with it, but after several hours of effort, and in the interest of getting the project finished on time, I decided to just choose another data source as I wasn’t getting any traction with the pdf. It wasn’t a waste of time as I familiarized myself with several packages and techniques (pdfTools, Tabulizer <- this one isn’t on CRAN, so I also familiarized myself with non-CRAN package installation)… but in the end, I was not even able to parse the .pdf, let alone work with the data. This is how we learn - and having found a few things that don’t work, I’m better prepared for next time. :)

Update to introduction

I took the extra time/extention given for this assignment and figured out how to parse the .pdf! As such, i’ve included 4 data-sets rather than 3. I didn’t have sufficient time to perform any major analysis, but I managed to get it tidy, which appears to have been the point of the exercise.

You can find this doc on rpubs here and as always, you can find all the relevant data on my github here

Clear the workspace

rm(list = ls())

Load the required libraries

## -- Attaching packages ------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data Set 1 - Electricity Data

For my first dataset, I chose to work with the electricity data posted by Rose. Let’s load it and take a look

electricity.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/us_daily_electric_system_operating_data.csv",
                      skip = 4)

electricity.data <- rename(electricity.data,"region" = "megawatthours")

head(electricity.data,5)

##                         region X01.01.2018 X01.02.2018 X01.03.2018
## 1          California (region)                      NA          NA
## 2                       Demand      645599      704074      727216
## 3               Net generation      463629      502312      504456
## 4 Total net actual interchange     -179525     -201654     -222762
## 5           Carolinas (region)                      NA          NA
##   X01.04.2018 X01.05.2018 X01.06.2018 X01.07.2018 X01.08.2018 X01.09.2018
## 1          NA          NA          NA          NA          NA          NA
## 2      714305      712021      668299      655802      731881      729380
## 3      505148      484644      447821      447952      500797      493884
## 4     -209163     -213905     -219968     -207861     -237089     -246264
## 5          NA          NA          NA          NA          NA          NA
##   X01.10.2018 X01.11.2018 X01.12.2018 X01.13.2018 X01.14.2018 X01.15.2018
## 1          NA          NA          NA          NA          NA          NA
## 2      716698      675260      687504      665136      642080      703473
## 3      482843      435107      460105      452264      428364      474015
## 4     -234361     -240146     -227413     -231833     -213729     -229468
## 5          NA          NA          NA          NA          NA          NA
##   X01.16.2018 X01.17.2018 X01.18.2018 X01.19.2018 X01.20.2018 X01.21.2018
## 1          NA          NA          NA          NA          NA          NA
## 2      713050      715733      718298      712862      669421      665340
## 3      487101      484557      488237      484276      458660      437506
## 4     -225955     -231182     -230069     -228598     -210774     -227842
## 5          NA          NA          NA          NA          NA          NA
##   X01.22.2018 X01.23.2018 X01.24.2018 X01.25.2018 X01.26.2018 X01.27.2018
## 1          NA          NA          NA          NA          NA          NA
## 2      727071      722833      728743      732552      699533      666415
## 3      502041      510543      500446      501674      464103      432315
## 4     -225042     -212304     -228303     -230401     -233814     -231742
## 5          NA          NA          NA          NA          NA          NA
##   X01.28.2018 X01.29.2018 X01.30.2018 X01.31.2018
## 1          NA          NA          NA          NA
## 2      647844      725483      720384      717446
## 3      433092      491871      480451      492836
## 4     -213582     -233589     -239945     -224620
## 5          NA          NA          NA          NA

Initial inspection of the data shows that a few changes are required in preparation for moving to a “tidy” format. I note that for each “region” (eg: California) there are 3 data-rows (“Demand”,“Net generation”,“Total net actual interchange”) and that the region row itself is actually blank. I’m going to copy the “region” column into a new column called “metric” and use this row to preserve the labels of the actual quantities being measured. Then, in the “region” col, I’ll convert the non-region labels to NAs and forward-fill to make sure that every row has an associated region.

electricity.data$metric <- electricity.data$region

electricity.data$region  <- str_replace(electricity.data$region,"Demand","")
electricity.data$region  <- str_replace(electricity.data$region,"Net generation","")
electricity.data$region  <- str_replace(electricity.data$region,"Total net actual interchange","")
is.na(electricity.data$region ) <- electricity.data$region ==''

electricity.data <- electricity.data %>% 
  fill(region) %>% 
  na.omit(electricity.data)


head(electricity.data,5)

##                region X01.01.2018 X01.02.2018 X01.03.2018 X01.04.2018
## 2 California (region)      645599      704074      727216      714305
## 3 California (region)      463629      502312      504456      505148
## 4 California (region)     -179525     -201654     -222762     -209163
## 6  Carolinas (region)      847133      938726      948954      901939
## 7  Carolinas (region)      802908      914879      920055      866924
##   X01.05.2018 X01.06.2018 X01.07.2018 X01.08.2018 X01.09.2018 X01.10.2018
## 2      712021      668299      655802      731881      729380      716698
## 3      484644      447821      447952      500797      493884      482843
## 4     -213905     -219968     -207861     -237089     -246264     -234361
## 6      939699      932742      946665      840094      646932      590003
## 7      895604      884842      913783      820248      639454      581165
##   X01.11.2018 X01.12.2018 X01.13.2018 X01.14.2018 X01.15.2018 X01.16.2018
## 2      675260      687504      665136      642080      703473      713050
## 3      435107      460105      452264      428364      474015      487101
## 4     -240146     -227413     -231833     -213729     -229468     -225955
## 6      561119      513368      587356      761043      823663      778424
## 7      553005      506277      582277      739821      780965      760210
##   X01.17.2018 X01.18.2018 X01.19.2018 X01.20.2018 X01.21.2018 X01.22.2018
## 2      715733      718298      712862      669421      665340      727071
## 3      484557      488237      484276      458660      437506      502041
## 4     -231182     -230069     -228598     -210774     -227842     -225042
## 6      779099      858007      755994      663462      575237      588743
## 7      764523      831241      739271      642171      558728      578841
##   X01.23.2018 X01.24.2018 X01.25.2018 X01.26.2018 X01.27.2018 X01.28.2018
## 2      722833      728743      732552      699533      666415      647844
## 3      510543      500446      501674      464103      432315      433092
## 4     -212304     -228303     -230401     -233814     -231742     -213582
## 6      523900      603365      676274      685613      575060      511872
## 7      520099      592705      659392      665487      564307      506623
##   X01.29.2018 X01.30.2018 X01.31.2018                       metric
## 2      725483      720384      717446                       Demand
## 3      491871      480451      492836               Net generation
## 4     -233589     -239945     -224620 Total net actual interchange
## 6      567562      697175      752846                       Demand
## 7      555393      676237      731116               Net generation

Now we’ll transfor the data from a wide format to a long format, rename a few things, and format the dates.

electricity.tidy <- electricity.data %>%  
  gather(electricity.data,power,-region,-metric) 

electricity.tidy<- rename(electricity.tidy,"date" = "electricity.data")

electricity.tidy <- transform(electricity.tidy, power = as.numeric(power))

electricity.tidy$date <- as.Date(gsub("X","",electricity.tidy$date),"%m.%d.%Y")

electricity.tidy$region <- gsub("\\(region\\)","",electricity.tidy$region)

head(electricity.tidy,5)

##        region                       metric       date   power
## 1 California                        Demand 2018-01-01  645599
## 2 California                Net generation 2018-01-01  463629
## 3 California  Total net actual interchange 2018-01-01 -179525
## 4  Carolinas                        Demand 2018-01-01  847133
## 5  Carolinas                Net generation 2018-01-01  802908

The data is now in a format that we should be able to work with relatively easily. The questions that rose wanted to look at were: Daily demand per region Daily net generation per region Daily total net actual Proportion of demand / total demand by region.

totals.byRegion <- electricity.tidy %>% 
  group_by(metric) %>% 
  filter(!(region =="United States Lower 48 ")) %>%
  summarize(power = sum(power))
  #ungroup() 

power.byRegion <- electricity.tidy %>% 
  group_by(region,metric) %>% 
  filter(!(region =="United States Lower 48 ")) %>%
  summarize(power = mean(power)) %>%
  ungroup() 

power.byRegion$proportions = power.byRegion$power / totals.byRegion$power

#interchange is on a totally different scale from demand and generation
#so lets split them.
power.supplyDemand <- power.byRegion %>%
  filter(!(metric =="Total net actual interchange"))

power.interchange <- power.byRegion %>%
  filter((metric =="Total net actual interchange"))


ggplot(power.supplyDemand, aes(x = region, y=power,fill=metric)) +
  geom_bar(stat="identity",position="dodge")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Region") + 
  ylab("Power Supply and Demand (MWh)") +
  ggtitle("Power Supply and Demand by Region") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(power.supplyDemand, aes(x = region, y=proportions,fill=metric)) +
 geom_bar(stat="identity",position="dodge")+
 theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Region") + 
  ylab("Proportional Supply and Demand") +
  ggtitle("Proportional Power Supply and Demand by Region") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(power.interchange, aes(x = region, y=power,fill=metric)) +
 geom_bar(stat="identity",position="dodge")+
 theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Region") + 
  ylab("Interchange (MWh)") +
  ggtitle("Interchange by Region") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(power.interchange, aes(x = region, y=proportions*-1,fill=metric)) +
 geom_bar(stat="identity",position="dodge")+
 theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Region") + 
  ylab("Proportional Interchange") +
  ggtitle("Proportional Interchange Region") +
  theme(plot.title = element_text(hjust = 0.5))

Whether we look at the data proportionally, or in absolute terms, we see more or less the same picture. The Mid-Atlantic and Midwest regions demand and generate the most electricity. New-England and the Southwest regions produce and demand the least, respectively.

California has the biggest electricity deficite as measured by total net interchange, and the northwest has the highest surplus.

demand.timeseries <- electricity.tidy %>% 
  filter((metric =="Demand") & !(region =="United States Lower 48 "))

ggplot(demand.timeseries, aes(x = date, y=power,color=region)) +
  geom_line() +
  scale_y_log10()+
  xlab("Time") + 
  ylab("Electricity Demand (MWh)") +
  ggtitle("Demand By Region over Time") +
  theme(plot.title = element_text(hjust = 0.5))

interchange.timeseries <- electricity.tidy %>% 
  filter((metric =="Total net actual interchange") & !(region =="United States Lower 48 "))

ggplot(interchange.timeseries, aes(x = date, y=power,color=region)) +
  geom_line()+
  xlab("Time") + 
  ylab("Interchange (MWh)") +
  ggtitle("Interchange by Region over Time") +
  theme(plot.title = element_text(hjust = 0.5))

If we look at demand and interchange by region as a timeseries, we can see that they vary considerably. In terms of demans, we can see that the majority of them seem to follow a similar pattern, namely, that they start high at the first of the year, then trail down until about Jan 10, where they abruptly reverse, spike, and then flatten out.

The interchange curves seem to be reasonably stable with the occasional violent spike

Data Set 2 - Belize

This dataset (posted by Albert) shows the prevalence of Dengue fever by various geographies in Belize. In hindsight, this dataset proved to be much easier than expected.

belize.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/Belize.csv",
                        header=TRUE)

head(belize.data)

##                      Community  Type X2007 X2008 X2009 X2010 X2011 X2012
## 1                  Belize City Urban  1.28  0.69  2.68  5.13  3.92  6.53
## 2 San Pedro Town- Amberg. Caye Urban  0.02  0.00  0.32  0.18  0.18  0.22
## 3                 Caye Caulker Urban  0.07  0.00  0.09  0.04  0.02  0.11
## 4                    Ladyville Rural  0.06  0.04  0.64  0.43  0.36  0.45
## 5                  Hattieville Rural  0.01  0.02  0.08  0.12  0.09  0.13
## 6                  Burrel Boom Rural  0.02  0.00  0.07  0.07  0.11  0.09
##   X2013 X2014
## 1 11.69 17.87
## 2  1.22  0.64
## 3  0.20  0.14
## 4  0.72  0.88
## 5  0.17  0.21
## 6  0.40  0.32

We’ve got a wide dataset that we’ll need to melt into a tidy format. We’ll also need to clean up column names and we’ll change blank Types of regions to “unknown”

belize.tidy <- belize.data %>% 
  gather(year,fever,-Community,-Type)

belize.tidy$year <- gsub("X","",belize.tidy$year)
belize.tidy$Type <- gsub("^$","Unknown",belize.tidy$Type)

head(belize.tidy)

##                      Community  Type year fever
## 1                  Belize City Urban 2007  1.28
## 2 San Pedro Town- Amberg. Caye Urban 2007  0.02
## 3                 Caye Caulker Urban 2007  0.07
## 4                    Ladyville Rural 2007  0.06
## 5                  Hattieville Rural 2007  0.01
## 6                  Burrel Boom Rural 2007  0.02

The data is now more or less tidy. One issue that remains is that there are NAs in the data. It would appear that we can just ignore the NAs using “na.rm=TRUE”. It may or may not be ideal to do so in the real world - I think that would be dependent on context/domain knowledge. I also tried interpolation here (acknowleding that it may not make sense in practice), but the data was much too sparse to allow for it. It appears as though an NA required 2 adjscent data points in order to be interpolated (at least as I attempted it). Lets look at a few plots:

fever.type <- belize.tidy %>%
  group_by(year,Type) %>%
  summarize(avg = mean(fever,na.rm=TRUE))

fever.type

## # A tibble: 24 x 3
## # Groups:   year [?]
##    year  Type       avg
##    <chr> <chr>    <dbl>
##  1 2007  Rural   0.0240
##  2 2007  Unknown 0.0400
##  3 2007  Urban   0.457 
##  4 2008  Rural   0.0140
##  5 2008  Unknown 0.0200
##  6 2008  Urban   0.230 
##  7 2009  Rural   0.155 
##  8 2009  Unknown 0.0150
##  9 2009  Urban   1.03  
## 10 2010  Rural   0.111 
## # ... with 14 more rows

ggplot(fever.type, aes(x = year, y=avg,group=Type,color=Type)) +
  geom_line()+
  scale_y_log10()+ 
  xlab("Time") + 
  ylab("Fever Prevalence") +
  ggtitle("Fever Prevalance by Location Type") +
  theme(plot.title = element_text(hjust = 0.5))

Firstly, in the charts we can see the spots where we’re lacking data, which I feel is okay as no data exists.

fever.Community <- belize.tidy %>%
  group_by(year,Community) %>%
  summarize(avg = mean(fever,na.rm=TRUE))


fever.Community

## # A tibble: 112 x 3
## # Groups:   year [?]
##    year  Community             avg
##    <chr> <fct>               <dbl>
##  1 2007  Belize City        1.28  
##  2 2007  Bermudan Landing NaN     
##  3 2007  Biscayne           0.0200
##  4 2007  Boston           NaN     
##  5 2007  Burrel Boom        0.0200
##  6 2007  Caye Caulker       0.0700
##  7 2007  Hattieville        0.0100
##  8 2007  Ladyville          0.0600
##  9 2007  Lords Bank       NaN     
## 10 2007  Mahogany Heights NaN     
## # ... with 102 more rows

ggplot(fever.Community, aes(x = year, y=avg,group=Community,color=Community)) +
  geom_line() +
  scale_y_log10()+ 
  xlab("Time") + 
  ylab("Fever Prevalence") +
  ggtitle("Fever Prevalance by Community") +
  theme(plot.title = element_text(hjust = 0.5))

Secondly we can see that fever prevalance is much higher in an urban type of environment and particularly in Belize City. I suspect that this is a known phenomenon that has to do with population density.

Let’s check and see if any years showed a particurlarly large spike in Dengue fever prevalence:

fever.diff <- belize.tidy %>%
  group_by(year) %>%
  summarize(avg = mean(fever,na.rm=TRUE)) %>%
  mutate(diff = (avg - lag(avg))/avg)


head(fever.diff,5)

## # A tibble: 5 x 3
##   year     avg    diff
##   <chr>  <dbl>   <dbl>
## 1 2007  0.170   NA    
## 2 2008  0.0867 - 0.962
## 3 2009  0.368    0.765
## 4 2010  0.567    0.351
## 5 2011  0.396  - 0.432

ggplot(fever.diff, aes(x = year, y=diff, group=1)) +
  geom_line() +
  xlab("Time") + 
  ylab("Percentage Change") +
  ggtitle("YoY Percentage Change in Fever Prevalence") +
  theme(plot.title = element_text(hjust = 0.5))

We can see that the average year-over-year change in fever prevalence shows no real trend. To the eye, it looks as if it may be mean-reverting around zero, but I don’t think that we have enough data to make any strong statements.

Data Set 3 - UN Migration

The third set that I’ve chosen is the UN Migration data as posted by Brian. I’ve used specific data that he posted and also found an “Annex” data table which has mappings for all the country codes and Sort.order columns, in case I want to use that data as well.

UN.data <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/UN_migrationStockByAg.csv",
                        skip=14)

UN.countryCodes <- read.csv("https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project2/UN_Country_codes.csv")


head(UN.data,5)

##   Sort.order Major.area..region..country.or.area.of.destination Notes
## 1         NA                                                         
## 2          1                                              WORLD      
## 3          2                                  Developed regions   (b)
## 4          3                                 Developing regions   (c)
## 5          4                          Least developed countries   (d)
##   Country.code Type.of.data..a.
## 1           NA                 
## 2          900                 
## 3          901                 
## 4          902                 
## 5          941                 
##   International.migrant.stock.at.mid.year.by.age..both.sexes.          X
## 1                                                         0-4        5-9
## 2                                                   5 444 102  6 705 750
## 3                                                   1 251 818  2 326 676
## 4                                                   4 192 284  4 379 074
## 5                                                   1 063 284  1 130 597
##          X.1        X.2         X.3         X.4         X.5         X.6
## 1      10-14      15-19       20-24       25-29       30-34       35-39
## 2  8 031 158  9 873 556  12 619 191  15 051 127  15 400 571  14 337 175
## 3  3 475 269  4 720 303   6 478 826   7 974 562   8 461 189   8 226 083
## 4  4 555 889  5 153 253   6 140 365   7 076 565   6 939 382   6 111 092
## 5  1 094 935  1 200 615   1 236 369   1 195 461     981 953     769 658
##           X.7         X.8        X.9       X.10       X.11       X.12
## 1       40-44       45-49      50-54      55-59      60-64      65-69
## 2  12 753 493  10 559 275  8 763 500  7 596 248  6 786 967  5 568 653
## 3   7 624 421   6 433 452  5 382 155  4 705 909  4 197 180  3 463 728
## 4   5 129 072   4 125 823  3 381 345  2 890 339  2 589 787  2 104 925
## 5     581 996     451 789    382 834    308 221    256 041    155 596
##         X.13       X.14         X.15
## 1      70-74        75+        Total
## 2  4 401 727  8 669 566  152 563 212
## 3  2 753 490  4 902 797   82 378 628
## 4  1 648 237  3 766 769   70 184 584
## 5    104 773    161 844   11 075 966
##   International.migrant.stock.at.mid.year.by.age..male.       X.16
## 1                                                   0-4        5-9
## 2                                             2 742 131  3 400 509
## 3                                               646 176  1 198 838
## 4                                             2 095 955  2 201 671
## 5                                               533 519    572 990
##         X.17       X.18       X.19       X.20       X.21       X.22
## 1      10-14      15-19      20-24      25-29      30-34      35-39
## 2  4 122 431  5 133 205  6 638 950  8 064 063  8 291 309  7 658 562
## 3  1 796 700  2 449 450  3 353 011  4 111 003  4 325 619  4 171 376
## 4  2 325 731  2 683 755  3 285 939  3 953 060  3 965 690  3 487 186
## 5    575 276    648 111    674 067    644 368    520 922    407 103
##         X.23       X.24       X.25       X.26       X.27       X.28
## 1      40-44      45-49      50-54      55-59      60-64      65-69
## 2  6 737 139  5 519 992  4 481 023  3 745 593  3 191 398  2 528 192
## 3  3 834 793  3 211 968  2 635 497  2 228 293  1 872 690  1 496 823
## 4  2 902 346  2 308 024  1 845 526  1 517 300  1 318 708  1 031 369
## 5    307 998    240 027    202 824    164 230    132 550     80 593
##         X.29       X.30        X.31
## 1      70-74        75+       Total
## 2  1 889 719  3 602 696  77 747 510
## 3  1 117 749  1 813 035  40 263 397
## 4    771 970  1 789 661  37 484 113
## 5     53 638     83 120   5 841 336
##   International.migrant.stock.at.mid.year.by.age..female.       X.32
## 1                                                     0-4        5-9
## 2                                               2 701 971  3 305 241
## 3                                                 605 642  1 127 838
## 4                                               2 096 329  2 177 403
## 5                                                 529 765    557 607
##         X.33       X.34       X.35       X.36       X.37       X.38
## 1      10-14      15-19      20-24      25-29      30-34      35-39
## 2  3 908 727  4 740 351  5 980 241  6 987 064  7 109 262  6 678 613
## 3  1 678 569  2 270 853  3 125 815  3 863 559  4 135 570  4 054 707
## 4  2 230 158  2 469 498  2 854 426  3 123 505  2 973 692  2 623 906
## 5    519 659    552 504    562 302    551 093    461 031    362 555
##         X.39       X.40       X.41       X.42       X.43       X.44
## 1      40-44      45-49      50-54      55-59      60-64      65-69
## 2  6 016 354  5 039 283  4 282 477  3 850 655  3 595 569  3 040 461
## 3  3 789 628  3 221 484  2 746 658  2 477 616  2 324 490  1 966 905
## 4  2 226 726  1 817 799  1 535 819  1 373 039  1 271 079  1 073 556
## 5    273 998    211 762    180 010    143 991    123 491     75 003
##         X.45       X.46        X.47
## 1      70-74        75+       Total
## 2  2 512 008  5 066 870  74 815 702
## 3  1 635 741  3 089 762  42 115 231
## 4    876 267  1 977 108  32 700 471
## 5     51 135     78 724   5 234 630

This data-set appears to need a whole bunch of work. I’m going to focus on the region,age and migration data which I wont be needing, then renaming the colums as appropriate.

colnames(UN.data)

##  [1] "Sort.order"                                                 
##  [2] "Major.area..region..country.or.area.of.destination"         
##  [3] "Notes"                                                      
##  [4] "Country.code"                                               
##  [5] "Type.of.data..a."                                           
##  [6] "International.migrant.stock.at.mid.year.by.age..both.sexes."
##  [7] "X"                                                          
##  [8] "X.1"                                                        
##  [9] "X.2"                                                        
## [10] "X.3"                                                        
## [11] "X.4"                                                        
## [12] "X.5"                                                        
## [13] "X.6"                                                        
## [14] "X.7"                                                        
## [15] "X.8"                                                        
## [16] "X.9"                                                        
## [17] "X.10"                                                       
## [18] "X.11"                                                       
## [19] "X.12"                                                       
## [20] "X.13"                                                       
## [21] "X.14"                                                       
## [22] "X.15"                                                       
## [23] "International.migrant.stock.at.mid.year.by.age..male."      
## [24] "X.16"                                                       
## [25] "X.17"                                                       
## [26] "X.18"                                                       
## [27] "X.19"                                                       
## [28] "X.20"                                                       
## [29] "X.21"                                                       
## [30] "X.22"                                                       
## [31] "X.23"                                                       
## [32] "X.24"                                                       
## [33] "X.25"                                                       
## [34] "X.26"                                                       
## [35] "X.27"                                                       
## [36] "X.28"                                                       
## [37] "X.29"                                                       
## [38] "X.30"                                                       
## [39] "X.31"                                                       
## [40] "International.migrant.stock.at.mid.year.by.age..female."    
## [41] "X.32"                                                       
## [42] "X.33"                                                       
## [43] "X.34"                                                       
## [44] "X.35"                                                       
## [45] "X.36"                                                       
## [46] "X.37"                                                       
## [47] "X.38"                                                       
## [48] "X.39"                                                       
## [49] "X.40"                                                       
## [50] "X.41"                                                       
## [51] "X.42"                                                       
## [52] "X.43"                                                       
## [53] "X.44"                                                       
## [54] "X.45"                                                       
## [55] "X.46"                                                       
## [56] "X.47"

so it looks like we want to get rid of columns 3, and 5

dropCols <- colnames(UN.data)[c(3,5)]
UN.data <- UN.data[ , !(names(UN.data) %in% dropCols)] 

head(colnames(UN.data),5)

## [1] "Sort.order"                                                 
## [2] "Major.area..region..country.or.area.of.destination"         
## [3] "Country.code"                                               
## [4] "International.migrant.stock.at.mid.year.by.age..both.sexes."
## [5] "X"

That appears to have worked. Now we want to rename the columns. As we see above, region-related data are contained in cols 1-4 and should be easy to manually rename. For the actual migration data, both sexes are in columns 4-20, males in 21-37 and femails in 38-54. I note also that column-header information that i’m looking for is actually contained in the first row. I’m going to try to rename the columns by pulling out the 1st row values and appending it with a gender-prefix (eg: “b0-4” would be both sexes age 0-4). Finally, we’ll drop the 1st row as it will no longer be of any use.

names(UN.data)[2] <- "region"

col.names <- as.matrix(UN.data[1, 4:20])

colnames(UN.data)[4:20] <- colnames(UN.data[4:20]) <- paste("b",col.names,sep = "_")
colnames(UN.data)[21:37] <- colnames(UN.data[21:37]) <- paste("m",col.names,sep = "_")
colnames(UN.data)[38:54] <- colnames(UN.data[38:54]) <- paste("f",col.names,sep = "_")

UN.data <- UN.data[-1, ] 

head(UN.data,5)

##   Sort.order                                                     region
## 2          1                                                      WORLD
## 3          2                                          Developed regions
## 4          3                                         Developing regions
## 5          4                                  Least developed countries
## 6          5 Less developed regions excluding least developed countries
##   Country.code      b_0-4      b_5-9    b_10-14   b_ 15-19     b_20-24
## 2          900  5 444 102  6 705 750  8 031 158  9 873 556  12 619 191
## 3          901  1 251 818  2 326 676  3 475 269  4 720 303   6 478 826
## 4          902  4 192 284  4 379 074  4 555 889  5 153 253   6 140 365
## 5          941  1 063 284  1 130 597  1 094 935  1 200 615   1 236 369
## 6          934  3 129 000  3 248 477  3 460 954  3 952 638   4 903 996
##       b_25-29     b_30-34     b_35-39     b_40-44     b_45-49    b_50-54
## 2  15 051 127  15 400 571  14 337 175  12 753 493  10 559 275  8 763 500
## 3   7 974 562   8 461 189   8 226 083   7 624 421   6 433 452  5 382 155
## 4   7 076 565   6 939 382   6 111 092   5 129 072   4 125 823  3 381 345
## 5   1 195 461     981 953     769 658     581 996     451 789    382 834
## 6   5 881 104   5 957 429   5 341 434   4 547 076   3 674 034  2 998 511
##      b_55-59    b_60-64    b_65-69    b_70-74      b_75+      b_Total
## 2  7 596 248  6 786 967  5 568 653  4 401 727  8 669 566  152 563 212
## 3  4 705 909  4 197 180  3 463 728  2 753 490  4 902 797   82 378 628
## 4  2 890 339  2 589 787  2 104 925  1 648 237  3 766 769   70 184 584
## 5    308 221    256 041    155 596    104 773    161 844   11 075 966
## 6  2 582 118  2 333 746  1 949 329  1 543 464  3 604 925   59 108 618
##        m_0-4      m_5-9    m_10-14   m_ 15-19    m_20-24    m_25-29
## 2  2 742 131  3 400 509  4 122 431  5 133 205  6 638 950  8 064 063
## 3    646 176  1 198 838  1 796 700  2 449 450  3 353 011  4 111 003
## 4  2 095 955  2 201 671  2 325 731  2 683 755  3 285 939  3 953 060
## 5    533 519    572 990    575 276    648 111    674 067    644 368
## 6  1 562 436  1 628 681  1 750 455  2 035 644  2 611 872  3 308 692
##      m_30-34    m_35-39    m_40-44    m_45-49    m_50-54    m_55-59
## 2  8 291 309  7 658 562  6 737 139  5 519 992  4 481 023  3 745 593
## 3  4 325 619  4 171 376  3 834 793  3 211 968  2 635 497  2 228 293
## 4  3 965 690  3 487 186  2 902 346  2 308 024  1 845 526  1 517 300
## 5    520 922    407 103    307 998    240 027    202 824    164 230
## 6  3 444 768  3 080 083  2 594 348  2 067 997  1 642 702  1 353 070
##      m_60-64    m_65-69    m_70-74      m_75+     m_Total      f_0-4
## 2  3 191 398  2 528 192  1 889 719  3 602 696  77 747 510  2 701 971
## 3  1 872 690  1 496 823  1 117 749  1 813 035  40 263 397    605 642
## 4  1 318 708  1 031 369    771 970  1 789 661  37 484 113  2 096 329
## 5    132 550     80 593     53 638     83 120   5 841 336    529 765
## 6  1 186 158    950 776    718 332  1 706 541  31 642 777  1 566 564
##        f_5-9    f_10-14   f_ 15-19    f_20-24    f_25-29    f_30-34
## 2  3 305 241  3 908 727  4 740 351  5 980 241  6 987 064  7 109 262
## 3  1 127 838  1 678 569  2 270 853  3 125 815  3 863 559  4 135 570
## 4  2 177 403  2 230 158  2 469 498  2 854 426  3 123 505  2 973 692
## 5    557 607    519 659    552 504    562 302    551 093    461 031
## 6  1 619 796  1 710 499  1 916 994  2 292 124  2 572 412  2 512 661
##      f_35-39    f_40-44    f_45-49    f_50-54    f_55-59    f_60-64
## 2  6 678 613  6 016 354  5 039 283  4 282 477  3 850 655  3 595 569
## 3  4 054 707  3 789 628  3 221 484  2 746 658  2 477 616  2 324 490
## 4  2 623 906  2 226 726  1 817 799  1 535 819  1 373 039  1 271 079
## 5    362 555    273 998    211 762    180 010    143 991    123 491
## 6  2 261 351  1 952 728  1 606 037  1 355 809  1 229 048  1 147 588
##      f_65-69    f_70-74      f_75+     f_Total
## 2  3 040 461  2 512 008  5 066 870  74 815 702
## 3  1 966 905  1 635 741  3 089 762  42 115 231
## 4  1 073 556    876 267  1 977 108  32 700 471
## 5     75 003     51 135     78 724   5 234 630
## 6    998 553    825 132  1 898 384  27 465 841

Now the data can be converted from “wide” to “long” and into a tidy format. Then we’ll separate the “group” column into “sex” and “age.range”.

Also… it is at this point where I have realized that my variable containing the “tidy” data in this case is actually named “UNtidy”… ugh.

UN.tidy <- UN.data %>% 
  gather(group,count,-Sort.order,-region,-Country.code)
    
UN.tidy <- separate(UN.tidy, group,c("sex","age.range"), sep="_")

head(UN.tidy,5)

##   Sort.order                                                     region
## 1          1                                                      WORLD
## 2          2                                          Developed regions
## 3          3                                         Developing regions
## 4          4                                  Least developed countries
## 5          5 Less developed regions excluding least developed countries
##   Country.code sex age.range      count
## 1          900   b       0-4  5 444 102
## 2          901   b       0-4  1 251 818
## 3          902   b       0-4  4 192 284
## 4          941   b       0-4  1 063 284
## 5          934   b       0-4  3 129 000

Looking better, but there are still a few things that we need to do to the data. We can delete all the rows where sex=b because anything in those rows can be computed as the sum of sex=m + sex=f. The same is true for age.range=Total - it’s the sum of all age-ranges for that specific country code, and thus can be easily computed, so it doesn’t represent individual observations, but the sum of individual observations.

UN.tidy <- UN.tidy %>% 
  filter(!(sex== "b")) %>% 
  filter(!(age.range == "Total"))

    
head(UN.tidy,5)

##   Sort.order                                                     region
## 1          1                                                      WORLD
## 2          2                                          Developed regions
## 3          3                                         Developing regions
## 4          4                                  Least developed countries
## 5          5 Less developed regions excluding least developed countries
##   Country.code sex age.range      count
## 1          900   m       0-4  2 742 131
## 2          901   m       0-4    646 176
## 3          902   m       0-4  2 095 955
## 4          941   m       0-4    533 519
## 5          934   m       0-4  1 562 436

And, finally, we’ll remove the whitespace from the “count” and convert it to a numeric type and also replace the blanks with NAs.

UN.tidy[UN.tidy == ".."] <- NA

#first we have to remove the spaces... 
UN.tidy$count <- gsub('\\s+', '', UN.tidy$count)

UN.tidy$count <- as.numeric(UN.tidy$count)

head(UN.tidy,5)

##   Sort.order                                                     region
## 1          1                                                      WORLD
## 2          2                                          Developed regions
## 3          3                                         Developing regions
## 4          4                                  Least developed countries
## 5          5 Less developed regions excluding least developed countries
##   Country.code sex age.range   count
## 1          900   m       0-4 2742131
## 2          901   m       0-4  646176
## 3          902   m       0-4 2095955
## 4          941   m       0-4  533519
## 5          934   m       0-4 1562436

I notice also that our “region” column has non-country entries contained in it, such as “WORLD” and “Developed regions”. This is a problem and could screw up any summing or averaging work by introducing a double-counting issue. We don’t have enough data in the original set to rectify this however, What we’re going to do is add some data from the “ANNEX” file to associate every country with a “Major.Area” and a “Region”. Then we can drop all non-country entries in the data and perform some analysis.

UN.countryCodes <- UN.countryCodes[,c(1,4,7)]
head(UN.countryCodes,5)

##   Country.code Major.area          Region
## 1            4       Asia   Southern Asia
## 2            8     Europe Southern Europe
## 3           12     Africa Northern Africa
## 4           16    Oceania       Polynesia
## 5           20     Europe Southern Europe

The data looks good, so now we’ll join and remove the non-country-specific rows:

UN.tidy <- left_join(UN.tidy, UN.countryCodes, by= c('Country.code' = 'Country.code')) 


UN.tidy <- UN.tidy %>% 
  filter(!is.na(Major.area))

head(UN.tidy,5)

##   Sort.order   region Country.code sex age.range count Major.area
## 1          9  Burundi          108   m       0-4  9574     Africa
## 2         10  Comoros          174   m       0-4   247     Africa
## 3         11 Djibouti          262   m       0-4  3543     Africa
## 4         12  Eritrea          232   m       0-4   385     Africa
## 5         13 Ethiopia          231   m       0-4 41272     Africa
##           Region
## 1 Eastern Africa
## 2 Eastern Africa
## 3 Eastern Africa
## 4 Eastern Africa
## 5 Eastern Africa

Now we have what appears to be complete and proper “tidy” data, so lets look at a few things.

migration.majorArea <- UN.tidy %>% 
  group_by(Major.area) %>% 
  summarise(migrants = sum(count,na.rm = TRUE))  %>%  
  ungroup() 

ggplot(migration.majorArea, aes(x = Major.area , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=Major.area)) +
  geom_bar(stat="identity",position="dodge") + 
  xlab("Major Area") + 
  ylab("Migrant Proportion") +
  ggtitle("Migration by Major Areas") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

migration.majorAreaSex <- UN.tidy %>% 
  group_by(Major.area,sex) %>% 
  summarise(migrants = sum(count,na.rm = TRUE))  %>%  
  ungroup() 

ggplot(migration.majorAreaSex, aes(x = Major.area , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=sex)) +
  geom_bar(stat="identity",position="dodge") + 
  xlab("Major Area") + 
  ylab("Migrant Proportion") +
  ggtitle("Migration by Major Areas & Sex") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

Migration is dominated by Asia, Europe and North America (about 80%). This makes sense as these are the most populus places. There does not appear to be huge disparity by sex, but only small differences in the proportions with Asia showing the largest migration.

migration.age <- UN.tidy %>% 
  group_by(age.range) %>% 
  summarise(migrants = sum(count,na.rm = TRUE))  %>%  
  ungroup() 


ggplot(migration.age, aes(x = age.range, y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=age.range)) +
  geom_bar(stat="identity",position="dodge") + 
  xlab("Age") + 
  ylab("Migrant Proportion") +
  ggtitle("Migration by Age") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

migration.ageSex <- UN.tidy %>% 
  group_by(age.range,sex) %>% 
  summarise(migrants = sum(count,na.rm = TRUE))  %>%  
  ungroup() 


ggplot(migration.ageSex, aes(x = age.range , y=migrants/sum(UN.tidy$count,na.rm=TRUE), fill=sex)) +
  geom_bar(stat="identity",position="dodge") + 
  xlab("Age") + 
  ylab("Migrant Proportion") +
  ggtitle("Migration by Age and Sex") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

If we look at migration by age, we see a hump between ages 20 and 50. Interestingly, when we break it down by sex, we see that in this high-density region, men are signigicantly more likley migrants, but women are vastly over-represented in the 75+ category. It’s unclear why this may be, but potentially, men move for work and maybe women move to be with family when they reach an older age.

Data Set 4 - Tax data (from a PDF)

My original intention for this project was to parse and tidy a .pdf document, but initial attempt was unsuccesssful. Given the extra time, however, I was able to get it to work, so here goes

library(tabulizer)

tax.data <- extract_tables("https://github.com/plb2018/DATA607/raw/master/DATA_607_Project2/canada_tax.pdf")

typeof(tax.data[1])

## [1] "list"

typeof(tax.data[[1]])

## [1] "character"

I used a package called Tabulizer. The parsing seem to have worked but it appears to have been returned as nested lists - a list for each page of the pdf. Let’s take a look at what we’ve got:

tax.data[[1]][1:5,1:5]

##      [,1]                       
## [1,] ""                         
## [2,] ""                         
## [3,] "Province or Territory"    
## [4,] ""                         
## [5,] "Newfoundland and Labrador"
##      [,2]                                       [,3]  [,4]        [,5]    
## [1,] ""                                         ""    "Less than" ""      
## [2,] "Grand Total of GST/HST Credit Recipients" ""    ""          ""      
## [3,] ""                                         ""    "$5,000"    ""      
## [4,] "(#) ($)"                                  "(#)" ""          "($)"   
## [5,] "160,300 $65,756"                          ""    "26,410"    "$7,044"

This looks pretty messy in it’s current state, so let’s see if we can tidy it up a bit. Firstly, i note that a few of the columns contain data for multiple salary-ranges. We’ll re-arrange the data to deal with this:

tax.p1 <- data.frame(tax.data[[1]],stringsAsFactors = FALSE)
tax.p2 <- data.frame(tax.data[[2]],stringsAsFactors = FALSE)

#put the data into proper "wide" format
tax.p2[14:28] <- tax.p2[19:nrow(tax.p2) , ]

#we'll now remove the extra data... and the "total" column while we're at it.
tax.p2 <- tax.p2[1:17, ]

head(tax.p2)

##                          X1  X2         X3  X4     X5  X6         X7  X8
## 1                               $40,000 to                $45,000 to    
## 2     Province or Territory        $44,999                   $49,999    
## 3                           (#)            ($)        (#)            ($)
## 4 Newfoundland and Labrador          3,980     $1,274            550    
## 5      Prince Edward Island          1,260       $422            290    
## 6               Nova Scotia          7,530     $2,501          1,280    
##     X9 X10        X11 X12 X13                      X1.1 X2.1       X3.1
## 1          $50,000 to                                        $55,000 to
## 2             $54,999             Province or Territory         $59,999
## 3      (#)            ($)                                (#)           
## 4 $169             60     $18 Newfoundland and Labrador                
## 5 $102             50     $14      Prince Edward Island              10
## 6 $406            190     $57               Nova Scotia              20
##   X4.1 X5.1 X6.1     X7.1 X8.1 X9.1 X10.1 X11.1 X12.1 X13.1
## 1                 $60,000                                  
## 2                and over                                  
## 3  ($)       (#)           ($)                             
## 4                                                          
## 5        $4                                                
## 6        $7                                                
##                        X1.2 X2.2
## 1                               
## 2     Province or Territory     
## 3                            (#)
## 4 Newfoundland and Labrador     
## 5      Prince Edward Island     
## 6               Nova Scotia

Now we’re going to rename the columns and drop everything that we don’t need. The column names are contained in the first few rows. We’ll paste them together where required. Note that I’m only going to focus on page 2 of the document from here on for the sake of simplicity.

colnames(tax.p2) <- paste(tax.p2[1, ],tax.p2[2, ],sep = " ")

head(tax.p2)

##       Province or Territory     $40,000 to $44,999               
## 1                                       $40,000 to               
## 2     Province or Territory                $44,999               
## 3                           (#)                    ($)        (#)
## 4 Newfoundland and Labrador                  3,980     $1,274    
## 5      Prince Edward Island                  1,260       $422    
## 6               Nova Scotia                  7,530     $2,501    
##   $45,000 to $49,999              $50,000 to $54,999        
## 1         $45,000 to                      $50,000 to        
## 2            $49,999                         $54,999        
## 3                    ($)      (#)                    ($)    
## 4                550     $169                     60     $18
## 5                290     $102                     50     $14
## 6              1,280     $406                    190     $57
##       Province or Territory     $55,000 to $59,999           
## 1                                       $55,000 to           
## 2     Province or Territory                $59,999           
## 3                           (#)                    ($)    (#)
## 4 Newfoundland and Labrador                                  
## 5      Prince Edward Island                     10     $4    
## 6               Nova Scotia                     20     $7    
##   $60,000 and over                   Province or Territory    
## 1          $60,000                                            
## 2         and over                   Province or Territory    
## 3                  ($)                                     (#)
## 4                                Newfoundland and Labrador    
## 5                                     Prince Edward Island    
## 6                                              Nova Scotia

#drop all the columns that we dont need
tax.p2 <- tax.p2[-14]
tax.p2 <- tax.p2[seq(1,28,2)]
tax.p2 <- tax.p2[-c(12:14)]

head(tax.p2)

##       Province or Territory $40,000 to $44,999     .2 $45,000 to $49,999
## 1                                   $40,000 to                $45,000 to
## 2     Province or Territory            $44,999                   $49,999
## 3                                                                       
## 4 Newfoundland and Labrador              3,980 $1,274                550
## 5      Prince Edward Island              1,260   $422                290
## 6               Nova Scotia              7,530 $2,501              1,280
##     .5 $50,000 to $54,999  .8 $55,000 to $59,999  .11 $60,000 and over
## 1              $50,000 to             $55,000 to               $60,000
## 2                 $54,999                $59,999              and over
## 3                                                                     
## 4 $169                 60 $18                                         
## 5 $102                 50 $14                 10   $4                 
## 6 $406                190 $57                 20   $7                 
##    .14
## 1     
## 2     
## 3     
## 4     
## 5     
## 6

Now some of the columns are properly named. Also, we no longer need the first few rows of the data, so we’ll drop them.

tax.p2 <- tax.p2[-1:-3,]

head(tax.p2,5)

##       Province or Territory $40,000 to $44,999      .2 $45,000 to $49,999
## 4 Newfoundland and Labrador              3,980  $1,274                550
## 5      Prince Edward Island              1,260    $422                290
## 6               Nova Scotia              7,530  $2,501              1,280
## 7             New Brunswick              6,670  $2,243              1,240
## 8                    Quebec             73,060 $25,216             15,330
##       .5 $50,000 to $54,999   .8 $55,000 to $59,999  .11 $60,000 and over
## 4   $169                 60  $18                                         
## 5   $102                 50  $14                 10   $4                 
## 6   $406                190  $57                 20   $7                 
## 7   $395                200  $59                 20   $4                 
## 8 $5,039              2,390 $662                170  $51               30
##    .14
## 4     
## 5     
## 6     
## 7     
## 8   $8

The data is starting to look much better. We can see that we need to rename every second column. We’ll basically copy previous column names to the right, and append a “_dollars" suffix which will help us out in the next steps.

colnames(tax.p2)[seq(3,11,2)] <- paste(colnames(tax.p2[seq(2,10,2)]),"_dollars",sep = " ")

colnames(tax.p2)[1] <- "Region"

head(tax.p2,5)

##                      Region $40,000 to $44,999 $40,000 to $44,999 _dollars
## 4 Newfoundland and Labrador              3,980                      $1,274
## 5      Prince Edward Island              1,260                        $422
## 6               Nova Scotia              7,530                      $2,501
## 7             New Brunswick              6,670                      $2,243
## 8                    Quebec             73,060                     $25,216
##   $45,000 to $49,999 $45,000 to $49,999 _dollars $50,000 to $54,999
## 4                550                        $169                 60
## 5                290                        $102                 50
## 6              1,280                        $406                190
## 7              1,240                        $395                200
## 8             15,330                      $5,039              2,390
##   $50,000 to $54,999 _dollars $55,000 to $59,999
## 4                         $18                   
## 5                         $14                 10
## 6                         $57                 20
## 7                         $59                 20
## 8                        $662                170
##   $55,000 to $59,999 _dollars $60,000 and over $60,000 and over _dollars
## 4                                                                       
## 5                          $4                                           
## 6                          $7                                           
## 7                          $4                                           
## 8                         $51               30                        $8

We’re now ready to tidy the data. We’re going to use gather(), and then separate() the “range” column on the “_" we added in a previous step. This will cause all rows with the “_dollar" appendix to have corresponding “dollar” entries in the “value” column, whereas we’ll be left with NAs for all rows where “range” didn’t have a suffix - we’ll replace those NAs with “count”.

tax.tidy <- gather(tax.p2,range,quantity,-Region) %>%
  separate(range,c("range","value"), sep="_")  

tax.tidy$value[is.na(tax.tidy$value)] <- "count"


head(tax.tidy,5)

##                      Region              range value quantity
## 1 Newfoundland and Labrador $40,000 to $44,999 count    3,980
## 2      Prince Edward Island $40,000 to $44,999 count    1,260
## 3               Nova Scotia $40,000 to $44,999 count    7,530
## 4             New Brunswick $40,000 to $44,999 count    6,670
## 5                    Quebec $40,000 to $44,999 count   73,060

We’re almost done. One final thing that we’re going to do is clean up the quantity column and make it numeric so that we can perform some analysis.

tax.tidy$quantity <- as.numeric(gsub('\\$|,', '', tax.tidy$quantity))

head(tax.tidy,5)

##                      Region              range value quantity
## 1 Newfoundland and Labrador $40,000 to $44,999 count     3980
## 2      Prince Edward Island $40,000 to $44,999 count     1260
## 3               Nova Scotia $40,000 to $44,999 count     7530
## 4             New Brunswick $40,000 to $44,999 count     6670
## 5                    Quebec $40,000 to $44,999 count    73060

The data is now tidy and we can do a few quick plots:

tax.range <- tax.tidy %>% 
  filter(value=="count") %>% 
  group_by(Region,range) %>% 
  summarise(quantity = mean(quantity,na.rm = TRUE))  %>%  
  ungroup() 


ggplot(tax.range, aes(x = range, y=quantity,group=Region,color=Region)) +
  geom_line() +
  scale_y_log10()+ 
  xlab("Income Range") + 
  ylab("Number of Recipients") +
  ggtitle("# of Tax Credit Recipients by Income and Region") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

tax.rangeD <- tax.tidy %>% 
  filter(value=="dollars") %>% 
  group_by(Region,range) %>% 
  summarise(quantity = mean(quantity,na.rm = TRUE))  %>%  
  ungroup() 


ggplot(tax.rangeD, aes(x = range, y=quantity,group=Region,color=Region)) +
  geom_line() +
  scale_y_log10()+ 
  xlab("Income Range") + 
  ylab("Tax Credits Collected") +
  ggtitle("$ Amt of Tax Credit Collected by Income and Region") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(plot.title = element_text(hjust = 0.5))

We can see that tax credits, in both number of recipients, and amount tend to decline as income range increases. This finding appears to be true in all cases.

DATA 607 - Project 2