Project 2

By Brian Weinfeld

March 5th, 2018

For my first data set, I used the set provided by Rose. This data set identifies a series of regions in the United States and, for each day in January 2018, shows the amount of elecriticy generated, the amount consumed, and the difference between these values.

raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/us_daily_electric_system_operating_data.csv', 
                     skip=4)
kable(raw.data[, 1:5] %>% head(10))
megawatthours 01/01/2018 01/02/2018 01/03/2018 01/04/2018
California (region) NA NA NA NA
Demand 645599 704074 727216 714305
Net generation 463629 502312 504456 505148
Total net actual interchange -179525 -201654 -222762 -209163
Carolinas (region) NA NA NA NA
Demand 847133 938726 948954 901939
Net generation 802908 914879 920055 866924
Total net actual interchange -43153 -22922 -29015 -34530
Central (region) NA NA NA NA
Demand 890403 916535 861352 847712

In order to tidy this data, I would first need to extract the locations present in every fourth row and create a new column in the original data set in which to store the information. Annoyingly, one of the location names features a comma and thus half the name is placed in the 01/01/2018 column. I developed a generic solution to this issue in case it came up in a bigger data set.

locations <- raw.data[seq(1, length(raw.data), 4), 1:2] %>% 
  replace_na(list(`01/01/2018`='')) %>%
  unite(location, 1, 2, sep='') %>%
  select(1) %>%
  unlist()
kable(locations)
x
location1 California (region)
location2 Carolinas (region)
location3 Central (region)
location4 Electric Reliability Council of TexasInc. (region)
location5 Florida (region)
location6 New England ISO (region)
location7 Mid-Atlantic (region)
location8 Midwest (region)

With all the necessary information, I tidied up the data and made a few additional changes to make this particular data set easier to work with. That included changing the names of some columns and splitting the date into a month, day and year column.

tidy.data <- raw.data[which(1:length(raw.data) %% 4 != 1), ] %>% 
  mutate(location=locations %>% 
              rep(each=3)
            ) %>%
  gather(date, mega.watt.hours, 2:32) %>%
  spread(megawatthours, mega.watt.hours) %>%
  rename(generated=`Net generation`, net=`Total net actual interchange`, demand=Demand) %>%
  separate(date, c('month', 'day', 'year'))

tidy.data[, 2:7] %<>%
  map(~as.numeric(.))
kable(tidy.data %>% head(10))
location month day year demand generated net
California (region) 1 1 2018 645599 463629 -179525
California (region) 1 2 2018 704074 502312 -201654
California (region) 1 3 2018 727216 504456 -222762
California (region) 1 4 2018 714305 505148 -209163
California (region) 1 5 2018 712021 484644 -213905
California (region) 1 6 2018 668299 447821 -219968
California (region) 1 7 2018 655802 447952 -207861
California (region) 1 8 2018 731881 500797 -237089
California (region) 1 9 2018 729380 493884 -246264
California (region) 1 10 2018 716698 482843 -234361

I plotted the proportion of energy generated compared to demand for each day and location. Any points above 1.0 indicated excess or wasted energy and any points beneath 1.0 indicate that not enough energy was genered.

tidy.data %>%
  ggplot(aes(x=day, y= generated / demand, color=location)) + 
  geom_point() +
  geom_smooth(se=FALSE) + 
  scale_color_brewer(palette='Set1') +
  theme(legend.position='bottom') +
  guides(color=guide_legend(nrow=3, override.aes=list(size=2))) +
  labs(x='Day in January 2018', 
       y='Proportion of Required Energy Generated',
       color='Location',
       title='California and New England Not Generating Enough Electricity to Meet Need'
       )

California and New England are consistantly not producing enough energy to meet the needs of their populations. Every other measured locations is hovering near the required amount.

For my second data set, I selected the world populations csv file discussed by Steven. The file contains a large number of countries and their populations in millions for each year from 1980 through 2010.

raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/populationbycountry19802010millions.csv', 
                     na=c('NA'))
kable(raw.data[, 1:5] %>% head())
X1 1980 1981 1982 1983
North America 320.27638 324.44694 328.62014 332.72487
Bermuda 0.05473 0.05491 0.05517 0.05551
Canada 24.5933 24.9 25.2019 25.4563
Greenland 0.05021 0.05103 0.05166 0.05211
Mexico 68.34748 69.96926 71.6409 73.36288
Saint Pierre and Miquelon 0.00599 0.00601 0.00605 0.00607

In order to make the data tidy I gathered all the year columns and put them in a single column and then arranged the data alphabetically by country. Finally, I wrote the file out for future use.

raw.data %<>%
  rename(country=X1) %>%
  gather(year, population, 2:32) %>%
  arrange(country) %T>%
  write.csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project2\\population_answer.csv')
kable(raw.data %>% head())
country year population
Afghanistan 1980 15.0436
Afghanistan 1981 13.67368
Afghanistan 1982 12.57743
Afghanistan 1983 12.43058
Afghanistan 1984 12.75384
Afghanistan 1985 13.09371

Working off of Steven’s suggestion about the prevelance of missing data, I decided to highlight the countries with the most missing data. I created a subset that contained all the missing data in the original raw data and read in another csv file I found online that contained a list of countries and their longitude and latitude.

missing.data <- raw.data %>%
  subset(population == '--' | is.na(population))

country.locations <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv', stringsAsFactors=FALSE)
kable(country.locations %>% head())
country latitude longitude name
AD 42.54624 1.601554 Andorra
AE 23.42408 53.847818 United Arab Emirates
AF 33.93911 67.709953 Afghanistan
AG 17.06082 -61.796428 Antigua and Barbuda
AI 18.22055 -63.068615 Anguilla
AL 41.15333 20.168331 Albania

I created data to plot by counting the number of missing entries and joining that information with the country locations. I created another column to dictate the size of the circles on the plot named mod.pop.

plotting.data <- missing.data %>%
  count(country, population) %>%
  inner_join(country.locations, by=c('country'='name')) %>%
  mutate(mod.pop = sigmoid((n / max(n)))*5)
kable(plotting.data %>% head())
country population n country.y latitude longitude mod.pop
Antarctica NA 31 AQ -75.25097 -0.071389 3.655293
Armenia 12 AM 40.06910 45.038189 2.977918
Aruba 6 AW 12.52111 -69.968338 2.741183
Azerbaijan 12 AZ 40.14310 47.576927 2.977918
Belarus 12 BY 53.70981 27.953389 2.977918
Bosnia and Herzegovina 12 BA 43.91589 17.679076 2.977918

Finally, I plotted the data. I used hollow cirlces for NA data and solid circles for missing (dash-dash) data.

ggplot(plotting.data) + 
  borders('world', color='gray50', fill='gray50') + 
  geom_point(aes(x=longitude, y=latitude, color=n), 
             shape=ifelse(is.na(plotting.data$population), 21, 16),
             size = plotting.data$mod.pop
             ) +
  geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
  labs(x='Longitude', 
     y='Latitude',
     color='Missing Data Count',
     title='Eastern European Countries are Missing Most Data'
  )

The graph shows that most of the missing data comes from eastern Europe. It should be noted that some missing elements were removed by the inner join on country locations as those countries were not part of the data set. The most common reason is that the countries no longger exist. For example, ‘Former USSR’ was listed as a country in the original data set.’

For my last data set, I’m using UN data on migrant data rates per country in 2010. This is my data set. The data set has numerous changes that need to be made to it. The following issues exist for this data set:

example <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv', 
                     stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8')
kable(example[15:27, 1:6])
V1 V2 V3 V4 V5 V6
15 Sort
order Major area , region, country or area of destination Notes Co untry co de Type of da ta (a) Internatio nal migrant stock at mid-year by age (both sexes)
16 0-4
17 1 WORLD 900 5 444 102
18 2 Developed regions (b) 901 1 251 818
19 3 Developing regions (c) 902 4 192 284
20 4 Least developed countries (d) 941 1 063 284
21 5 Less developed regions excluding least developed countries 934 3 129 000
22 6 Sub-Saharan Africa (e) 947 1 305 579
23 7 Africa 903 1 381 650
24 8 Eastern Africa 910 547 697
25 9 Burundi 108 B R 19 599
26 10 Comoros 174 B 600
27 11 Djibouti 262 B R 5 435

I began by reading in the data and removing the unneeded rows and columns. I set the column names using the information stored in row 1 and then removed that row. Finally, I omitted all NA pieces of data. Part of the naming sceme involved adding an ‘m’ or ‘f’ to the beginning of the columns for males and females respectively. This will eventually be used to split the data and is used her to differentiate between the two columns.

raw.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv', 
                     stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8') %>%
  .[-c(1:15, 17:22), -c(1, 3:4, 6:22, 39, 56)] %>%
  setNames(c('country', 'code', paste('m', .[1, 3:18]), paste('f', .[1, 19:34])) %>% unlist()) %>%
  .[-1, ] %>%
  na.omit()
kable(raw.data[1:10, 1:10])
country code m 0-4 m 5-9 m 10-14 m 15-19 m 20-24 m 25-29 m 30-34 m 35-39
23 Africa 676 971 697 939 654 031 740 174 871 195 961 607 859 150 712 199
24 Eastern Africa 267 929 269 020 255 309 289 807 310 260 326 438 297 497 246 256
25 Burundi B R 9 574 8 651 8 288 9 030 12 316 18 285 23 164 20 116
26 Comoros B 247 352 446 516 577 758 821 777
27 Djibouti B R 3 543 4 438 6 463 8 560 9 115 8 962 6 794 4 737
28 Eritrea I 385 335 437 612 801 884 787 623
29 Ethiopia B R 41 272 44 185 54 443 71 389 77 813 74 627 61 129 46 834
30 Kenya B R 9 432 11 378 14 750 22 655 17 884 17 509 14 519 11 915
31 Madagascar C 467 500 475 665 951 893 779 765
32 Malawi B R 76 753 65 246 59 561 64 862 58 236 51 855 41 527 32 424

The next step was the remove the header and subheader rows so that they could be applied to each row that contained a country. I grabbed all the rows that had no code in the code column (noticing that only countries had codes) and then seperated them based on whether they were a major grouping or a minor grouping. The major grouping locations had to be selected by hand. I then created a pairing of all major locations and minor locations.

all.groupings <- raw.data %>% 
  subset(code == '', select=c('country')) 

major.group.locs <- c(1, 7, 13, 18, 23)
major.group <- all.groupings[major.group.locs, ]
sub.group <- all.groupings[-major.group.locs, ]

all.groupings %<>% mutate(major.group = replace(.$country, !.$country %in% major.group, NA) %>% na.locf(), 
                          sub.group = replace(.$country, !.$country %in% sub.group, NA)
                         ) %>%
  subset(!is.na(sub.group), -1)
kable(all.groupings %>% head(10))
major.group sub.group
2 Africa Eastern Africa
3 Africa Middle Africa
4 Africa Northern Africa
5 Africa Southern Africa
6 Africa Western Africa
8 Asia Central Asia
9 Asia Eastern Asia
10 Asia South-Eastern Asia
11 Asia Southern Asia
12 Asia Western Asia

The all.groupings data contained every combination of major and minor groups and they are in the same order that they need to appear in the final data frame. However, each row appears some unique number of times in the final data set. For example, “Africa - Eastern Africa” represents the first 19 countries while “Africa - Middle Africa” represents the next 9. I decided to create what I called a mapping. This is a list that is the same length as the raw.data but with each row in all.groupings repeated the necessary number of times so that it aligns with the raw data.

I accommplished this by again searching for empty codes. This would indicate a new subheading and thus a transition of grouping. If there was a code, then the same grouping was repeated.

raw.data %<>% subset(!.$country %in% major.group) %>% .[-1, ]

country.mapping <- data.frame(group=rep(NA, nrow(raw.data)), subgroup=rep(NA, nrow(raw.data)))
for(i in 1:nrow(raw.data)){
  if(raw.data$code[i] == ''){
    all.groupings %<>% .[-1, ]
  }else{
    country.mapping[i, ] <- all.groupings[1, ]
  }
}

The end result of this mapping is that I could simply cbind the mapping with the raw data to create my answer data set. At the same time I also removed the now unneeded empty code rows and gave an id number column that would become necessary soon.

answer <- cbind(country.mapping, raw.data) %>%
  subset(code != '') %>% 
  rowid_to_column('id')
kable(answer[1:10, 1:7])
id group subgroup country code m 0-4 m 5-9
1 Africa Eastern Africa Burundi B R 9 574 8 651
2 Africa Eastern Africa Comoros B 247 352
3 Africa Eastern Africa Djibouti B R 3 543 4 438
4 Africa Eastern Africa Eritrea I 385 335
5 Africa Eastern Africa Ethiopia B R 41 272 44 185
6 Africa Eastern Africa Kenya B R 9 432 11 378
7 Africa Eastern Africa Madagascar C 467 500
8 Africa Eastern Africa Malawi B R 76 753 65 246
9 Africa Eastern Africa Mauritius C 155 187
10 Africa Eastern Africa Mayotte B 455 440

The data in this table really should be split into two. Once the data is made tidy, this would result in repeating the id, group, subgroup and country over a dozen times for each country. I seperated the first four columns out into their own data frame.

country.id.table <- answer[, 1:4]
kable(country.id.table %>% head())
id group subgroup country
1 Africa Eastern Africa Burundi
2 Africa Eastern Africa Comoros
3 Africa Eastern Africa Djibouti
4 Africa Eastern Africa Eritrea
5 Africa Eastern Africa Ethiopia
6 Africa Eastern Africa Kenya

I completed tidying the data by removing the unneeded columns and then gathering all the columns representing different migrant rates. I split the column into two seperating the gener and age ranges. I then spread the genders back out. Finally, I changed all the migrant rates to numerics.

migration.data <- answer[, -c(2:5)] %>%  
  gather(category, quantity, 2:33) %>%
  separate(category, c('gender', 'age'), ' ') %>%
  transform(age = ifelse(age == '5-9', '05-09', age)) %>%
  spread(gender, quantity)%>%
  arrange(id)

migration.data[, 3:4] %<>% purrr::map(~str_replace_all(., '\\s', '') %>% as.numeric())
kable(migration.data %>% head(10))
id age f m
1 0-4 10025 9574
1 05-09 9359 8651
1 10-14 9207 8288
1 15-19 10560 9030
1 20-24 15134 12316
1 25-29 21456 18285
1 30-34 23746 23164
1 35-39 18945 20116
1 40-44 14279 15031
1 45-49 10325 10537

The data is now successfully tidy, spread across two data frames. The first contains a list of all major groupings, minor groupings and country. The other contains male and female migrant rates divided by age group.

I decided to examine which countries had the highest migrant rates. To do this, I needed the populations of each country in the year 2010. Thankfully, the first data set I tidied up for this project contained such information. I read in that tidy data along with the country locations information I used with that data set. I joined all four data frames together in order to create a frame that contained each countries migrant population as a proportion of their total population. The data was then ordered from highest to lowest and the top 10 were subset.

populations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/population_answer.csv') %>%
  subset(year == 2010, select=-1) %>%
  mutate(population = population %>% as.numeric() * 1000000) %>%
  na.omit()

country.locations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv')

combined <- inner_join(country.id.table, populations, by='country') %>%
  inner_join(migration.data, by='id') %>%
  group_by(country) %>%
  summarise(total = (sum(f) + sum(m)) / sum(population)) %>%
  arrange(desc(total)) %>%
  .[1:10, ] %>%
  inner_join(country.locations, by=c('country'='name'))
kable(combined[, c(1, 4:5, 2)])
country latitude longitude total
Montserrat 16.742498 -62.187366 0.0245728
Guam 13.444304 144.793731 0.0241040
Kuwait 29.311660 47.481766 0.0240754
Qatar 25.354826 51.183884 0.0230216
French Guiana 3.933889 -53.125782 0.0207817
American Samoa -14.270972 -170.132217 0.0200239
Nauru -0.522778 166.931503 0.0189792
Gibraltar 36.137741 -5.345374 0.0188474
Estonia 58.595272 25.013607 0.0184908
Latvia 56.879635 24.603189 0.0182038

Finally, I plotted the data on a world map for viewing.

ggplot(combined) +
  borders('world', color='gray50', fill='gray50') + 
  geom_point(aes(x=longitude, y=latitude, color=country), size=4) +
  geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
  scale_color_brewer(palette='Set3')  +
  labs(x='Longitude', 
     y='Latitude',
     color='Country',
     title='Smaller Countries have Higher Migrant Rates'
  )

The top 10 countries with the highest migrant rates are displayed on the above world map. The only link they appear to have is that they are generally from countries with small populations. It makes sense that smaller countries would have higher rates while having overall lower number of migrants.