For my first data set, I used the set provided by Rose. This data set identifies a series of regions in the United States and, for each day in January 2018, shows the amount of elecriticy generated, the amount consumed, and the difference between these values.
raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/us_daily_electric_system_operating_data.csv',
skip=4)
kable(raw.data[, 1:5] %>% head(10))
| megawatthours | 01/01/2018 | 01/02/2018 | 01/03/2018 | 01/04/2018 |
|---|---|---|---|---|
| California (region) | NA | NA | NA | NA |
| Demand | 645599 | 704074 | 727216 | 714305 |
| Net generation | 463629 | 502312 | 504456 | 505148 |
| Total net actual interchange | -179525 | -201654 | -222762 | -209163 |
| Carolinas (region) | NA | NA | NA | NA |
| Demand | 847133 | 938726 | 948954 | 901939 |
| Net generation | 802908 | 914879 | 920055 | 866924 |
| Total net actual interchange | -43153 | -22922 | -29015 | -34530 |
| Central (region) | NA | NA | NA | NA |
| Demand | 890403 | 916535 | 861352 | 847712 |
In order to tidy this data, I would first need to extract the locations present in every fourth row and create a new column in the original data set in which to store the information. Annoyingly, one of the location names features a comma and thus half the name is placed in the 01/01/2018 column. I developed a generic solution to this issue in case it came up in a bigger data set.
locations <- raw.data[seq(1, length(raw.data), 4), 1:2] %>%
replace_na(list(`01/01/2018`='')) %>%
unite(location, 1, 2, sep='') %>%
select(1) %>%
unlist()
kable(locations)
| x | |
|---|---|
| location1 | California (region) |
| location2 | Carolinas (region) |
| location3 | Central (region) |
| location4 | Electric Reliability Council of TexasInc. (region) |
| location5 | Florida (region) |
| location6 | New England ISO (region) |
| location7 | Mid-Atlantic (region) |
| location8 | Midwest (region) |
With all the necessary information, I tidied up the data and made a few additional changes to make this particular data set easier to work with. That included changing the names of some columns and splitting the date into a month, day and year column.
tidy.data <- raw.data[which(1:length(raw.data) %% 4 != 1), ] %>%
mutate(location=locations %>%
rep(each=3)
) %>%
gather(date, mega.watt.hours, 2:32) %>%
spread(megawatthours, mega.watt.hours) %>%
rename(generated=`Net generation`, net=`Total net actual interchange`, demand=Demand) %>%
separate(date, c('month', 'day', 'year'))
tidy.data[, 2:7] %<>%
map(~as.numeric(.))
kable(tidy.data %>% head(10))
| location | month | day | year | demand | generated | net |
|---|---|---|---|---|---|---|
| California (region) | 1 | 1 | 2018 | 645599 | 463629 | -179525 |
| California (region) | 1 | 2 | 2018 | 704074 | 502312 | -201654 |
| California (region) | 1 | 3 | 2018 | 727216 | 504456 | -222762 |
| California (region) | 1 | 4 | 2018 | 714305 | 505148 | -209163 |
| California (region) | 1 | 5 | 2018 | 712021 | 484644 | -213905 |
| California (region) | 1 | 6 | 2018 | 668299 | 447821 | -219968 |
| California (region) | 1 | 7 | 2018 | 655802 | 447952 | -207861 |
| California (region) | 1 | 8 | 2018 | 731881 | 500797 | -237089 |
| California (region) | 1 | 9 | 2018 | 729380 | 493884 | -246264 |
| California (region) | 1 | 10 | 2018 | 716698 | 482843 | -234361 |
I plotted the proportion of energy generated compared to demand for each day and location. Any points above 1.0 indicated excess or wasted energy and any points beneath 1.0 indicate that not enough energy was genered.
tidy.data %>%
ggplot(aes(x=day, y= generated / demand, color=location)) +
geom_point() +
geom_smooth(se=FALSE) +
scale_color_brewer(palette='Set1') +
theme(legend.position='bottom') +
guides(color=guide_legend(nrow=3, override.aes=list(size=2))) +
labs(x='Day in January 2018',
y='Proportion of Required Energy Generated',
color='Location',
title='California and New England Not Generating Enough Electricity to Meet Need'
)
California and New England are consistantly not producing enough energy to meet the needs of their populations. Every other measured locations is hovering near the required amount.
For my second data set, I selected the world populations csv file discussed by Steven. The file contains a large number of countries and their populations in millions for each year from 1980 through 2010.
raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/populationbycountry19802010millions.csv',
na=c('NA'))
kable(raw.data[, 1:5] %>% head())
| X1 | 1980 | 1981 | 1982 | 1983 |
|---|---|---|---|---|
| North America | 320.27638 | 324.44694 | 328.62014 | 332.72487 |
| Bermuda | 0.05473 | 0.05491 | 0.05517 | 0.05551 |
| Canada | 24.5933 | 24.9 | 25.2019 | 25.4563 |
| Greenland | 0.05021 | 0.05103 | 0.05166 | 0.05211 |
| Mexico | 68.34748 | 69.96926 | 71.6409 | 73.36288 |
| Saint Pierre and Miquelon | 0.00599 | 0.00601 | 0.00605 | 0.00607 |
In order to make the data tidy I gathered all the year columns and put them in a single column and then arranged the data alphabetically by country. Finally, I wrote the file out for future use.
raw.data %<>%
rename(country=X1) %>%
gather(year, population, 2:32) %>%
arrange(country) %T>%
write.csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project2\\population_answer.csv')
kable(raw.data %>% head())
| country | year | population |
|---|---|---|
| Afghanistan | 1980 | 15.0436 |
| Afghanistan | 1981 | 13.67368 |
| Afghanistan | 1982 | 12.57743 |
| Afghanistan | 1983 | 12.43058 |
| Afghanistan | 1984 | 12.75384 |
| Afghanistan | 1985 | 13.09371 |
Working off of Steven’s suggestion about the prevelance of missing data, I decided to highlight the countries with the most missing data. I created a subset that contained all the missing data in the original raw data and read in another csv file I found online that contained a list of countries and their longitude and latitude.
missing.data <- raw.data %>%
subset(population == '--' | is.na(population))
country.locations <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv', stringsAsFactors=FALSE)
kable(country.locations %>% head())
| country | latitude | longitude | name |
|---|---|---|---|
| AD | 42.54624 | 1.601554 | Andorra |
| AE | 23.42408 | 53.847818 | United Arab Emirates |
| AF | 33.93911 | 67.709953 | Afghanistan |
| AG | 17.06082 | -61.796428 | Antigua and Barbuda |
| AI | 18.22055 | -63.068615 | Anguilla |
| AL | 41.15333 | 20.168331 | Albania |
I created data to plot by counting the number of missing entries and joining that information with the country locations. I created another column to dictate the size of the circles on the plot named mod.pop.
plotting.data <- missing.data %>%
count(country, population) %>%
inner_join(country.locations, by=c('country'='name')) %>%
mutate(mod.pop = sigmoid((n / max(n)))*5)
kable(plotting.data %>% head())
| country | population | n | country.y | latitude | longitude | mod.pop |
|---|---|---|---|---|---|---|
| Antarctica | NA | 31 | AQ | -75.25097 | -0.071389 | 3.655293 |
| Armenia | – | 12 | AM | 40.06910 | 45.038189 | 2.977918 |
| Aruba | – | 6 | AW | 12.52111 | -69.968338 | 2.741183 |
| Azerbaijan | – | 12 | AZ | 40.14310 | 47.576927 | 2.977918 |
| Belarus | – | 12 | BY | 53.70981 | 27.953389 | 2.977918 |
| Bosnia and Herzegovina | – | 12 | BA | 43.91589 | 17.679076 | 2.977918 |
Finally, I plotted the data. I used hollow cirlces for NA data and solid circles for missing (dash-dash) data.
ggplot(plotting.data) +
borders('world', color='gray50', fill='gray50') +
geom_point(aes(x=longitude, y=latitude, color=n),
shape=ifelse(is.na(plotting.data$population), 21, 16),
size = plotting.data$mod.pop
) +
geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
labs(x='Longitude',
y='Latitude',
color='Missing Data Count',
title='Eastern European Countries are Missing Most Data'
)
The graph shows that most of the missing data comes from eastern Europe. It should be noted that some missing elements were removed by the inner join on country locations as those countries were not part of the data set. The most common reason is that the countries no longger exist. For example, ‘Former USSR’ was listed as a country in the original data set.’
For my last data set, I’m using UN data on migrant data rates per country in 2010. This is my data set. The data set has numerous changes that need to be made to it. The following issues exist for this data set:
example <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv',
stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8')
kable(example[15:27, 1:6])
| V1 | V2 | V3 | V4 | V5 | V6 | |
|---|---|---|---|---|---|---|
| 15 | Sort | |||||
| order | Major area | , region, country or area of destination Notes Co | untry co | de Type of da | ta (a) Internatio | nal migrant stock at mid-year by age (both sexes) |
| 16 | 0-4 | |||||
| 17 | 1 | WORLD | 900 | 5 444 102 | ||
| 18 | 2 | Developed regions | (b) | 901 | 1 251 818 | |
| 19 | 3 | Developing regions | (c) | 902 | 4 192 284 | |
| 20 | 4 | Least developed countries | (d) | 941 | 1 063 284 | |
| 21 | 5 | Less developed regions excluding least developed countries | 934 | 3 129 000 | ||
| 22 | 6 | Sub-Saharan Africa | (e) | 947 | 1 305 579 | |
| 23 | 7 | Africa | 903 | 1 381 650 | ||
| 24 | 8 | Eastern Africa | 910 | 547 697 | ||
| 25 | 9 | Burundi | 108 | B R | 19 599 | |
| 26 | 10 | Comoros | 174 | B | 600 | |
| 27 | 11 | Djibouti | 262 | B R | 5 435 |
I began by reading in the data and removing the unneeded rows and columns. I set the column names using the information stored in row 1 and then removed that row. Finally, I omitted all NA pieces of data. Part of the naming sceme involved adding an ‘m’ or ‘f’ to the beginning of the columns for males and females respectively. This will eventually be used to split the data and is used her to differentiate between the two columns.
raw.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv',
stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8') %>%
.[-c(1:15, 17:22), -c(1, 3:4, 6:22, 39, 56)] %>%
setNames(c('country', 'code', paste('m', .[1, 3:18]), paste('f', .[1, 19:34])) %>% unlist()) %>%
.[-1, ] %>%
na.omit()
kable(raw.data[1:10, 1:10])
| country | code | m 0-4 | m 5-9 | m 10-14 | m 15-19 | m 20-24 | m 25-29 | m 30-34 | m 35-39 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 23 | Africa | 676 971 | 697 939 | 654 031 | 740 174 | 871 195 | 961 607 | 859 150 | 712 199 | |
| 24 | Eastern Africa | 267 929 | 269 020 | 255 309 | 289 807 | 310 260 | 326 438 | 297 497 | 246 256 | |
| 25 | Burundi | B R | 9 574 | 8 651 | 8 288 | 9 030 | 12 316 | 18 285 | 23 164 | 20 116 |
| 26 | Comoros | B | 247 | 352 | 446 | 516 | 577 | 758 | 821 | 777 |
| 27 | Djibouti | B R | 3 543 | 4 438 | 6 463 | 8 560 | 9 115 | 8 962 | 6 794 | 4 737 |
| 28 | Eritrea | I | 385 | 335 | 437 | 612 | 801 | 884 | 787 | 623 |
| 29 | Ethiopia | B R | 41 272 | 44 185 | 54 443 | 71 389 | 77 813 | 74 627 | 61 129 | 46 834 |
| 30 | Kenya | B R | 9 432 | 11 378 | 14 750 | 22 655 | 17 884 | 17 509 | 14 519 | 11 915 |
| 31 | Madagascar | C | 467 | 500 | 475 | 665 | 951 | 893 | 779 | 765 |
| 32 | Malawi | B R | 76 753 | 65 246 | 59 561 | 64 862 | 58 236 | 51 855 | 41 527 | 32 424 |
The next step was the remove the header and subheader rows so that they could be applied to each row that contained a country. I grabbed all the rows that had no code in the code column (noticing that only countries had codes) and then seperated them based on whether they were a major grouping or a minor grouping. The major grouping locations had to be selected by hand. I then created a pairing of all major locations and minor locations.
all.groupings <- raw.data %>%
subset(code == '', select=c('country'))
major.group.locs <- c(1, 7, 13, 18, 23)
major.group <- all.groupings[major.group.locs, ]
sub.group <- all.groupings[-major.group.locs, ]
all.groupings %<>% mutate(major.group = replace(.$country, !.$country %in% major.group, NA) %>% na.locf(),
sub.group = replace(.$country, !.$country %in% sub.group, NA)
) %>%
subset(!is.na(sub.group), -1)
kable(all.groupings %>% head(10))
| major.group | sub.group | |
|---|---|---|
| 2 | Africa | Eastern Africa |
| 3 | Africa | Middle Africa |
| 4 | Africa | Northern Africa |
| 5 | Africa | Southern Africa |
| 6 | Africa | Western Africa |
| 8 | Asia | Central Asia |
| 9 | Asia | Eastern Asia |
| 10 | Asia | South-Eastern Asia |
| 11 | Asia | Southern Asia |
| 12 | Asia | Western Asia |
The all.groupings data contained every combination of major and minor groups and they are in the same order that they need to appear in the final data frame. However, each row appears some unique number of times in the final data set. For example, “Africa - Eastern Africa” represents the first 19 countries while “Africa - Middle Africa” represents the next 9. I decided to create what I called a mapping. This is a list that is the same length as the raw.data but with each row in all.groupings repeated the necessary number of times so that it aligns with the raw data.
I accommplished this by again searching for empty codes. This would indicate a new subheading and thus a transition of grouping. If there was a code, then the same grouping was repeated.
raw.data %<>% subset(!.$country %in% major.group) %>% .[-1, ]
country.mapping <- data.frame(group=rep(NA, nrow(raw.data)), subgroup=rep(NA, nrow(raw.data)))
for(i in 1:nrow(raw.data)){
if(raw.data$code[i] == ''){
all.groupings %<>% .[-1, ]
}else{
country.mapping[i, ] <- all.groupings[1, ]
}
}
The end result of this mapping is that I could simply cbind the mapping with the raw data to create my answer data set. At the same time I also removed the now unneeded empty code rows and gave an id number column that would become necessary soon.
answer <- cbind(country.mapping, raw.data) %>%
subset(code != '') %>%
rowid_to_column('id')
kable(answer[1:10, 1:7])
| id | group | subgroup | country | code | m 0-4 | m 5-9 |
|---|---|---|---|---|---|---|
| 1 | Africa | Eastern Africa | Burundi | B R | 9 574 | 8 651 |
| 2 | Africa | Eastern Africa | Comoros | B | 247 | 352 |
| 3 | Africa | Eastern Africa | Djibouti | B R | 3 543 | 4 438 |
| 4 | Africa | Eastern Africa | Eritrea | I | 385 | 335 |
| 5 | Africa | Eastern Africa | Ethiopia | B R | 41 272 | 44 185 |
| 6 | Africa | Eastern Africa | Kenya | B R | 9 432 | 11 378 |
| 7 | Africa | Eastern Africa | Madagascar | C | 467 | 500 |
| 8 | Africa | Eastern Africa | Malawi | B R | 76 753 | 65 246 |
| 9 | Africa | Eastern Africa | Mauritius | C | 155 | 187 |
| 10 | Africa | Eastern Africa | Mayotte | B | 455 | 440 |
The data in this table really should be split into two. Once the data is made tidy, this would result in repeating the id, group, subgroup and country over a dozen times for each country. I seperated the first four columns out into their own data frame.
country.id.table <- answer[, 1:4]
kable(country.id.table %>% head())
| id | group | subgroup | country |
|---|---|---|---|
| 1 | Africa | Eastern Africa | Burundi |
| 2 | Africa | Eastern Africa | Comoros |
| 3 | Africa | Eastern Africa | Djibouti |
| 4 | Africa | Eastern Africa | Eritrea |
| 5 | Africa | Eastern Africa | Ethiopia |
| 6 | Africa | Eastern Africa | Kenya |
I completed tidying the data by removing the unneeded columns and then gathering all the columns representing different migrant rates. I split the column into two seperating the gener and age ranges. I then spread the genders back out. Finally, I changed all the migrant rates to numerics.
migration.data <- answer[, -c(2:5)] %>%
gather(category, quantity, 2:33) %>%
separate(category, c('gender', 'age'), ' ') %>%
transform(age = ifelse(age == '5-9', '05-09', age)) %>%
spread(gender, quantity)%>%
arrange(id)
migration.data[, 3:4] %<>% purrr::map(~str_replace_all(., '\\s', '') %>% as.numeric())
kable(migration.data %>% head(10))
| id | age | f | m |
|---|---|---|---|
| 1 | 0-4 | 10025 | 9574 |
| 1 | 05-09 | 9359 | 8651 |
| 1 | 10-14 | 9207 | 8288 |
| 1 | 15-19 | 10560 | 9030 |
| 1 | 20-24 | 15134 | 12316 |
| 1 | 25-29 | 21456 | 18285 |
| 1 | 30-34 | 23746 | 23164 |
| 1 | 35-39 | 18945 | 20116 |
| 1 | 40-44 | 14279 | 15031 |
| 1 | 45-49 | 10325 | 10537 |
The data is now successfully tidy, spread across two data frames. The first contains a list of all major groupings, minor groupings and country. The other contains male and female migrant rates divided by age group.
I decided to examine which countries had the highest migrant rates. To do this, I needed the populations of each country in the year 2010. Thankfully, the first data set I tidied up for this project contained such information. I read in that tidy data along with the country locations information I used with that data set. I joined all four data frames together in order to create a frame that contained each countries migrant population as a proportion of their total population. The data was then ordered from highest to lowest and the top 10 were subset.
populations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/population_answer.csv') %>%
subset(year == 2010, select=-1) %>%
mutate(population = population %>% as.numeric() * 1000000) %>%
na.omit()
country.locations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv')
combined <- inner_join(country.id.table, populations, by='country') %>%
inner_join(migration.data, by='id') %>%
group_by(country) %>%
summarise(total = (sum(f) + sum(m)) / sum(population)) %>%
arrange(desc(total)) %>%
.[1:10, ] %>%
inner_join(country.locations, by=c('country'='name'))
kable(combined[, c(1, 4:5, 2)])
| country | latitude | longitude | total |
|---|---|---|---|
| Montserrat | 16.742498 | -62.187366 | 0.0245728 |
| Guam | 13.444304 | 144.793731 | 0.0241040 |
| Kuwait | 29.311660 | 47.481766 | 0.0240754 |
| Qatar | 25.354826 | 51.183884 | 0.0230216 |
| French Guiana | 3.933889 | -53.125782 | 0.0207817 |
| American Samoa | -14.270972 | -170.132217 | 0.0200239 |
| Nauru | -0.522778 | 166.931503 | 0.0189792 |
| Gibraltar | 36.137741 | -5.345374 | 0.0188474 |
| Estonia | 58.595272 | 25.013607 | 0.0184908 |
| Latvia | 56.879635 | 24.603189 | 0.0182038 |
Finally, I plotted the data on a world map for viewing.
ggplot(combined) +
borders('world', color='gray50', fill='gray50') +
geom_point(aes(x=longitude, y=latitude, color=country), size=4) +
geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
scale_color_brewer(palette='Set3') +
labs(x='Longitude',
y='Latitude',
color='Country',
title='Smaller Countries have Higher Migrant Rates'
)
The top 10 countries with the highest migrant rates are displayed on the above world map. The only link they appear to have is that they are generally from countries with small populations. It makes sense that smaller countries would have higher rates while having overall lower number of migrants.