Project 2

By Brian Weinfeld

March 5th, 2018

Electricity Demand

For my first data set, I used the set provided by Rose. This data set identifies a series of regions in the United States and, for each day in January 2018, shows the amount of elecriticy generated, the amount consumed, and the difference between these values.

raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/us_daily_electric_system_operating_data.csv', 
                     skip=4)
kable(raw.data[, 1:5] %>% head(10))

megawatthours	01/01/2018	01/02/2018	01/03/2018	01/04/2018
California (region)	NA	NA	NA	NA
Demand	645599	704074	727216	714305
Net generation	463629	502312	504456	505148
Total net actual interchange	-179525	-201654	-222762	-209163
Carolinas (region)	NA	NA	NA	NA
Demand	847133	938726	948954	901939
Net generation	802908	914879	920055	866924
Total net actual interchange	-43153	-22922	-29015	-34530
Central (region)	NA	NA	NA	NA
Demand	890403	916535	861352	847712

In order to tidy this data, I would first need to extract the locations present in every fourth row and create a new column in the original data set in which to store the information. Annoyingly, one of the location names features a comma and thus half the name is placed in the 01/01/2018 column. I developed a generic solution to this issue in case it came up in a bigger data set.

locations <- raw.data[seq(1, length(raw.data), 4), 1:2] %>% 
  replace_na(list(`01/01/2018`='')) %>%
  unite(location, 1, 2, sep='') %>%
  select(1) %>%
  unlist()
kable(locations)

	x
location1	California (region)
location2	Carolinas (region)
location3	Central (region)
location4	Electric Reliability Council of TexasInc. (region)
location5	Florida (region)
location6	New England ISO (region)
location7	Mid-Atlantic (region)
location8	Midwest (region)

With all the necessary information, I tidied up the data and made a few additional changes to make this particular data set easier to work with. That included changing the names of some columns and splitting the date into a month, day and year column.

tidy.data <- raw.data[which(1:length(raw.data) %% 4 != 1), ] %>% 
  mutate(location=locations %>% 
              rep(each=3)
            ) %>%
  gather(date, mega.watt.hours, 2:32) %>%
  spread(megawatthours, mega.watt.hours) %>%
  rename(generated=`Net generation`, net=`Total net actual interchange`, demand=Demand) %>%
  separate(date, c('month', 'day', 'year'))

tidy.data[, 2:7] %<>%
  map(~as.numeric(.))
kable(tidy.data %>% head(10))

location	month	day	year	demand	generated	net
California (region)	1	1	2018	645599	463629	-179525
California (region)	1	2	2018	704074	502312	-201654
California (region)	1	3	2018	727216	504456	-222762
California (region)	1	4	2018	714305	505148	-209163
California (region)	1	5	2018	712021	484644	-213905
California (region)	1	6	2018	668299	447821	-219968
California (region)	1	7	2018	655802	447952	-207861
California (region)	1	8	2018	731881	500797	-237089
California (region)	1	9	2018	729380	493884	-246264
California (region)	1	10	2018	716698	482843	-234361

Analysis

I plotted the proportion of energy generated compared to demand for each day and location. Any points above 1.0 indicated excess or wasted energy and any points beneath 1.0 indicate that not enough energy was genered.

tidy.data %>%
  ggplot(aes(x=day, y= generated / demand, color=location)) + 
  geom_point() +
  geom_smooth(se=FALSE) + 
  scale_color_brewer(palette='Set1') +
  theme(legend.position='bottom') +
  guides(color=guide_legend(nrow=3, override.aes=list(size=2))) +
  labs(x='Day in January 2018', 
       y='Proportion of Required Energy Generated',
       color='Location',
       title='California and New England Not Generating Enough Electricity to Meet Need'
       )

California and New England are consistantly not producing enough energy to meet the needs of their populations. Every other measured locations is hovering near the required amount.

World Populations

For my second data set, I selected the world populations csv file discussed by Steven. The file contains a large number of countries and their populations in millions for each year from 1980 through 2010.

raw.data <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/populationbycountry19802010millions.csv', 
                     na=c('NA'))
kable(raw.data[, 1:5] %>% head())

X1	1980	1981	1982	1983
North America	320.27638	324.44694	328.62014	332.72487
Bermuda	0.05473	0.05491	0.05517	0.05551
Canada	24.5933	24.9	25.2019	25.4563
Greenland	0.05021	0.05103	0.05166	0.05211
Mexico	68.34748	69.96926	71.6409	73.36288
Saint Pierre and Miquelon	0.00599	0.00601	0.00605	0.00607

In order to make the data tidy I gathered all the year columns and put them in a single column and then arranged the data alphabetically by country. Finally, I wrote the file out for future use.

raw.data %<>%
  rename(country=X1) %>%
  gather(year, population, 2:32) %>%
  arrange(country) %T>%
  write.csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project2\\population_answer.csv')
kable(raw.data %>% head())

country	year	population
Afghanistan	1980	15.0436
Afghanistan	1981	13.67368
Afghanistan	1982	12.57743
Afghanistan	1983	12.43058
Afghanistan	1984	12.75384
Afghanistan	1985	13.09371

Working off of Steven’s suggestion about the prevelance of missing data, I decided to highlight the countries with the most missing data. I created a subset that contained all the missing data in the original raw data and read in another csv file I found online that contained a list of countries and their longitude and latitude.

missing.data <- raw.data %>%
  subset(population == '--' | is.na(population))

country.locations <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv', stringsAsFactors=FALSE)
kable(country.locations %>% head())

country	latitude	longitude	name
AD	42.54624	1.601554	Andorra
AE	23.42408	53.847818	United Arab Emirates
AF	33.93911	67.709953	Afghanistan
AG	17.06082	-61.796428	Antigua and Barbuda
AI	18.22055	-63.068615	Anguilla
AL	41.15333	20.168331	Albania

I created data to plot by counting the number of missing entries and joining that information with the country locations. I created another column to dictate the size of the circles on the plot named mod.pop.

plotting.data <- missing.data %>%
  count(country, population) %>%
  inner_join(country.locations, by=c('country'='name')) %>%
  mutate(mod.pop = sigmoid((n / max(n)))*5)
kable(plotting.data %>% head())

country	population	n	country.y	latitude	longitude	mod.pop
Antarctica	NA	31	AQ	-75.25097	-0.071389	3.655293
Armenia	–	12	AM	40.06910	45.038189	2.977918
Aruba	–	6	AW	12.52111	-69.968338	2.741183
Azerbaijan	–	12	AZ	40.14310	47.576927	2.977918
Belarus	–	12	BY	53.70981	27.953389	2.977918
Bosnia and Herzegovina	–	12	BA	43.91589	17.679076	2.977918

Finally, I plotted the data. I used hollow cirlces for NA data and solid circles for missing (dash-dash) data.

ggplot(plotting.data) + 
  borders('world', color='gray50', fill='gray50') + 
  geom_point(aes(x=longitude, y=latitude, color=n), 
             shape=ifelse(is.na(plotting.data$population), 21, 16),
             size = plotting.data$mod.pop
             ) +
  geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
  labs(x='Longitude', 
     y='Latitude',
     color='Missing Data Count',
     title='Eastern European Countries are Missing Most Data'
  )

The graph shows that most of the missing data comes from eastern Europe. It should be noted that some missing elements were removed by the inner join on country locations as those countries were not part of the data set. The most common reason is that the countries no longger exist. For example, ‘Former USSR’ was listed as a country in the original data set.’

Migration Rates

For my last data set, I’m using UN data on migrant data rates per country in 2010. This is my data set. The data set has numerous changes that need to be made to it. The following issues exist for this data set:

14 header rows
Column names spllit over 2 rows for heading and sub heading
Numbers contain spaces instead of commas
Some rows are headings or subheadings in addition to countries. For example, Burundi is in Africa and Eastern Africa.
Columns contain data for males, females and male and female combined.

example <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv', 
                     stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8')
kable(example[15:27, 1:6])

	V1	V2	V3	V4	V5	V6
15	Sort
order	Major area	, region, country or area of destination Notes Co	untry co	de Type of da	ta (a) Internatio	nal migrant stock at mid-year by age (both sexes)
16						0-4
17	1	WORLD		900		5 444 102
18	2	Developed regions	(b)	901		1 251 818
19	3	Developing regions	(c)	902		4 192 284
20	4	Least developed countries	(d)	941		1 063 284
21	5	Less developed regions excluding least developed countries		934		3 129 000
22	6	Sub-Saharan Africa	(e)	947		1 305 579
23	7	Africa		903		1 381 650
24	8	Eastern Africa		910		547 697
25	9	Burundi		108	B R	19 599
26	10	Comoros		174	B	600
27	11	Djibouti		262	B R	5 435

I began by reading in the data and removing the unneeded rows and columns. I set the column names using the information stored in row 1 and then removed that row. Finally, I omitted all NA pieces of data. Part of the naming sceme involved adding an ‘m’ or ‘f’ to the beginning of the columns for males and females respectively. This will eventually be used to split the data and is used her to differentiate between the two columns.

raw.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/UN_MigrantStockByAge_2015.csv', 
                     stringsAsFactors=FALSE, header=FALSE, na.strings='..', encoding='utf-8') %>%
  .[-c(1:15, 17:22), -c(1, 3:4, 6:22, 39, 56)] %>%
  setNames(c('country', 'code', paste('m', .[1, 3:18]), paste('f', .[1, 19:34])) %>% unlist()) %>%
  .[-1, ] %>%
  na.omit()
kable(raw.data[1:10, 1:10])

	country	code	m 0-4	m 5-9	m 10-14	m 15-19	m 20-24	m 25-29	m 30-34	m 35-39
23	Africa		676 971	697 939	654 031	740 174	871 195	961 607	859 150	712 199
24	Eastern Africa		267 929	269 020	255 309	289 807	310 260	326 438	297 497	246 256
25	Burundi	B R	9 574	8 651	8 288	9 030	12 316	18 285	23 164	20 116
26	Comoros	B	247	352	446	516	577	758	821	777
27	Djibouti	B R	3 543	4 438	6 463	8 560	9 115	8 962	6 794	4 737
28	Eritrea	I	385	335	437	612	801	884	787	623
29	Ethiopia	B R	41 272	44 185	54 443	71 389	77 813	74 627	61 129	46 834
30	Kenya	B R	9 432	11 378	14 750	22 655	17 884	17 509	14 519	11 915
31	Madagascar	C	467	500	475	665	951	893	779	765
32	Malawi	B R	76 753	65 246	59 561	64 862	58 236	51 855	41 527	32 424

The next step was the remove the header and subheader rows so that they could be applied to each row that contained a country. I grabbed all the rows that had no code in the code column (noticing that only countries had codes) and then seperated them based on whether they were a major grouping or a minor grouping. The major grouping locations had to be selected by hand. I then created a pairing of all major locations and minor locations.

all.groupings <- raw.data %>% 
  subset(code == '', select=c('country')) 

major.group.locs <- c(1, 7, 13, 18, 23)
major.group <- all.groupings[major.group.locs, ]
sub.group <- all.groupings[-major.group.locs, ]

all.groupings %<>% mutate(major.group = replace(.$country, !.$country %in% major.group, NA) %>% na.locf(), 
                          sub.group = replace(.$country, !.$country %in% sub.group, NA)
                         ) %>%
  subset(!is.na(sub.group), -1)
kable(all.groupings %>% head(10))

	major.group	sub.group
2	Africa	Eastern Africa
3	Africa	Middle Africa
4	Africa	Northern Africa
5	Africa	Southern Africa
6	Africa	Western Africa
8	Asia	Central Asia
9	Asia	Eastern Asia
10	Asia	South-Eastern Asia
11	Asia	Southern Asia
12	Asia	Western Asia

The all.groupings data contained every combination of major and minor groups and they are in the same order that they need to appear in the final data frame. However, each row appears some unique number of times in the final data set. For example, “Africa - Eastern Africa” represents the first 19 countries while “Africa - Middle Africa” represents the next 9. I decided to create what I called a mapping. This is a list that is the same length as the raw.data but with each row in all.groupings repeated the necessary number of times so that it aligns with the raw data.

I accommplished this by again searching for empty codes. This would indicate a new subheading and thus a transition of grouping. If there was a code, then the same grouping was repeated.

raw.data %<>% subset(!.$country %in% major.group) %>% .[-1, ]

country.mapping <- data.frame(group=rep(NA, nrow(raw.data)), subgroup=rep(NA, nrow(raw.data)))
for(i in 1:nrow(raw.data)){
  if(raw.data$code[i] == ''){
    all.groupings %<>% .[-1, ]
  }else{
    country.mapping[i, ] <- all.groupings[1, ]
  }
}

The end result of this mapping is that I could simply cbind the mapping with the raw data to create my answer data set. At the same time I also removed the now unneeded empty code rows and gave an id number column that would become necessary soon.

answer <- cbind(country.mapping, raw.data) %>%
  subset(code != '') %>% 
  rowid_to_column('id')
kable(answer[1:10, 1:7])

id	group	subgroup	country	code	m 0-4	m 5-9
1	Africa	Eastern Africa	Burundi	B R	9 574	8 651
2	Africa	Eastern Africa	Comoros	B	247	352
3	Africa	Eastern Africa	Djibouti	B R	3 543	4 438
4	Africa	Eastern Africa	Eritrea	I	385	335
5	Africa	Eastern Africa	Ethiopia	B R	41 272	44 185
6	Africa	Eastern Africa	Kenya	B R	9 432	11 378
7	Africa	Eastern Africa	Madagascar	C	467	500
8	Africa	Eastern Africa	Malawi	B R	76 753	65 246
9	Africa	Eastern Africa	Mauritius	C	155	187
10	Africa	Eastern Africa	Mayotte	B	455	440

The data in this table really should be split into two. Once the data is made tidy, this would result in repeating the id, group, subgroup and country over a dozen times for each country. I seperated the first four columns out into their own data frame.

country.id.table <- answer[, 1:4]
kable(country.id.table %>% head())

id	group	subgroup	country
1	Africa	Eastern Africa	Burundi
2	Africa	Eastern Africa	Comoros
3	Africa	Eastern Africa	Djibouti
4	Africa	Eastern Africa	Eritrea
5	Africa	Eastern Africa	Ethiopia
6	Africa	Eastern Africa	Kenya

I completed tidying the data by removing the unneeded columns and then gathering all the columns representing different migrant rates. I split the column into two seperating the gener and age ranges. I then spread the genders back out. Finally, I changed all the migrant rates to numerics.

migration.data <- answer[, -c(2:5)] %>%  
  gather(category, quantity, 2:33) %>%
  separate(category, c('gender', 'age'), ' ') %>%
  transform(age = ifelse(age == '5-9', '05-09', age)) %>%
  spread(gender, quantity)%>%
  arrange(id)

migration.data[, 3:4] %<>% purrr::map(~str_replace_all(., '\\s', '') %>% as.numeric())
kable(migration.data %>% head(10))

id	age	f	m
1	0-4	10025	9574
1	05-09	9359	8651
1	10-14	9207	8288
1	15-19	10560	9030
1	20-24	15134	12316
1	25-29	21456	18285
1	30-34	23746	23164
1	35-39	18945	20116
1	40-44	14279	15031
1	45-49	10325	10537

The data is now successfully tidy, spread across two data frames. The first contains a list of all major groupings, minor groupings and country. The other contains male and female migrant rates divided by age group.

Analysis

I decided to examine which countries had the highest migrant rates. To do this, I needed the populations of each country in the year 2010. Thankfully, the first data set I tidied up for this project contained such information. I read in that tidy data along with the country locations information I used with that data set. I joined all four data frames together in order to create a frame that contained each countries migrant population as a proportion of their total population. The data was then ordered from highest to lowest and the top 10 were subset.

populations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/population_answer.csv') %>%
  subset(year == 2010, select=-1) %>%
  mutate(population = population %>% as.numeric() * 1000000) %>%
  na.omit()

country.locations <- read_csv('https://raw.githubusercontent.com/brian-cuny/migrantdata/master/country_locations.csv')

combined <- inner_join(country.id.table, populations, by='country') %>%
  inner_join(migration.data, by='id') %>%
  group_by(country) %>%
  summarise(total = (sum(f) + sum(m)) / sum(population)) %>%
  arrange(desc(total)) %>%
  .[1:10, ] %>%
  inner_join(country.locations, by=c('country'='name'))
kable(combined[, c(1, 4:5, 2)])

country	latitude	longitude	total
Montserrat	16.742498	-62.187366	0.0245728
Guam	13.444304	144.793731	0.0241040
Kuwait	29.311660	47.481766	0.0240754
Qatar	25.354826	51.183884	0.0230216
French Guiana	3.933889	-53.125782	0.0207817
American Samoa	-14.270972	-170.132217	0.0200239
Nauru	-0.522778	166.931503	0.0189792
Gibraltar	36.137741	-5.345374	0.0188474
Estonia	58.595272	25.013607	0.0184908
Latvia	56.879635	24.603189	0.0182038

Finally, I plotted the data on a world map for viewing.

ggplot(combined) +
  borders('world', color='gray50', fill='gray50') + 
  geom_point(aes(x=longitude, y=latitude, color=country), size=4) +
  geom_label_repel(aes(x=longitude, y=latitude, label=country)) +
  scale_color_brewer(palette='Set3')  +
  labs(x='Longitude', 
     y='Latitude',
     color='Country',
     title='Smaller Countries have Higher Migrant Rates'
  )

The top 10 countries with the highest migrant rates are displayed on the above world map. The only link they appear to have is that they are generally from countries with small populations. It makes sense that smaller countries would have higher rates while having overall lower number of migrants.