Data 607_Project 2

The goal of this assignment is to prepare different datasets for analysis. For the 3 datasets required, I selected: 1. US Chronic Disease Idnicators (CDI), c/o Niteen Kumar 2. US Electric Grid, c/o Rose Koh 3. World population, c/o Steven Tipton

First we configure our toolset.

knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
library(stringr)

## Warning: package 'stringr' was built under R version 3.2.5

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.5

library(ggalt)

## Warning: package 'ggalt' was built under R version 3.2.5

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.2.5

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(knitr)

1. US Chronic Disease Indicators (CDI), c/o Niteen Kumar

As Niteen explained in his post, the chronic disease indicator (CDI) dataset is compiled by the federal goverment of the US. The chronic disease indicators (CDIs) are a set of surveillance indicators developed by consensus among CDC, the Council of State and Territorial Epidemiologists (CSTE), and the National Association of Chronic Disease Directors (NACDD) and are available on the Internet. A glossary of the variables and sources is available here: https://www.cdc.gov/mmwr/pdf/rr/rr6401.pdf

The purpose of the analysis is filter the data by categories and indicators to conclude how each state and location are impacted by several types of chronic diseases. That’s a lot of ground to cover, so we’ll focus this later on in the analysis section.

Data ingestion: First, we pull in the data, which is already captured in a CSV.

# Niteen collected data on Github so we call it from there.

cdi.sourcefile <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv"
cdi <- read.csv(cdi.sourcefile, stringsAsFactors = T, row.names = NULL) # read.table defaulted to header = T and sep = "."

Data cleanup: We begin with some cleanup.

Per https://www.cdc.gov/healthyyouth/data/yrbs/index.htm, the Youth Risk Behavior Surveillance System (YRBSS) monitors six types of health-risk behaviors that contribute to the leading causes of death and disability among youth and adults, including: * Behaviors that contribute to unintentional injuries and violence * Sexual behaviors related to unintended pregnancy and sexually transmitted diseases, including HIV infection * Alcohol and other drug use * Tobacco use * Unhealthy dietary behaviors * Inadequate physical activity

The YRBSS dataset seems like an intersting one to try to analyze. The first step is to subset the CDI data for YRBSS data alone - doing this up front reduces the size of the dataset (CDI is quite large) and improves performance.

# First, we make sure the column values read cleanly.  We coerce Datasource to character to ease subset operation.  Confused as to why this hack is necessary, but could not work with the factor as is.

cdi$Datasource <- as.character(str_trim(cdi$Datasource, side = "both"))
levels(cdi$Datasource)

## NULL

# Next, we filter based on the three data sources.  Tried both filter and subset, and they only cooperated after coercing to character.

target.sources <- c("YRBSS")
cdi.1 <- filter(cdi, Datasource %in% target.sources)

# We clean up a few column names using rename and check the dimensions of the dataframe.

cdi.1 <- cdi.1 %>% rename("Year" = "ï..Year", "Location" = "LocationDesc")
dim(cdi.1)

## [1] 765  19

Next we check different variables to see if they provide information valuable to our aims Specifically, we’ll look at Year, DataValueUnit, DataValueType, DataValueAlt, DataValueFootnoteSymbol, and DataValueFootnote.

We first look at to see what the YRBSS will provide.

table(cdi.1$Year, cdi.1$Datasource) %>% tbl_df

## # A tibble: 12 x 3
##    Var1      Var2      n
##    <chr>     <chr> <int>
##  1 2001      YRBSS     0
##  2 2006-2010 YRBSS     0
##  3 2007      YRBSS     0
##  4 2007-2011 YRBSS     0
##  5 2008      YRBSS     0
##  6 2009      YRBSS     0
##  7 2009-2011 YRBSS     0
##  8 2010      YRBSS     0
##  9 2011      YRBSS     0
## 10 2011-2012 YRBSS     0
## 11 2012      YRBSS     0
## 12 2013      YRBSS   765

table(cdi.1$Category, cdi.1$Datasource) %>% tbl_df

## # A tibble: 18 x 3
##    Var1                                            Var2      n
##    <chr>                                           <chr> <int>
##  1 Alcohol                                         YRBSS   110
##  2 Arthritis                                       YRBSS     0
##  3 Asthma                                          YRBSS     0
##  4 Cancer                                          YRBSS     0
##  5 Cardiovascular Disease                          YRBSS     0
##  6 Chronic Kidney Disease                          YRBSS     0
##  7 Chronic Obstructive Pulmonary Disease           YRBSS     0
##  8 Diabetes                                        YRBSS     0
##  9 Disability                                      YRBSS     0
## 10 Immunization                                    YRBSS     0
## 11 Mental Health                                   YRBSS     0
## 12 Nutrition, physical activity, and weight status YRBSS     0
## 13 Nutrition, Physical Activity, and Weight Status YRBSS   545
## 14 Older Adults                                    YRBSS     0
## 15 Oral Health                                     YRBSS     0
## 16 Overarching Conditions                          YRBSS     0
## 17 Reproductive Health                             YRBSS     0
## 18 Tobacco                                         YRBSS   110

YRBSS is all from 2013, and treats alcohol, tobacco, and nutrition / activity / weight.

DataValueUnit provide valuable information, so we won’t remove that. We should de-duplicate up " Number“,”cases per million“,”per 100,000“.

levels(cdi.1$DataValueType)%>% tbl_df

## # A tibble: 23 x 1
##    value                                                           
##    <chr>                                                           
##  1 Age-adjusted Mean                                               
##  2 Age-adjusted Prevalence                                         
##  3 Age-Adjusted Prevalence                                         
##  4 Age-adjusted rate                                               
##  5 Age-adjusted Rate                                               
##  6 Average Annual Age-adjusted Rate                                
##  7 Average Annual Crude Rate                                       
##  8 Average Annual Number                                           
##  9 Commercial host (dram shop) liability status for alcohol service
## 10 Crude Mean                                                      
## # ... with 13 more rows

DataValueType could provide valuable information, so we won’t remove that. We should de-duplicate up “Age-adjusted Prevalence”, Age-adjusted Rate“,”Crude rate" down below.

Next, we test how similar DataValue and DataValueAlt are for our data sources.

# We coerce DataValueAlt to character and then numeric.  Tried going straight from factor to numeric, but that proved very buggy.

cdi.1$DataValueAlt <- as.numeric(levels(cdi.1$DataValueAlt))[cdi.1$DataValueAlt]

## Warning: NAs introduced by coercion

# We subtract DataValue from DataValueAlt and correct for any NAs returned (i.e. setting to 0).

Val.Alt <- cdi.1$DataValue - cdi.1$DataValueAlt
Val.Alt[is.na(Val.Alt)] <- 0
Val.Alt[Val.Alt != 0]

## numeric(0)

For the selected data sources, there are no cases where there are discrepancies between the DataValue and DataValueAlt. We can remove DataValueAlt.

levels(cdi.1$DataValueFootnote) %>% tbl_df

## # A tibble: 21 x 1
##    value                                                                  
##    <chr>                                                                  
##  1 ""                                                                     
##  2 50 states + DC:  Median Percent; NSCH Data provided by Data Resource C~
##  3 50 States + DC: US Median                                              
##  4 A confidence interval for this location is not available because a cen~
##  5 CI was calculated using Poisson variable                               
##  6 Data are from cancer registries that meet the data quality criteria fo~
##  7 Data are not available because cancer registry did not meet the data q~
##  8 Data are not available because of state legislation and regulations wh~
##  9 Data not shown because of too few respondents or cases                 
## 10 DC was not rated for this measure because the measure addresses state ~
## # ... with 11 more rows

levels(cdi.1$DataValueFootnoteSymbol) %>% tbl_df

## # A tibble: 19 x 1
##    value    
##    <chr>    
##  1 ""       
##  2 #        
##  3 *        
##  4 **       
##  5 ***      
##  6 ****     
##  7 *, #     
##  8 ^^       
##  9 ^^^      
## 10 ^^^^     
## 11 ~        
## 12 ~~       
## 13 ~~~      
## 14 ~~~~     
## 15 Â¯       
## 16 â<U+0080> â<U+0080> â<U+0080> 
## 17 â<U+0080>¡â<U+0080>¡   
## 18 Â§Â§     
## 19 Â§Â§Â§Â§

DataValueFootnote and DataValueFootnoteSymbol don’t seem to help much for our aims, so we’ll add them to the list to remove. To summarize: * We can take LocationAbbr in favor of LocationDesc.
* As indicated above, we’ll remove DataValueAlt, DataValueFootnote and DataValueFootnoteType. * StratificationID is just another take on Gender.
* IndicatorID is duplicative with Indicator, and LocationID is duplicative with LocationAbbr.
* We won’t need geolocation data for state level analysis.

# We remove duplicative and non-additive variables.  

cdi.1 <- subset(cdi, select = -c(LocationAbbr, DataValueAlt, DataValueFootnote, DataValueFootnoteSymbol, StratificationID1, IndicatorID, LocationID, GeoLocation))

# We clean up a few duplicative levels in variables using mapvalues from dplyr.  These next steps are hashed out as they have not been successful irrespective of approach ( frustrating, as this seems like it should be elementary).

# unit.old <- c(" Number", "cases per million", "per 100,000")
# unit.new <- c("Number", "cases per 1,000,000", "per 100,000 residents")
# cdi.1 %>% mutate(DataValueUnit = replace(DataValueUnit, unit.old, unit.new))

# mapvalues(cdi.1$DataValueUnit, from = c(" Number", "cases per million", "per 100,000"), to = c("Number", "cases per 1,000,000", "per 100,000 residents"))
# levels(cdi.1$DataValueUnit)
# mapvalues(cdi.1$DataValueType, from = c("Age-adjusted Prevalence", "Age-adjusted Rate", "Crude rate"), to = C("Age-Adjusted Prevalence", "Age-Adjusted Rate", "Crude Rate"))

Data Melting: Now we can melt the data and structure it for tidy analysis.

# We melt the data into a tidy, "long" structure.  Either because of misconfiguration of the gather operation or large dimensionality of the dataset, I did not find this structure conducive in this case to experimenting with or developing analytical intuitions - quite the contrary.

cdi.tidy <- gather(cdi.1, "key", "n", 3:11)

## Warning: attributes are not identical across measure variables;
## they will be dropped

tbl_df(cdi.tidy)

## # A tibble: 481,221 x 4
##    ï..Year LocationDesc         key      n      
##    <fct>   <fct>                <chr>    <chr>  
##  1 2013    Alabama              Category Alcohol
##  2 2013    Alaska               Category Alcohol
##  3 2013    Arizona              Category Alcohol
##  4 2013    Arkansas             Category Alcohol
##  5 2013    California           Category Alcohol
##  6 2013    Colorado             Category Alcohol
##  7 2013    Connecticut          Category Alcohol
##  8 2013    Delaware             Category Alcohol
##  9 2013    District of Columbia Category Alcohol
## 10 2013    Florida              Category Alcohol
## # ... with 481,211 more rows

Perform analysis: Nitteen suggested filtering by category and indicators to conclude that how each state is impacted by several types of chronic diseases. As this dataset and its analysis proved something of a time-suck, I cut my losses and move on to the others. Hope to glean some insights on better approaches from my classmates.

2. US Electric Grid, c/o Rose Koh

The dataset contains demand, net generation, and total net actual interchange for the entire region in the US on a daily basis. I believe this was aggregated by the US Energy Information Administration, which collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment. Value in this dataset are in megawatthours, a unit of measure of power about equivalent to the electricity consumption of 330 home during one hour (https://www.cleanenergyauthority.com/solar-energy-resources/what-is-a-megawatt-and-a-megawatt-hour).

Data ingestion: First, we pull in the data, which is already captured in a CSV.

# Rose provided a CSV which we've saved to Github - we call it from there.

electricity.sourcefile <- ("https://raw.githubusercontent.com/JeremyOBrien16/DATA-607/master/us_daily_electric_system_operating_data.csv")
zap <- read.table(electricity.sourcefile, header = F, sep= ",", fill = T, na.strings = c("", " ", "NA"), stringsAsFactors = F)

Data cleanup: We begin with some cleanup. Per Rose, we transform and tidy for ease of manipulation of dates, regions, and variables. The target table configuration includes columns for date, region, and category.

# We capture the observations in the dataset, eliminating the blank rows.  An NA in V3 (the third column) is a good indicator of whether there's any information to capture in a given row.

zap.1 <- na.omit(zap, cols = V3)

# We capture the regions in the first column at four-row intervals.  We create a column vector tripling each region (i.e. California X3, then Carolinas X3, etc.) that can be mapped to each of the three variables (demand, net generation, total net actual interchange).

region.rows <- c(seq(from = 6, to = nrow(zap), by = 4))
region.name <- zap[region.rows, 1]
region.1 <- rep(region.name, each = 3)

# As the first several rows of the soure file did not include headers, we read the data without any.  Now that we've purged the noisy labels we make the new first row column headers.

colnames(zap.1) = zap.1[1, ]
zap.1 <- zap.1[-1, ]

# We concatenate the region vector we created with the observations.

zap.2 <- cbind(region.1, zap.1)

# We clean up a few column headers, renaming them so they're easy to call for analysis.

zap.2 <- zap.2 %>% rename("region" = "region.1", "cat" = "megawatthours")

# We replace types with less wordy labels; again, so they're easy to call.

zap.2$cat <- zap.2$cat %>% str_replace_all("Demand", "dem") %>% 
  str_replace_all("Net generation", "gen") %>% 
  str_replace_all("Total net actual interchange", "net")

# We adjust the data type for category.  We'll convert days from characters to dates and amounts from character to numerics after melting the dataset, as this is a simpler operation on a single column.

zap.2$cat <- as.factor(zap.2$cat)

Data Melting: Now we can melt the data and structure it for tidy analysis.

# We melt the data into a tidy structure.

zap.tidy <- gather(zap.2, "date", "amount", 3:33)

# Finally, we convert day column data type to date and amount column data type to numeric.

zap.tidy$date <- as.Date(zap.tidy$date, "%m/%d/%Y")
zap.tidy$amount <- as.numeric(zap.tidy$amount)

# We check the setup - as expected.

head(tbl_df(zap.tidy), 6)

## # A tibble: 6 x 4
##   region              cat   date         amount
##   <fct>               <fct> <date>        <dbl>
## 1 California (region) dem   2018-01-01  645599.
## 2 California (region) gen   2018-01-01  463629.
## 3 California (region) net   2018-01-01 -179525.
## 4 Carolinas (region)  dem   2018-01-01  847133.
## 5 Carolinas (region)  gen   2018-01-01  802908.
## 6 Carolinas (region)  net   2018-01-01  -43153.

Perform analysis: Rose suggested producing the following cuts: * Daily demand by region * Daily net generation by region * Daily total net actual interchange * Overall gap / surplus over the days captured in the dataset (slighlty adapted, same intent)

First, we leverage our tidied data to output daily demand by region.

zap.tidy %>%
  filter(cat %in% c("dem")) %>%
  group_by(region) %>% 
  separate(date, c("year", "month", "day"), sep = "-") %>% 
  spread(day, amount) %>%
  group_by(region) %>% 
  select(-c(cat, year, month))

## # A tibble: 14 x 32
## # Groups:   region [14]
##    region     `01`   `02`   `03`   `04`   `05`   `06`   `07`   `08`   `09`
##    <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Califor~ 6.46e5 7.04e5 7.27e5 7.14e5 7.12e5 6.68e5 6.56e5 7.32e5 7.29e5
##  2 Carolin~ 8.47e5 9.39e5 9.49e5 9.02e5 9.40e5 9.33e5 9.47e5 8.40e5 6.47e5
##  3 Central~ 8.90e5 9.17e5 8.61e5 8.48e5 8.47e5 8.57e5 7.19e5 7.23e5 7.30e5
##  4 Electri~ 1.29e6 1.36e6 1.26e6 1.13e6 9.98e5 9.12e5 8.35e5 8.76e5 9.42e5
##  5 Florida~ 5.89e5 7.27e5 7.43e5 8.49e5 8.45e5 7.65e5 7.06e5 5.95e5 5.32e5
##  6 Mid-Atl~ 2.83e6 3.00e6 2.96e6 2.94e6 3.11e6 3.05e6 2.92e6 2.73e6 2.46e6
##  7 Midwest~ 2.42e6 2.59e6 2.50e6 2.47e6 2.43e6 2.34e6 2.13e6 2.05e6 2.04e6
##  8 New Eng~ 4.15e5 4.33e5 4.14e5 4.12e5 4.06e5 4.32e5 4.25e5 4.06e5 3.74e5
##  9 New Yor~ 4.89e5 5.12e5 4.99e5 5.00e5 5.26e5 5.16e5 5.02e5 4.97e5 4.68e5
## 10 Northwe~ 9.22e5 1.08e6 1.07e6 1.05e6 1.01e6 9.66e5 9.61e5 1.01e6 9.94e5
## 11 Southea~ 8.74e5 9.67e5 9.15e5 9.39e5 9.39e5 8.68e5 8.38e5 7.84e5 6.56e5
## 12 Southwe~ 2.39e5 2.47e5 2.50e5 2.45e5 2.44e5 2.37e5 2.29e5 2.41e5 2.37e5
## 13 Tenness~ 6.52e5 6.73e5 6.12e5 6.50e5 6.38e5 6.15e5 5.64e5 5.18e5 4.64e5
## 14 United ~ 1.32e7 1.42e7 1.37e7 1.37e7 1.36e7 1.31e7 1.23e7 1.19e7 1.12e7
## # ... with 22 more variables: `10` <dbl>, `11` <dbl>, `12` <dbl>,
## #   `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
## #   `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>,
## #   `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>, `27` <dbl>,
## #   `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>

The demand trends are highest in the Mid-Atlantic, Midwest, Texas, Northwest, and Southeast.

Next, we implement the same workflow to output daily generation by region.

zap.tidy %>%
  filter(cat %in% c("gen")) %>%
  group_by(region) %>% 
  separate(date, c("year", "month", "day"), sep = "-") %>% 
  spread(day, amount) %>% 
  select(-c(cat, year, month)) %>% 
  arrange(region)

## # A tibble: 14 x 32
## # Groups:   region [14]
##    region     `01`   `02`   `03`   `04`   `05`   `06`   `07`   `08`   `09`
##    <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Califor~ 4.64e5 5.02e5 5.04e5 5.05e5 4.85e5 4.48e5 4.48e5 5.01e5 4.94e5
##  2 Carolin~ 8.03e5 9.15e5 9.20e5 8.67e5 8.96e5 8.85e5 9.14e5 8.20e5 6.39e5
##  3 Central~ 8.98e5 9.22e5 8.67e5 8.63e5 8.62e5 8.61e5 7.54e5 7.51e5 7.56e5
##  4 Electri~ 1.30e6 1.37e6 1.26e6 1.14e6 1.01e6 9.13e5 8.39e5 8.78e5 9.45e5
##  5 Florida~ 5.77e5 7.13e5 7.11e5 8.16e5 8.02e5 7.36e5 6.93e5 5.82e5 5.17e5
##  6 Mid-Atl~ 2.89e6 3.05e6 2.99e6 3.01e6 3.13e6 3.01e6 2.90e6 2.74e6 2.50e6
##  7 Midwest~ 2.43e6 2.59e6 2.50e6 2.48e6 2.44e6 2.36e6 2.16e6 2.06e6 2.03e6
##  8 New Eng~ 3.56e5 3.82e5 3.51e5 3.37e5 3.32e5 3.61e5 3.58e5 3.32e5 2.92e5
##  9 New Yor~ 4.19e5 4.49e5 4.38e5 4.46e5 4.92e5 4.92e5 4.65e5 4.41e5 4.11e5
## 10 Northwe~ 1.18e6 1.31e6 1.31e6 1.29e6 1.25e6 1.23e6 1.20e6 1.27e6 1.25e6
## 11 Southea~ 8.72e5 9.30e5 8.91e5 9.10e5 9.32e5 8.70e5 8.23e5 7.79e5 6.70e5
## 12 Southwe~ 2.92e5 3.10e5 3.28e5 3.05e5 3.02e5 2.96e5 2.87e5 3.13e5 3.20e5
## 13 Tenness~ 6.31e5 6.58e5 6.22e5 6.64e5 6.55e5 6.31e5 5.70e5 5.33e5 4.80e5
## 14 United ~ 1.32e7 1.41e7 1.37e7 1.37e7 1.36e7 1.31e7 1.23e7 1.19e7 1.13e7
## # ... with 22 more variables: `10` <dbl>, `11` <dbl>, `12` <dbl>,
## #   `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
## #   `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>,
## #   `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>, `27` <dbl>,
## #   `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>

The generation trends are highest in many of the same places - Mid-Atlantic, Midwest, Texas, Northwest, and Southeast. This means the regions with the biggest demand meet alot of it with their own production. A look at net actual interchange will reveal whether any of these lead to surpluses; as well as a where demand is not met by regional generation.

We output daily total net actual interchange by region.

zap.tidy %>%
  filter(cat %in% c("net")) %>%
  group_by(region) %>% 
  separate(date, c("year", "month", "day"), sep = "-") %>% 
  spread(day, amount) %>% 
  select(-c(cat, year, month)) %>% 
  arrange(region)

## # A tibble: 14 x 32
## # Groups:   region [14]
##    region     `01`    `02`    `03`    `04`    `05`    `06`    `07`    `08`
##    <fct>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Califo~ -1.80e5 -2.02e5 -2.23e5 -2.09e5 -2.14e5 -2.20e5 -2.08e5 -2.37e5
##  2 Caroli~ -4.32e4 -2.29e4 -2.90e4 -3.45e4 -4.33e4 -4.71e4 -3.19e4 -1.93e4
##  3 Centra~  7.73e3  5.79e3  5.30e3  1.56e4  1.52e4  3.38e3  3.49e4  2.81e4
##  4 Electr~  6.98e3  9.62e3  1.74e3  4.71e3  1.52e4  1.42e3  3.40e3  2.16e3
##  5 Florid~  1.03e3 -2.65e3 -1.64e4 -1.94e4 -2.79e4 -1.48e4  1.14e3 -5.85e2
##  6 Mid-At~  6.46e4  4.51e4  3.18e4  6.43e4  1.76e4 -3.62e4 -2.32e4  2.92e3
##  7 Midwes~  2.91e4  3.60e4  5.02e4  7.39e3  3.26e4  5.89e4  6.23e4  1.89e4
##  8 New En~ -5.89e4 -5.13e4 -6.31e4 -7.43e4 -7.43e4 -7.16e4 -6.70e4 -7.40e4
##  9 New Yo~ -7.03e4 -6.26e4 -6.19e4 -5.35e4 -3.41e4 -2.49e4 -3.68e4 -5.61e4
## 10 Northw~  2.32e5  2.00e5  2.11e5  2.13e5  2.13e5  2.37e5  2.14e5  2.29e5
## 11 Southe~ -2.39e3 -3.72e4 -2.36e4 -2.88e4 -6.36e3  1.74e3 -1.52e4 -5.31e3
## 12 Southw~  5.20e4  6.20e4  7.39e4  5.83e4  5.84e4  5.88e4  5.75e4  7.20e4
## 13 Tennes~ -2.08e4 -1.42e4  1.05e4  1.39e4  1.71e4  1.61e4  6.76e3  1.59e4
## 14 United~  1.83e4 -3.78e4 -3.34e4 -4.31e4 -3.74e4 -3.42e4  4.31e3 -1.90e4
## # ... with 23 more variables: `09` <dbl>, `10` <dbl>, `11` <dbl>,
## #   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>,
## #   `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>,
## #   `22` <dbl>, `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>,
## #   `27` <dbl>, `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>

Parity between demand / generation for regions on the large end of those scales means those regions are not prominent in terms of net interchange. Instead, the biggest shortfalls are in California, New York, New England, and the Carolinas. These are not the top users of power, but neither do they generate sufficiently to meet their own needs.

We finalize our analysis with a look at overall gap / surplus over January (the period captured in the dataset).

zap.tidy %>% 
  spread(cat, amount) %>%
  group_by(region) %>% 
  summarise(net.avg.daily = round(mean(net), 1)) %>%
  select(region, net.avg.daily) %>% 
  arrange(desc(net.avg.daily)) %>% 
  ggplot(aes(x = region, y = net.avg.daily, label = round(net.avg.daily))) +
  geom_bar(aes(fill = net.avg.daily > 0), stat = "identity") +
  scale_fill_manual(guide = F, breaks = c(TRUE, FALSE), values = c("red", "gray")) +
  geom_text(size = 3, position = position_stack(vjust = .5)) +
  coord_flip()

Based on this view of net demand / generation over the month of January, 2018, California is far and away the largest energy sink and the Northwest is a proportionately large energy source. The Mid-Atlantic and Southwest are also clearly net producers, while New York and New England are significant net users, followed by the Midwest and Carolinas.

As Rose observed, as a follow up we could isolate the day on which the gap is largest (i.e. peak grid load) and try to pull in finer time grain to identify when during the day there is the highest likelihood of brownout.

Conclusions: * Demand is highest in the Mid-Atlantic, Midwest, Texas, Northwest, and Southeast. * Supply is highest in the same places. * The biggest shortfalls are in California, New York, New England, and the Carolinas. * California’s defecit dwarfs all others combined, but is about the same scale as the Northwest’s surplus.

3. World population, c/o Steven Tipton

As Steve described it, this is a simple yet interesting population data set by country from 1980 to 2010. The values are stored with one row per country and population is per year in columns, a form similar to Hadley Wickham’s untidy song rankings data set. The dataset was compiled by the Energy Information Administration for the National Renewable Energy Laboratory, and can be found here: https://catalog.data.gov/dataset/population-by-country-1980-2010. Source information was somewhat opaque, but I believe population estimates come from the World Bank.

Data ingestion: First, we pull in the data, which is already captured in a CSV.

# Steven provided a CSV which we've saved to Github - we call it from there.

worldpop.sourcefile <- ("https://raw.githubusercontent.com/JeremyOBrien16/DATA-607/master/populationbycountry19802010millions.csv")
wp <- read.csv(worldpop.sourcefile, header = T, sep= ",", stringsAsFactors = F)

Data cleanup: We begin with some cleanup. As Steven observed, continents like Eurasia, North America, Antarctica, etc. - regional geographic groupings, really - are mixed in with countries. We should remove these from the countries column and capture information regional / grouping hierarchy in a separately. * North America is found in row 1, and should group rows 2:7 * Central & South America is found in row 8, and should group rows 10:53 * Antarctica is found in row 9 (but is NA for every year) * Europe is found in row 54, and should group rows 55:95 * Eurasia is found in row 96, and should group rows 97:112 * Middle East is found in row 113, and should group rows 114:127 * Africa is found in row 128, and should group rows 129:183 * Asia & Oceania is found in row 184, and should group rows 185:231 * World is found in row 232, and is it’s own thang.

We extract region totals from the observations and label each country with its continent (cont) in a new column.

# First, we create a labeling vector for each continent that has the same length as the number of countries (i.e. rows) on that continent. We exclude world, as it's just a supergrouping of all continents.  the row numbers come from the bulleted above.

# We could loop this over a vector with cont names
# i increments until the total length of the name vector is reached (8)
# the function called is rep
# the first argument is the sequenced value in the name vector
# input the row numbers for length.out based on a str_detect of first and last row number for the original names, subtracting 1
# assign to vector for each named by cont

NAm <- rep("NAm", length.out = (7-(2-1)))
CSAm <- rep("CSAm", length.out = (53-(10-1)))
Ant <- c("Ant")
Eur <- rep("Eur", length.out = (95-(55-1)))
EurAs <- rep("EurAs", length.out = (112-(97-1))) 
ME <- rep("ME", length.out = (127-(114-1)))
Afr <- rep("Afr", length.out = (183-(129-1)))
AsOc <- rep("AsOc", length.out = (231-(185-1)))

# We concatenate the continent labeling vectors into a single vector.

cont <- c(NAm, CSAm, Ant, Eur, EurAs, ME, Afr, AsOc)

# We eliminate the continent total rows from the observations (excluding world).

country.rows <- c(2:7, 10:53, 9, 55:95, 97:112, 114:127, 129:184, 186:231)
wp.obs <- wp[country.rows,]

# We confirm the continent labeling vector and observations line up, comparing length of cont vector with number of rows of wp.obs dataframe.

(nrow(wp.obs) - length(cont)) == 0

## [1] TRUE

# We concatenate the continent vector with the observations.

wp.1 <- cbind(cont, wp.obs)

# Relabel columns meaningfully, removing X from the start of each header name

wp.1 <- wp.1 %>% rename("Xcont" = "cont", "Xcountry" = "X")
colnames(wp.1) <- colnames(wp.1) %>% 
  unlist(colnames(wp.1)) %>%
  str_sub(2, length(colnames(wp.1)))

Data Melting: Now we can melt the data and structure it for tidy analysis.

Per Steven, we should create a table with three columns: country, year, and population (pop); to which we’ve enriched with continent (cont).

# Melt dataset to country, year, and population

wp.tidy <- gather(wp.1, key = "year", value = "pop", 3:33)
head(tbl_df(wp.tidy))

## # A tibble: 6 x 4
##   cont  country                   year  pop      
##   <fct> <chr>                     <chr> <chr>    
## 1 NAm   Bermuda                   1980  0.05473  
## 2 NAm   Canada                    1980  24.5933  
## 3 NAm   Greenland                 1980  0.05021  
## 4 NAm   Mexico                    1980  68.34748 
## 5 NAm   Saint Pierre and Miquelon 1980  0.00599  
## 6 NAm   United States             1980  227.22468

Perform analysis: Steven suggested a few interesting avenues of analysis. * Explore the differences between “NA” and “–” in the dataset.
* At first glance, the “NA” seems to mean “0” (given that it’s used for Antarctica), but it’s also seen with Wake Island and the Hawaiian trade zone (which flips from “NA” to “–” in 1987).
* The dashes appear to be used for years when countries did not exist (e.g., after East Germany and West Germany reunite, their separate listings have dashes, and Germany goes from dashes to values).
* Study changes in population in different areas of the world.

We look into the NAs first.

# We compile rows marked NA by country and year.

wp.tidy %>% 
  group_by(country) %>% 
  filter(is.na(pop)) %>%
  summarise(total.NA = n_distinct(year), first.NA = first(year), last.NA = last(year)) %>% 
  select(country, total.NA, first.NA, last.NA) %>%
  arrange(desc(total.NA)) %>% 
  tbl_df()

## # A tibble: 3 x 4
##   country             total.NA first.NA last.NA
##   <chr>                  <int> <chr>    <chr>  
## 1 Antarctica                31 1980     2010   
## 2 Wake Island               31 1980     2010   
## 3 Hawaiian Trade Zone        7 1980     1986

Three countries have NA - Antarctica and Wake Island span the entire 31 year period, but Hawaii is only NA between 1980 and 1986. Neither Antarctica nor Wake Island have indigenous populations, so in that case we can interpret NA as 0. A web search did not suggest meaningful reasons for why Hawaii’s population would have been marked NA prior to 1987.

We look into the dashes next.

# We compile rose marked with dashes by country and year.

wp.tidy %>% 
  group_by(country) %>% 
  filter(pop == "--") %>% 
  summarise(total.dash = n_distinct(year), first.dash = first(year), last.dash = last(year)) %>%
  select(country, total.dash, first.dash, last.dash) %>% 
  arrange(desc(total.dash)) %>% 
  tbl_df()

## # A tibble: 36 x 4
##    country                      total.dash first.dash last.dash
##    <chr>                             <int> <chr>      <chr>    
##  1 Montenegro                           26 1980       2005     
##  2 Serbia                               26 1980       2005     
##  3 Hawaiian Trade Zone                  24 1987       2010     
##  4 Timor-Leste (East Timor)             23 1980       2002     
##  5 Germany, East                        20 1991       2010     
##  6 Germany, West                        20 1991       2010     
##  7 Former U.S.S.R.                      19 1992       2010     
##  8 Former Yugoslavia                    19 1992       2010     
##  9 Former Czechoslovakia                18 1993       2010     
## 10 Former Serbia and Montenegro         17 1980       2010     
## # ... with 26 more rows

The following helps to explain why country-level reporting changed based on the years in question:

With the dissolution of the USSR former Soviet Socialist Republics: * The following gained independence and began reporting separately: include Armenia, Azerbaijan, Belarus, Estonia, Georgia, Kazakhstan, Kyrgyzstan, Latvia, Lithuania, Moldova, Russia, Tajikstan, Turkmenistan, Ukraine, and Uzbekistan. * East Germany and West Germany united to become Germany in 1991. * Czech Republic and Slovakia united to become Czechoslavakia in 1993.

With the dissolution of Socialist Federal Republic of Yugoslavia in the late 1980s: * Croatia, Slovenia, and Macedonia left the SFRY in 1991 and began reporting independently. * Bosnia-Herzegovinia proclaimed independence in 1992 and began reporting independently. * Yugoslavia comprised Serbia and Montenegro from 1992; Montenegro reported independently from 2005, and Serbia from 2006.

In the Middle East and Africa: * Palestine held a general election in 1996 - it’s possible that population censuses were conducted to facilitate this purpose. * Eritrea declared independence from Ethiopia in a 1993 referendum * Namibia gained independence from South Africa in 1990.

In Asia: * Timor-Leste gained independence from Indonesia in 2002.

In the Americas: * It’s not clear why Hawaii in NA until 1986 and “–” thereafter. * Aruba seceded from Netherland Antilles and gained independence in 1996.

Now that we’ve explored the NAs and dashes and have investigated the reasons for the missing data, we look shifts in continental populations. First, some data refinement.

# With the NAs / dashes behind us, we can convert pop and years to numeric values.

wp.calc <- wp.tidy 
wp.calc$pop <- as.numeric(wp.calc$pop)

## Warning: NAs introduced by coercion

wp.calc$year <- as.numeric(wp.calc$year)

# As Antarctica has no permanent population, we convert its NAs to 0 and confirm.

wp.calc$pop[wp.calc$cont == "Ant"] <- 0
wp.calc %>% 
  filter(cont %in% c("Ant")) %>% 
  select(country, year, pop) %>% 
  arrange(year)

##       country year pop
## 1  Antarctica 1980   0
## 2  Antarctica 1981   0
## 3  Antarctica 1982   0
## 4  Antarctica 1983   0
## 5  Antarctica 1984   0
## 6  Antarctica 1985   0
## 7  Antarctica 1986   0
## 8  Antarctica 1987   0
## 9  Antarctica 1988   0
## 10 Antarctica 1989   0
## 11 Antarctica 1990   0
## 12 Antarctica 1991   0
## 13 Antarctica 1992   0
## 14 Antarctica 1993   0
## 15 Antarctica 1994   0
## 16 Antarctica 1995   0
## 17 Antarctica 1996   0
## 18 Antarctica 1997   0
## 19 Antarctica 1998   0
## 20 Antarctica 1999   0
## 21 Antarctica 2000   0
## 22 Antarctica 2001   0
## 23 Antarctica 2002   0
## 24 Antarctica 2003   0
## 25 Antarctica 2004   0
## 26 Antarctica 2005   0
## 27 Antarctica 2006   0
## 28 Antarctica 2007   0
## 29 Antarctica 2008   0
## 30 Antarctica 2009   0
## 31 Antarctica 2010   0

Before examining specific regions / countries and exploring how political changes affected their populations, we set a little high-level context.

# We examine population growth between 1980 and 2010.  This does not account for the impact of unreported countries (NAs) in either year bracketing the analysis.

wp.calc %>% 
  filter(year %in% c("1980", "2010")) %>% 
  spread(year, pop) %>% 
  group_by(cont) %>%
  summarise(start = sum(`1980`, na.rm = T), end = sum(`2010`, na.rm = T), change.net = end - start, RoG = change.net / start, expon = (1 / (2010 - 1992)), CAGR = (end / start)^(expon) - 1) %>% 
  select(cont, start, end, change.net, RoG, CAGR) %>% 
  arrange(desc(RoG)) %>% 
  tbl_df

## # A tibble: 8 x 6
##   cont   start   end change.net      RoG      CAGR
##   <fct>  <dbl> <dbl>      <dbl>    <dbl>     <dbl>
## 1 ME      93.8  212.      119.    1.26     0.0464 
## 2 Afr    472.  1004.      532.    1.13     0.0428 
## 3 CSAm   293.   480.      187.    0.638    0.0278 
## 4 AsOc  2477.  3811.     1334.    0.539    0.0242 
## 5 NAm    320.   457.      136.    0.426    0.0199 
## 6 Eur    530.   606.       76.5   0.144    0.00753
## 7 EurAs  266.   283.       17.0   0.0639   0.00345
## 8 Ant      0.     0.        0.  NaN      NaN

In terms of rate of population growth, the Middle East increased by 126.4% (for a cumulative average growth rate of 4.6%) adding 118 million people over the 30 year period. While that sounds impressive, it pales in terms of tonnage: over the same period of time, North America grew by 136 million people, Central / South America by 187 million, Africa by 532 million peopl, and Asia / Oceania by…1.3 billion people!

We look at how the populations of former Soviet states fared after they become independent countries in 1992.

# We define the former Soviet bloc, visualize growth from 1992 to 2010, and calculate overall and annualized growth rates. 

soviet <- c("Armenia", "Belarus", "Estonia", "Georgia", "Kazakhstan", "Kyrgyzstan", "Latvia", "Lithuania", "Moldova", "Russia", "Tajikstan", "Turkmenistan", "Ukraine", "Uzbekistan")

wp.calc %>% 
  filter(country %in% soviet & year >= 1992) %>%
  select(country, year, pop) %>% 
  ggplot(aes(x = year, y= pop, group = country, colour = country)) +
  geom_line() +
  geom_point(size = 1.5) +
    labs(title = "Population of Ex-SSR States",
       subtitle = "1992-2010") +
  scale_x_continuous(breaks = seq(from = 1990, to = 2010, by = 5))

wp.calc %>% 
  filter(country %in% soviet) %>% 
  group_by(country) %>% 
  spread(year, pop) %>% 
  summarise(start = sum(`1992`, na.rm = T), end = sum(`2010`, na.rm = T), change.net = end - start, RoG = change.net / start, expon = (1 / (2010 - 1992)), CAGR = (end / start)^(expon) - 1) %>% 
  select(country, start, end, change.net, RoG, CAGR) %>% 
  arrange(desc(RoG)) %>% 
  tbl_df()

## # A tibble: 13 x 6
##    country       start    end change.net     RoG     CAGR
##    <chr>         <dbl>  <dbl>      <dbl>   <dbl>    <dbl>
##  1 Turkmenistan   3.83   4.94      1.11   0.290   0.0142 
##  2 Uzbekistan    21.6   27.9       6.26   0.290   0.0142 
##  3 Kyrgyzstan     4.50   5.51      1.01   0.225   0.0114 
##  4 Moldova        4.44   4.32     -0.121 -0.0272 -0.00153
##  5 Lithuania      3.70   3.55     -0.155 -0.0419 -0.00237
##  6 Belarus       10.2    9.61     -0.615 -0.0602 -0.00344
##  7 Russia       148.   139.       -9.01  -0.0607 -0.00347
##  8 Kazakhstan    16.5   15.5      -1.08  -0.0654 -0.00375
##  9 Armenia        3.38   2.97     -0.412 -0.122  -0.00719
## 10 Ukraine       51.9   45.4      -6.45  -0.124  -0.00735
## 11 Georgia        5.36   4.60     -0.754 -0.141  -0.00840
## 12 Latvia         2.62   2.22     -0.398 -0.152  -0.00913
## 13 Estonia        1.53   1.29     -0.238 -0.155  -0.00934

The Central Asian former SSRs - save Kazakhstan - all saw annualized growth averaging over 1% - all other former Soviet bloc countries shrunk in size. The most dramatic percentage declines are seen in Armenia (12.2%), Ukraine (12.4%), Georgia (14.1%), Latvia (15.2%), and Estonia (15.5%). The largest absolute population decreases are found in the biggest states - Russia lost over 9 million people, and the Ukraine nearly 6.5 million. That’s remarkable, and attributed to lower birth rates and abnormally high death rates.

Next, we try to find the impact of the drug war on Central American populations over the last decade. The whole region has been destablized, but we’ll focus a few countries directly impacted by changes in drug trade - Colombia, Peru, Guatemala, and Mexico.

# We define the four countries of interest, visualize growth over the last decade, and calculate overall and annualized growth rates.

drugs <- c("Colombia", "Peru", "Guatemala", "Mexico")

wp.calc %>% 
  filter(country %in% drugs & year >= 2000) %>%
  select(country, year, pop) %>% 
  ggplot(aes(x = year, y= pop, group = country, colour = country)) +
  geom_line() +
  geom_point(size = 1.5) +
    labs(title = "Population of Central American States Affected by the Drug War",
       subtitle = "2000-2010") +
  scale_x_continuous(breaks = seq(from = 2000, to = 2010, by = 1)) +
  scale_y_continuous(breaks = seq(from = 0, to = 125, by = 25)) +
  expand_limits(y = 0)

wp.calc %>% 
  filter(country %in% drugs) %>% 
  group_by(country) %>% 
  spread(year, pop) %>% 
  summarise(start = sum(`2000`, na.rm = T), end = sum(`2010`, na.rm = T), change.net = end - start, RoG = change.net / start, expon = (1 / (2010 - 2000)), CAGR = (end / start)^(expon) - 1) %>% 
  select(country, start, end, change.net, RoG, CAGR) %>% 
  arrange(desc(RoG)) %>% 
  tbl_df()

## # A tibble: 4 x 6
##   country   start   end change.net   RoG   CAGR
##   <chr>     <dbl> <dbl>      <dbl> <dbl>  <dbl>
## 1 Guatemala  11.1  13.6       2.47 0.222 0.0203
## 2 Peru       26.1  29.9       3.82 0.146 0.0138
## 3 Colombia   38.9  44.2       5.29 0.136 0.0128
## 4 Mexico     99.9 112.       12.5  0.126 0.0119

It’s difficult to observe the impact in country populations trended over 30 years. We likely need to explore mortality rates and dive into causes of death (violent, as well as the general health impact of conflict and criminality), with particular attention paid to different age strata.

Lastly, we look at China and India, population giants who responsible for a considerable portion of global growth in human headcount.

# We examine population growth in India and China - first graphically, and then by calculating growth rates over the three decades and the equivalent CAGR.

Sin.Ind <- c("China", "India")

wp.calc %>% 
  filter(country %in% Sin.Ind) %>% 
  select(country, year, pop) %>% 
  ggplot(aes(x = year, y= pop, group = country, colour = country)) +
  geom_line() +
  geom_point(size = 1.5) +
    labs(title = "Population of China and India",
       subtitle = "1980-2010") +
  scale_x_continuous(breaks = seq(from = 1980, to = 2010, by = 5)) +
  scale_y_continuous(breaks = seq(from = 0, to = 1500, by = 250)) +
  expand_limits(y = 0)

wp.calc %>% 
  filter(country %in% Sin.Ind) %>% 
  spread(year, pop) %>% 
  group_by(country) %>% 
  summarise(start = sum(`1980`, na.rm = T), end = sum(`2010`, na.rm = T), change.net = end - start, RoG = change.net / start, expon = (1 / (2010 - 1980)), CAGR = (end / start)^(expon) - 1) %>% 
  select(country, start, end, change.net, RoG, CAGR) %>% 
  arrange(desc(CAGR)) %>% 
  tbl_df

## # A tibble: 2 x 6
##   country start   end change.net   RoG   CAGR
##   <chr>   <dbl> <dbl>      <dbl> <dbl>  <dbl>
## 1 India    685. 1173.       488. 0.713 0.0181
## 2 China    985. 1330.       345. 0.351 0.0101

Both countries evidence a smooth, linear growth rate responsible for their dramatic stature viz. total world population. Health infrastructure improvements combined with the high birth rates that result from traditions of early marriage and preference for male children are reponsbile for this growth. Compared with India, China’s efforts to manage population growth (i.e. it’s one-child policy) start to demonstrate a change in slope in the early to mid-90s.

Conclusions: * The Middle East grew fastest over 1980-2010, but it’s smaller population base means other regions grew more in absolute terms - with Asia in the lead having added 1.3 billion people. * All former SSRs save Turkmenistan, Kyrgyzstan, and Uzbekistan shrunk in size since the dissolution of the USSR. This is mor pronounced in Russia and the Ukraine. * The impact of the drug war on populations in Colombia, Peru, Guatemala, and Mexico is not evident in these statistics. * While China and India are the world’s giants, China’s one-child policy has served as a governer on its population growth.

Data 607_Project 2

Jeremy O’Brien

March 7, 2018

1. US Chronic Disease Indicators (CDI), c/o Niteen Kumar

2. US Electric Grid, c/o Rose Koh

3. World population, c/o Steven Tipton