Data Set 1: Agricultural Productivity in the US

Description

The USDA publishes data on Agricultural Productivity in the US. Table 1a of this dataset contains price indices and implicit quantities of 29 farm outputs and inputs between 1948 and 2015. The data-set is untidy as it has multiple columns for both price and quantity for the various inputs and outputs. The file is also decorated with merged columns and footnotes. Looking at the data before ingestion will help determine where the names are and how to reshape it. The result of the ETL will be a tidy data-set where every column is a distinct variable, aggregate columns will be identified, as well as the direction of the commodity flow (input or output).

ETL

Process Names

The first step will be to process the names. They are stored in a combination of merged columns and multiple rows, so the process will be very manual and based on inspection. There are also many columns with the same names, as they are total or aggregate columns. Giving these columns unique names will greatly ease the ETL process, as it will allow the separation of price and quantity data, to be later joined on commodity name. These aggregate columns will be identified so that analysis can be done without the fear of overcounting. Lastly, the type of commodity will be added as well, also for ease of downstream analysis.

# Ingest
T1RawN <- unlist(fread("./table01a.csv", skip = 5L, nrows = 1L, drop = 60:90))
# Remove Empty entries due to merged columns
T1RawN <- T1RawN[nzchar(T1RawN)]
# Remove names
names(T1RawN) <- NULL
# Split into Input and Output to handle names on PRIOR row
T1RawNa <- T1RawN[1:11]
T1RawNb <- T1RawN[12:length(T1RawN)]
# Some of these names are added by inspection and not explicit ingestion
T1RawNa <- c("Total output", T1RawNa, "Farm-related output")
# Give unique names to totals for later joins or add multi-layered headers
T1RawNa[2] <- "All Livestock"
T1RawNa[6] <- "All Crops"
T1RawNa[12] <- "Other crops"
T1RawNb[1] <- "All Capital"
T1RawNb[6] <- "All Labor"
T1RawNb[9] <- "All Intermediate Goods"
T1RawNb <- c("Total farm input", T1RawNb)
# Combine proper input and output names
T1RawNames <- c(T1RawNa, T1RawNb)
# Identify the inputs and the outputs
T1Flow <- c(rep("Output", length(T1RawNa)), rep("Input", length(T1RawNb)))
# Identify the commodity type. This is by inspection.
T1Types <- c("Farm Output", rep("Livestock", 4L), rep("Crops", 7L), "Other",
             "Farm Input", rep("Capital", 5L), rep("Labor", 3L),
             rep("Intermediate goods", 7L))
# Identify if the column is an aggregate or total column—by inspection
T1Agg <- c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
           FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
           TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
# Create a table for later joining
T1Names <- data.table(Commodity = T1RawNames,
                      Flow = T1Flow, Type = T1Types, Aggregate = T1Agg)

Process Data

The next step is to process the data. Now that each commodity is uniquely named, the data-set can be split into two tables: one for price and one for quantity. Once split, these are both valid wide tables, since every row means something unique and every column is also a unique data element, although most are the same kind of variable. These will both be reshaped to long format, so that each column is a unique variable type, and the commodity identification is in the rows as well. The to long tables will be inner-joined to each other on the year and the commodity type, and the resulting table will be joined to the table with the flow, commodity type, and aggregate identifier table.

# Process Data - Split into Price and Quantity
T1Raw <- fread("./table01a.csv", skip = 6L, nrows = 68L, drop = 60:90,
                  data.table = FALSE)
# Take advantage that every other column is a price or a quantity
T1RawPrice <- T1Raw[, c(1, seq(2, 58, 2))]
T1RawQuant <- T1Raw[, c(1, seq(3, 59, 2))]
# Same order and same names
names(T1RawPrice) <- names(T1RawQuant) <- c("Year", T1RawNames)
# Turn data frames into data tables by reference
setDT(T1RawPrice)
setDT(T1RawQuant)
# Now reshape the Price & Quantity tables from "WIDE" to "LONG"
T1P <- melt(T1RawPrice, id.vars = "Year", variable.name = "Commodity",
            value.name = "Price", variable.factor = FALSE)
T1Q <- melt(T1RawQuant, id.vars = "Year", variable.name = "Commodity",
            value.name = "Quantity", variable.factor = FALSE)
# Now Inner Join the Price to the Quantity on both YEAR and COMMODITY
T1 <- T1P[T1Q, on = c(Year = "Year", Commodity = "Commodity"),
          nomatch = NULL][T1Names, on = "Commodity==Commodity", nomatch = NULL]
setkey(T1, Year)

Analysis

The suggested analyses in the post included:

Comparing the price of food types over time
Identifying the food types which have jumped the most in price over time
1. Is there a relationship between jump and quantity
Consider the relationship between labor and capital costs

The dataset is already on an equivalent dollar basis pegged to 2005, so prices are comparable over time.

Food Prices

The analysis is clearly restricted to farm outputs. The first question is to decide which output is considered a “food”. Looking at the commodities, it is reasonable to consider all non-aggregate outputs as food except for Feed Crops, Oil Crops, and Farm-related output.

Below is a plot of the annual prices of the food commodities with a locally smoothed (loess) trend overlay. The slopes and inflection points look similar for most commodities, with Food grains having a greater amplitude. Interestingly, the smoothed trendline for Fruits and nuts increased its slope around 1991 and continues to climb faster than other food commodities.

Food <- T1RawNa[c(3, 4, 5, 7, 10, 11, 12)]
ggplot(T1[Commodity %chin% Food], aes(x = Year, y = Price, color = Commodity)) +
  geom_line() + stat_smooth(method = "loess", se = FALSE) +
  scale_colour_brewer(type = "qual", palette = "Dark2") +
  scale_x_continuous(breaks = seq(1950, 2015, 5))

Price Jumps

This question depends on what is meant by a jump. Total change from 1948 to 2015 is shown below:

knitr::kable(T1[Commodity %chin% Food,
                {Growth = Price[.N] / Price[1];
                .(`Total Growth` = Growth,
                  `Annual Trend` = Growth ^ (1 / 68) - 1)},
                by = Commodity][order(-`Annual Trend`)])

Commodity	Total Growth	Annual Trend
Fruits and nuts	7.335386	0.0297382
Meat animals	5.402912	0.0251182
Vegetables and melons	4.621757	0.0227667
Other crops	3.943467	0.0203823
Dairy	3.658259	0.0192564
Food grains	2.255163	0.0120309
Poultry and eggs	1.652258	0.0074118

The only two comestible commodities with average annual trend rates of over 2.5% are Fruits and nuts and Meat animals. We can compare those to their respective quantities on the same graph by scaling them to be mean 0 and variance 1.

Expensive <- T1[Commodity %chin% c("Fruits and nuts", "Meat animals"),
                c("Year", "Commodity", "Price", "Quantity")]
Expensive[,
          c("Scaled Price", "Scaled Quantity") :=
            lapply(.SD, function(x) as.vector(scale(x))),
          by = Commodity,
          .SDcols = c("Price", "Quantity")
          ][, `:=`(c("Price", "Quantity"), NULL)]
Expensive <- melt(Expensive, id.vars = c("Year", "Commodity"))
ggplot(Expensive, aes(x = Year, y = value, color = Commodity, lty = variable)) +
  geom_path() + scale_colour_brewer(type = "qual", palette = "Dark2") +
  scale_x_continuous(breaks = seq(1950, 2015, 5))

The plot shows that there is a pretty strong correlation between the prices and available quantities of these items. This would imply that demand is driving supply. For if supply was leading demand, prices should drop.

knitr::kable(T1[Commodity %chin% c("Fruits and nuts", "Meat animals"),
                .(`Price-Quantity Correlation` = cor(Price, Quantity)),
                by = Commodity])

Commodity	Price-Quantity Correlation
Meat animals	0.7208833
Fruits and nuts	0.9510602

An analysis of price jumping is more complicated. It is related to change-point analysis, about which the interested reader may find more information in Adler (2015)¹ and the bibliography therein.

Labor and Capital

The simplest approach to answering this question would be to compare the aggregate labor price and quantity with that of the aggregate capital variable. This will be similar to the analysis done on the expensive foods.

LabCap <- T1[Commodity %chin% c("All Labor", "All Capital"),
                c("Year", "Commodity", "Price", "Quantity")]
LabCap[,
          c("Scaled Price", "Scaled Quantity") :=
            lapply(.SD, function(x) as.vector(scale(x))),
          by = Commodity,
          .SDcols = c("Price", "Quantity")
          ][, `:=`(c("Price", "Quantity"), NULL)]
LabCap <- melt(LabCap, id.vars = c("Year", "Commodity"))
ggplot(LabCap, aes(x = Year, y = value, color = Commodity, lty = variable)) +
  geom_path() + scale_colour_brewer(type = "qual", palette = "Dark2") +
  scale_x_continuous(breaks = seq(1950, 2015, 5))

Here we see an inverse relationship between price and quantity. As alluded to earlier, this makes sense when supply drives demand, so when supply drops the cost of labor and capital increase. There is also a positive correlation between labor and capital of 86%.

Data Set 2: United Nations Migration Data

Description

There are a number of datasets and tables provided by the UN on worldwide immigration. I will be focusing on data in the “Total international migrant stock” files—2019 revision.

The questions posed in the post suggesting this dataset cannot be answered in their entirety from this data. For example, how many immigrants can the US accept is not something the UN has, or even can, determine. What the first table in the file contains is the quinquennial estimates of non-native populations within countries—denoted as the migrant stock—with an additional estimate for 2019, broken into males and females. It also contains the total population estimates by country in the same Excel file. I will save the tabs as individual CSV files and work with those

ETL

From a tidy perspective this data is awful. Not only does it repeat column headings for years three times, it also has a three-to-four level heading structure for the countries. Moreover, I can discern no logic in the structure that could be utilized for a programmatic decision as to what row is a major area, a World Bank income group, a geographic region, a Sustainable Development Goal (SDG) region, a sub-region, or a country. Therefore, I deemed it easiest and most efficient to manually create a table of countries in tidy format to be used for all the data, and to extract the numerical data and tidy them accordingly.

Another problem is that not every country has data broken into male and female. Therefore, some questions will not be able to be answered by sex.

# Manually created country list from UN file. This file has the SDG region and
# subregion as additional variables.
Countries <- fread("Countries.csv", encoding = 'UTF-8')
# Read Migrant Data (Excel file table 1 saved as CSV)
RawMig <- fread("UN_MigrantStockTotal_2019.csv", skip = 17L, encoding = 'UTF-8')
# Isolate the male and female numbers, together with region
RawMigBoth <- RawMig[, c(2, 6:12)]
RawMigMale <- RawMig[, c(2, 13:19)]
RawMigFemale <- RawMig[, c(2, 20:26)]
# Rename them
names(RawMigBoth) <- names(RawMigMale) <- names(RawMigFemale) <- c(
  "Country", seq(1990, 2015, 5), 2019)
# Remove aggregates keeping only individual countries
MigMale <- RawMigMale[Country %chin% Countries$Country]
# Reshape from wide to long
MigMale <- melt(MigMale, id.vars = "Country", variable.name = "Year",
                value.name = "MigCount")
# Add Sex in preparation for joining to female
MigMale[, Sex := "Male"]

# Perform same adjustments to female data
MigFemale <- RawMigFemale[Country %chin% Countries$Country]
MigFemale <- melt(MigFemale, id.vars = "Country", variable.name = "Year",
                  value.name = "MigCount")
MigFemale[, Sex := "Female"]

# Perform same adjustments to combined data
MigBoth <- RawMigBoth[Country %chin% Countries$Country]
MigBoth <- melt(MigBoth, id.vars = "Country", variable.name = "Year",
                  value.name = "MigCount")
MigBoth[, Sex := "Combined"]

# Combine the data
MigrantData <- rbind(MigMale, MigFemale, MigBoth)
# Ensure that count is a number and not character; missing will be NA
MigrantData$MigCount <- as.integer(MigrantData$MigCount)
# Join to countries table
MigrantData <- Countries[MigrantData, on = "Country", nomatch = NULL]

# Perform equivalent ETL on the population data
RawPop <- fread("UN_PopulationTotal_2019.csv", skip = 17L, encoding = 'UTF-8')
RawPopBoth <- RawPop[, c(2, 5:11)]
RawPopMale <- RawPop[, c(2, 12:18)]
RawPopFemale <- RawPop[, c(2, 19:25)]
names(RawPopBoth) <- names(RawPopMale) <- names(RawPopFemale) <- c(
  "Country", seq(1990, 2015, 5), 2019)
PopMale <- RawPopMale[Country %chin% Countries$Country]
PopMale <- melt(PopMale, id.vars = "Country", variable.name = "Year",
                value.name = "PopCount")
PopMale[, Sex := "Male"]
PopFemale <- RawPopFemale[Country %chin% Countries$Country]
PopFemale <- melt(PopFemale, id.vars = "Country", variable.name = "Year",
                  value.name = "PopCount")
PopFemale[, Sex := "Female"]
PopBoth <- RawPopBoth[Country %chin% Countries$Country]
PopBoth <- melt(PopBoth, id.vars = "Country", variable.name = "Year",
                  value.name = "PopCount")
PopBoth[, Sex := "Combined"]
PopData <- rbind(PopMale, PopFemale, PopBoth)
# Population data was in 1000s so need to multiply by 1000 too
PopData$PopCount <- as.integer(PopData$PopCount) * 1000
# Inner Join the migrant data to the population data
UNData <- PopData[MigrantData, on = c(Year = "Year", Country = "Country",
                                      Sex = "Sex"), nomatch = NULL]
# Order the columns---really unnecessary
setcolorder(UNData, c("Year", "Country", "SDG Region", "SubRegion", "Continent",
                      "Sex", "PopCount", "MigCount"))
# Index by Year and Country. This breaks the UN ordering, but should make
# searches more efficient
setkey(UNData, Year, Country)

Analysis

The only question in the original post that can be answered from this data is the one asking in which survey year was there the largest migrant stock (highest number of foreign-born immigrants). The first question cannot be answered from this file, although there is another file which does contain the necessary information. The third and fifth questions are identical, and cannot be answered by this data, or the UN at all for this matter. The fourth question is one of flow and this data is one of stock. The data does not reflect the year of entry, in only reflects the total number of foreign-born immigrants living in the country at that time. Whether they have been there for 2 or 20 years is not in the data. Therefore, I will answer some other questions. Note that for all of these questions, missing data will be ignored (treated as 0).

In which survey year was there the largest migrant stock?
In which survey year was there the greatest disparity between males and females?
Which country has the largest population of migrant stock for each survey year?
Which country has the smallest population of migrant stock for each survey year?
Which country has the highest population percentage of migrant stock for each survey year?
Which country has the lowest population percentage of migrant stock for each survey year?
Which SDG Region has the highest population percentage of migrant stock for each survey year?
Which SDG Region has the lowest population percentage of migrant stock for each survey year?

Question 1

In which survey year was there the largest migrant stock? Looking at the total migrant stock numbers, the answer is 2019.

# Add up all migrant stock values for which there is data, group and order by
# year, and pass it to kable to make a prettier table
knitr::kable(x = {UNData[Sex == "Combined",
                    .(TotMigStock = sum(MigCount, na.rm = TRUE)),
                    keyby = Year]}, format.args = list(big.mark = ","))

Year	TotMigStock
1990	159,207,561
1995	166,953,713
2000	179,006,588
2005	197,633,172
2010	227,828,148
2015	256,932,403
2019	281,914,164

Question 2

In which survey year was there the greatest disparity between males and females? Looking at the migrant stock numbers by survey year, the answer is 2019.

# First add up all the migrant stock, this time by sex and by year, ignoring the
# combined numbers. Then cast it to a wide format. Then calculate the absolute
# difference between the female and male numbers and return that via kable
knitr::kable(x = dcast(UNData[Sex != "Combined",
                              .(MigStock = sum(MigCount, na.rm = TRUE)),
                              keyby = c("Year", "Sex")],
                       Year ~ Sex, value.var = "MigStock"
                       )[, Diff := abs(Female - Male)],
             format.args = list(big.mark = ","))

Year	Female	Male	Diff
1990	78,451,915	80,755,646	2,303,731
1995	82,460,396	84,493,317	2,032,921
2000	88,285,870	90,720,718	2,434,848
2005	96,795,340	100,837,832	4,042,492
2010	110,276,690	117,551,458	7,274,768
2015	124,077,275	132,855,128	8,777,853
2019	135,326,598	146,587,566	11,260,968

Question 3

Which country has the largest population of migrant stock for each survey year?

# Limited to combined data, return a sub-data table of Country & Migrant Count
# for the country with the largest migrant count, by survey year, to kable
knitr::kable(UNData[Sex == "Combined",
                    .SD[which.max(MigCount)],
                    by = Year,
                    .SDcols = c("Country", "MigCount")],
             format.args = list(big.mark = ","))

Year	Country	MigCount
1990	United States of America	23,251,026
1995	United States of America	28,451,053
2000	United States of America	34,814,053
2005	United States of America	39,258,293
2010	United States of America	44,183,643
2015	United States of America	48,178,877
2019	United States of America	50,661,149

Question 4

Which country has the smallest population of migrant stock for each survey year?

# Limited to combined data, return a sub-data table of Country & Migrant Count
# for the country with the smallest migrant count, by survey year, to kable
knitr::kable(UNData[Sex == "Combined",
                    .SD[which.min(MigCount)],
                    by = Year,
                    .SDcols = c("Country", "MigCount")],
             format.args = list(big.mark = ","))

Year	Country	MigCount
1990	Saint Helena	178
1995	Saint Helena	117
2000	Saint Helena	108
2005	Saint Helena	163
2010	Tuvalu	220
2015	Tuvalu	230
2019	Tuvalu	238

Question 5

Which country has the highest population percentage of migrant stock for each survey year?

# Limited to combined data, return a sub-data table of Country & Migrant Count
# for the country with the smallest migrant count, by survey year, to kable
knitr::kable(UNData[Sex == "Combined" & PopCount > 0,
                    .SD[which.max(MigCount / PopCount)],
                    keyby = Year,
                    .SDcols = c("Country", "MigCount", "PopCount")
                    ][, Ratio := MigCount / PopCount],
             format.args = list(big.mark = ","))

Year	Country	MigCount	PopCount	Ratio
1990	Bonaire, Sint Eustatius and Saba	38,410	13,000	2.9546154
1995	Bonaire, Sint Eustatius and Saba	42,258	15,000	2.8172000
2000	Bonaire, Sint Eustatius and Saba	46,105	14,000	3.2932143
2005	Bonaire, Sint Eustatius and Saba	62,205	14,000	4.4432143
2010	United Arab Emirates	7,316,697	8,549,000	0.8558541
2015	United Arab Emirates	7,995,126	9,262,000	0.8632181
2019	United Arab Emirates	8,587,256	9,770,000	0.8789412

Question 6

Which country has the lowest population percentage of migrant stock for each survey year?

# Limited to combined data, return a sub-data table of Country & Migrant Count
# for the country with the smallest migrant count, by survey year, to kable
knitr::kable(UNData[Sex == "Combined" & PopCount > 0,
                    .SD[which.min(MigCount / PopCount)],
                    keyby = Year,
                    .SDcols = c("Country", "MigCount", "PopCount")
                    ][, Ratio := MigCount / PopCount],
             format.args = list(big.mark = ","))

Year	Country	MigCount	PopCount	Ratio
1990	China	376,361	1,197,362,000	0.0003143
1995	China	442,198	1,262,276,000	0.0003503
2000	China	508,034	1,312,517,000	0.0003871
2005	China	678,947	1,353,482,000	0.0005016
2010	Cuba	6,640	11,225,000	0.0005915
2015	Cuba	5,312	11,324,000	0.0004691
2019	Cuba	4,886	11,333,000	0.0004311

Question 7

Which SDG Region has the highest population percentage of migrant stock for each survey year?

# Limited to combined data, first calculate a ratio by year and by SDG region.
# Then chain that to a call returning the largest ratio by year piped to kable.
knitr::kable(UNData[
  Sex == "Combined" & PopCount > 0,
  .(Ratio = sum(MigCount, na.rm = TRUE) / sum(PopCount, na.rm = TRUE)),
  by = c("Year", "SDG Region")
  ][,.SD[which.max(Ratio)], by = "Year"])

Year	SDG Region	Ratio
1990	Oceania	0.1450105
1995	Oceania	0.1419453
2000	Oceania	0.1403243
2005	Oceania	0.1462045
2010	Oceania	0.1571951
2015	Oceania	0.1638907
2019	Oceania	0.1706701

Question 8

Which SDG Region has the lowest population percentage of migrant stock for each survey year?

# Limited to combined data, first calculate a ratio by year and by SDG region.
# Then chain that to a call returning the smallest ratio by year piped to kable.
knitr::kable(UNData[
  Sex == "Combined" & PopCount > 0,
  .(Ratio = sum(MigCount, na.rm = TRUE) / sum(PopCount, na.rm = TRUE)),
  by = c("Year", "SDG Region")
  ][,.SD[which.min(Ratio)], by = "Year"])

Year	SDG Region	Ratio
1990	Eastern and South-Eastern Asia	0.0037196
1995	Eastern and South-Eastern Asia	0.0042764
2000	Eastern and South-Eastern Asia	0.0051381
2005	Eastern and South-Eastern Asia	0.0060918
2010	Eastern and South-Eastern Asia	0.0071524
2015	Eastern and South-Eastern Asia	0.0078413
2019	Eastern and South-Eastern Asia	0.0078371

Data Set 3: Renewable Energy Consumption Data

Description

This dataset is collated by OpenEI, and contains the sources of renewable energy by sector for the years 1989–2008. Like most of these untidy datasets, it is in wide format with multi-level headings. However, like the others, it is small, so a combination of intelligent ingestion and programmatic cleaning will be the quickest and most efficient way to perform the ETL.

ETL

The file will be read in its entirety, and then parsed to individual sector tables by inspection. The sectors will be added to the data, and then all will be combined into a single wide-format table. At this point, the data will be cleaned from footnotes and markers which are effectively zero finally, the data will be transformed into tidy-long format and any remaining numbers stored as character will be converted back to numbers. At this point the data is ready for analysis.

# Read
Raw <- fread("historicalrenewableenergyconsumptionbysectorandenergysource19892008.csv",
      skip = 6L)
# Set column names
names(Raw) <- c("Source", 1989:2008)
# Parse by sector and add sector identifier
Residential <- Raw[15:17, ]
Residential[, Sector := "Residential"]
Commercial <- Raw[21:26, ]
Commercial[, Sector := "Commercial"]
Industrial <- Raw[30:36, ]
Industrial[, Sector := "Industrial"]
Transportation <- Raw[39, ]
Transportation[, Sector := "Transportation"]
ElecUtilities <- Raw[46:51, ]
ElecUtilities[, Sector := "ElecUtilities"]
ElecIPP <- Raw[55:60, ]
ElecIPP[, Sector := "ElecIPP"]
# Recombine
Power <- rbind(Residential, Commercial, Industrial, Transportation,
               ElecUtilities, ElecIPP)
# Clean Values from * & - and extraneous footnotes
Power <- Power[, lapply(.SD, function(x)  gsub("[*-]", "0", x))]
Power$Source <- gsub("[[:digit:]]", "", Power$Source)
# Reshape to long format
Power <- melt(Power, id.vars = c("Sector", "Source"),
              variable.name = "Year",
              variable.factor = FALSE,
              value.name = "BTU_Q")
# Ensure numbers are numbers
Power$Year <- as.double(Power$Year)
Power$BTU_Q <- as.double(Power$BTU_Q)

Analysis

The analysis asked for in the post was to “…analyze the data to figure out what type of renewable energy is used the most or measure the amount of renewable energy that each sector consumes per year.”

Renewable Use by Sector

ggplot(Power[, .(BTU_Quad = sum(BTU_Q)), keyby = c("Year", "Sector")],
       aes(x = Year, y = BTU_Quad, color = Sector)) + geom_path() +
  scale_color_brewer(type = "qual", palette = "Dark2")

Renewable Use by Source

ggplot(Power[, .(BTU_Quad = sum(BTU_Q)), keyby = c("Year", "Source")],
       aes(x = Year, y = BTU_Quad, color = Source)) + geom_path() +
  scale_color_brewer(type = "qual", palette = "Set1")

Observation

The heaviest using sector is the electrical power utilities The heaviest used resource is hydroelectricity—whose usage tracks the utility usage closely. This stands to reason as the data indicates that the power utilities use hydro far and away as their largest source.

Another interesting point is that the rise in biomass usage mirrors the rise in the transportation sector. What was surprising to me was that the correspondence is 1:1. Only the transportation sector uses biomass, and it uses nothing else. This may be a miscoding or an older nomenclature, as more recent documentation indicates that biofuels are used in the transportation industry. See this briefing by the Environmental and Energy Study Institute, as an example.

Avraham Adler, A Survey of Approaches to a Changepoint Problem in an Actuarial Context, Variance, 9(1): 64–100, 2015. http://www.variancejournal.org/issues/?fa=article&abstrID=7179 ↩︎

DT 607—Project 2

CUNY MSDS—Fall 2019

Avraham Adler

9/26/2019

Data Set 1: Agricultural Productivity in the US

Description

ETL

Process Names

Process Data

Analysis

Food Prices

Price Jumps

Labor and Capital

Data Set 2: United Nations Migration Data

Description

ETL

Analysis

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Data Set 3: Renewable Energy Consumption Data

Description

ETL

Analysis

Renewable Use by Sector

Renewable Use by Source

Observation