DATA607-Week05-Project02

The goal of this assignment is to prepare different datasets¹ (databases) for downstream analysis work. Following are the 3 datasets that will be used for this project. Dataset sources was identified by

1. World Development Indicators (WDI) - Kyle Gilde

2. IRS’s Statistics of Income (SOI) - Pavan Akula

3. United States Census Bureau - Imports of Crude Oil - Nnaemezue Obi-eyisi

Libraries used for the project

#install.packages("xlsx")
library(xlsx)     # used for loading files with extension xlsx and xls
library(utils)    # loading csv files
library(tidyr)    # formating data 
library(dplyr)    # subseting, changing data layout
library(stringr)  # regular expression and data manupilation
library(ggplot2)  # graphs generation
library(knitr)    # report display, table format
library(readr)    # reading text files data

1. World Development Indicators (WDI)

The World Bank maintains World Development Indicators (WDI) database. Collection of development indicators are compiled from officially-recognized international sources by World Bank. It presents the most current and accurate global development data available, and includes national, regional and global estimates. Database contains various indicator topics including Agriculture & Rural Development, Aid Effectiveness, Climate Change, Economy & Growth, Education, Energy & Mining, Environment, External Debt, Financial Sector, Gender, Health, Infrastructure, Labor & Social Protection, Poverty, Private Sector, Public Sector, Science & Technology, Social Development, Trade and Urban Development.

Data is classified by World Bank into economies, regions, income groups and lending groups. Database has data from year 1960 to current. Update frequency of the database is quarterly. Complete database and data dictionary is available in Zip file format at World Bank website http://data.worldbank.org/data-catalog/world-development-indicators.

More information about data classification and collection can be found at

Country classification: https://datahelpdesk.worldbank.org/knowledgebase/articles/378834-how-does-the-world-bank-classify-countries
Data Compilation Methodology: https://datahelpdesk.worldbank.org/knowledgebase/articles/201203
Country and Lending Groups: https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups

Using WDI database, development indicators for last 10 years that fall under Financial Sector: Access topic will be analysed for G20 nations. To be more specific only one indicator Average transaction cost of sending and receving remittances from or to a specific country will be analysis.

# Load World Indicators from WDI_Series.csv
# This file contains information about each indicator
# There are about 17 indicators that fall under Financial Sector: Access
wd <- getwd()
wdi.indicator.data <- read.csv(paste(wd , "\\WDI_Series.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)
wdi.indicator.finsec.access <- wdi.indicator.data %>% 
  filter(Topic == "Financial Sector: Access") %>% 
  select(Indicator.Name)

# G20 nations
#nations <- c("Argentina","Australia","Brazil","Canada","China","France","Germany","India","Indonesia","Italy","Japan","Mexico","Russian Federation","Saudi Arabia","South Africa","South Korea","Turkey","United Kingdom","United States","Austria","Belgium","Bulgaria","Croatia","Cyprus","Czech Republic","Denmark","Estonia","Finland","Greece","Hungary","Ireland","Latvia","Lithuania","Luxembourg","Malta","Netherlands","Poland","Portugal","Romania","Slovak Republic","Slovenia","Spain","Sweden")

nations <- c("Argentina","Australia","Brazil","Canada","China","France","Germany","India","Indonesia","Italy","Japan","Mexico","Russia","Saudi Arabia","South Africa","South Korea","Turkey","United Kingdom","United States","European Union")

g20.nations <- data.frame(nations, stringsAsFactors = FALSE)

# Assign unique id's to nations
g20.nations <- g20.nations %>% 
  mutate(id = row_number())

# Load World Development Indicators (WDI) database
# Database has 383328 observations with 61 variables
wdi.raw.data <- read.csv(paste(wd, "\\WDI_Data.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)

#str(wdi.raw.data)
# Format data, some rows have [ts] and [w2] at the end of indicator description 
wdi.raw.data$Indicator.Name <- gsub("\\[ts\\]", "" ,wdi.raw.data$Indicator.Name)
wdi.raw.data$Indicator.Name <- gsub("\\[w2\\]", "" ,wdi.raw.data$Indicator.Name)
wdi.raw.data$Indicator.Name <- str_trim(wdi.raw.data$Indicator.Name)
wdi.raw.data$Country.Name <- str_replace(wdi.raw.data$Country.Name, "Russian Federation", "Russia")
wdi.raw.data$Country.Name <- str_replace(wdi.raw.data$Country.Name, "Korea, Rep\\.", "South Korea")

# Filter data for last 10 years excluding 2016
# Get data only for G20 nations
wdi.g20.nations.raw <- wdi.raw.data %>% 
  inner_join(wdi.indicator.finsec.access, by = "Indicator.Name") %>%
  inner_join(g20.nations, by = c("Country.Name" = "nations")) %>%
  select (id, Country.Name, Country.Code, Indicator.Name, X2006:X2015)

#unique(wdi.g20.nations.raw$Indicator.Name)
#unique(wdi.g20.nations.raw$Country.Name)
# Tidy the data set

wdi.g20.nations.tidy.data <- wdi.g20.nations.raw %>%
  gather(Year, value = Measure, X2006:X2015, na.rm = FALSE)

wdi.g20.nations.tidy.data$Measure = ifelse(is.na(wdi.g20.nations.tidy.data$Measure), 0, round(wdi.g20.nations.tidy.data$Measure, 2))

wdi.g20.nations.tidy.data$Year <- gsub("[[:alpha:]]", "" ,wdi.g20.nations.tidy.data$Year)

# Format rows data
unique(wdi.g20.nations.tidy.data$Indicator.Name)

##  [1] "Account at a financial institution (% age 15+)"                             
##  [2] "Account at a financial institution, female (% age 15+)"                     
##  [3] "Account at a financial institution, income, poorest 40% (% ages 15+)"       
##  [4] "Account at a financial institution, income, richest 60% (% ages 15+)"       
##  [5] "Account at a financial institution, male (% age 15+)"                       
##  [6] "Automated teller machines (ATMs) (per 100,000 adults)"                      
##  [7] "Average transaction cost of sending remittances from a specific country (%)"
##  [8] "Average transaction cost of sending remittances to a specific country (%)"  
##  [9] "Borrowers from commercial banks (per 1,000 adults)"                         
## [10] "Commercial bank branches (per 100,000 adults)"                              
## [11] "Depositors with commercial banks (per 1,000 adults)"                        
## [12] "Mobile account (% age 15+)"                                                 
## [13] "Mobile account, female (% age 15+)"                                         
## [14] "Mobile account, income, poorest 40% (% ages 15+)"                           
## [15] "Mobile account, income, richest 60% (% ages 15+)"                           
## [16] "Mobile account, male (% age 15+)"                                           
## [17] "Point-of-sale terminals (per 100,000 adults)"

wdi.g20.nations.tidy.data$Indicator.Name <- str_replace(wdi.g20.nations.tidy.data$Indicator.Name, "Average transaction cost of sending remittances to a specific country \\(\\%\\)", "Sending")

wdi.g20.nations.tidy.data$Indicator.Name <- str_replace(wdi.g20.nations.tidy.data$Indicator.Name, "Average transaction cost of sending remittances from a specific country \\(\\%\\)", "Receving")

# Get Average transaction cost consumer paid for sending and/or receving remittances to/from a specific country in 2015
remit.to.from <- wdi.g20.nations.tidy.data %>% 
  filter(Indicator.Name %in% c("Sending", "Receving")  & Year == 2015) %>% 
  select (Country.Name,id, Measure, Indicator.Name) %>% 
  spread(Indicator.Name, Measure)

# Rearrange the data
remit.to.from <- remit.to.from %>% arrange(id)

# Keep order of countries
remit.to.from <- transform(remit.to.from, Country.Name=factor(Country.Name,levels=unique(Country.Name)))

Information shows cost paid by the consumer for sending and/or receiving money in year 2015

Avg. cost paid by consumers for remittances in 2015 (%)
Country	Sending	Receving
Argentina	0.00	0.00
Australia	0.00	9.26
Brazil	8.13	6.26
Canada	0.00	8.93
China	10.37	0.00
France	0.00	7.26
Germany	0.00	9.34
India	6.74	0.00
Indonesia	6.86	0.00
Italy	0.00	6.27
Japan	0.00	13.22
Mexico	5.02	0.00
Russia	0.00	2.30
Saudi Arabia	0.00	4.49
South Africa	8.43	16.09
South Korea	0.00	5.81
Turkey	7.40	0.00
United Kingdom	0.00	7.36
United States	0.00	6.03
European Union	0.00	0.00

Graphical representation.

# Displays Average transaction cost of sending remittances from a specific country (%) in 2015

ggplot(data=remit.to.from, aes(x=`Country.Name`, y=Sending)) + geom_bar(width=.9,stat="identity",position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for sending money to other countries in 2015 (%)") + geom_text(aes(label=Sending), vjust=0, color="black")

# Displays Average transaction cost of receving remittances from a specific country (%) in 2015

ggplot(data=remit.to.from, aes(x=`Country.Name`, y=Receving)) + geom_bar(width=.9,stat="identity",position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for receving money to other countries in 2015 (%)") + geom_text(aes(label=Receving), vjust=0, color="black")

# Tidy the data and get countries that charge consumers for sending and receving remittances
remit.to.from.charge <- remit.to.from %>% 
  #filter(Sending>0 & Receving >0) %>% 
  select (`Country.Name`,Sending,Receving) %>% 
  gather(Type, value = Measure, -Country.Name,  na.rm = FALSE)

ggplot(remit.to.from.charge, aes(x = Country.Name,y = Measure)) + 
geom_bar(aes(fill = Type), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for remittances in 2015 (%)")

Conclusion:

Consumers residing in Argentina and European Union did not pay extra cost for sending and receiving remittances in 2015.
Consumers residing in South Africa and Brazil paid extra cost while sending and receiving remittances in 2015.
Consumers residing in Australia, Brazil, Canada, France, Germany, Italy, Japan, Russia, Saudi Arabia, South Africa, South Korea, United Kingdom and United States only paid extra cost while sending remittances in 2015.
Consumers residing in Brazil, China, India, Mexico, South Africa and Turkey only paid extra cost while receiving remittances in 2015.
Consumers residing South Africa paid highest cost for sending remittances, while Consumers residing in China paid highest cost for receiving remittances in 2015.

2. IRS’s Statistics of Income (SOI)

Internal Revenue Service (IRS) maintains Statistics of Income (SOI) - County Data. SOI bases its county income data on the addresses reported on individual income tax returns filed with the IRS. Database includes:

Number of returns, which approximates the number of households
Number of personal exemptions, which approximates the population
Adjusted gross income
Wages and salaries
Dividends before exclusion
Interest received

More information about data classification and collection can be found at

Data information: https://www.irs.gov/uac/soi-tax-stats-county-data
Data collection and processing: https://www.irs.gov/uac/soi-tax-stats-purpose-and-function-of-statistics-of-income-soi-program
Data Dictionary and Users Guide: https://www.irs.gov/uac/soi-tax-stats-county-income-data-users-guide-and-record-layouts

Using SOI Tax Stats - County Database 2014, metrics for California and Texas will be analysed. For the scope of the project returns will be analysis based on Adjusted gross income and Filing status for both states.

wd <- getwd()
# Data Dictionary
soi.data.dic <- read.xlsx2(paste(wd, "\\SOI_Data_Dictionary_County2014.xlsx", sep = ""), sheetName = "Sheet1", startRow=2, endRow = 138, startColumn = 1, endColumn = 6, as.data.frame = T, header = T, colClasses = "character", stringsAsFactors = FALSE)

# Get agi range
agi_range <- soi.data.dic %>% 
  filter(TYPE == "") %>% 
  select (VALUE.LINE.REFERENCE) %>% 
  separate(VALUE.LINE.REFERENCE, c("agi_id","agi_desc"), sep = "\\s=\\s")

agi_range$agi_id <- as.numeric(as.character(agi_range$agi_id))

# Load Statistics of Income (SOI) - County Data from 14incyallagi.csv
# This file contains county level tax filing details and also has state level rollup
# There are about 25536 observations and 128 variables that covers all states
# California (CA) and Texas (TX) states data will be used for this project

soi.raw.data <- read.csv(paste(wd , "\\14incyallagi.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)

# agi_stub = Adjusted gross income (1-Under $1, 2-$1 under $10,000, 3-$10,000 under $25,000,4-$25,000 under $50,000,5-$50,000 under $75,000,6-$75,000 under $100,000,7-$100,000 under $200,000,8-$200,000 or more
# numberOfTIReturns = Number of returns with taxable income
# tIAmount = Taxable income amount
# numberOfTPReturns = Number of returns with total tax payments
# tPAmount = Total tax payments amount

# Tidy the data and get required data
# Missing values, since numberOfSingleReturns + numberOfJointReturns + numberOfHOHReturns != numberOfReturns
# add otherReturns
soi.ca.tx.data <- soi.raw.data %>% 
  filter(STATE %in% c("CA","TX") & COUNTYFIPS == 0) %>% 
  mutate(otherReturns = (N1 - (mars1+MARS2+MARS4))) %>% 
  select(agi_stub,state = COUNTYNAME, numberOfReturns = N1, numberOfSingleReturns = mars1, numberOfJointReturns = MARS2, numberOfHOHReturns = MARS4, otherReturns, numberOfTIReturns = N04800, tIAmount = A04800, numberOfTPReturns = N10600, tPAmount = A10600) %>% 
  inner_join(agi_range, by = c("agi_stub" = "agi_id"))

# Calculate percentages
soi.ca.tx.data %<>%
  group_by(state) %>% 
  summarise(totalReturns = sum(numberOfReturns)) %>%
  select (state,totalReturns) %>% 
  inner_join(soi.ca.tx.data, by = "state")

soi.ca.tx.data <- mutate(soi.ca.tx.data, "SingleReturns" = round(numberOfSingleReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "JointReturns" = round(numberOfJointReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "HOHReturns" = round(numberOfHOHReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "OtherReturns" = round(otherReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "Total" = SingleReturns+JointReturns+HOHReturns+OtherReturns)

# Generate report to compare percentage of tax returns files
soi.ca.tx.data %>%
  select (state, agi_stub, agi_desc, Total) %>% 
  arrange(agi_stub) %>% 
  spread(state, Total) %>% 
  select (agi_desc, California, Texas) %>% 
  kable(digits = 2, col.names = c("Adjusted Gross Income", "California(%)", "Texas(%)"), format='pandoc', caption = "Percentage of Tax Returns filed based on Adjusted Gross Income for year 2014")

Percentage of Tax Returns filed based on Adjusted Gross Income for year 2014
Adjusted Gross Income	California(%)	Texas(%)
Under $1	1.70	1.24
$1 under $10,000	13.50	14.46
$10,000 under $25,000	22.90	24.06
$25,000 under $50,000	22.78	23.93
$50,000 under $75,000	12.67	12.69
$75,000 under $100,000	8.11	7.93
$100,000 under $200,000	12.71	11.28
$200,000 or more	5.63	4.42

Graphical representation.

# Rearrange the data
soi.ca.tx.data <- arrange(soi.ca.tx.data, agi_stub, agi_desc)
soi.ca.tx.data <- transform(soi.ca.tx.data, agi_desc=factor(agi_desc,levels=unique(agi_desc)))

# Generate graph to compare taxes returns files based on Adjusted Gross Income
ggplot(soi.ca.tx.data, aes(x = agi_desc,y = Total)) + 
geom_bar(aes(fill = state), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Adjusted Gross Income", y="Percentage", title = "Percentage of Tax Returns filed based on Adjusted Gross Income for year 2014")

# Generate report based on filing status
ca.tx.c1 <- soi.ca.tx.data %>%
  group_by(state) %>% 
  summarise(Single = sum(SingleReturns), Joint = sum(JointReturns), "Head Of Household" = sum(HOHReturns), Other=sum(OtherReturns)) %>%
  select (state, Single, Joint, `Head Of Household`, Other) %>% 
  gather(ReturnType, "n", Single, Joint, `Head Of Household`, Other) %>% 
  select(state, ReturnType, Total=n) %>% 
  spread(state, Total)

# Arrange data
ca.tx.c1 <- arrange(ca.tx.c1, desc(California))
ca.tx.c1 <- mutate(ca.tx.c1, id = row_number())
ca.tx.c1 <- transform(ca.tx.c1, id=factor(id,levels=unique(id)))
ca.tx.c1 %>% 
  select(ReturnType, California, Texas) %>% 
kable(digits = 2, col.names = c("Return Type", "California(%)", "Texas(%)"), format='pandoc', caption = "Percentage of Tax Returns filed based on Status for year 2014")

Percentage of Tax Returns filed based on Status for year 2014
Return Type	California(%)	Texas(%)
Single	47.93	44.51
Joint	35.85	36.83
Head Of Household	14.85	17.27
Other	1.37	1.40

Graphical representation.

# Get the dataset
ca.tx.c2 <-
soi.ca.tx.data %>%
  group_by(state) %>% 
  summarise(Single = sum(SingleReturns), Joint = sum(JointReturns), "Head Of Household" = sum(HOHReturns), Other=sum(OtherReturns)) %>%
  select (state, Single, Joint, `Head Of Household`, Other) %>% 
  gather(ReturnType, "n", Single, Joint, `Head Of Household`, Other) %>% 
  select(state, ReturnType, Total=n)

# Returns filing order
filing <- c("Single","Joint","Head Of Household","Other")
id <- c(1,2,3,4)
filing.order <- data.frame(filing, id, stringsAsFactors = FALSE)

ca.tx.c2 <- ca.tx.c2 %>% 
  inner_join(filing.order, by = c("ReturnType" = "filing"))

ca.tx.c2 <- arrange(ca.tx.c2, state, id)
ca.tx.c2 <- transform(ca.tx.c2, ReturnType=factor(ReturnType,levels=unique(ReturnType)))


# Generate graph to compare taxes returns files based on Adjusted Gross Income
ggplot(ca.tx.c2, aes(x = ReturnType,y = Total)) + 
geom_bar(aes(fill = state), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Filing Status", y="Percentage", title = "Percentage of Tax Returns filed based on Filing Status for year 2014")

Conclusion:

Texas has slightly more percentage of population that earns between $1 under $50,000.
California and Texas has approxmimately same percentage of population that earns between $50,000 under $75,000.
California has slightly more percentage of population that earns above $75,000.

3. United States Census Bureau - Imports of Crude Oil

The United States Census Bureau is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. Statistics of U.S. Imports of Crude Oil is compiled with the help of U.S. Customs and Border Protection and the Bureau of Economic Analysis. Purpose of this database is to provide detailed statistics on goods and estimates of services entering the U.S. from foreign countries.

More information about data and program can be found at

Program details: https://www.census.gov/foreign-trade/about/index.html
Foreign Trade historical data: https://www.census.gov/foreign-trade/statistics/historical/index.html
U.S. Imports of Crude Oil: https://www.census.gov/foreign-trade/statistics/historical/petr.txt

Using the U.S. Imports of Crude Oil database, metrics for yearly imports of crude oil and costs will be analysed. Since years 2017 and 2018 does not have complete data, years will not be used for analysis.

# Get local directory and load the file
localDir <- getwd()

# Skip top 24 lines as it contains header info
crude.oil.text <- read_lines(paste(str_trim(localDir),"//petr.txt",sep=""),skip = 24)

# Remove Spaces and replace with pipe(|)
pattern = "\\s+"
replacement = "|"
crude.oil.text <- gsub(pattern = pattern, replacement = replacement, crude.oil.text)

# Remove unnecessary data
pattern = "Jan\\."
crude.noneed.data <- crude.oil.text[str_detect(crude.oil.text, pattern = pattern)]
crude.oil.text <- crude.oil.text [! crude.oil.text %in% crude.noneed.data]

# Create data frame and add id column
# Split the data into 15 columns
crude.oil.data <- data.frame(unlist(str_split_fixed(crude.oil.text, "\\|", 15)), stringsAsFactors = F)
crude.oil.data <- mutate(crude.oil.data, id = row_number())

# Since data is text format and yearly data is stored side-by-side, convert it to single table
# File has about 46 yearly data sets, 2017 and 2018 data set is not used for analysis
start = 1
end = 13
crude.oil.tidy.data <- crude.oil.data %>% 
    filter(id == 0) %>% 
    select(X2:X8)
    
for (i in 1:22)
{
  # Get first set
  temp1 <- crude.oil.data %>% 
    filter(id >= start & id <= end) %>% 
    select(id, X1:X8)
  
  year <- crude.oil.data %>% 
    filter (id == start) %>% 
    select(X2)
  
  # Get year
  temp1$X1[temp1$X1 == ""] <- year
  
  # Get second set of data which is same side-by-side
  temp2 <- crude.oil.data %>% 
    filter(id >= start & id <= end) %>% 
    select(id, X1, X2=X9, X3=X10, X4=X11, X5=X12, X6=X13, X7=X14, X8=X15)
  
  # Get year
  year <- crude.oil.data %>% 
    filter (id == start) %>% 
    select(X3)  
  
  temp2$X1[temp2$X1 == ""] <- year
  
  # Remove extra rows from temp data frames
  temp1 <- temp1[-c(1), ]
  temp2 <- temp2[-c(1), ]
  
  # Store data into tidy table
  crude.oil.tidy.data <- bind_rows(crude.oil.tidy.data,temp1)
  crude.oil.tidy.data <- bind_rows(crude.oil.tidy.data,temp2)

  start = end + 1
  end = end + 13
}

# Add new id column and remove un-used columns
crude.oil.tidy.data$X1 <- as.numeric(unlist(crude.oil.tidy.data$X1))
crude.oil.tidy.data <- arrange(crude.oil.tidy.data, X1,id) %>% mutate(seq = row_number())
crude.oil.tidy.data$id <- NULL
crude.oil.tidy.data$X4 <- NULL
crude.oil.tidy.data$X6 <- NULL
crude.oil.tidy.data$X7 <- NULL
crude.oil.tidy.data$X8 <- NULL

# Format data and replace missing values with zeros
crude.oil.tidy.data$X3 <- as.numeric(ifelse(crude.oil.tidy.data$X3 == "N/A", "0",str_replace_all(crude.oil.tidy.data$X3,",","")))

crude.oil.tidy.data$X5 <- as.numeric(ifelse(crude.oil.tidy.data$X5 == "N/A", "0",str_replace_all(crude.oil.tidy.data$X5,",","")))

crude.oil.tidy.data <-  rename(crude.oil.tidy.data, Month = X2, Barrels = X3, Value = X5, Year = X1)

# Summarise data
crude.oil.yearly.data <- crude.oil.tidy.data %>% 
  group_by(Year) %>% 
  summarise(TotalBarrels = round(sum(Barrels)/10000,2), TotalValue = round(sum(Value)/10000,2), AverageBarrels = round(mean(Barrels)/10000,2), AverageValue = round(mean(Value)/10000,2))

# Generate report
crude.oil.yearly.data %>% 
  select (Year, AverageBarrels) %>% 
kable(digits = 2, col.names = c("Year", "Quantity (10 thousands of Barrels)"), format='pandoc', caption = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")

U.S. Yearly Average Imports of Crude Oil 1973 - 2016
Year	Quantity (10 thousands of Barrels)
1973	11.61
1974	12.20
1975	13.21
1976	17.09
1977	21.00
1978	19.94
1979	20.56
1980	16.48
1981	14.69
1982	11.84
1983	10.78
1984	11.00
1985	10.51
1986	13.62
1987	14.54
1988	15.73
1989	17.89
1990	18.47
1991	17.88
1992	19.12
1993	21.19
1994	22.53
1995	23.06
1996	24.11
1997	25.58
1998	27.02
1999	26.90
2000	28.33
2001	28.93
2002	28.48
2003	30.63
2004	31.84
2005	31.29
2006	31.12
2007	30.75
2008	29.92
2009	27.62
2010	28.14
2011	27.68
2012	25.81
2013	23.45
2014	22.51
2015	22.18
2016	23.39

# Graphical representation
ggplot(data=crude.oil.yearly.data, aes(x=Year, y=AverageBarrels, group=1)) +
    geom_line() +
    geom_point() + labs(x="Year", y="Quantity (10 thousands of Barrels)", title = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")

crude.oil.yearly.data %>% 
  select (Year, AverageValue) %>% 
kable(digits = 2, col.names = c("Year", "Value (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")

U.S. Yearly Average Imports of Crude Oil 1973 - 2016
Year	Value (10 thousands of Dollars)
1973	38.27
1974	137.35
1975	153.12
1976	212.33
1977	279.85
1978	267.84
1979	384.17
1980	516.79
1981	516.17
1982	395.38
1983	318.20
1984	304.41
1985	275.29
1986	189.34
1987	244.34
1988	223.70
1989	295.00
1990	364.87
1991	312.20
1992	321.28
1993	320.58
1994	320.66
1995	364.58
1996	457.76
1997	451.88
1998	310.44
1999	424.08
2000	748.97
2001	619.11
2002	644.03
2003	826.39
2004	1097.86
2005	1464.63
2006	1805.23
2007	1976.76
2008	2849.27
2009	1572.60
2010	2101.34
2011	2763.18
2012	2609.79
2013	2273.39
2014	2053.40
2015	1048.48
2016	844.10

# Graphical representation
ggplot(data=crude.oil.yearly.data, aes(x=Year, y=AverageValue, group=1)) +
    geom_line() +
    geom_point() + labs(x="Year", y="Value (10 thousands of Dollars)", title = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")

# Generate data for seasons
seasons <- c("Winter", "Winter", "Winter", "Spring", "Spring", "Spring", "Summer", "Summer", "Summer", "Fall", "Fall", "Fall")
m <- c("December", "January", "February", "March","April", "May", "June", "July", "August", "September", "October", "November")
id <- c(4,4,4,1,1,1,2,2,2,3,3,3)

seasons.data <- data.frame(seasons, m, id)
seasons.data$seasons <- as.character(seasons.data$seasons)
seasons.data$m <- as.character(seasons.data$m)

crude.oil.season.data <- crude.oil.tidy.data %>% 
  filter(Year >= 2007) %>% 
  inner_join(seasons.data, c("Month" = "m")) %>% 
  group_by(Year,seasons,id) %>% 
  summarise(TotalBarrels = round(sum(Barrels)/10000,2), TotalValue = round(sum(Value)/10000,2), AverageBarrels = round(mean(Barrels)/10000,2), AverageValue = round(mean(Value)/10000,2))

# Rearrange the data
crude.oil.season.data <- arrange(crude.oil.season.data, Year, id, seasons)
crude.oil.season.data <- transform(crude.oil.season.data, seasons=factor(seasons,levels=unique(seasons)))

crude.oil.season.data %>% 
  arrange(Year, id) %>%
  select (Year, seasons, TotalBarrels) %>% 
kable(digits = 2, col.names = c("Year", "Season", "Quantity (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Imports of Crude Oil per Season 2007 - 2016")

U.S. Imports of Crude Oil per Season 2007 - 2016
Year	Season	Quantity (10 thousands of Dollars)
2007	Spring	94.70
2007	Summer	94.99
2007	Fall	91.79
2007	Winter	87.59
2008	Spring	87.85
2008	Summer	95.12
2008	Fall	83.32
2008	Winter	92.77
2009	Spring	84.38
2009	Summer	84.56
2009	Fall	79.05
2009	Winter	83.49
2010	Spring	87.49
2010	Summer	92.84
2010	Fall	80.27
2010	Winter	77.11
2011	Spring	82.36
2011	Summer	87.77
2011	Fall	80.93
2011	Winter	81.14
2012	Spring	81.43
2012	Summer	81.50
2012	Fall	74.93
2012	Winter	71.87
2013	Spring	69.29
2013	Summer	73.81
2013	Fall	68.47
2013	Winter	69.80
2014	Spring	67.60
2014	Summer	66.85
2014	Fall	64.05
2014	Winter	71.60
2015	Spring	66.12
2015	Summer	67.94
2015	Fall	65.10
2015	Winter	67.04
2016	Spring	69.66
2016	Summer	72.69
2016	Fall	70.32
2016	Winter	67.97

ggplot(data=crude.oil.season.data, aes(x=Year, y=TotalBarrels, fill=seasons)) +
    geom_bar(stat="identity") + labs(x="Year", y="Quantity (10 thousands of Barrels)", title = "U.S. Imports of Crude Oil per Season 2007 - 2016")

crude.oil.season.data %>% 
  arrange(Year, id) %>%
  select (Year, seasons, TotalValue) %>% 
kable(digits = 2, col.names = c("Year", "Season", "Value (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Imports of Crude Oil per Season 2007 - 2016")

U.S. Imports of Crude Oil per Season 2007 - 2016
Year	Season	Value (10 thousands of Dollars)
2007	Spring	5358.51
2007	Summer	6163.29
2007	Fall	6756.07
2007	Winter	5443.30
2008	Spring	8594.62
2008	Summer	11492.08
2008	Fall	7379.61
2008	Winter	6724.93
2009	Spring	3902.07
2009	Summer	5253.04
2009	Fall	5476.85
2009	Winter	4239.22
2010	Spring	6658.94
2010	Summer	6744.47
2010	Fall	5968.91
2010	Winter	5843.73
2011	Spring	8376.87
2011	Summer	9151.45
2011	Fall	8157.05
2011	Winter	7472.83
2012	Spring	8838.33
2012	Summer	7824.15
2012	Fall	7395.09
2012	Winter	7259.93
2013	Spring	6727.45
2013	Summer	7236.61
2013	Fall	6779.38
2013	Winter	6537.28
2014	Spring	6432.45
2014	Summer	6476.85
2014	Fall	5652.31
2014	Winter	6079.17
2015	Spring	3156.12
2015	Summer	3565.80
2015	Fall	2651.97
2015	Winter	3207.90
2016	Spring	2115.18
2016	Summer	2899.84
2016	Fall	2809.48
2016	Winter	2304.68

ggplot(data=crude.oil.season.data, aes(x=Year, y=TotalValue, fill=seasons)) +
    geom_bar(stat="identity") + labs(x="Year", y="Value (10 thousands of Dollars)", title = "U.S. Imports of Crude Oil per Season 2007 - 2016")

Conclusion:

Import of crude oil incresed gradually over a period of time util 2005. Imports peaked between 2004 and 2006.
Imports of crude oil started to fall gradually after year 2006. In 2015 imports where similar to 1994.
Cost of crude oil imports increased sharply between 2000 and 2008.
Cost of crude oil imports dropped sharply in 2009.
Cost of crude oil imports increased sharply between 2010 and 2012.
Cost of crude oil imports started to fall since 2012.
General trend for last 10 year seasonly crude oil import shows declining.

References: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html https://github.com/rstudio/webinars/tree/master/05-Data-Wrangling-with-R-and-RStudio https://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/

¹: For the course of this project words dataset and database are interchangeable.

DATA607-Week05-Project02

Pavan Akula

March 7, 2017

1. World Development Indicators (WDI)

2. IRS’s Statistics of Income (SOI)

3. United States Census Bureau - Imports of Crude Oil