The goal of this assignment is to prepare different datasets1 (databases) for downstream analysis work. Following are the 3 datasets that will be used for this project. Dataset sources was identified by
1. World Development Indicators (WDI) - Kyle Gilde
2. IRS’s Statistics of Income (SOI) - Pavan Akula
3. United States Census Bureau - Imports of Crude Oil - Nnaemezue Obi-eyisi
Libraries used for the project
#install.packages("xlsx")
library(xlsx) # used for loading files with extension xlsx and xls
library(utils) # loading csv files
library(tidyr) # formating data
library(dplyr) # subseting, changing data layout
library(stringr) # regular expression and data manupilation
library(ggplot2) # graphs generation
library(knitr) # report display, table format
library(readr) # reading text files data
The World Bank maintains World Development Indicators (WDI) database. Collection of development indicators are compiled from officially-recognized international sources by World Bank. It presents the most current and accurate global development data available, and includes national, regional and global estimates. Database contains various indicator topics including Agriculture & Rural Development, Aid Effectiveness, Climate Change, Economy & Growth, Education, Energy & Mining, Environment, External Debt, Financial Sector, Gender, Health, Infrastructure, Labor & Social Protection, Poverty, Private Sector, Public Sector, Science & Technology, Social Development, Trade and Urban Development.
Data is classified by World Bank into economies, regions, income groups and lending groups. Database has data from year 1960 to current. Update frequency of the database is quarterly. Complete database and data dictionary is available in Zip file format at World Bank website http://data.worldbank.org/data-catalog/world-development-indicators.
More information about data classification and collection can be found at
Using WDI database, development indicators for last 10 years that fall under Financial Sector: Access topic will be analysed for G20 nations. To be more specific only one indicator Average transaction cost of sending and receving remittances from or to a specific country will be analysis.
# Load World Indicators from WDI_Series.csv
# This file contains information about each indicator
# There are about 17 indicators that fall under Financial Sector: Access
wd <- getwd()
wdi.indicator.data <- read.csv(paste(wd , "\\WDI_Series.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)
wdi.indicator.finsec.access <- wdi.indicator.data %>%
filter(Topic == "Financial Sector: Access") %>%
select(Indicator.Name)
# G20 nations
#nations <- c("Argentina","Australia","Brazil","Canada","China","France","Germany","India","Indonesia","Italy","Japan","Mexico","Russian Federation","Saudi Arabia","South Africa","South Korea","Turkey","United Kingdom","United States","Austria","Belgium","Bulgaria","Croatia","Cyprus","Czech Republic","Denmark","Estonia","Finland","Greece","Hungary","Ireland","Latvia","Lithuania","Luxembourg","Malta","Netherlands","Poland","Portugal","Romania","Slovak Republic","Slovenia","Spain","Sweden")
nations <- c("Argentina","Australia","Brazil","Canada","China","France","Germany","India","Indonesia","Italy","Japan","Mexico","Russia","Saudi Arabia","South Africa","South Korea","Turkey","United Kingdom","United States","European Union")
g20.nations <- data.frame(nations, stringsAsFactors = FALSE)
# Assign unique id's to nations
g20.nations <- g20.nations %>%
mutate(id = row_number())
# Load World Development Indicators (WDI) database
# Database has 383328 observations with 61 variables
wdi.raw.data <- read.csv(paste(wd, "\\WDI_Data.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)
#str(wdi.raw.data)
# Format data, some rows have [ts] and [w2] at the end of indicator description
wdi.raw.data$Indicator.Name <- gsub("\\[ts\\]", "" ,wdi.raw.data$Indicator.Name)
wdi.raw.data$Indicator.Name <- gsub("\\[w2\\]", "" ,wdi.raw.data$Indicator.Name)
wdi.raw.data$Indicator.Name <- str_trim(wdi.raw.data$Indicator.Name)
wdi.raw.data$Country.Name <- str_replace(wdi.raw.data$Country.Name, "Russian Federation", "Russia")
wdi.raw.data$Country.Name <- str_replace(wdi.raw.data$Country.Name, "Korea, Rep\\.", "South Korea")
# Filter data for last 10 years excluding 2016
# Get data only for G20 nations
wdi.g20.nations.raw <- wdi.raw.data %>%
inner_join(wdi.indicator.finsec.access, by = "Indicator.Name") %>%
inner_join(g20.nations, by = c("Country.Name" = "nations")) %>%
select (id, Country.Name, Country.Code, Indicator.Name, X2006:X2015)
#unique(wdi.g20.nations.raw$Indicator.Name)
#unique(wdi.g20.nations.raw$Country.Name)
# Tidy the data set
wdi.g20.nations.tidy.data <- wdi.g20.nations.raw %>%
gather(Year, value = Measure, X2006:X2015, na.rm = FALSE)
wdi.g20.nations.tidy.data$Measure = ifelse(is.na(wdi.g20.nations.tidy.data$Measure), 0, round(wdi.g20.nations.tidy.data$Measure, 2))
wdi.g20.nations.tidy.data$Year <- gsub("[[:alpha:]]", "" ,wdi.g20.nations.tidy.data$Year)
# Format rows data
unique(wdi.g20.nations.tidy.data$Indicator.Name)
## [1] "Account at a financial institution (% age 15+)"
## [2] "Account at a financial institution, female (% age 15+)"
## [3] "Account at a financial institution, income, poorest 40% (% ages 15+)"
## [4] "Account at a financial institution, income, richest 60% (% ages 15+)"
## [5] "Account at a financial institution, male (% age 15+)"
## [6] "Automated teller machines (ATMs) (per 100,000 adults)"
## [7] "Average transaction cost of sending remittances from a specific country (%)"
## [8] "Average transaction cost of sending remittances to a specific country (%)"
## [9] "Borrowers from commercial banks (per 1,000 adults)"
## [10] "Commercial bank branches (per 100,000 adults)"
## [11] "Depositors with commercial banks (per 1,000 adults)"
## [12] "Mobile account (% age 15+)"
## [13] "Mobile account, female (% age 15+)"
## [14] "Mobile account, income, poorest 40% (% ages 15+)"
## [15] "Mobile account, income, richest 60% (% ages 15+)"
## [16] "Mobile account, male (% age 15+)"
## [17] "Point-of-sale terminals (per 100,000 adults)"
wdi.g20.nations.tidy.data$Indicator.Name <- str_replace(wdi.g20.nations.tidy.data$Indicator.Name, "Average transaction cost of sending remittances to a specific country \\(\\%\\)", "Sending")
wdi.g20.nations.tidy.data$Indicator.Name <- str_replace(wdi.g20.nations.tidy.data$Indicator.Name, "Average transaction cost of sending remittances from a specific country \\(\\%\\)", "Receving")
# Get Average transaction cost consumer paid for sending and/or receving remittances to/from a specific country in 2015
remit.to.from <- wdi.g20.nations.tidy.data %>%
filter(Indicator.Name %in% c("Sending", "Receving") & Year == 2015) %>%
select (Country.Name,id, Measure, Indicator.Name) %>%
spread(Indicator.Name, Measure)
# Rearrange the data
remit.to.from <- remit.to.from %>% arrange(id)
# Keep order of countries
remit.to.from <- transform(remit.to.from, Country.Name=factor(Country.Name,levels=unique(Country.Name)))
Information shows cost paid by the consumer for sending and/or receiving money in year 2015
| Country | Sending | Receving |
|---|---|---|
| Argentina | 0.00 | 0.00 |
| Australia | 0.00 | 9.26 |
| Brazil | 8.13 | 6.26 |
| Canada | 0.00 | 8.93 |
| China | 10.37 | 0.00 |
| France | 0.00 | 7.26 |
| Germany | 0.00 | 9.34 |
| India | 6.74 | 0.00 |
| Indonesia | 6.86 | 0.00 |
| Italy | 0.00 | 6.27 |
| Japan | 0.00 | 13.22 |
| Mexico | 5.02 | 0.00 |
| Russia | 0.00 | 2.30 |
| Saudi Arabia | 0.00 | 4.49 |
| South Africa | 8.43 | 16.09 |
| South Korea | 0.00 | 5.81 |
| Turkey | 7.40 | 0.00 |
| United Kingdom | 0.00 | 7.36 |
| United States | 0.00 | 6.03 |
| European Union | 0.00 | 0.00 |
Graphical representation.
# Displays Average transaction cost of sending remittances from a specific country (%) in 2015
ggplot(data=remit.to.from, aes(x=`Country.Name`, y=Sending)) + geom_bar(width=.9,stat="identity",position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for sending money to other countries in 2015 (%)") + geom_text(aes(label=Sending), vjust=0, color="black")
# Displays Average transaction cost of receving remittances from a specific country (%) in 2015
ggplot(data=remit.to.from, aes(x=`Country.Name`, y=Receving)) + geom_bar(width=.9,stat="identity",position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for receving money to other countries in 2015 (%)") + geom_text(aes(label=Receving), vjust=0, color="black")
# Tidy the data and get countries that charge consumers for sending and receving remittances
remit.to.from.charge <- remit.to.from %>%
#filter(Sending>0 & Receving >0) %>%
select (`Country.Name`,Sending,Receving) %>%
gather(Type, value = Measure, -Country.Name, na.rm = FALSE)
ggplot(remit.to.from.charge, aes(x = Country.Name,y = Measure)) +
geom_bar(aes(fill = Type), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Country Name", y="Percentage", title = "Avg. cost paid by consumer for remittances in 2015 (%)")
Conclusion:
Consumers residing in Argentina and European Union did not pay extra cost for sending and receiving remittances in 2015.
Consumers residing in South Africa and Brazil paid extra cost while sending and receiving remittances in 2015.
Consumers residing in Australia, Brazil, Canada, France, Germany, Italy, Japan, Russia, Saudi Arabia, South Africa, South Korea, United Kingdom and United States only paid extra cost while sending remittances in 2015.
Consumers residing in Brazil, China, India, Mexico, South Africa and Turkey only paid extra cost while receiving remittances in 2015.
Consumers residing South Africa paid highest cost for sending remittances, while Consumers residing in China paid highest cost for receiving remittances in 2015.
Internal Revenue Service (IRS) maintains Statistics of Income (SOI) - County Data. SOI bases its county income data on the addresses reported on individual income tax returns filed with the IRS. Database includes:
More information about data classification and collection can be found at
Using SOI Tax Stats - County Database 2014, metrics for California and Texas will be analysed. For the scope of the project returns will be analysis based on Adjusted gross income and Filing status for both states.
wd <- getwd()
# Data Dictionary
soi.data.dic <- read.xlsx2(paste(wd, "\\SOI_Data_Dictionary_County2014.xlsx", sep = ""), sheetName = "Sheet1", startRow=2, endRow = 138, startColumn = 1, endColumn = 6, as.data.frame = T, header = T, colClasses = "character", stringsAsFactors = FALSE)
# Get agi range
agi_range <- soi.data.dic %>%
filter(TYPE == "") %>%
select (VALUE.LINE.REFERENCE) %>%
separate(VALUE.LINE.REFERENCE, c("agi_id","agi_desc"), sep = "\\s=\\s")
agi_range$agi_id <- as.numeric(as.character(agi_range$agi_id))
# Load Statistics of Income (SOI) - County Data from 14incyallagi.csv
# This file contains county level tax filing details and also has state level rollup
# There are about 25536 observations and 128 variables that covers all states
# California (CA) and Texas (TX) states data will be used for this project
soi.raw.data <- read.csv(paste(wd , "\\14incyallagi.csv", sep = ""), stringsAsFactors = FALSE, header = TRUE)
# agi_stub = Adjusted gross income (1-Under $1, 2-$1 under $10,000, 3-$10,000 under $25,000,4-$25,000 under $50,000,5-$50,000 under $75,000,6-$75,000 under $100,000,7-$100,000 under $200,000,8-$200,000 or more
# numberOfTIReturns = Number of returns with taxable income
# tIAmount = Taxable income amount
# numberOfTPReturns = Number of returns with total tax payments
# tPAmount = Total tax payments amount
# Tidy the data and get required data
# Missing values, since numberOfSingleReturns + numberOfJointReturns + numberOfHOHReturns != numberOfReturns
# add otherReturns
soi.ca.tx.data <- soi.raw.data %>%
filter(STATE %in% c("CA","TX") & COUNTYFIPS == 0) %>%
mutate(otherReturns = (N1 - (mars1+MARS2+MARS4))) %>%
select(agi_stub,state = COUNTYNAME, numberOfReturns = N1, numberOfSingleReturns = mars1, numberOfJointReturns = MARS2, numberOfHOHReturns = MARS4, otherReturns, numberOfTIReturns = N04800, tIAmount = A04800, numberOfTPReturns = N10600, tPAmount = A10600) %>%
inner_join(agi_range, by = c("agi_stub" = "agi_id"))
# Calculate percentages
soi.ca.tx.data %<>%
group_by(state) %>%
summarise(totalReturns = sum(numberOfReturns)) %>%
select (state,totalReturns) %>%
inner_join(soi.ca.tx.data, by = "state")
soi.ca.tx.data <- mutate(soi.ca.tx.data, "SingleReturns" = round(numberOfSingleReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "JointReturns" = round(numberOfJointReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "HOHReturns" = round(numberOfHOHReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "OtherReturns" = round(otherReturns*100/totalReturns,2))
soi.ca.tx.data <- mutate(soi.ca.tx.data, "Total" = SingleReturns+JointReturns+HOHReturns+OtherReturns)
# Generate report to compare percentage of tax returns files
soi.ca.tx.data %>%
select (state, agi_stub, agi_desc, Total) %>%
arrange(agi_stub) %>%
spread(state, Total) %>%
select (agi_desc, California, Texas) %>%
kable(digits = 2, col.names = c("Adjusted Gross Income", "California(%)", "Texas(%)"), format='pandoc', caption = "Percentage of Tax Returns filed based on Adjusted Gross Income for year 2014")
| Adjusted Gross Income | California(%) | Texas(%) |
|---|---|---|
| Under $1 | 1.70 | 1.24 |
| $1 under $10,000 | 13.50 | 14.46 |
| $10,000 under $25,000 | 22.90 | 24.06 |
| $25,000 under $50,000 | 22.78 | 23.93 |
| $50,000 under $75,000 | 12.67 | 12.69 |
| $75,000 under $100,000 | 8.11 | 7.93 |
| $100,000 under $200,000 | 12.71 | 11.28 |
| $200,000 or more | 5.63 | 4.42 |
Graphical representation.
# Rearrange the data
soi.ca.tx.data <- arrange(soi.ca.tx.data, agi_stub, agi_desc)
soi.ca.tx.data <- transform(soi.ca.tx.data, agi_desc=factor(agi_desc,levels=unique(agi_desc)))
# Generate graph to compare taxes returns files based on Adjusted Gross Income
ggplot(soi.ca.tx.data, aes(x = agi_desc,y = Total)) +
geom_bar(aes(fill = state), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Adjusted Gross Income", y="Percentage", title = "Percentage of Tax Returns filed based on Adjusted Gross Income for year 2014")
# Generate report based on filing status
ca.tx.c1 <- soi.ca.tx.data %>%
group_by(state) %>%
summarise(Single = sum(SingleReturns), Joint = sum(JointReturns), "Head Of Household" = sum(HOHReturns), Other=sum(OtherReturns)) %>%
select (state, Single, Joint, `Head Of Household`, Other) %>%
gather(ReturnType, "n", Single, Joint, `Head Of Household`, Other) %>%
select(state, ReturnType, Total=n) %>%
spread(state, Total)
# Arrange data
ca.tx.c1 <- arrange(ca.tx.c1, desc(California))
ca.tx.c1 <- mutate(ca.tx.c1, id = row_number())
ca.tx.c1 <- transform(ca.tx.c1, id=factor(id,levels=unique(id)))
ca.tx.c1 %>%
select(ReturnType, California, Texas) %>%
kable(digits = 2, col.names = c("Return Type", "California(%)", "Texas(%)"), format='pandoc', caption = "Percentage of Tax Returns filed based on Status for year 2014")
| Return Type | California(%) | Texas(%) |
|---|---|---|
| Single | 47.93 | 44.51 |
| Joint | 35.85 | 36.83 |
| Head Of Household | 14.85 | 17.27 |
| Other | 1.37 | 1.40 |
Graphical representation.
# Get the dataset
ca.tx.c2 <-
soi.ca.tx.data %>%
group_by(state) %>%
summarise(Single = sum(SingleReturns), Joint = sum(JointReturns), "Head Of Household" = sum(HOHReturns), Other=sum(OtherReturns)) %>%
select (state, Single, Joint, `Head Of Household`, Other) %>%
gather(ReturnType, "n", Single, Joint, `Head Of Household`, Other) %>%
select(state, ReturnType, Total=n)
# Returns filing order
filing <- c("Single","Joint","Head Of Household","Other")
id <- c(1,2,3,4)
filing.order <- data.frame(filing, id, stringsAsFactors = FALSE)
ca.tx.c2 <- ca.tx.c2 %>%
inner_join(filing.order, by = c("ReturnType" = "filing"))
ca.tx.c2 <- arrange(ca.tx.c2, state, id)
ca.tx.c2 <- transform(ca.tx.c2, ReturnType=factor(ReturnType,levels=unique(ReturnType)))
# Generate graph to compare taxes returns files based on Adjusted Gross Income
ggplot(ca.tx.c2, aes(x = ReturnType,y = Total)) +
geom_bar(aes(fill = state), position = "dodge", stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x="Filing Status", y="Percentage", title = "Percentage of Tax Returns filed based on Filing Status for year 2014")
Conclusion:
Texas has slightly more percentage of population that earns between $1 under $50,000.
California and Texas has approxmimately same percentage of population that earns between $50,000 under $75,000.
California has slightly more percentage of population that earns above $75,000.
The United States Census Bureau is a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. Statistics of U.S. Imports of Crude Oil is compiled with the help of U.S. Customs and Border Protection and the Bureau of Economic Analysis. Purpose of this database is to provide detailed statistics on goods and estimates of services entering the U.S. from foreign countries.
More information about data and program can be found at
Program details: https://www.census.gov/foreign-trade/about/index.html
Foreign Trade historical data: https://www.census.gov/foreign-trade/statistics/historical/index.html
U.S. Imports of Crude Oil: https://www.census.gov/foreign-trade/statistics/historical/petr.txt
Using the U.S. Imports of Crude Oil database, metrics for yearly imports of crude oil and costs will be analysed. Since years 2017 and 2018 does not have complete data, years will not be used for analysis.
# Get local directory and load the file
localDir <- getwd()
# Skip top 24 lines as it contains header info
crude.oil.text <- read_lines(paste(str_trim(localDir),"//petr.txt",sep=""),skip = 24)
# Remove Spaces and replace with pipe(|)
pattern = "\\s+"
replacement = "|"
crude.oil.text <- gsub(pattern = pattern, replacement = replacement, crude.oil.text)
# Remove unnecessary data
pattern = "Jan\\."
crude.noneed.data <- crude.oil.text[str_detect(crude.oil.text, pattern = pattern)]
crude.oil.text <- crude.oil.text [! crude.oil.text %in% crude.noneed.data]
# Create data frame and add id column
# Split the data into 15 columns
crude.oil.data <- data.frame(unlist(str_split_fixed(crude.oil.text, "\\|", 15)), stringsAsFactors = F)
crude.oil.data <- mutate(crude.oil.data, id = row_number())
# Since data is text format and yearly data is stored side-by-side, convert it to single table
# File has about 46 yearly data sets, 2017 and 2018 data set is not used for analysis
start = 1
end = 13
crude.oil.tidy.data <- crude.oil.data %>%
filter(id == 0) %>%
select(X2:X8)
for (i in 1:22)
{
# Get first set
temp1 <- crude.oil.data %>%
filter(id >= start & id <= end) %>%
select(id, X1:X8)
year <- crude.oil.data %>%
filter (id == start) %>%
select(X2)
# Get year
temp1$X1[temp1$X1 == ""] <- year
# Get second set of data which is same side-by-side
temp2 <- crude.oil.data %>%
filter(id >= start & id <= end) %>%
select(id, X1, X2=X9, X3=X10, X4=X11, X5=X12, X6=X13, X7=X14, X8=X15)
# Get year
year <- crude.oil.data %>%
filter (id == start) %>%
select(X3)
temp2$X1[temp2$X1 == ""] <- year
# Remove extra rows from temp data frames
temp1 <- temp1[-c(1), ]
temp2 <- temp2[-c(1), ]
# Store data into tidy table
crude.oil.tidy.data <- bind_rows(crude.oil.tidy.data,temp1)
crude.oil.tidy.data <- bind_rows(crude.oil.tidy.data,temp2)
start = end + 1
end = end + 13
}
# Add new id column and remove un-used columns
crude.oil.tidy.data$X1 <- as.numeric(unlist(crude.oil.tidy.data$X1))
crude.oil.tidy.data <- arrange(crude.oil.tidy.data, X1,id) %>% mutate(seq = row_number())
crude.oil.tidy.data$id <- NULL
crude.oil.tidy.data$X4 <- NULL
crude.oil.tidy.data$X6 <- NULL
crude.oil.tidy.data$X7 <- NULL
crude.oil.tidy.data$X8 <- NULL
# Format data and replace missing values with zeros
crude.oil.tidy.data$X3 <- as.numeric(ifelse(crude.oil.tidy.data$X3 == "N/A", "0",str_replace_all(crude.oil.tidy.data$X3,",","")))
crude.oil.tidy.data$X5 <- as.numeric(ifelse(crude.oil.tidy.data$X5 == "N/A", "0",str_replace_all(crude.oil.tidy.data$X5,",","")))
crude.oil.tidy.data <- rename(crude.oil.tidy.data, Month = X2, Barrels = X3, Value = X5, Year = X1)
# Summarise data
crude.oil.yearly.data <- crude.oil.tidy.data %>%
group_by(Year) %>%
summarise(TotalBarrels = round(sum(Barrels)/10000,2), TotalValue = round(sum(Value)/10000,2), AverageBarrels = round(mean(Barrels)/10000,2), AverageValue = round(mean(Value)/10000,2))
# Generate report
crude.oil.yearly.data %>%
select (Year, AverageBarrels) %>%
kable(digits = 2, col.names = c("Year", "Quantity (10 thousands of Barrels)"), format='pandoc', caption = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")
| Year | Quantity (10 thousands of Barrels) |
|---|---|
| 1973 | 11.61 |
| 1974 | 12.20 |
| 1975 | 13.21 |
| 1976 | 17.09 |
| 1977 | 21.00 |
| 1978 | 19.94 |
| 1979 | 20.56 |
| 1980 | 16.48 |
| 1981 | 14.69 |
| 1982 | 11.84 |
| 1983 | 10.78 |
| 1984 | 11.00 |
| 1985 | 10.51 |
| 1986 | 13.62 |
| 1987 | 14.54 |
| 1988 | 15.73 |
| 1989 | 17.89 |
| 1990 | 18.47 |
| 1991 | 17.88 |
| 1992 | 19.12 |
| 1993 | 21.19 |
| 1994 | 22.53 |
| 1995 | 23.06 |
| 1996 | 24.11 |
| 1997 | 25.58 |
| 1998 | 27.02 |
| 1999 | 26.90 |
| 2000 | 28.33 |
| 2001 | 28.93 |
| 2002 | 28.48 |
| 2003 | 30.63 |
| 2004 | 31.84 |
| 2005 | 31.29 |
| 2006 | 31.12 |
| 2007 | 30.75 |
| 2008 | 29.92 |
| 2009 | 27.62 |
| 2010 | 28.14 |
| 2011 | 27.68 |
| 2012 | 25.81 |
| 2013 | 23.45 |
| 2014 | 22.51 |
| 2015 | 22.18 |
| 2016 | 23.39 |
# Graphical representation
ggplot(data=crude.oil.yearly.data, aes(x=Year, y=AverageBarrels, group=1)) +
geom_line() +
geom_point() + labs(x="Year", y="Quantity (10 thousands of Barrels)", title = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")
crude.oil.yearly.data %>%
select (Year, AverageValue) %>%
kable(digits = 2, col.names = c("Year", "Value (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")
| Year | Value (10 thousands of Dollars) |
|---|---|
| 1973 | 38.27 |
| 1974 | 137.35 |
| 1975 | 153.12 |
| 1976 | 212.33 |
| 1977 | 279.85 |
| 1978 | 267.84 |
| 1979 | 384.17 |
| 1980 | 516.79 |
| 1981 | 516.17 |
| 1982 | 395.38 |
| 1983 | 318.20 |
| 1984 | 304.41 |
| 1985 | 275.29 |
| 1986 | 189.34 |
| 1987 | 244.34 |
| 1988 | 223.70 |
| 1989 | 295.00 |
| 1990 | 364.87 |
| 1991 | 312.20 |
| 1992 | 321.28 |
| 1993 | 320.58 |
| 1994 | 320.66 |
| 1995 | 364.58 |
| 1996 | 457.76 |
| 1997 | 451.88 |
| 1998 | 310.44 |
| 1999 | 424.08 |
| 2000 | 748.97 |
| 2001 | 619.11 |
| 2002 | 644.03 |
| 2003 | 826.39 |
| 2004 | 1097.86 |
| 2005 | 1464.63 |
| 2006 | 1805.23 |
| 2007 | 1976.76 |
| 2008 | 2849.27 |
| 2009 | 1572.60 |
| 2010 | 2101.34 |
| 2011 | 2763.18 |
| 2012 | 2609.79 |
| 2013 | 2273.39 |
| 2014 | 2053.40 |
| 2015 | 1048.48 |
| 2016 | 844.10 |
# Graphical representation
ggplot(data=crude.oil.yearly.data, aes(x=Year, y=AverageValue, group=1)) +
geom_line() +
geom_point() + labs(x="Year", y="Value (10 thousands of Dollars)", title = "U.S. Yearly Average Imports of Crude Oil 1973 - 2016")
# Generate data for seasons
seasons <- c("Winter", "Winter", "Winter", "Spring", "Spring", "Spring", "Summer", "Summer", "Summer", "Fall", "Fall", "Fall")
m <- c("December", "January", "February", "March","April", "May", "June", "July", "August", "September", "October", "November")
id <- c(4,4,4,1,1,1,2,2,2,3,3,3)
seasons.data <- data.frame(seasons, m, id)
seasons.data$seasons <- as.character(seasons.data$seasons)
seasons.data$m <- as.character(seasons.data$m)
crude.oil.season.data <- crude.oil.tidy.data %>%
filter(Year >= 2007) %>%
inner_join(seasons.data, c("Month" = "m")) %>%
group_by(Year,seasons,id) %>%
summarise(TotalBarrels = round(sum(Barrels)/10000,2), TotalValue = round(sum(Value)/10000,2), AverageBarrels = round(mean(Barrels)/10000,2), AverageValue = round(mean(Value)/10000,2))
# Rearrange the data
crude.oil.season.data <- arrange(crude.oil.season.data, Year, id, seasons)
crude.oil.season.data <- transform(crude.oil.season.data, seasons=factor(seasons,levels=unique(seasons)))
crude.oil.season.data %>%
arrange(Year, id) %>%
select (Year, seasons, TotalBarrels) %>%
kable(digits = 2, col.names = c("Year", "Season", "Quantity (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Imports of Crude Oil per Season 2007 - 2016")
| Year | Season | Quantity (10 thousands of Dollars) |
|---|---|---|
| 2007 | Spring | 94.70 |
| 2007 | Summer | 94.99 |
| 2007 | Fall | 91.79 |
| 2007 | Winter | 87.59 |
| 2008 | Spring | 87.85 |
| 2008 | Summer | 95.12 |
| 2008 | Fall | 83.32 |
| 2008 | Winter | 92.77 |
| 2009 | Spring | 84.38 |
| 2009 | Summer | 84.56 |
| 2009 | Fall | 79.05 |
| 2009 | Winter | 83.49 |
| 2010 | Spring | 87.49 |
| 2010 | Summer | 92.84 |
| 2010 | Fall | 80.27 |
| 2010 | Winter | 77.11 |
| 2011 | Spring | 82.36 |
| 2011 | Summer | 87.77 |
| 2011 | Fall | 80.93 |
| 2011 | Winter | 81.14 |
| 2012 | Spring | 81.43 |
| 2012 | Summer | 81.50 |
| 2012 | Fall | 74.93 |
| 2012 | Winter | 71.87 |
| 2013 | Spring | 69.29 |
| 2013 | Summer | 73.81 |
| 2013 | Fall | 68.47 |
| 2013 | Winter | 69.80 |
| 2014 | Spring | 67.60 |
| 2014 | Summer | 66.85 |
| 2014 | Fall | 64.05 |
| 2014 | Winter | 71.60 |
| 2015 | Spring | 66.12 |
| 2015 | Summer | 67.94 |
| 2015 | Fall | 65.10 |
| 2015 | Winter | 67.04 |
| 2016 | Spring | 69.66 |
| 2016 | Summer | 72.69 |
| 2016 | Fall | 70.32 |
| 2016 | Winter | 67.97 |
ggplot(data=crude.oil.season.data, aes(x=Year, y=TotalBarrels, fill=seasons)) +
geom_bar(stat="identity") + labs(x="Year", y="Quantity (10 thousands of Barrels)", title = "U.S. Imports of Crude Oil per Season 2007 - 2016")
crude.oil.season.data %>%
arrange(Year, id) %>%
select (Year, seasons, TotalValue) %>%
kable(digits = 2, col.names = c("Year", "Season", "Value (10 thousands of Dollars)"), format='pandoc', caption = "U.S. Imports of Crude Oil per Season 2007 - 2016")
| Year | Season | Value (10 thousands of Dollars) |
|---|---|---|
| 2007 | Spring | 5358.51 |
| 2007 | Summer | 6163.29 |
| 2007 | Fall | 6756.07 |
| 2007 | Winter | 5443.30 |
| 2008 | Spring | 8594.62 |
| 2008 | Summer | 11492.08 |
| 2008 | Fall | 7379.61 |
| 2008 | Winter | 6724.93 |
| 2009 | Spring | 3902.07 |
| 2009 | Summer | 5253.04 |
| 2009 | Fall | 5476.85 |
| 2009 | Winter | 4239.22 |
| 2010 | Spring | 6658.94 |
| 2010 | Summer | 6744.47 |
| 2010 | Fall | 5968.91 |
| 2010 | Winter | 5843.73 |
| 2011 | Spring | 8376.87 |
| 2011 | Summer | 9151.45 |
| 2011 | Fall | 8157.05 |
| 2011 | Winter | 7472.83 |
| 2012 | Spring | 8838.33 |
| 2012 | Summer | 7824.15 |
| 2012 | Fall | 7395.09 |
| 2012 | Winter | 7259.93 |
| 2013 | Spring | 6727.45 |
| 2013 | Summer | 7236.61 |
| 2013 | Fall | 6779.38 |
| 2013 | Winter | 6537.28 |
| 2014 | Spring | 6432.45 |
| 2014 | Summer | 6476.85 |
| 2014 | Fall | 5652.31 |
| 2014 | Winter | 6079.17 |
| 2015 | Spring | 3156.12 |
| 2015 | Summer | 3565.80 |
| 2015 | Fall | 2651.97 |
| 2015 | Winter | 3207.90 |
| 2016 | Spring | 2115.18 |
| 2016 | Summer | 2899.84 |
| 2016 | Fall | 2809.48 |
| 2016 | Winter | 2304.68 |
ggplot(data=crude.oil.season.data, aes(x=Year, y=TotalValue, fill=seasons)) +
geom_bar(stat="identity") + labs(x="Year", y="Value (10 thousands of Dollars)", title = "U.S. Imports of Crude Oil per Season 2007 - 2016")
Conclusion:
Import of crude oil incresed gradually over a period of time util 2005. Imports peaked between 2004 and 2006.
Imports of crude oil started to fall gradually after year 2006. In 2015 imports where similar to 1994.
Cost of crude oil imports increased sharply between 2000 and 2008.
Cost of crude oil imports dropped sharply in 2009.
Cost of crude oil imports increased sharply between 2010 and 2012.
Cost of crude oil imports started to fall since 2012.
General trend for last 10 year seasonly crude oil import shows declining.
References: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html https://github.com/rstudio/webinars/tree/master/05-Data-Wrangling-with-R-and-RStudio https://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/
1: For the course of this project words dataset and database are interchangeable.