Project sub-parts

Introduction

In the ever more connected (read globalized) world, Foreign direct investments (FDI) are an important source of revenue for any Nation. FDIs spur economic growth and also generate employment. In addition to these advantages, they promote technology and knowledge sharing and help bolster international trade relationships. Cross cultural interaction are another positive side effect of FDIs.

FDIs can be made in any sector/industry - Retail, Wholesale, Oil and Gas, Manufacturing etc.. The amount of investment that a nation receives in a particular sector is dependent on many factors. For instance a country like Japan, which has no oil wells of it’s own, is self sufficient in power generation but is dependent on oil producing countries to meet it’s oil and gas needs. Middle East although self sufficient to meet it’s oil demands, is dependent on other countries to equip its military as it does not have the infrastructure required to produce weapons.

The dataset that has been chosen for this study captures US’s FDIs in other countries over 16 years - 1982 to 1998. There are over 70 countries that have been captured across numerous industries. The source of this data is The Bureau of Economic Analysis (BEA) which promotes better understanding of the US economy by providing the most timely, relevant and accurate economic accounts. The data can be found here

Objective Oil has been described as ‘Liquid Gold’ because it has become so essential for our day to day sustenance that we have almost taken its supply for granted. But what goes unnoticed is the intricately and impeccably planned supply chain that keeps the gas stations always ready to pump gas from. In order to maintain this supply chain, the US has to make investments in the Petroleum industry all across the world. From a risk mitigation perspective such investments should be always diversified to avoid single point of dependency. These investments not only ensure a virtually limitless supply of oil to the citizens of the US but also serve in furthering American strategic interests in the world.

The US Energy Information Administration in it’s January 2018 shows that Canada exported more than one million barrels per day to the United States1. Even though we have data for 1982 - 1998, we are going to attempt to try challenge this finding by comparing the investments to the petroleum imported.

Packages needed

The packages needed for this study are

# loading the packages
library(dplyr) # transforming data
library(tidyr) # creating tidy data
library(stringr) #For manipulating strings
library(tibble) # coercing data to tibbles
library(DT) # for printing nice HTML output tablesinstall.packages()
library(ggplot2) # visualizing data

Data Preparation

Steps to retrieve data

  • On the landing page of the URL, select Investment type as United States direct investment abroad and Data Type as Balance of payments and direct investment position data.
  • Hit Next step and for Choose type of series select U.S. Direct Investment Position Abroad on a Historical-Cost Basis.
  • Move to the next page and for Classification select By Country and Industry.
  • On the next page select 1999 forward - Industry classification based on NAICS for Year Range.
  • For Industries select the following items:
    • Petrol
    • Total Manufacturing
    • Food and Kindered Products
    • Chemicals and Allied Products
    • Primary and fabricated metals
    • Industrial machinery and equipment
    • Electrical equipment, appliances, and components
    • Transportation Equipment
    • Other Manufacturing
    • Wholesale Trade
    • Finance (except depository institutions), insurance, and real estate
    • Services
  • For Countries select any country that seem relevant to your case. For the purpose of this study we have restricted to selecting-
    • Canada
    • Europe
    • Latin America
    • Africa
    • Middle East
    • Asia Pacific

The number of variables contained in the data is:

#reading the csv file
invest <- read.csv("US_investments_abroad_industry_wise.csv", header = TRUE, sep = ",",
                skip = 6)
## [1]  79 205

There are 79 rows and over 200 columns in this dataset. The names of a list of the columns aer the same as the industries we chose in the data set. The next step of the process is to import and then clean the data.

Data dictionary

  • D - Indicates that the data in the cell have been suppressed to avoid disclosure of data of individual companies
  • n.s - Stands for Not shown. Data may not be shown for several reasons.
  • * - Indicates a non-zero value between -$500,000 and +$500,000 or fewer than 50 employees, as appropriate.

The data type of the varaibles are:

#Listing the classes of the variables.
str(invest)
## 'data.frame':    79 obs. of  205 variables:
##  $ X                                                                      : Factor w/ 74 levels "","    Argentina",..: 1 69 71 72 24 25 28 29 31 32 ...
##  $ Petroleum                                                              : Factor w/ 57 levels "(D)","1024","10421",..: 15 41 3 20 1 55 57 47 1 4 ...
##  $ Petroleum.1                                                            : Factor w/ 59 levels "-2","(D)","1020",..: 17 41 4 20 2 50 59 54 2 55 ...
##  $ Petroleum.2                                                            : Factor w/ 59 levels "-28","(D)","10324",..: 19 41 4 23 2 43 59 47 2 36 ...
##  $ Petroleum.3                                                            : Factor w/ 63 levels "-10","-168","-5",..: 19 48 6 25 5 50 63 54 5 46 ...
##  $ Petroleum.4                                                            : Factor w/ 60 levels "-11","(D)","10173",..: 17 46 4 20 57 37 60 33 2 55 ...
##  $ Petroleum.5                                                            : Factor w/ 62 levels "-13","-14","(D)",..: 19 48 6 24 7 44 62 16 3 61 ...
##  $ Petroleum.6                                                            : Factor w/ 59 levels "-2","(D)","1",..: 19 46 7 23 8 48 59 2 2 57 ...
##  $ Petroleum.7                                                            : Factor w/ 55 levels "-10","-8","-96",..: 21 44 7 17 5 41 55 5 5 5 ...
##  $ Petroleum.8                                                            : Factor w/ 51 levels "-8","(D)","102",..: 18 42 5 23 2 40 51 2 2 2 ...
##  $ Petroleum.9                                                            : Factor w/ 54 levels "-1","-8","-90",..: 18 42 5 22 4 27 54 4 4 4 ...
##  $ Petroleum.10                                                           : Factor w/ 53 levels "-45","(D)","1051",..: 21 44 51 26 2 18 53 2 2 2 ...
##  $ Petroleum.11                                                           : Factor w/ 55 levels "-39","(D)","1011",..: 19 47 52 25 21 17 55 2 2 54 ...
##  $ Petroleum.12                                                           : Factor w/ 55 levels "-89","(*)","(D)",..: 18 42 5 21 3 14 55 3 3 6 ...
##  $ Petroleum.13                                                           : Factor w/ 58 levels "(D)","1","108",..: 18 50 56 23 1 14 58 12 1 6 ...
##  $ Petroleum.14                                                           : Factor w/ 56 levels "(*)","(D)","10131",..: 21 48 3 25 2 13 56 36 2 6 ...
##  $ Petroleum.15                                                           : Factor w/ 57 levels "(*)","(D)","10423",..: 22 54 5 30 2 15 57 38 2 56 ...
##  $ Petroleum.16                                                           : Factor w/ 59 levels "(D)","1048","1083",..: 22 55 5 35 13 8 59 46 39 3 ...
##  $ Total.Manufacturing                                                    : Factor w/ 69 levels "-112","(D)","0",..: 30 61 29 48 31 35 69 21 11 51 ...
##  $ Total.Manufacturing.1                                                  : Factor w/ 73 levels "-280","(*)","10215",..: 28 65 27 46 15 31 73 24 18 47 ...
##  $ Total.Manufacturing.2                                                  : Factor w/ 72 levels "-262","-5","(D)",..: 21 65 24 46 23 33 72 19 22 47 ...
##  $ Total.Manufacturing.3                                                  : Factor w/ 68 levels "-256","-3","(D)",..: 20 66 24 43 22 30 68 19 3 50 ...
##  $ Total.Manufacturing.4                                                  : Factor w/ 75 levels "-227","-3","-70",..: 28 9 35 54 24 36 75 23 34 60 ...
##  $ Total.Manufacturing.5                                                  : Factor w/ 71 levels "-105","(D)","1",..: 24 12 34 59 68 42 71 33 35 66 ...
##  $ Total.Manufacturing.6                                                  : Factor w/ 71 levels "-8","(D)","1070",..: 19 9 31 55 61 37 71 30 32 62 ...
##  $ Total.Manufacturing.7                                                  : Factor w/ 72 levels "(D)","10119",..: 21 15 31 62 36 41 72 37 45 3 ...
##  $ Total.Manufacturing.8                                                  : Factor w/ 72 levels "(*)","(D)","1",..: 24 22 38 68 35 43 72 40 69 8 ...
##  $ Total.Manufacturing.9                                                  : Factor w/ 74 levels "(*)","(D)","10",..: 26 22 36 70 43 50 74 41 54 13 ...
##  $ Total.Manufacturing.10                                                 : Factor w/ 73 levels "(*)","(D)","10167",..: 26 24 37 68 53 50 73 35 54 12 ...
##  $ Total.Manufacturing.11                                                 : Factor w/ 71 levels "(*)","(D)","10",..: 28 26 39 67 52 51 71 22 27 13 ...
##  $ Total.Manufacturing.12                                                 : Factor w/ 72 levels "(D)","0","1000",..: 27 28 38 71 63 57 72 1 33 13 ...
##  $ Total.Manufacturing.13                                                 : Factor w/ 71 levels "(D)","101","1024",..: 27 33 45 13 3 68 71 1 50 19 ...
##  $ Total.Manufacturing.14                                                 : Factor w/ 68 levels "-14","(D)","1003",..: 29 36 49 15 5 66 68 2 54 23 ...
##  $ Total.Manufacturing.15                                                 : Factor w/ 75 levels "-36","(D)","100",..: 30 37 51 10 73 70 75 59 69 23 ...
##  $ Total.Manufacturing.16                                                 : Factor w/ 74 levels "-48","(D)","1020",..: 22 36 45 9 4 70 74 51 73 21 ...
##  $ Food.and.Kindred.Products                                              : Factor w/ 47 levels "(*)","(D)","0",..: 22 44 12 33 19 6 47 41 3 30 ...
##  $ Food.and.Kindred.Products.1                                            : Factor w/ 50 levels "(*)","(D)","0",..: 21 48 13 35 15 10 50 47 3 34 ...
##  $ Food.and.Kindred.Products.2                                            : Factor w/ 51 levels "(*)","(D)","0",..: 19 50 14 34 2 12 51 46 3 33 ...
##  $ Food.and.Kindred.Products.3                                            : Factor w/ 50 levels "(*)","(D)","0",..: 17 48 13 36 2 14 50 49 3 30 ...
##  $ Food.and.Kindred.Products.4                                            : Factor w/ 53 levels "(*)","(D)","0",..: 19 5 22 45 11 25 53 7 3 42 ...
##  $ Food.and.Kindred.Products.5                                            : Factor w/ 51 levels "(*)","(D)","0",..: 15 9 14 47 2 27 51 11 3 42 ...
##  $ Food.and.Kindred.Products.6                                            : Factor w/ 51 levels "-11","(*)","(D)",..: 18 10 16 49 23 28 51 14 4 30 ...
##  $ Food.and.Kindred.Products.7                                            : Factor w/ 49 levels "-6","(*)","(D)",..: 15 7 18 38 11 20 49 17 5 29 ...
##  $ Food.and.Kindred.Products.8                                            : Factor w/ 51 levels "(*)","(D)","0",..: 17 12 25 46 24 32 51 21 4 35 ...
##  $ Food.and.Kindred.Products.9                                            : Factor w/ 51 levels "(D)","0","1",..: 16 12 25 49 29 32 51 22 3 4 ...
##  $ Food.and.Kindred.Products.10                                           : Factor w/ 51 levels "(D)","0","1",..: 17 19 26 5 29 33 51 9 3 7 ...
##  $ Food.and.Kindred.Products.11                                           : Factor w/ 53 levels "-15","-8","(D)",..: 15 24 35 8 9 40 53 3 5 20 ...
##  $ Food.and.Kindred.Products.12                                           : Factor w/ 51 levels "(*)","(D)","0",..: 14 19 33 4 2 35 51 2 36 10 ...
##  $ Food.and.Kindred.Products.13                                           : Factor w/ 56 levels "-2","(*)","(D)",..: 17 26 42 12 31 48 56 3 50 21 ...
##  $ Food.and.Kindred.Products.14                                           : Factor w/ 57 levels "-13","(D)","0",..: 17 25 37 13 26 5 57 2 49 22 ...
##  $ Food.and.Kindred.Products.15                                           : Factor w/ 60 levels "-1","-10","-4",..: 21 29 38 14 27 51 60 16 53 28 ...
##  $ Food.and.Kindred.Products.16                                           : Factor w/ 59 levels "-5","-9","(*)",..: 24 36 44 18 30 14 59 27 11 35 ...
##  $ Chemicals.and.Allied.Products                                          : Factor w/ 56 levels "(D)","0","1092",..: 19 17 40 51 29 5 56 1 43 53 ...
##  $ Chemicals.and.Allied.Products.1                                        : Factor w/ 58 levels "-3","(*)","(D)",..: 25 23 47 55 42 6 58 3 49 53 ...
##  $ Chemicals.and.Allied.Products.2                                        : Factor w/ 59 levels "-17","(*)","(D)",..: 24 22 46 51 49 8 59 3 3 50 ...
##  $ Chemicals.and.Allied.Products.3                                        : Factor w/ 61 levels "-25","(*)","(D)",..: 23 27 45 57 52 10 61 3 46 54 ...
##  $ Chemicals.and.Allied.Products.4                                        : Factor w/ 63 levels "-2","-3","-8",..: 30 35 48 10 8 18 63 5 51 60 ...
##  $ Chemicals.and.Allied.Products.5                                        : Factor w/ 59 levels "-1","-10","-8",..: 26 37 51 12 2 23 59 4 48 21 ...
##  $ Chemicals.and.Allied.Products.6                                        : Factor w/ 59 levels "-2","-6","(D)",..: 24 40 49 16 3 35 59 3 43 32 ...
##  $ Chemicals.and.Allied.Products.7                                        : Factor w/ 63 levels "(D)","0","1",..: 21 45 53 15 42 28 63 32 20 35 ...
##  $ Chemicals.and.Allied.Products.8                                        : Factor w/ 62 levels "(*)","(D)","0",..: 22 43 53 24 16 33 62 2 2 40 ...
##  $ Chemicals.and.Allied.Products.9                                        : Factor w/ 60 levels "-1","(*)","(D)",..: 18 42 46 25 22 34 60 3 6 40 ...
##  $ Chemicals.and.Allied.Products.10                                       : Factor w/ 61 levels "-1","(*)","(D)",..: 23 47 51 32 26 38 61 3 14 46 ...
##  $ Chemicals.and.Allied.Products.11                                       : Factor w/ 58 levels "-1","-3","(*)",..: 23 45 50 33 35 38 58 4 17 42 ...
##  $ Chemicals.and.Allied.Products.12                                       : Factor w/ 65 levels "-3","(*)","(D)",..: 17 47 53 31 33 49 65 59 13 45 ...
##  $ Chemicals.and.Allied.Products.13                                       : Factor w/ 60 levels "(*)","(D)","0",..: 18 49 54 37 24 52 60 11 21 46 ...
##  $ Chemicals.and.Allied.Products.14                                       : Factor w/ 60 levels "-13","(D)","0",..: 17 54 53 43 36 49 60 4 20 35 ...
##  $ Chemicals.and.Allied.Products.15                                       : Factor w/ 66 levels "-39","(D)","0",..: 18 59 60 45 44 47 66 54 27 37 ...
##  $ Chemicals.and.Allied.Products.16                                       : Factor w/ 68 levels "-17","(D)","0",..: 20 63 64 50 47 54 68 57 33 45 ...
##  $ Primary.and.fabricated.metals                                          : Factor w/ 35 levels "(D)","0","1",..: 11 23 8 16 12 34 35 1 2 14 ...
##  $ Primary.and.fabricated.metals.1                                        : Factor w/ 46 levels "-3","(*)","(D)",..: 16 30 12 15 17 7 46 3 2 14 ...
##  $ Primary.and.fabricated.metals.2                                        : Factor w/ 44 levels "-7","(*)","(D)",..: 17 31 14 16 18 43 44 3 5 12 ...
##  $ Primary.and.fabricated.metals.3                                        : Factor w/ 42 levels "-10","-18","(*)",..: 15 30 12 14 16 8 42 25 6 11 ...
##  $ Primary.and.fabricated.metals.4                                        : Factor w/ 44 levels "-1","-14","-30",..: 16 38 20 21 17 5 44 43 17 15 ...
##  $ Primary.and.fabricated.metals.5                                        : Factor w/ 43 levels "-1","-19","(*)",..: 18 38 24 21 19 12 43 4 19 16 ...
##  $ Primary.and.fabricated.metals.6                                        : Factor w/ 49 levels "-1","-2","-25",..: 25 46 36 29 26 17 49 6 35 20 ...
##  $ Primary.and.fabricated.metals.7                                        : Factor w/ 43 levels "-2","(*)","(D)",..: 16 39 21 30 17 11 43 28 26 29 ...
##  $ Primary.and.fabricated.metals.8                                        : Factor w/ 44 levels "-141","-2","-3",..: 21 9 27 34 22 20 44 33 5 37 ...
##  $ Primary.and.fabricated.metals.9                                        : Factor w/ 41 levels "-1","-187","-3",..: 18 40 24 33 25 19 41 30 34 31 ...
##  $ Primary.and.fabricated.metals.10                                       : Factor w/ 45 levels "-1","-127","-3",..: 18 44 27 34 19 24 45 39 31 32 ...
##  $ Primary.and.fabricated.metals.11                                       : Factor w/ 45 levels "-1","-187","-2",..: 24 44 29 37 30 21 45 7 1 38 ...
##  $ Primary.and.fabricated.metals.12                                       : Factor w/ 49 levels "-141","(*)","(D)",..: 22 48 26 35 33 20 49 3 3 34 ...
##  $ Primary.and.fabricated.metals.13                                       : Factor w/ 51 levels "-1","-99","(*)",..: 25 9 32 42 26 18 51 40 4 44 ...
##  $ Primary.and.fabricated.metals.14                                       : Factor w/ 46 levels "-127","(*)","(D)",..: 23 17 37 45 2 19 46 38 3 32 ...
##  $ Primary.and.fabricated.metals.15                                       : Factor w/ 50 levels "-1","-143","-2",..: 25 17 32 49 8 24 50 40 6 36 ...
##  $ Primary.and.fabricated.metals.16                                       : Factor w/ 48 levels "-43","-5","(*)",..: 22 18 32 9 23 11 48 44 4 41 ...
##  $ Industrial.machinery.and.equipment                                     : Factor w/ 35 levels "(*)","(D)","0",..: 17 8 15 33 2 24 35 4 3 10 ...
##  $ Industrial.machinery.and.equipment.1                                   : Factor w/ 36 levels "(*)","(D)","0",..: 19 11 22 34 2 25 36 4 4 9 ...
##  $ Industrial.machinery.and.equipment.2                                   : Factor w/ 39 levels "(*)","(D)","0",..: 19 14 24 37 26 25 39 4 4 11 ...
##  $ Industrial.machinery.and.equipment.3                                   : Factor w/ 38 levels "(*)","(D)","0",..: 15 13 20 7 27 21 38 1 4 14 ...
##  $ Industrial.machinery.and.equipment.4                                   : Factor w/ 39 levels "(*)","(D)","0",..: 16 19 21 10 2 11 39 2 38 20 ...
##  $ Industrial.machinery.and.equipment.5                                   : Factor w/ 41 levels "-24","-4","(*)",..: 15 21 27 13 34 18 41 4 37 26 ...
##  $ Industrial.machinery.and.equipment.6                                   : Factor w/ 42 levels "-3","(*)","(D)",..: 17 24 30 15 36 21 42 7 3 28 ...
##  $ Industrial.machinery.and.equipment.7                                   : Factor w/ 42 levels "-2","-3","(*)",..: 16 22 23 13 37 20 42 1 40 4 ...
##  $ Industrial.machinery.and.equipment.8                                   : Factor w/ 41 levels "-10","-4","-5",..: 16 24 22 14 5 15 41 2 40 28 ...
##  $ Industrial.machinery.and.equipment.9                                   : Factor w/ 42 levels "-5","(*)","(D)",..: 14 25 20 17 37 7 42 3 3 27 ...
##  $ Industrial.machinery.and.equipment.10                                  : Factor w/ 40 levels "-133","-5","(*)",..: 15 24 18 13 31 4 40 4 4 23 ...
##  $ Industrial.machinery.and.equipment.11                                  : Factor w/ 40 levels "-2","-4","-59",..: 18 24 16 12 27 39 40 5 5 20 ...
##  $ Industrial.machinery.and.equipment.12                                  : Factor w/ 40 levels "-71","(*)","(D)",..: 12 19 15 9 30 17 40 4 3 11 ...
##   [list output truncated]

Importing and cleaing

In its original form the data is very messy and inconsistent. There are several variables that have been mistaken to be observations. The values themselves contain some anamolies namely, D, NS, *. The objective of the data cleaning process is to rid the data set of these inconsistencies.

Data cleaning operations being performed

  1. Transposing the data and converting to a data frame. This is being done to interchange the variables and the observations.
  2. Assining titles to columns. Also, creating new columns for Industry and Year
  3. Removing the trailing spaces
  4. Removing the unneccessary suffix Industry columns
  5. Removing the ‘D’, ‘NS’ and the ’*’ in the data
invest <- read.csv("US_investments_abroad_industry_wise.csv", header = TRUE, sep = ",",
                skip = 6)
#loading the data as a dataframe
invest <- as_data_frame(invest)

#transposing the data
invest_t <- as.data.frame(t(invest))
#previewing the data frame
invest_t[1:10 , 1:5]

#assigning headers
colnames(invest_t) <- as.character(unlist(invest_t[1, ]))

#deleting the first row from the data frame
invest_t = invest_t[-1 ,  ] 

#previewing the dataframe
invest_t[1:10 , 1:5]

#Renaming the first column to year
names(invest_t)[names(invest_t) == ""] <- "Year"

#Use the 'rownames_to_column in dplyr package to carryout conversion.
invest_t <- rownames_to_column(invest_t,"Industry")


#removing the [1] from the column named Europe
names(invest_t)[names(invest_t) == "Europe[1]"] <- "Europe"

#removing the trailing spaces.
colnames(invest_t) <- str_trim(colnames(invest_t), "left")

#displaying output of column names
head(names(invest_t),6)

#renaming the values within the column 'Industry' using regular expressions
invest_t$Industry <- str_replace( invest_t$Industry, "^(Petroleum).*$","Petroleum") %>%
str_replace("^(Total.Manufacturing).*$","Total Manufacturing") %>%
str_replace("^(Food.and.Kindred.Products).*$","Food and Kindred Products") %>%
str_replace("^(Chemicals.and.Allied.Products).*$","Chemicals and Allied Products") %>%
str_replace("^(Primary.and.fabricated.metals).*$","Primary and fabricated metals") %>%
str_replace("^(Industrial.machinery.and.equipment).*$","Industrial machinery and equipment") %>%
str_replace("^(Electrical.equipment..appliances..and.components).*$","Electrical equipment appliances and components") %>% 
str_replace("^(Transportation.Equipment).*$","Transportation Equipment") %>%
str_replace("^(Other.Manufacturing).*$","Other Manufacturing") %>%
str_replace("^(Wholesale.Trade).*$","Wholesale Trade") %>%
str_replace("^(Finance..except.depository.institutions...insurance..and.real.estate).*$",
             "Finance except depository institutions insurance and real estate") %>%
str_replace("^(Services).*$","Services")

Cleaning activities thus far:

  1. Transposing the data and converting to a data frame
  2. Assining titles to columns. Also, creating new columns for Industry and Year
  3. Removing the trailing spaces
  4. Removing the unneccessary suffix Industry columns

Cleaning activities remaining:

  1. Removing the ‘D’, ‘NS’ and the ’*’ in the data
  2. Renaming the columns with “Other” in their names to appropriate and more comprehendable names
  3. Coercing the factors to character and numeric respectively.
#replacing the value of "(D)" with NA
invest_t[invest_t == "(D)"] <- NA

#replacing the value of "(*)" with NA
invest_t[invest_t == "(*)"] <- NA

#replacing the value of "n.s." with NA
invest_t[invest_t == "n.s."] <- NA

#renaming those columns that have names as other
names(invest_t)[names(invest_t) == "Other"] <- "Other European Countries" 
names(invest_t)[names(invest_t) == "Other1"] <- "Other Latin American Countries" 
names(invest_t)[names(invest_t) == "Other2"] <- "Other Western Hemisphere Countries"
names(invest_t)[names(invest_t) == "Other3"] <- "Other Carrebean Countries"
names(invest_t)[names(invest_t) == "Other4"] <- "Other African Countries"
names(invest_t)[names(invest_t) == "Other5"] <- "Other Middle East Countries"
names(invest_t)[names(invest_t) == "Other6"] <- "Other Asia Pacific Countries"

#Final and scrubbed data set
invest_final <- invest_t

#removing the aggregate columns
drops <- c("All Countries Total","Europe","Latin America and Other Western Hemisphere","Africa",
           "Other Western Hemisphere", "Asia and Pacific", "Middle East")
invest_final <- invest_t[ , !(names(invest_t) %in% drops)]
View(invest_final)

#Trying to convert for multiple columns in one go.
#invest_final[,2:80] <- apply(invest_final[,2:80], 2, as.character)
#invest_final[,2:80] <- apply(invest_final[,2:80], 2, as.numeric)

#Data table clean using the datatable package. Note to make the output scrollable we are using the scrollX = "200 px""
datatable(invest_final, options = list(pageLength = 10, scrollX = "200px"), caption = 'Table 1: Clean and tidy data.')

Investments in Petroleum industry - Region wise

Now that the data is scurbbed, it is ready for analysis. Our first order of business is to identify the amount of investment that US makes accross the world - region wise in the Petroleum industry. Once this is done we can compare the plot to see if it is in keeping with the table we saw in the introductory tab.

#EDA

#sum of investment in petroleum across all years
sub <- invest_final[1:17, ]

#transposing the dataset
sub_t <- as.data.frame(t(sub))

#assigning headers
colnames(sub_t) <- as.character(unlist(sub_t[1, ]))

#deleting the first row from the data frame
sub_t = sub_t[-1 ,  ] 

#adding country row
sub_t <- add_rownames(sub_t, "Country")

sub_num <- sub_t[,-1]

#before calculating aggregate, we convert from factor type to numeric type
sub_num[,1:17] <- apply(sub_num[, 1:17], 2, as.character)
sub_num[,1:17] <- apply(sub_num[, 1:17], 2, as.numeric)

#calculating aggregate sum
sub_sum <- sub_num %>%
  replace(is.na(.), 0) %>%
  mutate(Total = rowSums(.[1:7]))

# combining the country and total columns
x1 <- sub_sum$Total
x2 <- sub_t$Country

#binding the two vectors to form a data frame.
country_total <- cbind(x2, x1)


#removing the duplicated row names
country_total <- remove_rownames(as.data.frame(country_total))

#setting the column names
colnames(country_total) <- c("Country", "Total_investment")

#removing the duplicated row names
country_total <- remove_rownames(as.data.frame(country_total))
colnames(country_total) <- c("Country", "Total_investment")

#removing the all countries, year and europe row
#country_total <- country_total[-c(1, 2, 4, 28, 29, 38, 45, 55, 60, 65) , ]
#View(country_total)

#converting country_total to dataframe and converting to numeric
country_total <- as_data_frame(country_total)
country_total[,2] <- apply(country_total[, 2], 2, as.numeric)

#sorting the data and extracting top 10 entries
sorted <- country_total[ order(-country_total[,2]),]
top_10 <- head(sorted, 10)
#View(top_10)

#converting top_10 to a dataframe
top_10 <- as_data_frame(top_10)

#converting the total investment column to numeric
top_10[,2] <- apply(top_10[, 2], 2, as.numeric)

#################################

#incorporating the ggplot
#Creating a ggplot for comparing investment accross Canda over the years
df <- data.frame(top_10$Country, top_10$Total_investment)
ggplot(top_10, aes(top_10$Country, top_10$Total_investment)) + geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Countries", y = "Investments in Millions", title = "Investments in Petroleum - World over")

The country that comes on top is Canada. Let us drill a little deeper to understand how the US invests in Canada.

Investements in Canada

It would be really helpful to understand which industry on the whole has received the most investments over the years. For this purpose we will be making use of the ggplot package in R. We will be using the regular bar charts as we are dealing with both discrete and continous variables.

#creating an aggregated subset of total investment in canada over the years
invest_final <- as_data_frame(invest_final)
invest_final[,2:73] <- apply(invest_final[, 2:73], 2, as.numeric)

#Petroleum industry
invest_can <- c(sum(invest_final[1:17, 3]),sum(invest_final[18:34,3]), sum(invest_final[35:51, 3]), sum(invest_final[52:68, 3]), sum(invest_final[69:85, 3]),
                sum(invest_final[86:102, 3]), sum(invest_final[103:119, 3]), sum(invest_final[120:136, 3]), sum(invest_final[137:153, 3]), sum(invest_final[154:170, 3]), sum(invest_final[171:187, 3]), sum(invest_final[188:204, 3]))

#Creating a vector containing industry names
invest_industry <- c("Petroleum", "Manufacturing", "Food Products", "Chemical Products", "Fabricated metals", "Industry machinery", "Electrical equipment", "Transportation Equipment", "Other Manufacturing", "Wholesale Trade", "Finance", "Services")

#Creating a ggplot for comparing investment accross Canda over the years
df <- data.frame(invest_industry, invest_can)
ggplot(df, aes(invest_industry, invest_can)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in Canada")

Even though the US invests the most in Canadian petroleum industry, it is the manufacturing industry that has received the most investments. It would be useful to study the trend of investment in petroleum industry. We see, that apat from a dip during 1993, the trend has been generally upward.

#creaing a subset of Petoleum industry
invest_pet <- invest_final[1:17,2:65]

#This graph compares the investments made in different Middle Eastern countries.
ggplot(invest_pet , aes(Year, Canada, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in petroleum industry, Canada")

Investments in United Kingdom

The next most heavily invested country is UK. Analysing the amounts invested by the USA in United Kingdom across all industries by creating a plot for UK

#Creating an a vector containg the investments made in UK across all industries
invest_uk <- c(sum(invest_final[1:17, 25]),sum(invest_final[18:34,25]), sum(invest_final[35:51, 25]), sum(invest_final[52:68, 25]), sum(invest_final[69:85, 25]),sum(invest_final[86:102, 25]), sum(invest_final[103:119, 25]), sum(invest_final[120:136, 25]), sum(invest_final[137:153, 25]), sum(invest_final[154:170, 25]), sum(invest_final[171:187, 25]), sum(invest_final[188:204, 25]))

#ggplot to show investments in UK
df2 <- data.frame(invest_industry, invest_uk)
ggplot(df, aes(invest_industry, invest_uk)) + geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Industry", y = "Investments in Millions", title = "Investments in United Kingdom")

And, we see that it is the finance industry that gets the most investment. This makes sense as the economies of both these countries are connected and dependent.

Identiyfing the trend of US investments in the petroleum industry over the years.

#renaming the United kingdom column to United_Kingdom
names(invest_final)[names(invest_final) == 'United Kingdom' ] <- 'UK'
View(invest_final)

#creating a subset of Finance industry
invest_pet <- invest_final[1:17,2:65]

#This graph compares the investments made in different Middle Eastern countries.
ggplot(invest_pet , aes(Year, UK, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in Petroleum industry, United Kingdom")

Investments in Indonesia

Creating a similar plot for Latin American countries

#Creating an a vector containg the investments made in Indonesia across all industries
invest_indo <- c(sum(invest_final[1:17, 64]),sum(invest_final[18:34,64]), sum(invest_final[35:51, 64]), sum(invest_final[52:68, 64]), sum(invest_final[69:85, 64]),
sum(invest_final[86:102, 64]), sum(invest_final[103:119, 64]), sum(invest_final[120:136, 64]), sum(invest_final[137:153, 64]), sum(invest_final[154:170, 64]), sum(invest_final[171:187, 64]), sum(invest_final[188:204, 64]))

#using ggplot to plot a barchart to illustrate the investment amounts across industries.
df2 <- data.frame(invest_industry, invest_indo)
ggplot(df, aes(invest_industry, invest_indo)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in indonesia")

Interstingly Indonesia receives most of the US investment in the petroleum industry. Plotting the trend of investment in the petroleum industry in Indonesia

#creating a subset of Petroleum industry
invest_pet <- invest_final[1:17,2:65]

#This graph compares the investments made in Indonesia.
ggplot(invest_pet , aes(Year, Indonesia, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in Petroleum industry, Indonesia")

Investments in Mexico

It would really interesting to analyze how the US invests in its Southern neighbour, Mexico!!! Creating a similar plot for Mexico

#Creating an a vector containg the investments made in Mexico across all industries

invest_mex <- c(sum(invest_final[1:17, 40]),sum(invest_final[18:34,40]), sum(invest_final[35:51, 40]), sum(invest_final[52:68, 40]), sum(invest_final[69:85, 40]),sum(invest_final[86:102, 40]), sum(invest_final[103:119, 40]), sum(invest_final[120:136, 40]), sum(invest_final[137:153, 40]), sum(invest_final[154:170, 40]), sum(invest_final[171:187, 40]), sum(invest_final[188:204, 40]))

#ggplot
df2 <- data.frame(invest_industry, invest_mex)
ggplot(df, aes(invest_industry, invest_mex)) + geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Industry", y = "Investments in Millions", title = "Investments in Mexico")

We notice that Mexico receives very neglible amount of investment in the Petroleum industry. On the contrary the manufacutring industry seems to be heavily invested in. This could be attributed to the affordable labour in Mexico, that attracts US based companies to shift manufacturing bases. Let us study the trend of investment in the manufacturing industry.

#creating a subset of Manufacturing industry
invest_man <- invest_final[18:34,2:43]

#Creating the ggplot-bar graph. 
#This graph compares the investments made in Canada across 16 years in the petroleum industry.
ggplot(invest_man , aes(Year, Mexico)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in manufacturing industry, Mexico")

Conclusion

Based on our study we learnt that contrary to popular belief of the Middle East being US’s primary oil supplier, it is Canada. Infact, none of the Middle Eastern Countries even come in the top 10 most heavily invested countries. Surpringly though, from the APAC region, it is Indonesia that supplies the most oil to the US. Apart from a brief slump in investment, the overall trend of US investment in oil has been upwards. What will be intersting to see is how the US’s investment strategy for renewable energy ( a potential replacement of fossil fuels) would be. Unfortunately, the data set we worked with did not provide any visibility on that front. In closing we would like to point out that in times of growing dependence on oil to meet our energy needs, a slump in overall investment in the petroleum industry can be expected.

References

1 - US Energy Information Administration - https://www.eia.gov/petroleum/imports/companylevel/