Project sub-parts

Introduction

In the ever more connected (read globalized) world, Foreign direct investments (FDI) are an important source of revenue for any Nation. FDIs spur economic growth and also generate employment. In addition to these advantages, they promote technology and knowledge sharing and help bolster international trade relationships. Cross cultural interaction are another positive side effect of FDIs.

FDIs can be made in any sector/industry - Retail, Wholesale, Oil and Gas, Manufacturing etc.. The amount of investment that a nation receives in a particular sector is dependent on many factors. For instance a country like Japan, which has no oil wells of it’s own, is self sufficient in power generation but is dependent on oil producing countries to meet it’s oil and gas needs. Middle East although self sufficient to meet it’s oil demands, is dependent on other countries to equip its military as it does not have the infrastructure required to produce weapons.

The dataset that has been chosen for this study captures US’s FDIs in other countries over 16 years - 1982 to 1998. There are over 70 countries that have been captured across numerous industries. The source of this data is The Bureau of Economic Analysis (BEA) which promotes better understanding of the US economy by providing the most timely, relevant and accurate economic accounts. The data can be found here

Objective Oil has been described as ‘Liquid Gold’ because it has become so essential for our day to day sustenance that we have almost taken its supply for granted. But what goes unnoticed is the intricately and impeccably planned supply chain that keeps the gas stations always ready to pump gas from. In order to maintain this supply chain, the US has to make investments in the Petroleum industry all across the world. From a risk mitigation perspective such investments should be always diversified to avoid single point of dependency. These investments not only ensure a virtually limitless supply of oil to the citizens of the US but also serve in furthering American strategic interests in the world.

The US Energy Information Administration in it’s January 2018 shows that Canada exported more than one million barrels per day to the United States1. Even though we have data for 1982 - 1998, we are going to attempt to try challenge this finding by comparing the investments to the petroleum imported.

Packages needed

The packages needed for this study are

# loading the packages
library(dplyr) # transforming data
library(tidyr) # creating tidy data
library(stringr) #For manipulating strings
library(tibble) # coercing data to tibbles
library(DT) # for printing nice HTML output tablesinstall.packages()
library(ggplot2) # visualizing data

Data Preparation

Steps to retrieve data

  • On the landing page of the URL, select Investment type as United States direct investment abroad and Data Type as Balance of payments and direct investment position data.
  • Hit Next step and for Choose type of series select U.S. Direct Investment Position Abroad on a Historical-Cost Basis.
  • Move to the next page and for Classification select By Country and Industry.
  • On the next page select 1999 forward - Industry classification based on NAICS for Year Range.
  • For Industries select the following items:
    • Petrol
    • Total Manufacturing
    • Food and Kindered Products
    • Chemicals and Allied Products
    • Primary and fabricated metals
    • Industrial machinery and equipment
    • Electrical equipment, appliances, and components
    • Transportation Equipment
    • Other Manufacturing
    • Wholesale Trade
    • Finance (except depository institutions), insurance, and real estate
    • Services
  • For Countries select any country that seem relevant to your case. For the purpose of this study we have restricted to selecting-
    • Canada
    • Europe
    • Latin America
    • Africa
    • Middle East
    • Asia Pacific

The number of variables contained in the data is:

#reading the csv file
invest <- read.csv("US_investments_abroad_industry_wise.csv", header = TRUE, sep = ",",
                skip = 6)
## [1]  79 205

There are 79 rows and over 200 columns in this dataset. The names of a list of the columns aer the same as the industries we chose in the data set. The next step of the process is to import and then clean the data.

Data dictionary

  • D - Indicates that the data in the cell have been suppressed to avoid disclosure of data of individual companies
  • n.s - Stands for Not shown. Data may not be shown for several reasons.
  • * - Indicates a non-zero value between -$500,000 and +$500,000 or fewer than 50 employees, as appropriate.

The data type of the varaibles are:

#Listing the classes of the variables.
str(invest)
## 'data.frame':    79 obs. of  205 variables:
##  $ X                                                                      : Factor w/ 74 levels "","    Argentina",..: 1 69 71 72 24 25 28 29 31 32 ...
##  $ Petroleum                                                              : Factor w/ 57 levels "(D)","1024","10421",..: 15 41 3 20 1 55 57 47 1 4 ...
##  $ Petroleum.1                                                            : Factor w/ 59 levels "-2","(D)","1020",..: 17 41 4 20 2 50 59 54 2 55 ...
##  $ Petroleum.2                                                            : Factor w/ 59 levels "-28","(D)","10324",..: 19 41 4 23 2 43 59 47 2 36 ...
##  $ Petroleum.3                                                            : Factor w/ 63 levels "-10","-168","-5",..: 19 48 6 25 5 50 63 54 5 46 ...
##  $ Petroleum.4                                                            : Factor w/ 60 levels "-11","(D)","10173",..: 17 46 4 20 57 37 60 33 2 55 ...
##  $ Petroleum.5                                                            : Factor w/ 62 levels "-13","-14","(D)",..: 19 48 6 24 7 44 62 16 3 61 ...
##  $ Petroleum.6                                                            : Factor w/ 59 levels "-2","(D)","1",..: 19 46 7 23 8 48 59 2 2 57 ...
##  $ Petroleum.7                                                            : Factor w/ 55 levels "-10","-8","-96",..: 21 44 7 17 5 41 55 5 5 5 ...
##  $ Petroleum.8                                                            : Factor w/ 51 levels "-8","(D)","102",..: 18 42 5 23 2 40 51 2 2 2 ...
##  $ Petroleum.9                                                            : Factor w/ 54 levels "-1","-8","-90",..: 18 42 5 22 4 27 54 4 4 4 ...
##  $ Petroleum.10                                                           : Factor w/ 53 levels "-45","(D)","1051",..: 21 44 51 26 2 18 53 2 2 2 ...
##  $ Petroleum.11                                                           : Factor w/ 55 levels "-39","(D)","1011",..: 19 47 52 25 21 17 55 2 2 54 ...
##  $ Petroleum.12                                                           : Factor w/ 55 levels "-89","(*)","(D)",..: 18 42 5 21 3 14 55 3 3 6 ...
##  $ Petroleum.13                                                           : Factor w/ 58 levels "(D)","1","108",..: 18 50 56 23 1 14 58 12 1 6 ...
##  $ Petroleum.14                                                           : Factor w/ 56 levels "(*)","(D)","10131",..: 21 48 3 25 2 13 56 36 2 6 ...
##  $ Petroleum.15                                                           : Factor w/ 57 levels "(*)","(D)","10423",..: 22 54 5 30 2 15 57 38 2 56 ...
##  $ Petroleum.16                                                           : Factor w/ 59 levels "(D)","1048","1083",..: 22 55 5 35 13 8 59 46 39 3 ...
##  $ Total.Manufacturing                                                    : Factor w/ 69 levels "-112","(D)","0",..: 30 61 29 48 31 35 69 21 11 51 ...
##  $ Total.Manufacturing.1                                                  : Factor w/ 73 levels "-280","(*)","10215",..: 28 65 27 46 15 31 73 24 18 47 ...
##  $ Total.Manufacturing.2                                                  : Factor w/ 72 levels "-262","-5","(D)",..: 21 65 24 46 23 33 72 19 22 47 ...
##  $ Total.Manufacturing.3                                                  : Factor w/ 68 levels "-256","-3","(D)",..: 20 66 24 43 22 30 68 19 3 50 ...
##  $ Total.Manufacturing.4                                                  : Factor w/ 75 levels "-227","-3","-70",..: 28 9 35 54 24 36 75 23 34 60 ...
##  $ Total.Manufacturing.5                                                  : Factor w/ 71 levels "-105","(D)","1",..: 24 12 34 59 68 42 71 33 35 66 ...
##  $ Total.Manufacturing.6                                                  : Factor w/ 71 levels "-8","(D)","1070",..: 19 9 31 55 61 37 71 30 32 62 ...
##  $ Total.Manufacturing.7                                                  : Factor w/ 72 levels "(D)","10119",..: 21 15 31 62 36 41 72 37 45 3 ...
##  $ Total.Manufacturing.8                                                  : Factor w/ 72 levels "(*)","(D)","1",..: 24 22 38 68 35 43 72 40 69 8 ...
##  $ Total.Manufacturing.9                                                  : Factor w/ 74 levels "(*)","(D)","10",..: 26 22 36 70 43 50 74 41 54 13 ...
##  $ Total.Manufacturing.10                                                 : Factor w/ 73 levels "(*)","(D)","10167",..: 26 24 37 68 53 50 73 35 54 12 ...
##  $ Total.Manufacturing.11                                                 : Factor w/ 71 levels "(*)","(D)","10",..: 28 26 39 67 52 51 71 22 27 13 ...
##  $ Total.Manufacturing.12                                                 : Factor w/ 72 levels "(D)","0","1000",..: 27 28 38 71 63 57 72 1 33 13 ...
##  $ Total.Manufacturing.13                                                 : Factor w/ 71 levels "(D)","101","1024",..: 27 33 45 13 3 68 71 1 50 19 ...
##  $ Total.Manufacturing.14                                                 : Factor w/ 68 levels "-14","(D)","1003",..: 29 36 49 15 5 66 68 2 54 23 ...
##  $ Total.Manufacturing.15                                                 : Factor w/ 75 levels "-36","(D)","100",..: 30 37 51 10 73 70 75 59 69 23 ...
##  $ Total.Manufacturing.16                                                 : Factor w/ 74 levels "-48","(D)","1020",..: 22 36 45 9 4 70 74 51 73 21 ...
##  $ Food.and.Kindred.Products                                              : Factor w/ 47 levels "(*)","(D)","0",..: 22 44 12 33 19 6 47 41 3 30 ...
##  $ Food.and.Kindred.Products.1                                            : Factor w/ 50 levels "(*)","(D)","0",..: 21 48 13 35 15 10 50 47 3 34 ...
##  $ Food.and.Kindred.Products.2                                            : Factor w/ 51 levels "(*)","(D)","0",..: 19 50 14 34 2 12 51 46 3 33 ...
##  $ Food.and.Kindred.Products.3                                            : Factor w/ 50 levels "(*)","(D)","0",..: 17 48 13 36 2 14 50 49 3 30 ...
##  $ Food.and.Kindred.Products.4                                            : Factor w/ 53 levels "(*)","(D)","0",..: 19 5 22 45 11 25 53 7 3 42 ...
##  $ Food.and.Kindred.Products.5                                            : Factor w/ 51 levels "(*)","(D)","0",..: 15 9 14 47 2 27 51 11 3 42 ...
##  $ Food.and.Kindred.Products.6                                            : Factor w/ 51 levels "-11","(*)","(D)",..: 18 10 16 49 23 28 51 14 4 30 ...
##  $ Food.and.Kindred.Products.7                                            : Factor w/ 49 levels "-6","(*)","(D)",..: 15 7 18 38 11 20 49 17 5 29 ...
##  $ Food.and.Kindred.Products.8                                            : Factor w/ 51 levels "(*)","(D)","0",..: 17 12 25 46 24 32 51 21 4 35 ...
##  $ Food.and.Kindred.Products.9                                            : Factor w/ 51 levels "(D)","0","1",..: 16 12 25 49 29 32 51 22 3 4 ...
##  $ Food.and.Kindred.Products.10                                           : Factor w/ 51 levels "(D)","0","1",..: 17 19 26 5 29 33 51 9 3 7 ...
##  $ Food.and.Kindred.Products.11                                           : Factor w/ 53 levels "-15","-8","(D)",..: 15 24 35 8 9 40 53 3 5 20 ...
##  $ Food.and.Kindred.Products.12                                           : Factor w/ 51 levels "(*)","(D)","0",..: 14 19 33 4 2 35 51 2 36 10 ...
##  $ Food.and.Kindred.Products.13                                           : Factor w/ 56 levels "-2","(*)","(D)",..: 17 26 42 12 31 48 56 3 50 21 ...
##  $ Food.and.Kindred.Products.14                                           : Factor w/ 57 levels "-13","(D)","0",..: 17 25 37 13 26 5 57 2 49 22 ...
##  $ Food.and.Kindred.Products.15                                           : Factor w/ 60 levels "-1","-10","-4",..: 21 29 38 14 27 51 60 16 53 28 ...
##  $ Food.and.Kindred.Products.16                                           : Factor w/ 59 levels "-5","-9","(*)",..: 24 36 44 18 30 14 59 27 11 35 ...
##  $ Chemicals.and.Allied.Products                                          : Factor w/ 56 levels "(D)","0","1092",..: 19 17 40 51 29 5 56 1 43 53 ...
##  $ Chemicals.and.Allied.Products.1                                        : Factor w/ 58 levels "-3","(*)","(D)",..: 25 23 47 55 42 6 58 3 49 53 ...
##  $ Chemicals.and.Allied.Products.2                                        : Factor w/ 59 levels "-17","(*)","(D)",..: 24 22 46 51 49 8 59 3 3 50 ...
##  $ Chemicals.and.Allied.Products.3                                        : Factor w/ 61 levels "-25","(*)","(D)",..: 23 27 45 57 52 10 61 3 46 54 ...
##  $ Chemicals.and.Allied.Products.4                                        : Factor w/ 63 levels "-2","-3","-8",..: 30 35 48 10 8 18 63 5 51 60 ...
##  $ Chemicals.and.Allied.Products.5                                        : Factor w/ 59 levels "-1","-10","-8",..: 26 37 51 12 2 23 59 4 48 21 ...
##  $ Chemicals.and.Allied.Products.6                                        : Factor w/ 59 levels "-2","-6","(D)",..: 24 40 49 16 3 35 59 3 43 32 ...
##  $ Chemicals.and.Allied.Products.7                                        : Factor w/ 63 levels "(D)","0","1",..: 21 45 53 15 42 28 63 32 20 35 ...
##  $ Chemicals.and.Allied.Products.8                                        : Factor w/ 62 levels "(*)","(D)","0",..: 22 43 53 24 16 33 62 2 2 40 ...
##  $ Chemicals.and.Allied.Products.9                                        : Factor w/ 60 levels "-1","(*)","(D)",..: 18 42 46 25 22 34 60 3 6 40 ...
##  $ Chemicals.and.Allied.Products.10                                       : Factor w/ 61 levels "-1","(*)","(D)",..: 23 47 51 32 26 38 61 3 14 46 ...
##  $ Chemicals.and.Allied.Products.11                                       : Factor w/ 58 levels "-1","-3","(*)",..: 23 45 50 33 35 38 58 4 17 42 ...
##  $ Chemicals.and.Allied.Products.12                                       : Factor w/ 65 levels "-3","(*)","(D)",..: 17 47 53 31 33 49 65 59 13 45 ...
##  $ Chemicals.and.Allied.Products.13                                       : Factor w/ 60 levels "(*)","(D)","0",..: 18 49 54 37 24 52 60 11 21 46 ...
##  $ Chemicals.and.Allied.Products.14                                       : Factor w/ 60 levels "-13","(D)","0",..: 17 54 53 43 36 49 60 4 20 35 ...
##  $ Chemicals.and.Allied.Products.15                                       : Factor w/ 66 levels "-39","(D)","0",..: 18 59 60 45 44 47 66 54 27 37 ...
##  $ Chemicals.and.Allied.Products.16                                       : Factor w/ 68 levels "-17","(D)","0",..: 20 63 64 50 47 54 68 57 33 45 ...
##  $ Primary.and.fabricated.metals                                          : Factor w/ 35 levels "(D)","0","1",..: 11 23 8 16 12 34 35 1 2 14 ...
##  $ Primary.and.fabricated.metals.1                                        : Factor w/ 46 levels "-3","(*)","(D)",..: 16 30 12 15 17 7 46 3 2 14 ...
##  $ Primary.and.fabricated.metals.2                                        : Factor w/ 44 levels "-7","(*)","(D)",..: 17 31 14 16 18 43 44 3 5 12 ...
##  $ Primary.and.fabricated.metals.3                                        : Factor w/ 42 levels "-10","-18","(*)",..: 15 30 12 14 16 8 42 25 6 11 ...
##  $ Primary.and.fabricated.metals.4                                        : Factor w/ 44 levels "-1","-14","-30",..: 16 38 20 21 17 5 44 43 17 15 ...
##  $ Primary.and.fabricated.metals.5                                        : Factor w/ 43 levels "-1","-19","(*)",..: 18 38 24 21 19 12 43 4 19 16 ...
##  $ Primary.and.fabricated.metals.6                                        : Factor w/ 49 levels "-1","-2","-25",..: 25 46 36 29 26 17 49 6 35 20 ...
##  $ Primary.and.fabricated.metals.7                                        : Factor w/ 43 levels "-2","(*)","(D)",..: 16 39 21 30 17 11 43 28 26 29 ...
##  $ Primary.and.fabricated.metals.8                                        : Factor w/ 44 levels "-141","-2","-3",..: 21 9 27 34 22 20 44 33 5 37 ...
##  $ Primary.and.fabricated.metals.9                                        : Factor w/ 41 levels "-1","-187","-3",..: 18 40 24 33 25 19 41 30 34 31 ...
##  $ Primary.and.fabricated.metals.10                                       : Factor w/ 45 levels "-1","-127","-3",..: 18 44 27 34 19 24 45 39 31 32 ...
##  $ Primary.and.fabricated.metals.11                                       : Factor w/ 45 levels "-1","-187","-2",..: 24 44 29 37 30 21 45 7 1 38 ...
##  $ Primary.and.fabricated.metals.12                                       : Factor w/ 49 levels "-141","(*)","(D)",..: 22 48 26 35 33 20 49 3 3 34 ...
##  $ Primary.and.fabricated.metals.13                                       : Factor w/ 51 levels "-1","-99","(*)",..: 25 9 32 42 26 18 51 40 4 44 ...
##  $ Primary.and.fabricated.metals.14                                       : Factor w/ 46 levels "-127","(*)","(D)",..: 23 17 37 45 2 19 46 38 3 32 ...
##  $ Primary.and.fabricated.metals.15                                       : Factor w/ 50 levels "-1","-143","-2",..: 25 17 32 49 8 24 50 40 6 36 ...
##  $ Primary.and.fabricated.metals.16                                       : Factor w/ 48 levels "-43","-5","(*)",..: 22 18 32 9 23 11 48 44 4 41 ...
##  $ Industrial.machinery.and.equipment                                     : Factor w/ 35 levels "(*)","(D)","0",..: 17 8 15 33 2 24 35 4 3 10 ...
##  $ Industrial.machinery.and.equipment.1                                   : Factor w/ 36 levels "(*)","(D)","0",..: 19 11 22 34 2 25 36 4 4 9 ...
##  $ Industrial.machinery.and.equipment.2                                   : Factor w/ 39 levels "(*)","(D)","0",..: 19 14 24 37 26 25 39 4 4 11 ...
##  $ Industrial.machinery.and.equipment.3                                   : Factor w/ 38 levels "(*)","(D)","0",..: 15 13 20 7 27 21 38 1 4 14 ...
##  $ Industrial.machinery.and.equipment.4                                   : Factor w/ 39 levels "(*)","(D)","0",..: 16 19 21 10 2 11 39 2 38 20 ...
##  $ Industrial.machinery.and.equipment.5                                   : Factor w/ 41 levels "-24","-4","(*)",..: 15 21 27 13 34 18 41 4 37 26 ...
##  $ Industrial.machinery.and.equipment.6                                   : Factor w/ 42 levels "-3","(*)","(D)",..: 17 24 30 15 36 21 42 7 3 28 ...
##  $ Industrial.machinery.and.equipment.7                                   : Factor w/ 42 levels "-2","-3","(*)",..: 16 22 23 13 37 20 42 1 40 4 ...
##  $ Industrial.machinery.and.equipment.8                                   : Factor w/ 41 levels "-10","-4","-5",..: 16 24 22 14 5 15 41 2 40 28 ...
##  $ Industrial.machinery.and.equipment.9                                   : Factor w/ 42 levels "-5","(*)","(D)",..: 14 25 20 17 37 7 42 3 3 27 ...
##  $ Industrial.machinery.and.equipment.10                                  : Factor w/ 40 levels "-133","-5","(*)",..: 15 24 18 13 31 4 40 4 4 23 ...
##  $ Industrial.machinery.and.equipment.11                                  : Factor w/ 40 levels "-2","-4","-59",..: 18 24 16 12 27 39 40 5 5 20 ...
##  $ Industrial.machinery.and.equipment.12                                  : Factor w/ 40 levels "-71","(*)","(D)",..: 12 19 15 9 30 17 40 4 3 11 ...
##   [list output truncated]

Importing and cleaing

In its original form the data is very messy and inconsistent. There are several variables that have been mistaken to be observations. The values themselves contain some anamolies namely, D, NS, *. The objective of the data cleaning process is to rid the data set of these inconsistencies.

Data cleaning operations being performed

  1. Transposing the data and converting to a data frame. This is being done to interchange the variables and the observations.
  2. Assining titles to columns. Also, creating new columns for Industry and Year
  3. Removing the trailing spaces
  4. Removing the unneccessary suffix Industry columns
  5. Removing the ‘D’, ‘NS’ and the ’*’ in the data
invest <- read.csv("US_investments_abroad_industry_wise.csv", header = TRUE, sep = ",",
                skip = 6)
#loading the data as a dataframe
invest <- as_data_frame(invest)

#transposing the data
invest_t <- as.data.frame(t(invest))
#previewing the data frame
invest_t[1:10 , 1:5]

#assigning headers
colnames(invest_t) <- as.character(unlist(invest_t[1, ]))

#deleting the first row from the data frame
invest_t = invest_t[-1 ,  ] 

#previewing the dataframe
invest_t[1:10 , 1:5]

#Renaming the first column to year
names(invest_t)[names(invest_t) == ""] <- "Year"

#Use the 'rownames_to_column in dplyr package to carryout conversion.
invest_t <- rownames_to_column(invest_t,"Industry")


#removing the [1] from the column named Europe
names(invest_t)[names(invest_t) == "Europe[1]"] <- "Europe"

#removing the trailing spaces.
colnames(invest_t) <- str_trim(colnames(invest_t), "left")

#displaying output of column names
head(names(invest_t),6)

#renaming the values within the column 'Industry' using regular expressions
invest_t$Industry <- str_replace( invest_t$Industry, "^(Petroleum).*$","Petroleum") %>%
str_replace("^(Total.Manufacturing).*$","Total Manufacturing") %>%
str_replace("^(Food.and.Kindred.Products).*$","Food and Kindred Products") %>%
str_replace("^(Chemicals.and.Allied.Products).*$","Chemicals and Allied Products") %>%
str_replace("^(Primary.and.fabricated.metals).*$","Primary and fabricated metals") %>%
str_replace("^(Industrial.machinery.and.equipment).*$","Industrial machinery and equipment") %>%
str_replace("^(Electrical.equipment..appliances..and.components).*$","Electrical equipment appliances and components") %>% 
str_replace("^(Transportation.Equipment).*$","Transportation Equipment") %>%
str_replace("^(Other.Manufacturing).*$","Other Manufacturing") %>%
str_replace("^(Wholesale.Trade).*$","Wholesale Trade") %>%
str_replace("^(Finance..except.depository.institutions...insurance..and.real.estate).*$",
             "Finance except depository institutions insurance and real estate") %>%
str_replace("^(Services).*$","Services")

Cleaning activities thus far:

  1. Transposing the data and converting to a data frame
  2. Assining titles to columns. Also, creating new columns for Industry and Year
  3. Removing the trailing spaces
  4. Removing the unneccessary suffix Industry columns

Cleaning activities remaining:

  1. Removing the ‘D’, ‘NS’ and the ’*’ in the data
  2. Renaming the columns with “Other” in their names to appropriate and more comprehendable names
  3. Coercing the factors to character and numeric respectively.
#replacing the value of "(D)" with NA
invest_t[invest_t == "(D)"] <- NA

#replacing the value of "(*)" with NA
invest_t[invest_t == "(*)"] <- NA

#replacing the value of "n.s." with NA
invest_t[invest_t == "n.s."] <- NA

#renaming those columns that have names as other
names(invest_t)[names(invest_t) == "Other"] <- "Other European Countries" 
names(invest_t)[names(invest_t) == "Other1"] <- "Other Latin American Countries" 
names(invest_t)[names(invest_t) == "Other2"] <- "Other Western Hemisphere Countries"
names(invest_t)[names(invest_t) == "Other3"] <- "Other Carrebean Countries"
names(invest_t)[names(invest_t) == "Other4"] <- "Other African Countries"
names(invest_t)[names(invest_t) == "Other5"] <- "Other Middle East Countries"
names(invest_t)[names(invest_t) == "Other6"] <- "Other Asia Pacific Countries"

#Final and scrubbed data set
invest_final <- invest_t

#Trying to convert for multiple columns in one go.
invest_final[,2:80] <- apply(invest_final[,2:80], 2, as.character)
invest_final[,2:80] <- apply(invest_final[,2:80], 2, as.numeric)

#Data table clean using the datatable package. Note to make the output scrollable we are using the scrollX = "200 px""
datatable(invest_final, options = list(pageLength = 10, scrollX = "200px"), caption = 'Table 1: Clean and tidy data.')

Proposed Exploratory Data Analysis

Investments in Petroleum industry - Region wise

Now that the data is scurbbed, it is ready for analysis. Our first order of business is to identify the amount of investment that US makes accross the world - region wise in the Petroleum industry. Once this is done we can compare the plot to see if it is in keeping with the table we saw in the introductory tab.

#EDA
retain <- c("Year", "Canada", "Europe","Latin America and Other Western Hemisphere","Africa",
     "Other Western Hemisphere", "Asia and Pacific", "Middle East")
#invest_final <- invest_t[ , !(names(invest_t) %in% retain)]
invest_regions <- invest_t[ , (names(invest_t) %in% retain)]


#convert to numeric 
invest_regions[,1:8] <- apply(invest_regions[, 1:8], 2, as.character)
invest_regions[,1:8] <- apply(invest_regions[, 1:8], 2, as.numeric)

#sum of investment in petroleum across all years
sub <- invest_regions[1:17, ]

#transposing the dataset
sub_t <- as.data.frame(t(sub))

#assigning headers
colnames(sub_t) <- as.character(unlist(sub_t[1,]))
#deleting the first row from the data frame
sub_t = sub_t[-1 ,  ] 

#adding country row
sub_t <- add_rownames(sub_t, "Country")

sub_num <- sub_t[,-1]
#calculating aggregate sum
sub_sum <- sub_num %>%
  replace(is.na(.), 0) %>%
  mutate(Total = rowSums(.[1:7]))

# combining the country and total columns
x1 <- sub_sum$Total
x2 <- sub_t$Country

#binding the two vectors to form a data frame.
country_total <- cbind(x2, x1)

#removing the duplicated row names
country_total <- remove_rownames(as.data.frame(country_total))
colnames(country_total) <- c("Country", "Total_investment")

#sorting and then using the top 10
sorted <- country_total[order(x1, decreasing = TRUE),]

#################################

#incorporating the ggplot
#Creating a ggplot for comparing investment accross Canda over the years
df <- data.frame(sorted$Country, sorted$Total_investment)
ggplot(df, aes(sorted$Country, sorted$Total_investment)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Countries", y = "Investments in Millions", title = "Investments in Petroleum - World over")

The country that comes on top is Canada (as expected). Let us drill a little deeper to understand how the US invests in Canada.

Investements in Canada

It would be really helpful to understand which industry on the whole has received the most investments over the years. For this purpose we will be making use of the ggplot package in R. We will be using the regular bar charts as we are dealing with both discrete and continous variables.

#creating an aggregated subset of total investment in canada over the years

#Petroleum industry
invest_can <- c(sum(invest_final[1:17, 4]),sum(invest_final[18:34,4]), sum(invest_final[35:51, 4]), sum(invest_final[52:68, 4]), sum(invest_final[69:85, 4]),
sum(invest_final[86:102, 4]), sum(invest_final[103:119, 4]), sum(invest_final[120:136, 4]), sum(invest_final[137:153, 4]), sum(invest_final[154:170, 4]), sum(invest_final[171:187, 4]), sum(invest_final[188:204, 4]))

#Creating a vector containing industry names
invest_industry <- c("Petroleum", "Manufacturing", "Food Products", "Chemical Products", "Fabricated metals", "Industry machinery", "Electrical equipment", "Transportation Equipment", "Other Manufacturing", "Wholesale Trade", "Finance", "Services")

#Creating a ggplot for comparing investment accross Canda over the years
df <- data.frame(invest_industry, invest_can)
ggplot(df, aes(invest_industry, invest_can)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in Canada")

Even though the US invests the most in Canada when talking about the petroleum industry, it is the manufacturing industry that has received the most investments. It would be useful to study the trend of investment in petroleum industry. For this we will be using the line plot.

#creaing a subset of Petoleum industry
invest_pet <- invest_final[1:17,2:65]

#This graph compares the investments made in different Middle Eastern countries.
ggplot(invest_pet , aes(Year, Canada, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in petroleum industry, Canada")

Investments in Middle East

Middle east would be an interesting region to analyse.

Creating a similar plot for Middle East

#Creating an a vector containg the investments made in Middle East across all industries
invest_mde <- c(sum(invest_final[1:17, 61]),sum(invest_final[18:34,61]), sum(invest_final[35:51, 61]), sum(invest_final[52:68, 61]), sum(invest_final[69:85, 61]),
sum(invest_final[86:102, 61]), sum(invest_final[103:119, 61]), sum(invest_final[120:136, 61]), sum(invest_final[137:153, 61]), sum(invest_final[154:170, 61]), sum(invest_final[171:187, 61]), sum(invest_final[188:204, 61]))

#ggplot for ploting bar chart to illustrate investments in Mn vs investments in Middle East.
df2 <- data.frame(invest_industry, invest_mde)
ggplot(df, aes(invest_industry, invest_mde)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in Middle East")

#

And, we immediately get our answer that, YES! the petroleum industry is the one that receives the most investment from the US.

Identiyfing the trend over the years.

#renaming the middle east column to Middle_East
names(invest_final)[names(invest_final) == 'Middle East' ] <- 'Middle_East'
View(invest_final)

#creating a subset of Petoleum industry
invest_pet <- invest_final[1:17,2:65]

#This graph compares the investments made in different Middle Eastern countries.
ggplot(invest_pet , aes(Year, Middle_East, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in petroleum industry, Middle East")

Investments in all Latin American countries

Creating a similar plot for Latin American countries

#Creating an a vector containg the investments made in Mexico across all industries
invest_latam <- c(sum(invest_final[1:17, 29]),sum(invest_final[18:34,29]), sum(invest_final[35:51, 29]), sum(invest_final[52:68, 29]), sum(invest_final[69:85, 29]),
sum(invest_final[86:102, 29]), sum(invest_final[103:119, 29]), sum(invest_final[120:136, 29]), sum(invest_final[137:153, 29]), sum(invest_final[154:170, 29]), sum(invest_final[171:187, 29]), sum(invest_final[188:204, 29]))

#using ggplot to plot a barchart to illustrate the investment amounts across industries.
df2 <- data.frame(invest_industry, invest_latam)
ggplot(df, aes(invest_industry, invest_latam)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in Lat-Am")

Plotting the trend of investment in the petroleum industry in the LATAM region.

#renaming the 'Latin America and Other Western Hemisphere' column to 'LATAM'
names(invest_final)[names(invest_final) == 'Latin America and Other Western Hemisphere' ] <- 'LATAM'
View(invest_final)

#creating a subset of Petoleum industry
invest_pet <- invest_final[1:17,2:30]

#This graph compares the investments made in different Middle Eastern countries.
ggplot(invest_pet , aes(Year, LATAM, group = 1)) + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Investments in Millions", title = "Investment in petroleum industry, LATAM")

Investments in Mexico

We can now analyze US’s investment patterns in it’s neighbouring countries’ markets. For that let us look at it’s closest neighbour, Mexico!

Creating a similar plot for Mexico

#Creating an a vector containg the investments made in Mexico across all industries
invest_mex <- c(sum(invest_final[1:17, 43]),sum(invest_final[18:34,43]), sum(invest_final[35:51, 43]), sum(invest_final[52:68, 43]), sum(invest_final[69:85, 43]),
sum(invest_final[86:102, 43]), sum(invest_final[103:119, 43]), sum(invest_final[120:136, 43]), sum(invest_final[137:153, 43]), sum(invest_final[154:170, 43]), sum(invest_final[171:187, 43]), sum(invest_final[188:204, 43]))

#ggplot for ploting bar chart to illustrate investments in Mn vs investments in Mexico.
df2 <- data.frame(invest_industry, invest_mex)
ggplot(df, aes(invest_industry, invest_mex)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Industry", y = "Investments in Millions", title = "Investments in Mexico")

We notice that Mexico receives very neglible amount of investment in the Petroleum industry. On the contrary the manufacutring industry seems to be heavily invested in. This could be attributed to the affordable labour in Mexico, that attracts US based companies to shift manufacturing bases. Let us study the trend of investment in the manufacturing industry.

#creating a subset of Manufacturing industry
invest_man <- invest_final[18:34,2:43]

#Creating the ggplot-bar graph. 
#This graph compares the investments made in Canada across 16 years in the petroleum industry.
ggplot(invest_man , aes(Year, Mexico)) + 
geom_line() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Year", y = "Investments in Millions", title = "Investment in manufacturing industry, Mexico")

Under Construction

Now that we see that there has been a peculiar dip in investments between 1985 to 1992. The new plot which will be implemented in the final project.

Conclusion

References

1 - US Energy Information Administration - https://www.eia.gov/petroleum/imports/companylevel/