Project Details

The Data

For this project, I have selected the wide data set by Jason Joseph in Forum: Discussion 6/7: Untidy Data. This data comes from the US Census International Trade Data. Jason posted the csv file which has been downloaded into my working directory. The csv file is also posted on GitHub.

Analysis Question

Jason is very specific about the type of analysis to conduct using this data set. As a result, this makes my task a lot easier. Below are the analysis questions that Jason asked and that I attempt to answer.

  1. Calculate the average goods export for each month over the three-year span
  2. Calculate the average goods import for each month over the three-year span
  3. Calculate the average services export for each month over the three-year span
  4. Calculate the average services import for each month over the three-year span

Loading Required Libraries

library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Importing the Data

In this section, I import the data from my working directory. I will control several parameters of the read.csv function to have greater control on what is imported. I will also take a peak at the data and determine the structure.

mydata <- read.csv("international_trade.csv", header = TRUE, nrows = 13, stringsAsFactors = FALSE)

head(mydata)
##     Period        X   X2013     X.1   X2014     X.2   X2015     X.3
## 1                   Exports Imports Exports Imports Exports Imports
## 2                                                                  
## 3  January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 4          Services   56883  37,995  59,141  38,635  59,733  40,371
## 5                                                                  
## 6 February   Goods   132084 193,731 131,768 193,060 126,329 184,370

Subsetting

There are several rows with no values in the imported data set. I remove those rows using the deplyr filter function and determine the structure of the data.

I looks like all the values are characters.

mydata <- filter(mydata, X != "")

head(mydata)
##     Period        X   X2013     X.1   X2014     X.2   X2015     X.3
## 1  January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 2          Services   56883  37,995  59,141  38,635  59,733  40,371
## 3 February   Goods   132084 193,731 131,768 193,060 126,329 184,370
## 4          Services  56,828  37,935  57,726  39,270  59,560  40,057
## 5    March   Goods  130,093 186,326 135,923 198,973 127,183 198,347
## 6          Services  56,647  37,938  58,837  38,908  59,583  40,581
str(mydata)
## 'data.frame':    8 obs. of  8 variables:
##  $ Period: chr  "January" "" "February" "" ...
##  $ X     : chr  "Goods " "Services" "Goods " "Services" ...
##  $ X2013 : chr  "131,228" "56883" "132084" "56,828" ...
##  $ X.1   : chr  "191,745" "37,995" "193,731" "37,935" ...
##  $ X2014 : chr  "133,738" "59,141" "131,768" "57,726" ...
##  $ X.2   : chr  "193,706" "38,635" "193,060" "39,270" ...
##  $ X2015 : chr  "129,292" "59,733" "126,329" "59,560" ...
##  $ X.3   : chr  "192,242" "40,371" "184,370" "40,057" ...

New Column

Because the Period column has nested values, I will remove this column in a later step. For now I will create a new column by subsetting by Recycling and using the rep function to generate a specific pattern.

column <- as.character(mydata$Period [c(TRUE,FALSE)])

column
## [1] "January"  "February" "March"    "April"
column1 <- as.character(rep(column, each = 2))

head(column1)
## [1] "January"  "January"  "February" "February" "March"    "March"

Appending Column1

The previously created column will be appended to the data frame. Before this is done, I need to remove the the “Old” Period column.

mydata[ ,"Period"] <- NULL
mydata[ ,"Period"] <- column1

head(mydata)
##          X   X2013     X.1   X2014     X.2   X2015     X.3   Period
## 1   Goods  131,228 191,745 133,738 193,706 129,292 192,242  January
## 2 Services   56883  37,995  59,141  38,635  59,733  40,371  January
## 3   Goods   132084 193,731 131,768 193,060 126,329 184,370 February
## 4 Services  56,828  37,935  57,726  39,270  59,560  40,057 February
## 5   Goods  130,093 186,326 135,923 198,973 127,183 198,347    March
## 6 Services  56,647  37,938  58,837  38,908  59,583  40,581    March

Column Names

This section renames the existing column names with more descriptive and user-friendly names. Because there were nested values for the some column names, I will use flags to distinguish between an export/import year. For example, 2013E = 2013 Exports and 2013I = 2013 Imports.

colnames(mydata) <- as.character(c("Type", "2013E", "2013I", "2014E", "2014I", "2015E", "2015I", "Period"))

head(mydata)
##       Type   2013E   2013I   2014E   2014I   2015E   2015I   Period
## 1   Goods  131,228 191,745 133,738 193,706 129,292 192,242  January
## 2 Services   56883  37,995  59,141  38,635  59,733  40,371  January
## 3   Goods   132084 193,731 131,768 193,060 126,329 184,370 February
## 4 Services  56,828  37,935  57,726  39,270  59,560  40,057 February
## 5   Goods  130,093 186,326 135,923 198,973 127,183 198,347    March
## 6 Services  56,647  37,938  58,837  38,908  59,583  40,581    March

Data Manipulations

Gather

Although the data set is much cleaner and organized, it still has a wide format. To make it tidy, I will use the tidyr function gather to collapse the years into a column called Years.

mydata1 <- gather(mydata,"Years", "Dollars", 2:7)

head(mydata1)
##       Type   Period Years Dollars
## 1   Goods   January 2013E 131,228
## 2 Services  January 2013E   56883
## 3   Goods  February 2013E  132084
## 4 Services February 2013E  56,828
## 5   Goods     March 2013E 130,093
## 6 Services    March 2013E  56,647

Converting Dollars Column to Numeric

So far, I have been working with characters; I need the analysis variables to be numeric to perform my analysis. This section converts the Dollars to numeric in two steps. The first step extracts the commas from the Dollar values and adds the extracted values to the Dollars vector. In the second step, the values in the Dollars vector are converted to numeric.

#step1
Dollars <- unlist(str_replace_all(mydata1$Dollars, ",", ""))

#step2
Dollars <- as.numeric(Dollars)

Append Dollars Column

This step removes the old Dollars column and appends the new Dollars vector to the data frame.

mydata1[ ,"Dollars"] <- NULL

mydata1[ ,"Dollars"] <- Dollars

head(mydata1)
##       Type   Period Years Dollars
## 1   Goods   January 2013E  131228
## 2 Services  January 2013E   56883
## 3   Goods  February 2013E  132084
## 4 Services February 2013E   56828
## 5   Goods     March 2013E  130093
## 6 Services    March 2013E   56647

Spread

In the prior step, the months were in a single column. This section uses the dplyr spread function to make each month a single column. I need this step in order to add an Import/Export flag to the data. The Import/Export flag will allow me to better subset the data.

mydata2 <- spread(mydata1, Period, Dollars)

head(mydata2)
##     Type Years  April February January  March
## 1 Goods  2013E 131468   132084  131228 130093
## 2 Goods  2013I 189507   193731  191745 186326
## 3 Goods  2014E 135556   131768  133738 135923
## 4 Goods  2014I 199877   193060  193706 198973
## 5 Goods  2015E 129376   126329  129292 127183
## 6 Goods  2015I 190967   184370  192242 198347

Creating Import/Export Flag

As mentioned earlier, this flag is needed to better subset the data.

exim <- as.character(rep(c("Export", "Import"), 6))

Appending EXIMP Column

This step appends the previously created Import/Export flag to the data frame.

mydata2[ ,"EXIMP"] <- exim

Reversing the Prior Spread Action

This step essentially reverses the previous spread action and makes the data easier to subset.

mydata3 <- gather(mydata2, "Months", "Dollars", 3:6)

head(mydata3)
##     Type Years  EXIMP Months Dollars
## 1 Goods  2013E Export  April  131468
## 2 Goods  2013I Import  April  189507
## 3 Goods  2014E Export  April  135556
## 4 Goods  2014I Import  April  199877
## 5 Goods  2015E Export  April  129376
## 6 Goods  2015I Import  April  190967

Data Analysis

The data is almost ready for analysis. It still needs a few manipulations, but we can now begin to tackle the questions.

First, let’s start with defining the scenarios…

Defining Scenarios for Analysis

Below are the new variables created for each type of scenario required by Jason.

  1. Scenario1 = Goods & Export for the past three years
  2. Scenario2 = Goods & Import for the past three years
  3. Scenario3 = Services & Export for the past three years
  4. Scenario4 = Services & Export for the past three years

I use the dplyr filter function to subset the mydata3 data frame for each scenario.

scenario1 <- filter(mydata3, Type == "Goods " & EXIMP == "Export")
scenario2 <- filter(mydata3, Type == "Goods " & EXIMP == "Import")
scenario3 <- filter(mydata3, Type == "Services" & EXIMP == "Export")
scenario4 <- filter(mydata3, Type == "Services" & EXIMP == "Import")

Calculating the Mean for Each Scenario

In this section, I calculate the mean value for each of the scenarios described above. For this task, I use the dplyr group_by and summarise functions to get the mean value by month.

  1. Mean1 or Avg1 = Mean of Scenario1
  2. Mean2 or Avg2 = Mean of Scenario2
  3. Mean3 or Avg3 = Mean of Scenario3
  4. Mean4 or Avg4 = Mean of Scenario4
mean1 <- scenario1 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean1) <- c("Months", "Avg1")
mean1
## Source: local data frame [4 x 2]
## 
##     Months     Avg1
##     (fctr)    (dbl)
## 1    April 132133.3
## 2 February 130060.3
## 3  January 131419.3
## 4    March 131066.3
mean2 <- scenario2 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean2) <- c("Months", "Avg2")
mean2
## Source: local data frame [4 x 2]
## 
##     Months     Avg2
##     (fctr)    (dbl)
## 1    April 193450.3
## 2 February 190387.0
## 3  January 192564.3
## 4    March 194548.7
mean3 <- scenario3 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean3) <- c("Months", "Avg3")
mean3
## Source: local data frame [4 x 2]
## 
##     Months     Avg3
##     (fctr)    (dbl)
## 1    April 58608.67
## 2 February 58038.00
## 3  January 58585.67
## 4    March 58355.67
mean4 <- scenario4 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean4) <- c("Months", "Avg4")
mean4
## Source: local data frame [4 x 2]
## 
##     Months     Avg4
##     (fctr)    (dbl)
## 1    April 39389.00
## 2 February 39087.33
## 3  January 39000.33
## 4    March 39142.33

Joining Scenario Averages

Now that I have a mean for each scenario, I then use the dplyr inner_join function to joing the averages for each scenario into a single table. This table is be part of the Conclusion.

 comb1 <- inner_join(mean1, mean2, by = "Months")

 comb2 <- inner_join(comb1, mean3, by = "Months")

 mean <- inner_join(comb2, mean4, by = "Months")

Conclusion

I was able to answer the questions posed by Jason Joseph. The table below shows the monthly average for each scenario described in the initial section.

  1. Avg1 = avearge goods export for each month over the three-year span
  2. Avg2 = average goods import for each month over the three-year span
  3. Avg3 = average services export for each month over the three-year span
  4. Avg4 = average services import for each month over the three-year span
 mean
## Source: local data frame [4 x 5]
## 
##     Months     Avg1     Avg2     Avg3     Avg4
##     (fctr)    (dbl)    (dbl)    (dbl)    (dbl)
## 1    April 132133.3 193450.3 58608.67 39389.00
## 2 February 130060.3 190387.0 58038.00 39087.33
## 3  January 131419.3 192564.3 58585.67 39000.33
## 4    March 131066.3 194548.7 58355.67 39142.33