The Data
For this project, I have selected the wide data set by Jason Joseph in Forum: Discussion 6/7: Untidy Data. This data comes from the US Census International Trade Data. Jason posted the csv file which has been downloaded into my working directory. The csv file is also posted on GitHub.
Analysis Question
Jason is very specific about the type of analysis to conduct using this data set. As a result, this makes my task a lot easier. Below are the analysis questions that Jason asked and that I attempt to answer.
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In this section, I import the data from my working directory. I will control several parameters of the read.csv function to have greater control on what is imported. I will also take a peak at the data and determine the structure.
mydata <- read.csv("international_trade.csv", header = TRUE, nrows = 13, stringsAsFactors = FALSE)
head(mydata)
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## 1 Exports Imports Exports Imports Exports Imports
## 2
## 3 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 4 Services 56883 37,995 59,141 38,635 59,733 40,371
## 5
## 6 February Goods 132084 193,731 131,768 193,060 126,329 184,370
There are several rows with no values in the imported data set. I remove those rows using the deplyr filter function and determine the structure of the data.
I looks like all the values are characters.
mydata <- filter(mydata, X != "")
head(mydata)
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## 1 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 2 Services 56883 37,995 59,141 38,635 59,733 40,371
## 3 February Goods 132084 193,731 131,768 193,060 126,329 184,370
## 4 Services 56,828 37,935 57,726 39,270 59,560 40,057
## 5 March Goods 130,093 186,326 135,923 198,973 127,183 198,347
## 6 Services 56,647 37,938 58,837 38,908 59,583 40,581
str(mydata)
## 'data.frame': 8 obs. of 8 variables:
## $ Period: chr "January" "" "February" "" ...
## $ X : chr "Goods " "Services" "Goods " "Services" ...
## $ X2013 : chr "131,228" "56883" "132084" "56,828" ...
## $ X.1 : chr "191,745" "37,995" "193,731" "37,935" ...
## $ X2014 : chr "133,738" "59,141" "131,768" "57,726" ...
## $ X.2 : chr "193,706" "38,635" "193,060" "39,270" ...
## $ X2015 : chr "129,292" "59,733" "126,329" "59,560" ...
## $ X.3 : chr "192,242" "40,371" "184,370" "40,057" ...
New Column
Because the Period column has nested values, I will remove this column in a later step. For now I will create a new column by subsetting by Recycling and using the rep function to generate a specific pattern.
column <- as.character(mydata$Period [c(TRUE,FALSE)])
column
## [1] "January" "February" "March" "April"
column1 <- as.character(rep(column, each = 2))
head(column1)
## [1] "January" "January" "February" "February" "March" "March"
Appending Column1
The previously created column will be appended to the data frame. Before this is done, I need to remove the the “Old” Period column.
mydata[ ,"Period"] <- NULL
mydata[ ,"Period"] <- column1
head(mydata)
## X X2013 X.1 X2014 X.2 X2015 X.3 Period
## 1 Goods 131,228 191,745 133,738 193,706 129,292 192,242 January
## 2 Services 56883 37,995 59,141 38,635 59,733 40,371 January
## 3 Goods 132084 193,731 131,768 193,060 126,329 184,370 February
## 4 Services 56,828 37,935 57,726 39,270 59,560 40,057 February
## 5 Goods 130,093 186,326 135,923 198,973 127,183 198,347 March
## 6 Services 56,647 37,938 58,837 38,908 59,583 40,581 March
Column Names
This section renames the existing column names with more descriptive and user-friendly names. Because there were nested values for the some column names, I will use flags to distinguish between an export/import year. For example, 2013E = 2013 Exports and 2013I = 2013 Imports.
colnames(mydata) <- as.character(c("Type", "2013E", "2013I", "2014E", "2014I", "2015E", "2015I", "Period"))
head(mydata)
## Type 2013E 2013I 2014E 2014I 2015E 2015I Period
## 1 Goods 131,228 191,745 133,738 193,706 129,292 192,242 January
## 2 Services 56883 37,995 59,141 38,635 59,733 40,371 January
## 3 Goods 132084 193,731 131,768 193,060 126,329 184,370 February
## 4 Services 56,828 37,935 57,726 39,270 59,560 40,057 February
## 5 Goods 130,093 186,326 135,923 198,973 127,183 198,347 March
## 6 Services 56,647 37,938 58,837 38,908 59,583 40,581 March
Gather
Although the data set is much cleaner and organized, it still has a wide format. To make it tidy, I will use the tidyr function gather to collapse the years into a column called Years.
mydata1 <- gather(mydata,"Years", "Dollars", 2:7)
head(mydata1)
## Type Period Years Dollars
## 1 Goods January 2013E 131,228
## 2 Services January 2013E 56883
## 3 Goods February 2013E 132084
## 4 Services February 2013E 56,828
## 5 Goods March 2013E 130,093
## 6 Services March 2013E 56,647
Converting Dollars Column to Numeric
So far, I have been working with characters; I need the analysis variables to be numeric to perform my analysis. This section converts the Dollars to numeric in two steps. The first step extracts the commas from the Dollar values and adds the extracted values to the Dollars vector. In the second step, the values in the Dollars vector are converted to numeric.
#step1
Dollars <- unlist(str_replace_all(mydata1$Dollars, ",", ""))
#step2
Dollars <- as.numeric(Dollars)
Append Dollars Column
This step removes the old Dollars column and appends the new Dollars vector to the data frame.
mydata1[ ,"Dollars"] <- NULL
mydata1[ ,"Dollars"] <- Dollars
head(mydata1)
## Type Period Years Dollars
## 1 Goods January 2013E 131228
## 2 Services January 2013E 56883
## 3 Goods February 2013E 132084
## 4 Services February 2013E 56828
## 5 Goods March 2013E 130093
## 6 Services March 2013E 56647
Spread
In the prior step, the months were in a single column. This section uses the dplyr spread function to make each month a single column. I need this step in order to add an Import/Export flag to the data. The Import/Export flag will allow me to better subset the data.
mydata2 <- spread(mydata1, Period, Dollars)
head(mydata2)
## Type Years April February January March
## 1 Goods 2013E 131468 132084 131228 130093
## 2 Goods 2013I 189507 193731 191745 186326
## 3 Goods 2014E 135556 131768 133738 135923
## 4 Goods 2014I 199877 193060 193706 198973
## 5 Goods 2015E 129376 126329 129292 127183
## 6 Goods 2015I 190967 184370 192242 198347
Creating Import/Export Flag
As mentioned earlier, this flag is needed to better subset the data.
exim <- as.character(rep(c("Export", "Import"), 6))
Appending EXIMP Column
This step appends the previously created Import/Export flag to the data frame.
mydata2[ ,"EXIMP"] <- exim
Reversing the Prior Spread Action
This step essentially reverses the previous spread action and makes the data easier to subset.
mydata3 <- gather(mydata2, "Months", "Dollars", 3:6)
head(mydata3)
## Type Years EXIMP Months Dollars
## 1 Goods 2013E Export April 131468
## 2 Goods 2013I Import April 189507
## 3 Goods 2014E Export April 135556
## 4 Goods 2014I Import April 199877
## 5 Goods 2015E Export April 129376
## 6 Goods 2015I Import April 190967
The data is almost ready for analysis. It still needs a few manipulations, but we can now begin to tackle the questions.
First, let’s start with defining the scenarios…
Defining Scenarios for Analysis
Below are the new variables created for each type of scenario required by Jason.
I use the dplyr filter function to subset the mydata3 data frame for each scenario.
scenario1 <- filter(mydata3, Type == "Goods " & EXIMP == "Export")
scenario2 <- filter(mydata3, Type == "Goods " & EXIMP == "Import")
scenario3 <- filter(mydata3, Type == "Services" & EXIMP == "Export")
scenario4 <- filter(mydata3, Type == "Services" & EXIMP == "Import")
Calculating the Mean for Each Scenario
In this section, I calculate the mean value for each of the scenarios described above. For this task, I use the dplyr group_by and summarise functions to get the mean value by month.
mean1 <- scenario1 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean1) <- c("Months", "Avg1")
mean1
## Source: local data frame [4 x 2]
##
## Months Avg1
## (fctr) (dbl)
## 1 April 132133.3
## 2 February 130060.3
## 3 January 131419.3
## 4 March 131066.3
mean2 <- scenario2 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean2) <- c("Months", "Avg2")
mean2
## Source: local data frame [4 x 2]
##
## Months Avg2
## (fctr) (dbl)
## 1 April 193450.3
## 2 February 190387.0
## 3 January 192564.3
## 4 March 194548.7
mean3 <- scenario3 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean3) <- c("Months", "Avg3")
mean3
## Source: local data frame [4 x 2]
##
## Months Avg3
## (fctr) (dbl)
## 1 April 58608.67
## 2 February 58038.00
## 3 January 58585.67
## 4 March 58355.67
mean4 <- scenario4 %>% group_by(Months) %>% summarise(average = mean(Dollars))
colnames(mean4) <- c("Months", "Avg4")
mean4
## Source: local data frame [4 x 2]
##
## Months Avg4
## (fctr) (dbl)
## 1 April 39389.00
## 2 February 39087.33
## 3 January 39000.33
## 4 March 39142.33
Joining Scenario Averages
Now that I have a mean for each scenario, I then use the dplyr inner_join function to joing the averages for each scenario into a single table. This table is be part of the Conclusion.
comb1 <- inner_join(mean1, mean2, by = "Months")
comb2 <- inner_join(comb1, mean3, by = "Months")
mean <- inner_join(comb2, mean4, by = "Months")
I was able to answer the questions posed by Jason Joseph. The table below shows the monthly average for each scenario described in the initial section.
mean
## Source: local data frame [4 x 5]
##
## Months Avg1 Avg2 Avg3 Avg4
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 April 132133.3 193450.3 58608.67 39389.00
## 2 February 130060.3 190387.0 58038.00 39087.33
## 3 January 131419.3 192564.3 58585.67 39000.33
## 4 March 131066.3 194548.7 58355.67 39142.33