Project Details

The Data

For this project, I have selected the wide data set posted by Mohan Kandaraj in Forum: Discussion 6/7: Untidy Data. This data source comes from the US Department of Agriculture and it records Grain Stocks by Position and Month in Domestic Units in the United States for 2014 and 2015.

The data posted by Mohan is in the form of a text file. As a result, I use Regular Expressions to extract the relevant data values from this file and prepare the data for downstream analysis.

Analysis Question

As per Mohan, the objective is to “find which grain is worst hit in 2015. That is the grain which has maximum decline in stock in 2015”. For this analysis the Year-Over-Year trend between 2015 and 2014 are used to answer this question. The analysis variable is the Off Farms.

As per the USDA, this variable “includes stocks at mills, elevators, warehouses, terminals, and processors”.

Loading Required Libraries

library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Getting the Data from Github

project_url = url("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%202/grains_datasource.txt")
project_data <- readLines(project_url, n = 51, skipNul = TRUE)

Taking a Peak at the Imported Data

So far, it looks like this data set will require several cleanups and manipulations before it is ready for downstream analysis.

head(project_data)
## [1] "Grain Stocks by Position and Month in Domestic Units - United States: 2014 and 2015"           
## [2] "[Blank data cells indicate estimation period has not yet begun]"                               
## [3] "----------------------------------------------------------------------------------------------"
## [4] "                :                 2014                 :                 2015                 "
## [5] "                :-----------------------------------------------------------------------------"
## [6] "      Date      :     On     :    Off     : Total all  :     On     :    Off     : Total all  "
typeof(project_data)
## [1] "character"

Subsetting

Subsetting

Having reviewed the data contents above, I will only keep rows from 11:51

project_data <- project_data [c(11:51)]

head(project_data)
## [1] "Corn            :                                                                             "
## [2] "March 1 ........: 3,860,500    3,147,623     7,008,123   4,380,000    3,369,806    7,749,806  "
## [3] "June 1 .........: 1,863,200    1,988,516     3,851,716   2,275,000    2,177,992    4,452,992  "
## [4] "September 1 ....:   462,000      769,904     1,231,904     593,000    1,138,310    1,731,310  "
## [5] "December 1 .....: 7,087,000    4,124,380    11,211,380                                        "
## [6] "                :                                                                             "

Subsetting by Recycling

In the current data structure, several values are nested in a structure that is not ideal for analysis; however, the structure appears to be consistent.

I exploit this pattern by subsetting by Recycling. In the second subsetting below, I remove the rows for December. These are only available for 2014 and not 2015. As a result, they are not relevant for this analysis.

#First subsetting. 
project_data <- project_data [c(FALSE, TRUE,TRUE,TRUE,TRUE,FALSE)]

head(project_data)
## [1] "March 1 ........: 3,860,500    3,147,623     7,008,123   4,380,000    3,369,806    7,749,806  "
## [2] "June 1 .........: 1,863,200    1,988,516     3,851,716   2,275,000    2,177,992    4,452,992  "
## [3] "September 1 ....:   462,000      769,904     1,231,904     593,000    1,138,310    1,731,310  "
## [4] "December 1 .....: 7,087,000    4,124,380    11,211,380                                        "
## [5] "March 1 ........:    15,950      159,784       175,734       9,000      110,856      119,856  "
## [6] "June 1 .........:     4,500       87,924        92,424       2,960       31,329       34,289  "
#Second subsetting removes the rows for December. 
project_data <- project_data [c(TRUE, TRUE, TRUE, FALSE)]

head(project_data)
## [1] "March 1 ........: 3,860,500    3,147,623     7,008,123   4,380,000    3,369,806    7,749,806  "
## [2] "June 1 .........: 1,863,200    1,988,516     3,851,716   2,275,000    2,177,992    4,452,992  "
## [3] "September 1 ....:   462,000      769,904     1,231,904     593,000    1,138,310    1,731,310  "
## [4] "March 1 ........:    15,950      159,784       175,734       9,000      110,856      119,856  "
## [5] "June 1 .........:     4,500       87,924        92,424       2,960       31,329       34,289  "
## [6] "September 1 ....:     1,945       32,087        34,032       1,895       16,519       18,414  "
typeof(project_data)
## [1] "character"

Extracting

Extracting Digits

In this section, I extract the digit strings using the str_extract_all function. I then review the output, structure and type of data structure.

digit1 <- str_extract_all(project_data, "[[:digit:],]{2,}+")

head(digit1)
## [[1]]
## [1] "3,860,500" "3,147,623" "7,008,123" "4,380,000" "3,369,806" "7,749,806"
## 
## [[2]]
## [1] "1,863,200" "1,988,516" "3,851,716" "2,275,000" "2,177,992" "4,452,992"
## 
## [[3]]
## [1] "462,000"   "769,904"   "1,231,904" "593,000"   "1,138,310" "1,731,310"
## 
## [[4]]
## [1] "15,950"  "159,784" "175,734" "9,000"   "110,856" "119,856"
## 
## [[5]]
## [1] "4,500"  "87,924" "92,424" "2,960"  "31,329" "34,289"
## 
## [[6]]
## [1] "1,945"  "32,087" "34,032" "1,895"  "16,519" "18,414"
str(digit1)
## List of 21
##  $ : chr [1:6] "3,860,500" "3,147,623" "7,008,123" "4,380,000" ...
##  $ : chr [1:6] "1,863,200" "1,988,516" "3,851,716" "2,275,000" ...
##  $ : chr [1:6] "462,000" "769,904" "1,231,904" "593,000" ...
##  $ : chr [1:6] "15,950" "159,784" "175,734" "9,000" ...
##  $ : chr [1:6] "4,500" "87,924" "92,424" "2,960" ...
##  $ : chr [1:6] "1,945" "32,087" "34,032" "1,895" ...
##  $ : chr [1:6] "19,800" "15,323" "35,123" "20,810" ...
##  $ : chr [1:6] "9,710" "15,029" "24,739" "15,120" ...
##  $ : chr [1:6] "41,400" "32,910" "74,310" "47,200" ...
##  $ : chr [1:6] "43,830" "77,734" "121,564" "41,990" ...
##  $ : chr [1:6] "19,110" "63,145" "82,255" "20,940" ...
##  $ : chr [1:6] "97,820" "81,997" "179,817" "135,590" ...
##  $ : chr [1:6] "237,530" "819,435" "1,056,965" "278,710" ...
##  $ : chr [1:6] "96,995" "493,288" "590,283" "155,170" ...
##  $ : chr [1:6] "713,450" "1,193,770" "1,907,220" "647,000" ...
##  $ : chr [1:6] "20,700" "17,430" "38,130" "16,200" ...
##  $ : chr [1:6] "12,800" "8,724" "21,524" "10,250" ...
##  $ : chr [1:6] "38,700" "19,121" "57,821" "44,400" ...
##  $ : chr [1:6] "381,900" "611,928" "993,828" "609,200" ...
##  $ : chr [1:6] "109,100" "295,945" "405,045" "246,300" ...
##  $ : chr [1:6] "21,325" "70,666" "91,991" "49,700" ...
typeof(digit1)
## [1] "list"

Find and Replace

The values have a comma and this does not allow me to coerce these characters to numeric. To fix this, I first need to get rid of the comma and then coerce it as a number.

digit2 <- str_replace_all(digit1, ",","")

head(digit2)
## [1] "c(\"3860500\" \"3147623\" \"7008123\" \"4380000\" \"3369806\" \"7749806\")"
## [2] "c(\"1863200\" \"1988516\" \"3851716\" \"2275000\" \"2177992\" \"4452992\")"
## [3] "c(\"462000\" \"769904\" \"1231904\" \"593000\" \"1138310\" \"1731310\")"   
## [4] "c(\"15950\" \"159784\" \"175734\" \"9000\" \"110856\" \"119856\")"         
## [5] "c(\"4500\" \"87924\" \"92424\" \"2960\" \"31329\" \"34289\")"              
## [6] "c(\"1945\" \"32087\" \"34032\" \"1895\" \"16519\" \"18414\")"
typeof(digit2)
## [1] "character"

I am not sure why the output from the str_replace_all function is so different from the input. As a result of this, I will now extract the digit characters one more time.

digit3 <- str_extract_all(digit2, "[[:digit:]]+")

head(digit3)
## [[1]]
## [1] "3860500" "3147623" "7008123" "4380000" "3369806" "7749806"
## 
## [[2]]
## [1] "1863200" "1988516" "3851716" "2275000" "2177992" "4452992"
## 
## [[3]]
## [1] "462000"  "769904"  "1231904" "593000"  "1138310" "1731310"
## 
## [[4]]
## [1] "15950"  "159784" "175734" "9000"   "110856" "119856"
## 
## [[5]]
## [1] "4500"  "87924" "92424" "2960"  "31329" "34289"
## 
## [[6]]
## [1] "1945"  "32087" "34032" "1895"  "16519" "18414"
str(digit3)
## List of 21
##  $ : chr [1:6] "3860500" "3147623" "7008123" "4380000" ...
##  $ : chr [1:6] "1863200" "1988516" "3851716" "2275000" ...
##  $ : chr [1:6] "462000" "769904" "1231904" "593000" ...
##  $ : chr [1:6] "15950" "159784" "175734" "9000" ...
##  $ : chr [1:6] "4500" "87924" "92424" "2960" ...
##  $ : chr [1:6] "1945" "32087" "34032" "1895" ...
##  $ : chr [1:6] "19800" "15323" "35123" "20810" ...
##  $ : chr [1:6] "9710" "15029" "24739" "15120" ...
##  $ : chr [1:6] "41400" "32910" "74310" "47200" ...
##  $ : chr [1:6] "43830" "77734" "121564" "41990" ...
##  $ : chr [1:6] "19110" "63145" "82255" "20940" ...
##  $ : chr [1:6] "97820" "81997" "179817" "135590" ...
##  $ : chr [1:6] "237530" "819435" "1056965" "278710" ...
##  $ : chr [1:6] "96995" "493288" "590283" "155170" ...
##  $ : chr [1:6] "713450" "1193770" "1907220" "647000" ...
##  $ : chr [1:6] "20700" "17430" "38130" "16200" ...
##  $ : chr [1:6] "12800" "8724" "21524" "10250" ...
##  $ : chr [1:6] "38700" "19121" "57821" "44400" ...
##  $ : chr [1:6] "381900" "611928" "993828" "609200" ...
##  $ : chr [1:6] "109100" "295945" "405045" "246300" ...
##  $ : chr [1:6] "21325" "70666" "91991" "49700" ...

I finally have the output I need to coerce the digit characters into numeric. This is a must for downstream analysis.

Coersing to Numeric

In this sections, I use the lapply function for coercing digit characters into numeric.

digit4 <- lapply(digit3,as.numeric)

head(digit4)
## [[1]]
## [1] 3860500 3147623 7008123 4380000 3369806 7749806
## 
## [[2]]
## [1] 1863200 1988516 3851716 2275000 2177992 4452992
## 
## [[3]]
## [1]  462000  769904 1231904  593000 1138310 1731310
## 
## [[4]]
## [1]  15950 159784 175734   9000 110856 119856
## 
## [[5]]
## [1]  4500 87924 92424  2960 31329 34289
## 
## [[6]]
## [1]  1945 32087 34032  1895 16519 18414
str(digit4)
## List of 21
##  $ : num [1:6] 3860500 3147623 7008123 4380000 3369806 ...
##  $ : num [1:6] 1863200 1988516 3851716 2275000 2177992 ...
##  $ : num [1:6] 462000 769904 1231904 593000 1138310 ...
##  $ : num [1:6] 15950 159784 175734 9000 110856 ...
##  $ : num [1:6] 4500 87924 92424 2960 31329 ...
##  $ : num [1:6] 1945 32087 34032 1895 16519 ...
##  $ : num [1:6] 19800 15323 35123 20810 38609 ...
##  $ : num [1:6] 9710 15029 24739 15120 38625 ...
##  $ : num [1:6] 41400 32910 74310 47200 46091 ...
##  $ : num [1:6] 43830 77734 121564 41990 76247 ...
##  $ : num [1:6] 19110 63145 82255 20940 57630 ...
##  $ : num [1:6] 97820 81997 179817 135590 82925 ...
##  $ : num [1:6] 237530 819435 1056965 278710 861697 ...
##  $ : num [1:6] 96995 493288 590283 155170 598102 ...
##  $ : num [1:6] 713450 1193770 1907220 647000 1441957 ...
##  $ : num [1:6] 20700 17430 38130 16200 21454 ...
##  $ : num [1:6] 12800 8724 21524 10250 15407 ...
##  $ : num [1:6] 38700 19121 57821 44400 29442 ...
##  $ : num [1:6] 381900 611928 993828 609200 717399 ...
##  $ : num [1:6] 109100 295945 405045 246300 380770 ...
##  $ : num [1:6] 21325 70666 91991 49700 141725 ...

Creating a Data Frame

At this point, I needed to convert my numeric list into a data frame. Searching online, I found an extremely helpful function to accomplish this task. This comes from Prof. Jason Bryer PhD.

To access additional details click this link [http://www.r-bloggers.com/converting-a-list-to-a-data-frame/]

Loading the Required Functions

require(devtools)
## Loading required package: devtools
source_gist(4676064)
## Sourcing https://gist.githubusercontent.com/jbryer/4676064/raw/646d720a782efd1cdd0ead48ec476196278fb5f0/as.data.frame.list.R
## SHA-1 hash of file is 0079a0f5e96ee15d2515312485d7865d8cdf798d

Converting the Number List to a Data Frame

df <- as.data.frame(digit4)

head(df)
##        X1      X2      X3      X4      X5      X6
## 1 3860500 3147623 7008123 4380000 3369806 7749806
## 2 1863200 1988516 3851716 2275000 2177992 4452992
## 3  462000  769904 1231904  593000 1138310 1731310
## 4   15950  159784  175734    9000  110856  119856
## 5    4500   87924   92424    2960   31329   34289
## 6    1945   32087   34032    1895   16519   18414
str(df)
## 'data.frame':    21 obs. of  6 variables:
##  $ X1: num  3860500 1863200 462000 15950 4500 ...
##  $ X2: num  3147623 1988516 769904 159784 87924 ...
##  $ X3: num  7008123 3851716 1231904 175734 92424 ...
##  $ X4: num  4380000 2275000 593000 9000 2960 ...
##  $ X5: num  3369806 2177992 1138310 110856 31329 ...
##  $ X6: num  7749806 4452992 1731310 119856 34289 ...

Additional Data Manipulations

Adding Back Additional Details

At this point, the data frame does not have column headers, month or grain information. I needed to strip that level of detail in order to create a better data structure that’s more conducive for downstream analysis. In the next sections, I recreate the missing details and “glue” it back to the data frame.

Creating the Month Vectors

I can easily add back the month information because it is a repeating sequence of the following months: March, June and September. This sequence is repeated seven times in the new months column. This month column is created by a vector using the rep function with seven repetitions.

months <- rep(c("March", "June", "September"), 7)

months
##  [1] "March"     "June"      "September" "March"     "June"     
##  [6] "September" "March"     "June"      "September" "March"    
## [11] "June"      "September" "March"     "June"      "September"
## [16] "March"     "June"      "September" "March"     "June"     
## [21] "September"

Creating the Grain Vectors

Similarly to the new months vector, I create a vector with the types of grains in the data. In this case, each element is in the vector is repeated three times.

grain <- rep(c("Corn", "Sorghum", "Oats", "Barley", "All Wheat", "Durum Wheat", "Soybeans"), each=3)

grain
##  [1] "Corn"        "Corn"        "Corn"        "Sorghum"     "Sorghum"    
##  [6] "Sorghum"     "Oats"        "Oats"        "Oats"        "Barley"     
## [11] "Barley"      "Barley"      "All Wheat"   "All Wheat"   "All Wheat"  
## [16] "Durum Wheat" "Durum Wheat" "Durum Wheat" "Soybeans"    "Soybeans"   
## [21] "Soybeans"

Creating the Year Vectors

As in the months and grain vectors, I also create vectors for the 2014 and 2015 years. I will continue to use the rep function to repeat each element 21x. I then append these vectors to the data frames below.

#2014
Y14d <- rep(c("Y2014"), 21)
Y14d
##  [1] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
##  [9] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
## [17] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
#2015
Y15d <- rep(c("Y2015"), 21)
Y15d
##  [1] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"
##  [9] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"
## [17] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"

Appending the Months and Grain Vectors to the Data Frame

In this step, I append the months and grain vectors to the df data frame.

df[,"Months"] <- months
df[,"Grain"] <- grain

Tyding Up

After thinking about the type of downstream analysis required, I determined that subsetting the data set and then later appending them by years would facilitate this analysis. For this step, I used the dplyr select function to subset the data set into separate 2014 and 2015 data frames named Y14 and Y15.

Subsetting the Data Frame

Y14 <- select(df, X1:X3, Months, Grain)

Y15 <- select(df, X4:X6, Months, Grain)

head(Y14)
##        X1      X2      X3    Months   Grain
## 1 3860500 3147623 7008123     March    Corn
## 2 1863200 1988516 3851716      June    Corn
## 3  462000  769904 1231904 September    Corn
## 4   15950  159784  175734     March Sorghum
## 5    4500   87924   92424      June Sorghum
## 6    1945   32087   34032 September Sorghum
head(Y15)
##        X4      X5      X6    Months   Grain
## 1 4380000 3369806 7749806     March    Corn
## 2 2275000 2177992 4452992      June    Corn
## 3  593000 1138310 1731310 September    Corn
## 4    9000  110856  119856     March Sorghum
## 5    2960   31329   34289      June Sorghum
## 6    1895   16519   18414 September Sorghum

Adding the Year Columns

Each subsetted data set now represents a single year. At this point, I can now add the year columns to the data frames for the respective year.

Y14[, "Year"] <- Y14d
Y15[, "Year"] <- Y15d

head(Y14)
##        X1      X2      X3    Months   Grain  Year
## 1 3860500 3147623 7008123     March    Corn Y2014
## 2 1863200 1988516 3851716      June    Corn Y2014
## 3  462000  769904 1231904 September    Corn Y2014
## 4   15950  159784  175734     March Sorghum Y2014
## 5    4500   87924   92424      June Sorghum Y2014
## 6    1945   32087   34032 September Sorghum Y2014
head(Y15)
##        X4      X5      X6    Months   Grain  Year
## 1 4380000 3369806 7749806     March    Corn Y2015
## 2 2275000 2177992 4452992      June    Corn Y2015
## 3  593000 1138310 1731310 September    Corn Y2015
## 4    9000  110856  119856     March Sorghum Y2015
## 5    2960   31329   34289      June Sorghum Y2015
## 6    1895   16519   18414 September Sorghum Y2015

Renaming the Columns

Up until this point, the column names have not been too important. Going forward, the column names for the two data sets needs to match because I will append the Y14 and Y15 data frames. In this step, I name the columns.

colnames(Y14) <- c("On_Farms", "Off_Farms", "Total", "Month", "Grain", "Year")
colnames(Y15) <- c("On_Farms", "Off_Farms", "Total", "Month", "Grain", "Year")

head(Y14)
##   On_Farms Off_Farms   Total     Month   Grain  Year
## 1  3860500   3147623 7008123     March    Corn Y2014
## 2  1863200   1988516 3851716      June    Corn Y2014
## 3   462000    769904 1231904 September    Corn Y2014
## 4    15950    159784  175734     March Sorghum Y2014
## 5     4500     87924   92424      June Sorghum Y2014
## 6     1945     32087   34032 September Sorghum Y2014
head(Y15)
##   On_Farms Off_Farms   Total     Month   Grain  Year
## 1  4380000   3369806 7749806     March    Corn Y2015
## 2  2275000   2177992 4452992      June    Corn Y2015
## 3   593000   1138310 1731310 September    Corn Y2015
## 4     9000    110856  119856     March Sorghum Y2015
## 5     2960     31329   34289      June Sorghum Y2015
## 6     1895     16519   18414 September Sorghum Y2015

Appending the Datasets

Appending the Y14 and Y15 data frames is the last step before we can analyse the data and answer Mohan’s question. For this task I use the dplyr bind_rows function.

ytotal <- bind_rows(Y14,Y15)

ytotal
## Source: local data frame [42 x 6]
## 
##    On_Farms Off_Farms   Total     Month   Grain  Year
##       (dbl)     (dbl)   (dbl)     (chr)   (chr) (chr)
## 1   3860500   3147623 7008123     March    Corn Y2014
## 2   1863200   1988516 3851716      June    Corn Y2014
## 3    462000    769904 1231904 September    Corn Y2014
## 4     15950    159784  175734     March Sorghum Y2014
## 5      4500     87924   92424      June Sorghum Y2014
## 6      1945     32087   34032 September Sorghum Y2014
## 7     19800     15323   35123     March    Oats Y2014
## 8      9710     15029   24739      June    Oats Y2014
## 9     41400     32910   74310 September    Oats Y2014
## 10    43830     77734  121564     March  Barley Y2014
## ..      ...       ...     ...       ...     ...   ...

Downstream Analysis

After all the cleaning, manipulations and tidying up, the data is finally ready for analysis!

Groupign and Summarizing

The data set has both monthly and yearly time periods. In earlier steps, I removed the month of December since we have not reached that quarter for 2015. At this point, I only need the following months: March, June and September. To answer Mohan’s question, I need summarize the data and group by both Grain and Year. The analysis variable here will be Off_Farms Domestic Units.

ytotal1 <- ytotal %>% group_by(Grain, Year) %>% summarize(Total_Sum = sum(Off_Farms))

head(ytotal1)
## Source: local data frame [6 x 3]
## Groups: Grain [3]
## 
##       Grain  Year Total_Sum
##       (chr) (chr)     (dbl)
## 1 All Wheat Y2014   2506493
## 2 All Wheat Y2015   2901756
## 3    Barley Y2014    222876
## 4    Barley Y2015    216802
## 5      Corn Y2014   5906043
## 6      Corn Y2015   6686108

Spreading

Although keeping all of my annual time periods in a column made some of manipulations easier, now I need to make the annual time period values their own columns. This facilitates the Year-Over-Year calculation in the next step. For this step, I use the tidyr spread function to accomplish this.

ytotal2 <- spread(ytotal1, Year, Total_Sum)

head(ytotal2)
## Source: local data frame [6 x 3]
## 
##         Grain   Y2014   Y2015
##         (chr)   (dbl)   (dbl)
## 1   All Wheat 2506493 2901756
## 2      Barley  222876  216802
## 3        Corn 5906043 6686108
## 4 Durum Wheat   45275   66303
## 5        Oats   63262  123325
## 6     Sorghum  279795  158704

Adding the YOY Trend Calculation

In this section, I am now ready to do the Year-Over-Year calculation between Y2015 and Y2014. For this step, I use the dplyr mutate function to calculate a new field called YOY_Trend.

I also arrange the results so I can see maximum decrease at the top. For this step, I use the dplyr arrange function.

ytotal3 <- ytotal2 %>% mutate(YOY_Trend = (Y2015 / Y2014)-1)

arrange(ytotal3, (YOY_Trend))
## Source: local data frame [7 x 4]
## 
##         Grain   Y2014   Y2015   YOY_Trend
##         (chr)   (dbl)   (dbl)       (dbl)
## 1     Sorghum  279795  158704 -0.43278472
## 2      Barley  222876  216802 -0.02725282
## 3        Corn 5906043 6686108  0.13207913
## 4   All Wheat 2506493 2901756  0.15769563
## 5    Soybeans  978539 1239894  0.26708695
## 6 Durum Wheat   45275   66303  0.46445058
## 7        Oats   63262  123325  0.94943252

Conclusion

Based on this analysis, I can conclude that Sorghum was worst hit in 2015. This grain experienced the maximum Year-Over-Year decline, -43.28%, for comparable time periods.