The Data
For this project, I have selected the wide data set posted by Mohan Kandaraj in Forum: Discussion 6/7: Untidy Data. This data source comes from the US Department of Agriculture and it records Grain Stocks by Position and Month in Domestic Units in the United States for 2014 and 2015.
The data posted by Mohan is in the form of a text file. As a result, I use Regular Expressions to extract the relevant data values from this file and prepare the data for downstream analysis.
Analysis Question
As per Mohan, the objective is to “find which grain is worst hit in 2015. That is the grain which has maximum decline in stock in 2015”. For this analysis the Year-Over-Year trend between 2015 and 2014 are used to answer this question. The analysis variable is the Off Farms.
As per the USDA, this variable “includes stocks at mills, elevators, warehouses, terminals, and processors”.
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
project_url = url("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%202/grains_datasource.txt")
project_data <- readLines(project_url, n = 51, skipNul = TRUE)
Taking a Peak at the Imported Data
So far, it looks like this data set will require several cleanups and manipulations before it is ready for downstream analysis.
head(project_data)
## [1] "Grain Stocks by Position and Month in Domestic Units - United States: 2014 and 2015"
## [2] "[Blank data cells indicate estimation period has not yet begun]"
## [3] "----------------------------------------------------------------------------------------------"
## [4] " : 2014 : 2015 "
## [5] " :-----------------------------------------------------------------------------"
## [6] " Date : On : Off : Total all : On : Off : Total all "
typeof(project_data)
## [1] "character"
Subsetting
Having reviewed the data contents above, I will only keep rows from 11:51
project_data <- project_data [c(11:51)]
head(project_data)
## [1] "Corn : "
## [2] "March 1 ........: 3,860,500 3,147,623 7,008,123 4,380,000 3,369,806 7,749,806 "
## [3] "June 1 .........: 1,863,200 1,988,516 3,851,716 2,275,000 2,177,992 4,452,992 "
## [4] "September 1 ....: 462,000 769,904 1,231,904 593,000 1,138,310 1,731,310 "
## [5] "December 1 .....: 7,087,000 4,124,380 11,211,380 "
## [6] " : "
Subsetting by Recycling
In the current data structure, several values are nested in a structure that is not ideal for analysis; however, the structure appears to be consistent.
I exploit this pattern by subsetting by Recycling. In the second subsetting below, I remove the rows for December. These are only available for 2014 and not 2015. As a result, they are not relevant for this analysis.
#First subsetting.
project_data <- project_data [c(FALSE, TRUE,TRUE,TRUE,TRUE,FALSE)]
head(project_data)
## [1] "March 1 ........: 3,860,500 3,147,623 7,008,123 4,380,000 3,369,806 7,749,806 "
## [2] "June 1 .........: 1,863,200 1,988,516 3,851,716 2,275,000 2,177,992 4,452,992 "
## [3] "September 1 ....: 462,000 769,904 1,231,904 593,000 1,138,310 1,731,310 "
## [4] "December 1 .....: 7,087,000 4,124,380 11,211,380 "
## [5] "March 1 ........: 15,950 159,784 175,734 9,000 110,856 119,856 "
## [6] "June 1 .........: 4,500 87,924 92,424 2,960 31,329 34,289 "
#Second subsetting removes the rows for December.
project_data <- project_data [c(TRUE, TRUE, TRUE, FALSE)]
head(project_data)
## [1] "March 1 ........: 3,860,500 3,147,623 7,008,123 4,380,000 3,369,806 7,749,806 "
## [2] "June 1 .........: 1,863,200 1,988,516 3,851,716 2,275,000 2,177,992 4,452,992 "
## [3] "September 1 ....: 462,000 769,904 1,231,904 593,000 1,138,310 1,731,310 "
## [4] "March 1 ........: 15,950 159,784 175,734 9,000 110,856 119,856 "
## [5] "June 1 .........: 4,500 87,924 92,424 2,960 31,329 34,289 "
## [6] "September 1 ....: 1,945 32,087 34,032 1,895 16,519 18,414 "
typeof(project_data)
## [1] "character"
Extracting Digits
In this section, I extract the digit strings using the str_extract_all function. I then review the output, structure and type of data structure.
digit1 <- str_extract_all(project_data, "[[:digit:],]{2,}+")
head(digit1)
## [[1]]
## [1] "3,860,500" "3,147,623" "7,008,123" "4,380,000" "3,369,806" "7,749,806"
##
## [[2]]
## [1] "1,863,200" "1,988,516" "3,851,716" "2,275,000" "2,177,992" "4,452,992"
##
## [[3]]
## [1] "462,000" "769,904" "1,231,904" "593,000" "1,138,310" "1,731,310"
##
## [[4]]
## [1] "15,950" "159,784" "175,734" "9,000" "110,856" "119,856"
##
## [[5]]
## [1] "4,500" "87,924" "92,424" "2,960" "31,329" "34,289"
##
## [[6]]
## [1] "1,945" "32,087" "34,032" "1,895" "16,519" "18,414"
str(digit1)
## List of 21
## $ : chr [1:6] "3,860,500" "3,147,623" "7,008,123" "4,380,000" ...
## $ : chr [1:6] "1,863,200" "1,988,516" "3,851,716" "2,275,000" ...
## $ : chr [1:6] "462,000" "769,904" "1,231,904" "593,000" ...
## $ : chr [1:6] "15,950" "159,784" "175,734" "9,000" ...
## $ : chr [1:6] "4,500" "87,924" "92,424" "2,960" ...
## $ : chr [1:6] "1,945" "32,087" "34,032" "1,895" ...
## $ : chr [1:6] "19,800" "15,323" "35,123" "20,810" ...
## $ : chr [1:6] "9,710" "15,029" "24,739" "15,120" ...
## $ : chr [1:6] "41,400" "32,910" "74,310" "47,200" ...
## $ : chr [1:6] "43,830" "77,734" "121,564" "41,990" ...
## $ : chr [1:6] "19,110" "63,145" "82,255" "20,940" ...
## $ : chr [1:6] "97,820" "81,997" "179,817" "135,590" ...
## $ : chr [1:6] "237,530" "819,435" "1,056,965" "278,710" ...
## $ : chr [1:6] "96,995" "493,288" "590,283" "155,170" ...
## $ : chr [1:6] "713,450" "1,193,770" "1,907,220" "647,000" ...
## $ : chr [1:6] "20,700" "17,430" "38,130" "16,200" ...
## $ : chr [1:6] "12,800" "8,724" "21,524" "10,250" ...
## $ : chr [1:6] "38,700" "19,121" "57,821" "44,400" ...
## $ : chr [1:6] "381,900" "611,928" "993,828" "609,200" ...
## $ : chr [1:6] "109,100" "295,945" "405,045" "246,300" ...
## $ : chr [1:6] "21,325" "70,666" "91,991" "49,700" ...
typeof(digit1)
## [1] "list"
Find and Replace
The values have a comma and this does not allow me to coerce these characters to numeric. To fix this, I first need to get rid of the comma and then coerce it as a number.
digit2 <- str_replace_all(digit1, ",","")
head(digit2)
## [1] "c(\"3860500\" \"3147623\" \"7008123\" \"4380000\" \"3369806\" \"7749806\")"
## [2] "c(\"1863200\" \"1988516\" \"3851716\" \"2275000\" \"2177992\" \"4452992\")"
## [3] "c(\"462000\" \"769904\" \"1231904\" \"593000\" \"1138310\" \"1731310\")"
## [4] "c(\"15950\" \"159784\" \"175734\" \"9000\" \"110856\" \"119856\")"
## [5] "c(\"4500\" \"87924\" \"92424\" \"2960\" \"31329\" \"34289\")"
## [6] "c(\"1945\" \"32087\" \"34032\" \"1895\" \"16519\" \"18414\")"
typeof(digit2)
## [1] "character"
I am not sure why the output from the str_replace_all function is so different from the input. As a result of this, I will now extract the digit characters one more time.
digit3 <- str_extract_all(digit2, "[[:digit:]]+")
head(digit3)
## [[1]]
## [1] "3860500" "3147623" "7008123" "4380000" "3369806" "7749806"
##
## [[2]]
## [1] "1863200" "1988516" "3851716" "2275000" "2177992" "4452992"
##
## [[3]]
## [1] "462000" "769904" "1231904" "593000" "1138310" "1731310"
##
## [[4]]
## [1] "15950" "159784" "175734" "9000" "110856" "119856"
##
## [[5]]
## [1] "4500" "87924" "92424" "2960" "31329" "34289"
##
## [[6]]
## [1] "1945" "32087" "34032" "1895" "16519" "18414"
str(digit3)
## List of 21
## $ : chr [1:6] "3860500" "3147623" "7008123" "4380000" ...
## $ : chr [1:6] "1863200" "1988516" "3851716" "2275000" ...
## $ : chr [1:6] "462000" "769904" "1231904" "593000" ...
## $ : chr [1:6] "15950" "159784" "175734" "9000" ...
## $ : chr [1:6] "4500" "87924" "92424" "2960" ...
## $ : chr [1:6] "1945" "32087" "34032" "1895" ...
## $ : chr [1:6] "19800" "15323" "35123" "20810" ...
## $ : chr [1:6] "9710" "15029" "24739" "15120" ...
## $ : chr [1:6] "41400" "32910" "74310" "47200" ...
## $ : chr [1:6] "43830" "77734" "121564" "41990" ...
## $ : chr [1:6] "19110" "63145" "82255" "20940" ...
## $ : chr [1:6] "97820" "81997" "179817" "135590" ...
## $ : chr [1:6] "237530" "819435" "1056965" "278710" ...
## $ : chr [1:6] "96995" "493288" "590283" "155170" ...
## $ : chr [1:6] "713450" "1193770" "1907220" "647000" ...
## $ : chr [1:6] "20700" "17430" "38130" "16200" ...
## $ : chr [1:6] "12800" "8724" "21524" "10250" ...
## $ : chr [1:6] "38700" "19121" "57821" "44400" ...
## $ : chr [1:6] "381900" "611928" "993828" "609200" ...
## $ : chr [1:6] "109100" "295945" "405045" "246300" ...
## $ : chr [1:6] "21325" "70666" "91991" "49700" ...
I finally have the output I need to coerce the digit characters into numeric. This is a must for downstream analysis.
Coersing to Numeric
In this sections, I use the lapply function for coercing digit characters into numeric.
digit4 <- lapply(digit3,as.numeric)
head(digit4)
## [[1]]
## [1] 3860500 3147623 7008123 4380000 3369806 7749806
##
## [[2]]
## [1] 1863200 1988516 3851716 2275000 2177992 4452992
##
## [[3]]
## [1] 462000 769904 1231904 593000 1138310 1731310
##
## [[4]]
## [1] 15950 159784 175734 9000 110856 119856
##
## [[5]]
## [1] 4500 87924 92424 2960 31329 34289
##
## [[6]]
## [1] 1945 32087 34032 1895 16519 18414
str(digit4)
## List of 21
## $ : num [1:6] 3860500 3147623 7008123 4380000 3369806 ...
## $ : num [1:6] 1863200 1988516 3851716 2275000 2177992 ...
## $ : num [1:6] 462000 769904 1231904 593000 1138310 ...
## $ : num [1:6] 15950 159784 175734 9000 110856 ...
## $ : num [1:6] 4500 87924 92424 2960 31329 ...
## $ : num [1:6] 1945 32087 34032 1895 16519 ...
## $ : num [1:6] 19800 15323 35123 20810 38609 ...
## $ : num [1:6] 9710 15029 24739 15120 38625 ...
## $ : num [1:6] 41400 32910 74310 47200 46091 ...
## $ : num [1:6] 43830 77734 121564 41990 76247 ...
## $ : num [1:6] 19110 63145 82255 20940 57630 ...
## $ : num [1:6] 97820 81997 179817 135590 82925 ...
## $ : num [1:6] 237530 819435 1056965 278710 861697 ...
## $ : num [1:6] 96995 493288 590283 155170 598102 ...
## $ : num [1:6] 713450 1193770 1907220 647000 1441957 ...
## $ : num [1:6] 20700 17430 38130 16200 21454 ...
## $ : num [1:6] 12800 8724 21524 10250 15407 ...
## $ : num [1:6] 38700 19121 57821 44400 29442 ...
## $ : num [1:6] 381900 611928 993828 609200 717399 ...
## $ : num [1:6] 109100 295945 405045 246300 380770 ...
## $ : num [1:6] 21325 70666 91991 49700 141725 ...
At this point, I needed to convert my numeric list into a data frame. Searching online, I found an extremely helpful function to accomplish this task. This comes from Prof. Jason Bryer PhD.
To access additional details click this link [http://www.r-bloggers.com/converting-a-list-to-a-data-frame/]
Loading the Required Functions
require(devtools)
## Loading required package: devtools
source_gist(4676064)
## Sourcing https://gist.githubusercontent.com/jbryer/4676064/raw/646d720a782efd1cdd0ead48ec476196278fb5f0/as.data.frame.list.R
## SHA-1 hash of file is 0079a0f5e96ee15d2515312485d7865d8cdf798d
Converting the Number List to a Data Frame
df <- as.data.frame(digit4)
head(df)
## X1 X2 X3 X4 X5 X6
## 1 3860500 3147623 7008123 4380000 3369806 7749806
## 2 1863200 1988516 3851716 2275000 2177992 4452992
## 3 462000 769904 1231904 593000 1138310 1731310
## 4 15950 159784 175734 9000 110856 119856
## 5 4500 87924 92424 2960 31329 34289
## 6 1945 32087 34032 1895 16519 18414
str(df)
## 'data.frame': 21 obs. of 6 variables:
## $ X1: num 3860500 1863200 462000 15950 4500 ...
## $ X2: num 3147623 1988516 769904 159784 87924 ...
## $ X3: num 7008123 3851716 1231904 175734 92424 ...
## $ X4: num 4380000 2275000 593000 9000 2960 ...
## $ X5: num 3369806 2177992 1138310 110856 31329 ...
## $ X6: num 7749806 4452992 1731310 119856 34289 ...
Adding Back Additional Details
At this point, the data frame does not have column headers, month or grain information. I needed to strip that level of detail in order to create a better data structure that’s more conducive for downstream analysis. In the next sections, I recreate the missing details and “glue” it back to the data frame.
Creating the Month Vectors
I can easily add back the month information because it is a repeating sequence of the following months: March, June and September. This sequence is repeated seven times in the new months column. This month column is created by a vector using the rep function with seven repetitions.
months <- rep(c("March", "June", "September"), 7)
months
## [1] "March" "June" "September" "March" "June"
## [6] "September" "March" "June" "September" "March"
## [11] "June" "September" "March" "June" "September"
## [16] "March" "June" "September" "March" "June"
## [21] "September"
Creating the Grain Vectors
Similarly to the new months vector, I create a vector with the types of grains in the data. In this case, each element is in the vector is repeated three times.
grain <- rep(c("Corn", "Sorghum", "Oats", "Barley", "All Wheat", "Durum Wheat", "Soybeans"), each=3)
grain
## [1] "Corn" "Corn" "Corn" "Sorghum" "Sorghum"
## [6] "Sorghum" "Oats" "Oats" "Oats" "Barley"
## [11] "Barley" "Barley" "All Wheat" "All Wheat" "All Wheat"
## [16] "Durum Wheat" "Durum Wheat" "Durum Wheat" "Soybeans" "Soybeans"
## [21] "Soybeans"
Creating the Year Vectors
As in the months and grain vectors, I also create vectors for the 2014 and 2015 years. I will continue to use the rep function to repeat each element 21x. I then append these vectors to the data frames below.
#2014
Y14d <- rep(c("Y2014"), 21)
Y14d
## [1] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
## [9] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
## [17] "Y2014" "Y2014" "Y2014" "Y2014" "Y2014"
#2015
Y15d <- rep(c("Y2015"), 21)
Y15d
## [1] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"
## [9] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"
## [17] "Y2015" "Y2015" "Y2015" "Y2015" "Y2015"
Appending the Months and Grain Vectors to the Data Frame
In this step, I append the months and grain vectors to the df data frame.
df[,"Months"] <- months
df[,"Grain"] <- grain
After thinking about the type of downstream analysis required, I determined that subsetting the data set and then later appending them by years would facilitate this analysis. For this step, I used the dplyr select function to subset the data set into separate 2014 and 2015 data frames named Y14 and Y15.
Subsetting the Data Frame
Y14 <- select(df, X1:X3, Months, Grain)
Y15 <- select(df, X4:X6, Months, Grain)
head(Y14)
## X1 X2 X3 Months Grain
## 1 3860500 3147623 7008123 March Corn
## 2 1863200 1988516 3851716 June Corn
## 3 462000 769904 1231904 September Corn
## 4 15950 159784 175734 March Sorghum
## 5 4500 87924 92424 June Sorghum
## 6 1945 32087 34032 September Sorghum
head(Y15)
## X4 X5 X6 Months Grain
## 1 4380000 3369806 7749806 March Corn
## 2 2275000 2177992 4452992 June Corn
## 3 593000 1138310 1731310 September Corn
## 4 9000 110856 119856 March Sorghum
## 5 2960 31329 34289 June Sorghum
## 6 1895 16519 18414 September Sorghum
Adding the Year Columns
Each subsetted data set now represents a single year. At this point, I can now add the year columns to the data frames for the respective year.
Y14[, "Year"] <- Y14d
Y15[, "Year"] <- Y15d
head(Y14)
## X1 X2 X3 Months Grain Year
## 1 3860500 3147623 7008123 March Corn Y2014
## 2 1863200 1988516 3851716 June Corn Y2014
## 3 462000 769904 1231904 September Corn Y2014
## 4 15950 159784 175734 March Sorghum Y2014
## 5 4500 87924 92424 June Sorghum Y2014
## 6 1945 32087 34032 September Sorghum Y2014
head(Y15)
## X4 X5 X6 Months Grain Year
## 1 4380000 3369806 7749806 March Corn Y2015
## 2 2275000 2177992 4452992 June Corn Y2015
## 3 593000 1138310 1731310 September Corn Y2015
## 4 9000 110856 119856 March Sorghum Y2015
## 5 2960 31329 34289 June Sorghum Y2015
## 6 1895 16519 18414 September Sorghum Y2015
Renaming the Columns
Up until this point, the column names have not been too important. Going forward, the column names for the two data sets needs to match because I will append the Y14 and Y15 data frames. In this step, I name the columns.
colnames(Y14) <- c("On_Farms", "Off_Farms", "Total", "Month", "Grain", "Year")
colnames(Y15) <- c("On_Farms", "Off_Farms", "Total", "Month", "Grain", "Year")
head(Y14)
## On_Farms Off_Farms Total Month Grain Year
## 1 3860500 3147623 7008123 March Corn Y2014
## 2 1863200 1988516 3851716 June Corn Y2014
## 3 462000 769904 1231904 September Corn Y2014
## 4 15950 159784 175734 March Sorghum Y2014
## 5 4500 87924 92424 June Sorghum Y2014
## 6 1945 32087 34032 September Sorghum Y2014
head(Y15)
## On_Farms Off_Farms Total Month Grain Year
## 1 4380000 3369806 7749806 March Corn Y2015
## 2 2275000 2177992 4452992 June Corn Y2015
## 3 593000 1138310 1731310 September Corn Y2015
## 4 9000 110856 119856 March Sorghum Y2015
## 5 2960 31329 34289 June Sorghum Y2015
## 6 1895 16519 18414 September Sorghum Y2015
Appending the Datasets
Appending the Y14 and Y15 data frames is the last step before we can analyse the data and answer Mohan’s question. For this task I use the dplyr bind_rows function.
ytotal <- bind_rows(Y14,Y15)
ytotal
## Source: local data frame [42 x 6]
##
## On_Farms Off_Farms Total Month Grain Year
## (dbl) (dbl) (dbl) (chr) (chr) (chr)
## 1 3860500 3147623 7008123 March Corn Y2014
## 2 1863200 1988516 3851716 June Corn Y2014
## 3 462000 769904 1231904 September Corn Y2014
## 4 15950 159784 175734 March Sorghum Y2014
## 5 4500 87924 92424 June Sorghum Y2014
## 6 1945 32087 34032 September Sorghum Y2014
## 7 19800 15323 35123 March Oats Y2014
## 8 9710 15029 24739 June Oats Y2014
## 9 41400 32910 74310 September Oats Y2014
## 10 43830 77734 121564 March Barley Y2014
## .. ... ... ... ... ... ...
After all the cleaning, manipulations and tidying up, the data is finally ready for analysis!
Groupign and Summarizing
The data set has both monthly and yearly time periods. In earlier steps, I removed the month of December since we have not reached that quarter for 2015. At this point, I only need the following months: March, June and September. To answer Mohan’s question, I need summarize the data and group by both Grain and Year. The analysis variable here will be Off_Farms Domestic Units.
ytotal1 <- ytotal %>% group_by(Grain, Year) %>% summarize(Total_Sum = sum(Off_Farms))
head(ytotal1)
## Source: local data frame [6 x 3]
## Groups: Grain [3]
##
## Grain Year Total_Sum
## (chr) (chr) (dbl)
## 1 All Wheat Y2014 2506493
## 2 All Wheat Y2015 2901756
## 3 Barley Y2014 222876
## 4 Barley Y2015 216802
## 5 Corn Y2014 5906043
## 6 Corn Y2015 6686108
Spreading
Although keeping all of my annual time periods in a column made some of manipulations easier, now I need to make the annual time period values their own columns. This facilitates the Year-Over-Year calculation in the next step. For this step, I use the tidyr spread function to accomplish this.
ytotal2 <- spread(ytotal1, Year, Total_Sum)
head(ytotal2)
## Source: local data frame [6 x 3]
##
## Grain Y2014 Y2015
## (chr) (dbl) (dbl)
## 1 All Wheat 2506493 2901756
## 2 Barley 222876 216802
## 3 Corn 5906043 6686108
## 4 Durum Wheat 45275 66303
## 5 Oats 63262 123325
## 6 Sorghum 279795 158704
Adding the YOY Trend Calculation
In this section, I am now ready to do the Year-Over-Year calculation between Y2015 and Y2014. For this step, I use the dplyr mutate function to calculate a new field called YOY_Trend.
I also arrange the results so I can see maximum decrease at the top. For this step, I use the dplyr arrange function.
ytotal3 <- ytotal2 %>% mutate(YOY_Trend = (Y2015 / Y2014)-1)
arrange(ytotal3, (YOY_Trend))
## Source: local data frame [7 x 4]
##
## Grain Y2014 Y2015 YOY_Trend
## (chr) (dbl) (dbl) (dbl)
## 1 Sorghum 279795 158704 -0.43278472
## 2 Barley 222876 216802 -0.02725282
## 3 Corn 5906043 6686108 0.13207913
## 4 All Wheat 2506493 2901756 0.15769563
## 5 Soybeans 978539 1239894 0.26708695
## 6 Durum Wheat 45275 66303 0.46445058
## 7 Oats 63262 123325 0.94943252
Based on this analysis, I can conclude that Sorghum was worst hit in 2015. This grain experienced the maximum Year-Over-Year decline, -43.28%, for comparable time periods.