The URL for all reviews of the Hutzler Banana Slicer can be found here: http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending.
By using the element inspector of Chrome, I can peek into the html structure and discover that each review is structured in this way:
div with class "reviews" holds all reviews
div with class "review" holds a single review
div with class "helpful-votes-count" has text describing how helpful the review has been to other users
span with class "a-icon-alt" has text describing the number of stars
\<a\> with class "review-title" has title of the review
\<a\> with class "author" has the username of the reviewer
span with class "review-date" has text containing the date of the review
span with class "review-text" contains the review text
First, I need to install and load rvest, dplyr, and stringi. I’ll also load a few packages even though I’m not using them in the Amazon dataset. They’ll come in handy later! Note that you may have to install packages that are not present in your own R environment.
library(rvest)
## Loading required package: xml2
library(stringi)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(ggplot2)
Now I’ll load the page and choose just the “reviews” section. Note that classes are prefixed with a period, so I use “.reviews”.
amazon <- read_html("http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending")
review_section<-amazon %>% html_node(".reviews")
Within the “reviews” section, I have any number of nodes that I need to parse. I first create a list called “reviews”, then create vectors for each element I’m going to pull out of each review.
reviews<-review_section %>% html_nodes(".review")
helpful_votes<-reviews %>% html_nodes(".helpful-votes-count") %>% html_text()
helpful_votes
## [1] "54,292 of 55,056 people found the following review helpful"
## [2] "32,320 of 33,353 people found the following review helpful"
## [3] "12,670 of 13,219 people found the following review helpful"
## [4] "18,895 of 19,752 people found the following review helpful"
## [5] "6,371 of 6,885 people found the following review helpful"
## [6] "318 of 338 people found the following review helpful"
## [7] "5,135 of 5,567 people found the following review helpful"
## [8] "5,606 of 6,098 people found the following review helpful"
## [9] "735 of 808 people found the following review helpful"
## [10] "370 of 404 people found the following review helpful"
stars<-reviews %>% html_nodes(".a-icon-alt") %>% html_text()
stars
## [1] "5.0 out of 5 stars" "5.0 out of 5 stars" "2.0 out of 5 stars"
## [4] "5.0 out of 5 stars" "5.0 out of 5 stars" "2.0 out of 5 stars"
## [7] "3.0 out of 5 stars" "1.0 out of 5 stars" "5.0 out of 5 stars"
## [10] "5.0 out of 5 stars"
title<-reviews %>% html_nodes(".review-title") %>% html_text()
title
## [1] "No more winning for you, Mr. Banana!"
## [2] "Saved my marriage"
## [3] "Angle is wrong"
## [4] "GREAT Gift"
## [5] "Kirk Cameron's banana slicer"
## [6] "NOT wifi capable???"
## [7] "Just okay"
## [8] "Confusing"
## [9] "In a city of a thousand bananas there is always a story"
## [10] "no more stares"
author<-reviews %>% html_nodes(".author") %>% html_text()
author
## [1] "SW3K" "Mrs Toledo" "Jim Anderson" "Uncle Pookie"
## [5] "Noah" "Hubert" "IWonder" "Q-Tip"
## [9] "Freddie" "Jess"
date<-reviews %>% html_nodes(".review-date") %>% html_text()
date
## [1] "on March 3, 2011" "on July 30, 2012"
## [3] "on August 1, 2012" "on August 3, 2012"
## [5] "on August 7, 2012" "on December 30, 2012"
## [7] "on August 21, 2012" "on September 11, 2012"
## [9] "on January 26, 2013" "on August 11, 2012"
text<-reviews %>% html_nodes(".review-text") %>% html_text()
head(text)
## [1] "For decades I have been trying to come up with an ideal way to slice a banana. \"Use a knife!\" they say. Well...my parole officer won't allow me to be around knives. \"Shoot it with a gun!\" Background check...HELLO! I had to resort to carefully attempt to slice those bananas with my bare hands. 99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed. No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world. I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon."
## [2] "What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old \"I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?\" and of course, \"You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!\" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!"
## [3] "I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way."
## [4] "Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!"
## [5] "If God does not exist, then how is it that a banana fits so perfectly in this banana slicer? CHECKMATE, ATHEISTS!"
## [6] "Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network. Am I supposed to slice my bananas over an ethernet cable? What is this, 2005?"
Now I’ll column bind those vectors:
amazon_reviews<-data.frame(cbind(date, title, author, text, stars, helpful_votes))
head(amazon_reviews)
## date title author
## 1 on March 3, 2011 No more winning for you, Mr. Banana! SW3K
## 2 on July 30, 2012 Saved my marriage Mrs Toledo
## 3 on August 1, 2012 Angle is wrong Jim Anderson
## 4 on August 3, 2012 GREAT Gift Uncle Pookie
## 5 on August 7, 2012 Kirk Cameron's banana slicer Noah
## 6 on December 30, 2012 NOT wifi capable??? Hubert
## text
## 1 For decades I have been trying to come up with an ideal way to slice a banana. "Use a knife!" they say. Well...my parole officer won't allow me to be around knives. "Shoot it with a gun!" Background check...HELLO! I had to resort to carefully attempt to slice those bananas with my bare hands. 99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed. No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world. I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3 I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5 If God does not exist, then how is it that a banana fits so perfectly in this banana slicer? CHECKMATE, ATHEISTS!
## 6 Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network. Am I supposed to slice my bananas over an ethernet cable? What is this, 2005?
## stars
## 1 5.0 out of 5 stars
## 2 5.0 out of 5 stars
## 3 2.0 out of 5 stars
## 4 5.0 out of 5 stars
## 5 5.0 out of 5 stars
## 6 2.0 out of 5 stars
## helpful_votes
## 1 54,292 of 55,056 people found the following review helpful
## 2 32,320 of 33,353 people found the following review helpful
## 3 12,670 of 13,219 people found the following review helpful
## 4 18,895 of 19,752 people found the following review helpful
## 5 6,371 of 6,885 people found the following review helpful
## 6 318 of 338 people found the following review helpful
I’ve still got to do some cleanup: date, stars, and helpful_votes need to have extraneous text removed.
amazon_reviews$date<-as.Date(amazon_reviews$date, "on %B %d, %Y")
amazon_reviews$stars<-as.numeric(gsub(" out.+", "", amazon_reviews$stars))
amazon_reviews$helpful_votes<-gsub(" of.+", "", amazon_reviews$helpful_votes)
amazon_reviews$helpful_votes<-as.numeric(gsub(",", "", amazon_reviews$helpful_votes))
head(amazon_reviews)
## date title author
## 1 2011-03-03 No more winning for you, Mr. Banana! SW3K
## 2 2012-07-30 Saved my marriage Mrs Toledo
## 3 2012-08-01 Angle is wrong Jim Anderson
## 4 2012-08-03 GREAT Gift Uncle Pookie
## 5 2012-08-07 Kirk Cameron's banana slicer Noah
## 6 2012-12-30 NOT wifi capable??? Hubert
## text
## 1 For decades I have been trying to come up with an ideal way to slice a banana. "Use a knife!" they say. Well...my parole officer won't allow me to be around knives. "Shoot it with a gun!" Background check...HELLO! I had to resort to carefully attempt to slice those bananas with my bare hands. 99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed. No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world. I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3 I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5 If God does not exist, then how is it that a banana fits so perfectly in this banana slicer? CHECKMATE, ATHEISTS!
## 6 Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network. Am I supposed to slice my bananas over an ethernet cable? What is this, 2005?
## stars helpful_votes
## 1 5 54292
## 2 5 32320
## 3 2 12670
## 4 5 18895
## 5 5 6371
## 6 2 318
Now we’re ready to do some analysis. Let’s calculate the word count of the review text.
library(dplyr)
library(stringi)
amazon_reviews<-mutate(amazon_reviews, word_count=stri_count(text, regex="\\S+"))
head(amazon_reviews)
## date title author
## 1 2011-03-03 No more winning for you, Mr. Banana! SW3K
## 2 2012-07-30 Saved my marriage Mrs Toledo
## 3 2012-08-01 Angle is wrong Jim Anderson
## 4 2012-08-03 GREAT Gift Uncle Pookie
## 5 2012-08-07 Kirk Cameron's banana slicer Noah
## 6 2012-12-30 NOT wifi capable??? Hubert
## text
## 1 For decades I have been trying to come up with an ideal way to slice a banana. "Use a knife!" they say. Well...my parole officer won't allow me to be around knives. "Shoot it with a gun!" Background check...HELLO! I had to resort to carefully attempt to slice those bananas with my bare hands. 99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed. No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world. I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3 I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5 If God does not exist, then how is it that a banana fits so perfectly in this banana slicer? CHECKMATE, ATHEISTS!
## 6 Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network. Am I supposed to slice my bananas over an ethernet cable? What is this, 2005?
## stars helpful_votes word_count
## 1 5 54292 163
## 2 5 32320 182
## 3 2 12670 31
## 4 5 18895 44
## 5 5 6371 21
## 6 2 318 32
Are there any correlations? Let’s create a correlation matrix:
cor(select(amazon_reviews, stars, helpful_votes, word_count))
## stars helpful_votes word_count
## stars 1.0000000 0.3633555 0.3369425
## helpful_votes 0.3633555 1.0000000 -0.1205206
## word_count 0.3369425 -0.1205206 1.0000000
There seems to be a weak but noticable positive correlation between the number of stars and the helpful votes. Amazon users seem to find positive reviews (with higher number of stars) more helpful than less positive reviews.
It would be interesting to do this analysis with 50, or 100, or 1000 reviews, but only 10 are shown in a single page. An improved algorithm would retrieve additional pages of reviews and unite them in a single data frame for analysis.
In this section, we’ll be working with popularity data found here: http://www.gallup.com/poll/168098/americans-pope-favorable-light.aspx?g_source=position2&g_medium=related&g_campaign=tiles This is a dataset with lots of untidy features. The first and second rows form a single header. The first row contains two different data types: name and date range. The date variable does not have its own column… this data set needs some serious help!
download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/papal_popularity.csv", "papal_popularity.csv", method="curl")
papal_popularity<-read.csv("papal_popularity.csv")
head(papal_popularity)
## X Favorable Unfavorable Never.heard.of...no.opinion
## 1 % % %
## 2 Pope Francis
## 3 Feb 6-9, 2014 76 9 16
## 4 Apr 11-14, 2013 58 10 31
## 5
## 6 Pope Benedict XVI
First off, we notice that the first column has two kinds of data, and that our column names leave much to be desired. We’ll duplicate the first column and create better column names by using transmute().
papal_popularity<-transmute(papal_popularity, name=X, date=X, favorable=Favorable, unfavorable=Unfavorable, no_knowledge_or_opinion = Never.heard.of...no.opinion)
We should remove any rows in which the first column is empty. That will take care of that weird top row with just the percentage symbol as well as separator rows. We’ll also clean the first and second column so that only names appear in the first column and only dates appear in the second:
papal_popularity<-filter(papal_popularity, name!="")
papal_popularity<-mutate(papal_popularity, name=replace(name, !grepl("Pope", name), NA))
papal_popularity<-mutate(papal_popularity, date=replace(date, grepl("Pope", date), NA))
head(papal_popularity)
## name date favorable unfavorable
## 1 Pope Francis <NA>
## 2 <NA> Feb 6-9, 2014 76 9
## 3 <NA> Apr 11-14, 2013 58 10
## 4 Pope Benedict XVI <NA>
## 5 <NA> Mar 26-28, 2010 40 35
## 6 <NA> Apr 18-20, 2008 63 15
## no_knowledge_or_opinion
## 1
## 2 16
## 3 31
## 4
## 5 25
## 6 22
At this point, we have to fill in name rows. The blanks below each Pope’s name represent rows pertaining to him, so we should extend the most recent non-NA down through subsequent NA fields in that first column. This was tricky and I had to resort to the Google Machine, where I found a great solution at http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value.
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
}
papal_popularity$name<-repeat.before(papal_popularity$name)
head(papal_popularity)
## name date favorable unfavorable
## 1 Pope Francis <NA>
## 2 Pope Francis Feb 6-9, 2014 76 9
## 3 Pope Francis Apr 11-14, 2013 58 10
## 4 Pope Benedict XVI <NA>
## 5 Pope Benedict XVI Mar 26-28, 2010 40 35
## 6 Pope Benedict XVI Apr 18-20, 2008 63 15
## no_knowledge_or_opinion
## 1
## 2 16
## 3 31
## 4
## 5 25
## 6 22
We’ll now also get rid of that original per-pope header, which we don’t need any more. We can do this by filtering on dates.
papal_popularity<-filter(papal_popularity, !is.na(date))
head(papal_popularity)
## name date favorable unfavorable
## 1 Pope Francis Feb 6-9, 2014 76 9
## 2 Pope Francis Apr 11-14, 2013 58 10
## 3 Pope Benedict XVI Mar 26-28, 2010 40 35
## 4 Pope Benedict XVI Apr 18-20, 2008 63 15
## 5 Pope Benedict XVI Jun 1-3, 2007 52 16
## 6 Pope Benedict XVI Dec 16-18,2005 50 11
## no_knowledge_or_opinion
## 1 16
## 2 31
## 3 25
## 4 22
## 5 32
## 6 39
Now let’s clean up the dates by just grabbing the start date, and converting that to a date type:
papal_popularity$date<-gsub("\\-.+\\,\\s*", " ", papal_popularity$date)
papal_popularity$date<-as.Date(papal_popularity$date, "%b %d %Y")
If we use str() we can see that there are problems with some of our variable types – factor variables with unused factor levels and factor variables that should be numeric. We’ll fix that.
droplevels(papal_popularity$name)
## [1] Pope Francis Pope Francis Pope Benedict XVI
## [4] Pope Benedict XVI Pope Benedict XVI Pope Benedict XVI
## [7] Pope Benedict XVI Pope John Paul II Pope John Paul II
## [10] Pope John Paul II Pope John Paul II Pope John Paul II
## Levels: Pope Benedict XVI Pope Francis Pope John Paul II
papal_popularity$favorable<-as.numeric(as.character(papal_popularity$favorable))
papal_popularity$unfavorable<-as.numeric(as.character(papal_popularity$unfavorable))
papal_popularity$no_knowledge_or_opinion<-as.numeric(as.character(papal_popularity$no_knowledge_or_opinion))
Now that we have numerical vectors, we can use summarise() to find the mean favorable on a per-pope basis or plot their favorability over time:
group_by(papal_popularity,name) %>% summarise(mean(favorable))
## Source: local data frame [3 x 2]
##
## name mean(favorable)
## 1 Pope Benedict XVI 52.0
## 2 Pope Francis 67.0
## 3 Pope John Paul II 72.4
plot(favorable ~ date, filter(papal_popularity, grepl("John Paul", name)))
plot(favorable ~ date, filter(papal_popularity, grepl("Benedict", name)))
plot(favorable ~ date, filter(papal_popularity, grepl("Francis", name)))
One concern I have with data like this is that I don’t account for the possibility that a date range spans the turn of a year, which would cause some badly parsed data. This is an obvious area for improvement!
This (http://usda.mannlib.cornell.edu/usda/current/AgriPric/AgriPric-09-29-2015.txt) is a complex text file that has visually delimited tables like this:
Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons
------------------------------------------------------------------------------------------------------
: 2011 : : :
Commodity : Base Price : August 2014 : July 2015 : August 2015
------------------------------------------------------------------------------------------------------
:
Field crops :
Austrian winter peas .........dollars/cwt: 19.50 23.10 (D) (D)
Barley, all ...............dollars/bushel: 4.79 5.60 5.19 5.59
Feed .....................dollars/bushel: 4.59 3.31 2.98 3.00
Malting ..................dollars/bushel: 4.83 6.11 5.72 5.90
Beans, dry edible ............dollars/cwt: 34.60 35.90 27.50 29.00
Canola .......................dollars/cwt: 23.10 17.80 18.10 15.60
Chickpeas, all ...............dollars/cwt: 35.70 32.70 29.90 28.60
In this example, I’m going to pull the table I’ve just shown (in its entirety), the Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons.
First I have to read the data in and toss the rows I’m not concerned with!
download.file("http://usda.mannlib.cornell.edu/usda/current/AgriPric/AgriPric-09-29-2015.txt", "usda.txt", method="curl")
usda_text<-readLines("usda.txt")
head(usda_text, 20)
## [1] ""
## [2] "Agricultural Prices"
## [3] ""
## [4] "ISSN: 1937-4216"
## [5] ""
## [6] "Released September 29, 2015, by the National Agricultural Statistics Service "
## [7] "(NASS), Agricultural Statistics Board, United States Department of "
## [8] "Agriculture (USDA)."
## [9] ""
## [10] "August Farm Prices Received Index Up 3.0 percent "
## [11] ""
## [12] "The August Prices Received Index (Agricultural Production), at 102, increased "
## [13] "3.0 percent from July. At 88, the Crop Production Index increased "
## [14] "2.3 percent. At 117, the Livestock Production Index increased 1.7 percent. "
## [15] "Producers received higher prices for market eggs, lettuce, strawberries, and "
## [16] "apples but lower prices for broilers, wheat, onions, and hay. In addition to "
## [17] "prices, the indexes are influenced by the monthly mix of commodities "
## [18] "producers market. Increased monthly movement of cattle, grapes, calves, and "
## [19] "hogs offset the decreased marketing of wheat, soybeans, corn, and hay."
## [20] ""
Where the heck is the table I’m concerned about?
which(grepl("Prices Received for Field Crops and Fruits - United States", usda_text))
## [1] 61 320
The first occurrence is in a Table of Contents (note the dotted line that leads to “page 10”). The second occurrence is what we want. We’ll lop off the top of the text that doesn’t matter to us, then we’ll cut the bottom off after the third line of hyphens (we know that the first two rows of hyphens constitute part of the table header, and that the third row of all hyphens is the last line of the data table).
usda_text<-usda_text[320:length(usda_text)]
head(usda_text, 10)
## [1] "Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons"
## [2] "------------------------------------------------------------------------------------------------------"
## [3] " : 2011 : : : "
## [4] " Commodity : Base Price : August 2014 : July 2015 : August 2015 "
## [5] "------------------------------------------------------------------------------------------------------"
## [6] " : "
## [7] "Field crops : "
## [8] " Austrian winter peas .........dollars/cwt: 19.50 23.10 (D) (D) "
## [9] " Barley, all ...............dollars/bushel: 4.79 5.60 5.19 5.59 "
## [10] " Feed .....................dollars/bushel: 4.59 3.31 2.98 3.00 "
cutoff<-which(grepl("\\-{20,}", usda_text))[3] # Where's the third line of 20+ hyphens? That's our cutoff.
usda_text<-usda_text[1:cutoff]
head(usda_text)
## [1] "Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons"
## [2] "------------------------------------------------------------------------------------------------------"
## [3] " : 2011 : : : "
## [4] " Commodity : Base Price : August 2014 : July 2015 : August 2015 "
## [5] "------------------------------------------------------------------------------------------------------"
## [6] " : "
Time to get our rows into a data frame! This is a fixed-width table, so we don’t have to rely on regex or delimiters. We know that row 3 has some colons that can help us figure out where to cut the text:
str_locate_all(pattern="\\:", usda_text[3])
## [[1]]
## start end
## [1,] 43 43
## [2,] 58 58
## [3,] 73 73
## [4,] 88 88
col_widths<-c(43,58-43,73-58, 88-73, str_length(usda_text[3])-88)
We’ll need to write the R object back out to file in order to use read.fwf:
write(usda_text,"usda_text.txt")
usda_text<-read.fwf("usda_text.txt", col_widths, skip=6)
head(usda_text)
## V1 V2
## 1 Field crops :
## 2 Austrian winter peas .........dollars/cwt: 19.50
## 3 Barley, all ...............dollars/bushel: 4.79
## 4 Feed .....................dollars/bushel: 4.59
## 5 Malting ..................dollars/bushel: 4.83
## 6 Beans, dry edible ............dollars/cwt: 34.60
## V3 V4 V5
## 1
## 2 23.10 (D) (D)
## 3 5.60 5.19 5.59
## 4 3.31 2.98 3.00
## 5 6.11 5.72 5.90
## 6 35.90 27.50 29.00
Now to clean things up. First, we’ll rename the columns. Then, given that our crops are nested in categories of varying specificity, we’ll toss the broadest category (Field crops and Fruits). In the Field crops, we’ll keep the next indented level (with one space), while in the Fruits section, we’ll keep the third intented level (with two spaces). This is because while field crops like soybeans have an “all” category with prices, in addition to the more specific level indented below, the same is not true of fruits.
colnames(usda_text)<-c('commodity', 'base_2011', 'aug_2014', 'jul_2015', 'aug_2015')
head(usda_text)
## commodity base_2011
## 1 Field crops :
## 2 Austrian winter peas .........dollars/cwt: 19.50
## 3 Barley, all ...............dollars/bushel: 4.79
## 4 Feed .....................dollars/bushel: 4.59
## 5 Malting ..................dollars/bushel: 4.83
## 6 Beans, dry edible ............dollars/cwt: 34.60
## aug_2014 jul_2015 aug_2015
## 1
## 2 23.10 (D) (D)
## 3 5.60 5.19 5.59
## 4 3.31 2.98 3.00
## 5 6.11 5.72 5.90
## 6 35.90 27.50 29.00
filter(usda_text, grepl("^[[:alpha:]]+ ",commodity)) # no spaces -- this is the broadest category of crops
## commodity base_2011
## 1 Field crops :
## 2 Fruits :
## aug_2014 jul_2015 aug_2015
## 1
## 2
filter(usda_text, grepl("^\\s{1}[[:alpha:]]+ ",commodity)) # one space -- a bit more specific
## commodity base_2011
## 1 Austrian winter peas .........dollars/cwt: 19.50
## 2 Canola .......................dollars/cwt: 23.10
## 3 Corn ......................dollars/bushel: 6.02
## 4 Cottonseed ...................dollars/ton: 131.00
## 5 Flaxseed ..................dollars/bushel: 14.20
## 6 Lentils ......................dollars/cwt: 27.50
## 7 Oats ......................dollars/bushel: 3.42
## 8 Potatoes .....................dollars/cwt: 9.78
## 9 Sorghum grain ................dollars/cwt: 10.70
## 10 Soybeans ..................dollars/bushel: 12.50
## aug_2014 jul_2015 aug_2015
## 1 23.10 (D) (D)
## 2 17.80 18.10 15.60
## 3 3.63 3.80 3.68
## 4 182.00 (NA) 192.00
## 5 13.30 11.50 10.00
## 6 20.00 30.10 25.80
## 7 3.24 2.33 2.08
## 8 8.88 9.04 8.54
## 9 7.13 7.68 7.90
## 10 12.40 9.95 9.71
filter(usda_text, grepl("^\\s{2}[[:alpha:]]+ ",commodity)) # two spaces -- more specific still.
## commodity base_2011
## 1 Feed .....................dollars/bushel: 4.59
## 2 Malting ..................dollars/bushel: 4.83
## 3 Large .......................dollars/cwt: 39.20
## 4 Small .......................dollars/cwt: 23.40
## 5 Alfalfa .....................dollars/ton: 176.00
## 6 Other .......................dollars/ton: 119.00
## 7 Long ........................dollars/cwt: 12.30
## 8 Medium and short ............dollars/cwt: 18.40
## 9 Winter ...................dollars/bushel: 6.93
## 10 Durum ....................dollars/bushel: 9.22
## 11 Other spring .............dollars/bushel: 8.33
## 12 Hard red winter .........dollars/bushel: 7.07
## 13 Soft red winter .........dollars/bushel: 6.77
## 14 Hard red spring .........dollars/bushel: 8.38
## 15 White ...................dollars/bushel: 6.58
## 16 Grapefruit ..................dollars/box: 7.47
## 17 Lemons ......................dollars/box: 12.30
## 18 Oranges .....................dollars/box: 8.16
## 19 Tangelos ....................dollars/box: 3.50
## 20 Tangerines and mandarins ....dollars/box: 15.30
## 21 Apples 1/ .................dollars/pound: 0.325
## 22 Grapes 1/ ...................dollars/ton: 628.00
## 23 Peaches 1/ ..................dollars/ton: 507.00
## 24 Pears 1/ ....................dollars/ton: 559.00
## 25 Strawberries ................dollars/cwt: 104.00
## aug_2014 jul_2015 aug_2015
## 1 3.31 2.98 3.00
## 2 6.11 5.72 5.90
## 3 33.30 30.60 29.30
## 4 (D) 22.00 20.70
## 5 207.00 169.00 159.00
## 6 132.00 129.00 115.00
## 7 14.30 9.96 10.40
## 8 20.30 16.60 17.20
## 9 5.90 5.15 4.82
## 10 8.03 8.74 7.30
## 11 5.93 5.15 4.72
## 12 6.03 5.21 4.57
## 13 5.13 4.69 4.54
## 14 5.89 5.13 4.69
## 15 6.40 6.34 6.00
## 16 8.05 9.33 5.71
## 17 33.68 37.34 34.95
## 18 14.78 10.79 12.24
## 19 (S) (S) (S)
## 20 (S) (S) (S)
## 21 0.388 0.184 0.307
## 22 1,360.00 1,340.00 1,470.00
## 23 1,070.00 933.00 1,010.00
## 24 531.00 706.00 700.00
## 25 92.70 60.20 86.70
crop_price_simplified<-rbind(filter(usda_text, grepl("^\\s{1}[[:alpha:]]+ ",commodity)), filter(usda_text, grepl("^\\s{2}[[:alpha:]]+ ",commodity))[16:25,])
Now I’ll remove the extra text including units that happens after the commodity name, make sure my prices are considered character data, and strip spaces from across the data frame
crop_price_simplified$commodity<-gsub("1*/*\\s*\\..+","",crop_price_simplified$commodity)
crop_price_simplified$base_2011<-as.character(crop_price_simplified$base_2011)
crop_price_simplified$aug_2014<-as.character(crop_price_simplified$aug_2014)
crop_price_simplified$jul_2015<-as.character(crop_price_simplified$jul_2015)
crop_price_simplified$aug_2015<-as.character(crop_price_simplified$aug_2015)
crop_price_simplified<-data.frame(mapply(str_trim, crop_price_simplified))
I’ll get rid of commas in my numbers and replace (alpha) with blanks. Then I can make my prices into numerical vectors. Note they’ve become factor variables thanks to the mapply/data frame, so I need to be careful!
crop_price_simplified<-data.frame(gsub("\\,","",as.matrix(crop_price_simplified)))
crop_price_simplified<-data.frame(gsub("\\([[:alpha:]]+\\)","",as.matrix(crop_price_simplified)))
crop_price_simplified$base_2011<-as.numeric(as.character(crop_price_simplified$base_2011))
crop_price_simplified$aug_2014<-as.numeric(as.character(crop_price_simplified$aug_2014))
crop_price_simplified$jul_2015<-as.numeric(as.character(crop_price_simplified$jul_2015))
crop_price_simplified$aug_2015<-as.numeric(as.character(crop_price_simplified$aug_2015))
str(crop_price_simplified)
## 'data.frame': 20 obs. of 5 variables:
## $ commodity: Factor w/ 20 levels "Apples","Austrian winter peas",..: 2 3 4 5 6 10 11 15 16 17 ...
## $ base_2011: num 19.5 23.1 6.02 131 14.2 27.5 3.42 9.78 10.7 12.5 ...
## $ aug_2014 : num 23.1 17.8 3.63 182 13.3 20 3.24 8.88 7.13 12.4 ...
## $ jul_2015 : num NA 18.1 3.8 NA 11.5 30.1 2.33 9.04 7.68 9.95 ...
## $ aug_2015 : num NA 15.6 3.68 192 10 25.8 2.08 8.54 7.9 9.71 ...
head(crop_price_simplified)
## commodity base_2011 aug_2014 jul_2015 aug_2015
## 1 Austrian winter peas 19.50 23.10 NA NA
## 2 Canola 23.10 17.80 18.1 15.60
## 3 Corn 6.02 3.63 3.8 3.68
## 4 Cottonseed 131.00 182.00 NA 192.00
## 5 Flaxseed 14.20 13.30 11.5 10.00
## 6 Lentils 27.50 20.00 30.1 25.80
Looks like we have to gather the date data into its own column!
crop_price_simplified<-gather(crop_price_simplified, "date", "price", 2:5)
Let’s just find some simple summary stats.
group_by(crop_price_simplified, commodity) %>% summarise(mean(price, na.rm = TRUE))
## Source: local data frame [20 x 2]
##
## commodity mean(price, na.rm = TRUE)
## 1 Apples 0.3010
## 2 Austrian winter peas 21.3000
## 3 Canola 18.6500
## 4 Corn 4.2825
## 5 Cottonseed 168.3333
## 6 Flaxseed 12.2500
## 7 Grapefruit 7.6400
## 8 Grapes 1199.5000
## 9 Lemons 29.5675
## 10 Lentils 25.8500
## 11 Oats 2.7675
## 12 Oranges 11.4925
## 13 Peaches 880.0000
## 14 Pears 624.0000
## 15 Potatoes 9.0600
## 16 Sorghum grain 8.3525
## 17 Soybeans 11.1400
## 18 Strawberries 85.9000
## 19 Tangelos 3.5000
## 20 Tangerines and mandarins 15.3000
group_by(crop_price_simplified, date) %>% summarise(mean(price, na.rm = TRUE))
## Source: local data frame [4 x 2]
##
## date mean(price, na.rm = TRUE)
## 1 base_2011 105.1388
## 2 aug_2014 189.0043
## 3 jul_2015 199.3340
## 4 aug_2015 211.4834
And graph it!
ggplot(crop_price_simplified, aes(x=factor(commodity), fill=factor(date), y=price)) + geom_bar(position="dodge", stat="identity") + ylab("Price in Dollars") + xlab ("") + scale_fill_discrete(name="Time")
It would be interesting indeed to work with the various layers of specificity / subcrops – a bit complicated for my time commitment, but a cool project to work on!