Project 2 for IS 607

Amazon Reviews (suggested by Joy Payton)

Introduction to the Data

The URL for all reviews of the Hutzler Banana Slicer can be found here: http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending.

By using the element inspector of Chrome, I can peek into the html structure and discover that each review is structured in this way:

div with class "reviews" holds all reviews
  div with class "review" holds a single review
    div with class "helpful-votes-count" has text describing how helpful the review has been to other users
    span with class "a-icon-alt" has text describing the number of stars
    \<a\> with class "review-title" has title of the review
    \<a\> with class "author" has the username of the reviewer
    span with class "review-date" has text containing the date of the review
    span with class "review-text" contains the review text

Tool Preparation

First, I need to install and load rvest, dplyr, and stringi. I’ll also load a few packages even though I’m not using them in the Amazon dataset. They’ll come in handy later! Note that you may have to install packages that are not present in your own R environment.

library(rvest)

## Loading required package: xml2

library(stringi)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(stringr)
library(ggplot2)

Getting the Amazon Reviews

Now I’ll load the page and choose just the “reviews” section. Note that classes are prefixed with a period, so I use “.reviews”.

amazon <- read_html("http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending")
review_section<-amazon %>% html_node(".reviews")

HTML Parsing

Within the “reviews” section, I have any number of nodes that I need to parse. I first create a list called “reviews”, then create vectors for each element I’m going to pull out of each review.

reviews<-review_section %>% html_nodes(".review")
helpful_votes<-reviews %>% html_nodes(".helpful-votes-count") %>% html_text()
helpful_votes

##  [1] "54,292 of 55,056 people found the following review helpful"
##  [2] "32,320 of 33,353 people found the following review helpful"
##  [3] "12,670 of 13,219 people found the following review helpful"
##  [4] "18,895 of 19,752 people found the following review helpful"
##  [5] "6,371 of 6,885 people found the following review helpful"  
##  [6] "318 of 338 people found the following review helpful"      
##  [7] "5,135 of 5,567 people found the following review helpful"  
##  [8] "5,606 of 6,098 people found the following review helpful"  
##  [9] "735 of 808 people found the following review helpful"      
## [10] "370 of 404 people found the following review helpful"

stars<-reviews %>% html_nodes(".a-icon-alt") %>% html_text()
stars

##  [1] "5.0 out of 5 stars" "5.0 out of 5 stars" "2.0 out of 5 stars"
##  [4] "5.0 out of 5 stars" "5.0 out of 5 stars" "2.0 out of 5 stars"
##  [7] "3.0 out of 5 stars" "1.0 out of 5 stars" "5.0 out of 5 stars"
## [10] "5.0 out of 5 stars"

title<-reviews %>% html_nodes(".review-title") %>% html_text()
title

##  [1] "No more winning for you, Mr. Banana!"                   
##  [2] "Saved my marriage"                                      
##  [3] "Angle is wrong"                                         
##  [4] "GREAT Gift"                                             
##  [5] "Kirk Cameron's banana slicer"                           
##  [6] "NOT wifi capable???"                                    
##  [7] "Just okay"                                              
##  [8] "Confusing"                                              
##  [9] "In a city of a thousand bananas there is always a story"
## [10] "no more stares"

author<-reviews %>% html_nodes(".author") %>% html_text()
author

##  [1] "SW3K"         "Mrs Toledo"   "Jim Anderson" "Uncle Pookie"
##  [5] "Noah"         "Hubert"       "IWonder"      "Q-Tip"       
##  [9] "Freddie"      "Jess"

date<-reviews %>% html_nodes(".review-date") %>% html_text()
date

##  [1] "on March 3, 2011"      "on July 30, 2012"     
##  [3] "on August 1, 2012"     "on August 3, 2012"    
##  [5] "on August 7, 2012"     "on December 30, 2012" 
##  [7] "on August 21, 2012"    "on September 11, 2012"
##  [9] "on January 26, 2013"   "on August 11, 2012"

text<-reviews %>% html_nodes(".review-text") %>% html_text()
head(text)

## [1] "For decades I have been trying to come up with an ideal way to slice a banana.  \"Use a knife!\" they say.  Well...my parole officer won't allow me to be around knives.  \"Shoot it with a gun!\"  Background check...HELLO!  I had to resort to carefully attempt to slice those bananas with my bare hands.  99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed.  No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world.  I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon."                                                                                   
## [2] "What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old \"I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?\" and of course, \"You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!\" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!"
## [3] "I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [4] "Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [5] "If God does not exist, then how is it that a banana fits so perfectly in this banana slicer?  CHECKMATE, ATHEISTS!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [6] "Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network.  Am I supposed to slice my bananas over an ethernet cable?  What is this, 2005?"

Cleaning and Integration

Now I’ll column bind those vectors:

amazon_reviews<-data.frame(cbind(date, title, author, text, stars, helpful_votes))
head(amazon_reviews)

##                   date                                title       author
## 1     on March 3, 2011 No more winning for you, Mr. Banana!         SW3K
## 2     on July 30, 2012                    Saved my marriage   Mrs Toledo
## 3    on August 1, 2012                       Angle is wrong Jim Anderson
## 4    on August 3, 2012                           GREAT Gift Uncle Pookie
## 5    on August 7, 2012         Kirk Cameron's banana slicer         Noah
## 6 on December 30, 2012                  NOT wifi capable???       Hubert
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
## 1                                                                                    For decades I have been trying to come up with an ideal way to slice a banana.  "Use a knife!" they say.  Well...my parole officer won't allow me to be around knives.  "Shoot it with a gun!"  Background check...HELLO!  I had to resort to carefully attempt to slice those bananas with my bare hands.  99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed.  No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world.  I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         If God does not exist, then how is it that a banana fits so perfectly in this banana slicer?  CHECKMATE, ATHEISTS!
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network.  Am I supposed to slice my bananas over an ethernet cable?  What is this, 2005?
##                stars
## 1 5.0 out of 5 stars
## 2 5.0 out of 5 stars
## 3 2.0 out of 5 stars
## 4 5.0 out of 5 stars
## 5 5.0 out of 5 stars
## 6 2.0 out of 5 stars
##                                                helpful_votes
## 1 54,292 of 55,056 people found the following review helpful
## 2 32,320 of 33,353 people found the following review helpful
## 3 12,670 of 13,219 people found the following review helpful
## 4 18,895 of 19,752 people found the following review helpful
## 5   6,371 of 6,885 people found the following review helpful
## 6       318 of 338 people found the following review helpful

I’ve still got to do some cleanup: date, stars, and helpful_votes need to have extraneous text removed.

amazon_reviews$date<-as.Date(amazon_reviews$date, "on %B %d, %Y")
amazon_reviews$stars<-as.numeric(gsub(" out.+", "", amazon_reviews$stars))
amazon_reviews$helpful_votes<-gsub(" of.+", "", amazon_reviews$helpful_votes)
amazon_reviews$helpful_votes<-as.numeric(gsub(",", "", amazon_reviews$helpful_votes))
head(amazon_reviews)

##         date                                title       author
## 1 2011-03-03 No more winning for you, Mr. Banana!         SW3K
## 2 2012-07-30                    Saved my marriage   Mrs Toledo
## 3 2012-08-01                       Angle is wrong Jim Anderson
## 4 2012-08-03                           GREAT Gift Uncle Pookie
## 5 2012-08-07         Kirk Cameron's banana slicer         Noah
## 6 2012-12-30                  NOT wifi capable???       Hubert
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
## 1                                                                                    For decades I have been trying to come up with an ideal way to slice a banana.  "Use a knife!" they say.  Well...my parole officer won't allow me to be around knives.  "Shoot it with a gun!"  Background check...HELLO!  I had to resort to carefully attempt to slice those bananas with my bare hands.  99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed.  No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world.  I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         If God does not exist, then how is it that a banana fits so perfectly in this banana slicer?  CHECKMATE, ATHEISTS!
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network.  Am I supposed to slice my bananas over an ethernet cable?  What is this, 2005?
##   stars helpful_votes
## 1     5         54292
## 2     5         32320
## 3     2         12670
## 4     5         18895
## 5     5          6371
## 6     2           318

Analysis

Now we’re ready to do some analysis. Let’s calculate the word count of the review text.

library(dplyr)
library(stringi)
amazon_reviews<-mutate(amazon_reviews, word_count=stri_count(text, regex="\\S+"))
head(amazon_reviews)

##         date                                title       author
## 1 2011-03-03 No more winning for you, Mr. Banana!         SW3K
## 2 2012-07-30                    Saved my marriage   Mrs Toledo
## 3 2012-08-01                       Angle is wrong Jim Anderson
## 4 2012-08-03                           GREAT Gift Uncle Pookie
## 5 2012-08-07         Kirk Cameron's banana slicer         Noah
## 6 2012-12-30                  NOT wifi capable???       Hubert
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
## 1                                                                                    For decades I have been trying to come up with an ideal way to slice a banana.  "Use a knife!" they say.  Well...my parole officer won't allow me to be around knives.  "Shoot it with a gun!"  Background check...HELLO!  I had to resort to carefully attempt to slice those bananas with my bare hands.  99.9% of the time, I would get so frustrated that I just ended up squishing the fruit in my hands and throwing it against the wall in anger. Then, after a fit of banana-induced rage, my parole officer introduced me to this kitchen marvel and my life was changed.  No longer consumed by seething anger and animosity towards thick-skinned yellow fruit, I was able to concentrate on my love of theatre and am writing a musical play about two lovers from rival gangs that just try to make it in the world.  I think I'll call it South Side Story.Banana slicer...thanks to you, I see greatness on the horizon.
## 2 What can I say about the 571B Banana Slicer that hasn't already been said about the wheel, penicillin, or the iPhone.... this is one of the greatest inventions of all time. My husband and I would argue constantly over who had to cut the day's banana slices. It's one of those chores NO ONE wants to do! You know, the old "I spent the entire day rearing OUR children, maybe YOU can pitch in a little and cut these bananas?" and of course, "You think I have the energy to slave over your damn bananas? I worked a 12 hour shift just to come home to THIS?!" These are the things that can destroy an entire relationship. It got to the point where our children could sense the tension. The minute I heard our 6-year-old girl in her bedroom, re-enacting our daily banana fight with her Barbie dolls, I knew we had to make a change. That's when I found the 571B Banana Slicer. Our marriage has never been healthier, AND we've even incorporated it into our lovemaking. THANKS 571B BANANA SLICER!
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           I tried the banana slicer and found it unacceptable. As shown in the picture, the slicer is curved from left to right. All of my bananas are bent the other way.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Once I figured out I had to peel the banana before using - it works much better.Ordering one for my nephew who's in the air force in California. He's been using an old slinky to slice his banana's. He should really enjoy this product!
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         If God does not exist, then how is it that a banana fits so perfectly in this banana slicer?  CHECKMATE, ATHEISTS!
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Maybe I'm doing it wrong, but I can't get this thing hooked up to my wifi network.  Am I supposed to slice my bananas over an ethernet cable?  What is this, 2005?
##   stars helpful_votes word_count
## 1     5         54292        163
## 2     5         32320        182
## 3     2         12670         31
## 4     5         18895         44
## 5     5          6371         21
## 6     2           318         32

Are there any correlations? Let’s create a correlation matrix:

cor(select(amazon_reviews, stars, helpful_votes, word_count))

##                   stars helpful_votes word_count
## stars         1.0000000     0.3633555  0.3369425
## helpful_votes 0.3633555     1.0000000 -0.1205206
## word_count    0.3369425    -0.1205206  1.0000000

There seems to be a weak but noticable positive correlation between the number of stars and the helpful votes. Amazon users seem to find positive reviews (with higher number of stars) more helpful than less positive reviews.

Future Improvements

It would be interesting to do this analysis with 50, or 100, or 1000 reviews, but only 10 are shown in a single page. An improved algorithm would retrieve additional pages of reviews and unite them in a single data frame for analysis.

Papal Popularity (suggested by Andrew Goldberg)

Introduction to the Data

In this section, we’ll be working with popularity data found here: http://www.gallup.com/poll/168098/americans-pope-favorable-light.aspx?g_source=position2&g_medium=related&g_campaign=tiles This is a dataset with lots of untidy features. The first and second rows form a single header. The first row contains two different data types: name and date range. The date variable does not have its own column… this data set needs some serious help!

Getting Papal Popularity

download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/papal_popularity.csv", "papal_popularity.csv", method="curl")
papal_popularity<-read.csv("papal_popularity.csv")
head(papal_popularity)

##                   X Favorable Unfavorable Never.heard.of...no.opinion
## 1                           %           %                           %
## 2      Pope Francis                                                  
## 3     Feb 6-9, 2014        76           9                          16
## 4   Apr 11-14, 2013        58          10                          31
## 5                                                                    
## 6 Pope Benedict XVI

Cleaning

First off, we notice that the first column has two kinds of data, and that our column names leave much to be desired. We’ll duplicate the first column and create better column names by using transmute().

papal_popularity<-transmute(papal_popularity, name=X, date=X, favorable=Favorable, unfavorable=Unfavorable, no_knowledge_or_opinion = Never.heard.of...no.opinion)

We should remove any rows in which the first column is empty. That will take care of that weird top row with just the percentage symbol as well as separator rows. We’ll also clean the first and second column so that only names appear in the first column and only dates appear in the second:

papal_popularity<-filter(papal_popularity, name!="")
papal_popularity<-mutate(papal_popularity, name=replace(name, !grepl("Pope", name), NA))
papal_popularity<-mutate(papal_popularity, date=replace(date, grepl("Pope", date), NA))
head(papal_popularity)

##                name            date favorable unfavorable
## 1      Pope Francis            <NA>                      
## 2              <NA>   Feb 6-9, 2014        76           9
## 3              <NA> Apr 11-14, 2013        58          10
## 4 Pope Benedict XVI            <NA>                      
## 5              <NA> Mar 26-28, 2010        40          35
## 6              <NA> Apr 18-20, 2008        63          15
##   no_knowledge_or_opinion
## 1                        
## 2                      16
## 3                      31
## 4                        
## 5                      25
## 6                      22

At this point, we have to fill in name rows. The blanks below each Pope’s name represent rows pertaining to him, so we should extend the most recent non-NA down through subsequent NA fields in that first column. This was tricky and I had to resort to the Google Machine, where I found a great solution at http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}
papal_popularity$name<-repeat.before(papal_popularity$name)
head(papal_popularity)

##                name            date favorable unfavorable
## 1      Pope Francis            <NA>                      
## 2      Pope Francis   Feb 6-9, 2014        76           9
## 3      Pope Francis Apr 11-14, 2013        58          10
## 4 Pope Benedict XVI            <NA>                      
## 5 Pope Benedict XVI Mar 26-28, 2010        40          35
## 6 Pope Benedict XVI Apr 18-20, 2008        63          15
##   no_knowledge_or_opinion
## 1                        
## 2                      16
## 3                      31
## 4                        
## 5                      25
## 6                      22

We’ll now also get rid of that original per-pope header, which we don’t need any more. We can do this by filtering on dates.

papal_popularity<-filter(papal_popularity, !is.na(date))
head(papal_popularity)

##                name            date favorable unfavorable
## 1      Pope Francis   Feb 6-9, 2014        76           9
## 2      Pope Francis Apr 11-14, 2013        58          10
## 3 Pope Benedict XVI Mar 26-28, 2010        40          35
## 4 Pope Benedict XVI Apr 18-20, 2008        63          15
## 5 Pope Benedict XVI   Jun 1-3, 2007        52          16
## 6 Pope Benedict XVI  Dec 16-18,2005        50          11
##   no_knowledge_or_opinion
## 1                      16
## 2                      31
## 3                      25
## 4                      22
## 5                      32
## 6                      39

Now let’s clean up the dates by just grabbing the start date, and converting that to a date type:

papal_popularity$date<-gsub("\\-.+\\,\\s*", " ", papal_popularity$date)
papal_popularity$date<-as.Date(papal_popularity$date, "%b %d %Y")

If we use str() we can see that there are problems with some of our variable types – factor variables with unused factor levels and factor variables that should be numeric. We’ll fix that.

droplevels(papal_popularity$name)

##  [1] Pope Francis      Pope Francis      Pope Benedict XVI
##  [4] Pope Benedict XVI Pope Benedict XVI Pope Benedict XVI
##  [7] Pope Benedict XVI Pope John Paul II Pope John Paul II
## [10] Pope John Paul II Pope John Paul II Pope John Paul II
## Levels: Pope Benedict XVI Pope Francis Pope John Paul II

papal_popularity$favorable<-as.numeric(as.character(papal_popularity$favorable))
papal_popularity$unfavorable<-as.numeric(as.character(papal_popularity$unfavorable))
papal_popularity$no_knowledge_or_opinion<-as.numeric(as.character(papal_popularity$no_knowledge_or_opinion))

Analysis

Now that we have numerical vectors, we can use summarise() to find the mean favorable on a per-pope basis or plot their favorability over time:

group_by(papal_popularity,name) %>% summarise(mean(favorable))

## Source: local data frame [3 x 2]
## 
##                name mean(favorable)
## 1 Pope Benedict XVI            52.0
## 2      Pope Francis            67.0
## 3 Pope John Paul II            72.4

plot(favorable ~ date, filter(papal_popularity, grepl("John Paul", name)))

plot(favorable ~ date, filter(papal_popularity, grepl("Benedict", name)))

plot(favorable ~ date, filter(papal_popularity, grepl("Francis", name)))

Future Improvements

One concern I have with data like this is that I don’t account for the possibility that a date range spans the turn of a year, which would cause some badly parsed data. This is an obvious area for improvement!

USDA Crop Prices (suggested by Mohan Kandaraj)

Introduction to the data

This (http://usda.mannlib.cornell.edu/usda/current/AgriPric/AgriPric-09-29-2015.txt) is a complex text file that has visually delimited tables like this:

Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons
------------------------------------------------------------------------------------------------------
                                          :     2011     :              :              :              
                 Commodity                :  Base Price  : August 2014  :  July 2015   : August 2015  
------------------------------------------------------------------------------------------------------
                                          :                                                           
Field crops                               :                                                           
 Austrian winter peas .........dollars/cwt:     19.50           23.10            (D)            (D)   
 Barley, all ...............dollars/bushel:      4.79            5.60           5.19           5.59   
  Feed .....................dollars/bushel:      4.59            3.31           2.98           3.00   
  Malting ..................dollars/bushel:      4.83            6.11           5.72           5.90   
 Beans, dry edible ............dollars/cwt:     34.60           35.90          27.50          29.00   
 Canola .......................dollars/cwt:     23.10           17.80          18.10          15.60   
 Chickpeas, all ...............dollars/cwt:     35.70           32.70          29.90          28.60

In this example, I’m going to pull the table I’ve just shown (in its entirety), the Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons.

Getting the data

First I have to read the data in and toss the rows I’m not concerned with!

download.file("http://usda.mannlib.cornell.edu/usda/current/AgriPric/AgriPric-09-29-2015.txt", "usda.txt", method="curl")
usda_text<-readLines("usda.txt")
head(usda_text, 20)

##  [1] ""                                                                              
##  [2] "Agricultural Prices"                                                           
##  [3] ""                                                                              
##  [4] "ISSN: 1937-4216"                                                               
##  [5] ""                                                                              
##  [6] "Released September 29, 2015, by the National Agricultural Statistics Service " 
##  [7] "(NASS), Agricultural Statistics Board, United States Department of "           
##  [8] "Agriculture (USDA)."                                                           
##  [9] ""                                                                              
## [10] "August Farm Prices Received Index Up 3.0 percent "                             
## [11] ""                                                                              
## [12] "The August Prices Received Index (Agricultural Production), at 102, increased "
## [13] "3.0 percent from July. At 88, the Crop Production Index increased "            
## [14] "2.3 percent. At 117, the Livestock Production Index increased 1.7 percent. "   
## [15] "Producers received higher prices for market eggs, lettuce, strawberries, and " 
## [16] "apples but lower prices for broilers, wheat, onions, and hay. In addition to " 
## [17] "prices, the indexes are influenced by the monthly mix of commodities "         
## [18] "producers market. Increased monthly movement of cattle, grapes, calves, and "  
## [19] "hogs offset the decreased marketing of wheat, soybeans, corn, and hay."        
## [20] ""

Where the heck is the table I’m concerned about?

which(grepl("Prices Received for Field Crops and Fruits - United States", usda_text))

## [1]  61 320

The first occurrence is in a Table of Contents (note the dotted line that leads to “page 10”). The second occurrence is what we want. We’ll lop off the top of the text that doesn’t matter to us, then we’ll cut the bottom off after the third line of hyphens (we know that the first two rows of hyphens constitute part of the table header, and that the third row of all hyphens is the last line of the data table).

usda_text<-usda_text[320:length(usda_text)]
head(usda_text, 10)

##  [1] "Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons"              
##  [2] "------------------------------------------------------------------------------------------------------"
##  [3] "                                          :     2011     :              :              :              "
##  [4] "                 Commodity                :  Base Price  : August 2014  :  July 2015   : August 2015  "
##  [5] "------------------------------------------------------------------------------------------------------"
##  [6] "                                          :                                                           "
##  [7] "Field crops                               :                                                           "
##  [8] " Austrian winter peas .........dollars/cwt:     19.50           23.10            (D)            (D)   "
##  [9] " Barley, all ...............dollars/bushel:      4.79            5.60           5.19           5.59   "
## [10] "  Feed .....................dollars/bushel:      4.59            3.31           2.98           3.00   "

cutoff<-which(grepl("\\-{20,}", usda_text))[3]  # Where's the third line of 20+ hyphens?  That's our cutoff.
usda_text<-usda_text[1:cutoff]
head(usda_text)

## [1] "Prices Received for Field Crops and Fruits - United States: August 2015 with Comparisons"              
## [2] "------------------------------------------------------------------------------------------------------"
## [3] "                                          :     2011     :              :              :              "
## [4] "                 Commodity                :  Base Price  : August 2014  :  July 2015   : August 2015  "
## [5] "------------------------------------------------------------------------------------------------------"
## [6] "                                          :                                                           "

Cleaning the Data

Time to get our rows into a data frame! This is a fixed-width table, so we don’t have to rely on regex or delimiters. We know that row 3 has some colons that can help us figure out where to cut the text:

str_locate_all(pattern="\\:", usda_text[3])

## [[1]]
##      start end
## [1,]    43  43
## [2,]    58  58
## [3,]    73  73
## [4,]    88  88

col_widths<-c(43,58-43,73-58, 88-73, str_length(usda_text[3])-88)

We’ll need to write the R object back out to file in order to use read.fwf:

write(usda_text,"usda_text.txt")
usda_text<-read.fwf("usda_text.txt", col_widths, skip=6)
head(usda_text)

##                                            V1              V2
## 1 Field crops                               :                
## 2  Austrian winter peas .........dollars/cwt:      19.50     
## 3  Barley, all ...............dollars/bushel:       4.79     
## 4   Feed .....................dollars/bushel:       4.59     
## 5   Malting ..................dollars/bushel:       4.83     
## 6  Beans, dry edible ............dollars/cwt:      34.60     
##                V3              V4             V5
## 1                                               
## 2       23.10             (D)             (D)   
## 3        5.60            5.19            5.59   
## 4        3.31            2.98            3.00   
## 5        6.11            5.72            5.90   
## 6       35.90           27.50           29.00

Now to clean things up. First, we’ll rename the columns. Then, given that our crops are nested in categories of varying specificity, we’ll toss the broadest category (Field crops and Fruits). In the Field crops, we’ll keep the next indented level (with one space), while in the Fruits section, we’ll keep the third intented level (with two spaces). This is because while field crops like soybeans have an “all” category with prices, in addition to the more specific level indented below, the same is not true of fruits.

colnames(usda_text)<-c('commodity', 'base_2011', 'aug_2014', 'jul_2015', 'aug_2015')
head(usda_text)

##                                     commodity       base_2011
## 1 Field crops                               :                
## 2  Austrian winter peas .........dollars/cwt:      19.50     
## 3  Barley, all ...............dollars/bushel:       4.79     
## 4   Feed .....................dollars/bushel:       4.59     
## 5   Malting ..................dollars/bushel:       4.83     
## 6  Beans, dry edible ............dollars/cwt:      34.60     
##          aug_2014        jul_2015       aug_2015
## 1                                               
## 2       23.10             (D)             (D)   
## 3        5.60            5.19            5.59   
## 4        3.31            2.98            3.00   
## 5        6.11            5.72            5.90   
## 6       35.90           27.50           29.00

filter(usda_text, grepl("^[[:alpha:]]+ ",commodity)) # no spaces -- this is the broadest category of crops

##                                     commodity       base_2011
## 1 Field crops                               :                
## 2 Fruits                                    :                
##          aug_2014        jul_2015       aug_2015
## 1                                               
## 2

filter(usda_text, grepl("^\\s{1}[[:alpha:]]+ ",commodity)) # one space -- a bit more specific

##                                      commodity       base_2011
## 1   Austrian winter peas .........dollars/cwt:      19.50     
## 2   Canola .......................dollars/cwt:      23.10     
## 3   Corn ......................dollars/bushel:       6.02     
## 4   Cottonseed ...................dollars/ton:     131.00     
## 5   Flaxseed ..................dollars/bushel:      14.20     
## 6   Lentils ......................dollars/cwt:      27.50     
## 7   Oats ......................dollars/bushel:       3.42     
## 8   Potatoes .....................dollars/cwt:       9.78     
## 9   Sorghum grain ................dollars/cwt:      10.70     
## 10  Soybeans ..................dollars/bushel:      12.50     
##           aug_2014        jul_2015       aug_2015
## 1        23.10             (D)             (D)   
## 2        17.80           18.10           15.60   
## 3         3.63            3.80            3.68   
## 4       182.00            (NA)          192.00   
## 5        13.30           11.50           10.00   
## 6        20.00           30.10           25.80   
## 7         3.24            2.33            2.08   
## 8         8.88            9.04            8.54   
## 9         7.13            7.68            7.90   
## 10       12.40            9.95            9.71

filter(usda_text, grepl("^\\s{2}[[:alpha:]]+ ",commodity)) # two spaces -- more specific still.

##                                      commodity       base_2011
## 1    Feed .....................dollars/bushel:       4.59     
## 2    Malting ..................dollars/bushel:       4.83     
## 3    Large .......................dollars/cwt:      39.20     
## 4    Small .......................dollars/cwt:      23.40     
## 5    Alfalfa .....................dollars/ton:     176.00     
## 6    Other .......................dollars/ton:     119.00     
## 7    Long ........................dollars/cwt:      12.30     
## 8    Medium and short ............dollars/cwt:      18.40     
## 9    Winter ...................dollars/bushel:       6.93     
## 10   Durum ....................dollars/bushel:       9.22     
## 11   Other spring .............dollars/bushel:       8.33     
## 12   Hard red winter  .........dollars/bushel:       7.07     
## 13   Soft red winter  .........dollars/bushel:       6.77     
## 14   Hard red spring  .........dollars/bushel:       8.38     
## 15   White  ...................dollars/bushel:       6.58     
## 16   Grapefruit ..................dollars/box:       7.47     
## 17   Lemons ......................dollars/box:      12.30     
## 18   Oranges .....................dollars/box:       8.16     
## 19   Tangelos ....................dollars/box:       3.50     
## 20   Tangerines and mandarins ....dollars/box:      15.30     
## 21   Apples 1/ .................dollars/pound:      0.325     
## 22   Grapes 1/ ...................dollars/ton:     628.00     
## 23   Peaches 1/ ..................dollars/ton:     507.00     
## 24   Pears 1/ ....................dollars/ton:     559.00     
## 25   Strawberries ................dollars/cwt:     104.00     
##           aug_2014        jul_2015       aug_2015
## 1         3.31            2.98            3.00   
## 2         6.11            5.72            5.90   
## 3        33.30           30.60           29.30   
## 4          (D)           22.00           20.70   
## 5       207.00          169.00          159.00   
## 6       132.00          129.00          115.00   
## 7        14.30            9.96           10.40   
## 8        20.30           16.60           17.20   
## 9         5.90            5.15            4.82   
## 10        8.03            8.74            7.30   
## 11        5.93            5.15            4.72   
## 12        6.03            5.21            4.57   
## 13        5.13            4.69            4.54   
## 14        5.89            5.13            4.69   
## 15        6.40            6.34            6.00   
## 16        8.05            9.33            5.71   
## 17       33.68           37.34           34.95   
## 18       14.78           10.79           12.24   
## 19         (S)             (S)             (S)   
## 20         (S)             (S)             (S)   
## 21       0.388           0.184           0.307   
## 22    1,360.00        1,340.00        1,470.00   
## 23    1,070.00          933.00        1,010.00   
## 24      531.00          706.00          700.00   
## 25       92.70           60.20           86.70

crop_price_simplified<-rbind(filter(usda_text, grepl("^\\s{1}[[:alpha:]]+ ",commodity)), filter(usda_text, grepl("^\\s{2}[[:alpha:]]+ ",commodity))[16:25,])

Now I’ll remove the extra text including units that happens after the commodity name, make sure my prices are considered character data, and strip spaces from across the data frame

crop_price_simplified$commodity<-gsub("1*/*\\s*\\..+","",crop_price_simplified$commodity)
crop_price_simplified$base_2011<-as.character(crop_price_simplified$base_2011)
crop_price_simplified$aug_2014<-as.character(crop_price_simplified$aug_2014)
crop_price_simplified$jul_2015<-as.character(crop_price_simplified$jul_2015)
crop_price_simplified$aug_2015<-as.character(crop_price_simplified$aug_2015)
crop_price_simplified<-data.frame(mapply(str_trim, crop_price_simplified))

I’ll get rid of commas in my numbers and replace (alpha) with blanks. Then I can make my prices into numerical vectors. Note they’ve become factor variables thanks to the mapply/data frame, so I need to be careful!

crop_price_simplified<-data.frame(gsub("\\,","",as.matrix(crop_price_simplified)))
crop_price_simplified<-data.frame(gsub("\\([[:alpha:]]+\\)","",as.matrix(crop_price_simplified)))
crop_price_simplified$base_2011<-as.numeric(as.character(crop_price_simplified$base_2011))
crop_price_simplified$aug_2014<-as.numeric(as.character(crop_price_simplified$aug_2014))
crop_price_simplified$jul_2015<-as.numeric(as.character(crop_price_simplified$jul_2015))
crop_price_simplified$aug_2015<-as.numeric(as.character(crop_price_simplified$aug_2015))
str(crop_price_simplified)

## 'data.frame':    20 obs. of  5 variables:
##  $ commodity: Factor w/ 20 levels "Apples","Austrian winter peas",..: 2 3 4 5 6 10 11 15 16 17 ...
##  $ base_2011: num  19.5 23.1 6.02 131 14.2 27.5 3.42 9.78 10.7 12.5 ...
##  $ aug_2014 : num  23.1 17.8 3.63 182 13.3 20 3.24 8.88 7.13 12.4 ...
##  $ jul_2015 : num  NA 18.1 3.8 NA 11.5 30.1 2.33 9.04 7.68 9.95 ...
##  $ aug_2015 : num  NA 15.6 3.68 192 10 25.8 2.08 8.54 7.9 9.71 ...

head(crop_price_simplified)

##              commodity base_2011 aug_2014 jul_2015 aug_2015
## 1 Austrian winter peas     19.50    23.10       NA       NA
## 2               Canola     23.10    17.80     18.1    15.60
## 3                 Corn      6.02     3.63      3.8     3.68
## 4           Cottonseed    131.00   182.00       NA   192.00
## 5             Flaxseed     14.20    13.30     11.5    10.00
## 6              Lentils     27.50    20.00     30.1    25.80

Looks like we have to gather the date data into its own column!

crop_price_simplified<-gather(crop_price_simplified, "date", "price", 2:5)

Analysis

Let’s just find some simple summary stats.

group_by(crop_price_simplified, commodity) %>%  summarise(mean(price, na.rm = TRUE))

## Source: local data frame [20 x 2]
## 
##                   commodity mean(price, na.rm = TRUE)
## 1                    Apples                    0.3010
## 2      Austrian winter peas                   21.3000
## 3                    Canola                   18.6500
## 4                      Corn                    4.2825
## 5                Cottonseed                  168.3333
## 6                  Flaxseed                   12.2500
## 7                Grapefruit                    7.6400
## 8                    Grapes                 1199.5000
## 9                    Lemons                   29.5675
## 10                  Lentils                   25.8500
## 11                     Oats                    2.7675
## 12                  Oranges                   11.4925
## 13                  Peaches                  880.0000
## 14                    Pears                  624.0000
## 15                 Potatoes                    9.0600
## 16            Sorghum grain                    8.3525
## 17                 Soybeans                   11.1400
## 18             Strawberries                   85.9000
## 19                 Tangelos                    3.5000
## 20 Tangerines and mandarins                   15.3000

group_by(crop_price_simplified, date) %>%  summarise(mean(price, na.rm = TRUE))

## Source: local data frame [4 x 2]
## 
##        date mean(price, na.rm = TRUE)
## 1 base_2011                  105.1388
## 2  aug_2014                  189.0043
## 3  jul_2015                  199.3340
## 4  aug_2015                  211.4834

And graph it!

ggplot(crop_price_simplified, aes(x=factor(commodity), fill=factor(date), y=price)) + geom_bar(position="dodge", stat="identity") + ylab("Price in Dollars") +  xlab ("") + scale_fill_discrete(name="Time")

Future Improvements

It would be interesting indeed to work with the various layers of specificity / subcrops – a bit complicated for my time commitment, but a cool project to work on!

Project 2 for IS 607

Joy Payton

October 6, 2015

Amazon Reviews (suggested by Joy Payton)

Introduction to the Data

Tool Preparation

Getting the Amazon Reviews

HTML Parsing

Cleaning and Integration

Analysis

Future Improvements

Papal Popularity (suggested by Andrew Goldberg)

Introduction to the Data

Getting Papal Popularity

Cleaning

Analysis

Future Improvements

USDA Crop Prices (suggested by Mohan Kandaraj)

Introduction to the data

Getting the data

Cleaning the Data

Analysis

Future Improvements